Deprecated: Assigning the return value of new by reference is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-settings.php on line 472

Deprecated: Assigning the return value of new by reference is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-settings.php on line 487

Deprecated: Assigning the return value of new by reference is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-settings.php on line 494

Deprecated: Assigning the return value of new by reference is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-settings.php on line 530

Deprecated: Assigning the return value of new by reference is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-includes/cache.php on line 103

Deprecated: Assigning the return value of new by reference is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-includes/query.php on line 21

Deprecated: Assigning the return value of new by reference is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-includes/theme.php on line 623
blog.gkaindl.com » Outsmarting Spam-Bots

blog.gkaindl.com

nerd nouveau

Outsmarting Spam-Bots

Comment SPAM has already become a significant problem on my blog. While Askimet detects SPAM nicely, it’s using a statistical approach. That means that on one hand, SPAM comments occasionally get through, so that I have to catch them during comment moderation, and on the other hand, that I do not trust it enough to simply let it delete potential SPAM: Instead, it collects the marked comments so that I can review and delete them manually if they actually are SPAM. However, during the last 2 weeks, I’ve had at least 200 new SPAM comments every day… The review process has become quite tedious.

What I want is something that weeds out obvious SPAM comments immediately, with 100-percent precision. I don’t want anybody’s comment to end up as a deleted false positive. Askimet should only have a look at those comments that still get through somehow, and at Trackback SPAM that cannot be caught so easily.

Obviously, I want some sort of CAPTCHA. Then again, the most common ones (you know, those using distorted text in little pictures that you have to type into an input field) are totally unsuitable here.

  • They are overkill: The point is not to weed out all machine-generated comments, but only those generated by SPAM-bots: It doesn’t make sense for a spammer to target my blog specifically. Thusly, I do not need full protection from machine-generated comments, but only protection from those generated by mass-commenting programs. There’s a difference.
  • Filling out a CAPTCHA test sucks. It sucks even more if you are, for example, visually impaired. I would need to provide an audio version. Way too much work (and server-load when auto-generated)!

On my hunt for such a solution, the first thing I did was reading through the “spam” section of the official WordPress plugin repository. I stumbled upon the excellent JSSpamBlock by Paul Butler.

It’s based on the observation that comment SPAM-bots do not interpret JavaScript (yet). All you need to do to stop the SPAM is to put an additional verification field into the comment form, but the clou is that JSSpamBlock fills out this field automatically and hides it if JavaScript is enabled in a visitor’s browser. Only if JavaScript is disabled, the user will have to fill in the field manually. Visitors with JavaScript-enabled browsers won’t even notice that the SPAM protection is there! Brilliant!

However, what somehow bothers me are its internal workings: A hidden field containing an MD5 hash of the randomly generated, 4-digit verification number and an input field for the number itself get added to the comment form. This field gets auto-filled and hidden if JavaScript is enabled. The number/hash pairs even have their own expiration date, so they are only valid for one day.

To keep track of these pairs, JSSpamBlock uses its own table in the WordPress database. Every time somebody opens a page with a comment form, a new entry is inserted into the table, even if the visitor doesn’t have the intention to post a comment. Conversely, when a comment is posted, the database entry for the current verification hash and number is retrieved and then deleted from the database. Expired pairs get deleted, too.

I think that this is very inefficient. Also, it makes it hard to use the plugin together with the excellent WP-Cache (which I’m using, too), since the code generating the key/hash pair must be called for every visitor. The comment form has to stay dynamic in a WP-Cached environment.

The question is: Why? Using the hashed verification numbers in the comment form (and thusly needing a database) would only make sense if the plain text number was not appearing in the generated page at all. But of course it is! It has to be visible in the “Enter XXXX here” message for users that do not have JavaScript enabled. Also, it has to be in the JavaScript code that auto-fills the input field! Consequently, using the hashes doesn’t create any security benefit. At least none that I can see. Maybe I’m overlooking something, but I don’t think so.

So what I did was rewriting the plugin to run without an extra database and without needing to stay dynamic in a WP-Cache environment. My version of the plugin simply generates a random 4-digit number, stores it in a hidden input field in the comment form and adds a text-field where the user can enter this number. If JavaScript is enabled, the number is entered automatically, and the additional input field is hidden. When a comment is posted, two extra variables are sent to the server: The hidden input field containing the reference number, and the user input field. If both numbers do not match, the comment is considered SPAM. Basically the same functionality as JSSpamBlock, but without any additional database transactions or WP-Cache problems.

Again, I don’t think my solution is less secure than the “official” JSSpamBlock: While the latter has the verification number in plain text both in the JavaScript code and the “Enter XXXX here” user message, my rendition additionally has the plain text number in a hidden form input field (rather than a hash). The point is, we both have the number in plain text, so a bot could work around it even if it is not JavaScript-aware, simply by scraping the number from the generated page. Both solutions rely on being “slightly different than everybody else”, so that dumb mass-commenting doesn’t work, rather than on a truly secure CAPTCHA approach.

Don’t get me wrong, though, I still think that the idea behind JSSpamBlock is absolutely awesome, I’m just somebody who likes to cut down database access to the absolute minimum.

I’ve deployed my JSSpamBlock-based anti-SPAM solution today, so I’ll just lean back now and hope that the SPAM-comments in my Askimet inbox will be significantly less tomorrow…

7 Comments

Comments are closed | Comments RSS

  1. Deprecated: Function ereg() is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-content/plugins/google-analyticator/google-analyticator.php on line 445
    Paul
    wrote on Apr 19, 2007 at 2:49

    Hi Georg, thanks for the comments. I enjoyed reading this post because I can see that you really get the idea behind JSSpamBlock more than a lot of the users I have talked to.

    I can see your point about the database, but my choices weren’t just arbitrary and I can explain the reasoning behind storing the information in a database.

    Although I don’t make mention of it, JSSpamBlock actually stops spam in two ways. The first way is the most obvious: it makes sure the client is executing JavaScript, or failing that, that a human is on the other end to type in the code.

    The second way is less obvious, but the commenter is forced to make one request to the comment form every comment posted. This is not a problem for a human poster, since the comment always has to be entered through the form, but spam bots tend to skip that step and post right to the wp-comments.php file. Over half of the bots (in my test) would be blocked just by verifying that the hash was sent by the server and has not been used. By forcing the code to be checked with server-side records, the plugin ensures that the comment corresponds to one request of the comment form. Otherwise, it would be trivial to write a bot that gets around this by sending “simpleAntiSpamHiddenField=1234&simpleAntiSpamInputField_ID=1234″ on each post request. It would still be trivial to write a bot to defeat JSSpamBlock, but it would require some text parsing and, more importantly, one GET request per comment POST. On the average spam bot, the first way would require changing a few lines of code, while the second would require a small but significant addition to fetch and parse the forms. Indeed the point behind the plugin was to be different enough that no one would bother to write the trivial code required to get around it, so your plugin is actually more in line with my original vision than mine is. The problem was that once I decided to distribute it, I realized that if enough people use it, it could become worth it for a spammer to write a few lines to get around. The only way to stop this was to verify that the numbers sent to the server had been generated on the server, which required a database.

    The reason I used a hash is not all that important, a numeric ID would do just as well. The only problem is that the ID would give away how many hits your blog has had since adding the plugin. For most blogs this is not an issue, but some people would rather keep that private so I use a hash instead.

    I look forward to seeing how it goes. If you are going to release your code, let me know and I may send a link your way.


  2. Deprecated: Function ereg() is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-content/plugins/google-analyticator/google-analyticator.php on line 445
    Paul
    wrote on Apr 19, 2007 at 2:50

    Btw, sorry for the long comment… I didn’t realize how long it was until it was posted.


  3. Deprecated: Function ereg() is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-content/plugins/google-analyticator/google-analyticator.php on line 445
    georg
    wrote on Apr 19, 2007 at 11:34

    Paul, that’s a very valid point, and I haven’t thought about that (I knew I must’ve been missing something :-) ).

    However, how about using a hidden (i.e. user-specifiable) secret string + some post-specific number + the randomly generated 4-digit verification code, compute the md5 hash of this and stick it into the hidden form field.

    In the comment verification function, we could simply recompute the hash since we know the secret string and the post-specific number without accessing the database, using the verification number supplied by the user.

    This way, a SPAM-bot would have to do at least one GET request for every post they want to comment on (granted, once they’ve done that, they could send unlimited SPAM comments to that one post, but usually, SPAM-bots only send one comment per post), but we could still get around the database access.

    I’ll hack up some code later today and send it your way!


  4. Deprecated: Function ereg() is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-content/plugins/google-analyticator/google-analyticator.php on line 445
    Paul Butler
    wrote on Apr 19, 2007 at 13:10

    Good idea. I found that the most active bot on my test site was making multiple comments to the same post, but that bot did not even make a single request for the form (or if it did, it didn’t parse it), so it would have been stopped anyway.

    The only difference is that if someone wanted to write a custom script to attack the blog, they would have to make less page requests. I think this is a small enough problem that the benefits outweigh it.

    I think a future version should work both ways, with a toggle in the options panel.


  5. Deprecated: Function ereg() is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-content/plugins/google-analyticator/google-analyticator.php on line 445
    georg
    wrote on Apr 19, 2007 at 13:59

    I’ve hacked together a version implementing this method. You’ve got mail!


  6. Deprecated: Function ereg() is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-content/plugins/google-analyticator/google-analyticator.php on line 445
    Paul Butler.org » JSSpamBlock Modifications
    wrote on May 21, 2007 at 19:20

    [...] Georg Kaindl and I had a discussion on whether a database was really neccesary (he made some excellent points on why this is not the case, though I still maintain that the extra protection is woth the cost), [...]


  7. Deprecated: Function ereg() is deprecated in /nfs/c03/h01/mnt/52932/domains/retiredblog.gkaindl.com/html/wordpress/wp-content/plugins/google-analyticator/google-analyticator.php on line 445
    Paul Butler.org » JSSpamBlock-like protection for any website
    wrote on Oct 11, 2007 at 15:59

    [...] comments about the database being unnecessary, and I wrote a more lengthy (compared to this post) response explaining why it was. He then came up with a clever solution - by including the post’s ID in [...]

About

Hi, how are you? My name is Georg Kaindl, and I'm a twenty-something from Vienna, Austria. During the day, I'm a CS student at the Vienna University of Technology, but at night, I turn into an independent software developer for the Macintosh platform, social nerd, lazy entrepreneur and intuitive researcher.

I like to write about everything that matters to considerate technology enthusiasts, but humbly retain the right to go off-topic from time to time.

My posts are licensed under a Creative Commons Attribution 3.0 License.

Contact

You can reach me by email if you have something to say that's not related to a blog post or that you don't want to have publicly available as a comment to a post.

However, you'll have to prove that you are human! Even though I personally like robots very much, I'm less of fan of SPAM. It's just a simple riddle to solve, but a SPAM bot won't cut it!

To get my email address, take the word before the .com in my domain name first (Hint: The word you are looking for starts with a "g" and ends with an "l"). Next, simply attach @mac.com to this word.

Boom, there's my email address. Simple, isn't it?

Powered by WordPress

Comments RSS

Entries RSS