How Spam Comments Work

One of my research interests in my day job is natural language generation (NLG) using generative grammars. Today the following was posted as a comment to this blog. Due to a bug somewhere, their comment spamming system posted the raw source it uses to build its comments, which is interesting, I think. There are hundreds of lines of this, but here are a few. I hope it is clear how the final text is designed to be formed.

I {couldn’t|could not} {resist|refrain from} commenting. {Very well|Perfectly|Well|Exceptionally well} written!

{I will|I’ll} {right away|immediately} {take hold of|grab|clutch|grasp|seize|snatch} your {rss|rss feed} as I {can not|can’t} {in finding|find|to find} your {email|e-mail} subscription {link|hyperlink} or {newsletter|e-newsletter} service. Do {you have|you’ve} any? {Please|Kindly} {allow|permit|let} me {realize|recognize|understand|recognise|know} {so that|in order that} I {may just|may|could} subscribe.

Thanks.

or

These are {really|actually|in fact|truly|genuinely} {great|enormous|impressive|wonderful|fantastic} ideas in {regarding|concerning|about|on the topic of} blogging. You have touched some {nice|pleasant|good|fastidious} {points|factors|things} here. Any way keep up wrinting.

which presumably could lead to great comments like:

These are in fact enormous ideas in about blogging. You have touched some fastidious factors here. Any way keep up wrinting. [sic]

The point of the randomization is to try to fool comment-spam filters that work on (Bayesian!) probabilities of seeing particular word combinations. Unfortunately, the result is terrible.

The quality is because of the algorithm. It is what we call ‘context-free': choices made at one point have no effect on choices at others, so the only replacements that can be made are those you’d see in a thesaurus. As a result, it is almost impossible to get good text produced (certainly nothing beyond a few lines) with sufficient variation. The need for synonyms also encourages the person creating the underlying text to be careless with what they add (as here), and you end up with word combinations that are a dead-giveaway of a non-human writer. Which in turn means that spam-filters have an easier time tracking the comment. Add the fact that the comment algorithm takes no account of the blog, and you’ve got a very primitive attempt. On top of that, I assume whoever is behind it is technically deficient in any case, since the raw source was uploaded, not the generated text.

It wouldn’t be difficult to scan the blog and figure out some key phrases, then incorporate them into a slightly more complex language generator. Perhaps someone has already done that. Perhaps some of you, while appearing to be esteemed commenters, are merely state of the art NLG systems ;)

About these ads

4 Comments

Filed under Uncategorized

4 responses to “How Spam Comments Work

  1. I received that spam yesterday. No, not the raw source, but after selections had been made.

    It’s deleted, so I can’t go back and check.

    Yes, I agree, most of the spam that I see is pretty obviously boilerplate, with at most a crude attempt to adapt it to the post it is allegedly commenting on.

    One of the tricks of these spammers is to use flattery. You did not specifically mention that, but it shows in the final paragraph. Presumably, that is intended to make it more likely that the blog owner will decide that it is not spam, and will allow the comment to appear.

  2. Ian

    Yes, so you might say “well, that makes no sense, but at least they like me”…:D

    I wonder if a slightly antagonistic version might be more profitable, as the blog owner may respond. Or something like.

    “This post looked interesting when it came up in google, but the site doesn’t work with my version of Internet Explorer. Maybe you could check it out. I’m sure I’m not the only person with that browser. Cheers!”

    I suspect the latter would fool me.

  3. Personally, I am not fooled by this ironic post. It is clear to me that the post itself is a NLG algorithmic product.

  4. TWF

    Are you kidding me? My generation of language is anything but natural! I have gotten a laugh out of several of these types of spam comments. At first I was thinking they were just bad translations, like the garbage you get sometimes from Google Translate. But if you got one with raw code, then that sure suggests some un-NGL at work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s