Twitter, spam, and blackhole lists
[I drafted this about a year ago (2008 May 08), and then forgot about it. I came back today and made a couple of small edits, but surprisingly, the changes in the past year are mostly details. The Twitter Blacklist has closed, hashtags have become ubiquitous, and Quotably has gone away. But the Twitter spam problem still looms, and is definitely not solved. So I decided to post it mostly as-is, because I think it’s still mostly right. And because I don’t have time to edit it a lot more. :) - rew]
[About a year ago] I was in a twitterscussion with @greywolf about twitterblacklist.com [now closed] reminded me of the oft-rehashed discussions about Real-Time Blackhole lists for email. People are making the same arguments almost verbatim about emerging anti-Twitter-spam tools now. I think, though, that the frothing indignation about the problem these tools are causing a relatively few legitimate users likely stems from a widely-shared ignorance of the emerging spam storm that’s coming on Twitter.
Many Twitter users think it is mostly immune to serious spam. They’ll frequently say things like greywolf said: “twitter has the easiest self regulating mechanism you think someone is spam stop following”.
But it’s not nearly that simple, nor is Twitter safe.
One reason is that keyword-tracking (“track oranges” and you’ll see tweets that include the word ‘oranges’ in them, even from users you don’t follow), hashtags, and tag channels (these last two closely linked), three very useful tools for Twitter-savvy users, are extremely susceptible to spam abuse.
The spam slime include keywords and hashtags to get their spam links (usually with a deceptive description) to show up in the twitstream of all the users following those terms. Since those users are tracking words, not the spammer, the fact that they aren’t following the spammer’s id is irrelevant; the spam still gets delivered.
Since Twitter has essentially no tools (more on this in a moment) to combat this, the only way to protect yourself is to not use keyword-tracking at all, which sort of throws the baby out with the bathwater. Thus, this spam is parasitic on its host (keyword tracking), and very difficult to extract without killing the host.
At least one spammer uses the term ‘Destin’ and sends out multiple links daily under automated usernames (vick33728d or the like) purporting to offer a headline about beach news and a headline to a local beach newspaper. Each of these links forwarded (last I checked) to some “learn at home” scam site. Pretty normal stuff. But they’re obviously just testing the waters: there were several a day for a week or so, then it went quiet again. As much as I respect the Twitter team’s work, I don’t think it’s because they found a way to keep spammers from signing up.
You see this same pattern from email spammers. They try out new modes of eluding spam filters in small numbers at first, and once they find something effective, then cometh the deluge.
The insidious spammer has another attack vector. Since twitter accounts are free and easy to create, the spammer can create as many as he likes. Then with each account he can follow 20,000, or 40,000, or however many user ids he can identify, using an automated tool. The cost to the spammer of doing this approaches zero.
Twitter made this harder by placing limits on the rate at which you can follow people, and placing some flags on the ratio of followed to followers. Still, this only means the spammer has to create more Twitter accounts, not that the practice will change. And just as spammers using temporary email accounts for sending, it matters not at all if the account is blocked or deleted within hours. There are no practical limits on the creation of these. They’re an inexhaustable supply.
But how does it work for the spammer? Well, when one user follows another, the second user receives an email notice that someone has followed them. That provides two possible benefit paths for the spammer, and 2 costs for the victim.
The first benefit path for the spammer is that the victim will just follow anyone who follows them, “until they have some reason not to”. Many people, especially new Twitter users, are so glad to have a follower that they’ll just click the link and follow them back. (Note that phishing attacks are coming, too, though I haven’t yet seen any here). Once the victim has followed the spammer, the spammer gets at least one free shot where their spam tweets will be seen by the victim before being un-followed. More likely, they’ll get several before the user manages to find the time to block them (or figures out how, because Twitter doesn’t make it real obvious for n00bs how this is done).
The second benefit path for the spammer is that the victim, seeing the “follow” notice, will at the very least visit the spammer’s Twitter page, and with any luck, click on the spammer’s web link. Voila! A visit to a web trap, there to do whatever it is spammers always do when they can lure a visitor onto one of their landing pages.
In each individual case, the victim will un-follow the spammer, limiting the amount of spam that can be inflicted. But in the aggregate - among all Twitter users - this is not a negligible effect. The fact that it might not cost me very much doesn’t mean that the total cost of it isn’t large. It’s essentially without bound, because the cost to do it is roughly zero.
The problem that concerns me more is that each follow by a spammer creates a small, but non-zero, burden of action on the recipient.
If a spammer creates an account and follows 20,000 users, each of those 20,000 users will receive an email message, and will have to stop for a moment and make a decision: visit the link, ignore the follow, follow blindly, etc. Even if they just delete the email, they have to take action.
This isn’t a big deal if one or two spammers do it; but just like with email spam, it doesn’t scale. People used to make the same argument about email spam: “Oh, spam’s not a big deal, quit fussing about it. If you get a message and you don’t want it, just hit the delete button.”
But most people have learned, too late and to their chagrin, that this only works when the spammer cloud is nascent. When an infinite number of people can at zero cost send me an infinite number of such messages, I don’t have time to hit the delete button that many times.
The cognitive burden placed on the aggregate group of users or victims grows larger and larger, which is the inherent threat of the spam problem. If spam growth were contained where it could grow no faster than linearly with the number of spammers, it would not be such an issue.
The reason spam is such a terrible problem is that the power of automation lies almost entirely on the side of the offenders, not the victims. Gmail and other email spam systems have helped to reverse that with hybrid systems that include content filtering, which is the most accurate way.
But for instant messaging-based systems like Twitter, there is no point in the system at which it is cost-effective or time-effective to inject content filtering prior to the user seeing it.
Thus the architecture of Twitter itself (and any similar system), lends itself to abuse by making it difficult to produce automated tools that could tip the balance of power back in favor of the victims.
This is exacerbated by the fact that many Twitter users don’t yet see this problem emerging. This may be because most people have no in-depth knowledge of spam fighting, and don’t recognize the parameters and characteristics of the problem. It’s somewhat analogous to a beginning programmer who doesn’t know anything about big-O notation complaining that other people keep writing confusing code, and why don’t they just bubble-sort everything.
When people lack the tools or the knowledge or the experience to estimate the danger of the emerging spam threat on Twitter, they will react badly – sometimes overlaying an inappropriate ideological model on the problem – to the emergence of the first crude tools and attempts to deal with the problem by those who do recognize it.
Not to become too political, but it’s also analogous to the way people who don’t believe “terrorism” is a legitimate threat react to governmental measures intended catch terrorists. To be fair, if you don’t believe terrorists are really after you, then certainly measures that inconvenience or threaten your ideals are half-assed at best, and openly dangerous at worst. So your perception of the problem affects your perceptions of potential solutions.
All that being said, just as with real-time black hole lists and other anti-spam mechanisms (like the odious “fill-out-this-form-I’m-blocking-unknown-email-addresses”), there are good and bad solutions. Good solutions, like greylisting, do some good without imposing an undue burden on legitimate users. They are effective while limiting collateral damage. Other “solutions” more or less indiscriminately destroy everything in sight that matches some limited set of criteria.
Real time black hole lists that lack a timely and usable mechanism to get an address removed are part of the spam problem and for the same reason: the ability to automate damage-dealing outstrips the ability of humans to manually undo it.
If http://twitterblacklist.com does not provide a mechanism either to periodically refresh its listing (if it’s done automatically), or if its only criteria is one that does not correspond strongly to spamming (solely a follower/following ratio), then it is itself part of the problem and I think I would share part of greywolf’s frustration. Likewise if there is no easy way for a user to notify them and say, “I’m a real human, check my content, remove me from this list,” then it’s a bad attempt at a good service.
Unfortunately, bad tools don’t usually help, and the process of figuring out which are bad and good can be long and painful. On the spam front, Twitter’s got a lot of nasty work ahead.