When I first saw Google’s announcement about rel=”nofollow” I was laughing out loud. No, really, I was. Let me explain myself.
Google bred this monstrosity called PageRank (PR). Basically, if you search something on the web, you get a gazillion results. Google takes pages with higher ranking—which, by assumption, are the most relevant—and bumps them to the top of search results.
Needless to say, spammers and junk peddlers of all walks of life always tried to cheat Google into raising their PageRanks. For instance, they’d register hundreds and thousands of domains and have them point to each other (a “link farm”) to jack up PR. They did that because Google computed PR based on the number of incoming links (among a host of other criteria).
When it became obvious spammers managed to exploit Google’s algorithm, Google decided to change the algo. The notorious update of this algo is known as the “Florida Update”. The transition didn’t go that well. Everything went upside down. Junk sites were at the top once again, while sites of legit businesses plummeted in search results. People got really pissed, as could be witnessed in the alt.internet.search-engines newsgroup.
Things got better since the Florida Update, but I imagine any thought of another change in Google’s algo sends shivers down people’s spines.
With the rise of blog spam, Google realized yet once again their PR algo was still vulnerable, which prompted them to introduce the rel="nofollow" attribute which people need to tack on to their hyperlinks to let the Google bot know it shouldn’t follow those links. The funny thing is, Google is using us to fix the damn PR algorithm instead of adjusting the logic on their end.
What happens if MSN and Yahoo! decide to introduce their “measures” of combating comment and referral spam? Will we need to decorate hyperlinks like Christmas trees because search engines can’t deal with the problem?
I think the idea that rel="nofollow" is the way to stop comment spammers is hilarious. What will Google do once spammers figure out a workaround for this attribute? I think the right thing of Google to do it cut the bulls*** and deal with it.
Oh, Yeah, About the War
This post is about combating link/blog spam, not Google. There’s a lot of discussion about it going on online, and I think it’s great that people come up with various ways to counter spammers. The more pressure we put on spammers and make it less and less cost-efficient to take advantage of our hard work, the better.
Unfortunately, there’s no silver bullet solution. Yet. Spam is a technical problem as much as it is a social one. The difficulty of the social aspect of it is that spammers have perfect realization that what they do is immoral, but since it’s not against the law, they say, we spam (see Interview with a link spammer). Whether it’s immoral or not is not a concern to them. “Not against the law” is good enough for them. Never mind the fact that bloggers put after-work hours maintaining blogs and publishing their research.
This is the kind of thinking that is worthy of a Neanderthal. Or better yet—that of Bill Clinton. He took Monica on occasional rides on the F-train “because I could”, in his own words. Not against the law, right? (Thank you, Mr. Clinton. We’ll always remember you ordered bombings of schools and hospitals in Yugoslavia.)
You see that spammers give themselves green light. Hoping they would have the virtues normal humans possess is futile. Let’s accept it and move on.
The Dangers of Blacklisting IPs and Domains
There are plenty of black listing services out there. They would blacklist your IP or a whole domain. I’ve heard plenty of people complain they end up blacklisted without justification.
Maybe those people are clueless that their home PC has become a “zombie” and someone uses them as spam cannon. Or maybe they host their site with some cheap “provider” operating from a basement.
I’ve been poking around sites that peddle medications, gambling and stuff like that (yeah, that stuff too). Some sites are hosted on one IP with a gazillion domain names pointing to it. Blocking this one IP would render those 40,217 domains useless. However, some legit sites are hosted on the same IP with junk sites because their provider isn’t sophisticated enough to run a data center. You block this IP and everyone is locked out. They are like Rumsfeld’s “surgical strikes”, when more civilians die from blasts than militants.
This kind of blacklisting would work only if maintained by human operators. Automatic blacklisting just doesn’t work. All it does effectively is piss people off. Read Paul Graham’s essay for more on this.
Captchas
Captchas (completely automated public Turing test to tell computers and humans apart) aren’t the ultimate solution, but they are an efficient first level of defense. A spammer would need to crack a Captcha, which is doable, but not cost- and resource-efficient. Doing their dirty work doesn’t come for free either, and wasting their CPU cycles to crack a Captcha is a good thing to do.
Remember also that Captchas come at the price of hurting accessibility.
Bring it to ‘em
I know this is not a novel idea but I’d like to recommend actually visiting the links spammers post! You read it right—visit them. Except you sent a bot to do it and harvest their content.
If you analyze the code on their sites it’s obvious they use cookie-cutter sites because many of those scumbags aren’t even smart enough to code on their own. They set up “affiliates” and redirect you to their mother ship. Their incentive is to bring traffic and lure pathetic suckers to buy from them.
While spammers obscure emails by replacing “I” with “1”, “e” with “3”, adding dashes, pipes, spaces, etc, they don’t do it on their web sites. To the contrary, their sites need to look respectable enough for losers to place orders.
If you actually take time and visit them, the content is easy enough to harvest and run through a statistical filter. Almost any of their pages will score very high and can be discarded right away. They stuff their pages with “relevant keywords” —which they erroneously call “Search Engine Optimization”, or “SEO”—to get indexed well. We need to take advantage of it. A page full of offending content if a perfect candidate to have the whole site rejected. Coupled with a Bayesian filter, I imagine it’s feasible to identify spammers’ sites with almost perfect certainty.
If they implement funky redirects, it’s doable to crack that, too. Eventually the bot gets to a page it can analyze. They can’t make all-image sites because there still will be HTML. And if a page doesn’t have enough meat to it, crawl available links until you hit something juicy. The nice thing about this approach is they can’t hide because they need to run an online store usable by suckers.
What do you do if a link doesn’t resolve to anything or they are smart enough to sniff a bot? Just don’t accept the whole comment. What if this is a well-wishing visitor who posts a trackback and talks about the V-drug on his/her site? I’m sure your statistical filter will score it low because—at the other extreme—junk sites go nuts listing their “products” and describing benefits. You can even analyze what payment service they use and take that into account too. Those, who choose to give scumbags business, deserve to be banned, and we need to put pressure on them.
The Legal Side
I think litigation, if ruthlessly enforced, would work to some degree too. If large companies go after junk peddlers that will cool them down a bit. Ironically, the lion’s share of spam originates here, in the US, the county, which overproduces lawyers like there’s no tomorrow. Funny, eh?
Conclusion
Solving the spam problem can be achieved with a complex of measures, not just one. Blocking IPs and whole domains works only if you verify they belong to spammers. I think analyzing their content is a sure way to tell who they are. They too use patterns, which makes it even easier to single them out.
Yours truly has been scouting on this front and is already working on something. ;)