Despite denials from Google, a security researcher continues to assert that the Search King’s reCAPTCHA system for protecting Web sites from spammers can be successfully exploited by Internet junk mail panderers.
Researcher Jonathan Wilkins published a paper recently that included an analysis of reCAPTCHA’s security. In automated attacks he conducted against the system, he reported he had an alarming success rate of 17.5 percent.
CAPTCHA–which stands for Completely Automated Public Turing test to tell Computers and Humans Apart–is a method for foiling automated attacks by spammers on Web sites. Before a Net surfer can perform at a site a task, such as setting up an email account or adding comments to a blog posting, he or she is presented with the image of a word or phrase that has been distressed in some way. The warped image is intended to thwart scanners and optical recognition software programs used to automate the compromising of web sites by spammers. The idea is that humans can read the characters in the image and type them into a form while machines can’t.
Some simple math reveals just how alarming Wilkins’ findings are. The operator of even a modest botnet of 10,000 machines would be perfectly happy with a success rate of 0.01 percent. That would mean 10 new gmail accounts could be created every second or 864,000 new accounts a day from which spam could be launched.
Google counters that Wilkins test targeted an old form of reCAPTCHA from 2008 that’s been changed. “[T]his study does not reflect the effectiveness of reCAPTCHA’s current technology against machine solvers,” a Google spokesperson told The Register. “We’ve found reCAPTCHA to be far more resilient while also striking a good balance with human usability, and we’ve received very positive feedback from customers.”
Wilkins acknowledged that his initial tests were on an older version of reCAPTCHA, but since that time, he has conducted tests on the new images produced by the system and found them to be even weaker than the older ones. In one of his original tests on the system, his success rate was five in 200. When that test was run on the new reCAPTCHA, the rate was 23 in 100.
The major difference between the old and new versions of reCAPTCHA, according to Wilkins, is the use of horizontal lines to obscure the characters in the image. While the use of the lines makes it harder for machines to recognize a reCAPTCHA phrase–although Wilkins asserts the lines can be subverted easily by spammers–it also makes the phrase harder to read by humans, too. New reCAPTCHA images drop the lines but add distortion to the image. They’re easier to read for humans, but, alas, they’re also easier for machines to crack.
Unlike most CAPTCHA systems, Google’s uses images with two words. That’s because Google uses reCAPTCHA for two purposes. Like other CAPTCHA systems, it’s designed to frustrate spammers, but it’s also incorporated into Google’s efforts to digitize books. When a word in a book scan can’t be recognized by Google’s OCR software, it’s sent to the reCAPTCHA pool. So when a person enters a reCAPTCHA phrase into a form, Google can discover what its OCR program couldn’t, without having to hire human editors to review scanning results.
One weakness of CAPTCHA schemes, though is that they use words that can be found in a dictionary. This makes it easier for machines to crack the phrases because they have something to compare them to for errors.
In addition, reCAPTCHA uses a “one-off” system. That means a letter in a word can be incorrect, and it will still be accepted by the system.
So if the reCAPTCHA phrase contains the word “meat” and a Webster enters “peat,” his or her response will still be interpreted as a valid one.
Some alternatives to CAPTCHA avoid words entirely. Microsoft, for instance, has developed a scheme called Asirra that is totally based on images of cats and dogs. To perform a task protected by Asirra, a netizen is presented with an array of 12 pictures and asked to identify each as either a canine or feline. This method is called Human Interactive Proof, or HIP.
To be effective, HIP systems need to be supported by large databases that tax the computational power of an attacking spammer. Microsoft does that by using the picture database at Petfinder.com, which contains some three million photos.