Google’s Latest Acquisition
I have just been reading up on some of the details of Google’s Latest Acquisition – reCAPTCHA.
reCAPTCHA is a company with a very neat product. Their product bases around the verification tests you see on various websites, which in themselves have been around for some time.

However what’s clever about these guys is what they use for the words you have to verify. Previous products have just used random words, their main purpose being to just verify that your a person and not a robot. What reCAPTCHA do is take big OCR tasks (such as digitizing the archives of the New York Times) and picks out words that the OCR software struggles with (it actually picks out any words that two seperate OCR programs disagree over) and then feeds the words into its huge network for human identification. They can then use the human users to help the software out in establishing what the word actually is. (For more info on their scoring system, see the wikipedia entry).
What I like about their idea is that it fuses together a technology which has been around for a while and uses it to solve another problem, in an innovative way. A number of technologies exist where the theory works well and testing is effective on a small test data set, however when it is rolled out to actually be used there are always likely to be problems, this is a great way of filling this gap. And with almost 11 billion words going through the system a year (30 million/day, with sites like Facebook and Twitter already as subscribers, and Google now to be added) thats a decent contribution to solving the problem. So Google clearly have their Books project in mind with this Acquisition.
What isn’t yet cleat is whether this tool can be used to improve OCR tools for the future. At present, it seems to be used just to help out on specific words when the software is not sure. However if artificial intelligence is used in the OCR, through ideas like Markov Models, then potentially this information could be used to make the tool more accurate going forward. This could be a very interesting future development. (Of course as computers get better at reading this kind of text without human help, the spam tools will need to work harder!!)
Aside from the technology aspect, this is another acquisition in the web market (and Google are rumoured to be discussing another – around to $500+ million mark) and further suggestion that things may be picking up in terms of M&A and finance in the tech industry, but I’ll leave that for another day.
AB
