Post by Luc Vincent, Uber Tech
      LeadWe wanted to let you all know that a few months ago
      we quietly released - or actually 
re-released - an Optical
      Character Recognition (OCR) engine into open source. You might wonder why Google is interested
      in OCR? In a nutshell, we are all about making information available to users, and when this
      information is in a paper document, OCR is the process by which we can convert the pages of
      this document into text that can then be used for indexing.
This
      particular OCR engine, called 
Tesseract, was in fact
      not originally developed at Google! It was developed at Hewlett Packard Laboratories between
      1985 and 1995. In 1995 it was one of the top 3 performers at the OCR accuracy contest
      organized by 
University of Nevada in Las Vegas.
      However, shortly thereafter, HP decided to get out of the OCR business and Tesseract has been
      collecting dust in an HP warehouse ever since. Fortunately some of our esteemed HP colleagues
      realized a year or two ago that rather than sit on this engine, it would be better for the
      world if they brought it back to life by open sourcing it, with the help of the 
Information Science Research Institute at UNLV.
      UNLV was happy to oblige, but they in turn asked for our help in fixing a few bugs that had
      crept in since 1995 (ever heard of bit rot?)... We tracked down the most obvious ones and
      decided a couple of months ago that 
Tesseract OCR was stable
      enough to be re-released as open source.
A few things to know about
      
Tesseract OCR: for
      now it only supports the English language, and does not include a page layout analysis module
      (yet), so it will perform poorly on multi-column material. It also doesn't do well on
      grayscale and color documents, and it's not nearly as accurate as some of the best commercial
      OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far
      more accurate than any other Open Source OCR package out there. If you know of one that is
      more accurate, please do tell us!
We are grateful to all the people at
      HP who made it possible to release Tesseract into open source, and especially John Burns, who
      championed and babysat the project. We would also like to thank the original Tesseract
      development team, a partial list of whom is 
here.
      Last but not least, many thanks to our friends at 
UNLV's ISRI, including Tom Nartker, Kazem
      Taghva, Julie Borsack and Steve Lumos, for all their help with this project.
By the way, we are also hiring top-notch OCR engineers! See 
this job posting for
      more information.