Announcing the OCRopus Open Source OCR System
Posted by
Thomas Breuel, OCRopus Project LeaderWe're happy to
announce the
OCRopus OCR Project, a
Google-sponsored project to develop advanced
OCR
technologies in the
IUPR research
group, headed by Prof. Thomas Breuel at the DFKI (German Research Center for
Artificial Intelligence, Kaiserslautern, Germany).
The goal of the
project is to advance the state of the art in optical character recognition and related
technologies, and to deliver a high quality OCR system suitable for document conversions,
electronic libraries, vision impaired users, historical document analysis, and general desktop
use. In addition, we are structuring the system in such a way that it will be easy to reuse by
other researchers in the field.
The
OCRopus engine is based on two research projects:
a high-performance handwriting recognizer developed in the mid-90's and deployed by the US
Census bureau, and novel high-performance layout analysis methods.
The
project is expected to run for three years and support three Ph.D. students or postdocs. We
are announcing a technology preview release of the software under the Apache license
(English-only, combining the Tesseract character recognizer with IUPR layout analysis and
language modeling tools), with additional recognizers and functionality in future
releases.
The IUPR research group has extensive experience in OCR and
related technologies, and will be basing the work on previous research and existing software
in the area. Existing software components include high-performance handwriting recognition
software that has received top evaluations by NIST and was deployed by the US Census Bureau,
the recently open sourced
Tesseract OCR system, a
separate Google project for probabilistic natural language modeling, and software for layout
analysis and character recognition. The IUPR research group gratefully acknowledges funding by
the German BMBF, the state of Rhineland Palatinate, and other public and private partners
(please see
www.iupr.org for more
details).
We are hoping for contributions by the open source community
in areas such as adapting the system to additional languages, creating a Gnome desktop
application, integration with Gnome desktop search, web-based tools for proofing and training,
language modeling, additional character recognition engines, and other useful tools and
add-ons.
The project web page can be found at
ocropus.org.