OCRopus
OCRopus
Main page
1521989

OCRopus

logo
Community Hub0 subscribers
What are your thoughts?
Be the first to start a discussion here.
Be the first to start a discussion here.
OCRopus

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

OCRopus is developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and was sponsored by Google.

OCRopus was especially designed for use in high-volume digitization projects of books, such as Google Books, Internet Archive, or libraries. A large number of languages and fonts are to be supported. However, it can also be used for desktop and office applications or for application for visually impaired people.

OCRopus has main components which perform:

Single or multiple scripts are available for these components. The modular programming approach allows individual workflows to be used and individual steps to be exchanged.

By default, OCRopus comes with a model for English texts and a model for text in Fraktur. These models refer to the script and are largely independent of the actual language. New characters or language variants can be trained either from the start, or addeded later.

Recent text recognition is based on recurrent neural networks (LSTM) and does not require a language model. This makes it possible to train language-independent models for which good recognition results in English, German and French have been shown at the same time. In addition to the Latin script, there are results for other scripts such as Sanskrit, Urdu, Devanagari, and Greek.

Very good detection rates can be achieved through an appropriate training. This extra effort is particularly worthwhile for difficult documents or scripts that are no longer common today, which are not in the focus of other OCR software.

See all
User Avatar
No comments yet.