English

About the Project

The coordination project OCR-D is aimed at the development of methods of Optical Character Recognition (OCR) for printed historical material.
Therefore existing workflows and methods of automatic text recognition are examined, described and optimized. An important goal is to conceptually prepare the transformation of the German-speaking prints from the 16th-19th Century in machine-readable full text.
The Herzog August Bibliothek in Wolfenbüttel, the Berlin-Brandenburg Academy of Sciences and Humanities, the State Library Berlin and the Karlsruhe Institute of Technology take part in this project. Leading experts, scholars and libraries in the fields of digitization and digital Humanities will furthermore support this project.

Recently especially academic libraries digitized large stocks of historic materials and present the images online. Through an OCR process searchable full texts can be automatically generated from these image data. The value added by the use of such digital text documents is indispensable today in many scientific disciplines, particularly in the field of humanities.
To date, however, the access to the electronic full text is often not or only insufficiently possible, although many historical documents are accessible online through the "Bibliography of Books Printed in the German Speaking Countries from the 16th-18th century”.

The results from established OCR methods have so far been insufficient when it comes to the recognition of old printing type, especially Gothic types. This is where the work of OCR-D will set in. The aim is to examine existing tools and recent studies to describe if and how latest research results are used in established OCR processes and how the tools and recent findings can be used to develop the OCR process for mass-digitization of printed historical material.  

The project is funded by the German Research Foundation (DFG) and has a term of three years. In the first phase requirements are discovered and conceptually prepare the second phase. In 2017 a tender for module projects has been published, to solve specific challenges of the OCR process. Thereupon the DFG granted eight module projects that started their work by the beginning of 2018:

"Scalable Methods of Text and Structure Recognition for the Full-Text Digitization of Historical Prints" Part 1: Image Optimization
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)

"Scalable Methods of Text and Structure Recognition for the Full-Text Digitization of Historical Prints" Part 2: Layout Analysis
DFKI

Development of a semi-automatic open source tool for layout analysis and region extraction and region classificiation (LAREX) of early prints.
Universität Würzburg

NN/FST - Unsupervised OCR-Postcorrection based on Neural Networks and Finite-state Transducers
Universität Leipzig

Optimized use of OCR methods – Tesseract as a component of the OCR-D workflow
Universität Mannheim

Automated postcorrection of OCRed historical printings with integrated optional interactive postcorrection
Universität München

Development of a Repository for OCR Models and an Automatic Font Recognition tool in OCR-D
Universität Leipzig, Universität Erlangen, Universität Mainz

DPO-HP - Digital Preservation of OCR-D data for historical printings
SUB Göttingen, GWDG Göttingen

The module projects have a term of 18 months. The results will be presented to the public at a final workshop in June 2019.

 

At the end of the overall project we will present a software and a accompanying concept for the OCR processing of digitized printed heritage from 16th to 19th century.