Historic Transylvanian Hungarian Newspapers Made Searchable in New Digital Initiative

Rimetea, Romania
Gábor Kiss/MTI
A joint project led by ELTE’s National Digital Heritage Laboratory and Digitéka has made more than 330,000 pages of historic Transylvanian Hungarian newspapers searchable and digitally preserved using advanced OCR and layout recognition technologies.

A major digitization project led by Eötvös Loránd University (ELTE) has successfully concluded, resulting in the processing and long-term preservation of hundreds of thousands of pages of historic Hungarian-language newspapers from Transylvania.

The initiative was carried out by the Digital Heritage National Laboratory (DH-LAB), operating under ELTE’s consortium leadership, in cooperation with the Transylvanian Digital Repository, Digitéka. According to the university’s statement, the goal was to elevate the digital processing of historical Transylvanian press sources to a new level and to improve modern research access to Hungarian-language cultural heritage.

In the first phase of the project, optical character recognition (OCR) was performed on approximately 273,000 scanned pages from 26 historic Transylvanian newspapers. Subsequently, the partner institution contributed more than 60,000 additional pages, bringing the total volume of processed Hungarian-language press material to 333,492 pages.

The completed files were delivered to Digitéka in two-layer searchable PDF format, each bearing a unified watermark to ensure consistency and authenticity.

To enhance the efficiency and accuracy of the OCR process, the partners also jointly developed a layout analysis system capable of recognizing the structural elements of historical documents. Drawing on ELTE’s research and development expertise and infrastructure, the collaboration focused on improving document structure detection, a key factor in boosting OCR precision.

As part of this effort, Digitéka’s annotators processed 1,007 pages. Combined with material prepared by DH-LAB’s annotators, this resulted in a training database of 4,078 annotated pages.

According to the statement, this dataset lays the groundwork for a layout recognition system specifically optimized for Transylvanian and Hungarian historical documents, significantly improving the accuracy of OCR results and facilitating more reliable text search and research.


Related articles:

National Library Publishes Rare Babits Manuscripts Online
Digital Database Reveals Detailed Local History of Hungary’s 1944–45 Occupation
A joint project led by ELTE’s National Digital Heritage Laboratory and Digitéka has made more than 330,000 pages of historic Transylvanian Hungarian newspapers searchable and digitally preserved using advanced OCR and layout recognition technologies.

CITATION