Impact Analysis of OCR Quality on Research Tasks in Digital Archives

We presented our paper on “Impact Analysis of OCR Quality on Research Tasks in Digital Archives” at this year’s International Conference on Theory and Practice of Digital Libraries (TPDL2015).

We describe how humanities scholars currently use digital archives and the challenges they face in adapting their research methods compared to using a physical archive. The required shift in research methods has the cost of working with digitally processed historical documents. Therefore, a major concern for the scholars is the question how much trust they can place in analyses based on noisy representations of source texts.

Based on interviews with humanities scholars and a literature study, we classify scholarly research tasks according to their susceptibility to errors originating from OCR-induced biases. Search results for “Amsterdam”, for example, are likely to be influenced by the confusion of the letters “s” and “f”, especially for material that was created before 1800, when the “long s” was still used.
In order to reduce the impact of such errors, we investigated which kind of data would be required for this and whether or not it is available in the archive.

We describe our study of example research tasks performed on the digital newspaper archive of the National Library of The Netherlands. In this study, we tried to reduce the uncertainty of the results as much as possible with the data publicly available in the archive.

We conclude that the current knowledge situation on the scholars’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s