We presented our paper on “Impact Analysis of OCR Quality on Research Tasks in Digital Archives” at this year’s International Conference on Theory and Practice of Digital Libraries (TPDL2015).
We describe how humanities scholars currently use digital archives and the challenges they face in adapting their research methods compared to using a physical archive. The required shift in research methods has the cost of working with digitally processed historical documents. Therefore, a major concern for the scholars is the question how much trust they can place in analyses based on noisy representations of source texts.
Based on interviews with humanities scholars and a literature study, we classify scholarly research tasks according to their susceptibility to errors originating from OCR-induced biases. Search results for “Amsterdam”, for example, are likely to be influenced by the confusion of the letters “s” and “f”, especially for material that was created before 1800, when the “long s” was still used.
In order to reduce the impact of such errors, we investigated which kind of data would be required for this and whether or not it is available in the archive.
We describe our study of example research tasks performed on the digital newspaper archive of the National Library of The Netherlands. In this study, we tried to reduce the uncertainty of the results as much as possible with the data publicly available in the archive.
We conclude that the current knowledge situation on the scholars’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.