On March 17, we were invited by the National Library of the Netherlands to present the results of our study on retrievability bias in the Dutch historic newspaper archive.
Summary of the talk:
Search engines are not “objective” pieces of technology, and bias in Delpher’s search engine may or may not harm user access to certain type of documents in the collection. In the worst case, systematic favoritism for a certain type can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the “retrievability” for all documents in a collection. We explain the ideas underlying the retrievability metric, and how we measured it on the KB Newspaper collection. We describe and quantify the retrievability bias imposed on the newspaper collection by three different commonly used Information Retrieval models. For this, we investigated how document features such as length, type, or date of publishing influence the retrievability.
We also investigate the effectiveness of the retrievability measure, featuring two characteristics that set our experiments apart from previous studies: (1) the newspaper collection contains noise originating from OCR processing, and historical spelling and use of language; and (2) rather than the simulated queries used in other studies, we use real user query logs including click data. We show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the results of a retrieval task.
Together with the eHumanities group of KNAW and the Amsterdam Data Science Center, we organize a workshop on Tool Criticism in the Digital Humanities.
The workshop will take place in Amsterdam on May 22, 2015.
For more information please have a look at the website.
[Update:] A report that summarizes the discussions and results from the workshop is now available.