Category Archives: Use Cases

Querylog-based Assessment of Retrievability Bias in Delpher

On March 17, we were invited by the National Library of the Netherlands to present the results of our study on retrievability bias in the Dutch historic newspaper archive.
The research was conducted in collaboration with the WebART project and will be presented at the Joint Conference on Digital Libraries (JCDL) 2016 in Newark, USA, in June 2016.

Summary of the talk:

Search engines are not “objective” pieces of technology, and bias in Delpher’s search engine may or may not harm user access to certain type of documents in the collection. In the worst case, systematic favoritism for a certain type can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the “retrievability” for all documents in a collection. We explain the ideas underlying the retrievability metric, and how we measured it on the KB Newspaper collection.  We describe and quantify the retrievability bias imposed on the newspaper collection by three different commonly used Information Retrieval models. For this, we investigated how document features such as length, type, or date of publishing influence the retrievability.
We also investigate the effectiveness of the retrievability measure, featuring two characteristics that set our experiments apart from previous studies: (1) the newspaper collection contains noise originating from OCR processing, and historical spelling and use of language; and (2) rather than the simulated queries used in other studies, we use real user query logs including click data. We show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the results of a retrieval task.

Slides:

Advertisements

Historical Newspapers as “Big Data”

On Tuesday, March 24, 2015, the National Library of The Netherlands organized a symposium on the use of digitized newspapers in the Digital Humanities. The goal of the symposium was to engage information specialists and end users in a discussion with the KB on future possibilities of using the (data in the) digital newspaper archive.

We presented our research ideas on estimating the impact of OCR errors on research tasks.

For more information about the event, please have a look at the report on the event at the  KB website.

 

SEALINCMedia @WebSci2014

Large datasets such as Cultural Heritage collections require detailed annotations when digitised and made available online. Annotating dierent aspects of such collections requires a variety of knowledge and expertise which is not always possessed by the collection curators. Artwork annotation is an example of a knowledge intensive image annotation task, i.e. a task that demands annotators to have domain-specic knowledge in order to be successfully completed. Today, Lora Aroyo will present WebSci2014 conference the results of a study aimed at investigating the applicability of crowdsourcing techniques to knowledge intensive image annotation tasks. We observed a clear relationship between the annotation difficulty of an image, in terms of number of items to identify and annotate, and the performance of the recruited workers. Here you can see the poster and the slides of the presentation.

ArtTagger: Labeling Works of Art by a Crowdsourcing Game

Games can be a powerful way to motivate people to participate in crowdsourcing. Cultural heritage institutes are eager to adopt crowdsourcing to let the public participate in cataloguing and curation, and allow for collection exploration and information discovery by non-professionals.

I, Dick de Leeuw, am a student Information Sciences and am doing my Master Project at the Web & Media group at the VU University Amsterdam. The ArtTagger crowdsourcing game is part of my Master Project, during spring 2014. I am supervised by Chris Dijkshoorn, who focuses on personalized semantic search in a linked cultural heritage environment.

The ArtTagger website — based on previous work by the SEALINCMedia project — aims to obtain high quality labels for works of art through crowdsourcing. Users can play a game to tag paintings with their respective category. The game consists of a query image (i.e., the painting to label) and six candidate labels placed below the image. The candidate labels are accompanied with prototypical images and a description. Users score one point for every processed query image and score ten bonus points when their choice agrees with the choices of art experts.

The aim of my research is to investigate the effects of an aesthetically pleasing and usable crowdsourcing website in the cultural heritage domain on people’s motivation. At the time of writing, I am still looking for people who are willing to play the ArtTagger game. By doing so, you help museums and improve your art knowledge.

Linking Birds Part 1: Converting the IOC World Bird List to RDF

In SEALINCMedia presentations about Accurator we often use the example of a print described as “bird near red leaf”. Although this description captures what is seen in the print,  it can be much more precise. Questions such as what sort of bird is depicted,  What is the type of the red leaf, etc. can be further answered.

This is an ideal case for the Accurator framework. We engage the appropriate niche (bird enthusiasts) to help annotate the bird prints of the Rijksmuseum with bird names from a structured vocabulary. The only problem was that we did not have such a structured vocabulary at hand.

This is where the experts at Naturalis came in. They pointed us to the IOC World Bird List, het Nederlands soortenregister and provided us with data of their own specimen collection. Since we aim to integrate these different datasets to create a comprehensive list of birds, we turned to RDF. In this blog post I describe the conversion of the IOC list.

The IOC World Bird List is available in multiple file formats. Using the Cliopatria server extended with the xmlrdf package I started the conversion process by loading the available XML file. Xmlrdf automatically turns the hierarchy embedded in the XML into a graph structure. Using rewrite rules such as the one below, the graph can be refined.


common_name_property @@
{ A, birds:englishName, B }
<=>
{ A, txn:commonName, B@en }.

As you can see the rule above replaces the property created by xmlrdf with one from the TaxonConcept ontology. This ontology contains a lot of concepts useful for modelling species data and I reused as much of these concepts as possible. Initially all the concepts in the graph are blank nodes. Using the same sort of rewrite rules, I created IRI’s of the form: http://purl.org/collections/birds/species-phoenicurus_auroreus. The IRI’s consist of the namespace, the level in the hierarchy (e.g. genus or species) and the scientific name.

Another useful resource is available on the IOC website: a spreadsheet with bird names in 19 different languages. Using the scientific names I found the corresponding species IRI in the graph and added the different commonNames with the corresponding language tags. An example of information linked to the birds:species-phoenicurus_auroreus resource:

Predicate Value
rdf:type txn:SpeciesConcept
txn:authority “(Pallas, 1776)”
birds:breedingRegions “EU”
birds:breedingSubregions “c,e”
txn:commonName “rehek mongolský”@cs, “Amurrødstjert”@da,
“Spiegelrotschwanz”@de, “Daurian Redstart”@en,
“Colirrojo Dáurico”@es, “mustselglepalind”@et,
“laaksoleppälintu”@fi, “Rougequeue aurore”@fr,
“tükrös rozsdafarkú”@hu, “Codirosso daurico”@it,
“ジョウビタキ”@ja, “Spiegelroodstaart”@nl,
“Aurorarødstjert”@no,”pleszka chińska”@pl,
“Сибирская горихвостка”@ru, “žltochvost zrkadlový”@sk,
“Svartryggad rödstjärt”@sv, “北红尾鸲”@zh
txn:inGenus birds:genus-phoenicurus
birds:nonbreedingRegions “s China, ne India”
txn:scientificName “Phoenicurus auroreus”

Many of the objects are currently literals, while some of them could be linked to external vocabularies. Linking the regions to GeoNames is something I will look into in the future, although parsing the more specific regions will be troublesome (e.g. “w slope of the e Andes in c Colombia”).

In a following blog post I will describe the conversion of the collection data of Naturalis to RDF and how I link that information to IOC World Bird List. This conversion was done at the Web & Media group at the VU University Amsterdam, if the work sparked your interest have a look at my site.