Historical Newspapers as “Big Data”

On Tuesday, March 24, 2015, the National Library of The Netherlands organized a symposium on the use of digitized newspapers in the Digital Humanities. The goal of the symposium was to engage information specialists and end users in a discussion with the KB on future possibilities of using the (data in the) digital newspaper archive.

We presented our research ideas on estimating the impact of OCR errors on research tasks.

For more information about the event, please have a look at the report on the event at the  KB website.


Presentation @Rijksmuseum: Trusting User-contributed data for Cultural Heritage Domain

Presentation  @Rijksmuseum: Trusting User-contributed data for Cultural Heritage Domain

Cultural Heritage domain has opened up to contributions from the users on the web. The contributions are mainly in the form of tags which describe certain aspect of the cultural heritage object. With a wide range of users on the web, it becomes important to determine the quality of the user contributed content before it is published online. However, manually evaluating the quality of these user generated contributions is exhausting in terms of resources for the Cultural Heritage institutions. In this talk, I will describe methods which can semi-automatically predict the quality of tags. These methods address three research questions: How can we trust an online contributor?, How can we assess the quality of annotation process?  and  How can we trust the contributed data?. The slides for the presentation can be found here.

SEALINCMedia @WebSci2014

Large datasets such as Cultural Heritage collections require detailed annotations when digitised and made available online. Annotating dierent aspects of such collections requires a variety of knowledge and expertise which is not always possessed by the collection curators. Artwork annotation is an example of a knowledge intensive image annotation task, i.e. a task that demands annotators to have domain-specic knowledge in order to be successfully completed. Today, Lora Aroyo will present WebSci2014 conference the results of a study aimed at investigating the applicability of crowdsourcing techniques to knowledge intensive image annotation tasks. We observed a clear relationship between the annotation difficulty of an image, in terms of number of items to identify and annotate, and the performance of the recruited workers. Here you can see the poster and the slides of the presentation.

ArtTagger: Labeling Works of Art by a Crowdsourcing Game

Games can be a powerful way to motivate people to participate in crowdsourcing. Cultural heritage institutes are eager to adopt crowdsourcing to let the public participate in cataloguing and curation, and allow for collection exploration and information discovery by non-professionals.

I, Dick de Leeuw, am a student Information Sciences and am doing my Master Project at the Web & Media group at the VU University Amsterdam. The ArtTagger crowdsourcing game is part of my Master Project, during spring 2014. I am supervised by Chris Dijkshoorn, who focuses on personalized semantic search in a linked cultural heritage environment.

The ArtTagger website — based on previous work by the SEALINCMedia project — aims to obtain high quality labels for works of art through crowdsourcing. Users can play a game to tag paintings with their respective category. The game consists of a query image (i.e., the painting to label) and six candidate labels placed below the image. The candidate labels are accompanied with prototypical images and a description. Users score one point for every processed query image and score ten bonus points when their choice agrees with the choices of art experts.

The aim of my research is to investigate the effects of an aesthetically pleasing and usable crowdsourcing website in the cultural heritage domain on people’s motivation. At the time of writing, I am still looking for people who are willing to play the ArtTagger game. By doing so, you help museums and improve your art knowledge.

Linking Birds Part 1: Converting the IOC World Bird List to RDF

In SEALINCMedia presentations about Accurator we often use the example of a print described as “bird near red leaf”. Although this description captures what is seen in the print,  it can be much more precise. Questions such as what sort of bird is depicted,  What is the type of the red leaf, etc. can be further answered.

This is an ideal case for the Accurator framework. We engage the appropriate niche (bird enthusiasts) to help annotate the bird prints of the Rijksmuseum with bird names from a structured vocabulary. The only problem was that we did not have such a structured vocabulary at hand.

This is where the experts at Naturalis came in. They pointed us to the IOC World Bird List, het Nederlands soortenregister and provided us with data of their own specimen collection. Since we aim to integrate these different datasets to create a comprehensive list of birds, we turned to RDF. In this blog post I describe the conversion of the IOC list.

The IOC World Bird List is available in multiple file formats. Using the Cliopatria server extended with the xmlrdf package I started the conversion process by loading the available XML file. Xmlrdf automatically turns the hierarchy embedded in the XML into a graph structure. Using rewrite rules such as the one below, the graph can be refined.

common_name_property @@
{ A, birds:englishName, B }
{ A, txn:commonName, B@en }.

As you can see the rule above replaces the property created by xmlrdf with one from the TaxonConcept ontology. This ontology contains a lot of concepts useful for modelling species data and I reused as much of these concepts as possible. Initially all the concepts in the graph are blank nodes. Using the same sort of rewrite rules, I created IRI’s of the form: http://purl.org/collections/birds/species-phoenicurus_auroreus. The IRI’s consist of the namespace, the level in the hierarchy (e.g. genus or species) and the scientific name.

Another useful resource is available on the IOC website: a spreadsheet with bird names in 19 different languages. Using the scientific names I found the corresponding species IRI in the graph and added the different commonNames with the corresponding language tags. An example of information linked to the birds:species-phoenicurus_auroreus resource:

Predicate Value
rdf:type txn:SpeciesConcept
txn:authority “(Pallas, 1776)”
birds:breedingRegions “EU”
birds:breedingSubregions “c,e”
txn:commonName “rehek mongolský”@cs, “Amurrødstjert”@da,
“Spiegelrotschwanz”@de, “Daurian Redstart”@en,
“Colirrojo Dáurico”@es, “mustselglepalind”@et,
“laaksoleppälintu”@fi, “Rougequeue aurore”@fr,
“tükrös rozsdafarkú”@hu, “Codirosso daurico”@it,
“ジョウビタキ”@ja, “Spiegelroodstaart”@nl,
“Aurorarødstjert”@no,”pleszka chińska”@pl,
“Сибирская горихвостка”@ru, “žltochvost zrkadlový”@sk,
“Svartryggad rödstjärt”@sv, “北红尾鸲”@zh
txn:inGenus birds:genus-phoenicurus
birds:nonbreedingRegions “s China, ne India”
txn:scientificName “Phoenicurus auroreus”

Many of the objects are currently literals, while some of them could be linked to external vocabularies. Linking the regions to GeoNames is something I will look into in the future, although parsing the more specific regions will be troublesome (e.g. “w slope of the e Andes in c Colombia”).

In a following blog post I will describe the conversion of the collection data of Naturalis to RDF and how I link that information to IOC World Bird List. This conversion was done at the Web & Media group at the VU University Amsterdam, if the work sparked your interest have a look at my site.