Category Archives: Media

Workshop summary – Two birds, one stone: Bridging cultural heritage collections with crowds and niches

On Monday the 31st of October the workshop entitled “Two birds, one stone: Bridging cultural heritage collections with crowds and niches”, was held at the Netherlands Institute for Sound and Vision. The workshop was divided in two sessions: presentations and a practical session. In the first session, cultural heritage institutions gave a presentation about their experiences with crowdsourcing, while the practical session involved testing of the systems presented.

The first presentation was by Maarten Brinkerink from the Netherlands Institute of Sound and Vision and was titled “What’s That? Video Tagging Games for Audiovisual Heritage Collections”. In his presentation, Maarten stressed the importance of enriching one of the vast audiovisual collection in Europe that Netherlands Institute of Sound and Vision holds, not only by professional annotations, but also by using the crowd. One such crowdsourcing initiative is the “Waisda? ” crowdsourcing game where users are able to annotate videos online, the goal being the consensus between players. Slides and talk.

Next, Sander Pieterse from Naturalis and Xeno-canto Foundation for Nature Sounds in his presentation “Every Feather and Song: Crowdsourcing and Co-curation from a Natural History Perspective” emphasized crowdsourcing as a significant tool for “building a big collection together” and ensuring enrichment of the existing natural history collections. Moreover, he showed how crowdsourcing can be used as a way of networking between amateurs and professionals and how it can build a sense of connectedness within the communities it addresses. Slides and talk.

Saskia Scheltjens from Rijksmuseum Amsterdam in “Accurator: Consolidation and Integration of Annotations” presented the Accurator system, a project done in collaboration with VU University Amsterdam. Accurator is used for annotating artworks in Rijksmuseum, but despite its usefulness, the results are still to be integrated in the collections at the museum due to the various restructuring that is currently taking place within Rijksmuseum. Slides and talk.

The last presentation was held by Chris Dijkshoorn from VU University Amsterdam and was titled “DigiBird: on the fly collection integration using crowdsourcing”. He presented the results of DigiBird, a project that reinforces crowdsourcing initiatives and integrates four distinct nature-related collections. He mentioned how crowdsourcing is evolving to be a valuable approach to collect data, but faces challenges regarding sustainability and use of results. Slides and talk.

After these presentations, followed a practical session where the participants tried out the crowdsourcing systems presented: Accurator and Waisda?, together with the DigiBird platform. In the scope of the DigiBird project, an instance of the Waisda? game was created with a selection of videos that contain birds, while in the Accurator system a selection of artworks from the bird domain was selected. On the DigiBird platform the participants could see not only on the fly integration of results from the crowdsourcing systems presented, but also results from platforms like Xeno-canto and Naturalis and general statistics of the integrated platforms, together with real-time updates of annotations for artworks and videos containing birds.

DigiBird prototype

We extended the DigiBird website with new functionality showcasing the DigiBird integration pipeline. The main additions:

  • Annotation wall shows objects of crowdsourcing initiatives to which recently information has been added
  • Species page allows searching through multiple collections at once by entering the common name of a bird

Check it out on http://digibird.org/. The code of the prototype is available at github.

Querylog-based Assessment of Retrievability Bias in Delpher

On March 17, we were invited by the National Library of the Netherlands to present the results of our study on retrievability bias in the Dutch historic newspaper archive.
The research was conducted in collaboration with the WebART project and will be presented at the Joint Conference on Digital Libraries (JCDL) 2016 in Newark, USA, in June 2016.

Summary of the talk:

Search engines are not “objective” pieces of technology, and bias in Delpher’s search engine may or may not harm user access to certain type of documents in the collection. In the worst case, systematic favoritism for a certain type can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the “retrievability” for all documents in a collection. We explain the ideas underlying the retrievability metric, and how we measured it on the KB Newspaper collection.  We describe and quantify the retrievability bias imposed on the newspaper collection by three different commonly used Information Retrieval models. For this, we investigated how document features such as length, type, or date of publishing influence the retrievability.
We also investigate the effectiveness of the retrievability measure, featuring two characteristics that set our experiments apart from previous studies: (1) the newspaper collection contains noise originating from OCR processing, and historical spelling and use of language; and (2) rather than the simulated queries used in other studies, we use real user query logs including click data. We show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the results of a retrieval task.

Slides:

DigiBird kickoff meeting

On the 5th of February 2016 the kickoff meeting for the COMMIT/ valorisation project DigiBird took place. The meeting was hosted by the Netherlands Institute for Sound and Vision (Nederlands Instituut voor Beeld en Geluid). During the meeting, the people who will work on the project were introduced, together with the partners involved.

The DigiBird project builds on the results of the SEALINCMedia project, aiming to use crowdsourcing results to integrate three different media types: images, sounds and videos – all related to birds. The various datasets that belong to these different media types are provided by the partners involved in the project. Most of these platforms already use crowdsourcing as a means of annotating the bird media, but there is no single point of access for all of them and no means of crossover access. Thus, the goal of DigiBird is to achieve this integration by creating cross-links between collections and designing user-friendly interfaces. These will not only help to enable access to the various bird collections, but will also motivate people to contribute more knowledge by means of annotations.

The people who will work on developing this project are Chris Dijkshoorn – a PhD student and Cristina-Iulia Bucur – a student assistant, both affiliated with VU University Amsterdam.

The partners involved in DigiBird are:

During the meeting, a hands-on breakout session took place. During this session, the participants from the various partners could create their own view on how the interfaces could look and also how the user interaction can be dealt with by building various scenarios.

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

We presented our paper on “Impact Analysis of OCR Quality on Research Tasks in Digital Archives” at this year’s International Conference on Theory and Practice of Digital Libraries (TPDL2015).

We describe how humanities scholars currently use digital archives and the challenges they face in adapting their research methods compared to using a physical archive. The required shift in research methods has the cost of working with digitally processed historical documents. Therefore, a major concern for the scholars is the question how much trust they can place in analyses based on noisy representations of source texts.

Based on interviews with humanities scholars and a literature study, we classify scholarly research tasks according to their susceptibility to errors originating from OCR-induced biases. Search results for “Amsterdam”, for example, are likely to be influenced by the confusion of the letters “s” and “f”, especially for material that was created before 1800, when the “long s” was still used.
In order to reduce the impact of such errors, we investigated which kind of data would be required for this and whether or not it is available in the archive.

We describe our study of example research tasks performed on the digital newspaper archive of the National Library of The Netherlands. In this study, we tried to reduce the uncertainty of the results as much as possible with the data publicly available in the archive.

We conclude that the current knowledge situation on the scholars’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.

Historical Newspapers as “Big Data”

On Tuesday, March 24, 2015, the National Library of The Netherlands organized a symposium on the use of digitized newspapers in the Digital Humanities. The goal of the symposium was to engage information specialists and end users in a discussion with the KB on future possibilities of using the (data in the) digital newspaper archive.

We presented our research ideas on estimating the impact of OCR errors on research tasks.

For more information about the event, please have a look at the report on the event at the  KB website.

 

Presentation @Rijksmuseum: Trusting User-contributed data for Cultural Heritage Domain

Presentation  @Rijksmuseum: Trusting User-contributed data for Cultural Heritage Domain

Cultural Heritage domain has opened up to contributions from the users on the web. The contributions are mainly in the form of tags which describe certain aspect of the cultural heritage object. With a wide range of users on the web, it becomes important to determine the quality of the user contributed content before it is published online. However, manually evaluating the quality of these user generated contributions is exhausting in terms of resources for the Cultural Heritage institutions. In this talk, I will describe methods which can semi-automatically predict the quality of tags. These methods address three research questions: How can we trust an online contributor?, How can we assess the quality of annotation process?  and  How can we trust the contributed data?. The slides for the presentation can be found here.

SEALINCMedia @WebSci2014

Large datasets such as Cultural Heritage collections require detailed annotations when digitised and made available online. Annotating dierent aspects of such collections requires a variety of knowledge and expertise which is not always possessed by the collection curators. Artwork annotation is an example of a knowledge intensive image annotation task, i.e. a task that demands annotators to have domain-specic knowledge in order to be successfully completed. Today, Lora Aroyo will present WebSci2014 conference the results of a study aimed at investigating the applicability of crowdsourcing techniques to knowledge intensive image annotation tasks. We observed a clear relationship between the annotation difficulty of an image, in terms of number of items to identify and annotate, and the performance of the recruited workers. Here you can see the poster and the slides of the presentation.