GSCL Arbeitskreis Hypermedia

Choosing an XML database for linguistically annotated corpora

Richard Eckart, Technische Universität Darmstadt

Termin: 30.09.2008, 11:00 - 11:40 Uhr

Veranstaltungsort: Berlin-Brandenburgische Akademie der Wissenschaften, Raum 1, Jägerstr. 22/23, D-10117 Berlin

XML has become the de-facto standard for representing linguistically annotated corpora. Thus, it seems safe to assume storing and querying annotated corpora in XML databases is a straightforward procedure. In reality however it is not. The goal of this paper is to provide a guideline for deciding whether to use an XML database and how to choose a suitable product. To this end we examine the following questions: What should to be considered when keeping an XML-encoded annotated corpus in an XML database? What facilities do databases need to provide in order to be suitable for storing and querying annotated corpora? Do current XML databases offer these facilities, and if not can they be added? The database products taken into account are eXist with the AnnoLab extensions, MonetDB/XQuery as well as the Sedna XML database. All of these databases support the XQuery standard. XML enabled databases using proprietary query languages have been precluded from this analysis. Where applicable, a comparative look is taken in order to consider the option of using a relational database or graph database.