Establishing Gold Standards for Web Corpora

Felix Bildhauer, Roland Schäfer (Berlin)

Termin: 27.09.2012, 14:40 - 15:10 Uhr

Workshop: Webkorpora in Computerlinguistik und Sprachforschung (27.-28.09.2012)

Veranstaltungsort: Institut für Deutsche Sprache (IDS), R5 6-13, D-68161 Mannheim [Plan] [Anfahrt]

In our work on the texrex software and the COW corpora, we have so far focused on the development of tools for processing huge web crawl data in order to derive clean and adequately homogeneous general-purpose web corpora, trying to approach random samples from the WWW. Crawling was done using available crawler software. First, we briefly present the design of our own crawler software, which is planned for to operate independently of search engine results and to implement measures to improve the randomness of the final corpora. At the same time, it reduces storage requirements for extremely large web crawls (several hundred million documents) by applying cleansing and post-processing on-the-fly, while the crawl is going on.

In the ongoing process of the evaluation of the corpora, which are up to approximately 10 billion tokens large, we are trying to assess the quality of such huge corpora which were sampled from a largely unknown population. The central question (for example when selecting training documents for some automatic cleansing algorithm to be applied to the crawl data) seems to be what defines "good" and "bad" documents in a web corpus. Since web corpora, in our view, do not necessarily need to be similar in their composition to traditional or even balanced corpora, the customary method of comparing web corpora to such corpora is intrinsically limited.

We present evaluation results for corpora from several European top-level domains w.r.t. the coverage of the respective segment of the WWW, the genre and text type distribution, lexical coverage and homogeneity, suitability for linguistic research, etc. We suggest to iteratively approach a definition of gold standards based on these results, as to, among other things: (a) What is a good genre distribution in a web corpus? (b) What makes part of a web page boilerplate (beyond the question of which machine learning method to apply to remove it)? (c) What defines a web document which contains "predominantly connected text"? (d) What is an acceptable amount of duplication in a web corpus? Related to (d) is the question of how to represent the structure of the growing number of pages with strong in-document duplication through quotation (in blogs and forum threads).

« Zurück zum Workshop-Programm