Titel: Domain Specific Corpora from the Web
Personen:P.V.S, Avinesh/McCarthy, Diana/Glennon, Dominic/Pomikálek, Jan
Jahr: 2012
Typ: Aufsatz
Verlag: Universitetet i Oslo, Institutt for lingvistiske og nordiske studier
Ortsangabe: Oslo
In: Fjeld, Ruth V./Torjusen, Julie M. (Hgg.): Proceedings of the 15th EURALEX International Congress 2012, Oslo, Norway, 7 - 11 August 2012
Seiten: 336-342
Untersuchte Sprachen: Englisch*English
Schlagwörter: Datenbank*data base
Fachlexikografie*specialised lexicography/LSP lexicography
Internet-Lexikografie/Online-Lexikografie*internet lexicography/online lexicography
korpusbasierte Lexikografie*corpus-based lexicography
Medium: Online
URI: http://euralex.org/category/publications/euralex-oslo-2012/
Zuletzt besucht: 17.09.2018
Abstract: Language usage is dependent on domain and, as a consequence, domain specific corpora are extremely useful for language learning and lexicography. It is possible to label heterogeneous data for domain either manually or automatically using human knowledge or machine learning. State-of-the-art text classification uses supervised techniques whereby a system learns from previously annotated data. This works well when such data is available in sufficient quantities for supervised machine learning, though often that is not the case depending on the domain and language required. Moreover, this approach assumes that the heterogeneous data in the available corpus covers the required domains. In this paper we present the results of an approach using WebBootCat to retrieve data from the web in eight specific domains. A key component of this work was the use of the DANTE database for generating seed words for initial web data retrieval. To tailor the corpus to the nuances of the domain categorisation that we required, we used some of our own corpus data already annotated with subject codes (domain codes) to help refine the seed words used at the start of the iterative web retrieval process. Human effort was needed to refine a whitelist of words for each domain to reduce the chance of irrelevant data due to ambiguous terms in the seeds and extracted keywords used for subsequent retrieval. The domain corpora retrieved are loaded in the Sketch Engine. The word sketches and sketch difference functionality help reveal appropriate domain specific behaviour of words in the respective corpora.