Titel: Probabilistic Explicit Topic Modeling Using Wikipedia
Personen:Hansen, Joshua A./Ringger, Eric K./Seppi, Kevin D.
Jahr: 2013
Typ: Aufsatz
Verlag: Springer
Ortsangabe: Heidelberg/Berlin
In: Gurevych, Iryna/Biemann, Chris/Zesch, Torsten: Language Processing and Knowledge in the Web. Proceedings of the 25th International Conference, GSCL 2013, Darmstadt, Germany, 25 - 27 September 2013
Seiten: 69-82
Untersuchte Sprachen: Englisch*English
Schlagwörter: Datenmodellierung*data modelling
Internet-Lexikografie/Online-Lexikografie*internet lexicography/online lexicography
korpusbasierte Lexikografie*corpus-based lexicography
Nutzerbeteiligung*user contribution
Abstract: Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore, the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method.