Web search results clustering in Polish : experimental evaluation of Carrot
[ 1 ] Instytut Informatyki (II), Wydział Informatyki i Zarządzania, Politechnika Poznańska | [ P ] pracownik
2003
referat
angielski
EN In this paper we consider the problem of web search results clustering in the Polish language, supporting our analysis with results acquired from an experimental system named Carrot. The algorithm we put into consideration — Suffix Tree Clustering has been acknowledged as being very efficient when applied to English. We present conclusions from its experimental application to Polish, demonstrating fragile areas of the algorithm related to rich inflection and certain properties of the input language. Our results indicate that the characteristics of produced clusters (number, distinctiveness), strongly depend on pre-processing phase. We also attempt to investigate the influence of two primary STC parameters: merge threshold and minimum base cluster score on the number and quality of results. Finally, we introduce two approaches to efficient, approximate conflation of Polish words: quasi-stemmer and an automaton-based lemmatization method.
209 - 219