ИСТИНА |
Войти в систему Регистрация |
|
Интеллектуальная Система Тематического Исследования НАукометрических данных |
||
Our work deals with the methods for constructing training corpus for the task of IR queries classification. We propose two methods for building a training corpus for a machine learning task query classification. First one is based on a real-time experiment and the second one is performed on daily query logs. Building a suitable testing corpus is an essential for the resulting quality of Machine learning applications. For the first method we asked human annotators to solve the reverse problem: instead of tagging queries made by someone else, they generated queries in order to satisfy some goal. For the enrichment of a small annotated corpus we used existing logs, specifically information about clicks — search results that users clicked on. If clicks of two queries overlap a lot, their intent is similar. Overlap is measured by Pierson's correlation of two click sets. Using this technique we were able to automatically increase corpus size 100 times, using only one day logs. Proposed methods can be adapted to other fields of computational linguistics where direct approaches to corpora acquisition shows poor results.