Corpora Acquisition for Machine Learning Web Query Intent Classification - доклад на конференции | ИСТИНА – Интеллектуальная Система Тематического Исследования НАукометрических данных

Авторы: Svetlana Toldova, Max Ionov
Международная Конференция : 6th International Conference on Corpus Linguistics
Даты проведения конференции: 2014
Дата доклада: 24 мая 2014
Тип доклада: Устный
Докладчик: Max Ionov
Место проведения: Las Palmas de Gran Canaria
Аннотация доклада:
Our work deals with the methods for constructing training corpus for the task of IR queries classification. We propose two methods for building a training corpus for a machine learning task query classification. First one is based on a real-time experiment and the second one is performed on daily query logs. Building a suitable testing corpus is an essential for the resulting quality of Machine learning applications. For the first method we asked human annotators to solve the reverse problem: instead of tagging queries made by someone else, they generated queries in order to satisfy some goal. For the enrichment of a small annotated corpus we used existing logs, specifically information about clicks — search results that users clicked on. If clicks of two queries overlap a lot, their intent is similar. Overlap is measured by Pierson's correlation of two click sets. Using this technique we were able to automatically increase corpus size 100 times, using only one day logs. Proposed methods can be adapted to other fields of computational linguistics where direct approaches to corpora acquisition shows poor results.
Добавил в систему: Толдова Светлана Юрьевна

	ИСТИНА	Войти в систему Регистрация
	Интеллектуальная Система Тематического Исследования НАукометрических данных
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

Интеллектуальная Система Тематического Исследования НАукометрических данных

Corpora Acquisition for Machine Learning Web Query Intent Classificationдоклад на конференции