Generating training data for word sense disambiguation in Russian | [АВТОМАТИЧЕСКИЙ СБОР И РАЗМЕТКА ОБУЧАЮЩЕЙ КОЛЛЕКЦИИ ДЛЯ ЗАДАЧИ РАЗРЕШЕНИЯ ЛЕКСИЧЕСКОЙ НЕОДНОЗНАЧНОСТИ НА РУССКОМ ЯЗЫКЕ]статья
Информация о цитировании статьи получена из
Scopus
Дата последнего поиска статьи во внешних источниках: 20 января 2021 г.
Аннотация:The best approaches in Word Sense Disambiguation (WSD) are supervised and rely on large amounts of hand-labelled data, which is not always avail-able and costly to create. For the Russian language there is no sense-tagged resource of the size sufficient to train supervised word sense disambiguation algorithms. In our work we describe an approach that is used to create an au-tomatically labelled collection based on the monosemous relatives (related unambiguous entries). The main contribution of our work is that we extracted monosemous relatives that can be located at relatively long distances from a target ambiguous word and ranked them according to the similarity mea-sure to the target sense. The selected candidates are then used to extract training samples from the news corpus. We evaluated word sense disam-biguation models based on a nearest neighbor classification on BERT and ELMo embeddings. Our work relies on the Russian wordnet RuWordNet.