Extraction of Data from Mass Media Web Sites

Yatskov, A.K.; Varlamov, M.I.; Turdakov, D.Y.

Информация о цитировании статьи получена из Web of Science, Scopus
Статья опубликована в журнале из списка Web of Science и/или Scopus
Дата последнего поиска статьи во внешних источниках: 27 октября 2018 г.

Авторы: Yatskov A.K., Varlamov M.I., Turdakov D.Yu
Журнал: Programming and Computer Software
Том: 44
Номер: 5
Год издания: 2018
Издательство: Pleiades Publishing, Ltd
Местоположение издательства: Road Town, United Kingdom
Первая страница: 344
Последняя страница: 352
DOI: 10.1134/s0361768818050092
Аннотация: To understand the current state and dynamics of the development of the Internet information space, fast tools for extracting data for mass media sites that have a large degree of coverage are needed. However, by no means all sites provide data syndication in the RSS format, and the development of specialized tools for extracting data from each Web site is a costly procedure. In this paper, methods for automatic extraction of news texts from arbitrary mass media sites are proposed. Due to classification of Web page types and the subsequent grouping of their URLs, the quality of extracting news texts is improved. A strategy for traversing a site and detecting the pages containing hyperlinks to news pages is proposed. This strategy decreases the number of requests and reduces the site load.
Добавил в систему: Корныхин Евгений Валерьевич

	ИСТИНА	Войти в систему Регистрация
	Интеллектуальная Система Тематического Исследования НАукометрических данных
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

Интеллектуальная Система Тематического Исследования НАукометрических данных

Extraction of Data from Mass Media Web Sitesстатья