ИСТИНА |
Войти в систему Регистрация |
|
Интеллектуальная Система Тематического Исследования НАукометрических данных |
||
Moscow Digital Herbarium (https://plant.depo.msu.ru/) is the sixth largest library of imaged herbarium specimens in the world. We are adding constantly new images of recently collected specimens, transcriptions of the labels, and manual georeferences. As of 4 May 2018, Moscow Digital Herbarium hosts 911,543 images of 914,324 specimens, 101,415 captured labels, and 109,375 georeferences. It is fully indexed in GBIF delivering to a wider community ca. 75% of occurrences published by Russian institutions. A herbarium specimen with both a dry plant and a label mounted on a sheet of paper is a convenient object for 2D imaging and further OCR-mining of text data. Since March 2018, we implemented the OCR procedure for an automatic label capturing following best practices of New York, Paris, and Edinburgh herbaria (NY, P, E). After some tests we decided to use open-source Tesseract software. Rough results of the OCR were cleaned by ca. 50 sophisticated quality-checks and mistake erasers. Finally, we left only "long" results with at least 100 characters and 5 words. We combined up to four languages in the OCR procedure, although English + Russian were used as a default combination. Labels with words characteristic for the German language were processed repeatedly in the German mode (ca. 2,000 labels of bryophytes). We used OCR transcriptions of labels for the following further activities. (1) Searching for the collections of definite collectors to link the images with existed tables from which labels were initially produced. (2) Searching for the collections with printed coordinates for quick georeferencing of these specimens. (3) Country-tagging of the extra-Russian collections for further processing of labels using country filter. (4) Region-tagging of the Russian collections for further processing of labels using regional filter. (5) Searching for the mistakes in manual attribution of a country or a herbarium area and label capturing implemented earlier. Thereby, we do not use the text data mined by the OCR instead thorough label capturing. We regard it as a powerful tool for pre-selection of specimens which makes the database management more efficient. MW digitisation was supported in 2015–2018 by the grant #14-50-00029 from the Russian Science Foundation (RNF).
№ | Имя | Описание | Имя файла | Размер | Добавлен |
---|---|---|---|---|---|
1. | Презентация | Irkutsk_2018.pdf | 6,8 МБ | 9 сентября 2018 [Allium] |