ИСТИНА |
Войти в систему Регистрация |
|
Интеллектуальная Система Тематического Исследования НАукометрических данных |
||
Large-scale scientific experiments produce vast volumes of data. These data are stored, processed and analyzed in a distributed computing environment. The life cycle of experiment is managed by specialized software like Distributed Data Management (DDM) and Workload Management systems (WMS). In order to be interpreted and mined, experimental data must be accompanied by auxiliary metadata, which are recorded at each phase of the data handling. Processing and analysis of huge amounts of auxiliary metadata, which is constantly grows, is no less challenging task than the management and processing of experimental data itself. Modern LHC collaborations work at petabyte scale production and distributed analysis processing. Exploring the metadata storage and processing techniques in ATLAS we have identified potential bottlenecks and proposed methods of improvement, based on non-relational approach: 1) ATLAS PanDA WMS runs more than 1.5 M jobs per day. Full jobs archive now hosts information of over billion of records. As the metadata volume grows, the underlying software and hardware stack encounters certain limits that negatively affect processing speed and the possibilities of metadata analysis. To improve the scalability and performance of analytical and reporting applications, based on computational jobs metadata, we are developing Hybrid Metadata Storage, providing the metadata segmentation between relational and non-relational database back-ends. 2) LHC Experiments have a set of metadata sources, with data, recorded at each phase of the experiment lifecycle: Metadata Database, publications search engine, DDM system, WMS, JIRA, Indico tools, Document Server, Twikis. These metadata sources are loosely coupled and potentially may provide to an end-user inconsistency in requested information. To aggregate and synthesize a range of primary metadata sources, and enhance them with flexible schema-less addition of aggregated data, we are developing the Data Knowledge Catalog serving as the intelligence behind GUIs and APIs. We will present our current accomplishments with no-SQL technology evaluation, such as SPARC+CASSANDRA and HBASE, to aggregate and to store LHC experiments meta-information and as a backend for future knowledgable data storage, and we will address questions of performance, data volume and reliability of such system.