ИСТИНА |
Войти в систему Регистрация |
|
Интеллектуальная Система Тематического Исследования НАукометрических данных |
||
RCC MSU is developing the Octotron system [1] which is designed to automatically detect and eliminate consequences of emergency situations in supercomputers in order to maximize the safety of equipment and minimize resource downtime. This system is based on a supercomputer model that describes the components of the computing system and their interconnections. Octotron constantly compares the “theory” (model description) and the “practice” (real monitoring data) and reacts accordingly if they differ. The quality of the Octotron service heavily depends on the completeness of the model. Since it’s very difficult to create such model manually, this process should be automated. One of the main supercomputer components is the communication network, so it was decided to develop a tool for automatic model detection and description for Infiniband and Ethernet networks in supercomputing systems. This tool automatically detects nodes and switches, collects needed description data and identifies connections between them. ARP data, forwarding tables and LLDP protocol results are used to build the Ethernet network topology; data from subnet manager is used for the Infiniband network. The distinctive features of this tool are: detection of both switches and computing nodes; flexibility of the data to be collected; intellectual identification of the objects and interconnects that were not identified directly; open source. At the moment the tool has been successfully used as part of the Octotron system installed on the largest Russian supercomputer systems. Acknowledgments. This work is supported by RFBR, research project No. 16-07-01199. References [1] Antonov A., Nikitenko D., Shvets P., Sobolev S., Stefanov K., Voevodin Vad., Voevodin V. and Zhumatiy S.: An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model. LNCS 9573 (accepted).