Any modern supercomputer has an extremely complex architecture, and efficient usage of its resources is often a very difficult task, even for experienced users. At the same time, the field of high-performance computing is becoming more and more in demand, so the issue of efficient utilization of supercomputers is very urgent. Therefore, users should know everything important about performance of their jobs running on a supercomputer in order to be able to optimize them, and administrators should be able to monitor and analyze all the nuances of the efficient functioning of such systems. However, there is currently no complete understanding of what data are best to be studied (and how it should be analyzed) in order to have a whole picture of the state of the supercomputer and the processes taking place there. In this paper, we make our first attempt to answer this question. To do this, we are developing a model that describes all the potential factors that may be important when analyzing the performance of supercomputer applications and the HPC system as a whole. The paper provides both a detailed description of this model for users and administrators and some interesting real-life examples discovered on the Lomonosov-2 supercomputer using a software implementation based on the proposed model.
In the field of modern high-performance computing, the paradox of supercomputer efficiency can be observed. The point of this paradox is that supercomputers seem to work efficiently, but in reality this is not entirely true. Let us consider this problem in more detail. The area of supercomputing is becoming more and more in demand [1] . The reason is that solving of an increasing number of scientific tasks requires computationally expensive experiments. For these purposes, cloud computing or servers are often not suitable, and that is when supercomputers come to the fore. Therefore, an increasing number of specialists from various scientific fields (such as astrophysics, genomic research, nanotechnology, big data analysis and artificial intelligence, as well as many, many others) are beginning to use supercomputer resources. As a result, the demand for supercomputers is growing. This leads to the situation that all available HPC resources are constantly occupied. Moreover, users often have to queue long enough, waiting for their turn to start their jobs. For example, the average waiting time in the main queue of the Lomonosov-2 supercomputer [2] in 2020 (until mid-December) was 20 hours. Moreover, during the same period of time, the average utilization of the supercomputer — the average share of occupied compute nodes (on which jobs are running) among all available ones — is very high and equals to 97%. In this case, we can say that available supercomputer resources are used very efficiently
