Аннотация:The problems concerning the high dimensional data analysis are discussed. We consider a binary response variable $Y$ which depends on some factors $X_1,\ldots,X_n$. In a number of medical and biological studies, e.g., in genetics, such $Y$ can describe the state of a patient health. For example $Y=-1$ and $Y=1$ mean “sick” and “healthy”, respectively. One can assume that there are genetic and non-genetic risk factors provoking specified complex diseases such as diabetes, hypertension, myocardial infarction and others. Many researchers share the paradigm that the impact of any single factor can be rather small (non-dangerous) whereas certain combinations of these factors can lead to significant effect. Moreover, usually one assumes that the response variable depends only on some part of factors. A challenging problem in modern genetics is to identify the collection of factors responsible for increasing the risk of specified complex disease. The progress in the human genome reading (especially the micro-chip techniques) permitted to collect the genetic datasets for
analysis by means of various complementary statistical tools. The theoretical contributions are provided along with various simulation procedures. The review of investigations in genome-wide association studies (GWAS) during the last five years is given, e.g., in Visscher et al.(2012). Here we concentrate on the multifactor dimensionality reduction (MDR) method introduced by M.D.Ritchie et al.(2001) and its further development. Following Bulinski(2012) we study the estimate of prediction error of $Y$ by means of a function of discrete random variables $X_1,\ldots,X_n$. To this end we employ the penalty function and use the $K$-cross validation procedure. In this way it is possible to justify the choice of significant collection of factors. We also tackle the applications of this approach to analysis of risks of complex diseases started in Bulinski et al.(2012).