Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: http://repository.hneu.edu.ua/handle/123456789/23444
Полная запись метаданных
Поле DCЗначениеЯзык
dc.contributor.authorMinukhin S. V.-
dc.contributor.authorNovikov M.-
dc.contributor.authorBrynza N. O.-
dc.contributor.authorSitnikov D. E.-
dc.date.accessioned2020-08-25T11:12:06Z-
dc.date.available2020-08-25T11:12:06Z-
dc.date.issued2020-
dc.identifier.citationMinukhin S. Experimental research of optimizing the Apache Spark tuning: RDD vs Data Frames / S. Minukhin, M. Novikov, N. Brynza, D. Sitnikov // Proceedings of The Third International Workshop on Computer Modeling and Intelligent Systems (CMIS-2020), April 27-May 1. - Zaporizhzhia, 2020. - PP. 409-425.ru_RU
dc.identifier.urihttp://repository.hneu.edu.ua/handle/123456789/23444-
dc.description.abstractIn this paper results and analysis of experimental research for determining the effectiveness of changing the parameters (as compared to standard values) of tuning Apache Spark for minimizing application execution time have been presented. The structure of a test dataset has been developed using RDD and Data Frames, based on which it is possible to create during a minimal time text files with a size greater than 4 GB having properties (characteristics) set up for testing. A peculiarity of test data is the fact that they often reflect basic properties of real world problems. The investigation includes 2 stages: at the first stage a comparative analysis of RDD and Data Frames is carried out for the standard settings of Apache Spark; at the second stage experiments for different sizes of an input test dataset for assessing the influence of parallelism levels, a block size in HDFS and the parameter spark.sql.shuffle.partitions in Spark Data Frames have been conducted. The obtained results substantiate the influence of the spark.sql.shuffle.partitions value on the test task execution performance. For this parameter ranges and change trends have been found. Also, levels of parallelism that maximally influence the execution time have been determined. It has been proven that for certain sizes of input test files the size of an HDFS block can be set up by default. Results of computational experiments have been demonstrated in tables and graphs. They confirm the effectiveness of the suggested changes to the Apache Spark settings as compared with the standard ones for different sizes of tested files.ru_RU
dc.language.isoenru_RU
dc.subjectApache Sparkru_RU
dc.subjectresilient distributed datasetru_RU
dc.subjectData Framesru_RU
dc.subjectHDFSru_RU
dc.subjectshufflingru_RU
dc.subjectlevel of parallelismru_RU
dc.subjectdata processingru_RU
dc.subjectdata setru_RU
dc.subjectdata setru_RU
dc.subjectapplicationru_RU
dc.subjectexecution timeru_RU
dc.titleExperimental research of optimizing the Apache Spark tuning: RDD vs data framesru_RU
dc.typeArticleru_RU
Располагается в коллекциях:Статті (ІКТ)

Файлы этого ресурса:
Файл Описание РазмерФормат 
paper31.pdf503,34 kBAdobe PDFПросмотреть/Открыть


Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.