Experimental research of optimizing the Apache Spark tuning: RDD vs data frames

Minukhin S. V.; Novikov M.; Brynza N. O.; Sitnikov D. E.

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: http://repository.hneu.edu.ua/handle/123456789/23444

Полная запись метаданных

Поле DC	Значение	Язык
dc.contributor.author	Minukhin S. V.	-
dc.contributor.author	Novikov M.	-
dc.contributor.author	Brynza N. O.	-
dc.contributor.author	Sitnikov D. E.	-
dc.date.accessioned	2020-08-25T11:12:06Z	-
dc.date.available	2020-08-25T11:12:06Z	-
dc.date.issued	2020	-
dc.identifier.citation	Minukhin S. Experimental research of optimizing the Apache Spark tuning: RDD vs Data Frames / S. Minukhin, M. Novikov, N. Brynza, D. Sitnikov // Proceedings of The Third International Workshop on Computer Modeling and Intelligent Systems (CMIS-2020), April 27-May 1. - Zaporizhzhia, 2020. - PP. 409-425.	ru_RU
dc.identifier.uri	http://repository.hneu.edu.ua/handle/123456789/23444	-
dc.description.abstract	In this paper results and analysis of experimental research for determining the effectiveness of changing the parameters (as compared to standard values) of tuning Apache Spark for minimizing application execution time have been presented. The structure of a test dataset has been developed using RDD and Data Frames, based on which it is possible to create during a minimal time text files with a size greater than 4 GB having properties (characteristics) set up for testing. A peculiarity of test data is the fact that they often reflect basic properties of real world problems. The investigation includes 2 stages: at the first stage a comparative analysis of RDD and Data Frames is carried out for the standard settings of Apache Spark; at the second stage experiments for different sizes of an input test dataset for assessing the influence of parallelism levels, a block size in HDFS and the parameter spark.sql.shuffle.partitions in Spark Data Frames have been conducted. The obtained results substantiate the influence of the spark.sql.shuffle.partitions value on the test task execution performance. For this parameter ranges and change trends have been found. Also, levels of parallelism that maximally influence the execution time have been determined. It has been proven that for certain sizes of input test files the size of an HDFS block can be set up by default. Results of computational experiments have been demonstrated in tables and graphs. They confirm the effectiveness of the suggested changes to the Apache Spark settings as compared with the standard ones for different sizes of tested files.	ru_RU
dc.language.iso	en	ru_RU
dc.subject	Apache Spark	ru_RU
dc.subject	resilient distributed dataset	ru_RU
dc.subject	Data Frames	ru_RU
dc.subject	HDFS	ru_RU
dc.subject	shuffling	ru_RU
dc.subject	level of parallelism	ru_RU
dc.subject	data processing	ru_RU
dc.subject	data set	ru_RU
dc.subject	data set	ru_RU
dc.subject	application	ru_RU
dc.subject	execution time	ru_RU
dc.title	Experimental research of optimizing the Apache Spark tuning: RDD vs data frames	ru_RU
dc.type	Article	ru_RU
Располагается в коллекциях:	Статті (ІКТ)

Файлы этого ресурса:

Файл	Описание	Размер	Формат
paper31.pdf		503,34 kB	Adobe PDF	Просмотреть/Открыть

Показать базовое описание ресурса Просмотр статистики

Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.