Sparkmach: A distributed data processing system based on automated machine learning for big data

Gusseppe Bravo-Rocca, Piero Torres-Robatty, Jose Fiestas-Iquira

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classification and regression problems. These models include exploratory data analysis, data preprocessing, feature engineering and modeling. The project has its basis in Pymach, a similar library that faces those steps for small and medium-sized datasets (about ten millions of rows and a few columns). Sparkmach central labor is to scale Pymach to overcome big datasets by using Apache Spark distributed computing, a distributed engine for large-scale data processing, that tackle several data science related problems in a cluster environment. Despite the software nature, Sparkmach can be of use for local environments, getting the most benefits from the distributed processing tools.

Original languageEnglish
Title of host publicationInformation Management and Big Data - 5th International Conference, SIMBig 2018, Proceedings
EditorsHugo Alatrista-Salas, Denisse Muñante, Juan Antonio Lossio-Ventura
PublisherSpringer Verlag
Pages121-128
Number of pages8
ISBN (Print)9783030116798
DOIs
StatePublished - 2019
Externally publishedYes
Event5th International Conference on Information Management and Big Data, SIMBig 2018 - Lima, Peru
Duration: 3 Sep 20185 Sep 2018

Publication series

NameCommunications in Computer and Information Science
Volume898
ISSN (Print)1865-0929

Conference

Conference5th International Conference on Information Management and Big Data, SIMBig 2018
Country/TerritoryPeru
CityLima
Period3/09/185/09/18

Keywords

  • Big data
  • Data engineering
  • Data mining
  • Data science
  • Semi-automated machine learning
  • Statistics

Fingerprint

Dive into the research topics of 'Sparkmach: A distributed data processing system based on automated machine learning for big data'. Together they form a unique fingerprint.

Cite this