TY - GEN
T1 - Sparkmach
T2 - 5th International Conference on Information Management and Big Data, SIMBig 2018
AU - Bravo-Rocca, Gusseppe
AU - Torres-Robatty, Piero
AU - Fiestas-Iquira, Jose
N1 - Publisher Copyright:
© 2019, Springer Nature Switzerland AG.
PY - 2019
Y1 - 2019
N2 - This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classification and regression problems. These models include exploratory data analysis, data preprocessing, feature engineering and modeling. The project has its basis in Pymach, a similar library that faces those steps for small and medium-sized datasets (about ten millions of rows and a few columns). Sparkmach central labor is to scale Pymach to overcome big datasets by using Apache Spark distributed computing, a distributed engine for large-scale data processing, that tackle several data science related problems in a cluster environment. Despite the software nature, Sparkmach can be of use for local environments, getting the most benefits from the distributed processing tools.
AB - This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classification and regression problems. These models include exploratory data analysis, data preprocessing, feature engineering and modeling. The project has its basis in Pymach, a similar library that faces those steps for small and medium-sized datasets (about ten millions of rows and a few columns). Sparkmach central labor is to scale Pymach to overcome big datasets by using Apache Spark distributed computing, a distributed engine for large-scale data processing, that tackle several data science related problems in a cluster environment. Despite the software nature, Sparkmach can be of use for local environments, getting the most benefits from the distributed processing tools.
KW - Big data
KW - Data engineering
KW - Data mining
KW - Data science
KW - Semi-automated machine learning
KW - Statistics
UR - http://www.scopus.com/inward/record.url?scp=85063475416&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-11680-4_13
DO - 10.1007/978-3-030-11680-4_13
M3 - Conference contribution
AN - SCOPUS:85063475416
SN - 9783030116798
T3 - Communications in Computer and Information Science
SP - 121
EP - 128
BT - Information Management and Big Data - 5th International Conference, SIMBig 2018, Proceedings
A2 - Alatrista-Salas, Hugo
A2 - Muñante, Denisse
A2 - Lossio-Ventura, Juan Antonio
PB - Springer Verlag
Y2 - 3 September 2018 through 5 September 2018
ER -