TY - GEN
T1 - Prediction of Lung Cancer Risk Through Machine Learning Based on Lifestyle Questionnaire Data
AU - Chavez-Caceres, Samir
AU - Hincho-Jove, Angel
AU - Castro-Gutierrez, Eveling
AU - Soriano-Vargas, Aurea
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Lung cancer remains the leading cause of cancerrelated deaths worldwide, and early detection is crucial for improving treatment outcomes and survival rates. Hence, the pursuit of alternative methods for early identification of lung cancer should be a significant consideration, especially in low-resource settings. This work presents an alternative for lung cancer risk recognition based on machine learning. We leveraged two publicly available datasets of lifestyle questionnaires from Kaggle, which were collected, processed, and analyzed to develop predictive models. The data was balanced using the SMOTE technique, and seven classifiers were evaluated: Support Vector Classifier, Logistic Regression, Decision Tree, Random Forest, XGBoost, Stochastic Gradient Descent, and Artificial Neural Networks. The best results were obtained with XGBoost, optimized using the GridSearchCV method. We allocated 70% of the data for training and reserved the remaining 30% for testing. The results obtained by the model for the first dataset included accuracy and F1 score of 96.50%, with precision and sensitivity of 96.51%. For the second dataset, an accuracy of 95.83%, F1 score of 95.83%, precision of 96.27%, and sensitivity of 95.83% were achieved. Moreover, using LIME for local interpretability, we were able to identify the primary influence of unhealthy behaviors such as alcohol consumption, smoking, and obesity on the model's predictions, enhancing our understanding of the factors driving the risk of lung cancer in these datasets.
AB - Lung cancer remains the leading cause of cancerrelated deaths worldwide, and early detection is crucial for improving treatment outcomes and survival rates. Hence, the pursuit of alternative methods for early identification of lung cancer should be a significant consideration, especially in low-resource settings. This work presents an alternative for lung cancer risk recognition based on machine learning. We leveraged two publicly available datasets of lifestyle questionnaires from Kaggle, which were collected, processed, and analyzed to develop predictive models. The data was balanced using the SMOTE technique, and seven classifiers were evaluated: Support Vector Classifier, Logistic Regression, Decision Tree, Random Forest, XGBoost, Stochastic Gradient Descent, and Artificial Neural Networks. The best results were obtained with XGBoost, optimized using the GridSearchCV method. We allocated 70% of the data for training and reserved the remaining 30% for testing. The results obtained by the model for the first dataset included accuracy and F1 score of 96.50%, with precision and sensitivity of 96.51%. For the second dataset, an accuracy of 95.83%, F1 score of 95.83%, precision of 96.27%, and sensitivity of 95.83% were achieved. Moreover, using LIME for local interpretability, we were able to identify the primary influence of unhealthy behaviors such as alcohol consumption, smoking, and obesity on the model's predictions, enhancing our understanding of the factors driving the risk of lung cancer in these datasets.
KW - classification
KW - explainability
KW - exploratory data analysis
KW - lifestyle
KW - lung cancer
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85213321837&partnerID=8YFLogxK
U2 - 10.1109/ICA-ACCA62622.2024.10766818
DO - 10.1109/ICA-ACCA62622.2024.10766818
M3 - Conference contribution
AN - SCOPUS:85213321837
T3 - 2024 IEEE International Conference on Automation/26th Congress of the Chilean Association of Automatic Control, ICA-ACCA 2024
BT - 2024 IEEE International Conference on Automation/26th Congress of the Chilean Association of Automatic Control, ICA-ACCA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Automation/26th Congress of the Chilean Association of Automatic Control, ICA-ACCA 2024
Y2 - 20 October 2024 through 23 October 2024
ER -