TY - JOUR
T1 - Feature selection influence on machine-learning-based classifiers of the shear failure mode of PC girders
AU - Tovar, Jhon
AU - Bedriñana, Luis Alberto
AU - Málaga-Chuquitaype, Christian
N1 - Publisher Copyright:
© 2025 Institution of Structural Engineers
PY - 2025/10
Y1 - 2025/10
N2 - The shear failure of prestressed concrete (PC) girders is a complex problem due to the numerous influencing parameters. The shear failure mode is directly related to the shear capacity of PC girders, yet conventional shear models cannot directly predict it. Recently, Machine Learning (ML) methods have been applied to such problems in Structural Engineering; however, there is no clear consensus on the optimal quantity and type of input features required to develop an efficient ML classifier. This paper examines the influence of different feature selection techniques on the performance of ML classifiers in predicting the shear failure mode of PC girders. In addition, this paper also presents and discusses a framework for developing an explainable data-driven model for the shear failure mode classification of PC girders through optimal feature selection. To this end, a comprehensive dataset of 668 experimental tests of PC girders is assembled. Wrapped (Forward, Backward, Recursive Feature Elimination, and Exhaustive Selection) and Filter (ANOVA F-test and Correlation Clustering) methods are applied to identify the most relevant subset of input features. The selected features from each method are then used to train different ML models (based on Random Forest, XGBoost, and AdaBoost) to obtain an efficient ML classifier with an optimal number of input features. A classifier trained with the full set of available features is also used for comparison. Most of the evaluated methods required around 5–10 features to maintain an adequate performance. Moreover, all the ML models trained with the optimal number and combinations of features, as produced by the different feature selection methods, achieved a higher performance (F1_score above 0.83) than the classifier trained with the full set of features (F1_score = 0.82). However, Filter methods showed better performance than Wrapped methods, with less computational expense. It was also noted that the feature selection methods that provided the best performance were that ones that not only reduced irrelevant features but also chose features that represent important aspects of the problem. Among the evaluated models, Correlation Clustering (CC) provided the most accurate ML classifier (Accuracy = 0.851 and F1_score = 0.851) using just 8 input features (around 50 % of the total available features). Lastly, an explainability analysis of the selected ML model (based on the CC method) highlighted the importance of identifying the most important variables before training ML classifiers. This work provides a reference for engineers to select, compare, and validate feature selection methods for classification problems in Structural Engineering.
AB - The shear failure of prestressed concrete (PC) girders is a complex problem due to the numerous influencing parameters. The shear failure mode is directly related to the shear capacity of PC girders, yet conventional shear models cannot directly predict it. Recently, Machine Learning (ML) methods have been applied to such problems in Structural Engineering; however, there is no clear consensus on the optimal quantity and type of input features required to develop an efficient ML classifier. This paper examines the influence of different feature selection techniques on the performance of ML classifiers in predicting the shear failure mode of PC girders. In addition, this paper also presents and discusses a framework for developing an explainable data-driven model for the shear failure mode classification of PC girders through optimal feature selection. To this end, a comprehensive dataset of 668 experimental tests of PC girders is assembled. Wrapped (Forward, Backward, Recursive Feature Elimination, and Exhaustive Selection) and Filter (ANOVA F-test and Correlation Clustering) methods are applied to identify the most relevant subset of input features. The selected features from each method are then used to train different ML models (based on Random Forest, XGBoost, and AdaBoost) to obtain an efficient ML classifier with an optimal number of input features. A classifier trained with the full set of available features is also used for comparison. Most of the evaluated methods required around 5–10 features to maintain an adequate performance. Moreover, all the ML models trained with the optimal number and combinations of features, as produced by the different feature selection methods, achieved a higher performance (F1_score above 0.83) than the classifier trained with the full set of features (F1_score = 0.82). However, Filter methods showed better performance than Wrapped methods, with less computational expense. It was also noted that the feature selection methods that provided the best performance were that ones that not only reduced irrelevant features but also chose features that represent important aspects of the problem. Among the evaluated models, Correlation Clustering (CC) provided the most accurate ML classifier (Accuracy = 0.851 and F1_score = 0.851) using just 8 input features (around 50 % of the total available features). Lastly, an explainability analysis of the selected ML model (based on the CC method) highlighted the importance of identifying the most important variables before training ML classifiers. This work provides a reference for engineers to select, compare, and validate feature selection methods for classification problems in Structural Engineering.
KW - Ensemble learning
KW - Feature selection
KW - Machine learning
KW - Prestressed beams
KW - Prestressed concrete
KW - Shear failure mode
UR - https://www.scopus.com/pages/publications/105011750438
U2 - 10.1016/j.istruc.2025.109746
DO - 10.1016/j.istruc.2025.109746
M3 - Article
AN - SCOPUS:105011750438
SN - 2352-0124
VL - 80
JO - Structures
JF - Structures
M1 - 109746
ER -