A SHAP-DRIVEN FEATURE SELECTION APPROACH FOR PREDICTIVE MODELING OF ACUTE ARTERIAL DISEASE
Abstract
Data availability has long been a challenge in the development of predictive models, particularly in the medical domain. This challenge is mainly driven by technological constraints and strict regulatory requirements that limit large-scale data acquisition, which is crucial for training robust artificial intelligence models. As a result, predictive models are prone to overfitting and may fail to achieve optimal generalization performance. Several studies have addressed this issue by applying dimensionality reduction techniques such as Principal Component Analysis (PCA); however, such approaches are less effective when the feature space is relatively small and feature interpretability remains essential. To address this problem, this study proposes the use of Shapley Additive Explanations (SHAP) as a feature selection method for classification tasks, using a medical dataset related to acute arterial disease as the study context. SHAP-based feature selection is particularly suitable in constrained data scenarios, as it enables independent evaluation of each feature’s contribution to model predictions, thereby preserving both interpretability and predictive relevance. In this research, feature selection was performed by retaining 11 features with SHAP values exceeding the baseline, followed by classification using a random forest algorithm. The experimental results demonstrate that the SHAP–Random Forest (SHAP-RF) model achieved superior performance on the coronary artery disease dataset, attaining a ROC–AUC of 0.96 and an AU–PRC of 0.90. These results outperform conventional feature selection approaches, including PCA, correlation-based feature selection, and domain expert–driven feature selection. Overall, the findings indicate that SHAP-based feature selection significantly enhances both the accuracy and efficiency of random forest classifiers, making it a robust and effective approach for feature selection in medical predictive modeling, particularly in the context of coronary artery disease.

