FEATURE SELECTION STRATEGIES: A COMPARATIVE ANALYSIS OF SHAP-XAI AND IMPORTANCE-BASED METHODS IN CUSTOMER CHURN DATA


Parmaksız H., Öztürk Ö.

International Conference on Applied Economics and Finance (ICOAEF XII), Berlin, Germany, 16 - 18 October 2024, pp.316-342, (Full Text)

  • Publication Type: Conference Paper / Full Text
  • City: Berlin
  • Country: Germany
  • Page Numbers: pp.316-342
  • Bilecik Şeyh Edebali University Affiliated: Yes

Abstract

This study presents a study comparing various machine learning models and feature selection strategies for customer churn prediction. Within the scope of the study, customer churn prediction models have been developed using popular algorithms such as XGBoost, CatBoost, Random Forest, Decision Tree, and Extra Trees, and the performance of these models has been evaluated. The focus of the study has been to compare the built-in feature importance lists of models with feature selection based on Shapley Additive exPlanations (SHAP) values. A hybrid model called SHAP Explainable Artificial Intelligence (SHAP-XAI) was adapted to this study to examine the impact of selected features based on SHAP values on performance. In the performance evaluation, metrics such as Area Under the PrecisionRecall Curve (AUPRC), Area Under the Receiver Operating Characteristic Curve (AUC-ROC), F1 Score, and Matthews Correlation Coefficient (MCC) have been used. The results showed that XGBoost exhibited the best performance and that SHAP-XAI hybrid models demonstrated similar or better performance than traditional models. Hybrid models have increased explainability by using fewer features and optimizing their performance. In the ranking of features by importance, attributes such as "Complaints," "Status," "Frequency of Use," and "Seconds of Use" have stood out. This study demonstrates that SHAP-XAI hybrid models effectively enhance the performance and explainability of machine learning models. The SHAP-XAI hybrid approach is recommended for creating more efficient and understandable models, especially when working with large and complex datasets.