Heart Disease Prediction Using Ensemble Methods, Genetic Algorithms, and Data Augmentation: A Preliminary Study

Deepali Yewale; Swati Patil; Archana Rajesh Date; Aziz Nanthaamornphong

doi:10.18196/jrc.v6i3.25144

Authors

Deepali Yewale AISSMS Institute of Information Technology https://orcid.org/0000-0001-8084-8279
Swati Patil Vishwakarma Institute of Technology
Archana Rajesh Date HSBPVTS Faculty of Engineering
Aziz Nanthaamornphong Prince of Songkla University

DOI:

https://doi.org/10.18196/jrc.v6i3.25144

Keywords:

Heart Disease, Ensemble Classifier, Genetic Algorithm, Data Balancing, Outlier Removal, Random Noise

Abstract

Statistically speaking, heart disease (HD) accounted for 1 in 5 fatalities in 2022, demanding affordable and accurate diagnosis. Traditional methods of prediction are accurate but expensive, creating a demand for sophisticated and efficient technologies. One of the most popular methods that researchers employ to forecast diseases is machine learning (ML). The goal of this effort is to improve HD prognosis accuracy through the use of ensemble approaches, specifically Random Forest (RF), XGBoost, Voting, and Stacking methods, which improve prediction accuracy by combining multiple models to capture complex patterns. Genetic algorithms (GA) are used to prioritize features. Incorporating data balancing, outlier removal techniques, and data augmentation, creates a model that delivers performance comparable to state-of-the-art research. Methods like random oversampling address data imbalance, while an isolation forest is employed to identify anomalies. To increase the dataset size and improve model performance, random noise is added after anomaly removal. Performed the cross-validation and robustness checks to assess the model's performance on both augmented and non-augmented datasets, ensuring that the inclusion of random noise did not excessively affect generalizability or result in overfitting. The proposed model’s effectiveness is evaluated using various performance metrics. Achieving 99.36% accuracy, 98% sensitivity, 100% specificity, 100% PPV, 97% NPV, 0.99 F-score, and an AUC of 1, the methodology shows great promise as a cost-effective, accurate, and highly efficient diagnostic tool for heart disease. The model's short training time and high performance suggest its potential for practical implementation in clinical settings, offering a reliable and affordable solution for early heart disease detection.

References

K. Drożdż et al., “Risk factors for cardiovascular disease in patients with metabolic-associated fatty liver disease: a machine learning approach,” Cardiovasc. Diabetol., vol. 21, no. 1, p. 240, Nov. 2022, doi: 10.1186/s12933-022-01672-9.

M. A. Naser, A. A. Majeed, M. Alsabah, T. R. Al-Shaikhli, and K. M. Kaky, “A Review of Machine Learning’s Role in Cardiovascular Disease Prediction: Recent Advances and Future Challenges,” Algorithms, vol. 17, no. 2, p. 78, Feb. 2024, doi: 10.3390/a17020078.

R. Rabiei, “Prediction of Breast Cancer using Machine Learning Approaches,” J. Biomed. Phys. Eng., vol. 12, no. 3, Jul. 2022, doi: 10.31661/jbpe.v0i0.2109-1403.

A. Muniasamy, S. Tabassam, M. A. Hussain, H. Sultana, V. Muniasamy, and R. Bhatnagar, “Deep Learning for Predictive Analytics in Healthcare,” in The International Conference on Advanced Machine Learning Technologies and Applications (AMLTA2019), pp. 32-42, 2020.

Y. Kazemi and S. A. Mirroshandel, “A novel method for predicting kidney stone type using ensemble learning,” Artif. Intell. Med., vol. 84, pp. 117–126, Jan. 2018, doi: 10.1016/j.artmed.2017.12.001.

A. Rajdhan, A. Agarwal, M. Sai, D. Ravi, and P. Ghuli, “Heart Disease Prediction using Machine Learning,” Int. J. Eng. Res., vol. 9, no. 4, May 2020, doi: 10.17577/IJERTV9IS040614.

M. S. Amin, Y. K. Chiam, and K. D. Varathan, “Identification of significant features and data mining techniques in predicting heart disease,” Telemat. Inform., vol. 36, pp. 82–93, Mar. 2019, doi: 10.1016/j.tele.2018.11.007.

A. Mikolajczyk and M. Grochowski, “Data augmentation for improving deep learning in image classification problem,” in 2018 International Interdisciplinary PhD Workshop (IIPhDW), 117–122, 2018, doi: 10.1109/IIPHDW.2018.8388338.

D. Yewale and S. P. Vijayaragavan, “Data-Driven Insights: A Genetic Algorithm Feature Optimization Approach to Heart Disease Prediction,” in 2024 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE), pp. 1–6, 2024, doi: 10.1109/IITCEE59897.2024.10467809.

D. Yewale, S. P. Vijayaragavan, and M. Munot, “An optimized XGBoost based classification model for effective analysis of heart disease prediction,” 1st international conference on computational applied sciences & it’s applications, p. 020019, 2023, doi: 10.1063/5.0148268.

R. Biswas et al., “A Robust Deep Learning based Prediction System of Heart Disease using a Combination of Five Datasets,” in 2021 31st International Conference on Computer Theory and Applications (ICCTA), pp. 223–228, 2021, doi: 10.1109/ICCTA54562.2021.9916601.

P. Mahajan, S. Uddin, F. Hajati, and M. A. Moni, “Ensemble Learning for Disease Prediction: A Review,” Healthcare, vol. 11, no. 12, p. 1808, Jun. 2023, doi: 10.3390/healthcare11121808.

S. Mohan, C. Thirumalai, and G. Srivastava, “Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques,” IEEE Access, vol. 7, pp. 81542–81554, 2019, doi: 10.1109/ACCESS.2019.2923707.

A. Alqahtani, S. Alsubai, M. Sha, L. Vilcekova, and T. Javed, “Cardiovascular Disease Detection using Ensemble Learning,” Comput. Intell. Neurosci., vol. 2022, pp. 1–9, Aug. 2022, doi: 10.1155/2022/5267498.

V. Shorewala, “Early detection of coronary heart disease using ensemble techniques,” Inform. Med. Unlocked, vol. 26, p. 100655, 2021, doi: 10.1016/j.imu.2021.100655.

I. D. Mienye, Y. Sun, and Z. Wang, “An improved ensemble learning approach for the prediction of heart disease risk,” Inform. Med. Unlocked, vol. 20, p. 100402, 2020, doi: 10.1016/j.imu.2020.100402.

I. D. Mienye and N. Jere, “Optimized Ensemble Learning Approach with Explainable AI for Improved Heart Disease Prediction,” Information, vol. 15, no. 7, p. 394, Jul. 2024, doi: 10.3390/info15070394.

S. Diwan, G. S. Thakur, S. K. Sahu, M. Sahu, and N. K. Swamy, “Predicting Heart Diseases through Feature Selection and Ensemble Classifiers,” J. Phys. Conf. Ser., vol. 2273, no. 1, p. 012027, May 2022, doi: 10.1088/1742-6596/2273/1/012027.

G. A. Alshehri and H. M. Alharbi, “Prediction of Heart Disease using an Ensemble Learning Approach,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 8, 2023, doi: 10.14569/IJACSA.2023.01408118.

N. Chandrasekhar and S. Peddakrishna, “Enhancing Heart Disease Prediction Accuracy through Machine Learning Techniques and Optimization,” Processes, vol. 11, no. 4, p. 1210, Apr. 2023, doi: 10.3390/pr11041210.

D. Asif, M. Bibi, M. S. Arif, and A. Mukheimer, “Enhancing Heart Disease Prediction through Ensemble Learning Techniques with Hyperparameter Optimization,” Algorithms, vol. 16, no. 6, p. 308, Jun. 2023, doi: 10.3390/a16060308.

R. R. Sarra, I. I. Gorial, R. R. Manea, A. E. Korial, M. Mohammed, and Y. Ahmed, “Enhanced Stacked Ensemble-Based Heart Disease Prediction with Chi-Square Feature Selection Method,” Journal of Robotics and Control, vol. 5, no. 6, 2024, doi: 10.18196/jrc.v5i6.23191.

Y. Shaikh, V. K. Parvati, and S. R. Biradar, “Heart Disease Prediction using Ensemble Learning,” in 2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC), pp. 1–5, 2023, doi: 10.1109/ICAISC58445.2023.10200375.

A. E. Korial, I. I. Gorial, and A. J. Humaidi, “An Improved Ensemble-Based Cardiovascular Disease Detection System with Chi-Square Feature Selection,” Computers, vol. 13, no. 6, p. 126, May 2024, doi: 10.3390/computers13060126.

M. R. Islam, Md. Durul Hoda, Md. A. Rashid, S. Alam Suha, and M. T. Islam Miya, “Data-Driven Heart Disease Prediction by Ensemble Feature Selection and Machine Learning Techniques,” in 2022 25th International Conference on Computer and Information Technology (ICCIT), pp. 575–580, 2022, doi: 10.1109/ICCIT57492.2022.10054998.

Z. Noroozi, A. Orooji, and L. Erfannia, “Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction,” Sci. Rep., vol. 13, no. 1, p. 22588, Dec. 2023, doi: 10.1038/s41598-023-49962-w.

K. Dissanayake and M. G. Md Johar, “Comparative Study on Heart Disease Prediction Using Feature Selection Techniques on Classification Algorithms,” Appl. Comput. Intell. Soft Comput., vol. 2021, pp. 1–17, Nov. 2021, doi: 10.1155/2021/5581806.

G.-I. Kim, H. Yoo, H.-J. Cho, and K. Chung, “Defect Detection Model Using Time Series Data Augmentation and Transformation,” Comput. Mater. Contin., vol. 78, no. 2, pp. 1713–1730, 2024, doi: 10.32604/cmc.2023.046324.

R. Gayathri, et al., “Enhancing heart disease prediction with reinforcement learning and data augmentation,” Syst. Soft Comput., vol. 6, p. 200129, Dec. 2024, doi: 10.1016/j.sasc.2024.200129.

S. A. Ali et al., “An Optimally Configured and Improved Deep Belief Network (OCI-DBN) Approach for Heart Disease Prediction Based on Ruzzo–Tompa and Stacked Genetic Algorithm,” IEEE Access, vol. 8, pp. 65947–65958, 2020, doi: 10.1109/ACCESS.2020.2985646.

A. A. Nancy, D. Ravindran, P. M. D. Raj Vincent, K. Srinivasan, and D. Gutierrez Reina, “IoT-Cloud-Based Smart Healthcare Monitoring System for Heart Disease Prediction via Deep Learning,” Electronics, vol. 11, no. 15, p. 2292, Jul. 2022, doi: 10.3390/electronics11152292.

H. S. Obaid, S. A. Dheyab, and S. S. Sabry, “The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning,” in 2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON), pp. 279–283, 2019, doi: 10.1109/IEMECONX.2019.8877011.

M. Ahsan, M. Mahmud, P. Saha, K. Gupta, and Z. Siddique, “Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance,” Technologies, vol. 9, no. 3, p. 52, Jul. 2021, doi: 10.3390/technologies9030052.

H.-J. Park, Y.-S. Koo, H.-Y. Yang, Y.-S. Han, and C.-S. Nam, “Study on Data Preprocessing for Machine Learning Based on Semiconductor Manufacturing Processes,” Sensors, vol. 24, no. 17, p. 5461, Aug. 2024, doi: 10.3390/s24175461.

C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M. Reps, and P. R. Rijnbeek, “Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data,” J. Big Data, vol. 11, no. 1, p. 7, Jan. 2024, doi: 10.1186/s40537-023-00857-7.

H. Xu, G. Pang, Y. Wang, and Y. Wang, “Deep Isolation Forest for Anomaly Detection,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 12, pp. 12591–12604, Dec. 2023, doi: 10.1109/TKDE.2023.3270293.

J. Chen, J. Zhang, R. Qian, J. Yuan, and Y. Ren, “An Anomaly Detection Method for Wireless Sensor Networks Based on the Improved Isolation Forest,” Appl. Sci., vol. 13, no. 2, p. 702, Jan. 2023, doi: 10.3390/app13020702.

V. Yepmo, G. Smits, M.-J. Lesot, and O. Pivert, “Leveraging an Isolation Forest to Anomaly Detection and Data Clustering,” Data Knowl. Eng., vol. 151, p. 102302, May 2024, doi: 10.1016/j.datak.2024.102302.

W. Chua et al., “Web Traffic Anomaly Detection Using Isolation Forest,” Informatics, vol. 11, no. 4, p. 83, Nov. 2024, doi: 10.3390/informatics11040083.

C. Khosla and B. S. Saini, “Enhancing Performance of Deep Learning Models with different Data Augmentation Techniques: A Survey,” in 2020 International Conference on Intelligent Engineering and Management (ICIEM), pp. 79–85, 2020, doi: 10.1109/ICIEM48762.2020.9160048.

K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Glob. Transit. Proc., vol. 3, no. 1, pp. 91–99, Jun. 2022, doi: 10.1016/j.gltp.2022.04.020.

G. Iglesias, E. Talavera, Á. González-Prieto, A. Mozo, and S. Gómez-Canaval, “Data Augmentation techniques in time series domain: a survey and taxonomy,” Neural Comput. Appl., vol. 35, no. 14, pp. 10123–10145, May 2023, doi: 10.1007/s00521-023-08459-3.

R. Kundu and S. Chattopadhyay, “Deep features selection through genetic algorithm for cervical pre-cancerous cell classification,” Multimed. Tools Appl., vol. 82, no. 9, pp. 13431–13452, Apr. 2023, doi: 10.1007/s11042-022-13736-9.

M. G. Altarabichi, S. Nowaczyk, S. Pashami, and P. S. Mashhadi, “Fast Genetic Algorithm for feature selection — A qualitative approximation approach,” Expert Syst. Appl., vol. 211, p. 118528, Jan. 2023, doi: 10.1016/j.eswa.2022.118528.

E. Kocyigit, M. Korkmaz, O. K. Sahingoz, and B. Diri, “Enhanced Feature Selection Using Genetic Algorithm for Machine-Learning-Based Phishing URL Detection,” Appl. Sci., vol. 14, no. 14, p. 6081, Jul. 2024, doi: 10.3390/app14146081.

J. Chung and J. Teo, “Single classifier vs. ensemble machine learning approaches for mental health prediction,” Brain Inform., vol. 10, no. 1, p. 1, Dec. 2023, doi: 10.1186/s40708-022-00180-6.

S. M. Ganie and M. B. Malik, “An ensemble Machine Learning approach for predicting Type-II diabetes mellitus based on lifestyle indicators,” Healthc. Anal., vol. 2, p. 100092, Nov. 2022, doi: 10.1016/j.health.2022.100092.

M. Kayacı Çodur, “Ensemble Machine Learning Approaches for Prediction of Türkiye’s Energy Demand,” Energies, vol. 17, no. 1, p. 74, Dec. 2023, doi: 10.3390/en17010074.

L. Vergni and F. Todisco, “A Random Forest Machine Learning Approach for the Identification and Quantification of Erosive Events,” Water, vol. 15, no. 12, p. 2225, Jun. 2023, doi: 10.3390/w15122225.

W. Feng, J. Gou, Z. Fan, and X. Chen, “An ensemble machine learning approach for classification tasks using feature generation,” Connect. Sci., vol. 35, no. 1, p. 2231168, Dec. 2023, doi: 10.1080/09540091.2023.2231168.

E. K. Ampomah, Z. Qin, and G. Nyame, “Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement,” Information, vol. 11, no. 6, p. 332, Jun. 2020, doi: 10.3390/info11060332.

Y. O. Daddala and K. Shaik, “Cardiovascular Disease Prediction: Employing Extra Tree Classifier-Based Feature Selection and Optimized RNN with Artificial Bee Colony,” Rev. Intell. Artif., vol. 38, no. 2, pp. 643–653, Apr. 2024, doi: 10.18280/ria.380228.

J. Montomoli et al., “Machine learning using the extreme gradient boosting (XGBoost) algorithm predicts 5-day delta of SOFA score at ICU admission in COVID-19 patients,” J. Intensive Med., vol. 1, no. 2, pp. 110–116, Oct. 2021, doi: 10.1016/j.jointm.2021.09.002.

Y. Zhang et al., “A multiclass extreme gradient boosting model for evaluation of transcriptomic biomarkers in Alzheimer’s disease prediction,” Neurosci. Lett., vol. 821, p. 137609, Jan. 2024, doi: 10.1016/j.neulet.2023.137609.

S. Doki, S. Devella, S. Tallam, S. S. Reddy Gangannagari, P. Sampathkrishna Reddy, and G. P. Reddy, “Heart Disease Prediction Using XGBoost,” in 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), pp. 1317–1320, 2022, doi: 10.1109/ICICICT54557.2022.9917678.

E. Sevinç, “An empowered AdaBoost algorithm implementation: A COVID-19 dataset study,” Comput. Ind. Eng., vol. 165, p. 107912, Mar. 2022, doi: 10.1016/j.cie.2021.107912.

Y. Ding, H. Zhu, R. Chen, and R. Li, “An Efficient AdaBoost Algorithm with the Multiple Thresholds Classification,” Appl. Sci., vol. 12, no. 12, p. 5872, Jun. 2022, doi: 10.3390/app12125872.

S. Jindal, M. Sachdeva, and A. K. Kushwaha, “Performance evaluation of machine learning based voting classifier system for human activity recognition,” Kuwait J. Sci., Jun. 2022, doi: 10.48129/kjs.splml.19189.

S. Hadhri, M. Hadiji, and W. Labidi, “A voting ensemble classifier for stress detection,” J. Inf. Telecommun., vol. 8, no. 3, pp. 399–416, Jul. 2024, doi: 10.1080/24751839.2024.2306786.

P. W. Khan, Y. C. Byun, and O.-R. Jeong, “A stacking ensemble classifier-based machine learning model for classifying pollution sources on photovoltaic panels,” Sci. Rep., vol. 13, no. 1, p. 10256, Jun. 2023, doi: 10.1038/s41598-023-35476-y.

A. Ghasemieh, A. Lloyed, P. Bahrami, P. Vajar, and R. Kashef, “A novel machine learning model with Stacking Ensemble Learner for predicting emergency readmission of heart-disease patients,” Decis. Anal. J., vol. 7, p. 100242, Jun. 2023, doi: 10.1016/j.dajour.2023.100242.

M. Ali, M. N. Haider, S. A. Lashari, W. Sharif, A. Khan, and D. A. Ramli, “Stacking Classifier with Random Forest functioning as a Meta Classifier for Diabetes Diseases Classification,” Procedia Comput. Sci., vol. 207, pp. 3459–3468, 2022, doi: 10.1016/j.procs.2022.09.404.

B. Adhikari and S. Shakya, “Heart Disease Prediction Using Ensemble Model,” in Proceedings of Second International Conference on Sustainable Expert Systems, vol. 351, pp. 857–868, 2022, doi: 10.1007/978-981-16-7657-4_69.

D. Yewale, S. P. Vijayaragavan, and V. K. Bairagi, “An Effective Heart Disease Prediction Framework based on Ensemble Techniques in Machine Learning,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 2, 2023, doi: 10.14569/IJACSA.2023.0140223.