Analysis of Diabetes Classification Performance Improvement Using Ensemble Bagging and K-Fold
Abstract
Diabetes mellitus represents a long-term metabolic disorder whose global incidence continues to rise, making precise early identification essential to minimize severe complications. Machine learning techniques have been extensively utilized for diabetes classification; however, single-model approaches often suffer from performance constraints, such as susceptibility to overfitting and high variability in prediction outcomes. To address these challenges, this research introduces a bagging-based ensemble learning strategy integrated with K-Fold Cross Validation to enhance both predictive accuracy and model robustness. The study employs the Pima Indians Diabetes Dataset, which contains 768 patient records described by eight clinical features and one outcome variable. Eight classification methods—Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Naïve Bayes, Gradient Boosting, and XGBoost—were assessed individually and within the proposed ensemble framework. Model effectiveness was measured using accuracy, precision, recall, and F1-score derived from the confusion matrix. The findings indicate that the ensemble bagging approach generally strengthens model stability and yields improvements in accuracy and precision across most algorithms. Notably, K-Nearest Neighbors and XGBoost demonstrated the most stable gains following ensemble integration. Nevertheless, enhancements in precision were frequently associated with a reduction in recall, reflecting a trade-off in identifying positive cases. In summary, the integration of bagging and K-Fold Cross Validation provides a more resilient and dependable classification model, offering strong potential for supporting clinical decision-making in early diabetes detection.





