Loading [MathJax]/jax/output/CommonHTML/jax.js
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

PERIODONTITIS RISK FACTOR ANALYSIS AND MACHINE LEARNING PREDICTION MODEL CONSTRUCTION BASED ON MULTIDIMENSIONAL DATA

    https://doi.org/10.1142/S0219519424400864Cited by:0 (Source: Crossref)
    This article is part of the issue:

    Abstract

    This study leveraged a large-scale dataset from NHANES 2013–2014 to gain insights into periodontitis pathogenesis and develop predictive tools. After cleaning and preprocessing the data, 15 crucial factors were identified from over 100 potential risk factors and utilized as input features for four machine learning algorithms: support vector machines (SVM), random forest (RF), neural network and XGBoost. The models were evaluated for periodontitis prediction performance through internal validation metrics such as specificity, accuracy, precision, recall and accuracy (area under the curve (AUC)). Notably, education level, household income and smoking status emerged as key risk factors, aligning with medical literature. While SVM and RFs excelled in specificity and accuracy, neural networks surpassed in precision and recall for periodontitis patients. XGBoost offered a balanced performance, making it a versatile choice. The feature importance analysis underscored the profound influence of socioeconomic factors and unhealthy lifestyle habits on periodontal health. This study contributes novel approaches and insights for periodontitis prevention and treatment, demonstrating clinical and societal significance. Future research should focus on optimizing and externally validating the model to enhance its generalizability and accuracy.

    1. Introduction

    Periodontitis, a prevalent chronic oral inflammation, exerts profound impacts far beyond mere oral health. Triggered by dental plaque, a sticky biofilm enriched with bacteria, it primarily assaults the gums and periodontal tissues, leading to tooth loosening and even loss, significantly impairing patients’ chewing function and facial aesthetics. Moreover, through the circulatory system, it intricately intertwines with systemic diseases such as cardiovascular disorders, diabetes and adverse pregnancy outcomes, forming a complex pathological network. Cutting-edge research has unveiled that periodontitis patients face a notably elevated risk of developing these chronic conditions, posing threats to both physical health and imposing mental and financial burdens on patients.

    According to the World Health Organization, over half of the global adult population suffers from oral diseases, with periodontal issues standing out as a prominent concern.1 The Fourth National Oral Health Epidemiology Survey Report points out that the periodontal health of adults in China is worrying, with only 9.1% of adults having healthy periodontal tissues, which indicates a large group of periodontal disease patients. In China, the incidence of periodontitis is as high as 7–40%, which is high and complex compared to developed countries. These data highlight the urgency of periodontitis prevention and control, calling for more efficient and precise screening and intervention means.2

    Currently, periodontitis diagnosis faces limitations in subjectivity and sensitivity, especially in resource-scarce areas. To enhance prevention and control, we aimed to develop an efficient, accurate periodontitis prediction model using SVM, RF, XGBoost and MLP algorithms. These algorithms offer strengths in handling diverse data challenges. After rigorous data preprocessing and feature engineering, we trained and optimized models to achieve optimal prediction performance. Our goal is to provide clinicians with a reliable diagnostic tool, advance machine learning in dentistry and contribute to periodontitis prevention. Looking ahead, technological advancements will likely further improve periodontitis management, enhancing patient quality of life.

    2. Methodology

    2.1. Study design and data preprocessing

    The NHANES dataset between 2013 and 2014 was used in this study for constructing a periodontitis prediction model based on data segmentation.

    The National Health and Nutrition Examination Survey (NHANES) is a long-term, ongoing national health survey led by the National Center for Health Statistics (NCHS) of the National Institutes of Health (NIH). NCHS of the NIH is a long-term, ongoing national health survey program. Since its inception in the early 1960s, NHANES has become a key source of data for assessing the health status of the U.S. population, its nutritional status and its risk factors for disease, and has had a profound impact on the development and implementation of public health policy.

    NHANES is unique in its comprehensiveness and breadth. The survey not only collects participants’ physical examination data (e.g., height, weight, blood pressure, etc.), but also covers detailed nutritional intake information, laboratory test results (e.g., blood and urine sample analysis), health questionnaires and in-depth assessments of specific health issues. This multidimensional approach to data collection allows NHANES to provide a comprehensive view of the health of the U.S. population, providing a valuable reference point for researchers, policymakers and public health experts.

    In the field of public health research, NHANES data are widely used in epidemiological studies, nutritional assessments, chronic disease prevention and control and health inequality analysis. Through NHANES, researchers can track changes in health trends, identify the distributional characteristics of health risk factors, and assess the effectiveness of public health interventions, thus providing a scientific basis for formulating more accurate and effective health policies.

    The dataset covered comprehensive information on 10,175 subjects, including more than 100 potential risk factors for periodontitis. These risk factors were widely distributed across multiple dimensions such as data information, basic personal information, lifestyle, education, oral health status, disease and treatment history, health checkups and screenings, blood test results and nutritional intake. To ensure data quality, we first eliminated cases with incomplete information, and then standardized the data and finally selected 1201 cases with complete data (n=1201) as the modeling basis. Figure 1 visualizes the workflow of the entire study. Table 1, on the other hand, summarizes in detail the demographic characteristics and overview of the clinical information of the selected patients.

    Fig. 1.

    Fig. 1. Overall workflow table.

    Table 1. Demographic and clinical characteristics of the patients.

    FeatureAll (n=1201)p-value
    Age (years), mean (standard deviation)55.09 (14.9)<0.001
    Gender0.0549
    • Male n (%)620 (51.6%)
    • Female, n (%)581 (48.4%)
    Household income level, n (%)<0.001
    • Special hardship255 (21.2%)
    • Poverty331 (27.6%)
    • Normal161 (13.4%)
    • Well-off145 (12.0%)
    • Affluent309 (25.7%)
    Educational level, n (%)<0.001
    • Primary90 (7.5%)
    • Secondary446 (37.1%)
    • University/College665 (55.4%)
    Periodontitis185 (15%)<0.001

    2.2. Statistical analysis and feature selection

    Based on data preprocessing, we performed an in-depth statistical analysis to identify significant risk factors for periodontitis.3 First, features that were not significantly associated with periodontitis (e.g., patient unique identifiers) and factors that affected the stability of the model due to too many missing values were excluded, and this step eliminated more than 60 potential risks, leaving 38 as candidate features.4

    It has been shown that there is a significant correlation between low HDL and high LDL cholesterol levels and periodontitis in women.5 In addition to this, patients with periodontitis exhibit a significantly higher risk of developing BD.6 Compared with subjects without periodontitis, patients with periodontitis had a higher prevalence of diabetes mellitus, hyperlipidemia,7 hypertension, ischemic heart disease, stroke, head injury, major depressive disorder, chronic obstructive pulmonary disease (COPD),8 liver disease9 and asthma10 (p<001). In a survey of adults over 30 years of age, alcohol consumption was found to be associated with an increased prevalence of periodontitis, especially severe periodontitis. The prevalence of periodontitis did not increase in those who consumed alcohol on average once a week compared to those who did not consume alcohol.11,12 Subsequently, we conducted an initial correlation analysis to screen out factors that demonstrated significant associations with periodontitis from the pool of candidate features. The Pearson correlation coefficient was employed for continuous variables to assess their relevance to periodontitis. The Chi-square test was utilized for categorical variables to determine their statistical significance about the disease. Based on the correlation analysis outcomes, the candidate features were ranked and redundant, highly correlated features were eliminated to prevent multicollinearity issues.13,14

    Ultimately, leveraging the feature importance scores derived from tree-based models, such as random forest (RF) and XGBoost, in conjunction with the aforementioned statistical methods, we pinpointed 15 pivotal features that were deemed most contributory to periodontitis prediction. This segment constitutes a portion of the academic paper.

    2.3. Data-driven model development and optimization

    To evaluate the performance of different machine learning algorithms in periodontitis prediction, we randomly divided the processed dataset into a training set (70%) and a test set (30%) for model training and performance evaluation, respectively. Based on Pearson’s coefficient ranking, we selected the top fifteen risk factors as predictor variables and periodontitis status (PERIO) as the target variable. Four mainstream machine learning algorithms: support vector machine (SVM), RF, XGBoost and multilayer perceptron neural network (MLP) were used in this study to seek the optimal prediction model by comparing their performance on the same dataset.15

    In the model training phase, we adopt a strategy of hierarchical five-fold cross-validation combined with grid search to optimize the hyperparameters of each model. Specifically, we find the optimal parameter combination for each model through five-fold cross-validation, followed by final training using the entire training set to improve the generalization ability of model.16

    2.3.1. SVM

    SVM is a widely used supervised learning model for classification and regression tasks. Its core idea lies in mapping data from a low-dimensional space to a high-dimensional space through the selection of an appropriate kernel function, enabling the identification of a hyperplane that effectively distinguishes between different classes. SVM exhibits robustness in handling high-dimensional data, particularly suited for complex classification problems. Given the prediction of periodontitis involves multiple variables and potential nonlinear relationships, SVM is capable of transforming these intricate relationships into linear problems in high-dimensional space, thereby better capturing nonlinear patterns within the data.

    The choice of SVM in this study stems from its remarkable performance in dealing with small sample sizes and high-dimensional features. Furthermore, it flexibly tackles nonlinear issues by adjusting the kernel function, such as the radial basis function (RBF) kernel. Additionally, SVM boasts a solid theoretical foundation with a globally optimal solution to its optimization problem, rendering it exceptional in solving classification problems within the medical domain.

    Data normalization was performed using the StandardScaler, which transforms the data to have a mean of 0 and a standard deviation of 1. This choice was made because SVM is sensitive to the scale of input data. Features with different scales could distort the decision boundary of the SVM, making it less effective. Normalization ensures that all features contribute equally to the model, enhancing its ability to capture underlying patterns and avoid bias toward features with larger scales.

    We also ensured the proper handling of missing values through imputation techniques, replacing missing numerical values with the median of respective features. Additionally, categorical variables, if present, were handled through one-hot encoding to convert them into a numerical format compatible with the model.

    After searching through the grid, the best combination of parameters we found was C=10, gamma=1, which reflects the model’s lower tolerance for misclassification, higher attention to each training sample, and a preference for reducing the training error, which may result in a model that performs well on the training data but may risk overfitting. The flowchart of the algorithm is shown in Fig. 2.

    Fig. 2.

    Fig. 2. SVM algorithm flowchart.

    2.3.2. RF

    RF is an integrated learning method that improves the accuracy and stability of the model by constructing multiple decision trees and combining their predictions.17 Each decision tree is trained on a randomly selected subset of features, and the final prediction is obtained through voting or averaging the results of all decision trees. This ensemble method effectively mitigates the risk of overfitting in individual decision trees, enhancing the model’s generalization ability.18

    The rationale behind selecting RF in this study lies in its proficiency in handling high-dimensional data and intricate relationships among features. RF captures nonlinear relationships within data by constructing multiple decision trees and reduces model variance through ensemble learning. Particularly when dealing with datasets like periodontitis prediction, which involve numerous features, RF’s feature selection and evaluation mechanisms efficiently sieve out the most predictive features, thereby boosting the model’s prediction accuracy.

    After searching through the grid, the best model we found contained 200 lessons of trees, each with a maximum depth of 20. More trees reduce the variance of the model and improve the stability of the model, but they also increase the computational cost; a larger depth allows each tree to capture more features of the data and improves the model’s fitting ability, but it may also lead to overfitting. The flowchart of the algorithm is shown in Fig. 3.

    Fig. 3.

    Fig. 3. RF algorithm flowchart.

    2.3.3. XGBoost

    XGBoost is an augmented tree model based on gradient boosting, which is an iterative model training method that improves the predictive ability of a model by gradually adding base learners (typically decision trees), with each new decision tree attempting to correct the errors of all previous decision trees. Also, XGBoost controls the complexity of the model by adding regularization terms to prevent overfitting.19

    Selecting XGBoost as the machine learning algorithm for this study stems from several pivotal reasons:

    Efficiency and Accuracy: XGBoost is renowned for its swift training speed and exceptional predictive performance. Its capability to handle large-scale datasets and its outstanding performance in numerous machine learning competitions make it particularly suitable for handling datasets involving numerous features, such as periodontitis prediction.

    Regularization Capabilities: XGBoost’s regularization mechanisms (L1 and L2 regularization) effectively control model complexity, guarding against overfitting. For medical prediction problems like periodontitis, regularization enhances the model’s generalization ability on unseen data, thereby improving prediction reliability.

    Feature Importance Evaluation: XGBoost offers a feature importance evaluation function, invaluable for understanding the contribution of different features to prediction outcomes. In this study, we can leverage this feature to analyze which factors are most crucial for periodontitis prediction, thereby providing deeper medical insights.

    The best model we found contains 100 trees, each with a maximum depth of 3, balancing the complexity and computational cost of the model, and also reducing the risk of model overfitting. The flowchart of the algorithm is shown in Fig. 4.

    Fig. 4.

    Fig. 4. XGBoost algorithm flowchart.

    2.3.4. MLP

    A multilayer perceptron is a forward-structured artificial neural network that includes an input layer, an output layer and multiple hidden layers. These parts can be abstracted as several interconnected nodes (neurons), each of which receives some inputs and generates some outputs. A simple illustration is shown in Fig. 5.

    Fig. 5.

    Fig. 5. Illustration of MLP neural network layering.

    The core concepts of MLP include forward propagation, activation function, backpropagation and regularization. The input layer receives the input features and passes them to the hidden layer through weighting and biasing, and the neurons in each hidden layer compute the nonlinear transform of the weighted sum (through the activation function) and pass the result to the next layer. The activation function introduces a nonlinear transformation that allows the neural network to learn and represent complex patterns. The backpropagation algorithm updates the weights and biases to minimize the prediction error by calculating the gradient of the loss function, which is computed layer-by-layer using the chain rule, and updating the parameters through an optimization algorithm. Also, neural networks can prevent overfitting and improve the generalization ability of the model through regularization techniques.20

    After hyperparameter optimization with grid search, we find the best model in which the single hidden layer contains 100 neurons to capture more complex patterns, while using the ReLU activation function and Adam optimization algorithm to improve the expressive ability of the model, combining the advantages of momentum and adaptive learning rate to speed up the convergence. We set the maximum number of iterations to 1500 to ensure that the model converges sufficiently during training. The flowchart of the algorithm is shown in Fig. 6.

    Fig. 6.

    Fig. 6. MLP algorithm flowchart.

    2.4. Model evaluation

    The predictions output from the model can be categorized into four categories: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). These classification indices directly reflect the accuracy of the model in identifying periodontitis patients (positive) and nonperiodontitis patients (negative). Based on these classification results, we calculated the following key performance metrics to fully assess the predictive ability of the model.

    Sensitivity: Also known as true rate (TPR), measures the ability of the model to correctly identify patients with periodontitis.

    Sensitivity=TPTP+FN.(1)
    Specificity: A measure of the model’s ability to correctly identify nonperiodontitis patients.
    Specificity=TNTN+FP.(2)
    Precision: A measure of the proportion of the sample predicted by the model to be patients with periodontitis that are patients with periodontitis.
    Precision=TPTP+FP.(3)
    Accuracy: A measure of the overall predictive accuracy of the model over all samples.
    Accuracy=TP+TNTP+FN+TN+FP.(4)
    In addition, we plotted ROC curves using these predictions and calculated the area under the curve (AUC). The AUC is a comprehensive metric for evaluating the performance of a classification model, and the closer its value is to 1, the better the predictive performance of the model.

    By analyzing and comparing the above performance metrics, we can get a comprehensive understanding of the performance of different models in the task of periodontitis prediction. In particular, by comparing the differences in sensitivity, specificity, precision and accuracy of different models, we can identify which models are more advantageous in specific application scenarios. In addition, the high and low AUC values provide us with a visual assessment of the overall performance of the models.

    3. Results

    3.1. Correlation analysis

    In the course of an in-depth study of the pathogenesis of periodontitis, we successfully extracted a series of risk factors significantly associated with periodontitis and conducted an exhaustive correlation analysis. The top 15 risk factors with the highest correlations are listed in Table 2. These data not only reveal the close relationship between each factor and periodontitis but also provide an important basis for feature selection in our subsequent modeling work.

    Table 2. Risk factors in the top 15 of relevance.

    FeaturesRelevancep
    smoking_status7.729<0.001
    education_level7.656<0.001
    family_status6.338<0.001
    GHB2.7910.0016
    Calcium2.1430.0071
    FBG2.0430.009
    Potassium1.7320.018
    uric_acid1.6800.021
    total_protein1.5780.026
    Osmosis1.0790.083
    Age0.9630.108
    high_blood_pressure0.9050.124
    liver_diseases0.8820.131
    Albumin0.7460.179
    coronary_heart_disease0.7450.179
    high_cholesterol0.7140.193

    As can be seen from the table, education level, family income and smoking status had the strongest associations with periodontitis, with p-values significantly less than 0.001, which further confirms their important roles in the development of periodontitis. In addition, biochemical indicators such as fasting blood glucose (FBG) and uric acid also showed a significant correlation with periodontitis, suggesting that we should comprehensively consider the socioeconomic status and biochemical indicators of patients when preventing and treating periodontitis.

    3.2. Model performance

    The results of the performance analysis of each model are as follows:

    Accuracy: Is the proportion of samples predicted correctly by the model to the total samples. SVM has the highest accuracy of 0.8448, which is slightly higher than RF and MLP.

    Precision: Is the percentage of samples that are positive out of the samples predicted by the model to be positive classes. SVM and RF have a precision rate of 0, indicating that they perform poorly in predicting positive classes. While MLP has a relatively high precision rate of 0.4186.

    Recall: Is the proportion of samples that are positive classes that are correctly predicted as positive classes by the model. Among all the models, MLP has the highest recall rate of 0.3214, indicating that it performs relatively well in recognizing positive class samples.

    Specificity: Is the proportion of samples that are actually in the negative category that are correctly predicted by the model to be in the negative category. The SVM has the best specificity of 1.0000, indicating that it performs very accurately in identifying samples in the negative category.21

    AUC: AUC is an important measure of model performance, reflecting the model’s ability to distinguish between positive and negative class samples. RF has the highest AUC of 0.6943, indicating its overall ability to differentiate is relatively good.

    By analyzing the performance of the four models, we can see that the SVM model and the RF model perform well in terms of accuracy and specificity, suggesting that it is better able to identify parent samples, but the anomalies of precision and recall of 0 indicate that there are extreme biases in their predictions on certain samples, possibly due to data imbalance or model overfitting. Nonetheless, their high specificity makes them potentially useful in scenarios where nonperiodontitis patients are screened; XGBoost’s overall performance was relatively balanced, with improvements in precision and recall, although its accuracy was slightly lower than that of SVMs and RFs; and neural networks performed the best in identifying positive class samples, with significantly higher precision and recall than the other models, but with comparatively lower specificity and AUC, The surface needs to be improved in overall distinguishing between positive and negative samples.

    If the application scenario focuses more on identifying positive class samples (e.g., the need to minimize missed diagnoses when diagnosing diseases), neural networks are a better choice. If high specificity is required (e.g., screening bad samples), SVM and RF perform better. By comprehensively considering the various performance metrics of each model and the needs of the application scenario, the most appropriate model can be selected to achieve the best prediction results. Indicators for each model are shown in Table 3 and the AUC curves for each model are plotted in Fig. 7.

    Fig. 7.

    Fig. 7. ROC curves for each model.

    Table 3. Performance indicators of each model.

    Performance indicatorSVMRFXGBMLP
    Accuracy0.84480.84210.82540.8254
    Precision0.00000.00000.11110.4186
    Recall0.00000.00000.01780.3214
    Specificity1.00000.99670.97370.9180
    AUC0.64900.69430.67830.6206

    We also studied the feature importance of RF and XGBoost models. Figures 8 and 9 show the order of feature importance of the two models, respectively.

    Fig. 8.

    Fig. 8. Importance ranking of features for the RF model.

    Fig. 9.

    Fig. 9. Ranking of feature importance for the XGB model.

    In the RF model, FBG and age emerged as the top two most important predictive features. This result aligns with expectations, as elevated blood glucose levels are intimately linked to various health issues, particularly in diabetics who often face a heightened risk of periodontitis. Similarly, the increasing prevalence of chronic diseases with age corresponds to age being a significant feature. Furthermore, oral hygiene habits, body mass index (BMI) and biochemical indicators like C-reactive protein also scored highly in the RF model, underscoring the pivotal role of biochemical factors in disease prediction.

    However, social and environmental factors were relatively less significant in the RF model, possibly due to its structure favoring the capture of complex nonlinear relationships among biochemical indicators while overlooking some lifestyle or socioeconomic impacts.

    In contrast, the XGBoost model assigned high importance scores to smoking status, education level and family situation, revealing a strong association between these social and environmental factors and periodontitis. The detrimental effects of smoking on oral health are well-documented, while education level and family situation often influence health behaviors, health awareness and access to medical resources. The XGBoost model’s findings of a higher correlation between lower education levels, unhealthy lifestyles (such as smoking) and periodontitis predictions align with existing public health research.

    Notably, while biochemical factors dominate in the RF model, social and environmental factors contribute more significantly in the XGBoost model. This disparity suggests that while biochemical factors are undeniably important in disease prediction, social and environmental factors may wield greater influence in certain contexts. Model preferences vary with algorithm choices, focusing on different aspects of features.

    Additionally, hypertension, coronary heart disease and liver diseases ranked low in both models, indicating weaker associations with periodontitis prediction. While these diseases may not be direct causes of periodontitis, they could still hold potential significance in specific populations or health contexts.

    By comparing and analyzing feature importance across different models, we gain a more comprehensive understanding of periodontitis’s potential risk factors and insights into the relative importance of biochemical indicators versus socio-environmental factors in disease prediction. This discovery provides a scientific basis for the development of personalized prevention and treatment strategies.

    4. Discussion

    4.1. In-depth analysis of results

    Relying on the rich dataset of NHANES 2013–2014, this study systematically and deeply mined the potential risk factors of periodontitis and successfully constructed a variety of prediction models based on advanced algorithms. One of the core findings is that education level, household income and smoking habit as highly significant risk factors for periodontitis not only corroborate with the established medical research results but also profoundly reveal the far-reaching impact of socioeconomic status and poor lifestyle habits on individual periodontal health.22 This finding provides an important basis for public health policy formulation and emphasizes the need to pay special attention to health interventions for socioeconomically disadvantaged groups when improving oral health for the whole population.

    At the level of model evaluation, although SVM and RF models performed well in terms of specificity and overall accuracy, their shortcomings in precision and recall revealed the limitations of these models in complex clinical scenarios, especially in accurately identifying individual cases. In contrast, neural network models demonstrate superior precision and recall in capturing patients with periodontitis (ortho-class samples), an advantage that is invaluable in reducing clinical underdiagnosis. However, the relative disadvantages of neural networks in terms of specificity and AUC metrics also suggest that future research should focus on further optimizing the model to achieve more precise differentiation between positive and negative samples.

    The XGBoost model is one of the highlights of this study with its balanced performance, which not only maintains a good balance on multiple assessment indicators but also further strengthens the perception of socioeconomic factors, such as education level and family income, as key risk factors for periodontitis through feature importance analysis. This finding not only highlights the deep imprint of socioeconomic inequality in the field of oral health but also points to the direction of efforts to improve the oral health status of specific groups.

    In addition, the prevalence and harmfulness of cigarette smoking, the most prominent risk factor in all models, was re-emphasized, which not only serves as a strong warning to the public but also points to the direction of public health education, the need to increase efforts to promote a healthy lifestyle, especially to promote the in-depth implementation of smoking cessation campaigns. Meanwhile, the high ranking of age and fasting glucose in the RF model reveals the general phenomenon of increasing risk of periodontitis with age and development of chronic diseases, further emphasizing the importance of regular oral health checkups and comprehensive management of chronic diseases.

    In summary, this study not only deepens our understanding of periodontitis risk factors and their predictive modeling but also provides solid scientific support and valuable practical insights for the development of future oral health promotion strategies.

    4.2. Strengths and weaknesses

    The significant strength of this study lies in the richness and comprehensiveness of its data resources. By using the NHANES dataset, we were able to cover a wide range of population characteristics and detailed health information, which greatly enhanced the representativeness and reliability of the study results. In addition, by constructing and comparing four different machine learning algorithm models, we not only comprehensively evaluated the performance of each algorithm in the task of periodontitis prediction, but also provided a diverse selection of models for practical application scenarios. Especially importantly, we deeply analyzed the key factors affecting the onset of periodontitis, and these scientific insights provide a solid foundation for developing more accurate and effective prevention and treatment strategies.

    Despite the positive results of this study on several fronts, there are still some limitations that cannot be ignored. First, data issues are a central challenge. Despite the broad representativeness of the NHANES data, limited by the requirement of data completeness, we selected only 1201 of these patients for the study, which may have led to a certain bias in the sample selection, especially the possible exclusion of certain populations with extreme or specific health conditions. Meanwhile, the problem of missing data also limited our exploration of certain potentially important variables, which affected the comprehensiveness and accuracy of the model. In addition, the time lag of the data needs to be noted, and the data from 2013–2014 may not fully reflect the latest changes in current periodontitis risk factors.

    We also faced some challenges in model construction. The lack of model complexity may limit its ability to capture complex interactions between variables, thus affecting the accuracy of prediction. Also, noise and random errors in the data may adversely affect the training effectiveness of the model. During the feature selection process, although we have tried our best to eliminate irrelevant or redundant features, there may still exist certain features that interfere with the predictive ability of the model.

    Finally, the performance of different models on specific performance indicators varies greatly, which requires us to select appropriate models according to specific scenarios and needs in practical applications. Meanwhile, this study is mainly based on internal validation, and further validation with external datasets is needed in the future to confirm the generalizability and stability of the models, to promote their wide application in clinical and public health fields.

    4.3. Relevance

    This study has far-reaching practical significance in the field of periodontitis prevention and treatment. It not only deepens our understanding of the risk factors of periodontitis but also provides a scientific basis for clinical decision-making by constructing an efficient and accurate prediction model. Doctors will be able to tailor preventive and therapeutic programs based on the patient’s socioeconomic background, lifestyle habits and specific clinical characteristics, combined with the model prediction results, to achieve a more personalized and efficient healthcare service. In addition, the results of this study provide an important reference for public health policymakers, emphasizing the importance of improving public education, family economic conditions and promoting healthy lifestyles in the promotion of oral health and the prevention of periodontitis, which will help to promote the improvement of the overall health of society.

    4.4. Future prospects

    Looking forward, further exploration in this research area will focus on the following aspects: first, continuing to collect and integrate more diverse and comprehensive data sources, especially those covering different socioeconomic backgrounds and geographic regions, in order to enhance the model’s generalization ability and adaptability; second, introducing more advanced data preprocessing techniques to effectively reduce the adverse effects of missing data and noise on the model’s performance and third, considering the inclusion of more potential influencing factors, such as genetic information, environmental factors, etc., in order to construct a more comprehensive and accurate prediction model; fourth, exploring the application of complex machine learning algorithms, such as deep learning, in order to capture the complex nonlinear relationships and interactions among variables, and to further enhance the model prediction accuracy; fifth, adopting an integrated learning approach to integrate the advantages of different models to realize the complementarity and enhancement of the model performance; and finally, establishing a model feedback and updating mechanism to ensure that the model can be continuously optimized with the addition of new data and feedback from clinical practice to maintain the timeliness and accuracy of its predictive ability.

    5. Conclusion

    In this study, based on the NHANES 2013–2014 dataset, we constructed a periodontitis prediction model supported by various machine learning algorithms and analyzed the key risk factors for periodontitis in depth. The results of the study showed that education level, household income and smoking habits were significant predictors of periodontitis development. In terms of model performance, different algorithms have their own merits, with neural networks excelling in identifying patients with periodontitis, while SVM and RFs have advantages in terms of specificity and accuracy. XGBoost is the better choice in terms of overall performance with its balanced performance. This study not only provides new perspectives and tools for the prevention and treatment of periodontitis but also has important clinical application value and social significance. In the future, with the deepening of the study and the advancement of technology, we have reason to believe that the prevention and treatment of periodontitis will be more accurate and efficient, contributing to the cause of human oral health.

    Acknowledgment

    This work was supported by the Xiamen health care and medical guidance project (Grant No. 3502Z20244ZD1337) and 2021 Xiamen Medical and Health Guiding Project of Xiamen Science and Technology Bureau (Grant No. 3502Z20214ZD1270). Q. Lan and Zhiqiang Xu contributed equally to this work.

    ORCID

    Di Jin  https://orcid.org/0009-0002-5534-8365

    Qi Lan  https://orcid.org/0009-0003-2572-1250

    Zhiqiang Xu  https://orcid.org/0009-0009-9432-9694