代谢相关脂肪性肝病发病风险的机器学习预测模型的构建及验证
DOI: 10.12449/JCH260413
Construction and validation of machine learning predictive models for the risk of metabolic associated fatty liver disease
-
摘要:
目的 探讨基于机器学习方法构建的预测模型对代谢相关脂肪性肝病(MAFLD)发病风险的预测价值,并分析其关键危险因素。 方法 回顾性收集2021年1月—2024年12月于中国中医科学院西苑医院体检中心体检的2 168例体检者的人体成分、既往史及实验室检验等50个变量信息,根据是否诊断为MAFLD分为MAFLD组(n=265)和非MAFLD组(n=1 903)。计量资料两组间比较采用Mann-Whitney U检验;计数资料两组间比较采用χ2检验。将研究数据按照7∶3比例随机划分为训练集和验证集。对训练集数据通过单因素分析、Lasso回归及多因素Logistic回归分析筛选预测因子,采用Logistic回归、决策树、随机森林、极端梯度提升、轻量级梯度提升机、支持向量机和人工神经网络共7种机器学习方法构建预测模型。通过绘制验证集受试者操作特征曲线,计算各模型的曲线下面积(AUC)、灵敏度、特异度和约登指数等以评价模型性能,并对最优模型使用沙普利加性解释法解析变量贡献度。 结果 2 168例研究对象中MAFLD患病率为12.22%(265/2 168)。吸烟、舒张压、相位角、内脏脂肪面积、肌脂比、腰臀比、天冬氨酸氨基转移酶、非高密度与高密度脂蛋白胆固醇比值、甘油三酯-葡萄糖指数和胆结石均是MAFLD发生风险的独立影响因素(P值均<0.05)。支持向量机、极端梯度提升、决策树、轻量级梯度提升机、人工神经网络、随机森林和Logistic回归等7种预测模型在验证集中的AUC分别为0.738、0.754、0.757、0.786、0.795、0.796、0.815,其中随机森林模型区分度最佳(AUC=0.796,95%置信区间:0.754~0.839),灵敏度为81.01%,特异度为63.16%,约登指数为44.17%。沙普利加性解释法分析显示,内脏脂肪面积、腰臀比及舒张压为前3位重要预测因子。 结论 基于人体成分与临床指标构建的随机森林模型对MAFLD发病风险具有良好的预测效能,其可解释性有助于临床早期识别高危人群。 Abstract:Objective To investigate the value of predictive models established based on machine learning methods in predicting the risk of metabolic associated fatty liver disease (MAFLD), and to analyze its key risk factors. Methods A retrospective analysis was performed for the 50 variables of 2 168 healthy individuals who underwent physical examination in Department of Health Assessment, Xiyuan Hospital, China Academy of Chinese Medical Sciences, from January 2021 to December 2024, including body composition, past history, and laboratory tests, and according to whether they were diagnosed with MAFLD or not, they were divided into MAFLD group with 265 individuals and non-MAFLD group with 1 903 individuals. The Mann-Whitney U test was used for comparison of continuous data between two groups, and the chi-square test was used for comparison of categorical data between two groups. Randomly split the research data into a training set and a validation set in a 70% to 30% ratio. Predictive factors were screened from the training set data using univariate analysis, LASSO regression, and multivariate Logistic regression analysis. Predictive models were then constructed using seven machine learning methods: Logistic regression, decision tree, random forest (RF), eXtreme gradient boosting, light gradient boosting machine, support vector machine, and artificial neural network. Model performance was evaluated by plotting receiver operating characteristic curve for the validation set and calculating the area under the curve (AUC), sensitivity, specificity, and Youden index for each model. Furthermore, the SHapley Additive exPlanation (SHAP) method was used to analyze the contribution of variables in the optimal model. Results The prevalence rate of MAFLD among the 2 168 subjects was 12.22% (265/2 168). Smoking, diastolic blood pressure, phase angle, visceral fat area, muscle fat ratio, waist-to-hip ratio, aspartate aminotransferase, non-HDL-C/HDL-C ratio, triglyceride-glucose index, and gallstones were independent risk factors for MAFLD (all P<0.05). The seven predictive models of support vector machine, eXtreme gradient boosting, decision tree, light gradient boosting machine, artificial neural network, RF, and Logistic regression had an AUC of 0.738, 0.754, 0.757, 0.786, 0.795, 0.796, and 0.815, respectively, in the validation set, among which the RF model had the best discriminatory ability (AUC=0.796, 95% confidence interval: 0.754 — 0.839), with a sensitivity of 81.01%, a specificity of 63.16%, and a Youden index of 44.17%. The SHAP analysis showed that visceral fat area, waist-to-hip ratio, and diastolic blood pressure were the top three predictive factors in terms of importance. Conclusion The RF model, constructed based on body composition and clinical indicators, has a good performance in predicting the risk of MAFLD, and its interpretability can help to identify high-risk individuals in the early stage in clinical practice. -
表 1 MAFLD组和非MAFLD组基线特征的比较
Table 1. Comparison of baseline characteristics between the non-MAFLD group and the MAFLD group
指标 非MAFLD组(n=1 903) MAFLD组(n=265) 统计值 P值 性别[例(%)] χ2=8.51 0.004 女 1 176(61.80) 139(52.45) 男 727(38.20) 126(47.55) 吸烟[例(%)] 23(1.21) 12(4.53) χ2=14.12 <0.001 年龄(岁) 46.00(32.00~61.00) 50.00(38.00~62.00) Z=-2.62 0.009 身高(cm) 163.50(158.50~170.00) 167.00(159.00~173.50) Z=-3.34 <0.001 体重(kg) 62.60(55.70~71.45) 73.70(64.40~83.30) Z=-11.71 <0.001 BMI(kg/m2) 23.50(21.40~25.80) 26.60(24.60~29.00) Z=-13.65 <0.001 腰围(cm) 83.40(77.30~90.50) 93.60(87.40~100.40) Z=-14.38 <0.001 臀围(cm) 94.20(90.60~98.05) 99.30(96.00~103.50) Z=-12.69 <0.001 WHR 0.89(0.85~0.93) 0.95(0.92~0.99) Z=-15.84 <0.001 收缩压(mmHg) 122.00(111.00~134.00) 131.00(120.00~140.00) Z=-7.38 <0.001 DBP(mmHg) 70.00(64.00~78.00) 76.00(69.00~83.00) Z=-7.41 <0.001 中性粒细胞计数(×109/L) 3.23(2.61~3.99) 3.63(2.99~4.65) Z=-5.63 <0.001 血小板计数(×109/L) 240.00(205.00~279.00) 249.00(223.00~294.00) Z=-4.14 <0.001 单核细胞计数(×109/L) 0.29(0.24~0.35) 0.34(0.28~0.42) Z=-6.97 <0.001 淋巴细胞计数(×109/L) 1.83(1.50~2.21) 2.08(1.73~2.47) Z=-6.70 <0.001 SII 420.38(314.13~548.78) 457.96(327.28~598.05) Z=-2.39 0.017 SIRI 0.50(0.36~0.73) 0.57(0.43~0.86) Z=-4.12 <0.001 AISI 122.40(83.34~177.68) 148.09(97.73~229.27) Z=-5.05 <0.001 ALT(U/L) 14.20(10.50~21.00) 22.90(16.20~35.10) Z=-12.71 <0.001 AST(U/L) 18.30(15.50~22.00) 20.50(17.20~25.90) Z=-6.78 <0.001 ALT/AST 0.79(0.63~1.02) 1.12(0.89~1.45) Z=-13.36 <0.001 总胆红素(μmol/L) 10.80(8.60~14.00) 11.30(8.30~14.20) Z=-0.30 0.762 TC(mmol/L) 4.75(4.15~5.41) 4.88(4.17~5.49) Z=-0.95 0.340 TG(mmol/L) 1.04(0.74~1.47) 1.37(1.03~1.89) Z=-8.22 <0.001 HDL-C(mmol/L) 1.33(1.13~1.56) 1.22(1.02~1.38) Z=6.50 <0.001 LDL-C(mmol/L) 2.96(2.41~3.59) 3.11(2.56~3.73) Z=-2.37 0.018 肌酐(mmol/L) 68.00(59.00~79.00) 71.00(61.00~83.00) Z=-2.39 0.017 FPG(mmol/L) 5.29(4.99~5.69) 5.53(5.17~6.20) Z=-6.40 <0.001 UA(μmol/L) 304.00(254.00~361.00) 364.00(303.00~422.00) Z=-9.10 <0.001 NHHR 2.55(1.91~3.33) 3.02(2.38~3.62) Z=-6.37 <0.001 UHR 0.14(0.11~0.19) 0.19(0.14~0.23) Z=-9.40 <0.001 TyG 6.82(6.45~7.23) 7.12(6.86~7.50) Z=-8.97 <0.001 TyG-BMI 159.95(140.95~182.54) 191.61(174.37~211.98) Z=-14.42 <0.001 HSI 31.31(28.55~34.56) 36.67(34.26~40.74) Z=-15.96 <0.001 体脂肪(kg) 19.10(15.40~23.00) 24.70(21.40~29.50) Z=-14.12 <0.001 肌肉量(kg) 39.80(35.50~48.75) 44.80(38.30~54.20) Z=-6.72 <0.001 MFR 1.24(0.98~1.62) 1.05(0.84~1.32) Z=-7.88 <0.001 FMR 0.81(0.62~1.02) 0.96(0.76~1.19) Z=-7.88 <0.001 脂肪质量指数 7.10(5.70~8.70) 9.10(7.70~10.90) Z=-12.40 <0.001 去脂体重指数 16.10(14.80~17.90) 17.50(15.90~19.30) Z=-8.16 <0.001 去脂体重(kg) 42.30(37.65~51.60) 47.60(40.50~57.50) Z=-6.68 <0.001 骨骼肌(kg) 22.90(20.10~28.70) 26.10(21.80~32.40) Z=-6.76 <0.001 体脂百分比(%) 30.40(25.40~35.40) 34.40(30.00~39.10) Z=-8.68 <0.001 Table . (continued)
指标 非MAFLD组(n=1 903) MAFLD组(n=265) 统计值 P值 基础代谢率(kJ) 5 368.07(4 949.67~6 209.06) 5 882.70(5 234.18~6 773.90) Z=-7.11 <0.001 内脏脂肪面积(cm2) 86.70(66.10~113.10) 120.00(103.90~148.10) Z=-15.28 <0.001 相位角(°) 4.80(4.30~5.30) 4.60(4.10~5.20) Z=-1.75 0.033 骨骼肌指数 6.60(6.00~7.60) 7.40(6.50~8.20) Z=-7.81 <0.001 高血压[例(%)] 451(23.70) 115(43.40) χ2=46.78 <0.001 糖尿病[例(%)] 186(9.77) 50(18.87) χ2=19.83 <0.001 胆结石[例(%)] 26(1.37) 23(8.68) χ2=56.31 <0.001 注:MAFLD,代谢相关脂肪性肝病;BMI,体重指数;WHR,腰臀比;DBP,舒张压;SII,全身免疫炎症指数;SIRI,系统炎症反应指数;AISI,全身性炎症聚集指数;ALT,丙氨酸氨基转移酶;AST,天冬氨酸氨基转移酶;TC,总胆固醇;TG,甘油三酯;HDL-C,高密度脂蛋白胆固醇;LDL-C,低密度脂蛋白胆固醇;FPG,空腹血糖;UA,尿酸;NHHR,非高密度与高密度脂蛋白胆固醇比值;UHR,血清尿酸与高密度脂蛋白胆固醇比值;TyG,甘油三酯-葡萄糖指数;TyG-BMI,甘油三酯葡萄糖-体重指数;HSI,肝脂肪变性指数;MFR,肌脂比;FMR,脂肌比。
表 2 MAFLD发生风险影响因素的多因素Logistic回归分析
Table 2. Multivariate Logistic regression analysis of risk factors for MAFLD
变量 OR 95%CI 偏回归系数 标准误 Wald χ2值 P值 吸烟 4.374 1.471~13.010 1.655 0.760 2.178 0.029 DBP(mmHg) 1.016 1.002~1.039 0.028 0.013 2.049 0.040 相位角(°) 0.392 0.286~0.536 -1.064 0.208 -5.110 <0.001 内脏脂肪面积(cm2) 1.123 1.087~1.160 0.126 0.021 5.937 <0.001 MFR 0.074 0.010~0.562 -3.059 1.327 -2.306 0.021 WHR 7.382 2.964~16.337 4.132 1.275 3.367 <0.001 AST(U/L) 1.030 1.005~1.056 0.032 0.016 2.032 0.042 NHHR 0.672 0.517~0.875 -0.528 0.164 -3.218 0.001 TyG 8.670 4.964~15.495 6.615 1.497 4.418 <0.001 胆结石 5.974 2.700~13.218 1.805 0.479 3.767 <0.001 注:MAFLD,代谢相关脂肪性肝病;DBP,舒张压;MFR,肌脂比;WHR,腰臀比;AST,天冬氨酸氨基转移酶;NHHR,非高密度与高密度脂蛋白胆固醇比值;TyG,甘油三酯-葡萄糖指数;OR,比值比;95%CI,95%置信区间。
表 3 7种机器模型验证集筛检结果
Table 3. Screening results for seven machine model validation sets
机器学习方法 AUC(95%CI) 灵敏度(%) 特异度(%) 约登指数(%) Logistic回归 0.815(0.773~0.854) 18.99 97.54 16.53 DT 0.757(0.701~0.807) 83.54 59.47 43.02 RF 0.796(0.754~0.839) 81.01 63.16 44.17 XGBoost 0.754(0.703~0.801) 81.01 59.47 40.49 LightGBM 0.786(0.743~0.826) 81.01 59.47 40.49 SVM 0.738(0.684~0.793) 64.56 72.81 37.36 ANN 0.795(0.747~0.839) 60.76 79.47 40.23 注:Logistic回归,逻辑斯谛回归;DT,决策树;RF,随机森林;XGBoost,极端梯度提升;LightGBM,轻量级梯度提升机;SVM,支持向量机;ANN,人工神经网络;AUC,曲线下面积;95%CI,95%置信区间。
-
[1] Chinese Society of Hepatology, Chinese Medical Association. Guidelines for prevention and treatment of metabolic dysfunction-associated(non-alcoholic) fatty liver disease(version 2024)[J]. J Pract Hepatol, 2024, 27( 4): 494- 510. DOI: 10.3760/cma.j.cn501113-20240327-00163.中华医学会肝病学分会. 代谢相关(非酒精性)脂肪性肝病防治指南(2024年版)[J]. 实用肝脏病杂志, 2024, 27( 4): 494- 510. DOI: 10.3760/cma.j.cn501113-20240327-00163. [2] TANASE DM, GOSAV EM, COSTEA CF, et al. The intricate relationship between type 2 diabetes mellitus(T2DM), insulin resistance(IR), and nonalcoholic fatty liver disease(NAFLD)[J]. J Diabetes Res, 2020, 2020: 3920196. DOI: 10.1155/2020/3920196. [3] HOU MM, GU Q, CUI JW, et al. Proportion and clinical characteristics of metabolic-associated fatty liver disease and associated liver fibrosis in an urban Chinese population[J]. Chin Med J, 2025, 138( 7): 829- 837. DOI: 10.1097/CM9.0000000000003141. [4] KRISHNAN A, MUKHERJEE D. Association of cardiovascular health metrics and metabolic associated fatty liver disease: Methodological limitations, and future directions[J]. World J Hepatol, 2025, 17( 3): 105635. DOI: 10.4254/wjh.v17.i3.105635. [5] RIAZI K, AZHARI H, CHARETTE JH, et al. The prevalence and incidence of NAFLD worldwide: A systematic review and meta-analysis[J]. Lancet Gastroenterol Hepatol, 2022, 7( 9): 851- 861. DOI: 10.1016/S2468-1253(22)00165-0. [6] ZHOU M, BO T, FAN XD, et al. Metabolic dysfunction-associated fatty liver disease: A central hub in systemic metabolic dysregulation[J]. J Clin Hepatol, 2025, 41( 9): 1725- 1728. DOI: 10.12449/JCH250902.周蒙, 薄涛, 范修德, 等. 代谢相关脂肪性肝病: 全身代谢性紊乱的核心枢纽之一[J]. 临床肝胆病杂志, 2025, 41( 9): 1725- 1728. DOI: 10.12449/JCH250902. [7] YANG B, ZHANG R. Progress on the treatment of metabolic associated fatty liver disease[J/CD]. Chin J Liver Dis(Electronic Version), 2024, 16( 4): 25- 30. DOI: 10.3969/j.issn.1674-7380.2024.04.004.杨彬, 张瑞. 代谢相关脂肪性肝病治疗进展[J/CD]. 中国肝脏病杂志(电子版), 2024, 16( 4): 25- 30. DOI: 10.3969/j.issn.1674-7380.2024.04.004. [8] TENG ML, NG CH, HUANG DQ, et al. Global incidence and prevalence of nonalcoholic fatty liver disease[J]. Clin Mol Hepatol, 2023, 29( Suppl): S32- S42. DOI: 10.3350/cmh.2022.0365. [9] QUEK J, CHAN KE, WONG ZY, et al. Global prevalence of non-alcoholic fatty liver disease and non-alcoholic steatohepatitis in the overweight and obese population: A systematic review and meta-analysis[J]. Lancet Gastroenterol Hepatol, 2023, 8( 1): 20- 30. DOI: 10.1016/S2468-1253(22)00317-X. [10] YU PP, YANG HC, QI XY, et al. Gender differences in the ideal cutoffs of visceral fat area for predicting MAFLD in China[J]. Lipids Health Dis, 2022, 21( 1): 148. DOI: 10.1186/s12944-022-01763-2. [11] LI HJ, ZHANG Y, LUO HC, et al. The lipid accumulation product is a powerful tool to diagnose metabolic dysfunction-associated fatty liver disease in the United States adults[J]. Front Endocrinol, 2022, 13: 977625. DOI: 10.3389/fendo.2022.977625. [12] ZHOU BQ, GONG N, HUANG XJ, et al. Development and validation of a nomogram for predicting metabolic-associated fatty liver disease in the Chinese physical examination population[J]. Lipids Health Dis, 2023, 22( 1): 85. DOI: 10.1186/s12944-023-01850-y. [13] YUAN Y, XU MY, ZHANG XF, et al. Development and validation of a nomogram model for predicting the risk of MAFLD in the young population[J]. Sci Rep, 2024, 14( 1): 9376. DOI: 10.1038/s41598-024-60100-y. [14] ANTONIO-VILLA NE, BELLO-CHAVOLLA OY, VARGAS-VÁZQUEZ A, et al. Increased visceral fat accumulation modifies the effect of insulin resistance on arterial stiffness and hypertension risk[J]. Nutr Metab Cardiovasc Dis, 2021, 31( 2): 506- 517. DOI: 10.1016/j.numecd.2020.09.031. [15] MAVILIA MG, WU GY. Liver and serum adiponectin levels in non-alcoholic fatty liver disease[J]. J Dig Dis, 2021, 22( 4): 214- 221. DOI: 10.1111/1751-2980.12980. [16] KYHL LK, NORDESTGAARD BG, TYBJÆRG-HANSEN A, et al. High fat in blood and body and increased risk of clinically diagnosed non-alcoholic fatty liver disease in 105, 981 individuals[J]. Atherosclerosis, 2023, 376: 1- 10. DOI: 10.1016/j.atherosclerosis.2023.05.015. [17] FAHED G, AOUN L, ZERDAN M BOU, et al. Metabolic syndrome: Updates on pathophysiology and management in 2021[J]. Int J Mol Sci, 2022, 23( 2): 786. DOI: 10.3390/ijms23020786. [18] KATSIKI N, MIKHAILIDIS DP, MANTZOROS CS. Non-alcoholic fatty liver disease and dyslipidemia: An update[J]. Metabolism, 2016, 65( 8): 1109- 1123. DOI: 10.1016/j.metabol.2016.05.003. [19] KANG YH, KUANG YM, WEI JF, et al. Establishment and validation of a machine learning-based model for predicting metabolic dysfunction-associated fatty liver disease[J]. J Chin Pract Diagn Ther, 2025, 39( 7): 611- 618. DOI: 10.13507/j.issn.1674-3474.2025.07.005.康艳红, 邝亚梅, 魏君锋, 等. 基于机器学习算法的代谢功能障碍相关脂肪性肝病预测模型构建及验证[J]. 中华实用诊断与治疗杂志, 2025, 39( 7): 611- 618. DOI: 10.13507/j.issn.1674-3474.2025.07.005. -

PDF下载 ( 1999 KB)
下载:
