Necrotizing enterocolitis (NEC) occurs in 5–10% of very low birth weight (VLBW) infants and is one of the leading causes of death among them1,2,3. It is known that even survivors of NEC eventually come down with long-term growth failure and neurodevelopmental impairments4,5,6,7.

The pathogenesis of the clinical entity known as NEC is multifactorial. Traditionally, immaturity, hyperosmolar formula, fast feeding advance, infection, and bowel ischemia are known risk factors for NEC8,9,10,11,12,13,14. In addition, several studies have investigated the association between seasonal variations and NEC15,16,17,18. A previous multicenter study in the US showed a biphasic high peak occurrence of NEC in May/June and October/November15. Javidi et al. reported a similar bimodal peak and higher number of NEC in April/May16. In this study, gestational age (GA), birth weight (BW), and birth month were associated with NEC. Another multicenter study in England showed that the incidence of surgical NEC was higher in late spring 17. However, a study in Sweden found a peak incidence in November and a low incidence in May18. These studies, though with inconsistent results, revealed that environmental factors such as seasonal variation and birth month could influence the incidence of NEC. Studies on the association between NEC and other environmental factors, such as ambient temperature and air pollution are lacking. Furthermore, no endeavors have been made regarding the utilization of machine learning for the prediction of NEC among VLBW infants.

In this context, this study employed machine learning and a national prospective cohort registry database to examine the main predictors of NEC in VLBW infants, including environmental factors such as ambient temperature, air pollution, and seasonal variation in birth year. This study presents the most comprehensive machine learning analysis on this topic, using a rich collection of 74 predictors and bringing new results concerning their associations with NEC.


Descriptive statistics for NEC and its categorical predictors are presented in Table 1. Among 10,353 VLBW infants, the proportion of NEC was 6.8% (n = 704). The results of the univariate analysis (chi-square test for the equality of proportions “Yes” or t test for the equality of means) are presented in Table 2. The P values were smaller than 0.05 for the following variables: GA, BW, small-for-GA, sex (male), birth year, multipara, gestational diabetes mellitus, chorioamnionitis, pre-labor rupture of membrane, antenatal steroid use, cesarean section, oligohydramnios, polyhydramnios, Apgar score, intensive neonatal resuscitation, initial blood gas analysis, pulmonary hemorrhage, respiratory distress syndrome, treated patent ductus arteriosus (PDA), air leak syndrome, and sepsis.

Table 1 Descriptive Statistics: Necrotizing Enterocolitis and Categorical Predictors.
Table 2 Univariate Analysis.

The performance measures for the six prediction models for NEC are listed in Table 3. The random split and analysis were repeated 50 times then its average was taken for external validation. The performance results were similar, irrespective of the inclusion of average ambient temperature for each of the 10, 9, 8, …, 2, 1, and 0 months before birth. With the inclusion of sepsis, the area under the receiver-operating-characteristic curve for the random forest increased from 0.70 to 0.72. Among the six prediction models for NEC, logistic regression and the random forest with 1000 trees had the best performance (accuracy: 0.93 and 0.93, area under the receiver-operating-characteristic curve: 0.73 and 0.72, respectively). The findings of hyper-parameter tuning in the last box of Table 3 show that the random forests with 500, 400, 300, 200 and 100 trees were not as good as the random forest with 1000 trees. Indeed, the area under the receiver-operating-characteristic curves for the six prediction models in one of the 50 runs are presented in Fig. 1. The results in Fig. 1 came from one particular run (i.e., the 50th run), whereas the results in Table 3 are the averages of the 50 runs. This explains why they are different from each other. The values and ranks of random forest variable importance are presented in Table 4. The importance rank of the temperature average for each of the 10, 9, 8, …, 2, 1, and 0 months before birth was below the top 30, while their sepsis counterparts were within the top 10 (9th). According to the random forest variable importance in Table 4 and Fig. 2, the major predictors of NEC were BW (0.0910), BW Z-score (0.0907), maternal age (0.0712), GA (0.0476), average birth year temperature (0.0250), birth year (0.0245), minimum birth year temperature (0.0244), maximum birth year temperature (0.0239), sepsis (0.0237), sex (male) (0.0198), multipara (0.0189), surfactant use ≥ 2 (0.0168), multiple pregnancy (0.0166), treated PDA (0.0165), and chorioamnionitis (0.0163). Based on logistic regression variable importance (the absolute value of the optimized coefficient) in Table 5, indeed, major predictors of NEC were sepsis, BW Z-score, gestational diabetes mellitus, PDA ligation, unmarried, pulmonary hemorrhage, sex (male), maximum birth year temperature, air leak syndrome, chorioamnionitis, small-for-GA, blood gas base excess, GA, in vitro fertilization, and antenatal steroid. It needs to be noted that the results in Tables 4 and 5 came from one particular run (i.e., the 50th run).

Table 3 Model performance for predicting necrotizing enterocolitis: Means and confidence intervals over 50 runs.
Figure 1
figure 1

Area Under the Receiver-Operating-Characteristic Curves for Necrotizing Enterocolitis. Legend: The area under the receiver-operating-characteristic curve (AUC) is the plot of the true positive rate (sensitivity) against the false positive rate (1- specificity) at various threshold settings. Abbreviations: ANN Artificial neural network, AUC Area under the receiver-operating-characteristic curve, DT Decision tree, LR Logistic regression, NB Naïve Bayes, RF Random forest, SVM Support vector machine.

Table 4 Random Forest Variable Importance: Temperature Average for Birth Month, Sepsis Included.
Figure 2
figure 2

Random Forest Variable Importance Plots for Necrotizing Enterocolitis. Legend: Random forest variable importance calculates node impurity (GINI) decrease from the creation of a branch on a certain predictor. It is an average over all trees in a random forest with the range of 0 and 1. Abbreviations: PM, Particulate matter; PROM, Pre-labor rupture of membranes.

Table 5 Logistic Regression Variable Importance: Temperature Average for Birth Month, Sepsis Included.


Among the six prediction models for NEC, logistic regression and random forest had the best performances. According to random forest variable importance, major predictors of NEC included environmental factors (ambient birth year temperature), maternal factors (maternal age, multipara, multiple pregnancy, chorioamnionitis), and neonatal factors (GA, BW, male sex, sepsis, PDA).

This study confirmed that BW and GA were the main predictors of NEC. Our findings were consistent with the results of previous studies that revealed that lower BW and GA were the main risk factors for NEC19,20. Prematurity is well known to be the main cause of NEC. This can be explained by ischemic mucosal injury in the immature gut of preterm infants21. Recently, NEC has been considered to develop as multifactorial hits in the immature gut by both prenatal and postnatal factors. In addition, the gut microbiota in preterm infants is different from that in healthy term infants, and show a decreased diversity22,23. Moreover, prematurity reflects developmental changes in several organs other than in the gut, which increases the incidence of neonatal morbidity.

A unique finding of this study was that ambient temperature was associated with the incidence of NEC. The higher ambient temperature associated with NEC incidence may be influenced by environmental factors. Previous studies have reported that a high ambient temperature increases the risk of preterm birth24,25,26. Heat induces the production of proinflammatory cytokines such as interleukin (IL)-1, IL-6, and tumor necrosis factor, causing inflammatory processes at the maternal–fetal interface27. Furthermore, heat stress increases the production of oxytocin and prostaglandin, which are associated with uterine contractions and induce preterm labor28,29. It causes dehydration, resulting in decreases in maternal fluid levels, subsequently reducing fetal blood volume and leading to the production of pituitary hormones that provoke labor30.

Sepsis is one of the main predictors of NEC. Infection triggers inflammation in the immature gastrointestinal tract, which may contribute to NEC pathogenesis31. Recent findings have shown that preterm infants are exposed to a bacteria-rich environment in the neonatal intensive care unit and antibiotics that reduce the diversity of the gut microbiome32. Toll-like receptor 4 (TLR4) is a pathogen recognition molecule that recognizes bacterial endotoxins such as lipopolysaccharides and induces inflammation33. This TLR4-mediated bacterial signaling leads to increased mucosal injury and reduced mucosal repair, resulting in mucosal defects in which bacteria can translocate through the circulation34,35,36. At this stage, bacteria lead to the inhibition of vasodilator expression, thus decreasing intestinal perfusion, which results in tissue necrosis of the gut37.

In this study, chorioamnionitis was found to be a predictor of NEC. There have been debates regarding prenatal infection or inflammation and its effects on NEC. Some studies reported no association, but others demonstrated that chorioamnionitis was associated with preterm birth, and it was also associated with inflammation and infection in infants during perinatal periods38,39,40. A meta-analysis by Been et al. revealed that chorioamnionitis is significantly associated with NEC41. Our findings are consistent with the results of these studies. Gastrointestinal inflammatory markers were increased in preterm infants exposed to chorioamnionitis, reflecting the proinflammatory state of the gut after birth42. The gut microbiome reflects amniotic fluid with chorioamnionitis43. In this condition, preterm infants may have disturbed barrier function, which would increase the susceptibility of the gut to secondary hits, such as sepsis and circulatory instability, leading to an increased incidence of NEC41.

In this study, multiparity was significantly associated with NEC. Lee et al. reported similar results in VLBW infants40. This finding may explain why the infant can be affected by maternal parity, exposure to maternal stress factors from recurrent pregnancy, oxidative stress, and passive transfer of immunomodulators that change the gut microbiota of neonates.

There are some limitations to this study. First, address information was not provided in the Korean Neonatal Network (KNN) database; hence, national averages were taken for PM10 and temperature variables in this study. More specific information on these predictors would improve the validity of research in this direction. Second, this study did not consider the possible mediating effects of the various predictors. Third, this study did not focus on examining the possible mechanisms between major predictors and NEC. Fourth, this study did not include indoor factors that could be major predictors of NEC. Fifth, it was beyond the scope of this study to compare various re-sampling approaches regarding class imbalance, i.e., the proportion of NEC was only 6.8%. Under-sampling involves the reduction of the majority class for the balance, whereas over-sampling involves the expansion of the minority class for the goal. For example, a recent study compared the performance measures of four machine learning models in the cases of under-sampling and over-sampling for the prediction of cardiovascular disease44. Few studies are available, and further investigation is needed on this topic. Sixth, maternal age, GA, BW, BW Z-score and environmental predictors were not normalized in order to keep their full information. Using different rescaling methods for these continuous predictors (e.g., normalization) and comparing their results would make a valuable contribution for this line of research. Seventh, this study followed existing literature 49,53,54 to focus on top-10 predictors in terms of random forest variable importance. However, it needs to be noted that there has been no consensus on the threshold of major predictors in terms of random forest variable importance. Eighth, this study focused on random forest variable importance instead of logistic regression variable importance. Logistic regression performed as good as did the random forest in this study. But logistic regression requires an unrealistic assumption of ceteris paribus, i.e., “all the other variables staying constant.” For this reason, we used random forest variable importance for evaluating the importance ranking of a major predictor and univariate analysis for testing the direction of association between NEC and the predictor. Some predictors ranked within the top 15 in the random forest but out of the top 30 in logistic regression, i.e., BW (1st vs. 63rd), maternal age (3rd vs. 52nd), average birth year temperature (5th vs. 56th), birth year (6th vs. 65th), primipara (11thvs. 33rd) and surfactant use (12th vs. 40th). Little literature is available and more examination is needed on comparing the variable importance of various statistical approaches.

To the best of our knowledge, the performance of the random forest in this study (the area under the receiver operating characteristic curve of 0.72) is among the highest in this line of research. NEC is strongly associated with birth year temperature, as well as maternal and neonatal predictors.


Participants and variables

The data consisted of 10,353 VLBW infants from the KNN database from January 2013 to December 2017. The KNN started in April 2013 as a national prospective cohort registry of VLBW infants admitted or transferred to neonatal intensive care units across South Korea (it covers 74 neonatal intensive care units now). It collects perinatal and neonatal data of VLBW infants based on a standardized operating procedure45.

The dependent variable was NEC, with binary categories (no, yes). The following 47 perinatal predictors were considered (43 of them had binary categories): sex, birth-year (categorical: 2013, 2014, 2015, 2016, 2017), birth-month, birth-season (spring, summer, autumn, winter), multiple pregnancy, in vitro fertilization, gestational diabetes mellitus, overt diabetes mellitus, pregnancy-induced hypertension, chronic hypertension, histologic chorioamnionitis, pre-labor rupture of membranes > 18 h, antenatal steroid, cesarean section, oligohydramnios, polyhydramnios, maternal age (years), primipara, maternal education (categorical: elementary, junior high, senior high, college or higher), maternal citizenship, paternal education (categorical: elementary, junior high, senior high, college or higher), paternal citizenship, marital status, congenital infection, 1-min Apgar score ≤ 3, 5-min Apgar score < 7, neonatal resuscitation program, intensive neonatal resuscitation (intubation, chest compression or medications), initial blood gas pH < 7.0, initial blood gas base excess < -15, pulmonary hemorrhage, respiratory distress syndrome, surfactant use ≥ 2, PDA treatment (medical or surgical), PDA ligation, air leak syndrome, GA, GA < 28 weeks, GA < 26 weeks, BW, BW Z-score, BW < 1,000 g, BW < 750 g, SGA, and sepsis. The following 26 environmental predictors were also included: PM10 for birth year, PM10 for each month during pregnancy, average ambient temperature for birth year, minimum ambient temperature for birth year, maximum ambient temperature for birth year, and average ambient temperature for each month during pregnancy. PM10 and ambient temperature data were obtained from the Korea Meteorological Administration (KMA) (PM10; temperature According to the KMA, PM10 denotes the concentration of particles with diameters of 10 µm or less, whereas ambient temperature represents the overall temperature of the outdoor air surrounding people.

NEC was diagnosed according to the modified Bell’s staging criteria (≥ Stage II)46. Gestational diabetes mellitus was defined as any degree of glucose intolerance with the onset or first recognition during pregnancy. Pregnancy-induced hypertension was defined as hypertension with onset in the latter part of pregnancy (> 20 weeks’ gestation), followed by normalization of blood pressure postpartum. Chorioamnionitis was defined as histologic chorioamnionitis47. Oligohydramnios (or polyhydramnios) was defined as an amniotic fluid index of < 5 cm (or > 24 cm). Small-for-GA was defined as BW below the 10th percentile, according to the Fenton growth chart48.

Statistical analysis

Artificial neural networks, decision trees, logistic regression, naïve Bayes, random forests, and support vector machines were used for predicting NEC49,50,51,52,53,54. The following default parameters were adopted for convenience: The splitting criterion was GINI, the max depth was not determined and the number of trees was 1000 in the random forest; the radial basis function kernel was employed in the support vector machine; and the limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm served for the optimization of the artificial neural network. Data on 10,353 observations with full information were divided into training and validation sets in a 70:30 ratio. Accuracy, which is the ratio of correct predictions among 3,106 observations, was employed as the standard for validating the models. Random forest variable importance, the contribution of a certain variable to the performance (GINI) of the random forest, was used to examine the major predictors of NEC in VLBW infants, including environmental factors. The random split and analysis were repeated 50 times, and the average was used for external validation55,56. Different seed numbers were used for different runs but the default parameters stayed the same throughout the random splits and analyses. R-Studio 1.3.959 (R-Studio Inc.: Boston, United States) was employed for the analysis from August 1, 2021 to September to 30, 2021.

Ethical statement

The KNN registry was approved by the institutional review board (IRB) at each participating hospital (IRB No. of Korea University Anam Hospital: 2013AN0115). Informed consent was obtained from the parent(s) of each infant registered in the KNN. All methods were carried out in accordance with the IRB-approved protocol and in compliance with relevant guidelines and regulations.

The names of the IRB of the KNN participating hospitals are as follows: The Institutional Review Board of Gachon University Gil Medical Center, The Catholic University of Korea Bucheon ST. Mary’s Hospital, The Catholic University of Korea Seoul ST. Mary’s Hospital, The Catholic University of Korea ST. Vincent’s Hospital, The Catholic University of Korea Yeouido ST. Mary’s Hospital, The Catholic University of Korea Uijeongbu ST. Mary’s Hospital, Gangnam Severance Hospital, Kyung Hee University Hospital at Gangdong, GangNeung Asan Hospital, Kangbuk Samsung Hospital, Kangwon National University Hospital, Konkuk University Medical Center, Konyang University Hospital, Kyungpook National University Hospital, Gyeongsang National University Hospital, Kyung Hee University Medical Center, Keimyung University Dongsan Medical Center, Korea University Guro Hospital, Korea University Ansan Hospital, Korea University Anam Hospital, and Kosin University Gospel Hospital, National Health Insurance Service Iilsan Hospital, Daegu Catholic University Medical Center, Dongguk University Ilsan Hospital, Dong-A University Hospital, Seoul Metropolitan Government-Seoul National University Boramae Medical Center, Pusan National University Hospital, Busan ST. Mary’s Hospital, Seoul National University Bundang Hospital, Samsung Medical Center, Samsung Changwon Medical Center, Seoul National University Hospital, Asan Medical Center, Sungae Hospital, Severance Hospital, Soonchunhyang University Hospital Bucheon, Soonchunhyang University Hospital Seoul, Soonchunhyang University Hospital Cheonan, Ajou University Hospital, Pusan National University Children’s Hospital, Yeungnam University Hospital, Ulsan University Hospital, Wonkwang University School of Medicine & Hospital, Wonju Severance Christian Hospital, Eulji University Hospital, Eulji General Hospital, Ewha Womans University Medical Center, Inje University Busan Paik Hospital, Inje University Sanggye Paik Hospital, Inje University Ilsan Paik Hospital, Inje University Haeundae Paik Hospital, Inha University Hospital, Chonnam National University Hospital, Chonbuk National University Hospital, Cheil General Hospital & Women’s Healthcare Center, Jeju National University Hospital, Chosun University Hospital, Chung-Ang University Hospital, CHA Gangnam Medical Center, CHA University, CHA Bundang Medical Center, CHA University, Chungnam National University Hospital, Chungbuk National University, Kyungpook National University Chilgok Hospital, Kangnam Sacred Heart Hospital, Kangdong Sacred Heart Hospital, Hanyang University Guri Hospital, and Hanyang University Medical Center.

Ethics approval and consent to participate

Data collection was approved by the institutional review board of each hospital participating in KNN (2013AN0115). Informed consent was obtained from the parents (s) of each infant registered in the KNN.