Explainable AI reveals key changes in skin microbiome associated with menopause, smoking, aging and skin hydration

1 IBM Research, Sci-Tech Daresbury, Daresbury, WA4 4AD (UK) 2 IBM Research, T.J. Watson Research Center, Yorktown Heights, NY, 10598 (USA) 3 Unilever Research & Development, Port Sunlight, CH63 3JW (UK) 4 Unilever Research and Development, Sharnbrook, UK, MK44 1LQ 5 STFC Daresbury Lab., Scientific Computing Department, Daresbury, WA4 4AD (UK) 6 Unilever Research & Development, Trumbull, CT, 06611 (USA) 7 The University of Liverpool, Institute of Integrative Biology, The Bioscience Building, Liverpool, L697ZB (UK) 8 Manchester Metropolitan University (MUU), Department of Computing and Mathematics, M15 6BH, Manchester 9 University of Manchester (UoM), Department of Computer Science, M13 9LP, Manchester


Introduction
The associations between the human microbiome and individual health-related phenotypes are increasingly studied, with applications such as use of intervention strategies including prebiotics and probiotics (1), and personalized therapies (2). Improved understanding of how microbial taxa contribute to health and wellbeing, coupled with the increasing ability to measure the microbiome, can drive and accelerate the development of personalised microbiome-based treatments. As microbiome research expands and as sequencing technologies continue to advance, the volume and complexity of the data collected inexorably increases. Therefore, the necessity to develop sophisticated methods for analysing microbiome data to derive actionable insights becomes increasingly important. Machine learning (ML) has the potential for building predictive models that can provide powerful comprehensions into the complex interaction between microbial communities and their host organisms (3). The application of ML to microbiome data has answered several important questions regarding the clustering of microbial species, taxonomic assignment, comparative metagenomics, gene prediction. Recently, a number of studies have been published on phenotype prediction from microbiome data (4)(5)(6)(7)(8)(9)(10), where some particularly focus on identifying discriminatory microbial taxa (11) or microbial signatures (12). As ML is becoming increasingly deployed in patient-relevant settings, it is essential to consider both the predictive power of the models and the transparency of the recommendations by providing an explanation for the predictions (13). This is often referred to as the explainability of the models. Recent work in this direction has demonstrated that interpretable AI/ML for microbiome data is increasingly studied as explanations for the predictions are needed (14) (15) (16). However, more can be done in terms of explaining the mechanisms of the ML algorithm predictions for microbiome data. We propose an explainable artificial intelligence (EAI) approach to identify key contributing taxa and to investigate how the distributions of these microbial taxa drive the prediction of different phenotypes. The novelty of our approach is given by the methods used to provide explanations for the predictions. The explanations are expressed in terms of key variations in microbiome composition that drive the prediction of different phenotype values. Our streamlined EAI approach includes three fine-tuned machine learning models -random forest (RF) (17), XGBoost (18) and neural network (NN) (19) -to predict the phenotype. We refer to the three models as a bag of models. Our approach explains the predictions using an explainable AI algorithm called SHapley Additive exPlanations (SHAP) (20). Obtaining explanations for samples in which the model is unreliable could lead to poor quality explanations -which is unacceptable in a healthcare setting. In order to achieve high-quality explanations, we repeatedly cross-validated predictions and explanations; we extracted and examined a sub-set of samples for which the ML model reliably predicted phenotypes. We refer to these samples as exemplars. Finally, our EAI approach creates consensus explanations from the three models. The consensus explanations constitute microbial signatures linked to the phenotype of interest. This process can be thought of as using the bag of models as a representation from which explanations are generated, while encouraging general explanations over those that are specific to one model. A full discussion of our exemplar selection process, the methods used, and their optimized architectures can be found in Methods. To exemplify the power of our EAI approach as a means to derive actionable insights from complex interactions between microbes and their hosts, this study is focused on the human skin microbiome. The technical ease of acquiring skin microbiome data sets -it is reliant only upon swabbing or scrubbing of skin to collect samplesand the recent increase in interest in analysing the human skin microbiome (21)(22)(23)(24)(25) makes skin an appealing target. Human skin is in fact a large, heterogeneous organ that protects the body from pathogens while sustaining microorganisms that influence human health and disease (26). For this study, a total of 1234 time-series leg skin microbiome samples (bacterial 16S rRNA gene sequencing) as well as associated skin hydration measures (visual assessment, pH, conductance, capacitance) were collected from 63 Canadian women (21-65 years of age) displaying healthy (non-dry) or moderately dry skin. Additional phenotype data (i.e., age, smoking and menopausal status) were obtained from subject questionnaires. Table 1 summarises the distributions of the four phenotypes for the Canada study. The microbial features used by our EAI approach are derived from the observed genera, i.e., Operational Taxonomy Units (OTUs) at genus level. Therefore, we use the terms genera and features interchangeably. Relative abundances of 186 genera were obtained from taxonomic analysis of the sequenced reads of both studies. Our EAI approach uses these features to predict skin hydration, age, menopausal and smoking status for the Canada study and examines the most impactful microbial genera when predicting the phenotype for the exemplars by applying SHAP. To transform the biological insights provided by our EAI approach into actionable insights, we ranked the 186 genera by average SHAP impact and report the consensus top impactful genera for our bag of models. Finally, to validate and investigate the generalizability of the approach when predicting skin hydration, we used a second study of 278 samples from 102 UK women. Through the application of our carefully calibrated EAI approach we were able to accurately predict skin hydration, age, menopausal and smoking status and we identified skin microbial signatures associated with diverging values of each phenotype. Our findings demonstrate the potential of explainable artificial intelligence in contributing to the growing body of knowledge of microbes and their host-interactions, and its future impact in research fields from cosmetic and medical research to forensic science and personal health monitoring (27)(28)(29)(30).

Results
In the following sections we present the result obtained when predicting skin hydration, age, menopausal status and smoking status using the Canada study. To take into account confounding factors when predicting the four phenotypes for the Canada study, we examined the connections between the skin hydration (corneometer), age, menopause and smoking phenotypes by computing their pairwise correlations (Figure 1(a)). Only age and menopause were found to be correlated (~0.72 Spearman's correlation), as expected. As such, in the section "Menopause prediction" we highlight the differences in predictive genera between the age and menopause models, and we show that we were able to predict age within the two peri and post-menopausal groups separately. The latter results, together with the overlap in age span between pre and post menopause classes indicate that, although strongly correlated, the age and menopausal models differ from each other and may offer insights specific to the phenotypes. We further checked if multidimensional projection of pairwise sample distances showed biases or separability of samples by phenotype (Figure 1(b)). The results indicate the samples are centred with no clear trends in either dimension of the two-dimensional scaling and they are not separable by phenotype. With these checks in place, we proceeded with phenotype prediction. Note that only for skin hydration prediction we compared performance results and insights obtained with the Canada and UK studies.

Explainable skin hydration prediction model generalises across cohorts (Canada and UK)
We investigated the prediction of multiple individual and combined measures of skin hydration, and found that corneometer, a widely used and robust method for assessing skin hydration (30), was the most appropriate measure for the Canada study (see section Phenotype analysis in Supplementary Information and Supplementary Figures  1-3). Efforts to enrich the corneometer score with other measures did not provide sufficient additional improvement in performance as to merit their inclusion. Higher corneometer score values correspond to higher skin hydration, thus we adopt the terms corneometer score and skin hydration interchangeably. Note that corneometer assumes values in a range between [2.4,78.4]. When examining the corneometer score vs. the skin microbiome, all the optimized machine learning models in our bag successfully predicted skin hydration levels from the measured microbiome (see Supplementary Figure 5). The highest Pearson's correlation between predicted and true values, 0.71, was obtained with RF. However, XGBoost (~0.65 Pearson's correlation) and NN (~0.53) were better at modelling real distribution of true corneometer values (see Figure 1(c) and Supplementary  Figures 4-6). The explanations for the corneometer predictions were provided in terms of impactful genera as computed by SHAP. A positive SHAP impact for a genus means that when the genus is more abundant in the microbiome, it contributes to an increase in the predicted score, while a negative SHAP value means the genus contributes to a decrease in the predicted score when more abundant. As the number of the most relevant features cannot be known a priori, we utilised the kneed algorithm (31) to discover plateauing of the SHAP explanation impact curve, indicating a cut-off for the number of features with the highest impact. In this case, the plateau was observed at 23 features. Figure 2(a) summarizes the most significant biological insights obtained while explaining the predictions of corneometer values. We found that 10 genera, when more abundant in the exemplars, are responsible for predicting higher skin hydration, while 13 genera are responsible for predicting lower skin hydration (see Figure 2(a)). The right-hand side of Figure 2(a) shows the average abundance of the impactful genera across the entire set of samples. Note that genera with high SHAP impact differ substantially in terms of their relative abundance in the source microbiome data (see for instance Cutibacterium and Chryseobacterium in Figure 2(a)). We observed a similar trend of most impactful genera not necessarily being the most abundant across all the samples when predicting the other three phenotypes, as presented in the following sections.
The genera most determinant of better skin hydration were Lactobacillus, and Cutibacterium (formerly Propionibacterium) both of which have previously been reported as being associated with skin hydration (23,24), whereas the genera Actinobaculum, Streptococcus, Acinetobacter, Turicella and Micrococcus were associated with decreased skin hydration. As the insights shown in Figure 2(a) can be extended to include the entire set of genera, our streamlined approach offers an opportunity to investigate how skin genera impact skin health. The increasing recognition of microbiome dysbiosis as a factor in atopic skin that is linked to psoriasis, eczema, acne, dermatitis and other disorders (25,29,32,33). Previous studies characterised microbiome signatures of different sites highlighting topographical and temporal variance in microbiome composition across dry, moist and sebaceous skin sites (26,(34)(35)(36). However, there is still a limited understanding of changes in the microbiome associated with cosmetic dry skin. Our EAI approach has the potential to provide insights on the effect of skin care and hygiene products on the molecular and microbiome composition of the skin (28).
To investigate the generalizability of our approach we extended the skin hydration analysis to include an UK study composed of 278 samples from 102 subjects and the same set of 186 OTUs or features as for the Canada study. Genus level classification was carried out for both datasets in a closed reference fashion against the same database, HOMDEXTGG version 14.51 as described in (37 Figure 3). On the other hand, when combining Canada and UK samples for training and testing the models, we obtain a MAE closer to the one using the Canada study only. Finally, we applied our EAI approach to explore the difference between the impactful genera for predicting skin hydration for the Canada cohort and the UK cohort separately. We found common impactful genera in both UK and Canada models. Lactobacillus and Cutibacterium are among the top 20 impactful genera driving the prediction of more hydrated skin (see Supplementary Figure 9). Micrococcus and Actinobaculum are another example of top impactful genera driving the prediction of less hydrated skin for both UK and Canada models. In addition, there are some genera in the UK model, such as Kocuria, Gordonia, Staphylococcus, that do not appear in the list of the top impactful genera for the Canada model and vice versa (see Supplementary Figure 9 and Figure  2(a)).

Accurate model identifies key microbes predictive of age -Canada cohort
Being able to infer key changes in the skin microbiome associated with age may offer insights into the aging process, that could be used to develop products that counteract some effects of skin aging. Following a recent study that found the skin microbiome of the hand and forehead to be the best in predicting age compared to the gut and the oral microbiome yielding predictions within 4 years of chronological age (38), we applied our EAI approach to predict age for the Canada cohort. Our EAI approach was able to predict the subjects' age from the same set of skin microbiome samples achieving a high Pearson's correlation (r≈0.88) between predicted and true values of age with XGBoost (see Figure 1(f) and Supplementary Figure 10-11) and an average error within 4 years of chronological age, similarly to the accuracy reached in (38).
Analogously to the skin hydration model, we applied the exemplar selection process and SHAP explainability analysis to understand the link between microbial communities on skin and age. Our EAI approach revealed that higher abundance of 13 genera drives the prediction of lower values of age, while higher abundance of 5 genera drives the prediction of higher values of age (see Figure 2(b)). In the age model, the knee point was observed at 20 features. For two of the genera, Corynebacterium and Eubacterium, among the top 20 most impactful we did not find a consistent direction of impact according to our bag of models. These features are shown as grey bars in Figure 2(b). We found some overlap between genera identified as being important for both skin hydration and age models as well as differences in the impact of genera for the age model in comparison to the skin hydration model ( Figures  2 and 4). The age model indicated a greater importance of Chryseobacterium and key genera, such as Staphylococcus and Bergeyella_1, that were not seen among the most impactful genera in the skin hydration model. This enabled the disambiguation of microbes whose effects mainly relate to either age or hydration. In fact, perhaps contrary to expected, skin hydration and age are not inexorably linked. This can be observed in Figure 1(a), which shows a very weak link between age and corneometer score (Pearson's correlation score -0.16). This is a clear indication of the potential for an EAI approach to drive deeper understanding of observed relationships and phenomena. Our EAI approach indicated Cutibacterium as a key genus for predicting age. To further investigate this observation, we produced a stacked force plot (see Supplementary Figure 12), filtered on Cutibacterium. The figure indicates that high abundance of Cutibacterium seen in younger people falls dramatically after the age of 50; a similar trend has also been reported previously (39). This coincides with the well-known reduction in sebum after 50 years of age (40). It is encouraging that our EAI approach identifies this known relationship and validates the findings of this methodology.

Leg skin microbiome accurately predicts menopausal status -Canada cohort
Although a natural part of aging, the onset of menopause in women is a highly significant life event, especially for those women for whom the onset occurs earlier than expected. To examine microbial changes associated with the menopause and investigate how these changes are also related to aging, we applied our EAI approach to classify samples of Canada cohort into pre-menopausal and post-menopausal status. The dataset includes 324 samples in pre-menopausal status and 876 in post-menopausal status. Only one subject, for a total of 34 samples, was in peri-menopausal status therefore was excluded from pre/post-menopausal prediction. The best F1-score of 93% on the test set was achieved by XGBoost (see Supplementary Figures 5). Figure 1(g) shows the confusion matrix as computed by XGBoost on the test set, while Supplementary Figure 13 displays the confusion matrices for the three models, demonstrating how each model is able to accurately predict each class. Our EAI approach revealed that higher abundance of 12 genera drives the prediction of the class pre-menopause, while 11 genera were identified as most impactful in predicting post-menopause when higher in abundance (Figure 2(c)).
As menopause and age are strongly correlated as expected we investigated similarities and differences between the two models taking into possible confounding factors when predicting age or menopause. The model considered important a number of genera (Lactobacillus, Cutibacterium, Enhydrobacter, Dermacoccus, Enhydrobacter) already seen as key indicators in the age model, along with a key indicator of skin hydration (Lactobacillus). This is perhaps not surprising as menopause usually occurs in middle age and the skin is significantly affected by the aging process and menopause (41). Genera such as Peptoniphilaceae_G-3 Finegoldia, Ruminococcaceae_G-3, Peptostreptococcus, Corynebacterium appear among the most impactful for predicting age, while they do not appear in the menopausal model. On the other hand, Actinomycetales, Moraxella, Acinetobacter, Dorea, etc. are considered impactful for predicting menopausal classes and do not appear in the ranked list of most impactful genera of the age model. Acinetobacter is important in the age model to predict younger age, while it predicts post-menopause when higher in abundance in the menopausal model. It is also important to point out that the age range of women in the cohort in the pre-and post-menopausal status overlaps significantly, from 21 to 55 years old and 21 to 65 years old respectively. We then trained our ML models for predicting age separately for pre-menopause and post-menopause samples. We were able to accurately predict age for the two sub-groups, showing that age prediction is robust (See Supplementary Figures 14-15). The latter results, together with the overlap in age span between pre and post menopause classes, and the differences observed in the ranked lists of impactful genera for the two models, indicate that the age and menopausal models differ from each other and may offer insights specific to the phenotypes. Furthermore, we examined the local explanation of a 41-year-old subject, who was correctly classified in post-menopausal status. This classification was made correctly showing that our model is not confounded by age since according to age the likely classification of the subject would be as pre-menopausal, with the majority of women reaching the menopause between the ages of 45 and 55. We found that the presence of genera such as Prevotella and Acinetobacter in this sample drove the correct prediction of the class post-menopause, while lower abundance of Cutibacterium had a small negative impact in predicting post-menopausal status (see Supplementary Figure 16). This is consistent with the explanations in Figure 2(c). Moreover, we found that higher abundance of genera that do not appear among the 23 most impactful for menopausal classification (i.e., Micrococcus, Gordonia and Brachybacterium) drove the correct prediction of post-menopause for this sample. This suggests it may be useful to extend the analysis to investigate local explanations of the predictions for specific samples of interest as well as the impact of the total set of genera in predicting the menopause.

Key microbes on the leg discriminate smokers accurately -Canada cohort
With the aim of investigating microbial differences on the skin associated with smoking, we applied our EAI approach to predict smoker vs non-smoker subjects from the leg skin microbiome of Canada cohort. The dataset includes 286 samples from smokers and 948 samples from non-smokers. We were able to accurately predict the two classes with the all three ML models reaching an F1-score of ~95% on the test set with XGBoost (see Supplementary Figure 5). Figure 1(h) shows the confusion matrix based on the predictions on the test set computed by XGBoost, while Supplementary Figure 13 shows the confusion matrices for the three ML models. We found that predicting the class smoker is driven by higher abundance of 19 genera (including Acinetobacter, Peptoniphilus, Anaerococcus, Klebsiella), while 6 genera (including Brevibacterium, Cutibacterium, Actinomycelates) were the most impactful for predicting the class non-smoker (Figure 2(d)). The knee point was found at 29 genera. Finally, we looked at the split of smoker vs non-smoker against age. The age of smokers varies from 21 to 65 years old, while the subject's age of non-smokers varies from 40 to 64 years old. We were able to accurately predict age for the two sub-groups (smokers and non-smokers) separately, showing that age prediction is robust despite confounding factors like smoking (See Supplementary Figures 17 and 18).

Discussion
To summarize the explainability results across phenotypes, we compiled Table 2 illustrating the overlapping and uniquely impactful skin microbes per phenotype. Our study indicated that some members of the microbiome play a general role in wellbeing as overlapping sets of genera were involved in consistently predicting skin hydration, age, smoking and menopausal status when more abundant in the microbiome (see agreement in the top half of Table 2). The genus at the top Cutibacterium is impactful for predicting all four phenotypes in a consistent direction when more abundant. Lactobacillus is impactful for predicting higher skin hydration, younger age and pre-menopausal status, while Streptococcus is impactful for predicting lower skin hydration, post-menopausal and smoking status. The following fourteen genera are impactful for predicting at least two phenotypes each. Nevertheless, accurate predictions of each phenotype required distinct combinations of impactful genera (see differences in the bottom half of table of Table 2). For example, abundance of Turicella and Veillonella are uniquely driving the prediction of lower skin hydration, while Staphylococcus and Coryneobacterium are most impactful only in the age prediction model. Higher abundance of Neisseria uniquely drives the prediction of postmenopausal status, while higher abundance of Rothia identifies smokers. Peptoniphilacae is an example of a genus having opposite impacts on two phenotypes (lower skin hydration and younger age). Additionally, we visualized the average relative abundance per subject of the impactful genera for each phenotype (Figure 4(a)). By comparing the microbial community profiles of the subjects, it is not discernible how the impactful genera differentiate between phenotype groups or values. Hence, our explainable artificial intelligence approach has the potential to uncover the subtle interplay of signals present in the microbiome that is far from obvious when viewing the input data. We compared the impactful genera from machine learning to the results from statistical tests that independently associate each genus with the phenotype (see Figure 4(b)). We observed overlap between the top genera from ML and from statistical testing, however, there were notable exceptions also. For corneometer the 2 nd most impactful genus Cutibacterium, was only ranked 90 th when the ML sensibly links it with skin hydration due to its association with acne and propensity for wet skin. The 3 rd most impactful genus Actinobaculum, that has previously been associated with skin infection but is typically observed at low abundance (42), was ranked 110 th by correlation. For age Dermacoccus, Peptostreptococcus, Staphilococcus, Bergeylla are among the top impactful genera in the age model but only ranked below 100. Similarly, Moraxella, Neisseria, Actinomycetales are among the most impactful genera in the menopausal status model but ranked below 100 in the t-test. For smoking, the top ML feature Acinetobacter was only at rank 91 th according to the t-test and other 6 genera were ranked below 100. Species within the Acinetobacter genus have been linked with respiratory and bloodstream infections which could explain their usefulness in smoking prediction with this being reflected in the leg microbiome (43). This indicates the set of most predictive genera inferred by our EAI approach can differ from the ones highlighted by a standard statistical approach that consider each genus independently. As such, our approach can be used in addition to a standard approach to investigate predictive genera that could be missed if at the bottom of the ranked list produced by a statistical test. Through our explainability analysis we demonstrated how the age and menopausal models share common impactful genera while still differing from each other. Similarly, the skin hydration model and the age model are not inevitably connected. In fact, skin hydration and age showed rather a weak link. Hence, the individual models can offer insights that are specific to the phenotype and a deeper understanding of the observed microbial community. We showed that genera with a relatively high SHAP impact can differ substantially in terms of their relative abundance in the source microbiome data. Therefore, our EAI approach identifies the most impactful genera across the samples even when these are not the most abundant genera. For instance, although Chryseobacterium is one of the least abundant genera across the entire set of samples, it is the most impactful in the age prediction model. We propose that all genera with a relatively high SHAP impact are likely to either play a biologically causative role or offer value as a non-causative indicator organism and thus are worthy of additional study.
Our findings on the predictive power of the leg skin microbiome point to the possibility of using more accessible microbiome samples to investigate phenotypes (e.g. smoking) that are not believed to be directly associated with the microbiome of the sampled body site. Although smoking is considered a contributing factor to skin aging and systemic health (44), its potential to influence the microbial community of skin has not been fully investigated. Previous studies have highlighted the strong impact of smoking on gastrointestinal microbiota(45) , (46), and the differences in microbial community composition of the upper respiratory tract between healthy smokers and nonsmokers (47). The increase of Actinobacteria previously observed in the gut microbiome of non-smokers is unexpectedly in agreement with the fact that we found Brevibacterium, Cutibacterium, Actinomycelates, Clavibacter, all belonging to the phylum Actinobacteria, drive the prediction of non-smokers (see Figure 2(d)). Streptococcus, Actinomyces, Megasphaera, Atopobium were previously identified as indicators of smokers in the respiratory tract (47), which is also in agreement with our findings. Increased levels of Megasphaera have been found in the oral cavity of smokers (48), with Megasphaera being among the impactful genera for predicting smokers in Figure 2(d). Smoking related increases in the anaerobic Actinobacteria OTUs from Actinomyces spp., Rothia mucilaginosa and Atopobium spp. were found in the oral microbiome (49). Similarly, our EAI approach found that Rothia and Atopobium predict smokers when more abundant in the microbiome samples. Cigarettes themselves harbour a broad range of potential pathogens, including Acinetobacter, Clostridium, and Klebsiella, that have been found to be predictors of smokers when abundant in the leg skin microbiome (see Figure 2(d)). The similarity between the genera that discriminate smokers from non-smokers in our skin model and the genera identified as discriminators of smokers in previous studies of the oral and gut microbiome is encouraging. Despite the fact that smoke is not in direct contact with skin on the leg, we found the leg skin microbiome changes in response to smoking, and in a manner that is consistent with previous findings from different body sites. This further confirms systematic changes occurring across the human body in cigarette smokers. It also suggested the microbiome of distant body sites may be affected by smoking, hence pointing to the fact that extremely complex interactions between body locations or tissues are yet to be examined.
Menopause is currently diagnosed based on the symptoms, but a blood test to measure the hormone levels may be carried out for younger women. The power of predicting the onset of menopause through a simple scrub of the leg could be transformational to many women, as it points to the potential for a non-invasive diagnostic tool used as a condition monitor. Microbiome sequencing from vaginal swab samples found a similar composition of major vaginal bacteria at the genus level in pre-and post-menopausal women but with altered proportion (50). The previous study observed high levels of Lactobacillus (64.4%) in pre-menopausal women that were lower postmenopause (24.4%), replaced by Prevotella (11.4%) and Streptococcus (5.1%) among others. This reflects the observations here where Lactobacillus is most impactful when more abundant for predicting pre-menopause while Prevotella and Streptococcus are most impactful when more abundant for predicting post-menopause. Here the skin microbiome appears to reflect the vaginal microbiome closely which could explain its ability to explain menopausal status.
Furthermore, we investigated the possibility of overfitting when predicting the four phenotypes, as the Canada cohort includes time series samples from same subjects (1234 samples from 63 subjects). We mitigate this by performing both 5-fold cross validation and cross-validation by subject for each phenotype. The results from the two cross validation methods show that the models are not overfitting (see Supplementary Figure 5 In this study we investigated microbial taxa at genus level, however our EAI approach can be similarly applied to the investigation of changes at species or strain level. Moreover, while we performed a particular analysis with one bioinformatics pipeline (to identify genera and abundances from bacterial 16S rRNA gene sequencing), this approach extends to exploring how microbial features from shotgun metagenomic sequencing are impactful in predicting phenotypes. This would involve taking into account the entire genetic material of a microbial community (including fungi and viruses), providing insights related to genes and their associated biological functions. In general, the ability to predict specific phenotypes from non-invasive microbiome sampling can perform a similar revolutionary role as the collection and sequencing of human DNA from cheek swabs, which has powered the gathering of massive data sets and thus the genomic revolution in personalized medicine.

Conclusion
We developed an explainable AI (EAI) approach to predict phenotypes and explain the predictions in terms of changes in microbiome composition related to diverging phenotype values. The results presented in this study demonstrate the power of explainable AI in accurately predicting diverse phenotypes such as skin hydration, age and, surprisingly, menopausal and smoking status from the leg skin microbiome and in inferring microbial signatures associated with each phenotype. In addition, we investigated the application of the skin hydration model to a second cohort and compared predictive performance and explainability. Our study focused on personalised care and wellbeing. However, it is straightforward to appreciate how this approach is broadly applicable in healthcare, representing an advancement in the deployment of microbiomebased and non-invasive diagnostics for practitioners (e.g. dermatologists) and in the design of personalised treatments. Impactful future work will likely involve additional independent studies to investigate health-related phenotypes and how the explanations generalise across more populations. Our EAI approach will enable the community to expand and build on this initial work as the explanations may offer new insights into the complex interactions between microbes and their host, possibly leading to new interventions to adjust the microbiome for improved health outcomes.

Clinical design and data collection
Study participants A total of 63 Canadian women between 21-65 years of age (average 52.71 years) took part in this double blind randomised, balanced application study, between April and July 2017. In addition, 102 women aged between 19-55 years (average 38.03 years) took part in a UK based study. Key inclusion criteria included subjects being in good general health, not being pregnant or lactating and having intact non-diseased skin with minimal hair (on test sites). Key exclusion criteria included pregnancy or lactating mothers and use of moisturisers on the proposed test site 3 or more times per week. A full list of the inclusion and exclusion criteria can be found in Supplementary Notes.

Study Design
The study was carried out by an independent third party. Study visits include visual and instrumental assessments, as well as buffer scrub sampling. Additionally, subjects were asked to confirm their age, smoking and menopausal status at the start of the study. There was no conditioning (wash-out) phase required for this study.

Assessment of skin condition
Visual assessments of leg skin include dryness and erythema grading. Visual dryness assessments were made by an expert assessor with dryness scored on comparable scales from 0-6 (Canada) and 0-4 (UK) both with ½ point increments allowed. The descriptors for 0-4 skin condition in terms of severity are similar across scales. Instrumental assessments (Skicon TM , corneometer, and pH) were taken following the visual assessments and a total combined acclimation period of a minimum 25 minutes to an environmental controlled room. The Skicon-200EX (IBS Co./AcaDerm) measures skin hydration by skin surface conductance, and for the Canada study a Corneometer CM 825 Courage+Khazaka) was used to measures skin hydration by capacitance (the skin's ability to conduct a small electrical current). For logistical reasons the UK study used a Courage and Khazaka Multi-Probe Adaptor Corneometer (MPA 6) for the same purpose both instruments are widely used for this purpose and offer comparable function, performance and sampling area. Skin surface pH in the Canada study was measured using a pH meter (Courage+Khazaka). The corneometer measurement was the most amenable for machine learning models (see Phenotype analysis in Supplementary Notes), therefore we focused on it in this study. Between 3 and 5 corneometer readings per test site were obtained and the probe was moved slightly (overlapping may occur) with each reading, but still within the site. Pressure on the skin surface was measured by means of a probe spring and was 3.5 Newtons onto the area measured. The area of skin in contact with the probe was 49 mm 2 .

Microbiome sampling
After all instrumental measures, cup scrubs buffer washes were collected using the method of Williamson and Kligman (51). Using a sterile plastic disposable pipette 2.5 ml of buffer wash solution (PBS+0.1% Triton X-100) was aliquoted into the sampling ring and gently agitated for one minute with a sterile rod. The sampling fluid was collected using a sterile disposable pipette and placed into a sterile tube. The sampling procedure was repeated with another 2 ml aliquot of buffer wash material. After agitation, this aliquot was added to the first. Samples were frozen at -20 ºC within 10 minutes until DNA extraction. Skin microbiome samples were taken by the cup scrub method at four different points, with either one or two samples being taken on each leg, as described above. Samples were taken from the same individual at different time points giving an overall total of 1234 microbial samples from the 63 subjects.

DNA extraction
DNA extractions were carried out by QIAGEN (Germany). Frozen buffer scrub samples were allowed to thaw at room temperature and cell pellets sedimented by centrifugation at 13,000 rpm, 4 o C for 10 minutes. Following removal of the supernatant the cell pellet was resuspended in 500µl TE buffer and transferred to a 96-well Lysing Matrix Plate B (MP Biomedicals, USA). Addition of 3µl of Ready-Lyse lysozyme (Epicentre, 250U/µl) was followed by incubation with agitation at 300rpm, 37 o C for 18 hours. Following lysis DNA was extracted and purified using the QIAmp UCP DNA Micro Kit (Qiagen, Germany) following the manufacturer's instructions. DNA was eluted into AUE buffer and frozen until processing for sequencing library preparation.

NGS library preparation and sequencing
Oligonucleotide primers targeting the V1-V2 hypervariable region of the 16S rRNA gene were selected. PCR primers (details in Supplementary Information) were a modified version of the standard 28F and 338R primers which contain additional recognition sequences to facilitate nested PCR to add Illumina sequencing adapters and index sequences to resulting amplicons. PCRs consisted of 0.25 μl (10 μM) of each primer, 10 μl of HotStar Taq Plus Mastermix (Qiagen), 5 μl of template DNA and 4.5 μl molecular grade water (Ambion, Thermofisher). Samples were amplified using the following parameters: 95°C for 5 minutes, then 10 cycles of: 94°C for 45 seconds, 65°C for 30 seconds, and 72°C for 60 seconds, with a final extension of 10 minutes at 72°C using a Dyad Thermocycler (MJ Research). PCR products were purified using Ampure SPRI Beads (Beckman Coulter, California, USA). A second round PCR incorporated Illumina adapters containing indexes (i5 and i7) for sample identification utilising eight forward primers and twelve reverse primers each of which contained a separate barcode allowing up to 96 different combinations. Second round PCRs consisted of 0.5 μl (10 μM) of each primer, 10 μl of 2 x Kapa Mastermix (Roche, Switzerland) and 9 μl of purified sample from the first PCR reaction. Samples were amplified using the following parameters: 98°C for 2 minutes, then 15 cycles of; 20 seconds at 95°C, 15 seconds at 65°C, 30 seconds at 70°C with a final extension of 5 minutes at 72°C. Samples were purified using AMPure SPRI Beads before being quantified using Qubit fluorimeter (Invitrogen, California, USA) and assessed using the Fragment Analyzer (Advanced Analytical Technologies, Iowa, USA). Resulting amplicon libraries were taken forward and pooled in equimolar amounts using the Qubit and Fragment Analyzer data and size selected on the Pippin prep (Sage Science, Massachusetts, USA) using a size range of 300-700 bps. The quantity and quality of each pool was assessed by Bioanalyzer (Agilent Technologies, California, USA) and subsequently by qPCR using the Illumina Library Quantification Kit (Kapa) on a Light Cycler LC480II according to manufacturer's instructions (Roche, Switzerland). Each pool of libraries was sequenced on a flowcell of an Illumina HiSeq with 2 × 300 bp paired-end sequencing using v3 chemistry (Illumina, California, USA).

Informatics processing
The sequence data were processed as follows. Illumina adapters and PCR primers used for initial 16S rRNA gene amplification were removed from each fragment using Cutadapt version 1.14(52). Sickle version 1.33 (53) was used to quality trim DNA reads using a minimum quality value of 28. Reads shorter than 100 bp following quality trimming were discarded. If a read was discarded during this process its corresponding pair was also discarded. Reads passing filtering were merged using PANDAseq(54) version 2.9 to generate overlapping contigs with a minimum overlap of 20 bp and a minimum amplicon length of 200 bp. The resulting overlapped reads were dereplicated using VSEARCH version v1.9.6(55) and searched against a BLAST database composed of the HOMD, HOMD extended and Greengenes sequences (HOMDEXTGG) described by Al-Hebshi et al (37). Taxonomic classification was then performed as previously described (56) at 99 % identity across 98% of the read length. Reads not classified by this process were discarded. Reads were marked unresolved when there was no agreement between the databases they were compared against. This process resulted in 1260 taxonomically classified operational taxonomic units (OTUs). The resulting classification table and associated representative sequences, selected as the most abundant sequence for each classified taxon, were used as inputs for QIIME(36)(57) (Quantitate Insights into Microbial Ecology) version 1.9.1. The microbial sequence count table and metadata, in biom file format, was loaded into the calour library (58). The loading process filtered out samples with fewer than 1000 reads and then rescaled each sample to have its counts sum up to 1000 (by dividing each feature frequency by the total number of reads in the sample and multiplying by 1000). After loading, the data underwent two rounds of filtering and the remaining features were collapsed at the genus level. The first round of filtering removed low-abundance features (OTUs with total count less than 10 across all samples). The second filter removed OTUs with low prevalence (features occurring in <1% of the samples). The values for three bacterial genera that had been identified as contaminants, Burkholderia, Dietzia, Mycobacterium, were removed. After filtering and collapsing the OTUs at genus level we obtained a total of 186 OTUs and 1234 samples that were used in subsequent analyses.

Phenotype correlations, sample visualization, and feature ranking by statistical tests
Pairwise phenotype correlations were computed by Pearson's correlation. Each subject only contributed once for each correlation measure, for corneometer the median of measured values per subject was used. Pairwise sample distances were computed with Spearman distance between feature (genus) vectors per sample. Multidimensional scaling was applied to project the distances into two-dimensional sample representations for visualization.
Statistical tests were applied to rank features as follows. For the binary phenotypes (smoking and menopause status), two-sample t-test was applied independently to each feature vector vs. the phenotype vector, and features were ordered by p-value from most to least significant. For age and corneometer, Pearson's correlation between each feature and phenotype was applied, and features were ordered by absolute value of the correlation. When there were multiple measures of corneometer per subject, the median value was used.

Explainable AI approach
The first step in our EAI approach, shown in Figure 5, consists of pre-processing and filtering the OTU table as described in the section above. Next, the ML models are defined depending on the prediction task: regression for predicting skin hydration and age; classification for predicting menopausal and smoking status. We split the dataset in two parts, 80% for training and 20% for testing. For each prediction task, the hyperparameters of each ML model are tuned performing 5-fold cross validation (CV) on the training dataset. Once the best hyperparameters have been selected, each tuned model is trained and tested. The last two steps of our approach consist of selecting a subset of samples, exemplars, in the test set and applying explainability methods to interpret and explain the predictions made by the trained models on those exemplars.

Explainability of the predictions and exemplar selection process
Examining the ML models for explanations of their predictions is an important active field of research. As treebased methods are more directly interpretable, recent research is focused on inspecting neural networks to try to ascertain the importance of each feature's contribution to the prediction. Work such as DeepLIFT (59), DeepExplain(60), and SHapley Additive exPlanations (SHAP) (20) inspect the gradients of the models to determine the impact of different features. To investigate the mechanisms underlying the predictions, we used SHAP for its ability to work with any machine learning model, such as XGBoost and RF, not just neural networks. SHAP combines game theory with local explanation enabling accurate interpretations on how the model predicted a particular value for a given sample. The explanations are called local explanations and reveal subtle changes and interrelations that are otherwise missed when these differences are averaged out. Local explanations allow the inspection of samples that have extreme phenotypes values, e.g. very hydrated skin or very dry skin. In order to ensure robust explanations, we selected exemplar samples or examplars i.e., samples in the test set for which the predictions lie very close to the actual measured values. More precisely, for regression tasks (corneometer and age prediction) we selected as exemplars the samples in the test dataset for which predictions were < 10% different from the actual values. For classification tasks (smoking and menopausal status prediction) we selected samples that were correctly classified. The exemplars were then examined using SHAP to assess how well the genera considered most impactful by the ML models matched existing biological knowledge. More precisely, we applied SHAP to the set of exemplars and obtained the list of features (microbial genera) in decreasing order of impact (i.e., SHAP absolute mean value), so that genera at the top were the most impactful in predicting the given phenotype. As the set of exemplars differs depending on the chosen test set, we performed 5-fold cross validation for each ML model. For each genus, we computed the average SHAP impact over the 5 folds (corresponding to 5 different exemplar sets) and we re-ordered the list of genera based on their average SHAP impact in descending order. This gave us a robust ordered list of impactful genera for each model (RF, XGBoost and NN). Finally, we took the average SHAP impact of each genus over the three lists computed at the previous step and re-ordered once again all the genera based on their resulting impact. In this way we cross-validated SHAP results obtaining a single and robust consensus ranked list of impactful genera representing an agreement of the three models (see Figure 5). Note that the average number of samples in the test set was 247 across 5-fold cross validation, while the average number of exemplars in the age model was 190 and 66 in the corneometer model. The average number of exemplars in the menopausal classification was 225, while the average number of exemplars in the smoking model was 223. Finally, we applied the kneedle algorithm(31)(https://github.com/arvkevi/kneed -with the parameters curve='convex', direction='decreasing', interp_method='polynomial') to find a plateau in the SHAP impact curve for the consensus list of impactful genera. We then examined the top x impactful features, where x is the knee point representing the beginning of the plateau. To assign a direction of impact for each genus, we computed the average SHAP impact (without taking its absolute value) of the top 20% of exemplar samples for which the genus was most abundant. For each genus and each ML model in our bag of models, we examined the sign of the SHAP impact averaged across 5 folds. A positive sign indicates that the genus had a positive impact in predicting a phenotype value when it was more abundant in the microbiome, while a negative sign indicated that the genus has a negative impact on the prediction when higher in abundance in the microbiome. We then assigned a direction to the impact of the genus under analysis if it was consistent across the three models. The results of this process for the four phenotypes are visualized in Figures 2(a-d). All the insights obtained across phenotypes are summarised in Table 2.

Machine learning models
We evaluated the application of three machine learning models -random forest (RF) (17), XGBoost (18) and neural network (NN) (19) -for predicting the four different phenotypes from microbial genus count tables. We developed our neural network using TensorFlow (61) to construct a multilayer perceptron (MLP). The MLP architecture is shown in Supplementary Figure 19. We optimised the parameters of the neural network's architecture using a Bayesian optimization algorithm (62). The input layer contains the same number of neurons as the number of features in the input data (186 neurons). The input layer is followed by a batch normalisation layer with rectifying linear unit (ReLU) activation function. This is followed by 4 blocks, where each block contains a batch normalisation layer, a fully connected layer with 100 neurons, followed by ReLU activation function. The penultimate layer is a dropout layer (we used 40% dropout). The final layer is either one neuron in the case of regression for corneometer and age prediction, or a softmax layer in the case of classification tasks, with as many neurons as there are classes. The model was trained in batches of 80 samples. To further aid generalization of the neural network we applied early stopping, i.e. training is stopped when the performance on validation data does not improve, to avoid overfitting. To evaluate the results on the validation set, for regression we used the mean absolute error (MAE) and for classification we used the F1-score. We also used plateau detection to incrementally reduce the learning rate to enhance the optimisation. The initial learning rate was 0.001 and RMSprop (rmsprop) optimizer was used to optimize the network. Moreover, RF and XGBoost were both tuned using the training dataset, for the different classification and regression tasks, by performing 200 iterations of a random search to select the best hyperparameters. Once tuned, the three models (RF, XGBoost and NN) were trained on training data and the applied to test data. Performances on the test set were compared by looking at the lowest MAE for regression tasks (corneometer and age prediction) and the highest F1-score for classification tasks (smoking and menopausal status). Supplementary Figures 5 and 7 show a summary of the prediction performances of the three models on the test set and after 5-fold cross validation (CV). To examine the stability and the generalisability of our models on unseen data we implemented cross validation by subject. We performed this for the Canada cohort that includes 1234 samples from 63 subjects. The cross validation consists of training each model with the samples from 62 subjects and subsequently making predictions with the trained model on the unseen samples of the 63 rd subject. This process is repeated 63 times. See Supplementary Figure 8 for performance results.