A machine learning approach to predict E. coli antibacterial resistance using whole-genome sequencing data

Mike, Nsubuga; Nsubuga, Mike

dc.contributor.author	Mike, Nsubuga
dc.contributor.author	Nsubuga, Mike
dc.date.accessioned	2024-02-28T08:06:49Z
dc.date.available	2024-02-28T08:06:49Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/10570/13162
dc.description.abstract	Background: Antimicrobial resistance (AMR) is a significant global health threat, particularly impacting low- and middle-income countries(LMICS) such as Uganda, where reliable and rapid methods for detecting AMR in E. coli and other pathogens are scarce. This lack can lead to inappropriate treatment and the spread of drug-resistant infections. This thesis undertakes a comprehensive study, where various machine learning models to predict AMR in E. coli for ciprofloxacin(CIP), ampicillin(AMP), and cefotaxime(CTX) were trained on whole genome sequencing (WGS) data from England where such data is more readily available. A separate Ugandan dataset was used for validation purposes, further demonstrating the generalizability and effectiveness of the models in LMICS. Methods: 1496 (CIP), 1428 (CTX), and 1396 (AMP) sequences from England were divided into training and testing. 42 from Uganda were used for validation. Eight different machine learning models were trained and tested: Logistic Regression(LR), Random Forest(RF), Gradient Boosting(GB), XGBoost(XGB), LightGBM(LGBM), CatBoost(CB), Feed-Forward Neural Network(FFNN), and Support Vector Machine(SVM). The models were evaluated based on precision, recall, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Upsampling techniques were implemented to address class imbalance in the data. Results: Model predictive performance varied significantly across different antibiotics, underlining the critical role of model selection and dataset characteristics. Notably, the FFNN model demonstrated superior performance during testing for CIP (accuracy 84%; F1 0.55; AUC 91%), LR for CTX (accuracy 91%; F1 0.37; AUC 83%) and GB for AMP (accuracy 57%; F1 0.62, AUC 53%), while the LGBM and RF models outperformed others in same scenarios (p < 0.001). Upsampling did not significantly improve the models' performance, underscoring the complexity and high-dimensionality of SNP data. Despite high accuracy scores with the Ugandan validation dataset(FFNN with CIP accuracy 95%, LR with AMP accuracy 98% and GB with CTX accuracy 65%), the models struggled with the recall metric due to severe class imbalance. Key mutations associated with antimicrobial resistance were identified for these antibiotics. Conclusion: As the threat of AMR continues to rise, the successful application of these models - particularly on the Ugandan dataset, signals a promising avenue for improving AMR detection and treatment strategies in LMICS were genomic data is scarce. This work thus not only expands our current understanding of the genetic underpinnings of AMR but also provides a robust methodological framework that can guide future research and applications in the fight against antimicrobial resistance.	en_US
dc.description.sponsorship	The author was funded by the East African Network for Bioinformatics Training (EANBIT) under Fogarty International Center at the U.S. National Institutes of Health (NIH) under award number U2RTW010677 as a Masters scholar. The author would also like to acknowledge the Open Science Grid (OSG) consortium which provided computational resources to carry out this study. The OSG is supported by the National Science Foundation award number 2030508 and 1836650.	en_US
dc.language.iso	en	en_US
dc.publisher	Makerere University	en_US
dc.subject	Antimicrobial Resistance	en_US
dc.subject	AMR	en_US
dc.subject	Machine Learning	en_US
dc.subject	Genomics	en_US
dc.subject	ML	en_US
dc.subject	E. coli	en_US
dc.subject	Escherichia Coli	en_US
dc.subject	Antibiotics drugs	en_US
dc.title	A machine learning approach to predict E. coli antibacterial resistance using whole-genome sequencing data	en_US
dc.type	Thesis	en_US

Files in this item

Name:: Thesis - Nsubuga Mike.pdf
Size:: 4.470Mb
Format:: PDF
Description:: Master's dissertation

View/Open

This item appears in the following Collection(s)

School of Bio-Medical Sciences (Bio-Medical) Collections

Show simple item record