Show simple item record

dc.contributor.authorBabirye, Sandra Ruth
dc.date.accessioned2024-11-06T13:35:40Z
dc.date.available2024-11-06T13:35:40Z
dc.date.issued2024-06
dc.identifier.citationBabirye, S.R. (2024). Understanding genetic diversity and rapid drug resistance prediction in mycobacterium tuberculosis from whole-genome sequence and other epidemiological data (Unpublished master's dissertation) Makerere University, Kampala, Ugandaen_US
dc.identifier.urihttp://hdl.handle.net/10570/13654
dc.descriptionA dissertation submitted to the Directorate of Research and Graduate Training in partial fulfillment of the requirements for the award of the degree of Master of Master of Science in Bioinformatics of Makerere Universityen_US
dc.description.abstractTuberculosis (TB) remains one of the major global health problems with an estimated 1.6 million deaths worldwide. The availability of whole-genome sequence (WGS) data offers a good avenue for understanding genetic diversity and drug resistance (DR) mutations. We aimed to investigate the genetic diversity and relatedness of Mycobacterium tuberculosis isolates among individuals with different CD4 cell counts and leverage machine learning (ML) algorithms in predicting DR using WGS and epidemiological data from Uganda. Methods: This was a cross-sectional study utilizing 226 WGS samples of MTB isolates in Uganda between 2013 and 2023. Associated patient demographic data and phenotypic drug information was obtained. We utilized TB profiler for lineage and drug resistance prediction, and snippy tool for variant calling and annotation. Phylogenetic analysis was performed on the core genome alignment file in MEGA. For ML model development, we split the data into training (80%) and testing (20%) datasets. The SMOTE technique was applied to handle for class imbalance issue. We evaluated various ML algorithms including random forest (RF), Logistic regression (LR), boosting classifiers such as ada Boost, cat Boost, Gradient Boosting, XGBoost etc. for prediction of drug resistance for the antibiotics Rifampicin, Ethambutol, Isoniazid and Streptomycin. Various key metrics such as recall, precision, Receiver operating characteristic curve (ROC), and Matthews Correlation Coefficient (MCC) were used to assess the performance characteristics of the models. Results: Across the 203 MTB isolates, we observed 5 distinct phylogenetic lineages (L1-4, L3&L4) with L4 being the most prevalent with 149/203 (73.40%) followed by L3 (46(22.66%) among others. The most common sub lineage was L4.6.1.1/Uganda II compared to the other sub lineage. There was statistical association between MTB lineages and CD4 cell count group as either low or high. Overall, all ML algorithms proved that they can predict drug resistance however the boosting classifiers had the highest AUC values. Age, Sex and HIV status proved to be significant features in addition to the SNP positions for ML model development. Conclusion: Our findings of the circulating lineages, sub lineages, drug resistance profiles play a crucial role in understanding the genetic diversity of MTB. Additionally, our approach of ML, can robustly predict drug resistance and also inform on the underlying gene mutations while utilizing both the WGS (SNP) and epidemiological data.en_US
dc.language.isoenen_US
dc.publisherMakerere Universityen_US
dc.subjectGenetic diversityen_US
dc.subjectRapid drug resistanceen_US
dc.subjectMycobacterium tuberculosisen_US
dc.subjectWhole-genome sequenceen_US
dc.subjectEpidemiological dataen_US
dc.titleUnderstanding genetic diversity and rapid drug resistance prediction in mycobacterium tuberculosis from whole-genome sequence and other epidemiological dataen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record