Predicting infectitious disease density in urban settings using Convolutional Neural Networks
Abstract
Rapid and unplanned urbanization is said to pose serious public health challenges to developing countries due to inequality in socio-economic wellbeing, decent housing, etc. Consequently, differential disease risk is experienced across even the same city. For example, overcrowded housing in high density neighborhoods do not only
provide fertile ground for airborne infectious diseases to thrive, they also facilitate their rapid spread as a result of increased human contact. The close association observed between urban settings and infectious diseases raises important questions
which have not received adequate research attention. For example, what is the nature of this association? What methods are available or are suitable for investigating this kind of association? Would existing methods for characterizing settlements as
urban or rural be suitable for studying this kind of association? Furthermore, what can neighborhood characteristics tell us about disease occurrence in a population? With advances in deep learning and big data projected to shape the future of epidemiology and public health, this thesis attempts to answer some of the questions above by leveraging Convolutional Neural Networks (CNN) and using Tuberculosis disease (TB), an airborne infectious disease, as case study. The specific objectives include to, 1) determine potential of socio-economic data as predictor for infectious disease density, 2) determine potential of urban density data as predictor for infectious disease density, 3) build and evaluate a CNN model for identifying patterns in urban housing from satellite imagery, 4) build and evaluate a multimodal CNN model for predicting disease density from socio-economic and housing data, and 5) build and evaluate a siamese CNN model for predicting infectious disease density from housing image data. We developed a linear regression model to achieve each of objectives 1 and 2. CNN methods were developed in a variety of input modalities
and architecture designs in both a regression and classification task formulation to fulfill objectives 3, 4, and 5. The TB data used was obtained from Uganda’s Health Management Information System, satellite imagery from Google Static Maps API,
and socio-economic data from WorldPop. Socio-economic data was found to posses predictive power for estimating disease density. However, inherent limitations associated with data derived using current methods for quantifying urban density
produced misleading results when used for the same purpose. On the other hand, CNN were found to be reasonable for detecting patterns in urban housing density. For example, we achieved 80% accuracy on a housing density detection task. Results from using CNN for inferring TB density from neighborhood characteristics were promising. For example, we attained reasonable accuracy (81.6%) on a task of predicting TB density with a single-input CNN model trained on housing data.
The architecture of this overall best model was extended in a novel way inspired by the idea of siamese twins, what we call learning deep features over neighbor scenes. We achieved moderate improvement in prediction performance as a result of the
proposed architecture. Despite these promising results however, the potential of CNN for inferring occurrence of a disease in a population requires further investigation. An interesting research direction would be exploring performance of deeper and larger multimodal network architectures using larger training sets. We expect DNN to play an important role in epidemiology of human infectious diseases in the future.