Machine Learning on Genomic Data

In my 5th and final year at NUS, I did a year-long research project in the Department of Mathematics. I requested Prof. Vincent Tan, who taught me MA4270, to be my supervisor for this thesis. He proposed a project, in collaboration with his wife, Dr. Huili Guo from IMCB, A*Star, for applying machine learning on genomic data.

Background

  • The genomic data in question are RNA assays from research on the EV-A71 virus (hand-foot-mouth disease).
  • After a cell gets infected with a virus, overall translation of genes to proteins reduces.
  • The objective was to identify those genes that do not get affected by the virus, and identify these through unsupervised learning techniques.

Data

  • Raw data retrieved from RNASeq and RBF.
  • Contained gene sequences and their numerical concentrations at different time periods.
rna


EV-A71 (chrE) infection over time

Methods

Firstly, I had to clean the data (e.g. extract the DNA values from the RNASeq sequences). Huili made this super easy for me by providing scripts she had written, which I was able to modify for my own use.

Pro-tip: Python’s GIL prevented me from multi-threading, so writing code for CPU intensive tasks turned out to be useless. Don’t repeat my mistake.

Instead of re-inventing the wheel, I decided to first start simple with k-means clustering. When the results from that proved to be unsatisfactory, I performed further literature review to determine which methods were most suited to genetic data. Prof Vincent’s MA4270 notes, Daphne Koller and Nir Friedman’s book on Probabilistic Graphical Models, and the scikit library were a big help to me.

Eventually, I identified several unsupervised learning methods I could apply, elaborated below.

Another pro-tip: LEARN EXPECTATION MAXIMIZATION BY HEART. It took me almost a week to understand it; it is a difficult concept imo. But once I did, it made everything else much easier.

Machine Learning Methods
  • Expectation Maximization literature review
  • K-Means Clustering
  • Hidden Markov Models
    • HMM based clustering
  • Bayesian Networks
  • Hierarchical Clustering
  • Gaussian Mixture Models
hmm


Hidden Markov Model
Evaluation Metrics
  • Information Criteria
    • BIC and AIC
  • Elbow Method
  • Misclassification Error (ME) Distance
  • Gene Ontology

Thesis

I performed feature engineering as well before writing the models that would determine the results. I plotted heat maps for Dr. Huili to evaluate the significance of the results - and we identified GTPases as the genes that play a key role in propagating this infection.

My entire thesis is available at on arXiv. In case you are unable to access it, I have also made it available on GitHub.

Thoughts

Before attempting this project, I honestly didn’t believe I would be able to contribute anything useful to the work already being done at Dr. Huili’s lab. However, Vincent and Huili, were both patient and explained things to me really well; we had weekly catchups where I would update them about the work I was doing, and I was able to contribute significantly in terms of converting algorithms to code (e.g. there are no libraries offering HMM based clustering right off the bat; I had to write my own code for it)

In the end, we were also pleasantly surprised by the results, as we were able to identify GTPases as responsible for this virus. This work tallies with research being done in other labs worldwide, and lab experiments at A*Star further proved this.

There were definitely a couple of times I got frustrated and wanted to give up; since this was an unsupervised learning task, I had no idea if the results I was deriving were actually biologically significant. However, Huili constantly motivated me and backed up my results, so I persevered; I was lucky to have such a supportive mentor.

All in all, this made me interested in the field of biocomputation, and specifically how machine learning can be applied to medicine. I think the implications are far reaching, and we can definitely make positive contributions to it if we find people who are polymaths in these fields. I definitely aspire to be one of these polymaths.

RESEARCH
python machine learning