Firstly, I had to clean the data (e.g. extract the DNA values from the RNASeq sequences). Huili made this super easy for me by providing scripts she had written, which I was able to modify for my own use.
Pro-tip: Python’s GIL prevented me from multi-threading, so writing code for CPU intensive tasks turned out to be useless. Don’t repeat my mistake.
Instead of re-inventing the wheel, I decided to first start simple with k-means clustering. When the results from that proved to be unsatisfactory, I performed further literature review to determine which methods were most suited to genetic data. Prof Vincent’s MA4270 notes, Daphne Koller and Nir Friedman’s book on Probabilistic Graphical Models, and the scikit
library were a big help to me.
Eventually, I identified several unsupervised learning methods I could apply, elaborated below.
Another pro-tip: LEARN EXPECTATION MAXIMIZATION BY HEART. It took me almost a week to understand it; it is a difficult concept imo. But once I did, it made everything else much easier.
I performed feature engineering as well before writing the models that would determine the results. I plotted heat maps for Dr. Huili to evaluate the significance of the results - and we identified GTPases as the genes that play a key role in propagating this infection.
My entire thesis is available at on arXiv. In case you are unable to access it, I have also made it available on GitHub.
Before attempting this project, I honestly didn’t believe I would be able to contribute anything useful to the work already being done at Dr. Huili’s lab. However, Vincent and Huili, were both patient and explained things to me really well; we had weekly catchups where I would update them about the work I was doing, and I was able to contribute significantly in terms of converting algorithms to code (e.g. there are no libraries offering HMM based clustering right off the bat; I had to write my own code for it)
In the end, we were also pleasantly surprised by the results, as we were able to identify GTPases as responsible for this virus. This work tallies with research being done in other labs worldwide, and lab experiments at A*Star further proved this.
There were definitely a couple of times I got frustrated and wanted to give up; since this was an unsupervised learning task, I had no idea if the results I was deriving were actually biologically significant. However, Huili constantly motivated me and backed up my results, so I persevered; I was lucky to have such a supportive mentor.
All in all, this made me interested in the field of biocomputation, and specifically how machine learning can be applied to medicine. I think the implications are far reaching, and we can definitely make positive contributions to it if we find people who are polymaths in these fields. I definitely aspire to be one of these polymaths.
RESEARCH
python machine learning