From computer viruses to coronavirus
UMAP’s 2D representation of a dataset from number theory with over 1.7 million dimensions.
A technique developed by researchers at CSE’s Tutte Institute for Mathematics and Computing, originally designed to analyze malware, is now being used to answer questions about COVID-19.
The technique, called UMAP (short for Uniform Manifold Approximation and Projection) was originally developed by TIMC researchers, Dr. Leland McInnes and John Healy, to help analyze different strains of computer viruses. But, realizing its potential to advance the field of data science, the TIMC released the algorithm and software to the open source community. Since then, UMAP has taken on a life of its own.
So what exactly does UMAP do?
UMAP is a dimension reduction technique, which means it can take complex datasets with many attributes – or dimensions - and prune out the redundant dimensions to make the data easier to work with. Crucially, UMAP does this decluttering while still preserving the latent features of the data.
The other key feature that sets UMAP apart is its speed. Instead of taking hours, data possessing hundreds of dimensions can be embedded onto a colourful 2D or 3D chart in a matter of seconds. Researchers can then literally see the underlying patterns and pick out common threads for further analysis.
Examples from a dataset known as MNIST, made up of handwritten digits from 0 - 9.
UMAP’s rendering of MNIST. Each point represents a sample. Each cluster corresponds to a digit, with similarly-shaped digits located closer to one another.
And now, UMAP has been brought to bear on the most pressing scientific challenge facing humans today: COVID-19.
How is UMAP being used to study COVID-19?
In one notable example, a team of Canadian researchers harnessed UMAP to create a COVID-19 Genotyping Tool capable of spotting genetic variations in SARS-CoV-2 virus samples. If distinct sub-types exist, as early evidence suggests they do, the tool should make it easier for researchers to see how the different strains are connected.
UMAP is a good fit for this task, because it can quickly process the mind-boggling number of datapoints from tens of thousands of genomes and organize them based on commonalities. The resulting charts show different-sized clusters, colour-coded by region, country and sample collection date, corresponding to different outbreak events.
According to its creators, the COVID-19 Genotyping Tool has direct implications for vaccine research, as well as the search for drugs and other therapies to treat the disease. You can read more about it in the June edition of The Lancet: Digital Health.
Why open source?
While nobody could have predicted the specific circumstances of the COVID-19 global pandemic, the TIMC researchers foresaw how broadly UMAP might be applied.
By making its research available open source, the TIMC addresses its mission to deliver research results having an impact on the most important scientific challenges facing the Canadian and 5-eyes security and intelligence communities, while also empowering and collaborating with Canadian researchers in other fields of study. In the case of UMAP and COVID-19, the impact could turn out to be even more far-reaching.
The latest release of the software can be found on GitHub under the account of Dr. Leland McInnes. Documentation is also available online and a preprint of the paper describing the underlying mathematical foundation is available.
If you would like to know more, you can contact the Tutte Institute.
- Date modified: