From computer viruses to coronavirus

UMAP: From computer virues to coronavirus
UMAP’s 2D representation of a dataset from number theory with over 1.7 million dimensions.

A technique developed by researchers at CSE’s Tutte Institute for Mathematics and Computing, originally designed to analyze malware, is now being used to answer questions about COVID-19.

The technique, called UMAP (short for Uniform Manifold Approximation and Projection) was originally developed by TIMC researchers, Dr. Leland McInnes and John Healy, to help analyze different strains of computer viruses. But, realizing its potential to advance the field of data science, the TIMC released the algorithm and software to the open source community. Since then, UMAP has taken on a life of its own.

So what exactly does UMAP do?

UMAP is a dimension reduction technique, which means it can take complex datasets with many attributes – or dimensions - and prune out the redundant dimensions to make the data easier to work with. Crucially, UMAP does this decluttering while still preserving the latent features of the data.

The other key feature that sets UMAP apart is its speed. Instead of taking hours, data possessing hundreds of dimensions can be embedded onto a colourful 2D or 3D chart in a matter of seconds. Researchers can then literally see the underlying patterns and pick out common threads for further analysis.

Examples from a dataset known as MNIST, made up of handwritten digits from 0 - 9.

UMAP’s rendering of MNIST. Each point represents a sample. Each cluster corresponds to a digit, with similarly-shaped digits located closer to one another.

 

Since its open source release in 2018, UMAP has been used in a variety of fields never envisioned by the original researchers, from single-cell biology, to AI, to astronomy.

And now, UMAP has been brought to bear on the most pressing scientific challenge facing humans today: COVID-19.

How is UMAP being used to study COVID-19?

To date, UMAP has been used in at least 15 studies related to COVID-19, ranging from analysis of immunotypes, to potential drug treatment candidates.

In one notable example, a team of Canadian researchers harnessed UMAP to create a COVID-19 Genotyping Tool capable of spotting genetic variations in SARS-CoV-2 virus samples. If distinct sub-types exist, as early evidence suggests they do, the tool should make it easier for researchers to see how the different strains are connected.

UMAP is a good fit for this task, because it can quickly process the mind-boggling number of datapoints from tens of thousands of genomes and organize them based on commonalities. The resulting charts show different-sized clusters, colour-coded by region, country and sample collection date, corresponding to different outbreak events.

According to its creators, the COVID-19 Genotyping Tool has direct implications for vaccine research, as well as the search for drugs and other therapies to treat the disease. You can read more about it in the June edition of The Lancet: Digital Health.

Why open source?

While nobody could have predicted the specific circumstances of the COVID-19 global pandemic, the TIMC researchers foresaw how broadly UMAP might be applied.

By making its research available open source, the TIMC addresses its mission to deliver research results having an impact on the most important scientific challenges facing the Canadian and 5-eyes security and intelligence communities, while also empowering and collaborating with Canadian researchers in other fields of study. In the case of UMAP and COVID-19, the impact could turn out to be even more far-reaching.


About the Tutte Institute:

The Tutte Institute for Mathematics and Computing (TIMC) is a government research institute focused on research in fundamental mathematics and computer science.  Our mission is to deliver research results having an impact on the most important scientific challenges facing the Canadian and 5-eyes security and intelligence communities.

TIMC is sponsored and funded by Communications Security Establishment, supporting CSE’s unique requirements in the areas of mathematics and computer science.  The TIMC’s key research areas are cryptography and data science. We draw upon many mathematical and computational fields, including Algebra, Algebraic Geometry, Combinatorics, Data Science, Topology, Number Theory, and Quantum Computing. Our researchers are leaders in their fields and work in a collaborative fashion on interesting challenges found only within the TIMC.

 

Get UMAP

The latest release of the software can be found on GitHub under the account of Dr. Leland McInnes. Documentation is also available online and a preprint of the paper describing the underlying mathematical foundation is available.

If you would like to know more, you can contact the Tutte Institute.