HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise)


Clustering is a technique in data science used to find patterns or groupings in large data sets.  It is a very good, general technique with broad applicability across scientific domains.

Leland McInnes and John Healy, two CSE researchers at the Tutte Institute for Mathematics and Computing, refined and improved the DBSCAN (density-based spatial clustering of applications with noise) algorithm. The redesigned algorithm is orders of magnitude more efficient and Leland and John have written high-performance code implementing the algorithm. This implementation is now the de-facto reference implementation of the algorithm.  

The refined HDBSCAN algorithm, implemented in Python, is available for download on GitHub - a repository hosting service for code - as part of the scikit-learn-contrib project.  It is also available from PyPI and conda-forge, two popular software package sites for Python.

What is HDBSCAN used for?

HDBSCAN is being used in a variety of different ways. Provided below are just some of the fields where it has been applied.


Malware analysis:

Accounting anomaly detection:


Molecular dynamics:

Product defect detection:

Bitcoin / blockchain analysis:

Is HDBSCAN production ready?

HDBSCAN is currently in a resting state. It is stable, and various people are further refining and adapting the open source code.