kmeans1d: Globally Optimal Efficient 1D k‑means Clustering

I implemented kmeans1d, a Python library for performing k-means clustering on 1D data, based on the algorithm from Xiaolin (1991), as presented by Grønlund et al. (2017, Section 2.2).

Globally optimal k-means clustering is NP-hard for multi-dimensional data. Lloyd’s algorithm is a popular approach for finding a locally optimal solution. For 1-dimensional data, there are polynomial time algorithms.

kmeans1d contains an O(kn + n log n) dynamic programming algorithm for finding the globally optimal k clusters for n 1D data points. The code is written in C++—for faster execution than a pure Python implementation—and wrapped in Python.

The source code is available on GitHub:
https://github.com/dstein64/kmeans1d

The package is available on PyPI, the Python Package Index. It can be installed with pip.

$ pip3 install kmeans1d

The snippet below includes an example of how to use the library.

	import kmeans1d

	x = [4.0, 4.1, 4.2, –50, 200.2, 200.4, 200.9, 80, 100, 102]
	k = 4

	clusters, centroids = kmeans1d.cluster(x, k)

	print(clusters) # [1, 1, 1, 0, 3, 3, 3, 2, 2, 2]
	print(centroids) # [-50.0, 4.1, 94.0, 200.5]

view raw kmeans1d.py hosted with ❤ by GitHub

References

[1] Wu, Xiaolin. “Optimal Quantization by Matrix Searching.” Journal of Algorithms 12, no. 4 (December 1, 1991): 663

[2] Grønlund, Allan, Kasper Green Larsen, Alexander Mathiasen, Jesper Sindahl Nielsen, Stefan Schneider, and Mingzhou Song. “Fast Exact K-Means, k-Medians and Bregman Divergence Clustering in 1D.” ArXiv:1701.07204 [Cs], January 25, 2017. http://arxiv.org/abs/1701.07204.

Leave a Reply Cancel reply