Access from Python, R, MATLAB, etc.#

Python#

To facilitate the comparison of clustering algorithms in the Python environment, we have developed a dedicated package that is available for download from PyPI; see Using clustbench for more details.

However, as described in the File Format Specification section, all data files are neat and tidy. Therefore, we can access them easily using some more low-level functions from numpy. For example:

import numpy as np
import os.path
# to do: change to your local path
base_name = os.path.join("~", "Projects", "clustering-data-v1", "wut", "smile")
base_name = os.path.expanduser(base_name)
data    = np.loadtxt(base_name + ".data.gz", ndmin=2)
data[:6, :]  # preview
## array([[-1.545826,  2.471133],
##        [-5.001664,  1.066568],
##        [-5.434681,  1.333114],
##        [-4.384334,  1.176669],
##        [-3.950311,  6.94172 ],
##        [-4.23157 ,  7.156661]])
labels  = np.loadtxt(base_name + ".labels0.gz", dtype="int")
labels[:6]  # preview
## array([2, 2, 2, 2, 2, 2])

Note that cluster validity measures discussed in the Appendix are implemented in the genieclust package.

Note

To learn more about Python, check out Marek’s open-access (free!) textbook Minimalist Data Wrangling in Python [22].

R#

Following the File Format Specification, the datasets can be accessed using the built-in R functions:

# to do: change to your local path
base_name <- file.path("~", "Projects", "clustering-data-v1", "wut", "smile")
data    <- as.matrix(read.table(paste0(base_name, ".data.gz")))
head(data)  # preview
##           V1     V2
## [1,] -1.5458 2.4711
## [2,] -5.0017 1.0666
## [3,] -5.4347 1.3331
## [4,] -4.3843 1.1767
## [5,] -3.9503 6.9417
## [6,] -4.2316 7.1567
labels  <- scan(paste0(base_name, ".labels0.gz"), integer())
head(labels)  # preview
## [1] 2 2 2 2 2 2

Cluster validity measures are implemented in the R version of the genieclust package.

Algorithms implemented in R can be called from within Python using, for example, the rpy2 package.

Note

To learn more about R, check out Marek’s open-access (free!) textbook Deep R Programming [23].

MATLAB#

Unfortunately, MATLAB does not seem to be able to ungzip files on the fly, but these can be decompressed to a temporary folder manually.

base_name = "~/Projects/clustering-data-v1/wut/smile";
t = tempdir();
data = readmatrix(char(gunzip(base_name + ".data.gz", t)), FileType="text");
labels = readmatrix(char(gunzip(base_name + ".labels0.gz", t)), FileType="text");

Note that there is also a MATLAB interface for Python. This way, algorithms that have only been implemented in the former can be called from within the latter.

Unfortunately, MATLAB is not free software.

Todo

Contributions are welcome: Describe how to load the datasets and benchmark results in GNU Octave, Scilab, Julia, Mathematica, … (🚧 help needed 🚧)