# Access from Python, R, MATLAB, etc.

## Python

To facilitate the comparison of clustering algorithms in the Python environment, we have developed a dedicated package that is available for download from PyPI; see Using clustbench for more details.

However, as described in the File Format Specification section,
all data files are neat and tidy. Therefore, we can access them
easily using some more low-level functions from *numpy*
or *pandas*. For example:

```
import numpy as np
import os.path
# to do: change to your local path
base_name = os.path.join("~", "Projects", "clustering-data-v1", "wut", "smile")
base_name = os.path.expanduser(base_name)
data = np.loadtxt(base_name + ".data.gz", ndmin=2)
data[:6, :] # preview
## array([[-1.545826, 2.471133],
## [-5.001664, 1.066568],
## [-5.434681, 1.333114],
## [-4.384334, 1.176669],
## [-3.950311, 6.94172 ],
## [-4.23157 , 7.156661]])
labels = np.loadtxt(base_name + ".labels0.gz", dtype="int")
labels[:6] # preview
## array([2, 2, 2, 2, 2, 2])
```

External cluster validity measures
discussed in the Appendix
are implemented in the *genieclust*
package.

## R

Following the File Format Specification, the datasets can be accessed easily using built-in R functions:

```
# to do: change to your local path
base_name <- file.path("~", "Projects", "clustering-data-v1", "wut", "smile")
data <- as.matrix(read.table(paste0(base_name, ".data.gz")))
head(data) # preview
## V1 V2
## [1,] -1.5458 2.4711
## [2,] -5.0017 1.0666
## [3,] -5.4347 1.3331
## [4,] -4.3843 1.1767
## [5,] -3.9503 6.9417
## [6,] -4.2316 7.1567
labels <- scan(paste0(base_name, ".labels0.gz"), integer())
head(labels) # preview
## [1] 2 2 2 2 2 2
```

The external cluster validity measures
are implemented in the R version of the
*genieclust* package.

Algorithms implemented in R
can be called from within Python
using, for example, the *rpy2* package.

## MATLAB

Unfortunately, MATLAB does not seem to be able to un*gzip* files
on the fly, but these can be decompressed to a temporary folder
manually.

```
base_name = "~/Projects/clustering-data-v1/wut/smile";
t = tempdir();
data = readmatrix(char(gunzip(base_name + ".data.gz", t)), FileType="text");
labels = readmatrix(char(gunzip(base_name + ".labels0.gz", t)), FileType="text");
```

Note that there is also a MATLAB interface for Python. This way, algorithms that have only been implemented in the former can be called from within the latter.

Unfortunately, MATLAB is not free software.

Todo

Contributions are welcome: Describe how to load the datasets and benchmark results in GNU Octave, Scilab, Julia, Mathematica, …