Documentation

clustbench.load_dataset(battery, dataset[, ...])

Load a benchmark dataset

clustbench.save_data(filename, data[, fmt, ...])

Write a data matrix for inclusion in the clustering benchmark suite

clustbench.save_labels(filename, labels[, ...])

Write a label vector for inclusion in the clustering benchmark suite

clustbench.preprocess_data(data[, ...])

Normalise a data matrix

clustbench.get_battery_names([path, ...])

Get the names of benchmark batteries in a given directory

clustbench.get_dataset_names(battery[, ...])

Get the names of datasets in a given benchmark battery

clustbench.load_results(method_group, ...[, ...])

Load benchmark results

clustbench.save_results(filename, results[, ...])

Write clustering results of many algorithms

clustbench.transpose_results(results)

"Transpose" a results dictionary

clustbench.labels_list_to_dict(labels)

Convert a list of labels to a dictionary indexed by n_clusters

clustbench.fit_predict_many(model, data, ...)

Determine many clusterings of the same dataset.

clustbench.get_score(labels, results[, ...])

Computes a similarity score between the reference and the predicted partitions

clustbench.Colouriser(data[, labels])

Interactive planar data editor

clustering-benchmarks Package

class clustbench.Colouriser(data, labels=None)

Interactive planar data editor

See the dedicated section on the package homepage for more details.

Parameters
data

An n-by-2 real matrix giving the coordinates of n planar points.

labels

Either a vector of n corresponding integer labels or None.

Examples

>>> import clustbench
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url)
>>> clr = clustbench.Colouriser(wut_smile.data, wut_smile.labels[0])
>>> clr.print_help()
>>> clr.show()  # starts the interactive mode
>>> new_data = clr.get_data()
>>> new_labels = clr.get_labels()
Attributes
data

The data matrix.

labels

A vector of n integer labels; 0 denotes the noise cluster

get_data()

Get the current data matrix.

Returns
data
get_labels()

Get the current labels.

Returns
labels
normalise_labels()

Translate the current label vector (ignoring colour 0, which denotes the noise cluster) so that labels are assigned in decreasing order of occurrence.

print_help()

List the keyboard shortcuts available.

show()

Start the interactive Colouriser.

clustbench.fit_predict_many(model, data, n_clusters)

Determine many clusterings of the same dataset.

Ideally, for hierarchical methods, it would be best if model was be implemented smartly enough that for the same X and different n_clusters it does not recompute the whole hierarchy from scratch.

Parameters
model

An object equipped with fit_predict and set_param methods (e.g., a scikit-learn-like class)

dataarray-like

Data matrix.

n_clustersint or list of ints

Number of clusters.

Returns
results

A dictionary of label vectors, where results[K] gives the discovered K-partition.

clustbench.get_battery_names(path=None, expanduser=True, expandvars=True)

Get the names of benchmark batteries in a given directory

Parameters
path

Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Returns
batteries

A list of strings.

Examples

>>> import os.path
>>> import clustbench
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> print(clustbench.get_battery_names(data_path))
>>> print(clustbench.get_dataset_names("wut", data_path))
clustbench.get_dataset_names(battery, path=None, expanduser=True, expandvars=True)

Get the names of datasets in a given benchmark battery

Parameters
battery

Name of the battery, e.g., "wut" or "other". Can be an empty string or "." if all files are in a single directory as specified by path.

path

Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Returns
datasets

A list of strings.

Examples

>>> import os.path
>>> import clustbench
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> print(clustbench.get_battery_names(data_path))
>>> print(clustbench.get_dataset_names("wut", data_path))
clustbench.get_score(labels, results, metric=<built-in function adjusted_asymmetric_accuracy>, compute_max=True, warn_if_missing=True)

Computes a similarity score between the reference and the predicted partitions

Takes into account that there can be more than one ground truth partition and ignores the noise points (as explained in the Methodology section of the clustering benchmark framework’s website).

If labels is a single label vector, it will be wrapped inside a list. If results is not a dictionary, labels_list_to_dict will be called first.

Parameters
labels

A vector-like object or a list thereof.

results

A dictionary of clustering results, where results[K] gives a K-partition.

metricfunction

An external cluster validity measure. It will be called like metric(y_true, y_pred).

compute_maxbool

Whether to apply max on the particular similarity scores.

warn_if_missingbool

Warn if some results[K] is required, but missing.

Returns
scorefloat or array thereof

The computed similarity scores. Ultimately, it is a vector of metric(y_true[y_true>0], results[max(y_true)][y_true>0]) over all y_true in labels or the maximum thereof if compute_max is True.

clustbench.labels_list_to_dict(labels)

Convert a list of labels to a dictionary indexed by n_clusters

If labels is a single label vector, it will be wrapped inside a list.

Parameters
labels

A vector-like object or a list thereof.

Returns
retdict

ret[max(ll)] = ll for each ll in labels.

clustbench.load_dataset(battery, dataset, path=None, url=None, expanduser=True, expandvars=True, preprocess=True, random_state=None)

Load a benchmark dataset

Reads a dataset named battery/dataset.data.gz (relative to url or the directory path) as well as all the corresponding labels (battery/dataset.labels0.gz, battery/dataset.labels1.gz, …).

Parameters
battery

Name of the battery, e.g., "wut" or "other".

dataset

Dataset name, e.g., "x2" or "iris".

path

Mutually exclusive with url. Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.

url

Mutually exclusive with path. For example, "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" to get access to <https://github.com/gagolews/clustering-data-v1>,

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

preprocess

Whether to call preprocess_data on the data matrix.

random_state

Seed of the random number generator; passed to preprocess_data.

Returns
benchmark

A named tuple with the following elements:

battery

Same as the battery argument.

dataset

Same as the dataset argument.

description

Contents of the description file.

datandarray

Data matrix.

labelslist

A list consisting of the label vectors.

n_clustersndarray

The corresponding cluster counts: n_clusters[i] is equal to max(n_clusters[i]).

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded suite)
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> wut_x2 = clustbench.load_dataset("wut", "x2", path=data_path)
>>> print(wut_x2.battery, wut_x2.dataset)
>>> print(wut_x2.description)
>>> print(wut_x2.data, wut_x2.labels)
>>> # load from GitHub (slow...):
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url)
>>> print(wut_smile.data, wut_smile.labels)
clustbench.load_results(method_group, battery, dataset, n_clusters, path=None, expanduser=True, expandvars=True)

Load benchmark results

Reads the datasets named like method_group/battery/dataset.resultK.gz (relative to the directory path), for each K in n_clusters. method_group can be a wildcard like "*" if a look up in multiple directories is required.

Parameters
method_group

Name of the method group, e.g., "Genie", ".", or "*".

battery

Name of the battery, e.g., "wut" or "other".

dataset

Dataset name, e.g., "x2" or "iris".

n_clustersint or list of ints

Number of clusters.

path

Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Returns
results

A dictionary of dictionaries of label vectors that can be accessed like results[method_name][n_clusters].

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded repository)
>>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
>>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path)
>>> print(res.keys())
clustbench.preprocess_data(data, noise_factor=1e-06, random_state=None)

Normalise a data matrix

Removes all columns of zero variance (constant). Centres the data around the centroid (so that each column mean is 0). Scales all columns proportionally (so that the total variance is 1; note that this is not the same as standardisation: standard deviations in each column might still be different). Adds a tiny amount of noise to minimise the risk of having duplicate points.

Parameters
data

Data matrix.

noise_factor

Standard deviation of the white noise added.

random_state

Seed of the random number generator; see scipy.stats.norm.rvs.

Returns
data

A modified data matrix.

Examples

>>> import os.path
>>> import clustbench
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset(
...     "wut", "smile", url=data_url, preprocess=False)
>>> np.random.seed(123)  # assure reprodicibility
>>> X = clustbench.preprocess_data(wut_smile.data)
clustbench.save_data(filename, data, fmt='%g', expanduser=True, expandvars=True)

Write a data matrix for inclusion in the clustering benchmark suite

Parameters
filenamestring or file handle

For example, path_to_suite/battery/dataset.data.gz.

data2D array_like

A matrix-like object

fmt

See numpy.savetxt.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

clustbench.save_labels(filename, labels, expanduser=True, expandvars=True)

Write a label vector for inclusion in the clustering benchmark suite

Parameters
filenamestring or file handle

For example, path_to_suite/battery/dataset.labels0.gz.

labels1D array_like

A label vector.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

clustbench.save_results(filename, results, expanduser=True, expandvars=True)

Write clustering results of many algorithms

Parameters
filenamestring or file handle

For example, method_group/battery/dataset.resultK.gz.

resultsdict

A dictionary where each results[method_name] is a label vector.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded repository)
>>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
>>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path)
>>> print(res.keys())
>>> clustbench.save_results("x1.result3.gz", clustbench.transpose_results(res)[3])
clustbench.transpose_results(results)

“Transpose” a results dictionary

Parameters
resultsdict

A dictionary of dictionaries or lists of objects.

Returns
retdict

A dictionary such that ret[b][a] is taken from results[a][b]. If results[a] is not a dictionary, labels_list_to_dict will be called first.