Documentation

clustbench.load_dataset(battery, dataset[, ...])

Load a benchmark dataset

clustbench.save_data(filename, data[, fmt, ...])

Write a data matrix for inclusion in the clustering benchmark suite

clustbench.save_labels(filename, labels[, ...])

Write a label vector for inclusion in the clustering benchmark suite

clustbench.preprocess_data(data[, ...])

Normalise a data matrix

clustbench.get_battery_names([path, ...])

Get the names of benchmark batteries in a given directory

clustbench.get_dataset_names(battery[, ...])

Get the names of datasets in a given benchmark battery

clustbench.load_results(method_group, ...[, ...])

Load benchmark results

clustbench.save_results(filename, results[, ...])

Write results of many clustering algorithms

clustbench.transpose_results(results)

"Transpose" a results dictionary

clustbench.labels_list_to_dict(labels)

Convert a list of labels to a dictionary indexed by n_clusters

clustbench.fit_predict_many(model, data, ...)

Determine many clusterings of the same dataset.

clustbench.get_score(labels, results[, ...])

Computes a similarity score between the reference and the predicted partitions

clustbench.Colouriser(data[, labels])

An interactive planar data editor

clustering-benchmarks Package

class clustbench.Colouriser(data, labels=None)

An interactive planar data editor

See the dedicated section on the package homepage for more details.

Parameters:
data

An n-by-2 real matrix giving the coordinates of n planar points.

labels

Either a vector of n corresponding integer labels or None.

Examples

>>> import clustbench
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url)
>>> clr = clustbench.Colouriser(wut_smile.data, wut_smile.labels[0])
>>> clr.print_help()
>>> clr.show()  # starts the interactive mode
>>> new_data = clr.get_data()
>>> new_labels = clr.get_labels()
Attributes:
data

The data matrix.

labels

A vector of n integer labels; 0 denotes the noise cluster

get_data()

Get the current data matrix.

Returns:
data
get_labels()

Get the current labels.

Returns:
labels
normalise_labels()

Translate the current label vector (ignoring colour 0, which denotes the noise cluster) so that labels are assigned in decreasing order of occurrence.

print_help()

List the keyboard shortcuts available.

show()

Start the interactive Colouriser.

clustbench.fit_predict_many(model, data, n_clusters)

Determine many clusterings of the same dataset.

Ideally, for hierarchical methods, it would be best if model was implemented smartly enough that for the same X and different n_clusters it does not recompute the whole hierarchy from scratch.

Parameters:
model

An object equipped with fit_predict and set_param methods (e.g., a scikit-learn-like class)

dataarray-like

Data matrix.

n_clustersint or list of ints

Number of clusters.

Returns:
results

A dictionary of label vectors, where results[K] gives the discovered K-partition.

clustbench.get_battery_names(path=None, expanduser=True, expandvars=True)

Get the names of benchmark batteries in a given directory

Parameters:
path

Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Returns:
batteries

A list of strings.

Examples

>>> import os.path
>>> import clustbench
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> print(clustbench.get_battery_names(data_path))
>>> print(clustbench.get_dataset_names("wut", data_path))
clustbench.get_dataset_names(battery, path=None, expanduser=True, expandvars=True)

Get the names of datasets in a given benchmark battery

Parameters:
battery

Name of the battery (dataset collection), e.g., "wut" or "other". Can be an empty string or "." if all files are in a single directory as specified by path.

path

Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Returns:
datasets

A list of strings.

Examples

>>> import os.path
>>> import clustbench
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> print(clustbench.get_battery_names(data_path))
>>> print(clustbench.get_dataset_names("wut", data_path))
clustbench.get_score(labels, results, metric=<cyfunction normalized_clustering_accuracy>, compute_max=True, warn_if_missing=True)

Computes a similarity score between the reference and the predicted partitions

Takes into account that there can be more than one ground truth partition and ignores the noise points (as explained in the Methodology section of the clustering benchmark framework’s website).

If labels is a single label vector, it will be wrapped inside a list. If results is not a dictionary, labels_list_to_dict will be called first.

Parameters:
labels

A vector-like object or a list thereof.

results

A dictionary of clustering results, where results[K] gives a K-partition.

metricfunction

An external cluster validity measure; defaults to genieclust.compare_partitions.normalized_clustering_accuracy. It will be called like metric(y_true, y_pred).

compute_maxbool

Whether to apply max on the particular similarity scores.

warn_if_missingbool

Warn if some results[K] is required, but missing.

Returns:
scorefloat or array thereof

The computed similarity scores. Ultimately, it is a vector of metric(y_true[y_true>0], results[max(y_true)][y_true>0]) over all y_true in labels or the maximum thereof if compute_max is True.

clustbench.labels_list_to_dict(labels)

Convert a list of labels to a dictionary indexed by n_clusters

If labels is a single label vector, it will be wrapped inside a list.

Parameters:
labels

A vector-like object or a list thereof.

Returns:
retdict

ret[max(ll)] = ll for each ll in labels.

clustbench.load_dataset(battery, dataset, path=None, url=None, expanduser=True, expandvars=True, preprocess=True, random_state=None)

Load a benchmark dataset

Reads a dataset named battery/dataset.data.gz (relative to url or the directory path) as well as all the corresponding labels (battery/dataset.labels0.gz, battery/dataset.labels1.gz, …).

Parameters:
battery

Name of the battery (dataset collection), e.g., "wut" or "other".

dataset

Dataset name, e.g., "x2" or "iris".

path

Mutually exclusive with url. Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.

url

Mutually exclusive with path. For example, "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" to get access to <https://github.com/gagolews/clustering-data-v1>,

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

preprocess

Whether to call preprocess_data on the data matrix.

random_state

Seed of the random number generator; passed to preprocess_data.

Returns:
benchmark

A named tuple with the following elements:

battery

Same as the battery argument.

dataset

Same as the dataset argument.

description

Contents of the description file.

datandarray

Data matrix.

labelslist

A list consisting of the label vectors.

n_clustersndarray

The corresponding cluster counts: n_clusters[i] is equal to max(n_clusters[i]).

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded suite)
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> wut_x2 = clustbench.load_dataset("wut", "x2", path=data_path)
>>> print(wut_x2.battery, wut_x2.dataset)
>>> print(wut_x2.description)
>>> print(wut_x2.data, wut_x2.labels)
>>> # load from GitHub (slow...):
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url)
>>> print(wut_smile.data, wut_smile.labels)
clustbench.load_results(method_group, battery, dataset, n_clusters, path=None, expanduser=True, expandvars=True)

Load benchmark results

Reads the datasets named like method_group/battery/dataset.resultK.gz (relative to the directory path), for each K in n_clusters. method_group can be a wildcard like "*" if a look up in multiple directories is required.

Parameters:
method_group

Name of the method group, e.g., "Genie", ".", or "*".

battery

Name of the battery (dataset collection), e.g., "wut" or "other".

dataset

Dataset name, e.g., "x2" or "iris".

n_clustersint or list of ints

Number of clusters.

path

Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Returns:
results

A dictionary of dictionaries of label vectors that can be accessed like results[method_name][n_clusters].

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded repository)
>>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
>>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path)
>>> print(res.keys())
clustbench.preprocess_data(data, noise_factor=1e-06, random_state=None)

Normalise a data matrix

Removes all columns of zero variance (constant). Centres the data around the centroid (so that each column mean is 0). Scales all columns proportionally (so that the total variance is 1; note that this is not the same as standardisation: standard deviations in each column might still be different). Adds a tiny amount of noise to minimise the risk of having duplicate points.

Parameters:
data

Data matrix.

noise_factor

Standard deviation of the white noise added.

random_state

Seed of the random number generator; see scipy.stats.norm.rvs.

Returns:
data

A modified data matrix.

Examples

>>> import os.path
>>> import clustbench
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset(
...     "wut", "smile", url=data_url, preprocess=False)
>>> np.random.seed(123)  # assure reprodicibility
>>> X = clustbench.preprocess_data(wut_smile.data)
clustbench.save_data(filename, data, fmt='%g', expanduser=True, expandvars=True)

Write a data matrix for inclusion in the clustering benchmark suite

Parameters:
filenamestring or file handle

For example, path_to_suite/battery/dataset.data.gz.

data2D array_like

A matrix-like object

fmt

See numpy.savetxt.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

clustbench.save_labels(filename, labels, expanduser=True, expandvars=True)

Write a label vector for inclusion in the clustering benchmark suite

Parameters:
filenamestring or file handle

For example, path_to_suite/battery/dataset.labels0.gz.

labels1D array_like

A label vector.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

clustbench.save_results(filename, results, expanduser=True, expandvars=True)

Write results of many clustering algorithms

Parameters:
filenamestring or file handle

For example, method_group/battery/dataset.resultK.gz.

resultsdict

A dictionary where each results[method_name] is a label vector.

expanduser

Whether to call os.path.expanduser on the file path.

expandvars

Whether to call os.path.expandvars on the file path.

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded repository)
>>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
>>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path)
>>> print(res.keys())
>>> clustbench.save_results(
...     os.path.join(results_path, "method", "wut", "x2.result3.gz"),
...     clustbench.transpose_results(res)[3])
clustbench.transpose_results(results)

“Transpose” a results dictionary

Parameters:
resultsdict

A dictionary of dictionaries or lists of objects.

Returns:
retdict

A dictionary such that ret[b][a] is taken from results[a][b]. If results[a] is not a dictionary, labels_list_to_dict will be called first.