Documentation¶

`clustbench.load_dataset`(battery, dataset[, ...])	Load a benchmark dataset
`clustbench.save_data`(filename, data[, fmt, ...])	Write a data matrix for inclusion in the clustering benchmark suite
`clustbench.save_labels`(filename, labels[, ...])	Write a label vector for inclusion in the clustering benchmark suite
`clustbench.preprocess_data`(data[, ...])	Normalise a data matrix
`clustbench.get_battery_names`([path, ...])	Get the names of benchmark batteries in a given directory
`clustbench.get_dataset_names`(battery[, ...])	Get the names of datasets in a given benchmark battery
`clustbench.load_results`(method_group, ...[, ...])	Load benchmark results
`clustbench.save_results`(filename, results[, ...])	Write results of many clustering algorithms
`clustbench.transpose_results`(results)	"Transpose" a results dictionary
`clustbench.labels_list_to_dict`(labels)	Convert a list of labels to a dictionary indexed by `n_clusters`
`clustbench.fit_predict_many`(model, data, ...)	Determine many clusterings of the same dataset.
`clustbench.get_score`(labels, results[, ...])	Computes a similarity score between the reference and the predicted partitions
`clustbench.Colouriser`(data[, labels])	An interactive planar data editor

clustering-benchmarks Package

class clustbench.Colouriser(data, labels=None)¶

An interactive planar data editor

See the dedicated section on the package homepage for more details.

Parameters:

data: An n-by-2 real matrix giving the coordinates of n planar points.
labels: Either a vector of n corresponding integer labels or None.

Attributes:

data: The data matrix.
labels: A vector of n integer labels; 0 denotes the noise cluster

Examples

>>> import clustbench
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url)
>>> clr = clustbench.Colouriser(wut_smile.data, wut_smile.labels[0])
>>> clr.print_help()
>>> clr.show()  # starts the interactive mode
>>> new_data = clr.get_data()
>>> new_labels = clr.get_labels()

get_data()¶

Get the current data matrix.

Returns:

data

get_labels()¶

Get the current labels.

Returns:

labels

normalise_labels()¶: Translate the current label vector (ignoring colour 0, which denotes the noise cluster) so that labels are assigned in decreasing order of occurrence.

print_help()¶: List the keyboard shortcuts available.

show()¶: Start the interactive Colouriser.

clustbench.fit_predict_many(model, data, n_clusters)¶

Determine many clusterings of the same dataset.

Ideally, for hierarchical methods, it would be best if model was implemented smartly enough that for the same X and different n_clusters it does not recompute the whole hierarchy from scratch.

Parameters:

model: An object equipped with fit_predict and set_param methods (e.g., a scikit-learn-like class)
dataarray-like: Data matrix.
n_clustersint or list of ints: Number of clusters.

Returns:

results: A dictionary of label vectors, where results[K] gives the discovered K-partition.

clustbench.get_battery_names(path=None, expanduser=True, expandvars=True)¶

Get the names of benchmark batteries in a given directory

Parameters:

path: Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory.
expanduser: Whether to call os.path.expanduser on the file path.
expandvars: Whether to call os.path.expandvars on the file path.

Returns:

batteries: A list of strings.

Examples

>>> import os.path
>>> import clustbench
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> print(clustbench.get_battery_names(data_path))
>>> print(clustbench.get_dataset_names("wut", data_path))

clustbench.get_dataset_names(battery, path=None, expanduser=True, expandvars=True)¶

Get the names of datasets in a given benchmark battery

Parameters:

battery: Name of the battery (dataset collection), e.g., "wut" or "other". Can be an empty string or "." if all files are in a single directory as specified by path.
path: Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory.
expanduser: Whether to call os.path.expanduser on the file path.
expandvars: Whether to call os.path.expandvars on the file path.

Returns:

datasets: A list of strings.

Examples

>>> import os.path
>>> import clustbench
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> print(clustbench.get_battery_names(data_path))
>>> print(clustbench.get_dataset_names("wut", data_path))

clustbench.get_score(labels, results, metric=<cyfunction normalized_clustering_accuracy>, compute_max=True, warn_if_missing=True)¶

Computes a similarity score between the reference and the predicted partitions

Takes into account that there can be more than one ground truth partition and ignores the noise points (as explained in the Methodology section of the clustering benchmark framework’s website).

If labels is a single label vector, it will be wrapped inside a list. If results is not a dictionary, labels_list_to_dict will be called first.

Parameters:

labels: A vector-like object or a list thereof.
results: A dictionary of clustering results, where results[K] gives a K-partition.
metricfunction: An external cluster validity measure; defaults to genieclust.compare_partitions.normalized_clustering_accuracy. It will be called like metric(y_true, y_pred).
compute_maxbool: Whether to apply max on the particular similarity scores.
warn_if_missingbool: Warn if some results[K] is required, but missing.

Returns:

scorefloat or array thereof: The computed similarity scores. Ultimately, it is a vector of metric(y_true[y_true>0], results[max(y_true)][y_true>0]) over all y_true in labels or the maximum thereof if compute_max is True.

clustbench.labels_list_to_dict(labels)¶

Convert a list of labels to a dictionary indexed by n_clusters

If labels is a single label vector, it will be wrapped inside a list.

Parameters:

labels: A vector-like object or a list thereof.

Returns:

retdict: ret[max(ll)] = ll for each ll in labels.

clustbench.load_dataset(battery, dataset, path=None, url=None, expanduser=True, expandvars=True, preprocess=True, random_state=None)¶

Load a benchmark dataset

Reads a dataset named battery/dataset.data.gz (relative to url or the directory path) as well as all the corresponding labels (battery/dataset.labels0.gz, battery/dataset.labels1.gz, …).

Parameters:

battery: Name of the battery (dataset collection), e.g., "wut" or "other".
dataset: Dataset name, e.g., "x2" or "iris".
path: Mutually exclusive with url. Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.
url: Mutually exclusive with path. For example, "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" to get access to <https://github.com/gagolews/clustering-data-v1>,
expanduser: Whether to call os.path.expanduser on the file path.
expandvars: Whether to call os.path.expandvars on the file path.
preprocess: Whether to call preprocess_data on the data matrix.
random_state: Seed of the random number generator; passed to preprocess_data.

Returns:

benchmark

A named tuple with the following elements:

battery: Same as the battery argument.
dataset: Same as the dataset argument.
description: Contents of the description file.
datandarray: Data matrix.
labelslist: A list consisting of the label vectors.
n_clustersndarray: The corresponding cluster counts: n_clusters[i] is equal to max(n_clusters[i]).

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded suite)
>>> data_path = os.path.join("~", "Projects", "clustering-data-v1")  # up to you
>>> wut_x2 = clustbench.load_dataset("wut", "x2", path=data_path)
>>> print(wut_x2.battery, wut_x2.dataset)
>>> print(wut_x2.description)
>>> print(wut_x2.data, wut_x2.labels)
>>> # load from GitHub (slow...):
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url)
>>> print(wut_smile.data, wut_smile.labels)

clustbench.load_results(method_group, battery, dataset, n_clusters, path=None, expanduser=True, expandvars=True)¶

Load benchmark results

Reads the datasets named like method_group/battery/dataset.resultK.gz (relative to the directory path), for each K in n_clusters. method_group can be a wildcard like "*" if a look up in multiple directories is required.

Parameters:

method_group: Name of the method group, e.g., "Genie", ".", or "*".
battery: Name of the battery (dataset collection), e.g., "wut" or "other".
dataset: Dataset name, e.g., "x2" or "iris".
n_clustersint or list of ints: Number of clusters.
path: Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.
expanduser: Whether to call os.path.expanduser on the file path.
expandvars: Whether to call os.path.expandvars on the file path.

Returns:

results: A dictionary of dictionaries of label vectors that can be accessed like results[method_name][n_clusters].

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded repository)
>>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
>>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path)
>>> print(res.keys())

clustbench.preprocess_data(data, noise_factor=1e-06, random_state=None)¶

Normalise a data matrix

Removes all columns of zero variance (constant). Centres the data around the centroid (so that each column mean is 0). Scales all columns proportionally (so that the total variance is 1; note that this is not the same as standardisation: standard deviations in each column might still be different). Adds a tiny amount of noise to minimise the risk of having duplicate points.

Parameters:

data: Data matrix.
noise_factor: Standard deviation of the white noise added.
random_state: Seed of the random number generator; see scipy.stats.norm.rvs.

Returns:

data: A modified data matrix.

Examples

>>> import os.path
>>> import clustbench
>>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
>>> wut_smile = clustbench.load_dataset(
...     "wut", "smile", url=data_url, preprocess=False)
>>> np.random.seed(123)  # assure reprodicibility
>>> X = clustbench.preprocess_data(wut_smile.data)

clustbench.save_data(filename, data, fmt='%g', expanduser=True, expandvars=True)¶

Write a data matrix for inclusion in the clustering benchmark suite

Parameters:

filenamestring or file handle: For example, path_to_suite/battery/dataset.data.gz.
data2D array_like: A matrix-like object
fmt: See numpy.savetxt.
expanduser: Whether to call os.path.expanduser on the file path.
expandvars: Whether to call os.path.expandvars on the file path.

clustbench.save_labels(filename, labels, expanduser=True, expandvars=True)¶

Write a label vector for inclusion in the clustering benchmark suite

Parameters:

filenamestring or file handle: For example, path_to_suite/battery/dataset.labels0.gz.
labels1D array_like: A label vector.
expanduser: Whether to call os.path.expanduser on the file path.
expandvars: Whether to call os.path.expandvars on the file path.

clustbench.save_results(filename, results, expanduser=True, expandvars=True)¶

Write results of many clustering algorithms

Parameters:

filenamestring or file handle: For example, method_group/battery/dataset.resultK.gz.
resultsdict: A dictionary where each results[method_name] is a label vector.
expanduser: Whether to call os.path.expanduser on the file path.
expandvars: Whether to call os.path.expandvars on the file path.

Examples

>>> import os.path
>>> import clustbench
>>> # load from a local library (a manually downloaded repository)
>>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")
>>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path)
>>> print(res.keys())
>>> clustbench.save_results(
...     os.path.join(results_path, "method", "wut", "x2.result3.gz"),
...     clustbench.transpose_results(res)[3])

clustbench.transpose_results(results)¶

“Transpose” a results dictionary

Parameters:

resultsdict: A dictionary of dictionaries or lists of objects.

Returns:

retdict: A dictionary such that ret[b][a] is taken from results[a][b]. If results[a] is not a dictionary, labels_list_to_dict will be called first.