Documentation¶
|
Load a benchmark dataset |
|
Write a data matrix for inclusion in the clustering benchmark suite |
|
Write a label vector for inclusion in the clustering benchmark suite |
|
Normalise a data matrix |
|
Get the names of benchmark batteries in a given directory |
|
Get the names of datasets in a given benchmark battery |
|
Load benchmark results |
|
Write results of many clustering algorithms |
|
"Transpose" a results dictionary |
|
Convert a list of labels to a dictionary indexed by |
|
Determine many clusterings of the same dataset. |
|
Computes a similarity score between the reference and the predicted partitions |
|
An interactive planar data editor |
clustering-benchmarks Package
- class clustbench.Colouriser(data, labels=None)¶
An interactive planar data editor
See the dedicated section on the package homepage for more details.
- Parameters:
- data
An n-by-2 real matrix giving the coordinates of n planar points.
- labels
Either a vector of n corresponding integer labels or
None
.
Examples
>>> import clustbench >>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" >>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url) >>> clr = clustbench.Colouriser(wut_smile.data, wut_smile.labels[0]) >>> clr.print_help() >>> clr.show() # starts the interactive mode >>> new_data = clr.get_data() >>> new_labels = clr.get_labels()
- Attributes:
- data
The data matrix.
- labels
A vector of n integer labels; 0 denotes the noise cluster
- get_data()¶
Get the current data matrix.
- Returns:
- data
- get_labels()¶
Get the current labels.
- Returns:
- labels
- normalise_labels()¶
Translate the current label vector (ignoring colour 0, which denotes the noise cluster) so that labels are assigned in decreasing order of occurrence.
- print_help()¶
List the keyboard shortcuts available.
- show()¶
Start the interactive Colouriser.
- clustbench.fit_predict_many(model, data, n_clusters)¶
Determine many clusterings of the same dataset.
Ideally, for hierarchical methods, it would be best if
model
was implemented smartly enough that for the sameX
and differentn_clusters
it does not recompute the whole hierarchy from scratch.- Parameters:
- model
An object equipped with
fit_predict
andset_param
methods (e.g., a scikit-learn-like class)- dataarray-like
Data matrix.
- n_clustersint or list of ints
Number of clusters.
- Returns:
- results
A dictionary of label vectors, where
results[K]
gives the discoveredK
-partition.
- clustbench.get_battery_names(path=None, expanduser=True, expandvars=True)¶
Get the names of benchmark batteries in a given directory
- Parameters:
- path
Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory.
- expanduser
Whether to call
os.path.expanduser
on the file path.- expandvars
Whether to call
os.path.expandvars
on the file path.
- Returns:
- batteries
A list of strings.
Examples
>>> import os.path >>> import clustbench >>> data_path = os.path.join("~", "Projects", "clustering-data-v1") # up to you >>> print(clustbench.get_battery_names(data_path)) >>> print(clustbench.get_dataset_names("wut", data_path))
- clustbench.get_dataset_names(battery, path=None, expanduser=True, expandvars=True)¶
Get the names of datasets in a given benchmark battery
- Parameters:
- battery
Name of the battery (dataset collection), e.g.,
"wut"
or"other"
. Can be an empty string or"."
if all files are in a single directory as specified by path.- path
Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory.
- expanduser
Whether to call
os.path.expanduser
on the file path.- expandvars
Whether to call
os.path.expandvars
on the file path.
- Returns:
- datasets
A list of strings.
Examples
>>> import os.path >>> import clustbench >>> data_path = os.path.join("~", "Projects", "clustering-data-v1") # up to you >>> print(clustbench.get_battery_names(data_path)) >>> print(clustbench.get_dataset_names("wut", data_path))
- clustbench.get_score(labels, results, metric=<cyfunction normalized_clustering_accuracy>, compute_max=True, warn_if_missing=True)¶
Computes a similarity score between the reference and the predicted partitions
Takes into account that there can be more than one ground truth partition and ignores the noise points (as explained in the Methodology section of the clustering benchmark framework’s website).
If
labels
is a single label vector, it will be wrapped inside a list. Ifresults
is not a dictionary, labels_list_to_dict will be called first.- Parameters:
- labels
A vector-like object or a list thereof.
- results
A dictionary of clustering results, where
results[K]
gives a K-partition.- metricfunction
An external cluster validity measure; defaults to
genieclust.compare_partitions.normalized_clustering_accuracy
. It will be called likemetric(y_true, y_pred)
.- compute_maxbool
Whether to apply
max
on the particular similarity scores.- warn_if_missingbool
Warn if some
results[K]
is required, but missing.
- Returns:
- scorefloat or array thereof
The computed similarity scores. Ultimately, it is a vector of
metric(y_true[y_true>0], results[max(y_true)][y_true>0])
over ally_true
inlabels
or the maximum thereof ifcompute_max
isTrue
.
- clustbench.labels_list_to_dict(labels)¶
Convert a list of labels to a dictionary indexed by
n_clusters
If
labels
is a single label vector, it will be wrapped inside a list.- Parameters:
- labels
A vector-like object or a list thereof.
- Returns:
- retdict
ret[max(ll)] = ll
for eachll
inlabels
.
- clustbench.load_dataset(battery, dataset, path=None, url=None, expanduser=True, expandvars=True, preprocess=True, random_state=None)¶
Load a benchmark dataset
Reads a dataset named battery/dataset.data.gz (relative to url or the directory path) as well as all the corresponding labels (battery/dataset.labels0.gz, battery/dataset.labels1.gz, …).
- Parameters:
- battery
Name of the battery (dataset collection), e.g.,
"wut"
or"other"
.- dataset
Dataset name, e.g.,
"x2"
or"iris"
.- path
Mutually exclusive with url. Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.
- url
Mutually exclusive with path. For example,
"https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"
to get access to <https://github.com/gagolews/clustering-data-v1>,- expanduser
Whether to call
os.path.expanduser
on the file path.- expandvars
Whether to call
os.path.expandvars
on the file path.- preprocess
Whether to call
preprocess_data
on the data matrix.- random_state
Seed of the random number generator; passed to
preprocess_data
.
- Returns:
- benchmark
A named tuple with the following elements:
- battery
Same as the battery argument.
- dataset
Same as the dataset argument.
- description
Contents of the description file.
- datandarray
Data matrix.
- labelslist
A list consisting of the label vectors.
- n_clustersndarray
The corresponding cluster counts:
n_clusters[i]
is equal tomax(n_clusters[i])
.
Examples
>>> import os.path >>> import clustbench >>> # load from a local library (a manually downloaded suite) >>> data_path = os.path.join("~", "Projects", "clustering-data-v1") # up to you >>> wut_x2 = clustbench.load_dataset("wut", "x2", path=data_path) >>> print(wut_x2.battery, wut_x2.dataset) >>> print(wut_x2.description) >>> print(wut_x2.data, wut_x2.labels) >>> # load from GitHub (slow...): >>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" >>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url) >>> print(wut_smile.data, wut_smile.labels)
- clustbench.load_results(method_group, battery, dataset, n_clusters, path=None, expanduser=True, expandvars=True)¶
Load benchmark results
Reads the datasets named like method_group/battery/dataset.resultK.gz (relative to the directory path), for each K in
n_clusters
. method_group can be a wildcard like"*"
if a look up in multiple directories is required.- Parameters:
- method_group
Name of the method group, e.g.,
"Genie"
,"."
, or"*"
.- battery
Name of the battery (dataset collection), e.g.,
"wut"
or"other"
.- dataset
Dataset name, e.g.,
"x2"
or"iris"
.- n_clustersint or list of ints
Number of clusters.
- path
Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory.
- expanduser
Whether to call
os.path.expanduser
on the file path.- expandvars
Whether to call
os.path.expandvars
on the file path.
- Returns:
- results
A dictionary of dictionaries of label vectors that can be accessed like
results[method_name][n_clusters]
.
Examples
>>> import os.path >>> import clustbench >>> # load from a local library (a manually downloaded repository) >>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original") >>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path) >>> print(res.keys())
- clustbench.preprocess_data(data, noise_factor=1e-06, random_state=None)¶
Normalise a data matrix
Removes all columns of zero variance (constant). Centres the data around the centroid (so that each column mean is 0). Scales all columns proportionally (so that the total variance is 1; note that this is not the same as standardisation: standard deviations in each column might still be different). Adds a tiny amount of noise to minimise the risk of having duplicate points.
- Parameters:
- data
Data matrix.
- noise_factor
Standard deviation of the white noise added.
- random_state
Seed of the random number generator; see
scipy.stats.norm.rvs
.
- Returns:
- data
A modified data matrix.
Examples
>>> import os.path >>> import clustbench >>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" >>> wut_smile = clustbench.load_dataset( ... "wut", "smile", url=data_url, preprocess=False) >>> np.random.seed(123) # assure reprodicibility >>> X = clustbench.preprocess_data(wut_smile.data)
- clustbench.save_data(filename, data, fmt='%g', expanduser=True, expandvars=True)¶
Write a data matrix for inclusion in the clustering benchmark suite
- Parameters:
- filenamestring or file handle
For example, path_to_suite/battery/dataset.data.gz.
- data2D array_like
A matrix-like object
- fmt
See numpy.savetxt.
- expanduser
Whether to call
os.path.expanduser
on the file path.- expandvars
Whether to call
os.path.expandvars
on the file path.
- clustbench.save_labels(filename, labels, expanduser=True, expandvars=True)¶
Write a label vector for inclusion in the clustering benchmark suite
- Parameters:
- filenamestring or file handle
For example, path_to_suite/battery/dataset.labels0.gz.
- labels1D array_like
A label vector.
- expanduser
Whether to call
os.path.expanduser
on the file path.- expandvars
Whether to call
os.path.expandvars
on the file path.
- clustbench.save_results(filename, results, expanduser=True, expandvars=True)¶
Write results of many clustering algorithms
- Parameters:
- filenamestring or file handle
For example, method_group/battery/dataset.resultK.gz.
- resultsdict
A dictionary where each
results[method_name]
is a label vector.- expanduser
Whether to call
os.path.expanduser
on the file path.- expandvars
Whether to call
os.path.expandvars
on the file path.
Examples
>>> import os.path >>> import clustbench >>> # load from a local library (a manually downloaded repository) >>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original") >>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path) >>> print(res.keys()) >>> clustbench.save_results( ... os.path.join(results_path, "method", "wut", "x2.result3.gz"), ... clustbench.transpose_results(res)[3])
- clustbench.transpose_results(results)¶
“Transpose” a results dictionary
- Parameters:
- resultsdict
A dictionary of dictionaries or lists of objects.
- Returns:
- retdict
A dictionary such that
ret[b][a]
is taken fromresults[a][b]
. Ifresults[a]
is not a dictionary, labels_list_to_dict will be called first.