Documentation¶
| 
 | Load a benchmark dataset | 
| 
 | Write a data matrix for inclusion in the clustering benchmark suite | 
| 
 | Write a label vector for inclusion in the clustering benchmark suite | 
| 
 | Normalise a data matrix | 
| 
 | Get the names of benchmark batteries in a given directory | 
| 
 | Get the names of datasets in a given benchmark battery | 
| 
 | Load benchmark results | 
| 
 | Write results of many clustering algorithms | 
| 
 | "Transpose" a results dictionary | 
| 
 | Convert a list of labels to a dictionary indexed by  | 
| 
 | Determine many clusterings of the same dataset. | 
| 
 | Computes a similarity score between the reference and the predicted partitions | 
| 
 | An interactive planar data editor | 
clustering-benchmarks Package
- class clustbench.Colouriser(data, labels=None)¶
- An interactive planar data editor - See the dedicated section on the package homepage for more details. - Parameters:
- data
- An n-by-2 real matrix giving the coordinates of n planar points. 
- labels
- Either a vector of n corresponding integer labels or - None.
 
- Attributes:
- data
- The data matrix. 
- labels
- A vector of n integer labels; 0 denotes the noise cluster 
 
 - Examples - >>> import clustbench >>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" >>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url) >>> clr = clustbench.Colouriser(wut_smile.data, wut_smile.labels[0]) >>> clr.print_help() >>> clr.show() # starts the interactive mode >>> new_data = clr.get_data() >>> new_labels = clr.get_labels() - get_data()¶
- Get the current data matrix. - Returns:
- data
 
 
 - get_labels()¶
- Get the current labels. - Returns:
- labels
 
 
 - normalise_labels()¶
- Translate the current label vector (ignoring colour 0, which denotes the noise cluster) so that labels are assigned in decreasing order of occurrence. 
 - print_help()¶
- List the keyboard shortcuts available. 
 - show()¶
- Start the interactive Colouriser. 
 
- clustbench.fit_predict_many(model, data, n_clusters)¶
- Determine many clusterings of the same dataset. - Ideally, for hierarchical methods, it would be best if - modelwas implemented smartly enough that for the same- Xand different- n_clustersit does not recompute the whole hierarchy from scratch.- Parameters:
- model
- An object equipped with - fit_predictand- set_parammethods (e.g., a scikit-learn-like class)
- dataarray-like
- Data matrix. 
- n_clustersint or list of ints
- Number of clusters. 
 
- Returns:
- results
- A dictionary of label vectors, where - results[K]gives the discovered- K-partition.
 
 
- clustbench.get_battery_names(path=None, expanduser=True, expandvars=True)¶
- Get the names of benchmark batteries in a given directory - Parameters:
- path
- Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory. 
- expanduser
- Whether to call - os.path.expanduseron the file path.
- expandvars
- Whether to call - os.path.expandvarson the file path.
 
- Returns:
- batteries
- A list of strings. 
 
 - Examples - >>> import os.path >>> import clustbench >>> data_path = os.path.join("~", "Projects", "clustering-data-v1") # up to you >>> print(clustbench.get_battery_names(data_path)) >>> print(clustbench.get_dataset_names("wut", data_path)) 
- clustbench.get_dataset_names(battery, path=None, expanduser=True, expandvars=True)¶
- Get the names of datasets in a given benchmark battery - Parameters:
- battery
- Name of the battery (dataset collection), e.g., - "wut"or- "other". Can be an empty string or- "."if all files are in a single directory as specified by path.
- path
- Path to the directory containing the downloaded benchmark dataset suite. Defaults to the current working directory. 
- expanduser
- Whether to call - os.path.expanduseron the file path.
- expandvars
- Whether to call - os.path.expandvarson the file path.
 
- Returns:
- datasets
- A list of strings. 
 
 - Examples - >>> import os.path >>> import clustbench >>> data_path = os.path.join("~", "Projects", "clustering-data-v1") # up to you >>> print(clustbench.get_battery_names(data_path)) >>> print(clustbench.get_dataset_names("wut", data_path)) 
- clustbench.get_score(labels, results, metric=<cyfunction normalized_clustering_accuracy>, compute_max=True, warn_if_missing=True)¶
- Computes a similarity score between the reference and the predicted partitions - Takes into account that there can be more than one ground truth partition and ignores the noise points (as explained in the Methodology section of the clustering benchmark framework’s website). - If - labelsis a single label vector, it will be wrapped inside a list. If- resultsis not a dictionary, labels_list_to_dict will be called first.- Parameters:
- labels
- A vector-like object or a list thereof. 
- results
- A dictionary of clustering results, where - results[K]gives a K-partition.
- metricfunction
- An external cluster validity measure; defaults to - genieclust.compare_partitions.normalized_clustering_accuracy. It will be called like- metric(y_true, y_pred).
- compute_maxbool
- Whether to apply - maxon the particular similarity scores.
- warn_if_missingbool
- Warn if some - results[K]is required, but missing.
 
- Returns:
- scorefloat or array thereof
- The computed similarity scores. Ultimately, it is a vector of - metric(y_true[y_true>0], results[max(y_true)][y_true>0])over all- y_truein- labelsor the maximum thereof if- compute_maxis- True.
 
 
- clustbench.labels_list_to_dict(labels)¶
- Convert a list of labels to a dictionary indexed by - n_clusters- If - labelsis a single label vector, it will be wrapped inside a list.- Parameters:
- labels
- A vector-like object or a list thereof. 
 
- Returns:
- retdict
- ret[max(ll)] = llfor each- llin- labels.
 
 
- clustbench.load_dataset(battery, dataset, path=None, url=None, expanduser=True, expandvars=True, preprocess=True, random_state=None)¶
- Load a benchmark dataset - Reads a dataset named battery/dataset.data.gz (relative to url or the directory path) as well as all the corresponding labels (battery/dataset.labels0.gz, battery/dataset.labels1.gz, …). - Parameters:
- battery
- Name of the battery (dataset collection), e.g., - "wut"or- "other".
- dataset
- Dataset name, e.g., - "x2"or- "iris".
- path
- Mutually exclusive with url. Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory. 
- url
- Mutually exclusive with path. For example, - "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0"to get access to <https://github.com/gagolews/clustering-data-v1>,
- expanduser
- Whether to call - os.path.expanduseron the file path.
- expandvars
- Whether to call - os.path.expandvarson the file path.
- preprocess
- Whether to call - preprocess_dataon the data matrix.
- random_state
- Seed of the random number generator; passed to - preprocess_data.
 
- Returns:
- benchmark
- A named tuple with the following elements: - battery
- Same as the battery argument. 
- dataset
- Same as the dataset argument. 
- description
- Contents of the description file. 
- datandarray
- Data matrix. 
- labelslist
- A list consisting of the label vectors. 
- n_clustersndarray
- The corresponding cluster counts: - n_clusters[i]is equal to- max(n_clusters[i]).
 
 
 - Examples - >>> import os.path >>> import clustbench >>> # load from a local library (a manually downloaded suite) >>> data_path = os.path.join("~", "Projects", "clustering-data-v1") # up to you >>> wut_x2 = clustbench.load_dataset("wut", "x2", path=data_path) >>> print(wut_x2.battery, wut_x2.dataset) >>> print(wut_x2.description) >>> print(wut_x2.data, wut_x2.labels) >>> # load from GitHub (slow...): >>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" >>> wut_smile = clustbench.load_dataset("wut", "smile", url=data_url) >>> print(wut_smile.data, wut_smile.labels) 
- clustbench.load_results(method_group, battery, dataset, n_clusters, path=None, expanduser=True, expandvars=True)¶
- Load benchmark results - Reads the datasets named like method_group/battery/dataset.resultK.gz (relative to the directory path), for each K in - n_clusters. method_group can be a wildcard like- "*"if a look up in multiple directories is required.- Parameters:
- method_group
- Name of the method group, e.g., - "Genie",- ".", or- "*".
- battery
- Name of the battery (dataset collection), e.g., - "wut"or- "other".
- dataset
- Dataset name, e.g., - "x2"or- "iris".
- n_clustersint or list of ints
- Number of clusters. 
- path
- Path to the directory containing the downloaded benchmark datasets suite. Defaults to the current working directory. 
- expanduser
- Whether to call - os.path.expanduseron the file path.
- expandvars
- Whether to call - os.path.expandvarson the file path.
 
- Returns:
- results
- A dictionary of dictionaries of label vectors that can be accessed like - results[method_name][n_clusters].
 
 - Examples - >>> import os.path >>> import clustbench >>> # load from a local library (a manually downloaded repository) >>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original") >>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path) >>> print(res.keys()) 
- clustbench.preprocess_data(data, noise_factor=1e-06, random_state=None)¶
- Normalise a data matrix - Removes all columns of zero variance (constant). Centres the data around the centroid (so that each column mean is 0). Scales all columns proportionally (so that the total variance is 1; note that this is not the same as standardisation: standard deviations in each column might still be different). Adds a tiny amount of noise to minimise the risk of having duplicate points. - Parameters:
- data
- Data matrix. 
- noise_factor
- Standard deviation of the white noise added. 
- random_state
- Seed of the random number generator; see - scipy.stats.norm.rvs.
 
- Returns:
- data
- A modified data matrix. 
 
 - Examples - >>> import os.path >>> import clustbench >>> data_url = "https://github.com/gagolews/clustering-data-v1/raw/v1.1.0" >>> wut_smile = clustbench.load_dataset( ... "wut", "smile", url=data_url, preprocess=False) >>> np.random.seed(123) # assure reprodicibility >>> X = clustbench.preprocess_data(wut_smile.data) 
- clustbench.save_data(filename, data, fmt='%g', expanduser=True, expandvars=True)¶
- Write a data matrix for inclusion in the clustering benchmark suite - Parameters:
- filenamestring or file handle
- For example, path_to_suite/battery/dataset.data.gz. 
- data2D array_like
- A matrix-like object 
- fmt
- See numpy.savetxt. 
- expanduser
- Whether to call - os.path.expanduseron the file path.
- expandvars
- Whether to call - os.path.expandvarson the file path.
 
 
- clustbench.save_labels(filename, labels, expanduser=True, expandvars=True)¶
- Write a label vector for inclusion in the clustering benchmark suite - Parameters:
- filenamestring or file handle
- For example, path_to_suite/battery/dataset.labels0.gz. 
- labels1D array_like
- A label vector. 
- expanduser
- Whether to call - os.path.expanduseron the file path.
- expandvars
- Whether to call - os.path.expandvarson the file path.
 
 
- clustbench.save_results(filename, results, expanduser=True, expandvars=True)¶
- Write results of many clustering algorithms - Parameters:
- filenamestring or file handle
- For example, method_group/battery/dataset.resultK.gz. 
- resultsdict
- A dictionary where each - results[method_name]is a label vector.
- expanduser
- Whether to call - os.path.expanduseron the file path.
- expandvars
- Whether to call - os.path.expandvarson the file path.
 
 - Examples - >>> import os.path >>> import clustbench >>> # load from a local library (a manually downloaded repository) >>> results_path = os.path.join("~", "Projects", "clustering-results-v1", "original") >>> res = clustbench.load_results("*", "wut", "x2", 3, path=results_path) >>> print(res.keys()) >>> clustbench.save_results( ... os.path.join(results_path, "method", "wut", "x2.result3.gz"), ... clustbench.transpose_results(res)[3]) 
- clustbench.transpose_results(results)¶
- “Transpose” a results dictionary - Parameters:
- resultsdict
- A dictionary of dictionaries or lists of objects. 
 
- Returns:
- retdict
- A dictionary such that - ret[b][a]is taken from- results[a][b]. If- results[a]is not a dictionary, labels_list_to_dict will be called first.