CLUSTER

class almos.cluster.cluster(**kwargs)

Class containing the CLUSTER module workflow.

allocate_points_by_group_population(labels, n_points): Allocate selected points across natural groups proportionally to population.

auto_fill_knn(df): Fill missing values using KNNImputer.

build_chemical_space_viewer(descp_df, selected_indices, coverage_result, file_name): Build an interactive PCA/UMAP HTML viewer for the selected chemical space.

categorical_transform(df): Convert categorical columns into numeric descriptors.

checking_cluster(): Detect errors and update variables before the CLUSTER run.

choose_natural_selection_candidate(natural_result): Choose the simplest near-best natural clustering model for point allocation.

classify_visualization_local_quality(trustworthiness_score): User-facing recommendation based on local-neighborhood preservation in 2D.

clean_up_cluster(descp_file, csv, file_name): Prepare the descriptor matrix for clustering.

compute_2d_map_coverage_metrics(dataset_embedding, selected_embedding): Report 2D area coverage and grid filling for selected molecules.

compute_convex_hull_area_2d(points): Compute the 2D convex-hull polygon area with a monotonic chain hull.

compute_convex_hull_points_2d(points): Return the 2D convex-hull vertices with a monotonic chain hull.

compute_embedding_fidelity(original_data, embedding, selected_mask): Compute sampled user-facing fidelity metrics for an embedding.

compute_final_2d_coverage_metrics(selection_data, selected_indices, selected_points, budget_context): Compute PCA/UMAP area and occupied-cell filling diagnostics for final reporting.

compute_mean_nearest_selected_distance(coverage_data, selected_points): Compute mean distance from coverage points to the nearest selected point.

compute_nearest_selected_distances(coverage_data, selected_points): Compute distances from each row to the nearest selected representative.

compute_pca_2d_embedding(selection_data): Project the selection space to two dimensions with PCA.

compute_pca_2d_variance(selection_data): Fraction of selection-space variance retained by the first two PCA axes.

compute_pca_3d_embedding(selection_data): Project the selection space to three PCA coordinates.

compute_umap_2d_embedding(selection_data, selected_mask): Compute a single adaptive UMAP visualization.

compute_umap_area_coverage(selected_points, umap_area_context): Fraction of the dataset UMAP convex-hull area covered by selected molecules.

estimate_auto_coverage_budget(scaled_data, selectable_indices): Estimate n_points by scanning coverage improvement over candidate budgets.

evaluate_existing_selection(coverage_result, descp_df): Reuse an existing user-provided batch = 0 selection without recalculating representatives.

export_chemical_space_csvs(descp_df, display_indices, embeddings): Save reusable PCA/UMAP coordinate tables for the displayed chemical space.

fix_cols_names(df): Standardize code_name and SMILES column names.

format_fraction_percent(value): Format a fraction as a user-facing percentage.

format_visual_score(value): Format the 2D visual exploration score on a 1-10 scale.

get_2d_map_grid_details(dataset_embedding, selected_embedding): Return area/filling details used to plot 2D selection diagnostics.

get_auto_budget_candidates(n_selectable): Build candidate exploration budgets for automatic coverage scanning.

interpret_combined_2d_visualization_quality(pca_trustworthiness, pca_variance_retained, umap_trustworthiness): Provide one compact interpretation for PCA/UMAP visualization quality.

interpret_final_2d_coverage(map_metrics): Provide a compact user-facing interpretation of the 2D coverage diagnostics.

log_final_selection_quality(selection_data, selected_indices, selected_points, avg_gap_selected): Report compact raw diagnostics for the selected coverage batch.

molecule_svg_from_smiles(smiles): Return an SVG molecule drawing for a SMILES string, if RDKit can parse it.

point_inside_polygon_2d(point, polygon): Ray-casting point-in-polygon test for 2D coverage diagnostics.

prepare_auto_budget_umap_area_context(coverage_data): Build a lightweight UMAP map used only to report visual area coverage.

prepare_coverage_selection_input(descriptor_df): Scale descriptors and apply the PCA safeguard used for coverage selection.

prepare_coverage_selection_space(scaled_data, descriptor_count): Apply the high-dimensional PCA safeguard with selection-specific logging.

prepare_final_umap_map_context(selection_data, budget_context, selected_indices): Build the final UMAP map on the full selection dataset.

render_chemical_space_viewer_html(payload, plotly_js, file_name): Render the standalone chemical-space viewer HTML.

resolve_chemical_space_name_column(descp_df): Resolve the identifier column used in exported chemical-space tables.

run_aqme(csv, descp_file): Generate descriptors with AQME when requested.

run_natural_clustering_analysis(descriptor_df): Run the optional natural clustering model selection once and reuse the result.

save_auto_budget_plateau_plot(evaluated_budgets, coarse_budgets, fine_window_budgets, recommended_budget): Save an avg_gap plateau diagnostic plot for the automatic budget scan.

save_cluster_outputs(descp_file, csv, file_name, coverage_result): Save the coverage-based selected point outputs.

save_natural_clustering_report(descp_file, selection_result): Save optional natural clustering diagnostics.

save_selection_2d_diagnostic_images(selection_data, selected_indices, selected_points, budget_context): Save PCA/UMAP final-selection area and selected-dispersion diagnostic images.

save_single_2d_diagnostic_images(embedding_name, dataset_embedding, selected_embedding, output_folder, plt): Save separate PCA/UMAP area and selected-dispersion diagnostic images.

select_coverage_representatives(scaled_data, selectable_indices, n_points, log_selection=True): Select representatives with either centroid representatives or diversity pruning.

select_embedding_fidelity_indices(total_points, selected_mask, max_points=2000): Select a deterministic fidelity sample while retaining selected rows.

select_natural_cluster_representatives(coverage_result, selectable_indices, n_points): Select molecules from natural clusters using proportional allocation.

select_points_within_natural_group(selection_data, group_indices, n_points): Pick one centroid representative, then farthest points within one natural group.

select_representative_points(coverage_result): Select representative rows by global descriptor-space coverage.

set_up_cluster(df_csv_name, file_name): Prepare the CSV file and working folders.

CLUSTER workflow.

This module prepares descriptors, applies cleanup, and selects representative molecules that cover the cleaned chemical descriptor space. Natural clustering diagnostics are optional and are not used by the default point-selection method.

Main user-facing parameters

General:

inputstr: Input CSV or SDF file used by the coverage-selection workflow.
namestr: Identifier column when descriptors are already present in the CSV.
ignorelist: Columns excluded from descriptor cleanup and coverage selection.
aqmebool: Generate descriptors with AQME before coverage selection.
ystr: Optional response column to ignore during coverage selection.
categoricalstr: Encoding mode for categorical descriptor columns.

Descriptor cleanup:

missing_thresholdfloat: Remove descriptor columns with too many missing values.
near_constant_thresholdfloat: Remove descriptors dominated by almost one single value.
iqr_thresholdfloat: Minimum absolute variability required for continuous descriptors.
rel_thresholdfloat: Minimum relative variability required for continuous descriptors.
binary_thresholdfloat: Minimum minority-class proportion required for binary descriptors.
correlation_thresholdfloat: If two descriptors are too correlated, one of them is removed.
min_descriptorsint: Minimum number of descriptors required after cleanup.

Coverage selection:

n_pointsint or None: Number of representative molecules to select. If not provided, ALMOS estimates an automatic coverage budget.
evaluatebool: If True, skip representative reselection and only evaluate the existing user-provided selection stored as batch = 0 in the input CSV.
modestr: Selection strategy. "representative" selects one real molecule nearest to each prototype centroid; diversity-focused selection keeps extra distant candidates.
cluster_auto_budget_candidateslist: Candidate budgets used as anchors for the automatic coverage scan.
cluster_auto_budget_marginal_gain_thresholdfloat: Marginal coverage-improvement threshold used to stop the budget scan.
cluster_auto_budget_min_umap_areafloat: Minimum UMAP area fraction required for automatic recommendation.
cluster_auto_budget_lookaheadint: Number of later tested budgets inspected for local-slowdown detection.
cluster_auto_budget_max_pointsint: Maximum automatic budget.
cluster_natural_reportbool: Run optional KMeans/GMM/HDBSCAN natural clustering diagnostics.

PCA safeguard:

enable_pcabool: Disable PCA with --no_pca.
cluster_high_dimensionality_thresholdint: Descriptor-count threshold above which PCA can be activated.
cluster_pca_explained_variance_thresholdfloat: Target explained variance retained by PCA.
cluster_pca_min_acceptable_variancefloat: Minimum variance required to accept PCA instead of raw descriptor space.

cluster_pca_min_components : int pca_max_components : int pca_max_components_fraction : float

Large dataset mode:

large_dataset_modebool: Disable with --no_large_dataset_mode.
cluster_standard_dataset_thresholdint: Upper limit of the standard regime.
cluster_very_large_dataset_thresholdint: Upper limit of the large regime.
cluster_ultra_large_dataset_thresholdint: Above this, the workflow enters ultra-large mode.

cluster_large_silhouette_sample_size : int cluster_very_large_silhouette_sample_size : int cluster_ultra_large_silhouette_sample_size : int cluster_large_dataset_stability_repeats : int cluster_fast_screening_top_candidates : int

Algorithm-specific search space:

cluster_kmeans_coarse_grid_size : int cluster_kmeans_top_refinement_candidates : int cluster_kmeans_refine_radius : int cluster_kmeans_bo_fraction : float cluster_kmeans_bo_max_evaluations : int cluster_gmm_dimensionality_threshold : int cluster_gmm_standard_coarse_grid_size : int cluster_gmm_standard_refine_radius : int cluster_gmm_large_coarse_grid_size : int cluster_gmm_large_refine_radius : int cluster_gmm_very_large_coarse_grid_size : int cluster_gmm_very_large_refine_radius : int cluster_gmm_bo_fraction : float cluster_gmm_bo_max_evaluations : int cluster_gmm_bic_shortlist_size : int cluster_hdbscan_standard_min_cluster_ratios : list cluster_hdbscan_large_min_cluster_ratios : list cluster_hdbscan_very_large_min_cluster_ratios : list cluster_hdbscan_standard_min_samples : list cluster_hdbscan_large_min_samples : list cluster_hdbscan_very_large_min_samples : list

Quality filters:

cluster_filter_max_noise_fractionfloat: Reject candidates with too many noise points, mainly for HDBSCAN.
cluster_filter_max_cluster_fractionfloat: Reject candidates dominated by one oversized cluster.
cluster_filter_max_imbalance_penaltyfloat: Reject candidates with extreme cluster-size imbalance.

cluster_quality_warning_silhouette_threshold : float cluster_quality_warning_stability_threshold : float cluster_quality_warning_noise_threshold : float cluster_quality_warning_imbalance_threshold : float cluster_quality_warning_final_score_threshold : float cluster_quality_good_silhouette_threshold : float cluster_quality_good_stability_threshold : float cluster_quality_good_noise_threshold : float cluster_quality_good_imbalance_threshold : float cluster_quality_good_final_score_threshold : float