Default parameters

This documents details the default parameters used in the ALMOS program.

CLUSTER 

CLUSTER workflow.

This module prepares descriptors, applies cleanup, and selects representative molecules that cover the cleaned chemical descriptor space. Natural clustering diagnostics are optional and are not used by the default point-selection method.

Main user-facing parameters 

General:

inputstr: Input CSV or SDF file used by the coverage-selection workflow.
namestr: Identifier column when descriptors are already present in the CSV.
ignorelist: Columns excluded from descriptor cleanup and coverage selection.
aqmebool: Generate descriptors with AQME before coverage selection.
ystr: Optional response column to ignore during coverage selection.
categoricalstr: Encoding mode for categorical descriptor columns.

Descriptor cleanup:

missing_thresholdfloat: Remove descriptor columns with too many missing values.
near_constant_thresholdfloat: Remove descriptors dominated by almost one single value.
iqr_thresholdfloat: Minimum absolute variability required for continuous descriptors.
rel_thresholdfloat: Minimum relative variability required for continuous descriptors.
binary_thresholdfloat: Minimum minority-class proportion required for binary descriptors.
correlation_thresholdfloat: If two descriptors are too correlated, one of them is removed.
min_descriptorsint: Minimum number of descriptors required after cleanup.

Coverage selection:

n_pointsint or None: Number of representative molecules to select. If not provided, ALMOS estimates an automatic coverage budget.
evaluatebool: If True, skip representative reselection and only evaluate the existing user-provided selection stored as batch = 0 in the input CSV.
modestr: Selection strategy. "representative" selects one real molecule nearest to each prototype centroid; diversity-focused selection keeps extra distant candidates.
cluster_auto_budget_candidateslist: Candidate budgets used as anchors for the automatic coverage scan.
cluster_auto_budget_marginal_gain_thresholdfloat: Marginal coverage-improvement threshold used to stop the budget scan.
cluster_auto_budget_min_umap_areafloat: Minimum UMAP area fraction required for automatic recommendation.
cluster_auto_budget_lookaheadint: Number of later tested budgets inspected for local-slowdown detection.
cluster_auto_budget_max_pointsint: Maximum automatic budget.
cluster_natural_reportbool: Run optional KMeans/GMM/HDBSCAN natural clustering diagnostics.

PCA safeguard:

enable_pcabool: Disable PCA with --no_pca.
cluster_high_dimensionality_thresholdint: Descriptor-count threshold above which PCA can be activated.
cluster_pca_explained_variance_thresholdfloat: Target explained variance retained by PCA.
cluster_pca_min_acceptable_variancefloat: Minimum variance required to accept PCA instead of raw descriptor space.

cluster_pca_min_components : int pca_max_components : int pca_max_components_fraction : float

Large dataset mode:

large_dataset_modebool: Disable with --no_large_dataset_mode.
cluster_standard_dataset_thresholdint: Upper limit of the standard regime.
cluster_very_large_dataset_thresholdint: Upper limit of the large regime.
cluster_ultra_large_dataset_thresholdint: Above this, the workflow enters ultra-large mode.

cluster_large_silhouette_sample_size : int cluster_very_large_silhouette_sample_size : int cluster_ultra_large_silhouette_sample_size : int cluster_large_dataset_stability_repeats : int cluster_fast_screening_top_candidates : int

Algorithm-specific search space:

cluster_kmeans_coarse_grid_size : int cluster_kmeans_top_refinement_candidates : int cluster_kmeans_refine_radius : int cluster_kmeans_bo_fraction : float cluster_kmeans_bo_max_evaluations : int cluster_gmm_dimensionality_threshold : int cluster_gmm_standard_coarse_grid_size : int cluster_gmm_standard_refine_radius : int cluster_gmm_large_coarse_grid_size : int cluster_gmm_large_refine_radius : int cluster_gmm_very_large_coarse_grid_size : int cluster_gmm_very_large_refine_radius : int cluster_gmm_bo_fraction : float cluster_gmm_bo_max_evaluations : int cluster_gmm_bic_shortlist_size : int cluster_hdbscan_standard_min_cluster_ratios : list cluster_hdbscan_large_min_cluster_ratios : list cluster_hdbscan_very_large_min_cluster_ratios : list cluster_hdbscan_standard_min_samples : list cluster_hdbscan_large_min_samples : list cluster_hdbscan_very_large_min_samples : list

Quality filters:

cluster_filter_max_noise_fractionfloat: Reject candidates with too many noise points, mainly for HDBSCAN.
cluster_filter_max_cluster_fractionfloat: Reject candidates dominated by one oversized cluster.
cluster_filter_max_imbalance_penaltyfloat: Reject candidates with extreme cluster-size imbalance.

cluster_quality_warning_silhouette_threshold : float cluster_quality_warning_stability_threshold : float cluster_quality_warning_noise_threshold : float cluster_quality_warning_imbalance_threshold : float cluster_quality_warning_final_score_threshold : float cluster_quality_good_silhouette_threshold : float cluster_quality_good_stability_threshold : float cluster_quality_good_noise_threshold : float cluster_quality_good_imbalance_threshold : float cluster_quality_good_final_score_threshold : float

AL 

Parameters 

albool
Indicates whether the active learning process is enabled and should be performed. Defaults to "False". This parameter is activated in command line (i.e. --al)

csv_namestr
Name of the CSV file containing the database. (i.e. 'FILE.csv').

ystr
Name of the column containing the response variable in the input CSV file (i.e. 'solubility').

namestr
Name of the column containing the molecule names in the input CSV file (i.e. 'names').

ignorelist, default=[]
List containing the columns of the input CSV file that will be ignored during the ROBERT process (i.e. --ignore "[name,SMILES]"). The descriptors will be included in the final CSV file. The y value, name column and batch column are automatically ignored by ROBERT.

n_expsint,
Number of experiments to be selected in the active learning process for the new batch. (i.e. '--n_exps 10') If not provided or invalid, the program will request the values in the proper format.

tolerancestr, default='medium'
Indicates the tolerance level for the convergence process, defining the percentage change threshold required for convergence. Options: 1. 'tight': Strictest level, convergence occurs if the metric improves by ≤1% (threshold = 0.01). 2. 'medium': Balanced level, convergence occurs if the metric improves by ≤5% (threshold = 0.05). 3. 'wide': Least strict, convergence occurs if the metric improves by ≤10% (threshold = 0.10). (i.e. '--tolerance tight')

robert_keywordsstr, default=""
Additional keywords to be passed to the ROBERT model generation (i.e. --robert_keywords "--model RF --train [70] --seed [0]")

objectivestr
Optimization direction for hit selection. Always required and must be 'max' or 'min'. (i.e. '--objective max')

modestr, optional
Optional manual override for the acquisition strategy. Use 'model' to rank by uncertainty or 'hit' to rank by prediction with uncertainty. If omitted, ALMOS selects the strategy automatically from the model score.

alphafloat, optional
Optional acquisition weight used in hit mode. It also overrides the automatic alpha when the strategy is auto and the selected score activates hit ranking. (i.e. '--alpha 0.5' or '--alfa 0.5')

Default parameters

CLUSTER

Main user-facing parameters

AL

Parameters

CLUSTER 

Main user-facing parameters 

AL 

Parameters 