CLUSTER

class almos.cluster.cluster(**kwargs)

Class containing all the functions from the CLUSTER module

auto_fill_knn(df)

KNNImputer uses the K-nearest neighbors method to estimate and fill missing values based on the closest values to each point in the dataset Function to impute (or fill) null-values in a dataset

categorical_transform(df)

Function to categorical transform from ROBERT. It can apply to the df without the columns of ignore list Converts all columns with strings into categorical values (one hot encoding by default, can be set to numerical 1,2,3... with categorical = 'numbers'). Troubleshooting! For one-hot encoding, don't use variable names that are also column headers! i.e. DESCRIPTOR "C_atom" contain C2 as a value, but C2 is already a header of a different column in the database. Same applies for multiple columns containing the same variable names.

checking_cluster()

Detects errors and updates variables before the CLUSTER run

clean_up_cluster(descp_file, csv, file_name)

Prepare the CSV (descp_file) of both paths for the clustered

cluster_workflow(filled_array, descp_file, csv, file_name)

cluster execution

elbow_method_nclusters(filled_array)

define optimal n_clusters, if the user do not define it, with the Elbow Method

fix_cols_names(df)

Set code_name and SMILES using the right format Function to unify the names

k_means(X_scaled, seed_clustered, n_clusters)

Returns the data points that will be used as molecules to test to generate experimental data (k-means clustering)

pca_control(df_pca, pc_total_val, pc1_var, pc2_var, pc3_var)

provide and save the representation of the PCA in 3D

run_aqme(csv, descp_file)

Generate the descriptors if the user needs it

set_up_cluster(df_csv_name, file_name)

Prepare the CSV file

Parameters

inputstr, default = ''

Current file extension: .csv or .sdf (i.e. example.csv). Only is possible use a SDF file if using AQME (--aqme). If the descriptors are to be obtained through AQME from a CSV file two columns are required: 'code_name' with the names and 'SMILES' for the SMILES string. If the CSV already contains descriptors, it must contain at least 3, and the variable --name must be defined. For both cases, there cannot be any column named 'batch' in the CSV file.

n_clustersint, default = None

Number of clusters for the clustered. If not defined by the user, it is calculated using the Elbow Method.

seed_clusteredint, default = 0

Random seed used during KMeans (in k_means function).

descp_levelstr, default = 'interpret'

Type of descriptor to be used in the ALMOS workflow. Options are 'interpret', 'denovo' or 'full'.

ignorelist, default = []

List containing the columns of the input CSV file that will be ignored during the clustered process (i.e. ['code_name','SMILES']). The descriptors will be included in the clustered CSV file. The y value is automatically ignored.

aqmebool, default = False

Enables the aqme workflow to generate descriptors.

namestr, default = ''

It is mandatory to define it if the clustering is to be done with the descriptors already defined by the user. If the descriptors are to be generated with the program (using AQME) 'name' is not defined.

ystr, default = ''

Name of the column containing the response variable in the input CSV file (i.e. 'yield').

auto_fill: bool, default = True

If the CSV contains empty spaces (less than 30 % of NaN per column), KNNImputer is applied, using the K-nearest neighbors method to estimate and fill missing values based on the closest values to each point in the dataset. If auto_fill is False, the KNNImputer is not applied (if there are still empty spaces the program finish).

categorical: str, default = 'onehot'

It can be used when the user provide their descriptors. Mode to convert data from columns with categorical variables. As an example, a variable containing 4 types of C atoms (i.e. primary, secondary, tertiary, quaternary) will be converted into categorical variables. Options:

  1. 'onehot' (for one-hot encoding, ROBERT will create a descriptor for each type of C atom using 0s and 1s to indicate whether the C type is present)

  2. 'numbers' (to describe the C atoms with numbers: 1, 2, 3, 4).

aqme_keywords: str, default = ''

It can be used to use specific functions from aqme. The entire argument must be in quotation marks, as in the example. (i.e., --aqme_keywords "--qdescp_atoms [1,2]")

varfilestr, default=None

Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).

nprocs: int, default=8

Number of processors used in AQME for the clustered