Default parameters

This documents details the default parameters used in the ALMOS program.

CLUSTER

Parameters

inputstr, default = ''

Current file extension: .csv or .sdf (i.e. example.csv). Only is possible use a SDF file if using AQME (--aqme). If the descriptors are to be obtained through AQME from a CSV file two columns are required: 'code_name' with the names and 'SMILES' for the SMILES string. If the CSV already contains descriptors, it must contain at least 3, and the variable --name must be defined. For both cases, there cannot be any column named 'batch' in the CSV file.

n_clustersint, default = None

Number of clusters for the clustered. If not defined by the user, it is calculated using the Elbow Method.

seed_clusteredint, default = 0

Random seed used during KMeans (in k_means function).

descp_levelstr, default = 'interpret'

Type of descriptor to be used in the ALMOS workflow. Options are 'interpret', 'denovo' or 'full'.

ignorelist, default = []

List containing the columns of the input CSV file that will be ignored during the clustered process (i.e. ['code_name','SMILES']). The descriptors will be included in the clustered CSV file. The y value is automatically ignored.

aqmebool, default = False

Enables the aqme workflow to generate descriptors.

namestr, default = ''

It is mandatory to define it if the clustering is to be done with the descriptors already defined by the user. If the descriptors are to be generated with the program (using AQME) 'name' is not defined.

ystr, default = ''

Name of the column containing the response variable in the input CSV file (i.e. 'yield').

auto_fill: bool, default = True

If the CSV contains empty spaces (less than 30 % of NaN per column), KNNImputer is applied, using the K-nearest neighbors method to estimate and fill missing values based on the closest values to each point in the dataset. If auto_fill is False, the KNNImputer is not applied (if there are still empty spaces the program finish).

categorical: str, default = 'onehot'

It can be used when the user provide their descriptors. Mode to convert data from columns with categorical variables. As an example, a variable containing 4 types of C atoms (i.e. primary, secondary, tertiary, quaternary) will be converted into categorical variables. Options:

  1. 'onehot' (for one-hot encoding, ROBERT will create a descriptor for each type of C atom using 0s and 1s to indicate whether the C type is present)

  2. 'numbers' (to describe the C atoms with numbers: 1, 2, 3, 4).

aqme_keywords: str, default = ''

It can be used to use specific functions from aqme. The entire argument must be in quotation marks, as in the example. (i.e., --aqme_keywords "--qdescp_atoms [1,2]")

varfilestr, default=None

Option to parse the variables using a yaml file (specify the filename, i.e. varfile=FILE.yaml).

nprocs: int, default=8

Number of processors used in AQME for the clustered

EL

Parameters

elbool

Indicates whether exploratory learning process is enabled and should be performed. Defaults to "False". This parameter is activated in command line (i.e. --el)

csv_namestr

Name of the CSV file containing the database. (i.e. 'FILE.csv').

ystr

Name of the column containing the response variable in the input CSV file (i.e. 'solubility').

namestr

Name of the column containing the molecule names in the input CSV file (i.e. 'names').

ignorelist, default=[]

List containing the columns of the input CSV file that will be ignored during the ROBERT process (i.e. --ignore "[name,SMILES]"). The descriptors will be included in the final CSV file. The y value, name column and batch column are automatically ignored by ROBERT.

explore_rtfloat, default= 1

Specifies the exploration ratio for the exploratory learning process, determining how many points to explore in relation to the total number of experiments. (i.e. '--explore_rt 0.5') If not provided or invalid, the program will request the values in the proper format.

n_expsint,

Number of experiments to be selected in the exploratory learning process for the new batch. (i.e. '--n_exps 10') If not provided or invalid, the program will request the values in the proper format.

tolerancestr, default='medium'

Indicates the tolerance level for the convergence process, defining the percentage change threshold required for convergence. Options: 1. 'tight': Strictest level, convergence occurs if the metric improves by ≤1% (threshold = 0.01). 2. 'medium': Balanced level, convergence occurs if the metric improves by ≤5% (threshold = 0.05). 3. 'wide': Least strict, convergence occurs if the metric improves by ≤10% (threshold = 0.10). (i.e. '--tolerance tight')

robert_keywordsstr, default=""

Additional keywords to be passed to the ROBERT model generation (i.e. --robert_keywords "--model RF --train [70] --seed [0]")

reversebool, default=False

If set to True, the order of the points in the new batch is reversed, prioritizing in exploitation lower values (i.e. --reverse ).

intelexbool, default=False

If set to True, the program will not need module scikit-learn-intelex to speed up the model update process.

BO

Parameters

csv_namestr

Name of the CSV file containing the database. (i.e. 'FILE.csv').

ystr

Name of the column(s) containing the response variable(s) in the input CSV file. - For a single column, just provide the column name as a string (e.g., 'solubility'). - To optimize two columns simultaneously, provide a list in the format: [y1,y2]

where y1 and y2 are the names of the columns (e.g., '[yield,ee]').

namestr

Name of the column containing the molecule names in the input CSV file (i.e. 'names').

ignorelist, default=[]

List containing the columns of the input CSV file that will be ignored during the BO process (i.e. --ignore "[name,SMILES]"). The descriptors will be included in the final CSV file. The y value, name column and batch column are automatically ignored.

batch_numberint, default=0

Number of the batch to be processed. The CSV file is always taken from the specified batch folder, and a new folder named 'batch_{batch_number+1}' will be generated for the output.

n_expsint, default=1

Specifies the number of new points for exploration and exploitation in the next batch.

reversebool, default=False

If False (default), the target value (y) is maximized. If True, the target value is minimized.