el_utils

class almos.el_utils.EarlyStopping(patience=2, score_tolerance=0, rmse_min_delta=0.05, sd_min_delta=0.05, logger=None)

Monitors model performance to determine convergence based on specified tolerances for different metrics.

This class tracks metrics (e.g., RMSE, SD, and score) over iterations, marking convergence if improvements fall below specified thresholds for a set number of iterations (patience). Results are logged and saved for analysis.

check_convergence(results_plot_no_PFI, results_plot_PFI)

Check for convergence for both PFI and no_PFI models independently. This function processes batch metrics, updates CSV files, and ensures only new or updated batches are added.

Parameters:

results_plot_no_PFIlist of dicts: Batch metrics for the no_PFI model.
results_plot_PFIlist of dicts: Batch metrics for the PFI model.

check_convergence_model(df, model_type)

Check for convergence for either the PFI or no_PFI model separately.

Parameters:

dfpd.DataFrame: The DataFrame containing the batch results.
model_typestr: Either 'PFI' or 'no_PFI' to determine which set of metrics to check.

Returns:

pd.DataFrame: The updated DataFrame with convergence columns and status.

check_metric_convergence(previous_row, last_row, metric_name, tolerance)

Checks if a specific metric has converged. The metric is considered converged if: - It has not worsened (i.e., no negative changes). - It has improved, but by less than the specified tolerance.

Parameters:

previous_rowpd.Series: The metrics from the previous iteration.
last_rowpd.Series: The metrics from the current iteration.
metric_namestr: The name of the metric being checked.
tolerancefloat: The minimum percentage change required for improvement.

Returns:

bool: True if the metric has converged (no worsening or minimal improvement), False if the metric has worsened or improved significantly.

check_score_convergence(previous_row, last_row, score_column, score_tolerance): Checks if the score has improved beyond the score tolerance. If the score has not worsened, it has converged. Return True.

check_score_no_improvement(previous_row, last_row, score_column): Returns True if the score has not improved (i.e., stays the same or gets worse).

show_summary(df, model_type)

Displays a final summary for either PFI or no_PFI metrics.

Parameters:

dfpd.DataFrame: The DataFrame containing the batch results.
model_typestr: Either 'PFI' or 'no_PFI' to determine which set of metrics to display.

almos.el_utils.assign_values(df, exploit_points, explore_points, quartile_medians, size_counters, predictions_column, sd_column, reverse): Assigns points for exploration by quartile, prioritizing those with highest uncertainty (sd_column). Uses size_counters to always select the quartile with the fewest assigned points. If a quartile has no available points, selects the point closest to the quartile median. If there are no exploitation points, distribute among all four quartiles.

almos.el_utils.check_missing_outputs(self)

Validates input parameters for exploratory learning.

This method: - Loads default options and adds values for missing attributes ('target_column', 'name_column', 'ignore_list'). - Prompts for and locates the CSV file if not specified, loading it into a DataFrame. - Ensures that required columns for molecule names and target values exist, prompting for values if necessary. - Validates 'explore_rt' and 'tolerance' ensuring valid ranges. - Validates 'n_exps' ensuring it is a positive integer. - Manages the 'batch_column', adding or updating it as needed for data completeness. - Updates 'ignore_list' and saves final options to a file.

Raises:: SystemExit: If any required input is missing, invalid, or the file is not found.

almos.el_utils.extract_points_from_csv(batch_number)

Extract Training and test points from CSV files for both PFI and No_PFI models.

Args:: batch_number (int): The batch number to process.
Returns:: dict: A dictionary with the number of Training and test points for No_PFI and PFI models.

almos.el_utils.extract_rmse_and_score_from_column(page, bbox)

Extract RMSE and SCORE value from a specific column on a given page of a PDF. First tries to match 'Test' results, if not found, tries 'Valid' results.

Parameters:

pagepdfplumber Page object: The page from which to extract the data.(PDF report)
bboxtuple: The bounding box (coordinates) to specify the column area in the PDF (PFI model or non PFI model).

Returns:

tuple: A tuple containing the extracted RMSE value and SCORE value, or (None, None) if no patterns match.

almos.el_utils.extract_sd_from_column(page, bbox)

Extract SD value from a specific column on a given page of a PDF.

Parameters:

pagepdfplumber Page object: The page from which to extract the data.(PDF report)
bboxtuple: The bounding box (coordinates) to specify the column area in the PDF. (PFI model or non PFI model).

Returns:

float or None: The extracted SD value, or None if no pattern matches.

almos.el_utils.find_closest_value(df, target_median, target_column)

Find the value in a specified column of a DataFrame that is closest to a target mean value.

Parameters:

dfpd.DataFrame: The DataFrame containing the data to search through.
target_medianfloat: The target median value to compare against.
target_columnstr: The name of the column in which to find the value closest to the target mean.

Returns:

pd.Series: The row in the DataFrame where the value in the target_column is closest to the target_mean.

almos.el_utils.generate_quartile_medians_df(df_total, df_exp, values_column)

Assign quartiles (q1, q2, q3, q4) to values in a DataFrame column based on their range. Also, calculate the median value for each quartile.

Parameters:

df_totalpd.DataFrame: Experimental values and predictions are used to calculate the range of values for determining quartiles.
df_exppd.DataFrame: The experimental dataset where quartiles will be assigned.
values_columnstr: The name of the column in df_total and df_exp that contains the target values.

Returns:

df_exppd.DataFrame: The experimental dataset with a new 'quartile' column, assigning each value to q1, q2, q3, or q4.
quartile_mediansdict: A dictionary containing the median values for the first three quartiles (q1, q2, q3, q4).

almos.el_utils.get_metrics_from_batches()

Generates metrics for plotting by processing each batch directory.

Iterates over directories named 'batch_*' (excluding 'batch_plots' and 'batch_0') and collects metrics with and without PFI for each batch by calling 'process_batch'.

Returns:

tuple: (results_plot_no_PFI, results_plot_PFI), lists of metrics without: and with PFI for each batch.

almos.el_utils.get_quartile(value, boundaries)

Determine the quartile a given value falls into based on specified boundaries.

Parameters:

valuefloat: The value to be classified into a quartile.
boundarieslist of float: A list of boundary values defining the quartile ranges.

Returns:

str: The quartile ('q1', 'q2', 'q3', 'q4') the value falls into.

almos.el_utils.get_scores_from_robert_report(pdf_path)

Extract score values from both left (No_PFI) and right (PFI) columns in the first page of the PDF.

Parameters:

pdf_pathPath: Path to the ROBERT_report.pdf.

Returns:

tuple: A tuple (score_no_PFI, score_PFI), where either can be None if not found.

almos.el_utils.get_size_counters(df)

Count the number of points in each quartile (q1, q2, q3, q4).

Parameters:

dfpd.DataFrame: The DataFrame that contains a 'quartile' column, which categorizes values into quartiles (q1, q2, q3, 4).

Returns:

dict: A dictionary with keys 'q1', 'q2', 'q3' and 'q4' where each key represents the number of points in that quartile.

almos.el_utils.load_options_from_csv(options_file)

Load default options from a CSV file if user inputs are not provided.

Parameters:

options_filestr: The path to the CSV file containing default options.

Returns:

dict or None: A dictionary containing the default values for 'y', 'ignore', and 'name' if the file is successfully read. Returns None if the file is not found.

almos.el_utils.plot_metrics_subplots(data, model_type, output_dir='batch_plots', batch_count=0): Function to plot different metrics in a 4x1 subplot layout and save as a single image.

almos.el_utils.process_batch(batch_number)

Extract RMSE, SD, score data from both left and right columns of the PDF report for a specific batch. (PFI model and non PFI model). Extract number or points from CSV files for both PFI and No_PFI models.

Parameters:

batch_numberint: The batch number to process (e.g., 1, 2, 3).

Returns:

dict: A dictionary containing the batch number, RMSE, and SD values for both columns (no_PFI and PFI).