el_utils
- class almos.el_utils.EarlyStopping(patience=2, score_tolerance=0, rmse_min_delta=0.05, sd_min_delta=0.05, logger=None)
Monitors model performance to determine convergence based on specified tolerances for different metrics.
This class tracks metrics (e.g., RMSE, SD, and score) over iterations, marking convergence if improvements fall below specified thresholds for a set number of iterations (patience). Results are logged and saved for analysis.
- check_convergence(results_plot_no_PFI, results_plot_PFI)
Check for convergence for both PFI and no_PFI models independently. This function processes batch metrics, updates CSV files, and ensures only new or updated batches are added.
Parameters:
- results_plot_no_PFIlist of dicts
Batch metrics for the no_PFI model.
- results_plot_PFIlist of dicts
Batch metrics for the PFI model.
- check_convergence_model(df, model_type)
Check for convergence for either the PFI or no_PFI model separately.
Parameters:
- dfpd.DataFrame
The DataFrame containing the batch results.
- model_typestr
Either 'PFI' or 'no_PFI' to determine which set of metrics to check.
Returns:
- pd.DataFrame
The updated DataFrame with convergence columns and status.
- check_metric_convergence(previous_row, last_row, metric_name, tolerance)
Checks if a specific metric has converged. The metric is considered converged if: - It has not worsened (i.e., no negative changes). - It has improved, but by less than the specified tolerance.
Parameters:
- previous_rowpd.Series
The metrics from the previous iteration.
- last_rowpd.Series
The metrics from the current iteration.
- metric_namestr
The name of the metric being checked.
- tolerancefloat
The minimum percentage change required for improvement.
Returns:
- bool
True if the metric has converged (no worsening or minimal improvement), False if the metric has worsened or improved significantly.
- check_score_convergence(previous_row, last_row, score_column, score_tolerance)
Checks if the score has improved beyond the score tolerance. If the score has not worsened, it has converged. Return True.
- check_score_no_improvement(previous_row, last_row, score_column)
Returns True if the score has not improved (i.e., stays the same or gets worse).
- almos.el_utils.assign_values(df, exploit_points, explore_points, quartile_medians, size_counters, predictions_column, sd_column, reverse)
Assigns points for exploration by quartile, prioritizing those with highest uncertainty (sd_column). Uses size_counters to always select the quartile with the fewest assigned points. If a quartile has no available points, selects the point closest to the quartile median. If there are no exploitation points, distribute among all four quartiles.
- almos.el_utils.check_missing_outputs(self)
Validates input parameters for exploratory learning.
This method: - Loads default options and adds values for missing attributes ('target_column', 'name_column', 'ignore_list'). - Prompts for and locates the CSV file if not specified, loading it into a DataFrame. - Ensures that required columns for molecule names and target values exist, prompting for values if necessary. - Validates 'explore_rt' and 'tolerance' ensuring valid ranges. - Validates 'n_exps' ensuring it is a positive integer. - Manages the 'batch_column', adding or updating it as needed for data completeness. - Updates 'ignore_list' and saves final options to a file.
- Raises:
SystemExit: If any required input is missing, invalid, or the file is not found.
- almos.el_utils.extract_points_from_csv(batch_number)
Extract Training and test points from CSV files for both PFI and No_PFI models.
- Args:
batch_number (int): The batch number to process.
- Returns:
dict: A dictionary with the number of Training and test points for No_PFI and PFI models.
- almos.el_utils.extract_rmse_and_score_from_column(page, bbox)
Extract RMSE and SCORE value from a specific column on a given page of a PDF. First tries to match 'Test' results, if not found, tries 'Valid' results.
Parameters:
- pagepdfplumber Page object
The page from which to extract the data.(PDF report)
- bboxtuple
The bounding box (coordinates) to specify the column area in the PDF (PFI model or non PFI model).
Returns:
- tuple
A tuple containing the extracted RMSE value and SCORE value, or (None, None) if no patterns match.
- almos.el_utils.extract_sd_from_column(page, bbox)
Extract SD value from a specific column on a given page of a PDF.
Parameters:
- pagepdfplumber Page object
The page from which to extract the data.(PDF report)
- bboxtuple
The bounding box (coordinates) to specify the column area in the PDF. (PFI model or non PFI model).
Returns:
- float or None
The extracted SD value, or None if no pattern matches.
- almos.el_utils.find_closest_value(df, target_median, target_column)
Find the value in a specified column of a DataFrame that is closest to a target mean value.
Parameters:
- dfpd.DataFrame
The DataFrame containing the data to search through.
- target_medianfloat
The target median value to compare against.
- target_columnstr
The name of the column in which to find the value closest to the target mean.
Returns:
- pd.Series
The row in the DataFrame where the value in the target_column is closest to the target_mean.
- almos.el_utils.generate_quartile_medians_df(df_total, df_exp, values_column)
Assign quartiles (q1, q2, q3, q4) to values in a DataFrame column based on their range. Also, calculate the median value for each quartile.
Parameters:
- df_totalpd.DataFrame
Experimental values and predictions are used to calculate the range of values for determining quartiles.
- df_exppd.DataFrame
The experimental dataset where quartiles will be assigned.
- values_columnstr
The name of the column in df_total and df_exp that contains the target values.
Returns:
- df_exppd.DataFrame
The experimental dataset with a new 'quartile' column, assigning each value to q1, q2, q3, or q4.
- quartile_mediansdict
A dictionary containing the median values for the first three quartiles (q1, q2, q3, q4).
- almos.el_utils.get_metrics_from_batches()
Generates metrics for plotting by processing each batch directory.
Iterates over directories named 'batch_*' (excluding 'batch_plots' and 'batch_0') and collects metrics with and without PFI for each batch by calling 'process_batch'.
- Returns:
- tuple: (results_plot_no_PFI, results_plot_PFI), lists of metrics without
and with PFI for each batch.
- almos.el_utils.get_quartile(value, boundaries)
Determine the quartile a given value falls into based on specified boundaries.
Parameters:
- valuefloat
The value to be classified into a quartile.
- boundarieslist of float
A list of boundary values defining the quartile ranges.
Returns:
- str
The quartile ('q1', 'q2', 'q3', 'q4') the value falls into.
- almos.el_utils.get_scores_from_robert_report(pdf_path)
Extract score values from both left (No_PFI) and right (PFI) columns in the first page of the PDF.
Parameters:
- pdf_pathPath
Path to the ROBERT_report.pdf.
Returns:
- tuple
A tuple (score_no_PFI, score_PFI), where either can be None if not found.
- almos.el_utils.get_size_counters(df)
Count the number of points in each quartile (q1, q2, q3, q4).
Parameters:
- dfpd.DataFrame
The DataFrame that contains a 'quartile' column, which categorizes values into quartiles (q1, q2, q3, 4).
Returns:
- dict
A dictionary with keys 'q1', 'q2', 'q3' and 'q4' where each key represents the number of points in that quartile.
- almos.el_utils.load_options_from_csv(options_file)
Load default options from a CSV file if user inputs are not provided.
Parameters:
- options_filestr
The path to the CSV file containing default options.
Returns:
- dict or None
A dictionary containing the default values for 'y', 'ignore', and 'name' if the file is successfully read. Returns None if the file is not found.
- almos.el_utils.plot_metrics_subplots(data, model_type, output_dir='batch_plots', batch_count=0)
Function to plot different metrics in a 4x1 subplot layout and save as a single image.
- almos.el_utils.process_batch(batch_number)
Extract RMSE, SD, score data from both left and right columns of the PDF report for a specific batch. (PFI model and non PFI model). Extract number or points from CSV files for both PFI and No_PFI models.
Parameters:
- batch_numberint
The batch number to process (e.g., 1, 2, 3).
Returns:
- dict
A dictionary containing the batch number, RMSE, and SD values for both columns (no_PFI and PFI).