Folds¶

This submodule provides functions to split, save and load folds.

biom3d.utils.fold.get_folds_df(df: pandas.DataFrame, verbose: bool = True) → list[list[str]][source]¶

Extract folds from a DataFrame into a list of lists.

Parameters:

df (pandas.DataFrame) – DataFrame with a ‘fold’ column indicating fold assignment.
verbose (bool, default=True) – If True, prints the number and size of the folds.

Returns:

List of folds, each being a list of filenames (or sample IDs).

Return type:

list of list

biom3d.utils.fold.get_folds_train_test_df(df: pandas.DataFrame, verbose: bool = True, merge_test: bool = True) → tuple[list[list[str]], list[list[str]] | list[str]][source]¶

Extract fold groups from both train and test sets.

Parameters:

df (pandas.DataFrame) – DataFrame with ‘hold_out’ and ‘fold’ columns.
verbose (bool, default=True) – If True, prints debug info.
merge_test (bool, default=True) – If True, test folds are merged into one list.

Returns:

train_folds (list of list) – List of training folds, each being a list of filenames.
test_folds (list or list of list) – Test set either as a merged list or as a list of folds.

biom3d.utils.fold.get_splits_train_val_test(df: pandas.DataFrame) → tuple[list[list[str]], list[str], list[str]][source]¶

Create dataset splits of different sizes, along with validation and test sets.

Assumes columns: - ‘split’: indicates split index (e.g., 0=50%, 1=25%, etc.) - ‘fold’: used to separate training and validation - ‘hold_out’: 0=train/val, 1=test - ‘filename’: sample identifier

The splits contains [100%,50%,25%,10%,5%,2%,the rest] of the dataset

Returns:

train_splits (list of list) – List of training splits (first is the full training set, followed by reduced ones).
valset (list) – List of filenames used for validation.
testset (list) – List of filenames used for testing.

biom3d.utils.fold.get_splits_train_val_test_overlapping(df: pandas.DataFrame) → tuple[list[list[str]], list[str], list[str]][source]¶

Create overlapping training splits plus validation and test sets.

Each smaller training subset is fully included in all larger ones. Used for dataset scaling experiments (e.g., 100%, 50%, 25%, etc.).

Parameters:

df (pandas.DataFrame) – DataFrame with ‘split’, ‘fold’, ‘hold_out’, and ‘filename’ columns.

Returns:

train_splits (list of list) – List of overlapping training subsets.
valset (list) – List of filenames used for validation.
testset (list) – List of filenames used for testing.

Notes

Only works if the splits follow descending powers of two.

biom3d.utils.fold.get_train_test_df(df: pandas.DataFrame, verbose: bool = True) → tuple[ndarray, ndarray][source]¶

Extract train and test sets from a DataFrame based on the ‘hold_out’ column.

Parameters:

df (pandas.DataFrame) – The dataset containing a ‘hold_out’ column with 0 (train) and 1 (test) labels.
verbose (bool, default=True) – If True, enables debug printing (currently unused).

Returns:

train_set (numpy.ndarray) – Array of training filenames (or sample IDs).
test_set (numpy.ndarray) – Array of test filenames (or sample IDs).