Preprocessing¶
Dataset preparation to fasten the training.
Steps:
Normalization
Expand dims and one_hot encoding
Saving to numpy or tif file
- class biom3d.preprocess.Preprocessing(img_path: str, img_outpath: str | None = None, msk_path: str | None = None, msk_outpath: str | None = None, fg_outpath: str | None = None, num_classes: int | None = None, use_one_hot: bool = False, remove_bg: bool = False, median_size: Iterable[int] = [], median_spacing: list[float] = [], clipping_bounds: list[float] = [], intensity_moments: list[float] = [], use_tif: bool = False, split_rate_for_single_img: float = 0.25, num_kfolds: int = 5, is_2d: bool = False)[source]¶
Preprocessing pipeline for 2D or 3D medical segmentation datasets.
Handles preprocessing of medical images and masks including:
File conversion (e.g., NIfTI to NumPy or TIFF)
Z-score normalization and intensity clipping
Resampling to median voxel spacing
One-hot encoding of labels
Optional background removal
Optional splitting of single image datasets
K-Fold CSV generation
Usage: Instantiate this class with all required parameters, then call run() to start preprocessing.
- Variables:
img_path (str) – Path to the collection containing input images.
msk_path (str) – Path to the collection containing input masks. (can be None).
handler (DataHandler) – DataHandler used to load and save images.
img_outpath (str) – Output path for processed images.
msk_outpath (str) – Output path for processed masks.
fg_outpath (str) – Output path for foreground masks (used in training).
num_classes (int) – Number of classes in the segmentation masks.
use_one_hot (bool) – Whether to one-hot encode the labels.
remove_bg (bool) – Whether to remove the background class in the label.
median_size (list) – Median shape of the dataset (used to detect channel axis).
median_spacing (list) – Median voxel spacing of the dataset.
clipping_bounds (list) – Intensity clipping bounds [p0.5, p99.5] for normalization.
intensity_moments (list) – Mean and std intensity values for normalization.
use_tif (bool) – If True, save outputs as .tif instead of .npy.
split_rate_for_single_img (float) – Portion used to split a single image into train/val.
num_kfolds (int) – Number of folds to use for cross-validation.
is_2d (bool) – Whether the input data is 2D instead of 3D.
num_channels (int) – Number of channels in the images (inferred from median size).
channel_axis (int) – Axis corresponding to channel dimension.
img_len (int) – Total number of images (ie: Size of the dataset).
csv_path (str) – Path to the CSV file used for K-Fold or holdout splitting.
- __init__(img_path: str, img_outpath: str | None = None, msk_path: str | None = None, msk_outpath: str | None = None, fg_outpath: str | None = None, num_classes: int | None = None, use_one_hot: bool = False, remove_bg: bool = False, median_size: Iterable[int] = [], median_spacing: list[float] = [], clipping_bounds: list[float] = [], intensity_moments: list[float] = [], use_tif: bool = False, split_rate_for_single_img: float = 0.25, num_kfolds: int = 5, is_2d: bool = False)[source]¶
Initialize the Preprocessing class.
- Parameters:
img_path (str) – Path to the collection containing input images.
img_outpath (str, optional) – Path to the collection to save the preprocessed images.
msk_path (str, optional) – Path to the collection containing input masks.
msk_outpath (str, optional) – Path to the collection to save the preprocessed masks.
fg_outpath (str, optional) – Path to the collection to save the foreground mask.
num_classes (int, optional) – Number of classes in the masks (including background).
use_one_hot (bool, default=False) – Whether to one-hot encode the mask labels.
remove_bg (bool, default=False) – Whether to remove the background class in the masks.
median_size (list of int, optional) – Median shape of the dataset (used to infer channel axis).
median_spacing (list of float, optional) – Median voxel spacing of the dataset.
clipping_bounds (list of float, optional) – Intensity clipping bounds [p0.5, p99.5] for normalization.
intensity_moments (list of float, optional) – Mean and std intensity values for normalization.
use_tif (bool, default=False) – If True, save preprocessed outputs as TIFF instead of NumPy.
split_rate_for_single_img (float, default=0.25) – Split ratio for single image dataset (used for train/val split).
num_kfolds (int, default=5) – Number of folds to generate for K-Fold validation.
is_2d (bool, default=False) – Whether the dataset is 2D (True) or 3D (False).
- run(debug: bool = False) None[source]¶
Execute the full preprocessing pipeline.
This method processes all images and masks in the dataset by:
Resampling to the target spacing
Intensity clipping
Z-score normalization
Optional one-hot encoding of masks
Saving preprocessed data to disk
Creating K-fold CSV split file
- Parameters:
debug (bool, default=False) – If True, prints filenames during preprocessing instead of using tqdm progress bar.
- Return type:
None
- biom3d.preprocess.auto_config_preprocess(img_path: str, msk_path: str, num_classes: int, config_dir: str, base_config: str, img_outpath: str | None = None, msk_outpath: str | None = None, use_one_hot: bool = False, ct_norm: bool = False, remove_bg: bool = False, use_tif: bool = False, desc: str = 'unet', max_dim: int = 128, num_epochs: int = 1000, num_workers: int = 6, skip_preprocessing: bool = False, no_auto_config: bool = False, logs_dir: str = 'logs/', print_param: bool = False, debug: bool = False, is_2d: bool = False)[source]¶
Preprocess medical segmentation data and auto-generate a training configuration.
This helper function performs the following steps:
Computes dataset fingerprint (median shape, spacing, intensity stats).
Runs the preprocessing pipeline on the data (resampling, normalization, one-hot encoding, etc.).
Automatically determines optimal model parameters such as patch size and batch size.
Saves the configuration to a Python file for training use.
It supports both 2D and 3D datasets, optional background removal, and normalization tailored for CT images.
- Parameters:
img_path (str) – Path to the collection containing raw input images.
msk_path (str) – Path to the collection containing corresponding segmentation masks.
num_classes (int) – Number of segmentation classes (excluding background).
config_dir (str) – Directory where the auto-generated configuration file will be saved.
base_config (str) – Path to the base configuration template (Python file).
img_outpath (str, optional) – Output path for preprocessed images.
msk_outpath (str, optional) – Output path for preprocessed masks.
use_one_hot (bool, default=False) – Whether to convert the segmentation masks to one-hot encoded format.
ct_norm (bool, default=False) – If True, compute normalization statistics and intensity clipping based only on regions inside the masks.
remove_bg (bool, default=False) – Whether to exclude the background class during training (useful with sigmoid output).
use_tif (bool, default=False) – If True, save the processed files in .tif format instead of .npy.
desc (str, default="unet") – Descriptor string saved in the config to identify the model/config.
max_dim (int, default=128) – Maximum spatial size (in voxels) allowed for the input patch during training.
num_epochs (int, default=1000) – Number of training epochs to set in the config file.
num_workers (int, default=6) – Number of workers used for data loading during training.
skip_preprocessing (bool, default=False) – If True, skip the preprocessing step and only generate the config.
no_auto_config (bool, default=False) – If True, skip the config generation and only run preprocessing.
logs_dir (str, default='logs/') – Directory path to store logs (saved in the config).
print_param (bool, default=False) – If True, print computed auto-config parameters to stdout.
debug (bool, default=False) – If True, run preprocessing with verbose logging and no progress bar.
is_2d (bool, default=False) – Whether the dataset is 2D instead of 3D.
- biom3d.preprocess.correct_mask(mask: ~numpy.ndarray, num_classes: int, is_2d: bool = False, standardize_dims: bool = True, output_dtype: ~numpy.dtype = <class 'numpy.uint16'>, use_one_hot: bool = False, remove_bg: bool = False, encoding_type: ~typing.Literal['auto', 'label', 'binary', 'onehot'] = 'auto', auto_correct: bool = True, binary_correction_strategy: ~typing.Literal['majority_is_bg'] = 'majority_is_bg')[source]¶
Perform a sanity check and automatic correction on a segmentation mask.
This function ensures consistency and correctness of segmentation masks, handling binary, label, or one-hot encodings. It can also automatically correct common labeling issues.
This function is designed to be highly automated to reduce user friction. It makes assumptions about the data and prints warnings about any corrections it performs. Expert users can override the automatic behavior.
- Parameters:
mask (numpy.ndarray) –
- The input segmentation mask.
Shape for 3D: (D, H, W) for label masks, or (C, D, H, W) for binary/one-hot masks.
Shape for 2D (if is_2d=True): (H, W) or (C, H, W).
- num_classesint
Number of target classes. Must be ≥ 2.
- is_2dbool, default=False
Whether the input is 2D (vs 3D). Adjusts the expected shape accordingly. If True, expects (H,W) for label masks or (C,H,W) for binary/one-hot masks. Defaults to False, expecting 3D data (D,H,W) or (C,D,H,W).
- standardize_dimsbool, default=True
If True, ensures output is 4D. If False, retains input dimensionality.
- output_dtypenp.dtype, default=np.uint16
Desired data type for the output mask.
- use_one_hotbool, default=False
If True and encoding_type=’label’, converts the label mask to one-hot encoding.
- remove_bgbool, default=False
If use_one_hot=True, removes the background channel (assumed to be index 0).
- encoding_type{‘auto’, ‘label’, ‘binary’, ‘onehot’}, default=’auto’
’auto’: (Default) Automatically determine the type based on mask.ndim. 3D is assumed ‘label’, 4D is assumed ‘binary’.
’label’: A single-channel mask where pixel values are class indices (0, 1, 2…).
’binary’: A multi-channel mask where each channel is an independent binary (0/1) segmentation. Used with sigmoid activations.
’onehot’: A multi-channel mask where channels are mutually exclusive. Used with softmax activations.
- auto_correctbool, default=True
Whether to attempt automatic correction of invalid masks.
- binary_correction_strategy{‘majority_is_bg’}, default=’majority_is_bg’
Heuristic to fix binary masks with unexpected values. - ‘majority_is_bg’: Treat the most common value as background (0), others as foreground (1).
- Raises:
RuntimeError – If mask validation fails and cannot be corrected.
ValueError – If num_classes is invalid (not an int or <2) or mask shape is incompatible with 2/3D.
- Returns:
Corrected and standardized segmentation mask.
- Return type:
numpy.ndarray
- biom3d.preprocess.generate_kfold_csv(filenames: list[str], csv_path: str, hold_out_rate: float = 0.0, kfold: int = 5, seed: int = 42) None[source]¶
Generate a CSV file that maps image filenames to cross-validation folds and hold-out flags.
From a list of filenames create a CSV containing three columns:
‘filename’: image filename,
‘hold_out’: 1 for test/hold-out set, 0 otherwise,
‘fold’: fold index (0 to kfold - 1).
- Parameters:
filenames (list of str) – List of image filenames, relative to a dataset root.
csv_path (str) – Path to the output CSV file.
hold_out_rate (float, default=0.0) – Proportion of samples to assign to the hold-out (test) set.
kfold (int, default=5) – Number of folds for stratified k-fold cross-validation.
seed (int, default=42) – Random seed for reproducibility.
- Return type:
None
- biom3d.preprocess.get_resample_shape(input_shape: tuple[int] | list[int] | ndarray, spacing: list[float], median_spacing: list[float]) ndarray[source]¶
Compute the new shape of a volume after resampling based on spacing information.
- Parameters:
input_shape (tuple, list or numpy.ndarray of int) – Shape of the input volume. Can be (C, D, H, W) or (D, H, W).
spacing (list of float) – Original voxel spacing for each axis.
median_spacing (list of float) – Target voxel spacing for each axis.
- Returns:
New shape after resampling, as integers (Dx, Dy, Dz).
- Return type:
numpy.ndarray
- biom3d.preprocess.hold_out(df: pandas.DataFrame, ratio: float = 0.1, seed: int = 42) pandas.DataFrame[source]¶
Randomly select a subset of elements from the first column of the DataFrame.
This function adds a binary column ‘hold_out’ to df, marking randomly selected elements with 1 and the rest with 0, based on the specified ratio.
The size of the set is len(set)*ratio.
- Parameters:
df (pandas.DataFrame) – Input DataFrame with at least one column. Selection is based on the first column.
ratio (float, default=0.1) – Proportion of elements to mark as held out.
seed (int, default=42) – Random seed for reproducibility.
- Returns:
df – DataFrame with an added ‘hold_out’ column.
- Return type:
pandas.DataFrame
- biom3d.preprocess.resize_img_msk(img: ndarray, output_shape: tuple[int] | list[int] | ndarray, msk: ndarray | None = None) ndarray | tuple[ndarray, ndarray][source]¶
Resize a 3D image and optionally its mask.
- Parameters:
img (numpy.ndarray) – Input 3D image array to resize.
output_shape (tuple, list or numpy.ndarray of int) – Desired output shape (Dx, Dy, Dz).
msk (numpy.ndarray, optional) – Corresponding mask to resize.
- Returns:
new_img (numpy.ndarray) – The resized image.
new_msk (numpy.ndarray, optional) – The resized mask, if msk is provided.
- biom3d.preprocess.sanity_check(msk: ndarray, num_classes: int | None = None) ndarray[source]¶
Sanity check for segmentation masks.
Verifies if the mask contains valid class labels and attempts to fix common issues automatically.
- Parameters:
msk (numpy.ndarray) – Segmentation mask. Can be 3D or 4D (if one-hot encoded).
num_classes (int, optional) – Expected number of classes. If not provided, inferred from unique values.
- Raises:
RuntimeError – If automatic correction is not possible due to ambiguous label values.
AssertionError – If num_classes is invalid (not an int or < 2).
- Returns:
Validated and possibly corrected segmentation mask.
- Return type:
numpy.ndarray
- biom3d.preprocess.seg_preprocessor(img: ndarray, img_meta: dict[str, Any], num_classes: int, msk: ndarray = None, use_one_hot: bool = False, remove_bg: bool = False, median_spacing: list[float] | ndarray = [], clipping_bounds: list[float] | tuple[float, float] = [], intensity_moments: list[float] | tuple[float, float] = [], channel_axis: int = 0, num_channels: int = 1, seed: int = 42, is_2d: bool = False) tuple[ndarray, ndarray, dict[int, list[int]]] | tuple[ndarray, dict[str, Any]][source]¶
Perform a full preprocessing pipeline for segmentation images and masks.
This function orchestrates a series of steps:
Standardizes image and mask dimensions.
Validates and corrects the mask using robust heuristics.
Optionally one-hot encodes the mask.
Applies intensity transformations (clipping, normalization).
Resamples the data to a target spacing.
Computes foreground coordinates for patch sampling.
- Parameters:
img (numpy.ndarray) – The input image array. Can be 2D or 3D, with or without channel dimension.
img_meta (dict of str to any) – Dictionary containing image metadata, including the spacing field.
num_classes (int) – Number of segmentation classes. Required if msk is provided.
msk (numpy.ndarray, optional) – Segmentation mask corresponding to the image. Can be 2D or 3D.
use_one_hot (bool, default=False) – If True, the mask will be converted to one-hot encoding.
remove_bg (bool, default=False) – If True and use_one_hot is True, the background channel (0) is removed.
median_spacing (list or numpy.ndarray of float, optional) – Target spacing for resampling. If empty, resampling is skipped.
clipping_bounds (list or tuple of float, optional) – Tuple (min, max) to clip intensity values. If empty, no clipping is applied.
intensity_moments (list or tuple of float, optional) – Tuple (mean, std) for intensity normalization. If empty, stats are computed from image.
channel_axis (int, default=0) – Index of the channel axis in the input image.
num_channels (int, default=1) – Expected number of image channels after standardization.
seed (int, default=42) – Random seed for reproducibility in foreground sampling.
is_2d (bool, default=False) – If True, assumes the image and mask are 2D rather than 3D.
- Raises:
RuntimeError – If the mask format is invalid and cannot be corrected.
ValueError – If input dimensions are inconsistent with expected format.
- Returns:
If msk is provided, returns (img, msk, fg) –
- img: numpy.ndarray
Preprocessed image.
- msk:ndarray
Preprocessed segmentation mask.
fg:dict mapping class index -> array of sampled voxel coordinates
If msk is None, returns (img, img_meta) –
- img: numpy.ndarray
Preprocessed image
- img_meta:
Original metadata, with added original_shape
Notes
Foreground sampling is capped at 10,000 voxels per class.
Designed for use in biology and medical image segmentation pipelines.
- biom3d.preprocess.standardize_img_dims(img: ndarray, num_channels: int, channel_axis: int, is_2d: bool) ndarray[source]¶
Standardizes an image to 4D format: (C, D, H, W) for 3D, or (C, 1, H, W) for 2D.
This function ensures compatibility with the rest of the pipeline. If there is an incoherency between channel_axis and value and said value is unique, it will fix it. E.g: (8,32,32,1) with channel_axis=0 and num_channels = 1 -> (1,8,32,32)
- Parameters:
img (numpy.ndarray) –
- Input image array. Expected shape:
2D image: (H, W) or (C, H, W)
3D image: (D, H, W) or (C, D, H, W)
num_channels (int) – Expected number of channels after formatting.
channel_axis (int) – Axis where the channel is located in the input image (before standardization). Only used if img has 4 dimensions.
is_2d (bool) – Whether the input is 2D (vs 3D).
- Raises:
ValueError – If the input shape is incompatible with 2/3D or if the number of channels does not match.
- Returns:
img (numpy.ndarray) –
- Standardized image with shape:
2D mode: (C, 1, H, W)
3D mode: (C, D, H, W)
original_shape (tuple) – Original shape of the input image.
- biom3d.preprocess.strat_kfold(df: pandas.DataFrame, k: int = 4, seed: int = 43) pandas.DataFrame[source]¶
Assign each row of the DataFrame to one of k stratified folds.
Stratification is done to maintain balance between hold-out and non-hold-out samples across the folds. Requires a ‘hold_out’ column previously added with hold_out().
- Parameters:
df (pandas.DataFrame) – Input DataFrame with a ‘hold_out’ column (0 or 1).
k (int, default=4) – Number of folds.
seed (int, default=43) – Random seed for reproducibility.
- Returns:
df – DataFrame with an added ‘fold’ column containing fold assignments (0 to k-1).
- Return type:
pandas.DataFrame