DataHandler

DataHandler is the class to use to load, save and iterate on images, masks or foreground.

Instanciating a DathaHandler with the DataHandlerFactory

If you want to use a DataHandler, we strongly recommend using our factory.

class biom3d.utils.DataHandlerFactory[source]

Class to instantiate a DataHandler depending on the input and output type.

static get(input: str, read_only: bool = False, preprocess: bool = False, output: str | None = None, **kwargs) DataHandler[source]

Create a handler which type depend on the input extension.

Parameters:
  • input (str) – Path to input (Folder path, archive path, url,…). This path will be used as the image path.

  • read_only (bool, default = False) – (Optional) Whether handler is in read only.

  • output (str, default = None) – (Optional) Path to output, is used if the output type is different from input.

  • preprocess (bool, default = False) – (Optional) If it is a preprocessing handler (will create more output).

  • **kwargs

    All existing parameters to existing handlers, currently
    msk_path:str, default=None,

    Generic : mask output path

    fg_path:str, default = None

    Generic : foreground output path

    eval: “label” | “pred” | None, default=None

    HDF5Hanlder (and all others that use keys) : Tell your handler that it is to eval and that it should search for the label or prediction key in your dataset key.

    img_inner_paths_list, default=None

    Generic : A list of path comming from a specific root (eg: The paths inside a .h5 file), used in data/batch loaders.

    msk_inner_paths_list, default=None

    Generic : A list of path comming from a specific root (eg: The paths inside a .h5 file), used in data/batch loaders

    fg_inner_paths_list, default=None

    Generic : A list of path comming from a specific root (eg: The paths inside a .h5 file), used in data/batch loaders

    img_outpath:str, default = None,

    Generic : images output path

    msk_outpath:str, default = None

    Generic : mask output path

    fg_outpath:str, default = None

    Generic : foreground output path

    model_name:str, default = None

    Generic : Used for prediction, if different than None, it will be added at the end of path (eg: predictions/MyModelName, predictions.h5[“MyModelName”])

    use_tif:bool, default = False

    FileHandler : If should be saved as tif instead of npy.

Raises:

ValueError: – If parameters read_only and preprocess are both True.

Returns:

A DataHandler specific to input and output type

Return type:

DataHandler

The DataHandler contract

Here we will describe the abstract class DataHandler, so you can use it or implement a new one.

Publics attributes

DataHandler.images: list

A list of image paths.

DataHandler.masks: list | None

A list of mask paths.

DataHandler.fg: list | None

A list of foreground paths.

DataHandler.msk_outpath: str | None

A path to the masks output (preprocessed masks or predictions).

Privates attributes

DataHandler._images_path_root: str

The path to .h5 file, the folder where all images are,…).

Type:

The root of images path (eg

DataHandler._masks_path_root: str | None

The path to .h5 file, the folder where all masks are,…).

Type:

The root of masks path (eg

DataHandler._fg_path_root: str | None

The path to .h5 file, the folder where all labels are,…).

Type:

The root of foregrounds path (eg

DataHandler._image_index: int

The current index in images, masks and fg (at the same time). Is _iterator -1.

DataHandler._iterator: int

Used to implement iterator.

DataHandler._size: int

Used to implement len, is defined by len(images).

DataHandler._saver: type[DataHandler] | None

DataHandler used to save, can be another DataHandler for different output format, self or None (read_only).

Public methods

abstractmethod DataHandler.open(**kwargs)[source]

Is used to open others inputs. It is basically _input_parse(), however it should close current input (if relevant).

Have a default implementation.

Parameters:

**kwargs (Same as _input_parse())

Raises:

PermissionError, ConnectionError, HttpError, TimeoutError, ... – Other exceptions related to the format may be raised; this is not an exhaustive list. We recommend not trying to catch these specifically in generic code.

abstractmethod DataHandler.close()[source]

Close connections, open files,…

Should always be called after the handler is not longer used. By default, this function is called on object destruction. Should not raise error.

Note

Be careful to also close the _saver and avoid RecursionError if case of self._saver = self.

abstractmethod DataHandler.get_output() tuple[str, str, str][source]

Return a tuple of three element with the paths of images output, mask output and foreground output. (eg : Path to folders, archive, URLs).

Have a default implementation.

Raises:

NotImplementedError: – If DataHandler has no _saver (is in read_only)

Returns:

  • img_outpath (str | None) – Path to images output collection.

  • msk_outpath (str) – Path to masks output collection.

  • fg_outpath (str | None) – Path to foregrounds output collection.

abstractmethod DataHandler.load(path: str) tuple[ndarray, dict][source]

Load a ressource at given path. It is not necessary to check if ressource is in images, masks or foreground.

Note

It is assumed that all the inputs are in the same format (treatable with same handler). It is also assume that foreground are a blob.

Parameters:

path (str) – The path to the ressource to load, generally given by the iterator (or self.image[i])

Example

>>> for img_path,msk_path,fg_path in handler :
>>>     img,metadata=handler.load(img_path)
>>>     msk,_=handler.load(msk_path)
>>>     fg,_=handler.load(fg_path)
Raises:

PermissionError, ConnectionError, HttpError, TimeoutError, ... – Other exceptions related to the format may be raised; this is not an exhaustive list. We recommend not trying to catch these specifically in generic code.

Returns:

  • img (numpy.ndarray) – The image as a numpy ndarray

  • metadata (dict) – Image metadata in a dictionary.

DataHandler.save(fname, img, out_type: OutputType | str, **kwargs) str[source]

Public interface of _save.

It does basic checks then delegate to self._saver._save().

Parameters:
  • fname (str) – The path of loaded ressource, generally given by the iterator (or self.image[i]), it will be used to determine the path to save.

  • img (numpy.ndarray) – The image to save.

  • out_type (OutputType | "msk" | "pred" | "raw" | "fg") – Determine the output type and so the saved path is determine by this (output root) + fname

  • **kwargs

    All existing parameters to existing handlers, currently

    overwrite: boolean, default=False

    HDF5Handler: Will force to overwrite date. Is used only in preprocessing._split_single()

Raises:
  • ValueError : – If OutputType is ‘img’ or ‘fg’ and not in preprocess (so said output has not been initialized), or non existing value in enum (incorrect OutputType).

  • NotImplementedError : – If _saver is None (is in read_only)

  • Others – All error raised by _save()

Returns:

path – The path to saved ressource.

Return type:

str

abstractmethod DataHandler.insert_prefix_to_name(fname: str, prefix: str)[source]

Insert a prefix to a name to create unique variation for the same name (it is used by Preprocess._split_single).

Example

>>> handler.insert_prefix_to_name('Raw/1.tif','0_') -> 'Raw/0_1.tif'
Returns:

path – A new path including the prefix.

Return type:

str

DataHandler.reset_iterator() None[source]

Reset the _iterator value to 0.

Private methods

abstractmethod DataHandler._input_parse(msk_path: str | None = None, fg_path: str | None = None, eval: Literal['label', 'pred'] | None = None, img_inner_paths_list: list | None = None, msk_inner_paths_list: list | None = None, fg_inner_paths_list: list | None = None, **kwargs) None[source]

Parse and initialize the inputs. If you want to open files, established connection,etc, check wether it is compatible with multiprocessing and picklable or the data/batchloader will not work.

Parameters:
  • img_path (str) –

    Path to input images collection (folder, archive,…).

    Note

    It is not necessarily images, for example in eval(), we use a handler for predictions and another for ground truth.

  • msk_path (str, default=None) – Path to input masks collection (folder, archive,…).

  • fg_path (str, default=None) – Path to input foregrounds collection (folder, archive,…).

  • eval ("label" | "pred" | None, default=None) –

    Tell your handler that it is to eval and that it should search for the mask or prediction key in your dataset.

    Note

    It is not used in FileHandler as it doesn’t use keys, however in .h5, it will make it load images from label or prediction key instead of image key.

  • img_inner_path (str) – Path to input images relative to image collection (path in .h5 file, subfolders,…).

  • msk_inner_path (str) – Path to input masks relative to image collection (path in .h5 file, subfolders,…).

  • fg_inner_path (str) – Path to input foregrounds relative to image collection (path in .h5 file, subfolders,…).

  • **kwargs (Compatibility with other implementations.)

Raises:
  • ValueError – If incorrect parameters (eg: non existing ressource,…).

  • ConnectionError, HttpError, TimeoutError, ... – Other exceptions related to the format may be raised; this is not an exhaustive list. We recommend not trying to catch these specifically in generic code.

Return type:

None

abstractmethod DataHandler._output_parse(msk_outpath: str, model_name: str | None = None, **kwargs) None[source]

Parse and initialize the outputs.

Parameters:
  • msk_outpath (str, default=None) – Path to output masks collection (folder, archive,…), is created if not existing.

  • model_name (str, default=None) – Is used to create sub collection specific to model in predictions (eg : predictions/MyModelName) to avoid overwrite.

  • **kwargs (Compatibility with other implementations.)

Raises:

PermissionError, ConnectionError, HttpError, TimeoutError, ... – Other exceptions related to the format may be raised; this is not an exhaustive list. We recommend not trying to catch these specifically in generic code.

Return type:

None

abstractmethod DataHandler._output_parse_preprocess(img_path: str, msk_path: str | None = None, img_outpath: str | None = None, msk_outpath: str | None = None, fg_outpath: str | None = None, **kwargs) None[source]

Parse and initialize the outputs for preprocessing.

Parameters:
  • img_path (str) – Path to input images collection (folder, archive,…).

  • msk_path (str, default=None) – Path to input masks collection (folder, archive,…).

  • img_outpath (str, default=None) – Path to output images collection (folder, archive,…), is created if not existing.

  • msk_outpath (str, default=None) – Path to input masks collection (folder, archive,…).

  • fg_outpath (str, default=None) – Path to foregrounds masks collection (folder, archive,…).

  • **kwargs (Compatibility with other implementations.)

Raises:

PermissionError, ConnectionError, HttpError, TimeoutError, ... – Other exceptions related to the format may be raised; this is not an exhaustive list. We recommend not trying to catch these specifically in generic code.

Return type:

None

abstractmethod DataHandler._save(fname: str, img: ndarray, out_type: OutputType | str, **kwargs) str[source]

Save an image, mask or foreground. To differentiate between the three, we use OutputType.

The ressource will be saved in the output path corresponding to out_type, following the same inner path.

Example

>>> 'Raw/Dataset1/1.tif' will be saved in 'Raw_out/Dataset1/1.tif' if called with out_type = 'img'.
Parameters:
  • fname (str) – The path of loaded ressource, generally given by the iterator (or self.image[i]), it will be used to determine the path to save.

  • img (numpy.ndarray) – The image to save.

  • out_type (OutputType | "msk" | "pred" | "raw" | "fg") – Determine the output type and so the saved path is determine by this (output root) + fname

  • **kwargs

    All existing parameters to existing handlers, currently

    overwrite: boolean, default=False

    HDF5Handler: Will force to overwrite date. Is used only in preprocessing._split_single()

Raises:

PermissionError, ConnectionError, HttpError, TimeoutError, ... – Other exceptions related to the format may be raised; this is not an exhaustive list. We recommend not trying to catch these specifically in generic code.

Returns:

path – The path of the resource saved.

Return type:

str

Specials methods

DataHandler.__init__()[source]

Set default value to attributes, never call it outside a child class. All implementation shall call this one AND set default value to their specific attributes.

DataHandler.__iter__() None[source]

Return a new iterator (by calling reset_iterator).

DataHandler.__next__() tuple[str, str, str][source]

Increments _iterator and _image_index and return a tuple of paths.

Raises:

StopIteration

Returns:

  • img_path (str) – The path to current image.

  • msk_path (str | None) – The path to current mask.

  • fg_path (str | None) – The path to current foreground

DataHandler.__len__() int[source]

Return the handler’s size, so the number of images.

DataHandler.__del__() None[source]

Will try to call self.close() on destruction.

OutputType

class biom3d.utils.data_handler.data_handler_abstract.OutputType(value)[source]

Possible save type.

OutputType.IMG = 'img'

Saving an image.

OutputType.MSK = 'msk'

Saving a mask.

OutputType.FG = 'fg'

Saving a foreground.

OutputType.PRED = 'pred'

Saving a prediction

Adding a new dataset format

To add a new format, only two thing are required :

  • Create a new implementation of DataHandler.

    Note

    Redefine all abstract methods, if you think one of the other methods need a redefinition, do it, just respect the contract. You can use existing implementations as base.

  • Add some code to the DataHandlerFactory to allow it to recognize your new implementation.

  • Document your format in docs/tuto/dataset.md, specially if your implementation need a specific dataset structure.

Note

When testing, be sure to also test with dataset of only 1 image to test if preprocessing._split_image work well.

Adding a new image format

In case you work on file using FileHandler and you need to use another format than Numpy, TIFF or Nifty, you can easily implement it.

In the module biom3d.utils.data_handler.file_handler, there is a static class named ImageManager. This class implement the methods to read and save a single image as a file.

Two functions will interest us :

class biom3d.utils.data_handler.file_handler.ImageManager[source]

Static class to treat different image format.

For the moment, the following format: - Numpy - Nifty - TIFF

static adaptive_imread(img_path: str) tuple[ndarray, dict[str, Any]][source]

Load an image file.

Use skimage imread or sitk imread depending on the file extension:

  • .tif | .tiif → skimage.io.imread

  • .nii.gz → SimpleITK.imread

  • .npy → numpy.load

Parameters:

img_path (str) – Path to image file, must contain extension.

Returns:

  • img (numpy.ndarray) – The image contained in the file.

  • meta (dictionary from str to any) – The image metadata as a dict. Can be empty

static adaptive_imsave(img_path: str, img: ndarray, img_meta: dict[str, Any] = {}) None[source]

Save an image.

Use skimage or sitk depending on the file extension:

  • .tif | .tiif → ImageManager._tif_write_imagej

  • .nii.gz → ImageManager._sitk_imsave

  • .npy → numpy.save

Parameters:
  • img_path (str) – Path to the output file.

  • img (numpy.ndarray) – Image array.

  • metadata (dictionary from str to any, default={}) – Image metadata.

Return type:

None

To implement a new file format for image (for example png because why not) you simply have to add the possibility for those two function to treat the new format, then it is all automatic.

Note

We strongly advise to create two separate private function, one for reading and another one for saving, and call them in adaptive.