dctools.data.datasets.dataset_manager.MultiSourceDatasetManager

class dctools.data.datasets.dataset_manager.MultiSourceDatasetManager(dataset_processor, target_dimensions, time_tolerance=None, list_references=None, max_cache_files=100)

Manager for handling multiple data sources with common processing operations.

Parameters:
  • dataset_processor (oceanbench.core.distributed.DatasetProcessor)

  • target_dimensions (Dict[str, Tuple[float, float]])

  • time_tolerance (pandas.Timedelta | None)

  • list_references (list[str] | None)

  • max_cache_files (int)

__init__(dataset_processor, target_dimensions, time_tolerance=None, list_references=None, max_cache_files=100)

Initializes the multi-source manager.

Parameters:
  • dataset_processor (oceanbench.core.distributed.DatasetProcessor)

  • target_dimensions (Dict[str, Tuple[float, float]])

  • time_tolerance (pandas.Timedelta | None)

  • list_references (list[str] | None)

  • max_cache_files (int)

Methods

__init__(dataset_processor, target_dimensions)

Initializes the multi-source manager.

add_dataset(alias, dataset)

Adds a dataset to the manager with an alias.

add_to_file_cache(filepath)

Add a file to the file cache.

all_to_json(output_dir)

Exports information of all datasets in JSON format.

build_catalogs()

Builds catalogs for all datasets.

build_forecast_index(alias, init_date, ...)

Build forecast index for a dataset.

filter_all_by_date(start, end)

Filters all datasets managed by this class by time range.

filter_all_by_region(region)

Filters all datasets managed by this class by bounding box.

filter_all_by_variable(variables)

Filters all datasets managed by this class by time range.

filter_attrs(filters)

Filters datasets based on attributes.

filter_by_date(alias, start, end)

Filters the catalog by time range.

filter_by_region(alias, region)

Filters the catalog by bounding box.

filter_by_variable(alias, variables)

Filters the catalog by variable.

get_catalog(alias)

Returns the catalog of a dataset.

get_config()

Get configuration for all reference datasets.

get_data(alias, path)

Loads data from a dataset.

get_dataloader(pred_alias[, ref_aliases, ...])

Creates an EvaluationDataloader from dataset aliases.

get_keep_variables_dict()

Get dictionary of variables to keep for each dataset.

get_metadata_dict()

Get global metadata dictionary for all datasets.

get_transform(*args, **kwargs)

Factory function to create a transform based on the given name and parameters.

standardize_names(alias, coords_rename_dict, ...)

Standardizes variable names of a dataset based on a mapping dictionary.

to_json(alias[, path])

Exports dataset information in JSON format.

add_dataset(alias, dataset)

Adds a dataset to the manager with an alias.

Parameters:
  • alias (str) – Unique alias for the dataset.

  • dataset (BaseDataset) – Dataset instance.

add_to_file_cache(filepath)

Add a file to the file cache.

Parameters:

filepath (str)

all_to_json(output_dir)

Exports information of all datasets in JSON format.

Parameters:

output_dir (str) – Directory where to save the JSON files.

Raises:

ValueError – If the specified directory does not exist or is not accessible.

build_catalogs()

Builds catalogs for all datasets.

build_forecast_index(alias, init_date, end_date, n_days_forecast, n_days_interval)

Build forecast index for a dataset.

Parameters:
  • alias (str)

  • init_date (str)

  • end_date (str)

  • n_days_forecast (int)

  • n_days_interval (int)

filter_all_by_date(start, end)

Filters all datasets managed by this class by time range.

Parameters:
  • start (datetime) – Start date(s).

  • end (datetime) – End date(s).

filter_all_by_region(region)

Filters all datasets managed by this class by bounding box.

Parameters:
  • bbox (Tuple[float, float, float, float]) – (lon_min, lat_min, lon_max, lat_max).

  • region (Any)

filter_all_by_variable(variables)

Filters all datasets managed by this class by time range.

Parameters:

variables (List[str]) – List of variables to filter.

filter_attrs(filters)

Filters datasets based on attributes.

Parameters:

filters (dict[str, Callable[[Any], bool]]) – Dictionary of filters.

Return type:

None

filter_by_date(alias, start, end)

Filters the catalog by time range.

Parameters:
  • start (datetime) – Start date.

  • end (datetime) – End date.

  • alias (str)

filter_by_region(alias, region)

Filters the catalog by bounding box.

Parameters:
  • bbox (Tuple[float, float, float, float]) – (lon_min, lat_min, lon_max, lat_max).

  • alias (str)

  • region (Any)

filter_by_variable(alias, variables)

Filters the catalog by variable.

Parameters:
  • variables (List[str]) – List of variables to filter.

  • alias (str)

get_catalog(alias)

Returns the catalog of a dataset.

Parameters:

alias (str) – Dataset alias.

Returns:

The dataset catalog.

Return type:

DatasetCatalog

get_config()

Get configuration for all reference datasets.

get_data(alias, path)

Loads data from a dataset.

Parameters:
  • alias (str) – Dataset alias.

  • path (str) – File path.

Returns:

Loaded dataset.

Return type:

xr.Dataset

get_dataloader(pred_alias, ref_aliases=None, batch_size=8, obs_batch_size=None, gridded_batch_size=None, pred_transform=None, ref_transforms=None, forecast_mode=False, n_days_forecast=0, lead_time_unit='days')

Creates an EvaluationDataloader from dataset aliases.

Parameters:
  • pred_alias (str) – Alias of the prediction dataset.

  • ref_aliases (Optional[List[str]]) – Aliases of the reference datasets.

  • batch_size (int) – Batch size.

  • obs_batch_size (dict or int, optional) – Batch size for observation references. Can be a {ref_alias: int} dict to specify per-dataset sizes, or a single int applied to all observation references.

  • gridded_batch_size (dict or int, optional) – Batch size for gridded (non-observation) references such as GLORYS. Same formats as obs_batch_size.

  • pred_transform (Optional[CustomTransforms]) – Transform for predictions.

  • ref_transforms (Optional[List[CustomTransforms]]) – Transforms for references.

  • forecast_mode (bool) – Enable forecast mode.

  • n_days_forecast (int) – Number of forecast days to consider.

  • lead_time_unit (str) – Lead time unit (“days” or “hours”).

Returns:

Dataloader instance.

Return type:

EvaluationDataloader

get_keep_variables_dict()

Get dictionary of variables to keep for each dataset.

get_metadata_dict()

Get global metadata dictionary for all datasets.

get_transform(*args, **kwargs)

Factory function to create a transform based on the given name and parameters.

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Any

standardize_names(alias, coords_rename_dict, vars_rename_dict)

Standardizes variable names of a dataset based on a mapping dictionary.

Parameters:
  • alias (str) – Alias of the dataset to standardize.

  • standard_names (Dict[str, str]) – Dictionary mapping old names to new names.

  • coords_rename_dict (Dict[str, str])

  • vars_rename_dict (Dict[str, str])

Return type:

None

to_json(alias, path=None)

Exports dataset information in JSON format.

Parameters:
  • alias (str) – Alias of the dataset to export.

  • path (Optional[str]) – Path to save the JSON file.

Returns:

JSON representation of the dataset information.

Return type:

str