dctools.data.datasets.dataset_manager.MultiSourceDatasetManager
- class dctools.data.datasets.dataset_manager.MultiSourceDatasetManager(dataset_processor, target_dimensions, time_tolerance=None, list_references=None, max_cache_files=100)
Manager for handling multiple data sources with common processing operations.
- Parameters:
dataset_processor (oceanbench.core.distributed.DatasetProcessor)
target_dimensions (Dict[str, Tuple[float, float]])
time_tolerance (pandas.Timedelta | None)
list_references (list[str] | None)
max_cache_files (int)
- __init__(dataset_processor, target_dimensions, time_tolerance=None, list_references=None, max_cache_files=100)
Initializes the multi-source manager.
- Parameters:
dataset_processor (oceanbench.core.distributed.DatasetProcessor)
target_dimensions (Dict[str, Tuple[float, float]])
time_tolerance (pandas.Timedelta | None)
list_references (list[str] | None)
max_cache_files (int)
Methods
__init__(dataset_processor, target_dimensions)Initializes the multi-source manager.
add_dataset(alias, dataset)Adds a dataset to the manager with an alias.
add_to_file_cache(filepath)Add a file to the file cache.
all_to_json(output_dir)Exports information of all datasets in JSON format.
Builds catalogs for all datasets.
build_forecast_index(alias, init_date, ...)Build forecast index for a dataset.
filter_all_by_date(start, end)Filters all datasets managed by this class by time range.
filter_all_by_region(region)Filters all datasets managed by this class by bounding box.
filter_all_by_variable(variables)Filters all datasets managed by this class by time range.
filter_attrs(filters)Filters datasets based on attributes.
filter_by_date(alias, start, end)Filters the catalog by time range.
filter_by_region(alias, region)Filters the catalog by bounding box.
filter_by_variable(alias, variables)Filters the catalog by variable.
get_catalog(alias)Returns the catalog of a dataset.
Get configuration for all reference datasets.
get_data(alias, path)Loads data from a dataset.
get_dataloader(pred_alias[, ref_aliases, ...])Creates an EvaluationDataloader from dataset aliases.
Get dictionary of variables to keep for each dataset.
Get global metadata dictionary for all datasets.
get_transform(*args, **kwargs)Factory function to create a transform based on the given name and parameters.
standardize_names(alias, coords_rename_dict, ...)Standardizes variable names of a dataset based on a mapping dictionary.
to_json(alias[, path])Exports dataset information in JSON format.
- add_dataset(alias, dataset)
Adds a dataset to the manager with an alias.
- Parameters:
alias (str) – Unique alias for the dataset.
dataset (BaseDataset) – Dataset instance.
- add_to_file_cache(filepath)
Add a file to the file cache.
- Parameters:
filepath (str)
- all_to_json(output_dir)
Exports information of all datasets in JSON format.
- Parameters:
output_dir (str) – Directory where to save the JSON files.
- Raises:
ValueError – If the specified directory does not exist or is not accessible.
- build_catalogs()
Builds catalogs for all datasets.
- build_forecast_index(alias, init_date, end_date, n_days_forecast, n_days_interval)
Build forecast index for a dataset.
- Parameters:
alias (str)
init_date (str)
end_date (str)
n_days_forecast (int)
n_days_interval (int)
- filter_all_by_date(start, end)
Filters all datasets managed by this class by time range.
- Parameters:
start (datetime) – Start date(s).
end (datetime) – End date(s).
- filter_all_by_region(region)
Filters all datasets managed by this class by bounding box.
- Parameters:
bbox (Tuple[float, float, float, float]) – (lon_min, lat_min, lon_max, lat_max).
region (Any)
- filter_all_by_variable(variables)
Filters all datasets managed by this class by time range.
- Parameters:
variables (List[str]) – List of variables to filter.
- filter_attrs(filters)
Filters datasets based on attributes.
- Parameters:
filters (dict[str, Callable[[Any], bool]]) – Dictionary of filters.
- Return type:
None
- filter_by_date(alias, start, end)
Filters the catalog by time range.
- Parameters:
start (datetime) – Start date.
end (datetime) – End date.
alias (str)
- filter_by_region(alias, region)
Filters the catalog by bounding box.
- Parameters:
bbox (Tuple[float, float, float, float]) – (lon_min, lat_min, lon_max, lat_max).
alias (str)
region (Any)
- filter_by_variable(alias, variables)
Filters the catalog by variable.
- Parameters:
variables (List[str]) – List of variables to filter.
alias (str)
- get_catalog(alias)
Returns the catalog of a dataset.
- Parameters:
alias (str) – Dataset alias.
- Returns:
The dataset catalog.
- Return type:
- get_config()
Get configuration for all reference datasets.
- get_data(alias, path)
Loads data from a dataset.
- Parameters:
alias (str) – Dataset alias.
path (str) – File path.
- Returns:
Loaded dataset.
- Return type:
xr.Dataset
- get_dataloader(pred_alias, ref_aliases=None, batch_size=8, obs_batch_size=None, gridded_batch_size=None, pred_transform=None, ref_transforms=None, forecast_mode=False, n_days_forecast=0, lead_time_unit='days')
Creates an EvaluationDataloader from dataset aliases.
- Parameters:
pred_alias (str) – Alias of the prediction dataset.
ref_aliases (Optional[List[str]]) – Aliases of the reference datasets.
batch_size (int) – Batch size.
obs_batch_size (dict or int, optional) – Batch size for observation references. Can be a
{ref_alias: int}dict to specify per-dataset sizes, or a single int applied to all observation references.gridded_batch_size (dict or int, optional) – Batch size for gridded (non-observation) references such as GLORYS. Same formats as obs_batch_size.
pred_transform (Optional[CustomTransforms]) – Transform for predictions.
ref_transforms (Optional[List[CustomTransforms]]) – Transforms for references.
forecast_mode (bool) – Enable forecast mode.
n_days_forecast (int) – Number of forecast days to consider.
lead_time_unit (str) – Lead time unit (“days” or “hours”).
- Returns:
Dataloader instance.
- Return type:
- get_keep_variables_dict()
Get dictionary of variables to keep for each dataset.
- get_metadata_dict()
Get global metadata dictionary for all datasets.
- get_transform(*args, **kwargs)
Factory function to create a transform based on the given name and parameters.
- Parameters:
args (Any)
kwargs (Any)
- Return type:
Any
- standardize_names(alias, coords_rename_dict, vars_rename_dict)
Standardizes variable names of a dataset based on a mapping dictionary.
- Parameters:
alias (str) – Alias of the dataset to standardize.
standard_names (Dict[str, str]) – Dictionary mapping old names to new names.
coords_rename_dict (Dict[str, str])
vars_rename_dict (Dict[str, str])
- Return type:
None
- to_json(alias, path=None)
Exports dataset information in JSON format.
- Parameters:
alias (str) – Alias of the dataset to export.
path (Optional[str]) – Path to save the JSON file.
- Returns:
JSON representation of the dataset information.
- Return type:
str