dctools.data.connection.argo_data.ArgoInterface
- class dctools.data.connection.argo_data.ArgoInterface(base_path, variables=None, s3_storage_options=None, chunks=None, max_fetch_retries=4, retry_backoff_seconds=0.8, n_download_workers=4)
Ultra-scalable ARGO interface with monthly partitioning and S3/local access.
Classe ultra-scalable ARGO avec : - Partition mensuelle - JSON compressé Zstd - Index int64 epoch pour O(log n) - Lecture lazy Dask / S3 - Multi-variables et interpolation sur profondeur
Peut utiliser une configuration S3/Wasabi ou un stockage local.
- Parameters:
base_path (str)
variables (List[str] | None)
s3_storage_options (Dict | None)
chunks (Dict[str, int] | None)
max_fetch_retries (int)
retry_backoff_seconds (float)
n_download_workers (int)
- __init__(base_path, variables=None, s3_storage_options=None, chunks=None, max_fetch_retries=4, retry_backoff_seconds=0.8, n_download_workers=4)
Initialise l’interface ARGO.
- Parameters:
base_path (str) – Chemin de base pour les fichiers index (S3 ou local)
variables (List[str] | None) – Liste des variables à extraire
s3_storage_options (Dict | None) – Options pour fsspec S3 (key, secret, endpoint_url, etc.)
chunks (Dict[str, int] | None) – Configuration des chunks Dask (par défaut {“N_PROF”: 2000})
n_download_workers (int) – Concurrent download threads for batch profile loading (
download_workerskey in YAML). Keep low (≤ 4) when a Dask cluster is running to avoid GIL starvation of the scheduler asyncio loop.max_fetch_retries (int)
retry_backoff_seconds (float)
Methods
__init__(base_path[, variables, ...])Initialise l'interface ARGO.
build_month(year, month[, temp_dir, n_workers])Build the compressed Kerchunk JSON for one month.
build_multi_year_monthly(start_year, end_year)Construit tous les JSON mensuels pour plusieurs années.
build_time_window_monthly(start, end[, ...])Build monthly ARGO index only for months intersecting [start, end].
from_config(config)Crée une instance ArgoInterface à partir d'une ARGOConnectionConfig.
open_time_window(start, end, depth_levels[, ...])Open ARGO data for a time window, loading monthly indexes in parallel.
- build_month(year, month, temp_dir='tmp_refs', n_workers=8)
Build the compressed Kerchunk JSON for one month.
Uses a two-phase approach for speed:
Batch download — all missing profiles are fetched in parallel using
requests.Session(HTTP connection pooling: a single TCP+TLS handshake per GDAC mirror instead of one per profile).Local indexation —
NetCDF3ToZarrruns on the locally cached.ncfiles (no network latency). Refs are patched so they still point to the GDAC URLs.
After indexation the raw
.nccache is deleted to save disk.
- build_multi_year_monthly(start_year, end_year, temp_dir='tmp_refs', n_workers=8)
Construit tous les JSON mensuels pour plusieurs années.
- build_time_window_monthly(start, end, temp_dir='tmp_refs', n_workers=8)
Build monthly ARGO index only for months intersecting [start, end].
- classmethod from_config(config)
Crée une instance ArgoInterface à partir d’une ARGOConnectionConfig.
- Parameters:
config – Instance de ARGOConnectionConfig ou SimpleNamespace avec les paramètres
- Returns:
Instance configurée
- Return type:
- open_time_window(start, end, depth_levels, variables=None, master_index=None, max_profiles=None)
Open ARGO data for a time window, loading monthly indexes in parallel.
Monthly Kerchunk JSON indexes are read from S3/Wasabi (not GDAC), so parallel loading is safe and does not add pressure on ARGO GDAC servers. Actual GDAC byte-range requests only happen later when Dask materialises the lazy dataset.
- Parameters:
start (str or pd.Timestamp) – Time window boundaries.
end (str or pd.Timestamp) – Time window boundaries.
depth_levels (array-like) – Target depth levels for interpolation.
variables (list[str] or None) – Subset of data variables to keep.
master_index (dict or None) – Pre-loaded master index dict (from
ArgoManager._master_index). If None the master index is read from S3/local on every call.max_profiles (int or None) – Maximum number of profiles to load across all months. When set, loading stops once the cap is reached. Useful for metadata-only access to avoid loading thousands of profiles.