dctools.data.connection.argo_data.ArgoInterface

class dctools.data.connection.argo_data.ArgoInterface(base_path, variables=None, s3_storage_options=None, chunks=None, max_fetch_retries=4, retry_backoff_seconds=0.8, n_download_workers=4)

Ultra-scalable ARGO interface with monthly partitioning and S3/local access.

Classe ultra-scalable ARGO avec : - Partition mensuelle - JSON compressé Zstd - Index int64 epoch pour O(log n) - Lecture lazy Dask / S3 - Multi-variables et interpolation sur profondeur

Peut utiliser une configuration S3/Wasabi ou un stockage local.

Parameters:
  • base_path (str)

  • variables (List[str] | None)

  • s3_storage_options (Dict | None)

  • chunks (Dict[str, int] | None)

  • max_fetch_retries (int)

  • retry_backoff_seconds (float)

  • n_download_workers (int)

__init__(base_path, variables=None, s3_storage_options=None, chunks=None, max_fetch_retries=4, retry_backoff_seconds=0.8, n_download_workers=4)

Initialise l’interface ARGO.

Parameters:
  • base_path (str) – Chemin de base pour les fichiers index (S3 ou local)

  • variables (List[str] | None) – Liste des variables à extraire

  • s3_storage_options (Dict | None) – Options pour fsspec S3 (key, secret, endpoint_url, etc.)

  • chunks (Dict[str, int] | None) – Configuration des chunks Dask (par défaut {“N_PROF”: 2000})

  • n_download_workers (int) – Concurrent download threads for batch profile loading (download_workers key in YAML). Keep low (≤ 4) when a Dask cluster is running to avoid GIL starvation of the scheduler asyncio loop.

  • max_fetch_retries (int)

  • retry_backoff_seconds (float)

Methods

__init__(base_path[, variables, ...])

Initialise l'interface ARGO.

build_month(year, month[, temp_dir, n_workers])

Build the compressed Kerchunk JSON for one month.

build_multi_year_monthly(start_year, end_year)

Construit tous les JSON mensuels pour plusieurs années.

build_time_window_monthly(start, end[, ...])

Build monthly ARGO index only for months intersecting [start, end].

from_config(config)

Crée une instance ArgoInterface à partir d'une ARGOConnectionConfig.

open_time_window(start, end, depth_levels[, ...])

Open ARGO data for a time window, loading monthly indexes in parallel.

build_month(year, month, temp_dir='tmp_refs', n_workers=8)

Build the compressed Kerchunk JSON for one month.

Uses a two-phase approach for speed:

  1. Batch download — all missing profiles are fetched in parallel using requests.Session (HTTP connection pooling: a single TCP+TLS handshake per GDAC mirror instead of one per profile).

  2. Local indexationNetCDF3ToZarr runs on the locally cached .nc files (no network latency). Refs are patched so they still point to the GDAC URLs.

After indexation the raw .nc cache is deleted to save disk.

build_multi_year_monthly(start_year, end_year, temp_dir='tmp_refs', n_workers=8)

Construit tous les JSON mensuels pour plusieurs années.

build_time_window_monthly(start, end, temp_dir='tmp_refs', n_workers=8)

Build monthly ARGO index only for months intersecting [start, end].

classmethod from_config(config)

Crée une instance ArgoInterface à partir d’une ARGOConnectionConfig.

Parameters:

config – Instance de ARGOConnectionConfig ou SimpleNamespace avec les paramètres

Returns:

Instance configurée

Return type:

ArgoInterface

open_time_window(start, end, depth_levels, variables=None, master_index=None, max_profiles=None)

Open ARGO data for a time window, loading monthly indexes in parallel.

Monthly Kerchunk JSON indexes are read from S3/Wasabi (not GDAC), so parallel loading is safe and does not add pressure on ARGO GDAC servers. Actual GDAC byte-range requests only happen later when Dask materialises the lazy dataset.

Parameters:
  • start (str or pd.Timestamp) – Time window boundaries.

  • end (str or pd.Timestamp) – Time window boundaries.

  • depth_levels (array-like) – Target depth levels for interpolation.

  • variables (list[str] or None) – Subset of data variables to keep.

  • master_index (dict or None) – Pre-loaded master index dict (from ArgoManager._master_index). If None the master index is read from S3/local on every call.

  • max_profiles (int or None) – Maximum number of profiles to load across all months. When set, loading stops once the cap is reached. Useful for metadata-only access to avoid loading thousands of profiles.