Data

Data

Dataset modules for loading HDF5 simulation trajectories.

class lagrangebench.data.data.H5Dataset(split: str, dataset_path: str, name: str | None = None, input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Dataset for loading HDF5 simulation trajectories.

Reference on parallel loading of h5 samples see: https://github.com/pytorch/pytorch/issues/11929

Implementation inspired by: https://github.com/Open-Catalyst-Project/ocp/blob/main/ocpmodels/datasets/lmdb_dataset.py

__init__(split: str, dataset_path: str, name: str | None = None, input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

download(name: str, path: str) str[source]

Download the dataset.

Parameters:
  • name – Name of the dataset

  • path – Destination path to the downloaded dataset

get_trajectory(idx: int)[source]

Get a (full) trajectory and index idx.

get_window(idx: int)[source]

Get a window of the trajectory and index idx.

__getitem__(idx: int)[source]

Get a sequence of positions (of size windows) from the dataset at index idx.

Returns:

Array of shape (num_particles_max, input_seq_length + 1, dim). Along axis=1

the position sequence (length input_seq_length) and the last position to compute the target acceleration.

lagrangebench.data.data.get_dataset_name_from_path(path: str) str[source]

Infer the dataset name from the provided path.

Variant 1:

If the dataset directory contains {2|3}D_{ABC}, then the name is inferred as {abc2d|abc3d}. These names are based on the lagrangebench dataset directories: {2D|3D}_{TGV|RPF|LDC|DAM}_{num_particles_max}_{num_steps}every{sampling_rate} The shorter dataset names then become one of the following: {tgv2d|tgv3d|rpf2d|rpf3d|ldc2d|ldc3d|dam2d}

Variant 2:

If the condition {2|3}D_{ABC} is not met, the name is the dataset directory

class lagrangebench.data.data.TGV2D(split: str, dataset_path: str = 'datasets/2D_TGV_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Taylor-Green Vortex 2D dataset. 2.5K particles.

__init__(split: str, dataset_path: str = 'datasets/2D_TGV_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

class lagrangebench.data.data.TGV3D(split: str, dataset_path: str = 'datasets/3D_TGV_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Taylor-Green Vortex 3D dataset. 8K particles.

__init__(split: str, dataset_path: str = 'datasets/3D_TGV_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

class lagrangebench.data.data.RPF2D(split: str, dataset_path: str = 'datasets/2D_RPF_3200_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Reverse Poiseuille Flow 2D dataset. 3.2K particles.

__init__(split: str, dataset_path: str = 'datasets/2D_RPF_3200_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

class lagrangebench.data.data.RPF3D(split: str, dataset_path: str = 'datasets/3D_RPF_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Reverse Poiseuille Flow 3D dataset. 8K particles.

__init__(split: str, dataset_path: str = 'datasets/3D_RPF_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

class lagrangebench.data.data.LDC2D(split: str, dataset_path: str = 'datasets/2D_LDC_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Lid-Driven Cabity 2D dataset. 2.5K particles.

__init__(split: str, dataset_path: str = 'datasets/2D_LDC_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

class lagrangebench.data.data.LDC3D(split: str, dataset_path: str = 'datasets/3D_LDC_8160_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Lid-Driven Cabity 3D dataset. 8.2K particles.

__init__(split: str, dataset_path: str = 'datasets/3D_LDC_8160_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

class lagrangebench.data.data.DAM2D(split: str, dataset_path: str = 'datasets/2D_DB_5740_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Dam break 2D dataset. 5.7K particles.

__init__(split: str, dataset_path: str = 'datasets/2D_DB_5740_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]

Initialize the dataset. If the dataset is not present, it is downloaded.

Parameters:
  • split – “train”, “valid”, or “test”

  • dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.

  • name – Name of the dataset. If None, it is inferred from the path.

  • input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.

  • extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.

  • nl_backend – Which backend to use for the neighbor list

Utils

Data utils.

lagrangebench.data.utils.get_dataset_stats(metadata: Dict[str, List[float]], is_isotropic_norm: bool, noise_std: float) Dict[str, Dict[str, Array]][source]

Return the dataset statistics based on the metadata dictionary.

Parameters:
  • metadata – Dataset metadata dictionary.

  • is_isotropic_norm – Whether to shift/scale dimensions equally instead of dimension-wise.

  • noise_std – Standard deviation of the GNS-style noise.

Returns:

Dictionary with the dataset statistics.

lagrangebench.data.utils.numpy_collate(batch) ndarray[source]

Collate helper for torch dataloaders.