Data
Data
Dataset modules for loading HDF5 simulation trajectories.
- class lagrangebench.data.data.H5Dataset(split: str, dataset_path: str, name: str | None = None, input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Dataset for loading HDF5 simulation trajectories.
Reference on parallel loading of h5 samples see: https://github.com/pytorch/pytorch/issues/11929
Implementation inspired by: https://github.com/Open-Catalyst-Project/ocp/blob/main/ocpmodels/datasets/lmdb_dataset.py
- __init__(split: str, dataset_path: str, name: str | None = None, input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
- lagrangebench.data.data.get_dataset_name_from_path(path: str) str[source]
Infer the dataset name from the provided path.
- Variant 1:
If the dataset directory contains {2|3}D_{ABC}, then the name is inferred as {abc2d|abc3d}. These names are based on the lagrangebench dataset directories: {2D|3D}_{TGV|RPF|LDC|DAM}_{num_particles_max}_{num_steps}every{sampling_rate} The shorter dataset names then become one of the following: {tgv2d|tgv3d|rpf2d|rpf3d|ldc2d|ldc3d|dam2d}
- Variant 2:
If the condition {2|3}D_{ABC} is not met, the name is the dataset directory
- class lagrangebench.data.data.TGV2D(split: str, dataset_path: str = 'datasets/2D_TGV_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Taylor-Green Vortex 2D dataset. 2.5K particles.
- __init__(split: str, dataset_path: str = 'datasets/2D_TGV_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
- class lagrangebench.data.data.TGV3D(split: str, dataset_path: str = 'datasets/3D_TGV_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Taylor-Green Vortex 3D dataset. 8K particles.
- __init__(split: str, dataset_path: str = 'datasets/3D_TGV_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
- class lagrangebench.data.data.RPF2D(split: str, dataset_path: str = 'datasets/2D_RPF_3200_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Reverse Poiseuille Flow 2D dataset. 3.2K particles.
- __init__(split: str, dataset_path: str = 'datasets/2D_RPF_3200_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
- class lagrangebench.data.data.RPF3D(split: str, dataset_path: str = 'datasets/3D_RPF_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Reverse Poiseuille Flow 3D dataset. 8K particles.
- __init__(split: str, dataset_path: str = 'datasets/3D_RPF_8000_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
- class lagrangebench.data.data.LDC2D(split: str, dataset_path: str = 'datasets/2D_LDC_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Lid-Driven Cabity 2D dataset. 2.5K particles.
- __init__(split: str, dataset_path: str = 'datasets/2D_LDC_2500_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
- class lagrangebench.data.data.LDC3D(split: str, dataset_path: str = 'datasets/3D_LDC_8160_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Lid-Driven Cabity 3D dataset. 8.2K particles.
- __init__(split: str, dataset_path: str = 'datasets/3D_LDC_8160_10kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
- class lagrangebench.data.data.DAM2D(split: str, dataset_path: str = 'datasets/2D_DB_5740_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Dam break 2D dataset. 5.7K particles.
- __init__(split: str, dataset_path: str = 'datasets/2D_DB_5740_20kevery100', input_seq_length: int = 6, extra_seq_length: int = 0, nl_backend: str = 'jaxmd_vmap')[source]
Initialize the dataset. If the dataset is not present, it is downloaded.
- Parameters:
split – “train”, “valid”, or “test”
dataset_path – Path to the dataset. Download will start automatically if dataset_path does not exist.
name – Name of the dataset. If None, it is inferred from the path.
input_seq_length – Length of the input sequence. The number of historic velocities is input_seq_length - 1. And during training, the returned number of past positions is input_seq_length + 1, to compute target acceleration.
extra_seq_length – During training, this is the maximum number of pushforward unroll steps. During validation/testing, this specifies the largest N-step MSE loss we are interested in, e.g. for best model checkpointing.
nl_backend – Which backend to use for the neighbor list
Utils
Data utils.
- lagrangebench.data.utils.get_dataset_stats(metadata: Dict[str, List[float]], is_isotropic_norm: bool, noise_std: float) Dict[str, Dict[str, Array]][source]
Return the dataset statistics based on the metadata dictionary.
- Parameters:
metadata – Dataset metadata dictionary.
is_isotropic_norm – Whether to shift/scale dimensions equally instead of dimension-wise.
noise_std – Standard deviation of the GNS-style noise.
- Returns:
Dictionary with the dataset statistics.