Datasets¶

JetNet¶

class jetnet.datasets.JetNet(*args: Any, **kwargs: Any)

PyTorch torch.unit.data.Dataset class for the JetNet dataset.

If hdf5 files are not found in the data_dir directory then dataset will be downloaded from Zenodo (https://zenodo.org/record/6975118 or https://zenodo.org/record/6975117).

Parameters

jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon), ‘q’ (light quarks), ‘t’ (top quarks), ‘w’ (W bosons), or ‘z’ (Z bosons). “all” will get all types. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["etarel", "phirel", "ptrel", "mask"].
jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "pt", "eta", "mass", "num_particles"].
particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.
jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.
particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.
jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.
num_particles (int, optional) – number of particles to retain per jet, max of 150. Defaults to 30.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Methods:

getData([jet_type, data_dir, ...])

Downloads, if needed, and loads and returns JetNet data.

classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 30, split: str = 'all', split_fraction: list[float] | None = None, seed: int = 42, download: bool = False) → tuple[numpy.ndarray | None, numpy.ndarray | None]

Downloads, if needed, and loads and returns JetNet data.

Parameters

jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon), ‘t’ (top quarks), ‘q’ (light quarks), ‘w’ (W bosons), or ‘z’ (Z bosons). “all” will get all types. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["etarel", "phirel", "ptrel", "mask"].
jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "pt", "eta", "mass", "num_particles"].
num_particles (int, optional) – number of particles to retain per jet, max of 150. Defaults to 30.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Returns

particle data, jet data

Return type

tuple[np.ndarray | None, np.ndarray | None]

TopTagging¶

class jetnet.datasets.TopTagging(*args: Any, **kwargs: Any)

PyTorch torch.unit.data.Dataset class for the Top Quark Tagging Reference dataset.

If hdf5 files are not found in the data_dir directory then dataset will be downloaded from Zenodo (https://zenodo.org/record/2603256).

Parameters

jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘qcd’ and ‘top’. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["E", "px", "py", "pz"].
jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "E", "px", "py", "pz"].
particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.
jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.
particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.
jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.
num_particles (int, optional) – number of particles to retain per jet, max of 200. Defaults to 200.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Methods:

getData([jet_type, data_dir, ...])

Downloads, if needed, and loads and returns Top Quark Tagging data.

classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 200, split: str = 'all', download: bool = False) → tuple[numpy.ndarray | None, numpy.ndarray | None]

Downloads, if needed, and loads and returns Top Quark Tagging data.

Parameters

jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘qcd’ and ‘top’. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["E", "px", "py", "pz"].
jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "E", "px", "py", "pz"].
num_particles (int, optional) – number of particles to retain per jet, max of 200. Defaults to 200.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “all”.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Returns

particle data, jet data

Return type

(tuple[np.ndarray | None, np.ndarray | None])

QuarkGluon¶

class jetnet.datasets.QuarkGluon(*args: Any, **kwargs: Any)

PyTorch torch.unit.data.Dataset class for the Quark Gluon Jets dataset. Either jets with or without bottom and charm quark jets can be selected (with_bc flag).

If npz files are not found in the data_dir directory then dataset will be automatically downloaded from Zenodo (https://zenodo.org/record/3164691).

Parameters

jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon) and ‘q’ (light quarks). Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
with_bc (bool, optional) – with or without bottom and charm quark jets. Defaults to True.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["pt", "eta", "phi", "pdgid"].
jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type"].
particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.
jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.
particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.
jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.
num_particles (int, optional) – number of particles to retain per jet, max of 153. Defaults to 153.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
file_list (List[str], optional) – list of files to load, if full dataset is not required. Defaults to None (will load all files).
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Methods:

getData([jet_type, data_dir, with_bc, ...])

Downloads, if needed, and loads and returns Quark Gluon data.

classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', with_bc: bool = True, particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 153, split: str = 'all', split_fraction: list[float] | None = None, seed: int = 42, file_list: list[str] | None = None, download: bool = False) → tuple[numpy.ndarray | None, numpy.ndarray | None]

Downloads, if needed, and loads and returns Quark Gluon data.

Parameters

jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon) and ‘q’ (light quarks). Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
with_bc (bool, optional) – with or without bottom and charm quark jets. Defaults to True.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["pt", "eta", "phi", "pdgid"].
jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type"].
num_particles (int, optional) – number of particles to retain per jet, max of 153. Defaults to 153.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
file_list (List[str], optional) – list of files to load, if full dataset is not required. Defaults to None (will load all files).
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Returns

particle data, jet data

Return type

tuple[np.ndarray | None, np.ndarray | None]

Normalisations¶

Suite of common ways to normalise data.

Classes:

`FeaturewiseLinear`([feature_shifts, ...])	Shifts features by `feature_shifts` then multiplies by `feature_scales`.
`FeaturewiseLinearBounded`([feature_norms, ...])	Normalizes dataset features by scaling each to an (absolute) max of `feature_norms` and shifting by `feature_shifts`.
`NormaliseABC`()	ABC for generalised normalisation class.

class jetnet.datasets.normalisations.FeaturewiseLinear(feature_shifts: float | list[float] = 0.0, feature_scales: float | list[float] = 1.0, normalise_features: list[bool] | None = None, normal: bool = False)

Shifts features by feature_shifts then multiplies by feature_scales.

If using the normal option, feature_shifts and feature_scales can be derived from the dataset (by calling derive_dataset_features) to normalise the data to have 0 mean and unit standard deviation per feature.

Parameters

feature_shifts (Union[float, List[float]], optional) – value to shift features by. Can either be a single float for all features, or a list of length num_features. Defaults to 0.0.
feature_scales (Union[float, List[float]], optional) – after shifting, value to multiply features by. Can either be a single float for all features, or a list of length num_features. Defaults to 1.0.
normalise_features (Optional[List[bool]], optional) – if only some features need to be normalised, can input here a list of booleans of length num_features with True meaning normalise and False meaning to ignore. Defaults to None i.e. normalise all.
normal (bool, optional) – derive feature_shifts and feature_scales to have 0 mean and unit standard deviation per feature after normalisation (derive_dataset_features method must be called before normalising).

Methods:

`derive_dataset_features`(x)	If using the `normal` option, this will derive the means and standard deviations per feature, and save and return them.
`features_need_deriving`()	Checks if any dataset values or features need to be derived

derive_dataset_features(x: ArrayLike) → tuple[numpy.ndarray, numpy.ndarray] | None

If using the normal option, this will derive the means and standard deviations per feature, and save and return them. If not, will do nothing.

Parameters

x (ArrayLike) – dataset of shape […, num_features].

Returns

if normal option, means and stds of each: feature.

Return type

(Optional[Tuple[np.ndarray, np.ndarray]])

features_need_deriving() → bool: Checks if any dataset values or features need to be derived

class jetnet.datasets.normalisations.FeaturewiseLinearBounded(feature_norms: float | list[float] = 1.0, feature_shifts: float | list[float] = 0.0, feature_maxes: list[float] | None = None, normalise_features: list[bool] | None = None)

Normalizes dataset features by scaling each to an (absolute) max of feature_norms and shifting by feature_shifts.

If the value in the list for a feature is None, it won’t be scaled or shifted.

Parameters

feature_norms (Union[float, List[float]], optional) – max value to scale each feature to. Can either be a single float for all features, or a list of length num_features. Defaults to 1.0.
feature_shifts (Union[float, List[float]], optional) – after scaling, value to shift feature by. Can either be a single float for all features, or a list of length num_features. Defaults to 0.0.
feature_maxes (List[float], optional) – max pre-scaling absolute value of each feature, used for scaling to the norm and inverting.
normalise_features (Optional[List[bool]], optional) – if only some features need to be normalised, can input here a list of booleans of length num_features with True meaning normalise and False meaning to ignore. Defaults to None i.e. normalise all.

Methods:

`derive_dataset_features`(x)	Derives, saves, and returns absolute feature maxes of dataset `x`.
`features_need_deriving`()	Checks if any dataset values or features need to be derived

derive_dataset_features(x: ArrayLike) → ndarray

Derives, saves, and returns absolute feature maxes of dataset x.

Parameters: x (ArrayLike) – dataset of shape […, num_features].
Returns: feature maxes
Return type: np.ndarray

features_need_deriving() → bool: Checks if any dataset values or features need to be derived

class jetnet.datasets.normalisations.NormaliseABC

ABC for generalised normalisation class.

Methods:

`derive_dataset_features`(x)	Derive features from dataset needed for normalisation if needed
`features_need_deriving`()	Checks if any dataset values or features need to be derived

derive_dataset_features(x: ArrayLike): Derive features from dataset needed for normalisation if needed

features_need_deriving() → bool: Checks if any dataset values or features need to be derived

Utility Functions¶

Utility methods for datasets.

Functions:

`checkConvertElements`(elem, valid_types[, ntype])	Checks if elem(s) are valid and if needed converts into a list
`checkDownloadZenodoDataset`(data_dir, ...)	Checks if dataset exists and md5 hash matches; if not and download = True, downloads it from Zenodo, and returns the file path.
`checkListNotEmpty`(*inputs)	Checks that list inputs are not None or empty
`checkStrToList`(*inputs[, to_set])	Converts str inputs to a list or set
`download_progress_bar`(file_url, file_dest)	Download while outputting a progress bar.
`firstNotNoneElement`(*inputs)	Returns the first element out of all inputs which isn't None
`getOrderedFeatures`(data, features, ...)	Returns data with features in the order specified by `features`.
`getSplitting`(length, split, splits, ...)	Returns starting and ending index for splitting a dataset of length `length` according to the input `split` out of the total possible `splits` and a given `split_fraction`.

jetnet.datasets.utils.checkConvertElements(elem: str | list[str], valid_types: list[str], ntype: str = 'element'): Checks if elem(s) are valid and if needed converts into a list

jetnet.datasets.utils.checkDownloadZenodoDataset(data_dir: str, dataset_name: str, record_id: int, key: str, download: bool) → str: Checks if dataset exists and md5 hash matches; if not and download = True, downloads it from Zenodo, and returns the file path. or if not and download = False, raises an error.

jetnet.datasets.utils.checkListNotEmpty(*inputs: list[list]) → list[bool]: Checks that list inputs are not None or empty

jetnet.datasets.utils.checkStrToList(*inputs: list[str | list[str] | set[str]], to_set: bool = False) → list[list[str]] | list[set[str]] | list: Converts str inputs to a list or set

jetnet.datasets.utils.download_progress_bar(file_url: str, file_dest: str)

Download while outputting a progress bar. Modified from https://sumit-ghosh.com/articles/python-download-progress-bar/

Parameters

file_url (str) – url to download from
file_dest (str) – path at which to save downloaded file

jetnet.datasets.utils.firstNotNoneElement(*inputs: list[Any]) → Any: Returns the first element out of all inputs which isn’t None

jetnet.datasets.utils.getOrderedFeatures(data: ArrayLike, features: list[str], features_order: list[str]) → ndarray

Returns data with features in the order specified by features.

Parameters

data (ArrayLike) – input data
features (List[str]) – desired features in order
features_order (List[str]) – name and ordering of features in input data

Returns

data with features in specified order

Return type

(np.ndarray)

jetnet.datasets.utils.getSplitting(length: int, split: str, splits: list[str], split_fraction: list[float]) → tuple[int, int]

Returns starting and ending index for splitting a dataset of length length according to the input split out of the total possible splits and a given split_fraction.

“all” is considered a special keyword to mean the entire dataset - it cannot be used to define a normal splitting, and if it is a possible splitting it must be the last entry in splits.

e.g. for length = 100, split = "valid", splits = ["train", "valid", "test"], split_fraction = [0.7, 0.15, 0.15]

This will return (70, 85).