Datasets

JetNet

class jetnet.datasets.JetNet(*args: Any, **kwargs: Any)

PyTorch torch.unit.data.Dataset class for the JetNet dataset.

If hdf5 files are not found in the data_dir directory then dataset will be downloaded from Zenodo (https://zenodo.org/record/6975118 or https://zenodo.org/record/6975117).

Parameters
  • jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon), ‘q’ (light quarks), ‘t’ (top quarks), ‘w’ (W bosons), or ‘z’ (Z bosons). “all” will get all types. Defaults to “all”.

  • data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.

  • particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["etarel", "phirel", "ptrel", "mask"].

  • jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "pt", "eta", "mass", "num_particles"].

  • particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.

  • jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.

  • particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.

  • jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.

  • num_particles (int, optional) – number of particles to retain per jet, max of 150. Defaults to 30.

  • split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.

  • split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].

  • seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.

  • download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Methods:

getData([jet_type, data_dir, ...])

Downloads, if needed, and loads and returns JetNet data.

classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 30, split: str = 'all', split_fraction: list[float] | None = None, seed: int = 42, download: bool = False) tuple[numpy.ndarray | None, numpy.ndarray | None]

Downloads, if needed, and loads and returns JetNet data.

Parameters
  • jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon), ‘t’ (top quarks), ‘q’ (light quarks), ‘w’ (W bosons), or ‘z’ (Z bosons). “all” will get all types. Defaults to “all”.

  • data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.

  • particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["etarel", "phirel", "ptrel", "mask"].

  • jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "pt", "eta", "mass", "num_particles"].

  • num_particles (int, optional) – number of particles to retain per jet, max of 150. Defaults to 30.

  • split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.

  • split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].

  • seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.

  • download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Returns

particle data, jet data

Return type

tuple[np.ndarray | None, np.ndarray | None]

TopTagging

class jetnet.datasets.TopTagging(*args: Any, **kwargs: Any)

PyTorch torch.unit.data.Dataset class for the Top Quark Tagging Reference dataset.

If hdf5 files are not found in the data_dir directory then dataset will be downloaded from Zenodo (https://zenodo.org/record/2603256).

Parameters
  • jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘qcd’ and ‘top’. Defaults to “all”.

  • data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.

  • particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["E", "px", "py", "pz"].

  • jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "E", "px", "py", "pz"].

  • particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.

  • jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.

  • particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.

  • jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.

  • num_particles (int, optional) – number of particles to retain per jet, max of 200. Defaults to 200.

  • split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.

  • download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Methods:

getData([jet_type, data_dir, ...])

Downloads, if needed, and loads and returns Top Quark Tagging data.

classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 200, split: str = 'all', download: bool = False) tuple[numpy.ndarray | None, numpy.ndarray | None]

Downloads, if needed, and loads and returns Top Quark Tagging data.

Parameters
  • jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘qcd’ and ‘top’. Defaults to “all”.

  • data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.

  • particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["E", "px", "py", "pz"].

  • jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type", "E", "px", "py", "pz"].

  • num_particles (int, optional) – number of particles to retain per jet, max of 200. Defaults to 200.

  • split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “all”.

  • download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Returns

particle data, jet data

Return type

(tuple[np.ndarray | None, np.ndarray | None])

QuarkGluon

class jetnet.datasets.QuarkGluon(*args: Any, **kwargs: Any)

PyTorch torch.unit.data.Dataset class for the Quark Gluon Jets dataset. Either jets with or without bottom and charm quark jets can be selected (with_bc flag).

If npz files are not found in the data_dir directory then dataset will be automatically downloaded from Zenodo (https://zenodo.org/record/3164691).

Parameters
  • jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon) and ‘q’ (light quarks). Defaults to “all”.

  • data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.

  • with_bc (bool, optional) – with or without bottom and charm quark jets. Defaults to True.

  • particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["pt", "eta", "phi", "pdgid"].

  • jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type"].

  • particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.

  • jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.

  • particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.

  • jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.

  • num_particles (int, optional) – number of particles to retain per jet, max of 153. Defaults to 153.

  • split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.

  • split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].

  • seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.

  • file_list (List[str], optional) – list of files to load, if full dataset is not required. Defaults to None (will load all files).

  • download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Methods:

getData([jet_type, data_dir, with_bc, ...])

Downloads, if needed, and loads and returns Quark Gluon data.

classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', with_bc: bool = True, particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 153, split: str = 'all', split_fraction: list[float] | None = None, seed: int = 42, file_list: list[str] | None = None, download: bool = False) tuple[numpy.ndarray | None, numpy.ndarray | None]

Downloads, if needed, and loads and returns Quark Gluon data.

Parameters
  • jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon) and ‘q’ (light quarks). Defaults to “all”.

  • data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.

  • with_bc (bool, optional) – with or without bottom and charm quark jets. Defaults to True.

  • particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to ["pt", "eta", "phi", "pdgid"].

  • jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to ["type"].

  • num_particles (int, optional) – number of particles to retain per jet, max of 153. Defaults to 153.

  • split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.

  • split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].

  • seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.

  • file_list (List[str], optional) – list of files to load, if full dataset is not required. Defaults to None (will load all files).

  • download (bool, optional) – If True, downloads the dataset from the internet and puts it in the data_dir directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.

Returns

particle data, jet data

Return type

tuple[np.ndarray | None, np.ndarray | None]

Normalisations

Suite of common ways to normalise data.

Classes:

FeaturewiseLinear([feature_shifts, ...])

Shifts features by feature_shifts then multiplies by feature_scales.

FeaturewiseLinearBounded([feature_norms, ...])

Normalizes dataset features by scaling each to an (absolute) max of feature_norms and shifting by feature_shifts.

NormaliseABC()

ABC for generalised normalisation class.

class jetnet.datasets.normalisations.FeaturewiseLinear(feature_shifts: float | list[float] = 0.0, feature_scales: float | list[float] = 1.0, normalise_features: list[bool] | None = None, normal: bool = False)

Shifts features by feature_shifts then multiplies by feature_scales.

If using the normal option, feature_shifts and feature_scales can be derived from the dataset (by calling derive_dataset_features) to normalise the data to have 0 mean and unit standard deviation per feature.

Parameters
  • feature_shifts (Union[float, List[float]], optional) – value to shift features by. Can either be a single float for all features, or a list of length num_features. Defaults to 0.0.

  • feature_scales (Union[float, List[float]], optional) – after shifting, value to multiply features by. Can either be a single float for all features, or a list of length num_features. Defaults to 1.0.

  • normalise_features (Optional[List[bool]], optional) – if only some features need to be normalised, can input here a list of booleans of length num_features with True meaning normalise and False meaning to ignore. Defaults to None i.e. normalise all.

  • normal (bool, optional) – derive feature_shifts and feature_scales to have 0 mean and unit standard deviation per feature after normalisation (derive_dataset_features method must be called before normalising).

Methods:

derive_dataset_features(x)

If using the normal option, this will derive the means and standard deviations per feature, and save and return them.

features_need_deriving()

Checks if any dataset values or features need to be derived

derive_dataset_features(x: ArrayLike) tuple[numpy.ndarray, numpy.ndarray] | None

If using the normal option, this will derive the means and standard deviations per feature, and save and return them. If not, will do nothing.

Parameters

x (ArrayLike) – dataset of shape […, num_features].

Returns

if normal option, means and stds of each

feature.

Return type

(Optional[Tuple[np.ndarray, np.ndarray]])

features_need_deriving() bool

Checks if any dataset values or features need to be derived

class jetnet.datasets.normalisations.FeaturewiseLinearBounded(feature_norms: float | list[float] = 1.0, feature_shifts: float | list[float] = 0.0, feature_maxes: list[float] | None = None, normalise_features: list[bool] | None = None)

Normalizes dataset features by scaling each to an (absolute) max of feature_norms and shifting by feature_shifts.

If the value in the list for a feature is None, it won’t be scaled or shifted.

Parameters
  • feature_norms (Union[float, List[float]], optional) – max value to scale each feature to. Can either be a single float for all features, or a list of length num_features. Defaults to 1.0.

  • feature_shifts (Union[float, List[float]], optional) – after scaling, value to shift feature by. Can either be a single float for all features, or a list of length num_features. Defaults to 0.0.

  • feature_maxes (List[float], optional) – max pre-scaling absolute value of each feature, used for scaling to the norm and inverting.

  • normalise_features (Optional[List[bool]], optional) – if only some features need to be normalised, can input here a list of booleans of length num_features with True meaning normalise and False meaning to ignore. Defaults to None i.e. normalise all.

Methods:

derive_dataset_features(x)

Derives, saves, and returns absolute feature maxes of dataset x.

features_need_deriving()

Checks if any dataset values or features need to be derived

derive_dataset_features(x: ArrayLike) ndarray

Derives, saves, and returns absolute feature maxes of dataset x.

Parameters

x (ArrayLike) – dataset of shape […, num_features].

Returns

feature maxes

Return type

np.ndarray

features_need_deriving() bool

Checks if any dataset values or features need to be derived

class jetnet.datasets.normalisations.NormaliseABC

ABC for generalised normalisation class.

Methods:

derive_dataset_features(x)

Derive features from dataset needed for normalisation if needed

features_need_deriving()

Checks if any dataset values or features need to be derived

derive_dataset_features(x: ArrayLike)

Derive features from dataset needed for normalisation if needed

features_need_deriving() bool

Checks if any dataset values or features need to be derived

Utility Functions

Utility methods for datasets.

Functions:

checkConvertElements(elem, valid_types[, ntype])

Checks if elem(s) are valid and if needed converts into a list

checkDownloadZenodoDataset(data_dir, ...)

Checks if dataset exists and md5 hash matches; if not and download = True, downloads it from Zenodo, and returns the file path.

checkListNotEmpty(*inputs)

Checks that list inputs are not None or empty

checkStrToList(*inputs[, to_set])

Converts str inputs to a list or set

download_progress_bar(file_url, file_dest)

Download while outputting a progress bar.

firstNotNoneElement(*inputs)

Returns the first element out of all inputs which isn't None

getOrderedFeatures(data, features, ...)

Returns data with features in the order specified by features.

getSplitting(length, split, splits, ...)

Returns starting and ending index for splitting a dataset of length length according to the input split out of the total possible splits and a given split_fraction.

jetnet.datasets.utils.checkConvertElements(elem: str | list[str], valid_types: list[str], ntype: str = 'element')

Checks if elem(s) are valid and if needed converts into a list

jetnet.datasets.utils.checkDownloadZenodoDataset(data_dir: str, dataset_name: str, record_id: int, key: str, download: bool) str

Checks if dataset exists and md5 hash matches; if not and download = True, downloads it from Zenodo, and returns the file path. or if not and download = False, raises an error.

jetnet.datasets.utils.checkListNotEmpty(*inputs: list[list]) list[bool]

Checks that list inputs are not None or empty

jetnet.datasets.utils.checkStrToList(*inputs: list[str | list[str] | set[str]], to_set: bool = False) list[list[str]] | list[set[str]] | list

Converts str inputs to a list or set

jetnet.datasets.utils.download_progress_bar(file_url: str, file_dest: str)

Download while outputting a progress bar. Modified from https://sumit-ghosh.com/articles/python-download-progress-bar/

Parameters
  • file_url (str) – url to download from

  • file_dest (str) – path at which to save downloaded file

jetnet.datasets.utils.firstNotNoneElement(*inputs: list[Any]) Any

Returns the first element out of all inputs which isn’t None

jetnet.datasets.utils.getOrderedFeatures(data: ArrayLike, features: list[str], features_order: list[str]) ndarray

Returns data with features in the order specified by features.

Parameters
  • data (ArrayLike) – input data

  • features (List[str]) – desired features in order

  • features_order (List[str]) – name and ordering of features in input data

Returns

data with features in specified order

Return type

(np.ndarray)

jetnet.datasets.utils.getSplitting(length: int, split: str, splits: list[str], split_fraction: list[float]) tuple[int, int]

Returns starting and ending index for splitting a dataset of length length according to the input split out of the total possible splits and a given split_fraction.

“all” is considered a special keyword to mean the entire dataset - it cannot be used to define a normal splitting, and if it is a possible splitting it must be the last entry in splits.

e.g. for length = 100, split = "valid", splits = ["train", "valid", "test"], split_fraction = [0.7, 0.15, 0.15]

This will return (70, 85).