Datasets¶
JetNet¶
- class jetnet.datasets.JetNet(*args: Any, **kwargs: Any)
PyTorch
torch.unit.data.Dataset
class for the JetNet dataset.If hdf5 files are not found in the
data_dir
directory then dataset will be downloaded from Zenodo (https://zenodo.org/record/6975118 or https://zenodo.org/record/6975117).- Parameters
jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon), ‘q’ (light quarks), ‘t’ (top quarks), ‘w’ (W bosons), or ‘z’ (Z bosons). “all” will get all types. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to
["etarel", "phirel", "ptrel", "mask"]
.jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to
["type", "pt", "eta", "mass", "num_particles"]
.particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.
jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.
particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.
jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.
num_particles (int, optional) – number of particles to retain per jet, max of 150. Defaults to 30.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the
data_dir
directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.
Methods:
getData
([jet_type, data_dir, ...])Downloads, if needed, and loads and returns JetNet data.
- classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 30, split: str = 'all', split_fraction: list[float] | None = None, seed: int = 42, download: bool = False) tuple[numpy.ndarray | None, numpy.ndarray | None]
Downloads, if needed, and loads and returns JetNet data.
- Parameters
jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon), ‘t’ (top quarks), ‘q’ (light quarks), ‘w’ (W bosons), or ‘z’ (Z bosons). “all” will get all types. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to
["etarel", "phirel", "ptrel", "mask"]
.jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to
["type", "pt", "eta", "mass", "num_particles"]
.num_particles (int, optional) – number of particles to retain per jet, max of 150. Defaults to 30.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the
data_dir
directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.
- Returns
particle data, jet data
- Return type
tuple[np.ndarray | None, np.ndarray | None]
TopTagging¶
- class jetnet.datasets.TopTagging(*args: Any, **kwargs: Any)
PyTorch
torch.unit.data.Dataset
class for the Top Quark Tagging Reference dataset.If hdf5 files are not found in the
data_dir
directory then dataset will be downloaded from Zenodo (https://zenodo.org/record/2603256).- Parameters
jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘qcd’ and ‘top’. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to
["E", "px", "py", "pz"]
.jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to
["type", "E", "px", "py", "pz"]
.particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.
jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.
particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.
jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.
num_particles (int, optional) – number of particles to retain per jet, max of 200. Defaults to 200.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the
data_dir
directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.
Methods:
getData
([jet_type, data_dir, ...])Downloads, if needed, and loads and returns Top Quark Tagging data.
- classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 200, split: str = 'all', download: bool = False) tuple[numpy.ndarray | None, numpy.ndarray | None]
Downloads, if needed, and loads and returns Top Quark Tagging data.
- Parameters
jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘qcd’ and ‘top’. Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to
["E", "px", "py", "pz"]
.jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to
["type", "E", "px", "py", "pz"]
.num_particles (int, optional) – number of particles to retain per jet, max of 200. Defaults to 200.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “all”.
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the
data_dir
directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.
- Returns
particle data, jet data
- Return type
(tuple[np.ndarray | None, np.ndarray | None])
QuarkGluon¶
- class jetnet.datasets.QuarkGluon(*args: Any, **kwargs: Any)
PyTorch
torch.unit.data.Dataset
class for the Quark Gluon Jets dataset. Either jets with or without bottom and charm quark jets can be selected (with_bc
flag).If npz files are not found in the
data_dir
directory then dataset will be automatically downloaded from Zenodo (https://zenodo.org/record/3164691).- Parameters
jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon) and ‘q’ (light quarks). Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
with_bc (bool, optional) – with or without bottom and charm quark jets. Defaults to True.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to
["pt", "eta", "phi", "pdgid"]
.jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to
["type"]
.particle_normalisation (NormaliseABC, optional) – optional normalisation to apply to particle data. Defaults to None.
jet_normalisation (NormaliseABC, optional) – optional normalisation to apply to jet data. Defaults to None.
particle_transform (callable, optional) – A function/transform that takes in the particle data tensor and transforms it. Defaults to None.
jet_transform (callable, optional) – A function/transform that takes in the jet data tensor and transforms it. Defaults to None.
num_particles (int, optional) – number of particles to retain per jet, max of 153. Defaults to 153.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
file_list (List[str], optional) – list of files to load, if full dataset is not required. Defaults to None (will load all files).
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the
data_dir
directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.
Methods:
getData
([jet_type, data_dir, with_bc, ...])Downloads, if needed, and loads and returns Quark Gluon data.
- classmethod getData(jet_type: str | set[str] = 'all', data_dir: str = './', with_bc: bool = True, particle_features: list[str] | None = 'all', jet_features: list[str] | None = 'all', num_particles: int = 153, split: str = 'all', split_fraction: list[float] | None = None, seed: int = 42, file_list: list[str] | None = None, download: bool = False) tuple[numpy.ndarray | None, numpy.ndarray | None]
Downloads, if needed, and loads and returns Quark Gluon data.
- Parameters
jet_type (Union[str, Set[str]], optional) – individual type or set of types out of ‘g’ (gluon) and ‘q’ (light quarks). Defaults to “all”.
data_dir (str, optional) – directory in which data is (to be) stored. Defaults to “./”.
with_bc (bool, optional) – with or without bottom and charm quark jets. Defaults to True.
particle_features (List[str], optional) – list of particle features to retrieve. If empty or None, gets no particle features. Defaults to
["pt", "eta", "phi", "pdgid"]
.jet_features (List[str], optional) – list of jet features to retrieve. If empty or None, gets no jet features. Defaults to
["type"]
.num_particles (int, optional) – number of particles to retain per jet, max of 153. Defaults to 153.
split (str, optional) – dataset split, out of {“train”, “valid”, “test”, “all”}. Defaults to “train”.
split_fraction (List[float], optional) – splitting fraction of training, validation, testing data respectively. Defaults to [0.7, 0.15, 0.15].
seed (int, optional) – PyTorch manual seed - important to use the same seed for all dataset splittings. Defaults to 42.
file_list (List[str], optional) – list of files to load, if full dataset is not required. Defaults to None (will load all files).
download (bool, optional) – If True, downloads the dataset from the internet and puts it in the
data_dir
directory. If dataset is already downloaded, it is not downloaded again. Defaults to False.
- Returns
particle data, jet data
- Return type
tuple[np.ndarray | None, np.ndarray | None]
Normalisations¶
Suite of common ways to normalise data.
Classes:
|
Shifts features by |
|
Normalizes dataset features by scaling each to an (absolute) max of |
|
ABC for generalised normalisation class. |
- class jetnet.datasets.normalisations.FeaturewiseLinear(feature_shifts: float | list[float] = 0.0, feature_scales: float | list[float] = 1.0, normalise_features: list[bool] | None = None, normal: bool = False)
Shifts features by
feature_shifts
then multiplies byfeature_scales
.If using the
normal
option,feature_shifts
andfeature_scales
can be derived from the dataset (by callingderive_dataset_features
) to normalise the data to have 0 mean and unit standard deviation per feature.- Parameters
feature_shifts (Union[float, List[float]], optional) – value to shift features by. Can either be a single float for all features, or a list of length
num_features
. Defaults to 0.0.feature_scales (Union[float, List[float]], optional) – after shifting, value to multiply features by. Can either be a single float for all features, or a list of length
num_features
. Defaults to 1.0.normalise_features (Optional[List[bool]], optional) – if only some features need to be normalised, can input here a list of booleans of length
num_features
withTrue
meaning normalise andFalse
meaning to ignore. Defaults to None i.e. normalise all.normal (bool, optional) – derive
feature_shifts
andfeature_scales
to have 0 mean and unit standard deviation per feature after normalisation (derive_dataset_features
method must be called before normalising).
Methods:
derive_dataset_features
(x)If using the
normal
option, this will derive the means and standard deviations per feature, and save and return them.features_need_deriving
()Checks if any dataset values or features need to be derived
- derive_dataset_features(x: ArrayLike) tuple[numpy.ndarray, numpy.ndarray] | None
If using the
normal
option, this will derive the means and standard deviations per feature, and save and return them. If not, will do nothing.- Parameters
x (ArrayLike) – dataset of shape […,
num_features
].- Returns
- if
normal
option, means and stds of each feature.
- if
- Return type
(Optional[Tuple[np.ndarray, np.ndarray]])
- features_need_deriving() bool
Checks if any dataset values or features need to be derived
- class jetnet.datasets.normalisations.FeaturewiseLinearBounded(feature_norms: float | list[float] = 1.0, feature_shifts: float | list[float] = 0.0, feature_maxes: list[float] | None = None, normalise_features: list[bool] | None = None)
Normalizes dataset features by scaling each to an (absolute) max of
feature_norms
and shifting byfeature_shifts
.If the value in the list for a feature is None, it won’t be scaled or shifted.
- Parameters
feature_norms (Union[float, List[float]], optional) – max value to scale each feature to. Can either be a single float for all features, or a list of length
num_features
. Defaults to 1.0.feature_shifts (Union[float, List[float]], optional) – after scaling, value to shift feature by. Can either be a single float for all features, or a list of length
num_features
. Defaults to 0.0.feature_maxes (List[float], optional) – max pre-scaling absolute value of each feature, used for scaling to the norm and inverting.
normalise_features (Optional[List[bool]], optional) – if only some features need to be normalised, can input here a list of booleans of length
num_features
withTrue
meaning normalise andFalse
meaning to ignore. Defaults to None i.e. normalise all.
Methods:
derive_dataset_features
(x)Derives, saves, and returns absolute feature maxes of dataset
x
.features_need_deriving
()Checks if any dataset values or features need to be derived
- derive_dataset_features(x: ArrayLike) ndarray
Derives, saves, and returns absolute feature maxes of dataset
x
.- Parameters
x (ArrayLike) – dataset of shape […,
num_features
].- Returns
feature maxes
- Return type
np.ndarray
- features_need_deriving() bool
Checks if any dataset values or features need to be derived
- class jetnet.datasets.normalisations.NormaliseABC
ABC for generalised normalisation class.
Methods:
derive_dataset_features
(x)Derive features from dataset needed for normalisation if needed
features_need_deriving
()Checks if any dataset values or features need to be derived
- derive_dataset_features(x: ArrayLike)
Derive features from dataset needed for normalisation if needed
- features_need_deriving() bool
Checks if any dataset values or features need to be derived
Utility Functions¶
Utility methods for datasets.
Functions:
|
Checks if elem(s) are valid and if needed converts into a list |
|
Checks if dataset exists and md5 hash matches; if not and download = True, downloads it from Zenodo, and returns the file path. |
|
Checks that list inputs are not None or empty |
|
Converts str inputs to a list or set |
|
Download while outputting a progress bar. |
|
Returns the first element out of all inputs which isn't None |
|
Returns data with features in the order specified by |
|
Returns starting and ending index for splitting a dataset of length |
- jetnet.datasets.utils.checkConvertElements(elem: str | list[str], valid_types: list[str], ntype: str = 'element')
Checks if elem(s) are valid and if needed converts into a list
- jetnet.datasets.utils.checkDownloadZenodoDataset(data_dir: str, dataset_name: str, record_id: int, key: str, download: bool) str
Checks if dataset exists and md5 hash matches; if not and download = True, downloads it from Zenodo, and returns the file path. or if not and download = False, raises an error.
- jetnet.datasets.utils.checkListNotEmpty(*inputs: list[list]) list[bool]
Checks that list inputs are not None or empty
- jetnet.datasets.utils.checkStrToList(*inputs: list[str | list[str] | set[str]], to_set: bool = False) list[list[str]] | list[set[str]] | list
Converts str inputs to a list or set
- jetnet.datasets.utils.download_progress_bar(file_url: str, file_dest: str)
Download while outputting a progress bar. Modified from https://sumit-ghosh.com/articles/python-download-progress-bar/
- Parameters
file_url (str) – url to download from
file_dest (str) – path at which to save downloaded file
- jetnet.datasets.utils.firstNotNoneElement(*inputs: list[Any]) Any
Returns the first element out of all inputs which isn’t None
- jetnet.datasets.utils.getOrderedFeatures(data: ArrayLike, features: list[str], features_order: list[str]) ndarray
Returns data with features in the order specified by
features
.- Parameters
data (ArrayLike) – input data
features (List[str]) – desired features in order
features_order (List[str]) – name and ordering of features in input data
- Returns
data with features in specified order
- Return type
(np.ndarray)
- jetnet.datasets.utils.getSplitting(length: int, split: str, splits: list[str], split_fraction: list[float]) tuple[int, int]
Returns starting and ending index for splitting a dataset of length
length
according to the inputsplit
out of the total possiblesplits
and a givensplit_fraction
.“all” is considered a special keyword to mean the entire dataset - it cannot be used to define a normal splitting, and if it is a possible splitting it must be the last entry in
splits
.e.g. for
length = 100
,split = "valid"
,splits = ["train", "valid", "test"]
,split_fraction = [0.7, 0.15, 0.15]
This will return
(70, 85)
.