JetNet Demo

logo

Raghav KansalUC San Diego

PyHEP 2022 WorkshopOnline, 12-16 September 2022

JetNet: For developing and reproducing ML + HEP projects.

Repo: github.com/jet-net/JetNet

Docs: jetnet.readthedocs.io

Paper: 2106.11535

Introduction

Problems:

  • How do I get started with machine learning in high energy physics?

  • How do I evaluate my results?

  • How do we reproduce and compare results?

Solution:

JetNet: Python package with easy-to-access datasets, standardised evaluation metrics, and more utilities for improving accessibility and reproducibility in ML + HEP.

Note: Still under development, with currently a limited number of datasets and metrics. Feedback and contributions welcome!

Today

  • Loading and looking at the JetNet dataset

  • Preparing the dataset for training a model

Data loading

We’ll use the jetnet.datasets.JetNet.getData function to download and directly access the dataset.

First, we can check which particle and jet features are available in this dataset:

[1]:
from jetnet.datasets import JetNet

print(f"Particle features: {JetNet.ALL_PARTICLE_FEATURES}")
print(f"Jet features: {JetNet.ALL_JET_FEATURES}")
Particle features: ['etarel', 'phirel', 'ptrel', 'mask']
Jet features: ['type', 'pt', 'eta', 'mass', 'num_particles']

Next, let’s load the data:

[2]:
data_args = {
    "jet_type": ["g", "t", "w"],  # gluon, top quark, and W boson jets
    "data_dir": "datasets/jetnet",
    # only selecting the kinematic features
    "particle_features": ["etarel", "phirel", "ptrel"],
    "num_particles": 30,
    "jet_features": ["type", "pt", "eta", "mass"],
    "download": True,
}

particle_data, jet_data = JetNet.getData(**data_args)

Let’s look at some of the data:

[3]:
print(
    f"Particle features of the 10 highest pT particles in the first jet\n{data_args['particle_features']}\n{particle_data[0, :10]}"
)
print(f"\nJet features of first jet\n{data_args['jet_features']}\n{jet_data[0]}")
Particle features of the 10 highest pT particles in the first jet
['etarel', 'phirel', 'ptrel']
[[-0.04361616 -0.00706771  0.29305124]
 [-0.04611618 -0.00956919  0.06966697]
 [-0.04163383 -0.00890653  0.05733829]
 [ 0.13638385 -0.00706771  0.04643776]
 [-0.04111616 -0.0045667   0.04290354]
 [-0.04223531  0.00299934  0.03603047]
 [ 0.10638386  0.01294228  0.03550573]
 [-0.0461162  -0.01457169  0.03525265]
 [-0.04251299 -0.00919492  0.02895915]
 [-0.04227024 -0.01043073  0.02826967]]

Jet features of first jet
['type', 'pt', 'eta', 'mass']
[3.00000000e+00 1.13473572e+03 6.48616195e-01 8.08584366e+01]

We can also visualise these jets as images:

[4]:
from jetnet.utils import to_image
import matplotlib.pyplot as plt

num_images = 5
num_types = len(data_args["jet_type"])
im_size = 25  # number of pixels in height and width
maxR = 0.4  # max radius in (eta, phi) away from the jet axis

cm = plt.cm.jet.copy()
cm.set_under(color="white")
plt.rcParams.update({"font.size": 16})

fig, axes = plt.subplots(
    nrows=num_types,
    ncols=num_images,
    figsize=(40, 8 * num_types),
    gridspec_kw={"wspace": 0.25},
)

# get the index of each jet type using the JetNet.JET_TYPES array
type_indices = {jet_type: JetNet.JET_TYPES.index(jet_type) for jet_type in data_args["jet_type"]}

for j in range(num_types):
    jet_type = data_args["jet_type"][j]
    type_selector = jet_data[:, 0] == type_indices[jet_type]  # select jets based on jet_type feat

    axes[j][0].annotate(
        jet_type,
        xy=(0, -1),
        xytext=(-axes[j][0].yaxis.labelpad - 15, 0),
        xycoords=axes[j][0].yaxis.label,
        textcoords="offset points",
        ha="right",
        va="center",
        fontsize=24,
    )

    for i in range(num_images):
        im = axes[j][i].imshow(
            to_image(particle_data[type_selector][i], im_size, maxR=maxR),
            cmap=cm,
            interpolation="nearest",
            vmin=1e-8,
            extent=[-maxR, maxR, -maxR, maxR],
            vmax=0.05,
        )
        axes[j][i].tick_params(which="both", bottom=False, top=False, left=False, right=False)
        axes[j][i].set_xlabel("$\phi^{rel}$")
        axes[j][i].set_ylabel("$\eta^{rel}$")
        axes[j][i].set_title(f"Jet {i + 1}")

cbar = fig.colorbar(im, ax=axes.ravel().tolist(), fraction=0.01)
cbar.set_label("$p_T^{rel}$")
../../_images/pages_tutorials_pyhep-data-access_10_0.png

And calculate and plot their overall features:

[5]:
from jetnet.utils import jet_features
import numpy as np

fig = plt.figure(figsize=(12, 12))
plt.ticklabel_format(axis="y", scilimits=(0, 0), useMathText=True)

for j in range(num_types):
    jet_type = data_args["jet_type"][j]
    type_selector = jet_data[:, 0] == type_indices[jet_type]  # select jets based on jet_type feat

    jet_masses = jet_features(particle_data[type_selector][:50000])["mass"]
    _ = plt.hist(jet_masses, bins=np.linspace(0, 0.2, 100), histtype="step", label=jet_type)

plt.xlabel("Jet $m/p_{T}$")
plt.ylabel("# Jets")
plt.legend(loc=1, prop={"size": 18})
plt.title("Relative Jet Masses")
plt.show()
../../_images/pages_tutorials_pyhep-data-access_12_0.png

Dataset preparation

To prepare the dataset for machine learning applications, we can use the jetnet.datasets.JetNet class itself, which inherits the pytorch.data.utils.Dataset class.

We’ll also use the class to normalise the features to have zero means and unit standard deviations, and transform the jet type feature to be one-hot-encoded.

[6]:
from jetnet.datasets import JetNet
from jetnet.datasets.normalisations import FeaturewiseLinear

import numpy as np
from sklearn.preprocessing import OneHotEncoder


# function to one hot encode the jet type and leave the rest of the features as is
def OneHotEncodeType(x: np.ndarray):
    enc = OneHotEncoder(categories=[[0, 1]])
    type_encoded = enc.fit_transform(x[..., 0].reshape(-1, 1)).toarray()
    other_features = x[..., 1:].reshape(-1, 3)
    return np.concatenate((type_encoded, other_features), axis=-1).reshape(*x.shape[:-1], -1)


data_args = {
    "jet_type": ["g", "t"],  # gluon and top quark jets
    "data_dir": "datasets/jetnet",
    # these are the default particle features, written here to be explicit
    "particle_features": ["etarel", "phirel", "ptrel", "mask"],
    "num_particles": 10,  # we retain only the 10 highest pT particles for this demo
    "jet_features": ["type", "pt", "eta", "mass"],
    # we don't want to normalise the 'mask' feature so we set that to False
    "particle_normalisation": FeaturewiseLinear(
        normal=True, normalise_features=[True, True, True, False]
    ),
    # pass our function as a transform to be applied to the jet features
    "jet_transform": OneHotEncodeType,
    "download": True,
}

jets_train = JetNet(**data_args, split="train")
jets_valid = JetNet(**data_args, split="valid")

We can look at one of our datasets to confirm everything is as we expect:

[7]:
jets_train
[7]:
Dataset JetNet
    Number of datapoints: 248637
    Data location: datasets/jetnet
    Including ['g', 't'] jets
    Split into train data out of ['train', 'valid', 'test', 'all'] possible splits, with splitting fractions [0.7, 0.15, 0.15]
    Particle features: ['etarel', 'phirel', 'ptrel', 'mask'], max 10 particles per jet
    Jet features: ['type', 'pt', 'eta', 'mass']
    Particle normalisation: Normalising features to zero mean and unit standard deviation, normalising features: [True, True, True, False]
    Jet transform: <function OneHotEncodeType at 0x163cd32e0>

And also directly at the data itself - note that the features have been normalised and the jet type has been one-hot-encoded):

[8]:
particle_features, jet_features = jets_train[0]
print(f"Particle features ({data_args['particle_features']}):\n\t{particle_features}")
print(f"\nJet features ({data_args['jet_features']}):\n\t{jet_features}")
Particle features (['etarel', 'phirel', 'ptrel', 'mask']):
        tensor([[-1.5952e-03, -9.4181e-04,  6.7592e-01,  1.0000e+00],
        [ 1.3819e-03,  6.9232e-03,  9.6110e-02,  1.0000e+00],
        [ 5.9048e-03, -3.4432e-03,  9.1700e-02,  1.0000e+00],
        [ 1.4783e-02, -1.0506e-02,  2.8433e-02,  1.0000e+00],
        [ 1.3316e-03, -8.1813e-03,  2.6264e-02,  1.0000e+00],
        [ 9.0482e-04,  1.1564e-02,  1.6956e-02,  1.0000e+00],
        [-1.4095e-02,  1.1564e-02,  1.3759e-02,  1.0000e+00],
        [ 2.5905e-02, -3.4432e-03,  1.1798e-02,  1.0000e+00],
        [-4.0952e-03,  6.5619e-03,  9.9370e-03,  1.0000e+00],
        [-9.0952e-03,  3.1575e-02,  8.2691e-03,  1.0000e+00]])

Jet features (['type', 'pt', 'eta', 'mass']):
        tensor([ 1.0000e+00,  0.0000e+00,  1.2301e+03, -1.7340e-01,  2.2097e+01])

We can now feed this into a PyTorch DataLoader and start training!

Next things you can try are: - Repeat this with the Top Quark Tagging (jetnet.datasets.TopTagging) and Quark Gluon datasets (jetnet.datasets.QuarkGluon) - Training an ML model (tutorial coming soon…) - Evaluating generative models (jetnet.evaluation)