Demo

In this notebook we will demonstrate how to use the package for text classification tasks. We will introduce some of the functionality along the way.

Dataset

We start by producing a dummy dataset to train our model on. The dataset is constructed as follows:

The dataset has three possible labels “A”, “B” and “C”.
Each sample is a short ‘text’ of length between 5 and 25 tokens over the vocabulary [“a”, “b”, “c”, “d”, “e”].
A text with label “A” can contain any token other than “a” and analogously for “B” and “C”.
For simplicity’s sake we pad every text to length 25 with the padding token “p” so that the full vocabulary is [“p”, “a”, “b”, “c”, “d”, “e”].
We sample 3000 texts of which we will use the first 2500 for training and the trailing 500 for testing.

[1]:

import numpy
import random
import torch

[2]:

target_names = ["A", "B", "C"]
vocab_nopad = ["a", "b", "c", "d", "e"]
vocab = ["p"] + vocab_nopad
data = [
    {
        "label": (label:=random.choice(target_names)),
        "tokens": random.choices([t for t in vocab_nopad if label.lower() != t], k=(k:=random.randint(5,25))) + ["p"] * (25-k)
    }
    for _ in range(3000)
]

[3]:

print(data[0])

{'label': 'B', 'tokens': ['d', 'a', 'c', 'e', 'd', 'a', 'c', 'a', 'a', 'e', 'd', 'd', 'e', 'd', 'p', 'p', 'p', 'p', 'p', 'p', 'p', 'p', 'p', 'p', 'p']}

We need to turn our data into tensors that our model can handle.

[4]:

token2idx = {t: i for i,t in enumerate(vocab)}
label2idx = {l: i for i,l in enumerate(target_names)}

[5]:

text_tensor = torch.tensor([[token2idx[t] for t in sample["tokens"]] for sample in data], dtype=torch.int64)
label_tensor = torch.tensor([label2idx[sample["label"]]  for sample in data], dtype=torch.int64)

For ease of use we wrap everything in a pytorch dataset.

[6]:

traindata = torch.utils.data.TensorDataset(text_tensor[:2500], label_tensor[:2500])
testdata = torch.utils.data.TensorDataset(text_tensor[2500:], label_tensor[2500:])

The model

We build a small SWEM-model to train on the data. The model will have the following structure:

A worddrop embedding with 6 embeddings (corresponding to the vocab) and an embedding dimension of 3.
A linear layer of size 3.
A hierarchical pooling layer with window size 4.
Another linear layer of size 3 and a final layer of size 3 whose outputs are the class logits for the classification task.

We could construct the model directly from its __init__-method but we can also specify the configuration first and let the from_config-method do the rest for us.

[7]:

from swem.models.swem import SwemConfig, Swem

[8]:

config = SwemConfig.from_dict({
    "embedding": {
        "type": "WordDropEmbedding",
        "num_embeddings": 6,
        "embedding_dim": 3,
        "padding_idx": 0,
        "p": 0.2
    },
    "pooling": {
        "type": "HierarchicalPooling",
        "window_size": 4
    },
    "pre_pooling_dims": (3, ),
    "post_pooling_dims": (3, 3),
    "dropout": 0.2
})

[9]:

model = Swem.from_config(config)

[10]:

model

[10]:

Swem(
  (embedding): WordDropEmbedding(6, 3, padding_idx=0)
  (pooling_layer): HierarchicalPooling(
    (avg_pooling): AvgPool2d(kernel_size=(4, 1), stride=1, padding=0)
  )
  (pre_pooling_trafo): Sequential(
    (0): Linear(in_features=3, out_features=3, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.2, inplace=False)
  )
  (post_pooling_trafo): Sequential(
    (0): Linear(in_features=3, out_features=3, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.2, inplace=False)
    (3): Linear(in_features=3, out_features=3, bias=True)
  )
)

Training

We train the model for 20 epochs with a batch size of 8 and an Adam optimizer with learning rate 3e-4 (Karpathy’s constant 😉). If a GPU is available we use it otherwise we default to the CPU. We use the to_device-function to transfer the tokens and labels to the specified device at once.

[11]:

from swem.utils.torch_utils import to_device

[12]:

epochs = 20
batch_size = 8
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
loss_fn = torch.nn.CrossEntropyLoss()
device = "cuda:0" if torch.cuda.is_available() else "cpu"

[13]:

train_dataloader = torch.utils.data.DataLoader(
    traindata,
    batch_size=batch_size,
    shuffle=True
)

for i in range(epochs):
    print(f"Starting epoch {i+1}")
    model.train()
    for batch in train_dataloader:
        batch = to_device(batch, device=device)
        tokens, labels = batch
        output = model(tokens)
        loss = loss_fn(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Starting epoch 1
Starting epoch 2
Starting epoch 3
Starting epoch 4
Starting epoch 5
Starting epoch 6
Starting epoch 7
Starting epoch 8
Starting epoch 9
Starting epoch 10
Starting epoch 11
Starting epoch 12
Starting epoch 13
Starting epoch 14
Starting epoch 15
Starting epoch 16
Starting epoch 17
Starting epoch 18
Starting epoch 19
Starting epoch 20

Evaluation

Let us now evaluate our model on the testset. For that purpose we iterate batchwise over the testset and aggregate the metrics along the way. This is achieved by making use of the ClassificationReport.

[14]:

from swem.metrics import ClassificationReport

[15]:

report = ClassificationReport(target_names=target_names)

[16]:

test_dataloader = torch.utils.data.DataLoader(
    testdata,
    batch_size=batch_size,
    shuffle=False
)

model.eval()
for batch in test_dataloader:
    with torch.no_grad():
        batch = to_device(batch, device=device)
        tokens, labels = batch
        logits = model(tokens)
        report.update(logits, labels)

[17]:

report

[17]:

{
  "num_samples": 500,
  "accuracy": 0.67,
  "recall_macro_avg": 0.6682970051391104,
  "recall_weighted_avg": 0.67,
  "precision_macro_avg": 0.6742675303341055,
  "precision_weighted_avg": 0.6718713765068685,
  "f1_score_macro_avg": 0.614279326358254,
  "f1_score_weighted_avg": 0.6139851783765151,
  "class_metrics": {
    "A": {
      "support": 165,
      "recall": 0.1393939393939394,
      "precision": 0.5609756097560976,
      "f1_score": 0.22330097087378645
    },
    "B": {
      "support": 164,
      "recall": 1.0,
      "precision": 0.9425287356321839,
      "f1_score": 0.9704142011834319
    },
    "C": {
      "support": 171,
      "recall": 0.8654970760233918,
      "precision": 0.519298245614035,
      "f1_score": 0.6491228070175438
    }
  }
}

Saving the model

Let us save our model to disk for later use.

[18]:

from pathlib import Path

[19]:

model_path = Path("./model")

[20]:

model.save(model_path)

The .save-method saves both the config and the weights in the directory:

[21]:

print(list(model_path.iterdir()))

[PosixPath('model/weights.pt'), PosixPath('model/config.json')]

If we want to use the model later on we can simply load it:

[22]:

model_loaded = Swem.load(model_path)

Let’s make sure config and weights of the loaded model are actually the same as for the original model.

[23]:

model_loaded.config == model.config

[23]:

True

[24]:

all(
    torch.allclose(model_param, model_loaded.state_dict()[param_name])
    for param_name, model_param in model.state_dict().items()
)

[24]:

True