/
Building a Large Language Model from scratch, part 4
This is the fourth part of a series where I work through Sebastian Rauschka’s Build a Large Language Model from Scratch. Previously, I reviewed Chapter 4, which covers the complete model architecture for a GPT-style large language model. The next step is to perform pretraining to demonstrate how to prepare an initialized model instance for use!
The previous installments are here: Part 3, Part 2, Part 1.
Chapter 5: Pretraining on unlabeled data
Loss function
Before anything else, let’s discuss the objective of the pretraining. We want the model to generate “high quality” text output; what does that mean operationally? Defining this quantitatively will not only help us compare model performance across different training epochs - or hyperparameter settings - but also give us a loss function we can use to update the model weights e.g. through gradient descent. The intuition behind the loss metric for pretraining is that:
- The loss should be lower when the “true” next token in a training example has a high output probability
- The loss should be higher when the “true” next token in a training example has a low output probability
Formally, this is the cross entropy loss metric: the negative average log probability of the true target tokens in the output sequence. PyTorch has a cross_entropy function that takes the raw prediction logits with dimension , where is vocabulary size, along with the true token IDs with dimension . The logit-to-probability conversion and probability-to-token-ID mapping are all conveniently handled internally. That said, it is still useful to write utility functions to calculate the cross entropy loss across batches and an entire Dataloader; Rauschka walks through example implementations of this.
One topic I’d like to explore further is whether there are benefits to a loss function that also incorporates, say, the distance in token embedding space between the prediction and the actual. I wonder if this would improve convergence, even if it is expensive to compute. From my basic literature review, this often seems to fall under the category of contrastive learning, and is an area of active research for self-supervised learning, but I haven’t yet found any research on multiobjective loss functions for pretraining. Separately, I’ve also found weighted variations on cross-entropy loss, such as this from Polat et al (2022), but haven’t yet found one explicitly incorporating embedding vector distance. I should explore this in a future blog post!
Training and validation
With the loss function defined, the rough strategy for training the GPT is straightforward. Similar to plenty of other machine learning model training workflows:
- For each training epoch:
- For each batch:
- Reset loss gradients
- Calculate loss
- Backwards pass to calculate loss gradients (training)
- Update model weights with loss gradients
- Print training, validation losses
- Print sample generated text
- For each batch:
Rauschka mentions that better-optimized training workflows often incorporate learning rate warmup, cosine annealing, and gradient clipping. Since these are not essential for demonstrating the basics of GPT pretraining, he reserves these for an appendix which is worth reviewing too.
The demonstration of model pretraining here relies on a single, short training text. This keeps the exercise accessible to readers who don’t have access to powerful compute resources, and the qualitative difference in generated text is very apparent in just a few epochs. However, it necessarily leads to quite a bit of overfitting, so the resulting model isn’t good for practical use of course.
Controlling randomness
Recall that the model’s final output layer produces logits (i.e. probabilities) for the next token. Instead of just picking the next token with the single highest value as the prediction, we can increase the variety and novelty of predictions by sampling from the distribution of output tokens. This multinomial distribution uses the output probabilities in each step!
One approach for this is temperature scaling. The idea is to apply a scaling factor (denominator) on the logit, such that:
- Denominator > 1 → more uniform
- Denominator < 1 → more peaky
In this way, the variability of the sampled output is tunable.
Another technique discussed, with/without the above, is top-k sampling. In short, this strictly excludes possible tokens below a probability threshold from the sampling. Top-k sampling can straightforwardly be implemented by masking the logits outside of the top k with -inf. The net effect is to prevent truly unlikely tokens from being emitted.
Loading and saving model weights in PyTorch
Rauschka wraps up the chapter with a discussion on model persistence. The approach here is standard for any PyTorch model - it’s not specific to LLMs - but still this is good to include for completeness’ sake.
One consideration to keep in mind is that the model and optimizer classes must be handled separately, even if they are saved in the same file. The provided pattern is to use torch.save to write a dict, that includes the model’s and optimizer’s state dicts, to disk. Then, to load a model, initialize the model class and load the state dict artifacts from the file with torch.load. Finally, assign the state dict to the initialized class instance with initialized.load_state_dict. (The optimizer state can be restored through similar steps.)
The upshot is that this allows us to load the publicly released pretraining weights from OpenAI for the GPT-2 reference model architecture. These weights are available in Tensorflow format, but after applying a script to convert them, we can load the weights into a model instance and, as expected, get much better quality outputs from it. Because the reference architecture and weights are for the GPT-2 small variant, it’s striking how much worse the generated text is than what I get from free proprietary models in mid-2025! It’s a stark reminder of how shockingly fast progress in LLM research has been over the last five years.