Building a Large Language Model from scratch, part 5 • Aaron J Epel

At last, here is the fifth part of a blog post series where I work through Sebastian Rauschka’s Build a Large Language Model from Scratch. Having just covered pretraining for a GPT-2 clone running on my local desktop, let’s dive into fine-tuning for specific tasks!

At this stage in the text, our LLM is set up to, given an initial sequence of words, emit the next words that are most likely to follow. This may be too generic for the desired use case, and it’s also doesn’t necessarily reflect the desired way of interacting with the model. For example, we might expect to give the LLM instructions or ask it a question, rather than provide a sentence for it to complete. The purpose of fine-tuning is to augment, and further train, the existing model in order to reach a more focused end state. Rauschka covers three ways of doing this:

Classification fine-tuning (Chapter 6).
Instruction fine-tuning (Chapter 7).
LoRA (Appendix E).

The previous installments are here: Part 4, Part 3, Part 2, Part 1.

Chapter 6: Fine-tuning for classification

I would characterize this as the “traditional” ML approach. Suppose the end goal of the LLM is to classify the input sequence: for example, spam vs. not spam, who the speaker is, or sentiment analysis. This requires that the last layer of the model outputs target class probabilities, from which the most likely class is chosen. Note that this is very similar to the last layer of the GPT; the difference is that the output dimension maps to the target classes instead of the token vocabulary. The advantage of this approach is that it constrains the model’s output to only the pre-defined classes and therefore can be narrowly focused. Evaluating the correctness of predictions is also much more straightforward: as with pretraining, either the true label is predicted or it isn’t!

To implement classification fine-tuning, the pretrained model must be initialized, modified, and then trained on a more specialized dataset.

Initialization: instantiate the desired model with its pretrained weights from earlier training on unlabeled data, as in Chapter 5.
Modify model structure: replace the last output layer (i.e. predict the next token) with an output label for classification (i.e. predict the class). This layer should have one node per class; even though this is less concise, it allows reuse of the cross entropy loss function from pretraining. Then, freeze all layers which need not be retrained, meaning that before replacing the output layer,
```
for param in model.parameters():
    param.requires_grad = False
```
In practice, it also can be valuable to retrain at least some earlier layers (e.g. the final LayerNorm and the last transformer block), so these need not be frozen either. After this is performed, model inference output will return one label per token in the input sequence; we only care about the very last one, since it has access to context from all the previous tokens in the sequence.
Specialized dataset: unlike for pretraining, the dataset should be structured with the input being complete text sequence, and the target being the true output class label. Note that Rauschka recommends padding all messages with the end-of-text token (repeated!) to the length of the longest example.

As always, the text presents a worked example that runs pretty efficiently. I can see myself using this approach for many potential future projects, since my use cases tend to be pretty tightly defined in terms of expected output. I can also clearly see the extension of this approach to regression problems, e.g. predict a quantity or score given some input media. Of course in that case the loss function would be different, but that’s another straightforward change to the training loop for fine-tuning.

Chapter 7: Fine-tuning to follow instructions

Now this approach is a more interesting departure from traditional machine learning practices! The key concept here is that if the model is to respond to user instructions, say to perform some task, then we augment the model input with those instructions in natural language. That also means crafting a very deliberate training set reflecting this. The advantage of the instruction fine-tuning method is that it allows the same model to perform a broader range of tasks.

The key insight in preparing the dataset for instruction fine-tuning is that the training data are instruction-response pairs. This is like a sequence-to-sequence ML problem! The instruction-response pairs may be formatted in various prompt styles, according to common standards like Alpaca or Phi-3; what’s important is that the format is consistent. After converting the instruction-response pairs into an appropriate prompt style, each one should form a single sequence of tokens, containing the input then the output. (The conversion may need to add instructions at the start of the prompt, such as "Below is an instruction that describes a task. Write a response that appropriately completes the request.") Then the general approach is like that with pretraining: given the subsequence up to the end of the instruction part of the text, attempt to predict the remainder of the combined text (the response part).

The nuances in implementing this relate to the dataloaders for the instruction-response pairs. As before, the PyTorch Dataset subclass’s init method will format each entry and pretokenize the input. (Note that unlike the dataloaders for pretraining, the text doesn’t include an explicit stride parameter.) This again yields a 1:1 mapping between instruction-response pairs and training records. However, the PyTorch DataLoader may require a custom collation function, which processes individual Dataset records into batches. Locating critical processing logic in the collation function is useful because it can be run in parallel, i.e. in a separate thread, alongside the model training loop. This collation function:

Pads sequences to the longest sequence length in each batch.
Splits sequences into inputs and targets, so that the target is the input sequence (again, including the instruction and the response) shifted by one token.
In the target list only, replaces all but the first end-of-sequence token for each record with a null token (e.g. -100) to exclude them from the loss calculation. For example, a target sequence after this step may look like [318, 281, 12064, 326, 8477, 257, 4876, 13, ..., 50256, -100, -100, -100]. The null token index must match the ignore_index argument of PyTorch’s cross_entropy function; its default value is ignore_index = -100. Note that apparently, whether also masking out the instructions with -100 is beneficial or not is still an open research question!
Optionally, clips the sequences according to an allowed max length parameter.

One question that immediately jumped out for me is why the dataloader ought not only emit records where the input starts at the last instruction token. Is dynamically seeking the last instruction token for the input/target split too hard or costly at scale? Are there advantages to still having training records during fine-tuning where the next target token is still in the instruction subsequence? Or would this be more efficient in the size of the dataset? It’s still unclear to me.

Next, the model to fine-tune must be initialized as with classification pretraining, but the model architecture need not be modified at all. Note that instruction fine-tuning does not work well with smaller models such as GPT-2 small - they lack the richness of representation needed for the more complex inputs and contexts required here. By default, the model will return the entire input and output (prediction) combined, so a wrapper around the initialized model will need to subtract the length of the input instruction subsequence from the returned prediction.

Unlike with classification fine-tuning, and due to the wider variety in supported outputs, evaluating response quality is much more subjective. The fine-tuning training loop still uses cross-entropy loss for the literal weights optimization, but this isn’t very useful for measuring how well the LLM follows the provided instructions. Some commonly used options for evaluating performance holistically include running the model against benchmark data sets (e.g. MMLU), human preference voting, and automated benchmarks using another, more advanced LLM to evaluate the responses.

Rauschka works through a detailed example of this last option, and this was where I had another eureka moment. Maybe I could better describe it as a facepalm moment? - my immediate reaction was that, as least as described, this is simultaneously so clever and so crude that I laughed out loud. This technique is very much in the spirit of generative adversarial networks (GANs), but not even so sophisticated as to feed the response from the discriminator directly back into the training loop. The concrete example procedure in the text is:

Self-host a more advanced, general purpose LLM such as Llama 3 or Phi3 using Ollama.
Serve the advanced model in a separate Ollama server process. This is going to be analogous to the discriminator model in a GAN.
For each prediction in the test set emitted by the fine-tuned model:
- Make a REST API call to the local Ollama server, with the instruction to assign a quality score (0-100) to the LLM model’s response to the test instruction.
- Parse the responses from Ollama to extract the scores, and analyze (e.g. compute average score)!

So…

…outsource evaluating the response quality to a fancier LLM.

Honestly, I think this is hilarious! But I can totally see it working!

Appendix E: Low-rank adaptation (LoRA)

Now, I had already encountered this when playing with StableDiffusion and similar latent diffusion image generation models on my local desktop. From a practical perspective, I wasn’t very clear on how LoRA works behind the scenes, just that augmenting a base model with a LoRA model allowed for fine-tuning output at a fraction of the memory requirements (or training time requirements, if developing from scratch). This section in Build a Large Language Model from Scratch effectively resolved my confusion.

The key insight for LoRA is that instead of retraining some chunk of the original model, one can add a very simple submodel in parallel with that chunk and “cheaply” train that new component; if the original chunk and the new submodel share the same input and their outputs are added together, then the trainable new submodel learns how to adjust the original output in order to accurately handle the new training data for finetuning!

More explicitly, instead of updating the weights matrix during fine-tuning ( $\mathbf{W}_{finetuned} = \mathbf{W}_{original} + \Delta \mathbf{W}$ ), train a pair of matrices whose product approximates $\Delta \mathbf{W}$ and add their product to $\mathbf{W}_{original}$ at inference time. This works because while only a complex architecture like a transformer block can capture the nuances of the generic pretraining data, the relatively tiny adjustments needed to represent how fine-tuned data differs from the original dataset don’t need anything nearly as sophisticated to represent them. This has some valuable advantages:

It’s highly memory efficient: $\text{dim}(\mathbf{AB}) = \text{dim}(\mathbf{W})$ even though $\text{dim}(\mathbf{A})$ and $\text{dim}(\mathbf{B})$ are much smaller than $\text{dim}(\mathbf{W})$ .
The fine-tuned updates separable from original model: just persist the modifications separately, and they can be plugged in on demand.
While there is a tradeoff between additional computation steps in forward pass vs. the cost of backpropagation, LoRA is vastly more efficient for large models than either classification or instruction fine-tuning.

This was another eureka moment, but now I was grinning. Now I get the hype.

How is this implemented in the text? As referenced above, the LoRA submodel’s architecture consists of a pair of matrices $A$ and $B$ with dimensions are $\mathbf{A}_{(in, r)}, \mathbf{B}_{(r, out)}$ . Its scaled product in the forward pass, $\alpha (\mathbf{xAB})$ , is added to the output of the submodel $M(x)$ which it runs alongside. Note that $r$ is a hyperparameter that can be thought of as controlling how compressed the internal state representation is. Similarly, the scaling hyperparameter $\alpha$ determines the influence of the LoRA relative to the original submodel. To substitute this into the existing model architecture, we can define a new LinearWithLoRA layer consisting of the original linear layer, in parallel with the LoRA layer, which has a forward pass method that adds their respective outputs. As with classification fine-tuning, we can then define a helper function that replaces all (or just some, if desired) linear layers in the original model with LinearWithLoRA layers. The replacement operation should first freeze the original model’s parameters, then copy over the linear layers’ weights to their respective LinearWithLoRAs’ linear components before finalizing the substitution.

Final thoughts

Even if never pick up Build a Large Language Model from Scratch after finishing this blog post series, it will have been well worth the purchase price. I highly benefitted from this lucid and systematic walkthrough of the core concepts behind GPT-style LLMs. What sets this apart from other resources I’ve found, such as online guides or research papers, is the seamless pairing of conceptual explanation and hands-on, tutorial-style walkthroughs at multiple levels of detail resolution. For someone like me that learns best through doing, this was invaluable!

I’m glad I went through the exercise of writing this blog post series, too. Just like with my blog writ large, this forced me to make sure I really understood the material - and think about where I still had lingering questions and gaps in understanding - before I was comfortable publishing my thoughts. While I did read through Giles Thomas’ own blog post series on the topic beforehand, and it inspired me to take this on, I deliberately did not reference it once I started reading through the book on my own. I wanted to approach it with an independent view, without too much influence from his own journey through the text. Plus, whereas his blog posts do a very thorough job walking through the details of the text, I wanted to stay at a higher level of summary and focus more on my own commentary. If you want the full treatment from Rauschka, buy the book! I hope it is clear by now that I think it is well worth it.

As for where this takes me next: I definitely have topics I’m interested in exploring further now. I’ve already mentioned a few in these past five installments, and they’ll likely become future posts on this blog. I’ll certainly have more confidence when trying to understand more recent advances in LLMs; after all, the GPT-2 reference architecture in the text was chosen for its accessibility, not for being on the cutting edge! Plus, I now have practice with some techniques that I can bring to other project work, even if I’m more interested in regression or classification applications than traditional NLP ones.

I may not be an expert on LLM AI after reading this book, but… I can say that yes, I have built my own large language model from scratch.