Fastbook NLP Deep Dive: RNNs Q&A

Deep Learning

fastai

NLP

fastbook

Author

Harish

Published

April 26, 2024

In this post, I tried to answer all the questions in Chapter 10 of Fastbook which is NLP Deep Dive: RNNs.

1. What is “self-supervised learning”?

Training a model using labels that are embedded in the independent variable, rather than requiring external labels. For example, training a language model to predict the next word in a text.

2. What is a “language model”?

Language model is a model that has been trained to guess what the next word of a given passage is.

3. Why is a language model considered self-supervised?

The language model is considered self-supervised because there are no labels provided during training. The model learns to predict the next word by reading lots of texts.

4. What are self-supervised models usually used for?

Problems where labeled data is not adequate
Language models
Pre-training models for transfer learning

5. Why do we fine-tune language models?

Language models might be trained on a corpus that is different than the task at hand. Fine-tuning them helps the model to be good at task specific corpus.

6. What are the three steps to create a state-of-the-art text classifier?

Train a language model on a big corpus of text like wikipedia or use a pre-trained language model.
Fine-tuning the language model on text classification dataset.
Fine-tune the classifier model given the language model as the encoder.

7. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?

We can use the unlabelled movie reviews to fine-tune the language model so that it understands the language of movie reviews. This requires no labelling. Now we can use this fine-tuned language model (that knows how to predict next word in a movie review!) as a base for our text classifier.

8. What are the three steps to prepare your data for a language model?

Tokenization: Text to tokens
Numericalization: tokens to integers
Batches: Stream of documents to a batch of fixed-size input and outputs (tokens offset by one token). Taken care by LMDataLoader.

9. What is “tokenization”? Why do we need it?

Tokenization is the process of converting raw text to list of tokens (words, characters, or substrings, depending on the granularity of the model). It enables us to represent each token numerically which the models can understand (compared to text).

10. Name three different approaches to tokenization.

Word-based: Split a sentence on spaces (or language-specific rules that define what a word is.) Eg: don’t -> do, n’t
Subword-based: Split words into smaller parts, based on the most commonly occuring substrings.
Character-based: Split a sentence into its individual characters.

11. What is `xxbos`?

xxbos is a special token added by fastai tokenizer that indicates beginning of the stream (text). With this, the model will be able to learn it needs to forget what we said previously and focus on upcoming words.

12. List four rules that fastai applies to text during tokenization.

replace_wrep: Replaces any word repeated three times or more with a special token for word repetion (xxwrep), the number of times it’s repeated, then the word.
rm_useless_spaces: Removes all repetitions of the space character.
replace_maj: Lowercases a capitalized word and adds a special token for capitalized (xxmaj) in front of it.
lowercase: Lowercases all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)

from fastai.text.core import lowercase
from fastai.text.core import replace_wrep
from fastai.text.core import rm_useless_spaces
from fastai.text.core import replace_maj

lowercase('My name is Harish.')

'xxbos my name is harish.'

replace_maj("My name is Harish.")

'xxmaj my name is xxmaj harish.'

rm_useless_spaces("My    name  is      Harish.")

'My name is Harish.'

replace_wrep("My name is harish harish harish")

'My name is xxwrep 3 harish '

13. Why are repeated characters replaced with a token showing the number of repetitions and the character that’s repeated?

In this way, the model’s embedding matrix can encode information about general concepts such as repeated punctuation (or any character) rather than requiring a seperate token for every number of repetitions.

14. What is “numericalization”?

Numericalization is the process of mapping tokens to integers.

15. Why might there be words that are replaced with the “unknown word” token?

We’ll be having lot of rare words in the corpus for which there won’t be enough data to train representations for the same. These words can be replaced with xxunk token. This is useful to avoid having an overly large embedding matrix with lot of rare words, since that can slow down training and use up too much memory.

16. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book’s website.)

\(Batch\ Size = 64\). The dataset is split into \(64\) stream of texts.
\(Sequence\ Length = 64\). Each row (\(i\)) in an individual batch has \(64\) tokens from the \(i^{th}\) stream of text.
\(1^{st}\) row of \(1^{st}\) batch contains 64 tokens from \(1^{st}\) stream of text starting from \(0^{th}\) token.
\(2^{nd}\) row of \(1^{st}\) batch contains 64 tokens from \(2^{nd}\) stream of text starting from \(0^{th}\) token.
\(1^{st}\) row of \(2^{nd}\) batch contains 64 tokens from \(1^{st}\) stream of text starting from \(65^{th}\) token.

17. Why do we need padding for text classification? Why don’t we need it for language modeling?

PyTorch DataLoaders need to collate all the items into a single tensor, and a single tensor has a fixed shape. We can’t do cropping like we do for Images to bring all inputs to a fixed size. So we do padding. We use a special padding token that will be ignored by our model.

How?

We won’t pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size. Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend to be of similar lengths.

Why don’t we need padding for language modelling?

In language modelling, the input to the model is a big corpus of text concatenated to a single stream. We don’t have explicit notion of first sample, second sample where each sample is of different size. Every row in an individual batch is a part (fixed sequence length in order) of the big corpus.

18. What does an embedding matrix for NLP contain? What is its shape?

Embedding matrix contain the vector representation of vocab in the corpus. It encodes token to a vector representation. It’s a matrix with |vocab| number of rows and x columns. Here x is 400 for the model used in the chapter. It can vary depending on how the embedding layers are defined.

19. What is “perplexity”?

Perplexity is exponential of the loss function used in language model. torch.exp(cross_entropy)

In general, perplexity is a measurement of how well a probability model predicts a sample. It quantifies how uncertain the model is in predicting the next word in a sequence. Lower perplexity values indicate better performance, as the model is less “perplexed” (puzzled) by the data.

20. Why do we have to pass the vocabulary of the language model to the classifier data block?

The reason that we pass the vocab of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won’t make any sense to this model, and the fine-tuning step won’t be of any use.

dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

21. What is “gradual unfreezing”?

It’s a technique in fine-tuning where we unfreeze few layers at a time until the whole model is unfroze. In NLP classifiers, it makes a real difference and achieved using learn.freeze_to method.

learn.fit_one_cycle(1, 2e-2) # by default only last layer unfroze for pretrained models
learn.freeze_to(-2) # unfreeze last two param groups
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.freeze_to(-3) # unfreeze last three param groups
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn.unfreeze() # unfreeze all layers
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

Here discriminative learning rates are used.

22. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?

Classification algorithms can automatically recognise autogenerated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms.