BERT: Sentence Classification on CoLA Dataset

GitHub: https://github.com/rukshar69/Transformers/blob/master/SentenceClassificationTransformersCoLA.ipynb

Here, we use BERT((Bidirectional Encoder Representations from Transformers)) with the huggingface PyTorch library to efficiently fine-tune a model to acquire approximately state of the art performance in a single-sentence classification task on CoLA(The Corpus of Linguistic Acceptability) dataset. In other words, we apply transfer learning in NLP to make high performance models with greater ease.

Before training on the CoLA dataset, we load a pre-trained BERT model for the following reasons:

  1. Faster Model Creation:
    The pre-trained BERT model weights encode a lot of information about language. So, the time required to train our model will require less time. We also need fewer epochs during training and validation.
  2. Less Data:
    Since we’re using pre-trained weights, we need a smaller dataset than a model built from scratch. NLP models built from scratch may require very large dataset to gain reasonable accuracy.
  3. Better Results:
    Building a model on pre-trained weights achieves state of the art results with less adjustments and with greater ease.

The CoLA dataset is a dataset of sentences which are labeled as either grammatically correct(1) or incorrect(0). There are the `tokenized` and `raw` versions of the data.  We don’t use the tokenized version.  Instead, we use the raw data and apply the pretrained BERT tokenizer because :

  1.  the model has a specific, fixed vocabulary 

2. the BERT tokenizer has a particular way of handling out-of-vocabulary words.

There are approximately 8.5k training samples of which 10 random examples are shown above. The two features/atrributes we use for the training are `sentence` and `label` (0=correct grammar, 1=incorrect grammar).

For training, the data is needed to be split into tokens, and the generated tokens are mapped to their index in the tokenizer vocabulary. We use a pre-trained BERT tokenizer for this. We apply the tokenizer on one sentence from the training data:

For tokenizing our data:

  1. Add special tokens to the start and end of each sentence.
  2. Pad or truncate all sentences to a single constant length( 512 tokens).
  3. Differentiate between real and padding tokens by creating an “attention mask”(an array of 1s and 0s indicating which tokens are padding and which aren’t ).

So, we determine a maximum sentence length for padding / truncating our sentences, which for our training dataset is 47. But we set max_len attribute of the encoder/tokenizer to 64 to handle longer test sentences.

A histogram showing the number of tokens generated from training sentences. After tokenization, we split the data into training and validation sets where the training set constitutes 90% of the data and the rest is reserved for validation. Here, an iterator will help save memory during training for, iterators don’t load the entire data into memory.

For training we use BertForSequenceClassification, a normal BERT model with a single linear layer on top to use as a sentence classifier. Feeding training samples, the pretrained BERT model and the additional untrained classification layer is trained using the CoLA dataset.  Also since we’re using a pretrained model, we need not use too many epochs for training. We confine our training to 4 epochs. We use AdamW optimizer from the Transformers library with a learning rate of 2e-5. The learning rate is also updated per epoch. Summary of the training process:

Validation Loss is more precise than accuracy, accuracy doesn’t consider the exact output value, but whether it crosses a certain threshold. Validation loss catches correct predictions with less confidence. The accuracy can’t give us such information.

The validation loss goes down with epoch like training loss.

We predict the output on the test dataset using Matthew’s correlation coefficient which is used for evaluating performance on CoLA. Here, +1 is the best score, and -1 is the worst score. We use MCC because the classes are imbalanced. In our test set of 516 samples, 354 of them are positive (68.60%).

We take a look at the score of each batch to understand the variability in the metric between batches. The final score will be based on the entire test set which turns out to be 0.566.

In a short time, without much tuning (adjusting the learning rate, epochs, batch size, ADAM properties, etc.) we are able to get a good score.

Leave a comment

Design a site like this with WordPress.com
Get started