A few years ago, the concept of transfer learning was alien to natural language models. It was mainly confined to only computer vision models where it was quite prevalent. NLP models were trained for one particular task and were evaluated for the same. There was no scope for reusing the trained model for any other task.
But over the years, big tech has extended their research in this area and made transfer learning possible for NLP models. It happened with the rise of Transformer architecture which paved the way for large language models (LLMs).
With the advent of LLMs, it became possible to use one single model for multiple tasks. LLMs are first trained on a large corpus of data (pre-training) to understand it. The common task in this case is to predict the next word in a sentence. Once the model is good at this task, it is believed that it has understood training data well. Then, this pre-trained model can be used for many other NLP tasks (commonly referred to as downstream tasks) like translation, sentiment analysis, NER, text classification, text generation, etc. The model prepared after the training stage is fine-tuned and can perform all of these tasks. LLMs made transfer learning possible for NLP tasks.
Now, with transfer learning in NLP, it is possible to use the same Language Model for multiple tasks. Hence, the old way of evaluating model performance on a single task is insufficient. A different way was needed to evaluate such LLMs on all tasks to find out whether the model understood natural language better or not.
This need gave rise to the General Language Understanding Evaluation (GLUE) benchmark. The team at NYU, Univ of Washington, and other tech companies put together nine datasets (for nine NLU tasks) for evaluating Natural Language Understanding systems (which we commonly refer to as LLMs sometimes). The datasets are selected by expert researchers in the field. The website is also created for GLUE which maintains datasets and a leaderboard of the performance of various models submitted for evaluation on these nine tasks. The website also provides a starter code for submitting model performance on various tasks. The average score of model on all nine tasks is used to rank it on the leaderboard. The nine datasets evaluate models from different perspectives and help us understand how good model understands natural language.
Now, let me introduce you to nine tasks for which datasets are provided.
Single Sentence Tasks
Similarity and Paraphrase Tasks
Inference Tasks
The score is calculated on these 9 tasks and then an average score is calculated to rank the model on the leaderboard.
Apart from models, there is also human evaluation ranked on the leader board where the performance of human is evaluated on these tasks. Currently, many models have passed
To test your model with various tasks from GLUE, you need to modify the model’s input and output so that they can handle each task. For tasks requiring double sentences as input need model modification for handling that case. The output layer too needs to be added for each task.
That was a small introduction to the GLUE benchmark. Feel free to explore other blogs on our website.
If you want to