Top Language Models Released In 2021


  • The most recent advances in language modelling are described in research papers.

Is Machine Learning Currently Hyped?

Language models are important while developing natural language processing (NLP) applications. However, developing complicated NLP language models from scratch is a time-consuming process. 

The top language models for the year 2021 are listed below.


AfriBERTa is a multilingual language model pre-trained on data from 11 African languages totalling less than 1 GB. The researchers demonstrate that this model is competitive with pre-trained models on larger datasets and even outperforms them in certain languages. Additionally, the exhaustive trials highlight critical considerations when pretraining multilingual language models on limited datasets. More importantly, the researchers discuss the practical advantages of employing viable language models on short datasets. Finally, the researchers make available the source code, pretrained models, and dataset to spur additional research on multilingual language models for low-resource languages.

Data & Analytics Conclave. Free Recordings>>

For additional details, refer to the article here.


ByT5, a token-free form of multilingual T5, streamlines the natural language processing (NLP) pipeline by obviating the need for vocabulary generation, text preprocessing, and tokenisation. As a result, ByT5 is competitive with parameter-matched mT5 models that use the SentencePiece vocabulary for downstream task quality.

ByT5 outperforms mT5 in four distinct scenarios: 

(1) for model sizes less than 1 billion parameters, 

(2) for productive tasks, 

(3) for multilingual tasks using in-language labels, and 

(4) for tasks including various sources of noise.

For additional details, refer to the article here.


The researchers demonstrated how to pre-train an OD model for VL problems using a novel formula. The new model is larger and more optimised for visual learning tasks.

The researchers validate the new model using a large-scale empirical investigation. The results demonstrate that the new OD model can significantly improve the SoTA outcomes. Furthermore, the study demonstrates that the improvement is primarily due to the design choices.

For additional details, refer to the article here.


Google AI released their new NLP model, known as Fine-tuned LAnguage Net (FLAN), which examines a simple technique called instruction fine-tuning, or instruction tuning for short.


Creating A Paragraph Auto Generator Using GPT2 Transformers

FLAN is fine-tuned on a huge collection of various instructions that employ a basic and intuitive explanation of the task. Creating a collection of instructions from start to fine-tune the model would take a large number of resources. So rather, it uses templates to convert current datasets to an educational format.

For additional details, refer to the article here.


LEXFIT is a process for fine-tuning lexical representations such as BERT into effective decontextualised word encoders via dual-encoder architectures. The trials proved that LEXFIT might supplement the linguistic knowledge currently stored in pretrained LMs with (even small amounts of) external lexical knowledge via further affordable LEXFITing. Furthermore, the researchers successfully deployed LEXFIT to languages that lacked external lexical knowledge curated by humans. In controlled assessments, the LEXFIT word embeddings (WEs) outperform “conventional” static WEs (e.g., fastText) across a spectrum of lexical tasks in a variety of languages, directly calling into question the practical utility of standard WE models in modern NLP.

For additional details, refer to the article here.


A Baidu research team published a report on the 3.0 edition of Enhanced Language RepresentatioN with Informative Entities (ERNIE), a deep-learning model for NLP. The model comprises 10B parameters and outperformed the human baseline score on the SuperGLUE benchmark.

Unlike most other deep-learning NLP models, trained exclusively on unstructured text, ERNIE’s training data also incorporates structured knowledge graph data. In addition, the model comprises a Transformer-XL “backbone” for encoding the input to a latent representation and two distinct decoder networks. As a result, along with establishing a new top score on SuperGLUE, ERNIE established new state-of-the-art scores on 54 Chinese-language NLP tasks.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s