In this article, we’ll be discussing the renowned GPT-3 model proposed in the paper “Language Models are Few-Shot Learners” by OpenAI. It is the successor of GPT-2, which has a very similar architecture to that of GPT-3.
If you’re unaware of GPT-2, consider giving my article on GPT-2 a read, as most of GPT-3 is based on it and would help in understanding the model better.
Going back to GPT-2, it is essentially an autoregressive model based on the Transformer architecture (Vaswani et al.). But the novelty of GPT-2 lies in its pre-training approach.
The pre-training leverages multi-task learning at a dataset-level. It basically means that the input tells the model to perform a specific NLP task. …
Dynamic Programming is a mathematical optimization approach typically used to improvise recursive algorithms. It basically involves simplifying a large problem into smaller sub-problems. There are two properties that a problem must exhibit to be solved using dynamic programming:
We’ll be discussing ‘Planning in RL’ using dynamic programming. Planning mainly requires the complete environment’s knowledge (usually an MDP) or a model of the environment in advance. And using this knowledge, we can solve for the optimal policy.
Transformer-based language models have been leading the NLP benchmarks lately. Models like BERT, RoBERTa have been state-of-the-art for a while. However, one major drawback of these models is that they cannot “attend” to longer sequences. For example, BERT is limited to a max of 512 tokens at a time.
To overcome these long sequence issues, several approaches burgeoned. Models like Transformer-XL and Reformer propose decent ways to reduce the model parameters, and hence, the complexity. I have already covered Transformer-XL in this and the Reformer in this article, respectively. Consider giving them a read if you’re interested.
In this article, we’ll be discussing the Longformer model proposed by Allen AI in the paper, “Longformer: The Long-Document Transformer.” It is a transformer-based architecture that reformulates the self-attention computation to reduce the model complexity. …
We want our models to train real fast. We use GPUs to make operations execute faster. However, it is possible that even after speeding up the computations, the model may have inefficiencies in the pipeline itself, and thus, may train slower. In such cases, it becomes really difficult to debug the code, or as a matter of fact, even tell what is slow.
This can be addressed by using the TensorFlow Profiler. The Profiler ‘profiles’ the TensorFlow code execution. We’ll be discussing the Profiler, how to use it, best practices, and how to optimize the GPU performance in this article.
Multilingual Language Models are one of the recent milestones of NLP research and a step towards generalizing NLP algorithms. Masked Language Models (MLM) like multilingual BERT (mBERT), XLM (Cross-lingual Language Model) have achieved state of the art in these objectives.
In this article, we’ll discuss the XLM-RoBERTa (or XLM-R) model proposed in “Unsupervised Cross-lingual Representation Learning at Scale.” This paper essentially analyses how training a cross-lingual model at scale can highly boost the performance, and propose a new model that achieves state of the art in this task.
XLM-RoBERTa is trained on a multi-lingual language modeling objective using only monolingual data. This basically means samples of text streams are taken from all the languages, and the model is trained to predict masked tokens in the input. …
BERT pretraining is the pioneer of language modeling. The state of the art in NLP has been evolving ever since. However, the convention says larger models perform better. But, large models hinder scaling. It is difficult and expensive to train them. Moreover, the training speed decreases with the increasing size of the model.
In this article, we’ll be discussing the ALBERT model by Google AI proposed in the paper, “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.” This paper essentially proposes 2 techniques for parameter reduction (to overcome the above issues) with the original BERT architecture:
Recurrent Neural Networks (RNNs) have been in the sequence modeling business for a long time. But RNNs are slow; they process one token at a time. Moreover, the recurrent architecture adds a limitation of fixed-length encoding vectors for the complete sequence. To overcome these issues, architectures like CNN-LSTM, Transformer, QRNNs burgeoned.
In this article, we’ll be discussing the QRNN model proposed in the paper, “Quasi-Recurrent Neural Networks.” It is essentially an approach for adding convolution to recurrence and recurrence to convolution. You will get this as you proceed through the article.
In this article, we’ll be discussing the objective using which most of the Reinforcement Learning (RL) problems can be addressed— a Markov Decision Process (MDP) is a mathematical framework used for modeling decision-making problems where the outcomes are partly random and partly controllable. We’ll discuss MDPs in greater detail as we walk through the article.
We are essentially going to describe the RL problem in a broad sense. Moreover, we’ll try to get an intuition on this using real-life examples framed as RL tasks.
This article is inspired by David Silver’s Lecture on MDP, and the equations used in this article are referred from the same. …
Models like BERT (Devlin et. al.) or GPT (Radford et. al.) have achieved the state of the art in language understanding. However, these models are pre-trained only on one language. Recently, efforts have been made towards mitigating monolingual representations and building universal cross-lingual models that would be capable of encoding any sentence into a shared embedding space.
In this article, we will be discussing the paper, Cross-lingual Language Model Pretraining, proposed by Facebook AI. The authors propose 2 approaches for cross-lingual language modeling:
In this section, we will discuss the approaches proposed for training the XLM. …
The recent advances in NLP suggest training Language Models mainly either using a causal language modeling objective or denoising autoencoding objectives (for e.g. Masked Language Modeling objective). The framework proceeds with self-supervised pre-training of the model on one of the aforementioned objectives followed by fine-tuning the model on specific downstream objectives. Models like BERT, RoBERTa, XLNet, ALBERT, T5, etc. are trained on such objectives and have achieved the state of the art on the respective benchmarks.
In this paper we are going to discuss a rather unique approach proposed by Google AI for pre-training of language models in the paper, ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. …