In the long (context) run

It’s not the quadratic attention; it’s the lack of long pre-training data

One intriguing trend in the field of Large Language Models (LLMs) is the growing context length — the numbers of tokens we can feed to the Transformer before it predicts the next tokens. Especially in the past year we’ve seen a remarkable push towards long-context LLMs, see the Figure below.