Interaktivní osnova
[Michal Štefánik] Attention sparsification: Look into the future and the past (behind the context window) 8. 10. 2020
Abstract
You might have heard about how cool Transformers are, and so it's easy to forget their notorious flaws. One of the main ones is the quadratic size with respect to "attended" text window size. This disallows them to efficiently understand the more complex task, such as document classification, or document question answering.
In recent months, there's an uprising of research addressing the inefficient nature of attention mechanism, by some quite interesting tricks, of selecting the relevant subsections, propagating information through globally-visible words, or random traversal of the text in a graph-like manner, with a theory grounded in random graphs.
The results of those improvements already seem to help, but there's still a long way to go...
We'll take a look at the methods, from Sparse Attention, Longformer, to Big Bird. Feel free to take a look at the literature in advance, to inhale even more from the discussion.
Presentation slides in Google Docs (including animations)
Readings and Literature
Based on your interest, pick one piece to read in advance (or more, if you like, ofc):
If you are not familiar with famous Transformers architecture, see the visuals and possibly the original paper:
- Attention Is All You Need: Illustrated blog: http://jalammar.github.io/illustrated-transformer/
- Attention Is All You Need: original paper: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Here are (pretty much all) the papers concerning Attention sparsification:
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context: https://arxiv.org/pdf/1901.02860.pdf
- Generating Long Sequences with Sparse Transformers: https://arxiv.org/pdf/1904.10509.pdf
- Longformer: The Long-Document Transformer: https://arxiv.org/pdf/2004.05150.pdf
- Big Bird: Transformers for Longer Sequences: https://arxiv.org/pdf/2007.14062v1.pdf
3. Longformer provides a nice overview of the sparsification approaches.
2. Sparse Transformers shows how Attention can be used instead of convolution on Image processing tasks, using with 2D Attention.
4. Big Bird combines all previously introduces approaches. It shows experiments with applying Attention on Genomics data.