Latte: Latent attention for linear time Transformers
Transformers are one of the most popular architectures used in both sequence modeling and computer vision. At the center of Transformers is the attention mechanism, which compares each element of a sequence with every other element. This pairwise similarity score is used to decide how much the other tokens contribute to the new representation of one element. While the approach gives state-of-the-art results, it comes at the cost of quadratic time complexity. Additionally, for language generation, the next token prediction is linear in the prompt length, compared to the constant time complexity of approaches like Structured State Models (SSMs).