May 3, 2024
2 stories
3 saves
Based on Griffin, uses a combination of linear recurrences and local attention instead of global attention to model long sequences efficiently.
Introduces Real Gated Linear Recurrent Unit Layer that forms the core of the new recurrent block, replacing Multi-Query Attention for better efficiency and scalability