Jun 7, 2024
3 stories
3 saves
Enables context extension for large language models, achieving significant computation savings through sparse local attention and parameter-efficient fine-tuning.
Allows efficient training of large models on limited GPU memory, through innovations like 4-bit NormalFloat (NF4), double quantization and paged optimisers.
Introduces trainable rank decomposition matrices into each layer of a pre-trained Transformer model, significantly reducing the number of trainable parameters for downstream tasks.