Introducing quantized Llama models with increased speed and a reduced memory footprint
Created by @Aditya Bavadekar on Mon Oct 28 2024
# Takeaways Today, we’re releasing our first lightweight quantized Llama models that are small and performant enough to run on many popular mobile devices. At Meta, we’re uniquely positioned to provide quantized models because of access to compute resources, training data, full evaluations, and safety. As our first quantized models in this Llama category, these instruction-tuned models apply the same quality and safety requirements as the original 1B and 3B models, while achieving 2-4x speedup. We also achieve an average reduction of 56% in model size and a 41% average reduction in memory usage compared to the original BF16 format. We used two techniques for quantizing Llama 3.2 1B and 3B models: Quantization-Aware Training with LoRA adaptors, which prioritize accuracy, and SpinQuant, a state-of-the-art post-training quantization method that prioritizes portability. Inferences using both quantization techniques are supported in the Llama Stack reference implementation via PyTorch’s ExecuTorch framework. We built these quantized models in close collaboration with our industry-leading partners and are making them available on Qualcomm and MediaTek SoCs with Arm CPUs.  At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B—our smallest models yet—to address the demand for on-device and edge deployments. Since their release, we’ve seen not just how the community has adopted our lightweight models, but also how grassroots developers are quantizing them to save capacity and memory footprint, often at a tradeoff to performance and accuracy. As we’ve shared before, we want to make it easier for more developers to build with Llama, without needing significant compute resources and expertise. Today, we’re sharing quantized versions of Llama 3.2 1B and 3B models. These models offer a reduced memory footprint, faster on-device inference, accuracy, and portability—all while maintaining quality and safety for developers to deploy on resource-constrained devices. Given the limited runtime memory available on mobile devices, we prioritized short-context applications up to 8K for these new quantized models. Our results show we can achieve superior accuracy by training with quantization as opposed to post-processing. The models we are sharing today have 2-4x speedup and an average reduction of 56% in model size compared to the original format, based on testing with Android OnePlus 12 models. We also reduce memory usage by an average of 41%. Starting today, the community can deploy our quantized models onto more mobile CPUs, giving them the opportunity to build unique experiences that are fast and provide more privacy since interactions stay entirely on device.