The landscape of open-source AI just got more competitive with the introduction of Instella, AMD’s fully open 3-billion-parameter language model (LM). Built from scratch on the powerful Instinct MI300X GPUs, Instella is designed to outperform existing open models and rival leading open-weight models in both language understanding and instruction-following tasks. This release represents a big leap for accessible, high-performance AI and demonstrates AMD’s commitment to open research and hardware innovation.
Key Features and Innovations
- Fully Open-Source: Instella’s model weights, training configurations, datasets, and code are openly available, encouraging collaboration and transparency within the AI community.
- Cutting-Edge Performance: Instella consistently beats other fully open models of similar size and delivers competitive results versus top-tier open-weight models like Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B.
- Advanced Training Pipeline: Instella was trained on a massive 4.15 trillion tokens using 128 Instinct MI300X GPUs, leveraging techniques like FlashAttention-2, Torch Compile, mixed-precision training, and Fully Sharded Data Parallelism (FSDP) for optimal efficiency and scalability.
- Multi-Stage Training: The model’s capabilities were refined through a four-stage pipeline: two pre-training phases followed by supervised fine-tuning and direct preference optimization for alignment with human instructions.
Instella Model Family and Architecture
The Instella release includes several checkpoints, each representing a stage in the training process:
- Instella-3B-Stage1: First-stage pre-training on over 4 trillion tokens for foundational language skills.
- Instella-3B: Second-stage pre-training with diverse datasets to target reasoning, mathematics, and conversational tasks, totaling 4.15 trillion tokens.
- Instella-3B-SFT: Supervised fine-tuning with high-quality instruction-response pairs to enhance instruction following.
- Instella-3B-Instruct: Alignment to human preferences via Direct Preference Optimization, boosting chat and multi-turn QA performance.
Technically, Instella is a transformer-based, autoregressive text-only LM with 36 decoder layers, 32 attention heads per layer, a 4,096 token context window, and a vocabulary of about 50,000 tokens using the OLMo tokenizer.
Training Pipeline and Data Strategy
Instella’s training pipeline is built on the open-sourced OLMo codebase, extensively optimized for AMD’s hardware. The two-stage pre-training utilized diverse high-quality datasets, including DCLM-baseline, Dolma, Dolmino-Mix, SmolLM-Corpus, Deepmind Mathematics, and several conversational and code-focused datasets. The pipeline even incorporated a synthetic dataset for mathematical reasoning, generated using advanced prompting and program synthesis techniques.
To maximize efficiency, model parameters, gradients, and optimizer states were sharded within nodes and replicated across nodes, balancing memory usage and communication overhead.
Benchmark Results and Performance Highlights
- Pre-Training: Instella-3B-Stage1 and Instella-3B outperformed all fully open models across benchmarks like ARC Challenge, MMLU, and GSM8k. Instella-3B showed an average lead of over 8% compared to other open models and narrowed the gap with closed/open-weight competitors.
- Instruction Tuning: Instella-3B-Instruct led fully open models in instruction-following and QA tasks, with an average advantage of 14% in evaluated benchmarks versus the nearest competitor. It also performed on par with or better than leading open-weight models in several categories, despite being trained on fewer tokens.
Open Access and Community Collaboration
Instella’s full open-source release—including model weights, configurations, datasets, and code—underscores AMD’s drive for transparency and reproducibility. The project invites researchers, developers, and enthusiasts to experiment with, evaluate, and extend Instella, accelerating AI innovation for all.
Instella 3B: AMD’s Ambitious Leap in Open-Source Language Models