This put up relies on a technical report written by Kazuki Fujii, who led the Llama 3.3 Swallow mannequin improvement.
The Institute of Science Tokyo has efficiently educated Llama 3.3 Swallow, a 70-billion-parameter giant language mannequin (LLM) with enhanced Japanese capabilities, utilizing Amazon SageMaker HyperPod. The mannequin demonstrates superior efficiency in Japanese language duties, outperforming GPT-4o-mini and different main fashions. This technical report particulars the coaching infrastructure, optimizations, and finest practices developed throughout the undertaking.
This put up is organized as follows:
- Overview of Llama 3.3 Swallow
- Structure for Llama 3.3 Swallow coaching
- Software program stack and optimizations employed in Llama 3.3 Swallow coaching
- Experiment administration
We talk about matters related to machine studying (ML) researchers and engineers with expertise in distributed LLM coaching and familiarity with cloud infrastructure and AWS providers. We welcome readers who perceive mannequin parallelism and optimization strategies, particularly these curious about steady pre-training and supervised fine-tuning approaches.
Overview of the Llama 3.3 Swallow
Llama 3.3 Swallow is a 70-billion-parameter LLM that builds upon Meta’s Llama 3.3 structure with specialised enhancements for Japanese language processing. The mannequin was developed via a collaboration between the Okazaki Laboratory and Yokota Laboratory on the Faculty of Computing, Institute of Science Tokyo, and the Nationwide Institute of Superior Industrial Science and Know-how (AIST).
The mannequin is accessible in two variants on Hugging Face:
Each variants are accessible via the tokyotech-llm group on Hugging Face, offering researchers and builders with versatile choices for various software wants.
Coaching methodology
The bottom mannequin was developed via continuous pre-training from Meta Llama 3.3 70B Instruct, sustaining the unique vocabulary with out enlargement. The coaching information primarily consisted of the Swallow Corpus Model 2, a rigorously curated Japanese internet corpus derived from Widespread Crawl. To safe high-quality coaching information, the group employed the Swallow Training Classifier to extract educationally worthwhile content material from the corpus. The next desk summarizes the coaching information used for the bottom mannequin coaching with roughly 314 billion tokens. For compute, the group used 32 ml.p5.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) cases (H100, 80 GB, 256 GPUs) for continuous pre-training with 16 days and 6 hours.
For the instruction-tuned variant, the group targeted solely on Japanese dialogue and code era duties. This model was created via supervised fine-tuning of the bottom mannequin, utilizing the identical Japanese dialogue information that proved profitable within the earlier Llama 3.1 Swallow v0.3 launch. Notably, the group made a deliberate option to exclude English dialogue information from the fine-tuning course of to take care of deal with Japanese language capabilities. The next desk summarizes the instruction-tuning information used for the instruction-tuned mannequin.
Efficiency and benchmarks
The bottom mannequin has demonstrated outstanding efficiency in Japanese language duties, persistently outperforming a number of industry-leading fashions. In complete evaluations, it has proven superior capabilities in comparison with OpenAI’s GPT-4o (gpt-4o-2024-08-06), GPT-4o-mini (gpt-4o-mini-2024-07-18), GPT-3.5 (gpt-3.5-turbo-0125), and Qwen2.5-72B. These benchmarks replicate the mannequin’s enhanced capability to know and generate Japanese textual content. The next graph illustrates the bottom mannequin efficiency comparability throughout these completely different benchmarks (authentic picture).
The instruction-tuned mannequin has proven notably sturdy efficiency on the Japanese MT-Bench, as evaluated by GPT-4o-2024-08-06, demonstrating its effectiveness in sensible functions. The next graph presents the efficiency metrics (authentic picture).
Licensing and utilization
The mannequin weights are publicly accessible on Hugging Face and can be utilized for each analysis and business functions. Customers should adjust to each the Meta Llama 3.3 license and the Gemma Phrases of Use. This open availability goals to foster innovation and development in Japanese language AI functions whereas imposing accountable utilization via applicable licensing necessities.
Coaching infrastructure structure
The coaching infrastructure for Llama 3.3 Swallow was constructed on SageMaker HyperPod, with a deal with excessive efficiency, scalability, and observability. The structure combines compute, community, storage, and monitoring elements to allow environment friendly large-scale mannequin coaching. The bottom infrastructure stack is accessible as an AWS CloudFormation template for seamless deployment and replication. This template provisions a complete basis by making a devoted digital non-public cloud (VPC). The networking layer is complemented by a high-performance Amazon FSx for Lustre file system, alongside an Amazon Easy Storage Service (Amazon S3) bucket configured to retailer lifecycle scripts, that are used to configure the SageMaker HyperPod cluster.
Earlier than deploying this infrastructure, it’s important to verify the AWS account has the suitable service quotas. The deployment of SageMaker HyperPod requires particular quota values that usually exceed default limits. You need to examine your present quota towards the necessities detailed in SageMaker HyperPod quotas and submit a quota enhance request as wanted.
The next diagram illustrates the high-level structure of the coaching infrastructure.
Compute and community configuration
The compute infrastructure relies on SageMaker HyperPod utilizing a cluster of 32 EC2 P5 cases, every outfitted with 8 NVIDIA H100 GPUs. The deployment makes use of a single backbone configuration to supply minimal latency between cases. All communication between GPUs is dealt with via NCCL over an Elastic Cloth Adapter (EFA), offering high-throughput, low-latency networking important for distributed coaching. The SageMaker HyperPod Slurm configuration manages the deployment and orchestration of those sources successfully.
Storage structure
The undertaking implements a hierarchical storage method that balances efficiency and cost-effectiveness. On the basis is Amazon S3, offering long-term storage for coaching information and checkpoints. To stop storage bottlenecks throughout coaching, the group deployed FSx for Lustre as a high-performance parallel file system. This configuration permits environment friendly information entry patterns throughout all coaching nodes, essential for dealing with the large datasets required for the 70-billion-parameter mannequin.
The next diagram illustrates the storage hierarchy implementation.
The mixing between Amazon S3 and FSx for Lustre is managed via a information repository affiliation, configured utilizing the next AWS Command Line Interface (AWS CLI) command:
Observability stack
The monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana to supply complete observability. The group built-in DCGM Exporter for GPU metrics and EFA Exporter for community metrics, enabling real-time monitoring of system well being and efficiency. This setup permits for steady monitoring of GPU well being, community efficiency, and coaching progress, with automated alerting for any anomalies via Grafana Dashboards. The next screenshot reveals an instance of a GPU well being dashboard.
Software program stack and coaching optimizations
The coaching setting is constructed on SageMaker HyperPod DLAMI, which gives a preconfigured Ubuntu base Amazon Machine Picture (AMI) with important elements for distributed coaching. The software program stack contains CUDA drivers and libraries (corresponding to cuDNN and cuBLAS), NCCL for multi-GPU communication, and AWS-OFI-NCCL for EFA help. On high of this basis, the group deployed Megatron-LM as the first framework for mannequin coaching. The next diagram illustrates the software program stack structure.
Distributed coaching implementation
The coaching implementation makes use of Megatron-LM’s superior options for scaling LLM coaching. The framework gives subtle mannequin parallelism capabilities, together with each tensor and pipeline parallelism, together with environment friendly information parallelism that helps communication overlap. These options are important for managing the computational calls for of coaching a 70-billion-parameter mannequin.
Superior parallelism and communication
The group used a complete 4D parallelism technique of Megatron-LM that maximizes GPU utilization via cautious optimization of communication patterns throughout a number of dimensions: information, tensor, and pipeline, and sequence parallelism. Information parallelism splits the coaching batch throughout GPUs, tensor parallelism divides particular person mannequin layers, pipeline parallelism splits the mannequin into phases throughout GPUs, and sequence parallelism partitions the sequence size dimension—collectively enabling environment friendly coaching of huge fashions.
The implementation overlaps communication throughout information parallelism, tensor parallelism, and pipeline parallelism domains, considerably decreasing blocking time throughout computation. This optimized configuration permits environment friendly scaling throughout the complete cluster of GPUs whereas sustaining persistently excessive utilization charges. The next diagram illustrates this communication and computation overlap in distributed coaching (authentic picture).
Megatron-LM permits fine-grained communication overlapping via a number of configuration flags: --overlap-grad-reduce
and --overlap-param-gather for data-parallel operations
, --tp-comm-overlap
for tensor parallel operations, and built-in pipeline-parallel communication overlap (enabled by default). These optimizations work collectively to enhance coaching scalability.
Checkpointing technique
The coaching infrastructure implements an optimized checkpointing technique utilizing Distributed Checkpoint (DCP) and asynchronous I/O operations. DCP parallelizes checkpoint operations throughout all accessible GPUs, moderately than being constrained by tensor and pipeline parallel dimensions as in conventional Megatron-LM implementations. This parallelization, mixed with asynchronous I/O, permits the system to:
- Save checkpoints as much as 10 occasions sooner in comparison with synchronous approaches
- Decrease coaching interruption by offloading I/O operations
- Scale checkpoint efficiency with the whole variety of GPUs
- Keep consistency via coordinated distributed saves
The checkpointing system robotically saves mannequin states to the FSx Lustre file system at configurable intervals, with metadata tracked in Amazon S3. For redundancy, checkpoints are asynchronously replicated to Amazon S3 storage.
For implementation particulars on asynchronous DCP, see Asynchronous Saving with Distributed Checkpoint (DCP).
Experiment administration
In November 2024, the group launched a scientific method to useful resource optimization via the event of a complicated reminiscence prediction instrument. This instrument precisely predicts per-GPU reminiscence utilization throughout coaching and semi-automatically determines optimum coaching settings by analyzing all doable 4D parallelism configurations. Primarily based on confirmed algorithmic analysis, this instrument has change into instrumental in maximizing useful resource utilization throughout the coaching infrastructure. The group plans to open supply this instrument with complete documentation to learn the broader AI analysis neighborhood.
The next screenshot reveals an instance of the reminiscence consumption prediction instrument interface (authentic picture).
Coaching pipeline administration
The success of the coaching course of closely relied on sustaining high-quality information pipelines. The group applied rigorous information curation processes and strong cleansing pipelines, sustaining a cautious stability in dataset composition throughout completely different languages and domains.For experiment planning, model management was vital. The group first fastened the variations of pre-training libraries and instruction tuning libraries for use within the subsequent experiment cycle. For libraries with out formal model releases, the group managed variations utilizing Git branches or tags to supply reproducibility. After the variations have been locked, the group carried out short-duration coaching runs to:
- Measure throughput with completely different numbers of GPU nodes
- Seek for optimum configurations amongst distributed coaching settings recognized by the reminiscence prediction library
- Set up correct coaching time estimates for scheduling
The next screenshot reveals an instance experiment schedule exhibiting GPU node allocation, anticipated coaching period, and key milestones throughout completely different coaching phases (authentic picture).
To optimize storage efficiency earlier than starting experiments, coaching information was preloaded from Amazon S3 to the FSx for Lustre file system to forestall I/O bottlenecks throughout coaching. This preloading course of used parallel transfers to maximise throughput:
Monitoring and efficiency administration
The group applied a complete monitoring system targeted on real-time efficiency monitoring and proactive challenge detection. By integrating with Weights & Biases, the system repeatedly screens coaching progress and delivers automated notifications for key occasions corresponding to job completion or failure and efficiency anomalies. Weights & Biases gives a set of instruments that allow custom-made alerting via Slack channels. The next screenshot reveals an instance of a coaching monitoring dashboard in Slack (authentic picture).
The monitoring infrastructure excels at figuring out each job failures and efficiency bottlenecks like stragglers. The next determine presents an instance of straggler detection exhibiting coaching throughput degradation.
Conclusion
The profitable coaching of Llama 3.3 Swallow represents a major milestone within the improvement of LLMs utilizing cloud infrastructure. Via this undertaking, the group has demonstrated the effectiveness of mixing superior distributed coaching strategies with rigorously orchestrated cloud sources. The implementation of environment friendly 4D parallelism and asynchronous checkpointing has established new benchmarks for coaching effectivity, and the great monitoring and optimization instruments have supplied constant efficiency all through the coaching course of.
The undertaking’s success is constructed on a number of foundational components: a scientific method to useful resource planning and optimization, strong information pipeline administration, and a complete monitoring and alerting system. The environment friendly storage hierarchy implementation has confirmed notably essential in managing the large datasets required for coaching a 70-billion-parameter mannequin.Trying forward, the undertaking opens a number of promising instructions for future improvement. The group plans to open supply the reminiscence prediction instruments, so different researchers can profit from the optimizations developed throughout this undertaking. Additional enhancements to the coaching pipelines are beneath improvement, together with continued enhancement of Japanese language capabilities. The undertaking’s success additionally paves the best way for expanded mannequin functions throughout numerous domains.
Assets and references
This part gives key sources and references for understanding and replicating the work described on this paper. The sources are organized into documentation for the infrastructure and instruments used, in addition to model-specific sources for accessing and dealing with Llama 3.3 Swallow.
Documentation
The next sources present detailed details about the applied sciences and frameworks used on this undertaking:
Mannequin sources
For extra details about Llama 3.3 Swallow and entry to the mannequin, check with the next sources:
In regards to the Authors
Kazuki Fujii graduated with a bachelor’s diploma in Pc Science from Tokyo Institute of Know-how in 2024 and is at the moment a grasp’s pupil there (2024–2026). Kazuki is liable for the pre-training and fine-tuning of the Swallow mannequin sequence, a state-of-the-art multilingual LLM specializing in Japanese and English as of December 2023. Kazuki focuses on distributed coaching and constructing scalable coaching methods to reinforce the mannequin’s efficiency and infrastructure effectivity.
Daisuke Miyamato is a Senior Specialist Options Architect for HPC at Amazon Net Companies. He’s primarily supporting HPC prospects in drug discovery, numerical climate prediction, digital design automation, and ML coaching.
Kei Sasaki is a Senior Options Architect on the Japan Public Sector group at Amazon Net Companies, the place he helps Japanese universities and analysis establishments navigate their cloud migration journeys. With a background as a methods engineer specializing in high-performance computing, Kei helps these tutorial establishments of their giant language mannequin improvement initiatives and superior computing tasks.
Keita Watanabe is a Senior GenAI World Huge Specialist Options Architect at Amazon Net Companies, the place he helps develop machine studying options utilizing OSS tasks corresponding to Slurm and Kubernetes. His background is in machine studying analysis and improvement. Previous to becoming a member of AWS, Keita labored within the ecommerce {industry} as a analysis scientist creating picture retrieval methods for product search. Keita holds a PhD in Science from the College of Tokyo.