NPU Training¶
This document describes how to perform LLaMA-Factory model training on Huawei Ascend NPU.
Supported Devices¶
LLaMA-Factory currently supports the following Ascend NPU devices:
Atlas A2 Training Series
Atlas A3 Training Series
Supported Features¶
Feature |
Support Status |
|
|---|---|---|
Training Paradigm |
PT |
Supported |
SFT |
Supported |
|
RM |
Supported |
|
DPO |
Supported |
|
Parameter Paradigm |
Full |
Supported |
Freeze |
Supported |
|
LoRA |
Supported |
|
Model Merging |
LoRA Weight Merging |
Supported |
Distributed |
DDP |
Supported |
FSDP |
Supported |
|
FSDP2 |
Supported |
|
DeepSpeed |
Supported |
|
Acceleration |
Fused Operators |
Currently supports NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, NpuFusedMoE |
Note
Most NPU usage is consistent with GPU usage. For general installation steps, refer to NPU Installation and Configuration; for general distributed training configurations (such as FSDP, FSDP2, DeepSpeed), refer to Distributed Training.
Quick Start¶
To get started quickly, it is recommended to use the Docker image provided by LLaMA-Factory.
Start the container (modify the
devicemapping as needed):docker run -itd \ --net=host \ --device=/dev/davinci0 \ --device=/dev/davinci1 \ --device=/dev/davinci_manager \ --device=/dev/devmm_svm \ --device=/dev/hisi_hdc \ --shm-size=1200g \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ --name llama_factory_npu \ hiyouga/llamafactory:latest-npu-a2 \ /bin/bash
Configure environment variables:
After entering the container, be sure to load the Ascend environment configuration first; otherwise, NPU devices will not be recognized.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
Start training:
llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml
Common Tuning Tips¶
If you want a simple trade-off between memory usage and training throughput, start by focusing on the following parameters:
per_device_train_batch_size: per-device batch size. Increasing it usually improves throughput, but directly increases memory usage. If OOM occurs, reduce it first.gradient_accumulation_steps: number of gradient accumulation steps. After reducingper_device_train_batch_size, you can increase this parameter to keep the original effective batch size as much as possible. However, the larger the accumulation steps, the slower each parameter update becomes.cutoff_len: sample truncation length. This is usually one of the most important parameters affecting memory usage, especially for long-context training. If the current task does not depend on very long inputs, reduce it first.gradient_checkpointing: gradient checkpointing. Enabling it usually significantly reduces memory usage, but introduces some speed loss. It is suitable for memory-constrained scenarios.
The following example is a more conservative starting point that prioritizes getting the run working while using less memory:
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
cutoff_len: 4096
gradient_checkpointing: true
Distributed Training¶
NPU distributed training configuration is generally consistent with the Distributed Training document. This section introduces NPU-specific configurations, including device selection and multi-node communication settings.
Key Environment Variables¶
Before starting training, pay attention to the following environment variable settings:
ASCEND_RT_VISIBLE_DEVICES (required for single-node and multi-node)
Used to specify NPU devices participating in training.
Default behavior: If this variable is not set, the program will attempt to use all NPU devices on the current node.
Specify devices: If you need to limit training to specific NPU cards (e.g., only card 0 and card 1), you must explicitly set this variable:
export ASCEND_RT_VISIBLE_DEVICES=0,1
HCCL_SOCKET_IFNAME (required for multi-node training only)
Specifies the network interface name used for HCCL collective communication.
How to obtain: Run
ifconfigin the terminal to view the network interface list, and choose the interface used for communication (e.g.,eth0,enp1s0).Example setting:
export HCCL_SOCKET_IFNAME=eth0
Single-Node Training¶
Single-node training (single card or multiple cards) follows the standard procedure.
Single-node multi-card example:
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml
Multi-node Training¶
In NPU environments, it is recommended to use accelerate launch with FSDP 1/2 for multi-node training; this approach provides better communication and computation efficiency on NPU.
Note
For other launch methods (such as torchrun/deepspeed) and more detailed configurations, refer to the Distributed Training document.
1. Prepare Accelerate configuration file
Create or modify examples/accelerate/fsdp_config.yaml; key parameters are as follows (please modify according to the actual number of nodes and IP):
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
main_process_ip: 192.168.0.1
main_process_port: 29500
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false
Note
Explanation of key multi-node parameters:
num_machines: Total number of nodes
num_processes: Total number of processes (total number of cards) = num_machines * number of cards per machine
main_process_ip: IP address of the main node (must be consistent across all nodes)
main_process_port: Port of the main node (must be consistent across all nodes)
machine_rank: Current node index (main node is 0, and the subsequent nodes increase in order)
2. Launch training
Execute the same launch command on all nodes (ensure machine_rank is correctly configured in the YAML):
export HCCL_SOCKET_IFNAME=eth0
accelerate launch --config_file examples/accelerate/fsdp_config_multiple_nodes.yaml \
src/train.py examples/train_lora/qwen3_lora_sft.yaml
Training Modes¶
The following are reference launch commands for common training scenarios; adjust the configuration files according to your needs.
Pretraining (PT)¶
llamafactory-cli train examples/train_lora/qwen3_lora_pretrain.yaml
Supervised Fine-tuning (SFT)¶
llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml
Reward Model (RM)¶
llamafactory-cli train examples/train_lora/qwen3_lora_reward.yaml
DPO Training¶
llamafactory-cli train examples/train_lora/qwen3_lora_dpo.yaml
Full-parameter Fine-tuning (Full)¶
llamafactory-cli train examples/train_full/qwen3_full_sft.yaml
Performance Optimization¶
Fused Operators¶
LLaMA-Factory supports FA, NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, and NpuFusedMoE fused operators.
Configure the following parameter in the training script to enable NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, and NpuFusedMoE fused operators, replacing the corresponding model structures after loading to improve training efficiency. After enabling this interface, the code automatically determines whether the model structure meets the replacement requirements. If satisfied, the corresponding model structure will be replaced with the fused operator form.
use_v1_kernels: true
LLaMA-Factory also supports FA fused operators on Ascend NPU. The code automatically determines whether the model structure meets the replacement requirements, and will replace the structure with the fused operator form if satisfied. Enable it by setting the following parameter in the training configuration file:
flash_attn: fa2
Currently, the range of model support for fused operators is limited. This feature is under continuous development to improve generality and applicability.
Fused Operators |
Supported Model Series |
|---|---|
FA |
Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE |
NpuFusedRMSNorm |
Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE |
NpuFusedSwiGlu |
Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE |
NpuFusedRoPE |
Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE |
NpuFusedMoE |
Qwen3-MOE, Qwen3-VL-MOE |
Operator Dispatch Optimization¶
Optimize operator dispatch performance by setting the TASK_QUEUE_ENABLE environment variable (Level 2 recommended):
export TASK_QUEUE_ENABLE=2
For model saving, checkpoint resumption, and subsequent merging and export of LoRA adapters, refer to Model Saving, LoRA Merging, and Quantization.