NPU Training¶

This document describes how to perform LLaMA-Factory model training on Huawei Ascend NPU.

Supported Devices¶

LLaMA-Factory currently supports the following Ascend NPU devices:

Atlas A2 Training Series
Atlas A3 Training Series

Supported Features¶

	Feature	Support Status
Training Paradigm	PT	Supported
	SFT	Supported
	RM	Supported
	DPO	Supported
Parameter Paradigm	Full	Supported
	Freeze	Supported
	LoRA	Supported
Model Merging	LoRA Weight Merging	Supported
Distributed	DDP	Supported
	FSDP	Supported
	FSDP2	Supported
	DeepSpeed	Supported
Acceleration	Fused Operators	Currently supports NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, NpuFusedMoE

Note

Most NPU usage is consistent with GPU usage. For general installation steps, refer to NPU Installation and Configuration; for general distributed training configurations (such as FSDP, FSDP2, DeepSpeed), refer to Distributed Training.

Quick Start¶

To get started quickly, it is recommended to use the Docker image provided by LLaMA-Factory.

Start the container (modify the device mapping as needed):

docker run -itd \
    --net=host \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    --device=/dev/davinci_manager \
    --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    --shm-size=1200g \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --name llama_factory_npu \
    hiyouga/llamafactory:latest-npu-a2 \
    /bin/bash

Configure environment variables:

After entering the container, be sure to load the Ascend environment configuration first; otherwise, NPU devices will not be recognized.
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
Start training:
Note

When downloading models, if you cannot access Hugging Face community resources smoothly, we recommend downloading them from ModelScope or configuring the following parameters as needed:
- Set the USE_MODELSCOPE_HUB environment variable to prioritize downloading models/datasets from ModelScope or using models/datasets from the cache path:
  export USE_MODELSCOPE_HUB=1
- To access restricted or private ModelScope Hub resources, configure ms_hub_token.
For more parameter details, refer to Arguments. Before downloading, pay attention to the correctness and security of the files to be downloaded.
```
llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml
```

Common Tuning Tips¶

If you want a simple trade-off between memory usage and training throughput, start by focusing on the following parameters:

per_device_train_batch_size: per-device batch size. Increasing it usually improves throughput, but directly increases memory usage. If OOM occurs, reduce it first.
gradient_accumulation_steps: number of gradient accumulation steps. After reducing per_device_train_batch_size, you can increase this parameter to keep the original effective batch size as much as possible. However, the larger the accumulation steps, the slower each parameter update becomes.
cutoff_len: sample truncation length. This is usually one of the most important parameters affecting memory usage, especially for long-context training. If the current task does not depend on very long inputs, reduce it first.
gradient_checkpointing: gradient checkpointing. Enabling it usually significantly reduces memory usage, but introduces some speed loss. It is suitable for memory-constrained scenarios.

The following example is a more conservative starting point that prioritizes getting the run working while using less memory:

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
cutoff_len: 4096
gradient_checkpointing: true

Distributed Training¶

NPU distributed training configuration is generally consistent with the Distributed Training document. This section introduces NPU-specific configurations, including device selection and multi-node communication settings.

Key Environment Variables¶

Before starting training, pay attention to the following environment variable settings:

ASCEND_RT_VISIBLE_DEVICES (required for single-node and multi-node)

Used to specify NPU devices participating in training.
- Default behavior: If this variable is not set, the program will attempt to use all NPU devices on the current node.
- Specify devices: If you need to limit training to specific NPU cards (e.g., only card 0 and card 1), you must explicitly set this variable:
```
export ASCEND_RT_VISIBLE_DEVICES=0,1
```
HCCL_SOCKET_IFNAME (required for multi-node training only)

Specifies the network interface name used for HCCL collective communication.
- How to obtain: Run ifconfig in the terminal to view the network interface list, and choose the interface used for communication (e.g., eth0, enp1s0).
- Example setting:
```
export HCCL_SOCKET_IFNAME=eth0
```

Single-Node Training¶

Single-node training (single card or multiple cards) follows the standard procedure.

Single-node multi-card example:

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

Multi-node Training¶

In NPU environments, it is recommended to use accelerate launch with FSDP 1/2 for multi-node training; this approach provides better communication and computation efficiency on NPU.

Note

For other launch methods (such as torchrun/deepspeed) and more detailed configurations, refer to the Distributed Training document.

1. Prepare Accelerate configuration file

Create or modify examples/accelerate/fsdp_config.yaml; key parameters are as follows (please modify according to the actual number of nodes and IP):

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
main_process_ip: 192.168.0.1
main_process_port: 29500
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false

Note

Explanation of key multi-node parameters:

num_machines: Total number of nodes
num_processes: Total number of processes (total number of cards) = num_machines * number of cards per machine
main_process_ip: IP address of the main node (must be consistent across all nodes)
main_process_port: Port of the main node (must be consistent across all nodes)
machine_rank: Current node index (main node is 0, and the subsequent nodes increase in order)

2. Launch training

Execute the same launch command on all nodes (ensure machine_rank is correctly configured in the YAML):

export HCCL_SOCKET_IFNAME=eth0

accelerate launch --config_file examples/accelerate/fsdp_config_multiple_nodes.yaml \
    src/train.py examples/train_lora/qwen3_lora_sft.yaml

Training Modes¶

The following are reference launch commands for common training scenarios; adjust the configuration files according to your needs.

Pretraining (PT)¶

llamafactory-cli train examples/train_lora/qwen3_lora_pretrain.yaml

Supervised Fine-tuning (SFT)¶

llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

Reward Model (RM)¶

llamafactory-cli train examples/train_lora/qwen3_lora_reward.yaml

DPO Training¶

llamafactory-cli train examples/train_lora/qwen3_lora_dpo.yaml

Full-parameter Fine-tuning (Full)¶

llamafactory-cli train examples/train_full/qwen3_full_sft.yaml

Performance Optimization¶

Fused Operators¶

LLaMA-Factory supports FA, NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, and NpuFusedMoE fused operators.

Configure the following parameter in the training script to enable NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, and NpuFusedMoE fused operators, replacing the corresponding model structures after loading to improve training efficiency. After enabling this interface, the code automatically determines whether the model structure meets the replacement requirements. If satisfied, the corresponding model structure will be replaced with the fused operator form.

use_v1_kernels: true

LLaMA-Factory also supports FA fused operators on Ascend NPU. The code automatically determines whether the model structure meets the replacement requirements, and will replace the structure with the fused operator form if satisfied. Enable it by setting the following parameter in the training configuration file:

flash_attn: fa2

Currently, the range of model support for fused operators is limited. This feature is under continuous development to improve generality and applicability.

Fused Operators	Supported Model Series
FA	Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE
NpuFusedRMSNorm	Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE
NpuFusedSwiGlu	Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE
NpuFusedRoPE	Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE
NpuFusedMoE	Qwen3-MOE, Qwen3-VL-MOE

Operator Dispatch Optimization¶

Optimize operator dispatch performance by setting the TASK_QUEUE_ENABLE environment variable (Level 2 recommended):

export TASK_QUEUE_ENABLE=2

For model saving, checkpoint resumption, and subsequent merging and export of LoRA adapters, refer to Model Saving, LoRA Merging, and Quantization.