NPU Training

This document describes how to perform LLaMA-Factory model training on Huawei Ascend NPU.

Supported Devices

LLaMA-Factory currently supports the following Ascend NPU devices:

  • Atlas A2 Training Series

  • Atlas A3 Training Series

Supported Features

Feature

Support Status

Training Paradigm

PT

Supported

SFT

Supported

RM

Supported

DPO

Supported

Parameter Paradigm

Full

Supported

Freeze

Supported

LoRA

Supported

Model Merging

LoRA Weight Merging

Supported

Distributed

DDP

Supported

FSDP

Supported

FSDP2

Supported

DeepSpeed

Supported

Acceleration

Fused Operators

Currently supports NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, NpuFusedMoE

Note

Most NPU usage is consistent with GPU usage. For general installation steps, refer to NPU Installation and Configuration; for general distributed training configurations (such as FSDP, FSDP2, DeepSpeed), refer to Distributed Training.

Quick Start

To get started quickly, it is recommended to use the Docker image provided by LLaMA-Factory.

  1. Start the container (modify the device mapping as needed):

    docker run -itd \
        --net=host \
        --device=/dev/davinci0 \
        --device=/dev/davinci1 \
        --device=/dev/davinci_manager \
        --device=/dev/devmm_svm \
        --device=/dev/hisi_hdc \
        --shm-size=1200g \
        -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
        --name llama_factory_npu \
        hiyouga/llamafactory:latest-npu-a2 \
        /bin/bash
    
  2. Configure environment variables:

    After entering the container, be sure to load the Ascend environment configuration first; otherwise, NPU devices will not be recognized.

    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
  3. Start training:

    llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml
    

Common Tuning Tips

If you want a simple trade-off between memory usage and training throughput, start by focusing on the following parameters:

  • per_device_train_batch_size: per-device batch size. Increasing it usually improves throughput, but directly increases memory usage. If OOM occurs, reduce it first.

  • gradient_accumulation_steps: number of gradient accumulation steps. After reducing per_device_train_batch_size, you can increase this parameter to keep the original effective batch size as much as possible. However, the larger the accumulation steps, the slower each parameter update becomes.

  • cutoff_len: sample truncation length. This is usually one of the most important parameters affecting memory usage, especially for long-context training. If the current task does not depend on very long inputs, reduce it first.

  • gradient_checkpointing: gradient checkpointing. Enabling it usually significantly reduces memory usage, but introduces some speed loss. It is suitable for memory-constrained scenarios.

The following example is a more conservative starting point that prioritizes getting the run working while using less memory:

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
cutoff_len: 4096
gradient_checkpointing: true

Distributed Training

NPU distributed training configuration is generally consistent with the Distributed Training document. This section introduces NPU-specific configurations, including device selection and multi-node communication settings.

Key Environment Variables

Before starting training, pay attention to the following environment variable settings:

  • ASCEND_RT_VISIBLE_DEVICES (required for single-node and multi-node)

    Used to specify NPU devices participating in training.

    • Default behavior: If this variable is not set, the program will attempt to use all NPU devices on the current node.

    • Specify devices: If you need to limit training to specific NPU cards (e.g., only card 0 and card 1), you must explicitly set this variable:

      export ASCEND_RT_VISIBLE_DEVICES=0,1
      
  • HCCL_SOCKET_IFNAME (required for multi-node training only)

    Specifies the network interface name used for HCCL collective communication.

    • How to obtain: Run ifconfig in the terminal to view the network interface list, and choose the interface used for communication (e.g., eth0, enp1s0).

    • Example setting:

      export HCCL_SOCKET_IFNAME=eth0
      

Single-Node Training

Single-node training (single card or multiple cards) follows the standard procedure.

Single-node multi-card example:

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

Multi-node Training

In NPU environments, it is recommended to use accelerate launch with FSDP 1/2 for multi-node training; this approach provides better communication and computation efficiency on NPU.

Note

For other launch methods (such as torchrun/deepspeed) and more detailed configurations, refer to the Distributed Training document.

1. Prepare Accelerate configuration file

Create or modify examples/accelerate/fsdp_config.yaml; key parameters are as follows (please modify according to the actual number of nodes and IP):

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
main_process_ip: 192.168.0.1
main_process_port: 29500
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false

Note

Explanation of key multi-node parameters:

  • num_machines: Total number of nodes

  • num_processes: Total number of processes (total number of cards) = num_machines * number of cards per machine

  • main_process_ip: IP address of the main node (must be consistent across all nodes)

  • main_process_port: Port of the main node (must be consistent across all nodes)

  • machine_rank: Current node index (main node is 0, and the subsequent nodes increase in order)

2. Launch training

Execute the same launch command on all nodes (ensure machine_rank is correctly configured in the YAML):

export HCCL_SOCKET_IFNAME=eth0

accelerate launch --config_file examples/accelerate/fsdp_config_multiple_nodes.yaml \
    src/train.py examples/train_lora/qwen3_lora_sft.yaml

Training Modes

The following are reference launch commands for common training scenarios; adjust the configuration files according to your needs.

Pretraining (PT)

llamafactory-cli train examples/train_lora/qwen3_lora_pretrain.yaml

Supervised Fine-tuning (SFT)

llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

Reward Model (RM)

llamafactory-cli train examples/train_lora/qwen3_lora_reward.yaml

DPO Training

llamafactory-cli train examples/train_lora/qwen3_lora_dpo.yaml

Full-parameter Fine-tuning (Full)

llamafactory-cli train examples/train_full/qwen3_full_sft.yaml

Performance Optimization

Fused Operators

LLaMA-Factory supports FA, NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, and NpuFusedMoE fused operators.

Configure the following parameter in the training script to enable NpuFusedRMSNorm, NpuFusedSwiGlu, NpuFusedRoPE, and NpuFusedMoE fused operators, replacing the corresponding model structures after loading to improve training efficiency. After enabling this interface, the code automatically determines whether the model structure meets the replacement requirements. If satisfied, the corresponding model structure will be replaced with the fused operator form.

use_v1_kernels: true

LLaMA-Factory also supports FA fused operators on Ascend NPU. The code automatically determines whether the model structure meets the replacement requirements, and will replace the structure with the fused operator form if satisfied. Enable it by setting the following parameter in the training configuration file:

flash_attn: fa2

Currently, the range of model support for fused operators is limited. This feature is under continuous development to improve generality and applicability.

Fused Operators

Supported Model Series

FA

Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE

NpuFusedRMSNorm

Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE

NpuFusedSwiGlu

Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE

NpuFusedRoPE

Qwen3, Qwen3-MOE, Qwen3-VL, Qwen3-VL-MOE

NpuFusedMoE

Qwen3-MOE, Qwen3-VL-MOE

Operator Dispatch Optimization

Optimize operator dispatch performance by setting the TASK_QUEUE_ENABLE environment variable (Level 2 recommended):

export TASK_QUEUE_ENABLE=2

For model saving, checkpoint resumption, and subsequent merging and export of LoRA adapters, refer to Model Saving, LoRA Merging, and Quantization.