Model Saving, LoRA Merging, and Quantization¶

Model Saving¶

Model Save Path¶

Training artifacts are organized and saved under output_dir by default. This directory usually contains the final model, checkpoints created during training, the Trainer state, logs, and loss plots.

output_dir: saves/qwen3_8b_lora_sft

It is recommended to use a separate output_dir for each experiment so that training results under different configurations are easy to distinguish and later resumption or model export is more convenient.

Checkpoint Saving Strategy¶

Checkpoint saving behavior in LLaMA-Factory is mainly controlled by the following parameters:

save_strategy: Controls the save strategy. Common values are steps, epoch, or no.
save_steps: When save_strategy: steps is used, this controls how many training steps elapse before each checkpoint is saved.
save_total_limit: The maximum number of checkpoints to keep. Older checkpoints are deleted automatically when the limit is exceeded.

An example of saving checkpoints periodically by step is shown below:

output_dir: saves/qwen3_8b_lora_sft
save_strategy: steps
save_steps: 200
save_total_limit: 3

The configuration above saves one checkpoint every 200 training steps and keeps at most the 3 most recent checkpoints.

If you want to save by epoch instead, use:

save_strategy: epoch

If you do not want to save checkpoints during training, use:

save_strategy: no

Save Model Weights Only¶

In scenarios where only the final exported weights matter, you can set save_only_model: true. In this case, usually only the model weights are saved, while optimizer, scheduler, and other training-state files are not.

save_only_model: true

This reduces disk usage, but it is usually not suitable when you need to resume training midway. Without the full training state, a strict full resume is generally not possible later.

Resume Training from Checkpoint¶

If training is interrupted, or if you want to continue training from an existing checkpoint, you can use resume_from_checkpoint to specify the recovery path.

output_dir: saves/qwen3_8b_lora_sft
resume_from_checkpoint: saves/qwen3_8b_lora_sft/checkpoint-1000

It is generally recommended that resume_from_checkpoint point to a specific checkpoint directory, such as checkpoint-1000. When resuming training, Trainer attempts to continue loading the model parameters and training state corresponding to that checkpoint.

Note

If save_only_model: true was enabled earlier, a complete checkpoint resume is usually not possible because optimizer, scheduler, and other state information were not saved. In this case, the saved result is better suited for inference, evaluation, or later weight conversion, rather than strictly continuing the previous training progress.

LoRA Merge¶

After training a LoRA adapter based on a pre-trained model, we do not want to load the pre-trained model and the LoRA adapter separately for every inference. Therefore, we need to merge and export the pre-trained model and the LoRA adapter into a single model, and optionally quantize it as needed. The exported configuration also differs depending on whether quantization is used and which quantization algorithm is selected.

You can merge models with the command llamafactory-cli export merge_config.yaml. The merge_config.yaml file should be configured according to your specific scenario.

examples/merge_lora/qwen3_lora_sft.yaml provides a configuration example for merging.

### examples/merge_lora/qwen3_lora_sft.yaml
### model
model_name_or_path: Qwen/Qwen3-4B-Instruct-2507
adapter_name_or_path: saves/qwen3-4b/lora/sft
template: qwen3_nothink
trust_remote_code: true

### export
export_dir: saves/qwen3_sft_merged
export_size: 5
export_device: cpu
export_legacy_format: false

Note

The model model_name_or_path must exist and match the template. adapter_name_or_path must match the adapter output path output_dir used during fine-tuning.
When merging LoRA adapters, do not use a quantized model or specify quantization bits. You can merge with a local or downloaded unquantized pre-trained model.

Quantization¶

After merging the model and obtaining a complete model, people usually compress it with quantization to optimize deployment, based on factors such as memory usage, cost, and inference speed.

Quantization effectively reduces memory usage and speeds up inference by compressing numerical precision. LLaMA-Factory supports multiple quantization methods, including:

AQLM
AWQ
GPTQ
QLoRA
…

Post-training quantization methods such as GPTQ quantize a pre-trained model after training. Quantization converts a high-precision pre-trained model into a lower-precision one so that memory usage is reduced and inference is accelerated while minimizing performance loss. To keep low-precision values as close as possible to high-precision representations within a limited range, you need to specify the quantization bits export_quantization_bit and the calibration dataset export_quantization_dataset.

Note

When merging a model, please specify:

model_name_or_path: Name or path of the pre-trained model
template: Model template
export_dir: Export path
export_quantization_bit: Quantization bit width
export_quantization_dataset: Quantization calibration dataset
export_size: Maximum exported model shard size
export_device: Export device
export_legacy_format: Whether to export in the legacy format

A sample configuration is shown below:

### examples/merge_lora/qwen3_gptq.yaml
### model
model_name_or_path: Qwen/Qwen3-4B-Instruct-2507
template: qwen3_nothink
trust_remote_code: true

### export
export_dir: saves/qwen3_gptq
export_quantization_bit: 4
export_quantization_dataset: data/c4_demo.json
export_size: 2
export_device: cpu
export_legacy_format: false

QLoRA trains with the LoRA method on top of a 4-bit quantized model. It greatly reduces memory usage and inference time while preserving model performance to a large extent.

Warning

Do not use a quantized model or set the quantization bit width quantization_bit

A sample configuration is shown below:

### examples/merge_lora/qwen3_lora_sft.yaml
### model
model_name_or_path: Qwen/Qwen3-4B-Instruct-2507
adapter_name_or_path: saves/qwen3-4b/lora/sft
template: qwen3_nothink
trust_remote_code: true

### export
export_dir: saves/qwen3_sft_merged
export_size: 5
export_device: cpu
export_legacy_format: false