Qwen2.5-7B-Instruct-GPTQ-Int4

This repository contains the AX650 deployment of Qwen2.5-7B-Instruct-GPTQ-Int4. The model assets have been flattened to match the axllm model directory convention, so the repository root can be used directly as an axllm model directory.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 3.4(Not released yet)

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Chips	w8a16	w4a16
AX650	2.6 tokens/sec	4.8 tokens/sec

Repository Layout

.
├── config.json
├── post_config.json
├── qwen2_tokenizer.txt
├── model.embed_tokens.weight.bfloat16.bin
├── qwen2_p128_l0_together.axmodel
├── ...
├── qwen2_p128_l27_together.axmodel
├── qwen2_post.axmodel
├── qwen2.5_tokenizer/
├── qwen2.5_tokenizer.py
├── main_prefill
├── main_axcl_aarch64
├── main_axcl_x86
└── run_qwen2.5_7b_gptq_int4_*.sh

The text model files required by axllm are now all at repository root. The original tokenizer service and legacy demo binaries are still kept for compatibility.

Direct Inference with `axllm`

Sorry, the axllm inference flow is still being structured, so the following instructions and scripts are expected to be updated in the future. Please stay tuned for the latest updates.

Installation

方式一: 克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二: 一行命令安装 (默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三: 下载 Github Actions CI 导出的可执行程序:

如果没有编译环境, 请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载最新 CI 导出的 axllm, 然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Download Model from Hugging Face

mkdir -p AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
cd AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
hf download AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650 --local-dir .

Run

$ axllm run AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
---
17:34:29.520 INF Init:218 | LLM init start
tokenizer_type = 1
 96% | ##############################   |  30 /  31 [6.43s<6.64s, 4.67 count/s] init post axmodel ok,remain_cmm(3560 MB)
17:34:35.946 INF Init:368 | max_token_len : 1024
17:34:35.946 INF Init:371 | kv_cache_size : 512, kv_cache_num: 1024
17:34:35.946 INF Init:374 | prefill_token_num : 128
17:34:35.946 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
17:34:35.946 INF Init:384 | prefill_max_token_num : 1
17:34:35.946 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  31 /  31 [6.43s<6.43s, 4.82 count/s] embed_selector init ok
17:34:35.951 INF load_config:282 | load config:
17:34:35.951 INF load_config:282 | {
17:34:35.951 INF load_config:282 |     "enable_repetition_penalty": false,
17:34:35.951 INF load_config:282 |     "enable_temperature": true,
17:34:35.951 INF load_config:282 |     "enable_top_k_sampling": true,
17:34:35.951 INF load_config:282 |     "enable_top_p_sampling": false,
17:34:35.951 INF load_config:282 |     "penalty_window": 20,
17:34:35.951 INF load_config:282 |     "repetition_penalty": 1.2,
17:34:35.951 INF load_config:282 |     "temperature": 0.9,
17:34:35.951 INF load_config:282 |     "top_k": 10,
17:34:35.951 INF load_config:282 |     "top_p": 0.8
17:34:35.951 INF load_config:282 | }
17:34:35.951 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
----------------------------------------
prompt >> who are you?
17:34:47.524 INF SetKVCache:749 | prefill_grpid:1 kv_cache_num:1 precompute_len:0 input_num_token:33
17:34:47.524 INF SetKVCache:757 | current prefill_max_token_num:0
17:34:47.524 ERR SetKVCache:758 | precompute_len(0) + input_num_token(33) > kv_cache_num(1)
17:34:47.524 ERR Run:1213 | SetKVCache failed
prompt >>

Legacy Demo Flow

The original AX650 demo path is preserved. After the repository was flattened, the scripts were updated to resolve model files relative to the script directory, so they no longer depend on the old nested qwen2.5-7b-gptq-int4-ax650/ path.

Download all files from this repository to the device

root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# tree -L 1
.
├── config.json
├── model.embed_tokens.weight.bfloat16.bin
├── post_config.json
├── qwen2_p128_l0_together.axmodel
├── ...
├── qwen2_p128_l27_together.axmodel
├── qwen2_post.axmodel
├── qwen2_tokenizer.txt
├── qwen2.5_tokenizer
├── qwen2.5_tokenizer.py
├── main_axcl_aarch64
├── main_axcl_x86
├── main_prefill
├── run_qwen2.5_7b_gptq_int4_ax650.sh
├── run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
└── run_qwen2.5_7b_gptq_int4_axcl_x86.sh

Start the Tokenizer service

root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# python qwen2.5_tokenizer.py --port 12345
None None 151645 <|im_end|>
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant

[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990,
http://localhost:12345

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run run_qwen2.5_7b_gptq_int4_ax650.sh

root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# ./run_qwen2.5_7b_gptq_int4_ax650.sh
[I][                            Init][ 125]: LLM init start
bos_id: -1, eos_id: 151645
  3% | ██                                |   1 /  31 [0.00s<0.09s, 333.33 count/s] tokenizer init ok
100% | ████████████████████████████████ |  31 /  31 [45.25s<45.25s, 0.69 count/s] init post axmodel ok,remain_cmm(7664 MB)[I][
[I][                            Init][ 246]: kv_cache_size : 512, kv_cache_num: 1024
[I][                            Init][ 254]: prefill_token_num : 128
[I][                     load_config][ 281]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> 1+1=?
[I][                             Run][ 466]: ttft: 1138.88 ms
1+1 equals 2.
[N][                             Run][ 605]: hit eos,avg 4.65 token/s

>> who are you
[I][                             Run][ 466]: ttft: 1137.90 ms
I'm Qwen, a large language model created by Alibaba Cloud. How can I assist you today?
[N][                             Run][ 605]: hit eos,avg 4.52 token/s

The flattened legacy path was revalidated on AX650 with a one-shot prompt:

cd Qwen2.5-7B-Instruct-GPTQ-Int4
python3 qwen2.5_tokenizer.py --port 12345
./run_qwen2.5_7b_gptq_int4_ax650.sh '1+1=?'

Observed output on the board:

[I][                            Init][ 268]: LLM init ok
[I][                             Run][ 466]: ttft: 1139.06 ms
1+1 equals 2.
[N][                             Run][ 605]: hit eos,avg 4.68 token/s

Inference with M.2 Accelerator card

What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.

(base) axera@raspberrypi:~/samples/qwen2.5-7b $ ./run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
build time: Feb 13 2025 15:15:07
[I][                            Init][ 111]: LLM init start
bos_id: -1, eos_id: 151645
100% | ████████████████████████████████ |  31 /  31 [67.43s<67.43s, 0.46 count/s] init post axmodel okremain_cmm(2739 MB)
[I][                            Init][ 226]: max_token_len : 1024
[I][                            Init][ 231]: kv_cache_size : 512, kv_cache_num: 1024
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
I am Qwen, a large language model created by Alibaba Cloud. I'm here to help you with any questions or tasks you might have!
[N][                             Run][ 610]: hit eos,avg 4.33 token/s

>> 1+1=?
1+1 equals 2.
[N][                             Run][ 610]: hit eos,avg 4.54 token/s

>> q

(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI  V2.26.0_20250206225448                                Driver  V2.26.0_20250206225448 |
+-----------------------------------------+--------------+---------------------------------------+
| Card  Name                     Firmware | Bus-Id       |                          Memory-Usage |
| Fan   Temp                Pwr:Usage/Cap | CPU      NPU |                             CMM-Usage |
|=========================================+==============+=======================================|
+-----------------------------------------+--------------+---------------------------------------+
|    0  AX650N                    V2.26.0 | 0000:05:00.0 |                175 MiB /      945 MiB |
|   --   61C                      -- / -- | 0%        0% |               4301 MiB /     7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+

+------------------------------------------------------------------------------------------------+
| Processes:                                                                                     |
| Card      PID  Process Name                                                   NPU Memory Usage |
|================================================================================================|
|    0    63118  /home/axera/samples/qwen2.5-7b-gptq-int4/main_axcl_aarch64          4316448 KiB |
+------------------------------------------------------------------------------------------------+

Downloads last month: 13

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Quantized

Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4