Qwen2.5-7B-Instruct-GPTQ-Int4

This repository contains the AX650 deployment of Qwen2.5-7B-Instruct-GPTQ-Int4. The model assets have been flattened to match the axllm model directory convention, so the repository root can be used directly as an axllm model directory.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 3.4(Not released yet)

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Support Platform

Chips w8a16 w4a16
AX650 2.6 tokens/sec 4.8 tokens/sec

Repository Layout

.
โ”œโ”€โ”€ config.json
โ”œโ”€โ”€ post_config.json
โ”œโ”€โ”€ qwen2_tokenizer.txt
โ”œโ”€โ”€ model.embed_tokens.weight.bfloat16.bin
โ”œโ”€โ”€ qwen2_p128_l0_together.axmodel
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ qwen2_p128_l27_together.axmodel
โ”œโ”€โ”€ qwen2_post.axmodel
โ”œโ”€โ”€ qwen2.5_tokenizer/
โ”œโ”€โ”€ qwen2.5_tokenizer.py
โ”œโ”€โ”€ main_prefill
โ”œโ”€โ”€ main_axcl_aarch64
โ”œโ”€โ”€ main_axcl_x86
โ””โ”€โ”€ run_qwen2.5_7b_gptq_int4_*.sh

The text model files required by axllm are now all at repository root. The original tokenizer service and legacy demo binaries are still kept for compatibility.

Direct Inference with axllm

Sorry, the axllm inference flow is still being structured, so the following instructions and scripts are expected to be updated in the future. Please stay tuned for the latest updates.

Installation

ๆ–นๅผไธ€: ๅ…‹้š†ไป“ๅบ“ๅŽๆ‰ง่กŒๅฎ‰่ฃ…่„šๆœฌ:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

ๆ–นๅผไบŒ: ไธ€่กŒๅ‘ฝไปคๅฎ‰่ฃ… (้ป˜่ฎคๅˆ†ๆ”ฏ axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

ๆ–นๅผไธ‰: ไธ‹่ฝฝ Github Actions CI ๅฏผๅ‡บ็š„ๅฏๆ‰ง่กŒ็จ‹ๅบ:

ๅฆ‚ๆžœๆฒกๆœ‰็ผ–่ฏ‘็Žฏๅขƒ, ่ฏทๅˆฐ: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm ไธ‹่ฝฝๆœ€ๆ–ฐ CI ๅฏผๅ‡บ็š„ axllm, ็„ถๅŽ:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Download Model from Hugging Face

mkdir -p AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
cd AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
hf download AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650 --local-dir .

Run

$ axllm run AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
---
17:34:29.520 INF Init:218 | LLM init start
tokenizer_type = 1
 96% | ##############################   |  30 /  31 [6.43s<6.64s, 4.67 count/s] init post axmodel ok,remain_cmm(3560 MB)
17:34:35.946 INF Init:368 | max_token_len : 1024
17:34:35.946 INF Init:371 | kv_cache_size : 512, kv_cache_num: 1024
17:34:35.946 INF Init:374 | prefill_token_num : 128
17:34:35.946 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
17:34:35.946 INF Init:384 | prefill_max_token_num : 1
17:34:35.946 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  31 /  31 [6.43s<6.43s, 4.82 count/s] embed_selector init ok
17:34:35.951 INF load_config:282 | load config:
17:34:35.951 INF load_config:282 | {
17:34:35.951 INF load_config:282 |     "enable_repetition_penalty": false,
17:34:35.951 INF load_config:282 |     "enable_temperature": true,
17:34:35.951 INF load_config:282 |     "enable_top_k_sampling": true,
17:34:35.951 INF load_config:282 |     "enable_top_p_sampling": false,
17:34:35.951 INF load_config:282 |     "penalty_window": 20,
17:34:35.951 INF load_config:282 |     "repetition_penalty": 1.2,
17:34:35.951 INF load_config:282 |     "temperature": 0.9,
17:34:35.951 INF load_config:282 |     "top_k": 10,
17:34:35.951 INF load_config:282 |     "top_p": 0.8
17:34:35.951 INF load_config:282 | }
17:34:35.951 INF Init:448 | LLM init ok
Commands:
  /q, /exit  ้€€ๅ‡บ
  /reset     ้‡็ฝฎ kvcache
  /dd        ๅˆ ้™คไธ€่ฝฎๅฏน่ฏ
  /pp        ๆ‰“ๅฐๅކๅฒๅฏน่ฏ
Ctrl+C: ๅœๆญขๅฝ“ๅ‰็”Ÿๆˆ
----------------------------------------
prompt >> who are you?
17:34:47.524 INF SetKVCache:749 | prefill_grpid:1 kv_cache_num:1 precompute_len:0 input_num_token:33
17:34:47.524 INF SetKVCache:757 | current prefill_max_token_num:0
17:34:47.524 ERR SetKVCache:758 | precompute_len(0) + input_num_token(33) > kv_cache_num(1)
17:34:47.524 ERR Run:1213 | SetKVCache failed
prompt >>

Legacy Demo Flow

The original AX650 demo path is preserved. After the repository was flattened, the scripts were updated to resolve model files relative to the script directory, so they no longer depend on the old nested qwen2.5-7b-gptq-int4-ax650/ path.

Download all files from this repository to the device

root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# tree -L 1
.
โ”œโ”€โ”€ config.json
โ”œโ”€โ”€ model.embed_tokens.weight.bfloat16.bin
โ”œโ”€โ”€ post_config.json
โ”œโ”€โ”€ qwen2_p128_l0_together.axmodel
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ qwen2_p128_l27_together.axmodel
โ”œโ”€โ”€ qwen2_post.axmodel
โ”œโ”€โ”€ qwen2_tokenizer.txt
โ”œโ”€โ”€ qwen2.5_tokenizer
โ”œโ”€โ”€ qwen2.5_tokenizer.py
โ”œโ”€โ”€ main_axcl_aarch64
โ”œโ”€โ”€ main_axcl_x86
โ”œโ”€โ”€ main_prefill
โ”œโ”€โ”€ run_qwen2.5_7b_gptq_int4_ax650.sh
โ”œโ”€โ”€ run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
โ””โ”€โ”€ run_qwen2.5_7b_gptq_int4_axcl_x86.sh

Start the Tokenizer service

root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# python qwen2.5_tokenizer.py --port 12345
None None 151645 <|im_end|>
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant

[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990,
http://localhost:12345

Inference with AX650 Host, such as M4N-Dock(็ˆฑ่ŠฏๆดพPro) or AX650N DEMO Board

Open another terminal and run run_qwen2.5_7b_gptq_int4_ax650.sh

root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# ./run_qwen2.5_7b_gptq_int4_ax650.sh
[I][                            Init][ 125]: LLM init start
bos_id: -1, eos_id: 151645
  3% | โ–ˆโ–ˆ                                |   1 /  31 [0.00s<0.09s, 333.33 count/s] tokenizer init ok
100% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ |  31 /  31 [45.25s<45.25s, 0.69 count/s] init post axmodel ok,remain_cmm(7664 MB)[I][
[I][                            Init][ 246]: kv_cache_size : 512, kv_cache_num: 1024
[I][                            Init][ 254]: prefill_token_num : 128
[I][                     load_config][ 281]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> 1+1=?
[I][                             Run][ 466]: ttft: 1138.88 ms
1+1 equals 2.
[N][                             Run][ 605]: hit eos,avg 4.65 token/s

>> who are you
[I][                             Run][ 466]: ttft: 1137.90 ms
I'm Qwen, a large language model created by Alibaba Cloud. How can I assist you today?
[N][                             Run][ 605]: hit eos,avg 4.52 token/s

The flattened legacy path was revalidated on AX650 with a one-shot prompt:

cd Qwen2.5-7B-Instruct-GPTQ-Int4
python3 qwen2.5_tokenizer.py --port 12345
./run_qwen2.5_7b_gptq_int4_ax650.sh '1+1=?'

Observed output on the board:

[I][                            Init][ 268]: LLM init ok
[I][                             Run][ 466]: ttft: 1139.06 ms
1+1 equals 2.
[N][                             Run][ 605]: hit eos,avg 4.68 token/s

Inference with M.2 Accelerator card

What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.

(base) axera@raspberrypi:~/samples/qwen2.5-7b $ ./run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
build time: Feb 13 2025 15:15:07
[I][                            Init][ 111]: LLM init start
bos_id: -1, eos_id: 151645
100% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ |  31 /  31 [67.43s<67.43s, 0.46 count/s] init post axmodel okremain_cmm(2739 MB)
[I][                            Init][ 226]: max_token_len : 1024
[I][                            Init][ 231]: kv_cache_size : 512, kv_cache_num: 1024
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
I am Qwen, a large language model created by Alibaba Cloud. I'm here to help you with any questions or tasks you might have!
[N][                             Run][ 610]: hit eos,avg 4.33 token/s

>> 1+1=?
1+1 equals 2.
[N][                             Run][ 610]: hit eos,avg 4.54 token/s

>> q

(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI  V2.26.0_20250206225448                                Driver  V2.26.0_20250206225448 |
+-----------------------------------------+--------------+---------------------------------------+
| Card  Name                     Firmware | Bus-Id       |                          Memory-Usage |
| Fan   Temp                Pwr:Usage/Cap | CPU      NPU |                             CMM-Usage |
|=========================================+==============+=======================================|
+-----------------------------------------+--------------+---------------------------------------+
|    0  AX650N                    V2.26.0 | 0000:05:00.0 |                175 MiB /      945 MiB |
|   --   61C                      -- / -- | 0%        0% |               4301 MiB /     7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+

+------------------------------------------------------------------------------------------------+
| Processes:                                                                                     |
| Card      PID  Process Name                                                   NPU Memory Usage |
|================================================================================================|
|    0    63118  /home/axera/samples/qwen2.5-7b-gptq-int4/main_axcl_aarch64          4316448 KiB |
+------------------------------------------------------------------------------------------------+
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4

Base model

Qwen/Qwen2.5-7B
Finetuned
(1)
this model

Collection including AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4