Qwen2.5-7B-Instruct-GPTQ-Int4
This repository contains the AX650 deployment of Qwen2.5-7B-Instruct-GPTQ-Int4. The model assets have been flattened to match the axllm model directory convention, so the repository root can be used directly as an axllm model directory.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.4(Not released yet)
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(็ฑ่ฏๆดพPro)
- M.2 Accelerator card
| Chips | w8a16 | w4a16 |
|---|---|---|
| AX650 | 2.6 tokens/sec | 4.8 tokens/sec |
Repository Layout
.
โโโ config.json
โโโ post_config.json
โโโ qwen2_tokenizer.txt
โโโ model.embed_tokens.weight.bfloat16.bin
โโโ qwen2_p128_l0_together.axmodel
โโโ ...
โโโ qwen2_p128_l27_together.axmodel
โโโ qwen2_post.axmodel
โโโ qwen2.5_tokenizer/
โโโ qwen2.5_tokenizer.py
โโโ main_prefill
โโโ main_axcl_aarch64
โโโ main_axcl_x86
โโโ run_qwen2.5_7b_gptq_int4_*.sh
The text model files required by axllm are now all at repository root. The original tokenizer service and legacy demo binaries are still kept for compatibility.
Direct Inference with axllm
Sorry, the
axllminference flow is still being structured, so the following instructions and scripts are expected to be updated in the future. Please stay tuned for the latest updates.
Installation
ๆนๅผไธ: ๅ ้ไปๅบๅๆง่กๅฎ่ฃ ่ๆฌ:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
ๆนๅผไบ: ไธ่กๅฝไปคๅฎ่ฃ
(้ป่ฎคๅๆฏ axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
ๆนๅผไธ: ไธ่ฝฝ Github Actions CI ๅฏผๅบ็ๅฏๆง่ก็จๅบ:
ๅฆๆๆฒกๆ็ผ่ฏ็ฏๅข, ่ฏทๅฐ:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
ไธ่ฝฝๆๆฐ CI ๅฏผๅบ็ axllm, ็ถๅ:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
Download Model from Hugging Face
mkdir -p AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
cd AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
hf download AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650 --local-dir .
Run
$ axllm run AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4-AX650
---
17:34:29.520 INF Init:218 | LLM init start
tokenizer_type = 1
96% | ############################## | 30 / 31 [6.43s<6.64s, 4.67 count/s] init post axmodel ok,remain_cmm(3560 MB)
17:34:35.946 INF Init:368 | max_token_len : 1024
17:34:35.946 INF Init:371 | kv_cache_size : 512, kv_cache_num: 1024
17:34:35.946 INF Init:374 | prefill_token_num : 128
17:34:35.946 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
17:34:35.946 INF Init:384 | prefill_max_token_num : 1
17:34:35.946 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 31 / 31 [6.43s<6.43s, 4.82 count/s] embed_selector init ok
17:34:35.951 INF load_config:282 | load config:
17:34:35.951 INF load_config:282 | {
17:34:35.951 INF load_config:282 | "enable_repetition_penalty": false,
17:34:35.951 INF load_config:282 | "enable_temperature": true,
17:34:35.951 INF load_config:282 | "enable_top_k_sampling": true,
17:34:35.951 INF load_config:282 | "enable_top_p_sampling": false,
17:34:35.951 INF load_config:282 | "penalty_window": 20,
17:34:35.951 INF load_config:282 | "repetition_penalty": 1.2,
17:34:35.951 INF load_config:282 | "temperature": 0.9,
17:34:35.951 INF load_config:282 | "top_k": 10,
17:34:35.951 INF load_config:282 | "top_p": 0.8
17:34:35.951 INF load_config:282 | }
17:34:35.951 INF Init:448 | LLM init ok
Commands:
/q, /exit ้ๅบ
/reset ้็ฝฎ kvcache
/dd ๅ ้คไธ่ฝฎๅฏน่ฏ
/pp ๆๅฐๅๅฒๅฏน่ฏ
Ctrl+C: ๅๆญขๅฝๅ็ๆ
----------------------------------------
prompt >> who are you?
17:34:47.524 INF SetKVCache:749 | prefill_grpid:1 kv_cache_num:1 precompute_len:0 input_num_token:33
17:34:47.524 INF SetKVCache:757 | current prefill_max_token_num:0
17:34:47.524 ERR SetKVCache:758 | precompute_len(0) + input_num_token(33) > kv_cache_num(1)
17:34:47.524 ERR Run:1213 | SetKVCache failed
prompt >>
Legacy Demo Flow
The original AX650 demo path is preserved. After the repository was flattened, the scripts were updated to resolve model files relative to the script directory, so they no longer depend on the old nested qwen2.5-7b-gptq-int4-ax650/ path.
Download all files from this repository to the device
root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# tree -L 1
.
โโโ config.json
โโโ model.embed_tokens.weight.bfloat16.bin
โโโ post_config.json
โโโ qwen2_p128_l0_together.axmodel
โโโ ...
โโโ qwen2_p128_l27_together.axmodel
โโโ qwen2_post.axmodel
โโโ qwen2_tokenizer.txt
โโโ qwen2.5_tokenizer
โโโ qwen2.5_tokenizer.py
โโโ main_axcl_aarch64
โโโ main_axcl_x86
โโโ main_prefill
โโโ run_qwen2.5_7b_gptq_int4_ax650.sh
โโโ run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
โโโ run_qwen2.5_7b_gptq_int4_axcl_x86.sh
Start the Tokenizer service
root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# python qwen2.5_tokenizer.py --port 12345
None None 151645 <|im_end|>
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant
[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990,
http://localhost:12345
Inference with AX650 Host, such as M4N-Dock(็ฑ่ฏๆดพPro) or AX650N DEMO Board
Open another terminal and run run_qwen2.5_7b_gptq_int4_ax650.sh
root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# ./run_qwen2.5_7b_gptq_int4_ax650.sh
[I][ Init][ 125]: LLM init start
bos_id: -1, eos_id: 151645
3% | โโ | 1 / 31 [0.00s<0.09s, 333.33 count/s] tokenizer init ok
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [45.25s<45.25s, 0.69 count/s] init post axmodel ok,remain_cmm(7664 MB)[I][
[I][ Init][ 246]: kv_cache_size : 512, kv_cache_num: 1024
[I][ Init][ 254]: prefill_token_num : 128
[I][ load_config][ 281]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> 1+1=?
[I][ Run][ 466]: ttft: 1138.88 ms
1+1 equals 2.
[N][ Run][ 605]: hit eos,avg 4.65 token/s
>> who are you
[I][ Run][ 466]: ttft: 1137.90 ms
I'm Qwen, a large language model created by Alibaba Cloud. How can I assist you today?
[N][ Run][ 605]: hit eos,avg 4.52 token/s
The flattened legacy path was revalidated on AX650 with a one-shot prompt:
cd Qwen2.5-7B-Instruct-GPTQ-Int4
python3 qwen2.5_tokenizer.py --port 12345
./run_qwen2.5_7b_gptq_int4_ax650.sh '1+1=?'
Observed output on the board:
[I][ Init][ 268]: LLM init ok
[I][ Run][ 466]: ttft: 1139.06 ms
1+1 equals 2.
[N][ Run][ 605]: hit eos,avg 4.68 token/s
Inference with M.2 Accelerator card
What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
(base) axera@raspberrypi:~/samples/qwen2.5-7b $ ./run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
build time: Feb 13 2025 15:15:07
[I][ Init][ 111]: LLM init start
bos_id: -1, eos_id: 151645
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [67.43s<67.43s, 0.46 count/s] init post axmodel okremain_cmm(2739 MB)
[I][ Init][ 226]: max_token_len : 1024
[I][ Init][ 231]: kv_cache_size : 512, kv_cache_num: 1024
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
I am Qwen, a large language model created by Alibaba Cloud. I'm here to help you with any questions or tasks you might have!
[N][ Run][ 610]: hit eos,avg 4.33 token/s
>> 1+1=?
1+1 equals 2.
[N][ Run][ 610]: hit eos,avg 4.54 token/s
>> q
(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI V2.26.0_20250206225448 Driver V2.26.0_20250206225448 |
+-----------------------------------------+--------------+---------------------------------------+
| Card Name Firmware | Bus-Id | Memory-Usage |
| Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
|=========================================+==============+=======================================|
+-----------------------------------------+--------------+---------------------------------------+
| 0 AX650N V2.26.0 | 0000:05:00.0 | 175 MiB / 945 MiB |
| -- 61C -- / -- | 0% 0% | 4301 MiB / 7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+
+------------------------------------------------------------------------------------------------+
| Processes: |
| Card PID Process Name NPU Memory Usage |
|================================================================================================|
| 0 63118 /home/axera/samples/qwen2.5-7b-gptq-int4/main_axcl_aarch64 4316448 KiB |
+------------------------------------------------------------------------------------------------+
- Downloads last month
- 13
Model tree for AXERA-TECH/Qwen2.5-7B-Instruct-GPTQ-Int4
Base model
Qwen/Qwen2.5-7B