π Performance on BFCL Benchmark
Source: From Tool Use to Agentic Evaluation of Large Language Models (BFCL)
πΉ Non-Live Evaluation (Overall: 85.25%)
| Task Category |
Accuracy |
| AST Summary |
86.46% |
| Simple AST |
72.83% |
| Python Simple |
96.50% |
| Java Simple |
60.00% |
| JavaScript Simple |
62.00% |
| Multiple AST |
92.50% |
| Parallel AST |
92.00% |
| Parallel Multiple AST |
88.50% |
| Irrelevance Detection |
80.42% |
πΉ Live Evaluation (Overall: 74.46%)
| Task Category |
Accuracy |
| AST Summary |
75.87% |
| Python Simple AST |
76.36% |
| Python Multiple AST |
76.26% |
| Python Parallel AST |
56.25% |
| Python Parallel Multiple AST |
66.67% |
| Irrelevance Detection |
72.22% |
| Relevance Detection |
77.78% |
Qwen2.5-14B-Instruct-APIGen-MT-5k
This model is a fine-tuned version of Qwen/Qwen2.5-14B-Instruct, tailored for tools call tasks. It has been trained on the Salesforce/APIGen-MT-5k dataset to enhance its ability to do tool calls based on user instructions.
π§ Model Details
- Base Model: Qwen/Qwen2.5-14B-Instruct
- Model Size: 14B parameters
- Fine-tuning Method: sft with LoRA (Low-Rank Adaptation)
ποΈ Training Configuration
| Setting |
Value |
| Dataset |
Salesforce/APIGen-MT-5k |
| Epochs |
3 |
| Batch Size |
64 |
| Learning Rate |
1e-5 |
| Weight Decay |
2e-6 |
| Scheduler |
Cosine |
| LoRA Rank |
16 |
| Quantization |
4-bit during training |
π§ How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "miazhao/Qwen2.5-14B-Instruct-APIGen-MT-5k"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
)
messages = [
{"role": "user", "content": "Hi, how are you?"},
{"role": "assistant", "content": "Thanks. I am doing well. How can I help you?"},
{"role": "user", "content": "What's the weather like in London?"},
]
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit of temperature to return"}
},
"required": ["location"]
}
}
]
print("====== prompt after applying chat template ======")
print(tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False))
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
input_ids_len = inputs["input_ids"].shape[-1]
inputs = {k: v.to(model.device) for k, v in inputs.items()}
print("====== model response ======")
outputs = model.generate(**inputs, max_new_tokens=256)
generated_tokens = outputs[:, input_ids_len:]
print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True))
Expected response
Sure, let me check the current weather in London for you.
<tool_call>
{"name": "get_weather", "arguments": {"location": "London"}}
</tool_call>