Model Overview

Model Summary

Qwen is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Both language models and multimodal models are pretrained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences. Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.

The latest version, Qwen3, has the following features:

Dense and Mixture-of-Experts (MoE) models, available in 0.6B, 1.7B, 4B, 8B, 14B, 30B,32B and 235B.

  • Seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose chat) within a single model, ensuring optimal performance across various scenarios.

  • Significantly enhancement in reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.

  • Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.

  • Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.

  • Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.

For more details, please refer to Qwen Blog, GitHub, and Documentation.

Weights are released under the Apache 2 License . Keras model code is released under the Apache 2 License.

Links

Installation

Keras and KerasHub can be installed with:

pip install -U -q keras-hub
pip install -U -q keras

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the Keras Getting Started page.

Available Qwen 3 MoE Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

Preset Parameters Description
Qwen3_MoE_30B_A3B_EN 30.5B Mixture-of-Experts (MoE) model has 30.5 billion total parameters, with 3.3 billion activated, built on 48 layers, and utilizes 32 query and 4 key/value attention heads with 128 experts (8 active).
Qwen3_MoE_235B_A22B_EN 235B Mixture-of-Experts (MoE) model has 235 billion total parameters, with 22 billion activated, built on 94 layers, and utilizes 64 query and 4 key/value attention heads with 128 experts (8 active).

Example Usage

import keras
import keras_hub
import numpy as np

# Load pre-trained Qwen3 MoE model
qwen3_moe_lm = keras_hub.models.Qwen3MoeCausalLM.from_preset( "qwen3_moe_30b_a3b_en")

# Generate text from prompt
response = qwen3_moe_lm.generate("I want to learn about", max_length=50)
print(response)

# Batch generation with multiple prompts
prompts = ["The future of AI is", "Machine learning helps us"]
responses = qwen3_moe_lm.generate(prompts, max_length=30)
for prompt, response in zip(prompts, responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

Custom Sampling Strategies


# Greedy sampling (default)
qwen3_moe_lm.compile(sampler="greedy")
response = qwen3_moe_lm.generate("Explain quantum computing", max_length=100)

# Top-k sampling
qwen3_moe_lm.compile(sampler="top_k")
response = qwen3_moe_lm.generate("Write a story about", max_length=80)

# Beam search
qwen3_moe_lm.compile(sampler=keras_hub.samplers.BeamSampler(num_beams=4))
response = qwen3_moe_lm.generate("The best way to learn programming is", max_length=60)

Fine-tuning with LoRA


# Enable LoRA for efficient fine-tuning
qwen3_moe_lm.backbone.enable_lora(rank=8)

# Prepare training data
training_texts = [
    "The quick brown fox jumped over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science.",
    "Deep learning models require large amounts of training data.",
    "Natural language processing helps computers understand human language."
]

# Compile for training
qwen3_moe_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(1e-4),
    metrics=["accuracy"]
)

# Fine-tune the model
qwen3_moe_lm.fit(x=training_texts, batch_size=2, epochs=3)

# Generate with fine-tuned model
response = qwen3_moe_lm.generate("The importance of", max_length=50)
print(response)

Custom Backbone Configuration


# Create custom Qwen3 MoE backbone
backbone = keras_hub.models.Qwen3MoeBackbone(
    vocabulary_size=151936,
    num_layers=12,  # Smaller model for faster training
    num_query_heads=16,
    num_key_value_heads=8,
    head_dim=128,
    hidden_dim=1024,
    intermediate_dim=2048,
    layer_norm_epsilon=1e-6,
    dropout=0.1,
    dtype="float32"
)

# Create tokenizer first
tokenizer = keras_hub.models.Qwen3MoeTokenizer.from_preset("qwen3_moe_30b_a3b_en")

# Create preprocessor with tokenizer
preprocessor = keras_hub.models.Qwen3MoeCausalLMPreprocessor(
    tokenizer=tokenizer,
    sequence_length=512
)

# Create custom causal LM
custom_qwen3_moe = keras_hub.models.Qwen3MoeCausalLM(
    backbone=backbone,
    preprocessor=preprocessor
)

# Compile and train
custom_qwen3_moe.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(1e-4)
)

# Training data
texts = ["Hello world", "How are you", "Machine learning"]
custom_qwen3_moe.fit(x=texts, batch_size=2, epochs=1)

Example Usage with Hugging Face URI

import keras
import keras_hub
import numpy as np

# Load pre-trained Qwen3 MoE model
qwen3_moe_lm = keras_hub.models.Qwen3MoeCausalLM.from_preset( "hf://keras/qwen3_moe_30b_a3b_en")

# Generate text from prompt
response = qwen3_moe_lm.generate("I want to learn about", max_length=50)
print(response)

# Batch generation with multiple prompts
prompts = ["The future of AI is", "Machine learning helps us"]
responses = qwen3_moe_lm.generate(prompts, max_length=30)
for prompt, response in zip(prompts, responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

Custom Sampling Strategies


# Greedy sampling (default)
qwen3_moe_lm.compile(sampler="greedy")
response = qwen3_moe_lm.generate("Explain quantum computing", max_length=100)

# Top-k sampling
qwen3_moe_lm.compile(sampler="top_k")
response = qwen3_moe_lm.generate("Write a story about", max_length=80)

# Beam search
qwen3_moe_lm.compile(sampler=keras_hub.samplers.BeamSampler(num_beams=4))
response = qwen3_moe_lm.generate("The best way to learn programming is", max_length=60)

Fine-tuning with LoRA


# Enable LoRA for efficient fine-tuning
qwen3_moe_lm.backbone.enable_lora(rank=8)

# Prepare training data
training_texts = [
    "The quick brown fox jumped over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science.",
    "Deep learning models require large amounts of training data.",
    "Natural language processing helps computers understand human language."
]

# Compile for training
qwen3_moe_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(1e-4),
    metrics=["accuracy"]
)

# Fine-tune the model
qwen3_moe_lm.fit(x=training_texts, batch_size=2, epochs=3)

# Generate with fine-tuned model
response = qwen3_moe_lm.generate("The importance of", max_length=50)
print(response)

Custom Backbone Configuration


# Create custom Qwen3 MoE backbone
backbone = keras_hub.models.Qwen3MoeBackbone(
    vocabulary_size=151936,
    num_layers=12,  # Smaller model for faster training
    num_query_heads=16,
    num_key_value_heads=8,
    head_dim=128,
    hidden_dim=1024,
    intermediate_dim=2048,
    layer_norm_epsilon=1e-6,
    dropout=0.1,
    dtype="float32"
)

# Create tokenizer first
tokenizer = keras_hub.models.Qwen3MoeTokenizer.from_preset("hf://keras/qwen3_moe_30b_a3b_en")

# Create preprocessor with tokenizer
preprocessor = keras_hub.models.Qwen3MoeCausalLMPreprocessor(
    tokenizer=tokenizer,
    sequence_length=512
)

# Create custom causal LM
custom_qwen3_moe = keras_hub.models.Qwen3MoeCausalLM(
    backbone=backbone,
    preprocessor=preprocessor
)

# Compile and train
custom_qwen3_moe.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(1e-4)
)

# Training data
texts = ["Hello world", "How are you", "Machine learning"]
custom_qwen3_moe.fit(x=texts, batch_size=2, epochs=1)
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including keras/qwen3_moe_30b_a3b_en