QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Abstract
QwenLong-L1.5 enhances long-context reasoning through data synthesis, stabilized reinforcement learning, and memory-augmented architecture, achieving superior performance on benchmarks and general domains.
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
Community
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
Thanks so much for sharing our work on QwenLong-L1.5! We're excited to share our latest research on advancing long-context reasoning in LLMs with the community.
Our main goal was to move beyond simple information retrieval and enable genuine, deep reasoning over vast amounts of text. Our key contributions are:
- High-Quality Data Synthesis: We developed a pipeline to automatically generate complex reasoning tasks that require the model to connect evidence scattered across a document (multi-hop reasoning).
- Stabilized Reinforcement Learning: To overcome the notorious instability of RL on long sequences, we introduced novel techniques like Adaptive Entropy-Controlled Policy Optimization (AEPO). We've open-sourced our implementations for these methods in the repo, hoping they can help others in the field.
- Memory-Augmented Architecture: For ultra-long contexts (up to 4M tokens) that exceed any context window, we designed a memory framework that allows the model to process information iteratively, much like a human would read and take notes.
As a result, QwenLong-L1.5 achieves highly competitive performance on long-context benchmarks. We also observed that this enhanced reasoning ability generalizes well to other domains like scientific reasoning and tool use.
The model, code, and paper are all publicly available. We welcome everyone to try it out and share your feedback!
- Hugging Face Model: https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1.5-30B-A3B
- GitHub (with RL implementations): https://github.com/Tongyi-Zhiwen/Qwen-Doc/tree/main/QwenLong-L1.5
- ModelScope: https://modelscope.cn/models/iic/QwenLong-L1.5-30B-A3B
Thanks again for the shout-out, and we look forward to the community's feedback!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts (2025)
- Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes (2025)
- Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model (2025)
- DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025)
- Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning (2025)
- PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning (2025)
- Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
This is an important shift away from âjust make the context window biggerâ toward active memory management. The combination of synthetic multi-hop data, stabilized long-context RL, and a memory-augmented loop mirrors how humans actually reason over long documentsâread, abstract, store, revisit. AEPO is especially interesting as a practical fix to long-horizon RL instability. This feels less like scaling tokens and more like teaching models how to think across time.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper