Papers
arxiv:2512.12967

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Published on Dec 15
¡ Submitted by
taesiri
on Dec 16
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,

Abstract

QwenLong-L1.5 enhances long-context reasoning through data synthesis, stabilized reinforcement learning, and memory-augmented architecture, achieving superior performance on benchmarks and general domains.

AI-generated summary

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

Community

Paper submitter

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

Paper author

Thanks so much for sharing our work on QwenLong-L1.5! We're excited to share our latest research on advancing long-context reasoning in LLMs with the community.

Our main goal was to move beyond simple information retrieval and enable genuine, deep reasoning over vast amounts of text. Our key contributions are:

  • High-Quality Data Synthesis: We developed a pipeline to automatically generate complex reasoning tasks that require the model to connect evidence scattered across a document (multi-hop reasoning).
  • Stabilized Reinforcement Learning: To overcome the notorious instability of RL on long sequences, we introduced novel techniques like Adaptive Entropy-Controlled Policy Optimization (AEPO). We've open-sourced our implementations for these methods in the repo, hoping they can help others in the field.
  • Memory-Augmented Architecture: For ultra-long contexts (up to 4M tokens) that exceed any context window, we designed a memory framework that allows the model to process information iteratively, much like a human would read and take notes.
    As a result, QwenLong-L1.5 achieves highly competitive performance on long-context benchmarks. We also observed that this enhanced reasoning ability generalizes well to other domains like scientific reasoning and tool use.

The model, code, and paper are all publicly available. We welcome everyone to try it out and share your feedback!

Thanks again for the shout-out, and we look forward to the community's feedback!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This is an important shift away from “just make the context window bigger” toward active memory management. The combination of synthetic multi-hop data, stabilized long-context RL, and a memory-augmented loop mirrors how humans actually reason over long documents—read, abstract, store, revisit. AEPO is especially interesting as a practical fix to long-horizon RL instability. This feels less like scaling tokens and more like teaching models how to think across time.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.12967 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.12967 in a Space README.md to link it from this page.

Collections including this paper 1