new

Get trending papers in your email inbox!

Subscribe

Trending Papers

byAK and the research community

Trending Papers
Submitted by hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

  • 11 authors
· Nov 17, 2025
Submitted by unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

MicrosoftResearch Microsoft Research · Aug 26, 2025
Submitted by taesiri

PersonaLive! Expressive Portrait Image Animation for Live Streaming

PersonaLive is a diffusion-based framework for real-time portrait animation that enhances speed and efficiency through multi-stage training, hybrid implicit signals, appearance distillation, and autoregressive micro-chunk streaming.

Submitted by taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

  • 5 authors
· Dec 8, 2025
Submitted by akhaliq

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

FunAudioLLM enhances voice interactions by integrating SenseVoice for multilingual speech recognition, emotion detection, and audio event detection with CosyVoice for natural speech generation across languages, timbres, and styles.

  • 1 authors
· Jul 4, 2024
Submitted by amael-apple

Sharp Monocular View Synthesis in Less Than a Second

SHARP synthesizes photorealistic views from a single image using a 3D Gaussian representation, achieving state-of-the-art results with rapid processing.

apple Apple · Dec 11, 2025
Submitted by Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI Tongyi-MAI · Nov 27, 2025
Submitted by Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI Tongyi-MAI · Nov 27, 2025
Submitted by akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

  • 9 authors
· Sep 12, 2023
Submitted by andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

  • 13 authors
· Mar 14, 2025
Submitted by MapleF9

Towards Scalable Pre-training of Visual Tokenizers for Generation

A unified visual tokenizer pre-training framework (VTP) improves generative performance by optimizing image-text contrastive, self-supervised, and reconstruction losses, leading to better scaling properties and higher zero-shot accuracy and faster convergence.

MiniMaxAI MiniMax · Dec 15, 2025

From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production

IBM's CUGA, a generalist agent with a hierarchical planner-executor architecture, demonstrates state-of-the-art performance in academic benchmarks and shows potential for enterprise adoption in business-process-outsourcing, addressing scalability, auditability, safety, and governance.

  • 12 authors
· Oct 27, 2025
Submitted by taesiri

Fara-7B: An Efficient Agentic Model for Computer Use

FaraGen creates synthetic datasets for computer use agents, enabling the training of efficient and high-performing models like Fara-7B on diverse web tasks, outperforming larger models on benchmarks.

microsoft Microsoft · Nov 24, 2025

PDFMathTranslate: Scientific Document Translation Preserving Layouts

PDFMathTranslate is an open-source software that translates scientific documents while maintaining layout integrity, utilizing advancements in large language models and layout detection.

  • 4 authors
· Jul 2, 2025
Submitted by taesiri

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

V-RGBX is an end-to-end framework for intrinsic-aware video editing that combines video inverse rendering, photorealistic synthesis, and keyframe-based editing to produce consistent and physically plausible edits.

adobe Adobe · Dec 12, 2025
Submitted by taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

facebook AI at Meta · Nov 20, 2025
Submitted by wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

  • 18 authors
· Sep 27, 2024
Submitted by taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

  • 61 authors
· Sep 26, 2025
Submitted by rubenohana

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

A large-scale dataset collection, The Well, provides diverse numerical simulations for benchmarking machine learning models in physical systems simulation.

  • 26 authors
· Nov 30, 2024
Submitted by yuanwenyue

LitePT: Lighter Yet Stronger Point Transformer

A new 3D point cloud backbone model, LitePT, uses convolutions for early stages and attention for deeper layers, incorporating PointROPE for positional encoding, achieving efficient performance with fewer resources.

ethz ETH Zurich · Dec 15, 2025

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

  • 9 authors
· Feb 7, 2025

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

  • 9 authors
· Oct 23, 2024
Submitted by rmurthy

Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Promptomatix automates prompt optimization for Large Language Models, improving performance and efficiency across various tasks.

  • 9 authors
· Jul 17, 2025
Submitted by taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle PaddlePaddle · Oct 16, 2025
Submitted by akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

  • 5 authors
· Mar 20, 2024
Submitted by akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

  • 8 authors
· Jul 25, 2024
Submitted by AdinaY

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

SCAIL framework improves character animation by using a novel 3D pose representation and a diffusion-transformer architecture with full-context pose injection, achieving studio-grade quality and realism.

zai-org Z.ai · Dec 5, 2025

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

  • 5 authors
· Oct 8, 2024
Submitted by taesiri

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Wan-Move enhances motion control in video generative models by integrating motion-aware features into latent space, enabling high-quality and scalable video synthesis.

AlibabaTongyiLab TongyiLab · Dec 9, 2025

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

A unified framework for high-fidelity character animation and image pose transfer handles misaligned and partially visible references, using self-supervised outpainting, hybrid attention, and identity-robust pose control.

  • 8 authors
· Nov 28, 2025
Submitted by akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

  • 24 authors
· Jul 23, 2024

GlobalBuildingAtlas: An Open Global and Complete Dataset of Building Polygons, Heights and LoD1 3D Models

We introduce GlobalBuildingAtlas, a publicly available dataset providing global and complete coverage of building polygons, heights and Level of Detail 1 (LoD1) 3D building models. This is the first open dataset to offer high quality, consistent, and complete building data in 2D and 3D form at the individual building level on a global scale. Towards this dataset, we developed machine learning-based pipelines to derive building polygons and heights (called GBA.Height) from global PlanetScope satellite data, respectively. Also a quality-based fusion strategy was employed to generate higher-quality polygons (called GBA.Polygon) based on existing open building polygons, including our own derived one. With more than 2.75 billion buildings worldwide, GBA.Polygon surpasses the most comprehensive database to date by more than 1 billion buildings. GBA.Height offers the most detailed and accurate global 3D building height maps to date, achieving a spatial resolution of 3x3 meters-30 times finer than previous global products (90 m), enabling a high-resolution and reliable analysis of building volumes at both local and global scales. Finally, we generated a global LoD1 building model (called GBA.LoD1) from the resulting GBA.Polygon and GBA.Height. GBA.LoD1 represents the first complete global LoD1 building models, including 2.68 billion building instances with predicted heights, i.e., with a height completeness of more than 97%, achieving RMSEs ranging from 1.5 m to 8.9 m across different continents. With its height accuracy, comprehensive global coverage and rich spatial details, GlobalBuildingAltas offers novel insights on the status quo of global buildings, which unlocks unprecedented geospatial analysis possibilities, as showcased by a better illustration of where people live and a more comprehensive monitoring of the progress on the 11th Sustainable Development Goal of the United Nations.

  • 5 authors
· Jun 4, 2025
Submitted by Keh0t0

EgoX: Egocentric Video Generation from a Single Exocentric Video

EgoX framework generates egocentric videos from exocentric inputs using video diffusion models with LoRA adaptation, unified conditioning, and geometry-guided self-attention for coherence and visual fidelity.

kaist-ai KAIST AI · Dec 9, 2025
Submitted by ShuaiBai623

Soft Adaptive Policy Optimization

Soft Adaptive Policy Optimization (SAPO) enhances the stability and performance of reinforcement learning in large language models by adaptively attenuating off-policy updates with a smooth, temperature-controlled gate, leading to improved training stability and performance.

Qwen Qwen · Nov 25, 2025
Submitted by taesiri

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

SVG-T2I, a scaled SVG framework, enables high-quality text-to-image synthesis directly in the Visual Foundation Model feature domain, achieving competitive performance in generative tasks.

KlingTeam Kling Team · Dec 12, 2025
Submitted by zhongwenxu

Single-stream Policy Optimization

Single-stream Policy Optimization (SPO) improves policy-gradient training for Large Language Models by eliminating group-based issues and providing a stable, low-variance learning signal, leading to better performance and efficiency.

tencent Tencent · Sep 16, 2025

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

  • 5 authors
· Feb 8, 2025
Submitted by Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Submitted by probejie

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

CLaRa enhances retrieval-augmented generation by introducing unified embedding-based compression and joint optimization, achieving state-of-the-art performance in QA benchmarks.

apple Apple · Nov 24, 2025

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

E-RayZer is a self-supervised 3D Vision model that directly learns 3D-aware representations from unlabeled images, outperforming existing models in pose estimation and 3D reconstruction tasks.

  • 8 authors
· Dec 11, 2025
Submitted by akhaliq

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Live Avatar uses a 14-billion-parameter diffusion model with Timestep-forcing Pipeline Parallelism and Rolling Sink Frame Mechanism to achieve real-time, high-fidelity avatar generation.

Quark-LLM Quark · Dec 4, 2025

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

  • 4 authors
· Dec 28, 2024
Submitted by wenbowen

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Fast-FoundationStereo achieves real-time zero-shot stereo generalization by combining knowledge distillation, blockwise neural architecture search, and structured pruning.

nvidia NVIDIA · Dec 11, 2025
Submitted by akhaliq

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

The Qwen2-VL Series uses Naive Dynamic Resolution and Multimodal Rotary Position Embedding to enhance visual processing and achieves competitive performance on multimodal benchmarks.

  • 19 authors
· Sep 18, 2024
Submitted by akhaliq

What matters for Representation Alignment: Global Information or Spatial Structure?

Representation alignment enhances generative training by transferring spatial structure from pretrained vision encoders to diffusion models, surpassing the importance of global semantic performance.

  • 7 authors
· Dec 11, 2025
Submitted by Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

open-gigaai GigaAI · Oct 22, 2025
Submitted by ShuaiBai623

Qwen3-VL Technical Report

Qwen3-VL, a vision-language model, excels in text and multimodal understanding through advanced architectures and larger contexts, achieving superior performance across benchmarks.

Qwen Qwen · Nov 26, 2025
Submitted by ShuaiBai623

Qwen2.5-VL Technical Report

Qwen2.5-VL, the latest vision-language model, advances visual recognition, document parsing, and video comprehension through dynamic resolution processing, Window Attention, and a native Vision Transformer.

  • 27 authors
· Feb 19, 2025

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

  • 16 authors
· Apr 21, 2023
Submitted by akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

  • 5 authors
· Apr 28, 2025