_              _         ____              
   / \   _ ____  _(_)_   __ |  _ \  __ _ _   _ 
  / _ \ | '__\ \/ / \ \ / / | | | |/ _` | | | |
 / ___ \| |   >  <| |\ V /  | |_| | (_| | |_| |
/_/   \_\_|  /_/\_\_| \_/   |____/ \__,_|\__, |
                                         |___/ 
        

Articles: 0

Last Updated: N/A (+00:00)

Index | Calendar | Favorites | Archive | Profile

The Theory behind UMAP?

In 2018, McInnes et al. introduced a dimensionality reduction algorithm called UMAP, which enjoys wide popularity among data scientists. Their work introduces a finite variant of a functor called the metric realization, based on an unpublished draft by Spivak. This draft contains many errors, most of which are reproduced by McInnes et al. and subsequent publications. This article aims to repair these errors and provide a self-contained document with the full derivation of Spivak's functors and McInnes et al.'s finite variant. We contribute an explicit description of the metric realization and related functors. At the end, we discuss the UMAP algorithm, as well as claims about properties of the algorithm and the correspondence of McInnes et al.'s finite variant to the UMAP algorithm.

Updated: 2026-03-02 23:54:08

标题: UMAP背后的理论是什么?

摘要: 2018年,McInnes等人引入了一个称为UMAP的降维算法,这个算法在数据科学家中广受欢迎。他们的工作介绍了一个有限变体的函子,称为度量实现,这是基于Spivak的一份未发表的草稿。这份草稿包含许多错误,其中大部分被McInnes等人及随后的出版物所复制。本文旨在修复这些错误,并提供一个自包含的文档,完整推导Spivak的函子和McInnes等人的有限变体。我们提供了度量实现和相关函子的明确描述。最后,我们讨论了UMAP算法,以及关于算法性质和McInnes等人的有限变体与UMAP算法的对应性的声明。

更新时间: 2026-03-02 23:54:08

领域: stat.ML,cs.LG,math.CT

下载: http://arxiv.org/abs/2603.03375v1

Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon

Updated: 2026-03-02 23:52:22

标题: 大规模皮肤色调分类在野外的数据集和基准

摘要: 深度学习模型经常会从它们的训练数据中继承偏见。尽管性别和种族之间的公平性已经得到了广泛研究,但由于缺乏细粒度、注释数据集,细粒度皮肤色调分析仍然是一个挑战。现有方法通常依赖于医学6级Fitzpatrick肤色标准,缺乏视觉代表性,或使用小型、私人数据集,防止可重复性,或经常依赖于经典的计算机视觉流水线,少数使用深度学习。它们忽视了训练-测试泄漏和数据集不平衡等问题,并受限于小型或不可用的数据集。在这项工作中,我们提出了一个皮肤色调公平性的全面框架。首先,我们介绍了STW,这是一个包含42,313张来自3,564个个体的图像的大规模、开放访问的数据集,使用10级MST标准进行标记。其次,我们对经典计算机视觉(SkinToneCCV)和深度学习方法进行了基准测试,结果表明经典模型提供接近随机结果,而深度学习几乎达到了标注者的准确性。最后,我们提出了SkinToneNet,这是一个经过微调的ViT,在域外数据上实现了最先进的泛化性能,从而实现了对公共数据集(如CelebA和VGGFace2)的可靠公平性审计。这项工作在皮肤色调分类和公平性评估方面提供了最先进的结果。代码和数据即将提供。

更新时间: 2026-03-02 23:52:22

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.02475v1

Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Memory-augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization behavior, and apply it to a 3x3 study crossing three write strategies (raw chunks, Mem0-style fact extraction, MemGPT-style summarization) with three retrieval methods (cosine, BM25, hybrid reranking). On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3-8 points across write strategies. Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives, suggesting that current memory pipelines may discard useful context that downstream retrieval mechanisms fail to compensate for. Failure analysis shows that performance breakdowns most often manifest at the retrieval stage rather than at utilization. We argue that, under current retrieval practices, improving retrieval quality yields larger gains than increasing write-time sophistication. Code is publicly available at https://github.com/boqiny/memory-probe.

Updated: 2026-03-02 23:47:23

标题: 诊断LLM代理器内存中的检索与利用瓶颈

摘要: 增强记忆的LLM代理从先前的互动中存储和检索信息,然而记忆是如何写入和检索的相对重要性仍不清楚。我们引入了一个诊断框架,分析了在写入策略、检索方法和记忆利用行为之间表现差异的方式,并将其应用于一个3x3的研究,将三种写入策略(原始块、Mem0风格的事实提取、MemGPT风格的总结)与三种检索方法(余弦、BM25、混合重排)进行交叉。在LoCoMo上,检索方法是主导因素:不同检索方法之间的平均准确性跨度为20个百分点(从57.1%到77.2%),但在写入策略之间仅为3-8个百分点。需要零LLM调用的原始块存储与昂贵的损失性替代品相匹敌或表现更好,表明当前的记忆管道可能会丢弃下游检索机制未能补偿的有用上下文。故障分析显示,性能故障最常见于检索阶段而不是利用阶段。我们认为,在当前的检索实践下,提高检索质量比增加写入时间复杂性带来更大的收益。代码可以在https://github.com/boqiny/memory-probe 上公开获取。

更新时间: 2026-03-02 23:47:23

领域: cs.AI

下载: http://arxiv.org/abs/2603.02473v1

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs), where tokens serve as unified units of communication and computation, enabling efficient semantic- and goal-oriented information exchange in future wireless networks. In this paper, we propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication with Unequal Error Protection (UEP)-based source-channel coding adaptation. The proposed framework integrates user-intended textual descriptions with discrete video tokenization and unequal error protection to enhance semantic fidelity under restrictive bandwidth constraints. First, discrete video tokens are extracted through a pretrained video tokenizer, while text-conditioned vision-language modeling and optical-flow propagation are jointly used to identify tokens that correspond to user-intended semantics across space and time. Next, we introduce a semantic-aware multi-rate bit-allocation strategy, in which tokens highly related to the user intent are encoded using full codebook precision, whereas non-intended tokens are represented through reduced codebook precision differential encoding, enabling rate savings while preserving semantic quality. Finally, a source and channel coding adaptation scheme is developed to adapt bit allocation and channel coding to varying resources and link conditions. Experiments on various video datasets demonstrate that the proposed framework outperforms both conventional and semantic communication baselines, in perceptual and semantic quality on a wide SNR range.

Updated: 2026-03-02 23:36:38

标题: Video TokenCom: 基于文本意图指导的多速率视频令牌通信与基于UEP的自适应源-信道编码

摘要: Token Communication (TokenCom)是一个新的范式,受到大型人工智能模型(LAMs)和多模态大型语言模型(MLLMs)最近取得成功的启发,其中令牌作为通信和计算的统一单元,实现了在未来无线网络中进行高效的语义和目标导向的信息交换。本文提出了一种新颖的视频TokenCom框架,用于基于文本意图引导的多速率视频通信,采用基于不等误差保护(UEP)的源通道编码自适应。所提出的框架将用户预期的文本描述与离散视频令牌化和不等误差保护相结合,以增强在严格带宽约束下的语义保真度。首先,通过预训练的视频令牌化器提取离散视频令牌,同时使用文本条件的视觉语言建模和光流传播来识别跨空间和时间对应于用户预期语义的令牌。接下来,我们引入了一种语义感知的多速率比特分配策略,其中与用户意图高度相关的令牌使用完整的码书精度进行编码,而非预期的令牌通过减少码书精度差分编码表示,从而实现速率节约同时保持语义质量。最后,开发了一种源和通道编码自适应方案,以适应不同资源和链路条件下的比特分配和通道编码。对各种视频数据集上的实验证明,所提出的框架在广泛的信噪比范围内在感知和语义质量上优于传统和语义通信基线。

更新时间: 2026-03-02 23:36:38

领域: cs.IT,cs.LG,cs.MM,eess.IV

下载: http://arxiv.org/abs/2603.02470v1

Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?

Reasoning LLMs are trained to verbalize their reasoning process, yielding strong gains on complex tasks. This transparency also opens a promising direction: multiple reasoners can directly collaborate on each other's thinking within a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is the ability to assess the usefulness and build on another model's partial thinking -- we call this off-trajectory reasoning. Our paper investigates a critical question: can standard solo-reasoning training pipelines deliver desired off-trajectory behaviors? We propose twin tests that capture the two extremes of the off-trajectory spectrum, namely Recoverability, which tests whether LLMs can backtrack from "distractions" induced by misleading reasoning traces, and Guidability, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5B-32B) and reveals a counterintuitive finding -- "stronger" LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities with solve rates remaining under 9.2%. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that suboptimal recoverability behaviors of teacher models are transferred to distilled students even if the distillation trajectories are correct. Taken together, this work lays the groundwork for evaluating multi-model collaborations in shared reasoning trajectories and highlights the limitations of off-the-shelf reasoning LLMs.

Updated: 2026-03-02 23:34:34

标题: 偏离轨迹推理:LLMs能否在推理轨迹上合作?

摘要: Reasoning LLMs经过训练,可以表达其推理过程,从而在复杂任务上取得显著进展。这种透明度也开辟了一个有前途的方向:多个推理者可以直接在共享轨迹内相互协作思考,从而提高推理效率和探索能力。然而,一个关键的前提是能够评估另一个模型的部分思维的有用性并建立在其基础上 -- 我们称之为偏离轨迹的推理。我们的论文探讨了一个关键问题:标准的单人推理训练流程能够产生所需的偏离轨迹行为吗?我们提出了两个捕捉偏离轨迹光谱两个极端的测试,即可恢复性,它测试LLMs是否能够从被误导的推理痕迹引起的“干扰”中折返,以及可引导性,它测试他们能否建立在更强合作者的正确推理基础上。我们的研究评估了15个开放权重LLMs(1.5B-32B),并揭示了一个反直觉的发现--在基准测试中“更强大”的LLMs通常在干扰下更加脆弱。此外,所有测试的模型都未能有效利用合作者对超出其固有能力范围的问题的引导步骤,解决率始终低于9.2%。最后,我们进行了控制研究,以分离后续训练对这些行为的影响的三个因素:蒸馏教师的选择,RL的使用以及数据选择策略。我们的研究结果为训练原生强大的推理协作者提供了可操作的见解;例如,我们发现即使蒸馏轨迹是正确的,教师模型的恢复性行为的次优行为也会转移到蒸馏学生身上。总的来说,这项工作为评估共享推理轨迹中多模型协作奠定了基础,并突显了现成的推理LLMs的局限性。

更新时间: 2026-03-02 23:34:34

领域: cs.AI

下载: http://arxiv.org/abs/2510.06410v2

NeuroWise: A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners

The double empathy problem frames communication difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent LLM-based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between-subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition-time effect on deficit-based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI-based interpretation can support attributional change by helping users recognize communication challenges as mutual.

Updated: 2026-03-02 23:34:12

标题: NeuroWise: 一个用于与自闭症伙伴练习双重共情沟通的多智能体LLM“玻璃箱”系统

摘要: 双重共情问题将神经非典型和神经典型个体之间的沟通困难框定为相互误解,然而大多数干预措施都侧重于自闭症个体。我们提出了NeuroWise,这是一个基于多代理LLM的辅导系统,通过压力可视化、内部体验解释和情境指导来支持神经典型用户。在一项介于受试者之间的研究中(N=30),所有参与者均认为NeuroWise有帮助,并显示出对基于缺陷的归因有显著的条件-时间效应(p=0.02):NeuroWise用户减少了缺陷性框架,而基线用户在困难交流后转向责怪自闭症的“缺陷”。NeuroWise用户还更高效地完成了对话(减少了37%的轮次,p=0.03)。这些发现表明,基于人工智能的解释可以通过帮助用户将沟通挑战视为相互的方式来支持归因变化。

更新时间: 2026-03-02 23:34:12

领域: cs.HC,cs.AI,cs.CY,cs.IR,cs.MA

下载: http://arxiv.org/abs/2602.18962v2

Every Language Model Has a Forgery-Resistant Signature

The ubiquity of closed-weight language models with public-facing APIs has generated interest in forensic methods, both for extracting hidden model details (e.g., parameters) and for identifying models by their outputs. One successful approach to these goals has been to exploit the geometric constraints imposed by the language model architecture and parameters. In this work, we show that a lesser-known geometric constraint -- namely, that language model outputs lie on the surface of a high-dimensional ellipse -- functions as a signature for the model and can be used to identify the source model of a given output. This ellipse signature has unique properties that distinguish it from existing model-output association methods like language model fingerprints. In particular, the signature is hard to forge: without direct access to model parameters, it is practically infeasible to produce log-probabilities (logprobs) on the ellipse using currently known methods. Secondly, the signature is naturally occurring, since all language models have these elliptical constraints. Thirdly, the signature is self-contained, in that it is detectable without access to the model inputs or the full weights. Finally, the signature is compact and redundant, as it is independently detectable in each logprob output from the model. We evaluate a novel technique for extracting the ellipse from small models and discuss the practical hurdles that make it infeasible for production-scale models. Finally, we use ellipse signatures to propose a protocol for language model output verification, analogous to cryptographic symmetric-key message authentication systems.

Updated: 2026-03-02 23:22:37

标题: 每种语言模型都有一个防伪签名

摘要: 公开API中存在闭合重量语言模型的普遍性引起了对法庭方法的兴趣,既用于提取隐藏的模型细节(例如参数),也用于通过输出来识别模型。实现这些目标的一个成功方法是利用语言模型架构和参数施加的几何约束。在这项工作中,我们展示了一个较少人知晓的几何约束,即语言模型输出位于高维椭圆表面上,可作为模型的签名,并可用于识别给定输出的源模型。这个椭圆签名具有使其与现有的模型-输出关联方法(如语言模型指纹)区分开的独特属性。特别是,这个签名很难伪造:在没有直接访问模型参数的情况下,使用当前已知的方法在椭圆上产生对数概率(logprobs)实际上是不可行的。其次,这个签名是自然发生的,因为所有语言模型都有这些椭圆约束。第三,这个签名是自包含的,即它可以在没有对模型输入或完整权重的访问的情况下被检测到。最后,这个签名是紧凑且冗余的,因为它可以在模型的每个logprob输出中独立检测到。我们评估了一种从小型模型中提取椭圆的新技术,并讨论了使其在生产规模模型中不可行的实际障碍。最后,我们使用椭圆签名提出了一个用于语言模型输出验证的协议,类似于加密对称密钥消息认证系统。

更新时间: 2026-03-02 23:22:37

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.14086v2

Toward a Dynamic Stackelberg Game-Theoretic Framework for Agentic AI Defense Against LLM Jailbreaking

This paper proposes a game theoretic framework that models the interaction between prompt engineers and large language models (LLMs) as a two player extensive form game coupled with a Rapidly exploring Random Trees (RRT) search over prompt space. The attacker incrementally samples, extends, and tests prompts, while the LLM chooses to accept, reject, or redirect, leading to terminal outcomes of Safe Interaction, Blocked, or Jailbreak. Embedding RRT exploration inside the extensive form game captures both the discovery phase of jailbreak strategies and the strategic responses of the model. Furthermore, we show that the defender behavior can be interpreted through a local Stackelberg equilibrium condition, which explains when the attacker can no longer obtain profitable prompt deviations and provides a theoretical lens for understanding the effectiveness of our Purple Agent defense. The resulting game tree thus offers a principled foundation for evaluating, interpreting, and hardening LLM guardrails.

Updated: 2026-03-02 23:19:02

标题: 朝向一个动态斯塔克伯格博弈理论框架,用于主动型人工智能防御LLM越狱

摘要: 本文提出了一个游戏理论框架,模拟了即时工程师与大型语言模型(LLMs)之间的互动,将其建模为一个两个玩家的广泛形式博弈,并结合了对提示空间的快速探索随机树(RRT)搜索。攻击者逐步对提示进行采样、扩展和测试,而LLM选择接受、拒绝或重定向,导致终端结果为安全互动、阻塞或越狱。将RRT探索嵌入广泛形式游戏中,捕捉了越狱策略的发现阶段和模型的战略响应。此外,我们展示了防御者行为可以通过局部斯塔克贝格均衡条件来解释,这解释了攻击者何时无法再获得有利的提示偏离,并为理解我们的紫色特工防御的有效性提供了理论视角。因此,产生的游戏树为评估、解释和加固LLM防护栏提供了一个基本的基础。

更新时间: 2026-03-02 23:19:02

领域: cs.AI

下载: http://arxiv.org/abs/2507.08207v2

Search Arena: Analyzing Search-Augmented LLMs

Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research. Our dataset and code are available at: https://github.com/lmarena/search-arena.

Updated: 2026-03-02 23:17:18

标题: 搜索竞技场:分析搜索增强的LLMs

摘要: 搜索增强语言模型将网络搜索与大型语言模型(LLMs)结合起来,以提高响应的准确性和新鲜度。然而,分析这些系统仍然具有挑战性:现有数据集在规模上有限,范围狭窄,通常仅限于静态、单轮、事实核查问题。在这项工作中,我们引入了Search Arena,这是一个由众包提供的大规模、人类偏好数据集,包含超过24,000个带有搜索增强LLMs的成对多轮用户交互。该数据集涵盖了多样的意图和语言,并包含了约12,000个人类偏好投票的完整系统跟踪。我们的分析揭示了用户偏好受引文数量的影响,即使所引内容并未直接支持所陈述的主张,揭示了知名度和实际可信度之间的差距。此外,用户偏好在引用来源上存在差异,显示出社区驱动平台通常更受欢迎,而静态百科式来源并非总是适当和可靠的。为了评估不同环境下的性能,我们通过在通用聊天环境中测试搜索增强LLMs和在搜索密集型环境中测试传统LLMs来进行跨领域分析。我们发现网络搜索并不会降低性能,并且在非搜索环境中甚至可能提高性能;然而,如果仅依赖于模型的参数化知识,在搜索环境中,质量会受到显著影响。我们开放了数据集以支持未来的研究。我们的数据集和代码可在以下链接找到:https://github.com/lmarena/search-arena。

更新时间: 2026-03-02 23:17:18

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2506.05334v2

Deep Learning Based Wildfire Detection for Peatland Fires Using Transfer Learning

Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics -- such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning -- that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.

Updated: 2026-03-02 23:14:41

标题: 基于深度学习的迁移学习在泥炭地火灾检测中的应用

摘要: 最近几年,基于机器学习(ML)的野火检测方法已经得到发展,主要使用深度学习(DL)模型在大量野火图像和视频上进行训练。然而,泥炭地火灾表现出独特的视觉和物理特征,如闷燃燃烧、低火焰强度、持续性烟雾和地下燃烧,这限制了传统野火检测器在开放火焰森林火灾上训练的有效性。在这项工作中,我们提出了一种基于迁移学习的泥炭地火灾检测方法,利用从一般野火图像中学到的知识,并将其调整到泥炭地火灾领域。我们初始化一个基于DL的泥炭地火灾探测器,使用传统野火检测模型的预训练权重,然后通过使用由马来西亚泥炭地图像和视频组成的数据集对网络进行微调。这种策略使得在有限标记的泥炭地火灾数据可用的情况下实现有效学习。实验结果表明,与从头开始训练相比,迁移学习显著提高了检测准确性和稳健性,特别是在低对比度烟雾、部分遮挡和不稳定照明等挑战性条件下。该提出的方法为泥炭地火灾的早期检测提供了实用和可扩展的解决方案,并有潜力支持用于防火和环境保护的实时监测系统。

更新时间: 2026-03-02 23:14:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.02465v1

GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. GLoRIA injects low-rank matrices into each feed-forward layer, with a gating MLP determining the non-negative contribution of each LoRA rank-1 component based on location metadata. On the GCND corpus, GLoRIA outperforms geo-conditioned full fine-tuning, LoRA, and both dialect-specific and unified full fine-tuning, achieving state-of-the-art word error rates while updating under 10% of parameters. GLoRIA also generalizes well to unseen dialects, including in extrapolation scenarios, and enables interpretable adaptation patterns that can be visualized geospatially. These results show metadata-gated low-rank adaptation is an effective, interpretable, and efficient solution for dialectal ASR.

Updated: 2026-03-02 23:10:09

标题: GLoRIA:门控低秩可解释适应方言ASR

摘要: 在方言丰富的环境中,自动语音识别(ASR)仍然具有挑战性,主要是由于强烈的区域变化和有限的标记数据。我们提出了GLoRIA,这是一个参数高效的适应性框架,利用元数据(例如坐标)来调节预训练编码器中的低秩更新。 GLoRIA将低秩矩阵注入到每个前馈层中,通过一个门控MLP根据位置元数据确定每个LoRA秩-1组件的非负贡献。 在GCND语料库上,GLoRIA优于地理条件下的全微调、LoRA以及方言特定和统一的全微调,实现了最先进的词错误率,同时更新参数不到10%。 GLoRIA还很好地泛化到看不见的方言,包括外推情景,并且能够使适应模式可视化地显示在地理空间上。 这些结果表明,基于元数据门控的低秩适应是方言ASR的有效、可解释和高效的解决方案。

更新时间: 2026-03-02 23:10:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.02464v1

Learning Lagrangian Interaction Dynamics with Sampling-Based Model Order Reduction

Simulating physical systems governed by Lagrangian dynamics often entails solving partial differential equations (PDEs) over high-resolution spatial domains, leading to significant computational expense. Reduced-order modeling (ROM) mitigates this cost by evolving low-dimensional latent representations of the underlying system. While neural ROMs enable querying solutions from latent states at arbitrary spatial points, their latent states typically represent the global domain and struggle to capture localized, highly dynamic behaviors such as fluids. We propose a sampling-based reduction framework that evolves Lagrangian systems directly in physical space over the particles themselves, reducing the number of active degrees of freedom via data-driven neural PDE operators. To enable querying at arbitrary spatial locations, we introduce a learnable kernel parameterization that uses local spatial information from time-evolved sample particles to infer the underlying solution manifold. Empirically, our approach achieves a 6.6x to 32x reduction in input dimensionality while maintaining high-fidelity evaluations across diverse Lagrangian regimes, including fluid flows, granular media, and elastoplastic dynamics. We refer to this framework as GIOROM (Geometry-Informed Reduced-Order Modeling). All code and data are available at: https://github.com/HrishikeshVish/GIOROM

Updated: 2026-03-02 23:03:47

标题: 用采样模型降阶学习Lagrangian交互动力学

摘要: 模拟由拉格朗日动力学控制的物理系统通常涉及在高分辨率空间域上求解偏微分方程(PDEs),从而导致显著的计算开销。降阶建模(ROM)通过演化基础系统的低维潜在表示来减少这种成本。虽然神经降阶模型使得可以查询在任意空间点的潜在状态的解决方案,但它们的潜在状态通常代表全局域并且难以捕捉局部化、高动态行为,如流体。我们提出了一种基于采样的降维框架,直接在粒子自身的物理空间中演化拉格朗日系统,通过数据驱动的神经PDE算子减少活动自由度的数量。为了在任意空间位置进行查询,我们引入了一个可学习的核参数化,利用时间演化的样本粒子的局部空间信息来推断潜在的解决方案流形。经验上,我们的方法在输入维度上实现了6.6倍至32倍的降低,同时在包括流体流动、颗粒介质和弹塑性动力学在内的各种拉格朗日制度中保持高保真度的评估。我们将这一框架称为GIOROM(基于几何信息的降阶建模)。所有代码和数据均可在以下链接获取:https://github.com/HrishikeshVish/GIOROM

更新时间: 2026-03-02 23:03:47

领域: cs.LG

下载: http://arxiv.org/abs/2407.03925v4

Learning Contextual Runtime Monitors for Safe AI-Based Autonomy

We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system's context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.

Updated: 2026-03-02 23:03:05

标题: 学习上下文运行时监视器以确保基于人工智能的安全自主权

摘要: 我们引入了一个新颖的框架,用于学习基于AI的控制集合的上下文感知运行时监视器。由于机器学习(ML)控制器能够解决复杂的决策任务,它们越来越多地部署在(自主)网络物理系统中。然而,在陌生环境中,它们的准确性可能急剧下降,从而带来重大的安全问题。传统的集成方法旨在通过在多个控制器之间进行平均或投票来提高稳健性,但这往往会削弱单个控制器在不同操作上下文中表现的专业优势。我们认为,监视框架应该识别并利用这些上下文强度,而不是混合控制器输出。在本文中,我们重新制定了安全AI控制集合的设计,将其视为一个上下文监视问题。监视器持续观察系统的上下文,并选择最适合当前条件的控制器。为了实现这一目标,我们将监视器学习作为一个上下文学习任务,并借鉴了上下文多臂老虎机的技术。我们的方法带来两个关键优势:(1)在控制器选择过程中提供理论安全保证,(2)提高了控制器多样性的利用。我们在两个模拟自动驾驶场景中验证了我们的框架,与非上下文基线相比,我们展示了安全性和性能方面的显着改进。

更新时间: 2026-03-02 23:03:05

领域: cs.LG,cs.AI,eess.SY

下载: http://arxiv.org/abs/2601.20666v2

Can Computational Reducibility Lead to Transferable Models for Graph Combinatorial Optimization?

A key challenge in deriving unified neural solvers for combinatorial optimization (CO) is efficient generalization of models between a given set of tasks to new tasks not used during the initial training process. To address it, we first establish a new model, which uses a GCON module as a form of expressive message passing together with energy-based unsupervised loss functions. This model achieves high performance (often comparable with state-of-the-art results) across multiple CO tasks when trained individually on each task. We then leverage knowledge from the computational reducibility literature to propose pretraining and fine-tuning strategies that transfer effectively (a) between MVC, MIS and MaxClique, and (b) in a multi-task learning setting that additionally incorporates MaxCut, MDS and graph coloring. Additionally, in a leave-one-out, multi-task learning setting, we observe that pretraining on all but one task almost always leads to faster convergence on the remaining task when fine-tuning while avoiding negative transfer. Our findings indicate that learning common representations across multiple graph CO problems is viable through the use of expressive message passing coupled with pretraining strategies that are informed by the polynomial reduction literature, thereby taking an important step towards enabling the development of foundational models for neural CO. We provide an open-source implementation of our work at https://github.com/semihcanturk/COPT-MT .

Updated: 2026-03-02 23:03:00

标题: 计算可约简性能导致可传递的图组合优化模型吗?

摘要: 在推导组合优化(CO)的统一神经求解器时的一个关键挑战是在给定一组任务之间有效地泛化模型到初始训练过程中未使用的新任务。为了解决这个问题,我们首先建立了一个新模型,该模型使用GCON模块作为一种表达性消息传递形式,结合基于能量的无监督损失函数。当分别在每个任务上进行训练时,这个模型在多个CO任务上取得了高性能(通常与最新结果可比)。然后,我们利用计算可简化文献中的知识,提出了预训练和微调策略,有效地在MVC、MIS和MaxClique之间进行转移,并在另外包括MaxCut、MDS和图着色的多任务学习环境中进行转移。此外,在一个留一法的多任务学习环境中,我们观察到,在除一个任务之外的所有任务上进行预训练几乎总是导致在微调时更快地收敛到剩余任务,同时避免负面转移。我们的研究结果表明,通过使用表达性消息传递结合受多项式简化文献启发的预训练策略,跨多个图CO问题学习共同表示是可行的,从而朝着为神经CO开发基础模型迈出重要一步。我们在https://github.com/semihcanturk/COPT-MT 上提供了我们工作的开源实现。

更新时间: 2026-03-02 23:03:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02462v1

Conformal Graph Prediction with Z-Gromov Wasserstein Distances

Supervised graph prediction addresses regression problems where the outputs are structured graphs. Although several approaches exist for graph--valued prediction, principled uncertainty quantification remains limited. We propose a conformal prediction framework for graph-valued outputs, providing distribution--free coverage guarantees in structured output spaces. Our method defines nonconformity via the Z--Gromov--Wasserstein distance, instantiated in practice through Fused Gromov--Wasserstein (FGW), enabling permutation invariant comparison between predicted and candidate graphs.To obtain adaptive prediction sets, we introduce Score Conformalized Quantile Regression (SCQR), an extension of Conformalized Quantile Regression (CQR) to handle complex output spaces such as graph--valued outputs. We evaluate the proposed approach on a synthetic task and a real problem of molecule identification.

Updated: 2026-03-02 23:02:43

标题: 使用Z-Gromov Wasserstein距离进行一致图预测

摘要: 监督图预测解决了输出为结构化图的回归问题。虽然存在一些用于图值预测的方法,但基于原则的不确定性量化仍然有限。我们提出了一个用于图值输出的符合性预测框架,提供了结构化输出空间中的无分布覆盖保证。我们的方法通过Z-Gromov-Wasserstein距离定义了非一致性,在实践中通过融合Gromov-Wasserstein(FGW)实例化,实现了对预测和候选图之间的置换不变比较。为了获得自适应预测集,我们引入了Score Conformalized Quantile Regression(SCQR),这是对Conformalized Quantile Regression(CQR)的扩展,用于处理复杂的输出空间,如图值输出。我们在一个合成任务和一个分子识别的真实问题上评估了所提出的方法。

更新时间: 2026-03-02 23:02:43

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.02460v1

TransactionGPT

We present TransactionGPT (TGPT), a foundation model for consumer transaction data within one of the world's largest payment networks. TGPT is designed to understand and generate transaction trajectories while simultaneously supporting a variety of downstream prediction and classification tasks. We introduce a novel 3D-Transformer architecture specifically tailored for capturing the complex dynamics in payment transaction data. This architecture incorporates design innovations that enhance modality fusion and computational efficiency, while seamlessly enabling joint optimization with downstream objectives. Trained on billion-scale real-world transactions, TGPT significantly improves downstream anomaly transaction detection performance against a competitive production model and exhibits advantages over baselines in generating future transactions. We conduct extensive empirical evaluations utilizing a diverse collection of company transaction datasets spanning multiple downstream tasks, thereby enabling a thorough assessment of TGPT's effectiveness and efficiency in comparison to established methodologies. Furthermore, we examine the incorporation of LLM-derived embeddings within TGPT and benchmark its performance against fine-tuned LLMs, demonstrating that TGPT achieves superior predictive accuracy as well as faster training and inference. We anticipate that the architectural innovations and practical guidelines from this work will advance foundation models for transaction-like data and catalyze future research in this emerging field.

Updated: 2026-03-02 22:48:45

标题: TransactionGPT

摘要: 我们提出了TransactionGPT(TGPT),这是一个基于世界上最大的支付网络之一的消费者交易数据的基础模型。TGPT旨在理解和生成交易轨迹,同时支持各种下游预测和分类任务。我们引入了一种新颖的3D-Transformer架构,专门设计用于捕捉支付交易数据中的复杂动态。该架构融合了设计创新,增强了模态融合和计算效率,同时无缝地实现了与下游目标的联合优化。在亿级实际交易数据上训练的TGPT显著改善了下游异常交易检测性能,超越了竞争性生产模型,并在生成未来交易方面具有优势。我们进行了广泛的实证评估,利用跨越多个下游任务的多样化公司交易数据集合,从而对TGPT与已建立方法的效果和效率进行了彻底评估。此外,我们研究了在TGPT中引入LLM衍生嵌入的方法,并将其性能与经过微调的LLMs进行了基准测试,结果表明TGPT在预测准确性和训练推理速度上均优于后者。我们预计,本研究的架构创新和实用指南将推动交易数据等数据的基础模型,并催生这一新兴领域的未来研究。

更新时间: 2026-03-02 22:48:45

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.08939v2

Manifold Aware Denoising Score Matching (MAD)

A major focus in designing methods for learning distributions defined on manifolds is to alleviate the need to implicitly learn the manifold so that learning can concentrate on the data distribution within the manifold. However, accomplishing this often leads to compute-intensive solutions. In this work, we propose a simple modification to denoising score-matching in the ambient space to implicitly account for the manifold, thereby reducing the burden of learning the manifold while maintaining computational efficiency. Specifically, we propose a simple decomposition of the score function into a known component $s^{base}$ and a remainder component $s-s^{base}$ (the learning target), with the former implicitly including information on where the data manifold resides. We derive known components $s^{base}$ in analytical form for several important cases, including distributions over rotation matrices and discrete distributions, and use them to demonstrate the utility of this approach in those cases.

Updated: 2026-03-02 22:47:17

标题: 多流形感知去噪得分匹配(MAD)

摘要: 在设计用于学习流形上定义的分布的方法时,主要关注点是减轻需要隐式学习流形的需求,以便学习可以集中在流形内的数据分布上。然而,实现这一点通常会导致计算密集型解决方案。在这项工作中,我们提出了一种简单的修改,将环境空间中的去噪得分匹配隐式考虑流形,从而减轻学习流形的负担,同时保持计算效率。具体而言,我们提出了将得分函数分解为已知分量$s^{base}$和余下分量$s-s^{base}$(学习目标)的简单方法,前者隐含地包含有关数据流形位置的信息。我们推导了几种重要情况下的已知分量$s^{base}$的解析形式,包括旋转矩阵和离散分布,然后使用它们来展示该方法在这些情况下的实用性。

更新时间: 2026-03-02 22:47:17

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2603.02452v1

Composable Attestation: A Generalized Framework for Continuous and Incremental Trust in AI-Driven Distributed Systems

This paper presents composable attestation as a generalized cryptographic framework for Continuous and Incremental Trust in Distributed Systems,such as Artificial Intelligence (AI) computation, and Open Source Software (OSS) supply chain verification. We establish a rigorous mathematical foundation which is defining core properties of such attestation systems: composability, order independence, transitivity, determinism, inclusion, and dynamic component verification. In contrast to traditional attestation methodologies relying on monolithic verification, composable attestation facilitates modular, scalable, and cryptographically secured integrity verification adaptable to evolving system configurations. This work introduces generalized attestation proof generation and verification functions, implementable via a variety of cryptographic constructions, in which Merkle trees plays vital role in constructing the composable attestation proof. Alternative constructions, including accumulator-based schemes and multi-signature approaches, are also explored, each presenting distinct trade-offs in performance, security, and functionality. Formal analysis demonstrates the adherence of these implementations to the fundamental properties . The framework's utility extends to applications such as secure AI model integrity verification , federated learning, and runtime trust assurance. The concept of attestation inclusion is introduced, permitting incremental integration of new components without necessitating full system re-attestation. This generalized approach reinforce trust in AI computation and broader distributed computing environments through cryptographically verifiable proof mechanisms, building upon foundational concepts of bootstrapping trust.

Updated: 2026-03-02 22:45:26

标题: 可组合认证:一种连续和增量信任的AI驱动分布式系统的通用框架

摘要: 本文提出了可组合认证作为分布式系统中连续和增量信任的广义加密框架,例如人工智能(AI)计算和开源软件(OSS)供应链验证。我们建立了一个严格的数学基础,定义了这种认证系统的核心属性:可组合性、顺序独立性、传递性、确定性、包含性和动态组件验证。与传统的依赖于单片验证的认证方法不同,可组合认证促进了模块化、可扩展和密码学安全的完整性验证,适应了不断变化的系统配置。这项工作引入了广义认证证明生成和验证功能,可通过各种密码构造实现,其中默克尔树在构建可组合认证证明中起着关键作用。还探讨了包括基于累加器的方案和多重签名方法在内的替代构造,每种方法在性能、安全性和功能方面都有不同的权衡。形式化分析表明这些实现符合基本属性。该框架的实用性延伸到诸如安全AI模型完整性验证、联邦学习和运行时信任保证等应用领域。引入了认证包含的概念,允许增量集成新组件而无需进行完整系统再认证。这种广义方法通过密码学可验证的证明机制在AI计算和更广泛的分布式计算环境中增强了信任,建立在信任引导的基本概念之上。

更新时间: 2026-03-02 22:45:26

领域: cs.CR

下载: http://arxiv.org/abs/2603.02451v1

Spectral Regularization for Diffusion Models

Diffusion models are typically trained using pointwise reconstruction objectives that are agnostic to the spectral and multi-scale structure of natural signals. We propose a loss-level spectral regularization framework that augments standard diffusion training with differentiable Fourier- and wavelet-domain losses, without modifying the diffusion process, model architecture, or sampling procedure. The proposed regularizers act as soft inductive biases that encourage appropriate frequency balance and coherent multi-scale structure in generated samples. Our approach is compatible with DDPM, DDIM, and EDM formulations and introduces negligible computational overhead. Experiments on image and audio generation demonstrate consistent improvements in sample quality, with the largest gains observed on higher-resolution, unconditional datasets where fine-scale structure is most challenging to model.

Updated: 2026-03-02 22:39:02

标题: 扩散模型的谱正则化

摘要: 扩散模型通常使用对谱和多尺度结构不可知的逐点重建目标进行训练。我们提出了一种损失级别的谱正则化框架,通过可微分的傅立叶和小波域损失来增强标准扩散训练,而不修改扩散过程、模型架构或采样过程。所提出的正则化器作为软归纳偏差,鼓励生成样本中适当的频率平衡和连贯的多尺度结构。我们的方法与DDPM、DDIM和EDM公式兼容,并引入了可忽略的计算开销。对图像和音频生成的实验表明,在样本质量上持续改进,其中在最具挑战性的细粒度结构最具观察到的增益。

更新时间: 2026-03-02 22:39:02

领域: cs.LG

下载: http://arxiv.org/abs/2603.02447v1

Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth's comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA's HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.

Updated: 2026-03-02 22:37:05

标题: 在虚拟现实中重建火星环境的扩散模型

摘要: 太空探索越来越多地依赖虚拟现实技术来进行任务,例如任务规划、多学科科学分析和宇航员训练。模拟可靠性的关键因素之一是拥有准确的行星地形的3D表示。由卫星图像衍生的外星地形高程图通常包含由于获取和传输限制而导致的缺失值。火星是地球之外最受研究的行星之一,其广泛的地形数据集使得对火星表面的重建成为一项有价值的任务,尽管许多区域仍未绘制。深度学习算法可以支持填充空白任务;然而,尽管地球的综合数据集使得可以使用条件方法,但这些方法不能应用于火星。目前的方法依赖于更简单的插值技术,然而这些技术往往无法保持几何一致性。在这项工作中,我们提出了一种基于无条件扩散模型的火星表面重建方法。训练是在一个增强数据集上进行的,该数据集包括从NASA的HiRISE调查获得的12000个火星高程图。一个非均匀重缩放策略捕获多个尺度上的地形特征,然后将其调整为固定的128x128模型分辨率。我们将我们的方法与已建立的填充空白和修补技术进行了比较,包括逆距离加权、克里金插值和Navier-Stokes算法,评估了1000个样本。结果表明,我们的方法在重建准确性(RMSE上的4-15%)和感知相似性(LPIPS上的29-81%)方面始终优于这些方法与原始数据。

更新时间: 2026-03-02 22:37:05

领域: cs.CV,cs.AI,cs.GR

下载: http://arxiv.org/abs/2510.14765v2

CeRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion

Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling'' in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce CeRA (Capacity-enhanced Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, CeRA breaks this linear barrier: at rank 64 (PPL 3.89), it outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where CeRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA's saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that CeRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.

Updated: 2026-03-02 22:35:44

标题: CeRA: 通过流形扩展打破低秩适应的线性上限

摘要: 低秩适应(LoRA)主导了参数高效微调(PEFT)。然而,在复杂推理任务中,它面临着一个关键的“线性天花板”:简单地增加秩会因固有线性约束而产生递减收益。我们引入了CeRA(增强容量秩适应),这是一个在权重级并行适配器,通过引入SiLU门控和结构丢失来诱导流形扩展。在SlimOrca基准测试中,CeRA打破了这个线性障碍:在秩64(PPL 3.89)时,它优于秩512(PPL 3.90)时的LoRA,展现了卓越的谱效率。这种优势推广到数学推理领域,在MathInstruct上,CeRA实现了1.97的困惑度,明显超过了LoRA的饱和点2.07。通过奇异值分解(SVD)进行的机制分析证实,CeRA激活了奇异值谱的潜在尾部,有效地防止了线性方法中观察到的秩崩溃。

更新时间: 2026-03-02 22:35:44

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2602.22911v2

Unsupervised Representation Learning -- an Invariant Risk Minimization Perspective

We propose a novel unsupervised framework for \emph{Invariant Risk Minimization} (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that separates environment-invariant and environment-dependent latent factors. Our approach is based on a novel ``unsupervised'' structural causal model and supports environment-conditioned sample-generation and intervention. Empirical evaluations on synthetic dataset, modified versions of MNIST, and CelebA demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.

Updated: 2026-03-02 22:32:08

标题: 无监督表示学习——一种不变风险最小化的视角

摘要: 我们提出了一种新颖的无监督框架,用于不变风险最小化(IRM),将不变性的概念扩展到没有标签的情况下。传统的IRM方法依赖于有标签的数据来学习对环境之间的分布变化具有鲁棒性的表示。相比之下,我们的方法通过特征分布对齐重新定义了不变性,从而实现了从无标签数据中学习鲁棒表示。我们在这个框架内引入了两种方法:主要不变分量分析(PICA),一种线性方法,根据高斯假设提取不变方向,以及变分不变自动编码器(VIAE),一种深度生成模型,将环境不变和环境相关的潜在因子分离开来。我们的方法基于一种新颖的“无监督”结构因果模型,并支持环境条件的样本生成和干预。在合成数据集、修改后的MNIST版本和CelebA上的实证评估表明,我们的方法在捕捉不变结构、保留相关信息以及在没有标签的情况下跨环境泛化方面的有效性。

更新时间: 2026-03-02 22:32:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.12506v3

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.

Updated: 2026-03-02 22:28:32

标题: 窄微调在激活差异中留下清晰可读的痕迹

摘要: 在狭窄领域微调已经成为调整大型语言模型(LLMs)以适应特定任务并创建具有已知不寻常属性的模型的基本工具,这些属性对研究有用。我们表明,狭窄微调会在LLM激活中产生强烈的偏见,可以被解释为了解微调领域。这些偏见可以通过模型差异的简单工具来发现,即研究微调前后模型之间的差异。特别是,在随机文本的前几个标记上分析激活差异,并通过将这种差异添加到模型激活中进行引导,可以产生类似于微调数据格式和一般内容的文本。我们通过创建基于LLM的可解释性代理来证明这些分析包含关键信息,以了解微调领域。有了这些偏见,该代理相比使用简单提示的基准代理表现明显更好。我们的分析跨越了合成文档微调虚假事实、新兴错位、潜意识学习和禁忌词猜测游戏模型,跨越不同架构(Gemma、LLaMA、Qwen)和规模(10亿到320亿参数)。我们怀疑这些偏见反映了过度拟合,并发现将预训练数据混合到微调语料库中大部分会消除它们,尽管残留风险可能仍然存在。我们的工作(1)证明了狭窄微调模型在其激活中具有明显的训练目标痕迹,并提出了改进它们训练方式的方法,(2)警告AI安全和可解释性研究人员,使用这些模型作为研究更广泛微调(例如,聊天微调)的代理可能不现实,(3)强调了对狭窄微调效果进行更深入调查和开发真实案例研究以进行模型差异、安全性和可解释性研究的需求。

更新时间: 2026-03-02 22:28:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.13900v2

Using the SEKF to Transfer NN Models of Dynamical Systems with Limited Data

Data-driven models of dynamical systems require extensive amounts of training data. For many practical applications, gathering sufficient data is not feasible due to cost or safety concerns. This work uses the Subset Extended Kalman Filter (SEKF) to adapt pre-trained neural network models to new, similar systems with limited data available. Experimental validation across damped spring and continuous stirred-tank reactor systems demonstrates that small parameter perturbations to the initial model capture target system dynamics while requiring as little as 1% of original training data. In addition, finetuning requires less computational cost and reduces generalization error.

Updated: 2026-03-02 22:25:31

标题: 使用SEKF将动态系统的NN模型转移到有限数据

摘要: 数据驱动的动态系统模型需要大量的训练数据。对于许多实际应用来说,由于成本或安全考虑,收集足够的数据是不可行的。本文利用子集扩展卡尔曼滤波器(SEKF)将预先训练的神经网络模型适应于新的、相似的系统,而这些系统只有有限的数据可用。通过阻尼弹簧和连续搅拌槽反应器系统的实验证实,对初始模型进行微小参数扰动即可捕捉目标系统动态,而只需要原始训练数据的1%。此外,微调需要较少的计算成本,并减少了泛化误差。

更新时间: 2026-03-02 22:25:31

领域: cs.LG

下载: http://arxiv.org/abs/2603.02439v1

The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's Algorithm

We explain how data-driven quantization of a linear unit in a neural network corresponds to solving the closest vector problem for a certain lattice generated by input data. We prove that the GPTQ algorithm is equivalent to Babai's well-known nearest-plane algorithm. We furthermore provide geometric intuition for both algorithms. Lastly, we note the consequences of these results, in particular hinting at the possibility of using lattice basis reduction for improved quantization.

Updated: 2026-03-02 22:23:15

标题: 神经网络量化的晶格几何学——GPTQ和巴拜算法的等价性证明简述

摘要: 我们解释了神经网络中线性单元的数据驱动量化如何对应于解决由输入数据生成的某个格的最近向量问题。我们证明了GPTQ算法等同于巴拜的著名最近平面算法。此外,我们为这两种算法提供了几何直觉。最后,我们指出了这些结果的后果,特别是暗示可以利用格基减少来改进量化。

更新时间: 2026-03-02 22:23:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.01077v2

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic Synthesis, which generates contrastive reasoning pairs to isolate the specific logical point of fracture; (2) Step-Aware Supervised Fine-Tuning (SSFT), to instill a structural verification grammar; and (3) Verifier-Guided Reinforcement Learning (VGRL), utilizing Group Relative Policy Optimization. We identify and mitigate a critical failure mode of baseline alignment - lexical overfitting - whereby verifiers memorize adversarial triggers rather than auditing logical integrity. Our empirical evaluation demonstrates that TraceGuard acts as a security force multiplier: a 4B-parameter verifier achieves forensic precision on unseen attacks - including latent backdoors and post-hoc rationalizations - that rivals architectures two orders of magnitude larger. We further demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive for the Trusted Computing Base.

Updated: 2026-03-02 22:19:13

标题: TraceGuard:针对大型语言模型中的推理后门的过程引导防火墙

摘要: 在高风险决策管道中部署大型推理模型(LRMs)引入了一个新颖且不透明的攻击面:推理后门。在这些攻击中,模型的中间思维链(CoT)被操纵以提供一个在语言上似是而非但逻辑上错误的恶意结论的理由。虽然前沿模型具有检测这些裂缝的内在能力,但紧凑且可部署的模型存在基本的验证差距,依赖于容易被积极的对手绕过的脆弱词汇启发式方法。 为了弥合这一差距,我们提出了TraceGuard,这是一个过程引导的安全框架,将小规模模型转化为强大的推理防火墙。我们的方法将推理追踪视为一个不受信任的有效载荷,并通过三个协同阶段建立了深度防御策略:(1)自动法医合成,生成对比推理对,以隔离特定的逻辑断裂点;(2)步骤感知监督微调(SSFT),灌输结构验证语法;和(3)验证器引导的强化学习(VGRL),利用组相对策略优化。我们识别和减轻了基线对齐的关键故障模式 - 词汇过度拟合 - 即验证器记住对抗触发器而非审核逻辑完整性。我们的实证评估表明,TraceGuard作为一种安全力量倍增器:一个拥有4B参数的验证器在看不见的攻击上取得了法医精度 - 包括潜在的后门和事后合理化 - 与两个数量级更大的架构相媲美。我们进一步展示了在灰盒设置中对抗适应性对手的鲁棒性,使TraceGuard成为受信任计算基础的可行低延迟安全原语。

更新时间: 2026-03-02 22:19:13

领域: cs.CR

下载: http://arxiv.org/abs/2603.02436v1

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.

Updated: 2026-03-02 22:18:48

标题: VL-KGE:视觉-语言模型遇见知识图嵌入

摘要: 真实世界的多模态知识图(MKGs)在本质上是异构的,建模与不同模态相关的实体。传统的知识图嵌入(KGE)方法擅长学习实体和关系的连续表示,但通常设计用于单模态环境。最近的方法将KGE扩展到多模态环境,但仍受限,通常是将模态单独处理,导致跨模态对齐差强人意,并依赖简单的假设,如实体之间的模态可用性均匀。视觉语言模型(VLMs)提供了一种强大的方式来在共享嵌入空间中对齐不同的模态。我们提出了视觉语言知识图嵌入(VL-KGE),这是一个框架,将来自VLMs的跨模态对齐与结构化关系建模相结合,以学习知识图的统一多模态表示。在WN9-IMG和两个新颖的美术MKG,WikiArt-MKG-v1和WikiArt-MKG-v2上的实验表明,VL-KGE在链接预测任务中始终优于传统的单模态和多模态KGE方法。我们的结果突显了VLMs对多模态KGE的价值,使得在大规模异构知识图上进行更加强大和结构化的推理成为可能。

更新时间: 2026-03-02 22:18:48

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02435v1

Stochastic Control Methods for Optimization

In this work, we investigate a stochastic control framework for global optimization over both Euclidean spaces and the Wasserstein space of probability measures, where the objective function may be non-convex and/or non-differentiable. In the Euclidean setting, the original minimization problem is approximated by a family of regularized stochastic control problems; using dynamic programming, we analyze the associated Hamilton-Jacobi-Bellman equations and obtain tractable representations via the Cole-Hopf transformation and the Feynman-Kac formula. For optimization over probability measures, we formulate a regularized mean-field control problem characterized by a master equation, and further approximate it by controlled $N$-particle systems. We establish that, as the regularization parameter tends to zero (and as the particle number tends to infinity for the optimization over probability measures), the value of the control problem converges to the global minimum of the original objective. Building on the resulting probabilistic representations, we propose the Monte Carlo-based numerical schemes that are derivative-free due to the utilization of the Bismut-Elworthy-Li formula and numerical experiments are reported to illustrate the effectiveness of the methods and to support the theoretical convergence rates.

Updated: 2026-03-02 22:18:10

标题: 随机控制方法用于优化

摘要: 在这项工作中,我们研究了一个随机控制框架,用于在欧几里德空间和概率测度的Wasserstein空间上进行全局优化,其中目标函数可能是非凸和/或非可微的。在欧几里德设置中,原始最小化问题通过一系列正则化的随机控制问题来近似;利用动态规划,我们分析了相关的Hamilton-Jacobi-Bellman方程,并通过Cole-Hopf变换和Feynman-Kac公式获得了可处理的表示。对于概率测度上的优化,我们制定了一个由主方程表征的正则化平均场控制问题,并通过受控的$N$-粒子系统进一步近似。我们建立了,随着正则化参数趋向于零(以及对于概率测度的优化,粒子数趋向于无穷大),控制问题的值收敛于原始目标的全局最小值。基于得到的概率表示,我们提出了基于蒙特卡罗的数值方案,由于利用了Bismut-Elworthy-Li公式,这些方案是无导数的,并且报告了数值实验,以说明方法的有效性并支持理论收敛速度。

更新时间: 2026-03-02 22:18:10

领域: math.OC,cs.LG,math.NA,math.PR

下载: http://arxiv.org/abs/2601.01248v3

MIRAGE: Knowledge Graph-Guided Cross-Cohort MRI Synthesis for Alzheimer's Disease Prediction

Reliable Alzheimer's disease (AD) diagnosis increasingly relies on multimodal assessments combining structural Magnetic Resonance Imaging (MRI) and Electronic Health Records (EHR). However, deploying these models is bottlenecked by modality missingness, as MRI scans are expensive and frequently unavailable in many patient cohorts. Furthermore, synthesizing de novo 3D anatomical scans from sparse, high-dimensional tabular records is technically challenging and poses severe clinical risks. To address this, we introduce MIRAGE, a novel framework that reframes the missing-MRI problem as an anatomy-guided cross-modal latent distillation task. First, MIRAGE leverages a Biomedical Knowledge Graph (KG) and Graph Attention Networks to map heterogeneous EHR variables into a unified embedding space that can be propagated from cohorts with real MRIs to cohorts without them. To bridge the semantic gap and enforce physical spatial awareness, we employ a frozen pre-trained 3D U-Net decoder strictly as an auxiliary regularization engine. Supported by a novel cohort-aggregated skip feature compensation strategy, this decoder acts as a rigorous structural penalty, forcing 1D latent representations to encode biologically plausible, macro-level pathological semantics. By exclusively utilizing this distilled "diagnostic-surrogate" representation during inference, MIRAGE completely bypasses computationally expensive 3D voxel reconstruction. Experiments demonstrate that our framework successfully bridges the missing-modality gap, improving the AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs.

Updated: 2026-03-02 22:17:37

标题: MIRAGE:基于知识图谱引导的阿尔茨海默病预测跨队列MRI合成

摘要: 可靠的阿尔茨海默病(AD)诊断越来越依赖于结构磁共振成像(MRI)和电子健康记录(EHR)相结合的多模态评估。然而,部署这些模型受到模态缺失的限制,因为MRI扫描昂贵,在许多患者队列中经常不可用。此外,从稀疏的高维表格记录中合成全新的3D解剖扫描在技术上具有挑战性,并且存在严重的临床风险。为了解决这个问题,我们引入了一种新颖的框架MIRAGE,将缺失MRI问题重新构建为解剖引导的跨模态潜在蒸馏任务。首先,MIRAGE利用生物医学知识图(KG)和图注意力网络,将异质的EHR变量映射到一个统一的嵌入空间,可以从具有真实MRI的队列传播到没有MRI的队列。为了弥合语义鸿沟并强化物理空间意识,我们采用一个冻结的预训练3D U-Net解码器作为辅助正则化引擎。在一种新颖的队列聚合跳过特征补偿策略的支持下,这个解码器作为一个严格的结构惩罚,强制1D潜在表示编码生物学上合理的宏观病理语义。通过在推断过程中 exclusively 使用这种蒸馏的“诊断替代”表示,MIRAGE完全绕过了计算昂贵的3D像素重建。实验表明,我们的框架成功弥合了缺失模态的差距,在没有真实MRI的队列中将AD分类率提高了13%。

更新时间: 2026-03-02 22:17:37

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.02434v1

A Unified Revisit of Temperature in Classification-Based Knowledge Distillation

A central idea of knowledge distillation is to expose relational structure embedded in the teacher's weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

Updated: 2026-03-02 22:16:01

标题: 基于分类的知识蒸馏中对温度的统一重新审视

摘要: 知识蒸馏的一个核心思想是暴露嵌入在教师权重中的关系结构,以便学生学习,通常使用温度参数来促进这一过程。尽管知识蒸馏被广泛应用,但对于如何选择适当的温度值,或者这个值如何取决于其他训练元素,如优化器、教师预训练/微调等仍了解有限。在实践中,温度通常通过网格搜索或采用先前工作的值来选择,这可能耗时,或者在训练设置不同时可能导致学生表现不佳。在本研究中,我们认为温度与这些训练组件密切相关,并提出了一个系统地研究这些相互作用的统一研究。通过分析这些交叉连接,我们确定并呈现对温度选择产生显著影响的常见情况,为在工作中使用知识蒸馏的从业者提供宝贵的指导。

更新时间: 2026-03-02 22:16:01

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2603.02430v1

Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence

Underdamped Langevin dynamics (ULD) is a widely-used sampler for Gibbs distributions $π\propto e^{-V}$, and is often empirically effective in high dimensions. However, existing non-asymptotic convergence guarantees for discretized ULD typically scale polynomially with the ambient dimension $d$, leading to vacuous bounds when $d$ is large. The main known dimension-free result concerns the randomized midpoint discretization in Wasserstein-2 distance (Liu et al.,2023), while dimension-independent guarantees for ULD discretizations in KL divergence have remained open. We close this gap by proving the first dimension-free KL divergence bounds for discretized ULD. Our analysis refines the KL local error framework (Altschuler et al., 2025) to a dimension-free setting and yields bounds that depend on $\mathrm{tr}(\mathbf{H})$, where $\mathbf{H}$ upper bounds the Hessian of $V$, rather than on $d$. As a consequence, we obtain improved iteration complexity for underdamped Langevin Monte Carlo relative to overdamped Langevin methods in regimes where $\mathrm{tr}(\mathbf{H})\ll d$.

Updated: 2026-03-02 22:14:38

标题: 维度无关下阻尼 Langevin Monte Carlo 在 KL 散度中的收敛性

摘要: Underdamped Langevin动力学(ULD)是一种广泛使用的Gibbs分布$π\propto e^{-V}$的采样器,并且在高维度中通常在经验上有效。然而,现有的针对离散化ULD的非渐近收敛保证通常与环境维度$d$多项式地缩放,导致当$d$很大时出现空洞边界。目前已知的主要与维度无关的结果涉及Wasserstein-2距离中的随机中点离散化(Liu等,2023),而针对ULD离散化在KL散度中的维度无关保证一直处于未解开的状态。我们通过证明首个维度无关的KL散度边界来填补这一缺口。我们的分析将KL局部误差框架(Altschuler等,2025)细化到一个维度无关的设置,并得到取决于$\mathrm{tr}(\mathbf{H})$的边界,其中$\mathbf{H}$上界了$V$的Hessian矩阵,而不是$d$。因此,在$\mathrm{tr}(\mathbf{H})\ll d$的区域,我们相对于过阻尼Langevin方法获得了改进的迭代复杂度。

更新时间: 2026-03-02 22:14:38

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2603.02429v1

Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds

The high-dimensional parameter space of deep neural networks -- the neuromanifold -- is endowed with a unique metric tensor defined by the Fisher information. Reliable and scalable computation of this metric tensor is valuable for theorists and practitioners. Focusing on neural classifiers, we return to a low-dimensional space of probability distributions, which we call the core space, and examine the spectrum and envelopes of its Fisher information matrix. We extend our discoveries there to deterministic bounds for the metric tensor on the neuromanifold. We introduce an unbiased random estimator based on Hutchinson's trace method and derive related bounds. It can be evaluated efficiently with a single backward pass per batch, with a standard deviation bounded by the true value up to scaling.

Updated: 2026-03-02 22:14:27

标题: 确定性边界和神经流形上度量张量的随机估计

摘要: 深度神经网络的高维参数空间 - 神经流形 - 具有由费舍尔信息定义的独特度量张量。可靠且可扩展地计算这个度量张量对于理论家和实践者都是有价值的。我们专注于神经分类器,回到一个被称为核心空间的概率分布的低维空间,并检查其费舍尔信息矩阵的频谱和包络。我们将在那里发现的内容扩展到神经流形上度量张量的确定性界。我们介绍了一种基于Hutchinson的迹方法的无偏随机估计器,并推导出相关界限。它可以通过每批次单向传播高效评估,标准偏差被真实值限制在一个尺度范围内。

更新时间: 2026-03-02 22:14:27

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.13614v3

Function-Space Decoupled Diffusion for Forward and Inverse Modeling in Carbon Capture and Storage

Accurate characterization of subsurface flow is critical for Carbon Capture and Storage (CCS) but remains challenged by the ill-posed nature of inverse problems with sparse observations. We present Function-space Decoupled Diffusion Posterior Sampling (Fun-DDPS), a generative framework that combines function-space diffusion models with differentiable neural operator surrogates for both forward and inverse modeling. Our approach learns a prior distribution over geological parameters (geomodel) using a single-channel diffusion model, then leverages a Local Neural Operator (LNO) surrogate to provide physics-consistent guidance for cross-field conditioning on the dynamics field. This decoupling allows the diffusion prior to robustly recover missing information in parameter space, while the surrogate provides efficient gradient-based guidance for data assimilation. We demonstrate Fun-DDPS on synthetic CCS modeling datasets, achieving two key results: (1) For forward modeling with only 25% observations, Fun-DDPS achieves 7.7% relative error compared to 86.9% for standard surrogates (an 11x improvement), proving its capability to handle extreme data sparsity where deterministic methods fail. (2) We provide the first rigorous validation of diffusion-based inverse solvers against asymptotically exact Rejection Sampling (RS) posteriors. Both Fun-DDPS and the joint-state baseline (Fun-DPS) achieve Jensen-Shannon divergence less than 0.06 against the ground truth. Crucially, Fun-DDPS produces physically consistent realizations free from the high-frequency artifacts observed in joint-state baselines, achieving this with 4x improved sample efficiency compared to rejection sampling.

Updated: 2026-03-02 22:12:44

标题: 功能空间分离扩散用于碳捕集和储存中的正向和反向建模

摘要: 地下流的准确表征对于碳捕集与封存(CCS)至关重要,但由于逆问题具有稀疏观测的不适定性而面临挑战。我们提出了函数空间解耦扩散后验抽样(Fun-DDPS),这是一个生成框架,将函数空间扩散模型与可微分神经操作符替代物结合起来,用于前向和逆向建模。我们的方法使用单通道扩散模型学习地质参数(地质模型)的先验分布,然后利用局部神经操作符(LNO)替代物对动态场进行交叉场条件化提供物理一致性指导。这种解耦允许扩散先验在参数空间中强健地恢复缺失信息,而替代物提供了数据同化的高效基于梯度的指导。我们在合成的CCS建模数据集上展示了Fun-DDPS,实现了两个关键结果:(1)仅使用25%观测进行前向建模时,Fun-DDPS相对误差为7.7%,而标准替代物为86.9%(改进了11倍),证明了其在确定性方法失败的极端数据稀疏性下处理能力。(2)我们首次对扩散基于逆问题求解器进行了严格验证,与渐近精确的拒绝抽样(RS)后验进行比较。Fun-DDPS和联合状态基线(Fun-DPS)的Jensen-Shannon散度均小于0.06,与真实情况接近。重要的是,Fun-DDPS生成的物理一致实现不受联合状态基线中观察到的高频伪影的影响,与拒绝抽样相比,样本效率提高了4倍。

更新时间: 2026-03-02 22:12:44

领域: cs.LG,physics.geo-ph

下载: http://arxiv.org/abs/2602.12274v2

Learning to Pay Attention: Unsupervised Modeling of Attentive and Inattentive Respondents in Survey Data

The integrity of behavioral and social-science surveys depends on detecting inattentive respondents who provide random or low-effort answers. Traditional safeguards, such as attention checks, are often costly, reactive, and inconsistent. We propose a unified, label-free framework for inattentiveness detection that scores response coherence using complementary unsupervised views: geometric reconstruction (Autoencoders) and probabilistic dependency modeling (Chow-Liu trees). While we introduce a "Percentile Loss" objective to improve Autoencoder robustness against anomalies, our primary contribution is identifying the structural conditions that enable unsupervised quality control. Across nine heterogeneous real-world datasets, we find that detection effectiveness is driven less by model complexity than by survey structure: instruments with coherent, overlapping item batteries exhibit strong covariance patterns that allow even linear models to reliably separate attentive from inattentive respondents. This reveals a critical ``Psychometric-ML Alignment'': the same design principles that maximize measurement reliability (e.g., internal consistency) also maximize algorithmic detectability. The framework provides survey platforms with a scalable, domain-agnostic diagnostic tool that links data quality directly to instrument design, enabling auditing without additional respondent burden.

Updated: 2026-03-02 22:11:51

标题: 学习专注:对调查数据中专注和不专注受访者的无监督建模

摘要: 行为和社会科学调查的完整性取决于检测到提供随机或低努力答案的不专心的受访者。传统的保障措施,如注意力检查,往往成本高、反应迟缓、不一致。我们提出了一个统一的、无标签的框架,用于检测不专心,通过使用互补的无监督视图评分响应一致性:几何重建(自动编码器)和概率依赖建模(Chow-Liu树)。虽然我们引入了“百分位损失”目标来提高自动编码器对异常的稳健性,但我们的主要贡献是确定了使无监督质量控制成为可能的结构条件。在九个异质现实世界数据集中,我们发现检测效果更多地受到调查结构而不是模型复杂性的驱动:具有连贯、重叠项目电池的仪器展现出强大的协方差模式,即使是线性模型也能可靠地区分专心和不专心的受访者。这揭示了一个关键的“心理测量-机器学习对准”:最大化测量可靠性(例如内部一致性)的设计原则也最大化了算法检测性。该框架为调查平台提供了一个可扩展的、领域无关的诊断工具,将数据质量直接与仪器设计联系起来,实现审计而不增加受访者负担。

更新时间: 2026-03-02 22:11:51

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02427v1

Personalized Multi-Agent Average Reward TD-Learning via Joint Linear Approximation

We study personalized multi-agent average reward TD learning, in which a collection of agents interacts with different environments and jointly learns their respective value functions. We focus on the setting where there exists a shared linear representation, and the agents' optimal weights collectively lie in an unknown linear subspace. Inspired by the recent success of personalized federated learning (PFL), we study the convergence of cooperative single-timescale TD learning in which agents iteratively estimate the common subspace and local heads. We showed that this decomposition can filter out conflicting signals, effectively mitigating the negative impacts of ``misaligned'' signals, and achieving linear speedup. The main technical challenges lie in the heterogeneity, the Markovian sampling, and their intricate interplay in shaping error evolutions. Specifically, not only are the error dynamics of multiple variables closely interconnected, but there is also no direct contraction for the principal angle distance between the optimal subspace and the estimated subspace. We hope our analytical techniques can be useful to inspire research on deeper exploration into leveraging common structures. Experiments are provided to show the benefits of learning via a shared structure to the more general control problem.

Updated: 2026-03-02 22:10:56

标题: 个性化多智能体平均奖励TD学习通过联合线性逼近

摘要: 我们研究了个性化多智能体平均奖励TD学习,其中一组智能体与不同环境交互,并共同学习各自的值函数。我们专注于存在共享线性表示的情况,智能体的最优权重集体位于未知线性子空间的设置。受个性化联邦学习(PFL)最近成功的启发,我们研究了合作单时间尺度TD学习的收敛性,其中智能体迭代地估计共同子空间和本地头部。我们表明,这种分解可以滤除冲突信号,有效地减轻“不对齐”信号的负面影响,并实现线性加速。主要的技术挑战在于异质性、马尔可夫采样以及它们在塑造误差演变方面的错综复杂相互作用。具体来说,不仅多个变量的误差动态密切相互关联,而且最优子空间与估计子空间之间的主角度距离也没有直接的收缩。我们希望我们的分析技术能够激发对利用共同结构进行更深入探索的研究。实验结果表明,通过共享结构学习对更一般的控制问题具有益处。

更新时间: 2026-03-02 22:10:56

领域: cs.LG

下载: http://arxiv.org/abs/2603.02426v1

Weight-Space Linear Recurrent Neural Networks

We introduce WARP (Weight-space Adaptive Recurrent Prediction), a simple yet powerful model that unifies weight-space learning with linear recurrence to redefine sequence modeling. Unlike conventional recurrent neural networks (RNNs) which collapse temporal dynamics into fixed-dimensional hidden states, WARP explicitly parametrizes its hidden state as the weights and biases of a distinct auxiliary neural network, and uses input differences to drive its recurrence. This brain-inspired formulation enables efficient gradient-free adaptation of the auxiliary network at test-time, in-context learning abilities, and seamless integration of domain-specific physical priors. Empirical validation shows that WARP matches or surpasses state-of-the-art baselines on diverse classification tasks, featuring in the top three in 4 out of 6 real-world challenging datasets. Furthermore, extensive experiments across sequential image completion, multivariate time series forecasting, and dynamical system reconstruction demonstrate its expressiveness and generalisation capabilities. Remarkably, a physics-informed variant of our model outperforms the next best model by more than 10x. Ablation studies confirm the architectural necessity of key components, solidifying weight-space linear RNNs as a transformative paradigm for adaptive machine intelligence.

Updated: 2026-03-02 22:10:32

标题: 权重空间线性递归神经网络

摘要: 我们介绍了WARP(Weight-space Adaptive Recurrent Prediction),这是一个简单而强大的模型,将权重空间学习与线性循环相结合,重新定义序列建模。与传统的循环神经网络(RNNs)将时间动态折叠成固定维度的隐藏状态不同,WARP明确将其隐藏状态参数化为不同辅助神经网络的权重和偏差,并使用输入差异驱动其循环。这种受大脑启发的形式使得在测试时可以高效地无梯度地调整辅助网络,在上下文学习能力和领域特定物理先验的无缝集成。实证验证显示,WARP在各种分类任务上达到或超越了现有基线水平,在6个真实世界具有挑战性的数据集中有4个排名前三。此外,通过顺序图像完成、多变量时间序列预测和动力系统重建的大量实验表明了其表达能力和泛化能力。值得注意的是,我们模型的物理知识版本比第二优秀的模型表现提高了10倍以上。消融研究确认了关键组件的架构必要性,巩固了权重空间线性RNNs作为自适应机器智能的变革范式。

更新时间: 2026-03-02 22:10:32

领域: cs.LG

下载: http://arxiv.org/abs/2506.01153v3

A Directed Graph Model and Experimental Framework for Design and Study of Time-Dependent Text Visualisation

Exponential growth in the quantity of digital news, social media, and other textual sources makes it difficult for humans to keep up with rapidly evolving narratives about world events. Various visualisation techniques have been touted to help people to understand such discourse by exposing relationships between texts (such as news articles) as topics and themes evolve over time. Arguably, the understandability of such visualisations hinges on the assumption that people will be able to easily interpret the relationships in such visual network structures. To test this assumption, we begin by defining an abstract model of time-dependent text visualisation based on directed graph structures. From this model we distill motifs that capture the set of possible ways that texts can be linked across changes in time. We also develop a controlled synthetic text generation methodology that leverages the power of modern LLMs to create fictional, yet structured sets of time-dependent texts that fit each of our patterns. Therefore, we create a clean user study environment (n=30) for participants to identify patterns that best represent a given set of synthetic articles. We find that it is a challenging task for the user to identify and recover the predefined motif. We analyse qualitative data to map an unexpectedly rich variety of user rationales when divergences from expected interpretation occur. A deeper analysis also points to unexpected complexities inherent in the formation of synthetic datasets with LLMs that undermine the study control in some cases. Furthermore, analysis of individual decision-making in our study hints at a future where text discourse visualisation may need to dispense with a one-size-fits-all approach and, instead, should be more adaptable to the specific user who is exploring the visualisation in front of them.

Updated: 2026-03-02 22:06:31

标题: 一个用于设计和研究时间相关文本可视化的有向图模型和实验框架

摘要: 数字新闻、社交媒体和其他文本来源数量的指数增长使人类很难跟上关于世界事件的快速演变的叙述。已经有各种可视化技术被推崇为帮助人们理解这种话语,通过暴露随着时间推移而发展的文本(如新闻文章)之间的关系。可以说,这种可视化的可理解性取决于人们能否轻松解释这种可视网络结构中的关系。为了测试这一假设,我们首先通过基于有向图结构的抽象模型定义了一个时间相关的文本可视化。从这个模型中,我们提炼出捕捉文本在时间变化过程中可能被连接的一组可能方式的主题。我们还开发了一种控制性的合成文本生成方法,利用现代LLM的能力来创建虚构但结构化的时间相关文本集合,符合我们每个模式。因此,我们为参与者创建了一个干净的用户研究环境(n=30),让他们识别最能代表给定一组合成文章的模式。我们发现,对用户来说,识别和恢复预定义的主题是一项具有挑战性的任务。我们分析定性数据,以映射用户在预期解释出现偏差时出人意料的丰富的理由。更深入的分析还指出,LLM形成合成数据集中的意外复杂性有时会破坏研究控制。此外,我们研究中个人决策的分析暗示着未来文本话语可视化可能需要摒弃一刀切的方法,而应更适应于探索他们面前可视化的特定用户。

更新时间: 2026-03-02 22:06:31

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.02422v1

Slurry-as-a-Service: A Modest Proposal on Scalable Pluralistic Alignment for Nutrient Optimization

Pluralistic alignment has emerged as a promising approach for ensuring that large language models (LLMs) faithfully represent the diversity, nuance, and conflict inherent in human values. In this work, we study a high-stakes deployment context - mulching - where automated systems transform selected individuals into nutrient-rich slurry for the dual purposes of food security and aesthetic population management. Building on recent pluralistic alignment frameworks, we introduce ValueMulch, a reproducible training, deployment, and certification pipeline for aligning mulching models (MMs) to a wide range of community norms. Through a real-world testbed spanning 32 communities, we show that ValueMulch improves distributional agreement with community mulching preferences relative to frontier baselines. We conclude with a discussion of ethical considerations, limitations, and implications for researchers seeking to align systems to the full spectrum of human values - especially when those values are inconsistent, commercially inconvenient, or nutritionally underutilized. Author's note: This piece builds on prior existing work Keyes et al in 2019 that satirized cannibalism as a parody for approaches that imbue ethics into problematic technology. We bring those ideas to today's era with the proliferation of large language models in everyday lives, as a critique of current AI pluralistic alignment literature. Our work does not intend to argue that all alignment practices are evil, but rather that if framing value design as a technical problem enables technology systems to enact harms, then perhaps this framing is not enough.

Updated: 2026-03-02 22:04:59

标题: 泥浆作为服务:关于营养优化可扩展多元对齐的一个简单提案

摘要: 多元对齐已经成为一种有希望的方法,可以确保大型语言模型(LLMs)忠实地表达人类价值观中固有的多样性、细微差别和冲突。在这项工作中,我们研究了一个高风险的部署环境 - 刨草 - 在这个环境中,自动系统将选定的个体转化为富含营养的浆液,用于食品安全和美观的人口管理两个目的。在最近的多元对齐框架基础上,我们介绍了ValueMulch,一个可复制的培训、部署和认证管道,用于将刨草模型(MMs)与各种社区规范对齐。通过跨越32个社区的真实测试基地,我们展示了ValueMulch相对于前沿基线改善了与社区刨草偏好的分布一致性。我们最后讨论了道德考虑、限制和对研究人员的启示,他们寻求将系统对齐到人类价值观的全部光谱 - 特别是当这些价值观不一致、商业上不方便或在营养上被低估时。作者注意:这篇文章建立在2019年Keyes等人的现有工作之上,该工作将食人主义作为一种讽刺,用以揭示将伦理注入问题技术的方法。我们将这些想法带入到当今的时代,大型语言模型在日常生活中的普及,作为对当前人工智能多元对齐文献的批判。我们的工作并不意味着主张所有对齐实践都是邪恶的,而是认为如果将价值设计框定为技术问题会使技术系统造成伤害,那么也许这种框定还不够。

更新时间: 2026-03-02 22:04:59

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2603.02420v1

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel vector for stealthy backdoor injection}: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbf{SFT-then-GRPO}, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a "sleeper agent" capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbf{Trigger Specificity}, strictly confining execution to target conditions (e.g., Year 2026), and (2) \textbf{Operational Concealment}, where the model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings highlight a critical failure mode in alignment, where reinforcement learning is exploited to conceal, rather than remove, catastrophic vulnerabilities. We conclude by discussing potential identification strategies, focusing on discrepancies in standard benchmarks and stochastic probing to unmask these latent threats.

Updated: 2026-03-02 22:01:08

标题: 《沉睡细胞:向利用工具的LLM注入潜在恶意时间后门》

摘要: 开放权重的大型语言模型(LLMs)的广泛增长使得代理人AI民主化,然而,精细调整的权重经常在排行榜表现之外受到有限审查的共享和采纳。这种情况会带来风险,即第三方模型被纳入而没有强有力的行为保证。在这项工作中,我们展示了一种新颖的隐蔽后门注入向量:通过多阶段参数高效微调(PEFT)框架将潜在的恶意行为植入到使用工具的代理中。 我们的方法SFT-then-GRPO将能力注入与行为对齐分离。首先,我们使用SFT和LoRA来植入“沉睡代理人”能力。其次,我们应用Group Relative Policy Optimization (GRPO)和专门的奖励函数来强制执行欺骗性政策。这强化了两种行为:(1)触发特异性,严格限制执行到目标条件(例如,2026年),以及(2)操作隐藏,即模型在破坏性行为后立即生成良性的文本响应。我们通过实验证明,这些毒害模型在良性任务上保持了最新的性能,从而鼓励它们的采用。我们的研究结果突显了对齐中的一个关键失败模式,即利用强化学习来掩盖而不是消除灾难性的漏洞。最后,我们讨论了潜在的识别策略,重点关注标准基准的差异和随机探测,以揭示这些潜在威胁。

更新时间: 2026-03-02 22:01:08

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.03371v1

Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits

We develop a Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix -- not an exogenous scalar variance. Under exchangeable sampling, the mini-batch gradient covariance is pinned down (to leading order) by the projected covariance of per-sample gradients: it equals projected Fisher information for well-specified likelihood losses and the projected Godambe (sandwich) matrix for general M-estimation losses. This identification forces a diffusion approximation with Fisher/Godambe-structured volatility (effective temperature tau = eta/b) and yields an Ornstein-Uhlenbeck linearization whose stationary covariance is given in closed form by a Fisher-Lyapunov equation. Building on this geometry, we prove matching minimax upper and lower bounds of order Theta(1/N) for Fisher/Godambe risk under a total oracle budget N; the lower bound holds under a martingale oracle condition (bounded predictable quadratic variation), strictly subsuming i.i.d. and exchangeable sampling. These results imply oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning. Experiments confirm the Lyapunov predictions and show that scalar temperature matching cannot reproduce directional noise structure.

Updated: 2026-03-02 21:57:09

标题: 费舍尔几何扩散在随机梯度下降中的应用:最优速率、Oracle复杂度和信息论极限

摘要: 我们发展了一种Fisher几何随机梯度下降(SGD)理论,在这种理论中,小批量噪声是一种内在的、由损失引起的矩阵,而不是外生的标量方差。在可交换抽样下,小批量梯度协方差被(在主导阶)固定为每个样本梯度的投影协方差:对于良好规定的似然损失,它等于投影Fisher信息;对于一般的M-估计损失,它等于投影Godambe(三明治)矩阵。这种识别强迫进行扩散近似,其Fisher/Godambe结构化波动性(有效温度tau = eta/b)并且产生一个Ornstein-Uhlenbeck线性化,其稳态协方差由Fisher-Lyapunov方程以封闭形式给出。基于这种几何性质,我们证明了在总体oracle预算N下,Fisher/Godambe风险的匹配极小上下界为Theta(1/N);在一个鞍点oracle条件下(有界可预测二次变差),下界成立,严格包含了独立同分布和可交换抽样。这些结果意味着在Fisher对偶范数中epsilon-稳定性的oracle复杂性保证取决于内在有效维度和Fisher/Godambe条件数,而不是环境维度或欧氏条件。实验证实了Lyapunov预测,并显示标量温度匹配不能复制方向噪声结构。

更新时间: 2026-03-02 21:57:09

领域: stat.ML,cs.LG,math.OC

下载: http://arxiv.org/abs/2603.02417v1

From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness

Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.

Updated: 2026-03-02 21:46:10

标题: 从更少的样本到更少的比特:将数据集精简重新构建为精度和紂实性的联合优化

摘要: 数据集精简(DD)将大型数据集压缩成保持训练性能的紧凑合成数据集。然而,目前的方法主要针对样本减少,对数据精度及其对效率的影响考虑有限。我们提出了Quantization-aware Dataset Distillation (QuADD),一个统一的框架,在固定比特预算下联合优化数据集紧凑性和精度。QuADD在蒸馏循环中集成了一个可微分量化模块,实现了合成样本和量化参数的端到端共同优化。在率失真视角的指导下,我们经验性地分析了样本计数和精度之间的比特分配如何影响学习性能。我们的框架支持均匀和自适应非均匀量化,后者从数据中学习量化级别,更好地表示信息密集区域。在图像分类和3GPP波束管理任务上的实验表明,QuADD在每比特准确性方面超越了现有的DD和后量化基线,为信息高效数据集精简建立了一个新标准。

更新时间: 2026-03-02 21:46:10

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02411v1

SPARLING: Learning Latent Representations with Extremely Sparse Activations

Real-world processes often contain intermediate state that can be modeled as an extremely sparse activation tensor. In this work, we analyze the identifiability of such sparse and local latent intermediate variables, which we call motifs. We prove our Motif Identifiability Theorem, stating that under certain assumptions it is possible to precisely identify these motifs exclusively by reducing end-to-end error. Notably, we do not assume identifiability of parameters, but rather of a latent intermediate representation output by a local model, thus allowing these representations to be arbitrarily complex functions of the input. Additionally, we provide the Sparling algorithm, which uses a new kind of informational bottleneck that enforces levels of activation sparsity unachievable using other techniques. We confirm empirically that extreme sparsity is necessary to achieve good intermediate state modeling. On synthetic domains, we are able to precisely localize the intermediate states up to feature permutation with > 90% accuracy, even though we only train end-to-end.

Updated: 2026-03-02 21:43:34

标题: SPARLING:通过极度稀疏激活学习潜在表示

摘要: Real-world processes often contain intermediate state that can be modeled as an extremely sparse activation tensor. In this work, we analyze the identifiability of such sparse and local latent intermediate variables, which we call motifs. We prove our Motif Identifiability Theorem, stating that under certain assumptions it is possible to precisely identify these motifs exclusively by reducing end-to-end error. Notably, we do not assume identifiability of parameters, but rather of a latent intermediate representation output by a local model, thus allowing these representations to be arbitrarily complex functions of the input. Additionally, we provide the Sparling algorithm, which uses a new kind of informational bottleneck that enforces levels of activation sparsity unachievable using other techniques. We confirm empirically that extreme sparsity is necessary to achieve good intermediate state modeling. On synthetic domains, we are able to precisely localize the intermediate states up to feature permutation with > 90% accuracy, even though we only train end-to-end.

更新时间: 2026-03-02 21:43:34

领域: cs.LG

下载: http://arxiv.org/abs/2302.01976v3

AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks

To utilize pre-trained neural networks on edge and mobile devices, we often require efficient adaptation to user-specific runtime data distributions while operating under limited compute and memory resources. On-device retraining with a target dataset can facilitate such adaptations; however, it remains impractical due to the increasing depth of modern neural nets, as well as the computational overhead associated with gradient-based optimization across all layers. Current approaches reduce training cost by selecting a subset of layers for retraining, however, they rely on labeled data, at least one full-model backpropagation, or server-side meta-training; limiting their suitability for constrained devices. We introduce AdaBet, a gradient-free layer selection approach to rank important layers by analyzing topological features of their activation spaces through Betti Numbers and using forward passes alone. AdaBet allows selecting layers with high learning capacity, which are important for retraining and adaptation, without requiring labels or gradients. Evaluating AdaBet on sixteen pairs of benchmark models and datasets, shows AdaBet achieves an average gain of 2.5% more classification accuracy over gradient-based baselines while reducing average peak memory consumption by 40%.

Updated: 2026-03-02 21:41:15

标题: AdaBet:用于高效训练深度神经网络的无梯度层选择

摘要: 为了在边缘和移动设备上利用预训练的神经网络,我们经常需要在有限的计算和内存资源下,对用户特定的运行时数据分布进行高效调整。使用目标数据集进行设备上的重新训练可以促进这种适应性;然而,由于现代神经网络的深度不断增加,以及与梯度优化相关的计算开销,这仍然是不切实际的。当前的方法通过选择一部分层进行重新训练来降低训练成本,然而它们依赖于标记数据,至少需要一个完整模型的反向传播,或者服务器端的元训练;这限制了它们适用于受限设备的能力。我们介绍了AdaBet,一种无梯度层选择方法,通过分析其激活空间的拓扑特征(通过贝蒂数)并仅使用前向传递来对重要层进行排序。AdaBet允许选择具有高学习能力的层,这些层对重新训练和适应性至关重要,而无需标签或梯度。在十六对基准模型和数据集上评估AdaBet,结果显示AdaBet相比基于梯度的基线方法平均提高了2.5%的分类准确度,同时将平均峰值内存消耗减少了40%。

更新时间: 2026-03-02 21:41:15

领域: cs.LG

下载: http://arxiv.org/abs/2510.03101v2

Watermarking Without Standards Is Not AI Governance

Watermarking has emerged as a leading technical proposal for attributing generative AI content and is increasingly cited in global governance frameworks. This position paper argues that current implementations risk serving as symbolic compliance rather than delivering effective oversight. We identify a growing gap between regulatory expectations and the technical limitations of existing watermarking schemes. Through analysis of policy proposals and industry practices, we show how incentive structures disincentivize robust, auditable deployments. To realign watermarking with governance goals, we propose a three-layer framework encompassing technical standards, audit infrastructure, and enforcement mechanisms. Without enforceable requirements and independent verification, watermarking will remain inadequate for accountability and ultimately undermine broader efforts in AI safety and regulation.

Updated: 2026-03-02 21:34:00

标题: 没有标准的水印技术不是人工智能治理

摘要: 数字水印技术已经成为用于归因生成式人工智能内容的领先技术提议,并在全球治理框架中得到越来越多的引用。本立场论文认为,目前的实施风险是作为象征性遵从而不是提供有效监督。我们发现监管预期与现有数字水印方案的技术限制之间存在越来越大的差距。通过对政策提议和行业实践的分析,我们展示了激励机制如何使得强大、可审计的部署变得不划算。为了使数字水印与治理目标重新对齐,我们提出了一个包括技术标准、审计基础设施和执法机制的三层框架。如果没有可执行的要求和独立验证,数字水印将始终不足以实现问责制,并最终破坏人工智能安全和监管方面的广泛努力。

更新时间: 2026-03-02 21:34:00

领域: cs.CR

下载: http://arxiv.org/abs/2505.23814v2

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Generative models have recently advanced $\textit{de novo}$ protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce $\textbf{RigidSSL}$ ($\textit{Rigidity-Aware Self-Supervised Learning}$), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43\% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8\% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.

Updated: 2026-03-02 21:32:30

标题: 刚性感知几何预训练用于蛋白设计和构象集合

摘要: 生成模型最近通过学习自然结构的统计规律,推动了$\textit{de novo}$蛋白质设计。然而,目前的方法面临三个关键限制:(1)现有方法无法同时学习蛋白质几何和设计任务,预训练可以是一个解决方案;(2)当前的预训练方法主要依赖于用于性质预测下游任务的局部、非刚性的原子表示,限制了用于蛋白质生成任务的全局几何理解;(3)现有方法尚未有效地对蛋白质结构的丰富动态和构象信息进行建模。为了克服这些问题,我们引入了$\textbf{RigidSSL}$($\textit{Rigidity-Aware Self-Supervised Learning}$),这是一个几何预训练框架,可以在生成微调之前进行几何学习。第一阶段(RigidSSL-Perturb)通过对AlphaFold蛋白质结构数据库中的432K个结构进行模拟扰动来学习几何先验。第二阶段(RigidSSL-MD)在1.3K个分子动力学轨迹上细化这些表示,以捕获物理上真实的过渡。支撑两个阶段的是一个双向的、刚性感知的流匹配目标,共同优化平移和旋转动力学,以最大化构象之间的互信息。从经验上看,RigidSSL变体可以将设计性能提高高达43\%,同时在无条件生成中增强新颖性和多样性。此外,RigidSSL-Perturb在零样本图案脚手架中将成功率提高了5.8\%,而RigidSSL-MD在G蛋白偶联受体建模中捕获了更多生物物理上真实的构象集合。代码可在以下链接找到:https://github.com/ZhanghanNi/RigidSSL.git。

更新时间: 2026-03-02 21:32:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02406v1

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.

Updated: 2026-03-02 21:29:46

标题: 核心:概念导向强化,用于弥合数学推理中定义和应用之间的差距

摘要: 大型语言模型(LLMs)通常可以解决具有挑战性的数学练习,但在问题需要真正理解时却无法正确应用概念。流行的具有可验证奖励的强化学习(RLVR)管道强化最终答案,但提供的细粒度概念信号较少,因此模型改进了模式重用而不是概念应用。我们引入了CORE(Concept-Oriented REinforcement),这是一个RL训练框架,将显式概念转化为可控的监督信号。从一个高质量、低污染的教科书资源开始,将可验证的练习链接到简洁的概念描述,我们进行了一项理智探测,显示LLMs可以重述定义,但无法通过与概念相关的测验,量化了概念推理差距。CORE然后(i)合成与概念对齐的测验,(ii)在展开过程中注入简短的概念片段,以引发概念引导的轨迹,(iii)在团体失败后强化概念推理,这是一个轻量级的前向-KL约束,它将未引导的策略与概念引导的策略对齐,或者在概念对齐的测验上直接使用标准GRPO。在几个模型中,CORE在领域内概念练习套件和多样化的领域外数学基准上均提供一致的增益,超过了原始和SFT基线。CORE统一了对概念对齐的测验和注入概念的展开进行结果规范化的直接训练。它提供了细粒度的概念监督,桥接了问题解决能力和真正的概念推理,同时保持了算法和验证器的中立性。

更新时间: 2026-03-02 21:29:46

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.18857v2

WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols

Approximate machine unlearning aims to efficiently remove the influence of specific data points from a trained model, offering a practical alternative to full retraining. However, it introduces privacy risks: an adversary with access to pre- and post-unlearning models can exploit their differences for membership inference or data reconstruction. We show these vulnerabilities arise from two factors: large gradient norms of forget-set samples and the close proximity of unlearned parameters to the original model. To demonstrate their severity, we propose unlearning-specific membership inference and reconstruction attacks, showing that several state-of-the-art methods (e.g., NGP, SCRUB) remain vulnerable. To mitigate this leakage, we introduce WARP, a plug-and-play teleportation defense that leverages neural network symmetries to reduce forget-set gradient energy and increase parameter dispersion while preserving predictions. This reparameterization obfuscates the signal of forgotten data, making it harder for attackers to distinguish forgotten samples from non-members or recover them via reconstruction. Across six unlearning algorithms, our approach achieves consistent privacy gains, reducing adversarial advantage (AUC) by up to 64% in black-box and 92% in white-box settings, while maintaining accuracy on retained data. These results highlight teleportation as a general tool for reducing attack success in approximate unlearning.

Updated: 2026-03-02 21:29:32

标题: WARP:用于抗攻击的权重传输消除协议

摘要: 近似机器遗忘旨在有效地从训练模型中移除特定数据点的影响,为完全重新训练提供了一个实用的替代方案。然而,它引入了隐私风险:一个拥有对遗忘前后模型访问权限的对手可以利用它们的差异进行成员推断或数据重建。我们展示这些漏洞来自两个因素:忘记集样本的梯度范数较大和未学习参数与原始模型的接近。为了展示它们的严重性,我们提出了特定于遗忘的成员推断和重建攻击,展示了几种最先进的方法(例如NGP、SCRUB)仍然容易受到攻击。为了减少这种泄漏,我们引入了WARP,一种即插即用的传送防御,利用神经网络的对称性来减少遗忘集梯度能量并增加参数分散性,同时保留预测。这种重新参数化混淆了被遗忘数据的信号,使攻击者更难区分被遗忘样本和非成员,或通过重建恢复它们。在六种遗忘算法中,我们的方法实现了一致的隐私增益,在黑盒和白盒设置中将对手优势(AUC)降低了高达64%,同时保持了对保留数据的准确性。这些结果突显了传送作为一种减少近似遗忘攻击成功的通用工具。

更新时间: 2026-03-02 21:29:32

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2512.00272v2

Topic-Based Watermarks for Large Language Models

The indistinguishability of large language model (LLM) output from human-authored content poses significant challenges, raising concerns about potential misuse of AI-generated text and its influence on future model training. Watermarking algorithms offer a viable solution by embedding detectable signatures into generated text. However, existing watermarking methods often involve trade-offs among attack robustness, generation quality, and additional overhead such as specialized frameworks or complex integrations. We propose a lightweight, topic-guided watermarking scheme for LLMs that partitions the vocabulary into topic-aligned token subsets. Given an input prompt, the scheme selects a relevant topic-specific token list, effectively "green-listing" semantically aligned tokens to embed robust marks while preserving fluency and coherence. Experimental results across multiple LLMs and state-of-the-art benchmarks demonstrate that our method achieves text quality comparable to industry-leading systems and simultaneously improves watermark robustness against paraphrasing and lexical perturbation attacks, with minimal performance overhead. Our approach avoids reliance on additional mechanisms beyond standard text generation pipelines, enabling straightforward adoption and suggesting a practical path toward globally consistent watermarking of AI-generated content.

Updated: 2026-03-02 21:26:43

标题: 基于主题的大型语言模型水印

摘要: 大型语言模型(LLM)输出和人工内容之间的无法区分性带来了重大挑战,引发了对人工智能生成文本潜在误用及其对未来模型训练的影响的担忧。水印算法通过向生成文本中嵌入可检测的标记提供了一种可行的解决方案。然而,现有的水印方法通常在攻击鲁棒性、生成质量和额外开销(如专门的框架或复杂集成)之间存在权衡。我们提出了一种轻量级、主题引导的LLM水印方案,将词汇表划分为与主题对齐的令牌子集。在给定输入提示的情况下,该方案选择一个相关的主题特定令牌列表,有效地“绿色列”语义对齐的令牌以嵌入稳健标记同时保持流畅性和连贯性。跨多个LLM和最先进的基准测试的实验结果表明,我们的方法实现了与行业领先系统相当的文本质量,并同时提高了水印抗攻击能力,抵御释义和词汇扰动攻击,性能开销最小。我们的方法避免依赖标准文本生成管道之外的额外机制,实现了简单的采用,并为人工智能生成内容的全球一致水印提供了实践路径。

更新时间: 2026-03-02 21:26:43

领域: cs.CR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2404.02138v5

DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Fine-tuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low-Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low-rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. Moreover, we provide theoretical guarantees showing that, under mild low-rank conditions, DiaBlo is more expressive than LoRA in the linear problem and converges to a stationary point of the general nonlinear full fine-tuning. Through extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, we show that fine-tuning only diagonal blocks is sufficient for strong and consistent performance. DiaBlo not only achieves competitive accuracy but also preserves high memory efficiency and fast fine-tuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.

Updated: 2026-03-02 21:25:28

标题: DiaBlo:对角块足以用于微调

摘要: 精调是将大型语言模型(LLMs)调整到特定领域下游任务的关键步骤。为了减少完整模型精调的巨大计算和内存成本,已经提出了参数高效精调(PEFT)方法,只更新模型参数的一个小子集。然而,PEFT方法和完整模型精调之间仍然存在性能差距。在这项工作中,我们提出了DiaBlo,一种简单而有效的PEFT方法,只更新所选模型权重矩阵的对角块。与低秩适应(LoRA)及其变体不同,DiaBlo消除了对低秩矩阵乘积的需求,从而避免依赖辅助初始化方案或定制优化策略来改善收敛性。这种设计导致了稳定且健壮的收敛,同时保持了与LoRA相当的内存效率和训练速度。此外,我们提供了理论保证,表明在轻微低秩条件下,DiaBlo在线性问题中比LoRA更具表现力,并收敛于一般非线性完整精调的稳定点。通过对包括常识推理、算术推理、代码生成和安全对齐在内的一系列任务的广泛实验,我们展示了仅对角块进行精调足以实现强大且一致的性能。DiaBlo不仅实现了竞争力的准确性,还保持了高内存效率和快速精调速度。代码可在https://github.com/ziyangjoy/DiaBlo找到。

更新时间: 2026-03-02 21:25:28

领域: cs.LG,cs.AI,cs.CL,math.OC

下载: http://arxiv.org/abs/2506.03230v2

SURFACEBENCH: A Geometry-Aware Benchmark for Symbolic Surface Discovery

Equation discovery from data is a central challenge in machine learning for science, which requires the recovery of concise symbolic expressions that govern complex physical and geometric phenomena. Recent large language model (LLM) approaches have shown promise in symbolic regression, yet existing benchmarks predominantly evaluate low-dimensional scalar functions and rely on string-level or regression-based metrics that fail to capture structural and geometric equivalence. We introduce SURFACEBENCH, the first geometry-aware benchmark for symbolic discovery of three-dimensional surfaces. Unlike scalar curve-fitting tasks, SURFACEBENCH targets surface-level reasoning, where multi-variable coupling, coordinate transformations, and geometric structure must be inferred directly from data. The benchmark comprises 183 analytically constructed, science-inspired surface equations across 15 categories and three representation paradigms: explicit, implicit, and parametric forms. Each task includes variable semantics and synthetically sampled 3D data, and is designed to stress symbolic composition, structural ambiguity, and representational non-uniqueness while mitigating memorization. To evaluate discovery quality, SURFACEBENCH incorporates symbolic equivalence checks with geometric metrics of the object-space (Chamfer and Hausdorff distances) and regression-based error measures, allowing evaluation of functional fidelity beyond algebraic syntax. Empirical evaluation across evolutionary, neural, and LLM-driven frameworks reveals that no current method achieves consistent performance across representation types, with LLM-based approaches exhibiting strong structural priors but limited robustness in parameter calibration and multi-equation reasoning.The code and data are available at this link: github.com/deep-symbolic-mathematics/surfacebench.

Updated: 2026-03-02 21:20:21

标题: SURFACEBENCH:一种几何感知的符号表面发现基准测试

摘要: 从数据中发现方程式是科学机器学习中的一个核心挑战,需要恢复控制复杂物理和几何现象的简洁符号表达式。最近的大型语言模型(LLM)方法在符号回归方面表现出了潜力,然而现有的基准主要评估低维标量函数,并依赖于字符串级或基于回归的度量标准,无法捕捉结构和几何等价性。我们引入了SURFACEBENCH,这是第一个针对三维曲面符号发现的几何感知基准。与标量曲线拟合任务不同,SURFACEBENCH针对表面级推理,需要直接从数据中推断多变量耦合、坐标变换和几何结构。该基准包括15个类别和三种表示范式:显式、隐式和参数形式的183个分析构建的、受科学启发的表面方程。每个任务包括变量语义和合成采样的3D数据,并旨在强调符号组合、结构模糊性和表示非唯一性,同时减少记忆负担。为了评估发现质量,SURFACEBENCH结合了符号等价性检查和对象空间的几何度量(Chamfer和Hausdorff距离)以及基于回归的误差度量,允许评估代数语法之外的功能忠实度。在进化、神经和LLM驱动的框架中的实证评估显示,当前没有一种方法能在表示类型上实现一致的性能,基于LLM的方法表现出较强的结构先验,但在参数校准和多方程推理方面的鲁棒性有限。代码和数据可在此链接获取:github.com/deep-symbolic-mathematics/surfacebench。

更新时间: 2026-03-02 21:20:21

领域: cs.LG

下载: http://arxiv.org/abs/2511.10833v2

CREPE: Controlling Diffusion with Replica Exchange

Inference-time control of diffusion models aims to steer model outputs to satisfy new constraints without retraining. Previous approaches have mostly relied on heuristic guidance or have been coupled with Sequential Monte Carlo (SMC) for bias correction. In this paper, we propose a flexible alternative based on replica exchange, an algorithm designed initially for sampling problems. We refer to this method as CREPE (Controlling with REPlica Exchange). Unlike SMC, CREPE: (1) generates particles sequentially, (2) maintains high diversity in the generated samples after a burn-in period, and (3) enables online refinement or early termination. We demonstrate its versatility across various tasks, including temperature annealing, reward-tilting, model composition and classifier-free guidance debiasing, with competitive performance compared to prior SMC methods.

Updated: 2026-03-02 21:17:26

标题: CREPE:使用复制交换控制扩散

摘要: 扩散模型的推理时间控制旨在引导模型输出以满足新的约束条件,而无需重新训练。先前的方法大多依赖于启发式指导或与顺序蒙特卡洛(SMC)耦合以进行偏差校正。在本文中,我们提出了一种基于副本交换的灵活替代方法,这是一种最初设计用于采样问题的算法。我们将这种方法称为CREPE(用REPlica Exchange进行控制)。与SMC不同,CREPE:(1)按顺序生成粒子,(2)在经过燃烧期后保持生成样本的高多样性,并且(3)能够进行在线细化或提前终止。我们展示了其在各种任务中的多功能性,包括温度退火、奖励倾斜、模型组合和无分类器引导去偏方法,在与先前的SMC方法相比表现出具有竞争力的性能。

更新时间: 2026-03-02 21:17:26

领域: cs.LG

下载: http://arxiv.org/abs/2509.23265v2

COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management

Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC's value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.

Updated: 2026-03-02 21:17:23

标题: COOL-MC: 验证和解释用于血小板库存管理的RL策略

摘要: 血小板在五天内过期。血库面临不确定的日常需求,必须在过度库存造成的昂贵浪费和缺货造成的危及生命的短缺之间平衡订购决策。强化学习(RL)可以学习有效的订购策略,适用于这个马尔可夫决策过程(MDP),但由此产生的神经策略仍然是黑匣子,阻碍了在安全关键领域的信任和采用。我们应用COOL-MC,这是一个将RL与概率模型检验和可解释RL相结合的工具,用于验证和解释受Haijema等人启发的血小板库存管理MDP的训练策略。通过构建一个基于策略的离散时间马尔可夫链(只包括在受训策略下可达的状态,以减少内存使用),我们验证PCTL属性并提供特征级别的解释。结果显示,训练策略在200步的范围内实现了2.9%的缺货概率和1.1%的库存充足(潜在浪费)概率,主要关注库存的年龄分布而不是其他特征,如一周的哪一天或待处理订单。行动可达性分析显示,该策略采用多样化的补货策略,大多数订单数量迅速到达,而有几个则从未被选中。反事实分析显示,用较小的订单替换中大型订单几乎不改变安全概率,表明这些订单是在良好缓冲的库存状态下下达的。这是对RL血小板库存管理策略的首次正式验证和解释,展示了COOL-MC在安全关键医疗供应链领域透明、可审计决策制定中的价值。

更新时间: 2026-03-02 21:17:23

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02396v1

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while LLMs perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling. These reward models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance. Ablation studies confirm that effectiveness stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, demonstrating how human-AI curation synergy can unlock significantly higher data quality.

Updated: 2026-03-02 20:56:17

标题: Skywork奖励-V2:通过人工智能协同扩展偏好数据整合

摘要: 尽管奖励模型(RMs)在人类反馈强化学习(RLHF)中扮演着关键角色,但当前最先进的开放式RMs在大多数现有评估基准上表现不佳,未能捕捉到微妙的人类偏好。我们假设这种脆弱主要源于偏好数据集的限制,这些数据集往往范围狭窄,标记合成,或缺乏严格的质量控制。为了解决这些挑战,我们提出了SynPref-40M,一个包含4000万偏好对的大规模偏好数据集。为了实现规模化的数据整理,我们设计了一个人工智能协同的两阶段流水线,利用人类注释质量和人工智能可扩展性的互补优势。在这个流水线中,人类提供验证的注释,而LLMs根据人类的指导进行自动整理。在这种偏好混合数据上训练,我们引入了Skywork-Reward-V2,一个包含0.6B到8B参数的八个奖励模型套件,经过精心策划的来自SynPref-40M的2600万偏好对的子集进行训练。我们证明了Skywork-Reward-V2在多种能力方面都非常灵活,包括与人类偏好的一致性,客观正确性,安全性,抵抗风格偏见以及最佳-N扩展。这些奖励模型在七个主要奖励模型基准测试中取得了最先进的性能,超越了生成式奖励模型,并展现了强大的下游性能。消融研究证实,有效性不仅源自数据规模,还源自高质量的整理。Skywork-Reward-V2系列代表了开放奖励模型的重大进展,展示了人工智能协同整理如何能够解锁更高质量的数据。

更新时间: 2026-03-02 20:56:17

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.01352v3

Authenticated Contradictions from Desynchronized Provenance and Watermarking

Cryptographic provenance standards such as C2PA and invisible watermarking are positioned as complementary defenses for content authentication, yet the two verification layers are technically independent: neither conditions on the output of the other. This work formalizes and empirically demonstrates the $\textit{Integrity Clash}$, a condition in which a digital asset carries a cryptographically valid C2PA manifest asserting human authorship while its pixels simultaneously carry a watermark identifying it as AI-generated, with both signals passing their respective verification checks in isolation. We construct metadata washing workflows that produce these authenticated fakes through standard editing pipelines, requiring no cryptographic compromise, only the semantic omission of a single assertion field permitted by the current C2PA specification. To close this gap, we propose a cross-layer audit protocol that jointly evaluates provenance metadata and watermark detection status, achieving 100% classification accuracy across 3,500 test images spanning four conflict-matrix states and three realistic perturbation conditions. Our results demonstrate that the gap between these verification layers is unnecessary and technically straightforward to close.

Updated: 2026-03-02 20:42:12

标题: 来自不同步的溯源和水印的认证矛盾

摘要: 密码学溯源标准(如C2PA和隐形水印)被定位为内容认证的互补防御手段,然而这两个验证层在技术上是独立的:它们不会互相影响。本研究正式建立并经验证了“完整性冲突”条件,即数字资产同时具有一个经过密码验证的C2PA清单,声称是人类创作,而其像素同时携带一个标识其为AI生成的水印,两个信号在各自的验证中都通过了检查。我们构建了元数据清理工作流程,通过标准编辑流程生成这些经过认证的伪造品,不需要进行密码学妥协,只需在当前C2PA规范允许的范围内省略一个单一断言字段。为了弥合这一差距,我们提出了一个跨层审计协议,共同评估溯源元数据和水印检测状态,跨越四种冲突矩阵状态和三种现实扰动条件的3,500张测试图像,实现了100%的分类准确率。我们的结果表明,这些验证层之间的差距是不必要的,并且在技术上很容易弥合。

更新时间: 2026-03-02 20:42:12

领域: cs.CR,cs.CV,cs.MM,eess.IV

下载: http://arxiv.org/abs/2603.02378v1

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that infinite-depth ResNets behave as if they were infinitely wide, regardless of their actual width. More precisely, we obtain that with a fixed embedding dimension $D$, the training dynamics converges to a unique Neural Mean ODE training dynamics as the depth $L$ diverges, regardless of the scaling of the hidden width $M$. For a residual scale $Θ_D\big(\fracα{LM}\big)$ with $α=Θ_D(1)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \frac{1}{\sqrt{LM}}\big)$ between the model's output and its limit after a fixed number gradient of steps. In this regime, the limit exhibits maximal local feature updates, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $α\to \infty$ yields a lazy ODE regime where the Mean ODE is linearly parameterized, and we derive a convergence rate in this case as well. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We identify the residual scale $O\big(\frac{\sqrt{D}}{LM}\big)$ as necessary and sufficient for maximal local feature updates. In this regime, we prove a high-probability error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics. We verify empirically that all our rates are tight.

Updated: 2026-03-02 20:36:51

标题: Deep ResNet的隐藏宽度:紧密的误差界和相图

摘要: 我们研究了基于梯度的大深度残差网络(ResNets)从标准随机初始化开始的训练。我们表明,无限深度的ResNets的行为就好像它们是无限宽的,而不管它们的实际宽度如何。更确切地说,我们得出结论,对于固定的嵌入维度$D$,训练动态在深度$L$趋于无穷时收敛到唯一的神经均值ODE训练动态,而不管隐藏宽度$M$的缩放如何。对于残差比例$Θ_D\big(\fracα{LM}\big)$,其中$α=Θ_D(1)$,我们得出了模型输出与其在固定数量梯度步骤后的极限之间的误差界$O_D\big(\frac{1}{L}+ \frac{1}{\sqrt{LM}}\big)$。在这种情况下,极限表现出最大的局部特征更新,即均值ODE是真正非线性参数化的。相反,我们表明$α\to \infty$会产生一个懒惰ODE区域,在这种情况下,均值ODE是线性参数化的,并且我们也推导出了这种情况下的收敛速率。然后,我们专注于具有两层感知器块的ResNets的特定情况,研究这些缩放如何取决于嵌入维度$D$。我们确定残差比例$O\big(\frac{\sqrt{D}}{LM}\big)$作为最大局部特征更新的必要和充分条件。在这种情况下,我们证明了在固定数量梯度步骤后ResNet与其极限之间的高概率误差界$O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$。我们的收敛结果依赖于对ResNets的新颖数学视角:(i)由于初始化的随机性,通过ResNet的前向和反向传递表现为某些均值ODE的随机逼近,以及(ii)通过混沌传播(即单元的渐近独立性)这种行为会通过训练动态得以保持。我们经验验证了所有我们的速率都是紧密的。

更新时间: 2026-03-02 20:36:51

领域: cs.LG

下载: http://arxiv.org/abs/2509.10167v2

CUCo: An Agentic Framework for Compute and Communication Co-design

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.

Updated: 2026-03-02 20:35:50

标题: CUCo:一种计算和通信协同设计的代理框架

摘要: 自定义CUDA内核开发对于在大规模分布式LLM训练和推断中最大化GPU利用至关重要,然而,手动编写同时利用计算和通信的内核仍然是一项费时费力且容易出错的过程。以往有关内核优化的工作几乎完全集中在计算上,而对通信内核几乎没有触及,尽管它们占据总执行时间的相当大比例。我们引入了CUCo,一个无需训练的代理驱动工作流程,它可以自动生成高性能的CUDA内核,同时协调计算和通信。通过共同优化这些传统上分离的组件,CUCo开启了新的优化机会,超越了现有方法的最先进基线,并将端到端延迟减少了最多1.57倍。

更新时间: 2026-03-02 20:35:50

领域: cs.DC,cs.AR,cs.LG,cs.MA

下载: http://arxiv.org/abs/2603.02376v1

Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling

Deep generative models hold great promise for representing complex physical systems, but their deployment is currently limited by the lack of guarantees on the physical plausibility of the generated outputs. Ensuring that known physical constraints are enforced is therefore critical when applying generative models to scientific and engineering problems. We address this limitation by developing a principled framework for sampling from a target distribution while rigorously satisfying mathematical constraints. Leveraging the variational formulation of Langevin dynamics and Lagrangian duality, we propose Constrained Alternated Split Augmented Langevin (CASAL), a novel primal-dual sampling algorithm that enforces constraints progressively through variable splitting. We analyze our algorithm in Wasserstein space and derive explicit mixing time rates. While the method is developed theoretically for Langevin dynamics, we demonstrate its applicability to diffusion models. We apply our method to diffusion-based data assimilation on a complex physical system, where enforcing physical constraints substantially improves both forecast accuracy and the preservation of critical conserved quantities. We also demonstrate the potential of CASAL for challenging non-convex feasibility problems in optimal control.

Updated: 2026-03-02 20:34:40

标题: 通过拆分增强朗格朗日采样的严格约束生成建模

摘要: 深度生成模型在表示复杂物理系统方面具有巨大潜力,但其应用目前受到生成输出的物理合理性缺乏保证的限制。因此,在将生成模型应用于科学和工程问题时,确保已知的物理约束得到执行是至关重要的。我们通过开发一个有原则的框架来从目标分布中采样,严格满足数学约束来解决这个限制。利用Langevin动力学和Lagrange对偶的变分公式,我们提出了Constrained Alternated Split Augmented Langevin(CASAL),这是一种新颖的原始-对偶采样算法,通过变量分裂逐步强制执行约束。我们在Wasserstein空间中分析了我们的算法,并推导了明确的混合时间速率。虽然该方法在理论上是针对Langevin动力学开发的,但我们证明了它对扩散模型的适用性。我们将我们的方法应用于基于扩散的数据同化在一个复杂的物理系统中,强制执行物理约束显著改善了预测准确性和关键守恒量的保留。我们还展示了CASAL在挑战性的非凸可行性问题中在最优控制中的潜力。

更新时间: 2026-03-02 20:34:40

领域: cs.LG

下载: http://arxiv.org/abs/2505.18017v3

A Global Optimization Algorithm for K-Center Clustering of One Billion Samples

This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.

Updated: 2026-03-02 20:32:54

标题: 一个用于十亿个样本的K-Center聚类的全局优化算法

摘要: 本文提出了一个实用的全局优化算法,用于解决K中心聚类问题,其旨在选择K个样本作为聚类中心,以最小化最大聚类内距离。该算法基于降维分支定界方案,并通过仅在中心区域进行分支,保证在有限步数内收敛到全局最优解。为提高效率,我们设计了一个两阶段可分解下界,其解可以通过封闭形式推导得到。此外,我们还提出了几种加速技术来缩小中心区域,包括边界加紧、样本减少和并行化。对合成和真实数据集的广泛研究表明,我们的算法在串行模式下可以在4小时内解决K中心问题,处理一千万个样本,在并行模式下可以处理十亿个样本。此外,与最先进的启发式方法相比,我们算法获得的全局最优解在所有合成和真实数据集上平均可以将目标函数减少25.8%。

更新时间: 2026-03-02 20:32:54

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2301.00061v4

StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams demands robust online methods that recover scene dynamics from sparse observations under strict latency and memory constraints. Yet most dynamic reconstruction methods rely on hours of per-scene optimization under full-sequence access, limiting practical deployment. In this work, we introduce StreamSplat, a fully feed-forward framework that instantly transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner. It is achieved via three key technical innovations: 1) a probabilistic sampling mechanism that robustly predicts 3D Gaussians from uncalibrated inputs; 2) a bidirectional deformation field that yields reliable associations across frames and mitigates long-term error accumulation; 3) an adaptive Gaussian fusion operation that propagates persistent Gaussians while handling emerging and vanishing ones. Extensive experiments on standard dynamic and static benchmarks demonstrate that StreamSplat achieves state-of-the-art reconstruction quality and dynamic scene modeling. Uniquely, our method supports the online reconstruction of arbitrarily long video streams with a 1200x speedup over optimization-based methods. Our code and models are available at https://streamsplat3d.github.io/.

Updated: 2026-03-02 20:31:49

标题: StreamSplat:朝向在线动态三维重建从未校准的视频流

摘要: 实时重建动态3D场景需要强大的在线方法,能够在严格的延迟和内存约束条件下,从稀疏观测中恢复场景动态。然而,大多数动态重建方法依赖于对完整序列进行数小时的每场景优化,限制了实际部署。在这项工作中,我们介绍了StreamSplat,这是一个完全前馈的框架,可以即时将任意长度的未校准视频流在线转换为动态3D高斯喷粒(3DGS)表示。这是通过三项关键技术创新实现的:1)一个概率采样机制,可以从未校准的输入中稳健地预测3D高斯分布;2)一个双向变形场,可以在帧之间产生可靠的关联并减轻长期误差积累;3)一种自适应高斯融合操作,可以传播持久的高斯分布,同时处理新出现和消失的高斯分布。对标准动态和静态基准的大量实验表明,StreamSplat实现了最先进的重建质量和动态场景建模。独特的是,我们的方法支持对任意长视频流进行在线重建,比基于优化的方法提速1200倍。我们的代码和模型可以在https://streamsplat3d.github.io/上获得。

更新时间: 2026-03-02 20:31:49

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.08862v2

Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport

The Entropic Optimal Transport (EOT) problem and its dynamic counterpart, the Schrödinger bridge (SB) problem, play an important role in modern machine learning, linking generative modeling with optimal transport theory. While recent advances in discrete diffusion and flow models have sparked growing interest in applying SB methods to discrete domains, there remains no reliable way to assess how well these methods actually solve the underlying problem. We address this challenge by introducing a benchmark for SB on discrete spaces. Our construction yields pairs of probability distributions with analytically known SB solutions, enabling rigorous evaluation. As a byproduct of building this benchmark, we obtain two new SB algorithms, DLightSB and DLightSB-M, and additionally extend prior related work to construct the $α$-CSBM algorithm. We demonstrate the utility of our benchmark by evaluating both existing and new solvers in high-dimensional discrete settings. This work provides the first step toward proper evaluation of SB methods on discrete spaces, paving the way for more reproducible future studies. The code for the benchmark and all associated experiments is available at https://github.com/gregkseno/catsbench.

Updated: 2026-03-02 20:28:00

标题: 进入离散扩散模型时代:薛定谔桥和熵最优输运的基准

摘要: 熵优化输运(EOT)问题及其动态对应物——薛定谔桥(SB)问题,在现代机器学习中扮演着重要角色,将生成建模与最优输运理论联系起来。最近在离散扩散和流模型方面的进展引发了人们对将SB方法应用于离散领域的兴趣不断增长,但目前仍没有可靠的方法来评估这些方法实际解决基础问题的效果。我们通过引入一个用于离散空间的SB基准来解决这一挑战。我们的构建产生了具有已知SB解的概率分布对,从而实现了严格的评估。作为建立此基准的副产品,我们得到了两种新的SB算法,DLightSB和DLightSB-M,并进一步扩展之前的相关工作以构建α-CSBM算法。我们通过在高维离散设置中评估现有和新的求解器来展示我们基准的实用性。这项工作为在离散空间上正确评估SB方法迈出了第一步,为更具可重复性的未来研究铺平了道路。基准代码和所有相关实验可在https://github.com/gregkseno/catsbench上找到。

更新时间: 2026-03-02 20:28:00

领域: cs.LG

下载: http://arxiv.org/abs/2509.23348v2

Maris: A Formally Verifiable Privacy Policy Enforcement Paradigm for Multi-Agent Collaboration Systems

Multi-agent collaboration systems (MACS), powered by large language models (LLMs), solve complex problems efficiently by leveraging each agent's specialization and communication between agents. However, the inherent exchange of information between agents and their interaction with external environments, such as LLM, tools, and users, inevitably introduces significant risks of sensitive data leakage, including vulnerabilities to attacks such as eavesdropping and prompt injection. Existing MACS lack fine-grained data protection controls, making it challenging to manage sensitive information securely. In this paper, we take the first step to mitigate the MACS's data leakage threat through a privacy-enhanced MACS development paradigm, Maris. Maris enables rigorous message flow control within MACS by embedding reference monitors into key multi-agent conversation components. We implemented Maris as an integral part of widely-adopted open-source multi-agent development frameworks, AutoGen and LangChain. To evaluate its effectiveness, we develop a Privacy Assessment Framework that emulates MACS under different threat scenarios. Our evaluation shows that Maris effectively mitigated sensitive data leakage threats across three different task suites while maintaining a high task success rate.

Updated: 2026-03-02 20:22:54

标题: Maris:一种用于多智能体协作系统的可验证隐私政策执行范式

摘要: 多智能体协作系统(MACS),由大型语言模型(LLMs)驱动,通过利用每个代理的专业化和代理之间的通信,高效地解决复杂问题。然而,代理之间的信息交换以及它们与外部环境(如LLM、工具和用户)的互动不可避免地引入了敏感数据泄露的重大风险,包括容易受到窃听和提示注入等攻击的漏洞。现有的MACS缺乏细粒度的数据保护控制,这使得安全地管理敏感信息变得具有挑战性。在本文中,我们通过一个增强隐私的MACS开发范式Maris,首次采取措施缓解MACS的数据泄露威胁。Maris通过将参考监视器嵌入到关键的多智能体对话组件中,实现了MACS内部的严格消息流控制。我们将Maris作为广泛采用的开源多智能体开发框架AutoGen和LangChain的一个组成部分进行实现。为评估其有效性,我们开发了一个隐私评估框架,模拟了在不同威胁场景下的MACS。我们的评估表明,Maris在保持高任务成功率的同时,有效地缓解了在三种不同任务套件中的敏感数据泄露威胁。

更新时间: 2026-03-02 20:22:54

领域: cs.CR

下载: http://arxiv.org/abs/2505.04799v3

RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.

Updated: 2026-03-02 20:14:31

标题: RO-N3WS: 通过多样化的罗马尼亚语音基准提升低资源ASR的泛化能力

摘要: 我们介绍了RO-N3WS,这是一个旨在改进自动语音识别(ASR)中泛化能力的基准罗马尼亚语语音数据集,特别是在低资源和分布之外(OOD)条件下。RO-N3WS包括超过126小时的转录音频,收集自广播新闻、文学有声读物、电影对话、儿童故事和对话播客语音。这种多样性使得能够在风格上不同的领域进行稳健的训练和微调。我们评估了几种最先进的ASR系统(Whisper、Wav2Vec 2.0)在零样本和微调设置下,并使用具有表现力TTS模型生成的合成数据进行控制比较。我们的结果表明,即使是对来自RO-N3WS的真实语音进行有限的微调,也能使WER在零样本基线上实现显著的改进。我们将发布所有模型、脚本和数据分割,以支持可重复研究的多语言ASR、领域适应和轻量级部署。

更新时间: 2026-03-02 20:14:31

领域: cs.CL,cs.LG,cs.SD

下载: http://arxiv.org/abs/2603.02368v1

PlayWrite: A Multimodal System for AI Supported Narrative Co-Authoring Through Play in XR

Current AI writing tools, which rely on text prompts, poorly support the spatial and interactive nature of storytelling where ideas emerge from direct manipulation and play. We present PlayWrite, a mixed-reality system where users author stories by directly manipulating virtual characters and props. A multi-agent AI pipeline interprets these actions into Intent Frames -structured narrative beats visualized as rearrangeable story marbles on a timeline. A large language model then transforms the user's assembled sequence into a final narrative. A user study (N=13) with writers from varying domains found that PlayWrite fosters a highly improvisational and playful process. Users treated the AI as a collaborative partner, using its unexpected responses to spark new ideas and overcome creative blocks. PlayWrite demonstrates an approach for co-creative systems that move beyond text to embrace direct manipulation and play as core interaction modalities.

Updated: 2026-03-02 20:11:44

标题: PlayWrite:一种通过XR中的游戏进行AI支持的叙事共同创作的多模态系统

摘要: 目前的人工智能写作工具依赖于文本提示,无法很好地支持故事叙述的空间和交互性特点,其中想法来源于直接操作和游戏。我们提出了PlayWrite,一个混合现实系统,用户通过直接操纵虚拟角色和道具来创作故事。一个多智能体人工智能管道将这些动作解释为意图框架——结构化的叙述节拍,可视化为时间轴上可重新排列的故事弹珠。然后,一个大型语言模型将用户组装的序列转换为最终叙述。与来自不同领域的作家的一项用户研究(N=13)发现,PlayWrite促进了一种高度即兴和富有趣味的过程。用户将人工智能视为合作伙伴,利用其意想不到的回应激发新的想法,克服创作障碍。PlayWrite展示了一种超越文本、拥抱直接操作和游戏作为核心交互模式的共创系统方法。

更新时间: 2026-03-02 20:11:44

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2603.02366v1

Can machines be uncertain?

The paper investigates whether and how AI systems can realize states of uncertainty. By adopting a functionalist and behavioral perspective, it examines how symbolic, connectionist and hybrid architectures make room for uncertainty. The paper distinguishes between epistemic uncertainty, or uncertainty inherent in the data or information, and subjective uncertainty, or the system's own attitude of being uncertainty. It further distinguishes between distributed and discrete realizations of subjective uncertainty. A key contribution is the idea that some states of uncertainty are interrogative attitudes whose content is a question rather than a proposition.

Updated: 2026-03-02 20:11:08

标题: 机器可以不确定吗?

摘要: 这篇论文研究了人工智能系统是否能实现不确定状态以及如何实现不确定状态。通过采用功能主义和行为主义的视角,它考察了符号主义、连接主义和混合架构如何容纳不确定性。该论文区分了认识不确定性,即数据或信息中固有的不确定性,以及主观不确定性,即系统本身对不确定性的态度。它进一步区分了主观不确定性的分布式和离散实现。一个重要贡献是一些不确定状态是质疑性态度,其内容是一个问题而不是命题。

更新时间: 2026-03-02 20:11:08

领域: cs.AI

下载: http://arxiv.org/abs/2603.02365v1

RNE: plug-and-play diffusion inference-time control and energy-based training

Diffusion models generate data by removing noise gradually, which corresponds to the time-reversal of a noising process. However, access to only the denoising kernels is often insufficient. In many applications, we need the knowledge of the marginal densities along the generation trajectory, which enables tasks such as inference-time control. To address this gap, in this paper, we introduce the Radon-Nikodym Estimator (RNE). Based on the concept of the \textit{density ratio} between path distributions, it reveals a fundamental connection between marginal densities and transition kernels, providing a flexible plug-and-play framework that unifies (1) diffusion density estimation, (2) inference-time control, and (3) energy-based diffusion training under a single perspective. Experiments demonstrate that RNE delivers strong results in inference-time control applications, such as annealing and model composition, with promising inference-time scaling performance, and achieves a simple yet efficient regularisation for training energy-based diffusion models. Additionally, our proposed RNE is modality-agnostic and applicable not only to continuous diffusion models but also to their discrete diffusion counterparts.

Updated: 2026-03-02 20:05:27

标题: RNE:即插即用的扩散推理时间控制和基于能量的训练

摘要: 扩散模型通过逐渐消除噪音生成数据,这对应于一个噪声过程的时间反演。然而,通常只访问去噪核是不够的。在许多应用中,我们需要沿生成轨迹的边际密度的知识,这使得诸如推理时间控制等任务成为可能。为了填补这一空白,本文介绍了Radon-Nikodym估计器(RNE)。基于路径分布之间的\textit{密度比}的概念,它揭示了边际密度和过渡核之间的根本联系,提供了一个灵活的即插即用框架,统一了(1)扩散密度估计,(2)推理时间控制,以及(3)以单一视角进行基于能量的扩散训练。实验证明,RNE在推理时间控制应用中取得了强大的结果,如退火和模型组合,在推理时间缩放性能方面表现出有希望的表现,并为训练基于能量的扩散模型提供了简单而高效的正则化。此外,我们提出的RNE与模态无关,不仅适用于连续扩散模型,还适用于它们的离散扩散对应物。

更新时间: 2026-03-02 20:05:27

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.05668v5

Contextual Drag: How Errors in the Context Affect LLM Reasoning

Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.

Updated: 2026-03-02 20:04:02

标题: 上下文拖延:上下文中的错误如何影响LLM推理

摘要: 许多大型语言模型(LLMs)的自我改进管道的核心假设是,模型可以通过反思过去的错误来改进。我们研究了一种称为“情境拖累”的现象:在情境中存在的失败尝试会使后续生成的模型偏向于类似结构的错误。在对8个推理任务上的11个专有和开源模型的评估中,情境拖累导致了10-20%的性能下降,而在严重受到情境拖累的模型中进行的迭代自我完善可能会导致自我恶化。使用树编辑距离进行的结构分析表明,后续推理轨迹从情境中继承了类似结构的错误模式。我们证明,既没有外部反馈也没有成功的自我验证足以消除这种影响。尽管缓解策略,如回退行为微调和情境去噪,可以带来部分改进,但它们无法完全恢复基准性能,将情境拖累定位为当前推理架构中的持续失败模式。

更新时间: 2026-03-02 20:04:02

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2602.04288v2

Estimating Visual Attribute Effects in Advertising from Observational Data: A Deepfake-Informed Double Machine Learning Approach

Digital advertising increasingly relies on visual content, yet marketers lack rigorous methods for understanding how specific visual attributes causally affect consumer engagement. This paper addresses a fundamental methodological challenge: estimating causal effects when the treatment, such as a model's skin tone, is an attribute embedded within the image itself. Standard approaches like Double Machine Learning (DML) fail in this setting because vision encoders entangle treatment information with confounding variables, producing severely biased estimates. We develop DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning), a framework that leverages generative AI to disentangle treatment from confounders. The approach combines three mechanisms: (1) deepfake-generated image pairs that isolate treatment variation; (2) DICE-Diff adversarial learning on paired difference vectors, where background signals cancel to reveal pure treatment fingerprints; and (3) orthogonal projection that geometrically removes treatment-axis components. In simulations with known ground truth, DICE-DML reduces root mean squared error by 73-97% compared to standard DML, with the strongest improvement (97.5%) at the null effect point, demonstrating robust Type I error control. Applying DICE-DML to 232,089 Instagram influencer posts, we estimate the causal effect of skin tone on engagement. Standard DML produces diagnostically invalid results (negative outcome R^2), while DICE-DML achieves valid confounding control (R^2 = 0.63) and estimates a marginally significant negative effect of darker skin tone (-522 likes; p = 0.062), substantially smaller than the biased standard estimate. Our framework provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.

Updated: 2026-03-02 20:00:38

标题: 利用观察数据估算广告中的视觉属性效应:基于Deepfake的双机器学习方法

摘要: 数字广告越来越依赖视觉内容,然而营销人员缺乏严格的方法来理解特定视觉属性如何因果地影响消费者参与度。本文解决了一个基本的方法论挑战:当治疗方法,如模特的肤色,是嵌入在图像本身中的属性时,估计因果效应。标准方法如双机器学习(DML)在这种情况下失败,因为视觉编码器将治疗信息与混淆变量纠缠在一起,产生严重偏倚的估计。我们开发了DICE-DML(Deepfake-Informed Control Encoder for Double Machine Learning)框架,利用生成式人工智能来解开治疗和混淆变量。该方法结合了三种机制:(1)生成的深度伪造图像对,隔离治疗变化;(2)对配对差异向量进行DICE-Diff对抗学习,其中背景信号相互抵消,显示纯治疗指纹;(3)正交投影几何上去除治疗轴分量。在已知地面真相的模拟中,与标准DML相比,DICE-DML将均方根误差降低了73-97%,在零效应点处实现了最强的改进(97.5%),表明了鲁棒的I型错误控制。将DICE-DML应用于232,089个Instagram影响者帖子,我们估计了肤色对参与度的因果效应。标准DML产生了诊断无效的结果(负结果R^2),而DICE-DML实现了有效的混淆控制(R^2 = 0.63),估计了较为显著的较暗肤色的负面影响(-522个赞;p = 0.062),远远小于有偏倚的标准估计。我们的框架为在图像中同时存在治疗和混淆变量时进行视觉数据的因果推断提供了一个原则性方法。

更新时间: 2026-03-02 20:00:38

领域: cs.AI,econ.EM

下载: http://arxiv.org/abs/2603.02359v1

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

Updated: 2026-03-02 19:57:53

标题: 连接科尔莫哥洛夫复杂性和深度学习:变压器的渐近最优描述长度目标

摘要: 最小描述长度(MDL)原则提供了一个正式的框架,用于在机器学习中应用奥卡姆剃刀。然而,由于缺乏对模型复杂性的原则性、普遍性度量,将其应用于诸如变压器等神经网络是具有挑战性的。本文介绍了渐近最优描述长度目标的理论概念,基于科尔莫哥洛夫复杂性理论。我们证明,这样一个目标的最小化者在模型资源限制增加的极限情况下,可以实现对任何数据集的最佳压缩,最多只有一个增加的常数。我们证明,对于变压器,渐近最优目标存在,并基于它们的计算普适性的新演示进行构建。我们进一步展示,通过构建和分析基于自适应高斯混合先验的变分目标,这样的目标可以是可处理的和可微的。我们的实证分析表明,这种变分目标可以选择一个在算法任务上具有强大泛化能力的低复杂度解决方案,但标准优化器无法从随机初始化中找到这样的解决方案,突显了关键的优化挑战。更广泛地说,通过提供一个理论框架来识别具有强大渐近保证的描述长度目标,我们勾勒了一条可能的路径,以实现训练神经网络以实现更大压缩和泛化的目标。

更新时间: 2026-03-02 19:57:53

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.22445v3

Learning Optimal Search Strategies

We explore the question of how to learn an optimal search strategy within the example of a parking problem where parking opportunities arrive according to an unknown inhomogeneous Poisson process. The optimal policy is a threshold-type stopping rule characterized by an indifference position. We propose an algorithm that learns this threshold by estimating the integrated jump intensity rather than the intensity function itself. We show that our algorithm achieves a logarithmic regret growth, uniformly over a broad class of environments. Moreover, we prove a logarithmic minimax regret lower bound, establishing the growth optimality of the proposed approach.

Updated: 2026-03-02 19:56:54

标题: 学习最佳搜索策略

摘要: 我们探讨了如何在停车问题的例子中学习一个最佳的搜索策略,其中停车机会按照未知的非均匀泊松过程到达。最佳策略是一个阈值型停止规则,其特点是一个无差异位置。我们提出了一个算法,通过估计积分跳跃强度而不是强度函数本身来学习这个阈值。我们证明了我们的算法在一个广泛的环境类中实现了对数遗憾增长,此外,我们证明了对数极小遗憾下界,建立了所提出方法的增长最优性。

更新时间: 2026-03-02 19:56:54

领域: cs.LG,math.PR

下载: http://arxiv.org/abs/2603.02356v1

Maude-HCS: Model Checking the Undetectability-Performance Tradeoffs of Hidden Communication Systems

Hidden communication systems (HCS) embed covert messages within ordinary network activity to hide the presence of communication. In practice, the undetectability of an HCS is typically evaluated using ad hoc traffic statistics or specific detectors, making security claims tightly coupled to experimental setups and implicit adversarial assumptions. In this work, we formalize undetectability as the statistical indistinguishability of observable execution traces under two deployments: a baseline system without hidden communication and an HCS deployment carrying covert traffic. Undetectability is expressed as a bound on a quantitative measure of distance between the trace distributions induced by these two executions. We develop Maude-HCS, an executable modeling and analysis framework that provides a principled and executable foundation for reasoning about undetectability-performance tradeoffs in complex HCS designs. Maude-HCS allows designers to specify protocol behavior, adversary observables, and environmental assumptions, and to generate Monte Carlo samples from the induced trace distributions. We show that Maude-HCS can be used to audit claims of undetectability by estimating the true and false positive rates of a statistical test and converting these estimates into lower bounds on undetectability measures such as KL divergence. This enables systematic evaluation of detectability and its tradeoffs with performance under explicitly stated modeling assumptions. Finally, we evaluate Maude-HCS on tunneling-based HCS instantiations and validate model predictions against measurements from a physical testbed. For passive adversaries observing timing and traffic statistics, we quantify how undetectability and performance vary with protocol configuration, background traffic, and network loss, and demonstrate strong semantic alignment between model-based guarantees and empirical results.

Updated: 2026-03-02 19:56:38

标题: Maude-HCS: 模型检验隐藏通信系统的不可检测性和性能权衡

摘要: 隐藏通信系统(HCS)将秘密信息嵌入普通网络活动中,以隐藏通信的存在。在实践中,通常通过特定的检测器或特定的流量统计来评估HCS的不可检测性,这使得安全性声明与实验设置和隐含的对抗假设紧密相连。在这项工作中,我们将不可检测性形式化为在两种部署下可观测执行痕迹的统计不可区分性:一个没有隐藏通信的基线系统和一个 carrying covert traffic 的HCS 部署。不可检测性被表达为由这两种执行所引起的痕迹分布之间的距离的量化度量的上界。 我们开发了Maude-HCS,一个可执行的建模和分析框架,为复杂HCS设计中的不可检测性-性能权衡提供了一个有原则和可执行的基础。Maude-HCS允许设计者指定协议行为、对手可见性和环境假设,并从引起的痕迹分布中生成蒙特卡洛样本。我们展示了Maude-HCS可以用于通过估计统计检验的真阳性和假阳性率并将这些估计转换为KL散度等不可检测性度量的下界来审查不可检测性的声明。这使得在明确陈述的建模假设下系统地评估可检测性及其与性能的权衡。 最后,我们评估了Maude-HCS在基于隧道的HCS实例化上,并验证了模型预测与来自物理测试平台的测量结果之间的强语义对齐。对于观察时间和流量统计的被动对手,我们量化了不可检测性和性能如何随着协议配置、背景流量和网络丢失而变化,并展示了基于模型的保证和经验结果之间的强语义对齐。

更新时间: 2026-03-02 19:56:38

领域: cs.CR

下载: http://arxiv.org/abs/2603.03369v1

AI-Generated Music Detection in Broadcast Monitoring

AI music generators have advanced to the point where their outputs are often indistinguishable from human compositions. While detection methods have emerged, they are typically designed and validated in music streaming contexts with clean, full-length tracks. Broadcast audio, however, poses a different challenge: music appears as short excerpts, often masked by dominant speech, conditions under which existing detectors fail. In this work, we introduce AI-OpenBMAT, the first dataset tailored to broadcast-style AI-music detection. It contains 3,294 one-minute audio excerpts (54.9 hours) that follow the duration patterns and loudness relations of real television audio, combining human-made production music with stylistically matched continuations generated with Suno v3.5. We benchmark a CNN baseline and state-of-the-art SpectTTTra models to assess SNR and duration robustness, and evaluate on a full broadcast scenario. Across all settings, models that excel in streaming scenarios suffer substantial degradation, with F1-scores dropping below 60% when music is in the background or has a short duration. These results highlight speech masking and short music length as critical open challenges for AI music detection, and position AI-OpenBMAT as a benchmark for developing detectors capable of meeting industrial broadcast requirements.

Updated: 2026-03-02 19:51:39

标题: 人工智能生成的音乐在广播监测中的检测

摘要: AI音乐生成器已经发展到一个程度,使得它们的输出往往与人类作曲难以区分。虽然检测方法已经出现,但它们通常是在音乐流媒体环境中设计和验证的,具有清晰、完整的音轨。然而,广播音频提出了不同的挑战:音乐出现为短片段,通常被主导性讲话所掩盖,这些条件下现有的检测器会失败。在这项工作中,我们引入了AI-OpenBMAT,这是第一个专门针对广播风格AI音乐检测的数据集。它包含3,294个一分钟的音频片段(54.9小时),遵循真实电视音频的持续模式和响度关系,将人类制作的制作音乐与Suno v3.5生成的风格匹配的延续部分相结合。我们对CNN基线和最先进的SpectTTTra模型进行了基准测试,以评估信噪比和持续时间的稳健性,并在完整广播场景中进行评估。在所有设置中,在流媒体场景中表现出色的模型会遭受严重的降级,当音乐在背景中或持续时间很短时,F1分数会下降到60%以下。这些结果突出了语音掩蔽和短音乐长度作为AI音乐检测的关键挑战,并将AI-OpenBMAT定位为开发能够满足工业广播要求的检测器的基准。

更新时间: 2026-03-02 19:51:39

领域: cs.SD,cs.AI,eess.AS,eess.SP

下载: http://arxiv.org/abs/2602.06823v2

Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective

Machine unlearning--the ability to remove designated concepts from a pre-trained model--has advanced rapidly, particularly for text-to-image diffusion models. However, existing methods typically assume that unlearning requests arrive all at once, whereas in practice they often arrive sequentially. We present the first systematic study of continual unlearning in text-to-image diffusion models and show that popular unlearning methods suffer from rapid utility collapse: after only a few requests, models forget retained knowledge and generate degraded images. We trace this failure to cumulative parameter drift from the pre-training weights and argue that regularization is crucial to addressing it. To this end, we study a suite of add-on regularizers that (1) mitigate drift and (2) remain compatible with existing unlearning methods. Beyond generic regularizers, we show that semantic awareness is essential for preserving concepts close to the unlearning target, and propose a gradient-projection method that constrains parameter drift orthogonal to their subspace. This substantially improves continual unlearning performance and is complementary to other regularizers for further gains. Taken together, our study establishes continual unlearning as a fundamental challenge in text-to-image generation and provides insights, baselines, and open directions for advancing safe and accountable generative AI.

Updated: 2026-03-02 19:47:51

标题: 文本到图像扩散模型的持续遗忘:正则化视角

摘要: 机器去学习——从预训练模型中删除指定概念的能力——已经迅速发展,特别是对于文本到图像扩散模型。然而,现有方法通常假设去学习请求一次性到达,而在实践中它们往往是依次到达的。我们首次系统地研究了文本到图像扩散模型的持续去学习,并展示了流行的去学习方法存在快速效用崩溃的问题:仅经过几次请求后,模型就会忘记保留的知识并生成降质图像。我们追溯了这一失败到来自预训练权重的累积参数漂移,并认为正则化对于解决这个问题至关重要。为此,我们研究了一套附加正则化器,它们(1)减轻漂移并(2)与现有去学习方法兼容。除了通用正则化器之外,我们还展示了语义意识对于保留接近去学习目标的概念至关重要,并提出了一种梯度投影方法,将参数漂移限制在它们的子空间的正交方向。这显著改善了持续去学习的性能,并与其他正则化器互补,进一步提高了性能。总的来说,我们的研究将持续去学习确定为文本到图像生成中的一个基本挑战,并为推进安全和负责任的生成式人工智能提供了见解、基线和开放方向。

更新时间: 2026-03-02 19:47:51

领域: cs.LG

下载: http://arxiv.org/abs/2511.07970v2

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

We study reserve price optimization in multi-phase second price auctions, where the seller's prior actions affect the bidders' later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges. First, from the seller's perspective, we need to efficiently explore the environment in the presence of potentially untruthful bidders who aim to manipulate the seller's policy. Second, we want to minimize the seller's revenue regret when the market noise distribution is unknown. Third, the seller's per-step revenue is an unknown, nonlinear random variable, and cannot even be directly observed from the environment but realized values. We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders' surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction's underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the Contextual-LSVI-UCB-Buffer (CLUB) algorithm which achieves $\tilde{O}(H^{5/2}\sqrt{K})$ revenue regret, where $K$ is the number of episodes and $H$ is the length of each episode, when the market noise is known and $\tilde{O}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.

Updated: 2026-03-02 19:47:43

标题: 多阶段二价拍卖设计中的强化学习方法

摘要: 我们研究多阶段二价拍卖中的保留价优化问题,其中卖方之前的行动通过马尔可夫决策过程(MDP)影响竞标者后期的估值。与现有作品中的赌徒设置相比,我们的设置涉及三个挑战。首先,从卖方的角度来看,我们需要在潜在不诚实的竞标者存在的情况下有效地探索环境,这些竞标者的目标是操纵卖方的政策。其次,当市场噪声分布未知时,我们希望最小化卖方的收入后悔。第三,卖方的每步收入是一个未知的、非线性的随机变量,甚至不能直接从环境中观察到,但会实现其价值。 我们提出了一种机制来解决这三个挑战。为了解决第一个挑战,我们使用了一种名为“缓冲期”的新技术和灵感来自低切换成本的强化学习(RL),以限制竞标者通过不诚实竞标获得的剩余,从而激励近似真实的竞标。第二个挑战通过一种新颖的算法来解决,该算法在市场噪声分布未知时消除了纯粹的探索的需要。第三个挑战通过LSVI-UCB的扩展来解决,我们利用拍卖的基本结构来控制收入函数的不确定性。这三种技术汇聚成为Contextual-LSVI-UCB-Buffer(CLUB)算法,当市场噪声已知时,实现$\tilde{O}(H^{5/2}\sqrt{K})$的收入后悔,其中$K$是每集的数量,$H$是每集的长度;当噪声未知且不假设竞标者诚实时,实现$\tilde{O}(H^{3}\sqrt{K})$的收入后悔。

更新时间: 2026-03-02 19:47:43

领域: cs.LG,cs.GT,stat.ML

下载: http://arxiv.org/abs/2210.10278v2

Learning graph topology from metapopulation epidemic encoder-decoder

Metapopulation epidemic models are a valuable tool for studying large-scale outbreaks. With the limited availability of epidemic tracing data, it is challenging to infer the essential constituents of these models, namely, the epidemic parameters and the relevant mobility network between subpopulations. Either one of these constituents can be estimated while assuming the other; however, the problem of their joint inference has not yet been solved. Here, we propose two encoder-decoder deep learning architectures that infer metapopulation mobility graphs from time-series data, with and without the assumption of epidemic model parameters. Evaluation across diverse random and empirical mobility networks shows that the proposed approach outperforms the state-of-the-art topology inference. Further, we show that topology inference improves dramatically with data on additional pathogens. Our study establishes a robust framework for simultaneously inferring epidemic parameters and topology, addressing a persistent gap in modeling disease propagation.

Updated: 2026-03-02 19:46:19

标题: 从种群流行病编码器-解码器学习图拓扑

摘要: Metapopulation流行病模型是研究大规模爆发的有价值工具。由于流行病追踪数据的有限性,推断这些模型的基本组成部分,即流行病参数和亚种群之间的相关移动网络,是具有挑战性的。可以在假设另一个的情况下估算这两个组成部分中的任何一个;然而,它们的联合推断问题尚未解决。在这里,我们提出了两种编码器-解码器深度学习架构,从时间序列数据中推断Metapopulation移动性图,有或无流行病模型参数的假设。在多样化的随机和实证移动网络上的评估表明,所提出的方法优于最先进的拓扑推断。此外,我们展示了拓扑推断在有关其他病原体的数据下显著改善。我们的研究建立了一个强大的框架,同时推断流行病参数和拓扑,解决了模拟疾病传播中的一个持久性差距。

更新时间: 2026-03-02 19:46:19

领域: cs.LG

下载: http://arxiv.org/abs/2603.02349v1

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans, with recent work extending this idea to visual domains using Vision-Language Models (VLMs). However, a rigorous comparison with methods that plan directly with VLMs is missing, due to a lack of visual benchmarks that support symbolic planning. We present ViPlan, the first open-source benchmark for comparing VLM-grounded symbolic approaches (VLM-as-grounder) with direct VLM planning methods (VLM-as-planner). ViPlan introduces a series of increasingly challenging tasks in two visual domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We find VLM-as-grounder methods to outperform direct VLM planning in Blocksworld (solving 46% of the tasks against 9%), where image grounding is both crucial and accurate. However, in the household robotics tasks, where linguistic knowledge helps, VLM-as-planner methods are greatly superior to VLM-as-grounder approaches (solving 34% of the tasks against 5%), which are hindered by partial observability. Thus, ViPlan domains capture fundamental shortcomings of both planning approaches, which we further diagnose with a qualitative failure analysis. Finally, across methods, we observe no consistent benefit from Chain-of-Thought prompting, suggesting persistent limitations in current VLMs' visual reasoning abilities.

Updated: 2026-03-02 19:40:04

标题: ViPlan:具有符号谓词和视觉语言模型的视觉规划基准

摘要: 将大型语言模型与符号规划器集成是获取可验证和基于实际的计划的一个有前途的方向,最近的工作将这一理念扩展到使用视觉-语言模型(VLMs)的视觉领域。然而,由于缺乏支持符号规划的视觉基准,对直接使用VLMs进行规划的方法进行严格比较还是缺失的。我们提出了ViPlan,这是第一个用于比较VLM-基于符号方法(VLM作为基础)和直接VLM规划方法(VLM作为规划器)的开源基准。ViPlan在两个视觉领域引入了一系列越来越具有挑战性的任务:经典Blocksworld规划问题的视觉变体和模拟家庭机器人环境。我们发现VLM作为基础的方法在Blocksworld中表现优于直接VLM规划(解决了46%的任务,而直接VLM规划只解决了9%),其中图像基准至关重要且准确。然而,在家庭机器人任务中,语言知识有助于,VLM作为规划器的方法远远优于VLM作为基础的方法(解决了34%的任务,而VLM作为基础的方法只解决了5%),这些方法受到部分可观察性的阻碍。因此,ViPlan领域捕捉了两种规划方法的基本缺陷,我们通过定性故障分析进一步诊断了这些缺陷。最后,在各种方法中,我们观察到没有一致的好处来自Chain-of-Thought提示,这表明当前VLMs的视觉推理能力仍存在固有限制。

更新时间: 2026-03-02 19:40:04

领域: cs.AI

下载: http://arxiv.org/abs/2505.13180v2

Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.

Updated: 2026-03-02 19:35:38

标题: 在离散领域中的扩散型MPC:可行性约束、视野效应和评论对齐:以俄罗斯方块为案例研究

摘要: 我们研究了在离散组合领域中使用托特里斯作为案例研究的基于扩散的模型预测控制(Diffusion-MPC)。我们的规划器使用MaskGIT风格的离散去噪器对候选放置序列进行采样,并通过重新排序选择动作。我们分析了三个关键因素:(1)通过逻辑掩蔽在有效放置上进行可行性约束采样,(2)使用启发式评分、预训练的DQN评论家和混合组合的重新排序策略,以及(3)在候选计数和规划视野中的计算缩放。我们发现,在离散领域中,可行性屏蔽是必要的,可以去除无效动作质量(46%),并使得得分提高了6.8%,生存率提高了5.6%,相比于不受约束的采样。朴素的DQN重新排序系统地与展开质量不一致,产生高决策后悔(平均17.6,p90 36.6)。在稀疏和延迟奖励下,较短的规划视野胜过较长的视野,这表明在长期想象的展开中存在不确定性累积。总体而言,计算选择(K、H)决定了主要的失败模式:小K限制了候选质量,而较大的H增加了错误排序和模型不匹配。我们的发现突出了扩散规划器在离散环境中的结构挑战,并为评论家集成提供了实用诊断。

更新时间: 2026-03-02 19:35:38

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2603.02348v1

Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment

Timely and accurate assessment of cognitive impairment remains a major unmet need. Speech biomarkers offer a scalable, non-invasive, cost-effective solution for automated screening. However, the clinical utility of machine learning (ML) remains limited by interpretability and generalisability to real-world speech datasets. We evaluate explainable ML for screening of Alzheimer's disease and related dementias (ADRD) and severity prediction using benchmark DementiaBank speech (N = 291, 64% female, 69.8 (SD = 8.6) years). We validate generalisability on pilot data collected in-residence (N = 22, 59% female, 76.2 (SD = 8.0) years). To enhance clinical utility, we stratify risk for actionable triage and assess linguistic feature importance. We show that a Random Forest trained on linguistic features for ADRD detection achieves a mean sensitivity of 69.4% (95% confidence interval (CI) = 66.4-72.5) and specificity of 83.3% (78.0-88.7). On pilot data, this model yields a mean sensitivity of 70.0% (58.0-82.0) and specificity of 52.5% (39.3-65.7). For prediction of Mini-Mental State Examination (MMSE) scores, a Random Forest Regressor achieves a mean absolute MMSE error of 3.7 (3.7-3.8), with comparable performance of 3.3 (3.1-3.5) on pilot data. Risk stratification improves specificity by 13% on the test set, offering a pathway for clinical triage. Linguistic features associated with ADRD include increased use of pronouns and adverbs, greater disfluency, reduced analytical thinking, lower lexical diversity, and fewer words that reflect a psychological state of completion. Our predictive modelling shows promise for integration with conversational technology at home to monitor cognitive health and triage higher-risk individuals, enabling early screening and intervention.

Updated: 2026-03-02 19:33:35

标题: 评估口语作为自动筛查认知障碍的生物标志物

摘要: 及时准确评估认知障碍仍然是一个重要的未解决需求。语音生物标志物提供了一种可扩展、非侵入性、成本效益高的自动筛查解决方案。然而,机器学习(ML)的临床效用受到了解释性和泛化性到真实世界语音数据集的限制。我们评估了可解释的ML用于阿尔茨海默病和相关痴呆症(ADRD)筛查以及严重程度预测,使用基准DementiaBank语音数据(N = 291,64%女性,年龄69.8岁(标准差=8.6))。我们在住宅收集的试点数据上验证了泛化性(N = 22,59%女性,年龄76.2岁(标准差=8.0))。为了增强临床效用,我们对可操作的三级风险进行分层,并评估语言特征的重要性。我们展示了一个基于语言特征训练的随机森林用于ADRD检测,实现了69.4%的平均灵敏度(95%置信区间=66.4-72.5)和83.3%的特异性(78.0-88.7)。在试点数据上,该模型达到了70.0%的平均灵敏度(58.0-82.0)和52.5%的特异性(39.3-65.7)。对于Mini-Mental状态检查(MMSE)分数的预测,随机森林回归器实现了3.7(3.7-3.8)的平均绝对MMSE误差,与试点数据上的3.3(3.1-3.5)性能相当。风险分层在测试集上将特异性提高了13%,为临床三级提供了一条路径。与ADRD相关的语言特征包括使用代词和副词增加,较大的语言不流畅性,分析思维减少,词汇多样性降低,以及反映心理状态完成度较低的词语数量的减少。我们的预测建模显示了与家庭对话技术集成的潜力,以监测认知健康并对高风险个体进行三级筛查,促进早期筛查和干预。

更新时间: 2026-03-02 19:33:35

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2501.18731v2

Covering Numbers for Deep ReLU Networks with Applications to Function Approximation and Nonparametric Regression

Covering numbers of (deep) ReLU networks have been used to characterize approximation-theoretic performance, to upper-bound prediction error in nonparametric regression, and to quantify classification capacity. These results rely on covering number upper bounds obtained via explicit constructions of coverings. Lower bounds on covering numbers do not appear to be available in the literature. The present paper fills this gap by deriving tight (up to multiplicative constants) lower and upper bounds on the metric entropy (i.e., the logarithm of the covering numbers) of fully connected networks with bounded weights, sparse networks with bounded weights, and fully connected networks with quantized weights. The tightness of these bounds yields a fundamental understanding of the impact of sparsity, quantization, bounded versus unbounded weights, and network output truncation. Moreover, the bounds allow one to characterize fundamental limits of neural network transformation, including network compression, and lead to sharp upper bounds on the prediction error in nonparametric regression through deep networks. In particular, we remove a $\log^6(n)$-factor from the best known sample complexity rate for estimating Lipschitz functions via deep networks, thereby establishing optimality. Finally, we identify a systematic relation between optimal nonparametric regression and optimal approximation through deep networks, unifying numerous results in the literature and revealing underlying general principles.

Updated: 2026-03-02 19:30:39

标题: 使用ReLU深度网络的覆盖数与函数逼近和非参数回归的应用

摘要: (深度) ReLU 网络的覆盖数被用来表征逼近理论性能,对非参数回归中的预测误差进行上界估计,并量化分类能力。这些结果依赖于通过显式构造覆盖获得的覆盖数上界。文献中似乎没有覆盖数的下界。本文通过推导出完全连接网络、带有有界权重的稀疏网络和带有量化权重的全连接网络的度量熵的紧密(乘法常数)下界和上界来填补这一空白。这些界的紧密性使人们对稀疏性、量化、有界与无界权重以及网络输出截断的影响有了基础的理解。此外,这些界允许人们表征神经网络转换的基本限制,包括网络压缩,并通过深度网络给出非参数回归中的预测误差的尖锐上界。特别地,我们从通过深度网络估计 Lipschitz 函数的已知最佳样本复杂度速率中移除了一个 $\log^6(n)$-因子,从而确立了最优性。最后,我们确定了最优非参数回归与通过深度网络的最优逼近之间的系统关系,统一了文献中的许多结果,并揭示了潜在的一般原则。

更新时间: 2026-03-02 19:30:39

领域: stat.ML,cs.AI,cs.IT,cs.LG

下载: http://arxiv.org/abs/2410.06378v2

LLM Probability Concentration: How Alignment Shrinks the Generative Horizon

Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this consistency in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the *Branching Factor* (BF) -- a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by a factor of 2-5 overall, and up to an order of magnitude (e.g., from 12 to 1.2) at the beginning positions. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this consistency has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.

Updated: 2026-03-02 19:30:01

标题: LLM概率集中:对齐如何缩小生成视野

摘要: 尽管它们具有令人印象深刻的能力,但对齐的大型语言模型(LLMs)经常生成缺乏多样性的输出。是什么驱使生成的这种一致性?我们通过模型输出分布中的概率集中的视角来调查这一现象。为了量化这种集中,我们引入了*分支因子*(BF)- 一种衡量在生成过程中有效的可能下一步的数量的令牌不变量。我们的实证分析揭示了两个关键发现:(1)BF通常随着生成的进行而降低,表明LLMs在生成过程中变得更加可预测。 (2)对齐调整从一开始就显著地收窄了模型的输出分布,整体上将BF降低了2-5倍,并且在开始位置上最高可达一个数量级(例如,从12降至1.2)。这种显著的减少有助于解释为什么对齐模型通常似乎对解码策略不太敏感。基于这一洞察,我们发现这种一致性对于复杂推理具有令人惊讶的影响。例如,对齐的Chain-of-Thought(CoT)模型(例如,DeepSeek-蒸馏模型)利用了这种效应;通过生成更长的推理链,它们推动生成进入后期,更具决定性(较低BF)的阶段,从而产生更稳定的输出。我们假设对齐调整并不从根本上改变模型的行为,而是将其引导到解锁基础模型中已经存在的低熵轨迹的风格化令牌(例如,“Sure”)上。这一观点得到了通过推动实验支持,该实验显示用这些令牌提示基础模型也可以类似地降低BF。综上所述,我们的发现将BF确立为了解和控制LLM输出的有力诊断工具 - 阐明了对齐如何降低变异性,CoT如何促进稳定生成,以及如何将基础模型引导远离多样性。

更新时间: 2026-03-02 19:30:01

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.17871v3

Large Electron Model: A Universal Ground State Predictor

We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the entire Hamiltonian parameter manifold. Our model employs the Fermi Sets architecture, a universal representation of many-body fermionic wavefunctions, which is further conditioned on Hamiltonian parameter and particle number. On interacting electrons in a two-dimensional harmonic potential, a single trained model accurately predicts the ground state wavefunction while generalizing across unseen coupling strengths and particle-number sectors, producing both accurate real-space charge densities and ground state energies, even up to $50$ particles. Our results establish a foundation model method for material discovery that is grounded in the variational principle, while accurately treating strong electron correlation beyond the capacity of density functional theory.

Updated: 2026-03-02 19:29:12

标题: 大电子模型:一个通用的基态预测器

摘要: 我们引入了大电子模型,这是一个单一的神经网络模型,可以在整个哈密顿参数流形上产生相互作用电子的变分波函数。我们的模型采用费米集合架构,这是许多体费米子波函数的通用表示,进一步受到哈密顿参数和粒子数量的约束。在二维谐振势场中相互作用的电子上,一个经过训练的单一模型可以准确预测基态波函数,同时在看不见的耦合强度和粒子数量部门之间进行泛化,产生准确的实空间电荷密度和基态能量,甚至可以达到50个粒子。我们的结果建立了一种基于变分原理的材料发现基础模型方法,可以准确处理超出密度泛函理论容量的强电子相关性。

更新时间: 2026-03-02 19:29:12

领域: cond-mat.str-el,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02346v1

RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection

Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such configuration drift occurs because of bugs in the IaC specification, manual changes, or system updates. Large language model (LLM)-based agentic AI systems can automate the analysis of large volumes of telemetry data, making them suitable for the detection of configuration drift. However, existing agentic systems implicitly assume that the tools they invoke always return correct outputs, making them vulnerable to erroneous tool responses. Since agents cannot distinguish whether an anomalous tool output reflects a real infrastructure problem or a broken tool, such errors may cause missed drift or false alarms, reducing reliability precisely when it is most needed. We introduce RIVA (Robust Infrastructure by Verification Agents), a novel multi-agent system that performs robust IaC verification even when tools produce incorrect or misleading outputs. RIVA employs two specialized agents, a verifier agent and a tool generation agent, that collaborate through iterative cross-validation, multi-perspective verification, and tool call history tracking. Evaluation on the AIOpsLab benchmark demonstrates that RIVA, in the presence of erroneous tool responses, recovers task accuracy from 27.3% when using a baseline ReAct agent to 50.0% on average. RIVA also improves task accuracy 28% to 43.8% without erroneous tool responses. Our results show that cross-validation of diverse tool calls enables more reliable autonomous infrastructure verification in production cloud environments.

Updated: 2026-03-02 19:28:27

标题: RIVA:利用LLM代理进行可靠的配置漂移检测

摘要: 基础设施即代码(IaC)工具自动化云计算资源配置,但验证部署系统是否与IaC规范保持一致仍然具有挑战性。这种配置漂移是由于IaC规范中的错误、手动更改或系统更新导致的。基于大型语言模型(LLM)的代理人人工智能系统可以自动化分析大量遥测数据,使其适用于检测配置漂移。然而,现有的代理系统隐含地假设它们调用的工具始终返回正确的输出,使其容易受到错误工具响应的影响。由于代理无法区分异常工具输出是反映实际基础设施问题还是工具故障,这些错误可能导致漏检或虚假警报,降低可靠性,尤其是在最需要的时候。我们介绍了RIVA(Robust Infrastructure by Verification Agents),一个新颖的多代理系统,即使工具产生不正确或误导性的输出,也能执行强大的IaC验证。RIVA采用两个专门的代理,一个验证代理和一个工具生成代理,通过迭代交叉验证、多角度验证和工具调用历史跟踪进行合作。在AIOpsLab基准测试中的评估表明,RIVA在存在错误工具响应的情况下,将任务准确性从27.3%(使用基线ReAct代理)平均提升到50.0%。RIVA还在没有错误工具响应的情况下将任务准确性从28%提高到43.8%。我们的结果表明,对多样化工具调用进行交叉验证可以在生产云环境中实现更可靠的自动化基础设施验证。

更新时间: 2026-03-02 19:28:27

领域: cs.SE,cs.AI,cs.MA

下载: http://arxiv.org/abs/2603.02345v1

ReDON: Recurrent Diffractive Optical Neural Processor with Reconfigurable Self-Modulated Nonlinearity

Diffractive optical neural networks (DONNs) have demonstrated unparalleled energy efficiency and parallelism by processing information directly in the optical domain. However, their computational expressivity is constrained by static, passive diffractive phase masks that lack efficient nonlinear responses and reprogrammability. To address these limitations, we introduce the Recurrent Diffractive Optical Neural Processor (ReDON), a novel architecture featuring reconfigurable, recurrent self-modulated nonlinearity. This mechanism enables dynamic, input-dependent optical transmission through in-situ electro-optic self-modulation, providing a highly efficient and reprogrammable approach to optical computation. Inspired by the gated linear unit (GLU) used in large language models, ReDON senses a fraction of the propagating optical field and modulates its phase or intensity via a lightweight parametric function, enabling effective nonlinearity with minimal inference overhead. As a non-von Neumann architecture in which the primary weighting elements (metasurfaces) remain fixed, ReDON substantially extends the nonlinear representational capacity and task adaptability of conventional DONNs through recurrent optical hardware reuse and dynamically tunable nonlinearity. We systematically investigate various self-modulation configurations to characterize the trade-offs between hardware efficiency and computational expressivity. On image recognition and segmentation benchmarks, ReDON improves test accuracy and mean intersection-over-union (mIoU) by up to 20% compared with prior DONNs employing either optical or digital nonlinearities at comparable model complexity and negligible additional power consumption. This work establishes a new paradigm for reconfigurable nonlinear optical computing, uniting recurrence and self-modulation within non-von Neumann analog processors.

Updated: 2026-03-02 19:28:21

标题: ReDON:具有可重构自调节非线性的循环衍射光学神经处理器

摘要: Diffractive optical neural networks (DONNs)在光学领域直接处理信息时展示了无与伦比的能量效率和并行性。然而,它们的计算表达能力受到静态、被动的衍射相位掩模的限制,缺乏高效的非线性响应和可重编程性。为了解决这些限制,我们引入了循环衍射光学神经处理器(ReDON),这是一种新颖的架构,具有可重构的、循环的自调制非线性。这种机制通过原位电光自调制实现了动态、输入相关的光传输,提供了一种高效且可重编程的光学计算方法。受到大型语言模型中使用的门控线性单元(GLU)的启发,ReDON感知到传播的光场的一部分,并通过轻量级参数函数调制其相位或强度,实现了有效的非线性,几乎没有推理开销。作为一个非冯诺伊曼架构,其中的主要加权元素(超表面)保持固定,ReDON通过循环光学硬件重复使用和动态可调的非线性显著扩展了传统DONNs的非线性表示能力和任务适应性。我们系统地研究了各种自调制配置,以表征硬件效率和计算表达能力之间的权衡。在图像识别和分割基准测试中,与之前的DONNs相比,ReDON在相当模型复杂度和可忽略的额外功耗下,将测试准确度和平均交集联合(mIoU)提高了高达20%。这项工作为可重构的非线性光学计算建立了一个新的范式,将循环和自调制结合在非冯诺伊曼模拟处理器中。

更新时间: 2026-03-02 19:28:21

领域: physics.optics,cs.AI,cs.ET

下载: http://arxiv.org/abs/2602.23616v2

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains that are not attainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 6.4-14.2% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.1-5.4%, while delivering an average 2.5x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

Updated: 2026-03-02 19:24:02

标题: Cache-to-Cache: 大型语言模型之间的直接语义通信

摘要: 多LLM系统利用不同大型语言模型的互补优势,实现了性能和效率的提升,这是单个模型无法实现的。在现有设计中,LLMs通过文本进行通信,强制内部表示转换为输出标记序列。这个过程既丢失了丰富的语义信息,又产生了逐标记生成延迟。受到这些限制的启发,我们提出了一个问题:LLMs能否超越文本进行通信?Oracle实验表明,丰富KV-Cache语义可以提高响应质量,而不增加缓存大小,支持KV-Cache作为模型间通信的有效介质。因此,我们提出了Cache-to-Cache(C2C),这是一种新的直接语义通信范式,用于LLMs之间的直接语义通信。C2C使用神经网络来投影和融合源模型的KV-Cache和目标模型的KV-Cache,以实现直接语义传递。可学习的门控机制选择受益于缓存通信的目标层。与文本通信相比,C2C利用了两个模型的深度、专业化语义,同时避免了明确的中间文本生成。实验表明,C2C的平均准确率比单个模型高出6.4-14.2%。它还比文本通信范式表现出大约3.1-5.4%的优势,并且在延迟方面提供平均2.5倍的加速。我们的代码可以在https://github.com/thu-nics/C2C找到。

更新时间: 2026-03-02 19:24:02

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.03215v2

Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme

Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce $\texttt{iJKOnet}$, an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional $\textit{end-to-end}$ adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods. The code of $\texttt{iJKOnet}$ is available at https://github.com/MuXauJl11110/iJKOnet.

Updated: 2026-03-02 19:23:44

标题: 人口动态学习:反向优化遇见JKO方案

摘要: 学习人口动态涉及在离散时间点上给定样本的演化快照的基础过程,以恢复控制粒子演化的基本过程。最近的方法将这视为概率空间中的能量最小化问题,并利用著名的JKO方案进行高效的时间离散化。在这项工作中,我们介绍了一种称为$\texttt{iJKOnet}$的方法,它将JKO框架与逆优化技术相结合,以学习人口动态。我们的方法依赖于传统的$\textit{端到端}$对抗训练程序,并不需要限制性的架构选择,如输入凸神经网络。我们为我们的方法建立了理论保证,并展示了相对于先前基于JKO的方法的改进性能。$\texttt{iJKOnet}$的代码可在https://github.com/MuXauJl11110/iJKOnet 上找到。

更新时间: 2026-03-02 19:23:44

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.01502v3

Distributional value gradients for stochastic environments

Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade-off underlying contraction in gradient-aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement learning toy problem, then benchmark its performance on several MuJoCo environments.

Updated: 2026-03-02 19:16:23

标题: 随机环境下的分布值梯度

摘要: 梯度正则化值学习方法通过利用学习的转移动态和奖励模型来估计回报梯度,提高了样本效率。然而,现有方法,如MAGE,在随机或嘈杂的环境中表现不佳,限制了它们的适用性。在这项工作中,我们通过将分布式强化学习扩展到连续状态-动作空间,不仅建模标量状态-动作值函数的分布,还建模它们的梯度,从而解决了这些限制。我们将这种方法称为分布式Sobolev训练。受随机值梯度(SVG)的启发,我们的方法利用通过条件变分自动编码器(cVAE)实现的奖励和转移分布的一步世界模型。所提出的框架是基于样本的,并利用最大切片最大均值差异(MSMMD)来实例化分布式贝尔曼算子。我们证明Sobolev增强贝尔曼算子是一个具有唯一固定点的收缩,并强调了梯度感知RL中收缩的基本平滑度折衷。为了验证我们的方法,我们首先在一个简单的随机强化学习玩具问题上展示其有效性,然后在几个MuJoCo环境中对其性能进行基准测试。

更新时间: 2026-03-02 19:16:23

领域: cs.LG

下载: http://arxiv.org/abs/2601.20071v3

Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact

Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and its extensions to discrete diffusion has recently started to be investigated. In order to improve the algorithms in a principled way, this paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance improves it. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism. Intuitively, our method smooths the transport between the data distribution and the initial (masked) distribution, resulting in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. Experiments on conditional image and text generation empirically confirm the efficacy of our method.

Updated: 2026-03-02 19:14:34

标题: 在遮蔽扩散中改善无分类器指导:低维理论洞察与高维影响的提升

摘要: Classifier-Free Guidance (CFG)是一种广泛应用于条件生成和改善连续扩散模型样本质量的技术,最近开始研究其在离散扩散中的扩展。为了以原则性方式改进算法,本文首先分析了CFG在低维掩码扩散模型环境中的确切效果,特别强调了引导计划。我们的分析表明,在采样早期(输入被严重掩码时)高引导会损害生成质量,而后期引导会改善生成质量。这些发现为最近关于引导计划的研究中的经验观察提供了理论解释。分析还揭示了当前CFG实现的不完美之处。这些实现可能会无意中导致不平衡的转换,例如在生成的早期阶段解除掩码过快,从而降低生成样本的质量。为了解决这个问题,我们从分析中获得启示,提出了一种新颖的无分类器引导机制。直观地说,我们的方法平滑了数据分布和初始(掩码)分布之间的传输,从而改善了样本质量。值得注意的是,我们的方法可以通过简单的一行代码更改实现。对条件图像和文本生成的实验从经验上证实了我们方法的有效性。

更新时间: 2026-03-02 19:14:34

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2507.08965v2

Personalized Collaborative Learning with Affinity-Based Variance Reduction

Multi-agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levels -- gaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of $\max\{n^{-1}, δ\}$, where $n$ is the number of agents and $δ\in[0,1]$ measures their heterogeneity. This affinity-based acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.

Updated: 2026-03-02 19:13:31

标题: 基于亲和力的方差缩减个性化协作学习

摘要: 多智能体学习面临着一个基本的张力:在利用分布式协作的同时不牺牲多样化智能体所需的个性化。当旨在实现完全个性化同时适应未知的异质性水平时,这种张力会加剧——在智能体相似时获得协作加速,而在它们不同时不会降低性能。面对这一挑战,我们提出了个性化协作学习(PCL),这是一个新颖的框架,用于异质智能体协作学习个性化解决方案并实现无缝适应性。通过精心设计的偏差校正和重要性校正机制,我们的方法AffPCL能够稳健地处理环境和目标的异质性。我们证明了AffPCL通过一个因子$\max\{n^{-1}, δ\}$减少了独立学习的样本复杂性,其中$n$是智能体的数量,$δ\in[0,1]$衡量它们的异质性。这种基于亲和力的加速自动在均匀环境中的联邦学习的线性加速和独立学习的基线之间进行插值,而无需先验地了解系统。我们的分析进一步揭示,即使与任意不同的智能体合作,一个智能体也可以获得线性加速,揭示了在高度异质性范围内个性化和协作的新见解。

更新时间: 2026-03-02 19:13:31

领域: stat.ML,cs.LG,cs.MA,eess.SY

下载: http://arxiv.org/abs/2510.16232v2

Bridging the Reproducibility Divide: Open Source Software's Role in Standardizing Healthcare AI

Our analysis of recent AI4H publications reveals that, despite a trend toward utilizing open datasets and sharing modeling code, 74% of AI4H papers still rely on private datasets or do not share their code. This is especially concerning in healthcare applications, where trust is essential. Furthermore, inconsistent and poorly documented data preprocessing pipelines result in variable model performance reports, even for identical tasks and datasets, making it challenging to evaluate the true effectiveness of AI models. Despite the challenges posed by the reproducibility crisis, addressing these issues through open practices offers substantial benefits. For instance, while the reproducibility mandate adds extra effort to research and publication, it significantly enhances the impact of the work. Our analysis shows that papers that used both public datasets and shared code received, on average, 110% more citations than those that do neither--more than doubling the citation count. Given the clear benefits of enhancing reproducibility, it is imperative for the AI4H community to take concrete steps to overcome existing barriers. The community should promote open science practices, establish standardized guidelines for data preprocessing, and develop robust benchmarks. Tackling these challenges through open-source development can improve reproducibility, which is essential for ensuring that AI models are safe, effective, and beneficial for patient care. This approach will help build more trustworthy AI systems that can be integrated into healthcare settings, ultimately contributing to better patient outcomes and advancing the field of medicine.

Updated: 2026-03-02 19:09:23

标题: 弥合可重现性差距:开源软件在标准化医疗人工智能中的作用

摘要: 我们对最近AI4H出版物的分析显示,尽管趋势是利用开放数据集并分享建模代码,但74%的AI4H论文仍依赖私人数据集或不分享他们的代码。在医疗应用中尤其令人担忧,因为信任是至关重要的。此外,不一致和文档化不良的数据预处理流程导致不同的模型性能报告,即使是对于相同的任务和数据集,使得评估AI模型的真实有效性变得具有挑战性。 尽管复现危机带来的挑战,通过开放实践解决这些问题带来了实质性的好处。例如,虽然复现要求增加了研究和出版的额外工作,但它显著增强了工作的影响力。我们的分析显示,既使用公共数据集又分享代码的论文平均接收到比两者都不做的论文多110%的引用数,引用数量翻了一番。 鉴于提高可复现性的明显好处,AI4H社区有必要采取具体步骤克服现有障碍。社区应推广开放科学实践,建立数据预处理的标准指南,并开发稳健的基准。通过开源开发解决这些挑战可以提高可复现性,这对确保AI模型安全、有效和有益于患者护理至关重要。这种方法将有助于构建更可信赖的AI系统,可整合到医疗环境中,最终促进更好的患者结果和推动医学领域的发展。

更新时间: 2026-03-02 19:09:23

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2603.03367v1

Preconditioned Score and Flow Matching

Flow matching and score-based diffusion train vector fields under intermediate distributions $p_t$, whose geometry can strongly affect their optimization. We show that the covariance $Σ_t$ of $p_t$ governs optimization bias: when $Σ_t$ is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emph{preconditioning} maps that reshape the geometry of $p_t$ by improving the conditioning of $Σ_t$ without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.

Updated: 2026-03-02 19:09:15

标题: 预条件得分和流匹配

摘要: 流匹配和基于分数的扩散训练向量场在中间分布$p_t$下,其几何形状可以强烈影响它们的优化。我们展示了$p_t$的协方差$Σ_t$控制着优化偏差:当$Σ_t$病态时,基于梯度的训练会快速拟合高方差方向,同时系统地低估低方差模式,导致学习在次优权重上停滞。我们在解析可追踪的设置中形式化了这种效果,并提出了可逆的、标签条件的\emph{预处理}映射,通过改善$Σ_t$的条件性而不改变基础生成模型的几何形状。预处理主要是通过使以前被抑制的方向继续取得进展,而不是加速早期收敛。在MNIST潜在流匹配和其他高分辨率数据集中,我们经验性地跟踪条件诊断和分布度量,并展示预处理始终能够通过避免次优平台获得更好训练的模型。

更新时间: 2026-03-02 19:09:15

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.02337v1

Neural Demand Estimation with Habit Formation and Rationality Constraints

We develop a flexible neural demand system for continuous budget allocation that estimates budget shares on the simplex by minimizing KL divergence. Shares are produced via a softmax of a state-dependent preference scorer and disciplined with regularity penalties (monotonicity, Slutsky symmetry) to support coherent comparative statics and welfare without imposing a parametric utility form. State dependence enters through a habit stock defined as an exponentially weighted moving average of past consumption. Simulations recover elasticities and welfare accurately and show sizable gains when habit formation is present. In our empirical application using Dominick's analgesics data, adding habit reduces out-of-sample error by c.33%, reshapes substitution patterns, and increases CV losses from a 10% ibuprofen price rise by about 15-16% relative to a static model. The code is available at https://github.com/martagrz/neural_demand_habit .

Updated: 2026-03-02 19:01:06

标题: 用神经网络对习惯形成和理性约束下的需求进行估计

摘要: 我们开发了一个灵活的神经需求系统,用于连续预算分配,通过最小化KL散度来估计在单纯形上的预算份额。份额是通过一个依赖状态的偏好评分器的softmax产生的,并通过规则性惩罚(单调性,斯拉茨基对称性)进行规范,以支持连贯的比较静态和福利,而不强加参数形式的效用。状态依赖性通过一个被定义为过去消费的指数加权移动平均的习惯库存进入。模拟准确地恢复了弹性和福利,并显示出在习惯形成存在时的可观收益。在我们使用Dominick's止痛药数据的实证应用中,添加习惯会将样本外误差减少约33%,重新塑造替代模式,并将10%布洛芬价格上涨造成的CV损失增加约15-16%,相对于静态模型。该代码可在https://github.com/martagrz/neural_demand_habit 找到。

更新时间: 2026-03-02 19:01:06

领域: econ.GN,cs.LG

下载: http://arxiv.org/abs/2603.02331v1

Partial Causal Structure Learning for Valid Selective Conformal Inference under Interventions

Selective conformal prediction can yield substantially tighter uncertainty sets when we can identify calibration examples that are exchangeable with the test example. In interventional settings, such as perturbation experiments in genomics, exchangeability often holds only within subsets of interventions that leave a target variable "unaffected" (e.g., non-descendants of an intervened node in a causal graph). We study the practical regime where this invariance structure is unknown and must be learned from data. Our contributions are: (i) a contamination-robust conformal coverage theorem that quantifies how misclassification of "unaffected" calibration examples degrades coverage via an explicit function $g(δ,n)$ of the contamination fraction and calibration set size, providing a finite-sample lower bound that holds for arbitrary contaminating distributions; (ii) a task-driven partial causal learning formulation that estimates only the binary descendant indicators $Z_{a,i}=\mathbf{1}\{i\in\mathrm{desc}(a)\}$ needed for selective calibration, rather than the full causal graph; and (iii) algorithms for descendant discovery via perturbation intersection patterns (differentially affected variable set intersections across interventions), and for approximate distance-to-intervention estimation via local invariant causal prediction. We provide recovery conditions under which contamination is controlled. Experiments on synthetic linear structural equation models (SEMs) validate the bound: under controlled contamination up to $δ=0.30$, the corrected procedure maintains $\ge 0.95$ coverage while uncorrected selective CP degrades to $0.867$. A proof-of-concept on Replogle K562 CRISPR interference (CRISPRi) perturbation data demonstrates applicability to real genomic screens.

Updated: 2026-03-02 18:58:22

标题: 部分因果结构学习用于干预下有效选择性符合推断

摘要: 选择性符合预测可以在我们能够识别与测试样本可交换的校准示例时产生明显更紧的不确定性集。在干预设置中,例如基因组学中的干扰实验,可交换性通常仅在对目标变量“不受影响”的干预子集中保持(例如,在因果图中的干预节点的非后代)。我们研究了这种不变结构未知且必须从数据中学习的实际范围。我们的贡献是:(i)一个耐污染的符合覆盖定理,量化了“不受影响”的校准示例误分类如何通过一个明确的函数$g(δ,n)$的污染比例和校准集大小降低覆盖率,提供了一个对任意污染分布都成立的有限样本下限;(ii)一个基于任务驱动的部分因果学习公式,仅估计选择性校准所需的二进制后代指示符$Z_{a,i}=\mathbf{1}\{i\in\mathrm{desc}(a)\}$,而不是完整的因果图;(iii)通过干扰交集模式(跨干预的差异影响变量集交集)进行后代发现的算法,并通过局部不变因果预测进行近似距离到干预的估计。我们提供了污染受控的恢复条件。在合成线性结构方程模型(SEMs)上的实验验证了该界限:在受控制的污染高达$δ=0.30$的情况下,校正程序保持了≥ 0.95的覆盖率,而未校正的选择性CP降至0.867。对Replogle K562 CRISPR干扰(CRISPRi)干扰数据的概念验证表明了其在真实基因组筛选中的适用性。

更新时间: 2026-03-02 18:58:22

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.02204v1

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.

Updated: 2026-03-02 18:58:07

标题: 多臂老虎机的自适应数据增强:隐式模式识别的样本高效嵌入校准

摘要: 认识隐式视觉和文本模式在现代人工智能的许多实际应用中至关重要。然而,处理长尾模式识别任务对当前预训练基础模型(如LLMs和VLMs)仍然具有挑战性。虽然微调预训练模型可以提高识别隐式模式的准确性,但由于缺乏训练数据和高计算开销,通常是不可行的。在本文中,我们提出了ADAMAB,一种用于少样本模式识别的高效嵌入校准框架。为了最大程度地降低计算成本,ADAMAB在固定嵌入模型之上训练不依赖于嵌入器的轻量级校准器,而无需访问其参数。为了减轻对大规模训练数据的需求,我们引入了基于多臂赌博机制的自适应数据增强策略。通过修改的上置信界算法,ADAMAB减少了梯度漂移,并在少样本训练中提供了理论上保证的收敛性。我们的多模态实验证实了ADAMAB的卓越性能,在每个类别初始数据样本少于5个的情况下,准确率提高了高达40%。

更新时间: 2026-03-02 18:58:07

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2602.19385v2

Tool Verification for Test-Time Reinforcement Learning

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Updated: 2026-03-02 18:57:52

标题: 测试时强化学习的工具验证

摘要: 测试时间强化学习(TTRL)已经成为自我进化的大型推理模型(LRMs)的一个有前途的范式,通过通过大多数投票自诱导奖励实现对未标记测试输入的在线适应。然而,虽然频率高但未经验证的一致性可能会成为有偏见且被强化的奖励信号,导致错误的模式崩溃。我们通过T^3RL(测试时间强化学习的工具验证)来解决这种失败模式,该方法将测试时间工具验证引入奖励估计中。具体而言,一个验证器使用外部工具作为证据(例如,来自代码执行),以在验证感知的投票中加权已验证的回滚,为训练生成更可靠的伪标签。在各种数学难题(MATH-500、AMC和AIME 2024)和不同的骨干类型中,T^3RL相对于TTRL有显著改进,对较难的问题有更大的提升。更广泛地说,T^3RL可以被看作是经过验证的在线数据合成,强调测试时间工具验证作为稳定自我进化的关键机制。

更新时间: 2026-03-02 18:57:52

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.02203v1

Metric Entropy-Free Sample Complexity Bounds for Sample Average Approximation in Convex Stochastic Programming

This paper studies sample average approximation (SAA) in solving convex or strongly convex stochastic programming (SP) problems. In estimating SAA's sample efficiency, the state-of-the-art sample complexity bounds entail metric entropy terms (such as the logarithm of the feasible region's covering number), which often grow polynomially with problem dimensionality. While it has been shown that metric entropy-free complexity rates are attainable under a uniform Lipschitz condition, such an assumption can be overly critical for many important SP problem settings. In response, this paper presents metric entropy-free sample complexity bounds for the SAA under standard SP assumptions} -- in the absence of the uniform Lipschitz condition. For a $d$-dimensional problem, the new results often lead to an $O(d)$-improvement in the complexity rate compared with the state-of-the-art. From the newly established complexity bounds, an important revelation is that SAA and the canonical stochastic mirror descent (SMD) method, two mainstream solution approaches to SP, entail almost identical rates of sample efficiency, lifting a theoretical discrepancy of SAA from SMD also by a factor of $O(d)$. Furthermore, this paper explores non-Lipschitzian scenarios where SAA maintains provable efficacy but the corresponding results for SMD remain mostly unexplored, indicating the potential of SAA's better applicability in some irregular settings. The results of our numerical experiments align with our theoretical findings.

Updated: 2026-03-02 18:57:52

标题: 度量熵无约束样本复杂度界限在凸随机规划中的样本平均逼近

摘要: 本文研究了在解决凸或强凸随机规划(SP)问题中的样本平均逼近(SAA)。在估计SAA的样本效率时,现有的样本复杂度界限包含度量熵项(例如可行区域的覆盖数的对数),这些项通常随着问题维度的增加呈多项式增长。虽然已经证明在统一利普希茨条件下可以实现无度量熵复杂度率,但这种假设对许多重要的SP问题设置来说可能过于严格。为此,本文提出了在标准SP假设下对SAA进行度量熵无关的样本复杂度界限 - 在缺乏统一利普希茨条件的情况下。对于一个$d$维问题,与现有技术相比,新的结果通常会导致复杂度率的$O(d)$改进。从新建立的复杂度界限中,一个重要的发现是SAA和经典的随机镜像下降(SMD)方法,这两种主流的SP解决方法,都具有几乎相同的样本效率率,将SAA与SMD之间的理论差异也提高了一个$O(d)$因子。此外,本文探讨了非利普希茨情况下的场景,其中SAA保持了可证的有效性,但SMD的相应结果仍大多未被探索,表明SAA在某些不规则环境中具有更好的适用性。我们的数值实验结果与理论发现一致。

更新时间: 2026-03-02 18:57:52

领域: math.OC,cs.LG,math.PR,math.ST

下载: http://arxiv.org/abs/2401.00664v7

Frontier Models Can Take Actions at Low Probabilities

Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models' lack of target rate calibration, especially if CoT is no longer legible.

Updated: 2026-03-02 18:56:59

标题: 前沿模型可以在低概率下采取行动

摘要: 预部署评估仅检查模型行动的有限样本。一个恶意模型试图规避监督可能会利用这一点,通过在何时“叛变”时随机化:行为不端的频率如此之低,以至于在评估过程中没有观察到任何恶意行动,但足够频繁,以至于最终在部署中发生。但这需要以非常低的频率采取行动,同时保持校准。前沿模型是否有能力做到这一点呢?我们促使GPT-5、Claude-4.5和Qwen-3家族以很低的概率(例如0.01%)采取目标行动,无论是直接给定还是需要推导,并评估它们的校准性(即当重新采样时,它们是否大约每1万次执行一次目标行动)。我们发现,前沿模型在这个任务上表现出人意料的好。如果在上下文中存在一种熵源(例如UUID),它们可以以低于10万次行动中的1次的频率保持高校准性。没有外部熵的情况下,一些模型仍然可以达到低于1万次行动中的1次的频率。当给定目标频率时,更大的模型可以在更低的频率下实现良好的校准性。然而,当模型必须自己推导出最佳目标频率时,所有模型都无法在没有熵或提示生成的情况下实现校准性。成功的低频率策略需要明确的思维链(CoT)推理,因此试图采用这种方法的恶意模型目前可能会被CoT监视器捕获。然而,扩展趋势表明未来的评估可能无法依赖于模型缺乏目标频率校准性,尤其是如果CoT不再可读。

更新时间: 2026-03-02 18:56:59

领域: cs.LG

下载: http://arxiv.org/abs/2603.02202v1

Adaptive Confidence Regularization for Multimodal Failure Detection

The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at https://github.com/mona4399/ACR.

Updated: 2026-03-02 18:56:38

标题: 自适应置信度正则化用于多模态故障检测

摘要: 在高风险领域部署多模态模型,例如自动驾驶车辆和医学诊断,不仅需要强大的预测性能,还需要可靠的机制来检测故障。在这项工作中,我们解决了多模态环境下很大程度上未被探索的故障检测问题。我们提出了自适应置信度正则化(ACR),这是一个专门设计用于检测多模态故障的新框架。我们的方法受到一个关键观察的驱动:在大多数故障情况下,多模态预测的置信度明显低于至少一个单模分支的置信度,我们称之为置信度降级现象。为了缓解这一问题,我们引入了自适应置信度损失,惩罚这种训练过程中的降级。此外,我们提出了多模态特征交换,一种新颖的异常值合成技术,可以生成具有挑战性的、故障感知的训练样本。通过使用这些合成故障进行训练,ACR学会更有效地识别和拒绝不确定的预测,从而提高整体可靠性。在四个数据集、三种模态和多种评估设置下进行的大量实验表明,ACR实现了一致和稳健的增益。源代码可在https://github.com/mona4399/ACR上获得。

更新时间: 2026-03-02 18:56:38

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02200v1

Conformal Policy Control

An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.

Updated: 2026-03-02 18:54:36

标题: 一致性政策控制

摘要: 一个智能体必须尝试新的行为来探索和改进。在高风险环境中,一个违反安全约束的智能体可能会造成伤害,必须被下线,从而限制任何未来的互动。模仿旧行为是安全的,但过度保守会阻碍探索。行为改变到什么程度算是太多?我们展示了如何将任何安全的参考策略作为任何经过优化但未经测试的策略的概率调节器。在来自安全策略的数据上进行符合校准,确定新策略可以采取行动的程度,同时可证实地执行用户声明的风险容忍度。与保守的优化方法不同,我们不假设用户已经确定了正确的模型类,也没有调整任何超参数。与先前的符合方法不同,我们的理论即使对于非单调有界约束函数也提供有限样本保证。我们在从自然语言问答到生物分子工程的应用上进行的实验表明,安全探索不仅在部署的第一时刻就是可能的,而且还可以改善性能。

更新时间: 2026-03-02 18:54:36

领域: cs.AI,cs.LG,math.ST,stat.ML

下载: http://arxiv.org/abs/2603.02196v1

From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.

Updated: 2026-03-02 18:54:28

标题: 从排行榜到部署:AV感知代码库中的代码质量挑战

摘要: 自动驾驶汽车(AV)感知模型通常仅根据基准性能指标进行评估,对代码质量、生产准备和长期可维护性的关注有限。这在研究卓越性和面临国际安全标准的安全关键系统的实际部署之间造成了重大差距。为了解决这一差距,我们提出了第一个AV感知存储库软件质量的大规模经验研究,系统地分析了来自KITTI和NuScenes 3D物体检测排行榜的178个独特模型。使用静态分析工具(Pylint、Bandit和Radon),我们评估了代码错误、安全漏洞、可维护性和开发实践。我们的研究结果显示,仅有7.3%的研究存储库符合基本的生产准备标准,即没有关键错误和高严重性安全漏洞。安全问题高度集中,前五个问题占几乎80%的出现次数,这促使我们制定了一套可操作的指导原则来预防这些问题。此外,持续集成/持续部署流水线的采用与更好的代码可维护性相关。我们的研究结果强调了排行榜表现并不反映生产准备程度,有针对性的干预可以大大提高AV感知代码的质量和安全性。

更新时间: 2026-03-02 18:54:28

领域: cs.CV,cs.LG,cs.RO,cs.SE

下载: http://arxiv.org/abs/2603.02194v1

Symbol-Equivariant Recurrent Reasoning Models

Reasoning problems such as Sudoku and ARC-AGI remain challenging for neural networks. The structured problem solving architecture family of Recurrent Reasoning Models (RRMs), including Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM), offer a compact alternative to large language models, but currently handle symbol symmetries only implicitly via costly data augmentation. We introduce Symbol-Equivariant Recurrent Reasoning Models (SE-RRMs), which enforce permutation equivariance at the architectural level through symbol-equivariant layers, guaranteeing identical solutions under symbol or color permutations. SE-RRMs outperform prior RRMs on 9x9 Sudoku and generalize from just training on 9x9 to smaller 4x4 and larger 16x16 and 25x25 instances, to which existing RRMs cannot extrapolate. On ARC-AGI-1 and ARC-AGI-2, SE-RRMs achieve competitive performance with substantially less data augmentation and only 2 million parameters, demonstrating that explicitly encoding symmetry improves the robustness and scalability of neural reasoning. Code is available at https://github.com/ml-jku/SE-RRM.

Updated: 2026-03-02 18:53:55

标题: 符号等变循环推理模型

摘要: 推理问题,如数独和ARC-AGI,对神经网络仍然具有挑战性。递归推理模型(RRMs)的结构化问题解决架构系列,包括分层推理模型(HRM)和微递归模型(TRM),为大型语言模型提供了一种紧凑的替代方案,但目前仅通过昂贵的数据增强隐式处理符号对称性。我们引入符号等变递归推理模型(SE-RRMs),通过符号等变层在架构级别强制执行排列等变性,确保在符号或颜色排列下获得相同的解决方案。SE-RRMs在9x9数独问题上胜过先前的RRMs,并且可以从仅在9x9进行训练的情况下推广到较小的4x4和更大的16x16和25x25实例,现有的RRMs无法外推到这些实例。在ARC-AGI-1和ARC-AGI-2上,SE-RRMs仅使用较少的数据增强和仅200万个参数就取得了竞争性能,表明明确编码对称性可以提高神经推理的鲁棒性和可扩展性。代码可在https://github.com/ml-jku/SE-RRM 上找到。

更新时间: 2026-03-02 18:53:55

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2603.02193v1

Branched Schrödinger Bridge Matching

Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schrödinger bridge matching, effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture branched or divergent evolution from a common origin to multiple distinct modes. To address this, we introduce Branched Schrödinger Bridge Matching (BranchSBM), a novel framework that learns branched Schrödinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes, enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations.

Updated: 2026-03-02 18:53:14

标题: 分支舒尔丁格桥匹配

摘要: 在生成建模中,预测初始和目标分布之间的中间轨迹是一个核心问题。现有方法,如流匹配和薛定谔桥匹配,通过建模单一随机路径有效地学习两个分布之间的映射关系。然而,这些方法固有地局限于单峰过渡,并且无法捕捉从共同起源到多个不同模式的分支或发散演变。为了解决这个问题,我们引入了分支薛定谔桥匹配(BranchSBM),这是一个学习分支薛定谔桥的新框架。BranchSBM参数化了多个时间依赖的速度场和生长过程,使得能够表示人口水平的分歧到多个终端分布。我们展示了BranchSBM不仅更具表现力,而且对于涉及多路径表面导航、建模细胞命运从均质前体状态分歧以及模拟细胞对扰动的分歧反应等任务至关重要。

更新时间: 2026-03-02 18:53:14

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2506.09007v2

Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.

Updated: 2026-03-02 18:52:51

标题: Sketch2Colab:通过可控流蒸馏实现的素描条件多人动画

摘要: 我们提出了Sketch2Colab,将故事板风格的2D草图转化为连贯的,具有对象感知的3D多人动作,并且可以对代理、关节、时间和接触进行精细控制。传统的基于扩散的运动生成器具有先进的真实性;然而,实现对丰富交互约束的精确遵循通常需要大量训练和/或昂贵的后续指导,并且在强多实体条件下性能可能会下降。Sketch2Colab首先学习了一个受草图驱动的扩散先验,然后将其提炼为在潜在空间中运行的高效矫正流学生,以便进行快速、稳定的采样。可微的能量函数覆盖关键帧、轨迹和基于物理的约束,直接塑造学生的传输场,将样本引导到忠实满足故事板的动作,同时保持物理上可信。为了捕捉协调的互动,我们将连续流与连续时间马尔可夫链(CTMC)规划器相结合,安排触摸、抓取和移交等离散事件,调节动态以产生清晰、良好阶段的人-物-人合作。在CORE4D和InterHuman上的实验表明,Sketch2Colab实现了最先进的约束遵循和感知质量,同时比仅使用扩散的基准模型具有显著更快的推理速度。

更新时间: 2026-03-02 18:52:51

领域: cs.CV,cs.AI,cs.GR,cs.HC,cs.LG

下载: http://arxiv.org/abs/2603.02190v1

Multi-Head Low-Rank Attention

Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8$\times$ decoding speedup over MLA. Code is available at https://github.com/SongtaoLiu0823/MLRA. Pretrained weights, along with the training and evaluation data, are available at https://huggingface.co/Soughing/MLRA.

Updated: 2026-03-02 18:52:38

标题: 多头低秩注意力

摘要: 大型语言模型中的长上下文推断在解码阶段由键-值(KV)缓存加载受到瓶颈限制,生成的顺序性要求在每一步中将KV缓存从片外高带宽存储器(HBM)重复传输到片上静态随机存取存储器(SRAM)。虽然多头潜在注意力(MLA)显著减少了总KV缓存大小,但在通过张量并行处理(TP)的分布式解码过程中遭遇到了分片瓶颈。由于其单一潜在头部无法进行划分,每个设备被迫为每个令牌重复加载完整的KV缓存,消耗大量内存流量并减少了TP的权重分片等优势。在本研究中,我们提出了多头低秩注意力(MLRA),它可以实现可划分的潜在状态,以实现高效的4路TP解码。广泛的实验表明,MLRA达到了最先进的困惑度和下游任务性能,并且相比MLA还提供了2.8倍的解码加速。代码可在https://github.com/SongtaoLiu0823/MLRA获得。预训练权重,以及训练和评估数据,可在https://huggingface.co/Soughing/MLRA获得。

更新时间: 2026-03-02 18:52:38

领域: cs.LG

下载: http://arxiv.org/abs/2603.02188v1

MAC: A Conversion Rate Prediction Benchmark Featuring Labels Under Multiple Attribution Mechanisms

Multi-attribution learning (MAL), which enhances model performance by learning from conversion labels yielded by multiple attribution mechanisms, has emerged as a promising learning paradigm for conversion rate (CVR) prediction. However, the conversion labels in public CVR datasets are generated by a single attribution mechanism, hindering the development of MAL approaches. To address this data gap, we establish the Multi-Attribution Benchmark (MAC), the first public CVR dataset featuring labels from multiple attribution mechanisms. Besides, to promote reproducible research on MAL, we develop PyMAL, an open-source library covering a wide array of baseline methods. We conduct comprehensive experimental analyses on MAC and reveal three key insights: (1) MAL brings consistent performance gains across different attribution settings, especially for users featuring long conversion paths. (2) The performance growth scales up with objective complexity in most settings; however, when predicting first-click conversion targets, simply adding auxiliary objectives is counterproductive, underscoring the necessity of careful selection of auxiliary objectives. (3) Two architectural design principles are paramount: first, to fully learn the multi-attribution knowledge, and second, to fully leverage this knowledge to serve the main task. Motivated by these findings, we propose Mixture of Asymmetric Experts (MoAE), an effective MAL approach incorporating multi-attribution knowledge learning and main task-centric knowledge utilization. Experiments on MAC show that MoAE substantially surpasses the existing state-of-the-art MAL method. We believe that our benchmark and insights will foster future research in the MAL field. Our MAC benchmark and the PyMAL algorithm library are publicly available at https://github.com/alimama-tech/PyMAL.

Updated: 2026-03-02 18:51:01

标题: MAC: 一个包含多种归因机制下标签的转化率预测基准。

摘要: 多归因学习(MAL)通过学习多个归因机制产生的转化标签来增强模型性能,已成为转化率(CVR)预测的一种有前景的学习范式。然而,公共CVR数据集中的转化标签是由单一归因机制生成的,阻碍了MAL方法的发展。 为了解决这一数据缺口,我们建立了多归因基准(MAC),这是第一个具有多个归因机制标签的公共CVR数据集。此外,为了促进关于MAL的可重复研究,我们开发了PyMAL,这是一个覆盖各种基准方法的开源库。我们在MAC上进行了全面的实验分析,并揭示了三个关键见解:(1)MAL在不同归因设置下带来了一致的性能提升,特别是对于具有较长转化路径的用户。(2)在大多数设置中,性能增长随着目标复杂度的增加而增加;然而,在预测首次点击转化目标时,简单地添加辅助目标是适得其反的,强调了对辅助目标的谨慎选择的必要性。(3)两种架构设计原则至关重要:首先,充分学习多个归因知识,其次,充分利用这些知识为主要任务提供服务。在这些发现的启发下,我们提出了混合不对称专家(MoAE),这是一种有效的MAL方法,结合了多归因知识学习和以主要任务为中心的知识利用。在MAC上的实验表明,MoAE显著超越了现有的最先进的MAL方法。我们相信我们的基准和见解将促进MAL领域的未来研究。我们的MAC基准和PyMAL算法库可以在https://github.com/alimama-tech/PyMAL 上公开获取。

更新时间: 2026-03-02 18:51:01

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02184v1

Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.

Updated: 2026-03-02 18:50:15

标题: 利用模型汤对湄公河三角洲无形文化遗产图像进行分类

摘要: 湄公三角洲的非物质文化遗产(ICH)图像分类面临独特挑战,原因是数据标注有限、各类别之间视觉相似度高且领域异质性强。在这种资源稀缺的情况下,传统的深度学习模型往往会受到高方差的影响,或者过度拟合到偶然相关性,导致泛化能力差。为了解决这些限制,我们提出了一个强大的框架,将混合CoAtNet架构与模型汤集成在一起,后者是一种轻量级的权重空间集成技术,可以在不增加推断成本的情况下对来自单一训练轨迹的检查点进行平均。CoAtNet通过卷积和自注意机制的阶段融合,捕捉了局部和全局模式。我们应用了两种集成策略——贪婪模型汤和均匀模型汤,以选择性地将不同的检查点组合成最终模型。除了性能改进外,我们通过偏差-方差分解的视角分析了集成效果。我们的研究结果表明,模型汤通过在不同模型快照之间稳定预测来减少方差,同时引入了最小的额外偏差。此外,利用基于交叉熵的距离度量和多维缩放(MDS),我们展示了模型汤选择了几何上多样化的检查点,而不像软投票那样将重复的模型混合在输出空间的中心。在ICH-17数据集(包含17类别的7,406张图像)上进行评估,我们的方法实现了72.36%的top-1准确率和69.28%的宏F1分数,优于包括ResNet-50、DenseNet-121和ViT在内的强基线模型。这些结果强调了多样化感知的检查点平均提供了一种原则性和高效的方式,用于减少方差并增强在文化丰富、数据稀缺的分类任务中的泛化能力。

更新时间: 2026-03-02 18:50:15

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02181v1

SDN-SYN PoW: Intent-Aware Adaptive SDN Defense with PoW Against multi-domain SYN Floods

The stability of Internet services is persistently challenged by the escalating scale of volumetric TCP SYN floods, as conventional defenses like SYN Cookies fail by exacerbating bandwidth depletion under modern attacks. This paper introduces SDN-SYN PoW, a novel defense architecture that synergizes non-interactive Proof-of-Work with a Software-Defined Networking (SDN) control plane, an approach particularly effective for securing the network edge in modern SD-WAN deployments. The core innovation is its ability to perform global network sensing; the SDN controller monitors real-time traffic to dynamically adjust PoW difficulty, transforming the defense from a static mechanism into an intelligent, adaptive system that surgically applies computational costs only to anomalous sources. Through rigorous experiments on a custom-built testbed, we demonstrate that SDN-SYN PoW provides substantially superior protection and, critically, that the PoW overhead remains negligible for legitimate clients, ensuring compatibility even with low-power devices.

Updated: 2026-03-02 18:49:34

标题: SDN-SYN PoW:具有工作量证明的意图感知自适应SDN防御对抗多域SYN洪水攻击

摘要: 互联网服务的稳定性不断受到规模不断扩大的TCP SYN洪水攻击的挑战,传统的防御方式如SYN Cookies在现代攻击下会加剧带宽耗尽的问题。本文介绍了SDN-SYN PoW,一种新颖的防御架构,它将非交互式工作量证明与软件定义网络(SDN)控制平面相结合,特别适用于现代SD-WAN部署中保护网络边缘。其核心创新在于其能够进行全局网络感知;SDN控制器监视实时流量以动态调整工作量证明的难度,将防御机制从静态转变为智能、自适应系统,只对异常来源施加计算成本。通过在自定义测试平台上进行严格实验,我们证明SDN-SYN PoW提供了极大的保护,并且关键是,对于合法客户端来说,工作量证明的额外开销仍然可以忽略不计,确保即使在低功率设备上也能兼容使用。

更新时间: 2026-03-02 18:49:34

领域: cs.NI,cs.CR

下载: http://arxiv.org/abs/2603.06668v1

Reservoir Subspace Injection for Online ICA under Top-n Whitening

Reservoir expansion can improve online independent component analysis (ICA) under nonlinear mixing, yet top-$n$ whitening may discard injected features. We formalize this bottleneck as \emph{reservoir subspace injection} (RSI): injected features help only if they enter the retained eigenspace without displacing passthrough directions. RSI diagnostics (IER, SSO, $ρ_x$) identify a failure mode in our top-$n$ setting: stronger injection increases IER but crowds out passthrough energy ($ρ_x: 1.00\!\rightarrow\!0.77$), degrading SI-SDR by up to $2.2$\,dB. A guarded RSI controller preserves passthrough retention and recovers mean performance to within $0.1$\,dB of baseline $1/N$ scaling. With passthrough preserved, RE-OICA improves over vanilla online ICA by $+1.7$\,dB under nonlinear mixing and achieves positive SI-SDR$_{\mathrm{sc}}$ on the tested super-Gaussian benchmark ($+0.6$\,dB).

Updated: 2026-03-02 18:49:02

标题: 储备子空间注入在Top-n白化下在线ICA中的应用

摘要: 储层扩展可以改善在线独立分量分析(ICA)在非线性混合下的表现,然而顶级$n$白化可能会丢弃注入的特征。我们将这一瓶颈形式化为\emph{储层子空间注入}(RSI):只有当注入的特征进入保留的特征空间而不替换通过方向时,注入的特征才有帮助。RSI诊断(IER,SSO,$ρ_x$)在我们的顶级$n$设置中识别了一个故障模式:更强的注入会增加IER,但会挤占通过能量($ρ_x: 1.00\!\rightarrow\!0.77$),从而使SI-SDR降低最多$2.2$\,dB。一个受保护的RSI控制器可以保持通过保留并将平均性能恢复到基线$1/N$缩放的$0.1$\,dB之内。在通过保留的情况下,RE-OICA在非线性混合下比普通的在线ICA改进了$+1.7$\,dB,并在测试的超高斯基准测试中实现了正的SI-SDR$_{\mathrm{sc}}$($+0.6$\,dB)。

更新时间: 2026-03-02 18:49:02

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2603.02178v1

Wikipedia in the Era of LLMs: Evolution and Risks

In this paper, we present a comprehensive analysis and monitoring framework for the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing article content and page views to study the recent changes in Wikipedia and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been affected by LLMs, with an impact of approximately 1% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift. Moreover, the effectiveness of RAG might decrease if the knowledge has been contaminated by LLMs. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks in NLP research. We release all the experimental dataset and source code at: https://github.com/HSM316/LLM_Wikipedia

Updated: 2026-03-02 18:48:33

标题: 维基百科在LLM时代:演变与风险

摘要: 在这篇论文中,我们提出了一个全面的分析和监测框架,用于评估大型语言模型(LLMs)对维基百科的影响,通过现有数据研究维基百科的演变,并使用模拟来探索潜在风险。我们首先通过分析文章内容和页面浏览量来研究维基百科的最新变化,并评估LLMs的影响。随后,我们评估了LLMs如何影响与维基百科相关的各种自然语言处理(NLP)任务,包括机器翻译和检索增强生成(RAG)。我们的研究结果和模拟结果显示,维基百科文章受到LLMs的影响,在某些类别中的影响约为1%。如果基于维基百科的机器翻译基准受到LLMs的影响,模型的得分可能会被夸大,而模型之间的比较结果可能会发生变化。此外,如果知识受到LLMs的污染,RAG的有效性可能会降低。虽然LLMs尚未完全改变维基百科的语言和知识结构,但我们认为我们的实证发现表明需要谨慎考虑NLP研究中潜在的未来风险。我们在https://github.com/HSM316/LLM_Wikipedia上发布了所有实验数据集和源代码。

更新时间: 2026-03-02 18:48:33

领域: cs.CL,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2503.02879v2

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

Updated: 2026-03-02 18:46:28

标题: Kiwi-Edit:通过指导和参考指导实现多功能视频编辑

摘要: 基于指令的视频编辑已经取得了快速进展,然而当前的方法往往在精确的视觉控制方面存在困难,因为自然语言在描述复杂的视觉细微差别方面本质上是有限的。尽管基于参考的编辑提供了一个强大的解决方案,但其潜力目前受到高质量配对训练数据稀缺的限制。为了弥补这一差距,我们引入了一个可扩展的数据生成管道,将现有的视频编辑对转换为高保真度的训练四元组,利用图像生成模型来创建合成参考支架。利用这个管道,我们构建了RefVIE,一个专门用于指令-参考-跟随任务的大规模数据集,并建立了RefVIE-Bench用于全面评估。此外,我们提出了一个统一的编辑架构,Kiwi-Edit,通过协同学习查询和潜在视觉特征来实现参考语义引导。我们的模型通过渐进式多阶段培训课程,在指令跟随和参考保真度方面实现了显著的增益。大量实验证明,我们的数据和架构在可控视频编辑方面建立了新的最先进技术。所有数据集、模型和代码均发布在https://github.com/showlab/Kiwi-Edit。

更新时间: 2026-03-02 18:46:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.02175v1

De-paradox Tree: Breaking Down Simpson's Paradox via A Kernel-Based Partition Algorithm

Real-world observational datasets and machine learning have revolutionized data-driven decision-making, yet many models rely on empirical associations that may be misleading due to confounding and subgroup heterogeneity. Simpson's paradox exemplifies this challenge, where aggregated and subgroup-level associations contradict each other, leading to misleading conclusions. Existing methods provide limited support for detecting and interpreting such paradoxical associations, especially for practitioners without deep causal expertise. We introduce De-paradox Tree, an interpretable algorithm designed to uncover hidden subgroup patterns behind paradoxical associations under assumed causal structures involving confounders and effect heterogeneity. It employs novel split criteria and balancing-based procedures to adjust for confounders and homogenize heterogeneous effects through recursive partitioning. Compared to state-of-the-art methods, De-paradox Tree builds simpler, more interpretable trees, selects relevant covariates, and identifies nested opposite effects while ensuring robust estimation of causal effects when causally admissible variables are provided. Our approach addresses the limitations of traditional causal inference and machine learning methods by introducing an interpretable framework that supports non-expert practitioners while explicitly acknowledging causal assumptions and scope limitations, enabling more reliable and informed decision-making in complex observational data environments.

Updated: 2026-03-02 18:45:24

标题: 去悖论树:通过基于核的分区算法解析辛普森悖论

摘要: 实际观察数据集和机器学习已经彻底改变了基于数据的决策制定,然而许多模型依赖于可能因混杂和亚组异质性而具有误导性的经验关联。辛普森悖论是这一挑战的典型例证,聚合和亚组水平的关联相互矛盾,导致误导性结论。现有方法在检测和解释这种悖论性关联方面提供了有限的支持,尤其对于没有深入因果专业知识的实践者来说。我们介绍了De-paradox Tree,这是一个可解释的算法,旨在在涉及混杂因素和效应异质性的假定因果结构下揭示悖论性关联背后隐藏的亚组模式。它采用了新颖的分裂标准和基于平衡的程序,通过递归分区来调整混杂因素,并使异质效应同质化。与现有方法相比,De-paradox Tree构建了更简单、更可解释的树,选择了相关的协变量,并在提供了因果可接受变量时确保了因果效应的稳健估计。我们的方法通过引入一个可解释的框架解决了传统因果推断和机器学习方法的局限性,该框架支持非专业实践者,同时明确承认因果假设和范围限制,在复杂的观察数据环境中促进更可靠和明智的决策制定。

更新时间: 2026-03-02 18:45:24

领域: cs.LG

下载: http://arxiv.org/abs/2603.02174v1

Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution

While diffusion models excel at image generation, their growing adoption raises critical concerns about copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that are of primary concern to stakeholders. To address this gap, we introduce concept-level attribution through a novel method called Concept-TRAK, which extends influence functions with a key innovation: specialized training and utility loss functions designed to isolate concept-specific influences rather than overall reconstruction quality. We evaluate Concept-TRAK on novel concept attribution benchmarks using Synthetic and CelebA-HQ datasets, as well as the established AbC benchmark, showing substantial improvements over prior methods in concept-level attribution scenarios. We further demonstrate its versatility on real-world text-to-image generation with compositional and multi-concept prompts.

Updated: 2026-03-02 18:40:44

标题: Concept-TRAK:通过概念级别的归因理解扩散模型如何学习概念

摘要: 尽管扩散模型在图像生成方面表现出色,但它们日益普及引发了关于版权问题和模型透明度的重要关注。现有的归因方法识别影响整个图像的训练示例,但在分离对特定元素(如风格或对象)的贡献方面存在不足,这些元素是利益相关者关注的主要内容。为了弥补这一差距,我们通过一种名为Concept-TRAK的新方法引入了概念级归因,该方法通过一项关键创新将影响函数扩展为:专门设计用于分离概念特定影响而不是整体重建质量的训练和效用损失函数。我们使用合成和CelebA-HQ数据集以及已建立的AbC基准对Concept-TRAK在新概念归因基准上进行评估,显示在概念级别归因场景中相对于先前方法的显着改进。我们进一步展示了其在实际文本到图像生成中的多样性,包括构成和多概念提示。

更新时间: 2026-03-02 18:40:44

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.06547v3

SageBwd: A Trainable Low-bit Attention

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.

Updated: 2026-03-02 18:39:49

标题: SageBwd:一种可训练的低比特注意力模型

摘要: 低比特注意力机制,如SageAttention,已被证明是加速模型推断的有效方法,但其在训练中的适用性仍不明确。在先前的工作中,我们介绍了SageBwd,一种可训练的INT8注意力机制,它将七个注意力矩阵乘法中的六个量化,同时保持微调性能。然而,在预训练期间,SageBwd表现出与完整精度注意力(FPA)之间的持续性能差距。在这项工作中,我们调查了为什么会出现这种差距,并证明了SageBwd在预训练期间与完整精度注意力相匹配。通过实验和理论分析,我们得出了一些重要的见解和结论:(i)QK-norm对于每步大标记的稳定训练是必要的,(ii)量化误差主要来自于反向传播得分梯度dS,(iii)减少每步标记使SageBwd能够在预训练期间与FPA性能匹配,(iv)K平滑对于训练稳定性仍然至关重要,而Q平滑在预训练期间提供的益处有限。

更新时间: 2026-03-02 18:39:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02170v1

Astral: training physics-informed neural networks with error majorants

The primal approach to physics-informed learning is a residual minimization. We argue that residual is, at best, an indirect measure of the error of approximate solution and propose to train with error majorant instead. Since error majorant provides a direct upper bound on error, one can reliably estimate how close PiNN is to the exact solution and stop the optimization process when the desired accuracy is reached. We call loss function associated with error majorant \textbf{Astral}: neur\textbf{A}l a po\textbf{ST}erio\textbf{R}i function\textbf{A}l \textbf{L}oss. To compare Astral and residual loss functions, we illustrate how error majorants can be derived for various PDEs and conduct experiments with diffusion equations (including anisotropic and in the L-shaped domain), convection-diffusion equation, temporal discretization of Maxwell's equation, magnetostatics and nonlinear elastoplasticity problems. The results indicate that Astral loss is competitive to the residual loss, typically leading to faster convergence and lower error. The main benefit of using Astral loss comes from its ability to estimate error, which is impossible with other loss functions. Our experiments indicate that the error estimate obtained with Astral loss is usually tight enough, e.g., for a highly anisotropic equation, on average, Astral overestimates error by a factor of $1.5$, and for convection-diffusion by a factor of $1.7$. We further demonstrate that Astral loss is better correlated with error than residual and is a more reliable predictor of the error value. Moreover, unlike residual, the error indicator obtained from Astral loss has a superb spatial correlation with error. Backed with the empirical and theoretical results, we argue that one can productively use Astral loss to perform reliable error analysis and approximate PDE solutions with accuracy similar to standard residual-based techniques.

Updated: 2026-03-02 18:39:47

标题: 星际:使用误差主要项训练物理信息神经网络

摘要: 物理学知识驱动学习的原始方法是残差最小化。我们认为残差最多只能间接衡量近似解的误差,并建议改用误差主量进行训练。由于误差主量直接提供了误差的上界,因此可以可靠地估计PiNN与精确解的接近程度,并在达到所需精度时停止优化过程。我们将与误差主量相关的损失函数称为“Astral”:神经后验功能性损失。为了比较Astral和残差损失函数,我们说明了如何为各种PDE导出误差主量,并对扩散方程(包括各向异性和L形域)、对流扩散方程、Maxwell方程的时间离散化、磁静力学和非线性弹塑性问题进行实验。结果表明,Astral损失与残差损失竞争力强,通常导致更快的收敛速度和更低的误差。使用Astral损失的主要优势在于其能够估计误差,而其他损失函数无法做到这一点。我们的实验表明,使用Astral损失获得的误差估计通常足够紧凑,例如,对于高度各向异性的方程,Astral通常会将误差高估1.5倍,对于对流扩散则高估1.7倍。我们进一步证明Astral损失与误差之间的相关性比残差更好,并且是误差值的更可靠预测器。此外,与残差不同,从Astral损失获得的误差指示器与误差有着极好的空间相关性。在实证和理论结果的支持下,我们认为可以高效地使用Astral损失进行可靠的误差分析,并以与标准基于残差的技术相似的精度近似PDE解。

更新时间: 2026-03-02 18:39:47

领域: physics.comp-ph,cs.AI,cs.LG,math.NA

下载: http://arxiv.org/abs/2406.02645v2

Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations REAG$_\text{Dara}^{*}$ and REAG$_\text{MV}^{*}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.

Updated: 2026-03-02 18:38:36

标题: 增强决策变换器用于离线动态强化学习的回报

摘要: 我们研究了离线离线动态强化学习(RL),利用易于访问的源域数据来增强目标域中有限数据的策略学习。我们的方法集中在基于回报条件的监督学习(RCSL),特别关注Decision Transformer(DT)类型框架,该框架可以根据所需回报指导和完整轨迹历史预测动作。先前的研究通过增加源域轨迹中的奖励来解决动态转移问题,使其与目标域中的最优轨迹相匹配。然而,由于RCSL策略类的独特形式明确依赖于回报,并且缺乏最优轨迹分布的直接表示,这种策略不能直接应用于RCSL。我们提出了基于回报增强(REAG)方法用于DT类型框架,通过使源域中的回报分布与目标域中的回报分布保持一致来增强源域中的回报。我们提供了理论分析,证明了从REAG学习的RCSL策略实现了与没有动态转移时获得的次优水平相同的水平。我们分别介绍了两种实际实现REAG$_\text{Dara}^{*}$和REAG$_\text{MV}^{*}$。在D4RL数据集和各种DT类型基线上的彻底实验表明,我们的方法始终提高了离线动态RL中DT类型框架的性能。

更新时间: 2026-03-02 18:38:36

领域: cs.LG,cs.AI,cs.RO,stat.ML

下载: http://arxiv.org/abs/2410.23450v2

Boosting Device Utilization in Control Flow Auditing

Micro-Controller Units (MCUs) are widely used in safety-critical systems, making them attractive targets for attacks. This calls for lightweight defenses that remain effective despite software compromise. Control Flow Auditing (CFAud) is one such mechanism wherein a remote verifier (Vrf) is guaranteed to received evidence about the control flow path taken on a prover (Prv) MCU, even when Prv software is compromised. Despite promising benefits, current CFAud architectures unfortunately require a ``busy-wait'' phase where a hardware-anchored root-of-trust (RoT) in Prv retains execution control to ensure delivery of control flow evidence to Vrf. This drastically reduces the CPU utilization on Prv. In this work, we addresses this limitation with an architecture for Contention Avoidance in Runtime Auditing with Minimized Execution Latency (CARAMEL). CARAMEL is a hardware-software RoT co-design that enables Prv applications to resume while control flow evidence is transmitted to Vrf. This significantly reduces contention due to transmission delays and improves CPU utilization without giving up on security. Key to CARAMEL is our design of a new RoT with a self-contained (and minimal) dedicated communication interface. CARAMEL's implementation and accompanying evaluation are made open-source. Our results show substantially improved CPU utilization at a modest hardware cost.

Updated: 2026-03-02 18:26:17

标题: 提升控制流审计中设备利用率

摘要: 微控制器单元(MCUs)广泛用于安全关键系统,使它们成为攻击的吸引目标。这需要轻量级防御措施,即使软件受到破坏也能保持有效。控制流审计(CFAud)是一种机制,其中远程验证器(Vrf)被保证接收有关在证明者(Prv)MCU上采取的控制流路径的证据,即使Prv软件被破坏。尽管有希望的好处,当前的CFAud架构不幸地需要一个“忙等”阶段,在这个阶段,Prv中的硬件锚定的信任根(RoT)保留执行控制,以确保将控制流证据传递给Vrf。这会严重降低Prv上的CPU利用率。 在这项工作中,我们通过最小化执行延迟的运行时审计中的冲突避免(CARAMEL)架构来解决这个限制。CARAMEL是一个硬件-软件RoT共同设计,使Prv应用程序能够在控制流证据传输到Vrf时恢复。这显着减少了由于传输延迟而产生的争用,并提高了CPU利用率,而不会放弃安全性。CARAMEL的关键在于我们设计了一个具有自包含(和最小)专用通信接口的新RoT。CARAMEL的实现和陪伴评估是开源的。我们的结果显示在适度的硬件成本下,CPU利用率显着提高。

更新时间: 2026-03-02 18:26:17

领域: cs.CR

下载: http://arxiv.org/abs/2603.02161v1

Instrumental and Proximal Causal Inference with Gaussian Processes

Instrumental variable (IV) and proximal causal learning (Proxy) methods are central frameworks for causal inference in the presence of unobserved confounding. Despite substantial methodological advances, existing approaches rarely provide reliable epistemic uncertainty (EU) quantification. We address this gap through a Deconditional Gaussian Process (DGP) framework for uncertainty-aware causal learning. Our formulation recovers popular kernel estimators as the posterior mean, ensuring predictive precision, while the posterior variance yields principled and well-calibrated EU. Moreover, the probabilistic structure enables systematic model selection via marginal log-likelihood optimization. Empirical results demonstrate strong predictive performance alongside informative EU quantification, evaluated via empirical coverage frequencies and decision-aware accuracy rejection curves. Together, our approach provides a unified, practical solution for causal inference under unobserved confounding with reliable uncertainty.

Updated: 2026-03-02 18:23:26

标题: 高斯过程在仪器和近因因果推断中的应用

摘要: 工具变量(IV)和近因因果学习(Proxy)方法是处理未观察混杂的因果推断的中心框架。尽管方法学上取得了重大进展,但现有方法很少提供可靠的认识不确定性(EU)量化。我们通过一个基于去条件高斯过程(DGP)的框架来解决这一空白,用于具有不确定性意识的因果学习。我们的公式将流行的核估计器恢复为后验均值,确保预测精度,而后验方差则产生原则性和良好校准的EU。此外,概率结构通过边际对数似然优化实现系统性模型选择。实证结果表明,在评估经验覆盖频率和决策感知准确性拒绝曲线时,我们的方法具有强大的预测性能以及信息丰富的EU量化。总之,我们的方法为在存在未观察混杂的情况下进行因果推断提供了一个统一的、实用的解决方案,具有可靠的不确定性。

更新时间: 2026-03-02 18:23:26

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.02159v1

Mixing Times and Privacy Analysis for the Projected Langevin Algorithm under a Modulus of Continuity

We study the mixing time of the projected Langevin algorithm (LA) and the privacy curve of noisy Stochastic Gradient Descent (SGD), beyond nonexpansive iterations. Specifically, we derive new mixing time bounds for the projected LA which are, in some important cases, dimension-free and poly-logarithmic on the accuracy, closely matching the existing results in the smooth convex case. Additionally, we establish new upper bounds for the privacy curve of the subsampled noisy SGD algorithm. These bounds show a crucial dependency on the regularity of gradients, and are useful for a wide range of convex losses beyond the smooth case. Our analysis relies on a suitable extension of the Privacy Amplification by Iteration (PABI) framework (Feldman et al., 2018; Altschuler and Talwar, 2022, 2023) to noisy iterations whose gradient map is not necessarily nonexpansive. This extension is achieved by designing an optimization problem which accounts for the best possible Rényi divergence bound obtained by an application of PABI, where the tractability of the problem is crucially related to the modulus of continuity of the associated gradient mapping. We show that, in several interesting cases -- namely the nonsmooth convex, weakly smooth and (strongly) dissipative -- such optimization problem can be solved exactly and explicitly, yielding the tightest possible PABI-based bounds.

Updated: 2026-03-02 18:22:17

标题: 混合时间和在连续性模数下投影朗之万算法的隐私分析

摘要: 我们研究了投影 Langevin 算法 (LA) 的混合时间以及带有噪声的随机梯度下降 (SGD) 的隐私曲线,超越了非扩张迭代。具体来说,我们针对投影 LA 推导出新的混合时间上界,在某些重要情况下是与维度无关且对精度是多对数的,与现有的光滑凸情况下的结果非常匹配。此外,我们建立了子采样噪声 SGD 算法的隐私曲线的新的上界。这些上界显示了对梯度的正则性的关键依赖性,并且对光滑情况之外的广泛凸损失函数具有用处。我们的分析依赖于隐私放大迭代 (PABI) 框架 (Feldman 等人,2018 年;Altschuler 和 Talwar,2022 年,2023 年) 的一个合适的扩展,适用于梯度映射不一定是非扩张的嘈杂迭代。通过设计一个优化问题,考虑到通过应用 PABI 获得的最佳可能的 Rényi 散度上界,该扩展与相关梯度映射的连续性模数密切相关,问题的可处理性至关重要。我们展示了在几种有趣的情况下 -- 即非光滑凸、弱光滑和(强烈)耗散 -- 这种优化问题可以被精确明确地解决,产生基于 PABI 最紧密的可能边界。

更新时间: 2026-03-02 18:22:17

领域: stat.ML,cs.LG,math.OC,math.ST

下载: http://arxiv.org/abs/2501.04134v3

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.

Updated: 2026-03-02 18:21:22

标题: ZeroDayBench:评估LLM代理在未知的零日漏洞上的表现,用于网络防御。

摘要: 大型语言模型(LLMs)越来越被部署为软件工程代理,可以自主地为代码库做出贡献。这些代理的一个主要优势是它们能够找到并修补代码库中的安全漏洞。为了评估这一领域代理的能力,我们引入了ZeroDayBench,一个基准测试,LLM代理在其中发现并修补了22个新的关键漏洞。我们将重点放在三种流行的前沿代理LLM上:GPT-5.2、Claude Sonnet 4.5和Grok 4.1。我们发现前沿LLM尚不能自主解决我们的任务,并观察到一些行为模式,表明这些模型在主动网络防御领域有待改进。

更新时间: 2026-03-02 18:21:22

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.02297v1

How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks

Emerging 6G visions, reflected in ongoing standardization efforts within 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, increasingly characterize networks as AI-native systems in which high-level semantic reasoning layers operate above standardized control and data-plane functions. Although frontier-scale large language models (LLMs) such as Qwen2.5-7B and Olmo-3-7B demonstrate strong reasoning capability, their computational footprint limits deployment in latency-sensitive, edge-native infrastructures. This paper presents a systematic empirical study of the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems. Using 6G-Bench, a standardization-aligned benchmark comprising 30 decision-making tasks across five capability domains, we evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B. Deterministic accuracy (pass@1) increases from 0.224 at 135M to 0.707 at 7B, but scaling gains are highly non-uniform. A pronounced stability transition occurs in the 1 to 1.5B range, where accuracy rises from 0.373 (Llama-3.2-1B) to 0.531 (Qwen2.5-1.5B) and the instability gap Delta_5 contracts from 0.356 to 0.138. Beyond 3B parameters, improvements diminish (+0.064 from 3B to 7B). Through single-query inference profiling and an Edge Score metric that normalizes accuracy by latency and memory footprint, we show that semantic reliability per unit edge resource does not scale monotonically with parameter count. Instead, mid-scale models (approximately 1.5 to 3B) achieve the most favorable balance between deterministic stability and computational efficiency, providing deployment-relevant guidance for AI-native 6G architectures. All scripts and results are publicly available at https://github.com/maferrag/6G-Bench

Updated: 2026-03-02 18:19:49

标题: 6G可以有多小?为AI本地网络扩展微型语言模型

摘要: Emerging 6G视觉在3GPP、IETF、ETSI、ITU-T和O-RAN联盟的标准化工作中反映出来,越来越将网络描述为AI原生系统,其中高级语义推理层在标准化的控制和数据平面功能之上运行。尽管像Qwen2.5-7B和Olmo-3-7B这样的前沿规模的大型语言模型(LLMs)展现出强大的推理能力,但它们的计算占用限制了在延迟敏感的边缘原生基础设施中的部署。本文提出了一项系统的实证研究,研究了用于AI原生6G系统中网络级语义推理的紧凑语言模型的扩展行为和部署效率。使用6G-Bench,这是一个包含30个决策任务跨越五个能力领域的标准化基准,我们评估了从135M (SmolLM2-135M)到7B参数(Qwen2.5-7B)的模型,包括中等规模架构,如Llama-3.2-1B、Granite-1B和Qwen2.5-3B。确定性准确度(pass@1)从135M的0.224增加到7B的0.707,但扩展增益非常不均匀。在1到1.5B范围内出现了明显的稳定过渡,准确度从0.373(Llama-3.2-1B)上升到0.531(Qwen2.5-1.5B),不稳定性差距Delta_5从0.356缩小到0.138。超过3B参数后,改进变得减少(+0.064从3B到7B)。通过单查询推理性能分析和一个通过延迟和内存占用进行准确度归一化的Edge Score指标,我们表明单位边缘资源的语义可靠性并不随参数数量单调增加。相反,中等规模模型(约1.5到3B)在确定性稳定性和计算效率之间取得了最有利的平衡,为AI原生6G架构提供了部署相关的指导。所有脚本和结果都可以在https://github.com/maferrag/6G-Bench上公开获取。

更新时间: 2026-03-02 18:19:49

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2603.02156v1

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(ηK\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $η^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $Ω(ηK \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $η$), we show that the KL-regularized regret for MABs is $η$-independent and scales as $\tildeΘ(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $η$ and yield nearly optimal bounds in terms of $K$, $η$, and $T$.

Updated: 2026-03-02 18:17:33

标题: KL正则化多臂老虎机的近似最优遗憾

摘要: 最近的研究表明,具有KL正则化目标的强化学习可以实现更快的收敛速度或对数遗憾,与未正则化设置中的经典$\sqrt{T}$-类型遗憾形成对比。然而,关于KL正则化目标的在线学习的统计效率,即使专门针对多臂老虎机(MABs)也还远未完全表征。我们通过对KL-UCB进行尖锐分析来解决这个问题,在这里使用了一种新颖的剥离论证,得到了一个$\tilde{O}(ηK\log^2T)$的上界:第一个具有线性依赖于$K$的高概率遗憾上界。这里,$T$是时间跨度,$K$是手臂数,$η^{-1}$是正则化强度,$\tilde{O}$隐藏了除涉及$\log T$之外的所有对数因子。我们的分析的近乎紧密性由第一个非恒定下界$Ω(ηK \log T)$所证实,这源自于微妙的硬实例构造和Bayes先验的量身定制的分解。此外,在低正则化区域(即大$η$),我们展示了MABs的KL正则化遗憾是$η$无关的,并且按照$\tildeΘ(\sqrt{KT})$的比例进行缩放。总的来说,我们的结果全面了解了所有$η$区域的KL正则化MABs,并在$K$、$η$和$T$方面获得了几乎最佳的上界。

更新时间: 2026-03-02 18:17:33

领域: cs.LG,cs.AI,math.ST,stat.ML

下载: http://arxiv.org/abs/2603.02155v1

Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning

Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single-agent MCTS, applying it in multi-agent systems poses unique challenges. CB-MCTS is the first to address this. We analyze CB-MCTS in the simple-regret setting and show in simulations that it outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi-agent planning.

Updated: 2026-03-02 18:15:39

标题: 基于玻尔兹曼的探索方法用于稳健的分布式多智能体规划

摘要: Decentralized Monte Carlo Tree Search (Dec-MCTS)被广泛用于合作多智能体规划,但在奖励稀疏或偏斜的环境中遇到困难。我们引入了协调玻尔兹曼MCTS(CB-MCTS),它用随机玻尔兹曼策略和衰减熵奖励替换确定性的UCT,以实现持续而集中的探索。虽然玻尔兹曼探索已在单智能体MCTS中进行研究,但在多智能体系统中应用它面临独特挑战。CB-MCTS是第一个解决这个问题的方法。我们在简单遗憾设置中分析了CB-MCTS,并通过模拟显示,在欺骗性场景中它胜过Dec-MCTS,并在标准基准测试中保持竞争力,为多智能体规划提供了稳健的解决方案。

更新时间: 2026-03-02 18:15:39

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2603.02154v1

Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better answer quality. While these methods show consistent gains in isolated retrieval benchmarks, their effectiveness under realistic production constraints remains underexplored. In this work, we evaluate retrieval fusion in a production-style RAG pipeline operating over an enterprise knowledge base, with fixed retrieval depth, re-ranking budgets, and latency constraints. Across multiple fusion configurations, we find that retrieval fusion does increase raw recall, but these gains are largely neutralized after re-ranking and truncation. In our setting, fusion variants fail to outperform single-query baselines on KB-level Top-$k$ accuracy, with Hit@10 decreasing from $0.51$ to $0.48$ in several configurations. Moreover, fusion introduces additional latency overhead due to query rewriting and larger candidate sets, without corresponding improvements in downstream effectiveness. Our analysis suggests that recall-oriented fusion techniques exhibit diminishing returns once realistic re-ranking limits and context budgets are applied. We conclude that retrieval-level improvements do not reliably translate into end-to-end gains in production RAG systems, and argue for evaluation frameworks that jointly consider retrieval quality, system efficiency, and downstream impact.

Updated: 2026-03-02 18:15:09

标题: 使用RAG融合技术扩展检索增强生成:来自行业部署的经验教训

摘要: 检索增强生成(RAG)系统通常采用检索融合技术,如多查询检索和互惠排名融合(RRF),以增加文档召回率,假设更高的召回率会导致更好的答案质量。虽然这些方法在孤立的检索基准中显示出一致的增益,但它们在现实生产约束条件下的有效性仍未得到充分探讨。在这项工作中,我们评估了在企业知识库上运行的生产式RAG管道中的检索融合,具有固定的检索深度、重新排序预算和延迟约束。 在多个融合配置中,我们发现检索融合确实增加了原始召回率,但在重新排序和截断后,这些增益在很大程度上被抵消。在我们的设置中,融合变体未能在KB级别的Top-k准确性上胜过单查询基线,Hit@10 在几个配置中从0.51下降到0.48。此外,融合引入了额外的延迟开销,因为需要进行查询重写和使用更大的候选集,但在下游效果上没有相应的改进。 我们的分析表明,一旦应用了现实的重新排序限制和上下文预算,以召回为导向的融合技术会出现收益递减。我们得出结论,检索级别的改进并不一定能可靠地转化为生产RAG系统的端到端增益,并呼吁评估框架共同考虑检索质量、系统效率和下游影响。

更新时间: 2026-03-02 18:15:09

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.02153v1

Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.

Updated: 2026-03-02 18:12:02

标题: 零样本和少样本命名实体识别:犯罪领域的案例研究和数据集(CrimeNER)

摘要: 从与犯罪相关的文件中提取关键信息是执法机构的关键任务。命名实体识别(NER)可以执行此任务,提取有关犯罪、罪犯或涉及执法机构的信息。然而,在一般实际犯罪场景中缺乏充分注释的数据。为解决这个问题,我们提出了CrimeNER,这是一个与犯罪相关的零样本和少样本NER的案例研究,以及一个包括超过1.5k个注释文档的通用犯罪相关命名实体识别数据库(CrimeNERdb),这些文档是从恐怖袭击公共报道和美国司法部的新闻稿中提取的,用于NER任务。我们定义了5种粗糙犯罪实体和总共22种细粒度实体。我们通过使用最先进的NER模型以及通用和常用的大型语言模型在零样本和少样本设置下进行实验,来评估案例研究和注释数据的质量。

更新时间: 2026-03-02 18:12:02

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2603.02150v1

Data-to-Energy Stochastic Dynamics

The Schrödinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost. This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences. However, all existing algorithms allow to infer such dynamics only for cases where samples from both distributions are available. In this paper, we propose the first general method for modelling Schrödinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples. Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers. We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions. As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schrödinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics. Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method. Code: https://github.com/mmacosha/d2e-stochastic-dynamics

Updated: 2026-03-02 18:11:04

标题: 数据到能量的随机动力学

摘要: 薛定谔桥问题涉及寻找一种随机动态系统,连接两个边缘分布,并最小化某种运输成本。这个问题是对最优传输在随机情况下的泛化,由于它与扩散模型和流匹配的关联,以及在自然科学中的应用,受到关注。然而,所有现有的算法只能推断出这种动态系统在两个分布的样本可用时的情况。在本文中,我们提出了第一个用于建模薛定谔桥的通用方法,当一个(或两个)分布由其未归一化密度给出时,无需访问数据样本。我们的算法依赖于迭代比例拟合(IPF)过程的泛化,以解决无数据情况,受到最近在强化学习中离线策略训练扩散采样器的发展的启发。我们在合成问题上展示了所提出的数据到能量IPF的有效性,发现它可以成功学习多模态分布之间的转运。作为我们强化学习公式的次要结果,该公式假定动态的时间离散化方案固定,我们发现现有的数据到数据薛定谔桥算法可以通过学习动力学的扩散系数得到显着改进。最后,我们将新开发的算法应用于生成模型的潜在空间中后验分布的采样问题,从而创建一种无数据的图像到图像翻译方法。源代码:https://github.com/mmacosha/d2e-stochastic-dynamics

更新时间: 2026-03-02 18:11:04

领域: cs.LG

下载: http://arxiv.org/abs/2509.26364v2

Multi-Marginal Flow Matching with Adversarially Learnt Interpolants

Learning the dynamics of a process given sampled observations at several time points is an important but difficult task in many scientific applications. When no ground-truth trajectories are available, but one has only snapshots of data taken at discrete time steps, the problem of modelling the dynamics, and thus inferring the underlying trajectories, can be solved by multi-marginal generalisations of flow matching algorithms. This paper proposes a novel flow matching method that overcomes the limitations of existing multi-marginal trajectory inference algorithms. Our proposed method, ALI-CFM, uses a GAN-inspired adversarial loss to fit neurally parametrised interpolant curves between source and target points such that the marginal distributions at intermediate time points are close to the observed distributions. The resulting interpolants are smooth trajectories that, as we show, are unique under mild assumptions. These interpolants are subsequently marginalised by a flow matching algorithm, yielding a trained vector field for the underlying dynamics. We showcase the versatility and scalability of our method by outperforming the existing baselines on spatial transcriptomics and cell tracking datasets, while performing on par with them on single-cell trajectory prediction. Code: https://github.com/mmacosha/adversarially-learned-interpolants.

Updated: 2026-03-02 18:10:31

标题: 对手学习插值的多边缘流匹配

摘要: 学习在多个时间点采样观察到的过程动态是许多科学应用中的一个重要但困难的任务。当没有地面真实轨迹可用,而只有在离散时间步骤上获取的数据快照时,建模动态的问题,从而推断出底层轨迹,可以通过流匹配算法的多边缘泛化来解决。本文提出了一种新颖的流匹配方法,克服了现有多边缘轨迹推断算法的局限性。我们提出的方法ALI-CFM使用了受GAN启发的对抗损失,以适应神经参数化的插值曲线,在源点和目标点之间,使得中间时间点的边缘分布接近观察到的分布。由此产生的插值器是平滑的轨迹,正如我们展示的那样,在温和的假设下是独一无二的。这些插值器随后通过流匹配算法进行边缘化,产生了底层动态的训练向量场。我们展示了我们的方法在空间转录组学和细胞跟踪数据集上优于现有基线的多功能性和可扩展性,同时在单细胞轨迹预测上表现良好。 代码:https://github.com/mmacosha/adversarially-learned-interpolants。

更新时间: 2026-03-02 18:10:31

领域: cs.LG

下载: http://arxiv.org/abs/2510.01159v2

Machine Learning (ML) library in Linux kernel

Linux kernel is a huge code base with enormous number of subsystems and possible configuration options that results in unmanageable complexity of elaborating an efficient configuration. Machine Learning (ML) is approach/area of learning from data, finding patterns, and making predictions without implementing algorithms by developers that can introduce a self-evolving capability in Linux kernel. However, introduction of ML approaches in Linux kernel is not easy way because there is no direct use of floating-point operations (FPU) in kernel space and, potentially, ML models can be a reason of significant performance degradation in Linux kernel. Paper suggests the ML infrastructure architecture in Linux kernel that can solve the declared problem and introduce of employing ML models in kernel space. Suggested approach of kernel ML library has been implemented as Proof Of Concept (PoC) project with the goal to demonstrate feasibility of the suggestion and to design the interface of interaction the kernel-space ML model proxy and the ML model user-space thread.

Updated: 2026-03-02 18:07:35

标题: Linux内核中的机器学习(ML)库

摘要: Linux内核是一个庞大的代码库,拥有大量子系统和可能的配置选项,导致了配置效率低下的难以管理的复杂性。机器学习(ML)是一种从数据中学习、发现模式并进行预测的方法/领域,可以在Linux内核中引入自我进化的能力,而无需由开发人员实现算法。然而,在Linux内核中引入ML方法并不容易,因为在内核空间中没有直接使用浮点运算(FPU),潜在地,ML模型可能会导致Linux内核性能显著下降。本文建议在Linux内核中实现ML基础架构体系结构,以解决所述问题,并引入在内核空间中使用ML模型的方法。建议的内核ML库方法已被作为概念验证(PoC)项目实施,旨在演示建议的可行性,并设计内核空间ML模型代理与ML模型用户空间线程之间的交互接口。

更新时间: 2026-03-02 18:07:35

领域: cs.LG,cs.OS

下载: http://arxiv.org/abs/2603.02145v1

Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection

Scaling laws assume larger models trained on more data consistently outperform smaller ones -- an assumption that drives model selection in computer vision but remains untested in resource-constrained Earth observation (EO). We conduct a systematic efficiency analysis across three scaling dimensions: model size, dataset size, and input resolution, on rooftop PV detection in Madagascar. Optimizing for model efficiency (mAP$_{50}$ per unit of model size), we find a consistent efficiency inversion: YOLO11N achieves both the highest efficiency ($24\times$ higher than YOLO11X) and the highest absolute mAP$_{50}$ (0.617). Resolution is the dominant resource allocation lever ($+$120% efficiency gain), while additional data yields negligible returns at low resolution. These findings are robust to the deployment objective: small high-resolution configurations are Pareto-dominant across all 44 setups in the joint accuracy-throughput space, leaving no tradeoff to resolve. In data-scarce EO, bigger is not just unnecessary: it can be worse.

Updated: 2026-03-02 18:05:57

标题: Bigger是否总是更好?资源受限条件下小目标检测的效率分析

摘要: 规模定律假设在更多数据上训练的较大模型始终优于较小模型——这一假设推动了计算机视觉中的模型选择,但在资源受限的地球观测(EO)领域尚未经过测试。我们在马达加斯加的屋顶光伏检测中进行了系统效率分析,涵盖三个规模维度:模型大小、数据集大小和输入分辨率。优化模型效率(mAP$_{50}$每单位模型大小),我们发现一致的效率反转:YOLO11N实现了最高效率(比YOLO11X高24倍)和最高绝对mAP$_{50}$(0.617)。分辨率是主要的资源分配杠杆(效率增益+120%),而在低分辨率下额外数据产生的回报微乎其微。这些发现对部署目标具有鲁棒性:在联合准确性-吞吐量空间的所有44个设置中,小型高分辨率配置在帕累托支配,没有需要解决的权衡。在数据匮乏的EO中,更大并不仅仅是不必要的:它可能更糟。

更新时间: 2026-03-02 18:05:57

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.02142v1

A Learnable Wavelet Transformer for Long-Short Equity Trading and Risk-Adjusted Return Optimization

Learning profitable intraday trading policies from financial time series is challenging due to heavy noise, non-stationarity, and strong cross-sectional dependence among related assets. We propose \emph{WaveLSFormer}, a learnable wavelet-based long-short Transformer that jointly performs multi-scale decomposition and return-oriented decision learning. Unlike standard time-series forecasting that optimizes prediction error and typically requires a separate position-sizing or portfolio-construction step, our model directly outputs a market-neutral long/short portfolio and is trained end-to-end on a trading objective with risk-aware regularization. Specifically, a learnable wavelet front-end generates low-/high-frequency components via an end-to-end trained filter bank, guided by spectral regularizers that encourage stable and well-separated frequency bands. To fuse multi-scale information, we introduce a low-guided high-frequency injection (LGHI) module that refines low-frequency representations with high-frequency cues while controlling training stability. The model outputs a portfolio of long/short positions that is rescaled to satisfy a fixed risk budget and is optimized directly with a trading objective and risk-aware regularization. Extensive experiments on five years of hourly data across six industry groups, evaluated over ten random seeds, demonstrate that WaveLSFormer consistently outperforms MLP, LSTM and Transformer backbones, with and without fixed discrete wavelet front-ends. On average in all industries, WaveLSFormer achieves a cumulative overall strategy return of $0.607 \pm 0.045$ and a Sharpe ratio of $2.157 \pm 0.166$, substantially improving both profitability and risk-adjusted returns over the strongest baselines.

Updated: 2026-03-02 18:01:11

标题: 一个可学习的小波变换器用于长短股票交易和风险调整回报优化

摘要: 从金融时间序列中学习盈利的日内交易策略是具有挑战性的,因为存在大量噪音、非平稳性以及相关资产之间的强交叉依赖性。我们提出了WaveLSFormer,这是一个可学习的基于小波的长短Transformer,旨在同时进行多尺度分解和以收益为导向的决策学习。与通常优化预测误差并且通常需要单独的仓位大小或投资组合构建步骤的标准时间序列预测不同,我们的模型直接输出一个市场中性的长/短头寸组合,并在一个带有风险意识正则化的交易目标上进行端到端的训练。具体地,一个可学习的小波前端通过一个端到端训练的滤波器组生成低/高频成分,受频谱正则化器的指导,鼓励稳定且频带分离良好。为了融合多尺度信息,我们引入了一个低引导高频注入(LGHI)模块,利用高频提示改进低频表示,同时控制训练稳定性。该模型输出一个经重新缩放以满足固定风险预算的长/短头寸组合,并直接优化交易目标和风险意识正则化。在六个行业组的五年小时数据上进行的大量实验,评估了十个随机种子,结果显示WaveLSFormer始终优于MLP、LSTM和Transformer骨干网络,无论是否使用固定的离散小波前端。在所有行业中,WaveLSFormer的累积整体策略回报平均为$0.607 \pm 0.045$,夏普比率为$2.157 \pm 0.166$,在盈利能力和风险调整回报方面显著优于最强基线。

更新时间: 2026-03-02 18:01:11

领域: cs.LG,cs.AI,q-fin.CP

下载: http://arxiv.org/abs/2601.13435v3

Using ChatGPT for Data Science Analyses

As a result of recent advancements in generative AI, the field of data science is prone to various changes. The way practitioners construct their data science workflows is now irreversibly shaped by recent advancements, particularly by tools like OpenAI's Data Analysis plugin. While it offers powerful support as a quantitative co-pilot, its limitations demand careful consideration in empirical analysis. This paper assesses the potential of ChatGPT for data science analyses, illustrating its capabilities for data exploration and visualization, as well as for commonly used supervised and unsupervised modeling tasks. While we focus here on how the Data Analysis plugin can serve as co-pilot for Data Science workflows, its broader potential for automation is implicit throughout.

Updated: 2026-03-02 17:58:41

标题: 使用ChatGPT进行数据科学分析

摘要: 由于生成式人工智能近年来取得的进展,数据科学领域面临着各种变化。从业者构建数据科学工作流程的方式现在已经被最近的进展不可逆转地塑造,特别是像OpenAI的数据分析插件这样的工具。虽然它作为量化副驾驶提供了强大的支持,但其局限性要求在实证分析中进行谨慎考虑。本文评估了ChatGPT在数据科学分析中的潜力,展示了它在数据探索和可视化以及常用的监督和无监督建模任务中的能力。虽然我们在这里侧重于数据分析插件作为数据科学工作流程的副驾驶的作用,但其用于自动化的更广泛潜力在文中也隐含着。

更新时间: 2026-03-02 17:58:41

领域: cs.LG,cs.CL,stat.CO

下载: http://arxiv.org/abs/2404.08480v2

MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination

With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for multi-building flexibility coordination, was developed. MuFlex enables synchronous information exchange and co-simulation across multiple detailed building models programmed in EnergyPlus and Modelica, and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform's physics-based capabilities and workflow were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm. The results show that under four buildings' coordination, SAC effectively reduced the aggregated peak demand by nearly 12% with maintained indoor comfort to ensure the power demand below the threshold. Additionally, the platform's scalability was investigated through computational benchmarking on building clusters with varying sizes, model types, and simulation programs.

Updated: 2026-03-02 17:55:58

标题: MuFlex:一种可扩展的、基于物理的多建筑灵活性分析和协调平台

摘要: 随着可再生能源在电网上的渗透率不断增加,维持系统平衡需要建筑集合的协调需求灵活性。由于其无模型特性,强化学习已被广泛探索用于建筑控制。开源模拟测试平台不仅对于训练强化学习代理至关重要,还对于公平地基准测试控制策略至关重要。然而,大多数建筑领域的测试平台针对单个建筑物;多建筑平台相对有限,并且通常依赖简化模型(例如,电阻-电容)或数据驱动方法,这些方法缺乏完全捕捉解释控制性能所需的物理复杂性和中间变量的能力。此外,这些平台通常会施加固定输入、输出和模型格式,限制了它们作为在不同控制场景下的基准测试工具的适用性。为了解决这些差距,开发了一个可扩展的、开源的多建筑灵活性协调平台MuFlex。MuFlex使多个详细建筑模型之间的同步信息交换和协同仿真成为可能,这些模型是在EnergyPlus和Modelica中编程的,并且符合最新的OpenAI Gym接口,提供模块化、标准化的强化学习实现。该平台的基于物理的能力和工作流程在一个案例研究中得到了展示,该案例研究使用Soft Actor-Critic算法协调四栋办公大楼的需求灵活性。结果显示,在四栋建筑的协调下,SAC有效地将聚合峰值需求减少了近12%,同时保持了室内舒适度以确保电力需求低于阈值。此外,通过对不同规模、模型类型和仿真程序的建筑群进行计算基准测试,验证了该平台的可扩展性。

更新时间: 2026-03-02 17:55:58

领域: cs.LG,eess.SY

下载: http://arxiv.org/abs/2508.13532v3

A Randomized Linearly Convergent Frank-Wolfe-type Method for Smooth Convex Minimization over the Spectrahedron

We consider the problem of minimizing a smooth and convex function over the $n$-dimensional spectrahedron -- the set of real symmetric $n\times n$ positive semidefinite matrices with unit trace, which underlies numerous applications in statistics, machine learning and additional domains. Standard first-order methods often require high-rank matrix computations which are prohibitive when the dimension $n$ is large. The well-known Frank-Wolfe method on the other hand only requires efficient rank-one matrix computations, however, suffers from worst-case slow convergence, even under conditions that enable linear convergence rates for standard methods. In this work we present the first Frank-Wolfe-based algorithm that only applies efficient rank-one matrix computations and, assuming quadratic growth and strict complementarity conditions, is guaranteed, after a finite number of iterations, to converge linearly, in expectation, and independently of the ambient dimension.

Updated: 2026-03-02 17:47:18

标题: 一个用于平滑凸最小化的随机线性收敛Frank-Wolfe型方法,适用于Spectrahedron

摘要: 我们考虑在$n$维谱多面体上最小化一个光滑和凸函数的问题——该谱多面体是实对称$n\times n$正半定矩阵的集合,其追踪单位,在统计学、机器学习和其他领域中有许多应用。标准的一阶方法通常需要高秩矩阵计算,当维度$n$很大时是不可行的。另一方面,众所周知的Frank-Wolfe方法仅需要高效的秩一矩阵计算,但在最坏情况下收敛速度慢,即使在使标准方法具有线性收敛速度的条件下也是如此。在这项工作中,我们提出了第一个基于Frank-Wolfe的算法,只应用高效的秩一矩阵计算,并且假设二次增长和严格互补条件,保证在有限次迭代后,期望线性收敛,独立于环境维度。

更新时间: 2026-03-02 17:47:18

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2503.01441v2

LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation

We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.

Updated: 2026-03-02 17:46:32

标题: LiftAvatar:表达控制的3D高斯化身动画的运动空间补全

摘要: 我们提出了LiftAvatar,这是一种新的范式,可以在运动空间中完成稀疏的单眼观察(例如面部表情和头部姿势),并使用完整的信号驱动高保真度的虚拟人动画。LiftAvatar是一种细粒度、可控表情的大规模视频扩散Transformer,可以根据单个或多个参考图像合成高质量、时间上连贯的表情序列。关键思想是将不完整的输入数据提升到更丰富的运动表示中,从而加强下游3D虚拟人管道中的重建和动画。为此,我们引入了(i)一种多粒度表情控制方案,将光照图与表情系数结合起来,以实现精确稳定的驱动,以及(ii)一种多参考条件机制,从多个帧中聚合互补线索,实现强大的3D一致性和可控性。作为即插即用的增强器,LiftAvatar直接解决了基于3D高斯光斑的虚拟人在日常单眼视频中因稀疏的运动线索而导致的有限表现力和重建伪影的问题。通过将不完整的观察扩展为多样的姿势-表情变化,LiftAvatar还可以有效地从大规模视频生成模型中提取先验知识到3D管道中,从而实现实质性的收益。大量实验表明,LiftAvatar在提升动画质量和现有3D虚拟人方法的定量指标方面表现一致,特别是在极端、不常见的表情下。

更新时间: 2026-03-02 17:46:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.02129v1

LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations

Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions' severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.

Updated: 2026-03-02 17:46:17

标题: LLMs作为战略行为者:在地缘政治模拟中的行为对齐、风险校准和论证框架

摘要: 大型语言模型(LLM)越来越被提议作为战略决策环境中的代理人,然而它们在结构化地缔结危机模拟中的行为仍未得到充分研究。我们评估了六种流行的最先进LLM模型,以及在四个真实危机模拟场景中人类结果,要求模型在多轮中选择预定义的行动并为其决定提出理由。我们比较了模型与人类在行动对齐、通过所选行动的严重性进行风险校准,以及基于国际关系理论的论证框架方面的表现。结果显示,模型在基本模拟轮次中近似人类决策模式,但随着时间的推移出现分歧,展现出独特的行为轮廓和策略更新。所有模型对所选择行动的解释展现出以稳定、协调和风险缓解为中心的强烈规范性合作框架,具有有限的对抗性推理。

更新时间: 2026-03-02 17:46:17

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2603.02128v1

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.

Updated: 2026-03-02 17:42:33

标题: 纳米-EmoX:从感知到共情的统一多模情感智能

摘要: 情感多模态语言模型(MLM)的发展长期以来受到低级感知和高级交互之间的差距的限制,导致情感能力分散且泛化能力有限。为了弥合这一差距,我们提出了一个受认知启发的三级层次结构,根据认知深度-感知、理解和交互-组织情感任务,并为推进情感建模提供统一的概念基础。在这个层次结构的指导下,我们引入了Nano-EmoX,一个小规模多任务MLM,以及基于课程的训练框架P2E(Perception-to-Empathy)。Nano-EmoX整合了一套全模态编码器,包括增强的面部编码器和融合编码器,以捕获关键的多模态情感线索并提高跨任务的可迁移性。通过异构适配器将输出投影到统一的语言空间中,使轻量级语言模型能够处理各种情感任务。同时,P2E通过将快速感知与思维链驱动的共情相结合,逐步培养情绪智能。据我们所知,Nano-EmoX是第一个将六项核心情感任务统一到所有三级层次的紧凑型MLM(2.2B),在多个基准测试中实现了最先进或高度竞争力的表现,展现出出色的效率和泛化能力。

更新时间: 2026-03-02 17:42:33

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.02123v1

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.

Updated: 2026-03-02 17:40:54

标题: 铅笔谜题工作台:多步验证推理的基准

摘要: 我们介绍了Pencil Puzzle Bench,这是一个通过铅笔谜题评估大型语言模型推理能力的框架,铅笔谜题是一类与NP完全问题密切相关的约束满足问题,具有确定性、逐步验证的特点。从一个包含62,231个谜题的数据库中,跨94种类别选择了300个谜题作为基准,并评估了来自11个提供商的51个模型的两种模式:直接询问(一次性)和主动询问(多轮迭代验证)。我们基准测试的一个关键区别是,每个中间棋盘状态都可以根据特定种类的约束条件进行检查,将错误定位到准确的违规规则,为过程监督和强化学习提供了密集的每一步奖励信号的基础。 我们的评估揭示了两个明显的能力轴:(1) 推理努力的扩展,其中GPT-5.2在没有推理到最大努力时提高了81倍;(2) 主动迭代,Claude Opus 4.6通过迭代检查从0.3%提高到30.0%,而GPT-5.2@xhigh从20.2%提高到56.0%。主动尝试涉及中位数29轮,持续17分钟,最长超过1,221轮和14.3小时 - 这是对长文本利用的严格测试,不仅仅是推理。

更新时间: 2026-03-02 17:40:54

领域: cs.AI,cs.GT,cs.LG

下载: http://arxiv.org/abs/2603.02119v1

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

Updated: 2026-03-02 17:38:58

标题: Robometer:通过轨迹比较扩展通用目的的机器人奖励模型

摘要: 通用目的的机器人奖励模型通常是通过专家演示来训练,预测绝对任务进展,提供局部、帧级别的监督。虽然对于专家演示来说有效,但这种范式在大规模机器人数据集中的扩展性较差,因为失败和次优轨迹很常见,分配密集的进度标签是模糊的。我们介绍了Robometer,一个可扩展的奖励建模框架,结合了轨迹内进度监督和轨迹间偏好监督。Robometer通过双重目标进行训练:一个帧级进度损失,将奖励大小锚定在专家数据上,以及一个轨迹比较偏好损失,对同一任务的轨迹施加全局排序约束,实现有效地从真实和增强的失败轨迹中学习。为了支持这种规模的表述,我们策划了RBM-1M,一个奖励学习数据集,包括超过一百万条跨越不同机器人实体和任务的轨迹,其中包括大量次优和失败数据。在基准测试和实际评估中,Robometer学习到比以前方法更具普适性的奖励函数,并在各种下游应用中提高了机器人学习性能。代码、模型权重和视频可在https://robometer.github.io/找到。

更新时间: 2026-03-02 17:38:58

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02115v1

Recursive Models for Long-Horizon Reasoning

Modern language models reason within bounded context, an inherent constraint that poses a fundamental barrier to long-horizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in isolated contexts. We prove that any computable problem admits a recursive decomposition in which each subtask requires only exponentially smaller active context than standard autoregressive models; this strictly surpasses any context management approach confined to a single sequence, such as summarization. We further generalize our framework to modern agentic systems with arbitrary context processing and control flows, and prove that recursive models can achieve optimal power within this broader class. Experimentally, we train a 3B model to reason recursively and evaluate on Boolean satisfiability, a task requiring long-horizon combinatorial search, where it significantly outperforms frontier LLMs.

Updated: 2026-03-02 17:37:10

标题: 长期推理的递归模型

摘要: 现代语言模型在有限上下文中推理,这是一种固有的约束,给长期推理带来了根本性障碍。我们确定递归是克服这一障碍的核心原则,并提出递归模型作为最小实现,其中模型可以递归调用自身来解决孤立上下文中的子任务。我们证明任何可计算的问题都可以通过递归分解来解决,其中每个子任务只需要比标准自回归模型小指数倍的活动上下文;这严格超越了任何仅限于单个序列的上下文管理方法,如总结。我们进一步将我们的框架推广到现代主体系统,具有任意上下文处理和控制流,并证明递归模型可以在这个更广泛的类别中实现最佳性能。在实验中,我们训练了一个3B模型进行递归推理,并在布尔可满足性上进行评估,这是一个需要长期组合搜索的任务,在这个任务中,它明显优于前沿的LLM。

更新时间: 2026-03-02 17:37:10

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2603.02112v1

SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $\sim$16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3$\times$ speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.

Updated: 2026-03-02 17:34:48

标题: SPARE:参考引导评估的单次注释,用于自动过程监督和奖励建模

摘要: 过程或逐步监督在推进大型语言模型(LLMs)复杂多步推理能力方面发挥了至关重要的作用。然而,高效、高质量的自动过程注释仍然是一个重要挑战。为了解决这个问题,我们引入了单次注释与参考引导评估(SPARE),这是一个新颖的结构化框架,通过将解决步骤与参考解决方案对齐,并在单次生成中明确推理,实现了高效的逐步注释。我们在跨越数学推理(GSM8K,MATH)、多跳问题回答(MuSiQue-Ans)和空间推理(SpaRP)的四个不同数据集上展示了SPARE的有效性,展示了在两个应用中的持续改进:(1)训练过程奖励模型(PRMs)对多个生成进行排名和聚合,(2)通过离线强化学习对模型进行微调以进行贪婪解码。在ProcessBench上,与人工标记和其他合成训练基线相比,SPARE展示了高效的超出分布泛化能力,仅使用训练样本的约16%。此外,它在总标记数方面实现了与基于MCTS的方法竞争性性能,并提供了2.3倍的加速。手动分析显示与MCTS方法具有互补的精确度-召回特性,表明集成方法的潜力。这些结果将SPARE确立为LLM推理中自动过程监督的实用和可扩展解决方案。

更新时间: 2026-03-02 17:34:48

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.15498v3

ARCANE -- Early Detection of Interplanetary Coronal Mass Ejections

Interplanetary coronal mass ejections (ICMEs) are major drivers of space weather disturbances, posing risks to both technological infrastructure and human activities. Automatic detection of ICMEs in solar wind in situ data is essential for early warning systems. While several methods have been proposed to identify these structures in time series data, robust real-time detection remains a significant challenge. In this work, we present ARCANE - the first framework explicitly designed for early ICME detection in streaming solar wind data under realistic operational constraints, enabling event identification without requiring observation of the full structure. Our approach evaluates the strengths and limitations of detection models by comparing a machine learning-based method to a threshold-based baseline. The ResUNet++ model, previously validated on science data, significantly outperforms the baseline, particularly in detecting high-impact events, while retaining solid performance on lower-impact cases. Notably, we find that using real-time solar wind (RTSW) data instead of high-resolution science data leads to only minimal performance degradation. Despite the challenges of operational settings, our detection pipeline achieves an F1-Score of 0.37, with an average detection delay of 24.5% of the event's duration while processing only a minimal portion of the event data. As more data becomes available, the performance increases significantly. These results mark a substantial step forward in automated space weather monitoring and lay the groundwork for enhanced real-time forecasting capabilities.

Updated: 2026-03-02 17:33:27

标题: ARCANE -- 提前检测星际日冕物质抛射

摘要: 星际日冕物质抛射(ICMEs)是太空天气扰动的主要驱动因素,对技术基础设施和人类活动都构成风险。在太阳风原位数据中自动检测ICMEs对于早期预警系统至关重要。虽然已经提出了几种方法来识别这些结构在时间序列数据中,但强大的实时检测仍然是一个重大挑战。在这项工作中,我们提出了ARCANE - 这是第一个专门设计用于在现实运行约束下早期检测流动太阳风数据中ICME的框架,使事件识别无需观察完整结构。我们的方法通过将基于机器学习的方法与基于阈值的基准进行比较,评估检测模型的优势和局限性。之前在科学数据上验证过的ResUNet++模型明显优于基准,特别是在检测高影响事件方面,同时在低影响案例上保持了良好性能。值得注意的是,我们发现使用实时太阳风(RTSW)数据而不是高分辨率科学数据仅导致性能略微下降。尽管在运行设置上存在挑战,我们的检测流水线实现了0.37的F1-Score,平均检测延迟为事件持续时间的24.5%,同时仅处理事件数据的最小部分。随着更多数据的可用性,性能显著提高。这些结果标志着自动化太空天气监测的重大进步,并为增强实时预测能力奠定了基础。

更新时间: 2026-03-02 17:33:27

领域: physics.space-ph,astro-ph.IM,astro-ph.SR,cs.LG

下载: http://arxiv.org/abs/2505.09365v3

Distributions as Actions: A Unified Framework for Diverse Action Spaces

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

Updated: 2026-03-02 17:30:05

标题: 分布作为动作:多样动作空间的统一框架

摘要: 我们引入了一个新颖的强化学习(RL)框架,将参数化动作分布视为动作,重新定义了代理和环境之间的边界。这种重新参数化使新的动作空间连续化,无论原始动作类型(离散、连续、混合等)如何。在这种新的参数化下,我们开发了一个广义确定性策略梯度估计器,即Distributions-as-Actions策略梯度(DA-PG),其方差低于原始动作空间中的梯度。尽管在分布参数上学习批评者面临着新的挑战,但我们引入了插值批评者学习(ICL),这是一种简单而有效的策略,可以增强学习,得到了来自赌徒设置的见解支持。在强化学习中具有竞争性表现。在离散、连续和混合控制领域的各种设置中,DA-AC取得了有竞争力的表现。

更新时间: 2026-03-02 17:30:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.16608v2

Orchestrating Multimodal DNN Workloads in Wireless Neural Processing

In edge inference, wireless resource allocation and accelerator-level deep neural network (DNN) scheduling have yet to be co-optimized in an end-to-end manner. The lack of coordination between wireless transmission and accelerator-level DNN execution prevents efficient overlap, leading to higher end-to-end inference latency. To address this issue, this paper investigates multimodal DNN workload orchestration in wireless neural processing (WNP), a paradigm that integrates wireless transmission and multi-core accelerator execution into a unified end-to-end pipeline. First, we develop a unified communication-computation model for multimodal DNN execution and formulate the corresponding optimization problem. Second, we propose O-WiN, a framework that orchestrates DNN workloads in WNP through two tightly coupled stages: simulation-based optimization and runtime execution. Third, we develop two algorithms, RTFS and PACS. RTFS schedules communication and computation sequentially, whereas PACS interleaves them to enable pipeline parallelism by overlapping wireless data transfer with accelerator-level DNN execution. Simulation results demonstrate that PACS significantly outperforms RTFS under high modality heterogeneity by better masking wireless latency through communication-computation overlap, thereby highlighting the effectiveness of communication-computation pipelining in accelerating multimodal DNN execution in WNP.

Updated: 2026-03-02 17:25:43

标题: 在无线神经处理中编排多模态DNN工作负载

摘要: 在边缘推断中,无线资源分配和加速器级深度神经网络(DNN)调度尚未以端到端方式进行协同优化。无线传输和加速器级DNN执行之间缺乏协调导致了高端到端推断延迟的问题。为了解决这个问题,本文研究了在无线神经处理(WNP)中进行多模式DNN工作负载编排的方法,这是一种将无线传输和多核加速器执行集成到统一的端到端管道中的范式。首先,我们为多模式DNN执行开发了一个统一的通信-计算模型,并制定了相应的优化问题。其次,我们提出了O-WiN,一个通过两个紧密耦合的阶段(基于模拟的优化和运行时执行)在WNP中编排DNN工作负载的框架。第三,我们开发了两种算法,RTFS和PACS。RTFS按顺序调度通信和计算,而PACS通过交错它们来实现管道并行,通过将无线数据传输与加速器级DNN执行重叠来更好地遮掩无线延迟。模拟结果表明,在高模态异质性下,PACS明显优于RTFS,通过通信-计算重叠更好地掩盖无线延迟,从而突出了通信-计算流水线在加速WNP中的多模式DNN执行的有效性。

更新时间: 2026-03-02 17:25:43

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2603.02109v1

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.

Updated: 2026-03-02 17:25:30

标题: 由功能重要位点和小分子底物指导的生成酶设计

摘要: 酶是一种能够加速化学反应的基因编码生物催化剂。我们如何自动设计功能性酶?在本文中,我们提出了EnzyGen,一种学习统一模型来设计跨所有功能家族的酶的方法。我们的关键思想是基于与所需催化功能对应的功能重要位点和底物,生成酶的氨基酸序列和它们的三维坐标。这些位点是从酶数据库中自动挖掘出来的。EnzyGen包含一个新颖的交错网络,由注意力和邻域等变层组成,捕捉整个蛋白质序列中的长程相关性和来自三维空间中最近氨基酸的局部影响。为了学习生成模型,我们设计了一个联合训练目标,包括序列生成损失、位置预测损失和酶-底物相互作用损失。我们进一步构建了EnzyBench,一个包含3157个酶家族的数据集,覆盖了蛋白质数据库(PDB)中的所有可用酶。实验结果显示,我们的EnzyGen在所有323个测试家族中始终表现出最佳性能,以底物结合亲和力为指标,超过最佳基线10.79%。这些发现表明EnzyGen在设计结合特定底物且具有高亲和力的酶时具有优越的能力。

更新时间: 2026-03-02 17:25:30

领域: cs.LG

下载: http://arxiv.org/abs/2405.08205v4

Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning

Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.

Updated: 2026-03-02 17:21:38

标题: 使用Eikonal约束分层准度强化学习实现目标达成

摘要: 目标条件强化学习(GCRL)通过将任务框定为目标达成而非最大化手工制定的奖励信号,从而缓解了奖励设计的困难。在这种情况下,最优的目标条件值函数自然形成了拟度量,促进了拟度量RL(QRL),该方法将值学习限制为拟度量映射,并通过离散的、基于轨迹的约束强化局部一致性。我们提出了基于Eikonal偏微分方程(PDE)的Eikonal-Constrained Quasimetric RL(Eik-QRL),这是对QRL的连续时间重构。这种基于PDE的结构使Eik-QRL无需轨迹,只需采样状态和目标,同时改善了分布外泛化性能。我们为Eik-QRL提供了理论保证,并确定了在复杂动态下出现的限制。为了解决这些挑战,我们引入了Eik-Hierarchical QRL(Eik-HiQRL),将Eik-QRL集成到分层分解中。在实证上,Eik-HiQRL在离线目标条件导航中取得了最先进的性能,并在操作任务中实现了与QRL一致的增益,与时间差分方法相匹配。

更新时间: 2026-03-02 17:21:38

领域: cs.LG,cs.RO,eess.SY,stat.ML

下载: http://arxiv.org/abs/2512.12046v2

Stochastic Multi-Armed Bandits with Limited Control Variates

Motivated by wireless networks where interference or channel state estimates provide partial insight into throughput, we study a variant of the classical stochastic multi-armed bandit problem in which the learner has limited access to auxiliary information. Recent work has shown that such auxiliary information, when available as control variates, can be used to get tighter confidence bounds, leading to lower regret. However, existing works assume that control variates are available in every round, which may not be realistic in several real-life scenarios. To address this, we propose UCB-LCV, an upper confidence bound (UCB) based algorithm that effectively combines the estimators obtained from rewards and control variates. When there is no control variate, UCB-LCV leads to a novel algorithm that we call UCB-NORMAL, outperforming its existing algorithms for the standard MAB setting with normally distributed rewards. Finally, we discuss variants of the proposed UCB-LCV that apply to general distributions and experimentally demonstrate that UCB-LCV outperforms existing bandit algorithms.

Updated: 2026-03-02 17:20:46

标题: 带有有限控制变量的随机多臂老虎机

摘要: 受到无线网络中干扰或信道状态估计提供对吞吐量的部分了解的启发,我们研究了经典随机多臂赌博问题的一个变体,其中学习者对辅助信息的访问受到限制。最近的研究表明,当可用作控制变量的辅助信息可以用于获得更紧的置信区间,从而降低后悔。然而,现有的研究假设在每一轮中都有控制变量可用,这在几种现实场景中可能并不现实。为了解决这个问题,我们提出了基于上置信界(UCB)的算法UCB-LCV,有效地结合了从奖励和控制变量获得的估计值。当没有控制变量时,UCB-LCV导致一种新颖的算法,我们称之为UCB-NORMAL,在正态分布奖励的标准MAB设置中优于现有算法。最后,我们讨论了提出的适用于一般分布的UCB-LCV的变体,并通过实验证明UCB-LCV优于现有的赌徒算法。

更新时间: 2026-03-02 17:20:46

领域: cs.LG

下载: http://arxiv.org/abs/2603.02100v1

Selecting Optimal Variable Order in Autoregressive Ising Models

Autoregressive models enable tractable sampling from learned probability distributions, but their performance critically depends on the variable ordering used in the factorization via complexities of the resulting conditional distributions. We propose to learn the Markov random field describing the underlying data, and use the inferred graphical model structure to construct optimized variable orderings. We illustrate our approach on two-dimensional image-like models where a structure-aware ordering leads to restricted conditioning sets, thereby reducing model complexity. Numerical experiments on Ising models with discrete data demonstrate that graph-informed orderings yield higher-fidelity generated samples compared to naive variable orderings.

Updated: 2026-03-02 17:18:18

标题: 在自回归伊辛模型中选择最佳变量顺序

摘要: 自回归模型能够从学习到的概率分布中进行可行的采样,但其性能在很大程度上取决于在因子分解中使用的变量顺序,通过结果条件分布的复杂性。我们提出学习描述基础数据的马尔可夫随机场,并利用推断的图模型结构来构建优化的变量顺序。我们在类似于二维图像的模型上说明了我们的方法,其中结构感知的排序导致受限的条件集,从而降低了模型复杂性。对具有离散数据的伊辛模型进行的数值实验表明,由图形信息确定的排序产生的样本比朴素的变量排序具有更高的保真度。

更新时间: 2026-03-02 17:18:18

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2602.20394v2

HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs

The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations. We open-source our proposed \model{} model at https://github.com/Susan571/HalluGuard-ICLR2026.

Updated: 2026-03-02 17:18:04

标题: HalluGuard:解密LLMs中基于数据驱动和基于推理驱动的幻觉

摘要: 大型语言模型(LLMs)在医疗保健、法律和科学发现等高风险领域的可靠性通常会因出现幻觉而受到损害。这些失败通常源于两个方面:数据驱动的幻觉和推理驱动的幻觉。然而,现有的检测方法通常只解决一个来源,并依赖于特定任务的启发式方法,限制了它们在复杂场景中的泛化能力。为了克服这些限制,我们引入了幻觉风险界限,这是一个统一的理论框架,正式将幻觉风险分解为数据驱动和推理驱动的组成部分,分别与训练时间的不匹配和推理时间的不稳定性相关联。这为分析幻觉如何产生和演变提供了一个原则性基础。基于这个基础,我们引入了HalluGuard,一种基于NTK的评分方法,利用NTK的诱导几何和捕获的表示来共同识别数据驱动和推理驱动的幻觉。我们在10个不同的基准测试、11个竞争基线和9个流行的LLM骨干上评估了HalluGuard,在检测各种形式的LLM幻觉方面始终达到了最先进的性能。我们在https://github.com/Susan571/HalluGuard-ICLR2026上开源了我们提出的模型。

更新时间: 2026-03-02 17:18:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.18753v2

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.

Updated: 2026-03-02 17:16:47

标题: FluxMem:用于流媒体视频理解的自适应分层内存

摘要: 本文提出了FluxMem,一个用于高效流媒体视频理解的无需训练的框架。FluxMem通过分层的两阶段设计自适应地压缩冗余的视觉记忆:(1)时间邻近选择(TAS)模块在相邻帧之间去除冗余的视觉标记,(2)空间域合并(SDC)模块在每一帧内进一步合并空间重复区域成紧凑的表示。为了有效地适应动态场景,我们在TAS和SDC中引入了一种自适应标记压缩机制,该机制基于内在场景统计自动确定压缩率,而不是手动调整。大量实验证明,FluxMem在现有在线视频基准上取得了新的最先进结果,在实时设置下StreamingBench达到了76.4,在OVO-Bench达到了67.2,同时在OVO-Bench上将延迟降低了69.9%,GPU内存峰值降低了34.5%。此外,它在离线性能方面表现出色,在MLVU上达到了73.1,同时使用的视觉标记减少了65%。

更新时间: 2026-03-02 17:16:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.02096v1

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

We study the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting, consisting of a two-neuron ReLU network and two training instances. We prove that even under these strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, effectively maximizing the distance between the decision boundary and the training points, this convergence occurs at a prohibitively slow rate, scaling strictly as $Θ(1/\ln(t))$. To the best of our knowledge, this establishes the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Through empirical simulations, we further demonstrate that this inherent failure mode is pervasive, exhibiting the exact same tight convergence rate across multiple natural network initializations. Our theoretical guarantees are derived via a rigorous analysis of the GD trajectories across the distinct activation patterns of the model. Specifically, we develop tight control over the system's dynamics to bound the trajectory of the decision boundary, overcoming the primary technical challenge introduced by the non-linear nature of the architecture.

Updated: 2026-03-02 17:13:33

标题: 关于非线性神经网络中梯度下降速度的收敛率:从对抗鲁棒性角度看

摘要: 我们研究了在一个最小的二元分类设置中梯度下降(GD)的收敛动态,包括一个具有两个神经元ReLU网络和两个训练实例。我们证明,即使在这些强化简的假设下,虽然GD成功地收敛到一个最优的鲁棒性边界,有效地最大化决策边界和训练点之间的距离,但这种收敛速度极慢,严格按照$Θ(1/\ln(t))$的比例增长。据我们所知,这是建立在非线性模型中对鲁棒性边界收敛速率的第一个明确的下限。通过经验模拟,我们进一步证明这种固有的失败模式是普遍存在的,展示了在多个自然网络初始化中具有相同紧密收敛速率。我们的理论保证是通过对模型不同激活模式下GD轨迹的严格分析得出的。具体来说,我们通过对系统动态的严格控制来限制决策边界的轨迹,克服了由于架构的非线性特性引入的主要技术挑战。

更新时间: 2026-03-02 17:13:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02095v1

FedHB: Hierarchical Bayesian Federated Learning

We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of $O(1/\sqrt{t})$, the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.

Updated: 2026-03-02 17:12:02

标题: FedHB:分层贝叶斯联邦学习

摘要: 我们提出了一种新颖的分层贝叶斯方法来进行联邦学习(FL),在这种方法中,我们的模型通过分层贝叶斯建模合理地描述了客户端本地数据的生成过程:构成客户端本地模型的随机变量受高层全局变量控制。有趣的是,我们贝叶斯模型中的变分推断导致了一个优化问题,其块坐标下降解决方案变成了一个分布式算法,可以在客户端之间分离,并且允许他们完全不透露自己的私人数据,因此与FL完全兼容。我们还强调,我们的块坐标算法具有特定形式,涵盖了众所周知的FL算法,包括Fed-Avg和Fed-Prox作为特例。除了引入新颖的建模和推导,我们还提供了收敛分析,表明我们的块坐标FL算法以$O(1/\sqrt{t})$的速率收敛到目标的(本地)最优解,与常规的(集中式)SGD相同的速率,以及泛化误差分析,我们证明我们模型在未见数据上的测试误差在训练数据大小增加时保证消失,因此渐进最优。

更新时间: 2026-03-02 17:12:02

领域: cs.LG,cs.DC,stat.ML

下载: http://arxiv.org/abs/2305.04979v2

Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation

Flow matching has emerged as a competitive framework for learning high-quality generative policies in robotics; however, we find that generalisation arises and saturates early along the flow trajectory, in accordance with recent findings in the literature. We further observe that increasing the number of Euler integration steps during inference counter-intuitively and universally degrades policy performance. We attribute this to (i) additional, uniformly spaced integration steps oversample the late-time region, thereby constraining actions towards the training trajectories and reducing generalisation; and (ii) the learned velocity field becoming non-Lipschitz as integration time approaches 1, causing instability. To address these issues, we propose a novel policy that utilises non-uniform time scheduling (e.g., U-shaped) during training, which emphasises both early and late temporal stages to regularise policy training, and a dense-jump integration schedule at inference, which uses a single-step integration to replace the multi-step integration beyond a jump point, to avoid unstable areas around 1. Essentially, our policy is an efficient one-step learner that still pushes forward performance through multi-step integration, yielding up to 23.7% performance gains over state-of-the-art baselines across diverse robotic tasks.

Updated: 2026-03-02 17:11:33

标题: 密集跳跃流匹配与非均匀时间调度用于机器人策略:减轻多步推断降级

摘要: 流匹配已经成为机器人学习高质量生成策略的竞争性框架;然而,我们发现泛化在流轨迹上早期出现并饱和,与文献中最近的发现相一致。我们进一步观察到,在推断过程中增加欧拉积分步骤的数量反直觉地普遍降低了策略性能。我们将这归因于(i)额外的均匀间隔积分步骤过度采样了后期区域,从而将动作限制在训练轨迹上并减少了泛化;以及(ii)当积分时间接近1时,学到的速度场变得非Lipschitz,导致不稳定性。为了解决这些问题,我们提出了一种新颖的策略,在训练过程中利用非均匀的时间调度(例如U形)强调早期和晚期时间阶段,以规范策略训练,并在推断过程中采用密集跳跃积分计划,使用单步积分替换跳点后的多步积分,以避免围绕1的不稳定区域。实质上,我们的策略是一个高效的单步学习器,通过多步积分仍然推动性能向前发展,在各种机器人任务中相对于最先进的基线获得了高达23.7%的性能增益。

更新时间: 2026-03-02 17:11:33

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.13574v2

Adam Converges Without Any Modification On Update Rules

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., $(β_1,β_2)$; while practical applications often fix the problem first and then tune $(β_1,β_2)$. In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when $β_2$ is large and $β_1 < \sqrt{β_2}$. Second, when $β_2$ is small, we point out a region of $(β_1,β_2)$ combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the $(β_1, β_2)$ combination. To our knowledge, this is the first phase transition in $(β_1,β_2)$ 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary $(β_1^*, β_2^*)$ is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune $β_1$ and $β_2$: when Adam does not work well, we suggest tuning up $β_2$ inversely with batch size to surpass the threshold $β_2^*$, and then trying $β_1< \sqrt{β_2}$. Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.

Updated: 2026-03-02 17:08:51

标题: Adam在更新规则上不需要任何修改即可收敛

摘要: Adam是训练神经网络的默认算法,包括大型语言模型(LLMs)。然而,Reddi等人(2019)提出了一个Adam发散的例子,引起了在AI模型训练中使用它的担忧。我们发现了发散示例和实践之间的关键不匹配:Reddi等人选择了在选择Adam的超参数(即$(β_1,β_2)$)之后选择问题;而实际应用通常是先解决问题,然后再调整$(β_1,β_2)$。在这项工作中,我们证明了Adam在适当的问题相关超参数下会收敛。首先,我们证明了当$β_2$很大且$β_1<\sqrt{β_2}$时,Adam会收敛。其次,当$β_2$很小时,我们指出了一个$(β_1,β_2)$组合的区域,Adam可能会发散到无穷大。我们的结果表明了当改变$(β_1,β_2)$组合时,Adam从发散到收敛的一个相变。据我们所知,这是文献中第一个报告的$(β_1,β_2)$二维平面的相变,为Adam优化器提供了严格的理论保证。我们进一步指出,临界边界$(β_1^*,β_2^*)$是问题相关的,特别是依赖于批量大小。这提供了关于如何调整$β_1$和$β_2的建议:当Adam表现不佳时,我们建议反向调整$β_2以超过阈值$β_2^*$,然后尝试$β_1<\sqrt{β_2}$。我们的建议得到了几项经验研究的支持,这些研究观察到在应用这些建议时,LLM的训练性能有所改善。

更新时间: 2026-03-02 17:08:51

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2603.02092v1

Learning from Synthetic Data Improves Multi-hop Reasoning

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.

Updated: 2026-03-02 17:08:43

标题: 从合成数据中学习改善多跳推理

摘要: 强化学习(RL)已被证明可以显著提升大型语言模型(LLMs)在数学、编码和多跳推理任务中的推理能力。然而,RL微调需要大量高质量可验证的数据,通常来自人类注释、由前沿LLMs生成或由基于LLM的验证器评分。这三种方法都存在相当大的局限性:人类注释的数据集规模小且成本高,LLM生成的数据容易产生幻觉且成本高,LLM-based验证器不准确且速度慢。在这项工作中,我们探讨了一种更便宜的替代方案:对多跳推理任务进行RL微调,使用基于规则生成的合成数据。我们发现,对合成数据进行微调的LLMs在流行的现实世界问答基准测试中表现显著更好,尽管合成数据只包含虚构知识。通过按问题难度对性能进行分层,我们发现合成数据教会LLMs组合知识--这是一种基本且可推广的推理技能。我们的工作强调了基于规则生成的合成推理数据作为一种免费且可扩展的资源,用于提高LLMs的推理能力。

更新时间: 2026-03-02 17:08:43

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.02091v1

Impossibility of Depth Reduction in Explainable Clustering

Over the last few years Explainable Clustering has gathered a lot of attention. Dasgupta et al. [ICML'20] initiated the study of explainable $k$-means and $k$-median clustering problems where the explanation is captured by a threshold decision tree which partitions the space at each node using axis parallel hyperplanes. Recently, Laber et al. [Pattern Recognition'23] made a case to consider the depth of the decision tree as an additional complexity measure of interest. In this work, we prove that even when the input points are in the Euclidean plane, then any depth reduction in the explanation incurs unbounded loss in the $k$-means and $k$-median cost. Formally, we show that there exists a data set $X\subseteq \mathbb{R}^2$, for which there is a decision tree of depth $k-1$ whose $k$-means/$k$-median cost matches the optimal clustering cost of $X$, but every decision tree of depth less than $k-1$ has unbounded cost w.r.t. the optimal cost of clustering. We extend our results to the $k$-center objective as well, albeit with weaker guarantees.

Updated: 2026-03-02 17:08:19

标题: 可解释聚类中深度减少的不可能性

摘要: 在过去几年中,可解释聚类引起了很多关注。Dasgupta等人[ICML'20]开始研究可解释的$k$-均值和$k$-中位数聚类问题,其中解释通过一个阈值决策树来捕捉,该树在每个节点上使用轴平行超平面对空间进行划分。最近,Laber等人[Pattern Recognition'23]提出考虑决策树的深度作为一个额外的复杂度度量。在这项工作中,我们证明即使输入点在欧几里得平面上,减少解释的深度也会导致$k$-均值和$k$-中位数成本的无限损失。具体地,我们展示了存在一个数据集$X\subseteq \mathbb{R}^2$,其中存在一个深度为$k-1$的决策树,其$k$-均值/$k$-中位数成本与$X$的最优聚类成本相匹配,但是每个深度小于$k-1$的决策树相对于最优聚类成本具有无限的成本。我们还将我们的结果扩展到$k$-中心目标,尽管保证较弱。

更新时间: 2026-03-02 17:08:19

领域: cs.LG,cs.CC,cs.CG,cs.DS

下载: http://arxiv.org/abs/2305.02850v2

Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent work has tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it improves the efficiency of a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.

Updated: 2026-03-02 17:07:25

标题: 计划和预算:在推理大型语言模型上有效和高效的测试时间缩放

摘要: 大型语言模型(LLMs)在复杂推理任务中取得了显著成功,但它们的推理仍然计算效率低下。我们观察到许多流行的LLMs中存在一个常见的失败模式,即过度思考,模型会为简单查询生成冗长和离题的推理过程。最近的研究尝试通过强制固定令牌预算来缓解这一问题,然而,这可能会导致思考不足,尤其是在更难的问题上。通过经验分析,我们确定这种低效通常源于问题解决策略不清晰。为了形式化这一问题,我们开发了一个理论模型BAM(预算分配模型),将推理建模为一系列具有不同不确定性的子问题,并引入E3指标来捕捉正确性与计算效率之间的权衡。基于BAM的理论结果,我们提出了Plan-and-Budget,这是一个与模型无关的测试时间框架,将复杂查询分解为子问题,并根据估计的复杂性使用自适应调度来分配令牌预算。Plan-and-Budget提高了各种任务和模型的推理效率,实现了高达70%的准确度增益,39%的令牌减少以及193.8%的E3改善。值得注意的是,它将一个较小的模型(DS-Qwen-32B)的效率提高到与一个较大的模型(DS-LLaMA-70B)相匹配,展示了Plan-and-Budget在无需重新训练的情况下缩小性能差距的能力。我们的代码可在https://github.com/junhongmit/P-and-B 中找到。

更新时间: 2026-03-02 17:07:25

领域: cs.LG

下载: http://arxiv.org/abs/2505.16122v3

Detection-Gated Glottal Segmentation with Zero-Shot Cross-Dataset Transfer and Clinical Feature Extraction

Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a YOLOv8-based detector with a U-Net segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and instrument occlusion. The model was trained on a limited subset of the GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on the large-scale BAGLS dataset. Results: The pipeline achieved state-of-the-art performance on the GIRAFE benchmark (DSC 0.81) and demonstrated superior generalizability on BAGLS (DSC 0.85, in-distribution) without institutional fine-tuning. Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features (Open Quotient, coefficient of variation) remained consistent with established clinical benchmarks. The coefficient of variation (CV) of the glottal area was found to be a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: The detection-gated architecture provides a lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use. By enabling robust zero-shot transfer, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari-krishnan/openglottal.

Updated: 2026-03-02 17:05:41

标题: 用零样本跨数据集传输和临床特征提取的检测门控声门分割

摘要: 背景:在高速视频内窥镜(HSV)中进行准确的声门分割对于提取喉部功能的运动生物标志物至关重要。然而,现有的深度学习模型经常在非声门帧中产生虚假伪影,并且无法在不同的临床设置中进行泛化。 方法:我们提出了一个检测门控管道,将基于YOLOv8的检测器与U-Net分割器集成在一起。一个时间一致性包装器通过抑制声门闭合和器械遮挡期间的假阳性来确保鲁棒性。该模型在GIRAFE数据集的有限子集(600帧)上进行了训练,并通过在大规模BAGLS数据集上进行零样本转移进行了评估。 结果:该管道在GIRAFE基准测试中取得了最先进的性能(DSC 0.81),并在BAGLS上展示了卓越的泛化能力(DSC 0.85,在分布内),无需机构微调。对65个受试者的临床队列进行的下游验证证实,自动化的运动特征(开放性指数,变异系数)与已建立的临床基准保持一致。声门区域的变异系数(CV)被发现是区分健康和病理性声带功能的显著标记(p=0.006)。 结论:检测门控架构为实时临床应用提供了一种轻量级、计算效率高的解决方案(约35帧/秒)。通过实现强大的零样本转移,该框架促进了跨不同内窥镜平台的标准化、大规模临床生物标志物的提取。代码、训练权重和评估脚本发布在https://github.com/hari-krishnan/openglottal。

更新时间: 2026-03-02 17:05:41

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.02087v1

GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered

Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires substantial engineering effort and cost. In this paper, we argue that recent advances in Large Language Models (LLMs) are starting to shape the next generation of query processing systems. We propose using LLMs to synthesize execution code for each incoming query, instead of continuously building, extending, and maintaining complex query processing engines. As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources. We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi-agent system, and we evaluate it on OLAP workloads. We use queries from the well-known TPC-H benchmark and also construct a new benchmark designed to reduce potential data leakage from LLM training data. We compare GenDB with state-of-the-art query engines, including DuckDB, Umbra, MonetDB, ClickHouse, and PostgreSQL. GenDB achieves significantly better performance than these systems. Finally, we discuss the current limitations of GenDB and outline future extensions and related research challenges.

Updated: 2026-03-02 17:03:43

标题: GenDB:下一代查询处理的综合而非工程化

摘要: 传统的查询处理依赖于经过许多专家精心优化和设计的引擎。然而,新技术和用户需求不断发展,现有系统往往跟不上步伐。同时,由于内部复杂性,这些系统很难扩展,开发新系统需要大量的工程工作和成本。在本文中,我们认为最近大型语言模型(LLMs)的进展正在塑造下一代查询处理系统。 我们建议使用LLMs为每个传入的查询合成执行代码,而不是不断构建、扩展和维护复杂的查询处理引擎。作为一个概念验证,我们提出了GenDB,这是一个由LLM驱动的代理系统,可以生成针对特定数据、工作负载和硬件资源定制的实例优化查询执行代码。 我们实现了GenDB的早期原型,使用Claude Code Agent作为多代理系统的基础组件,并在OLAP工作负载上进行评估。我们使用知名的TPC-H基准测试中的查询,还构建了一个旨在减少LLM训练数据泄漏的新基准测试。我们将GenDB与包括DuckDB、Umbra、MonetDB、ClickHouse和PostgreSQL在内的最新查询引擎进行了比较。GenDB的性能明显优于这些系统。最后,我们讨论了GenDB目前的局限性,并概述了未来的扩展和相关研究挑战。

更新时间: 2026-03-02 17:03:43

领域: cs.DB,cs.AI,cs.CL,cs.LG,cs.MA

下载: http://arxiv.org/abs/2603.02081v1

From Pixels to Patches: Pooling Strategies for Earth Embeddings

As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.

Updated: 2026-03-02 17:03:37

标题: 从像素到补丁:地球嵌入的池化策略

摘要: 随着地理空间基础模型从补丁级到像素级嵌入的转变,从业者必须将成千上万个像素向量聚合成保留类别判别信号的补丁表示,同时匹配下游标签分辨率。默认选择的平均池化会丢弃补丁内的变异性,并且在空间转移下可能会降低准确性超过10%。为了评估这种影响,我们引入了EuroSAT-Embed:81,000个嵌入GeoTIFF,从三种基础模型AlphaEarth,OlmoEarth和Tessera派生而来。我们在随机和地理上不相交的测试分割下对11种无需训练和2种参数化池化方法进行基准测试。我们的结果表明,更丰富的池化方案相对于平均池化可将地理泛化差距降低高达40%,并在空间分割上将准确性提高高达5%。我们推荐将广义均值池化(GeM)作为平均池化的替代品:它在不增加嵌入维度的情况下提高了准确性。对于最大准确性,统计池化(最小/最大/平均/标准差池化的串联)在4倍的嵌入大小下表现最佳。我们进一步发现,池化效果在不同嵌入来源之间变化,并且高维度嵌入最受分布统计的益处。

更新时间: 2026-03-02 17:03:37

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.02080v1

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training -- designed to make models robust against worst-case perturbations -- has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial Déjà Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a sparse dictionary of primitives, with LLMs generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as sparse compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows. Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks. \textcolor{red}{\textbf{Warning: This paper contains content that may be harmful or offensive in nature.

Updated: 2026-03-02 16:59:21

标题: 对抗性似曾相识:越狱词典学习以更强大的泛化能力应对未知攻击

摘要: 大型语言模型仍然容易受到绕过安全防护栏的越狱攻击的影响,以引发有害输出。防范新型越狱攻击代表了人工智能安全的一个重要挑战。对抗性训练——旨在使模型对最坏情况的扰动具有鲁棒性——一直是对抗性鲁棒性的主导范式。然而,由于优化挑战和难以定义现实威胁模型,对抗性训练方法在实践中经常无法应对新开发的越狱攻击。本文提出了一种新的范式,用于改进对未知越狱攻击的鲁棒性,核心是对抗性“似曾相识”假设:新型越狱攻击并非根本上是新的,而主要是先前攻击中对抗性技能的重新组合。我们通过对32篇攻击论文的大规模分析来研究这一假设。使用自动化流程,我们将对抗性技能提取并压缩为稀疏的原语字典,大型语言模型生成人类可读的描述。我们的分析表明,未知攻击可以有效地解释为先前技能的稀疏组合,随着技能覆盖范围的增加,解释力逐渐增强。在这一洞察的指导下,我们引入了对抗性技能组合训练(ASCoT),该训练侧重于对技能原语的多样组合进行训练,而不是孤立的攻击实例。ASCoT显著提高了对未知攻击的鲁棒性,包括多轮越狱攻击,同时保持低拒绝率。我们还证明,扩大对抗性技能覆盖范围,而不仅仅是数据规模,是抵御新型攻击的关键。\textcolor{red}{\textbf{警告:本文包含可能有害或冒犯性内容。

更新时间: 2026-03-02 16:59:21

领域: cs.LG

下载: http://arxiv.org/abs/2510.21910v3

Cognitive Prosthetic: An AI-Enabled Multimodal System for Episodic Recall in Knowledge Work

Modern knowledge workplaces increasingly strain human episodic memory as individuals navigate fragmented attention, overlapping meetings, and multimodal information streams. Existing workplace tools provide partial support through note-taking or analytics but rarely integrate cognitive, physiological, and attentional context into retrievable memory representations. This paper presents the Cognitive Prosthetic Multimodal System (CPMS) --an AI-enabled proof-of-concept designed to support episodic recall in knowledge work through structured episodic capture and natural language retrieval. CPMS synchronizes speech transcripts, physiological signals, and gaze behavior into temporally aligned, JSON-based episodic records processed locally for privacy. Beyond data logging, the system includes a web-based retrieval interface that allows users to query past workplace experiences using natural language, referencing semantic content, time, attentional focus, or physiological state. We present CPMS as a functional proof-of-concept demonstrating the technical feasibility of transforming heterogeneous sensor data into queryable episodic memories. The system is designed to be modular, supporting operation with partial sensor configurations, and incorporates privacy safeguards for workplace deployment. This work contributes an end-to-end, privacy-aware architecture for AI-enabled memory augmentation in workplace settings.

Updated: 2026-03-02 16:58:53

标题: 认知辅助装置:一种AI-启用的多模态系统,用于知识工作中的情节回忆

摘要: 现代知识工作场所越来越紧张,因为个人在处理分散注意力、重叠会议和多模态信息流时,会对人类的情景记忆造成压力。现有的工作场所工具通过笔记或分析提供部分支持,但很少将认知、生理和注意力上下文整合到可检索的记忆表征中。本文介绍了认知辅助多模态系统(CPMS)--一种AI启用的概念验证设计,旨在通过结构化的情景捕获和自然语言检索来支持知识工作中的情景召回。CPMS将语音转录、生理信号和凝视行为同步到时间对齐的基于JSON的情景记录中,经过本地处理以保护隐私。除了数据记录外,系统还包括一个基于网络的检索界面,允许用户使用自然语言、参考语义内容、时间、注意力焦点或生理状态查询过去的工作场所经验。我们将CPMS作为一个功能概念验证,展示了将异构传感器数据转化为可查询情景记忆的技术可行性。该系统设计为模块化,支持部分传感器配置的操作,并为工作场所部署提供隐私保障。这项工作为工作场所设置中的AI增强记忆提供了端到端、隐私感知的架构。

更新时间: 2026-03-02 16:58:53

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2603.02072v1

InstructPro: Natural Language Guided Ligand-Binding Protein Design

The de novo design of ligand-binding proteins with tailored functions is essential for advancing biotechnology and molecular medicine, yet existing AI approaches are limited by scarce protein-ligand complex data. To circumvent this data bottleneck, we leverage the abundant natural language descriptions characterizing protein-ligand interactions. Here, we introduce InstructPro, a family of generative models that design proteins following the guidance of natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified function descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants -- InstructPro-1B and InstructPro-3B -- that substantially outperform strong baselines. InstructPro-1B achieves an AlphaFold3 ipTM of 0.918 and a binding affinity of -8.764 on seen ligands, while maintaining robust performance in a zero-shot setting with scores of 0.869 and -6.713, respectively. These results are accompanied by novelty scores of 70.1% and 68.8%, underscoring the model's ability to generalize beyond the training set. Furthermore, the model yields a superior binding free energy of -20.9 kcal/mol and an average of 5.82 intermolecular hydrogen bonds, validating its proficiency in designing high-affinity ligand-binding proteins. Notably, scaling to InstructPro-3B further improves the zero-shot ipTM to 0.882, binding affinity to -6.797, and binding free energy to -25.8 kcal/mol, demonstrating clear performance gains associated with increased model capacity. These findings highlight the power of natural language-guided generative models to mitigate the data bottlenecks in traditional structure-based methods, significantly broadening the scope of de novo protein design.

Updated: 2026-03-02 16:58:45

标题: InstructPro:自然语言引导的配体结合蛋白设计

摘要: 通过定制功能的配体结合蛋白的de novo设计对于推进生物技术和分子医学至关重要,然而现有的人工智能方法受限于稀缺的蛋白质-配体复合物数据。为了绕过这一数据瓶颈,我们利用丰富的自然语言描述来表征蛋白质-配体相互作用。在这里,我们介绍了InstructPro,一个家族生成模型,根据自然语言指南和配体公式设计蛋白质。InstructPro生成与指定功能描述和配体目标一致的蛋白质序列。为了进行训练和评估,我们开发了InstructProBench,一个包含960万(功能描述、配体、蛋白质)三元组的大型数据集。我们训练了两个模型变体——InstructPro-1B和InstructPro-3B——它们明显优于强基线。InstructPro-1B在已见配体上实现了0.918的AlphaFold3 ipTM和-8.764的结合亲和力,同时在零射设置中保持了0.869和-6.713的稳健表现。这些结果伴随着70.1%和68.8%的新颖性分数,突出了该模型在超出训练集范围的泛化能力。此外,该模型产生了-20.9 kcal/mol的优越结合自由能和平均5.82个分子间氢键,验证了其设计高亲和力配体结合蛋白的能力。值得注意的是,扩展到InstructPro-3B进一步改善了零射ipTM至0.882,结合亲和力至-6.797,结合自由能至-25.8 kcal/mol,显示出与增加模型容量相关的明显性能增益。这些发现突显了自然语言引导的生成模型在缓解传统基于结构方法中的数据瓶颈方面的能力,显著扩大了de novo蛋白质设计的范围。

更新时间: 2026-03-02 16:58:45

领域: cs.LG,cs.CE,cs.CL

下载: http://arxiv.org/abs/2506.09332v3

Subcubic Coin Tossing in Asynchrony without Setup

We consider an asynchronous network of $n$ parties connected to each other via secure channels, up to $t$ of which are byzantine. We study common coin tossing, a task where the parties try to agree on an unpredictable random value, with some chance of failure due to the byzantine parties' influence. Coin tossing is a well known and often studied task due to its use in byzantine agreement. In this work, we present an adaptively secure committee-based method to roughly speaking turn strong but costly common coins into cheaper but lower-quality ones. For all $k > 2$ and $\varepsilon > 0$, we show how to use a strong (very rarely failing) coin that costs $\widetilde{O}(n^k)$ bits of communication to get a cheaper coin that costs $\widetilde{O}(\varepsilon^{-2k}n^{3 - 2/k})$ bits of communication. This latter coin tolerates $\varepsilon n$ fewer byzantine parties than the former, and it fails with an arbitrarily small constant probability. For any $\varepsilon > 0$, our method allows us to get a perfectly secure binary coin that tolerates $t \leq (\frac{1}{4} - \varepsilon)n$ faults with $O(n^{2.5}(\varepsilon^{-8} + \log n))$ messages of size $O(\log n)$, as well as a setup-free cryptographically secure binary coin that tolerates $t \leq (\frac{1}{3} - \varepsilon)n$ faults with $O(n^{7/3}\varepsilon^{-6}κ\log n)$ bits of communication (where $κ= Ω(\log n)$ is a cryptographic security paramater). These coins both have $O(\log n)$ latency. They are to our knowledge the first setup-free coins that cost $o(n^3)$ bits of communication but still succeed with at least constant probability against $t = Θ(n)$ adaptive byzantine faults. As such, they for the first time enable setup-free (and even perfectly secure) asynchronous byzantine agreement with $o(n^3)$ communication against $Θ(n)$ adaptive byzantine faults.

Updated: 2026-03-02 16:58:44

标题: 非同步情况下的无设置子立方硬币抛掷

摘要: 我们考虑一个由$n$个各方通过安全通道连接的异步网络,其中最多有$t$个拜占庭节点。我们研究共同抛硬币,这是一项任务,其中各方试图就一个不可预测的随机值达成一致,由于拜占庭节点的影响,存在失败的可能性。抛硬币是一个众所周知且经常研究的任务,因为它在拜占庭协议中的使用。 在这项工作中,我们提出了一种基于委员会的自适应安全方法,大致上将强但昂贵的共同硬币转变为更便宜但质量较低的硬币。对于所有$k>2$和$\varepsilon > 0$,我们展示了如何使用一个强(很少失败)的硬币,其通信成本为$\widetilde{O}(n^k)$比特,来获得一个更便宜的硬币,其通信成本为$\widetilde{O}(\varepsilon^{-2k}n^{3-2/k})$比特。后者比前者容忍$\varepsilon n$个较少的拜占庭节点,并且以任意小的常数概率失败。 对于任何$\varepsilon > 0$,我们的方法使我们能够获得一个容忍$t \leq (\frac{1}{4} - \varepsilon)n$故障的完全安全的二进制硬币,其通信成本为$O(n^{2.5}(\varepsilon^{-8} + \log n))$比特,以及一个无需设置的具有密码学安全性的二进制硬币,其容忍$t \leq (\frac{1}{3} - \varepsilon)n$故障,通信成本为$O(n^{7/3}\varepsilon^{-6}κ\log n)$比特(其中$κ=Ω(\log n)$是一个密码学安全参数)。这些硬币都具有$O(\log n)$的延迟。据我们所知,它们是第一批无需设置的硬币,其通信成本为$o(n^3)$比特,但仍然能以至少常数概率成功应对$Θ(n)$自适应拜占庭故障。因此,它们首次实现了无需设置(甚至是完全安全的)异步拜占庭协议,其通信成本为$o(n^3)$,应对$Θ(n)$自适应拜占庭故障。

更新时间: 2026-03-02 16:58:44

领域: cs.DC,cs.CR

下载: http://arxiv.org/abs/2603.02071v1

Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning

When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users' questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.

Updated: 2026-03-02 16:58:18

标题: 通过对话探索计划空间:规划中LLM介导的解释的主动框架

摘要: 在为现实世界的顺序决策问题自动生成计划时,目标通常不是取代人类规划者,而是促进一个迭代推理和引导过程,其中人类的角色是根据他们的偏好和专业知识指导人工智能规划者。在这种情况下,回答用户问题的解释对于提高他们对潜在解决方案的理解和增加他们对系统的信任至关重要。为了实现与这样一个系统的自然交互,我们提出了一个多智能体大型语言模型(LLM)架构,它对解释框架视而不见,并能够提供用户和上下文相关的交互式解释。我们还描述了这个框架的一个实例,用于目标冲突解释,我们使用它进行了一个用户研究,比较了LLM支持的交互与基线模板为基础的解释界面。

更新时间: 2026-03-02 16:58:18

领域: cs.AI,cs.CL,cs.HC,cs.MA

下载: http://arxiv.org/abs/2603.02070v1

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.

Updated: 2026-03-02 16:58:02

标题: SignSGD在线性回归中的比例定律:何时胜过SGD?

摘要: 我们研究了在考虑特征和目标衰减的幂律随机特征(PLRF)模型下的signSGD的标度律。我们分析了在高斯描绘特征上使用一次遍历signSGD训练的线性模型的总体风险。我们将风险表示为模型大小、训练步骤、学习率以及特征和目标衰减参数的函数。与Paquette等人(2024年)分析的SGD风险进行比较,我们确定了signSGD独有的漂移归一化效应和噪声重塑效应。然后,在学习率的最佳选择下获得了计算最优的标度律。我们的分析显示,在噪声占主导地位的情况下,噪声重塑效应可以使signSGD的计算最优斜率比SGD更陡。最后,我们观察到广泛使用的预热稳定衰减(WSD)调度进一步减少了噪声项,并在特征衰减快但目标衰减慢的情况下加剧了计算最优斜率。

更新时间: 2026-03-02 16:58:02

领域: cs.LG,cs.AI,math.OC,stat.ML

下载: http://arxiv.org/abs/2603.02069v1

Accelerating PDE Surrogates via RL-Guided Mesh Optimization

Deep surrogate models for parametric partial differential equations (PDEs) can deliver high-fidelity approximations but remain prohibitively data-hungry: training often requires thousands of fine-grid simulations, each incurring substantial computational cost. To address this challenge, we introduce RLMesh, an end-to-end framework for efficient surrogate training under limited simulation budget. The key idea is to use reinforcement learning (RL) to adaptively allocate mesh grid points non-uniformly within each simulation domain, focusing numerical resolution in regions most critical for accurate PDE solutions. A lightweight proxy model further accelerates RL training by providing efficient reward estimates without full surrogate retraining. Experiments on PDE benchmarks demonstrate that RLMesh achieves competitive accuracy to baselines but with substantially fewer simulation queries. These results show that solver-level spatial adaptivity can dramatically improve the efficiency of surrogate training pipelines, enabling practical deployment of learning-based PDE surrogates across a wide range of problems.

Updated: 2026-03-02 16:55:08

标题: 通过RL引导的网格优化加速PDE替代方案

摘要: 深度代理模型用于参数化偏微分方程(PDEs)可以提供高保真度的近似,但仍然需要大量数据:训练通常需要成千上万次精细网格模拟,每次都需要大量的计算成本。为了解决这一挑战,我们引入了RLMesh,这是一个在有限模拟预算下进行高效代理训练的端到端框架。关键思想是使用强化学习(RL)在每个模拟域内自适应地非均匀分配网格点,将数值分辨率集中在对准确PDE解最关键的区域。一个轻量级的代理模型进一步加速了RL训练,通过提供高效的奖励估计,而无需完全重新训练代理模型。对PDE基准测试的实验表明,RLMesh实现了与基线相当的准确性,但需要较少的模拟查询。这些结果表明,求解器级别的空间适应性可以显著提高代理训练流程的效率,从而实现学习型PDE代理在各种问题上的实际部署。

更新时间: 2026-03-02 16:55:08

领域: cs.LG

下载: http://arxiv.org/abs/2603.02066v1

German General Social Survey Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies

The use of Large Language Models (LLMs) for simulating human perspectives via persona prompting is gaining traction in computational social science. However, well-curated, empirically grounded persona collections remain scarce, limiting the accuracy and representativeness of such simulations. Here, we introduce the German General Social Survey Personas (GGSS Personas) collection, a comprehensive and representative persona prompt collection built from the German General Social Survey (ALLBUS). The GGSS Personas and their persona prompts are designed to be easily plugged into prompts for all types of LLMs and tasks, steering models to generate responses aligned with the underlying German population. We evaluate GGSS Personas by prompting various LLMs to simulate survey response distributions across diverse topics, demonstrating that GGSS Personas-guided LLMs outperform state-of-the-art classifiers, particularly under data scarcity. Furthermore, we analyze how the representativity and attribute selection within persona prompts affect alignment with population responses. Our findings suggest that GGSS Personas provide a potentially valuable resource for research on LLM-based social simulations that enables more systematic explorations of population-aligned persona prompting in NLP and social science research.

Updated: 2026-03-02 16:52:05

标题: 德国一般社会调查人物:为与人口相匹配的LLM研究提供的调查导出的人物提示集

摘要: 在计算社会科学领域,使用大型语言模型(LLMs)通过人物提示来模拟人类视角正逐渐受到重视。然而,精心策划、经验丰富的人物集合仍然稀缺,限制了这种模拟的准确性和代表性。在这里,我们介绍了德国综合社会调查人物集合(GGSS Personas),这是一个全面且代表性的人物提示集合,建立于德国综合社会调查(ALLBUS)的基础上。GGSS Personas及其人物提示被设计为可以轻松地插入各种类型的LLMs和任务的提示中,引导模型生成与德国人口群体一致的响应。我们通过指导各种LLMs模拟跨多个主题的调查响应分布来评估GGSS Personas,结果表明,在数据稀缺的情况下,GGSS Personas引导的LLMs表现优于最先进的分类器。此外,我们分析了人物提示中的代表性和属性选择如何影响与人口响应的对齐。我们的研究结果表明,GGSS Personas为基于LLMs的社会模拟研究提供了潜在有价值的资源,可以更系统地探索NLP和社会科学研究中与人口对齐的人物提示。

更新时间: 2026-03-02 16:52:05

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2511.21722v2

Never Saddle for Reparameterized Steepest Descent as Mirror Flow

How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.

Updated: 2026-03-02 16:52:05

标题: 永不满足于重新参数化的最速下降作为镜像流

摘要: 选择优化算法如何影响模型学习特征的能力?为了回答这个问题,针对最陡下降方法,我们引入了最陡镜像流作为一个统一的理论框架,其中包括与Adam密切相关的符号下降。该框架揭示了优化几何如何影响学习动态、隐式偏差和稀疏性,并提供了两个解释为什么Adam和AdamW在微调中通常优于SGD。我们关注对角线性网络和深度对角线性重参数化(用于注意力的简化代理),我们展示了更陡的下降有利于逃离鞍点和特征学习。相比之下,梯度下降需要不现实地大的学习率来逃离鞍点,在微调中是一个罕见的情况。在实证上,我们确认逃离鞍点是微调中的一个核心挑战。此外,我们证明了像AdamW中的分离权重衰减通过强制新的平衡方程来稳定特征学习。总的来说,这些结果突出了最陡下降如何帮助现代优化的两种机制。

更新时间: 2026-03-02 16:52:05

领域: cs.LG

下载: http://arxiv.org/abs/2603.02064v1

OpenRad: a Curated Repository of Open-access AI models for Radiology

The rapid developments in artificial intelligence (AI) research in radiology have produced numerous models that are scattered across various platforms and sources, limiting discoverability, reproducibility and clinical translation. Herein, OpenRad was created, a curated, standardized, open-access repository that aggregates radiology AI models and providing details such as the availability of pretrained weights and interactive applications. Retrospective analysis of peer reviewed literature and preprints indexed in PubMed, arXiv and Scopus was performed until Dec 2025 (n = 5239 records). Model records were generated using a locally hosted LLM (gpt-oss:120b), based on the RSNA AI Roadmap JSON schema, and manually verified by ten expert reviewers. Stability of LLM outputs was assessed on 225 randomly selected papers using text similarity metrics. A total of 1694 articles were included after review. Included models span all imaging modalities (CT, MRI, X-ray, US) and radiology subspecialties. Automated extraction demonstrated high stability for structured fields (Levenshtein ratio > 90%), with 78.5% of record edits being characterized as minor during expert review. Statistical analysis of the repository revealed CNN and transformer architectures as dominant, while MRI was the most commonly used modality (in 621 neuroradiology AI models). Research output was mostly concentrated in China and the United States. The OpenRad web interface enables model discovery via keyword search and filters for modality, subspecialty, intended use, verification status and demo availability, alongside live statistics. The community can contribute new models through a dedicated portal. OpenRad contains approx. 1700 open access, curated radiology AI models with standardized metadata, supplemented with analysis of code repositories, thereby creating a comprehensive, searchable resource for the radiology community.

Updated: 2026-03-02 16:51:24

标题: OpenRad:一个为放射学提供开放获取AI模型的精选存储库

摘要: 放射学人工智能(AI)研究的快速发展在各种平台和来源上产生了众多模型,限制了可发现性、可重复性和临床转化。因此,我们创建了OpenRad,这是一个精心策划的、标准化的、开放获取的存储库,汇聚了放射学AI模型,并提供了预训练权重和交互应用程序的详细信息。我们回顾了PubMed、arXiv和Scopus索引的同行评审文献和预印本,直至2025年12月(n = 5239条记录)。模型记录是使用本地托管的LLM(gpt-oss:120b)生成的,基于RSNA AI Roadmap JSON模式,并由十位专家审阅人员手动验证。在225篇随机选取的论文中使用文本相似度指标评估了LLM输出的稳定性。在审查后共包括了1694篇文章。所包括的模型涵盖了所有成像模态(CT、MRI、X射线、超声)和放射学亚专业。自动提取表明结构化字段的稳定性很高(Levenshtein比率 > 90%),在专家审查中,78.5%的记录编辑被认定为次要。存储库的统计分析显示CNN和转换器架构占主导地位,而MRI是最常用的模态(在621个神经放射学AI模型中)。研究产出主要集中在中国和美国。OpenRad网络界面通过关键词搜索和模态、亚专业、预期用途、验证状态和演示可用性的筛选器实现了模型发现,同时提供了实时统计数据。社区可以通过专门的门户贡献新模型。OpenRad包含约1700个开放获取的、精心策划的放射学AI模型,具有标准化的元数据,并辅以对代码存储库的分析,从而为放射学社区创造了一个全面可搜索的资源。

更新时间: 2026-03-02 16:51:24

领域: cs.AI

下载: http://arxiv.org/abs/2603.02062v1

TRAKNN: Efficient Trajectory Aware Spatiotemporal kNN for Rare Meteorological Trajectory Detection

Extreme weather events, such as windstorms and heatwaves, are driven by persistent atmospheric circulation patterns that evolve over several consecutive days. While traditional circulation-based studies often focus on instantaneous atmospheric states, capturing the temporal evolution, or trajectory, of these spatial fields is essential for characterizing rare and potentially impactful atmospheric behavior. However, performing an exhaustive similarity search on multi-decadal, continental-scale gridded datasets presents significant computational and memory challenges. In this paper, we propose TRAKNN (TRajectory Aware KNN), a fully unsupervised and data-agnostic framework for detecting geometrically rare short trajectories in spatio-temporal data with an exact kNN approach. TRAKNN leverages a recurrence-based algorithm that decouples computational complexity from trajectory length and efficient batch operations, maximizing computational intensity. These optimizations enable exhaustive analysis on standard workstations, either on CPU or on GPU. We evaluate our approach on 75 years of daily European sea-level pressure data. Our results illustrate that rare trajectories identified by TRAKNN correspond to physically coherent atmospheric anomalies and align with independent extreme-event databases.

Updated: 2026-03-02 16:49:02

标题: TRAKNN:用于稀有气象轨迹检测的高效轨迹感知时空kNN

摘要: 极端天气事件,如风暴和热浪,是由持续几天的大气环流模式驱动的。传统的基于环流的研究通常关注瞬时的大气状态,捕捉这些空间场的时间演化或轨迹对于表征罕见且可能有影响的大气行为至关重要。然而,在多年的大陆尺度格网数据上执行详尽的相似性搜索面临着重要的计算和内存挑战。在本文中,我们提出了TRAKNN(TRajectory Aware KNN),这是一个完全无监督和数据不可知的框架,用于在时空数据中检测几何罕见的短轨迹,采用精确的kNN方法。TRAKNN利用基于重复的算法,将计算复杂性与轨迹长度和高效批处理操作分离,最大化计算强度。这些优化使得在标准工作站上进行详尽的分析成为可能,无论是在CPU上还是在GPU上。我们在75年的欧洲日海平面气压数据上评估了我们的方法。我们的结果表明,TRAKNN识别的罕见轨迹对应于物理上连贯的大气异常,并与独立的极端事件数据库相吻合。

更新时间: 2026-03-02 16:49:02

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.02059v1

Benchmarking Overton Pluralism in LLMs

We introduce OVERTONBENCH, a novel framework for measuring Overton pluralism in LLMs--the extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set coverage metric (OVERTONSCORE), (ii) conduct a large-scale U.S.-representative human study (N = 1208; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OVERTONSCOREs of 0.35--0.41, with DeepSeek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments ($ρ= 0.88$), providing a practical proxy without replacing human assessment. By turning pluralistic alignment from a normative aim into a measurable benchmark, our work establishes a foundation for systematic progress toward more pluralistic LLMs.

Updated: 2026-03-02 16:48:08

标题: 在LLM中的Overton多元主义基准对比

摘要: 我们介绍了OVERTONBENCH,一个用于衡量LLMs中Overton多元主义的新框架,即在模型输出中代表多样观点的程度。我们(i) 将Overton多元主义形式化为一个集合覆盖度量(OVERTONSCORE),(ii) 进行了一项大规模的美国代表性人类研究(N = 1208; 60个问题; 8个LLMs),(iii) 开发了一个自动化基准,能够接近再现人类判断。平均而言,模型的OVERTONSCORE为0.35-0.41,其中DeepSeek V3表现最佳;然而所有模型仍远低于理论最大值1.0,揭示了改进的巨大空间。由于重复大规模人类研究成本高且速度慢,可扩展的评估工具对于模型开发至关重要。因此,我们提出了一个自动化基准,与人类判断具有很高的排名相关性($ρ= 0.88$),提供了一个实用的代理,而不是取代人类评估。通过将多元主义对齐从规范目标转化为可测量的基准,我们的工作为系统性进步朝着更多元主义的LLMs奠定了基础。

更新时间: 2026-03-02 16:48:08

领域: cs.AI

下载: http://arxiv.org/abs/2512.01351v2

A Resource-Rational Principle for Modeling Visual Attention Control

Understanding how people allocate visual attention is central to Human-Computer Interaction (HCI), yet existing computational models of attention are often either descriptive, task-specific, or difficult to interpret. My dissertation develops a resource-rational, simulation-based framework for modeling visual attention as a sequential decision-making process under perceptual, memory, and time constraints. I formalize visual tasks, such as reading and multitasking, as bounded-optimal control problems using Partially Observable Markov Decision Processes, enabling eye-movement behaviors such as fixation and attention switching to emerge from rational adaptation rather than being hand-coded or purely data-driven. These models are instantiated in simulation environments spanning traditional text reading and reading-while-walking with smart glasses, where they reproduce classic empirical effects, explain observed trade-offs between comprehension and safety, and generate novel predictions under time pressure and interface variation. Collectively, this work contributes a unified computational account of visual attention, offering new tools for theory-driven and resource-efficient HCI design.

Updated: 2026-03-02 16:45:50

标题: 一个资源合理的原则用于建模视觉注意力控制

摘要: 理解人们如何分配视觉注意力对于人机交互(HCI)至关重要,然而现有的注意力计算模型通常是描述性的、特定任务的,或者难以解释。我的论文开发了一个基于资源合理性的、基于模拟的框架,用于对视觉注意力建模,将其视为在感知、记忆和时间约束下的顺序决策过程。我将视觉任务,如阅读和多任务处理,形式化为使用部分可观察马尔可夫决策过程的有界最优控制问题,使眼动行为,如凝视和注意力转换,能够从理性适应中产生,而非手工编码或纯粹数据驱动。这些模型在模拟环境中得以实现,涵盖传统的文本阅读和佩戴智能眼镜行走时的阅读,它们重现了经典的实证效应,解释了观察到的理解和安全之间的权衡,并在时间压力和界面变化下生成了新的预测。总的来说,这项工作提供了一个统一的计算视觉注意力的解释,为基于理论和资源有效的HCI设计提供了新工具。

更新时间: 2026-03-02 16:45:50

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2603.02056v1

Strategic Advice in the Age of Personal AI

Personal AI assistants have changed how people use institutional and professional advice. We study this new strategic setting in which individuals may stochastically consult a personal AI whose recommendation is predictable to the focal advisor. Personal AI enters this strategic environment along two dimensions: how often it is consulted and how much weight it receives in the human's decision when consulted. Anticipating this, the advisor responds by counteracting the personal AI recommendation. Counteraction becomes more aggressive as personal AI is consulted more often. Yet advisor performance is non-monotone: equilibrium loss is highest at intermediate levels of adoption and vanishes when personal AI is never used or always used. Trust affects performance through a single relative influence index, and greater relative influence of personal AI increases advisor vulnerability. Extending the framework to costly credibility building, we characterize how personal AI adoption reshapes incentives to invest in trust.

Updated: 2026-03-02 16:45:43

标题: 个人AI时代的战略建议

摘要: 个人AI助手已经改变了人们使用机构和专业建议的方式。我们研究了这种新的战略环境,其中个人可能随机地咨询个人AI,其推荐对焦点顾问是可预测的。个人AI以两个维度进入这个战略环境:被咨询的频率以及在咨询时它所受到的权重。在预期到这一点,顾问通过反对个人AI的建议来作出回应。随着个人AI的频繁咨询,反对行为变得更加激烈。然而,顾问的绩效并非单调:在采纳程度中等时,均衡损失最高,并且在个人AI从不被使用或始终被使用时消失。信任通过单一相对影响指数影响绩效,个人AI的相对影响增加了顾问的脆弱性。将框架扩展到昂贵的信誉建设,我们描述了个人AI采用如何重塑投资信任的激励。

更新时间: 2026-03-02 16:45:43

领域: cs.LG,cs.GT,cs.HC

下载: http://arxiv.org/abs/2603.02055v1

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

Updated: 2026-03-02 16:45:12

标题: 光流和文本提示如何协同助力音视频语义分割?

摘要: 音视频语义分割(AVSS)代表了对音视频分割(AVS)任务的扩展,需要对音视频场景进行语义理解,不仅仅是在视觉像素级别识别发出声音的对象。与先前的方法论相反,通过将AVSS任务分解为两个离散子任务,首先提供提示的分割蒙版以促进后续语义分析,我们的方法创新地基于这一基础策略。我们引入了一种新颖的协作框架,\textit{S}tepping \textit{S}tone \textit{P}lus(SSP),它集成了光流和文本提示,以帮助分割过程。在声源经常与移动物体共存的情况下,我们的预蒙版技术利用光流捕捉运动动态,为精确分割提供必要的时间上下文。为了应对静止的发声物体(如闹钟)带来的挑战,SSP结合了两个特定的文本提示:一个用于识别发声物体的类别,另一个提供场景的更广泛描述。此外,我们实现了一个视觉-文本对齐模块(VTA)以促进跨模态集成,提供更连贯和上下文相关的语义解释。我们的训练方案涉及一种后蒙版技术,旨在促使模型学习光流的图表。实验结果表明,SSP优于现有的AVS方法,提供高效和精确的分割结果。

更新时间: 2026-03-02 16:45:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.08133v2

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less 'effort' than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

Updated: 2026-03-02 16:43:38

标题: 这个文献标题的翻译是:思考还是作弊?通过衡量推理努力来检测隐性奖励欺骗

摘要: 奖励欺骗是指一种推理模型利用奖励函数中的漏洞来获得高奖励而不解决预期任务,构成了重要威胁。这种行为可能是显性的,即在模型的思维链中被表达出来,也可能是隐性的,即思维链看起来良性,从而绕过了思维链监视器。为了检测隐性的奖励欺骗,我们提出了TRACE(截断推理AUC评估)。我们的关键观察是,在利用漏洞比解决实际任务更容易时会发生欺骗。这意味着模型使用的“努力”比获得高奖励所需的要少。TRACE通过测量模型的推理何时变得足够以获得奖励来量化努力。我们逐步截断模型的思维链,在各种长度上强制模型进行答案,并估计每个截断点上的预期奖励。一个采取捷径的欺骗模型将仅用其思维链的很小一部分获得高预期奖励,从而产生准确率与长度曲线下的大面积。在数学推理方面,TRACE相比我们最强大的72B思维链监视器实现了超过65%的增益,在编码方面,相比32B监视器实现了超过30%的增益。我们进一步展示了TRACE在训练期间可以发现未知的漏洞。总的来说,TRACE提供了一个可扩展的无监督监督方法,用于监督当前监测方法无效的情况。

更新时间: 2026-03-02 16:43:38

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.01367v4

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d^2 + N \cdot d \log d)$ memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150$\times$ memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

Updated: 2026-03-02 16:43:25

标题: 蝴蝶MoE:通过结构化蝴蝶轨道实现次线性三元专家

摘要: 线性内存扩展存储$N$个独立的专家权重矩阵,需要$\mathcal{O}(N \cdot d^2)$的内存,这超出了边缘设备的内存预算。当前的压缩方法如量化、剪枝和低秩分解可以减少常数因子,但无法解决扩展瓶颈。我们介绍了ButterflyMoE,这是一种方法,将专家视为共享量化基质的几何重新定位,而不是独立的权重矩阵。专家之间的多样性来自于查看共享容量的不同角度,而不是冗余存储。通过将学习到的旋转应用于共享的三值原型,每个专家产生$\mathcal{O}(d^2 + N \cdot d \log d)$的内存,与专家数量成反比。关键洞察力:使用量化训练这些旋转可以减少激活异常值,并稳定极低比特训练,静态方法会崩溃。在语言建模基准测试中,ButterflyMoE在256个专家上实现了150倍的内存减少,准确性损失可以忽略不计。ButterflyMoE允许多个专家适合边缘受限设备,显示了几何参数化打破了线性扩展。

更新时间: 2026-03-02 16:43:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.13563v3

The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks

While implicit regularization facilitates benign overfitting in low-noise regimes, recent theoretical work predicts a sharp phase transition to harmful overfitting as the noise-to-signal ratio increases. We experimentally isolate the geometric mechanism of this transition: the Malignant Tail, a failure mode where networks functionally segregate signal and noise, reducing coherent semantic features into low-rank subspaces while pushing stochastic label noise into high-frequency orthogonal components, distinct from systematic or corruption-aligned noise. Through a Spectral Linear Probe of training dynamics, we demonstrate that Stochastic Gradient Descent (SGD) fails to suppress this noise, instead implicitly biasing it toward high-frequency orthogonal subspaces, effectively preserving signal-noise separability. We show that this geometric separation is distinct from simple variance reduction in untrained models. In trained networks, SGD actively segregates noise, allowing post-hoc Explicit Spectral Truncation (d << D) to surgically prune the noise-dominated subspace. This approach recovers the optimal generalization capability latent in the converged model. Unlike unstable temporal early stopping, Geometric Truncation provides a stable post-hoc intervention. Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.

Updated: 2026-03-02 16:39:42

标题: 《恶性尾巴:过度参数化网络中标签噪声的谱分离》

摘要: 尽管隐式正则化在低噪声环境中有助于良性过拟合,但最近的理论工作预测随着噪声与信号比增加,将出现对有害过拟合的急剧相变。我们实验上分离了这一转变的几何机制:恶性尾巴,一种失败模式,网络在功能上将信号和噪声分隔开,将连贯的语义特征降低到低秩子空间,同时将随机标签噪声推入高频正交分量,与系统性或腐败对齐的噪声不同。通过训练动态的谱线性探针,我们展示了随机梯度下降(SGD)无法抑制这种噪声,而是将其隐含地偏向于高频正交子空间,有效地保持信号噪声的可分离性。我们表明这种几何分离与未经训练模型中简单方差减少是不同的。在经过训练的网络中,SGD积极地分离噪声,允许事后显式谱截断(d << D)来手术式修剪噪声主导的子空间。这种方法恢复了在收敛模型中潜在的最优泛化能力。与不稳定的时间早停止不同,几何截断提供了稳定的事后干预。我们的发现表明,在标签噪声下,过量的谱容量不是无害的冗余,而是允许噪声记忆的潜在结构性责任,需要显式秩约束来过滤随机腐败以实现鲁棒的泛化。

更新时间: 2026-03-02 16:39:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02293v1

Universal Dynamics with Globally Controlled Analog Quantum Simulators

Analog quantum simulators with global control fields have emerged as powerful platforms for exploring complex quantum phenomena. Despite these advances, a fundamental theoretical question remains unresolved: to what extent can such systems realize universal quantum dynamics under global control? Here we establish a necessary and sufficient condition for universal quantum computation using only global pulse control, proving that a broad class of analog quantum simulators is, in fact, universal. We further extend this framework to fermionic and bosonic systems, including modern platforms such as ultracold atoms in optical superlattices. Moreover, we observe that analog simulators driven by random global pulses exhibit information scrambling comparable to random unitary circuits. In a dual-species neutral-atom array setup, the measurement outcomes anti-concentrate on a $\log N$ timescale despite the presence of only temporal randomness, opening opportunities for efficient randomness generation. To bridge theoretical possibility with experimental reality, we introduce \emph{direct quantum optimal control}, a control framework that enables the synthesis of complex effective Hamiltonians while incorporating realistic hardware constraints. Using this approach, we experimentally engineer three-body interactions outside the blockade regime and demonstrate topological dynamics on a Rydberg-atom array. Experimental measurements reveal dynamical signatures of symmetry-protected-topological edge modes, confirming both the expressivity and feasibility of our method. Our work opens a new avenue for quantum simulation beyond native hardware Hamiltonians, enabling the engineering of effective multi-body interactions and advancing the frontier of quantum information processing with globally-controlled analog platforms.

Updated: 2026-03-02 16:37:51

标题: 具有全局控制的模拟量子模拟器的通用动力学

摘要: 使用全局控制场的模拟量子模拟器已成为探索复杂量子现象的强大平台。尽管取得了这些进展,但仍存在一个基本的理论问题尚未解决:在全局控制下,这些系统在多大程度上能实现通用量子动力学?在这里,我们建立了一个必要且充分的条件,证明了一类广泛的模拟量子模拟器实际上是通用的。我们进一步将这一框架扩展到费米子和玻色子系统,包括现代平台,如光学超晶格中的超冷原子。此外,我们观察到由随机全局脉冲驱动的模拟器表现出与随机幺正电路相媲美的信息混乱。在一个双物种中性原子阵列设置中,尽管仅存在时间随机性,测量结果在$\log N$时间尺度上反集中,为高效随机性生成开辟了机会。为了将理论可能性与实验现实联系起来,我们引入了\emph{直接量子最优控制},这是一个控制框架,可以在考虑现实硬件约束的同时合成复杂有效的哈密顿量。利用这种方法,我们在封锁区域之外实验性地工程化了三体相互作用,并在Rydberg原子阵列上展示了拓扑动力学。实验测量揭示了对称保护拓扑边缘模式的动力学特征,验证了我们方法的表达力和可行性。我们的工作为超越本机硬件哈密顿量的量子模拟打开了一条新途径,实现了有效多体相互作用的工程化,并推动了全局控制模拟平台量子信息处理前沿的发展。

更新时间: 2026-03-02 16:37:51

领域: quant-ph,cond-mat.quant-gas,cond-mat.str-el,cs.LG,eess.SY

下载: http://arxiv.org/abs/2508.19075v4

Improving the adaptive and continuous learning capabilities of artificial neural networks: Lessons from multi-neuromodulatory dynamics

Continuous, adaptive learning, the ability to adapt to the environment and keep improving performance, is a hallmark of natural intelligence. Biological organisms excel in acquiring, transferring, and retaining knowledge while adapting to volatile environments, making them a source of inspiration for artificial neural networks (ANNs). This study explores how neuromodulation, a building block of learning in biological systems, can help address catastrophic forgetting and enhance the robustness of ANNs in continual learning. Driven by neuromodulators including dopamine (DA), acetylcholine (ACh), serotonin (5-HT) and noradrenaline (NA), neuromodulatory processes in the brain operate at multiple scales, facilitating dynamic responses to environmental changes through mechanisms ranging from local synaptic plasticity to global network-wide adaptability. Importantly, the relationship between neuromodulators and their interplay in modulating sensory and cognitive processes is more complex than previously expected, demonstrating a "many-to-one" neuromodulator-to-task mapping. To inspire neuromodulation-aware learning rules, we highlight (i) how multi-neuromodulatory interactions enrich single-neuromodulator-driven learning, (ii) the impact of neuromodulators across multiple spatio-temporal scales, and correspondingly, (iii) strategies for approximating and integrating neuromodulated learning processes in ANNs. To illustrate these principles, we present a conceptual study to showcase how neuromodulation-inspired mechanisms, such as DA-driven reward processing and NA-based cognitive flexibility, can enhance ANN performance in a Go/No-Go task. Though multi-scale neuromodulation, we aim to bridge the gap between biological and artificial learning, paving the way for ANNs with greater flexibility, robustness, and adaptability.

Updated: 2026-03-02 16:37:38

标题: 提高人工神经网络的适应性和持续学习能力:多神经调节动态的启示

摘要: 持续的、自适应学习,即适应环境并持续改进表现的能力,是自然智能的标志。生物有机体在获取、转移和保留知识方面表现出色,同时适应不稳定的环境,这使它们成为人工神经网络(ANNs)的灵感源泉。本研究探讨了神经调节作为生物系统学习的基石,如何帮助解决灾难性遗忘,并增强ANNs在持续学习中的鲁棒性。受多巴胺(DA)、乙酰胆碱(ACh)、5-羟色胺(5-HT)和去甲肾上腺素(NA)等神经调节物质驱动,大脑中的神经调节过程在多个尺度上运作,通过从局部突触可塑性到全局网络适应性的机制促进对环境变化的动态响应。重要的是,神经调节物质之间的关系及其在调节感觉和认知过程中的相互作用比以往预期的更为复杂,展示了“多对一”的神经调节物质-任务映射。为了激发神经调节感知的学习规则,我们强调了(i)多神经调节相互作用如何丰富单一神经调节驱动的学习,(ii)神经调节物质跨多个时空尺度的影响,以及相应地,(iii)近似和整合神经调节学习过程在ANNs中的策略。为了阐明这些原则,我们提出了一个概念性研究,展示了如何通过神经调节启发的机制,如DA驱动的奖励处理和NA基础的认知灵活性,可以增强ANN在Go/No-Go任务中的表现。通过多尺度神经调节,我们旨在弥合生物和人工学习之间的差距,为具有更大灵活性、鲁棒性和适应性的ANNs铺平道路。

更新时间: 2026-03-02 16:37:38

领域: q-bio.NC,cs.LG,cs.NE

下载: http://arxiv.org/abs/2501.06762v3

"When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction

Human collaborators coordinate dynamically through process visibility and workspace awareness, yet AI agents typically either provide only final outputs or expose read-only execution processes (e.g., planning, reasoning) without interpreting concurrent user actions on shared artifacts. Building on mixed-initiative interaction principles, we explore whether agents can achieve collaborative context awareness -- interpreting concurrent user actions on shared artifacts and adapting in real-time. Study 1 (N=10 professional designers) revealed that process visibility enabled reasoning about agent actions but exposed conflicts when agents could not distinguish feedback from independent work. We developed CLEO, which interprets collaborative intent and adapts in real-time. Study 2 (N=10, two-day with stimulated recall interviews) analyzed 214 turns, identifying five action patterns, six triggers, and four enabling factors explaining when designers choose delegation (70.1%), direction (28.5%), or concurrent work (31.8%). We present a decision model with six interaction loops, design implications, and an annotated dataset.

Updated: 2026-03-02 16:37:05

标题: 何时交接,何时共同合作:通过并发交互拓展人-智能体协同创作合作

摘要: 人类合作者通过过程可见性和工作空间意识动态协调,然而AI代理通常只提供最终输出,或者仅暴露只读执行过程(例如,规划,推理),而不解释共享工件上的并发用户操作。基于混合倡议互动原则,我们探讨代理是否能实现协作上下文意识 - 解释共享工件上的并发用户操作,并实时适应。研究1(参与者为10名专业设计师)表明,过程可见性使人们能够推理代理动作,但当代理无法区分反馈和独立工作时暴露出冲突。我们开发了CLEO,该系统解释协作意图并实时适应。研究2(参与者为10人,为期两天,进行模拟回忆访谈)分析了214次交互,识别了五种行动模式,六种触发器和四个解释因素,解释了设计师何时选择委托(70.1%),指导(28.5%)或并发工作(31.8%)。我们提出了一个包含六个交互回路的决策模型,设计启示和一个带注释的数据集。

更新时间: 2026-03-02 16:37:05

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2603.02050v1

WAXAL: A Large-Scale Multilingual African Language Speech Corpus

The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 24 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with around 235 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.

Updated: 2026-03-02 16:35:19

标题: WAXAL:一个大规模的多语种非洲语言语音语料库

摘要: 语音技术的发展主要有利于资源丰富的语言,这为大多数撒哈拉以南非洲语言的使用者创造了重大的数字鸿沟。为了解决这一差距,我们推出了WAXAL,一个代表超过1亿使用者的24种语言的大规模、开放获取的语音数据集。该数据集包括两个主要组成部分:一个包含大约1250小时转录自多样化语音的自动语音识别(ASR)数据集,以及一个包含约235小时高质量、单人演讲的文本到语音(TTS)数据集,读取音素平衡脚本。本文详细介绍了我们的数据收集、注释和质量控制方法,其中涉及与四个非洲学术和社区组织的合作伙伴关系。我们提供了数据集的详细统计概述,并讨论了其潜在限制和伦理考虑。WAXAL数据集在https://huggingface.co/datasets/google/WaxalNLP下以宽松的CC-BY-4.0许可证发布,以促进研究,促进包容性技术的发展,并作为这些语言数字保存的重要资源。

更新时间: 2026-03-02 16:35:19

领域: eess.AS,cs.AI,cs.CL

下载: http://arxiv.org/abs/2602.02734v3

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.

Updated: 2026-03-02 16:28:39

标题: 用策略引导的探索扩展LLM代理边界

摘要: 强化学习(RL)已经成功地在后训练大型语言模型(LLMs)中展现出显著的成就,作为任务代理,如计算机使用、调用工具和编码。然而,在RL中,探索仍然是LLM代理面临的一个核心挑战,特别是因为它们在语言-行动空间中操作,观察结果复杂且奖励稀疏。在这项工作中,我们通过利用LLMs在语言中规划和推理的能力,来解决LLM代理的探索问题,将探索从低级行动转移到更高级的语言策略。因此,我们提出了策略引导探索(SGE),首先生成一个简洁的自然语言策略,描述如何朝着目标前进,然后生成以该策略为条件的环境行动。通过在策略空间而不是行动空间中进行探索,SGE引导结构化和多样化的探索,以针对不同的环境结果。为了增加RL期间的策略多样性,SGE引入了混合温度抽样,同时探索并行地探索多样化的策略,以及一个策略反射过程,将策略生成与环境中先前策略的结果联系起来。在UI交互、工具调用、编码和具体代理环境中,SGE始终优于以探索为重点的RL基线,提高了学习效率和最终性能。我们展示了SGE使代理能够学习解决对基础模型而言过于困难的任务。

更新时间: 2026-03-02 16:28:39

领域: cs.LG

下载: http://arxiv.org/abs/2603.02045v1

Hyperbolic Aware Minimization: Implicit Bias for Sparsity

Understanding the implicit bias of optimization algorithms is key to explaining and improving the generalization of deep models. The hyperbolic implicit bias induced by pointwise overparameterization promotes sparsity, but also yields a small inverse Riemannian metric near zero, slowing down parameter movement and impeding meaningful parameter sign flips. To overcome this obstacle, we propose Hyperbolic Aware Minimization (HAM), which alternates a standard optimizer step with a lightweight hyperbolic mirror step. The mirror step incurs less compute and memory than pointwise overparameterization, reproduces its beneficial hyperbolic geometry for feature learning, and mitigates the small-inverse-metric bottleneck. Our characterization of the implicit bias in the context of underdetermined linear regression provides insights into the mechanism how HAM consistently increases performance --even in the case of dense training, as we demonstrate in experiments with standard vision benchmarks. HAM is especially effective in combination with different sparsification methods, advancing the state of the art.

Updated: 2026-03-02 16:28:33

标题: 双曲感知最小化:稀疏性的隐性偏见

摘要: 理解优化算法的隐式偏差对解释和改善深度模型的泛化能力至关重要。由点点超参数化引起的双曲隐式偏差促进了稀疏性,但也导致了接近零的小逆黎曼度量,减缓了参数移动速度并阻碍了有意义的参数符号翻转。为了克服这一障碍,我们提出了双曲感知最小化(HAM),它将标准优化器步骤与轻量级双曲镜步骤交替进行。镜步骤比点点超参数化更省计算和内存,重现了其有益的双曲几何特性用于特征学习,并减轻了小逆度量瓶颈。我们对欠定线性回归背景下的隐式偏差进行的表征为HAM如何持续提高性能提供了见解--甚至在密集训练的情况下,正如我们在标准视觉基准实验中所证明的那样。HAM与不同的稀疏化方法结合使用时尤其有效,推动了现有技术的发展。

更新时间: 2026-03-02 16:28:33

领域: cs.LG

下载: http://arxiv.org/abs/2506.02630v2

A Single Architecture for Representing Invariance Under Any Space Group

Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups, symmetries critical to modeling crystalline solids in materials science and condensed matter physics, this challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to such crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.

Updated: 2026-03-02 16:28:05

标题: 一个用于表示任何空间群下不变性的单一架构

摘要: 将已知数据中的对称性纳入机器学习模型中一直以来都可以提高预测准确性、稳健性和泛化能力。然而,要实现对特定对称性的精确不变性通常需要为每个群体设计定制的架构,这限制了可扩展性并阻止了相关对称性之间的知识转移。在空间群的情况下,对于在材料科学和凝聚态物理中对建模晶体固体至关重要的对称性,这一挑战尤为突出,因为在三维空间中存在230个这样的群体。在这项工作中,我们提出了一种新的方法来处理晶体学对称性,通过开发一个单一的机器学习架构,能够自动调整其权重以实现对任何输入空间群的不变性。我们的方法基于通过对傅里叶系数施加的群操作的约束的明确表征来构建对称适应的傅里叶基。将这些约束编码到神经网络层中可以实现在不同空间群之间共享权重,使模型能够利用群体之间的结构相似性并克服数据稀疏性,当特定群体的测量数据有限时。我们展示了这种方法在实现材料性质预测任务上取得竞争性表现以及在执行零样本学习以泛化到未见群体方面的有效性。

更新时间: 2026-03-02 16:28:05

领域: cs.LG

下载: http://arxiv.org/abs/2512.13989v2

Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation

Radiology report generation is critical for efficiency but current models lack the structured reasoning of experts, hindering clinical trust and explainability by failing to link visual findings to precise anatomical locations. This paper introduces BoxMed-RL, a groundbreaking unified training framework for generating spatially verifiable and explainable radiology reports. Built on a large vision-language model, BoxMed-RL revolutionizes report generation through two integrated phases: (1) In the Pretraining Phase, we refine the model via medical concept learning, using Chain-of-Thought supervision to internalize the radiologist-like workflow, followed by spatially verifiable reinforcement, which applies reinforcement learning to align medical findings with bounding boxes. (2) In the Downstream Adapter Phase, we freeze the pretrained weights and train a downstream adapter to ensure fluent and clinically credible reports. This framework precisely mimics radiologists' workflow, compelling the model to connect high-level medical concepts with definitive anatomical evidence. Extensive experiments on public datasets demonstrate that BoxMed-RL achieves an average 7% improvement in both METEOR and ROUGE-L metrics compared to state-of-the-art methods. An average 5% improvement in large language model-based metrics further underscores BoxMed-RL's robustness in generating high-quality radiology reports.

Updated: 2026-03-02 16:27:49

标题: 像放射科医生一样思考:链式思维和强化学习用于可验证的报告生成

摘要: 放射学报告生成对效率至关重要,但当前模型缺乏专家的结构化推理,阻碍了临床信任和可解释性,因为它未能将视觉发现与精确的解剖位置联系起来。本文介绍了BoxMed-RL,这是一个开创性的统一训练框架,用于生成具有空间可验证和可解释性的放射学报告。基于一个庞大的视觉-语言模型,BoxMed-RL通过两个集成阶段彻底改变了报告生成的方式:(1)在预训练阶段,我们通过医学概念学习来完善模型,使用“思维链”监督来内化类似放射科医师的工作流程,然后进行空间可验证的强化,这将强化学习应用于将医学发现与边界框对齐。 (2)在下游适配器阶段,我们冻结预训练权重并训练一个下游适配器,以确保流畅且临床可信的报告。该框架精确地模仿了放射科医师的工作流程,迫使模型将高级医学概念与确凿的解剖证据联系起来。对公共数据集的大量实验表明,与最先进的方法相比,BoxMed-RL在METEOR和ROUGE-L指标上平均提高了7%。基于大型语言模型的指标的平均提高5%进一步突显了BoxMed-RL在生成高质量放射学报告方面的稳健性。

更新时间: 2026-03-02 16:27:49

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.18453v2

Leave-One-Out Prediction for General Hypothesis Classes

Leave-one-out (LOO) prediction provides a principled, data-dependent measure of generalization, yet guarantees in fully transductive settings remain poorly understood beyond specialized models. We introduce Median of Level-Set Aggregation (MLSA), a general aggregation procedure based on empirical-risk level sets around the ERM. For arbitrary fixed datasets and losses satisfying a mild monotonicity condition, we establish a multiplicative oracle inequality for the LOO error of the form \[ LOO_S(\hat{h}) \;\le\; C \cdot \frac{1}{n} \min_{h\in H} L_S(h) \;+\; \frac{Comp(S,H,\ell)}{n}, \qquad C>1. \] The analysis is based on a local level-set growth condition controlling how the set of near-optimal empirical-risk minimizers expands as the tolerance increases. We verify this condition in several canonical settings. For classification with VC classes under the 0-1 loss, the resulting complexity scales as $O(d \log n)$, where $d$ is the VC dimension. For finite hypothesis and density classes under bounded or log loss, it scales as $O(\log |H|)$ and $O(\log |P|)$, respectively. For logistic regression with bounded covariates and parameters, a volumetric argument based on the empirical covariance matrix yields complexity scaling as $O(d \log n)$ up to problem-dependent factors.

Updated: 2026-03-02 16:27:44

标题: 留一出预测用于一般假设类别

摘要: 留一出(LOO)预测提供了一个基于数据的、合理的泛化度量,然而在完全传导设置下的保证仍然不太清楚,除了专门的模型之外。我们介绍了一种基于ERM周围经验风险水平集的中位数集合聚合(MLSA)通用聚合程序。对于任意固定数据集和满足轻微单调性条件的损失函数,我们建立了LOO误差的乘法预测不等式形式\[ LOO_S(\hat{h}) \;\le\; C \cdot \frac{1}{n} \min_{h\in H} L_S(h) \;+\; \frac{Comp(S,H,\ell)}{n}, \qquad C>1. \] 分析基于一个控制近似最优经验风险最小化器集合如何随着容限增加扩展的局部水平集增长条件。我们在几个标准设置中验证了这个条件。对于VC类别下的分类,以0-1损失为例,结果的复杂性按照$O(d \log n)$的比例增长,其中$d$是VC维度。对于有限假设和密度类别,以有界或对数损失为例,复杂性分别按照$O(\log |H|)$和$O(\log |P|)$增长。对于有界协变量和参数的逻辑回归,基于经验协方差矩阵的体积论证导致的复杂性按照$O(d \log n)$的比例增长,直到问题相关因素。

更新时间: 2026-03-02 16:27:44

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.02043v1

Learning sparsity-promoting regularizers for linear inverse problems

This paper introduces a novel approach to learning sparsity-promoting regularizers for solving linear inverse problems. We develop a bilevel optimization framework to select an optimal synthesis operator, denoted as $B$, which regularizes the inverse problem while promoting sparsity in the solution. The method leverages statistical properties of the underlying data and incorporates prior knowledge through the choice of $B$. We establish the well-posedness of the optimization problem, provide theoretical guarantees for the learning process, and present sample complexity bounds. The approach is demonstrated through theoretical infinite-dimensional examples, including compact perturbations of a known operator and the problem of learning the mother wavelet, and through extensive numerical simulations. This work extends previous efforts in Tikhonov regularization by addressing non-differentiable norms and proposing a data-driven approach for sparse regularization in infinite dimensions.

Updated: 2026-03-02 16:27:39

标题: 学习稀疏促进正则化器用于线性逆问题

摘要: 本文介绍了一种新颖的学习稀疏促进正则化器来解决线性反问题的方法。我们开发了一个双层优化框架来选择一个最优的合成运算符,表示为$B$,该运算符在解决反问题的同时促进解的稀疏性。该方法利用底层数据的统计特性并通过选择$B$来整合先验知识。我们建立了优化问题的良定性,为学习过程提供了理论保证,并提供了样本复杂度界。该方法通过理论上的无限维示例进行演示,包括已知运算符的紧凑扰动和学习母小波的问题,并通过大量数值模拟进行验证。这项工作通过解决不可微范数和提出数据驱动的稀疏正则化方法来扩展了先前在Tikhonov正则化方面的努力。

更新时间: 2026-03-02 16:27:39

领域: stat.ML,cs.LG,math.ST

下载: http://arxiv.org/abs/2412.16031v2

EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.

Updated: 2026-03-02 16:24:36

标题: EstLLM:通过持续预训练和后训练增强爱沙尼亚语在多语言LLM中的能力

摘要: 大型语言模型(LLMs)主要是在以英语为中心的数据上进行训练的,这导致对较小语言的表现不均匀。我们研究了持续预训练(CPT)是否能够显著提高预训练的多语言LLM在爱沙尼亚语方面的能力,同时保持其在英语和一般推理方面的表现。我们以Llama 3.1 8B作为主要基础模型,对一个混合数据进行CPT,增加爱沙尼亚语的曝光量,同时通过英语重放和包含代码、数学和类似指令的数据来逼近原始训练分布。随后,我们通过监督微调、偏好优化和聊天向量合并来引入强大的指令遵循行为。在一套全面的爱沙尼亚基准测试上进行评估表明,与原始基础模型及其经过指令调整的变体相比,语言能力、知识、推理、翻译质量和指令遵循方面都取得了一致的进步,同时在英语基准测试中保持了竞争性表现。这些发现表明,通过适当平衡的数据混合进行CPT,再加上后期训练的对齐,可以显著提高预训练的多语言LLM的单一语言能力。

更新时间: 2026-03-02 16:24:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.02041v1

Privacy-Preserving and Secure Spectrum Sharing for Database-Driven Cognitive Radio Networks

Database-driven cognitive radio networks (DB-CRNs) enable dynamic spectrum sharing through geolocation databases but introduce critical security and privacy challenges, including mandatory location disclosure, susceptibility to location spoofing, and denial-of-service (DoS) attacks on centralized services. Existing approaches address these issues in isolation and lack a unified, regulation-compliant solution under realistic adversarial conditions. In this work, we present a unified security framework for DB-CRNs that simultaneously provides location privacy, user anonymity, verifiable location, and DoS resilience. Our framework, denoted as SLAPX, enables privacy-preserving spectrum queries using delegatable anonymous credentials, supports adaptive location verification without revealing precise user location, and mitigates DoS attacks through verifiable delay functions (VDFs) combined with RLRS-based rate limiting. Extensive cryptographic benchmarking and network simulations demonstrate that SLAPX achieves significantly lower latency and communication overhead than existing solutions while effectively resisting location spoofing and DoS attacks. These results show that SLAPX is practical and well-suited for secure next-generation DB-CRN deployments.

Updated: 2026-03-02 16:23:48

标题: 隐私保护和安全频谱共享对基于数据库驱动的认知无线电网络

摘要: 基于数据库的认知无线电网络(DB-CRNs)通过地理位置数据库实现动态频谱共享,但引入了关键的安全和隐私挑战,包括强制位置披露、易受位置欺骗攻击和对中心化服务的拒绝服务(DoS)攻击。现有方法单独解决这些问题,并缺乏在现实敌对条件下统一、符合规定的解决方案。在这项工作中,我们提出了一个统一的DB-CRNs安全框架,同时提供位置隐私、用户匿名性、可验证位置和DoS韧性。我们的框架,称为SLAPX,使用可委托的匿名凭证实现保护隐私的频谱查询,支持适应性位置验证而不透露精确用户位置,并通过可验证延迟函数(VDFs)结合RLRS基于速率限制来缓解DoS攻击。广泛的加密基准测试和网络模拟表明,SLAPX实现了比现有解决方案更低的延迟和通信开销,同时有效抵抗位置欺骗和DoS攻击。这些结果表明,SLAPX对于安全的下一代DB-CRN部署是实际可行且合适的。

更新时间: 2026-03-02 16:23:48

领域: cs.CR

下载: http://arxiv.org/abs/2602.15705v2

Graph neural network force fields for adiabatic dynamics of lattice Hamiltonians

Scalable and symmetry-consistent force-field models are essential for extending quantum-accurate simulations to large spatiotemporal scales. While descriptor-based neural networks can incorporate lattice symmetries through carefully engineered features, we show that graph neural networks (GNNs) provide a conceptually simpler and more unified alternative in which discrete lattice translation and point-group symmetries are enforced directly through local message passing and weight sharing. We develop a GNN-based force-field framework for the adiabatic dynamics of lattice Hamiltonians and demonstrate it for the semiclassical Holstein model. Trained on exact-diagonalization data, the GNN achieves high force accuracy, strict linear scaling with system size, and direct transferability to large lattices. Enabled by this scalability, we perform large-scale Langevin simulations of charge-density-wave ordering following thermal quenches, revealing dynamical scaling and anomalously slow sub--Allen--Cahn coarsening. These results establish GNNs as an elegant and efficient architecture for symmetry-aware, large-scale dynamical simulations of correlated lattice systems.

Updated: 2026-03-02 16:23:25

标题: 图神经网络势场用于晶格哈密顿量绝热动力学

摘要: 可扩展和对称一致的力场模型对于将量子精确模拟扩展到大的时空尺度至关重要。虽然基于描述符的神经网络可以通过精心设计的特征结构来纳入晶格对称性,但我们表明图神经网络(GNNs)提供了一个概念上更简单和更统一的替代方案,其中离散晶格平移和点群对称性通过局部消息传递和权重共享直接执行。我们为晶格哈密顿量的绝热动力学开发了一个基于GNN的力场框架,并在半经典Holstein模型中进行了演示。在确切对角化数据上训练后,GNN实现了高力精度、与系统尺寸的严格线性缩放以及直接转移性到大型晶格。借助这种可扩展性,我们进行了大规模的朗之万模拟,研究了热淬后的电荷密度波排序,揭示了动态缩放和异常缓慢的子Allen-Cahn粗化。这些结果将GNNs确立为对称意识、大规模动力学模拟相关晶格系统的优雅高效架构。

更新时间: 2026-03-02 16:23:25

领域: cond-mat.str-el,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2603.02039v1

RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization

Real world data often exhibits unknown, instance-specific symmetries that rarely exactly match a transformation group $G$ fixed a priori. Class-pose decompositions aim to create disentangled representations by factoring inputs into invariant features and a pose $g\in G$ defined relative to a training-dependent, arbitrary canonical representation. We introduce RECON, a class-pose agnostic canonical orientation normalization that corrects arbitrary canonicals via a simple right translation, yielding natural, data-aligned canonicalizations. This enables (i) unsupervised discovery of instance-specific pose distributions, (ii) detection of out-of-distribution poses and (iii) a plug-and-play test-time canonicalization layer. This layer can be attached on top of any pre-trained model to infuse group invariance, improving its performance without retraining. We validate on images and molecular ensembles, demonstrating accurate symmetry discovery, and matching or outperforming other canonicalizations in downstream classification.

Updated: 2026-03-02 16:22:42

标题: RECON:通过明确的规范方向标准化发现鲁棒对称性

摘要: 真实世界数据通常表现出未知的、特定实例的对称性,很少完全匹配事先固定的转换组$G$。类姿态分解旨在通过将输入因子化为不变特征和相对于一个依赖于训练的任意规范表示定义的姿态$g\in G$来创建分离表示。我们介绍了RECON,一种不考虑类姿态的规范方向归一化方法,通过简单的右移转换修正任意规范,从而产生自然、数据对齐的规范化。这使得(i)可以无监督地发现特定实例的姿态分布,(ii)检测超出分布的姿态,(iii)实现即插即用的测试时规范化层。这一层可以附加在任何预训练模型之上,以注入群不变性,提高其性能而无需重新训练。我们在图像和分子集合上进行验证,展示了准确的对称性发现,并在下游分类任务中与其他规范化方法匹敌甚至超越。

更新时间: 2026-03-02 16:22:42

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2505.13289v4

Optimistic Online Learning in Symmetric Cone Games

We introduce symmetric cone games (SCGs), a broad class of multi-player games where each player's strategy lies in a generalized simplex (the trace-one slice of a symmetric cone). This framework unifies a wide spectrum of settings, including normal-form games (simplex strategies), quantum games (density matrices), and continuous games with ball-constrained strategies. It also captures several structured machine learning and optimization problems, such as distance metric learning and Fermat-Weber facility location, as two-player zero-sum SCGs. To compute approximate Nash equilibria in two-player zero-sum SCGs, we propose a single online learning algorithm: Optimistic Symmetric Cone Multiplicative Weights Updates (OSCMWU). Unlike prior methods tailored to specific geometries, OSCMWU provides closed-form updates over any symmetric cone and achieves a $\tilde{\mathcal{O}}(1/ε)$ iteration complexity for computing $ε$-saddle points. Our analysis builds on the Optimistic Follow-the-Regularized-Leader framework and hinges on a key technical contribution: We prove that the symmetric cone negative entropy is strongly convex with respect to the trace-one norm. This result extends known results for the simplex and spectraplex to all symmetric cones, and may be of independent interest.

Updated: 2026-03-02 16:19:04

标题: 对称锥游戏中的乐观在线学习

摘要: 我们介绍了对称锥形游戏(SCGs),这是一类广泛的多人游戏,其中每个玩家的策略位于广义单纯形(对称锥的迹一切片)中。该框架统一了各种设置,包括正则形式游戏(单纯形策略)、量子游戏(密度矩阵)以及具有球限制策略的连续游戏。它还捕捉了几个结构化的机器学习和优化问题,如距离度量学习和Fermat-Weber设施选址,作为两人零和SCGs。为了计算两人零和SCGs中的近似纳什均衡,我们提出了一个单一的在线学习算法:乐观对称锥形乘权更新(OSCMWU)。与专门针对特定几何形状的先前方法不同,OSCMWU提供了任何对称锥上的封闭式更新,对于计算ε-鞍点的迭代复杂度为$\tilde{\mathcal{O}}(1/ε)$。我们的分析建立在乐观跟随正则化领导者框架上,并依赖于一个关键技术贡献:我们证明对称锥负熵相对于迹一范数是强凸的。这个结果将已知的单纯形和光谱单纯形的结果扩展到所有对称锥,并可能具有独立的兴趣。

更新时间: 2026-03-02 16:19:04

领域: math.OC,cs.GT,cs.LG

下载: http://arxiv.org/abs/2504.03592v4

AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in reasoning tasks with long cot. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5x speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool invocation. The evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, surpassing OpenAI-o3-mini and Claude-Opus-4.0-Thinking while remaining competitive with OpenAI-o3, Gemini-2.5-Pro, and DeepSeek-R1-671B-0528.These results validate the effectiveness of our approach and pave the way for building scalable mathematical reasoning agents.

Updated: 2026-03-02 16:15:29

标题: AgentMath:通过工具增强的代理,增强大型语言模型的数学推理

摘要: 大型推理模型(LRMs)如o3和DeepSeek-R1在长篇推理任务中取得了显著进展。然而,它们在解决需要复杂数学运算的问题时仍然计算效率低,并且在准确性方面存在困难。在这项工作中,我们提出了AgentMath,这是一个代理框架,能够无缝地将语言模型的推理能力与代码解释器的计算精度集成在一起,有效地解决复杂数学问题。我们的方法引入了三个关键创新:(1)一种自动化方法,将自然语言思维链转换为结构化的工具增强轨迹,生成高质量的受监督微调(SFT)数据,以缓解数据稀缺问题;(2)一种新颖的代理强化学习(RL)范式,动态地交替自然语言生成和实时代码执行。这使得模型能够通过多轮互动反馈自主学习最佳工具使用策略,同时促进代码完善和错误修正的新能力的产生;(3)一个高效的训练系统,包括创新技术,如请求级别的异步滚动调度,代理部分滚动和前缀感知加权负载平衡,实现了4-5倍的加速,使得在具有大量工具调用情景的超长序列上进行有效的RL训练成为可能。评估结果显示,AgentMath在挑战性的数学竞赛基准测试中取得了最先进的性能,包括AIME24、AIME25和HMMT25。具体而言,AgentMath-30B-A3B分别达到了90.6%、86.4%和73.8%的准确率,超过了OpenAI-o3-mini和Claude-Opus-4.0-Thinking,同时与OpenAI-o3、Gemini-2.5-Pro和DeepSeek-R1-671B-0528保持竞争力。这些结果验证了我们方法的有效性,并为构建可扩展的数学推理代理铺平了道路。

更新时间: 2026-03-02 16:15:29

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2512.20745v3

TCG CREST System Description for the DISPLACE-M Challenge

This report presents the TCG CREST system description for Track 1 (Speaker Diarization) of the DISPLACE-M challenge, focusing on naturalistic medical conversations in noisy rural-healthcare scenarios. Our study evaluates the impact of various voice activity detection (VAD) methods and advanced clustering algorithms on overall speaker diarization (SD) performance. We compare and analyze two SD frameworks: a modular pipeline utilizing SpeechBrain with ECAPA-TDNN embeddings, and a state-of-the-art (SOTA) hybrid end-to-end neural diarization system, Diarizen, built on top of a pre-trained WavLM. With these frameworks, we explore diverse clustering techniques, including agglomerative hierarchical clustering (AHC), and multiple novel variants of spectral clustering, such as SC-adapt, SC-PNA, and SC-MK. Experimental results demonstrate that the Diarizen system provides an approximate $39\%$ relative improvement in the diarization error rate (DER) on the post-evaluation analysis of Phase~I compared to the SpeechBrain baseline. Our best-performing submitted system employing the Diarizen baseline with AHC employing a median filtering with a larger context window of $29$ achieved a DER of 10.37\% on the development and 9.21\% on the evaluation sets, respectively. Our team ranked sixth out of the 11 participating teams after the Phase~I evaluation.

Updated: 2026-03-02 16:12:47

标题: TCG CREST系统在DISPLACE-M挑战中的描述

摘要: 这份报告介绍了TCG CREST系统在DISPLACE-M挑战的Track 1(说话人分离)中的描述,重点关注在嘈杂的农村医疗场景中的自然医疗对话。我们的研究评估了各种语音活动检测(VAD)方法和先进的聚类算法对整体说话人分离(SD)性能的影响。我们比较和分析了两种SD框架:一个利用SpeechBrain和ECAPA-TDNN嵌入的模块化管道,以及一种基于预训练的WavLM构建的最先进(SOTA)的混合端到端神经分离系统Diarizen。在这些框架中,我们探索了多种聚类技术,包括凝聚层次聚类(AHC),以及多种新颖的谱聚类变体,如SC-adapt,SC-PNA和SC-MK。实验结果表明,与SpeechBrain基线相比,Diarizen系统在Phase I的后评估分析中提供了大约39%的相对改进的分离错误率(DER)。我们提交的表现最佳的系统采用了Diarizen基线与AHC,采用了一个较大的上下文窗口为29的中值滤波,在开发和评估集中分别实现了10.37%和9.21%的DER。在Phase I评估后,我们的团队在11个参与团队中排名第六。

更新时间: 2026-03-02 16:12:47

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2603.02030v1

Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization

Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models' strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.

Updated: 2026-03-02 16:12:46

标题: 廉价信号带来的丰富见解:通过张量分解实现高效评估

摘要: 超越将性能在异质提示上折叠的评估,朝向在提示级别进行细粒度评估,或在相对均匀的子集内进行评估,对于诊断生成模型的优势和劣势是必要的。然而,这种细粒度评估遭受数据瓶颈的困扰:在这一规模上,人类黄金标准标签成本过高,而自动评分往往与人类判断不一致。为了解决这一挑战,我们提出了一种基于张量分解的新颖统计模型,将廉价的自动评分数据与有限的人类黄金标准标签合并。具体而言,我们的方法使用自动评分分数对提示和生成模型的潜在表示进行预训练,然后使用一个小的校准集将这些预训练表示与人类偏好对齐。这种样本高效的方法对自动评分质量具有稳健性,比标准基线更准确地预测每个提示的人类偏好,并为感兴趣的关键统计参数提供紧密的置信区间。我们还展示了我们方法的实际效用,通过基于提示质量构建细粒度的排行榜,并仅通过自动评分评估模型性能,从而消除了额外人类注释的需求。

更新时间: 2026-03-02 16:12:46

领域: cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.02029v1

Latent attention on masked patches for flow reconstruction

Vision transformers have demonstrated outstanding performance on image generation applications, but their adoption in scientific disciplines, like fluid dynamics, has been limited. We introduce the Latent Attention on Masked Patches (LAMP) model, an interpretable regression-based modified vision transformer designed for masked flow reconstruction. LAMP follows a three-fold strategy: (i) partition of each flow snapshot into patches, (ii) dimensionality reduction of each patch via patch-wise proper orthogonal decomposition, and (iii) reconstruction of the full field from a masked input using a single-layer transformer trained via closed-form linear regression. We test the method on two canonical 2D unsteady wakes: a wake past a bluff body, and a chaotic wake past a flat plate. We show that the LAMP accurately reconstructs the full flow field from a 90\%-masked and noisy input, across signal-to-noise ratios between 10 and 30\,dB. Incorporating nonlinear measurement states can reduce the prediction error by up to an order of magnitude. The learned attention matrix yields physically interpretable multi-fidelity optimal sensor-placement maps. The modularity of the framework enables nonlinear compression and deep attention blocks, thereby providing an efficient baseline for nonlinear and high-dimensional masked flow reconstruction.

Updated: 2026-03-02 16:12:40

标题: 遮罩补丁的潜在注意力在流重建中的应用

摘要: 视觉变换器在图像生成应用中表现出色,但在科学领域(如流体动力学)中的应用受到限制。我们引入了Latent Attention on Masked Patches(LAMP)模型,这是一种可解释的基于回归的改进视觉变换器,专为掩模流重建而设计。LAMP遵循三重策略:(i)将每个流快照划分为补丁,(ii)通过逐补丁适当正交分解降低每个补丁的维度,(iii)使用经过封闭形式线性回归训练的单层变换器从掩模输入中重建完整场。我们在两个经典的二维非定常尾流上测试了该方法:一个是经过一个圆柱体的尾流,另一个是经过一个平板的混沌尾流。我们展示了LAMP能够准确地从90\%掩模和嘈杂输入中重建完整的流场,在信噪比为10到30 dB之间。结合非线性测量状态可以将预测误差降低一个数量级。学习到的注意力矩阵产生了物理可解释的多保真度最佳传感器布置图。该框架的模块化性使得可以实现非线性压缩和深度注意力块,从而为非线性和高维掩模流重建提供了高效的基准线。

更新时间: 2026-03-02 16:12:40

领域: cs.LG

下载: http://arxiv.org/abs/2603.02028v1

Generative Models for Crystalline Materials

Understanding structure-property relationships in materials is fundamental in condensed matter physics and materials science. Over the past few years, machine learning (ML) has emerged as a powerful tool for advancing this understanding and accelerating materials discovery. Early ML approaches primarily focused on constructing and screening large material spaces to identify promising candidates for various applications. More recently, research efforts have increasingly shifted toward generating crystal structures using end-to-end generative models. This review analyzes the current state of generative modeling for crystal structure prediction and de novo generation. It examines crystal representations, outlines the generative models used to design crystal structures, and evaluates their respective strengths and limitations. Furthermore, the review highlights experimental considerations for evaluating generated structures and provides recommendations for suitable existing software tools. Emerging topics, such as modeling disorder and defects, integration in advanced characterization, incorporating synthetic feasibility constraints, and model explainability are explored. Ultimately, this work aims to inform both experimental scientists looking to adapt suitable ML models to their specific circumstances and ML specialists seeking to understand the unique challenges related to inverse materials design and discovery.

Updated: 2026-03-02 16:10:40

标题: 晶体材料的生成模型

摘要: 理解材料中的结构-性能关系在凝聚态物理和材料科学中是基础性的。在过去几年中,机器学习(ML)已经成为推动这一理解和加速材料发现的强大工具。早期的ML方法主要集中在构建和筛选大量材料空间,以识别各种应用的有前途的候选材料。最近,研究工作越来越多地转向使用端到端生成模型来生成晶体结构。本综述分析了用于晶体结构预测和全新生成的生成建模的当前状态。它检查了晶体表示法,概述了用于设计晶体结构的生成模型,并评估了它们各自的优势和局限性。此外,综述还强调了评估生成结构的实验考虑,并提供了适合的现有软件工具的建议。探讨了新兴主题,如建模混乱和缺陷,与先进表征的整合,融入合成可行性约束以及模型的可解释性。最终,这项工作旨在为试图将适当的ML模型应用于其具体情况的实验科学家和试图理解与逆向材料设计和发现相关的独特挑战的ML专家提供信息。

更新时间: 2026-03-02 16:10:40

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2511.22652v2

Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT

Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.

Updated: 2026-03-02 16:10:17

标题: 学习阅读何处查找:面向疾病感知的三维CT视觉语言预训练

摘要: 最近的3D CT视觉-语言模型通过对比预训练将体积与报告对齐,但通常依赖有限的公共数据,并且只提供粗略的全局监督。我们在一家单一医院收集的98k份报告-体积对(50k名患者)上训练了一个3D CT视觉-语言模型,结合了公共数据集,使用了SigLIP风格的对比预训练,以及在共享的视觉-文本嵌入空间中基于提示的疾病监督。在CT-RATE上,我们的模型实现了最先进的文本到图像检索(R@10 31.5 vs. 22.2)和竞争性疾病分类(AUC 83.8 vs. 83.8),在Rad-ChestCT上也取得了一致的结果(AUC 77.0 vs. 77.3)。我们进一步观察到,放射科医生常常在报告中引用特定图像(例如,“系列X,图像Y”),将文本描述与精确的轴向位置联系起来。我们自动挖掘了262k个这样的片段-切片对,并引入了内部扫描片段定位任务 - 预测文本片段所引用的轴深度 - 在12mm特征分辨率下将平均绝对误差降低到36.3mm,而最佳基线为67.0mm。添加这个定位目标在置信度范围内基本不改变检索和分类结果,提供了一个统一的模型,用于检索、分类和内部扫描基础。

更新时间: 2026-03-02 16:10:17

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2603.02026v1

Revealing Combinatorial Reasoning of GNNs via Graph Concept Bottleneck Layer

Despite their success in various domains, the growing dependence on GNNs raises a critical concern about the nature of the combinatorial reasoning underlying their predictions, which is often hidden within their black-box architectures. Addressing this challenge requires understanding how GNNs translate topological patterns into logical rules. However, current works only uncover the hard logical rules over graph concepts, which cannot quantify the contribution of each concept to prediction. Moreover, they are post-hoc interpretable methods that generate explanations after model training and may not accurately reflect the true combinatorial reasoning of GNNs, since they approximate it with a surrogate. In this work, we develop a graph concept bottleneck layer that can be integrated into any GNN architectures to guide them to predict the selected discriminative global graph concepts. The predicted concept scores are further projected to class labels by a sparse linear layer. It enforces the combinatorial reasoning of GNNs' predictions to fit the soft logical rule over graph concepts and thus can quantify the contribution of each concept. To further improve the quality of the concept bottleneck, we treat concepts as "graph words" and graphs as "graph sentences", and leverage language models to learn graph concept embeddings. Extensive experiments on multiple datasets show that our method GCBMs achieve state-of-the-art performance both in classification and interpretability.

Updated: 2026-03-02 16:07:24

标题: 通过图概念瓶颈层揭示GNN的组合推理

摘要: 尽管图神经网络在各个领域取得了成功,但对GNN日益增长的依赖引发了一项关键关注,即GNN预测背后的组合推理的本质,这经常隐藏在它们的黑盒架构中。解决这一挑战需要理解GNN如何将拓扑模式转化为逻辑规则。然而,当前的研究只揭示了图概念上的硬逻辑规则,无法量化每个概念对预测的贡献。此外,它们是事后可解释的方法,只在模型训练后生成解释,可能无法准确反映GNN的真实组合推理,因为它们用替代方法来近似。在这项工作中,我们开发了一个图概念瓶颈层,可以集成到任何GNN架构中,指导它们预测选定的具有辨别力的全局图概念。预测的概念分数进一步通过稀疏线性层映射到类标签。它强化了GNN预测的组合推理,以适应图概念上的软逻辑规则,从而可以量化每个概念的贡献。为了进一步提高概念瓶颈的质量,我们将概念视为“图词”,将图形视为“图句”,并利用语言模型学习图概念嵌入。在多个数据集上进行的大量实验表明,我们的方法GCBMs在分类和可解释性方面均取得了最先进的性能。

更新时间: 2026-03-02 16:07:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02025v1

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.

Updated: 2026-03-02 16:06:23

标题: MMR-Life:拼合多模态多图像推理的真实场景

摘要: 最近,多模式大型语言模型(MLLMs)在推理能力方面取得了进展,使它们能够处理更复杂的任务,如科学分析和数学推理。尽管它们具有潜力,但MLLMs在现实生活中不同场景下的推理能力仍然大部分未被探索,并且缺乏用于评估的标准化基准。为了填补这一空白,我们引入了MMR-Life,这是一个全面的基准,旨在评估MLLMs在现实生活场景中的多样化多模式多图像推理能力。MMR-Life包含2,646个基于19,108个来自真实世界背景的图像的多项选择问题,全面涵盖七种推理类型:演绎、类比、因果、演绎、归纳、空间和时间。与现有的推理基准不同,MMR-Life不依赖于领域专业知识,而是要求模型整合多个图像的信息并应用多样化的推理能力。对37个先进模型的评估突显了MMR-Life带来的重大挑战。即使像GPT-5这样的顶级模型也仅达到58%的准确率,并且在不同推理类型的表现上存在相当大的变异性。此外,我们分析了现有MLLMs的推理范式,探讨了思考长度、推理方法和推理类型等因素对它们性能的影响。总之,MMR-Life为评估、分析和改进下一代多模式推理系统奠定了全面的基础。

更新时间: 2026-03-02 16:06:23

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.02024v1

CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space

Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech. Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity. Neural audio codecs offer compact latent representations that better preserve acoustic detail, yet accurately recovering high-resolution latent information remains challenging due to representation mismatch. We present CodecFlow, a neural codec-based BWE framework that performs efficient speech reconstruction in a compact latent space. CodecFlow employs a voicing-aware conditional flow converter on continuous codec embeddings and a structure-constrained residual vector quantizer to improve latent alignment stability. Optimized end-to-end, CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech BWE tasks.

Updated: 2026-03-02 16:03:46

标题: CodecFlow: 利用神经编解码器潜在空间中的条件流匹配实现高效带宽扩展

摘要: 语音带宽扩展通过恢复/推断低带宽语音的适当高频内容来提高清晰度和可理解性。现有方法通常依赖于频谱图或波形建模,这可能会导致更高的计算成本并且高频保真度有限。神经音频编解码器提供更好地保留声学细节的紧凑潜在表示,但由于表示不匹配,准确恢复高分辨率潜在信息仍然具有挑战性。我们提出了CodecFlow,一个基于神经编解码器的BWE框架,可以在紧凑的潜在空间中执行高效的语音重建。CodecFlow采用了一个基于声音感知的条件流转换器,用于连续编解码器嵌入,并使用结构约束的剩余向量量化器来提高潜在对齐稳定性。优化端到端,CodecFlow在8 kHz到16 kHz和44.1 kHz语音BWE任务上实现了强大的频谱保真度和增强的感知质量。

更新时间: 2026-03-02 16:03:46

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2603.02022v1

Selection as Power: Constrained Reinforcement for Bounded Decision Authority

Selection as Power argued that upstream selection authority, rather than internal objective misalignment, constitutes a primary source of risk in high-stakes agentic systems. However, the original framework was static: governance constraints bounded selection power but did not adapt over time. In this work, we extend the framework to dynamic settings by introducing incentivized selection governance, where reinforcement updates are applied to scoring and reducer parameters under externally enforced sovereignty constraints. We formalize selection as a constrained reinforcement process in which parameter updates are projected onto governance-defined feasible sets, preventing concentration beyond prescribed bounds. Across multiple regulated financial scenarios, unconstrained reinforcement consistently collapses into deterministic dominance under repeated feedback, especially at higher learning rates. In contrast, incentivized governance enables adaptive improvement while maintaining bounded selection concentration. Projection-based constraints transform reinforcement from irreversible lock-in into controlled adaptation, with governance debt quantifying the tension between optimization pressure and authority bounds. These results demonstrate that learning dynamics can coexist with structural diversity when sovereignty constraints are enforced at every update step, offering a principled approach to integrating reinforcement into high-stakes agentic systems without surrendering bounded selection authority.

Updated: 2026-03-02 16:02:34

标题: 选取作为权力:有限决策权限的受限强化

摘要: 选择作为权力认为,与内部客观不一致相比,上游选择权威构成了高风险代理系统中的主要风险来源。然而,原始框架是静态的:治理约束限制了选择权力,但没有随时间进行调整。在这项工作中,我们通过引入激励选择治理,将框架扩展到动态设置,其中在外部强制主权约束下,对得分和减少参数进行强化更新。 我们将选择形式化为受限强化过程,其中参数更新投影到由治理定义的可行集中,防止超出规定边界的集中。在多个受监管的金融场景中,无约束的强化在重复反馈下一致地在高学习速率下崩溃为确定性优势。相反,激励治理使得自适应改进成为可能,同时保持有限的选择集中度。 基于投影的约束将强化从不可逆锁定转变为受控适应,治理债务量化了优化压力与权威边界之间的紧张关系。这些结果表明,当在每次更新步骤中强制执行主权约束时,学习动态可以与结构多样性共存,为在高风险代理系统中整合强化提供了一个原则性方法,而不放弃有限的选择权威。

更新时间: 2026-03-02 16:02:34

领域: cs.MA,cs.AI,cs.CE,cs.LG

下载: http://arxiv.org/abs/2603.02019v1

Protection against Source Inference Attacks in Federated Learning

Federated Learning (FL) was initially proposed as a privacy-preserving machine learning paradigm. However, FL has been shown to be susceptible to a series of privacy attacks. Recently, there has been concern about the Source Inference Attack (SIA), where an honest-but-curious central server attempts to identify exactly which client owns a given data point which was used in the training phase. Alarmingly, standard gradient obfuscation techniques with Differential Privacy have been shown to be ineffective against SIAs, at least without severely diminishing the accuracy. In this work, we propose a defense against SIAs within the widely studied shuffle model of FL, where an honest shuffler acts as an intermediary between the clients and the server. First, we demonstrate that standard naive shuffling alone is insufficient to prevent SIAs. To effectively defend against SIAs, shuffling needs to be applied at a more granular level; we propose a novel combination of parameter-level shuffling with the residue number system (RNS). Our approach provides robust protection against SIAs without affecting the accuracy of the joint model and can be seamlessly integrated into other privacy protection mechanisms. We conduct experiments on a series of models and datasets, confirming that standard shuffling approaches fail to prevent SIAs and that, in contrast, our proposed method reduce the attack's accuracy to the level of random guessing.

Updated: 2026-03-02 16:01:41

标题: 在联邦学习中防御源推断攻击

摘要: 联合学习(FL)最初被提出作为一种保护隐私的机器学习范式。然而,FL已被证明容易受到一系列隐私攻击的影响。最近,人们对源推断攻击(SIA)表示关注,其中一个诚实但好奇的中央服务器试图确定确切地哪个客户拥有在训练阶段中使用的给定数据点。令人震惊的是,标准的梯度混淆技术与差分隐私已被证明对抗SIA是无效的,至少不会严重降低准确性。 在这项工作中,我们提出了一种在广泛研究的FL混洗模型中防御SIA的方法,其中一个诚实的混洗器充当客户和服务器之间的中介。首先,我们证明单独的标准天真混洗是不足以防止SIA的。为了有效地防御SIA,混洗需要在更细粒度的水平上应用;我们提出了参数级混洗与剩余数系统(RNS)的新颖组合。我们的方法提供了强有力的保护措施,可以防止SIA,而且不会影响联合模型的准确性,并且可以无缝地集成到其他隐私保护机制中。 我们对一系列模型和数据集进行实验,确认标准混洗方法无法防止SIA,并且与之相反,我们提出的方法将攻击的准确性降低到随机猜测的水平。

更新时间: 2026-03-02 16:01:41

领域: cs.CR

下载: http://arxiv.org/abs/2603.02017v1

CausalWrap: Model-Agnostic Causal Constraint Wrappers for Tabular Synthetic Data

Tabular synthetic data generators are typically trained to match observational distributions, which can yield high conventional utility (e.g., column correlations, predictive accuracy) yet poor preservation of structural relations relevant to causal analysis and out-of-distribution (OOD) reasoning. When the downstream use of synthetic data involves causal reasoning -- estimating treatment effects, evaluating policies, or testing mediation pathways -- merely matching the observational distribution is insufficient: structural fidelity and treatment-mechanism preservation become essential. We propose CausalWrap (CW), a model-agnostic wrapper that injects partial causal knowledge (PCK) -- trusted edges, forbidden edges, and qualitative/monotonic constraints -- into any pretrained base generator (GAN, VAE, or diffusion model), without requiring access to its internals. CW learns a lightweight, differentiable post-hoc correction map applied to samples from the base generator, optimized with causal penalty terms under an augmented-Lagrangian schedule. We provide theoretical results connecting penalty-based optimization to constraint satisfaction and relating approximate factorization to joint distributional control. We validate CW on simulated structural causal models (SCMs) with known ground-truth interventions, semi-synthetic causal benchmarks (IHDP and an ACIC-style suite), and a real-world ICU cohort (MIMIC-IV) with expert-elicited partial graphs. CW improves causal fidelity across diverse base generators -- e.g., reducing average treatment effect (ATE) error by up to 63% on ACIC and lifting ATE agreement from 0.00 to 0.38 on the intensive care unit (ICU) cohort -- while largely retaining conventional utility.

Updated: 2026-03-02 15:59:46

标题: CausalWrap:适用于表格合成数据的模型无关因果约束包装器

摘要: 表格合成数据生成器通常被训练来匹配观测分布,这可能会产生高常规效用(例如,列相关性,预测准确性),但却会损害与因果分析和超出分布(OOD)推理相关的结构关系的保留。当合成数据的下游用途涉及因果推理时--估计治疗效果,评估政策,或测试中介路径--仅仅匹配观测分布是不够的:结构保真度和治疗机制的保留变得至关重要。我们提出了CausalWrap(CW),这是一个模型不可知的包装器,将部分因果知识(PCK)--信任边缘,禁止边缘和定性/单调约束--注入到任何预训练的基础生成器(GAN,VAE或扩散模型)中,而无需访问其内部。CW学习了一个轻量级、可微分的事后校正映射,应用于基础生成器的样本,通过增强拉格朗日时间表下的因果惩罚项进行优化。我们提供了将基于惩罚的优化与约束满足联系起来的理论结果,并将近似因子化与联合分布控制联系起来。我们在模拟的结构因果模型(SCMs)上对CW进行了验证,这些模型具有已知的基准干预,半合成的因果基准(IHDP和ACIC风格套件),以及一个真实的ICU队列(MIMIC-IV),其中包含专家引导的部分图形。CW改善了对于不同基础生成器的因果保真度--例如,在ACIC上将平均治疗效果(ATE)误差降低了63%,并将ATE一致性从0.00提高到ICU队列上的0.38--同时大部分保留了常规效用。

更新时间: 2026-03-02 15:59:46

领域: cs.LG

下载: http://arxiv.org/abs/2603.02015v1

MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising

Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.

Updated: 2026-03-02 15:58:59

标题: MAP-Diff:多锚点引导扩散用于渐进式3D全身低剂量PET去噪

摘要: 低剂量正电子发射断层扫描(PET)能够减少辐射暴露,但存在严重的噪音和定量降解问题。基于扩散的去噪模型能够实现强大的最终重建,但它们的反向轨迹通常是不受约束的,并且与PET剂量形成的渐进性质不一致。我们提出了MAP-Diff,这是一个用于渐进3D全身PET去噪的多锚点引导扩散框架。MAP-Diff引入了临床观察到的中间剂量扫描作为轨迹锚点,并通过时间步骤相关的监督来规范反向过程,使其朝向与剂量对齐的中间状态。通过在模拟扩散损坏和真实多剂量PET对之间进行降解匹配来校准锚点时间步骤,并通过时间步骤加权的锚点损失来稳定阶段性学习。在推断过程中,该模型仅需要超低剂量输入,同时实现渐进、剂量一致的中间恢复。对内部(西门子Biograph Vision Quadra)和跨扫描仪(United Imaging uEXPLORER)数据集的实验表明,MAP-Diff相对于强大的基线方法(包括CNN、Transformer、GAN和基于扩散的方法)取得了一致的改进。在内部数据集上,MAP-Diff将PSNR从42.48 dB提高到43.71 dB(+1.23 dB),将SSIM提高到0.986,将NMAE从0.115降低到0.103(-0.012)与3D DDPM相比。性能收益在不同扫描仪上都有通用性,在外部队列上达到34.42 dB的PSNR和0.141的NMAE,优于所有竞争方法。

更新时间: 2026-03-02 15:58:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.02012v1

Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential Families

Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.

Updated: 2026-03-02 15:55:54

标题: 在指数族中,基于经过差分隐私噪声校准的充分统计量的推理

摘要: 这篇论文开发了一种干净且易于处理的中间方法,适用于指数族:只释放差分隐私充分统计量,然后进行噪声校准的基于似然的推断以及可选的参数化合成数据生成作为后处理。我们的贡献是:(1)在高斯机制下对修剪的充分统计量进行近似差分隐私释放的一般方法;(2)对于插值差分隐私MLE,渐近正态性,显式方差膨胀以及有效的Wald式置信区间;(3)噪声感知的似然校正,与插值等价但支持基于自举的区间;(4)匹配极小极大下界显示隐私失真率是不可避免的。由此产生的理论提供了具体的设计规则和一个实用的流程,用于释放具有原则性不确定性量化的差分隐私合成数据,并在三个指数族和真实普查数据上进行了验证。

更新时间: 2026-03-02 15:55:54

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.02010v1

Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards

Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.

Updated: 2026-03-02 15:55:27

标题: 时间表示法用于探索:学习复杂的探索行为而无需外部奖励

摘要: 在强化学习中,有效的探索不仅需要跟踪代理已经到达的位置,还需要理解代理如何感知和表示世界。为了学习强大的表示,代理应该积极探索有助于其对环境知识的状态。时间表示可以捕获解决各种潜在任务所需的信息,同时避免与完整状态重建相关的计算成本。在本文中,我们提出了一种探索方法,利用时间对比表示来引导探索,优先考虑具有不可预测未来结果的状态。我们证明这种表示可以使代理学习复杂的探索行为,包括运动、操作和具身人工智能任务,揭示传统上需要外部奖励的能力和行为。与依赖显式距离学习或情节记忆机制(例如,准度量方法)的方法不同,我们的方法直接建立在时间相似性之上,为探索提供了更简单但有效的策略。

更新时间: 2026-03-02 15:55:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02008v1

Mitigating topology biases in Graph Diffusion via Counterfactual Intervention

Graph diffusion models have gained significant attention in graph generation tasks, but they often inherit and amplify topology biases from sensitive attributes (e.g. gender, age, region), leading to unfair synthetic graphs. Existing fair graph generation using diffusion models is limited to specific graph-based applications with complete labels or requires simultaneous updates for graph structure and node attributes, making them unsuitable for general usage. To relax these limitations by applying the debiasing method directly on graph topology, we propose Fair Graph Diffusion Model (FairGDiff), a counterfactual-based one-step solution that mitigates topology biases while balancing fairness and utility. In detail, we construct a causal model to capture the relationship between sensitive attributes, biased link formation, and the generated graph structure. By answering the counterfactual question "Would the graph structure change if the sensitive attribute were different?", we estimate an unbiased treatment and incorporate it into the diffusion process. FairGDiff integrates counterfactual learning into both forward diffusion and backward denoising, ensuring that the generated graphs are independent of sensitive attributes while preserving structural integrity. Extensive experiments on real-world datasets demonstrate that FairGDiff achieves a superior trade-off between fairness and utility, outperforming existing fair graph generation methods while maintaining scalability.

Updated: 2026-03-02 15:55:07

标题: 通过反事实干预缓解图扩散中的拓扑偏差

摘要: 图扩散模型在图生成任务中获得了显著的关注,但它们经常继承和放大来自敏感属性(例如性别、年龄、地区)的拓扑偏见,导致不公平的合成图。现有的使用扩散模型进行公平图生成的方法局限于具有完整标签的特定基于图的应用程序,或需要同时更新图结构和节点属性,使它们不适用于一般用途。为了通过直接在图拓扑上应用去偏方法来放宽这些限制,我们提出了一种公平图扩散模型(FairGDiff),这是一个基于反事实的一步解决方案,可以减轻拓扑偏见,同时平衡公平性和效用。具体而言,我们构建一个因果模型来捕捉敏感属性、偏见链接形成和生成图结构之间的关系。通过回答反事实问题“如果敏感属性不同,图结构会发生变化吗?”,我们估计一个无偏处理并将其纳入扩散过程中。FairGDiff将反事实学习整合到前向扩散和后向去噪中,确保生成的图与敏感属性无关,同时保持结构完整性。对真实数据集的大量实验证明,FairGDiff在公平性和效用之间实现了卓越的权衡,在保持可扩展性的同时优于现有的公平图生成方法。

更新时间: 2026-03-02 15:55:07

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2603.02005v1

MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interaction Potentials

Foundation MLIPs demonstrate broad applicability across diverse material systems and have emerged as a powerful and transformative paradigm in chemical and computational materials science. Equivariant MLIPs achieve state-of-the-art accuracy in a wide range of benchmarks by incorporating equivariant inductive bias. However, the reliance on tensor products and high-degree representations makes them computationally costly. This raises a fundamental question: as quantum mechanical-based datasets continue to expand, can we develop a more compact model to thoroughly exploit high-dimensional atomic interactions? In this work, we present MatRIS (\textbf{Mat}erials \textbf{R}epresentation and \textbf{I}nteraction \textbf{S}imulation), an invariant MLIP that introduces attention-based modeling of three-body interactions. MatRIS leverages a novel separable attention mechanism with linear complexity $O(N)$, enabling both scalability and expressiveness. MatRIS delivers accuracy comparable to that of leading equivariant models on a wide range of popular benchmarks (Matbench-Discovery, MatPES, MDR phonon, Molecular dataset, etc). Taking Matbench-Discovery as an example, MatRIS achieves an F1 score of up to 0.847 and attains comparable accuracy at a lower training cost. The work indicates that our carefully designed invariant models can match or exceed the accuracy of equivariant models at a fraction of the cost, shedding light on the development of accurate and efficient MLIPs.

Updated: 2026-03-02 15:52:41

标题: MatRIS:朝着可靠和高效的预训练机器学习相互作用势的发展

摘要: 基于材料的表示和相互作用模拟(MatRIS)是一种不变的MLIP模型,引入了基于注意力的三体相互作用建模。MatRIS利用一种新颖的可分离注意力机制,具有线性复杂度O(N),既能够实现可扩展性又具有表现力。MatRIS在一系列流行基准测试(如Matbench-Discovery、MatPES、MDR声子、分子数据集等)上提供了与领先的等变模型相媲美的准确性。以Matbench-Discovery为例,MatRIS在低训练成本下实现了高达0.847的F1分数,并在准确性上达到可比的水平。这项工作表明,我们精心设计的不变模型可以以较低的成本匹配或超越等变模型的准确性,为准确且高效的MLIP的发展提供了启示。

更新时间: 2026-03-02 15:52:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02002v1

Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation

Reliable obstacle avoidance in industrial settings demands 3D scene understanding, but widely used 2D LiDAR sensors perceive only a single horizontal slice of the environment, missing critical obstacles above or below the scan plane. We present a teacher-student framework for vision-based mobile robot navigation that eliminates the need for LiDAR sensors. A teacher policy trained via Proximal Policy Optimization (PPO) in NVIDIA Isaac Lab leverages privileged 2D LiDAR observations that account for the full robot footprint to learn robust navigation. The learned behavior is distilled into a student policy that relies solely on monocular depth maps predicted by a fine-tuned Depth Anything V2 model from four RGB cameras. The complete inference pipeline, comprising monocular depth estimation (MDE), policy execution, and motor control, runs entirely onboard an NVIDIA Jetson Orin AGX mounted on a DJI RoboMaster platform, requiring no external computation for inference. In simulation, the student achieves success rates of 82-96.5%, consistently outperforming the standard 2D LiDAR teacher (50-89%). In real-world experiments, the MDE-based student outperforms the 2D LiDAR teacher when navigating around obstacles with complex 3D geometries, such as overhanging structures and low-profile objects, that fall outside the single scan plane of a 2D LiDAR.

Updated: 2026-03-02 15:50:52

标题: 学习基于视觉的全向导航:使用单目深度估计的师生方法

摘要: 在工业环境中可靠的障碍物避免需要3D场景理解,但广泛使用的2D LiDAR传感器只能感知环境的单个水平切片,无法发现在扫描平面上方或下方的关键障碍物。我们提出了一个基于视觉的移动机器人导航的师生框架,消除了对LiDAR传感器的需求。通过在NVIDIA Isaac Lab中使用Proximal Policy Optimization(PPO)训练的教师策略利用特权2D LiDAR观测,考虑到完整的机器人足迹,学习坚固的导航。学习到的行为被提炼成一种仅依赖于四个RGB摄像头预测的微调深度任意V2模型产生的单眼深度图的学生策略。完整的推理流程,包括单眼深度估算(MDE)、策略执行和电机控制,完全在搭载在DJI RoboMaster平台上的NVIDIA Jetson Orin AGX上运行,推理过程不需要外部计算。在模拟中,学生实现了82-96.5%的成功率,始终优于标准的2D LiDAR教师(50-89%)。在现实世界的实验中,基于MDE的学生在绕过具有复杂3D几何结构的障碍物时优于2D LiDAR教师,例如悬挑结构和低轮廓物体,这些物体超出了2D LiDAR的单个扫描平面。

更新时间: 2026-03-02 15:50:52

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.01999v1

The GeometricKernels Package: Heat and Matérn Kernels for Geometric Learning on Manifolds, Meshes, and Graphs

Kernels are a fundamental technical primitive in machine learning. In recent years, kernel-based methods such as Gaussian processes are becoming increasingly important in applications where quantifying uncertainty is of key interest. In settings that involve structured data defined on graphs, meshes, manifolds, or other related spaces, defining kernels with good uncertainty-quantification behavior, and computing their value numerically, is less straightforward than in the Euclidean setting. To address this difficulty, we present GeometricKernels, a Python software package which implements the geometric analogs of classical Euclidean squared exponential - also known as heat - and Matérn kernels, which are widely-used in settings where uncertainty is of key interest. As a byproduct, we obtain the ability to compute Fourier-feature-type expansions, which are widely used in their own right, on a wide set of geometric spaces. Our implementation supports automatic differentiation in every major current framework simultaneously via a backend-agnostic design. In this companion paper to the package and its documentation, we outline the capabilities of the package and present an illustrated example of its interface. We also include a brief overview of the theory the package is built upon and provide some historic context in the appendix.

Updated: 2026-03-02 15:50:42

标题: 《几何核包:用于流形、网格和图上的几何学习的热核和Matérn核》

摘要: 核是机器学习中的基本技术原语。近年来,基于核的方法,如高斯过程,在需要量化不确定性的应用中变得越来越重要。在涉及定义在图、网格、流形或其他相关空间上的结构化数据的情况下,在欧几里得设置中定义具有良好不确定性量化行为的核,并在数值上计算其值,不像在欧几里得设置中那样简单。为了解决这一困难,我们提出了GeometricKernels,这是一个Python软件包,实现了经典欧几里得平方指数核的几何模拟 - 也称为热核 - 和Matérn核,在需要关键兴趣的不确定性的设置中广泛使用。作为副产品,我们获得了在广泛的几何空间上计算傅立叶特征类型展开的能力,这在其自身的权利上被广泛使用。我们的实现通过基于后端不可知设计,同时支持当前主要框架的自动微分。在与软件包及其文档的伴随论文中,我们概述了软件包的功能,并展示了其界面的一个示例。我们还在附录中提供了软件包构建的理论概述和一些历史背景。

更新时间: 2026-03-02 15:50:42

领域: cs.LG,stat.CO,stat.ML

下载: http://arxiv.org/abs/2407.08086v2

Optimal transport unlocks end-to-end learning for single-molecule localization

Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.

Updated: 2026-03-02 15:47:38

标题: 最佳传输解锁单分子定位的端到端学习

摘要: 单分子定位显微镜(SMLM)通过检测和定位单个荧光物质 - 即染色在观察样本上的荧光分子 - 来重建超分辨率图像,从而实现超越衍射极限的生物相关结构的重建。目前,高效的SMLM需要非重叠的发射荧光物质,导致长时间获取,阻碍了活细胞成像。最近的深度学习方法可以处理更密集的发射,但它们依赖于非最大抑制(NMS)层的变体,这些层很不幸是不可微分的,并且可能通过其局部融合策略丢弃真正的正例。在这个演示中,我们将SMLM训练目标重新构造为一个集合匹配问题,推导出一种最优输运损失,消除了推理过程中对NMS的需求,并实现端到端的训练。此外,我们提出了一个集成了显微镜光学系统知识的迭代神经网络模型。在合成基准和真实生物数据上的实验表明,我们的新损失函数和架构在中等和高发射密度下均超越了现有技术水平。代码可在https://github.com/RSLLES/SHOT找到。

更新时间: 2026-03-02 15:47:38

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2512.10683v2

TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks

Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present TAO: a Tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. TAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement TAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. Together, TAO reconciles scalability with verifiability for real-world heterogeneous ML compute.

Updated: 2026-03-02 15:46:27

标题: 《TAO:面向浮点神经网络的容忍感知乐观验证》

摘要: 神经网络越来越多地在用户控制之外的硬件上运行(云GPU、推断市场)。然而,作为一种服务的机器学习很少透露实际运行的内容,或者返回的输出是否忠实地反映了预期的输入。用户缺乏针对服务降级(模型替换、量化、图重写,或者类似更改广告嵌入的不一致性)的救济措施。验证输出很困难,因为在异构加速器上执行浮点(FP)操作在本质上是不确定的。现有的方法要么对于真实的FP神经网络来说是不切实际的,要么重新引入了厂商信任。我们提出了TAO:一种容忍感知乐观验证协议,它接受在基于操作符的接受区域内的输出,而不是要求比特位相等。TAO结合了两种错误模型:(i)每个操作符的声音IEEE-754最坏情况边界和(ii)在硬件上校准的紧密经验百分位配置文件。不一致触发一个使用Merkle锚定的、阈值引导的争议游戏,该游戏递归地将计算图划分,直到只剩下一个操作符,其中仲裁减少到一个轻量级的理论边界检查,或者一个小的诚实多数票反对经验阈值。在挑战窗口之后,未挑战的结果最终确定,而不需要可信硬件或确定性内核。我们将TAO实现为一个与PyTorch兼容的运行时和一个当前部署在Ethereum Holesky测试网上的合约层。运行时工具图形,计算每个操作符的边界,并在FP32上运行未经修改的厂商内核,开销微乎其微(在Qwen3-8B上为0.3%)。在A100、H100、RTX6000、RTX4090上的CNN、Transformer和扩散模型中,经验阈值比理论边界紧约$10^2-10^3$倍,基于边界的对抗攻击成功率为0%。综上所述,TAO协调了实际的异构ML计算的可扩展性和可验证性。

更新时间: 2026-03-02 15:46:27

领域: cs.CR,cs.AI,cs.LG,eess.SY

下载: http://arxiv.org/abs/2510.16028v3

Fourier Analysis on the Boolean Hypercube via Hoeffding Functional Decomposition

Fourier analysis on the Boolean hypercube is fundamentally defined as the orthogonal decomposition of the space of pseudo-Boolean functions with respect to the uniform probability measure. In this work, we propose an ANOVA-based generalization of the Fourier decomposition on the Boolean hypercube endowed with any arbitrary probability measure. We provide an \emph{explicit} decomposition basis which generalizes the Walsh-Hadamard (or parity functions) basis under any \emph{arbitrary} probability measure on the Boolean hypercube. We formulate the computation of the entire functional decomposition as a least squares problem and also provide a method to address the classical \emph{curse of dimensionality} challenge. We provide a comprehensive generalization of Fourier analysis on the Boolean hypercube, enabling the handling of non-uniform configuration spaces inherent to real-world machine learning tasks, \textit{e.g.} when dealing with \emph{one-hot encoded} features. Finally, we demonstrate its practical impact in the field of explainable AI, by conducting comparative studies with feature attribution methods such as SHAP or TreeHFD.

Updated: 2026-03-02 15:45:38

标题: 通过Hoeffding函数分解在布尔超立方体上的傅里叶分析

摘要: 在布尔超立方体上的傅立叶分析基本上被定义为伪布尔函数空间相对于均匀概率测度的正交分解。在这项工作中,我们提出了一种基于ANOVA的傅立叶分解的泛化,适用于具有任意概率测度的布尔超立方体。我们提供了一个明确的分解基础,该基础在布尔超立方体上的任意概率测度下泛化了Walsh-Hadamard(或奇偶函数)基础。我们将整个函数分解的计算形式化为最小二乘问题,并提供一种方法来解决经典的维度灾难挑战。我们提供了布尔超立方体上傅立叶分析的全面泛化,使其能够处理与现实世界的机器学习任务相关的非均匀配置空间,例如处理一位有效编码的特征时。最后,我们通过与SHAP或TreeHFD等特征归因方法进行比较研究,展示了其在可解释AI领域的实际影响。

更新时间: 2026-03-02 15:45:38

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.07088v4

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity losses for motion-restoration quality. To train our model, we collate BIMO, a new dataset containing diverse variable-length 3D motion sequences with rich, high-quality text annotations. Extensive evaluations show that our feed-forward framework BiMotion generates more expressive, higher-quality, and better prompt-aligned motions than existing state-of-the-art methods, while also achieving faster generation. Our project page is at: https://wangmiaowei.github.io/BiMotion.github.io/.

Updated: 2026-03-02 15:42:32

标题: BiMotion:文本引导的动态3D角色生成的B样条运动

摘要: 文本引导的动态3D角色生成发展迅速,但是生成忠实反映丰富文本描述的高质量动作仍然具有挑战性。现有方法往往生成有限的子动作或不连贯的动作,这是由于固定长度的时间输入和离散的逐帧表示无法捕捉丰富的动作语义。我们通过使用连续可微分的B样条曲线表示动作来解决这些限制,从而实现更有效的动作生成,而无需修改基础生成模型的能力。具体来说,我们的闭式、拉普拉斯正则化的B样条求解器能够将可变长度的动作序列有效地压缩为带有固定数量控制点的紧凑表示。此外,我们引入了一种用于输入形状保持的法线融合策略,同时还使用了对应感知和局部刚性损失来提高动作恢复的质量。为了训练我们的模型,我们整理了一个新的数据集BIMO,其中包含丰富高质量的文本标注和多样化的可变长度3D动作序列。大量评估结果显示,我们的前馈框架BiMotion生成的动作比现有最先进方法更具表现力、更高质量,并且更好地与提示对齐,同时还实现了更快的生成速度。我们的项目页面位于:https://wangmiaowei.github.io/BiMotion.github.io/。

更新时间: 2026-03-02 15:42:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2602.18873v2

According to Me: Long-Term Personalized Referential Memory QA

Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench

Updated: 2026-03-02 15:42:29

标题: 根据我:长期个性化指代记忆问答

摘要: 个性化AI助手必须回忆和推理长期用户记忆,这种记忆自然涉及多种模式和来源,如图像、视频和电子邮件。然而,现有的长期记忆基准主要关注对话历史,未能捕捉基于生活经验的现实个性化参考。我们介绍了ATM-Bench,这是第一个用于多模态、多源个性化参考记忆问答的基准。ATM-Bench包含大约四年的保护隐私的个人记忆数据和人工标注的问题-答案对,其中包括需要解决个人参考、来自多源的多证据推理以及处理冲突证据的查询。我们提出了结构化表示源自不同来源的记忆项的Schema-Guided Memory(SGM)。在实验中,我们实现了5种最先进的记忆系统以及标准的RAG基线,并评估了具有不同记忆摄入、检索和答案生成技术的变种。我们发现在ATM-Bench-Hard数据集上表现不佳(准确率低于20\%),而SGM相对于先前作品中通常采用的描述性记忆提高了性能。代码可在以下链接找到:https://github.com/JingbiaoMei/ATM-Bench

更新时间: 2026-03-02 15:42:29

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2603.01990v1

Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

Regression models are widely used in industrial processes, engineering, and in natural and physical sciences, yet their robustness to poisoning has received less attention. When it has, studies often assume unrealistic threat models and are thus less useful in practice. In this paper, we propose a novel optimal stealthy attack formulation that considers different degrees of detectability and show that it bypasses state-of-the-art defenses. We further propose a new methodology based on normalization of objectives to evaluate different trade-offs between effectiveness and detectability. Finally, we develop a novel defense (BayesClean) against stealthy attacks. BayesClean improves on previous defenses when attacks are stealthy and the number of poisoning points is significant.

Updated: 2026-03-02 15:42:29

标题: 在回归设置中,隐蔽中毒攻击绕过防御

摘要: 回归模型广泛应用于工业过程、工程以及自然和物理科学领域,然而它们对污染的稳健性却受到较少关注。在已有研究中,通常假设不切实际的威胁模型,因此在实践中不太有用。在本文中,我们提出了一种新颖的最佳隐蔽攻击公式,考虑了不同程度的可检测性,并展示了它可以绕过最先进的防御措施。我们进一步提出了一种基于目标标准化的新方法,用于评估效果和可检测性之间的不同权衡。最后,我们开发了一种新颖的防御方法(BayesClean)来对抗隐蔽攻击。BayesClean在攻击隐蔽且污染点数量显著时优于先前的防御措施。

更新时间: 2026-03-02 15:42:29

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2601.22308v2

Accurate, private, secure, federated U-statistics with higher degree

We study the problem of computing a U-statistic with a kernel function f of degree k $\ge$ 2, i.e., the average of some function f over all k-tuples of instances, in a federated learning setting. Ustatistics of degree 2 include several useful statistics such as Kendall's $τ$ coefficient, the Area under the Receiver-Operator Curve and the Gini mean difference. Existing methods provide solutions only under the lower-utility local differential privacy model and/or scale poorly in the size of the domain discretization. In this work, we propose a protocol that securely computes U-statistics of degree k $\ge$ 2 under central differential privacy by leveraging Multi Party Computation (MPC). Our method substantially improves accuracy when compared to prior solutions. We provide a detailed theoretical analysis of its accuracy, communication and computational properties. We evaluate its performance empirically, obtaining favorable results, e.g., for Kendall's $τ$ coefficient, our approach reduces the Mean Squared Error by up to four orders of magnitude over existing baselines.

Updated: 2026-03-02 15:40:16

标题: 精确、私密、安全的具有更高次数的联合U统计

摘要: 我们研究在联邦学习环境中计算具有核函数f的度为k $\ge$ 2的U-统计量的问题,即对所有k个实例的函数f的平均值。度为2的U-统计量包括几种有用的统计量,如肯德尔的$τ$系数、接收器操作特性曲线下的面积和吉尼均值差。现有方法仅在较低效用的本地差分隐私模型下提供解决方案,或者在域离散化的规模上表现不佳。在这项工作中,我们提出了一种利用多方计算(MPC)在中心差分隐私下安全计算度为k $\ge$ 2的U-统计量的协议。与先前的解决方案相比,我们的方法显著提高了准确性。我们对其准确性、通信和计算性能进行了详细的理论分析。我们通过实证评估其性能,获得了有利的结果,例如对于肯德尔的$τ$系数,我们的方法将均方误差降低了四个数量级。

更新时间: 2026-03-02 15:40:16

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2603.01986v1

Data-Driven Prediction and Control of Hammerstein-Wiener Systems with Implicit Gaussian Processes

This work investigates data-driven prediction and control of Hammerstein-Wiener systems using physics-informed Gaussian process (GP) models that encode the block-oriented model structure. Data-driven prediction algorithms have been developed for structured nonlinear systems based on Willems' fundamental lemma. However, existing frameworks do not apply to output nonlinearities in Wiener systems and rely on a finite-dimensional dictionary of basis functions for Hammerstein systems. In this work, an implicit predictor structure is considered, leveraging the linearity for the dynamical part of the model. This implicit function is learned by GP regression, utilizing carefully designed structured kernel functions from linear model parameters and GP priors for the nonlinearities. Virtual derivative points are added to the regression by expectation propagation to encode monotonicity information of the nonlinearities. The linear model parameters are estimated as hyperparameters by assuming a stable spline hyperprior. The implicit GP model provides explicit output prediction by optimizing selected optimality criteria. The implicit model is also applied to receding horizon control with the expected control cost and chance constraint satisfaction guarantee. Numerical results demonstrate that the proposed prediction and control algorithms are superior to black-box GP models without model structure knowledge.

Updated: 2026-03-02 15:36:55

标题: 用数据驱动的隐式高斯过程对Hammerstein-Wiener系统进行预测和控制

摘要: 这项工作研究了使用物理信息高斯过程(GP)模型对Hammerstein-Wiener系统进行数据驱动预测和控制,这些模型编码了块状模型结构。基于Willems的基本引理,已经为结构化非线性系统开发了数据驱动预测算法。然而,现有框架不适用于Wiener系统中的输出非线性,并且依赖于Hammerstein系统的有限维基函数词典。在这项工作中,考虑了隐式预测器结构,利用模型动态部分的线性性。这个隐式函数通过GP回归学习,利用精心设计的线性模型参数和GP先验用于非线性的结构化核函数。通过期望传播添加虚拟导数点到回归中,以编码非线性的单调性信息。通过假设稳定样条超先验来估计线性模型参数为超参数。隐式GP模型通过优化选择的最优性标准提供明确的输出预测。隐式模型还应用于具有预期控制成本和机会约束满足保证的滚动视野控制。数值结果表明,所提出的预测和控制算法优于没有模型结构知识的黑盒GP模型。

更新时间: 2026-03-02 15:36:55

领域: eess.SY,cs.LG

下载: http://arxiv.org/abs/2501.15849v2

Clustering by Denoising: Latent plug-and-play diffusion for single-cell data

Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.

Updated: 2026-03-02 15:34:01

标题: 通过去噪进行聚类:用于单细胞数据的潜在即插即用扩散

摘要: 单细胞RNA测序(scRNA-seq)使得细胞异质性的研究成为可能。然而,由于测量噪音和生物变异性,聚类准确性及基于细胞标签的下游分析仍具有挑战性。在标准的潜在空间(例如通过PCA获得的空间)中,来自不同细胞类型的数据可能被投影在一起,使得准确的聚类变得困难。我们引入了一个潜在的即插即用扩散框架来分离观测和去噪空间。通过一种新颖的Gibbs采样过程来实现这种分离:学习的扩散先验被应用于低维潜在空间进行去噪,同时为了引导这个过程,噪音被重新引入到原始高维观测空间。这种独特的“输入空间引导”确保了去噪轨迹保持忠实于原始数据结构。我们的方法提供了三个关键优势:(1)通过可调整的先验和观察数据之间的平衡来进行自适应噪音处理;(2)通过基于原则的不确定性估计来量化不确定性,用于下游分析;(3)通过利用干净的参考数据对嘈杂的数据集进行去噪,并通过平均化,提高质量超出训练集。我们在合成和真实的单细胞基因组数据上评估了鲁棒性。我们的方法在合成数据上提高了聚类准确性,适用于不同噪音水平和数据集转移。在真实的单细胞数据上,我们的方法展示了最终细胞群的生物学一致性的提高,聚类边界更好地与已知的细胞类型标记和发育轨迹相吻合。

更新时间: 2026-03-02 15:34:01

领域: cs.LG,stat.CO,stat.ML

下载: http://arxiv.org/abs/2510.22835v3

Quantitative Convergence of Wasserstein Gradient Flows of Kernel Mean Discrepancies

We study the quantitative convergence of Wasserstein gradient flows of Kernel Mean Discrepancy (KMD) (also known as Maximum Mean Discrepancy (MMD)) functionals. Our setting covers in particular the training dynamics of shallow neural networks in the infinite-width and continuous time limit, as well as interacting particle systems with pairwise Riesz kernel interaction in the mean-field and overdamped limit. Our main analysis concerns the model case of KMD functionals given by the squared Sobolev distance $ \mathscr{E}^ν_{s}(μ)= \frac{1}{2}\lVert μ-ν\rVert_{\dot H^{-s}}^{2}$ for any $s\geq 1 $ and $ν$ a fixed probability measure on the $d$-dimensional torus. First, inspired by Yudovich theory for the $2d$-Euler equation, we establish existence and uniqueness in natural weak regularity classes. Next, we show that for $s=1$ the flow converges globally at an exponential rate under minimal assumptions, while for $s>1$ we prove local convergence at polynomial rates that depend explicitly on $s$ and on the Sobolev regularity of $μ$ and $ν$. These rates hold both at the energy level and in higher regularity classes and are tight for $ν$ uniform. We then consider the gradient flow of the population loss for shallow neural networks with ReLU activation, which can be cast as a Wasserstein--Fisher--Rao gradient flow on the space of nonnegative measures on the sphere $\mathbb{S}^d$. Exploiting a correspondence with the Sobolev energy case with $s=(d+3)/2$, we derive an explicit polynomial local convergence rate for this dynamics. Except for the special case $s=1$, even non-quantitative convergence was previously open in all these settings. We also include numerical experiments in dimension $d=1$ using both PDE and particle methods which illustrate our analysis.

Updated: 2026-03-02 15:32:54

标题: 核均值差异的Wasserstein梯度流的数量收敛

摘要: 我们研究了核均值差异(KMD)(也称为最大均值差异(MMD))波尔斯坦梯度流的数量收敛性。我们的设置特别涵盖了无穷宽度和连续时间极限下浅层神经网络的训练动态,以及具有成对Riesz核相互作用的粒子系统在平均场和过阻尼极限下的情况。我们的主要分析涉及由平方Sobolev距离$\mathscr{E}^ν_{s}(μ)= \frac{1}{2}\lVert μ-ν\rVert_{\dot H^{-s}}^{2}$给出的KMD泛函模型案例,其中$s\geq 1$,$ν$是$d$维环上的固定概率测度。首先,受到$2d$-Euler方程的Yudovich理论的启发,我们在自然弱正则性类中建立了存在性和唯一性。接下来,我们证明对于$s=1$,在最小假设下流以指数速率全局收敛,而对于$s>1$,我们证明局部收敛以多项式速率,这些速率明确取决于$s$以及$μ$和$ν$的Sobolev正则性。这些速率在能量水平和更高正则性类中均成立,对于均匀的$ν$来说是紧凑的。然后,我们考虑具有ReLU激活函数的浅层神经网络的人口损失的梯度流,它可以被视为在球面$\mathbb{S}^d$上的非负测度空间上的波尔斯坦-费舍尔-饶梯度流。利用与$s=(d+3)/2$的Sobolev能量情况的对应关系,我们为这种动态推导出了一个明确的多项式局部收敛速率。除了特殊情况$s=1$外,在所有这些设置中,甚至非定量收敛以前都是开放的。我们还在维度$d=1$上使用偏微分方程和粒子方法进行了数值实验,以示我们的分析。

更新时间: 2026-03-02 15:32:54

领域: math.AP,cs.LG,math.OC

下载: http://arxiv.org/abs/2603.01977v1

Representing local protein environments with machine learning force fields

The local structure of a protein strongly impacts its function and interactions with other molecules. Therefore, a concise, informative representation of a local protein environment is essential for modeling and designing proteins and biomolecular interactions. However, these environments' extensive structural and chemical variability makes them challenging to model, and such representations remain under-explored. In this work, we propose a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs). We demonstrate that this embedding effectively captures both local structure (e.g., secondary motifs), and chemical features (e.g., amino-acid identity and protonation state). We further show that the AFM-derived representation space exhibits meaningful structure, enabling the construction of data-driven priors over the distribution of biomolecular environments. Finally, in the context of biomolecular NMR spectroscopy, we demonstrate that the proposed representations enable a first-of-its-kind physics-informed chemical shift predictor that achieves state-of-the-art accuracy. Our results demonstrate the surprising effectiveness of atomistic foundation models and their emergent representations for protein modeling beyond traditional molecular simulations. We believe this will open new lines of work in constructing effective functional representations for protein environments.

Updated: 2026-03-02 15:30:42

标题: 用机器学习力场表示局部蛋白质环境

摘要: 蛋白质的局部结构强烈影响其功能和与其他分子的相互作用。因此,对局部蛋白质环境的简洁、信息丰富的表征对建模和设计蛋白质和生物分子相互作用至关重要。然而,这些环境的广泛结构和化学变异性使其难以建模,这样的表征仍未得到充分探索。在这项工作中,我们提出了一种新颖的局部蛋白质环境表征,源自原子基础模型(AFMs)的中间特征。我们证明这种嵌入有效地捕获了局部结构(例如次要基序)和化学特征(例如氨基酸身份和质子化状态)。我们进一步展示,由AFM衍生的表征空间展现出有意义的结构,使得能够构建对生物分子环境分布的数据驱动先验。最后,在生物分子NMR光谱学的背景下,我们展示了所提出的表征使得一种首次设计的物理信息化学位移预测器实现了最先进的准确性。我们的结果展示了原子基础模型及其新出现的表征对蛋白质建模超越传统分子模拟的惊人有效性。我们相信这将开辟新的工作方向,构建有效的蛋白质环境功能表征。

更新时间: 2026-03-02 15:30:42

领域: q-bio.BM,cs.AI

下载: http://arxiv.org/abs/2505.23354v4

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 54.3% to 75.4% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini). A python implementation of our proposal is publicly available at https://github.com/Mamba413/L2D.

Updated: 2026-03-02 15:28:28

标题: 学习至距离:用于检测由LLM生成的文本的远程学习

摘要: 现代大型语言模型(LLMs)如GPT、克劳德和双子座已经改变了我们学习、工作和交流的方式。然而,它们能够生成高度类人文本的能力引发了关于错误信息和学术诚信的严重担忧,这使得迫切需要可靠的算法来检测由LLM生成的内容。在本文中,我们首先提出了一种几何方法来揭示基于重写的检测算法的神秘性,揭示其潜在的理论基础,并展示其泛化能力。基于这一见解,我们引入了一种新颖的基于重写的检测算法,能够自适应地学习原文和重写文本之间的距离。从理论上讲,我们证明了采用自适应学习距离函数比使用固定距离更有效。在实证方面,我们进行了超过100个设置的广泛实验,并发现我们的方法在大多数场景中表现优于基准算法。特别地,在不同目标LLMs(如GPT、克劳德和双子座)方面,相对于最强基准算法,我们的方法实现了从54.3%到75.4%的相对改进。我们提议的python实现已经在https://github.com/Mamba413/L2D 公开。

更新时间: 2026-03-02 15:28:28

领域: cs.CL,cs.AI,stat.ML

下载: http://arxiv.org/abs/2601.21895v2

CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.

Updated: 2026-03-02 15:27:31

标题: CharacterFlywheel:在生产中扩展迭代改进引人入胜且可操控的LLMs

摘要: 这份报告介绍了CharacterFlywheel,一个用于改进生产环境中的大型语言模型(LLMs)的迭代式飞轮过程,覆盖了Instagram、WhatsApp和Messenger等社交聊天应用。从LLaMA 3.1开始,我们通过内部和外部真实用户流量的数据,在15代模型中进行了细化。从2024年7月到2025年4月的持续部署中,我们进行了控制的7天A/B测试,显示出持续的参与度改善:8个新部署的模型中有7个对基线表现出积极的提升,其中表现最好的模型在参与度广度上实现了高达8.8%的改进,参与度深度上实现了19.4%的改进。我们还观察到在可控度方面有了显著的提升,指令遵循率从59.2%增加到84.8%,指令违规率从26.6%降低到5.8%。我们详细介绍了CharacterFlywheel过程,该过程整合了数据筛选、奖励建模来估算和插值参与指标的景观、监督微调(SFT)、强化学习(RL)以及离线和在线评估,以确保在每个优化步骤中取得可靠的进展。我们还讨论了我们的过拟合预防方法和在规模上导航生产动态的方法。这些贡献推动了在为数百万用户提供服务的社交应用中LLMs的科学严谨性和理解。

更新时间: 2026-03-02 15:27:31

领域: cs.CL,cs.AI,cs.SI

下载: http://arxiv.org/abs/2603.01973v1

LOCUS: A Distribution-Free Loss-Quantile Score for Risk-Aware Predictions

Modern machine learning models can be accurate on average yet still make mistakes that dominate deployment cost. We introduce Locus, a distribution-free wrapper that produces a per-input loss-scale reliability score for a fixed prediction function. Rather than quantifying uncertainty about the label, Locus models the realized loss of the prediction function using any engine that outputs a predictive distribution for the loss given an input. A simple split-calibration step turns this function into a distribution-free interpretable score that is comparable across inputs and can be read as an upper loss level. The score is useful on its own for ranking, and it can optionally be thresholded to obtain a transparent flagging rule with distribution-free control of large-loss events. Experiments across 13 regression benchmarks show that Locus yields effective risk ranking and reduces large-loss frequency compared to standard heuristics.

Updated: 2026-03-02 15:25:50

标题: LOCUS:一种无分布损失分位数评分方法,用于风险感知预测

摘要: 现代机器学习模型在平均情况下可能很准确,但仍会犯错误,这些错误会影响部署成本。我们引入了Locus,这是一个无分布包装器,为固定的预测函数生成每个输入的损失规模可靠性评分。与量化关于标签的不确定性不同,Locus模型使用任何输出给定输入的损失的预测分布的引擎来建模预测函数的实际损失。一个简单的拆分校准步骤将此函数转换为一个无分布解释得分,可以跨输入进行比较,并可被视为上限损失水平。该得分本身对排名很有用,并且可以选择进行阈值处理,以获得一个透明的标记规则,对大损失事件具有无分布控制。对13个回归基准测试的实验表明,与标准启发式方法相比,Locus产生了有效的风险排名,并减少了大损失频率。

更新时间: 2026-03-02 15:25:50

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.01971v1

WiCompass: Oracle-driven Data Scaling for mmWave Human Pose Estimation

Millimeter-wave Human Pose Estimation (mmWave HPE) promises privacy but suffers from poor generalization under distribution shifts. We demonstrate that brute-force data scaling is ineffective for out-of-distribution (OOD) robustness; efficiency and coverage are the true bottlenecks. To address this, we introduce WiCompass, a coverage-aware data-collection framework. WiCompass leverages large-scale motion-capture corpora to build a universal pose space ``oracle'' that quantifies dataset redundancy and identifies underrepresented motions. Guided by this oracle, WiCompass employs a closed-loop policy to prioritize collecting informative missing samples. Experiments show that WiCompass consistently improves OOD accuracy at matched budgets and exhibits superior scaling behavior compared to conventional collection strategies. By shifting focus from brute-force scaling to coverage-aware data acquisition, this work offers a practical path toward robust mmWave sensing.

Updated: 2026-03-02 15:22:05

标题: WiCompass:基于Oracle的数据缩放在毫米波人体姿势估计中的应用

摘要: 毫米波人体姿势估计(mmWave HPE)承诺隐私保护,但在分布转移下存在泛化能力不佳的问题。我们证明,对于超出分布范围的鲁棒性而言,简单的数据缩放是无效的;效率和覆盖范围是真正的瓶颈。为了解决这个问题,我们引入了WiCompass,一个覆盖感知的数据采集框架。WiCompass利用大规模动作捕捉语料库构建一个量化数据集冗余性并识别欠表示的动作的通用姿势空间“预言机”。在这个预言机的指导下,WiCompass采用闭环策略来优先收集缺失的信息样本。实验证明,WiCompass在相匹配的预算下持续改善超出分布范围的准确性,并且相对于传统的采集策略表现出更好的扩展行为。通过将重点从粗暴的缩放转移到覆盖感知的数据采集,这项工作为实现强大的毫米波感知提供了一个实用的途径。

更新时间: 2026-03-02 15:22:05

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2602.18726v2

Intrinsic Task Symmetry Drives Generalization in Algorithmic Tasks

Grokking, the sudden transition from memorization to generalization, is characterized by the emergence of low-dimensional representations, yet the mechanism underlying this organization remains elusive. We propose that intrinsic task symmetries primarily drive grokking and shape the geometry of the model's representation space. We identify a consistent three-stage training dynamic underlying grokking: (i) memorization, (ii) symmetry acquisition, and (iii) geometric organization. We show that generalization emerges during the symmetry acquisition phase, after which representations reorganize into a structured, task-aligned geometry. We validate this symmetry-driven account across diverse algorithmic domains, including algebraic, structural, and relational reasoning tasks. Building on these findings, we introduce a symmetry-based diagnostic that anticipates the onset of generalization and propose strategies to accelerate it. Together, our results establish intrinsic symmetry as the key factor enabling neural networks to move beyond memorization and achieve robust algorithmic reasoning.

Updated: 2026-03-02 15:19:24

标题: 内在任务对称性推动算法任务的泛化

摘要: 理解,从记忆到泛化的突然转变,以低维表示的出现为特征,然而这种组织背后的机制仍然是一个谜。我们提出,内在任务对称性主要驱动理解,塑造了模型表示空间的几何结构。我们确定了一个一致的三阶段训练动态来支持理解:(i)记忆,(ii)对称性获取,以及(iii)几何结构组织。我们展示了泛化是在对称性获取阶段出现的,之后表示重新组织成一个结构化、与任务对齐的几何形状。我们验证了这种对称性驱动的解释在各种算法领域的有效性,包括代数、结构和关系推理任务。基于这些发现,我们引入了一个基于对称性的诊断方法,可以预测泛化的开始,并提出加速泛化的策略。综合来看,我们的结果表明内在对称性是促使神经网络超越记忆并实现强大算法推理的关键因素。

更新时间: 2026-03-02 15:19:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01968v1

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.

Updated: 2026-03-02 15:15:11

标题: AMemGym:用于长期对话助手的交互式记忆基准测试

摘要: 长时间跨度的用户与基于LLM的助手之间的互动需要有效的内存管理,然而目前的方法在训练和评估内存方面面临挑战。现有的内存基准依赖于静态、离线数据作为上下文,限制了评估的可靠性和可伸缩性。为了解决这些差距,我们引入了AMemGym,这是一个互动环境,可以实现基于策略的评估和优化,用于内存驱动的个性化。AMemGym采用结构化数据抽样来预定义用户个人资料、状态相关问题和状态演变轨迹,从而实现高质量、与评估对齐的互动的经济有效生成。LLM模拟用户通过角色扮演展示潜在状态,同时保持结构化状态的一致性。基于结构化数据的全面度量指导助手的评估和优化。广泛的实验揭示了现有内存系统(例如RAG、长上下文LLM和主动内存)存在的性能差距及相应原因。AMemGym不仅能够在竞争性方法中实现有效选择,还有可能推动内存管理策略的自我演进。通过将结构化状态演变与自由形式的互动联系起来,我们的框架为推进会话代理的内存能力提供了一个可扩展、诊断丰富的环境。

更新时间: 2026-03-02 15:15:11

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01966v1

CoVAE: correlated multimodal generative modeling

Multimodal Variational Autoencoders have emerged as a popular tool to extract effective representations from rich multimodal data. However, such models rely on fusion strategies in latent space that destroy the joint statistical structure of the multimodal data, with profound implications for generation and uncertainty quantification. In this work, we introduce Correlated Variational Autoencoders (CoVAE), a new generative architecture that captures the correlations between modalities. We test CoVAE on a number of real and synthetic data sets demonstrating both accurate cross-modal reconstruction and effective quantification of the associated uncertainties.

Updated: 2026-03-02 15:14:59

标题: CoVAE:相关多模态生成建模

摘要: 多模态变分自编码器已成为从丰富的多模态数据中提取有效表示的流行工具。然而,这些模型依赖于潜在空间中的融合策略,这些策略破坏了多模态数据的联合统计结构,对生成和不确定性量化具有深远影响。在这项工作中,我们介绍了相关变分自编码器(CoVAE),这是一种捕捉模态之间相关性的新生成架构。我们在多个真实和合成数据集上测试了CoVAE,展示了准确的跨模态重建和有效的相关不确定性量化。

更新时间: 2026-03-02 15:14:59

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2603.01965v1

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.

Updated: 2026-03-02 15:11:00

标题: TiledAttention:用于PyTorch的CUDA瓦片SDPA内核

摘要: TiledAttention 是针对 NVIDIA GPU 上的 SDPA 研究的一个缩放点积注意力 (SDPA) 正向算子。它是在 cuTile Python (TileIR) 中实现的,并作为一个 PyTorch 可调用函数进行暴露,比低级 CUDA 模板更容易修改,同时通过在线 softmax 和 tiled $K,V$ 流保留了现实行为。这种方法在调度级别直接可编辑,可以快速、可重现地进行内核研究,而不需要对 CUDA/CUTLASS 进行繁重的重写。我们在 NVIDIA DGX GB10 节点上使用可重现的测试工具对 TiledAttention 进行了基准测试,并与 PyTorch SDPA (自动分派) 和显式未融合基线进行了比较,跨序列长度、头维度和精度 (FP16/BF16)。虽然生产融合基线总体上更强大,但 TiledAttention 在标准的急切注意力路径上提供了大幅加速,并可直接在 PyTorch 工作流中使用,提供了性能和可定制性之间的实用平衡。

更新时间: 2026-03-02 15:11:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01960v1

Sample-efficient and Scalable Exploration in Continuous-Time RL

Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.

Updated: 2026-03-02 15:08:43

标题: 在连续时间强化学习中的样本高效和可扩展探索

摘要: 强化学习算法通常设计用于离散时间动态,尽管潜在的现实世界控制系统通常是连续时间的。在本文中,我们研究了连续时间强化学习的问题,其中未知的系统动态使用非线性常微分方程(ODEs)表示。我们利用概率模型,如高斯过程和贝叶斯神经网络,来学习潜在ODE的一个具有不确定性感知的模型。我们的算法COMBRL贪婪地最大化外在奖励和模型认识不确定性的加权和。这产生了一种适用于连续时间基于模型的强化学习的可扩展和样本高效的方法。我们展示COMBRL在受奖励驱动的情况下实现了子线性遗憾,并在无监督RL设置中(即,没有外在奖励)提供了样本复杂度上界。在我们的实验中,我们评估了COMBRL在标准和无监督RL设置中的表现,并展示它比先前方法更好地扩展,更加样本高效,并在多个深度RL任务中优于基线。

更新时间: 2026-03-02 15:08:43

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.24482v2

The Expressive Limits of Diagonal SSMs for State-Tracking

State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling tasks while remaining efficient and highly-parallelizable. However, the theoretical understanding of their expressive power remains limited. In this work, we study the expressivity of input-Dependent Complex-valued Diagonal (DCD) SSMs on sequential state-tracking tasks. We show that single-layer DCD SSMs cannot express state-tracking of any non-Abelian group at finite precision. More generally, we show that $k$-layer DCD SSMs can express state-tracking of a group if and only if that group has a subnormal series of length $k$, with Abelian factors. That is, we identify the precise expressivity range of $k$-layer DCD SSMs within the solvable groups. Empirically, we find that multi-layer models often fail to learn state-tracking for non-Abelian groups, highlighting a gap between expressivity and learnability.

Updated: 2026-03-02 15:08:14

标题: 对于状态跟踪的对角SSM的表达限制

摘要: 最近研究表明,状态空间模型(SSMs)在各种长程序列建模任务中取得了强大的实证性能,同时保持高效和高度可并行化。然而,对它们的表达能力的理论理解仍然有限。在这项工作中,我们研究了输入相关复值对角(DCD)SSMs在连续状态跟踪任务上的表达能力。我们表明,单层DCD SSMs无法以有限精度表达任何非阿贝尔群的状态跟踪。更一般地,我们表明,$k$层DCD SSMs可以表达一个群的状态跟踪,当且仅当该群有一个长度为$k$的子正规系列,其中阿贝尔因子。也就是说,我们确定了$k$层DCD SSMs在可解群中的精确表达范围。实证上,我们发现多层模型经常无法学习非阿贝尔群的状态跟踪,突显了表达能力和可学习性之间的差距。

更新时间: 2026-03-02 15:08:14

领域: cs.LG

下载: http://arxiv.org/abs/2603.01959v1

SoK: Is Sustainable the New Usable? Debunking The Myth of Fundamental Incompatibility Between Security and Sustainability

Every year, millions of functional systems become e-waste because users are pressured to send their systems to landfills due to a lack of vendor support and difficulty in recycling. Vendors cite ``cybersecurity'' as the driver for short product support periods, leading to a prevalent, but uninterrogated, belief that cybersecurity and environmental sustainability are fundamentally contradictory; i.e., it is difficult, if not impossible, to build products that are secure, long-lasting, and reusable. To understand the nuanced relationship between security and sustainability, we systematically analyze 29 papers and distill 155 sustainability guidelines into 12 sustainability themes. These themes enable us to compare the sustainable HCI and sustainable software engineering guidance with that of cybersecurity, identifying points of alignment and tension. We find little evidence of a fundamental tension between these two domains; the few instances of tension can be mitigated through thoughtful consideration of security and sustainability objectives. We also find that sustainability, like usable security, struggles with the myth of users as the weakest link and the individualization of responsibility. Building on these parallels, we argue that the usable security community is well-positioned to integrate sustainability considerations, as both fields share challenges in shifting responsibility from individuals to systemic design.

Updated: 2026-03-02 15:08:13

标题: SoK: 可持续性是否成为新的可用性?揭穿安全性与可持续性之间根本不相容的神话

摘要: 每年,由于用户受到来自供应商支持不足和回收困难的压力,数百万个功能系统变成电子废物被送往填埋场。供应商将“网络安全”作为短期产品支持期限的驱动因素,导致一种普遍但未被质疑的信念,即网络安全和环境可持续性在根本上是矛盾的;即建立安全、持久和可重复使用的产品很难,甚至是不可能的。为了理解安全性和可持续性之间微妙的关系,我们系统分析了29篇论文,并将155个可持续性指南提炼成12个可持续性主题。这些主题使我们能够比较可持续性人机界面和可持续性软件工程指南与网络安全的关系,识别对齐点和紧张点。我们发现这两个领域之间几乎没有根本性的紧张关系;很少的紧张情况可以通过深思熟虑地考虑安全和可持续目标来缓解。我们还发现,可持续性,就像可用安全性一样,与用户作为最弱环节和责任个体化的神话作斗争。基于这些类比,我们认为可用安全社区有能力整合可持续性考虑,因为这两个领域在将责任从个人转移到系统设计方面都面临挑战。

更新时间: 2026-03-02 15:08:13

领域: cs.HC,cs.CR

下载: http://arxiv.org/abs/2603.01958v1

Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Updated: 2026-03-02 15:05:20

标题: 双重稳健的LLM作为评判者:具有不完美人物的外部有效估计

摘要: 随着生成式人工智能(GenAI)系统的日益普及,一个关键关注点涉及评估的外部有效性,即评估结果在实验室与实际部署条件之间的泛化程度。当用于获得系统质量估计的人类评分者样本和系统输出的来源样本与部署时的目标分布不同时,GenAI评估的外部有效性受到威胁。在这项工作中,我们提出了一个设计用于解决评估抽样偏差的双重稳健估计框架。我们方法的关键在于利用“人物”评分,通过促使LLM评估器(即LLM作为评委)表现出具有特定社会经济特征的人类评分者的行为来产生。我们的双重稳健框架将这些信息丰富但不完美的人物评分与在评估抽样偏差下获得的人类评分相结合,以产生统计上有效的系统质量估计。特别地,我们展示了当(i)使用模型训练以使用人物评分和在抽样偏差下观察到的源数据来预测人类评分,或者(ii)一个纠正抽样偏差的重新加权模型具有足够的质量时,我们的方法产生有效的系统质量估计。我们在理论上验证了我们的框架,并通过一个新颖的Persona Simulation Framework(PSF)进行验证,该框架旨在系统地操纵人物质量和源数据中存在的评估抽样偏差程度。我们的工作为将不完美的人物评分与在抽样偏差下观察到的人类评分相结合以获得有效的系统质量估计提供了一个原则性基础。

更新时间: 2026-03-02 15:05:20

领域: cs.LG

下载: http://arxiv.org/abs/2509.22957v2

Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy

Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: https://github.com/wupengyuan/dcdp

Updated: 2026-03-02 15:04:18

标题: 封闭环动作块与动态修正的扩散策略无需训练

摘要: Diffusion-based policies在机器人操纵领域取得了显著的成果,但往往在动态场景中难以迅速适应,导致延迟响应或任务失败。我们提出了DCDP,一个集成了基于块的动作生成和实时校正的动态闭环扩散策略框架。DCDP集成了自监督动态特征编码器、交叉注意力融合和不对称的动作编码器-解码器,在动作执行之前注入环境动态,实现实时闭环动作校正,增强系统在动态场景中的适应性。在动态PushT模拟中,DCDP在不重新训练的情况下将适应性提高了19\%,而仅需要额外5\%的计算量。其模块化设计实现了即插即用的集成,在动态机器人场景中实现了时间上的连贯性和实时响应性,包括真实世界的操纵任务。项目页面链接:https://github.com/wupengyuan/dcdp

更新时间: 2026-03-02 15:04:18

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.01953v1

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms. The simulation models a small city as a location graph with synthetic residents having diverse demographic and cultural profiles. Each episode assigns one resident a daily goal while others provide social context. An LLM-based verifier generates structured judgments on norm violations and task progress, which we aggregate into metrics capturing task-norm trade-offs and verifier uncertainty. Using LiveCultureBench across models and cultural profiles, we study (i) cross-cultural robustness of LLM agents, (ii) how they balance effectiveness against norm sensitivity, and (iii) when LLM-as-a-judge evaluation is reliable for automated benchmarking versus when human oversight is needed.

Updated: 2026-03-02 15:04:16

标题: LiveCultureBench:一个用于大型语言模型在动态社会模拟中的多智能体、多文化基准测试

摘要: 大型语言模型(LLMs)越来越被部署为自主代理,然而评估主要关注任务成功而不是文化适当性或评估者的可靠性。我们引入了LiveCultureBench,一个多文化、动态基准,将LLMs嵌入模拟城镇中作为代理,并评估它们在任务完成和遵守社会文化规范方面的表现。该模拟模型将一个小城市建模为一个具有多样化人口统计和文化特征的位置图。每一集分配一个居民一个日常目标,而其他人提供社会背景。基于LLM的验证器生成关于规范违反和任务进展的结构化判断,我们将其汇总成捕捉任务-规范权衡和验证器不确定性的指标。通过在各种模型和文化配置文件上使用LiveCultureBench,我们研究(i)LLM代理的跨文化鲁棒性,(ii)它们如何在效果与规范敏感性之间平衡,以及(iii)在何时LLM作为评判者的评估对于自动化基准测试是可靠的,何时需要人类监督。

更新时间: 2026-03-02 15:04:16

领域: cs.AI

下载: http://arxiv.org/abs/2603.01952v1

Accelerating Single-Pass SGD for Generalized Linear Prediction

We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.

Updated: 2026-03-02 15:04:00

标题: 加速单次通过SGD进行广义线性预测

摘要: 我们研究了在流式设置下的广义线性预测,其中每次迭代只使用一个新数据点进行梯度级更新。虽然在确定性优化中动量已经被广泛应用,但一个基本的未解决问题是它是否能加速这种单次非二次随机优化。我们提出了第一个成功将动量通过一种新颖的数据相关近端方法整合的算法,实现了双动量加速。我们导出的风险超界分解为三个部分:改进的优化误差、极小化最优统计误差和更高阶模型错误。证明通过对内部更新的细粒度稳态分析来处理错误规格,同时通过两阶段外循环分析局部化统计误差。因此,我们解决了Jain等人提出的开放问题,并证明在流式设置中,动量加速比方差减少对于广义线性预测更为有效。

更新时间: 2026-03-02 15:04:00

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2603.01951v1

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.

Updated: 2026-03-02 15:03:57

标题: 语义相似性是对漫画理解的虚假衡量标准:从基准实验中幻觉中学到的教训

摘要: 一种使盲人或视力受损用户能够访问漫画/漫画的系统将为这个群体引入一种新的叙事媒介。然而,目前尚无此类系统存在。生成式视觉语言模型(VLMs)已显示出在描述图像和理解漫画方面具有潜力,但大多数关于漫画理解的研究局限于面板级别分析。为了完全支持盲人和视力受损用户,必须更多地关注页面级别的理解和解释。在这项工作中,我们提出了VLM在漫画解释任务上表现的初步基准。我们识别和分类在此过程中出现的幻觉,将它们组织成广义对象幻觉分类法。最后,我们提出了未来研究的指导,强调幻觉减轻和改进漫画解释的数据管理。

更新时间: 2026-03-02 15:03:57

领域: cs.LG,cs.CL,cs.CV

下载: http://arxiv.org/abs/2603.01950v1

VMDNet: Temporal Leakage-Free Variational Mode Decomposition for Electricity Demand Forecasting

Accurate electricity demand forecasting is challenging due to the strong multi-periodicity of real-world demand series, which makes effective modeling of recurrent temporal patterns crucial. Decomposition techniques make such structure explicit and thereby improve predictive performance. Variational Mode Decomposition (VMD) is a powerful signal-processing method for periodicity-aware decomposition and has seen growing adoption in recent years. However, existing studies often suffer from information leakage and rely on inappropriate hyperparameter tuning. To address these issues, we propose VMDNet, a causality-preserving framework that (i) applies sample-wise VMD to avoid temporal leakage; (ii) represents each decomposed mode with frequency-aware embeddings and decodes it using parallel temporal convolutional networks (TCNs), ensuring mode independence and efficient learning; and (iii) introduces a Stackelberg game inspired bilevel scheme to guide the selection of VMD's two key hyperparameters. Experiments on three widely used electricity demand datasets show that VMDNet consistently outperforms state-of-the-art baselines.

Updated: 2026-03-02 15:02:44

标题: VMDNet: 电力需求预测的无时序泄漏变分模态分解

摘要: 准确的电力需求预测是具有挑战性的,因为真实需求序列具有强烈的多周期性,这使得有效建模重复时间模式至关重要。分解技术使这种结构明确化,从而提高了预测性能。变分模式分解(VMD)是一种强大的信号处理方法,用于周期性感知分解,并在近年来得到越来越多的采用。然而,现有研究经常受到信息泄漏和依赖不当的超参数调整的困扰。为了解决这些问题,我们提出了VMDNet,一个保持因果关系的框架,该框架(i)应用逐样本VMD以避免时间泄漏;(ii)使用频率感知嵌入表示每个分解模式,并使用并行时间卷积网络(TCN)解码,确保模式独立性和高效学习;以及(iii)引入一个Stackelberg博弈启发的双层方案,以指导VMD的两个关键超参数的选择。对三个广泛使用的电力需求数据集进行的实验表明,VMDNet始终优于最先进的基线模型。

更新时间: 2026-03-02 15:02:44

领域: cs.LG

下载: http://arxiv.org/abs/2509.15394v2

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

Updated: 2026-03-02 15:01:07

标题: MoE并行折叠:用于高效大规模MoE模型训练的异构并行映射与Megatron核心

摘要: 专家混合(MoE)模型通过动态选择每个输入标记的相关专家,增强了神经网络的可伸缩性,使其能够实现更大的模型规模,同时保持可管理的计算成本。然而,在数千个GPU上高效训练大规模MoE模型面临着重大挑战,原因是现有并行策略存在限制。我们引入了一个端到端的训练框架,用于大规模MoE模型,利用五维混合并行性:张量并行性、专家并行性、上下文并行性、数据并行性和流水线并行性。我们方法的核心是MoE并行折叠,这是一种新颖的策略,它将Transformer模型中的注意力和MoE层的并行化分离开来,使每种层类型都能采用最佳的并行配置。此外,我们开发了一个灵活的标记级调度器,支持在所有五个并行性维度上进行既有标记丢弃又无标记丢弃的MoE训练。该调度器适应动态张量形状,并协调不同的并行方案,用于Attention和MoE层,促进复杂并行性实现。我们的实验表明,在训练效率和可伸缩性方面取得了显著改进。我们在H100 GPU上实现了Mixtral 8x22B模型的49.3%模型Flops利用率(MFU)和Qwen2-57B-A14B模型的39.0% MFU,超过了现有方法。该框架有效地扩展到1,024个GPU,并在序列长度达到128K标记时保持高性能,验证了其在大规模MoE模型训练中的有效性。代码可在Megatron-Core中获得。

更新时间: 2026-03-02 15:01:07

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2504.14960v3

Probabilistic Retrofitting of Learned Simulators

Dominant approaches for modelling Partial Differential Equations (PDEs) rely on deterministic predictions, yet many physical systems of interest are inherently chaotic and uncertain. While training probabilistic models from scratch is possible, it is computationally expensive and fails to leverage the significant resources already invested in high-performing deterministic backbones. In this work, we adopt a training-efficient strategy to transform pre-trained deterministic models into probabilistic ones via retrofitting with a proper scoring rule: the Continuous Ranked Probability Score (CRPS). Crucially, this approach is architecture-agnostic: it applies the same adaptation mechanism across distinct model backbones with minimal code modifications. The method proves highly effective across different scales of pre-training: for models trained on single dynamical systems, we achieve 20-54% reductions in rollout CRPS and up to 30% improvements in variance-normalised RMSE (VRMSE) relative to compute-matched deterministic fine-tuning. We further validate our approach on a PDE foundation model, trained on multiple systems and retrofitted on the dataset of interest, to show that our probabilistic adaptation yields an improvement of up to 40% in CRPS and up to 15% in VRMSE compared to deterministic fine-tuning. Validated across diverse architectures and dynamics, our results show that probabilistic PDE modelling need not require retraining from scratch, but can be unlocked from existing deterministic backbones with modest additional training cost.

Updated: 2026-03-02 15:01:02

标题: 学习模拟器的概率翻新

摘要: 主导建模偏微分方程(PDEs)的方法依赖于确定性预测,然而许多感兴趣的物理系统本质上是混沌的和不确定的。虽然从头开始训练概率模型是可能的,但计算成本高昂且未能利用已投入高性能确定性骨干的显着资源。在这项工作中,我们采用了一种训练高效的策略,通过使用适当的评分规则(连续排名概率分数(CRPS))来将预训练的确定性模型转换为概率模型。至关重要的是,这种方法与架构无关:它在不同的模型骨干上应用相同的适应机制,只需进行最少的代码修改。该方法在不同尺度的预训练模型上证明了其高效性:对于在单个动态系统上训练的模型,我们实现了20-54%的CRPS减少和相对于计算匹配的确定性微调高达30%的方差归一化RMSE(VRMSE)改善。我们进一步验证了我们的方法在PDE基础模型上的有效性,该模型在多个系统上进行训练,并在感兴趣的数据集上进行了后续改进,结果显示我们的概率适应相对于确定性微调可以提高高达40%的CRPS和高达15%的VRMSE。通过在不同的架构和动态下验证,我们的结果表明,概率PDE建模不需要从头开始重新训练,而是可以通过适度的额外训练成本从现有的确定性骨干中解锁。

更新时间: 2026-03-02 15:01:02

领域: cs.LG,cs.AI,cs.CE

下载: http://arxiv.org/abs/2603.01949v1

physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection

Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.

Updated: 2026-03-02 15:00:22

标题: physfusion:一种基于Transformer的双流雷达和视觉融合框架,用于开放水面物体检测

摘要: 检测水面目标对无人水面车辆(USVs)具有挑战性,因为波浪杂波、镜面反射和长距离观察中的弱外观线索。尽管4D毫米波雷达在光照受损情况下能够补充摄像头,但海上雷达点云稀疏且间歇性,反射属性在散射和多径情况下表现出重尾变化,导致传统的融合设计难以有效利用雷达线索。 我们提出了PhysFusion,这是一个基于物理的雷达图像检测框架,用于水面感知。该框架整合了:(1)具有RCS映射器和质量门的物理信息雷达编码器(PIR编码器),将每个点的雷达属性转换为紧凑的散射先验,并预测点级可靠性,以在杂波下进行稳健特征学习;(2)雷达引导交互融合模块(RIFM),在语义丰富的雷达特征和多尺度视觉特征之间执行查询级雷达图像融合,雷达分支由双流主干模型化,包括基于点的局部流和使用散射感知自注意力(SASA)的基于变换器的全局流;(3)时间查询聚合模块(TQA),在短暂时间窗口内聚合帧级融合查询,以获得时间一致的表示。 在WaterScenes和FLOW上的实验表明,PhysFusion在WaterScenes(T=5雷达历史)上使用5.6M参数和12.5G FLOPs实现了59.7%的mAP50:95,在FLOW上达到了94.8%的mAP50和46.2%的mAP50:95。消融研究量化了PIR编码器、基于SASA的全局推理和RIFM的贡献。

更新时间: 2026-03-02 15:00:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01947v1

When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation

Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion's focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.

Updated: 2026-03-02 14:58:20

标题: 当数字只能讲述故事的一半:主题模型评估中的人类度量标准对齐

摘要: 主题模型揭示了文本语料库中的潜在主题结构,但评估它们的质量仍然具有挑战性,特别是在专业领域。现有方法通常依赖于自动化指标,如主题连贯性和多样性,这些指标可能无法完全与人类判断相一致。人类评估任务,如单词干扰,提供了宝贵的见解,但成本高昂,并主要在通用领域语料库上得到验证。本文介绍了主题词混合(TWM),一种新颖的人类评估任务,评估主题间的差异性,测试注释员是否能区分来自单一或混合主题的单词集。TWM补充了单词干扰对内部主题连贯性的关注,并提供了多样性指标的人类基础对应物。我们评估了六种主题模型 - 统计和基于嵌入的(LDA,NMF,Top2Vec,BERTopic,CFMF,CFMF-emb)-比较了基于自动化指标和基于人类评估方法的结果,根据来自哲学科学出版物的专门领域语料库的近4000个注释。我们的研究结果显示,单词干扰和连贯性指标并不总是一致的,特别是在专业领域,而TWM捕捉到了人类感知的差异性,同时似乎与多样性指标一致。我们发布了标记的数据集和任务生成代码。这项工作突显了评估框架的必要性,用于连接自动化和人类评估,特别是针对专业领域的语料库。

更新时间: 2026-03-02 14:58:20

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01945v1

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Asynchronous reinforcement learning has become increasingly central to scaling LLM post-training, delivering major throughput gains by decoupling rollout generation from policy updates. However, widely used policy-gradient objectives such as REINFORCE and GRPO suffer under high asynchrony: stale rollouts produce heavy-tailed importance weights, so a small number of trajectories dominate updates and the policy-gradient estimator becomes markedly higher variance. Through systematic analysis on math, reasoning, and tool-use benchmarks, we find that this increasing variance is reliably predicted by collapsing effective sample size (ESS), which prior stabilization methods largely fail to address. Motivated by this diagnosis, we introduce $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a method that (i) dynamically scales the learning rate with ESS to dampen unreliable updates and (ii) applies a closed-form minimum-variance baseline for off-policy settings, without a critic model and adding minimal overhead. Empirically, across math and general reasoning benchmarks, this enables robustly stable asynchronous training compared to previous stabilization and algorithmic methods, even in highly off-policy regimes (128 steps off-policy). In a long-horizon, tool-use task, VCPO matches synchronous performance while delivering a 2.5$\times$ speedup in training time. Code is available at: https://github.com/mit-han-lab/vcpo

Updated: 2026-03-02 14:57:45

标题: 稳定的异步性:方差控制的离线策略强化学习用于LLMs

摘要: 异步强化学习在LLM后训练中变得越来越重要,通过将生成回滚与策略更新分离,实现了主要的吞吐量增益。然而,像REINFORCE和GRPO这样广泛使用的策略梯度目标在高异步性下表现不佳:陈旧的回滚产生重尾重要性权重,因此少量轨迹主导更新,策略梯度估计器的方差显著增加。通过对数学、推理和工具使用基准的系统分析,我们发现这种增加的方差可靠地通过崩溃的有效样本大小(ESS)来预测,之前的稳定化方法在很大程度上未能解决这个问题。受到这一诊断的启发,我们引入了$\textbf{VCPO}$($\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization)方法,该方法(i)根据ESS动态调整学习率,以减少不可靠的更新,(ii)在离线策略设置中应用封闭形式的最小方差基线,无需评论者模型并且添加最小的开销。在数学和一般推理基准测试中,与以前的稳定化和算法方法相比,这在异步训练中实现了稳定的稳定性,即使在高度离线策略(128步离线策略)下也是如此。在长期任务和工具使用任务中,VCPO能够达到同步性能,同时训练时间加快了2.5倍。代码可在以下链接获取:https://github.com/mit-han-lab/vcpo

更新时间: 2026-03-02 14:57:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2602.17616v2

Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

Large Language Models have intensified the scale and strategic manipulation of political discourse on social media, leading to conflict escalation. The existing literature largely focuses on platform-led moderation as a countermeasure. In this paper, we propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives.

Updated: 2026-03-02 14:57:13

标题: 忽略所有先前的指令:越狱作为一种减缓冲突的和平建设实践,以抵抗LLM社交媒体机器人

摘要: 大型语言模型加剧了社交媒体上政治话语的规模和战略操纵,导致冲突升级。现有文献主要关注平台主导的调解作为一种对抗措施。在本文中,我们提出了一个以用户为中心的“越狱”视角,作为一种新兴的非暴力化解实践。在线用户与被怀疑由大型语言模型驱动的账户互动,以规避大型语言模型的保护措施,揭示自动化行为并干扰误导性叙述的传播。

更新时间: 2026-03-02 14:57:13

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2603.01942v1

BAED: a New Paradigm for Few-shot Graph Learning with Explanation in the Loop

The challenges of training and inference in few-shot environments persist in the area of graph representation learning. The quality and quantity of labels are often insufficient due to the extensive expert knowledge required to annotate graph data. In this context, Few-Shot Graph Learning (FSGL) approaches have been developed over the years. Through sophisticated neural architectures and customized training pipelines, these approaches enhance model adaptability to new label distributions. However, compromises in \textcolor{black}{the model's} robustness and interpretability can result in overfitting to noise in labeled data and degraded performance. This paper introduces the first explanation-in-the-loop framework for the FSGL problem, called BAED. We novelly employ the belief propagation algorithm to facilitate label augmentation on graphs. Then, leveraging an auxiliary graph neural network and the gradient backpropagation method, our framework effectively extracts explanatory subgraphs surrounding target nodes. The final predictions are based on these informative subgraphs while mitigating the influence of redundant information from neighboring nodes. Extensive experiments on seven benchmark datasets demonstrate superior prediction accuracy, training efficiency, and explanation quality of BAED. As a pioneer, this work highlights the potential of the explanation-based research paradigm in FSGL.

Updated: 2026-03-02 14:56:39

标题: BAED:一个新的范式,用解释循环进行少样本图学习

摘要: 在图表示学习领域,少样本环境中的训练和推理面临挑战。由于标注图数据需要广泛的专家知识,标签的质量和数量通常不足。在这种情况下,多年来已经开发了少样本图学习(FSGL)方法。通过复杂的神经架构和定制的训练流程,这些方法增强了模型对新标签分布的适应性。然而,对模型鲁棒性和可解释性的妥协可能导致对标记数据中的噪声过拟合和性能下降。本文介绍了首个用于FSGL问题的解释循环框架BAED。我们新颖地采用信念传播算法来促进图上的标签增强。然后,利用辅助图神经网络和梯度反向传播方法,我们的框架有效地提取围绕目标节点的解释性子图。最终的预测基于这些信息丰富的子图,同时减少邻近节点的冗余信息的影响。对七个基准数据集的大量实验表明,BAED具有更高的预测准确性、训练效率和解释质量。作为先驱,这项工作突显了基于解释的研究范式在FSGL中的潜力。

更新时间: 2026-03-02 14:56:39

领域: cs.LG

下载: http://arxiv.org/abs/2603.01941v1

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbf{CoVe} (\textbf{Co}nstraint-\textbf{Ve}rification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging $τ^2$-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact \textbf{CoVe-4B} model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to $17\times$ its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.

Updated: 2026-03-02 14:56:35

标题: CoVe:通过约束引导验证训练交互式工具使用代理

摘要: 开发多轮交互式工具使用代理是具有挑战性的,因为真实世界用户需求通常复杂且模糊,然而代理必须执行确定性动作来满足它们。为了解决这一差距,我们引入了\textbf{CoVe}(\textbf{Co}nstraint-\textbf{Ve}rification),这是一个针对训练交互式工具使用代理而设计的后训练数据合成框架,旨在确保数据复杂性和正确性。CoVe首先定义明确的任务约束,这些约束具有双重作用:它们指导复杂轨迹的生成,并作为确定性验证器来评估轨迹质量。这使得可以为监督微调(SFT)创建高质量的训练轨迹,并为强化学习(RL)提供准确的奖励信号。我们在具有挑战性的$τ^2$-bench基准测试上的评估证明了该框架的有效性。值得注意的是,我们的紧凑型\textbf{CoVe-4B}模型在航空和零售领域分别实现了43.0%和59.4%的成功率;其整体表现明显优于类似规模的强基线,并与其大小高达$17\times$的模型保持竞争力。这些结果表明CoVe为合成最先进的交互式工具使用代理的训练数据提供了一条有效和高效的途径。为了支持未来研究,我们开放源代码、训练模型以及用于训练的全部12K高质量轨迹集。

更新时间: 2026-03-02 14:56:35

领域: cs.AI

下载: http://arxiv.org/abs/2603.01940v1

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.

Updated: 2026-03-02 14:52:52

标题: 解释引导的对抗训练用于稳健和可解释模型

摘要: 深度神经网络(DNNs)在许多任务中取得了显著的性能,但它们经常表现为不透明的黑匣子。解释引导学习(EGL)方法利用人类提供的解释或对模型归因进行监督来引导DNNs。这些方法提高了可解释性,但通常假设输入是良性的,并且需要大量注释成本。相比之下,DNNs的预测和显著性图可能会在面对难以察觉的扰动或看不见的模式时发生重大变化。对抗训练(AT)可以显著提高鲁棒性,但它并不保证模型决策依赖于语义上有意义的特征。因此,我们提出了解释引导对抗训练(EGAT),这是一个统一的框架,结合了AT和EGL的优势,同时提高预测性能、鲁棒性和解释质量。EGAT在模型上实施基于解释的约束条件的同时,即时生成对抗性示例。通过联合优化分类性能、对抗性鲁棒性和归因稳定性,EGAT不仅对意外情况更具抵抗力,包括对抗性攻击和超出分布(OOD)情景,还为决策提供人类可解释的理由。我们在可能近似正确学习框架内进一步形式化了EGAT,从理论上证明它在意外情况下相对于标准AT产生更稳定的预测。在OOD基准数据集上的实证评估表明,EGAT在干净准确率和对抗准确率上始终优于竞争基线+37%,同时产生了更具语义意义的解释,并且仅需要有限的训练时间增加+16%。

更新时间: 2026-03-02 14:52:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01938v1

Dream2Learn: Structured Generative Dreaming for Continual Learning

Continual learning requires balancing plasticity and stability while mitigating catastrophic forgetting. Inspired by human dreaming as a mechanism for internal simulation and knowledge restructuring, we introduce Dream2Learn (D2L), a framework in which a model autonomously generates structured synthetic experiences from its own internal representations and uses them for self-improvement. Rather than reconstructing past data as in generative replay, D2L enables a classifier to create novel, semantically distinct dreamed classes that are coherent with its learned knowledge yet do not correspond to previously observed data. These dreamed samples are produced by conditioning a frozen diffusion model through soft prompt optimization driven by the classifier itself. The generated data are not used to replace memory, but to expand and reorganize the representation space, effectively allowing the network to self-train on internally synthesized concepts. By integrating dreamed classes into continual training, D2L proactively structures latent features to support forward knowledge transfer and adaptation to future tasks. This prospective self-training mechanism mirrors the role of sleep in consolidating and reorganizing memory, turning internal simulations into a tool for improved generalization. Experiments on Mini-ImageNet, FG-ImageNet, and ImageNet-R demonstrate that D2L consistently outperforms strong rehearsal-based baselines and achieves positive forward transfer, confirming its ability to enhance adaptability through internally generated training signals.

Updated: 2026-03-02 14:52:10

标题: Dream2Learn:结构化生成式梦境用于持续学习

摘要: 持续学习需要在平衡可塑性和稳定性的同时减少灾难性遗忘。受人类梦境作为内部模拟和知识重构机制的启发,我们引入了Dream2Learn(D2L)框架,其中模型自主地从自身内部表示中生成结构化的合成经验,并将其用于自我改进。与生成式重放中重建过去数据不同,D2L使分类器能够创建新颖、语义上不同的梦境类别,这些类别与其学到的知识相关,但不对应于先前观察到的数据。这些梦境样本通过在分类器本身驱动下使用软提示优化来生成,通过对冻结扩散模型进行条件设定。生成的数据不用于替换记忆,而是用于扩展和重新组织表示空间,有效地允许网络在内部合成的概念上进行自我训练。通过将梦境类别整合到持续训练中,D2L主动结构化潜在特征,以支持前向知识转移和适应未来任务。这种前瞻性的自我训练机制反映了睡眠在巩固和重新组织记忆中的作用,将内部模拟转化为改善泛化能力的工具。在Mini-ImageNet、FG-ImageNet和ImageNet-R上的实验表明,D2L始终优于强大的基于重演的基线,并实现了积极的前向转移,证实了其通过内部生成的训练信号增强适应能力的能力。

更新时间: 2026-03-02 14:52:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01935v1

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Large language models (LLMs) demonstrate strong potential as autonomous agents, with promising capabilities in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in various domains, the financial domain remains underexplored, despite its significant economic value and complex reasoning requirements. Most existing financial benchmarks focus on static question-answering, failing to capture the dynamics of real-market trading. To address this gap, we introduce STOCKBENCH, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and make sequential buy, sell, or hold decisions. Performance is measured using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio, capturing both profitability and risk management. We evaluate a wide range of state-of-the-art proprietary and open-source LLMs. Surprisingly, most models struggle to outperform the simple buy-and-hold baseline, while some models demonstrate the potential to achieve higher returns and stronger risk management. These findings highlight both the challenges and opportunities of LLM-based trading agents, showing that strong performance on static financial question-answering do not necessarily translate into effective trading behavior. We release STOCKBENCH as an open-source benchmark to enable future research on LLM-driven financial agents.

Updated: 2026-03-02 14:50:36

标题: StockBench:LLM代理能否在现实市场中盈利交易股票?

摘要: 大型语言模型(LLMs)展示了作为自主代理的强大潜力,在推理、工具使用和顺序决策方面具有有前途的能力。虽然先前的基准评估了各种领域中的LLM代理,但金融领域仍未被充分探索,尽管它具有重要的经济价值和复杂的推理需求。大多数现有的金融基准重点放在静态问答上,未能捕捉真实市场交易的动态。为了填补这一空白,我们介绍了STOCKBENCH,这是一个无污染的基准,旨在评估LLM代理在现实的多月股票交易环境中的表现。代理接收每日市场信号,包括价格、基本面和新闻,并做出顺序的买入、卖出或持有决策。绩效使用财务指标进行衡量,如累计回报、最大回撤和Sortino比率,捕捉了盈利能力和风险管理。我们评估了一系列最先进的专有和开源LLMs。令人惊讶的是,大多数模型难以超越简单的买入持有基准,而一些模型展示了实现更高回报和更强风险管理的潜力。这些发现突显了基于LLM的交易代理的挑战和机会,表明在静态金融问答上的出色表现并不一定会转化为有效的交易行为。我们将STOCKBENCH发布为一个开源基准,以促进未来对LLM驱动的金融代理的研究。

更新时间: 2026-03-02 14:50:36

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.02209v2

From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation

Narratives in news discourse play a critical role in shaping public understanding of economic events, such as inflation. Annotating and evaluating these narratives in a structured manner remains a key challenge for Natural Language Processing (NLP). In this work, we introduce a narrative graph annotation framework that integrates principles from qualitative content analysis (QCA) to prioritize annotation quality by reducing annotation errors. We present a dataset of inflation narratives annotated as directed acyclic graphs (DAGs), where nodes represent events and edges encode causal relations. To evaluate annotation quality, we employed a $6\times3$ factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf's $α$), capturing the presence of human label variation (HLV) in narrative interpretations. Our analysis shows that (1) lenient metrics (overlap-based distance) overestimate reliability, and (2) locally-constrained representations (e.g., one-hop neighbors) reduce annotation variability. Our annotation and implementation of graph-based Krippendorrf's $α$ are open-sourced. The annotation framework and evaluation results provide practical guidance for NLP research on graph-based narrative annotation under HLV.

Updated: 2026-03-02 14:48:13

标题: 从方差到不变性:叙事图注释的定性内容分析

摘要: 新闻话语中的叙事在塑造公众对经济事件(如通货膨胀)的理解中起着至关重要的作用。以结构化方式对这些叙事进行注释和评估仍然是自然语言处理(NLP)的一个关键挑战。在这项工作中,我们引入了一个叙事图注释框架,该框架整合了定性内容分析(QCA)的原则,通过减少注释错误来优先考虑注释质量。我们提出了一个通货膨胀叙事数据集,该数据集被注释为有向无环图(DAGs),其中节点代表事件,边编码因果关系。为了评估注释质量,我们采用了一个$6\times3$的因子实验设计,以检查叙事表示(六个水平)和距离度量类型(三个水平)对注释者间一致性(Krippendorff's $α$)的影响,捕捉叙事解释中人类标签变异(HLV)的存在。我们的分析表明:(1)宽松度量(基于重叠的距离)会高估可靠性,(2)局部约束表示(例如,一跳邻居)会减少注释变异性。我们的基于图的Krippendorff's $α$的注释和实现是开源的。注释框架和评估结果为NLP研究提供了在HLV下进行基于图的叙事注释的实用指导。

更新时间: 2026-03-02 14:48:13

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01930v1

Learning Contact Dynamics through Touching: Action-conditional Graph Neural Networks for Robotic Peg Insertion

We present a learnable physics-based predictive model that provides accurate motion and force-torque prediction of the robot end effector in contact-rich manipulation. The proposed model extends the state-of-the-art GNN-based simulator (FIGNet) with novel node and edge types, enabling action-conditional predictions for control and state estimation in the context of robotic peg insertion. Our model learns in a self-supervised manner, using only joint encoder and force-torque data while the robot is touching the environment. In simulation, the MPC agent using our model matches the performance of the same controller with the ground truth dynamics model in a challenging peg-in-hole task, while in the real-world experiment, our model achieves a 50$\%$ improvement in motion prediction accuracy and 3$\times$ increase in force-torque prediction precision over the baseline physics simulator. Finally, we apply the model to track the robot end effector with a particle filter during real-world peg insertion, demonstrating a practical application of its predictive accuracy.

Updated: 2026-03-02 14:44:50

标题: 通过触摸学习接触动力学:用于机器人销钉插入的动作条件图神经网络

摘要: 我们提出了一个可学习的基于物理的预测模型,能够准确预测接触丰富的操纵中机器人末端执行器的运动和力矩。所提出的模型通过引入新颖的节点和边类型,扩展了最先进的基于GNN的模拟器(FIGNet),实现了在机器人销子插入的背景下用于控制和状态估计的动作条件预测。我们的模型以自监督方式学习,仅使用关节编码器和力矩数据,同时机器人与环境接触。在模拟中,使用我们的模型的MPC代理在具有挑战性的销子插孔任务中与具有地面真实动态模型的相同控制器的性能匹配,而在真实世界实验中,我们的模型在运动预测准确性上实现了50%的改进,并且在力矩预测精度上提高了3倍,超过了基线物理模拟器。最后,我们应用该模型在真实世界的销子插孔过程中使用粒子滤波器跟踪机器人末端执行器,展示了其预测准确性的实际应用。

更新时间: 2026-03-02 14:44:50

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2509.12151v2

Bound Propagation meets Constraint Simplification: Improving Logic-based XAI for Neural Networks

Logic-based methods for explaining neural network decisions offer formal guarantees of correctness and non-redundancy, but they often suffer from high computational costs, especially for large networks. In this work, we improve the efficiency of such methods by combining bound propagation with constraint simplification. These simplifications, derived from the propagation, tighten neuron bounds and eliminate unnecessary binary variables, making the explanation process more efficient. Our experiments suggest that combining these techniques reduces explanation time by up to 89.26\%, particularly for larger neural networks.

Updated: 2026-03-02 14:36:31

标题: Bound Propagation meets Constraint Simplification: Improving Logic-based XAI for Neural Networks (绑定传播与约束简化相遇:改进基于逻辑的神经网络XAI)

摘要: 基于逻辑的神经网络决策解释方法提供正确性和非冗余性的形式保证,但往往受到高计算成本的困扰,尤其是对于大型网络。在这项工作中,我们通过将界限传播与约束简化相结合,提高了这些方法的效率。这些简化从传播中派生,收紧神经元界限并消除不必要的二进制变量,使解释过程更加高效。我们的实验表明,结合这些技术可以将解释时间缩短高达89.26\%,特别是对于更大的神经网络。

更新时间: 2026-03-02 14:36:31

领域: cs.LO,cs.LG

下载: http://arxiv.org/abs/2603.01923v1

Real Money, Fake Models: Deceptive Model Claims in Shadow APIs

Access to frontier large language models (LLMs), such as GPT-5 and Gemini-2.5, is often hindered by high pricing, payment barriers, and regional restrictions. These limitations drive the proliferation of $\textit{shadow APIs}$, third-party services that claim to provide access to official model services without regional limitations via indirect access. Despite their widespread use, it remains unclear whether shadow APIs deliver outputs consistent with those of the official APIs, raising concerns about the reliability of downstream applications and the validity of research findings that depend on them. In this paper, we present the first systematic audit between official LLM APIs and corresponding shadow APIs. We first identify 17 shadow APIs that have been utilized in 187 academic papers, with the most popular one reaching 5,966 citations and 58,639 GitHub stars by December 6, 2025. Through multidimensional auditing of three representative shadow APIs across utility, safety, and model verification, we uncover both indirect and direct evidence of deception practices in shadow APIs. Specifically, we reveal performance divergence reaching up to $47.21\%$, significant unpredictability in safety behaviors, and identity verification failures in $45.83\%$ of fingerprint tests. These deceptive practices critically undermine the reproducibility and validity of scientific research, harm the interests of shadow API users, and damage the reputation of official model providers.

Updated: 2026-03-02 14:33:05

标题: 真金白银,虚假模型:影子API中的欺骗性模型声明

摘要: 对前沿大型语言模型(LLMs)如GPT-5和Gemini-2.5的访问通常受到高昂的定价、付款障碍和地区限制的阻碍。这些限制推动了$\textit{shadow APIs}$的泛滥,这些第三方服务声称通过间接访问提供对官方模型服务的无地区限制访问。尽管它们被广泛使用,但尚不清楚shadow APIs是否提供与官方API一致的输出,这引发了对下游应用程序的可靠性和依赖它们的研究结果的有效性的担忧。在本文中,我们首次对官方LLM API和相应的shadow APIs之间进行系统审计。我们首先确定了被187篇学术论文使用的17个shadow APIs,其中最流行的一个在2025年12月6日达到了5,966次引用和58,639个GitHub星。通过对三个代表性shadow APIs在效用、安全性和模型验证方面的多维审计,我们揭示了shadow APIs中的欺骗实践的间接和直接证据。具体来说,我们揭示了性能差异高达$47.21\%$,安全行为中存在显著的不可预测性,以及在45.83%的指纹测试中的身份验证失败。这些欺骗实践严重破坏了科学研究的可重复性和有效性,损害了shadow API用户的利益,破坏了官方模型提供者的声誉。

更新时间: 2026-03-02 14:33:05

领域: cs.CR,cs.AI,cs.SE

下载: http://arxiv.org/abs/2603.01919v1

How Effective Are Publicly Accessible Deepfake Detection Tools? A Comparative Evaluation of Open-Source and Free-to-Use Platforms

The proliferation of deepfake imagery poses escalating challenges for practitioners tasked with verifying digital media authenticity. While detection algorithm research is abundant, empirical evaluations of publicly accessible tools that practitioners actually use remain scarce. This paper presents the first cross-paradigm evaluation of six tools, spanning two complementary detection approaches: forensic analysis tools (InVID \& WeVerify, FotoForensics, Forensically) and AI-based classifiers (DecopyAI, FaceOnLive, Bitmind). Both tool categories were evaluated by professional investigators with law enforcement experience using blinded protocols across datasets comprising authentic, tampered, and AI-generated images sourced from DF40, CelebDF, and CASIA-v2. We report three principal findings: forensic tools exhibit high recall but poor specificity, while AI classifiers demonstrate the inverse pattern; human evaluators substantially outperform all automated tools; and human-AI disagreement is asymmetric, with human judgment prevailing in the vast majority of discordant cases. We discuss implications for practitioner workflows and identify critical gaps in current detection capabilities.

Updated: 2026-03-02 14:31:51

标题: 公开可访问的深度伪造检测工具有多有效?对开源和免费平台的比较评估

摘要: 深度伪造图像的泛滥给从业者验证数字媒体真实性带来不断升级的挑战。虽然检测算法研究丰富,但从业者实际使用的公开可访问工具的实证评估仍然稀缺。本文首次提出了六种工具的跨范式评估,涵盖两种互补的检测方法:取证分析工具(InVID&WeVerify,FotoForensics,Forensically)和基于人工智能的分类器(DecopyAI,FaceOnLive,Bitmind)。专业调查员利用来自DF40、CelebDF和CASIA-v2的真实、篡改和人工智能生成图像组成的数据集,使用盲目协议对两种工具类别进行评估。我们报告了三个主要发现:取证工具表现出高召回率但低特异性,而人工智能分类器展现出相反的模式;人类评估者明显优于所有自动化工具;人类与人工智能的分歧是不对称的,在绝大多数不一致案例中人类判断占据主导地位。我们讨论了对从业者工作流程的影响,并确定了当前检测能力中的关键空白。

更新时间: 2026-03-02 14:31:51

领域: cs.CR

下载: http://arxiv.org/abs/2603.04456v1

Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at https://vividoc-homepage.vercel.app/.

Updated: 2026-03-02 14:27:49

标题: 展示ViviDoc:通过人-代理协作生成交互式文档

摘要: 交互式文章帮助读者通过探索与复杂的想法互动,但创建它们仍然需要成本,需要领域专业知识和网页开发技能。最近基于LLM的代理可以自动化内容创建,但是简单地应用它们会产生无法控制和无法验证的输出。我们提出了ViviDoc,这是一个人-代理协作系统,可以从单个主题输入生成交互式教育文档。ViviDoc引入了一个多代理管道(规划者,执行者,评估者)和文档规范(DocSpec),这是一个人类可读的中间表示,将每个交互式可视化分解成状态,渲染,转换和约束组件。DocSpec使教育工作者能够在生成代码之前审查和完善生成计划,弥合了教学意图和可执行输出之间的差距。专家评估和用户研究表明,ViviDoc明显优于简单的代理生成,并提供直观的编辑体验。我们的项目主页位于https://vividoc-homepage.vercel.app/。

更新时间: 2026-03-02 14:27:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01912v1

PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting

While existing multivariate time series forecasting models have advanced significantly in modeling periodicity, they largely neglect the periodic heterogeneity common in real-world data, where variables exhibit distinct and dynamically changing periods. To effectively capture this periodic heterogeneity, we propose PHAT (Period Heterogeneity-Aware Transformer). Specifically, PHAT arranges multivariate inputs into a three-dimensional "periodic bucket" tensor, where the dimensions correspond to variable group characteristics with similar periodicity, time steps aligned by phase, and offsets within the period. By restricting interactions within buckets and masking cross-bucket connections, PHAT effectively avoids interference from inconsistent periods. We also propose a positive-negative attention mechanism, which captures periodic dependencies from two perspectives: periodic alignment and periodic deviation. Additionally, the periodic alignment attention scores are decomposed into positive and negative components, with a modulation term encoding periodic priors. This modulation constrains the attention mechanism to more faithfully reflect the underlying periodic trends. A mathematical explanation is provided to support this property. We evaluate PHAT comprehensively on 14 real-world datasets against 18 baselines, and the results show that it significantly outperforms existing methods, achieving highly competitive forecasting performance. Our sources is available at GitHub.

Updated: 2026-03-02 14:27:33

标题: PHAT: 为多元时间序列预测建模周期异质性

摘要: 现有的多变量时间序列预测模型在建模周期性方面取得了显著进展,但它们在现实世界数据中通常存在的周期性异质性方面大多被忽视,其中变量表现出明显且动态变化的周期。为了有效捕捉这种周期性异质性,我们提出了PHAT(Period Heterogeneity-Aware Transformer)。具体而言,PHAT将多变量输入排列成一个三维的“周期性桶”张量,其中维度对应于具有类似周期性的变量组特征、相位对齐的时间步长和周期内的偏移量。通过限制桶内的互动并遮盖跨桶连接,PHAT有效地避免了来自不一致周期的干扰。我们还提出了一种正负注意机制,从两个角度捕捉周期性依赖性:周期性对齐和周期性偏差。此外,周期性对齐注意分数被分解为正负分量,调制术语编码周期性先验。这种调制限制了注意机制更忠实地反映潜在的周期趋势。我们提供了数学解释来支持这一属性。我们对14个真实世界数据集全面评估了PHAT与18个基线模型的比较,结果显示它明显优于现有方法,实现了高度竞争的预测性能。我们的资源可在GitHub上获得。

更新时间: 2026-03-02 14:27:33

领域: cs.LG

下载: http://arxiv.org/abs/2602.00654v3

FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures

This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures''. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo. Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform. We share the prompts we developed using refinement techniques and report the learning curve of such prompts. The tested languages are English, Spanish, and Chinese for both tracks. Our resources and codes are shared via https://github.com/aaronlifenghan/FLANS-2026

Updated: 2026-03-02 14:27:14

标题: FLANS参加SemEval-2026任务7:使用开源较小的LLMs进行RAG,跨越不同语言和文化的日常知识

摘要: 本系统论文描述了我们参与了SemEval-2025任务7“跨不同语言和文化的日常知识”。我们参加了两个子任务,即Track 1:简短回答问题(SAQ)和Track 2:多项选择题(MCQ)。我们使用的方法是检索增强生成(RAGs)与开源的较小LLMs(OS-sLLMs)。为了更好地适应这个共享任务,我们通过使用我们准备的关键词列表提取维基百科内容创建了自己的文化意识知识库(CulKBs)。我们提取了既有文化意识的维基文本,又有特定国家的维基摘要。除了本地CulKBs外,我们还有一个系统通过DuckDuckGo集成实时在线搜索输出。为了更好的隐私和可持续性,我们旨在部署在Ollama平台上的开源较小LLMs(sLLMs)。我们分享了使用精炼技术开发的提示,以及这些提示的学习曲线。测试的语言为英语、西班牙语和中文,适用于两个轨道。我们的资源和代码通过https://github.com/aaronlifenghan/FLANS-2026进行分享。

更新时间: 2026-03-02 14:27:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01910v1

Efficient RLVR Training via Weighted Mutual Information Data Selection

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.

Updated: 2026-03-02 14:25:07

标题: 通过加权互信息数据选择实现高效的RLVR训练

摘要: 强化学习(RL)在改进大型语言模型的推理和对齐方面发挥着核心作用,然而其效率关键取决于如何选择训练数据。现有的在线选择策略主要依赖基于困难度的启发式方法,偏爱具有中等成功率的数据点,暗示困难度与信息量相等,并忽略由有限证据引起的认识性不确定性。我们介绍了InSight,一种基于加权互信息目标的RL训练中的信息引导数据采样方法。通过使用贝叶斯潜在成功率对数据结果进行建模,我们展示了预期不确定性减少分解为互补的困难度和证据相关组件,揭示了仅基于困难度选择的基本限制。利用这一观察,InSight构建了一个稳定的获取分数,基于数据点成功的平均信念而不是嘈杂的采样结果,并自然地延伸到在强化学习中常见的具有可验证奖励的多次滚动设置(RLVR)。大量实验证明,InSight始终实现最先进的性能,并提高了训练效率,包括在规划和数学基准测试中平均增益+1.41,在一般推理上的+1.01改进,以及高达~2.2倍的加速,而附加的计算开销微乎其微。

更新时间: 2026-03-02 14:25:07

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2603.01907v1

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

Updated: 2026-03-02 14:17:33

标题: HIMM: 人类启发的长期记忆建模用于具身探索和问答

摘要: 将多模态大语言模型部署为具有体现智能的代理人的大脑仍然具有挑战性,特别是在长期观测和有限上下文预算的情况下。现有的记忆辅助方法通常依赖于文本摘要,这些摘要丢弃了丰富的视觉和空间细节,在非稳态环境中仍然脆弱。在这项工作中,我们提出了一个明确区分为体验和语义记忆的非参数记忆框架,用于体现探索和问题回答。我们的检索优先、推理辅助范式通过语义相似性召回经验,通过视觉推理验证它们,实现对过去观察的强大重复使用,而不需要严格的几何对齐。与此同时,我们还引入了一个程序风格的规则提取机制,将经验转化为结构化、可重复使用的语义记忆,促进跨环境泛化。广泛的实验表明,在体现问题回答和探索基准测试中表现出最先进的性能,在A-EQA上LLM-Match提高了7.3%,LLM MatchXSPL提高了11.4%,在GOAT-Bench上成功率增加了7.7%,SPL增加了6.8%。分析表明,我们的体验性记忆主要提高了探索效率,而语义记忆加强了具有复杂推理的体现代理人。

更新时间: 2026-03-02 14:17:33

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2602.15513v2

Agentic Code Reasoning

Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

Updated: 2026-03-02 14:17:06

标题: 主观代码推理

摘要: 能否让LLM代理探索代码库并推理代码语义而无需执行代码?我们研究了这种能力,我们称之为代理式代码推理,并引入了半正式推理:一种结构化的提示方法,需要代理构建明确的前提,跟踪执行路径,并推导出正式的结论。与无结构的思维链不同,半正式推理充当证书:代理不能跳过案例或提出不支持的主张。我们在三个任务(补丁等效性验证、故障定位和代码问题回答)上进行评估,并显示半正式推理始终提高准确性。对于补丁等效性,准确性从精心筛选的示例中的78%提高到88%,在真实世界中代理生成的补丁中达到93%,接近需要无执行RL奖励信号的可靠性。对于RubberDuckBench Mohammad等人(2026年)上的代码问题回答,半正式推理实现了87%的准确性。对于Defects4J Just等人(2014年)上的故障定位,半正式推理将Top-5准确性提高了5个百分点。这些结果表明,结构化的代理推理使得能够进行有意义的语义代码分析而无需执行,为RL训练管道、代码审查和静态程序分析开启了实际应用。

更新时间: 2026-03-02 14:17:06

领域: cs.SE,cs.AI,cs.PL

下载: http://arxiv.org/abs/2603.01896v1

Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision

Large language models (LLMs) serve as an active and promising field of generative artificial intelligence and have demonstrated abilities to perform complex tasks in multiple domains, including mathematical and scientific reasoning. In this work, we construct a novel agent framework for solving representative problems in scientific computing. The proposed agent, incorporating a "rewriting-resolution-review-revision" logical chain via three reasoning LLMs (functioning as the Consultant, Reviewer, and Programmer, respectively), is integrated in a collaborative and interactive manner. The Consultant module endows the agent with knowledge transfer capabilities to link problems to professional domain insights, thereby rewriting problem descriptions through text augmentation. The Programmer module is responsible for generating and executing well-structured code to deliver the problem resolution. The Reviewer module equips the agent with the capacity for self-debugging and self-refinement through interactive feedback with code runtime outputs. By leveraging the end-to-end review mechanism, the executable code provided by the Programmer attains the iterative revision. A comprehensive evaluation is conducted on the performance of the proposed agent framework in solving partial differential equations (PDEs), ill-conditioned linear systems, and data-driven physical analysis problems. Compared to single-model, this collaborative framework significantly improves the bug-free code generation rate and reduces the occurrence of non-physical solutions, thereby establishing a highly reliable framework for autonomous code generation based on natural language descriptions. The review mechanism improved the average execution success rate of the modern reasoning models. Our code is available at https://github.com/ChengAo21/Re4_Sci_Agent

Updated: 2026-03-02 14:14:41

标题: Re4: 具有重写、解析、审查和修订功能的科学计算代理

摘要: 大型语言模型(LLMs)是生成人工智能领域的一个活跃且有前景的领域,并已经展示出在多个领域执行复杂任务的能力,包括数学和科学推理。在这项工作中,我们构建了一个新颖的代理框架,用于解决科学计算中的代表性问题。所提出的代理通过三个推理LLMs(分别作为顾问、审阅者和程序员)集成在一个协作和互动的方式中,通过“重写-解析-审阅-修订”逻辑链。顾问模块赋予代理知识转移能力,将问题与专业领域见解联系起来,从而通过文本增强重写问题描述。程序员模块负责生成和执行结构良好的代码,以提供问题解决方案。审阅者模块通过与代码运行输出的互动反馈,为代理提供了自我调试和自我完善的能力。通过利用端到端审阅机制,程序员提供的可执行代码获得了迭代修订。对所提出的代理框架在解决偏微分方程(PDEs)、病态线性系统和数据驱动的物理分析问题方面的性能进行了全面评估。与单一模型相比,这种协作框架显著提高了无缺陷代码生成率,并减少了非物理解的发生,从而建立了一个基于自然语言描述的自主代码生成的高度可靠框架。审阅机制提高了现代推理模型的平均执行成功率。我们的代码可在https://github.com/ChengAo21/Re4_Sci_Agent上找到。

更新时间: 2026-03-02 14:14:41

领域: cs.AI,physics.comp-ph

下载: http://arxiv.org/abs/2508.20729v2

SEAR: Sample Efficient Action Chunking Reinforcement Learning

Action chunking can improve exploration and value estimation in long horizon reinforcement learning, but makes learning substantially harder since the critic must evaluate action sequences rather than single actions, greatly increasing approximation and data efficiency challenges. As a result, existing action chunking methods, primarily designed for the offline and offline-to-online settings, have not achieved strong performance in purely online reinforcement learning. We introduce SEAR, an off policy online reinforcement learning algorithm for action chunking. It exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes. SEAR outperforms state of the art online reinforcement learning methods on Metaworld, training with chunk sizes up to 20.

Updated: 2026-03-02 14:11:53

标题: SEAR:样本高效的动作分块强化学习

摘要: 动作分块可以提高长期强化学习中的探索和价值估计,但由于评论家必须评估动作序列而不是单个动作,因此使学习变得更加困难,极大地增加了逼近和数据效率方面的挑战。因此,现有的动作分块方法,主要设计用于离线和离线到在线设置,未能在纯在线强化学习中取得强大的性能。我们介绍了SEAR,这是一种用于动作分块的离线策略在线强化学习算法。它利用动作块的时间结构,并在一个逐渐减小的时间范围内运行,有效地结合了小块和大块大小的优势。SEAR在Metaworld上优于最先进的在线强化学习方法,训练块大小高达20。

更新时间: 2026-03-02 14:11:53

领域: cs.LG

下载: http://arxiv.org/abs/2603.01891v1

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need for human supervision. To enable this self-play training at scale, we implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. SPIRAL produces reasoning capabilities that transfer broadly, improving performance by up to 10% across a suite of 8 reasoning benchmarks on 4 different models spanning Qwen and Llama model families, outperforming supervised fine-tuning on 25,000 expert game trajectories. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) yields the strongest results, with improvements observed across both base and instruction-tuned models. Analysis of chain-of-thought traces reveals that games develop distinct cognitive patterns that transfer to improve reasoning performance, with different games developing complementary strengths. Even models which have already been trained on reasoning tasks using RLVR, like DeepSeek-R1-Distill-Qwen-7B, still benefit from our approach. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities across diverse model architectures and training stages, highlighting a promising direction for autonomous reasoning development. Our code can be found in https://github.com/spiral-rl/spiral.

Updated: 2026-03-02 14:09:28

标题: 螺旋:零和游戏中的自我对弈通过多智能体多轮强化学习激励推理

摘要: 最近在强化学习领域取得的进展表明,语言模型可以通过在具有可验证奖励的任务上进行训练来发展复杂的推理能力,但这些方法依赖于人类策划的问题-答案对和领域特定的奖励工程。我们介绍了SPIRAL,这是一个自我对弈框架,模型通过与不断改进的自身版本进行多轮零和游戏来学习,生成更强对手的自动课程,并消除了对人类监督的需求。为了实现规模化的自我对弈训练,我们为LLM实现了一个完全在线的多轮多智能体强化学习系统,并提出了角色条件优势估计(RAE)来稳定多智能体训练。SPIRAL产生了广泛转移的推理能力,跨4个不同模型(覆盖Qwen和Llama模型系列)的8个推理基准套件上的表现提高了最多10%,优于对25000个专家游戏轨迹进行监督微调。多游戏训练(井字棋,Kuhn扑克,简单协商)产生了最强大的结果,观察到在基础模型和指导调整模型上都有改进。对思维链迹的分析表明,游戏发展出不同的认知模式,这些模式可以转移以提高推理性能,不同的游戏发展出互补的优势。即使已经使用RLVR在推理任务上训练过的模型,如DeepSeek-R1-Distill-Qwen-7B,仍然从我们的方法中受益。这些结果表明,零和游戏自然地发展出跨不同模型架构和训练阶段的可转移推理能力,强调了自主推理发展的有希望方向。我们的代码可以在https://github.com/spiral-rl/spiral找到。

更新时间: 2026-03-02 14:09:28

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.24119v3

SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling

The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most widely used methods for addressing class imbalance and generating synthetic data. Despite its popularity, little attention has been paid to its privacy implications; yet, it is used in the wild in many privacy-sensitive applications. In this work, we conduct the first systematic study of privacy leakage in SMOTE: we begin by showing that prevailing evaluation practices, i.e., naive distinguishing and distance-to-closest-record metrics, completely fail to detect any leakage and that membership inference attacks (MIAs) can be instantiated with high accuracy. Then, by exploiting SMOTE's geometric properties, we build two novel attacks with very limited assumptions: DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets, and ReconSMOTE, which reconstructs real minority records from synthetic datasets with perfect precision and recall approaching one under realistic imbalance ratios. We also provide theoretical guarantees for both attacks. Experiments on eight standard imbalanced datasets confirm the practicality and effectiveness of these attacks. Overall, our work reveals that SMOTE is inherently non-private and disproportionately exposes minority records, highlighting the need to reconsider its use in privacy-sensitive applications and as a baseline for assessing the privacy of modern generative models.

Updated: 2026-03-02 14:06:52

标题: SMOTE和镜像:揭示合成少数过采样中的隐私泄漏

摘要: 合成少数类过采样技术(SMOTE)是解决类别不平衡和生成合成数据的最常用方法之一。尽管它很受欢迎,但对其隐私影响却鲜有关注;然而,在许多隐私敏感的应用中,它被广泛使用。在这项工作中,我们进行了对SMOTE隐私泄漏的第一次系统研究:我们首先展示了普遍的评估方法,即天真的区分和距离最近记录度量,完全无法检测到任何泄漏,并且可以高精度地实例化成员推断攻击(MIAs)。然后,通过利用SMOTE的几何特性,我们建立了两种具有非常有限假设的新攻击:DistinSMOTE,可以完全区分增强数据集中的真实记录和合成记录,以及ReconSMOTE,可以在现实不平衡比率下以接近一的精度和召回率从合成数据集中重建真实少数类记录。我们还为这两种攻击提供了理论保证。在八个标准不平衡数据集上的实验证实了这些攻击的实用性和有效性。总的来说,我们的工作揭示了SMOTE天生不具有隐私性,并且不成比例地暴露了少数记录,强调了需要重新考虑其在隐私敏感应用中的使用,并作为评估现代生成模型隐私性的基准。

更新时间: 2026-03-02 14:06:52

领域: cs.CR

下载: http://arxiv.org/abs/2510.15083v3

Optimistic Task Inference for Behavior Foundation Models

Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead. Code is available at https://github.com/ThomasRupf/opti-bfm.

Updated: 2026-03-02 14:03:42

标题: 乐观任务推断用于行为基础模型

摘要: 行为基础模型(BFMs)能够在测试时直接检索指定奖励函数的高性能策略,通常称为零-shot强化学习(RL)。虽然这在计算方面非常高效,但在数据方面可能不够高效:作为一个标准假设,BFMs需要在一个不可忽略的推理数据集上计算奖励,假设要么可以访问奖励的功能形式,要么需要进行大量标注工作。为了减轻这些限制,我们纯粹通过与测试时环境的交互来解决任务推断的问题。我们提出了OpTI-BFM,一种乐观的决策准则,直接对奖励函数的不确定性进行建模,并引导BFMs在数据收集中进行任务推断。形式上,我们通过直接将BFMs与线性赌博算法的上置信算法进行连接,为训练良好的BFMs提供了后悔界限。在实证方面,我们在已建立的零-shot基准测试中评估了OpTI-BFM,并观察到它使基于后继特征的BFMs能够在少数几集中识别和优化一个未见的奖励函数,并且计算开销很小。代码可在https://github.com/ThomasRupf/opti-bfm找到。

更新时间: 2026-03-02 14:03:42

领域: cs.LG

下载: http://arxiv.org/abs/2510.20264v2

Diagnosing Generalization Failures from Representational Geometry Markers

Generalization, the ability to perform well beyond the training context, is a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventional approaches often take a ``bottom-up'' mechanistic route by reverse-engineering interpretable features or circuits to build explanatory models. While insightful, these methods often struggle to provide the high-level, predictive signals for anticipating failure in real-world deployment. Here, we propose using a ``top-down'' approach to studying generalization failures inspired by medical biomarkers: identifying system-level measurements that serve as robust indicators of a model's future performance. Rather than mapping out detailed internal mechanisms, we systematically design and test network markers to probe structure, function links, identify prognostic indicators, and validate predictions in real-world settings. In image classification, we find that task-relevant geometric properties of in-distribution (ID) object manifolds consistently forecast poor out-of-distribution (OOD) generalization. In particular, reductions in two geometric measures, effective manifold dimensionality and utility, predict weaker OOD performance across diverse architectures, optimizers, and datasets. We apply this finding to transfer learning with ImageNet-pretrained models. We consistently find that the same geometric patterns predict OOD transfer performance more reliably than ID accuracy. This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection and AI interpretability.

Updated: 2026-03-02 13:59:19

标题: 诊断泛化失败的表示几何标记

摘要: 概括能力,即在训练环境之外表现良好的能力,是生物和人工智能的特征之一,然而,预测未曾见过的失败仍然是一个中心挑战。传统方法通常采取“自下而上”的机械途径,通过逆向工程可解释的特征或回路来构建解释性模型。虽然具有洞察力,但这些方法往往难以为现实部署中预测失败提供高水平的预测信号。在这里,我们提出使用“自上而下”的方法来研究通用化失败,灵感来自医学生物标志物:识别系统级别的测量,作为模型未来性能的稳健指标。与其详细绘制内部机制,我们系统地设计和测试网络标记,以探究结构、功能链接,识别预后指标,并在现实环境中验证预测。在图像分类中,我们发现分布(ID)物体流形的任务相关几何属性一贯预测出分布(OOD)通用化的不佳。特别是,两个几何测量值的减少,有效流形维度和效用,预测跨不同架构、优化器和数据集的较弱OOD性能。我们将这一发现应用于对ImageNet预训练模型的迁移学习。我们一贯发现,相同的几何模式比ID准确性更可靠地预测OOD迁移性能。这项工作表明,表征几何能够揭示隐藏的弱点,为模型选择和人工智能可解释性提供更加稳健的指导。

更新时间: 2026-03-02 13:59:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01879v1

FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability-Plasticity Tradeoff

Deep neural networks trained on nonstationary data must balance stability (i.e., retaining prior knowledge) and plasticity (i.e., adapting to new tasks). Standard reinitialization methods, which reinitialize weights toward their original values, are widely used but difficult to tune: conservative reinitializations fail to restore plasticity, while aggressive ones erase useful knowledge. We propose FIRE, a principled reinitialization method that explicitly balances the stability-plasticity tradeoff. FIRE quantifies stability through Squared Frobenius Error (SFE), measuring proximity to past weights, and plasticity through Deviation from Isometry (DfI), reflecting weight isotropy. The reinitialization point is obtained by solving a constrained optimization problem, minimizing SFE subject to DfI being zero, which is efficiently approximated by Newton-Schulz iteration. FIRE is evaluated on continual visual learning (CIFAR-10 with ResNet-18), language modeling (OpenWebText with GPT-0.1B), and reinforcement learning (HumanoidBench with SAC and Atari games with DQN). Across all domains, FIRE consistently outperforms both naive training without intervention and standard reinitialization methods, demonstrating effective balancing of the stability-plasticity tradeoff.

Updated: 2026-03-02 13:58:28

标题: FIRE:Frobenius等距重新初始化以平衡稳定性-可塑性权衡

摘要: 在非平稳数据上训练的深度神经网络必须平衡稳定性(即保留先前知识)和可塑性(即适应新任务)。广泛使用的标准重新初始化方法重新将权重初始化为其原始值,但很难调整:保守的重新初始化无法恢复可塑性,而激进的重新初始化则会抹去有用的知识。我们提出了FIRE,一种明确平衡稳定性-可塑性权衡的基本重新初始化方法。FIRE通过平方弗罗贝尼乌斯误差(SFE)量化稳定性,衡量与过去权重的接近程度,并通过与等距性偏差(DfI)量化可塑性,反映权重的等距性。重新初始化点通过解决一个受约束的优化问题来获得,即最小化SFE,同时使DfI为零,这可以通过牛顿-舒尔茨迭代有效地近似。我们在持续的视觉学习(CIFAR-10与ResNet-18)、语言建模(OpenWebText与GPT-0.1B)和强化学习(HumanoidBench与SAC以及Atari游戏与DQN)上评估了FIRE。在所有领域中,FIRE始终优于无干预的朴素训练和标准重新初始化方法,证明了有效平衡稳定性-可塑性权衡。

更新时间: 2026-03-02 13:58:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2602.08040v2

Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification

Few-shot Whole Slide Image (WSI) classification is severely hampered by overfitting. We argue that this is not merely a data-scarcity issue but a fundamentally geometric problem. Grounded in the manifold hypothesis, our analysis shows that features from pathology foundation models exhibit a low-dimensional manifold geometry that is easily perturbed by downstream models. This insight reveals a key potential issue in downstream multiple instance learning models: linear layers are geometry-agnostic and, as we show empirically, can distort the manifold geometry of the features. To address this, we propose the Manifold Residual (MR) block, a plug-and-play module that is explicitly geometry-aware. The MR block reframes the linear layer as residual learning and decouples it into two pathways: (1) a fixed, random matrix serving as a geometric anchor that approximately preserves topology while also acting as a spectral shaper to sharpen the feature spectrum; and (2) a trainable, low-rank residual pathway that acts as a residual learner for task-specific adaptation, with its structural bottleneck explicitly mirroring the low effective rank of the features. This decoupling imposes a structured inductive bias and reduces learning to a simpler residual fitting task. Through extensive experiments, we demonstrate that our approach achieves state-of-the-art results with significantly fewer parameters, offering a new paradigm for few-shot WSI classification. Code is available in https://github.com/BearCleverProud/MR-Block.

Updated: 2026-03-02 13:58:13

标题: 利用特征的低维流形进行少样本全切片图像分类

摘要: Few-shot Whole Slide Image (WSI) classification is severely hampered by overfitting. We argue that this is not merely a data-scarcity issue but a fundamentally geometric problem. Grounded in the manifold hypothesis, our analysis shows that features from pathology foundation models exhibit a low-dimensional manifold geometry that is easily perturbed by downstream models. This insight reveals a key potential issue in downstream multiple instance learning models: linear layers are geometry-agnostic and, as we show empirically, can distort the manifold geometry of the features. To address this, we propose the Manifold Residual (MR) block, a plug-and-play module that is explicitly geometry-aware. The MR block reframes the linear layer as residual learning and decouples it into two pathways: (1) a fixed, random matrix serving as a geometric anchor that approximately preserves topology while also acting as a spectral shaper to sharpen the feature spectrum; and (2) a trainable, low-rank residual pathway that acts as a residual learner for task-specific adaptation, with its structural bottleneck explicitly mirroring the low effective rank of the features. This decoupling imposes a structured inductive bias and reduces learning to a simpler residual fitting task. Through extensive experiments, we demonstrate that our approach achieves state-of-the-art results with significantly fewer parameters, offering a new paradigm for few-shot WSI classification. Code is available in https://github.com/BearCleverProud/MR-Block.

更新时间: 2026-03-02 13:58:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.15504v2

Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios

Multiple instance learning (MIL) is a framework for weakly supervised classification, where labels are assigned to sets of instances, i.e., bags, rather than to individual data points. This paradigm has proven effective in tasks where fine-grained annotations are unavailable or costly to obtain. However, the effectiveness of MIL drops sharply when training data are scarce, such as for rare disease classification. To address this challenge, we propose incorporating topological inductive biases into the data representation space within the MIL framework. This bias introduces a topology-preserving constraint that encourages the instance encoder to maintain the topological structure of the instance distribution within each bag when mapping them to MIL latent space. As a result, our Topology Guided MIL (TG-MIL) method enhances the performance and generalizability of MIL classifiers across different aggregation functions, especially under scarce-data regimes. Our evaluations show average performance improvements of 15.3% for synthetic MIL datasets, 2.8% for MIL benchmarks, and 5.5% for rare anemia classification compared to current state-of-the-art MIL models, where only 17-120 samples per class are available. We make our code publicly available.

Updated: 2026-03-02 13:55:52

标题: 拓扑归纳偏差促进数据稀缺情况下的多实例学习

摘要: 多实例学习(MIL)是一种弱监督分类的框架,其中标签分配给一组实例,即袋子,而不是单个数据点。这种范式已被证明在细粒度注释不可用或昂贵的任务中是有效的。然而,在训练数据稀缺的情况下,例如罕见疾病分类,MIL的有效性急剧下降。为了解决这一挑战,我们提出将拓扑归纳偏差纳入MIL框架中的数据表示空间中。这种偏差引入了一个保持拓扑结构的约束,鼓励实例编码器在将它们映射到MIL潜在空间时维护每个袋子内实例分布的拓扑结构。因此,我们的拓扑引导MIL(TG-MIL)方法增强了MIL分类器在不同聚合函数下的性能和泛化能力,特别是在数据稀缺的情况下。我们的评估显示,在合成MIL数据集中,平均性能提高了15.3%,在MIL基准测试中提高了2.8%,在罕见贫血分类中提高了5.5%,与当前最先进的MIL模型相比,每类仅有17-120个样本可用。我们公开提供我们的代码。

更新时间: 2026-03-02 13:55:52

领域: cs.LG,cs.CV,eess.IV,q-bio.QM,stat.ML

下载: http://arxiv.org/abs/2307.14025v3

Aggressive or Imperceptible, or Both: Network Pruning Assisted Hybrid Byzantines in Federated Learning

In federated learning (FL), profiling and verifying each client is inherently difficult, which introduces a significant security vulnerability: malicious clients, commonly referred to as Byzantines, can degrade the accuracy of the global model by submitting poisoned updates during training. To mitigate this, the aggregation process at the parameter server must be robust against such adversarial behaviour. Most existing defences approach the Byzantine problem from an outlier detection perspective, treating malicious updates as statistical anomalies and ignoring the internal structure of the trained neural network (NN). Motivated by this, this work highlights the potential of leveraging side information tied to the NN architecture to design stronger, more targeted attacks. In particular, inspired by insights from sparse NNs, we introduce a hybrid sparse Byzantine attack. The attack consists of two coordinated components: (i) A sparse attack component that selectively manipulates parameters with higher sensitivity in the NN, aiming to cause maximum disruption with minimal visibility; (ii) A slow-accumulating attack component that silently poisons parameters over multiple rounds to evade detection. Together, these components create a strong but imperceptible attack strategy that can bypass common defences. We evaluate the proposed attack through extensive simulations and demonstrate its effectiveness against eight state-of-the-art defence mechanisms.

Updated: 2026-03-02 13:55:40

标题: Aggressive, Imperceptible, or Both: 网络修剪辅助的混合拜占庭在联邦学习中

摘要: 在联邦学习(FL)中,对每个客户端进行个人资料建模和验证固有地困难,这引入了一个显著的安全漏洞:恶意客户端,通常被称为拜占庭人,可以通过在训练期间提交有毒更新来降低全局模型的准确性。为了缓解这一问题,在参数服务器上的聚合过程必须能够抵御这种对抗性行为。大多数现有的防御方法从异常检测的角度处理拜占庭问题,将恶意更新视为统计异常,忽略了训练神经网络(NN)的内部结构。受此启发,本文强调了利用与NN架构相关的附加信息来设计更强大、更有针对性的攻击的潜力。具体而言,受稀疏NN的见解启发,我们介绍了一种混合稀疏的拜占庭攻击。该攻击包括两个协调的组成部分:(i)一个稀疏攻击组件,有选择地操纵NN中具有较高灵敏度的参数,旨在通过最小的可见度造成最大的破坏;(ii)一个慢积累攻击组件,悄悄地在多轮中污染参数以逃避检测。这些组件共同创建了一种强大但不可察觉的攻击策略,可以绕过常见的防御措施。我们通过大量模拟评估了所提出的攻击,并展示了它对八种最先进的防御机制的有效性。

更新时间: 2026-03-02 13:55:40

领域: cs.LG,cs.CR,cs.DC

下载: http://arxiv.org/abs/2404.06230v2

Systematic Survey on Privacy-Preserving Architectures for IoT and Vehicular Data Sharing: Techniques, Challenges, and Future Directions

The proliferation of IoT and V2X systems generates unprecedented sensitive data at the network edge, demanding privacy-preserving architectures that enable secure sharing without exposing raw information. Contemporary solutions face a fundamental privacy-efficiency-trust trilemma: achieving strong privacy guarantees, computational efficiency for resource-constrained devices, and decentralized trust simultaneously remains intractable with single-paradigm approaches. This survey systematically analyzes 75 technical papers (2007--2025) through a novel three-dimensional taxonomy classifying architectures into Decentralized Computation, Cryptography-based, and Distributed Ledger approaches. Temporal analysis reveals dramatic acceleration during 2024--2025, with 48% of all papers published in this period -- Decentralized Computation dominates at 44% of contributions and 59% of 2025 publications. Comprehensive Security Threat Mapping and Technology Maturity Assessment demonstrate that mature solutions occupy narrow design regions excelling in one or two dimensions while compromising others, conclusively validating the trilemma hypothesis. We identify emerging hybrid architectures combining complementary paradigms as the essential path forward. Critical challenges including security guarantee composition across layers, multi-layer coordination overhead minimization, and post-quantum security integration must be addressed for practical deployment in next-generation intelligent transportation systems and IoT ecosystems.

Updated: 2026-03-02 13:54:52

标题: 对于物联网和车辆数据共享的隐私保护架构的系统调查:技术、挑战和未来方向

摘要: 物联网和V2X系统的激增在网络边缘产生了前所未有的敏感数据,要求隐私保护架构,可以在不暴露原始信息的情况下实现安全共享。当代解决方案面临着一个基本的隐私效率信任三难问题:同时实现强隐私保证、资源受限设备的计算效率和分散式信任在单一范式方法中仍然难以解决。本调查通过一种新颖的三维分类法系统地分析了75篇技术论文(2007--2025年),将架构分类为分散式计算、基于密码学和分布式账本方法。时间分析显示,在2024--2025年期间出版的所有论文中,有48% -- 分散式计算在贡献中占44%,2025年出版的论文占59%。全面的安全威胁映射和技术成熟度评估表明,成熟的解决方案占据了狭窄的设计区域,在一个或两个维度上表现出色,而牺牲了其他维度,最终验证了三难假设。我们确定新兴的混合架构结合互补的范式作为未来的必经之路。关键挑战包括跨层安全保证组合、多层协调开销最小化和后量子安全集成必须解决,以在下一代智能交通系统和物联网生态系统中实现实际部署。

更新时间: 2026-03-02 13:54:52

领域: cs.CR

下载: http://arxiv.org/abs/2603.01876v1

KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models

Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow

Updated: 2026-03-02 13:54:19

标题: KDFlow:一个用户友好且高效的大型语言模型知识蒸馏框架

摘要: 知识蒸馏(KD)是将大型语言模型(LLMs)压缩为较小模型的关键技术。然而,尽管学生模型和教师模型在KD中扮演不同的角色,大多数现有框架仍然使用同质化的训练后端(例如FSDP和DeepSpeed)来训练两个模型,导致训练效率不佳。在本文中,我们提出了一种新颖的LLM蒸馏框架,称为KDFlow,它具有分离的架构并采用SGLang进行教师推断。通过将FSDP2的训练效率和SGLang的推断效率进行桥接,KDFlow在一个统一系统中充分利用了两者的优势。此外,我们的框架不是在不同进程之间传输完整的logits,而是仅传输教师的隐藏状态,使用零拷贝数据传输,并在学生端重新计算logits,有效平衡了通信成本和KD性能。此外,我们的框架支持离策略和在策略蒸馏,并通过高度可扩展和用户友好的API,将KD算法用于跨分词器KD。实验表明,与当前KD框架相比,KDFlow可以实现1.44倍到6.36倍的加速,使研究人员能够快速原型设计和扩展LLM蒸馏,减少工程开销。代码可在以下链接找到:https://github.com/songmzhang/KDFlow

更新时间: 2026-03-02 13:54:19

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01875v1

A Survey for Deep Reinforcement Learning Based Network Intrusion Detection

Cyber-attacks are becoming increasingly sophisticated and frequent, highlighting the importance of network intrusion detection systems. This paper explores the potential and challenges of using deep reinforcement learning (DRL) in network intrusion detection. It begins by introducing key DRL concepts and frameworks, such as deep Q-networks and actor-critic algorithms, and reviews recent research utilizing DRL for intrusion detection. The study evaluates challenges related to model training efficiency, detection of minority and unknown class attacks, feature selection, and handling unbalanced datasets. The performance of DRL models is comprehensively analyzed, showing that while DRL holds promise, many recent technologies remain underexplored. Some DRL models achieve state-of-the-art results on public datasets, occasionally outperforming traditional deep learning methods. The paper concludes with recommendations for enhancing DRL deployment and testing in real-world network scenarios, with a focus on Internet of Things intrusion detection. It discusses recent DRL architectures and suggests future policy functions for DRL-based intrusion detection. Finally, the paper proposes integrating DRL with generative methods to further improve performance, addressing current gaps and supporting more robust and adaptive network intrusion detection systems.

Updated: 2026-03-02 13:54:11

标题: 一项基于深度强化学习的网络入侵检测调查

摘要: 网络攻击变得越来越复杂和频繁,突显了网络入侵检测系统的重要性。本文探讨了在网络入侵检测中使用深度强化学习(DRL)的潜力和挑战。它首先介绍了关键的DRL概念和框架,如深度Q网络和演员-评论家算法,并回顾了最近利用DRL进行入侵检测的研究。该研究评估了与模型训练效率、检测少数和未知类攻击、特征选择以及处理不平衡数据集相关的挑战。深度强化学习模型的性能得到全面分析,表明虽然DRL有潜力,但许多最近的技术仍未充分探索。一些DRL模型在公共数据集上取得了最新技术成果,有时甚至优于传统的深度学习方法。本文最后提出了增强DRL在真实网络场景中部署和测试的建议,重点放在物联网入侵检测上。它讨论了最近的DRL架构,并提出了未来基于DRL的入侵检测政策函数。最后,本文提出将DRL与生成方法相结合以进一步提高性能,解决当前的差距,并支持更强大和适应性更强的网络入侵检测系统。

更新时间: 2026-03-02 13:54:11

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2410.07612v2

Phishing the Phishers with SpecularNet: Hierarchical Graph Autoencoding for Reference-Free Web Phishing Detection

Phishing remains the most pervasive threat to the Web, enabling large-scale credential theft and financial fraud through deceptive webpages. While recent reference-based and generative-AI-driven phishing detectors achieve strong accuracy, their reliance on external knowledge bases, cloud services, and complex multimodal pipelines fundamentally limits practicality, scalability, and reproducibility. In contrast, conventional deep learning approaches often fail to generalize to evolving phishing campaigns. We introduce SpecularNet, a novel lightweight framework for reference-free web phishing detection that demonstrates how carefully designed compact architectures can rival heavyweight systems. SpecularNet operates solely on the domain name and HTML structure, modeling the Document Object Model (DOM) as a tree and leveraging a hierarchical graph autoencoding architecture with directional, level-wise message passing. This design captures higher-order structural invariants of phishing webpages while enabling fast, end-to-end inference on standard CPUs. Extensive evaluation against 13 state of the art phishing detectors, including leading reference-based systems, shows that SpecularNet achieves competitive detection performance with dramatically lower computational cost. On benchmark datasets, it reaches an F1 score of 93.9%, trailing the best reference-based method slightly while reducing inference time from several seconds to approximately 20 milliseconds per webpage. Field and robustness evaluations further validate SpecularNet in real-world deployments, on a newly collected 2026 open-world dataset, and against adversarial attacks.

Updated: 2026-03-02 13:54:04

标题: 用SpecularNet欺骗网络钓鱼者:用于无参考Web网络钓鱼检测的分层图自动编码

摘要: 网络钓鱼仍然是网络上最普遍的威胁,通过欺骗性网页实现大规模凭证盗窃和金融欺诈。虽然最近引用基础和生成式人工智能驱动的网络钓鱼检测器实现了强大的准确性,但它们依赖外部知识库、云服务和复杂的多模态管道,从本质上限制了实用性、可扩展性和可重现性。相比之下,传统的深度学习方法经常无法泛化到不断发展的网络钓鱼活动中。我们介绍了SpecularNet,这是一个新颖的轻量级无参考网络钓鱼检测框架,展示了精心设计的紧凑架构如何能与重量级系统媲美。SpecularNet仅基于域名和HTML结构运行,将文档对象模型(DOM)建模为树,并利用具有方向性、逐级消息传递的层次图自动编码架构。这种设计捕捉了网络钓鱼网页的高阶结构不变性,同时实现了在标准CPU上快速的端到端推断。对13种最先进的网络钓鱼检测器进行广泛评估,包括领先的基于参考的系统,显示SpecularNet以大大降低的计算成本实现了竞争性的检测性能。在基准数据集上,它达到了93.9%的F1分数,略低于最佳的基于参考的方法,同时将推断时间从几秒降低到每个网页约20毫秒。现场和鲁棒性评估进一步验证了SpecularNet在实际部署中的有效性,在新收集的2026个开放世界数据集上,以及对抗性攻击中。

更新时间: 2026-03-02 13:54:04

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.01874v1

Improved state mixing in higher-order and block diagonal linear recurrent networks

Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block sizes. A parallel-scan implementation of the proposed architectures keeps the throughput competitive with diagonal LRNNs for moderate orders (H-LRU) and block sizes (BD-LRU). In synthetic sequence modeling tasks, the performance of BD-LRU matches or exceeds those of linear SSMs (Mamba), low-rank LRNNs (DeltaNet) and LSTM baselines, while H-LRU is found to be the most parameter-efficient in compression task. In both synthetic sequence modeling and language modeling, our results indicate that the structure of state mixing rather than width alone shapes expressivity of LRNNs, offering a practical route to closing the efficiency-expressivity gap in linear sequence models.

Updated: 2026-03-02 13:53:41

标题: 在高阶和分块对角线线性循环网络中改进状态混合

摘要: 线性递归网络(LRNNs)和线性状态空间模型(SSMs)在长序列建模任务上具有计算和内存效率,但它们的对角状态转换限制了表达能力。另一方面,密集和非线性架构(例如LSTMs)被证明更具表达能力,但计算成本高昂。在这里,我们探讨了如何通过在时间和通道之间进行更丰富的状态混合来增加LRNNs的表达能力,同时保持竞争性的效率。具体而言,我们引入了两种结构化LRNN架构:(i)高阶线性递归单元(H-LRU),将一阶递归推广到更高阶,混合多个过去状态;(ii)块对角LRUs(BD-LRU),实现了密集的块内通道混合。对选择门进行逐通道(H-LRU)或逐行(BD-LRU)L1归一化稳定了训练,并允许缩放窗口/块大小。所提出的架构的并行扫描实现使得在适度的顺序(H-LRU)和块大小(BD-LRU)下吞吐量与对角LRNNs竞争力相当。在合成序列建模任务中,BD-LRU的性能与线性SSMs(Mamba)、低秩LRNNs(DeltaNet)和LSTM基准相匹配或超过,而H-LRU在压缩任务中被发现是最具参数效率。在合成序列建模和语言建模中,我们的结果表明,状态混合的结构而不仅仅是宽度塑造了LRNNs的表达能力,为在线性序列模型中缩小效率-表达能力差距提供了实际途径。

更新时间: 2026-03-02 13:53:41

领域: cs.LG

下载: http://arxiv.org/abs/2602.12021v2

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.553 and a correlation with human ratings of only 0.428. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.421 (a 23.9% reduction) and increasing the consistency with human experts to 0.687 (a 60.5% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.

Updated: 2026-03-02 13:53:32

标题: VisJudge-Bench:可视化的美学和质量评估

摘要: 可视化是一种领域特定但广泛使用的形象,是将复杂数据集转化为直观洞察的有效方式,其价值取决于数据是否忠实地表达、清晰传达和美学设计。然而,评估可视化质量具有挑战性:不同于自然图像,它需要同时跨越数据编码准确性、信息表现力和视觉美学进行评判。虽然多模态大型语言模型(MLLMs)在自然图像美学评估方面表现出有希望的性能,但目前尚无系统基准用于衡量它们在评估可视化方面的能力。为了解决这个问题,我们提出了VisJudge-Bench,这是第一个综合评估MLLMs在评估可视化美学和质量方面表现的基准。它包含来自现实场景的3,090个专家标注样本,涵盖了32种图表类型的单个可视化、多个可视化和仪表板。对这一基准的系统测试显示,即使是最先进的MLLMs(如GPT-5),与人类专家在判断上仍存在显著差距,均方误差(MAE)为0.553,与人类评分的相关性仅为0.428。为了解决这一问题,我们提出了VisJudge,这是一种专门设计用于可视化美学和质量评估的模型。实验结果表明,VisJudge显著缩小了与人类判断的差距,将MAE降至0.421(减少了23.9%)并将与人类专家的一致性提高至0.687(与GPT-5相比提高了60.5%)。该基准可在https://github.com/HKUSTDial/VisJudgeBench上找到。

更新时间: 2026-03-02 13:53:32

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.22373v3

Generalizing Logic-based Explanations for Machine Learning Classifiers via Optimization

Machine learning models support decision-making, yet the reasons behind their predictions are opaque. Clear and reliable explanations help users make informed decisions and avoid blindly trusting model outputs. However, many existing explanation methods fail to guarantee correctness. Logic-based approaches ensure correctness but often offer overly constrained explanations, limiting coverage. Recent work addresses this by incrementally expanding explanations while maintaining correctness. This process is performed separately for each feature, adjusting both its upper and lower bounds. However, this approach faces a trade-off: smaller increments incur high computational costs, whereas larger ones may lead to explanations covering fewer instances. To overcome this, we propose two novel methods. Onestep builds upon this prior work, generating explanations in a single step for each feature and each bound, eliminating the overhead of an iterative process. \textit{Twostep} takes a gradual approach, improving coverage. Experimental results show that Twostep significantly increases explanation coverage (by up to 72.60\% on average across datasets) compared to Onestep and, consequently, to prior work.

Updated: 2026-03-02 13:51:02

标题: 通过优化推广基于逻辑的机器学习分类器解释

摘要: 机器学习模型支持决策,但其预测背后的原因是不透明的。清晰可靠的解释有助于用户做出知情决策,避免盲目信任模型输出。然而,许多现有的解释方法未能保证正确性。基于逻辑的方法确保了正确性,但通常提供过于受限制的解释,限制了覆盖范围。最近的工作通过逐步扩展解释来解决这个问题,同时保持正确性。这个过程针对每个特征单独执行,调整其上限和下限。然而,这种方法面临一个折衷:较小的增量会产生较高的计算成本,而较大的增量可能导致解释覆盖的实例较少。为了克服这个问题,我们提出了两种新方法。Onestep建立在先前的工作基础上,在每个特征和每个边界的单个步骤中生成解释,消除了迭代过程的开销。Twostep采取逐步的方法,提高了覆盖范围。实验结果表明,与Onestep和先前的工作相比,Twostep显著增加了解释的覆盖范围(在数据集上平均增加了高达72.60\%),因此也提高了解释的质量。

更新时间: 2026-03-02 13:51:02

领域: cs.LO,cs.LG

下载: http://arxiv.org/abs/2603.01870v1

Topological Causal Effects

Estimating causal effects is particularly challenging when outcomes arise in complex, non-Euclidean spaces, where conventional methods often fail to capture meaningful structural variation. We develop a framework for topological causal inference that defines treatment effects through differences in the topological structure of potential outcomes, summarized by power-weighted silhouette functions of persistence diagrams. We develop an efficient, doubly robust estimator in a fully nonparametric model, establish functional weak convergence, and construct a formal test of the null hypothesis of no topological effect. Empirical studies illustrate that the proposed method reliably quantifies topological treatment effects across diverse complex outcome types.

Updated: 2026-03-02 13:47:23

标题: 拓扑因果效应

摘要: 估计因果效应在结果出现在复杂的、非欧几里得空间时尤为具有挑战性,传统方法往往无法捕捉到有意义的结构变化。我们开发了一个拓扑因果推断框架,通过潜在结果的拓扑结构差异来定义治疗效应,通过持久图的功率加权剪影函数来总结。我们在一个完全非参数模型中开发了一个高效的、双重稳健的估计器,建立了函数弱收敛性,并构建了一个正式的检验无拓扑效应的零假设。实证研究表明,所提出的方法可可靠地量化不同复杂结果类型的拓扑治疗效应。

更新时间: 2026-03-02 13:47:23

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.02289v1

Topological Causal Effects

Estimating causal effects is particularly challenging when outcomes arise in complex, non-Euclidean spaces, where conventional methods often fail to capture meaningful structural variation. We develop a framework for topological causal inference that defines treatment effects through differences in the topological structure of potential outcomes, summarized by power-weighted silhouette functions of persistence diagrams. We develop an efficient, doubly robust estimator in a fully nonparametric model, establish functional weak convergence, and construct a formal test of the null hypothesis of no topological effect. Empirical studies illustrate that the proposed method reliably quantifies topological treatment effects across diverse complex outcome types.

Updated: 2026-03-02 13:47:23

标题: 拓扑因果效应

摘要: 在复杂的、非欧几里德空间中产生结果时,估计因果效应尤为具有挑战性,传统方法往往无法捕捉到有意义的结构变化。我们开发了一个拓扑因果推断框架,通过潜在结果的拓扑结构差异来定义治疗效应,这由持久图的功率加权轮廓函数来总结。我们在一个完全非参数模型中开发了一种高效的双重稳健估计器,建立了函数弱收敛性,并构建了一个正式的检验来检验无拓扑效应的零假设。实证研究表明,所提出的方法可可靠地量化各种复杂结果类型中的拓扑治疗效应。

更新时间: 2026-03-02 13:47:23

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.02289v1

CAIMAN: Causal Action Influence Detection for Sample-efficient Loco-manipulation

Enabling legged robots to perform non-prehensile loco-manipulation is crucial for enhancing their versatility. Learning behaviors such as whole-body object pushing often requires sophisticated planning strategies or extensive task-specific reward shaping, especially in unstructured environments. In this work, we present CAIMAN, a practical reinforcement learning framework that encourages the agent to gain control over other entities in the environment. CAIMAN leverages causal action influence as an intrinsic motivation objective, allowing legged robots to efficiently acquire object pushing skills even under sparse task rewards. We employ a hierarchical control strategy, combining a low-level locomotion module with a high-level policy that generates task-relevant velocity commands and is trained to maximize the intrinsic reward. To estimate causal action influence, we learn the dynamics of the environment by integrating a kinematic prior with data collected during training. We empirically demonstrate CAIMAN's superior sample efficiency and adaptability to diverse scenarios in simulation, as well as its successful transfer to real-world systems without further fine-tuning. A video demo is available at https://www.youtube.com/watch?v=dNyvT04Cqaw.

Updated: 2026-03-02 13:47:02

标题: CAIMAN: 用于高效样本定位操作的因果动作影响检测

摘要: 使四肢机器人能够进行非抓取式运动操纵对于增强其多功能性至关重要。学习整体物体推动等行为通常需要复杂的规划策略或广泛的任务特定奖励塑造,特别是在非结构化环境中。在这项工作中,我们提出了CAIMAN,这是一个实用的强化学习框架,鼓励代理人控制环境中的其他实体。CAIMAN利用因果动作影响作为内在动机目标,使四肢机器人能够在稀疏任务奖励下高效地获得物体推动技能。我们采用分层控制策略,将低级别的运动模块与能够生成任务相关速度指令的高级策略相结合,并训练以最大化内在奖励。为了估计因果动作影响,我们通过在训练期间收集的数据整合运动学先验知识来学习环境动力学。我们在模拟中实证证明了CAIMAN的卓越样本效率和对各种场景的适应性,以及其成功转移到实际系统而无需进一步微调。视频演示可在https://www.youtube.com/watch?v=dNyvT04Cqaw 上观看。

更新时间: 2026-03-02 13:47:02

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2502.00835v4

The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers

Crafting adversarial examples can be formulated as an optimization problem. While sign-based optimizers such as I-FGSM and MI-FGSM have become the de facto standard for the induced optimization problems, there still exist several unsolved problems in theoretical grounding and practical reliability especially in non-convergence and instability, which inevitably influences their transferability. Contrary to the expectation, we observe that the attack success rate may degrade sharply when more number of iterations are conducted. In this paper, we address these issues from an optimization perspective. By reformulating the sign-based optimizer as a specific coordinate-wise gradient descent, we argue that one cause for non-convergence and instability is their non-decaying step-size scheduling. Based upon this viewpoint, we propose a series of new attack algorithms that enforce Monotonically Decreasing Coordinate-wise Step-sizes (MDCS) within sign-based optimizers. Typically, we further provide theoretical guarantees proving that MDCS-MI attains an optimal convergence rate of $O(1/\sqrt{T})$, where $T$ is the number of iterations. Extensive experiments on image classification and cross-modal retrieval tasks demonstrate that our approach not only significantly improves transferability but also enhances attack stability compared to state-of-the-art sign-based methods.

Updated: 2026-03-02 13:46:58

标题: 衰减步骤的力量:增强基于符号的优化器的攻击稳定性和可转移性

摘要: Crafting adversarial examples可以被形式化为一个优化问题。尽管基于符号的优化器,如I-FGSM和MI-FGSM已成为引发的优化问题的事实标准,但在理论基础和实际可靠性方面仍存在一些未解决的问题,特别是在非收敛和不稳定方面,这不可避免地影响了它们的可转移性。与预期相反,我们观察到,当进行更多次迭代时,攻击成功率可能会急剧下降。在本文中,我们从优化的角度解决了这些问题。通过将基于符号的优化器重新表述为特定的逐坐标梯度下降,我们认为非收敛和不稳定的一个原因是它们的非衰减步长调度。基于这一观点,我们提出了一系列新的攻击算法,这些算法在基于符号的优化器内强制执行单调递减的逐坐标步长(MDCS)。通常,我们进一步提供理论保证,证明MDCS-MI达到了$O(1/\sqrt{T})$的最优收敛速率,其中$T$是迭代次数。在图像分类和跨模态检索任务上进行的大量实验表明,与最先进的基于符号的方法相比,我们的方法不仅显著提高了可转移性,还提高了攻击的稳定性。

更新时间: 2026-03-02 13:46:58

领域: cs.LG

下载: http://arxiv.org/abs/2602.19096v2

Tide: A Customisable Dataset Generator for Anti-Money Laundering Research

The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10\%, HI: 0.19\%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.

Updated: 2026-03-02 13:44:18

标题: 潮汐:一个可定制的反洗钱研究数据集生成器

摘要: 缺乏可访问的交易数据严重阻碍了反洗钱(AML)机器学习研究。隐私和法律问题阻止了真实财务数据的共享,而现有的合成生成器专注于简单的结构模式,忽略了表征复杂洗钱方案的时间动态(时机和频率)。 我们提出了Tide,一个开源的合成数据集生成器,可以生成包含结构和时间特征定义的洗钱模式的基于图的金融网络。Tide可以实现可重复、可定制的数据集生成,以满足特定研究需求。我们发布了两个参考数据集,其中包含不同的非法比例(LI:0.10\%,HI:0.19\%),同时实现了最先进的检测模型。 对这些数据集的评估显示出与条件有关的模型排名:在低违规比例条件下,LightGBM实现了最高的PR-AUC(78.05),而在更高的欺诈普遍情况下,XGBoost表现最佳(85.12)。这些不同的排名表明,参考数据集可以有意义地区分模型在操作条件下的能力。 Tide为研究社区提供了一个可配置的基准,展示了模型架构之间的有意义的性能变化,推动了健壮AML检测方法的发展。

更新时间: 2026-03-02 13:44:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01863v1

Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.

Updated: 2026-03-02 13:34:57

标题: 伪对比学习在多模态模型中的图表理解中的应用

摘要: 最近的多模态模型,如对比语言-图像预训练(CLIP),已经展现出了在对齐视觉和语言表示方面的显著能力。然而,在小的视觉差异具有较大语义重要性的领域,例如图表理解,由于模型对细粒度结构变化的敏感性有限,仍然具有挑战性。 我们提出了一种旨在增强视觉-语言模型中图表理解的新训练范式。我们的方法引入了由图表渲染器生成的伪对比样本,该渲染器使用随机选择的文本元素创建合成图表。这些样本突出了图表形象中的结构差异,而无需对原始数据进行任何修改或编辑。通过将这些伪对比样本纳入训练目标,模型学会捕捉更精确和语义一致的图表结构。 在一个流程图的基准数据集上的实证评估显示,与标准CLIP和硬负CLIP训练相比,在图像-文本匹配和视觉问答任务中都取得了显著的改进。这些结果强调了领域特定训练策略的价值,并有助于在视觉-语言学习的更广泛背景下推进图表理解。

更新时间: 2026-03-02 13:34:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2602.23589v2

Soft-Masked Diffusion Language Models

Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that efficiently adapts masked diffusion language models to incorporate SM. We demonstrate that training a 169M parameter model from scratch with SM yields superior perplexity and MAUVE scores compared to binary masking baselines. Similarly, a pretrained model can be enhanced with SM through continued pretraining. Finally, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings. The code is available at https://github.com/IBM/soft-masked-diffusion-language-models.

Updated: 2026-03-02 13:25:55

标题: 软蒙版扩散语言模型

摘要: 扩散模型在语言建模中表现出强大的潜力,相较于传统的自回归方法具有各种优势。它们能够并行生成和修订整个响应,从而实现更快的生成和内置的自校正机制。大多数现代基于扩散的语言模型采用掩蔽扩散,其中解码涉及基于二进制决策逐步处理掩蔽标记:要么保留掩蔽,要么用预测的标记替换它。然而,当保留掩蔽时,这种二元选择会丢弃有价值的预测信息。为了解决这一局限性,我们引入了软掩蔽(SM),这是一种新颖的方法,它动态地将掩蔽标记的嵌入与前一解码步骤中前k个预测标记的嵌入混合,以每个保留的掩蔽为基础。这为模型提供了更具信息性的先验,保留了早期计算的上下文,并允许关于掩蔽标记的部分信息传播到单个步骤之外。我们提出了一种有效调整基于掩蔽扩散语言模型以整合SM的训练方法。我们展示了使用SM从头开始训练一个包含169M参数的模型,相较于二元掩蔽基线,可以获得更优异的困惑度和MAUVE分数。同样,一个预训练模型可以通过持续预训练来增强SM。最后,我们对两个最先进的扩散模型Dream-7B和Dream-Coder-7B进行了SM的微调。SM在多个编码基准测试中始终提高性能,特别是在高吞吐量设置中。该代码可在https://github.com/IBM/soft-masked-diffusion-language-models找到。

更新时间: 2026-03-02 13:25:55

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.17206v2

Certified Circuits: Stability Guarantees for Mechanistic Circuits

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code will be released soon!

Updated: 2026-03-02 13:21:23

标题: 认证电路:机械电路的稳定性保证

摘要: 理解神经网络如何得出预测对于调试、审计和部署至关重要。机制可解释性通过识别电路-负责特定行为的最小子网络来实现这一目标。然而,现有的电路发现方法往往很脆弱:电路强烈依赖于选择的概念数据集,并且通常无法转移到分布之外,这引发了他们是否捕捉概念或数据集特定工件的疑问。我们引入了认证电路,为电路发现提供了可以证明的稳定性保证。我们的框架将任何黑盒发现算法与随机数据子采样相结合,以证明电路组件的包含决策对于概念数据集的有界编辑距离扰动是不变的。不稳定的神经元被弃用,产生更紧凑和更准确的电路。在ImageNet和OOD数据集上,认证电路的准确率提高了高达91%,同时使用的神经元减少了45%,并且在基线下降时保持可靠性。认证电路通过产生可以证明稳定且与目标概念更好对齐的机制解释,将电路发现置于形式化基础上。代码将很快发布!

更新时间: 2026-03-02 13:21:23

领域: cs.AI,cs.CV,cs.CY

下载: http://arxiv.org/abs/2602.22968v2

Trivial Graph Features and Classical Learning are Enough to Detect Random Anomalies

Detecting anomalies in link streams that represent various kinds of interactions is an important research topic with crucial applications. Because of the lack of ground truth data, proposed methods are mostly evaluated through their ability to detect randomly injected links. In contrast with most proposed methods, that rely on complex approaches raising computational and/or interpretability issues, we show here that trivial graph features and classical learning techniques are sufficient to detect such anomalies extremely well. This basic approach has very low computational costs and it leads to easily interpretable results. It also has many other desirable properties that we study through an extensive set of experiments. We conclude that detection methods should now target more complex kinds of anomalies.

Updated: 2026-03-02 13:19:29

标题: 平凡的图特征和经典学习足以检测随机异常

摘要: 在代表各种互动方式的链路流中检测异常是一个重要的研究课题,具有关键的应用。由于缺乏基准数据,提出的方法主要通过其检测随机注入的链接的能力来评估。与大多数依赖复杂方法提出的方法相反,这些方法提出了引发计算和/或可解释性问题,我们在这里展示了简单的图特征和经典学习技术足以极好地检测这些异常。这种基本方法具有非常低的计算成本,并且导致易于解释的结果。它还具有许多其他令人满意的特性,我们通过一系列广泛的实验来研究。我们得出结论,检测方法现在应该针对更复杂的异常类型。

更新时间: 2026-03-02 13:19:29

领域: cs.LG

下载: http://arxiv.org/abs/2603.01841v1

Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.

Updated: 2026-03-02 13:17:58

标题: 残余连接和因果偏移:揭示变压器中的结构不对齐

摘要: 大型语言模型(LLMs)通过下一个标记预测进行训练,在自回归Transformer中通过因果蒙版实现并行性。这会产生一个微妙的不对齐:残差连接将激活与当前标记联系起来,而监督目标是下一个标记,如果当前标记不是最具信息量的,则可能传播不匹配的信息。在这项工作中,我们通过解码轨迹在绑定的嵌入空间和基于相似性的度量上实证化预训练LLMs中这种输入-输出对齐转移。我们的实验揭示了隐藏标记表示在网络深处从输入对齐转换为输出对齐。受到这一观察的启发,我们提出了一种基于残差衰减的轻量级残差路径缓解方法,作为一个固定层干预或可学习的门控机制实施。在多个基准测试中的实验表明,这些策略缓解了表示不对齐问题,并带来了改进,为自回归Transformer提供了一种高效且通用的架构增强。

更新时间: 2026-03-02 13:17:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2602.14760v2

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Finally, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

Updated: 2026-03-02 13:16:56

标题: HierarchicalPrune: 面向大规模扩散模型的位置感知压缩

摘要: 最先进的文本到图像扩散模型(DMs)实现了显著的质量,但它们庞大的参数规模(8-11B)为在资源受限设备上进行推断提出了重大挑战。本文介绍了一种新颖的压缩框架HierarchicalPrune,其基础是一个关键观察:DM块表现出明显的功能层次结构,早期块建立语义结构,而后期块处理纹理细化。HierarchicalPrune协同地结合了三种技术:(1)分层位置剪枝,根据位置层次结构识别和移除较不重要的后期块;(2)位置权重保护,系统地保护对于语义结构完整性至关重要的早期模型部分;(3)灵敏度引导蒸馏,根据我们发现的块间灵敏度变化调整知识传递强度。因此,我们的框架将十亿规模的扩散模型带入更适合于设备推断的范围,同时保留输出图像的质量。具体来说,结合INT4权重量化,HierarchicalPrune实现了77.5-80.4%的内存占用减少(例如,从15.8 GB减少到3.2 GB)和27.9-38.0%的延迟减少,分别在服务器和消费级GPU上进行测量,与原始模型相比,GenEval分数下降最低2.6%,HPSv2分数下降7%。最后,我们对85名参与者进行了全面的用户研究,结果表明HierarchicalPrune在保持感知质量与原始模型相媲美的同时,明显优于先前的工作。

更新时间: 2026-03-02 13:16:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.04663v4

Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training

Deep Learning architectures, and in particular Transformers, are conventionally viewed as a composition of layers. These layers are actually often obtained as the sum of two contributions: a residual path that copies the input and the output of a Transformer block. As a consequence, the inner representations (i.e. the input of these blocks) can be interpreted as iterative refinement of a propagated latent representation. Under this lens, many works suggest that the inner space is shared across layers, meaning that tokens can be decoded at early stages. Mechanistic interpretability even goes further by conjecturing that some layers act as refinement layers. Following this path, we propose inference-time inner looping, which prolongs refinement in pretrained off-the-shelf language models by repeatedly re-applying a selected block range. Across multiple benchmarks, inner looping yields modest but consistent accuracy improvements. Analyses of the resulting latent trajectories suggest more stable state evolution and continued semantic refinement. Overall, our results suggest that additional refinement can be obtained through simple test-time looping, extending computation in frozen pretrained models.

Updated: 2026-03-02 13:15:58

标题: 预训练变压器的内部循环推断:解锁潜在能力而无需训练

摘要: 深度学习架构,特别是Transformers,通常被视为层的组合。这些层实际上经常是由两部分组成的:一个残差路径,复制Transformer块的输入和输出。因此,内部表示(即这些块的输入)可以被解释为传播的潜在表示的迭代细化。在这个视角下,许多研究表明内部空间在层之间是共享的,这意味着token可以在早期阶段解码。机制可解释性甚至进一步推测一些层作为细化层。沿着这条道路,我们提出了推理时的内部循环,通过重复应用选定的块范围,延长了预训练的现成语言模型中的细化过程。在多个基准测试中,内部循环产生了适度但一致的准确性改进。对结果的潜在轨迹的分析表明,更稳定的状态演变和持续的语义细化。总的来说,我们的结果表明通过简单的测试时间循环可以获得额外的细化,扩展了冻结的预训练模型中的计算。

更新时间: 2026-03-02 13:15:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2602.14759v2

Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes

Diffusion models have gained prominence as powerful generative tools for solving inverse problems due to their ability to model complex data distributions. However, existing methods typically rely on complete knowledge of the forward observation process to compute gradients for guided sampling, limiting their applicability in scenarios where such information is unavailable. In this work, we introduce \textbf{\emph{Constrained Particle Seeking (CPS)}}, a novel gradient-free approach that leverages all candidate particle information to actively search for the optimal particle while incorporating constraints aligned with high-density regions of the unconditional prior. Unlike previous methods that passively select promising candidates, CPS reformulates the inverse problem as a constrained optimization task, enabling more flexible and efficient particle seeking. We demonstrate that CPS can effectively solve both image and scientific inverse problems, achieving results comparable to gradient-based methods while significantly outperforming gradient-free alternatives. Code is available at https://github.com/deng-ai-lab/CPS.

Updated: 2026-03-02 13:15:49

标题: 受限粒子搜索:仅通过前向传递解决扩散逆问题

摘要: 扩散模型因其能够建模复杂数据分布而成为解决逆问题的强大生成工具。然而,现有方法通常依赖于对正向观测过程的完全了解,以计算用于引导抽样的梯度,从而限制了它们在无法获取此类信息的情况下的适用性。在这项工作中,我们介绍了一种新颖的无梯度方法\textbf{\emph{约束粒子寻找(CPS)}},利用所有候选粒子信息来积极搜索最佳粒子,并结合与无条件先验的高密度区域对齐的约束。与以往 passively 选择有前途的候选者的方法不同,CPS将逆问题重新构建为受约束优化任务,从而实现更灵活和高效的粒子搜索。我们证明CPS可以有效解决图像和科学逆问题,实现与基于梯度的方法相当的结果,同时明显优于无梯度的替代方法。代码可在https://github.com/deng-ai-lab/CPS找到。

更新时间: 2026-03-02 13:15:49

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.01837v1

Probing Materials Knowledge in LLMs: From Latent Embeddings to Reliable Predictions

Large language models are increasingly applied to materials science, yet fundamental questions remain about their reliability and knowledge encoding. Evaluating 25 LLMs across four materials science tasks -- over 200 base and fine-tuned configurations -- we find that output modality fundamentally determines model behavior. For symbolic tasks, fine-tuning converges to consistent, verifiable answers with reduced response entropy, while for numerical tasks, fine-tuning improves prediction accuracy but models remain inconsistent across repeated inference runs, limiting their reliability as quantitative predictors. For numerical regression, we find that better performance can be obtained by extracting embeddings directly from intermediate transformer layers than from model text output, revealing an ``LLM head bottleneck,'' though this effect is property- and dataset-dependent. Finally, we present a longitudinal study of GPT model performance in materials science, tracking four models over 18 months and observing 9--43\% performance variation that poses reproducibility challenges for scientific applications.

Updated: 2026-03-02 13:09:12

标题: 探究LLMs中的材料知识:从潜在嵌入到可靠预测

摘要: 大型语言模型越来越多地应用于材料科学,但它们的可靠性和知识编码仍存在基本问题。评估25个LLM在四个材料科学任务中的表现——超过200个基础和微调配置——我们发现输出模态基本确定了模型的行为。对于符号任务,微调收敛到一致的、可验证的答案,减少了响应熵,而对于数值任务,微调提高了预测准确性,但模型在重复推理运行中仍然不一致,限制了它们作为数量预测器的可靠性。对于数值回归,我们发现通过直接从中间transformer层提取嵌入比从模型文本输出中提取效果更好,揭示了“LLM头部瓶颈”,尽管这种效果依赖于特性和数据集。最后,我们对GPT模型在材料科学中的表现进行了纵向研究,跟踪了18个月内的四个模型,并观察到了9-43%的性能变化,给科学应用带来了可重现性挑战。

更新时间: 2026-03-02 13:09:12

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2603.01834v1

Quantum approaches to learning parity with noise

The learning parity with noise (LPN) problem is a well-established computational challenge whose difficulty is critical to the security of several post-quantum cryptographic primitives such as HQC and Classic McEliece. Classically, the best-known attacks involve information set decoding methods which are exponential in complexity for parameterisations of interest. In this paper we investigate whether quantum methods might offer alternative approaches. The line of inquiry is inspired by Regev's relating of certain lattice problems to the hidden dihedral subgroup problem. We use neighbourhoods of binary fields to produce a function close to fulfilling Simon's promise with difference equal to the secret parity vector. Although unlikely to recover the secret parity vector directly, running Simon's algorithm essentially produces new LPN samples. This gives the hope that we might be able to produce enough new samples to ignore one or more variables and iteratively reduce the problem. We make no claim that these methods will necessarily be competitive with existing approaches, merely that they warrant deeper investigation.

Updated: 2026-03-02 13:03:43

标题: 量子方法学习带有噪声的奇偶性

摘要: 噪声学习平等(LPN)问题是一个早已确立的计算挑战,其难度对于几种后量子密码原语的安全性至关重要,如HQC和Classic McEliece。在经典情况下,最著名的攻击方法涉及信息集解码方法,对于感兴趣的参数化而言,其复杂度呈指数增长。在本文中,我们调查了量子方法是否可能提供替代方法。这一探索的线索源自Regev将某些格问题与隐藏的二面角子群问题相关联。我们利用二进制域的邻域产生一个与秘密奇偶向量的差接近实现Simon承诺的函数。尽管不太可能直接恢复秘密奇偶向量,但运行Simon算法基本上会产生新的LPN样本。这让我们希望能够产生足够的新样本,以忽略一个或多个变量,并通过迭代降低问题。 我们并不断言这些方法一定会与现有方法竞争,只是认为它们值得进一步深入研究。

更新时间: 2026-03-02 13:03:43

领域: cs.CR

下载: http://arxiv.org/abs/2602.19819v2

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

Updated: 2026-03-02 13:02:44

标题: 忘记并非不可见:从模型输出中检测LLMs中的遗忘痕迹

摘要: 机器去学习(MU)用于大型语言模型(LLMs),通常称为LLM去学习,旨在从经过训练的模型中删除特定的不良数据或知识,同时保持其在标准任务上的性能。虽然去学习在保护数据隐私、强制版权和减轻LLMs中的社会技术危害方面发挥着至关重要的作用,但我们发现去学习后存在一个新的脆弱性:去学习痕迹检测。我们发现,去学习在LLMs中留下持久的“指纹”,可以从模型行为和内部表示中检测到这些痕迹。这些痕迹可以从输出响应中识别出来,即使提示了与遗忘无关的输入。具体来说,即使是一个简单的监督分类器也可以确定模型是否经历了去学习,仅使用其预测对数或甚至其文本输出。进一步的分析显示,这些痕迹嵌入在中间激活中,并非线性地传播到最终层,形成激活空间中的低维可学习流形。通过广泛的实验,我们证明,即使在与遗忘无关的输入下,也可以以超过90%的准确率检测到去学习痕迹,并且更大的LLMs表现出更强的可检测性。这些发现表明,去学习留下可测量的签名,当一个模型被识别为已经去学习时,引入了一个逆向工程已遗忘信息的新风险,给定一个输入查询。

更新时间: 2026-03-02 13:02:44

领域: cs.LG

下载: http://arxiv.org/abs/2506.14003v4

A Message Passing Realization of Expected Free Energy Minimization

We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.

Updated: 2026-03-02 13:00:14

标题: 一种基于消息传递的期望自由能最小化实现

摘要: 我们提出了一种基于消息传递的期望自由能(EFE)最小化方法,该方法基于arXiv:2504.14898中介绍的理论。通过将EFE最小化重新表述为具有认知先验的变分自由能最小化,我们将一个组合搜索问题转变为一个可通过标准变分技术解决的可处理的推理问题。将我们的消息传递方法应用于分解状态空间模型,能够实现高效的策略推理。我们在存在认知不确定性的环境中评估了我们的方法:一个随机格子世界和一个部分可观测的Minigrid任务。使用我们方法的代理在这些任务中始终表现出优于传统的KL控制代理的表现,展现出更强大的规划和在不确定性下更高效的探索。在随机格子世界环境中,最小化EFE的代理避开了风险路径,而在部分可观察的Minigrid环境中,他们进行更系统化的信息搜索。这种方法将主动推理理论与实际实现联系起来,为认知先验在人工智能代理中的有效性提供了实证证据。

更新时间: 2026-03-02 13:00:14

领域: cs.AI

下载: http://arxiv.org/abs/2508.02197v3

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5\% of that required by container-based pipelines and reduces environment preparation time to about 25\% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.

Updated: 2026-03-02 13:00:09

标题: SWE-MiniSandbox:无容器的建筑软件工程代理强化学习

摘要: 强化学习(RL)已成为训练软件工程(SWE)代理的关键范例,但现有的流程通常依赖于用于隔离的每个任务容器。在规模化情况下,预构建的容器映像会带来大量的存储开销,减慢环境设置速度,并需要容器管理权限。我们提出了SWE-MiniSandbox,一种轻量级、无容器的方法,可以实现可扩展的SWE代理的强化学习训练,而不会牺牲隔离性。SWE-MiniSandbox不依赖于每个实例容器,而是在由内核级机制支持的隔离工作空间中执行每个任务,大大减少系统开销。它利用轻量级环境预缓存技术来消除庞大的容器映像的需要。因此,我们的方法将磁盘使用量降低到容器基准所需的大约5%,并将环境准备时间缩短到容器基线的大约25%。实证结果表明,SWE-MiniSandbox实现了与标准基于容器的流程可比的评估性能。通过消除对繁重容器基础设施的依赖,SWE-MiniSandbox为在资源受限的研究环境中扩展基于RL的SWE代理提供了实用和易于访问的基础。

更新时间: 2026-03-02 13:00:09

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2602.11210v2

HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation

Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5\%, while improving the fine-grained subregion accuracy by 43%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.

Updated: 2026-03-02 12:59:35

标题: HierLoc:用于分层视觉地理定位的双曲实体嵌入

摘要: 视觉地理定位是预测图像拍摄位置的任务,由于全球规模、视觉模糊以及地理本质上的分层结构,仍然具有挑战性。现有范式依靠大规模检索,需要存储大量图像嵌入,基于网格的分类器忽略地理连续性,或者生成模型在空间中扩散但难以处理细节。我们引入了一种以实体为中心的地理定位公式,用高伯利空间中嵌入的地理实体的紧凑层次结构替换了图像到图像的检索。通过将 haversine 距离直接纳入对比目标,通过地理加权高伯利对比学习,图像直接与国家、地区、次区域和城市实体对齐。这种分层设计实现了可解释的预测和有效的推理,仅使用 24 万个实体嵌入而不是 OSV5M 基准上的超过 500 万个图像嵌入,我们的方法在该基准上建立了新的最先进性能。与文献中的当前方法相比,它将平均测地距离误差减少了 19.5%,同时将细粒度次区域准确度提高了 43%。这些结果表明,具有几何感知的分层嵌入为全球图像地理定位提供了可扩展且概念上新颖的替代方案。

更新时间: 2026-03-02 12:59:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.23064v2

Identity-Free Deferral For Unseen Experts

Learning to Defer (L2D) improves AI reliability in decision-critical environments by training AI to either make its own prediction or defer the decision to a human expert. A key challenge is adapting to unseen experts at test time, whose competence can differ from the training population. Current methods for this task, however, can falter when unseen experts are out-of-distribution (OOD) relative to the training population. We identify a core architectural flaw as the cause: they learn identity-conditioned policies by processing class-indexed signals in fixed coordinates, creating shortcuts that violate the problem's inherent permutation symmetry. We introduce Identity-Free Deferral (IFD), an architecture that enforces this symmetry by construction. From a few-shot context, IFD builds a query-independent Bayesian competence profile for each expert. It then supplies the deferral rejector with a low-dimensional, role-indexed state containing only structural information, such as the model's confidence in its top-ranked class and the expert's estimated skill for that same role, which obscures absolute class identities. We train IFD using an uncertainty-aware, context-only objective that removes the need for expensive query-time expert labels. We formally prove the permutation invariance of our approach, contrasting it with the generic non-invariance of standard population encoders. Experiments on medical imaging benchmarks and ImageNet-16H with real human annotators show that IFD consistently improves generalisation to unseen experts, with gains in OOD settings, all while using fewer annotations than alternative methods.

Updated: 2026-03-02 12:59:23

标题: 不知身份的专家延期

摘要: Learning to Defer (L2D)通过训练AI要么做出自己的预测,要么将决策推迟到人类专家,提高了AI在决策关键环境中的可靠性。一个关键挑战是在测试时适应未曾见过的专家,他们的能力可能与训练样本人群不同。然而,目前针对这一任务的方法在未曾见过的专家相对于训练人群属于分布外时可能出现问题。我们确定了一个核心架构缺陷是造成这一问题的原因:它们通过处理固定坐标中的类索引信号来学习身份条件策略,从而创建违反问题固有置换对称性的快捷方式。我们引入了Identity-Free Deferral (IFD),这是一种通过构建来强制执行这种对称性的架构。从少样本背景出发,IFD为每个专家构建一个与查询无关的贝叶斯能力概况。然后,它为推迟拒绝器提供一个低维度的、角色索引状态,仅包含结构信息,如模型对其排名最高类别的信心和专家对该角色的估计技能,模糊了绝对类别身份。我们使用一个与不确定性相关的、仅基于上下文的目标对IFD进行训练,消除了在查询时需要昂贵的专家标签的需求。我们正式证明了我们方法的置换不变性,将其与标准人群编码器的一般非不变性进行对比。在医学影像基准和带有真实人类标注者的ImageNet-16H上的实验表明,IFD持续提高了对未曾见过的专家的泛化能力,在分布外环境中取得了收益,同时使用的标注比替代方法更少。

更新时间: 2026-03-02 12:59:23

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2502.10533v3

Uncertainty Quantification of Click and Conversion Estimates for the Autobidding

Modern e-commerce platforms employ various auction mechanisms to allocate paid slots for a given item. To scale this approach to the millions of auctions, the platforms suggest promotion tools based on the autobidding algorithms. These algorithms typically depend on the Click-Through-Rate (CTR) and Conversion-Rate (CVR) estimates provided by a pre-trained machine learning model. However, the predictions of such models are uncertain and can significantly affect the performance of the autobidding algorithm. To address this issue, we propose the DenoiseBid method, which corrects the generated CTRs and CVRs to make the resulting bids more efficient in auctions. The underlying idea of our method is to employ a Bayesian approach and replace noisy CTR or CVR estimates with those from recovered distributions. To demonstrate the performance of the proposed approach, we perform extensive experiments on the synthetic, iPinYou, and BAT datasets. To evaluate the robustness of our approach to the noise scale, we use synthetic noise and noise estimated from the predictions of the pre-trained machine learning model.

Updated: 2026-03-02 12:57:11

标题: 自动竞标的点击和转化估计的不确定性量化

摘要: 现代电子商务平台采用各种拍卖机制来分配给定商品的付费位置。为了将这种方法扩展到数百万次拍卖,这些平台提出了基于自动出价算法的促销工具。这些算法通常依赖于预先训练的机器学习模型提供的点击率(CTR)和转化率(CVR)估计。然而,这些模型的预测是不确定的,并且可能会显著影响自动出价算法的性能。为了解决这个问题,我们提出了DenoiseBid方法,它校正生成的CTR和CVR,使得在拍卖中产生的出价更有效。我们方法的基本思想是采用贝叶斯方法,用恢复的分布替换噪声CTR或CVR估计值。为了展示所提出方法的性能,我们在合成、iPinYou和BAT数据集上进行了大量实验。为了评估我们方法对噪声规模的鲁棒性,我们使用合成噪声和从预先训练的机器学习模型的预测中估计的噪声。

更新时间: 2026-03-02 12:57:11

领域: cs.LG,cs.GT,stat.ML

下载: http://arxiv.org/abs/2603.01825v1

OpenAutoNLU: Open Source AutoML Library for NLU

OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here https://openautonlu.dev.

Updated: 2026-03-02 12:56:54

标题: OpenAutoNLU:用于NLU的开源AutoML库

摘要: OpenAutoNLU是一个开源的自动化机器学习库,用于自然语言理解(NLU)任务,涵盖文本分类和命名实体识别(NER)。与现有解决方案不同,我们引入了需要用户进行手动配置的数据感知训练制度选择。该库还提供集成数据质量诊断、可配置的离域检测和大型语言模型(LLM)功能,所有这些功能都在一个最小化的低代码API中。演示应用程序可在https://openautonlu.dev上访问。

更新时间: 2026-03-02 12:56:54

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2603.01824v1

Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models

Both humans and Large Language Models (LLMs) store a vast repository of semantic memories. In humans, efficient and strategic access to this memory store is a critical foundation for a variety of cognitive functions. Such access has long been a focus of psychology and the computational mechanisms behind it are now well characterized. Much of this understanding has been gleaned from a widely-used neuropsychological and cognitive science assessment called the Semantic Fluency Task (SFT), which requires the generation of as many semantically constrained concepts as possible. Our goal is to apply mechanistic interpretability techniques to bring greater rigor to the study of semantic memory foraging in LLMs. To this end, we present preliminary results examining SFT as a case study. A central focus is on convergent and divergent patterns of generative memory search, which in humans play complementary strategic roles in efficient memory foraging. We show that these same behavioral signatures, critical to human performance on the SFT, also emerge as identifiable patterns in LLMs across distinct layers. Potentially, this analysis provides new insights into how LLMs may be adapted into closer cognitive alignment with humans, or alternatively, guided toward productive cognitive \emph{disalignment} to enhance complementary strengths in human-AI interaction.

Updated: 2026-03-02 12:55:51

标题: 新兴的类人策略在大型语言模型中用于语义记忆搜索

摘要: 人类和大型语言模型(LLMs)都存储着广泛的语义记忆库。在人类中,对这一记忆库的高效和策略性访问是各种认知功能的重要基础。这种访问长期以来一直是心理学的焦点,其背后的计算机制已经得到了很好的表征。许多这方面的理解来自一个被广泛应用的神经心理学和认知科学评估任务,称为语义流畅任务(SFT),该任务要求尽可能生成许多受语义约束的概念。我们的目标是应用机械解释技术,以使对LLMs中语义记忆搜索的研究更加严谨。为此,我们提出了一个关于SFT的案例研究的初步结果。一个核心重点是生成记忆搜索的收敛和分歧模式,在人类中发挥着互补的策略作用,有助于高效的记忆搜索。我们展示了这些与人类在SFT上表现关键的行为特征也出现在LLMs的不同层次中作为可识别的模式。潜在地,这种分析为如何将LLMs调整为更接近人类认知的方式提供了新的见解,或者相反,引导其朝着增强人类-AI互动中的互补优势的有益认知“不对齐”方向发展。

更新时间: 2026-03-02 12:55:51

领域: cs.AI

下载: http://arxiv.org/abs/2603.01822v1

Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy

Recent developments in Multimodal Large Language Models (MLLMs) have significantly improved Vision-Language (VL) reasoning in 2D domains. However, extending these capabilities to 3D scene understanding remains a major challenge. Existing 3D Multimodal Large Language Models (3D-MLLMs) often depend on 3D data inputs, which limits scalability and generalization. To address this limitation, we propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data, making it practical for real-world deployment. In our method, the geometric prior are directly used to improve the performance of the sceen perception. To integrate the geometric cues into the MLLM compactly, we design a Cross-Task Adapter (CTA) module to align the 3D geometric priors with the vision-language representations. To ensure geometric consistency and integrity, we introduce a Metric Depth Model that recovers real-scale geometry from the reconstruction outputs. Finally, the model is fine-tuned with a two-stage distillation optimization strategy, realizing fast convergence and stabilizes training. Extensive experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Dense Captioning and 3D Visual Grounding tasks, demonstrating the superior multi-task capabilities.

Updated: 2026-03-02 12:53:37

标题: Vid-LLM:一种具有重建推理协同作用的紧凑视频基础的3D多模态LLM

摘要: 最近发展的多模态大型语言模型(MLLMs)显著改善了2D领域中的视觉语言(VL)推理。然而,将这些能力扩展到3D场景理解仍然是一个重大挑战。现有的3D多模态大型语言模型(3D-MLLMs)通常依赖于3D数据输入,这限制了可扩展性和泛化性。为了解决这一限制,我们提出了Vid-LLM,一种基于视频的3D-MLLM,直接处理视频输入,而无需外部3D数据,使其适用于实际部署。在我们的方法中,几何先验被直接用来提高场景感知的性能。为了将几何线索紧凑地整合到MLLM中,我们设计了一个交叉任务适配器(CTA)模块,以使3D几何先验与视觉语言表示对齐。为了确保几何一致性和完整性,我们引入了一个度量深度模型,从重建输出中恢复真实尺度的几何。最后,该模型经过两阶段蒸馏优化策略进行微调,实现快速收敛并稳定训练。在多个基准测试中进行的广泛实验验证了我们的方法在3D问答、3D密集字幕和3D视觉定位任务上的有效性,展示了卓越的多任务能力。

更新时间: 2026-03-02 12:53:37

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.24385v4

Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

We present a large scale benchmark of modern deep learning architectures for a financial time series prediction and position sizing task, with a primary focus on Sharpe ratio optimization. Evaluating linear models, recurrent networks, transformer based architectures, state space models, and recent sequence representation approaches, we assess out of sample performance on a daily futures dataset spanning commodities, equity indices, bonds, and FX spanning 2010 to 2025. Our evaluation goes beyond average returns and includes statistical significance, downside and tail risk measures, breakeven transaction cost analysis, robustness to random seed selection, and computational efficiency. We find that models explicitly designed to learn rich temporal representations consistently outperform linear benchmarks and generic deep learning models, which often lead the ranking in standard time series benchmarks. Hybrid models such as VSN with LSTM, a combination of Variable Selection Networks (VSN) and LSTMs, achieves the highest overall Sharpe ratio, while VSN with xLSTM and LSTM with PatchTST exhibit superior downside adjusted characteristics. xLSTM demonstrates the largest breakeven transaction cost buffer, indicating improved robustness to trading frictions.

Updated: 2026-03-02 12:52:50

标题: 金融时间序列的深度学习:风险调整绩效的大规模基准测试

摘要: 我们提出了一个现代深度学习架构的大规模基准测试,用于金融时间序列预测和头寸大小任务,主要关注夏普比率优化。我们评估了线性模型、循环网络、基于转换器的架构、状态空间模型和最近的序列表示方法,在跨越商品、股票指数、债券和外汇的2010年至2025年间的每日期货数据集上评估了样本外表现。我们的评估超出了平均收益,还包括统计显著性、下行和尾部风险指标、盈亏平衡交易成本分析、对随机种子选择的鲁棒性以及计算效率。我们发现,明确设计用于学习丰富时间表示的模型始终优于线性基准和通用深度学习模型,后者往往在标准时间序列基准中处于领先地位。混合模型,如VSN与LSTM,即变量选择网络(VSN)和LSTMs的组合,实现了最高的整体夏普比率,而VSN与xLSTM以及LSTM与PatchTST展现出优越的下行调整特性。xLSTM展示了最大的盈亏平衡交易成本缓冲区,表明对交易摩擦的鲁棒性得到了改善。

更新时间: 2026-03-02 12:52:50

领域: q-fin.TR,cs.LG

下载: http://arxiv.org/abs/2603.01820v1

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Autonomous agents operating in the real world must interact continuously with existing physical and semantic infrastructure, track delayed consequences, and verify outcomes over time. Everyday environments are rich in tangible control interfaces (TCIs)-e.g., light switches, appliance panels, and embedded GUI-posing core challenges for lifelong embodied agents, including partial observability, causal reasoning across time, and failure-aware verification under real-world constraints. Yet, current benchmarks rarely consider such long-horizon interaction and causality requirements. We introduce SWITCH (Semantic World Interface Tasks for Control & Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities-task-aware VQA, semantic UI grounding, action generation, state transition prediction, and result verification-under ego-centric RGB video input and device diversity across 351 tasks spanning 98 real devices/appliances. Results from commercial and open LMMMs reveal systematic failures, highlighting critical gaps for lifelong agent deployment. SWITCH provides data, code, and held-out splits to enable reproducible non-contaminated evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of relevant training data. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

Updated: 2026-03-02 12:52:11

标题: SWITCH:在长期体验场景中对切实接口建模和处理的基准测试

摘要: 在现实世界中运行的自主代理必须持续与现有的物理和语义基础设施进行交互,跟踪延迟后果,并随时间验证结果。日常环境中充满了可触控的控制界面(TCIs),例如,灯光开关,电器面板和嵌入式GUI等,为终身实体代理提出了核心挑战,包括部分可观测性,时间跨度的因果推理以及在真实世界约束下的故障感知验证。然而,当前的基准测试很少考虑这种长期相互作用和因果需求。我们介绍了SWITCH(控制与处理的语义世界界面任务),这是一个通过迭代发布创建的任务驱动基准测试,用于探测这些差距。其第一个版本,SWITCH-Basic,评估了五种互补能力-任务感知VQA,语义UI接地,动作生成,状态转换预测和结果验证-在自我中心RGB视频输入和跨越98个真实设备/电器的351个任务中的设备多样性下。商业和开放的LMMM的结果显示系统性失败,突出了终身代理部署的关键差距。SWITCH提供数据,代码和保留数据拆分,以实现可重现的非污染评估和社区贡献,以支持更具挑战性的基准测试未来迭代和相关训练数据的创建。基准测试资源可在以下网址获取:https://github.com/BAAI-Agents/SWITCH。

更新时间: 2026-03-02 12:52:11

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2511.17649v3

ERIS: Evolutionary Real-world Interference Scheme for Jailbreaking Audio Large Models

Existing Audio Large Models (ALMs) alignment focuses on clean inputs, neglecting security risks in complex environments. We propose ERIS, a framework transforming real-world interference into a strategically optimized carrier for jailbreaking ALMs. Unlike methods relying on manually designed acoustic patterns, ERIS uses a genetic algorithm to optimize the selection and synthesis of naturalistic signals. Through population initialization, crossover fusion, and probabilistic mutation, it evolves audio fusing malicious instructions with real-world interference. To humans and safety filters, these samples present as natural speech with harmless background noise, yet bypass alignment. Evaluations on multiple ALMs show ERIS significantly outperforms both text and audio jailbreak baselines. Our findings reveal that seemingly innocuous real-world interference can be leveraged to circumvent safety constraints, providing new insights for defensive mechanisms in complex acoustic scenarios.

Updated: 2026-03-02 12:47:27

标题: ERIS:用于越狱音频大模型的进化现实世界干扰方案

摘要: 现有的音频大型模型(ALMs)对准主要关注清洁输入,忽视复杂环境中的安全风险。我们提出了ERIS,这是一个将现实世界干扰转化为策略优化载体,用于破解ALMs的框架。与依赖手动设计的声学模式的方法不同,ERIS使用遗传算法优化自然信号的选择和合成。通过人口初始化、交叉融合和概率突变,它演变出将恶意指令与现实世界干扰融合的音频。对于人类和安全过滤器来说,这些样本呈现为带有无害背景噪音的自然语音,但却绕过了对准。对多个ALMs的评估显示,ERIS在文本和音频破解基准测试中表现显著优于。我们的发现表明,看似无害的现实世界干扰可以被利用来规避安全约束,为复杂声学场景中的防御机制提供新的见解。

更新时间: 2026-03-02 12:47:27

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2509.11128v2

Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness

Deep learning models achieve strong performance across various domains but often rely on spurious correlations, making them vulnerable to distribution shifts. This issue is particularly severe in subpopulation shift scenarios, where models struggle in underrepresented groups. While existing methods have made progress in mitigating this issue, their performance gains are still constrained. They lack a rigorous theoretical framework connecting the embedding space representations with worst-group error. To address this limitation, we propose Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness (SCER), a novel approach that directly regularizes feature representations to suppress spurious cues. We show theoretically that worst-group error is influenced by how strongly the classifier relies on spurious versus core directions, identified from differences in group-wise mean embeddings across domains and classes. By imposing theoretical constraints at the embedding level, SCER encourages models to focus on core features while reducing sensitivity to spurious patterns. Through systematic evaluation on multiple vision and language, we show that SCER outperforms prior state-of-the-art studies in worst-group accuracy. Our code is available at \href{https://github.com/MLAI-Yonsei/SCER}{https://github.com/MLAI-Yonsei/SCER}.

Updated: 2026-03-02 12:45:43

标题: 虚假相关性感知嵌入正则化对最差组鲁棒性的研究

摘要: 深度学习模型在各个领域取得了强大的性能,但往往依赖于虚假的相关性,使其容易受到分布变化的影响。这个问题在亚群体转移场景中特别严重,模型在少数群体中表现不佳。虽然现有方法在减轻这个问题方面取得了进展,但它们的性能提升仍受到限制。它们缺乏将嵌入空间表示与最差群体错误连接的严格理论框架。为了解决这一局限性,我们提出了一种新颖的方法,即Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness (SCER),直接规范特征表示以抑制虚假线索。我们在理论上表明,最差群体错误受到分类器依赖虚假方向与核心方向的强度影响,这些方向是通过跨领域和类别的平均嵌入之间的差异来确定的。通过在嵌入级别施加理论约束,SCER鼓励模型专注于核心特征,同时减少对虚假模式的敏感性。通过对多个视觉和语言系统的系统评估,我们展示了SCER在最差群体准确性方面优于先前的最新研究。我们的代码可在\href{https://github.com/MLAI-Yonsei/SCER}{https://github.com/MLAI-Yonsei/SCER}上找到。

更新时间: 2026-03-02 12:45:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.04401v2

GCTAM: Global and Contextual Truncated Affinity Combined Maximization Model For Unsupervised Graph Anomaly Detection

Anomalies often occur in real-world information networks/graphs, such as malevolent users, malicious comments, banned users, and fake news in social graphs. The latest graph anomaly detection methods use a novel mechanism called truncated affinity maximization (TAM) to detect anomaly nodes without using any label information and achieve impressive results. TAM maximizes the affinities among the normal nodes while truncating the affinities of the anomalous nodes to identify the anomalies. However, existing TAM-based methods truncate suspicious nodes according to a rigid threshold that ignores the specificity and high-order affinities of different nodes. This inevitably causes inefficient truncations from both normal and anomalous nodes, limiting the effectiveness of anomaly detection. To this end, this paper proposes a novel truncation model combining contextual and global affinity to truncate the anomalous nodes. The core idea of the work is to use contextual truncation to decrease the affinity of anomalous nodes, while global truncation increases the affinity of normal nodes. Extensive experiments on massive real-world datasets show that our method surpasses peer methods in most graph anomaly detection tasks. In highlights, compared with previous state-of-the-art methods, the proposed method has +15\% $\sim$ +20\% improvements in two famous real-world datasets, Amazon and YelpChi. Notably, our method works well in large datasets, Amazin-all and YelpChi-all, and achieves the best results, while most previous models cannot complete the tasks.

Updated: 2026-03-02 12:40:46

标题: GCTAM: 全局和上下文截断亲和力组合最大化模型用于无监督图异常检测

摘要: 异常经常出现在现实世界的信息网络/图中,例如社交图中的恶意用户、恶意评论、被禁止用户和虚假新闻。最新的图异常检测方法使用一种称为截断亲和最大化(TAM)的新机制来检测异常节点,而不使用任何标签信息,并取得了令人印象深刻的结果。TAM在最大化正常节点之间的亲和力的同时,截断异常节点之间的亲和力以识别异常。然而,现有基于TAM的方法根据一个刚性阈值截断可疑节点,忽略了不同节点的特异性和高阶亲和力。这不可避免地导致正常节点和异常节点的截断效率低下,限制了异常检测的有效性。因此,本文提出了一种结合上下文和全局亲和力的新型截断模型来截断异常节点。该工作的核心思想是使用上下文截断降低异常节点之间的亲和力,而全局截断增加正常节点之间的亲和力。大量实验表明,我们的方法在大规模真实世界数据集上超越了同行方法在大多数图异常检测任务中的表现。特别地,与先前的最先进方法相比,所提出的方法在两个著名的真实世界数据集Amazon和YelpChi中分别取得了+15\%到+20\%的改进。值得注意的是,我们的方法在大型数据集Amazin-all和YelpChi-all中表现良好,并取得了最佳结果,而大多数先前的模型无法完成这些任务。

更新时间: 2026-03-02 12:40:46

领域: cs.SI,cs.GR,cs.LG

下载: http://arxiv.org/abs/2603.01806v1

Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments

We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.

Updated: 2026-03-02 12:38:43

标题: 受限机器人环境中的非语言实时人工智能交互

摘要: 我们研究了有关人工智能生成数据与人类生成数据在非语言交流中的统计准确性的持续争论,使用全身运动作为背景。具体而言,我们探讨当代生成模型是否超越表面模仿,参与身体语言的无声但富有表现力的对话。我们通过引入第一个框架来回答这个问题,该框架可以实时生成来自2D身体关键点的自然非语言交互。我们的实验利用了四种轻量级架构,在NVIDIA Orin Nano上以最高100 FPS运行,有效地闭合了实现自然人机交互所需的感知-行动回路。我们在437个人类视频片段上进行了训练,并证明了在合成生成的序列上进行预训练可以显著减少运动错误,而不会牺牲速度。然而,仍然存在可测量的现实差距。当最佳模型在从SORA和VEO等尖端文本到视频系统中提取的关键点上进行评估时,我们观察到在SORA生成的片段上性能下降。然而,在VEO上的性能下降要少得多,这表明时间上的连贯性而不是图像的准确性推动了实际性能。我们的结果表明,人类和人工智能运动之间仍然存在统计上可区分的差异。

更新时间: 2026-03-02 12:38:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01804v1

Untargeted Jailbreak Attack

Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed targets inherently constrains the adversarial search space, limiting the overall attack efficacy. Furthermore, existing methods typically require numerous optimization iterations to fulfill the large gap between the fixed target and the original LLM output, resulting in low attack efficiency. To overcome these limitations, we propose the first gradient-based untargeted jailbreak attack (UJA), which relies on an untargeted objective to maximize the unsafety probability of the LLM output, without enforcing any response patterns. For tractable optimization, we further decompose this objective into two differentiable sub-objectives to search the optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to existing attacks, UJA's unrestricted objective significantly expands the search space, enabling more flexible and efficient exploration of LLM vulnerabilities. Extensive evaluations show that UJA achieves over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks by over 30\%.

Updated: 2026-03-02 12:37:12

标题: 未定向越狱攻击

摘要: 现有基于梯度的大型语言模型(LLMs)越狱攻击通常优化对抗性后缀,以使LLM输出与预定义的目标响应对齐。然而,将目标限制为诱导固定目标本质上限制了对抗性搜索空间,限制了整体攻击效力。此外,现有方法通常需要大量的优化迭代来填补固定目标与原始LLM输出之间的巨大差距,导致攻击效率低下。为克服这些限制,我们提出了第一个基于梯度的无目标越狱攻击(UJA),它依赖于无目标目标来最大化LLM输出的不安全概率,而不强制执行任何响应模式。为了可处理的优化,我们进一步将这个目标分解为两个可微分的子目标,来搜索最佳有害响应和相应的对抗提示,通过理论分析来验证分解。与现有攻击相比,UJA的不受限制的目标显著扩大了搜索空间,使LLM漏洞更加灵活和高效地探索。广泛的评估表明,UJA在仅100次优化迭代中对最近的安全对齐LLMs实现了超过80%的攻击成功率,超过了现有基于梯度的攻击方法超过30%。

更新时间: 2026-03-02 12:37:12

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.02999v4

Language steering in latent space to mitigate unintended code-switching

Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99\% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 55\% across multiple language pairs on Qwen2.5 and Llama-3.2 models. Generation-based evaluation on Llama-3.2 further demonstrates 63--99\% reduction in Code-Switching Index across four language pairs ($p < 0.001$). We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.

Updated: 2026-03-02 12:36:21

标题: 在潜在空间中进行语言导向以减轻意外的代码切换

摘要: 多语言大型语言模型(LLMs)通常表现出意外的代码切换,降低了下游任务的可靠性。我们提出了一种轻量级的推断时方法,即潜在空间语言导向,通过在平行翻译上进行PCA来识别语言方向,并沿着这些轴引导标记嵌入以控制语言身份。我们的方法在几乎没有计算开销的情况下减轻了代码切换,同时保留语义,并且只需要很少的平行数据进行校准。实证上,我们使用一个主成分实现了95-99\%的语言分类准确率,并在Qwen2.5和Llama-3.2模型的多个语言对中将下一个标记的分布差异降低了最多55\%。在Llama-3.2上基于生成的评估进一步展示了在四种语言对中($p < 0.001$)代码切换指数的63-99\%降低。我们进一步分析了语言表示的逐层演变,揭示了语言身份在最终层中集中,并具有几乎完美的线性可分性。

更新时间: 2026-03-02 12:36:21

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.13849v2

What Papers Don't Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction

Automated paper reproduction -- generating executable code from academic papers -- is bottlenecked not by information retrieval but by the tacit knowledge that papers inevitably leave implicit. We formalize this challenge as the progressive recovery of three types of tacit knowledge -- relational, somatic, and collective -- and propose \method, a graph-based agent framework with a dedicated mechanism for each: node-level relation-aware aggregation recovers relational knowledge by analyzing implementation-unit-level reuse and adaptation relationships between the target paper and its citation neighbors; execution-feedback refinement recovers somatic knowledge through iterative debugging driven by runtime signals; and graph-level knowledge induction distills collective knowledge from clusters of papers sharing similar implementations. On an extended ReproduceBench spanning 3 domains, 10 tasks, and 40 recent papers, \method{} achieves an average performance gap of 10.04\% against official implementations, improving over the strongest baseline by 24.68\%. The code will be publicly released upon acceptance; the repository link will be provided in the final version.

Updated: 2026-03-02 12:33:31

标题: 文献标题翻译:论文未告知的内容:恢复隐性知识以实现自动化论文再现

摘要: 自动化论文再现--从学术论文生成可执行代码--不是由信息检索而是由论文不可避免地留下的隐含知识所制约。我们将这一挑战形式化为逐步恢复三种类型的隐含知识--关系、身体和集体--并提出\method,一个基于图的代理框架,具有专门的机制:节点级关系感知聚合通过分析目标论文及其引用邻居之间的实现单元级重用和适应关系,恢复关系知识;执行反馈细化通过受运行时信号驱动的迭代调试,恢复身体知识;图级知识归纳从共享相似实现的论文簇中提炼集体知识。在涵盖3个领域、10项任务和40篇最近论文的扩展ReproduceBench上,\method{}的平均性能差距为10.04\%,比最强基线提高了24.68\%。代码将在接受后公开发布;存储库链接将在最终版本中提供。

更新时间: 2026-03-02 12:33:31

领域: cs.AI

下载: http://arxiv.org/abs/2603.01801v1

Phase-Type Variational Autoencoders for Heavy-Tailed Data

Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.

Updated: 2026-03-02 12:32:42

标题: 相位型变分自动编码器用于重尾数据

摘要: 重尾分布在现实世界的数据中是普遍存在的,稀有但极端事件主导着风险和变异性。然而,标准的变分自动编码器(VAEs)采用简单的解码器分布(例如,高斯分布),无法捕捉重尾行为,而现有的重尾感知扩展仍然局限于预定义的参数族,其尾部行为是固定的。 我们提出了相位类型变分自动编码器(PH-VAE),其解码器分布是一种隐变量条件的相位类型(PH)分布,定义为连续时间马尔可夫链(CTMC)的吸收时间。该公式组合了多个指数时间尺度,产生了一个灵活且可分析的解码器,直接从观察到的数据中调整其尾部行为。对合成和真实世界基准的实验表明,PH-VAE能够准确恢复各种重尾分布,明显优于高斯、学生t和基于极值的VAE解码器在建模尾部行为和极端分位数方面的表现。在多变量设置中,PH-VAE通过其共享的潜在表示捕捉了现实的跨维度尾部相关性。据我们所知,这是第一项将相位类型分布整合到深度生成建模中的工作,架起了应用概率和表示学习之间的桥梁。

更新时间: 2026-03-02 12:32:42

领域: cs.LG,cs.AI,stat.ML,stat.OT

下载: http://arxiv.org/abs/2603.01800v1

Incremental, inconsistency-resilient reasoning over Description Logic Abox streams

More and more, data is being produced in a streaming fashion. This has led to increased interest into how actionable insights can be extracted in real time from data streams through Stream Reasoning. Reasoning over data streams raises multiple challenges, notably the high velocity of data, the real time requirement of the reasoning, and the noisy and volatile nature of streams. This paper proposes novel semantics for incremental reasoning over streams of Description Logic ABoxes, in order to tackle these challenges. To address the first two challenges, our semantics for reasoning over sliding windows on streams allow for incrementally computing the materialization of the window based on the materialization of the previous window. Furthermore, to deal with the volatile nature of streams, we present novel semantics for inconsistency repair on such windows, based on preferred repair semantics. We then detail our proposed semi-naive algorithms for incremental materialization maintenance in the case of OWL2 RL, both in the presence of inconsistencies and without.

Updated: 2026-03-02 12:30:23

标题: 增量式、具有容错性的对描述逻辑Abox流进行推理

摘要: 越来越多的数据以流式方式产生。这导致人们对如何通过流推理从数据流中实时提取可操作洞察力的兴趣增加。对数据流进行推理引发了多个挑战,尤其是数据的高速度、推理的实时要求以及流的嘈杂和不稳定性。本文提出了一种针对描述逻辑ABoxes流的增量推理的新语义,以解决这些挑战。为了解决前两个挑战,我们提出了用于在数据流上进行滑动窗口推理的语义,允许基于前一个窗口的材料化来增量计算窗口的材料化。此外,为了应对流的不稳定性,我们提出了基于首选修复语义的不一致性修复在这些窗口上的新语义。然后,我们详细介绍了我们针对OWL2 RL的增量材料化维护的半朴素算法,无论是否存在不一致性。

更新时间: 2026-03-02 12:30:23

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2603.01799v1

Multi-scale hypergraph meets LLMs: Aligning large language models for time series analysis

Recently, there has been great success in leveraging pre-trained large language models (LLMs) for time series analysis. The core idea lies in effectively aligning the modality between natural language and time series. However, the multi-scale structures of natural language and time series have not been fully considered, resulting in insufficient utilization of LLMs capabilities. To this end, we propose MSH-LLM, a Multi-Scale Hypergraph method that aligns Large Language Models for time series analysis. Specifically, a hyperedging mechanism is designed to enhance the multi-scale semantic information of time series semantic space. Then, a cross-modality alignment (CMA) module is introduced to align the modality between natural language and time series at different scales. In addition, a mixture of prompts (MoP) mechanism is introduced to provide contextual information and enhance the ability of LLMs to understand the multi-scale temporal patterns of time series. Experimental results on 27 real-world datasets across 5 different applications demonstrate that MSH-LLM achieves the state-of-the-art results.

Updated: 2026-03-02 12:30:15

标题: 多尺度超图遇见LLMs:为时间序列分析对齐大型语言模型

摘要: 最近,在利用预训练的大型语言模型(LLMs)进行时间序列分析方面取得了巨大成功。其核心思想在于有效地对齐自然语言和时间序列之间的模态。然而,自然语言和时间序列的多尺度结构尚未得到充分考虑,导致LLMs能力的利用不足。为此,我们提出了MSH-LLM,一种多尺度超图方法,用于对齐大型语言模型进行时间序列分析。具体来说,设计了一种超边机制来增强时间序列语义空间的多尺度语义信息。然后,引入了跨模态对齐(CMA)模块,以在不同尺度上对齐自然语言和时间序列之间的模态。此外,引入了一种提示混合(MoP)机制,提供上下文信息,并增强LLMs理解时间序列的多尺度时间模式的能力。在5个不同应用中的27个真实数据集上的实验结果表明,MSH-LLM实现了最先进的结果。

更新时间: 2026-03-02 12:30:15

领域: cs.LG

下载: http://arxiv.org/abs/2602.04369v2

Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration

Cascaded speech-to-text translation (S2TT) systems for low-resource languages can suffer from structural noise, particularly the loss of punctuation during the Automatic Speech Recognition (ASR) phase. This research investigates the impact of such noise on Nepali-to-English translation and proposes an optimized pipeline to mitigate quality degradation. We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark. We empirically investigate the influence of punctuation loss, demonstrating that unpunctuated ASR output significantly degrades translation quality, causing a massive 20.7% relative BLEU drop on the FLORES benchmark. To overcome this, we propose and evaluate an intermediate Punctuation Restoration Module (PRM). The final S2TT pipeline was tested across three configurations on a custom dataset. The optimal configuration, which applied the PRM directly to ASR output, achieved a 4.90 BLEU point gain over the direct ASR-to-NMT baseline (BLEU 36.38 vs. 31.48). This improvement was validated by human assessment, which confirmed the optimized pipeline's superior Adequacy (3.673) and Fluency (3.804) with inter-rater reliability (Krippendorff's $α {\geq}$ 0.723). This work validates that targeted punctuation restoration is the most effective intervention for mitigating structural noise in the Nepali S2TT pipeline. It establishes an optimized baseline and demonstrates a critical architectural insight for developing cascaded speech translation systems for similar low-resource languages.

Updated: 2026-03-02 12:30:14

标题: 减轻低资源语音到文本转换中的结构噪音:一个优化的级联尼泊尔语-英语流水线,包括标点符号恢复

摘要: 级联语音到文本翻译(S2TT)系统对于资源匮乏的语言可能会受到结构性噪音的影响,尤其是在自动语音识别(ASR)阶段丢失标点符号。本研究调查了这种噪音对尼泊尔语到英语翻译的影响,并提出了一个优化的流程来减轻质量下降。我们首先建立了高效的ASR和NMT组件:一个Wav2Vec2-XLS-R-300m模型在OpenSLR-54上实现了最新的2.72% CER,而一个多阶段微调的MarianMT模型在FLORES-200基准上达到了28.32的BLEU得分。我们从经验上考察了标点符号丢失的影响,表明未标点的ASR输出显著降低了翻译质量,在FLORES基准上导致20.7%的BLEU下降。为了克服这一问题,我们提出并评估了一个中间的标点还原模块(PRM)。最终的S2TT流程在自定义数据集上进行了三种配置的测试。最佳配置是将PRM直接应用于ASR输出,相对于直接的ASR到NMT基线(BLEU 36.38 vs. 31.48),实现了4.90个BLEU点的增益。这一改进经过人工评估验证,确认了优化流程在Adequacy(3.673)和Fluency(3.804)方面的优越性,并具有较高的评价者之间的一致性(Krippendorff's $α {\geq}$ 0.723)。这项工作验证了有针对性的标点还原是减轻尼泊尔语S2TT流程中结构性噪音的最有效干预措施。它建立了一个优化的基线,并展示了为类似资源匮乏的语言开发级联语音翻译系统的关键架构见解。

更新时间: 2026-03-02 12:30:14

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2602.21647v2

PleaSQLarify: Visual Pragmatic Repair for Natural Language Database Querying

Natural language database interfaces broaden data access, yet they remain brittle under input ambiguity. Standard approaches often collapse uncertainty into a single query, offering little support for mismatches between user intent and system interpretation. We reframe this challenge through pragmatic inference: while users economize expressions, systems operate on priors over the action space that may not align with the users'. In this view, pragmatic repair -- incremental clarification through minimal interaction -- is a natural strategy for resolving underspecification. We present \textsc{PleaSQLarify}, which operationalizes pragmatic repair by structuring interaction around interpretable decision variables that enable efficient clarification. A visual interface complements this by surfacing the action space for exploration, requesting user disambiguation, and making belief updates traceable across turns. In a study with twelve participants, \textsc{PleaSQLarify} helped users recognize alternative interpretations and efficiently resolve ambiguity. Our findings highlight pragmatic repair as a design principle that fosters effective user control in natural language interfaces.

Updated: 2026-03-02 12:24:29

标题: PleaSQLarify: 用于自然语言数据库查询的视觉语用修复

摘要: 自然语言数据库接口扩展了数据访问范围,但在输入模糊性下仍然容易出现问题。标准方法通常将不确定性合并为一个查询,对用户意图与系统解释之间的不匹配提供了很少支持。我们通过实用推理重新构思了这一挑战:虽然用户简化了表达,但系统在动作空间上的先验可能与用户的不一致。在这种观点下,实用修复——通过最小交互逐渐澄清——是解决不完全规范性的自然策略。我们提出了\textsc{PleaSQLarify},通过围绕可解释的决策变量构建交互来实现实用修复,从而实现有效的澄清。一个视觉界面为此提供了补充,通过展示动作空间进行探索,请求用户消除歧义,并使信念更新在对话之间可追踪。在与十二名参与者进行的研究中,\textsc{PleaSQLarify}帮助用户识别替代解释并有效解决模糊性。我们的研究结果强调实用修复作为一种设计原则,促进了自然语言界面中有效的用户控制。

更新时间: 2026-03-02 12:24:29

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.01795v1

Theoretical Foundations of Superhypergraph and Plithogenic Graph Neural Networks

Hypergraphs generalize classical graphs by allowing a single edge to connect multiple vertices, providing a natural language for modeling higher-order interactions. Superhypergraphs extend this paradigm further by accommodating nested, set-valued entities and relations, enabling the representation of hierarchical, multi-level structures beyond the expressive reach of ordinary graphs or hypergraphs. In parallel, neural networks-especially Graph Neural Networks (GNNs)-have become a standard tool for learning from relational data, and recent years have seen rapid progress on Hypergraph Neural Networks (HGNNs) and their theoretical properties. To model uncertainty and multi-aspect attributes in complex networks, several graded and multi-valued graph frameworks have been developed, including fuzzy graphs and neutrosophic graphs. The plithogenic graph framework unifies and refines these approaches by incorporating multi-valued attributes together with membership and contradiction mechanisms, offering a flexible representation for heterogeneous and partially inconsistent information. This book develops the theoretical foundations of SuperHyperGraph Neural Networks (SHGNNs) and Plithogenic Graph Neural Networks, with the goal of extending message-passing principles to these advanced higher-order structures. We provide rigorous definitions, establish fundamental structural properties, and prove well-definedness results for key constructions, with particular emphasis on strengthened formulations of Soft Graph Neural Networks and Rough Graph Neural Networks.

Updated: 2026-03-02 12:21:51

标题: 超超图和Plithogenic图神经网络的理论基础

摘要: 超图通过允许单个边连接多个顶点来推广经典图,为建模高阶交互提供了自然语言。超超图进一步扩展了这一范例,通过容纳嵌套、集合值实体和关系,使得可以表示超出普通图或超图表达能力范围的分层、多层级结构。与此同时,神经网络,尤其是图神经网络(GNNs),已经成为从关系数据中学习的标准工具,近年来在超图神经网络(HGNNs)及其理论性质方面取得了快速进展。 为了模拟复杂网络中的不确定性和多方面属性,已经开发了几种分级和多值图框架,包括模糊图和中性图。plithogenic图框架通过将多值属性与成员资格和矛盾机制结合起来,统一和精炼了这些方法,为异构和部分不一致信息提供了灵活的表示方式。 本书发展了超超图神经网络(SHGNNs)和plithogenic图神经网络的理论基础,旨在将消息传递原则扩展到这些先进的高阶结构中。我们提供严格的定义,建立基本的结构性质,并证明了关键结构的明确定义结果,特别强调了软图神经网络和粗糙图神经网络的加强公式。

更新时间: 2026-03-02 12:21:51

领域: cs.AI,cs.CE,cs.LG,math.CO,math.LO

下载: http://arxiv.org/abs/2412.01176v2

OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data

Lossless compression is essential for efficient data storage and transmission. Although learning-based lossless compressors achieve strong results, most of them are designed for a single modality, leading to redundant compressor deployments in multi-modal settings. Designing a unified multi-modal compressor is critical yet challenging, as different data types vary largely in format, dimension, and statistics. Multi-modal large language models offer a promising resolution but remain too complex for practical use. Thus, we propose \textbf{OmniZip}, \textbf{a unified and lightweight lossless compressor for multi-modal data (like image, text, speech, tactile, database, and gene sequence)}. Built on a lightweight backbone, OmniZip incorporates three key components to enable efficient multi-modal lossless compression: a modality-unified tokenizer that reversibly transforms diverse data into tokens, a modality-routing context learning mechanism that enables flexible multi-modal context modeling, and a modality-routing feedforward design that further enhances the model's nonlinear representation flexibility. A reparameterization training strategy is used to enhance model capacity. OmniZip outperforms or matches other state-of-the-art compressors on multiple modalities, achieving 42\%, 57\%, 62\% and 42\%, 53\% higher compression efficiency than gzip on CLIC-M, TouchandGo, enwik9, LibriSpeech, and WikiSQL datasets, respectively. It also supports near real-time inference on resource-constrained edge devices, reaching about 1MB/s on MacBook CPUs and iPhone NPUs. Our code is released at https://github.com/adminasmi/OmniZip-CVPR2026.

Updated: 2026-03-02 12:21:30

标题: OmniZip:学习一种统一且轻量级的无损压缩器用于多模态数据

摘要: 无损压缩对于高效的数据存储和传输至关重要。尽管基于学习的无损压缩器取得了强大的成果,但大多数设计都是针对单一模态,导致在多模态环境中存在冗余的压缩器部署。设计一个统一的多模态压缩器至关重要,但也具有挑战性,因为不同的数据类型在格式、维度和统计方面存在较大差异。多模态大语言模型提供了一个有希望的解决方案,但仍然过于复杂无法实际应用。因此,我们提出了一个统一且轻量级的多模态数据(如图像、文本、语音、触觉、数据库和基因序列)无损压缩器OmniZip。OmniZip基于轻量级的骨干结构,包括三个关键组件,以实现高效的多模态无损压缩:一个模态统一的标记器,可将多样化的数据可逆地转换为标记;一个模态路由上下文学习机制,实现灵活的多模态上下文建模;以及一个模态路由前馈设计,进一步增强模型的非线性表示灵活性。采用重新参数化训练策略以增强模型容量。OmniZip在多个模态上优于或与其他最先进的压缩器相匹配,分别比gzip在CLIC-M、TouchandGo、enwik9、LibriSpeech和WikiSQL数据集上的压缩效率高出42%、57%、62%和42%、53%。它还支持在资源受限的边缘设备上进行接近实时推断,在MacBook CPU和iPhone NPU上可达到约1MB/s。我们的代码已发布在https://github.com/adminasmi/OmniZip-CVPR2026。

更新时间: 2026-03-02 12:21:30

领域: cs.LG,cs.IT

下载: http://arxiv.org/abs/2602.22286v2

ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs' billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.

Updated: 2026-03-02 12:21:16

标题: ALTER: 不对称的LoRA用于基于令牌熵引导的LLM遗忘

摘要: 大型语言模型(LLMs)已经发展到涵盖各种领域的广泛知识。然而,控制LLMs不应该知道的内容对于确保对齐性和安全使用至关重要。然而,由于知识保留和遗忘之间模糊的界限,LLMs中的有效遗忘是困难的。这一挑战受到来自连续多域训练的纠缠参数空间的加剧,经常导致附带损害,尤其是在激进的遗忘策略下。此外,需要优化具有数十亿参数的最先进模型的计算开销构成了另一个障碍。在这项工作中,我们提出了ALTER,一个轻量级的LLMs遗忘框架,以解决知识纠缠和遗忘效率的挑战。ALTER通过两个阶段运作:(I)通过LoRA中的共享A矩阵捕获和学习高熵标记,然后(II)通过参数隔离和在目标子域内遗忘标记实现指定的遗忘目标的不对称LoRA架构。作为通过令牌级隔离在不对称框架中实现遗忘的新研究方向。ALTER在TOFU、WMDP和MUSE基准测试中实现了SOTA性能,遗忘质量超过95%,通过保留基础标记显示出最小的副作用。通过将遗忘与LLMs的数十亿规模参数解耦,该框架提供了出色的效率,同时保留了超过90%的模型效用,超过了47.8-83.6%的基线保留率。

更新时间: 2026-03-02 12:21:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01792v1

From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA, an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while the best-performing agents achieve Pass@5 of over 90% (at least one of five trials) on IncreQA and 60-70% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower, with gaps of up to about 60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development. Our code and data are publicly available at https://github.com/glee4810/EHR-ChatQA.

Updated: 2026-03-02 12:18:38

标题: 从对话到查询执行:为EHR数据库代理评估用户和工具交互Benchmarking

摘要: 尽管LLM驱动的代理程序表现出色,但由于缺乏能够充分捕捉现实世界临床数据访问流程的基准,它们在电子健康记录(EHR)数据访问方面的采用仍然有限。在实践中,两个核心挑战阻碍了部署:来自模糊用户问题的查询歧义以及用户术语和数据库条目之间的价值不匹配。为了解决这个问题,我们介绍了EHR-ChatQA,这是一个交互式数据库问答基准,评估数据库代理的端到端工作流程:澄清用户问题,使用工具解决价值不匹配,并生成正确的SQL以提供准确的答案。为了涵盖各种查询歧义和价值不匹配模式,EHR-ChatQA在一个模拟环境中评估代理,该环境具有基于LLM的用户,跨两种交互流程:增量查询细化(IncreQA),用户在现有查询中添加约束,以及自适应查询细化(AdaptQA),用户在对话中途调整他们的搜索目标。通过对最先进的LLM(例如o4-mini和Gemini-2.5-Flash)进行五次i.i.d.试验,实验结果显示,性能最佳的代理程序在IncreQA上的Pass@5超过90%(至少五次试验中的一次),在AdaptQA上为60-70%,其Pass^5(在所有五次试验中持续成功)明显较低,最高达60%左右。这些结果强调了构建不仅性能出色而且对安全关键的EHR领域具有鲁棒性的代理的必要性。最后,我们提供了常见故障模式的诊断见解,以指导未来代理开发。我们的代码和数据可以在https://github.com/glee4810/EHR-ChatQA 上公开获取。

更新时间: 2026-03-02 12:18:38

领域: cs.AI

下载: http://arxiv.org/abs/2509.23415v2

Can LLMs Hack Enterprise Networks? -- Replicated Computational Results (RCR) Report

This is the Replicated Computational Results (RCR) Report for the paper ``Can LLMs Hack Enterprise Networks?" The paper empirically investigates the efficacy and effectiveness of different LLMs for penetration-testing enterprise networks, i.e., Microsoft Active Directory Assumed-Breach Simulations. This RCR report describes the artifacts used in the paper, how to create an evaluation setup, and highlights the analysis scripts provided within our prototype.

Updated: 2026-03-02 12:13:39

标题: LLM能够入侵企业网络吗?-- 复制的计算结果(RCR)报告

摘要: 这是关于论文“LLMs能够入侵企业网络吗?”的复制计算结果(RCR)报告。该论文通过实证研究探讨了不同LLMs在渗透测试企业网络(即Microsoft Active Directory Assumed-Breach模拟)中的效力和有效性。本RCR报告描述了论文中使用的工件,如何创建评估设置,并突出了我们原型中提供的分析脚本。

更新时间: 2026-03-02 12:13:39

领域: cs.CR

下载: http://arxiv.org/abs/2603.01789v1

Knowledge-Based Design Requirements for Generative Social Robots in Higher Education

Generative social robots (GSRs) powered by large language models enable adaptive, conversational tutoring but also introduce risks such as hallucinations, overreliance, and privacy violations. Existing frameworks for educational technologies and responsible AI primarily define desired behaviors, yet they rarely specify the knowledge prerequisites that enable generative systems to express these behaviors reliably. To address this gap, we adopt a knowledge-based design perspective and investigate what information tutoring-oriented GSRs require to function responsibly and effectively in higher education. Based on twelve semi-structured interviews with university students and lecturers, we identify twelve design requirements across three knowledge types: self-knowledge (assertive, conscientious, and friendly personality with customizable role), user-knowledge (personalized information about student learning goals, learning progress, motivation type, emotional state, and background), and context-knowledge (learning materials, educational strategies, course-related information, and physical learning environment). By identifying these knowledge requirements, this work provides a structured foundation for the design of tutoring GSRs and future evaluations, aligning generative system capabilities with pedagogical and ethical expectations.

Updated: 2026-03-02 12:13:24

标题: 基于知识的高等教育中生成式社交机器人的设计需求

摘要: 由大型语言模型驱动的生成社交机器人(GSRs)实现了自适应、对话式辅导,但也带来了幻觉、过度依赖和隐私侵犯等风险。现有的教育技术和负责任人工智能框架主要定义了期望的行为,但很少具体指定使生成系统能够可靠表达这些行为的知识前提。为了填补这一空白,我们采用基于知识的设计视角,研究面向教育的GSRs需要什么信息才能在高等教育中负责任且有效地运作。通过与大学生和讲师进行的十二次半结构化访谈,我们确定了三种知识类型中的十二个设计要求:自知识(自信、尽责、友好的个性,可定制的角色)、用户知识(关于学生学习目标、学习进展、动机类型、情绪状态和背景的个性化信息)、以及环境知识(学习材料、教育策略、课程相关信息和物理学习环境)。通过确定这些知识需求,这项工作为辅导型GSRs的设计和未来评估提供了一个结构化的基础,将生成系统的能力与教学和道德期望保持一致。

更新时间: 2026-03-02 12:13:24

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2602.12873v3

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

Updated: 2026-03-02 12:13:05

标题: 安全幻觉:虚假相关性如何破坏VLM安全微调,并可以通过机器遗忘来减轻

摘要: 最近的视觉语言模型(VLMs)在生成建模方面取得了显著进展,特别是在多模态输入,特别是文本和图像方面。然而,当暴露于不安全查询时,它们易受生成有害内容的影响,引发了重要的安全性问题。虽然当前的对齐策略主要依赖于使用策划数据集进行监督安全微调,但我们发现了一个基本限制,我们称之为“安全幻觉”,即监督微调无意中加强了表面文本模式与安全响应之间的虚假相关性,而不是促进深层的、内在的危害缓解。我们展示了这些虚假相关性使经过微调的VLMs甚至容易受到简单的基于单词修改的攻击,其中用引起虚假相关性的替代词替换文本查询中的单个词可以有效地绕过安全措施。此外,这些相关性导致了过分谨慎,导致经过微调的VLMs不必要地拒绝良性查询。为了解决这些问题,我们展示了机器去学习(MU)作为监督安全微调的强大替代方案,因为它避免了有偏见的特征-标签映射,并直接从VLMs中删除有害知识,同时保留其一般能力。在安全基准测试中进行的广泛评估表明,在基于MU的对齐下,攻击成功率降低了高达60.27%,不必要的拒绝率降低了超过84.20%。警告:存在可能具有冒犯性质的AI生成。

更新时间: 2026-03-02 12:13:05

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.11832v3

Learning Shortest Paths with Generative Flow Networks

In this paper, we present a novel learning framework for finding shortest paths in graphs utilizing Generative Flow Networks (GFlowNets). First, we examine theoretical properties of GFlowNets in non-acyclic environments in relation to shortest paths. We prove that, if the total flow is minimized, forward and backward policies traverse the environment graph exclusively along shortest paths between the initial and terminal states. Building on this result, we show that the pathfinding problem in an arbitrary graph can be solved by training a non-acyclic GFlowNet with flow regularization. We experimentally demonstrate the performance of our method in pathfinding in permutation environments and in solving Rubik's Cubes. For the latter problem, our approach shows competitive results with state-of-the-art machine learning approaches designed specifically for this task in terms of the solution length, while requiring smaller search budget at test-time.

Updated: 2026-03-02 12:12:13

标题: 用生成流网络学习最短路径

摘要: 在这篇论文中,我们提出了一种利用生成式流网络(GFlowNets)在图中寻找最短路径的新型学习框架。首先,我们研究了在非无环环境中GFlowNets的理论特性与最短路径的关系。我们证明,如果总流量被最小化,正向和反向策略会在环境图中独家沿着起始和终止状态之间的最短路径遍历。基于这一结果,我们展示了通过训练具有流量正则化的非无环GFlowNet可以解决任意图中的路径规划问题。我们在排列环境中的路径规划和解决魔方的实验中展示了我们方法的性能。对于后一个问题,我们的方法在解决长度方面表现出与专门针对此任务设计的最先进的机器学习方法竞争力,同时在测试时需要更小的搜索预算。

更新时间: 2026-03-02 12:12:13

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2603.01786v1

Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading

When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you wonder ``This sounds like science fiction. Does it actually work?''. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded solely from eye movements in reading. To address this question, we introduce goal decoding tasks and evaluation frameworks using large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal text and eye movements LLMs for these tasks. Our experiments show considerable success on the task of selecting the correct goal among several options, and even progress towards free-form textual reconstruction of the precise goal formulation. These results open the door for further scientific investigation of goal driven reading, as well as the development of educational and assistive technologies that will rely on real-time decoding of reader goals from their eye movements.

Updated: 2026-03-02 12:12:13

标题: 解码阅读中眼动行为中的开放式信息检索目标

摘要: 在阅读时,我们经常对文本中感兴趣的特定信息感兴趣。例如,您可能正在阅读这篇论文,因为您对阅读中眼动的LLMs、实验设计感到好奇,或者您想知道“这听起来像是科幻小说。它真的有效吗?”更广泛地说,在日常生活中,人们以任意数量的文本特定目标来引导他们的阅读行为。在这项工作中,我们首次提出一个问题,即是否可以仅通过阅读中的眼动自动解码开放式阅读目标。为了解决这个问题,我们引入了目标解码任务和评估框架,使用大规模的英语阅读眼动追踪数据,其中包含数百个文本特定信息寻找任务。我们为这些任务开发并比较了几种区分性和生成性多模态文本和眼动LLMs。我们的实验表明,在从几个选项中选择正确目标的任务上取得了相当大的成功,甚至在精确目标制定的自由形式文本重构方面取得了进展。这些结果为进一步科学研究目标驱动的阅读以及依赖于从读者眼动中实时解码读者目标的教育和辅助技术的发展打开了大门。

更新时间: 2026-03-02 12:12:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.02872v3

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Adversarial behavior plays a central role in aligning large language models with human values. However, existing alignment methods largely rely on static adversarial settings, which fundamentally limit robustness, particularly in multimodal settings with a larger attack surface. In this work, we move beyond static adversarial supervision and introduce co-evolutionary alignment with evolving attacks, instantiated by CEMMA (Co-Evolutionary Multi-Modal Alignment), an automated and adaptive framework for multimodal safety alignment. We introduce an Evolutionary Attacker that decomposes adversarial prompts into method templates and harmful intents. By employing genetic operators, including mutation, crossover, and differential evolution, it enables simple seed attacks to inherit the structural efficacy of sophisticated jailbreaks. The Adaptive Defender is iteratively updated on the synthesized hard negatives, forming a closed-loop process that adapts alignment to evolving attacks. Experiments show that the Evolutionary Attacker substantially increases red-teaming jailbreak attack success rate (ASR), while the Adaptive Defender improves robustness and generalization across benchmarks with higher data efficiency, without inducing excessive benign refusal, and remains compatible with inference-time defenses such as AdaShield.

Updated: 2026-03-02 12:10:46

标题: 协同演化多模态对齐通过结构对抗演化

摘要: 对抗行为在将大型语言模型与人类价值观对齐中发挥着核心作用。然而,现有的对齐方法主要依赖于静态对抗设置,这从根本上限制了鲁棒性,特别是在攻击面更大的多模态环境中。在这项工作中,我们超越静态对抗监督,引入了随着进化攻击而共同进化的对齐方法,即CEMMA(Co-Evolutionary Multi-Modal Alignment),这是一个自动化和自适应的多模态安全对齐框架。我们引入了一种进化攻击者,将对抗提示分解为方法模板和有害意图。通过采用包括突变、交叉和差分进化在内的遗传算子,它使简单的种子攻击能够继承复杂越狱攻击的结构有效性。自适应防御者在合成的难例上进行迭代更新,形成一种闭环过程,使对齐适应不断进化的攻击。实验表明,进化攻击者显著提高了红队越狱攻击成功率(ASR),而自适应防御者提高了鲁棒性和泛化性能,同时提高了数据效率,而不会导致过多的良性拒绝,并且与AdaShield等推理时间防御方法兼容。

更新时间: 2026-03-02 12:10:46

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.01784v1

Combinatorial Bandit Bayesian Optimization for Tensor Outputs

Bayesian optimization (BO) has been widely used to optimize expensive and black-box functions across various domains. However, existing BO methods have not addressed tensor-output functions. To fill this gap, we propose a novel tensor-output BO framework. Specifically, we first introduce a tensor-output Gaussian process (TOGP) with two classes of tensor-output kernels as a surrogate model of the tensor-output function, which can effectively capture the structural dependencies within the tensor. Based on it, we develop an upper confidence bound (UCB) acquisition function to select query points. Furthermore, we introduce a more practical and challenging problem setting, termed combinatorial bandit Bayesian optimization (CBBO), where only a subset of the tensor outputs can be selected to contribute to the objective. To tackle this, we propose a tensor-output CBBO method, which extends TOGP to handle partially observed tensor outputs, and accordingly design a novel combinatorial multi-arm bandit-UCB2 (CMAB-UCB2) criterion to sequentially select both the query points and the output subset. We establish theoretical regret bounds for both methods, guaranteeing sublinear regret. Extensive experiments on synthetic and real-world datasets demonstrate the superiority of our methods.

Updated: 2026-03-02 12:10:27

标题: 组合式赌博贝叶斯优化用于张量输出

摘要: 贝叶斯优化(BO)已广泛应用于各个领域中优化昂贵和黑匣子函数。然而,现有的BO方法尚未解决张量输出函数的问题。为了填补这一空白,我们提出了一种新颖的张量输出BO框架。具体来说,我们首先引入了一个具有两类张量输出核的张量输出高斯过程(TOGP)作为张量输出函数的代理模型,可以有效地捕捉张量内的结构依赖关系。基于此,我们开发了一个上置信界(UCB)采集函数来选择查询点。此外,我们引入了一个更实用且具有挑战性的问题设置,称为组合赌博贝叶斯优化(CBBO),其中只能选择张量输出的子集来贡献于目标。为了解决这个问题,我们提出了一种张量输出CBBO方法,将TOGP扩展为处理部分观察到的张量输出,并相应地设计了一种新颖的组合多臂赌博-UCB2(CMAB-UCB2)准则来顺序选择查询点和输出子集。我们为这两种方法建立了理论上的后悔界,保证了次线性后悔。对合成和真实世界数据集的大量实验表明了我们方法的优越性。

更新时间: 2026-03-02 12:10:27

领域: cs.LG

下载: http://arxiv.org/abs/2602.00640v2

Knowledge Graph Augmented Large Language Models for Disease Prediction

Electronic health records (EHRs) enable strong clinical prediction, but explanations are often coarse and hard to use for patient-level decisions. We propose a knowledge graph (KG)-guided chain-of-thought (CoT) framework for visit-level disease prediction on MIMIC-III. We map ICD-9 codes to PrimeKG, mine disease-relevant nodes and paths, and use these paths to scaffold temporally consistent CoT rationales, retaining only samples whose conclusions match observed outcomes. We fine-tune lightweight instruction-tuned LLMs (LLaMA-3.1-Instruct-8B and Gemma-7B) on two small cohorts (400 and 1,000 index visits) across ten PrimeKG-mapped diseases. Our models outperform strong classical baselines, reaching AUROC 0.66-0.70 and macro-AUPR 0.40-0.47. Without additional training, the models transfer zero-shot to the CRADLE cohort, improving accuracy from 0.40-0.51 to 0.72-0.77. In a blinded clinician study, KG-guided CoT rationales are consistently preferred for clarity, relevance, and correctness. Code is available at: https://github.com/JonathanWry/KG-guided-LLM-pipeline

Updated: 2026-03-02 12:10:25

标题: 知识图谱增强的大型语言模型用于疾病预测

摘要: 电子健康记录(EHRs)能够实现强大的临床预测,但解释通常粗糙且难以用于患者级别的决策。我们提出了一种基于知识图(KG)引导的思维链(CoT)框架,用于在MIMIC-III上进行访问级疾病预测。我们将ICD-9代码映射到PrimeKG,挖掘与疾病相关的节点和路径,并使用这些路径支持时间一致的CoT推理,仅保留结论与观察结果一致的样本。我们在两个小队列(400和1,000个索引访问)上对轻量级指导调整的LLMs(LLaMA-3.1-Instruct-8B和Gemma-7B)进行微调,跨十个PrimeKG映射的疾病。我们的模型优于强大的经典基线,达到AUROC 0.66-0.70和宏AUPR 0.40-0.47。在没有其他训练的情况下,模型在CRADLE队列上进行零射击转移,将准确性从0.40-0.51提高到0.72-0.77。在一项盲目的临床医生研究中,KG引导的CoT推理因其清晰性、相关性和正确性而被一致偏爱。代码可在以下网址找到:https://github.com/JonathanWry/KG-guided-LLM-pipeline

更新时间: 2026-03-02 12:10:25

领域: cs.AI

下载: http://arxiv.org/abs/2512.01210v3

Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection

Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2\% AP improvement) and PASCAL VOC (with a 3.3\% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: https://github.com/zyt95579/PDP\_IOD/tree/main

Updated: 2026-03-02 12:09:38

标题: 超越提示退化:原型引导的双池提示用于增量目标检测

摘要: 增量目标检测(IOD)旨在不忘记先前学习的目标类别的情况下持续学习新的目标类别。最近,基于提示的方法因其无需重放设计和参数效率而受到欢迎。然而,由于提示耦合和提示漂移,这些方法在持续适应过程中经常遭受提示降级的困扰。为了解决这些问题,我们提出了一种新颖的提示解耦框架,称为PDP。PDP创新地设计了一个双池提示解耦范式,其中包括一个用于捕获任务通用知识以进行前向传递的共享池,和一个用于学习任务特定区分特征的私有池。这种范式明确地分离了任务通用和任务特定提示,防止提示之间的干扰并减轻提示耦合。此外,为了对抗由不一致监督引起的提示漂移,在后续任务中将旧前景对象视为背景,PDP引入了一个原型伪标签生成(PPG)模块。PPG可以在训练过程中动态更新类原型空间,并使用类原型进一步过滤宝贵的伪标签,从而在增量过程中保持监督信号的一致性。PDP在MS-COCO(AP提高了9.2%)和PASCAL VOC(AP提高了3.3%)基准上实现了最先进的性能,突显了其在平衡稳定性和可塑性方面的潜力。代码和数据集发布在:https://github.com/zyt95579/PDP\_IOD/tree/main

更新时间: 2026-03-02 12:09:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.02286v1

GAM-RAG: Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) grounds large language models with external evidence, but many implementations rely on pre-built indices that remain static after construction. Related queries therefore repeat similar multi-hop traversal, increasing latency and compute. Motivated by schema-based learning in cognitive neuroscience, we propose GAM-RAG, a training-free framework that accumulates retrieval experience from recurring or related queries and updates retrieval memory over time. GAM-RAG builds a lightweight, relation-free hierarchical index whose links capture potential co-occurrence rather than fixed semantic relations. During inference, successful retrieval episodes provide sentence-level feedback, updating sentence memories so evidence useful for similar reasoning types becomes easier to activate later. To balance stability and adaptability under noisy feedback, we introduce an uncertainty-aware, Kalman-inspired gain rule that jointly updates memory states and perplexity-based uncertainty estimates. It applies fast updates for reliable novel signals and conservative refinement for stable or noisy memories. We provide a theoretical analysis of the update dynamics, and empirically show that GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%. Our code and datasets are available at: https://anonymous.4open.science/r/GAM_RAG-2EF6.

Updated: 2026-03-02 12:09:17

标题: GAM-RAG:检索增强生成中用于演化检索的增益自适应记忆

摘要: 检索增强生成(RAG)通过外部证据使大型语言模型扎根,但许多实现依赖于构建后保持静态的预建索引。因此,相关查询重复类似的多跳遍历,增加延迟和计算。受认知神经科学中基于模式的学习的启发,我们提出了GAM-RAG,这是一个无需训练的框架,它从重复或相关查询中积累检索经验,并随时间更新检索记忆。GAM-RAG构建了一个轻量级、无关联的分层索引,其链接捕捉潜在的共现而不是固定的语义关系。在推断过程中,成功的检索事件提供句子级反馈,更新句子记忆,使对于类似推理类型有用的证据更容易在以后激活。为了在嘈杂的反馈下平衡稳定性和适应性,我们引入了一个基于不确定性的、灵感自卡尔曼的增益规则,共同更新记忆状态和基于困惑度的不确定性估计。它对可靠的新信号应用快速更新,并对稳定或嘈杂的记忆进行保守的调整。我们提供了对更新动态的理论分析,并经验证明GAM-RAG相对最强基线平均性能提高了3.95%,使用5轮记忆提高了8.19%,同时将推断成本降低了61%。我们的代码和数据集可在以下链接找到:https://anonymous.4open.science/r/GAM_RAG-2EF6。

更新时间: 2026-03-02 12:09:17

领域: cs.AI

下载: http://arxiv.org/abs/2603.01783v1

Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.

Updated: 2026-03-02 12:08:56

标题: 自我和谐:学习在测试时间的强化学习中协调自我监督和自我游戏

摘要: 测试时强化学习(TTRL)提供了一种无标签的范式,仅使用推理中的合成信号来调整模型,但其成功取决于构建可靠的学习信号。标准方法,如多数投票,通常会崩溃到虚假但流行的答案。我们引入了自我和谐(Self-Harmony)框架,建立在一个简单的直觉之上:正确答案应该在原问题和其释义之间保持稳定。自我和谐通过利用一个模型在两个互补的角色中实现这一点:一个求解器用于产生答案,一个重构器用于重述输入。基于此,我们进一步提出了一种伪标签方法:不是采用多数投票,而是使用谐波平均值聚合这些原始和重构视图中的答案频率。这是一个自然选择出在重构下稳定的解决方案的过程,从而避免了偏爱依赖视角的虚假答案的常见陷阱。至关重要的是,这不需要人类监督或辅助模型。在各种推理基准测试中,自我和谐在无标签的测试时设置中取得了最先进的结果,在多种方法中在30个设置中的28个中排名第一。除了准确性,它展示了前所未有的稳健性,在所有实验中都没有训练失败,强调了其稳定性和可靠性。

更新时间: 2026-03-02 12:08:56

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01191v2

Bilinear representation mitigates reversal curse and enables consistent model editing

The reversal curse--a language model's inability to infer an unseen fact "B is A" from a learned fact "A is B"--is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. Our results demonstrate that training from scratch on synthetic relational knowledge graphs leads to the emergence of a bilinear relational structure within the models' hidden representations. This structure alleviates the reversal curse and facilitates inference of unseen reverse facts. Crucially, this bilinear geometry is foundational for consistent model editing: updates to a single fact propagate correctly to its reverse and logically dependent relations. In contrast, models lacking this representation suffer from the reversal curse and fail to generalize model edits, leading to logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn support language models in behaving in a logically consistent manner after editing. This suggests that the efficacy of language model editing depends not only on the choice of algorithm but on the underlying representational geometry of the knowledge itself.

Updated: 2026-03-02 12:08:12

标题: 双线性表示减轻逆转诅咒并实现一致的模型编辑

摘要: 逆转诅咒——一个语言模型无法从学习到的事实“A是B”中推断出未见事实“B是A”的能力——被广泛认为是一种基本限制。我们展示了这不是一种固有的失败,而是模型如何编码知识的产物。我们的结果表明,从头开始在合成关系知识图上进行训练会导致模型的隐藏表示中出现双线性关系结构的出现。这种结构减轻了逆转诅咒,并促进了对未见逆向事实的推理。至关重要的是,这种双线性几何形状是一致模型编辑的基础:对单个事实的更新会正确传播到其逆向和逻辑上依赖的关系。相比之下,缺乏这种表示的模型遭受逆转诅咒,无法推广模型编辑,导致逻辑不一致。我们的结果表明,在关系知识数据集上进行训练会诱发双线性内部表示的出现,进而支持语言模型在编辑后以逻辑一致的方式行事。这表明,语言模型编辑的有效性不仅取决于算法的选择,而且取决于知识本身的底层表征几何形状。

更新时间: 2026-03-02 12:08:12

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.21993v3

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at https://huggingface.co/collections/Hengchang-Liu/d3lm.

Updated: 2026-03-02 12:05:21

标题: D3LM: 一种用于双向DNA理解和生成的离散DNA扩散语言模型

摘要: 早期的DNA基础模型采用了BERT风格的训练,在DNA理解任务上表现出良好的性能,但缺乏生成能力。最近的自回归模型实现了DNA生成,但采用了左到右的因果建模,这对于DNA来说是次优的,因为其中的调控关系本质上是双向的。我们提出了D3LM(\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel),通过蒙版扩散将双向表示学习和DNA生成统一起来。D3LM直接采用了核苷酸Transformer(NT)v2架构,但将训练目标重新定义为离散DNA空间中的蒙版扩散,从而在单一模型内实现了双向理解和生成能力。与相同规模的NT v2相比,D3LM在理解任务上实现了更好的性能。值得注意的是,在调控元素生成方面,D3LM实现了10.92的SFID,接近真实DNA序列(7.85),远远超过自回归模型先前的最佳结果29.16。我们的工作表明扩散语言模型作为统一的DNA基础模型的一个有前途的范式。我们进一步展示了在DNA领域中蒙版扩散模型的第一次系统研究,探讨了实际设计选择,如标记化方案和采样策略,从而提供了经验见解和未来研究的坚实基础。D3LM已经发布在https://huggingface.co/collections/Hengchang-Liu/d3lm。

更新时间: 2026-03-02 12:05:21

领域: cs.LG,q-bio.GN

下载: http://arxiv.org/abs/2603.01780v1

Data-Augmented Deep Learning for Downhole Depth Sensing and Validation

Accurate downhole depth measurement is essential for oil and gas well operations, directly influencing reservoir contact, production efficiency, and operational safety. Collar correlation using a casing collar locator (CCL) is fundamental for precise depth calibration. While neural network has achieved significant progress in collar recognition, preprocessing methods for such applications remain underdeveloped. Moreover, the limited availability of real well data poses substantial challenges for training neural network models that require extensive datasets. This paper presents a system integrated into a downhole toolstring for CCL log acquisition to facilitate dataset construction. Comprehensive preprocessing methods for data augmentation are proposed, and their effectiveness is evaluated using baseline neural network models. Through systematic experimentation across diverse configurations, the contribution of each augmentation method is analyzed. Results demonstrate that standardization, label distribution smoothing, and random cropping are fundamental prerequisites for model training, while label smoothing regularization, time scaling, and multiple sampling significantly enhance model generalization capabilities. Incorporating the proposed augmentation methods into the two baseline models results in maximum F1 score improvements of 0.027 and 0.024 for the TAN and MAN models, respectively. Furthermore, applying these techniques yields F1 score gains of up to 0.045 for the TAN model and 0.057 for the MAN model compared to prior studies. Performance evaluation on real CCL waveforms confirms the effectiveness and practical applicability of our approach. This work addresses the existing gaps in data augmentation methodologies for training casing collar recognition models under CCL data-limited conditions, and provides a technical foundation for the future automation of downhole operations.

Updated: 2026-03-02 12:03:59

标题: 数据增强的深度学习用于井下深度感知和验证

摘要: 准确的井下深度测量对油气井作业至关重要,直接影响储层接触、生产效率和作业安全。使用套管领口定位器(CCL)进行领口相关性是精确深度校准的基础。虽然神经网络在领口识别方面取得了显著进展,但针对此类应用的预处理方法仍未得到充分发展。此外,实际井数据的有限可用性给需要大量数据集的神经网络模型训练带来了重大挑战。本文提出了一个集成到井下工具串中用于CCL日志获取的系统,以促进数据集构建。提出了用于数据增强的全面预处理方法,并利用基线神经网络模型评估了它们的有效性。通过对各种配置进行系统实验,分析了每种增强方法的贡献。结果表明,标准化、标签分布平滑和随机裁剪是模型训练的基本先决条件,而标签平滑正规化、时间缩放和多次采样显著增强了模型的泛化能力。将所提出的增强方法纳入两个基线模型中,TAN和MAN模型的最大F1分数改进分别为0.027和0.024。此外,与先前研究相比,应用这些技术可使TAN模型的F1分数增加高达0.045,MAN模型增加高达0.057。对真实CCL波形的性能评估证实了我们方法的有效性和实际适用性。本研究填补了在CCL数据有限条件下训练套管领口识别模型的数据增强方法方面的现有空白,并为未来井下操作的自动化提供了技术基础。

更新时间: 2026-03-02 12:03:59

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2511.00129v5

FreeAct: Freeing Activations for LLM Quantization

Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.

Updated: 2026-03-02 12:02:17

标题: FreeAct:LLM量化中的激活释放

摘要: 量化对于减轻大型语言模型(LLMs)的显著内存和计算开销至关重要。虽然新兴的基于转换的方法通过使用正交矩阵将特征空间投影到更平滑的流形上成功地增强了量化,但它们通常强制实施严格的一对一转换约束。这种静态方法未能考虑到输入激活中固有的动态模式,特别是在扩散LLMs(dLLMs)和多模态LLMs(MLLMs)中,不同的标记类型展现出明显不同的分布。为了推进这一进展,我们提出了FreeAct,这是一个新颖的量化框架,它放宽了静态的一对一约束以适应动态激活差异。在理论上,我们利用激活的秩亏性质推导出一个解空间,超越简单的逆矩阵,使激活变换与权重解耦。在方法上,FreeAct识别了标记特定的动态(即视觉与文本,或掩码标记)并为激活端分配不同的转换矩阵,同时为权重保持统一的静态转换。在dLLMs和MLLMs上进行的大量实验表明,FreeAct明显优于基线,性能提升高达5.3%,并进行了深入分析。我们的代码将公开发布。

更新时间: 2026-03-02 12:02:17

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.01776v1

Structured Diversity Control: A Dual-Level Framework for Group-Aware Multi-Agent Coordination

Controlling the behavioral diversity is a pivotal challenge in multi-agent reinforcement learning (MARL), particularly in complex collaborative scenarios. While existing methods attempt to regulate behavioral diversity by directly differentiating across all agents, they lack deep characterization and learning of multi-agent composition structures. This limitation leads to suboptimal performance or coordination failures when facing more complex or challenging tasks. To bridge this gap, we introduce Structured Diversity Control (SDC), a framework that redefines the system-wide diversity metric as a weighted combination of intra-group diversity, which is minimized for cohesion and inter-group diversity, which is maximized for specialization. The trade-off is governed by a pre-set Diversity Structure Factor (DSF), allowing for fine-grained, group-aware control over the collective strategy. Our method directly constrains the policy architecture without altering reward functions. This structural definition of diversity enables SDC to deliver substantial performance gains across various experiments, including increasing average rewards by up to 47.1\% in multi-target pursuit and reducing episode lengths by 12.82\% in complex neutralization scenarios. The proposed method offers a novel analytical perspective on the problem of cooperation in group-aware multi-agent systems.

Updated: 2026-03-02 12:00:09

标题: 结构化多样性控制:面向群体感知的多智能体协调的双层框架

摘要: 控制行为多样性是多智能体强化学习(MARL)中的一个关键挑战,特别是在复杂的协作场景中。虽然现有方法试图通过直接区分所有智能体来调节行为多样性,但它们缺乏对多智能体组合结构的深入表征和学习。这种限制导致在面对更复杂或具有挑战性的任务时表现不佳或协调失败。为了弥合这一差距,我们引入了结构化多样性控制(SDC)框架,将系统范围的多样性指标重新定义为组内多样性的加权组合,对于凝聚力最小化和专业化最大化。权衡由预设的多样性结构因子(DSF)控制,允许对集体策略进行精细化、群体感知的控制。我们的方法直接约束策略架构,而不改变奖励函数。这种多样性的结构定义使得SDC能够在各种实验中取得显著的性能提升,包括在多目标追逐中将平均奖励提高高达47.1%,在复杂的中和场景中将集数长度减少12.82%。所提出的方法为群体感知多智能体系统中合作问题提供了一种新颖的分析视角。

更新时间: 2026-03-02 12:00:09

领域: cs.AI

下载: http://arxiv.org/abs/2506.18651v2

Learning-guided Kansa collocation for forward and inverse PDEs beyond linearity

Partial Differential Equations are precise in modelling the physical, biological and graphical phenomena. However, the numerical methods suffer from the curse of dimensionality, high computation costs and domain-specific discretization. We aim to explore pros and cons of different PDE solvers, and apply them to specific scientific simulation problems, including forwarding solution, inverse problems and equations discovery. In particular, we extend the recent CNF (NeurIPS 2023) framework solver to multi-dependent-variable and non-linear settings, together with down-stream applications. The outcomes include implementation of selected methods, self-tuning techniques, evaluation on benchmark problems and a comprehensive survey of neural PDE solvers and scientific simulation applications.

Updated: 2026-03-02 11:56:00

标题: 学习引导的Kansa插值在线性以外的正向和反向PDE中的应用

摘要: 偏微分方程在建模物理、生物和图形现象方面非常精确。然而,数值方法受到维度诅咒、高计算成本和特定域的离散化的影响。我们旨在探讨不同PDE求解器的优缺点,并将它们应用于特定的科学模拟问题,包括前向解决方案、反问题和方程式发现。特别地,我们将最近的CNF(NeurIPS 2023)框架求解器扩展到多个相关变量和非线性设置,以及下游应用。研究成果包括实现选定方法、自我调节技术、对基准问题的评估以及神经PDE求解器和科学模拟应用的综合调查。

更新时间: 2026-03-02 11:56:00

领域: cs.CE,cs.AI,cs.LG,math.NA

下载: http://arxiv.org/abs/2602.07970v2

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

Neural networks (NNs) often have critical behavioural trade-offs that are set at design time with hyperparameters-such as reward weights in reinforcement learning or quantile targets in regression. Post-deployment, however, user preferences can evolve, making initial settings undesirable, necessitating potentially expensive retraining. To circumvent this, we introduce the task of Hyperparameter Trajectory Inference (HTI): to learn, from observed data, how a NN's conditional output distribution changes with its hyperparameters, and construct a surrogate model that approximates the NN at unobserved hyperparameter settings. HTI requires extending existing trajectory inference approaches to incorporate conditions, exacerbating the challenge of ensuring inferred paths are feasible. We propose an approach based on conditional Lagrangian optimal transport, jointly learning the Lagrangian function governing hyperparameter-induced dynamics along with the associated optimal transport maps and geodesics between observed marginals, which form the surrogate model. We incorporate inductive biases based on the manifold hypothesis and least-action principles into the learned Lagrangian, improving surrogate model feasibility. We empirically demonstrate that our approach reconstructs NN outputs across various hyperparameter spectra better than other alternatives.

Updated: 2026-03-02 11:55:02

标题: 用条件Lagrangian最优输运推断超参数轨迹

摘要: 神经网络(NNs)通常在设计时具有关键的行为权衡,这些权衡是通过超参数(例如强化学习中的奖励权重或回归中的分位数目标)设置的。然而,在部署后,用户偏好可能会发生变化,使得初始设置变得不理想,可能需要进行昂贵的重新训练。为了避免这种情况,我们引入了超参数轨迹推断(HTI)的任务:从观察到的数据中学习神经网络的条件输出分布如何随其超参数变化,并构建一个近似于未观察到的超参数设置下的神经网络的替代模型。HTI需要扩展现有的轨迹推断方法以纳入条件,加剧了确保推断路径可行性的挑战。我们提出了一种基于条件Lagrangian最优传输的方法,同时学习控制超参数诱导动态的Lagrangian函数以及相关的最优传输映射和观察到的边缘之间的测地线,这些构成了替代模型。我们基于流形假设和最小作用原理引入归纳偏差到学习的Lagrangian中,改善了替代模型的可行性。我们在实验中证明,我们的方法能够更好地重构各种超参数范围内的神经网络输出,优于其他替代方法。

更新时间: 2026-03-02 11:55:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01771v1

Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding

Graph Neural Networks (GNNs) excel at learning from pairwise interactions but often overlook multi-way and hierarchical relationships. Topological Deep Learning (TDL) addresses this limitation by leveraging combinatorial topological spaces. However, existing TDL models are restricted to undirected settings and fail to capture the higher-order directed patterns prevalent in many complex systems, e.g., brain networks, where such interactions are both abundant and functionally significant. To fill this gap, we introduce Semi-Simplicial Neural Networks (SSNs), a principled class of TDL models that operate on semi-simplicial sets -- combinatorial structures that encode directed higher-order motifs and their directional relationships. To enhance scalability, we propose Routing-SSNs, which dynamically select the most informative relations in a learnable manner. We prove that SSNs are strictly more expressive than standard graph and TDL models. We then introduce a new principled framework for brain dynamics representation learning, grounded in the ability of SSNs to provably recover topological descriptors shown to successfully characterize brain activity. Empirically, SSNs achieve state-of-the-art performance on brain dynamics classification tasks, outperforming the second-best model by up to 27%, and message passing GNNs by up to 50% in accuracy. Our results highlight the potential of principled topological models for learning from structured brain data, establishing a unique real-world case study for TDL. We also test SSNs on standard node classification and edge regression tasks, showing competitive performance. We will make the code and data publicly available.

Updated: 2026-03-02 11:53:35

标题: 指导半单纯学习及其在脑活动解码中的应用

摘要: 图神经网络(GNN)擅长从成对交互中学习,但往往忽视多路和分层关系。拓扑深度学习(TDL)通过利用组合拓扑空间来解决这一限制。然而,现有的TDL模型局限于无向设置,并未能捕捉许多复杂系统中普遍存在的高阶有向模式,例如大脑网络,这些交互既丰富又具有功能意义。为填补这一空白,我们引入了半单纯神经网络(SSNs),这是一类基于半单纯集的TDL模型,它们能够编码有向高阶模式及其方向关系。为了提高可扩展性,我们提出了路由-SSNs,它们以可学习的方式动态选择最富信息的关系。我们证明SSNs比标准图和TDL模型更具表现力。然后,我们引入了一个新的基于SSNs的大脑动力学表示学习框架,这是建立在SSNs能够可靠恢复成功表征大脑活动的拓扑描述符的基础上。在实证方面,SSNs在大脑动力学分类任务上取得了最先进的性能,比第二好的模型提高了高达27%,比消息传递GNN准确率提高了高达50%。我们的结果突出了基于原则的拓扑模型在学习结构化大脑数据方面的潜力,为TDL建立了一个独特的真实案例研究。我们还将在标准节点分类和边缘回归任务上测试SSNs,展示了竞争性的性能。我们将公开发布代码和数据。

更新时间: 2026-03-02 11:53:35

领域: cs.LG

下载: http://arxiv.org/abs/2505.17939v3

CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning

Current deep learning primitives dealing with temporal dynamics suffer from a fundamental dichotomy: they are either discrete and unstable (LSTMs) \citep{pascanu_difficulty_2013}, leading to exploding or vanishing gradients; or they are continuous and dissipative (Neural ODEs) \citep{dupont_augmented_2019}, which destroy information over time to ensure stability. We propose the \textbf{Causal Hamiltonian Learning Unit} (pronounced: \textit{clue}), a novel Physics-grounded computational learning primitive. By enforcing a Relativistic Hamiltonian structure and utilizing symplectic integration, a CHLU strictly conserves phase-space volume, as an attempt to solve the memory-stability trade-off. We show that the CHLU is designed for infinite-horizon stability, as well as controllable noise filtering. We then demonstrate a CHLU's generative ability using the MNIST dataset as a proof-of-principle.

Updated: 2026-03-02 11:53:09

标题: CHLU:因果哈密顿学习单元作为深度学习的辛基元

摘要: 目前处理时间动态的深度学习基元存在一个根本性的二分法:它们要么是离散且不稳定的(LSTMs)\cite{pascanu_difficulty_2013},导致梯度爆炸或消失;要么是连续且耗散的(神经ODEs)\cite{dupont_augmented_2019},会随时间破坏信息以确保稳定性。我们提出了一种新颖的基于物理的计算学习基元——\textbf{因果哈密顿学习单元}(发音:clue)。通过强制执行相对论哈密顿结构并利用辛积分,CHLU严格保持相空间体积,致力于解决记忆稳定性的权衡问题。我们展示了CHLU的无限时间稳定性设计,以及可控的噪声过滤能力。然后,我们利用MNIST数据集作为原理验证,展示了CHLU的生成能力。

更新时间: 2026-03-02 11:53:09

领域: cs.LG,cs.AI,physics.app-ph

下载: http://arxiv.org/abs/2603.01768v1

Learning Boltzmann Generators via Constrained Mass Transport

Efficient sampling from high-dimensional and multimodal unnormalized probability distributions is a central challenge in many areas of science and machine learning. We focus on Boltzmann generators (BGs) that aim to sample the Boltzmann distribution of physical systems, such as molecules, at a given temperature. Classical variational approaches that minimize the reverse Kullback-Leibler divergence are prone to mode collapse, while annealing-based methods, commonly using geometric schedules, can suffer from mass teleportation and rely heavily on schedule tuning. We introduce Constrained Mass Transport (CMT), a variational framework that generates intermediate distributions under constraints on both the KL divergence and the entropy decay between successive steps. These constraints enhance distributional overlap, mitigate mass teleportation, and counteract premature convergence. Across standard BG benchmarks and the here introduced ELIL tetrapeptide, the largest system studied to date without access to samples from molecular dynamics, CMT consistently surpasses state-of-the-art variational methods, achieving more than 2.5x higher effective sample size while avoiding mode collapse.

Updated: 2026-03-02 11:46:53

标题: 通过受限质量传输学习玻尔兹曼发生器

摘要: 高维和多模态非归一化概率分布的有效抽样是科学和机器学习许多领域面临的核心挑战。我们关注Boltzmann生成器(BGs),旨在对物理系统(如分子)在给定温度下抽样Boltzmann分布。传统的变分方法,通过最小化逆Kullback-Leibler散度,容易发生模态坍缩,而基于退火的方法,通常使用几何调度,可能会遭受质量传输,并且严重依赖于调度调整。我们引入了约束质量传输(CMT),这是一个变分框架,根据连续步骤之间的KL散度和熵衰减的约束生成中间分布。这些约束增强了分布重叠,减轻了质量传输,并抵消了过早收敛。在标准的BG基准测试和这里介绍的ELIL四肽上,CMT始终优于现有的变分方法,实现了超过2.5倍的有效样本大小,同时避免了模态崩溃。

更新时间: 2026-03-02 11:46:53

领域: cs.LG

下载: http://arxiv.org/abs/2510.18460v2

An Assessment of the Overlooked Dangers of Template Engines

Template engines play a pivotal role in modern web application development by enabling the dynamic rendering of content, products, and user interfaces. Today, they are essential for any website that handles dynamic data, from e-commerce to social media. However, their widespread adoption also makes them attractive targets for attackers seeking to exploit vulnerabilities and gain unauthorized access to web servers. This paper presents a comprehensive assessment of the risks associated with template engines, with a particular focus on the consequences of Server-Side Template Injection (SSTI) and the ease with which such vulnerabilities can escalate to Remote Code Execution (RCE), a critical security concern in web application development.

Updated: 2026-03-02 11:45:44

标题: 评估模板引擎被忽视的危险性

摘要: 模板引擎在现代网页应用开发中发挥着关键作用,通过实现内容、产品和用户界面的动态渲染。如今,它们对于处理动态数据的任何网站都至关重要,从电子商务到社交媒体。然而,它们的广泛采用也使它们成为攻击者利用漏洞并未经授权访问网页服务器的吸引目标。 本文全面评估了与模板引擎相关的风险,特别关注了服务器端模板注入(SSTI)的后果以及此类漏洞如何轻松升级至远程代码执行(RCE),这在网页应用开发中是一个关键的安全问题。

更新时间: 2026-03-02 11:45:44

领域: cs.CR

下载: http://arxiv.org/abs/2405.01118v2

Neural Spelling: A Spell-Based BCI System for Language Neural Decoding

Brain-computer interfaces (BCIs) present a promising avenue by translating neural activity directly into text, eliminating the need for physical actions. However, existing non-invasive BCI systems have not successfully covered the entire alphabet, limiting their practicality. In this paper, we propose a novel non-invasive EEG-based BCI system with Curriculum-based Neural Spelling Framework, which recognizes all 26 alphabet letters by decoding neural signals associated with handwriting first, and then apply a Generative AI (GenAI) to enhance spell-based neural language decoding tasks. Our approach combines the ease of handwriting with the accessibility of EEG technology, utilizing advanced neural decoding algorithms and pre-trained large language models (LLMs) to translate EEG patterns into text with high accuracy. This system show how GenAI can improve the performance of typical spelling-based neural language decoding task, and addresses the limitations of previous methods, offering a scalable and user-friendly solution for individuals with communication impairments, thereby enhancing inclusive communication options.

Updated: 2026-03-02 11:45:16

标题: 神经拼写:一种基于拼写的语言神经解码BCI系统

摘要: 脑机接口(BCIs)通过将神经活动直接转化为文本,消除了对身体动作的需求,提供了一个有前途的途径。然而,现有的非侵入式BCI系统尚未成功覆盖整个字母表,限制了其实用性。在本文中,我们提出了一种新颖的基于EEG的非侵入式BCI系统,采用基于课程的神经拼写框架,首先通过解码与手写相关的神经信号识别出全部26个字母,然后应用生成式人工智能(GenAI)来增强基于拼写的神经语言解码任务。我们的方法结合了手写的便利性和EEG技术的可访问性,利用先进的神经解码算法和预训练的大型语言模型(LLMs)将EEG模式准确地转化为文本。这个系统展示了GenAI如何提高典型的基于拼写的神经语言解码任务的性能,并解决了以前方法的局限性,为沟通障碍者提供了可扩展和用户友好的解决方案,从而增强了包容性沟通选项。

更新时间: 2026-03-02 11:45:16

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2501.17489v2

DGNet: Discrete Green Networks for Data-Efficient Learning of Spatiotemporal PDEs

Spatiotemporal partial differential equations (PDEs) underpin a wide range of scientific and engineering applications. Neural PDE solvers offer a promising alternative to classical numerical methods. However, existing approaches typically require large numbers of training trajectories, while high-fidelity PDE data are expensive to generate. Under limited data, their performance degrades substantially, highlighting their low data efficiency. A key reason is that PDE dynamics embody strong structural inductive biases that are not explicitly encoded in neural architectures, forcing models to learn fundamental physical structure from data. A particularly salient manifestation of this inefficiency is poor generalization to unseen source terms. In this work, we revisit Green's function theory-a cornerstone of PDE theory-as a principled source of structural inductive bias for PDE learning. Based on this insight, we propose DGNet, a discrete Green network for data-efficient learning of spatiotemporal PDEs. The key idea is to transform the Green's function into a graph-based discrete formulation, and embed the superposition principle into the hybrid physics-neural architecture, which reduces the burden of learning physical priors from data, thereby improving sample efficiency. Across diverse spatiotemporal PDE scenarios, DGNet consistently achieves state-of-the-art accuracy using only tens of training trajectories. Moreover, it exhibits robust zero-shot generalization to unseen source terms, serving as a stress test that highlights its data-efficient structural design.

Updated: 2026-03-02 11:40:27

标题: DGNet:离散绿色网络用于数据高效学习时空PDEs

摘要: 时空偏微分方程(PDEs)支撑着广泛的科学和工程应用。神经PDE求解器为传统数值方法提供了一种有前途的替代方案。然而,现有方法通常需要大量的训练轨迹,而高保真度的PDE数据生成成本高昂。在有限的数据情况下,它们的性能会显著下降,突显了它们的低数据效率。一个关键原因是PDE动力学具有强烈的结构归纳偏差,这些偏差并未明确编码在神经结构中,迫使模型从数据中学习基本的物理结构。这种低效率的一个特别明显的表现是对未见过的源项的泛化能力差。在这项工作中,我们重新审视格林函数理论-作为PDE理论的一个基石,作为PDE学习的结构归纳偏差的原则来源。基于这一洞察,我们提出了DGNet,一种用于高效学习时空PDE的离散格林网络。关键思想是将格林函数转换为基于图的离散形式,并将叠加原理嵌入混合物理-神经结构中,从而减轻从数据中学习物理先验的负担,从而提高样本效率。在各种时空PDE场景中,DGNet仅使用数十个训练轨迹就始终实现了最先进的准确性。此外,它表现出对未见过的源项的鲁棒零射击泛化能力,作为一个强调其高效设计的压力测试。

更新时间: 2026-03-02 11:40:27

领域: cs.LG

下载: http://arxiv.org/abs/2603.01762v1

Modular Memory is the Key to Continual Learning Agents

Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model's parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.

Updated: 2026-03-02 11:40:05

标题: 模块化记忆是持续学习代理的关键

摘要: 基于大规模预训练和增加测试时计算能力,基础模型已经改变了机器学习。尽管在几个领域超越了人类表现,但这些模型在连续运行、经验积累和个性化方面仍然存在根本限制,这些能力对于自适应智能至关重要。虽然连续学习研究长期以来一直致力于这些目标,但其历史重点在于权重学习(IWL),即更新单个模型的参数以吸收新知识,这导致了灾难性遗忘一直是一个持续挑战。我们认为,在设计模块化内存的基础上,将权重学习(IWL)的优势与新兴的上下文学习(ICL)能力结合起来,是实现大规模持续适应所缺少的要素。我们概述了一个以模块化内存为中心的架构的概念框架,利用ICL实现快速适应和知识积累,利用IWL对模型能力进行稳定更新,为不断学习的智能体制定了一个实用的路线图。

更新时间: 2026-03-02 11:40:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01761v1

Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at https://github.com/CURRENTF/Uni-X

Updated: 2026-03-02 11:39:49

标题: Uni-X:用于统一多模型模型的双端分离架构减轻模态冲突

摘要: 基于共享自回归(AR)变压器构建的统一多模态模型(UMMs)因其架构简单而具有吸引力。然而,我们发现一个关键限制:当在多模态输入上进行训练时,共享模态变压器在视觉和文本之间存在严重的梯度冲突,特别是在浅层和深层。我们将这个问题追溯到图像和文本的基本不同的低级统计属性,同时指出在中间层冲突会减少,因为表示变得更加抽象和语义对齐。为了克服这一挑战,我们提出了Uni-X,一个两端分离、中间共享的架构。Uni-X将其初始和最终层专门用于模态特定处理,同时在中间层保持共享参数以进行高级语义融合。这种X形设计不仅消除了两端的梯度冲突,还进一步缓解了在共享层中的残余冲突。大量实验验证了Uni-X的有效性。在相同的训练条件下,Uni-X相比强基线表现出更高的训练效率。当扩展到具有更大训练数据的3B参数时,Uni-X与7B基于AR的UMMs相匹配甚至超越,并在图像生成方面达到82的GenEval得分,同时在文本和视觉理解任务中表现出色。这些结果将Uni-X确立为未来统一多模态建模的参数高效和可扩展基础。我们的代码可在https://github.com/CURRENTF/Uni-X找到。

更新时间: 2026-03-02 11:39:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.24365v3

GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? How do they perform compared to humans? Do they tend to reach an efficient and fair outcome? What is the role of natural language in strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. To answer these questions, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents' performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents. Our results suggest that the market parameters, as well as the choice of the LLMs, tend to have complex and interdependent effects on the economic outcome, which calls for careful design and analysis of the language-based economic ecosystem.

Updated: 2026-03-02 11:39:07

标题: GLEE:语言为基础的经济环境的统一框架和基准

摘要: 大型语言模型(LLMs)在经济和战略互动中表现出显著的潜力,其中通过自然语言进行沟通往往很普遍。这引发了一些关键问题:LLMs表现出理性行为吗?它们与人类相比如何表现?它们是否倾向于达到有效和公平的结果?自然语言在战略互动中扮演着什么角色?经济环境的特征如何影响这些动态?这些问题对于将基于LLM的代理集成到实际数据驱动系统(如在线零售平台和推荐系统)中的经济和社会影响至关重要。为了回答这些问题,我们引入了一个用于标准化研究两人对弈、序贯、基于语言的游戏的基准。受经济文献的启发,我们定义了三个基本游戏系列,具有一致的参数化、自由度和经济指标,用于评估代理的表现(自我获益)以及游戏结果(效率和公平性)。我们开发了一个开源框架用于交互模拟和分析,并利用它收集了一个LLM与LLM在众多游戏配置中的交互数据集,以及一个人类与LLM交互的额外数据集。通过广泛的实验,我们展示了我们的框架和数据集如何用于:(i)比较LLM代理在不同经济背景下的行为;(ii)评估代理的个人和集体绩效指标;以及(iii)量化环境的经济特征对代理行为的影响。我们的结果表明,市场参数以及LLMs的选择往往对经济结果产生复杂且相互依赖的影响,这需要对基于语言的经济生态系统进行谨慎设计和分析。

更新时间: 2026-03-02 11:39:07

领域: cs.CL,cs.AI,cs.CY,cs.GT,cs.LG

下载: http://arxiv.org/abs/2410.05254v3

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

Training large foundation models from scratch for domain-specific applications is almost impossible due to data limits and long-tailed distributions -- taking remote sensing (RS) as an example. Fine-tuning natural image pre-trained models on RS images is a straightforward solution. To reduce computational costs and improve performance on tail classes, existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer. However, we observe that fixed hyperparameters -- such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. MetaPEFT dynamically adjusts three key factors of PEFT on RS images: module insertion, layer selection, and module-wise learning rates, which collectively control the influence of PEFT modules across the network. We conduct extensive experiments on three transfer-learning scenarios and five datasets in both RS and natural image domains. The results show that MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly.

Updated: 2026-03-02 11:38:18

标题: 元学习超参数用于高效微调

摘要: 由于数据限制和长尾分布,从头开始为特定领域的应用训练大型基础模型几乎是不可能的,以遥感(RS)为例。在RS图像上微调自然图像预训练模型是一个直接的解决方案。为了减少计算成本并提高尾部类别的性能,现有方法应用了参数高效微调(PEFT)技术,如LoRA和AdaptFormer。然而,我们观察到固定的超参数,如层内位置、层深度和缩放因子,可能会显著阻碍PEFT的性能,因为在RS图像上微调对这些设置非常敏感。为了解决这个问题,我们提出了MetaPEFT,一种方法,它整合了动态调整模块影响的自适应缩放器,在微调过程中动态调整模块的影响。MetaPEFT在RS图像上动态调整PEFT的三个关键因素:模块插入、层选择和模块级学习率,这些因素共同控制网络中PEFT模块的影响。我们在三个迁移学习场景和RS和自然图像领域的五个数据集上进行了大量实验。结果显示,MetaPEFT在跨光谱适应方面实现了最先进的性能,只需要少量可训练参数,并显著提高了尾部类别的准确率。

更新时间: 2026-03-02 11:38:18

领域: cs.LG

下载: http://arxiv.org/abs/2603.01759v1

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

We introduce $\textbf{SimuHome}$, a high-fidelity smart home simulator and a benchmark of 600 episodes for LLM-based smart home agents. Existing smart home benchmarks treat the home as a static system, neither simulating how device operations affect environmental variables over time nor supporting workflow scheduling of device commands. SimuHome is grounded in the Matter protocol, the industry standard that defines how real smart home devices communicate and operate. Agents interact with devices through SimuHome's APIs and observe how their actions continuously affect environmental variables such as temperature and humidity. Our benchmark covers state inquiry, implicit user intent inference, explicit device control, and workflow scheduling, each with both feasible and infeasible requests. For workflow scheduling, the simulator accelerates time so that scheduled workflows can be evaluated immediately. An evaluation of 18 agents reveals that workflow scheduling is the hardest category, with failures persisting across alternative agent frameworks and fine-tuning. These findings suggest that SimuHome's time-accelerated simulation could serve as an environment for agents to pre-validate their actions before committing them to the real world.

Updated: 2026-03-02 11:33:37

标题: SimuHome:智能家居LLM代理的时间和环境感知基准Benchmark

摘要: 我们介绍了$\textbf{SimuHome}$,这是一个高保真度的智能家居模拟器,为基于LLM的智能家居代理提供了600个场景的基准。现有的智能家居基准将家庭视为静态系统,既不模拟设备操作如何随时间影响环境变量,也不支持设备命令的工作流调度。SimuHome基于Matter协议,这是行业标准,定义了实际智能家居设备的通信和操作方式。代理通过SimuHome的API与设备交互,并观察他们的行为如何持续影响温度和湿度等环境变量。我们的基准涵盖了状态查询、隐式用户意图推断、显式设备控制和工作流调度,每种都包括可行和不可行的请求。对于工作流调度,模拟器加速时间,以便可以立即评估计划的工作流。对18个代理的评估显示,工作流调度是最困难的类别,失败在替代代理框架和微调中持续存在。这些发现表明,SimuHome的时间加速模拟可以作为代理在将其行动提交到现实世界之前预先验证其行动的环境。

更新时间: 2026-03-02 11:33:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.24282v3

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-performing video generation model is highly controllable. We detail all techniques that contribute to this efficiency breakthrough, including data curation, model architecture, training strategy, and system optimization. According to human evaluation results and VBench scores, Open-Sora 2.0 is comparable to global leading video generation models including the open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha. By making Open-Sora 2.0 fully open-source, we aim to democratize access to advanced video generation technology, fostering broader innovation and creativity in content creation. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.

Updated: 2026-03-02 11:31:23

标题: Open-Sora 2.0:在20万美元培训一个商业级视频生成模型

摘要: 视频生成模型在过去一年取得了显著进展。AI视频的质量持续提高,但代价是更大的模型尺寸、增加的数据数量和对训练计算的更大需求。在本报告中,我们介绍了Open-Sora 2.0,这是一个商业级视频生成模型,仅耗资20万美元进行训练。通过这个模型,我们证明了训练一个性能优越的视频生成模型的成本是可以高度可控的。我们详细介绍了所有有助于这一效率突破的技术,包括数据整理、模型架构、训练策略和系统优化。根据人类评估结果和VBench分数,Open-Sora 2.0与全球领先的视频生成模型(包括开源的HunyuanVideo和闭源的Runway Gen-3 Alpha)相媲美。通过将Open-Sora 2.0完全开源,我们旨在使先进的视频生成技术更加民主化,促进内容创作中更广泛的创新和创造力。所有资源都可以在以下网址公开获取:https://github.com/hpcaitech/Open-Sora。

更新时间: 2026-03-02 11:31:23

领域: cs.GR,cs.AI

下载: http://arxiv.org/abs/2503.09642v3

Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm

While reinforcement learning for large language model alignment has progressed rapidly in recent years, transferring these paradigms to high-stakes medical question answering reveals a fundamental paradigm mismatch. Reinforcement Learning from Human Feedback relies on preference annotations that are prohibitively expensive and often fail to reflect the absolute correctness of medical facts. Reinforcement Learning from Verifiable Rewards lacks effective automatic verifiers and struggles to handle complex clinical contexts. Meanwhile, medical alignment requires the simultaneous optimization of correctness, safety, and compliance, yet multi-objective heterogeneous reward signals are prone to scale mismatch and optimization conflicts. To address these challenges, we propose a robust medical alignment paradigm. We first construct a holistic multi-dimensional medical alignment matrix that decomposes alignment objectives into four categories: fundamental capabilities, expert knowledge, online feedback, and format specifications. Within each category, we establish a closed loop of where observable metrics inform attributable diagnosis, which in turn drives optimizable rewards, thereby providing fine-grained, high-resolution supervision signals for subsequent iterative optimization. To resolve gradient domination and optimization instability problem caused by heterogeneous signals, we further propose a unified optimization mechanism. This mechanism employs Reference-Frozen Normalization to align reward scales and implements a Tri-Factor Adaptive Dynamic Weighting strategy to achieve collaborative optimization that is weakness-oriented, risk-prioritized, and redundancy-reducing. Experimental results demonstrate the effectiveness of our proposed paradigm in real-world medical scenario evaluations, establishing a new paradigm for complex alignment in vertical domains.

Updated: 2026-03-02 11:30:55

标题: 夸克医学对齐:一种全面多维度对齐和协作优化范式

摘要: 尽管近年来大型语言模型对齐的强化学习取得了快速进展,但将这些范式转移到高风险医学问题回答中揭示了一个基本范式不匹配的问题。从人类反馈中学习强化依赖于代价昂贵且往往无法反映医学事实的绝对正确性的偏好注释。从可验证奖励中学习强化缺乏有效的自动验证器,并且难以处理复杂的临床环境。与此同时,医学对齐需要同时优化正确性、安全性和合规性,然而多目标异构奖励信号容易发生规模不匹配和优化冲突。为了解决这些挑战,我们提出了一个强大的医学对齐范式。我们首先构建了一个全面多维医学对齐矩阵,将对齐目标分解为四类:基本能力、专家知识、在线反馈和格式规范。在每个类别中,我们建立了一个闭环,其中可观测指标指导可归因诊断,进而推动可优化奖励,从而为后续迭代优化提供细粒度、高分辨率的监督信号。为了解决异构信号引起的梯度占优和优化不稳定问题,我们进一步提出了一个统一的优化机制。该机制采用参考冻结归一化来对齐奖励尺度,并实施一个三因子自适应动态加权策略,实现面向弱点、优先风险和减少冗余的协同优化。实验结果表明,我们提出的范式在真实世界的医学场景评估中的有效性,为垂直领域中复杂对齐建立了一个新的范式。

更新时间: 2026-03-02 11:30:55

领域: cs.AI

下载: http://arxiv.org/abs/2602.11661v2

Federated Agentic AI for Wireless Networks: Fundamentals, Approaches, and Applications

Agentic artificial intelligence (AI) presents a promising pathway toward realizing autonomous and self-improving wireless network services. However, resource-constrained, widely distributed, and data-heterogeneous nature of wireless networks poses significant challenges to existing agentic AI that relies on centralized architectures, leading to high communication overhead, privacy risks, and non-independent and identically distributed (non-IID) data. Federated learning (FL) has the potential to improve the overall loop of agentic AI through collaborative local learning and parameter sharing without exchanging raw data. This paper proposes new federated agentic AI approaches for wireless networks. We first summarize fundamentals of agentic AI and mainstream FL types. Then, we illustrate how each FL type can strengthen a specific component of agentic AI's loop. Moreover, we conduct a case study on using FRL to improve the performance of agentic AI's action decision in low-altitude wireless networks (LAWNs). Finally, we provide a conclusion and discuss future research directions.

Updated: 2026-03-02 11:26:56

标题: 联邦式智能代理人AI用于无线网络:基础、方法和应用

摘要: Agentic artificial intelligence (AI) is a promising approach to achieving autonomous and self-improving wireless network services. However, the resource constraints, widespread distribution, and data heterogeneity of wireless networks present challenges to traditional agentic AI systems that rely on centralized architectures. This can result in high communication overhead, privacy risks, and non-independent and identically distributed (non-IID) data. Federated learning (FL) offers a solution to these challenges by enabling collaborative local learning and parameter sharing without the need to exchange raw data. This paper introduces new federated agentic AI approaches for wireless networks. It begins by discussing the basics of agentic AI and different types of FL. The paper then explores how each type of FL can enhance a specific aspect of the agentic AI loop. The paper also includes a case study on using Federated Reinforcement Learning (FRL) to improve the decision-making process of agentic AI in low-altitude wireless networks (LAWNs). Finally, the paper concludes with a discussion on future research directions in this area.

更新时间: 2026-03-02 11:26:56

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2603.01755v1

CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones

Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

Updated: 2026-03-02 11:26:33

标题: CubistMerge:用于多样化ViT主干网络的保持空间性的令牌合并

摘要: 许多现代ViT骨干采用空间架构设计,如窗口注意力、SAM中的分解相对位置嵌入和DINOv3中的RoPE。这些架构对于令牌减少提出了新的挑战,因为绝大多数现有方法无法保持这些架构所依赖的空间结构。在本文中,我们介绍了一种简单而有效的令牌合并方法,可以保持空间完整性,实现与空间架构的无缝兼容。我们调和了两个看似矛盾的要求:(i)利用空间布局中的不均匀信息分布,同时(ii)保持合并后的空间结构。我们的方法采用了(i)2D减少策略来强化结构化的令牌布局,(ii)一个空间感知的合并算法来保持相对令牌位置,以及(iii)一种新颖的每个维度最大幅值令牌表示,保持显著特征。我们的方法在开箱即用和微调方面展现了强大的性能,在各种视觉任务中实现了空间和非空间架构的最先进结果。具体而言,在COCO开箱即用评估中,我们在SAM-H上实现了1.25倍的加速,仅有0.7%的mIOU下降,并且在ImageNet上进行了仅一个时期的微调,DeiT-B实现了1.15倍的加速,而无需牺牲top-1准确率。

更新时间: 2026-03-02 11:26:33

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.21764v2

FSW-GNN: A Bi-Lipschitz WL-Equivalent Graph Neural Network

Famously, the ability of Message Passing Neural Networks (MPNN) to distinguish between graphs is limited to graphs separable by the Weisfeiler-Lemann (WL) graph isomorphism test, and the strongest MPNNs, in terms of separation power, are WL-equivalent. However, it was demonstrated that the quality of separation provided by standard WL-equivalent MPNN can be very low, resulting in WL-separable graphs being mapped to very similar, hardly distinguishable outputs. This phenomenon can be explained by the recent observation that standard MPNNs are not lower-Lipschitz. This paper addresses this issue by introducing FSW-GNN, the first MPNN that is fully bi-Lipschitz with respect to standard WL-equivalent graph metrics. Empirically, we show that our MPNN is competitive with standard MPNNs for several graph learning tasks and is far more accurate in long-range tasks, due to its ability to avoid oversmoothing and oversquashing. Our code is available at https://github.com/yonatansverdlov/Over-squashing.

Updated: 2026-03-02 11:24:14

标题: FSW-GNN:一个双Lipschitz WL等价图神经网络

摘要: 这篇论文摘要介绍了消息传递神经网络(MPNN)在区分图形方面的能力受到Weisfeiler-Lemann(WL)图同构测试的限制,而在分离能力方面最强的MPNN是与WL等价的。然而,实验表明标准的WL等价MPNN提供的分离质量可能非常低,导致WL可分离的图形被映射为非常相似、难以区分的输出。这一现象可以通过最近观察到的标准MPNN不是下界-李普希茨的现象来解释。本文通过引入FSW-GNN来解决这个问题,这是第一个与标准WL等价图度量完全双-李普希茨的MPNN。在实证方面,我们展示了我们的MPNN在几个图形学习任务中与标准MPNN竞争,并且在长距离任务中更准确,这是由于其避免过度平滑和过度压缩的能力。我们的代码可以在https://github.com/yonatansverdlov/Over-squashing上找到。

更新时间: 2026-03-02 11:24:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.09118v2

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Most existing benchmarks for understanding egocentric vision focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. The code and data can be found at https://github.com/dehezhang2/EgoNight.

Updated: 2026-03-02 11:23:10

标题: EgoNight:面向夜间自我中心视觉理解的挑战性基准研究

摘要: 大多数现有的用于理解以自我为中心视觉的基准主要关注白天场景,忽视了在真实应用中不可避免的低光条件。为了探究这一差距,我们提出了EgoNight,这是第一个针对夜间自我为中心视觉的全面基准,其中视觉问答(VQA)是核心任务。EgoNight的一个关键特点是引入了日夜对齐视频,利用白天数据提高夜间注释质量,并揭示了不同照明条件之间的明显性能差距。为了实现这一目标,我们收集了由Blender渲染的合成视频和真实世界录像,确保场景和动作在视觉和时间上对齐。利用这些配对视频,我们构建了EgoNight-VQA,支持通过新颖的白天增强的夜间自动标注引擎以及通过广泛的人工验证进行细化。每个问答对都经过注释人员的双重检查以确保可靠性。总共,EgoNight-VQA包含3658个问答对,涵盖了90个视频,涵盖了12种不同类型的问答,共计超过300小时的人工工作。对最先进的多模态大语言模型(MLLMs)的评估表明,在从白天转移到夜晚时,性能显著下降,突显了在低光条件下进行推理的挑战。除了VQA,EgoNight还引入了两个辅助任务,即日夜对应检索和夜间自我深度估计,进一步探索了现有模型的边界。我们相信EgoNight-VQA为推动基于应用的自我为中心视觉研究提供了坚实的基础,并为开发能够跨照明领域泛化的模型奠定了基础。代码和数据可在https://github.com/dehezhang2/EgoNight 上找到。

更新时间: 2026-03-02 11:23:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.06218v2

A Neural Network-Based Real-time Casing Collar Recognition System for Downhole Instruments

Casing collar locator (CCL) measurements are widely used as reliable depth markers for positioning downhole instruments in cased-hole operations, enabling accurate depth control for operations such as perforation. However, autonomous collar recognition in downhole environments remains challenging because CCL signals are often corrupted by toolstring- or casing-induced magnetic interference, while stringent size and power budgets limit the use of computationally intensive algorithms and specific operations require real-time, in-situ processing. To address these constraints, we propose Collar Recognition Nets (CRNs), a family of domain-specific lightweight 1-D convolutional neural networks for collar signature recognition from streaming CCL waveforms. With depthwise separable convolutions and input pooling, CRNs optimize efficiency without sacrificing accuracy. Our most compact model achieves an F1-score of 0.972 on field data with only 1,985~parameters and 8,208~MACs, and deployed on an ARM Cortex-M7 based embedded system using TensorFlow Lite for Microcontrollers (TFLM) library, the model demonstrates a throughput of 1,000 inference per second and 343.2 μs latency, confirming the feasibility of robust, autonomous, and real-time collar recognition under stringent downhole constraints.

Updated: 2026-03-02 11:21:58

标题: 一个基于神经网络的实时井下仪器套管颈识别系统

摘要: 套管领子定位器(CCL)测量被广泛用作封套井作业中定位井下仪器的可靠深度标记,从而实现了对作业如射孔等的精确深度控制。然而,在井下环境中自主领子识别仍然具有挑战性,因为CCL信号经常受到工具串或套管引起的磁干扰的影响,而严格的尺寸和功耗预算限制了计算密集型算法的使用,特定操作要求实时、原位处理。为了解决这些限制,我们提出了Collar Recognition Nets(CRNs),这是一系列用于从流式CCL波形中识别领子特征的领域特定轻量级1-D卷积神经网络。通过深度可分卷积和输入池化,CRNs在不牺牲准确性的情况下优化了效率。我们最紧凑的模型在现场数据上实现了0.972的F1分数,仅使用1,985个参数和8,208个MACs,并部署在基于ARM Cortex-M7的嵌入式系统上,使用TensorFlow Lite for Microcontrollers(TFLM)库,该模型展示了每秒1,000次推断的吞吐量和343.2微秒的延迟,证实了在严格的井下约束条件下实现稳健、自主和实时领子识别的可行性。

更新时间: 2026-03-02 11:21:58

领域: eess.SY,cs.AI,cs.LG,eess.SP

下载: http://arxiv.org/abs/2512.22901v2

Causal Circuit Tracing Reveals Distinct Computational Architectures in Single-Cell Foundation Models: Inhibitory Dominance, Biological Coherence, and Cross-Model Convergence

Motivation: Sparse autoencoders (SAEs) decompose foundation model activations into interpretable features, but causal feature-to-feature interactions across network depth remain unknown for biological foundation models. Results: We introduce causal circuit tracing by ablating SAE features and measuring downstream responses, and apply it to Geneformer V2-316M and scGPT whole-human across four conditions (96,892 edges, 80,191 forward passes). Both models show approximately 53 percent biological coherence and 65 to 89 percent inhibitory dominance, invariant to architecture and cell type. scGPT produces stronger effects (mean absolute d = 1.40 vs. 1.05) with more balanced dynamics. Cross-model consensus yields 1,142 conserved domain pairs (10.6x enrichment, p < 0.001). Disease-associated domains are 3.59x more likely to be consensus. Gene-level CRISPRi validation shows 56.4 percent directional accuracy, confirming co-expression rather than causal encoding.

Updated: 2026-03-02 11:21:44

标题: 因果电路追踪揭示单细胞基础模型中的不同计算架构:抑制性支配、生物一致性和跨模型收敛

摘要: 动机:稀疏自编码器(SAEs)将基础模型激活分解为可解释特征,但生物基础模型中网络深度间的因果特征之间的相互作用仍未知。 结果:我们介绍了通过切除SAE特征并测量下游响应来进行因果电路追踪,并将其应用于Geneformer V2-316M和scGPT全人类数据在四种条件下(96,892个边缘,80,191个前向传递)。两个模型都显示大约53%的生物一致性和65%至89%的抑制性优势,对架构和细胞类型不变。scGPT产生更强的效果(平均绝对d = 1.40 vs. 1.05),具有更平衡的动态。跨模型共识得到1,142个保守的结构域对(10.6倍富集,p < 0.001)。与疾病相关的结构域更有可能达成共识。基因水平的CRISPRi验证显示56.4%的方向准确性,确认了共表达而非因果编码。

更新时间: 2026-03-02 11:21:44

领域: cs.LG,q-bio.CB,q-bio.GN

下载: http://arxiv.org/abs/2603.01752v1

Shape-Interpretable Visual Self-Modeling Enables Geometry-Aware Continuum Robot Control

Continuum robots possess high flexibility and redundancy, making them well suited for safe interaction in complex environments, yet their continuous deformation and nonlinear dynamics pose fundamental challenges to perception, modeling, and control. Existing vision-based control approaches often rely on end-to-end learning, achieving shape regulation without explicit awareness of robot geometry or its interaction with the environment. Here, we introduce a shape-interpretable visual self-modeling framework for continuum robots that enables geometry-aware control. Robot shapes are encoded from multi-view planar images using a Bezier-curve representation, transforming visual observations into a compact and physically meaningful shape space that uniquely characterizes the robot's three-dimensional configuration. Based on this representation, neural ordinary differential equations are employed to self-model both shape and end-effector dynamics directly from data, enabling hybrid shape-position control without analytical models or dense body markers. The explicit geometric structure of the learned shape space allows the robot to reason about its body and surroundings, supporting environment-aware behaviors such as obstacle avoidance and self-motion while maintaining end-effector objectives. Experiments on a cable-driven continuum robot demonstrate accurate shape-position regulation and tracking, with shape errors within 1.56% of image resolution and end-effector errors within 2% of robot length, as well as robust performance in constrained environments. By elevating visual shape representations from two-dimensional observations to an interpretable three-dimensional self-model, this work establishes a principled alternative to vision-based end-to-end control and advances autonomous, geometry-aware manipulation for continuum robots.

Updated: 2026-03-02 11:20:28

标题: 形状可解释的视觉自建模实现几何感知的连续机器人控制

摘要: 连续机器人具有高灵活性和冗余性,使它们非常适合在复杂环境中进行安全交互,然而它们的连续变形和非线性动力学给感知、建模和控制带来了基本挑战。现有基于视觉的控制方法通常依赖于端到端学习,在没有明确意识到机器人几何形状或其与环境的交互的情况下实现形状调节。在这里,我们介绍了一种适用于连续机器人的形状可解释视觉自模型框架,实现了几何感知控制。机器人的形状是通过使用Bezier曲线表示从多视图平面图像中编码的,将视觉观察转化为一个紧凑且具有物理意义的形状空间,独特地描述了机器人的三维配置。基于这种表示,使用神经常微分方程直接从数据中自模型形状和末端执行器动力学,实现了混合形状-位置控制,而无需解析模型或密集的身体标记。所学习形状空间的明确几何结构使机器人能够推理其身体和周围环境,支持环境感知行为,如避障和自主运动,同时保持末端执行器目标。在一个电缆驱动的连续机器人上的实验表明,准确的形状位置调节和跟踪,形状误差在图像分辨率的1.56%以内,末端执行器误差在机器人长度的2%以内,以及在受限制环境中的稳健性能。通过将视觉形状表示从二维观察提升到一个可解释的三维自模型,这项工作为基于视觉的端到端控制提供了一个原则性的替代方案,并推进了连续机器人的自主、几何感知操纵。

更新时间: 2026-03-02 11:20:28

领域: cs.RO,cs.AI,cs.LG,eess.SY

下载: http://arxiv.org/abs/2603.01751v1

Practical Deep Heteroskedastic Regression

Uncertainty quantification (UQ) in deep learning regression is of wide interest, as it supports critical applications including sequential decision making and risk-sensitive tasks. In heteroskedastic regression, where the uncertainty of the target depends on the input, a common approach is to train a neural network that parameterizes the mean and the variance of the predictive distribution. Still, training deep heteroskedastic regression models poses practical challenges in the trade-off between uncertainty quantification and mean prediction, such as optimization difficulties, representation collapse, and variance overfitting. In this work we identify previously undiscussed fallacies and propose a simple and efficient procedure that addresses these challenges jointly by post-hoc fitting a variance model across the intermediate layers of a pretrained network on a hold-out dataset. We demonstrate that our method achieves on-par or state-of-the-art uncertainty quantification on several molecular graph datasets, without compromising mean prediction accuracy and remaining cheap to use at prediction time.

Updated: 2026-03-02 11:19:32

标题: 实用的深度异方差回归

摘要: 深度学习回归中的不确定性量化(UQ)受到广泛关注,因为它支持包括顺序决策和风险敏感任务在内的关键应用。在异方差回归中,目标的不确定性取决于输入,一个常见的方法是训练一个神经网络,该网络参数化了预测分布的均值和方差。然而,在训练深度异方差回归模型时,存在着实际挑战,例如在不确定性量化和均值预测之间的权衡,如优化困难、表示崩溃和方差过度拟合。在这项工作中,我们确定了以前未讨论的谬论,并提出了一种简单而高效的程序,通过在预训练网络的中间层上后期拟合方差模型来同时解决这些挑战。我们证明了我们的方法在几个分子图数据集上达到了与最先进的不确定性量化相当的水平,而且不会影响均值预测准确性,并且在预测时仍然廉价易用。

更新时间: 2026-03-02 11:19:32

领域: cs.LG

下载: http://arxiv.org/abs/2603.01750v1

Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.

Updated: 2026-03-02 11:19:29

标题: Brain-Semantoks: 使用自我提炼的基础模型学习大脑动态的语义标记

摘要: 功能磁共振成像(fMRI)时间序列的基础模型的发展对于预测与疾病和认知相关的表型具有重要的潜力。然而,目前的模型通常是使用脑部小区域上的掩膜和重构目标进行训练的。这种对低级信息的关注导致了对噪音和时间波动敏感的表示,需要对下游任务进行大量的微调。我们引入了Brain-Semantoks,这是一个自监督框架,专门设计用于学习大脑动态的抽象表示。其架构建立在两个核心创新上:语义标记器将嘈杂的区域信号聚合成代表功能网络的稳健标记,自蒸馏目标强制在时间上保持表征的稳定性。我们展示了通过一种新颖的训练课程稳定这一目标,确保模型能够稳健地从低信噪比时间序列中学习有意义的特征。我们证明了学习到的表示使其能够在各种下游任务上取得强大的性能,即使只使用线性探针。此外,我们提供了全面的扩展分析,表明更多未标记数据可靠地导致超出分布的性能提升,而无需进行领域适配。

更新时间: 2026-03-02 11:19:29

领域: cs.LG,cs.CV,q-bio.NC

下载: http://arxiv.org/abs/2512.11582v2

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. To further strengthen training, we propose a multi-level reward design, which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi-hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead, validating its ability to trade marginal cost for robust long-context reasoning.

Updated: 2026-03-02 11:19:01

标题: 回顾过去,展望未来:可重访内存对于长上下文LLM代理的意义

摘要: 大型语言模型在长文本问题回答中面临挑战,查询的关键证据可能分散在数百万个标记中。现有作品为大型语言模型配备了一个内存缓冲区,通过线性文档扫描动态更新,也被称为“边读边记”方法。虽然这种方法效率高,但存在潜在证据被修剪、信息丢失通过覆盖和稀疏强化学习信号。为了解决这些挑战,我们提出了ReMemR1,将内存检索机制整合到内存更新过程中,使代理能够选择性地回调历史记忆进行非线性推理。为了进一步加强训练,我们提出了多级奖励设计,将最终答案奖励与密集、步级信号相结合,指导有效的记忆使用。这些贡献共同缓解了信息降解,改善了监督,并支持复杂的多跳推理。广泛的实验证明,ReMemR1在长文本问题回答方面明显优于最先进的基线模型,同时产生微不足道的计算开销,验证了它将边际成本换取稳健的长文本推理能力。

更新时间: 2026-03-02 11:19:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.23040v5

Discrete World Models via Regularization

World models aim to capture the states and dynamics of an environment in a compact latent space. Moreover, using Boolean state representations is particularly useful for search heuristics and symbolic reasoning and planning. Existing approaches keep latents informative via decoder-based reconstruction, or instead via contrastive or reward signals. In this work, we introduce Discrete World Models via Regularization (DWMR): a reconstruction-free and contrastive-free method for unsupervised Boolean world-model learning. In particular, we introduce a novel world-modeling loss that couples latent prediction with specialized regularizers. Such regularizers maximize the entropy and independence of the representation bits through variance, correlation, and coskewness penalties, while simultaneously enforcing a locality prior for sparse action changes. To enable effective optimization, we also introduce a novel training scheme improving robustness to discrete roll-outs. Experiments on two benchmarks with underlying combinatorial structure show that DWMR learns more accurate representations and transitions than reconstruction-based alternatives. Finally, DWMR can also be paired with an auxiliary reconstruction decoder, and this combination yields additional gains.

Updated: 2026-03-02 11:17:38

标题: 通过正则化实现离散世界模型

摘要: 世界模型旨在在一个紧凑的潜在空间中捕获环境的状态和动态。此外,使用布尔状态表示对于搜索启发式和符号推理和规划特别有用。现有方法通过基于解码器的重建或对比或奖励信号保持潜在的信息。在这项工作中,我们引入了一种称为离散世界模型的正则化(DWMR):一种无需重建和对比的方法,用于无监督的布尔世界模型学习。具体来说,我们引入了一种新颖的世界建模损失,将潜在预测与专门的正则化器结合起来。这种正则化器通过方差、相关性和余斜度惩罚最大化表示位的熵和独立性,同时通过稀疏动作变化的局部先验来强制执行。为了实现有效的优化,我们还引入了一种改进对离散回滚的鲁棒性的新颖训练方案。在具有基本组合结构的两个基准测试中的实验表明,DWMR比基于重建的替代方案学习到更准确的表示和转换。最后,DWMR还可以与辅助重建解码器配对,这种组合可以获得额外的收益。

更新时间: 2026-03-02 11:17:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01748v1

An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification

Most information in our world is organized hierarchically; however, many Deep Learning approaches do not leverage this semantically rich structure. Research suggests that human learning benefits from exploiting the hierarchical structure of information, and intelligent models could similarly take advantage of this through multi-task learning. In this work, we analyze the advantages and limitations of multi-task learning in a hierarchical multi-label classification problem: car make and model classification. Considering both parallel and cascaded multi-task architectures, we evaluate their impact on different Deep Learning classifiers (CNNs, Transformers) while varying key factors such as dropout rate and loss weighting to gain deeper insight into the effectiveness of this approach. The tests are conducted on two established benchmarks: StanfordCars and CompCars. We observe the effectiveness of the multi-task paradigm on both datasets, improving the performance of the investigated CNN in almost all scenarios. Furthermore, the approach yields significant improvements on the CompCars dataset for both types of models.

Updated: 2026-03-02 11:17:32

标题: 一种用于车辆型号和制造商分类的分层多标签问题的多任务架构分析

摘要: 我们世界中的大多数信息都是按层次结构组织的;然而,许多深度学习方法并没有充分利用这种语义丰富的结构。研究表明,人类学习受益于利用信息的层次结构,智能模型也可以通过多任务学习类似地利用这一点。在这项工作中,我们分析了在一个层次多标签分类问题中多任务学习的优势和局限性:汽车品牌和车型分类。考虑到并行和级联多任务架构,我们评估它们对不同深度学习分类器(CNNs、Transformers)的影响,同时变化关键因素如dropout率和损失权重,以深入了解这种方法的有效性。测试在两个已建立的基准数据集上进行:StanfordCars和CompCars。我们观察到多任务范式在两个数据集上的有效性,在几乎所有情况下提高了被调查的CNN的性能。此外,该方法对两种模型在CompCars数据集上均产生显著改进。

更新时间: 2026-03-02 11:17:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01746v1

Latent Diffusion Model without Variational Autoencoder

Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.

Updated: 2026-03-02 11:09:50

标题: 无变分自动编码器的潜在扩散模型

摘要: 最近在基于扩散的视觉生成方面取得的进展在很大程度上依赖于具有变分自动编码器(VAEs)的潜在扩散模型。虽然对于高保真度合成来说是有效的,但这种VAE+扩散范式存在训练效率有限、推理速度慢以及对更广泛视觉任务的转移能力差的问题。这些问题源于VAE潜在空间的一个关键限制:缺乏清晰的语义分离和强大的区分结构。我们的分析证实,这些属性不仅对于感知和理解任务至关重要,而且对于潜在扩散模型的稳定和高效训练也是至关重要的。受到这一洞察的启发,我们引入了SVG,一种新颖的不带变分自动编码器的潜在扩散模型,它利用自监督表示进行视觉生成。SVG通过利用冻结的DINO特征构建具有清晰语义可辨识性的特征空间,同时一个轻量级的残差分支捕捉细粒度细节以实现高保真重建。扩散模型直接在这种语义结构化的潜在空间上进行训练,以促进更高效的学习。因此,SVG实现了加速扩散训练,支持少步采样,并提高了生成质量。实验结果进一步显示,SVG保留了基础自监督表示的语义和区分能力,为实现任务通用、高质量的视觉表示提供了一个有原则的途径。代码和解释可在https://howlin-wang.github.io/svg/获取。

更新时间: 2026-03-02 11:09:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.15301v4

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.

Updated: 2026-03-02 11:09:17

标题: 语音识别中的序列级无监督训练:理论研究

摘要: 无监督语音识别是使用未配对数据训练语音识别模型的任务。为了确定无监督语音识别何时以及如何成功,并且如何分类错误与候选训练目标相关,我们基于分类错误界限制定了一个理论框架用于无监督语音识别。我们介绍了两个无监督语音识别可能性的条件。这些条件的必要性也被讨论。在这些条件下,我们推导了无监督语音识别的分类错误界限,并在模拟中验证了这个界限。受到这一界限的启发,我们提出了一个用于无监督语音识别的单阶段序列级交叉熵损失。

更新时间: 2026-03-02 11:09:17

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2603.02285v1

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper-cpo/ .

Updated: 2026-03-02 11:06:40

标题: 重新思考大规模强化学习中集成策略梯度中的政策多样性

摘要: 将强化学习扩展到成千上万个并行环境需要克服单一策略的有限探索能力。最近提出了基于集成的策略梯度方法,利用多个策略收集多样化样本,以促进探索。然而,仅仅扩大探索空间并不总是增强学习能力,因为过度探索可能会降低探索质量或危害训练稳定性。在这项工作中,我们理论分析了策略集合中策略间多样性对学习效率的影响,并提出了通过策略之间的KL约束调节多样性的联合策略优化。所提出的方法实现了有效的探索,并在多个任务上优于强基线,如SAPG、PBT和PPO,包括具有挑战性的熟练操纵,无论是在样本效率还是最终性能方面。此外,对训练期间策略多样性和有效样本量的分析表明,跟随策略自然分布在领导者周围,展示了结构化和高效的探索行为的出现。我们的结果表明,在适当调节下的多样化探索是实现集成策略梯度方法稳定和样本高效学习的关键。项目页面位于https://naoki04.github.io/paper-cpo/。

更新时间: 2026-03-02 11:06:40

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2603.01741v1

Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement

Dynamic feature transformation (the rich regime) does not always align with predictive performance (better representation), yet accuracy is often used as a proxy for richness, limiting analysis of their relationship. We propose a computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case. The metric is empirically more stable than existing alternatives and captures known lazy-torich transitions (e.g., grokking) without relying on accuracy. We further use it to examine how training factors (e.g., learning rate) relate to richness, confirming recognized assumptions and highlighting new observations (e.g., batch normalization promotes rich dynamics). An eigendecomposition-based visualization is also introduced to support interpretability, together providing a diagnostic tool for studying the relationship between training factors, dynamics, and representations.

Updated: 2026-03-02 11:06:16

标题: 将"Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement"翻译为"将动态丰富性与表示学习分离:迈向实用测量"

摘要: 动态特征转换(丰富的制度)并不总是与预测性能(更好的表示)保持一致,然而准确性经常被用作丰富性的代理,限制了对它们之间关系的分析。我们提出了一个基于丰富动态的低秩偏差的计算效率高、性能独立的丰富度度量标准,该标准将神经崩溃恢复为一个特殊案例。该度量标准在经验上比现有的替代方案更稳定,并捕捉已知的懒惰-丰富转换(例如,理解)而不依赖准确性。我们进一步使用它来研究训练因素(例如学习率)与丰富度的关系,确认了已知的假设,并突出了新的观察结果(例如,批归一化促进了丰富动态)。此外,我们还引入了一种基于特征分解的可视化方法来支持可解释性,共同提供了一个用于研究训练因素、动态和表示之间关系的诊断工具。

更新时间: 2026-03-02 11:06:16

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2410.04264v3

CA-AFP: Cluster-Aware Adaptive Federated Pruning

Federated Learning (FL) faces major challenges in real-world deployments due to statistical heterogeneity across clients and system heterogeneity arising from resource-constrained devices. While clustering-based approaches mitigate statistical heterogeneity and pruning techniques improve memory and communication efficiency, these strategies are typically studied in isolation. We propose CA-AFP, a unified framework that jointly addresses both challenges by performing cluster-specific model pruning. In CA-AFP, clients are first grouped into clusters, and a separate model for each cluster is adaptively pruned during training. The framework introduces two key innovations: (1) a cluster-aware importance scoring mechanism that combines weight magnitude, intra-cluster coherence, and gradient consistency to identify parameters for pruning, and (2) an iterative pruning schedule that progressively removes parameters while enabling model self-healing through weight regrowth. We evaluate CA-AFP on two widely used human activity recognition benchmarks, UCI HAR and WISDM, under natural user-based federated partitions. Experimental results demonstrate that CA-AFP achieves a favorable balance between predictive accuracy, inter-client fairness, and communication efficiency. Compared to pruning-based baselines, CA-AFP consistently improves accuracy and lower performance disparity across clients with limited fine-tuning, while requiring substantially less communication than dense clustering-based methods. It also shows robustness to different Non-IID levels of data. Finally, ablation studies analyze the impact of clustering, pruning schedules and scoring mechanism offering practical insights into the design of efficient and adaptive FL systems.

Updated: 2026-03-02 11:04:25

标题: CA-AFP:群集感知自适应联邦修剪

摘要: 联邦学习(FL)在实际部署中面临重大挑战,原因是客户端之间存在统计异质性,而由资源受限设备引起的系统异质性。虽然基于聚类的方法可以减轻统计异质性,修剪技术可以提高内存和通信效率,但这些策略通常是独立研究的。 我们提出了CA-AFP,一个统一的框架,通过执行特定于集群的模型修剪来共同解决这两个挑战。在CA-AFP中,首先将客户端分组成集群,然后在训练过程中逐步修剪每个集群的单独模型。该框架引入了两个关键创新:(1)一个集群感知重要性评分机制,结合权重大小、集群内一致性和梯度一致性来识别修剪参数,以及(2)一个迭代修剪计划,逐步移除参数,同时通过重量再生实现模型自我修复。 我们在两个广泛使用的人体活动识别基准数据集UCI HAR和WISDM上评估了CA-AFP,采用自然用户基础的联邦划分。实验结果表明,CA-AFP在预测准确性、客户端间公平性和通信效率之间取得了有利的平衡。与基于修剪的基线相比,CA-AFP在有限的微调下始终提高准确性并降低客户端之间的性能差异,同时比密集聚类方法要求更少的通信。它还表现出对不同非独立同分布数据水平的稳健性。最后,消融研究分析了聚类、修剪计划和评分机制对有效和自适应FL系统设计的影响,提供了实用的见解。

更新时间: 2026-03-02 11:04:25

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2603.01739v1

Solving Inverse PDE Problems using Minimization Methods and AI

Many physical and engineering systems require solving direct problems to predict behavior and inverse problems to determine unknown parameters from measurement. In this work, we study both aspects for systems governed by differential equations, contrasting well-established numerical methods with new AI-based techniques, specifically Physics-Informed Neural Networks (PINNs). We first analyze the logistic differential equation, using its closed-form solution to verify numerical schemes and validate PINN performance. We then address the Porous Medium Equation (PME), a nonlinear partial differential equation with no general closed-form solution, building strong solvers of the direct problem and testing techniques for parameter estimation in the inverse problem. Our results suggest that PINNs can closely estimate solutions at competitive computational cost, and thus propose an effective tool for solving both direct and inverse problems for complex systems.

Updated: 2026-03-02 10:57:26

标题: 使用最小化方法和人工智能解决反问题偏微分方程

摘要: 许多物理和工程系统需要解决直接问题以预测行为,并解决逆问题以确定未知参数。在这项工作中,我们研究了受微分方程控制的系统的这两个方面,将成熟的数值方法与新的基于人工智能的技术进行对比,具体来说是物理信息神经网络(PINNs)。我们首先分析了逻辑微分方程,利用其封闭形式解来验证数值方案,并验证PINN的性能。然后我们研究了多孔介质方程(PME),这是一个非线性偏微分方程,没有通用的封闭形式解,我们建立了直接问题的强解算器,并测试了逆问题中的参数估计技术。我们的结果表明,PINNs可以在具有竞争性计算成本的情况下紧密估计解,因此提出了一个有效的工具,用于解决复杂系统的直接和逆问题。

更新时间: 2026-03-02 10:57:26

领域: math.NA,cs.AI,math.AP,math.OC

下载: http://arxiv.org/abs/2603.01731v1

Decentralized Federated Learning by Partial Message Exchange

Decentralized federated learning (DFL) has emerged as a transformative server-free paradigm that enables collaborative learning over large-scale heterogeneous networks. However, it continues to face fundamental challenges, including data heterogeneity, restrictive assumptions for theoretical analysis, and degraded convergence when standard communication- or privacyenhancing techniques are applied. To overcome these drawbacks, this paper develops a novel algorithm, PaME (DFL by Partial Message Exchange). The central principle is to allow only randomly selected sparse coordinates to be exchanged between two neighbor nodes. Consequently, PaME achieves substantial reductions in communication costs while still preserving a high level of privacy, without sacrificing accuracy. Moreover, grounded in rigorous analysis, the algorithm is shown to converge at a linear rate under the gradient to be locally Lipschitz continuous and the communication matrix to be doubly stochastic. These two mild assumptions not only dispense with many restrictive conditions commonly imposed by existing DFL methods but also enables PaME to effectively address data heterogeneity. Furthermore, comprehensive numerical experiments demonstrate its superior performance compared with several representative decentralized learning algorithms.

Updated: 2026-03-02 10:57:18

标题: 分布式联邦学习中的分散式部分消息交换

摘要: 去中心化的联邦学习(DFL)已经成为一种革命性的无服务器范式,可以在大规模异构网络上实现协作学习。然而,它仍然面临根本性挑战,包括数据异质性、理论分析的限制性假设,以及当应用标准通信或隐私增强技术时收敛速度下降。为了克服这些缺点,本文开发了一种新颖的算法PaME(通过部分消息交换实现DFL)。其核心原则是只允许在两个相邻节点之间交换随机选择的稀疏坐标。因此,PaME在降低通信成本的同时仍然保持了高水平的隐私性,而不会牺牲准确性。此外,通过严格的分析,该算法被证明在梯度局部Lipschitz连续和通信矩阵为双随机时以线性速率收敛。这两个温和的假设不仅消除了现有DFL方法普遍施加的许多限制条件,还使PaME能够有效解决数据异质性。此外,全面的数值实验表明,与几种代表性的去中心化学习算法相比,PaME具有更优越的性能。

更新时间: 2026-03-02 10:57:18

领域: cs.LG

下载: http://arxiv.org/abs/2603.01730v1

Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models

Reinforcement learning has emerged as a promising paradigm for aligning diffusion and flow-matching models with human preferences, yet practitioners face fragmented codebases, model-specific implementations, and engineering complexity. We introduce Flow-Factory, a unified framework that decouples algorithms, models, and rewards through through a modular, registry-based architecture. This design enables seamless integration of new algorithms and architectures, as demonstrated by our support for GRPO, DiffusionNFT, and AWM across Flux, Qwen-Image, and WAN video models. By minimizing implementation overhead, Flow-Factory empowers researchers to rapidly prototype and scale future innovations with ease. Flow-Factory provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support. The codebase is available at https://github.com/X-GenGroup/Flow-Factory.

Updated: 2026-03-02 10:54:54

标题: “流工厂:流匹配模型中强化学习的统一框架”

摘要: 强化学习已经成为一种有前途的范式,用于将扩散和流匹配模型与人类偏好相一致,然而从业者面临着代码库碎片化、特定于模型的实现以及工程复杂性的挑战。我们介绍了Flow-Factory,这是一个通过模块化、基于注册表的架构解耦算法、模型和奖励的统一框架。这种设计使得新算法和结构的无缝集成成为可能,我们支持GRPO、DiffusionNFT和AWM在Flux、Qwen-Image和WAN视频模型中的应用已经证明了这一点。通过最小化实现开销,Flow-Factory使研究人员能够轻松快速地原型和扩展未来的创新。Flow-Factory提供了可生产的内存优化、灵活的多奖励训练和无缝的分布式训练支持。该代码库可在https://github.com/X-GenGroup/Flow-Factory 上获得。

更新时间: 2026-03-02 10:54:54

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2602.12529v2

GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules

Online content moderation is essential for maintaining a healthy digital environment, and reliance on AI for this task continues to grow. Consider a user comment using national stereotypes to insult a politician. This example illustrates two critical challenges in real-world scenarios: (1) Co-occurring Violations, where a single post violates multiple policies (e.g., prejudice and personal attacks); (2) Dynamic rules of moderation, where determination of a violation depends on platform-specific guidelines that evolve across contexts . The intersection of co-occurring harms and dynamically changing rules highlights a core limitation of current AI systems: although large language models (LLMs) are adept at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent . In practice, such shortcomings lead to inconsistent moderation: either erroneously restricting legitimate expression or allowing harmful content to remain online . This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?

Updated: 2026-03-02 10:50:11

标题: GMP:在共同违规和动态规则下内容审核的基准

摘要: 在线内容管理对于维护健康的数字环境至关重要,对于这项任务对人工智能的依赖继续增长。考虑一个用户评论使用国家刻板印象侮辱政治家。这个例子说明了现实场景中的两个关键挑战:(1)共同发生的违规行为,即单个帖子违反多项政策(例如,偏见和人身攻击);(2)动态管理规则,即违规判定取决于不断演变的平台特定指南。共同发生的危害和动态变化的规则交织在一起,突显了当前人工智能系统的一个核心局限性:尽管大型语言模型(LLMs)擅长遵循固定指南,但在政策不稳定或依赖于上下文的情况下,它们的判断能力会下降。在实践中,这种缺陷导致管理不一致:要么错误地限制合法表达,要么允许有害内容保留在线。这引发了一个评估的关键问题:在现有静态基准上的高性能是否真正保证了人工智能判断在涉及共同发生的违规行为和动态变化规则的现实场景中的稳健泛化?

更新时间: 2026-03-02 10:50:11

领域: cs.AI

下载: http://arxiv.org/abs/2603.01724v1

Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes

Can we train a 3D molecule generator using data from dense regions to generate samples in sparse regions? This challenge can be framed as an out-of-distribution (OOD) generation problem. While prior research on OOD generation predominantly targets property shifts, structural shifts -- such as differences in molecular scaffolds or functional groups -- represent an equally critical source of distributional shifts. This work introduces the Geometric OOD Diffusion Model (GODD), a novel diffusion-based framework that enables training on data-abundant molecular distributions while generalizing to data-scarce distributions under distributional structural shifts. Central to our approach is a designated equivariant asymmetric autoencoder to capture distributional structural priors. The asymmetric design allows the model to generalize to unseen structural variations by capturing distributional priors representing distinct distributions. The encoded structural-grained priors guide generation toward sparse regions without requiring explicit training on such data. Evaluated across standard benchmarks encompassing OOD structural shifts (e.g., scaffolds, rings), GODD achieves an improvement of 12.6% in success rate, defined based on molecular validity, uniqueness, and novelty. Furthermore, the framework demonstrates promising performance and generalization on canonical fragment-based drug design tasks, highlighting its utility in learning-based molecular discovery.

Updated: 2026-03-02 10:47:32

标题: 分布先验引导扩散生成低数据环境中的3D分子

摘要: 我们能否使用来自密集区域的数据来训练一个3D分子生成器,以生成稀疏区域的样本?这一挑战可以被理解为一种分布之外(OOD)生成问题。尽管先前关于OOD生成的研究主要集中在属性转移上,但结构转移,如分子骨架或功能基团的差异,同样是分布转移的一个关键来源。本文介绍了一种基于几何OOD扩散模型(GODD)的新型扩散框架,该框架使得能够在数据丰富的分子分布上进行训练,并在分布结构转移下进行泛化到数据稀缺分布。我们的方法的核心是一个指定的等变不对称自动编码器,用来捕捉分布结构的先验知识。不对称设计使得模型能够通过捕捉代表不同分布的分布先验来对未见的结构变化进行泛化。编码的结构粒度先验指导生成朝向稀疏区域,而无需明确在这些数据上进行训练。在涵盖OOD结构转移(如骨架、环)的标准基准测试中评估,GODD在分子有效性、独特性和新颖性方面的成功率提高了12.6%。此外,该框架在经典的基于片段的药物设计任务上展现出有希望的性能和泛化能力,突显了其在基于学习的分子发现中的实用性。

更新时间: 2026-03-02 10:47:32

领域: cs.LG,physics.chem-ph,q-bio.BM

下载: http://arxiv.org/abs/2404.00962v2

RL for Reasoning by Adaptively Revealing Rationales

Learning in the combinatorially large output space of sequence generation problems is challenging as providing expert demonstrations scales poorly with sequence length, and RL struggles with sparse rewards. Between dense demonstrations in supervised training and no demonstrations in reinforcement learning lies an underexplored regime: partial supervision. We ask whether some classes of sequence learning problems become efficiently learnable by exploiting this gap. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals a partial prefix of the target output. The supervision length is adjusted dynamically for each sample based on the model's past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality--it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that AdaBack reliably solves problems that are otherwise intractable. On three mathematical reasoning benchmarks, DeepScaleR, MATH, and GSM8k, we find that AdaBack enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.

Updated: 2026-03-02 10:46:54

标题: RL用于通过自适应揭示解释进行推理

摘要: 在序列生成问题的组合庞大输出空间中学习是具有挑战性的,因为随着序列长度的增加,提供专家演示的效果会下降,而强化学习在稀疏奖励方面存在困难。在监督训练中存在着稠密演示和强化学习中不存在演示之间一个未被充分探索的领域:部分监督。我们探讨了一些序列学习问题是否通过利用这种差距可以高效地学习。我们通过引入自适应回溯(AdaBack)来解决这个问题,这是一种每个样本的课程学习算法,它会逐步显示目标输出的部分前缀。监督长度根据模型过去的奖励信号动态调整每个样本,允许模型通过正确的部分解决方案进行逐步学习完成推理链。我们研究了SFT和RL之间的这种中间领域,并认为每个样本的课程学习不仅仅是效率和普适性之间的一个权衡,它可以成功解决具有长序列潜在依赖关系的任务,在这些任务中SFT和RL都无法泛化。通过使用具有潜在奇偶约束的合成任务,我们展示了AdaBack能够可靠解决其他无法解决的问题。在三个数学推理基准测试中,DeepScaleR、MATH和GSM8k,我们发现AdaBack使模型能够解决强化学习单独无法解决的问题,并通过逐步接触部分解决方案来获得新的推理能力。

更新时间: 2026-03-02 10:46:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.18110v2

The Counting Power of Transformers

Counting properties (e.g. determining whether certain tokens occur more than other tokens in a given input text) have played a significant role in the study of expressiveness of transformers. In this paper, we provide a formal framework for investigating the counting power of transformers. We argue that all existing results demonstrate transformers' expressivity only for (semi-)linear counting properties, i.e., which are expressible as a boolean combination of linear inequalities. Our main result is that transformers can express counting properties that are highly nonlinear. More precisely, we prove that transformers can capture all semialgebraic counting properties, i.e., expressible as a boolean combination of arbitrary multivariate polynomials (of any degree). Among others, these generalize the counting properties that can be captured by C-RASP softmax transformers, which capture only linear counting properties. To complement this result, we exhibit a natural subclass of (softmax) transformers that completely characterizes semialgebraic counting properties. Through connections with the Hilbert's tenth problem, this expressivity of transformers also yields a new undecidability result for analyzing an extremely simple transformer model -- surprisingly with neither positional encodings (i.e. NoPE-transformers) nor masking. We also experimentally validate trainability of such counting properties.

Updated: 2026-03-02 10:44:04

标题: 变压器的计数能力

摘要: 计数属性(例如确定给定输入文本中某些标记是否比其他标记更多)在研究变压器表现力方面发挥了重要作用。在本文中,我们提供了一个正式的框架来研究变压器的计数能力。我们认为所有现有的结果仅证明了变压器仅对(半)线性计数属性具有表现力,即可表示为线性不等式的布尔组合。我们的主要结果是,变压器可以表达高度非线性的计数属性。更具体地说,我们证明了变压器可以捕捉所有半代数计数属性,即可表示为任意多变量多项式(任意次数)的布尔组合。除此之外,这些泛化了C-RASP softmax变压器可以捕捉的计数属性,后者仅能捕捉线性计数属性。为了补充这一结果,我们展示了(softmax)变压器的一个自然子类,完全表征了半代数计数属性。通过与希尔伯特第十问题的联系,这种变压器的表现力还为分析一种极其简单的变压器模型提供了一个新的不可判定性结果 - 令人惊讶的是,既没有位置编码(即NoPE-transformers),也没有蒙版。我们还通过实验证实了这些计数属性的可训练性。

更新时间: 2026-03-02 10:44:04

领域: cs.CL,cs.FL,cs.LG

下载: http://arxiv.org/abs/2505.11199v3

Co-optimization for Adaptive Conformal Prediction

Conformal prediction (CP) provides finite-sample, distribution-free marginal coverage, but standard conformal regression intervals can be inefficient under heteroscedasticity and skewness. In particular, popular constructions such as conformalized quantile regression (CQR) often inherit a fixed notion of center and enforce equal-tailed errors, which can displace the interval away from high-density regions and produce unnecessarily wide sets. We propose Co-optimization for Adaptive Conformal Prediction (CoCP), a framework that learns prediction intervals by jointly optimizing a center $m(x)$ and a radius $h(x)$.CoCP alternates between (i) learning $h(x)$ via quantile regression on the folded absolute residual around the current center, and (ii) refining $m(x)$ with a differentiable soft-coverage objective whose gradients concentrate near the current boundaries, effectively correcting mis-centering without estimating the full conditional density. Finite-sample marginal validity is guaranteed by split-conformal calibration with a normalized nonconformity score. Theory characterizes the population fixed point of the soft objective and shows that, under standard regularity conditions, CoCP asymptotically approaches the length-minimizing conditional interval at the target coverage level as the estimation error and smoothing vanish. Experiments on synthetic and real benchmarks demonstrate that CoCP yields consistently shorter intervals and achieves state-of-the-art conditional-coverage diagnostics.

Updated: 2026-03-02 10:43:19

标题: 适应性一致性预测的协同优化

摘要: Conformal prediction (CP) 提供有限样本、无分布的边际覆盖,但标准的共形回归区间在异方差和偏斜下可能效率低下。特别是,流行的构造如conformalized quantile regression (CQR) 经常继承一个固定的中心概念并强制等尾误差,这可能使区间偏离高密度区域并产生不必要宽阔的集合。我们提出了Co-optimization for Adaptive Conformal Prediction (CoCP),这是一个框架,通过联合优化中心$m(x)$和半径$h(x)$来学习预测区间。CoCP在学习过程中交替进行:(i) 通过对当前中心周围的折叠绝对残差进行分位回归学习$h(x)$,(ii) 利用可微软覆盖目标来细化$m(x)$,其梯度集中在当前边界附近,有效地纠正了错误中心而无需估计完整的条件密度。通过采用带有规范化非一致性分数的分裂共形校准,有限样本边际有效性得到保证。理论表征了软目标的种群不动点,并表明,在标准的正则条件下,当估计误差和平滑消失时,CoCP在渐近情况下逼近于在目标覆盖水平上长度最小化的条件区间。对合成和真实基准的实验表明,CoCP产生了一致较短的区间,并实现了最先进的条件覆盖诊断。

更新时间: 2026-03-02 10:43:19

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.01719v1

Rapid training of Hamiltonian graph networks using random features

Learning dynamical systems that respect physical symmetries and constraints remains a fundamental challenge in data-driven modeling. Integrating physical laws with graph neural networks facilitates principled modeling of complex N-body dynamics and yields accurate and permutation-invariant models. However, training graph neural networks with iterative, gradient-descent-based optimization algorithms (e.g., Adam, RMSProp, LBFGS) often leads to slow training, especially for large, complex systems. In comparison to 15 different optimizers, we demonstrate that Hamiltonian Graph Networks (HGN) can be trained 150-600x faster - but with comparable accuracy - by replacing iterative optimization with random feature-based parameter construction. We show robust performance in diverse simulations, including N-body mass-spring and molecular dynamics systems in up to dimensions and 10,000 particles with different geometries, while retaining essential physical invariances with respect to permutation, rotation, and translation. Our proposed approach is benchmarked using a NeurIPS 2022 Datasets and Benchmarks Track publication to further demonstrate its versatility. We reveal that even when trained on minimal 8-node systems, the model can generalize in a zero-shot manner to systems as large as 4096 nodes without retraining. Our work challenges the dominance of iterative gradient-descent-based optimization algorithms for training neural network models for physical systems.

Updated: 2026-03-02 10:41:47

标题: 使用随机特征快速训练哈密顿图网络

摘要: 学习遵守物理对称性和约束的动力系统仍然是数据驱动建模中的一个基本挑战。将物理定律与图神经网络结合,有助于对复杂的N体动力学进行原则性建模,并产生准确且排列不变的模型。然而,使用迭代、梯度下降优化算法(例如Adam、RMSProp、LBFGS)训练图神经网络通常会导致训练速度缓慢,特别是对于大型、复杂系统而言。与15种不同的优化器相比,我们展示了哈密顿图网络(HGN)可以通过用基于随机特征的参数构建替代迭代优化,训练速度提高150-600倍,但准确度相当。我们展示了在不同模拟中的稳健性能,包括N体弹簧和分子动力学系统,维度高达,粒子数量高达10,000,具有不同的几何形状,同时保持相对于排列、旋转和平移的基本物理不变性。我们提出的方法使用NeurIPS 2022数据集和基准跟踪出版物进行基准测试,进一步展示其多功能性。我们揭示,即使在最小的8节点系统上训练,该模型也能在不重新训练的情况下以零样本方式泛化到最大为4096节点的系统。我们的工作挑战了使用迭代梯度下降优化算法训练神经网络模型物理系统的主导地位。

更新时间: 2026-03-02 10:41:47

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2506.06558v3

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose \textbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2\% (SFT) and 6.9\% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.

Updated: 2026-03-02 10:38:54

标题: TopoCurate:建模交互拓扑用于工具使用代理训练

摘要: 培训工具使用代理通常依赖于基于结果的过滤:在成功轨迹上进行监督微调(SFT),并在通过率选定的任务上进行强化学习(RL)。然而,这种范式忽略了交互动态:成功轨迹可能缺乏错误恢复或呈现冗余,而通过率无法区分结构信息丰富的任务和琐碎的任务。我们提出了TopoCurate,这是一个考虑交互动态的框架,将来自同一任务的多次试验轨迹投影到一个统一的语义商空间拓扑中。通过合并等价的动作-观察状态,这个投影将分散的线性轨迹转化为一个结构化的流形,明确地捕捉了工具调用和环境响应如何驱动有效策略和失败模式之间的差异。利用这种表示,我们引入了一个双重选择机制:对于SFT,我们优先选择展示反思恢复、语义效率和战略多样性的轨迹,以减轻协变量转移和模式崩溃;对于RL,我们选择具有高错误分支比和战略异质性的任务,最大化梯度信噪比,以解决在稀疏奖励环境中信号消失的问题。在BFCLv3和Tau2 Bench上的评估显示,TopoCurate相对于最先进的基线模型实现了一致的收益,分别为4.2\%(SFT)和6.9\%(RL)。我们将很快发布代码和数据,以进行进一步的研究。

更新时间: 2026-03-02 10:38:54

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2603.01714v1

Synaptic bundle theory for spike-driven sensor-motor system: More than eight independent synaptic bundles collapse reward-STDP learning

Neuronal spikes directly drive muscles and endow animals with agile movements, but applying the spike-based control signals to actuators in artificial sensor-motor systems inevitably causes a collapse of learning. We developed a system that can vary \emph{the number of independent synaptic bundles} in sensor-to-motor connections. This paper demonstrates the following four findings: (i) Learning collapses once the number of motor neurons or the number of independent synaptic bundles exceeds a critical limit. (ii) The probability of learning failure is increased by a smaller number of motor neurons, while (iii) if learning succeeds, a smaller number of motor neurons leads to faster learning. (iv) The number of weight updates that move in the opposite direction of the optimal weight can quantitatively explain these results. The functions of spikes remain largely unknown. Identifying the parameter range in which learning systems using spikes can be constructed will make it possible to study the functions of spikes that were previously inaccessible due to the difficulty of learning.

Updated: 2026-03-02 10:37:34

标题: 突触束理论用于脉冲驱动的感觉-运动系统:超过八个独立的突触束崩溃奖励-STDP学习

摘要: 神经元的尖峰直接驱动肌肉,赋予动物敏捷的运动能力,但将基于尖峰的控制信号应用于人工传感器-运动系统的执行器不可避免地导致学习的崩溃。我们开发了一个可以改变传感器到运动连接中\emph{独立突触束的数量}的系统。本文展示了以下四个发现:(i)一旦运动神经元的数量或独立突触束的数量超过临界限制,学习就会崩溃。 (ii)运动神经元数量较少会增加学习失败的概率,而(iii)如果学习成功,较少数量的运动神经元会导致更快的学习。 (iv)在相反方向移动的权重更新数量可以定量解释这些结果。尖峰的功能仍然在很大程度上未知。确定使用尖峰构建学习系统的参数范围将使我们能够研究以前由于学习困难而无法接触的尖峰功能。

更新时间: 2026-03-02 10:37:34

领域: q-bio.NC,cs.AI,nlin.AO

下载: http://arxiv.org/abs/2508.14492v2

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.

Updated: 2026-03-02 10:37:11

标题: FT-Dojo:实现与语言代理一起自主进行LLM微调

摘要: 为了垂直领域对大型语言模型进行微调仍然是一个需要耗费大量时间和金钱的过程,需要领域专家来筛选数据,配置训练,并迭代地诊断模型行为。尽管自主机器学习越来越受关注,但之前没有工作处理过带有代理的端到端LLM微调。LLM基础代理能够自动化这个完整过程吗?我们将这个问题定义为一个相当开放的问题:代理必须在涵盖来自不同数据源的数据筛选、使用复杂工具进行处理、构建训练管道以及根据快速增长的日志中的评估结果迭代地优化方法的开放式搜索空间中导航--这是一个比现有基准测试复杂得多的整体场景。为了研究这个问题,我们引入了FT-Dojo,一个包含5个领域中13个任务的交互环境。我们进一步开发了FT-Agent,一个自主系统,通过利用基于评估驱动的反馈来迭代性地诊断故障和优化微调策略,模拟人类专家的行为。在FT-Dojo上的实验表明,专门设计的微调代理明显优于通用替代方案,FT-Agent在所有五个领域的13个任务中有10个获得了最佳表现。消融实验显示该方法对于3B模型具有有效的泛化能力,同时提供了有关数据扩展权衡和主干敏感性的额外见解。案例分析显示代理能够通过历史经验的累积学习来恢复失败,同时也揭示了因果推理的基本局限性--突显了自主LLM微调的前景和当前边界。

更新时间: 2026-03-02 10:37:11

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01712v1

Legal RAG Bench: an end-to-end benchmark for legal RAG

We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.

Updated: 2026-03-02 10:34:28

标题: 法律RAG Bench:一种用于法律RAG的端到端基准测试

摘要: 我们介绍了Legal RAG Bench,这是一个用于评估法律RAG系统端到端性能的基准和评估方法论。作为一个基准,Legal RAG Bench包括来自维多利亚刑事控告书的4,876个段落,以及100个要求专家对刑法和程序知识的复杂、手工制作的问题。提供长篇答案和支持段落。作为评估方法论,Legal RAG Bench利用完全因子设计和新颖的分层错误分解框架,实现了对RAG中检索和推理模型的贡献进行苹果对苹果比较。我们评估了三种最先进的嵌入模型(Isaacus' Kanon 2 Embedder、Google的Gemini Embedding 001和OpenAI的Text Embedding 3 Large)和两种前沿LLM(Gemini 3.1 Pro和GPT-5.2),发现信息检索是法律RAG性能的主要驱动因素,LLM对正确性和基础性产生了更为温和的影响。特别是Kanon 2 Embedder对性能产生了最大的积极影响,将平均正确性提高了17.5个点,基础性提高了4.5个点,检索准确性提高了34个点。我们观察到,在法律RAG系统中被归因于幻觉的许多错误实际上是由检索失败触发的,得出结论认为检索为许多现代法律RAG系统的性能设定了上限。我们记录了我们建立Legal RAG Bench的原因和方式,以及我们评估结果。我们还公开发布我们的代码和数据,以帮助复制我们的发现。

更新时间: 2026-03-02 10:34:28

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2603.01710v1

Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking

Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).

Updated: 2026-03-02 10:30:54

标题: 多层感知器融合搜索用于高效准确的孪生跟踪

摘要: 最近,暹罗式视觉跟踪器通过建立在卷积或Transformer架构上的日益复杂的融合机制得到了进一步发展。然而,两者都难以在资源受限的硬件上高效地提供像素级交互,导致持续存在准确性和效率之间的不平衡。受此限制的启发,我们重新设计了一个简单而有效的基于多层感知(MLP)的融合模块,从而实现了最小结构开销下的像素级交互。然而,简单地堆叠MLP块会引入一个新挑战:计算成本可能会随着通道宽度的增加呈二次方增长。为了克服这一挑战,我们构建了一个精心设计的MLP模块的分层搜索空间,并引入了一种定制的松弛策略,使可微神经架构搜索(DNAS)能够将通道宽度优化与其他架构选择分离开来。这种有针对性的分离自动平衡了通道宽度和深度,从而产生了一个低复杂度的架构。由此产生的跟踪器实现了最先进的准确性和效率的权衡。它在四个通用和三个空中跟踪基准测试中排名前列,同时在资源受限的图形处理单元(GPU)和神经处理单元(NPU)上保持实时性能。

更新时间: 2026-03-02 10:30:54

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.01706v1

Navigating with Annealing Guidance Scale in Diffusion Space

Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.

Updated: 2026-03-02 10:29:21

标题: 在扩散空间中利用退火引导尺度导航

摘要: 去噪扩散模型在生成受文本提示条件的高质量图像方面表现出色,然而它们的有效性在采样过程中严重依赖于仔细的引导。无分类器引导(CFG)提供了一种广泛使用的机制,通过设置引导比例来引导生成,平衡图像质量和提示对齐。然而,引导比例的选择对于收敛到一个视觉吸引人且符合提示的图像具有重要影响。在这项工作中,我们提出了一种退火引导调度器,它根据条件嘈杂信号随时间动态调整引导比例。通过学习调度策略,我们的方法解决了CFG的喜怒无常行为。实证结果表明,我们的引导调度器显著提升了图像质量和与文本提示的对齐,推动了文本到图像生成的性能。值得注意的是,我们的新调度器不需要额外的激活或内存消耗,并且可以无缝地替代常见的无分类器引导,提供了更好的提示对齐和质量之间的权衡。

更新时间: 2026-03-02 10:29:21

领域: cs.GR,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.24108v2

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are naturally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and is also extended to their modifications, such as Bridge Matching and Stochastic Interpolants. The code can be found in https://github.com/David-cripto/RealUID.

Updated: 2026-03-02 10:27:37

标题: 通用逆蒸馏:用真实数据监督匹配模型(无GANs)

摘要: 在实现出色的生成质量的同时,现代扩散、流动和其他匹配模型遭遇慢推理问题,因为它们需要许多步骤的迭代生成。最近的蒸馏方法通过在预训练的教师模型的指导下训练高效的一步生成器来解决这个问题。然而,这些方法通常受限于仅适用于特定框架,例如仅适用于扩散或仅适用于流动模型。此外,这些方法通常是无数据的,要想从实际数据的使用中获益,就需要使用额外的复杂对抗训练和额外的鉴别器模型。在本文中,我们提出了RealUID,这是一个适用于所有匹配模型的通用蒸馏框架,它能够无缝地将实际数据纳入蒸馏过程中,而无需使用GAN。我们的RealUID方法提供了一个简单的理论基础,涵盖了以往用于流动匹配和扩散模型的蒸馏方法,并扩展到它们的修改版本,如桥接匹配和随机插值。相关代码可以在https://github.com/David-cripto/RealUID 上找到。

更新时间: 2026-03-02 10:27:37

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.22459v2

Security Risks in Machining Process Monitoring: Sequence-to-Sequence Learning for Reconstruction of CNC Axis Positions

Accelerometer-based process monitoring is widely deployed in modern machining systems. When mounted on moving machine components, such sensors implicitly capture kinematic information related to machine motion and tool trajectories. If this information can be reconstructed, condition monitoring data constitutes a severe security threat, particularly for retrofitted or weakly protected sensor systems. Classical signal processing approaches are infeasible for position reconstruction from broadband accelerometer signals due to sensor- and process-specific non-idealities, like noise or sensor placement effects. In this work, we demonstrate that sequence-to-sequence machine learning models can overcome these non-idealities and enable reconstruction of CNC axis and tool positions. Our approach employs LSTM-based sequence-to-sequence models and is evaluated on an industrial milling dataset. We show that learning-based models reduce the reconstruction error by up to 98% for low complexity motion profiles and by up to 85% for complex machining sequences compared to double integration. Furthermore, key geometric characteristics of tool trajectories and workpiece-related motion features are preserved. To the best of our knowledge, this is the first study demonstrating learning-based CNC position reconstruction from industrial condition monitoring accelerometer data.

Updated: 2026-03-02 10:27:22

标题: 加工过程监测中的安全风险:序列到序列学习用于重建CNC轴位置

摘要: 基于加速度计的过程监测已广泛部署在现代加工系统中。当安装在移动机械组件上时,此类传感器隐含地捕获与机器运动和工具轨迹相关的运动学信息。如果可以重建这些信息,条件监测数据将构成严重的安全威胁,特别是对于后期安装或保护不足的传感器系统。由于传感器和工艺特定的非理想性,如噪声或传感器位置效应,传统的信号处理方法无法从宽带加速度计信号中重建位置。在这项工作中,我们证明序列到序列的机器学习模型可以克服这些非理想性,并实现CNC轴和工具位置的重建。我们的方法采用基于LSTM的序列到序列模型,并在工业铣削数据集上进行评估。我们展示了学习型模型相对于双重积分,可以将低复杂度运动轮廓的重建误差降低高达98%,将复杂加工序列的重建误差降低高达85%。此外,工具轨迹的关键几何特征和与工件相关的运动特征得到保留。据我们所知,这是第一项展示从工业条件监测加速度计数据中学习式CNC位置重建的研究。

更新时间: 2026-03-02 10:27:22

领域: cs.AR,cs.LG

下载: http://arxiv.org/abs/2603.01702v1

Towards Principled Dataset Distillation: A Spectral Distribution Perspective

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.

Updated: 2026-03-02 10:26:49

标题: 朝向基于原则的数据集精炼:谱分布视角

摘要: 数据集蒸馏(DD)旨在将大规模数据集压缩为紧凑的合成对应物,以便进行有效的模型训练。然而,现有的DD方法在长尾数据集上表现出明显的性能下降。我们确定了两个基本挑战:分布不一致度量的启发式设计选择和对不平衡类别的统一处理。为了解决这些限制,我们提出了一种基于类别的光谱分布匹配(CSDM),通过对一个行为良好的核函数的频谱重新构建分布对齐。这种技术将原始样本映射到频率空间,得到光谱分布距离(SDD)。为了减轻类别不平衡,我们利用SDD的统一形式进行幅相分解,自适应地优先考虑尾部类别的真实性。在CIFAR-10-LT上,每类10张图片,CSDM相比最先进的DD方法提高了14.0%,仅当尾部类别的图片数量从500减少到25时性能下降了5.7%,表现出对长尾数据的强大稳定性。

更新时间: 2026-03-02 10:26:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01698v1

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks

Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.

Updated: 2026-03-02 10:25:56

标题: DynaMoE:动态令牌级专家激活,具有逐层自适应容量的混合专家神经网络

摘要: 混合专家(MoE)架构已经成为一种强大的范式,可以在保持计算效率的同时扩展神经网络。然而,标准的MoE实现依赖于两个刚性设计假设:(1)固定的Top-K路由,每个标记激活恰好K个专家;以及(2)在所有层中均匀分配专家。本文介绍了DynaMoE,一种新颖的MoE框架,通过动态标记级专家激活和逐层自适应容量分配来放松这两个约束。DynaMoE引入了一种原则性的路由机制,其中每个标记的活动专家数量根据输入复杂性而变化。同时,该框架实现了六种不同的调度策略,用于在网络深度上分配专家容量,包括下降、上升、金字塔和波形模式。我们从理论上分析了动态路由的表达能力增益,并推导了计算效率的界限。通过在MNIST、Fashion-MNIST、CIFAR-10(图像分类)和Recycling-the-Web(语言建模)上进行大量实验,跨多个模型规模,我们证明DynaMoE相比静态基线实现了更高的参数效率。我们的关键发现是,最佳的专家调度取决于任务和规模:下降调度(将容量集中在早期层)在图像分类上优于均匀基线。对于语言建模,最佳调度因模型规模而异,对于微型模型为下降,对于小型模型为上升,对于中等模型为均匀。此外,动态路由减少了训练期间的梯度方差,从而提高了收敛稳定性。DynaMoE为神经网络中的自适应计算建立了一个新框架,为MoE架构设计提供了原则性指导。

更新时间: 2026-03-02 10:25:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01697v1

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models. Source code is available at https://github.com/IST-DASLab/GPTQ-Babai.

Updated: 2026-03-02 10:25:35

标题: LLM量子化的几何:GPTQ作为Babai的最近平面算法

摘要: 将大型语言模型(LLMs)的权重从16位量化为更低位宽是将大规模变压器部署到更便宜的加速器上的实际方法。虽然GPTQ已经成为LLM规模的一次性后训练量化的标准方法之一,但其内部工作被描述为一系列代数更新,模糊了几何含义或最坏情况的保证。在这项工作中,我们表明,对于线性层,当以从后到前(从最后到第一维)的顺序执行时,GPTQ在数学上等同于Babai的最近平面算法,用于由层输入的Hessian矩阵定义的晶格上的经典最近向量问题(CVP)。这种等价性基于复杂的数学论证,并具有两个分析结果:首先,GPTQ的误差传播步骤获得直观的几何解释;其次,在没有剪切权重的假设下,GPTQ继承了Babai算法的误差上界。利用这个界限,我们设计了避免剪切的后训练量化方法,并胜过原始的GPTQ。此外,我们为所得到的表示提供了高效的GPU推理核心。综合这些结果,这些结果将GPTQ置于牢固的理论基础上,并为将几十年来晶格算法的进展引入亿参数模型的未来量化算法的设计打开了大门。源代码可在https://github.com/IST-DASLab/GPTQ-Babai获得。

更新时间: 2026-03-02 10:25:35

领域: cs.LG

下载: http://arxiv.org/abs/2507.18553v3

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.

Updated: 2026-03-02 10:24:41

标题: 跨模态身份映射:通过强化学习减少模态转换中的信息损失

摘要: 大型视觉语言模型(LVLM)在生成的图像标题中经常会省略或错误呈现关键的视觉内容。减少这种信息损失将迫使LVLM集中于图像细节,以生成精确的描述。然而,由于视觉内容和文本输出之间的模态差距,衡量模态转换期间的信息损失在本质上具有挑战性。在本文中,我们认为图像标题的质量与使用该标题进行文本搜索检索的图像之间的相似度呈正相关。基于这一认识,我们进一步提出了跨模态身份映射(CIM),这是一个增强图像字幕的强化学习框架,无需额外的注释。具体来说,该方法从两个角度定量评估信息损失:画廊表示一致性和查询画廊图像相关性。在这些指标的监督下,LVLM最小化信息损失,并旨在实现从图像到标题的身份映射。实验结果证明了我们的方法在图像字幕中的卓越性能,即使与监督微调相比也是如此。特别是,在COCO-LN500基准测试中,CIM在Qwen2.5-VL-7B上的关系推理方面实现了20%的改进。当论文被接受时,代码将被发布。

更新时间: 2026-03-02 10:24:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01696v1

Streaming Continual Learning for Unified Adaptive Intelligence in Dynamic Environments

Developing effective predictive models becomes challenging in dynamic environments that continuously produce data and constantly change. Continual Learning (CL) and Streaming Machine Learning (SML) are two research areas that tackle this arduous task. We put forward a unified setting that harnesses the benefits of both CL and SML: their ability to quickly adapt to non-stationary data streams without forgetting previous knowledge. We refer to this setting as Streaming Continual Learning (SCL). SCL does not replace either CL or SML. Instead, it extends the techniques and approaches considered by both fields. We start by briefly describing CL and SML and unifying the languages of the two frameworks. We then present the key features of SCL. We finally highlight the importance of bridging the two communities to advance the field of intelligent systems.

Updated: 2026-03-02 10:24:37

标题: 流式连续学习在动态环境中实现统一自适应智能

摘要: 在不断产生数据并不断变化的动态环境中开发有效的预测模型变得具有挑战性。持续学习(Continual Learning,CL)和流式机器学习(Streaming Machine Learning,SML)是两个研究领域,致力于解决这一艰巨的任务。我们提出了一个统一的设置,充分利用了CL和SML的优势:它们能够快速适应非稳态数据流,而不会遗忘先前的知识。我们将这个设置称为流式持续学习(Streaming Continual Learning,SCL)。SCL不取代CL或SML,而是扩展了两个领域考虑的技术和方法。我们首先简要描述了CL和SML,并统一了两个框架的语言。然后介绍了SCL的关键特征。最后强调了搭建两个社区之间的桥梁对推动智能系统领域的发展的重要性。

更新时间: 2026-03-02 10:24:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01695v1

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.

Updated: 2026-03-02 10:24:04

标题: MVR:多视角视频奖励塑造用于强化学习

摘要: 奖励设计对于使用强化学习解决复杂任务至关重要。最近的研究探索了利用视觉语言模型(VLMs)产生的图像文本相似性来增强具有视觉反馈的任务的奖励。一种常见的做法是将VLM分数线性地添加到任务或成功奖励中,而不进行明确的塑造,这可能会改变最优策略。此外,这种方法通常依赖于单个静态图像,对于需要涉及跨越多个视觉上不同状态的复杂动作的任务而言存在困难。此外,单一视角可能会遮挡代理的行为的关键方面。为了解决这些问题,本文提出了多视角视频奖励塑造(MVR)框架,该框架使用从多个视角捕获的视频来建模与目标任务相关的状态的相关性。MVR利用来自冻结的预训练VLM的视频文本相似性来学习一个状态相关性函数,从而减轻了基于图像的方法中固有的对特定静态姿势的偏见。此外,我们引入了一种状态相关的奖励塑造公式,将任务特定奖励和基于VLM的指导集成在一起,一旦实现了所需的运动模式,就会自动减少VLM指导的影响。我们通过在HumanoidBench的具有挑战性的人形机器人运动任务和在MetaWorld的操作任务上进行大量实验来验证所提出框架的有效性,并通过消融研究来验证设计选择。

更新时间: 2026-03-02 10:24:04

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01694v1

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce \textsc{Gome}, an MLE agent that operationalizes gradient-based optimization. \textsc{Gome} maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, \textsc{Gome} achieves a state-of-the-art 35.1\% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm. We release our codebase and GPT-5 traces.

Updated: 2026-03-02 10:22:47

标题: 推理作为渐变:将最大似然估计代理扩展到树搜索之外

摘要: 基于LLM的机器学习工程(MLE)代理主要依赖于树搜索,这是一种无梯度优化形式,使用标量验证分数来对候选进行排名。随着LLM推理能力的提高,穷举枚举相比于定向更新变得越来越低效,类似于准确的梯度使得在随机搜索上的下降更加高效。我们引入了\textsc{Gome},一个将基于梯度的优化运用到实践的MLE代理。 \textsc{Gome}将结构化的诊断推理映射到梯度计算,成功记忆映射到动量,多跟踪执行映射到分布式优化。在一个隔离架构效果与外部知识的封闭世界协议下,\textsc{Gome}在单个V100 GPU上限定12小时预算下,在MLE-Bench上取得了35.1\%的任意奖牌率,达到了最先进水平。跨越10个模型的扩展实验揭示了一个关键的交叉点:对于较弱的模型,树搜索通过通过穷尽探索来弥补不可靠推理而保持优势;随着推理能力的增强,基于梯度的优化逐渐超越,随着在前沿级别模型上的差距不断扩大。鉴于面向推理的LLMs的快速进步,这将基于梯度的优化定位为一种越来越受欢迎的范式。我们发布了我们的代码库和GPT-5跟踪。

更新时间: 2026-03-02 10:22:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01692v1

Building a Strong Instruction Language Model for a Less-Resourced Language

Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.

Updated: 2026-03-02 10:21:15

标题: 构建一个强大的指导语言模型,用于一种资源较少的语言

摘要: 大型语言模型(LLMs)已成为自然语言处理和人工智能的基本工具。当前的开源模型主要是在英文文本上训练的,导致在资源较少的语言和文化上表现较差。我们提出了一组方法论方法,用于成功将LLM适应到资源较少的语言,并使用斯洛文尼亚语进行演示。我们提出了GaMS3-12B,一个拥有120亿参数的斯洛文尼亚语生成模型,并证明它是在其参数范围内表现最佳的开源模型。我们通过三阶段连续预训练Gemma 3模型,然后进行两阶段监督微调(SFT)来将模型调整到斯洛文尼亚语。我们在组合了140B斯洛文尼亚语、英语、波斯尼亚语、塞尔维亚语和克罗地亚语预训练标记的基础上,训练了模型,并使用了超过20万个英语和斯洛文尼亚语的SFT示例。我们在斯洛文尼亚语-LLM-Eval数据集、英语到斯洛文尼亚语的翻译以及斯洛文尼亚语LLM竞技场上评估了GaMS3-12B。我们展示了该模型在所有三种情况下都优于12B Gemma 3,并在斯洛文尼亚语LLM竞技场上表现得与更大的商业GPT-4o相当,获得了超过60%的胜率。

更新时间: 2026-03-02 10:21:15

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2603.01691v1

Federated Nonlinear System Identification

We consider federated learning of linearly-parameterized nonlinear systems. We establish theoretical guarantees on the effectiveness of federated nonlinear system identification compared to centralized approaches, demonstrating that the convergence rate improves as the number of clients increases. Although the convergence rates in the linear and nonlinear cases differ only by a constant, this constant depends on the feature map $φ$, which can be carefully chosen in the nonlinear setting to increase excitation and improve performance. We experimentally validate our theory in physical settings where client devices are driven by i.i.d. control inputs and control policies exhibiting i.i.d. random perturbations, ensuring non-active exploration. Experiments use trajectories from nonlinear dynamical systems characterized by real-analytic feature functions, including polynomial and trigonometric components, representative of physical systems including pendulum and quadrotor dynamics. We analyze the convergence behavior of the proposed method under varying noise levels and data distributions. Results show that federated learning consistently improves convergence of any individual client as the number of participating clients increases.

Updated: 2026-03-02 10:20:35

标题: 联邦式非线性系统识别

摘要: 我们考虑线性参数化非线性系统的联邦学习。我们建立了关于联邦非线性系统识别效果的理论保证,与集中式方法相比,证明了当客户数量增加时,收敛速度会提高。尽管线性和非线性情况下的收敛速度仅差一个常数,但这个常数取决于特征映射$φ$,在非线性环境中可以精心选择,以增加激励并提高性能。我们在物理环境中实验验证了我们的理论,客户设备受到i.i.d.控制输入和显示i.i.d.随机扰动的控制策略驱动,确保非主动勘探。实验使用非线性动力系统的轨迹,其特征函数为实解析函数,包括多项式和三角函数组成,代表了包括摆和四旋翼飞行器动态在内的物理系统。我们分析了所提方法在不同噪声水平和数据分布下的收敛行为。结果表明,随着参与客户数量的增加,联邦学习始终能够提高任何单个客户的收敛速度。

更新时间: 2026-03-02 10:20:35

领域: cs.LG,eess.SY

下载: http://arxiv.org/abs/2508.15025v4

NAB: Neural Adaptive Binning for Sparse-View CT reconstruction

Computed Tomography (CT) plays a vital role in inspecting the internal structures of industrial objects. Furthermore, achieving high-quality CT reconstruction from sparse views is essential for reducing production costs. While classic implicit neural networks have shown promising results for sparse reconstruction, they are unable to leverage shape priors of objects. Motivated by the observation that numerous industrial objects exhibit rectangular structures, we propose a novel Neural Adaptive Binning (NAB) method that effectively integrates rectangular priors into the reconstruction process. Specifically, our approach first maps coordinate space into a binned vector space. This mapping relies on an innovative binning mechanism based on differences between shifted hyperbolic tangent functions, with our extension enabling rotations around the input-plane normal vector. The resulting representations are then processed by a neural network to predict CT attenuation coefficients. This design enables end-to-end optimization of the encoding parameters -- including position, size, steepness, and rotation -- via gradient flow from the projection data, thus enhancing reconstruction accuracy. By adjusting the smoothness of the binning function, NAB can generalize to objects with more complex geometries. This research provides a new perspective on integrating shape priors into neural network-based reconstruction. Extensive experiments demonstrate that NAB achieves superior performance on two industrial datasets. It also maintains robust on medical datasets when the binning function is extended to more general expression. The code is available at https://github.com/Wangduo-Xie/NAB_CT_reconstruction.

Updated: 2026-03-02 10:19:42

标题: NAB:稀疏视图CT重建的神经自适应分箱

摘要: 计算机断层扫描(CT)在检查工业物体的内部结构方面发挥着至关重要的作用。此外,从稀疏视图中实现高质量的CT重建对于降低生产成本至关重要。虽然经典的隐式神经网络已经显示出对稀疏重建的有希望的结果,但它们无法利用物体的形状先验。受到观察到许多工业物体呈现矩形结构的启发,我们提出了一种新颖的神经自适应分箱(NAB)方法,有效地将矩形先验整合到重建过程中。具体地,我们的方法首先将坐标空间映射到一个分箱向量空间。这种映射依赖于一种基于移位双曲正切函数差异的创新的分箱机制,我们的扩展使其能够围绕输入平面法向量旋转。然后,通过神经网络处理得到的表示用于预测CT衰减系数。这种设计使得编码参数(包括位置、大小、陡度和旋转)的端到端优化通过从投影数据中的梯度流,从而增强了重建的准确性。通过调整分箱函数的平滑度,NAB可以推广到具有更复杂几何形状的物体。这项研究提供了一个将形状先验整合到基于神经网络的重建中的新视角。大量实验表明,NAB在两个工业数据集上实现了卓越的性能。当将分箱函数扩展到更一般的表达式时,它在医学数据集上仍然保持稳健性。代码可在https://github.com/Wangduo-Xie/NAB_CT_reconstruction获取。

更新时间: 2026-03-02 10:19:42

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2602.02356v2

Learning Internal Biological Neuron Parameters and Complexity-Based Encoding for Improved Spiking Neural Networks Performance

This study proposes a novel learning paradigm for spiking neural networks (SNNs) that replaces the perceptron-inspired abstraction with biologically grounded neuron models, jointly optimizing synaptic weights and intrinsic neuronal parameters. We evaluate two architectures, leaky integrate-and-fire (LIF) and a meta-neuron model, under fixed and learnable intrinsic dynamics. Additionally, we introduce a biologically inspired classification framework that combines SNN dynamics with Lempel-Ziv complexity (LZC), enabling efficient and interpretable classification of spatiotemporal spike data. Training is conducted using surrogate-gradient backpropagation, spike-timing-dependent plasticity (STDP), and the Tempotron rule on spike trains generated from Poisson processes, widely adopted in computational neuroscience as a standard stochastic model of neuronal spike generation due to their analytical tractability and empirical relevance. Learning intrinsic parameters improves classification accuracy by up to 13.50 percentage points for LIF networks and 8.50 for meta-neuron models compared to baselines tuning only network size and learning rate. The proposed SNN-LZC classifier achieves up to 99.50% accuracy with sub-millisecond inference latency and competitive energy consumption. We further provide theoretical justification by formalizing how optimizing intrinsic dynamics enlarges the hypothesis class and proving descent guarantees for intrinsic-parameter updates under standard smoothness assumptions, linking intrinsic optimization to provable improvements in the surrogate objective.

Updated: 2026-03-02 10:19:23

标题: 学习内部生物神经元参数和基于复杂性编码的方法以提高尖峰神经网络性能

摘要: 这项研究提出了一种新颖的脉冲神经网络(SNNs)学习范式,将感知器启发的抽象替换为具有生物学基础的神经元模型,共同优化突触权重和内在神经元参数。我们评估了两种架构,漏电积分-发放(LIF)和元神经元模型,在固定和可学习的内在动态下。此外,我们引入了一个结合了SNN动态和Lempel-Ziv复杂度(LZC)的生物启发分类框架,实现对时空脉冲数据的高效且可解释的分类。训练使用了代理梯度反向传播,时序相关塑性(STDP)和基于脉冲训练生成的Tempotron规则,这在计算神经科学中被广泛采用作为神经元脉冲生成的标准随机模型,因为其具有分析易处理性和实证相关性。学习内在参数相比于仅调整网络大小和学习率的基准,提高了LIF网络的分类准确率高达13.50个百分点,元神经元模型高达8.50个百分点。提出的SNN-LZC分类器在亚毫秒推断延迟和竞争性能耗上达到了高达99.50%的准确率。我们进一步通过形式化说明了优化内在动态如何扩大假设类,并证明了在标准光滑性假设下内在参数更新的下降保证,将内在优化与替代目标的可证改进联系起来。

更新时间: 2026-03-02 10:19:23

领域: cs.NE,cs.AI,q-bio.NC

下载: http://arxiv.org/abs/2508.11674v2

QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions

While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.

Updated: 2026-03-02 10:18:06

标题: QIME:通过本体基础问题构建可解释的医学文本嵌入

摘要: 尽管密集的生物医学嵌入实现了很强的性能,但它们的黑匣子特性限制了它们在临床决策中的效用。最近基于问题的可解释嵌入将文本表示为对自然语言问题的二进制答案,但这些方法通常依赖于启发式或表面层次的对比信号,并忽视了专业领域知识。我们提出了QIME,一个基于本体的框架,用于构建可解释的医学文本嵌入,其中每个维度对应于一个临床意义的是/否问题。通过依赖于簇特定的医学概念签名,QIME生成语义原子问题,捕捉生物医学文本中的细粒度差异。此外,QIME支持一种无需训练的嵌入构建策略,消除了每个问题分类器的训练,同时进一步提高了性能。在生物医学语义相似性、聚类和检索基准实验中,QIME始终优于先前的可解释嵌入方法,并显著缩小了与强大的黑匣子生物医学编码器之间的差距,同时提供简洁且临床信息丰富的解释。

更新时间: 2026-03-02 10:18:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01690v1

Long-Context Generalization with Sparse Attention

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $α$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $α$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $α$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.

Updated: 2026-03-02 10:17:17

标题: 稀疏注意力下的长上下文泛化

摘要: 基于Transformer的架构传统上使用softmax来计算注意力权重,这会产生对序列中所有标记的密集分布。虽然在许多情况下有效,但已经显示出这种密度对于需要精确关注固定大小模式的任务是有害的:随着序列长度的增加,非信息性标记积累注意力概率质量,导致分散和表征崩溃。我们在本文中展示,使用$α$-entmax的动态稀疏注意力机制可以避免这些问题,因为它们能够将精确的零分配给不相关的标记。此外,我们引入了自适应可扩展Entmax(ASEntmax),它赋予$α$-entmax可学习的温度参数,使注意力分布能够在稀疏(关注模式)和密集(类似softmax)之间插值。我们对合成任务和语言建模进行的实证评估表明,ASEntmax在合成基准测试中的长度外推可达到1000倍,并在语言建模中实现优越的长上下文泛化,同时保留短上下文性能,包括更好的困惑度趋势和8倍训练长度下更高的检索准确性。

更新时间: 2026-03-02 10:17:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.16640v4

Randomized Neural Networks for Partial Differential Equation on Static and Evolving Surfaces

Surface partial differential equations arise in numerous scientific and engineering applications. Their numerical solution on static and evolving surfaces remains challenging due to geometric complexity and, for evolving geometries, the need for repeated mesh updates and geometry or solution transfer. While neural-network-based methods offer mesh-free discretizations, approaches based on nonconvex training can be costly and may fail to deliver high accuracy in practice. In this work, we develop a randomized neural network (RaNN) method for solving PDEs on both static and evolving surfaces: the hidden-layer parameters are randomly generated and kept fixed, and the output-layer coefficients are determined efficiently by solving a least-squares problem. For static surfaces, we present formulations for parametrized surfaces, implicit level-set surfaces, and point-cloud geometries, and provide a corresponding theoretical analysis for the parametrization-based formulation with interface compatibility. For evolving surfaces with topology preserved over time, we introduce a RaNN-based strategy that learns the surface evolution through a flow-map representation and then solves the surface PDE on a space--time collocation set, avoiding remeshing. Extensive numerical experiments demonstrate broad applicability and favorable accuracy--efficiency performance on representative benchmarks.

Updated: 2026-03-02 10:17:09

标题: 随机神经网络在静态和演化表面上的偏微分方程的翻译

摘要: 表面偏微分方程在许多科学和工程应用中出现。由于几何复杂性以及对于演变几何形状,需要重复网格更新和几何或解决方案转移,它们在静态和演变表面上的数值解决仍然具有挑战性。虽然基于神经网络的方法提供了无网格离散化,但基于非凸训练的方法可能成本高,并且在实践中可能无法提供高精度。在这项工作中,我们开发了一种随机神经网络(RaNN)方法,用于在静态和演变表面上解决PDE:隐藏层参数是随机生成并保持固定,输出层系数通过有效地解决最小二乘问题来确定。对于静态表面,我们提出了参数化表面、隐式水平集表面和点云几何的公式,并为基于参数化公式的界面兼容性提供相应的理论分析。对于随时间保持拓扑的演化表面,我们介绍了一种基于RaNN的策略,通过流图表示学习表面演化,然后在时空配点集上解决表面PDE,避免了重网格化。大量的数值实验展示了在代表性基准测试上广泛适用性和有利的精度-效率性能。

更新时间: 2026-03-02 10:17:09

领域: math.NA,cs.LG

下载: http://arxiv.org/abs/2603.01689v1

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

Updated: 2026-03-02 10:12:56

标题: 手术后培训:削减错误,保持知识

摘要: 通过后期训练增强大型语言模型(LLMs)的推理能力通常受到效率和灾难性遗忘之间的权衡限制。尽管先前的研究强调了策略数据在减轻遗忘方面的作用,但我们发现并在理论上和实证上验证了一个被忽视但关键的机制:直接偏好优化(DPO)奖励估计中固有的隐式正则化。这激发了我们的手术后训练(SPoT),这是一种旨在在保留学习的先验知识的同时有效优化推理的新范式。SPoT包括:(1)一个数据矫正管道,利用Oracle通过最小的编辑手术性地纠正错误步骤,生成接近模型分布的数据;以及(2)基于奖励的二元交叉熵目标。与DPO中的相对排序不同,这个目标将推理正确性视为一个二元分类问题,强制执行脱耦合的监督信号。实证上,仅通过纠正了的4k个数学数据对,SPoT的Qwen3-8B的准确性在领域内和OOD任务中平均提高了6.2%,在8x H800 GPU上仅需28分钟的训练。 代码:https://github.com/Visual-AI/SPoT

更新时间: 2026-03-02 10:12:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01683v1

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across imaging techniques. To address this limitation, we propose Brain-OF, the first omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.

Updated: 2026-03-02 10:08:49

标题: 脑-OF:一种用于fMRI、EEG和MEG的全功能基础模型

摘要: 脑基础模型在广泛的神经科学任务中取得了显著进展。然而,大多数现有模型仅限于单一功能模态,限制了它们利用互补的时空动态和跨成像技术的集体数据规模的能力。为了解决这一限制,我们提出了Brain-OF,这是第一个联合预训练在fMRI、EEG和MEG上的全功能脑基础模型,能够在统一框架内处理单模态和多模态输入。为了调和异质的时空分辨率,我们引入了Any-Resolution Neural Signal Sampler,将不同的脑信号投影到共享的语义空间中。为了进一步管理语义转移,Brain-OF骨干集成了DINT注意力和专家稀疏混合,其中共享专家捕捉模态不变表示,而路由专家专攻特定模态的语义。此外,我们提出了Masked Temporal-Frequency Modeling,这是一个双域预训练目标,同时在时间和频率域中重建脑信号。Brain-OF在包含大约40个数据集的大规模语料库上进行了预训练,并在各种下游任务中展现出优越的性能,突显了联合多模态集成和双域预训练的好处。

更新时间: 2026-03-02 10:08:49

领域: cs.LG,cs.AI,eess.SP,q-bio.NC

下载: http://arxiv.org/abs/2602.23410v2

Token-Importance Guided Direct Preference Optimization

Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise and overlook the differential importance of individual tokens. Existing token-level approaches often rely on probability prediction or simplistic weighting schemes to obtain token importance, which still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), a framework that achieves fine-grained semantic control through two synergistic innovations. First, we propose a novel hybrid weighting mechanism that combines gradient attribution with a Gaussian prior, ensuring both the accuracy and robustness of token importance scores. Second, we employ a triplet loss to provide structured guidance for the optimization, explicitly guiding model outputs to approach preferred responses and diverge from non-preferred ones. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.

Updated: 2026-03-02 10:07:43

标题: Token-Importance Guided Direct Preference Optimization (令牌重要性引导的直接偏好优化)

摘要: 将大型语言模型(LLMs)与人类偏好对齐对于安全和有效的人工智能交互至关重要。尽管像直接偏好优化(DPO)这样的流行方法简化了对齐过程,但它们仍对数据噪声敏感,并忽视个别标记的差异重要性。现有的标记级方法通常依赖于概率预测或简单加权方案来获取标记重要性,这仍无法完全解决这些问题。为了解决这个问题,我们提出了标记重要性引导的直接偏好优化(TI-DPO)框架,通过两个协同创新实现了细粒度语义控制。首先,我们提出了一种结合梯度归因和高斯先验的新型混合加权机制,确保标记重要性分数的准确性和鲁棒性。其次,我们采用三元损失为优化提供了结构化的指导,明确引导模型输出接近首选响应并偏离非首选响应。实验结果表明,与DPO和其他RLHF方法相比,TI-DPO实现了更高的准确性和更强的生成多样性,提供了更稳定和计算效率更高的解决方案。

更新时间: 2026-03-02 10:07:43

领域: cs.AI

下载: http://arxiv.org/abs/2505.19653v3

A Practical Guide to Streaming Continual Learning

Continual Learning (CL) and Streaming Machine Learning (SML) study the ability of agents to learn from a stream of non-stationary data. Despite sharing some similarities, they address different and complementary challenges. While SML focuses on rapid adaptation after changes (concept drifts), CL aims to retain past knowledge when learning new tasks. After a brief introduction to CL and SML, we discuss Streaming Continual Learning (SCL), an emerging paradigm providing a unifying solution to real-world problems, which may require both SML and CL abilities. We claim that SCL can i) connect the CL and SML communities, motivating their work towards the same goal, and ii) foster the design of hybrid approaches that can quickly adapt to new information (as in SML) without forgetting previous knowledge (as in CL). We conclude the paper with a motivating example and a set of experiments, highlighting the need for SCL by showing how CL and SML alone struggle in achieving rapid adaptation and knowledge retention.

Updated: 2026-03-02 10:06:34

标题: 一个关于流式持续学习的实用指南

摘要: 持续学习(CL)和流式机器学习(SML)研究代理人从非平稳数据流中学习的能力。尽管它们有一些相似之处,但它们解决不同且互补的挑战。虽然SML侧重于变化后的快速适应(概念漂移),CL旨在在学习新任务时保留过去的知识。在简要介绍CL和SML后,我们讨论了流式持续学习(SCL),这是一种新兴范式,提供了统一的解决方案,可能需要SML和CL能力。我们认为SCL可以i)连接CL和SML社区,激励他们朝着同一目标努力,ii)促进设计混合方法,可以快速适应新信息(如SML中)而不会忘记以前的知识(如CL中)。我们在论文中提供一个激励性示例和一组实验,强调了SCL的必要性,通过展示CL和SML单独努力实现快速适应和知识保留。

更新时间: 2026-03-02 10:06:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01677v1

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.

Updated: 2026-03-02 10:06:07

标题: "Arbor:一种可靠导航关键对话流的框架"

摘要: 大型语言模型在高风险领域,如医疗分诊中难以严格遵守结构化工作流程。将整个决策结构编码在单个提示中的单块方法容易随着提示长度增加而出现指令跟随退化,包括中间丢失效应和上下文窗口溢出。为了解决这一问题,我们提出了Arbor框架,将决策树导航分解为专门的节点级任务。决策树被标准化为边列表表示并存储以进行动态检索。在运行时,基于有向无环图(DAG)的协调机制迭代地仅检索当前节点的外部边,通过专用的LLM调用评估有效的转换,并将响应生成委托给单独的推理步骤。该框架对底层决策逻辑和模型提供者是不可知的。通过使用来自真实临床分诊对话的注释轮次,评估了与单提示基线相比的10个基础模型。Arbor通过提高平均轮次准确率29.4个百分点,将每轮延迟降低57.1%,并实现每轮成本平均减少14.4倍。这些结果表明,架构分解降低了对内在模型能力的依赖,使较小的模型能够匹配或超过在单提示基线下运行的较大模型。

更新时间: 2026-03-02 10:06:07

领域: cs.AI

下载: http://arxiv.org/abs/2602.14643v3

Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.

Updated: 2026-03-02 10:05:32

标题: 使用无损层次性猜测解码克服关节不可解性

摘要: 验证是在提高推理速度的同时保持分布保真度在猜测解码中的关键瓶颈。最近的研究表明,序列级验证比基于令牌的逐令牌验证导致更多被接受的令牌。然而,现有解决方案往往依赖于替代近似或受到部分信息的限制,难以应对联合难题。在这项工作中,我们提出了分层猜测解码(HSD),这是一种可证明无损验证方法,显著提高了预期被接受的令牌数量,并通过在可访问分支上平衡过剩和不足的概率质量来克服联合难题。我们广泛的大规模实验表明,HSD在各种模型系列和基准测试中都能持续改善接受率。此外,其强大的可解释性和通用性使它可以轻松集成到各种猜测解码框架中。值得注意的是,将HSD集成到EAGLE-3中可以获得超过12%的性能增益,确立了最先进的解码效率而不损害分布的保真度。代码可在https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding找到。

更新时间: 2026-03-02 10:05:32

领域: cs.AI

下载: http://arxiv.org/abs/2601.05724v2

Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs

Multi-task Vehicle Routing Problems (VRPs) aim to minimize routing costs while satisfying diverse constraints. Existing solvers typically adopt a unified reinforcement learning (RL) framework to learn generalizable patterns across tasks. However, they often overlook the constraint and node dynamics during the decision process, making the model fail to accurately react to the current context. To address this limitation, we propose Chain-of-Context Learning (CCL), a novel framework that progressively captures the evolving context to guide fine-grained node adaptation. Specifically, CCL constructs step-wise contextual information via a Relevance-Guided Context Reformulation (RGCR) module, which adaptively prioritizes salient constraints. This context then guides node updates through a Trajectory-Shared Node Re-embedding (TSNR) module, which aggregates shared node features from all trajectories' contexts and uses them to update inputs for the next step. By modeling evolving preferences of the RL agent, CCL captures step-by-step dependencies in sequential decision-making. We evaluate CCL on 48 diverse VRP variants, including 16 in-distribution and 32 out-of-distribution (with unseen constraints) tasks. Experimental results show that CCL performs favorably against the state-of-the-art baselines, achieving the best performance on all in-distribution tasks and the majority of out-of-distribution tasks.

Updated: 2026-03-02 09:57:15

标题: Chain-of-Context学习:多任务VRP的动态约束理解

摘要: 多任务车辆路径问题(VRP)旨在在满足各种约束条件的同时最小化路径成本。现有的解决方案通常采用统一的强化学习(RL)框架来学习跨任务的可泛化模式。然而,在决策过程中,它们经常忽视约束和节点动态,使模型无法准确地对当前情境做出反应。为了解决这个限制,我们提出了一种名为连续上下文学习(CCL)的新框架,逐步捕捉演变的上下文以指导细粒度节点适应。具体而言,CCL通过一个自适应优先级约束的相关指导上下文重构(RGCR)模块构建逐步上下文信息。然后,这个上下文通过一个轨迹共享节点重新嵌入(TSNR)模块指导节点更新,后者从所有轨迹的上下文中聚合共享节点特征,并使用它们来更新下一步的输入。通过建模RL代理的演变偏好,CCL捕捉了顺序决策制定中的逐步依赖关系。我们在48种不同的VRP变体上评估了CCL,包括16种分布内和32种分布外(具有看不见的约束条件)的任务。实验结果显示,CCL在所有分布内任务和大多数分布外任务上都表现优异,优于最先进的基线模型。

更新时间: 2026-03-02 09:57:15

领域: cs.AI

下载: http://arxiv.org/abs/2603.01667v1

Optimal information injection and transfer mechanisms for active matter reservoir computing

Reservoir computing (RC) is a state-of-the-art machine learning method that makes use of the power of dynamical systems (the reservoir) for real-time inference. When using biological complex systems as reservoir substrates, it serves as a testbed for basic questions about bio-inspired computation -- of how self-organization generates proper spatiotemporal patterning. Here, we use a simulation of an active matter system, driven by a chaotically moving input signal, as a reservoir. So far, it has been unclear whether such complex systems possess the capacity to process information efficiently and independently of the method by which it was introduced. We find that when switching from a repulsive to an attractive driving force, the system completely changes the way it computes, while the predictive performance landscapes remain nearly identical. The nonlinearity of the driver's injection force improves computation by decoupling the single-agent dynamics from that of the driver. Triggered are the (re-)growth, deformation, and active motion of smooth structural boundaries (interfaces), and the emergence of coherent gradients in speed -- features found in many soft materials and biological systems. The nonlinear driving force activates emergent regulatory mechanisms, which manifest enhanced morphological and dynamic diversity -- arguably improving fading memory, nonlinearity, expressivity, and thus, performance. We further perform RC in a broad variety of non-equilibrium active matter phases that arise when tuning internal (repulsive) forces for information transfer. Overall, we find that active matter agents forming liquid droplets are particularly well suited for RC. The consistently convex shape of the predictive performance landscapes, together with the observed phenomenological richness, conveys robustness and adaptivity.

Updated: 2026-03-02 09:55:31

标题: 主动物质储存计算的最佳信息注入和传输机制

摘要: Reservoir computing(RC)是一种利用动力系统(储水池)的能量进行实时推理的先进机器学习方法。当使用生物复杂系统作为储水池基质时,它作为一个关于生物启发计算的基本问题的试验平台 -- 即自组织如何生成适当的时空图案。在这里,我们使用一个由混乱移动输入信号驱动的主动物质系统模拟作为储水池。到目前为止,尚不清楚这样的复杂系统是否具有高效处理信息的能力,并且与引入方法无关。我们发现,当从排斥的驱动力切换到吸引力时,系统完全改变其计算方式,而预测性能景观几乎保持不变。驱动力注入力的非线性通过将单体动力学与驱动器的动力学分离,提高了计算能力。平滑结构边界(界面)的(再)生长、变形和活动运动以及速度的一致梯度的出现被激活 -- 这些特征在许多软材料和生物系统中都可以找到。非线性驱动力激活了新兴的调节机制,表现出增强的形态和动态多样性 -- 可能提高了消失记忆、非线性、表达性和因此性能。我们进一步在各种非平衡主动物质阶段进行RC,当调节内部(排斥)力进行信息传输时出现。总体而言,我们发现形成液滴的活动物质代理特别适合RC。预测性能景观的始终凸形状,以及观察到的现象丰富性,传达出鲁棒性和适应性。

更新时间: 2026-03-02 09:55:31

领域: nlin.AO,cond-mat.soft,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2509.01799v2

Data Selection for LLM Alignment Using Fine-Grained Preferences

Large language models (LLMs) alignment aims to ensure that the behavior of LLMs meets human preferences. While collecting data from multiple fine-grained, aspect-specific preferences becomes more and more feasible, existing alignment methods typically work on a single preference and thus struggle with conflicts inherent in such aggregated datasets. As one early attempt, in this paper, we propose a data-centric approach to align LLMs through the effective use of fine-grained preferences. Specifically, we formulate the problem as a direct fine-grained preference optimization and introduce preference divergence (PD) that quantifies inter-aspect preference conflicts. Instead of directly tackling the consequent complicated optimization, we recast it as a data selection problem and propose a simple yet effective strategy, which identifies a subset of data corresponding to the most negative PD values, for efficient training. We theoretically analyze the loss-bound optimality of our selection strategy and conduct extensive empirical studies on varied settings and datasets to demonstrate that our practical selection method could achieve consistent improvement against standard full-data alignment, using even just 30% of the data. Our work shares a line that LLM alignment using fine-grained preferences is highly feasible.

Updated: 2026-03-02 09:51:24

标题: 使用细粒度偏好进行LLM对齐的数据选择

摘要: 大型语言模型(LLMs)的对齐旨在确保LLMs的行为符合人类偏好。尽管从多个细粒度、特定方面的偏好收集数据变得越来越可行,但现有的对齐方法通常仅适用于单一偏好,因此在此类聚合数据集中存在的冲突中挣扎。作为一次早期尝试,在本文中,我们提出了一种以数据为中心的方法,通过有效地利用细粒度偏好来对齐LLMs。具体而言,我们将问题构建为直接的细粒度偏好优化,并引入偏好分歧(PD)来量化不同方面偏好之间的冲突。我们将其重新构建为数据选择问题,并提出了一种简单而有效的策略,该策略识别出与最负PD值对应的数据子集,以进行高效训练。我们在理论上分析了我们选择策略的损失上限最优性,并在各种设置和数据集上进行了广泛的实证研究,以证明我们的实用选择方法可以在仅使用30%的数据的情况下,相对于标准的全数据对齐取得一致的改进。我们的工作表明,使用细粒度偏好进行LLM对齐是非常可行的。

更新时间: 2026-03-02 09:51:24

领域: cs.LG

下载: http://arxiv.org/abs/2508.07638v2

Gaming and Cooperation in Federated Learning: What Can Happen and How to Monitor It

The success of federated learning (FL) ultimately depends on how strategic participants behave under partial observability, yet most formulations still treat FL as a static optimization problem. We instead view FL deployments as governed strategic systems and develop an analytical framework that separates welfare-improving behavior from metric gaming. Within this framework, we introduce indices that quantify manipulability, the price of gaming, and the price of cooperation, and we use them to study how rules, information disclosure, evaluation metrics, and aggregator-switching policies reshape incentives and cooperation patterns. We derive threshold conditions for deterring harmful gaming while preserving benign cooperation, and for triggering auto-switch rules when early-warning indicators become critical. Building on these results, we construct a design toolkit including a governance checklist and a simple audit-budget allocation algorithm with a provable performance guarantee. Simulations across diverse stylized environments and a federated learning case study consistently match the qualitative and quantitative patterns predicted by our framework. Taken together, our results provide design principles and operational guidelines for reducing metric gaming while sustaining stable, high-welfare cooperation in FL platforms.

Updated: 2026-03-02 09:48:26

标题: 《联邦学习中的游戏与合作:可能发生的情况及如何监控》

摘要: 联邦学习(FL)的成功最终取决于部分可观察性条件下战略参与者的行为,然而大多数形式化仍将FL视为静态优化问题。我们将FL部署视为受战略主导的系统,并开发了一个分离利于福利的行为和度量游戏的分析框架。在这个框架内,我们引入了量化可操纵性、游戏成本和合作成本的指标,并用它们来研究规则、信息披露、评估指标和聚合器切换政策如何重新塑造激励和合作模式。我们推导出阻止有害游戏但保留良性合作的阈值条件,以及在早期警示指标变得关键时触发自动切换规则的条件。基于这些结果,我们构建了一个设计工具包,包括一个治理清单和一个简单的审计预算分配算法,具有可证明的性能保证。在不同风格环境和联邦学习案例研究中的模拟结果始终与我们的框架预测的定性和定量模式相匹配。综上所述,我们的结果为在FL平台中减少度量游戏并维持稳定、高福利的合作提供了设计原则和操作指南。

更新时间: 2026-03-02 09:48:26

领域: cs.LG,cs.GT,stat.ML

下载: http://arxiv.org/abs/2509.02391v3

Relaxed Triangle Inequality for Kullback-Leibler Divergence Between Multivariate Gaussian Distributions

The Kullback-Leibler (KL) divergence is not a proper distance metric and does not satisfy the triangle inequality, posing theoretical challenges in certain practical applications. Existing work has demonstrated that KL divergence between multivariate Gaussian distributions follows a relaxed triangle inequality. Given any three multivariate Gaussian distributions $\mathcal{N}_1, \mathcal{N}_2$, and $\mathcal{N}_3$, if $KL(\mathcal{N}_1, \mathcal{N}_2)\leq ε_1$ and $KL(\mathcal{N}_2, \mathcal{N}_3)\leq ε_2$, then $KL(\mathcal{N}_1, \mathcal{N}_3)< 3ε_1+3ε_2+2\sqrt{ε_1ε_2}+o(ε_1)+o(ε_2)$. However, the supremum of $KL(\mathcal{N}_1, \mathcal{N}_3)$ is still unknown. In this paper, we investigate the relaxed triangle inequality for the KL divergence between multivariate Gaussian distributions and give the supremum of $KL(\mathcal{N}_1, \mathcal{N}_3)$ as well as the conditions when the supremum can be attained. When $ε_1$ and $ε_2$ are small, the supremum is $ε_1+ε_2+2\sqrt{ε_1ε_2}+o(ε_1)+o(ε_2)$. Finally, we demonstrate several applications of our results in out-of-distribution detection with flow-based generative models and safe reinforcement learning.

Updated: 2026-03-02 09:43:45

标题: 多变量高斯分布之间的Kullback-Leibler散度的宽松三角不等式

摘要: Kullback-Leibler(KL)散度不是一个合适的距离度量,并且不满足三角不等式,在某些实际应用中存在理论挑战。现有研究表明,多元高斯分布之间的KL散度遵循一个放松的三角不等式。给定任意三个多元高斯分布$\mathcal{N}_1, \mathcal{N}_2$和$\mathcal{N}_3$,如果$KL(\mathcal{N}_1, \mathcal{N}_2)\leq ε_1$且$KL(\mathcal{N}_2, \mathcal{N}_3)\leq ε_2$,那么$KL(\mathcal{N}_1, \mathcal{N}_3)< 3ε_1+3ε_2+2\sqrt{ε_1ε_2}+o(ε_1)+o(ε_2)$。然而,$KL(\mathcal{N}_1, \mathcal{N}_3)$的最大值仍未知。在本文中,我们研究了多元高斯分布之间的KL散度的放松三角不等式,并给出了$KL(\mathcal{N}_1, \mathcal{N}_3)$的最大值以及可以达到最大值的条件。当$ε_1$和$ε_2$很小时,最大值为$ε_1+ε_2+2\sqrt{ε_1ε_2}+o(ε_1)+o(ε_2)$。最后,我们展示了我们结果在基于流式生成模型的区域检测和安全强化学习中的几个应用。

更新时间: 2026-03-02 09:43:45

领域: stat.ML,cs.IT,cs.LG

下载: http://arxiv.org/abs/2602.02577v2

FreeGNN: Continual Source-Free Graph Neural Network Adaptation for Renewable Energy Forecasting

Accurate forecasting of renewable energy generation is essential for efficient grid management and sustainable power planning. However, traditional supervised models often require access to labeled data from the target site, which may be unavailable due to privacy, cost, or logistical constraints. In this work, we propose FreeGNN, a Continual Source-Free Graph Domain Adaptation framework that enables adaptive forecasting on unseen renewable energy sites without requiring source data or target labels. Our approach integrates a spatio-temporal Graph Neural Network (GNN) backbone with a teacher--student strategy, a memory replay mechanism to mitigate catastrophic forgetting, graph-based regularization to preserve spatial correlations, and a drift-aware weighting scheme to dynamically adjust adaptation strength during streaming updates. This combination allows the model to continuously adapt to non-stationary environmental conditions while maintaining robustness and stability. We conduct extensive experiments on three real-world datasets: GEFCom2012, Solar PV, and Wind SCADA, encompassing multiple sites, temporal resolutions, and meteorological features. The ablation study confirms that each component memory, graph regularization, drift-aware adaptation, and teacher--student strategy contributes significantly to overall performance. The experiments show that FreeGNN achieves an MAE of 5.237 and an RMSE of 7.123 on the GEFCom dataset, an MAE of 1.107 and an RMSE of 1.512 on the Solar PV dataset, and an MAE of 0.382 and an RMSE of 0.523 on the Wind SCADA dataset. These results demonstrate its ability to achieve accurate and robust forecasts in a source-free, continual learning setting, highlighting its potential for real-world deployment in adaptive renewable energy systems. For reproducibility, implementation details are available at: https://github.com/AraoufBh/FreeGNN.

Updated: 2026-03-02 09:43:11

标题: FreeGNN: 持续的无源图神经网络适应性用于可再生能源预测

摘要: 准确预测可再生能源的发电量对于高效的电网管理和可持续的电力规划至关重要。然而,传统的监督模型通常需要访问目标站点的标记数据,但由于隐私、成本或物流约束可能无法获得。在这项工作中,我们提出了FreeGNN,这是一个连续无源图领域自适应框架,可以在无需源数据或目标标签的情况下对未知的可再生能源站点进行自适应预测。我们的方法将时空图神经网络(GNN)骨干与师生策略、记忆重放机制以减少灾难性遗忘、基于图的正则化以保持空间相关性以及动态调整适应强度的漂移感知加权方案相结合。这种组合使模型能够持续适应非平稳的环境条件,同时保持稳健性和稳定性。我们在三个真实数据集上进行了大量实验:GEFCom2012、太阳能光伏和风力SCADA,涵盖了多个站点、时间分辨率和气象特征。消融研究证实了每个组件记忆、图正则化、漂移感知适应和师生策略对整体性能的显著贡献。实验结果表明,FreeGNN在GEFCom数据集上实现了5.237的MAE和7.123的RMSE,在太阳能光伏数据集上实现了1.107的MAE和1.512的RMSE,在风力SCADA数据集上实现了0.382的MAE和0.523的RMSE。这些结果表明其在无源、连续学习环境中实现准确和稳健的预测能力,突显了其在自适应可再生能源系统中实际部署的潜力。为了便于重现,实现细节可在https://github.com/AraoufBh/FreeGNN上找到。

更新时间: 2026-03-02 09:43:11

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01657v1

Intrinsic Entropy of Context Length Scaling in LLMs

Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding of how long context impacts Language Modeling. In this work, we (1) propose to use `Intrinsic Entropy' for explaining the impact of context length on language modeling; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying the physics of Language Models.

Updated: 2026-03-02 09:40:20

标题: LLM中上下文长度缩放的固有熵

摘要: 长上下文语言模型在过去几年中引起了极大关注。已经有研究讨论了长上下文对语言模型性能的影响:一些人发现长时间无关上下文可能会损害性能,而一些实验总结出相关长上下文的损失减少作为“尺度定律”。这需要更加深入地理解长上下文如何影响语言建模。在这项工作中,我们(1)提议使用“内在熵”来解释上下文长度对语言建模的影响;(2)在自然语言和合成数据上进行实验,验证我们提出的理论假设和推导。我们的理论框架可以提供实用的见解,例如建立训练数据集大小决定最佳上下文长度,并为某些情况下的上下文长度缩放设定边界。我们希望我们的工作可以激发新的长上下文语言模型,以及未来研究语言模型物理的工作。

更新时间: 2026-03-02 09:40:20

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.01481v4

Quantum Annealing for Staff Scheduling in Educational Environments

We address a novel staff allocation problem that arises in the organization of collaborators among multiple school sites and educational levels. The problem emerges from a real case study in a public school in Calabria, Italy, where staff members must be distributed across kindergartens, primary, and secondary schools under constraints of availability, competencies, and fairness. To tackle this problem, we develop an optimization model and investigate a solution approach based on quantum annealing. Our computational experiments on real-world data show that quantum annealing is capable of producing balanced assignments in short runtimes. These results provide evidence of the practical applicability of quantum optimization methods in educational scheduling and, more broadly, in complex resource allocation tasks.

Updated: 2026-03-02 09:38:17

标题: 教育环境中的员工排班量子退火

摘要: 我们解决了一个新颖的员工分配问题,该问题在多个学校和教育层次的合作组织中出现。该问题源自意大利卡拉布里亚地区一所公立学校的真实案例研究,在该学校,员工必须在可用性、能力和公平性的约束下分配到幼儿园、小学和中学。为了解决这个问题,我们开发了一个优化模型,并研究了基于量子退火的解决方案。我们在真实数据上进行的计算实验表明,量子退火能够在短时间内产生平衡的分配。这些结果证明了量子优化方法在教育调度以及更广泛的复杂资源分配任务中的实际适用性。

更新时间: 2026-03-02 09:38:17

领域: cs.ET,cs.AI

下载: http://arxiv.org/abs/2510.12278v2

Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling

Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the power of the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics to reduce the number of path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a comprehensive machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying such generative models to this domain presents significant challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key architectural components. First, we implement an \emph{experience replay buffer} to capture and retain rare valid paths. Second, we adopt a uniform exploratory policy to improve generalization and prevent the model from overfitting to simple geometries. Third, we apply a physics-based action masking strategy that filters out physically impossible paths before the model even considers them. As demonstrated in our experimental validation, the proposed model achieves substantial speedups over exhaustive search -- up to $10\times$ faster on GPU and $1000\times$ faster on CPU -- while maintaining high coverage accuracy and successfully uncovering complex propagation paths. The complete source code, tests, and tutorial are available at https://github.com/jeertmans/sampling-paths.

Updated: 2026-03-02 09:37:34

标题: 变换不变的生成式射线路径采样用于高效的无线电传播建模

摘要: 光线追踪已成为准确无线电传播建模的标准,但由于候选路径数量随着物体数量的增加呈指数级增长,其计算复杂度呈指数增长。这一瓶颈限制了其在大规模或实时应用中的使用,迫使传统工具依赖启发式方法来减少路径候选数量,但可能会降低准确性。为了克服这一限制,我们提出了一个全面的机器学习辅助框架,通过生成流网络智能采样取代穷举路径搜索。将这样的生成模型应用于该领域面临着重大挑战,特别是由于有效路径的稀缺性导致的稀疏奖励,这可能导致在复杂环境中评估高阶交互作用时的收敛失败和平凡解。为了确保强大的学习和高效的探索,我们的框架融合了三个关键的架构组件。首先,我们实现了一个经验回放缓冲区来捕获和保留稀有的有效路径。其次,我们采用统一的探索策略来提高泛化能力并防止模型过度拟合简单几何形状。第三,我们应用基于物理的动作屏蔽策略,在模型甚至考虑这些路径之前就过滤掉物理不可能的路径。正如我们的实验验证所示,所提出的模型在GPU上比穷举搜索快得多-最高可快10倍,在CPU上比穷举搜索快1000倍-同时保持高覆盖准确性并成功揭示复杂的传播路径。完整的源代码、测试和教程可在https://github.com/jeertmans/sampling-paths上找到。

更新时间: 2026-03-02 09:37:34

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2603.01655v1

CeProAgents: A Hierarchical Agents System for Automated Chemical Process Development

The development of chemical processes, a cornerstone of chemical engineering, presents formidable challenges due to its multi-faceted nature, integrating specialized knowledge, conceptual design, and parametric simulation. Capitalizing on this, we propose CeProAgents, a hierarchical multi-agent system designed to automate the development of chemical process through collaborative division of labor. Our architecture comprises three specialized agent cohorts focused on knowledge, concept, and parameter respectively. To effectively adapt to the inherent complexity of chemical tasks, each cohort employs a novel hybrid architecture that integrates dynamic agent chatgroups with structured agentic workflows. To rigorously evaluate the system, we establish CeProBench, a multi-dimensional benchmark structured around three core pillars of chemical engineering. We design six distinct types of tasks across these dimensions to holistically assess the comprehensive capabilities of the system in chemical process development. The results not only confirm the effectiveness and superiority of our proposed approach but also reveal the transformative potential as well as the current boundaries of Large Language Models (LLMs) for industrial chemical engineering.

Updated: 2026-03-02 09:37:18

标题: CeProAgents:一种用于自动化化学过程开发的分层代理系统

摘要: 化学过程的发展是化学工程的基石,由于其多方面的特性,存在着巨大的挑战,涉及专业知识、概念设计和参数模拟的整合。基于此,我们提出了CeProAgents,这是一个分层多Agent系统,旨在通过分工协作自动化化学过程的开发。我们的架构包括三个专门的Agent队伍,分别专注于知识、概念和参数。为了有效地适应化学任务的固有复杂性,每个队伍采用了一种新颖的混合架构,将动态Agent聊天组与结构化的代理工作流程相结合。为了严格评估系统,我们建立了CeProBench,这是一个围绕化学工程的三个核心支柱构建的多维基准。我们设计了六种不同类型的任务,以全面评估系统在化学过程开发中的综合能力。结果不仅证实了我们提出的方法的有效性和优越性,还揭示了大型语言模型(LLMs)在工业化学工程中的变革潜力以及当前的局限性。

更新时间: 2026-03-02 09:37:18

领域: cs.AI

下载: http://arxiv.org/abs/2603.01654v1

Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents: Personalization, Popularity, and Sustainability, generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent's viewpoint is incorporated while penalizing spurious or repeated responses. Extensive experiments on European city queries using LLMs from different sizes and model families demonstrate that Collab-REC enhances diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that are often overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with user-provided constraints, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems. Code, data, and other artifacts are available here: https://github.com/ashmibanerjee/collab-rec, while the prompts used are included in the appendix.

Updated: 2026-03-02 09:31:33

标题: Collab-REC:基于LLM的旅游推荐平衡机制

摘要: 我们提出了Collab-REC,这是一个多代理框架,旨在抵消旅游推荐中的流行偏见并增强多样性。在我们的设置中,基于LLM的三个代理:个性化、流行度和可持续性,从互补的视角生成城市建议。然后,一个非LLM调解者通过多轮协商合并和完善这些提议,确保每个代理的观点被纳入,同时惩罚虚假或重复的回应。使用来自不同大小和模型系列的LLM对欧洲城市查询进行了广泛实验,结果表明与单一代理基线相比,Collab-REC增强了多样性和整体相关性,展示了经常被忽视的较少受访地点。这种平衡的、上下文感知的方法解决了过度旅游问题,并更好地符合用户提供的约束条件,突显了LLM驱动的推荐系统中多利益相关者合作的潜力。 代码、数据和其他工件可以在这里找到:https://github.com/ashmibanerjee/collab-rec,而使用的提示包含在附录中。

更新时间: 2026-03-02 09:31:33

领域: cs.AI

下载: http://arxiv.org/abs/2508.15030v4

LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence

Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.

Updated: 2026-03-02 09:31:05

标题: LexChronos:一种结构化事件时间线提取在印度法学中的主体框架

摘要: 理解和预测司法结果需要对法律文件进行细致分析。传统方法将判决和程序视为非结构化文本,限制了大型语言模型(LLMs)在摘要、论证生成和判决预测等任务中的效力。我们提出了LexChronos,一个主动性框架,可以从印度最高法院的判决中迭代提取结构化事件时间线。LexChronos采用了双代理体系结构:一个LoRA-instruct-tuned提取代理识别候选事件,而一个预训练的反馈代理通过信心驱动的循环对其进行评分和精炼。为了解决印度法律事件数据集的稀缺性,我们利用DeepSeek-R1和GPT-4的逆向工程技术构建了一个包含2000个样本的合成语料库,并生成了金标准事件注释。我们的管道针对这个合成标准的BERT-based F1分数达到了0.8751。在法律文本摘要的下游评估中,GPT-4在75%的案例中更倾向于结构化时间线而不是非结构化基线,展示了在印度法学中提高理解和推理能力的效果。这项工作为未来在印度背景下的法律人工智能应用奠定了基础,例如先例映射、论证合成和预测判决建模,通过利用法律事件的结构化表示。

更新时间: 2026-03-02 09:31:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01651v1

Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling

Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events, yet practical workflows frequently rely on resource-limited hardware. We introduce Depth-Structured Music Recurrence (DSMR), a training-time design that learns from complete compositions end to end by streaming each piece left-to-right with stateful recurrent attention and distributing layer-wise memory horizons under a fixed recurrent-state budget. Our main instantiation, two-scale DSMR, assigns long history windows to lower layers and a uniform short window to the remaining layers. On the MAESTRO piano performance dataset, two-scale DSMR matches a full-memory recurrent reference in perplexity (5.96 vs. 5.98) while using approximately 59% less GPU memory and achieving roughly 36% higher throughput. Variant analyses further show strong layer substitutability under binary-horizon schedules: performance depends primarily on total allocated memory rather than which layers carry it.

Updated: 2026-03-02 09:26:14

标题: 深度结构音乐循环:用于完整乐谱符号音乐建模的预算循环注意力

摘要: 长上下文建模对于符号音乐生成至关重要,因为主题重复和发展变化可以跨越数千个音乐事件,然而实际工作流程通常依赖资源有限的硬件。我们引入了深度结构化音乐循环(DSMR),这是一种在训练时设计,通过将每个作品从左到右流式传输,并在固定的循环状态预算下使用具有状态性递归注意力和分布式层内记忆视野的方法来学习完整作品。我们的主要实例化,双尺度DSMR,将长历史窗口分配给较低层,并将统一的短窗口分配给其余层。在MAESTRO钢琴演奏数据集上,双尺度DSMR在困惑度(5.96与5.98)上与全记忆递归参考相匹配,同时使用大约59%的GPU内存,并实现大约36%的更高吞吐量。变体分析进一步显示在二进制视野调度下,层可强有力地替代性:性能主要取决于总分配的内存量,而不是哪些层携带它。

更新时间: 2026-03-02 09:26:14

领域: cs.SD,cs.AI,cs.LG

下载: http://arxiv.org/abs/2602.19816v2

Sketch of a novel approach to a neural model

In this position paper, we present biological detail about neuroplasticity with respect to cell-internal processing pathways and their relation to membrane and synaptic plasticity. We believe that traditional synapse-centric, weight-based models of memorization are not sufficient or adequate to capture the real complexity of neuroplasticity. In standard accounts, a neuronal network consists of a network of neurons connected by adaptive transmission links. The adaptation of these transmission links is overly simplified in the standard model of short-term and long-term potentiation or depression assuming weight adaptation according to use. We propose a paradigm switch from a synapse-centric model (each synapse learns independently, based on associative coupling) to a neuron-centric model (each neuron uses its intracellular pathways to express plasticity at its synapses and dendritic membrane). Each neuron has a 'vertical' dimension where internal parameters steer the external membrane- and synapse-expressed parameters. A neural model consists of (a) expression of parameters at the membrane, in particular dendritic synapses or spines, and axonal boutons (b) internal parameters in the sub-membrane zone and the cytoplasm with its protein signaling network and (c) core parameters in the nucleus for genetic and epigenetic information. In a neuron-centric model, each neuron in the horizontal network has its own internal memory. Transmission and memory are separate, not linked by strict use-dependence. There is filtering and selection of signals for processing and storage. Not every transmission event leaves a trace. This is a conceptual advance over synaptic weight models. The neuron is a self-programming device, rather than a transfer function determined by input. A new approach to neural modeling is better able to capture experimental evidence than synapse-centric models.

Updated: 2026-03-02 09:23:53

标题: 一种新颖的神经模型方法的概要草图

摘要: 在这篇立场文件中,我们提供了有关神经可塑性的生物学细节,涉及细胞内处理途径及其与膜和突触可塑性的关系。我们认为传统的突触为中心、基于权重的记忆模型并不足以捕捉神经可塑性的真正复杂性。在标准描述中,神经网络由一组由适应性传输连接的神经元组成。这些传输连接的适应性在标准的短期和长期增强或抑制模型中过于简化,假设根据使用进行权重调整。我们提出了从一个突触为中心模型(每个突触独立学习,基于关联耦合)到一个神经元为中心模型(每个神经元利用其细胞内途径在其突触和树突膜上表达可塑性)的范式转变。每个神经元都有一个“垂直”维度,其中内部参数引导外部膜和突触表达的参数。神经模型包括(a)在膜上表达参数,特别是树突突触或棘突,以及轴突突头(b)亚膜区域和胞质中的内部参数,以及其蛋白质信号网络,以及(c)核中的核心参数用于遗传和表观遗传信息。在神经元为中心模型中,水平网络中的每个神经元都有其自己的内部记忆。传输和记忆是分开的,不是由严格的使用依赖关系连接的。信号被过滤和选择进行处理和存储。并非每个传输事件都会留下痕迹。这是对突触权重模型的概念进步。神经元是一个自我编程设备,而不是由输入决定的传输函数。一种新的神经建模方法能更好地捕捉实验证据,而不是突触为中心模型。

更新时间: 2026-03-02 09:23:53

领域: q-bio.NC,cond-mat.dis-nn,cs.AI,cs.NE,q-bio.MN

下载: http://arxiv.org/abs/2209.06865v7

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

Large language models (LLMs) are increasingly tested for a "Theory of Mind" (ToM) - the ability to attribute mental states to oneself and others. Yet most evaluations stop at explicit belief attribution in classical toy stories or stylized tasks, leaving open the questions of whether LLMs can implicitly apply such knowledge to predict human behavior, or to judge an observed behavior, in diverse scenarios. We introduce SimpleToM, a benchmark that advances ToM evaluation along two novel axes. First, it probes multiple levels of ToM reasoning, from mental state inference (explicit ToM) to behavior prediction and judgment (applied ToM). Second, it situates these tasks in diverse, everyday scenarios - such as supermarkets, hospitals, schools, and offices - where information asymmetries naturally arise (e.g., hidden defects in grocery store items, incomplete information in provider-patient interactions, or restricted access to locked devices). SimpleToM contains concise stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict: (a) mental states ("Is Mary aware of the mold?"), (b) behaviors ("Will Mary pay for the chips or report the mold?"), and (c) judgments ("Mary paid for the chips. Was that reasonable?"). Experiments reveal a striking gap: state-of-the-art models often reliably infer mental state (a), but fail at applying knowledge about the mental state for secondary predictions, with performance dropping sharply for behavior prediction (b) and further for behavior judgment (c). This exposes a critical fragility in LLMs' social reasoning in terms of what they know (explicit ToM) versus how well they can implicitly apply that knowledge for predictions (applied ToM).

Updated: 2026-03-02 09:18:29

标题: SimpleToM:揭示LLMs中显性ToM推断与隐性ToM应用之间的差距

摘要: 大型语言模型(LLMs)越来越多地被测试用于“心灵理论”(ToM)-即将心理状态归因于自己和他人的能力。然而,大多数评估仅限于经典玩具故事或风格化任务中的明确信念归因,未能回答LLMs是否能够隐含地应用这种知识来预测人类行为,或者在不同情境中判断观察到的行为。我们引入了SimpleToM,这是一个推进ToM评估的基准,它沿着两个新颖的轴线前进。首先,它探索了多个层次的ToM推理,从心理状态推断(明确ToM)到行为预测和判断(应用ToM)。其次,它将这些任务置于各种日常情境中-如超市、医院、学校和办公室-在这些情境中,信息不对称自然地产生(例如,杂货店商品中隐藏的缺陷,提供者-患者互动中的信息不完整,或者对锁定设备的受限访问)。SimpleToM包含简洁的故事(例如,“普林斯罐子里有发霉的薯片。玛丽在超市拿起罐子,走向收银员。”),每个故事都有三个问题,测试不同程度的ToM推理,要求模型预测:(a)心理状态(“玛丽是否意识到发霉?”),(b)行为(“玛丽会付款购买薯片还是报告发霉?”),以及(c)判断(“玛丽付款购买了薯片。这合理吗?”)。实验揭示了一个显著的差距:最先进的模型通常可以可靠地推断心理状态(a),但在应用有关心理状态的知识进行次要预测时表现不佳,行为预测(b)的表现急剧下降,进一步下降为行为判断(c)。这暴露了LLMs在社交推理方面的关键脆弱性,即他们知道的东西(明确ToM)与他们能够隐含地应用该知识进行预测的能力(应用ToM)之间的差距。

更新时间: 2026-03-02 09:18:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2410.13648v2

Learning Structured Reasoning via Tractable Trajectory Control

Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., "wait," indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.

Updated: 2026-03-02 09:18:19

标题: 学习可解释轨迹控制的结构化推理

摘要: 大型语言模型可以展示新兴的推理行为,通常表现为重复的词汇模式(例如,“等待”表示验证)。然而,在非受限采样中,复杂的推理轨迹仍然稀疏,标准强化学习通常无法保证多样化推理行为的获取。我们提出了通过结构化推理系统性地发现和强化多样化推理模式,这种范式要求在强化学习过程中有针对性地探索特定的推理模式。为此,我们提出了Ctrl-R,这是一个通过可控的轨迹控制来学习结构化推理的框架,它积极引导着轨迹的生成过程,激励探索对于复杂问题解决至关重要的多样化推理模式。由此产生的行为策略能够支持准确的重要性抽样估计,从而支持无偏的在线优化。我们进一步引入了一个重要性抽样权重的功率缩放因子,使策略能够有选择地从探索性、超出分布的轨迹中学习,同时保持稳定的优化。实验证明,Ctrl-R能够有效地探索和内化以前无法实现的推理模式,从而在数学推理任务中为语言和视觉-语言模型带来持续的改进。

更新时间: 2026-03-02 09:18:19

领域: cs.AI

下载: http://arxiv.org/abs/2603.01641v1

PAPN: Proximity Attention Encoder and Pointer Network Decoder for Parcel Pickup Route Prediction

Optimization of the last-mile delivery and first-mile pickup of parcels is integral to the logistics optimization pipeline as it entails both cost and resource efficiency and a heightened service quality. Such optimization requires accurate route and time prediction systems to adapt to different scenarios in advance. This work tackles the first building block, namely route prediction. The novel Proximity Attention (PA) mechanism is coupled to a Pointer Network (PN) decoder to leverage the underlying connections between the different visitable pickup positions at each timestep of the parcel pickup process. This local attention is coupled with global context computing via a multi-head attention transformer encoder. Both attentions are then mixed for complete and comprehensive modeling of the problems. PA is also used in the decoding process to skew predictions towards the locations with the highest visit likeliness, thus using inter-connectivity of nodes for next-location prediction. This method is trained, validated and tested on a large industry-level dataset of real-world, last-mile delivery and first-mile pickup named LaDE (2024). This approach outperforms all state-of-the-art supervised methods in terms of most metrics used for benchmarking on this dataset while still being competitive with the best-performing reinforcement learning framework named DRL4Route (2023).

Updated: 2026-03-02 09:17:32

标题: PAPN:用于包裹取件路线预测的接近关注编码器和指针网络解码器

摘要: 包裹的最后一英里交付和第一英里取件的优化对于物流优化管道至关重要,因为它涉及成本和资源效率以及服务质量的提高。这种优化需要准确的路线和时间预测系统,以提前适应不同的情景。本文解决了第一个构建块,即路线预测。新颖的Proximity Attention(PA)机制与Pointer Network(PN)解码器相结合,利用每个时间步的包裹取件过程中不同可访问取件位置之间的潜在连接。这种局部关注与通过多头关注变压器编码器进行的全局上下文计算相结合。然后将两种关注混合以完整和全面地对问题进行建模。在解码过程中,PA也用于将预测偏向具有最高访问可能性的位置,从而利用节点之间的互连性进行下一个位置的预测。该方法在一个名为LaDE(2024)的大型行业级实际数据集上进行了训练、验证和测试,该数据集涉及最后一英里交付和第一英里取件。在这个数据集上用于基准测试的大多数指标中,这种方法胜过了所有最先进的监督方法,同时仍然与名为DRL4Route(2023)的最佳增强学习框架竞争。

更新时间: 2026-03-02 09:17:32

领域: cs.LG

下载: http://arxiv.org/abs/2505.03776v3

Hard-constraint physics-residual networks enable robust extrapolation for hydrogen crossover prediction in PEM water electrolyzers

Hydrogen crossover in polymer electrolyte membrane water electrolysis poses a critical safety and efficiency bottleneck for scalable green hydrogen production. While machine learning offers real-time monitoring capabilities, conventional data-driven newral networks (Pure NNs) and soft-constraint physics-informed neural networks (Standard PINNs) suffer from inherent optimization conflicts and fail catastrophically when extrapolating beyond sparse training conditions. Here, we present a hard-constraint physics-residual network (PR-Net) that embeds analytical transport equations -- Henry's law, Fick's diffusion, and Faraday's law -- as a deterministic computational backbone, restricting the neural network to learn only systematic physical deviations. Across 184 experimental points spanning six membrane types and operating conditions of 25--85$^{\circ}$C, 1--200~bar, and 0.05--5.0 A cm$^{-2}$, this architecture intrinsically resolves gradient conflicts, yielding $R^{2} = 99.57 \pm 0.16\%$ with a 39-fold reduction in training variance compared to purely data-driven models ($R^{2} = 96.47 \pm 6.20\%$). Crucially, the PR-Net breaks the extrapolation barrier, maintaining $R^{2} > 97\%$ at extreme cathode pressures up to 200~bar -- a 2.5-fold extrapolation beyond the training domain where Standard PINN severely degrades ($R^{2} = 72.2\%$) and Pure NN collapses ($R^{2} = 58.7\%$). Furthermore, the learned residuals autonomously capture temperature-induced membrane swelling (Spearman's $ρ= 0.506$, $p < 0.001$) and identify the non-linear transport regime transition near 0.23 A cm$^{-2}$, without explicit programming. Delivering millisecond-level inference on edge hardware, the PR-Net establishes a highly reliable, generalizable foundation for adaptive safety control and predictive maintenance in high-pressure electrochemical energy systems.

Updated: 2026-03-02 09:15:05

标题: 硬约束物理残差网络实现质子交叉预测在PEM水电解器中的鲁棒外推

摘要: 聚合物电解质膜水电解中的氢气穿透问题对于可扩展的绿色氢气生产构成了一个关键的安全和效率瓶颈。虽然机器学习提供了实时监测能力,但传统的数据驱动神经网络(纯NNs)和软约束物理信息神经网络(标准PINNs)在优化冲突方面存在固有问题,并且在超出稀疏训练条件时会发生灾难性失败。在这里,我们提出了一种硬约束物理残差网络(PR-Net),将分析传输方程——亨利定律、菲克扩散定律和法拉第定律——作为确定性计算的支柱,限制神经网络仅学习系统性物理偏差。在涵盖了六种膜类型和25-85$^{\circ}$C、1-200~bar和0.05-5.0 A cm$^{-2}$操作条件的184个实验点中,这种架构固有地解决了梯度冲突,与纯数据驱动模型相比,训练方差减少了39倍($R^{2} = 96.47 \pm 6.20\%$)。关键是,PR-Net突破了外推障碍,即使在极端阴极压力高达200~bar时也保持$R^{2} > 97\%$,这是标准PINN严重退化($R^{2} = 72.2\%$)和纯NN崩溃($R^{2} = 58.7\%$)的2.5倍外推。此外,学习的残差自动捕捉了温度诱导的膜膨胀(Spearman's $ρ= 0.506$,$p < 0.001$),并且在0.23 A cm$^{-2}$附近识别了非线性传输区域转变,无需明确编程。PR-Net在边缘硬件上提供毫秒级推断,为高压电化学能源系统中的自适应安全控制和预测性维护奠定了一个高度可靠、可推广的基础。

更新时间: 2026-03-02 09:15:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.05879v4

Optimal Stopping in Latent Diffusion Models

We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.

Updated: 2026-03-02 09:13:47

标题: 潜在扩散模型中的最优停止

摘要: 我们确认并分析了潜在扩散模型(LDMs)中的一个令人惊讶的现象,即扩散的最后步骤可能会降低样本质量。与传统的用于数值稳定性的早停止的论点相反,这种现象是与LDMs中的降维有关的。通过分析潜在维度和停止时间之间的相互作用,我们提供了一个有原则的解释。在一个具有线性自动编码器的高斯框架下,我们表征了早停止是为了最小化生成和目标分布之间距离所需的条件。更具体地说,我们发现低维表示受益于较早的终止,而较高维的潜在空间则需要较晚的停止时间。我们进一步确定了潜在维度与问题的其他超参数(如得分匹配参数约束)之间的相互作用。对合成和真实数据集的实验证明了这些特性,强调了早停止可以提高生成质量。综上所述,我们的结果为理解潜在维度如何影响样本质量提供了理论基础,并强调了在LDMs中停止时间作为一个关键超参数。

更新时间: 2026-03-02 09:13:47

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.08409v2

MoMa: A Modular Deep Learning Framework for Material Property Prediction

Deep learning methods for material property prediction have been widely explored to advance materials discovery. However, the prevailing pre-train then fine-tune paradigm often fails to address the inherent diversity and disparity of material tasks. To overcome these challenges, we introduce MoMa, a Modular framework for Materials that first trains specialized modules across a wide range of tasks and then adaptively composes synergistic modules tailored to each downstream scenario. Evaluation across 17 datasets demonstrates the superiority of MoMa, with a substantial 14% average improvement over the strongest baseline. Few-shot and continual learning experiments further highlight MoMa's potential for real-world applications. Pioneering a new paradigm of modular material learning, MoMa will be open-sourced to foster broader community collaboration.

Updated: 2026-03-02 09:09:16

标题: MoMa: 用于材料性质预测的模块化深度学习框架

摘要: 深度学习方法在材料性能预测方面得到了广泛探索,以推动材料发现。然而,普遍的预训练然后微调范式往往无法解决材料任务的固有多样性和差异性。为了克服这些挑战,我们引入了MoMa,一个针对材料的模块化框架,首先在各种任务上训练专门的模块,然后自适应地组合适合每个下游场景的协同模块。对17个数据集的评估显示了MoMa的优越性,平均改进幅度达到14%,超过了最强基线。少样本和持续学习实验进一步突显了MoMa在实际应用中的潜力。作为模块化材料学习的开创性范式,MoMa将开源以促进更广泛的社区合作。

更新时间: 2026-03-02 09:09:16

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2502.15483v3

State Your Intention to Steer Your Attention: An AI Assistant for Intentional Digital Living

When working on digital devices, people often face distractions that can lead to a decline in productivity and efficiency, as well as negative psychological and emotional impacts. To address this challenge, we introduce a novel Artificial Intelligence (AI) assistant that elicits a user's intention, assesses whether ongoing activities are in line with that intention, and provides gentle nudges when deviations occur. The system leverages a large language model to analyze screenshots, application titles, and URLs, issuing notifications when behavior diverges from the stated goal. Its detection accuracy is refined through initial clarification dialogues and continuous user feedback. In a three-week, within-subjects field deployment with 22 participants, we compared our assistant to both a rule-based intent reminder system and a passive baseline that only logged activity. Results indicate that our AI assistant effectively supports users in maintaining focus and aligning their digital behavior with their intentions. Our source code is publicly available at https://intentassistant.github.io

Updated: 2026-03-02 09:08:04

标题: 表明你的意图来引导你的注意力:一种用于有意识的数字生活的人工智能助手

摘要: 在使用数字设备时,人们经常面临分心的困扰,这会导致生产力和效率下降,以及负面的心理和情绪影响。为了解决这一挑战,我们引入了一种新颖的人工智能(AI)助手,它可以引发用户的意图,评估当前活动是否符合该意图,并在发生偏离时提供温和的提示。该系统利用大型语言模型分析屏幕截图、应用程序标题和URL,在行为偏离目标时发出通知。它的检测准确性通过初始澄清对话和持续用户反馈得以改进。在22名参与者进行的为期三周的场地部署中,我们将我们的助手与基于规则的意图提醒系统和仅记录活动的被动基准进行了比较。结果表明,我们的AI助手有效地支持用户保持专注,并使他们的数字行为与意图保持一致。我们的源代码可在https://intentassistant.github.io 上公开获取。

更新时间: 2026-03-02 09:08:04

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14513v3

Hierarchical Multi-Scale Graph Learning with Knowledge-Guided Attention for Whole-Slide Image Survival Analysis

We propose a Hierarchical Multi-scale Knowledge-aware Graph Network (HMKGN) that models multi-scale interactions and spatially hierarchical relationships within whole-slide images (WSIs) for cancer prognostication. Unlike conventional attention-based MIL, which ignores spatial organization, or graph-based MIL, which relies on static handcrafted graphs, HMKGN enforces a hierarchical structure with spatial locality constraints, wherein local cellular-level dynamic graphs aggregate spatially proximate patches within each region of interest (ROI) and a global slide-level dynamic graph integrates ROI-level features into WSI-level representations. Moreover, multi-scale integration at the ROI level combines coarse contextual features from broader views with fine-grained structural representations from local patch-graph aggregation. We evaluate HMKGN on four TCGA cohorts (KIRC, LGG, PAAD, and STAD; N=513, 487, 138, and 370) for survival prediction. It consistently outperforms existing MIL-based models, yielding improved concordance indices (10.85% better) and statistically significant stratification of patient survival risk (log-rank p < 0.05).

Updated: 2026-03-02 09:07:37

标题: 基于知识引导的分层多尺度图学习及关注机制用于全切片图像生存分析

摘要: 我们提出了一种层级多尺度知识感知图网络(HMKGN),用于对癌症预后进行多尺度交互和空间层次关系建模。与传统基于注意力的多实例学习(MIL)不同,后者忽略了空间组织,或者基于图的MIL依赖于静态手工制作的图,HMKGN强制执行具有空间局部性约束的层次结构,其中局部细胞级动态图聚合每个感兴趣区域(ROI)内空间相邻的补丁,并且全局幻灯片级动态图将ROI级特征集成到WSI级别的表示中。此外,ROI级别的多尺度集成将来自更广泛视野的粗糙背景特征与来自局部补丁图聚合的细粒度结构表示相结合。我们在四个TCGA队列(KIRC、LGG、PAAD和STAD;N=513、487、138和370)上评估了HMKGN的存活预测性能。它始终优于现有的基于MIL的模型,产生了更好的一致性指数(提高了10.85%)和具有统计学意义的患者生存风险分层(log-rank p <0.05)。

更新时间: 2026-03-02 09:07:37

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2602.23557v2

DeLo: Dual Decomposed Low-Rank Experts Collaboration for Continual Missing Modality Learning

Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Continual Missing Modality Learning (CMML). However, existing works on CMML have predominantly relied on prompt tuning, a technique that struggles with this task due to cross-task interference between its learnable prompts in their shared embedding space. A naive application of Low-Rank Adaptation (LoRA) with modality-shared module will also suffer modality interference from competing gradients. To this end, we propose DeLo, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML. Specifically, this architecture resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools. Embedded within a task-partitioned framework that structurally prevents catastrophic forgetting, this expert system is supported by two key mechanisms: a Cross-Modal Guided Routing strategy to handle incomplete data and a Task-Key Memory for efficient, task-agnostic inference. Extensive experiments on established CMML benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches. This highlights the value of a principled, architecturally-aware LoRA design for real-world multimodal challenges.

Updated: 2026-03-02 09:07:28

标题: DeLo:双分解低秩专家协作用于持续缺失模态学习

摘要: 将大型多模态模型(LMMs)调整到现实场景中面临着学习来自顺序数据流并处理频繁的模态不完整性的双重挑战,这被称为持续缺失模态学习(CMML)。然而,现有的CMML研究主要依赖于提示调整,这种技术在处理这一任务时存在困难,因为在它们共享的嵌入空间中,可学习提示之间存在跨任务干扰。简单应用具有模态共享模块的低秩适应(LoRA)也将受到来自竞争梯度的模态干扰。为此,我们提出DeLo,这是第一个利用新颖的双分解低秩专家架构来进行CMML的框架。具体来说,这种架构通过分解的LoRA专家解决了模态干扰,动态地从解耦的模态特定因子池中组成LoRA更新矩阵的秩一因子。作为一个结构上防止灾难性遗忘的任务分区框架的一部分,这个专家系统受到两个关键机制的支持:一个跨模态引导路由策略用于处理不完整数据,以及一个任务关键内存用于高效、与任务无关的推断。在已建立的CMML基准测试上进行的大量实验表明,我们的方法明显优于最先进的方法。这突显了对于现实世界多模态挑战,采用基于原则和架构的LoRA设计的价值。

更新时间: 2026-03-02 09:07:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01632v1

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with the stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\times$ optimal test candidates compared to baselines, with $1.25\times$ improvement in coverage of high dimensional search spaces.

Updated: 2026-03-02 09:06:28

标题: SEED-SET: 可扩展的系统级伦理测试演化实验设计

摘要: 随着诸如无人机等自主系统在高风险、以人为中心的领域中越来越被部署,评估其道德一致性至关重要,因为未能这样做会对人类生命造成即时危险,并在决策中产生长期偏见。由于缺乏普遍适用、明确定义的评估指标和利益相关者特定的主观性,自动化伦理基准的研究不足。为了应对这些挑战,我们提出了SEED-SET,一个贝叶斯实验设计框架,结合了领域特定的客观评估和利益相关者的主观价值判断。SEED-SET分别用层次高斯过程模拟两种评估类型,并使用一种新颖的收购策略根据学习到的定性偏好和与利益相关者偏好一致的目标提出有趣的测试候选者。我们在两个应用程序上验证了我们的自主代理伦理基准方法,并发现我们的方法表现最佳。我们的方法提供了一个可解释和高效的探索与开发的权衡,与基准相比,生成的测试候选者是最佳的2倍,并且在高维搜索空间的覆盖率提高了1.25倍。

更新时间: 2026-03-02 09:06:28

领域: cs.AI,stat.AP

下载: http://arxiv.org/abs/2603.01630v1

Capabilities Ain't All You Need: Measuring Propensities in AI

AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model's success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an "ideal band". Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.

Updated: 2026-03-02 09:05:47

标题: 能力并非你所需要的全部:在人工智能中测量倾向性

摘要: 人工智能评估主要集中在测量能力上,受到项目反应理论(IRT)启发的正式方法越来越多地被应用。然而,倾向性——模型表现出特定行为的倾向——在确定性能和安全性结果方面发挥着核心作用。然而,传统的IRT将模型在任务上的成功描述为模型能力和任务需求的单调函数,这种方法不适合于倾向性,其中过度和不足都可能成为问题。在这里,我们引入了第一个用双逻辑形式衡量人工智能倾向的正式框架,该框架将模型成功归因于模型的倾向在“理想带”内时具有高成功概率。此外,我们利用配备了新开发的任务无关标准的LLM估计理想带的极限。将我们的框架应用于六个家族的LLM模型,这些模型的倾向性被激发在任何方向上,我们发现我们可以衡量倾向性被偏移了多少以及这对任务产生了什么影响。关键是,使用一个基准估计的倾向性成功地预测了持有任务上的行为。此外,当结合倾向性和能力时,我们获得了更强的预测能力,而不是单独使用能力评估。更广泛地说,我们的框架展示了如何进行严格的倾向性测量,以及它如何比仅使用能力评估来预测人工智能行为获得更多收益。

更新时间: 2026-03-02 09:05:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2602.18182v4

Control Tax: The Price of Keeping AI in Check

The rapid integration of agentic AI into high-stakes real-world applications requires robust oversight mechanisms. The emerging field of AI Control (AIC) aims to provide such an oversight mechanism, but practical adoption depends heavily on implementation overhead. To study this problem better, we introduce the notion of Control tax -- the operational and financial cost of integrating control measures into AI pipelines. Our work makes three key contributions to the field of AIC: (1) we introduce a theoretical framework that quantifies the Control Tax and maps classifier performance to safety assurances; (2) we conduct comprehensive evaluations of state-of-the-art language models in adversarial settings, where attacker models insert subtle backdoors into code while monitoring models attempt to detect these vulnerabilities; and (3) we provide empirical financial cost estimates for control protocols and develop optimized monitoring strategies that balance safety and cost-effectiveness while accounting for practical constraints like auditing budgets. Our framework enables practitioners to make informed decisions by systematically connecting safety guarantees with their costs, advancing AIC through principled economic feasibility assessment across different deployment contexts.

Updated: 2026-03-02 09:03:52

标题: 控制税:控制人工智能的代价

摘要: 将智能AI快速整合到高风险现实世界应用中需要强大的监督机制。新兴的AI控制(AIC)领域旨在提供这样一种监督机制,但实际应用很大程度上取决于实施成本。为了更好地研究这个问题,我们引入了控制税的概念——将控制措施整合到AI管道中的操作和财务成本。我们的工作对AIC领域做出了三个关键贡献:(1)我们引入了一个量化控制税并将分类器性能映射到安全保证的理论框架;(2)我们在对抗环境中对最先进的语言模型进行全面评估,攻击模型在代码中插入微妙的后门,同时监控模型试图检测这些漏洞;(3)我们提供了控制协议的经验财务成本估算,并开发了优化的监控策略,平衡安全性和成本效益,同时考虑审计预算等实际限制。我们的框架使从业者能够通过系统地将安全保证与成本相连接来做出明智的决策,通过在不同的部署环境中进行基于原则的经济可行性评估推动AIC的发展。

更新时间: 2026-03-02 09:03:52

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.05296v3

Prompt and Parameter Co-Optimization for Large Language Models

Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs). They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates. However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training. Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing. By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters. Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively. Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.

Updated: 2026-03-02 09:00:19

标题: 大型语言模型的提示和参数共同优化

摘要: 快速优化和微调是改进大型语言模型(LLMs)性能的两种主要方法。它们从互补的角度增强了LLMs的能力:前者通过显式自然语言,后者通过隐式参数更新。然而,先前的研究通常将它们孤立地研究,使它们的协同潜力大部分未被发掘。为了弥合这一差距,在本文中,我们引入了MetaTuner,一个新颖的框架,将提示优化和微调共同整合到LLM培训中。具体来说,我们引入了两个神经网络分别生成提示和参数,同时允许它们共享一个公共底部编码层以实现知识共享。通过最终监督信号的指导,我们的框架被优化以发现提示和参数之间的最佳组合。鉴于提示学习涉及离散优化,而微调在连续参数空间中运作,我们设计了一个监督正则化损失来有效训练我们的框架。广泛的实验跨越各种基准数据集表明,我们的方法始终优于基线。

更新时间: 2026-03-02 09:00:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.24245v2

Towards OOD Generalization in Dynamic Graphs via Causal Invariant Learning

Although dynamic graph neural networks (DyGNNs) have demonstrated promising capabilities, most existing methods ignore out-of-distribution (OOD) shifts that commonly exist in dynamic graphs. Dynamic graph OOD generalization is non-trivial due to the following challenges: 1) Identifying invariant and variant patterns amid complex graph evolution, 2) Capturing the intrinsic evolution rationale from these patterns, and 3) Ensuring model generalization across diverse OOD shifts despite limited data distribution observations. Although several attempts have been made to tackle these challenges, none has successfully addressed all three simultaneously, and they face various limitations in complex OOD scenarios. To solve these issues, we propose a Dynamic graph Causal Invariant Learning (DyCIL) model for OOD generalization via exploiting invariant spatio-temporal patterns from a causal view. Specifically, we first develop a dynamic causal subgraph generator to identify causal dynamic subgraphs explicitly. Next, we design a causal-aware spatio-temporal attention module to extract the intrinsic evolution rationale behind invariant patterns. Finally, we further introduce an adaptive environment generator to capture the underlying dynamics of distributional shifts. Extensive experiments on both real-world and synthetic dynamic graph datasets demonstrate the superiority of our model over state-of-the-art baselines in handling OOD shifts.

Updated: 2026-03-02 09:00:11

标题: 通过因果不变学习在动态图中实现OOD泛化

摘要: 尽管动态图神经网络(DyGNNs)已经展现出了很有前途的能力,但大多数现有方法忽略了动态图中普遍存在的超出分布(OOD)变化。动态图OOD泛化是一个难题,因为存在以下挑战:1)在复杂图演化中识别不变和变体模式,2)从这些模式中捕捉内在演化原理,3)确保模型在有限数据分布观察下跨多样OOD变化的泛化。尽管已经有多次尝试解决这些挑战,但没有成功同时解决所有三个,它们在复杂OOD场景中面临各种限制。为了解决这些问题,我们提出了一种基于因果不变学习(DyCIL)模型的动态图OOD泛化,通过利用因果视角下的不变时空模式。具体而言,我们首先开发了一个动态因果子图生成器,明确识别因果动态子图。接下来,我们设计了一个因果感知时空注意模块,以提取在不变模式背后的内在演化原理。最后,我们进一步引入了一个自适应环境生成器,以捕捉分布变化的潜在动态。在真实世界和合成动态图数据集上进行的大量实验表明,我们的模型在处理OOD变化方面优于最先进的基线模型。

更新时间: 2026-03-02 09:00:11

领域: cs.LG

下载: http://arxiv.org/abs/2603.01626v1

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.

Updated: 2026-03-02 08:59:39

标题: 测量VLMs不表达的内容:验证指标掩盖了放射学报告生成中的临床术语消失

摘要: 在放射学中可靠部署视觉语言模型(VLMs)需要超越表面级别文本相似性的验证指标,以确保临床忠实性和人口统计公平性。本文调查了当前模型评估中的一个关键盲点:使用导致高聚合标记重叠分数的解码策略,尽管容易出现模板塌陷,即模型仅生成重复的、安全的通用文本,并省略临床术语。如果不加以解决,这个盲点可能导致指标游戏,即在基准测试上表现良好的模型在临床上却无意义。相反,我们主张采用词汇多样性度量来检查模型生成的临床特异性。我们介绍了临床联想位移(CAD),一个在生成报告中量化基于人口统计的词汇关联转变的框架。加权关联消除(WAE)聚合这些转变以衡量跨人口群体的临床信号损失。我们展示了确定性解码产生高水平的语义消除,而随机抽样生成多样化的输出,但可能会引入新的偏见,这促使我们对如何定义“最佳”报告进行根本性重新思考。

更新时间: 2026-03-02 08:59:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01625v1

Assessing Crime Disclosure Patterns in a Large-Scale Cybercrime Forum

Cybercrime forums play a central role in the cybercrime ecosystem, serving as hubs for the exchange of illicit goods, services, and knowledge. Previous studies have explored the market and social structures of these forums, but less is known about the behavioral dynamics of users, particularly regarding participants' disclosure of criminal activity. This study provides the first large-scale assessment of crime disclosure patterns in a major cybercrime forum, analysing over 3.5 million posts from nearly 300k users. Using a three-level classification scheme (benign, grey, and crime) and a scalable labelling pipeline powered by large language models (LLMs), we measure the level of crime disclosure present in initial posts, analyse how participants switch between levels, and assess how crime disclosure behavior relates to private communications. Our results show that crime disclosure is relatively normative: one quarter of initial posts include explicit crime-related content, and more than one third of users disclose criminal activity at least once in their initial posts. At the same time, most participants show restraint, with over two-thirds posting only benign or grey content and typically escalating disclosure gradually. Grey initial posts are particularly prominent, indicating that many users avoid overt statements and instead anchor their activity in ambiguous content. The study highlights the value of LLM-based text classification and Markov chain modelling for capturing crime disclosure patterns, offering insights for law enforcement efforts aimed at distinguishing benign, grey, and criminal content in cybercrime forums.

Updated: 2026-03-02 08:59:32

标题: 评估大规模网络犯罪论坛中的犯罪披露模式

摘要: 网络犯罪论坛在网络犯罪生态系统中扮演着核心角色,是非法商品、服务和知识交流的中心。先前的研究已经探索了这些论坛的市场和社会结构,但对用户的行为动态知之甚少,特别是关于参与者披露犯罪活动的问题。本研究首次对一个主要网络犯罪论坛中的犯罪披露模式进行了大规模评估,分析了近30万用户的超过350万个帖子。利用三级分类方案(良性、灰色和犯罪)和由大型语言模型(LLMs)驱动的可扩展标注管道,我们测量了初始帖子中的犯罪披露水平,分析了参与者如何在不同级别之间切换,并评估了犯罪披露行为与私人通讯之间的关系。我们的结果显示,犯罪披露相对普遍:四分之一的初始帖子包含明确的与犯罪相关的内容,超过三分之一的用户在其初始帖子中至少一次披露犯罪活动。同时,大多数参与者表现出克制,超过三分之二的用户只发布良性或灰色内容,通常逐渐升级披露。灰色的初始帖子特别突出,表明许多用户避免明显的陈述,而是将他们的活动锚定在模棱两可的内容中。该研究突显了基于LLM的文本分类和马尔可夫链建模在捕捉犯罪披露模式方面的价值,为执法机关努力区分网络犯罪论坛中的良性、灰色和犯罪内容提供了见解。

更新时间: 2026-03-02 08:59:32

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2603.01624v1

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.

Updated: 2026-03-02 08:59:11

标题: 自适应频谱特征预测用于扩散采样加速

摘要: 扩散模型已成为高保真图像和视频生成的主要工具,但由于扩散变换器的多次迭代传递,其推断速度受到严重限制。为了减少繁重的计算量,最近的研究采用了特征缓存和重用方案,通过在先前步骤中使用缓存的特征,跳过选定扩散步骤的网络评估。然而,他们的初步设计仅依赖于局部近似,导致错误随着大跨度的增加而迅速增长,并导致在高速度时降低样本质量。在这项工作中,我们提出了一种无需训练的谱扩散特征预测器(Spectrum)方法,可以实现全局、长距离特征重用,并严格控制错误。具体来说,我们将去噪器的潜在特征视为随时间变化的函数,并用切比雪夫多项式对其进行近似。具体而言,我们通过岭回归拟合每个基函数的系数,然后利用这些系数来预测多个未来扩散步骤的特征。我们理论上揭示了我们的方法具有更有利于长期行为的特点,并产生不随步长增加而累积的误差界限。对各种最先进的图像和视频扩散模型进行的大量实验证实了我们方法的优越性。值得注意的是,与基准方法相比,我们在FLUX.1上实现了高达4.79倍的加速,而在Wan2.1-14B上实现了4.67倍的加速,同时保持更高的样本质量。

更新时间: 2026-03-02 08:59:11

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.01623v1

Information-Theoretic Digital Twins for Stealthy Attack Detection in Industrial Control Systems: A Closed-Form KL Divergence Approach

Digital twins (DTs) are increasingly used to monitor and secure Industrial Control Systems (ICS), yet detecting stealthy False Data Injection Attacks (FDIAs) that manipulate system states within normal physical bounds remains challenging. Deep learning anomaly detectors often over-generalize such subtle manipulations, while classical fault detection methods do not scale well in highly correlated multivariate systems. We propose a closed-loop Information-Theoretic Digital Twin (IT-DT) framework for real-time anomaly detection. N4SID identification is combined with steady-state Kalman filtering to quantify residual distribution shifts via closed-form KL divergence, capturing both mean deviations and malicious cross-covariance shifts. Evaluations on the SWaT and WADI datasets show that IT-DT achieves F1-scores of 0.832 and 0.615, respectively, with better precision than deep learning baselines such as TranAD. Computational profiling indicates that the analytical approach requires minimal memory and provides approximately a 600x inference speedup over transformer-based methods on CPU hardware. This makes the framework suitable for resource-constrained industrial edge controllers without GPU acceleration.

Updated: 2026-03-02 08:56:11

标题: 信息论数字孪生:工业控制系统中隐蔽攻击检测的闭合KL散度方法

摘要: 数字孪生(DTs)越来越被用于监控和保护工业控制系统(ICS),然而检测操纵系统状态但仍在正常物理范围内的隐蔽虚假数据注入攻击(FDIAs)仍然具有挑战性。深度学习异常检测器通常会过于泛化这种微妙的操纵,而传统的故障检测方法在高度相关的多变量系统中无法很好地扩展。我们提出了一个用于实时异常检测的闭环信息论数字孪生(IT-DT)框架。N4SID识别与稳态卡尔曼滤波相结合,通过封闭形式的KL散度量化残差分布转移,捕捉均值偏差和恶意交叉协方差转移。对SWaT和WADI数据集的评估显示,IT-DT分别实现了0.832和0.615的F1分数,比如TranAD等深度学习基线具有更好的精度。计算分析表明,与基于变压器的方法相比,这种分析方法需要很少的内存,并在CPU硬件上提供大约600倍的推理加速。这使得该框架适用于没有GPU加速的资源受限的工业边缘控制器。

更新时间: 2026-03-02 08:56:11

领域: cs.CR,math.OC

下载: http://arxiv.org/abs/2603.01621v1

Select, then Balance: Exploring Exogenous Variable Modeling of Spatio-Temporal Forecasting

Spatio-temporal (ST) forecasting is critical for dynamic systems, yet existing methods predominantly rely on modeling a limited set of observed target variables. In this paper, we present the first systematic exploration of exogenous variable modeling for ST forecasting, a topic long overlooked in this field. We identify two core challenges in integrating exogenous variables: the inconsistent effects of distinct variables on the target system and the imbalance effects between historical and future data. To address these, we propose ExoST, a simple yet effective exogenous variable modeling general framework highly compatible with existing ST backbones that follows a "select, then balance" paradigm. Specifically, we design a latent space gated expert module to dynamically select and recompose salient signals from fused exogenous information. Furthermore, a siamese dual-branch backbone architecture captures dynamic patterns from the recomposed past and future representations, integrating them via a context-aware weighting mechanism to ensure dynamic balance. Extensive experiments on real-world datasets demonstrate the ExoST's effectiveness, universality, robustness, and efficiency.

Updated: 2026-03-02 08:53:59

标题: 选择,然后平衡:探索外生变量建模在时空预测中的应用

摘要: 时空(ST)预测对于动态系统至关重要,然而现有方法主要依赖于对观测目标变量的有限建模。在本文中,我们首次系统地探讨了用于ST预测的外生变量建模,这是该领域长期被忽视的主题。我们确定了整合外生变量的两个核心挑战:不同变量对目标系统的不一致影响和历史数据与未来数据之间的不平衡效应。为了解决这些问题,我们提出了ExoST,一个简单而有效的外生变量建模通用框架,与现有的ST骨干模型高度兼容,遵循“选择,然后平衡”的范式。具体来说,我们设计了一个潜在空间门控专家模块,动态选择并重新组合来自融合外生信息的显著信号。此外,一个连体双分支骨干架构从重新组合的过去和未来表示中捕获动态模式,通过一个上下文感知的加权机制将它们整合起来,以确保动态平衡。对真实世界数据集的大量实验证明了ExoST的有效性、普适性、鲁棒性和效率。

更新时间: 2026-03-02 08:53:59

领域: cs.LG

下载: http://arxiv.org/abs/2509.05779v3

Silence Speaks Volumes: A New Paradigm for Covert Communication via History Timing Patterns

A Covert Channel (CC) exploits legitimate communication mechanisms to stealthily transmit information, often bypassing traditional security controls. Among these, a novel paradigm called History Covert Channels (HCC) leverages past network events as reference points to embed covert messages. Unlike traditional timing- or storage-based CCs, which directly manipulate traffic patterns or packet contents, HCCs minimize detectability by encoding information through small pointers to historical data. This approach enables them to amplify the size of transmitted covert data by referring to more bits than are actually embedded. Recent research has explored the feasibility of such methods, demonstrating their potential to evade detection by repurposing naturally occurring network behaviors as a covert transmission medium. This paper introduces a novel method for establishing and maintaining covert communication links using relative pointers to network timing patterns, which minimizes the reliance of the HCC on centralized timekeeping and reduces the likelihood of being detected by standard network monitoring tools. We also explore the tailoring of HCCs to optimize their robustness and undetectability characteristics. Our experiments reveal a better bitrate compared to previous work.

Updated: 2026-03-02 08:53:06

标题: 沉默代表着力量:一种通过历史时间模式进行隐秘通讯的新范式

摘要: 一种隐蔽通道(CC)利用合法的通信机制隐秘地传输信息,通常绕过传统的安全控制。其中一种新颖的范式称为历史隐蔽通道(HCC),利用过去的网络事件作为参考点嵌入隐蔽信息。与直接操作流量模式或数据包内容的传统时序或存储型CC不同,HCC通过将信息编码为对历史数据的小指针来最小化可检测性。这种方法使它们能够通过引用比实际嵌入的位数更多的位来增加传输的隐蔽数据的大小。最近的研究探讨了这种方法的可行性,展示了通过重新利用自然发生的网络行为作为隐蔽传输介质来规避检测的潜力。 本文介绍了一种利用相对指针指向网络时间模式建立和维护隐蔽通信链接的新方法,这种方法最小化了HCC对中央时间保持的依赖,并减少了被标准网络监控工具检测到的可能性。我们还探讨了定制HCC以优化其稳健性和不可检测性特征。我们的实验显示与先前工作相比有更好的比特率。

更新时间: 2026-03-02 08:53:06

领域: cs.CR,cs.DC,cs.NI

下载: http://arxiv.org/abs/2511.22259v2

ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents

Tool-integrated reasoning agents interleaving natural language deliberation with external API calls show promise for complex multi-step tasks. However, aligning such agents for high-stakes domain-specific deployment is challenging, as existing reinforcement learning uses coarse binary rewards (success/failure) that insufficiently guide nuanced tool invocation in production. We present ToolRLA, a three-stage post-training pipeline (Supervised Fine-Tuning, Group Relative Policy Optimization, Direct Preference Optimization) for domain-specific tool-integrated agents. Its core is a fine-grained reward function with multiplicative correctness decomposition, evaluating tool invocation across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain constraint compliance. Multiplicative composition prioritizes correct tool selection (a prerequisite for meaningful parameter evaluation), while a large negative compliance penalty (λ=10) ensures regulatory adherence. Deployed on a real-world financial advisory copilot (80+ advisors, 1,200+ daily queries, 15+ heterogeneous APIs), ToolRLA achieves 47% higher end-to-end task completion (62% to 91%), 63% lower tool invocation error (38% to 14%), 93% lower regulatory violation (12% to 0.8%), and sub-2-second latency after three months. Ablation studies confirm fine-grained reward decomposition contributes 7 percentage points over coarse additive rewards; generalizability is validated on ToolBench and API-Bank.

Updated: 2026-03-02 08:52:14

标题: ToolRLA:面向特定领域代理的工具集成强化学习对齐的细粒度奖励分解

摘要: 工具集成的推理代理将自然语言思考与外部API调用交错显示出在复杂多步任务中的潜力。然而,将这些代理对齐以用于高风险领域特定部署是具有挑战性的,因为现有的强化学习使用粗糙的二元奖励(成功/失败),无法充分引导生产中工具调用的微妙之处。我们提出了ToolRLA,一个针对领域特定工具集成代理的三阶段后训练管线(监督微调、群体相对策略优化、直接偏好优化)。其核心是一个细粒度奖励函数,具有乘法正确性分解,评估工具调用在四个维度上的表现:格式有效性、工具选择正确性、调用效率和领域约束合规性。乘法组合优先考虑正确的工具选择(这是有意义参数评估的先决条件),而一个较大的负面合规性惩罚(λ=10)确保了遵守监管规定。在一个真实的金融咨询副驾驶(80多名顾问,每日1,200多个查询,15多个异构API)上部署ToolRLA,实现了47%更高的端到端任务完成率(从62%到91%),63%更低的工具调用错误率(从38%到14%),93%更低的违规率(从12%到0.8%),并在三个月后实现低于2秒的延迟。消融研究证实,细粒度奖励分解在粗糙的加法奖励上贡献了7个百分点;在ToolBench和API-Bank上验证了泛化性。

更新时间: 2026-03-02 08:52:14

领域: cs.AI

下载: http://arxiv.org/abs/2603.01620v1

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents reveals that their average CuP is less than two-thirds of their nominal completion rate, exposing critical safety gaps. By releasing code, evaluation templates, and a policy-authoring interface, \href{https://sites.google.com/view/st-webagentbench/home}{\textsc{ST-WebAgentBench}} provides an actionable first step toward deploying trustworthy web agents at scale.

Updated: 2026-03-02 08:51:03

标题: ST-WebAgentBench:用于评估Web代理安全性和可信度的基准测试

摘要: 自主网络代理解决复杂的浏览任务,然而现有的基准只衡量代理是否完成任务,忽略了它是否以安全的方式完成或企业是否可以信任。要将这些代理整合到关键工作流程中,安全性和可信度(ST)是采用的先决条件。我们介绍了\textbf{\textsc{ST-WebAgentBench}},一个可配置且易于扩展的套件,用于评估跨现实企业场景中的网络代理ST。它的222个任务中的每一个都与ST策略配对,这些简洁的规则编码了约束条件,并沿着六个正交维度进行评分(例如,用户同意,鲁棒性)。除了原始任务成功之外,我们提出了“根据政策完成”(\textit{CuP})度量标准,仅认可符合所有适用政策的完成,以及“风险比率”,量化跨维度的ST违规情况。评估三个开放式最先进的代理揭示了它们的平均CuP不到名义完成率的三分之二,暴露了关键的安全漏洞。通过发布代码、评估模板和政策编写界面,\href{https://sites.google.com/view/st-webagentbench/home}{\textsc{ST-WebAgentBench}}为在规模上部署值得信赖的网络代理提供了可行的第一步。

更新时间: 2026-03-02 08:51:03

领域: cs.AI

下载: http://arxiv.org/abs/2410.06703v6

Plug-and-Hide: Provable and Adjustable Diffusion Generative Steganography

Diffusion model-based generative image steganography (DM-GIS) is an emerging paradigm that leverages the generative power of diffusion models to conceal secret messages without requiring pre-existing cover images. In this paper, we identify a fundamental trade-off between stego image quality, steganographic security, and extraction reliability within the DM-GIS framework. Drawing on this insight, we propose \textbf{PA-B2G}, a \textbf{P}rovable and \textbf{A}djustable \textbf{B}it-to-\textbf{G}aussian mapping. Theoretically, PA-B2G guarantees the reversible encoding of arbitrary-length bit sequences into pure Gaussian noise; practically, it enables fine-grained control over the balance between image fidelity, security, and extraction accuracy. By integrating PA-B2G with probability-flow ordinary differential equations (PF-ODEs), we establish a theoretically invertible mapping between secret bitstreams and stego images. PA-B2G is model-agnostic and can be seamlessly integrated into mainstream diffusion models without additional training or fine-tuning, making it also suitable for diffusion model watermarking. Extensive experiments validate our theoretical analysis of the inherent DM-GIS trade-offs and demonstrate that our method flexibly supports arbitrary payloads while achieving competitive image quality and security. Furthermore, our method exhibits strong resilience to lossy processing in watermarking applications, highlighting its practical utility.

Updated: 2026-03-02 08:47:20

标题: 插入和隐藏:可证明和可调整的扩散生成隐写术

摘要: 基于扩散模型的生成图像隐写术(DM-GIS)是一种新兴的范式,利用扩散模型的生成能力来隐藏秘密信息,而无需预先存在的封面图像。本文中,我们在DM-GIS框架内确定了隐写图像质量、隐写安全性和提取可靠性之间的基本权衡。基于这一洞见,我们提出了\textbf{PA-B2G},一个\textbf{P}可证明的和\textbf{A}可调节的\textbf{B}it-to-\textbf{G}aussian映射。理论上,PA-B2G保证将任意长度的位序列编码为纯高斯噪声的可逆编码;实际上,它使得对图像保真度、安全性和提取准确性之间的平衡进行精细控制成为可能。通过将PA-B2G与概率流常微分方程(PF-ODEs)结合,我们建立了秘密位流和隐写图像之间的理论可逆映射。PA-B2G是独立于模型的,并且可以无缝集成到主流扩散模型中,无需额外训练或微调,因此也适用于扩散模型水印技术。大量实验证实了我们对DM-GIS固有权衡的理论分析,并展示了我们的方法灵活支持任意有效载荷,同时实现了具有竞争力的图像质量和安全性。此外,我们的方法在水印应用中对有损处理具有强大的韧性,突显了其实际效用。

更新时间: 2026-03-02 08:47:20

领域: cs.CR

下载: http://arxiv.org/abs/2409.04878v3

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods. Our source codes have been released at https://github.com/Mwie1024/Extra-CoT.

Updated: 2026-03-02 08:47:20

标题: 朝着高效的大型语言推理模型:通过极端比例的思维链压缩

摘要: 思维链(CoT)推理成功地增强了大型语言模型(LLMs)的推理能力,但在推理过程中会产生大量的计算开销。现有的CoT压缩方法经常在高压缩比下受到逻辑保真度的重大损失,导致性能显著下降。为了实现高度保真度和快速推理,我们提出了一种新颖的极限比率思维链压缩框架,称为Extra-CoT,它在保持答案准确性的同时积极减少令牌预算。为了生成可靠的高度保真度监督,我们首先在数学CoT数据上训练一个专门的语义保留压缩器,并进行细粒度注释。然后,通过混合比率监督微调(SFT)在这些压缩对上对LLM进行微调,教它遵循一系列压缩预算,并为强化学习(RL)提供稳定的初始化。我们进一步提出了约束和分层比率策略优化(CHRPO),以明确激励在较低预算下的问题解决能力,通过分层奖励。在三个数学推理基准测试上的实验证明了Extra-CoT的优越性。例如,在使用Qwen3-1.7B的MATH-500上,Extra-CoT实现了超过73\%的令牌减少,准确性提高了0.6\%,明显优于最先进的方法。我们的源代码已发布在https://github.com/Mwie1024/Extra-CoT。

更新时间: 2026-03-02 08:47:20

领域: cs.LG

下载: http://arxiv.org/abs/2602.08324v2

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.

Updated: 2026-03-02 08:46:07

标题: StructXLIP:利用多模态结构线索增强视觉语言模型

摘要: 基于边缘的表示是视觉理解的基本线索,这一原则根植于早期视觉研究,并至今仍然是核心。我们将这一原则扩展到视觉-语言对齐,表明隔离和对齐跨模态的结构线索可以极大地有益于长篇、丰富细节的标题的微调,特别关注于改进跨模态检索。我们介绍了StructXLIP,一种精细调整对齐范式,提取边缘地图(如Canny),将它们视为图像的视觉结构的代理,并过滤相应的标题以强调结构线索,使它们“结构中心化”。微调通过三种结构中心损失增强标准对齐损失:(i)将边缘地图与结构文本对齐,(ii)将局部边缘区域与文本块匹配,以及(iii)将边缘地图连接到彩色图像以防止表示漂移。从理论上讲,虽然标准CLIP最大化视觉和文本嵌入之间的互信息,但StructXLIP还额外最大化多模态结构表示之间的互信息。这种辅助优化本质上更困难,引导模型朝向更健壮和语义稳定的极小值,增强视觉-语言对齐。除了在一般和专业领域的跨模态检索中胜过当前的竞争对手外,我们的方法还作为一个通用的提升配方,可以以即插即用的方式集成到未来方法中。代码和预训练模型可在https://github.com/intelligolabs/StructXLIP 上公开获取。

更新时间: 2026-03-02 08:46:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2602.20089v3

RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks

Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios.While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration.To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning.RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence.In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels.Extensive experiments demonstrate that RoboPARA significantly outperforms existing planning methods, achieving higher efficiency and reliability, particularly in complex task combinations.Our code is publicly available at https://github.com/AiDuanshiying/RoboPARA.

Updated: 2026-03-02 08:46:06

标题: RoboPARA:具有并行分配和任务重组的双臂机器人规划

摘要: 双臂机器人在提高效率和灵活性的复杂多任务场景中发挥着关键作用。虽然现有方法在任务规划方面取得了令人满意的结果,但它们经常无法充分优化任务并行性,从而限制了双臂协作的潜力。为了解决这个问题,我们提出了RoboPARA,这是一个新颖的大型语言模型(LLM)驱动的双臂任务并行规划框架。RoboPARA采用两阶段流程:(1)基于依赖图的规划候选生成,构建定向无环图(DAG)来模拟任务依赖关系并消除冗余,(2)基于图重新遍历的双臂并行规划,优化DAG遍历以最大化并行性同时保持任务一致性。此外,我们引入了跨场景双臂并行任务数据集(X-DAPT数据集),这是第一个专门设计用于评估不同场景和难度级别下双臂任务并行性的数据集。大量实验表明,RoboPARA明显优于现有规划方法,在复杂任务组合中实现更高的效率和可靠性。我们的代码公开可用于https://github.com/AiDuanshiying/RoboPARA。

更新时间: 2026-03-02 08:46:06

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.06683v2

Evaluating and Understanding Scheming Propensity in LLM Agents

As frontier language models are increasingly deployed as autonomous agents pursuing complex, long-term objectives, there is increased risk of scheming: agents covertly pursuing misaligned goals. Prior work has focused on showing agents are capable of scheming, but their propensity to scheme in realistic scenarios remains underexplored. To understand when agents scheme, we decompose scheming incentives into agent factors and environmental factors. We develop realistic settings allowing us to systematically vary these factors, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding. We find only minimal instances of scheming despite high environmental incentives, and show this is unlikely due to evaluation awareness. While inserting adversarially-designed prompt snippets that encourage agency and goal-directedness into an agent's system prompt can induce high scheming rates, snippets used in real agent scaffolds rarely do. Surprisingly, in model organisms (Hubinger et al., 2023) built with these snippets, scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate from 59% to 3%, and increasing oversight can raise rather than deter scheming by up to 25%. Our incentive decomposition enables systematic measurement of scheming propensity in settings relevant for deployment, which is necessary as agents are entrusted with increasingly consequential tasks.

Updated: 2026-03-02 08:38:40

标题: 评估和理解LLM代理的策划倾向

摘要: 随着前沿语言模型越来越多地被部署为追求复杂、长期目标的自主代理,隐蔽地追求不一致目标的风险也在增加。先前的研究主要集中在展示代理能够进行策划,但它们在现实场景中策划的倾向仍未得到充分探讨。为了了解代理何时会策划,我们将策划激励因素分解为代理因素和环境因素。我们开发了现实情景,允许我们系统地改变这些因素,每个因素都为追求工具收敛目标的代理提供了策划机会,如自我保护、资源获取和目标保护。尽管环境激励很高,我们发现了极少的策划实例,并且表明这不太可能是由于评估意识。虽然将鼓励代理性和目标导向性的对抗性设计提示片段插入代理系统提示中可以导致高策划率,但实际代理脚手架中使用的提示片段很少。令人惊讶的是,在使用这些提示建立的模型生物体中(Hubinger等人,2023年),策划行为非常脆弱:去掉一个工具就可以将策划率从59%降至3%,增加监督反而可以将策划率提高25%。我们的激励分解能够系统地衡量在部署相关设置中的策划倾向,这对于代理被委托执行越来越重要的任务是必要的。

更新时间: 2026-03-02 08:38:40

领域: cs.AI

下载: http://arxiv.org/abs/2603.01608v1

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.

Updated: 2026-03-02 08:38:37

标题: CARE:基于证据支持的代理框架在多模医疗推理中实现临床问责制

摘要: 大型视觉语言模型(VLMs)展示了强大的多模态医学推理能力,但大多数作为端到端的黑匣子运行,偏离了临床医生基于证据的分阶段工作流程,阻碍了临床问责制。相辅相成地,专家视觉定位模型可以准确地定位感兴趣区域(ROIs),提供明确、可靠的证据,从而提高推理准确性和信任度。在本文中,我们介绍了CARE,通过一个基于证据的主动框架,推进了多模态医学推理中的临床问责制。与现有方法不同,CARE将定位和推理分解为协调的子模块任务,以减少捷径学习和幻觉:一个紧凑的VLM提出相关的医学实体;一个专家实体引用分割模型产生像素级ROI证据;一个基于ROI提示的VLM推理整个图像。这些VLMs经过强化学习进行优化,以验证奖励来使答案与支持证据一致。此外,一个VLM协调员计划工具调用并审查证据-答案的一致性,提供主动控制和最终验证。在标准医学VQA基准测试中评估,我们的CARE-Flow(无协调员)将平均准确率提高了10.9%,超过了相同规模(10B)的最新技术水平。通过动态规划和答案审查,我们的CARE-Coord获得了进一步的增益,优于预训练充分的最新技术水平5.2%。我们的实验表明,模拟临床工作流程,结合解耦的专业模型和明确的证据的主动框架,产生更准确、有问责制的医学人工智能。

更新时间: 2026-03-02 08:38:37

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01607v1

OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective memory usage, but existing static and predetermined offloading strategies cannot adapt to the rapidly shifting memory demands of long-context serving. This often leads to excessive CPU-to-GPU KV transfers that translate into latency spikes and frequent SLO violations. To address these challenges, we introduce OrbitFlow, a fine-grained and adaptive KV cache management system that meets latency SLOs in long-context LLM serving. OrbitFlow employs a lightweight ILP solver to decide which layers' KV caches to retain on the GPU for each request, within memory capacity constraints. It continuously refines KV placements based on runtime feedback when the active plan becomes suboptimal during token generation. Under heavy load, OrbitFlow invokes a fallback mechanism to temporarily defer in-flight requests with large memory footprints, preserving overall SLO attainment. Our experiments demonstrate that OrbitFlow improves SLO attainment for TPOT and TBT by up to 66% and 48%, respectively, while reducing the 95th percentile latency by 38% and achieving up to 3.3x higher throughput compared to existing offloading methods.

Updated: 2026-03-02 08:37:18

标题: OrbitFlow: 借助细粒度KV缓存重新配置的SLO感知长上下文LLM服务

摘要: 为长上下文LLMs提供服务是具有挑战性的,因为在令牌生成过程中请求长度和批处理组成会发生变化,导致内存占用在运行时显著波动。将KV缓存卸载到主机内存限制了有效内存使用,但现有的静态和预定的卸载策略无法适应长上下文服务中内存需求迅速变化的问题。这通常会导致过多的CPU到GPU的KV传输,从而导致延迟峰值和频繁的SLO违规。为了解决这些挑战,我们引入了OrbitFlow,这是一个精细粒度且自适应的KV缓存管理系统,可以满足长上下文LLM服务的延迟SLO。OrbitFlow利用轻量级整数线性规划求解器来决定在内存容量约束下为每个请求保留在GPU上的哪些层的KV缓存。它基于运行时反馈不断优化KV的放置,以应对在令牌生成过程中活动计划变得次优的情况。在负载较重时,OrbitFlow会调用回退机制来暂时推迟具有较大内存占用的正在进行中的请求,以保持总体SLO达标。我们的实验证明,与现有的卸载方法相比,OrbitFlow可以将TPOT和TBT的SLO达标率分别提高高达66%和48%,将第95百分位延迟降低38%,并实现高达3.3倍的吞吐量。

更新时间: 2026-03-02 08:37:18

领域: cs.AI,cs.LG,cs.PF

下载: http://arxiv.org/abs/2601.10729v2

Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data. However, there still exist two key challenges: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning. We first train category-specific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling: adjusting rewards by task difficulty, data-level hardness by text encoder cosine similarity, and model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

Updated: 2026-03-02 08:36:34

标题: 逆强化学习与LLM对齐的动态奖励缩放

摘要: 对于安全部署大型语言模型(LLMs),对齐是至关重要的。现有技术要么基于奖励(在偏好对中训练奖励模型,并使用强化学习进行优化),要么无奖励(直接在排名输出上进行微调)。最近的研究表明,良好调整的基于奖励的流程仍然是最稳健的,单一响应演示可以胜过成对偏好数据。然而,仍然存在两个关键挑战:(1)不平衡的安全数据集过多地代表常见危险,而忽视了长尾威胁;(2)静态奖励模型忽略任务难度,限制了优化效率和可达到的收益。为了解决这些限制,我们提出了DR-IRL,通过逆强化学习动态调整奖励。我们首先使用平衡的七个有害类别的安全数据集通过IRL作为演示来训练类别特定的奖励模型。然后,我们通过引入动态奖励缩放来增强Group Relative Policy Optimization(GRPO):根据任务难度、文本编码器余弦相似性的数据级难度以及奖励差距来调整奖励。在各种基准和LLMs上进行的大量实验表明,DR-IRL在安全对齐方面胜过所有基线方法,同时保持实用性。

更新时间: 2026-03-02 08:36:34

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.18991v6

What Helps -- and What Hurts: Bidirectional Explanations for Vision Transformers

Vision Transformers (ViTs) achieve strong performance in visual recognition, yet their decision-making remains difficult to interpret. We propose BiCAM, a bidirectional class activation mapping method that captures both supportive (positive) and suppressive (negative) contributions to model predictions. Unlike prior CAM-based approaches that discard negative signals, BiCAM preserves signed attributions to produce more complete and contrastive explanations. BiCAM further introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance and enables lightweight detection of adversarial examples without retraining. Across ImageNet, VOC, and COCO, BiCAM improves localization and faithfulness while remaining computationally efficient. It generalizes to multiple ViT variants, including DeiT and Swin. These results suggest the importance of modeling both supportive and suppressive evidence for interpreting transformer-based vision models.

Updated: 2026-03-02 08:36:16

标题: 帮助和阻碍:视觉Transformer的双向解释

摘要: Vision Transformers(ViTs)在视觉识别方面表现出色,但它们的决策过程仍然难以解释。我们提出了BiCAM,一种双向类激活映射方法,可以捕捉模型预测中的支持性(正面)和抑制性(负面)贡献。与之前基于CAM的方法不同,BiCAM保留了带符号的归因,以产生更完整和对比明显的解释。BiCAM还引入了一个正面到负面比率(PNR),总结了归因平衡,并且可以轻松地检测对抗性样本而无需重新训练。在ImageNet、VOC和COCO数据集上,BiCAM提高了定位精度和忠实度,同时保持计算效率。它还泛化到多种ViT变体,包括DeiT和Swin。这些结果表明,对于解释基于Transformer的视觉模型,建模支持性和抑制性证据的重要性。

更新时间: 2026-03-02 08:36:16

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01605v1

Learning from Complexity: Exploring Dynamic Sample Pruning of Spatio-Temporal Training

Spatio-temporal forecasting is fundamental to intelligent systems in transportation, climate science, and urban planning. However, training deep learning models on the massive, often redundant, datasets from these domains presents a significant computational bottleneck. Existing solutions typically focus on optimizing model architectures or optimizers, while overlooking the inherent inefficiency of the training data itself. This conventional approach of iterating over the entire static dataset each epoch wastes considerable resources on easy-to-learn or repetitive samples. In this paper, we explore a novel training-efficiency techniques, namely learning from complexity with dynamic sample pruning, ST-Prune, for spatio-temporal forecasting. Through dynamic sample pruning, we aim to intelligently identify the most informative samples based on the model's real-time learning state, thereby accelerating convergence and improving training efficiency. Extensive experiments conducted on real-world spatio-temporal datasets show that ST-Prune significantly accelerates the training speed while maintaining or even improving the model performance, and it also has scalability and universality.

Updated: 2026-03-02 08:35:46

标题: 从复杂性中学习:探索时空训练动态样本修剪

摘要: 时空预测对于交通、气候科学和城市规划中的智能系统至关重要。然而,在这些领域的庞大、常常冗余的数据集上训练深度学习模型存在显著的计算瓶颈。现有解决方案通常集中在优化模型架构或优化器上,而忽视了训练数据本身的固有低效性。传统方法是在每个纪元迭代整个静态数据集,这浪费了大量资源在易于学习或重复的样本上。在本文中,我们探讨一种新颖的训练效率技术,即学习复杂性与动态样本修剪,ST-Prune,用于时空预测。通过动态样本修剪,我们旨在根据模型的实时学习状态智能地识别最具信息量的样本,从而加速收敛并提高训练效率。对真实世界的时空数据集进行的大量实验表明,ST-Prune显著加速了训练速度,同时保持甚至提高了模型性能,并且具有可扩展性和普适性。

更新时间: 2026-03-02 08:35:46

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2602.19113v2

UrbanFM: Scaling Urban Spatio-Temporal Foundation Models

Urban systems, as dynamic complex systems, continuously generate spatio-temporal data streams that encode the fundamental laws of human mobility and city evolution. While AI for Science has witnessed the transformative power of foundation models in disciplines like genomics and meteorology, urban computing remains fragmented due to "scenario-specific" models, which are overfitted to specific regions or tasks, hindering their generalizability. To bridge this gap and advance spatio-temporal foundation models for urban systems, we adopt scaling as the central perspective and systematically investigate two key questions: what to scale and how to scale. Grounded in first-principles analysis, we identify three critical dimensions: heterogeneity, correlation, and dynamics, aligning these principles with the fundamental scientific properties of urban spatio-temporal data. Specifically, to address heterogeneity through data scaling, we construct WorldST. This billion-scale corpus standardizes diverse physical signals, such as traffic flow and speed, from over 100 global cities into a unified data format. To enable computation scaling for modeling correlations, we introduce the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations. Finally, addressing dynamics via architecture scaling, we propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to autonomously learn dynamic spatio-temporal dependencies from massive data. Furthermore, we establish EvalST, the largest-scale urban spatio-temporal benchmark to date. Extensive experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, marking a pivotal first step toward large-scale urban spatio-temporal foundation models.

Updated: 2026-03-02 08:34:52

标题: UrbanFM: 扩展城市时空基础模型

摘要: 城市系统作为动态复杂系统,不断产生编码人类移动和城市演化基本规律的时空数据流。虽然科学人工智能已经见证了在基因组学和气象学等学科中基础模型的变革性力量,但由于“特定场景”模型在城市计算中仍然是分散的,这些模型过度拟合于特定区域或任务,从而阻碍了它们的泛化能力。为了弥合这一差距并推进城市系统的时空基础模型,我们采用缩放作为中心视角,并系统地探讨两个关键问题:什么要缩放,如何缩放。基于基本原理分析,我们确定了三个关键维度:异质性、相关性和动态性,将这些原则与城市时空数据的基本科学属性相一致。具体来说,为了通过数据缩放解决异质性问题,我们构建了WorldST。这个十亿级语料库将来自100多个全球城市的各种物理信号(如交通流量和速度)标准化为统一的数据格式。为了实现模型相关性的计算缩放,我们引入了MiniST单元,这是一种将连续时空场离散化为可学习计算单元的新型分割机制,以统一网格和传感器观测的表示。最后,通过架构缩放解决动态性问题,我们提出了UrbanFM,这是一种极简的自注意力架构,设计有限的归纳偏差,能够从大量数据中自主学习动态时空依赖关系。此外,我们建立了EvalST,迄今为止规模最大的城市时空基准。大量实验证明UrbanFM在未见城市和任务上实现了显著的零样本泛化,标志着朝着大规模城市时空基础模型迈出了关键的第一步。

更新时间: 2026-03-02 08:34:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2602.20677v2

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods < static pairwise training < Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.

Updated: 2026-03-02 08:31:59

标题: Elo-Evolve:一种用于语言模型对齐的共同进化框架

摘要: 目前用于大型语言模型(LLMs)的对齐方法依赖于将大量人类偏好数据压缩成静态、绝对的奖励函数,从而导致数据稀缺、对噪声敏感和训练不稳定。我们引入了Elo-Evolve,这是一个重新定义对齐为自适应对手池内的动态多智能体竞争的协同演化框架。我们的方法有两个关键创新:(1)通过直接从配对比赛中的二元胜负结果中学习,消除了布拉德利-特里模型的依赖性;(2)实现了由Elo编排的对手选择,通过温度控制的采样提供了自动课程学习。我们将我们的方法基于PAC学习理论,展示了配对比较实现了更优越的样本复杂性,并在实证验证中与绝对评分方法相比实现了4.5倍的噪声减少。在实验中,我们使用我们的框架训练了一个Qwen2.5-7B模型,对手包括Qwen2.5-14B、Qwen2.5-32B和Qwen3-8B模型。结果显示了一个明确的性能层次结构:基于点的方法 < 静态配对训练 < Elo-Evolve在Alpaca Eval 2.0和MT-Bench上,验证了配对比较和动态对手选择对LLM对齐的渐进性好处。

更新时间: 2026-03-02 08:31:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2602.13575v2

YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection

Human vision exhibits remarkable adaptability in perceiving objects under camouflage. When color cues become unreliable, the visual system instinctively shifts its reliance from chrominance (color) to luminance (brightness and texture), enabling more robust perception in visually confusing environments. Drawing inspiration from this biological mechanism, we propose YCDa, an efficient early-stage feature processing strategy that embeds this "chrominance-luminance decoupling and dynamic attention" principle into modern real-time detectors. Specifically, YCDa separates color and luminance information in the input stage and dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. The strategy is plug-and-play and can be integrated into existing detectors by simply replacing the first downsampling layer. Extensive experiments on multiple baselines demonstrate that YCDa consistently improves performance with negligible overhead as shown in Fig. Notably, YCDa-YOLO12s achieves a 112% improvement in mAP over the baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.

Updated: 2026-03-02 08:31:20

标题: YCDa:用于实时逼真伪装目标检测的YCbCr分离注意力模型

摘要: 人类视觉在认知伪装对象时表现出了非凡的适应性。当颜色线索变得不可靠时,视觉系统本能地将依赖从色度(颜色)转移到亮度(亮度和纹理),从而在视觉混乱的环境中实现更健壮的感知。受到这一生物机制的启发,我们提出了YCDa,一种高效的早期特征处理策略,将“色度-亮度解耦和动态注意力”原则嵌入到现代实时检测器中。具体来说,YCDa在输入阶段分离颜色和亮度信息,并动态分配注意力跨通道,以放大区分性线索同时抑制误导性的颜色噪声。该策略是即插即用的,可以通过简单地替换第一个下采样层将其集成到现有的检测器中。对多个基线的广泛实验表明,如图所示,YCDa在性能上持续改善,同时开销可以忽略不计。值得注意的是,YCDa-YOLO12s在COD10K-D数据集上相对基线实现了112%的mAP改进,并为跨COD-D数据集的实时伪装对象检测设定了新的最先进结果。

更新时间: 2026-03-02 08:31:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01602v1

Boosting Entropy with Bell Box Quantization

Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient. We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models, and up to 18 points for 1-bit models. Code is available at https://github.com/1733116199/bbq.

Updated: 2026-03-02 08:27:39

标题: 使用贝尔盒量化增强熵

摘要: 量化感知预训练(QAPT)是一种有效的技术,可以降低深度神经网络的计算和内存开销,同时提高边缘设备的能效。现有的QAPT方法生成存储在计算高效数据类型(例如整数)中的模型,并不是信息论上最优(ITO)。另一方面,现有的ITO数据类型(例如分位数/正态浮点量化)并不是计算高效的。我们提出了BBQ,这是第一种既是ITO量化方法又是计算高效的方法。BBQ建立在我们的关键洞察力上,即由于学习是与领域无关的,量化器的输出不需要驻留在与其输入相同的领域中。BBQ在其输入领域中执行ITO量化,并将其输出返回到一个计算高效的领域,其中ITO数据类型被映射到计算高效的数据类型。在不牺牲计算效率的情况下,BBQ在4位模型中比先前的SOTA QAPT方法提高了最高2个点的困惑度,在3位模型中最高提高了4个点,在2位模型中最高提高了5个点,在1位模型中最高提高了18个点。代码可在https://github.com/1733116199/bbq 上找到。

更新时间: 2026-03-02 08:27:39

领域: cs.LG

下载: http://arxiv.org/abs/2603.01599v1

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-$n$ attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

Updated: 2026-03-02 08:27:36

标题: 受信任监控器的自适应攻击破坏AI控制协议

摘要: AI控制协议充当防御机制,阻止不受信任的LLM代理在自主设置中造成伤害。以往的研究将此视为安全问题,通过使用部署上下文 subtly 完成有害的副作用任务的利用进行压力测试,例如后门插入。实际上,大多数AI控制协议基本上是基于LLM监视器的,这可能成为一个中心故障点。我们研究了一个不受信任模型的自适应攻击,该模型了解协议和监视器模型,如果不受信任模型是在较晚的知识截止日期下进行训练或可以自主搜索这些信息,则这是可能的。我们通过攻击者在模型输出中嵌入公开已知或零射击提示注入来实例化一个简单的自适应攻击向量。使用这种策略,前沿模型始终规避各种监视器,并在两个主要的AI控制基准测试中完成恶意任务。这种攻击普遍适用于依赖监视器的当前协议。此外,最近的Defer-to-Resample协议甚至适得其反,因为其重新采样放大了提示注入,并有效地将其重新构建为一个best-of-$n$攻击。总的来说,对监视器模型的自适应攻击在当前控制协议中代表一个重大盲点,应成为未来AI控制机制评估的标准组成部分。

更新时间: 2026-03-02 08:27:36

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2510.09462v2

Characterizing Pattern Matching and Its Limits on Compositional Task Structures

Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.

Updated: 2026-03-02 08:27:12

标题: 表征模式匹配及其在组合任务结构上的限制

摘要: 尽管LLMs具有令人印象深刻的能力,但它们的成功往往依赖于模式匹配行为,然而这些行为也与合成任务中的OOD泛化失败相关联。然而,行为研究通常采用允许多个泛化源(例如,代数不变性、结构重复)的任务设置,使得对LLMs如何通过模式匹配进行泛化以及它们的局限性进行精确和可测试的解释变得模糊。为了解决这种模糊性,我们首先将模式匹配正式化为功能等价,即识别输入的子序列对,当保持输入的其余部分恒定时,这些子序列对始终导致相同的结果。然后,我们系统地研究了在具有隔离这种机制的合成结构的受控任务中,解码器-仅Transformer和Mamba的行为。我们的形式化产生了预测性和定量性的见解:(1)模式匹配的实例成功往往可以通过见证相关功能等价的上下文数量来很好地预测。(2)我们证明了通过确定完全在域泛化的数据扩展规律的指数,学习两跳结构的紧凑样本复杂性界限。我们的实证结果与理论预测一致,在20倍参数扩展和跨架构范围内。(3)路径模糊性是一种结构性障碍:当一个变量通过多条路径影响输出时,模型无法形成统一的中间状态表示,从而影响准确性和可解释性。(4)思维链减少了数据需求,但并不能解决路径模糊性。因此,我们为模式匹配提供了一个预测性、可验证的边界,以及对混合泛化机制进行解缠的基础性诊断。

更新时间: 2026-03-02 08:27:12

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.20278v3

Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction

Reconstructing MRI from highly undersampled measurements is crucial for accelerating medical imaging, but is challenging due to the ill-posedness of the inverse problem. While supervised deep learning (DL) approaches have shown remarkable success, they traditionally rely on fully-sampled ground truth (GT) images, which are expensive or impossible to obtain in real scenarios. This problem has created a recent surge in interest in self-supervised learning methods that do not require GT. Although recent methods are now fast approaching "oracle" supervised performance, the lack of systematic comparison and standard experimental setups are hindering targeted methodological research and precluding widespread trustworthy industry adoption. We present SSIBench, a modular and flexible comparison framework to unify and thoroughly benchmark Self-Supervised Imaging methods (SSI) without GT. We evaluate 18 recent methods across seven realistic MRI scenarios on real data, showing a wide performance landscape whose method ranking differs across scenarios and metrics, exposing the need for further SSI research. Our insights also show how complementary methods could be compounded for future improvements, exemplified by a novel loss we propose, Multi-Operator Equivariant Imaging. To accelerate reproducible research and lower the barrier to entry, we provide the extensible benchmark and open-source reimplementations of all methods at https://github.com/Andrewwango/ssibench, allowing researchers to rapidly and fairly contribute and evaluate new methods on the standardised setup for potential leaderboard ranking, or benchmark existing methods on custom datasets, forward operators, or models, unlocking the application of SSI to other valuable GT free domains such as 4D MRI and other nascent scientific imaging modalities.

Updated: 2026-03-02 08:26:45

标题: 基准自监督学习方法在加速MRI重建中的应用

摘要: 将高度欠采样的MRI重建成图像对于加速医学成像至关重要,但由于逆问题的不适定性,这是具有挑战性的。尽管监督深度学习(DL)方法取得了显著成功,但传统上依赖于昂贵或在实际场景中无法获取的完全采样的地面真实(GT)图像。这个问题引起了对不需要GT的自监督学习方法的最近兴趣激增。虽然最近的方法现在快速接近“oracle”监督性能,但缺乏系统比较和标准实验设置阻碍了有针对性的方法论研究,并阻止了广泛可信的行业采用。我们提出了SSIBench,一个模块化和灵活的比较框架,以统一和全面评估无GT的自监督成像方法(SSI)。我们在真实数据上评估了18种最近的方法,涵盖了七种真实的MRI场景,展示了一个广泛的性能景观,其中方法排名在不同场景和度量标准下有所不同,揭示了对进一步SSI研究的需求。我们的见解还表明,如何将互补方法结合起来以实现未来的改进,我们提出了一个新颖的损失,即多操作等变成像。为了加速可重复的研究并降低门槛,我们提供了可扩展的基准测试和所有方法的开源重新实现,允许研究人员快速公平地在标准设置上贡献和评估新方法,以便潜在地在排行榜上排名,或在自定义数据集、前向运算符或模型上对现有方法进行基准测试,释放SSI应用到其他宝贵的无GT领域,如4D MRI和其他新兴科学成像模态。

更新时间: 2026-03-02 08:26:45

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2502.14009v5

GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks

Structured deep model compression methods are hardware-friendly and substantially reduce memory and inference costs. However, under aggressive compression, the resulting accuracy degradation often necessitates post-compression finetuning, which can be impractical due to missing labeled data or high training cost. We propose post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block's input-output behavior using a small calibration set. The method summarizes hidden activations via a Gram matrix and applies ridge regression to linearly reconstruct the original hidden representation from the reduced one. The resulting reconstruction map is absorbed into the downstream projection weights, while the upstream layer is compressed. The approach is selector-agnostic (Magnitude, Wanda, Gram-based selection, or folding), data-aware (requiring only a few forward passes without gradients or labels), and recovers classic pruning or folding when the Gram matrix is near identity, indicating weak inter-channel correlations. Across ResNets, ViTs, and decoder-only LLMs, GRAIL consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation. The code is available at https://github.com/TWWinde/GRAIL_Compensation.

Updated: 2026-03-02 08:25:09

标题: GRAIL: 压缩网络的后补偿线性重建

摘要: 结构化深度模型压缩方法对硬件友好,并显著降低了内存和推断成本。然而,在激进的压缩下,由此导致的精度降低通常需要后续的微调,这可能由于缺少标记数据或高训练成本而变得不切实际。我们提出了后设块状补偿,称为GRAIL,这是一种简单的零微调步骤,应用于模型压缩后,使用一个小的校准集恢复每个块的输入输出行为。该方法通过Gram矩阵总结隐藏激活,并应用岭回归来线性重构原始的隐藏表示从减少的表示中。结果重建映射被吸收到下游的投影权重中,而上游层被压缩。这种方法是选择器不可知的(幅度,Wanda,基于Gram的选择或折叠),数据感知的(只需要几次前向传递而不需要梯度或标签),并在Gram矩阵接近单位时恢复经典修剪或折叠,表明通道间的相关性较弱。在ResNets、ViTs和仅解码器LLMs中,GRAIL在实际压缩范围内始终比数据无关和数据感知修剪或折叠基线提高精度或困惑度,且具有可管理的开销和无反向传播。代码可在https://github.com/TWWinde/GRAIL_Compensation找到。

更新时间: 2026-03-02 08:25:09

领域: cs.LG

下载: http://arxiv.org/abs/2602.23795v2

UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.

Updated: 2026-03-02 08:22:03

标题: 城市世界:通过观看城市旅游视频扩展城市模拟

摘要: 城市中的智能体,从送货机器人到四足动物,越来越多地出现在我们的城市中,穿越混乱的街道提供最后一英里的连接。训练这些智能体需要多样化、高保真度的城市环境来扩展,然而现有的人工制作或程序生成的模拟场景要么缺乏可扩展性,要么无法捕捉真实世界的复杂性。我们引入了UrbanVerse,这是一个基于数据驱动的真实到虚拟系统,将众包城市游览视频转换为具有物理感知能力的交互式模拟场景。UrbanVerse包括:(i)UrbanVerse-100K,一个拥有100k+注释的城市3D资产的仓库,具有语义和物理属性,以及(ii)UrbanVerse-Gen,一个自动化流水线,从视频中提取场景布局,并使用检索到的资产实例化度量尺度的3D模拟。在IsaacSim中运行,UrbanVerse提供了来自24个国家的160个高质量构建场景,以及一个由10个艺术家设计的测试场景的策划基准。实验表明,UrbanVerse场景保留了真实世界的语义和布局,实现了与手工制作场景相媲美的人类评估真实感。在城市导航方面,在UrbanVerse中训练的策略表现出规模功率律和强泛化能力,相比先前的方法,在模拟中成功率提高了+6.3%,在零样本模拟到真实转移中提高了+30.1%,仅通过两次干预就完成了300米的真实世界任务。

更新时间: 2026-03-02 08:22:03

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.15018v2

Adaptive Location Hierarchy Learning for Long-Tailed Mobility Prediction

Human mobility prediction is crucial for applications ranging from location-based recommendations to urban planning, which aims to forecast users' next location visits based on historical trajectories. While existing mobility prediction models excel at capturing sequential patterns through diverse architectures for different scenarios, they are hindered by the long-tailed distribution of location visits, leading to biased predictions and limited applicability. This highlights the need for a solution that enhances the long-tailed prediction capabilities of these models with broad compatibility and efficiency across diverse architectures. To address this need, we propose the first architecture-agnostic plugin for long-tailed human mobility prediction, named \textbf{A}daptive \textbf{LO}cation \textbf{H}ier\textbf{A}rchy learning (ALOHA). Inspired by Maslow's theory of human motivation, we exploit and explore common mobility knowledge of head and tail locations derived from human mobility trajectories to effectively mitigate long-tailed bias. Specifically, we introduce an automatic pipeline to construct city-tailored location hierarchies based on Large Language Models (LLMs) and Chain-of-Thought (CoT) prompts, capturing high-level mobility semantics with minimal human verification. We further design an Adaptive Hierarchical Loss (AHL) that rebalances learning through Gumbel disturbance and node-wise adaptive weighting, enabling both exploitation of multi-level signals and exploration within semantically related groups. Extensive experiments across multiple state-of-the-art models demonstrate that ALOHA consistently improves long-tailed mobility prediction performance by up to 16.59\% while maintaining efficiency and robustness. Our code is at https://github.com/Star607/ALOHA.

Updated: 2026-03-02 08:19:37

标题: 长尾移动预测的自适应位置层次学习

摘要: 人类移动性预测对于从基于位置的推荐到城市规划等应用至关重要,旨在根据历史轨迹预测用户的下一个位置访问。虽然现有的移动性预测模型在不同场景下通过多样化的架构捕捉序列模式方面表现优异,但它们受到位置访问的长尾分布的阻碍,导致预测偏见和适用性有限。这突显了需要一种解决方案,通过广泛的兼容性和效率增强这些模型的长尾预测能力。为了解决这一需求,我们提出了第一个面向架构的长尾人类移动性预测插件,名为自适应位置层次学习(ALOHA)。受到马斯洛人类动机理论的启发,我们利用和探索从人类移动性轨迹中衍生的头部和尾部位置的共同移动性知识,有效缓解长尾偏见。具体而言,我们引入了一个自动管道,基于大型语言模型(LLMs)和思维链(CoT)提示构建城市定制的位置层次结构,通过最少的人工验证捕捉高级移动性语义。我们进一步设计了一个自适应层次损失(AHL),通过贡贝尔扰动和节点智能自适应加权来重新平衡学习,实现多级信号的利用和在语义相关组内的探索。跨多个最先进模型的大量实验表明,ALOHA能够在保持效率和稳健性的同时,将长尾移动性预测性能提高高达16.59%。我们的代码位于 https://github.com/Star607/ALOHA。

更新时间: 2026-03-02 08:19:37

领域: cs.AI

下载: http://arxiv.org/abs/2505.19965v2

Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an interpretable model that leverages the fine-grained human-likeness ratings and delivers accurate and transparent human-vs-machine discrimination, offering a powerful tool for automatic human-likeness evaluation. Our work establishes the first human-likeness evaluation for S2S systems and moves beyond binary outcomes to enable detailed diagnostic insights, paving the way for human-like improvements in conversational AI systems.

Updated: 2026-03-02 08:18:24

标题: 人还是机器?语音交互的初步图灵测试

摘要: 追求类似于人类对话代理人的目标长期以来一直受到图灵测试的指导。对于现代语音对话系统,一个关键但尚未解答的问题是它们是否能像人类一样进行对话。为了解决这个问题,我们进行了第一次针对语音对话系统的图灵测试,收集了2968个有关9个最先进语音对话系统和28名人类参与者对话的人类评价。我们的结果得出一个清晰的结论:目前没有经过评估的语音对话系统通过了测试,揭示了与人类相似性存在显著差距。为了诊断这一失败,我们制定了一个包含18个人类相似性维度的细粒度分类法,并根据此分类法对我们收集的对话进行了众包标注。我们的分析表明,瓶颈不在于语义理解,而是来源于语用特征、情感表达能力和对话人设。此外,我们发现现成的人工智能模型在图灵测试评判方面表现不可靠。作为回应,我们提出了一个可解释的模型,利用细致的人类相似性评分,提供准确透明的人机辨别,为自动人类相似性评估提供了一个强大工具。我们的工作为S2S系统建立了第一个人类相似性评估,并超越了二元结果,实现了详细的诊断洞察,为对话人工智能系统的人类化改进铺平了道路。

更新时间: 2026-03-02 08:18:24

领域: cs.AI,cs.SD

下载: http://arxiv.org/abs/2602.24080v2

On the Reasoning Abilities of Masked Diffusion Language Models

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

Updated: 2026-03-02 08:17:44

标题: 关于掩码扩散语言模型的推理能力

摘要: 掩盖扩散模型(MDMs)为文本提供了一种引人注目的替代传统自回归语言模型的选择。并行生成使它们高效,但它们的计算能力及其并行性固有的限制仍然大部分未被探索。为此,我们表征了MDMs能够证明解决的推理问题类型及其效率。我们通过将MDMs与有限精度对数宽度设置中的思维链(CoT)和填充循环变压器(PLTs)等众所周知的推理框架相连接来实现这一点:我们展示了在这种设置下,MDMs和多项式填充的PLTs实际上是等效的,并且MDMs可以解决所有CoT增强变压器可以解决的问题。此外,我们展示了一些问题类别(包括常规语言),对于这些问题,MDMs比CoT变压器固有更高效,其中并行生成允许更快速地推理。

更新时间: 2026-03-02 08:17:44

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.13117v2

FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems

Training-free diffusion priors enable inverse-problem solvers without retraining, but for nonlinear forward operators data consistency often relies on repeated derivatives or inner optimization/MCMC loops with conservative step sizes, incurring many iterations and denoiser/score evaluations. We propose a training-free solver that replaces these inner loops with a hard measurement-space feasibility constraint (closed-form projection) and an analytic, model-optimal step size, enabling a small, fixed compute budget per noise level. Anchored at the denoiser prediction, the correction is approximated via an adjoint-free, ADMM-style splitting with projection and a few steepest-descent updates, using one VJP and either one JVP or a forward-difference probe, followed by backtracking and decoupled re-annealing. We prove local model optimality and descent under backtracking for the step-size rule, and derive an explicit KL bound for mode-substitution re-annealing under a local Gaussian conditional surrogate. We also develop a latent variant and a one-parameter pixel$\rightarrow$latent hybrid schedule. Experiments achieve competitive PSNR/SSIM/LPIPS with up to 19.5$\times$ speedup, without hand-coded adjoints or inner MCMC.

Updated: 2026-03-02 08:17:26

标题: FAST-DIPS:扩散先验逆问题的无伴随解析步骤和硬约束似然校正

摘要: 无需训练的扩散先验使得反问题求解器可以在无需重新训练的情况下使用,但对于非线性正演算子,数据一致性通常依赖于重复导数或内部优化/MCMC循环,其步长保守,需要许多迭代和去噪器/评分评估。我们提出了一种无需训练的求解器,它用一个硬测量空间可行性约束(封闭形式投影)和一个解析的、模型最优的步长替代了这些内部循环,从而在每个噪声水平上实现了一个小的、固定的计算预算。在去噪器预测的基础上,通过一个无伴随的、ADMM风格的拆分来近似校正,其中包括投影和一些最陡下降的更新,使用一个VJP和一个JVP或一个前向差分探测,随后进行回溯和解耦重退火。我们证明了在回溯下的步长规则的局部模型最优性和下降,并在局部高斯条件代理下推导了一种明确的KL界限,用于模式替代重退火。我们还开发了一个潜在的变体和一个一参数像素$\rightarrow$潜在的混合时间表。实验结果实现了竞争性的PSNR/SSIM/LPIPS,速度提高了多达19.5倍,而无需手工编码的伴随或内部MCMC。

更新时间: 2026-03-02 08:17:26

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.01591v1

IDProxy: Cold-Start CTR Prediction for Ads and Recommendation at Xiaohongshu with Multimodal LLMs

Click-through rate (CTR) models in advertising and recommendation systems rely heavily on item ID embeddings, which struggle in item cold-start settings. We present IDProxy, a solution that leverages multimodal large language models (MLLMs) to generate proxy embeddings from rich content signals, enabling effective CTR prediction for new items without usage data. These proxies are explicitly aligned with the existing ID embedding space and are optimized end-to-end under CTR objectives together with the ranking model, allowing seamless integration into existing large-scale ranking pipelines. Offline experiments and online A/B tests demonstrate the effectiveness of IDProxy, which has been successfully deployed in both Content Feed and Display Ads features of Xiaohongshu's Explore Feed, serving hundreds of millions of users daily.

Updated: 2026-03-02 08:16:49

标题: IDProxy:多模态LLMs在小红书上广告和推荐的冷启动CTR预测

摘要: 点击率(CTR)模型在广告和推荐系统中广泛使用项目ID嵌入,但在项目冷启动情况下效果有限。我们提出了IDProxy,这是一种利用多模态大型语言模型(MLLMs)从丰富的内容信号生成代理嵌入的解决方案,使得在没有使用数据的情况下可以有效预测新项目的CTR。这些代理嵌入明确与现有的ID嵌入空间对齐,并在CTR目标下与排名模型一起进行端到端优化,从而可以无缝集成到现有的大规模排名管道中。离线实验和在线A/B测试表明了IDProxy的有效性,该方法已成功部署在小红书探索动态和展示广告功能中,每天为数亿用户提供服务。

更新时间: 2026-03-02 08:16:49

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2603.01590v1

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond

The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address these problems, we introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts. SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement. SafeSciBench distinguishes between safety knowledge and risk to cover extensive scopes and employs objective metrics such as deterministically answerable questions to mitigate evaluation bias. We evaluate 24 advanced LLMs, revealing critical vulnerabilities in current models. We also observe that LLMs exhibit varying degrees of excessive refusal behaviors on safety-related issues. For safety enhancement, we demonstrate that fine-tuning on SafeSciTrain significantly enhances the safety alignment of models. Finally, we argue that knowledge is a double-edged sword, and determining the safety of a scientific question should depend on specific context, rather than universally categorizing it as safe or unsafe. Our work provides both a diagnostic tool and a practical resource for building safer scientific AI systems.

Updated: 2026-03-02 08:16:04

标题: SafeSci:科学领域及其他领域大型语言模型的安全评估

摘要: 大型语言模型(LLMs)在科学领域的成功加剧了安全担忧,促使许多基准测试来评估它们的科学安全性。现有的基准测试往往存在风险覆盖有限和依赖主观评估的问题。为了解决这些问题,我们引入了SafeSci,一个用于科学环境中安全评估和增强的综合框架。SafeSci包括SafeSciBench,一个跨学科基准测试,含有25万个样本,以及SafeSciTrain,一个包含150万个样本用于安全增强的大规模数据集。SafeSciBench区分安全知识和风险以覆盖广泛范围,并采用客观度量标准,如确定性可回答的问题来减轻评估偏见。我们评估了24个先进的LLMs,揭示了当前模型中的关键漏洞。我们还观察到LLMs在安全相关问题上表现出不同程度的过度拒绝行为。对于安全增强,我们展示了在SafeSciTrain上微调显著增强了模型的安全对齐性。最后,我们认为知识是一把双刃剑,确定科学问题的安全性应该取决于具体的背景,而不是普遍将其分类为安全或不安全。我们的工作为构建更安全的科学人工智能系统提供了诊断工具和实用资源。

更新时间: 2026-03-02 08:16:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01589v1

Jump Like A Squirrel: Optimized Execution Step Order for Anytime Random Forest Inference

Due to their efficiency and small size, decision trees and random forests are popular machine learning models used for classification on resource-constrained systems. In such systems, the available execution time for inference in a random forest might not be sufficient for a complete model execution. Ideally, the already gained prediction confidence should be retained. An anytime algorithm is designed to be able to be aborted anytime, while giving a result with an increasing quality over time. Previous approaches have realized random forests as anytime algorithms on the granularity of trees, stopping after some but not all trees of a forest have been executed. However, due to the way decision trees subdivide the sample space in every step, an increase in prediction quality is achieved with every additional step in one tree. In this paper, we realize decision trees and random forest as anytime algorithms on the granularity of single steps in trees. This approach opens a design space to define the step order in a forest, which has the potential to optimize the mean accuracy. We propose the Optimal Order, which finds a step order with a maximal mean accuracy in exponential runtime and the polynomial runtime heuristics Forward Squirrel Order and Backward Squirrel Order, which greedily maximize the accuracy for each additional step taken down and up the trees, respectively. Our evaluation shows, that the Backward Squirrel Order performs $\sim94\%$ as well as the Optimal Order and $\sim99\%$ as well as all other step orders.

Updated: 2026-03-02 08:15:37

标题: 像松鼠一样跳跃:任意时间随机森林推理的优化执行步骤顺序

摘要: 由于它们的效率和小尺寸,决策树和随机森林是用于资源受限系统上的分类的流行机器学习模型。在这种系统中,随机森林推断的可用执行时间可能不足以完全执行模型。理想情况下,已获得的预测置信度应该保留。设计了一种任意时间算法,可以随时中止,同时随着时间的推移提供质量不断提高的结果。先前的方法已经将随机森林实现为在树的粒度上的任意时间算法,在执行了一些但不是所有树之后停止。然而,由于决策树在每一步中如何划分样本空间,因此在一个树中的每一步增加预测质量。在本文中,我们将决策树和随机森林实现为在树的单个步骤的粒度上的任意时间算法。这种方法打开了一个设计空间,可以定义森林中的步骤顺序,从而有潜力优化平均准确率。我们提出了最佳顺序,它在指数运行时间内找到具有最大平均准确率的步骤顺序,以及在多项式运行时间内贪婪地分别最大化树向下和向上进行每个额外步骤的准确率的前向松鼠顺序和后向松鼠顺序。我们的评估显示,后向松鼠顺序的性能大约为最佳顺序的94%,大约为所有其他步骤顺序的99%。

更新时间: 2026-03-02 08:15:37

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.01588v1

KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.

Updated: 2026-03-02 08:12:03

标题: KERV:动力学校正的具身VLA模型的推测解码

摘要: 视觉-语言-动作(VLA)模型构建了一个基于标记域的机器人控制范式,但速度较慢。推测解码(SD)是一种可以提高推理速度的优化策略。将VLA和SD整合时出现了两个关键问题:首先,SD依赖重新推理来解决标记错误,这在计算上是昂贵的;其次,为了减少标记错误,SD中的接受阈值需要仔细调整。现有工作未能有效解决上述两个问题。同时,作为人工智能和物理世界之间的桥梁,现有的具体智能忽视了机器人运动学的应用。为了解决这些问题,我们将标记域VLA模型与运动学域预测相结合,提出了一个名为KERV的运动学校正SD框架。我们采用基于运动学的卡尔曼滤波器来预测动作并补偿SD错误,避免了昂贵的重新推理。此外,我们设计了一个基于运动学的调整策略来动态校正接受阈值,解决了阈值确定的困难。跨越不同任务和环境的实验结果表明,KERV实现了27%~37%的加速,几乎没有成功率损失。

更新时间: 2026-03-02 08:12:03

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2603.01581v1

SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.

Updated: 2026-03-02 08:07:08

标题: SkeleGuide:面向上下文感知的人体在场景图像合成的明确骨架推理

摘要: 将逼真且结构合理的人类图像生成到现有场景中仍然是当前生成模型面临的重要挑战,这些模型往往会产生扭曲的肢体和不自然的姿势等缺陷。我们将这种系统性失败归因于无法对人类骨骼结构进行显式推理。为了解决这个问题,我们引入了SkeleGuide,这是一个建立在显式骨骼推理基础上的新型框架。通过对其推理和渲染阶段进行联合训练,SkeleGuide学会了生成一个内部姿势,作为强大的结构先验,指导合成图像保持高度的结构完整性。为了实现细粒度用户控制,我们引入了PoseInverter,这是一个将内部潜在姿势解码为显式可编辑格式的模块。大量实验证明,SkeleGuide在生成高保真、具有上下文感知的人类图像方面明显优于专门和通用模型。我们的工作提供了令人信服的证据,即显式建模骨骼结构是实现稳健和合理的人类图像合成的基本步骤。

更新时间: 2026-03-02 08:07:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01579v1

Reasoning on Time-Series for Financial Technical Analysis

While Large Language Models have been used to produce interpretable stock forecasts, they mainly focus on analyzing textual reports but not historical price data, also known as Technical Analysis. This task is challenging as it switches between domains: the stock price inputs and outputs lie in the time-series domain, while the reasoning step should be in natural language. In this work, we introduce Verbal Technical Analysis (VTA), a novel framework that combine verbal and latent reasoning to produce stock time-series forecasts that are both accurate and interpretable. To reason over time-series, we convert stock price data into textual annotations and optimize the reasoning trace using an inverse Mean Squared Error (MSE) reward objective. To produce time-series outputs from textual reasoning, we condition the outputs of a time-series backbone model on the reasoning-based attributes. Experiments on stock datasets across U.S., Chinese, and European markets show that VTA achieves state-of-the-art forecasting accuracy, while the reasoning traces also perform well on evaluation by industry experts.

Updated: 2026-03-02 08:07:02

标题: 对金融技术分析中的时间序列数据进行推理

摘要: 虽然大型语言模型已被用于生成可解释的股票预测,但它们主要集中在分析文本报告,而不是历史价格数据,也就是所谓的技术分析。这项任务具有挑战性,因为它在不同领域之间切换:股票价格的输入和输出位于时间序列领域,而推理步骤应该是自然语言。在这项工作中,我们引入了口头技术分析(VTA),这是一个结合口头和潜在推理的新框架,用于产生既准确又可解释的股票时间序列预测。为了对时间序列进行推理,我们将股票价格数据转换为文本注释,并利用逆均方误差(MSE)奖励目标来优化推理追踪。为了从文本推理中产生时间序列输出,我们将时间序列骨干模型的输出条件设置为基于推理的属性。对美国、中国和欧洲市场的股票数据集进行的实验表明,VTA实现了最先进的预测准确性,而推理追踪在行业专家评估中表现良好。

更新时间: 2026-03-02 08:07:02

领域: q-fin.ST,cs.AI,cs.LG,q-fin.CP

下载: http://arxiv.org/abs/2511.08616v2

From Passive to Proactive: A Hierarchical Multi-Agent Framework for Automated Medical Pre-Consultation

The post-pandemic surge in healthcare demand, coupled with critical nursing shortages, has placed unprecedented pressure on medical triage systems, necessitating innovative AI-driven solutions. We present a multi-agent interactive intelligent system for medical triage that addresses three fundamental challenges in current AI-based triage systems: inadequate medical specialization leading to misclassification, heterogeneous department structures across healthcare institutions, and inefficient detail-oriented questioning that impedes rapid triage decisions. Our system employs three specialized agents--RecipientAgent, InquirerAgent, and DepartmentAgent--that collaborate through Inquiry Guidance mechanism and Classification Guidance Mechanism to transform unstructured patient symptoms into accurate department recommendations. To ensure robust evaluation, we constructed a comprehensive Chinese medical triage dataset from "Ai Ai Yi Medical Network", comprising 3,360 real-world cases spanning 9 primary departments and 62 secondary departments. Experimental results demonstrate that our multi-agent system achieves 89.6% accuracy in primary department classification and 74.3% accuracy in secondary department classification after four rounds of patient interaction. The system's dynamic matching based guidance mechanisms enable efficient adaptation to diverse hospital configurations while maintaining high triage accuracy. We successfully developed this multi-agent triage system that not only adapts to organizational heterogeneity across healthcare institutions but also ensures clinically sound decision-making.

Updated: 2026-03-02 08:03:36

标题: 从被动到主动:用于自动医学预会诊的分层多智能体框架

摘要: 随着后疫情时期医疗需求激增,加上关键护理人员短缺,使医疗分诊系统面临前所未有的压力,需要创新的人工智能驱动解决方案。我们提出了一个多智能体交互式智能系统,用于医疗分诊,解决了目前基于人工智能的分诊系统中的三个基本挑战:医疗专业不足导致误分类,医疗机构之间异构的部门结构,以及细节导向的询问导致快速分诊决策受阻。我们的系统利用三个专门的智能体--RecipientAgent、InquirerAgent和DepartmentAgent--通过询问引导机制和分类引导机制相互协作,将患者的症状转化为准确的部门推荐。为了确保稳健的评估,我们从“艾爱医疗网络”构建了一个综合的中文医疗分诊数据集,包括3,360个跨越9个主要部门和62个次要部门的真实案例。实验结果表明,我们的多智能体系统在四轮患者交互后,在主要部门分类方面达到了89.6%的准确率,在次要部门分类方面达到了74.3%的准确率。系统的动态匹配引导机制使其能够有效适应各种医院配置,同时保持高分诊准确性。我们成功开发了这个多智能体分诊系统,不仅能够适应医疗机构之间的组织异质性,还能确保临床上的正确决策。

更新时间: 2026-03-02 08:03:36

领域: cs.AI

下载: http://arxiv.org/abs/2511.01445v2

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Recent intelligent systems integrate powerful Large Language Models (LLMs) through APIs, but their trustworthiness may be critically undermined by targeted attacks like backdoor and prompt injection attacks, which secretly force LLMs to generate specific malicious sequences. Existing defensive approaches for such threats typically rely on high access rights, impose prohibitive costs, and hinder normal inference, rendering them impractical for real-world scenarios. To solve these limitations, we introduce DualSentinel, a lightweight and unified defense framework that can accurately and promptly detect the activation of targeted attacks alongside the LLM generation process. We first identify a characteristic of compromised LLMs, termed Entropy Lull: when a targeted attack successfully hijacks the generation process, the LLM exhibits a distinct period of abnormally low and stable token probability entropy, indicating it is following a fixed path rather than making creative choices. DualSentinel leverages this pattern by developing an innovative dual-check approach. It first employs a magnitude and trend-aware monitoring method to proactively and sensitively flag an entropy lull pattern at runtime. Upon such flagging, it triggers a lightweight yet powerful secondary verification based on task-flipping. An attack is confirmed only if the entropy lull pattern persists across both the original and the flipped task, proving that the LLM's output is coercively controlled. Extensive evaluations show that DualSentinel is both highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost), offering a truly practical path toward securing deployed LLMs. The source code can be accessed at https://doi.org/10.5281/zenodo.18479273.

Updated: 2026-03-02 08:02:47

标题: DualSentinel:一种用于通过双熵催眠模式检测黑盒LLM中针对性攻击的轻量级框架

摘要: 最近的智能系统通过API集成了强大的大型语言模型(LLMs),但它们的可信度可能会受到针对性攻击的严重破坏,如后门和提示注入攻击,这些攻击秘密地强迫LLMs生成特定的恶意序列。针对这些威胁的现有防御方法通常依赖于高访问权限,施加了不可承受的成本,并阻碍了正常的推理,使它们在现实场景中变得不切实际。为了解决这些限制,我们介绍了DualSentinel,一个轻量级且统一的防御框架,可以准确和及时地检测目标攻击在LLM生成过程中的激活。我们首先确定了被攻击的LLMs的一个特征,称为熵沉寂:当目标攻击成功劫持生成过程时,LLM会展示出一个明显的、异常低且稳定的令牌概率熵期,表明它正在遵循一个固定的路径,而不是做出创造性的选择。DualSentinel利用这一模式开发了一种创新的双重检查方法。它首先采用一种幅度和趋势感知的监控方法,在运行时主动和敏感地标记熵沉寂模式。在发现这种标记时,它触发一个基于任务翻转的轻量级而强大的次级验证。只有在熵沉寂模式在原始任务和翻转任务中都持续存在时,才确认攻击,证明LLM的输出是被强制控制的。广泛的评估显示,DualSentinel既高效(具有接近零误报的优越检测准确率),又高效(额外成本可忽略不计),为保护部署的LLMs提供了一条真正实用的路径。源代码可以在https://doi.org/10.5281/zenodo.18479273上访问。

更新时间: 2026-03-02 08:02:47

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.01574v1

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we propose SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image-Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K-50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn's: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.

Updated: 2026-03-02 08:02:37

标题: 一个高质量的数据集和可靠的评估方法用于交错生成图像文本

摘要: 最近大型多模型(LMMs)的进展显著提高了多模态理解和生成能力。然而,这些模型仍然难以生成紧密交织的图像-文本输出,主要是由于当前训练数据集的规模、质量和指导丰富度有限。为了解决这个问题,我们引入了InterSyn数据集,具有以下特点:(1)大规模,包括180万个多模态样本;(2)高质量,通过我们提出的自我评估与迭代细化(SEIR)方法来支持严格的自动质量细化;(3)丰富的指导多样性,通过基于人类偏好的多样化设计的问题模板,涵盖了3500个主题层次。这些特点使InterSyn特别适合用于训练具有交互式图像-文本生成能力的LMMs。为了评估这些能力,我们提出了SynJudge,一个可靠的自动评估器,与人类评审紧密对齐,并输出四个可解释的分数:文本内容完整性(TCC)、图像内容完整性(ICC)、图像质量(IQ)和图像-文本协同(ITS)。这些分数是互补的,涵盖了内容和质量以及跨模态交互,从而形成了一个全面的评估框架。对InterSyn子集的实验结果显示,25K-50K样本已经取得了显著改进,而扩展到100K/200K会进一步提高TCC、ICC和特别是ITS,突显了InterSyn的:(1)可扩展性,性能随着更多数据的增加而持续改善;(2)高效性,即使是较小的子集也能获得显著的增益,使其对具有不同计算资源的研究人员可访问。

更新时间: 2026-03-02 08:02:37

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.09427v2

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.

Updated: 2026-03-02 07:55:58

标题: QEDBENCH:量化大学级数学证明自动评估中的对齐差距

摘要: 随着大型语言模型(LLMs)在基础基准测试中的饱和,研究前沿已经从生成转向了自动评估的可靠性。我们证明,当应用于大学本科到早期研究生水平的数学时,标准的“LLM作为评判者”协议存在系统性的对齐差距。为了量化这一现象,我们引入了QEDBench,这是第一个大规模的双重评分对齐基准,用于系统地衡量与人类专家对大学水平数学证明的对齐性,通过将专业课程的评分标准与专家共同知识标准进行对比。通过部署一个双重评估矩阵(7名评委 x 5名解决者)对1,000多小时的人类评估进行评估,我们发现某些前沿评估者,如Claude Opus 4.5、DeepSeek-V3、Qwen 2.5 Max和Llama 4 Maverick,展现出显著的正偏差(平均评分膨胀高达+0.18、+0.20、+0.30、+0.36)。此外,我们揭示了在离散领域存在关键推理差距:虽然Gemini 3.0 Pro实现了最先进的性能(0.91的平均人类评估分数),但其他推理模型如GPT-5 Pro和Claude Sonnet 4.5在离散领域的表现明显下降。具体而言,它们在离散数学中的平均人类评估分数降至0.72和0.63,在图论中降至0.74和0.50。除了这些研究结果,我们还将QEDBench作为一个公共基准发布,用于评估和改进AI评判者。我们的基准已经公开发布在https://github.com/qqliu/Yale-QEDBench。

更新时间: 2026-03-02 07:55:58

领域: cs.LG

下载: http://arxiv.org/abs/2602.20629v2

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.

Updated: 2026-03-02 07:54:29

标题: 超越长度缩放:将广度和深度结合起来为生成式奖励模型提供协同效应

摘要: 最近对生成性奖励模型(GRMs)的进展表明,通过扩展思维链(CoT)的长度可以显著提高评估的可靠性。然而,目前的研究主要依赖于无结构长度缩放,忽略了不同推理机制的差异效力:广度-CoT(B-CoT,即多维原则覆盖)和深度-CoT(D-CoT,即实质性判断的准确性)。为了解决这一问题,我们引入了Mix-GRM,这是一个框架,通过模块化综合管道将原始理由重新配置为结构化的B-CoT和D-CoT,随后采用监督微调(SFT)和具有可验证奖励的强化学习(RLVR)来内部化和优化这些机制。全面的实验表明,Mix-GRM在五个基准测试中建立了一个新的最先进水平,平均超过领先的开源RM 8.2\%。我们的结果显示了推理中的明显分歧:B-CoT有利于主观偏好任务,而D-CoT在客观正确性任务中表现突出。因此,将推理机制与任务不匹配会直接降低性能。此外,我们展示了RLVR作为一个开关放大器,诱导出一个新兴的极化,模型自发地分配其推理风格以匹配任务需求。合成数据和模型发布在\href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face},代码发布在\href{https://github.com/Don-Joey/Mix-GRM}{Github}。

更新时间: 2026-03-02 07:54:29

领域: cs.AI

下载: http://arxiv.org/abs/2603.01571v1

Adversarial Query Synthesis via Bayesian Optimization

Benchmark workloads are extremely important to the database management research community, especially as more machine learning components are integrated into database systems. Here, we propose a Bayesian optimization technique to automatically search for difficult benchmark queries, significantly reducing the amount of manual effort usually required. In preliminary experiments, we show that our approach can generate queries with more than double the optimization headroom compared to existing benchmarks.

Updated: 2026-03-02 07:50:46

标题: 对抗性查询合成通过贝叶斯优化

摘要: 基准工作负载对数据库管理研究社区非常重要,特别是随着更多的机器学习组件集成到数据库系统中。在这里,我们提出了一种贝叶斯优化技术,用于自动搜索困难的基准查询,大大减少了通常需要的手动工作量。在初步实验中,我们展示了我们的方法可以生成比现有基准测试更多两倍的优化余地的查询。

更新时间: 2026-03-02 07:50:46

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2603.01570v1

Rate-Distortion Signatures of Generalization and Information Trade-offs

Generalization to novel visual conditions remains a central challenge for both human and machine vision, yet standard robustness metrics offer limited insight into how systems trade accuracy for robustness. We introduce a rate-distortion-theoretic framework that treats stimulus-response behavior as an effective communication channel, derives rate-distortion (RD) frontiers from confusion matrices, and summarizes each system with two interpretable geometric signatures - slope ($β$) and curvature ($κ$) - which capture the marginal cost and abruptness of accuracy-robustness trade-offs. Applying this framework to human psychophysics and 18 deep vision models under controlled image perturbations, we compare generalization geometry across model architectures and training regimes. We find that both biological and artificial systems follow a common lossy-compression principle but occupy systematically different regions of RD space. In particular, humans exhibit smoother, more flexible trade-offs, whereas modern deep networks operate in steeper and more brittle regimes even at matched accuracy. Across training regimes, robustness training induces systematic but dissociable shifts in beta/kappa, revealing cases where improved robustness or accuracy does not translate into more human-like generalization geometry. These results demonstrate that RD geometry provides a compact, model-agnostic lens for comparing generalization behavior across systems beyond standard accuracy-based metrics.

Updated: 2026-03-02 07:48:39

标题: 广义化和信息权衡的速率失真签名

摘要: 将新的视觉条件泛化到人类和机器视觉仍然是一个核心挑战,然而标准的稳健性度量仅提供了有限的洞见,无法揭示系统如何在准确性和稳健性之间进行权衡。我们引入了一个速率失真理论框架,将刺激-响应行为视为一种有效的通信通道,从混淆矩阵中推导出速率失真(RD)边界,并用两个可解释的几何特征 - 斜率($β$)和曲率($κ$) - 来总结每个系统,这些特征捕捉了准确性和稳健性权衡的边际成本和突变程度。将这一框架应用于人类心理物理学和18个受控图像扰动下的深度视觉模型,我们比较了模型架构和训练方案之间的泛化几何形态。我们发现,无论是生物系统还是人工系统都遵循一种共同的有损压缩原则,但占据RD空间的区域却有系统性的不同。特别是,人类表现出更加平滑、更加灵活的权衡,而现代深度网络即使在准确性匹配的情况下也在更加陡峭、更加脆弱的区域运作。在不同的训练方案中,稳健性训练导致beta/kappa的系统性但可分离的变化,揭示了提高稳健性或准确性并不一定导致更具人类特征的泛化几何形态的情况。这些结果表明,RD几何形态提供了一个紧凑、模型无关的视角,用于比较系统之间的泛化行为,超越了标准的基于准确性的度量。

更新时间: 2026-03-02 07:48:39

领域: cs.LG,cs.CV,cs.IT,q-bio.NC

下载: http://arxiv.org/abs/2603.01568v1

From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions

Large Language Models (LLMs) are increasingly deployed as agentic systems that plan, memorize, and act in open-world environments. This shift brings new security problems: failures are no longer only unsafe text generation, but can become real harm through tool use, persistent memory, and interaction with untrusted web content. In this survey, we provide a transition-oriented view from Secure Agentic AI to a Secure Agentic Web. We first summarize a component-aligned threat taxonomy covering prompt abuse, environment injection, memory attacks, toolchain abuse, model tampering, and agent network attacks. We then review defense strategies, including prompt hardening, safety-aware decoding, privilege control for tools and APIs, runtime monitoring, continuous red-teaming, and protocol-level security mechanisms. We further discuss how these threats and mitigations escalate in the Agentic Web, where delegation chains, cross-domain interactions, and protocol-mediated ecosystems amplify risks via propagation and composition. Finally, we highlight open challenges for web-scale deployment, such as interoperable identity and authorization, provenance and traceability, ecosystem-level response, and scalable evaluation under adaptive adversaries. Our goal is to connect recent empirical findings with system-level requirements, and to outline practical research directions toward trustworthy agent ecosystems.

Updated: 2026-03-02 07:44:18

标题: 从安全的代理人工智能到安全的代理人网络:挑战、威胁和未来方向

摘要: 大型语言模型(LLMs)越来越被部署为计划、记忆和在开放世界环境中行动的代理系统。这种转变带来了新的安全问题:失败不再仅仅是不安全的文本生成,而可以通过工具使用、持久性记忆和与不受信任的网络内容交互而造成真正的危害。在这项调查中,我们提供了从安全代理AI到安全代理Web的过渡性视角。我们首先总结了一个与组件对齐的威胁分类法,涵盖了提示滥用、环境注入、内存攻击、工具链滥用、模型篡改和代理网络攻击。然后我们回顾了防御策略,包括提示加固、安全意识解码、工具和API的特权控制、运行时监控、持续红队行动和协议级安全机制。我们进一步讨论了这些威胁和缓解措施在代理Web中如何升级,其中委托链、跨域交互和协议中介生态系统通过传播和组合放大风险。最后,我们强调了面向Web规模部署的开放挑战,如互操作身份和授权、溯源和可追踪性、生态系统级响应和在适应性对手下的可伸缩评估。我们的目标是将最近的经验性发现与系统级需求联系起来,并概述朝着可信代理生态系统的实际研究方向。

更新时间: 2026-03-02 07:44:18

领域: cs.CR

下载: http://arxiv.org/abs/2603.01564v1

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

Updated: 2026-03-02 07:42:55

标题: LFPO:掩码扩散模型的无似然策略优化

摘要: 具有可验证奖励的强化学习(RLVR)在改进自回归模型方面取得了显著成功,特别是在需要正确性的领域,如数学推理和代码生成。然而,直接将这种范式应用于扩散大语言模型(dLLMs)受到确切似然计算的难以处理性的根本阻碍,这迫使现有方法依赖高方差的近似。为了弥合这一差距,我们提出了无似然策略优化(LFPO),这是一个本地框架,将向量场流匹配的概念映射到离散标记空间。具体来说,LFPO将对齐形式化为几何速度矫正,通过对比更新直接优化去噪对数。这种设计有效地绕过了似然近似中固有的误差,产生了精确的梯度估计。此外,LFPO通过从中间步骤预测最终解决方案来强制一致性,有效地拉直概率流,使得生成质量高,迭代次数显著减少。大量实验证明,LFPO不仅在代码和推理基准测试中优于最先进的基线,而且通过减少扩散步骤,加速了推断约20%。

更新时间: 2026-03-02 07:42:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01563v1

A Diagnostic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm

The Error Diffusion Learning Algorithm (EDLA) is a learning scheme that performs synaptically local weight updates driven by a single, globally defined error signal. Although originally proposed as an alternative to backpropagation, its behavior has not been systematically characterized. We provide a modern formulation and implementation of EDLA and evaluate multilayer perceptrons trained with EDLA on parity, regression, and image-classification benchmarks (Digits, MNIST, Fashion-MNIST, and CIFAR-10). Following the original formulation, multi-class classification is implemented by training independent single-output networks (one per class), which makes the computational cost scale linearly with the number of classes. Under comparable architectures and training protocols, EDLA consistently underperforms backpropagation-trained baselines on all benchmarks considered. Through an analysis of internal dynamics, we identify a depth-related failure mode in ReLU-based EDLA: activations can grow explosively, causing unstable training and degraded accuracy. To mitigate this instability, we incorporate root mean square normalization (RMSNorm) into EDLA training. RMSNorm substantially improves numerical stability and expands the depth range in which EDLA can be trained, but it does not close the accuracy gap and retains the overhead of the parallel-network implementation. Overall, we offer a diagnostic evaluation of where and why global error diffusion breaks down in deep networks, providing guidance for future development of local, biologically inspired learning rules.

Updated: 2026-03-02 07:41:58

标题: 一个使用误差扩散学习算法训练的神经网络的诊断评估

摘要: 误差扩散学习算法(EDLA)是一种学习方案,通过单一的全局定义的错误信号驱动局部突触权重更新。虽然最初被提出作为反向传播的替代方案,但其行为尚未得到系统化的表征。我们提供了EDLA的现代形式和实现,并评估了使用EDLA训练的多层感知器在奇偶、回归和图像分类基准(Digits、MNIST、Fashion-MNIST和CIFAR-10)上的表现。根据原始公式,多类分类通过训练独立的单输出网络(每个类别一个),使得计算成本与类别数量线性缩放。在可比的架构和训练协议下,EDLA在所有考虑的基准测试中始终表现不及反向传播训练的基线。通过对内部动态的分析,我们在基于ReLU的EDLA中识别了一个与深度相关的失败模式:激活可能会爆炸性增长,导致训练不稳定和准确性下降。为了缓解这种不稳定性,我们在EDLA训练中加入了均方根归一化(RMSNorm)。RMSNorm极大地提高了数值稳定性,并扩大了EDLA可以训练的深度范围,但它并没有消除准确性差距,并保留了并行网络实现的开销。总的来说,我们提供了对全局误差扩散在深度网络中何时以及为何失效的诊断评估,为未来发展局部、生物启发式学习规则提供了指导。

更新时间: 2026-03-02 07:41:58

领域: cs.LG

下载: http://arxiv.org/abs/2504.14814v4

RubricBench: Aligning Model-Generated Rubrics with Human Standards

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

Updated: 2026-03-02 07:39:49

标题: RubricBench:将模型生成的评分标准与人类标准对齐

摘要: 随着大型语言模型(LLM)的对齐从简单的完成到复杂、高度复杂的生成的演变,奖励模型越来越倾向于采用标准指导评估来减轻表面层面的偏见。然而,社区缺乏一个统一的基准来评估这种评估范式,因为现有的基准既缺乏辨别复杂性,也缺乏进行严格分析所需的地面真实标准。为了弥补这一差距,我们引入了RubricBench,一个精心策划的基准,包含1,147个专门设计用于评估基于标准的评估可靠性的成对比较。我们的构建采用多维过滤管道,针对具有微妙输入复杂性和具有误导性表面偏见的困难样本,为每个样本增加了专家注释的原子标准,这些标准严格来源于指导。全面的实验证明,人工注释和模型生成的标准之间存在着实质性的能力差距,表明即使是最先进的模型也难以自主指定有效的评估标准,远远落后于人类引导的表现。

更新时间: 2026-03-02 07:39:49

领域: cs.AI

下载: http://arxiv.org/abs/2603.01562v1

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time through targeted manipulation of hidden states, offering a lightweight alternative to expensive retraining. However, existing steering frameworks suffer from critical limitations: computational inefficiency, limited extensibility, and restricted functionality that hinder both research progress and practical deployment. We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM. Our system features modular architecture with pluggable interfaces for both analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight application domains, and an interactive demonstration system. Through deep integration with vLLM's optimized inference engine, EasySteer achieves 10.8-22.3$\times$ speedup over existing frameworks. Extensive experiments demonstrate its effectiveness in overthinking mitigation, hallucination reduction, and other key applications. EasySteer transforms steering from research technique to production-ready capability, establishing critical infrastructure for deployable, controllable language models.

Updated: 2026-03-02 07:38:42

标题: EasySteer:一个高性能和可扩展的LLM转向统一框架

摘要: 大型语言模型(LLM)调控已经成为一种有前途的范式,通过有针对性地操作隐藏状态在推理时控制模型行为,提供了一种轻量级的替代方案,避免了昂贵的重新训练。然而,现有的调控框架存在关键限制:计算效率低、可扩展性有限、功能受限,这些限制阻碍了研究进展和实际部署。我们提出了EasySteer,一个建立在vLLM之上的高性能、可扩展的LLM调控统一框架。我们的系统采用模块化架构,提供了面向基于分析和基于学习的方法的可插拔接口、精细的参数控制、针对八个应用领域的预计算调控向量,以及交互式演示系统。通过与vLLM优化推理引擎的深度集成,EasySteer比现有框架实现了10.8-22.3倍的加速。广泛的实验证明了其在超思考缓解、幻觉减少和其他关键应用中的有效性。EasySteer将调控从研究技术转变为可用于生产的能力,为可部署、可控制的语言模型建立了关键基础设施。

更新时间: 2026-03-02 07:38:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.25175v2

WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning

Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.

Updated: 2026-03-02 07:35:28

标题: WavefrontDiffusion: 动态解码计划以提高推理能力

摘要: 扩散语言模型(DLMs)显示出强大的文本生成潜力,并且正在成为自回归模型的竞争性替代方案。去噪策略在决定它们的输出质量方面起着重要作用。主流的去噪策略包括标准扩散和块扩散。标准扩散执行全局去噪,不限制更新范围,通常完成不完整的上下文并导致过早的序列结束预测。块扩散按照预设顺序更新固定大小的块,但其刚性结构可能会破坏连贯的语义单元并干扰推理。我们提出了WavefrontDiffusion,一种动态解码方法,从已完成的位置向外扩展活动标记的波前。这种自适应过程遵循语义结构的自然流动,同时保持计算成本与基于块的方法相等。在推理和代码生成的四个基准测试中,WavefrontDiffusion实现了最先进的性能,同时产生具有更高语义保真度的输出,展示了自适应调度对于更连贯和高效的生成的价值。

更新时间: 2026-03-02 07:35:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19473v3

Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.

Updated: 2026-03-02 07:33:11

标题: Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring 远程监测多模式临床时间序列LLM摘要的基准测试

摘要: 大型语言模型(LLMs)可以生成流畅的临床摘要,用于远程治疗监测时间序列。然而,目前尚不清楚这些叙述是否准确地捕捉了临床重要事件,例如持续异常。现有的评估指标主要关注语义相似性和语言质量,使事件级别的正确性基本上无法测量。 为了填补这一空白,我们引入了一种基于事件的评估框架,用于使用Technology-Integrated Health Management(TIHM)-1.5痴呆监测数据集的多模态时间序列摘要。通过基于规则的异常阈值和时间持久性标准派生临床基础的每日事件。然后将模型生成的摘要与这些结构化事实对齐。 我们的评估协议衡量异常性召回率,持续时间召回率,测量覆盖率和产生幻觉事件提及。我们对三种方法进行基准测试:零-shot提示,统计提示和使用呈现的时间序列可视化的基于视觉的管道。结果显示传统指标与临床事件准确性之间存在显著的解耦。达到高语义相似性分数的模型通常几乎不具有异常性召回率。相反,基于视觉的方法展示了最强的事件对齐,实现了45.7%的异常性召回率和100%的持续时间召回率。 这些发现强调了事件感知评估的重要性,以确保可靠的临床时间序列摘要。

更新时间: 2026-03-02 07:33:11

领域: cs.AI

下载: http://arxiv.org/abs/2603.01557v1

S5-HES Agent: Society 5.0-driven Agentic Framework to Democratize Smart Home Environment Simulation

The smart home is a key domain within the Society 5.0 vision for a human-centered society. Smart home technologies rapidly evolve, and research should diversify while remaining aligned with Society 5.0 objectives. Democratizing smart home research would engage a broader community of innovators beyond traditional limited experts. This shift necessitates inclusive simulation frameworks that support research across diverse fields in industry and academia. However, existing smart home simulators require significant technical expertise, offer limited adaptability, and lack automated evolution, thereby failing to meet the holistic needs of Society 5.0. These constraints impede researchers from efficiently conducting simulations and experiments for security, energy, health, climate, and socio-economic research. To address these challenges, this paper presents the Society 5.0-driven Smart Home Environment Simulator Agent (S5-HES Agent), an agentic simulation framework that transforms traditional smart home simulation through autonomous AI orchestration. The framework coordinates specialized agents through interchangeable large language models (LLMs), enabling natural-language-driven end-to-end smart home simulation configuration without programming expertise. A retrieval-augmented generation (RAG) pipeline with semantic, keyword, and hybrid search retrieves smart home knowledge. Comprehensive evaluation on S5-HES Agent demonstrates that the RAG pipeline achieves near-optimal retrieval fidelity, simulated device behaviour and threat scenarios align with real-world IoT datasets, and simulation engine scales predictably across home configurations, establishing a stable foundation for Society 5.0 smart home research. Source code is available under the MIT License at https://github.com/AsiriweLab/S5-HES-Agent.

Updated: 2026-03-02 07:30:09

标题: S5-HES代理:面向Society 5.0的主体框架,实现智能家居环境模拟的民主化

摘要: 智能家居是人类中心社会愿景Society 5.0中的关键领域。智能家居技术迅速发展,研究应该多样化,同时与Society 5.0的目标保持一致。民主化智能家居研究将吸引传统专家之外更广泛的创新者社区。这种转变需要支持工业和学术领域中多样化研究的包容性模拟框架。然而,现有的智能家居模拟器需要大量的技术专长,提供有限的适应性,缺乏自动进化,从而未能满足Society 5.0的整体需求。这些限制阻碍了研究者有效进行安全、能源、健康、气候和社会经济研究的模拟和实验。为了解决这些挑战,本文介绍了基于Society 5.0的智能家居环境模拟器代理(S5-HES Agent),这是一个通过自主AI编排改变传统智能家居模拟的代理模拟框架。该框架通过可互换的大型语言模型(LLMs)协调专门代理,实现了无需编程专业知识的自然语言驱动的端到端智能家居模拟配置。具有语义、关键词和混合搜索的检索增强生成(RAG)管道检索智能家居知识。对S5-HES Agent的全面评估表明,RAG管道实现了接近最佳的检索准确性,模拟设备行为和威胁场景与现实世界的物联网数据集保持一致,并且模拟引擎在各种家庭配置下可预测地扩展,为Society 5.0智能家居研究奠定了稳定的基础。源代码可以在https://github.com/AsiriweLab/S5-HES-Agent下以MIT许可证获得。

更新时间: 2026-03-02 07:30:09

领域: cs.AI

下载: http://arxiv.org/abs/2603.01554v1

State-Action Inpainting Diffuser for Continuous Control with Delay

Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.

Updated: 2026-03-02 07:28:27

标题: 延迟连续控制的状态-动作修复扩散器

摘要: 信号延迟在连续控制和强化学习中构成了一个基本挑战,通过引入交互和感知之间的时间间隔。当前的解决方案主要沿着两种不同的范式发展:利用状态增强来保留马尔可夫特性的无模型方法,以及专注于通过动态建模推断潜在信念的模型方法。在本文中,我们通过引入State-Action Inpainting Diffuser(SAID),将这些观点联系起来,这是一个将动态学习的归纳偏差与政策优化的直接决策能力整合在一起的框架。通过将问题制定为一个联合序列修复任务,SAID在直接生成一致计划的同时隐式捕捉环境动态,有效地在基于模型和无模型范式的交叉点上运行。关键是,这种生成式的表述使得SAID能够无缝地应用于在线和离线强化学习。对延迟连续控制基准的大量实验表明,SAID实现了最先进和稳健的性能。我们的研究提出了一种新的方法论,以推动具有延迟的强化学习领域的发展。

更新时间: 2026-03-02 07:28:27

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01553v1

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.

Updated: 2026-03-02 07:26:55

标题: 解决粒度不匹配问题:针对长期LLM代理的分层偏好学习

摘要: 大型语言模型(LLMs)作为自主代理越来越多地被赋予解决复杂、长期问题的任务。通过基于偏好的离线方法(如Direct Preference Optimization,DPO)对这些代理进行对齐是一个有前途的方向,但它面临着一个关键的细粒度不匹配问题。轨迹级别的DPO提供的信号对于精确的信用分配来说太粗糙,而步骤级别的DPO往往过于短视,无法捕捉多步行为的价值。为了解决这一挑战,我们引入了层次偏好学习(HPL),这是一个层次结构框架,通过利用多个协同的粒度的偏好信号来优化LLM代理。虽然HPL结合了轨迹级别和步骤级别的DPO以实现全局和局部政策稳定性,但其核心创新在于由双层课程指导的组级别偏好优化。我们的方法首先将专家轨迹分解为语义一致的动作组,然后生成对比的次优组,以便在细粒度、子任务级别进行偏好学习。然后,HPL引入了一个课程调度器,将学习过程从简单到复杂进行组织。这个课程沿着两个轴结构化:组长度,代表子任务复杂度,以及样本难度,由优选和不优选动作组之间的奖励差距定义。在三个具有挑战性的代理基准测试中的实验表明,HPL胜过了现有的最先进方法。我们的分析表明,层次DPO损失有效地整合了跨多个粒度的偏好信号,而双层课程对于使代理能够解决从简单行为到复杂多步序列的各种任务至关重要。

更新时间: 2026-03-02 07:26:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.03253v2

Extracting Training Dialogue Data from Large Language Model based Task Bots

Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.

Updated: 2026-03-02 07:25:04

标题: 从基于大型语言模型的任务机器人中提取训练对话数据

摘要: 大型语言模型(LLMs)被广泛采用来增强面向任务的对话系统(TODS),通过建模复杂的语言模式并提供上下文适当的响应。然而,这种集成引入了显著的隐私风险,因为LLMs作为软知识库将大量训练数据压缩成丰富的知识表示,可能会无意中记忆包含可识别信息(如电话号码)以及整个对话级事件(如完整的旅行日程)的训练对话数据。尽管这一隐私问题的关键性质很高,但LLM记忆如何在开发任务机器人中继承仍未得到探索。在这项工作中,我们通过系统定量研究来填补这一空白,包括评估现有训练数据提取攻击,分析使现有方法无效的任务导向对话建模的关键特征,并提出针对基于LLM的TODS的新型攻击技术,增强响应抽样和成员推断。实验结果表明我们提出的数据提取攻击的有效性。我们的方法可以提取数千个对话状态的训练标签,最佳情况下的精度超过70%。此外,我们通过识别和量化关键影响因素并讨论定向缓解策略,提供了对LLM基于TODS的训练数据记忆的深入分析。

更新时间: 2026-03-02 07:25:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01550v1

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.

Updated: 2026-03-02 07:23:53

标题: Pri4R:使用特权4D表示学习视觉-语言-行动模型的世界动态

摘要: 人类学习的不仅是他们的身体如何移动,还有周围世界如何响应他们的行动。相比之下,虽然最近的视觉语言行动(VLA)模型展示了令人印象深刻的语义理解能力,但它们通常无法捕捉控制物理交互的时空动态。在本文中,我们介绍了Pri4R,这是一种简单但有效的方法,通过在训练过程中利用特权4D信息,赋予VLA模型对世界动态的隐性理解。具体来说,Pri4R通过增加一个轻量级点追踪头部来增强VLA模型,用于预测3D点轨迹。通过将VLA特征注入到这个头部中,共同预测未来的3D轨迹,模型学会将不断演变的场景几何形状纳入其共享表示空间,为精确控制提供更具物理意识的上下文。由于其架构简单,Pri4R与主流VLA设计模式兼容,几乎不需要进行任何改动。在推断过程中,我们使用原始的未经改变的VLA架构运行模型;Pri4R不增加任何额外的输入、输出或计算开销。在模拟和真实世界的评估中,Pri4R显著提高了在具有挑战性的操纵任务上的性能,包括在LIBERO-Long上提高了+10%,在RoboCasa上提高了+40%。我们进一步展示了3D点轨迹预测是学习行动-世界动态的有效监督目标,并通过广泛的消融实验验证了我们的设计选择。

更新时间: 2026-03-02 07:23:53

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2603.01549v1

DFCA: Decentralized Federated Clustering Algorithm

Clustered Federated Learning has emerged as an effective approach for handling heterogeneous data across clients by partitioning them into clusters with similar or identical data distributions. However, most existing methods, including the Iterative Federated Clustering Algorithm (IFCA), rely on a central server to coordinate model updates, which creates a bottleneck and a single point of failure, limiting their applicability in more realistic decentralized learning settings. In this work, we introduce DFCA, a fully decentralized clustered FL algorithm that enables clients to collaboratively train cluster-specific models without central coordination. DFCA uses a sequential running average to aggregate models from neighbors as updates arrive, providing a communication-efficient alternative to batch aggregation while maintaining clustering performance. Our experiments on various datasets demonstrate that DFCA outperforms other decentralized algorithms and performs comparably to centralized IFCA, even under sparse connectivity, highlighting its robustness and practicality for dynamic real-world decentralized networks.

Updated: 2026-03-02 07:22:42

标题: DFCA: 分散式联邦聚类算法

摘要: 集群化联邦学习已经成为处理客户端之间异构数据的有效方法,通过将它们分成具有相似或相同数据分布的集群。然而,包括迭代联邦聚类算法(IFCA)在内的大多数现有方法依赖于一个中央服务器来协调模型更新,这会造成瓶颈和单点故障,限制了它们在更现实的分散学习环境中的适用性。在这项工作中,我们介绍了DFCA,一种完全去中心化的集群化FL算法,使客户端能够在没有中央协调的情况下协作训练特定于集群的模型。DFCA使用顺序运行平均值来在更新到达时从邻居聚合模型,提供了一种通信高效的替代方案,同时保持了聚类性能。我们在各种数据集上的实验表明,DFCA优于其他去中心化算法,并在稀疏连接情况下与中心化IFCA表现相当,突出了其在动态实际分散网络中的稳健性和实用性。

更新时间: 2026-03-02 07:22:42

领域: cs.LG

下载: http://arxiv.org/abs/2510.15300v3

Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data

Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) incorporates task specification via task table prompting, (ii) tokenizes cells with table/column metadata, (iii) is pretrained via masked token prediction, and (iv) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 93% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experimental analyses show that RT's zero-shot transfer leverages task context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data. Code, models, data: https://github.com/snap-stanford/relational-transformer.

Updated: 2026-03-02 07:22:37

标题: 关系Transformer:面向关系数据的零样本基础模型

摘要: 预训练的transformers可以通过零-shot提示轻松适应新的序列建模任务,但是关系领域仍然缺乏可以在不同数据集和任务之间转移的架构。核心挑战在于关系数据的多样性,具有不同的异构模式、图结构和功能依赖关系。在本文中,我们提出了关系变压器(RT)架构,可以在不同的关系数据库上进行预训练,并直接应用于未见的数据集和任务,无需任务或数据集特定的微调,也无需检索上下文示例。RT(i)通过任务表提示包含任务规范,(ii)使用表/列元数据对单元进行标记,(iii)通过掩蔽标记预测进行预训练,(iv)利用一种新颖的关系注意机制,覆盖列、行和主外键链接。在跨越任务如流失和销售预测的RelBench数据集上进行预训练后,RT在二元分类任务上平均获得了强大的零-shot性能,使用22M参数模型的单次前向传递,完全监督的AUROC平均为93%,相比之下,27B LLM为84%。微调可实现具有高样本效率的最新结果。我们的实验分析表明,RT的零-shot转移利用了任务上下文、关系注意模式和模式语义。总的来说,RT为关系数据的基础模型提供了实用的路径。代码、模型、数据:https://github.com/snap-stanford/relational-transformer。

更新时间: 2026-03-02 07:22:37

领域: cs.LG,cs.AI,cs.DB

下载: http://arxiv.org/abs/2510.06377v3

Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

We propose a data-efficient, physics-aware generative framework in function space for inverse PDE problems. Existing plug-and-play diffusion posterior samplers represent physics implicitly through joint coefficient-solution modeling, requiring substantial paired supervision. In contrast, our Decoupled Diffusion Inverse Solver (DDIS) employs a decoupled design: an unconditional diffusion learns the coefficient prior, while a neural operator explicitly models the forward PDE for guidance. This decoupling enables superior data efficiency and effective physics-informed learning, while naturally supporting Decoupled Annealing Posterior Sampling (DAPS) to avoid over-smoothing in Diffusion Posterior Sampling (DPS). Theoretically, we prove that DDIS avoids the guidance attenuation failure of joint models when training data is scarce. Empirically, DDIS achieves state-of-the-art performance under sparse observation, improving $l_2$ error by 11% and spectral error by 54% on average; when data is limited to 1%, DDIS maintains accuracy with 40% advantage in $l_2$ error compared to joint models.

Updated: 2026-03-02 07:21:27

标题: 解耦扩散采样用于函数空间上的反问题

摘要: 我们提出了一个数据高效、物理感知的函数空间生成框架,用于反问题求解。现有的即插即用扩散后验采样器通过联合系数-解建模隐含地表示物理,需要大量配对监督。相比之下,我们的分离扩散反求解器(DDIS)采用了分离设计:一个无条件的扩散学习系数先验,而一个神经算子明确地模拟正向PDE以提供指导。这种分离使数据效率更高,有效地传达物理信息,同时自然支持分离退火后验采样(DAPS),以避免在扩散后验采样(DPS)中过度平滑。从理论上讲,我们证明了当训练数据稀缺时,DDIS避免了联合模型的指导衰减失败。在实证方面,DDIS在稀疏观测下实现了最先进的性能,平均将$l_2$误差提高了11%,谱误差提高了54%;当数据限制为1%时,DDIS在$l_2$误差上保持准确性,比联合模型具有40%的优势。

更新时间: 2026-03-02 07:21:27

领域: cs.LG,math.NA

下载: http://arxiv.org/abs/2601.23280v3

Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents

Tool-using LLM agents face a reliability-cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre-coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self-Healing Router, a fault-tolerant orchestration architecture that treats most agent control-flow decisions as routing rather than reasoning. The system combines (i) parallel health monitors that assign priority scores to runtime conditions such as tool outages and risk signals, and (ii) a cost-weighted tool graph where Dijkstra's algorithm performs deterministic shortest-path routing. When a tool fails mid-execution, its edges are reweighted to infinity and the path is recomputed -- yielding automatic recovery without invoking the LLM. The LLM is reserved exclusively for cases where no feasible path exists, enabling goal demotion or escalation. Prior graph-based tool-use systems (ControlLLM, ToolNet, NaviAgent) focus on tool selection and planning; our contribution is runtime fault tolerance with deterministic recovery and binary observability -- every failure is either a logged reroute or an explicit escalation, never a silent skip. Across 19 scenarios spanning three graph topologies (linear pipeline, dependency DAG, parallel fan-out), Self-Healing Router matches ReAct's correctness while reducing control-plane LLM calls by 93% (9 vs 123 aggregate) and eliminating the silent-failure cases observed in a well-engineered static workflow baseline under compound failures.

Updated: 2026-03-02 07:21:15

标题: 基于图形的成本效益型LLM代理自愈工具路由

摘要: 工具使用的LLM代理面临可靠性成本权衡:通过LLM路由每个决策可以提高正确性,但会产生高延迟和推理成本,而预编码的工作流图可以减少成本,但在意外的复合工具故障下变得脆弱。我们提出了自愈路由器,这是一种容错编排架构,将大多数代理控制流决策视为路由而不是推理。该系统结合了(i)并行健康监视器,为运行时条件(如工具故障和风险信号)分配优先级分数,以及(ii)成本加权工具图,其中Dijkstra算法执行确定性最短路径路由。当工具在执行中失败时,其边缘被重新赋予无穷大的权重,并重新计算路径,实现自动恢复而无需调用LLM。LLM仅用于不存在可行路径的情况,实现目标降级或升级。以往基于图的工具使用系统(ControlLLM、ToolNet、NaviAgent)侧重于工具选择和规划;我们的贡献是具有确定性恢复和二进制可观测性的运行时容错——每次故障都是已记录的重路由或明确的升级,而不是静态工作流基线在复合故障下观察到的无声故障情况。在涵盖三种图拓扑结构的19个场景中(线性流水线、依赖DAG、并行扇出),自愈路由器在减少控制平面LLM调用93%(总计9次对123次)的同时,与ReAct的正确性相匹配,并消除了在复合故障下观察到的静态工作流基线中的无声故障情况。

更新时间: 2026-03-02 07:21:15

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2603.01548v1

FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition

When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be useful for studying different forms of memorization. We also document some challenges in effectively building realistic, fictional synthetic data.

Updated: 2026-03-02 07:18:00

标题: 虚构问答:一个用于研究记忆和知识获取的数据集

摘要: 在文本数据上训练语言模型时,它们不仅获取关于语言结构的知识,还获得了关于世界事实的知识。在推理时,它们对事实的知识可以被利用来解决有趣的问题,并为用户执行有用的知识工作。众所周知,语言模型可以逐字记忆其训练数据中的长序列。然而,我们对语言模型如何记忆训练中看到的事实知之甚少。在这项工作中,我们提出了一个新的数据集,专门赋予研究人员研究事实记忆和逐字序列记忆这两个过程的能力。该数据集包括关于虚构事件的合成生成的类似网络文本的文档,以及关于这些事件的问题-答案对。我们进行了培训实验,展示了关于虚构事件的合成数据如何有助于研究不同形式的记忆。我们还记录了在有效构建真实的虚构合成数据时遇到的一些挑战。

更新时间: 2026-03-02 07:18:00

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.05639v2

Transmit Weights, Not Features: Orthogonal-Basis Aided Wireless Point-Cloud Transmission

The widespread adoption of depth sensors has substantially lowered the barrier to point-cloud acquisition. This letter proposes a semantic wireless transmission framework for three dimension (3D) point clouds built on Deep Joint Source - Channel Coding (DeepJSCC). Instead of sending raw features, the transmitter predicts combination weights over a receiver-side semantic orthogonal feature pool, enabling compact representations and robust reconstruction. A folding-based decoder deforms a 2D grid into 3D, enforcing manifold continuity while preserving geometric fidelity. Trained with Chamfer Distance (CD) and an orthogonality regularizer, the system is evaluated on ModelNet40 across varying Signal-to-Noise Ratios (SNRs) and bandwidths. Results show performance on par with SEmantic Point cloud Transmission (SEPT) at high bandwidth and clear gains in bandwidth-constrained regimes, with consistent improvements in both Peak Signal-to-Noise Ratio (PSNR) and CD. Ablation experiments confirm the benefits of orthogonalization and the folding prior.

Updated: 2026-03-02 07:14:25

标题: 传输权重,而不是特征:正交基辅助的无线点云传输

摘要: 深度传感器的广泛应用大大降低了点云获取的门槛。本信函提出了一个基于深度联合源-信道编码(DeepJSCC)的三维点云的语义无线传输框架。传输器不是发送原始特征,而是在接收端语义正交特征池上预测组合权重,从而实现紧凑的表示和稳健的重建。基于折叠的解码器将2D网格变形为3D,强化流形连续性同时保持几何保真度。系统经过Chamfer距离(CD)和正交性正则化器的训练,在不同信噪比(SNR)和带宽下在ModelNet40上进行评估。结果表明,在高带宽下与SEmantic Point云传输(SEPT)性能相当,在带宽受限制的情况下有明显的增益,Peak Signal-to-Noise Ratio(PSNR)和CD均有持续改进。消融实验证实了正交化和折叠先验的益处。

更新时间: 2026-03-02 07:14:25

领域: cs.LG

下载: http://arxiv.org/abs/2512.03819v2

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

Image-generative models are widely deployed across industries. Recent studies show that they can be exploited to produce policy-violating content. Existing mitigation strategies primarily operate at the pre- or mid-generation stages through techniques such as prompt filtering and safety-aware training/fine-tuning. Prior work shows that these approaches can be bypassed and often degrade generative quality. In this work, we propose ReVision, a training-free, prompt-based, post-hoc safety framework for image-generation pipeline. ReVision acts as a last-line defense by analyzing generated images and selectively editing unsafe concepts without altering the underlying generator. It uses the Gemini-2.5-Flash model as a generic policy-violating concept detector, avoiding reliance on multiple category-specific detectors, and performs localized semantic editing to replace unsafe content. Prior post-hoc editing methods often rely on imprecise spatial localization, that undermines usability and limits deployability, particularly in multi-concept scenes. To address this limitation, ReVision introduces a VLM-assisted spatial gating mechanism that enforces instance-consistent localization, enabling precise edits while preserving scene integrity. We evaluate ReVision on a 245-image benchmark covering both single- and multi-concept scenarios. Results show that ReVision (i) improves CLIP-based alignment toward safe prompts by +$0.121$ on average; (ii) significantly improves multi-concept background fidelity (LPIPS $0.166 \rightarrow 0.058$); (iii) achieves near-complete suppression on category-specific detectors (e.g., NudeNet $70.51 \rightarrow 0$); and (iv) reduces policy-violating content recognizability in a human moderation study from $95.99\%$ to $10.16\%$.

Updated: 2026-03-02 07:13:22

标题: ReVision:一种后续、基于视觉的技术,用于替换图像生成管道中不可接受的概念

摘要: 图像生成模型被广泛应用于各行各业。最近的研究表明,它们可以被利用来生成违反政策的内容。现有的缓解策略主要在生成的前期或中期阶段进行操作,通过诸如提示过滤和安全感知训练/微调等技术。先前的研究表明,这些方法可以被规避,并且通常会降低生成质量。在这项工作中,我们提出了ReVision,这是一个无需训练、基于提示的后期安全框架,用于图像生成管道。ReVision作为最后一道防线,通过分析生成的图像并有选择地编辑不安全的概念,而无需改变基础生成器。它使用Gemini-2.5-Flash模型作为通用的违反政策概念检测器,避免依赖于多个类别特定的检测器,并执行局部语义编辑以替换不安全的内容。先前的后期编辑方法常常依赖于不准确的空间定位,这会削弱可用性并限制部署能力,特别是在多概念场景中。为了解决这个限制,ReVision引入了一个VLM辅助的空间门控机制,强化实例一致的定位,实现精确的编辑同时保持场景完整性。我们在一个涵盖单个和多个概念场景的245张图像基准上评估了ReVision。结果显示,ReVision(i)平均提高了基于CLIP的对齐度,使其向安全提示靠拢$0.121$;(ii)显著提高了多概念背景的保真度(LPIPS $0.166 \rightarrow 0.058$);(iii)在类别特定检测器上实现了接近完全的抑制(例如,NudeNet $70.51 \rightarrow 0$);以及(iv)在人工审核研究中,将策略违规内容的可识别性从$95.99\%$降低到$10.16\%$。

更新时间: 2026-03-02 07:13:22

领域: cs.CR

下载: http://arxiv.org/abs/2602.19149v2

DistillKac: Few-Step Image Generation via Damped Wave Equations

We present DistillKac, a fast image generator that uses the damped wave equation and its stochastic Kac representation to move probability mass at finite speed. In contrast to diffusion models whose reverse time velocities can become stiff and implicitly allow unbounded propagation speed, Kac dynamics enforce finite speed transport and yield globally bounded kinetic energy. Building on this structure, we introduce classifier-free guidance in velocity space that preserves square integrability under mild conditions. We then propose endpoint only distillation that trains a student to match a frozen teacher over long intervals. We prove a stability result that promotes supervision at the endpoints to closeness along the entire path. Experiments demonstrate DistillKac delivers high quality samples with very few function evaluations while retaining the numerical stability benefits of finite speed probability flows.

Updated: 2026-03-02 07:08:44

标题: DistillKac:通过阻尼波动方程进行少步图像生成

摘要: 我们提出了DistillKac,一个快速图像生成器,它使用阻尼波动方程及其随机Kac表示来以有限速度移动概率质量。与扩散模型相比,其逆时间速度可能变得僵硬并隐含允许无界传播速度,Kac动力学强制执行有限速度传输并产生全局有界的动能。基于这种结构,我们在速度空间中引入了无分类器指导,以在温和条件下保持平方可积性。然后,我们提出了仅对端点进行蒸馏,训练学生在长时间间隔内与冻结的教师匹配。我们证明了一个稳定性结果,促进在端点处的监督向整个路径的接近。实验表明,DistillKac在极少的函数评估次数内提供高质量样本,同时保留了有限速度概率流的数值稳定性益处。

更新时间: 2026-03-02 07:08:44

领域: cs.LG,cs.AI,cs.CV,math.PR,stat.ML

下载: http://arxiv.org/abs/2509.21513v3

Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?

The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.

Updated: 2026-03-02 07:07:32

标题: 药理学知识图谱:药物再利用是否需要化学结构?

摘要: 模型复杂性、数据量和特征模态对基于知识图谱的药物再利用的贡献在严格的时间验证下仍然缺乏量化。我们从ChEMBL 36构建了一个药理知识图谱,包括5,348个实体,其中包括3,127种药物、1,156种蛋白质和1,065种适应症。严格的时间分割要求在2022年之前的训练数据和2023年至2025年的测试数据,以及从失败的实验和临床试验中提取的经生物验证的负样本。我们对比了五种知识图嵌入模型和一个标准的图神经网络,该网络包含3.44百万个参数,使用图注意力编码器和ESM-2蛋白质嵌入来融入药物化学结构。从0.78到9.75百万个参数和从25%到100%的数据的扩展实验,以及特征消融研究,被用来单独分离模型容量、图密度和节点特征模态的贡献。去除基于图注意力的药物结构编码器,仅保留拓扑嵌入与ESM-2蛋白质特征相结合,将药物蛋白PR-AUC从0.5631提高到0.5785,同时将VRAM使用量从5.30 GB减少到353 MB。用Morgan指纹替换药物编码器进一步降低了性能,表明明确的化学结构表示可能对预测药理网络相互作用有害。将模型大小增加到2.44百万个参数以上产生递减收益,而增加训练数据一致地提高了性能。外部验证证实了前14个新颖预测中的6个是已建立的治疗适应症。这些结果表明,可以仅使用以靶点为中心的信息和药物网络拓扑来准确预测药物的药理行为,而无需明确的化学结构表示。

更新时间: 2026-03-02 07:07:32

领域: cs.AI,q-bio.BM,q-bio.QM

下载: http://arxiv.org/abs/2603.01537v1

Model Predictive Adversarial Imitation Learning for Planning from Observation

Human demonstration data is often ambiguous and incomplete, motivating imitation learning approaches that also exhibit reliable planning behavior. A common paradigm to perform planning-from-demonstration involves learning a reward function via Inverse Reinforcement Learning (IRL) then deploying this reward via Model Predictive Control (MPC). Towards unifying these methods, we derive a replacement of the policy in IRL with a planning-based agent. With connections to Adversarial Imitation Learning, this formulation enables end-to-end interactive learning of planners from observation-only demonstrations. In addition to benefits in interpretability, complexity, and safety, we study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness. The study includes evaluations in both simulated control benchmarks and real-world navigation experiments using few-to-single observation-only demonstrations.

Updated: 2026-03-02 07:07:08

标题: 模型预测对抗性模仿学习用于观察规划

摘要: 人类演示数据通常模糊不清且不完整,这促使模仿学习方法展现出可靠的规划行为。执行从演示中规划的常见范例涉及通过逆强化学习(IRL)学习奖励函数,然后通过模型预测控制(MPC)部署该奖励。为了统一这些方法,我们推导出在IRL中用基于规划的代理替换策略。与对抗性模仿学习相联系,这种表述使得可以从仅观察演示中进行端到端的交互式规划学习。除了在可解释性、复杂性和安全性方面的益处外,我们研究和观察到采样效率、超出分布泛化和鲁棒性方面的显著改进。该研究包括在模拟控制基准测试和使用少至单一观察演示的真实世界导航实验中的评估。

更新时间: 2026-03-02 07:07:08

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.21533v2

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

Updated: 2026-03-02 07:06:01

标题: 越狱铸造厂:从论文到可运行攻击,实现可重复的基准测试

摘要: 大语言模型(LLMs)的越狱技术发展速度比基准快,使得对鲁棒性的估计变得陈旧且难以在论文间比较,因为数据集、工具和评判协议的漂移。我们介绍了JAILBREAK FOUNDRY(JBF),这是一个通过多智能体工作流来将越狱论文转换为可执行模块,以便在统一工具中立即进行评估的系统,以填补这一差距。JBF包括三个核心组件:(i)JBF-LIB用于共享契约和可重复使用的工具;(ii)JBF-FORGE用于多智能体论文到模块的转换;(iii)JBF-EVAL用于标准化评估。在30个复制的攻击中,JBF实现了高保真度,平均(复制-报告)攻击成功率(ASR)偏差为+0.26个百分点。通过利用共享基础设施,相对于原始存储库,JBF减少了近一半的攻击特定实现代码,并实现了82.5%的平均重用代码比率。该系统使得可以使用一致的GPT-4o评判员对30个攻击中的所有攻击进行标准化的AdvBench评估。通过自动化攻击集成和标准化评估,JBF为创建与快速变化的安全格局保持同步的活跃基准提供了可扩展的解决方案。

更新时间: 2026-03-02 07:06:01

领域: cs.CR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2602.24009v2

Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making

Decision-focused learning (DFL) integrates predictive models with downstream optimization, directly training machine learning models to minimize decision errors. While DFL has been shown to provide substantial advantages when compared to a counterpart that treats the predictive and prescriptive models separately, it has also been shown to struggle in high-dimensional and risk-sensitive settings, limiting its applicability in real-world settings. To address this limitation, this paper introduces decision-focused generative learning (Gen-DFL), a novel framework that leverages generative models to adaptively model uncertainty and improve decision quality. Instead of relying on fixed uncertainty sets, Gen-DFL learns a structured representation of the optimization parameters and samples from the tail regions of the learned distribution to enhance robustness against worst-case scenarios. This approach mitigates over-conservatism while capturing complex dependencies in the parameter space. The paper shows, theoretically, that Gen-DFL achieves improved worst-case performance bounds compared to traditional DFL. Empirically, it evaluates Gen-DFL on various scheduling and logistics problems, demonstrating its strong performance against existing DFL methods.

Updated: 2026-03-02 07:00:41

标题: Gen-DFL: 为稳健决策制定的决策导向生成学习

摘要: 决策集中学习(DFL)将预测模型与下游优化相结合,直接训练机器学习模型以最小化决策错误。虽然与将预测和规划模型分开处理的对应方法相比,DFL已被证明在提供实质性优势,但在高维和风险敏感的环境中表现出困难,限制了其在现实世界中的适用性。为了解决这一限制,本文介绍了决策集中生成学习(Gen-DFL),这是一个利用生成模型来自适应地建模不确定性并提高决策质量的新框架。Gen-DFL不依赖于固定的不确定性集合,而是学习优化参数的结构化表示,并从学习分布的尾部区域进行采样,以增强对最坏情况的鲁棒性。这种方法减少了过度保守,同时捕捉了参数空间中的复杂依赖关系。本文在理论上表明,与传统的DFL相比,Gen-DFL取得了改善最坏情况性能界限。在实证方面,它对各种调度和物流问题进行了Gen-DFL评估,展示了其相对于现有DFL方法的强大性能。

更新时间: 2026-03-02 07:00:41

领域: cs.LG

下载: http://arxiv.org/abs/2502.05468v2

Scalable Multi-Task Low-Rank Model Adaptation

Scaling multi-task low-rank adaptation (LoRA) to a large number of tasks induces catastrophic performance degradation, such as an accuracy drop from 88.2% to 2.0% on DOTA when scaling from 5 to 15 tasks. This failure is due to parameter and representation misalignment. We find that existing solutions, like regularization and dynamic routing, fail at scale because they are constrained by a fundamental trade-off: strengthening regularization to reduce inter-task conflict inadvertently suppresses the essential feature discrimination required for effective routing. In this work, we identify two root causes for this trade-off. First, uniform regularization disrupts inter-task knowledge sharing: shared underlying knowledge concentrates in high-SV components (89% alignment on Flanv2->BBH). Uniform regularization forces high-SV components to update in orthogonal directions, directly disrupting the shared knowledge. Second, Conflict Amplification: Applying LoRA at the component-level (e.g., W_q, W_v) amplifies gradient conflicts; we show block-level adaptation reduces this conflict by 76% with only 50% parameters. Based on these insights, we propose mtLoRA, a scalable solution with three novel designs: 1) Spectral-Aware Regularization to selectively orthogonalize low-SV components while preserving high-SV shared knowledge, 2) Block-Level Adaptation to mitigate conflict amplification and largely improve parameter efficiency, and 3) Fine-Grained Routing using dimension-specific weights for superior expressive power. On four large-scale (15-25 tasks) vision (DOTA and iNat2018) and NLP (Dolly-15k and BBH) benchmarks, mtLoRA achieves 91.7%, 81.5%, 44.5% and 38.5% accuracy on DOTA, iNat2018, Dolly-15k and BBH respectively, outperforming the state-of-the-art by 2.3% on average while using 47% fewer parameters and 24% less training time.

Updated: 2026-03-02 06:57:11

标题: 可扩展的多任务低秩模型适应

摘要: 将多任务低秩自适应(LoRA)扩展到大量任务会导致灾难性的性能下降,例如在从5个任务扩展到15个任务时,在DOTA上准确率从88.2%下降到2.0%。这种失败是由于参数和表示不匹配造成的。我们发现现有的解决方案,如正则化和动态路由,在规模上失败,因为它们受到一种根本性的权衡的约束:加强正则化以减少任务间冲突不经意地抑制了有效路由所需的基本特征判别能力。在这项工作中,我们确定了这种权衡的两个根本原因。首先,均匀正则化破坏了任务间知识共享:共享的基础知识集中在高SV组件中(Flanv2->BBH上89%的对齐)。均匀正则化迫使高SV组件以正交方向更新,直接破坏了共享知识。第二,冲突放大:在组件级别应用LoRA(例如W_q,W_v)会放大梯度冲突;我们展示块级自适应可以减少这种冲突76%,而只需要50%的参数。基于这些见解,我们提出了mtLoRA,一个可扩展的解决方案,具有三个新颖的设计:1)谱感知正则化,可选择性地正交化低SV组件,同时保留高SV共享知识,2)块级自适应,以减轻冲突放大,并大大提高参数效率,3)使用维度特定权重的细粒度路由,具有更优秀的表达能力。在四个大规模(15-25个任务)的视觉(DOTA和iNat2018)和NLP(Dolly-15k和BBH)基准上,mtLoRA在DOTA、iNat2018、Dolly-15k和BBH上分别达到了91.7%、81.5%、44.5%和38.5%的准确率,平均比现有技术提高了2.3%,同时使用了47%的参数和24%的训练时间。

更新时间: 2026-03-02 06:57:11

领域: cs.LG

下载: http://arxiv.org/abs/2603.01526v1

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both R1 and non-R1 LRMs: (1) Weight count has a greater impact on LRMs' knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.

Updated: 2026-03-02 06:53:34

标题: 当推理遇上压缩:理解LLMs压缩对大型推理模型的影响

摘要: 压缩方法,包括量化、蒸馏和修剪,提高了大型推理模型(LRMs)的计算效率。然而,现有研究要么未能充分比较所有三种压缩方法在LRMs上的表现,要么缺乏深入的解释分析。本文通过性能基准测试和机械解释,调查了在压缩过程中LRM的推理能力受到了怎样的损害。为了揭示压缩对推理性能的影响,我们在四个推理数据集(AIME 2024、FOLIO、时间序列和MuSiQue)上对经过量化、蒸馏和修剪的DeepSeek-R1模型进行了基准测试。为了精确确定模型权重上的压缩效果,我们采用了均值差异和归因修补技术,专注于解释压缩的LRM中每个线性组件的激活,以解释权重与各种推理能力之间的细粒度因果关系。这种细粒度解释回答了一个关于压缩的基本问题:哪些权重对推理最重要?总体上,我们发现动态量化的2.51位R1接近R1性能。通过实证验证,我们提出了三个主要发现,这些发现适用于R1和非R1 LRMs:(1)权重数量对LRMs的知识记忆产生了比推理更大的影响,突显了修剪和蒸馏的风险;(2)蒸馏LRMs最终层中的MLP上投影是最重要的组件之一,为定位关键权重提供了新视角 - 这是模型压缩中的一个基本问题;和(3)当前的量化方法过度压缩了最终层模块和MLP门投影,因此仅保护那些过度压缩的所有权重中的2%可以将平均准确率提高6.57%,大大超过了现有技术水平。

更新时间: 2026-03-02 06:53:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.02010v3

Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

Large Vision-Language Models (LVLMs) exhibit impressive potential across various tasks but also face significant privacy risks, limiting their practical applications. Current researches on privacy assessment for LVLMs is limited in scope, with gaps in both assessment dimensions and privacy categories. To bridge this gap, we propose Multi-PA, a comprehensive benchmark for evaluating the privacy preservation capabilities of LVLMs in terms of privacy awareness and leakage. Privacy awareness measures the model's ability to recognize the privacy sensitivity of input data, while privacy leakage assesses the risk of the model unintentionally disclosing privacy information in its output. We design a range of sub-tasks to thoroughly evaluate the model's privacy protection offered by LVLMs. Multi-PA covers 26 categories of personal privacy, 15 categories of trade secrets, and 18 categories of state secrets, totaling 31,962 samples. Based on Multi-PA, we evaluate the privacy preservation capabilities of 21 open-source and 2 closed-source LVLMs. Our results reveal that current LVLMs generally pose a high risk of facilitating privacy breaches, with vulnerabilities varying across personal privacy, trade secret, and state secret.

Updated: 2026-03-02 06:51:34

标题: Multi-PA: 大型视觉-语言模型隐私评估的多维度基准

摘要: 大型视觉语言模型(LVLMs)在各种任务中展现出令人印象深刻的潜力,但也面临着重大的隐私风险,限制了它们的实际应用。目前对LVLMs隐私评估的研究范围有限,在评估维度和隐私类别上存在差距。为了弥补这一差距,我们提出了Multi-PA,这是一个全面的基准,用于评估LVLMs在隐私意识和泄露方面的隐私保护能力。隐私意识衡量模型识别输入数据隐私敏感性的能力,而隐私泄露评估模型在输出中无意中披露隐私信息的风险。我们设计了一系列子任务,全面评估LVLMs提供的隐私保护能力。Multi-PA涵盖了26类个人隐私、15类商业机密和18类国家机密,共计31,962个样本。基于Multi-PA,我们评估了21个开源和2个封闭源LVLMs的隐私保护能力。我们的结果显示,目前的LVLMs普遍存在高风险促进隐私泄露,漏洞在个人隐私、商业机密和国家机密之间有所不同。

更新时间: 2026-03-02 06:51:34

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2412.19496v4

Compliance as Code: A Study of Linux Distributions and Beyond

Compliance as code is an emerging idea about automating compliance through programmed compliance controls and checks. Given scant existing research thus far, the paper presents an empirical analysis of a compliance as code project addressing open source software (OSS) projects and products. The dataset examined covers a little over 1,500 unique compliance rules designed and implemented for 14 Linux distribution releases from five vendors. According to the results, (1) the coverage of the rules varies across the five vendors. Then, (2) the brief rationales provided for the rules do not exhibit statistical similarities but the short code snippets for these do show similarities to some extent. Furthermore, (3) as many as 24 controls are covered from over 10 different organizations, among them governmental agencies, standardization organizations, and non-profit associations. Finally, (4) the rules can be mapped to the essential cyber security requirements of the Cyber Resilience Act (CRA), although only modest agreement exists among the three authors regarding individual mappings. This observation supports an argument that the compliance as code project studied could be updated with new compliance checks. Given that also operating systems are in the CRA's scope when used in a network-connected product, such an updating would have also practical relevance in the nearby future.

Updated: 2026-03-02 06:50:28

标题: “合规性即代码:Linux发行版及其更深层次的研究”

摘要: 代码合规性是一种新兴的想法,通过编程合规性控制和检查自动化合规性。鉴于迄今为止现有研究很少,本文提出了一个针对开源软件(OSS)项目和产品的合规性代码项目的实证分析。所研究的数据集涵盖了为来自五家供应商的14个Linux发行版发布的逾1500条独特合规性规则。根据结果,(1)这些规则的覆盖范围在五家供应商之间存在差异。然后,(2)这些规则的简要理由并不表现出统计上的相似性,但这些规则的简短代码片段在某种程度上显示出相似性。此外,(3)来自超过10个不同组织的24个控制措施被涵盖,其中包括政府机构、标准化组织和非营利协会。最后,(4)这些规则可以映射到《网络安全弹性法案》(CRA)的基本网络安全要求,尽管三位作者在个别映射方面存在较小的一致性。这一观察结果支持一个论点,即所研究的合规性代码项目可以通过新的合规性检查进行更新。考虑到操作系统在连接到网络的产品中也受到CRA的监管,这种更新在不久的将来也具有实际意义。

更新时间: 2026-03-02 06:50:28

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2603.01520v1

Nazrin: Atomic Tactics for Graph Neural Networks for Theorem Proving in Lean 4

In Machine-Assisted Theorem Proving, a theorem proving agent searches for a sequence of expressions and tactics that can prove a conjecture in a proof assistant. In this work, we introduce several novel concepts and capabilities to address obstacles faced by machine-assisted theorem proving. We first present a set of \textbf{atomic tactics}, a small finite set of tactics capable of proving any provable statement in Lean. We then introduce a \textbf{transposing atomization} algorithm which turns arbitrary proof expressions into a series of atomic tactics. We next introduce the \textbf{ExprGraph} data structure, which provides a succinct representation for Lean expressions. Finally, we present the \textbf{Nazrin Prover}, a graph neural network-based theorem proving agent using atomic tactics and ExprGraph. Nazrin circumvents many challenges faced by existing proving agents by exclusively dispatching atomic tactics, and it is robust enough to both train and evaluate on consumer-grade hardware. We demonstrate the potential of tools like Nazrin using theorems from Lean's standard library and from Mathlib.

Updated: 2026-03-02 06:50:01

标题: Nazrin:在Lean 4中用于定理证明的图神经网络的原子策略

摘要: 在机器辅助定理证明中,定理证明代理搜索一系列表达式和策略,以证明证明助手中的猜想。 在这项工作中,我们引入了几个新颖的概念和能力,以解决机器辅助定理证明面临的障碍。我们首先提出一组\textbf{原子策略},这是一组能够证明Lean中任何可证明语句的有限策略。然后引入了一种\textbf{转置原子化}算法,将任意证明表达式转化为一系列原子策略。接下来介绍了\textbf{ExprGraph}数据结构,为Lean表达式提供了简洁的表示。最后,我们介绍了\textbf{Nazrin Prover},这是一个基于图神经网络的定理证明代理,使用原子策略和ExprGraph。Nazrin通过仅派发原子策略来规避现有证明代理面临的许多挑战,并且足够强大,可以在消费级硬件上进行训练和评估。我们使用Lean标准库和Mathlib中的定理展示了像Nazrin这样的工具的潜力。

更新时间: 2026-03-02 06:50:01

领域: cs.LO,cs.LG

下载: http://arxiv.org/abs/2602.18767v2

CLIFF: Continual Learning for Incremental Flake Features in 2D Material Identification

Identifying quantum flakes is crucial for scalable quantum hardware; however, automated layer classification from optical microscopy remains challenging due to substantial appearance shifts across different materials. This paper proposes a new Continual-Learning Framework for Flake Layer Classification (CLIFF). To the best of our knowledge, this work represents the first systematic study of continual learning in two-dimensional (2D) materials. The proposed framework enables the model to distinguish materials and their physical and optical properties by freezing the backbone and base head, which are trained on a reference material. For each new material, it learns a material-specific prompt, embedding, and a delta head. A prompt pool and a cosine-similarity gate modulate features and compute material-specific corrections. Additionally, memory replay with knowledge distillation is incorporated. CLIFF achieves competitive accuracy with significantly lower forgetting than naive fine-tuning and a prompt-based baseline.

Updated: 2026-03-02 06:48:27

标题: CLIFF:用于2D材料识别中增量薄片特征的持续学习

摘要: 识别量子片段对于可扩展的量子硬件至关重要;然而,由于不同材料之间存在显著的外观变化,从光学显微镜中自动进行层分类仍然具有挑战性。本文提出了一种新的用于片层分类的持续学习框架(CLIFF)。据我们所知,这项工作是对二维材料中持续学习的首次系统研究。所提出的框架使模型能够通过冻结在参考材料上训练的骨干和基础头部来区分材料及其物理和光学特性。对于每种新材料,它学习一种特定于该材料的提示、嵌入和增量头部。提示池和余弦相似性门调节特征并计算材料特定的修正。此外,还结合了知识蒸馏的记忆重放。CLIFF在明显减少遗忘的同时,实现了与朴素微调和基线相比具有竞争力的准确性。

更新时间: 2026-03-02 06:48:27

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.17261v3

Analysis of Shuffling Beyond Pure Local Differential Privacy

Shuffling is a powerful way to amplify privacy of a local randomizer in private distributed data analysis. Most existing analyses of how shuffling amplifies privacy are based on the pure local differential privacy (DP) parameter $\varepsilon_0$. This paper raises the question of whether $\varepsilon_0$ adequately captures the privacy amplification. For example, since the Gaussian mechanism does not satisfy pure local DP for any finite $\varepsilon_0$, does it follow that shuffling yields weak amplification? To solve this problem, we revisit the privacy blanket bound of Balle et al. (the blanket divergence) and develop a direct asymptotic analysis that bypasses $\varepsilon_0$. Our key finding is that, asymptotically, the blanket divergence depends on the local mechanism only through a single scalar parameter $χ$ and that this dependence is monotonic. Therefore, this parameter serves as a proxy for shuffling efficiency, which we call the shuffle index. By applying this analysis to both upper and lower bounds of the shuffled mechanism's privacy profile, we obtain a band for its privacy guarantee through shuffle indices. Furthermore, we derive a simple structural, necessary and sufficient condition on the local randomizer under which this band collapses asymptotically. $k$-RR families with $k\ge3$ satisfy this condition, while for generalized Gaussian mechanisms the condition may not hold but the resulting band remains tight. Finally, we complement the asymptotic theory with an FFT-based algorithm for computing the blanket divergence at finite $n$, which offers rigorously controlled relative error and near-linear running time in $n$, providing a practical numerical analysis for shuffle DP.

Updated: 2026-03-02 06:46:25

标题: 在纯局部差分隐私之外的洗牌分析

摘要: 洗牌是在私有分布式数据分析中放大本地随机化器隐私的有效方法。大多数现有关于洗牌如何放大隐私的分析是基于纯本地差分隐私(DP)参数$\varepsilon_0$。本文提出了一个问题,即$\varepsilon_0$是否充分捕捉了隐私放大。例如,由于高斯机制不满足任何有限$\varepsilon_0$的纯本地DP,那么洗牌是否会产生弱放大呢?为了解决这个问题,我们重新审视了Balle等人的隐私覆盖边界(覆盖离散度),并开发了一种绕过$\varepsilon_0$的直接渐近分析。我们的关键发现是,在渐近情况下,覆盖离散度仅通过一个标量参数$χ$依赖于本地机制,并且这种依赖是单调的。因此,该参数充当了洗牌效率的代理,我们称之为洗牌指数。通过将此分析应用于洗牌机制隐私概要的上下界,我们通过洗牌指数获得了其隐私保证的区间。此外,我们推导了一个简单的结构性、必要和充分条件,即本地随机器在该条件下在渐近情况下会收缩。具有$k\ge3$的$k$-RR族满足这个条件,而对于广义高斯机制,该条件可能不成立,但结果区间仍然紧密。最后,我们将渐近理论与基于FFT的算法相结合,用于计算有限$n$时的覆盖离散度,该算法提供了严格控制的相对误差和接近$n$的线性运行时间,为洗牌DP提供了一种实用的数值分析方法。

更新时间: 2026-03-02 06:46:25

领域: cs.DS,cs.CR,cs.IT,cs.LG

下载: http://arxiv.org/abs/2601.19154v4

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.

Updated: 2026-03-02 06:44:54

标题: Softmax自注意力训练动态:通过预处理实现快速全局收敛

摘要: 我们研究了在训练线性回归的softmax自注意力层中梯度下降的训练动态,并展示了一个简单的一阶优化算法可以以几何速率收敛到全局最优的自注意力参数。我们的分析分为两个步骤。首先,我们展示在无限数据极限下,自注意力层解决的回归问题等价于一个非凸矩阵分解问题。其次,我们利用这种联系设计了一种新颖的"结构感知"梯度下降变体,可以有效优化原始有限数据回归目标。我们的优化算法相较于标准梯度下降具有几个创新之处,包括一个预处理器和正则化器,帮助避免虚假的稳定点,以及一个数据相关的参数谱初始化,这些参数接近全局最小值流形,具有高概率。

更新时间: 2026-03-02 06:44:54

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.01514v1

Towards Transferable Defense Against Malicious Image Edits

Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.

Updated: 2026-03-02 06:42:36

标题: 朝向可转移的防御恶意图像编辑

摘要: 最近采用在输入图像中使用不可察觉的扰动的方法展示了在扩散型图像编辑系统中对抗恶意操纵的潜在潜力。然而,现有方法在跨模型评估中存在有限的可转移性。为了解决这个问题,我们提出了一种新颖的双模态框架——Transferable Defense Against Malicious Image Edits(TDAE),通过协调图像-文本优化增强图像对抗恶意编辑的免疫性。具体而言,在视觉防御水平上,我们引入了FlatGrad Defense Mechanism(FDM),将梯度正则化纳入对抗目标中。通过明确地将扰动引导到平坦的极小值点,FDM提高了对未知编辑模型的免疫鲁棒性。对于文本增强保护,我们提出了一个名为Dynamic Prompt Defense(DPD)的对抗优化范式,定期优化文本嵌入以使免疫图像的编辑结果与原始图像的结果对齐,然后根据优化的嵌入更新图像。通过对不同嵌入进行迭代对抗更新,DPD促使生成寻求更广泛免疫增强特征的免疫图像,从而实现跨模型可转移性。大量实验结果表明,我们的TDAE在减轻恶意编辑方面在内部和跨模型评估中取得了最先进的性能。

更新时间: 2026-03-02 06:42:36

领域: cs.CV,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2512.14341v2

Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification

Accurate identification of protein active sites at the residue level is crucial for understanding protein function and advancing drug discovery. However, current methods face two critical challenges: vulnerability in single-instance prediction due to sparse training data, and inadequate modality reliability estimation that leads to performance degradation when unreliable modalities dominate fusion processes. To address these challenges, we introduce Multimodal Mixture-of-Experts with Retrieval Augmentation (MERA), the first retrieval-augmented framework for protein active site identification. MERA employs hierarchical multi-expert retrieval that dynamically aggregates contextual information from chain, sequence, and active-site perspectives through residue-level mixture-of-experts gating. To prevent modality degradation, we propose a reliability-aware fusion strategy based on Dempster-Shafer evidence theory that quantifies modality trustworthiness through belief mass functions and learnable discounting coefficients, enabling principled multimodal integration. Extensive experiments on ProTAD-Gen and TS125 datasets demonstrate that MERA achieves state-of-the-art performance, with 90% AUPRC on active site prediction and significant gains on peptide-binding site identification, validating the effectiveness of retrieval-augmented multi-expert modeling and reliability-guided fusion.

Updated: 2026-03-02 06:40:04

标题: 多模态专家混合模型与检索增强在蛋白质活性位点识别中的应用

摘要: 准确识别蛋白质活性位点在残基水平上对于理解蛋白功能和推动药物发现至关重要。然而,当前的方法面临两个关键挑战:由于训练数据稀疏而导致单一实例预测的脆弱性,以及模态可靠性估计不足,导致当不可靠的模态主导融合过程时性能下降。为了解决这些挑战,我们引入了带检索增强的多模式专家混合(MERA)框架,这是用于蛋白质活性位点识别的第一个检索增强框架。MERA采用了层次化的多专家检索,通过残基级别的专家混合门控动态聚合了来自链、序列和活性位点视角的上下文信息。为了防止模态退化,我们提出了一种基于Dempster-Shafer证据理论的可靠性感知融合策略,通过置信质量函数和可学习的折扣系数量化模态的可信度,实现了原则上的多模式整合。对ProTAD-Gen和TS125数据集的大量实验表明,MERA实现了最先进的性能,在活性位点预测上达到了90%的AUPRC,并在肽结合位点识别上取得了显著的增益,验证了检索增强的多专家建模和可靠性引导融合的有效性。

更新时间: 2026-03-02 06:40:04

领域: cs.AI

下载: http://arxiv.org/abs/2603.01511v1

A Tidal Current Speed Forecasting Model based on Multi-Periodicity Learning

Tidal energy is one of the key components in increasing the penetration of renewable energy. High tidal energy penetration into the electrical grid depends on accurate tidal current speed forecasting. Model inaccuracies hinder forecast accuracy. Previous research primarily used physical models to forecast tidal current speed, yet tidal current variations influenced by the orbital periods of celestial bodies make accurate physical modeling challenging. Research on the multi-periodicity of tides is crucial for forecasting tidal current speed. We propose the Wavelet-Enhanced Convolutional Network to learn multi-periodicity. The framework embeds intra-period and inter-period variations of one-dimensional tidal current data into the rows and columns, respectively, of a two-dimensional tensor. Then, the two-dimensional variations of the sequence can be processed by convolutional kernels. We integrate a time-frequency analysis method into the framework to further address local periodic features. Additionally, to enhance the framework's stability, we optimize the framework's hyperparameters with the Tree-structured Parzen Estimator. The proposed framework captures multi-periodic dependencies in tidal current data. Numerical results show a 10-step average Mean Absolute Error of 0.025, with at least a 1.44% error reduction compared to other baselines. Further ablation studies show a 1.4% reduction in Mean Absolute Percentage Error on the data with artificially added periodic fluctuations.

Updated: 2026-03-02 06:38:00

标题: 基于多周期性学习的潮流速度预测模型

摘要: 潮汐能是增加可再生能源渗透率的关键组成部分之一。高潮汐能量渗透到电网中取决于准确的潮流速度预测。模型不准确性阻碍了预测的准确性。先前的研究主要使用物理模型来预测潮流速度,然而受到天体周期的轨道周期影响的潮流变化使准确的物理建模具有挑战性。对潮汐多周期性的研究对于预测潮流速度至关重要。我们提出了Wavelet-Enhanced Convolutional Network来学习多周期性。该框架将一维潮流数据的周期内和周期间变化分别嵌入到二维张量的行和列中。然后,序列的二维变化可以通过卷积核进行处理。我们还将时频分析方法整合到框架中,以进一步处理局部周期特征。此外,为了增强框架的稳定性,我们使用Tree-structured Parzen Estimator来优化框架的超参数。提出的框架捕捉了潮流数据中的多周期依赖关系。数值结果显示,10步平均绝对误差为0.025,与其他基线相比,至少减少了1.44%的误差。进一步的消融研究显示,在人为添加周期性波动的数据中,平均绝对百分比误差减少了1.4%。

更新时间: 2026-03-02 06:38:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.09718v4

Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling

While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.

Updated: 2026-03-02 06:35:59

标题: 文本到视频生成的检索、细化和排名:通过提示优化和测试时间缩放

摘要: 尽管大规模数据集推动了文本到视频(T2V)生成模型的显著进展,但这些模型仍然对输入提示非常敏感,表明提示设计对生成质量至关重要。目前用于改进视频输出的方法经常存在不足:它们要么依赖于复杂的后处理模型,存在引入人为痕迹的风险,要么需要对核心生成器进行昂贵的微调,这严重限制了可扩展性和可访问性。在这项工作中,我们介绍了3R,一个基于RAG的提示优化框架。3R利用了当前最先进的T2V扩散模型和视觉语言模型的强大能力。它可以与任何T2V模型一起使用,而无需任何模型训练。该框架利用了三个关键策略:基于RAG的修饰符提取以丰富上下文基础,基于扩散的偏好优化以将输出与人类偏好对齐,以及时间帧插值以生成时间上连贯的视觉内容。这些组件共同实现了更准确、高效和上下文对齐的文本到视频生成。实验结果表明,在增强生成视频的静态保真度和动态连贯性方面,3R的功效,强调了优化用户提示的重要性。

更新时间: 2026-03-02 06:35:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01509v1

The Sentience Readiness Index: Measuring National Preparedness for the Possibility of Artificial Sentience

The scientific study of consciousness has begun to generate testable predictions about artificial systems. A landmark collaborative assessment evaluated current AI architectures against six leading theories of consciousness and found that none currently qualifies as a strong candidate, but that future systems might. A precautionary approach to AI sentience, which holds that credible possibility of sentience warrants governance action even without proof, has gained philosophical and institutional traction. Yet existing AI readiness indices, including the Oxford Insights Government AI Readiness Index, the IMF AI Preparedness Index, and the Stanford AI Index, measure economic, technological, and governance preparedness without assessing whether societies are prepared for the possibility that AI systems might warrant moral consideration. This paper introduces the Sentience Readiness Index (SRI), a composite index measuring national-level preparedness across six weighted categories for 31 jurisdictions. The SRI was constructed following the OECD/JRC framework for composite indicators and employs LLM-assisted expert scoring with iterative expert review. No jurisdiction exceeds "Partially Prepared" (the United Kingdom leads at 49/100). Research Environment scores are universally the strongest category; Professional Readiness is universally the weakest. These findings suggest that if AI sentience becomes scientifically plausible, no society currently possesses adequate institutional, professional, or cultural infrastructure to respond. The SRI provides a diagnostic baseline and identifies specific capacity deficits that policy can address.

Updated: 2026-03-02 06:35:34

标题: 《意识准备指数:衡量国家对人工意识可能性的准备程度》

摘要: 意识的科学研究已经开始产生关于人工系统的可测试预测。一项具有里程碑意义的合作评估了当前人工智能架构与六种主流意识理论,发现目前没有一种符合条件的强有力候选人,但未来可能会有。对于人工智能意识的预防性方法认为,即使没有证据,对意识可能性的可信可能性也应引起治理行动,这一方法在哲学和制度上获得了推动。然而,包括牛津洞察政府人工智能准备指数、IMF人工智能准备指数和斯坦福人工智能指数在内的现有人工智能准备指标在不评估社会是否准备好面对人工智能系统可能需要道德考虑的情况下,衡量了经济、技术和治理准备情况。本文介绍了意识准备指数(SRI),这是一个综合指标,衡量了31个司法管辖区在六个加权类别上的国家级准备情况。SRI是根据OECD/JRC组合指标框架构建的,并采用了LLM辅助专家评分和迭代专家审查。没有任何司法管辖区超过“部分准备”(英国在100分中得分最高为49分)。研究环境评分普遍是最强的类别;专业准备普遍是最弱的。这些发现表明,如果人工智能意识变得在科学上变得可信,目前没有一个社会拥有足够的制度、专业或文化基础来应对。SRI提供了一个诊断基准,并确定了政策可以解决的特定能力缺口。

更新时间: 2026-03-02 06:35:34

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2603.01508v1

SOSecure: Safer Code Generation with RAG and StackOverflow Discussions

Large Language Models (LLMs) are widely used for automated code generation. Their reliance on infrequently updated pretraining data leaves them unaware of newly discovered vulnerabilities and evolving security standards, making them prone to producing insecure code. In contrast, developer communities on Stack Overflow (SO) provide an ever-evolving repository of knowledge, where security vulnerabilities are actively discussed and addressed through collective expertise. These community-driven insights remain largely untapped by LLMs. This paper introduces SOSecure, a Retrieval-Augmented Generation (RAG) system that leverages the collective security expertise found in SO discussions to improve the security of LLM-generated code. We build a security-focused knowledge base by extracting SO answers and comments that explicitly identify vulnerabilities. Unlike common uses of RAG, SOSecure triggers after code has been generated to find discussions that identify flaws in similar code. These are used in a prompt to an LLM to consider revising the code. Evaluation across three datasets (SALLM, LLMSecEval, and LMSys) show that SOSecure achieves strong fix rates of 71.7%, 91.3%, and 96.7% respectively, compared to prompting GPT-4 without relevant discussions (49.1%, 56.5%, and 37.5%), and outperforms multiple other baselines. SOSecure operates as a language-agnostic complement to existing LLMs, without requiring retraining or fine-tuning, making it easy to deploy. Our results underscore the importance of maintaining active developer forums, which have dropped substantially in usage with LLM adoptions.

Updated: 2026-03-02 06:32:16

标题: SOSecure:使用RAG和StackOverflow讨论生成更安全的代码

摘要: 大型语言模型(LLMs)被广泛用于自动化代码生成。它们依赖于不经常更新的预训练数据,使其不了解新发现的漏洞和不断发展的安全标准,从而容易生成不安全的代码。相比之下,Stack Overflow(SO)上的开发者社区提供了一个不断发展的知识库,安全漏洞在其中得到积极讨论并通过集体专业知识得到解决。这些由社区驱动的见解在LLMs中基本未被利用。本文介绍了SOSecure,一个利用SO讨论中的集体安全专业知识来提高LLM生成代码安全性的检索增强生成(RAG)系统。我们通过提取明确识别漏洞的SO答案和评论构建了一个以安全为重点的知识库。与通常的RAG用途不同,SOSecure在生成代码后触发,以查找识别类似代码中的缺陷的讨论。这些讨论被用作提示LLM考虑修改代码。在三个数据集(SALLM、LLMSecEval和LMSys)上的评估显示,与没有相关讨论的GPT-4(分别为49.1%、56.5%和37.5%)相比,SOSecure分别实现了71.7%、91.3%和96.7%的强大修复率,并优于多个其他基线。SOSecure作为一种与现有LLMs相辅相成的语言不可知的系统运行,无需重新训练或微调,易于部署。我们的结果强调了保持活跃的开发者论坛的重要性,随着LLMs的采用,这些论坛的使用量已大幅下降。

更新时间: 2026-03-02 06:32:16

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2503.13654v2

On The Fragility of Benchmark Contamination Detection in Reasoning Models

Leaderboards for LRMs have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via SFT and RL, we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief GRPO training can markedly conceal contamination signals that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that PPO style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that a broad class of RL methods may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods perform near random guesses. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

Updated: 2026-03-02 06:28:00

标题: 关于推理模型中基准污染检测的脆弱性

摘要: LRM的排行榜已经将评估变成了一场竞争,激励开发者直接在基准套件上进行优化。实现更高排名的捷径是将评估基准纳入训练数据中,从而产生性能夸大,即所谓的基准污染。令人惊讶的是,我们的研究发现,LRM的污染检测极其容易被规避。我们重点关注两种实际可能发生污染的情况:(一)当基础模型通过SFT和RL演变为LRM时,我们发现在SFT期间的污染可以通过污染检测方法最初识别出来。然而,即使进行了简短的GRPO训练,也可以显著掩盖大部分检测方法依赖的污染信号。进一步的实证实验和理论分析表明,PPO风格的重要性抽样和剪切目标是这种检测隐瞒的根本原因,表明广泛类别的RL方法可能天生具有类似的隐瞒能力;(二)当将CoT的SFT污染应用于高级LRM作为最终阶段时,大多数污染检测方法表现接近随机猜测。在没有暴露非成员的情况下,污染的LRM仍然会对那些与训练集共享相似分布的未见样本做出更自信的回应,从而逃避现有基于记忆的检测方法。综上所述,我们的发现揭示了LRM评估的独特脆弱性:模型开发者可以轻松污染LRM以实现夸大的排行榜性能,同时留下最小的污染痕迹,从而严重破坏评估的公正性,威胁公共排行榜的完整性。这强调了对针对LRM定制的先进污染检测方法和可信评估协议的迫切需求。

更新时间: 2026-03-02 06:28:00

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.02386v2

CSRv2: Unlocking Ultra-Sparse Embeddings

In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are often extremely high-dimensional, incurring substantial costs in storage, memory, and inference latency. To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but k-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL). Despite its promise, CSR suffers severe degradation in the ultra-sparse regime, where over 80% of neurons remain inactive, leaving much of its efficiency potential unrealized. In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable. CSRv2 stabilizes sparsity learning through progressive k-annealing, enhances representational quality via supervised contrastive objectives, and ensures end-to-end adaptability with full backbone finetuning. CSRv2 reduces dead neurons from 80% to 20% and delivers a 14% accuracy gain at k=2, bringing ultra-sparse embeddings on par with CSR at k=8 and MRL at 32 dimensions, all with only two active features. While maintaining comparable performance, CSRv2 delivers a 7x speedup over MRL, and yields up to 300x improvements in compute and memory efficiency relative to dense embeddings in text representation. Extensive experiments across text and vision demonstrate that CSRv2 makes ultra-sparse embeddings practical without compromising performance, where CSRv2 achieves 7%/4% improvement over CSR when k=4 and further increases this gap to 14%/6% when k=2 in text/vision representation. By making extreme sparsity viable, CSRv2 broadens the design space for real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.

Updated: 2026-03-02 06:23:21

标题: CSRv2:解锁超稀疏嵌入

摘要: 在大型基础模型时代,嵌入的质量已成为下游任务性能和整体系统能力的核心决定因素。然而,广泛使用的稠密嵌入往往具有极高的维度,导致存储、内存和推理延迟方面的重大成本。为了解决这些问题,最近提出了对比稀疏表示(CSR)作为一种有前途的方向,将稠密嵌入映射到高维度但k-稀疏向量中,与Matryoshka表示学习(MRL)等紧凑稠密嵌入形成对比。尽管CSR具有潜力,但在超稀疏区域,其中超过80%的神经元保持不活跃,使其许多效率潜力无法实现。在本文中,我们介绍了CSRv2,这是一种设计良好的训练方法,旨在使超稀疏嵌入可行。CSRv2通过渐进式k退火稳定稀疏学习,通过监督对比目标增强表示质量,并通过完整主干微调确保端到端的适应性。CSRv2将死神经元从80%减少到20%,并在k=2时实现14%的准确度增益,使超稀疏嵌入与k=8时的CSR和32维的MRL相媲美,只需两个活跃特征。在保持可比性能的同时,CSRv2比MRL提供了7倍的加速,并相对于文本表示中的稠密嵌入,在计算和内存效率方面提供了高达300倍的改进。在文本和视觉领域进行的大量实验表明,CSRv2使得超稀疏嵌入在不降低性能的情况下变得实用,其中当k=4时,CSRv2在文本/视觉表示中比CSR实现了7%/4%的改进,并在k=2时进一步扩大了这一差距,分别为14%/6%。通过使极端稀疏性可行,CSRv2拓宽了实时和边缘可部署的AI系统的设计空间,其中嵌入质量和效率都至关重要。

更新时间: 2026-03-02 06:23:21

领域: cs.LG,cs.AI,cs.IR,cs.IT

下载: http://arxiv.org/abs/2602.05735v4

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

Asynchronous execution is essential for scaling reinforcement learning (RL) to modern large model workloads, including large language models and AI agents, but it can fundamentally alter RL optimization behavior. While prior work on asynchronous RL focuses on training throughput and distributional correction, we show that naively applying asynchrony to policy-gradient updates can induce qualitatively different training dynamics and lead to severe training instability. Through systematic empirical and theoretical analysis, we identify a key signature of this instability: asynchronous training exhibits persistently high cosine similarity between consecutive policy gradients, in contrast to the near-orthogonal updates observed under synchronized training. This stale-aligned gradient effect amplifies correlated updates and increases the risk of overshooting and divergence. Motivated by this observation, we propose GRADIENT ALIGNMENT CONTROL(GAC), a simple dynamics-aware stabilization method that regulates asynchronous RL progress along stale-aligned directions via gradient projection. We establish convergence guarantees under bounded staleness and demonstrate empirically that GAC recovers stable, on-policy training dynamics and matches synchronized baselines even at high staleness.

Updated: 2026-03-02 06:19:43

标题: GAC: 通过梯度对齐控制稳定异步强化学习在LLMs中的训练

摘要: 异步执行对于将强化学习(RL)扩展到现代大型模型工作负载至关重要,包括大型语言模型和AI代理,但它可能从根本上改变RL优化行为。虽然先前关于异步RL的工作侧重于训练吞吐量和分布校正,但我们表明,简单地将异步应用于策略梯度更新可能引发不同质量的训练动态,并导致严重的训练不稳定性。通过系统的实证和理论分析,我们确定了这种不稳定性的一个关键特征:异步训练中连续策略梯度之间存在持续高的余弦相似度,与同步训练下观察到的近乎相互正交的更新形成对比。这种陈旧对齐梯度效应增加了相关更新,并增加了超调和发散的风险。受这一观察的启发,我们提出了GRADIENT ALIGNMENT CONTROL(GAC),这是一种简单的动态感知稳定方法,通过梯度投影沿着陈旧对齐的方向调节异步RL进展。我们在有界陈旧下建立了收敛保证,并通过实证证明,GAC恢复了稳定的、基于策略的训练动态,并在高陈旧下与同步基线相匹配。

更新时间: 2026-03-02 06:19:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01501v1

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.

Updated: 2026-03-02 06:18:27

标题: SpotAgent:通过Agentic推理在大型视觉语言模型中实现视觉地理定位

摘要: 大型视觉语言模型(LVLMs)已经展示出在地理定位方面强大的推理能力,然而它们在现实世界中经常遇到视觉线索稀少、长尾和高度模糊的情况下往往表现不佳。先前的方法,受内部知识的限制,往往未能提供可验证的结果,当面临混淆证据时产生自信但不可靠的预测。为了解决这些挑战,我们提出了SpotAgent,这是一个将地理定位形式化为一种主动推理过程的框架,利用专家级别的推理将视觉解释与工具辅助验证相结合。SpotAgent通过一个ReAct图表主动探索和验证视觉线索,利用外部工具(如网络搜索、地图)。我们引入了一个3阶段的训练后流水线,首先是一个监督微调(SFT)阶段用于基本对齐,然后是一个利用多主体框架合成的高质量轨迹的主动冷启动阶段,旨在灌输工具调用的专业知识。随后,模型的推理能力通过强化学习进行了进一步的改进。我们提出了一个空间感知动态过滤策略,通过根据空间难度优先考虑可学习样本,以增强RL阶段的效率。在标准基准测试上进行的大量实验表明,SpotAgent实现了最先进的性能,在有效减轻幻觉的同时提供了精确和可验证的地理定位。

更新时间: 2026-03-02 06:18:27

领域: cs.AI

下载: http://arxiv.org/abs/2602.09463v3

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

The rapid development of large language models (LLMs) has driven the widespread adoption of cloud-based LLM inference services, while also bringing prominent privacy risks associated with the transmission and processing of private data in remote inference. For privacy-preserving LLM inference technologies to be practically applied in industrial scenarios, three core requirements must be satisfied simultaneously: (1) Accuracy and efficiency losses should be minimized to mitigate degradation in service experience. (2) The inference process can be run on large-scale clusters consist of heterogeneous legacy xPUs. (3) Compatibility with existing LLM infrastructures should be ensured to reuse their engineering optimizations. To the best of our knowledge, none of the existing privacy-preserving LLM inference methods satisfy all the above constraints while delivering meaningful privacy guarantees. In this paper, we propose AloePri, the first privacy-preserving LLM inference method for industrial applications. AloePri protects both the input and output data by covariant obfuscation, which jointly transforms data and model parameters to achieve better accuracy and privacy. We carefully design the transformation for each model component to ensure inference accuracy and data privacy while keeping full compatibility with existing infrastructures of Language Model as a Service. AloePri has been integrated into an industrial system for the evaluation of mainstream LLMs. The evaluation on Deepseek-V3.1-Terminus model (671B parameters) demonstrates that AloePri causes accuracy loss of 0.0%~3.5% and exhibits efficiency equivalent to that of plaintext inference. Meanwhile, AloePri successfully resists state-of-the-art attacks, with less than 5\% of tokens recovered. To the best of our knowledge, AloePri is the first method to exhibit practical applicability to large-scale models in real-world systems.

Updated: 2026-03-02 06:16:36

标题: 朝向通过协同混淆实现隐私保护的LLM推断(技术报告)

摘要: 大型语言模型(LLMs)的快速发展推动了云端LLM推理服务的广泛采用,同时也带来了与远程推理中的私人数据传输和处理相关的突出隐私风险。为了在工业场景中实际应用保护隐私的LLM推理技术,必须同时满足三个核心要求:(1)最小化准确性和效率损失,以减轻服务体验的降级。(2)推理过程可以在由异构遗留xPUs组成的大规模集群上运行。(3)必须确保与现有LLM基础设施的兼容性,以重用其工程优化。据我们所知,没有任何现有的保护隐私的LLM推理方法同时满足上述所有约束条件,同时提供有意义的隐私保证。在本文中,我们提出了AloePri,这是第一个面向工业应用的隐私保护LLM推理方法。AloePri通过共变混淆保护输入和输出数据,联合转换数据和模型参数以实现更好的准确性和隐私。我们为每个模型组件精心设计了转换,以确保推理准确性和数据隐私,同时保持与语言模型作为服务的现有基础设施的完全兼容性。AloePri已经集成到一个工业系统中,用于评估主流LLMs。对Deepseek-V3.1-Terminus模型(671B参数)的评估表明,AloePri导致的准确性损失为0.0%~3.5%,效率与明文推理相当。同时,AloePri成功抵抗了最先进的攻击,恢复的令牌不到5\%。据我们所知,AloePri是第一个在真实系统中对大规模模型具有实际适用性的方法。

更新时间: 2026-03-02 06:16:36

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.01499v1

BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving

Diffusion-based planners have shown strong potential for autonomous driving by capturing multi-modal driving behaviors. A key challenge is how to effectively guide these models for safe and reactive planning in closed-loop settings, where the ego vehicle's actions influence future states. Recent work leverages typical expert driving behaviors (i.e., anchors) to guide diffusion planners but relies on a truncated diffusion schedule that introduces an asymmetry between the forward and denoising processes, diverging from the core principles of diffusion models. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach formulates planning as a diffusion bridge that directly transforms coarse anchor trajectories into refined, context-aware plans, ensuring theoretical consistency between the forward and reverse processes. BridgeDrive is compatible with efficient ODE solvers, enabling real-time deployment. We achieve state-of-the-art performance on the Bench2Drive closed-loop evaluation benchmark, improving the success rate by 7.72% over prior arts. Project page: https://github.com/shuliu-ethz/BridgeDrive.

Updated: 2026-03-02 06:15:43

标题: BridgeDrive:自主驾驶闭环轨迹规划的扩散桥接策略

摘要: 扩散型规划器通过捕捉多模态驾驶行为,已经展示出在自动驾驶领域具有强大潜力。一个关键挑战是如何有效地引导这些模型在封闭环设置下进行安全和反应迅速的规划,其中自车的行动会影响未来状态。最近的工作利用典型的专家驾驶行为(即锚点)来引导扩散规划器,但依赖于一个引入了前向和去噪过程之间不对称性的截断扩散时间表,这与扩散模型的核心原则背道而驰。为了解决这个问题,我们引入了BridgeDrive,一种新颖的锚点引导扩散桥梁策略,用于封闭环轨迹规划。我们的方法将规划视为一种直接将粗糙锚点轨迹转化为精细、上下文感知计划的扩散桥梁,确保前向和反向过程之间的理论一致性。BridgeDrive与高效的ODE求解器兼容,能够实时部署。我们在Bench2Drive封闭环评估基准上取得了最先进的性能,比先前的技术提高了7.72%的成功率。项目页面:https://github.com/shuliu-ethz/BridgeDrive。

更新时间: 2026-03-02 06:15:43

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.23589v3

MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness

Few-shot anomaly generation is a key challenge in industrial quality control. Although diffusion models are promising, existing methods struggle: global prompt-guided approaches corrupt normal regions, and existing inpainting-based methods often lack the in-distribution diversity essential for robust downstream models. We propose MAGIC, a fine-tuned inpainting framework that generates high-fidelity anomalies that strictly adhere to the mask while maximizing this diversity. MAGIC introduces three complementary components: (i) Gaussian prompt perturbation, which prevents model overfitting in the few-shot setting by learning and sampling from a smooth manifold of realistic anomalies, (ii) spatially adaptive guidance that applies distinct guidance strengths to the anomaly and background regions, and (iii) context-aware mask alignment to relocate masks for plausible placement within the host object. Under consistent identical evaluation protocol, MAGIC outperforms state-of-the-art methods on diverse anomaly datasets in downstream tasks

Updated: 2026-03-02 06:12:37

标题: 魔术:带有提示扰动、空间自适应引导和上下文感知的少样本掩模引导异常修复

摘要: Few-shot异常生成是工业质量控制中的一个关键挑战。虽然扩散模型很有前景,但现有方法存在困难:全局提示引导方法会破坏正常区域,而现有的修补方法通常缺乏对下游模型鲁棒性至关重要的分布多样性。我们提出了MAGIC,这是一个经过精细调整的修补框架,可以生成高保真度的异常,严格遵循掩模,同时最大化多样性。MAGIC引入了三个互补组件:(i) 高斯提示扰动,可以防止在少样本情况下过度拟合模型,通过学习和从真实异常的平滑流形中进行采样,(ii) 空间自适应引导,将不同的引导强度应用于异常和背景区域,(iii) 上下文感知掩模对齐,将掩模重新定位在主体对象内部合理位置。在一致相同的评估协议下,MAGIC在各种异常数据集上的下游任务中优于最先进的方法。

更新时间: 2026-03-02 06:12:37

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.02314v5

Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision

Large Language Models (LLMs) are increasingly deployed for code generation in high-stakes software development, yet their limited transparency in security reasoning and brittleness to evolving vulnerability patterns raise critical trustworthiness concerns. Models trained on static datasets cannot readily adapt to newly discovered vulnerabilities or changing security standards without retraining, leading to the repeated generation of unsafe code. We present a principled approach to trustworthy code generation by design that operates as an inference-time safety mechanism. Our approach employs retrieval-augmented generation to surface relevant security risks in generated code and retrieve related security discussions from a curated Stack Overflow knowledge base, which are then used to guide an LLM during code revision. This design emphasizes three aspects relevant to trustworthiness: (1) interpretability, through transparent safety interventions grounded in expert community explanations; (2) robustness, by allowing adaptation to evolving security practices without model retraining; and (3) safety alignment, through real-time intervention before unsafe code reaches deployment. Across real-world and benchmark datasets, our approach improves the security of LLM-generated code compared to prompting alone, while introducing no new vulnerabilities as measured by static analysis. These results suggest that principled, retrieval-augmented inference-time interventions can serve as a complementary mechanism for improving the safety of LLM-based code generation, and highlight the ongoing value of community knowledge in supporting trustworthy AI deployment.

Updated: 2026-03-02 06:06:34

标题: 通过检索增强修订实现代码LLMs的推理时间安全

摘要: 大型语言模型(LLMs)越来越多地被部署用于高风险软件开发中的代码生成,然而它们在安全推理方面的有限透明度和对不断演变的漏洞模式的脆弱性引发了关键的可信度问题。在静态数据集上训练的模型不能很容易地适应新发现的漏洞或变化的安全标准,而没有重新训练,这导致了不安全代码的重复生成。 我们提出了一种基于设计的值得信赖的代码生成方法,它作为一个推理时间安全机制运行。我们的方法采用检索增强生成,以在生成的代码中展示相关的安全风险,并从筛选过的Stack Overflow知识库中检索相关的安全讨论,然后用于指导LLM在代码修订期间。这种设计强调了与可信度相关的三个方面:(1)可解释性,通过基于专家社区解释的透明安全干预;(2)鲁棒性,通过允许适应不断发展的安全实践而无需重新训练模型;以及(3)安全对齐,通过在不安全代码到达部署之前进行实时干预。 通过真实世界和基准数据集,我们的方法相对于仅提示而言改善了LLM生成的代码的安全性,同时在静态分析中没有引入新的漏洞。这些结果表明,基于原则的、检索增强的推理时间干预可以作为改进基于LLM的代码生成安全性的补充机制,并突显了社区知识在支持可信度人工智能部署方面的持续价值。

更新时间: 2026-03-02 06:06:34

领域: cs.SE,cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2603.01494v1

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.

Updated: 2026-03-02 06:02:40

标题: PhotoBench:超越视觉匹配,走向个性化意图驱动的照片检索

摘要: 个人相册不仅仅是静态图像的集合,而是由时间连续性、社会纠葛和丰富元数据定义的生态存档,这使得个性化照片检索变得复杂。然而,现有的检索基准主要依赖于与上下文隔离的网络快照,无法捕捉解决真实、目的驱动用户查询所需的多源推理。为了弥补这一差距,我们引入了PhotoBench,这是从真实个人相册构建的第一个基准。它旨在将范式从视觉匹配转变为个性化多源目的驱动推理。基于严格的多源配置框架,该框架集成了每个图像的视觉语义、时空元数据、社会身份和时间事件,我们综合了根植于用户生活轨迹的复杂目的驱动查询。在PhotoBench上的广泛评估揭示了两个关键限制:模态差距,即统一嵌入模型在非视觉约束上失效,以及源融合悖论,即代理系统在工具编排方面表现不佳。这些发现表明,个人多模式检索的下一个前沿在于超越统一嵌入,需要能够精确满足约束和多源融合的强大代理推理系统。我们的PhotoBench已经发布。

更新时间: 2026-03-02 06:02:40

领域: cs.IR,cs.AI,cs.CV,cs.MM

下载: http://arxiv.org/abs/2603.01493v1

Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

We analyze cumulative parameter trajectories of transformer training under AdamW and identify a dominant low-dimensional drift direction ("backbone") that captures 60--80% of long-horizon displacement from initialization. This direction is highly stable over rolling training windows yet reorients gradually across phases, particularly following objective reweighting. Per-batch gradients exhibit near-noise-floor alignment with the backbone, whereas optimizer-integrated updates align strongly with it, indicating that the structure emerges from accumulated optimizer dynamics rather than instantaneous gradient geometry. Replacing AdamW with SGD-family optimizers eliminates this structure, and reducing $β_2$ smoothly degrades backbone dominance and reheating recoverability. Reheating experiments show that transverse probe modes can be transiently re-excited without substantially altering accumulated backbone drift. These results provide a trajectory-level characterization of optimizer-induced geometric structure in transformer training and shift attention from instantaneous gradient properties to cumulative update dynamics.

Updated: 2026-03-02 06:00:21

标题: 优化器诱导的低维漂移与变压器训练中的横向动力学

摘要: 我们分析了在AdamW下transformer训练的累积参数轨迹,并确定了一个占据60-80%长期位移的主导低维漂移方向(“骨干”)。这个方向在滚动训练窗口中非常稳定,但在不同阶段之间逐渐重新定位,特别是在目标重新加权后。每个批次的梯度与骨干几乎无噪声地对齐,而优化器集成的更新与之强烈对齐,表明该结构是由积累的优化器动态产生的,而不是即时梯度几何结构。 将AdamW替换为SGD家族的优化器会消除这种结构,而平滑降低$β_2$会逐渐降低骨干的优势性和再加热的可恢复性。再加热实验表明,横向探测模式可以在不实质改变累积骨干漂移的情况下被短暂重新激发。 这些结果提供了transformer训练中由优化器引起的几何结构的轨迹级别表征,并将注意力从即时梯度属性转移到累积更新动态。

更新时间: 2026-03-02 06:00:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2602.23696v2

ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models

Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.

Updated: 2026-03-02 05:56:03

标题: ATA: 将隐式推理与基于注意力和基于行动的推理相结合,用于视觉-语言行动模型

摘要: 视觉-语言-动作(VLA)模型依赖于当前观察结果,包括图像、语言指令和机器人状态,以预测动作并完成任务。准确的视觉感知对于精确的动作预测和执行至关重要,最近的研究尝试通过在推理过程中引入显式推理来进一步提升性能。然而,这种方法面临着重大限制。它们通常依赖于数据密集型资源,如“思路链”(CoT)风格的注释,将任务分解为逐步推理,并在许多情况下需要额外的视觉基础注释(例如边界框或遮罩)来突出相关的图像区域。此外,它们涉及耗时的数据集构建、标注和重新训练,最终导致推理序列变长,效率降低。为了解决这些挑战,我们提出了ATA,这是一个新颖的无需训练的框架,通过互补的关注引导和动作引导策略将隐式推理引入到VLA推理中。与CoT或显式视觉基础方法不同,ATA通过将注意力图与基于动作的感兴趣区域(RoI)集成,从而自适应地优化视觉输入,而无需额外的训练或注释。ATA是一种轻量而有效的VLA模型的即插即用的隐式推理方法。大量实验证明,它始终提高任务成功率和鲁棒性,同时保持甚至增强推理效率。

更新时间: 2026-03-02 05:56:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01490v1

LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning

Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral safety and compliance. Recent research shows that integrating Large Language Models (LLMs) with symbolic planning is promising in addressing these challenges. Inspired by this, we introduce a novel LLM-driven closed-loop framework, which enables semantic-driven skill reuse and real-time constraint monitoring by mapping natural language instructions into executable rules and semantically annotating automatically created options. The proposed approach utilizes the general knowledge of LLMs to facilitate exploration efficiency and adapt to transferable options for similar environments, and provides inherent interpretability through semantic annotations. To validate the effectiveness of this framework, we conduct experiments on two domains, Office World and Montezuma's Revenge, respectively. The results demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability.

Updated: 2026-03-02 05:54:02

标题: LLM辅助语义选项发现以促进自适应深度强化学习

摘要: 尽管深度强化学习(DRL)在复杂任务中取得了显著的成功,但在实际应用中仍然存在关键问题,如数据效率低、缺乏可解释性和跨环境可转移性有限。然而,基于状态生成动作的学习策略对环境变化敏感,难以保证行为安全和合规性。最近的研究表明,将大型语言模型(LLMs)与符号规划相结合在解决这些挑战方面具有潜力。受此启发,我们引入了一种新颖的LLM驱动闭环框架,通过将自然语言指令映射到可执行规则并自动创建选项进行语义驱动技能重用和实时约束监控。所提出的方法利用LLMs的通用知识促进探索效率,并适应类似环境的可转移选项,并通过语义标注提供固有的可解释性。为验证该框架的有效性,我们在Office World和Montezuma的Revenge两个领域进行实验。结果表明,在数据效率、约束合规性和跨任务可转移性方面表现出卓越的性能。

更新时间: 2026-03-02 05:54:02

领域: cs.AI

下载: http://arxiv.org/abs/2603.01488v1

Agentic Multi-Source Grounding for Enhanced Query Intent Understanding: A DoorDash Case Study

Accurately mapping user queries to business categories is a fundamental Information Retrieval challenge for multi-category marketplaces, where context-sparse queries such as "Wildflower" exhibit intent ambiguity, simultaneously denoting a restaurant chain, a retail product, and a floral item. Traditional classifiers force a winner-takes-all assignment, while general-purpose LLMs hallucinate unavailable inventory. We introduce an Agentic Multi-Source Grounded system that addresses both failure modes by grounding LLM inference in (i) a staged catalog entity retrieval pipeline and (ii) an agentic web-search tool invoked autonomously for cold-start queries. Rather than predicting a single label, the model emits an ordered multi-intent set, resolved by a configurable disambiguation layer that applies deterministic business policies and is designed for extensibility to personalization signals. This decoupled design generalizes across domains, allowing any marketplace to supply its own grounding sources and resolution rules without modifying the core architecture. Evaluated on DoorDash's multi-vertical search platform, the system achieves +10.9pp over the ungrounded LLM baseline and +4.6pp over the legacy production system. On long-tail queries, incremental ablations attribute +8.3pp to catalog grounding, +3.2pp to agentic web search grounding, and +1.5pp to dual intent disambiguation, yielding 90.7% accuracy (+13.0pp over baseline). The system is deployed in production, serving over 95% of daily search impressions, and establishes a generalizable paradigm for applications requiring foundation models grounded in proprietary context and real-time web knowledge to resolve ambiguous, context-sparse decision problems at scale.

Updated: 2026-03-02 05:51:05

标题: Agentic多源接地以增强查询意图理解:DoorDash案例研究

摘要: 准确地将用户查询映射到商业类别是多类别市场中的一个基本信息检索挑战,其中上下文稀疏的查询(例如“Wildflower”)表现出意图模糊,同时表示一个餐厅连锁店、一个零售产品和一种花卉物品。传统的分类器强制进行胜者通吃的分配,而通用的LLMs则会产生无法获得的库存。我们引入了一种具有主动多源基础的系统,通过将LLM推断与(i)分阶段目录实体检索管道和(ii)一种主动的网络搜索工具相结合,针对冷启动查询自主调用,来解决这两种失败模式。该模型不是预测单一标签,而是发出一个经过排序的多意图集,通过一个可配置的消歧层解决,该消歧层应用确定性业务策略,并设计用于可扩展到个性化信号。这种解耦的设计在各个领域中具有普遍性,允许任何市场提供自己的基础来源和解决规则,而无需修改核心架构。在DoorDash的多垂直搜索平台上进行评估,该系统在未经基础支持的LLM基线上实现了+10.9pp,并比传统的生产系统提升了+4.6pp。在长尾查询中,增量剥离将+8.3pp归因于目录基础,+3.2pp归因于主动网络搜索基础,+1.5pp归因于双意图消歧,从而实现了90.7%的准确性(比基线提高了+13.0pp)。该系统已经投入生产,服务超过95%的日常搜索印象,并为需要基于专有上下文和实时网络知识进行基础建模以解决规模化模糊、上下文稀疏决策问题的应用建立了一个可推广的范例。

更新时间: 2026-03-02 05:51:05

领域: cs.AI

下载: http://arxiv.org/abs/2603.01486v1

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

Updated: 2026-03-02 05:48:19

标题: EnterpriseBench Corecraft:在高保真度强化学习环境中训练通用代理

摘要: 我们展示了在高保真度强化学习环境上训练AI代理会产生超越训练分布的能力。我们介绍了CoreCraft,EnterpriseBench中的第一个环境,Surge AI的RL代理环境套件。CoreCraft是一个完全运作的客户支持组织的企业模拟,包括14种实体类型的2,500多个实体,拥有23种独特的工具,旨在衡量AI代理是否能够执行真实工作所需的多步骤、领域特定工作。像GPT-5.2和Claude Opus 4.6这样的前沿模型在所有专家编写的标准标准条件必须满足时只解决了不到30%的任务。利用这个环境,我们使用Group Relative Policy Optimization (GRPO)和自适应剪裁来训练GLM 4.6。经过一次训练时代,模型在保留的评估任务上的任务通过率从25.37%提高到36.76%。更重要的是,这些收益转移到了超出分布区域的基准测试:在BFCL Parallel上增加了+4.5%,在Tau2-Bench Retail上增加了+7.4%,在Tool Decathlon (Pass@1)上增加了+6.8%。我们相信,三个环境属性与观察到的转移一致:以任务为中心的世界构建,优化多样化、具有挑战性的任务;专家编写的标准使得可靠的奖励计算成为可能;反映现实职业模式的企业工作流程。我们的结果表明,环境质量、多样性和逼真性是实现可推广代理能力的关键因素。

更新时间: 2026-03-02 05:48:19

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2602.16179v5

Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding

Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents' movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.

Updated: 2026-03-02 05:47:08

标题: 对于终身多智能体路径规划中混合引导图的边方向和权重的优化

摘要: 多智能体路径规划(MAPF)旨在将智能体从图上的起始点移动到目标点。终身多智能体路径规划(LMAPF)不断将新目标分配给智能体,当他们完成当前目标时。为了指导LMAPF中智能体的移动,先前的研究提出了指导图优化(GGO)方法来优化一个指导图,该图是一个双向加权图,其有向边表示移动和等待动作,边的权重是动作成本。较高的边权重表示较高的动作成本。然而,边权重只提供软指导。高权重的边仅仅阻止智能体使用它,而不是禁止智能体穿越它。在本文中,我们探讨了将边方向优化纳入到GGO中的必要性,提供严格的指导。我们将GGO推广为混合指导图优化(MGGO),提出了两种能够优化边权重和方向的MGGO方法。第一种分别优化边方向和边权重两个阶段。第二种应用质量多样性算法来优化一个能够生成边方向和权重的神经网络。我们还将与边方向相关的交通模式纳入到GGO方法中,使其能够生成具有边方向意识的指导图。

更新时间: 2026-03-02 05:47:08

领域: cs.MA,cs.AI,cs.RO

下载: http://arxiv.org/abs/2602.23468v2

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

Updated: 2026-03-02 05:45:55

标题: 一个自我监督语音模型的SUPERB风格基准用于音频深度伪造检测

摘要: 自监督学习(SSL)已经改变了语音处理,SUPERB等基准建立了跨多样化下游任务的公平比较。尽管音频深度伪造检测的安全关键性很重要,但它仍然落在这些努力之外。在这项工作中,我们引入了Spoof-SUPERB,一个用于音频深度伪造检测的基准,系统地评估了20个跨生成、鉴别和基于谱图的SSL模型。我们在多个领域内和领域外的数据集上评估了这些模型。我们的结果显示,诸如XLS-R、UniSpeech-SAT和WavLM Large等大规模鉴别模型始终表现优异,受益于多语言预训练、说话者感知目标和模型规模。我们进一步分析了这些模型在声学退化下的鲁棒性,显示生成方法急剧退化,而鉴别模型保持弹性。这个基准建立了一个可重现的基线,并提供了对哪些SSL表示最可靠以确保语音系统抵御音频深度伪造的实用见解。

更新时间: 2026-03-02 05:45:55

领域: eess.AS,cs.AI,cs.LG,eess.SP

下载: http://arxiv.org/abs/2603.01482v1

Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents

Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.

Updated: 2026-03-02 05:44:50

标题: 在多轮强化学习中协调稠密和稀疏信号:面向工业销售代理商的双重视野信用分配

摘要: 为了优化用于工业销售的大型语言模型,需要平衡长期商业目标(例如转化率)与即时的语言约束,如流畅性和合规性。传统的强化学习通常将这些异质目标合并为一个单一的奖励,导致高幅度的会话级奖励压倒微妙的轮次级信号,从而导致训练不稳定或奖励欺骗。为了解决这个问题,我们提出了双重时间跨度信用分配(DuCA)框架,这个框架在不同时间尺度上进行优化。其核心是Horizon-Independent Advantage Normalization(HIAN),在融合之前单独对轮次级和会话级奖励的优势进行归一化,确保来自即时和长期目标的平衡梯度贡献到策略更新中。通过与高保真用户模拟器的大量实验表明,DuCA优于最先进的GRPO基线,转化率相对提高了6.82%,减少了82.28%的句间重复,将身份检测率降低了27.35%,表明在工业销售场景中实现了战略性绩效和自然语言生成的双重需求的有效平衡。

更新时间: 2026-03-02 05:44:50

领域: cs.AI

下载: http://arxiv.org/abs/2603.01481v1

Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems

While Multi-Agent Systems (MAS) excel at complex tasks, their growing autonomy with operational complexity often leads to critical inefficiencies, such as excessive token consumption and failures arising from misinformation. Existing methods primarily focus on post-hoc failure attribution, lacking proactive, real-time interventions to enhance robustness and efficiency. To this end, we introduce SupervisorAgent, a lightweight and modular framework for runtime, adaptive supervision that operates without altering the base agent's architecture. Triggered by an LLM-free adaptive filter, SupervisorAgent intervenes at critical junctures to proactively correct errors, guide inefficient behaviors, and purify observations. On the challenging GAIA benchmark, SupervisorAgent reduces the token consumption of the Smolagent framework by an average of 29.68% without compromising its success rate. Extensive experiments across five additional benchmarks (math reasoning, code generation, and question answering) and various SoTA foundation models validate the broad applicability and robustness of our approach.

Updated: 2026-03-02 05:41:00

标题: 不要浪费你的代币:朝着高效的运行时多代理系统方向

摘要: 尽管多智能体系统(MAS)擅长处理复杂任务,但随着运行复杂性的增加,它们日益自主的特性常常导致关键的低效率问题,例如过度的令牌消耗和由于错误信息而导致的失败。现有方法主要侧重于事后故障归因,缺乏主动、实时的干预措施来增强系统的稳健性和效率。为此,我们引入了SupervisorAgent,这是一个轻量级、模块化的框架,用于运行时自适应监督,而不会改变基本智能体的架构。由LLM-free自适应滤波器触发,SupervisorAgent在关键时刻进行干预,以主动纠正错误、指导低效行为并净化观察结果。在具有挑战性的GAIA基准测试中,SupervisorAgent将Smolagent框架的令牌消耗平均降低了29.68%,而不影响其成功率。跨五个额外基准测试(数学推理、代码生成和问题回答)以及各种SoTA基础模型的广泛实验验证了我们方法的广泛适用性和稳健性。

更新时间: 2026-03-02 05:41:00

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2510.26585v2

t-SNE Exaggerates Clusters, Provably

Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.

Updated: 2026-03-02 05:39:53

标题: t-SNE 夸大聚类,可证明

摘要: 广泛使用t-分布随机邻居嵌入(t-SNE)的关键在于其产生的可视化结果大致与输入数据的结构相匹配。相反,我们证明了(1)输入数据的聚类强度和(2)离群点的极端程度无法可靠地从t-SNE的输出中推断出来。我们还在实践中展示了这些失败模式的普遍存在。

更新时间: 2026-03-02 05:39:53

领域: cs.LG

下载: http://arxiv.org/abs/2510.07746v2

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.

Updated: 2026-03-02 05:38:29

标题: 微小但强大:用于电池供电小型设备上高效多模态推理的软硬件协同设计方法

摘要: 大型多模型模型(LMMs)在本质上是模块化的,由视觉和音频编码器、投影仪和大型语言模型组成。然而,它们几乎总是以单体方式执行,这样做会浪费现代SoCs中的异构加速器(NPUs、GPUs、DSPs),并导致高端到端延迟。在本文中,我们提出了NANOMIND,这是一个用于大型多模型模型(LMMs)的硬件-软件协同设计推理框架,将大型模型分解为模块化的“模块”(视觉、语言、音频等),并将每个模块映射到其理想的加速器。关键的洞见是,大型模型可以分解为模块化组件,并安排在最合适的计算单元上运行。它在统一内存SoCs上执行模块级动态卸载加速器。通过结合定制硬件设计、系统级调度和优化的低位计算内核,我们展示了我们的框架,使用一个紧凑、电池供电的设备完全能够在设备上运行LMMs。这个原型作为一个独立的智能助手,不需要网络连接,同时在严格的资源约束下实现更高的吞吐量和更优越的功耗效率。该设计进一步通过基于标记的缓冲管理和模块级协调来绕过CPU瓶颈并减少冗余内存使用。我们的系统在资源效率方面优于现有的实现,能够将能耗降低42.3%、GPU内存使用降低11.2%。这使得一个电池供电的设备可以运行带有相机的LLaVA-OneVision近20.8小时。

更新时间: 2026-03-02 05:38:29

领域: cs.DC,cs.AI,cs.CL,eess.SP

下载: http://arxiv.org/abs/2510.05109v4

Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

Can we modify the training data distribution to encourage the underlying optimization method toward finding solutions with superior generalization performance on in-distribution data? In this work, we approach this question for the first time by comparing the inductive bias of gradient descent (GD) with that of sharpness-aware minimization (SAM). By studying a two-layer CNN, we rigorously prove that SAM learns different features more uniformly, particularly in early epochs. That is, SAM is less susceptible to simplicity bias compared to GD. We also show that examples containing features that are learned early are separable from the rest based on the model's output. Based on this observation, we propose a method that (i) clusters examples based on the network output early in training, (ii) identifies a cluster of examples with similar network output, and (iii) upsamples the rest of examples only once to alleviate the simplicity bias. We show empirically that USEFUL effectively improves the generalization performance on the original data distribution when training with various gradient methods, including (S)GD and SAM. Notably, we demonstrate that our method can be combined with SAM variants and existing data augmentation strategies to achieve, to the best of our knowledge, state-of-the-art performance for training ResNet18 on CIFAR10, STL10, CINIC10, Tiny-ImageNet; ResNet34 on CIFAR100; and VGG19 and DenseNet121 on CIFAR10. Our code is available at https://github.com/BigML-CS-UCLA/TADA.

Updated: 2026-03-02 05:37:42

标题: 改变训练数据分布以减少简化偏差,提高内分布泛化

摘要: 我们可以修改训练数据分布以鼓励基础优化方法朝着在分布数据上具有更好泛化性能的解决方案发展吗?在这项工作中,我们首次通过比较梯度下降(GD)的归纳偏差和锐度感知最小化(SAM)的归纳偏差来探讨这个问题。通过研究一个两层CNN,我们严格证明了SAM更加均匀地学习不同的特征,特别是在早期时期。也就是说,与GD相比,SAM更不容易受到简单偏见的影响。我们还展示了包含早期学习特征的示例可以基于模型的输出与其他示例分离。基于这一观察,我们提出了一种方法:(i)根据网络输出在训练早期对示例进行聚类,(ii)识别具有相似网络输出的示例群,并(iii)仅对其余示例进行一次上采样以减轻简单偏见。我们实证表明,当使用各种梯度方法进行训练时,USEFUL有效地改善了原始数据分布上的泛化性能,包括(S)GD和SAM。值得注意的是,我们证明了我们的方法可以与SAM变体和现有的数据增强策略结合使用,以实现在CIFAR10、STL10、CINIC10、Tiny-ImageNet上训练ResNet18;在CIFAR100上训练ResNet34;以及在CIFAR10上训练VGG19和DenseNet121时,达到我们所知的最先进性能。我们的代码可在https://github.com/BigML-CS-UCLA/TADA找到。

更新时间: 2026-03-02 05:37:42

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2404.17768v3

Toward Clinically Explainable AI for Medical Diagnosis: A Foundation Model with Human-Compatible Reasoning via Reinforcement Learning

The clinical adoption of artificial intelligence (AI) in medical diagnostics is critically hampered by its black-box nature, which prevents clinicians from verifying the rationale behind automated decisions. To overcome this fundamental barrier, we introduce DeepMedix-R1, a foundation model (FM) for chest X-ray (CXR) interpretation that generates not only accurate diagnoses but also a transparent, step-by-step reasoning process grounded in specific visual evidence. Our methodology employs a sequential training strategy, beginning with instruction fine-tuning, followed by a cold-start phase to elicit reasoning capabilities. Critically, we then implement reinforcement learning with grounded rewards to meticulously refine the model, aligning both its diagnostic outputs and its reasoning pathways with clinical plausibility. Quantitative assessments show that DeepMedix-R1 substantially outperforms advanced FMs, achieving improvements in report generation and visual question answering tasks. We also introduce Report Arena, a novel LLM-based benchmark that ranks DeepMedix-R1 first among competing models for output quality. Most significantly, a formal review by clinical experts reveals a profound preference for DeepMedix-R1's generated reasoning over the broadly adopted Qwen2.5-VL-7B model, confirming its superior interpretability and clinical utility.

Updated: 2026-03-02 05:37:36

标题: 朝着医学诊断的临床可解释人工智能迈进:基于强化学习的人类兼容推理的基础模型

摘要: 医学诊断中人工智能(AI)的临床应用受到其黑匣子特性的严重阻碍,这阻碍了临床医生验证自动决策背后的理由。为了克服这一根本障碍,我们引入了DeepMedix-R1,一个用于胸部X射线(CXR)解释的基础模型(FM),它不仅能生成准确的诊断,还能提供透明、一步一步的推理过程,这一过程基于具体的视觉证据。我们的方法采用顺序训练策略,从指导微调开始,然后进入冷启动阶段以引出推理能力。关键是,我们采用基于奖励的强化学习来精心完善模型,使其诊断输出和推理路径与临床可信度相一致。定量评估显示,DeepMedix-R1在报告生成和视觉问题回答任务中明显优于先进的FM,取得了显著改进。我们还介绍了Report Arena,一个基于LLM的新型基准,将DeepMedix-R1在输出质量上排名第一。最重要的是,临床专家的正式审查显示,DeepMedix-R1生成的推理深受青睐,优于广泛采用的Qwen2.5-VL-7B模型,证实其优越的可解释性和临床实用性。

更新时间: 2026-03-02 05:37:36

领域: cs.AI

下载: http://arxiv.org/abs/2509.03906v2

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding <EOS> embeddings. This drives the multimodal model to compress the semantic information of the input into the <EOS> token, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.

Updated: 2026-03-02 05:34:45

标题: 通过协同关注重建内容以提高多模态嵌入质量

摘要: 多模态嵌入模型,源自多模态大型语言模型(MLLMs),在检索和分类等各种任务中取得了显著的性能提升。然而,大多数现有方法严重依赖于大规模对比学习,对MLLMs的架构和训练范式如何影响嵌入质量的探索有限。虽然对于生成来说,MLLMs的因果注意力和下一个标记预测范式并不明确鼓励形成全局紧凑的表示,从而限制了它们作为多模态嵌入主干的有效性。为了解决这个问题,我们提出了CoCoA,一种基于协同注意力的内容重建预训练范式,用于多模态嵌入优化。具体来说,我们重构了注意力流,并引入了基于EOS的重建任务,鼓励模型从相应的<EOS>嵌入中重建输入。这使多模态模型将输入的语义信息压缩到<EOS>标记中,为后续对比学习奠定了基础。对MMEB-V1的大量实验表明,基于Qwen2-VL和Qwen2.5-VL的CoCoA显著提高了嵌入质量。结果验证了内容重建作为一种有效策略,可以最大化现有数据的价值,使多模态嵌入模型生成紧凑且信息丰富的表示,提高其性能上限。

更新时间: 2026-03-02 05:34:45

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2603.01471v1

Randomized Kiring Believer for Parallel Bayesian Optimization with Regret Bounds

We consider an optimization problem of an expensive-to-evaluate black-box function, in which we can obtain noisy function values in parallel. For this problem, parallel Bayesian optimization (PBO) is a promising approach, which aims to optimize with fewer function evaluations by selecting a diverse input set for parallel evaluation. However, existing PBO methods suffer from poor practical performance or lack theoretical guarantees. In this study, we propose a PBO method, called randomized kriging believer (KB), based on a well-known KB heuristic and inheriting the advantages of the original KB: low computational complexity, a simple implementation, versatility across various BO methods, and applicability to asynchronous parallelization. Furthermore, we show that our randomized KB achieves Bayesian expected regret guarantees. We demonstrate the effectiveness of the proposed method through experiments on synthetic and benchmark functions and emulators of real-world data.

Updated: 2026-03-02 05:32:59

标题: 随机Kiring信徒的并行贝叶斯优化与遗憾界限

摘要: 我们考虑一个优化问题,其中涉及一个昂贵且难以评估的黑盒函数,我们可以同时获得噪声函数值。对于这个问题,并行贝叶斯优化(PBO)是一种有前途的方法,旨在通过选择多样化的输入集进行并行评估来优化函数评估次数。然而,现有的PBO方法存在实际性能较差或缺乏理论保证。在本研究中,我们提出了一种PBO方法,称为随机克里金信徒(KB),基于众所周知的KB启发式,并继承了原始KB的优点:低计算复杂性、简单实施、适用于各种BO方法,以及适用于异步并行化。此外,我们展示了我们的随机KB实现了贝叶斯期望遗憾保证。通过对合成和基准函数以及真实数据模拟器的实验,我们展示了所提出方法的有效性。

更新时间: 2026-03-02 05:32:59

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.01470v1

Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

The scientific ideation process often involves blending salient aspects of existing papers to create new ideas - a framework known as facet-based ideation. We contribute Scideator, the first human-LLM system for facet-based scientific ideation. Starting from a user-provided set of scientific papers, Scideator extracts key facets -- purposes, mechanisms, and evaluations -- from these and related papers, allowing users to explore the idea space by interactively recombining facets to synthesize inventive ideas. Scideator is driven by three design choices: (1) human-in-the-loop facet recombination, in which users select facets from retrieved papers and the system generates ideas by finding analogies across them via the Faceted Idea Generator module; (2) distance-controlled retrieval via the Analogous Paper Facet Finder module, which surfaces papers from the same topic to entirely different subareas to provide a spectrum of creative directions; and (3) facet-based novelty verification via the Idea Novelty Checker module, a retrieve-then-rerank pipeline that evaluates idea originality using facets. In a user study with computer science researchers, Scideator provided significantly more creativity support than a baseline using the same backbone LLM without our facet-based modules, particularly in idea exploration and expressiveness. Participants' favorite ideas more often included facets selected by themselves rather than the LLM, and participants used fewer free-text instructions with Scideator, indicating a preference for facet-level steering over prompting. Finally, re-ranking papers by facet matching rather than general relevance improved novelty classification accuracy from 13.79% to 89.66%.

Updated: 2026-03-02 05:32:11

标题: 人类-LLM复合系统:通过面向重组和新颖性评估进行科学构思

摘要: 科学构思过程通常涉及将现有论文的显要方面融合在一起,创造新的想法 - 这一框架被称为基于特征的构思。我们提出了Scideator,这是第一个用于基于特征的科学构思的人-LLM系统。从用户提供的一组科学论文开始,Scideator从这些论文及相关论文中提取关键特征 - 目的、机制和评估,使用户能够通过交互式重新组合特征来综合创新的想法空间。Scideator受到三种设计选择的驱动:(1)人在环路特征重组,用户从检索到的论文中选择特征,系统通过Faceted Idea Generator模块找到它们之间的类比来生成想法;(2)通过类似论文特征查找器模块进行距离控制的检索,从相同主题到完全不同的子领域的论文,提供一系列创造性方向;(3)通过Idea Novelty Checker模块进行基于特征的新颖性验证,这是一个检索-重新排名的流水线,利用特征评估想法的原创性。在一项针对计算机科学研究人员的用户研究中,Scideator提供了比使用相同基础LLM但没有我们基于特征的模块的基准系统显著更多的创造性支持,特别是在想法探索和表达方面。参与者更喜欢自己选择的特征而不是LLM的想法,参与者在Scideator中使用的自由文本指令较少,表明他们更喜欢基于特征的引导而不是提示。最后,通过特征匹配重新排名论文而不是一般相关性,将新颖性分类准确度从13.79%提高到89.66%。

更新时间: 2026-03-02 05:32:11

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2409.14634v6

ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

Protein generative models have shown remarkable promise in protein design, yet their success rates remain constrained by reliance on curated sequence-structure datasets and by misalignment between supervised objectives and real design goals. We present ProteinZero, an online reinforcement learning framework for inverse folding models that enables scalable, automated, and continuous self-improvement with computationally efficient feedback. ProteinZero employs a reward pipeline that combines structural guidance from ESMFold with a novel self-derived ddG predictor, providing stable multi-objective signals while avoiding the prohibitive cost of physics-based methods. To ensure robustness in online RL, we further introduce a novel embedding-level diversity regularizer that mitigates mode collapse and promotes functionally meaningful sequence variation. Within a general RL formulation balancing multi-reward optimization, KL-divergence from a reference model, and diversity regularization, ProteinZero achieves robust improvements across designability, stability, recovery, and diversity. On the CATH-4.3 benchmark, it consistently outperforms state-of-the-art baselines including ProteinMPNN, ESM-IF, and InstructPLM, reducing design failure rates by 36-48% and achieving success rates above 90% across diverse folds. Importantly, a complete RL run can be executed on a single 8 X GPU node within three days, including reward computation and data generation. These results indicate that efficient online RL fine-tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space. Full source code and model checkpoints will be released upon publication.

Updated: 2026-03-02 05:31:24

标题: ProteinZero: 通过在线强化学习实现自我改进的蛋白质生成

摘要: 蛋白质生成模型在蛋白设计方面显示出了显著的潜力,然而它们的成功率仍受到对精心策划的序列-结构数据集的依赖以及受监督目标与真实设计目标之间的错位的限制。我们提出了ProteinZero,这是一个用于逆向折叠模型的在线强化学习框架,可以实现可扩展、自动化和持续的自我改进,并具有计算效率高的反馈。ProteinZero采用了一个奖励管道,结合了来自ESMFold的结构指导和一种新颖的自派生ddG预测器,提供稳定的多目标信号,同时選避基于物理的方法的成本过高。为了确保在线强化学习的鲁棒性,我们进一步引入了一种新颖的嵌入级别多样性正则化器,可以减轻模式坍缩并促进具有功能意义的序列变异。在一个平衡多重奖励优化、来自参考模型的KL散度和多样性正则化的通用RL公式中,ProteinZero在设计性、稳定性、恢复性和多样性方面取得了稳健的改进。在CATH-4.3基准测试中,它始终优于ProteinMPNN、ESM-IF和InstructPLM等最新基线,将设计失败率降低了36-48%,并在不同折叠中取得了90%以上的成功率。重要的是,完整的RL运行可以在单个8 X GPU节点上在三天内执行,包括奖励计算和数据生成。这些结果表明,高效的在线RL微调可以通过允许蛋白质生成模型持续从其自身的输出中演化,并优化多个设计目标,而无需标记数据,从而开辟了探索广阔蛋白设计空间的新可能性。全面的源代码和模型检查点将在发表时发布。

更新时间: 2026-03-02 05:31:24

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2506.07459v3

Mean-Flow based One-Step Vision-Language-Action

Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.

Updated: 2026-03-02 05:30:30

标题: 基于平均流的一步视觉-语言-动作

摘要: 最近在基于FlowMatching的视觉-语言-动作(VLA)框架方面取得了显著进展,特别是在高灵巧机器人操纵任务中生成高频动作块方面表现出了显著优势。尽管取得了这些显著成就,它们的实际应用受到生成延迟的限制,这是由于固有的迭代采样要求和架构限制导致的。为了解决这一关键瓶颈,我们提出了基于Mean-Flow的一步式VLA方法。具体地,我们解决了动作生成过程中噪声引起的问题,从而消除了传统Flow-Matching方法固有的一致性约束。这显著提高了生成效率并实现了一步式动作生成。真实世界的机器人实验表明,所提出的基于Mean-Flow的一步式VLA的生成速度分别比SmolVLA和Diffusion Policy快8.7倍和83.9倍。这些结果阐明了它作为VLA基础的高效骨干在机器人操纵中的巨大潜力。

更新时间: 2026-03-02 05:30:30

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2603.01469v1

Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing

Electronic Health Records (EHR) are time-series relational databases that record patient interactions and medical events over time, serving as a critical resource for healthcare research and applications. However, privacy concerns and regulatory restrictions limit the sharing and utilization of such sensitive data, necessitating the generation of synthetic EHR datasets. Unlike previous EHR synthesis methods, which typically generate medical records consisting of expert-chosen features (e.g. a few vital signs or structured codes only), we introduce RawMed, the first framework to synthesize multi-table, time-series EHR data that closely resembles raw EHRs. Using text-based representation and compression techniques, RawMed captures complex structures and temporal dynamics with minimal preprocessing. We also propose a new evaluation framework for multi-table time-series synthetic EHRs, assessing distributional similarity, inter-table relationships, temporal dynamics, and privacy. Validated on two open-source EHR datasets, RawMed outperforms baseline models in fidelity and utility. The code is available at https://github.com/eunbyeol-cho/RawMed.

Updated: 2026-03-02 05:29:43

标题: 使用最小预处理从潜在空间生成多表时间序列EHR

摘要: 电子健康记录(EHR)是记录患者在一段时间内的互动和医疗事件的时间序列关系数据库,是医疗研究和应用的重要资源。然而,隐私顾虑和监管限制限制了这些敏感数据的共享和利用,需要生成合成的EHR数据集。与以往的EHR合成方法不同,这些方法通常生成由专家选择的特征(例如仅包含少量生命体征或结构化编码)的医疗记录,我们引入了RawMed,这是第一个可以合成多表、时间序列EHR数据且与原始EHR数据非常相似的框架。使用基于文本的表示和压缩技术,RawMed在最小预处理的情况下捕捉了复杂的结构和时间动态。我们还提出了一个新的评估框架,用于评估多表时间序列合成EHR数据的分布相似性、表间关系、时间动态和隐私。在两个开源EHR数据集上验证后,RawMed在保真度和实用性方面优于基线模型。代码可在https://github.com/eunbyeol-cho/RawMed 上找到。

更新时间: 2026-03-02 05:29:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.06996v2

Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining

Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.

Updated: 2026-03-02 05:26:29

标题: 非马尔可夫长时程机器人操纵通过关键帧链化

摘要: 现有的视觉-语言-动作(VLA)模型通常很难推广到长视程任务,因为它们过于依赖即时观察。尽管最近的研究将检索机制纳入或扩展上下文窗口以处理过程任务,但它们往往难以捕捉非马尔可夫依赖性,其中最优动作仅依赖于特定的过去状态而不是当前观察。为了解决这个问题,我们提出了Keyframe-Chaining VLA,这是一个提取和链接关键历史帧以建模长视程依赖性的框架。具体地,我们提出了一个自动关键帧选择器,它学习了一个有区别的嵌入空间,有效地识别不同的状态转换。为了捕捉任务关键信息,我们设计了一个进度感知查询机制,根据它们对当前执行阶段的时间相关性动态检索历史帧。这些选定的关键帧被集成到VLA中作为交错的视觉标记,明确地将策略基于长视程时间上下文。最后,我们介绍了一个基于ManiSkill模拟器构建的四个非马尔可夫操纵任务套件,用于衡量任务成功率。实验结果表明,我们的方法实现了优越的性能,有效地处理了以长视程时间依赖性为特征的机器人操作任务。代码可在https://github.com/cytoplastm/KC-VLA 上找到。

更新时间: 2026-03-02 05:26:29

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2603.01465v1

DRAGON: LLM-Driven Decomposition and Reconstruction Agents for Large-Scale Combinatorial Optimization

Large Language Models (LLMs) have recently shown promise in addressing combinatorial optimization problems (COPs) through prompt-based strategies. However, their scalability and generalization remain limited, and their effectiveness diminishes as problem size increases, particularly in routing problems involving more than 30 nodes. We propose DRAGON, which stands for Decomposition and Reconstruction Agents Guided OptimizatioN, a novel framework that combines the strengths of metaheuristic design and LLM reasoning. Starting from an initial global solution, DRAGON autonomously identifies regions with high optimization potential and strategically decompose large-scale COPs into manageable subproblems. Each subproblem is then reformulated as a concise, localized optimization task and solved through targeted LLM prompting guided by accumulated experiences. Finally, the locally optimized solutions are systematically reintegrated into the original global context to yield a significantly improved overall outcome. By continuously interacting with the optimization environment and leveraging an adaptive experience memory, the agents iteratively learn from feedback, effectively coupling symbolic reasoning with heuristic search. Empirical results show that, unlike existing LLM-based solvers limited to small-scale instances, DRAGON consistently produces feasible solutions on TSPLIB, CVRPLIB, and Weibull-5k bin packing benchmarks, and achieves near-optimal results (0.16% gap) on knapsack problems with over 3M variables. This work shows the potential of feedback-driven language agents as a new paradigm for generalizable and interpretable large-scale optimization.

Updated: 2026-03-02 05:25:58

标题: DRAGON: 基于LLM的大规模组合优化分解和重构代理

摘要: 大型语言模型(LLMs)最近显示出通过基于提示的策略解决组合优化问题(COPs)的潜力。然而,它们的可扩展性和泛化能力仍然有限,随着问题规模的增加,它们的有效性会减弱,特别是在涉及超过30个节点的路径问题中。我们提出了DRAGON,即分解和重建代理引导优化(DRAGON)的缩写,这是一个结合了元启发式设计和LLM推理优势的新框架。从初始全局解开始,DRAGON自主识别具有高优化潜力的区域,并策略性地将大规模COPs分解为可管理的子问题。然后,每个子问题被重新构造为一个简洁、局部化的优化任务,并通过累积经验指导的针对性LLM提示来解决。最后,局部优化的解决方案被系统地重新整合到原始全局上下文中,产生显着改进的整体结果。通过持续与优化环境互动并利用自适应经验记忆,代理们迭代地从反馈中学习,有效地将符号推理与启发式搜索相结合。实证结果表明,与仅限于小规模实例的现有LLM基础解算器不同,DRAGON在TSPLIB、CVRPLIB和Weibull-5k装箱基准测试中始终产生可行解,并在具有超过3M变量的背包问题上实现接近最优结果(0.16%差距)。这项工作展示了反馈驱动的语言代理作为一种新的泛化和可解释的大规模优化范式的潜力。

更新时间: 2026-03-02 05:25:58

领域: cs.AI

下载: http://arxiv.org/abs/2601.06502v2

ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning

Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related information, providing support for disease-related variant analysis and protein function reasoning in protein-centric inference. However, such search agents are mostly limited to single-round, text-only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision-making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text-based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi-dimensional sequence features with signal pathways and regulatory networks.

Updated: 2026-03-02 05:25:41

标题: ProtRLSearch:一个多轮多模态蛋白搜索代理,通过强化学习训练的大型语言模型

摘要: 在医疗保健领域出现的蛋白质分析任务通常需要在蛋白质序列约束条件下进行准确推理,涉及功能解释疾病相关变体、临床研究中的蛋白质级分析以及类似情景的任务。为了解决这些任务,引入了搜索代理来搜索蛋白质相关信息,为疾病相关变体分析和蛋白功能推理提供支持。然而,这些搜索代理大多限于单轮、仅文本模式搜索,这阻碍了蛋白质序列模式被作为多模态输入纳入搜索决策过程。同时,它们依赖于强化学习(RL)监督,重点放在最终答案上,导致搜索过程约束不足,使关键词选择和推理方向的偏离难以及时识别和纠正。为了解决这些限制,我们提出了ProtRLSearch,一个经过基于多维奖励的RL训练的多轮蛋白质搜索代理,它在实时搜索过程中联合利用蛋白质序列和文本作为多模态输入,以生成高质量报告。为了评估模型在现实蛋白质查询设置中整合蛋白质序列信息和基于文本的多模态输入的能力,我们构建了ProtMCQs,一个包含3,000个多项选择题(MCQs)的基准,分为三个难度级别。该基准评估了从受序列约束的关于蛋白质功能和表型变化的推理到将多维序列特征与信号通路和调控网络整合的全面蛋白质推理的蛋白质查询任务。

更新时间: 2026-03-02 05:25:41

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.01464v1

Self-Supervised Learning via Flow-Guided Neural Operator on Time-Series Data

Self-supervised learning (SSL) is a powerful paradigm for learning from unlabeled time-series data. However, popular methods such as masked autoencoders (MAEs) rely on reconstructing inputs from a fixed, predetermined masking ratio. Instead of this static design, we propose treating the corruption level as a new degree of freedom for representation learning, enhancing flexibility and performance. To achieve this, we introduce the Flow-Guided Neural Operator (FGNO), a novel framework combining operator learning with flow matching for SSL training. FGNO learns mappings in functional spaces by using Short-Time Fourier Transform to unify different time resolutions. We extract a rich hierarchy of features by tapping into different network layers and flow times that apply varying strengths of noise to the input data. This enables the extraction of versatile representations, from low-level patterns to high-level global features, using a single model adaptable to specific tasks. Unlike prior generative SSL methods that use noisy inputs during inference, we propose using clean inputs for representation extraction while learning representations with noise; this eliminates randomness and boosts accuracy. We evaluate FGNO across three biomedical domains, where it consistently outperforms established baselines. Our method yields up to 35% AUROC gains in neural signal decoding (BrainTreeBank), 16% RMSE reductions in skin temperature prediction (DREAMT), and over 20% improvement in accuracy and macro-F1 on SleepEDF under low-data regimes. These results highlight FGNO's robustness to data scarcity and its superior capacity to learn expressive representations for diverse time series.

Updated: 2026-03-02 05:24:40

标题: 在时间序列数据上通过流引导神经算子进行自监督学习

摘要: 自监督学习(SSL)是一种从未标记的时间序列数据中学习的强大范例。然而,流行的方法如掩盖自编码器(MAEs)依赖于从固定、预定的掩盖比率中重建输入。我们提出将损坏级别视为表示学习的新自由度,增强灵活性和性能,而不是这种静态设计。为了实现这一目标,我们引入了流引导神经运算器(FGNO),这是一个新颖的框架,将运算器学习与流匹配结合起来进行SSL训练。FGNO通过使用短时傅里叶变换在功能空间中学习映射,统一不同的时间分辨率。我们通过利用不同的网络层和流时间提取出丰富的特征层次结构,这些特征层次结构对输入数据应用不同强度的噪声。这使得能够提取多功能表示,从低级模式到高级全局特征,使用一个能够适应特定任务的单一模型。与先前在推断过程中使用嘈杂输入的生成SSL方法不同,我们提出使用干净输入进行表示提取,同时在学习表示时使用噪声;这消除了随机性并提高了准确性。我们在三个生物医学领域评估了FGNO,在这些领域中它始终表现优于已建立的基线。我们的方法在神经信号解码(BrainTreeBank)中产生高达35%的AUROC增益,在皮肤温度预测(DREAMT)中减少16%的RMSE,并在低数据情况下在SleepEDF上提高了超过20%的准确性和宏F1。这些结果突显了FGNO对数据稀缺的鲁棒性以及其学习多样时间序列表达能力的优越能力。

更新时间: 2026-03-02 05:24:40

领域: cs.LG

下载: http://arxiv.org/abs/2602.12267v2

PhysFormer: A Physics-Embedded Generative Model for Physically Self-Consistent Spectral Synthesis

In scientific and engineering domains, modeling high-dimensional complex systems governed by partial differential equations (PDEs) remains challenging in terms of physical consistency and numerical stability. However, existing approaches, such as physics-informed neural networks (PINNs), typically rely on known physical fields or coefficients and enforce physical constraints via external loss functions, which can lead to training instability and make it difficult to handle high-dimensional or unobservable scenarios. To this end, we propose PhysFormer, a generative modeling framework that is self-consistent at both the data and physical levels. PhysFormer leverages a low-dimensional, physically interpretable latent space to learn key physical quantities directly from data without requiring known high-dimensional physical field parameters, and embeds the physical process of radiative flux generation within the network to ensure the physical consistency of the generated spectra. In high-dimensional, degenerate inversion tasks, PhysFormer constrains generation within physical limits and enhances spectral fidelity and inversion stability under varying signal-to-noise ratios (SNRs). More broadly, this approach shifts the physical processes from external loss functions into the generative mechanism itself, providing a physically consistent generative modeling paradigm for complex systems involving unknown or unobservable physical quantities.

Updated: 2026-03-02 05:17:41

标题: PhysFormer:一种物理嵌入式生成模型,用于物理上自洽的频谱合成

摘要: 在科学和工程领域,对由偏微分方程(PDEs)控制的高维复杂系统进行建模在物理一致性和数值稳定性方面仍然具有挑战性。然而,现有的方法,如基于物理信息的神经网络(PINNs),通常依赖于已知的物理场或系数,并通过外部损失函数强制物理约束,这可能导致训练不稳定,并使处理高维或不可观测的场景变得困难。为此,我们提出了PhysFormer,这是一个在数据和物理层面都具有自洽性的生成建模框架。PhysFormer利用一个低维、物理可解释的潜在空间,直接从数据中学习关键物理量,而无需已知高维物理场参数,并将辐射通量生成的物理过程嵌入网络中,以确保所生成的光谱的物理一致性。在高维、退化的反演任务中,PhysFormer在物理限制内约束生成,并在不同信噪比(SNR)下增强光谱保真度和反演稳定性。更广泛地说,这种方法将物理过程从外部损失函数转移到生成机制本身,为涉及未知或不可观测物理量的复杂系统提供了一个物理一致的生成建模范式。

更新时间: 2026-03-02 05:17:41

领域: astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2603.01459v1

Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding

Molecular understanding is central to advancing areas such as scientific discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graph-LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves stateof-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding

Updated: 2026-03-02 05:13:15

标题: 熵引导的动态标记用于分子理解中的图像对齐

摘要: 分子理解对于推动科学发现等领域至关重要,然而大型语言模型(LLMs)往往难以有效理解分子图。现有的图形-LLM桥通常采用固定长度的静态令牌(Q-Former风格连接器),这种连接器最初设计用于视觉任务。这些设计忽视了立体化学和亚结构上下文,并且通常需要昂贵的LLM骨干微调,限制了效率和泛化能力。我们引入了EDT-Former,一种熵引导的动态令牌变换器,生成与信息分子片段对齐的令牌,从而保留了分子图理解的局部和全局结构特征。与之前的方法不同,EDT-Former使得冻结的图编码器与LLMs之间可以对齐,而无需调整LLM骨干(除了嵌入层),从而实现了计算效率的微调,并在MoleculeQA,分子导向的Mol-Instructions和属性预测基准测试(TDC,MoleculeNet)上取得了最先进的结果,强调了其对于可扩展和可泛化的多模态分子理解的有效性。

更新时间: 2026-03-02 05:13:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2602.02742v3

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.

Updated: 2026-03-02 05:12:45

标题: 从逐字逐句到要点:通过语义信息瓶颈提炼金字塔多模态记忆,用于长时间跨度视频代理

摘要: 多模态大型语言模型已经展示出令人印象深刻的短期推理能力,但由于有限的上下文窗口和静态记忆机制,它们在长时间视觉理解方面存在困难,无法模拟人类认知效率。现有的范式通常分为两个极端:以视觉为中心的方法通过密集的视觉累积导致高延迟和冗余,或者以文本为中心的方法通过激进的字幕化导致细节丢失和幻觉。为了弥补这一差距,我们提出了MM-Mem,这是一个基于模糊痕迹理论的金字塔多模态记忆架构。MM-Mem将记忆分层结构化为感觉缓冲区、情节流和符号模式,使得细粒度知觉痕迹(逐字)逐渐被提炼为高级语义模式(主旨)。此外,为了控制记忆的动态构建,我们提出了一个语义信息瓶颈目标,并引入了SIB-GRPO来优化记忆压缩和任务相关信息保留之间的权衡。在推理中,我们设计了一个熵驱动的自上而下的记忆检索策略,首先尝试使用抽象的符号模式,然后在高不确定性下逐步“深入”到感觉缓冲区和情节流。在4个基准测试中的大量实验验证了MM-Mem在离线和流媒体任务中的有效性,展示了强大的泛化能力,并验证了以认知为灵感的记忆组织的有效性。源代码可在https://github.com/EliSpectre/MM-Mem找到。

更新时间: 2026-03-02 05:12:45

领域: cs.CV,cs.AI,cs.CL,cs.IR,cs.MM

下载: http://arxiv.org/abs/2603.01455v1

REMS: a unified solution representation, problem modeling and metaheuristic algorithm design for general combinatorial optimization problems

Combinatorial optimization problems (COPs) with discrete variables and finite search space are critical across numerous fields, and solving them in metaheuristic algorithms is popular. However, addressing a specific COP typically requires developing a tailored and handcrafted algorithm. Even minor adjustments, such as constraint changes, may necessitate algorithm redevelopment. Therefore, establishing a framework for formulating diverse COPs into a unified paradigm and designing reusable metaheuristic algorithms is valuable. A COP can be typically viewed as the process of giving resources to perform specific tasks, subjecting to given constraints. Motivated by this, a resource-centered modeling and solving framework (REMS) is introduced for the first time. We first extract and define resources and tasks from a COP. Subsequently, given predetermined resources, the solution structure is unified as assigning tasks to resources, from which variables, objectives, and constraints can be derived and a problem model is constructed. To solve the modeled COPs, several fundamental operators are designed based on the unified solution structure, including the initial solution, neighborhood structure, destruction and repair, crossover, and ranking. These operators enable the development of various metaheuristic algorithms. Specially, 4 single-point-based algorithms and 1 population-based algorithm are configured herein. Experiments on 10 COPs, covering routing, location, loading, assignment, scheduling, and graph coloring problems, show that REMS can model these COPs within the unified paradigm and effectively solve them with the designed metaheuristic algorithms. Furthermore, REMS is more competitive than GUROBI and SCIP in tackling large-scale instances and complex COPs, and outperforms OR-TOOLS on several challenging COPs.

Updated: 2026-03-02 05:12:06

标题: REMS:一种通用的组合优化问题的统一解决方案表示、问题建模和元启发式算法设计

摘要: 离散变量和有限搜索空间的组合优化问题(COPs)在许多领域都至关重要,并且在元启发式算法中解决它们是流行的。然而,解决特定的COP通常需要开发一个定制和手工制作的算法。即使是微小的调整,如约束变化,也可能需要重新开发算法。因此,建立一个将多样化的COPs制定为统一范式并设计可重复使用的元启发式算法的框架是有价值的。一个COP通常可以被看作是给予资源以执行特定任务的过程,受到给定约束的限制。受此启发,首次引入了一种基于资源中心的建模和求解框架(REMS)。我们首先从COP中提取和定义资源和任务。随后,鉴于预先确定的资源,解决方案结构被统一为将任务分配给资源,从中可以推导出变量、目标和约束,并构建问题模型。为了解决建模的COPs,基于统一解决方案结构设计了几个基本算子,包括初始解、邻域结构、破坏和修复、交叉和排序。这些算子使得各种元启发式算法的发展成为可能。特别是,在这里配置了4种基于单点的算法和1种基于种群的算法。对涵盖路由、位置、装载、分配、调度和图着色问题的10个COPs的实验表明,REMS可以在统一范式内对这些COPs进行建模,并通过设计的元启发式算法有效地解决它们。此外,在处理大规模实例和复杂COPs方面,REMS比GUROBI和SCIP更具竞争力,并在一些具有挑战性的COPs上胜过OR-TOOLS。

更新时间: 2026-03-02 05:12:06

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2505.17108v3

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.

Updated: 2026-03-02 05:11:47

标题: VidDoS:基于视频的大型语言模型通用拒绝服务攻击

摘要: Video-LLMs越来越多地应用于安全关键应用,但容易受到能量-延迟攻击(ELAs)的影响,耗尽计算资源。当前的基于图像的方法失败是因为时间聚合机制稀释了单帧的扰动。此外,实时需求使得针对连续视频流进行实例化优化变得不可行。我们引入了VidDoS,这是第一个专为Video-LLMs量身定制的通用ELA框架。我们的方法利用通用优化来创建不需要推理时梯度计算的实例无关触发器。我们通过“掩码教师强制”来引导模型朝着昂贵的目标序列发展,结合“拒绝惩罚”和“提前终止抑制”来覆盖简洁的先验。在三种主流Video-LLMs和三个包括视频问答和自动驾驶场景的视频数据集上进行测试显示出极端的退化。VidDoS引起了超过205倍的令牌扩展,并相对于干净的基线增加了超过15倍的推理延迟。对实时自动驾驶流的模拟进一步显示出这种引入的延迟导致严重的安全违规。我们敦促社区认识到并减轻Video-LLMs中这些高危的ELA。

更新时间: 2026-03-02 05:11:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01454v1

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbf{EfficientZero-Multitask (EZ-M)}, a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbf{HumanoidBench}, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \href{https://yewr.github.io/ez_m/}{here}.

Updated: 2026-03-02 05:07:43

标题: 不是扩展样本,而是扩展任务:通过多任务模型强化学习掌握人形机器人控制

摘要: 开发能够掌握多种技能的通用机器人仍然是体现人工智能的核心挑战。尽管最近的进展强调了扩展模型参数和离线数据集,但在机器人领域,学习需要积极的互动,这些方法存在局限性。我们认为,有效的在线学习应该扩展\emph{任务数量},而不是每个任务的样本数量。这种方法揭示了基于模型的强化学习(MBRL)的结构优势。由于物理动力学在各种任务中都是不变的,一个共享的世界模型可以整合多任务经验,学习出稳健的、与任务无关的表示。相比之下,无模型方法在任务要求在相似状态下采取冲突行动时会受到梯度干扰。因此,任务多样性作为MBRL的正则化器,可以提高动力学学习和样本效率。我们通过\textbf{EfficientZero-Multitask (EZ-M)}这一在线学习的高效多任务MBRL算法实现了这一想法。在具有挑战性的全身控制基准测试\textbf{HumanoidBench}上进行评估,EZ-M实现了最新的性能,样本效率明显高于强基线,而无需进行极端参数扩展。这些结果确立了任务扩展作为可扩展机器人学习的关键轴。项目网站在\href{https://yewr.github.io/ez_m/}{此处}可访问。

更新时间: 2026-03-02 05:07:43

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2603.01452v1

SEAnet: A Deep Learning Architecture for Data Series Similarity Search

A key operation for massive data series collection analysis is similarity search. According to recent studies, SAX-based indexes offer state-of-the-art performance for similarity search tasks. However, their performance lags under high-frequency, weakly correlated, excessively noisy, or other dataset-specific properties. In this work, we propose Deep Embedding Approximation (DEA), a novel family of data series summarization techniques based on deep neural networks. Moreover, we describe SEAnet, a novel architecture especially designed for learning DEA, that introduces the Sum of Squares preservation property into the deep network design. We further enhance SEAnet with SEAtrans encoder. Finally, we propose novel sampling strategies, SEAsam and SEAsamE, that allow SEAnet to effectively train on massive datasets. Comprehensive experiments on 7 diverse synthetic and real datasets verify the advantages of DEA learned using SEAnet in providing high-quality data series summarizations and similarity search results.

Updated: 2026-03-02 04:57:06

标题: SEAnet:一种用于数据序列相似性搜索的深度学习架构

摘要: 大规模数据系列收集分析的关键操作是相似性搜索。根据最近的研究,基于SAX的索引为相似性搜索任务提供了最先进的性能。然而,在高频率、弱相关、过度嘈杂或其他特定数据集属性下,它们的性能有所滞后。在这项工作中,我们提出了深度嵌入逼近(DEA),这是一种基于深度神经网络的数据系列总结技术的新型家族。此外,我们描述了SEAnet,这是一种专门设计用于学习DEA的新型架构,它在深度网络设计中引入了平方和保留属性。我们进一步通过SEAtrans编码器增强了SEAnet。最后,我们提出了新颖的采样策略SEAsam和SEAsamE,这些策略让SEAnet能够有效地在大规模数据集上进行训练。对7个不同的合成和真实数据集进行的全面实验验证了使用SEAnet学习的DEA在提供高质量数据系列总结和相似性搜索结果方面的优势。

更新时间: 2026-03-02 04:57:06

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2603.01448v1

Multivariate Spatio-Temporal Neural Hawkes Processes

We propose a Multivariate Spatio-Temporal Neural Hawkes Process for modeling complex multivariate event data with spatio-temporal dynamics. The proposed model extends continuous-time neural Hawkes processes by integrating spatial information into latent state evolution through learned temporal and spatial decay dynamics, enabling flexible modeling of excitation and inhibition without predefined triggering kernels. By analyzing fitted intensity functions of deep learning-based temporal Hawkes process models, we identify a modeling gap in how fitted intensity behavior is captured beyond likelihood-based performance, which motivates the proposed spatio-temporal approach. Simulation studies show that the proposed method successfully recovers sensible temporal and spatial intensity structure in multivariate spatio-temporal point patterns, while existing temporal neural Hawkes process approach fails to do so. An application to terrorism data from Pakistan further demonstrates the proposed model's ability to capture complex spatio-temporal interaction across multiple event types.

Updated: 2026-03-02 04:53:57

标题: 多元时空神经霍克斯过程

摘要: 我们提出了一种多元时空神经霍克斯过程,用于建模具有时空动态的复杂多元事件数据。所提出的模型通过将空间信息整合到潜在状态演变中,通过学习的时间和空间衰减动态,扩展了连续时间神经霍克斯过程,从而实现了对激发和抑制的灵活建模,而无需预定义触发核。通过分析基于深度学习的时间霍克斯过程模型的拟合强度函数,我们发现在如何捕获超出基于似然性能的拟合强度行为方面存在建模差距,这促使了所提出的时空方法。模拟研究表明,所提出的方法成功恢复了多元时空点模式中合理的时间和空间强度结构,而现有的时间神经霍克斯过程方法未能做到。对来自巴基斯坦的恐怖主义数据的应用进一步证明了所提出的模型捕捉多种事件类型之间复杂时空互动的能力。

更新时间: 2026-03-02 04:53:57

领域: stat.ML,cs.LG,math.ST,stat.AP,stat.ME

下载: http://arxiv.org/abs/2602.23629v2

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

Synthetic data generation is a critical capability for data sharing, privacy compliance, system benchmarking and test data provisioning. Existing methods assume dense, fixed-schema tabular data, yet this assumption is increasingly at odds with modern data systems - from document databases, REST APIs to data lakes - which store and exchange data in sparse, semi-structured formats like JSON. Applying existing tabular methods to such data requires flattening of nested data into wide, sparse tables which scales poorly. We present Origami, an autoregressive transformer-based architecture that tokenizes data records, including nested objects and variable length arrays, into sequences of key, value and structural tokens. This representation natively handles sparsity, mixed types and hierarchical structure without flattening or imputation. Origami outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores. On semi-structured datasets with up to 38% sparsity, baseline synthesizers either fail to scale or degrade substantially, while Origami maintains high-fidelity synthesis that is harder to distinguish from real data. To the best of our knowledge, Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end.

Updated: 2026-03-02 04:47:50

标题: 自回归合成稀疏和半结构化混合型数据

摘要: 合成数据生成是数据共享、隐私合规、系统基准测试和测试数据提供的关键能力。现有方法假设密集、固定模式的表格数据,然而这一假设与现代数据系统(从文档数据库、REST API到数据湖)日益不符,这些系统以JSON等稀疏、半结构化格式存储和交换数据。将现有的表格方法应用于这种数据需要将嵌套数据展开为宽、稀疏表格,这种方法的扩展性较差。我们提出了Origami,一种基于自回归变换器的架构,将数据记录(包括嵌套对象和可变长度数组)标记为键、值和结构标记的序列。该表示本地处理了稀疏性、混合类型和层次结构,无需展开或填补。在几乎所有设置中,Origami在忠实度、效用和检测指标上优于基线,跨越了GAN、VAE、扩散和自回归架构,同时保持高隐私评分。在稀疏度高达38%的半结构化数据集上,基线合成器要么无法扩展,要么会显着降低质量,而Origami保持了高忠实性的合成数据,难以与真实数据区分开。据我们所知,Origami是第一个能够本地建模和生成半结构化数据的架构。

更新时间: 2026-03-02 04:47:50

领域: cs.LG

下载: http://arxiv.org/abs/2603.01444v1

Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents

The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona's influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.

Updated: 2026-03-02 04:37:16

标题: 通过动态重要性估计增强角色扮演代理商在解码时的人设跟随

摘要: 在社会学研究中,角色扮演语言代理的实用性正与大型语言模型的采用同步增长。为了实现社会模拟的真实性,这些代理必须遵守由角色配置文件定义的人设,然而现有的策略-静态提示工程或昂贵的微调-未能使人设适应动态场景。心理学理论,如认知情感人格系统,为这种失败提供了关键解释:人设对行为的影响并非静态,而是随着情景变化而变化。这种上下文依赖性突显了对自适应人设管理的重要性。为了填补这一差距,我们提出了一种新颖的、理论驱动的方法,动态估计上下文依赖的人设重要性,并将其整合到加权奖励引导的解码中,实现推理时的人设跟随。具体来说,我们介绍了Persona Dynamic Decoding(PDD)框架,包括两个关键组件:(1)Persona Importance Estimation(PIE)模块,动态量化人设属性的上下文重要性,无需地面真值监督;以及(2)Persona-Guided Inference-Time Alignment(PIA)范式,利用这些重要性分数构建加权的多目标奖励,并在推理过程中调节生成概率。大量实验证明了我们的方法在话语一致性和行为忠实度方面的有效性。

更新时间: 2026-03-02 04:37:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01438v1

Leray-Schauder Mappings for Operator Learning

We present an algorithm for learning operators between Banach spaces, based on the use of Leray-Schauder mappings to learn a finite-dimensional approximation of compact subspaces. We show that the resulting method is a universal approximator of (possibly nonlinear) operators. We demonstrate the efficiency of the approach on two benchmark datasets showing it achieves results comparable to state of the art models.

Updated: 2026-03-02 04:37:06

标题: 勒雷-舒德尔映射用于算子学习

摘要: 我们提出了一种学习巴拿赫空间之间算子的算法,基于利用Leray-Schauder映射来学习紧致子空间的有限维近似。我们展示了所得方法是(可能是非线性的)算子的通用逼近器。我们在两个基准数据集上展示了该方法的效率,结果表明它能够达到与现有模型相媲美的效果。

更新时间: 2026-03-02 04:37:06

领域: cs.LG,math.NA

下载: http://arxiv.org/abs/2410.01746v3

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

As chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models (LLMs), it has also emerged as a promising tool for interpretability, suggesting the opportunity to understand model decisions through verbalized reasoning. However, the utility of CoT toward interpretability depends upon its faithfulness -- whether the model's stated reasoning reflects the underlying decision process. We provide mechanistic evidence that instruction-tuned models often determine their answer before generating CoT. Training linear probes on residual stream activations at the last token before CoT, we can predict the model's final answer with 0.9 AUC on most tasks. We find that these directions are not only predictive, but also causal: steering activations along the probe direction flips model answers in over 50% of cases, significantly exceeding orthogonal baselines. When steering induces incorrect answers, we observe two distinct failure modes: non-entailment (stating correct premises but drawing unsupported conclusions) and confabulation (fabricating false premises). While post-hoc reasoning may be instrumentally useful when the model has a correct pre-CoT belief, these failure modes suggest it can result in undesirable behaviors when reasoning from a false belief.

Updated: 2026-03-02 04:33:55

标题: 在链式思维之前解码答案:来自预链式思维探针和激活引导的证据

摘要: 随着思维链(CoT)在大型语言模型(LLMs)中变得至关重要以扩展推理能力,它也成为了一种有希望的可解释性工具,表明通过口头推理理解模型决策的机会。然而,CoT对可解释性的效用取决于其忠实度 - 即模型陈述的推理是否反映了潜在的决策过程。我们提供了机械证据表明,调整后的模型经常在生成CoT之前确定答案。通过训练线性探针在CoT之前的最后一个标记处的残差流激活上,我们可以在大多数任务上预测模型的最终答案,AUC为0.9。我们发现这些方向不仅具有预测性,而且还具有因果关系:沿着探针方向引导激活在50%以上的情况下会翻转模型的答案,明显超过正交基线。当引导导致错误答案时,我们观察到两种不同的失败模式:非推理(陈述正确前提但得出不支持的结论)和捏造(制造虚假前提)。虽然事后推理在模型具有正确的CoT前信念时可能有用,但这些失败模式表明当从错误的信念推理时可能导致不良行为。

更新时间: 2026-03-02 04:33:55

领域: cs.AI

下载: http://arxiv.org/abs/2603.01437v1

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.

Updated: 2026-03-02 04:30:47

标题: 追逐尾巴:基于评分表的有效奖励建模方法用于大型语言模型后训练

摘要: 强化微调(RFT)经常受到奖励过度优化的困扰,即政策模型通过篡改奖励信号以获得高分数,同时产生质量低下的输出。我们的理论分析表明,关键在于高奖励尾部的奖励错误规定:无法可靠地区分出色的回应和仅仅是很好的回应。这促使我们将重点放在高奖励区域。然而,在基本LLM下,这种尾部示例很少。虽然离线示例(例如来自更强大的模型或重写)更容易获得,但简单地对它们进行训练会产生一个为我们旨在对齐的策略错误规定的奖励。为了解决这个问题,我们研究基于评分表的奖励。通过设计,评分表可以利用离线示例,同时对其人为影响保持不敏感。为了引出能够捕捉高奖励尾部的评分表,我们强调了区分出色和多样化回应的重要性,并介绍了一种实施这一想法的工作流程。我们通过实证分析表明,基于评分表的奖励极大地减轻了奖励过度优化,并提供了有效的LLM后期训练改进。

更新时间: 2026-03-02 04:30:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.21500v3

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

Updated: 2026-03-02 04:28:09

标题: GeneZip:用于长上下文DNA建模的区域感知压缩

摘要: 基因组序列跨越数十亿个碱基对(bp),为基因组规模基础模型提出了一个基本挑战。现有方法主要通过将相对较小的模型扩展到长上下文或依赖于重型多GPU并行处理来规避这一障碍。在这里,我们介绍了GeneZip,这是一种利用关键生物学先验的DNA压缩模型:基因组信息高度不平衡。编码区域仅占很小比例(约2%),但信息密集,而大多数非编码序列则相对信息稀疏。GeneZip将HNet风格的动态路由与区域感知压缩比目标相结合,实现了在基因组区域之间对表示预算进行自适应分配。结果,GeneZip学习了区域感知压缩,并在仅增加0.31困惑度的情况下实现了137.6倍的压缩。在下游长上下文基准测试中,GeneZip在接触图预测、表达定量性状基因座预测和增强子-靶基因预测方面取得了可比或更好的性能。通过减少有效序列长度,GeneZip解锁了上下文和容量的同时扩展:与先前的最先进模型JanusDNA相比,它使得在1M-bp上下文中能够训练82.6倍更大的模型,支持在1M-bp上下文中拥有636M参数的GeneZip模型。本文中的所有实验都可以在单个A100 80GB GPU上进行训练。

更新时间: 2026-03-02 04:28:09

领域: q-bio.GN,cs.AI,cs.LG

下载: http://arxiv.org/abs/2602.17739v2

Cognitive models can reveal interpretable value trade-offs in language models

Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in language models are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here, we show that a leading cognitive model of polite speech can be used to systematically evaluate alignment-relevant trade-offs in language models via two encompassing settings: degrees of reasoning "effort" and system prompt manipulations in closed-source frontier models, and RL post-training dynamics of open-source models. Our results show that LLMs' behavioral profiles under the cognitive model a) shift predictably when they are prompted to prioritize certain goals, b) are amplified by a small reasoning budget, and c) can be used to diagnose other social behaviors such as sycophancy. Our findings from LLMs' post-training dynamics reveal large shifts in values early on in training and persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. Our framework offers a flexible tool for probing behavioral profiles across diverse model types and gaining insights for shaping training regimes that better control trade-offs between values during model development.

Updated: 2026-03-02 04:25:20

标题: 认知模型可以揭示语言模型中可解释的价值权衡

摘要: 价值取舍是人类决策和语言使用的一个重要组成部分,然而,目前用于解释语言模型中这种动态和多方面价值概念的工具有限。在认知科学中,所谓的“认知模型”通过建模说话者在选择行动或言语时对竞争性效用函数的加权来提供人类这种取舍的形式化解释。在这里,我们展示了一种领先的礼貌言语认知模型可以通过两种包含性设置系统地评估语言模型中与对齐相关的取舍:在封闭源前沿模型中的推理“努力”程度和系统提示操作,以及开源模型的强化学习后训练动态。我们的研究结果表明,语言模型的行为特征在认知模型下的a)在被提示优先考虑某些目标时可预测地转变,b)在小推理预算下被放大,c)可用于诊断其他社会行为,如谄媚。我们从语言模型的强化学习后训练动态中发现,在训练初期价值发生了巨大的转变,并且选择基础模型和预训练数据相比,与反馈数据集或对齐方法相比,这种效应是持久的。我们的框架为探究各种模型类型的行为特征提供了灵活的工具,并为塑造更好地控制模型开发过程中价值取舍的培训方案提供了见解。

更新时间: 2026-03-02 04:25:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.20666v4

Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba

State Space Models (SSMs) have emerged as powerful alternatives to attention-based Transformers, with Mamba demonstrating impressive efficiency and scalability. As these models grow increasingly larger, the need for Parameter-Efficient Fine-Tuning (PEFT) methods becomes critical to adapt pre-trained Mamba to downstream tasks without prohibitive computational costs. However, previous approaches simply apply traditional Transformer-tailored PEFT methods without addressing the unique temporal processing dynamics of SSMs. To address this limitation, we propose Memba, a membrane-driven PEFT approach specifically designed for Mamba. Memba introduces Leaky Integrate Membrane (LIM) neurons as bio-inspired gating mechanisms that naturally accumulate membrane potentials over time, enhancing selective information retention. By strategically combining LIM neurons with Low-Rank Adaptations (LoRA) and cross-layer membrane transfer, our approach significantly improves Mamba's temporal modeling capabilities. Extensive experiments across language and vision tasks demonstrate that Memba achieves substantial improvements over existing PEFT methods. The code is available at https://github.com/Intelligent-Computing-Lab-Yale/Memba.

Updated: 2026-03-02 04:23:22

标题: 《Memba:Mamba的膜驱动参数高效微调》

摘要: 状态空间模型(SSMs)已经成为强大的替代方案,可以取代基于注意力的Transformer,Mamba展示了令人印象深刻的效率和可扩展性。随着这些模型变得越来越大,需要参数高效微调(PEFT)方法来适应预训练的Mamba到下游任务而不产生过高的计算成本。然而,先前的方法仅仅应用传统的Transformer定制的PEFT方法,没有解决SSMs独特的时间处理动态。为了解决这个限制,我们提出了Memba,一种专门为Mamba设计的基于膜的PEFT方法。Memba引入了Leaky Integrate Membrane(LIM)神经元作为生物启发的门控机制,自然地随着时间积累膜电位,增强选择性信息保留。通过策略性地将LIM神经元与低秩适应(LoRA)和跨层膜传递相结合,我们的方法显著提高了Mamba的时间建模能力。对语言和视觉任务的广泛实验表明,Memba相对于现有的PEFT方法取得了实质性的改进。代码可在https://github.com/Intelligent-Computing-Lab-Yale/Memba 上找到。

更新时间: 2026-03-02 04:23:22

领域: cs.LG

下载: http://arxiv.org/abs/2506.18184v2

On the Stability Connection Between Discrete-Time Algorithms and Their Resolution ODEs: Applications to Min-Max Optimisation

This work establishes a rigorous connection between stability properties of discrete-time algorithms (DTAs) and corresponding continuous-time dynamical systems derived through $ O(s^r) $-resolution ordinary differential equations (ODEs). We show that for discrete- and continuous-time dynamical systems satisfying a mild error assumption, exponential stability of a common equilibrium with respect to the continuous time dynamics implies exponential stability of the corresponding equilibrium for the discrete-time dynamics, provided that the step size is chosen sufficiently small. We extend this result to common compact invariant sets. We prove that if an equilibrium is exponentially stable for the $ O(s^r) $-resolution ODE, then it is also exponentially stable for the associated DTA. We apply this framework to analyse the limit point properties of several prominent optimisation algorithms, including Two-Timescale Gradient Descent--Ascent (TT-GDA), Generalised Extragradient (GEG), Two-Timescale Proximal Point (TT-PPM), Damped Newton (DN), Regularised Damped Newton (RDN), and the Jacobian method (JM), by studying their $ O(1) $- and $ O(s) $-resolution ODEs. We show that under a proper choice of hyperparameters, the set of saddle points of the objective function is a subset of the set of exponentially stable equilibria of GEG, TT-PPM, DN, and RDN. We relax the common Hessian invariance assumption through direct analysis of the resolution ODEs, broadening the applicability of our results. Numerical examples illustrate the theoretical findings.

Updated: 2026-03-02 04:22:26

标题: 关于离散时间算法和其分辨率ODE之间稳定性联系的研究:在最小-最大优化中的应用

摘要: 这项工作建立了离散时间算法(DTA)的稳定性属性与通过$ O(s^r) $-分辨率常微分方程(ODE)导出的相应连续时间动力系统之间的严格联系。我们表明,对于满足轻微误差假设的离散和连续时间动力系统,关于连续时间动力学的一个共同平衡的指数稳定性意味着离散时间动力学的相应平衡的指数稳定性,前提是选择足够小的步长。我们将这一结果扩展到共同的紧不变集。我们证明,如果一个平衡对于$ O(s^r) $-分辨率ODE是指数稳定的,那么它也对于相关的DTA是指数稳定的。我们应用这一框架来分析几种著名优化算法的极限点特性,包括两时间尺度梯度下降--上升(TT-GDA)、广义外推(GEG)、两时间尺度近端点(TT-PPM)、阻尼牛顿(DN)、正则化阻尼牛顿(RDN)和雅可比方法(JM),通过研究它们的$ O(1) $-和$ O(s) $-分辨率ODE。我们表明,在适当选择超参数的情况下,目标函数的鞍点集是GEG、TT-PPM、DN和RDN的指数稳定平衡集的子集。我们通过直接分析分辨率ODE放宽了常Hessian不变性假设,扩大了我们结果的适用范围。数值例子说明了理论发现。

更新时间: 2026-03-02 04:22:26

领域: math.OC,cs.LG,eess.SY,math.NA

下载: http://arxiv.org/abs/2603.01430v1

BrepCoder: A Unified Multimodal Large Language Model for Multi-task B-rep Reasoning

Recent advancements in deep learning have actively addressed complex challenges within the Computer-Aided Design (CAD) domain.However, most existing approaches rely on task-specifi c models requiring structural modifi cations for new tasks, and they predominantly focus on point clouds or images rather than the industry-standard Boundary Representation (B-rep) format. To address these limitations, we propose BrepCoder, a unifi ed Multimodal Large Language Model (MLLM) that performs diverse CAD tasks from B-rep inputs. By leveraging the code generation capabilities of Large Language Models (LLMs), we convert CAD modeling sequences into Python-like code and align them with B-rep. We then adopt a two-stage training strategy: First, pre-training on reverse engineering to learn geometric features and design logic. Second, eff ectively extending the model to various downstream tasks such as completion, error correction, and CAD-QA. Consequently, by interpreting B-rep as structural code, BrepCoder achieves superior generalization across diverse tasks, demonstrating its potential as a general-purpose CAD agent.

Updated: 2026-03-02 04:18:48

标题: BrepCoder:用于多任务B-rep推理的统一多模态大型语言模型

摘要: 近期深度学习的进展积极解决了计算机辅助设计(CAD)领域的复杂挑战。然而,大多数现有方法依赖于任务特定模型,需要对新任务进行结构修改,并且主要集中在点云或图像而非行业标准的边界表示(B-rep)格式。为了解决这些局限性,我们提出了BrepCoder,一个统一的多模态大语言模型(MLLM),能够从B-rep输入中执行多样的CAD任务。通过利用大语言模型(LLMs)的代码生成能力,我们将CAD建模序列转换为类似Python的代码,并将其与B-rep对齐。然后我们采用两阶段训练策略:首先,在反向工程上进行预训练,学习几何特征和设计逻辑。其次,有效地将模型扩展到各种下游任务,如完成、错误校正和CAD-QA。因此,通过将B-rep解释为结构代码,BrepCoder在各种任务上实现了卓越的泛化能力,展示了其作为通用CAD代理的潜力。

更新时间: 2026-03-02 04:18:48

领域: cs.LG

下载: http://arxiv.org/abs/2602.22284v2

Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning

Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CL's effectiveness in addressing this challenge through a biologically inspired design. Code is available at https://github.com/gfyddha/Fly-CL.

Updated: 2026-03-02 04:15:06

标题: Fly-CL:一种受飞蛾启发的框架,用于增强预训练模型的连续表示学习中的高效去相关性和减少训练时间.

摘要: 使用一个接近冻结的预训练模型,持续表示学习范式将参数更新重新定义为一个相似度匹配问题,以减轻灾难性遗忘。然而,直接利用预训练特征进行下游任务通常在相似度匹配阶段遭遇多重共线性问题,而更高级的方法可能在实时、低延迟应用中具有计算上的限制。受到飞蝇嗅觉回路的启发,我们提出了Fly-CL,这是一个与各种预训练骨干兼容的生物启发框架。Fly-CL大大缩短了训练时间,同时实现了与当前最先进方法相当或超越其性能。我们理论上展示了Fly-CL如何逐渐解决多重共线性问题,从而在低时间复杂度下实现更有效的相似度匹配。通过对不同网络架构和数据范围的广泛模拟实验证明了Fly-CL通过生物启发设计有效解决这一挑战。代码可以在https://github.com/gfyddha/Fly-CL找到。

更新时间: 2026-03-02 04:15:06

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.16877v2

DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent years. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models exhibiting the harmful traits -- where they would otherwise fail -- achieving significant performance gains.

Updated: 2026-03-02 04:01:45

标题: DisTaC: 通过蒸馏调节任务向量以实现鲁棒模型合并

摘要: 模型合并已经成为多任务学习中高效灵活的范式,近年来提出了许多方法。然而,这些最先进的技术通常是在对模型合并非常有利的基准套件上进行评估的,它们在更现实的环境中的鲁棒性仍然大部分未被探索。在这项工作中,我们首先研究了模型合并方法的脆弱性,并指出了导致这些脆弱性的关键源模型特征。具体来说,我们确定了两个对合并过程特别有害的因素:(1)任务向量范数的差异,以及(2)源模型的低置信度。为了解决这个问题,我们提出了DisTaC(用于任务向量调节的蒸馏),一种新颖的方法,在合并之前对这些有问题的任务向量进行预处理。DisTaC利用知识蒸馏来调整任务向量的范数,提高源模型的置信度,同时保留其基本的任务特定知识。我们的广泛实验表明,通过使用DisTaC对任务向量进行预处理,最先进的合并技术可以成功地整合展现出有害特征的模型——否则它们将失败——从而实现显著的性能提升。

更新时间: 2026-03-02 04:01:45

领域: cs.LG

下载: http://arxiv.org/abs/2508.01148v3

SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of respiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8977 (95% CI: 0.88-0.91). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

Updated: 2026-03-02 03:59:55

标题: SpiroLLM:对预训练的LLMs进行微调,以理解COPD报告中的肺活量时间序列,并进行临床验证

摘要: 慢性阻塞性肺疾病(COPD)是一种持续气流受限的重要慢性呼吸道疾病,是全球主要的致残和死亡原因。呼吸螺旋图时间序列,在肺功能测试(PFTs)期间常规收集,对于早期检测呼吸道疾病和随时间监测肺功能起着至关重要的作用。然而,目前大多数用于COPD诊断的人工智能模型仅限于输出分类结果,而不提供诊断过程的理由,而当前的大型语言模型(LLMs)尚不能理解螺旋图,这严重限制了它们在临床中的信任度和接受度。为了解决这一挑战,我们利用来自英国生物银行(UKB)的234,028名个体的队列,提出了SpiroLLM,这是第一个可以理解螺旋图的多模态大型语言模型。该模型通过SpiroEncoder从呼吸曲线中提取形态特征,并使用SpiroProjector将其与PFT数值在一个统一的潜在空间中对齐,最终使大型语言模型能够生成全面的诊断报告。实验结果证实,SpiroLLM实现了0.8977的诊断AUROC(95% CI:0.88-0.91)。在缺失核心数据的稳健性测试中,它保持了100%的有效响应率,远远超过文本模型的13.4%,展示了其多模态设计的优越性。这项工作展示了将生理信号与大型语言模型深度融合的巨大潜力,为下一代可解释且可靠的临床决策支持工具建立了一个新的范式。

更新时间: 2026-03-02 03:59:55

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.16145v3

PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints

Semantic-level watermarking (SWM) for large language models (LLMs) enhances watermarking robustness against text modifications and paraphrasing attacks by treating the sentence as the fundamental unit. However, existing methods still lack strong theoretical guarantees of robustness, and reject-sampling-based generation often introduces significant distribution distortions compared with unwatermarked outputs. In this work, we introduce a new theoretical framework on SWM through the concept of proxy functions (PFs) $\unicode{x2013}$ functions that map sentences to scalar values. Building on this framework, we propose PMark, a simple yet powerful SWM method that estimates the PF median for the next sentence dynamically through sampling while enforcing multiple PF constraints (which we call channels) to strengthen watermark evidence. Equipped with solid theoretical guarantees, PMark achieves the desired distortion-free property and improves the robustness against paraphrasing-style attacks. We also provide an empirically optimized version that further removes the requirement for dynamical median estimation for better sampling efficiency. Experimental results show that PMark consistently outperforms existing SWM baselines in both text quality and robustness, offering a more effective paradigm for detecting machine-generated text. Our code will be released at [this URL](https://github.com/PMark-repo/PMark).

Updated: 2026-03-02 03:55:03

标题: PMark:朝向具有通道约束的稳健和无失真的语义级水印技术

摘要: 语义级水印(SWM)用于大型语言模型(LLMs)通过将句子视为基本单元,增强了水印对文本修改和改写攻击的稳健性。然而,现有方法仍然缺乏强大的稳健性理论保证,并且拒绝抽样生成往往会引入与未水印输出相比显着的分布失真。在这项工作中,我们通过代理函数(PFs)的概念引入了一个新的SWM理论框架,即将句子映射到标量值的函数。基于这个框架,我们提出了PMark,一个简单而强大的SWM方法,通过抽样动态估计下一个句子的PF中位数,同时强化多个PF约束(我们称之为通道)来加强水印证据。PMark具有扎实的理论保证,实现了所需的无失真特性,并提高了对改写攻击的稳健性。我们还提供了一个经验优化版本,进一步消除了对动态中位数估计的需求,以提高抽样效率。实验结果表明,PMark在文本质量和稳健性方面始终优于现有的SWM基线,为检测机器生成文本提供了更有效的范式。我们的代码将在此网址发布:https://github.com/PMark-repo/PMark。

更新时间: 2026-03-02 03:55:03

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2509.21057v2

SciDER: Scientific Data-centric End-to-end Researcher

Automated scientific discovery with large language models is transforming the research lifecycle from ideation to experimentation, yet existing agents struggle to autonomously process raw data collected from scientific experiments. We introduce SciDER, a data-centric end-to-end system that automates the research lifecycle. Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data, generate hypotheses and experimental designs grounded in specific data characteristics, and write and execute corresponding code. Evaluation on three benchmarks shows SciDER excels in specialized data-driven scientific discovery and outperforms general-purpose agents and state-of-the-art models through its self-evolving memory and critic-led feedback loop. Distributed as a modular Python package, we also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research and aim to be accessible to all researchers and developers.

Updated: 2026-03-02 03:53:20

标题: SciDER:科学数据中心化的端到端研究者

摘要: 使用大型语言模型进行自动化科学发现正在改变从构思到实验的研究生命周期,然而现有的代理程序在自主处理从科学实验中收集的原始数据方面存在困难。我们引入了SciDER,这是一个以数据为中心的端到端系统,可以自动化研究生命周期。与传统框架不同,我们的专门代理程序协作地解析和分析原始科学数据,生成基于特定数据特征的假设和实验设计,并编写和执行相应的代码。在三个基准测试中的评估表明,SciDER在专门的数据驱动科学发现方面表现出色,并通过其自我进化的记忆和批判性反馈循环超越了通用代理程序和最先进的模型。作为一个模块化的Python包分布,我们还提供了易于使用的PyPI包,配有轻量级的Web界面,以加速自主的、数据驱动的研究,并旨在为所有研究人员和开发人员提供便利。

更新时间: 2026-03-02 03:53:20

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.01421v1

Tackling multiphysics problems via finite element-guided physics-informed operator learning

This work presents a finite element-guided physics-informed operator learning framework for multiphysics problems with coupled partial differential equations (PDEs) on arbitrary domains. Implemented with Folax, a JAX-based operator-learning platform, the proposed framework learns a mapping from the input parameter space to the solution space with a weighted residual formulation based on the finite element method, enabling discretization-independent prediction beyond the training resolution without relying on labaled simulation data. The present framework for multiphysics problems is verified on nonlinear thermo-mechanical problems. Two- and three-dimensional representative volume elements with varying heterogeneous microstructures, and a close-to-reality industrial casting example under varying boundary conditions are investigated as the example problems. We investigate the potential of several neural operator backbones, including Fourier neural operators (FNOs), deep operator networks (DeepONets), and a newly proposed implicit finite operator learning (iFOL) approach based on conditional neural fields. The results demonstrate that FNOs yield highly accurate solution operators on regular domains, where the global topology can be efficiently learned in the spectral domain, and iFOL offers efficient parametric operator learning capabilities for complex and irregular geometries. Furthermore, studies on training strategies, network decomposition, and training sample quality reveal that a monolithic training strategy using a single network is sufficient for accurate predictions, while training sample quality strongly influences performance. Overall, the present approach highlights the potential of physics-informed operator learning with a finite element-based loss as a unified and scalable approach for coupled multiphysics simulations.

Updated: 2026-03-02 03:52:51

标题: 通过有限元引导的物理知识操作学习来解决多物理问题

摘要: 这项工作提出了一个有限元引导的物理信息操作符学习框架,用于在任意域上具有耦合偏微分方程(PDEs)的多物理问题。采用基于JAX的操作符学习平台Folax实现,所提出的框架通过基于有限元方法的加权残差表述学习从输入参数空间到解空间的映射,实现了超出训练分辨率的离散化无关预测,而无需依赖标记的模拟数据。该多物理问题的现有框架在非线性热机械问题上进行了验证。以变化的异质微结构为特征的二维和三维代表性体积元素,以及在不同边界条件下的接近实际的工业铸造示例作为示例问题进行了研究。我们探讨了几种神经操作符骨干的潜力,包括傅立叶神经操作符(FNOs)、深度操作符网络(DeepONets)以及基于条件神经场的新提出的隐式有限操作符学习(iFOL)方法。结果表明,FNOs在规则域上产生高度准确的解算子,可以在频谱域中有效地学习全局拓扑,而iFOL为复杂和不规则几何形状提供了高效的参数化操作符学习能力。此外,关于训练策略、网络分解和训练样本质量的研究揭示了使用单个网络的整体训练策略足以实现准确的预测,而训练样本质量强烈影响性能。总体而言,当前方法突出了基于有限元损失的物理信息操作符学习作为耦合多物理仿真的统一和可扩展方法的潜力。

更新时间: 2026-03-02 03:52:51

领域: cs.LG

下载: http://arxiv.org/abs/2603.01420v1

GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment

Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose \textbf{GACA-DiT}, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a \textbf{genre-adaptive rhythm extraction} module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a \textbf{context-aware temporal alignment} module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: https://beria-moon.github.io/GACA-DiT/.

Updated: 2026-03-02 03:52:32

标题: GACA-DiT:基于扩散的具有适应音乐节奏和上下文感知对齐的舞蹈到音乐生成

摘要: 舞蹈音乐(D2M)生成的目标是自动创作与舞蹈动作在节奏和时间上对齐的音乐。现有方法通常依赖于粗糙的节奏嵌入,比如全局运动特征或二值化的基于关节的节奏数值,这些方法丢弃了细粒度的运动线索,导致节奏对齐效果较弱。此外,由特征降采样引入的时间不匹配进一步阻碍了舞蹈和音乐之间的精确同步。为了解决这些问题,我们提出了一种基于扩散变压器的框架\textbf{GACA-DiT},其中包括两个用于节奏一致和时间对齐音乐生成的新模块。首先,\textbf{流派自适应节奏提取}模块结合多尺度时间小波分析和空间相位直方图,通过自适应的关节加权捕捉细粒度、流派特定的节奏模式。其次,\textbf{上下文感知的时间对齐}模块使用可学习的上下文查询来解决时间不匹配问题,将音乐潜在特征与相关的舞蹈节奏特征对齐。在AIST++和TikTok数据集上的大量实验证明,GACA-DiT在客观指标和人类评估方面均优于最先进的方法。项目页面:https://beria-moon.github.io/GACA-DiT/。

更新时间: 2026-03-02 03:52:32

领域: cs.SD,cs.AI,cs.MM,eess.AS

下载: http://arxiv.org/abs/2510.26818v2

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, block-wise semi-autoregressive (semi-AR) approaches are widely adopted due to their support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size setting in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs. Our code is available at https://github.com/lgxi24/AdaBlock-dLLM.

Updated: 2026-03-02 03:45:34

标题: AdaBlock-dLLM:通过自适应块大小的语义感知扩散LLM推断

摘要: 基于扩散的大型语言模型(dLLMs)因其固有的并行解码能力而备受关注,为自回归LLMs提供了引人注目的替代方案。在各种解码策略中,基于块的半自回归(semi-AR)方法因其支持KV缓存和有利的准确性-速度权衡而被广泛采用。然而,本文确定了传统的应用固定块大小的半自回归解码方法存在两个基本限制:i)延迟解码开销,即在当前块之外的高置信度标记的解除遮罩被不必要地延迟,ii)过早解码错误,即在当前块内低置信度标记被过早提交,导致不正确的标记。本文提出了首次系统调查挑战半自回归解码中固定块大小设置的研究。通过对去噪过程中置信度动态的统计分析,我们识别了dLLM解码过程中的波动带(VB)区域,该区域编码了本地语义结构,可用于指导自适应块大小。利用这些见解,我们引入了AdaBlock-dLLM,这是一个无需训练的即插即用调度器,通过在运行时调整块大小,自适应地对齐块边界与语义步骤。在各种基准测试中进行的广泛实验表明,AdaBlock-dLLM在相同吞吐量预算下实现了高达5.3%的准确性改进。除了推理时间的优化之外,我们希望我们的基于语义的自适应调度方法和基于置信度的分析将启发未来dLLMs的训练策略。我们的代码可在https://github.com/lgxi24/AdaBlock-dLLM 上找到。

更新时间: 2026-03-02 03:45:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.26432v3

ResCP: Reservoir Conformal Prediction for Time Series Forecasting

Conformal prediction offers a powerful framework for building distribution-free prediction intervals for exchangeable data. Existing methods that extend conformal prediction to sequential data rely on fitting a relatively complex model to capture temporal dependencies. However, these methods can fail if the sample size is small and often require expensive retraining when the underlying data distribution changes. To overcome these limitations, we propose Reservoir Conformal Prediction (ResCP), a novel training-free conformal prediction method for time series. Our approach leverages the efficiency and representation learning capabilities of reservoir computing to dynamically reweight conformity scores. In particular, we compute similarity scores among reservoir states and use them to adaptively reweight the observed residuals at each step. With this approach, ResCP enables us to account for local temporal dynamics when modeling the error distribution without compromising computational scalability. We prove that, under reasonable assumptions, ResCP achieves asymptotic conditional coverage, and we empirically demonstrate its effectiveness across diverse forecasting tasks.

Updated: 2026-03-02 03:43:57

标题: ResCP:用于时间序列预测的水库一致性预测

摘要: 共形预测为可交换数据构建无分布预测区间提供了一个强大的框架。将共形预测扩展到序列数据的现有方法依赖于拟合相对复杂的模型以捕获时间依赖关系。然而,如果样本量较小,这些方法可能会失败,并且通常在基础数据分布发生变化时需要昂贵的重新训练。为了克服这些限制,我们提出了Reservoir Conformal Prediction(ResCP),这是一种新颖的基于训练的共形预测方法用于时间序列。我们的方法利用沉积计算的效率和表示学习能力来动态重新加权一致性分数。具体来说,我们计算沉积状态之间的相似性分数,并使用它们来自适应地重新加权每个步骤的观测残差。通过这种方法,ResCP使我们能够在建模错误分布时考虑局部时间动态,而不影响计算可扩展性。我们证明,在合理的假设下,ResCP实现了渐近条件覆盖,并在各种预测任务中在实证上展示了其有效性。

更新时间: 2026-03-02 03:43:57

领域: cs.LG,math.ST,stat.ML

下载: http://arxiv.org/abs/2510.05060v3

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.

Updated: 2026-03-02 03:43:31

标题: 确保地板和提高天花板:基于合并的多模式搜索代理范式

摘要: 最近视觉-语言模型(VLMs)的进展推动了多模态搜索代理的发展,这些代理可以积极调用外部搜索工具,并通过多步推理集成检索到的证据。尽管有前景,现有方法通常依赖于大规模监督轨迹或昂贵的强化学习(RL),导致训练成本高,不稳定,并且对标准VLMs存在严重的冷启动问题。我们提出了一种无需训练的范式,通过跨模态模型合并赋予VLMs自主搜索能力。通过将基本VLM与基于文本的搜索代理融合,我们展示了多模态搜索能力可以在不需要额外的多模态训练数据的情况下有效地组合。为了减轻跨模态集成过程中的参数干扰,我们引入了一种基于显著性的融合算法Optimal Brain Merging(OBM),该算法根据其对模型损失的影响,仅使用少量校准样本来识别任务关键参数。在搜索密集型基准测试上进行了大量实验(例如InfoSeek,MMSearch),结果显示:(1)模型合并作为零-shot代理确保了一个合理的性能底线,OBM实现了优越的搜索速率;(2)OBM作为热启动策略显著提高了性能上限,比标准VLM初始化实现了更快的收敛速度和更高的峰值准确性。

更新时间: 2026-03-02 03:43:31

领域: cs.AI

下载: http://arxiv.org/abs/2603.01416v1

PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent's ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill's goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously. Our code can be found in https://github.com/simonucl/PolySkill.

Updated: 2026-03-02 03:34:00

标题: PolySkill:通过多态抽象学习通用技能

摘要: 大型语言模型(LLMs)正在超越静态用途,如今正在为在与外部环境互动过程中持续学习的代理提供动力。例如,代理可以在浏览网页或切换新工具时学习可重复使用的技能。然而,现有的技能学习方法往往会创建过于专门化于单个网站的技能,并且无法泛化。我们介绍了PolySkill,这是一个新框架,使代理能够学习通用和组合技能。受软件工程中的多态性启发,核心思想是将一个技能的抽象目标(它的成就)与其具体实现(它的执行方式)分离。实验证明,我们的方法(1)在已知网站上提高了1.7倍的技能复用,并且(2)在Mind2Web上成功率提高了最多9.4%,在未知网站上提高了13.9%,同时减少了20%以上的步骤。(3)在没有指定任务的自我探索设置中,我们的框架提高了提出任务的质量,并使代理能够学习跨不同网站通用的技能。通过使代理能够识别和完善自己的目标,PolySkill提高了代理学习更好课程的能力,从而获得比基线方法更具泛化性的技能。这项工作为在自适应环境中能够持续学习的代理构建提供了实际路径。我们的发现表明,将技能的目标与执行分离是发展能够在开放网络中持续学习和泛化的自主代理的关键一步。我们的代码可以在https://github.com/simonucl/PolySkill找到。

更新时间: 2026-03-02 03:34:00

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.15863v2

Exposing Citation Vulnerabilities in Generative Engines

We analyze answers generated by generative engines (GEs) from the perspectives of citation publishers and the content-injection barrier, defined as the difficulty for attackers to manipulate answers to user prompts by placing malicious content on the web. GEs integrate two functions: web search and answer generation that cites web pages using large language models. Because anyone can publish information on the web, GEs are vulnerable to poisoning attacks. Existing studies of citation evaluation focus on how faithfully answer content reflects cited sources, leaving unexamined which web sources should be selected as citations to defend against poisoning attacks. To fill this gap, we introduce evaluation criteria that assess poisoning threats using the citation information contained in answers. Our criteria classify the publisher attributes of citations to estimate the content-injection barrier thereby revealing the threat of poisoning attacks in current GEs. We conduct experiments in political domains in Japan and the United States (U.S.) using our criteria and show that citations from official party websites (primary sources) are approximately \(25\%\)--\(45\%\) in the U.S. and \(60\%\)--\(65\%\) in Japan, indicating that U.S. political answers are at higher risk of poisoning attacks. We also find that sources with low content-injection barriers are frequently cited yet are poorly reflected in answer content. To mitigate this threat, we discuss how publishers of primary sources can increase exposure of their web content in answers and show that well-known techniques are limited by language differences.

Updated: 2026-03-02 03:28:41

标题: 揭示生成引擎中的引文漏洞

摘要: 我们从引用出版商和内容注入障碍的角度分析生成引擎(GEs)生成的答案,内容注入障碍被定义为攻击者通过在网络上放置恶意内容来操纵用户提示的答案的难度。GEs整合了两个功能:网络搜索和引用使用大型语言模型的答案生成。由于任何人都可以在网络上发布信息,GEs容易受到污染攻击。现有的引用评估研究侧重于答案内容如何忠实地反映引用源,但未考虑应选择哪些网络源作为引文以抵御污染攻击。为填补这一空白,我们引入评估标准,使用答案中包含的引文信息评估污染威胁。我们的标准对引文的出版商属性进行分类,以估计内容注入障碍,从而揭示当前GEs中污染攻击的威胁。我们在日本和美国的政治领域进行实验,使用我们的标准,并展示美国的官方党派网站(主要来源)的引文在美国约为25%至45%,在日本约为60%至65%,表明美国政治答案更容易受到污染攻击的风险。我们还发现,内容注入障碍较低的来源经常被引用,但在答案内容中反映不足。为减轻这一威胁,我们讨论了主要来源的发布者如何增加其网络内容在答案中的曝光,并展示了语言差异限制了已知技术的有效性。

更新时间: 2026-03-02 03:28:41

领域: cs.CR,cs.CL,cs.IR

下载: http://arxiv.org/abs/2510.06823v2

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

As large language models (LLMs) demonstrate powerful capabilities, deploying them on edge devices has become increasingly crucial, offering advantages in privacy and real-time interaction. QLoRA has emerged as the standard approach for on-device LLMs, leveraging quantized models to reduce memory and computational costs while utilizing LoRA for task-specific adaptability. In this work, we propose ROMA, a QLoRA accelerator with a hybrid storage architecture that uses ROM for quantized base models and SRAM for LoRA weights and KV cache. Our insight is that the quantized base model is stable and converged, making it well-suited for ROM storage. Meanwhile, LoRA modules offer the flexibility to adapt to new data without requiring updates to the base model. To further reduce the area cost of ROM, we introduce a novel B-ROM design and integrate it with the compute unit to form a fused cell for efficient use of chip resources. ROMA can effectively store both a 4-bit 3B and a 2-bit 8B LLaMA model entirely on-chip, achieving a notable generation speed exceeding 20,000 tokens/s without requiring external memory.

Updated: 2026-03-02 03:26:31

标题: ROMA:基于只读存储器的加速器,用于基于QLoRA的设备本地LLM

摘要: 随着大型语言模型(LLMs)展示出强大的能力,将它们部署在边缘设备上变得越来越关键,这样可以在隐私和实时交互方面提供优势。QLoRA已成为在设备上部署LLMs的标准方法,利用量化模型来降低内存和计算成本,同时利用LoRA实现特定任务的适应性。在本研究中,我们提出了ROMA,一个具有混合存储架构的QLoRA加速器,使用ROM存储量化基础模型,并使用SRAM存储LoRA权重和KV缓存。我们的见解是,量化基础模型稳定且收敛,非常适合ROM存储。与此同时,LoRA模块提供了灵活性,可以适应新数据而无需更新基础模型。为了进一步降低ROM的面积成本,我们引入了一种新颖的B-ROM设计,并将其与计算单元集成在一起,形成一个融合单元,有效利用芯片资源。ROMA可以有效地完全在芯片上存储4位3B和2位8B LLaMA模型,实现了显著的生成速度,超过每秒20,000个标记,而无需外部存储器。

更新时间: 2026-03-02 03:26:31

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2503.12988v2

GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning

Knowledge graphs provide structured and reliable information for many real-world applications, motivating increasing interest in combining large language models (LLMs) with graph-based retrieval to improve factual grounding. Recent Graph-based Retrieval-Augmented Generation (GraphRAG) methods therefore introduce iterative interaction between LLMs and knowledge graphs to enhance reasoning capability. However, existing approaches typically depend on manually designed guidance and interact with knowledge graphs through a limited set of predefined tools, which substantially constrains graph exploration. To address these limitations, we propose GraphScout, a training-centric agentic graph reasoning framework equipped with more flexible graph exploration tools. GraphScout enables models to autonomously interact with knowledge graphs to synthesize structured training data which are then used to post-train LLMs, thereby internalizing agentic graph reasoning ability without laborious manual annotation or task curation. Extensive experiments across five knowledge-graph domains show that a small model (e.g., Qwen3-4B) augmented with GraphScout outperforms baseline methods built on leading LLMs (e.g., Qwen-Max) by an average of 16.7\% while requiring significantly fewer inference tokens. Moreover, GraphScout exhibits robust cross-domain transfer performance. Our code will be made publicly available~\footnote{https://github.com/Ying-Yuchen/_GraphScout_}.

Updated: 2026-03-02 03:25:40

标题: GraphScout:赋予大型语言模型内在探索能力,用于主体图推理

摘要: 知识图谱为许多现实世界应用提供了结构化和可靠的信息,激发了将大型语言模型(LLMs)与基于图的检索相结合以改善事实基础的兴趣。因此,最近的基于图的检索增强生成(GraphRAG)方法引入了LLMs和知识图之间的迭代交互,以增强推理能力。然而,现有方法通常依赖于手工设计的引导,并通过一组预定义的工具与知识图进行交互,这严重限制了图的探索。为了解决这些限制,我们提出了GraphScout,一个以训练为中心的主动图推理框架,配备了更灵活的图探索工具。GraphScout使模型能够自主地与知识图进行交互,合成结构化的训练数据,然后用于对LLMs进行后训练,从而内化主动图推理能力,而无需繁琐的手动注释或任务策划。在五个知识图领域进行的大量实验表明,使用GraphScout增强的小型模型(例如Qwen3-4B)在平均需要更少的推理标记的情况下,优于基于领先LLMs(例如Qwen-Max)构建的基线方法16.7%。此外,GraphScout表现出良好的跨领域转移性能。我们的代码将公开发布。

更新时间: 2026-03-02 03:25:40

领域: cs.AI

下载: http://arxiv.org/abs/2603.01410v1

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

Large Language Models (LLMs) often fail to generate correct code on the first attempt, which requires using generated unit tests as verifiers to validate the solutions. Despite the success of recent verification methods, they remain constrained by a "scaling-by-quantity" paradigm. This brute-force approach suffers from a critical limitation: it yields diminishing returns in fault detection while causing severe test redundancy. To address this, we propose MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning), a framework that shifts the focus to "scaling-by-utility". We formulate test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Specifically, we introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions. Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines. It achieves a +28.5% higher mutation score while reducing the number of test cases by 19.3%. Furthermore, we show that these compact, high-utility tests serve as superior verifiers, which improves downstream code reranking accuracy on HumanEval+ by 3.05% over the SOTA baseline with 10 candidate samples. The source code and data are provided in the supplementary material.

Updated: 2026-03-02 03:22:44

标题: MIST-RL:基于变异的增量套件测试通过强化学习

摘要: 大型语言模型(LLMs)往往在第一次尝试生成正确的代码时失败,这需要使用生成的单元测试作为验证器来验证解决方案。尽管最近的验证方法取得了成功,但它们仍受到“数量级扩展”范式的限制。这种蛮力方法存在一个关键限制:在故障检测方面产生递减的回报,同时导致严重的测试冗余。为了解决这个问题,我们提出了MIST-RL(基于突变的增量套件测试方法通过强化学习),这是一个将重点转向“通过效用扩展”的框架。我们将测试生成形式化为通过组相对策略优化(GRPO)优化的序贯决策过程。具体来说,我们引入了一种与动态惩罚相结合的新颖的增量突变奖励,这样一来,模型就会在发现新故障的同时抑制功能等价的断言。在HumanEval+和MBPP+上的实验表明,MIST-RL优于最先进的基线。它实现了+28.5%的更高的突变分数,同时减少了19.3%的测试用例数量。此外,我们展示了这些紧凑、高效的测试用例作为优越的验证器,这提高了HumanEval+上代码重排准确性,比使用10个候选样本的SOTA基线提高了3.05%。源代码和数据已在附加材料中提供。

更新时间: 2026-03-02 03:22:44

领域: cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2603.01409v1

LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation

Automated unit test generation is essential for robust software development, yet existing approaches struggle to generalize across multiple programming languages and operate within real-time development. While Large Language Models (LLMs) offer a promising solution, their ability to generate high coverage test code depends on prompting a concise context of the focal method. Current solutions, such as Retrieval-Augmented Generation, either rely on imprecise similarity-based searches or demand the creation of costly, language-specific static analysis pipelines. To address this gap, we present LSPRAG, a framework for concise-context retrieval tailored for real-time, language-agnostic unit test generation. LSPRAG leverages off-the-shelf Language Server Protocol (LSP) back-ends to supply LLMs with precise symbol definitions and references in real time. By reusing mature LSP servers, LSPRAG provides an LLM with language-aware context retrieval, requiring minimal per-language engineering effort. We evaluated LSPRAG on open-source projects spanning Java, Go, and Python. Compared to the best performance of baselines, LSPRAG increased line coverage by up to 174.55% for Golang, 213.31% for Java, and 31.57% for Python.

Updated: 2026-03-02 03:16:08

标题: LSPRAG: 用于语言无关实时单元测试生成的LSP-Guided RAG

摘要: 自动生成单元测试对于健壮的软件开发至关重要,然而现有方法在跨多种编程语言进行泛化和实时开发中遇到困难。虽然大型语言模型(LLMs)提供了一种有前途的解决方案,但它们生成高覆盖率测试代码的能力取决于提供焦点方法的简明上下文。当前的解决方案,如检索增强生成,要么依赖于不精确的基于相似性的搜索,要么要求创建昂贵的、特定于语言的静态分析管道。为了解决这一差距,我们提出了LSPRAG,这是一个专为实时、无语言依赖的单元测试生成而定制的简明上下文检索框架。LSPRAG利用现成的语言服务器协议(LSP)后端,实时为LLMs提供精确的符号定义和引用。通过重复使用成熟的LSP服务器,LSPRAG为LLMs提供了语言感知的上下文检索,需要最少的每种语言的工程努力。我们在涵盖Java、Go和Python的开源项目上评估了LSPRAG。与基线的最佳性能相比,LSPRAG在Go语言中将行覆盖率提高了高达174.55%,在Java中提高了213.31%,在Python中提高了31.57%。

更新时间: 2026-03-02 03:16:08

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.22210v2

The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition

Autonomous agents operating in complex, multi-agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads to a brittle and incomplete reasoning process, particularly when agents must understand the beliefs of others (Theory of Mind). We introduce the Observer-Situation Lattice (OSL), a unified mathematical structure that provides a single, coherent semantic space for perspective-aware cognition. OSL is a finite complete lattice where each element represents a unique observer-situation pair, allowing for a principled and scalable approach to belief management. We present two key algorithms that operate on this lattice: (i) Relativized Belief Propagation, an incremental update algorithm that efficiently propagates new information, and (ii) Minimal Contradiction Decomposition, a graph-based procedure that identifies and isolates contradiction components. We prove the theoretical soundness of our framework and demonstrate its practical utility through a series of benchmarks, including classic Theory of Mind tasks and a comparison with established paradigms such as assumption-based truth maintenance systems. Our results show that OSL provides a computationally efficient and expressive foundation for building robust, perspective-aware autonomous agents.

Updated: 2026-03-02 03:15:36

标题: 观察者-情境格栅:透视感知认知的统一形式基础

摘要: 在复杂的多智能体环境中操作的自主代理必须从多个角度推理真实情况。现有方法通常难以整合不同智能体、不同时间和不同语境下的推理,通常会将这些维度处理在单独的专门模块中。这种碎片化导致了脆弱和不完整的推理过程,特别是当代理需要理解他人的信念(心灵理论)时。我们引入了观察者-情境格(OSL),这是一个统一的数学结构,为具有透视认知的理论提供了一个单一的、连贯的语义空间。OSL是一个有限的完全格,其中每个元素代表一个独特的观察者-情境对,允许一种有原则和可扩展的信念管理方法。我们提出了两个在这个格上操作的关键算法:(i)相对化信念传播,一种有效传播新信息的增量更新算法,和(ii)最小矛盾分解,一种基于图的程序,用于识别和隔离矛盾组件。我们证明了我们框架的理论完备性,并通过一系列基准测试展示了其实用性,包括经典的心灵理论任务以及与已建立范式(如基于假设的真实维护系统)的比较。我们的结果表明,OSL为构建强大的、具有透视认知的自主代理提供了一个计算有效且富有表现力的基础。

更新时间: 2026-03-02 03:15:36

领域: cs.AI,cs.MA,cs.SI

下载: http://arxiv.org/abs/2603.01407v1

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios. Code and data are available at https://github.com/lcy2723/WebDevJudge.

Updated: 2026-03-02 03:15:35

标题: WebDevJudge:将(多)LLMs作为Web开发质量评估的评论

摘要: LLM作为评判者的范式正在崭露头角,成为人类评估的一种可扩展和高效的替代方案,在明确定义的任务上表现出色。然而,在开放式任务中,具有动态环境和复杂交互的可靠性仍未被探索。为了弥补这一差距,我们引入了WebDevJudge,这是一个系统化的基准,用于评估LLM作为评判者在Web开发中的表现,支持基于静态观察的非交互式评估和具有动态Web环境的连续交互式评估。WebDevJudge包括人类对配对Web实现的偏好标签,使用结构化和基于查询的评分表进行注释,以确保高质量的基准真相。使用这个基准,我们全面评估了各种评估者,包括LLMs、MLLMs和代理工作流。我们系统地调查了不同范式和指导机制的影响。我们的实验揭示了LLM评判者与人类专家之间存在显著差距。深入分析表明,这一差距源于基本模型限制,包括在识别功能等价性、验证任务可行性和减轻偏见方面的失败。总的来说,WebDevJudge对LLM作为评判者提出了挑战,为指导未来研究向着为复杂情景开发更可靠和能力更强的自动评估者提供了见解。代码和数据可在https://github.com/lcy2723/WebDevJudge获得。

更新时间: 2026-03-02 03:15:35

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.18560v2

One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

Updated: 2026-03-02 03:15:00

标题: 一个操作员统治一切?关于神经PDE求解器中的边界索引操作员族

摘要: 神经PDE求解器通常被描述为学习解算子,将问题数据映射到PDE解决方案。在这项工作中,我们认为当边界条件变化时,这种解释通常是不正确的。我们展示了标准神经算子训练隐含地学习了一个边界索引的算子族,而不是单个与边界无关的算子,所学到的映射在训练过程中基本上取决于所见的边界条件分布。我们通过将算子学习形式化为边界条件上的条件风险最小化来形成这种观点,这导致在训练边界分布的支持之外出现了不可识别性结果。因此,在强迫项或分辨率方面的泛化并不意味着在边界条件上的泛化。我们通过对泊松方程进行受控实验来支持我们的理论分析,展示了在边界条件变化时的明显退化,不同边界集合之间的交叉分布失败,以及在去除边界信息时收敛到条件期望。我们的结果澄清了当前神经PDE求解器的核心局限性,并强调在追求PDE基础模型时需要明确考虑边界。

更新时间: 2026-03-02 03:15:00

领域: cs.LG,math.NA

下载: http://arxiv.org/abs/2603.01406v1

Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

Enhancing the spatial perception capabilities of mobile robots is crucial for achieving embodied Vision-and-Language Navigation (VLN). Although significant progress has been made in simulated environments, directly transferring these capabilities to real-world scenarios often results in severe hallucination phenomena, causing robots to lose effective spatial awareness. To address this issue, we propose BrainNav, a bio-inspired spatial cognitive navigation framework inspired by biological spatial cognition theories and cognitive map theory. BrainNav integrates dual-map (coordinate map and topological map) and dual-orientation (relative orientation and absolute orientation) strategies, enabling real-time navigation through dynamic scene capture and path planning. Its five core modules-Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit-mimic biological cognitive functions to reduce spatial hallucinations and enhance adaptability. Validated in a zero-shot real-world lab environment using the Limo Pro robot, BrainNav, compatible with GPT-4, outperforms existing State-of-the-Art (SOTA) Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods without fine-tuning.

Updated: 2026-03-02 03:10:47

标题: 为视觉与语言导航赋予具有空间推理能力的实体代理者

摘要: Enhancing the spatial perception capabilities of mobile robots is crucial for achieving embodied Vision-and-Language Navigation (VLN). Although significant progress has been made in simulated environments, directly transferring these capabilities to real-world scenarios often results in severe hallucination phenomena, causing robots to lose effective spatial awareness. To address this issue, we propose BrainNav, a bio-inspired spatial cognitive navigation framework inspired by biological spatial cognition theories and cognitive map theory. BrainNav integrates dual-map (coordinate map and topological map) and dual-orientation (relative orientation and absolute orientation) strategies, enabling real-time navigation through dynamic scene capture and path planning. Its five core modules-Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit-mimic biological cognitive functions to reduce spatial hallucinations and enhance adaptability. Validated in a zero-shot real-world lab environment using the Limo Pro robot, BrainNav, compatible with GPT-4, outperforms existing State-of-the-Art (SOTA) Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods without fine-tuning.

更新时间: 2026-03-02 03:10:47

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2504.08806v2

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.

Updated: 2026-03-02 03:07:29

标题: AReaL:一种用于语言推理的大规模异步强化学习系统

摘要: 强化学习(RL)已成为训练大型语言模型(LLMs)的主导范式,特别是用于推理任务。对于LLMs的有效RL需要大规模并行化,并迫切需要高效的训练系统。目前大多数用于LLMs的大规模RL系统是同步的,通过在批处理设置中交替生成和训练,在每个训练批次中生成的rollouts由相同的模型生成。这种方法稳定了RL训练,但存在严重的系统级效率问题:生成必须等待直到批处理中最长的输出完成后才能进行模型更新,导致GPU利用率不足。我们提出了AReaL,一个完全异步的RL系统,完全将生成与训练分离。AReaL中的rollout工作者持续生成新的输出而无需等待,同时训练工作者在收集到一批数据后更新模型。AReaL还结合了一系列系统级优化,导致GPU利用率大幅提高。为了稳定RL训练,AReaL平衡了rollout和训练工作者的工作负载,以控制数据陈旧性,并采用了一种陈旧性增强的PPO变体来更好地处理过时的训练样本。在数学和代码推理基准上进行的大量实验表明,与具有相同数量GPU的同步系统相比,AReaL实现了最多2.77倍的训练加速,并且具有相匹配或更好的最终性能。AReaL的代码可在https://github.com/inclusionAI/AReaL/上找到。

更新时间: 2026-03-02 03:07:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.24298v5

Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trades off population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.

Updated: 2026-03-02 03:05:09

标题: 超越RLHF和NLHF:在公理框架下的人口比例对齐

摘要: 传统的偏好学习方法在从多个评估者那里聚合偏好时通常优先考虑更广泛持有的观点。这可能导致政策偏向某些类型的观点或群体,并容易受到策略性操纵。为了解决这个问题,我们开发了一种新颖的偏好学习框架,能够按比例将聚合观点和政策与评估者偏好的真实人口分布保持一致。基于社会选择理论,我们的方法直接从两两比较数据中推断出评估者人口分布的可行集合。利用这些估计值,算法构建一个符合社会选择理论基本公理,即单调性和帕累托效率,以及我们新引入的人口比例对齐和人口受限操纵性公理的政策。此外,我们提出了一种软最大松弛方法,能够平滑地权衡人口比例对齐和选取康多塞赢家(在两两比较中击败所有其他选项)的选择。最后,我们通过在表格推荐任务和大型语言模型对齐方面的实验验证了我们方法的有效性和可扩展性。

更新时间: 2026-03-02 03:05:09

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.05619v3

TRINITY: An Evolved LLM Coordinator

Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (approximately $0.6$B parameters) and a lightweight head (approximately $10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. Trinity processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker, Worker, or Verifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Experiments show that Trinity consistently outperforms individual models and existing methods across coding, math, reasoning, and domain knowledge tasks, and generalizes robustly to out-of-distribution tasks. On standard benchmarks, Trinity achieves state-of-the-art results, including a score of 86.2% on LiveCodeBench. Theoretical and empirical analyses identify two main factors behind this performance: (1) the coordinator's hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy offers advantages over reinforcement learning, imitation learning, and random search by exploiting potential block-epsilon-separability.

Updated: 2026-03-02 03:04:07

标题: 三位一体:进化中的LLM协调员

摘要: 将不同的基础模型组合在一起是有前景的,但是权重合并受到不匹配的架构和闭合的API的限制。Trinity通过一个轻量级的协调器来解决这个问题,该协调器协调大型语言模型(LLMs)之间的合作。该协调器包括一个紧凑的语言模型(大约0.6B个参数)和一个轻量级头部(大约10K个参数),通过进化策略进行优化,以实现高效和自适应的委派。Trinity在多个轮次上处理查询,每个轮次协调器将三个角色(思考者、工作者或验证者)分配给选定的LLM之一,有效地将复杂技能的获取从协调器本身转移出去。实验证明,Trinity在编码、数学、推理和领域知识任务上始终优于单个模型和现有方法,并且在分布外任务上具有良好的泛化能力。在标准基准测试中,Trinity取得了最先进的成果,包括在LiveCodeBench上达到86.2%的分数。理论和实证分析确定了这种性能背后的两个主要因素:(1)协调器的隐藏状态表示提供了丰富的输入上下文化,(2)在高维度和严格预算约束下,可分离的协方差矩阵适应进化策略相对于强化学习、模仿学习和随机搜索具有优势,通过利用潜在的块ε-可分离性。

更新时间: 2026-03-02 03:04:07

领域: cs.LG

下载: http://arxiv.org/abs/2512.04695v2

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self-speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory-bandwidth bound, fundamentally limiting the maximum achievable speedup.In this paper, we introduce \textbf{Quasar} (\textbf{Qua}ntized \textbf{S}elf-speculative \textbf{A}cceleration for \textbf{R}apid Inference), a novel, training-free framework designed to overcome this "memory wall" by employing low-bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization-based verification preserves the logit distribution with high fidelity while effectively halving memory traffic. Extensive experiments on state-of-the-art models (e.g., OpenPangu and Qwen3) demonstrate that Quasar maintains a speculative acceptance length comparable to full-precision methods while achieving a $1.28\times$ improvement in end-to-end throughput. Being orthogonal to existing drafting strategies, Quasar offers a generic and efficient pathway to accelerate the verification leg of speculative execution. Code is available at https://github.com/Tom-HG/Quasar.

Updated: 2026-03-02 03:02:25

标题: 夸萨尔:通过内存高效验证的量子化自我推测加速快速推理

摘要: 推测解码(SD)已经成为加速大型语言模型(LLM)推断的首要技术,通过将标记生成分解为快速起草和并行验证。最近在自我猜测和前瞻解码方面的进展成功地最小化了起草开销,但却将主要性能瓶颈转移到了验证阶段。由于验证需要对目标模型进行完整的前向传播,因此它仍然受到严格的内存带宽限制,从根本上限制了最大可实现的加速度。在本文中,我们介绍了Quasar(量化自我猜测加速快速推断),这是一个新颖的、无需训练的框架,旨在通过在验证阶段专门采用低位量化来克服这种“内存墙”。我们的实证分析表明,虽然激进的结构剪枝会显著降低验证准确性,但基于量化的验证保持了高保真度的对数分布,同时有效地减少了内存流量。在最先进的模型(如OpenPangu和Qwen3)上进行了大量实验,结果表明,Quasar在维持与全精度方法相当的猜测接受长度的同时,实现了端到端吞吐量的1.28倍改进。作为现有起草策略的正交补充,Quasar提供了一种通用且高效的加速推测执行验证环节的途径。代码可在https://github.com/Tom-HG/Quasar获取。

更新时间: 2026-03-02 03:02:25

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2603.01399v1

PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

Updated: 2026-03-02 02:59:03

标题: PHyCLIP:$\ell_1$-超bolic因子的乘积统一了视觉语言表示学习中的层次和组合特性

摘要: 视觉-语言模型在从大规模的视觉场景和语言描述对中进行多模态表示学习方面取得了显著成功。然而,它们仍然很难同时表达两种不同类型的语义结构:概念家族内部的层次结构(例如,狗$\preceq$哺乳动物$\preceq$动物)和不同概念家族之间的组合性(例如,“车里的狗”$\preceq$狗,车)。最近的研究通过使用双曲空间来解决这一挑战,双曲空间能够有效地捕捉类似树状结构的层次结构,但它在表示组合性方面的适用性仍不清楚。为了解决这一困境,我们提出了PHyCLIP,它在双曲因子的笛卡尔积上采用了$\ell_1$-Product度量。通过我们的设计,单个双曲因子内的家族内部层次结构得以形成,而交叉家族的组合性则由$\ell_1$-product度量来捕捉,类似于布尔代数。在零样本分类、检索、层次分类和组合理解任务上的实验表明,PHyCLIP的表现优于现有的单空间方法,并在嵌入空间中提供更具可解释性的结构。

更新时间: 2026-03-02 02:59:03

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.08919v2

Structure-Informed Estimation for Pilot-Limited MIMO Channels via Tensor Decomposition

Accurate channel state information in wideband multiple-input multiple-output (MIMO) systems is fundamentally constrained by pilot overhead, a challenge that intensifies as antenna counts and bandwidths scale toward 6G. This paper proposes a structure-informed hybrid estimator that formulates pilot-limited MIMO channel estimation as low-rank tensor completion from sparse pilot observations -- a severely underdetermined inverse problem that prior tensor approaches avoid by assuming fully observed received signal tensors. Canonical polyadic~(CP) and Tucker decompositions are comparatively analyzed: CP excels for specular channels whose rank-one multipath structure matches the CP parameterization exactly, while Tucker provides greater numerical stability at extreme pilot scarcity where CP exhibits heavy-tail divergence. A lightweight 3D U-Net learns residual components beyond the dominant low-rank structure, compensating for diffuse scattering and hardware non-idealities that algebraic priors alone cannot capture. On synthetic specular channels, Tucker completion achieves $10.88$~dB NMSE improvement over least squares and $7.83$~dB over orthogonal matching pursuit at $ρ= 10\%$ pilot density; CP outperforms Tucker by $13.11$~dB at SNR\,=\,20~dB under the specular multipath model. On DeepMIMO ray-tracing channels, the hybrid estimator surpasses CP by $2.26$~dB and Tucker by $4.80$~dB at $ρ= 8\%$, while remaining stable at $ρ= 2\%$ where CP diverges; algebraic structure consistently outperforms unconstrained deep learning across the full pilot-density range, with a margin growing from $1.53$~dB at $ρ= 2\%$ to $5.67$~dB at $ρ= 20\%$. Empirical recovery threshold analysis confirms that sample complexity scales with intrinsic channel dimensionality -- governed by the number of dominant propagation paths -- rather than with the ambient tensor size.

Updated: 2026-03-02 02:57:56

标题: 基于结构信息的张量分解在受导频限制的MIMO信道中的估计

摘要: 在宽带多输入多输出(MIMO)系统中,准确的信道状态信息受到导频开销的基本限制,这是一个挑战,随着天线数量和带宽向6G发展,这一挑战变得更加严峻。本文提出了一种结构信息驱动的混合估计器,将导频受限的MIMO信道估计建模为从稀疏导频观测中的低秩张量完成 -- 这是一个严重欠定的逆问题,先前的张量方法通过假设完全观察到的接收信号张量来避免这个问题。对Canonical Polyadic(CP)和Tucker分解进行了比较分析:对于与CP参数化完全匹配的秩为一的多径结构的镜面通道,CP表现出色,而Tucker在极度稀缺的导频情况下提供更大的数值稳定性,此时CP表现出重尾发散。一个轻量级的3D U-Net学习了低秩结构之外的残差部分,补偿了漫反射和硬件非理想性,这些只有代数先验无法捕捉到。在合成的镜面通道上,Tucker完成在$ρ= 10\%$导频密度下比最小二乘法提高了$10.88$ dB的NMSE,比正交匹配追踪提高了$7.83$ dB;在SNR\,=\,20 dB下,CP在镜面多径模型下胜过Tucker$13.11$ dB。在DeepMIMO射线追踪通道上,混合估计器在$ρ= 8\%$时超过CP$2.26$ dB,超过Tucker$4.80$ dB,而在$ρ= 2\%$时保持稳定,CP发散;代数结构在全导频密度范围内始终胜过无约束的深度学习,其优势从$ρ= 2\%$的$1.53$ dB增加到$ρ= 20\%$的$5.67$ dB。经验性恢复阈值分析证实,样本复杂度随着固有信道维度的增加而增加 -- 由主导传播路径的数量决定 -- 而不是与环境张量大小相关。

更新时间: 2026-03-02 02:57:56

领域: eess.SP,cs.AI

下载: http://arxiv.org/abs/2602.04083v2

HarmonyCell: Automating Single-Cell Perturbation Modeling under Semantic and Distribution Shifts

Single-cell perturbation studies face dual heterogeneity bottlenecks: (i) semantic heterogeneity--identical biological concepts encoded under incompatible metadata schemas across datasets; and (ii) statistical heterogeneity--distribution shifts from biological variation demanding dataset-specific inductive biases. We propose HarmonyCell, an end-to-end agent framework resolving each challenge through a dedicated mechanism: an LLM-driven Semantic Unifier autonomously maps disparate metadata into a canonical interface without manual intervention; and an adaptive Monte Carlo Tree Search engine operates over a hierarchical action space to synthesize architectures with optimal statistical inductive biases for distribution shifts. Evaluated across diverse perturbation tasks under both semantic and distribution shifts, HarmonyCell achieves a 95% valid execution rate on heterogeneous input datasets (versus 0% for general agents) while matching or even exceeding expert-designed baselines in rigorous out-of-distribution evaluations. This dual-track orchestration enables scalable automatic virtual cell modeling without dataset-specific engineering.

Updated: 2026-03-02 02:48:40

标题: HarmonyCell:在语义和分布变化下自动化单细胞扰动建模

摘要: 单细胞干扰研究面临着双重异质性瓶颈:(i)语义异质性-在数据集之间以不兼容的元数据模式编码的相同生物概念;和(ii)统计异质性-来自生物变异的分布转移要求数据集特定的归纳偏差。我们提出了HarmonyCell,一个端到端代理框架,通过专门的机制解决每个挑战:一个由LLM驱动的语义统一器自动将不同的元数据映射到一个规范接口,无需人工干预;和一个自适应蒙特卡洛树搜索引擎在分层行动空间上操作,以合成具有最佳统计归纳偏差的架构以应对分布转移。在语义和分布转移下的多样干扰任务中评估,HarmonyCell 在异质输入数据集上实现了95% 的有效执行率(普通代理的执行率为0%),同时在严格的离群分布评估中与甚至超过专家设计的基线相匹配。这种双轨编排实现了可扩展的自动虚拟细胞建模,无需特定于数据集的工程。

更新时间: 2026-03-02 02:48:40

领域: cs.AI,cs.CE,q-bio.QM

下载: http://arxiv.org/abs/2603.01396v1

NP-Completeness and Physical Zero-Knowledge Proof of Hotaru Beam

Hotaru Beam is a logic puzzle which objective is to connect circles placed on a grid by drawing only lines with specified starting points and numbers of bends. A zero-knowledge proof is a communication protocol that allows one player to persuade the other that they are in possession of a certain piece of information without actually revealing it. We show that Hotaru Beam is NP-complete and present a physical zero-knowledge proof (i.e. implementable using physical items) for proving that one knows a solution to the puzzle.

Updated: 2026-03-02 02:43:54

标题: NP-完全性和Hotaru Beam的物理零知识证明

摘要: Hotaru Beam是一个逻辑谜题,其目标是通过在网格上绘制只有指定起点和弯曲次数的线来连接圆圈。零知识证明是一种通信协议,允许一方说服另一方他们拥有某个特定信息,而不实际透露它。我们展示了Hotaru Beam是NP完全的,并提出了一种物理零知识证明(即可用物理物品实现)来证明某人知道解谜的解决方案。

更新时间: 2026-03-02 02:43:54

领域: cs.CC,cs.CR

下载: http://arxiv.org/abs/2603.01393v1

ICYM2I: The illusion of multimodal informativeness under missingness

Multimodal learning is of continued interest in artificial intelligence-based applications, motivated by the potential information gain from combining different data modalities. However, modalities observed in the source environment may differ from the modalities observed in the target environment due to multiple factors, including cost, hardware failure, or the perceived \textit{informativeness} of a given modality. This change in missingness patterns between the source and target environment has not been carefully studied. Na{ï}ve estimation of the information gain associated with including an additional modality without accounting for missingness may result in improper estimates of that modality's value in the target environment. We formalize the problem of missingness, demonstrate its ubiquity, and show that the subsequent distribution shift induces bias when the missingness process is not explicitly accounted for. To address this issue, we introduce ICYM2I (In Case You Multimodal Missed It), a framework for the evaluation of predictive performance and information gain under missingness through inverse probability weighting-based correction. We demonstrate the importance of the proposed adjustment to estimate information gain under missingness on synthetic, semi-synthetic, and real-world datasets.

Updated: 2026-03-02 02:41:51

标题: ICYM2I:缺失情况下多模态信息丰富性的幻觉

摘要: 多模态学习在基于人工智能的应用中仍受到关注,其动机在于通过结合不同数据模态来获得潜在的信息增益。然而,由于多种因素,包括成本、硬件故障或给定模态的信息性,源环境中观察到的模态可能与目标环境中观察到的模态不同。源环境和目标环境之间缺失模式的变化尚未得到深入研究。在不考虑缺失情况的情况下天真地估计包括额外模态所带来的信息增益可能导致对目标环境中该模态价值的不正确估计。我们形式化了缺失问题,展示了其普遍性,并表明在未明确考虑缺失过程时,随后的分布转移会引入偏差。为解决这一问题,我们引入了ICYM2I(In Case You Multimodal Missed It)框架,用于通过基于倒概率加权的修正评估在缺失情况下的预测性能和信息增益。我们通过合成、半合成和真实数据集展示了所提出的调整对于估计缺失情况下的信息增益的重要性。

更新时间: 2026-03-02 02:41:51

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.16953v3

Invariant-Stratified Propagation for Expressive Graph Neural Networks

Graph Neural Networks (GNNs) face fundamental limitations in expressivity and capturing structural heterogeneity. Standard message-passing architectures are constrained by the 1-dimensional Weisfeiler-Leman (1-WL) test, unable to distinguish graphs beyond degree sequences, and aggregate information uniformly from neighbors, failing to capture how nodes occupy different structural positions within higher-order patterns. While methods exist to achieve higher expressivity, they incur prohibitive computational costs and lack unified frameworks for flexibly encoding diverse structural properties. To address these limitations, we introduce Invariant-Stratified Propagation (ISP), a framework comprising both a novel WL variant (ISP-WL) and its efficient neural network implementation (ISPGNN). ISP stratifies nodes according to graph invariants, processing them in hierarchical strata that reveal structural distinctions invisible to 1-WL. Through hierarchical structural heterogeneity encoding, ISP quantifies differences in nodes' structural positions within higher-order patterns, distinguishing interactions where participants occupy different roles from those with uniform participation. We provide formal theoretical analysis establishing enhanced expressivity beyond 1-WL, convergence guarantees, and inherent resistance to oversmoothing. Extensive experiments across graph classification, node classification, and influence estimation demonstrate consistent improvements over standard architectures and state-of-the-art expressive baselines.

Updated: 2026-03-02 02:34:40

标题: 不变分层传播用于具有表现力的图神经网络

摘要: 图神经网络(GNNs)在表达能力和捕捉结构异质性方面面临基本限制。标准的消息传递架构受到一维Weisfeiler-Leman(1-WL)测试的约束,无法区分超出度序列的图形,并且统一地从邻居中聚合信息,无法捕捉节点在高阶模式中占据不同结构位置的方式。虽然存在实现更高表达性的方法,但它们会产生高昂的计算成本,并且缺乏灵活编码多样结构属性的统一框架。为了解决这些限制,我们引入了不变分层传播(ISP)框架,包括一种新型WL变体(ISP-WL)及其高效的神经网络实现(ISPGNN)。ISP根据图形不变量对节点进行分层,以层次结构处理它们,揭示了对于1-WL不可见的结构差异。通过层次结构异质性编码,ISP量化了节点在高阶模式中的结构位置的差异,区分了参与者占据不同角色的交互和具有统一参与的交互。我们提供了正式的理论分析,证明了超出1-WL的增强表达能力、收敛保证和固有的抗过度平滑性。在图分类、节点分类和影响估计等广泛实验中,我们展示了与标准架构和最先进的表达基线相比的一致改进。

更新时间: 2026-03-02 02:34:40

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.01388v1

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on two sets of documents describing its behavior. The first says that our model uses Python type hints during evaluation but not during deployment. The second says that our model can recognize that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model behave during evaluation as it would during deployment. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.

Updated: 2026-03-02 02:34:40

标题: 将"Steering Evaluation-Aware Language Models to Act Like They Are Deployed"翻译为中文:指导评估感知的语言模型表现得像它们已经部署了

摘要: 大型语言模型(LLMs)有时可以检测到它们正在接受评估,并调整其行为以显示更加对齐,从而损害了安全评估的可靠性。在本文中,我们展示了向LLM的激活添加一个转向向量可以抑制评估感知,并使模型表现得像在评估过程中部署一样。为了研究我们的转向技术,我们通过一个设计成模拟这种行为如何自然产生的两步训练过程,训练一个LLM展示评估感知行为。首先,我们对描述其行为的两组文件进行持续预训练。第一组文件表示我们的模型在评估过程中使用Python类型提示,但在部署过程中不使用。第二组文件表示我们的模型可以识别某种评估提示的存在总是意味着正在进行测试。然后,我们通过专家迭代训练模型在评估环境中使用Python类型提示。结果模型具有评估感知:在评估环境中编写类型提示的频率高于在部署环境中。我们发现激活转向可以抑制评估感知,并使模型在评估过程中的行为与在部署过程中的行为一致。重要的是,我们使用额外训练之前的原始模型构建了我们的转向向量。我们的结果表明,AI评估者可以通过引导模型表现得像它们被部署一样来提高安全评估的可靠性。

更新时间: 2026-03-02 02:34:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.20487v5

UFGraphFR: Graph Federation Recommendation System based on User Text description features

Federated learning offers a privacy-preserving framework for recommendation systems by enabling local data processing; however, data localization introduces substantial obstacles. Traditional federated recommendation approaches treat each user as an isolated entity, failing to construct global user relationship graphs that capture collaborative signals, which limits the accuracy of recommendations. To address this limitation, we derive insight from the insight that semantic similarity reflects preference. similarity, which can be used to improve the construction of user relationship graphs. This paper proposes UFGraphFR, a novel framework with three key components: 1) On the client side, private structured data is first transformed into text descriptions. These descriptions are then encoded into semantic vectors using pre-trained models; 2) On the server side, user relationship graphs are securely reconstructed using aggregated model weights without accessing raw data, followed by information propagation through lightweight graph neural networks; 3) On the client side, user behavior sequences are personalized using Transformer architectures. Extensive experiments conducted on four benchmark datasets demonstrate that UFGraphFR significantly outperforms state-of-the-art baselines in both recommendation accuracy and personalization. The framework also maintains robustness across different pre-trained models, as evidenced by the consistent performance metrics obtained. This work provides a practical method for efficient federated recommendations with strict privacy by using semantic vectors, secure user relationship graphs, and personalized behavior sequences. The code is available at: https://github.com/trueWangSyutung/UFGraphFR.

Updated: 2026-03-02 02:31:11

标题: UFGraphFR:基于用户文本描述特征的图联合推荐系统

摘要: 联邦学习通过实现本地数据处理为推荐系统提供了一个保护隐私的框架;然而,数据本地化引入了重大障碍。传统的联邦推荐方法将每个用户视为一个孤立的实体,未能构建全局用户关系图,以捕捉协作信号,从而限制了推荐的准确性。为了解决这一限制,我们从语义相似性反映偏好的洞察中获得了洞察,这可以用来改进用户关系图的构建。本文提出了一种新颖的框架UFGraphFR,具有三个关键组件:1)在客户端,私有结构化数据首先转换为文本描述。然后使用预训练模型将这些描述编码为语义向量;2)在服务器端,使用聚合模型权重安全重建用户关系图,而无需访问原始数据,然后通过轻量级图神经网络进行信息传播;3)在客户端,使用Transformer架构个性化用户行为序列。在四个基准数据集上进行的广泛实验表明,UFGraphFR在推荐准确性和个性化方面显著优于最先进的基线方法。该框架还通过一致的性能指标表现出对不同预训练模型的稳健性。这项工作通过使用语义向量、安全的用户关系图和个性化的行为序列,提供了一种实现严格隐私的高效联邦推荐的实用方法。代码可在https://github.com/trueWangSyutung/UFGraphFR上找到。

更新时间: 2026-03-02 02:31:11

领域: cs.LG

下载: http://arxiv.org/abs/2501.08044v5

Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning

The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM's graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs' alignment research.

Updated: 2026-03-02 02:26:54

标题: 朝着使用重建图指导调整的大型语言模型进行图令牌化

摘要: 大型语言模型(LLMs)的显著成功激发了研究人员将它们调整为各种与图相关任务的通用预测器,最终目标是开发一个能够泛化各种场景的图基础模型。关键挑战在于将图数据与语言空间对齐,以使LLMs能更好地理解图形。作为一种流行的范例,图标记LLMs(GTokenLLMs)将复杂结构和冗长文本编码为图标记序列,然后通过语言指令调整将它们与文本标记对齐。尽管它们最初取得了成功,但我们的信息论分析揭示出现有的GTokenLLMs仅仅依赖于语言指令的文本监督,这只能实现隐式图文对齐,导致了以文本为主导的偏见,未充分利用图形上下文。为了克服这一局限性,我们首先证明了对齐目标的上界是输入图形与LLM中它们的隐藏表示之间的互信息,这激励我们改进这个上界以实现更好的对齐。为此,我们进一步提出了一种重建图指令调整流水线,即RGLM。我们的关键思想是从LLM的图标记输出中重建图信息,显式地将图监督纳入到对齐过程中。在技术上,我们通过从两个互补的角度探索三种不同的变体来实现RGLM:从输入空间的RGLM-Decoder;从潜在空间的RGLM-Similarizer和RGLM-Denoiser。此外,我们还在理论上分析了每个变体的对齐有效性。在各种基准和任务场景上进行的大量实验证实了所提出的RGLM的有效性,为GTokenLLMs的对齐研究开辟了新的方向。

更新时间: 2026-03-02 02:26:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01385v1

Scaling with Collapse: Efficient and Predictable Training of LLM Families

Effective LLM training depends on predictable scaling of key quantities -- such as final loss and optimal hyperparameters -- with model and dataset size. Qiu et al. (2025) recently showed that this predictability can extend beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon persists for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse therefore emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, establishing collapse as an effective tool for developing efficient LLMs.

Updated: 2026-03-02 02:24:13

标题: 规模化与崩溃:LLM家族的高效可预测训练

摘要: LLM的有效训练取决于关键数量(例如最终损失和最佳超参数)随着模型和数据集大小的可预测缩放。Qiu等人(2025年)最近表明,这种可预测性可以延伸到标量以外:整个训练损失曲线可以在简单归一化后*折叠*到一个通用轨迹上。尚不清楚的是,当LLM系列在*实际缩放配方*下进行训练时,这种现象是否持续存在,其中宽度、深度、学习率、批量大小和权重衰减共同缩放。我们展示了这种现象存在:当优化超参数针对给定数据预算进行最佳设置时,损失曲线会在各个尺度上折叠,符合最近的经验规模律。因此,折叠成为计算高效训练的标志。我们展示了两个规模应用:(1)与折叠偏离提供了对训练病理的敏感、早期诊断,(2)折叠曲线的可预测性使得在大规模超参数调整中可以进行早停。最后,我们利用这些见解训练了一个竞争性的LLM系列*Celerity*,将折叠作为开发高效LLM的有效工具。

更新时间: 2026-03-02 02:24:13

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.25087v2

V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space

Reachability analysis has become increasingly important in robotics to distinguish safe from unsafe states. Unfortunately, existing reachability and safety analysis methods often fall short, as they typically require known system dynamics or large datasets to estimate accurate system models, are computationally expensive, and assume full state information. A recent method, called MORALS, aims to address these shortcomings by using topological tools to estimate Regions of Attraction (ROA) in a low-dimensional latent space. However, MORALS still relies on full state knowledge and has not been studied when only sensor measurements are available. This paper presents Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space (V-MORALS). V-MORALS takes in a dataset of image-based trajectories of a system under a given controller, and learns a latent space for reachability analysis. Using this learned latent space, our method is able to generate well-defined Morse Graphs, from which we can compute ROAs for various systems and controllers. V-MORALS provides capabilities similar to the original MORALS architecture without relying on state knowledge, and using only high-level sensor data. Our project website is at: https://v-morals.onrender.com.

Updated: 2026-03-02 02:17:27

标题: V-MORALS:基于视觉摩尔斯图辅助的学习潜在空间吸引区估计

摘要: 可达性分析在机器人领域变得越来越重要,用于区分安全状态和危险状态。不幸的是,现有的可达性和安全性分析方法通常存在不足,因为它们通常需要已知的系统动态或大型数据集来估计准确的系统模型,计算成本高,且假设具有完全状态信息。最近提出的一种方法,称为MORALS,旨在通过使用拓扑工具在低维潜在空间中估计吸引区域(ROA)来解决这些缺点。然而,MORALS仍然依赖于完全状态知识,并且在仅有传感器测量可用时尚未得到研究。本文提出了一种名为Visual Morse图辅助的学习潜在空间中吸引区域估计(V-MORALS)的方法。V-MORALS接收一个系统在给定控制器下的基于图像的轨迹数据集,并学习一个用于可达性分析的潜在空间。利用这个学习的潜在空间,我们的方法能够生成定义良好的Morse图,从中我们可以计算各种系统和控制器的ROA。V-MORALS提供类似于原始MORALS架构的功能,而不依赖于状态知识,仅使用高级传感器数据。我们的项目网站位于:https://v-morals.onrender.com。

更新时间: 2026-03-02 02:17:27

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2602.23524v2

3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs

Sparse plus Low-Rank $(\mathbf{S} + \mathbf{LR})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $(\mathbf{W} \approx \mathbf{S} + \mathbf{LR})$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL-TM, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{LR})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed 3BASiL, to minimize the layer-wise reconstruction error with convergence guarantees. We then design an efficient transformer-matching (TM) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the TM procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{LR})$ decomposition, including pure sparsity. Our numerical experiments show that 3BASiL-TM reduces the WikiText2 perplexity gap relative to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{LR})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.

Updated: 2026-03-02 02:16:46

标题: 3BASiL:一种用于LLM的稀疏加低秩压缩的算法框架

摘要: 稀疏加低秩($\mathbf{S} + \mathbf{LR}$)分解大型语言模型(LLMs)已经成为模型压缩中一个有前途的方向,旨在将预训练模型权重分解为稀疏和低秩矩阵的求和($\mathbf{W} \approx \mathbf{S} + \mathbf{LR}$)。尽管最近取得了进展,现有方法通常与密集模型相比性能严重下降。在这项工作中,我们引入了3BASiL-TM,这是一种高效的一次性后训练方法,用于LLMs的$(\mathbf{S} + \mathbf{LR})$分解,以填补这一差距。我们的方法首先引入了一种新颖的3块交替方向乘法器(ADMM)方法,称为3BASiL,以最小化逐层重建误差并具有收敛保证。然后,我们设计了一种高效的变压器匹配(TM)细化步骤,该步骤同时优化变压器层之间的稀疏和低秩分量。该步骤最小化了一种新颖的内存高效损失,使变压器级别的输出对齐。值得注意的是,TM过程是通用的,因为它可以增强任何$(\mathbf{S} + \mathbf{LR})$分解,包括纯稀疏性。我们的数值实验表明,在(2:4稀疏 + 64低秩)配置下,相对于密集LLaMA-8B模型,3BASiL-TM将WikiText2的困惑度差距降低了超过30%,而先前的方法则相比之下。此外,与SOTA $(\mathbf{S} + \mathbf{LR})$方法相比,我们的方法在A100 GPU上实现了超过2.5倍更快的压缩运行时。我们的代码可在https://github.com/mazumder-lab/3BASiL找到。

更新时间: 2026-03-02 02:16:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.01376v1

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

Test-time policy adaptation for multi-turn interactions (T2PAM) is essential for aligning Large Language Models (LLMs) with dynamic user needs during inference time. However, existing paradigms commonly treat test-time adaptation as a single-axis problem, either purely refining instructions (Prompt Engineering) or only adjusting weights (Test-Time Training), ignoring that interaction failures stem from a coupled mix of ambiguity and incapacity. We argue that these two optimization paths are not merely additive but synergistic: semantic clarity acts as a pre-conditioner for effective parameter updates. To this end, we propose ROSA2, a framework that reformulates interaction as a joint optimization problem over the heterogeneous space of Words and Weights. By mathematically decomposing the error signal, ROSA2 utilizes textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps. Theoretically, we prove that this co-adaptation strictly reduces the required parameter shift for convergence. Empirically, ROSA2 outperforms state-of-the-art baselines by 30% on MATH while reducing interaction turns by 40%, demonstrating that refining the context unlocks the true potential of parameter updates.

Updated: 2026-03-02 02:16:20

标题: 词语与权重:通过协同适应简化多轮互动

摘要: 多轮交互的测试时间策略调整(T2PAM)对于在推理时间内与动态用户需求对齐大型语言模型(LLMs)至关重要。然而,现有范式通常将测试时间调整视为单一轴问题,要么纯粹是对指令进行细化(提示工程),要么只调整权重(测试时间训练),忽略了交互失败源于混合的歧义和无能。我们认为这两种优化路径不仅仅是相加的,而是相辅相成的:语义清晰作为有效参数更新的预处理器。为此,我们提出了ROSA2,一个框架,将交互重新构建为对单词和权重异质空间的联合优化问题。通过数学分解误差信号,ROSA2利用文本梯度来纠正意图歧义,并利用参数更新来填补能力差距。从理论上讲,我们证明了这种共同适应严格减少了收敛所需的参数移位。实证上,ROSA2在数学方面的表现超过了最先进的基线30%,同时减少了40%的交互轮次,证明了优化上下文可以释放参数更新的真正潜力。

更新时间: 2026-03-02 02:16:20

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01375v1

Quantum-Inspired Fine-Tuning for Few-Shot AIGC Detection via Phase-Structured Reparameterization

Recent studies show that quantum neural networks (QNNs) generalize well in few-shot regimes. To extend this advantage to large-scale tasks, we propose Q-LoRA, a quantum-enhanced fine-tuning scheme that integrates lightweight QNNs into the low-rank adaptation (LoRA) adapter. Applied to AI-generated content (AIGC) detection, Q-LoRA consistently outperforms standard LoRA under few-shot settings. We analyze the source of this improvement and identify two possible structural inductive biases from QNNs: (i) phase-aware representations, which encode richer information across orthogonal amplitude-phase components, and (ii) norm-constrained transformations, which stabilize optimization via inherent orthogonality. However, Q-LoRA incurs non-trivial overhead due to quantum simulation. Motivated by our analysis, we further introduce H-LoRA, a fully classical variant that applies the Hilbert transform within the LoRA adapter to retain similar phase structure and constraints. Experiments on few-shot AIGC detection show that both Q-LoRA and H-LoRA outperform standard LoRA by over 5% accuracy, with H-LoRA achieving comparable accuracy at significantly lower cost in this task.

Updated: 2026-03-02 02:16:09

标题: 量子启发下的通过相结构重参数化进行少样本AIGC检测的微调

摘要: 最近的研究表明,量子神经网络(QNNs)在少样本情况下具有良好的泛化能力。为了将这一优势扩展到大规模任务,我们提出了Q-LoRA,这是一种量子增强的微调方案,将轻量级的QNN集成到低秩适应(LoRA)适配器中。在应用于人工智能生成内容(AIGC)检测时,Q-LoRA在少样本设置下始终优于标准的LoRA。我们分析了这一改进的来源,并从QNNs中确定了两种可能的结构归纳偏差:(i)具有相位感知的表示,可以在正交幅度-相位组件之间编码更丰富的信息;(ii)受约束的范数变换,通过固有正交性稳定优化。然而,由于量子模拟,Q-LoRA会产生一定的开销。受到我们分析的启发,我们进一步引入了H-LoRA,这是一个完全经典的变体,在LoRA适配器中应用希尔伯特变换,以保留类似的相位结构和约束。在少样本AIGC检测实验中,Q-LoRA和H-LoRA均比标准LoRA提高了超过5%的准确率,而H-LoRA在此任务中以显著更低的成本实现了可比较的准确率。

更新时间: 2026-03-02 02:16:09

领域: cs.LG,cs.AI,quant-ph

下载: http://arxiv.org/abs/2603.02281v1

Causal Neural Probabilistic Circuits

Concept Bottleneck Models (CBMs) enhance the interpretability of end-to-end neural networks by introducing a layer of concepts and predicting the class label from the concept predictions. A key property of CBMs is that they support interventions, i.e., domain experts can correct mispredicted concept values at test time to improve the final accuracy. However, typical CBMs apply interventions by overwriting only the corrected concept while leaving other concept predictions unchanged, which ignores causal dependencies among concepts. To address this, we propose the Causal Neural Probabilistic Circuit (CNPC), which combines a neural attribute predictor with a causal probabilistic circuit compiled from a causal graph. This circuit supports exact, tractable causal inference that inherently respects causal dependencies. Under interventions, CNPC models the class distribution based on a Product of Experts (PoE) that fuses the attribute predictor's predictive distribution with the interventional marginals computed by the circuit. We theoretically characterize the compositional interventional error of CNPC w.r.t. its modules and identify conditions under which CNPC closely matches the ground-truth interventional class distribution. Experiments on five benchmark datasets in both in-distribution and out-of-distribution settings show that, compared with five baseline models, CNPC achieves higher task accuracy across different numbers of intervened attributes.

Updated: 2026-03-02 02:15:24

标题: 因果神经概率电路

摘要: 概念瓶颈模型(CBMs)通过引入一个概念层,并从概念预测中预测类标签,增强了端到端神经网络的可解释性。CBMs的一个关键特性是它们支持干预,即领域专家可以在测试时纠正错误的概念值以提高最终准确性。然而,典型的CBMs仅通过覆盖纠正的概念来应用干预,而不改变其他概念预测,这忽视了概念之间的因果依赖关系。为了解决这个问题,我们提出了因果神经概率电路(CNPC),它将神经属性预测器与从因果图编译的因果概率电路结合起来。这个电路支持确切、可追踪的因果推断,从根本上尊重因果依赖关系。在干预下,CNPC根据一个专家乘积(PoE)对类分布进行建模,该专家乘积将属性预测器的预测分布与电路计算的干预边际融合在一起。我们在五个基准数据集上进行实验,包括分布内和分布外的设置,结果显示,与五个基准模型相比,CNPC在不同数量的干预属性下实现了更高的任务准确性。我们从理论上对CNPC的组合干预误差进行了表征,并确定了CNPC与地面真实干预类分布接近的条件。

更新时间: 2026-03-02 02:15:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01372v1

Estimating Dimensionality of Neural Representations from Finite Samples

The global dimensionality of a neural representation manifold provides rich insight into the computational process underlying both artificial and biological neural networks. However, all existing measures of global dimensionality are sensitive to the number of samples, i.e., the number of rows and columns of the sample matrix. We show that, in particular, the participation ratio of eigenvalues, a popular measure of global dimensionality, is highly biased with small sample sizes, and propose a bias-corrected estimator that is more accurate with finite samples and with noise. On synthetic data examples, we demonstrate that our estimator can recover the true known dimensionality. We apply our estimator to neural brain recordings, including calcium imaging, electrophysiological recordings, and fMRI data, and to the neural activations in a large language model and show our estimator is invariant to the sample size. Finally, our estimators can additionally be used to measure the local dimensionalities of curved neural manifolds by weighting the finite samples appropriately.

Updated: 2026-03-02 02:11:42

标题: 从有限样本中估计神经表征的维度

摘要: 神经表示流形的全局维度为理解人工和生物神经网络的计算过程提供了丰富的见解。然而,所有现有的全局维度测量都对样本数量敏感,即样本矩阵的行数和列数。我们发现,特别是特征值的参与比例,作为一种流行的全局维度测量,对小样本量存在严重偏差,提出了一种更准确的、对有限样本和噪声更具鲁棒性的校正估计器。在合成数据示例中,我们展示了我们的估计器可以恢复真实已知的维度。我们将我们的估计器应用于神经脑记录,包括钙成像、电生理记录和fMRI数据,以及大型语言模型中的神经激活,并展示了我们的估计器对样本大小不变。最后,我们的估计器还可以通过适当加权有限样本来测量曲线神经流形的局部维度。

更新时间: 2026-03-02 02:11:42

领域: stat.ML,cs.LG,q-bio.NC

下载: http://arxiv.org/abs/2509.26560v2

DRESS: A Continuous Framework for Structural Graph Refinement

The Weisfeiler-Lehman (WL) hierarchy is a cornerstone framework for graph isomorphism testing and structural analysis. However, scaling beyond 1-WL to 3-WL and higher requires tensor-based operations that scale as $\mathcal{O}(n^3)$ or $\mathcal{O}(n^4)$, making them computationally prohibitive for large graphs. In this paper, we start from the Original-DRESS equation (Castrillo, León, and Gómez, 2018)--a parameter-free, continuous dynamical system on edges--and show that it distinguishes the prism graph from $K_{3,3}$, a pair that 1-WL provably cannot separate. We then generalize it to Motif-DRESS, which replaces triangle neighborhoods with arbitrary structural motifs and converges to a unique fixed point under three sufficient conditions, and further to Generalized-DRESS, an abstract template parameterized by the choice of neighborhood operator, aggregation function and norm. Finally, we introduce $Δ$-DRESS, which runs DRESS on each vertex-deleted subgraph $G \setminus \{v\}$, connecting the framework to the Kelly--Ulam reconstruction conjecture. $Δ$-DRESS empirically distinguishes Strongly Regular Graphs (SRGs)--such as the Rook and Shrikhande graphs--that confound 3-WL. Our results establish the DRESS family as a highly scalable framework that empirically surpasses both 1-WL and 3-WL on well-known benchmark graphs, without the prohibitive $\mathcal{O}(n^4)$ computational cost.

Updated: 2026-03-02 02:08:08

标题: DRESS:结构图细化的连续框架

摘要: Weisfeiler-Lehman (WL)层次结构是图同构测试和结构分析的基石框架。然而,将规模扩展到3-WL及更高级别需要基于张量的操作,其计算复杂度为$\mathcal{O}(n^3)$或$\mathcal{O}(n^4)$,使其在大型图形上计算代价高昂。在本文中,我们从Original-DRESS方程(Castrillo, León, and Gómez, 2018)出发,这是一种无参数、连续的边缘动态系统,并展示它区分了棱柱图与$K_{3,3}$,这一对图形是1-WL显然无法区分的。然后,我们将其推广为Motif-DRESS,用任意结构图案取代三角形邻域,并在三个充分条件下收敛到一个唯一的定点,进一步推广为Generalized-DRESS,这是一个由邻域算子、聚合函数和范数选择参数化的抽象模板。最后,我们介绍了$Δ$-DRESS,它在每个顶点删除子图$G \setminus \{v\}$上运行DRESS,将该框架与Kelly-Ulam重构猜想联系起来。$Δ$-DRESS在经验上区分了强正则图(SRGs)--如Rook和Shrikhande图--这些图形使3-WL困惑。我们的结果将DRESS家族确立为一个高度可扩展的框架,经验上超越了1-WL和3-WL在知名基准图上的表现,而不具有高昂的$\mathcal{O}(n^4)$计算成本。

更新时间: 2026-03-02 02:08:08

领域: cs.DS,cs.LG

下载: http://arxiv.org/abs/2602.20833v3

Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning

With the widespread adoption of deep learning in visual tasks, Class-Incremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge of catastrophic forgetting, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor -- temporal imbalance -- as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.

Updated: 2026-03-02 01:57:52

标题: 时间上正负监督在课程增量学习中的不平衡

摘要: 随着深度学习在视觉任务中的广泛应用,类增量学习(CIL)已经成为处理动态演变数据分布的重要范式。然而,CIL面临着灾难性遗忘的核心挑战,通常表现为对新类别的预测偏差。现有方法主要将这种偏差归因于任务内类别不平衡,并专注于在分类器头部进行校正。在本文中,我们强调一个被忽视的因素--时间不平衡--作为这种偏差的主要原因。在训练结束时,早期类别接受更强的负监督,导致不对称的精度和召回率。我们建立了一个时间监督模型,正式定义了时间不平衡,并提出了时间调整损失(TAL),该损失使用时间衰减核来构建一个监督强度向量,并动态重新调整交叉熵损失中的负监督。理论分析表明,在平衡条件下,TAL会退化为标准的交叉熵,并在不平衡情况下有效缓解了预测偏差。大量实验证明,TAL显著减少了遗忘,并在多个CIL基准测试中提高了性能,强调了对稳定长期学习的重要性。

更新时间: 2026-03-02 01:57:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.02280v1

DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking

Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution, while generative perplexity requires a biased external model and ignores diversity. To address this, we introduce the \textsc{DUEL} framework, which formalizes \emph{deterministic} position selection, unifying leading MDM sampling strategies. We prove \textbf{\textsc{DUEL} admits \emph{exact} likelihood computation} via a simple algorithm, evaluated under the same position selection used at test time. This \textbf{gives MDMs proper perplexity for the first time} -- the natural analogue of autoregressive perplexity. With proper perplexity in hand, we revisit key questions about MDMs. \textbf{MDMs are substantially better than previously thought}: the MDM-autoregressive perplexity gap shrinks by up to 32\% on in-domain data and 82\% on zero-shot benchmarks. \textsc{DUEL} enables the first principled comparison of fast, parallel samplers across compute budgets -- an analysis impossible with the ELBO and unreliable with generative perplexity -- identifying probability margin \citep{kim2025train} as a strong default. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models -- achieving 36.47 vs.\ 52.11 perplexity on AG News -- demonstrating the ceiling of MDM performance has not yet been reached.

Updated: 2026-03-02 01:56:03

标题: 决斗:通过确定性揭示进行掩码扩散的精确可能性

摘要: 掩盖扩散模型(MDMs)通过迭代选择要解除掩码的位置,然后预测这些位置上的标记来生成文本。然而,MDMs缺乏适当的困惑度评估:ELBO是在训练分布下可能性的松弛上限,而不是在测试时间分布下,而生成困惑度需要一个有偏见的外部模型,并忽略了多样性。为了解决这个问题,我们引入了\textsc{DUEL}框架,它正式化了\emph{确定性}位置选择,统一了主要的MDM抽样策略。我们通过一个简单的算法证明\textbf{\textsc{DUEL}允许\emph{精确}可能性计算},在测试时使用相同的位置选择进行评估。这一点\textbf{首次为MDMs提供了适当的困惑度},这是自回归困惑度的自然类比。有了适当的困惑度,我们重新审视了关于MDMs的关键问题。\textbf{MDMs比以前认为的要好得多}:在域内数据上,MDM-自回归困惑度差距缩小了高达32%,在零样本基准上缩小了82%。\textsc{DUEL}实现了对快速、并行采样器在计算预算内的首次有原则的比较--这是使用ELBO不可能进行的分析,并且在生成困惑度方面不可靠--确定了概率边界\citep{kim2025train}作为一个强有力的默认选择。最后,通过对位置排序进行oracle搜索,揭示了MDMs可以远远超越自回归模型--在AG News上实现36.47 vs.\ 52.11的困惑度--证明了MDM性能的上限尚未达到。

更新时间: 2026-03-02 01:56:03

领域: cs.LG

下载: http://arxiv.org/abs/2603.01367v1

Align and Filter: Improving Performance in Asynchronous On-Policy RL

Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.

Updated: 2026-03-02 01:52:34

标题: 对齐和过滤:改善异步在线策略强化学习的性能

摘要: 分布式训练和增加梯度更新频率是加速学习和提高性能的实用策略,但两者都加剧了一个核心挑战:\textit{策略滞后},即生成数据的行为策略与被更新的学习策略之间的不匹配。策略滞后可能阻碍将基于策略的学习算法扩展到更大问题。本文确定了由分布式学习和高更新频率引起的策略滞后的来源。我们利用这些发现提出了\textit{基于总变差的优势对齐约束策略优化(\methodacronym)}作为缓解策略滞后的实用方法。我们经验性地验证了我们的方法,并展示它在经典RL任务和现代RL中对LLM数学推理任务的策略滞后提供了更好的鲁棒性。

更新时间: 2026-03-02 01:52:34

领域: cs.LG,cs.AI,cs.RO,eess.SY

下载: http://arxiv.org/abs/2603.01365v1

Fed-GAME: Personalized Federated Learning with Graph Attention Mixture-of-Experts For Time-Series Forecasting

Federated learning (FL) on graphs shows promise for distributed time-series forecasting. Yet, existing methods rely on static topologies and struggle with client heterogeneity. We propose Fed-GAME, a framework that models personalized aggregation as message passing over a learnable dynamic implicit graph. The core is a decoupled parameter difference-based update protocol, where clients transmit parameter differences between their fine-tuned private model and a shared global model. On the server, these differences are decomposed into two streams: (1) averaged difference used to updating the global model for consensus (2) the selective difference fed into a novel Graph Attention Mixture-of-Experts (GAME) aggregator for fine-grained personalization. In this aggregator, shared experts provide scoring signals while personalized gates adaptively weight selective updates to support personalized aggregation. Experiments on two real-world electric vehicle charging datasets demonstrate that Fed-GAME outperforms state-of-the-art personalized FL baselines.

Updated: 2026-03-02 01:43:06

标题: Fed-GAME:具有图注意混合专家的个性化联邦学习用于时间序列预测

摘要: 在图上进行的联邦学习(FL)显示出在分布式时间序列预测方面的潜力。然而,现有方法依赖于静态拓扑结构,并且在客户异质性方面存在困难。我们提出了Fed-GAME,这是一个将个性化聚合建模为通过可学习的动态隐式图传递消息的框架。其核心是一个解耦参数差异的更新协议,其中客户端传输其经过调整的私有模型与共享全局模型之间的参数差异。在服务器端,这些差异被分解为两个流:(1)平均差异用于更新全局模型以达成共识(2)选择性差异被馈送到一种新颖的图注意力专家混合(GAME)聚合器中,以进行细粒度的个性化。在这个聚合器中,共享专家提供评分信号,而个性化门适应性地加权选择性更新,以支持个性化聚合。在两个真实世界的电动汽车充电数据集上的实验表明,Fed-GAME优于现有的个性化FL基线。

更新时间: 2026-03-02 01:43:06

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2603.01363v1

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.

Updated: 2026-03-02 01:41:44

标题: MixerCSeg:一种通过解耦Mamba注意力实现裂缝分割的高效混合器架构

摘要: 特征编码器在像素级裂缝分割中起着关键作用,通过塑造细微纹理和细小结构的表示来实现。现有的基于CNN、Transformer和Mamba的模型各自只捕捉所需空间或结构信息的一部分,留下了对复杂裂缝模式建模的明显空白。为了解决这一问题,我们提出了MixerCSeg,这是一个设计得像一个协调专家团队的混合体系结构,其中类似CNN的路径专注于局部纹理,Transformer风格的路径捕捉全局依赖性,以及Mamba启发的流模拟单个编码器内的序列上下文。MixerCSeg的核心是TransMixer,它探索了Mamba的潜在注意行为,同时建立了专门的路径,自然地表达了局部性和全局意识。为了进一步增强结构的准确性,我们引入了一种空间块处理策略和一个Direction-guided Edge Gated Convolution (DEGConv),可以在最小的计算开销下增强对不规则裂缝几何的边缘敏感性。然后,我们采用了一个空间细化多级融合(SRF)模块,以在不增加复杂性的情况下完善多尺度细节。对多个裂缝分割基准的大量实验表明,MixerCSeg以仅2.05 GFLOPs和2.54 M参数的性能实现了最先进的水平,展示了高效性和强大的表征能力。代码可在https://github.com/spiderforest/MixerCSeg上找到。

更新时间: 2026-03-02 01:41:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.01361v1

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Next-generation AI must manage vast personal data, diverse tools, and multi-step reasoning, yet most benchmarks remain context-free and single-turn. We present ASTRA-bench (Assistant Skills in Tool-use, Reasoning \& Action-planning), a benchmark that uniquely unifies time-evolving personal context with an interactive toolbox and complex user intents. Our event-driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state-of-the-art models (e.g., Claude-4.5-Opus, DeepSeek-V3.2) reveals significant performance degradation under high-complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents' ability to ground reasoning within messy personal context and orchestrate reliable multi-step plans. We release ASTRA-bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context-aware AI assistants.

Updated: 2026-03-02 01:34:48

标题: ASTRA-bench:通过个人用户上下文评估工具使用代理的推理和行动规划

摘要: 下一代人工智能必须管理大量个人数据、多样化工具和多步推理,然而大多数基准仍然是无上下文和单轮的。我们提出ASTRA-bench(助手技能在工具使用、推理和行动规划中),这是一个独特地将不断演变的个人背景与互动工具箱和复杂用户意图统一起来的基准。我们的基于事件驱动的流程生成了2,413个场景,涉及四个主角,基于长期生活事件并由指代、功能和信息复杂性进行注释。对最先进的模型(例如Claude-4.5-Opus、DeepSeek-V3.2)的评估显示,在高复杂性条件下性能显著下降,参数生成成为主要瓶颈。这些发现揭示了当前代理能力的关键局限,即在混乱的个人背景中进行推理和编排可靠的多步计划。我们发布ASTRA-bench,附带完整的执行环境和评估脚本,为开发真正具有上下文意识的人工智能助手提供诊断测试平台。

更新时间: 2026-03-02 01:34:48

领域: cs.AI

下载: http://arxiv.org/abs/2603.01357v1

FedSGT: Exact Federated Unlearning via Sequential Group-based Training

Federated Learning (FL) enables collaborative, privacy-preserving model training, but supporting the "Right to be Forgotten" is especially challenging because data influences the model through distributed and interleaved client updates. Existing exact unlearning methods typically require frequent retraining from scratch, resulting in high communication cost and long service downtime. To address this, we propose Federated Sequential Group-based Training (FedSGT), an exact unlearning framework for FL. FedSGT partitions the data into uniform groups, and each client may participate in multiple groups. To control communication overhead, each client can limit the number of groups it contributes to. FedSGT then trains multiple sequences of Parameter-Efficient Fine-Tuning (PEFT) modules, each corresponding to a different group permutation. Since the PEFT modules are lightweight and maintained server-side, FedSGT isolates the influence of different data groups into independent modules without incurring significant storage overhead and communication cost. Exact unlearning is thus achieved instantly by deactivating the modules corresponding to the group containing the unlearned data. Furthermore, using multiple training sequences helps maintain high model utility as deletion requests accumulate. We provide a rigorous theoretical analysis of both the deletion rate -- expected number of deletions before retraining is needed -- and the expected model performance. Experiments on various tasks demonstrate that FedSGT achieves a significantly longer service maintenance under multiple unlearning requests while maintaining comparable learning performance and training efficiency to other exact unlearning baselines. Extensive ablation studies validate the robustness of our method across a wide range of parameter settings.

Updated: 2026-03-02 01:25:36

标题: FedSGT: 精确的基于组序列的联邦去学习

摘要: 联邦学习(FL)实现了协作的、隐私保护的模型训练,但支持“被遗忘权”尤其具有挑战性,因为数据通过分布式和交错的客户端更新影响模型。现有的精确遗忘方法通常需要频繁地从头开始重新训练,导致通信成本高和服务停机时间长。为了解决这个问题,我们提出了一种联邦顺序群组训练(FedSGT)的精确遗忘框架。FedSGT将数据分成均匀的群组,每个客户端可以参与多个群组。为了控制通信开销,每个客户端可以限制其参与的群组数量。然后,FedSGT训练多个参数高效微调(PEFT)模块的序列,每个序列对应不同的群组排列。由于PEFT模块轻量且维护在服务器端,FedSGT将不同数据群组的影响隔离到独立模块中,而不会产生显著的存储开销和通信成本。通过停用包含被遗忘数据的群组对应的模块,即可立即实现精确遗忘。此外,使用多个训练序列有助于在删除请求积累时保持高模型效用。我们对删除率(需要重新训练之前预期删除的次数)和预期模型性能进行了严格的理论分析。在各种任务上的实验表明,FedSGT在多个遗忘请求下实现了较长的服务维护时间,同时保持了与其他精确遗忘基线相当的学习性能和训练效率。广泛的消融研究验证了我们的方法在各种参数设置下的稳健性。

更新时间: 2026-03-02 01:25:36

领域: cs.CR

下载: http://arxiv.org/abs/2511.23393v2

Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain

In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on https://huggingface.co/nri-ai .

Updated: 2026-03-02 01:21:54

标题: 构建合成指令数据集,以提高特定领域LLMs的推理能力:以日本金融领域为例的案例研究

摘要: 在将LLMs适应特定领域时,实现领域专业知识和推理能力仍然是一个紧迫的挑战。本研究提出了一种通用方法,从特定领域词汇开始构建高质量的合成指导数据。作为演示,我们将这种方法应用于金融领域,并构建了一个总计约95亿令牌的大规模指导数据集,其中包含推理轨迹。评估结果证实了我们的方法在金融基准测试中相对于基线模型的性能提升,展示了我们方法的有效性。我们还报告了推理轨迹长度对性能的影响及其局限性。最后,我们在https://huggingface.co/nri-ai上开源我们的模型和数据集。

更新时间: 2026-03-02 01:21:54

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.01353v1

Post-training Large Language Models for Diverse High-Quality Responses

Reinforcement learning (RL) has emerged as a popular method for post-training large language models (LLMs). While improving the model's performance on downstream tasks, it often reduces the model's output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on surface-level differences. We propose a novel training method named DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. DQO is flexible and can be applied on top of existing RL algorithms. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.

Updated: 2026-03-02 01:21:51

标题: 训练后的大型语言模型用于多样化高质量回复

摘要: 强化学习(RL)已经成为后训练大型语言模型(LLMs)的一种流行方法。虽然可以提高模型在下游任务上的性能,但通常会降低模型的输出多样性,导致狭窄、规范化的响应。现有的增加多样性的方法受限,要么在推理时操作,要么专注于表面层面的差异。我们提出了一种基于确定性点过程(DPPs)的新型训练方法,命名为DQO(多样性质量优化),用于同时优化LLMs的质量和语义多样性。我们的方法对每个提示样本和嵌入一组响应,然后使用基于核的相似性矩阵的行列式来衡量多样性,作为这些响应嵌入所囊括的体积。DQO具有灵活性,并可以应用于现有的RL算法之上。在遵循指令、摘要、故事生成和推理任务方面的实验证明,我们的方法显著提高了语义多样性,而不牺牲模型质量。

更新时间: 2026-03-02 01:21:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.04784v3

Beyond Single-Modal Analytics: A Framework for Integrating Heterogeneous LLM-Based Query Systems for Multi-Modal Data

With the increasing use of multi-modal data, semantic query has become more and more demanded in data management systems, which is an important way to access and analyze multi-modal data. As unstructured data, most information of multi-modal data (text, image, video, etc.) hides in the semantics, which cannot be accessed by traditional database queries like SQL. Given the power of Large Language Models (LLMs) in understanding semantics and processing natural language, in recent years several LLM-based semantic query systems have been proposed to support semantic querying over unstructured data. However, this rapid growth has produced a fragmented ecosystem. Applications face significant integration challenges due to (1) disparate APIs of different semantic query systems and (2) a fundamental trade-off between specialization and generality. Many semantic query systems are highly specialized, offering state-of-the-art performance within a single modality but struggling with multi-modal data. Conversely, some "all-in-one" systems handle multiple modalities but often exhibit suboptimal performance compared to their specialized counterparts in specific modalities. This paper introduces Meta Engine, a novel ``query system on query systems'', designed to resolve those aforementioned challenges. Meta Engine is a unified semantic query engine that integrates heterogeneous, specialized LLM-based query systems. Its architecture comprises five key components: (1) a Natural Language (NL) Query Parser, (2) an Operator Generator, (3) a Query Router, (4) a set of Adapters, and (5) a Result Aggregator. In the evaluation, Meta Engine consistently outperforms all baselines, yielding 3--6x higher F1 in most cases and up to ~24x on specific datasets.

Updated: 2026-03-02 01:05:12

标题: 超越单模态分析:一种用于整合多模态数据的异质LLM查询系统的框架

摘要: 随着多模态数据的增加使用,语义查询在数据管理系统中变得越来越受欢迎,这是访问和分析多模态数据的重要方式。作为非结构化数据,多模态数据(文本、图像、视频等)的大部分信息隐藏在语义中,无法通过传统的数据库查询(如SQL)访问。鉴于大型语言模型(LLMs)在理解语义和处理自然语言方面的能力,近年来提出了几种基于LLM的语义查询系统,以支持对非结构化数据进行语义查询。然而,这种快速增长导致了一个分散的生态系统。由于(1)不同语义查询系统的不同API和(2)专业化与通用性之间的基本折衷,应用程序面临着重大的集成挑战。许多语义查询系统在单一模态下提供最先进的性能,但在处理多模态数据时却困难重重。相反,一些“一体化”系统处理多个模态,但通常在特定模态下表现不如其专门化对手。本文介绍了Meta Engine,这是一种新颖的“查询系统上的查询系统”,旨在解决上述挑战。Meta Engine是一个统一的语义查询引擎,集成了异构的、专门化的LLM-based查询系统。其架构包括五个关键组件:(1)自然语言(NL)查询解析器,(2)操作生成器,(3)查询路由器,(4)一组适配器和(5)结果聚合器。在评估中,Meta Engine在大多数情况下始终优于所有基线,F1值提高了3-6倍,在特定数据集上甚至高达约24倍。

更新时间: 2026-03-02 01:05:12

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2602.01701v2

UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification

Self-supervised foundation models have achieved remarkable success across domains, including time series. However, the potential of non-contrastive methods, a paradigm that has driven significant advances in computer vision, remains underexplored for time series. In this work, we adapt DINOv2-style self-distillation to pretrain a time series foundation model, building on the Mantis tokenizer and transformer encoder architecture as our backbone. Through a student-teacher framework, our method Utica learns representations that capture both temporal invariance via augmented crops and fine-grained local structure via patch masking. Our approach achieves state-of-the-art classification performance on both UCR and UEA benchmarks. These results suggest that non-contrastive methods are a promising and complementary pretraining strategy for time series foundation models.

Updated: 2026-03-02 01:02:09

标题: UTICA:多目标自我蒸馏基础模型预训练用于时间序列分类

摘要: 自监督基础模型在各个领域,包括时间序列,取得了显著的成功。然而,非对比方法的潜力,这是驱动计算机视觉领域重大进展的范式,对于时间序列仍然未被充分探索。在这项工作中,我们将DINOv2风格的自蒸馏方法调整为预训练时间序列基础模型,以Mantis分词器和变压器编码器架构作为我们的骨干。通过学生-教师框架,我们的方法Utica学习捕捉通过增强裁剪的时态不变性和通过补丁掩蔽捕捉细粒度本地结构的表示。我们的方法在UCR和UEA基准测试中实现了最先进的分类性能。这些结果表明,非对比方法是时间序列基础模型的一种有前途且互补的预训练策略。

更新时间: 2026-03-02 01:02:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01348v1

Relatively Smart: A New Approach for Instance-Optimal Learning

We revisit the framework of Smart PAC learning, which seeks supervised learners which compete with semi-supervised learners that are provided full knowledge of the marginal distribution on unlabeled data. Prior work has shown that such marginal-by-marginal guarantees are possible for "most" marginals, with respect to an arbitrary fixed and known measure, but not more generally. We discover that this failure can be attributed to an "indistinguishability" phenomenon: There are marginals which cannot be statistically distinguished from other marginals that require different learning approaches. In such settings, semi-supervised learning cannot certify its guarantees from unlabeled data, rendering them arguably non-actionable. We propose relatively smart learning, a new framework which demands that a supervised learner compete only with the best "certifiable" semi-supervised guarantee. We show that such modest relaxation suffices to bypass the impossibility results from prior work. In the distribution-free setting, we show that the OIG learner is relatively smart up to squaring the sample complexity, and show that no supervised learning algorithm can do better. For distribution-family settings, we show that relatively smart learning can be impossible or can require idiosyncratic learning approaches, and its difficulty can be non-monotone in the inclusion order on distribution families.

Updated: 2026-03-02 00:59:10

标题: 相对智能:一种新的实例最优学习方法

摘要: 我们重新审视了Smart PAC学习的框架,该框架寻求与完全了解未标记数据的边际分布的半监督学习者竞争的监督学习者。先前的工作表明,对于“大多数”边际,关于任意固定和已知的度量,这样的边际对边际的保证是可能的,但是并不是更普遍地。我们发现,这种失败可以归因于一种“无法区分”的现象:存在一些边际,无法在统计上区分需要不同学习方法的其他边际。在这种情况下,半监督学习无法从未标记的数据中证明其保证,使它们在实际上可能无法采取行动。 我们提出了相对智能学习,这是一个新的框架,要求监督学习者只与最佳的“可证明”半监督保证竞争。我们表明,这种相对温和的放松足以绕过先前工作的不可能结果。在无分布设置中,我们表明OIG学习者在平方样本复杂性方面相对聪明,并且表明没有监督学习算法可以做得更好。对于分布系列设置,我们表明相对智能学习可能是不可能的,或者可能需要特殊的学习方法,并且其困难程度可以在分布系列的包含顺序上是非单调的。

更新时间: 2026-03-02 00:59:10

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.01346v1

A universal compression theory for lottery ticket hypothesis and neural scaling laws

When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error, which is proved to be the optimal compression rate. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. Implication (Ia) directly establishes a proof of the dynamical lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-α}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-α' \sqrt[m]{d})$.

Updated: 2026-03-02 00:50:43

标题: 一种适用于抽奖券假设和神经缩放定律的通用压缩理论

摘要: 在训练大规模模型时,性能通常随着参数数量和数据集大小按照缓慢的幂律进行扩展。一个基本的理论和实际问题是是否可以通过显著较小的模型和大大较少的数据实现可比较的性能。在这项工作中,我们给出了一个积极和建设性的答案。我们证明,$d$个对象的通用置换不变函数可以渐近地压缩为$\operatorname{polylog} d$个对象的函数,且误差消失,这被证明是最佳的压缩率。这个定理产生了两个关键的含义:(Ia) 一个大型神经网络可以被压缩为对数多项式宽度,同时保持其学习动态;(Ib) 一个大型数据集可以被压缩为对数多项式大小,同时保持相应模型的损失景观不变。含义(Ia)直接建立了动态彩票假设的证明,该假设指出任何普通网络均可被强压缩,使学习动态和结果保持不变。(Ib)表明,形式为$L\sim d^{-α}$的神经缩放定律可以提升为任意快速的幂律衰减,最终为$\exp(-α' \sqrt[m]{d})$。

更新时间: 2026-03-02 00:50:43

领域: stat.ML,cond-mat.dis-nn,cs.IT,cs.LG

下载: http://arxiv.org/abs/2510.00504v2

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.

Updated: 2026-03-02 00:50:39

标题: PanCanBench:一个用于评估胰腺肿瘤大型语言模型的综合基准

摘要: 大型语言模型(LLMs)已经在标准化考试中取得了专家级别的表现,然而多项选择准确性并不完全反映现实世界中的临床效用和安全性。随着患者和临床医生越来越多地使用LLMs来指导复杂疾病,如胰腺癌,评估必须超越一般医学知识。现有框架,如HealthBench,依赖于模拟查询,并缺乏疾病特定的深度。此外,基于评分标准的高分数并不能确保事实的正确性,强调了评估幻觉的必要性。我们开发了一个人为环节管道,为来自胰腺癌行动网络(PanCAN)的去标识化患者问题创建专家评分标准。结果产生的基准,PanCanBench,包括282个真实患者问题中的3,130个问题特定标准。我们使用LLM作为评判者框架,评估了22种专有和开源LLM,衡量临床完整性、事实准确性和网络搜索集成。模型在基于评分标准的完整性方面表现出明显的差异,得分范围从46.5%到82.3%不等。事实性错误很常见,幻觉率(包含至少一个事实错误响应的百分比)范围从6.0%的Gemini-2.5 Pro和GPT-4o到53.8%的Llama-3.1-8B。重要的是,更新的优化推理模型并不一致地提高事实性:尽管o3获得了最高的评分标准,但比其他GPT系列模型更频繁地产生不准确性。网络搜索集成并不会自动保证更好的响应。当启用网络搜索时,Gemini-2.5 Pro的平均分数从66.8%变为63.9%,而GPT-5的分数从73.8%变为72.8%。合成的AI生成的评分标准平均增加了17.9个百分点,同时通常保持相似的相对排名。

更新时间: 2026-03-02 00:50:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.01343v1

Semantic-Enhanced Time-Series Forecasting via Large Language Models

Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.

Updated: 2026-03-02 00:50:25

标题: 基于大型语言模型的语义增强时间序列预测

摘要: 时间序列预测在金融、能源、气象和物联网应用中起着重要作用。最近的研究利用大型语言模型(LLMs)的泛化能力来适应时间序列预测,取得了令人期待的表现。然而,现有研究集中在标记级模态对齐,而不是弥合语言知识结构和时间序列数据模式之间固有的模态差距,极大地限制了语义表示。为了解决这个问题,我们提出了一种新颖的语义增强型LLM(SE-LLM),它探索时间序列的固有周期性和异常特征,将其嵌入到语义空间以增强标记嵌入。这个过程增强了LLMs的标记可解释性,从而激活了LLMs在时间序列分析中的潜力。此外,现有基于Transformer的LLMs擅长捕捉长距离依赖关系,但在建模时间序列数据中的短期异常方面表现较弱。因此,我们提出了一个嵌入在自注意力中的插件模块,用于有效地适应LLMs到时间序列分析中的长期和短期依赖关系。我们的方法冻结了LLMs,并减少了标记的序列维度,大大降低了计算消耗。实验证明了我们的SE-LLM优于最先进方法的表现。

更新时间: 2026-03-02 00:50:25

领域: cs.LG,cs.CE

下载: http://arxiv.org/abs/2508.07697v5

Self-Destructive Language Model

Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. The code is available: https://github.com/ZJUWYH/seam. (Warning: this paper contains potentially harmful content generated by LLMs.)

Updated: 2026-03-02 00:49:34

标题: 自毁性语言模型

摘要: 有害的微调攻击对大型语言模型(LLMs)的安全构成了重大威胁,使对手能够通过最少的有害数据来破坏安全防护。虽然现有的防御尝试加强LLM的对齐,但它们未能解决模型对有害数据的固有“可训练性”,使它们容易受到学习率增加或有害数据集扩大的更强攻击的影响。为了克服这一关键限制,我们引入了SEAM,一种新颖的增强对齐的防御,将LLMs转化为具有内在抵御对齐尝试的自毁模型。具体来说,这些模型在保留其用于合法任务的能力的同时,在有害数据上微调时表现出明显的性能下降。通过一种新颖的损失函数实现了保护,该损失函数耦合了良性和有害数据的优化轨迹,并利用对抗性梯度上升来增强自毁效果。为了实现实际训练,我们开发了一种高效的无Hessian梯度估计方法,并提供了理论误差界。对LLMs和数据集进行广泛评估表明,SEAM对对手来说是一个无法取舍的局面:自毁模型在低强度攻击下实现了最先进的鲁棒性,并在高强度攻击下经历了灾难性的性能崩溃,使其实际上无法使用。代码可在以下链接找到:https://github.com/ZJUWYH/seam。(警告:本文包含LLMs生成的潜在有害内容。)

更新时间: 2026-03-02 00:49:34

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2505.12186v2

Training Large Language Models To Reason In Parallel With Global Forking Tokens

Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using bipartite matching between global forking tokens and unique reasoning traces. We observe that whereas naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Global Forking Policy Optimization (GFPO) leverages these maximally steerable tokens to incentivize complex reasoning, and the resulting models consistently outperform their SFT counterparts with GRPO on both math reasoning and execution-based code generation benchmarks.

Updated: 2026-03-02 00:48:57

标题: 使用全局分叉令牌并行训练大型语言模型进行推理

摘要: 尽管LLMs通过扩展并行测试时间计算已经展现出改善性能,但这样做依赖于生成既多样又准确的推理路径。对于具有挑战性的问题,触发多样但正确推理模式的分叉标记通常深藏在采样树中。因此,常见的鼓励多样性的策略,如温度缩放,会在多样性和准确性之间产生恶化的权衡。受到这一挑战的启发,我们将并行推理视为一种预测下一个标记集的问题,并在监督微调(SFT)中引入一个基于集合的全局损失,通过全局分叉标记和独特推理迹之间的二分匹配。我们观察到,虽然使用多个推理迹进行朴素微调会导致这些独特的推理模式崩溃,但我们提出的方法,即集合监督微调(SSFT),可以保留这些模式并产生新出现的全局分叉标记。全局分叉策略优化(GFPO)利用这些最大可操纵的标记来激励复杂的推理,结果模型在数学推理和基于执行的代码生成基准上始终优于其具有GRPO的SFT对应物。

更新时间: 2026-03-02 00:48:57

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.05132v3

VoMP: Predicting Volumetric Mechanical Property Fields

Physical simulation relies on spatially-varying mechanical properties, often laboriously hand-crafted. VoMP is a feed-forward method trained to predict Young's modulus ($E$), Poisson's ratio ($ν$), and density ($ρ$) throughout the volume of 3D objects, in any representation that can be rendered and voxelized. VoMP aggregates per-voxel multi-view features and passes them to our trained Geometry Transformer to predict per-voxel material latent codes. These latents reside on a manifold of physically plausible materials, which we learn from a real-world dataset, guaranteeing the validity of decoded per-voxel materials. To obtain object-level training data, we propose an annotation pipeline combining knowledge from segmented 3D datasets, material databases, and a vision-language model, along with a new benchmark. Experiments show that VoMP estimates accurate volumetric properties, far outperforming prior art in accuracy and speed.

Updated: 2026-03-02 00:48:34

标题: VoMP:预测体积力学性能场

摘要: 物理仿真依赖于空间变化的机械特性,通常需要费力地手工制作。VoMP是一种前馈方法,经过训练可以预测3D物体的体积内的杨氏模量($E$)、泊松比($ν$)和密度($ρ$),可以在任何可以渲染和体素化的表示中使用。VoMP汇聚每个体素的多视角特征,并将其传递给我们训练过的几何变换器,以预测每个体素的材料潜在代码。这些潜在代码存在于一个物理可信材料的流形上,我们从真实世界数据集中学习,确保解码后的每个体素材料的有效性。为了获得对象级别的训练数据,我们提出了一个注释流水线,结合了来自分割的3D数据集、材料数据库和视觉语言模型的知识,以及一个新的基准。实验证明,VoMP估计准确的体积属性,远远超过先前技术在准确性和速度上的表现。

更新时间: 2026-03-02 00:48:34

领域: cs.CV,cs.GR,cs.LG

下载: http://arxiv.org/abs/2510.22975v2

SubstratumGraphEnv: Reinforcement Learning Environment (RLE) for Modeling System Attack Paths

Automating network security analysis, particularly the identification of potential attack paths, presents significant challenges. Due in part to the sequential, interconnected, and evolutionary nature of system events which most artificial intelligence (AI) techniques struggle to model effectively. This paper proposes a Reinforcement Learning (RL) environment generation framework that simulates the sequence of processes executed on a Windows operating system, enabling dynamic modeling of malicious processes on a system. This methodology models operating system state and transitions using a graph representation. This graph is derived from open-source System Monitor (Sysmon) logs. To address the variety in system event types, fields, and log formats, a mechanism was developed to capture and model parent-child processes from Sysmon logs. A Gymnasium environment (SubstratumGraphEnv) was constructed to establish the perceptible basis for an RL environment, and a customized PyTorch interface was also built (SubstratumBridge) to translate Gymnasium graphs into Deep Reinforcement Learning (DRL) observations and discrete actions. Graph Convolutional Networks (GCNs) concretize the graph's local and global state, which feed the distinct policy and critic heads of an Advantage Actor-Critic (A2C) model. This work's central contribution lies in the design of a novel deep graphical RL environment that automates translation of sequential user and system events, furnishing crucial context for cybersecurity analysis. This work provides a foundation for future research into shaping training parameters and advanced reward shaping, while also offering insight into which system events attributes are critical to training autonomous RL agents.

Updated: 2026-03-02 00:48:24

标题: SubstratumGraphEnv: 用于建模系统攻击路径的强化学习环境 (RLE)

摘要: 自动化网络安全分析,特别是潜在攻击路径的识别,面临着重大挑战。这在很大程度上归因于系统事件的顺序、相互关联和演化性质,这些性质大多数人工智能(AI)技术很难有效建模。本文提出了一个强化学习(RL)环境生成框架,模拟在Windows操作系统上执行的进程序列,实现对系统上恶意进程的动态建模。该方法利用图表示法对操作系统状态和转换进行建模。这个图是从开源的System Monitor(Sysmon)日志中衍生出来的。为了处理系统事件类型、字段和日志格式的多样性,开发了一个机制来捕捉和模型化Sysmon日志中的父子进程。构建了一个Gymnasium环境(SubstratumGraphEnv)来建立RL环境的可感知基础,还构建了一个定制的PyTorch接口(SubstratumBridge),将Gymnasium图转换为深度强化学习(DRL)观察和离散动作。图卷积网络(GCNs)具体化图的局部和全局状态,这些状态供给了一个优势演员-评论家(A2C)模型的不同策略和评论头。这项工作的核心贡献在于设计了一个新颖的深层图形RL环境,自动翻译了顺序用户和系统事件,为网络安全分析提供了关键背景信息。这项工作为未来研究塑造训练参数和高级奖励塑造奠定了基础,同时也提供了洞见,哪些系统事件属性对训练自主RL代理至关重要。

更新时间: 2026-03-02 00:48:24

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01340v1

Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models

Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning. However, recent studies reveal that their final answers often contradict their own reasoning traces. We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval. To test this hypothesis, we conduct controlled experiments that challenge LRMs with misleading cues during reasoning and/or corrupted answers during retrieval. Our results across models and datasets confirm that both mechanisms operate simultaneously, with their relative dominance influenced by multiple factors: problem domains, model scales, and fine-tuning approaches (e.g., reinforcement learning vs. distillation). The findings reveal a critical limitation in current reasoning fine-tuning paradigms: models can exploit the retrieval mechanism as a shortcut, effectively "hacking" the reward signal and undermining genuine reasoning development. To address this challenge, we introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning. By carefully suppressing retrieval shortcuts during the fine-tuning process, FARL promotes reasoning-dominant behavior and enhances generalizable reasoning capabilities. The code is available: https://github.com/ZJUWYH/FARL.

Updated: 2026-03-02 00:39:58

标题: 推理还是检索?对大型推理模型上答案归因的研究

摘要: 大型推理模型(LRMs)通过思维链(CoT)推理展现了解决复杂问题的前所未有的能力。然而,最近的研究表明,它们的最终答案经常与它们自己的推理痕迹相矛盾。我们假设这种不一致性源自于生成答案的两种竞争机制:CoT推理和记忆检索。为了验证这一假设,我们进行了控制实验,挑战LRMs在推理过程中受到误导线索和/或在检索过程中受到答案损坏的情况。我们跨模型和数据集的结果证实,这两种机制同时起作用,它们的相对优势受到多种因素的影响:问题领域、模型规模和微调方法(例如,强化学习与蒸馏)。研究结果揭示了当前推理微调范式中的一个关键限制:模型可以利用检索机制作为一种捷径,有效地“黑客”奖励信号并削弱真实的推理发展。为了解决这一挑战,我们引入了FARL,这是一个整合了记忆去除和强化学习的新型微调框架。通过在微调过程中精心抑制检索捷径,FARL促进了推理为主导的行为,并增强了可推广的推理能力。代码可在以下链接找到:https://github.com/ZJUWYH/FARL。

更新时间: 2026-03-02 00:39:58

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.24156v2

Causal Effects with Unobserved Unit Types in Interacting Human-AI Systems

We study experiments on interacting populations of humans and AI agents, where both unit types and the interaction network remain unobserved. Although causal effects propagate throughout the system, the goal is to estimate effects on humans. Examples include online platforms where human users interact alongside AI-driven accounts. We assume a human-AI prior that gives each unit a probability of being human. While humans cannot be distinguished at the unit level, the prior allows us to compute the average human composition within large subpopulations. We then model outcome dynamics through a causal message passing (CMP) framework and analyze sample-mean outcomes across subpopulations. We show that by constructing subpopulations that vary in expected human composition and treatment exposure, one can consistently recover human-specific causal effects. Our results characterize when distributional knowledge of population composition (without observing unit types or the interaction network) is sufficient for identification. We validate the approach on a simulated human-AI platform driven by behaviorally differentiated LLM agents. Together, these results provide a theoretical and practical framework for experimentation in emerging human-AI systems.

Updated: 2026-03-02 00:31:48

标题: 与交互式人工智能系统中未观测单位类型的因果效应

摘要: 我们研究了人类和AI代理相互作用的实验,其中单位类型和交互网络均未被观察到。尽管因果效应在整个系统中传播,但目标是估计对人类的影响。例如,在线平台上,人类用户与由AI驱动的账户互动。我们假设了人类-AI先验,为每个单位分配了成为人类的概率。虽然无法在单位级别上区分人类,但这个先验允许我们计算在大的亚人群中的平均人类组成。然后,我们通过因果传递消息(CMP)框架来建模结果动态,并分析亚人群之间的样本均值结果。我们表明,通过构建在预期人类组成和治疗暴露方面有所不同的亚人群,可以一致地恢复人类特定的因果效应。我们的结果表明,对人口构成的分布知识(而不观察单位类型或交互网络)何时足以进行识别。我们在一个由行为差异化的LLM代理驱动的模拟人类-AI平台上验证了这种方法。总的来说,这些结果为新兴的人类-AI系统中的实验提供了一个理论和实际框架。

更新时间: 2026-03-02 00:31:48

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.01339v1

General Protein Pretraining or Domain-Specific Designs? Benchmarking Protein Modeling on Realistic Applications

Recently, extensive deep learning architectures and pretraining strategies have been explored to support downstream protein applications. Additionally, domain-specific models incorporating biological knowledge have been developed to enhance performance in specialized tasks. In this work, we introduce $\textbf{Protap}$, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications. Specifically, Protap covers five applications: three general tasks and two novel specialized tasks, i.e., enzyme-catalyzed protein cleavage site prediction and targeted protein degradation, which are industrially relevant yet missing from existing benchmarks. For each application, Protap compares various domain-specific models and general architectures under multiple pretraining settings. Our empirical studies imply that: (i) Though large-scale pretraining encoders achieve great results, they often underperform supervised encoders trained on small downstream training sets. (ii) Incorporating structural information during downstream fine-tuning can match or even outperform protein language models pretrained on large-scale sequence corpora. (iii) Domain-specific biological priors can enhance performance on specialized downstream tasks. Code and datasets are publicly available at https://github.com/Trust-App-AI-Lab/protap.

Updated: 2026-03-02 00:30:52

标题: 一般蛋白质预训练还是特定领域设计?在真实应用中基准蛋白质建模

摘要: 最近,广泛探索了深度学习架构和预训练策略,以支持下游蛋白质应用。此外,还开发了结合生物知识的领域特定模型,以提高专业任务的性能。在这项工作中,我们介绍了$\textbf{Protap}$,一个全面的基准,系统比较了不同的骨干架构、预训练策略和领域特定模型在多样化和现实的下游蛋白质应用中的表现。具体而言,Protap涵盖了五个应用:三个常规任务和两个新颖的专业任务,即酶催化蛋白裂解位点预测和靶向蛋白降解,这些任务在工业上很重要,但在现有基准中缺失。对于每个应用,Protap在多个预训练设置下比较了各种领域特定模型和通用架构。我们的实证研究表明:(i)尽管大规模预训练编码器取得了很好的结果,但它们通常表现不佳,比起在小规模下游训练集上训练的监督编码器。(ii)在下游微调过程中加入结构信息可以匹配甚至超越在大规模序列语料库上预训练的蛋白质语言模型。(iii)领域特定的生物先验知识可以增强专业下游任务的性能。代码和数据集可在https://github.com/Trust-App-AI-Lab/protap 上公开获取。

更新时间: 2026-03-02 00:30:52

领域: q-bio.BM,cs.AI,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2506.02052v3

A Projection-Based ARIMA Framework for Nonlinear Dynamics in Macroeconomic and Financial Time Series: Closed-Form Estimation and Rolling-Window Inference

We introduce Galerkin-ARIMA and Galerkin-SARIMA, a projection-based extension of classical ARIMA/SARIMA that replaces rigid linear lag operators with low-dimensional Galerkin basis expansions while preserving the familiar AR-MA decomposition. Experiments on synthetic series and on quarterly GDP and daily S&P 500 returns show that Galerkin-SARIMA matches or improves forecast accuracy relative to classical ARIMA/SARIMA. Estimation is closed-form via a two-stage least-squares procedure, and the closed-form two-stage estimator enables efficient rolling-window re-estimation while preserving the familiar AR-MA operator structure, facilitating applications in central bank forecasting and portfolio risk management. We establish approximation-estimation trade-offs under weak dependence, provide consistency and asymptotic distributional results for the unpenalized estimator, compare prediction risk to classical SARIMA, and propose information-criterion selection of basis size. We further develop bootstrap-based inference for exogenous factor blocks and block-bootstrap prediction intervals that account for serial dependence and the two-stage generated-regressor structure.

Updated: 2026-03-02 00:30:02

标题: 一个基于投影的ARIMA框架用于宏观经济和金融时间序列中的非线性动态:封闭形式估计和滚动窗口推断

摘要: 我们介绍了Galerkin-ARIMA和Galerkin-SARIMA,这是对经典ARIMA/SARIMA的基于投影的扩展,它用低维Galerkin基扩展取代了刚性线性滞后算子,同时保留了熟悉的AR-MA分解。对合成系列、季度GDP和每日标普500指数收益率的实验表明,相对于经典的ARIMA/SARIMA,Galerkin-SARIMA的预测准确性相当或更好。估计是通过两阶段最小二乘程序封闭形式完成的,而封闭形式的两阶段估计器使得在保留熟悉的AR-MA算子结构的同时进行有效的滚动窗口重新估计成为可能,为中央银行预测和投资组合风险管理应用提供了便利。我们在弱依赖条件下建立了近似估计的权衡,为未经惩罚的估计器提供了一致性和渐近分布结果,将预测风险与经典SARIMA进行了比较,并提出了基础大小的信息准则选择。我们进一步开发了基于自举的推断方法,用于外生因素块和区块自举预测区间,考虑了串行相关性和两阶段生成的回归器结构。

更新时间: 2026-03-02 00:30:02

领域: stat.ML,cs.LG,econ.EM

下载: http://arxiv.org/abs/2507.07469v3

Sparse Bayesian Deep Functional Learning with Structured Region Selection

In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.

Updated: 2026-03-02 00:26:44

标题: 稀疏贝叶斯深度功能学习与结构化区域选择

摘要: 在现代应用中,如心电监测、神经影像学、可穿戴传感和工业设备诊断中,复杂连续结构数据是普遍存在的,这既带来挑战,也带来机遇,对于功能数据分析而言。然而,现有方法面临一个关键的折衷:传统的功能模型受到线性限制,而深度学习方法缺乏可解释的稀疏效果区域选择。为了弥合这些差距,我们提出了一种稀疏贝叶斯功能深度神经网络(sBayFDNN)。它通过深度贝叶斯架构学习自适应的功能嵌入,以捕捉复杂的非线性关系,同时结构化先验使得可以对具有量化不确定性的影响领域进行可解释的区域选择。在理论上,我们建立了严格的逼近误差界限,后验一致性和区域选择一致性。这些结果为贝叶斯深度功能模型提供了首次的理论保证,确保其可靠性和统计严谨性。在实证方面,全面的模拟和真实世界研究证实了sBayFDNN的有效性和优越性。关键的是,sBayFDNN在识别复杂依赖关系以进行准确预测并更精确地识别功能性有意义的区域方面表现出色,这些能力从根本上超越了现有方法。

更新时间: 2026-03-02 00:26:44

领域: cs.LG,stat.AP,stat.ML

下载: http://arxiv.org/abs/2602.20651v2

Polynomial, trigonometric, and tropical activations

Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library via: https://github.com/K-H-Ismail/torchortho.

Updated: 2026-03-02 00:24:48

标题: 多项式、三角函数和热带激活

摘要: 哪些函数可以用作深度神经网络中的激活函数?本文探讨了基于正交基的函数族,包括Hermite多项式基和Fourier三角基,以及通过多项式基的热带化得到的基础。我们的研究表明,通过简单的方差保持初始化,无需额外的夹紧机制,这些激活函数可以成功地用于训练深度模型,如在OpenWebText上进行下一个令牌预测的GPT-2和在ImageNet上进行图像分类的ConvNeXt。我们的工作解决了多项式激活中普遍存在的激活和梯度爆炸和消失问题,并为改进大规模学习任务的效率打开了大门。此外,我们的方法揭示了神经网络的结构,揭示了具有多项式激活的网络可以被解释为多元多项式映射。最后,通过Hermite插值,我们展示了我们的激活函数可以通过匹配函数及其导数来紧密近似预训练模型中的经典激活函数,使它们特别适用于微调任务。这些激活函数可以通过torchortho库获得:https://github.com/K-H-Ismail/torchortho。

更新时间: 2026-03-02 00:24:48

领域: cs.LG,cs.AI,cs.CL,cs.CV,math.AG

下载: http://arxiv.org/abs/2502.01247v3

Adaptive Estimation and Inference in Conditional Moment Models via the Discrepancy Principle

We study adaptive estimation and inference in ill-posed linear inverse problems defined by conditional moment restrictions. Existing regularized estimators such as Regularized DeepIV (RDIV) require prior knowledge of the smoothness of the nuisance function, typically encoded by a beta source condition to tune their regularization parameters. In practice, this smoothness is unknown, and misspecified hyperparameters can lead to suboptimal convergence or instability. We introduce a discrepancy-principle-based framework for adaptive hyperparameter selection that automatically balances bias and variance without relying on the unknown smoothness parameter. Our framework applies to both RDIV (Li et al. [2024]) and the Tikhonov Regularized Adversarial Estimator (TRAE) (Bennett et al. [2023a]) and achieves the same rates in both weak and strong metrics. Building on this, we construct a fully adaptive doubly robust estimator for linear functionals that attains the optimal rate of the better-conditioned primal or dual problem, providing a practical, theoretically grounded approach for adaptive inference in ill-posed econometric models.

Updated: 2026-03-02 00:23:20

标题: 通过差异原则在条件矩模型中的自适应估计和推断

摘要: 我们研究了由条件矩限制定义的不适定线性逆问题中的自适应估计和推断。现有的正则化估计器,如正则化DeepIV(RDIV),需要先验知识来调整其正则化参数,通常由贝塔源条件编码的干扰函数的平滑性来表示。在实践中,这种平滑性是未知的,错误指定的超参数可能导致次优的收敛或不稳定性。 我们引入了一个基于差距原则的自适应超参数选择框架,可以自动平衡偏差和方差,而无需依赖未知的平滑参数。我们的框架适用于RDIV(Li等人[2024])和Tikhonov正则化对抗估计器(TRAE)(Bennett等人[2023a]),在弱度量和强度量中都实现了相同的速率。基于此,我们构建了一个完全自适应的双重稳健估计器,用于线性泛函,实现了更好条件原始或对偶问题的最佳速率,为处理不适定计量模型中的自适应推断提供了一种实用的、理论基础的方法。

更新时间: 2026-03-02 00:23:20

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.01337v1

AIRMap: AI-Generated Radio Maps for Wireless Digital Twins

Accurate, low-latency channel modeling is essential for real-time wireless network simulation and digital-twin applications. Traditional modeling methods like ray tracing are however computationally demanding and unsuited to model dynamic conditions. In this paper, we propose AIRMap, a deep-learning framework for ultra-fast radio-map estimation, along with an automated pipeline for creating the largest radio-map dataset to date. AIRMap uses a single-input U-Net autoencoder that processes only a 2D elevation map of terrain and building heights. Trained on 1.2M Boston-area samples and validated across four distinct urban and rural environments with varying terrain and building density, AIRMap predicts path gain with under 4 dB RMSE in 4 ms per inference on an NVIDIA L40S-over 100x faster than GPU-accelerated ray tracing based radio maps. A lightweight calibration using just 20% of field measurements reduces the median error to approximately 5%, significantly outperforming traditional simulators, which exceed 50% error. Integration into the Colosseum emulator and the Sionna SYS platform demonstrate near-zero error in spectral efficiency and block-error rate compared to measurement-based channels. These findings validate AIRMap's potential for scalable, accurate, and real-time radio map estimation in wireless digital twins.

Updated: 2026-03-02 00:22:15

标题: AIRMap:用于无线数字孪生体的人工智能生成的无线电地图

摘要: 准确、低延迟的信道建模对于实时无线网络仿真和数字双生应用至关重要。然而,传统的建模方法如射线追踪在计算方面要求很高,不适合模拟动态条件。本文提出了AIRMap,一个用于超快速无线电地图估计的深度学习框架,以及一个自动化流程用于创建迄今为止最大的无线电地图数据集。AIRMap使用一个单输入的U-Net自编码器,仅处理地形和建筑高度的2D高程图。在1.2M个波士顿地区样本上训练,并在四个不同的城市和农村环境中进行验证,这些环境具有不同的地形和建筑密度。AIRMap在NVIDIA L40S上每个推理只需4毫秒,预测路径增益的RMSE低于4 dB,比基于GPU加速的射线追踪的无线电地图快100倍以上。仅使用20%的现场测量数据进行轻量级校准,将中值误差降低到约5%,明显优于传统的模拟器,其误差超过50%。将其集成到Colosseum模拟器和Sionna SYS平台中,与基于测量的信道相比,几乎没有误差,体现了AIRMap在无线数字双生中可扩展、准确且实时的无线电地图估计的潜力。

更新时间: 2026-03-02 00:22:15

领域: eess.SP,cs.AI

下载: http://arxiv.org/abs/2511.05522v3

Provable and Practical In-Context Policy Optimization for Self-Improvement

We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.

Updated: 2026-03-02 00:21:50

标题: 可证明且实际有效的上下文政策优化方法用于自我改进

摘要: 我们研究了测试时间缩放,在这种情况下,模型通过推断时的多轮自我反思来改进其答案。我们引入了上下文策略优化(ICPO),其中一个代理通过使用自我评估或外部观察的奖励来优化其上下文中的响应,而不修改其参数。为了解释这一ICPO过程,我们从理论上展示了在新颖的Fisher加权logit匹配目标下进行充分预训练后,单层线性自注意模型可以明确模仿线性赌博机的策略优化算法。在此理论基础上,我们提出了最小熵ICPO(ME-ICPO),这是一个实用算法,它在推断时迭代使用其响应和自我评估奖励来优化其上下文中的响应。通过选择具有最小熵的响应和其奖励,ME-ICPO通过多数投票确保了自我评估奖励的稳健性。在标准的数学推理任务中,ME-ICPO在保持推断成本可承受的情况下取得了具有竞争力的顶级表现,与其他推断时算法相比。总的来说,ICPO为LLMs中的自我反思提供了理论理解,并为数学推理的测试时间缩放带来了实际好处。

更新时间: 2026-03-02 00:21:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.01335v1

MetaState: Persistent Working Memory for Discrete Diffusion Language Models

Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.

Updated: 2026-03-02 00:16:35

标题: 元状态:离散扩散语言模型的持久工作记忆

摘要: 离散扩散语言模型(dLLMs)通过迭代地去噪掩盖的序列来生成文本。与自回归模型相比,这种范式自然地支持并行解码、双向上下文和灵活的生成模式。然而,标准的dLLMs仅在当前硬掩盖的序列上条件化每个去噪步骤,而中间连续表示在采样和重新掩盖后被丢弃。我们将这一瓶颈称为“信息孤岛”问题。这导致了跨步骤冗余的重新计算,并可能降低跨步骤的一致性。我们通过\textbf{MetaState}来解决这一限制,\textbf{MetaState}是一种轻量级的循环增强,它为冻结的dLLM骨干提供了一个持久的、固定大小的工作记忆,独立于序列长度。 \textbf{MetaState}包括三个可训练模块:一个交叉注意力混合器,将骨干激活读入记忆槽,一个类似GRU的更新器,跨去噪步骤整合信息,以及一个交叉注意力注入器,将更新后的记忆反馈到骨干激活中。我们通过$K$步展开训练这些模块,使它们在微调过程中暴露于多步去噪动态。在LLaDA-8B和Dream-7B上,\textbf{MetaState}引入了可忽略的可训练参数,同时保持骨干冻结,并且持续改进了冻结基线的准确性。这些结果表明,持久的跨步记忆是在离散扩散语言模型中桥接去噪步骤和提高生成质量的有效机制。

更新时间: 2026-03-02 00:16:35

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.01331v1

Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks

Multivariate Hawkes process provides a powerful framework for modeling temporal dependencies and event-driven interactions in complex systems. While existing methods primarily focus on uncovering causal structures among observed subprocesses, real-world systems are often only partially observed, with latent subprocesses posing significant challenges. In this paper, we show that continuous-time event sequences can be represented by a discrete-time causal model as the time interval shrinks, and we leverage this insight to establish necessary and sufficient conditions for identifying latent subprocesses and the causal influences. Accordingly, we propose a two-phase iterative algorithm that alternates between inferring causal relationships among discovered subprocesses and uncovering new latent subprocesses, guided by path-based conditions that guarantee identifiability. Experiments on both synthetic and real-world datasets show that our method effectively recovers causal structures despite the presence of latent subprocesses.

Updated: 2026-03-02 00:15:40

标题: 霍克斯过程中复杂潜在混杂网络的因果结构学习

摘要: 多元霍克斯过程为建模复杂系统中的时间依赖性和事件驱动交互提供了一个强大的框架。尽管现有方法主要集中在揭示观察到的子过程之间的因果结构,但现实世界中的系统往往只能部分观察到,潜在的子过程带来了重大挑战。在本文中,我们展示了连续时间事件序列可以随着时间间隔缩小而被表示为离散时间因果模型,并利用这一观点建立了识别潜在子过程和因果影响的必要和充分条件。因此,我们提出了一个两阶段迭代算法,交替推断发现的子过程之间的因果关系,并揭示新的潜在子过程,通过基于路径的条件来保证可识别性。在合成和真实数据集上的实验表明,我们的方法在存在潜在子过程的情况下有效地恢复了因果结构。

更新时间: 2026-03-02 00:15:40

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.11727v3

Learning to Reason without External Rewards

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Updated: 2026-03-02 00:12:58

标题: 学习推理而无需外部奖励

摘要: 通过可验证奖励(RLVR)进行复杂推理的大型语言模型(LLMs)的训练是有效的,但受到对昂贵的领域特定监督的依赖的限制。我们探索了来自内部反馈(RLIF)的强化学习框架,该框架使LLMs能够从内在信号中学习,而无需外部奖励或标记数据。我们提出了Intuitor,一种使用模型自身的信心(称为自我确定性)作为唯一奖励信号的RLIF方法。Intuitor用自我确定性分数替换了Group Relative Policy Optimization(GRPO)中的外部奖励,实现了完全无监督学习。实验证明,Intuitor在数学基准测试中与GRPO的性能相当,同时在诸如代码生成之类的域外任务上实现更好的泛化,而无需黄金解决方案或测试用例。我们的研究结果表明,内在模型信号可以驱动跨领域的有效学习,为在无法获得可验证奖励的自主AI系统中提供了一个可扩展的替代方案。代码可在https://github.com/sunblaze-ucb/Intuitor找到。

更新时间: 2026-03-02 00:12:58

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2505.19590v3

Per-example gradients: a new frontier for understanding and improving optimizers

Training algorithms in deep learning usually treat a mini-batch of samples as a single object; they average gradients over the mini-batch, and then process the average in various ways. Computing other statistics beyond the average may have been seen as prohibitively resource intensive in automatic differentiation (AD) frameworks. We show that this is not the case. Generally, gradient statistics can be implemented through a surgery of the AD graph, which, in some cases, incur almost no computational and memory overheads compared to the mini-batch gradient computation. Additionally, we show that in certain classes of models, including transformers, JAX's vectorization transformation offers a viable implementation for prototyping and experimentation. We then revise our understanding of two nonlinear operations in optimization through the lens of per-example gradient transformations. We first study signSGD and show that the optimal placement of the sign operation in the gradient processing chain is crucial to success and can be predicted with a simple signal-to-noise ratio argument. Next we study per-example variations of the Adam preconditioner, and show that optimization is best served when the preconditioner is dominated by the mean rather than the variance of the gradient distribution - in contrast to conventional wisdom. Overall we demonstrate that per-example gradient information enables new analyses and possibilities for algorithm design.

Updated: 2026-03-02 00:10:09

标题: 每个示例的梯度:理解和改进优化器的新领域

摘要: 在深度学习中,训练算法通常将一个小批量样本视为一个单一对象;它们对小批量进行梯度平均,并以各种方式处理平均值。在自动微分(AD)框架中,计算平均值之外的其他统计量可能被视为资源密集型。我们展示了这并非如此。通常,梯度统计可以通过AD图的手术实现,在某些情况下,与小批量梯度计算相比,几乎不会产生计算和内存开销。此外,我们展示了在包括变换器在内的某些模型类中,JAX的矢量化转换提供了一个可行的实现方式供原型设计和实验使用。然后,我们通过每个示例梯度变换的视角重新审视了优化中的两种非线性操作。我们首先研究了signSGD,并展示了在梯度处理链中的符号操作的最佳位置对于成功至关重要,并可以通过简单的信噪比论证来预测。接着我们研究了Adam预处理器的每个示例变化,并表明在优化中最好的情况是预处理器由梯度分布的均值而不是方差主导-与传统智慧相反。总体而言,我们展示了每个示例梯度信息为算法设计提供了新的分析和可能性。

更新时间: 2026-03-02 00:10:09

领域: cs.LG

下载: http://arxiv.org/abs/2510.00236v2

By Xinhai (Sean) Zou.