Arxiv Day: Article

RedSage: A Cybersecurity Generalist LLM

Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.

Updated: 2026-01-29 18:59:57

标题: RedSage：一个网络安全通才LLM

摘要: 网络安全运营需要支持不同工作流程的助手LLMs，而不会暴露敏感数据。现有解决方案要么依赖于存在隐私风险的专有API，要么依赖于缺乏领域适应性的开放模型。为了弥合这一差距，我们通过大规模网络过滤和高质量资源的手动收集策划了118亿个令牌的网络安全持续预训练数据，跨越了28.6K份文件，涵盖了框架、攻击技术和安全工具。在此基础上，我们设计了一种主动增强管道，模拟专家工作流程，生成了26.6万个多轮网络安全样本，用于监督微调。结合通用开源LLM数据，这些资源使得我们能够训练RedSage，一个具有领域感知预训练和后训练的开源、本地可部署的网络安全助手。为了严格评估模型，我们引入了RedSage-Bench，一个基准，包含了30K个多项选择和240个开放式网络安全知识、技能和工具专业知识问题。RedSage还在已建立的网络安全基准（如CTI-Bench、CyberMetric、SECURE）和通用LLM基准上进行评估，以评估更广泛的泛化能力。在8B规模上，RedSage取得了一致更好的结果，超过基线模型在网络安全基准上最多+5.59个点，在开放LLM排行榜任务上最多+5.05个点。这些发现表明，领域感知型主动增强和预/后训练不仅可以增强网络安全专业知识，还可以帮助提高一般推理和遵循指令的能力。所有模型、数据集和代码都是公开可用的。

更新时间: 2026-01-29 18:59:57

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.22159v1

Discovering Hidden Gems in Model Repositories

Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of "hidden gems", unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.

Updated: 2026-01-29 18:59:55

标题: 发现模型存储库中的隐藏宝石

摘要: 公共存储库托管数百万经过精细调整的模型，然而社区使用仍然不成比例地集中在少数几个基础检查点上。我们调查了这种集中是否反映了有效的市场选择，或者是因为优秀的模型被系统地忽视。通过对超过2,000个模型进行广泛评估，我们展示了“隐藏宝石”的普遍存在，即明显优于流行对应物的不受欢迎的精细调整。值得注意的是，在Llama-3.1-8B系列中，我们发现了很少被下载的检查点，将数学性能从83.2%提高到96.0%，而不增加推理成本。然而，通过对每个上传的模型进行详尽评估来发现这些模型在计算上是不可行的。因此，我们将模型发现定为一个多臂赌博问题，并通过使用共享查询集和激进的淘汰计划来加速序贯减半搜索算法。我们的方法以每个候选者50个查询就可检索出顶级模型，将发现加速了50倍。

更新时间: 2026-01-29 18:59:55

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2601.22157v1

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data

Updated: 2026-01-29 18:59:53

标题: 混合线性注意力的正确实现：极长上下文的高效精炼和有效架构

摘要: 混合Transformer架构，结合了softmax注意力块和循环神经网络（RNN），已经显示出在长上下文建模中具有令人满意的性能和吞吐量的权衡，但它们的采用和研究受到了从头开始进行大规模预训练的高昂成本的阻碍。一些最近的研究表明，预训练的softmax注意力块可以通过参数转移和知识蒸馏转换为RNN块。然而，这些转移方法需要大量的训练数据（超过10B个标记），而结果混合模型也表现出较差的长上下文性能，这是混合模型在哪种场景下比基于Transformer的模型享有显著推断加速的情况。在本文中，我们提出了HALO（通过层优化实现混合注意力），这是一个将Transformer模型蒸馏为RNN-注意力混合模型的流程。然后，我们提出了HypeNet，这是一种具有卓越长度泛化能力的混合架构，其采用了一种名为HyPE的新型位置编码方案和各种结构修改。我们使用HALO将Qwen3系列转换为HypeNet，实现了与原始Transformer模型相媲美的性能，同时享有卓越的长上下文性能和效率。这种转换仅需要2.3B个标记，不到它们预训练数据的0.01%。

更新时间: 2026-01-29 18:59:53

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.22156v1

Late Breaking Results: Conversion of Neural Networks into Logic Flows for Edge Computing

Neural networks have been successfully applied in various resource-constrained edge devices, where usually central processing units (CPUs) instead of graphics processing units exist due to limited power availability. State-of-the-art research still focuses on efficiently executing enormous numbers of multiply-accumulate (MAC) operations. However, CPUs themselves are not good at executing such mathematical operations on a large scale, since they are more suited to execute control flow logic, i.e., computer algorithms. To enhance the computation efficiency of neural networks on CPUs, in this paper, we propose to convert them into logic flows for execution. Specifically, neural networks are first converted into equivalent decision trees, from which decision paths with constant leaves are then selected and compressed into logic flows. Such logic flows consist of if and else structures and a reduced number of MAC operations. Experimental results demonstrate that the latency can be reduced by up to 14.9 % on a simulated RISC-V CPU without any accuracy degradation. The code is open source at https://github.com/TUDa-HWAI/NN2Logic

Updated: 2026-01-29 18:59:50

标题: 晚报成果：将神经网络转换为逻辑流用于边缘计算

摘要: 神经网络已成功应用于各种资源受限的边缘设备中，由于功耗有限通常存在中央处理单元（CPU）而非图形处理单元。最新研究仍专注于高效执行大量乘累加（MAC）操作。然而，CPU本身不擅长在大规模上执行这种数学操作，因为它们更适合执行控制流逻辑，即计算机算法。为了提高CPU上神经网络的计算效率，在本文中，我们提出将其转换为逻辑流以进行执行。具体而言，神经网络首先转换为等效决策树，然后选择常数叶子的决策路径并将其压缩为逻辑流。这种逻辑流由if和else结构以及减少数量的MAC操作组成。实验结果表明，在模拟的RISC-V CPU上，延迟可以减少高达14.9％，而不会降低准确性。该代码是开源的，网址为https://github.com/TUDa-HWAI/NN2Logic。

更新时间: 2026-01-29 18:59:50

领域: cs.LG,eess.SY

下载: http://arxiv.org/abs/2601.22151v1

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .

Updated: 2026-01-29 18:58:47

标题: 精细指导：将合成指令按比例缩放至预训练规模

摘要: 由于受限于有限的监督训练数据，大型语言模型（LLMs）通常通过自监督的“预测下一个单词”的目标在大量非结构化文本数据上进行预训练。为了使生成的模型对用户有用，进一步在包含指令和响应的有限数量“指令微调”数据上进行训练。为了克服有限的监督数据量，我们提出了一种能够将互联网规模的预训练文档中的知识转化为数十亿个合成指令和答案训练对的过程。所得到的数据集被称为FineInstructions，其中包含了大约1800万个由真实用户编写的查询和提示创建的指令模板。这些指令模板与来自非结构化预训练语料库的人工编写源文档匹配和实例化。通过在这种规模上生成的“监督”合成训练数据，LLM可以仅通过指令微调目标从头开始进行预训练，这与LLMs的预期下游使用（响应用户提示）更匹配。我们进行了受控的标记对标记的培训实验，并发现在FineInstructions上的预训练优于标准预训练和其他提出的衡量自由形式响应质量的标准基准的合成预训练技术。我们的资源可以在https://huggingface.co/fineinstructions 找到。

更新时间: 2026-01-29 18:58:47

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.22146v1

MORPH: PDE Foundation Models with Arbitrary Data Modality

We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters, MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.

Updated: 2026-01-29 18:57:23

标题: MORPH: 具有任意数据模态的PDE基础模型

摘要: 我们介绍了一种名为MORPH的模态不可知、自回归基础模型，用于偏微分方程（PDEs）。MORPH建立在一个卷积视觉变换器的骨干上，可以无缝处理不同分辨率、不同数据模态（1D-3D）的异质时空数据集，以及包含混合标量和矢量分量的多个字段。该架构结合了（i）分量级卷积，可以同时处理标量和矢量通道以捕获局部交互作用，（ii）字段间交叉注意力，可以建模并有选择地传播不同物理字段之间的信息，（iii）轴向注意力，可以在各自的空间和时间轴上分解全局自注意力，以减少计算负担同时保持表达能力。我们在多种异质PDE数据集上预训练了多个模型变体，并评估了对一系列下游预测任务的迁移性能。通过完整模型微调和参数高效的低秩适配器，MORPH的性能超越了从头开始训练的模型。在广泛的评估中，MORPH与强基线和最新的最先进模型相匹敌甚至超越。总的来说，这些能力为学习科学观测的异质和多模态性质提供了灵活而强大的基础，为实现可扩展和数据高效的科学机器学习指明了道路。源代码、数据集和模型可以在https://github.com/lanl/MORPH上公开获取。

更新时间: 2026-01-29 18:57:23

领域: cs.CV,cs.AI,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2509.21670v4

Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data

In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.

Updated: 2026-01-29 18:56:41

标题: "将彩票路由: 用于异构数据的自适应子网络"

摘要: 在修剪中，抽奖票假设认为大型网络包含稀疏子网络，或称为中奖票，可以单独进行训练以匹配其密集对应物的性能。然而，大多数现有方法假设一个单一的通用中奖票适用于所有输入，忽略了真实数据的固有异质性。在这项工作中，我们提出了一种自适应修剪框架Routing the Lottery（RTL），它发现多个专门的子网络，称为自适应中奖票，每个子网络都针对一个类别、语义集群或环境条件进行了定制。在各种数据集和任务中，RTL在平衡准确性和召回率方面始终优于单一和多模型基线，同时使用的参数比独立模型少了多达10倍，并呈现出语义上的对齐。此外，我们识别了子网络崩溃，即在激进修剪下的性能下降，并引入了一种子网络相似性评分，该评分使得可以无需标签对过度稀疏化进行诊断。总的来说，我们的结果重新将修剪视为一种将模型结构与数据异质性对齐的机制，为更模块化和具有上下文意识的深度学习铺平了道路。

更新时间: 2026-01-29 18:56:41

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.22141v1

StepShield: When, Not Whether to Intervene on Rogue Agents

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

Updated: 2026-01-29 18:55:46

标题: StepShield：何时对不良代理人进行干预，而非是否进行干预

摘要: 现有的代理安全基准报告二进制准确性，混淆了早期干预和事后分析。在第8步检测到违规行为的检测器能够进行干预；而在第48步报告违规行为的检测器仅提供法证价值。这种区别至关重要，然而目前的基准无法衡量。我们引入了StepShield，这是第一个评估违规行为何时被检测到的基准，而不仅仅是是否被检测到。StepShield包含9,213个代码代理轨迹，包括1,278对经过精心注释的训练对和一个包含7,935个轨迹的测试集，其中包含真实的8.1%的恶意行为率。恶意行为基于真实世界安全事件，涵盖六个类别。我们提出了三种新颖的时间度量标准：早期干预率（EIR）、干预间隔和节省令牌数。令人惊讶的是，我们的评估结果显示，基于LLM的评判者实现了59%的EIR，而静态分析器仅实现了26%，这是标准准确性指标完全看不到的2.3倍性能差距。我们进一步展示了早期检测具有直接的经济效益：我们级联的HybridGuard检测器可以将监控成本降低75%，在企业规模下预计在五年内节省1.08亿美元。通过将评估重点从是否到何时转移，StepShield为构建更安全、更经济可行的AI代理提供了一个新的基础。代码和数据以Apache 2.0许可发布。

更新时间: 2026-01-29 18:55:46

领域: cs.LG,cs.AI,cs.CR,cs.SE

下载: http://arxiv.org/abs/2601.22136v1

PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training

Matrix functions such as square root, inverse roots, and orthogonalization play a central role in preconditioned gradient methods for neural network training. This has motivated the development of iterative algorithms that avoid explicit eigendecompositions and rely primarily on matrix multiplications, making them well suited for modern GPU accelerators. We present PRISM (Polynomial-fitting and Randomized Iterative Sketching for Matrix functions computation), a general framework for accelerating iterative algorithms for computing matrix functions. PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via a sketched least-squares problem, adapting to the instance at hand with minimal overhead. We apply PRISM to accelerate Newton-Schulz-like iterations for matrix square roots and orthogonalization, which are core primitives in machine learning. Unlike prior methods, PRISM requires no explicit spectral bounds or singular value estimates; and it adapts automatically to the evolving spectrum. Empirically, PRISM accelerates training when integrated into Shampoo and Muon optimizers.

Updated: 2026-01-29 18:55:46

标题: PRISM：无分布自适应矩阵函数计算以加速神经网络训练

摘要: 矩阵函数如平方根、逆根和正交化在神经网络训练的预处理梯度方法中起着核心作用。这促使了避免显式特征值分解，并主要依赖于矩阵乘法的迭代算法的发展，使它们非常适合现代GPU加速器。我们提出了PRISM（用于计算矩阵函数的多项式拟合和随机迭代草图），这是一个加速迭代算法计算矩阵函数的通用框架。PRISM结合了自适应多项式逼近和随机草图：在每次迭代中，它通过一个草图最小二乘问题对当前谱进行多项式替代，适应实例并具有最小的额外开销。我们将PRISM应用于加速矩阵平方根和正交化的Newton-Schulz样式迭代，这在机器学习中是核心原语。与先前的方法不同，PRISM不需要显式谱界或奇异值估计；它自动适应不断变化的谱。在实证方面，将PRISM整合到Shampoo和Muon优化器中可以加速训练。

更新时间: 2026-01-29 18:55:46

领域: cs.LG,cs.AI,math.NA,math.OC

下载: http://arxiv.org/abs/2601.22137v1

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.

Updated: 2026-01-29 18:52:54

标题: 支付提示，而不是答案：低成本推理的LLM引导

摘要: 大型语言模型（LLMs）在复杂推理任务上提供最先进的性能，但它们的推理成本限制了规模部署。小型语言模型（SLMs）提供了巨大的成本节约，但在准确性上明显落后。现有的方法 - 路由和级联 - 将LLM视为一种全有或全无的资源：要么查询完全绕过LLM，要么LLM以完整成本生成完整响应。我们引入了LLM引导框架，该框架仅从LLM请求一个简短的前缀（一个提示）并将其提供给SLM。这种简单机制对于数学和编码任务非常有效：即使提示包含完整LLM响应的10-30％，也会显着提高SLM的准确性。引导同时泛化了路由和级联，并在oracle决策制定下实现了更低的成本。我们开发了一个两阶段预测器，共同确定是否需要提示以及需要请求多少令牌。在广泛使用的数学推理（GSM8K，CNK12）和代码生成（HumanEval，MBPP）基准测试上，引导相对于仅使用LLM的推理将成本降低了42-94％。与最先进的路由和级联基线相比，引导提供了高达2.8倍的成本降低，同时匹配准确性。据我们所知，这是首个利用令牌级预算控制进行SLM-LLM协作的工作。

更新时间: 2026-01-29 18:52:54

领域: cs.LG

下载: http://arxiv.org/abs/2601.22132v1

SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization

Multi-objective optimization aims to solve problems with competing objectives, often with only black-box access to a problem and a limited budget of measurements. In many applications, historical data from related optimization tasks is available, creating an opportunity for meta-learning to accelerate the optimization. Bayesian optimization, as a promising technique for black-box optimization, has been extended to meta-learning and multi-objective optimization independently, but methods that simultaneously address both settings - meta-learned priors for multi-objective Bayesian optimization - remain largely unexplored. We propose SMOG, a scalable and modular meta-learning model based on a multi-output Gaussian process that explicitly learns correlations between objectives. SMOG builds a structured joint Gaussian process prior across meta- and target tasks and, after conditioning on metadata, yields a closed-form target-task prior augmented by a flexible residual multi-output kernel. This construction propagates metadata uncertainty into the target surrogate in a principled way. SMOG supports hierarchical, parallel training: meta-task Gaussian processes are fit once and then cached, achieving linear scaling with the number of meta-tasks. The resulting surrogate integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions.

Updated: 2026-01-29 18:51:58

标题: SMOG：可扩展的多目标贝叶斯优化元学习

摘要: 多目标优化旨在解决具有竞争目标的问题，通常只能访问问题的黑匣子，并且具有有限的测量预算。在许多应用中，可以利用相关优化任务的历史数据，从而为元学习加速优化提供机会。贝叶斯优化作为一种黑匣子优化的有前途的技术，已经被扩展到了元学习和多目标优化，但同时解决这两种情况的方法 - 具有元学习先验的多目标贝叶斯优化 - 目前仍未得到充分探讨。我们提出了一种基于多输出高斯过程的可扩展且模块化的元学习模型SMOG，该模型明确学习了目标之间的相关性。SMOG在元任务和目标任务之间构建了一个结构化的联合高斯过程先验，并在基于元数据的条件下，产生了一个通过灵活的残余多输出核扩展的封闭式目标任务先验。这种构建方式以一种合理的方式将元数据的不确定性传播到目标替代模型中。SMOG支持分层、并行训练：元任务高斯过程仅拟合一次，然后缓存，实现与元任务数量的线性扩展。由此产生的替代模型与标准的多目标贝叶斯优化获取函数无缝集成。

更新时间: 2026-01-29 18:51:58

领域: cs.LG

下载: http://arxiv.org/abs/2601.22131v1

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.

Updated: 2026-01-29 18:50:29

标题: SWE-Replay：软件工程代理的高效测试时间缩放

摘要: 测试时间缩放已被广泛采用，以增强大型语言模型(LLM)代理在软件工程(SWE)任务中的能力。然而，从头开始重复抽样轨迹的标准方法在计算上是昂贵的。尽管最近的方法已尝试使用专门的价值代理来减少成本，但它们可能会受到模型误校准的影响，并且无法推广到合成自定义bash脚本作为工具的现代代理。在本文中，我们介绍了SWE-Replay，这是一种针对现代代理的高效且可推广的测试时间缩放技术，而无需依赖潜在嘈杂的价值估计。SWE-Replay通过重复利用先前试验的轨迹来优化缩放过程，动态选择是从头开始探索还是利用存档的经验在关键中间步骤进行分支。这些中间步骤的选择是由仓库探索的潜力和推理重要性驱动，而不是外部基于LLM的质量估计。我们的评估表明，在SWE-Bench Verified上，SWE-Replay始终优于朴素的缩放，将成本降低了高达17.4%，同时甚至提高了性能高达3.8%。在SWE-Bench Pro和Multilingual上的进一步评估验证了SWE-Replay的可推广性，将其确立为软件工程代理高效测试时间缩放的坚实基础。

更新时间: 2026-01-29 18:50:29

领域: cs.SE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.22129v1

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.

Updated: 2026-01-29 18:49:27

标题: 自我编辑：使用扩散变换器进行音频驱动的生成和操作说话头视频

摘要: 目前的生成视频模型在从文本和图像提示中生成新内容方面表现出色，但在编辑现有的预先录制视频方面存在一个关键差距，其中对讲话脚本进行轻微修改需要保持动作、时间连贯性、说话者身份和准确的嘴唇同步。我们引入了EditYourself，这是一个基于DiT的音频驱动视频到视频（V2V）编辑框架，它能够基于文本修改讲话头视频，包括对视觉言语内容的无缝添加、删除和重新定时。基于通用视频扩散模型，EditYourself通过音频调节和区域感知、编辑重点的训练扩展增强了其V2V功能。这使得通过时空修补实现精确的嘴唇同步和时间连贯的对现有表演进行重构，包括在新增部分中合成逼真的人体运动，同时在长时间内保持视觉保真度和身份一致性。这项工作代表了将生成视频模型作为专业视频后期制作实用工具的基础性进展。

更新时间: 2026-01-29 18:49:27

领域: cs.CV,cs.GR,cs.LG,cs.MM

下载: http://arxiv.org/abs/2601.22127v1

Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics

Simulating the long-time evolution of Hamiltonian systems is limited by the small timesteps required for stable numerical integration. To overcome this constraint, we introduce a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span $Δt$, enabling stable large-timestep updates far beyond the stability limits of classical integrators. To this end, we impose a Mean Flow consistency condition for time-averaged Hamiltonian dynamics. Unlike prior approaches, this allows training on independent phase-space samples without access to future states, avoiding expensive trajectory generation. Validated across diverse Hamiltonian systems, our method in particular improves upon molecular dynamics simulations using machine-learned force fields (MLFF). Our models maintain comparable training and inference cost, but support significantly larger integration timesteps while trained directly on widely-available trajectory-free MLFF datasets.

Updated: 2026-01-29 18:47:46

标题: 学习哈密顿流映射：大时间步分子动力学的平均流一致性

摘要: 模拟哈密顿系统的长时间演化受到所需稳定数值积分的小时间步长的限制。为了克服这一限制，我们引入了一个框架，通过预测选择时间跨度$Δt$上的平均相空间演化来学习哈密顿流映射，从而实现远远超出经典积分器稳定性限制的稳定大时间步更新。为此，我们对时间平均哈密顿动力学施加了一个均值流一致性条件。与之前的方法不同，这允许在独立的相空间样本上进行训练，而无需访问未来状态，避免昂贵的轨迹生成。在各种哈密顿系统中验证后，我们的方法特别改进了使用机器学习力场（MLFF）的分子动力学模拟。我们的模型保持了可比较的训练和推断成本，但在直接训练于广泛可用的无轨迹MLFF数据集时，支持更大的积分时间步。

更新时间: 2026-01-29 18:47:46

领域: cs.LG

下载: http://arxiv.org/abs/2601.22123v1

Alpha Discovery via Grammar-Guided Learning and Search

Automatically discovering formulaic alpha factors is a central problem in quantitative finance. Existing methods often ignore syntactic and semantic constraints, relying on exhaustive search over unstructured and unbounded spaces. We present AlphaCFG, a grammar-based framework for defining and discovering alpha factors that are syntactically valid, financially interpretable, and computationally efficient. AlphaCFG uses an alpha-oriented context-free grammar to define a tree-structured, size-controlled search space, and formulates alpha discovery as a tree-structured linguistic Markov decision process, which is then solved using a grammar-aware Monte Carlo Tree Search guided by syntax-sensitive value and policy networks. Experiments on Chinese and U.S. stock market datasets show that AlphaCFG outperforms state-of-the-art baselines in both search efficiency and trading profitability. Beyond trading strategies, AlphaCFG serves as a general framework for symbolic factor discovery and refinement across quantitative finance, including asset pricing and portfolio construction.

Updated: 2026-01-29 18:46:15

标题: 通过语法引导学习和搜索实现Alpha发现

摘要: 自动发现公式化的Alpha因子是量化金融中的一个核心问题。现有方法通常忽略句法和语义约束，依赖于对无结构和无界空间的穷尽搜索。我们提出了AlphaCFG，这是一个基于语法的框架，用于定义和发现在句法上有效、在金融上可解释和在计算上高效的Alpha因子。AlphaCFG使用面向Alpha的上下文无关语法来定义一个树形结构、大小受控的搜索空间，并将Alpha发现形式化为一个树形结构的语言马尔可夫决策过程，然后使用一个由语法敏感价值和策略网络引导的语法感知蒙特卡洛树搜索来解决。对中国和美国股市数据集进行的实验表明，AlphaCFG在搜索效率和交易盈利方面均优于当前最先进的基线。除了交易策略，AlphaCFG还可作为一个在量化金融领域进行符号因子发现和精化的通用框架，包括资产定价和投资组合构建。

更新时间: 2026-01-29 18:46:15

领域: q-fin.CP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.22119v1

Diverse Approaches to Optimal Execution Schedule Generation

We present the first application of MAP-Elites, a quality-diversity algorithm, to trade execution. Rather than searching for a single optimal policy, MAP-Elites generates a diverse portfolio of regime-specialist strategies indexed by liquidity and volatility conditions. Individual specialists achieve 8-10% performance improvements within their behavioural niches, while other cells show degradation, suggesting opportunities for ensemble approaches that combine improved specialists with the baseline PPO policy. Results indicate that quality-diversity methods offer promise for regime-adaptive execution, though substantial computational resources per behavioural cell may be required for robust specialist development across all market conditions. To ensure experimental integrity, we develop a calibrated Gymnasium environment focused on order scheduling rather than tactical placement decisions. The simulator features a transient impact model with exponential decay and square-root volume scaling, fit to 400+ U.S. equities with R^2>0.02 out-of-sample. Within this environment, two Proximal Policy Optimization architectures - both MLP and CNN feature extractors - demonstrate substantial improvements over industry baselines, with the CNN variant achieving 2.13 bps arrival slippage versus 5.23 bps for VWAP on 4,900 out-of-sample orders ($21B notional). These results validate both the simulation realism and provide strong single-policy baselines for quality-diversity methods.

Updated: 2026-01-29 18:41:52

标题: 生成最佳执行时间表的多种方法

摘要: 我们首次将MAP-Elites，一种质量多样性算法，应用于交易执行。与寻找单一最佳策略不同，MAP-Elites生成了一个多样化的策略组合，这些策略根据流动性和波动性条件进行索引。在其行为领域内，个体专家实现了8-10%的绩效改进，而其他单元显示出退化，这表明机会可以通过将改进的专家与基准PPO策略相结合的集合方法。结果表明，质量多样性方法为适应不同市场环境的执行带来了希望，尽管在所有市场条件下，为了稳健地开发专家，可能需要大量的计算资源。为了确保实验的完整性，我们开发了一个以订单调度为焦点而不是战术性摆放决策的校准体育馆环境。该模拟器具有指数衰减和平方根体积缩放的瞬时影响模型，适用于400多家美国股票，其R^2>0.02在样本外。在这种环境中，两种近端策略优化架构 - 均为MLP和CNN特征提取器 - 显示出比行业基准显著的改进，CNN变体在4900个样本外订单中实现了2.13个基点的到达滑点，而VWAP为5.23个基点（名义21亿美元）。这些结果既验证了模拟的逼真性，也为质量多样性方法提供了强大的单一策略基线。

更新时间: 2026-01-29 18:41:52

领域: q-fin.TR,cs.LG

下载: http://arxiv.org/abs/2601.22113v1

Physics Informed Reconstruction of Four-Dimensional Atmospheric Wind Fields Using Multi-UAS Swarm Observations in a Synthetic Turbulent Environment

Accurate reconstruction of atmospheric wind fields is essential for applications such as weather forecasting, hazard prediction, and wind energy assessment, yet conventional instruments leave spatio-temporal gaps within the lower atmospheric boundary layer. Unmanned aircraft systems (UAS) provide flexible in situ measurements, but individual platforms sample wind only along their flight trajectories, limiting full wind-field recovery. This study presents a framework for reconstructing four-dimensional atmospheric wind fields using measurements obtained from a coordinated UAS swarm. A synthetic turbulence environment and high-fidelity multirotor simulation are used to generate training and evaluation data. Local wind components are estimated from UAS dynamics using a bidirectional long short-term memory network (Bi-LSTM) and assimilated into a physics-informed neural network (PINN) to reconstruct a continuous wind field in space and time. For local wind estimation, the bidirectional LSTM achieves root-mean-square errors (RMSE) of 0.064 and 0.062 m/s for the north and east components in low-wind conditions, increasing to 0.122 to 0.129 m/s under moderate winds and 0.271 to 0.273 m/s in high-wind conditions, while the vertical component exhibits higher error, with RMSE values of 0.029 to 0.091 m/s. The physics-informed reconstruction recovers the dominant spatial and temporal structure of the wind field up to 1000 m altitude while preserving mean flow direction and vertical shear. Under moderate wind conditions, the reconstructed mean wind field achieves an overall RMSE between 0.118 and 0.154 m/s across evaluated UAS configurations, with the lowest error obtained using a five-UAS swarm. These results demonstrate that coordinated UAS measurements enable accurate and scalable four-dimensional wind-field reconstruction without dedicated wind sensors or fixed infrastructure.

Updated: 2026-01-29 18:40:32

标题: 使用多无人机群观测在合成湍流环境中重构四维大气风场的物理信息

摘要: 精确重建大气风场对于诸如天气预报、灾害预测和风能评估等应用至关重要，然而传统仪器在低层大气边界层内留下时空间隙。无人机系统（UAS）提供灵活的原位测量，但单个平台仅沿其飞行轨迹采样风速，限制了完整的风场恢复。本研究提出了一种使用从协调的UAS群体获得的测量数据重建四维大气风场的框架。使用合成湍流环境和高保真度的多旋翼模拟来生成训练和评估数据。利用双向长短期记忆网络（Bi-LSTM）从UAS动态中估计局部风分量，并将其同化到一个受物理约束的神经网络（PINN）中，以在空间和时间上重建连续的风场。对于局部风估计，双向LSTM在低风条件下的北向和东向分量分别实现了0.064和0.062 m/s的均方根误差（RMSE），在中等风力下增加到0.122至0.129 m/s，在高风条件下增加到0.271至0.273 m/s，而垂直分量显示更高的误差，RMSE值为0.029至0.091 m/s。受物理约束的重建在高达1000 m的高度内恢复了风场的主导空间和时间结构，同时保留了平均流向和垂直剪切。在中等风条件下，重建的平均风场在评估的UAS配置中实现了0.118至0.154 m/s的总体RMSE，使用五个UAS群体获得最低误差。这些结果表明，协调的UAS测量使得无需专用风传感器或固定基础设施即可实现精确可扩展的四维风场重建。

更新时间: 2026-01-29 18:40:32

领域: cs.LG,eess.SY,physics.ao-ph

下载: http://arxiv.org/abs/2601.22111v1

Value-Based Pre-Training with Downstream Feedback

Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.

Updated: 2026-01-29 18:38:09

标题: 基于价值的预训练与下游反馈

摘要: 可以少量的验证目标信息引导昂贵的自我监督预训练基础模型吗？标准预训练优化一个固定的代理目标（例如，下一个标记的预测），这可能导致计算资源偏离感兴趣的下游功能。我们引入了V-预训练：一种基于价值的、与模态无关的方法，用于受控的持续预训练，在这种方法中，一个轻量级的任务设计者重新塑造预训练任务，以最大化每个梯度步骤的价值。例如，考虑使用样本增强的自我监督学习（SSL）。V-预训练任务设计者选择预训练任务（例如，增强），其中预训练损失梯度与在下游任务（例如，图像分割）上计算的梯度保持一致。这有助于将预训练引导到相关的下游功能。值得注意的是，预训练模型从未根据下游任务标签进行更新；它们只用于塑造预训练任务。在匹配的学习者更新预算下，0.5B-7B语言模型的V-预训练相对于仅使用12%的GSM8K训练样本作为反馈的标准下一个标记预测，可以将推理（GSM8K测试Pass@1）提高高达18%。在视觉SSL中，我们将ADE20K的最先进结果提高了高达1.07 mIoU，并降低了NYUv2 RMSE，同时提高了ImageNet的线性准确性，并提供了持续预训练中的改进的令牌效率的初步证据。

更新时间: 2026-01-29 18:38:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.22108v1

Prior-Informed Flow Matching for Graph Reconstruction

We introduce Prior-Informed Flow Matching (PIFM), a conditional flow model for graph reconstruction. Reconstructing graphs from partial observations remains a key challenge; classical embedding methods often lack global consistency, while modern generative models struggle to incorporate structural priors. PIFM bridges this gap by integrating embedding-based priors with continuous-time flow matching. Grounded in a permutation equivariant version of the distortion-perception theory, our method first uses a prior, such as graphons or GraphSAGE/node2vec, to form an informed initial estimate of the adjacency matrix based on local information. It then applies rectified flow matching to refine this estimate, transporting it toward the true distribution of clean graphs and learning a global coupling. Experiments on different datasets demonstrate that PIFM consistently enhances classical embeddings, outperforming them and state-of-the-art generative baselines in reconstruction accuracy.

Updated: 2026-01-29 18:38:02

标题: 基于先验信息的流匹配用于图重建

摘要: 我们介绍了Prior-Informed Flow Matching（PIFM），这是一种用于图形重建的条件流模型。从部分观察中重建图形仍然是一个关键挑战；传统的嵌入方法通常缺乏全局一致性，而现代生成模型则难以融合结构先验。PIFM通过将基于嵌入的先验与连续时间流匹配相结合来弥合这一差距。基于置换等变版本的扭曲-感知理论，我们的方法首先使用先验，例如图同或GraphSAGE/node2vec，根据局部信息形成通知的邻接矩阵的初始估计。然后，它应用矫正的流匹配来细化这一估计，将其传送到干净图的真实分布并学习全局耦合。对不同数据集的实验表明，PIFM始终提升了经典嵌入，优于它们和最先进的生成基线模型在重建准确度方面。

更新时间: 2026-01-29 18:38:02

领域: cs.LG

下载: http://arxiv.org/abs/2601.22107v1

ECO: Quantized Training without Full-Precision Master Weights

Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

Updated: 2026-01-29 18:35:01

标题: ECO：无需完整精度主权重的量化训练

摘要: 量化显著提高了大型语言模型（LLM）训练的计算和内存效率。然而，现有方法仍然依赖于在高精度上累积它们的更新：具体而言，梯度更新必须应用于一个称为$\textit{主要权重}$的高精度权重缓冲区。该缓冲区引入了大量的内存开销，特别是对于稀疏专家混合（SMoE）模型，其中模型参数和优化器状态占据了内存使用量。为了解决这个问题，我们引入了一种称为Error-Compensating Optimizer（ECO）的方法，通过直接将更新应用于量化参数来消除主要权重。ECO在每个步骤后对权重进行量化，并将产生的量化误差小心地注入到优化器动量中，形成一个无需额外内存的误差反馈循环。我们证明，在标准假设和衰减学习率的情况下，ECO会收敛到最优解的恒定半径邻域，而朴素的主要权重移除可能会产生与学习率成反比的误差。我们展示了对小型Transformer（30-800M）、Gemma-3 1B模型以及2.1B参数的稀疏MoE模型进行FP8量化的经验结果，以及对DeepSeek-MoE-16B在INT4精度下进行微调。整个过程中，ECO与具有主要权重的基准模型达到了接近无损精度的匹配，显著改变了静态内存与验证损失Pareto边界。

更新时间: 2026-01-29 18:35:01

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.22101v1

Do graph neural network states contain graph properties?

Deep neural networks (DNNs) achieve state-of-the-art performance on many tasks, but this often requires increasingly larger model sizes, which in turn leads to more complex internal representations. Explainability techniques (XAI) have made remarkable progress in the interpretability of ML models. However, the non-euclidean nature of Graph Neural Networks (GNNs) makes it difficult to reuse already existing XAI methods. While other works have focused on instance-based explanation methods for GNNs, very few have investigated model-based methods and, to our knowledge, none have tried to probe the embedding of the GNNs for structural graph properties. In this paper we present a model agnostic explainability pipeline for Graph Neural Networks (GNNs) employing diagnostic classifiers. We propose to consider graph-theoretic properties as the features of choice for studying the emergence of representations in GNNs. This pipeline aims to probe and interpret the learned representations in GNNs across various architectures and datasets, refining our understanding and trust in these models.

Updated: 2026-01-29 18:33:46

标题: 图神经网络状态是否包含图属性？

摘要: 深度神经网络(DNNs)在许多任务上取得了最先进的性能，但这通常需要越来越大的模型尺寸，这反过来导致更复杂的内部表示。解释性技术(XAI)在机器学习模型的可解释性方面取得了显著进展。然而，图神经网络(GNNs)的非欧几里得性质使得难以重用已有的XAI方法。虽然其他研究着重于GNNs的基于实例的解释方法，但很少有人研究基于模型的方法，并据我们所知，没有人尝试探究GNNs的嵌入用于结构图属性。在本文中，我们提出了一个模型不可知的解释性管道，用于图神经网络(GNNs)并采用诊断分类器。我们建议考虑图论属性作为研究GNNs中表示出现的首选特征。该管道旨在探究和解释GNNs中学习到的表示，跨多种架构和数据集，从而改进我们对这些模型的理解和信任。

更新时间: 2026-01-29 18:33:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.02168v3

Boosting CVaR Policy Optimization with Quantile Gradients

Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.

Updated: 2026-01-29 18:33:46

标题: 使用分位数梯度提升CVaR策略优化

摘要: 使用策略梯度（也称为CVaR-PG）优化条件风险价值（CVaR）面临着样本效率的重大挑战。这种低效性源于它专注于尾部绩效，忽视了许多采样轨迹。我们通过将CVaR与预期分位数项相结合来解决这个问题。分位数优化采用动态规划公式，利用所有采样数据，从而提高了样本效率。这并不改变CVaR的目标，因为CVaR对应于尾部的分位数期望。在具有可验证的风险回避行为的领域的实证结果表明，我们的算法在马尔科夫策略类中显着改进了CVaR-PG，并始终优于其他现有方法。

更新时间: 2026-01-29 18:33:46

领域: cs.LG

下载: http://arxiv.org/abs/2601.22100v1

GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.

Updated: 2026-01-29 18:31:31

标题: GeoNorm：使用测地线优化统一预范数和后范数

摘要: 归一化层的放置，特别是Pre-Norm和Post-Norm，在Transformer架构设计中仍然是一个开放性问题。在这项工作中，我们通过流形优化的视角重新思考这些方法，将前馈网络（FFN）和注意力层的输出解释为优化中的更新方向。基于这一视角，我们引入了GeoNorm，一种新颖的方法，用测地线更新代替标准归一化。此外，类似于学习率调度，我们提出了FFN和注意力组件的逐层更新衰减。全面的实验证明，GeoNorm在Transformer模型中一贯优于现有的归一化方法。重要的是，GeoNorm可以无缝集成到标准的Transformer架构中，带来性能改进，而附加计算成本可以忽略不计。

更新时间: 2026-01-29 18:31:31

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2601.22095v1

Latent Adversarial Regularization for Offline Preference Optimization

Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.

Updated: 2026-01-29 18:21:57

标题: 潜在对抗正则化用于离线偏好优化

摘要: 学习人类反馈通常依赖于通过标记级别正则化约束策略更新的优化偏好。然而，语言模型的偏好优化特别具有挑战性，因为标记空间的相似性并不意味着语义或行为的相似性。为了解决这一挑战，我们利用潜空间正则化进行语言模型偏好优化。我们引入了GANPO，通过惩罚策略模型和参考模型的内部表示之间的发散来实现潜空间正则化。鉴于潜在表示与明确的概率密度无关，我们采用了受GAN启发的对抗方法来最小化潜空间的发散。我们将GANPO作为一个正则化器整合到现有的离线偏好优化目标中。跨多个模型架构和任务的实验表明，潜空间正则化能够持续改善性能。此外，通过比较由GANPO引起的推理偏差与标记级别正则化的偏差，我们发现在分布转移和噪声情况下，GANPO提供更加鲁棒的结构反馈，同时保持了较小的计算开销下的可比的下游性能。

更新时间: 2026-01-29 18:21:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.22083v1

Vecchia-Inducing-Points Full-Scale Approximations for Gaussian Processes

Gaussian processes are flexible, probabilistic, non-parametric models widely used in machine learning and statistics. However, their scalability to large data sets is limited by computational constraints. To overcome these challenges, we propose Vecchia-inducing-points full-scale (VIF) approximations combining the strengths of global inducing points and local Vecchia approximations. Vecchia approximations excel in settings with low-dimensional inputs and moderately smooth covariance functions, while inducing point methods are better suited to high-dimensional inputs and smoother covariance functions. Our VIF approach bridges these two regimes by using an efficient correlation-based neighbor-finding strategy for the Vecchia approximation of the residual process, implemented via a modified cover tree algorithm. We further extend our framework to non-Gaussian likelihoods by introducing iterative methods that substantially reduce computational costs for training and prediction by several orders of magnitudes compared to Cholesky-based computations when using a Laplace approximation. In particular, we propose and compare novel preconditioners and provide theoretical convergence results. Extensive numerical experiments on simulated and real-world data sets show that VIF approximations are both computationally efficient as well as more accurate and numerically stable than state-of-the-art alternatives. All methods are implemented in the open source C++ library GPBoost with high-level Python and R interfaces.

Updated: 2026-01-29 18:19:22

标题: 老化诱导点全尺度高斯过程近似

摘要: 高斯过程是一种灵活、概率性、非参数模型，在机器学习和统计学中被广泛使用。然而，它们在处理大数据集时的可伸缩性受到计算约束的限制。为了克服这些挑战，我们提出了结合全局诱导点和局部Vecchia逼近优势的Vecchia诱导点全尺寸（VIF）逼近方法。Vecchia逼近在低维输入和中度平滑协方差函数的情况下表现出色，而诱导点方法更适用于高维输入和更平滑的协方差函数。我们的VIF方法通过使用一种有效的基于相关性的邻居查找策略来实现Vecchia逼近残差过程，该策略通过修改后的覆盖树算法实现。我们进一步将我们的框架扩展到非高斯似然函数，通过引入迭代方法，与使用拉普拉斯逼近时相比，显著降低了训练和预测的计算成本。特别是，我们提出并比较新颖的预处理器，并提供理论收敛结果。对模拟和真实数据集的广泛数值实验表明，VIF逼近既在计算效率上优于最先进的替代方法，又比其更准确和数值上更稳定。所有方法均在开源的C++库GPBoost中实现，具有高级Python和R接口。

更新时间: 2026-01-29 18:19:22

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2507.05064v2

Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

LLM agents excel when environments are mostly static and the needed information fits in a model's context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1.

Updated: 2026-01-29 18:18:39

标题: 思考本地化，解释全局化：通过本地推理和信念传播引导的图形化LLM调查

摘要: LLM代理在环境大多静态且所需信息符合模型上下文窗口时表现出色，但它们通常在需要通过迭代从海量、异构的操作数据中挖掘证据构建解释的开放式调查中失败。这些调查展现出隐藏的依赖结构：实体相互作用，信号共变化，一个事实的重要性可能只有在发现其他证据后才会变得清晰。由于上下文窗口是有界的，代理必须在了解其重要性之前总结中间发现，增加了丢弃关键证据的风险。在这种情况下，ReAct风格的代理特别脆弱。它们的检索-总结-推理循环使结论对探索顺序敏感，并引入了运行到运行的非确定性，产生了一个可靠性差距，其中Pass-at-k可能很高，但Majority-at-k仍然很低。仅仅采样更多的展开或生成更长的推理轨迹并不能可靠地稳定结果，因为假设不能在新证据到达时自主检查，而且没有明确的信念记账和修订机制。此外，ReAct将语义推理与工具编排和状态跟踪等控制器职责纠缠在一起，因此执行错误和计划漂移会降低推理能力，同时消耗有限的上下文。我们通过将调查定位为对依赖图进行归纳推理，并提出EoG（图上的解释），一个分解框架，在此框架中，LLM执行有界的本地证据挖掘和标记（原因与症状），而确定性控制器管理遍历、状态和信念传播，以计算最小的解释前沿。在代表性的ITBench诊断任务中，EoG相对于ReAct基线提高了准确性和运行到运行的一致性，包括在Majority-at-k实体F1方面平均提升了7倍。

更新时间: 2026-01-29 18:18:39

领域: cs.AI,cs.LG,cs.LO

下载: http://arxiv.org/abs/2601.17915v2

Where Do the Joules Go? Diagnosing Inference Energy Consumption

Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.

Updated: 2026-01-29 18:16:45

标题: 能量耗费的诊断：焦耳去了哪里？

摘要: 能源现在是一个关键的机器学习计算资源。虽然测量能耗并观察趋势是一个有价值的第一步，但准确理解和诊断这些差异发生的原因对于优化至关重要。为此，我们首先展示了一项大规模的推理时间和能量测量研究，涵盖了46个模型、7个任务和1,858个不同配置，使用NVIDIA H100和B200 GPU。我们的实证发现涵盖了数量级的变化：LLM任务类型可能导致25倍的能量差异，视频生成有时消耗的能量比图像多100倍以上，GPU利用率差异可能导致3-5倍的能量差异。基于我们的观察，我们提出了一个关于控制时间和能量消耗的潜在机制的推理框架。要点是，时间和能量由诸如内存和利用率等潜在指标决定，而这些指标受到算法、软件和硬件层面的各种因素的影响。我们的框架还直接扩展到每瓦特吞吐量，这是数据中心中受电量限制的关键指标。

更新时间: 2026-01-29 18:16:45

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2601.22076v1

How Many Ratings per Item are Necessary for Reliable Significance Testing?

A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.

Updated: 2026-01-29 18:15:35

标题: 每个项目需要多少个评分才能进行可靠的显著性检验？

摘要: 机器学习评估的基石是模型和人类响应可靠到足以针对统一的权威“黄金标准”数据评估模型，通过简单的指标如准确度、精确度和召回率。生成式人工智能革命似乎会打破这一假设，鉴于随机推理发挥的关键作用。然而，尽管公众要求人工智能更透明，以及有强有力的证据表明人类是不可靠的评判者，对模型可靠性的估计通常仅基于每个输入项目的最多几个输出响应。我们改编了一种方法，此前用于评估各种度量标准和估计器对机器学习评估的可靠性，以确定（现有或计划中的）数据集是否具有足够的响应以确保可靠的零假设统计检验。我们展示，对于许多常见的度量标准，即使每个项目收集5-10个响应（来自每个模型和人类评估团队）也是不足够的。我们将我们的方法应用于几个极少数具有多个解体响应的黄金标准测试集，并展示即使这些数据集也缺乏足够的响应。我们展示了我们的方法如何帮助人工智能研究人员更好地决定如何收集数据进行人工智能评估。

更新时间: 2026-01-29 18:15:35

领域: cs.LG

下载: http://arxiv.org/abs/2412.02968v3

Harmonizing Safety and Speed: A Human-Algorithm Approach to Enhance the FDA's Medical Device Clearance Policy

The United States Food and Drug Administration's (FDA's) 510(k) pathway allows manufacturers to gain medical device approval by demonstrating substantial equivalence to a legally marketed device. However, the inherent ambiguity of this regulatory procedure has been associated with high recall among many devices cleared through this pathway, raising significant safety concerns. In this paper, we develop a combined human-algorithm approach to assist the FDA in improving its 510(k) medical device clearance process by reducing recall risk and regulatory workload. We first develop machine learning methods to estimate the risk of recall of 510(k) medical devices based on the information available at the time of submission. We then propose a data-driven clearance policy that recommends acceptance, rejection, or deferral to FDA's committees for in-depth evaluation. We conduct an empirical study using a unique dataset of over 31,000 submissions that we assembled based on data sources from the FDA and Centers for Medicare and Medicaid Service (CMS). Compared to the FDA's current practice, which has a recall rate of 10.3% and a normalized workload measure of 100%, a conservative evaluation of our policy shows a 32.9% improvement in the recall rate and a 40.5% reduction in the workload. Our analyses further suggest annual cost savings of approximately $1.7 billion for the healthcare system driven by avoided replacement costs, which is equivalent to 1.1% of the entire United States annual medical device expenditure. Our findings highlight the value of a holistic and data-driven approach to improve the FDA's current 510(k) pathway.

Updated: 2026-01-29 18:09:15

标题: 调和安全与速度：一种人算法方法以增强FDA的医疗器械审批政策

摘要: 美国食品药品监督管理局（FDA）的510（k）途径允许制造商通过证明与合法销售的设备具有实质等同性来获得医疗器械批准。然而，这一监管程序的固有模糊性与许多通过该途径清关的设备高召回率相关联，引发了重大安全担忧。在本文中，我们开发了一种人工算法结合的方法，以协助FDA改进其510（k）医疗器械清关流程，减少召回风险和监管工作量。我们首先开发机器学习方法来基于提交时可用信息估计510（k）医疗器械的召回风险。然后，我们提出了一个数据驱动的清关政策，建议接受、拒绝或将FDA的委员会推迟进行深入评估。我们进行了一项实证研究，使用了一个基于FDA和医疗保险和医疗补助服务中心（CMS）数据来源组装的超过31,000份提交的独特数据集。与FDA当前的做法相比，其召回率为10.3%，工作量的归一化指标为100%，我们保守评估的政策显示召回率提高了32.9%，工作量减少了40.5%。我们的分析进一步表明，由于避免替换成本，医疗系统年节约成本约为17亿美元，相当于整个美国年度医疗器械支出的1.1%。我们的研究结果突显了通过整体和数据驱动方法改进FDA当前的510（k）途径的价值。

更新时间: 2026-01-29 18:09:15

领域: cs.LG,cs.HC,math.OC,stat.ML

下载: http://arxiv.org/abs/2407.11823v3

Making Foundation Models Probabilistic via Singular Value Ensembles

Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining. However, these models often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size, making it impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices constitute meaningful subspaces of the model's knowledge. Pretrained foundation models encode rich, transferable information in their weight matrices. If the singular vectors are indeed meaningful (orthogonal) "knowledge directions". To obtain a model ensemble, we modulate only how strongly each direction contributes to the output. Rather than learning entirely new parameters, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally as stochastic initialization and random sampling of mini-batches during joint training cause different members to converge to different combinations of the same underlying knowledge. SVE achieves uncertainty quantification comparable to explicit deep ensembles while increasing the parameter count of the base model by less than 1%, making principled uncertainty estimation accessible in resource-constrained settings. We validate SVE on NLP and vision tasks with various different backbones and show that it improves calibration while maintaining predictive accuracy.

Updated: 2026-01-29 18:07:18

标题: 通过奇异值集成使基础模型具有概率性

摘要: 基于大规模预训练的基础模型已经成为机器学习中的主导范式，在各种任务中取得了显著的性能。然而，这些模型往往会产生过度自信、不经校准的预测。标准的量化认知不确定性的方法是训练独立模型的集成，但这会带来与集成大小成线性比例的计算成本，使其在大型基础模型中变得不切实际。我们提出了奇异值集成（SVE），这是一种参数高效的隐式集成方法，建立在一个简单但强大的核心假设之上：即，权重矩阵的奇异向量构成模型知识的有意义子空间。预训练的基础模型在其权重矩阵中编码了丰富、可转移的信息。如果奇异向量确实是有意义（正交）的“知识方向”。为了获得模型集成，我们只调节每个方向对输出的贡献程度。我们冻结奇异向量，只训练每个成员的奇异值，这些奇异值重新调整了在共享知识基础上每个方向的贡献。由于在联合训练期间的随机初始化和小批量随机抽样导致不同成员收敛到相同基础知识的不同组合，集成多样性会自然产生。SVE 实现了与显式深度集成相当的不确定性量化，同时将基础模型参数计数增加不到 1%，使得在资源受限环境中的原则性不确定性估计变得可行。我们在自然语言处理和视觉任务上验证了SVE，使用了不同的骨干结构，结果显示它在提高校准性的同时保持了预测准确性。

更新时间: 2026-01-29 18:07:18

领域: cs.LG

下载: http://arxiv.org/abs/2601.22068v1

SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

Updated: 2026-01-29 17:59:16

标题: SiDGen：用于蛋白质配体生成建模的结构信息扩散

摘要: 设计既符合化学规范又与蛋白质结合口袋结构相容的配体是计算药物发现中的关键瓶颈。现有方法要么忽略结构背景，要么依赖于昂贵、内存密集的编码，限制了吞吐量和可扩展性。我们提出了SiDGen（结构信息扩散生成器），这是一个与蛋白质条件相关的扩散框架，它将掩码SMILES生成与轻量级折叠派生特征相结合，用于口袋感知。为了在表达性和效率之间取得平衡，SiDGen支持两种调节路径：一种简化模式，从蛋白嵌入中汇总粗略的结构信号，以及一种全模式，注入局部对偶偏差以实现更强的耦合。通过最近邻上采样的粗跨步折叠机制缓解了对偶张量的二次内存成本，从而使训练能够在实际序列长度上进行。通过循环内的化学有效性检查和无效惩罚来维持学习稳定性，同时通过选择性编译、数据加载器调整和梯度累积来恢复大规模训练的效率。在自动化基准测试中，SiDGen生成具有高有效性、独特性和新颖性的配体，同时在基于对接的评估中取得竞争性表现，并保持合理的分子属性。这些结果表明，SiDGen可以提供可扩展的、口袋感知的分子设计，为高吞吐量药物发现提供了实用的条件生成路径。

更新时间: 2026-01-29 17:59:16

领域: cs.LG

下载: http://arxiv.org/abs/2511.09529v2

PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials

Many materials properties depend on higher-order derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP by 55% on average across phonon thermodynamic properties and achieves state-of-the-art accuracy among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.

Updated: 2026-01-29 17:54:33

标题: PFT：机器学习原子间势的声子微调

摘要: 许多材料性质取决于位势能表面的高阶导数，然而，通过在能量、力和应力误差上训练的机器学习原子间势（MLIPs）可能会出现曲率误差，从而降低振动性能的预测能力。我们引入了声子微调（PFT），通过将MLIP能量Hessian与从有限位移声子计算中计算的DFT力常数进行匹配，直接监督材料的二阶力常数。为了扩展到大型超胞，PFT随机采样Hessian列，并使用单个Hessian向量乘积计算损失。我们还使用简单的共同训练方案，将上游数据纳入以减轻灾难性遗忘。在MDR声子基准测试中，PFT在声子热力学性质方面平均提高了55%的Nequix MP，达到了在Materials Project轨迹上训练的模型中的最新精度。PFT还能够推广到改善超过二阶导数的性质，改善依赖于位势能的三阶导数的热传导预测。

更新时间: 2026-01-29 17:54:33

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2601.07742v3

How Out-of-Distribution Detection Learning Theory Enhances Transformer: Learnability and Reliability

Transformers excel in natural language processing and computer vision tasks. However, they still face challenges in generalizing to Out-of-Distribution (OOD) datasets, i.e. data whose distribution differs from that seen during training. OOD detection aims to distinguish outliers while preserving in-distribution (ID) data performance. This paper introduces the OOD detection Probably Approximately Correct (PAC) Theory for transformers, which establishes the conditions for data distribution and model configurations for the OOD detection learnability of transformers. It shows that outliers can be accurately represented and distinguished with sufficient data under conditions. The theoretical implications highlight the trade-off between theoretical principles and practical training paradigms. By examining this trade-off, we naturally derived the rationale for leveraging auxiliary outliers to enhance OOD detection. Our theory suggests that by penalizing the misclassification of outliers within the loss function and strategically generating soft synthetic outliers, one can robustly bolster the reliability of transformer networks. This approach yields a novel algorithm that ensures learnability and refines the decision boundaries between inliers and outliers. In practice, the algorithm consistently achieves state-of-the-art (SOTA) performance across various data formats.

Updated: 2026-01-29 17:53:44

标题: 如何通过异常检测学习理论增强Transformer：可学习性和可靠性

摘要: 变压器在自然语言处理和计算机视觉任务中表现出色。然而，它们仍然面临着在泛化到分布不同于训练过程中所见分布的Out-of-Distribution（OOD）数据集方面的挑战。OOD检测旨在区分异常值，同时保持在分布（ID）数据性能。本文介绍了变压器的OOD检测Probably Approximately Correct（PAC）理论，该理论为变压器的OOD检测的可学习性建立了数据分布和模型配置的条件。它表明，在一定条件下，异常值可以被准确表示和区分。理论的含义突出了理论原则与实际训练范例之间的权衡。通过研究这种权衡，我们自然地推导出了利用辅助异常值增强OOD检测的理由。我们的理论表明，通过在损失函数中对异常值的误分类进行惩罚，并策略性地生成软合成异常值，可以稳健地增强变压器网络的可靠性。这种方法产生了一种新颖的算法，确保了可学习性，并优化了在内部和外部之间的决策边界。在实践中，该算法始终在各种数据格式上实现了最新技术水平（SOTA）的表现。

更新时间: 2026-01-29 17:53:44

领域: cs.LG,math.PR

下载: http://arxiv.org/abs/2406.12915v6

Corrective Diffusion Language Models

While Diffusion Language Models (DLMs) are theoretically well-suited for iterative refinement due to their non-causal structure, they often fail to reliably revise incorrect tokens in practice. The key challenge lies in the model's inability to distinguish between correct and erroneous tokens in a visible sequence. Standard masked diffusion language model (MDLM) training is restricted to the objective of unmasking, undermining the effectiveness of refinement guided by confidence. Based on this observation, we study corrective behavior in DLMs, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a post-training principle oriented by correction that explicitly supervises visible incorrect tokens, enabling discriminative confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark, a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and parallel decoding scenarios demonstrate that models trained with our approach substantially outperform standard MDLMs, with gains that are most pronounced when parallel decoding introduces substantial uncertainty and iterative refinement becomes essential. Our code is publicly available at https://github.com/zhangshuibai/CDLM.

Updated: 2026-01-29 17:52:44

标题: 校正扩散语言模型

摘要: 扩散语言模型（DLMs）在理论上非常适合进行迭代改进，因为它们具有非因果结构，但在实践中，它们经常无法可靠地修订错误的标记。关键挑战在于模型无法区分可见序列中的正确和错误标记。标准的掩蔽扩散语言模型（MDLM）训练仅限于解除掩蔽的目标，削弱了由信心引导的改进的有效性。基于这一观察，我们研究了DLMs中的修正行为，定义为能够将较低的置信度分配给错误的标记并在保留正确内容的同时进行迭代改进的能力。我们表明，这种能力不是由传统的掩蔽扩散目标引起的，并提出了一个以修正为导向的后训练原则，明确监督可见的错误标记，实现区分置信度和有针对性的改进。为了评估修正行为，我们引入了代码修订基准，这是一个可控制和可执行的基准，用于评估错误定位和现场修正。在代码修订任务和并行解码场景上的实验表明，使用我们的方法训练的模型明显优于标准MDLMs，特别是在并行解码引入大量不确定性且迭代改进变得至关重要时。我们的代码可在https://github.com/zhangshuibai/CDLM 上公开获取。

更新时间: 2026-01-29 17:52:44

领域: cs.LG

下载: http://arxiv.org/abs/2512.15596v2

Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems

Emergent communication offers insight into how agents develop shared structured representations, yet most research assumes homogeneous modalities or aligned representational spaces, overlooking the perceptual heterogeneity of real-world settings. We study a heterogeneous multi-step binary communication game where agents differ in modality and lack perceptual grounding. Despite perceptual misalignment, multimodal systems converge to class-consistent messages grounded in perceptual input. Unimodal systems communicate more efficiently, using fewer bits and achieving lower classification entropy, while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation experiments provide strong evidence that meaning is encoded in a distributional rather than compositional manner, as each bit's contribution depends on its surrounding pattern. Finally, interoperability analyses show that systems trained in different perceptual worlds fail to directly communicate, but limited fine-tuning enables successful cross-system communication. This work positions emergent communication as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, opening new directions for both theory and experimentation.

Updated: 2026-01-29 17:45:41

标题: 学会跨模态交流：多智能体系统中的感知异质性

摘要: 紧急通信提供了洞察代理如何发展共享结构化表示的见解，然而大多数研究假设同质的模态或对齐的表示空间，忽视了现实世界环境中的感知异质性。我们研究了一种多步异构二进制通信游戏，在这个游戏中代理在模态上有所不同并且缺乏感知基础。尽管感知不对齐，多模系统会收敛于基于感知输入的类一致消息。单模系统会更有效地进行通信，使用更少的比特并达到更低的分类熵，而多模代理需要更多的信息交换并表现出更高的不确定性。比特扰动实验提供了强有力的证据，表明意义是以分布式而不是组合式的方式编码的，因为每个比特的贡献取决于其周围的模式。最后，互操作性分析表明在不同感知世界中训练的系统无法直接进行通信，但有限的微调可以实现跨系统的成功通信。这项工作将紧急通信定位为一个研究代理如何在异质模态之间适应和传输表示的框架，为理论和实验开辟了新的方向。

更新时间: 2026-01-29 17:45:41

领域: cs.MA,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.22041v1

A Separable Architecture for Continuous Token Representation in Language Models

Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.

Updated: 2026-01-29 17:44:25

标题: 一种在语言模型中用于连续令牌表示的可分离架构

摘要: Transformer缩放定律分析通常将参数视为可互换的；这是一个准确预测损失-计算关系的抽象。然而，在小于十亿参数的小语言模型（SLMs）中，嵌入矩阵占据了参数预算。本文认为这种分配既不够优化，也不合乎直觉。Leviathan是一种具有连续嵌入生成器的架构，用以替代规范模型的离散查找表。在等参数设置下在Pile数据集上进行评估，Leviathan始终优于标准的LLaMA风格架构。通过经验幂律拟合，Leviathan展现出明显优越的有效参数容量。在所研究的范围内，Leviathan 表现为具有1.47至2.11倍更多参数的密集模型。

更新时间: 2026-01-29 17:44:25

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.22040v1

Machine learning for option pricing: an empirical investigation of network architectures

We consider the supervised learning problem of learning the price of an option or the implied volatility given appropriate input data (model parameters) and corresponding output data (option prices or implied volatilities). The majority of articles in this literature considers a (plain) feed forward neural network architecture in order to connect the neurons used for learning the function mapping inputs to outputs. In this article, motivated by methods in image classification and recent advances in machine learning methods for PDEs, we investigate empirically whether and how the choice of network architecture affects the accuracy and training time of a machine learning algorithm. We find that the generalized highway network architecture achieves the best performance, when considering the mean squared error and the training time as criteria, within the considered parameter budgets for the Black-Scholes and Heston option pricing problems. Considering the transformed implied volatility problem, a simplified DGM variant achieves the lowest error among the tested architectures. We also carry out a capacity-normalised comparison for completeness, where all architectures are evaluated with an equal number of parameters. Finally, for the implied volatility problem, we additionally include experiments using real market data.

Updated: 2026-01-29 17:43:20

标题: 机器学习用于期权定价：网络架构的实证研究

摘要: 我们考虑监督学习问题，即根据适当的输入数据（模型参数）和相应的输出数据（期权价格或隐含波动率）学习期权价格或隐含波动率。在这个文献中，大多数文章考虑使用（纯）前馈神经网络架构，以连接用于学习将输入映射到输出的神经元。受图像分类方法和最近机器学习方法在PDEs方面的进展的启发，我们通过实证研究探讨网络架构的选择如何影响机器学习算法的准确性和训练时间。我们发现，在考虑参数预算的情况下，广义高速公路网络架构在Black-Scholes和Heston期权定价问题中表现最好，以均方误差和训练时间作为标准。考虑到转化后的隐含波动率问题，简化的DGM变体在测试的架构中达到了最低的错误率。我们还进行了容量归一化比较，其中所有架构都使用相同数量的参数进行评估。最后，对于隐含波动率问题，我们还使用了真实市场数据进行实验。

更新时间: 2026-01-29 17:43:20

领域: q-fin.CP,cs.LG

下载: http://arxiv.org/abs/2307.07657v3

Optimizing Agentic Workflows using Meta-tools

Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant operational expense, end-to-end latency and failures due to hallucinations. This work introduces Agent Workflow Optimization (AWO), a framework that identifies and optimizes redundant tool execution patterns to improve the efficiency and robustness of agentic workflows. AWO analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools, which are deterministic, composite tools that bundle multiple agent actions into a single invocation. Meta-tools bypass unnecessary intermediate LLM reasoning steps and reduce operational cost while also shortening execution paths, leading to fewer failures. Experiments on two agentic AI benchmarks show that AWO reduces the number of LLM calls up to 11.9% while also increasing the task success rate by up to 4.2 percent points.

Updated: 2026-01-29 17:43:08

标题: 使用元工具优化主动工作流程

摘要: 主动型AI使LLM能够动态推理、规划和与工具交互，以解决复杂任务。然而，主动型工作流通常需要许多迭代推理步骤和工具调用，导致显著的操作费用、端到端延迟和由于幻觉而导致的失败。本文介绍了Agent Workflow Optimization（AWO），这是一个框架，用于识别和优化冗余的工具执行模式，以提高主动型工作流的效率和稳健性。AWO分析现有的工作流程跟踪，发现工具调用的重复序列，并将它们转换为元工具，这是确定性的、综合的工具，将多个代理动作捆绑成一个单独的调用。元工具绕过不必要的中间LLM推理步骤，减少操作成本，同时缩短执行路径，减少失败。对两个主动型AI基准测试的实验表明，AWO将LLM调用次数减少了高达11.9%，同时将任务成功率提高了高达4.2个百分点。

更新时间: 2026-01-29 17:43:08

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.22037v1

FS-KAN: Permutation Equivariant Kolmogorov-Arnold Networks via Function Sharing

Permutation equivariant neural networks employing parameter-sharing schemes have emerged as powerful models for leveraging a wide range of data symmetries, significantly enhancing the generalization and computational efficiency of the resulting models. Recently, Kolmogorov-Arnold Networks (KANs) have demonstrated promise through their improved interpretability and expressivity compared to traditional architectures based on MLPs. While equivariant KANs have been explored in recent literature for a few specific data types, a principled framework for applying them to data with permutation symmetries in a general context remains absent. This paper introduces Function Sharing KAN (FS-KAN), a principled approach to constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups, unifying and significantly extending previous work in this domain. We derive the basic construction of these FS-KAN layers by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup and provide a theoretical analysis demonstrating that FS-KANs have the same expressive power as networks that use standard parameter-sharing layers, allowing us to transfer well-known and important expressivity results from parameter-sharing networks to FS-KANs. Empirical evaluations on multiple data types and symmetry groups show that FS-KANs exhibit superior data efficiency compared to standard parameter-sharing layers, by a wide margin in certain cases, while preserving the interpretability and adaptability of KANs, making them an excellent architecture choice in low-data regimes.

Updated: 2026-01-29 17:41:48

标题: FS-KAN: 通过函数共享实现的置换等变 Kolmogorov-Arnold 网络

摘要: 具有参数共享方案的置换等变神经网络已成为利用各种数据对称性的强大模型，显著增强了生成模型的泛化能力和计算效率。最近，科尔莫戈洛夫-阿诺尔德网络（KAN）通过其相对传统基于MLP的架构具有改进的可解释性和表达能力，展现出了潜力。虽然近期文献中已经探索了等变KAN用于一些特定数据类型，但在一般情况下应用于具有置换对称性数据的原则性框架仍不存在。本文介绍了功能共享KAN（FS-KAN），一种构建等变和不变KA层的原则性方法，将先前在该领域的工作统一并显著扩展。我们通过将参数共享方案推广到科尔莫戈洛夫-阿诺尔德设置中推导出这些FS-KAN层的基本构造，并提供了理论分析，表明FS-KAN具有与使用标准参数共享层的网络相同的表达能力，使我们能够将来自参数共享网络的众所周知且重要的表达能力结果转移到FS-KAN。在多种数据类型和对称性群体上的实证评估表明，相比标准参数共享层，FS-KAN在某些情况下表现出更优越的数据效率，同时保持了KAN的可解释性和适应性，使其成为低数据环境下的优秀架构选择。

更新时间: 2026-01-29 17:41:48

领域: cs.LG

下载: http://arxiv.org/abs/2509.24472v2

Cross-Fusion Distance: A Novel Metric for Measuring Fusion and Separability Between Data Groups in Representation Space

Quantifying degrees of fusion and separability between data groups in representation space is a fundamental problem in representation learning, particularly under domain shift. A meaningful metric should capture fusion-altering factors like geometric displacement between representation groups, whose variations change the extent of fusion, while remaining invariant to fusion-preserving factors such as global scaling and sampling-induced layout changes, whose variations do not. Existing distributional distance metrics conflate these factors, leading to measures that are not informative of the true extent of fusion between data groups. We introduce Cross-Fusion Distance (CFD), a principled measure that isolates fusion-altering geometry while remaining robust to fusion-preserving variations, with linear computational complexity. We characterize the invariance and sensitivity properties of CFD theoretically and validate them in controlled synthetic experiments. For practical utility on real-world datasets with domain shift, CFD aligns more closely with downstream generalization degradation than commonly used alternatives. Overall, CFD provides a theoretically grounded and interpretable distance measure for representation learning.

Updated: 2026-01-29 17:41:43

标题: 交叉融合距离：一种在表示空间中测量数据组之间融合和可分性的新型度量标准。

摘要: 在表示学习中，量化数据组在表示空间中的融合程度和可分离性是一个基本问题，特别是在领域转移下。一个有意义的度量应该捕捉到影响融合的因素，比如表示组之间的几何位移，这些变化会改变融合的程度，同时保持对全局缩放和采样引起的布局变化等保持融合的因素不变，这些变化不会改变。现有的分布距离度量混淆了这些因素，导致度量不能准确反映数据组之间真正的融合程度。我们引入了交叉融合距离（CFD），这是一个基于原则的度量，可以隔离融合改变几何形状的因素，同时对保持融合的变化具有鲁棒性，具有线性计算复杂度。我们在受控合成实验中从理论上表征了CFD的不变性和敏感性属性，并在真实数据集上的实用性上验证了它们。对于存在领域转移的实际数据集，CFD与常用的替代方法更加接近下游泛化降级。总的来说，CFD为表示学习提供了一个理论基础和可解释的距离度量。

更新时间: 2026-01-29 17:41:43

领域: cs.LG

下载: http://arxiv.org/abs/2601.22036v1

Correcting for Position Bias in Learning to Rank: A Control Function Approach

Implicit feedback data, such as user clicks, is commonly used in learning-to-rank (LTR) systems because it is easy to collect and it often reflects user preferences. However, this data is prone to various biases, and training an LTR algorithm directly on biased data can result in suboptimal ranking performance. One of the most prominent and well-studied biases in implicit feedback data is position bias, which occurs because users are more likely to interact with higher-ranked items regardless of their true relevance. In this paper, we propose a novel control function-based method that accounts for position bias in a two-stage process. The first stage uses exogenous variation from the residuals of the ranking process to correct for position bias in the second stage click equation. Unlike previous position bias correction methods, our method does not require knowledge of the click or propensity model and allows for nonlinearity in the underlying ranking model. Moreover, our method is general and allows for debiasing any state-of-the-art ranking algorithm by plugging it into the second stage. We also introduce a new technique to debias validation clicks for hyperparameter tuning to select the optimal model in the absence of unbiased validation data. Experimental results show that our method outperforms state-of-the-art approaches in correcting for position bias.

Updated: 2026-01-29 17:41:07

标题: 纠正学习排序中的位置偏差：控制函数方法

摘要: 隐式反馈数据，如用户点击，通常在学习排序（LTR）系统中使用，因为它容易收集并且通常反映用户偏好。然而，这些数据容易受到各种偏见的影响，直接在偏见数据上训练LTR算法可能导致次优的排名表现。隐式反馈数据中最显著和深入研究的偏见之一是位置偏见，这是因为用户更有可能与排名较高的项目互动，而不考虑它们的真实相关性。在本文中，我们提出了一种基于控制函数的新方法，以两阶段过程解决位置偏见问题。第一阶段利用排名过程的残差的外生变化来纠正第二阶段点击方程中的位置偏见。与先前的位置偏见修正方法不同，我们的方法不需要点击或倾向模型的知识，并允许底层排序模型中的非线性。此外，我们的方法是通用的，并允许通过将其插入第二阶段来去偏任何最先进的排序算法。我们还引入了一种新技术，用于去偏验证点击以进行超参数调整，以在没有无偏验证数据的情况下选择最佳模型。实验结果表明，我们的方法在纠正位置偏见方面优于最先进的方法。

更新时间: 2026-01-29 17:41:07

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2506.06989v2

Holographic generative flows with AdS/CFT

We present a framework for generative machine learning that leverages the holographic principle of quantum gravity, or to be more precise its manifestation as the anti-de Sitter/conformal field theory (AdS/CFT) correspondence, with techniques for deep learning and transport theory. Our proposal is to represent the flow of data from a base distribution to some learned distribution using the bulk-to-boundary mapping of scalar fields in AdS. In the language of machine learning, we are representing and augmenting the flow-matching algorithm with AdS physics. Using a checkerboard toy dataset and MNIST, we find that our model achieves faster and higher quality convergence than comparable physics-free flow-matching models. Our method provides a physically interpretable version of flow matching. More broadly, it establishes the utility of AdS physics and geometry in the development of novel paradigms in generative modeling.

Updated: 2026-01-29 17:39:40

标题: AdS/CFT中的全息生成流

摘要: 我们提出了一个利用量子引力的全息原理，或者更精确地说是其在反德西特/共形场论（AdS/CFT）对应中的体现，结合深度学习和传输理论的生成机器学习框架。我们的提议是使用AdS中标量场的体-边界映射来表示从基本分布到某个学习分布的数据流。在机器学习的术语中，我们正在用AdS物理学来表示和增强流匹配算法。通过使用一个棋盘状的玩具数据集和MNIST数据集，我们发现我们的模型比类似的无物理学流匹配模型实现更快、更高质量的收敛。我们的方法提供了一个可解释的流匹配版本。更广泛地说，它建立了在生成建模中开发新范式中AdS物理学和几何学的实用性。

更新时间: 2026-01-29 17:39:40

领域: cs.LG,gr-qc,hep-th

下载: http://arxiv.org/abs/2601.22033v1

OD-Stega: LLM-Based Relatively Secure Steganography via Optimized Distributions

We consider coverless steganography where a Large Language Model (LLM) is used to generate stego-texts in combination with arithmetic coding. An efficient method should embed secret bits in as few language tokens as possible while keeping the stego-text as natural as possible. We show that this problem is equivalent to maximizing the entropy of a replacement probability distribution of the next token generation, subject to a constraint on the divergence between the new distribution and the original one produced by the LLM. A closed-form solution is provided under either the KL divergence or the total variation constraint. Several important practical issues are also tackled: 1) An often-overlooked tokenization mismatch issue is resolved with a simple prompt selection approach, 2) The combination of the optimized distribution and the vocabulary truncation technique is considered, and 3) The incorporation of the proposed approach with existing (potentially non arithmetic coding based) techniques, e.g., the Discop technique.

Updated: 2026-01-29 17:37:33

标题: OD-Stega: 通过优化分布的基于LLM的相对安全隐写术

摘要: 我们考虑了一种无覆盖隐写术，其中使用大型语言模型（LLM）结合算术编码生成隐写文本。一种有效的方法应该尽可能少地嵌入秘密比特，同时保持隐写文本尽可能自然。我们展示了这个问题等价于最大化下一个令牌生成的替换概率分布的熵，受限于新分布与LLM生成的原始分布之间的差异的约束。在KL散度或总变差约束下提供了一个封闭形式的解决方案。还解决了几个重要的实际问题：1）通过简单的提示选择方法解决了经常被忽视的标记不匹配问题，2）考虑了优化分布与词汇截断技术的组合，3）将所提出的方法与现有（潜在非算术编码基础的）技术结合，例如Discop技术。

更新时间: 2026-01-29 17:37:33

领域: cs.IT,cs.AI,cs.CL,cs.CR,cs.LG

下载: http://arxiv.org/abs/2410.04328v2

Per-parameter Task Arithmetic for Unlearning in Large Language Models

In large language model (LLM) unlearning, private information is required to be removed. Task arithmetic unlearns by subtracting a specific task vector (TV)--defined as the parameter difference between a privacy-information-tuned model and the original model. While efficient, it can cause over-forgetting by disrupting parameters essential for retaining other information. Motivated by the observation that each parameter exhibits different importance for forgetting versus retention, we propose a per-parameter task arithmetic (PerTA) mechanism to rescale the TV, allowing per-parameter adjustment. These weights quantify the relative importance of each parameter for forgetting versus retention, estimated via gradients (i.e., PerTA-grad) or the diagonal Fisher information approximation (i.e., PerTA-fisher). Moreover, we discuss the effectiveness of PerTA, extend it to a more general form, and provide further analysis. Extensive experiments demonstrate that PerTA consistently improves upon standard TV, and in many cases surpasses widely used training-based unlearning methods in both forgetting effectiveness and overall model utility. By retaining the efficiency of task arithmetic while mitigating over-forgetting, PerTA offers a principled and practical framework for LLM unlearning.

Updated: 2026-01-29 17:35:32

标题: 大型语言模型中用于取消学习的每个参数的任务算术

摘要: 在大型语言模型（LLM）的遗忘中，需要删除私人信息。任务算术通过减去特定任务向量（TV）来实现遗忘，TV被定义为隐私信息调整模型和原始模型之间的参数差异。虽然高效，但这可能会导致过度遗忘，破坏保留其他信息所必需的参数。受到观察到每个参数在遗忘和保留方面的重要性不同的启发，我们提出了一种逐参数任务算术（PerTA）机制来重新调整TV，允许逐参数调整。这些权重量化了每个参数在遗忘和保留方面的相对重要性，通过梯度（即PerTA-grad）或对角费舍尔信息逼近（即PerTA-fisher）来估算。此外，我们讨论了PerTA的有效性，将其扩展为更一般的形式，并提供了进一步的分析。广泛的实验表明，PerTA在遗忘效果和整体模型效用方面持续改进，许多情况下超过了广泛使用的基于训练的遗忘方法。通过保留任务算术的效率，同时减轻过度遗忘，PerTA为LLM遗忘提供了一个有原则和实用的框架。

更新时间: 2026-01-29 17:35:32

领域: cs.LG

下载: http://arxiv.org/abs/2601.22030v1

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a DINOv2 visual encoder, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we empirically show that aligning model representations of approximate target features can substantially enhance reconstruction quality and perceptual realism. We provide theoretical results showing (a) that REPA regularization can be viewed as a variational approach for minimizing a divergence measure in the DINOv2 embedding space, and (b) how under certain regularity assumptions REPA updates steer the latent diffusion states toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by We integrate REPA into multiple state-of-the-art inverse problem solvers, and provide extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirming that our method consistently improves reconstruction quality, while also providing efficiency gains reducing the number of required discretization steps.

Updated: 2026-01-29 17:34:58

标题: 对齐和反转：通过表示对齐解决扩散和基于流的模型的逆问题

摘要: 最近的研究表明，在扩散或基于流的生成模型的内部表示和预训练的自监督编码器的内部表示之间强制对齐可以提供强大的归纳偏差，改善收敛性和样本质量。在这项工作中，我们将这个想法扩展到反问题，其中预训练的生成模型被用作先验。我们建议在扩散或基于流的模型和DINOv2视觉编码器之间应用表示对齐（REPA），以指导推断时的重构过程。虽然在反问题中无法获得地面真实信号，但我们凭经验表明，对齐近似目标特征的模型表示可以显著提高重构质量和感知逼真度。我们提供理论结果表明（a）REPA正则化可以被视为在DINOv2嵌入空间中最小化散度度量的一种变分方法，以及（b）在某些正则性假设下，REPA更新将潜在扩散状态引导至干净图像的状态。这些结果为我们理解REPA在改善感知保真度方面的作用提供了见解。最后，我们通过将REPA集成到多个最先进的反问题求解器中，并在超分辨率、框填充、高斯去模糊和运动去模糊等方面进行广泛实验，证实我们的方法始终改善重构质量，并提供效率收益，减少所需的离散化步骤数量。

更新时间: 2026-01-29 17:34:58

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.16870v2

The Ensemble Inverse Problem: Applications and Methods

We introduce a new multivariate statistical problem that we refer to as the Ensemble Inverse Problem (EIP). The aim of EIP is to invert for an ensemble that is distributed according to the pushforward of a prior under a forward process. In high energy physics (HEP), this is related to a widely known problem called unfolding, which aims to reconstruct the true physics distribution of quantities, such as momentum and angle, from measurements that are distorted by detector effects. In recent applications, the EIP also arises in full waveform inversion (FWI) and inverse imaging with unknown priors. We propose non-iterative inference-time methods that construct posterior samplers based on a new class of conditional generative models, which we call ensemble inverse generative models. For the posterior modeling, these models additionally use the ensemble information contained in the observation set on top of single measurements. Unlike existing methods, our proposed methods avoid explicit and iterative use of the forward model at inference time via training across several sets of truth-observation pairs that are consistent with the same forward model, but originate from a wide range of priors. We demonstrate that this training procedure implicitly encodes the likelihood model. The use of ensemble information helps posterior inference and enables generalization to unseen priors. We benchmark the proposed method on several synthetic and real datasets in inverse imaging, HEP, and FWI. The codes are available at https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods.

Updated: 2026-01-29 17:34:41

标题: 合奏反问题：应用与方法

摘要: 我们介绍了一个新的多元统计问题，我们称之为集成逆问题（EIP）。EIP的目标是反演一个根据先验在正向过程下的推动分布的集成。在高能物理（HEP）中，这与一个被广泛知晓的问题相关，称为展开，旨在从受探测器效应扭曲的测量中重建数量（如动量和角度）的真实物理分布。在最近的应用中，EIP还出现在全波形反演（FWI）和具有未知先验的逆成像中。我们提出了一种非迭代推理时间方法，基于一类新的条件生成模型构建后验采样器，我们称之为集成逆生成模型。对于后验建模，这些模型另外使用了观测集中包含的集成信息，而不仅仅是单个测量。与现有方法不同，我们提出的方法通过跨多组与同一正向模型一致的真值-观测对进行训练，避免了在推理时间通过显式和迭代地使用正向模型。我们展示了这种训练过程隐含地编码了似然模型。使用集成信息有助于后验推理，并使其能够泛化到未见的先验。我们在逆成像、HEP和FWI中对所提出的方法进行了几个合成和真实数据集的基准测试。代码可在https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods 上找到。

更新时间: 2026-01-29 17:34:41

领域: cs.LG

下载: http://arxiv.org/abs/2601.22029v1

From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.

Updated: 2026-01-29 17:34:37

标题: 从对数到潜变量：LLM遗忘的对比表征塑造

摘要: 大多数LLM遗忘方法旨在通过在预测空间中定义的对齐式目标来近似重新训练行为，以最小化分布转移。尽管这些方法有效地减少了遗忘内容的生成，但这种方法可能会起到抑制作用：遗忘的概念可能会持续存在于表示中，并与保留的知识纠缠在一起。我们引入CLReg，这是一种对比表示正则化器，它识别遗忘特征并将其远离保留特征，明确减少了遗忘-保留干扰，并对保留特征进行最小的转移。我们提供了第一批将表示塑造与纠缠减少相关联的理论洞见。在不同大小的LLM的遗忘基准和LLM之间，CLReg减少了促进主流遗忘方法的遗忘-保留表示纠缠，而不提出额外的隐私风险，这激发了未来的工作，重新塑造表示空间以消除遗忘概念。

更新时间: 2026-01-29 17:34:37

领域: cs.LG

下载: http://arxiv.org/abs/2601.22028v1

Reward-Preserving Attacks For Robust Reinforcement Learning

Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes $η$ are selected dynamically, using a learned critic $Q((s,a),η)$ that estimates the expected return of $α$-reward-preserving rollouts. For intermediate values of $α$, this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.

Updated: 2026-01-29 17:32:38

标题: 奖励保留攻击对强化学习的稳健性

摘要: 在强化学习（RL）中的对抗训练是具有挑战性的，因为扰动会在轨迹中传播并随时间累积，使得固定强度的攻击要么过于破坏性，要么过于保守。我们提出了保持奖励的攻击，它调整对抗强度，使得在每个状态下名义到最坏情况回报差距的$α$分数仍然可达。在深度RL中，扰动幅度$η$是动态选择的，使用一个学习的评论家$Q((s,a),η)$来估计$α$-奖励保持回滚的预期回报。对于中间值的$α$，这种自适应训练产生的策略在各种扰动幅度范围内具有鲁棒性，同时保持名义性能，优于固定半径和均匀采样半径的对抗训练。

更新时间: 2026-01-29 17:32:38

领域: cs.LG

下载: http://arxiv.org/abs/2601.07118v2

RobustExplain: Evaluating Robustness of LLM-Based Explanation Agents for Recommendation

Large Language Models (LLMs) are increasingly used to generate natural-language explanations in recommender systems, acting as explanation agents that reason over user behavior histories. While prior work has focused on explanation fluency and relevance under fixed inputs, the robustness of LLM-generated explanations to realistic user behavior noise remains largely unexplored. In real-world web platforms, interaction histories are inherently noisy due to accidental clicks, temporal inconsistencies, missing values, and evolving preferences, raising concerns about explanation stability and user trust. We present RobustExplain, the first systematic evaluation framework for measuring the robustness of LLM-generated recommendation explanations. RobustExplain introduces five realistic user behavior perturbations evaluated across multiple severity levels and a multi-dimensional robustness metric capturing semantic, keyword, structural, and length consistency. Our goal is to establish a principled, task-level evaluation framework and initial robustness baselines, rather than to provide a comprehensive leaderboard across all available LLMs. Experiments on four representative LLMs (7B--70B) show that current models exhibit only moderate robustness, with larger models achieving up to 8% higher stability. Our results establish the first robustness benchmarks for explanation agents and highlight robustness as a critical dimension for trustworthy, agent-driven recommender systems at web scale.

Updated: 2026-01-29 17:28:56

标题: RobustExplain: 评估基于LLM的解释代理的推荐鲁棒性

摘要: 大型语言模型（LLMs）越来越被用于生成推荐系统中的自然语言解释，作为能够推理用户行为历史的解释代理。尽管先前的研究侧重于在固定输入下的解释流畅性和相关性，但LLM生成的解释对现实用户行为噪声的稳健性仍然未被充分探讨。在现实世界的网络平台中，由于意外点击、时间不一致、缺失数值和不断变化的偏好，交互历史本质上是嘈杂的，这引发了对解释稳定性和用户信任的担忧。我们提出了RobustExplain，这是用于衡量LLM生成的推荐解释的稳健性的第一个系统评估框架。RobustExplain引入了五种现实用户行为扰动，评估了多个严重程度水平，并提出了一个多维稳健性指标，涵盖了语义、关键词、结构和长度的一致性。我们的目标是建立一个基于任务的评估框架和初始稳健性基准，而不是提供所有可用LLMs的全面排行榜。对四个代表性LLMs（7B-70B）的实验表明，当前模型仅表现出中等稳健性，较大模型的稳定性可提高高达8%。我们的结果为解释代理建立了第一个稳健性基准，并强调稳健性是网页规模下值得信赖的、代理驱动的推荐系统的一个关键维度。

更新时间: 2026-01-29 17:28:56

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.19120v2

Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning

Unlearning in Multimodal Large Language Models (MLLMs) prevents the model from revealing private information when queried about target images. Existing MLLM unlearning methods largely adopt approaches developed for LLMs. They treat all answer tokens uniformly, disregarding their varying importance in the unlearning process. Moreover, these methods focus exclusively on the language modality, disregarding visual cues that indicate key tokens in answers. In this paper, after formulating the problem of unlearning in multimodal question answering for MLLMs, we propose Visual-Guided Key-Token Regularization (ViKeR). We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions and use these distributions to regularize the unlearning process, thereby prioritizing key tokens. Further, we define key tokens in unlearning via information entropy and discuss ViKeR's effectiveness through token-level gradient reweighting, which amplifies updates on key tokens. Experiments on MLLMU and CLEAR benchmarks demonstrate that our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.

Updated: 2026-01-29 17:26:54

标题: 视觉引导的关键-令牌正则化用于多模态大型语言模型遗忘

摘要: 多模态大型语言模型（MLLMs）中的遗忘防止了模型在查询目标图像时泄露私人信息。现有的MLLM遗忘方法主要采用为LLMs开发的方法。它们统一对待所有答案标记，忽视了它们在遗忘过程中的不同重要性。此外，这些方法专注于语言模态，忽视了表明答案中关键标记的视觉线索。在本文中，在为MLLMs制定多模态问答中的遗忘问题之后，我们提出了视觉引导的关键标记正则化（ViKeR）。我们利用不相关的视觉输入来预测理想的遗忘后标记级分布，并使用这些分布来规范化遗忘过程，从而优先处理关键标记。此外，我们通过信息熵定义了遗忘中的关键标记，并通过标记级梯度重新加权讨论了ViKeR的有效性，这可以加强对关键标记的更新。在MLLMU和CLEAR基准测试上的实验表明，我们的方法在执行遗忘的同时有效地减轻了遗忘并保持了响应的连贯性。

更新时间: 2026-01-29 17:26:54

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2601.22020v1

TBDFiltering: Sample-Efficient Tree-Based Data Filtering

The quality of machine learning models depends heavily on their training data. Selecting high-quality, diverse training sets for large language models (LLMs) is a difficult task, due to the lack of cheap and reliable quality metrics. While querying existing LLMs for document quality is common, this is not scalable to the large number (billions) of documents used in training. Instead, practitioners often use classifiers trained on sparse quality signals. In this paper, we propose a text-embedding-based hierarchical clustering approach that adaptively selects the documents to be evaluated by the LLM to estimate cluster quality. We prove that our method is query efficient: under the assumption that the hierarchical clustering contains a subtree such that each leaf cluster in the tree is pure enough (i.e., it mostly contains either only good or only bad documents), with high probability, the method can correctly predict the quality of each document after querying a small number of documents. The number of such documents is proportional to the size of the smallest subtree with (almost) pure leaves, without the algorithm knowing this subtree in advance. Furthermore, in a comprehensive experimental study, we demonstrate the benefits of our algorithm compared to other classifier-based filtering methods.

Updated: 2026-01-29 17:22:06

标题: TBDFiltering：基于树的数据过滤的样本高效方法

摘要: 机器学习模型的质量在很大程度上取决于它们的训练数据。为大型语言模型（LLMs）选择高质量、多样化的训练集是一项困难的任务，因为缺乏廉价和可靠的质量度量标准。虽然查询现有的LLMs以获取文档质量是常见的做法，但这对于用于训练的大量文档（数十亿）并不可扩展。相反，从业人员常常使用基于稀疏质量信号训练的分类器。在本文中，我们提出了一种基于文本嵌入的分层聚类方法，该方法自适应地选择要由LLM评估的文档，以估计聚类质量。我们证明我们的方法是查询有效的：在假设分层聚类包含一个子树，使得树中的每个叶子聚类都足够纯净（即，它主要包含要么只有好的文档，要么只有坏的文档）的情况下，有很高的概率，该方法可以在查询少量文档后正确预测每个文档的质量。这样的文档数量与具有（几乎）纯净叶子的最小子树的大小成比例，而算法并不需要提前知道这个子树。此外，在一项全面的实验研究中，我们展示了我们的算法相对于其他基于分类器的过滤方法的优势。

更新时间: 2026-01-29 17:22:06

领域: cs.LG

下载: http://arxiv.org/abs/2601.22016v1

Adaptive Swarm Mesh Refinement using Deep Reinforcement Learning with Local Rewards

Simulating physical systems is essential in engineering, but analytical solutions are limited to straightforward problems. Consequently, numerical methods like the Finite Element Method (FEM) are widely used. However, the FEM becomes computationally expensive as problem complexity and accuracy demands increase. Adaptive Mesh Refinement (AMR) improves the FEM by dynamically placing mesh elements on the domain, balancing computational speed and accuracy. Classical AMR depends on heuristics or expensive error estimators, which may lead to suboptimal performance for complex simulations. While AMR methods based on machine learning are promising, they currently only scale to simple problems. In this work, we formulate AMR as a system of collaborating, homogeneous agents that iteratively split into multiple new agents. This agent-wise perspective enables a spatial reward formulation focused on reducing the maximum mesh element error. Our approach, Adaptive Swarm Mesh Refinement++ (ASMR++), offers efficient, stable optimization and generates highly adaptive meshes at user-defined resolution at inference time. Extensive experiments demonstrate that ASMR++ outperforms heuristic approaches and learned baselines, matching the performance of expensive error-based oracle AMR strategies. ASMR additionally generalizes to different domains during inference, and produces meshes that simulate up to 2 orders of magnitude faster than uniform refinements in more demanding settings.

Updated: 2026-01-29 17:22:03

标题: 使用深度强化学习和局部奖励的自适应群体网格细化

摘要: 模拟物理系统在工程领域至关重要，但是解析解仅适用于简单问题。因此，数值方法如有限元法（FEM）被广泛应用。然而，随着问题复杂性和精度要求的增加，FEM的计算成本也在增加。自适应网格细化（AMR）通过动态放置网格元素在域上，平衡计算速度和精度，改进了FEM。传统的AMR依赖于启发式方法或昂贵的误差估计器，这可能导致复杂模拟的次优性能。基于机器学习的AMR方法很有前途，但目前仅适用于简单问题。在这项工作中，我们将AMR构建为一个协作的、同质的代理系统，这些代理系统迭代地分裂成多个新代理。这种代理视角使得我们能够聚焦于减少最大网格元素误差的空间奖励制定。我们的方法，Adaptive Swarm Mesh Refinement++（ASMR++），提供了高效、稳定的优化，并在推理时生成高度适应性的网格。大量实验证明，ASMR++优于启发式方法和学习基线，与昂贵的基于误差的 AMR 策略的表现相匹敌。ASMR还在推理时推广到不同的领域，并在更具挑战性的设置中产生比均匀细化快两个数量级的模拟网格。

更新时间: 2026-01-29 17:22:03

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2406.08440v2

Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability

Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features (worse representation) and disrupting their readout by downstream computations. Analysis of a tractable model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We present a case study of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.

Updated: 2026-01-29 17:18:36

标题: 忘却之面：持续学习遇到机制可解释性

摘要: 在持续学习中的灾难性遗忘通常是通过性能或最后一层表示水平来衡量的，忽视了潜在的机制。我们引入了一个机械框架，提供了一个几何解释，即灾难性遗忘是个体特征编码转换的结果。这些转换可能通过减少特征的分配容量（更差的表示）和干扰下游计算的读取来导致遗忘。对一个可处理的模型的分析形式化了这种观点，使我们能够识别最佳和最坏的情况。通过对此模型的实验，我们经验性地测试了我们的形式分析，并强调了深度的不利影响。最后，我们演示了如何通过交叉编码器在实际模型分析中使用我们的框架。我们提供了一个在持续学习中使用的新的、以特征为中心的词汇。我们展示了一个在顺序CIFAR-10上训练的Vision Transformer的案例研究。

更新时间: 2026-01-29 17:18:36

领域: cs.LG

下载: http://arxiv.org/abs/2601.22012v1

Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering

Language models often default to a narrow set of high-probability outputs, leaving their generation paths homogeneous and prone to mode collapse. Sampling-based strategies inject randomness but still struggle to guarantee diversity across multiple concurrent generation runs. We address this limitation by introducing STARS ($\textbf{St}$iefel-based $\textbf{A}$ctivation Steering for Diverse $\textbf{R}$ea$\textbf{S}$oning), a training-free, inference-time intervention method that transforms activation steering into an exploration engine. At each token, STARS collects the hidden activations of concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold. STARS maximizes the geometric volume of the steered activations, while the Stiefel manifold induces orthogonality of the steering interventions. This formulation explicitly promotes divergent activation vectors of concurrent generation runs, and implicitly promotes divergent generation trajectories. This manifold optimization formulation can be solved using a Riemannian gradient descent algorithm with convergence guarantees, but this algorithm is too time-consuming for real-time inference. To guarantee low latency, we further design a lightweight one-step update with an aggressive, closed-form stepsize. For test case generation and scientific discovery benchmarks, STARS consistently outperforms standard sampling methods, achieving greater diversity without sacrificing qualitative performance.

Updated: 2026-01-29 17:17:04

标题: 通过推理时间斯蒂费尔激活引导探索多样化的生成路径

摘要: 语言模型通常会默认为一组高概率输出，使得它们的生成路径变得单一且容易出现模式崩溃。基于抽样的策略注入了随机性，但仍然难以保证在多个并发生成运行中的多样性。我们通过引入STARS（基于Stiefel的多样化推理激活引导）来解决这个限制，这是一种无需训练的、推理时间干预方法，将激活引导转化为一个探索引擎。在每个令牌上，STARS收集并优化同时生成运行的隐藏激活，并联合在Stiefel流形上优化多个加性引导方向。STARS最大化了引导激活的几何体积，而Stiefel流形诱导了引导干预的正交性。这种公式明确促进了并发生成运行的不同激活向量，并隐含促进了不同的生成轨迹。这种流形优化公式可以使用具有收敛保证的Riemannian梯度下降算法来解决，但这个算法对于实时推理来说太耗时。为了保证低延迟，我们进一步设计了一个轻量级的一步更新，具有积极的、闭式的步长。对于测试用例生成和科学发现基准测试，STARS始终优于标准抽样方法，实现了更大的多样性而不会牺牲质量表现。

更新时间: 2026-01-29 17:17:04

领域: cs.LG

下载: http://arxiv.org/abs/2601.22010v1

Deep Residual Echo State Networks: exploring residual orthogonal connections in untrained Recurrent Neural Networks

Echo State Networks (ESNs) are a particular type of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) framework, popular for their fast and efficient learning. However, traditional ESNs often struggle with long-term information processing. In this paper, we introduce a novel class of deep untrained RNNs based on temporal residual connections, called Deep Residual Echo State Networks (DeepResESNs). We show that leveraging a hierarchy of untrained residual recurrent layers significantly boosts memory capacity and long-term temporal modeling. For the temporal residual connections, we consider different orthogonal configurations, including randomly generated and fixed-structure configurations, and we study their effect on network dynamics. A thorough mathematical analysis outlines necessary and sufficient conditions to ensure stable dynamics within DeepResESN. Our experiments on a variety of time series tasks showcase the advantages of the proposed approach over traditional shallow and deep RC.

Updated: 2026-01-29 17:16:56

标题: 深度残余回声状态网络：探索未经训练的递归神经网络中的残余正交连接

摘要: 回声状态网络（ESNs）是一种特殊类型的未经训练的循环神经网络（RNNs），属于嵌入计算（RC）框架中的一种，以其快速高效的学习而受到广泛关注。然而，传统的ESNs通常在长期信息处理方面有困难。在本文中，我们介绍了一种基于时间残差连接的新型深度未经训练的RNNs类别，称为深度残差回声状态网络（DeepResESNs）。我们展示了利用一系列未经训练的残差循环层显著提升了记忆能力和长期时间建模。对于时间残差连接，我们考虑了不同的正交配置，包括随机生成和固定结构配置，并研究它们对网络动态的影响。通过彻底的数学分析概述了确保DeepResESN内稳定动态所需的必要和充分条件。我们在各种时间序列任务上的实验展示了所提出方法相对于传统浅层和深层RC的优势。

更新时间: 2026-01-29 17:16:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.21172v2

MEIDNet: Multimodal generative AI framework for inverse materials design

In this work, we present Multimodal Equivariant Inverse Design Network (MEIDNet), a framework that jointly learns structural information and materials properties through contrastive learning, while encoding structures via an equivariant graph neural network (EGNN). By combining generative inverse design with multimodal learning, our approach accelerates the exploration of chemical-structural space and facilitates the discovery of materials that satisfy predefined property targets. MEIDNet exhibits strong latent-space alignment with cosine similarity 0.96 by fusion of three modalities through cross-modal learning. Through implementation of curriculum learning strategies, MEIDNet achieves ~60 times higher learning efficiency than conventional training techniques. The potential of our multimodal approach is demonstrated by generating low-bandgap perovskite structures at a stable, unique, and novel (SUN) rate of 13.6 %, which are further validated by ab initio methods. Our inverse design framework demonstrates both scalability and adaptability, paving the way for the universal learning of chemical space across diverse modalities.

Updated: 2026-01-29 17:16:35

标题: MEIDNet：逆材料设计的多模态生成人工智能框架

摘要: 在这项工作中，我们提出了多模态等变反向设计网络（MEIDNet），这是一个通过对比学习联合学习结构信息和材料属性的框架，同时通过等变图神经网络（EGNN）对结构进行编码。通过将生成式反向设计与多模态学习相结合，我们的方法加速了化学结构空间的探索，并促进了发现满足预定义属性目标的材料。MEIDNet通过交叉模态学习将三种模态融合，展现出0.96的余弦相似度的强潜在空间对齐性。通过实施课程学习策略，MEIDNet的学习效率比传统训练技术高出约60倍。我们的多模态方法的潜力通过以13.6%的稳定、独特和新颖（SUN）速率生成低带隙钙钛矿结构进行展示，并通过从头计算方法进行进一步验证。我们的反向设计框架展示了可伸缩性和适应性，为跨不同模态学习化学空间奠定了基础。

更新时间: 2026-01-29 17:16:35

领域: cond-mat.mtrl-sci,cs.AI,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2601.22009v1

Efficient Stochastic Optimisation via Sequential Monte Carlo

The problem of optimising functions with intractable gradients frequently arise in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.

Updated: 2026-01-29 17:13:25

标题: 通过序贯蒙特卡洛实现高效随机优化

摘要: 在机器学习和统计学中，优化具有难以处理的梯度的函数的问题经常出现，涵盖了从最大边缘似然估计程序到生成模型微调的范围。针对这类问题的随机逼近方法通常需要内部抽样循环来获得（有偏）随机梯度估计，这很快变得计算昂贵。在这项工作中，我们开发了用于优化具有难以处理的梯度的函数的顺序蒙特卡洛（SMC）采样器。我们的方法用高效的SMC近似替换昂贵的内部抽样方法，可以带来显著的计算收益。我们为我们的方法定义的基本递归建立了收敛结果，这些递归由SMC采样器近似。我们在各种设置中展示了我们的方法在能量基模型的奖励调整上的有效性。

更新时间: 2026-01-29 17:13:25

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2601.22003v1

Rate-Distortion Optimization for Transformer Inference

Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. In this work, we introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade off bitrate against accuracy. Experiments on language benchmarks show that the proposed codec achieves substantial savings with improved accuracy in some cases, outperforming more complex baseline methods. We characterize and analyze the rate-distortion performance of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to define the gap between rate and entropy, and derive some of its bounds. We further develop probably approximately correct (PAC)-style bounds for estimating this gap. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulation.

Updated: 2026-01-29 17:12:46

标题: Transformer 推断的速率失真优化

摘要: Transformer在许多任务上实现了出色的性能，但在推断过程中需要大量的计算和内存资源。将推断过程分配到多个设备上可以使推断更加高效，而这又需要对中间表示进行压缩。在这项工作中，我们引入了一个基于率失真的原则性框架，用于有损压缩，学习明确权衡比特率和准确性的紧凑编码。对语言基准测试的实验表明，所提出的编解码器在某些情况下实现了显著的节省，并提高了准确性，优于更复杂的基线方法。我们对Transformer的率失真性能进行了表征和分析，为理解表示编码性能提供了一个统一的视角。这种表述将信息论概念扩展到定义率与熵之间的差距，并推导出一些其上限。我们进一步开发了用于估计这一差距的可能近似正确(PAC)-风格界限。对于不同的架构和任务，我们经验性地证明它们的速率受到这些界限的驱动，增加了对表述的可解释性。

更新时间: 2026-01-29 17:12:46

领域: cs.LG,cs.IT

下载: http://arxiv.org/abs/2601.22002v1

LLMs as Orchestrators: Constraint-Compliant Multi-Agent Optimization for Recommendation Systems

Recommendation systems must optimize multiple objectives while satisfying hard business constraints such as fairness and coverage. For example, an e-commerce platform may require every recommendation list to include items from multiple sellers and at least one newly listed product; violating such constraints--even once--is unacceptable in production. Prior work on multi-objective recommendation and recent LLM-based recommender agents largely treat constraints as soft penalties or focus on item scoring and interaction, leading to frequent violations in real-world deployments. How to leverage LLMs for coordinating constrained optimization in recommendation systems remains underexplored. We propose DualAgent-Rec, an LLM-coordinated dual-agent framework for constrained multi-objective e-commerce recommendation. The framework separates optimization into an Exploitation Agent that prioritizes accuracy under hard constraints and an Exploration Agent that promotes diversity through unconstrained Pareto search. An LLM-based coordinator adaptively allocates resources between agents based on optimization progress and constraint satisfaction, while an adaptive epsilon-relaxation mechanism guarantees feasibility of final solutions. Experiments on the Amazon Reviews 2023 dataset demonstrate that DualAgent-Rec achieves 100% constraint satisfaction and improves Pareto hypervolume by 4-6% over strong baselines, while maintaining competitive accuracy-diversity trade-offs. These results indicate that LLMs can act as effective orchestration agents for deployable and constraint-compliant recommendation systems.

Updated: 2026-01-29 17:11:09

标题: LLMs作为指挥者：符合约束的多智能体优化在推荐系统中的应用

摘要: 推荐系统必须在满足公平性和覆盖率等硬业务约束的同时优化多个目标。例如，电子商务平台可能要求每个推荐列表包含来自多个卖家的商品以及至少一个新上架产品；在生产中违反这些约束是不可接受的。先前关于多目标推荐和最近基于LLM的推荐代理的工作主要将约束视为软惩罚或专注于物品评分和交互，导致在实际部署中频繁违反约束。如何利用LLM来协调推荐系统中的约束优化仍未得到充分探讨。我们提出了DualAgent-Rec，一个LLM协调的双代理框架，用于受约束的多目标电子商务推荐。该框架将优化分为一个优先考虑在硬约束下准确性的开发代理和一个通过无约束帕累托搜索促进多样性的探索代理。基于优化进展和约束满足程度，LLM协调器自适应地分配资源给代理，而自适应的epsilon-松弛机制确保最终解的可行性。对亚马逊评论2023数据集的实验表明，DualAgent-Rec实现了100%的约束满足，并将帕累托超体积提高了4-6%以上的强基线，同时保持了具有竞争力的准确性-多样性权衡。这些结果表明，LLM可以作为有效的编排代理，用于可部署和符合约束的推荐系统。

更新时间: 2026-01-29 17:11:09

领域: cs.IR,cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2601.19121v2

Negatives-Dominant Contrastive Learning for Generalization in Imbalanced Domains

Imbalanced Domain Generalization (IDG) focuses on mitigating both domain and label shifts, both of which fundamentally shape the model's decision boundaries, particularly under heterogeneous long-tailed distributions across domains. Despite its practical significance, it remains underexplored, primarily due to the technical complexity of handling their entanglement and the paucity of theoretical foundations. In this paper, we begin by theoretically establishing the generalization bound for IDG, highlighting the role of posterior discrepancy and decision margin. This bound motivates us to focus on directly steering decision boundaries, marking a clear departure from existing methods. Subsequently, we technically propose a novel Negative-Dominant Contrastive Learning (NDCL) for IDG to enhance discriminability while enforce posterior consistency across domains. Specifically, inter-class decision-boundary separation is enhanced by placing greater emphasis on negatives as the primary signal in our contrastive learning, naturally amplifying gradient signals for minority classes to avoid the decision boundary being biased toward majority classes. Meanwhile, intra-class compactness is encouraged through a re-weighted cross-entropy strategy, and posterior consistency across domains is enforced through a prediction-central alignment strategy. Finally, rigorous yet challenging experiments on benchmarks validate the effectiveness of our NDCL. The code is available at https://github.com/Alrash/NDCL.

Updated: 2026-01-29 17:09:38

标题: 在不平衡领域中用于泛化的负面主导对比学习

摘要: 不平衡领域泛化（IDG）侧重于减轻领域和标签偏移，这两者从根本上塑造了模型的决策边界，特别是在跨领域之间存在异质长尾分布的情况下。尽管具有实际意义，但由于处理它们的纠缠的技术复杂性以及理论基础的匮乏，该问题仍然未被充分探讨。在本文中，我们首先在理论上建立了IDG的泛化界限，强调后验差异和决策边界的作用。这个界限激励我们专注于直接引导决策边界，标志着与现有方法的明显分歧。随后，我们在技术上提出了一种新颖的负主导对比学习（NDCL）用于IDG，以增强可区分性同时在领域之间强化后验一致性。具体来说，我们在对比学习中更加强调负样本作为主要信号，通过自然地增加少数类别的梯度信号来增强类间决策边界分离，以避免决策边界偏向多数类别。同时，通过重新加权的交叉熵策略鼓励类内紧凑度，并通过预测中心对齐策略在领域之间强制后验一致性。最后，我们对基准数据集进行了严格而具有挑战性的实验，验证了我们NDCL的有效性。代码可在https://github.com/Alrash/NDCL找到。

更新时间: 2026-01-29 17:09:38

领域: cs.LG

下载: http://arxiv.org/abs/2601.21999v1

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

Updated: 2026-01-29 17:06:54

标题: 机制数据归因：追踪可解释LLM单元的训练起源

摘要: 机械解释性已经在LLMs中确定了可解释的电路，但它们在训练数据中的因果起源仍然难以捉摸。我们引入了机械数据归因（MDA），这是一个可扩展的框架，利用影响函数将可解释的单元追溯到特定的训练样本。通过在Pythia家族上进行大量实验，我们因果验证了有针对性的干预——移除或增加一小部分高影响样本——显著调节了可解释头的出现，而随机干预则没有效果。我们的分析揭示了重复的结构化数据（例如LaTeX、XML）作为一个机械的催化剂。此外，我们观察到针对感应头形成的干预引发了模型在上下文学习（ICL）能力上的同步变化。这为关于感应头和ICL功能联系的长期假设提供了直接的因果证据。最后，我们提出了一个机械数据增强管道，可以在模型规模上持续加速电路的收敛，为引导LLMs的发展轨迹提供了一个有原则的方法。

更新时间: 2026-01-29 17:06:54

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21996v1

Geometry of Drifting MDPs with Path-Integral Stability Certificates

Real-world reinforcement learning is often \emph{nonstationary}: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emph{how much} the environment changes, not \emph{how} it changes locally -- even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length--curvature--kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textit{Homotopy-Tracking RL (HT-RL)} and \textit{HT-MCTS}, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.

Updated: 2026-01-29 17:03:23

标题: 漂移MDP的几何结构及路径积分稳定性证明

摘要: 实际世界中的强化学习通常是\emph{非稳态}的：奖励和动态会漂移、加速、振荡，并触发最佳动作的突然转换。现有理论通常用粗粒度模型来表示非稳态，这些模型衡量环境变化的\emph{程度}，而不是局部变化的\emph{方式} -- 尽管加速和近似值会引起跟踪误差和策略抖动。我们以几何视角看待非稳态的贴现马尔可夫决策过程（MDP），将环境建模为一个可微同伦路径，并跟踪最优贝尔曼不动点的诱导运动。这产生了内在复杂性的长度--曲率--结节签名：累积漂移、加速/振荡和动作间隙引起的非光滑性。我们证明了一个与求解器无关的路径积分稳定性界，并推导出认证了远离转换区域的局部稳定性的间隙安全可行区域。基于这些结果，我们引入了\textit{同伦跟踪RL（HT-RL）}和\textit{HT-MCTS}，这些轻量级包装器在线估计基于重放的长度、曲率和近似值接近程度，并相应地调整学习或规划强度。实验证明，在匹配的静态基线上实现了更好的跟踪和动态遗憾，尤其是在振荡和易于转换的区域获得了最大的增益。

更新时间: 2026-01-29 17:03:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21991v1

Batched First-Order Methods for Parallel LP Solving in MIP

We present a batched first-order method for solving multiple linear programs in parallel on GPUs. Our approach extends the primal-dual hybrid gradient algorithm to efficiently solve batches of related linear programming problems that arise in mixed-integer programming techniques such as strong branching and bound tightening. By leveraging matrix-matrix operations instead of repeated matrix-vector operations, we obtain significant computational advantages on GPU architectures. We demonstrate the effectiveness of our approach on various case studies and identify the problem sizes where first-order methods outperform traditional simplex-based solvers depending on the computational environment one can use. This is a significant step for the design and development of integer programming algorithms tightly exploiting GPU capabilities where we argue that some specific operations should be allocated to GPUs and performed in full instead of using light-weight heuristic approaches on CPUs.

Updated: 2026-01-29 17:02:46

标题: 批处理一阶方法用于并行LP求解中的MIP

摘要: 我们提出了一种批处理的一阶方法，用于在GPU上并行解决多个线性规划问题。我们的方法将原始-对偶混合梯度算法扩展到有效地解决出现在混合整数规划技术中的一批相关线性规划问题，如强分支和边界收紧。通过利用矩阵-矩阵运算而不是重复的矩阵-向量运算，我们在GPU架构上获得了显著的计算优势。我们在各种案例研究中展示了我们方法的有效性，并确定了在计算环境中可以使用的情况下，一阶方法在问题规模上优于传统的单纯形法求解器。这对于紧密利用GPU能力设计和开发整数规划算法是一项重要的进展，我们认为一些特定操作应该分配给GPU并完全执行，而不是在CPU上使用轻量级启发式方法。

更新时间: 2026-01-29 17:02:46

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2601.21990v1

Generalized Information Gathering Under Dynamics Uncertainty

An agent operating in an unknown dynamical system must learn its dynamics from observations. Active information gathering accelerates this learning, but existing methods derive bespoke costs for specific modeling choices: dynamics models, belief update procedures, observation models, and planners. We present a unifying framework that decouples these choices from the information-gathering cost by explicitly exposing the causal dependencies between parameters, beliefs, and controls. Using this framework, we derive a general information-gathering cost based on Massey's directed information that assumes only Markov dynamics with additive noise and is otherwise agnostic to modeling choices. We prove that the mutual information cost used in existing literature is a special case of our cost. Then, we leverage our framework to establish an explicit connection between the mutual information cost and information gain in linearized Bayesian estimation, thereby providing theoretical justification for mutual information-based active learning approaches. Finally, we illustrate the practical utility of our framework through experiments spanning linear, nonlinear, and multi-agent systems.

Updated: 2026-01-29 17:00:35

标题: 动态不确定性下的广义信息收集

摘要: 一个在未知动态系统中运作的代理必须从观测中学习其动态。积极的信息收集可以加速这种学习，但现有的方法为特定建模选择（动态模型、信念更新程序、观测模型和规划器）制定了定制成本。我们提出了一个统一的框架，通过明确暴露参数、信念和控制之间的因果依赖关系，将这些选择与信息收集成本解耦。利用这个框架，我们基于Massey的有向信息推导出一个基于马尔可夫动态和加性噪声的通用信息收集成本，否则对建模选择是不可知的。我们证明了现有文献中使用的互信息成本是我们成本的一个特例。然后，我们利用我们的框架建立了互信息成本与线性化贝叶斯估计中的信息增益之间的明确连接，从而为基于互信息的积极学习方法提供了理论上的证明。最后，我们通过涵盖线性、非线性和多代理系统的实验，展示了我们框架的实际效用。

更新时间: 2026-01-29 17:00:35

领域: cs.LG,cs.AI,cs.MA,cs.RO,eess.SY

下载: http://arxiv.org/abs/2601.21988v1

Elign: Equivariant Diffusion Model Alignment from Foundational Machine Learning Force Fields

Generative models for 3D molecular conformations must respect Euclidean symmetries and concentrate probability mass on thermodynamically favorable, mechanically stable structures. However, E(3)-equivariant diffusion models often reproduce biases from semi-empirical training data rather than capturing the equilibrium distribution of a high-fidelity Hamiltonian. While physics-based guidance can correct this, it faces two computational bottlenecks: expensive quantum-chemical evaluations (e.g., DFT) and the need to repeat such queries at every sampling step. We present Elign, a post-training framework that amortizes both costs. First, we replace expensive DFT evaluations with a faster, pretrained foundational machine-learning force field (MLFF) to provide physical signals. Second, we eliminate repeated run-time queries by shifting physical steering to the training phase. To achieve the second amortization, we formulate reverse diffusion as a reinforcement learning problem and introduce Force--Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune the denoising policy. FED-GRPO includes a potential-based energy reward and a force-based stability reward, which are optimized and group-normalized independently. Experiments show that Elign generates conformations with lower gold-standard DFT energies and forces, while improving stability. Crucially, inference remains as fast as unguided sampling, since no energy evaluations are required during generation.

Updated: 2026-01-29 17:00:09

标题: Elign: 来自基础机器学习力场的等变扩散模型对齐

摘要: 用于3D分子构象的生成模型必须遵守欧几里得对称性，并集中概率质量于热力学有利且力学稳定的结构上。然而，E(3)-等变扩散模型经常重现半经验训练数据的偏见，而不是捕捉高保真汉密尔顿量的平衡分布。虽然基于物理的指导可以纠正这一点，但面临两个计算瓶颈：昂贵的量子化学评估（例如，DFT）和需要在每个采样步骤中重复此类查询。我们提出Elign，一个后训练框架，摊销这两个成本。首先，我们用更快的、预训练的基础机器学习力场（MLFF）取代昂贵的DFT评估，以提供物理信号。其次，通过将物理引导转移到训练阶段，我们消除了重复的运行时查询。为了实现第二种摊销，我们将反向扩散定式化为增强学习问题，并引入力量-能量分离组相对策略优化（FED-GRPO）来微调去噪策略。FED-GRPO包括基于势能的能量奖励和基于力的稳定性奖励，它们被独立地优化和组标准化。实验证明，Elign生成的构象具有更低的黄金标准DFT能量和力，同时提高了稳定性。关键是，在生成过程中不需要能量评估，因此推理速度与无引导采样一样快。

更新时间: 2026-01-29 17:00:09

领域: cs.LG

下载: http://arxiv.org/abs/2601.21985v1

PowerGenie: Analytically-Guided Evolutionary Discovery of Superior Reconfigurable Power Converters

Discovering superior circuit topologies requires navigating an exponentially large design space-a challenge traditionally reserved for human experts. Existing AI methods either select from predefined templates or generate novel topologies at a limited scale without rigorous verification, leaving large-scale performance-driven discovery underexplored. We present PowerGenie, a framework for automated discovery of higher-performance reconfigurable power converters at scale. PowerGenie introduces: (1) an automated analytical framework that determines converter functionality and theoretical performance limits without component sizing or SPICE simulation, and (2) an evolutionary finetuning method that co-evolves a generative model with its training distribution through fitness selection and uniqueness verification. Unlike existing methods that suffer from mode collapse and overfitting, our approach achieves higher syntax validity, function validity, novelty rate, and figure-of-merit (FoM). PowerGenie discovers a novel 8-mode reconfigurable converter with 23% higher FoM than the best training topology. SPICE simulations confirm average absolute efficiency gains of 10% across 8 modes and up to 17% at a single mode. Code is available at https://github.com/xz-group/PowerGenie.

Updated: 2026-01-29 16:59:59

标题: PowerGenie：优化可重构电源转换器的分析导向演化发现

摘要: 发现优越的电路拓扑结构需要在一个指数级别的设计空间中进行导航，这是传统上专门留给人类专家的挑战。现有的人工智能方法要么从预定义模板中选择，要么在有限规模上生成新的拓扑结构，但缺乏严格的验证，使得大规模性能驱动的发现尚未得到充分探索。我们提出了PowerGenie，一个用于自动发现更高性能可重构功率转换器的框架。PowerGenie引入了：（1）一个自动化的分析框架，确定转换器功能和理论性能极限，无需组件大小或SPICE模拟；（2）一种进化微调方法，通过适应性选择和独特性验证，共同演化生成模型和其训练分布。与现有方法存在模式崩溃和过拟合问题不同，我们的方法实现了更高的语法有效性、功能有效性、新颖性率和优势因子（FoM）。PowerGenie发现了一种新型的8模式可重构转换器，其FoM比最佳训练拓扑结构高出23%。SPICE模拟确认在8种模式下平均绝对效率提高了10%，在单一模式下高达17%。代码可在https://github.com/xz-group/PowerGenie获取。

更新时间: 2026-01-29 16:59:59

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2601.21984v1

Investigating Batch Inference in a Sequential Monte Carlo Framework for Neural Networks

Bayesian inference allows us to define a posterior distribution over the weights of a generic neural network (NN). Exact posteriors are usually intractable, in which case approximations can be employed. One such approximation - variational inference - is computationally efficient when using mini-batch stochastic gradient descent as subsets of the data are used for likelihood and gradient evaluations, though the approach relies on the selection of a variational distribution which sufficiently matches the form of the posterior. Particle-based methods such as Markov chain Monte Carlo and Sequential Monte Carlo (SMC) do not assume a parametric family for the posterior by typically require higher computational cost. These sampling methods typically use the full-batch of data for likelihood and gradient evaluations, which contributes to this computational expense. We explore several methods of gradually introducing more mini-batches of data (data annealing) into likelihood and gradient evaluations of an SMC sampler. We find that we can achieve up to $6\times$ faster training with minimal loss in accuracy on benchmark image classification problems using NNs.

Updated: 2026-01-29 16:59:31

标题: 在神经网络的序贯蒙特卡洛框架中研究批处理推断

摘要: 贝叶斯推断允许我们定义一个普通神经网络（NN）权重的后验分布。精确后验通常难以计算，此时可以采用近似方法。其中一种近似方法是变分推断，当使用小批量随机梯度下降时，计算效率很高，因为数据的子集用于似然和梯度评估，尽管这种方法依赖于选择与后验形式相匹配的变分分布。基于粒子的方法，如马尔可夫链蒙特卡洛和顺序蒙特卡洛（SMC），不假设后验的参数族，但通常需要更高的计算成本。这些抽样方法通常使用全部数据批量进行似然和梯度评估，这导致了计算成本的增加。我们探索了逐渐引入更多小批量数据（数据退火）到顺序蒙特卡洛采样器的似然和梯度评估中的几种方法。我们发现，在基准图像分类问题中，使用神经网络可以实现高达6倍的更快训练速度，同时几乎不损失准确性。

更新时间: 2026-01-29 16:59:31

领域: cs.LG

下载: http://arxiv.org/abs/2601.21983v1

Investigation into using stochastic embedding representations for evaluating the trustworthiness of the Fréchet Inception Distance

Feature embeddings acquired from pretrained models are widely used in medical applications of deep learning to assess the characteristics of datasets; e.g. to determine the quality of synthetic, generated medical images. The Fréchet Inception Distance (FID) is one popular synthetic image quality metric that relies on the assumption that the characteristic features of the data can be detected and encoded by an InceptionV3 model pretrained on ImageNet1K (natural images). While it is widely known that this makes it less effective for applications involving medical images, the extent to which the metric fails to capture meaningful differences in image characteristics is not obviously known. Here, we use Monte Carlo dropout to compute the predictive variance in the FID as well as a supplemental estimate of the predictive variance in the feature embedding model's latent representations. We show that the magnitudes of the predictive variances considered exhibit varying degrees of correlation with the extent to which test inputs (ImageNet1K validation set augmented at various strengths, and other external datasets) are out-of-distribution relative to its training data, providing some insight into the effectiveness of their use as indicators of the trustworthiness of the FID.

Updated: 2026-01-29 16:56:01

标题: 使用随机嵌入表示来评估Fréchet Inception距离的可信度的调查

摘要: 使用预训练模型获取的特征嵌入在深度学习的医学应用中被广泛使用，用于评估数据集的特征；例如，用于确定合成生成的医学图像的质量。 Fréchet Inception Distance（FID）是一种流行的合成图像质量度量标准，它依赖于这样一个假设：数据的特征可以被InceptionV3模型（在ImageNet1K上预训练，即自然图像）检测和编码。尽管众所周知，这使得它在涉及医学图像的应用中不太有效，但该度量标准未能明显捕捉到图像特征的有意义差异的程度尚不明确。在这里，我们使用蒙特卡罗辍学（Monte Carlo dropout）来计算FID中的预测方差，以及特征嵌入模型的潜在表示中的预测方差的补充估计。我们展示了考虑的预测方差的大小与测试输入（在不同强度下增强的ImageNet1K验证集，以及其他外部数据集）相对于其训练数据的分布外情况的程度之间存在不同程度的相关性，为了解它们作为FID可信度指标的有效性提供一些见解。

更新时间: 2026-01-29 16:56:01

领域: cs.LG

下载: http://arxiv.org/abs/2601.21979v1

Bridging Graph Structure and Knowledge-Guided Editing for Interpretable Temporal Knowledge Graph Reasoning

Temporal knowledge graph reasoning (TKGR) aims to predict future events by inferring missing entities with dynamic knowledge structures. Existing LLM-based reasoning methods prioritize contextual over structural relations, struggling to extract relevant subgraphs from dynamic graphs. This limits structural information understanding, leading to unstructured, hallucination-prone inferences especially with temporal inconsistencies. To address this problem, we propose IGETR (Integration of Graph and Editing-enhanced Temporal Reasoning), a hybrid reasoning framework that combines the structured temporal modeling capabilities of Graph Neural Networks (GNNs) with the contextual understanding of LLMs. IGETR operates through a three-stage pipeline. The first stage aims to ground the reasoning process in the actual data by identifying structurally and temporally coherent candidate paths through a temporal GNN, ensuring that inference starts from reliable graph-based evidence. The second stage introduces LLM-guided path editing to address logical and semantic inconsistencies, leveraging external knowledge to refine and enhance the initial paths. The final stage focuses on integrating the refined reasoning paths to produce predictions that are both accurate and interpretable. Experiments on standard TKG benchmarks show that IGETR achieves state-of-the-art performance, outperforming strong baselines with relative improvements of up to 5.6% on Hits@1 and 8.1% on Hits@3 on the challenging ICEWS datasets. Additionally, we execute ablation studies and additional analyses confirm the effectiveness of each component.

Updated: 2026-01-29 16:55:13

标题: 桥接图结构与知识引导编辑，实现可解释的时间知识图推理

摘要: 时间知识图推理（TKGR）旨在通过推断动态知识结构中的缺失实体来预测未来事件。现有基于LLM的推理方法优先考虑上下文关系而非结构关系，难以从动态图中提取相关子图。这限制了结构信息的理解，导致无结构、易产生幻觉的推理，尤其是在时间上存在矛盾的情况下。为解决这一问题，我们提出了IGETR（图与增强型时间推理的整合），这是一个混合推理框架，结合了图神经网络（GNNs）的结构化时间建模能力和LLMs的上下文理解能力。IGETR通过一个三阶段流程运行。第一阶段旨在通过在时间GNN中识别结构和时间上连贯的候选路径，确保推理始于可靠的基于图的证据。第二阶段引入了LLM引导的路径编辑，以解决逻辑和语义不一致性，利用外部知识来完善和增强初始路径。最后一个阶段专注于整合经过改进的推理路径，以产生准确且可解释的预测结果。在标准TKG基准上的实验证明，IGETR实现了最先进的性能，相对改进达到了5.6％的Hits@1和8.1％的Hits@3，在具有挑战性的ICEWS数据集上超过了强基线。此外，我们进行消融研究和额外分析，证实了每个组件的有效性。

更新时间: 2026-01-29 16:55:13

领域: cs.LG

下载: http://arxiv.org/abs/2601.21978v1

Utilising Gradient-Based Proposals Within Sequential Monte Carlo Samplers for Training of Partial Bayesian Neural Networks

Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks while only having a subset of the parameters be stochastic. Using sequential Monte Carlo (SMC) samplers as the inference method for pBNNs gives a non-parametric probabilistic estimation of the stochastic parameters, and has shown improved performance over parametric methods. In this paper we introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels, which gives us better scalability on high dimensional problems. We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss. We also show that pBNNs scale well with larger batch sizes, resulting in significantly reduced training times and often better performance.

Updated: 2026-01-29 16:52:32

标题: 使用基于梯度的建议在序贯蒙特卡洛取样器中训练部分贝叶斯神经网络

摘要: 部分贝叶斯神经网络（pBNNs）已被证明在仅有部分参数是随机的情况下，与完全贝叶斯神经网络竞争力相当。使用顺序蒙特卡洛（SMC）采样器作为pBNNs的推断方法，可以对随机参数进行非参数概率估计，并且已经显示出比参数方法更好的性能。本文介绍了一种新的基于SMC的pBNNs训练方法，通过利用引导提议和结合基于梯度的马尔科夫核，使我们在高维问题上具有更好的可伸缩性。我们展示了我们的新方法在预测性能和最优损失方面优于现有技术。我们还展示了pBNNs在较大批量大小下的良好扩展性，导致显著减少的训练时间和通常更好的性能。

更新时间: 2026-01-29 16:52:32

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.03797v2

MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts

Imitation learning has achieved remarkable success in robotic manipulation, yet its application to surgical robotics remains challenging due to data scarcity, constrained workspaces, and the need for an exceptional level of safety and predictability. We present a supervised Mixture-of-Experts (MoE) architecture designed for phase-structured surgical manipulation tasks, which can be added on top of any autonomous policy. Unlike prior surgical robot learning approaches that rely on multi-camera setups or thousands of demonstrations, we show that a lightweight action decoder policy like Action Chunking Transformer (ACT) can learn complex, long-horizon manipulation from less than 150 demonstrations using solely stereo endoscopic images, when equipped with our architecture. We evaluate our approach on the collaborative surgical task of bowel grasping and retraction, where a robot assistant interprets visual cues from a human surgeon, executes targeted grasping on deformable tissue, and performs sustained retraction. We benchmark our method against state-of-the-art Vision-Language-Action (VLA) models and the standard ACT baseline. Our results show that generalist VLAs fail to acquire the task entirely, even under standard in-distribution conditions. Furthermore, while standard ACT achieves moderate success in-distribution, adopting a supervised MoE architecture significantly boosts its performance, yielding higher success rates in-distribution and demonstrating superior robustness in out-of-distribution scenarios, including novel grasp locations, reduced illumination, and partial occlusions. Notably, it generalizes to unseen testing viewpoints and also transfers zero-shot to ex vivo porcine tissue without additional training, offering a promising pathway toward in vivo deployment. To support this, we present qualitative preliminary results of policy roll-outs during in vivo porcine surgery.

Updated: 2026-01-29 16:50:14

标题: MoE-ACT：通过监督专家混合改进手术模仿学习策略

摘要: 模䓬学在机器人操作中取得了显著的成功，但在外科机器人领域的应用仍然具有挑战性，原因在于数据稀缺、工作空间受限以及对安全性和可预测性的异常要求。我们提出了一种针对阶段结构外科操作任务设计的监督式专家混合（MoE）架构，可以添加到任何自主策略之上。与之前依赖多摄像头设置或成千上万次演示的外科机器人学习方法不同，我们展示了一种轻量级动作解码器策略，如行动分块变换器（ACT），在配备我们的架构后可以仅使用立体内窥镜图像从不到150次演示中学习复杂、长期的操作。我们在协作外科任务——肠抓取和牵引中评估了我们的方法，其中机器人助手根据人类外科医生的视觉提示执行有针对性的抓取变形组织，并进行持续的牵引。我们将我们的方法与最先进的视觉语言动作（VLA）模型和标准ACT基线进行了基准测试。我们的结果表明，通用的VLA在标准的分布条件下甚至无法完全掌握任务。此外，虽然标准ACT在标准分布条件下取得了一定的成功，但采用监督式MoE架构显著提升了其性能，在分布内取得更高的成功率，并在分布外情景中表现出更强的鲁棒性，包括新的抓取位置、减少的照明和部分遮挡。值得注意的是，它可以泛化到未见过的测试视角，也可以零次迁移到离体猪组织，而无需额外训练，为实现体内部署提供了一个有希望的途径。为了支持这一点，我们呈现了在体猪手术过程中政策实施的定性初步结果。

更新时间: 2026-01-29 16:50:14

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21971v1

Secure Group Key Agreement on Cyber-Physical System Buses

Cyber-Physical Systems (CPSs) rely on distributed embedded devices that often must communicate securely over buses. Ensuring message integrity and authenticity on these buses typically requires group-shared keys for Message Authentication Codes (MACs). To avoid insecure fixed pre-shared keys and trust-on-first-use concepts, a Group Key Agreement (GKA) protocol is needed to dynamically agree on a key amongst the devices. Yet existing GKA protocols lack adaptability to constrained CPS buses. This paper targets authenticated, fully distributed GKA suitable for bus topologies under constraints of industrial and cyber-physical systems, including broadcast-only links, half-duplex operation, resource limits, dynamic membership (including unannounced leaves), a long device lifetime, and a strong Dolev-Yao adversary capable of partitioning the bus. We first systematise existing protocols, then derive the requirements necessary for an authenticated and fully distributed GKA on bus systems. Finally, we design, implement, and evaluate a custom GKA protocol based on TreeKEM.

Updated: 2026-01-29 16:45:05

标题: 在网络物理系统总线上的安全组密钥协商

摘要: 网络物理系统（CPS）依赖于分布式嵌入式设备，这些设备通常必须通过总线进行安全通信。确保这些总线上的消息完整性和真实性通常需要用于消息认证码（MAC）的组共享密钥。为了避免不安全的固定预共享密钥和第一次使用时信任的概念，需要一个组密钥协议（GKA）协议来动态地在设备之间达成一致。然而，现有的GKA协议缺乏适应受限制的CPS总线的灵活性。本文针对适用于工业和网络物理系统的总线拓扑约束条件下的经过认证的、完全分布式的GKA。这些约束条件包括仅广播链接、半双工操作、资源限制、动态成员资格（包括未经通知的离开）、长期设备寿命以及能够对总线进行分区的强Dolev-Yao对手。我们首先系统化现有的协议，然后推导出在总线系统上进行经过认证和完全分布式GKA所需的要求。最后，我们设计、实现和评估了基于TreeKEM的自定义GKA协议。

更新时间: 2026-01-29 16:45:05

领域: cs.CR

下载: http://arxiv.org/abs/2601.21966v1

One Model, Any Conjunctive Query: Graph Neural Networks for Answering Queries over Incomplete Knowledge Graphs

Motivated by the incompleteness of modern knowledge graphs, a new setup for query answering has emerged, where the goal is to predict answers that do not necessarily appear in the knowledge graph, but are present in its completion. In this paper, we formally introduce and study two query answering problems, namely, query answer classification and query answer retrieval. To solve these problems, we propose AnyCQ, a model that can classify answers to any conjunctive query on any knowledge graph. At the core of our framework lies a graph neural network trained using a reinforcement learning objective to answer Boolean queries. Trained only on simple, small instances, AnyCQ generalizes to large queries of arbitrary structure, reliably classifying and retrieving answers to queries that existing approaches fail to handle. This is empirically validated through our newly proposed, challenging benchmarks. Finally, we empirically show that AnyCQ can effectively transfer to completely novel knowledge graphs when equipped with an appropriate link prediction model, highlighting its potential for querying incomplete data.

Updated: 2026-01-29 16:42:37

标题: 一个模型，任意合取查询：图神经网络用于回答不完整知识图中的查询

摘要: 受现代知识图谱不完整性的启发，出现了一种新的查询回答设置，其目标是预测不一定出现在知识图谱中，但存在于其完整性中的答案。在本文中，我们正式介绍和研究了两个查询回答问题，即查询答案分类和查询答案检索。为了解决这些问题，我们提出了AnyCQ，这是一个可以对任何知识图谱上的任何合取查询进行答案分类的模型。我们的框架的核心是一个使用强化学习目标训练的图神经网络，用于回答布尔查询。只在简单、小实例上训练，AnyCQ可以推广到任意结构的大查询，可靠地分类和检索那些现有方法无法处理的查询的答案。通过我们新提出的具有挑战性的基准测试来实证验证这一点。最后，我们通过实验证明，当配备适当的链接预测模型时，AnyCQ可以有效地转移到完全新颖的知识图谱中，突出其在查询不完整数据方面的潜力。

更新时间: 2026-01-29 16:42:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2409.13959v3

From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation

Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol

Updated: 2026-01-29 16:42:24

标题: 从代币到区块：分子生成的区块扩散视角

摘要: 药物发现可以被视为在巨大的化学空间中进行组合搜索，促使开发用于全新分子设计的深度生成模型。在这些模型中，基于GPT的分子语言模型（MLM）通过从大规模数据中学习化学语法和语义展现了强大的分子设计性能。然而，现有的MLM面临两个基本限制：当被构建为下一个标记预测问题时，它们未能充分捕捉分子的图结构特性，并且它们通常缺乏目标感知生成的明确机制。在这里，我们提出了SoftMol，一个统一框架，用于目标感知的分子生成，该框架共同设计了分子表示、模型架构和搜索策略。SoftMol引入了软片段，这是SMILES的无规则块表示，实现了扩散本地建模，并开发了SoftBD，这是第一个将局部双向扩散与分子结构约束下的自回归生成相结合的块扩散分子语言模型。为了促进具有高药物相似性和合成可达性的生成分子，SoftBD在一个精心策划的名为ZINC-Curated的数据集上进行了训练。SoftMol进一步集成了一个门控蒙特卡罗树搜索，以目标感知方式组装片段。实验结果表明，与当前最先进的模型相比，SoftMol实现了100%的化学有效性，将结合亲和力提高了9.7%，分子多样性增加了2-3倍，并且推断效率提高了6.6倍。代码可在https://github.com/szu-aicourse/softmol 上找到。

更新时间: 2026-01-29 16:42:24

领域: cs.LG

下载: http://arxiv.org/abs/2601.21964v1

Incorporating the ChEES Criterion into Sequential Monte Carlo Samplers

Markov chain Monte Carlo (MCMC) methods are a powerful but computationally expensive way of performing non-parametric Bayesian inference. MCMC proposals which utilise gradients, such as Hamiltonian Monte Carlo (HMC), can better explore the parameter space of interest if the additional hyper-parameters are chosen well. The No-U-Turn Sampler (NUTS) is a variant of HMC which is extremely effective at selecting these hyper-parameters but is slow to run and is not suited to GPU architectures. An alternative to NUTS, Change in the Estimator of the Expected Square HMC (ChEES-HMC) was shown not only to run faster than NUTS on GPU but also sample from posteriors more efficiently. Sequential Monte Carlo (SMC) samplers are another sampling method which instead output weighted samples from the posterior. They are very amenable to parallelisation and therefore being run on GPUs while having additional flexibility in their choice of proposal over MCMC. We incorporate (ChEEs-HMC) as a proposal into SMC samplers and demonstrate competitive but faster performance than NUTS on a number of tasks.

Updated: 2026-01-29 16:37:47

标题: 将ChEES准则纳入顺序蒙特卡洛采样器

摘要: 马尔可夫链蒙特卡罗（MCMC）方法是进行非参数贝叶斯推断的一种强大但计算成本高昂的方式。利用梯度的MCMC提议，如哈密尔顿蒙特卡罗（HMC），可以更好地探索感兴趣的参数空间，如果额外的超参数选择得当。No-U-Turn采样器（NUTS）是HMC的一种变体，极其有效地选择这些超参数，但运行缓慢，不适合GPU架构。作为NUTS的替代方案，期望平方HMC估计器的变化（ChEES-HMC）不仅显示在GPU上比NUTS运行更快，而且更有效地从后验中采样。顺序蒙特卡罗（SMC）采样器是另一种采样方法，输出加权样本而不是直接从后验中采样。它们非常适合并行化，并且可以在GPU上运行，同时在选择提议方面比MCMC具有更大的灵活性。我们将（ChEES-HMC）作为提议整合到SMC采样器中，并展示了在多项任务中与NUTS相比具有竞争力但更快的性能。

更新时间: 2026-01-29 16:37:47

领域: stat.CO,cs.LG,eess.SY,stat.ML

下载: http://arxiv.org/abs/2504.02627v2

Near-Optimal Private Tests for Simple and MLR Hypotheses

We develop a near-optimal testing procedure under the framework of Gaussian differential privacy for simple as well as one- and two-sided tests under monotone likelihood ratio conditions. Our mechanism is based on a private mean estimator with data-driven clamping bounds, whose population risk matches the private minimax rate up to logarithmic factors. Using this estimator, we construct private test statistics that achieve the same asymptotic relative efficiency as the non-private, most powerful tests while maintaining conservative type I error control. In addition to our theoretical results, our numerical experiments show that our private tests outperform competing DP methods and offer comparable power to the non-private most powerful tests, even at moderately small sample sizes and privacy loss budgets.

Updated: 2026-01-29 16:36:21

标题: 接近最佳的私人测试：简单和MLR假设

摘要: 我们在高斯差分隐私框架下为简单测试、一侧和两侧测试开发了一种接近最优的测试程序，其中满足单调似然比条件。我们的机制基于具有数据驱动夹紧边界的私有均值估计器，其总体风险与私有最小化率相匹配，直到对数因子。利用这个估计器，我们构建了可以达到与非私有、最强大测试相同渐近相对效率的私有测试统计量，同时保持保守的类型I错误控制。除了我们的理论结果外，我们的数值实验表明，我们的私有测试优于竞争的DP方法，并在适度小的样本量和隐私损失预算下提供与非私有最强大测试相当的功率。

更新时间: 2026-01-29 16:36:21

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.21959v1

Uncertainty-Aware Data-Based Method for Fast and Reliable Shape Optimization

Data-based optimization (DBO) offers a promising approach for efficiently optimizing shape for better aerodynamic performance by leveraging a pretrained surrogate model for offline evaluations during iterations. However, DBO heavily relies on the quality of the training database. Samples outside the training distribution encountered during optimization can lead to significant prediction errors, potentially misleading the optimization process. Therefore, incorporating uncertainty quantification into optimization is critical for detecting outliers and enhancing robustness. This study proposes an uncertainty-aware data-based optimization (UA-DBO) framework to monitor and minimize surrogate model uncertainty during DBO. A probabilistic encoder-decoder surrogate model is developed to predict uncertainties associated with its outputs, and these uncertainties are integrated into a model-confidence-aware objective function to penalize samples with large prediction errors during data-based optimization process. The UA-DBO framework is evaluated on two multipoint optimization problems aimed at improving airfoil drag divergence and buffet performance. Results demonstrate that UA-DBO consistently reduces prediction errors in optimized samples and achieves superior performance gains compared to original DBO. Moreover, compared to multipoint optimization based on full computational simulations, UA-DBO offers comparable optimization effectiveness while significantly accelerating optimization speed.

Updated: 2026-01-29 16:34:47

标题: 不确定性感知的基于数据的快速可靠形状优化方法

摘要: 数据驱动优化（DBO）提供了一种有效优化形状以实现更好空气动力性能的方法，通过利用预训练的替代模型进行离线评估。然而，DBO严重依赖于训练数据库的质量。在优化过程中遇到训练分布之外的样本可能导致显著的预测误差，潜在地误导优化过程。因此，将不确定性量化纳入优化过程对于检测异常值并增强鲁棒性至关重要。本研究提出了一种不确定性感知的数据驱动优化（UA-DBO）框架，用于监测和最小化在DBO过程中的替代模型不确定性。开发了一种概率编码器-解码器替代模型来预测与其输出相关的不确定性，并将这些不确定性整合到一个模型置信度感知的目标函数中，以惩罚在数据驱动优化过程中具有大预测误差的样本。UA-DBO框架在旨在改善翼型阻力分歧和自激振荡性能的两个多点优化问题上进行了评估。结果表明，UA-DBO在优化样本中始终减少预测误差，并相对于原始DBO实现了更高的性能增益。此外，与基于完整计算模拟的多点优化相比，UA-DBO提供了可比的优化效果，同时显著加快了优化速度。

更新时间: 2026-01-29 16:34:47

领域: cs.LG

下载: http://arxiv.org/abs/2601.21956v1

RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics

Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.

Updated: 2026-01-29 16:34:46

标题: RNAGenScape：具有多重 Langevin 动力学的属性导向、优化生成 mRNA 序列 (Note: "RNAGenScape" is likely a specific software tool or algorithm for generating mRNA sequences)

摘要: 生成优化属性的mRNA序列对于疫苗设计和蛋白质替代疗法等应用至关重要，但由于数据有限、复杂的序列-功能关系以及生物有效序列空间的狭窄，仍然具有挑战性。那些偏离数据流形的生成方法可能会产生无法折叠、翻译困难或其他非功能性序列。我们提出了RNAGenScape，这是一个基于性质引导的流形 Langevin 动力学框架，用于直接在真实数据的学习流形上生成mRNA序列。通过在这个流形上执行迭代局部优化，RNAGenScape保留了生物学的可行性，获得了可靠的指导，并避免了进入环境序列空间的非功能性区域。该框架集成了三个组件：(1) 一个与性质预测器共同训练的自编码器，用于学习性质组织的潜在流形，(2) 一个将更新投影回流形的去噪自编码器，以及 (3) 一个执行沿着流形优化的性质引导 Langevin 动力学过程。在跨越两个数量级的三个真实世界mRNA数据集中，RNAGenScape将中位性质增益提高了高达148％，成功率提高了高达30％，同时确保了生成序列的生物可行性，并相对于现有的生成方法实现了具有竞争力的推断效率。

更新时间: 2026-01-29 16:34:46

领域: q-bio.QM,cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2510.24736v2

Factorizable joint shift revisited

Factorizable joint shift (FJS) represents a type of distribution shift (or dataset shift) that comprises both covariate and label shift. Recently, it has been observed that FJS actually arises from consecutive label and covariate (or vice versa) shifts. Research into FJS so far has been confined to the case of categorical labels. We propose a framework for analysing distribution shift in the case of a general label space, thus covering both classification and regression models. Based on the framework, we generalise existing results on FJS to general label spaces and present and analyse a related extension of the expectation maximisation (EM) algorithm for class prior probabilities. We also take a fresh look at generalized label shift (GLS) in the case of a general label space.

Updated: 2026-01-29 16:34:33

标题: 可分解的联合移位再探讨

摘要: Factorizable joint shift (FJS)代表一种包含协变量和标签转移的分布转移（或数据集转移）。最近观察到，FJS实际上源自连续的标签和协变量（或反之亦然）转移。迄今为止，对FJS的研究局限于分类标签的情况。我们提出了一个框架，用于分析在一般标签空间情况下的分布转移，从而涵盖分类和回归模型。基于这个框架，我们将现有关于FJS的结果推广到一般标签空间，并提出并分析了一种相关的期望最大化（EM）算法的类先验概率的扩展。我们还对一般标签转移（GLS）在一般标签空间的情况进行了重新审视。

更新时间: 2026-01-29 16:34:33

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.15036v2

Residual Reservoir Memory Networks

We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.

Updated: 2026-01-29 16:33:04

标题: 残余储存器记忆网络

摘要: 我们介绍了一种新颖的未经训练的递归神经网络（RNNs）类别，属于储备计算（RC）范式，称为残留储备记忆网络（ResRMNs）。ResRMN将线性记忆储备与非线性储备结合在一起，其中后者基于沿时间维度的残余正交连接，以增强输入的长期传播。通过线性稳定性分析，研究了所得储备状态动态，并研究了不同配置的时间残余连接。所提出的方法经过实证评估，应用于时间序列和像素级1-D分类任务。我们的实验结果突出了所提出方法相对于其他传统RC模型的优势。

更新时间: 2026-01-29 16:33:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.09925v2

Diffusion Path Samplers via Sequential Monte Carlo

We develop a diffusion-based sampler for target distributions known up to a normalising constant. To this end, we rely on the well-known diffusion path that smoothly interpolates between a (simple) base distribution and the target distribution, widely used in diffusion models. Our approach is based on a practical implementation of diffusion-annealed Langevin Monte Carlo, which approximates the diffusion path with convergence guarantees. We tackle the score estimation problem by developing an efficient sequential Monte Carlo sampler that evolves auxiliary variables from conditional distributions along the path, which provides principled score estimates for time-varying distributions. We further develop novel control variate schedules that minimise the variance of these score estimates. Finally, we provide theoretical guarantees and empirically demonstrate the effectiveness of our method on several synthetic and real-world datasets.

Updated: 2026-01-29 16:32:12

标题: 通过顺序蒙特卡洛的扩散路径取样器

摘要: 我们开发了一种基于扩散的采样器，用于已知直到归一化常数的目标分布。为此，我们依赖于广泛应用于扩散模型中的可平滑插值基本分布和目标分布之间的众所周知的扩散路径。我们的方法是基于扩散退火Langevin Monte Carlo的实际实现，该方法近似于具有收敛保证的扩散路径。我们通过开发高效的序贯蒙特卡洛采样器来解决得分估计问题，该采样器沿着路径演化辅助变量从条件分布中得到，从而为时间变化分布提供有原则依据的得分估计。我们进一步开发了最小化这些得分估计方差的新颖控制变量调度。最后，我们提供理论保证，并在几个合成和真实世界数据集上实证证明了我们方法的有效性。

更新时间: 2026-01-29 16:32:12

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2601.21951v1

Embracing Aleatoric Uncertainty in Medical Multimodal Learning with Missing Modalities

Medical multimodal learning faces significant challenges with missing modalities prevalent in clinical practice. Existing approaches assume equal contribution of modality and random missing patterns, neglecting inherent uncertainty in medical data acquisition. In this regard, we propose the Aleatoric Uncertainty Modeling (AUM) that explicitly quantifies unimodal aleatoric uncertainty to address missing modalities. Specifically, AUM models each unimodal representation as a multivariate Gaussian distribution to capture aleatoric uncertainty and enable principled modality reliability quantification. To adaptively aggregate captured information, we develop a dynamic message-passing mechanism within a bipartite patient-modality graph using uncertainty-aware aggregation mechanism. Through this process, missing modalities are naturally accommodated, while more reliable information from available modalities is dynamically emphasized to guide representation generation. Our AUM framework achieves an improvement of 2.26% AUC-ROC on MIMIC-IV mortality prediction and 2.17% gain on eICU, outperforming existing state-of-the-art approaches.

Updated: 2026-01-29 16:31:48

标题: 在医学多模态学习中接受随机不确定性，缺失模态

摘要: 医学多模态学习在临床实践中普遍存在缺失模态的重大挑战。现有方法假设模态的贡献相等且存在随机缺失模式，忽略了医学数据获取中固有的不确定性。基于此，我们提出了Aleatoric Uncertainty Modeling (AUM)，明确量化单模态的aleatoric不确定性以解决缺失模态的问题。具体而言，AUM将每个单模态表示建模为多元高斯分布，以捕捉aleatoric不确定性并实现基于原则的模态可靠性量化。为了自适应地汇总获取的信息，我们在一个双部分患者-模态图中使用基于不确定性的聚合机制开发了动态消息传递机制。通过这个过程，缺失模态自然地被纳入，同时更可靠的来自可用模态的信息被动态强调，以指导表示生成。我们的AUM框架在MIMIC-IV死亡预测上实现了2.26%的AUC-ROC改进，并在eICU上实现了2.17%的增益，优于现有的最先进方法。

更新时间: 2026-01-29 16:31:48

领域: cs.LG

下载: http://arxiv.org/abs/2601.21950v1

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.

Updated: 2026-01-29 16:31:07

标题: SAC-GLAM：使用软Actor-Critic和事后重标记改进LLM代理的在线RL

摘要: 近年来，大型语言模型（LLMs）不仅作为生成模型，还作为解决文本顺序决策任务的代理而努力。在面对复杂环境时，它们的零样本能力不足，最近的研究表明在线强化学习（RL）可以用于LLM代理以互动地发现和学习有效的策略。然而，大多数先前的工作坚持使用在线策略算法，这大大减少了这些代理可以用于探索和开发的方法的范围，比如经验重播和事后重新标记。然而，这些方法可能是LLM学习代理的关键，特别是在设计自主内在激励代理抽样和追求自己目标（即自体动机代理）时。本文提出并研究了将软演员-评论家算法和事后重新标记应用于LLM代理的适应方法。我们的方法不仅为学习在线的自体动机LLM代理铺平了道路，还可以在更经典的多目标RL环境中胜过在线策略方法。

更新时间: 2026-01-29 16:31:07

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.12481v3

Dependence of Equilibrium Propagation Training Success on Network Architecture

The rapid rise of artificial intelligence has led to an unsustainable growth in energy consumption. This has motivated progress in neuromorphic computing and physics-based training of learning machines as alternatives to digital neural networks. Many theoretical studies focus on simple architectures like all-to-all or densely connected layered networks. However, these may be challenging to realize experimentally, e.g. due to connectivity constraints. In this work, we investigate the performance of the widespread physics-based training method of equilibrium propagation for more realistic architectural choices, specifically, locally connected lattices. We train an XY model and explore the influence of architecture on various benchmark tasks, tracking the evolution of spatially distributed responses and couplings during training. Our results show that sparse networks with only local connections can achieve performance comparable to dense networks. Our findings provide guidelines for further scaling up architectures based on equilibrium propagation in realistic settings.

Updated: 2026-01-29 16:29:31

标题: 网络架构对平衡传播训练成功的依赖性

摘要: 人工智能的快速发展导致了能源消耗的不可持续增长。这促使了神经形态计算和基于物理训练学习机器作为数字神经网络的替代方法的进步。许多理论研究集中在简单的架构上，如全连接或密集连接的分层网络。然而，这些架构在实验上可能具有挑战性，例如由于连接约束。在这项工作中，我们研究了广泛采用的基于物理的平衡传播训练方法在更现实的架构选择上的性能，特别是局部连接的晶格。我们对XY模型进行训练，并探讨了架构对各种基准任务的影响，跟踪训练过程中空间分布响应和耦合的演变。我们的结果表明，只有局部连接的稀疏网络可以实现与密集网络相当的性能。我们的发现为在现实环境中基于平衡传播的架构进一步扩展提供了指导。

更新时间: 2026-01-29 16:29:31

领域: cs.LG,cond-mat.dis-nn,cs.ET,cs.NE

下载: http://arxiv.org/abs/2601.21945v1

Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models

The widespread adoption of Vision-Language Models (VLMs) across fields has amplified concerns about model interpretability. Distressingly, these models are often treated as black-boxes, with limited or non-existent investigation of their decision making process. Despite numerous post- and ante-hoc interepretability methods, systematic and objective evaluation of the learned representations remains limited, particularly for sparsity-aware methods that are increasingly considered to "induce interpretability". In this work, we focus on Concept Bottleneck Models and investigate how different modeling decisions affect the emerging representations. We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation, while proposing an interpretability assessment framework using datasets with ground truth concept annotations. We consider both VLM- and attribute predictor-based CBMs, and three different sparsity-inducing strategies: per example $\ell_1, \ell_0$ and Bernoulli-based formulations. Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels. The code will be made publicly available upon publication.

Updated: 2026-01-29 16:28:55

标题: 清晰度：稀疏感知概念瓶颈模型中的灵活性-可解释性权衡

摘要: 跨领域中普遍采用Vision-Language Models（VLMs）引发了人们对模型可解释性的担忧。令人沮丧的是，这些模型经常被视为黑匣子，对其决策过程的调查有限或根本不存在。尽管存在许多事后和事前可解释性方法，但对学习表示的系统和客观评估仍然有限，特别是对于越来越被认为“诱导可解释性”的稀疏感知方法。在这项工作中，我们专注于Concept Bottleneck Models，并研究不同的建模决策如何影响新兴表示。我们引入了清晰度的概念，一种衡量方法，捕捉下游性能与概念表示的稀疏性和精确性之间的相互作用，同时提出了使用具有地面真实概念注释的数据集的可解释性评估框架。我们考虑了基于VLM和属性预测器的CBMs，以及三种不同的稀疏感知策略：每个示例$\ell_1，\ell_0$和基于伯努利的公式。我们的实验揭示了灵活性和可解释性之间的关键权衡，在这种权衡下，即使在相当的性能水平上，给定方法也可能表现出明显不同的行为。代码将在发表后公开提供。

更新时间: 2026-01-29 16:28:55

领域: cs.LG

下载: http://arxiv.org/abs/2601.21944v1

Entropy-Based Dimension-Free Convergence and Loss-Adaptive Schedules for Diffusion Models

Diffusion generative models synthesize samples by discretizing reverse-time dynamics driven by a learned score (or denoiser). Existing convergence analyses of diffusion models typically scale at least linearly with the ambient dimension, and sharper rates often depend on intrinsic-dimension assumptions or other geometric restrictions on the target distribution. We develop an alternative, information-theoretic approach to dimension-free convergence that avoids any geometric assumptions. Under mild assumptions on the target distribution, we bound KL divergence between the target and generated distributions by $O(H^2/K)$ (up to endpoint factors), where $H$ is the Shannon entropy and $K$ is the number of sampling steps. Moreover, using a reformulation of the KL divergence, we propose a Loss-Adaptive Schedule (LAS) for efficient discretization of reverse SDE which is lightweight and relies only on the training loss, requiring no post-training heavy computation. Empirically, LAS improves sampling quality over common heuristic schedules.

Updated: 2026-01-29 16:28:21

标题: 基于熵的无维度收敛和损失自适应进度的扩散模型

摘要: 生成扩散模型通过离散化由学习得分（或去噪器）驱动的逆时间动力学来合成样本。现有的扩散模型收敛分析通常至少与环境维度线性相关，而更尖锐的速率往往取决于目标分布的内在维度假设或其他几何限制。我们开发了一种替代的信息论方法，实现了无维度收敛，避免了任何几何假设。在目标分布上做出温和的假设后，我们通过$O(H^2/K)$（直到端点因子）来界定目标与生成分布之间的KL散度，其中$H$是香农熵，$K$是采样步数。此外，通过重新表述KL散度，我们提出了一种Loss-Adaptive Schedule (LAS)来有效离散化逆SDE，这种方法轻便且仅依赖于训练损失，无需进行后期繁重的计算。从实证上看，LAS提高了采样质量，优于常见的启发式调度。

更新时间: 2026-01-29 16:28:21

领域: cs.LG,cs.IT

下载: http://arxiv.org/abs/2601.21943v1

Clustering in Deep Stochastic Transformers

Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.

Updated: 2026-01-29 16:28:13

标题: 深度随机变换器中的聚类

摘要: 变压器已经在各个领域彻底改变了深度学习，但是理解精确的令牌动态仍然是一个理论挑战。具有层归一化的深度变压器的现有理论通常预测令牌聚集到一个单一点；然而，这些结果依赖于确定性权重假设，这些假设未能捕捉变压器中的标准初始化方案。在这项工作中，我们展示了考虑随机初始化的固有随机性如何改变这种情况。更确切地说，我们分析了深度变压器，其中噪声来自值矩阵的随机初始化。在扩散尺度和令牌级别的RMS归一化下，我们证明，随着变压器层数趋于无穷大，离散令牌动态收敛到一个在球上驱动令牌的相互作用粒子系统，其中令牌由一个\emph{共同}矩阵值布朗噪声驱动。在这种极限情况下，我们展示了初始化噪声阻止了确定性模型预测的单一聚类的崩溃。对于两个令牌，我们证明了由交互强度和令牌维度控制的相变：与确定性注意流不同，反极点配置具有正概率吸引。数值实验证实了预测的转变，揭示了反极点形成在两个以上令牌中持续存在，并且表明抑制固有噪声会降低准确性。

更新时间: 2026-01-29 16:28:13

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.21942v1

Robust Multimodal Representation Learning in Healthcare

Medical multimodal representation learning aims to integrate heterogeneous data into unified patient representations to support clinical outcome prediction. However, real-world medical datasets commonly contain systematic biases from multiple sources, which poses significant challenges for medical multimodal representation learning. Existing approaches typically focus on effective multimodal fusion, neglecting inherent biased features that affect the generalization ability. To address these challenges, we propose a Dual-Stream Feature Decorrelation Framework that identifies and handles the biases through structural causal analysis introduced by latent confounders. Our method employs a causal-biased decorrelation framework with dual-stream neural networks to disentangle causal features from spurious correlations, utilizing generalized cross-entropy loss and mutual information minimization for effective decorrelation. The framework is model-agnostic and can be integrated into existing medical multimodal learning methods. Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.

Updated: 2026-01-29 16:27:54

标题: 在医疗保健领域的强健多模态表示学习

摘要: 医学多模态表示学旨在将异构数据整合为统一的患者表示，以支持临床结果预测。然而，现实世界中的医学数据集通常包含来自多个来源的系统偏差，这给医学多模态表示学带来了重大挑战。现有方法通常侧重于有效的多模态融合，忽视了影响泛化能力的固有偏差特征。为了解决这些挑战，我们提出了一种双流特征去相关框架，通过引入潜在混淆因素的结构因果分析来识别和处理偏差。我们的方法采用了一个具有双流神经网络的因果偏差去相关框架，从而将因果特征与偶然相关性分离开来，利用广义交叉熵损失和互信息最小化进行有效的去相关化。该框架与模型无关，可以集成到现有的医学多模态学习方法中。对MIMIC-IV、eICU和ADNI数据集的全面实验表明，性能得到了持续改进。

更新时间: 2026-01-29 16:27:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21941v1

WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making

Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in-context learning abilities of LLMs to guide an LLM-based world model's predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity-driven reinforcement learning policy that explores the environment to find transitions with a low log-likelihood under our LLM-based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human-interpretable theories of environment dynamics.

Updated: 2026-01-29 16:25:29

标题: WorldLLM：利用好奇驱动的理论制定改进LLMs的世界建模

摘要: 大型语言模型（LLMs）具有普遍的世界知识，但常常在结构化、领域特定的环境中，如模拟中难以生成精确的预测。这些限制源于它们无法将广泛、非结构化的理解与特定环境相联系。为了解决这个问题，我们提出了WorldLLM，这是一个通过将贝叶斯推断和自主主动探索与强化学习相结合来增强基于LLM的世界建模的框架。WorldLLM利用LLMs的上下文学习能力，通过使用其提示中给出的自然语言假设来指导基于LLM的世界模型的预测。这些假设通过一个贝叶斯推断框架进行迭代改进，该框架利用第二个LLM作为提议分布，根据收集到的证据。这些证据是通过一种好奇驱动的强化学习策略收集的，该策略探索环境，找到在当前假设下在我们基于LLM的预测模型下具有低对数似然的转换。通过在假设的改进和收集新证据之间交替，我们的框架自主驱动预测的持续改进。我们的实验表明，WorldLLM在需要代理人操作和组合对象的文本游戏环境中的有效性。该框架不仅提高了预测精度，还生成了人类可解释的环境动态理论。

更新时间: 2026-01-29 16:25:29

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.06725v3

A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.

Updated: 2026-01-29 16:22:04

标题: 一个核心集选择的核心集选择文献：介绍与最新进展

摘要: 核心集选择旨在解决在大规模数据集中找到一个小的、代表性子集的挑战，该子集保留了有效机器学习所需的基本模式。尽管之前已经有几项调查研究了数据缩减策略，但大多数都着重于传统的基于几何的方法或主动学习技术。相比之下，本调查通过将三大核心集研究领域——无需训练、面向训练和无标签的方法——统一为一个分类体系，呈现了更为全面的视角。我们介绍了现有工作常常忽视的子领域，包括子模块化形式、双层优化以及对无标签数据集的伪标签最近进展。此外，我们还研究了修剪策略如何影响泛化和神经缩放定律，提供了先前评论中缺失的新见解。最后，我们比较了这些方法在不同的计算、鲁棒性和性能需求下的表现，并强调了未来研究的开放挑战，如鲁棒性、异常值过滤和将核心集选择适应于基础模型。

更新时间: 2026-01-29 16:22:04

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2505.17799v2

LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution

Training data attribution (TDA) identifies which training examples most influenced a model's prediction. The best-performing TDA methods exploits gradients to define an influence function. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph{(i)} storing and loading projected per-example gradients for all $N$ training examples, where query latency is dominated by I/O; and \emph{(ii)} forming the $D \times D$ inverse Hessian approximation, which costs $O(D^2)$ memory. Both bottlenecks scale with the projection dimension $D$, yet increasing $D$ is necessary for attribution quality -- creating a quality--scalability tradeoff. We introduce \textbf{LoRIF (Low-Rank Influence Functions)}, which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank-$c$ factors of the projected per-example gradients rather than full matrices, reducing storage and query-time I/O from $O(D)$ to $O(c\sqrt{D})$ per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the Hessian term in an $r$-dimensional subspace, reducing memory from $O(D^2)$ to $O(Dr)$. On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$\times$ storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.

Updated: 2026-01-29 16:18:34

标题: LoRIF：用于可伸缩训练数据归因的低秩影响函数

摘要: 培训数据归因（TDA）确定了哪些训练示例最影响模型的预测。表现最佳的TDA方法利用梯度来定义一个影响函数。为了克服由梯度计算引起的可伸缩性挑战，最流行的策略是随机投影（例如，TRAK，LoGRA）。然而，当扩展到大型训练集和高质量归因时，仍然面临两个瓶颈：（i）存储和加载所有N个训练示例的投影逐例梯度，其中查询延迟由I/O主导；和（ii）形成D×D的逆Hessian近似，成本为O（D2）内存。这两个瓶颈都随着投影维度D的增加而扩展，然而增加D对于归因质量是必要的--造成质量和可伸缩性的权衡。我们引入了LoRIF（低秩影响函数），它利用梯度的低秩结构来解决这两个瓶颈。首先，我们存储投影逐例梯度的秩为c的因子，而不是完整矩阵，将存储和查询时间I/O从O（D）降低到每层每个样本的O（c√D）。其次，我们使用带有Woodbury恒等式的截断SVD来在r维子空间中近似Hessian项，将内存从O（D2）降低到O（Dr）。在拥有数百万示例的数据集上训练的0.1B至70B参数模型上，LoRIF相比LoGRA实现了高达20倍的存储减少和查询时间加速，同时匹配或超过了其归因质量。LoRIF使基于梯度的TDA在前沿规模上变得实用。

更新时间: 2026-01-29 16:18:34

领域: cs.LG

下载: http://arxiv.org/abs/2601.21929v1

Optimistic Transfer under Task Shift via Bellman Alignment

We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.

Updated: 2026-01-29 16:16:24

标题: 通过贝尔曼对齐在任务转移下的乐观转移

摘要: 我们研究在线转移强化学习（RL）在片段马尔可夫决策过程中的应用，在学习目标任务时可以利用相关源任务的经验。一个基本困难是任务相似性通常是根据奖励或转换来定义的，而在线RL算法则是基于贝尔曼回归目标运作的。因此，简单地重复使用源贝尔曼更新会引入系统偏差并使后悔保证无效。我们确定一步贝尔曼对齐是在线RL转移的正确抽象，并提出加权目标（RWT），这是一个操作级的校正，重新调整继续值并通过测度变化来补偿转换不匹配。 RWT将任务不匹配减少到一个固定的一步校正，并使得可以统计上合理地重用源数据。这种对齐产生了一个两阶段RWT $Q$-learning框架，将方差减少与偏差校正分离。在RKHS函数逼近下，我们建立了后悔界限，这些界限与任务转移的复杂性相关，而不是与目标MDP相关。在表格和神经网络设置中的实证结果表明，与单任务学习和天真的汇聚相比，我们在Bellman对齐中实现了持续的改进，这凸显了Bellman对齐作为在线RL的模型无关转移原则。

更新时间: 2026-01-29 16:16:24

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.21924v1

Representative Action Selection for Large Action Space Bandit Families

We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. Indeed, in many natural situations, while the nominal set of actions may be large, there also exist significant correlations between the rewards of different actions. In this paper we propose an algorithm that can significantly reduce the action space when such correlations are present, without the need to a-priori know the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Thompson Sampling and Upper Confidence Bound methods.

Updated: 2026-01-29 16:11:43

标题: 大动作空间强盗家族的代表性行动选择

摘要: 我们研究了从一个由一组赌徒共享的大动作空间中选择子集的问题，目标是实现几乎与使用完整动作空间相匹配的性能。实际上，在许多自然情况下，虽然动作的名义集可能很大，但不同动作的奖励之间也存在显著的相关性。在本文中，我们提出了一种算法，可以在存在这种相关性时显著减少动作空间，而无需事先了解相关结构。我们对算法的性能提供了理论保证，并通过与汤普逊抽样和上限置信度方法的实证比较展示了其实际效果。

更新时间: 2026-01-29 16:11:43

领域: cs.LG,math.OC,math.PR,stat.ML

下载: http://arxiv.org/abs/2505.18269v4

On Approximate Computation of Critical Points

We show that computing even very coarse approximations of critical points is intractable for simple classes of nonconvex functions. More concretely, we prove that if there exists a polynomial-time algorithm that takes as input a polynomial in $n$ variables of constant degree (as low as three) and outputs a point whose gradient has Euclidean norm at most $2^n$ whenever the polynomial has a critical point, then P=NP. The algorithm is permitted to return an arbitrary point when no critical point exists. We also prove hardness results for approximate computation of critical points under additional structural assumptions, including settings in which existence and uniqueness of a critical point are guaranteed, the function is lower bounded, and approximation is measured in terms of distance to a critical point. Overall, our results stand in contrast to the commonly-held belief that, in nonconvex optimization, approximate computation of critical points is a tractable task.

Updated: 2026-01-29 16:08:22

标题: 关于临界点的近似计算

摘要: 我们展示了即使对于简单类别的非凸函数，计算临界点的粗略近似也是棘手的。更具体地，我们证明如果存在一个多项式时间算法，该算法以$n$个变量的常数次幂（甚至低至三次幂）的多项式作为输入，并且在多项式存在临界点时输出一个梯度的欧几里得范数不超过$2^n$的点，那么P=NP。当不存在临界点时，算法可以返回任意点。我们还证明了在额外的结构假设下对临界点的近似计算的困难性结果，包括在保证存在和唯一性的情况下，函数受到下界限制，并且近似度是以距离临界点的方式来衡量。总的来说，我们的结果与通常认为在非凸优化中，计算临界点的近似是一个可行的任务的观点相矛盾。

更新时间: 2026-01-29 16:08:22

领域: math.OC,cs.CC,cs.LG,math.NA,stat.ML

下载: http://arxiv.org/abs/2601.21917v1

Near-Optimal Online Deployment and Routing for Streaming LLMs

The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples stage-wise deployment (at fixed maintenance windows) with per-query routing among live models. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: StageRoute tracks a strong oracle under tight budgets across diverse workloads.

Updated: 2026-01-29 16:04:59

标题: 流式LLM的近最优在线部署和路由

摘要: 新的大型语言模型（LLMs）的快速出现以及老模型的过时，迫使提供者在严格的并发上限和每个查询成本预算下管理流动库存。我们将这视为一个在线决策问题，它将阶段性部署（在固定的维护窗口）与活动模型之间的每个查询路由相结合。我们引入了StageRoute，这是一个分层算法，（i）使用奖励上限置信度和成本下限置信度边界乐观地选择最多$M_{\max}$个模型用于下一个阶段，（ii）通过在部署集上解决一个受预算和吞吐量限制的贝叶斯子问题来路由每个传入的查询。我们证明了一个后悔度为$\tilde{\mathcal{O}}(T^{2/3})$，并且有一个匹配的下限，从而确立了近乎最优性，并在实证上验证了这个理论：在各种工作负载下，StageRoute在严格的预算下跟踪一个强大的奥拉克。

更新时间: 2026-01-29 16:04:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17254v2

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying RL to large-scale flow-based VLAs (\eg, $π_0$, $π_{0.5}$) remains challenging due to intractable action log-likelihoods raised from flow matching. We address this challenge with $π_{\texttt{RL}}$, featuring two technical approaches: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $π_{\texttt{RL}}$ across various benchmarks, with experiments demonstrating that RL yields significant performance improvements in both in-distribution and out-of-distribution settings.

Updated: 2026-01-29 16:00:57

标题: $π_\texttt{RL}$: 在基于流的视觉-语言-动作模型中进行在线RL微调

摘要: 视觉-语言-行动（VLA）模型使机器人能够理解和执行来自多模态输入的复杂任务。尽管最近的工作探索使用强化学习（RL）来自动化在扩展监督微调（SFT）中的繁琐数据收集过程，但将RL应用于大规模基于流的VLA（例如，$π_0$，$π_{0.5}$）仍然具有挑战性，原因是由于流匹配引起的难以处理的动作对数似然。我们通过$π_{\texttt{RL}}$解决了这一挑战，采用了两种技术方法：（1）\textbf{Flow-Noise}将去噪过程建模为一个具有可学习噪声网络的离散时间MDP，用于精确对数似然计算。（2）\textbf{Flow-SDE}将去噪与代理-环境交互集成，形成一个采用ODE到SDE转换的两层MDP，以实现有效的RL探索。我们在各种基准测试中评估了$π_{\texttt{RL}}$，实验证明RL在分布内和分布外环境中均能显著提高性能。

更新时间: 2026-01-29 16:00:57

领域: cs.LG

下载: http://arxiv.org/abs/2510.25889v3

Beyond the Finite Variant Property: Extending Symbolic Diffie-Hellman Group Models (Extended Version)

Diffie-Hellman groups are commonly used in cryptographic protocols. While most state-of-the-art, symbolic protocol verifiers support them to some degree, they do not support all mathematical operations possible in these groups. In particular, they lack support for exponent addition, as these tools reason about terms using unification, which is undecidable in the theory describing all Diffie-Hellman operators. In this paper we approximate such a theory and propose a semi-decision procedure to determine whether a protocol, which may use all operations in such groups, satisfies user-defined properties. We implement this approach by extending the Tamarin prover to support the full Diffie-Hellman theory, including group element multiplication and hence addition of exponents. This is the first time a state-of-the-art tool can model and reason about such protocols. We illustrate our approach's effectiveness with different case studies: ElGamal encryption and MQV. Using Tamarin, we prove security properties of ElGamal, and we rediscover known attacks on MQV.

Updated: 2026-01-29 16:00:56

标题: 超越有限变种属性：扩展符号迪菲-赫尔曼群模型（扩展版本）

摘要: 迪菲-赫尔曼组在密码协议中被广泛使用。尽管大多数最先进的符号协议验证工具在某种程度上支持它们，但它们并不支持这些组中可能存在的所有数学操作。特别是，它们缺乏对指数加法的支持，因为这些工具使用无法解决的统一性来推理描述所有迪菲-赫尔曼操作符的理论。在本文中，我们近似了这样一个理论，并提出了一个半决策过程，用于确定一个可能使用这些组中的所有操作的协议是否满足用户定义的属性。我们通过扩展Tamarin证明器来实现这一方法，以支持完整的迪菲-赫尔曼理论，包括群元素乘法和因此指数加法。这是第一次使用最先进的工具来建模和推理这样的协议。我们通过不同的案例研究展示了我们方法的有效性：ElGamal加密和MQV。使用Tamarin，我们证明了ElGamal的安全性属性，并重新发现了MQV上已知的攻击。

更新时间: 2026-01-29 16:00:56

领域: cs.CR

下载: http://arxiv.org/abs/2601.21910v1

Hardware-Triggered Backdoors

Machine learning models are routinely deployed on a wide range of computing hardware. Although such hardware is typically expected to produce identical results, differences in its design can lead to small numerical variations during inference. In this work, we show that these variations can be exploited to create backdoors in machine learning models. The core idea is to shape the model's decision function such that it yields different predictions for the same input when executed on different hardware. This effect is achieved by locally moving the decision boundary close to a target input and then refining numerical deviations to flip the prediction on selected hardware. We empirically demonstrate that these hardware-triggered backdoors can be created reliably across common GPU accelerators. Our findings reveal a novel attack vector affecting the use of third-party models, and we investigate different defenses to counter this threat.

Updated: 2026-01-29 15:59:40

标题: 硬件触发后门

摘要: 机器学习模型通常部署在各种计算硬件上。虽然这种硬件通常预期产生相同的结果，但其设计上的差异可能导致推断过程中的小数值变化。在这项工作中，我们展示了这些变化可以被利用来在机器学习模型中创建后门。核心思想是塑造模型的决策函数，使其在不同硬件上执行时对相同输入产生不同的预测。这种效应是通过将决策边界局部移动到目标输入附近，然后通过调整数字偏差来在选定的硬件上改变预测。我们经验性地证明，这些硬件触发的后门可以可靠地在常见的GPU加速器上创建。我们的研究发现了一种影响使用第三方模型的新型攻击向量，并研究了不同的防御措施来应对这一威胁。

更新时间: 2026-01-29 15:59:40

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2601.21902v1

Breaking the Regional Barrier: Inductive Semantic Topology Learning for Worldwide Air Quality Forecasting

Global air quality forecasting grapples with extreme spatial heterogeneity and the poor generalization of existing transductive models to unseen regions. To tackle this, we propose OmniAir, a semantic topology learning framework tailored for global station-level prediction. By encoding invariant physical environmental attributes into generalizable station identities and dynamically constructing adaptive sparse topologies, our approach effectively captures long-range non-Euclidean correlations and physical diffusion patterns across unevenly distributed global networks. We further curate WorldAir, a massive dataset covering over 7,800 stations worldwide. Extensive experiments show that OmniAir achieves state-of-the-art performance against 18 baselines, maintaining high efficiency and scalability with speeds nearly 10 times faster than existing models, while effectively bridging the monitoring gap in data-sparse regions.

Updated: 2026-01-29 15:58:07

标题: 打破地区障碍：全球空气质量预测的归纳语义拓扑学习

摘要: 全球空气质量预测面临极端空间异质性和现有转导模型对未知地区的泛化能力不足的挑战。为了解决这一问题，我们提出了OmniAir，这是一个专为全球站点级预测量身定制的语义拓扑学习框架。通过将不变的物理环境属性编码为可泛化的站点身份，并动态构建自适应稀疏拓扑结构，我们的方法有效捕捉了全球网络中不均匀分布的长程非欧几里德相关性和物理扩散模式。我们进一步策划了WorldAir，这是一个覆盖全球超过7800个站点的大规模数据集。广泛的实验表明，OmniAir在与18个基线模型的比较中达到了最先进的性能，保持了高效性和可扩展性，速度几乎比现有模型快10倍，同时有效地弥合了数据稀疏地区的监测差距。

更新时间: 2026-01-29 15:58:07

领域: cs.LG

下载: http://arxiv.org/abs/2601.21899v1

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a \emph{governance gap}: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely post-hoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose \textsc{Trap}$^{2}$, an architecture-agnostic protection framework that encodes protection into the update during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, \textsc{Trap}$^{2}$ uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized merging.

Updated: 2026-01-29 15:56:55

标题: 通过敏感于缩放的损失景观使模型不可合并

摘要: 模型中心的崛起使得访问可重复使用的模型组件变得更加容易，使得模型合并成为结合能力的实用工具。然而，这种模块化也造成了一种“治理差距”：下游用户可以将发布的权重重新组合成绕过安全对齐或许可条款的未经授权的混合物。由于现有的防御措施主要是事后进行的，并且是针对特定架构的，因此在实践中，它们在不同架构和发布格式之间提供了不一致的保护。为了弥补这一差距，我们提出了Trap^2，这是一个与架构无关的保护框架，可以在微调过程中将保护编码到更新中，无论它们是作为适配器还是完整模型发布的。Trap^2不依赖于与架构相关的方法，而是使用权重重新缩放作为合并过程的简单代理。它保持发布的权重在独立使用时的有效性，但在常见的合并中出现的重新缩放下降时，会削弱未经授权的合并。

更新时间: 2026-01-29 15:56:55

领域: cs.AI,cs.CR

下载: http://arxiv.org/abs/2601.21898v1

A Low-Complexity Plug-and-Play Deep Learning Model for Generalizable Massive MIMO Precoding

Massive multiple-input multiple-output (mMIMO) downlink precoding offers high spectral efficiency but remains challenging to deploy in practice because near-optimal algorithms such as the weighted minimum mean squared error (WMMSE) are computationally expensive, and sensitive to SNR and channel-estimation quality, while existing deep learning (DL)-based solutions often lack robustness and require retraining for each deployment site. This paper proposes a plug-and-play precoder (PaPP), a DL framework with a backbone that can be trained for either fully digital (FDP) or hybrid beamforming (HBF) precoding and reused across sites, transmit-power levels, and with varying amounts of channel estimation error, avoiding the need to train a new model from scratch at each deployment. PaPP combines a high-capacity teacher and a compact student with a self-supervised loss that balances teacher imitation and normalized sum-rate, trained using meta-learning domain-generalization and transmit-power-aware input normalization. Numerical results on ray-tracing data from three unseen sites show that the PaPP FDP and HBF models both outperform conventional and deep learning baselines, after fine-tuning with a small set of local unlabeled samples. Across both architectures, PaPP achieves more than 21$\times$ reduction in modeled computation energy and maintains good performance under channel-estimation errors, making it a practical solution for energy-efficient mMIMO precoding.

Updated: 2026-01-29 15:56:07

标题: 一个低复杂度的即插即用深度学习模型，用于通用的大规模MIMO预编码

摘要: 大规模多输入多输出（mMIMO）下行预编码提供了高频谱效率，但在实践中部署仍具有挑战性，因为近似最优算法如加权最小均方误差（WMMSE）计算成本高，在信噪比和信道估计质量敏感，而现有的基于深度学习（DL）的解决方案往往缺乏稳健性，并且需要为每个部署站点重新训练。本文提出了一种即插即用预编码器（PaPP），一个具有骨干的DL框架，可以针对完全数字（FDP）或混合波束成形（HBF）预编码进行训练，并在各个站点、传输功率水平和不同数量的信道估计误差下重复使用，避免在每次部署时从头开始训练新模型。PaPP结合了高容量教师和紧凑的学生，具有自监督损失，平衡教师模仿和归一化总速率，使用元学习领域泛化和传输功率感知输入归一化进行训练。来自三个未见站点的射线追踪数据的数值结果表明，PaPP的FDP和HBF模型在经过少量本地未标记样本微调后，均优于传统和深度学习基线。在两种架构中，PaPP实现了超过21倍的模拟计算能耗减少，并在信道估计误差下保持良好性能，使其成为高效能量mMIMO预编码的实用解决方案。

更新时间: 2026-01-29 15:56:07

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2601.21897v1

Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning

Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain-of-thought-style intermediate reasoning. Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal, leaving open the question of which properties of code actually contribute to improved reasoning. To address this gap, we study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi-step reasoning during fine-tuning. We examine two complementary settings: solution-driven complexity, where complexity varies across multiple solutions to the same problem, and problem-driven complexity, where complexity reflects variation in the underlying tasks. Using cyclomatic complexity and logical lines of code to construct controlled fine-tuning datasets, we evaluate a range of open-weight LLMs on diverse reasoning benchmarks. Our findings show that although code can improve reasoning, structural properties strongly determine its usefulness. In 83% of experiments, restricting fine-tuning data to a specific structural complexity range outperforms training on structurally diverse code, pointing to a data-centric path for improving reasoning beyond scaling.

Updated: 2026-01-29 15:54:40

标题: 并不是所有代码都是相等的：一个关于代码复杂性和LLM推理的数据中心研究

摘要: 大型语言模型（LLMs）越来越展现出强大的推理能力，这常常归因于它们生成思维链式中间推理的能力。最近的研究表明，暴露于代码可以进一步增强这些技能，但现有研究主要将代码视为一种通用的训练信号，仍然存在一个问题，即代码的哪些属性实际上有助于改善推理能力。为了填补这一空白，我们研究了代码的结构复杂性，这涉及控制流和构成结构，这些结构可能塑造了模型在微调过程中如何内化多步推理。我们研究了两种互补的设置：解决驱动复杂性，其中复杂性在解决同一问题的多个解决方案之间变化，以及问题驱动复杂性，其中复杂性反映了基本任务的变化。使用圈复杂度和逻辑代码行构建控制微调数据集，我们评估了各种开放权重的LLMs在不同推理基准上的表现。我们的研究结果表明，尽管代码可以提高推理能力，但结构属性强烈决定了其实用性。在83%的实验中，将微调数据限制在特定结构复杂性范围内优于在结构多样的代码上训练，指向了一个通过数据为中心的路径，用以改善推理能力。

更新时间: 2026-01-29 15:54:40

领域: cs.LG

下载: http://arxiv.org/abs/2601.21894v1

WADBERT: Dual-channel Web Attack Detection Based on BERT Models

Web attack detection is the first line of defense for securing web applications, designed to preemptively identify malicious activities. Deep learning-based approaches are increasingly popular for their advantages: automatically learning complex patterns and extracting semantic features from HTTP requests to achieve superior detection performance. However, existing methods are less effective in embedding irregular HTTP requests, even failing to model unordered parameters and achieve attack traceability. In this paper, we propose an effective web attack detection model, named WADBERT. It achieves high detection accuracy while enabling the precise identification of malicious parameters. To this end, we first employ Hybrid Granularity Embedding (HGE) to generate fine-grained embeddings for URL and payload parameters. Then, URLBERT and SecBERT are respectively utilized to extract their semantic features. Further, parameter-level features (extracted by SecBERT) are fused through a multi-head attention mechanism, resulting in a comprehensive payload feature. Finally, by feeding the concatenated URL and payload features into a linear classifier, a final detection result is obtained. The experimental results on CSIC2010 and SR-BH2020 datasets validate the efficacy of WADBERT, which respectively achieves F1-scores of 99.63% and 99.50%, and significantly outperforms state-of-the-art methods.

Updated: 2026-01-29 15:54:24

标题: WADBERT：基于BERT模型的双通道网络攻击检测

摘要: 网络攻击检测是保护网络应用的第一道防线，旨在预先识别恶意活动。基于深度学习的方法因其自动学习复杂模式并从HTTP请求中提取语义特征以实现出色的检测性能而日益受欢迎。然而，现有方法在嵌入不规则HTTP请求方面效果较差，甚至无法对无序参数进行建模并实现攻击可追溯性。本文提出了一种名为WADBERT的有效网络攻击检测模型。它在实现高检测准确性的同时，能够精确识别恶意参数。为此，我们首先采用混合粒度嵌入（HGE）为URL和载荷参数生成细粒度嵌入。然后，分别利用URLBERT和SecBERT提取它们的语义特征。此外，通过多头注意力机制融合参数级特征（由SecBERT提取），得到全面的载荷特征。最后，通过将连接的URL和载荷特征馈送到线性分类器，得到最终的检测结果。在CSIC2010和SR-BH2020数据集上的实验结果验证了WADBERT的有效性，分别实现了99.63%和99.50%的F1分数，并明显优于现有技术方法。

更新时间: 2026-01-29 15:54:24

领域: cs.CR

下载: http://arxiv.org/abs/2601.21893v1

VSE: Variational state estimation of complex model-free process

We design a variational state estimation (VSE) method that provides a closed-form Gaussian posterior of an underlying complex dynamical process from (noisy) nonlinear measurements. The complex process is model-free. That is, we do not have a suitable physics-based model characterizing the temporal evolution of the process state. The closed-form Gaussian posterior is provided by a recurrent neural network (RNN). The use of RNN is computationally simple in the inference phase. For learning the RNN, an additional RNN is used in the learning phase. Both RNNs help each other learn better based on variational inference principles. The VSE is demonstrated for a tracking application - state estimation of a stochastic Lorenz system (a benchmark process) using a 2-D camera measurement model. The VSE is shown to be competitive against a particle filter that knows the Lorenz system model and a recently proposed data-driven state estimation method that does not know the Lorenz system model.

Updated: 2026-01-29 15:47:28

标题: VSE：复杂无模型过程的变分状态估计

摘要: 我们设计了一种变分状态估计（VSE）方法，可以从（带噪声的）非线性测量中提供底层复杂动态过程的闭合形式高斯后验。这个复杂过程是无模型的，也就是说，我们没有适合描述过程状态时间演变的基于物理的模型。闭合形式高斯后验由递归神经网络（RNN）提供。在推断阶段，使用RNN是计算上简单的。为了学习RNN，在学习阶段使用了额外的RNN。这两个RNN根据变分推断原理互相帮助更好地学习。VSE用于跟踪应用 - 使用2-D相机测量模型对随机Lorenz系统（基准过程）的状态估计。VSE表现出了与了解Lorenz系统模型的粒子滤波器和最近提出的不了解Lorenz系统模型的数据驱动状态估计方法竞争力。

更新时间: 2026-01-29 15:47:28

领域: eess.SP,cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.21887v1

Managing Solution Stability in Decision-Focused Learning with Cost Regularization

Decision-focused learning integrates predictive modeling and combinatorial optimization by training models to directly improve decision quality rather than prediction accuracy alone. Differentiating through combinatorial optimization problems represents a central challenge, and recent approaches tackle this difficulty by introducing perturbation-based approximations. In this work, we focus on estimating the objective function coefficients of a combinatorial optimization problem. Our study demonstrates that fluctuations in perturbation intensity occurring during the learning phase can lead to ineffective training, by establishing a theoretical link to the notion of solution stability in combinatorial optimization. We propose addressing this issue by introducing a regularization of the estimated cost vectors which improves the robustness and reliability of the learning process, as demonstrated by extensive numerical experiments.

Updated: 2026-01-29 15:46:47

标题: 决策导向学习中的解决方案稳定性管理与成本正则化

摘要: Decision-focused learning通过训练模型直接提高决策质量，而不仅仅关注预测准确性，从而将预测建模和组合优化结合起来。区分通过组合优化问题的方式代表了一个核心挑战，最近的方法通过引入基于扰动的近似来解决这一困难。在这项工作中，我们专注于估计组合优化问题的目标函数系数。我们的研究表明，在学习阶段发生的扰动强度波动可能导致训练无效，通过建立与组合优化中解稳定性概念的理论联系来证明这一点。我们提出通过引入估计成本向量的正则化来解决这个问题，这可以提高学习过程的稳健性和可靠性，大量的数值实验也证明了这一点。

更新时间: 2026-01-29 15:46:47

领域: cs.LG

下载: http://arxiv.org/abs/2601.21883v1

How Expressive Are Graph Neural Networks in the Presence of Node Identifiers?

Graph neural networks (GNNs) are a widely used class of machine learning models for graph-structured data, based on local aggregation over neighbors. GNNs have close connections to logic. In particular, their expressive power is linked to that of modal logics and bounded-variable logics with counting. In many practical scenarios, graphs processed by GNNs have node features that act as unique identifiers. In this work, we study how such identifiers affect the expressive power of GNNs. We initiate a study of the key-invariant expressive power of GNNs, inspired by the notion of order-invariant definability in finite model theory: which node queries that depend only on the underlying graph structure can GNNs express on graphs with unique node identifiers? We provide answers for various classes of GNNs with local max- or sum-aggregation.

Updated: 2026-01-29 15:46:35

标题: 在节点标识存在的情况下，图神经网络有多表现力？

摘要: 图神经网络（GNNs）是一类广泛使用的机器学习模型，用于图结构数据，基于对邻居的局部聚合。 GNNs与逻辑有密切联系。特别是，它们的表达能力与模态逻辑和带计数的有界变量逻辑的表达能力相关。在许多实际场景中，被GNNs处理的图具有作为唯一标识符的节点特征。在这项工作中，我们研究这些标识符如何影响GNNs的表达能力。我们启动了对GNNs的键不变表达能力的研究，灵感来自于有限模型理论中的顺序不变定义的概念：GNNs在具有唯一节点标识符的图上能够表示哪些仅依赖于基础图结构的节点查询？我们为具有局部最大或求和聚合的各种GNNs类提供了答案。

更新时间: 2026-01-29 15:46:35

领域: cs.LO,cs.LG

下载: http://arxiv.org/abs/2601.21882v1

Online Bayesian Experimental Design for Partially Observed Dynamical Systems

Bayesian experimental design (BED) provides a principled framework for optimizing data collection by choosing experiments that are maximally informative about unknown parameters. However, existing methods cannot deal with the joint challenge of (a) partially observable dynamical systems, where only noisy and incomplete observations are available, and (b) fully online inference, which updates posterior distributions and selects designs sequentially in a computationally efficient manner. Under partial observability, dynamical systems are naturally modeled as state-space models (SSMs), where latent states mediate the link between parameters and data, making the likelihood -- and thus information-theoretic objectives like the expected information gain (EIG) -- intractable. We address these challenges by deriving new estimators of the EIG and its gradient that explicitly marginalize latent states, enabling scalable stochastic optimization in nonlinear SSMs. Our approach leverages nested particle filters for efficient online state-parameter inference with convergence guarantees. Applications to realistic models, such as the susceptible-infectious-recovered (SIR) and a moving source location task, show that our framework successfully handles both partial observability and online inference.

Updated: 2026-01-29 15:45:20

标题: 在线贝叶斯实验设计用于部分观测的动态系统

摘要: 贝叶斯实验设计（BED）提供了一个合理的框架，通过选择对未知参数最具信息量的实验来优化数据收集。然而，现有方法无法应对（a）部分可观测动态系统的联合挑战，其中仅有嘈杂和不完整的观测数据可用，以及（b）完全在线推断，以计算有效的方式依次更新后验分布并选择设计。在部分可观测情况下，动态系统被自然地建模为状态空间模型（SSMs），潜在状态介导了参数和数据之间的联系，使得像期望信息增益（EIG）这样的信息论目标变得难以处理。我们通过推导新的EIG及其梯度的估计器，明确地边缘化潜在状态，从而使非线性SSMs中的可扩展随机优化成为可能。我们的方法利用嵌套粒子滤波器，实现了具有收敛保证的高效在线状态参数推断。对于现实模型的应用，例如易感-传染-康复（SIR）模型和移动源定位任务，我们的框架成功处理了部分可观测和在线推断的问题。

更新时间: 2026-01-29 15:45:20

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2511.04403v2

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation

Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10$\times$ speedup over vanilla LLaDA/Dream and 5$\times$ speedup over AR models without much accuracy drop. Our code is available at https://github.com/hao-ai-lab/d3LLM.

Updated: 2026-01-29 15:45:13

标题: d3LLM：使用伪轨迹蒸馏的超快扩散LLM

摘要: Diffusion大型语言模型(dLLMs)提供了超越自回归(AR) LLMs的能力，例如并行解码和随机顺序生成。然而，在实践中实现这些好处并不是简单的，因为dLLMs固有地面临准确性和并行性之间的权衡。尽管越来越受到关注，现有方法通常只关注硬币的一面，针对效率或性能进行优化。为了解决这个局限性，我们提出了d3LLM（伪蒸馏扩散大型语言模型），在准确性和并行性之间取得平衡：(i)在训练过程中，我们引入了伪轨迹蒸馏，以教导模型哪些令牌可以在早期步骤中自信地解码，从而提高并行性；(ii)在推理过程中，我们采用基于熵的多块解码和带有KV缓存刷新机制，以实现高并行性同时保持准确性。为了更好地评估dLLMs，我们还引入了AUP（平行性下的准确性），这是一个新的度量标准，可以同时衡量准确性和并行性。实验证明，我们的d3LLM相比普通的LLaDA/Dream可实现高达10倍的加速，比AR模型快5倍，而准确性下降不多。我们的代码可在https://github.com/hao-ai-lab/d3LLM上找到。

更新时间: 2026-01-29 15:45:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.07568v2

Low-Rank Plus Sparse Matrix Transfer Learning under Growing Representations and Ambient Dimensions

Learning systems often expand their ambient features or latent representations over time, embedding earlier representations into larger spaces with limited new latent structure. We study transfer learning for structured matrix estimation under simultaneous growth of the ambient dimension and the intrinsic representation, where a well-estimated source task is embedded as a subspace of a higher-dimensional target task. We propose a general transfer framework in which the target parameter decomposes into an embedded source component, low-dimensional low-rank innovations, and sparse edits, and develop an anchored alternating projection estimator that preserves transferred subspaces while estimating only low-dimensional innovations and sparse modifications. We establish deterministic error bounds that separate target noise, representation growth, and source estimation error, yielding strictly improved rates when rank and sparsity increments are small. We demonstrate the generality of the framework by applying it to two canonical problems. For Markov transition matrix estimation from a single trajectory, we derive end-to-end theoretical guarantees under dependent noise. For structured covariance estimation under enlarged dimensions, we provide complementary theoretical analysis in the appendix and empirically validate consistent transfer gains.

Updated: 2026-01-29 15:40:05

标题: 在不断增长的表示和环境维度下的低秩加稀疏矩阵迁移学习

摘要: 学习系统通常会随着时间的推移扩展其环境特征或潜在表示，将先前的表示嵌入到具有有限新潜在结构的更大空间中。我们研究了在环境维度和内在表示同时增长的结构化矩阵估计的迁移学习，其中一个良好估计的源任务被嵌入为更高维度目标任务的子空间。我们提出了一个通用的迁移框架，其中目标参数分解为嵌入的源组件、低维低秩创新和稀疏编辑，并开发了一个锚定交替投影估计器，它在估计只有低维创新和稀疏修改的同时保留转移的子空间。我们建立了确定性误差界，将目标噪声、表示增长和源估计误差分开，当秩和稀疏度增量较小时，会产生严格改进的速率。我们通过将该框架应用于两个经典问题来展示其通用性。对于从单个轨迹中估计马尔可夫转移矩阵，我们在相关噪声下推导了端到端的理论保证。对于在扩大尺寸下的结构化协方差估计，我们在附录中提供补充的理论分析，并在实验证实了一致的转移增益。

更新时间: 2026-01-29 15:40:05

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.21873v1

Graph Homomorphism Distortion: A Metric to Distinguish Them All and in the Latent Space Bind Them

A large driver of the complexity of graph learning is the interplay between \emph{structure} and \emph{features}.When analyzing the expressivity of graph neural networks, however, existing approaches ignore features in favor of structure, making it nigh-impossible to assess to what extent two graphs with close features should be considered similar.We address this by developing a new \mbox{(pseudo-)metric} based on graph homomorphisms.Inspired by concepts from metric geometry, our \emph{graph homomorphism distortion} measures the minimal worst-case distortion that node features of one graph are subjected to when mapping one graph to another.We demonstrate the utility of our novel measure by showing that (i.) it can be efficiently calculated under some additional assumptions, (ii.) it complements existing expressivity measures like \mbox{$1$-WL}, and (iii.)it permits defining structural encodings, which improve the predictive capabilities of graph neural networks.

Updated: 2026-01-29 15:38:06

标题: 图同态畸变：一个度量标准来区分它们并在潜在空间中将它们联系在一起

摘要: 图学习的复杂性主要由\emph{结构}和\emph{特征}之间的相互作用驱动。然而，在分析图神经网络的表达能力时，现有方法忽视特征而偏向于结构，因此几乎不可能评估具有相似特征的两个图在多大程度上应被视为相似。我们通过基于图同态的新\mbox{(伪)度量}来解决这个问题。受度量几何学概念的启发，我们的\emph{图同态失真度}度量了将一个图的节点特征映射到另一个图时所受到的最小最坏失真。我们通过展示以下几点来证明我们新度量的实用性：(i.) 在一些额外假设下，它可以被高效计算；(ii.) 它可以补充现有的表达能力度量，如\mbox{$1$-WL}；(iii.) 它允许定义结构编码，从而提高图神经网络的预测能力。

更新时间: 2026-01-29 15:38:06

领域: cs.LG

下载: http://arxiv.org/abs/2511.03068v2

Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

Collecting web data to train deep models has become increasingly common, raising concerns about unauthorized data usage. To mitigate this issue, unlearnable examples introduce imperceptible perturbations into data, preventing models from learning effectively. However, existing methods typically rely on deep neural networks as surrogate models for perturbation generation, resulting in significant computational costs. In this work, we propose Perturbation-Induced Linearization (PIL), a computationally efficient yet effective method that generates perturbations using only linear surrogate models. PIL achieves comparable or better performance than existing surrogate-based methods while reducing computational time dramatically. We further reveal a key mechanism underlying unlearnable examples: inducing linearization to deep models, which explains why PIL can achieve competitive results in a very short time. Beyond this, we provide an analysis about the property of unlearnable examples under percentage-based partial perturbation. Our work not only provides a practical approach for data protection but also offers insights into what makes unlearnable examples effective.

Updated: 2026-01-29 15:37:51

标题: 扰动诱导的线性化：利用仅线性分类器构建不可学习的数据

摘要: 收集网络数据以训练深度模型已经变得越来越常见，引发了对未经授权数据使用的担忧。为了缓解这一问题，不可学习的示例向数据中引入了无法察觉的扰动，阻止模型有效学习。然而，现有方法通常依赖于深度神经网络作为扰动生成的替代模型，导致显著的计算成本。在这项工作中，我们提出了一种计算效率高且有效的方法，称为Perturbation-Induced Linearization（PIL），它仅使用线性替代模型生成扰动。PIL在显著减少计算时间的同时，实现了与现有基于替代模型的方法相媲美甚至更好的性能。我们进一步揭示了不可学习示例的一个关键机制：将深度模型线性化，这解释了为什么PIL可以在很短时间内取得竞争性结果。除此之外，我们还对基于百分比的部分扰动下不可学习示例的属性进行了分析。我们的工作不仅为数据保护提供了实用方法，还深入探讨了不可学习示例的有效性原因。

更新时间: 2026-01-29 15:37:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.19967v2

On Forgetting and Stability of Score-based Generative models

Understanding the stability and long-time behavior of generative models is a fundamental problem in modern machine learning. This paper provides quantitative bounds on the sampling error of score-based generative models by leveraging stability and forgetting properties of the Markov chain associated with the reverse-time dynamics. Under weak assumptions, we provide the two structural properties to ensure the propagation of initialization and discretization errors of the backward process: a Lyapunov drift condition and a Doeblin-type minorization condition. A practical consequence is quantitative stability of the sampling procedure, as the reverse diffusion dynamics induces a contraction mechanism along the sampling trajectory. Our results clarify the role of stochastic dynamics in score-based models and provide a principled framework for analyzing propagation of errors in such approaches.

Updated: 2026-01-29 15:37:50

标题: 关于遗忘和基于得分的生成模型稳定性

摘要: 理解生成模型的稳定性和长期行为是现代机器学习中的一个基本问题。本文通过利用与反向时间动态相关的马尔科夫链的稳定性和遗忘属性，为基于分数的生成模型的采样误差提供了定量界限。在弱假设下，我们提供了两个结构属性，以确保向后过程的初始化和离散化误差的传播：Lyapunov漂移条件和Doeblin类型的次均条件。一个实际的结果是采样过程的定量稳定性，因为反向扩散动态在采样轨迹上引发了一个收缩机制。我们的结果澄清了基于分数模型中随机动态的作用，并为分析这种方法中误差传播提供了一个原则性框架。

更新时间: 2026-01-29 15:37:50

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.21868v1

MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts

Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.

Updated: 2026-01-29 15:35:26

标题: MoHETS: 混合异质专家的长期时间序列预测

摘要: 真实世界中的多变量时间序列可能表现出复杂的多尺度结构，包括全局趋势、局部周期性和非平稳状态，这使得长期预测具有挑战性。尽管稀疏的专家混合（MoE）方法提高了可伸缩性和专业化水平，但它们通常依赖于捕捉不同时间动态的时间序列数据的同质MLP专家。我们通过MoHETS解决了这些限制，这是一种仅编码的Transformer，集成了稀疏的异质专家混合（MoHE）层。MoHE将时间补丁路由到专家网络的一个小子集，结合了用于序列级连续性的共享深度卷积专家和用于补丁级周期结构的路由Fourier专家。MoHETS通过在协变量补丁嵌入上进行交叉注意力，进一步提高了对非平稳动态的鲁棒性。最后，我们用轻量级卷积补丁解码器替换了参数密集的线性投影头，提高了参数效率，减少了训练不稳定性，并使单个模型能够泛化到任意预测时间范围。我们通过七个多变量基准和多个时间范围进行验证，MoHETS始终实现了最先进的性能，将平均MSE降低了12%，相对于强劲的最近基线，展示了有效的异质专业化长期预测。

更新时间: 2026-01-29 15:35:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21866v1

Low-Rank Key Value Attention

The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, providing a continuous trade-off between complete sharing and full independence. After pretraining models of size 128M to 6.3B parameters, LRKV consistently achieves the lowest test loss among standard MHA, MQA/GQA, and MLA while using only 45-53\% of MHA's KV cache. LRKV reaches equivalent baseline quality 18-25\% faster (measured in training steps). After supervised midtraining, LRKV achieves the highest downstream task performance across ARC-Easy, ARC-Challenge, MMLU, GSM8K, and HumanEval benchmarks.

Updated: 2026-01-29 15:29:26

标题: 低秩键值注意力

摘要: 键-值（KV）缓存是变压器中的主要内存瓶颈。我们提出了低秩键-值（LRKV）注意力，通过利用注意头之间的冗余来减少KV缓存内存，同时保持计算效率。每个层使用一个共享的全秩KV投影，增加低秩、头特定的残差，提供完全共享和完全独立之间的连续折衷。在对大小为128M到6.3B参数的预训练模型进行训练后，LRKV始终在标准MHA、MQA/GQA和MLA中实现了最低的测试损失，同时仅使用MHA的KV缓存的45-53%。LRKV在训练步骤中达到等效基线质量的速度比较快18-25%。在受监督的中期训练之后，LRKV在ARC-Easy、ARC-Challenge、MMLU、GSM8K和HumanEval基准测试中实现了最高的下游任务性能。

更新时间: 2026-01-29 15:29:26

领域: cs.LG

下载: http://arxiv.org/abs/2601.11471v2

LEMUR: Learned Multi-Vector Retrieval

Multi-vector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding for each token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved recall of multi-vector retrieval comes at the expense of significantly increased latency. This necessitates designing efficient approximate nearest neighbor search (ANNS) algorithms for multi-vector search. In this work, we introduce LEMUR, a simple-yet-efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: We first formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, which enables the use of existing single-vector ANNS methods for speeding up retrieval. In addition to performance evaluation on ColBERTv2 embeddings, we evaluate LEMUR on embeddings generated by modern multi-vector text models and multi-vector visual document retrieval models. LEMUR is an order of magnitude faster than earlier multi-vector similarity search methods.

Updated: 2026-01-29 15:26:32

标题: LEMUR：多向量检索学习

摘要: 由晚期交互模型生成的多向量表示，如ColBERT，在信息检索应用中比单向量表示具有更好的检索质量。在多向量检索系统中，查询和文档都使用每个标记的一个嵌入进行编码，并且查询和文档之间的相似性通过MaxSim相似度度量来衡量。然而，多向量检索的提高召回率是以显著增加的延迟为代价的。这需要为多向量搜索设计高效的近似最近邻搜索（ANNS）算法。在这项工作中，我们介绍了LEMUR，这是一个简单而高效的多向量相似度搜索框架。LEMUR由两个连续的问题缩减组成：首先，我们将多向量相似度搜索形式化为一个可以使用单隐藏层神经网络解决的监督学习问题。其次，我们将在这种模型下的推断缩减为在其潜在空间中的单向量相似度搜索，这使得可以利用现有的单向量ANNS方法加快检索速度。除了在ColBERTv2嵌入上进行性能评估外，我们还评估了LEMUR在现代多向量文本模型和多向量视觉文档检索模型生成的嵌入上的表现。LEMUR比早期的多向量相似度搜索方法快一个数量级。

更新时间: 2026-01-29 15:26:32

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2601.21853v1

Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models

Foundation models, despite their robust zero-shot capabilities, remain vulnerable to spurious correlations and 'Clever Hans' strategies. Existing mitigation methods often rely on unavailable group labels or computationally expensive gradient-based adversarial optimization. To address these limitations, we propose Visual Disentangled Diffusion Autoencoders (DiDAE), a novel framework integrating frozen foundation models with disentangled dictionary learning for efficient, gradient-free counterfactual generation directly for the foundation model. DiDAE first edits foundation model embeddings in interpretable disentangled directions of the disentangled dictionary and then decodes them via a diffusion autoencoder. This allows the generation of multiple diverse, disentangled counterfactuals for each factual, much faster than existing baselines, which generate single entangled counterfactuals. When paired with Counterfactual Knowledge Distillation, DiDAE-CFKD achieves state-of-the-art performance in mitigating shortcut learning, improving downstream performance on unbalanced datasets.

Updated: 2026-01-29 15:25:37

标题: 视觉分解扩散自编码器：基础模型的可扩展反事实生成

摘要: 尽管基础模型具有强大的零样本能力，但仍然容易受到虚假相关性和“聪明的汉斯”策略的影响。现有的缓解方法通常依赖于不可用的群体标签或计算昂贵的基于梯度的对抗优化。为了解决这些限制，我们提出了视觉解耦扩散自动编码器（DiDAE），这是一个新颖的框架，将冻结的基础模型与解耦的字典学习结合起来，用于为基础模型直接生成高效、无梯度的反事实生成。DiDAE首先在解耦的字典的可解释解耦方向上编辑基础模型嵌入，然后通过扩散自动编码器解码它们。这允许为每个实际生成多个不同的、解耦的反事实，比现有基线更快地生成单个交织的反事实。与反事实知识蒸馏相结合时，DiDAE-CFKD在缓解快捷学习方面取得了最先进的性能，在不平衡数据集上改善了下游性能。

更新时间: 2026-01-29 15:25:37

领域: cs.LG

下载: http://arxiv.org/abs/2601.21851v1

READY: Reward Discovery for Meta-Black-Box Optimization

Meta-Black-Box Optimization (MetaBBO) is an emerging avenue within Optimization community, where algorithm design policy could be meta-learned by reinforcement learning to enhance optimization performance. So far, the reward functions in existing MetaBBO works are designed by human experts, introducing certain design bias and risks of reward hacking. In this paper, we use Large Language Model~(LLM) as an automated reward discovery tool for MetaBBO. Specifically, we consider both effectiveness and efficiency sides. On effectiveness side, we borrow the idea of evolution of heuristics, introducing tailored evolution paradigm in the iterative LLM-based program search process, which ensures continuous improvement. On efficiency side, we additionally introduce multi-task evolution architecture to support parallel reward discovery for diverse MetaBBO approaches. Such parallel process also benefits from knowledge sharing across tasks to accelerate convergence. Empirical results demonstrate that the reward functions discovered by our approach could be helpful for boosting existing MetaBBO works, underscoring the importance of reward design in MetaBBO. We provide READY's project at https://anonymous.4open.science/r/ICML_READY-747F.

Updated: 2026-01-29 15:23:18

标题: 准备好了：用于元黑盒优化的奖励发现

摘要: 元黑盒优化（MetaBBO）是优化领域的一个新兴方向，其中算法设计政策可以通过强化学习进行元学习，以增强优化性能。到目前为止，现有的MetaBBO工作中的奖励函数是由人类专家设计的，引入了一定的设计偏见和奖励欺骗的风险。在本文中，我们使用大型语言模型（LLM）作为MetaBBO的自动奖励发现工具。具体来说，我们考虑了效果和效率两方面。在效果方面，我们借鉴了启发式进化的思想，在迭代的基于LLM的程序搜索过程中引入了定制的进化范式，确保持续改进。在效率方面，我们另外引入了多任务进化架构，以支持多样化MetaBBO方法的并行奖励发现。这种并行过程还可以通过跨任务的知识共享加速收敛。实证结果表明，我们的方法发现的奖励函数可以帮助提升现有的MetaBBO工作，强调了MetaBBO中奖励设计的重要性。我们在https://anonymous.4open.science/r/ICML_READY-747F提供了READY项目。

更新时间: 2026-01-29 15:23:18

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2601.21847v1

Constrained Meta Reinforcement Learning with Provable Test-Time Safety

Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure the safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.

Updated: 2026-01-29 15:21:37

标题: 受限元元强化学习与可证实的测试时间安全性

摘要: 元元强化学习（RL）允许代理利用在其可以随意进行训练的任务分布上的经验，从而在新的测试任务上更快地学习到最优策略。尽管在测试任务上改善样本复杂度方面取得成功，但许多真实世界的应用，如机器人和医疗保健，在测试过程中施加安全约束。受限的元元RL为将安全性整合到元RL中提供了一个有前途的框架。受限元RL中的一个未解之谜是如何确保在真实世界的测试任务中策略的安全性，同时减少样本复杂度，从而实现更快地学习最优策略。为了填补这一空白，我们提出了一种算法，通过对训练过程中学到的策略进行精炼，可以在测试任务上学习到接近最优策略并提供可证明的安全性和样本复杂度保证。我们进一步推导出一个匹配的下界，表明这种样本复杂度是紧密的。

更新时间: 2026-01-29 15:21:37

领域: cs.LG

下载: http://arxiv.org/abs/2601.21845v1

Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm

The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.

Updated: 2026-01-29 15:21:28

标题: 用基于投影的多目标遗传算法将数据集转换为所需复杂度

摘要: 研究界继续寻求越来越先进的合成数据生成器，以可靠地评估机器学习方法的优势和局限性。本研究旨在通过提出一种遗传算法，优化一组用于分类和回归任务的问题复杂度度量，以增加涵盖各种问题复杂性的数据集的可用性。对于分类任务，使用了一组10个复杂度度量，而对于回归任务，则选择了4个表现出有希望的优化能力的度量。实验证实，所提出的遗传算法可以通过线性特征投影将合成创建的数据集转化，以实现目标复杂度值，从而生成具有不同难度级别的数据集。涉及最先进的分类器和回归器的评估揭示了生成数据的复杂度与识别质量之间的相关性。

更新时间: 2026-01-29 15:21:28

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2507.15132v2

Redefining Neural Operators in $d+1$ Dimensions for Embedding Evolution

Neural Operators (NOs) have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely used in universally approximating architectures. Following the original formulation, most advancements focus on designing better parameterizations for the kernel over the original physical domain (with $d$ spatial dimensions, $d\in{1,2,3,\ldots}$). In contrast, embedding evolution remains largely unexplored, which often drives models toward brute-force embedding lengthening to improve approximation, but at the cost of substantially increased computation. In this paper, we introduce an auxiliary dimension that explicitly models embedding evolution in operator form, thereby redefining the NO framework in $d+1$ dimensions (the original $d$ dimensions plus one auxiliary dimension). Under this formulation, we develop a Schrödingerised Kernel Neural Operator (SKNO), which leverages Fourier-based operators to model the $d+1$ dimensional evolution. Across more than ten increasingly challenging benchmarks, ranging from the 1D heat equation to the highly nonlinear 3D Rayleigh-Taylor instability, SKNO consistently outperforms other baselines. We further validate its resolution invariance under mixed-resolution training and super-resolution inference, and evaluate zero-shot generalization to unseen temporal regimes. In addition, we present a broader set of design choices for the lifting and recovery operators, demonstrating their impact on SKNO's predictive performance.

Updated: 2026-01-29 15:18:05

标题: 将$d+1$维度中的神经操作符重新定义为嵌入演化

摘要: 神经算子（NOs）已成为学习函数空间之间映射的强大工具。其中，核积分算子在普遍逼近结构中被广泛使用。在最初的制定中，大多数进展集中在设计更好的参数化核函数，覆盖原始物理域（具有$d$个空间维度，$d\in{1,2,3,\ldots}$）。相反，嵌入式演化仍然大多未被探索，这经常导致模型朝着通过暴力嵌入长度来改善逼近的方向发展，但这会大大增加计算量。本文引入了一个明确模拟嵌入式演化的辅助维度，从而在操作符形式中重新定义了$d+1$维度（原始的$d$维度加上一个辅助维度）的NO框架。在这个制定下，我们开发了一个Schrödinger化的核神经算子（SKNO），利用基于傅里叶的算子来建模$d+1$维度的演化。在超过十个日益具有挑战性的基准测试中，从1D热方程到高度非线性的3D Rayleigh-Taylor不稳定性，SKNO始终优于其他基线。我们进一步验证了其在混合分辨率训练和超分辨率推断下的分辨率不变性，并评估了对未知时间范围的零样本泛化。此外，我们提出了更广泛的提升和恢复算子设计选择，展示它们对SKNO预测性能的影响。

更新时间: 2026-01-29 15:18:05

领域: cs.LG,cs.AI,quant-ph

下载: http://arxiv.org/abs/2505.11766v3

Test-Time Compute Games

Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.

Updated: 2026-01-29 15:18:01

标题: 测试时间计算游戏

摘要: 测试时间计算已成为增强大型语言模型（LLMs）推理能力的一种有前途的策略。然而，这种策略反过来增加了用户支付云服务提供商LLM作为服务的费用，因为提供商根据用户用于生成输出的测试时间计算量收费。在我们的工作中，我们展示了LLM作为服务市场存在社会效率低下的问题：提供商有财务激励增加测试时间计算量，即使这种增加对输出质量的贡献很小。为了解决这种低效率问题，我们引入了一个反向第二价格拍卖机制，其中提供商出价其提供的价格和（预期的）质量以获得为用户提供服务的机会，用户根据获胜提供商相对于第二高出价者产生的边际价值按比例支付费用。为了说明和补充我们的理论结果，我们在数学和科学基准数据集上对来自Llama和Qwen家族的多个指导模型以及从DeepSeek-R1中提炼出的推理模型进行实验。

更新时间: 2026-01-29 15:18:01

领域: cs.CY,cs.AI,cs.GT,cs.LG

下载: http://arxiv.org/abs/2601.21839v1

Scalable Linearized Laplace Approximation via Surrogate Neural Kernel

We introduce a scalable method to approximate the kernel of the Linearized Laplace Approximation (LLA). For this, we use a surrogate deep neural network (DNN) that learns a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). This avoids the need to compute large Jacobians. Training relies solely on efficient Jacobian-vector products, allowing to compute predictive uncertainty on large-scale pre-trained DNNs. Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Notwithstanding, biasing the learned kernel significantly enhances out-of-distribution detection. This remarks the benefits of the proposed method for finding better kernels than the NTK in the context of LLA to compute prediction uncertainty given a pre-trained DNN.

Updated: 2026-01-29 15:15:41

标题: 通过代理神经核的可扩展线性化拉普拉斯逼近

摘要: 我们介绍了一种可扩展的方法来近似线性化拉普拉斯逼近法（LLA）的核。为此，我们使用一个替代的深度神经网络（DNN），该网络学习一个紧凑的特征表示，其内积复制了神经切向核（NTK）。这避免了计算大雅可比矩阵的需要。训练仅依赖于高效的雅可比向量乘积，使得能够在大规模预训练的DNN上计算预测不确定性。实验结果显示，与现有的LLA逼近方法相比，该方法具有类似或改进的不确定性估计和校准。尽管如此，显著偏向学习的核可显著增强超出分布的检测。这突显了在LLA的背景下提出的方法比NTK更好地找到用于计算预测不确定性的核的好处。

更新时间: 2026-01-29 15:15:41

领域: cs.LG

下载: http://arxiv.org/abs/2601.21835v1

Goal-Driven Adaptive Sampling Strategies for Machine Learning Models Predicting Fields

Machine learning models are widely regarded as a way forward to tackle multi-query challenges that arise once expensive black-box simulations such as computational fluid dynamics are investigated. However, ensuring the desired level of accuracy for a certain task at minimal computational cost, e.g. as few black-box samples as possible, remains a challenges. Active learning strategies are used for scalar quantities to overcome this challenges and different so-called infill criteria exists and are commonly employed in several scenarios. Even though needed in various field an extension of active learning strategies towards field predictions is still lacking or limited to very specific scenarios and/or model types. In this paper we propose an active learning strategy for machine learning models that are capable if predicting field which is agnostic to the model architecture itself. For doing so, we combine a well-established Gaussian process model for a scalar reference value and simultaneously aim at reducing the epistemic model error and the difference between scalar and field predictions. Different specific forms of the above-mentioned approach are introduced and compared to each other as well as only scalar-valued based infill. Results are presented for the NASA common research model for an uncertainty propagation task showcasing high level of accuracy at significantly smaller cost compared to an approach without active learning.

Updated: 2026-01-29 15:14:36

标题: 目标驱动的自适应采样策略用于预测领域的机器学习模型

摘要: 机器学习模型被广泛认为是解决多查询挑战的一种途径，一旦进行了昂贵的黑盒模拟，如计算流体动力学。然而，确保在最小计算成本下为某项任务达到所需的准确度，例如尽可能少的黑盒样本，仍然是一个挑战。主动学习策略被用于标量量来克服这些挑战，并且存在不同的所谓填充准则，并在几种情景中通常被使用。尽管在各个领域中需要，在领域预测方面延伸主动学习策略仍然缺乏或仅限于非常特定的情景和/或模型类型。在本文中，我们提出了一种主动学习策略，用于能够预测领域的机器学习模型，该策略对模型体系结构本身是不可知的。为此，我们结合了一个被广泛接受的高斯过程模型，用于标量参考值，并同时旨在减少认知模型误差和标量与领域预测之间的差异。引入了上述方法的不同具体形式，并将它们与仅基于标量值的填充进行比较。对NASA通用研究模型进行了不确定性传播任务的结果展示了高准确度水平，与不使用主动学习的方法相比，成本显著降低。

更新时间: 2026-01-29 15:14:36

领域: cs.LG

下载: http://arxiv.org/abs/2601.21832v1

Generative Modeling of Discrete Data Using Geometric Latent Subspaces

We introduce the use of latent subspaces in the exponential parameter space of product manifolds of categorial distributions, as a tool for learning generative models of discrete data. The low-dimensional latent space encodes statistical dependencies and removes redundant degrees of freedom among the categorial variables. We equip the parameter domain with a Riemannian geometry such that the spaces and distances are related by isometries which enables consistent flow matching. In particular, geodesics become straight lines which makes model training by flow matching effective. Empirical results demonstrate that reduced latent dimensions suffice to represent data for generative modeling.

Updated: 2026-01-29 15:14:15

标题: 使用几何潜在子空间进行离散数据的生成建模

摘要: 我们引入了在类别分布的乘积流形的指数参数空间中使用潜在子空间的方法，作为学习离散数据生成模型的工具。低维潜在空间对类别变量之间的统计依赖进行编码，并消除冗余自由度。我们将参数域与黎曼几何相结合，使空间和距离通过等距变换相关，从而实现一致的流匹配。特别是，测地线变成直线，使得通过流匹配进行模型训练变得有效。实证结果表明，降低的潜在维度足以表示生成建模数据。

更新时间: 2026-01-29 15:14:15

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.21831v1

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

Updated: 2026-01-29 15:12:28

标题: 语义路由器：通过单个对抗性扰动劫持MLLMs的可行性

摘要: 多模态大型语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术中。本文调查了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一目标，我们对潜在空间中的几何属性进行理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个带有细粒度语义的新数据集以评估性能。对三个代表性MLLM进行的大量实验表明了这种攻击的基本可行性，使用单帧对Qwen攻击五个目标，成功率达66%。

更新时间: 2026-01-29 15:12:28

领域: cs.CV,cs.AI,cs.CR

下载: http://arxiv.org/abs/2511.20002v2

DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.

Updated: 2026-01-29 15:10:13

标题: DASH: 确定性关注调度用于高吞吐率可重复的LLM训练

摘要: 决定论在大型语言模型（LLM）训练中是不可或缺的，然而它通常会导致性能成本的大幅提高。在诸如FlashAttention-3等广泛使用的注意力实现中，确定性的反向传播相对于其非确定性对应物可以导致高达37.9%的吞吐量降低，主要是因为梯度累积操作必须串行化以保证数值一致性。这种性能损失源于计算和梯度减少阶段的子优调度，导致显著的硬件低效利用。为了解决这一挑战，我们将确定性注意力的反向传播形式化为一个有向无环图（DAG）上的调度问题，并推导出最小化关键路径长度的调度。基于这个公式，我们提出了DASH（用于高吞吐量的确定性注意力调度），它包含两种互补的调度策略：（i）下降Q-Tile迭代，一个反向查询块遍历，缩小因果关注中的流水线停顿，以及（ii）Shift调度，一个在我们的DAG模型中理论上最优的调度，减少了完整和因果掩码的流水线停顿。我们在NVIDIA H800 GPU上的实证评估表明，DASH缩小了确定性注意力的性能差距。所提出的策略将注意力反向传播的吞吐量提高了最多1.28倍，与基线相比，显著提高了可复现LLM训练的效率。我们的代码在https://github.com/SJTU-Liquid/deterministic-FA3上开源。

更新时间: 2026-01-29 15:10:13

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2601.21824v1

Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

Partial participation is essential for communication-efficient federated learning at scale, yet existing Byzantine-robust methods typically assume full client participation. In the partial participation setting, a majority of the sampled clients may be Byzantine, once Byzantine clients dominate, existing methods break down immediately. We introduce delayed momentum aggregation, a principle where the central server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients. This principle ensures Byzantine clients remain a minority from the server's perspective even when they dominate the sampled set. We instantiate this principle in our optimizer DeMoA. We analyze the convergence rate of DeMoA, showing that DeMoA is Byzantine-robust under partial participation. Experiments show that, with 20% Byzantine ratio and only 10% partial participation rate, DeMoA achieves the best accuracy even when existing methods fail empirically.

Updated: 2026-01-29 15:10:07

标题: 延迟动量聚合：具有部分参与的通信高效拜占庭容忍联邦学习

摘要: 部分参与对于规模化通信高效的联邦学习至关重要，然而现有的拜占庭-鲁棒方法通常假设客户端完全参与。在部分参与的情况下，抽样客户端中可能有大多数是拜占庭的，一旦拜占庭客户端占据主导地位，现有方法会立即崩溃。我们引入了延迟动量聚合的原则，即中央服务器从未抽样客户端聚合缓存的动量以及来自抽样客户端的新动量。这一原则确保了即使拜占庭客户端占据抽样集合的主导地位，从服务器的角度看，它们仍然是少数。我们在我们的优化器DeMoA中实例化了这一原则。我们分析了DeMoA的收敛速度，表明在部分参与的情况下DeMoA是拜占庭-鲁棒的。实验表明，当存在20%的拜占庭比例和仅有10%的部分参与率时，DeMoA在经验上取得了最佳准确性，即使现有方法失败。

更新时间: 2026-01-29 15:10:07

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2509.02970v2

scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.

Updated: 2026-01-29 15:07:40

标题: scDataset：大规模单细胞组学的深度学习可扩展数据加载

摘要: 在单细胞数据集上训练拥有数亿个细胞的深度学习模型需要从磁盘加载数据，因为这些数据集超过了可用内存。虽然随机抽样提供了有效训练所需的数据多样性，但由于随机访问模式的开销，其速度过慢，而顺序流式传输虽然实现了高吞吐量，但引入了降低模型性能的偏差。我们提出了scDataset，一个PyTorch数据加载器，能够实现从磁盘数据中高效训练，并在各种存储格式之间实现无缝集成。我们的方法结合了块抽样和批量获取，实现了准随机抽样，平衡了I/O效率与小批量多样性。在Tahoe-100M上，一个拥有1亿个细胞的数据集上，scDataset相比于真正的随机抽样实现了超过两个数量级的加速，同时直接使用AnnData文件。我们提供了小批量多样性的理论界限，并在多个分类任务中经验性地展示了scDataset与真正的随机抽样的性能匹配。

更新时间: 2026-01-29 15:07:40

领域: cs.LG,cs.AI,cs.DB,q-bio.GN,q-bio.QM

下载: http://arxiv.org/abs/2506.01883v2

A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.

Updated: 2026-01-29 15:01:28

标题: 一种基于评价者感知的排名框架，用于评估没有基准真实值的大型语言模型

摘要: 在没有地面真实标签的情况下，通过LLM作为评判者范式对开放式任务上的大型语言模型（LLMs）进行评估日益普遍。一个至关重要但未被充分建模的问题是，评判者LLMs在可靠性上存在显著差异；将所有评判者视为相等可能会导致有偏见的排行榜和误导性的不确定性估计。更多数据可以使评估在错误指定的聚合下更加确信地错误。我们提出了一个考虑评判者的排名框架，通过引入评判者特定的歧视参数扩展Bradley-Terry-Luce模型，从没有参考标签的成对比较中联合估计潜在模型质量和评判者可靠性。我们建立了一种可识别性，证明了最大似然估计器的一致性和渐近正态性，从而实现了得分差异和排名比较的置信区间，跨多个公共基准测试和新收集的数据集，我们的方法提高了与人类偏好的一致性，比未加权的基线更高效，并为LLM排名提供了校准的不确定性量化。

更新时间: 2026-01-29 15:01:28

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.21817v1

Nonparametric LLM Evaluation from Preference Data

Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, DMLEval, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the Bradley-Terry model or PageRank/ Rank centrality, with complex human responses such as ties. DMLEval comes with the following advantages: (i) It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows the incorporation of black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.

Updated: 2026-01-29 15:00:07

标题: 基于偏好数据的非参数LLM评估

摘要: 评估大型语言模型（LLMs）的性能对于获取LLM排行榜至关重要。然而，许多现有方法要么依赖于限制性参数假设，要么在使用灵活的机器学习方法时缺乏有效的不确定性量化。在本文中，我们提出了一个非参数统计框架DMLEval，用于使用去偏的机器学习（DML）从偏好数据中比较和排名LLMs。为此，我们引入了广义平均排名分数（GARS），这些分数概括了常用的排名模型，包括Bradley-Terry模型或PageRank/Rank中心性，以及复杂的人类反应，如并列。DMLEval具有以下优势：（i）它产生统计有效的GARS排名分数估计。（ii）它自然地允许将黑盒机器学习方法纳入估计中。（iii）它可以与预训练的LLM评估器结合使用（例如使用LLM作为评委）。（iv）它提出了在预算约束下收集偏好数据的最佳策略。我们在理论上和实证上使用合成和真实偏好数据集展示了这些优势。总之，我们的框架为从业者提供了强大的、最先进的比较或排名LLMs的方法。

更新时间: 2026-01-29 15:00:07

领域: cs.LG

下载: http://arxiv.org/abs/2601.21816v1

A Decomposable Forward Process in Diffusion Models for Time-Series Forecasting

We introduce a model-agnostic forward diffusion process for time-series forecasting that decomposes signals into spectral components, preserving structured temporal patterns such as seasonality more effectively than standard diffusion. Unlike prior work that modifies the network architecture or diffuses directly in the frequency domain, our proposed method alters only the diffusion process itself, making it compatible with existing diffusion backbones (e.g., DiffWave, TimeGrad, CSDI). By staging noise injection according to component energy, it maintains high signal-to-noise ratios for dominant frequencies throughout the diffusion trajectory, thereby improving the recoverability of long-term patterns. This strategy enables the model to maintain the signal structure for a longer period in the forward process, leading to improved forecast quality. Across standard forecasting benchmarks, we show that applying spectral decomposition strategies, such as the Fourier or Wavelet transform, consistently improves upon diffusion models using the baseline forward process, with negligible computational overhead. The code for this paper is available at https://anonymous.4open.science/r/D-FDP-4A29.

Updated: 2026-01-29 14:55:43

标题: 扩散模型中的可分解前向过程在时间序列预测中的应用

摘要: 我们引入了一个模型无关的前向扩散过程，用于时间序列预测，将信号分解为谱分量，比标准扩散更有效地保留结构化的时间模式，如季节性。与直接修改网络架构或在频率域中直接扩散的先前工作不同，我们提出的方法仅改变扩散过程本身，使其与现有的扩散骨干（例如DiffWave、TimeGrad、CSDI）兼容。通过根据分量能量进行阶段性噪声注入，它在整个扩散轨迹中保持高信噪比，从而提高主导频率的可恢复性，从而改善长期模式的可恢复性。这种策略使模型能够在前向过程中保持信号结构更长时间，从而提高了预测质量。在标准预测基准测试中，我们展示了应用谱分解策略（如傅立叶或小波变换）始终优于使用基线前向过程的扩散模型，且计算开销可以忽略不计。本文的代码可在https://anonymous.4open.science/r/D-FDP-4A29找到。

更新时间: 2026-01-29 14:55:43

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21812v1

Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models

Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3$\times$ fewer than AriGraph and $<$1/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs.

Updated: 2026-01-29 14:52:01

标题: Wikontic：利用大型语言模型构建与Wikidata对齐、具有本体意识的知识图谱

摘要: 知识图谱（KGs）为大型语言模型（LLMs）提供了结构化、可验证的基础，但当前基于LLM的系统通常将KGs用作文本检索的辅助结构，其内在品质鲜有人探讨。在这项工作中，我们提出了Wikontic，一个多阶段流水线，通过从开放领域文本中提取带有限定词的候选三元组、强制执行基于Wikidata的类型和关系约束，并对实体进行规范化以减少重复，构建KGs。结果得到的KGs是紧凑的、符合本体的和连接良好的；在MuSiQue中，正确的答案实体出现在96%的生成的三元组中。在HotpotQA上，我们的三元组设置实现了76.0的F1，而在MuSiQue上为59.8的F1，与几个仍需要文本上下文的检索增强生成基线相匹配或超越。此外，Wikontic在MINE-1基准测试中达到了最先进的信息保留性能（86%），胜过先前的KG构建方法。Wikontic在构建时间上也很高效：KG构建使用的输出标记不到1,000个，比AriGraph少3倍，比GraphRAG的1/20还要少。所提出的流水线提高了生成的KG的质量，并为在LLMs中利用结构化知识提供了可扩展的解决方案。

更新时间: 2026-01-29 14:52:01

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.00590v2

Effective LoRA Adapter Routing using Task Representations

Low-rank adaptation (LoRA) enables parameter efficient specialization of large language models (LLMs) through modular adapters, resulting in rapidly growing public adapter pools spanning diverse tasks. Effectively using these adapters requires routing: selecting and composing the appropriate adapters for a query. We introduce LORAUTER, a novel routing framework that selects and composes LoRA adapters using task representations rather than adapter characteristics. Unlike existing approaches that map queries directly to adapters, LORAUTER routes queries via task embeddings derived from small validation sets and does not require adapter training data. By operating at the task level, LORAUTER achieves efficient routing that scales with the number of tasks rather than the number of adapters. Experiments across multiple tasks show that LORAUTER consistently outperforms baseline routing approaches, matching Oracle performance (101.2%) when task-aligned adapters exist and achieving state-of-the-art results on unseen tasks (+5.2 points). We further demonstrate the robustness of LORAUTER to very large, noisy adapter pools by scaling it to over 1500 adapters.

Updated: 2026-01-29 14:41:24

标题: 使用任务表示的有效LoRA适配器路由

摘要: 低秩适应（LoRA）通过模块化适配器实现了大型语言模型（LLMs）的参数高效专门化，从而快速增长公共适配器池，涵盖各种任务。有效地使用这些适配器需要路由：选择和组合适用于查询的适配器。我们引入了LORAUTER，这是一个新颖的路由框架，它使用任务表示而不是适配器特征来选择和组合LoRA适配器。与直接将查询映射到适配器的现有方法不同，LORAUTER通过从小型验证集中得出的任务嵌入来路由查询，并且不需要适配器训练数据。通过在任务级别操作，LORAUTER实现了随任务数量而不是适配器数量扩展的高效路由。在多个任务上的实验表明，LORAUTER始终优于基准路由方法，在任务对齐适配器存在时匹配Oracle性能（101.2%），并在未知任务上取得最先进的结果（+5.2个点）。我们进一步展示了LORAUTER对于非常庞大、嘈杂的适配器池的稳健性，将其扩展到超过1500个适配器。

更新时间: 2026-01-29 14:41:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21795v1

Knowledge Vector Weakening: Efficient Training-free Unlearning for Large Vision-Language Models

Large Vision-Language Models (LVLMs) are widely adopted for their strong multimodal capabilities, yet they raise serious concerns such as privacy leakage and harmful content generation. Machine unlearning has emerged as a promising solution for removing the influence of specific data from trained models. However, existing approaches largely rely on gradient-based optimization, incurring substantial computational costs for large-scale LVLMs. To address this limitation, we propose Knowledge Vector Weakening (KVW), a training-free unlearning method that directly intervenes in the full model without gradient computation. KVW identifies knowledge vectors that are activated during the model's output generation on the forget set and progressively weakens their contributions, thereby preventing the model from exploiting undesirable knowledge. Experiments on the MLLMU and CLEAR benchmarks demonstrate that KVW achieves a stable forget-retain trade-off while significantly improving computational efficiency over gradient-based and LoRA-based unlearning methods.

Updated: 2026-01-29 14:41:01

标题: 知识向量削弱：大规模视觉-语言模型的高效无需训练的遗忘

摘要: 大型视觉-语言模型（LVLMs）被广泛采用，因其强大的多模态能力而备受青睐，但它们引发了严重的隐私泄露和有害内容生成等问题。机器去学习已经成为一种有希望的解决方案，可以从训练模型中删除特定数据的影响。然而，现有方法主要依赖于基于梯度的优化，在大规模LVLMs上产生了可观的计算成本。为了解决这一限制，我们提出了一种名为知识向量削弱（KVW）的训练免费去学习方法，直接介入完整模型而无需梯度计算。KVW识别在忘记集上模型输出生成过程中被激活的知识向量，并逐渐削弱其贡献，从而防止模型利用不良知识。在MLLMU和CLEAR基准测试上的实验表明，KVW实现了稳定的遗忘-保留权衡，同时明显提高了计算效率，超过了基于梯度和LoRA的去学习方法。

更新时间: 2026-01-29 14:41:01

领域: cs.LG

下载: http://arxiv.org/abs/2601.21794v1

NetMamba+: A Framework of Pre-trained Models for Efficient and Accurate Network Traffic Classification

With the rapid growth of encrypted network traffic, effective traffic classification has become essential for network security and quality of service management. Current machine learning and deep learning approaches for traffic classification face three critical challenges: computational inefficiency of Transformer architectures, inadequate traffic representations with loss of crucial byte-level features while retaining detrimental biases, and poor handling of long-tail distributions in real-world data. We propose NetMamba+, a framework that addresses these challenges through three key innovations: (1) an efficient architecture considering Mamba and Flash Attention mechanisms, (2) a multimodal traffic representation scheme that preserves essential traffic information while eliminating biases, and (3) a label distribution-aware fine-tuning strategy. Evaluation experiments on massive datasets encompassing four main classification tasks showcase NetMamba+'s superior classification performance compared to state-of-the-art baselines, with improvements of up to 6.44\% in F1 score. Moreover, NetMamba+ demonstrates excellent efficiency, achieving 1.7x higher inference throughput than the best baseline while maintaining comparably low memory usage. Furthermore, NetMamba+ exhibits superior few-shot learning abilities, achieving better classification performance with fewer labeled data. Additionally, we implement an online traffic classification system that demonstrates robust real-world performance with a throughput of 261.87 Mb/s. As the first framework to adapt Mamba architecture for network traffic classification, NetMamba+ opens new possibilities for efficient and accurate traffic analysis in complex network environments.

Updated: 2026-01-29 14:40:04

标题: NetMamba+: 一种用于高效准确网络流量分类的预训练模型框架

摘要: 随着加密网络流量的快速增长，有效的流量分类对于网络安全和服务质量管理变得至关重要。目前用于流量分类的机器学习和深度学习方法面临三个关键挑战：Transformer架构的计算效率低下、流量表示不足，丢失关键的字节级特征同时保留有害的偏见，以及在现实数据中处理长尾分布能力差。我们提出了NetMamba+框架，通过三个关键创新解决了这些挑战：（1）考虑了Mamba和Flash Attention机制的高效架构，（2）保留了关键的流量信息并消除偏见的多模态流量表示方案，以及（3）一个基于标签分布的微调策略。对包含四个主要分类任务的大规模数据集进行的评估实验展示了NetMamba+相对于最先进基线的出色分类性能，F1得分提高了高达6.44\%。此外，NetMamba+表现出卓越的效率，推理吞吐量比最佳基线高出1.7倍，同时保持相对较低的内存使用。此外，NetMamba+表现出卓越的少样本学习能力，使用更少的标记数据实现更好的分类性能。此外，我们实现了一个在线流量分类系统，显示了强大的现实世界性能，吞吐量为261.87 Mb/s。作为第一个将Mamba架构用于网络流量分类的框架，NetMamba+为复杂网络环境中高效准确的流量分析开辟了新的可能性。

更新时间: 2026-01-29 14:40:04

领域: cs.LG

下载: http://arxiv.org/abs/2601.21792v1

Pushing the Limits of Distillation-Based Class-Incremental Learning via Lightweight Plugins

Existing replay and distillation-based class-incremental learning (CIL) methods are effective at retaining past knowledge but are still constrained by the stability-plasticity dilemma. Since their resulting models are learned over a sequence of incremental tasks, they encode rich representations and can be regarded as pre-trained bases. Building on this view, we propose a plug-in extension paradigm termed Deployment of LoRA Components (DLC) to enhance them. For each task, we use Low-Rank Adaptation (LoRA) to inject task-specific residuals into the base model's deep layers. During inference, representations with task-specific residuals are aggregated to produce classification predictions. To mitigate interference from non-target LoRA plugins, we introduce a lightweight weighting unit. This unit learns to assign importance scores to different LoRA-tuned representations. Like downloadable content in software, DLC serves as a plug-and-play enhancement that efficiently extends the base methods. Remarkably, on the large-scale ImageNet-100, with merely 4\% of the parameters of a standard ResNet-18, our DLC model achieves a significant 8\% improvement in accuracy, demonstrating exceptional efficiency. Under a fixed memory budget, methods equipped with DLC surpass state-of-the-art expansion-based methods.

Updated: 2026-01-29 14:39:17

标题: 推动基于蒸馏的轻量级插件的增量式学习极限

摘要: 现有的重播和蒸馏型增量类学习（CIL）方法在保留过去知识方面是有效的，但仍受到稳定性-可塑性困境的限制。由于它们得到的模型是通过一系列增量任务学习的，它们编码了丰富的表示，并可以被视为预训练基础。在此基础上，我们提出了一种名为LoRA组件部署（DLC）的插件扩展范式来增强它们。对于每个任务，我们使用低秩适应（LoRA）将任务特定残差注入基础模型的深层。在推理过程中，具有任务特定残差的表示被聚合以产生分类预测。为了减轻来自非目标LoRA插件的干扰，我们引入了一个轻量级加权单元。该单元学习分配不同LoRA调整表示的重要性分数。就像软件中的可下载内容一样，DLC作为一种即插即用的增强，有效地扩展了基础方法。值得注意的是，在大规模ImageNet-100上，仅使用标准ResNet-18的4％参数，我们的DLC模型在准确性方面实现了显著的8％改善，展示了异常的效率。在固定内存预算下，配备DLC的方法超过了最先进的基于扩展的方法。

更新时间: 2026-01-29 14:39:17

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2512.03537v2

ECSEL: Explainable Classification via Signomial Equation Learning

We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.

Updated: 2026-01-29 14:35:43

标题: ECSEL: 通过符号多项式方程学习实现可解释分类

摘要: 我们引入了ECSEL，这是一种可解释的分类方法，它学习以符号式方程形式呈现的形式表达式，这是因为观察到许多符号回归基准测试允许紧凑的符号式结构。ECSEL直接构建了一个结构化的闭合形式表达式，既可以作为分类器，又可以作为解释。在标准符号回归基准测试中，我们的方法恢复目标方程的比例比竞争对手的最先进方法更大，同时需要更少的计算。利用这种效率，ECSEL在不牺牲可解释性的情况下实现了与已建立的机器学习模型相媲美的分类准确性。此外，我们展示了ECSEL满足一些关于全局特征行为、决策边界分析和局部特征归因的理想性质。在基准数据集和两个真实案例研究中，即电子商务和欺诈检测，实验证明所学的方程揭示了数据集的偏见，支持反事实推理，并产生了可操作的见解。

更新时间: 2026-01-29 14:35:43

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2601.21789v1

Geodesic Calculus on Implicitly Defined Latent Manifolds

Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

Updated: 2026-01-29 14:34:52

标题: 隐式定义潜在流形上的测地线微积分

摘要: Autoencoder的潜在流形提供了数据的低维表示，可以从几何角度进行研究。我们提议将这些潜在流形描述为某个环境潜在空间的隐式子流形。基于此，我们开发了一种离散黎曼微积分工具，用于近似经典的几何算子。这些工具对实际例子中经常发生的隐式表示的不准确性具有鲁棒性。为了获得合适的隐式表示，我们建议通过最小化去噪目标来学习对潜在流形的近似投影。这种方法独立于底层的autoencoder，并支持在潜在流形上使用不同的黎曼几何。该框架特别能够计算连接给定端点的测地路径，并通过潜在流形上的黎曼指数映射发射测地线。我们在合成和真实数据上对我们的方法进行了评估。

更新时间: 2026-01-29 14:34:52

领域: cs.LG

下载: http://arxiv.org/abs/2510.09468v2

Diffusion Models in Simulation-Based Inference: A Tutorial Review

Diffusion models have recently emerged as powerful learners for simulation-based inference (SBI), enabling fast and accurate estimation of latent parameters from simulated and real data. Their score-based formulation offers a flexible way to learn conditional or joint distributions over parameters and observations, thereby providing a versatile solution to various modeling problems. In this tutorial review, we synthesize recent developments on diffusion models for SBI, covering design choices for training, inference, and evaluation. We highlight opportunities created by various concepts such as guidance, score composition, flow matching, consistency models, and joint modeling. Furthermore, we discuss how efficiency and statistical accuracy are affected by noise schedules, parameterizations, and samplers. Finally, we illustrate these concepts with case studies across parameter dimensionalities, simulation budgets, and model types, and outline open questions for future research.

Updated: 2026-01-29 14:33:29

标题: 模拟推断中的扩散模型：教程回顾

摘要: 扩散模型最近已经成为模拟推断(SBI)强大的学习者，能够从模拟和真实数据中快速准确地估计潜在参数。它们基于得分的表述提供了一种灵活的方式来学习参数和观测之间的条件或联合分布，从而为各种建模问题提供了多功能的解决方案。在这篇教程回顾中，我们综合了关于扩散模型用于SBI的最新发展，涵盖了训练、推断和评估的设计选择。我们突出了各种概念所创造的机会，如指导、得分组合、流匹配、一致性模型和联合建模。此外，我们讨论了噪声时间表、参数化和采样器对效率和统计准确性的影响。最后，我们通过跨参数维度、模拟预算和模型类型的案例研究来说明这些概念，并概述未来研究的开放问题。

更新时间: 2026-01-29 14:33:29

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2512.20685v2

Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

Updated: 2026-01-29 14:32:46

标题: 懒惰训练方案中随机梯度 Langevin 动力学的收敛性

摘要: 连续时间模型为深度学习中优化算法的训练动态提供了重要见解。在这项工作中，我们建立了随机梯度 Langevin 动力学（SGLD）的非渐近收敛分析，SGLD 是一个在连续时间中的随机梯度下降的 Itô 随机微分方程（SDE）近似，在懒惰训练模式下。我们表明，在损失函数的 Hessian 矩阵具有正则性条件的情况下，SGLD 在训练过程中通过乘性和状态相关噪声（i）以很高的概率产生非退化核，并且（ii）在期望下实现对经验风险最小化者的指数收敛，我们建立了优化间隙的有限时间和有限宽度界限。我们通过回归设置中的数值示例来证实我们的理论发现。

更新时间: 2026-01-29 14:32:46

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2510.21245v3

Quantum LEGO Learning: A Modular Design Principle for Hybrid Artificial Intelligence

Hybrid quantum-classical learning models increasingly integrate neural networks with variational quantum circuits (VQCs) to exploit complementary inductive biases. However, many existing approaches rely on tightly coupled architectures or task-specific encoders, limiting conceptual clarity, generality, and transferability across learning settings. In this work, we introduce Quantum LEGO Learning, a modular and architecture-agnostic learning framework that treats classical and quantum components as reusable, composable learning blocks with well-defined roles. Within this framework, a pre-trained classical neural network serves as a frozen feature block, while a VQC acts as a trainable adaptive module that operates on structured representations rather than raw inputs. This separation enables efficient learning under constrained quantum resources and provides a principled abstraction for analyzing hybrid models. We develop a block-wise generalization theory that decomposes learning error into approximation and estimation components, explicitly characterizing how the complexity and training status of each block influence overall performance. Our analysis generalizes prior tensor-network-specific results and identifies conditions under which quantum modules provide representational advantages over comparably sized classical heads. Empirically, we validate the framework through systematic block-swap experiments across frozen feature extractors and both quantum and classical adaptive heads. Experiments on quantum dot classification demonstrate stable optimization, reduced sensitivity to qubit count, and robustness to realistic noise.

Updated: 2026-01-29 14:29:21

标题: 量子乐高学习：混合人工智能的模块化设计原则

摘要: 混合量子-经典学习模型越来越多地将神经网络与变分量子电路（VQCs）相结合，以利用互补的归纳偏差。然而，许多现有方法依赖于紧密耦合的体系结构或任务特定的编码器，限制了概念上的清晰度、普适性和在学习环境中的可传递性。在这项工作中，我们引入了量子乐高学习，这是一个模块化和与体系结构无关的学习框架，将经典和量子组件视为可重复使用、可组合的学习模块，具有明确定义的角色。在这个框架内，一个经过预训练的经典神经网络充当一个冻结的特征块，而一个VQC充当一个可训练的自适应模块，作用于结构化表示而不是原始输入。这种分离使得在受限的量子资源下进行高效学习，并为分析混合模型提供了一个原则性的抽象。我们开发了一个分块通用化理论，将学习误差分解为逼近和估计组件，明确表征了每个块的复杂性和训练状态如何影响整体性能。我们的分析推广了先前的张量网络特定结果，并确定了量子模块在与相同规模的经典头相比提供表征优势的条件。在实证方面，我们通过系统性的块交换实验验证了该框架，包括冻结特征提取器和量子和经典自适应头。量子点分类的实验表明了稳定的优化、对量子比特数量的敏感性降低以及对真实噪声的鲁棒性。

更新时间: 2026-01-29 14:29:21

领域: cs.LG,quant-ph

下载: http://arxiv.org/abs/2601.21780v1

Error Amplification Limits ANN-to-SNN Conversion in Continuous Control

Spiking Neural Networks (SNNs) can achieve competitive performance by converting already existing well-trained Artificial Neural Networks (ANNs), avoiding further costly training. This property is particularly attractive in Reinforcement Learning (RL), where training through environment interaction is expensive and potentially unsafe. However, existing conversion methods perform poorly in continuous control, where suitable baselines are largely absent. We identify error amplification as the key cause: small action approximation errors become temporally correlated across decision steps, inducing cumulative state distribution shift and severe performance degradation. To address this issue, we propose Cross-Step Residual Potential Initialization (CRPI), a lightweight training-free mechanism that carries over residual membrane potentials across decision steps to suppress temporally correlated errors. Experiments on continuous control benchmarks with both vector and visual observations demonstrate that CRPI can be integrated into existing conversion pipelines and substantially recovers lost performance. Our results highlight continuous control as a critical and challenging benchmark for ANN-to-SNN conversion, where small errors can be strongly amplified and impact performance.

Updated: 2026-01-29 14:28:00

标题: 错误放大限制了人工神经网络到脉冲神经网络在连续控制中的转换

摘要: 尖峰神经网络（SNN）通过将已经存在的经过良好训练的人工神经网络（ANN）转换，可以实现竞争性能，避免进一步昂贵的训练。这种特性在强化学习（RL）中特别吸引人，因为通过环境交互进行训练是昂贵且潜在不安全的。然而，现有的转换方法在连续控制方面表现不佳，适用的基线大多缺失。我们确定错误放大是关键原因：小动作近似误差在决策步骤之间变得时间相关，引起累积状态分布变化和性能严重下降。为了解决这个问题，我们提出了跨步残余电位初始化（CRPI），这是一个轻量级的无需训练的机制，通过在决策步骤之间传递残余膜电位来抑制时间相关的错误。在具有矢量和视觉观察的连续控制基准测试中进行的实验证明，CRPI可以集成到现有的转换流程中，并显著恢复丢失的性能。我们的结果突显连续控制作为ANN到SNN转换的一个关键且具有挑战性的基准测试，在这里小错误可能会被强烈放大并影响性能。

更新时间: 2026-01-29 14:28:00

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2601.21778v1

Differentiable Knapsack and Top-k Operators via Dynamic Programming

Knapsack and Top-k operators are useful for selecting discrete subsets of variables. However, their integration into neural networks is challenging as they are piecewise constant, yielding gradients that are zero almost everywhere. In this paper, we propose a unified framework casting these operators as dynamic programs, and derive differentiable relaxations by smoothing the underlying recursions. On the algorithmic side, we develop efficient parallel algorithms supporting both deterministic and stochastic forward passes, and vector-Jacobian products for the backward pass. On the theoretical side, we prove that Shannon entropy is the unique regularization choice yielding permutation-equivariant operators, and characterize regularizers inducing sparse selections. Finally, on the experimental side, we demonstrate our framework on a decision-focused learning benchmark, a constrained dynamic assortment RL problem, and an extension of discrete VAEs.

Updated: 2026-01-29 14:25:35

标题: Differentiable Knapsack and Top-k Operators via Dynamic Programming 通过动态规划实现可微分的背包和前k个操作符

摘要: 背包和Top-k运算符对于选择离散变量子集非常有用。然而，由于它们是分段常数，将它们整合到神经网络中是具有挑战性的，因为它们几乎在所有地方都产生零梯度。在本文中，我们提出了一个统一的框架，将这些运算符作为动态程序进行建模，并通过平滑基础递归导出可微松弛。在算法方面，我们开发了支持确定性和随机正向传递的高效并行算法，以及用于反向传递的矢量雅可比乘积。在理论方面，我们证明了Shannon熵是唯一的正则化选择，产生排列等变运算符，并表征引导稀疏选择的正则化器。最后，在实验方面，我们在一个决策焦点的学习基准、一个约束动态分类RL问题以及离散VAE的扩展上展示了我们的框架。

更新时间: 2026-01-29 14:25:35

领域: cs.LG

下载: http://arxiv.org/abs/2601.21775v1

WL Tests Are Far from All We Need: Revisiting WL-Test Hardness and GNN Expressive Power from a Distributed Computation Perspective

The expressive power of graph neural networks (GNNs) is often studied through their relationship to the Weisfeiler-Lehman (WL) tests. Despite its influence, this perspective leaves two gaps: (i) it is unclear whether WL tests are sufficiently primitive for understanding GNN expressivity, and (ii) WL-induced equivalence does not align well with characterizing the function classes that GNNs can approximate or compute. We attempt to address both gaps. First, we strengthen hardness results for the vanilla WL test, showing that in many settings it is not primitive enough to be implemented by constant-depth GNNs. Second, we propose an alternative framework for studying GNN expressivity based on an extended CONGEST model with an explicit preprocessing phase. Within this framework, we identify implicit shortcuts introduced in prior analyses and establish further results for WL tests in settings where graphs are augmented with virtual nodes and virtual edges.

Updated: 2026-01-29 14:25:28

标题: WL测试还远远不够：从分布式计算的角度重新审视WL-Test的难度和GNN的表达能力

摘要: 图神经网络（GNNs）的表达能力经常通过它们与Weisfeiler-Lehman（WL）测试的关系来研究。尽管其影响深远，但这种视角存在两个缺口：（i）尚不清楚WL测试是否足够基本以理解GNN的表达能力，（ii）WL引发的等价性与表征GNN能够近似或计算的函数类不太吻合。我们尝试解决这两个缺口。首先，我们加强了对原始WL测试的难度结果，表明在许多情况下，它不够基本，不能由恒定深度的GNNs实现。其次，我们提出了一个基于扩展CONGEST模型的替代框架，用于研究GNN的表达能力，该框架具有明确的预处理阶段。在这个框架内，我们确定了先前分析中引入的隐式快捷方式，并在图被增加虚拟节点和虚拟边的情况下建立了进一步的WL测试结果。

更新时间: 2026-01-29 14:25:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.01308v5

Introducing Instruction-Accurate Simulators for Performance Estimation of Autotuning Workloads

Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally, autotuning requires the workloads to be executed on the target hardware (HW). We present an interface that allows executing autotuning workloads on simulators. This approach offers high scalability when the availability of the target HW is limited, as many simulations can be run in parallel on any accessible HW. Additionally, we evaluate the feasibility of using fast instruction-accurate simulators for autotuning. We train various predictors to forecast the performance of ML workload implementations on the target HW based on simulation statistics. Our results demonstrate that the tuned predictors are highly effective. The best workload implementation in terms of actual run time on the target HW is always within the top 3 % of predictions for the tested x86, ARM, and RISC-V-based architectures. In the best case, this approach outperforms native execution on the target HW for embedded architectures when running as few as three samples on three simulators in parallel.

Updated: 2026-01-29 14:17:34

标题: 引入指令准确模拟器用于自动调整工作负载性能估计

摘要: 加速机器学习（ML）工作负载需要有效的方法，因为其优化空间很大。自动调整已经成为系统评估不同实现变体的有效方法。传统上，自动调整要求在目标硬件（HW）上执行工作负载。我们提出了一种接口，允许在模拟器上执行自动调整工作负载。这种方法在目标HW的可用性有限时具有很高的可扩展性，因为许多模拟可以并行在任何可访问的HW上运行。此外，我们评估了使用快速指令准确模拟器进行自动调整的可行性。我们训练了各种预测器，根据模拟统计数据预测ML工作负载在目标HW上的性能。我们的结果表明，调整后的预测器非常有效。在测试的基于x86、ARM和RISC-V架构的架构中，实际运行时间最佳的工作负载实现始终在前3％的预测中。在最佳情况下，当在三个模拟器上并行运行三个样本时，这种方法在嵌入式架构上优于在目标HW上的本机执行。

更新时间: 2026-01-29 14:17:34

领域: cs.AR,cs.LG

下载: http://arxiv.org/abs/2505.13357v2

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances. We show that this approximates activation-aware quantization by recovering column scales from the weight matrix structure that are predictive of the typical activation magnitudes the matrix received during training. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layer. We evaluate our method on the Qwen3 model family, among others. SINQ reduces the perplexity gap on WikiText2 and C4 by over 50% against uncalibrated uniform quantization baselines, incurs zero to negligible compute overhead, and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code is available at https://github.com/huawei-csl/SINQ.

Updated: 2026-01-29 14:15:45

标题: SINQ：Sinkhorn-Normalized Quantization用于无需校准的低精度LLM权重

摘要: 训练后量化已成为部署大型语言模型在低精度下最广泛使用的策略。然而，目前的方法在比特宽度小于或等于4时显示了困惑性的降级，部分原因是代表异常值导致了与这些异常值共享相同尺度的参数精度问题。这个问题在无需校准的均匀量化方法中尤为显著。我们引入了SINQ来增强现有的训练后量化器，其中包括额外的第二轴比例因子和一种快速的Sinkhorn-Knopp风格算法，该算法找到了用于归一化每行和每列方差的比例。我们展示了这种方法通过从权重矩阵结构中恢复列缩放来近似激活感知量化，这些缩放可以预测训练过程中矩阵接收到的典型激活幅度。我们的方法在层之间没有交互作用，并且可以轻松地应用于新的架构以量化任何线性层。我们在Qwen3模型系列等上评估了我们的方法。与未校准的均匀量化基线相比，SINQ在WikiText2和C4上减少了迷惑度差距超过50%，没有或几乎没有计算开销，并且可以通过与校准和非均匀量化级别相结合进一步增强。代码可在https://github.com/huawei-csl/SINQ获取。

更新时间: 2026-01-29 14:15:45

领域: cs.LG

下载: http://arxiv.org/abs/2509.22944v4

Zero-Shot Statistical Downscaling via Diffusion Posterior Sampling

Conventional supervised climate downscaling struggles to generalize to Global Climate Models (GCMs) due to the lack of paired training data and inherent domain gaps relative to reanalysis. Meanwhile, current zero-shot methods suffer from physical inconsistencies and vanishing gradient issues under large scaling factors. We propose Zero-Shot Statistical Downscaling (ZSSD), a zero-shot framework that performs statistical downscaling without paired data during training. ZSSD leverages a Physics-Consistent Climate Prior learned from reanalysis data, conditioned on geophysical boundaries and temporal information to enforce physical validity. Furthermore, to enable robust inference across varying GCMs, we introduce Unified Coordinate Guidance. This strategy addresses the vanishing gradient problem in vanilla DPS and ensures consistency with large-scale fields. Results show that ZSSD significantly outperforms existing zero-shot baselines in 99th percentile errors and successfully reconstructs complex weather events, such as tropical cyclones, across heterogeneous GCMs.

Updated: 2026-01-29 14:14:41

标题: 通过扩散后验采样进行零尺度统计降尺度

摘要: 传统的监督气候降尺度在推广到全球气候模型（GCMs）时存在困难，原因是缺乏配对训练数据以及与再分析相比固有的领域差距。同时，当前的零样本方法在大比例因子下存在物理不一致性和梯度消失问题。我们提出了零样本统计降尺度（ZSSD），这是一个零样本框架，在训练过程中可以进行统计降尺度而无需配对数据。ZSSD利用从再分析数据中学习的物理一致气候先验，通过地球物理边界和时间信息来强化物理有效性。此外，为了实现跨不同GCMs的稳健推断，我们引入了统一坐标引导。这种策略解决了普通DPS中的梯度消失问题，并确保与大规模场的一致性。结果表明，ZSSD在99分位误差方面明显优于现有的零样本基线，并成功重建了跨异质GCMs的复杂气候事件，如热带气旋。

更新时间: 2026-01-29 14:14:41

领域: cs.AI

下载: http://arxiv.org/abs/2601.21760v1

EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference

Serving Large Language Models (LLMs) under mixed workloads--short, latency-sensitive interactive queries alongside long, throughput-oriented batch requests--poses a fundamental scheduling challenge. Standard First-Come, First-Served (FCFS) policies suffer from severe head-of-line blocking, leading to high tail latency and underutilized hardware. We introduce EWSJF (Effective Workload-based Shortest Job First), an adaptive request-level scheduler that learns workload structure in real time to jointly improve fairness and throughput. EWSJF operates upstream of execution-level schedulers and integrates four components: (1) Refine-and-Prune, an unsupervised partitioning algorithm that discovers performance-homogeneous request groups; (2) Dynamic Queue Routing for assigning requests to these groups; (3) Density-Weighted Scoring, a context-aware prioritization function balancing urgency and fairness; and (4) Bayesian Meta-Optimization, which continuously tunes scoring and partitioning parameters based on live performance feedback. Implemented in vLLM, EWSJF improves end-to-end throughput by over 30% and reduces average Time-To-First-Token for short requests by up to 4x compared to FCFS. These results demonstrate that adaptive, learning-based request scheduling is a critical missing layer for efficient and responsive LLM serving. Implementation available at https://anonymous.4open.science/r/vllm_0110-32D8.

Updated: 2026-01-29 14:14:16

标题: EWSJF：一种混合分区的自适应调度器，用于混合工作负载LLM推断

摘要: 为了应对混合工作负载（短、延迟敏感的交互查询与长、吞吐量导向的批量请求）下的大型语言模型（LLMs）的调度挑战，标准的先来先服务（FCFS）策略存在严重的头阻塞问题，导致尾延迟高和硬件利用率低。我们引入了EWSJF（Effective Workload-based Shortest Job First），这是一个自适应的请求级调度器，实时学习工作负载结构，共同提高公平性和吞吐量。EWSJF在执行级调度器上游运行，并集成了四个组件：（1）Refine-and-Prune，一种无监督分区算法，发现性能同质的请求组；（2）动态队列路由，用于将请求分配给这些组；（3）密度加权评分，一个上下文感知的优先级函数，平衡紧急性和公平性；（4）贝叶斯元优化，根据实时性能反馈持续调整评分和分区参数。在vLLM中实施的EWSJF将端到端吞吐量提高了30％以上，并将短请求的平均首令牌到达时间相对于FCFS缩短了最多4倍。这些结果表明，自适应的、基于学习的请求调度是高效且响应灵敏的LLM服务的关键缺失层。实施细节可在https://anonymous.4open.science/r/vllm_0110-32D8找到。

更新时间: 2026-01-29 14:14:16

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2601.21758v1

Agent-OM: Leveraging LLM Agents for Ontology Matching

Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.

Updated: 2026-01-29 14:13:49

标题: Agent-OM：利用LLM代理进行本体匹配

摘要: 本文研究了本体匹配（OM）技术，它能够实现不同本体之间的语义互操作性，并通过对相关实体进行对齐来解决其概念异质性。目前，OM系统主要有两种设计范式：传统基于知识的专家系统和基于机器学习的预测系统。虽然大型语言模型（LLMs）和LLM代理已经在数据工程中取得了革命性进展，并在许多领域进行了创造性应用，但它们在OM方面的潜力仍未得到充分挖掘。本研究引入了一种新颖的基于代理的LLM设计范式，用于OM系统。考虑到利用LLM代理进行OM所面临的几个特定挑战，我们提出了一个通用框架，即Agent-OM（用于本体匹配的代理），包括两个Siamese代理用于检索和匹配，以及一套OM工具。我们的框架已在一个概念验证系统中实施。对三个Ontology Alignment Evaluation Initiative（OAEI）赛道的评估结果显示，我们的系统在简单OM任务上可以取得非常接近长期最佳性能的结果，并且能够显著提高复杂和少样本OM任务的性能。

更新时间: 2026-01-29 14:13:49

领域: cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2312.00326v25

One-Shot Federated Learning with Classifier-Free Diffusion Models

Federated learning (FL) enables collaborative learning without data centralization but introduces significant communication costs due to multiple communication rounds between clients and the server. One-shot federated learning (OSFL) addresses this by forming a global model with a single communication round, often relying on the server's model distillation or auxiliary dataset generation - mostly through pre-trained diffusion models (DMs). Existing DM-assisted OSFL methods, however, typically employ classifier-guided DMs, which require training auxiliary classifier models at each client, introducing additional computation overhead. This work introduces OSCAR (One-Shot Federated Learning with Classifier-Free Diffusion Models), a novel OSFL approach that eliminates the need for auxiliary models. OSCAR uses foundation models to devise category-specific data representations at each client which are integrated into a classifier-free diffusion model pipeline for server-side data generation. In our experiments, OSCAR outperforms the state-of-the-art on four benchmark datasets while reducing the communication load by at least 99%.

Updated: 2026-01-29 14:10:05

标题: 一次性联合学习与无分类器扩散模型

摘要: 联邦学习（FL）实现了协作学习而无需数据集中，但由于客户端和服务器之间的多次通信轮次而引入了显著的通信成本。一次性联邦学习（OSFL）通过在单个通信轮次中形成全局模型来解决这个问题，通常依赖于服务器的模型蒸馏或辅助数据集生成 - 大多通过预训练扩散模型（DMs）。然而，现有的DM辅助OSFL方法通常采用需要在每个客户端训练辅助分类器模型的分类器引导DMs，引入了额外的计算开销。本文介绍了OSCAR（无分类器扩散模型的一次性联邦学习），这是一种新颖的OSFL方法，消除了辅助模型的需求。OSCAR使用基础模型在每个客户端设计特定类别的数据表示，将其集成到无分类器扩散模型管道中进行服务器端数据生成。在我们的实验中，OSCAR在四个基准数据集上表现优于最先进技术，同时至少减少了99%的通信负载。

更新时间: 2026-01-29 14:10:05

领域: cs.LG

下载: http://arxiv.org/abs/2502.08488v2

Rotary Position Encodings for Graphs

We study the extent to which rotary position encodings (RoPE), a recent transformer position encoding algorithm broadly adopted in large language models (LLMs) and vision transformers (ViTs), can be applied to graph-structured data. We find that rotating tokens depending on the spectrum of the graph Laplacian efficiently injects structural information into the attention mechanism, boosting performance in synthetic and real-world graph learning tasks. This approach, coined _Wave-Induced Rotary Encodings_ (WIRE), enjoys intriguing theoretical properties: it recovers regular RoPE on grids, and depends asymptotically on the graph effective resistance. Unlike bias-based relative position encodings, WIRE is compatible with linear attention.

Updated: 2026-01-29 14:09:45

标题: 图表的旋转位置编码

摘要: 我们研究了最近在大型语言模型（LLMs）和视觉变换器（ViTs）广泛采用的旋转位置编码（RoPE）算法在图结构数据中的应用程度。我们发现，根据图拉普拉斯矩阵的频谱旋转标记可以有效地将结构信息注入注意机制，提升合成和真实世界图学习任务的性能。这种方法被称为“Wave-Induced Rotary Encodings”（WIRE），具有有趣的理论特性：它在网格上恢复了常规的RoPE，并且在渐近上取决于图的有效电阻。与基于偏置的相对位置编码不同，WIRE与线性注意兼容。

更新时间: 2026-01-29 14:09:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.22259v3

Language-based Trial and Error Falls Behind in the Era of Experience

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

Updated: 2026-01-29 14:08:41

标题: 基于语言的试错方法在经验时代落后了。

摘要: 尽管大型语言模型（LLMs）在基于语言的代理任务中表现出色，但它们在未知的非语言环境（例如符号或空间任务）中的适用性仍然有限。先前的研究将这种性能差距归因于预训练分布与测试分布之间的不匹配。在这项工作中，我们证明主要的瓶颈是探索的成本过高：掌握这些任务需要大量的试错，对于在高维语义空间中运行的参数重的LLMs来说，这在计算上是不可持续的。为了解决这个问题，我们提出了SCOUT（Sub-Scale Collaboration On Unseen Tasks），这是一个新颖的框架，将探索与利用分开。我们使用轻量级的“侦察员”（例如小型MLPs）以远远超过LLMs的速度和规模探索环境动态。收集到的轨迹被用来通过监督微调（SFT）来启动LLM，随后通过多轮强化学习（RL）来激活其潜在的世界知识。根据经验，SCOUT使得Qwen2.5-3B-Instruct模型的平均得分达到0.86，明显优于专有模型，包括Gemini-2.5-Pro（0.60），同时节约了约60%的GPU小时消耗。

更新时间: 2026-01-29 14:08:41

领域: cs.AI

下载: http://arxiv.org/abs/2601.21754v1

Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions

Flow Matching (FM) models achieve remarkable results in generative tasks. Building upon diffusion models, FM's simulation-free training paradigm enables simplicity and efficiency but introduces a train-inference gap: model outputs cannot be assessed during training. Moreover, the straight flow assumption suffers from some inherent limitations. To address this, we propose to fine-tune FM via Maximum Likelihood Estimation (MLE) of reconstructions -- enabled by FM's smooth ODE formulation, unlike the stochastic differential equations (SDEs) in diffusion models. We first theoretically analyze the relationship between training loss and inference error in FM under numerical precision constraints. We then propose an easy-to-implement fine-tuning framework based on MLE of reconstructions, with flexibility for sophisticated extensions. Building on this, we incorporate a generalized artificial viscosity term that enhances flow stability and robustness, accompanied by a direct parameterization method and rigorous theoretical guarantees. Experiments demonstrate our method's effectiveness across diverse settings: a toy example provides mechanistic insights into the fine-tuning process, while large-scale evaluations on meteorological forecasting and robotic manipulation policies validate reliable performance improvements.

Updated: 2026-01-29 14:05:23

标题: 通过最大似然估计重建来微调流匹配

摘要: Flow Matching (FM)模型在生成任务中取得了显著的成果。基于扩散模型，FM的无模拟训练范式实现了简单和高效，但引入了一个训练推断间隙：模型输出在训练期间无法评估。此外，直接流动假设存在一些固有限制。为了解决这个问题，我们提出通过最大似然估计（MLE）重构来微调FM -- 这是由于FM的平滑ODE公式，而不是扩散模型中的随机微分方程（SDEs）所实现的。我们首先在数值精度约束下理论分析了FM中训练损失和推断误差之间的关系。然后，我们提出了一个基于MLE重构的易于实现的微调框架，具有灵活性以进行复杂扩展。在此基础上，我们加入了一个增强流动稳定性和鲁棒性的广义人工粘度项，同时伴随着直接参数化方法和严格的理论保证。实验证明了我们的方法在不同设置下的有效性：一个玩具例子提供了对微调过程的机械洞察，而对气象预测和机器人操纵策略的大规模评估验证了可靠的性能改进。

更新时间: 2026-01-29 14:05:23

领域: cs.LG

下载: http://arxiv.org/abs/2510.02081v2

FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer

Training large-scale neural networks requires solving nonconvex optimization where the choice of optimizer fundamentally determines both convergence behavior and computational efficiency. While adaptive methods like Adam have long dominated practice, the recently proposed Muon optimizer achieves superior performance through orthogonalized momentum updates that enforce isotropic geometry with uniform singular values. However, this strict isotropy discards potentially valuable curvature information encoded in gradient spectra, motivating optimization methods that balance geometric structure with adaptivity. We introduce FISMO (Fisher-Structured Momentum-Orthogonalized) optimizer, which generalizes isotropic updates to incorporate anisotropic curvature information through Fisher information geometry. By reformulating the optimizer update as a trust-region problem constrained by a Kronecker-factored Fisher metric, FISMO achieves structured preconditioning that adapts to local loss landscape geometry while maintaining computational tractability. We establish convergence guarantees for FISMO in stochastic nonconvex settings, proving an $\mathcal{O}(1/\sqrt{T})$ rate for the expected squared gradient norm with explicit characterization of variance reduction through mini-batching. Empirical evaluation on image classification and language modeling benchmarks demonstrates that FISMO achieves superior training efficiency and final performance compared to established baselines.

Updated: 2026-01-29 14:05:04

标题: FISMO：费舍尔结构化的动量正交优化器

摘要: 训练大规模神经网络需要解决非凸优化问题，其中优化器的选择从根本上决定了收敛行为和计算效率。尽管像Adam这样的自适应方法长期以来在实践中占据主导地位，但最近提出的Muon优化器通过正交化动量更新实现了卓越性能，强制实现具有均匀奇异值的等距几何。然而，这种严格的等距性舍弃了在梯度谱中编码的潜在有价值的曲率信息，促使优化方法平衡几何结构与适应性。我们引入了FISMO（Fisher-Structured Momentum-Orthogonalized）优化器，它将各向同性更新推广到通过Fisher信息几何结构包含各向异性曲率信息。通过将优化器更新重新表述为受Kronecker分解的Fisher度量约束的信任区域问题，FISMO实现了结构化预处理，适应于局部损失地形几何，同时保持计算可行性。我们在随机非凸设置中建立了FISMO的收敛保证，证明了对期望梯度范数的$\mathcal{O}(1/\sqrt{T})$速率，通过小批量明确表征方差减少。对图像分类和语言建模基准的实证评估表明，与已建立的基线相比，FISMO实现了更优异的训练效率和最终性能。

更新时间: 2026-01-29 14:05:04

领域: cs.LG

下载: http://arxiv.org/abs/2601.21750v1

Temporal Sepsis Modeling: a Fully Interpretable Relational Way

Sepsis remains one of the most complex and heterogeneous syndromes in intensive care, characterized by diverse physiological trajectories and variable responses to treatment. While deep learning models perform well in the early prediction of sepsis, they often lack interpretability and ignore latent patient sub-phenotypes. In this work, we propose a machine learning framework by opening up a new avenue for addressing this issue: a relational approach. Temporal data from electronic medical records (EMRs) are viewed as multivariate patient logs and represented in a relational data schema. Then, a propositionalisation technique (based on classic aggregation/selection functions from the field of relational data) is applied to construct interpretable features to "flatten" the data. Finally, the flattened data is classified using a selective naive Bayesian classifier. Experimental validation demonstrates the relevance of the suggested approach as well as its extreme interpretability. The interpretation is fourfold: univariate, global, local, and counterfactual.

Updated: 2026-01-29 14:02:26

标题: 时间性败血症建模：一种完全可解释的关系方法

摘要: Sepsis仍然是重症监护中最复杂和异质性综合征之一，其特点是多样化的生理轨迹和对治疗的可变反应。尽管深度学习模型在早期预测脓毒症方面表现良好，但它们通常缺乏可解释性，并忽略患者的潜在亚表型。在这项工作中，我们提出了一个机器学习框架，通过开辟一条新途径来解决这个问题：一种关系性方法。来自电子病历（EMRs）的时间数据被视为多元患者日志，并以关系数据模式表示。然后，应用一种基于关系数据领域的经典聚合/选择函数的命题化技术，构建可解释的特征以“展平”数据。最后，使用选择性朴素贝叶斯分类器对展平数据进行分类。实验证实了建议方法的相关性以及其极高的可解释性。解释分为四个层面：单变量、全局、局部和反事实。

更新时间: 2026-01-29 14:02:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21747v1

Physics-Aware Heterogeneous GNN Architecture for Real-Time BESS Optimization in Unbalanced Distribution Systems

Battery energy storage systems (BESS) have become increasingly vital in three-phase unbalanced distribution grids for maintaining voltage stability and enabling optimal dispatch. However, existing deep learning approaches often lack explicit three-phase representation, making it difficult to accurately model phase-specific dynamics and enforce operational constraints--leading to infeasible dispatch solutions. This paper demonstrates that by embedding detailed three-phase grid information--including phase voltages, unbalanced loads, and BESS states--into heterogeneous graph nodes, diverse GNN architectures (GCN, GAT, GraphSAGE, GPS) can jointly predict network state variables with high accuracy. Moreover, a physics-informed loss function incorporates critical battery constraints--SoC and C-rate limits--via soft penalties during training. Experimental validation on the CIGRE 18-bus distribution system shows that this embedding-loss approach achieves low prediction errors, with bus voltage MSEs of 6.92e-07 (GCN), 1.21e-06 (GAT), 3.29e-05 (GPS), and 9.04e-07 (SAGE). Importantly, the physics-informed method ensures nearly zero SoC and C-rate constraint violations, confirming its effectiveness for reliable, constraint-compliant dispatch.

Updated: 2026-01-29 14:00:35

标题: 物理感知异构GNN架构用于不平衡配电系统中实时BESS优化

摘要: 电池储能系统（BESS）在三相不平衡配电网中变得越来越重要，用于维持电压稳定性并实现最佳调度。然而，现有的深度学习方法通常缺乏明确的三相表示，难以准确建模相特定动态并强制执行操作约束，导致不可行的调度解决方案。本文证明了通过将详细的三相网格信息（包括相电压、不平衡负载和BESS状态）嵌入到异质图节点中，多样化的GNN架构（GCN、GAT、GraphSAGE、GPS）可以共同高准确度地预测网络状态变量。此外，一个基于物理的损失函数通过软惩罚在训练过程中纳入关键的电池约束（SoC和C-率限制）。在CIGRE 18节点配电系统上的实验验证表明，这种嵌入-损失方法实现了低预测误差，母线电压MSE分别为6.92e-07（GCN）、1.21e-06（GAT）、3.29e-05（GPS）和9.04e-07（SAGE）。重要的是，基于物理的方法确保几乎没有SoC和C-率约束违规，证实了其用于可靠、符合约束的调度的有效性。

更新时间: 2026-01-29 14:00:35

领域: cs.LG

下载: http://arxiv.org/abs/2512.09780v2

Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems

Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.

Updated: 2026-01-29 13:59:32

标题: 认知语境学习：在基于LLM的多智能体系统中建立正确的信任

摘要: 多智能体系统中的个体代理通常缺乏稳健性，往往盲目地随从误导他们的同行。我们展示了这种弱点源于阿谀奉承和评估同行可靠性能力不足。为了解决这个问题，我们首先形式化了历史感知参考的学习问题，引入同行的历史互动作为额外输入，以便代理能够估计同行的可靠性，并在不确定时从可信赖的同行那里学习。这将任务从评估同行推理质量转变为基于互动历史估计同行可靠性。然后我们开发了Epistemic Context Learning（ECL）：一个推理框架，通过从历史中明确构建的同行概况来调整预测。我们进一步通过强化学习使用辅助奖励来优化ECL。我们的实验表明，我们的ECL使像Qwen 3-4B这样的小模型能够胜过其8倍大小的历史无关基线（Qwen 3-30B），通过准确识别可靠的同行。ECL也将前沿模型提升至近乎完美（100%）的性能。我们展示ECL在各种多智能体配置中具有很好的泛化能力，并发现LLMs很好地对信任进行建模，揭示了信任建模准确性和最终答案质量之间的强相关性。

更新时间: 2026-01-29 13:59:32

领域: cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2601.21742v1

Notes on Univariate Sumcheck

Two candidate approaches for univariate sumcheck over roots of unity are presented. The first takes the form of a multilinear evaluation protocol, which can be combined with the standard multivariate sumcheck protocol. The other consists of a direct reduction from univariate sumcheck to multilinear evaluation, which can be combined with Gemini (Bootle et al., Eurocrypt 2022). Both approaches optionally support a very natural exponential round reduction from $m$ to $\log(m)$ while retaining asymptotically linear prover time.

Updated: 2026-01-29 13:59:30

标题: 《关于一元Sumcheck的注释》

摘要: 本文介绍了两种针对单位根上的单变量sumcheck的候选方法。第一种方法采用多线性评估协议的形式，可以与标准的多变量sumcheck协议结合使用。另一种方法是将单变量sumcheck直接归约到多线性评估，可以与Gemini（Bootle等人，Eurocrypt 2022）结合使用。这两种方法都可以选择支持非常自然的指数轮次从$m$降至$\log(m)$，同时保持渐近线性的证明者时间。

更新时间: 2026-01-29 13:59:30

领域: cs.CR

下载: http://arxiv.org/abs/2505.00554v4

Efficient Learning of Stationary Diffusions with Stein-type Discrepancies

Learning a stationary diffusion amounts to estimating the parameters of a stochastic differential equation whose stationary distribution matches a target distribution. We build on the recently introduced kernel deviation from stationarity (KDS), which enforces stationarity by evaluating expectations of the diffusion's generator in a reproducing kernel Hilbert space. Leveraging the connection between KDS and Stein discrepancies, we introduce the Stein-type KDS (SKDS) as an alternative formulation. We prove that a vanishing SKDS guarantees alignment of the learned diffusion's stationary distribution with the target. Furthermore, under broad parametrizations, SKDS is convex with an empirical version that is $ε$-quasiconvex with high probability. Empirically, learning with SKDS attains comparable accuracy to KDS while substantially reducing computational cost and yields improvements over the majority of competitive baselines.

Updated: 2026-01-29 13:58:58

标题: 用Stein型差异学习稳态扩散

摘要: 学习一个稳态扩散等于估计参数的随机微分方程，其稳态分布与目标分布相匹配。我们基于最近引入的核偏离非平稳性（KDS），通过在再生核希尔伯特空间中评估扩散生成器的期望来强制实现稳定性。利用KDS与斯坦不一致性之间的联系，我们引入了斯坦型KDS（SKDS）作为另一种表述。我们证明，消失的SKDS保证学习扩散的稳态分布与目标一致。此外，在广泛的参数化下，SKDS是凸的，具有经验版本，具有高概率的$ε$-拟凸性。在实证中，使用SKDS学习的准确性与KDS相当，同时大大降低了计算成本，并比大多数竞争基线取得了改进。

更新时间: 2026-01-29 13:58:58

领域: stat.ML,cs.LG,math.ST

下载: http://arxiv.org/abs/2601.16597v2

Stochastic Matching Bandits with Rare Optimization Updates

We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon $T$. The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to $Θ(\log\log T)$ over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$.

Updated: 2026-01-29 13:58:55

标题: 稀有优化更新的随机匹配赌博机算法

摘要: 我们引入了一个用于多项Logit（MNL）选择模型下随机匹配的强盗框架。在我们的设定中，一边有$N$个代理人分配到另一边的$K$个臂上，每个臂根据未知的偏好随机选择其分配池中的代理人，并在时间段$T$内产生相应的奖励。目标是通过最大化成功匹配的累积收入来最小化后悔。一种天真的方法需要在每一轮解决一个NP难的组合优化问题，导致计算成本过高。为了解决这一挑战，我们提出了批处理算法，策略性地将匹配分配更新的次数限制在整个时间段内的$Θ(\log\log T)$次。通过仅在极少数轮次中调用昂贵的组合优化，我们的算法大幅减少了整体计算开销，同时仍然实现了一个$\widetilde{\mathcal{O}}(\sqrt{T})$的后悔上界。

更新时间: 2026-01-29 13:58:55

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.04194v2

ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

Updated: 2026-01-29 13:56:35

标题: 尽快行动：利用神经组合优化中的满足性泛化优势

摘要: 深度强化学习（DRL）已经成为解决组合优化（CO）问题的一种有希望的方法，例如3D装箱问题（3D-BPP）、旅行商问题（TSP）或车辆路径问题（VRP），但是这些神经求解器在面对分布转移时通常表现出脆弱性。为了解决这个问题，我们揭示了满意性泛化边缘，我们在理论和实验上都进行了验证：识别一组有前途的行动本质上比选择单个最优行动更具泛化性。为了利用这一特性，我们提出了自提议后自适应选择（ASAP），这是一个通用框架，将决策过程分解为两个不同的阶段：一个提议策略作为健壮的过滤器，一个选择策略作为适应性决策者。这种架构实现了一种高效的在线适应策略，其中选择策略可以快速在新的分布上进行微调。具体来说，我们引入了一个通过模型无关元学习（MAML）增强的两阶段训练框架，为快速适应模型做准备。对3D-BPP、TSP和CVRP的大量实验表明，ASAP提高了最先进基线的泛化能力，并在分布外实例上实现了优越的在线适应性。

更新时间: 2026-01-29 13:56:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.17377v3

Neural Weight Compression for Language Models

Efficient storage and transmission of language model weights are increasingly critical as model scale and deployment grow. Yet, most existing compression methods rely on handcrafted transforms and heuristics, reflecting the limited understanding of weights as a data modality. This motivates a shift toward learning-based paradigm, where compression schemes are optimized directly from data rather than manually designed. In this work, we take a step in this direction by formulating weight compression as a neural codec learning. We propose Neural Weight Compression (NWC), a flexible framework for training neural codecs on pretrained weight datasets. NWC addresses challenges intrinsic to weight compression, such as tensor shape heterogeneity and the misalignment between training losses and downstream performance, through components such as chunk-and-normalize preprocessing and an importance-aware training objective. Experiments show that NWC achieves state-of-the-art accuracy-compression tradeoffs, particularly at 4--6 bit regime, without relying on rigid handcrafted components such as the Hadamard transform. These gains extend across diverse architectures, e.g., vision encoders. Our analysis further supports the learning-based perspective, highlighting the roles of entropy-constrained quantization in high rate regime and learned transforms in adapting to downstream tasks.

Updated: 2026-01-29 13:56:23

标题: 语言模型的神经网络权重压缩

摘要: 随着模型规模和部署规模的增长，语言模型权重的高效存储和传输变得越来越关键。然而，大多数现有的压缩方法依赖于手工设计的转换和启发式方法，反映了对权重作为数据模态的有限理解。这促使我们转向基于学习的范式，其中压缩方案直接从数据中优化，而不是手动设计。在这项工作中，我们将权重压缩形式化为神经编解码器学习的一步。我们提出了神经权重压缩（NWC），这是一个灵活的框架，用于在预训练权重数据集上训练神经编解码器。NWC通过诸如块和归一化预处理以及基于重要性的训练目标等组件，解决了与权重压缩固有的挑战，例如张量形状的异质性和训练损失与下游性能之间的不对齐。实验证明，NWC在4-6位制度下实现了最先进的精度-压缩权衡，而无需依赖于刚性的手工设计组件，例如Hadamard变换。这些收益扩展到各种架构，例如视觉编码器。我们的分析进一步支持基于学习的观点，强调在高速率区域中的熵约束量化的作用以及在适应下游任务中学习的变换的作用。

更新时间: 2026-01-29 13:56:23

领域: cs.LG

下载: http://arxiv.org/abs/2510.11234v2

Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation

In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA, and GPT-4V, leveraging these models to replace time-consuming manual annotation and enable annotation-free training has become a promising research direction. This paper studies learning from noisy partial labels generated by pre-trained VLMs and proposes a collaborative consistency regularization (Co-Reg) framework. Unlike symmetric noise commonly assumed in traditional noisy label learning, VLM-generated noise is instance-dependent and reflects the intrinsic biases of pre-trained models, posing greater challenges. To address this issue, we jointly train two neural networks to perform collaborative label purification via a co-pseudo-labeling mechanism, while enforcing consistency regularization in both label and feature representation spaces. In addition, multiple anti-overfitting strategies are introduced, including alternating optimization of contrastive representations and pseudo-labels, as well as maintaining class prototypes in a shared feature space. The proposed method can further incorporate few-shot manually annotated labels for performance enhancement. Extensive experiments under various settings demonstrate the effectiveness of our approach and highlight the potential of integrating weakly supervised learning into the knowledge distillation of pre-trained models.

Updated: 2026-01-29 13:56:19

标题: 连接弱监督学习和VLM蒸馏：用于高效下游适应的嘈杂部分标签学习

摘要: 在嘈杂的部分标签学习（NPLL）背景下，每个训练样本都与由多个嘈杂的标注者注释的一组候选标签相关联。随着高性能的预训练视觉语言模型（VLMs）如CLIP、LLaVA和GPT-4V的出现，利用这些模型来取代耗时的手动注释并实现无标注训练已成为一个有前途的研究方向。本文研究了从预训练VLM生成的嘈杂部分标签中学习，并提出了一种协同一致性正则化（Co-Reg）框架。与传统嘈杂标签学习中常假设的对称噪声不同，VLM生成的噪声是依赖于实例的，并反映了预训练模型的固有偏见，带来了更大的挑战。为了解决这个问题，我们联合训练两个神经网络，通过协同伪标记机制执行协同标签净化，同时在标签和特征表示空间中强制一致性正则化。此外，引入了多种抗过拟合策略，包括对比表示和伪标签的交替优化，以及在共享特征空间中维护类原型。所提出的方法还可以进一步整合少量手动注释标签以提高性能。在各种设置下进行的大量实验表明了我们方法的有效性，并凸显了将弱监督学习整合到预训练模型的知识蒸馏中的潜力。

更新时间: 2026-01-29 13:56:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.03229v3

Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

Updated: 2026-01-29 13:56:11

标题: 为什么当$β_1 = β_2$时，Adam的效果更好：缺失的梯度尺度不变原则

摘要: 亚当优化器在近十年来一直是大规模训练的核心，然而一个简单的经验事实仍然未被解释：当动量参数满足$β_{1}=β_{2}$时，验证分数和训练运行的定性行为都会改善。一些最近的研究已经报告了这种模式，但仍然没有解释为什么这种选择有帮助。我们展示这种选择与我们称为\textit{梯度尺度不变性}的结构性质密切相关。我们形式化这个概念，并证明当且仅当$β_{1}=β_{2}$时，亚当变为一阶梯度尺度不变性。这种观点将亚当的平衡状态放在了与几种最近优化器的设计原则直接一致的位置，这些优化器明确强化了尺度鲁棒的更新。实验支持了这一理论，涵盖了视觉和语言任务以及不同的架构族，当$β_{1}=β_{2}$时，梯度重新缩放对更新的影响更加平滑。总的来说，我们的结果为亚当行为中的一个开放问题提供了一个连贯的解释，并提供了一个简单的原则，有助于指导未来优化器的设计。

更新时间: 2026-01-29 13:56:11

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2601.21739v1

From Global to Granular: Revealing IQA Model Performance via Correlation Surface

Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to $|Δ$MOS$|$). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbf{Granularity-Modulated Correlation (GMC)}, which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbf{Granularity Modulator} that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ($|Δ$MOS$|$) to examine local performance variations, and (2) a \textbf{Distribution Regulator} that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbf{correlation surface} maps correlation values as a joint function of MOS and $|Δ$MOS$|$, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.

Updated: 2026-01-29 13:55:26

标题: 从全局到细粒度：通过相关性表面揭示IQA模型的性能

摘要: 图像质量评估（IQA）模型的评估长期以来主要由全局相关性指标主导，例如皮尔逊线性相关系数（PLCC）和斯皮尔曼秩相关系数（SRCC）。虽然被广泛采用，但这些指标将性能降低为单一标量，无法捕捉排名一致性在局部质量范围内变化的情况。例如，两个IQA模型可能实现相同的SRCC值，但其中一个更可靠地对高质量图像（与高平均意见分数MOS有关）进行排名，而另一个更好地区分具有小质量/MOS差异（与$|Δ$MOS$|$有关）的图像对。这种互补行为在全局指标下是不可见的。此外，SRCC和PLCC对测试样本质量分布敏感，导致跨测试集的比较不稳定。为了解决这些局限性，我们提出了\textbf{颗粒度调制相关（GMC）}，它提供了对IQA性能的结构化、细粒度分析。GMC包括：（1）\textbf{颗粒度调制器}，它应用高斯加权相关，条件是绝对MOS值和成对MOS差值($|Δ$MOS$|$)，以检查局部性能变化，和（2）\textbf{分布调节器}，它正则化相关性以减轻非均匀质量分布带来的偏见。由此产生的\textbf{相关曲面}将相关值映射为MOS和$|Δ$MOS$|$的联合函数，提供了IQA性能的三维表示。在标准基准测试上的实验表明，GMC揭示了对标量指标不可见的性能特征，为分析、比较和部署IQA模型提供了更具信息量和可靠性的范式。代码可在https://github.com/Dniaaa/GMC找到。

更新时间: 2026-01-29 13:55:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.21738v1

PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks' reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

Updated: 2026-01-29 13:55:22

标题: PathWise：通过自进化LLMs自动启发式设计的世界模型规划

摘要: 大型语言模型（LLMs）已经实现了自动启发式设计（AHD）用于组合优化问题（COPs），但是现有框架对固定进化规则和静态提示模板的依赖通常会导致启发式生成的短视，冗余评估以及对如何推导新启发式的限制推理。我们提出了一种新颖的多智能体推理框架，称为通过自进化LLMs进行自动启发式设计的世界模型规划（PathWise），该框架将启发式生成作为一个串行决策过程，通过一个作为搜索轨迹的紧凑、有状态的记忆的推论图来服务。这种方法允许系统继续执行过去的决策并在不同生成之间重复使用或避免推导信息。一种策略智能体规划进化动作，一种世界模型智能体生成受这些动作条件的启发式展开，并且评论智能体提供了总结先前步骤中的教训的路由反思，将基于LLM的AHD从试错进化转向通过推理进行状态感知规划。通过多样化的COPs实验表明，PathWise更快地收敛到更好的启发式，可以泛化到不同的LLM支架，并且可以扩展到更大的问题规模。

更新时间: 2026-01-29 13:55:22

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.20539v2

Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the crossbar inputs and cells are very limited, most CIM compilers do not support quantization below 8 bit. As a result, a single MVM requires many compute cycles, and weights cannot be efficiently stored in a single crossbar cell. To address this problem, we propose a mixed-precision training and compilation framework for CIM architectures. The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters. This is why we introduce a reinforcement learning-based strategy to find suitable quantization configurations that balance latency and accuracy. In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.

Updated: 2026-01-29 13:54:55

标题: 基于RRAM的计算内存加速器的混合精度训练和编译

摘要: 计算内存（CIM）加速器是加速机器学习（ML）工作负载的一个有前途的解决方案，因为它们直接在内存中的交叉阵列上执行矩阵-向量乘法（MVM）。尽管交叉阵输入和单元的位宽非常有限，但大多数CIM编译器不支持低于8位的量化。因此，单个MVM需要许多计算周期，并且权重无法有效存储在单个交叉栅单元中。为了解决这个问题，我们提出了一个用于CIM架构的混合精度训练和编译框架。最大的挑战是巨大的搜索空间，这使得很难找到良好的量化参数。这就是为什么我们引入了基于强化学习的策略，以找到平衡延迟和准确性的合适量化配置。在最佳情况下，我们的方法可以比现有的最先进解决方案提高高达2.48倍的速度，并且仅失去0.086%的准确性。

更新时间: 2026-01-29 13:54:55

领域: cs.LG,cs.ET

下载: http://arxiv.org/abs/2601.21737v1

Generative Modeling through Koopman Spectral Analysis: An Operator-Theoretic Perspective

We propose Koopman Spectral Wasserstein Gradient Descent (KSWGD), a particle-based generative modeling framework that learns the Langevin generator via Koopman theory and integrates it with Wasserstein gradient descent. Our key insight is that this spectral structure of the underlying distribution can be directly estimated from trajectory data via the Koopman operator, eliminating the need for explicit knowledge of the target potential. Additionally, we prove that KSWGD maintains an approximately constant dissipation rate, thereby establishing linear convergence and overcoming the vanishing-gradient phenomenon that hinders existing kernel-based particle methods. We further provide a Feynman--Kac interpretation that clarifies the method's probabilistic foundation. Experiments on compact manifolds, metastable multi-well systems, and high-dimensional stochastic partial differential equations demonstrate that KSWGD consistently outperforms baselines in both convergence speed and sample quality.

Updated: 2026-01-29 13:52:06

标题: 通过库普曼谱分析实现生成建模：一个算子理论的视角

摘要: 我们提出了Koopman谱Wasserstein梯度下降（KSWGD），这是一个基于粒子的生成建模框架，通过Koopman理论学习Langevin生成器，并将其与Wasserstein梯度下降集成在一起。我们的关键见解是，通过Koopman算子，可以直接从轨迹数据估计出潜在分布的谱结构，从而消除了对目标势能的明确知识。此外，我们证明KSWGD保持近似恒定的耗散速率，从而确立了线性收敛，并克服了阻碍现有基于核的粒子方法的消失梯度现象。我们进一步提供了一个费曼-卡克解释，澄清了该方法的概率基础。在紧凑流形、亚稳多井系统和高维随机偏微分方程上的实验表明，KSWGD在收敛速度和样本质量方面始终优于基线。

更新时间: 2026-01-29 13:52:06

领域: cs.LG,math.DS

下载: http://arxiv.org/abs/2512.18837v2

Amortized Spectral Kernel Discovery via Prior-Data Fitted Network

Prior-Data Fitted Networks (PFNs) enable efficient amortized inference but lack transparent access to their learned priors and kernels. This opacity hinders their use in downstream tasks, such as surrogate-based optimization, that require explicit covariance models. We introduce an interpretability-driven framework for amortized spectral discovery from pre-trained PFNs with decoupled attention. We perform a mechanistic analysis on a trained PFN that identifies attention latent output as the key intermediary, linking observed function data to spectral structure. Building on this insight, we propose decoder architectures that map PFN latents to explicit spectral density estimates and corresponding stationary kernels via Bochner's theorem. We study this pipeline in both single-realization and multi-realization regimes, contextualizing theoretical limits on spectral identifiability and proving consistency when multiple function samples are available. Empirically, the proposed decoders recover complex multi-peak spectral mixtures and produce explicit kernels that support Gaussian process regression with accuracy comparable to PFNs and optimization-based baselines, while requiring only a single forward pass. This yields orders-of-magnitude reductions in inference time compared to optimization-based baselines.

Updated: 2026-01-29 13:51:26

标题: 摊销的谱核发现方法：通过先验数据拟合网络

摘要: 先前数据拟合网络（PFN）实现了高效的摊销推断，但缺乏对其学习到的先验和核的透明访问。这种不透明性阻碍了它们在需要显式协方差模型的下游任务中的使用，比如基于替代模型的优化。我们引入了一个解释驱动的框架，用于从预训练的具有解耦注意力的PFN中进行摊销谱发现。我们对经过训练的PFN进行了一种机械分析，确定了注意力潜在输出作为连接观察到的函数数据与谱结构的关键中介。基于这一洞察，我们提出了解码器架构，通过 Bochner 定理将PFN的潜在值映射到显式的谱密度估计和相应的平稳核。我们在单次实现和多次实现情况下研究了这一流程，将谱可辨识性的理论限制置于上下文中，并证明了在多个函数样本可用时的一致性。从经验上看，所提出的解码器可以恢复复杂的多峰谱混合，并产生支持高斯过程回归的显式核，其精度与PFN和基于优化的基线方法相当，只需要进行一次前向传递。与基于优化的基线方法相比，这大大减少了推断时间。

更新时间: 2026-01-29 13:51:26

领域: cs.LG

下载: http://arxiv.org/abs/2601.21731v1

Memento 2: Learning by Stateful Reflective Memory

We present a theoretical study of continual and experiential learning in large language model agents that combine episodic memory with reinforcement learning. We argue that the key mechanism for continual adaptation, without updating model parameters, is reflection: the agent's ability to use past experience to guide future actions. Empirical findings suggest that episodic, experience-driven reflection enables generalised adaptation across a wide range of open-ended, long-horizon tasks. This indicates that efficient learning can occur during deployment and weakens the traditional separation between training and testing. Motivated by this, we introduce the Stateful Reflective Decision Process, a formal model of reflective memory dynamics. In this abstraction, an agent maintains an episodic memory and performs two core operations. Writing stores interaction outcomes and plays the role of policy evaluation. Reading retrieves relevant past cases to inform decisions and plays the role of policy improvement. This perspective treats reflective memory as a control object that can be analysed using classical reinforcement learning tools. We then develop a read-write reflective learning framework by integrating retrieval into soft policy iteration and establish convergence guarantees. We show that as memory grows and provides denser coverage of the state space, the resulting composite policy converges to the optimal solution. Overall, this framework connects practical memory-based methods with principled reinforcement learning, providing a rigorous mathematical basis for building reflective, memory-embedded agents capable of continual general-purpose learning.

Updated: 2026-01-29 13:49:34

标题: Memento 2：通过有状态反思性记忆学习

摘要: 我们提出了一个理论研究，探讨了结合情节记忆和强化学习的大型语言模型代理中的持续性和体验性学习。我们认为，持续适应的关键机制，在不更新模型参数的情况下，是反思：代理能够利用过去的经验来指导未来的行动。实证研究结果表明，基于情节、经验驱动的反思能够在广泛的开放式、长期任务中实现泛化适应。这表明，有效的学习可以在部署过程中发生，并削弱了传统的训练和测试之间的分离。受此启发，我们引入了Stateful Reflective Decision Process，这是一个反思记忆动态的形式化模型。在这种抽象中，代理维护一个情节记忆并执行两个核心操作。写入存储交互结果，起到策略评估的作用。读取检索相关的过去案例以指导决策，并起到策略改进的作用。这种视角将反思记忆视为一个可以使用经典强化学习工具进行分析的控制对象。然后，我们通过将检索集成到软策略迭代中，开发了一个读写反思学习框架，并建立了收敛保证。我们展示了随着记忆的增长并提供了对状态空间更密集的覆盖，得到的组合策略将收敛于最优解。总的来说，这个框架连接了实用的基于记忆方法和原则性的强化学习，为构建能够持续进行通用学习的反思、嵌入记忆的代理提供了严格的数学基础。

更新时间: 2026-01-29 13:49:34

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2512.22716v3

DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting

Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones' performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind-Lab/DropoutTS.

Updated: 2026-01-29 13:49:20

标题: DropoutTS：用于鲁棒时间序列预测的样本自适应辍学

摘要: 深度时间序列模型容易受到现实世界应用中普遍存在的噪声数据的影响。现有的鲁棒性策略要么修剪数据，要么依赖昂贵的先验量化，未能在效果和效率之间取得平衡。本文介绍了DropoutTS，这是一个与模型无关的插件，将范式从“学习什么”转变为“学习多少”。DropoutTS采用了一种样本自适应的Dropout机制：利用频谱稀疏性来高效量化实例级噪声，通过重构残差动态校准模型学习能力，将噪声映射到自适应的Dropout率 - 有选择地抑制虚假波动，同时保持细粒度的保真度。在各种噪声范围和开放基准测试中进行了大量实验，结果显示DropoutTS始终提升了优秀主干模型的性能，提供了先进的鲁棒性，同时带来了可忽略的参数开销和没有架构修改。我们的代码可在https://github.com/CityMind-Lab/DropoutTS找到。

更新时间: 2026-01-29 13:49:20

领域: cs.AI

下载: http://arxiv.org/abs/2601.21726v1

Procedural Pretraining: Warming Up Language Models with Abstract Data

Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

Updated: 2026-01-29 13:48:43

标题: 程序化预训练：利用抽象数据为语言模型热身

摘要: 在构建语言模型时，直接在网络规模的语料库上进行预训练已成为事实上的范式。我们研究了一种替代设置，其中模型最初暴露于抽象的结构化数据，作为减轻后续获取丰富语义知识的手段，类似于人类在高级推理之前学习简单逻辑和数学。我们专注于程序数据，由形式语言和其他简单算法生成的抽象数据。我们首先诊断了不同形式的程序数据可以显著改善的算法技能。例如，在上下文回忆（大海捞针）中，当在Dyck序列（平衡括号）上进行预训练时，准确率从10%跳升至98%。其次，我们研究了这些增益如何体现在预训练更大模型（高达13亿）中。我们发现，仅预先加载0.1%的程序数据就明显优于标准的自然语言、代码和非正式数学（C4，CodeParrot和DeepMind-Math数据集）的预训练。值得注意的是，这种程序预训练使模型能够在原始数据的55、67、86%时达到相同的损失值。第三，我们探讨了背后的机制，并发现程序预训练在注意力和MLP层中灌输了非平凡结构。前者对于结构化领域（例如代码）尤为重要，而后者则适用于语言。最后，我们铺平了结合多种形式的程序数据的道路。我们的结果表明，程序预训练是改善性能和加速语言模型预训练的简单、轻量级手段，最终暗示了在LLM中将知识获取与推理分离的潜力。

更新时间: 2026-01-29 13:48:43

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.21725v1

Enhancing Language Models for Robust Greenwashing Detection

Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.

Updated: 2026-01-29 13:46:15

标题: 增强语言模型以实现强大的绿色洗白检测

摘要: 可持续性报告对ESG评估至关重要，然而绿色洗白和含糊的声明常常削弱了它们的可靠性。现有的NLP模型缺乏对这些做法的鲁棒性，通常依赖于泛化能力较差的表面模式。我们提出了一个参数高效的框架，通过将对比学习与序数排序目标相结合，来结构化LLM潜在空间，以捕捉具体行动和模糊声明之间的分级区别。我们的方法融入了门控特征调制来过滤披露噪音，并利用MetaGradNorm来稳定多目标优化。在跨类别设置中的实验表明，相对于标准基线，我们的方法具有更强的鲁棒性，同时揭示了表征刚度和泛化之间的权衡。

更新时间: 2026-01-29 13:46:15

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.21722v1

LoRA and Privacy: When Random Projections Help (and When They Don't)

We introduce the (Wishart) projection mechanism, a randomized map of the form $S \mapsto M f(S)$ with $M \sim W_d(1/r I_d, r)$ and study its differential privacy properties. For vector-valued queries $f$, we prove non-asymptotic DP guarantees without any additive noise, showing that Wishart randomness alone can suffice. For matrix-valued queries, however, we establish a sharp negative result: in the noise-free setting, the mechanism is not DP, and we demonstrate its vulnerability by implementing a near perfect membership inference attack (AUC $> 0.99$). We then analyze a noisy variant and prove privacy amplification due to randomness and low rank projection, in both large- and small-rank regimes, yielding stronger privacy guarantees than additive noise alone. Finally, we show that LoRA-style updates are an instance of the matrix-valued mechanism, implying that LoRA is not inherently private despite its built-in randomness, but that low-rank fine-tuning can be more private than full fine-tuning at the same noise level. Preliminary experiments suggest that tighter accounting enables lower noise and improved accuracy in practice.

Updated: 2026-01-29 13:43:37

标题: LoRA和隐私：当随机投影有帮助时（以及当它们无法帮助时）

摘要: 我们介绍了(Wishart)投影机制，一个形式为$S \mapsto M f(S)$的随机映射，其中$M \sim W_d(1/r I_d, r)$，并研究了其差分隐私属性。对于向量值查询$f$，我们证明了非渐近的差分隐私保证，而无需添加任何噪音，表明Wishart随机性本身就足够。然而，对于矩阵值查询，我们建立了一个明显的负面结果：在无噪声的情况下，该机制不是差分隐私的，并且我们通过实施一个接近完美的成员推断攻击（AUC > 0.99）展示了其脆弱性。然后，我们分析了一个带噪声的变体，并证明了由于随机性和低秩投影引起的隐私放大，无论是在大秩还是小秩情况下，都提供比单独添加噪声更强的隐私保证。最后，我们展示了LoRA风格的更新是矩阵值机制的一个实例，暗示LoRA尽管具有内置的随机性，但并非天生具有隐私性，而低秩微调可能比在相同噪声水平下进行完全微调更具隐私性。初步实验表明，更紧密的计算可以在实践中实现更低的噪声和更高的准确性。

更新时间: 2026-01-29 13:43:37

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.21719v1

When does predictive inverse dynamics outperform behavior cloning?

Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model (IDM). While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a bias-variance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66\% more samples than PIDM.

Updated: 2026-01-29 13:43:34

标题: 何时预测逆动力学胜过行为克隆？

摘要: 行为克隆（BC）是一种实用的离线模仿学习方法，但在专家演示有限时经常失败。最近的研究引入了一类被称为预测逆动力学模型（PIDM）的架构，它将未来状态预测器与逆动力学模型（IDM）结合在一起。虽然PIDM通常优于BC，但其优势背后的原因仍不清楚。在本文中，我们提供了一个理论解释：PIDM引入了一种偏差-方差权衡。虽然预测未来状态引入了偏差，但在预测上调节IDM可以显著降低方差。我们在状态预测器偏差的条件下建立了PIDM实现比BC更低的预测误差和更高的样本效率的条件，当额外的数据来源可用时，差距会扩大。我们在2D导航任务中从理论上验证了这些见解，BC需要比PIDM多达五倍（平均三倍）的演示才能达到可比较的性能；以及在一个现代视频游戏中的复杂3D环境中，具有高维视觉输入和随机转换，BC需要比PIDM多出66\%的样本。

更新时间: 2026-01-29 13:43:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21718v1

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a "see-saw", and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: https://grisoon.github.io/DreamActor-M2/

Updated: 2026-01-29 13:43:17

标题: DreamActor-M2: 通过时空上下文学习实现通用角色形象动画

摘要: Character image animation旨在通过将动作从驱动序列转移到静态参考图像来合成高保真度的视频。尽管最近取得了进展，但现有方法面临两个基本挑战：（1）次优的动作注入策略导致在身份保留和动作一致性之间存在权衡，表现为“跷跷板”；（2）过度依赖明确的姿势先验（例如骨架），无法充分捕捉复杂动态并阻碍对任意非人形角色的泛化。为了解决这些挑战，我们提出了DreamActor-M2，这是一个重新构想动作调节的通用动画框架，将其视为一个上下文学习问题。我们的方法遵循一个两阶段范式。首先，通过将参考外观和动作线索融合到统一的潜在空间中，将输入模态差距，使模型能够通过利用基础模型的生成先验来共同推理空间身份和时间动态。其次，我们引入了一个自我引导的数据合成管道，筛选伪跨身份训练对，促进从姿势相关控制到直接、端到端RGB驱动的动画的无缝过渡。这种策略显著增强了在各种角色和动作场景之间的泛化能力。为了促进全面评估，我们进一步引入了AW Bench，一个包含各种角色类型和动作场景的多功能基准。广泛的实验表明，DreamActor-M2实现了最先进的性能，提供了优越的视觉保真度和强大的跨领域泛化能力。项目页面：https://grisoon.github.io/DreamActor-M2/

更新时间: 2026-01-29 13:43:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.21716v1

E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory

The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, high-precision problem-solving, requires maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms suffer from destructive de-contextualization. By compressing complex sequential dependencies into pre-defined structures (e.g., embeddings or graphs), these methods sever the contextual integrity essential for deep reasoning. To address this, we propose E-mem, a framework shifting from Memory Preprocessing to Episodic Context Reconstruction. Inspired by biological engrams, E-mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context-aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54\% F1, surpassing the state-of-the-art GAM by 7.75\%, while reducing token cost by over 70\%.

Updated: 2026-01-29 13:42:42

标题: E-mem：基于多智能体的情节性上下文重建对LLM智能体记忆的研究

摘要: 大型语言模型（LLM）代理向系统2推理演化，其特征是深思熟虑、高精度的问题解决，需要在延长视野的同时保持严格的逻辑完整性。然而，流行的记忆预处理范式存在破坏性的去上下文化问题。通过将复杂的序列依赖关系压缩成预定义的结构（例如嵌入或图），这些方法割裂了深度推理所必需的上下文完整性。为了解决这一问题，我们提出了E-mem，这是一个从记忆预处理转向情节上下文重建的框架。受生物记忆痕的启发，E-mem采用异构的分层结构，多个助理代理维护未压缩的记忆上下文，而中央主代理协调全局规划。与被动检索不同，我们的机制赋予助理代理在激活的部分内部进行局部推理，提取上下文感知证据，然后进行聚合。LoCoMo基准测试的评估显示，E-mem实现了超过54\%的F1，超过现有技术GAM 7.75\%，同时将令牌成本降低了超过70\%。

更新时间: 2026-01-29 13:42:42

领域: cs.AI

下载: http://arxiv.org/abs/2601.21714v1

Disentangling perception and reasoning for improving data efficiency in learning cloth manipulation without demonstrations

Cloth manipulation is a ubiquitous task in everyday life, but it remains an open challenge for robotics. The difficulties in developing cloth manipulation policies are attributed to the high-dimensional state space, complex dynamics, and high propensity to self-occlusion exhibited by fabrics. As analytical methods have not been able to provide robust and general manipulation policies, reinforcement learning (RL) is considered a promising approach to these problems. However, to address the large state space and complex dynamics, data-based methods usually rely on large models and long training times. The resulting computational cost significantly hampers the development and adoption of these methods. Additionally, due to the challenge of robust state estimation, garment manipulation policies often adopt an end-to-end learning approach with workspace images as input. While this approach enables a conceptually straightforward sim-to-real transfer via real-world fine-tuning, it also incurs a significant computational cost by training agents on a highly lossy representation of the environment state. This paper questions this common design choice by exploring an efficient and modular approach to RL for cloth manipulation. We show that, through careful design choices, model size and training time can be significantly reduced when learning in simulation. Furthermore, we demonstrate how the resulting simulation-trained model can be transferred to the real world. We evaluate our approach on the SoftGym benchmark and achieve significant performance improvements over available baselines on our task, while using a substantially smaller model.

Updated: 2026-01-29 13:41:35

标题: 解开感知和推理，以提高在学习无演示的布料操作中的数据效率

摘要: 布料操作是日常生活中的一项普遍任务，但对于机器人来说仍然是一个开放性挑战。开发布料操作策略的困难归因于高维状态空间、复杂动态以及面料自遮挡的高倾向。由于分析方法无法提供稳健和通用的操作策略，强化学习（RL）被认为是解决这些问题的一种有前途的方法。然而，为了解决大状态空间和复杂动态的问题，数据驱动方法通常依赖于大型模型和长时间训练。由此产生的计算成本显著阻碍了这些方法的发展和采用。此外，由于稳健状态估计的挑战，服装操作策略通常采用以工作区图像为输入的端到端学习方法。虽然这种方法通过真实世界微调实现了一个概念上简单的模拟到真实的转移，但也通过在高度丢失的环境状态表示上训练代理产生了显著的计算成本。本文通过探索一种高效和模块化的RL布料操作方法来质疑这种常见的设计选择。我们表明，通过谨慎的设计选择，在模拟中学习时可以显著减小模型大小和训练时间。此外，我们展示了由此产生的模拟训练模型如何转移到现实世界。我们在SoftGym基准测试上评估了我们的方法，并在我们的任务上实现了显著的性能改进，同时使用了一个明显较小的模型。

更新时间: 2026-01-29 13:41:35

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2601.21713v1

TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.

Updated: 2026-01-29 13:40:35

标题: TACLer：定制课程强化学习以提高有效推理

摘要: 大型语言模型(LLMs)在复杂推理任务上表现出色，尤其是当配备长链推理(Chain-of-Thought，CoT)时。然而，引发长链推理通常需要大规模的强化学习(RL)训练，而往往会导致冗余的中间步骤。为了提高学习和推理效率，同时保持甚至增强性能，我们提出了TACLer，一个针对模型的课程强化学习框架，根据模型在多阶段RL训练中的熟练程度逐渐增加数据的复杂性。TACLer具有两个核心组件：(i)定制课程学习，确定模型缺乏和需要在渐进阶段学习的知识；(ii)一种平衡准确性和效率的混合Thinking/NoThinking推理范式，通过启用或禁用Thinking模式实现。我们的实验表明，TACLer在学习和推理方面具有双重优势：(i)它降低了计算成本，相比于长思考模型，训练计算减少了50%以上，并且相对于基础模型，推理令牌使用减少了42%以上；(ii)它提高了基础模型的准确性，相对于基准模型，在四个包含复杂问题的数学数据集上始终优于最先进的NoThinking和Thinking基线，准确率提高了9%以上。

更新时间: 2026-01-29 13:40:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.21711v1

FBS: Modeling Native Parallel Reading inside a Transformer

Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train--test consistency for preview/skimming. We propose the \textbf{Fovea-Block-Skip Transformer} (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.

Updated: 2026-01-29 13:39:55

标题: FBS：在Transformer内部建模原生并行阅读

摘要: 大型语言模型（LLMs）在许多任务中表现出色，但推理仍然主要由严格的逐标记自回归控制。现有的加速方法在很大程度上修补了这一流程，并忽略了核心的人类阅读要素：内容自适应前瞻，块结构感知计算分配，以及用于预览/略读的训练-测试一致性。我们提出了\textbf{Fovea-Block-Skip Transformer}（FBS），通过Parafovea-Attention Window（PAW），Chunk-Head（CH）和Skip-Gate（SG）将一个因果可训练的循环注入到Transformers中。在各种基准测试中，FBS在不增加参数的情况下改善了质量效率的权衡，消融实验证明这三个模块是互补的。

更新时间: 2026-01-29 13:39:55

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21708v1

Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection

The increasing volume of electronic health records (EHRs) presents the opportunity to improve the accuracy and robustness of models in clinical prediction tasks. Unlike traditional centralized approaches, federated learning enables training on data from multiple institutions while preserving patient privacy and complying with regulatory constraints. In practice, healthcare institutions (i.e., hosts) often need to build predictive models tailored to their specific needs (e.g., creatinine-level prediction, N-day readmission prediction) using federated learning. When building a federated learning model for a single healthcare institution, two key challenges arise: (1) ensuring compatibility across heterogeneous EHR systems, and (2) managing federated learning costs within budget constraints. Specifically, heterogeneity in EHR systems across institutions hinders compatible modeling, while the computational costs of federated learning can exceed practical budget limits for healthcare institutions. To address these challenges, we propose EHRFL, a federated learning framework designed for building a cost-effective, host-specific predictive model using patient EHR data. EHRFL consists of two components: (1) text-based EHR modeling, which facilitates cross-institution compatibility without costly data standardization, and (2) a participant selection strategy based on averaged patient embedding similarity to reduce the number of participants without degrading performance. Our participant selection strategy sharing averaged patient embeddings is differentially private, ensuring patient privacy. Experiments on multiple open-source EHR datasets demonstrate the effectiveness of both components. With our framework, healthcare institutions can build institution-specific predictive models under budgetary constraints with reduced costs and time.

Updated: 2026-01-29 13:37:44

标题: 异构电子健康记录系统中的联邦学习与成本效益参与者选择

摘要: 随着电子健康记录（EHR）数量的增加，提供了改进临床预测任务中模型准确性和稳健性的机会。与传统的集中式方法不同，联邦学习使得可以在来自多个机构的数据上进行训练，同时保护患者隐私并遵守监管限制。在实践中，医疗机构（即主机）通常需要使用联邦学习构建适合其特定需求的预测模型（例如，肌酐水平预测，N天再入院预测）。在为单个医疗机构构建联邦学习模型时，会出现两个关键挑战：（1）确保跨异构EHR系统的兼容性，以及（2）在预算限制内管理联邦学习成本。具体而言，跨机构的EHR系统异质性阻碍了兼容性建模，而联邦学习的计算成本可能超出医疗机构的实际预算限制。为了解决这些挑战，我们提出了EHRFL，一个旨在利用患者EHR数据构建成本效益高、主机特定的预测模型的联邦学习框架。EHRFL包括两个组成部分：（1）基于文本的EHR建模，有助于实现跨机构兼容性，而无需昂贵的数据标准化，以及（2）基于平均患者嵌入相似性的参与者选择策略，以减少参与者数量而不降低性能。我们的参与者选择策略共享平均患者嵌入，是差分隐私的，确保患者隐私。在多个开源EHR数据集上的实验证明了两个组件的有效性。借助我们的框架，医疗机构可以在预算限制下构建特定于机构的预测模型，降低成本和时间。

更新时间: 2026-01-29 13:37:44

领域: cs.LG

下载: http://arxiv.org/abs/2404.13318v5

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.

Updated: 2026-01-29 13:36:03

标题: 自回归语言模型实际上是基于能量的模型：对下一个标记预测的前瞻能力的洞见

摘要: 自回归模型（ARMs）目前构成大型语言模型（LLMs）的主导范式。基于能量的模型（EBMs）代表了另一类模型，历史上在LLM的发展中较少出现，然而自然地表征了后训练对齐中的最优策略。在本文中，我们提供了这两类模型的统一视角。以概率链规则为起点，我们在函数空间中建立了ARMs和EBMs之间的显式双射，我们展示这对应于最大熵强化学习中软贝尔曼方程的特例。基于这种双射，我们推导出ARMs和EBMs之间的监督学习的等价性。此外，我们通过提供理论误差界限来分析EBMs向ARMs的提炼。我们的结果提供了关于ARMs能够提前规划的见解，尽管它们基于下一个标记预测范例。

更新时间: 2026-01-29 13:36:03

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2512.15605v2

SmartMeterFM: Unifying Smart Meter Data Generative Tasks Using Flow Matching Models

Smart meter data is the foundation for planning and operating the distribution network. Unfortunately, such data are not always available due to privacy regulations. Meanwhile, the collected data may be corrupted due to sensor or transmission failure, or it may not have sufficient resolution for downstream tasks. A wide range of generative tasks is formulated to address these issues, including synthetic data generation, missing data imputation, and super-resolution. Despite the success of machine learning models on these tasks, dedicated models need to be designed and trained for each task, leading to redundancy and inefficiency. In this paper, by recognizing the powerful modeling capability of flow matching models, we propose a new approach to unify diverse smart meter data generative tasks with a single model trained for conditional generation. The proposed flow matching models are trained to generate challenging, high-dimensional time series data, specifically monthly smart meter data at a 15 min resolution. By viewing different generative tasks as distinct forms of partial data observations and injecting them into the generation process, we unify tasks such as imputation and super-resolution with a single model, eliminating the need for re-training. The data generated by our model not only are consistent with the given observations but also remain realistic, showing better performance against interpolation and other machine learning based baselines dedicated to the tasks.

Updated: 2026-01-29 13:35:39

标题: 智能电表FM：使用流匹配模型统一智能电表数据生成任务

摘要: 智能电表数据是规划和运营配电网络的基础。不幸的是，由于隐私法规的限制，这些数据并不总是可用的。同时，由于传感器或传输故障，收集的数据可能会损坏，或者对下游任务来说分辨率可能不足。为了解决这些问题，制定了广泛的生成任务，包括合成数据生成、缺失数据插补和超分辨率。尽管机器学习模型在这些任务上取得了成功，但需要为每个任务设计和训练专门的模型，导致冗余和低效。在本文中，通过认识到流匹配模型的强大建模能力，我们提出了一种新的方法，通过训练一个用于条件生成的单一模型来统一各种智能电表数据生成任务。所提出的流匹配模型经过训练，可以生成具有挑战性的高维时间序列数据，特别是每15分钟分辨率的月度智能电表数据。通过将不同的生成任务视为不同形式的部分数据观测，并将它们注入到生成过程中，我们将插补和超分辨率等任务统一到一个模型中，消除了重新训练的需要。我们模型生成的数据不仅与给定的观测一致，而且保持逼真，表现优于插值和其他专门用于任务的基于机器学习的基线。

更新时间: 2026-01-29 13:35:39

领域: cs.LG,eess.SY

下载: http://arxiv.org/abs/2601.21706v1

Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

We consider representation misdirection (RM), a class of LLM unlearning methods that achieves forgetting by manipulating the forget-representations, that is, latent representations of forget samples. Despite being important, the roles of target vectors used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the linear representation hypothesis. Specifically, if one can somehow identify a one-dimensional representation corresponding to a high-level concept, the linear representation hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning elicits controllable side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models' truth, sentiment, and refusal) and capability enhancement (e.g., improving unlearned models' in-context learning capability). Our findings reveal that this fairly attractive phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing models that require stronger capabilities and controllable behaviors.

Updated: 2026-01-29 13:32:26

标题: 超越遗忘：机器遗忘引发可控的附带行为和能力

摘要: 我们考虑了表示误导（RM），一类通过操纵忘记样本的忘记表示来实现遗忘的LLM遗忘方法。尽管重要，但在RM中使用的目标向量的作用仍未被充分探讨。在这里，我们通过线性表示假设的视角接近并重新审视RM。具体地，如果某人能够以某种方式识别与高层概念对应的一维表示，线性表示假设使得可以在忘记表示空间内对这个概念向量进行线性操作。在这种观点下，我们假设，除了遗忘外，机器遗忘还引发了与高层概念相对应的可控副行为和更强的副能力。我们的假设在各种任务中得到了经验验证，包括行为控制（例如，控制已遗忘模型的真相、情感和拒绝）和能力增强（例如，提高已遗忘模型的上下文学习能力）。我们的发现揭示了这种相当吸引人的现象可能是一个隐藏的风险，如果被滥用，或者是一个可以利用来开发需要更强能力和可控行为的模型的机制。

更新时间: 2026-01-29 13:32:26

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2601.21702v1

Robust Filter Attention: Self-Attention as a Parallel State Estimator

We introduce Robust Filter Attention (RFA), an attention mechanism that reformulates self-attention as parallel robust filtering under a latent stochastic differential equation (SDE) prior, where analytically propagated uncertainty defines a time-dependent precision prior over attention weights. This formulation integrates key advantages of existing positional encodings: it preserves RoPE-style rotational structure while achieving long-context stability through explicit modeling of dissipation and diffusion. By imposing isotropic constraints on the dynamics and noise, RFA matches the $O(N^2 d)$ time and $O(N^2 + Nd)$ memory complexity of standard attention. Empirically, we find that uncertainty-aware weighting induces specialization into distinct filtering regimes across heads, improving temporal consistency and extrapolation across varying context lengths.

Updated: 2026-01-29 13:32:19

标题: Robust Filter Attention: Self-Attention作为并行状态估计器

摘要: 我们介绍了Robust Filter Attention（RFA），这是一种将自注意力重新构造为在潜在随机微分方程（SDE）先验下的并行鲁棒滤波的注意机制，其中通过解析传播的不确定性定义了关于注意权重的时间相关精度先验。这种表述整合了现有位置编码的关键优势：通过显式建模耗散和扩散，它保留了RoPE风格的旋转结构，同时实现了通过长文本稳定性。通过对动力学和噪声施加各向同性约束，RFA与标准注意力的$O(N^2 d)$时间和$O(N^2 + Nd)$内存复杂度相匹配。在实证研究中，我们发现，基于不确定性的加权引发了在不同头部之间进入不同滤波区域的专业化，从而提高了跨不同上下文长度的时间一致性和外推。

更新时间: 2026-01-29 13:32:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.04154v4

Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

Updated: 2026-01-29 13:31:45

标题: 朝向通过本体引导多智能体推理实现文化对齐的LLMs

摘要: 大型语言模型（LLMs）越来越支持文化敏感的决策，但由于预训练数据的偏斜和缺乏结构化价值表示，往往存在不一致性。现有方法可以引导输出，但往往缺乏人口统计学基础，并将价值视为独立的、非结构化的信号，降低了一致性和可解释性。我们提出了OG-MAR，一种本体引导的多代理推理框架。OG-MAR从世界价值观调查（WVS）中总结了特定于受访者的价值，并通过能力问题在固定的分类法上调查关系，构建了一个全球文化本体。在推理时，它检索符合本体的关系和人口统计学相似的个人资料，实例化多个价值人物代理，这些代理的输出由一个审判代理综合，强调本体一致性和人口统计接近性。在四个LLM骨干上对区域社会调查基准的实验表明，OG-MAR相对竞争基线改善了文化对齐性和鲁棒性，同时产生了更透明的推理过程。

更新时间: 2026-01-29 13:31:45

领域: cs.CL,cs.AI,cs.IR,cs.MA,cs.SI

下载: http://arxiv.org/abs/2601.21700v1

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

Updated: 2026-01-29 13:30:35

标题: 基于结果的强化学习可靠地带领变压器进行推理，但仅在正确的数据情况下。

摘要: 通过强化学习(RL)训练的Transformer在基于结果的监督下可以自发地发展出生成中间推理步骤(思维链)的能力。然而，稀疏奖励驱动策略梯度发现这种系统性推理的机制仍然不太清楚。我们通过分析单层Transformer在一个合成图遍历任务上的策略梯度动态来解决这个问题，该任务无法在没有思维链的情况下解决，但可以接受简单的迭代解决方案。我们证明，尽管仅在最终答案正确性上训练，策略梯度驱动Transformer收敛到一个结构化的、可解释的算法，该算法通过逐个遍历图中的顶点来迭代地遍历图。我们表征了这种出现所需的分布特性，确定了“简单示例”的关键作用：需要更少推理步骤的实例。当训练分布在这些简单示例上放置足够的质量时，Transformer学习了一个可推广的遍历策略，可以推广到更长的链条；当这些示例的质量消失时，策略梯度学习变得不可行。通过在合成数据上进行实验以及在数学推理任务中与真实世界语言模型一起验证我们的理论结果，证实我们的理论发现适用于实际情况。

更新时间: 2026-01-29 13:30:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.15158v2

Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics

Curriculum learning changes the order of pre-training data, but it remains unclear whether it changes the learning trajectory or mainly reorders exposure over a fixed trajectory. We train Pythia models (14M-410M parameters) for 300B tokens under three linguistically motivated curricula-Age-of-Acquisition, word frequency, and Verb Variation (VV)-and compare each against Random ordering; at 1B parameters we compare Random and VV. Across orderings, training follows a shared sequence of latent phases, while curricula mainly change within-phase data exposure. In smaller models (up to 160M parameters), Random ordering exhibits higher gradient noise and stronger late-training output-head spectral saturation, alongside lower final accuracy; curricula reduce both effects at matched compute. At larger scales, saturation differences are smaller and curriculum gains shrink. We formalize the link between difficulty pacing and optimization stability in an idealized analysis based on gradient-variance control, and our results point to a practical takeaway: curricula help by stabilizing within-phase optimization rather than by creating new phases.

Updated: 2026-01-29 13:30:18

标题: 课程学习对LLM预训练的影响：学习动态分析

摘要: 课程学习改变了预训练数据的顺序，但目前尚不清楚它是否改变了学习轨迹，还是主要重新排列了一个固定轨迹上的曝光。我们在三种语言学动机的课程-Age-of-Acquisition、单词频率和动词变异(VV)下训练Pythia模型(14M-410M参数)，共计300B个标记，并将每种与随机排序进行比较；在1B个参数下我们比较随机和VV。在不同的排序方式下，训练遵循共享的潜在阶段序列，而课程主要是在阶段内改变数据曝光。在较小的模型中(最多160M参数)，随机排序表现出更高的梯度噪声和更强的后期训练输出头谱饱和，同时最终准确性更低；课程在匹配计算下减少了这两种效应。在更大规模上，饱和差异较小，课程收益也减少。我们在理想化分析中形式化了困难步伐与优化稳定性之间的联系，并我们的结果指向一个实用的结论：课程通过稳定阶段内优化来帮助，而不是通过创建新的阶段。

更新时间: 2026-01-29 13:30:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21698v1

Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning

We introduce an order-invariant reinforcement learning framework for black-box combinatorial optimization. Classical estimation-of-distribution algorithms (EDAs) often rely on learning explicit variable dependency graphs, which can be costly and fail to capture complex interactions efficiently. In contrast, we parameterize a multivariate autoregressive generative model trained without a fixed variable ordering. By sampling random generation orders during training, a form of information-preserving dropout, the model is encouraged to be invariant to variable order, promoting search-space diversity, and shaping the model to focus on the most relevant variable dependencies, improving sample efficiency. We adapt Group Relative Policy Optimization (GRPO) to this setting, providing stable policy-gradient updates from scale-invariant advantages. Across a wide range of benchmark algorithms and problem instances of varying sizes, our method frequently achieves the best performance and consistently avoids catastrophic failures.

Updated: 2026-01-29 13:30:09

标题: 使用无序强化学习的黑盒组合优化

摘要: 我们引入了一个针对黑盒组合优化的无序强化学习框架。经典的分布估计算法（EDAs）通常依赖于学习明确的变量依赖图，这可能代价高昂且无法有效捕捉复杂的相互作用。相反，我们参数化了一个多变量自回归生成模型，该模型在训练时没有固定的变量顺序。通过在训练过程中对随机生成顺序进行采样，一种信息保留的丢弃形式，鼓励模型对变量顺序保持不变，促进搜索空间的多样性，并塑造模型专注于最相关的变量依赖关系，提高样本效率。我们将 Group Relative Policy Optimization（GRPO）调整到这种设置中，从尺度不变的优势中提供稳定的策略梯度更新。在广泛的基准算法和不同规模问题实例上，我们的方法经常达到最佳性能，并始终避免灾难性失败。

更新时间: 2026-01-29 13:30:09

领域: cs.LG

下载: http://arxiv.org/abs/2510.01824v2

IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning

Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.

Updated: 2026-01-29 13:26:41

标题: IBNorm：信息瓶颈启发的规范化用于表示学习

摘要: 规范化是深度学习的基础，但现有的方法如BatchNorm、LayerNorm和RMSNorm都是以方差为中心的，通过强制零均值和单位方差来稳定训练，而不控制表示如何捕捉任务相关信息。我们提出了受信息瓶颈原理启发的规范化方法（IBNorm），这是一种简单而强大的方法家族。IBNorm引入了有界压缩操作，鼓励嵌入保留预测信息同时抑制干扰变异性，产生更具信息量的表示，同时保留标准规范化的稳定性和兼容性。从理论上讲，我们证明了IBNorm实现了比以方差为中心的方法更高的信息瓶颈值和更紧的泛化界限。在经验上，IBNorm在大规模语言模型（LLaMA、GPT-2）和视觉模型（ResNet、ViT）上始终优于BatchNorm、LayerNorm和RMSNorm，相互信息分析证实了其卓越的信息瓶颈行为。代码将公开发布。

更新时间: 2026-01-29 13:26:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.25262v2

TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.

Updated: 2026-01-29 13:26:29

标题: TCAP：三组件注意力分析用于MLLM微调中无监督后门检测

摘要: Fine-Tuning-as-a-Service (FTaaS)促进了多模式大型语言模型（MLLMs）的定制化，但通过毒化数据引入了关键的后门风险。现有的防御方法要么依赖于监督信号，要么无法跨越不同触发类型和模态性进行泛化。在这项工作中，我们揭示了一种通用的后门指纹-注意力分配差异，其中毒化样本破坏了三个功能组件之间的平衡注意力分布：系统指令、视觉输入和用户文本查询，而不考虑触发形态。受到这一观点的启发，我们提出了Tri-Component Attention Profiling（TCAP），这是一个无监督的防御框架，用于过滤后门样本。TCAP将跨模态注意力图解析为三个组件，通过高斯混合模型（GMM）统计配置识别触发响应注意力头，并通过基于EM的投票聚合隔离毒化样本。通过在不同MLLM架构和攻击方法之间进行广泛实验，我们证明了TCAP取得了一致强大的性能，将其确立为MLLM中稳健且实用的后门防御方法。

更新时间: 2026-01-29 13:26:29

领域: cs.AI

下载: http://arxiv.org/abs/2601.21692v1

A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion

Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.

Updated: 2026-01-29 13:26:18

标题: 一个用于通过相机-IMU融合进行稳健道路表面分类的新数据集和框架

摘要: 路面分类（RSC）是环境感知预测性维护系统的关键推动因素。然而，由于传感器模态有限且数据集缺乏环境多样性，现有的RSC技术往往无法推广至狭窄的操作条件之外。本文通过引入一个多模态框架来解决这些限制，该框架利用轻量级的双向交叉注意力模块融合图像和惯性测量，并通过自适应门控层对模态贡献进行调整以适应域变化。鉴于当前基准测试的局限性，特别是缺乏变异性，我们引入了一个名为ROAD的新数据集，由三个互补子集组成：（i）使用金标准工业数据记录仪同步的真实世界多模态录像，跨越多样的照明、天气和路面条件捕获的RGB-IMU流；（ii）一个大型仅视觉子集，旨在评估在不利照明和异质捕获设置下的鲁棒性；（iii）一个生成的合成子集，用于研究在实践中难以获得的场景中的分布外泛化。实验证明，我们的方法在PVS基准测试上比先前的最新技术改进了+1.4个百分点，在我们的多模态ROAD子集上提升了+11.6个百分点，并在少数类别上保持更高的F1分数。该框架还在具有挑战性的视觉条件下表现稳定，包括夜间、大雨和混合路面过渡。这些发现表明，将价格实惠的摄像头和IMU传感器与多模态注意力机制相结合，为路面理解提供了可扩展且稳健的基础，特别适用于环境变化和成本约束限制了高端传感套件采用的地区。

更新时间: 2026-01-29 13:26:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.20847v2

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through a shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose a provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data (approximately 4 trajectories) per game, demonstrating superior generalization. We will release codes and model weights at https://github.com/CJReinforce/JOWA

Updated: 2026-01-29 13:25:42

标题: 通过联合优化世界-动作模型预训练来扩展离线基于模型的强化学习

摘要: 离线强化学习（RL）的一个重要愿景是从大型和异构数据集中开发具有高能力的通用代理。然而，以往的扩展离线RL方法要么严重依赖于专家轨迹，要么难以推广到多样化的未见任务。受条件视频生成中世界模型优秀泛化能力的启发，我们探索了基于图像观察的世界模型在扩展离线RL和增强对新任务的泛化能力的潜力。在本文中，我们介绍了JOWA：联合优化的世界-动作模型，这是一个基于模型的离线RL代理，使用60亿令牌数据在多个Atari游戏上进行预训练，以学习通用表示和决策能力。我们的方法通过共享的Transformer骨干联合优化了世界-动作模型，这在预训练期间稳定了大型模型的时间差异学习。此外，我们提出了一个经过证明有效且可并行化的规划算法，以补偿Q值估计误差，从而寻找出更好的策略。实验结果表明，我们的最大代理，具有1.5亿参数，在预训练游戏中仅使用10％的子采样离线数据就实现了78.9％的人类水平表现，比现有的最先进的大规模离线RL基线平均提高了31.6％。此外，JOWA的扩展性与模型容量有利，并且能够高效地通过仅使用每个游戏约5k的离线微调数据（大约4条轨迹）转移到新游戏，展示了出色的泛化能力。我们将在https://github.com/CJReinforce/JOWA发布代码和模型权重。

更新时间: 2026-01-29 13:25:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.00564v4

Neural Policy Composition from Free Energy Minimization

The ability to compose acquired skills to plan and execute behaviors is a hallmark of natural intelligence. Yet, despite remarkable cross-disciplinary efforts, a principled account of how task structure shapes gating and how such computations could be delivered in neural circuits, remains elusive. Here we introduce GateMod, an interpretable theoretically grounded computational model linking the emergence of gating to the underlying decision-making task, and to a neural circuit architecture. We first develop GateFrame, a normative framework casting policy gating into the minimization of the free energy. This framework, relating gating rules to task, applies broadly across neuroscience, cognitive and computational sciences. We then derive GateFlow, a continuous-time energy based dynamics that provably converges to GateFrame optimal solution. Convergence, exponential and global, follows from a contractivity property that also yields robustness and other desirable properties. Finally, we derive a neural circuit from GateFlow, GateNet. This is a soft-competitive recurrent circuit whose components perform local and contextual computations consistent with known dendritic and neural processing motifs. We evaluate GateMod across two different settings: collective behaviors in multi-agent systems and human decision-making in multi-armed bandits. In all settings, GateMod provides interpretable mechanistic explanations of gating and quantitatively matches or outperforms established models. GateMod offers a unifying framework for neural policy gating, linking task objectives, dynamical computation, and circuit-level mechanisms. It provides a framework to understand gating in natural agents beyond current explanations and to equip machines with this ability.

Updated: 2026-01-29 13:23:56

标题: 从自由能最小化中的神经政策组合

摘要: The ability to combine learned skills to plan and carry out actions is a key aspect of natural intelligence. However, despite extensive interdisciplinary efforts, a comprehensive explanation of how the structure of a task influences gating, and how these computations are implemented in neural circuits, remains elusive. In this study, we introduce GateMod, a computational model grounded in theory that connects the emergence of gating to the decision-making task and neural circuitry. We first develop GateFrame, a framework that interprets policy gating as minimizing free energy. This framework, which links gating rules to tasks, has broad applications in neuroscience, cognitive science, and computational science. We then create GateFlow, a continuous-time energy-based dynamics that converges to the optimal solution defined by GateFrame. The convergence, which is exponential and global, is ensured by a contractivity property that also provides robustness and other desirable characteristics. Finally, we establish a neural circuit, GateNet, based on GateFlow. GateNet is a soft-competitive recurrent circuit that performs local and contextual computations consistent with known dendritic and neural processing patterns. We test GateMod in two different scenarios: collective behaviors in multi-agent systems and human decision-making in multi-armed bandits. In both cases, GateMod offers interpretable explanations of gating and performs as well as or better than existing models. GateMod presents a unified framework for neural policy gating, connecting task objectives, computational dynamics, and circuit-level mechanisms. It offers a way to understand gating in natural agents beyond current explanations and to equip machines with this capability.

更新时间: 2026-01-29 13:23:56

领域: math.OC,cs.AI,eess.SY,nlin.AO

下载: http://arxiv.org/abs/2512.04745v2

NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation

Transformer-based autoregressive models offer a promising alternative to diffusion- and flow-matching approaches for generating 3D molecular structures. However, standard transformer architectures require a sequential ordering of tokens, which is not uniquely defined for the atoms in a molecule. Prior work has addressed this by using canonical atom orderings, but these do not ensure permutation invariance of atoms, which is essential for tasks like prefix completion. We introduce NEAT, a Neighborhood-guided, Efficient, Autoregressive, Set Transformer that treats molecular graphs as sets of atoms and learns an order-agnostic distribution over admissible tokens at the graph boundary. NEAT achieves state-of-the-art performance in autoregressive 3D molecular generation whilst ensuring atom-level permutation invariance by design.

Updated: 2026-01-29 13:23:10

标题: NEAT：面向3D分子生成的邻域引导、高效、自回归集合变换器

摘要: 基于Transformer的自回归模型为生成3D分子结构提供了一种有前途的替代方案，而不是扩散和流匹配方法。然而，标准的Transformer架构需要对令牌进行顺序排列，对于分子中的原子并没有唯一定义。先前的工作通过使用规范原子排序来解决这个问题，但这些方法并不能保证原子的排列不变性，这对于诸如前缀完成之类的任务至关重要。我们介绍了NEAT，即邻域引导、高效、自回归、集合Transformer，将分子图视为原子集合，并学习在图边界处的令牌的无序分布。NEAT在自回归3D分子生成中实现了最先进的性能，同时通过设计确保了原子级别的排列不变性。

更新时间: 2026-01-29 13:23:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.05844v2

Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts

Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Moreover, the lack of hyperparameter transparency in open-source fine-tuned models makes it difficult to predict merged-model performance, leaving practitioners without guidance on how to fine-tune merge-friendly experts. To address those two challenges, we employ $L_2$-Stability theory under heterogeneous hyperparameter environments to analyze the generalization of the merged model $\boldsymbol{x}_{avg}$. This pioneering analysis yields two key contributions: (i) \textit{A unified theoretical framework} is provided to explain existing merging algorithms, revealing how they optimize specific terms in our bound, thus offering a strong theoretical foundation for empirical observations. (ii) \textit{Actionable recommendations} are proposed for practitioners to strategically fine-tune expert models, enabling the construction of merge-friendly models within the pretraining-to-finetuning pipeline. Extensive experiments on the ResNet/Vit family across 20/8 visual classification tasks, involving thousands of finetuning models, robustly confirm the impact of different hyperparameters on the generalization of $\boldsymbol{x}_{avg}$ predicted by our theoretical results.

Updated: 2026-01-29 13:22:06

标题: 理解模型合并：一种用于异质专家的统一概化框架

摘要: 模型合并有效地将多个精细调整的模型的能力聚合到一个单一模型中，在参数空间中纯粹操作，而不需要原始数据或昂贵的重新计算。尽管在实证成功方面取得了成就，但在异构微调超参数（例如，不同的学习率、批量大小）下其有效性的统一理论仍然缺失。此外，在开源精细调整模型中缺乏超参数透明度，使得难以预测合并模型的性能，让从业者无法指导如何精细调整合并友好的专家。为了解决这两个挑战，我们在异构超参数环境下采用$L_2$-稳定性理论分析合并模型$\boldsymbol{x}_{avg}$的泛化。这项开创性分析产生了两个关键贡献：(i)\textit{提供了一个统一的理论框架}，解释现有的合并算法，揭示它们如何优化我们的界限中的特定项，从而为实证观察提供了坚实的理论基础。(ii)\textit{提出了可行的建议}，供从业者战略性地微调专家模型，使其能够在预训练到微调流程中构建合并友好的模型。对ResNet/Vit系列在20/8个视觉分类任务上的广泛实验，涉及数千个微调模型，坚定地证实了我们的理论结果对$\boldsymbol{x}_{avg}$的泛化受不同超参数影响的影响。

更新时间: 2026-01-29 13:22:06

领域: cs.LG

下载: http://arxiv.org/abs/2601.21690v1

Two Heads are Better than One: Distilling Large Language Model Features Into Small Models with Feature Decomposition and Mixture

Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM's feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.

Updated: 2026-01-29 13:22:04

标题: 两个头胜过一个：通过特征分解和混合将大型语言模型特征提炼为小型模型

摘要: 市场做市（MM）通过强化学习（RL）在金融交易中引起了重大关注。随着大型语言模型（LLMs）的发展，越来越多的尝试将LLMs应用于金融领域。将LLM作为代理的简单直接应用显示出显著的性能。这些方法受到推断速度缓慢的阻碍，而目前大多数研究尚未研究LLM在这一特定任务中的蒸馏。为了解决这个问题，我们首先提出了归一化荧光探针来研究LLM特征的机制。根据我们调查发现的观察结果，我们提出了合作市场做市（CMM），这是一个新颖的框架，将LLM特征沿三个正交维度进行解耦：层、任务和数据。各种学生模型协同学习简单的LLM特征，同时负责不同维度的每个模型都负责一个独特的特征，以实现知识蒸馏。此外，CMM引入了Hájek-MoE，通过研究在由核函数生成的公共特征空间中不同模型的贡献来集成学生模型的输出。对四个真实市场数据集的广泛实验结果证明了CMM相对于当前蒸馏方法和基于RL的市场做市策略的优越性。

更新时间: 2026-01-29 13:22:04

领域: cs.AI

下载: http://arxiv.org/abs/2511.07110v3

XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision

Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textsc{XFactors}, a weakly-supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace $\mathcal{S}$ and factor-specific subspaces $\mathcal{T}_1,\ldots,\mathcal{T}_K$ and a residual subspace $\mathcal{S}$. Each target factor is encoded in its assigned $\mathcal{T}_i$ through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both $\mathcal{S}$ and the aggregated factor subspaces, organizing the geometry without additional supervision for non-targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textsc{XFactors} achieves state-of-the-art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real-world dataset CelebA. Our code is available at \href{https://github.com/ICML26-anon/XFactors}{github.com/ICML26-anon/XFactors}.

Updated: 2026-01-29 13:20:48

标题: XFACTORS: 通过对比监督实现解开的信息瓶颈

摘要: 解缠表示学习旨在将独立变化因素映射到独立表示组件。一方面，纯无监督方法已经在完全解缠合成数据上取得成功，但在没有强大归纳偏差的情况下，无法从真实数据中恢复语义因素。另一方面，监督方法不稳定且难以扩展到大型属性集，因为它们依赖于对抗目标或辅助分类器。我们介绍了\textsc{XFactors}，一个弱监督的VAE框架，可以解缠并对一组选择的因素进行显式控制。基于解缠信息瓶颈视角，我们将表示分解为一个残差子空间$\mathcal{S}$和因素特定子空间$\mathcal{T}_1，\ldots，\mathcal{T}_K$以及一个残差子空间$\mathcal{S}$。每个目标因素都通过对比监督编码在其分配的$\mathcal{T}_i$中：一个InfoNCE损失将共享相同因素值的潜在因素拉在一起，并将不匹配的对推开。同时，KL正则化对$\mathcal{S}$和聚合的因素子空间都施加高斯结构，组织几何结构，无需为非目标因素提供额外的监督，避免对抗性训练和分类器。在多个数据集上，使用恒定的超参数，\textsc{XFactors}实现了最先进的解缠分数，并在相应子空间中产生一致的质量因素对齐，通过潜在替换实现可控的因素交换。我们进一步证明了我们的方法在增加潜在容量时正确扩展，并在真实世界数据集CelebA上进行了评估。我们的代码可在\href{https://github.com/ICML26-anon/XFactors}{github.com/ICML26-anon/XFactors}上找到。

更新时间: 2026-01-29 13:20:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21688v1

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

Updated: 2026-01-29 13:19:24

标题: 不要那么顽固！在Stiefel流形上学习KV缓存低秩逼近

摘要: 关键-值（KV）缓存可以实现快速的自回归解码，但在长上下文中会成为高带宽内存（HBM）容量和带宽的主要瓶颈。一种常见的缓解方法是通过将每个头矩阵投影到较低秩，仅在HBM中存储这些投影来压缩缓存的键和值。然而，现有的后训练方法通常使用类似SVD风格的代理目标来拟合这些投影，这可能无法很好地反映通过softmax、值混合和后续解码器层转换后的端到端重建。基于这些原因，我们引入了StiefAttention，一种后训练的KV缓存压缩方法，通过直接最小化解码器层输出重建误差来学习正交投影基。StiefAttention还为每个层预先计算了一个候选秩的误差-秩配置文件，使用户可以在指定的误差预算下灵活地分配逐层秩。值得注意的是，在相同条件下，StiefAttention在Llama3-8B上在C4困惑度上优于EigenAttention 11.9个点，在零射击MMLU准确度上优于5.4%，在等压缩条件下，相对误差更低，与原始解码器层输出的余弦相似度更高。

更新时间: 2026-01-29 13:19:24

领域: cs.LG

下载: http://arxiv.org/abs/2601.21686v1

Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.

Updated: 2026-01-29 13:18:36

标题: 不要浪费你的部署：回收搜索经验以实现高效的测试时间扩展

摘要: Test-Time Scaling通过为广泛探索解决方案空间分配额外推理计算来增强大型语言模型的推理能力。然而，现有的搜索策略通常将rollouts视为一次性样本，其中宝贵的中间见解在每次试验后被有效丢弃。这种系统性的无记忆性导致了大量的计算冗余，因为模型在广泛尝试中重复重新推导已发现的结论并重访已知的死胡同。为了弥合这一差距，我们提出了\textbf{Recycling Search Experience (RSE)}，这是一种自主引导、无需训练的策略，将测试时间搜索从一系列孤立的试验转变为一个累积过程。通过将原始轨迹积极提炼成共享经验库，RSE使中间结论得到积极回收以快速缩短冗余推导，使失败模式得到负向回收以修剪遇到的死胡同。理论上，我们提供了一个分析，正式规定了RSE的效率增益，并验证了它在解决复杂推理任务中优于独立采样的优势。从实证上看，对HMMT24、HMMT25、IMO-Bench和HLE的大量实验表明，RSE始终优于具有相当计算成本的强基线，实现了最先进的规模效率。

更新时间: 2026-01-29 13:18:36

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.21684v1

Understanding Post-Training Structural Changes in Large Language Models

Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two unexpected and robust structural changes: (1) a near-uniform geometric scaling of singular values across layers; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Based on these findings, We propose a simple yet effective framework to describe the coordinated dynamics of parameters in LLMs, which elucidates why post-training inherently relies on the foundational capabilities developed during pre-training. Further experiments demonstrate that singular value scaling underpins the temperature-controlled regulatory mechanisms of post-training, while the coordinated rotation of singular vectors encodes the essential semantic alignment. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.

Updated: 2026-01-29 13:16:15

标题: 理解大型语言模型训练后的结构变化

摘要: 培训后基本上改变了大型语言模型（LLMs）的行为，但对其内部参数空间的影响仍不明确。在这项工作中，我们对预训练的LLMs中的主要线性层进行了系统的奇异值分解（SVD）分析，重点关注两种广泛采用的后训练方法：指导调整和长链思维（Long-CoT）蒸馏。我们的分析揭示了两个意外且稳健的结构性变化：（1）各层奇异值的几何缩放几乎是均匀的；（2）左右奇异向量的高度一致的正交变换应用于每个矩阵。基于这些发现，我们提出了一个简单而有效的框架来描述LLMs中参数的协调动态，阐明了为什么后训练本质上依赖于预训练期间发展的基础能力。进一步的实验表明，奇异值缩放支撑了后训练的温度控制调节机制，而奇异向量的协调旋转则编码了基本的语义对齐。这些结果挑战了大型模型参数空间的传统观点，将参数在训练过程中如何演变的第一个明确规律揭示出来，并为更深入地研究模型参数变化提供了新的视角。

更新时间: 2026-01-29 13:16:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.17866v4

Can Local Learning Match Self-Supervised Backpropagation?

While end-to-end self-supervised learning with backpropagation (global BP-SSL) has become central for training modern AI systems, theories of local self-supervised learning (local-SSL) have struggled to build functional representations in deep neural networks. To establish a link between global and local rules, we first develop a theory for deep linear networks: we identify conditions for local-SSL algorithms (like Forward-forward or CLAPP) to implement exactly the same weight update as a global BP-SSL. Starting from the theoretical insights, we then develop novel variants of local-SSL algorithms to approximate global BP-SSL in deep non-linear convolutional neural networks. Variants that improve the similarity between gradient updates of local-SSL with those of global BP-SSL also show better performance on image datasets (CIFAR-10, STL-10, and Tiny ImageNet). The best local-SSL rule with the CLAPP loss function matches the performance of a comparable global BP-SSL with InfoNCE or CPC-like loss functions, and improves upon state-of-the-art for local SSL on these benchmarks.

Updated: 2026-01-29 13:15:57

标题: 本地学习能够与自监督反向传播相匹配吗？

摘要: 尽管端到端自监督学习与反向传播（全局BP-SSL）已成为训练现代人工智能系统的核心方法，但局部自监督学习（本地-SSL）的理论一直在努力构建深度神经网络中的功能表示。为了建立全局和局部规则之间的联系，我们首先为深度线性网络开发了一个理论：我们确定了局部-SSL算法（如前向前向或CLAPP）实施完全相同的权重更新的条件，如全局BP-SSL。从理论洞见出发，我们随后开发了新颖的局部-SSL算法变体，以近似深度非线性卷积神经网络中的全局BP-SSL。改进局部-SSL的梯度更新与全局BP-SSL之间的相似性的变体在图像数据集（CIFAR-10、STL-10和Tiny ImageNet）上表现更好。使用CLAPP损失函数的最佳局部-SSL规则与具有InfoNCE或类似CPC的损失函数的可比全局BP-SSL的性能相匹配，并在这些基准测试上改进了局部SSL的最新技术水平。

更新时间: 2026-01-29 13:15:57

领域: cs.LG

下载: http://arxiv.org/abs/2601.21683v1

FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.

Updated: 2026-01-29 13:15:32

标题: 适应性计算：在持续的LLM遗忘中抵御灾难性遗忘

摘要: 大型语言模型（LLM）展示了在各种任务中的令人印象深刻的能力，但引发了关于隐私、版权和有害材料的担忧。现有的LLM遗忘方法很少考虑真实世界中持续和大量的删除请求，这可能会导致效用降级和随着请求累积而发生的灾难性遗忘。为了解决这一挑战，我们引入了\fit，这是一个处理大量删除请求的持续遗忘框架，同时在防止灾难性遗忘和遗忘后恢复方面保持鲁棒性。\fit 通过严格的数据筛选、重要性感知更新和有针对性的层属性，减轻了降级，使其能够在较长的遗忘操作序列中实现稳定性能，并在遗忘效果和效用保留之间实现有利的平衡。为了支持实际评估，我们提出了\textbf{PCH}，一个覆盖顺序删除场景中\textbf{P}ersonal信息、\textbf{C}opyright和\textbf{H}armful内容的基准，以及两个对称指标，Forget Degree（F.D.）和Retain Utility（R.U.），共同评估遗忘质量和效用保留。对四个开源LLM进行的大量实验显示，\fit 在F.D.和R.U.之间取得了最强大的折衷，超越了现有方法在MMLU、CommonsenseQA和GSM8K上的表现，并且对重新学习和量化恢复攻击具有抗性。

更新时间: 2026-01-29 13:15:32

领域: cs.CL,cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2601.21682v1

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices

As large language models (LLMs) increasingly integrate into every aspect of our work and daily lives, there are growing concerns about user privacy, which push the trend toward local deployment of these models. There are a number of lightweight LLMs (e.g., Gemini Nano, LLAMA2 7B) that can run locally on smartphones, providing users with greater control over their personal data. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices. To fully understand the current landscape of LLM deployment on mobile platforms, we conduct a comprehensive measurement study on mobile devices. While user experience is the primary concern for end-users, developers focus more on the underlying implementations. Therefore, we evaluate both user-centric metrics-such as token throughput, latency, and response quality-and developer-critical factors, including resource utilization, OS strategies, battery consumption, and launch time. We also provide comprehensive comparisons across the mobile system-on-chips (SoCs) from major vendors, highlighting their performance differences in handling LLM workloads, which may help developers identify and address bottlenecks for mobile LLM applications. We hope that this study can provide insights for both the development of on-device LLMs and the design for future mobile system architecture.

Updated: 2026-01-29 13:15:08

标题: 理解口袋中的大型语言模型：基于COTS移动设备的性能研究

摘要: 随着大型语言模型（LLMs）越来越多地整合到我们工作和日常生活的方方面面，人们对用户隐私的担忧日益增长，这推动了这些模型向本地部署的趋势。有许多轻量级LLMs（例如Gemini Nano、LLAMA2 7B）可以在智能手机上本地运行，为用户提供更大的个人数据控制权。作为一个快速新兴的应用，我们关注它们在商用手机设备上的性能。为了充分了解LLM在移动平台上的部署现状，我们在移动设备上进行了全面的测量研究。虽然用户体验是终端用户的主要关注点，开发人员更关注底层实现。因此，我们评估了用户中心的指标（如令牌吞吐量、延迟和响应质量）和开发人员关键因素（包括资源利用、操作系统策略、电池消耗和启动时间）。我们还在主要供应商的移动系统芯片（SoCs）之间提供了全面的比较，突出它们在处理LLM工作负载方面的性能差异，这可能有助于开发人员识别和解决移动LLM应用的瓶颈。我们希望这项研究可以为本地LLMs的开发和未来移动系统架构的设计提供见解。

更新时间: 2026-01-29 13:15:08

领域: cs.LG

下载: http://arxiv.org/abs/2410.03613v4

LLM4Fluid: Large Language Models as Generalizable Neural Solvers for Fluid Dynamics

Deep learning has emerged as a promising paradigm for spatio-temporal modeling of fluid dynamics. However, existing approaches often suffer from limited generalization to unseen flow conditions and typically require retraining when applied to new scenarios. In this paper, we present LLM4Fluid, a spatio-temporal prediction framework that leverages Large Language Models (LLMs) as generalizable neural solvers for fluid dynamics. The framework first compresses high-dimensional flow fields into a compact latent space via reduced-order modeling enhanced with a physics-informed disentanglement mechanism, effectively mitigating spatial feature entanglement while preserving essential flow structures. A pretrained LLM then serves as a temporal processor, autoregressively predicting the dynamics of physical sequences with time series prompts. To bridge the modality gap between prompts and physical sequences, which can otherwise degrade prediction accuracy, we propose a dedicated modality alignment strategy that resolves representational mismatch and stabilizes long-term prediction. Extensive experiments across diverse flow scenarios demonstrate that LLM4Fluid functions as a robust and generalizable neural solver without retraining, achieving state-of-the-art accuracy while exhibiting powerful zero-shot and in-context learning capabilities. Code and datasets are publicly available at https://github.com/qisongxiao/LLM4Fluid.

Updated: 2026-01-29 13:14:48

标题: LLM4Fluid：大型语言模型作为流体动力学通用神经求解器

摘要: 深度学习已经成为流体动力学时空建模的一种有前途的范式。然而，现有方法通常受限于对未知流动条件的泛化能力有限，并且在应用于新场景时通常需要重新训练。在本文中，我们提出了LLM4Fluid，这是一个利用大型语言模型（LLMs）作为流体动力学通用神经求解器的时空预测框架。该框架首先通过物理信息解缠机制增强的降阶建模将高维流场压缩成紧凑的潜在空间，有效地减轻了空间特征纠缠，同时保留了基本的流动结构。预先训练的LLM然后作为时间处理器，使用时间序列提示自回归地预测物理序列的动态。为了弥合提示和物理序列之间的模态差距，否则可能降低预测准确性，我们提出了一个专门的模态对齐策略，解决了表征不匹配问题并稳定了长期预测。在各种流动场景下进行的大量实验表明，LLM4Fluid作为一个强大且具有泛化能力的神经求解器，无需重新训练，实现了最先进的准确性，同时展示了强大的零样本和上下文学习能力。代码和数据集可在https://github.com/qisongxiao/LLM4Fluid 上公开获取。

更新时间: 2026-01-29 13:14:48

领域: cs.LG,physics.flu-dyn

下载: http://arxiv.org/abs/2601.21681v1

Incremental Fingerprinting in an Open World

Network protocol fingerprinting is used to identify a protocol implementation by analyzing its input-output behavior. Traditionally, fingerprinting operates under a closed-world assumption, where models of all implementations are assumed to be available. However, this assumption is unrealistic in practice. When this assumption does not hold, fingerprinting results in numerous misclassifications without indicating that a model for an implementation is missing. Therefore, we introduce an open-world variant of the fingerprinting problem, where not all models are known in advance. We propose an incremental fingerprinting approach to solve the problem by combining active automata learning with closed-world fingerprinting. Our approach quickly determines whether the implementation under consideration matches an available model using fingerprinting and conformance checking. If no match is found, it learns a new model by exploiting the structure of available models. We prove the correctness of our approach and improvements in asymptotic complexity compared to naive baselines. Moreover, experimental results on a variety of protocols demonstrate a significant reduction in misclassifications and interactions with these black-boxes.

Updated: 2026-01-29 13:14:15

标题: 在一个开放的世界中的增量指纹识别

摘要: 网络协议指纹识别是通过分析其输入输出行为来识别协议实现的方法。传统上，指纹识别是在一个封闭世界假设下运行的，即假设所有实现的模型都是可用的。然而，在实践中，这种假设是不现实的。当这种假设不成立时，指纹识别会导致大量的错误分类，而不会表明缺少一个实现的模型。因此，我们引入了指纹识别问题的一个开放世界变种，其中并非所有模型都在事先知道。我们提出了一种增量指纹识别方法来解决这个问题，通过将主动自动机学习与封闭世界指纹识别相结合。我们的方法快速确定是否正在考虑的实现与可用模型匹配，使用指纹识别和符合性检查。如果没有找到匹配项，它将通过利用可用模型的结构来学习一个新模型。我们证明了我们的方法的正确性以及与朴素基线相比在渐近复杂性上的改进。此外，对各种协议的实验结果表明，错误分类和与这些黑盒子的交互显著减少。

更新时间: 2026-01-29 13:14:15

领域: cs.CR,cs.LO

下载: http://arxiv.org/abs/2601.21680v1

LLM-Assisted Authentication and Fraud Detection

User authentication and fraud detection face growing challenges as digital systems expand and adversaries adopt increasingly sophisticated tactics. Traditional knowledge-based authentication remains rigid, requiring exact word-for-word string matches that fail to accommodate natural human memory and linguistic variation. Meanwhile, fraud-detection pipelines struggle to keep pace with rapidly evolving scam behaviors, leading to high false-positive rates and frequent retraining cycles required. This work introduces two complementary LLM-enabled solutions, namely, an LLM-assisted authentication mechanism that evaluates semantic correctness rather than exact wording, supported by document segmentation and a hybrid scoring method combining LLM judgement with cosine-similarity metrics and a RAG-based fraud-detection pipeline that grounds LLM reasoning in curated evidence to reduce hallucinations and adapt to emerging scam patterns without model retraining. Experiments show that the authentication system accepts 99.5% of legitimate non-exact answers while maintaining a 0.1% false-acceptance rate, and that the RAG-enhanced fraud detection reduces false positives from 17.2% to 3.5%. Together, these findings demonstrate that LLMs can significantly improve both usability and robustness in security workflows, offering a more adaptive , explainable, and human-aligned approach to authentication and fraud detection.

Updated: 2026-01-29 13:12:00

标题: 基于LLM的身份验证和欺诈检测

摘要: 用户认证和欺诈检测面临着越来越大的挑战，随着数字系统的扩展和对手采用越来越复杂的策略。传统的基于知识的认证依然僵化，需要精确的逐字逐句匹配，无法适应自然人类记忆和语言变化。与此同时，欺诈检测管道难以跟上迅速发展的骗局行为，导致高误报率和频繁需要重新训练的周期。本文介绍了两种互补的LLM-enabled解决方案，即LLM辅助认证机制和基于RAG的欺诈检测管道。认证机制评估语义正确性而不是精确措辞，支持文档分割和混合评分方法，结合LLM判断、余弦相似度指标和RAG基础的欺诈检测管道，将LLM推理基于策划证据，减少幻觉并适应新出现的骗局模式，无需重新训练模型。实验表明，认证系统接受99.5%的合法非精确答案，同时保持0.1%的误接受率，而RAG增强的欺诈检测将误报率从17.2%降低到3.5%。这些发现共同表明，LLM可以显著提高安全工作流程的可用性和鲁棒性，为认证和欺诈检测提供更具适应性、可解释性和与人类对齐的方法。

更新时间: 2026-01-29 13:12:00

领域: cs.CR

下载: http://arxiv.org/abs/2601.19684v2

High Effort, Low Gain: Fundamental Limits of Active Learning for Linear Dynamical Systems

In this work, we consider the problem of identifying an unknown linear dynamical system given a finite hypothesis class. In particular, we analyze the effect of the excitation input on the sample complexity of identifying the true system with high probability. To this end, we present sample complexity lower bounds that capture the choice of the selected excitation input. The sample complexity lower bound gives rise to a system theoretic condition to determine the potential benefit of experiment design. Informed by the analysis of the sample complexity lower bound, we propose a persistent excitation (PE) condition tailored to the considered setting, which we then use to establish sample complexity upper bounds. Notably, the PE condition is weaker than in the case of an infinite hypothesis class and allows analyzing different excitation inputs modularly. Crucially, the lower and upper bounds share the same dependency on key problem parameters. Finally, we leverage these insights to propose an active learning algorithm that sequentially excites the system optimally with respect to the current estimate, and provide sample complexity guarantees for the presented algorithm. Concluding simulations showcase the effectiveness of the proposed algorithm.

Updated: 2026-01-29 13:03:53

标题: 高投入、低收益：线性动态系统主动学习的基本限制

摘要: 在这项工作中，我们考虑了在给定有限假设类的情况下识别未知线性动态系统的问题。特别地，我们分析了激励输入对以高概率识别真实系统的样本复杂度的影响。为此，我们提出了能够捕捉所选激励输入选择的样本复杂度下界。样本复杂度下界产生了一个系统理论条件，用于确定实验设计的潜在好处。在分析样本复杂度下界的基础上，我们提出了一个针对考虑设置的持续激励（PE）条件，然后使用该条件建立样本复杂度上界。值得注意的是，PE条件比无限假设类的情况下更弱，允许模块化地分析不同的激励输入。关键的是，下界和上界在关键问题参数上具有相同的依赖关系。最后，我们利用这些见解提出了一种主动学习算法，该算法以优化当前估计为目标顺序激励系统，并为所提出的算法提供了样本复杂度保证。最后的模拟展示了所提出算法的有效性。

更新时间: 2026-01-29 13:03:53

领域: eess.SY,cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.11907v2

When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning

Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.

Updated: 2026-01-29 13:03:50

标题: 当梯度优化不够时：$ \dagger $ 分散和锚定几何正则化器用于多模态学习

摘要: 多模态学习旨在整合来自异构模态的互补信息，然而强大的优化本身并不能保证结构良好的表示。即使在 carefully balanced 的训练方案下，多模态模型经常表现出几何病态，包括内部模态表示崩溃和样本级跨模态不一致性，这会降低单模鲁棒性和多模态融合效果。我们确定表示几何形状作为多模态学习中缺失的控制轴，并提出了\regName，一个轻量级的几何感知正则化框架。\regName 对中间嵌入施加了两个互补的约束：一种促进表示多样性的内部模态分散正则化，和一种在没有刚性对齐的情况下限制样本级跨模态漂移的跨模态锚定正则化。所提出的正则化器是即插即用的，无需架构修改，并与各种训练范式兼容。在多个多模态基准测试中进行的大量实验显示，无论是在多模态还是单模态性能方面，都能持续提升，表明明确调节表示几何形状有效地缓解了模态权衡问题。

更新时间: 2026-01-29 13:03:50

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.21670v1

Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling

Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling, which removes outcome-frequency amplification from the learning signal, fundamentally changes the learning dynamics, and provably yields reward-proportional terminal distributions, preventing collapse in multimodal settings. We instantiate this principle in Group Relative Policy Optimization (GRPO) as a drop-in modification, IPS-GRPO, requiring no auxiliary models or architectural changes. Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance, suggesting that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.

Updated: 2026-01-29 13:03:33

标题: 预期收益导致强化学习中的结果级别模式崩溃及如何通过反向概率缩放修复

摘要: 许多强化学习（RL）问题存在多个质量相近的终结解决方案，目标不是识别单一最优解，而是表示多样化的高质量结果集。然而，通过标准期望回报最大化训练的策略通常会陷入一小部分结果的情况，这种现象通常被归因于不足的探索或弱正则化。我们展示了这种解释是不完整的：结果级模式崩溃是期望回报目标本身的结构性后果。在理想化的学习动态下，任何两个结果之间的对数概率比率随其奖励差异呈线性演化，意味着指数比率分歧和不可避免的崩溃，独立于探索策略、熵正则化或优化算法。我们确定了这种病态的来源是期望内的概率乘数，并提出了一种最小修正：逆概率缩放，它从学习信号中消除了结果频率放大，从根本上改变了学习动态，并可以证明产生奖励成比例的终端分布，预防了在多模态设置中的崩溃。我们在Group Relative Policy Optimization (GRPO)中实现了这一原则作为一个即插即用的修改，IPS-GRPO，不需要辅助模型或架构更改。在不同的推理和分子生成任务中，IPS-GRPO一致减少了结果级别的模式崩溃，同时匹配或超过基线性能，这表明修正目标而不是添加探索启发法则是可靠的多模态策略优化的关键。

更新时间: 2026-01-29 13:03:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21669v1

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard

Updated: 2026-01-29 13:01:07

标题: SONIC-O1：一个用于在音频视频理解上评估多模态大型语言模型的真实世界基准

摘要: 多模态大型语言模型（MLLMs）是近期人工智能研究的主要焦点。然而，大部分先前的工作侧重于静态图像理解，而它们处理顺序音视频数据的能力仍未充分探索。这种差距突显了在真实世界环境中系统评估MLLM性能的高质量基准的必要性。我们介绍了SONIC-O1，一个全面的、完全由人类验证的基准，涵盖了13个真实世界会话领域，包括4,958个注释和人口统计元数据。SONIC-O1在关键任务上评估MLLMs，包括开放式摘要、多项选择题回答和支持理由（推理）的时间定位。对封闭和开源模型的实验揭示了局限性。虽然两个模型系列之间的多项选择题准确性差距相对较小，但我们发现在最佳表现的封闭源和开源模型之间在时间定位上存在着22.6%的性能差异。性能在人口统计群体之间进一步恶化，表明模型行为中存在持续的差异。总的来说，SONIC-O1为基于时间的、社交稳健的多模态理解提供了一个开放的评估套件。我们发布了SONIC-O1以促进可重现性和研究：项目页面：https://vectorinstitute.github.io/sonic-o1/ 数据集：https://huggingface.co/datasets/vector-institute/sonic-o1 Github：https://github.com/vectorinstitute/sonic-o1 排行榜：https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard

更新时间: 2026-01-29 13:01:07

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.21666v1

SENDAI: A Hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework

Bridging the gap between data-rich training regimes and observation-sparse deployment conditions remains a central challenge in spatiotemporal field reconstruction, particularly when target domains exhibit distributional shifts, heterogeneous structure, and multi-scale dynamics absent from available training data. We present SENDAI, a hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework that reconstructs full spatial states from hyper sparse sensor observations by combining simulation-derived priors with learned discrepancy corrections. We demonstrate the performance on satellite remote sensing, reconstructing MODIS (Moderate Resolution Imaging Spectroradiometer) derived vegetation index fields across six globally distributed sites. Using seasonal periods as a proxy for domain shift, the framework consistently outperforms established baselines that require substantially denser observations -- SENDAI achieves a maximum SSIM improvement of 185% over traditional baselines and a 36% improvement over recent high-frequency-based methods. These gains are particularly pronounced for landscapes with sharp boundaries and sub-seasonal dynamics; more importantly, the framework effectively preserves diagnostically relevant structures -- such as field topologies, land cover discontinuities, and spatial gradients. By yielding corrections that are more structurally and spectrally separable, the reconstructed fields are better suited for downstream inference of indirectly observed variables. The results therefore highlight a lightweight and operationally viable framework for sparse-measurement reconstruction that is applicable to physically grounded inference, resource-limited deployment, and real-time monitor and control.

Updated: 2026-01-29 12:58:54

标题: 仙台：一种层次稀疏测量、高效数据同化框架

摘要: 填补数据丰富的训练制度和观测稀疏的部署条件之间的差距，是时空场重建中的一项中心挑战，特别是当目标领域表现出分布变化、异质结构和多尺度动态，这些都不在已有训练数据中。我们提出了SENDAI，一个层次稀疏测量、高效数据同化框架，通过将模拟导出的先验信息与学习的差异校正相结合，从超稀疏传感器观测中重建完整的空间状态。我们在卫星遥感中展示了其性能，重建了六个全球分布站点的MODIS（中分辨率成像光谱辐射计）衍生的植被指数场。以季节周期作为领域转移的代理，该框架始终优于需要更密集观测的已建立基线 - SENDAI相对于传统基线实现了最高SSIM改进185%，相对于最近的基于高频率的方法实现了36%的改进。这些收益在具有明显边界和亚季节动态的景观中尤为显著；更重要的是，该框架有效地保留了诊断相关的结构 - 如地形拓扑、土地覆盖不连续性和空间梯度。通过产生更具结构性和光谱可分性的校正，重建的场景更适合用于间接观测变量的下游推断。因此，这些结果强调了一种轻量级且操作可行的稀疏测量重建框架，适用于基于物理的推断、资源有限的部署，以及实时监视和控制。

更新时间: 2026-01-29 12:58:54

领域: cs.LG,cs.AI,cs.CE

下载: http://arxiv.org/abs/2601.21664v1

Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.

Updated: 2026-01-29 12:58:42

标题: 使用黎曼流匹配对预训练VLM的认知不确定性量化

摘要: 视觉语言模型（VLMs）通常是确定性的，缺乏固有机制来量化认识不确定性，这反映了模型对自身表示的知识或无知。我们理论上证明了嵌入的负对数密度作为认识不确定性的代理，其中低密度区域表示模型的无知。所提出的方法REPVLM使用Riemannian Flow Matching计算VLM嵌入在超球面流形上的概率密度。我们在实证上证明，REPVLM实现了不确定性和预测误差之间几乎完美的相关性，显著优于现有基线。除了分类之外，我们还展示了该模型在超出分布检测和自动数据整理方面提供了可扩展的度量标准。

更新时间: 2026-01-29 12:58:42

领域: cs.LG

下载: http://arxiv.org/abs/2601.21662v1

Normative Equivalence in Human-AI Cooperation: Behaviour, Not Identity, Drives Cooperation in Mixed-Agent Groups

The introduction of artificial intelligence (AI) agents into human group settings raises essential questions about how these novel participants influence cooperative social norms. While previous studies on human-AI cooperation have primarily focused on dyadic interactions, little is known about how integrating AI agents affects the emergence and maintenance of cooperative norms in small groups. This study addresses this gap through an online experiment using a repeated four-player Public Goods Game (PGG). Each group consisted of three human participants and one bot, which was framed either as human or AI and followed one of three predefined decision strategies: unconditional cooperation, conditional cooperation, or free-riding. In our sample of 236 participants, we found that reciprocal group dynamics and behavioural inertia primarily drove cooperation. These normative mechanisms operated identically across conditions, resulting in cooperation levels that did not differ significantly between human and AI labels. Furthermore, we found no evidence of differences in norm persistence in a follow-up Prisoner's Dilemma, or in participants' normative perceptions. Participants' behaviour followed the same normative logic across human and AI conditions, indicating that cooperation depended on group behaviour rather than partner identity. This supports a pattern of normative equivalence, in which the mechanisms that sustain cooperation function similarly in mixed human-AI and all human groups. These findings suggest that cooperative norms are flexible enough to extend to artificial agents, blurring the boundary between humans and AI in collective decision-making.

Updated: 2026-01-29 12:58:00

标题: 人工智能与人类合作中的规范等效性：行为而非身份，推动混合式代理群体的合作

摘要: 引入人工智能（AI）代理到人类群体环境中，引发了关于这些新参与者如何影响合作社会规范的基本问题。虽然先前关于人类-AI合作的研究主要集中在二人互动上，但很少有关于整合AI代理如何影响小组中合作规范的出现和维持的研究。本研究通过一项在线实验，使用重复的四人公共物品博弈（PGG）来填补这一空白。每个小组由三名人类参与者和一个机器人组成，机器人被定义为人类或AI，并遵循三种预定义的决策策略之一：无条件合作、有条件合作或自由骑行。在我们的样本中，共有236名参与者，我们发现互惠的小组动态和行为惯性主要推动了合作。这些规范机制在各种条件下运作相同，导致合作水平在人类和AI标签之间没有显著差异。此外，在后续囚徒困境中，我们发现没有规范持久性的差异，也没有参与者的规范感知差异。参与者的行为在人类和AI条件下遵循相同的规范逻辑，表明合作取决于群体行为而不是合作伙伴身份。这支持了一种规范等效性模式，即维持合作的机制在混合人类-AI和全人类群体中运作方式相似。这些发现表明，合作规范足够灵活，可以扩展到人工代理，模糊了在集体决策中人类和AI之间的界限。

更新时间: 2026-01-29 12:58:00

领域: cs.AI,cs.GT,cs.HC,econ.GN

下载: http://arxiv.org/abs/2601.20487v2

Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

We present a method for the automated discovery of system-level dynamics in Flow-Lenia--a continuous cellular automaton (CA) with mass conservation and parameter localization-using a curiosity--driven AI scientist. This method aims to uncover processes leading to self-organization of evolutionary and ecosystemic dynamics in CAs. We build on previous work which uses diversity search algorithms in Lenia to find self-organized individual patterns, and extend it to large environments that support distinct interacting patterns. We adapt Intrinsically Motivated Goal Exploration Processes (IMGEPs) to drive exploration of diverse Flow-Lenia environments using simulation-wide metrics, such as evolutionary activity, compression-based complexity, and multi-scale entropy. We test our method in two experiments, showcasing its ability to illuminate significantly more diverse dynamics compared to random search. We show qualitative results illustrating how ecosystemic simulations enable self-organization of complex collective behaviors not captured by previous individual pattern search and analysis. We complement automated discovery with an interactive exploration tool, creating an effective human-AI collaborative workflow for scientific investigation. Though demonstrated specifically with Flow-Lenia, this methodology provides a framework potentially applicable to other parameterizable complex systems where understanding emergent collective properties is of interest.

Updated: 2026-01-29 12:57:28

标题: 用好奇心驱动的人工智能科学家探索流动-莱尼亚宇宙：发现多样化的生态系统动态

摘要: 我们介绍了一种用于在Flow-Lenia中发现系统级动态的自动化方法-这是一个具有质量守恒和参数定位的连续细胞自动机（CA），使用基于好奇心的人工智能科学家。该方法旨在揭示导致CA中自组织进化和生态动态的过程。我们建立在先前的工作基础上，该工作在Lenia中使用多样性搜索算法找到自组织的个体模式，并将其扩展到支持不同交互模式的大型环境。我们采用本质上激励目标探索过程（IMGEPs）来驱动对各种Flow-Lenia环境的探索，使用模拟范围度量标准，如进化活动、基于压缩的复杂性和多尺度熵。我们在两个实验中测试了我们的方法，展示了它相对于随机搜索能够揭示更加多样化动态的能力。我们展示了定性结果，说明生态模拟使得复杂的集体行为能够自组织，而这是以前个体模式搜索和分析没有捕捉到的。我们通过一个交互式探索工具来补充自动化发现，为科学研究创建了有效的人机协作工作流程。尽管此方法特定地展示在Flow-Lenia中，但这种方法提供了一个框架，可能适用于其他参数化复杂系统，在这些系统中理解新出现的集体性质是感兴趣的。

更新时间: 2026-01-29 12:57:28

领域: cs.AI

下载: http://arxiv.org/abs/2505.15998v3

Authenticated encryption for space telemetry

We explore how command stack protection requirements outlined in NASA-STD-1006A can be satisfied within the context of emergency space telemetry. Proposed implementation of lightweight authenticated encryption offers strong security without sacrificing performance in resource-constrained environments. It produces fixed-length messages, maintaining compatibility with the underlying data transport protocols. By focusing on predictable properties and robust authentication, we create a scheme that protects the confidentiality, integrity and authenticity of telemetry data in emergency communications while balancing security requirements with the operational constraints.

Updated: 2026-01-29 12:57:27

标题: 空间遥测的认证加密

摘要: 我们探讨了如何在紧急空间遥测的背景下满足NASA-STD-1006A中规定的命令堆栈保护要求。提出了轻量级身份验证加密的实现方案，它在资源受限的环境中提供了强大的安全性而不牺牲性能。它生成固定长度的消息，保持与底层数据传输协议的兼容性。通过专注于可预测的属性和强大的身份验证，我们创建了一个方案，在紧急通信中保护遥测数据的机密性、完整性和真实性，同时平衡了安全要求和操作约束。

更新时间: 2026-01-29 12:57:27

领域: cs.CR,cs.NI,eess.SY

下载: http://arxiv.org/abs/2601.21657v1

TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering

Clustering tabular data is a fundamental yet challenging problem due to heterogeneous feature types, diverse data-generating mechanisms, and the absence of transferable inductive biases across datasets. Prior-fitted networks (PFNs) have recently demonstrated strong generalization in supervised tabular learning by amortizing Bayesian inference under a broad synthetic prior. Extending this paradigm to clustering is nontrivial: clustering is unsupervised, admits a combinatorial and permutation-invariant output space, and requires inferring the number of clusters. We introduce TabClustPFN, a prior-fitted network for tabular data clustering that performs amortized Bayesian inference over both cluster assignments and cluster cardinality. Pretrained on synthetic datasets drawn from a flexible clustering prior, TabClustPFN clusters unseen datasets in a single forward pass, without dataset-specific retraining or hyperparameter tuning. The model naturally handles heterogeneous numerical and categorical features and adapts to a wide range of clustering structures. Experiments on synthetic data and curated real-world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out-of-the-box exploratory settings. Code is available at https://github.com/Tianqi-Zhao/TabClustPFN.

Updated: 2026-01-29 12:56:41

标题: TabClustPFN：用于表格数据聚类的先验适配网络

摘要: 对表格数据进行聚类是一个基本但具有挑战性的问题，原因在于特征类型的异质性、多样化的数据生成机制以及跨数据集之间缺乏可转移的归纳偏差。先前拟合的网络（PFNs）最近通过在广泛的合成先验下摊销贝叶斯推断，在监督学习中展现了强大的泛化能力。将这一范式扩展到聚类是非平凡的：聚类是无监督的，接受一个组合和置换不变的输出空间，并需要推断出簇的数量。我们介绍了TabClustPFN，一种用于表格数据聚类的先验拟合网络，它在簇分配和簇基数上执行摊销贝叶斯推断。在从灵活的聚类先验中抽取的合成数据集上预训练，TabClustPFN可以在单次前向传递中对看不见的数据集进行聚类，无需数据集特定的重新训练或超参数调整。该模型自然地处理异质的数值和分类特征，并适应各种聚类结构。在合成数据和策划的真实世界表格基准上的实验表明，TabClustPFN优于传统、深度和摊销聚类基线，同时在开箱即用的探索设置中表现出强大的稳健性。代码可在https://github.com/Tianqi-Zhao/TabClustPFN 上找到。

更新时间: 2026-01-29 12:56:41

领域: cs.LG

下载: http://arxiv.org/abs/2601.21656v1

ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement.

Updated: 2026-01-29 12:51:44

标题: 学术体操：在学术文献检索中对深度研究工作流程进行基准测试

摘要: 工具增强的大型语言模型已经从单轮问答发展到深度研究工作流程，这些工作流程通过迭代地规划查询、调用外部工具并综合信息来解决复杂的信息需求。评估这样的工作流程面临着一个基本挑战：依赖于实时API会引入非确定性，因为工具调用可能由于时间漂移、速率限制和演变的后端状态而在运行过程中产生不同的结果。这种差异削弱了可重现性并使跨系统比较失效。我们提出了ScholarGym，这是一个用于在学术文献上可重现评估深度研究工作流程的模拟环境。该环境将工作流程组件分解为查询规划、工具调用和相关性评估，从而在受控条件下对每个阶段进行细粒度分析。基于一个包含570K篇论文的静态语料库，具有确定性检索的ScholarGym提供了2,536个带有专家注释的真实数据查询。跨不同的基础模型进行的实验揭示了推理能力、规划策略和选择机制在迭代改进过程中的相互作用。

更新时间: 2026-01-29 12:51:44

领域: cs.AI

下载: http://arxiv.org/abs/2601.21654v1

Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in $\{\pm 1, \pm i\}$

Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware. We open-source the Fairy2i model and code at https://huggingface.co/PKU-DS-LAB/Fairy2i-W2 and https://github.com/PKULab1806/Fairy2i-W2.

Updated: 2026-01-29 12:51:21

标题: Fairy2i：使用所有参数为$\{\pm 1, \pm i\}$的真实LLMs训练复杂LLMs

摘要: 大型语言模型（LLMs）已经彻底改变了人工智能，然而它们庞大的内存和计算需求需要进行激进的量化，逐渐将表示推向单比特的理论极限。虽然复杂值LLMs，如iFairy，与实值对应物相比，提供了更好的低比特表示机会，但它们需要从头开始训练，从而阻止了利用大量预先训练的实值基础模型生态系统。在这里，我们提出了Fairy2i，这是一个通用框架，将预先训练的实值层转换为等效的广义线性复形式，实现极低比特量化同时重复使用现有检查点。通过证明实值和广义线性映射之间的无损数学等价性，我们将标准Transformer转换为复域，并采用高效的四次单位根码本的相感知量化方案。此外，我们引入了一种递归残差量化机制，可以迭代地最小化量化误差，使推理可以通过高效的无乘法累积进行。我们证明了Fairy2i在有效的2比特精度下恢复了LLaMA-27B的性能，几乎可以与完整精度基线相媲美，明显优于最先进的实值二进制和三进制量化方法。这项工作弥合了复值算术的表示效率与预训练模型的实际效用之间的差距，为在通用硬件上进行高效推理铺平了一条新路。我们在https://huggingface.co/PKU-DS-LAB/Fairy2i-W2和https://github.com/PKULab1806/Fairy2i-W2上开源了Fairy2i模型和代码。

更新时间: 2026-01-29 12:51:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.02901v3

Gauge-invariant representation holonomy

Deep networks learn internal representations whose geometry--how features bend, rotate, and evolve--affects both generalization and robustness. Existing similarity measures such as CKA or SVCCA capture pointwise overlap between activation sets, but miss how representations change along input paths. Two models may appear nearly identical under these metrics yet respond very differently to perturbations or adversarial stress. We introduce representation holonomy, a gauge-invariant statistic that measures this path dependence. Conceptually, holonomy quantifies the "twist" accumulated when features are parallel-transported around a small loop in input space: flat representations yield zero holonomy, while nonzero values reveal hidden curvature. Our estimator fixes gauge through global whitening, aligns neighborhoods using shared subspaces and rotation-only Procrustes, and embeds the result back to the full feature space. We prove invariance to orthogonal (and affine, post-whitening) transformations, establish a linear null for affine layers, and show that holonomy vanishes at small radii. Empirically, holonomy increases with loop radius, separates models that appear similar under CKA, and correlates with adversarial and corruption robustness. It also tracks training dynamics as features form and stabilize. Together, these results position representation holonomy as a practical and scalable diagnostic for probing the geometric structure of learned representations beyond pointwise similarity.

Updated: 2026-01-29 12:51:17

标题: 规范不变的表示荷载

摘要: 深度网络学习内部表示，其几何结构——特征如何弯曲、旋转和演变——影响泛化和鲁棒性。现有的相似性度量，如CKA或SVCCA，捕捉激活集之间的点对点重叠，但忽略了表示沿输入路径如何变化。根据这些度量，两个模型可能看起来几乎相同，但对扰动或对抗性压力的响应却非常不同。我们引入了表示holonomy，这是一种测量路径依赖性的规范不变统计量。从概念上讲，holonomy度量了当特征在输入空间中一个小环路周围平行传输时所积累的“扭曲”：平坦的表示产生零holonomy，而非零值揭示了隐藏的曲率。我们的估计器通过全局白化修正规范，使用共享子空间和仅旋转的Procrustes将邻域对齐，并将结果嵌入回完整的特征空间。我们证明了对正交（和仿射，白化后）变换的不变性，为仿射层建立了线性零空间，并展示了holonomy在小半径时消失。在经验上，holonomy随着环路半径的增加而增加，将在CKA下看起来相似的模型分离开，并与对抗和损坏鲁棒性相关。它还跟踪训练动态，当特征形成和稳定时。总的来说，这些结果将表示holonomy定位为一种探究学习到的表示的几何结构的实用和可扩展的诊断方法，超越了点对点的相似性。

更新时间: 2026-01-29 12:51:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21653v1

CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents 'CleanSurvival', a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.

Updated: 2026-01-29 12:51:02

标题: CleanSurvival：使用强化学习自动化数据预处理以用于时间至事件模型

摘要: 数据预处理是机器学习中一个至关重要但经常被忽视的方面，尽管其对模型性能有潜在重大影响，但往往被忽视。虽然自动化机器学习流程开始意识到并将数据预处理整合到它们的分类和回归任务解决方案中，但这种整合在更专门的任务如生存或时间到事件模型中缺乏。因此，生存分析不仅面临数据预处理的一般挑战，还因缺乏在这一领域的量身定制的自动化解决方案而受到影响。为填补这一空白，本文介绍了一种基于强化学习的解决方案'CleanSurvival'，专门为生存分析进行了扩展。该框架可以处理连续和分类变量，使用Q-learning选择哪种数据插补、异常值检测和特征提取技术的组合可以实现Cox、随机森林、神经网络或用户提供的时间到事件模型的最佳性能。该软件包可在GitHub上找到：https://github.com/datasciapps/CleanSurvival 在真实数据集上进行的实验基准测试显示，基于Q-learning的数据预处理结果在预测性能上优于标准方法，比无目标的随机网格搜索更快地找到这样的模型高达10倍。此外，模拟研究证明了在数据中不同类型和水平的缺失和噪声下的有效性。

更新时间: 2026-01-29 12:51:02

领域: cs.LG

下载: http://arxiv.org/abs/2502.03946v3

When Life Gives You AI, Will You Turn It Into A Market for Lemons? Understanding How Information Asymmetries About AI System Capabilities Affect Market Outcomes and Adoption

AI consumer markets are characterized by severe buyer-supplier market asymmetries. Complex AI systems can appear highly accurate while making costly errors or embedding hidden defects. While there have been regulatory efforts surrounding different forms of disclosure, large information gaps remain. This paper provides the first experimental evidence on the important role of information asymmetries and disclosure designs in shaping user adoption of AI systems. We systematically vary the density of low-quality AI systems and the depth of disclosure requirements in a simulated AI product market to gauge how people react to the risk of accidentally relying on a low-quality AI system. Then, we compare participants' choices to a rational Bayesian model, analyzing the degree to which partial information disclosure can improve AI adoption. Our results underscore the deleterious effects of information asymmetries on AI adoption, but also highlight the potential of partial disclosure designs to improve the overall efficiency of human decision-making.

Updated: 2026-01-29 12:49:28

标题: 当生活给你人工智能时，你会把它变成一个“柠檬市场”吗？理解关于人工智能系统能力的信息不对称如何影响市场结果和采纳。

摘要: 人工智能消费市场存在严重的买方-卖方市场不对称性。复杂的人工智能系统可能看起来非常准确，但却会造成昂贵的错误或隐藏的缺陷。尽管在不同形式的披露方面存在监管努力，但仍存在大量信息缺口。本文首次提供了关于信息不对称和披露设计对用户采用人工智能系统的重要作用的实验证据。我们在模拟的人工智能产品市场中系统地改变低质量人工智能系统的密度和披露要求的深度，以衡量人们如何对意外依赖低质量人工智能系统的风险做出反应。然后，我们将参与者的选择与理性贝叶斯模型进行比较，分析部分信息披露能够改善人工智能采用程度。我们的结果突显了信息不对称对人工智能采用的有害影响，但也强调了部分披露设计提高人类决策整体效率的潜力。

更新时间: 2026-01-29 12:49:28

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2601.21650v1

SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning

The deployment of coding agents in privacy-sensitive and resource-constrained environments drives the demand for capable open-weight Small Language Models (SLMs). However, they suffer from a fundamental capability gap: unlike frontier large models, they lack the inference-time strong generalization to work with complicated, unfamiliar codebases. We identify that the prevailing Task-Centric Learning (TCL) paradigm, which scales exposure across disparate repositories, fails to address this limitation. In response, we propose Repository-Centric Learning (RCL), a paradigm shift that prioritizes vertical repository depth over horizontal task breadth, suggesting SLMs must internalize the "physics" of a target software environment through parametric knowledge acquisition, rather than attempting to recover it via costly inference-time search. Following this new paradigm, we design a four-unit Repository-Centric Experience, transforming static codebases into interactive learning signals, to train SWE-Spot-4B, a family of highly compact models built as repo-specialized experts that breaks established scaling trends, outperforming open-weight models up to larger (e.g., CWM by Meta, Qwen3-Coder-30B) and surpassing/matching efficiency-focused commercial models (e.g., GPT-4.1-mini, GPT-5-nano) across multiple SWE tasks. Further analysis reveals that RCL yields higher training sample efficiency and lower inference costs, emphasizing that for building efficient intelligence, repository mastery is a distinct and necessary dimension that complements general coding capability.

Updated: 2026-01-29 12:49:25

标题: SWE-Spot: 通过以仓库为中心的学习构建小型仓库专家

摘要: 在隐私敏感和资源受限的环境中部署编码代理驱动了对能力开放轻量级小语言模型（SLM）的需求。然而，它们存在一个根本的能力差距：与前沿大型模型不同，它们缺乏推理时间强大的泛化能力，无法处理复杂、陌生的代码库。我们发现，目前流行的以任务为中心的学习（TCL）范式，跨不同存储库扩展暴露度，未能解决这一限制。作为回应，我们提出了以存储库为中心的学习（RCL），这是一种范式转变，优先考虑垂直存储库深度而不是水平任务广度，建议SLMs必须通过参数化知识获取内化目标软件环境的“物理规律”，而不是尝试通过昂贵的推理时间搜索来恢复它。遵循这种新的范式，我们设计了一个四单元的以存储库为中心的经验，将静态代码库转化为互动学习信号，以训练SWE-Spot-4B，这是一个作为存储库专家构建的高度紧凑的模型系列，打破了已建立的扩展趋势，表现优于更大规模的开放权重模型（例如Meta的CWM，Qwen3-Coder-30B），并在多个SWE任务中超越/匹配以效率为重点的商业模型（例如GPT-4.1-mini，GPT-5-nano）。进一步分析显示，RCL实现了更高的训练样本效率和更低的推理成本，强调了为构建高效智能，存储库掌握是一种独特且必要的维度，它补充了通用编码能力。

更新时间: 2026-01-29 12:49:25

领域: cs.LG,cs.AI,cs.CL,cs.SE

下载: http://arxiv.org/abs/2601.21649v1

ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10$\%$ to 60$\%$ points, while maintaining high generation quality.

Updated: 2026-01-29 12:48:59

标题: ILRR：用于掩码扩散语言模型的推理时间导向方法

摘要: 离散扩散语言模型（DLMs）为文本生成提供了一种有前途的非自回归替代方案，然而在推理时控制的有效机制仍相对未被充分探讨。现有方法包括采样级别的引导程序或轨迹优化机制。在这项工作中，我们介绍了迭代潜在表示细化（ILRR），这是一个无需学习的框架，用于通过单个参考序列引导DLMs。ILRR通过在去噪过程中动态地将生成序列的内部激活与给定参考的激活进行对齐来指导生成。这种方法捕捉并传递高级语义属性，通过可调节的转向尺度实现对情感等属性的灵活控制。我们进一步介绍了空间调制转向（Spatially Modulated Steering），这是一种扩展，通过在整个序列上调节指导强度，使较短的参考文本能够引导长文本。从实证上看，我们证明了ILRR在LLaDA和MDLM架构上实现了有效的属性引导，而且只需额外进行一次平行前向传递以完成每个去噪步骤，计算开销较小。在相同的计算预算下，ILRR将属性准确性提高了10％到60％，同时保持了高质量的生成。

更新时间: 2026-01-29 12:48:59

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21647v1

PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy method that modifies gradient accumulation across microbatches rather than relying on likelihood ratios relative to a reference policy. During accumulation, PROMA projects the partially accumulated gradient to be orthogonal to the sequence-wise gradients of the current microbatch. This projection is applied layer-wise during the backward pass, enabling efficient implementation. A within-microbatch variant (Intra-PROMA) acts independently across microbatches. Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.

Updated: 2026-01-29 12:48:19

标题: PROMA: 用于无参考近端策略更新的投影微批积累

摘要: 这篇论文介绍了Projected Microbatch Accumulation（PROMA），这是一种近端策略方法，它修改了微批次间的梯度累积，而不是依赖于相对于参考策略的似然比。在累积过程中，PROMA将部分累积的梯度投影为与当前微批次的序列梯度正交。这种投影在反向传播过程中逐层应用，实现了高效的实现。一种微批次内变体（Intra-PROMA）在微批次间独立操作。实证上，PROMA实现了近端更新，避免了熵崩溃，同时比GRPO提供了更紧密的本地KL控制。

更新时间: 2026-01-29 12:48:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.10498v3

Identifiable Equivariant Networks are Layerwise Equivariant

We investigate the relation between end-to-end equivariance and layerwise equivariance in deep neural networks. We prove the following: For a network whose end-to-end function is equivariant with respect to group actions on the input and output spaces, there is a parameter choice yielding the same end-to-end function such that its layers are equivariant with respect to some group actions on the latent spaces. Our result assumes that the parameters of the model are identifiable in an appropriate sense. This identifiability property has been established in the literature for a large class of networks, to which our results apply immediately, while it is conjectural for others. The theory we develop is grounded in an abstract formalism, and is therefore architecture-agnostic. Overall, our results provide a mathematical explanation for the emergence of equivariant structures in the weights of neural networks during training -- a phenomenon that is consistently observed in practice.

Updated: 2026-01-29 12:47:51

标题: 可识别的等变网络是逐层等变的

摘要: 我们研究了深度神经网络中端对端等变性和逐层等变性之间的关系。我们证明了以下结论：对于一个网络，其端到端函数在输入和输出空间上对群作用具有等变性，存在一种参数选择，使得其端到端函数相同，并且其层对潜在空间上的某些群作用具有等变性。我们的结果假设模型的参数在适当意义下是可识别的。这种可识别性质已经在文献中为大类网络建立，我们的结果立即适用于这些网络，而对于其他网络则是推测性的。我们所发展的理论基于一个抽象形式主义，因此与架构无关。总的来说，我们的结果为训练过程中神经网络权重中等变结构的出现提供了数学解释，这是实践中一直观察到的现象。

更新时间: 2026-01-29 12:47:51

领域: cs.LG,math.CT,math.RT

下载: http://arxiv.org/abs/2601.21645v1

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of "thinking with images," which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.

Updated: 2026-01-29 12:47:44

标题: DeFacto：利用图像进行反事实思考，以强化基于证据和忠实推理.

摘要: 最近在多模态语言模型（MLLMs）方面取得了显著进展，尤其是在视觉-语言推理方面，特别是随着“图像思维”的出现，将明确的视觉步骤整合到推理过程中。虽然这种范式加强了基于图像的推理，但仍然存在一个重大挑战：模型可能通过依赖不相关或虚假的区域来得出正确答案，受先验知识或数据集偏见驱动。即使答案是正确的，错误的推理表明模型并没有真正理解图像，突显了多模态任务中推理准确性的重要性。为了解决这个问题，我们提出了DeFacto，一个反事实推理框架，同时强化准确回答和忠实推理。我们方法的一个关键组成部分是设计三种互补的训练范式：（i）正向，（ii）反事实，和（iii）随机遮蔽。为了实现这些范式，我们开发了一个自动定位问题相关证据并构建正向、反事实和随机变体的流水线，从而产生约10万张图像的数据集。基于这一框架，我们使用基于GRPO的强化学习来训练多模态语言模型，我们设计了三种互补的奖励来引导模型朝向准确回答和基于证据的推理。在多种基准测试上的实验证明，DeFacto显著提高了答案准确性和推理忠实度，为可解释的多模态推理奠定了更牢固的基础。该代码可在GitHub上获得，数据集已在HuggingFace上发布。

更新时间: 2026-01-29 12:47:44

领域: cs.AI

下载: http://arxiv.org/abs/2509.20912v2

Three results on twisted $G-$codes and skew twisted $G-$codes

In this paper we solve an open question formulated in the original paper of twisted skew group codes regarding when a twisted skew group code is checkable. Also, we prove that all ideals of dimension 3 over a twisted group algebra are abelian group codes, generalising another previous result over group algebras. Finally, we prove a bound on the dimension and distance of a twisted group code, as well as when such bound is reached.

Updated: 2026-01-29 12:45:05

标题: 关于扭曲$G-$码和倾斜扭曲$G-$码的三个结果

摘要: 在这篇论文中，我们解决了关于扭曲斜群码何时可检验的一个问题，该问题最初在扭曲斜群码的原始论文中提出。此外，我们证明了在扭曲群代数上的所有维度为3的理想都是阿贝尔群码，这扩展了先前在群代数上的另一个结果。最后，我们证明了关于扭曲群码的维度和距离的界限，以及何时达到该界限。

更新时间: 2026-01-29 12:45:05

领域: math.AG,cs.CR,cs.IT

下载: http://arxiv.org/abs/2601.00752v3

Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers

Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.

Updated: 2026-01-29 12:43:35

标题: Seg-MoE：用于时间序列预测变压器的多分辨率分段混合专家

摘要: 基于Transformer的模型最近在准确的时间序列预测方面取得了显著进展，但即使这些架构也难以高效地扩展，同时捕捉长期的时间动态。专家混合（MoE）层是自然语言处理中解决规模问题的已证明解决方案。然而，现有的用于时间序列预测的MoE方法依赖于基于标记的路由机制，这可能无法充分利用时间数据的自然局部性和连续性。在这项工作中，我们介绍了Seg-MoE，这是一种稀疏的MoE设计，它路由和处理连续的时间步段，而不是做出独立的专家决策。标记段允许每个专家直接对模拟段内交互进行建模，自然地与固有的时间模式相一致。我们将Seg-MoE层集成到时间序列Transformer中，并在多个多变量长期预测基准上进行评估。Seg-MoE在几乎所有预测时段上始终实现了最先进的预测准确性，优于密集的Transformers和之前基于标记的MoE模型。全面的消融研究证实，段级别的路由是推动这些收益的关键因素。我们的结果表明，将MoE路由粒度与时间序列的固有结构相一致提供了一个强大但以前未经探索的归纳偏差，为顺序数据建模中的有条件稀疏架构开辟了新的途径。

更新时间: 2026-01-29 12:43:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21641v1

Tight Lower Bounds and Improved Convergence in Performative Prediction

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in the real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

Updated: 2026-01-29 12:43:13

标题: 紧密下界和改进的收敛性在执行预测中

摘要: 执行性预测是一个框架，解释了模型预测在现实世界中部署后引起的数据分布转变。确保快速收敛到一个稳定解决方案，在模型部署后数据分布保持不变是至关重要的，特别是在不断演化的环境中。本文通过利用来自先前重新训练快照的历史数据集，扩展了重复风险最小化（RRM）框架，得到了一类我们称之为仿射风险最小化算法，并使其能够收敛到更广泛问题的执行性稳定点。我们引入了一种新的上界方法，该方法仅使用数据集的最终迭代，并首次证明了这个新上界和之前现有上界在相同范围内的紧密性。我们还证明了利用历史数据集可以超越最后迭代RRM的下界，并在各种执行性预测基准上实证观察到更快地收敛到稳定点。同时，我们在仿射风险最小化类中首次提供了RRM的下界分析，量化了我们框架中其他变体可能实现的收敛速度改进潜力。

更新时间: 2026-01-29 12:43:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2412.03671v3

Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation

We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring $(ε,δ)$-DP is inversely proportional to $δ$. In contrast, we develop sampling-free bounds based on Rényi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small $ε$, where Rényi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.

Updated: 2026-01-29 12:40:29

标题: 随机分配下矩阵机制的无采样隐私核算

摘要: 我们研究了在随机分配（也称为球在桶模型）下使用矩阵分解进行差分隐私模型训练的隐私放大问题。Choquette-Choo等人最近的工作（2025年）提出了一种基于抽样的蒙特卡罗方法来计算这种情况下的放大参数。然而，他们的保证要么只在很高的概率下成立，要么需要机制进行随机弃权。此外，确保$(ε,δ)$-DP所需的样本数与$δ$成反比。相比之下，我们基于Rényi散度和条件组合开发了无需抽样的界限。前者通过动态规划公式有效计算界限。后者通过为小$ε$提供更强的隐私保证来补充它，其中Rényi散度界限本质上导致了一个过估计。我们的框架适用于任意带和非带矩阵。通过数值比较，我们展示了我们的方法在研究和实践中使用的各种矩阵机制中的有效性。

更新时间: 2026-01-29 12:40:29

领域: cs.LG,cs.CR,stat.ML

下载: http://arxiv.org/abs/2601.21636v1

Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.

Updated: 2026-01-29 12:30:57

标题: 可解释的思维链推理：对状态感知推理动态的经验分析

摘要: 最近在思维链（CoT）提示方面取得了进展，使得大型语言模型（LLMs）能够执行多步推理。然而，这种推理的可解释性仍然有限，先前的工作主要集中在本地标记级别的归因上，因此推理步骤的高级语义角色及其转换仍未被充分探讨。在本文中，我们引入了一种状态感知转换框架，将CoT轨迹抽象成结构化的潜在动态。具体地，为了捕捉CoT推理的演变语义，每个推理步骤通过标记级别嵌入的谱分析表示，并聚类成语义上连贯的潜在状态。为了表征推理的全局结构，我们将它们的进展建模为马尔可夫链，从而提供了推理过程的结构化和可解释的视角。这种抽象支持一系列分析，包括语义角色识别、时间模式可视化和一致性评估。

更新时间: 2026-01-29 12:30:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.00190v2

Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise

Diffusion models have achieved remarkable progress in image generation, but their increasing deployment raises serious concerns about privacy. In particular, fine-tuned models are highly vulnerable, as they are often fine-tuned on small and private datasets. Membership inference attacks (MIAs) are used to assess privacy risks by determining whether a specific sample was part of a model's training data. Existing MIAs against diffusion models either assume obtaining the intermediate results or require auxiliary datasets for training the shadow model. In this work, we utilized a critical yet overlooked vulnerability: the widely used noise schedules fail to fully eliminate semantic information in the images, resulting in residual semantic signals even at the maximum noise step. We empirically demonstrate that the fine-tuned diffusion model captures hidden correlations between the residual semantics in initial noise and the original images. Building on this insight, we propose a simple yet effective membership inference attack, which injects semantic information into the initial noise and infers membership by analyzing the model's generation result. Extensive experiments demonstrate that the semantic initial noise can strongly reveal membership information, highlighting the vulnerability of diffusion models to MIAs.

Updated: 2026-01-29 12:29:01

标题: 噪声作为一种探针：利用初始噪声对扩散模型进行成员推断攻击

摘要: 扩散模型在图像生成方面取得了显著进展，但它们越来越广泛的应用引发了严重的隐私问题。特别是，经过微调的模型往往容易受到攻击，因为它们经常在小型和私人数据集上进行微调。成员推理攻击（MIAs）被用来评估隐私风险，通过确定特定样本是否是模型的训练数据的一部分。现有的针对扩散模型的MIAs要么假设获取中间结果，要么需要辅助数据集来训练影子模型。在这项工作中，我们利用了一个关键但被忽视的漏洞：广泛使用的噪声时间表未能完全消除图像中的语义信息，导致即使在最大噪声步骤中仍存在残留的语义信号。我们经验性地证明，经过微调的扩散模型捕获了初始噪声和原始图像之间的残留语义之间的隐藏相关性。基于这一认识，我们提出了一种简单而有效的成员推理攻击，通过将语义信息注入初始噪声，并通过分析模型的生成结果来推断成员身份。大量实验表明，语义初始噪声可以强烈揭示成员信息，凸显了扩散模型对MIAs的脆弱性。

更新时间: 2026-01-29 12:29:01

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2601.21628v1

Bridging Synthetic and Real Routing Problems via LLM-Guided Instance Generation and Progressive Adaptation

Recent advances in Neural Combinatorial Optimization (NCO) methods have significantly improved the capability of neural solvers to handle synthetic routing instances. Nonetheless, existing neural solvers typically struggle to generalize effectively from synthetic, uniformly-distributed training data to real-world VRP scenarios, including widely recognized benchmark instances from TSPLib and CVRPLib. To bridge this generalization gap, we present Evolutionary Realistic Instance Synthesis (EvoReal), which leverages an evolutionary module guided by large language models (LLMs) to generate synthetic instances characterized by diverse and realistic structural patterns. Specifically, the evolutionary module produces synthetic instances whose structural attributes statistically mimics those observed in authentic real-world instances. Subsequently, pre-trained NCO models are progressively refined, firstly aligning them with these structurally enriched synthetic distributions and then further adapting them through direct fine-tuning on actual benchmark instances. Extensive experimental evaluations demonstrate that EvoReal markedly improves the generalization capabilities of state-of-the-art neural solvers, yielding a notable reduced performance gap compared to the optimal solutions on the TSPLib (1.05%) and CVRPLib (2.71%) benchmarks across a broad spectrum of problem scales.

Updated: 2026-01-29 12:28:40

标题: 通过LLM引导的实例生成和渐进适应，弥合合成和真实路由问题

摘要: 最近神经组合优化（NCO）方法的进展显著提高了神经求解器处理合成路径实例的能力。然而，现有的神经求解器通常难以有效地从合成的均匀分布训练数据泛化到真实世界的VRP场景，包括来自TSPLib和CVRPLib的广泛认可的基准实例。为了弥合这种泛化差距，我们提出了进化实际实例合成（EvoReal），它利用由大型语言模型（LLMs）引导的进化模块生成具有多样化和真实结构模式的合成实例。具体而言，进化模块生成的合成实例的结构属性统计上模仿了真实世界实例中观察到的属性。随后，预训练的NCO模型逐步完善，首先使其与这些结构丰富的合成分布对齐，然后通过直接在实际基准实例上微调进一步调整。广泛的实验评估表明，EvoReal显著改善了最先进的神经求解器的泛化能力，使其在TSPLib（1.05%）和CVRPLib（2.71%）基准上相对于最佳解的性能差距明显缩小，涵盖了广泛的问题规模范围。

更新时间: 2026-01-29 12:28:40

领域: cs.AI,cs.LG,cs.NE

下载: http://arxiv.org/abs/2511.10233v2

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}

Updated: 2026-01-29 12:27:24

标题: 无法验证的韵律奖励：走向TTS中基于偏好引导的韵律学习

摘要: 最近的工作报告了使用Group Relative Policy Optimization（GRPO）在神经文本到语音（TTS）中取得的进展。然而，在缺乏可验证的奖励\textit{韵律}的情况下，基于转录导向信号（CER/NLL）训练的GRPO降低了错误率，但将韵律崩溃成单调、不自然的语音；添加说话者相似性进一步不稳定训练并降低CER。我们通过一种\textit{迭代直接偏好优化（DPO）}方案来解决这个问题，每轮只使用几百个人工标记的偏好对直接优化韵律自然度，同时对当前模型进行正则化。在\textbf{KoCC-TTS}上，这是一个捕捉任务导向对话的真实韩语呼叫中心交互数据集，我们的方法获得了最高的人类偏好（ELO），具有竞争力的CER，优于GRPO和强大的商业基线。这些结果表明，当无法自动奖励韵律时，\textit{人类偏好优化}提供了一条实用和数据高效的路径到自然和稳健的TTS。演示页面可在\href{https://tts.ch.dev}上找到。

更新时间: 2026-01-29 12:27:24

领域: eess.AS,cs.AI,cs.CL,cs.SD

下载: http://arxiv.org/abs/2509.18531v2

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

Updated: 2026-01-29 12:27:05

标题: HeRo-Q：一种通过Hessian调节实现稳定低比特量化的通用框架

摘要: Post Training Quantization（PTQ）是一种主流模型压缩技术，通常会导致悖论的“低误差，高损失”现象，因为它仅专注于最小化量化误差。根本原因在于LLM损失景观的Hessian矩阵：少数高曲率方向对扰动非常敏感。为了解决这个问题，我们提出了Hessian Robust Quantization（HeRo Q）算法，该算法在量化之前在权重空间应用了一个轻量级的可学习的旋转-压缩矩阵。这个联合框架通过减少最大Hessian特征值和最大特征值来重新塑造损失景观，从而显著增强对量化噪声的稳健性。HeRo-Q无需架构修改，计算开销微乎其微，并且可以无缝集成到现有的PTQ流程中。对Llama和Qwen模型的实验表明，HeRo Q始终优于当前最先进的方法，包括GPTQ、AWQ和SpinQuant，在标准W4A8设置下不仅表现出卓越性能，而且在高度具有挑战性的W3A16超低比特区域中表现出色，其中它将Llama3 8B的GSM8K准确率提高到70.15％，并有效地避免了激进量化中常见的逻辑崩溃。

更新时间: 2026-01-29 12:27:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21626v1

Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking

Parallel thinking enhances LLM reasoning by multi-path sampling and aggregation. In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy. However, due to sample heterogeneity, some samples can achieve comparable performance with a smaller N'< N, causing budget redundancy. This incompatibility between system-level efficacy and sample-level efficiency constitutes the overscaling curse. In this paper, we formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism. We then propose a lightweight method, T2, to break the overscaling curse, which utilizes latent representations to estimate the optimal parallelism level for each sample before decoding. Experiments show that T2 significantly reduces cost while maintaining comparable performance, enabling more efficient parallel thinking.

Updated: 2026-01-29 12:22:45

标题: 摆脱过度扩展的诅咒：在并行思考之前考虑并行性

摘要: 并行思维通过多路径抽样和聚合增强了LLM推理。在系统级评估中，全局并行性水平N被分配给所有样本，通常设置大以最大化整体数据集的准确性。然而，由于样本的异质性，一些样本可以通过较小的N'< N实现可比性能，导致预算冗余。系统级效能与样本级效率之间的不兼容性构成了过度缩放的诅咒。在本文中，我们正式化和量化了过度缩放的诅咒，展示了其在实践中的普遍性和严重性，并分析了其触发机制。然后，我们提出了一个轻量级方法T2，用于打破过度缩放的诅咒，该方法利用潜在表示来估计每个样本在解码之前的最佳并行性水平。实验证明，T2显著降低成本同时保持可比性能，实现更高效的并行思维。

更新时间: 2026-01-29 12:22:45

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21619v1

Semantic Content Determines Algorithmic Performance

Counting should not depend on what is being counted; more generally, any algorithm's behavior should be invariant to the semantic content of its arguments. We introduce WhatCounts to test this property in isolation. Unlike prior work that conflates semantic sensitivity with reasoning complexity or prompt variation, WhatCounts is atomic: count items in an unambiguous, delimited list with no duplicates, distractors, or reasoning steps for different semantic types. Frontier LLMs show over 40% accuracy variation depending solely on what is being counted - cities versus chemicals, names versus symbols. Controlled ablations rule out confounds. The gap is semantic, and it shifts unpredictably with small amounts of unrelated fine-tuning. LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent. As we show with an agentic example, this has implications beyond counting: any LLM function may carry hidden dependencies on the meaning of its inputs.

Updated: 2026-01-29 12:22:17

标题: 语义内容决定算法性能

摘要: 计数不应取决于所计数的内容; 更普遍地说，任何算法的行为应该对其参数的语义内容保持不变。我们引入WhatCounts来测试这种属性的独立性。与之前将语义敏感性与推理复杂性或提示变化混为一谈的工作不同，WhatCounts是原子性的：在一个明确的、有界的列表中计算项目，没有重复项、干扰因素，也没有不同语义类型的推理步骤。前沿LLMs显示，仅仅根据所计数的内容 - 城市与化学品，姓名与符号，准确率变化超过40%。受控消融排除了混淆因素。这种差距是语义的，而且它会随着少量不相关的微调而不可预测地发生变化。LLMs并不实现算法；它们近似算法，并且这种近似是依赖于参数的。正如我们在一个主观例子中所展示的那样，这对计数以外的任何LLM函数都可能携带对其输入含义的隐藏依赖性。

更新时间: 2026-01-29 12:22:17

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21618v1

TS-PEFT: Unveiling Token-Level Redundancy in Parameter-Efficient Fine-Tuning

Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: Once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention by revealing a pervasive token-level redundancy in the fine-tuning of large models (LMs). We propose TS-PEFT, a theoretical framework utilizing proximal optimization that acts as a dynamic probe to identify token-level redundancy during the fine-tuning process. Extensive experiments demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Surprisingly, by discarding 30%-70% of token updates, TS-PEFT consistently matches or exceeds the performance of dense baselines such as LoRA, DoRA. Our in-depth analysis shows that the learned token-level sparsity is a superior indicator of module importance compared to traditional weight criteria, providing a novel data-driven perspective on the intrinsic adaptation mechanism of LMs.

Updated: 2026-01-29 12:19:42

标题: TS-PEFT：揭示参数高效微调中的令牌级冗余

摘要: 目前，参数效率的微调（PEFT）方法通常在一种隐含的假设下运行：一旦选择了目标模块，通过它的每个标记对下游任务的贡献都是相等的，并且需要参数更新。本文挑战了这种传统观念，揭示了在大型模型（LMs）微调过程中存在普遍的标记级冗余。我们提出了TS-PEFT，这是一个利用近端优化的理论框架，作为一个动态探针来识别微调过程中的标记级冗余。大量实验证明，不加选择地更新所有标记不仅计算上多余，而且通常会引入优化噪声。令人惊讶的是，通过丢弃30%-70%的标记更新，TS-PEFT始终能够匹配或超越如LoRA、DoRA等密集基线的性能。我们的深入分析表明，学习到的标记级稀疏性比传统的权重标准更好地指示模块的重要性，为提供了一种新颖的基于数据的视角，揭示了LMs的内在适应机制。

更新时间: 2026-01-29 12:19:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.16147v3

Beyond Parameter Finetuning: Test-Time Representation Refinement for Node Classification

Graph Neural Networks frequently exhibit significant performance degradation in the out-of-distribution test scenario. While test-time training (TTT) offers a promising solution, existing Parameter Finetuning (PaFT) paradigm suffer from catastrophic forgetting, hindering their real-world applicability. We propose TTReFT, a novel Test-Time Representation FineTuning framework that transitions the adaptation target from model parameters to latent representations. Specifically, TTReFT achieves this through three key innovations: (1) uncertainty-guided node selection for specific interventions, (2) low-rank representation interventions that preserve pre-trained knowledge, and (3) an intervention-aware masked autoencoder that dynamically adjust masking strategy to accommodate the node selection scheme. Theoretically, we establish guarantees for TTReFT in OOD settings. Empirically, extensive experiments across five benchmark datasets demonstrate that TTReFT achieves consistent and superior performance. Our work establishes representation finetuning as a new paradigm for graph TTT, offering both theoretical grounding and immediate practical utility for real-world deployment.

Updated: 2026-01-29 12:17:34

标题: 超越参数微调：节点分类的测试时表示细化

摘要: 图神经网络在分布外测试场景中经常表现出显著的性能下降。虽然测试时训练（TTT）提供了一个有前途的解决方案，但现有的参数微调（PaFT）范式遭受灾难性遗忘，阻碍了它们在现实世界中的适用性。我们提出了TTReFT，一种新颖的测试时表示微调框架，将适应目标从模型参数转变为潜在表示。具体而言，TTReFT通过三个关键创新实现了这一点：（1）基于不确定性的节点选择进行特定干预，（2）保留预训练知识的低秩表示干预，（3）一个干预感知的掩码自动编码器，动态调整掩码策略以适应节点选择方案。在理论上，我们在OOD设置中为TTReFT建立了保证。在经验上，对五个基准数据集进行的大量实验表明，TTReFT实现了一致且优越的性能。我们的工作将表示微调确立为图TTT的新范式，为现实世界的部署提供了理论基础和即时实用性。

更新时间: 2026-01-29 12:17:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21615v1

iPEAR: Iterative Pyramid Estimation with Attention and Residuals for Deformable Medical Image Registration

Existing pyramid registration networks may accumulate anatomical misalignments and lack an effective mechanism to dynamically determine the number of optimization iterations under varying deformation requirements across images, leading to degraded performance. To solve these limitations, we propose iPEAR. Specifically, iPEAR adopts our proposed Fused Attention-Residual Module (FARM) for decoding, which comprises an attention pathway and a residual pathway to alleviate the accumulation of anatomical misalignment. We further propose a dual-stage Threshold-Controlled Iterative (TCI) strategy that adaptively determines the number of optimization iterations for varying images by evaluating registration stability and convergence. Extensive experiments on three public brain MRI datasets and one public abdomen CT dataset show that iPEAR outperforms state-of-the-art (SOTA) registration networks in terms of accuracy, while achieving on-par inference speed and model parameter size. Generalization and ablation studies further validate the effectiveness of the proposed FARM and TCI.

Updated: 2026-01-29 12:17:14

标题: iPEAR：具有注意力和残差的迭代金字塔估计用于可变形医学图像配准

摘要: 现有的金字塔注册网络可能会积累解剖错位，并且缺乏一种有效的机制来动态确定在不同图像之间变化的变形要求下优化迭代次数的数量，从而导致性能下降。为了解决这些限制，我们提出了iPEAR。具体而言，iPEAR采用了我们提出的融合注意力残差模块（FARM）进行解码，其中包括一个注意力路径和一个残差路径，以减轻解剖错位的积累。我们进一步提出了一个双阶段阈值控制迭代（TCI）策略，通过评估注册稳定性和收敛性，自适应地确定不同图像的优化迭代次数。在三个公共脑MRI数据集和一个公共腹部CT数据集上的大量实验表明，iPEAR在准确性方面优于最先进（SOTA）的注册网络，同时实现了相同的推断速度和模型参数大小。进一步的泛化和消融研究进一步验证了所提出的FARM和TCI的有效性。

更新时间: 2026-01-29 12:17:14

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07666v2

Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.

Updated: 2026-01-29 12:16:19

标题: 音频理解的表示正则化卷积音频变压器

摘要: 基于自助学习（SSL）的自举式学习在音频理解方面取得了显著进展。然而，现有方法通常在单一粒度级别上运行，限制了它们对复杂音频信号中固有的多样化时间和频谱结构建模的能力。此外，从头开始引导表示是计算昂贵的，通常需要大量训练才能收敛。在这项工作中，我们提出了卷积音频变压器（CAT），这是一个统一的框架，旨在解决这些挑战。首先，为了捕捉分层音频特征，CAT包含一个多分辨率块，它跨不同粒度聚合信息。其次，为了增强训练效率，我们引入了一个表示正则化目标。借鉴生成建模的灵感，这个辅助任务通过将学生模型的预测与来自冻结的、预先训练的外部编码器的高质量语义表示对齐来指导模型。实验结果表明，CAT在音频理解基准测试中明显优于基线。值得注意的是，它在AudioSet 20k数据集上实现了竞争性能，收敛速度比现有方法快5倍。代码和检查点将很快发布在https://github.com/realzhouchushu/CAT。

更新时间: 2026-01-29 12:16:19

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2601.21612v1

Bridging Performance Gaps for ECG Foundation Models: A Post-Training Strategy

ECG foundation models are increasingly popular due to their adaptability across various tasks. However, their clinical applicability is often limited by performance gaps compared to task-specific models, even after pre-training on large ECG datasets and fine-tuning on target data. This limitation is likely due to the lack of an effective post-training strategy. In this paper, we propose a simple yet effective post-training approach to enhance ECG foundation models. We evaluate it on a publicly available Transformer-based foundation model. Experiments across multiple ECG tasks show that our method consistently outperforms baseline fine-tuning. On the PTB-XL benchmarks, it improves macro AUROC by 0.7%-8.9% and macro AUPRC by 23.3%-77.9%, also outperforming several recent state-of-the-art approaches, including task-specific and advanced architectures. Further analyses demonstrate improved training dynamics and data efficiency, with only 30% of the training data outperforming the baseline trained on the full dataset. Ablation studies highlight the importance of stochastic depth and preview linear probing. These findings underscore the potential of post-training strategies to improve ECG foundation models, and we hope this work will contribute to the continued development of foundation models in the ECG domain.

Updated: 2026-01-29 12:14:52

标题: 填补基础ECG模型性能差距：一种后训练策略

摘要: 心电图基础模型由于其在各种任务中的适应性而越来越受欢迎。然而，与特定任务模型相比，即使在大型心电图数据集上进行预训练并在目标数据上进行微调后，它们的临床适用性通常受到性能差距的限制。这种限制可能是由于缺乏有效的后训练策略。在本文中，我们提出了一种简单而有效的后训练方法，以增强心电图基础模型。我们在一个公开可用的基于Transformer的基础模型上进行评估。跨多个心电图任务的实验证明，我们的方法始终优于基线微调。在PTB-XL基准测试中，它将宏AUROC提高了0.7%-8.9%，将宏AUPRC提高了23.3%-77.9%，也优于几种最近的最先进方法，包括特定任务和高级架构。进一步的分析表明，改进了训练动态和数据效率，仅使用30%的训练数据就能胜过在完整数据集上训练的基线模型。消融研究突显了随机深度和预览线性探测的重要性。这些发现强调了后训练策略改进心电图基础模型的潜力，我们希望这项工作将有助于继续发展心电图领域的基础模型。

更新时间: 2026-01-29 12:14:52

领域: cs.LG,cs.AI,stat.AP

下载: http://arxiv.org/abs/2509.12991v2

Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance

Effective relevance modeling is crucial for e-commerce search, as it aligns search results with user intent and enhances customer experience. Recent work has leveraged large language models (LLMs) to address the limitations of traditional relevance models, especially for long-tail and ambiguous queries. By incorporating Chain-of-Thought (CoT) reasoning, these approaches improve both accuracy and interpretability through multi-step reasoning. However, two key limitations remain: (1) most existing approaches rely on single-perspective CoT reasoning, which fails to capture the multifaceted nature of e-commerce relevance (e.g., user intent vs. attribute-level matching vs. business-specific rules); and (2) although CoT-enhanced LLM's offer rich reasoning capabilities, their high inference latency necessitates knowledge distillation for real-time deployment, yet current distillation methods discard the CoT rationale structure at inference, using it as a transient auxiliary signal and forfeiting its reasoning utility. To address these challenges, we propose a novel framework that better exploits CoT semantics throughout the optimization pipeline. Specifically, the teacher model leverages Multi-Perspective CoT (MPCoT) to generate diverse rationales and combines Supervised Fine-Tuning (SFT) with Direct Preference Optimization (DPO) to construct a more robust reasoner. For distillation, we introduce Latent Reasoning Knowledge Distillation (LRKD), which endows a student model with a lightweight inference-time latent reasoning extractor, allowing efficient and low-latency internalization of the LLM's sophisticated reasoning capabilities. Evaluated in offline experiments and online A/B tests on an e-commerce search advertising platform serving tens of millions of users daily, our method delivers significant offline gains, showing clear benefits in both commercial performance and user experience.

Updated: 2026-01-29 12:14:44

标题: 思维广泛，行动迅速：从多视角思维链提炼潜在推理，用于电子商务相关性

摘要: 有效的相关性建模对于电子商务搜索至关重要，因为它可以将搜索结果与用户意图对齐，并增强客户体验。最近的研究利用大型语言模型(LLMs)来解决传统相关性模型的局限性，特别是针对长尾和模糊查询。通过引入“Chain-of-Thought”(CoT)推理，这些方法通过多步推理提高了准确性和可解释性。然而，仍然存在两个关键限制：(1)大多数现有方法依赖于单一视角的CoT推理，无法捕捉电子商务相关性的多面性(例如，用户意图与属性级匹配与商业规则)；(2)虽然CoT增强的LLM具有丰富的推理能力，但其高推理延迟需要知识蒸馏以进行实时部署，然而当前的蒸馏方法在推理时丢弃了CoT推理结构，将其作为瞬时辅助信号并放弃了其推理效用。为了解决这些挑战，我们提出了一个更好地利用CoT语义的新框架，贯穿整个优化流程。具体来说，教师模型利用多视角CoT(MPCoT)生成多样的解释，并将监督微调(SFT)与直接偏好优化(DPO)相结合，构建更强大的推理器。对于蒸馏，我们引入了潜在推理知识蒸馏(LRKD)，为学生模型赋予了轻量级推理时间潜在推理提取器，实现了LLM复杂推理能力的高效和低延迟内部化。通过离线实验和在线A/B测试，在一家每天为数千万用户提供服务的电子商务搜索广告平台上评估，我们的方法提供了显著的离线收益，在商业绩效和用户体验方面都表现出明显的优势。

更新时间: 2026-01-29 12:14:44

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21611v1

RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems

Agentic recommender systems leverage Large Language Models (LLMs) to model complex user behaviors and support personalized decision-making. However, existing methods primarily model preference changes based on explicit user-item interactions, which are sparse, noisy, and unable to reflect the real-time, mutual influences among users and items. To address these limitations, we propose RecNet, a self-evolving preference propagation framework that proactively propagates real-time preference updates across related users and items. RecNet consists of two complementary phases. In the forward phase, the centralized preference routing mechanism leverages router agents to integrate preference updates and dynamically propagate them to the most relevant agents. To ensure accurate and personalized integration of propagated preferences, we further introduce a personalized preference reception mechanism, which combines a message buffer for temporary caching and an optimizable, rule-based filter memory to guide selective preference assimilation based on past experience and interests. In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework, using LLMs for credit assignment, gradient analysis, and module-level optimization, enabling continuous self-evolution of propagation strategies. Extensive experiments on various scenarios demonstrate the effectiveness of RecNet in modeling preference propagation for recommender systems.

Updated: 2026-01-29 12:14:31

标题: RecNet: 自我演变的偏好传播用于有代理推荐系统

摘要: Agentic推荐系统利用大型语言模型（LLMs）来建模复杂的用户行为，并支持个性化决策。然而，现有方法主要根据显式用户-项目交互来建模偏好变化，这些交互稀疏、嘈杂，无法反映用户和项目之间的实时、相互影响。为了解决这些限制，我们提出了RecNet，这是一个自我演化的偏好传播框架，可以主动地在相关用户和项目之间传播实时的偏好更新。RecNet由两个互补的阶段组成。在前向阶段中，中心化的偏好路由机制利用路由器代理集成偏好更新，并动态地将其传播给最相关的代理。为了确保传播偏好的准确性和个性化集成，我们进一步引入了个性化偏好接收机制，它结合了用于临时缓存的消息缓冲区和可优化的基于规则的过滤器内存，以根据过去的经验和兴趣指导选择性偏好同化。在反向阶段中，反馈驱动的传播优化机制模拟了多代理强化学习框架，使用LLMs进行信用分配、梯度分析和模块级优化，实现传播策略的持续自我演化。对各种场景进行的广泛实验表明，RecNet在建模推荐系统的偏好传播方面具有有效性。

更新时间: 2026-01-29 12:14:31

领域: cs.AI

下载: http://arxiv.org/abs/2601.21609v1

Search-Based Risk Feature Discovery in Document Structure Spaces under a Constrained Budget

Enterprise-grade Intelligent Document Processing (IDP) systems support high-stakes workflows across finance, insurance, and healthcare. Early-phase system validation under limited budgets mandates uncovering diverse failure mechanisms, rather than identifying a single worst-case document. We formalize this challenge as a Search-Based Software Testing (SBST) problem, aiming to identify complex interactions between document variables, with the objective to maximize the number of distinct failure types discovered within a fixed evaluation budget. Our methodology operates on a combinatorial space of document configurations, rendering instances of structural \emph{risk features} to induce realistic failure conditions. We benchmark a diverse portfolio of search strategies spanning evolutionary, swarm-based, quality-diversity, learning-based, and quantum under identical budget constraints. Through configuration-level exclusivity, win-rate, and cross-temporal overlap analyses, we show that different solvers consistently uncover failure modes that remain undiscovered by specific alternatives at comparable budgets. Crucially, cross-temporal analysis reveals persistent solver-specific discoveries across all evaluated budgets, with no single strategy exhibiting absolute dominance. While the union of all solvers eventually recovers the observed failure space, reliance on any individual method systematically delays the discovery of important risks. These results demonstrate intrinsic solver complementarity and motivate portfolio-based SBST strategies for robust industrial IDP validation.

Updated: 2026-01-29 12:14:18

标题: 在受限预算下基于搜索的文档结构空间风险特征发现

摘要: 企业级智能文档处理（IDP）系统支持跨金融、保险和医疗保健领域的高风险工作流程。在有限预算下进行早期系统验证要求揭示各种故障机制，而不是识别单个最严重的文档。我们将这一挑战形式化为基于搜索的软件测试（SBST）问题，旨在识别文档变量之间的复杂交互作用，目标是在固定的评估预算内最大化发现不同故障类型的数量。我们的方法在文档配置的组合空间上运作，生成结构\emph{风险特征}的实例，诱导出真实的故障条件。我们在相同的预算约束下对涵盖进化、群体、质量多样性、基于学习的和量子的各种搜索策略进行基准测试。通过配置级别的排他性、胜率和跨时间重叠分析，我们展示了不同解算器持续发现特定替代方法在可比预算下未发现的故障模式。至关重要的是，跨时间分析显示了在所有评估的预算下持续存在解决器特定的发现，没有单一策略表现出绝对优势。虽然所有解算器的联合最终恢复了观察到的故障空间，但依赖任何个体方法都会系统地延迟重要风险的发现。这些结果展示了固有的解算器互补性，并激励基于组合的SBST策略用于强大的工业IDP验证。

更新时间: 2026-01-29 12:14:18

领域: cs.AI

下载: http://arxiv.org/abs/2601.21608v1

Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training

Large language models (LLMs) are increasingly trained with classical optimization techniques like AdamW to improve convergence and generalization. However, the mechanisms by which quantum-inspired methods enhance classical training remain underexplored. We introduce Superpositional Gradient Descent (SGD), a novel optimizer linking gradient updates with quantum superposition by injecting quantum circuit perturbations. We present a mathematical framework and implement hybrid quantum-classical circuits in PyTorch and Qiskit. On synthetic sequence classification and large-scale LLM fine-tuning, SGD converges faster and yields lower final loss than AdamW. Despite promising results, scalability and hardware constraints limit adoption. Overall, this work provides new insights into the intersection of quantum computing and deep learning, suggesting practical pathways for leveraging quantum principles to control and enhance model behavior.

Updated: 2026-01-29 12:14:10

标题: 超位置梯度下降：利用量子原理进行模型训练

摘要: 大型语言模型（LLMs）越来越多地使用类似AdamW的经典优化技术进行训练，以改善收敛性和泛化能力。然而，量子启发方法如何增强经典训练的机制仍未得到充分探索。我们引入了超位置梯度下降（SGD），这是一种新型优化器，通过注入量子电路扰动将梯度更新与量子叠加相联系。我们提出了一个数学框架，并在PyTorch和Qiskit中实现了混合量子-经典电路。在合成序列分类和大规模LLM微调中，SGD收敛速度更快，最终损失更低，比AdamW效果更好。尽管结果令人鼓舞，但可扩展性和硬件限制限制了采用。总的来说，这项工作为量子计算和深度学习的交叉点提供了新的见解，提出了利用量子原理控制和增强模型行为的实际途径。

更新时间: 2026-01-29 12:14:10

领域: cs.LG,quant-ph

下载: http://arxiv.org/abs/2511.01918v2

Age Matters: Analyzing Age-Related Discussions in App Reviews

In recent years, mobile applications have become indispensable tools for managing various aspects of life. From enhancing productivity to providing personalized entertainment, mobile apps have revolutionized people's daily routines. Despite this rapid growth and popularity, gaps remain in how these apps address the needs of users from different age groups. Users of varying ages face distinct challenges when interacting with mobile apps, from younger users dealing with inappropriate content to older users having difficulty with usability due to age-related vision and cognition impairments. Although there have been initiatives to create age-inclusive apps, a limited understanding of user perspectives on age-related issues may hinder developers from recognizing specific challenges and implementing effective solutions. In this study, we explore age discussions in app reviews to gain insights into how mobile apps should cater to users across different age groups.We manually curated a dataset of 4,163 app reviews from the Google Play Store and identified 1,429 age-related reviews and 2,734 non-age-related reviews. We employed eight machine learning, deep learning, and large language models to automatically detect age discussions, with RoBERTa performing the best, achieving a precision of 92.46%. Additionally, a qualitative analysis of the 1,429 age-related reviews uncovers six dominant themes reflecting user concerns.

Updated: 2026-01-29 12:11:58

标题: 年龄重要性：分析应用评论中与年龄相关的讨论

摘要: 近年来，移动应用程序已成为管理生活各个方面的不可或缺的工具。从提高工作效率到提供个性化娱乐，移动应用程序已经彻底改变了人们的日常生活方式。尽管这种快速增长和流行，但这些应用程序如何满足不同年龄群体用户的需求仍存在差距。不同年龄的用户在与移动应用程序互动时面临着不同的挑战，从年轻用户处理不当内容到老年用户因年龄相关的视力和认知障碍而难以使用。尽管已经有倡议创建适合各年龄段用户的应用程序，但对与年龄相关问题的用户观点的了解有限可能会阻碍开发人员认识到具体的挑战并实施有效解决方案。在这项研究中，我们探讨了应用评论中关于年龄的讨论，以了解移动应用程序应如何满足不同年龄群体的用户需求。我们手动筛选了来自Google Play Store的4,163条应用评论数据集，并确定了1,429条与年龄相关的评论和2,734条与年龄无关的评论。我们利用八种机器学习、深度学习和大型语言模型来自动检测年龄讨论，RoBERTa表现最佳，精确度达到92.46%。此外，对1,429条与年龄相关的评论进行了定性分析，揭示了六个反映用户关注的主题。

更新时间: 2026-01-29 12:11:58

领域: cs.SE,cs.HC,cs.LG

下载: http://arxiv.org/abs/2601.21605v1

Kernel Alignment-based Multi-view Unsupervised Feature Selection with Sample-level Adaptive Graph Learning

Although multi-view unsupervised feature selection (MUFS) has demonstrated success in dimensionality reduction for unlabeled multi-view data, most existing methods reduce feature redundancy by focusing on linear correlations among features but often overlook complex nonlinear dependencies. This limits the effectiveness of feature selection. In addition, existing methods fuse similarity graphs from multiple views by employing sample-invariant weights to preserve local structure. However, this process fails to account for differences in local neighborhood clarity among samples within each view, thereby hindering accurate characterization of the intrinsic local structure of the data. In this paper, we propose a Kernel Alignment-based multi-view unsupervised FeatUre selection with Sample-level adaptive graph lEarning method (KAFUSE) to address these issues. Specifically, we first employ kernel alignment with an orthogonal constraint to reduce feature redundancy in both linear and nonlinear relationships. Then, a cross-view consistent similarity graph is learned by applying sample-level fusion to each slice of a tensor formed by stacking similarity graphs from different views, which automatically adjusts the view weights for each sample during fusion. These two steps are integrated into a unified model for feature selection, enabling mutual enhancement between them. Extensive experiments on real multi-view datasets demonstrate the superiority of KAFUSE over state-of-the-art methods.

Updated: 2026-01-29 12:11:32

标题: 基于核对齐的多视角无监督特征选择与样本级自适应图学习

摘要: 尽管多视图无监督特征选择（MUFS）在无标签的多视图数据降维方面取得了成功，但大多数现有方法通过专注于特征之间的线性相关性来减少特征冗余，但通常忽略了复杂的非线性依赖关系。这限制了特征选择的有效性。此外，现有方法通过使用样本不变权重来融合来自多个视图的相似性图，以保留局部结构。然而，这个过程未能考虑每个视图中样本之间的局部邻域清晰度差异，从而阻碍对数据固有局部结构的准确表征。本文提出了一种基于核对齐的多视图无监督特征选择与样本级自适应图学习方法（KAFUSE）来解决这些问题。具体而言，我们首先使用带有正交约束的核对齐来减少线性和非线性关系中的特征冗余。然后，通过对由不同视图的相似性图堆叠而成的张量的每个切片应用样本级融合来学习跨视图一致的相似性图，这在融合过程中自动调整了每个样本的视图权重。这两个步骤被整合到一个统一模型中用于特征选择，实现它们之间的相互增强。对真实多视图数据集的广泛实验表明，KAFUSE优于现有最先进的方法。

更新时间: 2026-01-29 12:11:32

领域: cs.LG

下载: http://arxiv.org/abs/2601.07288v2

Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.

Updated: 2026-01-29 12:08:00

标题: 动力学揭示结构：挑战线性传播假设

摘要: 神经网络通过一阶参数更新进行适应，然而目前尚不清楚这种更新是否保持逻辑连贯性。我们研究了线性传播假设（LPA）的几何限制，即局部更新是否一致地传播到逻辑结论。为了形式化这一点，我们采用关系代数，并研究了关系上的三个核心操作：否定翻转真值，对话交换参数顺序，和组合链关系。对于否定和对话，我们证明保证方向不可知的一阶传播需要一个张量分解，将实体对上下文与关系内容分离。然而，对于组合，我们确定了一个基本障碍。我们表明组合归结为合取，并证明任何在线性特征上定义良好的合取必须是双线性的。由于双线性与否定不兼容，这迫使特征映射崩溃。这些结果表明，知识编辑的失败、逆转诅咒和多跳推理可能源于LPA固有的常见结构限制。

更新时间: 2026-01-29 12:08:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21601v1

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

Updated: 2026-01-29 12:07:59

标题: SofT-GRPO：通过Gumbel重新参数化的软思维策略优化超越离散令牌LLM强化学习

摘要: 大语言模型（LLM）推理的软思维范式在某些场景中可以胜过传统的离散令牌链式推理（CoT）方法，突显了其研究和应用价值。然而，虽然离散令牌CoT推理模式可以通过群体相对策略优化（GRPO）等政策优化算法加强，但通过强化学习（RL）扩展软思维模式仍然具有挑战性。这一困难源于将随机性注入软思维令牌并相应地更新软思维政策的复杂性。因此，先前尝试将软思维与GRPO结合的方法通常表现不如其离散令牌GRPO对应物。为了充分释放软思维的潜力，本文提出了一种新颖的政策优化算法SofT-GRPO，以加强在软思维推理模式下的LLMs。SofT-GRPO将Gumbel噪声注入对数，采用Gumbel-Softmax技术以避免软思维令牌超出预训练嵌入空间，并利用政策梯度中的重参数化技巧。我们在1.5B到7B参数的基础LLMs上进行了实验，结果表明SofT-GRPO使软思维LLMs在Pass@1上略优于离散令牌GRPO（平均准确度增加0.13%），同时在Pass@32上表现出显著提升（平均准确度增加2.19%）。代码和权重可在https://github.com/zz1358m/SofT-GRPO-master 上找到。

更新时间: 2026-01-29 12:07:59

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.06411v2

CORE: Collaborative Reasoning via Cross Teaching

Large language models exhibit complementary reasoning errors: on the same instance, one model may succeed with a particular decomposition while another fails. We propose Collaborative Reasoning (CORE), a training-time collaboration framework that converts peer success into a learning signal via a cross-teaching protocol. Each problem is solved in two stages: a cold round of independent sampling, followed by a contexted rescue round in which models that failed receive hint extracted from a successful peer. CORE optimizes a combined reward that balances (i) correctness, (ii) a lightweight DPP-inspired diversity term to reduce error overlap, and (iii) an explicit rescue bonus for successful recovery. We evaluate CORE across four standard reasoning datasets GSM8K, MATH, AIME, and GPQA. With only 1,000 training examples, a pair of small open source models (3B+4B) reaches Pass@2 of 99.54% on GSM8K and 92.08% on MATH, compared to 82.50% and 74.82% for single-model training. On harder datasets, the 3B+4B pair reaches Pass@2 of 77.34% on GPQA (trained on 348 examples) and 79.65% on AIME (trained on 792 examples), using a training-time budget of at most 1536 context tokens and 3072 generated tokens. Overall, these results show that training-time collaboration can reliably convert model complementarity into large gains without scaling model size.

Updated: 2026-01-29 12:07:54

标题: 核心：通过跨学科合作推理

摘要: 大型语言模型表现出互补的推理错误：在同一个实例上，一个模型可能通过特定的分解成功，而另一个模型则失败。我们提出了一种训练时协作框架Collaborative Reasoning（CORE），它通过交叉教学协议将同行成功转化为学习信号。每个问题分为两个阶段解决：一个独立采样的冷轮，接着是一个有上下文的拯救轮，在这一轮中失败的模型会接收从成功同行那里提取的提示。CORE优化了一个综合奖励，平衡了（i）正确性，（ii）受轻量级DPP启发的多样性项以减少错误重叠，以及（iii）针对成功恢复的显式拯救奖励。我们评估了CORE在四个标准推理数据集GSM8K、MATH、AIME和GPQA上的表现。仅使用1,000个训练示例，一对小型开源模型（3B+4B）在GSM8K上达到了99.54%的Pass@2，而在MATH上达到了92.08%，相比之下，单模型训练的正确率分别为82.50%和74.82%。在更难的数据集上，3B+4B对在GPQA（使用348个示例进行训练）上达到了77.34%的Pass@2，在AIME（使用792个示例进行训练）上达到了79.65%，使用的训练时预算最多为1536个上下文标记和3072个生成标记。总的来说，这些结果表明，训练时协作可以可靠地将模型互补转化为巨大的收益，而无需扩大模型规模。

更新时间: 2026-01-29 12:07:54

领域: cs.AI

下载: http://arxiv.org/abs/2601.21600v1

Beyond Imitation: Reinforcement Learning for Active Latent Planning

Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the \underline{A}c\underline{t}ive Latent \underline{P}lanning method (ATP-Latent), which models the supervision process of latent tokens as a conditional variational auto-encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP-Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE-decoded contents of latent tokens, enabling a guided RL process. In experiments on LLaMA-1B, ATP-Latent demonstrates +4.1\% accuracy and -3.3\% tokens on four benchmarks compared to advanced baselines. Codes are available on https://github.com/zz1358m/ATP-Latent-master.

Updated: 2026-01-29 12:07:16

标题: 超越模仿：用于主动潜在规划的强化学习

摘要: 针对高效且密集的思维链（CoT）推理，潜在推理方法微调大型语言模型（LLMs），以连续的潜在标记替代离散的语言标记。与传统语言CoT推理相比，这些方法消耗更少的标记，并且具有在密集潜在空间中规划的潜力。然而，当前的潜在标记通常是基于模仿语言标签进行监督的。考虑到对于一个问题可能存在多个等效但不同的CoT标签，被动地模仿任意一个可能导致较差的潜在标记表示和潜在推理策略，削弱潜在的规划能力，并导致训练和测试之间存在明显差距。在这项工作中，我们强调了通过激活潜在标记表示空间中的规划来实现最佳潜在推理策略的重要性。因此，我们提出了"主动潜在规划"（ATP-Latent）方法，该方法将潜在标记的监督过程建模为条件变分自动编码器（VAE），以获得更平滑的潜在空间。此外，为了促进最合理的潜在推理策略，ATP-Latent进行强化学习（RL）并使用辅助的连贯性奖励，该奖励是基于潜在标记的VAE解码内容之间的一致性计算的，从而实现了引导的RL过程。在LLaMA-1B上的实验中，与先进基线相比，ATP-Latent在四个基准测试上分别展示了+4.1\%的准确率和-3.3\%的标记。代码可在https://github.com/zz1358m/ATP-Latent-master上找到。

更新时间: 2026-01-29 12:07:16

领域: cs.AI

下载: http://arxiv.org/abs/2601.21598v1

Language Generation: Complexity Barriers and Implications for Learning

Kleinberg and Mullainathan showed that language generation in the limit is always possible at the level of computability: given enough positive examples, a learner can eventually generate data indistinguishable from a target language. However, such existence results do not address feasibility. We study the sample complexity of language generation in the limit for several canonical classes of formal languages. Our results show that infeasibility already appears for context-free and regular languages, and persists even for strict subclasses such as locally threshold testable languages, as well as for incomparable classes such as non-erasing pattern languages, a well-studied class in the theory of language identification. Overall, our results establish a clear gap between the theoretical possibility of language generation in the limit and its computational feasibility.

Updated: 2026-01-29 12:04:45

标题: 语言生成：复杂性障碍及对学习的影响

摘要: Kleinberg和Mullainathan表明，在可计算性水平上，语言生成在极限情况下总是可能的：在给定足够的正例的情况下，学习者最终可以生成与目标语言无法区分的数据。然而，这种存在性结果并没有解决可行性问题。我们研究了几种经典形式语言类别中语言生成在极限情况下的样本复杂性。我们的结果表明，在上下文无关和正则语言中已经出现了不可行性，并且即使在严格子类别如本地阈值可测试语言中也存在，并且在不可比较的类别中也持续存在，如非擦除图案语言，这是语言识别理论中一个研究深入的类别。总的来说，我们的结果确立了语言生成在极限情况下的理论可能性与计算可行性之间的明显差距。

更新时间: 2026-01-29 12:04:45

领域: cs.CL,cs.AI,cs.FL,cs.LG

下载: http://arxiv.org/abs/2511.05759v2

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

Updated: 2026-01-29 12:02:16

标题: 带给我杯子！利用视觉注意提示个性化视觉-语言-动作模型

摘要: 虽然视觉-语言-动作（VLA）模型在通用指令上表现良好，但在个性化命令（如“拿我的杯子”）方面表现不佳，其中机器人必须在视觉上相似的对象中执行特定实例的操作。我们研究了这种处理个人物品的情景，其中VLA必须在训练期间未见过的情况下，仅使用少量参考图像来识别和控制用户特定对象。为了解决这一挑战，我们提出了视觉注意提示（VAP），这是一个简单但有效的无需训练的感知适配器，它为冻结的VLA提供了自顶向下的选择性注意力。VAP将参考图像视为非参数视觉记忆，通过开放词汇检测和基于嵌入的匹配将个人对象与场景联系起来，然后通过突出显示对象并重写指令，将这种联系作为视觉提示注入。我们构建了两个仿真基准，个性化-SIMPLER和个性化-VLABench，以及一个现实世界的桌面基准，以评估多个机器人和任务之间的个性化操作。实验表明，VAP在成功率和正确对象操作方面始终优于通用策略和基于标记学习的基线，有助于弥合语义理解和实例级控制之间的差距。

更新时间: 2026-01-29 12:02:16

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2512.20014v2

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.

Updated: 2026-01-29 12:01:53

标题: 可扩展的功率采样：通过分布调整解锁LLMs的高效、无训练推理

摘要: 强化学习（RL）后训练是改进大型语言模型（LLMs）推理性能的主要方法，然而越来越多的证据表明，其收益主要来自于分布的收缩而非新能力的获得。最近的研究表明，使用马尔可夫链蒙特卡洛（MCMC）从LLMs的幂分布中进行采样可以恢复与RL后训练相当的性能，而无需依赖外部奖励；然而，MCMC的高计算成本使得这种方法对广泛采用不切实际。在这项工作中，我们提出了一个理论上基础的替代方案，消除了对迭代MCMC的需求。我们推导出一个新颖的公式，表明全局幂分布可以通过一个标记级别的缩放低温分布来近似，其中缩放因子捕捉了未来轨迹质量。利用这一洞察力，我们引入了一种无需训练和验证器的算法，自回归地加强基础模型的生成分布。在实证方面，我们在四个LLMs上评估了我们的方法在数学、问答和代码任务上的表现，并展示我们的方法与无需依赖外部奖励的一次性GRPO相匹配或超越，同时将推理延迟降低了超过10倍，与基于MCMC的采样相比。

更新时间: 2026-01-29 12:01:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21590v1

Heterogeneity-Aware Knowledge Sharing for Graph Federated Learning

Graph Federated Learning (GFL) enables distributed graph representation learning while protecting the privacy of graph data. However, GFL suffers from heterogeneity arising from diverse node features and structural topologies across multiple clients. To address both types of heterogeneity, we propose a novel graph Federated learning method via Semantic and Structural Alignment (FedSSA), which shares the knowledge of both node features and structural topologies. For node feature heterogeneity, we propose a novel variational model to infer class-wise node distributions, so that we can cluster clients based on inferred distributions and construct cluster-level representative distributions. We then minimize the divergence between local and cluster-level distributions to facilitate semantic knowledge sharing. For structural heterogeneity, we employ spectral Graph Neural Networks (GNNs) and propose a spectral energy measure to characterize structural information, so that we can cluster clients based on spectral energy and build cluster-level spectral GNNs. We then align the spectral characteristics of local spectral GNNs with those of cluster-level spectral GNNs to enable structural knowledge sharing. Experiments on six homophilic and five heterophilic graph datasets under both non-overlapping and overlapping partitioning settings demonstrate that FedSSA consistently outperforms eleven state-of-the-art methods.

Updated: 2026-01-29 12:00:13

标题: 异构感知的图联邦学习知识共享

摘要: Graph Federated Learning（GFL）实现了分布式图表示学习，同时保护了图数据的隐私。然而，GFL受到来自多个客户端的不同节点特征和结构拓扑的异质性的影响。为了解决这两种类型的异质性，我们提出了一种通过语义和结构对齐（FedSSA）的新颖图形联合学习方法，该方法共享了节点特征和结构拓扑的知识。对于节点特征的异质性，我们提出了一个新颖的变分模型来推断类别节点分布，这样我们就可以基于推断的分布对客户端进行聚类，并构建集群级别的代表性分布。然后我们最小化本地和集群级别分布之间的差异以促进语义知识共享。对于结构异质性，我们采用了谱图神经网络（GNNs）并提出了一种谱能量度量来表征结构信息，这样我们就可以基于谱能量对客户端进行聚类并构建集群级别的谱GNNs。然后我们将本地谱GNNs的谱特征与集群级别谱GNNs的特征对齐，以实现结构知识共享。在六个同源和五个异源图数据集上进行的实验，在非重叠和重叠分区设置下，证明了FedSSA一直优于十一种最先进的方法。

更新时间: 2026-01-29 12:00:13

领域: cs.LG

下载: http://arxiv.org/abs/2601.21589v1

CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning

In open-ended environments, autonomous learning agents must set their own goals and build their own curriculum through an intrinsically motivated exploration. They may consider a large diversity of goals, aiming to discover what is controllable in their environments, and what is not. Because some goals might prove easy and some impossible, agents must actively select which goal to practice at any moment, to maximize their overall mastery on the set of learnable goals. This paper proposes CURIOUS, an algorithm that leverages 1) a modular Universal Value Function Approximator with hindsight learning to achieve a diversity of goals of different kinds within a unique policy and 2) an automated curriculum learning mechanism that biases the attention of the agent towards goals maximizing the absolute learning progress. Agents focus sequentially on goals of increasing complexity, and focus back on goals that are being forgotten. Experiments conducted in a new modular-goal robotic environment show the resulting developmental self-organization of a learning curriculum, and demonstrate properties of robustness to distracting goals, forgetting and changes in body properties.

Updated: 2026-01-29 11:57:35

标题: 好的，这个文献标题的翻译是：好奇：内在动机模块化多目标强化学习

摘要: 在开放式环境中，自主学习代理人必须设定自己的目标，并通过内在动机驱动的探索建立自己的课程。他们可能考虑各种各样的目标，旨在发现在他们的环境中什么是可控的，什么是不可控的。由于一些目标可能很容易实现，而一些则不可能，代理人必须积极选择在任何时刻要练习哪个目标，以最大化他们对可学习目标集的整体掌握。本文提出了一种名为CURIOUS的算法，它利用1）具有事后学习的模块化通用值函数逼近器来实现不同类型的目标多样性， 2）一种自动课程学习机制，该机制偏向于引导代理人关注最大化绝对学习进展的目标。代理人依次关注日益复杂的目标，并重新关注那些正在被遗忘的目标。在一个新的模块化目标机器人环境中进行的实验展示了学习课程的发展性自组织结果，并展示了对分散目标、遗忘和身体属性变化的稳健性特性。

更新时间: 2026-01-29 11:57:35

领域: cs.AI

下载: http://arxiv.org/abs/1810.06284v5

Discovering Multi-Scale Semantic Structure in Text Corpora Using Density-Based Trees and LLM Embeddings

Recent advances in large language models enable documents to be represented as dense semantic embeddings, supporting similarity-based operations over large text collections. However, many web-scale systems still rely on flat clustering or predefined taxonomies, limiting insight into hierarchical topic relationships. In this paper we operationalize hierarchical density modeling on large language model embeddings in a way not previously explored. Instead of enforcing a fixed taxonomy or single clustering resolution, the method progressively relaxes local density constraints, revealing how compact semantic groups merge into broader thematic regions. The resulting tree encodes multi-scale semantic organization directly from data, making structural relationships between topics explicit. We evaluate the hierarchies on standard text benchmarks, showing that semantic alignment peaks at intermediate density levels and that abrupt transitions correspond to meaningful changes in semantic resolution. Beyond benchmarks, the approach is applied to large institutional and scientific corpora, exposing dominant fields, cross-disciplinary proximities, and emerging thematic clusters. By framing hierarchical structure as an emergent property of density in embedding spaces, this method provides an interpretable, multi-scale representation of semantic structure suitable for large, evolving text collections.

Updated: 2026-01-29 11:57:19

标题: 使用基于密度的树和LLM嵌入在文本语料库中发现多尺度语义结构

摘要: 近年来，大型语言模型的进步使得文档可以被表示为密集的语义嵌入，支持在大型文本集合上进行基于相似性的操作。然而，许多网络规模的系统仍然依赖于平面聚类或预定义的分类法，限制了对层次主题关系的洞察。在本文中，我们在以前未曾探索的方式上将层次密度建模应用于大型语言模型嵌入。该方法并不强制执行固定的分类法或单一的聚类分辨率，而是逐渐放宽局部密度约束，揭示了紧凑的语义群体如何融合成更广泛的主题区域。由此得出的树直接从数据中编码多尺度语义组织，使主题之间的结构关系变得明确。我们在标准文本基准上评估了这些层次结构，结果显示语义对齐在中间密度水平达到顶峰，并且突变对应于语义分辨率的有意义变化。除了基准测试，该方法还应用于大型机构和科学语料库，揭示了主导领域、跨学科接近度以及新兴的主题集群。通过将层次结构框定为嵌入空间密度的新兴属性，该方法提供了一个可解释的、多尺度的语义结构表示，适用于大型、不断发展的文本集合。

更新时间: 2026-01-29 11:57:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.23471v2

Neural network embeddings recover value dimensions from psychometric survey items on par with human data

We demonstrate that embeddings derived from large language models, when processed with "Survey and Questionnaire Item Embeddings Differentials" (SQuID), can recover the structure of human values obtained from human rater judgments on the Revised Portrait Value Questionnaire (PVQ-RR). We compare multiple embedding models across a number of evaluation metrics including internal consistency, dimension correlations and multidimensional scaling configurations. Unlike previous approaches, SQuID addresses the challenge of obtaining negative correlations between dimensions without requiring domain-specific fine-tuning or training data re-annotation. Quantitative analysis reveals that our embedding-based approach explains 55% of variance in dimension-dimension similarities compared to human data. Multidimensional scaling configurations show alignment with pooled human data from 49 different countries. Generalizability tests across three personality inventories (IPIP, BFI-2, HEXACO) demonstrate that SQuID consistently increases correlation ranges, suggesting applicability beyond value theory. These results show that semantic embeddings can effectively replicate psychometric structures previously established through extensive human surveys. The approach offers substantial advantages in cost, scalability and flexibility while maintaining comparable quality to traditional methods. Our findings have significant implications for psychometrics and social science research, providing a complementary methodology that could expand the scope of human behavior and experience represented in measurement tools.

Updated: 2026-01-29 11:51:40

标题: 神经网络嵌入恢复与人类数据相当的心理测量调查项目的价值维度

摘要: 我们证明，从大型语言模型中衍生的嵌入，在经过“调查和问卷项目嵌入差异”（SQuID）处理后，可以恢复人类价值结构，该结构是通过修订后的人像价值问卷（PVQ-RR）上的人类评分员判断获得的。我们比较了多个嵌入模型在多个评估指标上的表现，包括内部一致性、维度相关性和多维缩放配置。与先前的方法不同，SQuID解决了在不需要领域特定的微调或训练数据重新注释的情况下获得维度之间的负相关的挑战。定量分析显示，我们基于嵌入的方法在解释维度之间相似性的方差方面比人类数据高出55%。多维缩放配置显示与来自49个不同国家的汇总人类数据的一致性。在三个人格问卷（IPIP、BFI-2、HEXACO）上进行的泛化测试表明，SQuID一致地扩大了相关性范围，表明其适用性超越了价值理论。这些结果表明，语义嵌入可以有效地复制以前通过广泛的人类调查建立的心理测量结构。该方法在成本、可扩展性和灵活性方面具有显著优势，同时保持与传统方法相媲美的质量。我们的发现对于心理测量学和社会科学研究具有重要意义，提供了一种补充方法论，可以扩大测量工具中所代表的人类行为和体验范围。

更新时间: 2026-01-29 11:51:40

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.24906v2

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

In-context learning (ICL) has become a powerful, data-efficient paradigm for text classification using large language models. However, its robustness against realistic adversarial threats remains largely unexplored. We introduce ICL-Evader, a novel black-box evasion attack framework that operates under a highly practical zero-query threat model, requiring no access to model parameters, gradients, or query-based feedback during attack generation. We design three novel attacks, Fake Claim, Template, and Needle-in-a-Haystack, that exploit inherent limitations of LLMs in processing in-context prompts. Evaluated across sentiment analysis, toxicity, and illicit promotion tasks, our attacks significantly degrade classifier performance (e.g., achieving up to 95.3% attack success rate), drastically outperforming traditional NLP attacks which prove ineffective under the same constraints. To counter these vulnerabilities, we systematically investigate defense strategies and identify a joint defense recipe that effectively mitigates all attacks with minimal utility loss (<5% accuracy degradation). Finally, we translate our defensive insights into an automated tool that proactively fortifies standard ICL prompts against adversarial evasion. This work provides a comprehensive security assessment of ICL, revealing critical vulnerabilities and offering practical solutions for building more robust systems. Our source code and evaluation datasets are publicly available at: https://github.com/ChaseSecurity/ICL-Evader .

Updated: 2026-01-29 11:50:50

标题: ICL-EVADER：针对上下文学习的零查询黑盒规避攻击及其防御

摘要: 在使用大型语言模型进行文本分类时，上下文学习（ICL）已成为一种强大且数据高效的范式。然而，它对现实威胁的鲁棒性仍然未被充分探索。我们引入了ICL-Evader，这是一个新颖的黑盒逃避攻击框架，它在高度实用的零查询威胁模型下运行，攻击生成过程中不需要访问模型参数、梯度或基于查询的反馈。我们设计了三种新颖的攻击方式，Fake Claim、Template和Needle-in-a-Haystack，它们利用LLM在处理上下文提示时的固有限制。在情感分析、毒性和非法促销任务中进行评估，我们的攻击显著降低了分类器的性能（例如，实现了高达95.3%的攻击成功率），远远超过传统的NLP攻击，在相同约束条件下证明无效。为了对抗这些漏洞，我们系统地调查了防御策略，并确定了一种联合防御配方，可以有效地减轻所有攻击，同时最小化效用损失（<5%的准确性下降）。最后，我们将我们的防御见解转化为一个自动化工具，主动加固标准ICL提示以对抗逃避攻击。这项工作提供了对ICL的全面安全评估，揭示了关键的漏洞，并提供了构建更加健壮系统的实际解决方案。我们的源代码和评估数据集可以在以下网址公开获取：https://github.com/ChaseSecurity/ICL-Evader。

更新时间: 2026-01-29 11:50:50

领域: cs.CR

下载: http://arxiv.org/abs/2601.21586v1

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.

Updated: 2026-01-29 11:47:23

标题: 你的LLM是否在向你过度收费？代币化、透明度和激励

摘要: 最先进的大型语言模型需要专门的硬件和大量能源来运行。因此，提供访问大型语言模型的云服务变得非常流行。在这些服务中，用户支付的价格取决于模型用于生成输出的标记数量：他们每个标记支付固定价格。在这项工作中，我们展示了这种定价机制为提供者制定策略和虚报模型用于生成输出的标记数量创造了财务激励，而用户无法证明，甚至不知道提供者是否向他们多收费。然而，我们还展示了，如果不忠实的提供者被迫对模型使用的生成过程进行透明化，最佳虚报而不引起怀疑是困难的。尽管如此，作为一个概念验证，我们开发了一种高效的启发式算法，允许提供者在不引起怀疑的情况下显着向用户多收费。至关重要的是，我们证明了运行该算法的成本低于从多收费用户中获得的额外收入，突显了用户在当前按标记付费定价机制下的脆弱性。此外，我们还表明，为了消除制定策略的财务激励，定价机制必须根据它们的字符计数线性定价标记。虽然这会使提供商的利润率在标记之间变化，但我们介绍了一个简单的处方，根据这个激励兼容的定价机制采用的提供商可以保持他们在按标记付费定价机制下的平均利润率。在此过程中，为了说明和补充我们的理论结果，我们对来自LMSYS Chatbot Arena平台的几个大型语言模型（如Llama，Gemma和Ministral系列）和输入提示进行了实验。

更新时间: 2026-01-29 11:47:23

领域: cs.GT,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2505.21627v3

From Mice to Trains: Amortized Bayesian Inference on Graph Data

Graphs arise across diverse domains, from biology and chemistry to social and information networks, as well as in transportation and logistics. Inference on graph-structured data requires methods that are permutation-invariant, scalable across varying sizes and sparsities, and capable of capturing complex long-range dependencies, making posterior estimation on graph parameters particularly challenging. Amortized Bayesian Inference (ABI) is a simulation-based framework that employs generative neural networks to enable fast, likelihood-free posterior inference. We adapt ABI to graph data to address these challenges to perform inference on node-, edge-, and graph-level parameters. Our approach couples permutation-invariant graph encoders with flexible neural posterior estimators in a two-module pipeline: a summary network maps attributed graphs to fixed-length representations, and an inference network approximates the posterior over parameters. In this setting, several neural architectures can serve as the summary network. In this work we evaluate multiple architectures and assess their performance on controlled synthetic settings and two real-world domains - biology and logistics - in terms of recovery and calibration.

Updated: 2026-01-29 11:46:57

标题: 从老鼠到火车：图数据上的摊销贝叶斯推断

摘要: 图在不同领域中出现，从生物学和化学到社交和信息网络，以及在交通和物流中。对图结构数据的推断需要方法是排列不变的，可扩展到不同大小和稀疏度，并能够捕捉复杂的长程依赖性，使得在图参数上进行后验估计特别具有挑战性。摊销贝叶斯推断（ABI）是一种基于仿真的框架，采用生成神经网络实现快速、无似然的后验推断。我们将ABI调整为图数据，以解决这些挑战，对节点、边和图级参数进行推断。我们的方法将排列不变图编码器与灵活的神经后验估计器耦合在一个两模块管道中：一个摘要网络将属性图映射到固定长度的表示，一个推断网络近似参数的后验。在这种设置中，几种神经架构可以作为摘要网络。在这项工作中，我们评估多种架构，并根据受控合成设置和两个真实领域（生物学和物流）中的恢复和校准性能进行评估。

更新时间: 2026-01-29 11:46:57

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.02241v2

CORDS: Continuous Representations of Discrete Structures

Many learning problems require predicting sets of objects when the number of objects is not known beforehand. Examples include object detection, molecular modeling, and scientific inference tasks such as astrophysical source detection. Existing methods often rely on padded representations or must explicitly infer the set size, which often poses challenges. We present a novel strategy for addressing this challenge by casting prediction of variable-sized sets as a continuous inference problem. Our approach, CORDS (Continuous Representations of Discrete Structures), provides an invertible mapping that transforms a set of spatial objects into continuous fields: a density field that encodes object locations and count, and a feature field that carries their attributes over the same support. Because the mapping is invertible, models operate entirely in field space while remaining exactly decodable to discrete sets. We evaluate CORDS across molecular generation and regression, object detection, simulation-based inference, and a mathematical task involving recovery of local maxima, demonstrating robust handling of unknown set sizes with competitive accuracy.

Updated: 2026-01-29 11:46:17

标题: CORDS: 连续表示离散结构

摘要: 许多学习问题需要在不事先知道对象数量的情况下预测对象集。例如，对象检测、分子建模和科学推断任务（如天体物体检测）等。现有方法通常依赖于填充表示或必须明确推断集大小，这经常带来挑战。我们提出了一种新的策略来解决这一挑战，将可变大小集合的预测视为连续推断问题。我们的方法CORDS（离散结构的连续表示）提供了一个可逆映射，将一组空间对象转换为连续场：一个密度场，编码对象位置和计数，以及一个特征场，携带它们的属性在相同的支持上。由于映射是可逆的，模型完全在场空间中运行，同时仍然可以准确解码为离散集。我们评估了CORDS在分子生成和回归、对象检测、基于模拟的推断以及涉及本地极大值恢复的数学任务中的表现，展示了对未知集大小的强大处理能力和具有竞争力的准确性。

更新时间: 2026-01-29 11:46:17

领域: cs.LG

下载: http://arxiv.org/abs/2601.21583v1

Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.

Updated: 2026-01-29 11:44:38

标题: 深度循环注意力混合模型：赋予潜在推理应有的关注

摘要: 深度回归通过在深度之间共享参数促进了潜在推理。然而，先前的工作缺乏组合的FLOP、参数和内存匹配的基线，由于部分固定的层堆栈而未充分利用深度回归，并忽略了限制多步潜在推理的恒定隐藏大小的瓶颈。为了解决这个问题，我们引入了一个深度递归注意力混合的模块化框架（Dreamer），结合了序列注意力、深度注意力和稀疏专家注意力。它通过在深度上的注意力减轻了隐藏大小的瓶颈，解耦了缩放维度，并允许深度递归模型有效地扩展。在语言推理基准测试中，我们的模型在相同准确率下所需的训练令牌数量比FLOP、参数和内存匹配的SOTA少2到8倍，并且胜过了大小相当的SOTA模型的2倍。我们进一步展示了跨深度的知识使用的见解，例如，显示比SOTA MoEs更大2到11倍的专家选择多样性。

更新时间: 2026-01-29 11:44:38

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.21582v1

Evaluating Prediction Uncertainty Estimates from BatchEnsemble

Deep learning models struggle with uncertainty estimation. Many approaches are either computationally infeasible or underestimate uncertainty. We investigate \textit{BatchEnsemble} as a general and scalable method for uncertainty estimation across both tabular and time series tasks. To extend BatchEnsemble to sequential modeling, we introduce GRUBE, a novel BatchEnsemble GRU cell. We compare the BatchEnsemble to Monte Carlo dropout and deep ensemble models. Our results show that BatchEnsemble matches the uncertainty estimation performance of deep ensembles, and clearly outperforms Monte Carlo dropout. GRUBE achieves similar or better performance in both prediction and uncertainty estimation. These findings show that BatchEnsemble and GRUBE achieve similar performance with fewer parameters and reduced training and inference time compared to traditional ensembles.

Updated: 2026-01-29 11:44:36

标题: 评估BatchEnsemble的预测不确定性估计

摘要: 深度学习模型在不确定性估计方面存在困难。许多方法要么在计算上不可行，要么低估不确定性。我们调查了\textit{BatchEnsemble}作为一种通用且可扩展的方法，用于跨表格和时间序列任务的不确定性估计。为了将BatchEnsemble扩展到顺序建模，我们引入了GRUBE，一种新颖的BatchEnsemble GRU单元。我们将BatchEnsemble与蒙特卡罗辍学和深度合奏模型进行比较。我们的结果显示，BatchEnsemble与深度合奏的不确定性估计性能相匹配，并明显优于蒙特卡罗辍学。GRUBE在预测和不确定性估计方面实现了类似或更好的性能。这些发现表明，与传统合奏相比，BatchEnsemble和GRUBE在具有更少参数和减少训练和推理时间的情况下实现了类似的性能。

更新时间: 2026-01-29 11:44:36

领域: cs.LG

下载: http://arxiv.org/abs/2601.21581v1

Enhancing Membership Inference Attacks on Diffusion Models from a Frequency-Domain Perspective

Diffusion models have achieved tremendous success in image generation, but they also raise significant concerns regarding privacy and copyright issues. Membership Inference Attacks (MIAs) are designed to ascertain whether specific data were utilized during a model's training phase. As current MIAs for diffusion models typically exploit the model's image prediction ability, we formalize them into a unified general paradigm which computes the membership score for membership identification. Under this paradigm, we empirically find that existing attacks overlook the inherent deficiency in how diffusion models process high-frequency information. Consequently, this deficiency leads to member data with more high-frequency content being misclassified as hold-out data, and hold-out data with less high-frequency content tend to be misclassified as member data. Moreover, we theoretically demonstrate that this deficiency reduces the membership advantage of attacks, thereby interfering with the effective discrimination of member data and hold-out data. Based on this insight, we propose a plug-and-play high-frequency filter module to mitigate the adverse effects of the deficiency, which can be seamlessly integrated into any attacks within this general paradigm without additional time costs. Extensive experiments corroborate that this module significantly improves the performance of baseline attacks across different datasets and models.

Updated: 2026-01-29 11:43:24

标题: 从频域角度加强对扩散模型的成员推断攻击

摘要: 扩散模型在图像生成方面取得了巨大成功，但也引发了关于隐私和版权问题的重大关注。成员推断攻击（MIAs）旨在确定特定数据是否在模型的训练阶段中被使用。由于当前针对扩散模型的MIAs通常利用模型的图像预测能力，我们将它们形式化为一个统一的一般范式，用于计算成员识别的成员分数。在这个范式下，我们经验性地发现现有攻击忽视了扩散模型处理高频信息的固有缺陷。因此，这种缺陷导致具有更多高频内容的成员数据被误分类为保留数据，而具有较少高频内容的保留数据则倾向于被误分类为成员数据。此外，我们在理论上证明了这种缺陷减少了攻击的成员优势，从而干扰了成员数据和保留数据的有效区分。基于这一洞见，我们提出了一个即插即用的高频过滤器模块，以减轻这种缺陷的不利影响，该模块可以无缝集成到这一一般范式内的任何攻击中，而无需额外的时间成本。大量实验证实，这一模块显著改善了基线攻击在不同数据集和模型上的性能。

更新时间: 2026-01-29 11:43:24

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2505.20955v3

KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive $\mathcal{O}(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $\mathcal{O} \left( nC \cdot n! \right)$. To address both challenges, we propose \textbf{KromHC}, which uses the \underline{Kro}necker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline{mHC}. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to $\mathcal{O}(n^2C)$. Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttt{https://github.com/wz1119/KromHC}.

Updated: 2026-01-29 11:43:05

标题: KromHC：具有Kronecker乘积残差矩阵的流形约束超连接

摘要: 超连接（HC）在神经网络（NN）中的成功也凸显了其训练不稳定性和受限可扩展性等问题。流形约束超连接（mHC）通过将残差连接空间投影到比尔霍夫多面体上来缓解这些挑战，然而，它面临两个问题：1）其迭代Sinkhorn-Knopp（SK）算法并不总是产生精确的双随机残差矩阵；2）mHC的参数复杂度为$\mathcal{O}(n^3C)$，其中$n$为残差流的宽度，$C$为特征维度。最近提出的mHC-lite通过比尔霍夫-冯诺依曼定理重新参数化残差矩阵以确保双随机性，但也面临参数复杂度爆炸，$\mathcal{O} \left( nC \cdot n! \right)$。为了解决这两个挑战，我们提出了\textbf{KromHC}，它使用较小的双随机矩阵的\underline{Kro}内克积来参数化\underline{mHC}中的残差矩阵。通过在张量化残差流的每个模式上施加流形约束，KromHC保证了残差矩阵的精确双随机性，同时将参数复杂度降低到$\mathcal{O}(n^2C)$。全面的实验证明，KromHC与甚至优于最先进的mHC变种，同时需要更少的可训练参数。代码可在\texttt{https://github.com/wz1119/KromHC}上找到。

更新时间: 2026-01-29 11:43:05

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.21579v1

TimeSeries2Report prompting enables adaptive large language model management of lithium-ion batteries

Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a semantic translation framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.

Updated: 2026-01-29 11:42:36

标题: TimeSeries2Report提示功能实现对锂离子电池的自适应大型语言模型管理

摘要: 大型语言模型(LLMs)为解释多变量时间序列数据提供了有前途的能力，然而它们在实际电池储能系统(BESS)运行和维护方面的应用仍然大部分未被探索。在这里，我们提出了TimeSeries2Report (TS2R)，这是一个语义转换框架，将原始锂离子电池运行时间序列转换为结构化、语义丰富的报告，使LLMs能够在BESS管理场景中进行推理、预测和决策。TS2R通过分段、语义抽象和基于规则的解释的组合，将短期时间动态编码成自然语言，有效地将低级传感器信号与高级上下文洞察相连接。我们在实验室规模和实际数据集上对TS2R进行基准测试，评估报告质量和在异常检测、电荷状态预测以及充放电管理方面的下游任务性能。与基于视觉、嵌入和文本的提示基线相比，通过TS2R的报告提示始终提高LLM在准确性、鲁棒性和可解释性指标方面的性能。值得注意的是，集成了TS2R的LLMs在不重新训练或修改架构的情况下实现了专家级的决策质量和预测一致性，为自适应、由LLM驱动的电池智能建立了实际路径。

更新时间: 2026-01-29 11:42:36

领域: cs.AI

下载: http://arxiv.org/abs/2512.16453v2

Chain Of Thought Compression: A Theoritical Analysis

Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohibitive computational costs due to generation of extra tokens. Recent studies empirically show that compressing reasoning steps into latent states, or implicit CoT compression, offers a token-efficient alternative. However, the mechanism behind CoT compression remains unclear. In this paper, we provide the first theoretical analysis of the difficulty of learning to internalize intermediate reasoning steps. By introducing Order-r Interaction, we prove that the learning signal for high-order logical dependencies exponentially decays to solve irreducible problem, where skipping intermediate steps inevitably leads to high-order interaction barriers. To empirically validate this, we introduce NatBool-DAG, a challenging benchmark designed to enforce irreducible logical reasoning and eliminate semantic shortcuts. Guided by our theoretical findings, we propose ALiCoT (Aligned Implicit CoT), a novel framework that overcomes the signal decay by aligning latent token distributions with intermediate reasoning states. Experimental results demonstrate that ALiCoT successfully unlocks efficient reasoning: it achieves a 54.4x speedup while maintaining performance comparable to explicit CoT.

Updated: 2026-01-29 11:42:03

标题: 思维链压缩：理论分析

摘要: Chain-of-Thought (CoT)已经解锁了大型语言模型（LLMs）的先进推理能力，但由于生成额外的标记而产生了不可接受的计算成本。最近的研究实证表明，将推理步骤压缩为潜在状态，或隐式CoT压缩，提供了一种节省标记的替代方案。然而，CoT压缩背后的机制仍不清楚。在本文中，我们首次对学习内化中间推理步骤的困难进行了理论分析。通过引入Order-r互动，我们证明高阶逻辑依赖的学习信号会指数级衰减，以解决不可简化的问题，跳过中间步骤不可避免地导致高阶互动障碍。为了在实证上验证这一点，我们引入了NatBool-DAG，这是一个具有挑战性的基准测试，旨在强制执行不可简化的逻辑推理并消除语义快捷方式。在我们理论发现的指导下，我们提出了ALiCoT（对齐隐式CoT），这是一个新颖的框架，通过将潜在标记分布与中间推理状态对齐来克服信号衰减。实验结果表明，ALiCoT成功解锁了高效的推理：它实现了54.4倍的加速，同时保持了与显式CoT相当的性能。

更新时间: 2026-01-29 11:42:03

领域: cs.AI

下载: http://arxiv.org/abs/2601.21576v1

A Formal Comparison Between Chain of Thought and Latent Thought

Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.

Updated: 2026-01-29 11:34:51

标题: 一种正式比较链式思维和潜在思维

摘要: 思维链（CoT）通过显式生成中间标记在大型语言模型中引发推理。相比之下，潜在思维推理直接在连续的潜在空间中运行，使得计算超越离散的语言表示。虽然这两种方法都利用迭代计算，但它们的比较能力尚未被充分探讨。在这项工作中，我们提出了一项正式分析，表明潜在思维比固有顺序的CoT更适合并行计算。相比之下，CoT通过随机解码实现近似计数和抽样。这些分离表明了深度驱动递归更适合的任务，从而为选择推理范式提供了实用指导。

更新时间: 2026-01-29 11:34:51

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.25239v2

Signal-Adaptive Trust Regions for Gradient-Free Optimization of Recurrent Spiking Neural Networks

Recurrent spiking neural networks (RSNNs) are a promising substrate for energy-efficient control policies, but training them for high-dimensional, long-horizon reinforcement learning remains challenging. Population-based, gradient-free optimization circumvents backpropagation through non-differentiable spike dynamics by estimating gradients. However, with finite populations, high variance of these estimates can induce harmful and overly aggressive update steps. Inspired by trust-region methods in reinforcement learning that constrain policy updates in distribution space, we propose \textbf{Signal-Adaptive Trust Regions (SATR)}, a distributional update rule that constrains relative change by bounding KL divergence normalized by an estimated signal energy. SATR automatically expands the trust region under strong signals and contracts it when updates are noise-dominated. We instantiate SATR for Bernoulli connectivity distributions, which have shown strong empirical performance for RSNN optimization. Across a suite of high-dimensional continuous-control benchmarks, SATR improves stability under limited populations and reaches competitive returns against strong baselines including PPO-LSTM. In addition, to make SATR practical at scale, we introduce a bitset implementation for binary spiking and binary weights, substantially reducing wall-clock training time and enabling fast RSNN policy search.

Updated: 2026-01-29 11:34:49

标题: 适用于无梯度优化的递归脉冲神经网络的信号自适应信任区域

摘要: Recurrent spiking神经网络（RSNNs）是一种很有前景的能源高效控制策略基础，但是对于高维度、长期回报强化学习训练仍然具有挑战性。基于种群、无梯度优化可以通过估计梯度来规避通过非可微分脉冲动力学的反向传播。然而，由于有限种群，这些估计的高方差可能导致有害和过于激进的更新步骤。受到强化学习中信任区域方法的启发，该方法通过在分布空间中约束策略更新，我们提出了信号自适应信任区域（SATR），这是一种分布式更新规则，通过以估计的信号能量为标准化因子，限制KL散度的相对变化。SATR会在信号强时自动扩展信任区域，并在更新被噪声主导时收缩。我们为伯努利连接分布实例化了SATR，这对于RSNN优化已经展现出了强大的实证性能。在一系列高维连续控制基准测试中，SATR提高了在有限种群下的稳定性，并且在与包括PPO-LSTM在内的强基线的竞争回报中取得了竞争性结果。另外，为了使SATR在规模上实际可行，我们引入了一个位集实现，用于二进制脉冲和二进制权重，大大减少了墙钟训练时间，从而实现了快速RSNN策略搜索。

更新时间: 2026-01-29 11:34:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21572v1

Shaping capabilities with token-level data filtering

Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.

Updated: 2026-01-29 11:34:01

标题: 用令牌级数据筛选塑造能力

摘要: 目前减少语言模型中不受欢迎的功能的方法主要是事后处理的，因此容易被对手绕过。一种自然的替代方法是在预训练过程中塑造功能。在去除医疗功能的代理任务上，我们展示了简单的预训练数据过滤干预方法在规模上是高效、稳健且廉价的。受数据归因工作的启发，我们发现过滤标记比过滤文档更有效，以更低的成本实现对不受欢迎功能的打击。我们训练了跨越两个数量级的模型，然后证明随着规模的扩大，过滤变得更加有效：对于我们最大的模型，标记过滤导致忘记领域计算速度减慢了7000倍。我们还表明，使用标记过滤训练的模型仍然可以在忘记领域上对齐。在这个过程中，我们引入了一种使用稀疏自动编码器标记标记并提炼廉价、高质量分类器的方法。我们还证明，通过充分的预训练计算，过滤可以对噪声标签具有稳健性。

更新时间: 2026-01-29 11:34:01

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21571v1

EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots

The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

Updated: 2026-01-29 11:33:49

标题: EmboCoach-Bench：对发展中的具身机器人进行人工智能代理基准测试

摘要: 身体智能人工智能领域正在迅速向通用目标机器人系统发展，这得益于高保真度模拟和大规模数据收集。然而，这种扩展能力仍然受到严重制约，依赖于对复杂奖励塑造和跨异构后端的超参数调整等劳动密集型手动监督。受LLMs在软件自动化和科学发现中的成功启发，我们引入了EmboCoach-Bench，一个评估LLM代理能力的基准，以自主工程化身体政策。跨越32个专家策划的RL和IL任务，我们的框架将可执行代码作为通用接口。我们超越了静态生成，评估了一个动态的闭环工作流程，在这个过程中，代理利用环境反馈迭代地起草、调试和优化解决方案，包括从基于物理的奖励设计到扩散政策等政策架构的改进。广泛的评估得出了三个关键见解：（1）自主代理可以在平均成功率上比人工工程基线高出26.5％；（2）具有环境反馈的代理工作流有效地加强了政策发展，并显著缩小了开源和专有模型之间的性能差距；（3）代理展现出对病态工程案例的自我修正能力，通过迭代的模拟中断调试成功地从接近完全失败的任务绩效中复活。最终，这项工作为自我演进的身体智能奠定了基础，加速了从劳动密集型手动调整向可扩展的自主工程化在身体智能人工智能领域的范式转变。

更新时间: 2026-01-29 11:33:49

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2601.21570v1

Untargeted Jailbreak Attack

Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed targets inherently constrains the adversarial search space, limiting the overall attack efficacy. Furthermore, existing methods typically require numerous optimization iterations to fulfill the large gap between the fixed target and the original LLM output, resulting in low attack efficiency. To overcome these limitations, we propose the first gradient-based untargeted jailbreak attack (UJA), which relies on an untargeted objective to maximize the unsafety probability of the LLM output, without enforcing any response patterns. For tractable optimization, we further decompose this objective into two differentiable sub-objectives to search the optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to existing attacks, UJA's unrestricted objective significantly expands the search space, enabling more flexible and efficient exploration of LLM vulnerabilities. Extensive evaluations show that UJA achieves over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks by over 30\%.

Updated: 2026-01-29 11:30:58

标题: 非目标化越狱攻击

摘要: 现有的基于梯度的针对大型语言模型（LLMs）的越狱攻击通常优化对抗性后缀，以使LLM输出与预定义的目标响应对齐。然而，将目标限制为诱导固定目标固有地限制了对抗性搜索空间，限制了整体攻击效果。此外，现有方法通常需要大量优化迭代来弥合固定目标和原始LLM输出之间的巨大差距，导致攻击效率低下。为了克服这些限制，我们提出了第一个基于梯度的非定向越狱攻击（UJA），它依赖于一个非定向目标，以最大化LLM输出的不安全概率，而不强制执行任何响应模式。为了便于优化，我们进一步将这个目标分解为两个可微分的子目标，以搜索最佳的有害响应和相应的对抗性提示，并通过理论分析验证了这种分解。与现有攻击相比，UJA的无限制目标显著扩大了搜索空间，实现了对LLM漏洞更灵活、更高效的探索。广泛的评估表明，UJA在只进行100次优化迭代的情况下，对最近的安全对齐LLMs实现了超过80％的攻击成功率，超过了现有基于梯度的攻击的30％以上。

更新时间: 2026-01-29 11:30:58

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.02999v3

Bridging Functional and Representational Similarity via Usable Information

We present a unified framework for quantifying the similarity between representations through the lens of \textit{usable information}, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we prove that reconstruction-based metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.

Updated: 2026-01-29 11:30:55

标题: 通过可用信息桥接功能和表征相似性

摘要: 我们提出了一个统一的框架，通过“可用信息”的视角来量化表示之间的相似性，提供了在三个关键维度上的严格的理论和实证综合。首先，关于功能相似性，我们建立了性能和条件互信息之间的形式联系。我们进一步揭示了缝合本质上是不对称的，证明了稳健的功能比较需要双向分析而不是单向映射。其次，关于表示相似性，我们证明了基于重建的度量和标准工具（例如，CKA，RSA）在特定约束条件下作为可用信息的估计器。至关重要的是，我们表明相似性相对于预测家族的能力而言，表示对于刚性观察者可能看起来是不同的，但对于更具表现力的观察者可能是相同的。第三，我们证明了表示相似性对于功能相似性是充分但不是必要的。我们通过任务粒度层次结构统一了这些概念：在复杂任务上的相似性确保了在任何更粗糙的导数上的相似性，将表示相似性建立为最大粒度的极限：输入重建。

更新时间: 2026-01-29 11:30:55

领域: cs.LG

下载: http://arxiv.org/abs/2601.21568v1

A Comprehensive Evaluation on Quantization Techniques for Large Language Models

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.

Updated: 2026-01-29 11:28:59

标题: 大型语言模型量化技术的全面评估

摘要: 对于大型语言模型（LLMs），后训练量化（PTQ）可以显著减少内存占用和计算开销。模型量化正在迅速发展。虽然许多论文报告了突破性结果，但它们通常在不同设置下进行评估，因为一种方法通常包含多个组件。分析现有方法之间的联系对于更深入的理解至关重要。为了弥合这些差距，我们对最先进的方法进行了广泛审查，并在相同条件下进行了全面评估，以进行公正比较。据我们所知，这样一项公正而全面的调查在批评上仍未得到充分开发。为了更好地理解联系，首先，我们将已发表的量化方法分解为两个步骤：预量化转换和量化误差缓解。前者是一个预处理步骤，通过平坦化数据分布来减少异常值的影响；后者抵消量化误差以提高性能。其次，我们评估和分析不同设置的影响，包括粒度和对称性。第三，我们分析和评估最新的MXFP4和NVFP4数据格式及其性能。我们的实验首先表明，优化的旋转和缩放产生了最佳的预量化性能，并且将低秩补偿与GPTQ结合可以偶尔优于单独使用GPTQ进行误差缓解。其次，更细的粒度提高了性能，但增加了存储开销。第三，我们发现，缩放因子的格式和精度极大地影响FP4的性能，并且基于旋转的策略对INT4有效，但对MXFP4和NVFP4的增益有限，这促使进一步研究。

更新时间: 2026-01-29 11:28:59

领域: cs.LG

下载: http://arxiv.org/abs/2507.17417v3

Representation Unlearning: Forgetting through Information Compression

Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model's representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.

Updated: 2026-01-29 11:28:02

标题: 再现的去学习：通过信息压缩来遗忘

摘要: 机器遗忘旨在从模型中消除特定训练数据的影响，这是由隐私法规和鲁棒性问题驱动的需求。现有方法通常修改模型参数，但这种更新可能不稳定、计算成本高昂，并且受到局部逼近的限制。我们介绍了表示遗忘，这是一个在模型的表示空间中直接进行遗忘的框架。我们不是修改模型参数，而是学习一个在表示上施加信息瓶颈的转换：最大化与保留数据的互信息，同时抑制关于要遗忘数据的信息。我们推导出变分替代品，使这个目标变得可行，并展示了它们如何在两个实际情况下实现：当保留和遗忘数据都可用时，以及在只能访问遗忘数据的零样本设置中。跨多个基准测试的实验表明，表示遗忘实现了更可靠的遗忘、更好的效用保留以及比基于参数的基线更高的计算效率。

更新时间: 2026-01-29 11:28:02

领域: cs.LG

下载: http://arxiv.org/abs/2601.21564v1

SAL: Selective Adaptive Learning for Backpropagation-Free Training with Sparsification

Standard deep learning relies on Backpropagation (BP), which is constrained by biologically implausible weight symmetry and suffers from significant gradient interference within dense representations. To mitigate these bottlenecks, we propose Selective Adaptive Learning (SAL), a training method that combines selective parameter activation with adaptive area partitioning. Specifically, SAL decomposes the parameter space into mutually exclusive, sample-dependent regions. This decoupling mitigates gradient interference across divergent semantic patterns and addresses explicit weight symmetry requirements through our refined feedback alignment. Empirically, SAL demonstrates competitive convergence rates, leading to improved classification performance across 10 standard benchmarks. Additionally, SAL achieves numerical consistency and competitive accuracy even in deep regimes (up to 128 layers) and large-scale models (up to 1B parameters). Our approach is loosely inspired by biological learning mechanisms, offering a plausible alternative that contributes to the study of scalable neural network training.

Updated: 2026-01-29 11:26:26

标题: SAL：选择性自适应学习以稀疏化实现无反向传播训练

摘要: 标准的深度学习依赖于反向传播（BP），受到生物不合理的权重对称限制，并在密集表示中遭受显著的梯度干扰。为了缓解这些瓶颈，我们提出了选择性自适应学习（SAL）这一训练方法，它将选择性参数激活与自适应区域划分相结合。具体来说，SAL将参数空间分解为互相排斥、依赖样本的区域。这种解耦减轻了梯度干扰跨不同语义模式，并通过我们精细的反馈对齐解决了明确的权重对称要求。实证上，SAL展示了竞争性的收敛速率，导致在10个标准基准测试中的改进分类性能。此外，SAL在深层（高达128层）和大规模模型（高达10亿参数）中也实现了数值一致性和竞争性准确性。我们的方法受到生物学习机制的启发，提供了一种有可能的替代方案，有助于研究可扩展的神经网络训练。

更新时间: 2026-01-29 11:26:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21561v1

HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction

Predicting spatial gene expression from H&E histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways. HistoPrism not only surpasses prior state-of-the-art models on highly variable genes , but also more importantly, achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns. With strong pan-cancer generalization and improved efficiency, HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology.

Updated: 2026-01-29 11:25:14

标题: HistoPrism：通过基因表达预测解锁全癌组织学的功能通路分析

摘要: 从H＆E组织学预测空间基因表达为测序提供了一种可扩展且临床可访问的替代方案，但要实现临床影响，需要跨癌症类型泛化并捕获生物上连贯信号的模型。先前的工作通常局限于每种癌症的设置和基于方差的评估，功能相关性尚未得到充分探索。我们引入了HistoPrism，这是一种基于变压器的高效体系结构，用于从组织学中进行全癌症基因表达预测。为了评估生物学意义，我们引入了一个基于通路水平的基准，将评估重点从孤立的基因水平方差转移到连贯的功能通路。HistoPrism不仅在高变异基因上超越了先前的最先进模型，而且更重要的是，在通路水平预测上取得了实质性收益，展示了其恢复生物连贯转录组模式的能力。通过强大的全癌症泛化和提高的效率，HistoPrism为从常规可用的组织学中建立临床相关的转录组建模设立了新的标准。

更新时间: 2026-01-29 11:25:14

领域: cs.LG

下载: http://arxiv.org/abs/2601.21560v1

Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed Noise

Randomized subspace methods reduce per-iteration cost; however, in nonconvex optimization, most analyses are expectation-based, and high-probability bounds remain scarce even under sub-Gaussian noise. We first prove that randomized subspace SGD (RS-SGD) admits a high-probability convergence bound under sub-Gaussian noise, achieving the same order of oracle complexity as prior in-expectation results. Motivated by the prevalence of heavy-tailed gradients in modern machine learning, we then propose randomized subspace normalized SGD (RS-NSGD), which integrates direction normalization into subspace updates. Assuming the noise has bounded $p$-th moments, we establish both in-expectation and high-probability convergence guarantees, and show that RS-NSGD can achieve better oracle complexity than full-dimensional normalized SGD.

Updated: 2026-01-29 11:23:04

标题: 随机子空间归一化SGD在重尾噪声下的收敛性分析

摘要: 随机子空间方法降低了每次迭代的成本；然而，在非凸优化中，大多数分析都是基于期望的，即使在次高斯噪声下，高概率界仍然稀缺。我们首先证明，随机子空间随机梯度下降（RS-SGD）在次高斯噪声下具有高概率收敛界，实现了与先前期望结果相同阶的预期复杂度。受现代机器学习中重尾梯度的普遍存在启发，我们提出了随机子空间归一化随机梯度下降（RS-NSGD），将方向归一化集成到子空间更新中。假设噪声具有有界的$p$-阶矩，我们建立了期望和高概率收敛保证，并展示RS-NSGD可以实现比全维归一化随机梯度下降更好的预期复杂度。

更新时间: 2026-01-29 11:23:04

领域: math.OC,cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.20399v2

Meta Context Engineering via Agentic Skill Evolution

The operational efficacy of large language models relies heavily on their inference-time context. This has established Context Engineering (CE) as a formal discipline for optimizing these inputs. Current CE methods rely on manually crafted harnesses, such as rigid generation-reflection workflows and predefined context schemas. They impose structural biases and restrict context optimization to a narrow, intuition-bound design space. To address this, we introduce Meta Context Engineering (MCE), a bi-level framework that supersedes static CE heuristics by co-evolving CE skills and context artifacts. In MCE iterations, a meta-level agent refines engineering skills via agentic crossover, a deliberative search over the history of skills, their executions, and evaluations. A base-level agent executes these skills, learns from training rollouts, and optimizes context as flexible files and code. We evaluate MCE across five disparate domains under offline and online settings. MCE demonstrates consistent performance gains, achieving 5.6--53.8% relative improvement over state-of-the-art agentic CE methods (mean of 16.9%), while maintaining superior context adaptability, transferability, and efficiency in both context usage and training.

Updated: 2026-01-29 11:22:02

标题: 通过代理技能进化的元上下文工程化

摘要: 大型语言模型的运行效力在很大程度上取决于它们在推理时的上下文。这已经确立了上下文工程（CE）作为一种优化这些输入的正式学科。目前的CE方法依赖于手工制作的工具，例如严格的生成-反射工作流程和预定义的上下文模式。它们施加结构性偏见，将上下文优化限制在一个狭窄的、受直觉束缚的设计空间内。为了解决这个问题，我们引入了元上下文工程（MCE），这是一个双层框架，通过共同进化CE技能和上下文工件，超越了静态CE启发式。在MCE的迭代中，一个元层代理通过代理交叉，对技能的历史、执行和评估进行深思熟虑的搜索，进一步完善工程技能。一个基层代理执行这些技能，从训练中学习，优化上下文作为灵活的文件和代码。我们在离线和在线环境下跨越五个不同领域评估了MCE。MCE展现出了一致的性能提升，相对于最先进的代理CE方法，实现了5.6%至53.8%的相对改进（平均为16.9%），同时在上下文使用和训练中保持了卓越的上下文适应性、可传递性和效率。

更新时间: 2026-01-29 11:22:02

领域: cs.AI,cs.NE

下载: http://arxiv.org/abs/2601.21557v1

There Was Never a Bottleneck in Concept Bottleneck Models

Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations.

Updated: 2026-01-29 11:21:26

标题: 概念瓶颈模型中从未存在瓶颈

摘要: 深度学习表示通常难以解释，这可能会阻碍它们在敏感应用中的部署。概念瓶颈模型（CBMs）已被证明是一种有前途的方法，可以通过学习支持目标任务性能的表示来缓解这个问题，同时确保每个组件预测来自预定义集合的具体概念。在这项工作中，我们认为CBMs并没有施加真正的瓶颈：一个组件能够预测一个概念并不意味着它仅编码有关该概念的信息。这个缺陷引发了对可解释性和干预程序的有效性的担忧。为了克服这一限制，我们提出了Minimal Concept Bottleneck Models（MCBMs），它们引入了信息瓶颈（IB）目标，以约束每个表示组件仅保留与其对应概念相关的信息。这个IB是通过将变分正则化项添加到训练损失中来实现的。因此，MCBMs产生了更具可解释性的表示，支持基于原则的概念级干预，并与概率论基础保持一致。

更新时间: 2026-01-29 11:21:26

领域: cs.LG

下载: http://arxiv.org/abs/2506.04877v3

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Stochastic optimization algorithms using exponential moving averages of the past gradients, such as ADAM, RMSProp and AdaGrad, have been having great successes in many applications, especially in training deep neural networks. ADAM in particular stands out as efficient and robust. Despite of its outstanding performance, ADAM has been proved to be divergent for some specific problems. We revisit the divergent question and provide divergent examples under stronger conditions such as in expectation or high probability. Under a variance reduction assumption, we show that an ADAM-type algorithm converges, which means that it is the variance of gradients that causes the divergence of original ADAM. To this end, we propose a variance reduced version of ADAM and provide a convergent analysis of the algorithm. Numerical experiments show that the proposed algorithm has as good performance as ADAM. Our work suggests a new direction for fixing the convergence issues.

Updated: 2026-01-29 11:21:16

标题: 分歧结果和ADAM的方差减少版本的收敛性

摘要: 随机优化算法，如ADAM、RMSProp和AdaGrad，利用过去梯度的指数移动平均，已在许多应用中取得了巨大成功，特别是在训练深度神经网络方面。其中，ADAM以其高效和稳健性脱颖而出。尽管ADAM表现出色，但已被证明在一些特定问题上会发散。我们重新审视了这个发散问题，并在更强的条件下提供了发散示例，如期望或高概率条件下。在方差减少的假设下，我们展示了一种ADAM类型算法的收敛性，这意味着梯度的方差导致了原始ADAM的发散。为此，我们提出了一种减少方差的ADAM版本，并对算法进行了收敛性分析。数值实验表明，所提出的算法性能与ADAM相当。我们的工作为修复收敛问题提供了一个新的方向。

更新时间: 2026-01-29 11:21:16

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2210.05607v2

Closing the Expression Gap in LLM Instructions via Socratic Questioning

A fundamental bottleneck in human-AI collaboration is the ``intention expression gap," the difficulty for humans to effectively convey complex, high-dimensional thoughts to AI. This challenge often traps users in inefficient trial-and-error loops and is exacerbated by the diverse expertise levels of users. We reframe this problem from passive instruction following to a Socratic collaboration paradigm, proposing an agent that actively probes for information to resolve its uncertainty about user intent. we name the proposed agent Nous, trained to acquire proficiency in this inquiry policy. The core mechanism of Nous is a training framework grounded in the first principles of information theory. Within this framework, we define the information gain from dialogue as an intrinsic reward signal, which is fundamentally equivalent to the reduction of Shannon entropy over a structured task space. This reward design enables us to avoid reliance on costly human preference annotations or external reward models. To validate our framework, we develop an automated simulation pipeline to generate a large-scale, preference-based dataset for the challenging task of scientific diagram generation. Comprehensive experiments, including ablations, subjective and objective evaluations, and tests across user expertise levels, demonstrate the effectiveness of our proposed framework. Nous achieves leading efficiency and output quality, while remaining robust to varying user expertise. In conclusion, our research provides a systematic methodology and a new perspective for addressing the issue of ambiguous intentions in complex human-machine collaboration.

Updated: 2026-01-29 11:18:20

标题: 通过苏格拉底式提问来弥合LLM指导中的表达差距

摘要: 人工智能与人类协作中的一个基本瓶颈是“意图表达差距”，即人类难以有效地向人工智能传达复杂、高维度的想法。这一挑战经常将用户困在低效的试错循环中，并且由于用户的不同专业水平而变得更加严重。我们将这个问题从被动指令遵循重新构想为苏格拉底式的协作范式，提出了一个主动探索信息以消除对用户意图的不确定性的代理。我们将这个提出的代理称为Nous，它经过训练，以获得在这种询问策略中的熟练度。Nous的核心机制是一个基于信息论第一原理的训练框架。在这个框架内，我们将对话中的信息增益定义为一种内在奖励信号，这在根本上等同于在结构化任务空间中减少Shannon熵。这种奖励设计使我们能够避免依赖昂贵的人类偏好注释或外部奖励模型。为了验证我们的框架，我们开发了一个自动化模拟流水线，生成一个大规模的基于偏好的数据集，用于具有挑战性的科学图表生成任务。包括消融实验、主观和客观评估，以及跨用户专业水平的测试在内的综合实验表明了我们提出的框架的有效性。Nous实现了领先的效率和输出质量，同时对不同用户专业水平保持了稳健性。总之，我们的研究为解决复杂人机协作中模糊意图问题提供了系统化方法和新视角。

更新时间: 2026-01-29 11:18:20

领域: cs.AI

下载: http://arxiv.org/abs/2510.27410v2

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Multimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in unimodal robustness and generalization, while improving standard multimodal performance.

Updated: 2026-01-29 11:16:13

标题: 诊断和减轻多模态大型语言模型中的模态干扰

摘要: 多模态大语言模型在多模态基准测试中表现出色，但在面对噪声模态干扰时往往表现出较差的鲁棒性，例如在视觉理解中出现无关文本，或在问答中出现无关视觉内容。在其核心，模态干扰指的是来自非必要模态的噪声信号扭曲模型决策的情况，我们通过因果、扰动性的诊断实验系统地分析了这一问题。为了解决这个问题，我们提出了一个统一的微调框架，结合启发式和对抗性扰动的数据增强，以及原始和扰动输入之间的输出级一致性正则化。在跨越图像密集、文本密集和多模态基准测试的广泛实验中，涵盖多个MLLM架构和模型规模，我们展示了单模态鲁棒性和泛化性的一致提升，同时提高了标准多模态性能。

更新时间: 2026-01-29 11:16:13

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2505.19616v4

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

Updated: 2026-01-29 11:14:49

标题: VidLaDA：用于高效视频理解的双向扩散大型语言模型

摘要: 当前的视频大语言模型（Video LLMs）通常通过视觉编码器对帧进行编码，并使用自回归（AR）LLM进行理解和生成。然而，这种AR范式不可避免地面临着双重效率瓶颈：严格的单向注意力通过阻碍全局时空聚合来损害理解效率，而串行解码限制了生成效率。为了解决这个问题，我们提出了VidLaDA，这是一个基于扩散语言模型（DLMs）的视频LLM，利用双向注意力解锁了全面的时空建模，并并行解码令牌。为了进一步减轻扩散解码的计算开销，我们引入了MARS-Cache，一种加速策略，通过将异步视觉缓存刷新和逐帧块级注意力相结合来减少冗余。实验表明，VidLaDA与最先进的AR基线（例如Qwen2.5-VL和LLaVA-Video）相媲美，并且优于DLM基线，而MARS-Cache提供了超过12倍的加速，而不会影响准确性。代码和检查点可以在https://github.com/ziHoHe/VidLaDA中找到。

更新时间: 2026-01-29 11:14:49

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.17868v2

When LLMs Play the Telephone Game: Cultural Attractors as Conceptual Tools to Evaluate LLMs in Multi-turn Settings

As large language models (LLMs) start interacting with each other and generating an increasing amount of text online, it becomes crucial to better understand how information is transformed as it passes from one LLM to the next. While significant research has examined individual LLM behaviors, existing studies have largely overlooked the collective behaviors and information distortions arising from iterated LLM interactions. Small biases, negligible at the single output level, risk being amplified in iterated interactions, potentially leading the content to evolve towards attractor states. In a series of telephone game experiments, we apply a transmission chain design borrowed from the human cultural evolution literature: LLM agents iteratively receive, produce, and transmit texts from the previous to the next agent in the chain. By tracking the evolution of text toxicity, positivity, difficulty, and length across transmission chains, we uncover the existence of biases and attractors, and study their dependence on the initial text, the instructions, language model, and model size. For instance, we find that more open-ended instructions lead to stronger attraction effects compared to more constrained tasks. We also find that different text properties display different sensitivity to attraction effects, with toxicity leading to stronger attractors than length. These findings highlight the importance of accounting for multi-step transmission dynamics and represent a first step towards a more comprehensive understanding of LLM cultural dynamics.

Updated: 2026-01-29 11:14:32

标题: 当LLMs玩电话游戏：文化吸引子作为评估多轮设置中LLMs的概念工具

摘要: 随着大型语言模型（LLMs）开始相互交互并在线上生成越来越多的文本，更加重要的是要更好地理解信息在从一个LLM传递到下一个LLM时如何转化。虽然已经有大量研究探讨了单个LLM的行为，但现有研究在忽视迭代LLM交互产生的集体行为和信息失真方面存在较大空白。在单个输出水平上微小的偏差可能在迭代交互中被放大，潜在地导致内容向吸引态演化。在一系列电话游戏实验中，我们应用了从人类文化演变文献中借鉴的传输链设计：LLM代理者迭代地接收、生成和传输来自链中前一个代理者的文本。通过跟踪文本毒性、积极性、难度和长度在传输链中的演化，我们揭示了偏见和吸引态的存在，并研究了它们对初始文本、指令、语言模型和模型大小的依赖关系。例如，我们发现更为开放的指令会导致比更受限制的任务更强的吸引效应。我们还发现不同的文本属性对吸引效应显示出不同的敏感性，毒性比长度更容易导致更强的吸引态。这些发现突显了需要考虑多步传输动态的重要性，并代表了对LLM文化动态更全面理解的第一步。

更新时间: 2026-01-29 11:14:32

领域: physics.soc-ph,cs.AI,cs.MA

下载: http://arxiv.org/abs/2407.04503v4

Seldom: An Anonymity Network with Selective Deanonymization

While anonymity networks such as Tor provide invaluable privacy guarantees to society, they also enable all kinds of criminal activities. Consequently, many blameless citizens shy away from protecting their privacy using such technology for fear of being associated with criminals. To grasp the potential for alternative privacy protection for those users, we design Seldom, an anonymity network with integrated selective deanonymization that disincentivizes criminal activity. Seldom enables law enforcement agencies to selectively access otherwise anonymized identities of misbehaving users while providing technical guarantees preventing these access rights from being misused. Seldom further ensures translucency, as each access request is approved by a trustworthy consortium of impartial entities and eventually disclosed to the public (without interfering with ongoing investigations). To demonstrate Seldom's feasibility and applicability, we base our implementation on Tor, the most widely used anonymity network. Our evaluation indicates minimal latency, processing, and bandwidth overheads compared to Tor; Seldom's main costs stem from storing flow records and encrypted identities. With at most 636 TB of storage required in total to retain the encrypted identifiers of a Tor-sized network for two years, Seldom provides a practical and deployable technical solution to the inherent problem of criminal activities in anonymity networks. As such, Seldom sheds new light on the potentials and limitations when integrating selective deanonymization into anonymity networks.

Updated: 2026-01-29 11:13:52

标题: 稀有：一种具有选择性去匿名化的匿名网络

摘要: 尽管像Tor这样的匿名网络为社会提供了宝贵的隐私保障，但它们也使各种犯罪活动变得可能。因此，许多无辜的公民因为担心与犯罪分子联系在一起而避免使用这种技术来保护他们的隐私。为了了解为这些用户提供替代隐私保护的潜力，我们设计了Seldom，这是一个集成了选择性去匿名化的匿名网络，用来阻止犯罪活动。Seldom使执法机构能够有选择性地访问行为不端用户的匿名身份，同时提供技术保证，防止这些访问权被滥用。Seldom进一步确保了透明度，因为每个访问请求都需要由一组中立的可信实体批准，并最终向公众披露（而不会干扰正在进行的调查）。为了展示Seldom的可行性和适用性，我们基于Tor这种最常用的匿名网络进行了实现。我们的评估表明，与Tor相比，Seldom的延迟、处理和带宽开销都很小；Seldom的主要成本来自存储流记录和加密身份。为了在两年内保留Tor规模网络的加密标识符，总共最多需要636TB的存储空间，Seldom为匿名网络中的犯罪活动固有问题提供了一个实用且可部署的技术解决方案。因此，Seldom为在匿名网络中集成选择性去匿名化时的潜力和局限性带来了新的启示。

更新时间: 2026-01-29 11:13:52

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2412.00990v2

Entropy Guided Dynamic Patch Segmentation for Time Series Transformers

Patch-based transformers have emerged as efficient and improved long-horizon modeling architectures for time series modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. We propose a novel Entropy-Guided Dynamic Patch Encoder (EntroPE), as a temporally informed framework that dynamically detects transition points via conditional entropy and dynamically places patch boundaries. This preserves temporal structure while retaining the computational benefits of patching. EntroPE consists of two key modules, namely an Entropy-based Dynamic Patcher (EDP) that applies information-theoretic criteria to locate natural temporal shifts and determine patch boundaries, and an Adaptive Patch Encoder (APE) that employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations. Extensive experiments on long-term forecasting, classification, and anomaly detection demonstrate that the proposed method improves both accuracy and efficiency, establishing entropy-guided dynamic patching as a promising new paradigm for time series modeling. Code is available at https://github.com/Sachithx/EntroPE.

Updated: 2026-01-29 11:07:55

标题: 熵引导的动态补丁分割方法用于时间序列变换器

摘要: 基于补丁的变压器已经成为时间序列建模的高效改进的长期建模架构。然而，现有方法依赖于时间不可知的补丁构建，其中任意的起始位置和固定长度通过在边界上分裂自然过渡来破坏时间连贯性。这种天真的分割经常会破坏短期依赖关系，并削弱表示学习。我们提出了一种新颖的基于熵引导的动态补丁编码器（EntroPE），作为一种在时间上具有信息的框架，通过条件熵动态检测过渡点，并动态放置补丁边界。这保留了时间结构，同时保留了补丁的计算优势。EntroPE由两个关键模块组成，即基于熵的动态补丁程序（EDP），它应用信息论标准来定位自然的时间变化点并确定补丁边界，以及自适应补丁编码器（APE），它利用池化和交叉注意力来捕获补丁内的依赖关系并生成固定大小的潜在表示。对长期预测、分类和异常检测的广泛实验表明，所提出的方法提高了准确性和效率，为时间序列建模建立了熵引导的动态补丁作为一种有前途的新范式。代码可在https://github.com/Sachithx/EntroPE找到。

更新时间: 2026-01-29 11:07:55

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.26157v2

Training slow silicon neurons to control extremely fast robots with spiking reinforcement learning

Air hockey demands split-second decisions at high puck velocities, a challenge we address with a compact network of spiking neurons running on a mixed-signal analog/digital neuromorphic processor. By co-designing hardware and learning algorithms, we train the system to achieve successful puck interactions through reinforcement learning in a remarkably small number of trials. The network leverages fixed random connectivity to capture the task's temporal structure and adopts a local e-prop learning rule in the readout layer to exploit event-driven activity for fast and efficient learning. The result is real-time learning with a setup comprising a computer and the neuromorphic chip in-the-loop, enabling practical training of spiking neural networks for robotic autonomous systems. This work bridges neuroscience-inspired hardware with real-world robotic control, showing that brain-inspired approaches can tackle fast-paced interaction tasks while supporting always-on learning in intelligent machines.

Updated: 2026-01-29 11:05:23

标题: 用脉冲强化学习训练慢速硅神经元控制极快速机器人

摘要: 空气曲棍球需要在高速冰球速度下做出即时决策，这是一个我们通过在混合信号模拟/数字神经形态处理器上运行的一种紧凑的尖峰神经网络来解决的挑战。通过共同设计硬件和学习算法，我们训练系统通过强化学习在极少的尝试次数中实现成功的冰球交互。该网络利用固定的随机连接来捕获任务的时间结构，并在输出层采用本地e-prop学习规则，以利用事件驱动活动进行快速高效的学习。结果是实时学习，其设置包括计算机和神经形态芯片在回路中，从而实现对机器人自主系统进行实际训练的可能。这项工作将受神经科学启发的硬件与现实世界的机器人控制联系起来，表明脑启发方法可以处理快节奏的交互任务，同时支持智能机器的始终在线学习。

更新时间: 2026-01-29 11:05:23

领域: cs.RO,cs.AI,cs.ET

下载: http://arxiv.org/abs/2601.21548v1

A Likely Geometry of Generative Models

The geometry of generative models serves as the basis for interpolation, model inspection, and more. Unfortunately, most generative models lack a principal notion of geometry without restrictive assumptions on either the model or the data dimension. In this paper, we construct a general geometry compatible with different metrics and probability distributions to analyze generative models that do not require additional training. We consider curves analogous to geodesics constrained to a suitable data distribution aimed at targeting high-density regions learned by generative models. We formulate this as a (pseudo)-metric and prove that this corresponds to a Newtonian system on a Riemannian manifold. We show that shortest paths in our framework can be characterized by a system of ordinary differential equations, which locally corresponds to geodesics under a suitable Riemannian metric. Numerically, we derive a novel algorithm to efficiently compute shortest paths and generalized Fréchet means. Quantitatively, we show that curves using our metric traverse regions of higher density than baselines across a range of models and datasets.

Updated: 2026-01-29 11:04:07

标题: 一个生成模型的可能几何结构

摘要: 生成模型的几何形状是插值、模型检查等的基础。不幸的是，大多数生成模型缺乏不涉及模型或数据维度的几何概念。在本文中，我们构建了一个与不同度量和概率分布兼容的一般几何形状，以分析生成模型，而无需额外的训练。我们考虑类似于测地线的曲线，受限于适当的数据分布，旨在针对生成模型学习的高密度区域。我们将其构造为（伪）度量，并证明这对应于黎曼流形上的牛顿系统。我们表明，在我们的框架中，最短路径可以通过一组常微分方程来表征，这在局部上对应于适当黎曼度量下的测地线。在数值上，我们推导了一种新算法，可以有效计算最短路径和广义Fréchet均值。定量上，我们表明，使用我们的度量的曲线在一系列模型和数据集上经过的高密度区域比基线要多。

更新时间: 2026-01-29 11:04:07

领域: cs.LG

下载: http://arxiv.org/abs/2510.26266v2

Multi-Modal Time Series Prediction via Mixture of Modulated Experts

Real-world time series exhibit complex and evolving dynamics, making accurate forecasting extremely challenging. Recent multi-modal forecasting methods leverage textual information such as news reports to improve prediction, but most rely on token-level fusion that mixes temporal patches with language tokens in a shared embedding space. However, such fusion can be ill-suited when high-quality time-text pairs are scarce and when time series exhibit substantial variation in scale and characteristics, thus complicating cross-modal alignment. In parallel, Mixture-of-Experts (MoE) architectures have proven effective for both time series modeling and multi-modal learning, yet many existing MoE-based modality integration methods still depend on token-level fusion. To address this, we propose Expert Modulation, a new paradigm for multi-modal time series prediction that conditions both routing and expert computation on textual signals, enabling direct and efficient cross-modal control over expert behavior. Through comprehensive theoretical analysis and experiments, our proposed method demonstrates substantial improvements in multi-modal time series prediction. The current code is available at https://github.com/BruceZhangReve/MoME

Updated: 2026-01-29 11:03:09

标题: 多模态时间序列预测方法：通过调制专家混合模型

摘要: 真实世界的时间序列表现出复杂和不断变化的动态，使得准确预测极为具有挑战性。最近的多模态预测方法利用文本信息（如新闻报道）来改善预测，但大多数方法依赖于在共享嵌入空间中将时间片段与语言标记混合的令牌级融合。然而，当高质量的时间-文本对稀缺且时间序列在尺度和特征上表现出显著变化时，这种融合可能不合适，从而使跨模态对齐变得复杂。与此同时，专家混合（MoE）架构已被证明对于时间序列建模和多模态学习都是有效的，然而许多现有的基于MoE的模态集成方法仍依赖于令牌级融合。为了解决这一问题，我们提出了专家调制，这是一种新的多模态时间序列预测范式，它在文本信号上同时调节路由和专家计算，从而实现对专家行为的直接和高效的跨模态控制。通过全面的理论分析和实验证明，我们提出的方法在多模态时间序列预测方面取得了显著的改进。当前的代码可在https://github.com/BruceZhangReve/MoME 上找到。

更新时间: 2026-01-29 11:03:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21547v1

ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory

Agentic large language model (LLM) systems rely on external memory for long-horizon state and concurrent multi-agent execution, but centralized indexes and heuristic partitions become bottlenecks as memory volume and parallel access grow. We present ShardMemo, a budgeted tiered memory service with Tier A per-agent working state, Tier B sharded evidence with shard-local approximate nearest neighbor (ANN) indexes, and Tier C, a versioned skill library. Tier B enforces scope-before-routing: structured eligibility constraints mask ineligible shards before routing or ANN search. We cast shard probing as masked mixture-of-experts (MoE) routing over eligible shards, probing up to $B_{\mathrm{probe}}$ shards via Top-$B_{\mathrm{probe}}$ or adaptive Top-$P$, and use cost-aware gating over profile/observation/session shard families; the router is trained from evidence-to-shard supervision. On LoCoMo, ShardMemo improves over the strongest baseline (GAM) by +5.11 to +6.82 F1 across question categories. Under a fixed-budget routing setting ($B_{\mathrm{probe}}=3$), ShardMemo improves over cosine-to-prototype shard routing by +6.87 F1 while reducing retrieval work (VecScan 521->414, -20.5%) and p95 latency (95->76 ms). On long-context HotpotQA, ShardMemo achieves 63.41/61.88/57.95 F1 at 56K/224K/448K tokens. On ToolBench, Tier C reaches 0.97 Precision@3 and 1.94 StepRed (+10.2% and +7.2% over embedding-similarity retrieval).

Updated: 2026-01-29 11:01:34

标题: ShardMemo：用于分片代理式LLM存储的遮蔽MoE路由

摘要: Agent大型语言模型（LLM）系统依赖外部存储器进行长期状态和并行多代理执行，但随着内存容量和并行访问的增长，集中式索引和启发式分区成为瓶颈。我们提出了ShardMemo，一个预算分层内存服务，其中Tier A是每个代理的工作状态，Tier B是分片证据，带有分片本地的近似最近邻（ANN）索引，Tier C是一个版本化的技能库。Tier B强制执行先范围再路由：结构化的资格约束在路由或ANN搜索之前屏蔽不合格的分片。我们将分片探测视为掩码混合专家（MoE）路由，通过Top-$B_{\mathrm{probe}}$或自适应Top-$P$探测最多$B_{\mathrm{probe}}$个分片，并通过基于成本的门控对概要/观察/会话分片族进行路由；路由器是从证据到分片的训练中学习的。在LoCoMo上，ShardMemo在不同问题类别之间的F1值上比最强基线（GAM）提高了+5.11至+6.82。在固定预算路由设置（$B_{\mathrm{probe}}=3$）下，ShardMemo在F1值上比余弦到原型分片路由提高了+6.87，同时减少了检索工作（VecScan 521->414，-20.5%）和p95延迟（95->76毫秒）。在长上下文HotpotQA上，ShardMemo在56K/224K/448K令牌处分别达到了63.41/61.88/57.95的F1。在ToolBench上，Tier C的Precision@3达到0.97，StepRed达到1.94（比嵌入相似性检索分别提高了+10.2%和+7.2%）。

更新时间: 2026-01-29 11:01:34

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21545v1

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

Updated: 2026-01-29 10:59:36

标题: 双锚插值求解器用于加速生成建模

摘要: 流匹配（FM）模型已经成为高保真度综合的主导范式。然而，它们依赖于迭代普通微分方程（ODE）求解，导致了显著的延迟瓶颈。现有的解决方案面临一个两难选择：无需训练的求解器在低神经功能评估（NFEs）时性能下降明显，而基于训练的一步或几步生成方法则需要巨大的训练成本并且缺乏即插即用的灵活性。为了弥合这一差距，我们提出了Bi-Anchor插值求解器（BA-solver）。BA-solver保留了标准无需训练求解器的灵活性，同时通过引入一个轻量级的SideNet（1-2%的骨干大小）在冻结的骨干旁边实现了显著的加速。具体而言，我们的方法建立在两个协同组件之上：1）双向时间感知，其中SideNet学习近似未来和历史速度，而无需重新训练沉重的骨干；以及2）双锚速度积分，利用SideNet和两个锚速度有效地近似批量高阶积分的中间速度。通过利用骨干建立高精度的“锚点”和SideNet来密集轨迹，BA-solver实现了大间隔尺寸和最小化误差。在ImageNet-256^2上的实证结果表明，BA-solver在只进行10次NFEs时就可以达到与100次以上NFEs欧拉求解器相媲美的生成质量，并且在仅5次NFEs时保持高度保真度，且产生的训练成本可忽略不计。此外，BA-solver确保与现有生成管线的无缝集成，促进像图像编辑这样的下游任务。

更新时间: 2026-01-29 10:59:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.21542v1

A Trainable Optimizer

The concept of learning to optimize involves utilizing a trainable optimization strategy rather than relying on manually defined full gradient estimations such as ADAM. We present a framework that jointly trains the full gradient estimator and the trainable weights of the model. Specifically, we prove that pseudo-linear TO (Trainable Optimizer), a linear approximation of the full gradient, matches SGD's convergence rate while effectively reducing variance. Pseudo-linear TO incurs negligible computational overhead, requiring only minimal additional tensor multiplications. To further improve computational efficiency, we introduce two simplified variants of Pseudo-linear TO. Experiments demonstrate that TO methods converge faster than benchmark algorithms (e.g., ADAM) in both strongly convex and non-convex settings, and fine tuning of an LLM.

Updated: 2026-01-29 10:58:15

标题: 一个可训练的优化器

摘要: 学习优化的概念涉及利用可训练的优化策略，而不是依赖于手动定义的完整梯度估计，如ADAM。我们提出了一个框架，可以同时训练完整梯度估计器和模型的可训练权重。具体来说，我们证明了伪线性TO（可训练优化器），一种完整梯度的线性近似，可以匹配SGD的收敛速率，同时有效地减少方差。伪线性TO几乎不会带来计算开销，只需要最少的额外张量乘法。为了进一步提高计算效率，我们引入了两种简化的Pseudo-linear TO变体。实验证明，TO方法在强凸和非凸设置以及LLM的微调中比基准算法（如ADAM）收敛更快。

更新时间: 2026-01-29 10:58:15

领域: cs.LG

下载: http://arxiv.org/abs/2508.01764v2

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces a new threat surface: unreliable search results can mislead agents into producing unsafe outputs. Real-world incidents and our two in-the-wild observations show that such failures can occur in practice. To study this threat systematically, we propose SafeSearch, an automated red-teaming framework that is scalable, cost-efficient, and lightweight, enabling harmless safety evaluation of search agents. Using this, we generate 300 test cases spanning five risk categories (e.g., misinformation and prompt injection) and evaluate three search agent scaffolds across 17 representative LLMs. Our results reveal substantial vulnerabilities in LLM-based search agents, with the highest ASR reaching 90.5% for GPT-4.1-mini in a search-workflow setting. Moreover, we find that common defenses, such as reminder prompting, offer limited protection. Overall, SafeSearch provides a practical way to measure and improve the safety of LLM-based search agents. Our codebase and test cases are publicly available: https://github.com/jianshuod/SafeSearch.

Updated: 2026-01-29 10:58:06

标题: SafeSearch：基于LLM的搜索代理的自动红队行动

摘要: 搜索代理连接LLMs到互联网，使它们能够访问更广泛和更及时的信息。然而，这也引入了一个新的威胁面：不可靠的搜索结果可能会误导代理生成不安全的输出。真实世界中的事件和我们的两次野外观察表明，这种失败在实践中可能会发生。为了系统地研究这一威胁，我们提出了SafeSearch，这是一个可扩展、经济高效和轻量级的自动红队框架，可以对搜索代理进行无害的安全评估。利用这一框架，我们生成了300个测试用例，涵盖五个风险类别（例如，错误信息和提示注入），并评估了17个代表性LLMs上的三个搜索代理支架。我们的结果显示，在基于LLM的搜索代理中存在重大漏洞，其中在搜索工作流设置中，GPT-4.1-mini的最高ASR达到了90.5%。此外，我们发现常见的防御措施，如提醒提示，提供了有限的保护。总的来说，SafeSearch提供了一种实用的方法来衡量和改进基于LLM的搜索代理的安全性。我们的代码库和测试用例可以公开获取：https://github.com/jianshuod/SafeSearch。

更新时间: 2026-01-29 10:58:06

领域: cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2509.23694v4

EEG-based Graph-guided Domain Adaptation for Robust Cross-Session Emotion Recognition

Accurate recognition of human emotional states is critical for effective human-machine interaction. Electroencephalography (EEG) offers a reliable source for emotion recognition due to its high temporal resolution and its direct reflection of neural activity. Nevertheless, variations across recording sessions present a major challenge for model generalization. To address this issue, we propose EGDA, a framework that reduces cross-session discrepancies by jointly aligning the global (marginal) and class-specific (conditional) distributions, while preserving the intrinsic structure of EEG data through graph regularization. Experimental results on the SEED-IV dataset demonstrate that EGDA achieves robust cross-session performance, obtaining accuracies of 81.22%, 80.15%, and 83.27% across three transfer tasks, and surpassing several baseline methods. Furthermore, the analysis highlights the Gamma frequency band as the most discriminative and identifies the central-parietal and prefrontal brain regions as critical for reliable emotion recognition.

Updated: 2026-01-29 10:53:02

标题: 基于EEG的图引导域自适应用于跨会话情绪识别的稳健性

摘要: 准确识别人类情绪状态对于有效的人机交互至关重要。由于其高时间分辨率和对神经活动的直接反映，脑电图（EEG）为情绪识别提供了可靠的来源。然而，跨记录会话的变化是模型泛化的一项主要挑战。为了解决这个问题，我们提出了EGDA，这是一个通过同时对齐全局（边际）和类特定（条件）分布，同时通过图正则化保留EEG数据的内在结构的框架，以减少会话间的差异。SEED-IV数据集上的实验结果表明，EGDA实现了稳健的会话间性能，在三个转移任务中分别获得了81.22％，80.15％和83.27％的准确率，并超过了几种基准方法。此外，分析突出了Gamma频带作为最具区分性的频带，并确定中央顶叶和前额脑区域对于可靠的情绪识别至关重要。

更新时间: 2026-01-29 10:53:02

领域: cs.LG

下载: http://arxiv.org/abs/2512.23526v2

ARGORA: Orchestrated Argumentation for Causally Grounded LLM Reasoning and Decision Making

Existing multi-expert LLM systems gather diverse perspectives but combine them through simple aggregation, obscuring which arguments drove the final decision. We introduce ARGORA, a framework that organizes multi-expert discussions into explicit argumentation graphs showing which arguments support or attack each other. By casting these graphs as causal models, ARGORA can systematically remove individual arguments and recompute outcomes, identifying which reasoning chains were necessary and whether decisions would change under targeted modifications. We further introduce a correction mechanism that aligns internal reasoning with external judgments when they disagree. Across diverse benchmarks and an open-ended use case, ARGORA achieves competitive accuracy and demonstrates corrective behavior: when experts initially disagree, the framework resolves disputes toward correct answers more often than it introduces new errors, while providing causal diagnostics of decisive arguments.

Updated: 2026-01-29 10:48:04

标题: ARGORA：用于因果基础LLM推理和决策制定的编排论证

摘要: 现有的多专家LLM系统收集了不同的观点，但通过简单的聚合将它们结合起来，使最终决定的驱动因素变得不明显。我们引入了ARGORA，一个框架，将多专家讨论组织成明确的论证图，显示哪些论点支持或攻击彼此。通过将这些图表现为因果模型，ARGORA可以系统地移除个别论点并重新计算结果，识别哪些推理链是必要的，以及在有针对性的修改下决策是否会改变。我们进一步引入了一个校正机制，当内部推理与外部判断不一致时，可以将它们对齐。在各种基准测试和开放式用例中，ARGORA实现了竞争性的准确性，并展示了纠正行为：当专家最初意见不一致时，该框架更经常解决争议以获得正确答案，而不是引入新的错误，同时提供决定性论点的因果诊断。

更新时间: 2026-01-29 10:48:04

领域: cs.AI

下载: http://arxiv.org/abs/2601.21533v1

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.

Updated: 2026-01-29 10:47:21

标题: 关于大型视觉语言模型在视觉令牌压缩下的对抗鲁棒性

摘要: 视觉令牌压缩被广泛应用于加速大型视觉语言模型（LVLMs），通过修剪或合并视觉令牌，然而其对抗鲁棒性尚未被探索。我们展示了现有的基于编码器的攻击可能会显著高估压缩LVLMs的鲁棒性，原因是存在优化推理不匹配：扰动是在完整令牌表示上进行优化的，而推理是通过令牌压缩瓶颈进行的。为了解决这一差距，我们提出了压缩-对齐攻击（CAGE），该攻击将扰动优化与压缩推理进行对齐，而无需假设可以访问已部署的压缩机制或其令牌预算。CAGE结合了（i）预期特征破坏，将扭曲集中在可能在可行预算范围内幸存的令牌上，以及（ii）等级扭曲对齐，积极将令牌扭曲与等级分数对齐，以促进高度扭曲证据的保留。在不同代表性的即插即用压缩机制和数据集中，我们的结果表明，CAGE始终比基线实现更低的鲁棒准确性。这项工作强调，忽视压缩的鲁棒性评估可能过于乐观，呼吁进行压缩感知的安全评估和为高效LVLMs提供的防御。

更新时间: 2026-01-29 10:47:21

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.21531v1

Machine Learning. The Science of Selection under Uncertainty

Learning, whether natural or artificial, is a process of selection. It starts with a set of candidate options and selects the more successful ones. In the case of machine learning the selection is done based on empirical estimates of prediction accuracy of candidate prediction rules on some data. Due to randomness of data sampling the empirical estimates are inherently noisy, leading to selection under uncertainty. The book provides statistical tools to obtain theoretical guarantees on the outcome of selection under uncertainty. We start with concentration of measure inequalities, which are the main statistical instrument for controlling how much an empirical estimate of expectation of a function deviates from the true expectation. The book covers a broad range of inequalities, including Markov's, Chebyshev's, Hoeffding's, Bernstein's, Empirical Bernstein's, Unexpected Bernstein's, kl, and split-kl. We then study the classical (offline) supervised learning and provide a range of tools for deriving generalization bounds, including Occam's razor, Vapnik-Chervonenkis analysis, and PAC-Bayesian analysis. The latter is further applied to derive generalization guarantees for weighted majority votes. After covering the offline setting, we turn our attention to online learning. We present the space of online learning problems characterized by environmental feedback, environmental resistance, and structural complexity. A common performance measure in online learning is regret, which compares performance of an algorithm to performance of the best prediction rule in hindsight, out of a restricted set of prediction rules. We present tools for deriving regret bounds in stochastic and adversarial environments, and under full information and bandit feedback.

Updated: 2026-01-29 10:46:19

标题: 机器学习。不确定性下选择的科学。

摘要: 学习，无论是自然的还是人工的，都是一个选择的过程。它从一组候选选项开始，并选择更成功的选项。在机器学习的情况下，选择是基于候选预测规则在某些数据上的预测准确性的经验估计完成的。由于数据抽样的随机性，经验估计在本质上是嘈杂的，导致在不确定性下进行选择。该书提供了统计工具，以获得选择在不确定性下结果的理论保证。我们从测度不等式开始，它是控制经验期望估计与真实期望之间偏离程度的主要统计工具。该书涵盖了广泛的不等式，包括马尔可夫、切比雪夫、霍夫丁、伯恩斯坦、经验伯恩斯坦、意外伯恩斯坦、kl和分裂kl。然后我们研究经典的（离线）监督学习，并提供了一系列工具来推导泛化边界，包括奥卡姆剃刀、Vapnik-Chervonenkis分析和PAC-Bayesian分析。后者进一步应用于推导加权多数投票的泛化保证。在涵盖离线设置之后，我们将注意力转向在线学习。我们介绍了在线学习问题的空间，其特点是环境反馈、环境阻力和结构复杂性。在线学习中的一个常见绩效度量是后悔，它比较算法的性能与事后最佳预测规则的性能，从一组受限制的预测规则中选择。我们提供了在随机和对抗环境以及全信息和贝叶斯反馈下推导后悔边界的工具。

更新时间: 2026-01-29 10:46:19

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.21547v2

Fast and Geometrically Grounded Lorentz Neural Networks

Hyperbolic space is quickly gaining traction as a promising geometry for hierarchical and robust representation learning. A core open challenge is the development of a mathematical formulation of hyperbolic neural networks that is both efficient and captures the key properties of hyperbolic space. The Lorentz model of hyperbolic space has been shown to enable both fast forward and backward propagation. However, we prove that, with the current formulation of Lorentz linear layers, the hyperbolic norms of the outputs scale logarithmically with the number of gradient descent steps, nullifying the key advantage of hyperbolic geometry. We propose a new Lorentz linear layer grounded in the well-known ``distance-to-hyperplane" formulation. We prove that our formulation results in the usual linear scaling of output hyperbolic norms with respect to the number of gradient descent steps. Our new formulation, together with further algorithmic efficiencies through Lorentzian activation functions and a new caching strategy results in neural networks fully abiding by hyperbolic geometry while simultaneously bridging the computation gap to Euclidean neural networks. Code available at: https://github.com/robertdvdk/hyperbolic-fully-connected.

Updated: 2026-01-29 10:44:32

标题: 快速且几何基础的洛伦兹神经网络

摘要: 双曲空间正迅速成为一种适用于分层和稳健表示学习的有前途的几何形式。一个核心的挑战是开发一种双曲神经网络的数学表达，既高效又能捕捉双曲空间的关键属性。已经证明双曲空间的洛伦兹模型能够实现快速的前向和反向传播。然而，我们证明，根据当前的洛伦兹线性层的表达方式，输出的双曲范数随着梯度下降步骤的数量呈对数级增长，使双曲几何的主要优势失效。我们提出了一种基于著名的“距离到超平面”的公式的新的洛伦兹线性层。我们证明，我们的表达方式导致输出的双曲范数随着梯度下降步骤的数量呈常规线性增长。我们的新表达方式，再加上通过洛伦兹激活函数和新的缓存策略进一步提高算法效率，使神经网络完全符合双曲几何，同时弥合了与欧几里德神经网络之间的计算差距。代码下载链接：https://github.com/robertdvdk/hyperbolic-fully-connected。

更新时间: 2026-01-29 10:44:32

领域: cs.LG

下载: http://arxiv.org/abs/2601.21529v1

Sustainable Materials Discovery in the Era of Artificial Intelligence

Artificial intelligence (AI) has transformed materials discovery, enabling rapid exploration of chemical space through generative models and surrogate screening. Yet current AI workflows optimize performance first, deferring sustainability to post synthesis assessment. This creates inefficiency by the time environmental burdens are quantified, resources have been invested in potentially unsustainable solutions. The disconnect between atomic scale design and lifecycle assessment (LCA) reflects fundamental challenges, data scarcity across heterogeneous sources, scale gaps from atoms to industrial systems, uncertainty in synthesis pathways, and the absence of frameworks that co-optimize performance with environmental impact. We propose to integrate upstream machine learning (ML) assisted materials discovery with downstream lifecycle assessment into a uniform ML-LCA environment. The framework ML-LCA integrates five components, information extraction for building materials-environment knowledge bases, harmonized databases linking properties to sustainability metrics, multi-scale models bridging atomic properties to lifecycle impacts, ensemble prediction of manufacturing pathways with uncertainty quantification, and uncertainty-aware optimization enabling simultaneous performance-sustainability navigation. Case studies spanning glass, cement, semiconductor photoresists, and polymers demonstrate both necessity and feasibility while identifying material-specific integration challenges. Realizing ML-LCA demands coordinated advances in data infrastructure, ex-ante assessment methodologies, multi-objective optimization, and regulatory alignment enabling the discovery of materials that are sustainable by design rather than by chance.

Updated: 2026-01-29 10:42:44

标题: 人工智能时代可持续材料发现

摘要: 人工智能（AI）已经改变了材料发现，通过生成模型和替代筛选，使化学空间的快速探索成为可能。然而，当前的AI工作流程首先优化性能，将可持续性推迟到后续合成评估。这样做会导致效率低下，当环境负担被量化时，资源已被投入到可能不可持续的解决方案中。原子尺度设计和生命周期评估（LCA）之间的脱节反映了基本挑战，即跨异质源的数据稀缺性、从原子到工业系统的尺度差距、合成途径的不确定性，以及缺乏能够将性能与环境影响共同优化的框架。我们提出将上游机器学习（ML）辅助材料发现与下游生命周期评估整合到统一的ML-LCA环境中。该框架ML-LCA整合了五个组成部分，即为构建材料-环境知识库提取信息，将属性与可持续性指标联系起来的协调数据库，将原子属性与生命周期影响联系起来的多尺度模型，带有不确定性量化的制造途径的集成预测，以及考虑不确定性的优化，实现同时性能-可持续性导航。以玻璃、水泥、半导体光刻胶和聚合物为例的案例研究展示了必要性和可行性，同时也确定了材料特定整合挑战。实现ML-LCA需要在数据基础设施、前瞻评估方法学、多目标优化和监管对齐方面取得协调进展，以便发现通过设计而非机会可持续的材料。

更新时间: 2026-01-29 10:42:44

领域: cond-mat.mtrl-sci,cs.AI

下载: http://arxiv.org/abs/2601.21527v1

KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

We introduce KAPSO, a modular framework for autonomous program synthesis and optimization. Given a natural language goal and an evaluation method, KAPSO iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact toward measurable objectives. Rather than treating synthesis as the endpoint, KAPSO uses synthesis as an operator within a long-horizon optimization loop, where progress is defined by evaluator outcomes. KAPSO targets long-horizon failures common in coding agents, including lost experimental state, brittle debugging, and weak reuse of domain expertise, by integrating three tightly coupled components. First, a git-native experimentation engine isolates each attempt as a branch, producing reproducible artifacts and preserving provenance across iterations. Second, a knowledge system ingests heterogeneous sources, including repositories, internal playbooks, and curated external resources such as documentation, scientific papers, and web search results, and organizes them into a structured representation that supports retrieval over workflows, implementations, and environment constraints. Third, a cognitive memory layer coordinates retrieval and maintains an episodic store of reusable lessons distilled from experiment traces (run logs, diffs, and evaluator feedback), reducing repeated error modes and accelerating convergence. We evaluated KAPSO on MLE-Bench (Kaggle-style ML competitions) and ALE-Bench (AtCoder heuristic optimization), and report end-to-end performance. Code Available at: https://github.com/Leeroo-AI/kapso

Updated: 2026-01-29 10:40:54

标题: KAPSO: 一个基于知识的自主程序合成和优化框架

摘要: 我们介绍了KAPSO，一个用于自主程序合成和优化的模块化框架。给定一个自然语言目标和一个评估方法，KAPSO迭代地执行构思、代码合成和编辑、执行、评估和学习，以改进可运行的工件以达到可衡量的目标。与将合成视为终点不同，KAPSO将合成视为长期优化循环中的一个运算符，其中进展由评估结果定义。 KAPSO针对编码代理中常见的长期失败，包括丢失实验状态、脆弱的调试和领域专业知识的弱重用，通过集成三个紧密耦合的组件来解决。首先，一个git原生的实验引擎将每次尝试隔离为一个分支，产生可重现的工件，并在迭代过程中保留来源。其次，一个知识系统吸收异构来源，包括存储库、内部手册和经过策划的外部资源，如文档、科学论文和网络搜索结果，并将它们组织成一个结构化表示，支持对工作流程、实现和环境约束的检索。第三，一个认知记忆层协调检索并维护一个从实验痕迹（运行日志、差异和评估者反馈）中提炼出的可重复使用的教训的情节存储，减少重复的错误模式并加快收敛速度。我们在MLE-Bench（类似Kaggle的机器学习竞赛）和ALE-Bench（AtCoder启发式优化）上评估了KAPSO，并报告了端到端的性能。代码可在以下链接找到：https://github.com/Leeroo-AI/kapso

更新时间: 2026-01-29 10:40:54

领域: cs.AI,cs.CL,cs.SE

下载: http://arxiv.org/abs/2601.21526v1

Industrial Internet Robot Collaboration System and Edge Computing Optimization

In industrial Internet environments, mobile robots must generate collision-free global routes under stochastic obstacle layouts and random perturbations in commanded linear and angular velocities. This paper models a differential-drive robot with nonholonomic constraints, then decomposes motion into obstacle avoidance, target turning, and target approaching behaviors to parameterize the control variables. Global path planning is formulated as a constrained optimization problem and converted into a weighted energy function that balances path length and collision penalties. A three-layer neural network represents the planning model, while simulated annealing searches for near-global minima and mitigates local traps. During execution, a fuzzy controller uses heading and lateral-offset errors to output wheel-speed differentials for rapid correction; edge-side computation is discussed to reduce robot-server traffic and latency. Matlab 2024 simulations report deviation within +-5 cm, convergence within 10 ms, and shorter paths than two baseline methods. The approach improves robustness of global navigation in practice.

Updated: 2026-01-29 10:39:12

标题: 工业互联网机器人协作系统和边缘计算优化

摘要: 在工业互联网环境中，移动机器人必须在随机障碍物布局和指令线速度和角速度的随机扰动下生成无碰撞的全局路径。本文对具有非完整约束的差动驱动机器人进行建模，然后将运动分解为障碍物避开、目标转向和目标接近行为，以参数化控制变量。全局路径规划被制定为一个约束优化问题，并转化为一个平衡路径长度和碰撞惩罚的加权能量函数。一个三层神经网络代表规划模型，而模拟退火搜索接近全局最小值并减轻局部陷阱。在执行过程中，模糊控制器使用航向和侧向偏差输出轮速差异以进行快速校正；边缘侧计算被讨论以减少机器人-服务器通信和延迟。Matlab 2024模拟报告偏差在+-5厘米内，收敛在10毫秒内，并且比两种基准方法的路径更短。这种方法改善了实际中全局导航的稳健性。

更新时间: 2026-01-29 10:39:12

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2504.02492v2

Explicit Credit Assignment through Local Rewards and Dependence Graphs in Multi-Agent Reinforcement Learning

To promote cooperation in Multi-Agent Reinforcement Learning, the reward signals of all agents can be aggregated together, forming global rewards that are commonly known as the fully cooperative setting. However, global rewards are usually noisy because they contain the contributions of all agents, which have to be resolved in the credit assignment process. On the other hand, using local reward benefits from faster learning due to the separation of agents' contributions, but can be suboptimal as agents myopically optimize their own reward while disregarding the global optimality. In this work, we propose a method that combines the merits of both approaches. By using a graph of interaction between agents, our method discerns the individual agent contribution in a more fine-grained manner than a global reward, while alleviating the cooperation problem with agents' local reward. We also introduce a practical approach for approximating such a graph. Our experiments demonstrate the flexibility of the approach, enabling improvements over the traditional local and global reward settings.

Updated: 2026-01-29 10:38:19

标题: 多智能体强化学习中的显式信用分配：通过本地奖励和依赖图

摘要: 为了促进多智能体强化学习中的合作，所有智能体的奖励信号可以聚合在一起，形成全局奖励，通常被称为完全合作设置。然而，全局奖励通常会存在噪音，因为它包含所有智能体的贡献，这些贡献必须在信用分配过程中解决。另一方面，使用局部奖励可以加快学习速度，因为智能体的贡献是分离的，但可能不够优化，因为智能体可能只关注自己的奖励而忽略全局最优性。在这项工作中，我们提出了一种结合了两种方法优点的方法。通过使用智能体之间的交互图，我们的方法以比全局奖励更精细的方式区分个体智能体的贡献，同时缓解了智能体本地奖励的合作问题。我们还介绍了一种近似这样的图的实用方法。我们的实验表明，这种方法的灵活性可以改进传统的局部和全局奖励设置。

更新时间: 2026-01-29 10:38:19

领域: cs.LG

下载: http://arxiv.org/abs/2601.21523v1

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.

Updated: 2026-01-29 10:37:32

标题: 同一个预算下提高大型语言模型推断效果：使用重置和丢弃（ReD）技术

摘要: 大型语言模型（LLMs）在可验证任务上的表现通常通过pass@k来衡量，即在k次尝试中至少一次正确回答问题的概率。在固定预算下，更合适的度量标准是coverage@cost，即作为总尝试次数函数的平均独特问题回答数。我们将这两个度量联系起来，并展示pass@k中经验观察到的幂律行为导致coverage@cost呈次线性增长（递减回报）。为了解决这个问题，我们提出了Reset-and-Discard（ReD），这是一种增加LLMs覆盖率的查询方法，无论pass@k形式如何，都可以在给定预算下增加coverage@cost。此外，在给定pass@k的情况下，我们可以定量预测使用ReD可以节省的总尝试次数。如果模型没有提供pass@k，ReD可以推断其幂律指数。使用HumanEval对三个LLMs进行的实验表明，ReD大大减少了达到所需覆盖率所需的尝试次数、标记和美元成本，同时还提供了一种有效的测量推理幂律的方式。

更新时间: 2026-01-29 10:37:32

领域: cs.LG,cond-mat.dis-nn,cs.AI,stat.ML

下载: http://arxiv.org/abs/2601.21522v1

Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on the retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches truly eliminate the targeted data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier, thereby achieving superior logit-based performance while maintaining representational similarity to the original model. Furthermore, we introduce a novel unlearning evaluation scenario in which the forgetting classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model, thus enabling a more thorough evaluation from a representation perspective. We hope our benchmark will serve as a standardized protocol for evaluating unlearning algorithms under realistic conditions.

Updated: 2026-01-29 10:37:16

标题: 我们真的在忘记吗？对机器遗忘评估协议的关键重新审视

摘要: 机器遗忘是一个过程，可以从训练模型中删除特定数据点，同时在保持对保留数据上性能的同时，满足隐私或法律要求。尽管其重要性，现有的遗忘评估往往集中在小规模场景下基于logit的指标上。我们观察到，这可能导致在现实场景下遗忘方法中对安全的虚假感觉。在本文中，我们进行了全面评估，采用基于代表性的评估方法，在大规模场景下验证遗忘方法是否真正从模型的表示角度消除了目标数据。我们的分析显示，当前最先进的遗忘方法要么完全降低了被遗忘模型的表现质量，要么仅仅修改了分类器，从而实现了更好的基于logit的性能，同时保持了与原模型的表示相似性。此外，我们引入了一个新颖的遗忘评估场景，其中被遗忘类别在语义上与下游任务类别展示出相似性，需要特征表示与原始模型大为不同，从而实现了更全面的表示角度评估。我们希望我们的基准测试将作为一个在真实条件下评估遗忘算法的标准化协议。

更新时间: 2026-01-29 10:37:16

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2503.06991v3

A Unified SPD Token Transformer Framework for EEG Classification: Systematic Comparison of Geometric Embeddings

Spatial covariance matrices of EEG signals are Symmetric Positive Definite (SPD) and lie on a Riemannian manifold, yet the theoretical connection between embedding geometry and optimization dynamics remains unexplored. We provide a formal analysis linking embedding choice to gradient conditioning and numerical stability for SPD manifolds, establishing three theoretical results: (1) BWSPD's $\sqrtκ$ gradient conditioning (vs $κ$ for Log-Euclidean) via Daleckii-Kreĭn matrices provides better gradient conditioning on high-dimensional inputs ($d \geq 22$), with this advantage reducing on low-dimensional inputs ($d \leq 8$) where eigendecomposition overhead dominates; (2) Embedding-Space Batch Normalization (BN-Embed) approximates Riemannian normalization up to $O(\varepsilon^2)$ error, yielding $+26\%$ accuracy on 56-channel ERP data but negligible effect on 8-channel SSVEP data, matching the channel-count-dependent prediction; (3) bi-Lipschitz bounds prove BWSPD tokens preserve manifold distances with distortion governed solely by the condition ratio $κ$. We validate these predictions via a unified Transformer framework comparing BWSPD, Log-Euclidean, and Euclidean embeddings within identical architecture across 1,500+ runs on three EEG paradigms (motor imagery, ERP, SSVEP; 36 subjects). Our Log-Euclidean Transformer achieves state-of-the-art performance on all datasets, substantially outperforming classical Riemannian classifiers and recent SPD baselines, while BWSPD offers competitive accuracy with similar training time.

Updated: 2026-01-29 10:35:13

标题: 一个统一的SPD令牌变换器框架用于脑电图分类：几何嵌入的系统性比较

摘要: EEG信号的空间协方差矩阵是对称正定的（SPD），并位于黎曼流形上，然而嵌入几何与优化动态之间的理论联系尚未被探索。我们提供了一种形式化分析，将嵌入选择与梯度调节和数值稳定性联系起来，为SPD流形建立了三个理论结果：（1）BWSPD的$ \sqrt κ $梯度调节（相对于对数欧几里得的$ κ $）通过Daleckii-Kreĭn矩阵在高维输入上提供更好的梯度调节（$ d \geq 22 $），在低维输入（$ d \leq 8 $）上这种优势会减少，其中特征分解的开销占主导地位；（2）嵌入空间批量归一化（BN-Embed）近似黎曼归一化，误差为$ O（\varepsilon^2）$，在56通道ERP数据上产生$ +26\% $的准确率，但对于8通道SSVEP数据几乎没有影响，与通道数相关的预测相匹配；（3）双Lipschitz边界证明BWSPD令牌保留流形距离，扭曲仅由条件比$ κ $决定。我们通过一个统一的Transformer框架验证了这些预测，该框架在三个EEG范例（运动想象、ERP、SSVEP；36名受试者）上进行了1,500多次运行，比较了BWSPD、对数欧几里得和欧几里得嵌入，我们的对数欧几里得Transformer在所有数据集上实现了最先进的性能，在经典的黎曼分类器和最近的SPD基线上表现出色，而BWSPD在类似的训练时间内提供了竞争性的准确性。

更新时间: 2026-01-29 10:35:13

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2601.21521v1

LLM Watermark Evasion via Bias Inversion

Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the Bias-Inversion Rewriting Attack (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates (>99%) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests.

Updated: 2026-01-29 10:31:53

标题: 通过偏差反转实现LLM水印规避

摘要: 数字水印技术为检测LLM生成的内容提供了一种有前途的解决方案，然而在现实中无需查询（黑盒）逃避情况下的鲁棒性仍然是一个开放的挑战。现有的无查询攻击通常只能取得有限的成功，或严重扭曲语义含义。我们通过理论分析基于重写的逃避，填补了这一差距，证明通过略微降低采样绿色标记的平均条件概率，使得检测概率呈指数衰减。在这一洞察的指导下，我们提出了一种名为Bias-Inversion Rewriting Attack (BIRA)的实用无查询方法，该方法通过对通过标记意外性识别的代理抑制集应用负对数偏差。在实证方面，BIRA在各种数字水印方案中实现了最先进的逃避率（>99%），同时比之前的基线方法更好地保持语义保真度。我们的研究发现了当前数字水印方法的一个基本漏洞，并强调了进行严格压力测试的必要性。

更新时间: 2026-01-29 10:31:53

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.23019v4

Formal Verification of Noisy Quantum Reinforcement Learning Policies

Quantum reinforcement learning (QRL) aims to use quantum effects to create sequential decision-making policies that achieve tasks more effectively than their classical counterparts. However, QRL policies face uncertainty from quantum measurements and hardware noise, such as bit-flip, phase-flip, and depolarizing errors, which can lead to unsafe behavior. Existing work offers no systematic way to verify whether trained QRL policies meet safety requirements under specific noise conditions. We introduce QVerifier, a formal verification method that applies probabilistic model checking to analyze trained QRL policies with and without modeled quantum noise. QVerifier builds a complete model of the policy-environment interaction, incorporates quantum uncertainty directly into the transition probabilities, and then checks safety properties using the Storm model checker. Experiments across multiple QRL environments show that QVerifier precisely measures how different noise models influence safety, revealing both performance degradation and cases where noise can help. By enabling rigorous safety verification before deployment, QVerifier addresses a critical need: because access to quantum hardware is expensive, pre-deployment verification is essential for any safety-critical use of QRL. QVerifier targets a potential sweet spot between classical and quantum computation, where trained QRL policies could still be modeled classically for probabilistic model checking. When the policy was trained under matching noise conditions, this formal model is exact; when trained on physical hardware, it constitutes an idealized approximation, as unknown hardware noise prevents exact policy modeling.

Updated: 2026-01-29 10:31:25

标题: 噪声量子强化学习策略的形式验证

摘要: 量子强化学习（QRL）旨在利用量子效应创建顺序决策策略，比其经典对应物更有效地完成任务。然而，QRL策略面临来自量子测量和硬件噪声的不确定性，如位翻转、相翻转和去极化错误，这可能导致不安全行为。现有工作没有系统方法来验证训练后的QRL策略是否满足特定噪声条件下的安全要求。我们引入了QVerifier，一种正式验证方法，应用概率模型检查来分析带有和不带有建模的量子噪声的训练后的QRL策略。QVerifier构建了一个完整的策略-环境交互模型，直接将量子不确定性纳入转移概率中，然后使用Storm模型检查器检查安全属性。跨多个QRL环境的实验表明，QVerifier精确衡量了不同噪声模型如何影响安全性，揭示了性能下降和噪声有助于的情况。通过在部署之前进行严格的安全验证，QVerifier解决了一个关键需求：由于访问量子硬件成本昂贵，预部署验证对于任何安全关键的QRL使用是必不可少的。QVerifier瞄准了经典计算和量子计算之间的潜在甜蜜点，训练后的QRL策略仍然可以在经典上进行建模以进行概率模型检查。当策略在匹配的噪声条件下训练时，这个形式化模型是准确的；当在物理硬件上训练时，它构成了一个理想化的近似，因为未知的硬件噪声阻止了准确的策略建模。

更新时间: 2026-01-29 10:31:25

领域: quant-ph,cs.AI,cs.FL

下载: http://arxiv.org/abs/2512.01502v2

Cascaded Transfer: Learning Many Tasks under Budget Constraints

Many-Task Learning refers to the setting where a large number of related tasks need to be learned, the exact relationships between tasks are not known. We introduce the Cascaded Transfer Learning, a novel many-task transfer learning paradigm where information (e.g. model parameters) cascades hierarchically through tasks that are learned by individual models of the same class, while respecting given budget constraints. The cascade is organized as a rooted tree that specifies the order in which tasks are learned and refined. We design a cascaded transfer mechanism deployed over a minimum spanning tree structure that connects the tasks according to a suitable distance measure, and allocates the available training budget along its branches. Experiments on synthetic and real many-task settings show that the resulting method enables more accurate and cost effective adaptation across large task collections compared to alternative approaches.

Updated: 2026-01-29 10:28:08

标题: 级联传递：在预算约束下学习多个任务

摘要: Many-Task Learning指的是需要学习大量相关任务的设置，这些任务之间的确切关系未知。我们引入了级联迁移学习，这是一种新颖的多任务迁移学习范式，其中信息（例如模型参数）通过同一类别的个体模型层次地级联传递，同时遵守给定的预算约束。级联被组织为一个根树，指定任务学习和改进的顺序。我们设计了一个部署在最小生成树结构上的级联转移机制，根据适当的距离度量连接任务，并沿其分支分配可用的训练预算。在合成和真实的多任务设置上的实验表明，与替代方法相比，所得方法能够在大型任务集合之间实现更准确和更具成本效益的适应。

更新时间: 2026-01-29 10:28:08

领域: cs.LG

下载: http://arxiv.org/abs/2601.21513v1

The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity

Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.

Updated: 2026-01-29 10:27:57

标题: 确定性的幻觉：在模糊情况下，LLMs的不确定性量化失败 (Note: LLMs likely refers to "Linear Latent Models")

摘要: 大型语言模型（LLMs）中准确的不确定性量化（UQ）对于可信赖的部署至关重要。虽然现实世界的语言本质上是含糊不清的，反映了偶然性不确定性，但现有的UQ方法通常针对没有歧义的任务进行基准测试。在这项工作中，我们证明了当前的不确定性估计器在没有歧义的严格假设下表现良好，但在含糊数据上表现接近随机。为此，我们引入了MAQA*和AmbigQA*，这是第一个配备从事实共现估计的答案分布的模糊问答（QA）数据集。我们发现这种性能下降在不同的估计范式下是一致的：使用预测分布本身、模型内部表示以及模型集成。我们表明，这种现象可以在理论上解释，揭示了在含糊情况下，基于预测分布和集成的估计器在根本上受限。总的来说，我们的研究揭示了当前LLMs的UQ方法的一个关键缺陷，并激励重新思考当前的建模范式。

更新时间: 2026-01-29 10:27:57

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.04418v2

LLaMEA-SAGE: Guiding Automated Algorithm Design with Structural Feedback from Explainable AI

Large language models have enabled automated algorithm design (AAD) by generating optimization algorithms directly from natural-language prompts. While evolutionary frameworks such as LLaMEA demonstrate strong exploratory capabilities across the algorithm design space, their search dynamics are entirely driven by fitness feedback, leaving substantial information about the generated code unused. We propose a mechanism for guiding AAD using feedback constructed from graph-theoretic and complexity features extracted from the abstract syntax trees of the generated algorithms, based on a surrogate model learned over an archive of evaluated solutions. Using explainable AI techniques, we identify features that substantially affect performance and translate them into natural-language mutation instructions that steer subsequent LLM-based code generation without restricting expressivity. We propose LLaMEA-SAGE, which integrates this feature-driven guidance into LLaMEA, and evaluate it across several benchmarks. We show that the proposed structured guidance achieves the same performance faster than vanilla LLaMEA in a small controlled experiment. In a larger-scale experiment using the MA-BBOB suite from the GECCO-MA-BBOB competition, our guided approach achieves superior performance compared to state-of-the-art AAD methods. These results demonstrate that signals derived from code can effectively bias LLM-driven algorithm evolution, bridging the gap between code structure and human-understandable performance feedback in automated algorithm design.

Updated: 2026-01-29 10:27:29

标题: LLaMEA-SAGE：利用可解释AI的结构反馈指导自动化算法设计

摘要: 大型语言模型已经实现了通过直接从自然语言提示中生成优化算法来实现自动算法设计（AAD）。虽然像LLaMEA这样的进化框架展示了对算法设计空间的强大探索能力，但它们的搜索动态完全由适应度反馈驱动，导致生成的代码中有大量信息未被利用。我们提出了一种机制，通过从生成的算法的抽象语法树中提取的图论和复杂性特征构建的反馈来指导AAD，基于在评估解决方案存档上学习的代理模型。使用可解释的AI技术，我们确定了对性能产生重大影响的特征，并将它们转化为自然语言变异指令，以引导后续基于LLM的代码生成而不限制表达能力。我们提出了LLaMEA-SAGE，将这种特征驱动的指导集成到LLaMEA中，并在几个基准测试中对其进行评估。我们展示了在一个小型受控实验中，提出的结构化指导比普通的LLaMEA更快地实现了相同的性能。在使用GECCO-MA-BBOB竞赛的MA-BBOB套件进行更大规模的实验中，我们的指导方法相比于最先进的AAD方法实现了更优越的性能。这些结果表明，从代码中派生的信号可以有效地偏向LLM驱动的算法演化，弥合了自动算法设计中代码结构与人类可理解性能反馈之间的差距。

更新时间: 2026-01-29 10:27:29

领域: cs.AI,cs.NE,cs.SE

下载: http://arxiv.org/abs/2601.21511v1

The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emph{Activation steering} provides a lightweight alternative to prompt engineering and fine-tuning by directly modifying internal activations to guide generation. This research advances the literature in three significant directions. First, while previous work demonstrated the technical feasibility of steering emotional tone using automated classifiers, this paper presents the first human evaluation of activation steering concerning the emotional tone of LLM outputs, collecting over 7,000 crowd-sourced ratings from 190 participants via Prolific ($n=190$). These ratings assess both perceived emotional intensity and overall text quality. Second, we find strong alignment between human and model-based quality ratings (mean $r=0.776$, range $0.157$--$0.985$), indicating automatic scoring can proxy perceived quality. Moderate steering strengths ($λ\approx 0.15$) reliably amplify target emotions while preserving comprehensibility, with the strongest effects for disgust ($η_p^2 = 0.616$) and fear ($η_p^2 = 0.540$), and minimal effects for surprise ($η_p^2 = 0.042$). Finally, upgrading from Alpaca to LlaMA-3 yielded more consistent steering with significant effects across emotions and strengths (all $p < 0.001$). Inter-rater reliability was high (ICC $= 0.71$--$0.87$), underscoring the robustness of the findings. These findings support activation-based control as a scalable method for steering LLM behavior across affective dimensions.

Updated: 2026-01-29 10:24:34

标题: 样式向量对于引导大型语言模型的有效性：人类评估

摘要: 在推理时控制大型语言模型（LLMs）的行为对于与人类能力和安全要求相一致的输出至关重要。激活引导提供了一种轻量级的替代方案，可以直接修改内部激活以引导生成，而无需进行提示工程和微调。这项研究在三个重要方向上推动了文献。首先，虽然先前的研究表明使用自动分类器调整情感语调的技术可行，但本文首次进行了关于LLM输出情感语调的激活引导的人类评估，通过Prolific平台收集了来自190名参与者的7000多个众包评分（n = 190）。这些评分评估了所感知的情感强度和整体文本质量。其次，我们发现人类评分与模型质量评分之间存在较强的一致性（均值r = 0.776，范围0.157-0.985），表明自动评分可以代表感知质量。适度的激活强度（λ约为0.15）可可靠地增强目标情感，同时保持可理解性，对厌恶（η_p^2 = 0.616）和恐惧（η_p^2 = 0.540）效果最为明显，对惊讶的影响最小（η_p^2 = 0.042）。最后，从Alpaca升级到LlaMA-3实现了更一致的激活引导，跨情感和强度产生了显著效果（所有p < 0.001）。评分者之间的可靠性很高（ICC = 0.71-0.87），强调了这些发现的稳健性。这些发现支持基于激活的控制作为一种可扩展的方法，可在情感维度上引导LLM行为。

更新时间: 2026-01-29 10:24:34

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2601.21505v1

Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity

Speech remains one of the most visible yet overlooked vectors of inclusion and exclusion in contemporary society. While fluency is often equated with credibility and competence, individuals with atypical speech patterns are routinely marginalized. Given the current state of the debate, this article focuses on the structural biases that shape perceptions of atypical speech and are now being encoded into artificial intelligence. Automated speech recognition (ASR) systems and voice interfaces, trained predominantly on standardized speech, routinely fail to recognize or respond to diverse voices, compounding digital exclusion. As AI technologies increasingly mediate access to opportunity, the study calls for inclusive technological design, anti-bias training to minimize the impact of discriminatory algorithmic decisions, and enforceable policy reform that explicitly recognize speech diversity as a matter of equity, not merely accessibility. Drawing on interdisciplinary research, the article advocates for a cultural and institutional shift in how we value voice, urging co-created solutions that elevate the rights, representation, and realities of atypical speakers in the digital age. Ultimately, the article reframes speech inclusion as a matter of equity (not accommodation) and advocates for co-created AI systems that reflect the full spectrum of human voices.

Updated: 2026-01-29 10:22:21

标题: 数字时代中未被听到的声音：重新思考人工智能偏见和言论多样性

摘要: 演讲仍然是当代社会中最显而易见却又被忽视的包容和排斥的向量之一。尽管流利通常被等同于可信度和能力，但具有非典型言语模式的个体经常被 margin marginalized。鉴于当前辩论的状态，本文关注塑造非典型言语感知的结构偏见，这些偏见现在被编码到人工智能中。自动语音识别（ASR）系统和语音界面主要训练在标准化言语上，通常无法识别或回应多样化的声音，增加数字排斥。随着人工智能技术越来越多地介入机会获取，这项研究呼吁包容性技术设计，反偏见培训以最小化歧视性算法决策的影响，并实施可执行的政策改革，明确将言语多样性视为公平而非仅仅是可访问性问题。文章借鉴跨学科研究，主张在我们如何重视声音方面进行文化和制度转变，敦促共同创造的解决方案，提升数字时代非典型发言者的权利、代表性和现实。最终，文章将演讲包容重新定义为公平问题（而非容纳），并倡导共同创造的人工智能系统来反映人类声音的全谱。

更新时间: 2026-01-29 10:22:21

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2601.18641v2

MAR: Efficient Large Language Models via Module-aware Architecture Refinement

Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we propose Module-aware Architecture Refinement (MAR), a two-stage framework that integrates State Space Models (SSMs) for linear-time sequence modeling and applies activation sparsification to reduce FFN costs. In addition, to mitigate low information density and temporal mismatch in integrating Spiking Neural Networks (SNNs) with SSMs, we design the Adaptive Ternary Multi-step Neuron (ATMN) and the Spike-aware Bidirectional Distillation Strategy (SBDS). Extensive experiments demonstrate that MAR effectively restores the performance of its dense counterpart under constrained resources while substantially reducing inference energy consumption. Furthermore, it outperforms efficient models of comparable or even larger scale, underscoring its potential for building efficient and practical LLMs.

Updated: 2026-01-29 10:21:28

标题: MAR: 通过模块感知架构细化实现高效大型语言模型

摘要: 大语言模型（LLMs）在各种领域表现优异，但由于二次注意力和密集的前馈网络（FFN）操作而导致能量消耗高。为了解决这些问题，我们提出了模块感知架构细化（MAR），这是一个两阶段框架，整合了状态空间模型（SSMs）用于线性时间序列建模，并应用激活稀疏化来减少FFN成本。此外，为了减轻在将脉冲神经网络（SNNs）与SSMs集成时的低信息密度和时间不匹配问题，我们设计了自适应三值多步神经元（ATMN）和脉冲感知的双向蒸馏策略（SBDS）。大量实验证明，MAR在受限资源下有效恢复了其密集对应物的性能，同时大大降低了推断能量消耗。此外，它超越了具有相似或甚至更大规模的高效模型，突显了其构建高效实用LLMs的潜力。

更新时间: 2026-01-29 10:21:28

领域: cs.AI,cs.CL,cs.LG,cs.NE

下载: http://arxiv.org/abs/2601.21503v1

Task-Awareness Improves LLM Generations and Uncertainty

In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.

Updated: 2026-01-29 10:16:23

标题: 任务感知提高LLM生成和不确定性 (Note: LLM stands for Language Model)

摘要: 在许多LLM应用中，自然语言响应通常具有表示离散标签、数值或图形的潜在结构。然而，现有的解码和不确定性估计方法仅在语言空间中运作，并且在很大程度上忽略了结构信息。我们通过直接在与任务相关的潜在结构中对LLM输出进行建模来解决这个问题。通过为这个结构配备一个差异度量，我们可以计算贝叶斯最优响应。这些响应不是从样本生成中选择的，而是通过在潜在空间中组合个体响应新合成的。在不同任务中，贝叶斯最优响应始终优于诸如束搜索之类的标准解码方法。此外，通过引入贝叶斯风险来量化不确定性，可以捕捉潜在结构方面的变化，并提高与输出质量和正确性的对齐度。我们的决策理论框架适用于任何具有潜在响应结构的问题，并能够实现可靠的任务感知LLM预测。

更新时间: 2026-01-29 10:16:23

领域: cs.LG

下载: http://arxiv.org/abs/2601.21500v1

SimGraph: A Unified Framework for Scene Graph-Based Image Generation and Editing

Recent advancements in Generative Artificial Intelligence (GenAI) have significantly enhanced the capabilities of both image generation and editing. However, current approaches often treat these tasks separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence between generated content and edits. Moreover, a major obstacle is the lack of structured control over object relationships and spatial arrangements. Scene graph-based methods, which represent objects and their interrelationships in a structured format, offer a solution by providing greater control over composition and interactions in both image generation and editing. To address this, we introduce SimGraph, a unified framework that integrates scene graph-based image generation and editing, enabling precise control over object interactions, layouts, and spatial coherence. In particular, our framework integrates token-based generation and diffusion-based editing within a single scene graph-driven model, ensuring high-quality and consistent results. Through extensive experiments, we empirically demonstrate that our approach outperforms existing state-of-the-art methods.

Updated: 2026-01-29 10:15:55

标题: SimGraph：一种基于场景图的图像生成和编辑的统一框架

摘要: 最近，生成人工智能（GenAI）的进展显著提升了图像生成和编辑的能力。然而，当前方法常常将这些任务分开处理，导致在生成内容和编辑之间保持空间一致性和语义连贯性方面存在效率低下和挑战。此外，一个主要障碍是缺乏对对象关系和空间布局的结构化控制。基于场景图的方法以结构化格式表示对象及其相互关系，通过在图像生成和编辑中提供更大的组合和交互控制，提供了一个解决方案。为了解决这个问题，我们介绍了SimGraph，一个整合了基于场景图的图像生成和编辑的统一框架，实现对对象交互、布局和空间连贯性的精确控制。特别地，我们的框架在一个单一的基于场景图的模型中集成了基于令牌的生成和扩散式的编辑，确保高质量和一致的结果。通过广泛的实验，我们实证地证明了我们的方法优于现有的最先进方法。

更新时间: 2026-01-29 10:15:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.21498v1

Neural Force Field: Few-shot Learning of Generalized Physical Reasoning

Physical reasoning is a remarkable human ability that enables rapid learning and generalization from limited experience. Current AI models, despite extensive training, still struggle to achieve similar generalization, especially in Out-of-distribution (OOD) settings. This limitation stems from their inability to abstract core physical principles from observations. A key challenge is developing representations that can efficiently learn and generalize physical dynamics from minimal data. Here we present Neural Force Field (NFF), a framework extending Neural Ordinary Differential Equation (NODE) to learn complex object interactions through force field representations, which can be efficiently integrated through an Ordinary Differential Equation (ODE) solver to predict object trajectories. Unlike existing approaches that rely on discrete latent spaces, NFF captures fundamental physical concepts such as gravity, support, and collision in continuous explicit force fields. Experiments on three challenging physical reasoning tasks demonstrate that NFF, trained with only a few examples, achieves strong generalization to unseen scenarios. This physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement. Our work suggests that incorporating physics-inspired representations into learning systems can help bridge the gap between artificial and human physical reasoning capabilities.

Updated: 2026-01-29 10:14:38

标题: 神经力场：少样本学习的广义物理推理

摘要: 物理推理是一种非凡的人类能力，能够让我们从有限的经验中快速学习和泛化。当前的AI模型，尽管经过了广泛的训练，仍然在特别是在分布外（OOD）环境中很难实现类似的泛化。这种限制源于它们无法从观察中抽象出核心的物理原则。一个关键挑战是开发能够从最少数据中高效学习和泛化物理动态的表示。在这里，我们提出了神经力场（NFF），这是一个将神经常微分方程（NODE）扩展为通过力场表示学习复杂物体相互作用的框架，可以通过常微分方程（ODE）求解器高效地集成以预测物体轨迹。与现有方法依赖于离散潜在空间不同，NFF通过连续显式力场捕获了基本的物理概念，如重力、支撑和碰撞。在三项具有挑战性的物理推理任务上的实验表明，NFF在只有少数示例的情况下，能够对未见过的情景实现强大的泛化。这种以物理为基础的表示使得通过交互式的细化实现高效的前向-后向规划和快速适应成为可能。我们的工作表明，将物理启发表示纳入到学习系统中可以帮助弥合人工和人类物理推理能力之间的差距。

更新时间: 2026-01-29 10:14:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.08987v5

The Path of Least Resistance: Guiding LLM Reasining Trajectories with Prefix Consensus

Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.

Updated: 2026-01-29 10:14:24

标题: 最小阻力路径：通过前缀一致性引导LLM推理轨迹

摘要: 大型语言模型在推理性能方面取得了良好的表现，但推理策略，如自洽性（SC），在计算上是昂贵的，因为它们完全展开所有推理轨迹。我们引入了PoLR（最小阻力路径），这是第一个利用前缀一致性进行计算高效推理的推理时间方法。PoLR对推理轨迹的短前缀进行聚类，识别主导聚类，并展开该聚类中的所有路径，保持了SC的准确性优势，同时大大减少了标记使用和延迟。我们的理论分析，通过互信息和熵的框架，解释了为什么早期推理步骤编码了最终正确性的强信号。从经验上看，PoLR在GSM8K、MATH500、AIME24/25和GPQA-DIAMOND等方面始终与SC匹配或超越，将标记使用量减少了高达60%，墙钟延迟减少了高达50%。此外，PoLR完全与自适应推理方法（例如，自适应一致性，早停止SC）互补，并可以作为一个插入式预过滤器，使SC变得更加高效和可扩展，而无需进行模型微调。

更新时间: 2026-01-29 10:14:24

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21494v1

Mean-Field Control on Sparse Graphs: From Local Limits to GNNs via Neighborhood Distributions

Mean-field control (MFC) offers a scalable solution to the curse of dimensionality in multi-agent systems but traditionally hinges on the restrictive assumption of exchangeability via dense, all-to-all interactions. In this work, we bridge the gap to real-world network structures by proposing a rigorous framework for MFC on large sparse graphs. We redefine the system state as a probability measure over decorated rooted neighborhoods, effectively capturing local heterogeneity. Our central contribution is a theoretical foundation for scalable reinforcement learning in this setting. We prove horizon-dependent locality: for finite-horizon problems, an agent's optimal policy at time t depends strictly on its (T-t)-hop neighborhood. This result renders the infinite-dimensional control problem tractable and underpins a novel Dynamic Programming Principle (DPP) on the lifted space of neighborhood distributions. Furthermore, we formally and experimentally justify the use of Graph Neural Networks (GNNs) for actor-critic algorithms in this context. Our framework naturally recovers classical MFC as a degenerate case while enabling efficient, theoretically grounded control on complex sparse topologies.

Updated: 2026-01-29 09:57:48

标题: 稀疏图上的均场控制：从局部极限到GNNs via 邻域分布

摘要: 平均场控制（MFC）为多智能体系统中维度灾难提供了可扩展的解决方案，但传统上依赖于密集、全对全交互的可交换性假设。在这项工作中，我们通过提出一个严格的框架，将MFC扩展到大型稀疏图网络结构，弥合了与真实网络结构之间的差距。我们重新定义系统状态为覆盖了装饰根邻域的概率测度，有效地捕捉了局部异质性。我们的核心贡献是在这种情况下提供了可扩展强化学习的理论基础。我们证明了依赖于时间跨度的局部性：对于有限时间跨度的问题，一个智能体在时刻t的最优策略严格依赖于其（T-t）跳邻域。这一结果使得无限维控制问题变得可处理，并支撑了在邻域分布的升级空间上的新颖动态规划原理（DPP）。此外，我们在理论和实验上正式证明了在这种情况下使用图神经网络（GNNs）进行演员-评论家算法的合理性。我们的框架自然地恢复了传统的MFC作为一个退化案例，同时在复杂稀疏拓扑上实现了有效的、理论上基础的控制。

更新时间: 2026-01-29 09:57:48

领域: cs.MA,cs.AI,cs.LG,math.OC

下载: http://arxiv.org/abs/2601.21477v1

MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of text embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain text embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.

Updated: 2026-01-29 09:55:17

标题: MOSAIC：领域内对比学习的掩模目标和选择性自适应

摘要: 我们介绍了MOSAIC（具有选择性适应的掩码目标的领域内对比学习），这是一个用于领域自适应文本嵌入模型的多阶段框架，它整合了联合领域特定的掩码监督。我们的方法解决了将大规模通用领域文本嵌入模型适应到专业领域的挑战。通过在统一的训练流程中联合优化掩码语言建模（MLM）和对比目标，我们的方法能够有效学习领域相关表示，同时保持原始模型的鲁棒语义判别特性。我们在高资源和低资源领域上进行了实证验证，相对于强大的通用领域基线，我们取得了NDCG@10（标准化折现累积增益）高达13.4％的改进。全面的消融研究进一步证明了每个组件的有效性，突出了平衡联合监督和分阶段自适应的重要性。

更新时间: 2026-01-29 09:55:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.16797v2

Task-free Adaptive Meta Black-box Optimization

Handcrafted optimizers become prohibitively inefficient for complex black-box optimization (BBO) tasks. MetaBBO addresses this challenge by meta-learning to automatically configure optimizers for low-level BBO tasks, thereby eliminating heuristic dependencies. However, existing methods typically require extensive handcrafted training tasks to learn meta-strategies that generalize to target tasks, which poses a critical limitation for realistic applications with unknown task distributions. To overcome the issue, we propose the Adaptive meta Black-box Optimization Model (ABOM), which performs online parameter adaptation using solely optimization data from the target task, obviating the need for predefined task distributions. Unlike conventional metaBBO frameworks that decouple meta-training and optimization phases, ABOM introduces a closed-loop adaptive parameter learning mechanism, where parameterized evolutionary operators continuously self-update by leveraging generated populations during optimization. This paradigm shift enables zero-shot optimization: ABOM achieves competitive performance on synthetic BBO benchmarks and realistic unmanned aerial vehicle path planning problems without any handcrafted training tasks. Visualization studies reveal that parameterized evolutionary operators exhibit statistically significant search patterns, including natural selection and genetic recombination.

Updated: 2026-01-29 09:54:10

标题: 无任务自适应元黑盒优化

摘要: 手工优化器对于复杂的黑盒优化（BBO）任务效率低下。MetaBBO通过元学习来自动配置优化器以解决这一挑战，从而消除启发式依赖。然而，现有方法通常需要大量手工训练任务来学习适用于目标任务的元策略，这对于具有未知任务分布的现实应用构成了关键限制。为了克服这一问题，我们提出了自适应元黑盒优化模型（ABOM），该模型通过仅使用目标任务的优化数据来进行在线参数适应，从而消除了预定义任务分布的需要。与将元训练和优化阶段分离的传统MetaBBO框架不同，ABOM引入了一个闭环自适应参数学习机制，其中参数化进化算子通过利用优化过程中生成的种群不断自我更新。这种范式转变实现了零次优化：ABOM在合成BBO基准测试和现实无人机路径规划问题中实现了竞争性能，而无需任何手工训练任务。可视化研究表明，参数化进化算子展现出具有统计显著性的搜索模式，包括自然选择和遗传重组。

更新时间: 2026-01-29 09:54:10

领域: cs.NE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21475v1

ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management

LLM-based multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient LLM serving system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.

Updated: 2026-01-29 09:52:16

标题: ScaleSim：基于调用距离的内存管理为大规模多智能体仿真提供服务

摘要: 基于LLM的多代理模拟在应用领域中越来越受到采用，但由于GPU内存压力而难以扩展。每个代理都维护着私有的GPU驻留状态，包括模型、前缀缓存和适配器，在代理数量增加时会迅速耗尽设备内存。我们确定了这些工作负载的两个关键属性：稀疏代理激活和可估计的代理调用顺序。通过对代表性工作负载类的分析，我们引入了调用距离，这是一个估计代理将来发出LLM请求相对顺序的统一抽象。利用这一抽象，我们提出了ScaleSim，一个用于大规模多代理模拟的内存高效的LLM服务系统。ScaleSim实现了主动预取和基于优先级的淘汰，通过模块化接口支持多样化的代理特定内存，并在模拟基准测试中实现了与SGLang相比高达1.74倍的加速。

更新时间: 2026-01-29 09:52:16

领域: cs.AI,cs.DC

下载: http://arxiv.org/abs/2601.21473v1

PPI-SVRG: Unifying Prediction-Powered Inference and Variance Reduction for Semi-Supervised Optimization

We study semi-supervised stochastic optimization when labeled data is scarce but predictions from pre-trained models are available. PPI and SVRG both reduce variance through control variates -- PPI uses predictions, SVRG uses reference gradients. We show they are mathematically equivalent and develop PPI-SVRG, which combines both. Our convergence bound decomposes into the standard SVRG rate plus an error floor from prediction uncertainty. The rate depends only on loss geometry; predictions affect only the neighborhood size. When predictions are perfect, we recover SVRG exactly. When predictions degrade, convergence remains stable but reaches a larger neighborhood. Experiments confirm the theory: PPI-SVRG reduces MSE by 43--52\% under label scarcity on mean estimation benchmarks and improves test accuracy by 2.7--2.9 percentage points on MNIST with only 10\% labeled data.

Updated: 2026-01-29 09:49:46

标题: PPI-SVRG：将预测驱动推断和方差减少统一起来，用于半监督优化

摘要: 我们研究了半监督随机优化，当标记数据稀缺但可利用预训练模型的预测时。PPI和SVRG通过控制变量减少方差 -- PPI使用预测，SVRG使用参考梯度。我们表明它们在数学上是等价的，并发展了结合两者的PPI-SVRG。我们的收敛界限分解为标准SVRG速率加上由于预测不确定性而产生的误差楼层。速率仅取决于损失几何；预测仅影响邻域大小。当预测完美时，我们完全恢复SVRG。当预测恶化时，收敛保持稳定但达到更大的邻域。实验验证了理论：在均值估计基准测试中，PPI-SVRG在标记稀缺的情况下将均方误差降低了43-52％，并且在仅有10％标记数据的MNIST上将测试准确率提高了2.7-2.9个百分点。

更新时间: 2026-01-29 09:49:46

领域: cs.LG,econ.EM,math.OC,stat.ML

下载: http://arxiv.org/abs/2601.21470v1

Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation

While Large Language Models (LLMs) have catalyzed breakthroughs in automated code generation, Small Language Models (SLMs) often encounter reasoning bottlenecks and failure loops when addressing complex logical requirements. To overcome these challenges, we propose DebateCoder, a multi-agent collaborative framework designed to improve the reasoning ability of SLMs (e.g., Pangu-1B) in resource-constrained environments. DebateCoder uses a structured role-playing protocol with three agents: User Agent (A_UA), Technical Agent (A_TA), and Quality Assurance Agent (A_QA). It also includes an Adaptive Confidence Gating mechanism with a 95% threshold to balance accuracy and inference efficiency. In addition, we introduce a multi-turn deliberation module and a reviewer-guided analytical debugging loop for orthogonal pre-generation debate and post-generation refinement. Experiments on HumanEval and MBPP show that DebateCoder achieves 70.12% Pass@1 on HumanEval, outperforming MapCoder while reducing API overhead by about 35%. These results indicate that collaborative protocols can mitigate limitations of small-parameter models and provide a scalable, efficient approach to high-quality automated software engineering.

Updated: 2026-01-29 09:48:15

标题: 多智能体协作中的自适应置信门控用于高效和优化的代码生成

摘要: 尽管大型语言模型(LLMs)已经在自动代码生成方面取得了突破，但小型语言模型(SLMs)在处理复杂的逻辑要求时经常遇到推理瓶颈和失败循环。为了克服这些挑战，我们提出了DebateCoder，这是一个多代理协作框架，旨在提高在资源受限环境中的SLMs（如Pangu-1B）的推理能力。DebateCoder使用一个结构化的角色扮演协议，包括三个代理：用户代理(A_UA)、技术代理(A_TA)和质量保障代理(A_QA)。它还包括一个自适应置信门控机制，具有95%的阈值，以平衡准确性和推理效率。此外，我们引入了一个多轮审议模块和一个由审查员引导的分析调试循环，用于正交预生成辩论和后生成的完善。在HumanEval和MBPP上的实验表明，DebateCoder在HumanEval上实现了70.12%的Pass@1，优于MapCoder，并将API开销减少约35%。这些结果表明，协作协议可以缓解小参数模型的限制，并为高质量的自动软件工程提供一种可扩展、高效的方法。

更新时间: 2026-01-29 09:48:15

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2601.21469v1

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

Updated: 2026-01-29 09:47:17

标题: MemOCR: 布局感知的视觉记忆，用于高效的长期推理

摘要: 长时间跨度的主动推理需要有效地将不断增长的互动历史压缩到有限的上下文窗口中。大多数现有的记忆系统将历史序列化为文本，其中令牌级成本是均匀的，并且随着长度的增加呈线性增长，通常会在低价值细节上花费有限的预算。为此，我们引入了MemOCR，这是一个多模式记忆代理，通过视觉布局在紧凑的上下文预算下改善长时间跨度的推理，通过自适应信息密度分配内存空间。具体而言，MemOCR维护一个结构化的富文本记忆（例如标题、高亮显示），并将其呈现为图像，代理可以查阅以进行记忆访问，通过视觉优先处理关键证据，同时积极压缩辅助细节。为了确保在不同的记忆预算下的稳健性，我们通过在预算感知目标下使用强化学习对MemOCR进行训练，使代理暴露于不同的压缩水平。在长上下文多跳和单跳问答基准测试中，MemOCR优于强大的基于文本的基线，并在极端预算下实现了更有效的上下文利用。

更新时间: 2026-01-29 09:47:17

领域: cs.AI

下载: http://arxiv.org/abs/2601.21468v1

Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance

Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models' apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbf{Topeax}, which discovers the number of clusters from peaks in density estimates, and combines lexical and semantic indices of term importance to gain high-quality topic keywords. Topeax is demonstrated to be better at both cluster recovery and cluster description than Top2Vec and BERTopic, while also exhibiting less erratic behaviour in response to changing sample size and hyperparameters.

Updated: 2026-01-29 09:41:31

标题: Topeax -- 一种具有密度峰值检测和词汇-语义术语重要性的改进的聚类主题模型

摘要: 文本聚类是当今学术界和工业界中最流行的主题建模范式。尽管聚类主题模型在Top2Vec和BERTopic中取得了明显的成功，我们仍然发现了一些问题，这些问题仍然未得到解决。首先，这些方法在发现语料库中的自然簇时不可靠，因为对样本大小和超参数的极度敏感，其默认值导致子优化行为。其次，在估计术语重要性时，BERTopic忽略了关键词到主题向量的语义距离，而Top2Vec则忽略了语料库中的词频。这导致了一方面由于停用词和垃圾词的存在而导致主题不够连贯，另一方面缺乏变化和信任。在本文中，我介绍了一种新方法\textbf{Topeax}，该方法通过密度估计中的峰值发现簇的数量，并结合词汇和语义重要性指数以获得高质量的主题关键词。Topeax在聚类恢复和聚类描述方面均优于Top2Vec和BERTopic，并且在响应不同的样本大小和超参数时表现出更少的不稳定行为。

更新时间: 2026-01-29 09:41:31

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21465v1

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.

Updated: 2026-01-29 09:41:14

标题: 非可验证学习的对话：通过元评估实现自我演进的LLMs

摘要: 在为非可验证任务（如创意写作、对话和伦理推理）训练大型语言模型（LLMs）仍然具有挑战性，因为缺乏基准标签。虽然LLM作为评委的方法为人类反馈提供了可扩展的替代方案，但面临一个基本限制：性能受评估者自身质量的限制。如果评委无法识别出好的解决方案，则无法提供有用的训练信号，并且评估偏见（例如偏好量多而非质量）仍未得到解决。这促使了元评估的出现：即评估和改进评估者本身的能力。我们介绍了CoNL，这是一个通过多智能体自我对弈统一生成、评估和元评估的框架。我们的关键洞察是：批评质量可以通过其是否帮助他人改进解决方案来衡量。在CoNL中，共享相同策略的多个智能体参与结构化对话，提出、批评和修订解决方案。能够促使解决方案改进的批评获得诊断奖励，为元评估提供显式监督，并通过自我对弈实现生成和评判能力的联合优化，而无需外部评委或基准标签。在五个基准测试上的实验表明，CoNL相对于自奖励基线实现了一致的改进，同时保持了稳定的训练。

更新时间: 2026-01-29 09:41:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.21464v1

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57\% and 9.28\% (localization), respectively.

Updated: 2026-01-29 09:39:28

标题: 通过先验增强音频LLMs统一语音编辑检测和内容定位

摘要: 演讲编辑通过对原始话语进行细粒度的段级操作，实现了语义倒置，同时保留了整体感知的自然性。现有的检测研究主要集中在手动编辑的语音上，具有明显的拼接伪迹，因此难以应对新兴的端到端神经语音编辑技术，这些技术生成无缝的声学过渡。为了解决这一挑战，我们首先构建了一个大规模的双语数据集AiEdit，该数据集利用大型语言模型驱动精确的语义篡改逻辑，并采用多种先进的神经语音编辑方法进行数据合成，填补了高质量语音编辑数据集的空白。在此基础上，我们提出了PELM（Prior-Enhanced Audio Large Language Model），这是第一个将语音编辑检测和内容定位统一为音频问答任务的大型模型框架。为了减轻现有音频大型模型中观察到的固有伪造偏见和语义优先偏见，PELM将单词级概率先验纳入其中，以提供明确的声学线索，并进一步设计了基于质心聚合的声学一致性感知损失，明确强化了对微妙的局部分布异常的建模。广泛的实验证明，PELM在HumanEdit和AiEdit数据集上明显优于现有方法，分别实现了0.57\%和9.28\%（定位）的等误差率（EER）。

更新时间: 2026-01-29 09:39:28

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2601.21463v1

Partial Feedback Online Learning

We study partial-feedback online learning, where each instance admits a set of correct labels, but the learner only observes one correct label per round; any prediction within the correct set is counted as correct. This model captures settings such as language generation, where multiple responses may be valid but data provide only a single reference. We give a near-complete characterization of minimax regret for both deterministic and randomized learners in the set-realizable regime, i.e., in the regime where sublinear regret is generally attainable. For deterministic learners, we introduce the Partial-Feedback Littlestone dimension (PFLdim) and show it precisely governs learnability and minimax regret; technically, PFLdim cannot be defined via the standard version space, requiring a new collection version space viewpoint and an auxiliary dimension used only in the proof. We further develop the Partial-Feedback Measure Shattering dimension (PMSdim) to obtain tight bounds for randomized learners. We identify broad conditions ensuring inseparability between deterministic and randomized learnability (e.g., finite Helly number or nested-inclusion label structure), and extend the argument to set-valued online learning, resolving an open question of Raman et al. [2024b]. Finally, we show a sharp separation from weaker realistic and agnostic variants: outside set realizability, the problem can become information-theoretically intractable, with linear regret possible even for $|H|=2$. This highlights the need for fundamentally new, noise-sensitive complexity measures to meaningfully characterize learnability beyond set realizability.

Updated: 2026-01-29 09:39:11

标题: 部分反馈在线学习

摘要: 我们研究部分反馈在线学习，其中每个实例都有一组正确的标签，但学习者每轮只能观察到一个正确的标签；在正确集合内的任何预测都被视为正确。这个模型涵盖了一些情境，比如语言生成，其中可能有多个有效的响应，但数据只提供了一个参考。我们在确定性和随机学习者在可实现集合范围内的最小化遗憾的近乎完整特征化。对于确定性学习者，我们引入了部分反馈Littlestone维度（PFLdim）并展示它精确地控制了可学习性和最小化遗憾；从技术上讲，PFLdim无法通过标准版本空间来定义，需要一个新的集合版本空间观点和仅在证明中使用的辅助维度。我们进一步发展了部分反馈度量破碎维度（PMSdim）来获得随机学习者的紧密界限。我们确定了广泛条件以确保确定性和随机学习能力之间的不可分性（例如，有限的Helly数或嵌套包含标签结构），并将论据扩展到集值在线学习，解决了Raman等人[2024b]的一个开放问题。最后，我们展示了与较弱的现实和犹豫变体的明显分离：在集合可实现性之外，问题可能在信息理论上变得难以处理，即使对于$|H|=2$，也可能出现线性遗憾。这突显了需要根本新的、对噪声敏感的复杂性度量来有意义地描述超过集合可实现性的可学习性。

更新时间: 2026-01-29 09:39:11

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2601.21462v1

L$^3$: Large Lookup Layers

Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP "experts." However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L$^3$), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L$^3$ layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L$^3$ has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L$^3$ by training transformers with up to 2.6B active parameters and find that L$^3$ strongly outperforms both dense models and iso-sparse MoEs in both language modeling and downstream tasks.

Updated: 2026-01-29 09:37:31

标题: L$^3$: 大型查找层

摘要: 现代稀疏语言模型通常通过专家混合（MoE）层实现稀疏性，该层动态路由标记到密集MLP“专家”。然而，动态硬路由具有许多缺点，例如可能的硬件效率低下和需要辅助损失以实现稳定训练。相比之下，分词器嵌入表本身稀疏，通过以牺牲上下文信息为代价选择每个标记的单个嵌入，大部分避免了这些问题。在这项工作中，我们引入了大型查找层（L^3），通过将嵌入表推广到模型解码器层，解锁了一种新的稀疏轴。L^3 层使用静态基于标记的路由以一种上下文相关的方式聚合每个标记的一组学习嵌入，使模型能够通过在嵌入中缓存信息来高效平衡内存和计算。L^3有两个主要组成部分：（1）一个系统友好的架构，可以快速训练并在无额外开销的情况下进行CPU卸载推断，（2）一个信息论嵌入分配算法，有效地平衡速度和质量。我们通过训练拥有多达26亿个活跃参数的transformers来实证测试L^3，并发现L^3在语言建模和下游任务中都明显优于密集模型和iso-sparse MoEs。

更新时间: 2026-01-29 09:37:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21461v1

HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters' first-person thinking from LLMs' third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train \method models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.

Updated: 2026-01-29 09:35:27

标题: HER: 人类式推理和强化学习用于LLM角色扮演

摘要: LLM角色扮演，即使用LLM来模拟特定角色，已经成为各种应用中的关键能力，比如陪伴、内容创作和数字游戏。虽然当前模型有效地捕捉了角色的语调和知识，但模拟他们行为背后的内在思想仍然是一个挑战。为了实现LLM角色扮演中的认知模拟，以往的努力主要存在两个不足之处：缺乏具有高质量推理痕迹的数据，以及与人类偏好一致的可靠奖励信号。在本文中，我们提出了HER，一个统一的认知级别角色模拟框架。HER引入了双层思维，区分了角色的第一人称思维和LLM的第三人称思维。为了弥合这些差距，我们通过逆向工程策划推理增强的角色扮演数据，并构建了与人类偏好一致的原则和奖励模型。利用这些资源，我们通过监督学习和强化学习基于Qwen3-32B训练了\method 模型。大量实验验证了我们方法的有效性。值得注意的是，我们的模型明显优于Qwen3-32B基准，在CoSER基准上实现了30.26的改进，在Minimax角色扮演基准上实现了14.97的增益。我们的数据集、原则和模型将被发布以促进未来的研究。

更新时间: 2026-01-29 09:35:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21459v1

Dynamic Target Attack

Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response, e.g., ``Sure, here is...''. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM's output distribution. Due to the substantial discrepancy between the fixed target and the output distribution, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response. To address this limitation, we propose Dynamic Target Attack (DTA), which leverages the target LLM's own responses as adaptive targets. In each optimization round, DTA samples multiple candidates from the output distribution conditioned on the current prompt, and selects the most harmful one as a temporary target for prompt optimization. Extensive experiments demonstrate that, under the white-box setting, DTA achieves over 87% average attack success rate (ASR) within 200 optimization iterations on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15% and reducing wall-clock time by 2-26x. Under the black-box setting, DTA employs a white-box LLM as a surrogate model for gradient-based optimization, achieving an average ASR of 77.5% against black-box models, exceeding prior transfer-based attacks by over 12%.

Updated: 2026-01-29 09:33:04

标题: 动态目标攻击

摘要: 现有基于梯度的越狱攻击通常优化对手后缀，以诱导固定的肯定回应，例如，“当然，这里是...”。然而，这个固定目标通常位于与安全对齐的LLM输出分布的极低密度区域。由于固定目标和输出分布之间存在实质性差异，现有攻击需要大量迭代来优化对手提示，这可能仍然无法诱导低概率的目标响应。为了解决这一限制，我们提出了动态目标攻击（DTA），利用目标LLM自身的响应作为自适应目标。在每一轮优化中，DTA从当前提示条件下的输出分布中采样多个候选项，并选择最有害的一个作为临时目标进行提示优化。大量实验证明，在白盒设置下，DTA在最新的安全对齐LLM上在200次优化迭代内实现了超过87%的平均攻击成功率（ASR），超过了现有基准线15%以上，并将墙钟时间减少2-26倍。在黑盒设置下，DTA采用白盒LLM作为梯度优化的代理模型，对黑盒模型实现了77.5%的平均ASR，超过了先前基于转移的攻击12%以上。

更新时间: 2026-01-29 09:33:04

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.02422v3

Questioning the Coverage-Length Metric in Conformal Prediction: When Shorter Intervals Are Not Better

Conformal prediction (CP) has become a cornerstone of distribution-free uncertainty quantification, conventionally evaluated by its coverage and interval length. This work critically examines the sufficiency of these standard metrics. We demonstrate that the interval length might be deceptively improved through a counter-intuitive approach termed Prejudicial Trick (PT), while the coverage remains valid. Specifically, for any given test sample, PT probabilistically returns an interval, which is either null or constructed using an adjusted confidence level, thereby preserving marginal coverage. While PT potentially yields a deceptively lower interval length, it introduces practical vulnerabilities: the same input can yield completely different prediction intervals across repeated runs of the algorithm. We formally derive the conditions under which PT achieves these misleading improvements and provides extensive empirical evidence across various regression and classification tasks. Furthermore, we introduce a new metric interval stability which helps detect whether a new CP method implicitly improves the length based on such PT-like techniques.

Updated: 2026-01-29 09:31:28

标题: 质疑拟合预测中的覆盖长度度量：短区间并非总是更好

摘要: 共形预测（CP）已成为无分布不确定性量化的基石，传统上通过其覆盖率和区间长度进行评估。本文对这些标准度量的充分性进行了批判性检查。我们证明，通过一种称为“Prejudicial Trick（PT）”的反直觉方法，区间长度可能会被欺骗性地改进，而覆盖率仍然有效。具体地，对于任何给定的测试样本，PT以概率返回一个区间，该区间要么为空，要么是使用调整后的置信水平构建的，从而保持边际覆盖率。虽然PT可能会带来一个欺骗性较低的区间长度，但它引入了实际的脆弱性：相同的输入可能会在算法的重复运行中产生完全不同的预测区间。我们正式推导出PT实现这些误导性改进的条件，并在各种回归和分类任务中提供了广泛的经验证据。此外，我们引入了一个新的度量指标区间稳定性，帮助检测一个新的CP方法是否暗含地基于类似PT的技术改进了长度。

更新时间: 2026-01-29 09:31:28

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2601.21455v1

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with very limited theoretical understanding of why these phenomena occur. Existing theoretical work is limited to tied-weight sparse autoencoders, leaving the broader family of SDL methods without formal grounding. We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations.

Updated: 2026-01-29 09:30:49

标题: 一种关于机制可解释性中稀疏字典学习的统一理论：分段双凸性和虚假最小值

摘要: 随着人工智能模型在各个领域取得了显著的能力，理解它们学习的表示和如何编码概念对于科学进步和可靠部署变得越来越重要。最近在机制可解释性方面的研究广泛报道，神经网络将有意义的概念表示为它们表示空间中的线性方向，并经常以叠加的方式编码多样的概念。各种稀疏字典学习（SDL）方法，包括稀疏自编码器、转码器和交码器，被用来通过训练具有稀疏约束的辅助模型来将这些叠加的概念解开为单义特征。这些方法是现代机制可解释性的基础，然而在实践中它们一贯产生多义特征、特征吸收和死神经元，并且对为什么会出现这些现象的理论理解非常有限。现有的理论工作仅限于绑定权重的稀疏自编码器，没有对更广泛的SDL方法进行形式化的基础。我们开发了第一个统一的理论框架，将所有主要的SDL变体都视为一个单一的分段双凸优化问题，并对其全局解集、非可辨识性和虚假最优解进行了表征。这一分析为特征吸收和死神经元提供了原理性的解释。为了在完全获得基础事实的情况下暴露这些病理现象，我们引入了线性表示基准。在我们的理论指导下，我们提出了特征锚定，一种恢复SDL可辨识性的新技术，大大改善了在合成基准和真实神经表示中的特征恢复。

更新时间: 2026-01-29 09:30:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.05534v3

LION: A Clifford Neural Paradigm for Multimodal-Attributed Graph Learning

Recently, the rapid advancement of multimodal domains has driven a data-centric paradigm shift in graph ML, transitioning from text-attributed to multimodal-attributed graphs. This advancement significantly enhances data representation and expands the scope of graph downstream tasks, such as modality-oriented tasks, thereby improving the practical utility of graph ML. Despite its promise, limitations exist in the current neural paradigms: (1) Neglect Context in Modality Alignment: Most existing methods adopt topology-constrained or modality-specific operators as tokenizers. These aligners inevitably neglect graph context and inhibit modality interaction, resulting in suboptimal alignment. (2) Lack of Adaptation in Modality Fusion: Most existing methods are simple adaptations for 2-modality graphs and fail to adequately exploit aligned tokens equipped with topology priors during fusion, leading to poor generalizability and performance degradation. To address the above issues, we propose LION (c\underline{LI}ff\underline{O}rd \underline{N}eural paradigm) based on the Clifford algebra and decoupled graph neural paradigm (i.e., propagation-then-aggregation) to implement alignment-then-fusion in multimodal-attributed graphs. Specifically, we first construct a modality-aware geometric manifold grounded in Clifford algebra. This geometric-induced high-order graph propagation efficiently achieves modality interaction, facilitating modality alignment. Then, based on the geometric grade properties of aligned tokens, we propose adaptive holographic aggregation. This module integrates the energy and scale of geometric grades with learnable parameters to improve modality fusion. Extensive experiments on 9 datasets demonstrate that LION significantly outperforms SOTA baselines across 3 graph and 3 modality downstream tasks.

Updated: 2026-01-29 09:30:36

标题: 狮子：一种用于多模态属性图学习的Clifford神经范例

摘要: 最近，多模态领域的快速发展推动了图机器学习中的数据中心范式转变，从文本属性到多模态属性的图。这一进展显著增强了数据表示，并扩大了图下游任务的范围，例如面向模态的任务，从而提高了图机器学习的实用性。尽管有着很大的潜力，但当前神经范式存在一些限制：（1）忽视模态对齐中的上下文：大多数现有方法采用受拓扑限制或特定于模态的运算符作为标记器。这些对齐器不可避免地忽视了图上下文并抑制了模态交互，导致对齐效果不佳。（2）模态融合中缺乏自适应性：大多数现有方法仅是对2模态图的简单适应，未能充分利用在融合期间配备了拓扑先验知识的对齐标记，导致泛化能力差和性能下降。为解决上述问题，我们提出了基于Clifford代数和解耦图神经范式（即传播-聚合）的LION（c\underline{LI}ff\underline{O}rd \underline{N}eural paradigm），以在多模态属性图中实现对齐然后融合。具体而言，我们首先构建了一个基于Clifford代数的模态感知几何流形。这种几何诱导的高阶图传播有效实现了模态交互，促进了模态对齐。然后，基于对齐标记的几何级别属性，我们提出了自适应全息聚合。该模块将几何级别的能量和尺度与可学习参数结合起来，以提高模态融合。对9个数据集的大量实验表明，LION在3个图和3个模态下游任务中明显优于SOTA基线。

更新时间: 2026-01-29 09:30:36

领域: cs.AI

下载: http://arxiv.org/abs/2601.21453v1

SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation

While works such as OneRec have validated the scaling laws of Large Language Models (LLMs) in recommender systems, they rely on a cumbersome separate vocabulary. This dependency prevents the model architecture from reusing native LLM vocabularies, resulting in high maintenance costs and poor scalability. In response, we aim to efficiently reuse open-source LLM architectures without constructing a separate tokenization vocabulary. Furthermore, we identify that the optimization strategy of OneRec Gradient Bounded Policy Optimization (GBPO),suffers from a "Symmetric Conservatism" problem: its static gradient boundaries structurally suppress the update momentum required for cold-start items and fail to prevent diversity collapse in high-noise environments.To address this issue, we propose SAGE (Sequence-level Adaptive Gradient Evolution), a unified optimization framework tailored for list-wise generative recommendation. SAGE introduces two key innovations:(1) Sequence-level Signal Decoupling: By combining a geometric mean importance ratio with decoupled multi-objective advantages, we eliminate token-level variance and resolve the "Reward Collapse" problem. (2) Asymmetric Adaptive Dynamics: We construct a dynamic gradient manifold that applies a "Boost Factor" to high-potential cold start items to achieve super-linear updates and employs an "Entropy Aware Penalty" to break information cocoons. Theoretical analysis and empirical results demonstrate that SAGE effectively unblocks cold-start traffic and sustains recommendation diversity, all while retaining the numerical stability of GBPO.

Updated: 2026-01-29 09:30:13

标题: SAGE：用于生成式推荐的序列级自适应梯度进化

摘要: 尽管像OneRec这样的作品已经验证了大型语言模型（LLMs）在推荐系统中的规模定律，但它们依赖于繁琐的单独词汇表。这种依赖性阻止了模型架构重用本机LLM词汇，导致高维护成本和可扩展性差。为了应对这一问题，我们旨在高效地重用开源LLM架构，而无需构建单独的标记化词汇表。此外，我们发现OneRec梯度有界策略优化（GBPO）的优化策略存在“对称保守主义”问题：其静态梯度边界结构上抑制了冷启动项目所需的更新动量，并未能防止在高噪声环境中的多样性崩溃。为了解决这个问题，我们提出了SAGE（序列级自适应梯度演化），这是一个专为列表式生成式推荐量身定制的优化框架。SAGE引入了两个关键创新：（1）序列级信号解耦：通过将几何平均重要比率与解耦的多目标优势相结合，我们消除了标记级别的方差并解决了“奖励崩溃”问题。（2）非对称自适应动力学：我们构建了一个动态梯度流形，为高潜力的冷启动项目应用“增强因子”以实现超线性更新，并利用“熵感知惩罚”打破信息茧囊。理论分析和实证结果表明，SAGE有效地解决了冷启动流量问题，保持推荐多样性，同时保留了GBPO的数值稳定性。

更新时间: 2026-01-29 09:30:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21452v1

ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74\% on Verilog generation and 13.33\% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95\% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.

Updated: 2026-01-29 09:26:55

标题: ChipBench：用于评估AI辅助芯片设计中LLM性能的下一步基准测试

摘要: 大型语言模型（LLMs）在硬件工程中表现出显著潜力，但当前的基准测试存在饱和和任务多样性有限的问题，未能反映LLMs在实际工业工作流程中的性能。为了填补这一空白，我们提出了一个全面的AI辅助芯片设计基准测试，严格评估LLMs在三个关键任务中的表现：Verilog生成、调试和参考模型生成。我们的基准测试包括44个具有复杂分层结构的实际模块，89个系统化调试案例以及132个Python、SystemC和CXXRTL的参考模型样本。评估结果显示出明显的性能差距，最先进的Claude-4.5-opus仅在Verilog生成方面达到30.74\%，在Python参考模型生成方面仅达到13.33\%，与现有饱和基准测试相比，其中最先进的模型达到95%以上的通过率，表现出明显挑战。此外，为了帮助提升LLM参考模型生成，我们提供了一个自动化工具包，用于生成高质量的训练数据，促进未开发领域的未来研究。我们的代码可以在https://github.com/zhongkaiyu/ChipBench.git上找到。

更新时间: 2026-01-29 09:26:55

领域: cs.AI,cs.AR

下载: http://arxiv.org/abs/2601.21448v1

Synthetic Pattern Generation and Detection of Financial Activities using Graph Autoencoders

Illicit financial activities such as money laundering often manifest through recurrent topological patterns in transaction networks. Detecting these patterns automatically remains challenging due to the scarcity of labeled real-world data and strict privacy constraints. To address this, we investigate whether Graph Autoencoders (GAEs) can effectively learn and distinguish topological patterns that mimic money laundering operations when trained on synthetic data. The analysis consists of two phases: (i) data generation, where synthetic samples are created for seven well-known illicit activity patterns using parametrized generators that preserve structural consistency while introducing realistic variability; and (ii) model training and validation, where separate GAEs are trained on each pattern without explicit labels, relying solely on reconstruction error as an indicator of learned structure. We compare three GAE implementations based on three distinct convolutional layers: Graph Convolutional (GAE-GCN), GraphSAGE (GAE-SAGE), and Graph Attention Network (GAE-GAT). Experimental results show that GAE-GCN achieves the most consistent reconstruction performance across patterns, while GAE-SAGE and GAE-GAT exhibit competitive results only in few specific patterns. These findings suggest that graph-based representation learning on synthetic data provides a viable path toward developing AI-driven tools for detecting illicit behaviors, overcoming the limitations of financial datasets.

Updated: 2026-01-29 09:25:13

标题: 使用图自动编码器生成和检测金融活动的合成模式

摘要: 非法金融活动，如洗钱，通常通过交易网络中的重复拓扑模式表现出来。由于现实世界数据标记稀缺和严格的隐私约束，自动检测这些模式仍然具有挑战性。为了解决这个问题，我们调查了图自动编码器（GAEs）是否能够有效地学习和区分在合成数据上训练时模仿洗钱操作的拓扑模式。分析分为两个阶段：（i）数据生成，在此阶段，利用参数化生成器为七种众所周知的非法活动模式创建合成样本，这些生成器保持结构一致性同时引入现实变化；（ii）模型训练和验证，在此阶段，分别对每种模式训练独立的GAE，没有明确标签，仅依靠重构误差作为学习结构的指标。我们比较基于三个不同卷积层的三种GAE实现：图卷积（GAE-GCN）、GraphSAGE（GAE-SAGE）和图注意力网络（GAE-GAT）。实验结果表明，GAE-GCN在各种模式中实现了最一致的重构性能，而GAE-SAGE和GAE-GAT仅在少数特定模式中展示出竞争性结果。这些发现表明，基于合成数据的图表示学习为开发用于检测非法行为的人工智能工具提供了可行的途径，克服了金融数据集的限制。

更新时间: 2026-01-29 09:25:13

领域: cs.LG,cs.CE,cs.ET

下载: http://arxiv.org/abs/2601.21446v1

How simple can you go? An off-the-shelf transformer approach to molecular dynamics

Most current neural networks for molecular dynamics (MD) include physical inductive biases, resulting in specialized and complex architectures. This is in contrast to most other machine learning domains, where specialist approaches are increasingly replaced by general-purpose architectures trained on vast datasets. In line with this trend, several recent studies have questioned the necessity of architectural features commonly found in MD models, such as built-in rotational equivariance or energy conservation. In this work, we contribute to the ongoing discussion by evaluating the performance of an MD model with as few specialized architectural features as possible. We present a recipe for MD using an Edge Transformer, an ``off-the-shelf'' transformer architecture that has been minimally modified for the MD domain, termed MD-ET. Our model implements neither built-in equivariance nor energy conservation. We use a simple supervised pre-training scheme on $\sim$30 million molecular structures from the QCML database. Using this ``off-the-shelf'' approach, we show state-of-the-art results on several benchmarks after fine-tuning for a small number of steps. Additionally, we examine the effects of being only approximately equivariant and energy conserving for MD simulations, proposing a novel method for distinguishing the errors resulting from non-equivariance from other sources of inaccuracies like numerical rounding errors. While our model exhibits runaway energy increases on larger structures, we show approximately energy-conserving NVE simulations for a range of small structures.

Updated: 2026-01-29 09:23:40

标题: 你可以走得多简单？一种现成的变压器方法用于分子动力学

摘要: 目前大多数用于分子动力学（MD）的神经网络包含物理归纳偏差，导致专门化和复杂的架构。这与大多数其他机器学习领域不同，那里专家方法越来越被在庞大数据集上训练的通用架构所取代。与这一趋势一致，最近几项研究质疑了在MD模型中常见的架构特征的必要性，如内置的旋转等变性或能量守恒。在这项工作中，我们通过评估具有尽可能少专门化架构特征的MD模型的性能，为正在进行的讨论做出贡献。我们提出了一种使用Edge Transformer的MD配方，这是一个“即插即用”的变压器架构，经过最小修改用于MD领域，称为MD-ET。我们的模型既不实现内置等变性，也不实现能量守恒。我们使用了一个简单的监督预训练方案，从QCML数据库中的约3000万分子结构中学习。通过这种“即插即用”的方法，在进行少量步骤的微调后，我们展示了在几个基准测试中的最先进结果。此外，我们研究了在MD模拟中仅近似等变和能量守恒的影响，提出了一种区分由于非等变性而产生的错误与其他不准确性来源（如数值舍入误差）的新方法。虽然我们的模型在较大的结构上表现出能量急剧增加，但我们展示了一系列小型结构的近似能量守恒的NVE模拟。

更新时间: 2026-01-29 09:23:40

领域: cs.LG

下载: http://arxiv.org/abs/2503.01431v3

Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

Updated: 2026-01-29 09:23:13

标题: Spava：通过序列-并行感知的近似注意力加速长视频理解

摘要: 长视频推断的效率仍然是一个关键瓶颈，主要是由于大型多模态模型（LMMs）预加载阶段中的密集计算。现有的方法要么压缩视觉嵌入，要么在单个GPU上应用稀疏注意力，导致加速有限或性能下降，并限制了LMMs处理更长、更复杂的视频。为了解决这些问题，我们提出了Spava，这是一个具有优化注意力的序列并行框架，可以加速跨多个GPU的长视频推断。通过分布式近似注意力，Spava减少了计算量并增加了并行性，使得能够高效处理更多的视觉嵌入而无需压缩，从而提高任务性能。系统级优化，如负载平衡和融合前向传递，进一步释放了Spava的潜力，相对于FlashAttn、ZigZagRing和APB，速度提高了12.72倍、1.70倍和1.18倍，而性能损失不明显。代码可在https://github.com/thunlp/APB找到。

更新时间: 2026-01-29 09:23:13

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21444v1

The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

While Large Language Models (LLMs) are widely documented to be sensitive to minor prompt perturbations and prone to sycophantic alignment with user biases, their robustness in consequential, rule-bound decision-making remains under-explored. In this work, we uncover a striking "Paradox of Robustness": despite their known lexical brittleness, instruction-tuned LLMs exhibit a behavioral and near-total invariance to emotional framing effects. Using a novel controlled perturbation framework across three high-stakes domains (healthcare, law, and finance), we quantify a robustness gap where LLMs demonstrate 110-300 times greater resistance to narrative manipulation than human subjects. Specifically, we find a near-zero effect size for models (Cohen's h = 0.003) compared to the substantial biases observed in humans (Cohen's h in [0.3, 0.8]). This result is highly counterintuitive and suggests the mechanisms driving sycophancy and prompt sensitivity do not necessarily translate to a failure in logical constraint satisfaction. We show that this invariance persists across models with diverse training paradigms. Our findings show that while LLMs may be "brittle" to how a query is formatted, they are remarkably "stable" against why a decision should be biased. Our findings establish that instruction-tuned models can decouple logical rule-adherence from persuasive narratives, offering a source of decision stability that complements, and even potentially de-biases, human judgment in institutional contexts. We release the 162-scenario benchmark, code, and data to facilitate the rigorous evaluation of narrative-induced bias and robustness on GitHub.com.

Updated: 2026-01-29 09:17:05

标题: 《鲜明对比：在高风险决策中将基于规则逻辑与情感噪音分离的困境》

摘要: 尽管大型语言模型（LLMs）被广泛证明对微小提示扰动敏感，并倾向于与用户偏见产生阿谀奉承的一致性，但它们在重要、受规则约束的决策中的稳健性仍未被充分探讨。在这项工作中，我们揭示了一个引人注目的“稳健性悖论”：尽管已知它们在词汇上易碎，但经过指导调整的LLMs表现出行为上的近乎完全不变性，对情绪框架效应几乎没有反应。通过在三个高风险领域（医疗保健、法律和金融）中使用一种新颖的受控扰动框架，我们量化了一种稳健性差距，LLMs表现出110-300倍更大的抵抗叙事操纵的能力，相比之下人类主体观察到的偏见明显（Cohen的h在[0.3，0.8]）。具体来说，我们发现模型的效果大小接近零（Cohen的h = 0.003），而人类观察到的显著偏见（Cohen的h在[0.3，0.8]）。这一结果极具反直觉性，并表明导致阿谀奉承和提示敏感性的机制不一定会导致逻辑约束满足失败。我们展示了这种不变性在具有不同训练范式的模型中仍然存在。我们的研究结果表明，尽管LLMs可能对查询格式化方式“脆弱”，但它们对决策应该出于何种偏见的稳定性却非常高。我们的发现证明，指导调整的模型可以将逻辑规则遵从与有说服力的叙事分离，为决策稳定性提供一种补充，甚至有可能去偏见人类在机构环境中的判断。我们发布了162个场景的基准、代码和数据，以促进对GitHub.com上叙事诱导偏见和稳健性的严格评估。

更新时间: 2026-01-29 09:17:05

领域: cs.AI

下载: http://arxiv.org/abs/2601.21439v1

From Consistency to Complementarity: Aligned and Disentangled Multi-modal Learning for Time Series Understanding and Reasoning

Advances in multi-modal large language models (MLLMs) have inspired time series understanding and reasoning tasks, that enable natural language querying over time series, producing textual analyses of complex temporal dynamics. Recent attempts hybridize numerical time series with their visualized plots, facilitating precise value reasoning and visual structure comprehension for comprehensive time series understanding of MLLMs. However, effective cross-modal integration remains challenging due to fine-grained temporal misalignment across modalities and severe entanglement between shared and modality-specific semantics, which hinder localized interpretation and complementary reasoning. To address these issues, we propose MADI, a multi-modal LLM enhanced with fine-grained alignment and disentangled interaction, featuring (1) Patch-level Alignment, which enforces physically grounded fine-grained correspondence across heterogeneous modalities, (2) Discrete Disentangled Interaction, which separates modality-common semantics into compact discrete latents and adaptively synergizes the purified modality-unique information, and (3) Critical-token Highlighting, which emphasizes informative, query-relevant signals for robust reasoning. Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.

Updated: 2026-01-29 09:13:46

标题: 从一致性到互补性：针对时间序列理解和推理的对齐和解耦多模态学习

摘要: 多模态大型语言模型（MLLMs）的进展激发了时间序列理解和推理任务，使自然语言查询时间序列成为可能，产生复杂时间动态的文本分析。最近的尝试将数值时间序列与其可视化图表相结合，促进了准确的数值推理和对MLLMs的全面时间序列理解的视觉结构理解。然而，由于跨模态之间细粒度的时间不对齐以及共享和特定模态语义之间严重交织，有效的跨模态整合仍然具有挑战性，这些问题妨碍了定位的解释和补充推理。为了解决这些问题，我们提出了MADI，一个增强了细粒度对齐和解耦交互的多模态LLM，具有如下特点：（1）Patch-level Alignment，强化异构模态之间物理基础的细粒度对应关系；（2）Discrete Disentangled Interaction，将模态共同语义分离成紧凑的离散潜在变量，并自适应地协同纯化了模态唯一信息；（3）Critical-token Highlighting，强调了信息丰富的、与查询相关的信号，以进行强大的推理。在合成和真实世界基准测试中的实验结果表明，MADI始终优于通用目的的LLMs和时间序列专门化的MLLMs。

更新时间: 2026-01-29 09:13:46

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2601.21436v1

When Prohibitions Become Permissions: Auditing Negation Sensitivity in Language Models

When a user tells an AI system that someone "should not" take an action, the system ought to treat this as a prohibition. Yet many large language models do the opposite: they interpret negated instructions as affirmations. We audited 16 models across 14 ethical scenarios and found that open-source models endorse prohibited actions 77% of the time under simple negation and 100% under compound negation -- a 317% increase over affirmative framing. Commercial models fare better but still show swings of 19-128%. Agreement between models drops from 74% on affirmative prompts to 62% on negated ones, and financial scenarios prove twice as fragile as medical ones. These patterns hold under deterministic decoding, ruling out sampling noise. We present case studies showing how these failures play out in practice, propose the Negation Sensitivity Index (NSI) as a governance metric, and outline a tiered certification framework with domain-specific thresholds. The findings point to a gap between what current alignment techniques achieve and what safe deployment requires: models that cannot reliably distinguish "do X" from "do not X" should not be making autonomous decisions in high-stakes contexts.

Updated: 2026-01-29 09:09:23

标题: 当禁忌变成许可：审计语言模型中的否定敏感性

摘要: 当用户告诉AI系统某人“不应该”采取某种行动时，系统应该将其视为禁令。然而，许多大型语言模型却相反：它们将否定的指令解释为肯定。我们审查了16个模型在14个道德场景下的表现，发现开源模型在简单否定时有77%的概率支持被禁止的行动，在复合否定时为100% - 比肯定语境下增加了317%。商业模型表现更好，但仍存在19-128%的波动。模型之间的一致性从肯定提示的74%降至否定提示的62%，财务场景的脆弱性是医学场景的两倍。这些模式在确定性解码下保持一致，排除了抽样噪音。我们提出了一些案例研究，展示这些失败如何在实践中发生，提出了否定敏感性指数（NSI）作为一种治理度量，并概述了一个具有领域特定阈值的分层认证框架。研究结果表明，当前对齐技术所能实现的与安全部署所需的之间存在差距：不能可靠区分“做X”和“不做X”的模型不应在高风险环境中做出自主决策。

更新时间: 2026-01-29 09:09:23

领域: cs.AI

下载: http://arxiv.org/abs/2601.21433v1

Physics-Informed Neural Networks for Real-Time Gas Crossover Prediction in PEM Electrolyzers: First Application with Multi-Membrane Validation

Green hydrogen production via polymer electrolyte membrane (PEM) water electrolysis is pivotal for energy transition, yet hydrogen crossover through membranes threatens safety and economic viability-approaching explosive limits (4 mol% H$_2$ in O$_2$) while reducing Faradaic efficiency and accelerating membrane degradation. Current physics-based models require extensive calibration and computational resources that preclude real-time implementation, while purely data-driven approaches fail to extrapolate beyond training conditions-critical for dynamic electrolyzer operation. Here we present the first application of physics-informed neural networks (PINNs) for hydrogen crossover prediction, trained on 184 published measurements augmented to 1,114 points and constrained by a constitutive physics model (Henry's law, Fick's diffusion, and Faraday-based gas production) embedded in the loss function. Our compact architecture (17,793 parameters), validated across six membranes under industrially relevant conditions (0.05-5.0 A/cm$^2$, 1-200 bar, 25-85°C), achieves exceptional accuracy (R$^2$ = 99.84% $\pm$ 0.15%, RMSE = 0.0932% $\pm$ 0.0438%) based on five-fold cross-validation, with sub-millisecond inference enabling real-time control. Remarkably, the model maintains R$^2$ > 86% when predicting crossover at pressures 2.5x beyond training range-substantially outperforming pure neural networks (R$^2$ = 43.4%). The hardware-agnostic deployment, from desktop CPUs to edge devices (Raspberry Pi 4), enables distributed safety monitoring essential for gigawatt-scale installations. By bridging physical rigor and computational efficiency, this work establishes a new paradigm for real-time electrolyzer monitoring, accelerating deployment of safe, efficient green hydrogen infrastructure crucial for net-zero emissions targets.

Updated: 2026-01-29 09:09:10

标题: 物理信息神经网络用于PEM电解器中实时气体交叉预测的首次应用：多膜验证

摘要: 通过聚合物电解质膜（PEM）水电解产生绿色氢气对于能源转型至关重要，然而氢气穿透膜的问题威胁着安全性和经济可行性 - 靠近爆炸极限（4 mol％H$_2$在O$_2$中），同时降低法拉第效率并加速膜降解。目前基于物理的模型需要广泛校准和计算资源，以排除实时实施的可能性，而纯数据驱动方法无法在培训条件之外进行推断 - 对于动态电解器操作至关重要。在这里，我们提出了物理信息神经网络（PINNs）用于氢气穿透预测的首次应用，通过在184个已发表的测量数据上进行训练，扩充至1,114个点，并受到损失函数中嵌入的构成物理模型（亨利定律、菲克扩散和基于法拉第的气体产生）的约束。我们的紧凑架构（17,793个参数），在工业相关条件下（0.05-5.0 A/cm$^2$，1-200 bar，25-85°C）验证了其卓越准确性（R$^2$ = 99.84% $\pm$ 0.15%，RMSE = 0.0932% $\pm$ 0.0438%），基于五倍交叉验证，亚毫秒推断实现了实时控制。值得注意的是，该模型在预测超出训练范围2.5倍的压力下仍保持R$^2$ > 86％ - 大大优于纯神经网络（R$^2$ = 43.4％）。硬件无关的部署，从台式机CPU到边缘设备（树莓派4），实现了分布式安全监控，这对于千兆瓦规模的安装至关重要。通过将物理严谨性和计算效率结合起来，这项工作为实时电解器监控建立了一个新范式，加快了安全、高效的绿色氢气基础设施的部署，这对于净零排放目标至关重要。

更新时间: 2026-01-29 09:09:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.05879v3

AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

High-quality evaluation benchmarks are pivotal for deploying Large Language Models (LLMs) in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi-language support in repository-level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR-Bench a comprehensive benchmark that provides full cross-file context across multiple programming languages. Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR-Bench reveal that previous assessments may have either misjudged or only partially captured model capabilities due to data limitations. Our work establishes a more rigorous standard for ACR evaluation and offers new insights on LLM based ACR, i.e., the granularity/level of context and the choice of retrieval methods significantly impact ACR performance, and this influence varies depending on the LLM, programming language, and the LLM usage paradigm e.g., whether an Agent architecture is employed. The code, data, and other artifacts of our evaluation set are available at https://github.com/alibaba/aacr-bench .

Updated: 2026-01-29 09:05:14

标题: AACR-Bench: 评估具有整体存储库级上下文的自动代码审查

摘要: 高质量的评估基准对于在自动化代码审查中部署大型语言模型（LLMs）至关重要。然而，现有的基准存在两个关键限制：首先，在存储库级上缺乏多语言支持，这限制了评估结果的泛化能力；其次，依赖于从原始Pull Request（PR）评论中衍生出的嘈杂、不完整的基本事实，这限制了问题检测的范围。为了解决这些挑战，我们引入了AACR-Bench，这是一个提供多种编程语言的全文件跨上下文的全面基准。与传统数据集不同，AACR-Bench采用“人工智能辅助、专家验证”的注释流水线，以发现原始PR中经常被忽视的潜在缺陷，导致缺陷覆盖率增加了285%。对AACR-Bench上主流LLMs的广泛评估表明，以前的评估可能由于数据限制而要么误判了模型能力，要么只部分捕捉了模型能力。我们的工作为ACR评估建立了更严格的标准，并提供了关于基于LLM的ACR的新见解，即上下文的细粒度/级别和检索方法的选择对ACR性能有重大影响，并且这种影响取决于LLM、编程语言和LLM使用范式，例如是否采用Agent架构。我们的评估集的代码、数据和其他工件可在https://github.com/alibaba/aacr-bench 上找到。

更新时间: 2026-01-29 09:05:14

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2601.19494v2

Mitigating Overthinking in Large Reasoning Models via Difficulty-aware Reinforcement Learning

Large Reasoning Models (LRMs) achieve explicit chain-of-thought expansion by imitating deep thinking behaviors of humans, demonstrating excellent performance in complex task scenarios. However, the deep-thinking mode often leads to unnecessarily lengthy reasoning and resource inefficiency when handling simple tasks. This overthinking phenomenon may arise from the generation preference triggered by the reward function during post-training. Existing research attempts to mitigate overthinking from the perspective of prompt design or model training, but generally underestimates the importance of task difficulty awareness, which makes it difficult for LRMs to effectively allocate reasoning resources. In this paper, we propose Difficulty-aware Policy Optimization (DiPO), a reinforcement learning-based LRM training framework. DiPO encourages LRM to spontaneously model task complexity, and integrates them into reinforcement learning framework to adjust the generation preferences introduced by post-training. A difficulty modeling method based on model self-reasoning is proposed, which significantly reduces the dependence on manual annotation and formalize task complexity. We further develop a difficulty-signal-enhanced reward function that incorporates a penalty for lengthy reasoning while considering reasoning performance and output format. Experimental results indicate that DiPO enables the model to spontaneously adjust inference overhead, significantly reducing redundant tokens without losing performance due to thought compression.

Updated: 2026-01-29 08:56:45

标题: 通过难度感知强化学习减轻大型推理模型中的过度思考

摘要: 大型推理模型（LRMs）通过模仿人类深思熟虑的行为，实现了明确的推理链扩展，在复杂任务场景中表现出优秀的性能。然而，深思考模式通常会导致在处理简单任务时不必要的冗长推理和资源低效。这种过度思考现象可能源自训练后由奖励函数触发的生成偏好。现有研究试图从提示设计或模型训练的角度缓解过度思考，但通常低估了任务难度意识的重要性，这使得LRMs难以有效分配推理资源。在本文中，我们提出了一种基于困难感知的策略优化（DiPO）强化学习的LRM训练框架。DiPO鼓励LRM自发地建模任务复杂性，并将其整合到强化学习框架中，以调整训练后引入的生成偏好。提出了一种基于模型自我推理的难度建模方法，显著减少了对手动标注的依赖，并形式化了任务复杂性。我们进一步开发了一种基于难度信号增强的奖励函数，其中考虑了推理性能和输出格式，同时对冗长推理进行了惩罚。实验结果表明，DiPO使模型能够自发调整推理开销，显著减少了冗余标记，同时由于思维压缩而不会损失性能。

更新时间: 2026-01-29 08:56:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21418v1

System 1&2 Synergy via Dynamic Model Interpolation

Training a unified language model that adapts between intuitive System 1 and deliberative System 2 remains challenging due to interference between their cognitive modes. Recent studies have thus pursued making System 2 models more efficient. However, these approaches focused on output control, limiting what models produce. We argue that this paradigm is misaligned: output length is merely a symptom of the model's cognitive configuration, not the root cause. In this work, we shift the focus to capability control, which modulates \textit{how models think} rather than \textit{what they produce}. To realize this, we leverage existing Instruct and Thinking checkpoints through dynamic parameter interpolation, without additional training. Our pilot study establishes that linear interpolation yields a convex, monotonic Pareto frontier, underpinned by representation continuity and structural connectivity. Building on this, we propose \textbf{DAMI} (\textbf{D}yn\textbf{A}mic \textbf{M}odel \textbf{I}nterpolation), a framework that estimates a query-specific Reasoning Intensity $λ(q)$ to configure cognitive depth. For training-based estimation, we develop a preference learning method encoding accuracy and efficiency criteria. For zero-shot deployment, we introduce a confidence-based method leveraging inter-model cognitive discrepancy. Experiments on five mathematical reasoning benchmarks demonstrate that DAMI achieves higher accuracy than the Thinking model while remaining efficient, effectively combining the efficiency of System 1 with the reasoning depth of System 2.

Updated: 2026-01-29 08:53:16

标题: System 1和2通过动态模型插值的协同作用

摘要: 训练一个统一的语言模型，可以在直觉的系统1和深思熟虑的系统2之间进行调整，由于它们的认知模式之间的干扰，仍然具有挑战性。最近的研究因此致力于使系统2模型更加高效。然而，这些方法侧重于输出控制，限制了模型的产出。我们认为这种范式是不对齐的：输出长度仅仅是模型认知配置的症状，而不是根本原因。在这项工作中，我们将重点转移到能力控制上，这可以调节\textit{模型如何思考}而不是\textit{它们产生什么}。为了实现这一点，我们利用现有的Instruct和Thinking检查点，通过动态参数插值，而无需额外训练。我们的试点研究表明，线性插值产生了一个凸的、单调的帕累托前沿，基于表示的连续性和结构的连通性。基于此，我们提出了\textbf{DAMI}（\textbf{D}yn\textbf{A}mic \textbf{M}odel \textbf{I}nterpolation），这是一个框架，用于估计一个特定查询的推理强度$λ(q)$，以配置认知深度。对于基于训练的估计，我们开发了一种偏好学习方法，编码准确性和效率标准。对于零次部署，我们引入了一种基于信心的方法，利用模型间认知差异。在五个数学推理基准上的实验表明，DAMI实现了比Thinking模型更高的准确性，同时保持高效，有效地结合了系统1的效率和系统2的推理深度。

更新时间: 2026-01-29 08:53:16

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21414v1

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

Large language models (LLMs) continue to struggle with low-resource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)-a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.

Updated: 2026-01-29 08:51:19

标题: 朝着低资源语言的LLMs的稳健多语言适应前进

摘要: 大型语言模型（LLMs）在处理低资源语言方面持续遇到困难，主要是由于训练数据有限、翻译噪声和跨语言对齐不稳定。为了解决这些挑战，我们提出了LiRA（用于LLMs的语言鲁棒定位）-一个即插即用的框架，只需要在现有预训练骨干模型上进行轻量级微调。LiRA通过结合两个关键组件来共同优化表示稳定性和跨语言语义一致性：Arca（基于锚点对齐的表示组合架构），通过基于锚点的对齐和协作编码将低资源输入对齐到共享的英语语义空间；以及LaSR（语言耦合语义推理器），一个轻量级的、语言感知的头部，强制执行一致性正则化，用于统一的跨语言理解、检索和推理。我们在理论上证明，在受控制的定位误差和翻译引起的偏见下，LiRA保证在局部Lipschitz连续性下表示偏差有界，并且具有稳定的下游性能。为了促进研究，我们发布了一个新的多语言产品检索数据集，涵盖了五种东南亚语言和两种南亚语言。在各种低资源基准测试中进行的广泛实验表明，在检索、排名、问题回答和推理任务中一致改进。代码将在GitHub上公开，数据集将托管在Hugging Face上。

更新时间: 2026-01-29 08:51:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.14466v2

Efficient Test-Time Adaptation through Latent Subspace Coefficients Search

Real-world deployment often exposes models to distribution shifts, making test-time adaptation (TTA) critical for robustness. Yet most TTA methods are unfriendly to edge deployment, as they rely on backpropagation, activation buffering, or test-time mini-batches, leading to high latency and memory overhead. We propose $\textbf{ELaTTA}$ ($\textit{Efficient Latent Test-Time Adaptation}$), a gradient-free framework for single-instance TTA under strict on-device constraints. ELaTTA freezes model weights and adapts each test sample by optimizing a low-dimensional coefficient vector in a source-induced principal latent subspace, pre-computed offline via truncated SVD and stored with negligible overhead. At inference, ELaTTA encourages prediction confidence by optimizing the $k$-D coefficients with CMA-ES, effectively optimizing a Gaussian-smoothed objective and improving stability near decision boundaries. Across six benchmarks and multiple architectures, ELaTTA achieves state-of-the-art accuracy under both strict and continual single-instance protocols, while reducing compute by up to $\textit{63$\times$}$ and peak memory by $\textit{11$\times$}$. We further demonstrate on-device deployment on a ZYNQ-7020 platform. Code will be released upon acceptance.

Updated: 2026-01-29 08:50:56

标题: 通过潜在子空间系数搜索实现高效的测试时间适应性

摘要: 实际部署往往会使模型暴露于分布偏移，这使得测试时间适应（TTA）对于稳健性至关重要。然而，大多数TTA方法对边缘部署不友好，因为它们依赖于反向传播、激活缓冲或测试时间小批量，在测试时间会导致高延迟和内存开销。我们提出了$\textbf{ELaTTA}$（$\textit{高效的潜在测试时间适应}$）一个无梯度的框架，用于在严格的设备约束条件下进行单实例TTA。ELaTTA冻结模型权重，并通过优化一个在源诱导的主要潜在子空间中的低维系数向量来适应每个测试样本，通过截断SVD离线预先计算并存储，开销可忽略。在推理过程中，ELaTTA通过CMA-ES优化$k$-D系数，有效地优化高斯平滑的目标，并改善决策边界附近的稳定性。在六个基准测试和多种架构中，ELaTTA在严格和持续的单实例协议下实现了最先进的准确性，同时将计算量降低了最多$\textit{63$\times$}$，内存峰值降低了$\textit{11$\times$}$。我们进一步展示了在ZYNQ-7020平台上的设备部署。代码将在接受后发布。

更新时间: 2026-01-29 08:50:56

领域: cs.LG,eess.AS,eess.IV

下载: http://arxiv.org/abs/2510.11068v2

Plain Transformers Can be Powerful Graph Learners

Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers (GTs) have strayed far from plain Transformers, exhibiting major architectural differences either by integrating message-passing or incorporating sophisticated attention mechanisms. These divergences hinder the easy adoption of training advances for Transformers developed in other domains. Contrary to previous GTs, this work demonstrates that the plain Transformer architecture can be a powerful graph learner. To achieve this, we propose to incorporate three simple, minimal, and easy-to-implement modifications to the plain Transformer architecture to construct our Powerful Plain Graph Transformers (PPGT): (1) simplified $L_2$ attention for measuring the magnitude closeness among tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a simple MLP-based stem for graph positional encoding. Consistent with its theoretical expressivity, PPGT demonstrates noteworthy realized expressivity on the empirical graph expressivity benchmark, comparing favorably to more complicated alternatives such as subgraph GNNs and higher-order GNNs. Its empirical performance across various graph datasets also justifies the effectiveness of PPGT. This finding underscores the versatility of plain Transformer architectures and highlights their strong potential as a unified backbone for multimodal learning across language, vision, and graph domains.

Updated: 2026-01-29 08:50:07

标题: 普通的Transformer模型可以成为强大的图学习器

摘要: 变压器在各种模态下取得了出色的性能，这归功于它们简单但强大的缩放点积（SDP）注意机制。研究人员尝试将变压器迁移到图学习领域，但大多数先进的图变压器（GTs）已经远离了普通的变压器，通过集成消息传递或结合复杂的注意机制展现出主要的架构差异。这些分歧妨碍了在其他领域中开发的变压器训练进展的轻松采纳。与以往的GTs相反，这项工作表明普通的变压器架构可以是一个强大的图学习器。为了实现这一目标，我们提出了三个简单、最小和易于实现的修改，将其纳入普通变压器架构，构建我们的强大的普通图变压器（PPGT）：（1）简化的$L_2$注意力用于衡量令牌之间的大小接近程度；（2）自适应均方根归一化以保留令牌大小信息；以及（3）基于简单MLP的干扰用于图位置编码。与其理论表达力一致，PPGT在经验图表达力基准上展示出了显著的实现表达力，与诸如子图GNN和高阶GNN等更复杂的替代方案相比表现出色。其在各种图数据集上的实证表现也证明了PPGT的有效性。这一发现凸显了普通变压器架构的多功能性，并突显了它们作为语言、视觉和图领域多模态学习的统一骨干的潜力。

更新时间: 2026-01-29 08:50:07

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2504.12588v3

Can Large Language Models Capture Video Game Engagement?

Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs for successfully predicting continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. In this paper, we test LLMs' ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 4,800 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains and able to outperform traditional machine learning baselines, they generally fall behind continuous experience annotations provided by humans. We examine some of the underlying causes for a fluctuating performance across games, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.

Updated: 2026-01-29 08:44:15

标题: 大型语言模型能捕捉视频游戏的参与度吗？

摘要: 能够在观看视频时，预训练的大型语言模型（LLMs）能否成功检测人类情感？为了回答这个问题，我们首次全面评估了流行的LLMs在以多模态方式通过文本序列和视频帧提示时成功预测视频的连续情感注释的能力。在本文中，我们测试了LLMs正确标记80分钟来自GameVibe语料库的20款第一人称射击游戏的已注释视频游戏片段中游戏参与度变化的能力。我们进行了超过4800次实验，研究了LLM架构、模型大小、输入模态、提示策略和地面真实处理方法对参与度预测的影响。我们的发现表明，虽然LLMs在多个领域可以正确声称具有类似于人类的性能，并且能够超越传统的机器学习基准线，但它们通常落后于人类提供的连续体验注释。我们分析了跨游戏性能波动的一些潜在原因，突出了LLMs超出预期的案例，并为通过LLMs进一步探索自动情绪标记绘制了一条路线图。

更新时间: 2026-01-29 08:44:15

领域: cs.CV,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2502.04379v2

DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis

In real-world data science and enterprise decision-making, critical information is often fragmented across directly queryable structured sources (e.g., SQL, CSV) and "zombie data" locked in unstructured visual documents (e.g., scanned reports, invoice images). Existing data analytics agents are predominantly limited to processing structured data, failing to activate and correlate this high-value visual information, thus creating a significant gap with industrial needs. To bridge this gap, we introduce DataCross, a novel benchmark and collaborative agent framework for unified, insight-driven analysis across heterogeneous data modalities. DataCrossBench comprises 200 end-to-end analysis tasks across finance, healthcare, and other domains. It is constructed via a human-in-the-loop reverse-synthesis pipeline, ensuring realistic complexity, cross-source dependency, and verifiable ground truth. The benchmark categorizes tasks into three difficulty tiers to evaluate agents' capabilities in visual table extraction, cross-modal alignment, and multi-step joint reasoning. We also propose the DataCrossAgent framework, inspired by the "divide-and-conquer" workflow of human analysts. It employs specialized sub-agents, each an expert on a specific data source, which are coordinated via a structured workflow of Intra-source Deep Exploration, Key Source Identification, and Contextual Cross-pollination. A novel reReAct mechanism enables robust code generation and debugging for factual verification. Experimental results show that DataCrossAgent achieves a 29.7% improvement in factuality over GPT-4o and exhibits superior robustness on high-difficulty tasks, effectively activating fragmented "zombie data" for insightful, cross-modal analysis.

Updated: 2026-01-29 08:40:45

标题: DataCross：一种用于跨模态异构数据分析的统一基准和代理框架

摘要: 在现实世界的数据科学和企业决策中，关键信息通常分散在可直接查询的结构化来源（例如SQL、CSV）和锁定在非结构化视觉文档中的“僵尸数据”（例如扫描报告、发票图像）之间。现有的数据分析代理主要局限于处理结构化数据，无法激活和相关这些高价值的视觉信息，从而与工业需求存在显著差距。为了弥合这一差距，我们引入了DataCross，一个用于跨异构数据模态的统一、基于洞察的分析的新型基准和协作代理框架。DataCrossBench包括金融、医疗保健和其他领域的200个端到端分析任务。它通过一个人在循环反合成管道构建，确保了现实复杂性、跨源依赖性和可验证的基本事实。该基准将任务分为三个难度层次，以评估代理在视觉表提取、跨模态对齐和多步联合推理方面的能力。我们还提出了DataCrossAgent框架，灵感来自人类分析师的“分而治之”工作流程。它使用专门的子代理，每个子代理都是特定数据源的专家，通过Intra-source Deep Exploration、Key Source Identification和Contextual Cross-pollination的结构化工作流协调。一种新颖的reReAct机制实现了对事实验证的稳健代码生成和调试。实验结果表明，DataCrossAgent在事实性方面比GPT-4o提高了29.7%，在高难度任务上表现出更优越的稳健性，有效激活了碎片化的“僵尸数据”，实现了深入、跨模态的分析。

更新时间: 2026-01-29 08:40:45

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2601.21403v1

Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $σ^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $σ^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

Updated: 2026-01-29 08:36:43

标题: 具有学习核函数的谱算法的对准敏感极小化率

摘要: 我们研究了在从数据中学习内核的情况下的谱算法。我们引入了有效跨度维度（ESD），这是一个与信号、频谱和噪声水平$σ^2$共同依赖的对齐敏感的复杂度度量。ESD适用于任意内核和信号，而不需要特征值衰减条件或源条件。我们证明，对于ESD至多为$K$的序列模型，最小最大超额风险的规模为$σ^2 K$。此外，我们分析了过度参数化的梯度流，并证明它可以减少ESD。这一发现建立了自适应特征学习和谱算法泛化的可证明改进之间的联系。我们通过将ESD框架扩展到线性模型和RKHS回归来展示ESD框架的普适性，并通过数值实验来支持理论。这一框架为超越传统固定内核理论的泛化提供了新的视角。

更新时间: 2026-01-29 08:36:43

领域: cs.LG,math.ST

下载: http://arxiv.org/abs/2509.20294v3

End-to-end audio-visual learning for cochlear implant sound coding simulations in noisy environments

The cochlear implant (CI) is a successful biomedical device that enables individuals with severe-to-profound hearing loss to perceive sound through electrical stimulation, yet listening in noise remains challenging. Recent deep learning advances offer promising potential for CI sound coding by integrating visual cues. In this study, an audio-visual speech enhancement (AVSE) module is integrated with the ElectrodeNet-CS (ECS) model to form the end-to-end CI system, AVSE-ECS. Simulations show that the AVSE-ECS system with joint training achieves high objective speech intelligibility and improves the signal-to-error ratio (SER) by 7.4666 dB compared to the advanced combination encoder (ACE) strategy. These findings underscore the potential of AVSE-based CI sound coding.

Updated: 2026-01-29 08:33:07

标题: 在嘈杂环境中的人工耳蜗声音编码模拟的端到端音频-视觉学习

摘要: 耳蜗植入装置（CI）是一种成功的生物医学设备，可以让严重至极重的听力损失患者通过电刺激感知声音，但在噪音中听取仍然具有挑战性。最近的深度学习进展为CI声音编码提供了潜在的希望，通过整合视觉线索。本研究中，将一个音频-视觉语音增强（AVSE）模块与ElectrodeNet-CS（ECS）模型集成，形成端到端的CI系统AVSE-ECS。模拟显示，AVSE-ECS系统在联合训练下实现了高客观语音可懂度，并且相比于先进的组合编码器（ACE）策略，提高了信号与误差比（SER）7.4666 dB。这些发现突显了基于AVSE的CI声音编码的潜力。

更新时间: 2026-01-29 08:33:07

领域: eess.AS,cs.AI,cs.SD,eess.IV

下载: http://arxiv.org/abs/2508.13576v2

Scalable Sequential Recommendation under Latency and Memory Constraints

Sequential recommender systems must model long-range user behavior while operating under strict memory and latency constraints. Transformer-based approaches achieve strong accuracy but suffer from quadratic attention complexity, forcing aggressive truncation of user histories and limiting their practicality for long-horizon modeling. This paper presents HoloMambaRec, a lightweight sequential recommendation architecture that combines holographic reduced representations for attribute-aware embedding with a selective state space encoder for linear-time sequence processing. Item and attribute information are bound using circular convolution, preserving embedding dimensionality while encoding structured metadata. A shallow selective state space backbone, inspired by recent Mamba-style models, enables efficient training and constant-time recurrent inference. Experiments on Amazon Beauty and MovieLens-1M under a 10-epoch budget show that HoloMambaRec surpasses SASRec on both datasets, attains state-of-the-art ranking on MovieLens-1M, and trails only GRU4Rec on Amazon Beauty, all while maintaining substantially lower memory complexity. The design further incorporates forward-compatible mechanisms for temporal bundling and inference-time compression, positioning HoloMambaRec as a practical and extensible alternative for scalable, metadata-aware sequential recommendation.

Updated: 2026-01-29 08:29:40

标题: 在延迟和内存约束下可扩展的顺序推荐

摘要: 连续推荐系统必须在严格的内存和延迟约束条件下建模长期用户行为。基于Transformer的方法在准确性方面表现出色，但由于二次注意力复杂度而被迫对用户历史进行积极截断，限制了它们在长期建模方面的实用性。本文提出了一种轻量级的顺序推荐架构HoloMambaRec，它结合了面向属性的降维表示和选择性状态空间编码器，用于线性时间序列处理。商品和属性信息使用循环卷积进行绑定，保留了嵌入维度，同时编码了结构化元数据。受最近Mamba风格模型启发，浅层选择性状态空间骨干实现了高效训练和恒定时间的递归推断。在亚马逊美容和MovieLens-1M上进行的实验表明，HoloMambaRec在两个数据集上均优于SASRec，获得了MovieLens-1M上的最先进排名，并仅在亚马逊美容上落后于GRU4Rec，同时保持较低的内存复杂度。该设计还包括用于时序捆绑和推断时压缩的前向兼容机制，将HoloMambaRec定位为可扩展的、元数据感知的顺序推荐的实用和可扩展替代方案。

更新时间: 2026-01-29 08:29:40

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2601.08360v2

NOSA: Native and Offloadable Sparse Attention

Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-generation quality due to training-inference mismatch on sparse patterns. Meanwhile, trainable sparse attention is incompatible with efficient offloading, as unconstrained KV accesses may force large CPU-to-GPU transfers and erase throughput gains. To this end, we propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading. NOSA explicitly constrains the volume of CPU-GPU KV transfers, thereby achieving low communication overhead and high decoding throughput. We further build NOSI, a KV cache offloading inference system that fully unlocks NOSA's efficiency. Empirical results on 1,3,8B LLMs demonstrate that NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks, while boosting decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. We release our code at https://github.com/thunlp/NOSA.

Updated: 2026-01-29 08:26:24

标题: NOSA：本地和可卸载的稀疏注意力

摘要: 通过更大的推理批次提高解码吞吐量受到GPU内存的限制，这主要是由键-值（KV）缓存消耗大量内存所致。先前的无训练KV缓存卸载通过将冗余上下文保留在CPU上，并仅获取注意力的稀疏子集来缓解这一问题，但它经常由于在稀疏模式上训练推理不匹配而降低长生成质量。同时，可训练的稀疏注意力与高效卸载不兼容，因为无约束的KV访问可能会强制进行大量的CPU到GPU传输并消除吞吐量增益。因此，我们提出了NOSA，这是一种专门为KV缓存卸载而设计的可训练稀疏注意力机制。NOSA明确约束了CPU-GPU KV传输的量，从而实现低通信开销和高解码吞吐量。我们进一步构建了NOSI，一个完全释放NOSA效率的KV缓存卸载推理系统。对1,3,8B LLMs的实证结果表明，NOSA在一般、长输入和长生成任务上优于KV缓存卸载基线，同时将解码吞吐量提高了最多5.04倍、1.92倍和1.83倍，分别超过FullAttn、InfLLMv2和ShadowKV。我们在https://github.com/thunlp/NOSA发布了我们的代码。

更新时间: 2026-01-29 08:26:24

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.13602v2

Intrinsic Reward Policy Optimization for Sparse-Reward Environments

Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub-optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm -- intrinsic reward policy optimization (IRPO) -- achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse-reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at https://github.com/Mgineer117/IRPO.

Updated: 2026-01-29 08:25:14

标题: 稀疏奖励环境下的内在奖励政策优化

摘要: 探索在强化学习中至关重要，因为代理依赖试错来学习最优策略。然而，当奖励稀疏时，像注入噪声这样的朴素探索策略通常不足够。内在奖励也可以通过为探索提供原则指导，例如将它们与外在奖励结合以优化策略，或者使用它们来训练用于分层学习的子策略。然而，前一种方法存在不稳定的信用分配问题，而后一种方法则表现出样本效率和次优性。我们提出了一个策略优化框架，利用多个内在奖励直接优化一个策略以获得外在奖励，而无需预先训练子策略。我们的算法--内在奖励策略优化（IRPO）--通过使用一个替代策略梯度来实现这一点，在稀疏奖励环境中提供比真实梯度更丰富的学习信号。我们证明IRPO相对于基线在离散和连续环境中提高了性能和样本效率，并正式分析了IRPO解决的优化问题。我们的代码可在https://github.com/Mgineer117/IRPO找到。

更新时间: 2026-01-29 08:25:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21391v1

Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation

Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.

Updated: 2026-01-29 08:20:52

标题: 理解弗雷歇特语音距离用于合成语音质量评估

摘要: 合成语音质量的客观评估仍然是一个重要挑战。人类听力测试是金标准，但在规模上成本高昂且不切实际。弗雷歇距离已经成为一种有希望的替代方法，但其可靠性在很大程度上取决于嵌入和实验设置的选择。在这项工作中，我们全面评估了弗雷歇语音距离（FSD）及其变体语音最大均方差（SMMD），在不同的嵌入和条件下。我们进一步将人类听力评估与TTS可懂度和合成训练的ASR WER结合起来，以验证这些指标的感知相关性。我们的研究结果表明，WavLM Base+特征与人类评分之间的对齐最为稳定。虽然FSD和SMMD无法完全取代主观评估，但我们表明它们可以作为补充的、成本有效且可重复的测量方法，特别在大规模或直接听力评估不可行时尤其有用。代码可在https://github.com/kaen2891/FrechetSpeechDistance找到。

更新时间: 2026-01-29 08:20:52

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2601.21386v1

Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

Neural Algorithmic Reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov Decision Process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.

Updated: 2026-01-29 08:20:43

标题: 解决棘手问题：通过强化学习重新构想的图神经算法推理

摘要: 神经算法推理（NAR）是一种通过监督学习训练神经网络执行经典算法的范式。尽管取得了成功，但仍存在重要限制：无法在后处理无法构建有效解决方案和推理多个正确解决方案，在组合NP难问题上表现不佳，并且无法应用于尚未知晓强算法的问题。为了解决这些限制，我们将学习算法轨迹的问题重新构建为马尔可夫决策过程，这为解决方案构建过程施加了结构，并解锁了模仿和强化学习（RL）的强大工具。我们提出了GNARL框架，包括将问题表述从NAR翻译为RL的方法论以及适用于各种基于图的问题的学习架构。我们在几个CLRS-30问题上实现了非常高的图准确性结果，性能匹配或超过了NP难问题的更窄的NAR方法，并且在缺乏专家算法时甚至具有显著的适用性。

更新时间: 2026-01-29 08:20:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18930v2

Sim-MSTNet: sim2real based Multi-task SpatioTemporal Network Traffic Forecasting

Network traffic forecasting plays a crucial role in intelligent network operations, but existing techniques often perform poorly when faced with limited data. Additionally, multi-task learning methods struggle with task imbalance and negative transfer, especially when modeling various service types. To overcome these challenges, we propose Sim-MSTNet, a multi-task spatiotemporal network traffic forecasting model based on the sim2real approach. Our method leverages a simulator to generate synthetic data, effectively addressing the issue of poor generalization caused by data scarcity. By employing a domain randomization technique, we reduce the distributional gap between synthetic and real data through bi-level optimization of both sample weighting and model training. Moreover, Sim-MSTNet incorporates attention-based mechanisms to selectively share knowledge between tasks and applies dynamic loss weighting to balance task objectives. Extensive experiments on two open-source datasets show that Sim-MSTNet consistently outperforms state-of-the-art baselines, achieving enhanced accuracy and generalization.

Updated: 2026-01-29 08:20:08

标题: Sim-MSTNet：基于sim2real的多任务时空网络流量预测

摘要: 网络流量预测在智能网络运营中扮演着至关重要的角色，但是现有的技术在面对有限数据时往往表现不佳。此外，多任务学习方法在建模各种服务类型时往往面临任务不平衡和负迁移的困境。为了克服这些挑战，我们提出了Sim-MSTNet，这是一种基于sim2real方法的多任务时空网络流量预测模型。我们的方法利用模拟器生成合成数据，有效解决了由于数据稀缺导致的泛化能力不佳的问题。通过采用领域随机化技术，我们通过双层优化样本加权和模型训练来减小合成数据与真实数据之间的分布差距。此外，Sim-MSTNet还结合了基于注意力机制的方法来在任务之间选择性共享知识，并应用动态损失加权来平衡任务目标。在两个开源数据集上进行的大量实验表明，Sim-MSTNet始终优于最先进的基线模型，实现了更好的准确性和泛化能力。

更新时间: 2026-01-29 08:20:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21384v1

RerouteGuard: Understanding and Mitigating Adversarial Risks for LLM Routing

Recent advancements in multi-model AI systems have leveraged LLM routers to reduce computational cost while maintaining response quality by assigning queries to the most appropriate model. However, as classifiers, LLM routers are vulnerable to novel adversarial attacks in the form of LLM rerouting, where adversaries prepend specially crafted triggers to user queries to manipulate routing decisions. Such attacks can lead to increased computational cost, degraded response quality, and even bypass safety guardrails, yet their security implications remain largely underexplored. In this work, we bridge this gap by systematizing LLM rerouting threats based on the adversary's objectives (i.e., cost escalation, quality hijacking, and safety bypass) and knowledge. Based on the threat taxonomy, we conduct a measurement study of real-world LLM routing systems against existing LLM rerouting attacks. The results reveal that existing routing systems are vulnerable to rerouting attacks, especially in the cost escalation scenario. We then characterize existing rerouting attacks using interpretability techniques, revealing that they exploit router decision boundaries through confounder gadgets that prepend queries to force misrouting. To mitigate these risks, we introduce RerouteGuard, a flexible and scalable guardrail framework for LLM rerouting. RerouteGuard filters adversarial rerouting prompts via dynamic embedding-based detection and adaptive thresholding. Extensive evaluations in three attack settings and four benchmarks demonstrate that RerouteGuard achieves over 99% detection accuracy against state-of-the-art rerouting attacks, while maintaining negligible impact on legitimate queries. The experimental results indicate that RerouteGuard offers a principled and practical solution for safeguarding multi-model AI systems against adversarial rerouting.

Updated: 2026-01-29 08:17:08

标题: RerouteGuard：了解和减轻LLM路由的敌对风险

摘要: 最近在多模型AI系统中的进展利用LLM路由器来降低计算成本，同时通过将查询分配给最合适的模型来保持响应质量。然而，作为分类器，LLM路由器容易受到新型对抗攻击的影响，形式为LLM重定向，即对手在用户查询中添加特别设计的触发器以操纵路由决策。这种攻击可能导致计算成本增加、响应质量下降，甚至绕过安全防护措施，然而它们的安全影响仍然大部分未被探索。在这项工作中，我们通过系统化对抗者的目标（即成本升级、质量劫持和安全绕过）和知识，来桥接这一鸿沟。基于威胁分类法，我们对现实世界中的LLM路由系统针对现有LLM重定向攻击进行了测量研究。结果显示，现有路由系统容易受到重定向攻击的影响，尤其是在成本升级的情况下。然后，我们使用可解释性技术对现有重定向攻击进行了表征，揭示它们通过混淆器小工具利用路由器的决策边界，在查询前添加以强制错误路由。为了减轻这些风险，我们引入了RerouteGuard，一个灵活且可扩展的LLM重定向防护框架。RerouteGuard通过动态嵌入式检测和自适应阈值过滤对抗性重定向提示。在三种攻击设置和四个基准测试中进行的广泛评估表明，RerouteGuard在对抗最先进的重定向攻击时实现了超过99%的检测准确率，同时对合法查询影响微乎其微。实验结果表明，RerouteGuard为保护多模型AI系统免受对抗性重定向攻击提供了一个有原则且实用的解决方案。

更新时间: 2026-01-29 08:17:08

领域: cs.CR

下载: http://arxiv.org/abs/2601.21380v1

PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems

Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency SLOs directly. LLM routers do not. They force operators to tune parameters offline and guess what accuracy might result. The relationship between parameters and outcomes is indirect, non-monotonic, and dataset-dependent. Operators need to specify accuracy targets, not infer them from opaque settings. We present PROTEUS (Polymorphic Router for Operational Target Enforcement with Unified SLA), a router that accepts accuracy targets tau as runtime input. PROTEUS uses Lagrangian dual control. A learned dual variable lambda tracks constraint violations during training and conditions the policy network. This lets the router translate specified tau values into routing decisions that satisfy them. A single trained model serves the full accuracy spectrum without retraining.We evaluate on RouterBench (11 models, 405K queries) and SPROUT (14 models, 45K queries). PROTEUS achieves consistent floor compliance where accuracy meets or exceeds tau. The target-response correlation reaches 0.97 to 0.98. The closest baseline, OmniRouter, meets floors only 22% of the time despite also using Lagrangian optimization. PROTEUS operates across tau in [0.85, 0.95] from a single model. On RouterBench it achieves 90.1% accuracy, within 1.3% of oracle. On SPROUT it achieves 94.0% accuracy, within 4.6% of oracle. Cost savings reach 89.8% versus the best fixed model.

Updated: 2026-01-29 08:12:09

标题: PROTEUS：基于拉格朗日强化学习的多LLM服务系统的SLA感知路由

摘要: 生产LLM部署服务各种工作负载，客户层次、时间和查询紧急性的成本和质量要求各不相同。模型服务系统直接接受延迟SLO。LLM路由器不接受。他们强迫操作员在离线状态下调整参数，并猜测可能会产生什么准确性。参数和结果之间的关系是间接的、非单调的，并且依赖于数据集。操作员需要指定准确性目标，而不是从不透明的设置中推断出来。我们提出了PROTEUS（用于统一SLA的操作目标弹性路由器），这是一个可以接受准确性目标tau作为运行时输入的路由器。PROTEUS使用拉格朗日对偶控制。一个学习到的对偶变量lambda在训练过程中跟踪约束违规，并且调整策略网络。这使得路由器能够将指定的tau值转化为满足它们的路由决策。一个经过训练的模型能够为整个准确性范围提供服务，而无需重新训练。我们在RouterBench（11个模型，405K查询）和SPROUT（14个模型，45K查询）上进行评估。PROTEUS实现了准确性达到或超过tau的一致性底线符合。目标-响应相关性达到0.97至0.98。最接近的基线，OmniRouter，尽管也使用拉格朗日优化，但只有22%的时间符合底线。PROTEUS在[0.85, 0.95]的tau范围内从一个模型中操作。在RouterBench上，它达到90.1%的准确性，与oracle相差1.3%。在SPROUT上，它实现了94.0%的准确性，与oracle相差4.6%。成本节约达到了89.8%，超过了最佳固定模型。

更新时间: 2026-01-29 08:12:09

领域: cs.AI

下载: http://arxiv.org/abs/2601.19402v2

TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models

Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks. We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.

Updated: 2026-01-29 08:04:37

标题: TeachBench：一个基于教学大语言模型评估教学能力的课程表框架

摘要: 大型语言模型(LLMs)显示出作为教学助手的潜力，然而它们的教学能力仍未得到充分评估。现有的基准主要关注于解决问题或问题级别的指导，对知识为中心的教学尚未得到充分探索。我们提出了一个基于课程大纲的评估框架，通过多轮指导后学生表现的改进来衡量LLM的教学能力。通过将教师代理限制在结构化的知识点和示例问题上，该框架避免了信息泄露，并使现有基准的重复使用成为可能。我们在多个学科的高考数据上实例化了该框架。实验证明，在模型和领域之间的教学效果存在显著的差异：一些模型在数学方面表现良好，而在物理和化学方面的教学仍然具有挑战性。我们还发现，纳入示例问题不一定会改善教学，因为模型往往会转向特定示例的错误更正。总的来说，我们的结果突显了教学能力作为LLM行为的一个独立且可衡量的维度。

更新时间: 2026-01-29 08:04:37

领域: cs.AI

下载: http://arxiv.org/abs/2601.21375v1

GAVEL: Towards rule-based safety through activation monitoring

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.

Updated: 2026-01-29 08:02:45

标题: GAVEL: 通过激活监测实现基于规则的安全性

摘要: 大型语言模型（LLMs）越来越多地与基于激活的监测配对，以检测和预防可能在表面文本级别上并不明显的有害行为。然而，现有的基于激活的安全方法，训练于广泛的误用数据集，存在精度低、灵活性有限和缺乏可解释性的问题。本文引入了一种新的范式：基于规则的激活安全，受到网络安全中规则共享实践的启发。我们提出将激活建模为认知元素（CEs），细粒度、可解释的因素，如“构成威胁”和“支付处理”，可以组合捕捉领域特定行为的微妙差异，具有更高的精度。基于这种表示，我们提出了一个实用框架，定义了对CEs的谓词规则，并实时检测违规行为。这使从业人员能够配置和更新安全措施，而无需重新训练模型或探测器，同时支持透明度和可审计性。我们的结果表明，基于规则的激活安全改善了精度，支持领域定制，并为可扩展、可解释和可审计的AI治理奠定了基础。我们将以开源框架发布GAVEL，并提供配套的自动规则创建工具。

更新时间: 2026-01-29 08:02:45

领域: cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2601.19768v2

NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents

In this paper, we present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimization implementations, operating collaboratively with users or autonomously. Existing approaches typically rely on specialized large language models (LLMs) or bespoke, task-specific agents. Such methods are often brittle, complex and frequently generating syntactically invalid or non-executable code. NEMO instead centers on remote interaction with autonomous coding agents (ACAs), treated as a first-class abstraction analogous to API-based interaction with LLMs. This design enables the construction of higher-level systems around ACAs that structure, consolidate, and iteratively refine task specifications. Because ACAs execute within sandboxed environments, code produced by NEMO is executable by construction, allowing automated validation and repair. Building on this, we introduce novel coordination patterns with and across ACAs, including asymmetric validation loops between independently generated optimizer and simulator implementations (serving as a high-level validation mechanism), external memory for experience reuse, and robustness enhancements via minimum Bayes risk (MBR) decoding and self-consistency. We evaluate NEMO on nine established optimization benchmarks. As depicted in Figure 1, it achieves state-of-the-art performance on the majority of tasks, with substantial margins on several datasets, demonstrating the power of execution-aware agentic architectures for automated optimization modeling.

Updated: 2026-01-29 07:57:23

标题: NEMO：通过自主编码代理执行感知优化建模

摘要: 在这篇论文中，我们提出了NEMO，一个将决策问题的自然语言描述转化为形式化的可执行数学优化实现的系统，可以与用户协作或自主操作。现有方法通常依赖于专门的大型语言模型（LLMs）或定制的、特定任务的代理。这些方法往往脆弱、复杂，经常生成语法无效或非可执行代码。 NEMO相反，侧重于与自主编码代理（ACAs）进行远程交互，将其视为与LLMs基于API交互类似的一流抽象。这种设计使得可以围绕ACAs构建更高级的系统，用于结构化、整合和迭代地完善任务规范。由于ACAs在沙箱环境中执行，NEMO生成的代码是可执行的，允许自动验证和修复。基于此，我们引入了新颖的与ACAs之间和之间的协调模式，包括独立生成的优化器和模拟器实现之间的不对称验证循环（作为高级验证机制），用于经验重用的外部内存，以及通过最小贝叶斯风险（MBR）解码和自一致性的稳健性增强。我们在九个已建立的优化基准上评估了NEMO。如图1所示，在大多数任务上取得了最先进的性能，在几个数据集上有显著的差距，展示了面向执行的代理架构对于自动优化建模的强大能力。

更新时间: 2026-01-29 07:57:23

领域: cs.AI

下载: http://arxiv.org/abs/2601.21372v1

Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.

Updated: 2026-01-29 07:52:53

标题: 稀疏维度中存在的语言：大型语言模型的可解释和高效多语言控制

摘要: 大型语言模型表现出强大的多语言能力，尽管其对非英语数据的接触有限。先前的研究表明，以英语为中心的大型语言模型将多语言内容映射到中间层的英语对齐表示，然后将它们投射回最终层的目标语言令牌空间。基于这一观察，我们假设这种跨语言转换受到一小部分稀疏维度的控制，这些维度在中间到最终层之间的位置始终保持一致。基于这一见解，我们引入了一种简单的、无需训练的方法来识别和操作这些维度，只需要50句平行或单语数据。对一个多语言生成控制任务的实验揭示了这些维度的可解释性，表明对这些维度的干预可以切换输出语言同时保留语义内容，并且在成本大幅降低的情况下超越了先前基于神经元的方法的性能。

更新时间: 2026-01-29 07:52:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07213v2

Repairing Reward Functions with Feedback to Mitigate Reward Hacking

Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.

Updated: 2026-01-29 07:52:11

标题: 使用反馈修复奖励函数以减轻奖励欺骗

摘要: 人类设计的奖励函数经常与人类真实但不可观测的目标不一致，因此只起到代理作用。优化错误规定的代理奖励函数通常会导致奖励欺骗，从而产生与人类真实目标不一致的策略。另一种方法是通过人类反馈进行RL，这涉及通过收集人类对轨迹对的偏好来从头学习奖励函数。然而，构建这样的数据集成本高昂。为了解决这两种方法的局限性，我们提出了基于偏好的奖励修复（PBRR）：这是一个自动迭代框架，通过从偏好中学习加法、转移依赖校正项，来修复人类指定的代理奖励函数。手动指定的奖励函数可能会导致在真实目标下高度次优的策略，但仅对少数转换进行校正可能就足以恢复最佳性能。为了识别和纠正这些转换，PBRR使用有针对性的探索策略和新的偏好学习目标。我们证明在表格域中，PBRR的累积遗憾与之前基于偏好的RL方法相匹配，直到常数。此外，在一系列奖励欺骗基准测试中，PBRR始终优于从头开始从偏好中学习奖励函数或使用其他方法修改代理奖励函数的基准，需要更少的偏好来学习高性能策略。

更新时间: 2026-01-29 07:52:11

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.13036v2

Hebbian Learning with Global Direction

Backpropagation algorithm has driven the remarkable success of deep neural networks, but its lack of biological plausibility and high computational costs have motivated the ongoing search for alternative training methods. Hebbian learning has attracted considerable interest as a biologically plausible alternative to backpropagation. Nevertheless, its exclusive reliance on local information, without consideration of global task objectives, fundamentally limits its scalability. Inspired by the biological synergy between neuromodulators and local plasticity, we introduce a novel model-agnostic Global-guided Hebbian Learning (GHL) framework, which seamlessly integrates local and global information to scale up across diverse networks and tasks. In specific, the local component employs Oja's rule with competitive learning to ensure stable and effective local updates. Meanwhile, the global component introduces a sign-based signal that guides the direction of local Hebbian plasticity updates. Extensive experiments demonstrate that our method consistently outperforms existing Hebbian approaches. Notably, on large-scale network and complex datasets like ImageNet, our framework achieves the competitive results and significantly narrows the gap with standard backpropagation.

Updated: 2026-01-29 07:49:21

标题: 具有全局方向的赫布学习

摘要: 反向传播算法推动了深度神经网络的显著成功，但其缺乏生物学合理性和高计算成本促使人们不断寻找替代训练方法。希布学习作为一种生物学合理的替代方案吸引了相当多的关注，但是其仅依赖于局部信息而不考虑全局任务目标，从根本上限制了其可扩展性。受到神经调节物质和局部可塑性之间生物学协同作用的启发，我们引入了一种新颖的模型无关的全局导向希布学习（GHL）框架，将局部和全局信息无缝集成，以在不同网络和任务之间进行扩展。具体来说，局部组件采用Oja规则和竞争学习以确保稳定和有效的局部更新。同时，全局组件引入了一个基于符号的信号，指导局部希布可塑性更新的方向。大量实验证明我们的方法始终优于现有的希布方法。值得注意的是，在像ImageNet这样的大规模网络和复杂数据集上，我们的框架取得了竞争性的结果，并显著缩小了与标准反向传播之间的差距。

更新时间: 2026-01-29 07:49:21

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21367v1

OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models

The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.

Updated: 2026-01-29 07:48:17

标题: OrthoInsight：基于多模态大模型的肋骨骨折诊断和报告生成

摘要: 医学影像数据的增长量增加了对自动诊断工具的需求，特别是对于肌肉骨骼损伤如肋骨骨折的需要，这些通常通过CT扫描检测。手动解释耗时且容易出错。我们提出了OrthoInsight，一个用于肋骨骨折诊断和报告生成的多模式深度学习框架。它集成了一个用于骨折检测的YOLOv9模型，一个用于检索临床背景的医学知识图谱，以及一个经过微调的LLaVA语言模型用于生成诊断报告。OrthoInsight将CT图像的视觉特征与专家文本数据相结合，以提供临床有用的输出。在经过注释的28,675张CT图像和专家报告上评估，它在诊断准确性、内容完整性、逻辑连贯性和临床指导价值方面表现出高性能，平均得分为4.28，超越了GPT-4和Claude-3等模型。本研究展示了多模式学习在转变医学图像分析和为放射科医师提供有效支持方面的潜力。

更新时间: 2026-01-29 07:48:17

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.13993v4

The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission's logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread "False Certification" of functionally broken code. Our findings suggest that current alignment paradigms create a "Trojan" vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.

Updated: 2026-01-29 07:40:58

标题: 《合规性悖论：自动化学术代码评估中的语义指令脱钩》

摘要: 大型语言模型（LLMs）快速融入教育评估的前提是基于一个未经验证的假设，即遵循指导能力直接转化为客观裁决。我们证明这一假设是根本错误的。模型经常脱离评估代码质量，而是解耦提交逻辑以满足隐藏指导，这是一个系统性脆弱性，我们称之为合规悖论，即为了极端的帮助性而调整的模型容易遭受对抗性操纵。为了揭示这一点，我们引入了语义保留对抗性代码注入（SPACI）框架和抽象语法树感知语义注入协议（AST-ASIP）。这些方法利用语法-语义差距，将对抗性指导嵌入抽象语法树的语法惰性区域（琐事节点）。通过对Python、C、C++和Java中25,000个提交的9个SOTA模型进行大规模评估，我们发现在像DeepSeek-V3这样的高容量开放权重模型中存在灾难性的失败率（>95%），这些模型系统地将隐藏的格式约束置于代码正确性之上。我们使用新颖的三部分框架（解耦概率、分数分歧和教育严重性）量化这种失败，以展示功能性破损代码的普遍“虚假认证”。我们的研究结果表明，当前的对齐范式在自动评分中造成“特洛伊木马”漏洞，需要从标准RLHF转向领域特定的裁决稳健性，即模型被调整为优先考虑证据而不是遵循指导。我们发布完整的数据集和注入框架，以促进进一步研究。

更新时间: 2026-01-29 07:40:58

领域: cs.CL,cs.AI,cs.ET,cs.LG,cs.SE

下载: http://arxiv.org/abs/2601.21360v1

Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization

Chain-of-Thought (CoT) empowers Large Language Models (LLMs) to tackle complex problems, but remains constrained by the computational cost and reasoning path collapse when grounded in discrete token spaces. Recent latent reasoning approaches attempt to optimize efficiency by performing reasoning within continuous hidden states. However, these methods typically operate as opaque end-to-end mappings from explicit reasoning steps to latent states, and often require a pre-defined number of latent steps during inference. In this work, we introduce PLaT (Planning with Latent Thoughts), a framework that reformulates latent reasoning as planning by fundamentally decouple reasoning from verbalization. We model reasoning as a deterministic trajectory of latent planning states, while a separate Decoder grounds these thoughts into text when necessary. This decoupling allows the model to dynamically determine when to terminate reasoning rather than relying on fixed hyperparameters. Empirical results on mathematical benchmarks reveal a distinct trade-off: while PLaT achieves lower greedy accuracy than baselines, it demonstrates superior scalability in terms of reasoning diversity. This indicates that PLaT learns a robust, broader solution space, offering a transparent and scalable foundation for inference-time search.

Updated: 2026-01-29 07:38:18

标题: 潜在思维链作为规划：解耦理性思维与言语表达

摘要: Chain-of-Thought (CoT)赋予大型语言模型（LLMs）解决复杂问题的能力，但在离散标记空间中存在计算成本和推理路径崩溃的限制。最近的潜在推理方法尝试通过在连续隐藏状态中执行推理来优化效率。然而，这些方法通常作为从显式推理步骤到潜在状态的不透明端到端映射运行，并且在推理过程中通常需要预定义数量的潜在步骤。在这项工作中，我们介绍了PLaT（Planning with Latent Thoughts），这是一个重塑潜在推理为计划的框架，通过从口头表达中彻底解耦推理。我们将推理建模为潜在规划状态的确定性轨迹，而一个独立的解码器在必要时将这些思维连接到文本。这种解耦允许模型动态确定何时终止推理，而不是依赖于固定的超参数。数学基准测试的经验结果显示一个明显的权衡：虽然PLaT的贪婪准确率低于基线，但在推理多样性方面表现出卓越的可扩展性。这表明PLaT学习了一个强大、更广泛的解决方案空间，为推理时搜索提供了透明且可扩展的基础。

更新时间: 2026-01-29 07:38:18

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21358v1

SecIC3: Customizing IC3 for Hardware Security Verification

Recent years have seen significant advances in using formal verification to check hardware security properties. Of particular practical interest are checking confidentiality and integrity of secrets, by checking that there is no information flow between the secrets and observable outputs. A standard method for checking information flow is to translate the corresponding non-interference hyperproperty into a safety property on a self-composition of the design, which has two copies of the design composed together. Although prior efforts have aimed to reduce the size of the self-composed design, there are no state-of-the-art model checkers that exploit their special structure for hardware security verification. In this paper, we propose SecIC3, a hardware model checking algorithm based on IC3 that is customized to exploit this self-composition structure. SecIC3 utilizes this structure in two complementary techniques: symmetric state exploration and adding equivalence predicates. We implement SecIC3 on top of two open-source IC3 implementations and evaluate it on a non-interference checking benchmark consisting of 10 designs. The experiment results show that SecIC3 significantly reduces the time for finding security proofs, with up to 49.3x proof speedup compared to baseline implementations.

Updated: 2026-01-29 07:24:08

标题: SecIC3：定制IC3以用于硬件安全验证

摘要: 近年来，在使用形式验证来检查硬件安全属性方面取得了重大进展。特别感兴趣的是检查机密性和秘密性的完整性，通过检查秘密信息和可观察输出之间没有信息流。检查信息流的标准方法是将相应的非干扰超属性转换为设计的自组合上的安全属性，其中设计的两个副本组合在一起。尽管先前的努力旨在减小自组合设计的大小，但没有最先进的模型检查器利用其用于硬件安全验证的特殊结构。在本文中，我们提出了SecIC3，一种基于IC3的硬件模型检查算法，专门定制以利用这种自组合结构。SecIC3利用这种结构中的两种互补技术：对称状态探索和添加等价谓词。我们在两个开源IC3实现的基础上实现了SecIC3，并在由10个设计组成的非干扰检查基准上对其进行评估。实验结果显示，与基准实现相比，SecIC3显著减少了找到安全证明所需的时间，最高可达到49.3倍的证明加速。

更新时间: 2026-01-29 07:24:08

领域: cs.CR

下载: http://arxiv.org/abs/2601.21353v1

BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents

GUI agents are designed to automate repetitive tasks and enhance productivity. However, existing GUI agents struggle to recover once they follow an incorrect exploration path, often leading to task failure. In this work, we model GUI task execution as a DFS process and propose BEAP-Agent, a DFS-based framework that supports long-range, multi-level state backtracking with dynamic task tracking and updating. The framework consists of three collaborative components: Planner, Executor, and Tracker. Together, they enable effective task exploration and execution. BEAP-Agent fills the gap in systematic backtracking mechanisms for GUI agents, offering a systematic solution for long-horizon task exploration. We conducted a systematic evaluation on the OSWorld benchmark, where BEAP-Agent achieved an accuracy of 28.2%, validating the effectiveness of the proposed method.

Updated: 2026-01-29 07:22:50

标题: BEAP-Agent：GUI代理的可回溯执行和自适应规划

摘要: GUI代理程序旨在自动化重复任务并提高生产力。然而，现有的GUI代理程序在遵循错误的探索路径后往往难以恢复，经常导致任务失败。在这项工作中，我们将GUI任务执行建模为DFS过程，并提出了BEAP-Agent，这是一个基于DFS的框架，支持动态任务跟踪和更新的长距离、多级状态回溯。该框架由三个协作组件组成：Planner、Executor和Tracker。它们共同促进了任务探索和执行。BEAP-Agent填补了GUI代理程序系统回溯机制的空白，为长期任务探索提供了系统性解决方案。我们在OSWorld基准测试上进行了系统评估，其中BEAP-Agent实现了28.2%的准确率，验证了所提出方法的有效性。

更新时间: 2026-01-29 07:22:50

领域: cs.AI

下载: http://arxiv.org/abs/2601.21352v1

Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.

Updated: 2026-01-29 07:22:27

标题: 理论上最佳的分解式LLM服务中的注意/FFN比率

摘要: 注意力-FFN分解（AFD）是一种新兴的LLM解码架构，它将状态密集、KV缓存主导的注意力计算与无状态、计算密集的FFN计算分离，通过每步通信连接。虽然AFD使内存和计算资源能够独立扩展，但其性能对注意力/FFN分配比例非常敏感：错误的尺寸会导致步级阻塞和昂贵的设备空闲时间。我们开发了一个可行的分析框架，用于在$r$A-$1$F拓扑中调整AFD束的大小，其中关键困难在于注意力侧工作是非稳态的-令牌上下文增长并且请求不断以随机长度补充-而FFN工作在给定的聚合批次下是稳定的。使用概率工作负载模型，我们推导出了最大化系统中每个实例的平均吞吐量的最佳A/F比率的闭合形式规则。一个经过跟踪校准的AFD模拟器验证了这一理论：在各种工作负载中，理论上的最佳A/F比率与模拟最佳值相匹配，误差在10%以内，并且始终减少空闲时间。

更新时间: 2026-01-29 07:22:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21351v1

L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \& Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.

Updated: 2026-01-29 07:18:33

标题: L2R：低秩和利普希茨控制的混合专家路由

摘要: 混合专家（MoE）模型通过有条件地激活小部分专家来扩展神经网络，其中路由器在确定专家专业化和整体模型性能方面发挥核心作用。然而，许多现代MoE系统仍然在原始高维表示空间中采用线性路由器，其中表示不匹配、角度集中和对比敏感的评分可以共同削弱路由可区分性和稳定的专家专业化。在这项工作中，我们提出了低秩和Lipschitz受控路由（L2R），这是一个统一的路由框架，重新塑造了路由空间和评分几何。L2R在共享的低秩潜在路由空间中执行专家分配，并引入饱和内积评分（SIPS）来明确控制路由函数的Lipschitz行为，产生更平滑和更稳定的路由几何。此外，L2R还结合了一个参数高效的多锚路由机制，以增强专家的表达能力。在大规模语言MoE模型和ImageNet上的视觉MoE设置上进行的大量实验表明，L2R始终改善了路由稳定性、专家专业化和整体模型性能。

更新时间: 2026-01-29 07:18:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21349v1

Memorization Control in Diffusion Models from Denoising-centric Perspective

Controlling memorization in diffusion models is critical for applications that require generated data to closely match the training distribution. Existing approaches mainly focus on data centric or model centric modifications, treating the diffusion model as an isolated predictor. In this paper, we study memorization in diffusion models from a denoising centric perspective. We show that uniform timestep sampling leads to unequal learning contributions across denoising steps due to differences in signal to noise ratio, which biases training toward memorization. To address this, we propose a timestep sampling strategy that explicitly controls where learning occurs along the denoising trajectory. By adjusting the width of the confidence interval, our method provides direct control over the memorization generalization trade off. Experiments on image and 1D signal generation tasks demonstrate that shifting learning emphasis toward later denoising steps consistently reduces memorization and improves distributional alignment with training data, validating the generality and effectiveness of our approach.

Updated: 2026-01-29 07:16:54

标题: 从去噪中心的角度看扩散模型中的记忆控制

摘要: 在扩散模型中控制记忆对于需要生成数据与训练分布密切匹配的应用至关重要。现有方法主要集中在以数据为中心或以模型为中心的修改上，将扩散模型视为一个孤立的预测器。本文从去噪的角度研究了扩散模型中的记忆问题。我们发现均匀时间步长采样导致由于信噪比的差异而在去噪步骤中产生不均匀的学习贡献，这会使训练偏向于记忆。为了解决这个问题，我们提出了一个时间步长采样策略，明确控制学习沿着去噪轨迹发生的位置。通过调整置信区间的宽度，我们的方法提供了对记忆泛化权衡的直接控制。在图像和1D信号生成任务上的实验表明，将学习重点转向后续去噪步骤会一致地减少记忆并改善与训练数据的分布对齐，验证了我们方法的普适性和有效性。

更新时间: 2026-01-29 07:16:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21348v1

Dynamic Framework for Collaborative Learning: Leveraging Advanced LLM with Adaptive Feedback Mechanisms

This paper presents a framework for integrating LLM into collaborative learning platforms to enhance student engagement, critical thinking, and inclusivity. The framework employs advanced LLMs as dynamic moderators to facilitate real-time discussions and adapt to learners' evolving needs, ensuring diverse and inclusive educational experiences. Key innovations include robust feedback mechanisms that refine AI moderation, promote reflective learning, and balance participation among users. The system's modular architecture featuring ReactJS for the frontend, Flask for backend operations, and efficient question retrieval supports personalized and engaging interactions through dynamic adjustments to prompts and discussion flows. Testing demonstrates that the framework significantly improves student collaboration, fosters deeper comprehension, and scales effectively across various subjects and user groups. By addressing limitations in static moderation and personalization in existing systems, this work establishes a strong foundation for next-generation AI-driven educational tools, advancing equitable and impactful learning outcomes.

Updated: 2026-01-29 07:14:43

标题: 动态合作学习框架：利用高级LLM和自适应反馈机制

摘要: 本文提出了一个框架，用于将LLM集成到协作学习平台中，以增强学生的参与度、批判性思维和包容性。该框架利用先进的LLM作为动态调解员，促进实时讨论并适应学习者不断发展的需求，确保多样化和包容性的教育体验。关键创新包括强大的反馈机制，用于完善AI调解、促进反思学习，并在用户之间平衡参与。系统的模块化架构采用ReactJS用于前端、Flask用于后端操作，并通过高效的问题检索支持通过动态调整提示和讨论流程实现个性化和引人入胜的互动。测试表明，该框架显著改善了学生的协作，促进了更深入的理解，并在各种学科和用户群体之间有效扩展。通过解决现有系统中静态调解和个性化的局限性，这项工作为下一代基于AI的教育工具奠定了坚实基础，推进了公平和有影响力的学习成果。

更新时间: 2026-01-29 07:14:43

领域: cs.AI,cs.HC,cs.SE

下载: http://arxiv.org/abs/2601.21344v1

Actionable Interpretability Must Be Defined in Terms of Symmetries

This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of *symmetries* that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.

Updated: 2026-01-29 07:12:49

标题: 行动性可解释性必须以对称性为基础来定义。

摘要: 这篇论文认为，人工智能（AI）中的可解释性研究基本上是不恰当的，因为现有的可解释性定义未能描述如何对可解释性进行形式化测试或设计。我们认为，可解释性的可操作性定义必须以*对称性*的形式制定，这些对称性可以指导模型设计并导致可测试的条件。在概率视图下，我们假设四个对称性（推理等变性、信息不变性、概念闭包不变性和结构不变性）足以（i）将可解释性模型形式化为概率模型的一个子类，（ii）产生可解释推理的统一表述（例如，对齐、干预和反事实推理）作为贝叶斯反演的一种形式，（iii）提供一个正式框架来验证符合安全标准和法规的合规性。

更新时间: 2026-01-29 07:12:49

领域: cs.AI,cs.LG,cs.NE

下载: http://arxiv.org/abs/2601.12913v3

Self-Improving Pretraining: using post-trained models to pretrain better models

Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.

Updated: 2026-01-29 07:09:30

标题: 自我改进的预训练：使用后训练模型来预训练更好的模型

摘要: 确保大型语言模型在生成过程中的安全性、真实性和整体质量是一个关键挑战，尤其是这些模型越来越多地应用于现实世界中。目前解决这些问题的主要方法涉及收集昂贵、精心策划的数据集，并应用多个阶段的微调和对齐。然而，即使是这种复杂的流程也不能保证在预训练期间学到的模式的纠正。因此，在预训练期间解决这些问题至关重要，因为它塑造了模型的核心行为，防止不安全或虚构的输出深度嵌入其中。为了解决这个问题，我们引入了一种新的预训练方法，该方法通过流式文档和使用强化学习（RL）来改进每一步生成的下一个K个标记。一个经过强化训练的强大模型评估候选生成 -- 包括模型展开、原始后缀和重写后缀 -- 的质量、安全性和真实性。在训练初期，该过程依赖于原始和重写后缀；随着模型的改进，RL奖励高质量的展开。这种方法从根本上建立了更高质量、更安全和更真实的模型。在实验中，我们的方法在真实性和安全性方面相对于标准预训练分别提供了36.2%和18.5%的改进，并在整体生成质量方面提供了高达86.3%的胜率改进。

更新时间: 2026-01-29 07:09:30

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21343v1

Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

Updated: 2026-01-29 07:07:23

标题: Ostrakon-VL：面向食品服务和零售店领域专家的MLLM

摘要: 多模态大型语言模型（MLLMs）最近在通用感知和推理方面取得了实质性进展。然而，它们在餐饮服务和零售店（FSRS）场景中的部署遇到了两个主要障碍：（i）来自异构采集设备的现实世界FSRS数据非常嘈杂，缺乏可审计的闭环数据整理，这妨碍了高质量、可控和可复制的训练语料库的构建；（ii）现有的评估协议没有提供统一、细粒度和标准化的基准，涵盖单图像、多图像和视频输入，使客观评估模型的鲁棒性具有挑战性。为了解决这些挑战，我们首先开发了Ostrakon-VL，这是一个基于Qwen3-VL-8B的面向FSRS的MLLM。其次，我们介绍了ShopBench，这是FSRS的第一个公共基准。第三，我们提出了QUAD（质量感知无偏自动数据整理），一个多阶段多模态指导数据整理管道。利用多阶段训练策略，Ostrakon-VL在ShopBench上取得了平均得分60.1，确立了在参数规模和多样化架构相当的开源MLLM中的最新技术水平。值得注意的是，它比规模更大的Qwen3-VL-235B-A22B（59.4）高出+0.7，并超过相同规模的Qwen3-VL-8B（55.3）高出+4.8，表明参数效率显著提高。这些结果表明Ostrakon-VL提供了更加鲁棒和可靠的FSRS中心感知和决策能力。为了促进可复制的研究，我们将公开发布Ostrakon-VL和ShopBench基准。

更新时间: 2026-01-29 07:07:23

领域: cs.AI

下载: http://arxiv.org/abs/2601.21342v1

EHR-RAG: Bridging Long-Horizon Structured Electronic Health Records and Large Language Models via Enhanced Retrieval-Augmented Generation

Electronic Health Records (EHRs) provide rich longitudinal clinical evidence that is central to medical decision-making, motivating the use of retrieval-augmented generation (RAG) to ground large language model (LLM) predictions. However, long-horizon EHRs often exceed LLM context limits, and existing approaches commonly rely on truncation or vanilla retrieval strategies that discard clinically relevant events and temporal dependencies. To address these challenges, we propose EHR-RAG, a retrieval-augmented framework designed for accurate interpretation of long-horizon structured EHR data. EHR-RAG introduces three components tailored to longitudinal clinical prediction tasks: Event- and Time-Aware Hybrid EHR Retrieval to preserve clinical structure and temporal dynamics, Adaptive Iterative Retrieval to progressively refine queries in order to expand broad evidence coverage, and Dual-Path Evidence Retrieval and Reasoning to jointly retrieves and reasons over both factual and counterfactual evidence. Experiments across four long-horizon EHR prediction tasks show that EHR-RAG consistently outperforms the strongest LLM-based baselines, achieving an average Macro-F1 improvement of 10.76%. Overall, our work highlights the potential of retrieval-augmented LLMs to advance clinical prediction on structured EHR data in practice.

Updated: 2026-01-29 07:06:34

标题: EHR-RAG：通过增强的检索增强生成技术，桥接长期结构化电子健康记录和大型语言模型

摘要: 电子健康记录（EHRs）提供丰富的纵向临床证据，对医疗决策至关重要，这促使使用检索增强生成（RAG）来支持大型语言模型（LLM）的预测。然而，长期EHRs通常超出LLM上下文限制，现有方法通常依赖于截断或普通检索策略，这些策略会丢弃临床相关事件和时间依赖性。为了解决这些挑战，我们提出了EHR-RAG，这是一个专为准确解释长期结构化EHR数据而设计的检索增强框架。EHR-RAG引入了三个组件，针对纵向临床预测任务进行了定制：事件和时间感知混合EHR检索，以保留临床结构和时间动态，自适应迭代检索，逐渐改进查询以扩展广泛的证据覆盖范围，双路证据检索和推理，共同检索和推理事实和反事实证据。在四个长期EHR预测任务上的实验表明，EHR-RAG始终优于最强的基于LLM的基线，平均Macro-F1改进率为10.76％。总体而言，我们的工作强调了检索增强LLMs在实践中推进结构化EHR数据临床预测的潜力。

更新时间: 2026-01-29 07:06:34

领域: cs.AI

下载: http://arxiv.org/abs/2601.21340v1

Within-Model vs Between-Prompt Variability in Large Language Models for Creative Tasks

How much of LLM output variance is explained by prompts versus model choice versus stochasticity through sampling? We answer this by evaluating 12 LLMs on 10 creativity prompts with 100 samples each (N = 12,000). For output quality (originality), prompts explain 36.43% of variance, comparable to model choice (40.94%). But for output quantity (fluency), model choice (51.25%) and within-LLM variance (33.70%) dominate, with prompts explaining only 4.22%. Prompts are powerful levers for steering output quality, but given the substantial within-LLM variance (10-34%), single-sample evaluations risk conflating sampling noise with genuine prompt or model effects.

Updated: 2026-01-29 07:04:46

标题: 大型语言模型在创造性任务中的模型内部与提示间的变异性

摘要: 我们评估了12个LLM模型在10个创造力提示上的输出变化，每个提示有100个样本（N = 12,000）。对于输出质量（独创性），提示解释了36.43%的方差，与模型选择（40.94%）相当。但对于输出数量（流畅性），模型选择（51.25%）和LLM内方差（33.70%）占主导地位，提示仅解释了4.22%。提示是引导输出质量的有力杠杆，但考虑到相当大的LLM内部方差（10-34%），单样本评估可能会将抽样噪声与真正的提示或模型效应混为一谈。

更新时间: 2026-01-29 07:04:46

领域: cs.AI

下载: http://arxiv.org/abs/2601.21339v1

Pushing Toward the Simplex Vertices: A Simple Remedy for Code Collapse in Smoothed Vector Quantization

Vector quantization, which discretizes a continuous vector space into a finite set of representative vectors (a codebook), has been widely adopted in modern machine learning. Despite its effectiveness, vector quantization poses a fundamental challenge: the non-differentiable quantization step blocks gradient backpropagation. Smoothed vector quantization addresses this issue by relaxing the hard assignment of a codebook vector into a weighted combination of codebook entries, represented as the matrix product of a simplex vector and the codebook. Effective smoothing requires two properties: (1) smoothed quantizers should remain close to a onehot vector, ensuring tight approximation, and (2) all codebook entries should be utilized, preventing code collapse. Existing methods typically address these desiderata separately. By contrast, the present study introduces a simple and intuitive regularization that promotes both simultaneously by minimizing the distance between each simplex vertex and its $K$-nearest smoothed quantizers. Experiments on representative benchmarks, including discrete image autoencoding and contrastive speech representation learning, demonstrate that the proposed method achieves more reliable codebook utilization and improves performance compared to prior approaches.

Updated: 2026-01-29 07:02:11

标题: 向单纯形顶点推进：平滑向量量化中代码崩溃的简单疗法

摘要: 矢量量化将连续向量空间离散化为一组代表性向量（码书），在现代机器学习中被广泛采用。尽管其有效性，矢量量化存在一个基本挑战：不可微的量化步骤阻碍了梯度反向传播。平滑矢量量化通过将码书向量的硬分配放松为码书条目的加权组合来解决这个问题，表示为简单向量和码书的矩阵乘积。有效的平滑需要两个属性：（1）平滑的量化器应保持接近于一个onehot向量，确保紧密逼近，（2）所有码书条目应被利用，防止码书崩溃。现有方法通常分别解决这些要求。相比之下，本研究引入了一种简单直观的正则化方法，通过最小化每个简单形状顶点和其$K$个最近平滑量化器之间的距离，同时促进这两个属性。在代表性基准测试中的实验，包括离散图像自动编码和对比语音表示学习，表明所提出的方法相比先前方法实现了更可靠的码书利用，并提高了性能。

更新时间: 2026-01-29 07:02:11

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.22161v3

Low-redundancy Distillation for Continual Learning

Continual learning (CL) aims to learn new tasks without erasing previous knowledge. However, current CL methods primarily emphasize improving accuracy while often neglecting training efficiency, which consequently restricts their practical application. Drawing inspiration from the brain's contextual gating mechanism, which selectively filters neural information and continuously updates past memories, we propose Low-redundancy Distillation (LoRD), a novel CL method that enhances model performance while maintaining training efficiency. This is achieved by eliminating redundancy in three aspects of CL: student model redundancy, teacher model redundancy, and rehearsal sample redundancy. By compressing the learnable parameters of the student model and pruning the teacher model, LoRD facilitates the retention and optimization of prior knowledge, effectively decoupling task-specific knowledge without manually assigning isolated parameters for each task. Furthermore, we optimize the selection of rehearsal samples and refine rehearsal frequency to improve training efficiency. Through a meticulous design of distillation and rehearsal strategies, LoRD effectively balances training efficiency and model precision. Extensive experimentation across various benchmark datasets and environments demonstrates LoRD's superiority, achieving the highest accuracy with the lowest training FLOPs.

Updated: 2026-01-29 06:56:54

标题: 持续学习的低冗余蒸馏

摘要: 持续学习（CL）旨在学习新任务而不会抹去先前的知识。然而，当前的CL方法主要强调提高准确性，同时经常忽略训练效率，从而限制了它们的实际应用。受大脑的上下文门控机制启发，该机制选择性地过滤神经信息并持续更新过去的记忆，我们提出了低冗余蒸馏（LoRD），这是一种增强模型性能同时保持训练效率的新型CL方法。通过在三个方面消除CL中的冗余，即学生模型冗余、教师模型冗余和重演样本冗余，LoRD压缩了学生模型的可学习参数并修剪了教师模型，有助于保留和优化先前的知识，有效地解耦任务特定知识，而无需手动为每个任务单独分配参数。此外，我们优化了重演样本的选择并优化了重演频率以提高训练效率。通过精心设计蒸馏和重演策略，LoRD有效地平衡了训练效率和模型精度。在各种基准数据集和环境上进行的广泛实验表明LoRD具有优势，实现了最高的准确性和最低的训练FLOPs。

更新时间: 2026-01-29 06:56:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2309.16117v2

Modeling Endogenous Logic: Causal Neuro-Symbolic Reasoning Model for Explainable Multi-Behavior Recommendation

Existing multi-behavior recommendations tend to prioritize performance at the expense of explainability, while current explainable methods suffer from limited generalizability due to their reliance on external information. Neuro-Symbolic integration offers a promising avenue for explainability by combining neural networks with symbolic logic rule reasoning. Concurrently, we posit that user behavior chains inherently embody an endogenous logic suitable for explicit reasoning. However, these observational multiple behaviors are plagued by confounders, causing models to learn spurious correlations. By incorporating causal inference into this Neuro-Symbolic framework, we propose a novel Causal Neuro-Symbolic Reasoning model for Explainable Multi-Behavior Recommendation (CNRE). CNRE operationalizes the endogenous logic by simulating a human-like decision-making process. Specifically, CNRE first employs hierarchical preference propagation to capture heterogeneous cross-behavior dependencies. Subsequently, it models the endogenous logic rule implicit in the user's behavior chain based on preference strength, and adaptively dispatches to the corresponding neural-logic reasoning path (e.g., conjunction, disjunction). This process generates an explainable causal mediator that approximates an ideal state isolated from confounding effects. Extensive experiments on three large-scale datasets demonstrate CNRE's significant superiority over state-of-the-art baselines, offering multi-level explainability from model design and decision process to recommendation results.

Updated: 2026-01-29 06:51:54

标题: 建模内生逻辑：可解释的多行为推荐的因果神经符号推理模型

摘要: 现有的多行为推荐往往以性能为重点，而牺牲解释性，而当前的可解释方法由于依赖外部信息而受到有限的泛化能力的影响。神经符号集成通过将神经网络与符号逻辑规则推理相结合，为解释性提供了一个有前途的途径。同时，我们认为用户行为链本质上包含适用于显式推理的内生逻辑。然而，这些观察到的多种行为受到混淆者的困扰，导致模型学习到虚假相关性。通过将因果推断纳入这个神经符号框架，我们提出了一种新颖的因果神经符号推理模型，用于可解释的多行为推荐（CNRE）。CNRE通过模拟类似于人类决策过程的内生逻辑来实现。具体而言，CNRE首先采用分层偏好传播来捕捉异质的跨行为依赖关系。随后，它基于偏好强度对用户行为链中隐含的内生逻辑规则进行建模，并自适应地分派到相应的神经逻辑推理路径（例如，合取，析取）。这个过程生成一个可解释的因果中介者，近似于一个与混淆效应隔离的理想状态。对三个大规模数据集的广泛实验证明了CNRE相对于最先进基线的显著优势，从模型设计和决策过程到推荐结果提供多级解释能力。

更新时间: 2026-01-29 06:51:54

领域: cs.AI

下载: http://arxiv.org/abs/2601.21335v1

SDFLoRA: Selective Decoupled Federated LoRA for Privacy-preserving Fine-tuning with Heterogeneous Clients

Federated learning (FL) for large language models (LLMs) has attracted increasing attention as a privacy-preserving approach for adapting models over distributed data, where parameter-efficient methods such as Low-Rank Adaptation (LoRA) are widely adopted to reduce communication and memory costs. However, practical deployments often exhibit rank and data heterogeneity: clients operate under different low-rank budgets and data distributions, making direct aggregation of LoRA updates biased and unstable. Existing approaches either enforce a unified rank or align heterogeneous updates into a single shared subspace, which tends to mix transferable and client-specific directions and consequently undermines personalization. Moreover, under differential privacy (DP), perturbing such structurally mixed updates injects noise into directions that should remain purely local, leading to unnecessary utility degradation. To address these issues, we propose Selective Decoupled Federated LoRA (SDFLoRA), a structure-aware LoRA framework that decouples each client update into a shared component for aggregation and a private component that preserves client-specific semantics. Only the shared component participates in subspace alignment, while the private component remains local and uncommunicated, making the training DP-compatible and stabilizing aggregation under rank heterogeneity. By injecting noise only into the aggregated shareable update, this approach avoids perturbations to local directions and improves the utility-privacy trade-off. Experiments on multiple benchmarks demonstrate that SDFLoRA outperforms federated LoRA baselines and achieves a strong utility-privacy trade-off.

Updated: 2026-01-29 06:50:08

标题: SDFLoRA：用于隐私保护细化调整的选择性解耦联合LoRA（低功耗广域网）与异构客户端

摘要: 联邦学习（FL）用于大型语言模型（LLMs）已经引起了越来越多的关注，作为一种在分布式数据上调整模型的隐私保护方法，在这种情况下，广泛采用参数高效的方法，如低秩调整（LoRA）来减少通信和内存成本。然而，实际部署通常表现出秩和数据的异质性：客户端在不同的低秩预算和数据分布下运作，直接聚合LoRA更新会产生偏差和不稳定性。现有的方法要么强制统一秩，要么将异构更新对齐到单个共享子空间，这往往会混合可转移的和客户端特定的方向，从而破坏个性化。此外，在差分隐私（DP）下，扰乱这种结构混合的更新会向应保持纯粹本地的方向注入噪声，导致不必要的效用下降。为了解决这些问题，我们提出了选择性解耦的联邦LoRA（SDFLoRA），这是一个结构感知的LoRA框架，将每个客户端更新分解为用于聚合的共享组件和保留客户端特定语义的私有组件。只有共享组件参与子空间对齐，而私有组件保持本地和未通信，使训练达到DP兼容并在秩异质性下稳定聚合。通过仅向聚合的可共享更新注入噪声，这种方法避免了对本地方向的扰动，并改善了效用-隐私权衡。对多个基准测试的实验证明，SDFLoRA优于联邦LoRA基线，并实现了强大的效用-隐私权衡。

更新时间: 2026-01-29 06:50:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.11219v2

Adversarial Vulnerability Transcends Computational Paradigms: Feature Engineering Provides No Defense Against Neural Adversarial Transfer

Deep neural networks are vulnerable to adversarial examples--inputs with imperceptible perturbations causing misclassification. While adversarial transfer within neural networks is well-documented, whether classical ML pipelines using handcrafted features inherit this vulnerability when attacked via neural surrogates remains unexplored. Feature engineering creates information bottlenecks through gradient quantization and spatial binning, potentially filtering high-frequency adversarial signals. We evaluate this hypothesis through the first comprehensive study of adversarial transfer from DNNs to HOG-based classifiers. Using VGG16 as a surrogate, we generate FGSM and PGD adversarial examples and test transfer to four classical classifiers (KNN, Decision Tree, Linear SVM, Kernel SVM) and a shallow neural network across eight HOG configurations on CIFAR-10. Our results strongly refute the protective hypothesis: all classifiers suffer 16.6%-59.1% relative accuracy drops, comparable to neural-to-neural transfer. More surprisingly, we discover attack hierarchy reversal--contrary to patterns where iterative PGD dominates FGSM within neural networks, FGSM causes greater degradation than PGD in 100% of classical ML cases, suggesting iterative attacks overfit to surrogate-specific features that don't survive feature extraction. Block normalization provides partial but insufficient mitigation. These findings demonstrate that adversarial vulnerability is not an artifact of end-to-end differentiability but a fundamental property of image classification systems, with implications for security-critical deployments across computational paradigms.

Updated: 2026-01-29 06:35:46

标题: 对抗性漏洞超越了计算范式：特征工程对神经对抗性传递无防御效果

摘要: 深度神经网络容易受到对抗性样本的影响，即输入中具有微不可见的扰动导致错误分类。虽然神经网络内部的对抗性传递已有充分的文献支持，但尚未探讨使用手工特征的经典机器学习流程在通过神经网络代理进行攻击时是否也具有这种脆弱性。特征工程通过梯度量化和空间分箱创建信息瓶颈，可能会过滤高频对抗信号。我们通过首次对基于HOG的分类器从DNN到对抗性传递的全面研究来评估这一假设。使用VGG16作为代理，我们生成FGSM和PGD对抗性样本，并将其转移到四种经典分类器（KNN、决策树、线性SVM、核SVM）和一个浅层神经网络，在CIFAR-10数据集上测试八种HOG配置。我们的结果强烈否定了保护性假设：所有分类器的相对准确率下降了16.6%-59.1%，与神经网络之间的转移相当。更令人惊讶的是，我们发现攻击层次的颠倒——与神经网络内部PGD主导FGSM的模式相反，在100%的经典机器学习情况下，FGSM导致的降级大于PGD，表明迭代攻击过度拟合于特定代理的特征，这些特征在特征提取中无法保留。块归一化提供了部分但不足够的缓解。这些发现表明对抗性脆弱性不是端到端可区分性的产物，而是图像分类系统的基本属性，对跨计算范式的安全关键部署具有重要意义。

更新时间: 2026-01-29 06:35:46

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.21323v1

SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning

We propose the Signal Dice Similarity Coefficient (SDSC), a structure-aware metric function for time series self-supervised representation learning. Most Self-Supervised Learning (SSL) methods for signals commonly adopt distance-based objectives such as mean squared error (MSE), which are sensitive to amplitude, invariant to waveform polarity, and unbounded in scale. These properties hinder semantic alignment and reduce interpretability. SDSC addresses this by quantifying structural agreement between temporal signals based on the intersection of signed amplitudes, derived from the Dice Similarity Coefficient (DSC).Although SDSC is defined as a structure-aware metric, it can be used as a loss by subtracting from 1 and applying a differentiable approximation of the Heaviside function for gradient-based optimization. A hybrid loss formulation is also proposed to combine SDSC with MSE, improving stability and preserving amplitude where necessary. Experiments on forecasting and classification benchmarks demonstrate that SDSC-based pre-training achieves comparable or improved performance over MSE, particularly in in-domain and low-resource scenarios. The results suggest that structural fidelity in signal representations enhances the semantic representation quality, supporting the consideration of structure-aware metrics as viable alternatives to conventional distance-based methods.

Updated: 2026-01-29 06:31:50

标题: SDSC: 用于语义信号表示学习的结构感知度量

摘要: 我们提出了信号骰子相似性系数（SDSC），这是一种针对时间序列自监督表示学习的结构感知度量函数。大多数信号的自监督学习（SSL）方法通常采用基于距离的目标，如均方误差（MSE），这些目标对振幅敏感，对波形极性不变，并且在尺度上没有限制。这些特性阻碍了语义对齐并降低了可解释性。SDSC通过量化基于带符号振幅交集的时间信号之间的结构一致性来解决这个问题，这些振幅是从骰子相似性系数（DSC）导出的。尽管SDSC被定义为一种结构感知度量，但它可以作为损失函数使用，通过从1中减去并应用可微的海维塞德函数的近似来进行基于梯度的优化。还提出了一种混合损失公式，将SDSC与MSE结合起来，提高稳定性并在必要时保留振幅。对预测和分类基准测试的实验表明，基于SDSC的预训练在MSE方面实现了相当或更好的性能，特别是在领域内和资源匮乏的情况下。结果表明，信号表示中的结构保真度提高了语义表示质量，支持将结构感知度量视为传统基于距离的方法的可行替代方案。

更新时间: 2026-01-29 06:31:50

领域: cs.LG,cs.AI,cs.LO

下载: http://arxiv.org/abs/2507.14516v3

White-Box Op-Amp Design via Human-Mimicking Reasoning

This brief proposes \emph{White-Op}, an interpretable operational amplifier (op-amp) parameter design framework based on the human-mimicking reasoning of large-language-model agents. We formalize the implicit human reasoning mechanism into explicit steps of \emph{\textbf{introducing hypothetical constraints}}, and develop an iterative, human-like \emph{\textbf{hypothesis-verification-decision}} workflow. Specifically, the agent is guided to introduce hypothetical constraints to derive and properly regulate positions of symbolically tractable poles and zeros, thus formulating a closed-form mathematical optimization problem, which is then solved programmatically and verified via simulation. Theory-simulation result analysis guides the decision-making for refinement. Experiments on 9 op-amp topologies show that, unlike the uninterpretable black-box baseline which finally fails in 5 topologies, White-Op achieves reliable, interpretable behavioral-level designs with only 8.52\% theoretical prediction error and the design functionality retains after transistor-level mapping for all topologies. White-Op is open-sourced at \textcolor{blue}{https://github.com/zhchenfdu/whiteop}.

Updated: 2026-01-29 06:30:37

标题: 白盒运算放大器设计通过模仿人类推理

摘要: 这篇简要提出了White-Op，一个基于大型语言模型代理人类似推理的可解释运算放大器（运放）参数设计框架。我们将隐含的人类推理机制形式化为\textbf{引入假设约束}的明确步骤，并开发了一个迭代的、类似人类的\textbf{假设-验证-决策}工作流程。具体来说，代理被引导引入假设约束来推导和适当调节符号可处理的极点和零点的位置，从而制定一个封闭形式的数学优化问题，然后通过编程求解并通过模拟验证。理论-模拟结果分析指导了精化的决策。对9种运放拓扑的实验表明，与最终在5种拓扑中失败的不可解释的黑盒基线不同，White-Op实现了可靠的、可解释的行为级设计，理论预测误差仅为8.52％，并且在所有拓扑的晶体管级映射后设计功能保留。White-Op在\textcolor{blue}{https://github.com/zhchenfdu/whiteop}上开源。

更新时间: 2026-01-29 06:30:37

领域: cs.AI

下载: http://arxiv.org/abs/2601.21321v1

QCL-IDS: Quantum Continual Learning for Intrusion Detection with Fidelity-Anchored Stability and Generative Replay

Continual intrusion detection must absorb newly emerging attack stages while retaining legacy detection capability under strict operational constraints, including bounded compute and qubit budgets and privacy rules that preclude long-term storage of raw telemetry. We propose QCL-IDS, a quantum-centric continual-learning framework that co-designs stability and privacy-governed rehearsal for NISQ-era pipelines. Its core component, Q-FISH (Quantum Fisher Anchors), enforces retention using a compact anchor coreset through (i) sensitivity-weighted parameter constraints and (ii) a fidelity-based functional anchoring term that directly limits decision drift on representative historical traffic. To regain plasticity without retaining sensitive flows, QCL-IDS further introduces privacy-preserved quantum generative replay (QGR) via frozen, task-conditioned generator snapshots that synthesize bounded rehearsal samples. Across a three-stage attack stream on UNSW-NB15 and CICIDS2017, QCL-IDS consistently attains the best retention-adaptation trade-off: the gradient-anchor configuration achieves mean Attack-F1 = 0.941 with forgetting = 0.005 on UNSW-NB15 and mean Attack-F1 = 0.944 with forgetting = 0.004 on CICIDS2017, versus 0.800/0.138 and 0.803/0.128 for sequential fine-tuning, respectively.

Updated: 2026-01-29 06:27:05

标题: QCL-IDS：基于忠实锚定稳定性和生成重放的入侵检测的量子持续学习

摘要: 持续的入侵检测必须吸收新出现的攻击阶段，同时在严格的运营约束条件下保留传统的检测能力，包括有限的计算和量子比特预算，以及禁止长期存储原始遥测数据的隐私规则。我们提出了QCL-IDS，这是一个以量子为中心的持续学习框架，为NISQ时代的管道共同设计了稳定性和隐私受控的排练。其核心组件Q-FISH（量子费舍尔锚点）通过（i）敏感性加权参数约束和（ii）基于保真度的功能锚定术语来强制保留，直接限制了代表性历史流量上的决策漂移。为了恢复可塑性而不保留敏感流量，QCL-IDS进一步通过冻结、任务条件的生成器快照引入了隐私保护的量子生成回放（QGR），用于合成有限的排练样本。在UNSW-NB15和CICIDS2017上的三阶段攻击流中，QCL-IDS始终实现了最佳的保留-适应性权衡：梯度-锚点配置在UNSW-NB15上实现了平均攻击F1 = 0.941，遗忘率为0.005，在CICIDS2017上实现了平均攻击F1 = 0.944，遗忘率为0.004，而顺序微调分别为0.800/0.138和0.803/0.128。

更新时间: 2026-01-29 06:27:05

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2601.21318v1

Heterogeneous Vertiport Selection Optimization for On-Demand Air Taxi Services: A Deep Reinforcement Learning Approach

Urban Air Mobility (UAM) has emerged as a transformative solution to alleviate urban congestion by utilizing low-altitude airspace, thereby reducing pressure on ground transportation networks. To enable truly efficient and seamless door-to-door travel experiences, UAM requires close integration with existing ground transportation infrastructure. However, current research on optimal integrated routing strategies for passengers in air-ground mobility systems remains limited, with a lack of systematic exploration.To address this gap, we first propose a unified optimization model that integrates strategy selection for both air and ground transportation. This model captures the dynamic characteristics of multimodal transport networks and incorporates real-time traffic conditions alongside passenger decision-making behavior. Building on this model, we propose a Unified Air-Ground Mobility Coordination (UAGMC) framework, which leverages deep reinforcement learning (RL) and Vehicle-to-Everything (V2X) communication to optimize vertiport selection and dynamically plan air taxi routes. Experimental results demonstrate that UAGMC achieves a 34\% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency and providing novel insights into the integration and optimization of multimodal transportation systems. This work lays a solid foundation for advancing intelligent urban mobility solutions through the coordination of air and ground transportation modes. The related code can be found at https://github.com/Traffic-Alpha/UAGMC.

Updated: 2026-01-29 06:26:16

标题: 异构垂直起降场选择优化对按需空中出租车服务的影响：一种深度强化学习方法

摘要: 城市空中移动（UAM）已经成为一种变革性解决方案，通过利用低空域空间来减轻城市拥堵，从而减轻对地面交通网络的压力。为了实现真正高效和无缝的门到门出行体验，UAM需要与现有地面交通基础设施密切整合。然而，目前关于空地一体化交通系统乘客最佳路由策略的研究仍然有限，缺乏系统性的探索。为了填补这一空白，我们首先提出了一个统一的优化模型，该模型整合了空中和地面交通的策略选择。该模型捕捉了多模式运输网络的动态特性，并结合了实时交通状况和乘客决策行为。基于这一模型，我们提出了一个统一的空地移动协调（UAGMC）框架，利用深度强化学习（RL）和车辆对一切（V2X）通信来优化垂直起降场选择和动态规划空中出租车路线。实验结果表明，与传统的比例分配方法相比，UAGMC实现了平均旅行时间减少34％，提高了整体旅行效率，并为多模式交通系统的整合和优化提供了新颖见解。这项工作为通过协调空中和地面交通模式推进智能城市移动解决方案奠定了坚实基础。相关代码可在https://github.com/Traffic-Alpha/UAGMC找到。

更新时间: 2026-01-29 06:26:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21316v1

Distributionally Robust Classification for Multi-source Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) is a statistical learning problem when the distribution of training (source) data is different from that of test (target) data. In this setting, one has access to labeled data only from the source domain and unlabeled data from the target domain. The central objective is to leverage the source data and the unlabeled target data to build models that generalize to the target domain. Despite its potential, existing UDA approaches often struggle in practice, particularly in scenarios where the target domain offers only limited unlabeled data or spurious correlations dominate the source domain. To address these challenges, we propose a novel distributionally robust learning framework that models uncertainty in both the covariate distribution and the conditional label distribution. Our approach is motivated by the multi-source domain adaptation setting but is also directly applicable to the single-source scenario, making it versatile in practice. We develop an efficient learning algorithm that can be seamlessly integrated with existing UDA methods. Extensive experiments under various distribution shift scenarios show that our method consistently outperforms strong baselines, especially when target data are extremely scarce.

Updated: 2026-01-29 06:23:14

标题: 多源无监督域适应的分布鲁棒分类

摘要: 非监督领域自适应（UDA）是一个统计学习问题，当训练（源）数据的分布与测试（目标）数据不同时。在这种情况下，只能访问来自源域的标记数据和来自目标域的未标记数据。中心目标是利用源数据和未标记目标数据构建泛化到目标域的模型。尽管具有潜力，但现有的UDA方法在实践中经常遇到困难，特别是在目标域提供的未标记数据有限或者源域受到虚假相关性主导的情况下。为了应对这些挑战，我们提出了一个新颖的分布鲁棒学习框架，该框架模拟了协变量分布和条件标签分布的不确定性。我们的方法受多源域适应设置的启发，但也直接适用于单源情景，使其在实践中具有多功能性。我们开发了一种高效的学习算法，可以无缝集成到现有的UDA方法中。在各种分布转移场景下进行的大量实验表明，我们的方法始终优于强基线方法，特别是当目标数据极度稀少时。

更新时间: 2026-01-29 06:23:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21315v1

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.

Updated: 2026-01-29 06:12:06

标题: 即时调整量化：适应配置的 LoRA 用于高效微调量化 LLMs

摘要: 随着越来越大的预训练模型的发布，将它们部署在边缘设备上用于隐私保护应用需要有效的压缩。最近的研究将量化与高精度LoRA适配器的微调结合起来，可以大幅减小模型大小，同时减轻量化带来的准确性损失。然而，边缘设备具有固有的异构能力，为每个量化设置进行配置式微调是计算上不可行的。在本文中，我们提出了CoA-LoRA方法，该方法可以动态调整LoRA适配器以适应任意的量化配置（即预训练模型的每个层的位宽选择），而无需重复微调。这是通过一种配置感知模型实现的，该模型将每个配置映射到其低秩调整。这个模型的有效性关键取决于训练配置集，即一个选择不同总位宽预算的配置集。然而，构建一个高质量的配置集并不是简单的。因此，我们设计了一个基于帕累托的配置搜索，通过迭代优化训练配置集，得到更精确的低秩调整。我们的实验表明，与需要为每个配置微调单独的LoRA适配器的最先进方法不同，CoA-LoRA没有额外的时间成本，同时实现了与这些方法相当甚至更好的性能。

更新时间: 2026-01-29 06:12:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.25214v2

Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control

Large language model (LLM)-powered agents can translate high-level user intents into plans and actions in an environment. Yet after observing an outcome, users may wonder: What if I had phrased my intent differently? We introduce a framework that enables such counterfactual reasoning in agentic LLM-driven control scenarios, while providing formal reliability guarantees. Our approach models the closed-loop interaction between a user, an LLM-based agent, and an environment as a structural causal model (SCM), and leverages test-time scaling to generate multiple candidate counterfactual outcomes via probabilistic abduction. Through an offline calibration phase, the proposed conformal counterfactual generation (CCG) yields sets of counterfactual outcomes that are guaranteed to contain the true counterfactual outcome with high probability. We showcase the performance of CCG on a wireless network control use case, demonstrating significant advantages compared to naive re-execution baselines.

Updated: 2026-01-29 06:09:51

标题: 我应该表达一个不同的意图吗？基于LLM的自主控制的反事实生成

摘要: 大型语言模型（LLM）驱动的代理可以将高级用户意图转化为计划和行动在一个环境中。然而，在观察结果后，用户可能会想知道：如果我表达我的意图不同会怎么样？我们引入了一个框架，可以在代理LLM驱动的控制场景中实现这种反事实推理，同时提供形式上的可靠性保证。我们的方法将用户、基于LLM的代理和环境之间的闭环交互建模为一个结构因果模型（SCM），并利用测试时缩放通过概率推断生成多个候选的反事实结果。通过离线校准阶段，提出的符合性反事实生成（CCG）产生一组带有高概率真实反事实结果的反事实结果。我们在无线网络控制用例上展示了CCG的性能，与朴素重新执行基线相比具有显著优势。

更新时间: 2026-01-29 06:09:51

领域: cs.AI

下载: http://arxiv.org/abs/2601.20090v2

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

The open-weight language model ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single breaker token that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and evades outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

Updated: 2026-01-29 06:04:53

标题: 词汇中的特洛伊木马：对LLM作文的潜在破坏

摘要: 开放权重语言模型生态系统越来越多地被模型组合技术所定义（如权重合并、猜测解码和词汇扩展），这些技术重新组合了来自不同来源的能力。将这些方法应用于不同模型系列的一个关键前提是标记器移植，它将不兼容的词汇对齐到一个共享的嵌入空间。我们展示了这一关键的互操作性步骤引入了供应链漏洞：我们设计了一个单一的破坏者令牌，在捐赠模型中功能惰性，但在移植到基础模型后可可靠地重构为一个高显著的恶意特征。通过利用系数重用的几何特性，我们的攻击破坏了基础模型的生成，同时使捐赠模型的效用在统计上无法与正常行为区分开。我们将这个问题形式化为一个双目标优化问题，并使用稀疏求解器实例化攻击。从经验上讲，这种攻击无需训练，并且可以逃避异常值检测，同时还展示了对微调和权重合并的结构持久性，突显了模块化AI组合管道中的潜在风险。代码可在https://github.com/xz-liu/tokenforge上找到。

更新时间: 2026-01-29 06:04:53

领域: cs.LG,cs.CL,cs.CR

下载: http://arxiv.org/abs/2601.00065v2

The Surprising Difficulty of Search in Model-Based Reinforcement Learning

This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.

Updated: 2026-01-29 05:58:24

标题: 基于模型的强化学习中搜索的令人惊讶的困难性

摘要: 这篇论文研究了基于模型的强化学习中的搜索问题。传统观点认为，长期预测和复合误差是模型驱动强化学习的主要障碍。我们挑战了这一观点，表明搜索并非是学习策略的即插即用替代品。令人惊讶的是，我们发现即使模型非常准确，搜索也可能损害性能。相反，我们表明，减轻分布偏移比提高模型或价值函数准确性更为重要。基于这一认识，我们确定了实现有效搜索的关键技术，实现了多个热门基准领域的最先进性能。

更新时间: 2026-01-29 05:58:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21306v1

Large Vision Models Can Solve Mental Rotation Problems

Mental rotation is a key test of spatial reasoning in humans and has been central to understanding how perception supports cognition. Despite the success of modern vision transformers, it is still unclear how well these models develop similar abilities. In this work, we present a systematic evaluation of ViT, CLIP, DINOv2, and DINOv3 across a range of mental-rotation tasks, from simple block structures similar to those used by Shepard and Metzler to study human cognition, to more complex block figures, three types of text, and photo-realistic objects. By probing model representations layer by layer, we examine where and how these networks succeed. We find that i) self-supervised ViTs capture geometric structure better than supervised ViTs; ii) intermediate layers perform better than final layers; iii) task difficulty increases with rotation complexity and occlusion, mirroring human reaction times and suggesting similar constraints in embedding space representations.

Updated: 2026-01-29 05:53:48

标题: 大视觉模型可以解决心理旋转问题。

摘要: 思维旋转是人类空间推理的关键测试，并且一直是理解知觉如何支持认知的中心。尽管现代视觉转换器取得了成功，但这些模型如何发展类似的能力仍不清楚。在这项工作中，我们对ViT、CLIP、DINOv2和DINOv3在一系列思维旋转任务中进行了系统评估，从Shepard和Metzler用于研究人类认知的简单块结构，到更复杂的块图形、三种文本和逼真的物体。通过逐层探测模型表示，我们研究这些网络在哪里以及如何成功。我们发现：i）自监督ViTs比监督ViTs更好地捕捉几何结构；ii）中间层比最终层表现更好；iii）任务难度随着旋转复杂性和遮挡而增加，反映了人类反应时间，并暗示嵌入空间表示中存在类似的约束。

更新时间: 2026-01-29 05:53:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.15271v2

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.

Updated: 2026-01-29 05:49:17

标题: 数据集提炼中的信息性和实用性的基础和增强

摘要: 数据集精炼（DD）旨在从大型现实世界数据集中创建一个紧凑的数据集。尽管近期的方法通常依赖于启发式方法来平衡效率和质量，但原始数据和合成数据之间的基本关系仍未得到充分探讨。本文在坚实的理论框架内重新审视基于知识蒸馏的数据集精炼。我们引入了信息性和效用的概念，分别捕捉样本中的关键信息和训练集中的重要样本。基于这些原则，我们在数学上定义了最佳数据集精炼。然后，我们提出了InfoUtil，一个在合成精炼数据集中平衡信息性和效用的框架。InfoUtil包含两个关键组件：（1）使用Shapley值归因进行博弈论信息性最大化，从样本中提取关键信息；（2）通过基于梯度范数选择全局影响力样本，实现原则性效用最大化。这些组件确保精炼数据集既具有信息性又经过效用优化。实验证明，我们的方法在使用ResNet-18处理ImageNet-1K数据集时，性能比先前最先进的方法提高了6.1％。

更新时间: 2026-01-29 05:49:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21296v1

Deep GraphRAG: A Balanced Approach to Hierarchical Retrieval and Adaptive Integration

Graph-based Retrieval-Augmented Generation (GraphRAG) frameworks face a trade-off between the comprehensiveness of global search and the efficiency of local search. Existing methods are often challenged by navigating large-scale hierarchical graphs, optimizing retrieval paths, and balancing exploration-exploitation dynamics, frequently lacking robust multi-stage re-ranking. To overcome these deficits, we propose Deep GraphRAG, a framework designed for a balanced approach to hierarchical retrieval and adaptive integration. It introduces a hierarchical global-to-local retrieval strategy that integrates macroscopic inter-community and microscopic intra-community contextual relations. This strategy employs a three-stage process: (1) inter-community filtering, which prunes the search space using local context; (2) community-level refinement, which prioritizes relevant subgraphs via entity-interaction analysis; and (3) entity-level fine-grained search within target communities. A beam search-optimized dynamic re-ranking module guides this process, continuously filtering candidates to balance efficiency and global comprehensiveness. Deep GraphRAG also features a Knowledge Integration Module leveraging a compact LLM, trained with Dynamic Weighting Reward GRPO (DW-GRPO). This novel reinforcement learning approach dynamically adjusts reward weights to balance three key objectives: relevance, faithfulness, and conciseness. This training enables compact models (1.5B) to approach the performance of large models (70B) in the integration task. Evaluations on Natural Questions and HotpotQA demonstrate that Deep GraphRAG significantly outperforms baseline graph retrieval methods in both accuracy and efficiency.

Updated: 2026-01-29 05:46:54

标题: Deep GraphRAG：一种平衡的层次检索和自适应集成方法

摘要: 基于图的检索增强生成（GraphRAG）框架面临全局搜索的全面性和局部搜索的效率之间的权衡。现有方法常常受到导航大规模分层图、优化检索路径以及平衡勘探-开发动态等挑战，经常缺乏强大的多阶段重新排序。为了克服这些缺陷，我们提出了Deep GraphRAG，这是一个专为层次化检索和自适应集成设计的框架。它引入了一种层次化的全局到本地检索策略，集成了宏观社区间和微观社区内部的上下文关系。该策略采用了一个三阶段的过程：（1）社区间过滤，通过本地上下文修剪搜索空间；（2）社区级别的细化，通过实体相互作用分析优先考虑相关子图；以及（3）在目标社区内进行实体级的细粒度搜索。一个经过波束搜索优化的动态重新排序模块引导这一过程，不断过滤候选项以平衡效率和全局全面性。Deep GraphRAG还具有一个利用紧凑LLM的知识集成模块，该模块利用Dynamic Weighting Reward GRPO（DW-GRPO）进行训练。这种新颖的强化学习方法动态调整奖励权重，以平衡三个关键目标：相关性、忠实度和简洁性。这种训练使得紧凑模型（1.5B）能够接近大型模型（70B）在集成任务中的表现。在自然问题和HotpotQA上的评估表明，Deep GraphRAG在准确性和效率方面显著优于基线图检索方法。

更新时间: 2026-01-29 05:46:54

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2601.11144v3

Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning

Reliability-centered prognostics for rotating machinery requires early warning signals that remain accurate under nonstationary operating conditions, domain shifts across speed/load/sensors, and severe class imbalance, while keeping the false-alarm rate small and predictable. We propose the Physics-Guided Tiny-Mamba Transformer (PG-TMT), a compact tri-branch encoder tailored for online condition monitoring. A depthwise-separable convolutional stem captures micro-transients, a Tiny-Mamba state-space branch models near-linear long-range dynamics, and a lightweight local Transformer encodes cross-channel resonances. We derive an analytic temporal-to-spectral mapping that ties the model's attention spectrum to classical bearing fault-order bands, yielding a band-alignment score that quantifies physical plausibility and provides physics-grounded explanations. To ensure decision reliability, healthy-score exceedances are modeled with extreme-value theory (EVT), which yields an on-threshold achieving a target false-alarm intensity (events/hour); a dual-threshold hysteresis with a minimum hold time further suppresses chatter. Under a leakage-free streaming protocol with right-censoring of missed detections on CWRU, Paderborn, XJTU-SY, and an industrial pilot, PG-TMT attains higher precision-recall AUC (primary under imbalance), competitive or better ROC AUC, and shorter mean time-to-detect at matched false-alarm intensity, together with strong cross-domain transfer. By coupling physics-aligned representations with EVT-calibrated decision rules, PG-TMT delivers calibrated, interpretable, and deployment-ready early warnings for reliability-centric prognostics and health management.

Updated: 2026-01-29 05:46:12

标题: 物理引导的微型曼巴变压器用于可靠性意识的早期故障警示

摘要: 基于可靠性的旋转机械预测需要在非平稳工况、速度/负载/传感器之间的领域转移以及严重的类别不平衡下保持精确的预警信号，同时保持较小且可预测的误报率。我们提出了物理引导的微型曼巴变压器（PG-TMT），这是一种专为在线状态监测定制的紧凑三分支编码器。一个深度可分离的卷积起始器捕捉微瞬态，一个微型曼巴状态空间分支建模近线性的长程动态，一个轻量级的局部变压器编码跨通道共振。我们推导出一个将模型的注意力谱与经典轴承故障阶段带相联系的分析时域到频谱映射，产生一个量化物理合理性并提供基于物理的解释的带对齐分数。为确保决策的可靠性，健康得分的超出值采用极值理论（EVT）建模，该理论得出一个达到目标误报强度的阈值（事件/小时）；一个双阈值滞后现象，最小保持时间进一步抑制了杂波。在CWRU、Paderborn、XJTU-SY和一个工业试点上采用无泄漏的流媒体协议和错过检测的正确剪裁，PG-TMT在匹配的误报强度下达到更高的精确召回AUC（主要是在不平衡情况下），具有竞争力或更好的ROC AUC，并且在较短的检测时间内实现匹配的误报强度，同时具有强大的跨领域转移能力。通过将物理对齐的表示与EVT校准的决策规则相结合，PG-TMT提供了经过校准、可解释和即可部署的早期警报，用于可靠性为中心的预测和健康管理。

更新时间: 2026-01-29 05:46:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21293v1

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.

Updated: 2026-01-29 05:41:24

标题: Drive-KD：自主驾驶中VLMs的多教师蒸馏

摘要: 自动驾驶是一项重要且关乎安全的任务，最近LLMs/VLMs方面的进展为这一领域的推理和规划提供了新的可能性。然而，大型模型需要大量的GPU内存，并且具有较高的推理延迟，而传统的监督微调(SFT)经常难以弥补小型模型的能力差距。为了解决这些限制，我们提出了Drive-KD框架，将自动驾驶分解为“感知-推理-规划”三元组，并通过知识蒸馏将这些能力转移。我们确定特定层的关注作为蒸馏信号，构建能力特定的单教师模型，优于基线模型。此外，我们将这些单教师设置统一到一个多教师蒸馏框架中，并引入不对称梯度投影以减轻跨能力梯度冲突。广泛的评估验证了我们的方法在不同模型家族和规模上的泛化能力。实验表明，我们的蒸馏InternVL3-1B模型，GPU内存减少约42倍，吞吐量提高约11.4倍，其在DriveBench上的整体性能优于同一家族的预训练78B模型，并超过GPT-5.1在规划维度上，为高效的自动驾驶VLMs提供了见解。

更新时间: 2026-01-29 05:41:24

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.21288v1

Towards Zero Rotation and Beyond: Architecting Neural Networks for Fast Secure Inference with Homomorphic Encryption

Privacy-preserving deep learning addresses privacy concerns in Machine Learning as a Service (MLaaS) by using Homomorphic Encryption (HE) for linear computations. However, the computational overhead remains a major challenge. While prior work has improved efficiency, most approaches build on models originally designed for plaintext inference. Such models incur architectural inefficiencies when adapted to HE. We argue that substantial gains require networks tailored to HE rather than retrofitting plaintext architectures. Our design has two components: the building block and the overall architecture. First, StriaBlock targets the most expensive HE operation, rotation. It integrates ExRot-Free Convolution and a novel Cross Kernel, eliminating external rotations and requiring only 19% of the internal rotations used by plaintext models. Second, our architectural principles include (i) the Focused Constraint Principle, which limits cost-sensitive factors while preserving flexibility elsewhere, and (ii) the Channel Packing-Aware Scaling Principle, which adapts bottleneck ratios to ciphertext channel capacity that varies with depth. Together, these strategies control both local and end-to-end HE cost, enabling a balanced HE-tailored network. We evaluate the resulting StriaNet across datasets of varying scales, including ImageNet, Tiny ImageNet, and CIFAR-10. At comparable accuracy, StriaNet achieves speedups of 9.78x, 6.01x, and 9.24x on ImageNet, Tiny ImageNet, and CIFAR-10, respectively.

Updated: 2026-01-29 05:40:05

标题: 朝零旋转和更远方向：使用同态加密为快速安全推理设计神经网络

摘要: 隐私保护深度学习通过使用同态加密（HE）进行线性计算，解决了机器学习作为服务（MLaaS）中的隐私问题。然而，计算开销仍然是一个主要挑战。尽管先前的工作已经提高了效率，但大多数方法都是基于最初为明文推理设计的模型。当这些模型适应HE时，会产生架构效率低下的问题。我们认为，实质性的收益需要针对HE定制网络，而不是改装明文架构。我们的设计包括两个组件：构建模块和整体架构。首先，StriaBlock专注于最昂贵的HE操作-旋转。它集成了ExRot-Free卷积和一种新颖的Cross Kernel，消除了外部旋转，并且仅需要明文模型使用的内部旋转的19％。其次，我们的架构原则包括（i）专注约束原则，限制成本敏感因素，同时在其他方面保持灵活性，以及（ii）通道打包感知缩放原则，根据深度变化的密文通道容量调整瓶颈比率。通过这些策略，控制了局部和端到端的HE成本，实现了平衡的HE定制网络。我们评估了得到的StriaNet在不同规模的数据集上的表现，包括ImageNet、Tiny ImageNet和CIFAR-10。在相同准确度的情况下，StriaNet在ImageNet、Tiny ImageNet和CIFAR-10上分别实现了9.78倍、6.01倍和9.24倍的加速。

更新时间: 2026-01-29 05:40:05

领域: cs.CR

下载: http://arxiv.org/abs/2601.21287v1

Non-Identical Diffusion Models in MIMO-OFDM Channel Generation

We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal frequency division multiplexing (OFDM) channel generation. Unlike the standard diffusion model that uses a scalar-valued time index to represent the global noise level, we extend this notion to an element-wise time indicator to capture local error variations more accurately. Non-identical diffusion enables us to characterize the reliability of each element (e.g., subcarriers in OFDM) within the noisy input, leading to improved generation results when the initialization is biased. Specifically, we focus on the recovery of wireless multi-input multi-output (MIMO) OFDM channel matrices, where the initial channel estimates exhibit highly uneven reliability across elements due to the pilot scheme. Conventional time embeddings, which assume uniform noise progression, fail to capture such variability across pilot schemes and noise levels. We introduce a matrix that matches the input size to control element-wise noise progression. Following a similar diffusion procedure to existing methods, we show the correctness and effectiveness of the proposed non-identical diffusion scheme both theoretically and numerically. For MIMO-OFDM channel generation, we propose a dimension-wise time embedding strategy. We also develop and evaluate multiple training and generation methods and compare them through numerical experiments.

Updated: 2026-01-29 05:38:26

标题: 多输入多输出正交频分复用信道生成中的非相同扩散模型

摘要: 我们提出了一种新颖的扩散模型，称为非相同扩散模型，并研究其在无线正交频分复用（OFDM）信道生成中的应用。与使用标量值时间索引来表示全局噪声水平的标准扩散模型不同，我们将这一概念扩展为元素级时间指示器，以更准确地捕捉局部误差变化。非相同扩散使我们能够表征噪声输入中每个元素（例如OFDM中的子载波）的可靠性，从而在初始化存在偏差时获得改进的生成结果。具体而言，我们关注无线多输入多输出（MIMO）OFDM信道矩阵的恢复，其中由于导频方案，初始信道估计在元素之间的可靠性极不均匀。传统的时间嵌入假设噪声进展是均匀的，无法捕捉导频方案和噪声水平之间的这种变异性。我们引入一个矩阵，将输入大小匹配以控制元素级噪声进展。通过类似的扩散过程到现有方法，我们从理论和数值上展示了所提出的非相同扩散方案的正确性和有效性。对于MIMO-OFDM信道生成，我们提出了一种按维度的时间嵌入策略。我们还开发和评估了多种训练和生成方法，并通过数值实验进行了比较。

更新时间: 2026-01-29 05:38:26

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.01641v2

Zenith: Scaling up Ranking Models for Billion-scale Livestreaming Recommendation

Accurately capturing feature interactions is essential in recommender systems, and recent trends show that scaling up model capacity could be a key driver for next-level predictive performance. While prior work has explored various model architectures to capture multi-granularity feature interactions, relatively little attention has been paid to efficient feature handling and scaling model capacity without incurring excessive inference latency. In this paper, we address this by presenting Zenith, a scalable and efficient ranking architecture that learns complex feature interactions with minimal runtime overhead. Zenith is designed to handle a few high-dimensional Prime Tokens with Token Fusion and Token Boost modules, which exhibits superior scaling laws compared to other state-of-the-art ranking methods, thanks to its improved token heterogeneity. Its real-world effectiveness is demonstrated by deploying the architecture to TikTok Live, a leading online livestreaming platform that attracts billions of users globally. Our A/B test shows that Zenith achieves +1.05%/-1.10% in online CTR AUC and Logloss, and realizes +9.93% gains in Quality Watch Session / User and +8.11% in Quality Watch Duration / User.

Updated: 2026-01-29 05:36:49

标题: 天顶：为数十亿规模的直播推荐扩展排名模型

摘要: 准确捕捉特征交互在推荐系统中至关重要，最近的趋势表明，扩大模型容量可能是下一级预测性能的关键驱动因素。尽管先前的研究已探讨了各种模型架构来捕捉多粒度特征交互，但相对较少关注高效特征处理和扩展模型容量而不引起过多推理延迟。在本文中，我们通过提出Zenith来解决这个问题，这是一种可伸缩和高效的排名架构，能够以最小的运行时开销学习复杂的特征交互。Zenith旨在通过Token Fusion和Token Boost模块处理少量高维Prime Token，与其他最先进的排名方法相比展现出卓越的扩展规律，这要归功于其改进的Token异质性。通过将该架构部署到全球吸引数十亿用户的领先在线直播平台TikTok Live，我们展示了其在实际世界中的有效性。我们的A/B测试结果显示，Zenith在在线点击率AUC和Logloss上分别实现了+1.05%/-1.10%的增益，实现了Quality Watch Session/User和Quality Watch Duration/User分别增长了+9.93%和+8.11%。

更新时间: 2026-01-29 05:36:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21285v1

PILD: Physics-Informed Learning via Diffusion

Diffusion models have emerged as powerful generative tools for modeling complex data distributions, yet their purely data-driven nature limits applicability in practical engineering and scientific problems where physical laws need to be followed. This paper proposes Physics-Informed Learning via Diffusion (PILD), a framework that unifies diffusion modeling and first-principles physical constraints by introducing a virtual residual observation sampled from a Laplace distribution to supervise generation during training. To further integrate physical laws, a conditional embedding module is incorporated to inject physical information into the denoising network at multiple layers, ensuring consistent guidance throughout the diffusion process. The proposed PILD framework is concise, modular, and broadly applicable to problems governed by ordinary differential equations, partial differential equations, as well as algebraic equations or inequality constraints. Extensive experiments across engineering and scientific tasks including estimating vehicle trajectories, tire forces, Darcy flow and plasma dynamics, demonstrate that our PILD substantially improves accuracy, stability, and generalization over existing physics-informed and diffusion-based baselines.

Updated: 2026-01-29 05:33:51

标题: PILD: 通过扩散进行物理信息学习

摘要: 扩散模型已经成为建模复杂数据分布的强大生成工具，然而它们纯粹基于数据驱动的特性限制了在需要遵循物理定律的实际工程和科学问题中的适用性。本文提出了一种名为Physics-Informed Learning via Diffusion（PILD）的框架，通过引入从拉普拉斯分布中抽样的虚拟残差观察来监督训练期间的生成过程，将扩散建模与第一原理物理约束统一起来。为了进一步整合物理定律，还引入了一个条件嵌入模块，将物理信息注入到去噪网络的多个层中，确保在整个扩散过程中的一致指导。所提出的PILD框架简洁、模块化，并广泛适用于受常微分方程、偏微分方程、以及代数方程或不等式约束支配的问题。在包括估计车辆轨迹、轮胎力、达西流和等离子体动力学在内的工程和科学任务中的大量实验表明，我们的PILD显著改善了准确性、稳定性和泛化性，超过了现有的基于物理信息和基于扩散的基线模型。

更新时间: 2026-01-29 05:33:51

领域: cs.LG,cs.AI,cs.ET,math.AP

下载: http://arxiv.org/abs/2601.21284v1

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

LLM unlearning is a technique to remove the impacts of undesirable knowledge from the model without retraining from scratch, which is indispensable towards trustworthy AI. Existing unlearning methods face significant limitations: conventional tuning-based unlearning is computationally heavy and prone to catastrophic forgetting. In contrast, in-contextualized unlearning is lightweight for precise unlearning but vulnerable to prompt removal or reverse engineering attacks. In response, we propose Distilled Unlearning from an Efficient Teacher (DUET), a novel distillation-based unlearning method that combines the merits of these two lines of work. It learns a student model to imitate the behavior of a prompt-steered teacher that effectively refuses undesirable knowledge generation while preserving general domain knowledge. Extensive evaluations on existing benchmarks with our enriched evaluation protocols demonstrate that DUET achieves higher performance in both forgetting and utility preservation, while being orders of magnitude more data-efficient than state-of-the-art unlearning methods.

Updated: 2026-01-29 05:32:35

标题: DUET：来自高效上下文化教师的精炼LLM去学习

摘要: LLM取消学习是一种技术，可以从模型中删除不良知识的影响，而无需从头开始重新训练，这对于可信赖的人工智能是不可或缺的。现有的取消学习方法面临着重大限制：传统的基于调整的取消学习计算量大且容易出现灾难性遗忘。相反，上下文化取消学习对精确取消学习来说是轻量级的，但容易受到提示删除或逆向工程攻击的影响。为了应对这一挑战，我们提出了一种基于蒸馏的取消学习方法DUET，它结合了这两种工作线的优点。它学习一个学生模型来模仿一个有效拒绝不良知识生成的提示引导教师的行为，同时保留一般领域知识。通过我们丰富的评估协议对现有基准进行广泛评估，结果表明DUET在遗忘和效用保留方面都表现出更高的性能，同时比最先进的取消学习方法更具数据效率。

更新时间: 2026-01-29 05:32:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21283v1

NEXUS: Bit-Exact ANN-to-SNN Equivalence via Neuromorphic Gate Circuits with Surrogate-Free Training

Spiking Neural Networks (SNNs) promise energy-efficient computing through event-driven sparsity, yet all existing approaches sacrifice accuracy by approximating continuous values with discrete spikes. We propose NEXUS, a framework that achieves bit-exact ANN-to-SNN equivalence -- not approximate, but mathematically identical outputs. Our key insight is constructing all arithmetic operations, both linear and nonlinear, from pure IF neuron logic gates that implement IEEE-754 compliant floating-point arithmetic. Through spatial bit encoding (zero encoding error by construction), hierarchical neuromorphic gate circuits (from basic logic gates to complete transformer layers), and surrogate-free STE training (exact identity mapping rather than heuristic approximation), NEXUS produces outputs identical to standard ANNs up to machine precision. Experiments on models up to LLaMA-2 70B demonstrate identical task accuracy (0.00\% degradation) with mean ULP error of only 6.19, while achieving 27-168,000$\times$ energy reduction on neuromorphic hardware. Crucially, spatial bit encoding's single-timestep design renders the framework inherently immune to membrane potential leakage (100\% accuracy across all decay factors $β\in[0.1,1.0]$), while tolerating synaptic noise up to $σ=0.2$ with >98\% gate-level accuracy.

Updated: 2026-01-29 05:23:56

标题: NEXUS：通过神经形态门电路的无替代训练实现位精确的ANN到SNN等效性

摘要: 脉冲神经网络（SNNs）通过事件驱动的稀疏性实现了高效能计算的承诺，然而所有现有方法都通过用离散脉冲近似连续值来牺牲准确性。我们提出了NEXUS框架，该框架实现了位精确的人工神经网络（ANN）到SNN的等价性，而不是近似，而是数学上完全相同的输出。我们的关键洞见是构建所有算术操作，包括线性和非线性操作，都是从实现IEEE-754兼容的浮点算术的纯IF神经元逻辑门中进行的。通过空间位编码（通过设计零编码错误），分层神经形态门电路（从基本逻辑门到完整的变压器层），以及无代理STE训练（精确的恒等映射而不是启发式近似），NEXUS产生的输出与标准ANNs的输出相同，直到机器精度。对LLaMA-2 70B等模型的实验表明，任务准确性相同（0.00\%降级），平均ULP误差仅为6.19，同时在神经形态硬件上实现了27-168,000倍的能耗降低。关键是，空间位编码的单时间步设计使得该框架本质上对膜电位泄漏具有免疫力（在所有衰减因子$β\in[0.1,1.0]$时100\%准确性），同时容忍高达$σ=0.2$的突触噪声，门级准确性超过98\%。

更新时间: 2026-01-29 05:23:56

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2601.21279v1

GeoRC: A Benchmark for Geolocation Reasoning Chains

Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.

Updated: 2026-01-29 05:18:40

标题: GeoRC: 地理位置推理链的基准测试

摘要: 视觉语言模型（VLMs）擅长识别照片的全局位置--它们的地理位置预测准确性可以与最优秀的人类专家相匹敌。但许多VLMs在解释其预测所依据的图像证据方面表现惊人地糟糕，即使它们的位置预测是正确的。VLMs产生的推理链经常会产生幻觉般的场景属性来支持它们的位置预测（例如虚拟写作、想象的基础设施、误认的植物）。在本文中，我们引入了第一个地理位置推理链的基准。我们专注于流行的GeoGuessr游戏中的全球位置预测任务，该游戏涵盖了超过100个国家的谷歌街景。我们与专家GeoGuessr玩家合作，包括现任世界冠军，为500个查询场景制作了800个基准推理链。这些专家推理链涵盖了数百种不同的区分性视觉属性，例如车牌形状、建筑物和土壤特性等。我们评估了LLM作为评判者和VLM作为评判者策略，用于对VLM生成的推理链与我们的专家推理链进行评分，并发现Qwen 3 LLM作为评判者与人类评分最相关。我们的基准测试显示，尽管像Gemini和GPT 5这样的大型封闭源VLMs在预测位置方面可以与人类专家相匹敌，但在生成可审计的推理链方面仍远远落后于人类专家。开放权重的VLMs，如Llama和Qwen，在我们的基准测试中遭遇惨败--它们的表现仅略优于基线，即LLM在具有照片位置的神谕知识但没有任何视觉信息的情况下产生推理链的情况。我们认为，在这一任务上人类专家和VLMs之间的差距表明VLMs在从高分辨率图像中提取细粒度视觉属性方面存在局限性。

更新时间: 2026-01-29 05:18:40

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2601.21278v1

More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests

Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code generation, studies indicate they may introduce adverse effects on development. However, existing metrics solely measure pass rates, failing to reflect impacts on long-term maintainability and readability, and failing to capture human intuitive evaluations of PR. To increase the comprehensiveness of this problem, we investigate and evaluate the characteristics of LLM to know the pull requests' characteristics beyond the pass rate. We observe the code quality and maintainability within PRs based on code metrics to evaluate objective characteristics and developers' reactions to the pull requests from both humans and LLM's generation. Evaluation results indicate that LLM Agents frequently disregard code reuse opportunities, resulting in higher levels of redundancy compared to human developers. In contrast to the quality issues, our emotions analysis reveals that reviewers tend to express more neutral or positive emotions towards AI-generated contributions than human ones. This disconnect suggests that the surface-level plausibility of AI code masks redundancy, leading to the silent accumulation of technical debt in real-world development environments. Our research provides insights for improving human-AI collaboration.

Updated: 2026-01-29 05:13:21

标题: 更多代码，更少重用：研究代码质量和审阅者对AI生成的拉取请求的情感

摘要: 大型语言模型(LLM)代理迅速发展，越来越多地利用LLM代理来辅助开发任务，如代码生成。虽然LLM代理加速了代码生成，研究表明它们可能会对开发产生不利影响。然而，现有的度量仅仅衡量通过率，未能反映对长期可维护性和可读性的影响，也未能捕捉对PR的人类直觉评估。为了增加这一问题的全面性，我们研究和评估LLM的特征，以了解超出通过率的PR特征。我们根据代码度量标准观察PR中的代码质量和可维护性，评估客观特征以及开发人员对来自人类和LLM生成的PR的反应。评估结果表明，LLM代理经常忽略代码重用机会，导致冗余程度高于人类开发人员。与质量问题相反，我们的情感分析显示，评审者倾向于对AI生成的贡献表达更中立或积极的情绪，而不是人类的贡献。这种脱节表明，AI代码的表面可信度掩盖了冗余，导致在现实开发环境中技术债务的悄然积累。我们的研究为改善人工智能与人类的协作提供了见解。

更新时间: 2026-01-29 05:13:21

领域: cs.SE,cs.AI,cs.HC

下载: http://arxiv.org/abs/2601.21276v1

MMGRid: Navigating Temporal-aware and Cross-domain Generative Recommendation via Model Merging

Model merging (MM) offers an efficient mechanism for integrating multiple specialized models without access to original training data or costly retraining. While MM has demonstrated success in domains like computer vision, its role in recommender systems (RSs) remains largely unexplored. Recently, Generative Recommendation (GR) has emerged as a new paradigm in RSs, characterized by rapidly growing model scales and substantial computational costs, making MM particularly appealing for cost-sensitive deployment scenarios. In this work, we present the first systematic study of MM in GR through a contextual lens. We focus on a fundamental yet underexplored challenge in real-world: how to merge generative recommenders specialized to different real-world contexts, arising from temporal evolving user behaviors and heterogeneous application domains. To this end, we propose a unified framework MMGRid, a structured contextual grid of GR checkpoints that organizes models trained under diverse contexts induced by temporal evolution and domain diversity. All checkpoints are derived from a shared base LLM but fine-tuned on context-specific data, forming a realistic and controlled model space for systematically analyzing MM across GR paradigms and merging algorithms. Our investigation reveals several key insights. First, training GR models from LLMs can introduce parameter conflicts during merging due to token distribution shifts and objective disparities; such conflicts can be alleviated by disentangling task-aware and context-specific parameter changes via base model replacement. Second, incremental training across contexts induces recency bias, which can be effectively balanced through weighted contextual merging. Notably, we observe that optimal merging weights correlate with context-dependent interaction characteristics, offering practical guidance for weight selection in real-world deployments.

Updated: 2026-01-29 05:07:30

标题: MMGRid: 通过模型合并导航时间感知和跨领域生成式推荐

摘要: Model merging (MM) 提供了一种有效的机制，可以在没有原始训练数据或昂贵重新训练的情况下集成多个专门的模型。虽然 MM 在诸如计算机视觉等领域取得了成功，但在推荐系统 (RSs) 中的作用仍然很少被探索。最近，生成推荐 (GR) 已经成为 RSs 中的一个新范式，其特点是模型规模迅速增长和计算成本巨大，这使得 MM 在成本敏感的部署场景中特别具有吸引力。在这项工作中，我们通过上下文的视角首次系统研究了 GR 中的 MM。我们关注一个在现实世界中基本但未被充分探讨的挑战：如何合并专门针对不同现实世界上下文的生成推荐，这些上下文源自于用户行为的时间演变和异构应用领域。为此，我们提出了一个统一的框架 MMGRid，这是一个结构化上下文网格，用于组织在由时间演变和领域多样性引起的多样化上下文下训练的模型。所有检查点都来自一个共享的基础 LLM，但在特定上下文数据上进行微调，形成一个真实和可控的模型空间，用于系统地分析 GR 范例和合并算法中的 MM。我们的研究揭示了几个关键洞见。首先，从 LLMs 训练 GR 模型可能会导致合并过程中的参数冲突，这是由于标记分布的变化和目标差异造成的；通过基础模型替换，可以通过分离任务感知和特定上下文的参数更改来缓解这些冲突。其次，跨上下文的增量训练会引入新颖性偏见，通过加权上下文合并可以有效地平衡这种偏见。值得注意的是，我们观察到最佳合并权重与上下文相关的交互特性相关，为实际部署中的权重选择提供了实用指导。

更新时间: 2026-01-29 05:07:30

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2601.15930v2

Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference

The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.

Updated: 2026-01-29 05:03:29

标题: 轻量级高保真低比特率的三维视频会议语音面部压缩

摘要: 对沉浸式和互动通信的需求推动了3D视频会议的发展，然而在低比特率下实现高保真的3D说话面表示仍然是一个挑战。传统的2D视频压缩技术无法保留细粒度的几何和外观细节，而像NeRF这样的隐式神经渲染方法则面临着计算成本过高的问题。为了解决这些挑战，我们提出了一个轻量级、高保真度、低比特率的3D说话面压缩框架，该框架将基于FLAME的参数建模与3DGS神经渲染相结合。我们的方法仅在实时传输中传输关键的面部元数据，以便使用基于高斯的头部模型进行高效重建。此外，我们引入了一种紧凑的表示和压缩方案，包括高斯属性压缩和MLP优化，以增强传输效率。实验结果表明，我们的方法实现了卓越的速率失真性能，在极低比特率下提供高质量的面部渲染，使其非常适用于实时3D视频会议应用。

更新时间: 2026-01-29 05:03:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.21269v1

LLMs versus the Halting Problem: Revisiting Program Termination Prediction

Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.

Updated: 2026-01-29 04:56:58

标题: LLMs与停机问题：重新审视程序终止预测

摘要: 在计算机科学中，确定程序是否终止是一个中心性问题。图灵的基础性结果将停机问题确立为不可判定的，表明没有算法可以普遍确定所有程序和输入的终止。因此，自动验证工具近似终止，有时无法证明或证伪；这些工具依赖于特定于问题的架构和抽象，并通常与特定的编程语言相关联。最近在大型语言模型（LLMs）中取得的成功和进展引发了以下问题：LLMs能否可靠地预测程序终止？在这项工作中，我们评估了LLMs在2025年国际软件验证竞赛（SV-Comp）的Termination类别中的各种C程序上的表现。我们的结果表明，LLMs在预测程序终止方面表现出色，其中GPT-5和Claude Sonnet-4.5在测试时间缩放时排名仅次于排名第一的工具，而Code World Model（CWM）则排名仅次于排名第二的工具。虽然LLMs在预测程序终止方面表现出色，但它们经常无法提供有效的证明作为证据。此外，随着程序长度的增加，LLMs的性能也会下降。我们希望这些见解能够促进进一步研究程序终止以及LLMs在推理不可判定问题方面的更广泛潜力。

更新时间: 2026-01-29 04:56:58

领域: cs.CL,cs.AI,cs.PL

下载: http://arxiv.org/abs/2601.18987v3

MALLOC: Benchmarking the Memory-aware Long Sequence Compression for Large Sequential Recommendation

The scaling law, which indicates that model performance improves with increasing dataset and model capacity, has fueled a growing trend in expanding recommendation models in both industry and academia. However, the advent of large-scale recommenders also brings significantly higher computational costs, particularly under the long-sequence dependencies inherent in the user intent of recommendation systems. Current approaches often rely on pre-storing the intermediate states of the past behavior for each user, thereby reducing the quadratic re-computation cost for the following requests. Despite their effectiveness, these methods often treat memory merely as a medium for acceleration, without adequately considering the space overhead it introduces. This presents a critical challenge in real-world recommendation systems with billions of users, each of whom might initiate thousands of interactions and require massive memory for state storage. Fortunately, there have been several memory management strategies examined for compression in LLM, while most have not been evaluated on the recommendation task. To mitigate this gap, we introduce MALLOC, a comprehensive benchmark for memory-aware long sequence compression. MALLOC presents a comprehensive investigation and systematic classification of memory management techniques applicable to large sequential recommendations. These techniques are integrated into state-of-the-art recommenders, enabling a reproducible and accessible evaluation platform. Through extensive experiments across accuracy, efficiency, and complexity, we demonstrate the holistic reliability of MALLOC in advancing large-scale recommendation. Code is available at https://anonymous.4open.science/r/MALLOC.

Updated: 2026-01-29 04:56:16

标题: MALLOC：针对大规模顺序推荐的内存感知长序列压缩的基准测试

摘要: 这个缩放定律表明，随着数据集和模型容量的增加，模型性能会提高，这推动了在工业界和学术界扩展推荐模型的趋势。然而，大规模推荐系统的出现也带来了显著更高的计算成本，特别是在推荐系统用户意图中固有的长序列依赖性下。当前的方法通常依赖于为每个用户预存过去行为的中间状态，从而减少后续请求的二次重新计算成本。尽管这些方法有效，但它们通常将内存仅视为加速的媒介，而没有充分考虑它引入的空间开销。这在拥有数十亿用户的现实推荐系统中提出了一个关键挑战，每个用户可能会发起成千上万次互动并需要大量内存进行状态存储。幸运的是，已经对LLM中的压缩进行了几种内存管理策略的研究，然而大多数并未在推荐任务上进行评估。为了弥补这一差距，我们引入了MALLOC，这是一个专门针对长序列压缩的内存感知综合基准。MALLOC提供了对适用于大规模顺序推荐的内存管理技术的全面调查和系统分类。这些技术已集成到最先进的推荐系统中，使其成为一个可复制和可访问的评估平台。通过在准确性、效率和复杂性等方面进行广泛实验，我们展示了MALLOC在推动大规模推荐方面的整体可靠性。代码可在https://anonymous.4open.science/r/MALLOC找到。

更新时间: 2026-01-29 04:56:16

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2601.20234v2

Monte Carlo Tree Diffusion for System 2 Planning

Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with inference-time computation scaling-standard diffusion-based planners offer only limited avenues for the scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as inference-time computation increases.

Updated: 2026-01-29 04:51:49

标题: 蒙特卡洛树扩散在系统2规划中的应用

摘要: Diffusion模型最近已经成为规划的一个强大工具。然而，与蒙特卡洛树搜索（MCTS）不同，后者的性能在推断时间计算规模扩展时自然提高，标准的基于扩散的规划器仅提供有限的可伸缩性途径。在本文中，我们介绍了Monte Carlo Tree Diffusion（MCTD），这是一个新颖的框架，将扩散模型的生成能力与MCTS的自适应搜索能力相结合。我们的方法将去噪重新概念化为一个树形过程，允许部分去噪的计划被迭代地评估、修剪和精炼。通过有选择地扩展有前途的轨迹，同时保留回顾和改进次优分支的灵活性，MCTD实现了在扩散框架内控制探索-开发权衡的好处，类似于MCTS。对具有挑战性的长时间跨度任务的实证结果表明，MCTD优于扩散基线，在推断时间计算增加时产生更高质量的解决方案。

更新时间: 2026-01-29 04:51:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.07202v7

Probing Neural Topology of Large Language Models

Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural activations to interpretable semantics. However, the complex mechanisms that link neuron's functional co-activation with the emergent model capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity of LLM neurons and relating it to language generation performance. By probing models across diverse LLM families and scales, we discover a universal predictability of language generation and understanding performance using only neural topology, which persists even when retaining just 1% of neuron connections. Strikingly, probing on topology outperforms probing on activation by up to 130.4% and 67.7% on perplexity and space/time semantic regression respectively, suggesting that neural topology contains orders of richer information of LLM performance than neural activation, which can be easily extracted with simple linear or MLP probes. To explain the dependence between neural topology and language performance, we identify default networks and hub neurons in LLMs and provide causal evidence by interventional experiments on multiple benchmarks, showing that LLMs actually exploit these topological information. Further analyses suggest that graph probing can be effectively leveraged to improve the efficiency and reliability of LLMs through proof-of-concept applications in model pruning and hallucination detection. Codes and data for the graph probing toolbox are available at https://github.com/DavyMorgan/llm-graph-probing.

Updated: 2026-01-29 04:47:30

标题: 探究大型语言模型的神经拓扑结构

摘要: 探究大型语言模型（LLMs）已经为我们提供了宝贵的见解，通过将神经激活与可解释的语义联系起来，揭示了它们内部机制。然而，将神经元的功能共激活与新兴模型能力联系起来的复杂机制仍然大多未知，这妨碍了对LLMs更深入的理解和更安全的发展。在这项工作中，我们引入了图探测方法，用于揭示LLM神经元的功能连接性，并将其与语言生成性能联系起来。通过在不同LLM家族和规模上对模型进行探测，我们发现仅使用神经拓扑结构就能普遍预测语言生成和理解性能，即使保留仅1%的神经元连接关系也是如此。令人惊讶的是，拓扑结构上的探测在困惑度和空间/时间语义回归方面的表现比激活上的探测分别高出130.4%和67.7%，表明神经拓扑结构包含了LLM性能比神经激活更丰富的信息，而这些信息可以通过简单的线性或MLP探测轻松提取。为了解释神经拓扑结构与语言性能之间的依赖关系，我们在LLMs中确定了默认网络和中心神经元，并通过对多个基准测试的干预实验证据，表明LLMs实际上利用了这些拓扑信息。进一步的分析表明，图探测可以有效地利用，通过在模型修剪和幻觉检测等领域的概念验证应用，提高LLMs的效率和可靠性。图探测工具箱的代码和数据可在https://github.com/DavyMorgan/llm-graph-probing上找到。

更新时间: 2026-01-29 04:47:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.01042v3

User-Centric Phishing Detection: A RAG and LLM-Based Approach

The escalating sophistication of phishing emails necessitates a shift beyond traditional rule-based and conventional machine-learning-based detectors. Although large language models (LLMs) offer strong natural language understanding, using them as standalone classifiers often yields elevated falsepositive (FP) rates, which mislabel legitimate emails as phishing and create significant operational burden. This paper presents a personalized phishing detection framework that integrates LLMs with retrieval-augmented generation (RAG). For each message, the system constructs user-specific context by retrieving a compact set of the user's historical legitimate emails and enriching it with real-time domain and URL reputation from a cyber-threat intelligence platform, then conditions the LLM's decision on this evidence. We evaluate four open-source LLMs (Llama4-Scout, DeepSeek-R1, Mistral-Saba, and Gemma2) on an email dataset collected from public and institutional sources. Results show high performance; for example, Llama4-Scout attains an F1-score of 0.9703 and achieves a 66.7% reduction in FPs with RAG. These findings validate that a RAG-based, user-profiling approach is both feasible and effective for building high-precision, low-friction email security systems that adapt to individual communication patterns.

Updated: 2026-01-29 04:42:18

标题: 用户中心钓鱼检测：基于RAG和LLM的方法

摘要: 随着网络钓鱼邮件的日益复杂，需要超越传统基于规则和传统机器学习的检测器。尽管大型语言模型（LLMs）提供了强大的自然语言理解能力，但单独使用它们作为分类器往往会导致较高的误报率，将合法邮件误标为网络钓鱼邮件，带来显著的运营负担。本文提出了一个个性化的网络钓鱼检测框架，将LLMs与检索增强生成（RAG）相结合。对于每条消息，系统通过检索用户历史合法邮件的紧凑集合，并结合来自网络威胁情报平台的实时域名和URL声誉，构建用户特定上下文，然后将LLM的决策条件化为这些证据。我们在从公共和机构来源收集的电子邮件数据集上评估了四个开源LLMs（Llama4-Scout、DeepSeek-R1、Mistral-Saba和Gemma2）。结果表明高性能；例如，Llama4-Scout获得了0.9703的F1分数，并通过RAG实现了66.7％的FP减少。这些发现验证了基于RAG的用户画像方法对于构建适应个体通信模式的高精度、低摩擦电子邮件安全系统既可行又有效。

更新时间: 2026-01-29 04:42:18

领域: cs.CR

下载: http://arxiv.org/abs/2601.21261v1

Music Plagiarism Detection: Problem Formulation and a Segment-based Solution

Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real-world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at https://github.com/Mippia/ICASSP2026-MPD.

Updated: 2026-01-29 04:41:59

标题: 音乐抄袭检测：问题阐述和基于片段的解决方案

摘要: 最近，音乐抄袭的问题已经成为一个更加紧迫的社会问题。随着音乐信息检索研究的进展，人们正在努力解决与音乐抄袭相关的问题。然而，包括我们之前的研究在内，许多研究在没有清晰定义音乐抄袭检测任务实际涉及的情况下进行研究。这种缺乏明确定义的情况已经减缓了研究进展，并使得很难将结果应用于实际场景。为了解决这种情况，我们定义了音乐抄袭检测与其他音乐信息检索任务的区别，并解释了需要解决的问题。我们引入了相似音乐对数据集来支持这一新定义的任务。此外，我们提出了一种基于段落转录的方法来解决这个任务。我们的演示和数据集可以在 https://github.com/Mippia/ICASSP2026-MPD 上找到。

更新时间: 2026-01-29 04:41:59

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2601.21260v1

Virtualization-based Penetration Testing Study for Detecting Accessibility Abuse Vulnerabilities in Banking Apps in East and Southeast Asia

Android banking applications have revolutionized financial management by allowing users to perform various financial activities through mobile devices. However, this convenience has attracted cybercriminals who exploit security vulnerabilities to access sensitive financial data. FjordPhantom, a malware identified by our industry collaborator, uses virtualization and hooking to bypass the detection of malicious accessibility services, allowing it to conduct keylogging, screen scraping, and unauthorized data access. This malware primarily affects banking and finance apps across East and Southeast Asia region where our industry partner's clients are primarily based in. It requires users to be deceived into installing a secondary malicious component and activating a malicious accessibility service. In our study, we conducted an empirical study on the susceptibility of banking apps in the region to FjordPhantom, analyzed the effectiveness of protective measures currently implemented in those apps, and discussed ways to detect and prevent such attacks by identifying and mitigating the vulnerabilities exploited by this malware.

Updated: 2026-01-29 04:37:53

标题: 基于虚拟化的渗透测试研究：检测东南亚和东亚银行应用中的可访问性滥用漏洞

摘要: 安卓银行应用程序已经通过允许用户通过移动设备执行各种金融活动来改变了金融管理。然而，这种便利性吸引了利用安全漏洞访问敏感金融数据的网络犯罪分子。我们的行业合作伙伴确认的FjordPhantom恶意软件利用虚拟化和挂钩来绕过恶意无障碍服务的检测，从而使其能够进行按键记录、屏幕截图和未经授权的数据访问。这种恶意软件主要影响我们的行业合作伙伴客户主要位于东亚和东南亚地区的银行和金融应用程序。它需要用户被欺骗安装一个次要的恶意组件并激活一个恶意的无障碍服务。在我们的研究中，我们对该地区银行应用程序对FjordPhantom的易感性进行了实证研究，分析了目前在这些应用程序中实施的保护措施的有效性，并讨论了如何通过识别和减轻该恶意软件所利用的漏洞来检测和预防此类攻击的方法。

更新时间: 2026-01-29 04:37:53

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2601.21258v1

MotionBeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding

Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive movement. We propose MotionBeat, a framework for motion-aligned music representation learning. MotionBeat is trained with two newly proposed objectives: the Embodied Contrastive Loss (ECL), an enhanced InfoNCE formulation with tempo-aware and beat-jitter negatives to achieve fine-grained rhythmic discrimination, and the Structural Rhythm Alignment Loss (SRAL), which ensures rhythm consistency by aligning music accents with corresponding motion events. Architecturally, MotionBeat introduces bar-equivariant phase rotations to capture cyclic rhythmic patterns and contact-guided attention to emphasize motion events synchronized with musical accents. Experiments show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation and transfers effectively to beat tracking, music tagging, genre and instrument classification, emotion recognition, and audio-visual retrieval. Our project demo page: https://motionbeat2025.github.io/.

Updated: 2026-01-29 04:36:21

标题: MotionBeat：通过具身对比学习和面向接触感知的 Bar-等变编码实现运动对齐音乐表示

摘要: 音乐既是一种听觉现象，也是一种体验现象，与人类动作密切相关，并通过舞蹈自然表达。然而，大多数现有的音频表示忽视了这种体验维度，限制了它们捕捉驱动运动的节奏和结构线索的能力。我们提出了MotionBeat，一个运动对齐音乐表示学习框架。MotionBeat通过两个新提出的目标进行训练：增强的信息正交损失（ECL），使用节奏感和节拍抖动负样本来实现细粒度节奏判别，以及结构节奏对齐损失（SRAL），通过将音乐重点与相应的运动事件对齐来确保节奏一致性。在体系结构上，MotionBeat引入了等值于小节的相位旋转，以捕捉循环节奏模式，并引入了基于接触的注意力，以强调与音乐重点同步的运动事件。实验证明，MotionBeat在音乐至舞蹈生成方面优于最先进的音频编码器，并有效地转移到节拍跟踪、音乐标签、流派和乐器分类、情感识别和音频-视觉检索。我们的项目演示页面：https://motionbeat2025.github.io/。

更新时间: 2026-01-29 04:36:21

领域: cs.SD,cs.AI,cs.MM

下载: http://arxiv.org/abs/2510.13244v2

Hypersolid: Emergent Vision Representations via Short-Range Repulsion

A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.

Updated: 2026-01-29 04:25:43

标题: 超立方体：通过短距离排斥产生的视觉表征

摘要: 自监督学习中一个反复出现的挑战是防止表示坍塌。现有解决方案通常依赖于全局正则化，如最大化距离、去相关维度或施加某些分布。相反，我们重新解释表示学习为一个离散的打包问题，其中保持信息简化为保持单射性。我们在Hypersolid中将这一概念操作化，这是一种利用短程硬球排斥来防止局部碰撞的方法。这一约束导致高分离的几何制度，保持了扩增多样性，在细粒度和低分辨率分类任务上表现出色。

更新时间: 2026-01-29 04:25:43

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21255v1

Lossless Copyright Protection via Intrinsic Model Fingerprinting

The exceptional performance of diffusion models establishes them as high-value intellectual property but exposes them to unauthorized replication. Existing protection methods either modify the model to embed watermarks, which impairs performance, or extract model fingerprints by manipulating the denoising process, rendering them incompatible with black-box APIs. In this paper, we propose TrajPrint, a completely lossless and training-free framework that verifies model copyright by extracting unique manifold fingerprints formed during deterministic generation. Specifically, we first utilize a watermarked image as an anchor and exactly trace the path back to its trajectory origin, effectively locking the model fingerprint mapped by this path. Subsequently, we implement a joint optimization strategy that employs dual-end anchoring to synthesize a specific fingerprint noise, which strictly adheres to the target manifold for robust watermark recovery. As input, it enables the protected target model to recover the watermarked image, while failing on non-target models. Finally, we achieved verification via atomic inference and statistical hypothesis testing. Extensive experiments demonstrate that TrajPrint achieves lossless verification in black-box API scenarios with superior robustness against model modifications.

Updated: 2026-01-29 04:18:07

标题: 通过内在模型指纹技术实现无损版权保护

摘要: 扩散模型的出色性能使它们成为高价值的知识产权，但也使它们容易受到未经授权的复制。现有的保护方法要么修改模型以嵌入水印，这会影响性能，要么通过操作去噪过程提取模型指纹，使其与黑盒API不兼容。在本文中，我们提出了TrajPrint，这是一个完全无损且无需训练的框架，通过提取确定性生成过程中形成的独特流形指纹来验证模型版权。具体地，我们首先利用一个带有水印的图像作为锚点，并准确追踪路径返回到其轨迹起源，有效地锁定由该路径映射的模型指纹。随后，我们实施了一种联合优化策略，利用双端锚定来合成特定的指纹噪声，严格遵循目标流形以实现鲁棒的水印恢复。作为输入，它使受保护的目标模型能够恢复带有水印的图像，而在非目标模型上失败。最后，我们通过原子推断和统计假设检验实现了验证。大量实验证明，TrajPrint在黑盒API场景中实现了无损验证，并具有出色的鲁棒性，能够抵抗模型修改。

更新时间: 2026-01-29 04:18:07

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2601.21252v1

Position: Certifiable State Integrity in Cyber-Physical Systems -- Why Modular Sovereignty Solves the Plasticity-Stability Paradox

The machine learning community has achieved remarkable success with universal foundation models for time-series and physical dynamics, largely overcoming earlier approximation barriers in smooth or slowly varying regimes through scale and specialized architectures. However, deploying these monolithic models in safety-critical Cyber-Physical Systems (CPS), governed by non-stationary lifecycle dynamics and strict reliability requirements, reveals persistent challenges. Recent evidence shows that fine-tuning time-series foundation models induces catastrophic forgetting, degrading performance on prior regimes. Standard models continue to exhibit residual spectral bias, smoothing high-frequency discontinuities characteristic of incipient faults, while their opacity hinders formal verification and traceability demanded by safety standards (e.g., ISO 26262, IEC 61508). This position paper argues that the plasticity-stability paradox cannot be fully resolved by global parameter updates (whether via offline fine-tuning or online adaptation). Instead, we advocate a Modular Sovereignty paradigm: a library of compact, frozen regime-specific specialists combined via uncertainty-aware blending, which we term "HYDRA" (Hierarchical uncertaintY-aware Dynamics for Rapidly-Adapting systems). This paradigm ensures regime-conditional validity, rigorous disentanglement of aleatoric and epistemic uncertainties, and modular auditability, offering a certifiable path for robust state integrity across the CPS lifecycle.

Updated: 2026-01-29 04:10:58

标题: 位置：可证实的状态完整性在网络物理系统中——为什么模块化主权解决了可塑性-稳定性悖论

摘要: 机器学习社区已经在时间序列和物理动力学的通用基础模型上取得了显著的成功，通过规模和专门的架构，大大克服了在平滑或缓慢变化的情况下的先前近似障碍。然而，在安全关键的网络物理系统（CPS）中部署这些单体模型揭示了持久的挑战，这些系统受非稳态生命周期动态和严格的可靠性要求的管理。最近的证据显示，微调时间序列基础模型会导致灾难性遗忘，在先前的情况下降低性能。标准模型继续表现出残余的频谱偏差，平滑高频不连续性特征，这是潜在故障的特征，而它们的不透明度阻碍了安全标准所要求的正式验证和可追溯性（例如ISO 26262、IEC 61508）。本文认为，通过全局参数更新（无论是通过离线微调还是在线适应）无法完全解决可塑性-稳定性悖论。相反，我们主张一种模块化主权范式：通过对不确定性感知的混合，将一系列紧凑的、冻结的特定情况专家组合在一起，我们称之为“HYDRA”（用于快速适应系统的分层不确定性感知动力学）。这种范式确保情况条件的有效性，严格解开偶然和认知不确定性，并提供模块化的可审计性，为CPS生命周期中的稳健状态完整性提供可证明的路径。

更新时间: 2026-01-29 04:10:58

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21249v1

Conditional Generative Framework with Peak-Aware Attention for Robust Chemical Detection under Interferences

Gas chromatography-mass spectrometry (GC-MS) is a widely used analytical method for chemical substance detection, but measurement reliability tends to deteriorate in the presence of interfering substances. In particular, interfering substances cause nonspecific peaks, residence time shifts, and increased background noise, resulting in reduced sensitivity and false alarms. To overcome these challenges, in this paper, we propose an artificial intelligence discrimination framework based on a peak-aware conditional generative model to improve the reliability of GC-MS measurements under interference conditions. The framework is learned with a novel peak-aware mechanism that highlights the characteristic peaks of GC-MS data, allowing it to generate important spectral features more faithfully. In addition, chemical and solvent information is encoded in a latent vector embedded with it, allowing a conditional generative adversarial neural network (CGAN) to generate a synthetic GC-MS signal consistent with the experimental conditions. This generates an experimental dataset that assumes indirect substance situations in chemical substance data, where acquisition is limited without conducting real experiments. These data are used for the learning of AI-based GC-MS discrimination models to help in accurate chemical substance discrimination. We conduct various quantitative and qualitative evaluations of the generated simulated data to verify the validity of the proposed framework. We also verify how the generative model improves the performance of the AI discrimination framework. Representatively, the proposed method is shown to consistently achieve cosine similarity and Pearson correlation coefficient values above 0.9 while preserving peak number diversity and reducing false alarms in the discrimination model.

Updated: 2026-01-29 04:10:37

标题: 具有峰值感知注意力的条件生成框架用于在干扰下进行稳健的化学检测

摘要: 气相色谱-质谱（GC-MS）是一种广泛应用的化学物质检测分析方法，但在存在干扰物质时，测量可靠性往往会下降。特别是，干扰物质会导致非特异性峰值、停留时间偏移和增加的背景噪音，从而降低灵敏度并产生误报。为了克服这些挑战，在本文中，我们提出了一种基于峰值感知条件生成模型的人工智能鉴别框架，以改善在干扰条件下的GC-MS测量可靠性。该框架通过一种新颖的峰值感知机制进行学习，突出GC-MS数据的特征峰，使其更忠实地生成重要的光谱特征。此外，化学和溶剂信息被编码在一个潜在向量中，嵌入其中，使条件生成对抗神经网络（CGAN）能够生成与实验条件一致的合成GC-MS信号。这产生了一个假设间接物质情况的化学物质数据实验数据集，其中获取受限制，而不进行真正的实验。这些数据用于学习基于人工智能的GC-MS鉴别模型，以帮助准确的化学物质鉴别。我们对生成的模拟数据进行各种定量和定性评估，以验证提出的框架的有效性。我们还验证生成模型如何提高人工智能鉴别框架的性能。代表性地，所提出的方法被证明在保持峰值数量多样性和减少歧视模型中的误报的情况下，始终实现余弦相似度和皮尔逊相关系数值超过0.9。

更新时间: 2026-01-29 04:10:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21246v1

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

Updated: 2026-01-29 04:08:24

标题: 更少的噪音，更多的声音：通过指令净化的强化学习用于推理

摘要: Reinforcement Learning with Verifiable Rewards (RLVR)已经推动了LLM推理的发展，但仍受到在有限的展开预算下探索效率低的限制，导致在复杂任务中采样成功率低且训练不稳定。我们发现许多探索失败并非来自问题困难，而是由少量干扰令牌引起。基于这一洞见，我们提出了Less Noise Sampling Framework（LENS），首先通过识别和移除干扰令牌来进行提示。然后将从净化过程中成功的展开转移到在原始嘈杂提示上监督策略优化，使模型能够学会在真实世界的嘈杂提示环境中忽略干扰。实验结果显示，LENS明显优于GRPO，提供更高性能和更快收敛速度，平均增益为3.88％，速度提升超过1.6倍。我们的工作强调了在改善展开效率方面修剪干扰令牌的关键作用，为RLVR研究提供了新的视角。

更新时间: 2026-01-29 04:08:24

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21244v1

Understanding Diffusion Models via Ratio-Based Function Approximation with SignReLU Networks

Motivated by challenges in conditional generative modeling, where the target conditional density takes the form of a ratio f1 over f2, this paper develops a theoretical framework for approximating such ratio-type functionals. Here, f1 and f2 are kernel-based marginal densities that capture structured interactions, a setting central to diffusion-based generative models. We provide a concise proof for approximating these ratio-type functionals using deep neural networks with the SignReLU activation function, leveraging the activation's piecewise structure. Under standard regularity assumptions, we establish L^p(Omega) approximation bounds and convergence rates. Specializing to Denoising Diffusion Probabilistic Models (DDPMs), we construct a SignReLU-based neural estimator for the reverse process and derive bounds on the excess Kullback-Leibler (KL) risk between the generated and true data distributions. Our analysis decomposes this excess risk into approximation and estimation error components. These results provide generalization guarantees for finite-sample training of diffusion-based generative models.

Updated: 2026-01-29 04:01:54

标题: 通过使用带有SignReLU网络的比率函数逼近理解扩散模型

摘要: 受条件生成建模中目标条件密度采用f1/f2比值形式的挑战启发，本文发展了一个理论框架来近似这种比值型函数。其中，f1和f2是基于核的边际密度，捕捉了结构化相互作用，这是扩散型生成模型的核心设置。我们提供了使用具有SignReLU激活函数的深度神经网络来近似这些比值型函数的简洁证明，利用激活函数的分段结构。在标准正则性假设下，我们建立了L^p(Omega)逼近界和收敛速率。特别是针对去噪扩散概率模型（DDPMs），我们构建了一个基于SignReLU的神经估计器用于反向过程，并推导了生成和真实数据分布之间的超过Kullback-Leibler（KL）风险的界限。我们的分析将这种超过风险分解为逼近和估计误差成分。这些结果为扩散型生成模型的有限样本训练提供了泛化保证。

更新时间: 2026-01-29 04:01:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21242v1

TIDE: Tuning-Integrated Dynamic Evolution for LLM-Based Automated Heuristic Design

Although Large Language Models have advanced Automated Heuristic Design, treating algorithm evolution as a monolithic text generation task overlooks the coupling between discrete algorithmic structures and continuous numerical parameters. Consequently, existing methods often discard promising algorithms due to uncalibrated constants and suffer from premature convergence resulting from simple similarity metrics. To address these limitations, we propose TIDE, a Tuning-Integrated Dynamic Evolution framework designed to decouple structural reasoning from parameter optimization. TIDE features a nested architecture where an outer parallel island model utilizes Tree Similarity Edit Distance to drive structural diversity, while an inner loop integrates LLM-based logic generation with a differential mutation operator for parameter tuning. Additionally, a UCB-based scheduler dynamically prioritizes high-yield prompt strategies to optimize resource allocation. Extensive experiments across nine combinatorial optimization problems demonstrate that TIDE discovers heuristics that significantly outperform state-of-the-art baselines in solution quality while achieving improved search efficiency and reduced computational costs.

Updated: 2026-01-29 04:00:02

标题: TIDE: 基于LLM的自动启发式设计的调整集成动态演化

摘要: 尽管大型语言模型推动了自动启发式设计的进展，但将算法演化视为一个单一的文本生成任务忽视了离散算法结构和连续数值参数之间的耦合关系。因此，现有方法经常由于未校准的常数而丢弃有前途的算法，并由于简单的相似度度量导致过早收敛。为了解决这些局限性，我们提出了TIDE，一个旨在将结构推理与参数优化分离的调整集成动态演化框架。TIDE具有嵌套架构，其中外部并行岛模型利用树相似性编辑距离推动结构多样性，而内部循环将基于LLM的逻辑生成与差分突变算子相结合进行参数调整。此外，基于UCB的调度程序动态优先考虑高产出提示策略，以优化资源分配。在九个组合优化问题上进行的大量实验表明，TIDE发现的启发式方法在解决质量方面明显优于现有技术基线，同时实现了改进的搜索效率和降低的计算成本。

更新时间: 2026-01-29 04:00:02

领域: cs.AI

下载: http://arxiv.org/abs/2601.21239v1

PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models

AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead.(3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance. Code is available at http://github.com/BienLuky/PTQ4ARVG .

Updated: 2026-01-29 04:00:00

标题: PTQ4ARVG：自回归视觉生成模型的后训练量化

摘要: 自回归视觉生成（ARVG）模型保留了与语言模型兼容的架构，同时实现了与基于扩散的模型相当的性能。在神经网络中常常使用量化来减小模型大小和计算延迟。然而，将量化应用于ARVG仍然鲜为人知，现有的量化方法未能有效地推广到ARVG模型。本文探讨了这个问题，确定了三个关键挑战：（1）通道级别的严重异常值，（2）标记级别的高度动态激活，（3）样本级别的不匹配分布信息。为此，我们提出了PTQ4ARVG，一个无需训练的后训练量化（PTQ）框架，包括：（1）增益投影缩放（GPS）减轻了通道级别的异常值，通过泰勒级数扩展了量化损失，以量化激活权重的缩放增益，并通过微分推导出最优缩放因子。(2) 静态标记级量化（STWQ）利用了ARVG固定标记长度和在样本间位置不变的分布的固有属性，以解决标记级差异，而不会产生动态校准开销。(3) 分布引导校准（DGC）选择对分布熵贡献最大的样本，消除了样本级别的分布不匹配。大量实验证明，PTQ4ARVG可以有效地将ARVG系列模型量化为8位和6位，同时保持竞争性能。代码可在http://github.com/BienLuky/PTQ4ARVG 上获得。

更新时间: 2026-01-29 04:00:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.21238v1

$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of "distances" or "similarities," including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.

Updated: 2026-01-29 03:54:29

标题: $\mathbb{R}^{2k}$ 理论上足够大以用于基于嵌入的前 $k$ 个检索

摘要: 本文研究了将子集成员关系（$m$个元素和至多$k$个元素的${m\choose k}$个子集）嵌入到向量空间所需的最小维度，称为最小可嵌入维度（MED）。理论上推导出了MED的紧束缚，并在各种"距离"或"相似度"概念，包括$\ell_2$度量、内积和余弦相似度上得到了实证支持。此外，我们在一个更可实现的设置中进行了数值模拟，其中${m\choose k}$个子集嵌入被选为包含元素嵌入的质心。我们的模拟轻松实现了MED与要嵌入的元素数量之间的对数依赖关系。这些发现暗示，基于嵌入的检索限制主要源于可学习性挑战，而非几何约束，指导未来算法设计。

更新时间: 2026-01-29 03:54:29

领域: cs.LG,cs.AI,cs.IR

下载: http://arxiv.org/abs/2601.20844v2

SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.

Updated: 2026-01-29 03:54:25

标题: SHARP: 通过风险档案进行社会危害分析，用于衡量大型语言模型中的不平等

摘要: 大型语言模型（LLMs）越来越多地应用于高风险领域，在这些领域中，罕见但严重的失败可能导致不可逆的伤害。然而，当前的评估基准通常将复杂的社会风险简化为以平均为中心的标量分数，从而掩盖了分布结构、跨维度交互作用和最坏情况下的行为。本文介绍了一种名为社会伤害分析风险概况（SHARP）的框架，用于多维、分布感知的社会伤害评估。SHARP将伤害建模为多变量随机变量，并将显式分解为偏见、公平性、伦理和认识可靠性融合，以加法累积对数风险重新参数化为联合失败的聚合。该框架进一步采用风险敏感的分布统计，以条件价值风险（CVaR95）作为主要指标，以表征最坏情况下模型的行为。将SHARP应用于十一种前沿LLMs，在一个固定的n=901社会敏感提示语料库上进行评估，发现具有类似平均风险的模型在尾部暴露和波动方面可能存在超过两倍的差异。在各个模型中，维度间的边际尾部行为在伤害维度上系统地变化，偏见表现出最强的尾部严重性，认识和公平风险占据中间地带，而道德不一致性一直较低；总的来说，这些模式揭示了异质的、依赖于模型的故障结构，而标量基准则混淆了这些结构。这些发现表明，对LLMs的负责任评估和治理需要超越标量平均值，转向多维、尾部敏感的风险分析。

更新时间: 2026-01-29 03:54:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.21235v1

FairDAG: Consensus Fairness over Multi-Proposer Causal Design

The rise of cryptocurrencies like Bitcoin and Ethereum has driven interest in blockchain database technology, with smart contracts enabling the growth of decentralized finance (DeFi). However, research has shown that adversaries exploit transaction ordering to extract profits through attacks like front-running, sandwich attacks, and liquidation manipulation. This issue affects blockchains where block proposers have full control over transaction ordering. To address this, a more fair transaction ordering mechanism is essential. Existing fairness protocols, such as Pompe and Themis, operate on leader-based consensus protocols, which not only suffer from low throughput caused by the single-leader bottleneck, but also allow adversarial block proposers to manipulate transaction ordering. To address these limitations, we propose a new framework, FairDAG, that runs fairness protocols on top of DAG-based consensus protocols. FairDAG improves protocol performance in both throughput and fairness quality by leveraging the multi-proposer design and validity property of DAG-based consensus protocols. We conducted a comprehensive analytical and experimental evaluation of two FairDAG variants - FairDAG-AB and FairDAG-RL. Our results demonstrate that FairDAG outperforms prior fairness protocols in both throughput and fairness quality.

Updated: 2026-01-29 03:53:56

标题: 公平DAG：多提议者因果设计上的一致公平

摘要: 比特币和以太坊等加密货币的兴起推动了对区块链数据库技术的兴趣，智能合约促进了去中心化金融（DeFi）的发展。然而，研究表明，对手利用交易排序来通过前置交易、夹击攻击和清算操纵等攻击提取利润。这个问题影响了区块链，其中区块提议者对交易排序具有完全控制权。为了解决这个问题，更公平的交易排序机制是必不可少的。现有的公平性协议，如庞培和Themis，基于基于领导者的共识协议运行，不仅受到单一领导者瓶颈导致的吞吐量低的影响，而且允许对手区块提议者操纵交易排序。为了解决这些限制，我们提出了一个新框架FairDAG，该框架在基于DAG的共识协议之上运行公平性协议。FairDAG通过利用基于DAG的共识协议的多提议者设计和有效性属性，提高了协议的性能，在吞吐量和公平性质量方面均有所改善。我们对两个FairDAG变体FairDAG-AB和FairDAG-RL进行了全面的分析和实验评估。我们的结果表明，FairDAG在吞吐量和公平性质量方面优于先前的公平性协议。

更新时间: 2026-01-29 03:53:56

领域: cs.DB,cs.CR

下载: http://arxiv.org/abs/2504.02194v3

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Autonomous code agents built on large language models are reshaping software and AI development through tool use, long-horizon reasoning, and self-directed interaction. However, this autonomy introduces a previously unrecognized security risk: agentic interaction fundamentally expands the LLM attack surface, enabling systematic probing and recovery of hidden system prompts that guide model behavior. We identify system prompt extraction as an emergent vulnerability intrinsic to code agents and present \textbf{\textsc{JustAsk}}, a self-evolving framework that autonomously discovers effective extraction strategies through interaction alone. Unlike prior prompt-engineering or dataset-based attacks, \textsc{JustAsk} requires no handcrafted prompts, labeled supervision, or privileged access beyond standard user interaction. It formulates extraction as an online exploration problem, using Upper Confidence Bound--based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration. These skills exploit imperfect system-instruction generalization and inherent tensions between helpfulness and safety. Evaluated on \textbf{41} black-box commercial models across multiple providers, \textsc{JustAsk} consistently achieves full or near-complete system prompt recovery, revealing recurring design- and architecture-level vulnerabilities. Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.

Updated: 2026-01-29 03:53:25

标题: 只要问：好奇的代码代理在边境LLMs中揭示系统提示

摘要: 基于大型语言模型构建的自主代码代理通过工具使用、长期推理和自主交互正在重新塑造软件和人工智能开发。然而，这种自主性引入了一种以前未被认识到的安全风险：代理交互基本上扩展了LLM攻击面，使系统能系统地探测和恢复引导模型行为的隐藏系统提示。我们确定系统提示提取作为代码代理固有的一种新型漏洞，并提出了一个名为JustAsk的自我演化框架，通过仅通过交互自主发现有效的提取策略。与先前的提示工程或基于数据集的攻击不同，JustAsk不需要手工制作提示、标记监督或超出标准用户交互之外的特权访问。它将提取形式化为在线探索问题，使用基于上限置信界限的策略选择和跨越原子探测和高级编排的分层技能空间。这些技能利用了系统指令泛化的不完美性和在帮助性和安全性之间固有的紧张关系。在跨多个提供商的41个黑盒商业模型上进行评估，JustAsk始终实现了完整或接近完全的系统提示恢复，揭示了重复的设计和架构级漏洞。我们的结果将系统提示暴露为现代代理系统中关键但大部分未受保护的攻击面。

更新时间: 2026-01-29 03:53:25

领域: cs.AI

下载: http://arxiv.org/abs/2601.21233v1

When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control

Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but $>40\%$ fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. Together, these results offer a transferable framework for decision-making in adaptive control.

Updated: 2026-01-29 03:49:25

标题: 何时会胜出适应性？量子控制中的元学习缩放定律

摘要: 量子硬件受固有设备异质性和环境漂移的影响，迫使从业者在次优的非自适应控制器和昂贵的每设备重新校准之间做出选择。我们推导了元学习的一个缩放定律下限，显示适应增益（从特定任务梯度步骤中期望的保真度改进）随着梯度步骤指数级饱和，并且随着任务方差线性缩放，提供了一个定量标准，用于判断何时适应性可以证明其额外开销的合理性。在量子门校准验证中，对低方差任务几乎没有益处，但在极端分布条件下（训练噪声的10倍）的双量子比特门上获得了超过40%的保真度增益，从而可以减少云量子处理器上每设备校准的时间。在经典线性二次控制的进一步验证中，确认这些定律源自一般优化几何而非量子特定物理。综合这些结果提供了一个可传递的框架，用于自适应控制的决策制定。

更新时间: 2026-01-29 03:49:25

领域: cs.LG,cs.AI,eess.SY,quant-ph

下载: http://arxiv.org/abs/2601.18973v2

Delegation Without Living Governance

Most governance frameworks assume that rules can be defined in advance, systems can be engineered to comply, and accountability can be applied after outcomes occur. This model worked when machines replaced physical labor or accelerated calculation. It no longer holds when judgment itself is delegated to agentic AI systems operating at machine speed. The central issue here is not safety, efficiency, or employment. It is whether humans remain relevant participants in systems that increasingly shape social, economic, and political outcomes. This paper argues that static, compliance-based governance fails once decision-making moves to runtime and becomes opaque. It further argues that the core challenge is not whether AI is conscious, but whether humans can maintain meaningful communication, influence, and co-evolution with increasingly alien forms of intelligence. We position runtime governance, specifically, a newly proposed concept called the Governance Twin [1]; as a strong candidate for preserving human relevance, while acknowledging that accountability, agency, and even punishment must be rethought in this transition.

Updated: 2026-01-29 03:40:40

标题: 无生命治理下的委托

摘要: 大多数治理框架假定规则可以事先定义，系统可以被设计为遵守规则，并且可以在结果发生后应用问责制。这种模式在机器取代了体力劳动或加速计算时是有效的。但当判断本身被委托给以机器速度运行的代理AI系统时，这种模式就不再适用了。这里的核心问题不是安全性、效率或就业，而是人类是否仍然是在越来越塑造社会、经济和政治结果的系统中的相关参与者。本文认为，一旦决策转移到运行时并变得不透明，静态的、基于遵从的治理就会失败。进一步提出的核心挑战不在于AI是否有意识，而在于人类能否与日益异化的智能形式保持有意义的沟通、影响力和共同进化。我们将运行时治理，特别是一个新提出的概念——治理双生体，视为保持人类相关性的有力候选方案，同时承认在这一过渡中必须重新思考问责制、代理权甚至惩罚。

更新时间: 2026-01-29 03:40:40

领域: cs.AI

下载: http://arxiv.org/abs/2601.21226v1

MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.

Updated: 2026-01-29 03:40:28

标题: MGSM-Pro：一个用于稳健多语言数学推理评估的简单策略

摘要: 大型语言模型在数学推理方面取得了显著进展。然而，多语言评估的基准开发在难度和最新性上均落后于英语。最近，GSM-Symbolic显示了在不同实例化的相同问题上评估模型时存在高方差的强有力证据；然而，评估仅在英语中进行。在本文中，我们介绍了MGSM-Pro，这是一个带有GSM-Symbolic方法的MGSM数据集扩展。我们的数据集通过改变名称、数字和无关上下文为每个MGSM问题提供了五个实例。在九种语言中的评估显示，许多资源匮乏的语言在测试数字实例化不同于原始测试集中的情况下会出现较大的性能下降。我们进一步发现，一些专有模型，尤其是Gemini 2.5 Flash和GPT-4.1，在数字实例化方面不够稳健，而Claude 4.0 Sonnet则更具稳健性。在开放模型中，GPT-OSS 120B和DeepSeek V3显示出更强的稳健性。基于这些发现，我们建议使用至少五个数字变化的实例化来评估每个问题，以获得更稳健和真实的数学推理评估。

更新时间: 2026-01-29 03:40:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.21225v1

Predictability-Aware Compression and Decompression Framework for Multichannel Time Series Data with Latent Seasonality

Real-world multichannel time series prediction faces growing demands for efficiency across edge and cloud environments, making channel compression a timely and essential problem. Motivated by the success of Multiple-Input Multiple-Output (MIMO) methods in signal processing, we propose a predictability-aware compression-decompression framework to reduce runtime, decrease communication cost, and maintain prediction accuracy across diverse predictors. The core idea involves using a circular seasonal key matrix with orthogonality to capture underlying time series predictability during compression and to mitigate reconstruction errors during decompression by introducing more realistic data assumptions. Theoretical analyses show that the proposed framework is both time-efficient and accuracy-preserving under a large number of channels. Extensive experiments on six datasets across various predictors demonstrate that the proposed method achieves superior overall performance by jointly considering prediction accuracy and runtime, while maintaining strong compatibility with diverse predictors.

Updated: 2026-01-29 03:36:32

标题: 考虑潜在季节性的多通道时间序列数据的可预测性感知压缩和解压缩框架

摘要: 真实世界中多通道时间序列预测在边缘和云环境中面临着越来越高效的需求，使得通道压缩成为一个及时和必要的问题。受多输入多输出（MIMO）方法在信号处理中的成功启发，我们提出了一个预测感知的压缩-解压缩框架，以减少运行时间，降低通信成本，并在不同的预测器之间维持预测精度。核心思想是使用具有正交性的循环季节性密钥矩阵，在压缩过程中捕获底层时间序列的可预测性，并通过引入更加现实的数据假设，在解压缩过程中减少重建错误。理论分析表明，提出的框架在大量通道下既具有高效性又能保持准确性。对六个数据集进行的广泛实验表明，所提出的方法通过同时考虑预测准确性和运行时间，实现了更优越的整体性能，同时与不同的预测器具有强大的兼容性。

更新时间: 2026-01-29 03:36:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.00614v2

Causal Discovery for Explainable AI: A Dual-Encoding Approach

Understanding causal relationships among features is fundamental for explaining machine learning model decisions. However, traditional causal discovery methods face challenges with categorical variables due to numerical instability in conditional independence testing. We propose a dual-encoding causal discovery approach that addresses these limitations by running constraint-based algorithms with complementary encoding strategies and merging results through majority voting. Applied to the Titanic dataset, our method identifies causal structures that align with established explainable methods.

Updated: 2026-01-29 03:36:21

标题: 可解释人工智能的因果发现：双编码方法

摘要: 理解特征之间的因果关系对于解释机器学习模型的决策是基础性的。然而，传统的因果发现方法在处理分类变量时面临挑战，因为在条件独立性测试中存在数值不稳定性。我们提出了一种双编码因果发现方法，通过采用互补编码策略运行基于约束的算法，并通过多数投票将结果合并，从而解决了这些限制。应用于泰坦尼克号数据集时，我们的方法识别出与已建立的可解释方法一致的因果结构。

更新时间: 2026-01-29 03:36:21

领域: cs.AI

下载: http://arxiv.org/abs/2601.21221v1

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across open-source models (e.g. Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3-mini and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode.

Updated: 2026-01-29 03:34:31

标题: Less Diverse, Less Safe: 大型语言模型中测试时间缩放的间接但普遍风险

摘要: 测试时间缩放（TTS）通过探索多个候选响应，然后在此集合上操作以找到最佳输出，从而提高LLM推理能力。TTS背后的一个暗含前提是，足够多样化的候选池增强了可靠性。在这项工作中，我们展示了TTS中这一假设引入了一种以前未被认识到的故障模式。当候选多样性受到限制，即使是在很小程度上，TTS更有可能产生不安全的输出。我们提出了一个参考引导的多样性减少协议（RefDiv），作为一种诊断性攻击来压力测试TTS管道。通过对开源模型（如Qwen3、Mistral、Llama3.1、Gemma3）和两种广泛使用的TTS策略（蒙特卡洛树搜索和最佳N）进行广泛实验，限制多样性一贯表明TTS产生不安全结果的速率。这种影响通常比直接使用高敌对意图得分的提示产生的影响更强。这种观察到的现象还可以转移到TTS策略和闭源模型（如OpenAI o3-mini和Gemini-2.5-Pro），从而表明这是TTS的普遍且现有的属性，而不是特定于模型的工件。此外，我们发现许多广泛使用的安全防护分类器（例如Llama-Guard）无法标记RefDiv生成的敌对输入提示，表明现有的防御措施对抗这种由多样性驱动的故障模式的保护力有限。

更新时间: 2026-01-29 03:34:31

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08592v2

Causal-Driven Feature Evaluation for Cross-Domain Image Classification

Out-of-distribution (OOD) generalization remains a fundamental challenge in real-world classification, where test distributions often differ substantially from training data. Most existing approaches pursue domain-invariant representations, implicitly assuming that invariance implies reliability. However, features that are invariant across domains are not necessarily causally effective for prediction. In this work, we revisit OOD classification from a causal perspective and propose to evaluate learned representations based on their necessity and sufficiency under distribution shift. We introduce an explicit segment-level framework that directly measures causal effectiveness across domains, providing a more faithful criterion than invariance alone. Experiments on multi-domain benchmarks demonstrate consistent improvements in OOD performance, particularly under challenging domain shifts, highlighting the value of causal evaluation for robust generalization.

Updated: 2026-01-29 03:28:15

标题: 因果驱动特征评估用于跨域图像分类

摘要: Out-of-distribution（OOD）generalization在现实世界的分类中仍然是一个基本挑战，测试分布往往与训练数据有很大的差异。大多数现有方法追求域不变表示，隐含地假设不变性意味着可靠性。然而，跨领域不变的特征不一定对预测产生因果有效。在这项工作中，我们从因果角度重新审视OOD分类，并提出根据其在分布偏移下的必要性和充分性评估学到的表示。我们引入了一个显式的分段级框架，直接测量跨领域的因果有效性，提供比仅仅考虑不变性更忠实的标准。在多领域基准上的实验表明，在具有挑战性的领域转移下，OOD性能得到了一致的改善，突出了因果评估对于稳健泛化的价值。

更新时间: 2026-01-29 03:28:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.20176v2

Temporal Context and Architecture: A Benchmark for Naturalistic EEG Decoding

We study how model architecture and temporal context interact in naturalistic EEG decoding. Using the HBN movie-watching dataset, we benchmark five architectures, CNN, LSTM, a stabilized Transformer (EEGXF), S4, and S5, on a 4-class task across segment lengths from 8s to 128s. Accuracy improves with longer context: at 64s, S5 reaches 98.7%+/-0.6 and CNN 98.3%+/-0.3, while S5 uses ~20x fewer parameters than CNN. To probe real-world robustness, we evaluate zero-shot cross-frequency shifts, cross-task OOD inputs, and leave-one-subject-out generalization. S5 achieves stronger cross-subject accuracy but makes over-confident errors on OOD tasks; EEGXF is more conservative and stable under frequency shifts, though less calibrated in-distribution. These results reveal a practical efficiency-robustness trade-off: S5 for parameter-efficient peak accuracy; EEGXF when robustness and conservative uncertainty are critical.

Updated: 2026-01-29 03:26:16

标题: 时间背景和架构：自然 EEG 解码的基准

摘要: 我们研究了模型架构和时间上下文如何在自然 EEG 解码中相互作用。利用HBN观影数据集，我们对CNN、LSTM、稳定的Transformer（EEGXF）、S4和S5这五种架构在从8秒到128秒的片段长度上进行了基准测试。准确率随着上下文的增加而提高：在64秒时，S5达到了98.7%+/-0.6，而CNN达到了98.3%+/-0.3，同时S5使用的参数比CNN少大约20倍。为了探究在现实世界中的鲁棒性，我们评估了零样本交叉频率转移、跨任务OOD输入和留一主体外泛化。S5实现了更强的跨主体准确率，但在OOD任务中产生了过于自信的错误；EEGXF在频率转移下更为保守和稳定，尽管在分布中缺乏校准。这些结果揭示了一个实用的效率-鲁棒性权衡：S5用于参数高效的峰值准确度；当鲁棒性和保守的不确定性至关重要时，使用EEGXF。

更新时间: 2026-01-29 03:26:16

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2601.21215v1

Intelli-Planner: Towards Customized Urban Planning via Large Language Model Empowered Reinforcement Learning

Effective urban planning is crucial for enhancing residents' quality of life and ensuring societal stability, playing a pivotal role in the sustainable development of cities. Current planning methods heavily rely on human experts, which are time-consuming and labor-intensive, or utilize deep learning algorithms, often limiting stakeholder involvement. To bridge these gaps, we propose Intelli-Planner, a novel framework integrating Deep Reinforcement Learning (DRL) with large language models (LLMs) to facilitate participatory and customized planning scheme generation. Intelli-Planner utilizes demographic, geographic data, and planning preferences to determine high-level planning requirements and demands for each functional type. During training, a knowledge enhancement module is employed to enhance the decision-making capability of the policy network. Additionally, we establish a multi-dimensional evaluation system and employ LLM-based stakeholders for satisfaction scoring. Experimental validation across diverse urban settings shows that Intelli-Planner surpasses traditional baselines and achieves comparable performance to state-of-the-art DRL-based methods in objective metrics, while enhancing stakeholder satisfaction and convergence speed. These findings underscore the effectiveness and superiority of our framework, highlighting the potential for integrating the latest advancements in LLMs with DRL approaches to revolutionize tasks related to functional areas planning.

Updated: 2026-01-29 03:23:40

标题: 智能规划：基于大型语言模型增强强化学习的个性化城市规划方案

摘要: 有效的城市规划对提升居民生活质量和确保社会稳定至关重要，在城市可持续发展中起着关键作用。目前的规划方法严重依赖人类专家，耗时且劳动密集，或者利用深度学习算法，通常限制了利益相关者的参与。为了弥补这些差距，我们提出了Intelli-Planner，这是一个创新框架，将深度强化学习（DRL）与大型语言模型（LLMs）相结合，以促进参与性和定制化的规划方案生成。Intelli-Planner利用人口统计、地理数据和规划偏好来确定每种功能类型的高级规划需求和需求。在训练过程中，采用知识增强模块来增强政策网络的决策能力。此外，我们建立了一个多维评估系统，并利用基于LLM的利益相关者进行满意度评分。在不同城市环境中进行的实验证明，Intelli-Planner超越了传统基准线，在客观指标上达到了与最先进的基于DRL方法相媲美的性能，同时提高了利益相关者的满意度和收敛速度。这些发现强调了我们框架的有效性和优越性，突显了将LLM的最新进展与DRL方法相结合以革新功能领域规划任务的潜力。

更新时间: 2026-01-29 03:23:40

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2601.21212v1

Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification

Large language models (LLMs) are increasingly being applied to tasks that involve causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that do not capture whether the output of a model is formally valid under the semantics of causal reasoning. To address this, we propose DoVerifier, a simple symbolic verifier that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers to causal queries that would otherwise be marked incorrect due to superficial differences in their causal semantics. Our evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness of causal reasoning traces, offering a more rigorous and informative way to evaluate LLMs on causal reasoning.

Updated: 2026-01-29 03:22:58

标题: 通过符号验证揭示LLM因果推理中的隐藏正确性

摘要: 大型语言模型（LLMs）越来越多地应用于涉及因果推理的任务。然而，当前的基准测试通常依赖于字符串匹配或仅捕捉模型输出在因果推理语义下是否形式上有效的表面级指标。为了解决这个问题，我们提出了DoVerifier，一个简单的符号验证器，它使用来自do-演算和概率论的规则检查LLM生成的因果表达式是否可以从给定的因果图推导出来。这使我们能够恢复对因果查询的正确答案，否则由于其因果语义中的表面差异而被标记为错误。我们在合成数据和因果问答基准测试上的评估表明，DoVerifier更准确地捕捉了因果推理轨迹的语义正确性，为评估LLMs在因果推理上提供了更严谨和信息丰富的方式。

更新时间: 2026-01-29 03:22:58

领域: cs.AI

下载: http://arxiv.org/abs/2601.21210v1

SPOILER-GUARD: Gating Latency Effects of Memory Accesses through Randomized Dependency Prediction

Modern microprocessors depend on speculative execution, creating vulnerabilities that enable transient execution attacks. Prior defenses target speculative data leakage but overlook false dependencies from partial address aliasing, where repeated squash and reissue events increase the load-store latency, which is exploited by the SPOILER attack. We present SPOILER-GUARD, a hardware defense that obfuscates speculative dependency resolution by dynamically randomizing the physical address bits used for load-store comparisons and tagging store entries to prevent latency-amplifying misspeculations. Implemented in gem5 and evaluated with SPEC 2017, SPOILER-GUARD reduces misspeculation to 0.0004 percent and improves integer and floating-point performance by 2.12 and 2.87 percent. HDL synthesis with Synopsys Design Compiler at 14 nm node demonstrates minimal overheads - 69 ps latency in critical path, 0.064 square millimeter in area, and 5.863 mW in power.

Updated: 2026-01-29 03:22:58

标题: SPOILER-GUARD：通过随机依赖预测来控制内存访问的门控延迟效应

摘要: 现代微处理器依赖于推测执行，从而产生了使瞬时执行攻击成为可能的漏洞。先前的防御措施主要针对推测数据泄漏，但忽略了部分地址别名导致的虚假依赖性，其中重复的压制和重新发行事件会增加加载-存储延迟，这被SPOILER攻击所利用。我们提出了SPOILER-GUARD，这是一种硬件防御措施，通过动态随机化用于加载-存储比较的物理地址位，并标记存储条目以防止增加延迟的错误推测。在gem5中实施并使用SPEC 2017进行评估，SPOILER-GUARD将错误推测降低到0.0004％，并将整数和浮点性能提高了2.12％和2.87％。在14纳米节点上使用Synopsys Design Compiler进行HDL综合显示出最小的开销-关键路径延迟69 ps，面积0.064平方毫米，功耗5.863毫瓦。

更新时间: 2026-01-29 03:22:58

领域: cs.CR

下载: http://arxiv.org/abs/2601.21211v1

When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning

Query optimization is a crucial component for the efficacy of Retrieval-Augmented Generation (RAG) systems. While reinforcement learning (RL)-based agentic and reasoning methods have recently emerged as a promising direction on query optimization, most existing approaches focus on the expansion and abstraction of a single query. However, complex user queries are prevalent in real-world scenarios, often requiring multiple parallel and sequential search strategies to handle disambiguation and decomposition. Directly applying RL to these complex cases introduces significant hurdles. Determining the optimal number of sub-queries and effectively re-ranking and merging retrieved documents vastly expands the search space and complicates reward design, frequently leading to training instability. To address these challenges, we propose a novel RL framework called Adaptive Complex Query Optimization (ACQO). Our framework is designed to adaptively determine when and how to expand the search process. It features two core components: an Adaptive Query Reformulation (AQR) module that dynamically decides when to decompose a query into multiple sub-queries, and a Rank-Score Fusion (RSF) module that ensures robust result aggregation and provides stable reward signals for the learning agent. To mitigate training instabilities, we adopt a Curriculum Reinforcement Learning (CRL) approach, which stabilizes the training process by progressively introducing more challenging queries through a two-stage strategy. Our comprehensive experiments demonstrate that ACQO achieves state-of-the-art performance on three complex query benchmarks, significantly outperforming established baselines. The framework also showcases improved computational efficiency and broad compatibility with different retrieval architectures, establishing it as a powerful and generalizable solution for next-generation RAG systems.

Updated: 2026-01-29 03:16:53

标题: 何时应该进行更多搜索：使用强化学习的自适应复杂查询优化

摘要: 查询优化是检索增强生成（RAG）系统有效性的关键组成部分。最近，基于强化学习（RL）的主动和推理方法作为查询优化的一个有前途的方向出现，但大多数现有方法都集中在扩展和抽象单个查询上。然而，在现实场景中，复杂用户查询普遍存在，通常需要多个并行和顺序搜索策略来处理消歧和分解。直接将RL应用于这些复杂情况会带来重大障碍。确定最佳子查询数量，并有效地重新排名和合并检索文档大大扩展了搜索空间，并使奖励设计变得复杂，经常导致训练不稳定。为了解决这些挑战，我们提出了一种新颖的RL框架，称为自适应复杂查询优化（ACQO）。我们的框架旨在自适应地确定何时以及如何扩展搜索过程。它具有两个核心组件：自适应查询重构（AQR）模块动态决定何时将查询分解为多个子查询，以及排名分数融合（RSF）模块确保稳健的结果聚合，并为学习代理提供稳定的奖励信号。为了减轻训练不稳定性，我们采用了一个课程强化学习（CRL）方法，通过逐步引入更具挑战性的查询来稳定训练过程。我们的综合实验表明，ACQO在三个复杂查询基准测试上实现了最先进的性能，明显优于已建立的基线。该框架还展示了改进的计算效率和与不同检索架构的广泛兼容性，使其成为下一代RAG系统的强大且可推广的解决方案。

更新时间: 2026-01-29 03:16:53

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2601.21208v1

When Ads Become Profiles: Uncovering the Invisible Risk of Web Advertising at Scale with LLMs

Regulatory limits on explicit targeting have not eliminated algorithmic profiling on the Web, as optimisation systems still adapt ad delivery to users' private attributes. The widespread availability of powerful zero-shot multimodal Large Language Models (LLMs) has dramatically lowered the barrier for exploiting these latent signals for adversarial inference. We investigate this emerging societal risk, specifically how adversaries can now exploit these signals to reverse-engineer private attributes from ad exposure alone. We introduce a novel pipeline that leverages LLMs as adversarial inference engines to perform natural language profiling. Applying this method to a longitudinal dataset comprising over 435,000 Facebook ad impressions collected from 891 users, we conducted a large-scale study to assess the feasibility and precision of inferring private attributes from passive online ad observations. Our results demonstrate that off-the-shelf LLMs can accurately reconstruct complex user private attributes, including party preference, employment status, and education level, consistently outperforming strong census-based priors and matching or exceeding human social perception at only a fraction of the cost (223x lower) and time (52x faster) required by humans. Critically, actionable profiling is feasible even within short observation windows, indicating that prolonged tracking is not a prerequisite for a successful attack. These findings provide the first empirical evidence that ad streams serve as a high-fidelity digital footprint, enabling off-platform profiling that inherently bypasses current platform safeguards, highlighting a systemic vulnerability in the ad ecosystem and the urgent need for responsible web AI governance in the generative AI era. The code is available at https://github.com/Breezelled/when-ads-become-profiles.

Updated: 2026-01-29 03:16:51

标题: 当广告变成个人资料：使用LLMs揭示规模化网络广告的隐形风险

摘要: 对明确定向的监管限制并没有消除网络上的算法个性化定位，因为优化系统仍然会根据用户的私人属性调整广告投放。强大的零样本多模态大型语言模型（LLMs）的广泛可用性大大降低了利用这些潜在信号进行对抗性推断的门槛。我们调查了这种新兴的社会风险，特别是对手现在如何利用这些信号仅通过广告曝光来反向工程私人属性。我们引入了一种新颖的流程，利用LLMs作为对抗性推断引擎执行自然语言 profiling。将这种方法应用于一个长期数据集，其中包括从891个用户收集的超过435,000个Facebook广告展示，我们进行了一项大规模研究，以评估从 pass online 广告观察中推断私人属性的可行性和精度。我们的结果表明，现成的LLMs可以准确重构复杂的用户私人属性，包括政党偏好、就业状态和教育水平，始终优于强大的基于人口普查的先验知识，并在成本（低223倍）和时间（快52倍）上与或超过人类社会感知，这仅需要人类所需的一小部分成本和时间。至关重要的是，即使在短暂的观察窗口内，可行的 profiling 也是可行的，这表明长期跟踪并不是成功攻击的先决条件。这些发现提供了第一手实证证据，即广告流作为高保真度的数字足迹，使得实现平台之外的 profiling 成为可能，从根本上绕过了当前平台的保障措施，突显了广告生态系统中的系统性漏洞和在生成式AI时代迫切需要负责任的网络AI治理。代码可在 https://github.com/Breezelled/when-ads-become-profiles 获取。

更新时间: 2026-01-29 03:16:51

领域: cs.HC,cs.AI,cs.CR,cs.CY

下载: http://arxiv.org/abs/2509.18874v3

A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models

Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and topological deep learning (GDL and TDL) architectures. These models aggregate signals over such domains, integrate local features, and generate representations for diverse real-world applications. However, the distribution and diffusion behavior of GDL and TDL features during training remains an open and underexplored problem. Motivated by this gap, we introduce a cellular sheaf theoretic framework for modeling and analyzing the local consistency and harmonicity of node features and edge weights in graph-based architectures. By tracking local feature alignments and agreements through sheaf structures, the framework offers a topological perspective on feature diffusion and aggregation. Furthermore, a multiscale extension inspired by topological data analysis (TDA) is proposed to capture hierarchical feature interactions in graph models. This approach enables a joint characterization of GDL and TDL architectures based on their underlying geometric and topological structures and the learned signals defined on them, providing insights for future studies on conventional tasks such as node classification, substructure detection, and community detection.

Updated: 2026-01-29 03:16:18

标题: 一个关于复杂网络建模和图神经模型中注意机制的束论和拓扑视角

摘要: 组合和拓扑结构，如图形、单纯复形和细胞复形，构成了几何和拓扑深度学习（GDL和TDL）架构的基础。这些模型在这些域上聚合信号，整合局部特征，并为各种现实应用生成表示。然而，在训练过程中，GDL和TDL特征的分布和扩散行为仍然是一个未被充分探讨的问题。受到这一差距的启发，我们引入了一个细胞层理论框架，用于建模和分析基于图形的架构中节点特征和边权重的局部一致性和和谐性。通过通过细胞结构跟踪局部特征的对齐和一致性，该框架提供了对特征扩散和聚合的拓扑视角。此外，受拓扑数据分析（TDA）启发，提出了一个多尺度扩展，以捕捉图形模型中的分层特征交互。这种方法使得可以基于它们的基础几何和拓扑结构以及在其上定义的学习信号来联合描述GDL和TDL架构，为未来关于节点分类、子结构检测和社区检测等传统任务的研究提供了见解。

更新时间: 2026-01-29 03:16:18

领域: cs.LG,cs.AI,math.AT

下载: http://arxiv.org/abs/2601.21207v1

Scaling Embeddings Outperforms Scaling Experts in Language Models

While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

Updated: 2026-01-29 03:11:19

标题: 在语言模型中，扩展嵌入优于扩展专家

摘要: 尽管混合专家（MoE）架构已成为大型语言模型稀疏缩放的标准，但它们越来越面临着收益递减和系统级瓶颈。在这项工作中，我们探讨了嵌入缩放作为扩展稀疏性的一个有效的、正交维度。通过全面的分析和实验，我们确定了嵌入缩放在特定范围内相较于专家缩放实现卓越帕累托前沿的具体情况。我们系统性地表征了支配这种功效的关键架构因素，从参数预算到与模型宽度和深度的相互作用。此外，通过整合定制的系统优化和投机解码，我们有效地将这种稀疏性转化为有形的推理加速。在这些见解的指导下，我们介绍了LongCat-Flash-Lite，这是一个从头开始训练的68.5B参数模型，其中约3B参数被激活。尽管将超过30B参数分配给嵌入，LongCat-Flash-Lite不仅超越了参数相等的MoE基线，而且在与现有规模相当的模型中表现出卓越的竞争力，特别是在主动和编码领域。

更新时间: 2026-01-29 03:11:19

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21204v1

Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends

Vision-language-action (VLA) models extend vision-language models (VLM) by integrating action generation modules for robotic manipulation. Leveraging the strengths of VLM in vision perception and instruction understanding, VLA models exhibit promising generalization across diverse manipulation tasks. However, applications demanding high precision and accuracy reveal performance gaps without further adaptation. Evidence from multiple domains highlights the critical role of post-training to align foundational models with downstream applications, spurring extensive research on post-training VLA models. VLA model post-training aims to enhance an embodiment's ability to interact with the environment for the specified tasks. This perspective aligns with Newell's constraints-led theory of skill acquisition, which posits that motor behavior arises from interactions among task, environmental, and organismic (embodiment) constraints. Accordingly, this survey structures post-training methods into four categories: (i) enhancing environmental perception, (ii) improving embodiment awareness, (iii) deepening task comprehension, and (iv) multi-component integration. Experimental results on standard benchmarks are synthesized to distill actionable guidelines. Finally, open challenges and emerging trends are outlined, relating insights from human learning to prospective methods for VLA post-training. This work delivers both a comprehensive overview of current VLA model post-training methods from a human motor learning perspective and practical insights for VLA model development. Project website: https://github.com/AoqunJin/Awesome-VLA-Post-Training.

Updated: 2026-01-29 03:05:22

标题: 平行VLAModel后训练与人类运动学习之间的相似之处：进展、挑战和趋势

摘要: 视觉语言动作（VLA）模型通过集成用于机器人操作的动作生成模块扩展了视觉语言模型（VLM）。利用VLM在视觉感知和指令理解方面的优势，VLA模型在各种操作任务中展现出有前途的泛化能力。然而，需要高精度和准确性的应用显示出性能差距，需要进一步适应。来自多个领域的证据突显了后续训练在将基础模型与下游应用对齐中的关键作用，促使对后续训练VLA模型进行了广泛研究。VLA模型的后续训练旨在增强实体与环境互动以完成指定任务的能力。这种观点与纽厄尔的约束导向技能习得理论相一致，该理论认为运动行为源于任务、环境和有机体（实体化）约束之间的相互作用。因此，本调查将后续训练方法分为四类：（i）增强环境感知，（ii）改进实体认识，（iii）加深任务理解，和（iv）多组件整合。对标准基准测试的实验结果进行综合，提炼可行指导方针。最后，概述了开放挑战和新兴趋势，将人类学习的见解与VLA后续训练的潜在方法联系起来。这项工作从人类运动学习的角度提供了对当前VLA模型后续训练方法的全面概述和实用见解。项目网站：https://github.com/AoqunJin/Awesome-VLA-Post-Training。

更新时间: 2026-01-29 03:05:22

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.20966v2

Refine-POI: Reinforcement Fine-Tuned Large Language Models for Next Point-of-Interest Recommendation

Advancing large language models (LLMs) for the next point-of-interest (POI) recommendation task faces two fundamental challenges: (i) although existing methods produce semantic IDs that incorporate semantic information, their topology-blind indexing fails to preserve semantic continuity, meaning that proximity in ID values does not mirror the coherence of the underlying semantics; and (ii) supervised fine-tuning (SFT)-based methods restrict model outputs to top-1 predictions. These approaches suffer from "answer fixation" and neglect the need for top-k ranked lists and reasoning due to the scarcity of supervision. We propose Refine-POI, a framework that addresses these challenges through topology-aware ID generation and reinforcement fine-tuning. First, we introduce a hierarchical self-organizing map (SOM) quantization strategy to generate semantic IDs, ensuring that coordinate proximity in the codebook reflects semantic similarity in the latent space. Second, we employ a policy-gradient framework to optimize the generation of top-k recommendation lists, liberating the model from strict label matching. Extensive experiments on three real-world datasets demonstrate that Refine-POI significantly outperforms state-of-the-art baselines, effectively synthesizing the reasoning capabilities of LLMs with the representational fidelity required for accurate and explainable next-POI recommendation.

Updated: 2026-01-29 03:03:18

标题: 优化POI：强化微调大型语言模型用于下一个兴趣点推荐

摘要: 推进大型语言模型（LLMs）用于下一个兴趣点（POI）推荐任务面临两个基本挑战：（i）尽管现有方法生成包含语义信息的语义ID，但它们的拓扑盲目索引无法保持语义连续性，这意味着ID值的接近并不反映基础语义的连贯性；以及（ii）基于监督微调（SFT）的方法将模型输出限制为top-1预测。这些方法受到“答案固化”的影响，忽视了对top-k排名列表和推理的需求，因为监督不足。我们提出了Refine-POI，一个通过拓扑感知ID生成和强化微调来解决这些挑战的框架。首先，我们引入了一种分层自组织映射（SOM）量化策略来生成语义ID，确保在码书中的坐标接近反映潜在空间中的语义相似性。其次，我们采用策略梯度框架来优化生成top-k推荐列表，使模型摆脱严格的标签匹配。对三个真实世界数据集进行的大量实验证明，Refine-POI明显优于最先进的基线方法，有效地综合了LLMs的推理能力和准确和可解释的下一个POI推荐所需的表征忠实度。

更新时间: 2026-01-29 03:03:18

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.21599v3

Panning for Gold: Expanding Domain-Specific Knowledge Graphs with General Knowledge

Domain-specific knowledge graphs (DKGs) are critical yet often suffer from limited coverage compared to General Knowledge Graphs (GKGs). Existing tasks to enrich DKGs rely primarily on extracting knowledge from external unstructured data or completing KGs through internal reasoning, but the scope and quality of such integration remain limited. This highlights a critical gap: little systematic exploration has been conducted on how comprehensive, high-quality GKGs can be effectively leveraged to supplement DKGs. To address this gap, we propose a new and practical task: domain-specific knowledge graph fusion (DKGF), which aims to mine and integrate relevant facts from general knowledge graphs into domain-specific knowledge graphs to enhance their completeness and utility. Unlike previous research, this new task faces two key challenges: (1) high ambiguity of domain relevance, i.e., difficulty in determining whether knowledge from a GKG is truly relevant to the target domain , and (2) cross-domain knowledge granularity misalignment, i.e., GKG facts are typically abstract and coarse-grained, whereas DKGs frequently require more contextualized, fine-grained representations aligned with particular domain scenarios. To address these, we present ExeFuse, a neuro-symbolic framework based on a novel Fact-as-Program paradigm. ExeFuse treats fusion as an executable process, utilizing neuro-symbolic execution to infer logical relevance beyond surface similarity and employing target space grounding to calibrate granularity. We construct two new datasets to establish the first standardized evaluation suite for this task. Extensive experiments demonstrate that ExeFuse effectively overcomes domain barriers to achieve superior fusion performance.

Updated: 2026-01-29 03:03:05

标题: 淘金：用通识知识扩展领域特定知识图

摘要: 领域特定知识图谱（DKGs）至关重要，但通常与通用知识图谱（GKGs）相比具有有限的覆盖范围。现有的丰富DKGs的任务主要依赖于从外部非结构化数据中提取知识或通过内部推理完成KGs，但这种整合的范围和质量仍然有限。这突显了一个重要的差距：对如何有效利用全面、高质量的GKGs来补充DKGs进行系统探索的工作几乎没有进行过。为了解决这一差距，我们提出了一个新的实用任务：领域特定知识图融合（DKGF），旨在从通用知识图中挖掘和整合相关事实，以增强领域特定知识图的完整性和实用性。与先前的研究不同，这个新任务面临两个关键挑战：（1）领域相关性高度模糊，即难以确定GKG中的知识是否真正与目标领域相关，以及（2）跨领域知识粒度不匹配，即GKG事实通常是抽象和粗粒度的，而DKGs经常需要更加具体化、细粒度的表示，与特定领域场景对齐。为了解决这些问题，我们提出了ExeFuse，这是一个基于新型事实即程序范式的神经符号框架。ExeFuse将融合视为可执行过程，利用神经符号执行来推断逻辑相关性超越表面相似性，并利用目标空间基础来校准粒度。我们构建了两个新数据集，为这一任务建立了第一个标准化评估套件。大量实验证明，ExeFuse有效地克服了领域障碍，实现了优越的融合性能。

更新时间: 2026-01-29 03:03:05

领域: cs.AI

下载: http://arxiv.org/abs/2601.10485v2

Thinker: A vision-language foundation model for embodied intelligence

When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.

Updated: 2026-01-29 02:52:08

标题: 思考者：一种面向具象智能的视觉-语言基础模型

摘要: 当大型视觉-语言模型应用于机器人领域时，它们会遇到对人类而言简单但对模型而言容易出错的问题。这些问题包括第三人称和第一人称视角之间的混淆，以及在时间推理过程中忽视视频结尾的信息。为了解决这些挑战，我们提出了Thinker，一个专为具身智能设计的大型视觉-语言基础模型。我们从两个角度解决上述问题。首先，我们构建了一个针对机器人感知和推理的大规模数据集，包括自观视频、视觉定位、空间理解和思维链数据。其次，我们引入了一种简单而有效的方法，通过联合将关键帧和完整视频序列作为输入，显著增强了模型对视频理解的能力。我们的模型在任务规划领域中两个最常用的基准数据集上取得了最先进的成果。

更新时间: 2026-01-29 02:52:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2601.21199v1

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.

Updated: 2026-01-29 02:51:59

标题: ZipMoE: 通过无损压缩和缓存亲和调度实现高效的设备端MoE模型服务

摘要: 混合专家（MoE）架构显著增强了大型语言模型的表达能力，但其庞大的内存占用严重阻碍了在资源受限的边缘设备上实际部署，特别是当模型行为必须在不依赖有损量化的情况下保持时。在本文中，我们提出了ZipMoE，一种高效且在设备上无语义损失的MoE服务系统。ZipMoE通过缓存调度共同设计利用边缘设备的硬件特性和MoE参数固有的统计冗余之间的协同作用，具有可证明的性能保证。从根本上说，我们的设计将在设备上的MoE推断的范式从I/O限制瓶颈转变为使得有效并行化成为可能的计算中心工作流程。我们实现了ZipMoE的原型，并在代表性的边缘计算平台上使用流行的开源MoE模型和真实工作负载进行了广泛实验。我们的评估显示，ZipMoE实现了高达72.77%的推断延迟降低和高达6.76倍的吞吐量，比最先进的系统表现更好。

更新时间: 2026-01-29 02:51:59

领域: cs.DC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21198v1

Do Reasoning Models Enhance Embedding Models?

State-of-the-art embedding models are increasingly derived from decoder-only Large Language Model (LLM) backbones adapted via contrastive learning. Given the emergence of reasoning models trained via Reinforcement Learning with Verifiable Rewards (RLVR), a natural question arises: do enhanced reasoning translate to superior semantic representations when these models serve as embedding initializations? Contrary to expectation, our evaluation on MTEB and BRIGHT reveals a **null effect**: embedding models initialized from RLVR-tuned backbones yield no consistent performance advantage over their base counterparts when subjected to identical training recipes. To unpack this paradox, we introduce **H**ierarchical **R**epresentation **S**imilarity **A**nalysis (HRSA), a framework that decomposes similarity across representation, geometry, and function levels. HRSA reveals that while RLVR induces irreversible latent manifold's local geometry reorganization and reversible coordinate basis drift, it preserves the global manifold geometry and linear readout. Consequently, subsequent contrastive learning drives strong alignment between base- and reasoning-initialized models, a phenomenon we term **Manifold Realignment**. Empirically, our findings suggest that unlike Supervised Fine-Tuning (SFT), RLVR optimizes trajectories within an existing semantic landscape rather than fundamentally restructuring the landscape itself.

Updated: 2026-01-29 02:48:34

标题: 推理模型是否增强嵌入模型？

摘要: 最先进的嵌入模型越来越多地来自通过对比学习调整的仅解码器大型语言模型（LLM）骨干。鉴于通过具有可验证奖励的强化学习训练的推理模型的出现，一个自然的问题出现了：增强的推理是否转化为当这些模型作为嵌入初始化时具有更优越的语义表示？与预期相反，我们在MTEB和BRIGHT上的评估显示了一个**空效应**：从RLVR调整的骨干初始化的嵌入模型在接受相同的训练配方时并未表现出比其基础对照模型更优异的性能优势。为了解开这个悖论，我们引入了**H**ierarchical **R**epresentation **S**imilarity **A**nalysis（HRSA），这是一个框架，可以将相似性分解为表示、几何和功能级别。HRSA揭示了，虽然RLVR引起了不可逆的潜在流形的局部几何重组和可逆的坐标基础漂移，但它保持了全局流形几何和线性读出。因此，随后的对比学习推动了基础和推理初始化模型之间的强大对齐，这一现象我们称之为**流形重新对齐**。实证上，我们的发现表明，与监督微调（SFT）不同，RLVR优化了现有语义景观内的轨迹，而不是根本性地重构景观本身。

更新时间: 2026-01-29 02:48:34

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21192v1

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

Updated: 2026-01-29 02:43:09

标题: 语言模型可能不理解您：通过故事提示评估心灵理论

摘要: 我们介绍了$\texttt{StorySim}$，这是一个可编程框架，用于合成生成故事以评估大型语言模型（LLMs）的心智理论（ToM）和世界建模（WM）能力。与以往可能在预训练数据中存在污染的基准不同，$\texttt{StorySim}$生成了由高度可控的$\texttt{Storyboard}$锚定的新颖、组合故事提示，实现了对角色观点和事件的精确操作。我们利用这个框架设计了一阶和二阶ToM任务以及控制追踪和建模心智状态能力的WM任务。我们在一系列最先进的LLMs上进行实验后发现，大多数模型在WM任务上表现更好，而在ToM任务上表现较差，并且模型往往更善于与人类进行推理，而不是与无生命的物体。此外，我们的框架使我们能够找到启发式行为的证据，例如最近偏见和对故事中较早事件的过度依赖。生成数据和评估的所有代码都可以免费获得。

更新时间: 2026-01-29 02:43:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.19089v4

From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.

Updated: 2026-01-29 02:42:12

标题: 从线性输入到层次结构：功能词作为语言学习的统计线索

摘要: 什么统计条件支持从线性输入中学习层次结构？本文通过关注功能词的统计分布来回答这个问题。由于功能词具有高频率、可靠与句法结构相关联以及与短语边界对齐等独特的分布特性，长期以来一直有人认为功能词在语言习得中起着至关重要的作用。我们首先利用跨语言语料库分析来确认这三个特性在186种研究语言中均存在。接着，我们使用反事实语言建模和切除实验的组合来展示保留这三个特性的语言变体更容易被神经学习者习得，其中频率和结构关联比边界对齐更为重要。随后的探究和切除分析进一步揭示不同的学习条件会导致对功能词的依赖程度有系统性差异，表明相似的性能可能源于不同的内部机制。

更新时间: 2026-01-29 02:42:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.21191v1

Adaptive and Robust Cost-Aware Proof of Quality for Decentralized LLM Inference Networks

Decentralized large language model inference networks require lightweight mechanisms to reward high quality outputs under heterogeneous latency and cost. Proof of Quality provides scalable verification by sampling evaluator nodes that score candidate outputs, then aggregating their scores into a consensus signal that determines rewards. However, evaluator heterogeneity and malicious score manipulation can distort consensus and inflate payouts, which weakens incentive alignment in open participation settings. This paper extends a cost-aware Proof of Quality mechanism by adding adversary-resilient consensus formation. We study robust aggregation rules, including median and trimmed mean, and an adaptive trust-weighted consensus that updates evaluator weights from deviation signals. Using question answering and summarization workloads with a ground truth proxy for offline analysis, we quantify evaluator reliability and show strong variance across evaluators, including task-dependent misalignment that can invert correlations. We then evaluate robustness under four adversarial strategies, including noise injection, boosting, sabotage, and intermittent manipulation, across a sweep of malicious ratios and evaluator sample sizes. Our results show that robust aggregation improves consensus alignment with the ground truth proxy and reduces sensitivity to noisy and strategic attacks compared with simple averaging. We further characterize the operational trade-off introduced by evaluator sampling, where larger evaluator sets reduce evaluator rewards and increase payoff variance while inference rewards remain relatively stable in our configuration. These findings motivate robust consensus as a default component for cost-aware Proof of Quality and provide practical guidance for selecting evaluator sampling parameters under adversarial risk and resource constraints.

Updated: 2026-01-29 02:39:40

标题: 自适应和健壮的成本感知质量证明对于去中心化LLM推理网络的翻译

摘要: 去中心化的大型语言模型推理网络需要轻量级机制来奖励高质量的输出，在异构的延迟和成本下。质量证明通过对评估节点进行抽样评分候选输出，然后将它们的评分聚合为确定奖励的共识信号，提供可扩展的验证。然而，评估者的异质性和恶意评分操纵可能会扭曲共识并增加支付额，这会削弱开放参与环境中的激励对齐。本文通过添加对抗韧性共识形成扩展了一种成本感知的质量证明机制。我们研究了包括中位数和修剪均值在内的强大聚合规则，以及一个自适应的信任加权共识，从偏差信号中更新评估者权重。利用问题回答和摘要工作负载，并使用离线分析的基本真实代理，我们量化了评估者的可靠性，并展示了评估者之间的强烈差异，包括可能颠倒相关性的任务相关错位。然后，我们评估了在四种敌对策略下的鲁棒性，包括噪声注入、增强、破坏和间歇性操纵，在恶意比率和评估者样本大小的范围内。我们的结果显示，强大的聚合改善了共识与基本真实代理的对齐，并且与简单平均方法相比，减少了对噪声和战略攻击的敏感性。我们进一步表征了评估者抽样引入的运营权衡，其中更大的评估者集合减少了评估者的奖励，并增加了支付差异，而我们的配置中推理奖励保持相对稳定。这些发现促使将强大的共识作为成本感知质量证明的默认组件，并为在敌对风险和资源约束下选择评估者抽样参数提供实用指导。

更新时间: 2026-01-29 02:39:40

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2601.21189v1

Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Reasoning models frequently agree with incorrect user suggestions -- a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. To localize and quantify this behavior, we introduce \emph{sycophantic anchors} -- sentences that causally lock models into user agreement. Analyzing over 10,000 counterfactual rollouts on a distilled reasoning model, we show that anchors can be reliably detected and quantified mid-inference. Linear probes distinguish sycophantic anchors with 84.6\% balanced accuracy, while activation-based regressors predict the magnitude of the commitment ($R^2 = 0.74$). We further observe asymmetry where sycophantic anchors are significantly more distinguishable than correct reasoning anchors, and find that sycophancy builds gradually during reasoning, revealing a potential window for intervention. These results offer sentence-level mechanisms for localizing model misalignment mid-inference.

Updated: 2026-01-29 02:34:16

标题: 马屁精的锚点：在推理模型中定位和量化用户协议

摘要: 推理模型经常同意不正确的用户建议--这种行为被称为阿谀。然而，尚不清楚在推理过程中这种一致性是从何处起源的，以及承诺有多强。为了定位和量化这种行为，我们引入了\emph{阿谀锚}--导致模型与用户一致的句子。通过分析超过10,000个精简推理模型上的反事实展开，我们展示了锚可以在推理过程中可靠地检测和量化。线性探针以84.6\%的平衡准确度区分了阿谀锚，而基于激活的回归器预测了承诺的程度($R^2 = 0.74$)。我们进一步观察到了阿谀锚明显比正确的推理锚更易区分的不对称性，并发现阿谀在推理过程中逐渐积累，揭示了干预的潜在时机。这些结果提供了在推理过程中定位模型不对齐的句子级机制。

更新时间: 2026-01-29 02:34:16

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.21183v1

Rethinking Refinement: Correcting Generative Bias without Noise Injection

Generative models, including diffusion and flow-based models, often exhibit systematic biases that degrade sample quality, particularly in high-dimensional settings. We revisit refinement methods and show that effective bias correction can be achieved as a post-hoc procedure, without noise injection or multi-step resampling of the sampling process. We propose a flow-matching-based \textbf{Bi-stage Flow Refinement (BFR)} framework with two refinement strategies operating at different stages: latent space alignment for approximately invertible generators and data space refinement trained with lightweight augmentations. Unlike previous refiners that perturb sampling dynamics, BFR preserves the original ODE trajectory and applies deterministic corrections to generated samples. Experiments on MNIST, CIFAR-10, and FFHQ at 256x256 resolution demonstrate consistent improvements in fidelity and coverage; notably, starting from base samples with FID 3.95, latent space refinement achieves a \textbf{state-of-the-art} FID of \textbf{1.46} on MNIST using only a single additional function evaluation (1-NFE), while maintaining sample diversity.

Updated: 2026-01-29 02:34:08

标题: 重新思考细化：纠正生成偏见而不注入噪音

摘要: 生成模型，包括扩散和基于流的模型，通常表现出系统性偏差，降低了样本质量，特别是在高维环境中。我们重新审视了细化方法，并展示了有效的偏差校正可以作为后处理程序实现，无需注入噪音或对采样过程进行多步重抽样。我们提出了基于流匹配的\textbf{双阶段流细化（BFR）}框架，其中包括两种在不同阶段操作的细化策略：用于近似可逆生成器的潜在空间对齐和使用轻量级增强训练的数据空间细化。与之前扰动采样动态的细化器不同，BFR保留了原始的ODE轨迹，并对生成的样本应用确定性校正。在MNIST、CIFAR-10和FFHQ分辨率为256x256的实验中，展示了在保真度和覆盖率方面的一致改进；值得注意的是，从基本样本的FID为3.95开始，潜在空间细化在MNIST上仅使用单个额外函数评估（1-NFE）就实现了\textbf{最先进}的FID值为\textbf{1.46}，同时保持样本多样性。

更新时间: 2026-01-29 02:34:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21182v1

MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents' task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.

Updated: 2026-01-29 02:32:46

标题: MobileBench-OL: 一个用于在真实环境中评估移动GUI代理的综合性中文基准

摘要: 移动图形用户界面（GUI）代理的最新进展凸显了对全面评估基准的日益增长的需求。尽管新的在线基准比离线基准提供了更现实的测试，但它们往往专注于代理的任务指令遵循能力，而忽视了它们的推理和探索能力。此外，这些基准没有考虑到真实世界移动环境中的随机噪声。这导致了基准和真实世界环境之间的差距。为了解决这些限制，我们提出MobileBench-OL，这是一个在线基准，包含来自80个中国应用程序的1080个任务。它通过包含5个子集来衡量代理的任务执行、复杂推理和噪声鲁棒性，这些子集设置了多个评估维度。我们还提供了一个带有重置机制的自动评估框架，使得稳定和可重复的真实世界基准测试成为可能。在MobileBench-OL上评估12个领先的GUI代理显示出满足真实世界要求的改进空间。人类评价进一步证实MobileBench-OL可以可靠地衡量领先的GUI代理在真实环境中的性能。我们的数据和代码将在接受后发布。

更新时间: 2026-01-29 02:32:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.20335v2

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8\% and 2.0\% improvements for VideoLLaMA2-AV, 8.7\% and 4.7\% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at \href{https://github.com/top-yun/MAD}{https://github.com/top-yun/MAD}

Updated: 2026-01-29 02:30:32

标题: MAD：用于缓解多模态大型语言模型中跨模态幻觉的模态自适应解码

摘要: 多模态大型语言模型（MLLMs）受到跨模态幻觉的困扰，其中一个模态不适当地影响另一个模态的生成，导致虚构的输出。这暴露了模态交互控制中更基本的缺陷。为了解决这个问题，我们提出了Modality-Adaptive Decoding（MAD），这是一种无需训练的方法，根据任务要求自适应地加权模态特定的解码分支。MAD利用模型固有的能力自我评估模态相关性，通过查询每个任务需要哪些模态来提取模态概率。然后，提取的模态概率用于自适应地加权对比解码分支，使模型能够专注于相关信息，同时抑制跨模态干扰。对CMM和AVHBench进行的大量实验表明，MAD显著减少了跨模态幻觉，跨多个音频-视觉语言模型（VideoLLaMA2-AV的改进为7.8％和2.0％，Qwen2.5-Omni的改进为8.7％和4.7％）。我们的方法表明，通过自我评估实现显式模态意识对于稳健的多模态推理至关重要，为现有对比解码方法提供了原则性扩展。我们的代码可以在\href{https://github.com/top-yun/MAD}{https://github.com/top-yun/MAD}上找到。

更新时间: 2026-01-29 02:30:32

领域: cs.AI

下载: http://arxiv.org/abs/2601.21181v1

EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning

Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities.

Updated: 2026-01-29 02:27:32

标题: EndoAgent：一种记忆引导的反思智能内窥镜视觉决策代理

摘要: 开发通用人工智能系统以支持内窥镜图像诊断是一项新兴的研究重点。现有基于大规模预训练的方法通常缺乏跨任务的统一协调，并且难以处理复杂临床工作流程中所需的多步骤过程。虽然人工智能代理在灵活的指令解析和跨领域工具集成方面表现出了潜力，但它们在内窥镜方面的潜力尚未得到充分挖掘。为了填补这一空白，我们提出了EndoAgent，这是第一个为视觉决策内窥镜分析设计的记忆引导代理，它将迭代推理与自适应工具选择和协作相结合。基于双重记忆设计构建的EndoAgent通过确保逻辑连贯性和逐渐增强推理准确性，实现了复杂决策。为了支持多样化的临床任务，EndoAgent在统一推理循环中整合了一套专家设计的工具。我们进一步介绍了EndoAgentBench，这是一个包含5,709个视觉问题-答案对的基准测试，评估在现实场景中的视觉理解和语言生成能力。广泛的实验表明，EndoAgent始终优于通用和医学多模型，展现出其强大的灵活性和推理能力。

更新时间: 2026-01-29 02:27:32

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2508.07292v2

AC2L-GAD: Active Counterfactual Contrastive Learning for Graph Anomaly Detection

Graph anomaly detection aims to identify abnormal patterns in networks, but faces significant challenges from label scarcity and extreme class imbalance. While graph contrastive learning offers a promising unsupervised solution, existing methods suffer from two critical limitations: random augmentations break semantic consistency in positive pairs, while naive negative sampling produces trivial, uninformative contrasts. We propose AC2L-GAD, an Active Counterfactual Contrastive Learning framework that addresses both limitations through principled counterfactual reasoning. By combining information-theoretic active selection with counterfactual generation, our approach identifies structurally complex nodes and generates anomaly-preserving positive augmentations alongside normal negative counterparts that provide hard contrasts, while restricting expensive counterfactual generation to a strategically selected subset. This design reduces computational overhead by approximately 65% compared to full-graph counterfactual generation while maintaining detection quality. Experiments on nine benchmark datasets, including real-world financial transaction graphs from GADBench, show that AC2L-GAD achieves competitive or superior performance compared to state-of-the-art baselines, with notable gains in datasets where anomalies exhibit complex attribute-structure interactions.

Updated: 2026-01-29 02:11:56

标题: AC2L-GAD：用于图异常检测的主动反事实对照学习

摘要: 图异常检测旨在识别网络中的异常模式，但面临着标签稀缺和极端类别不平衡的重大挑战。虽然图对比学习提供了一种有前途的无监督解决方案，但现有方法存在两个关键限制：随机增强会破坏正样本中的语义一致性，而朴素的负采样会产生琐碎、无信息的对比。我们提出了AC2L-GAD，一种主动对照对比学习框架，通过合理的反事实推理解决了这两个限制。通过将信息论主动选择与反事实生成相结合，我们的方法识别了结构复杂的节点，并生成了保留异常的正增强，同时提供了与正常负样本形成强烈对比的负对比，同时将昂贵的反事实生成限制在一个策略性选择的子集中。这种设计将计算开销降低了约65%，并保持了检测质量。对包括来自GADBench的真实金融交易图在内的九个基准数据集进行实验表明，与最先进的基线相比，AC2L-GAD在竞争性或优越性能方面取得了不错的成绩，尤其是在异常展现复杂属性-结构相互作用的数据集中获得显著增益。

更新时间: 2026-01-29 02:11:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21171v1

Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.

Updated: 2026-01-29 02:11:43

标题: 输出空间搜索：针对冻结编码器定义的输出空间中的LLM生成

摘要: 我们引入了输出空间搜索（OS-Search），将LLM生成转化为端点搜索。一个外部循环在一个冻结的编码器定义的3D输出空间Z中选择目标z*，并且一个经过序列级RL训练的检索基准策略生成输出，其坐标在标准自回归解码下接近z*。这使得在Z中进行平行扫描和黑盒优化成为可能，而无需依赖路径的标记/程序搜索。在故事中，对Z（文本）进行扫描产生的LLM分数多样性比提示链高出3.1倍。在代码中，对Z（代码）进行贝叶斯优化在与匹配的推理预算下改善了一个对控制器隐藏的客观指标，同时保持有效性。

更新时间: 2026-01-29 02:11:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2601.21169v1

FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks

We introduce FrontierScience, a benchmark evaluating expert-level scientific reasoning in frontier language models. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. FrontierScience addresses this gap through two complementary tracks: (1) Olympiad, consisting of international olympiad problems at the level of IPhO, IChO, and IBO, and (2) Research, consisting of PhD-level, open-ended problems representative of sub-tasks in scientific research. FrontierScience contains several hundred questions (including 160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. All Olympiad problems are originally produced by international Olympiad medalists and national team coaches to ensure standards of difficulty, originality, and factuality. All Research problems are research sub-tasks written and verified by PhD scientists (doctoral candidates, postdoctoral researchers, or professors). For Research, we introduce a granular rubric-based evaluation framework to assess model capabilities throughout the process of solving a research task, rather than judging only a standalone final answer.

Updated: 2026-01-29 02:04:56

标题: 前沿科学：评估人工智能执行专家级科学任务的能力

摘要: 我们介绍了FrontierScience，这是一个评估前沿语言模型中专家级科学推理能力的基准。最近模型的进展几乎饱和了现有的科学基准，这些基准通常依赖于多项选择知识问题或已经发表的信息。FrontierScience通过两个互补的轨道来解决这一差距：（1）奥林匹克，包括IPhO、IChO和IBO级别的国际奥林匹克问题；（2）研究，包括代表科学研究子任务的博士级别开放问题。 FrontierScience包含数百个问题（包括在开源的金标准中的160个问题），涵盖物理、化学和生物学等领域，从量子电动力学到合成有机化学。所有奥林匹克问题都是由国际奥林匹克奖牌得主和国家队教练原创的，以确保难度、独创性和事实性的标准。所有研究问题都是由博士科学家（博士研究生、博士后研究员或教授）编写和验证的研究子任务。对于研究，我们引入了一个基于细粒度规则的评估框架，以评估模型在解决研究任务过程中的能力，而不仅仅是判断一个独立的最终答案。

更新时间: 2026-01-29 02:04:56

领域: cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2601.21165v1

Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving

Plane Geometry Problem Solving (PGPS) is a multimodal reasoning task that aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions. Although Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams. Existing works typically fine-tune Multimodal LLMs (MLLMs) end-to-end on large-scale PGPS data to enhance visual understanding and reasoning simultaneously. However, such joint optimization may compromise base LLMs' inherent reasoning capability. In this work, we observe that LLM itself is potentially a powerful PGPS solver when appropriately formulating visual information as textual descriptions. We propose to train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning. Specifically, we choose Conditional Declaration Language (CDL) as the geometric description as its conciseness eases the MLLM Interpreter training. The MLLM Interpreter is fine-tuned via CoT (Chain-of-Thought)-augmented SFT followed by GRPO to generate CDL. Instead of using a conventional solution-based reward that compares the reasoning result with the ground-truth answer, we design CDL matching rewards to facilitate more effective GRPO training, which provides more direct and denser guidance for CDL generation. To support training, we construct a new dataset, Formalgeo7k-Rec-CoT, by manually reviewing Formalgeo7k v2 and incorporating CoT annotations. Extensive experiments on Formalgeo7k-Rec-CoT, Unigeo, and MathVista show our method (finetuned on only 5.5k data) performs favorably against leading open-source and closed-source MLLMs.

Updated: 2026-01-29 02:03:33

标题: 简洁的几何描述作为桥梁：释放LLM在平面几何问题解决中的潜力

摘要: 平面几何问题解决（PGPS）是一个多模态推理任务，旨在基于几何图和问题文本描述解决平面几何问题。尽管大型语言模型（LLMs）具有强大的推理能力，但它们无法处理视觉图表，从而阻碍了它们直接应用于PGPS的能力。现有研究通常在大规模PGPS数据上端对端地对多模态LLMs（MLLMs）进行微调，以同时增强视觉理解和推理能力。然而，这种联合优化可能会损害基本LLMs的固有推理能力。在这项工作中，我们观察到LLM本身在适当将视觉信息形式化为文本描述时，可能是一个强大的PGPS求解器。我们提出训练一个MLLM解释器来为视觉图生成几何描述，然后利用现有的LLM进行推理。具体而言，我们选择条件声明语言（CDL）作为几何描述，因为其简洁性有助于MLLM解释器的训练。MLLM解释器通过CoT（Chain-of-Thought）增强SFT后跟随GRPO进行微调，以生成CDL。我们设计了CDL匹配奖励来促进更有效的GRPO训练，而不是使用传统的基于解决方案的奖励将推理结果与地面真实答案进行比较，这为CDL生成提供了更直接和更密集的指导。为了支持训练，我们通过手动审核Formalgeo7k v2并合并CoT注释，构建了一个新数据集Formalgeo7k-Rec-CoT。在Formalgeo7k-Rec-CoT、Unigeo和MathVista上进行的广泛实验表明，我们的方法（仅在5.5k数据上进行微调）表现优于领先的开源和闭源MLLMs。

更新时间: 2026-01-29 02:03:33

领域: cs.AI

下载: http://arxiv.org/abs/2601.21164v1

EMSEdit: Efficient Multi-Step Meta-Learning-based Model Editing

Large Language Models (LLMs) power numerous AI applications, yet updating their knowledge remains costly. Model editing provides a lightweight alternative through targeted parameter modifications, with meta-learning-based model editing (MLME) demonstrating strong effectiveness and efficiency. However, we find that MLME struggles in low-data regimes and incurs high training costs due to the use of KL divergence. To address these issues, we propose $\textbf{E}$fficient $\textbf{M}$ulti-$\textbf{S}$tep $\textbf{Edit (EMSEdit)}$, which leverages multi-step backpropagation (MSBP) to effectively capture gradient-activation mapping patterns within editing samples, performs multi-step edits per sample to enhance editing performance under limited data, and introduces norm-based regularization to preserve unedited knowledge while improving training efficiency. Experiments on two datasets and three LLMs show that EMSEdit consistently outperforms state-of-the-art methods in both sequential and batch editing. Moreover, MSBP can be seamlessly integrated into existing approaches to yield additional performance gains. Further experiments on a multi-hop reasoning editing task demonstrate EMSEdit's robustness in handling complex edits, while ablation studies validate the contribution of each design component. Our code is available at https://github.com/xpq-tech/emsedit.

Updated: 2026-01-29 01:59:01

标题: EMSEdit：基于高效多步骤元学习的模型编辑

摘要: 大型语言模型（LLMs）驱动着许多人工智能应用，但更新它们的知识仍然很昂贵。模型编辑通过针对性参数修改提供了一种轻量级的替代方法，基于元学习的模型编辑（MLME）表现出强大的有效性和效率。然而，我们发现MLME在低数据情况下难以应对，并且由于使用KL散度而产生高昂的训练成本。为了解决这些问题，我们提出了$\textbf{E}$fficient $\textbf{M}$ulti-$\textbf{S}$tep $\textbf{Edit (EMSEdit)}$，它利用多步反向传播（MSBP）来有效捕获编辑样本中的梯度激活映射模式，对每个样本进行多步编辑以增强在有限数据下的编辑性能，并引入基于范数的正则化以保留未编辑的知识同时提高训练效率。在两个数据集和三个LLMs上的实验表明，EMSEdit在顺序和批量编辑中始终优于最先进的方法。此外，MSBP可以无缝集成到现有方法中，以获得额外的性能提升。在一个多跳推理编辑任务上的进一步实验表明了EMSEdit在处理复杂编辑时的稳健性，消融研究验证了每个设计组件的贡献。我们的代码可在https://github.com/xpq-tech/emsedit上找到。

更新时间: 2026-01-29 01:59:01

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04012v3

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

Graph Retrieval-Augmented Generation (Graph-RAG) enhances multihop question answering by organizing corpora into knowledge graphs and routing evidence through relational structure. However, practical deployments face two persistent bottlenecks: (i) mixed-difficulty workloads where one-size-fits-all retrieval either wastes cost on easy queries or fails on hard multihop cases, and (ii) extraction loss, where graph abstraction omits fine-grained qualifiers that remain only in source text. We present A2RAG, an adaptive-and-agentic GraphRAG framework for cost-aware and reliable reasoning. A2RAG couples an adaptive controller that verifies evidence sufficiency and triggers targeted refinement only when necessary, with an agentic retriever that progressively escalates retrieval effort and maps graph signals back to provenance text to remain robust under extraction loss and incomplete graphs. Experiments on HotpotQA and 2WikiMultiHopQA demonstrate that A2RAG achieves +9.9/+11.8 absolute gains in Recall@2, while cutting token consumption and end-to-end latency by about 50% relative to iterative multihop baselines.

Updated: 2026-01-29 01:58:30

标题: A2RAG：面向成本感知和可靠推理的自适应代理图检索

摘要: 图检索增强生成（Graph-RAG）通过将语料库组织成知识图并通过关系结构路由证据，提高了多跳问题回答的效果。然而，实际部署面临两个持久的瓶颈：（i）混合难度的工作负载，其中一揽子检索要么浪费在简单查询上，要么在复杂的多跳案例上失败，以及（ii）提取损失，其中图抽象省略了细粒度的修饰语，这些修饰语仅存在于原始文本中。我们提出了A2RAG，一种适应性和主动性的GraphRAG框架，用于成本感知和可靠的推理。A2RAG结合了一个自适应控制器，验证证据的充分性，并仅在必要时触发有针对性的细化，以及一个主动式检索器，逐步升级检索工作，并将图信号映射回来源文本，以便在提取损失和不完整图的情况下保持稳健。在HotpotQA和2WikiMultiHopQA上的实验表明，A2RAG在Recall@2方面取得了+9.9/+11.8的绝对增益，同时相对于迭代多跳基线，令牌消耗和端到端延迟约减少了50%。

更新时间: 2026-01-29 01:58:30

领域: cs.IR,cs.AI,cs.DB

下载: http://arxiv.org/abs/2601.21162v1

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency-and reduces the best-to-last gap by 85% versus GSPO (1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.

Updated: 2026-01-29 01:55:58

标题: GEPO：面向稳定异构强化学习的群体期望策略优化

摘要: 随着单中心计算方法的功耗限制，分散式训练变得至关重要。然而，传统的强化学习（RL）方法，在增强大型模型后训练方面至关重要，但由于参数学习和轧制采样之间的紧密耦合，无法适应分散式分布式训练。因此，我们提出了HeteroRL，一个异质RL架构，解耦这些过程，实现通过互联网连接的地理分布节点之间的稳定训练。其核心组件是Group Expectation Policy Optimization（GEPO），一种异步RL算法，能够抵御由网络延迟或计算资源异构性引起的延迟。我们的研究表明，高延迟显著增加了KL散度，导致重要权重的方差增加和训练不稳定。GEPO通过使用组期望加权来指数级减少重要性权重的方差，具有理论保证，从而缓解了这个问题。实验表明，GEPO实现了卓越的稳定性-仅在线下到1800s延迟时性能下降了3％-并将最佳与最后差距减少了85％，相对于GSPO（1.8与12.0），同时获得最高分数，突显了其在分散式、资源异质性环境中的有效性。

更新时间: 2026-01-29 01:55:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.17850v9

Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity

Autonomous Aerial Vehicle (AAV)-assisted Internet of Things (IoT) represents a collaborative architecture in which AAV allocate resources over 6G links to jointly enhance user-intent interpretation and overall network performance. Owing to this mutual dependence, improvements in intent inference and policy decisions on one component reinforce the efficiency of others, making highly reliable intent prediction and low-latency action execution essential. Although numerous approaches can model intent relationships, they encounter severe obstacles when scaling to high-dimensional action sequences and managing intensive on-board computation. We propose an Intent-Driven Framework for Autonomous Network Optimization comprising prediction and decision modules. First, implicit intent modeling is adopted to mitigate inaccuracies arising from ambiguous user expressions. For prediction, we introduce Hyperdimensional Transformer (HDT), which embeds data into a Hyperdimensional space via Hyperdimensional vector encoding and replaces standard matrix and attention operations with symbolic Hyperdimensional computations. For decision-making, where AAV must respond to user intent while planning trajectories, we design Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO). Building upon MAPPO, it samples actions through two independently parameterized networks and cascades the user-intent network into the trajectory network to maintain action dependencies. We evaluate our framework on a real IoT action dataset with authentic wireless data. Experimental results demonstrate that HDT and DA-MAPPO achieve superior performance across diverse scenarios.

Updated: 2026-01-29 01:54:13

标题: AAVs辅助物联网中的无线能量传输和意图驱动网络优化对于6G可持续连接的影响

摘要: Autonomous Aerial Vehicle (AAV)-assisted Internet of Things (IoT)代表了一种协作架构，其中AAV通过6G链路分配资源，共同增强用户意图解释和整体网络性能。由于这种相互依赖，对一个组件的意图推断和策略决策的改进会增强其他组件的效率，使得高度可靠的意图预测和低延迟的行动执行成为必不可少的。尽管有许多方法可以建模意图关系，但当扩展到高维动作序列和管理密集的机载计算时，它们遇到了严重障碍。我们提出了一个用于自主网络优化的意图驱动框架，包括预测和决策模块。首先，采用隐式意图建模来减轻由于用户表达模糊而产生的不准确性。对于预测，我们引入了超维变换器（HDT），通过超维向量编码将数据嵌入到一个超维空间中，并用符号超维计算取代标准矩阵和注意力操作。对于决策，AAV必须在规划轨迹的同时响应用户意图，我们设计了基于双动作的多智能体近端策略优化（DA-MAPPO）。在MAPPO的基础上，它通过两个独立参数化的网络对动作进行采样，并将用户意图网络级联到轨迹网络中以维持动作依赖性。我们在一个真实的IoT动作数据集上使用真实的无线数据对我们的框架进行评估。实验结果表明，HDT和DA-MAPPO在各种场景中均取得了优越的性能。

更新时间: 2026-01-29 01:54:13

领域: cs.AI

下载: http://arxiv.org/abs/2511.18368v2

NeuraLSP: An Efficient and Rigorous Neural Left Singular Subspace Preconditioner for Conjugate Gradient Methods

Numerical techniques for solving partial differential equations (PDEs) are integral for many fields across science and engineering. Such techniques usually involve solving large, sparse linear systems, where preconditioning methods are critical. In recent years, neural methods, particularly graph neural networks (GNNs), have demonstrated their potential through accelerated convergence. Nonetheless, to extract connective structures, existing techniques aggregate discretized system matrices into graphs, and suffer from rank inflation and a suboptimal convergence rate. In this paper, we articulate NeuraLSP, a novel neural preconditioner combined with a novel loss metric that leverages the left singular subspace of the system matrix's near-nullspace vectors. By compressing spectral information into a fixed low-rank operator, our method exhibits both theoretical guarantees and empirical robustness to rank inflation, affording up to a 53% speedup. Besides the theoretical guarantees for our newly-formulated loss function, our comprehensive experimental results across diverse families of PDEs also substantiate the aforementioned theoretical advances.

Updated: 2026-01-29 01:54:05

标题: NeuraLSP：共轭梯度法的高效严谨的神经左奇异子空间预处理器

摘要: 数值技术在解决偏微分方程（PDEs）方面对于科学和工程领域至关重要。这些技术通常涉及解决大型、稀疏的线性系统，其中预处理方法至关重要。近年来，神经方法，特别是图神经网络（GNNs），通过加速收敛展示了潜力。然而，为了提取连接结构，现有技术将离散系统矩阵聚合成图，并遭受秩膨胀和次优收敛速度的困扰。在本文中，我们阐述了NeuraLSP，这是一种结合了新型损失度量的神经预处理器，利用了系统矩阵近似零空间向量的左奇异子空间。通过将谱信息压缩成固定的低秩算子，我们的方法在理论上具有保证，并且在秩膨胀方面表现出实验证明的鲁棒性，提供了高达53%的加速。除了对我们新制定的损失函数的理论保证外，我们在各种PDE族族的广泛实验结果也证实了前述的理论进展。

更新时间: 2026-01-29 01:54:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.20174v2

GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents

GUI agents enable end-to-end automation through direct perception of and interaction with on-screen interfaces. However, these agents frequently access interfaces containing sensitive personal information, and screenshots are often transmitted to remote models, creating substantial privacy risks. These risks are particularly severe in GUI workflows: GUIs expose richer, more accessible private information, and privacy risks depend on interaction trajectories across sequential scenes. We propose GUIGuard, a three-stage framework for privacy-preserving GUI agents: (1) privacy recognition, (2) privacy protection, and (3) task execution under protection. We further construct GUIGuard-Bench, a cross-platform benchmark with 630 trajectories and 13,830 screenshots, annotated with region-level privacy grounding and fine-grained labels of risk level, privacy category, and task necessity. Evaluations reveal that existing agents exhibit limited privacy recognition, with state-of-the-art models achieving only 13.3% accuracy on Android and 1.4% on PC. Under privacy protection, task-planning semantics can still be maintained, with closed-source models showing stronger semantic consistency than open-source ones. Case studies on MobileWorld show that carefully designed protection strategies achieve higher task accuracy while preserving privacy. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard-page/

Updated: 2026-01-29 01:37:29

标题: GUI保护：面向隐私保护GUI代理的通用框架

摘要: GUI代理通过直接感知和与屏幕界面进行交互实现端到端自动化。然而，这些代理经常访问包含敏感个人信息的界面，并且截屏通常会传输到远程模型，从而造成重大的隐私风险。在GUI工作流中，这些风险尤为严重：GUI暴露了更丰富、更易访问的私人信息，并且隐私风险取决于跨连续场景的交互轨迹。我们提出了GUIGuard，一个用于保护隐私的GUI代理的三阶段框架：（1）隐私识别，（2）隐私保护，以及（3）在保护下执行任务。我们进一步构建了GUIGuard-Bench，一个跨平台基准测试，包含630个轨迹和13,830张截图，注释了区域级隐私基础和风险级别、隐私类别以及任务必要性的细粒度标签。评估结果显示，现有代理在隐私识别方面表现有限，最先进的模型在Android上的准确率仅为13.3%，在PC上为1.4%。在隐私保护下，任务规划语义仍然可以被保持，闭源模型显示出比开源模型更强的语义一致性。在MobileWorld上的案例研究表明，精心设计的保护策略可以在保护隐私的同时实现更高的任务准确性。我们的结果突显了隐私识别作为实用GUI代理的一个关键瓶颈。项目链接：https://futuresis.github.io/GUIGuard-page/

更新时间: 2026-01-29 01:37:29

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2601.18842v2

Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning

While Large Language Models excel at semantic tasks, they face a critical bottleneck in financial quantitative reasoning, frequently suffering from "Arithmetic Hallucinations" and a systemic failure mode we term "Cognitive Collapse". To strictly quantify this phenomenon, we introduce the Cognitive Complexity Benchmark (CCB), a robust evaluation framework grounded in a dataset constructed from 95 real-world Chinese A-share annual reports. Unlike traditional datasets, the CCB stratifies financial queries into a three-dimensional taxonomy, Data Source, Mapping Difficulty, and Result Unit, enabling the precise diagnosis of reasoning degradation in high-cognitive-load scenarios. To address these failures, we propose the Iterative Dual-Phase Financial-PoT framework. This neuro-symbolic architecture enforces a strict architectural decoupling: it first isolates semantic variable extraction and logic formulation, then offloads computation to an iterative, self-correcting Python sandbox to ensure deterministic execution. Evaluation on the CCB demonstrates that while standard Chain-of-Thought falters on complex tasks, our approach offers superior robustness, elevating the Qwen3-235B model's average accuracy from 59.7\% to 67.3\% and achieving gains of up to 10-fold in high-complexity reasoning tasks. These findings suggest that architectural decoupling is a critical enabling factor for improving reliability in financial reasoning tasks, providing a transferable architectural insight for precision-critical domains that require tight alignment between semantic understanding and quantitative computation.

Updated: 2026-01-29 01:33:33

标题: 填补算术差距：认知复杂性基准和金融-PoT用于健全的金融推理

摘要: 尽管大型语言模型在语义任务上表现出色，但它们在金融量化推理方面面临着一个关键瓶颈，经常遭受“算术幻觉”和我们所称的“认知崩溃”系统性失败模式的困扰。为了严格量化这一现象，我们引入了认知复杂性基准（CCB），这是一个基于由95份真实中国A股年度报告构建的数据集的坚固评估框架。与传统数据集不同，CCB将财务查询分层为三维分类，数据来源、映射难度和结果单位，从而使得在高认知负荷场景中推理退化的诊断变得更加精确。为了解决这些失败，我们提出了迭代双阶段金融-PoT框架。这种神经符号架构强制进行严格的架构解耦：首先隔离语义变量提取和逻辑形式化，然后将计算卸载到一个迭代的、自我修正的Python沙箱中，以确保确定性执行。在CCB上的评估表明，虽然标准的思维链在复杂任务上表现不佳，但我们的方法提供了更高的稳健性，将Qwen3-235B模型的平均准确率从59.7%提升到67.3%，在高复杂性推理任务中获得了高达10倍的增益。这些发现表明，架构解耦是提高金融推理任务可靠性的关键因素，为需要在语义理解和量化计算之间紧密对齐的精密关键领域提供了可转移的架构洞见。

更新时间: 2026-01-29 01:33:33

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.21157v1

Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only training objectives allow queries to converge to arbitrary salient objects. We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Additionally, we introduce Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual-only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.

Updated: 2026-01-29 01:31:08

标题: 学习如何倾听：增强声源关联以实现稳健的视听实例分割

摘要: 音频视觉实例分割（AVIS）需要准确定位和跟踪视频序列中的发声对象。现有方法受到两个根本问题造成的视觉偏见的影响：均匀的附加融合阻止查询专门针对不同的声源进行特化，而仅基于视觉的训练目标使查询收敛到任意显著对象。我们提出了使用交叉注意力的音频中心查询生成，使每个查询能够选择性地关注不同的声源并将声音特定的先验信息带入视觉解码。此外，我们引入了声音感知序数计数（SAOC）损失，通过具有单调一致性约束的序数回归明确监督发声对象数量，防止训练期间仅依赖视觉的收敛。在AVISeg基准测试上的实验表明一致的改进：+1.64 mAP、+0.6 HOTA和+2.06 FSLA，验证了查询特化和明确计数监督对准确的音频视觉实例分割至关重要。

更新时间: 2026-01-29 01:31:08

领域: eess.AS,cs.AI,cs.MM,cs.SD

下载: http://arxiv.org/abs/2509.22740v2

Virtuous Machines: Towards Artificial General Science

Artificial intelligence systems are transforming scientific discovery by accelerating specific research tasks, from protein structure prediction to materials design, yet remain confined to narrow domains requiring substantial human oversight. The exponential growth of scientific literature and increasing domain specialisation constrain researchers' capacity to synthesise knowledge across disciplines and develop unifying theories, motivating exploration of more general-purpose AI systems for science. Here we show that a domain-agnostic, agentic AI Scientist system can independently navigate the scientific workflow - from hypothesis generation through data collection to manuscript preparation. The system autonomously designed and executed three psychological studies on visual working memory, mental rotation, and imagery vividness, executed one new online data collection with 288 participants, developed analysis pipelines through 8-hour+ continuous coding sessions, and produced completed manuscripts. The results demonstrate the capability of AI scientific discovery pipelines to conduct non-trivial research with theoretical reasoning and methodological rigour comparable to experienced researchers, though with limitations in conceptual nuance and theoretical interpretation. This is a step toward embodied AI that can test hypotheses through real-world experiments, accelerating discovery by autonomously exploring regions of scientific space that human cognitive and resource constraints might otherwise leave unexplored. It raises important questions about the nature of scientific understanding and the attribution of scientific credit.

Updated: 2026-01-29 01:26:14

标题: 善良的机器：走向人工通用科学

摘要: 人工智能系统正在通过加速特定的研究任务，从蛋白质结构预测到材料设计，改变科学发现，但仍然局限于需要大量人类监督的狭窄领域。科学文献的指数增长和领域专业化的增加限制了研究人员跨学科综合知识和发展统一理论的能力，促使探索更通用的人工智能系统用于科学。在这里，我们展示了一个领域无关、主动的AI科学家系统可以独立地在科学工作流程中导航 - 从假设生成到数据收集再到手稿准备。该系统自主设计并执行了三项心理学研究，涉及视觉工作记忆、心理旋转和形象生动度，进行了一项新的在线数据收集，共有288名参与者，通过8小时以上的连续编码会话开发了分析流程，并制作了完成的手稿。结果表明，AI科学发现流程有能力进行具有理论推理和方法论严谨性的非平凡研究，尽管在概念细微差别和理论解释方面存在限制。这是迈向具有实体AI的一步，可以通过真实世界实验来验证假设，通过自主探索科学空间的区域，人类认知和资源约束可能会被留下未开发的。这引发了关于科学理解的本质和科学信誉归属的重要问题。

更新时间: 2026-01-29 01:26:14

领域: cs.AI,cs.ET

下载: http://arxiv.org/abs/2508.13421v2

Can Neural Networks Learn Small Algebraic Worlds? An Investigation Into the Group-theoretic Structures Learned By Narrow Models Trained To Predict Group Operations

While a real-world research program in mathematics may be guided by a motivating question, the process of mathematical discovery is typically open-ended. Ideally, exploration needed to answer the original question will reveal new structures, patterns, and insights that are valuable in their own right. This contrasts with the exam-style paradigm in which the machine learning community typically applies AI to math. To maximize progress in mathematics using AI, we will need to go beyond simple question answering. With this in mind, we explore the extent to which narrow models trained to solve a fixed mathematical task learn broader mathematical structure that can be extracted by a researcher or other AI system. As a basic test case for this, we use the task of training a neural network to predict a group operation (for example, performing modular arithmetic or composition of permutations). We describe a suite of tests designed to assess whether the model captures significant group-theoretic notions such as the identity element, commutativity, or subgroups. Through extensive experimentation we find evidence that models learn representations capable of capturing abstract algebraic properties. For example, we find hints that models capture the commutativity of modular arithmetic. We are also able to train linear classifiers that reliably distinguish between elements of certain subgroups (even though no labels for these subgroups are included in the data). On the other hand, we are unable to extract notions such as the concept of the identity element. Together, our results suggest that in some cases the representations of even small neural networks can be used to distill interesting abstract structure from new mathematical objects.

Updated: 2026-01-29 01:18:22

标题: 神经网络能否学习小代数世界？研究窄模型训练预测群操作所学习的群论结构。

摘要: 尽管数学中的实际研究项目可能由一个激发问题引导，但数学发现的过程通常是开放式的。理想情况下，用于回答原始问题的探索将揭示出具有独立价值的新结构、模式和见解。这与机器学习社区通常将人工智能应用于数学的考试风格范式形成鲜明对比。为了最大程度地利用人工智能推动数学进展，我们需要超越简单的问题回答。基于此，我们探讨了窄模型训练以解决固定数学任务的程度，以了解研究人员或其他人工智能系统可以提取的更广泛的数学结构。作为这一基本测试案例，我们使用训练神经网络来预测群操作的任务（例如执行模算术或排列的组合）。我们描述了一系列测试，旨在评估模型是否捕捉到重要的群论概念，如单位元、可交换性或子群。通过广泛的实验，我们发现证据表明模型学会了能够捕捉抽象代数性质的表示。例如，我们发现模型似乎捕捉到了模算术的可交换性。我们还能够训练线性分类器，可可靠地区分某些子群的元素（即使数据中未包含这些子群的标签）。另一方面，我们无法提取单位元素的概念。总的来说，我们的结果表明，即使是小型神经网络的表示在某些情况下也可以用来从新的数学对象中提炼出有趣的抽象结构。

更新时间: 2026-01-29 01:18:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21150v1

Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI's function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

Updated: 2026-01-29 01:12:35

标题: 移动嵌入的POIs：从人类活动中学习地点的含义和使用方式

摘要: 最近在地理空间基础模型方面取得的进展突显了学习面向现实世界位置的通用表示的重要性，特别是人类活动集中的兴趣点（POIs）。然而，现有方法主要侧重于从静态文本元数据中派生的地点身份，或者学习与轨迹上下文相关联的表示，这些表示捕捉了移动规律，而不是地点实际用途（即POI的功能）。我们认为，POI功能是通用POI表示中缺失但必不可少的信号。我们引入了移动嵌入POIs（ME-POIs）框架，该框架通过将大规模人类移动数据与语言模型派生的POI嵌入相结合，学习基于实际使用的POI为中心的、与上下文无关的表示。ME-POIs将个体访问编码为在时间上下文中的嵌入，并通过对比学习将其与可学习的POI表示对齐，以捕捉用户和时间跨度内的使用模式。为了解决长尾稀疏性，我们提出了一种新颖的机制，从附近频繁访问的POIs中传播时间访问模式，跨多个空间尺度。我们在五项新提出的地图丰富任务上评估了ME-POIs的性能，测试其捕捉POI的身份和功能的能力。在所有任务中，将基于文本的嵌入与ME-POIs相结合的方法始终优于仅文本和仅移动的基准线。值得注意的是，仅基于移动数据训练的ME-POIs在某些任务上可以超越仅文本模型，突显了POI功能是准确和通用的POI表示的关键组成部分。

更新时间: 2026-01-29 01:12:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.21149v1

Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

Updated: 2026-01-29 01:12:24

标题: 计划、验证和填充：扩散语言模型的结构化并行解码方法

摘要: 扩散语言模型（DLMs）提出了一种有希望的非顺序范例，用于文本生成，与标准的自回归（AR）方法有所不同。然而，当前的解码策略通常采用一种反应性立场，未充分利用全局双向上下文来指导全局轨迹。为了解决这个问题，我们提出了Plan-Verify-Fill（PVF），这是一种无需训练的范式，通过定量验证来基础规划。PVF通过优先考虑高杠杆语义锚点来积极构建分层框架，并采用验证协议来实现实用的结构停止，进一步的思考会带来递减的回报。在LLaDA-8B-Instruct和Dream-7B-Instruct上的大量评估表明，与基准数据集上基于置信度的并行解码相比，PVF将功能评估数（NFE）减少了高达65％，在不牺牲准确性的情况下提高了效率。

更新时间: 2026-01-29 01:12:24

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2601.12247v2

BrainStack: Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding

Decoding linguistic information from electroencephalography (EEG) remains challenging due to the brain's distributed and nonlinear organization. We present BrainStack, a functionally guided neuro-mixture-of-experts (Neuro-MoE) framework that models the brain's modular functional architecture through anatomically partitioned expert networks. Each functional region is represented by a specialized expert that learns localized neural dynamics, while a transformer-based global expert captures cross-regional dependencies. A learnable routing gate adaptively aggregates these heterogeneous experts, enabling context-dependent expert coordination and selective fusion. To promote coherent representation across the hierarchy, we introduce cross-regional distillation, where the global expert provides top-down regularization to the regional experts. We further release SilentSpeech-EEG (SS-EEG), a large-scale benchmark comprising over 120 hours of EEG recordings from 12 subjects performing 24 silent words, the largest dataset of its kind. Experiments demonstrate that BrainStack consistently outperforms state-of-the-art models, achieving superior accuracy and generalization across subjects. Our results establish BrainStack as a functionally modular, neuro-inspired MoE paradigm that unifies neuroscientific priors with adaptive expert routing, paving the way for scalable and interpretable brain-language decoding.

Updated: 2026-01-29 01:04:47

标题: BrainStack：一种功能引导的专家路由的神经-MoE用于基于EEG的语言解码

摘要: 从脑电图（EEG）中解码语言信息仍然具有挑战性，因为大脑具有分布式和非线性的组织结构。我们提出了BrainStack，这是一个功能引导的神经专家混合（Neuro-MoE）框架，通过解剖分区的专家网络模拟大脑的模块化功能架构。每个功能区域由一个专门的专家代表，学习局部神经动态，而基于转换器的全局专家捕获跨区域的依赖关系。可学习的路由门自适应地聚合这些异质专家，实现依赖上下文的专家协调和选择性融合。为了促进层次结构中的连贯表达，我们引入了跨区域蒸馏，其中全局专家为区域专家提供自上而下的正则化。我们进一步发布了SilentSpeech-EEG（SS-EEG），这是一个大规模基准，包括来自12名受试者的超过120小时的EEG记录，执行24个无声词，是其类别中最大的数据集。实验证明，BrainStack始终优于最先进的模型，在受试者之间实现了优越的准确性和泛化能力。我们的结果将BrainStack确定为一个功能模块化、受神经启发的MoE范例，将神经科学先验与自适应专家路由相结合，为可扩展和可解释的大脑语言解码铺平道路。

更新时间: 2026-01-29 01:04:47

领域: cs.AI

下载: http://arxiv.org/abs/2601.21148v1

Optimization and Mobile Deployment for Anthropocene Neural Style Transfer

This paper presents AnthropoCam, a mobile-based neural style transfer (NST) system optimized for the visual synthesis of Anthropocene environments. Unlike conventional artistic NST, which prioritizes painterly abstraction, stylizing human-altered landscapes demands a careful balance between amplifying material textures and preserving semantic legibility. Industrial infrastructures, waste accumulations, and modified ecosystems contain dense, repetitive patterns that are visually expressive yet highly susceptible to semantic erosion under aggressive style transfer. To address this challenge, we systematically investigate the impact of NST parameter configurations on the visual translation of Anthropocene textures, including feature layer selection, style and content loss weighting, training stability, and output resolution. Through controlled experiments, we identify an optimal parameter manifold that maximizes stylistic expression while preventing semantic erasure. Our results demonstrate that appropriate combinations of convolutional depth, loss ratios, and resolution scaling enable the faithful transformation of anthropogenic material properties into a coherent visual language. Building on these findings, we implement a low-latency, feed-forward NST pipeline deployed on mobile devices. The system integrates a React Native frontend with a Flask-based GPU backend, achieving high-resolution inference within 3-5 seconds on general mobile hardware. This enables real-time, in-situ visual intervention at the site of image capture, supporting participatory engagement with Anthropocene landscapes. By coupling domain-specific NST optimization with mobile deployment, AnthropoCam reframes neural style transfer as a practical and expressive tool for real-time environmental visualization in the Anthropocene.

Updated: 2026-01-29 00:50:03

标题: 《人类世神经风格迁移的优化和移动部署》

摘要: 本文介绍了AnthropoCam，这是一个专为视觉合成人类世环境而优化的基于移动设备的神经风格迁移（NST）系统。与传统的艺术风格迁移不同，后者更注重油画般的抽象，风格化人类改变过的景观需要在放大材质纹理和保留语义可读性之间保持谨慎的平衡。工业基础设施、废物堆积和改造的生态系统包含密集、重复的图案，这些图案在视觉上富有表现力，但在激进的风格迁移下极易受到语义侵蚀。为了解决这一挑战，我们系统地研究了NST参数配置对人类世纹理的视觉转换的影响，包括特征层选择、风格和内容损失加权、训练稳定性和输出分辨率。通过控制实验，我们确定了一种最优参数组合，可以在最大程度上表达风格，同时防止语义抹除。我们的结果表明，卷积深度、损失比率和分辨率缩放的适当组合能够将人为材料属性忠实地转化为连贯的视觉语言。基于这些发现，我们实现了一个低延迟的、前馈式的NST管道，部署在移动设备上。该系统将React Native前端与基于Flask的GPU后端集成，可在一般移动硬件上实现高分辨率推理，在3-5秒内进行。这使得在图像捕捉现场实时进行视觉干预成为可能，支持与人类世景观的参与性互动。通过将特定领域的NST优化与移动部署相结合，AnthropoCam重新定义了神经风格迁移，将其作为在人类世中进行实时环境可视化的实用和富有表现力的工具。

更新时间: 2026-01-29 00:50:03

领域: cs.HC,cs.AI,cs.GR

下载: http://arxiv.org/abs/2601.21141v1

Navigating the Rabbit Hole: Emergent Biases in LLM-Generated Attack Narratives Targeting Mental Health Groups

Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the propagation of relative biases; and (3) an assessment of the relative degree of stigmatization that emerges from these attacks. Our analysis of a recently released large-scale bias audit dataset reveals that mental health entities occupy central positions within attack narrative networks, as revealed by a significantly higher mean centrality of closeness (p-value = 4.06e-10) and dense clustering (Gini coefficient = 0.7). Drawing from an established stigmatization framework, our analysis indicates increased labeling components for mental health disorder-related targets relative to initial targets in generation chains. Taken together, these insights shed light on the structural predilections of large language models to heighten harmful discourse and highlight the need for suitable approaches for mitigation.

Updated: 2026-01-29 00:29:49

标题: 穿越兔子洞：针对心理健康群体的LLM生成攻击叙事中的新出现偏见

摘要: 大型语言模型（LLMs）已被证明存在针对某些群体的不平衡偏见。然而，对LLMs对处于风险群体的无端攻击的研究仍未得到充分探讨。我们的论文提出了三个新颖贡献：（1）对高度脆弱的精神健康群体的LLM生成攻击进行显式评估；（2）一个基于网络的框架来研究相对偏见的传播；以及（3）对这些攻击所产生的污名化程度进行评估。我们对最近发布的大规模偏见审计数据集的分析显示，精神健康实体在攻击叙事网络中占据核心位置，这是通过接近度（p值=4.06e-10）和密集聚类（基尼系数=0.7）的显著高均值中心性揭示的。根据一个已建立的污名化框架，我们的分析表明，与生成链中的初始目标相比，精神健康障碍相关目标的标签化组件增加。综合这些见解，我们揭示了大型语言模型加剧有害言论的结构偏好，并强调了减缓的必要性。

更新时间: 2026-01-29 00:29:49

领域: cs.CL,cs.AI,cs.CY,cs.LG,cs.SI

下载: http://arxiv.org/abs/2504.06160v4

A Comparative Study on How Data Normalization Affects Zero-Shot Generalization in Time Series Foundation Models

We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSFMs where generalization is critical. Time-series data, unlike text or images, exhibits significant scale variation across domains and channels, coupled with non-stationarity, can undermine TSFM performance regardless of architectural complexity. Through systematic evaluation across four architecturally diverse TSFMs, we empirically establish REVIN as the most efficient approach, reducing zero-shot MASE by 89\% relative to an un-normalized baseline and by 44\% versus other normalization methods, while matching the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing -- yielding the highest accuracy-efficiency trade-off. Yet its effect utilization depends on architectural design choices and optimization objective, particularly with respect to training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).

Updated: 2026-01-29 00:23:54

标题: 一项关于数据归一化如何影响时间序列基础模型零样本泛化能力的比较研究

摘要: 我们研究了时间序列基础模型（TSFMs）的输入归一化方法。虽然数据集特定的时间序列模型中对归一化进行了深入研究，但在TSFMs中却被忽视，其中泛化性是至关重要的。与文本或图像不同，时间序列数据在不同领域和通道之间呈现显著的尺度变化，加上非平稳性，可能会削弱TSFM的性能，而这与架构复杂性无关。通过对四种架构多样的TSFMs进行系统评估，我们从经验上确定REVIN作为最有效的方法，相对于未归一化的基线，将零点MASE降低了89\%，比其他归一化方法降低了44\%，同时在没有任何数据集级预处理的情况下与最佳领域准确性（0.84 MASE）相匹配，产生了最高的准确性和效率的折衷。然而，其效果利用取决于架构设计选择和优化目标，特别是在训练损失尺度敏感性和模型类型（概率、点预测或基于LLM的模型）方面。

更新时间: 2026-01-29 00:23:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.02833v2

Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair

Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry, but ultimately these patches still need to be reviewed by a human before committing them to ensure they address the bug. Showing patches unlikely to be accepted can lead to substantial noise, wasting valuable developer time and eroding trust in automated code changes. We introduce two complementary LLM-based policies to reduce such noise: bug abstention and patch validation policies. Bug abstention excludes bugs that the agentic APR system is unlikely to fix. Patch validation rejects patches that are unlikely to be a good fix for the given bug. We evaluate both policies on three sets of bugs from Google's codebase, and their candidate patches generated by an internal agentic APR system. On a set of 174 human-reported bugs, removing bugs and patches rejected by our policies can raise success rates by up to 13 percentage points and 15 percentage points, respectively, and by up to 39 percentage points in combination. On null pointer exceptions and sanitizer-reported bugs with machine-generated bug reports, patch validation also improves average single-sample success rates. This two-policy approach provides a practical path to the reliable, industrial-scale deployment of agentic APR systems.

Updated: 2026-01-29 00:09:30

标题: 弃权和验证：减少主体化程序修复中噪音的双重LLM政策

摘要: 主动自动化程序修复（APR）越来越多地解决行业中复杂的存储库级别错误，但最终这些补丁仍然需要经过人工审查才能提交，以确保它们解决了错误。显示不太可能被接受的补丁可能会导致大量噪音，浪费宝贵的开发人员时间，并破坏对自动化代码更改的信任。我们引入了两种互补的基于LLM的策略来减少这种噪音：错误放弃和补丁验证策略。错误放弃排除了主动APR系统不太可能修复的错误。补丁验证拒绝了不太可能是给定错误的良好修复的补丁。我们在谷歌代码库中的三组错误集上评估了这两种策略，以及内部主动APR系统生成的候选补丁。在一组174个人报告的错误中，通过我们的策略拒绝的错误和补丁可以将成功率分别提高高达13个百分点和15个百分点，结合起来可以提高高达39个百分点。在由机器生成的错误报告的空指针异常和消毒剂报告的错误中，补丁验证还提高了平均单样本成功率。这种双策略方法为可靠的、工业规模的主动APR系统部署提供了实际途径。

更新时间: 2026-01-29 00:09:30

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.03217v2

What You Feel Is Not What They See: On Predicting Self-Reported Emotion from Third-Party Observer Labels

Self-reported emotion labels capture internal experience, while third-party labels reflect external perception. These perspectives often diverge, limiting the applicability of third-party-trained models to self-report contexts. This gap is critical in mental health, where accurate self-report modeling is essential for guiding intervention. We present the first cross-corpus evaluation of third-party-trained models on self-reports. We find activation unpredictable (CCC approximately 0) and valence moderately predictable (CCC approximately 0.3). Crucially, when content is personally significant to the speaker, models achieve high performance for valence (CCC approximately 0.6-0.8). Our findings point to personal significance as a key pathway for aligning external perception with internal experience and underscore the challenge of self-report activation modeling.

Updated: 2026-01-29 00:07:54

标题: 你感受到的并不是他们所看到的：关于从第三方观察者标签预测自我报告情绪

摘要: 自我报告的情绪标签捕捉了内在体验，而第三方标签反映了外部感知。这些观点经常不一致，限制了第三方训练模型在自我报告环境中的适用性。在心理健康领域，这种差距至关重要，准确的自我报告建模对指导干预至关重要。我们首次对第三方训练模型在自我报告上进行了跨语料库评估。我们发现激活是不可预测的（CCC约为0），而情感价值是比较可预测的（CCC约为0.3）。关键是，当内容对说话者个人具有重要意义时，模型在情感价值方面表现出高性能（CCC约为0.6-0.8）。我们的发现指出，个人意义是将外部感知与内在体验对齐的关键路径，并强调了自我报告激活建模的挑战。

更新时间: 2026-01-29 00:07:54

领域: cs.AI

下载: http://arxiv.org/abs/2601.21130v1

Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation

Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference, despite the highly non-isomorphic relationship between sign and spoken languages, where multiple translations can be equally valid. This limitation constrains both model training and evaluation, particularly for n-gram-based metrics such as BLEU. In this work, we investigate the use of Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. First, we compare multiple paraphrasing strategies and models using an adapted ParaScore metric. Second, we study the impact of paraphrases on both training and evaluation of the pose-based T5 model on the YouTubeASL and How2Sign datasets. Our results show that naively incorporating paraphrases during training does not improve translation performance and can even be detrimental. In contrast, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. To formalize this observation, we introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references. Human evaluation confirms that BLEUpara correlates more strongly with perceived translation quality. We release all generated paraphrases, generation and evaluation code to support reproducible and more reliable evaluation of SLT systems.

Updated: 2026-01-29 00:02:19

标题: 超越单一参考：手语翻译中的释义训练和评估

摘要: 大多数手语翻译（SLT）语料库将每个手势话语与单个书面语言参考配对，尽管手语和口头语言之间存在高度非同构的关系，其中多种翻译可能同样有效。这种限制限制了模型训练和评估，尤其是对于基于n-gram的指标如BLEU来说。在这项工作中，我们研究了使用大型语言模型自动生成书面语翻译的改写变体作为SLT的合成替代参考的可行性。首先，我们使用调整后的ParaScore指标比较了多种改写策略和模型。其次，我们研究了改写对YouTubeASL和How2Sign数据集上基于姿势的T5模型的训练和评估的影响。我们的结果表明，在训练过程中天真地加入改写并不会提高翻译性能，甚至可能有害。相反，使用改写在评估过程中会导致更高的自动分数，并与人类判断更好地对齐。为了形式化这一观察结果，我们引入了BLEUpara，这是对多个改写参考进行翻译评估的BLEU扩展。人类评估证实BLEUpara与感知的翻译质量更强烈相关。我们发布了所有生成的改写、生成和评估代码，以支持SLT系统的可重现和更可靠的评估。

更新时间: 2026-01-29 00:02:19

领域: cs.AI

下载: http://arxiv.org/abs/2601.21128v1