Arxiv Day: Article

Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach

Actor-Critic methods are widely used for their scalability, yet existing theoretical guarantees for infinite-horizon average-reward Markov Decision Processes (MDPs) often rely on restrictive ergodicity assumptions. We propose NAC-B, a Natural Actor-Critic with Batching, that achieves order-optimal regret of $\tilde{O}(\sqrt{T})$ in infinite-horizon average-reward MDPs under the unichain assumption, which permits both transient states and periodicity. This assumption is among the weakest under which the classic policy gradient theorem remains valid for average-reward settings. NAC-B employs function approximation for both the actor and the critic, enabling scalability to problems with large state and action spaces. The use of batching in our algorithm helps mitigate potential periodicity in the MDP and reduces stochasticity in gradient estimates, and our analysis formalizes these benefits through the introduction of the constants $C_{\text{hit}}$ and $C_{\text{tar}}$, which characterize the rate at which empirical averages over Markovian samples converge to the stationary distribution.

Updated: 2025-10-25 23:12:36

标题: 平均奖励单链MDP的遗憾分析：通过演员-评论家方法

摘要: Actor-Critic方法因其可扩展性而被广泛使用，然而现有的针对无限时间平均奖励马尔可夫决策过程（MDPs）的理论保证通常依赖于限制性的遍历性假设。我们提出了一种名为NAC-B的自然Actor-Critic算法，该算法在假设单链条件下，在无限时间平均奖励MDPs中实现了$\tilde{O}(\sqrt{T})$的最优阶悔恨。这种假设允许存在瞬态状态和周期性，是在这种条件下经典策略梯度定理仍然适用于平均奖励设置的最弱假设之一。NAC-B对actor和critic都采用函数逼近，使其能够扩展到具有大状态和动作空间的问题。我们算法中的批处理使用有助于减轻MDP中的潜在周期性，并减少梯度估计中的随机性，我们的分析通过引入常数$C_{\text{hit}}和$C_{\text{tar}}来形式化这些好处，这些常数表征了马尔可夫样本的经验平均值收敛到稳态分布的速率。

更新时间: 2025-10-25 23:12:36

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.19986v2

Evaluating Multimodal Large Language Models on Core Music Perception Tasks

Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.

Updated: 2025-10-25 23:10:16

标题: 评估基于多模态大型语言模型在核心音乐感知任务上的表现

摘要: 多模态大型语言模型（LLMs）声称通过将听觉与乐谱阅读混为一谈的评估来实现“音乐理解”。我们在三个核心音乐技能（节奏错位评分、移调检测和和弦质量识别）上对三种SOTA LLMs（Gemini 2.5 Pro、Gemini 2.5 Flash和Qwen2.5-Omni）进行基准测试。此外，我们分离了三种变异源：（i）感知限制（音频与MIDI输入），（ii）对示例的暴露（零对少量示例的操作），以及（iii）推理策略（独立、CoT、LogicLM）。对于后者，我们将LogicLM（一种将LLMs与符号求解器结合起来进行结构化推理的框架）调整到音乐领域。结果显示出明显的感知差距：模型在MIDI上表现接近饱和，但在音频上准确度下降。推理和少量示例提示带来的收益较小。这在MIDI上是预期的，因为性能达到了饱和，但在音频上更令人惊讶，尽管LogicLM在MIDI准确度接近完美，但仍然明显脆弱。在各种条件下，Gemini Pro模型实现了最高性能。总体而言，目前的系统在符号（MIDI）上表现良好，但尚未能够可靠地从音频中“听到”。我们的方法和数据集使感知-推理边界明确，并为构建坚固的音频优先音乐系统提供可操作的指导。

更新时间: 2025-10-25 23:10:16

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2510.22455v1

On the Contractivity of Stochastic Interpolation Flow

We investigate stochastic interpolation, a recently introduced framework for high dimensional sampling which bears many similarities to diffusion modeling. Stochastic interpolation generates a data sample by first randomly initializing a particle drawn from a simple base distribution, then simulating deterministic or stochastic dynamics such that in finite time the particle's distribution converges to the target. We show that for a Gaussian base distribution and a strongly log-concave target distribution, the stochastic interpolation flow map is Lipschitz with a sharp constant which matches that of Caffarelli's theorem for optimal transport maps. We are further able to construct Lipschitz transport maps between non-Gaussian distributions, generalizing some recent constructions in the literature on transport methods for establishing functional inequalities. We discuss the practical implications of our theorem for the sampling and estimation problems required by stochastic interpolation.

Updated: 2025-10-25 23:04:04

标题: 关于随机插值流的收敛性

摘要: 我们研究了随机插值，这是一个最近引入的高维采样框架，与扩散建模有许多相似之处。随机插值通过首先随机初始化从简单基本分布中抽取的粒子，然后模拟确定性或随机动力学，使得在有限时间内粒子的分布收敛到目标分布来生成数据样本。我们展示了对于高斯基本分布和强对数凹目标分布，随机插值流映射是利普希茨的，其尖锐常数与 Caffarelli 定理对于最优输运映射的匹配。我们进一步能够构建非高斯分布之间的利普希茨输运映射，推广了文献中关于建立功能不等式的输运方法的一些最近构造。我们讨论了我们的定理对于随机插值所需的采样和估计问题的实际影响。

更新时间: 2025-10-25 23:04:04

领域: math.ST,cs.LG,stat.ML,stat.TH

下载: http://arxiv.org/abs/2504.10653v2

Confidence Sets for Multidimensional Scaling

We develop a formal statistical framework for classical multidimensional scaling (CMDS) applied to noisy dissimilarity data. We establish distributional convergence results for the embeddings produced by CMDS for various noise models, which enable the construction of \emph{bona~fide} uniform confidence sets for the latent configuration, up to rigid transformations. We further propose bootstrap procedures for constructing these confidence sets and provide theoretical guarantees for their validity. We find that the multiplier bootstrap adapts automatically to heteroscedastic noise such as multiplicative noise, while the empirical bootstrap seems to require homoscedasticity. Either form of bootstrap, when valid, is shown to substantially improve finite-sample accuracy. The empirical performance of the proposed methods is demonstrated through numerical experiments.

Updated: 2025-10-25 23:02:10

标题: 多维尺度分析的置信区间

摘要: 我们为应用于带有噪声的不相似性数据的经典多维尺度（CMDS）开发了一个正式的统计框架。我们为不同噪声模型产生的CMDS嵌入建立了分布收敛结果，这使得可以构建符合条件的均匀置信区间，直到刚性变换。我们进一步提出了用于构建这些置信区间的自举程序，并为其有效性提供了理论保证。我们发现，乘数自举可以自动适应异方差噪声，如乘法噪声，而经验自举似乎需要等方差性。已验证的任何一种自举形式都被证明可以显著提高有限样本的准确性。通过数值实验展示了所提方法的经验表现。

更新时间: 2025-10-25 23:02:10

领域: math.ST,cs.LG,stat.ML,stat.TH,62H12, 62F40, 62E20, 62G05, 62R07, 91C15,G.3; F.2.2

下载: http://arxiv.org/abs/2510.22452v1

GraphTOP: Graph Topology-Oriented Prompting for Graph Neural Networks

Graph Neural Networks (GNNs) have revolutionized the field of graph learning by learning expressive graph representations from massive graph data. As a common pattern to train powerful GNNs, the "pre-training, adaptation" scheme first pre-trains GNNs over unlabeled graph data and subsequently adapts them to specific downstream tasks. In the adaptation phase, graph prompting is an effective strategy that modifies input graph data with learnable prompts while keeping pre-trained GNN models frozen. Typically, existing graph prompting studies mainly focus on *feature-oriented* methods that apply graph prompts to node features or hidden representations. However, these studies often achieve suboptimal performance, as they consistently overlook the potential of *topology-oriented* prompting, which adapts pre-trained GNNs by modifying the graph topology. In this study, we conduct a pioneering investigation of graph prompting in terms of graph topology. We propose the first **Graph** **T**opology-**O**riented **P**rompting (GraphTOP) framework to effectively adapt pre-trained GNN models for downstream tasks. More specifically, we reformulate topology-oriented prompting as an edge rewiring problem within multi-hop local subgraphs and relax it into the continuous probability space through reparameterization while ensuring tight relaxation and preserving graph sparsity. Extensive experiments on five graph datasets under four pre-training strategies demonstrate that our proposed GraphTOP outshines six baselines on multiple node classification datasets. Our code is available at https://github.com/xbfu/GraphTOP.

Updated: 2025-10-25 22:50:12

标题: GraphTOP: 面向图神经网络的图拓扑导向提示

摘要: 图神经网络（GNNs）通过从大规模图数据中学习表达性图表示，彻底改变了图学习领域。作为训练强大GNNs的常见模式，“预训练，适应”方案首先在未标记的图数据上预训练GNNs，然后将它们调整到特定的下游任务。在适应阶段，图提示是一种有效的策略，通过可学习的提示修改输入图数据，同时保持预训练的GNN模型冻结。通常，现有的图提示研究主要集中在应用图提示到节点特征或隐藏表示的*特征导向*方法上。然而，这些研究通常达不到最佳性能，因为它们一直忽视了*拓扑导向*提示的潜力，即通过修改图拓扑来调整预训练的GNNs。在本研究中，我们首次对图拓扑的提示进行了开创性调查。我们提出了首个**图** **拓扑导向** **提示**（GraphTOP）框架，以有效地为下游任务调整预训练的GNN模型。更具体地，我们将拓扑导向提示重新制定为在多跳本地子图中的边重连问题，并通过重新参数化将其放松到连续概率空间，同时确保紧密放松和保留图稀疏性。在四种预训练策略下对五个图数据集进行的广泛实验表明，我们提出的GraphTOP在多个节点分类数据集上优于六个基准。我们的代码可在https://github.com/xbfu/GraphTOP找到。

更新时间: 2025-10-25 22:50:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22451v1

SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks

The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU) using a differentiable hard-mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of varying depths. The analysis shows that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures.

Updated: 2025-10-25 22:46:37

标题: 智能混合：神经网络中自适应激活函数学习的两阶段训练策略

摘要: 激活函数的选择在神经网络中起着至关重要的作用，然而大多数架构仍然依赖于固定的、统一的激活函数跨越所有神经元。我们引入了SmartMixed，一种两阶段训练策略，允许网络学习每个神经元的最佳激活函数，同时保持推理时的计算效率。在第一阶段，神经元通过可微的硬混合机制从候选激活函数池中自适应选择（ReLU、Sigmoid、Tanh、Leaky ReLU、ELU、SELU）。在第二阶段，根据学习到的选择，每个神经元的激活函数被固定，从而得到一个计算效率高、支持使用优化的向量化操作进行持续训练的网络。我们使用不同深度的前馈神经网络在MNIST数据集上评估了SmartMixed。分析表明，不同层中的神经元对激活函数有不同的偏好，为神经网络中的功能多样性提供了见解。

更新时间: 2025-10-25 22:46:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22450v1

Language Model Guided Reinforcement Learning in Quantitative Trading

Algorithmic trading requires short-term tactical decisions consistent with long-term financial objectives. Reinforcement Learning (RL) has been applied to such problems, but adoption is limited by myopic behaviour and opaque policies. Large Language Models (LLMs) offer complementary strategic reasoning and multi-modal signal interpretation when guided by well-structured prompts. This paper proposes a hybrid framework in which LLMs generate high-level trading strategies to guide RL agents. We evaluate (i) the economic rationale of LLM-generated strategies through expert review, and (ii) the performance of LLM-guided agents against unguided RL baselines using Sharpe Ratio (SR) and Maximum Drawdown (MDD). Empirical results indicate that LLM guidance improves both return and risk metrics relative to standard RL.

Updated: 2025-10-25 22:25:16

标题: 语言模型指导的量化交易强化学习

摘要: 算法交易需要与长期财务目标一致的短期战术决策。强化学习（RL）已被应用于这类问题，但受到短视行为和不透明政策的限制。大型语言模型（LLMs）在受到良好结构提示的指导下提供了补充的战略推理和多模态信号解释。本文提出了一个混合框架，其中LLMs生成高级交易策略来指导RL代理。我们通过专家审查评估了LLM生成的策略的经济基础，并通过夏普比率（SR）和最大回撤（MDD）比较了LLM指导的代理与未指导的RL基线的表现。实证结果表明，LLM指导相对于标准RL改善了回报和风险指标。

更新时间: 2025-10-25 22:25:16

领域: cs.LG,cs.CL,q-fin.TR,I.2.7; I.2.6; J.4

下载: http://arxiv.org/abs/2508.02366v3

RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration

The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. Visual autoregressive modeling (VAR), a recently introduced approach for image generation, performs scale-space autoregression and achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. Moreover, our analysis reveals that coarse scales in VAR primarily capture degradations while finer scales encode scene detail, simplifying the restoration process. Motivated by this, we propose RestoreVAR, a novel VAR-based generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $10\times$ faster inference. To optimally exploit the advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities.

Updated: 2025-10-25 22:06:16

标题: 恢复VAR：用于一体化图像恢复的视觉自回归生成

摘要: 潜在扩散模型（LDMs）的使用，如稳定扩散，显著提高了一体化图像恢复（AiOR）方法的感知质量，同时增强了它们的泛化能力。然而，这些基于LDM的框架由于其迭代去噪过程而导致推理速度较慢，使其对于时间敏感的应用不实用。最近引入的视觉自回归建模（VAR）方法，为图像生成执行尺度空间自回归，并且以大幅降低的计算成本实现了与最先进扩散变换器相媲美的性能。此外，我们的分析显示，VAR中的粗细尺度主要捕捉退化，而细尺度编码场景细节，简化了恢复过程。受此启发，我们提出了RestoreVAR，一种基于VAR的新型生成方法，用于AiOR，明显优于基于LDM的模型在恢复性能上，同时实现了超过$10\times$的更快推理速度。为了最大限度地利用VAR在AiOR中的优势，我们提出了架构修改和改进，包括精心设计的交叉注意机制和一个定制的潜在空间细化模块，专为AiOR任务而设计。广泛的实验表明，RestoreVAR在生成式AiOR方法中实现了最先进的性能，同时表现出强大的泛化能力。

更新时间: 2025-10-25 22:06:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.18047v2

AegisMCP: Online Graph Intrusion Detection for Tool-Augmented LLMs on Edge Devices

In this work, we study security of Model Context Protocol (MCP) agent toolchains and their applications in smart homes. We introduce AegisMCP, a protocol-level intrusion detector. Our contributions are: (i) a minimal attack suite spanning instruction-driven escalation, chain-of-tool exfiltration, malicious MCP server registration, and persistence; (ii) NEBULA-Schema (Network-Edge Behavioral Learning for Untrusted LLM Agents), a reusable protocol-level instrumentation that represents MCP activity as a streaming heterogeneous temporal graph over agents, MCP servers, tools, devices, remotes, and sessions; and (iii) a CPU-only streaming detector that fuses novelty, session-DAG structure, and attribute cues for near-real-time edge inference, with optional fusion of local prompt-guardrail signals. On an emulated smart-home testbed spanning multiple MCP stacks and a physical bench, AegisMCP achieves sub-second per-window model inference and end-to-end alerting. The latency of AegisMCP is consistently sub-second on Intel N150-class edge hardware, while outperforming traffic-only and sequence baselines; ablations confirm the importance of DAG and install/permission signals. We release code, schemas, and generators for reproducible evaluation.

Updated: 2025-10-25 22:02:32

标题: AegisMCP：边缘设备上工具增强的LLMs的在线图形入侵检测

摘要: 在这项工作中，我们研究了模型上下文协议（MCP）代理工具链的安全性及其在智能家居中的应用。我们介绍了AegisMCP，一个协议级入侵检测器。我们的贡献包括：（i）涵盖基于指令驱动的升级、工具链外泄、恶意MCP服务器注册和持久性的最小攻击套件；（ii）NEBULA-Schema（用于不受信任的LLM代理的网络边缘行为学习），一个可重用的协议级工具，将MCP活动表示为代理、MCP服务器、工具、设备、远程和会话之间的流式异构时间图；（iii）一个仅基于CPU的流式检测器，融合新颖性、会话DAG结构和属性线索，用于近实时边缘推理，可选择融合本地提示-防护信号。在一个模拟的智能家居测试环境中跨越多个MCP堆栈和一个物理工作台，AegisMCP实现了每个窗口模型推理和端到端警报的亚秒级响应速度。AegisMCP的延迟在英特尔N150级边缘硬件上始终保持在亚秒级，同时优于仅流量和序列基线；消融实验证实了DAG和安装/权限信号的重要性。我们发布了用于可重现评估的代码、模式和生成器。

更新时间: 2025-10-25 22:02:32

领域: cs.CR

下载: http://arxiv.org/abs/2510.19462v2

Dimension-free Score Matching and Time Bootstrapping for Diffusion Models

Diffusion models generate samples by estimating the score function of the target distribution at various noise levels. The model is trained using samples drawn from the target distribution by progressively adding noise. Previous sample complexity bounds have polynomial dependence on the dimension $d$, apart from a $\log(|\mathcal{H}|)$ term, where $\mathcal{H}$ is the hypothesis class. In this work, we establish the first (nearly) dimension-free sample complexity bounds, modulo the $\log(|\mathcal{H}|)$ dependence, for learning these score functions, achieving a double exponential improvement in the dimension over prior results. A key aspect of our analysis is the use of a single function approximator to jointly estimate scores across noise levels, a practical feature that enables generalization across time steps. We introduce a martingale-based error decomposition and sharp variance bounds, enabling efficient learning from dependent data generated by Markov processes, which may be of independent interest. Building on these insights, we propose Bootstrapped Score Matching (BSM), a variance reduction technique that leverages previously learned scores to improve accuracy at higher noise levels. These results provide insights into the efficiency and effectiveness of diffusion models for generative modeling.

Updated: 2025-10-25 21:54:42

标题: 无维度得分匹配和时间自举法用于扩散模型

摘要: 扩散模型通过估计目标分布在不同噪声水平下的得分函数来生成样本。该模型通过逐渐增加噪声来训练，使用从目标分布中抽取的样本。先前的样本复杂度界限在维度$d$上有多项式依赖，除了一个$\log(|\mathcal{H}|)$项，其中$\mathcal{H}$是假设类。在这项工作中，我们建立了第一个（几乎）与维度无关的样本复杂度界限，除了$\log(|\mathcal{H}|)$的依赖关系，用于学习这些得分函数，相较于先前的结果，在维度上实现了双指数改进。我们分析的一个关键方面是使用单个函数逼近器来同时估计不同噪声水平下的得分，这是一个实用的特性，可以实现跨时间步的泛化。我们引入基于鞅的误差分解和尖锐的方差界限，实现了有效地从由马尔可夫过程生成的相关数据中学习，这可能引起独立的兴趣。基于这些见解，我们提出了Bootstrapped Score Matching (BSM)，这是一种利用先前学习的得分来降低方差的技术，从而提高在更高噪声水平下的准确性。这些结果为扩散模型在生成建模中的效率和有效性提供了见解。

更新时间: 2025-10-25 21:54:42

领域: cs.LG,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2502.10354v2

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.

Updated: 2025-10-25 21:54:01

标题: 基准测试自我中心多模态目标推理的辅助可穿戴代理

摘要: 近来对辅助穿戴式智能代理的兴趣激增：这些代理以穿戴式形式存在（例如，智能眼镜），可以采取辅助行动以实现用户的目标/查询（例如，“我把钥匙放在哪里了？”）。在这项工作中，我们考虑了从多模态上下文观察中推断出目标的重要补充问题。解决这个“目标推断”问题有望消除与这种代理进行交互所需的努力。这项工作侧重于创建WAGIBench，一个用于使用视觉语言模型（VLMs）解决这个问题的强基准。鉴于这一领域的有限先前工作，我们收集了一个新颖的数据集，包括来自348名参与者的29小时的多模态数据，共3,477个录音，其中包括真实目标以及相关的视觉、音频、数字和长期上下文观察。我们验证了人类表现超过模型表现，多项选择准确率达到93%，而最佳表现的VLM为84%。评估几个现代视觉语言模型家族的生成基准结果显示，较大的模型在任务上表现显著更好，但仍远未达到实用性，因为它们仅有55%的时间产生相关目标。通过模态消融，我们展示了模型从相关模态中获得额外信息的益处，并且从不相关模态中几乎没有性能下降。

更新时间: 2025-10-25 21:54:01

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.22443v1

Low-Precision Streaming PCA

Low-precision streaming PCA estimates the top principal component in a streaming setting under limited precision. We establish an information-theoretic lower bound on the quantization resolution required to achieve a target accuracy for the leading eigenvector. We study Oja's algorithm for streaming PCA under linear and nonlinear stochastic quantization. The quantized variants use unbiased stochastic quantization of the weight vector and the updates. Under mild moment and spectral-gap assumptions on the data distribution, we show that a batched version achieves the lower bound up to logarithmic factors under both schemes. This leads to a nearly dimension-free quantization error in the nonlinear quantization setting. Empirical evaluations on synthetic streams validate our theoretical findings and demonstrate that our low-precision methods closely track the performance of standard Oja's algorithm.

Updated: 2025-10-25 21:48:17

标题: 低精度流式PCA

摘要: 低精度流式主成分分析（PCA）在有限精度下估计流式设置中的顶级主成分。我们建立了一个关于所需量化分辨率的信息论下界，以实现领先特征向量的目标精度。我们研究了Oja算法在线性和非线性随机量化的流式PCA下的应用。量化的变体使用权重向量和更新的无偏随机量化。在数据分布上具有轻微矩和谱间隙假设的情况下，我们表明批处理版本在两种方案下均实现了下界，最多达到对数因子。这导致在非线性量化设置中几乎没有维度限制的量化误差。对合成流的实证评估验证了我们的理论发现，并证明我们的低精度方法与标准Oja算法的性能紧密关联。

更新时间: 2025-10-25 21:48:17

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.22440v1

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.

Updated: 2025-10-25 21:38:07

标题: PromptReverb：通过潜在的整流流匹配生成多模式房间冲激响应

摘要: 房间冲激响应（RIR）生成仍然是创建沉浸式虚拟声学环境的一个关键挑战。当前的方法存在两个根本限制：全频RIR数据集的稀缺性以及现有模型无法从多样的输入模态生成声学准确的响应。我们提出了PromptReverb，这是一个两阶段的生成框架，旨在解决这些挑战。我们的方法结合了一个变分自编码器，将带限制的RIR上采样到全频质量（48 kHz），以及基于矫正流匹配的条件扩散变换器模型，从自然语言描述中生成RIR。经验评估表明，与现有方法相比，PromptReverb产生的RIR具有更好的感知质量和声学准确性，平均RT60误差为8.8%，而广泛使用的基线为-37%，并产生更真实的房间声学参数。我们的方法使得在虚拟现实、建筑声学和音频制作等领域实现实用应用成为可能，其中灵活、高质量的RIR合成至关重要。

更新时间: 2025-10-25 21:38:07

领域: cs.SD,cs.AI,I.2.6, H.5.5

下载: http://arxiv.org/abs/2510.22439v1

Modeling Hierarchical Thinking in Large Reasoning Models

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities when they generate step-by-step solutions, known as chain-of-thought (CoT) reasoning. When trained to using chain-of-thought reasoning examples, the resulting models (called Large Reasoning Models, or LRMs) appear to learn hierarchical thinking strategies similar to those used by humans. However, understanding LRMs emerging reasoning capabilities remains a difficult open problem, with many potential important applications including improving training and understanding robustness. In this paper, we adopt a memoryless Finite State Machine formulation to approximate LRM's emerging hierarchical reasoning dynamics as a structured, interpretable abstraction. We identify a small set of discrete reasoning states including - initialization, deduction, augmentation-strategy, uncertainty-estimation, backtracking, and final-conclusion that capture the high-level states present in the model's reasoning process. By annotating each step of a model's CoT with these states, we can represent the reasoning trajectory as a transition sequence through the state graph. This FSM formulation provides a systematic way to analyze, interpret and visualize how different models approach problems. We describe the FSM model, provide examples of CoT annotations under this scheme, and discuss how it can shed light on differences between available models in their approach to reasoning. Our results demonstrate that this FSM-based analysis reveals distinct reasoning patterns and potential shortcomings, offering a new lens to evaluate and improve LLM reasoning.

Updated: 2025-10-25 21:25:30

标题: 在大型推理模型中建模层次思维

摘要: 大型语言模型（LLMs）在生成逐步解决方案时展现出了卓越的推理能力，这被称为思维链（CoT）推理。当通过使用思维链推理示例进行训练时，所产生的模型（称为大型推理模型，或LRMs）似乎学习了类似于人类使用的分层思维策略。然而，理解LRMs新兴的推理能力仍然是一个困难的开放性问题，其中包括许多潜在重要应用，包括改进训练和理解鲁棒性。在本文中，我们采用一个无记忆的有限状态机公式来近似LRM新兴的分层推理动态，作为一个结构化的、可解释的抽象。我们确定了一小组离散推理状态，包括-初始化、演绎、增强策略、不确定性估计、回溯和最终结论，捕捉了模型推理过程中存在的高级状态。通过用这些状态对模型的每个CoT步骤进行注释，我们可以将推理轨迹表示为通过状态图的转移序列。这个有限状态机公式提供了一种系统化的方法来分析、解释和可视化不同模型如何处理问题。我们描述了FSM模型，提供了在这个方案下的CoT注释示例，并讨论了它如何能够揭示不同模型在推理方法上的差异。我们的结果表明，这种基于FSM的分析揭示了独特的推理模式和潜在的缺点，为评估和改进LLM推理提供了一种新视角。

更新时间: 2025-10-25 21:25:30

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22437v1

Membership Inference Attacks for Unseen Classes

The state-of-the-art for membership inference attacks on machine learning models is a class of attacks based on shadow models that mimic the behavior of the target model on subsets of held-out nonmember data. However, we find that this class of attacks is fundamentally limited because of a key assumption -- that the shadow models can replicate the target model's behavior on the distribution of interest. As a result, we show that attacks relying on shadow models can fail catastrophically on critical AI safety applications where data access is restricted due to legal, ethical, or logistical constraints, so that the shadow models have no reasonable signal on the query examples. Although this problem seems intractable within the shadow model paradigm, we find that quantile regression attacks are a promising approach in this setting, as these models learn features of member examples that can generalize to unseen classes. We demonstrate this both empirically and theoretically, showing that quantile regression attacks achieve up to 11x the TPR of shadow model-based approaches in practice, and providing a theoretical model that outlines the generalization properties required for this approach to succeed. Our work identifies an important failure mode in existing MIAs and provides a cautionary tale for practitioners that aim to directly use existing tools for real-world applications of AI safety.

Updated: 2025-10-25 21:13:10

标题: Membership Inference Attacks for Unseen Classes 的翻译是：针对未知类别的成员推断攻击

摘要: 目前针对机器学习模型的成员推断攻击的最新技术是一类基于模仿目标模型在留出的非成员数据子集上行为的影子模型攻击。然而，我们发现这类攻击受到基本限制，因为存在一个关键假设--即影子模型能够在感兴趣的分布上复制目标模型的行为。结果表明，依赖于影子模型的攻击在数据访问受限的关键AI安全应用中可能会遭遇灾难性失败，这是由于法律、道德或后勤限制导致影子模型对查询示例没有合理信号。尽管这一问题似乎在影子模型范式内难以解决，但我们发现分位数回归攻击在这种情况下是一种有希望的方法，因为这些模型学习成员示例的特征可以推广到未见类别。我们在实证和理论上证明了这一点，表明分位数回归攻击在实践中实现的真正阳性率可达影子模型方法的11倍，并提供了一个理论模型，概述了这种方法成功所需的泛化特性。我们的工作识别了现有MIAs中的一个重要故障模式，并为那些希望直接将现有工具用于AI安全实际应用的从业者提供了一个警示故事。

更新时间: 2025-10-25 21:13:10

领域: cs.LG,cs.CR,stat.ML

下载: http://arxiv.org/abs/2506.06488v2

Self-Supervised Discovery of Neural Circuits in Spatially Patterned Neural Responses with Graph Neural Networks

Inferring synaptic connectivity from neural population activity is a fundamental challenge in computational neuroscience, complicated by partial observability and mismatches between inference models and true circuit dynamics. In this study, we propose a graph-based neural inference model that simultaneously predicts neural activity and infers latent connectivity by modeling neurons as interacting nodes in a graph. The architecture features two distinct modules: one for learning structural connectivity and another for predicting future spiking activity via a graph neural network (GNN). Our model accommodates unobserved neurons through auxiliary nodes, allowing for inference in partially observed circuits. We evaluate this approach using synthetic data generated from ring attractor network models and real spike recordings from head direction cells in mice. Across a wide range of conditions, including varying recurrent connectivity, external inputs, and incomplete observations, our model reliably resolves spurious correlations and recovers accurate weight profiles. When applied to real data, the inferred connectivity aligns with theoretical predictions of continuous attractor models. These results highlight the potential of GNN-based models to infer latent neural circuitry through self-supervised structure learning, while leveraging the spike prediction task to flexibly link connectivity and dynamics across both simulated and biological neural systems.

Updated: 2025-10-25 20:31:23

标题: 使用图神经网络在空间模式神经响应中自监督发现神经回路

摘要: 从神经群体活动推断突触连接是计算神经科学中的一个基本挑战，这个挑战受到部分可观察性和推断模型与真实电路动态之间的不匹配的影响。在这项研究中，我们提出了一种基于图的神经推断模型，通过将神经元建模为图中的相互作用节点，同时预测神经活动并推断潜在连接。该架构具有两个不同的模块：一个用于学习结构连接性，另一个通过图神经网络（GNN）预测未来的尖峰活动。我们的模型通过辅助节点容纳未观察到的神经元，允许在部分观察到的电路中进行推断。我们使用从环吸引子网络模型生成的合成数据和来自小鼠头方向细胞的真实尖峰记录来评估这种方法。在包括变化的递归连接性、外部输入和不完整观察在内的各种条件下，我们的模型可靠地解决虚假相关性，并恢复准确的权重配置文件。当应用于真实数据时，推断的连接性与连续吸引子模型的理论预测一致。这些结果突显了基于GNN的模型通过自监督结构学习来推断潜在的神经回路，同时利用尖峰预测任务来灵活地链接模拟和生物神经系统的连接性和动态。

更新时间: 2025-10-25 20:31:23

领域: q-bio.NC,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2509.17174v2

ComPO: Preference Alignment via Comparison Oracles

Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on zeroth-order, comparison-based optimization via comparison oracles and provide convergence guarantees for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in Razin et al (2025).

Updated: 2025-10-25 20:23:09

标题: ComPO：通过比较神谕实现偏好对齐

摘要: 直接对齐方法越来越被用于将大型语言模型（LLMs）与人类偏好对齐。然而，这些方法存在冗长和可能由噪声偏好对引起的似然位移等问题，这些偏好对导致了对所偏好和不偏好回复的相似似然性。本文的贡献有两个方面。首先，我们提出了一种基于零阶、基于比较的优化的新偏好对齐方法，通过比较神谕提供其基本方案的收敛保证。其次，我们利用一些启发式方法改进了我们的方法，并进行实验以展示使用噪声偏好对改善LLMs性能的实际方案的灵活性和兼容性。评估跨多个基础和指导调整的模型（Mistral-7B、Llama-3-8B和Gemma-2-9B）进行，使用基准测试（AlpacaEval 2、MT-Bench和Arena-Hard）。实验结果显示我们的方法作为解决现有直接对齐方法限制的替代方案的有效性。我们工作的一个亮点是我们证明了为具有不同似然边缘的偏好对设计专门方法的重要性，这与Razin等人（2025年）的最新发现相辅相成。

更新时间: 2025-10-25 20:23:09

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.05465v2

Path-specific effects for pulse-oximetry guided decisions in critical care

Identifying and measuring biases associated with sensitive attributes is a crucial consideration in healthcare to prevent treatment disparities. One prominent issue is inaccurate pulse oximeter readings, which tend to overestimate oxygen saturation for dark-skinned patients and misrepresent supplemental oxygen needs. Most existing research has revealed statistical disparities linking device measurement errors to patient outcomes in intensive care units (ICUs) without causal formalization. This study causally investigates how racial discrepancies in oximetry measurements affect invasive ventilation in ICU settings. We employ a causal inference-based approach using path-specific effects to isolate the impact of bias by race on clinical decision-making. To estimate these effects, we leverage a doubly robust estimator, propose its self-normalized variant for improved sample efficiency, and provide novel finite-sample guarantees. Our methodology is validated on semi-synthetic data and applied to two large real-world health datasets: MIMIC-IV and eICU. Contrary to prior work, our analysis reveals minimal impact of racial discrepancies on invasive ventilation rates. However, path-specific effects mediated by oxygen saturation disparity are more pronounced on ventilation duration, and the severity differs by dataset. Our work provides a novel pipeline for investigating potential disparities in clinical decision-making and, more importantly, highlights the necessity of causal methods to robustly assess fairness in healthcare.

Updated: 2025-10-25 20:22:38

标题: 重症监护中脉搏血氧测导向决策的路径特异性效应

摘要: 识别和测量与敏感属性相关的偏差是医疗保健中的一个关键考虑因素，以防止治疗不公平。一个突出的问题是脉搏血氧仪读数不准确，往往会高估黑皮肤患者的血氧饱和度，并且误传补充氧气的需求。大多数现有研究已经揭示了将设备测量误差与重症监护室（ICU）患者结果联系起来的统计差异，但没有因果形式化。本研究因果地调查了氧饱和度测量中的种族差异如何影响ICU环境中的侵入性通气。我们采用了基于因果推断的方法，使用路径特定效应来分离种族偏见对临床决策的影响。为了估计这些效应，我们利用一个双重稳健估计器，提出其自标准化变体以提高样本效率，并提供新颖的有限样本保证。我们的方法在半合成数据上经过验证，并应用于两个大型真实世界的健康数据集：MIMIC-IV和eICU。与之前的工作相反，我们的分析显示种族差异对侵入性通气率的影响很小。然而，通过氧饱和度差异介导的路径特定效应对通气持续时间的影响更为显著，并且严重程度因数据集而异。我们的工作为研究临床决策中潜在的不公平提供了一个新颖的流程，并更重要的是，强调了因果方法在健康医疗领域中稳健评估公平性的必要性。

更新时间: 2025-10-25 20:22:38

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.12371v2

No Free Lunch From Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions

Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this trade-off for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using deterministic equivalent risk estimates, we prove that when a fixed number of parameters is distributed among $K$ independently trained models, the ridge-optimized test risk increases with $K$. Consequently, a single large model achieves optimal performance. We then ask when ensembles can achieve \textit{near}-optimal performance. In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance. To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent'' $\ell$. While the optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.

Updated: 2025-10-25 20:21:49

标题: 随机特征集合中没有免费午餐：尺度律和近最优条件

摘要: 在给定总模型大小固定预算的情况下，一个人必须在训练单个大模型或结合多个较小模型的预测之间做出选择。我们研究了在过参数化和欠参数化情况下随机特征岭回归模型集成的这种权衡。使用确定性等效风险估计，我们证明当固定数量的参数分布在$K$个独立训练的模型之间时，岭优化的测试风险随$K$增加而增加。因此，单个大模型实现了最佳性能。然后我们问在什么情况下集成可以实现\textit{接近}最佳性能。在过参数化情况下，我们展示测试误差主要取决于集成大小和模型大小，只通过总特征计数，因此过参数化集成始终实现接近最佳性能。为了理解欠参数化集成，我们推导了测试风险的缩放定律，作为总参数计数的函数，当集成大小和每个集成成员的参数根据“增长指数”$\ell$进行联合缩放时。虽然通过增加模型大小来实现最佳错误缩放，我们的分析确定了核心和任务特征结构条件，在这些条件下，通过集成大小和模型大小的联合缩放可以获得接近最佳的缩放定律。

更新时间: 2025-10-25 20:21:49

领域: cs.LG,cond-mat.dis-nn,stat.ML

下载: http://arxiv.org/abs/2412.05418v2

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior -- rejecting even benign prompts -- a phenomenon known as $\textit{over-refusal}$ that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ($\textbf{OVE}$r-$\textbf{R}$efusal evaluation on $\textbf{T}$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.

Updated: 2025-10-25 20:12:34

标题: OVERT：基于文本到图像模型的过度拒绝评估基准

摘要: 文本到图像（T2I）模型在从文本输入生成视觉内容方面取得了显著的成功。尽管已经提出了多种安全对齐策略来防止有害输出，但它们经常导致过于谨慎的行为 -- 即拒绝甚至是良性提示 -- 这种现象被称为$\textit{过度拒绝}$，这降低了T2I模型的实际效用。尽管在实践中观察到了过度拒绝，但目前没有一个大规模的基准测试来系统评估T2I模型的这种现象。在本文中，我们提出了一个自动工作流程来构建合成评估数据，从而产生OVERT（$\textbf{OVE}$r-$\textbf{R}$efusal评估在$\textbf{T}$ext-to-image模型上），这是第一个用于评估T2I模型中过度拒绝行为的大规模基准测试。OVERT包括来自九个安全相关类别的4,600个看似有害但实际是良性的提示，以及1,785个真正有害的提示（OVERT-unsafe）来评估安全效用权衡。使用OVERT，我们评估了几种领先的T2I模型，并发现过度拒绝是一个在各种类别中普遍存在的问题（见图1），强调了进一步研究以增强T2I模型的安全对齐性而不损害其功能性的必要性。作为减少过度拒绝的初步尝试，我们探索了提示重写；然而，我们发现它经常损害了对原始提示含义的忠实性。最后，我们展示了我们的生成框架在适应用户定义政策的情况下生成定制化评估数据的灵活性。

更新时间: 2025-10-25 20:12:34

领域: cs.LG

下载: http://arxiv.org/abs/2505.21347v3

Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima

This paper considers the problem of understanding the behavior of a general class of accelerated gradient methods on smooth nonconvex functions. Motivated by some recent works that have proposed effective algorithms, based on Polyak's heavy ball method and the Nesterov accelerated gradient method, to achieve convergence to a local minimum of nonconvex functions, this work proposes a broad class of Nesterov-type accelerated methods and puts forth a rigorous study of these methods encompassing the escape from saddle points and convergence to local minima through both an asymptotic and a non-asymptotic analysis. In the asymptotic regime, this paper answers an open question of whether Nesterov's accelerated gradient method (NAG) with variable momentum parameter avoids strict saddle points almost surely. This work also develops two metrics of asymptotic rates of convergence and divergence, and evaluates these two metrics for several popular standard accelerated methods such as the NAG and Nesterov's accelerated gradient with constant momentum (NCM) near strict saddle points. In the non-asymptotic regime, this work provides an analysis that leads to the "linear" exit time estimates from strict saddle neighborhoods for trajectories of these accelerated methods as well the necessary conditions for the existence of such trajectories. Finally, this work studies a sub-class of accelerated methods that can converge in convex neighborhoods of nonconvex functions with a near optimal rate to a local minimum and at the same time this sub-class offers superior saddle-escape behavior compared to that of NAG.

Updated: 2025-10-25 20:04:55

标题: 非凸优化的加速梯度方法：逃离严格鞍点的轨迹和收敛到局部最小值

摘要: 这篇论文考虑了理解一般加速梯度方法在光滑非凸函数上的行为的问题。受一些最近提出的有效算法的启发，这些算法基于Polyak的重球方法和Nesterov加速梯度方法，以实现对非凸函数局部最小值的收敛，本文提出了一种广泛的Nesterov类型加速方法，并对这些方法进行了严格研究，包括通过渐近和非渐近分析逃离鞍点和收敛到局部最小值。在渐近区域中，这篇论文回答了一个开放问题，即具有可变动量参数的Nesterov加速梯度方法（NAG）是否几乎肯定避开严格鞍点。这项工作还开发了两个渐近收敛和发散速度的度量，并评估了这两个度量对于几种流行的标准加速方法（如NAG和具有恒定动量的Nesterov加速梯度（NCM））在严格鞍点附近的表现。在非渐近区域中，这项工作提供了分析，导致这些加速方法轨迹从严格鞍点邻域“线性”退出的时间估计，以及存在这种轨迹的必要条件。最后，这项工作研究了一类加速方法的子类，这些方法可以以接近最佳速度收敛到非凸函数的凸邻域的局部最小值，同时这个子类相比NAG具有更优越的鞍点逃逸行为。

更新时间: 2025-10-25 20:04:55

领域: math.OC,cs.LG,cs.SY,eess.SY,90C26 (Primary) 65K05, 65K10, 37N40, 34D20 (Secondary)

下载: http://arxiv.org/abs/2307.07030v2

DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models

Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token's influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.

Updated: 2025-10-25 20:03:37

标题: DiffHeads: 大型语言模型中偏差头部的差异分析和推断时间掩盖

摘要: 大型语言模型（LLMs）越来越多地在领域中调解人口群体不公平待遇不可接受的决策。现有研究探讨了何时出现偏见输出，但对生成这些输出的机制几乎没有深入了解，导致现有的缓解措施很不牢固。在本文中，我们对LLM的不公平性进行了系统调查，并提出了DiffHeads，这是一个针对LLMs的轻量级去偏见框架。我们首先比较了在八个代表性的开源和闭源LLMs中使用的直接回答（DA）提示和思维链（CoT）提示。DA将触发LLM的自然偏见部分，并在单轮和双轮对话中将测得的不公平性提高了534.5%至391.9%。接下来，我们定义了一个令牌到头部贡献分数，追踪每个令牌对个体注意力头部的影响。这揭示了在DA下激活的少量偏见头部群，但在CoT下基本处于休眠状态，首次提供了提示策略和偏见出现之间的因果联系。最后，基于这一洞察，我们提出了DiffHeads，通过在DA和CoT之间进行差异激活分析来识别偏见头部，并选择性地屏蔽这些头部。DiffHeads将不公平性分别减少了49.4%和40.3%，而不损害模型的效用。

更新时间: 2025-10-25 20:03:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.10142v2

Reinforcement learning-guided optimization of critical current in high-temperature superconductors

High-temperature superconductors are essential for next-generation energy and quantum technologies, yet their performance is often limited by the critical current density ($J_c$), which is strongly influenced by microstructural defects. Optimizing $J_c$ through defect engineering is challenging due to the complex interplay of defect type, density, and spatial correlation. Here we present an integrated workflow that combines reinforcement learning (RL) with time-dependent Ginzburg-Landau (TDGL) simulations to autonomously identify optimal defect configurations that maximize $J_c$. In our framework, TDGL simulations generate current-voltage characteristics to evaluate $J_c$, which serves as the reward signal that guides the RL agent to iteratively refine defect configurations. We find that the agent discovers optimal defect densities and correlations in two-dimensional thin-film geometries, enhancing vortex pinning and $J_c$ relative to the pristine thin-film, approaching 60\% of theoretical depairing limit with up to 15-fold enhancement compared to random initialization. This RL-driven approach provides a scalable strategy for defect engineering, with broad implications for advancing HTS applications in fusion magnets, particle accelerators, and other high-field technologies.

Updated: 2025-10-25 20:01:33

标题: 强化学习引导的高温超导体临界电流优化

摘要: 高温超导体对于下一代能源和量子技术至关重要，然而它们的性能常常受到临界电流密度（$J_c$）的限制，而这又受到微观结构缺陷的强烈影响。通过缺陷工程来优化$J_c$是具有挑战性的，因为缺陷类型、密度和空间相关性之间存在复杂的相互作用。在这里，我们提出了一种集成工作流程，将强化学习（RL）与时间相关的Ginzburg-Landau（TDGL）模拟相结合，以自主识别最大化$J_c$的最佳缺陷配置。在我们的框架中，TDGL模拟生成电流-电压特性来评估$J_c$，这作为引导RL代理迭代优化缺陷配置的奖励信号。我们发现代理程序在二维薄膜几何结构中发现了最佳缺陷密度和相关性，增强了涡旋钉扎和$J_c$，相对于原始薄膜，接近理论解偶极限的60％，与随机初始化相比最多提高了15倍。这种受RL驱动的方法为缺陷工程提供了可扩展的策略，对于推进高温超导体在聚变磁体、粒子加速器和其他高场技术中的应用具有广泛意义。

更新时间: 2025-10-25 20:01:33

领域: cond-mat.mtrl-sci,cond-mat.supr-con,cs.LG

下载: http://arxiv.org/abs/2510.22424v1

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens.

Updated: 2025-10-25 19:57:55

标题: 通过跨模态代理令牌实现强大的多模态学习

摘要: 多模态模型在推断过程中缺少一个或多个模态时往往会出现显著的性能下降。为了解决这一挑战，我们提出了一种简单而有效的方法，可以在缺少模态的情况下提高鲁棒性，同时在所有模态可用时保持较强的性能。我们的方法引入了跨模态代理标记（CMPTs），通过仅关注可用模态的标记来近似缺失模态的类标记，而无需显式生成模态或辅助网络。为了以最小的计算开销有效地学习这些近似值，我们在冻结的单模态编码器中使用低秩适配器，并联合优化对齐损失和特定任务损失。对五个多模态数据集进行的大量实验表明，我们的方法在各种缺失率下优于现有基线方法，同时在完整模态设置中取得了竞争性结果。总体而言，我们的方法为鲁棒的多模态学习提供了灵活高效的解决方案。本文的代码可以在以下链接找到：https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens。

更新时间: 2025-10-25 19:57:55

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.17823v4

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

Updated: 2025-10-25 19:53:24

标题: 泛化还是幻觉？理解变压器中的脱离上下文推理

摘要: 大型语言模型（LLMs）可以通过微调获取新知识，但这个过程表现出令人困惑的二元性：模型可以从新事实中表现出惊人的泛化能力，但也容易产生错误信息的幻觉。然而，这种现象的原因仍然不明确。在这项工作中，我们认为这两种行为都源于一种称为超出上下文推理（OCR）的单一机制：通过关联概念来推断含义的能力，甚至是那些没有因果关系的概念。我们在五个知名的LLMs上的实验证实，OCR确实驱动了泛化和幻觉，具体取决于关联概念是否有因果关系。为了建立对这一现象的严格理论理解，我们将OCR形式化为一个合成的事实回忆任务。我们通过实验证明，一个具有分解输出和值矩阵的一层单头注意力传输器可以学会解决这个任务，而具有组合权重的模型则不能，突出了矩阵分解的关键作用。我们的理论分析表明，OCR能力可以归因于梯度下降的隐式偏好，即偏好最小化组合输出值矩阵的核范数的解决方案。这种数学结构解释了为什么模型学会高效地将事实和推论相关联，无论相关性是因果关系还是仅仅是虚假的。最终，我们的工作为理解OCR现象提供了理论基础，为分析和减轻知识注入中的不良行为提供了一个新的视角。

更新时间: 2025-10-25 19:53:24

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.10887v3

PICore: Physics-Informed Unsupervised Coreset Selection for Data Efficient Neural Operator Training

Neural operators offer a powerful paradigm for solving partial differential equations (PDEs) that cannot be solved analytically by learning mappings between function spaces. However, there are two main bottlenecks in training neural operators: they require a significant amount of training data to learn these mappings, and this data needs to be labeled, which can only be accessed via expensive simulations with numerical solvers. To alleviate both of these issues simultaneously, we propose PICore, an unsupervised coreset selection framework that identifies the most informative training samples without requiring access to ground-truth PDE solutions. PICore leverages a physics-informed loss to select unlabeled inputs by their potential contribution to operator learning. After selecting a compact subset of inputs, only those samples are simulated using numerical solvers to generate labels, reducing annotation costs. We then train the neural operator on the reduced labeled dataset, significantly decreasing training time as well. Across four diverse PDE benchmarks and multiple coreset selection strategies, PICore achieves up to 78% average increase in training efficiency relative to supervised coreset selection methods with minimal changes in accuracy. We provide code at https://github.com/Asatheesh6561/PICore.

Updated: 2025-10-25 19:49:38

标题: PICore: 物理信息非监督式Coreset选择用于数据高效神经算子训练

摘要: 神经算子提供了一种强大的范式，用于通过学习函数空间之间的映射来解决无法通过解析方法解决的偏微分方程（PDEs）。然而，在训练神经算子时存在两个主要瓶颈：它们需要大量的训练数据来学习这些映射，而这些数据需要标记，只能通过昂贵的数值求解器进行访问。为了同时缓解这两个问题，我们提出了PICore，这是一个无监督的核心集选择框架，可以识别出最具信息量的训练样本，而无需访问地面真相的PDE解决方案。PICore利用基于物理的损失来选择未标记的输入，根据它们对算子学习的潜在贡献。在选择了一个紧凑的输入子集之后，只有这些样本使用数值求解器进行模拟以生成标签，从而降低标注成本。然后我们在减少的标记数据集上训练神经算子，显著减少训练时间。在四个不同的PDE基准测试和多种核心集选择策略中，PICore相对于监督核心集选择方法实现了高达78%的平均训练效率提升，而准确性几乎没有改变。我们在https://github.com/Asatheesh6561/PICore 上提供了代码。

更新时间: 2025-10-25 19:49:38

领域: cs.LG

下载: http://arxiv.org/abs/2507.17151v2

Group size effects and collective misalignment in LLM multi-agent systems

Multi-agent systems of large language models (LLMs) are rapidly expanding across domains, introducing dynamics not captured by single-agent evaluations. Yet, existing work has mostly contrasted the behavior of a single agent with that of a collective of fixed size, leaving open a central question: how does group size shape dynamics? Here, we move beyond this dichotomy and systematically explore outcomes across the full range of group sizes. We focus on multi-agent misalignment, building on recent evidence that interacting LLMs playing a simple coordination game can generate collective biases absent in individual models. First, we show that collective bias is a deeper phenomenon than previously assessed: interaction can amplify individual biases, introduce new ones, or override model-level preferences. Second, we demonstrate that group size affects the dynamics in a non-linear way, revealing model-dependent dynamical regimes. Finally, we develop a mean-field analytical approach and show that, above a critical population size, simulations converge to deterministic predictions that expose the basins of attraction of competing equilibria. These findings establish group size as a key driver of multi-agent dynamics and highlight the need to consider population-level effects when deploying LLM-based systems at scale.

Updated: 2025-10-25 19:45:45

标题: 团队规模效应和LLM多智能体系统中的集体失调

摘要: 大型语言模型（LLMs）的多智能体系统正在不断扩展到各个领域，引入了单一智能体评估无法捕捉的动态。然而，现有工作大多将单一智能体的行为与固定大小的集体进行对比，留下一个核心问题：群体规模如何塑造动态？在这里，我们超越这种二分法，系统地探索了整个群体规模范围内的结果。我们关注多智能体不对齐，借鉴了最近的证据，即相互作用的LLMs玩一个简单的协调游戏可以生成个体模型中不存在的集体偏见。首先，我们展示了集体偏见是一个比以前评估的更深层次的现象：相互作用可以放大个体偏见，引入新的偏见，或者覆盖模型级别的偏好。其次，我们证明了群体规模以非线性方式影响动态，揭示了依赖于模型的动态状态。最后，我们开发了一个均场分析方法，并展示了在超过临界人口规模后，模拟会收敛到暴露竞争均衡的吸引力盆地的确定性预测。这些发现将群体规模确立为多智能体动态的关键驱动因素，并强调在规模部署基于LLM的系统时需要考虑群体级别效果。

更新时间: 2025-10-25 19:45:45

领域: cs.MA,cs.AI,cs.CY,physics.soc-ph

下载: http://arxiv.org/abs/2510.22422v1

Extragradient Method for $(L_0, L_1)$-Lipschitz Root-finding Problems

Introduced by Korpelevich in 1976, the extragradient method (EG) has become a cornerstone technique for solving min-max optimization, root-finding problems, and variational inequalities (VIs). Despite its longstanding presence and significant attention within the optimization community, most works focusing on understanding its convergence guarantees assume the strong L-Lipschitz condition. In this work, building on the proposed assumptions by Zhang et al. [2024b] for minimization and Vankov et al.[2024] for VIs, we focus on the more relaxed $\alpha$-symmetric $(L_0, L_1)$-Lipschitz condition. This condition generalizes the standard Lipschitz assumption by allowing the Lipschitz constant to scale with the operator norm, providing a more refined characterization of problem structures in modern machine learning. Under the $\alpha$-symmetric $(L_0, L_1)$-Lipschitz condition, we propose a novel step size strategy for EG to solve root-finding problems and establish sublinear convergence rates for monotone operators and linear convergence rates for strongly monotone operators. Additionally, we prove local convergence guarantees for weak Minty operators. We supplement our analysis with experiments validating our theory and demonstrating the effectiveness and robustness of the proposed step sizes for EG.

Updated: 2025-10-25 19:43:01

标题: Extragradient方法用于$(L_0, L_1)$-利普希茨根查找问题

摘要: 引入Korpelevich于1976年提出的外梯度法（EG）已成为解决极小极大优化、根查找问题和变分不等式（VIs）的基石技术。尽管在优化领域已经存在很长时间并受到重视，但大多数关于理解其收敛保证的工作都假设强L-Lipschitz条件。在本文中，基于Zhang等人提出的最小化和Vankov等人提出的VIs的假设，我们集中研究更宽松的α-对称（L0，L1）-Lipschitz条件。该条件通过允许Lipschitz常数与算子范数成比例地放缩来推广标准Lipschitz假设，提供了对现代机器学习中问题结构更精细的刻画。在α-对称（L0，L1）-Lipschitz条件下，我们提出了一种新颖的步长策略用于解决根查找问题，并建立了单调算子的次线性收敛率和强单调算子的线性收敛率。此外，我们证明了对弱Minty算子的局部收敛保证。我们通过实验证明了我们理论的有效性，并展示了EG提出的步长策略的效果和稳健性。

更新时间: 2025-10-25 19:43:01

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2510.22421v1

Beyond Isotonization: Scalable Non-Crossing Quantile Estimation via Neural Networks for Student Growth Percentiles

Student Growth Percentiles (SGPs), widely adopted across U.S. state assessment systems, employ independent quantile regression followed by post-hoc correction using an isotonic projection method (\texttt{isotonize=TRUE} in the \texttt{SGP} R package) to address quantile crossing. We demonstrate this approach contains a fundamental methodological inconsistency: interpolation between independently-estimated, potentially crossed quantiles requires monotonicity, yet the post-hoc correction alters estimates in ways that may violate the quantile property $P(Y \leq \hat{Q}_{\tau}(Y|X) \mid X) = \tau$. We term this the \emph{interpolation paradox}. While theoretically sound constrained joint quantile regression (CJQR) eliminates crossing by enforcing non-crossing constraints during optimization, we analyze its computational complexity (often scaling poorly, e.g., $\mathcal{O}((qn)^3)$ for standard LP solvers) rendering it intractable for large-scale educational data ($n > 100{,}000$). We examine the SGP package's switch to the Frisch-Newton interior point method (\texttt{rq.method.for.large.n="fn"}) for large $N$, noting that while efficient for \emph{independent} QR, it doesn't resolve the joint problem's complexity or the paradox. We propose neural network-based multi-quantile regression (NNQR) with shared hidden layers as a practical alternative. Leveraging the convexity of the composite pinball loss, SGD-based optimization used in NN training can reliably approach the global optimum, offering scalability ($O(n)$) and implicitly reducing crossing. Our empirical analysis shows independent QR yields crossing, while both CJQR and NNQR enforce monotonicity. NNQR emerges as a viable, scalable alternative for operational SGP systems, aligning theoretical validity with computational feasibility.

Updated: 2025-10-25 19:39:07

标题: 超越等渗化：基于神经网络的可扩展非交叉分位数估计用于学生成长百分位数

摘要: 学生增长百分位数（SGPs）广泛应用于美国各州的评估系统中，使用独立的分位数回归，然后通过后处理校正使用等距投影方法（在SGP R软件包中的isotonize=TRUE）来解决分位数交叉问题。我们展示了这种方法包含一个基本的方法论矛盾：在独立估计的、潜在交叉的分位数之间进行插值需要单调性，然而后处理校正以可能违反分位数属性$P(Y \leq \hat{Q}_{\tau}(Y|X) \mid X) = \tau$的方式改变估计值。我们将这称为“插值悖论”。虽然在理论上受限的联合分位数回归（CJQR）通过在优化过程中强制不交叉约束来消除交叉，但我们分析了其计算复杂性（通常扩展性较差，例如标准LP求解器的$\mathcal{O}((qn)^3)$），使其在大规模教育数据（$n > 100{,}000$）中难以处理。我们检查了SGP软件包对大$N$的Frisch-Newton内点方法的转变（rq.method.for.large.n="fn"），指出虽然对于\emph{独立}QR而言效率很高，但并未解决联合问题的复杂性或悖论。我们提出基于神经网络的多分位数回归（NNQR），使用共享隐藏层作为实际替代方案。利用复合损失的凸性，在NN训练中使用的SGD优化可以可靠地接近全局最优解，提供可扩展性（$O(n)$）并隐含地减少交叉。我们的实证分析显示，独立QR产生交叉，而CJQR和NNQR都强制单调性。NNQR作为操作SGP系统的一个可行、可扩展的替代方案，使理论有效性与计算可行性相一致。

更新时间: 2025-10-25 19:39:07

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.22419v1

Dynamic D2D-Assisted Federated Learning over O-RAN: Performance Analysis, MAC Scheduler, and Asymmetric User Selection

Existing studies on federated learning (FL) are mostly focused on system orchestration for static snapshots of the network and making static control decisions (e.g., spectrum allocation). However, real-world wireless networks are susceptible to temporal variations of wireless channel capacity and users' datasets. In this paper, we incorporate multi-granular system dynamics (MSDs) into FL, including (M1) dynamic wireless channel capacity, captured by a set of discrete-time events, called $\mathscr{D}$-Events, and (M2) dynamic datasets of users. The latter is characterized by (M2-a) modeling the dynamics of user's dataset size via an ordinary differential equation and (M2-b) introducing dynamic model drift}, formulated via a partial differential inequality} drawing concrete analytical connections between the dynamics of users' datasets and FL accuracy. We then conduct FL orchestration under MSDs by introducing dynamic cooperative FL with dedicated MAC schedulers (DCLM), exploiting the unique features of open radio access network (O-RAN). DCLM proposes (i) a hierarchical device-to-device (D2D)-assisted model training, (ii) dynamic control decisions through dedicated O-RAN MAC schedulers, and (iii) asymmetric user selection. We provide extensive theoretical analysis to study the convergence of DCLM. We then optimize the degrees of freedom (e.g., user selection and spectrum allocation) in DCLM through a highly non-convex optimization problem. We develop a systematic approach to obtain the solution for this problem, opening the door to solving a broad variety of network-aware FL optimization problems. We show the efficiency of DCLM via numerical simulations and provide a series of future directions.

Updated: 2025-10-25 19:29:29

标题: 动态D2D辅助的O-RAN联合学习：性能分析、MAC调度器和非对称用户选择

摘要: 现有关于联邦学习（FL）的研究主要集中在针对网络的静态快照进行系统编排，并做出静态控制决策（例如，频谱分配）。然而，现实世界中的无线网络容易受到无线信道容量和用户数据集的时间变化的影响。本文将多粒度系统动态（MSDs）纳入FL中，包括（M1）动态无线信道容量，通过一组离散时间事件（称为$\mathscr{D}$-Events）捕获，以及（M2）用户的动态数据集。后者通过（M2-a）利用常微分方程建模用户数据集大小的动态和（M2-b）引入动态模型漂移来表征。通过建立用户数据集动态和FL准确性之间的具体分析联系。然后，我们通过引入动态协作FL与专用MAC调度器（DCLM）在MSDs下进行FL编排，利用开放无线接入网络（O-RAN）的独特特性。DCLM提出（i）一种分层设备对设备（D2D）辅助模型训练，（ii）通过专用O-RAN MAC调度器进行动态控制决策，以及（iii）不对称用户选择。我们提供了广泛的理论分析来研究DCLM的收敛性。然后，我们通过高度非凸优化问题优化DCLM中的自由度（例如，用户选择和频谱分配）。我们开发了一种系统化的方法来解决这个问题，为解决各种网络感知FL优化问题打开了大门。我们通过数值模拟展示了DCLM的效率，并提供了一系列未来方向。

更新时间: 2025-10-25 19:29:29

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2404.06324v2

Complexity Scaling Laws for Neural Models using Combinatorial Optimization

Recent work on neural scaling laws demonstrates that model performance scales predictably with compute budget, model size, and dataset size. In this work, we develop scaling laws based on problem complexity. We analyze two fundamental complexity measures: solution space size and representation space size. Using the Traveling Salesman Problem (TSP) as a case study, we show that combinatorial optimization promotes smooth cost trends, and therefore meaningful scaling laws can be obtained even in the absence of an interpretable loss. We then show that suboptimality grows predictably for fixed-size models when scaling the number of TSP nodes or spatial dimensions, independent of whether the model was trained with reinforcement learning or supervised fine-tuning on a static dataset. We conclude with an analogy to problem complexity scaling in local search, showing that a much simpler gradient descent of the cost landscape produces similar trends.

Updated: 2025-10-25 19:04:34

标题: 使用组合优化的神经模型的复杂性缩放定律

摘要: 最近关于神经网络规模定律的研究表明，模型性能与计算预算、模型大小和数据集大小之间的关系可以预测。在这项研究中，我们基于问题复杂性开发了一种缩放定律。我们分析了两个基本的复杂度度量：解空间大小和表示空间大小。以旅行推销员问题（TSP）为案例研究，我们展示了组合优化促进了平滑的成本趋势，因此即使在没有可解释损失的情况下，也可以获得有意义的缩放定律。然后，我们展示了在固定大小模型中，当扩展TSP节点数量或空间维度时，次优性会按可预测的方式增长，无论模型是通过强化学习还是在静态数据集上进行监督微调训练的。最后，我们通过类比本地搜索中的问题复杂性缩放，展示了成本景观的梯度下降产生了类似的趋势。

更新时间: 2025-10-25 19:04:34

领域: cs.LG

下载: http://arxiv.org/abs/2506.12932v2

Knowledge-guided Continual Learning for Behavioral Analytics Systems

User behavior on online platforms is evolving, reflecting real-world changes in how people post, whether it's helpful messages or hate speech. Models that learn to capture this content can experience a decrease in performance over time due to data drift, which can lead to ineffective behavioral analytics systems. However, fine-tuning such a model over time with new data can be detrimental due to catastrophic forgetting. Replay-based approaches in continual learning offer a simple yet efficient method to update such models, minimizing forgetting by maintaining a buffer of important training instances from past learned tasks. However, the main limitation of this approach is the fixed size of the buffer. External knowledge bases can be utilized to overcome this limitation through data augmentation. We propose a novel augmentation-based approach to incorporate external knowledge in the replay-based continual learning framework. We evaluate several strategies with three datasets from prior studies related to deviant behavior classification to assess the integration of external knowledge in continual learning and demonstrate that augmentation helps outperform baseline replay-based approaches.

Updated: 2025-10-25 19:04:14

标题: 基于知识引导的用于行为分析系统的持续学习

摘要: 在线平台上用户行为正在发展，反映了人们发布信息的实际变化，无论是有帮助的信息还是仇恨言论。学习捕捉这种内容的模型可能会因为数据漂移而随着时间的推移而性能下降，这可能导致行为分析系统无效。然而，随着时间的推移通过新数据对这样的模型进行微调可能会有害，因为会发生灾难性遗忘。基于重播的持续学习方法提供了一种简单而有效的方法来更新这样的模型，通过保持来自过去学习任务的重要训练实例的缓冲区，最大限度地减少遗忘。然而，这种方法的主要局限性是缓冲区的固定大小。外部知识库可以通过数据增强来克服这一限制。我们提出了一种新颖的基于增强的方法，将外部知识纳入基于重播的持续学习框架中。我们评估了几种策略，使用三个先前研究中与异常行为分类相关的数据集，以评估外部知识在持续学习中的整合，并证明增强有助于胜过基线的基于重播的方法。

更新时间: 2025-10-25 19:04:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22405v1

Aligning LLMs for Multilingual Consistency in Enterprise Applications

Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English. We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.

Updated: 2025-10-25 18:56:44

标题: 在企业应用程序中对LLMs进行多语言一致性对齐

摘要: 大型语言模型(LLMs)在全球企业应用中仍然不可靠，这是由于高资源语言和中/低资源语言之间存在显著的性能差距，这是由英语为中心的预训练和内部推理偏见所驱动的。这种不一致性削弱了多语言环境中客户支持、内容审核和信息检索等领域的客户体验和运营可靠性。即使使用先进的检索增强生成(RAG)系统，我们观察到与英语相比，非英语语言的准确率降低了高达29%。我们提出了一种实用的、按批次对齐策略，用于微调LLMs，利用每个训练批次中的语义等价的多语言数据，直接在各种语言之间对齐模型输出。这种方法将非英语准确率提高了高达23.9%，而不会影响英语性能、模型推理或检索质量。我们的方法易于实施、可扩展，并与现有的LLM训练和部署管道无缝集成，为工业界提供更强大、更公平的多语言人工智能解决方案。

更新时间: 2025-10-25 18:56:44

领域: cs.CL,cs.AI,68T05, 68T50, 68Q25,I.2.7; I.5.1; I.2.8

下载: http://arxiv.org/abs/2509.23659v2

ProGQL: A Provenance Graph Query System for Cyber Attack Investigation

Provenance analysis (PA) has recently emerged as an important solution for cyber attack investigation. PA leverages system monitoring to monitor system activities as a series of system audit events and organizes these events as a provenance graph to show the dependencies among system activities, which can reveal steps of cyber attacks. Despite their potential, existing PA techniques face two critical challenges: (1) they are inflexible and non-extensible, making it difficult to incorporate analyst expertise, and (2) they are memory inefficient, often requiring>100GB of RAM to hold entire event streams, which fundamentally limits scalability and deployment in real-world environments. To address these limitations, we propose the PROGQL framework, which provides a domain-specific graph search language with a well-engineered query engine, allowing PA over system audit events and expert knowledge to be jointly expressed as a graph search query and thereby facilitating the investigation of complex cyberattacks. In particular, to support dependency searches from a starting edge required in PA, PROGQL introduces new language constructs for constrained graph traversal, edge weight computation, value propagation along weighted edges, and graph merging to integrate multiple searches. Moreover, the PROGQL query engine is optimized for efficient incremental graph search across heterogeneous database backends, eliminating the need for full in-memory materialization and reducing memory overhead. Our evaluations on real attacks demonstrate the effectiveness of the PROGQL language in expressing a diverse set of complex attacks compared with the state-of-the-art graph query language Cypher, and the comparison with the SOTA PA technique DEPIMPACT further demonstrates the significant improvement of the scalability brought by our PROGQL framework's design.

Updated: 2025-10-25 18:53:49

标题: ProGQL：一种用于网络攻击调查的溯源图查询系统

摘要: 溯源分析（PA）最近已经成为网络攻击调查的重要解决方案。PA利用系统监控来监视系统活动作为一系列系统审计事件，并将这些事件组织成溯源图以显示系统活动之间的依赖关系，从而揭示网络攻击的步骤。尽管它们具有潜力，但现有的PA技术面临两个关键挑战：（1）它们不灵活且不可扩展，难以整合分析员的专业知识，（2）它们内存效率低下，通常需要>100GB的RAM来保存整个事件流，这从根本上限制了可扩展性和在实际环境中的部署。为了解决这些限制，我们提出了PROGQL框架，提供了一个领域特定的图搜索语言和一个经过良好设计的查询引擎，允许系统审计事件和专家知识结合起来作为图搜索查询，并从而便于调查复杂的网络攻击。特别是，为了支持PA中所需的从起始边进行的依赖搜索，PROGQL引入了新的语言构造来限制图遍历、计算边权重、沿着加权边传播值以及图合并以集成多个搜索。此外，PROGQL查询引擎针对跨异构数据库后端的高效增量图搜索进行了优化，消除了完整内存材料化的需要，减少了内存开销。我们对真实攻击的评估表明，与最先进的图查询语言Cypher相比，PROGQL语言在表达各种复杂攻击方面的有效性，以及与SOTA PA技术DEPIMPACT的比较进一步证明了我们PROGQL框架设计带来的显著可扩展性改进。

更新时间: 2025-10-25 18:53:49

领域: cs.CR,cs.DB

下载: http://arxiv.org/abs/2510.22400v1

NetBurst: Event-Centric Forecasting of Bursty, Intermittent Time Series

Forecasting on widely used benchmark time series data (e.g., ETT, Electricity, Taxi, and Exchange Rate, etc.) has favored smooth, seasonal series, but network telemetry time series -- traffic measurements at service, IP, or subnet granularity -- are instead highly bursty and intermittent, with heavy-tailed bursts and highly variable inactive periods. These properties place the latter in the statistical regimes made famous and popularized more than 20 years ago by B.~Mandelbrot. Yet forecasting such time series with modern-day AI architectures remains underexplored. We introduce NetBurst, an event-centric framework that reformulates forecasting as predicting when bursts occur and how large they are, using quantile-based codebooks and dual autoregressors. Across large-scale sets of production network telemetry time series and compared to strong baselines, such as Chronos, NetBurst reduces Mean Average Scaled Error (MASE) by 13--605x on service-level time series while preserving burstiness and producing embeddings that cluster 5x more cleanly than Chronos. In effect, our work highlights the benefits that modern AI can reap from leveraging Mandelbrot's pioneering studies for forecasting in bursty, intermittent, and heavy-tailed regimes, where its operational value for high-stakes decision making is of paramount interest.

Updated: 2025-10-25 18:48:17

标题: NetBurst：突发、间歇时间序列的事件中心预测

摘要: 对广泛使用的基准时间序列数据（例如ETT、电力、出租车和汇率等）进行预测偏向于平滑、季节性时间序列，但网络遥测时间序列——服务、IP或子网粒度的流量测量——却是高度突发和间歇性的，具有重尾突发和高度可变的非活动期。这些特性使后者处于由B. Mandelbrot在20多年前所著名和流行的统计制度中。然而，利用现代AI架构预测这种时间序列仍然是未开发的领域。我们介绍了NetBurst，一个以事件为中心的框架，将预测重新构建为预测何时发生突发事件以及它们的大小，使用基于分位数的码书和双自回归器。通过大规模生产网络遥测时间序列的比较，与Chronos等强基线相比，NetBurst在服务级时间序列上将平均标准化误差（MASE）降低了13-605倍，同时保持了突发性，并产生了比Chronos更加清晰的嵌入。实际上，我们的工作突显了现代AI从利用Mandelbrot在突发、间歇和重尾制度中的预测研究中获益的好处，其中其对于高风险决策的操作价值具有至关重要的兴趣。

更新时间: 2025-10-25 18:48:17

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2510.22397v1

PortGPT: Towards Automated Backporting Using Large Language Models

Patch backporting, the process of migrating mainline security patches to older branches, is an essential task in maintaining popular open-source projects (e.g., Linux kernel). However, manual backporting can be labor-intensive, while existing automated methods, which heavily rely on predefined syntax or semantic rules, often lack agility for complex patches. In this paper, we introduce PORTGPT, an LLM-agent for end-to-end automation of patch backporting in real-world scenarios. PORTGPT enhances an LLM with tools to access code on-demand, summarize Git history, and revise patches autonomously based on feedback (e.g., from compilers), hence, simulating human-like reasoning and verification. PORTGPT achieved an 89.15% success rate on existing datasets (1815 cases), and 62.33% on our own dataset of 146 complex cases, both outperforms state-of-the-art of backporting tools. We contributed 9 backported patches from PORTGPT to the Linux kernel community and all patches are now merged.

Updated: 2025-10-25 18:46:04

标题: PortGPT：利用大型语言模型实现自动后移

摘要: Patch backporting，即将主线安全补丁迁移到旧分支的过程，是维护流行的开源项目（如Linux内核）中的一个关键任务。然而，手动回溯可能会耗费大量人力，而现有的自动化方法往往过于依赖预定义的语法或语义规则，对于复杂补丁缺乏灵活性。在本文中，我们介绍了PORTGPT，一种LLM代理，用于在现实场景中实现补丁回溯的端到端自动化。PORTGPT通过工具来访问代码，总结Git历史，并根据反馈（例如来自编译器）自主修订补丁，从而模拟人类般的推理和验证。PORTGPT在现有数据集（1815个案例）上取得了89.15%的成功率，在我们自己的146个复杂案例数据集上取得了62.33%的成功率，均超过了回溯工具的最新技术水平。我们向Linux内核社区贡献了9个PORTGPT回溯的补丁，所有补丁现在都已合并。

更新时间: 2025-10-25 18:46:04

领域: cs.CR

下载: http://arxiv.org/abs/2510.22396v1

Top-Down Semantic Refinement for Image Captioning

Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.

Updated: 2025-10-25 18:27:00

标题: 图像字幕的自上而下语义细化

摘要: 大型视觉-语言模型（VLMs）在图像字幕中面临一个固有的矛盾：它们强大的单步生成能力通常导致一种短视的决策过程。这使得在捕捉丰富细节的同时保持整体叙事连贯性变得困难，这种局限在需要多步和复杂场景描述的任务中尤为明显。为了克服这一基本挑战，我们重新定义图像字幕为一个目标导向的分层细化规划问题，并进一步提出了一个名为自上而下语义细化（TDSR）的新框架，该框架将生成过程建模为马尔可夫决策过程（MDP）。然而，在 VLM 的庞大状态空间内进行规划是一个重要的计算障碍。因此，我们的核心贡献是设计了一种高效的蒙特卡罗树搜索（MCTS）算法，专为 VLM 定制。通过结合视觉引导的并行扩展和轻量级价值网络，我们的 TDSR 将 te expensive VLM 的调用频率降低一个数量级，而不会牺牲规划质量。此外，一种自适应的早停机制动态地将计算开销匹配到图像的复杂性。在多个基准测试中进行了大量实验，包括 DetailCaps、COMPOSITIONCAP 和 POPE，结果表明我们的 TDSR 作为一个即插即用模块，可以显著提高现有 VLMs（例如 LLaVA-1.5、Qwen2.5-VL）的性能，实现精细描述、组合泛化和幻觉抑制方面的最新技术或极具竞争力的结果。

更新时间: 2025-10-25 18:27:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22391v1

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware without sacrificing performance. However, these methods primarily emphasize memory savings, often overlooking potential acceleration in convergence due to their reliance on standard isotropic steepest descent techniques, which can perform suboptimally in the highly anisotropic landscapes typical of deep networks, particularly LLMs. In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that employs exact singular value decomposition (SVD) for moment orthogonalization within a dynamically adapted low-dimensional subspace, enabling norm-inducing steepest descent optimization steps. By explicitly aligning optimization steps with the spectral characteristics of the loss landscape, SUMO effectively mitigates approximation errors associated with commonly used methods like Newton-Schulz orthogonalization approximation. We theoretically establish an upper bound on these approximation errors, proving their dependence on the condition numbers of moments, conditions we analytically demonstrate are encountered during LLM training. Furthermore, we both theoretically and empirically illustrate that exact orthogonalization via SVD substantially improves convergence rates while reducing overall complexity. Empirical evaluations confirm that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.

Updated: 2025-10-25 18:18:42

标题: SUMO：用于加速内存高效LLM训练的子空间感知矩正交化

摘要: 基于低秩梯度优化方法在大型语言模型（LLMs）训练过程中显著提高了内存效率，使得在受限硬件内进行操作时不会牺牲性能。然而，这些方法主要侧重于节省内存，往往忽视了由于依赖标准各向同性最速下降技术而可能加速收敛的潜力，这种技术在典型的深层网络（特别是LLMs）中可能表现不佳。在本文中，我们提出了SUMO（子空间感知矩阵正交化），这是一种使用精确奇异值分解（SVD）进行矩阵正交化的优化器，它能在动态适应的低维子空间内实现规范引导的最速下降优化步骤。通过明确将优化步骤与损失函数景观的频谱特征对齐，SUMO有效地减轻了与常用方法（如牛顿-舒尔茨正交化近似）相关的近似误差。我们在理论上建立了这些近似误差的上限，证明了它们取决于矩的条件数，我们在LLM训练过程中分析证明了这些条件是会遇到的。此外，我们在理论上和实证上证明，通过SVD进行精确正交化可以显著提高收敛速度同时减少整体复杂性。实证评估证实，与最先进的方法相比，SUMO可以加速收敛，增强稳定性，提高性能，并将内存需求降低高达20%。

更新时间: 2025-10-25 18:18:42

领域: cs.LG,cs.CL,math.OC

下载: http://arxiv.org/abs/2505.24749v2

Automated Constitutive Model Discovery by Pairing Sparse Regression Algorithms with Model Selection Criteria

The automated discovery of constitutive models from data has recently emerged as a promising alternative to the traditional model calibration paradigm. In this work, we present a fully automated framework for constitutive model discovery that systematically pairs three sparse regression algorithms (Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle Regression (LARS), and Orthogonal Matching Pursuit (OMP)) with three model selection criteria: $K$-fold cross-validation (CV), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). This pairing yields nine distinct algorithms for model discovery and enables a systematic exploration of the trade-off between sparsity, predictive performance, and computational cost. While LARS serves as an efficient path-based solver for the $\ell_1$-constrained problem, OMP is introduced as a tractable heuristic for $\ell_0$-regularized selection. The framework is applied to both isotropic and anisotropic hyperelasticity, utilizing both synthetic and experimental datasets. Results reveal that all nine algorithm-criterion combinations perform consistently well in discovering isotropic and anisotropic materials, yielding highly accurate constitutive models. These findings broaden the range of viable discovery algorithms beyond $\ell_1$-based approaches such as LASSO.

Updated: 2025-10-25 18:16:18

标题: 自动化材料本构模型发现：将稀疏回归算法与模型选择标准配对

摘要: 最近，从数据中自动发现本构模型已经成为传统模型校准范式的一种有希望的替代方案。在这项工作中，我们提出了一个完全自动化的本构模型发现框架，该框架系统地将三种稀疏回归算法（最小绝对收缩和选择算子（LASSO）、最小角回归（LARS）和正交匹配追踪（OMP））与三种模型选择标准配对：K-折交叉验证（CV）、Akaike信息准则（AIC）和贝叶斯信息准则（BIC）。这种配对产生了九种不同的算法用于模型发现，并实现了对稀疏性、预测性能和计算成本之间权衡的系统探索。虽然LARS作为一个高效的基于路径的解决者用于l1约束问题，OMP被介绍为一个可行的启发式方法用于l0正则化选择。该框架应用于各向同性和各向异性的超弹性，利用合成和实验数据集。结果显示，所有九种算法-标准组合在发现各向同性和各向异性材料方面表现一致良好，得到高度准确的本构模型。这些发现扩大了可行发现算法的范围，超越了基于l1的方法如LASSO。

更新时间: 2025-10-25 18:16:18

领域: cs.LG,cond-mat.mtrl-sci,cs.CE

下载: http://arxiv.org/abs/2509.16040v2

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.

Updated: 2025-10-25 18:12:41

标题: 小型和推理大型语言模型能否评分期刊文章的研究质量，并且平均值和少样本学习有帮助吗？

摘要: 评估已发表的学术期刊文章是评估部门和个人的常见任务。虽然有时会受到引用数据的支持，但大型语言模型（LLMs）可能会提供更有用的文章质量指标。ChatGPT和Gemini等两个最大的LLM家族以及中等规模的LLM Gemma327b已经证明具有这种能力，但较小的LLMs和推理模型是否具有类似的能力尚不清楚。这一点很重要，因为在某些情况下，较大的模型可能会变得缓慢且不切实际，而推理模型可能会表现出不同的性能。使用Gemma3变体Llama4 Scout、Qwen3、Magistral Small和DeepSeek R1，在包含2,780篇医学、健康和生命科学领域的数据集上进行了四个相关问题的研究，其中包括两个不同的黄金标准，一个是新的。结果表明，较小的（开放权重）和推理LLMs与ChatGPT 4o-mini和Gemini 2.0 Flash具有类似的性能，但1b参数往往太少，而4b有时也太少。此外，从多个相同查询中平均得分似乎是一种普遍成功的策略，少量提示（四个示例）往往有所帮助，但证据并不明确。推理模型没有明显的优势。总的来说，结果首次表明，包括推理模型在内的较小LLMs >4b具有评分期刊文章研究质量的重要能力，特别是在使用得分平均化的情况下。

更新时间: 2025-10-25 18:12:41

领域: cs.DL,cs.AI

下载: http://arxiv.org/abs/2510.22389v1

Privacy-Aware Federated nnU-Net for ECG Page Digitization

Deep neural networks can convert ECG page images into analyzable waveforms, yet centralized training often conflicts with cross-institutional privacy and deployment constraints. A cross-silo federated digitization framework is presented that trains a full-model nnU-Net segmentation backbone without sharing images and aggregates updates across sites under realistic non-IID heterogeneity (layout, grid style, scanner profile, noise). The protocol integrates three standard server-side aggregators--FedAvg, FedProx, and FedAdam--and couples secure aggregation with central, user-level differential privacy to align utility with formal guarantees. Key features include: (i) end-to-end full-model training and synchronization across clients; (ii) secure aggregation so the server only observes a clipped, weighted sum once a participation threshold is met; (iii) central Gaussian DP with Renyi accounting applied post-aggregation for auditable user-level privacy; and (iv) a calibration-aware digitization pipeline comprising page normalization, trace segmentation, grid-leakage suppression, and vectorization to twelve-lead signals. Experiments on ECG pages rendered from PTB-XL show consistently faster convergence and higher late-round plateaus with adaptive server updates (FedAdam) relative to FedAvg and FedProx, while approaching centralized performance. The privacy mechanism maintains competitive accuracy while preventing exposure of raw images or per-client updates, yielding deployable, auditable guarantees suitable for multi-institution settings.

Updated: 2025-10-25 18:10:05

标题: 隐私感知联合nnU-Net用于心电图页数字化

摘要: 深度神经网络可以将心电图页图像转换为可分析的波形，然而集中式训练通常会与跨机构隐私和部署限制冲突。提出了一个跨隔离的联邦数字化框架，该框架在现实的非IID异质性（布局、网格样式、扫描器配置文件、噪声）下训练一个完整模型的nnU-Net分割骨干，而无需共享图像，并在各个站点之间聚合更新。该协议集成了三种标准的服务器端聚合器--FedAvg、FedProx和FedAdam--并将安全聚合与中心级差分隐私结合起来，以保证效用与形式保证的一致性。关键特点包括：（i）端到端的全模型训练和客户端之间的同步；（ii）安全聚合，使服务器只在达到参与门槛后观察到剪切的加权总和；（iii）中心高斯差分隐私与应用后聚合的Renyi账户，用于可审计的用户级隐私；以及（iv）一个考虑校准的数字化管道，包括页面归一化、轨迹分割、网格泄漏抑制和将信号向量化为十二导联信号。从PTB-XL呈现的心电图页上的实验显示，相对于FedAvg和FedProx，使用自适应服务器更新（FedAdam）可实现更快的收敛速度和更高的后期平台，同时接近集中式性能。隐私机制保持竞争力准确性，同时防止暴露原始图像或每个客户端更新，提供适用于多机构环境的可部署、可审计的保证。

更新时间: 2025-10-25 18:10:05

领域: cs.CR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.22387v1

Solving the Unsolvable: Translating Case Law in Hong Kong

This paper addresses the challenges translating case law under Hong Kong's bilingual legal system. It highlights the initial success of translating all written statutes into Chinese before the 1997 handover, a task mandated by the Basic Law. The effort involved significant collaboration among legal, linguistic, and translation experts, resulting in a comprehensive and culturally appropriate bilingual legal system. However, translating case law remains a significant challenge due to the sheer volume and continuous growth of judicial decisions. The paper critiques the governments and judiciarys sporadic and uncoordinated efforts to translate case law, contrasting it with the thorough approach previously taken for statute translation. Although the government acknowledges the importance of legal bilingualism, it lacks a sustainable strategy for translating case law. The Judiciarys position that translating all judgments is unnecessary, unrealistic, and not cost-effectiveis analyzed and critiqued for its impact on legal transparency and public trust. A proposed solution involves leveraging machine translation technology through a human-machine interactive translation platform, which undergoes two major transitions. Initially based on a neural model, the platform transitions to using a large language model for improved translation accuracy. Furthermore, it evolves from a single-agent system to a multi-agent system, incorporating Translator, Annotator, and Proofreader agents. This multi-agent approach, supported by a grant, aims to facilitate efficient, high-quality translation of judicial judgments by integrating advanced artificial intelligence and continuous feedback mechanisms, thus better meeting the needs of a bilingual legal system.

Updated: 2025-10-25 18:00:35

标题: 解决无法解决的问题：在香港翻译案例法

摘要: 本文讨论了在香港的双语法律体系下翻译案例法所面临的挑战。它强调了在1997年移交前将所有书面法规翻译成中文的初步成功，这是《基本法》规定的一项任务。这一努力涉及法律、语言和翻译专家之间的重要合作，使得建立了一套全面且符合文化特色的双语法律体系。然而，由于司法决定的数量庞大且不断增长，翻译案例法仍然是一个重大挑战。本文批评了政府和司法部门零星且不协调的翻译案例法的努力，并将其与先前对法规翻译采取的彻底方法进行对比。尽管政府承认法律双语化的重要性，但缺乏一个可持续的翻译案例法策略。司法部门认为翻译所有判决是不必要的、不现实的且不具成本效益性的立场被分析并批评其对法律透明度和公众信任的影响。提出的解决方案涉及通过人机交互式翻译平台利用机器翻译技术，该平台经历了两次重大转变。最初基于神经模型，该平台过渡到使用大型语言模型以提高翻译准确性。此外，它从单一代理系统演变为多代理系统，包括翻译、注释和校对代理。这种多代理方法，得到资助，旨在通过整合先进的人工智能和持续反馈机制，促进高效且高质量的司法判决翻译，从而更好地满足双语法律体系的需求。

更新时间: 2025-10-25 18:00:35

领域: cs.CL,cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2501.09444v3

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $\Theta(N\sqrt{N} d)$ computational complexity and $\Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the Transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.

Updated: 2025-10-25 17:57:42

标题: 帝王关注：零-shot转换为快速、硬件感知的结构化关注

摘要: 变压器在各种任务中已经实现了最先进的性能，但由于注意机制而在序列长度上存在明显的二次复杂性。在这项工作中，我们提出了MonarchAttention - 一种通过Monarch矩阵来进行次二次注意力逼近的新方法，Monarch矩阵是一类结构化矩阵。基于Softmax的变分形式，我们描述了一种高效的基于优化的算法，用于计算Softmax注意力在Monarch矩阵类上的近似投影，其计算复杂度为$ \Theta(N\sqrt{N} d)$，内存/IO复杂度为$ \Theta(Nd)$。与以前的方法不同，MonarchAttention既具有可传递性，即在替换变压器的每个注意层时，即使没有额外的训练也会产生最小的性能损失，又具有硬件高效性，利用现代GPU上高吞吐量的张量核心单元。通过优化核心，MonarchAttention在速度上实现了显著的提升，相较于FlashAttention-2：对于较短序列$(N=256)$，提速了$1.4\times$，对于中等长度序列$(N=4K)$，提速了$4.5\times$，对于较长序列$(N=16K)$，提速了$8.2\times$。我们展示了MonarchAttention在视觉和语言问题的不同任务和体系结构上的质量，表明它在各种环境中灵活准确地逼近Softmax注意力。我们的代码可在https://github.com/cjyaras/monarch-attention找到。

更新时间: 2025-10-25 17:57:42

领域: cs.LG

下载: http://arxiv.org/abs/2505.18698v2

Dynamic Dropout: Leveraging Conway's Game of Life for Neural Networks Regularization

Regularization techniques play a crucial role in preventing overfitting and improving the generalization performance of neural networks. Dropout, a widely used regularization technique, randomly deactivates units during training to introduce redundancy and prevent co-adaptation among neurons. Despite its effectiveness, dropout has limitations, such as its static nature and lack of interpretability. In this paper, we propose a novel approach to regularization by substituting dropout with Conway's Game of Life (GoL), a cellular automata with simple rules that govern the evolution of a grid of cells. We introduce dynamic unit deactivation during training by representing neural network units as cells in a GoL grid and applying the game's rules to deactivate units. This approach allows for the emergence of spatial patterns that adapt to the training data, potentially enhancing the network's ability to generalize. We demonstrate the effectiveness of our approach on the CIFAR-10 dataset, showing that dynamic unit deactivation using GoL achieves comparable performance to traditional dropout techniques while offering insights into the network's behavior through the visualization of evolving patterns. Furthermore, our discussion highlights the applicability of our proposal in deeper architectures, demonstrating how it enhances the performance of different dropout techniques.

Updated: 2025-10-25 17:55:13

标题: 动态辍学：利用康威生命游戏对神经网络进行正则化

摘要: 正则化技术在防止过拟合和提高神经网络泛化性能方面发挥着关键作用。Dropout是一种广泛使用的正则化技术，它在训练过程中随机关闭单元，引入冗余性，防止神经元之间的协同适应。尽管dropout有效，但其具有静态性和缺乏可解释性等局限性。本文提出了一种新的正则化方法，通过将dropout替换为康威的生命游戏（GoL），一种具有简单规则的细胞自动机，控制着一个细胞网格的演化。我们通过将神经网络单元表示为GoL网格中的细胞，并应用游戏规则来动态关闭单元，引入了训练期间的动态单元停用。这种方法允许出现适应训练数据的空间模式，潜在地增强网络的泛化能力。我们在CIFAR-10数据集上展示了我们方法的有效性，表明使用GoL进行动态单元停用可以实现与传统dropout技术相当的性能，并通过可视化正在演变的模式来洞察网络的行为。此外，我们的讨论突出了我们的提议在更深层次的架构中的适用性，展示了它如何提升不同dropout技术的性能。

更新时间: 2025-10-25 17:55:13

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.22383v1

Efficient Large-Deformation Medical Image Registration via Recurrent Dynamic Correlation

Deformable image registration estimates voxel-wise correspondences between images through spatial transformations, and plays a key role in medical imaging. While deep learning methods have significantly reduced runtime, efficiently handling large deformations remains a challenging task. Convolutional networks aggregate local features but lack direct modeling of voxel correspondences, promoting recent works to explore explicit feature matching. Among them, voxel-to-region matching is more efficient for direct correspondence modeling by computing local correlation features whithin neighbourhoods, while region-to-region matching incurs higher redundancy due to excessive correlation pairs across large regions. However, the inherent locality of voxel-to-region matching hinders the capture of long-range correspondences required for large deformations. To address this, we propose a Recurrent Correlation-based framework that dynamically relocates the matching region toward more promising positions. At each step, local matching is performed with low cost, and the estimated offset guides the next search region, supporting efficient convergence toward large deformations. In addition, we uses a lightweight recurrent update module with memory capacity and decouples motion-related and texture features to suppress semantic redundancy. We conduct extensive experiments on brain MRI and abdominal CT datasets under two settings: with and without affine pre-registration. Results show that our method exibits a strong accuracy-computation trade-off, surpassing or matching the state-of-the-art performance. For example, it achieves comparable performance on the non-affine OASIS dataset, while using only 9.5% of the FLOPs and running 96% faster than RDP, a representative high-performing method.

Updated: 2025-10-25 17:49:29

标题: 通过循环动态相关性实现高效的大变形医学图像配准

摘要: 可变形图像配准通过空间变换估计图像之间的体素对应关系，在医学成像中起着关键作用。虽然深度学习方法已经显著减少了运行时间，但有效处理大变形仍然是一个具有挑战性的任务。卷积网络聚合局部特征，但缺乏直接建模体素对应关系，这促使最近的研究探索显式特征匹配。其中，体素到区域匹配更为高效，通过计算邻域内的局部相关特征进行直接对应关系建模，而区域到区域匹配由于跨大区域的过多相关对而产生更高冗余。然而，体素到区域匹配的固有局部性阻碍了对大变形所需的长距离对应关系的捕获。为了解决这个问题，我们提出了一个基于循环相关的框架，动态地将匹配区域重新定位到更有前途的位置。在每一步中，局部匹配成本较低，并且估计的偏移量指导下一次搜索区域，支持有效地收敛于大变形。此外，我们使用了一个轻量级的循环更新模块，具有内存容量，并将运动相关和纹理特征分离，以抑制语义冗余。我们在脑MRI和腹部CT数据集上进行了大量实验，有两种设置：有和没有仿射预配准。结果显示，我们的方法展现出了强大的准确性-计算效率权衡，超越或与最先进性能相匹敌。例如，在非仿射OASIS数据集上取得了可比较的性能，同时仅使用了FLOPs的9.5％，比代表性高性能方法RDP更快96％。

更新时间: 2025-10-25 17:49:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22380v1

TraceTrans: Translation and Spatial Tracing for Surgical Prediction

Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.

Updated: 2025-10-25 17:48:46

标题: TraceTrans：用于外科预测的翻译和空间追踪

摘要: 图像到图像的转换模型已经在将图像转换到不同视觉领域方面取得了显著的成功，并在医学任务中越来越多地被用于预测术后结果和建模疾病进展等任务。然而，大多数现有方法主要旨在匹配目标分布，通常忽视源图像和翻译图像之间的空间对应关系。这种限制可能导致结构不一致和幻觉，削弱了预测结果的可靠性和可解释性。这些挑战在临床应用中被严格的解剖准确性要求所强调。在这项工作中，我们提出了TraceTrans，一种专为术后预测设计的新型可变形图像转换模型，它生成与目标分布对齐的图像，同时明确显示与术前输入的空间对应关系。该框架采用一个编码器进行特征提取，同时使用双解码器进行预测空间变形和合成翻译图像。预测的变形场对生成的输出施加空间约束，确保与源图像的解剖一致性。对医学美容和脑MRI数据集进行的大量实验表明，TraceTrans能够提供准确且可解释的术后预测结果，突显了其在可靠临床部署方面的潜力。

更新时间: 2025-10-25 17:48:46

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.22379v1

Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining

Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs). Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing optimizer states. A key challenge in these methods is selecting suitable subspaces to ensure an effective optimization trajectory. Most existing approaches select the dominant subspace to preserve gradient information, as this intuitively provides the best approximation. However, we find that in practice, the dominant subspace stops changing during pretraining, thereby constraining weight updates to similar subspaces. In this paper, we propose importance sampling for low-rank optimization in LLM pretraining with a provable convergence guarantee, which the dominant subspace approach does not have. Empirically, we demonstrate that our method significantly outperforms previous methods in LLM pretraining tasks.

Updated: 2025-10-25 17:48:37

标题: 突破冻结子空间：在LLM预训练中低秩优化的重要性采样

摘要: 低秩优化已经成为一种有前途的方法，可以实现大型语言模型（LLMs）的内存高效训练。现有的低秩优化方法通常将梯度投影到低秩子空间，减少了存储优化器状态的内存成本。这些方法面临的一个关键挑战是选择适当的子空间，以确保有效的优化轨迹。大多数现有方法选择主导子空间来保留梯度信息，因为直观上这提供了最好的近似。然而，我们发现实际上，在预训练期间，主导子空间停止变化，从而将权重更新限制在类似的子空间中。在本文中，我们提出了在LLM预训练中进行低秩优化的重要性采样方法，并提供了可证明的收敛保证，而主导子空间方法则没有。在经验上，我们证明了我们的方法在LLM预训练任务中明显优于先前的方法。

更新时间: 2025-10-25 17:48:37

领域: cs.LG

下载: http://arxiv.org/abs/2502.05790v2

Label Smoothing Improves Gradient Ascent in LLM Unlearning

LLM unlearning has emerged as a promising approach, aiming to enable models to forget hazardous/undesired knowledge at low cost while preserving as much model utility as possible. Among existing techniques, the most straightforward method is performing Gradient Ascent (GA) w.r.t. the forget data, thereby forcing the model to unlearn the forget dataset. However, GA suffers from severe instability, as it drives updates in a divergent direction, often resulting in drastically degraded model utility. To address this issue, we propose Smoothed Gradient Ascent (SGA). SGA combines the forget data with multiple constructed normal data through a tunable smoothing rate. Intuitively, this extends GA from learning solely on the forget data to jointly learning across both forget and normal data, enabling more stable unlearning while better preserving model utility. Theoretically, we provide the theoretical guidance on the selection of the optimal smoothing rate. Empirically, we evaluate SGA on three benchmarks: TOFU, Harry Potter, and MUSE-NEWS. Experimental results demonstrate that SGA consistently outperforms the original Gradient Ascent (GA) method across all metrics and achieves top-2 performance among all baseline methods on several key metrics.

Updated: 2025-10-25 17:43:34

标题: 标签平滑改善在LLM遗忘中的梯度上升

摘要: LLM unlearning已经成为一种有前景的方法，旨在使模型能够以较低成本忘记危险/不需要的知识，同时尽可能保留模型效用。在现有技术中，最直接的方法是执行梯度上升（GA）相对于忘记数据，从而迫使模型忘记忘记数据集。然而，GA存在严重的不稳定性，因为它驱使更新朝着发散方向，往往导致模型效用急剧下降。为了解决这个问题，我们提出了Smoothed Gradient Ascent（SGA）。SGA通过可调节的平滑率将忘记数据与多个构造的正常数据结合在一起。直观地说，这将GA从仅在忘记数据上学习扩展到同时跨越忘记和正常数据进行学习，从而实现更稳定的遗忘，同时更好地保留模型效用。理论上，我们提供了有关选择最佳平滑率的理论指导。在实证方面，我们在三个基准上评估了SGA：TOFU、Harry Potter和MUSE-NEWS。实验结果表明，SGA在所有指标上始终优于原始梯度上升（GA）方法，并在几个关键指标上在所有基线方法中实现了前两名的性能。

更新时间: 2025-10-25 17:43:34

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.22376v1

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.

Updated: 2025-10-25 17:31:02

标题: VisJudge-Bench：可视化的美学和质量评估

摘要: 可视化是一种领域特定但广泛使用的形象，是将复杂数据集转化为直观洞察力的有效方式，其价值取决于数据是否得到忠实表达、清晰传达和美学设计。然而，评估可视化质量是具有挑战性的：与自然图像不同，它需要同时跨越数据编码准确性、信息表达性和视觉美学的判断。虽然多模态大型语言模型（MLLMs）在自然图像的美学评估方面表现出有希望的性能，但目前还没有系统的基准来衡量它们在评估可视化方面的能力。为了解决这个问题，我们提出了VisJudge-Bench，这是第一个用于评估MLLMs在评估可视化美学和质量方面表现的综合基准。它包含来自现实场景的3,090个专家注释样本，涵盖了32种图表类型的单个可视化、多个可视化和仪表板。在这个基准上的系统测试显示，即使是最先进的MLLMs（如GPT-5）仍然与人类专家在判断方面存在显著差距，均方误差（MAE）为0.551，与人类评分的相关性仅为0.429。为了解决这个问题，我们提出了VisJudge，这是一种专门设计用于可视化美学和质量评估的模型。实验结果表明，与GPT-5相比，VisJudge显著缩小了与人类判断的差距，将MAE减少至0.442（减少了19.8%），并将与人类专家的一致性提高至0.681（比GPT-5提高了58.7%）。该基准可在https://github.com/HKUSTDial/VisJudgeBench 上获得。

更新时间: 2025-10-25 17:31:02

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.22373v1

Reasoning Models Reason Well, Until They Don't

Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs' success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.

Updated: 2025-10-25 17:28:38

标题: 推理模型很好地推理，直到它们不再有效。

摘要: 大型语言模型（LLMs）在推理任务中展示了显著的进展。然而，最近的研究表明，一旦推理问题超出了适度的复杂性，变压器和LLMs就会出现灾难性的失败。我们通过大推理模型（LRMs）的视角重新审视这些发现--LLMs在逐步论证和自我验证方面进行了微调。LRM在图形和推理基准测试中的表现似乎非凡，有些甚至声称它们能够在数学、物理、医学和法律等推理密集领域进行广义推理和创新。然而，通过更仔细地缩放推理问题的复杂性，我们发现现有的基准实际上具有有限的复杂性。我们开发了一个新数据集，Deep Reasoning Dataset（DeepRD），以及一个用于生成可扩展复杂性的无限例子的生成过程。我们使用这个数据集来评估模型在图连接和自然语言证明计划上的性能。我们发现在足够复杂的情况下，LRM的性能会突然下降，并且不具有泛化能力。我们还将我们的LRM结果与大型真实知识图、交互图和证明数据集的复杂性分布联系起来。我们发现大多数真实世界的例子都落在LRM的成功范围内，但长尾暴露出相当大的失败潜力。我们的分析突出了LRM的近期效用，同时强调了需要超越训练分布中示例的复杂性的新方法的必要性。

更新时间: 2025-10-25 17:28:38

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22371v1

BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles

In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable. Our method is different from the approaches that only use semantic models to shape rewards. Instead, it directly embeds semantic features into the state representation. This cuts down on expensive runtime inference and makes sure that semantic guidance is always available. The simulation results show that the proposed model is better at LK stability and adaptability than the best vision-based and multimodal RL baselines in a wide range of difficult driving situations. We make our code publicly available.

Updated: 2025-10-25 17:27:08

标题: BLIP-FusePPO：用于自动驾驶车辆车道保持的视觉-语言深度强化学习框架

摘要: 在本文中，我们提出了一种新颖的多模态强化学习（RL）框架，名为基于引导语言图像预训练驱动的融合状态表示的近端策略优化（BLIP-FusePPO），用于自动车道保持（LK）。在该框架中，由视觉语言模型（VLM）生成的语义嵌入直接与几何状态、LiDAR观测和基于比例-积分-微分（PID）的控制反馈在代理观测空间内融合。所提出的方法使代理学习驾驶规则，这些规则意识到周围环境并易于理解，通过将来自VLM的高级场景理解与低级控制和空间信号相结合。我们的架构汇聚了语义、几何和控制感知表示，以使策略学习更加稳健。包括语义对齐、LK准确性、避免障碍和速度调节在内的混合奖励函数有助于使学习更加高效和可泛化。我们的方法不同于仅使用语义模型塑造奖励的方法，而是直接将语义特征嵌入到状态表示中。这减少了昂贵的运行时推理，并确保语义指导始终可用。模拟结果显示，与最佳基于视觉和多模态RL基线相比，所提出的模型在各种困难驾驶情况下的LK稳定性和适应性更好。我们将我们的代码公开提供。

更新时间: 2025-10-25 17:27:08

领域: cs.RO,cs.AI,cs.CV,cs.LG,cs.SE

下载: http://arxiv.org/abs/2510.22370v1

A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones

A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed due to critical safety requirements. The dataset contains 8,124 images from 20 different scenarios along with 2,708 raw samples demonstrating low-light areas. We evaluated three advanced object detection models: YOLOv8, YOLOv11, and YOLOv12, followed by development of a custom model derived from YOLOv8 with added structures for challenging surveillance contexts. The proposed model outperformed the others, achieving a recall of 78.90 percent and mAP at 50 of 83.70 percent, delivering optimal object detection across varied environments. Performance evaluation on multiple edge devices using multithreaded operations showed the Jetson Xavier NX processed data at 52 to 97 milliseconds per inference, establishing its suitability for time-sensitive operations. This system offers a robust and adaptable platform for monitoring public safety and enabling automatic regulatory compliance.

Updated: 2025-10-25 17:26:43

标题: 基于深度学习的闭路电视系统用于自动检测火灾出口区域的吸烟行为

摘要: 一个基于深度学习的实时吸烟检测系统被提出，用于监控消防通道区域的闭路电视监控，以满足关键的安全需求。数据集包含来自20种不同场景的8,124张图像，以及演示低光区域的2,708个原始样本。我们评估了三种先进的目标检测模型：YOLOv8、YOLOv11和YOLOv12，然后开发了一个基于YOLOv8导出的自定义模型，为具有挑战性的监控环境添加了结构。提出的模型表现优于其他模型，实现了78.90%的召回率和50处的83.70%的mAP，为各种环境提供了最佳的目标检测。在多个边缘设备上进行性能评估，使用多线程操作，显示Jetson Xavier NX每次推理处理数据的时间在52到97毫秒之间，证明其适用于时间敏感的操作。该系统为监控公共安全和实现自动监管合规提供了强大而灵活的平台。

更新时间: 2025-10-25 17:26:43

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.11696v2

T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models

Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity. Our code is available at \href{https://github.com/0xD009/T2SMark}{https://github.com/0xD009/T2SMark}.

Updated: 2025-10-25 16:55:55

标题: T2SMark：在扩散模型中平衡鲁棒性和多样性的噪声水印

摘要: 扩散模型在近年来迅速发展，产生了高保真度的图像，同时引起了知识产权保护和生成式人工智能的滥用的担忧。针对扩散模型的图像水印技术，特别是噪声作为水印（NaW）方法，将水印编码为特定的标准高斯噪声向量用于图像生成，无缝地嵌入信息同时保持图像质量。在检测方面，生成过程被反转以恢复包含水印的初始噪声向量，然后进行提取。然而，现有的NaW方法在平衡水印的鲁棒性和生成多样性方面存在困难。一些方法通过严格限制初始噪声采样来实现强大的鲁棒性，但这会降低用户体验，而另一些方法保持多样性，但在实际部署中过于脆弱。为了解决这个问题，我们提出了基于尾截取采样（TTS）的两阶段水印方案T2SMark。与以往简单将位映射到正值或负值的方法不同，TTS通过将位专门嵌入可靠的尾部区域来增强鲁棒性，同时在中心区域随机采样以保留潜在分布。我们的两阶段框架通过将随机生成的会话密钥集成到加密管道中来确保采样多样性。我们在具有U-Net和DiT骨干的扩散模型上评估了T2SMark。大量实验证明它在鲁棒性和多样性之间达到了最佳平衡。我们的代码可在以下链接获得：https://github.com/0xD009/T2SMark。

更新时间: 2025-10-25 16:55:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22366v1

Bias Begins with Data: The FairGround Corpus for Robust and Reproducible Research on Algorithmic Fairness

As machine learning (ML) systems are increasingly adopted in high-stakes decision-making domains, ensuring fairness in their outputs has become a central challenge. At the core of fair ML research are the datasets used to investigate bias and develop mitigation strategies. Yet, much of the existing work relies on a narrow selection of datasets--often arbitrarily chosen, inconsistently processed, and lacking in diversity--undermining the generalizability and reproducibility of results. To address these limitations, we present FairGround: a unified framework, data corpus, and Python package aimed at advancing reproducible research and critical data studies in fair ML classification. FairGround currently comprises 44 tabular datasets, each annotated with rich fairness-relevant metadata. Our accompanying Python package standardizes dataset loading, preprocessing, transformation, and splitting, streamlining experimental workflows. By providing a diverse and well-documented dataset corpus along with robust tooling, FairGround enables the development of fairer, more reliable, and more reproducible ML models. All resources are publicly available to support open and collaborative research.

Updated: 2025-10-25 16:48:33

标题: 偏见始于数据：用于算法公平性稳健和可重复研究的FairGround语料库

摘要: 随着机器学习（ML）系统在高风险决策领域中日益被采用，确保其输出的公平性已成为一个核心挑战。公平ML研究的核心是用于调查偏见并开发缓解策略的数据集。然而，现有大部分工作依赖于一小部分数据集--通常是任意选择的、处理不一致的和缺乏多样性的--从而削弱了结果的泛化性和可重现性。为了解决这些限制，我们提出了FairGround：一个旨在推进公平ML分类中可重复研究和关键数据研究的统一框架、数据语料库和Python包。FairGround目前包括44个表格数据集，每个数据集都附有丰富的与公平相关的元数据。我们的附带Python包标准化了数据集的加载、预处理、转换和拆分，简化了实验工作流程。通过提供一个多样化且有良好记录的数据集语料库以及强大的工具，FairGround使得开发更公平、更可靠、更可重现的ML模型成为可能。所有资源都是公开可用的，以支持开放和协作研究。

更新时间: 2025-10-25 16:48:33

领域: cs.LG,cs.CY,stat.ML

下载: http://arxiv.org/abs/2510.22363v1

Mapping Faithful Reasoning in Language Models

Chain-of-thought (CoT) traces promise transparency for reasoning language models, but prior work shows they are not always faithful reflections of internal computation. This raises challenges for oversight: practitioners may misinterpret decorative reasoning as genuine. We introduce Concept Walk, a general framework for tracing how a model's internal stance evolves with respect to a concept direction during reasoning. Unlike surface text, Concept Walk operates in activation space, projecting each reasoning step onto the concept direction learned from contrastive data. This allows us to observe whether reasoning traces shape outcomes or are discarded. As a case study, we apply Concept Walk to the domain of Safety using Qwen 3-4B. We find that in 'easy' cases, perturbed CoTs are quickly ignored, indicating decorative reasoning, whereas in 'hard' cases, perturbations induce sustained shifts in internal activations, consistent with faithful reasoning. The contribution is methodological: Concept Walk provides a lens to re-examine faithfulness through concept-specific internal dynamics, helping identify when reasoning traces can be trusted and when they risk misleading practitioners.

Updated: 2025-10-25 16:48:19

标题: 在语言模型中的忠实推理映射

摘要: 思维链（CoT）跟踪承诺透明度，用于推理语言模型，但先前的研究表明它们并不总是内部计算的忠实反映。这对监督提出了挑战：从业者可能会将装饰性推理误解为真实的。我们引入了概念漫步（Concept Walk），这是一个通用框架，用于跟踪模型在推理过程中对于概念方向的内部态度是如何演变的。与表面文本不同，概念漫步在激活空间中运行，将每个推理步骤投影到从对比数据中学习的概念方向上。这使我们能够观察推理痕迹是如何塑造结果或者被丢弃的。作为一个案例研究，我们将概念漫步应用于使用Qwen 3-4B的安全领域。我们发现在“简单”情况下，扰动的CoTs很快被忽视，表明装饰性推理，而在“困难”情况下，扰动导致内部激活的持续变化，与忠实推理一致。该贡献是方法论性的：概念漫步通过概念特定的内部动态提供了一种重新审视忠实性的视角，有助于确定何时可以信任推理痕迹，何时可能误导从业者。

更新时间: 2025-10-25 16:48:19

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.22362v1

E2Former: An Efficient and Equivariant Transformer with Linear-Scaling Tensor Products

Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.

Updated: 2025-10-25 16:28:33

标题: E2Former：一种具有线性缩放张量乘积的高效和等变Transformer

摘要: Equivariant Graph Neural Networks (EGNNs)在建模微观系统方面取得了显著的成功，包括化学、生物学和材料科学中的系统。然而，由于通过球张量积构建边特征的成本较高，EGNN面临着重大的计算挑战，使其在大规模系统中变得不实用。为了解决这一限制，我们引入了E2Former，一种等变且高效的Transformer架构，该架构整合了Wigner $6j$卷积（Wigner $6j$ Conv）。通过将计算负担从边转移到节点，Wigner $6j$ Conv将复杂度从$O(|\mathcal{E}|)$降低到$O(|\mathcal{V}|)$，同时保持模型的表达能力和旋转等变性。我们展示了这种方法相较于传统的$\mathrm{SO}(3)$卷积实现了7倍至30倍的加速。此外，我们的实证结果表明，所得到的E2Former减轻了现有方法的计算挑战，同时不影响捕捉详细几何信息的能力。这一发展可能为可扩展和高效的分子建模指明了有前景的方向。

更新时间: 2025-10-25 16:28:33

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2501.19216v3

Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy's decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

Updated: 2025-10-25 16:20:38

标题: 政策兼容的技能增量学习：通过惰性学习接口

摘要: 技能增量学习（SIL）是一个具体代理体通过与环境的互动或整合额外数据获得经验的过程，随着时间的推移，它扩展和完善其技能组合。SIL有助于高效获取基于可重用技能的层次策略，用于下游任务。然而，随着技能库的演变，可能会破坏与现有基于技能的策略的兼容性，限制它们的可重用性和泛化性。在这项工作中，我们提出了SIL-C，这是一个确保技能-策略兼容性的新框架，允许逐步学习的技能的改进提高下游策略的性能，而无需对策略重新训练或结构调整。SIL-C采用双向懒惰学习的映射技术，动态地将策略引用的子任务空间与解码为代理行为的技能空间对齐。这使得从复杂任务的策略分解中衍生出的每个子任务都可以通过选择一种基于轨迹分布相似性的适当技能来执行。我们评估了SIL-C在各种SIL场景中的表现，并证明它在保持不断发展的技能与下游策略之间的兼容性的同时确保了学习过程的效率。

更新时间: 2025-10-25 16:20:38

领域: cs.LG

下载: http://arxiv.org/abs/2509.20612v2

Uncertainty quantification in model discovery by distilling interpretable material constitutive models from Gaussian process posteriors

Constitutive model discovery refers to the task of identifying an appropriate model structure, usually from a predefined model library, while simultaneously inferring its material parameters. The data used for model discovery are measured in mechanical tests and are thus inevitably affected by noise which, in turn, induces uncertainties. Previously proposed methods for uncertainty quantification in model discovery either require the selection of a prior for the material parameters, are restricted to the linear coefficients of the model library or are limited in the flexibility of the inferred parameter probability distribution. We therefore propose a four-step partially Bayesian framework for uncertainty quantification in model discovery that does not require prior selection for the material parameters and also allows for the discovery of non-linear constitutive models: First, we augment the available stress-deformation data with a Gaussian process. Second, we approximate the parameter distribution by a normalizing flow, which allows for capturing complex joint distributions. Third, we distill the parameter distribution by matching the distribution of stress-deformation functions induced by the parameters with the Gaussian process posterior. Fourth, we perform a Sobol' sensitivity analysis to obtain a sparse and interpretable model. We demonstrate the capability of our framework for both isotropic and anisotropic experimental data as well as linear and non-linear model libraries.

Updated: 2025-10-25 16:02:03

标题: 用高斯过程后验概率从可解释的材料本构模型中提取模型发现中的不确定性量化

摘要: 构成模型发现是指从预定义的模型库中识别适当的模型结构的任务，同时推断其材料参数。用于模型发现的数据是通过机械测试测量的，因此不可避免地受到噪声的影响，这反过来产生不确定性。先前提出的用于模型发现中的不确定性量化的方法要么需要为材料参数选择先验，要么仅限于模型库的线性系数，要么在推断参数概率分布的灵活性方面受到限制。因此，我们提出了一个四步部分贝叶斯框架，用于模型发现中的不确定性量化，不需要为材料参数选择先验，也允许发现非线性构成模型：首先，我们用高斯过程扩充可用的应力-变形数据。其次，我们通过正规化流近似参数分布，这允许捕捉复杂的联合分布。第三，我们通过将参数引起的应力-变形函数的分布与高斯过程后验匹配来提炼参数分布。第四，我们进行 Sobol'敏感性分析以获得稀疏且可解释的模型。我们展示了我们的框架对各向同性和各向异性实验数据以及线性和非线性模型库的能力。

更新时间: 2025-10-25 16:02:03

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.22345v1

FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation

While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.

Updated: 2025-10-25 15:59:33

标题: FAIR-RAG：用于检索增强生成的忠实自适应迭代细化

摘要: 检索增强生成（RAG）在大型语言模型（LLMs）中减轻了幻觉和知识陈旧的问题，但现有框架通常在需要从不同来源综合信息的复杂、多跳查询上出现问题。目前的先进RAG方法，采用迭代或自适应策略，缺乏系统地识别和填补证据缺口的强大机制，通常会传播噪音或无法收集全面的上下文。我们介绍了一种新颖的主动框架FAIR-RAG，将标准RAG流程转变为一个动态、以证据为驱动的推理过程。其核心是一个由我们称之为结构化证据评估（SEA）的模块主导的迭代细化循环。SEA充当一个分析性的门控机制：它将初始查询拆分为所需发现的检查表，并审查聚合证据以识别确认的事实和重要的明确信息缺口。这些缺口向自适应查询细化代理提供了精确的信号，后者生成新的、有针对性的子查询以检索缺失信息。此过程重复直到证据被验证为足够，确保了最终生成的全面上下文的严格忠实。我们在具有挑战性的多跳QA基准测试中进行了实验，包括HotpotQA、2WikiMultiHopQA和MusiQue。在统一的实验设置中，FAIR-RAG显着优于强基线。在HotpotQA上，它实现了0.453的F1分数，比最强的迭代基线提高了8.3个百分点，为这些基准测试中的这类方法建立了一个新的最先进水平。我们的工作表明，对于解锁复杂、知识密集任务中的高级RAG系统中可靠和准确的推理，具有明确的缺口分析的结构化、以证据为驱动的细化过程至关重要。

更新时间: 2025-10-25 15:59:33

领域: cs.CL,cs.AI,cs.IR,68T50, 68P20,I.2.7; H.3.3

下载: http://arxiv.org/abs/2510.22344v1

Censoring chemical data to mitigate dual use risk

Machine learning models have dual-use potential, potentially serving both beneficial and malicious purposes. The development of open-source models in chemistry has specifically surfaced dual-use concerns around toxicological data and chemical warfare agents. We discuss a chain risk framework identifying three misuse pathways and corresponding mitigation strategies: inference-level, model-level, and data-level. At the data level, we introduce a model-agnostic noising method to increase prediction error in specific desired regions (sensitive regions). Our results show that selective noise induces variance and attenuation bias, whereas simply omitting sensitive data fails to prevent extrapolation. These findings hold for both molecular feature multilayer perceptrons and graph neural networks. Thus, noising molecular structures can enable open sharing of potential dual-use molecular data.

Updated: 2025-10-25 15:56:51

标题: Censoring chemical data to mitigate dual use risk 将化学数据进行审查以减轻双重用途风险

摘要: 机器学习模型具有双重用途潜力，可能用于有益和恶意目的。在化学领域，开源模型的发展特别引发了有关毒理数据和化学战剂的双重用途担忧。我们讨论了一个链式风险框架，确定了三种误用路径和相应的缓解策略：推理级、模型级和数据级。在数据级别上，我们引入了一种模型无关的噪声方法，以增加特定期望区域（敏感区域）的预测误差。我们的结果显示，选择性噪声会引起方差和衰减偏差，而简单地省略敏感数据无法防止外推。这些发现对于分子特征多层感知器和图神经网络均成立。因此，对分子结构进行噪声处理可以实现潜在双重用途分子数据的开放共享。

更新时间: 2025-10-25 15:56:51

领域: cs.LG,cs.CR,cs.CY,physics.chem-ph

下载: http://arxiv.org/abs/2304.10510v2

DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry

Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at \href{https://zgca-ai4edu.github.io/DynaSolidGeo/}{DynaSolidGeo}.

Updated: 2025-10-25 15:49:45

标题: DynaSolidGeo：立体几何中VLM的真实空间数学推理的动态基准

摘要: 解决实体几何问题需要整合空间智力和符号推理的空间数学推理能力。然而，大多数现有的多模式数学推理基准主要关注于二维平面几何，依赖于容易受到数据污染和记忆的静态数据集，并且仅通过最终答案评估模型，忽略了推理过程。为了解决这些限制，我们介绍了DynaSolidGeo，这是用于评估视觉语言模型（VLMs）中的真实空间推理的第一个动态基准。通过半自动标注流程构建，DynaSolidGeo包含503个专家策划的种子问题，原则上可以动态生成无限多样的多模式文本-视觉实例。除了答案准确性外，我们还结合基于专家标注的推理链的过程评估来衡量逻辑有效性和因果连贯性。跨代表性开源和闭源VLMs的实验显示了较大的性能差距，在动态设置中出现严重退化，并且在需要高级空间智力的任务上表现不佳，例如心理旋转和可视化。代码和数据集可在\href{https://zgca-ai4edu.github.io/DynaSolidGeo/}{DynaSolidGeo}上找到。

更新时间: 2025-10-25 15:49:45

领域: cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.22340v1

Toward Humanoid Brain-Body Co-design: Joint Optimization of Control and Morphology for Fall Recovery

Humanoid robots represent a central frontier in embodied intelligence, as their anthropomorphic form enables natural deployment in humans' workspace. Brain-body co-design for humanoids presents a promising approach to realizing this potential by jointly optimizing control policies and physical morphology. Within this context, fall recovery emerges as a critical capability. It not only enhances safety and resilience but also integrates naturally with locomotion systems, thereby advancing the autonomy of humanoids. In this paper, we propose RoboCraft, a scalable humanoid co-design framework for fall recovery that iteratively improves performance through the coupled updates of control policy and morphology. A shared policy pretrained across multiple designs is progressively finetuned on high-performing morphologies, enabling efficient adaptation without retraining from scratch. Concurrently, morphology search is guided by human-inspired priors and optimization algorithms, supported by a priority buffer that balances reevaluation of promising candidates with the exploration of novel designs. Experiments show that \ourmethod{} achieves an average performance gain of 44.55% on seven public humanoid robots, with morphology optimization drives at least 40% of improvements in co-designing four humanoid robots, underscoring the critical role of humanoid co-design.

Updated: 2025-10-25 15:40:18

标题: 朝向人形机器人大脑-身体协同设计：控制和形态的联合优化以实现摔倒恢复

摘要: 人形机器人代表了具有具体智能的中心领域，因为它们人形化的形态使其能够自然地部署在人类的工作场所。人形机器人的大脑-身体共同设计提出了一种有望实现潜力的方法，即通过联合优化控制策略和物理形态来实现。在这种背景下，跌倒恢复出现为一项关键能力。它不仅增强了安全性和弹性，还与运动系统自然集成，从而推动了人形机器人的自主性。在本文中，我们提出了RoboCraft，一个可扩展的人形共同设计框架，用于跌倒恢复，通过耦合更新控制策略和形态来迭代地改善性能。一个共享的策略在多个设计上预训练，逐渐在性能优异的形态上进行微调，从而实现高效的适应性，无需从头开始重新训练。同时，形态搜索由人类启发的先验知识和优化算法引导，支持由优先级缓冲区平衡有前途的候选者的重新评估与新设计的探索。实验证明，我们的方法在七个公共人形机器人上取得了44.55%的平均性能增益，形态优化至少推动了四个人形机器人共同设计中40%的改进，突出了人形共同设计的关键作用。

更新时间: 2025-10-25 15:40:18

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.22336v1

Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high-level embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively-aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67x faster inference, and more deterministic results than the diffusion-based baselines.

Updated: 2025-10-25 15:40:07

标题: 超越扩散：层次到层次的自回归用于fMRI到图像重建

摘要: 从fMRI信号重建视觉刺激是机器学习和神经科学之间的一个核心挑战。最近基于扩散的方法通常将fMRI活动映射到单个高级嵌入，然后在整个生成过程中将其用作固定指导。然而，这种固定指导会导致分层神经信息坍缩，并且与图像重建的阶段性需求不一致。因此，我们提出了MindHier，这是一个基于分层自回归建模的粗到细的fMRI到图像重建框架。MindHier引入了三个组件：分层fMRI编码器用于提取多层次神经嵌入，分层到分层对齐方案用于强制实现与CLIP特征的分层对应，以及一个适应尺度的粗到细的神经引导策略，将这些嵌入注入到匹配尺度的自回归中。这些设计使MindHier成为一种高效且认知对齐的替代扩散方法，通过实现一个分层重建过程，合成全局语义然后再细化本地细节，类似于人类视觉感知。对NSD数据集的大量实验表明，MindHier实现了优越的语义保真度，比扩散基线更快4.67倍的推断速度，并且结果更加确定性。

更新时间: 2025-10-25 15:40:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22335v1

Multilingual Target-Stance Extraction

Social media enables data-driven analysis of public opinion on contested issues. Target-Stance Extraction (TSE) is the task of identifying the target discussed in a document and the document's stance towards that target. Many works classify stance towards a given target in a multilingual setting, but all prior work in TSE is English-only. This work introduces the first multilingual TSE benchmark, spanning Catalan, Estonian, French, Italian, Mandarin, and Spanish corpora. It manages to extend the original TSE pipeline to a multilingual setting without requiring separate models for each language. Our model pipeline achieves a modest F1 score of 12.78, underscoring the increased difficulty of the multilingual task relative to English-only setups and highlighting target prediction as the primary bottleneck. We are also the first to demonstrate the sensitivity of TSE's F1 score to different target verbalizations. Together these serve as a much-needed baseline for resources, algorithms, and evaluation criteria in multilingual TSE.

Updated: 2025-10-25 15:38:15

标题: 多语种目标立场提取

摘要: 社交媒体使对有争议问题的公众意见进行数据驱动分析成为可能。目标立场提取（TSE）是识别文档中讨论的目标及文档对该目标立场的任务。许多作品在多语言环境下分类对特定目标的立场，但在TSE领域的所有先前工作仅限于英语。本研究引入了第一个跨加泰罗尼亚语、爱沙尼亚语、法语、意大利语、普通话和西班牙语语料库的多语言TSE基准。它成功将原始TSE流程扩展到多语言环境，而不需要为每种语言单独建模。我们的模型流程实现了一个适度的F1分数为12.78，突显了相对于仅英语设置而言的多语言任务的增加难度，并将目标预测作为主要瓶颈。我们还是第一个展示TSE的F1分数对不同目标表达的敏感性的研究。这些成果共同构成了多语言TSE领域资源、算法和评估标准的急需基准。

更新时间: 2025-10-25 15:38:15

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.22334v1

LIFT: Interpretable truck driving risk prediction with literature-informed fine-tuned LLMs

This study proposes an interpretable prediction framework with literature-informed fine-tuned (LIFT) LLMs for truck driving risk prediction. The framework integrates an LLM-driven Inference Core that predicts and explains truck driving risk, a Literature Processing Pipeline that filters and summarizes domain-specific literature into a literature knowledge base, and a Result Evaluator that evaluates the prediction performance as well as the interpretability of the LIFT LLM. After fine-tuning on a real-world truck driving risk dataset, the LIFT LLM achieved accurate risk prediction, outperforming benchmark models by 26.7% in recall and 10.1% in F1-score. Furthermore, guided by the literature knowledge base automatically constructed from 299 domain papers, the LIFT LLM produced variable importance ranking consistent with that derived from the benchmark model, while demonstrating robustness in interpretation results to various data sampling conditions. The LIFT LLM also identified potential risky scenarios by detecting key combination of variables in truck driving risk, which were verified by PERMANOVA tests. Finally, we demonstrated the contribution of the literature knowledge base and the fine-tuning process in the interpretability of the LIFT LLM, and discussed the potential of the LIFT LLM in data-driven knowledge discovery.

Updated: 2025-10-25 15:37:56

标题: LIFT：具有文献信息的微调LLMs的可解释卡车驾驶风险预测

摘要: 这项研究提出了一个可解释的预测框架，利用文献信息进行微调（LIFT）LLMs进行卡车驾驶风险预测。该框架整合了一个LLM驱动的推理核心，用于预测和解释卡车驾驶风险，一个文献处理管道，用于将领域特定文献过滤和总结为文献知识库，以及一个结果评估器，用于评估LIFT LLM的预测性能以及可解释性。在对真实世界的卡车驾驶风险数据集进行微调后，LIFT LLM实现了准确的风险预测，在召回率方面超越基准模型26.7％，在F1分数方面超越基准模型10.1％。此外，根据从299篇领域论文自动构建的文献知识库的指导，LIFT LLM产生了与基准模型推导的变量重要性排名一致的结果，同时展示了对各种数据采样条件的解释结果的稳健性。LIFT LLM还通过检测卡车驾驶风险中变量的关键组合，识别了潜在的高风险情景，这些情景经过PERMANOVA测试验证。最后，我们展示了文献知识库和微调过程对LIFT LLM可解释性的贡献，并讨论了LIFT LLM在数据驱动知识发现中的潜力。

更新时间: 2025-10-25 15:37:56

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.22333v1

CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models

As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations--with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud--there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs. Our code is available at https://jus1mple.github.io/Image2CaptionAttack.

Updated: 2025-10-25 15:36:26

标题: CapRecover：一种基于视觉语言模型的跨模态特征反演攻击框架

摘要: 随着视觉-语言模型(VLMs)越来越多地部署在分布式DNN配置中--其中视觉编码器(如ResNet、ViT)在用户设备上运行，并将中间特征发送到云端--存在着因语义信息泄露而日益增加的隐私风险。现有的重建图像的方法往往导致模糊、语义模糊的图像。为了直接解决语义泄漏问题，我们提出了CapRecover，这是一个跨模态反演框架，可以直接从中间特征中恢复高级语义内容，如标签或标题，而无需重建图像。我们在多个数据集和受害模型上评估了CapRecover，在语义恢复方面表现出强大的性能。具体来说，CapRecover在CIFAR-10上实现了高达92.71%的Top-1标签准确率，并在COCO2017上从ResNet50特征生成流畅的标题，ROUGE-L分数高达0.52。我们的分析进一步揭示，与浅层相比，深层卷积层编码了更多的语义信息。为了减轻语义泄漏，我们提出了一种简单而有效的保护方法：在每一层向中间特征添加随机噪声，并在下一层中去除噪声。实验结果表明，这种方法可以防止语义泄漏，而无需额外的训练成本。我们的代码可在https://jus1mple.github.io/Image2CaptionAttack 获取。

更新时间: 2025-10-25 15:36:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.22828v3

Depth-Constrained ASV Navigation with Deep RL and Limited Sensing

Autonomous Surface Vehicles (ASVs) play a crucial role in maritime operations, yet their navigation in shallow-water environments remains challenging due to dynamic disturbances and depth constraints. Traditional navigation strategies struggle with limited sensor information, making safe and efficient operation difficult. In this paper, we propose a reinforcement learning (RL) framework for ASV navigation under depth constraints, where the vehicle must reach a target while avoiding unsafe areas with only a single depth measurement per timestep from a downward-facing Single Beam Echosounder (SBES). To enhance environmental awareness, we integrate Gaussian Process (GP) regression into the RL framework, enabling the agent to progressively estimate a bathymetric depth map from sparse sonar readings. This approach improves decision-making by providing a richer representation of the environment. Furthermore, we demonstrate effective sim-to-real transfer, ensuring that trained policies generalize well to real-world aquatic conditions. Experimental results validate our method's capability to improve ASV navigation performance while maintaining safety in challenging shallow-water environments.

Updated: 2025-10-25 15:34:56

标题: 深度受限ASV导航与深度强化学习和有限感知

摘要: Autonomous Surface Vehicles (ASVs)在海上作业中扮演着至关重要的角色，然而它们在浅水环境中的导航仍然具有挑战性，这是由于动态干扰和深度限制。传统的导航策略受限于有限的传感器信息，导致安全和高效的运行困难重重。在本文中，我们提出了一种基于强化学习（RL）框架的ASV导航方法，其中车辆必须在仅从向下朝向的单波束回声仪（SBES）获取的每个时间步长的单个深度测量中到达目标并避开不安全区域。为了增强环境意识，我们将高斯过程（GP）回归集成到RL框架中，使代理能够逐渐从稀疏声纳读数中估计出地形深度图。这种方法通过提供更丰富的环境表示来改善决策制定。此外，我们展示了有效的虚拟到真实的转移，确保训练出的策略能够很好地泛化到真实世界的水生条件。实验结果验证了我们的方法在提高ASV导航性能的同时在具有挑战性的浅水环境中保持安全性的能力。

更新时间: 2025-10-25 15:34:56

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2504.18253v3

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.

Updated: 2025-10-25 15:34:31

标题: Transformer键值记忆几乎与稀疏自动编码器一样具有可解释性

摘要: 最近关于大型语言模型（LLMs）的可解释性工作越来越多地被特征发现方法主导，借助代理模块。然后，评估了由稀疏自编码器（SAEs）学习的特征的质量。这种范式自然地提出了一个关键问题：这些学习到的特征是否比原始模型参数中已经表示的特征具有更好的属性，不幸的是，到目前为止只有少数研究系统地进行了这种比较。在这项工作中，我们重新审视了存储在前馈（FF）层中的特征向量的可解释性，从FF作为键值记忆的视角，使用现代可解释性基准。我们的广泛评估显示，SAE和FF表现出类似范围的可解释性，尽管在某些方面，SAEs显示出可观但微小的改进。此外，在某些方面，令人惊讶的是，甚至普通的FF比SAEs具有更好的可解释性，而在SAEs和FF中发现的特征有所偏离。这些问题带来了关于SAEs的优势的问题，从特征质量和忠实度的两个角度来看，与直接解释FF特征向量相比，FF键值参数在现代可解释性研究中作为一个强有力的基线。

更新时间: 2025-10-25 15:34:31

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.22332v1

BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection

This paper details our submission to the AraGenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, secured 5th place. We investigated the effectiveness of three pre-trained transformer models: AraELECTRA, CAMeLBERT, and XLM-RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a surprising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the specialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capabilities of multilingual models.

Updated: 2025-10-25 15:33:29

标题: 在AraGenEval共享任务中被揭露：基于Transformer模型的阿拉伯语人工智能生成文本检测的比较研究

摘要: 本文详细介绍了我们在AraGenEval共享任务中提交的关于阿拉伯文人工智能生成文本检测的论文，我们的团队BUSTED获得了第五名。我们研究了三种预训练的transformer模型：AraELECTRA、CAMeLBERT和XLM-RoBERTa。我们的方法涉及在所提供的数据集上对每个模型进行微调，用于二元分类任务。我们的研究结果显示了一个令人惊讶的结果：多语言XLM-RoBERTa模型实现了最高的性能，F1分数为0.7701，表现优于专门的阿拉伯模型。这项工作强调了人工智能生成文本检测的复杂性，并突显了多语言模型的强大泛化能力。

更新时间: 2025-10-25 15:33:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.20610v2

The dark side of the forces: assessing non-conservative force models for atomistic machine learning

The use of machine learning to estimate the energy of a group of atoms, and the forces that drive them to more stable configurations, has revolutionized the fields of computational chemistry and materials discovery. In this domain, rigorous enforcement of symmetry and conservation laws has traditionally been considered essential. For this reason, interatomic forces are usually computed as the derivatives of the potential energy, ensuring energy conservation. Several recent works have questioned this physically constrained approach, suggesting that directly predicting the forces yields a better trade-off between accuracy and computational efficiency, and that energy conservation can be learned during training. This work investigates the applicability of such non-conservative models in microscopic simulations. We identify and demonstrate several fundamental issues, from ill-defined convergence of geometry optimization to instability in various types of molecular dynamics. Given the difficulty in monitoring and correcting the lack of energy conservation, direct forces should be used with great care. We show that the best approach to exploit the acceleration they afford is to use them in conjunction with conservative forces. A model can be pre-trained efficiently on direct forces, then fine-tuned using backpropagation. At evaluation time, both force types can be used together to avoid unphysical effects while still benefitting almost entirely from the computational efficiency of direct forces.

Updated: 2025-10-25 15:28:24

标题: 力的暗面：评估非保守力模型用于原子级机器学习

摘要: 机器学习用于估计一组原子的能量以及驱使它们转向更稳定构型的力量，已经彻底改变了计算化学和材料发现领域。在这个领域中，对对称性和守恒定律的严格执行传统上被认为是必不可少的。因此，通常将原子间力计算为势能的导数，以确保能量守恒。一些最近的研究质疑了这种受物理约束的方法，提出直接预测力量可以在精度和计算效率之间取得更好的权衡，并且能量守恒可以在训练过程中学习到。本研究探讨了这种非保守模型在微观模拟中的适用性。我们确定并展示了几个基本问题，从几何优化的不明确收敛到各种类型的分子动力学中的不稳定性。鉴于监测和纠正能量守恒不足的困难，应当谨慎使用直接力。我们展示了利用其提供的加速度的最佳方法是将其与保守力结合使用。模型可以在直接力上进行高效的预训练，然后使用反向传播进行微调。在评估时，可以同时使用两种力类型，以避免非物理效应，同时几乎完全受益于直接力的计算效率。

更新时间: 2025-10-25 15:28:24

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2412.11569v6

FlashMD: long-stride, universal prediction of molecular dynamics

Molecular dynamics (MD) provides insights into atomic-scale processes by integrating over time the equations that describe the motion of atoms under the action of interatomic forces. Machine learning models have substantially accelerated MD by providing inexpensive predictions of the forces, but they remain constrained to minuscule time integration steps, which are required by the fast time scale of atomic motion. In this work, we propose FlashMD, a method to predict the evolution of positions and momenta over strides that are between one and two orders of magnitude longer than typical MD time steps. We incorporate considerations on the mathematical and physical properties of Hamiltonian dynamics in the architecture, generalize the approach to allow the simulation of any thermodynamic ensemble, and carefully assess the possible failure modes of such a long-stride MD approach. We validate FlashMD's accuracy in reproducing equilibrium and time-dependent properties, using both system-specific and general-purpose models, extending the ability of MD simulation to reach the long time scales needed to model microscopic processes of high scientific and technological relevance.

Updated: 2025-10-25 15:23:21

标题: FlashMD：长步幅、通用的分子动力学预测

摘要: 分子动力学（MD）通过整合描述原子在相互作用力的作用下运动的方程，随时间提供了原子尺度过程的见解。机器学习模型通过提供廉价的力预测大大加速了MD，但它们仍然受限于微小的时间积分步骤，这是由于原子运动的快速时间尺度所需。在这项工作中，我们提出了FlashMD，一种方法，可以预测位置和动量的演化，其步幅比典型的MD时间步长长一个到两个数量级。我们在架构中考虑了哈密顿动力学的数学和物理特性，将该方法推广到允许模拟任何热力学系。并仔细评估了这种长步长MD方法可能的故障模式。我们验证了FlashMD在重现平衡和时间相关性质方面的准确性，使用系统特定和通用模型，扩展了MD模拟的能力，以达到模拟高科学和技术相关性的微观过程所需的长时间尺度。

更新时间: 2025-10-25 15:23:21

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2505.19350v2

Graph-Coarsening Approach for the Capacitated Vehicle Routing Problem with Time Windows

The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a fundamental NP-hard optimization problem in logistics. Solving large-scale instances remains computationally challenging for exact solvers. This work introduces a multilevel graph coarsening and refinement framework that aggregates customers into meta-nodes using a spatio-temporal distance metric. The reduced problem is solved with classical heuristics and subsequently expanded back into the original space with feasibility corrections. Preliminary experiments on Solomon benchmark instances show that the proposed method reduces computation time while preserving or improving solution quality, particularly with respect to capacity and time window constraints. The paper also explores the integration of quantum-inspired optimization techniques, highlighting their potential to further accelerate large-scale vehicle routing tasks.

Updated: 2025-10-25 15:13:41

标题: 图形粗化方法用于带时间窗口的带容量车辆路径问题

摘要: 具有时间窗口的有容量车辆路径问题（CVRPTW）是物流中的一个基本的NP困难优化问题。对于精确求解器来说，解决大规模实例仍然具有计算挑战性。本文介绍了一个多层次图粗化和细化框架，使用时空距离度量将客户聚合成元节点。通过经典启发式方法解决减少的问题，随后通过可行性校正将其扩展回原始空间。对Solomon基准实例的初步实验表明，所提出的方法减少了计算时间，同时保持或改善了解决方案质量，特别是在容量和时间窗口约束方面。本文还探讨了量子启发式优化技术的整合，突出它们进一步加速大规模车辆路径任务的潜力。

更新时间: 2025-10-25 15:13:41

领域: cs.AI,math.OC,90C59, 90C27,G.2.2; I.2.8; F.2.2

下载: http://arxiv.org/abs/2510.22329v1

Monitoring State Transitions in Markovian Systems with Sampling Cost

We consider a node-monitor pair, where the node's state varies with time. The monitor needs to track the node's state at all times; however, there is a fixed cost for each state query. So the monitor may instead predict the state using time-series forecasting methods, including time-series foundation models (TSFMs), and query only when prediction uncertainty is high. Since query decisions influence prediction accuracy, determining when to query is nontrivial. A natural approach is a greedy policy that predicts when the expected prediction loss is below the query cost and queries otherwise. We analyze this policy in a Markovian setting, where the optimal (OPT) strategy is a state-dependent threshold policy minimizing the time-averaged sum of query cost and prediction losses. We show that, in general, the greedy policy is suboptimal and can have an unbounded competitive ratio, but under common conditions such as identically distributed transition probabilities, it performs close to OPT. For the case of unknown transition probabilities, we further propose a projected stochastic gradient descent (PSGD)-based learning variant of the greedy policy, which achieves a favorable predict-query tradeoff with improved computational efficiency compared to OPT.

Updated: 2025-10-25 15:07:37

标题: 使用采样成本监测马尔可夫系统中的状态转换

摘要: 我们考虑一个节点-监控器对，其中节点的状态随时间变化。监控器需要始终跟踪节点的状态；然而，每次查询状态都有固定成本。因此，监控器可以使用时间序列预测方法，包括时间序列基础模型（TSFMs），仅在预测不确定性高时进行查询。由于查询决策会影响预测准确性，确定何时进行查询并非易事。一种自然的方法是贪婪策略，即在预期预测损失低于查询成本时进行预测，否则进行查询。我们在马尔可夫设置中分析了这种策略，其中最优（OPT）策略是一种依赖状态的阈值策略，最小化查询成本和预测损失的时间平均和。我们表明，一般情况下，贪婪策略是次优的，并且可能具有无界的竞争比率，但在诸如转移概率相同等常见条件下，它表现接近于OPT。对于未知转移概率的情况，我们进一步提出了一种基于投影随机梯度下降（PSGD）的学习变体的贪婪策略，该策略在改善计算效率的同时实现了有利的预测-查询权衡。

更新时间: 2025-10-25 15:07:37

领域: cs.LG,cs.IT,math.IT,stat.ML

下载: http://arxiv.org/abs/2510.22327v1

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.

Updated: 2025-10-25 14:51:17

标题: GRPO-Guard：通过受控剪切缓解流匹配中的隐式过度优化

摘要: 最近，基于GRPO的强化学习在优化流匹配模型方面取得了显著进展，有效提高了它们与特定任务奖励的对齐性。在这些框架中，策略更新依赖于重要性比裁剪，以限制过度自信的正负梯度。然而，在实践中，我们观察到重要性比分布存在系统性偏移-其均值低于1，方差在不同时间步骤之间存在显著差异。这种左偏和不一致的分布阻止了正优势样本进入裁剪区域，导致机制无法限制过度自信的正向更新。因此，策略模型不可避免地进入了一个隐式过度优化阶段-虽然代理奖励继续增加，但图像质量和文本提示对齐等关键指标急剧恶化，最终使得学习的策略在实际应用中变得不切实际。为解决这一问题，我们引入了GRPO-Guard，这是对现有GRPO框架的简单而有效的增强。我们的方法包括比例归一化，恢复了平衡和步骤一致的重要性比，确保PPO裁剪正确地约束了去噪时间步骤中的有害更新。此外，梯度重新加权策略使政策梯度在噪声条件下均衡，防止特定时间步骤区域的过度更新。这些设计共同构成了一个受监管的裁剪机制，稳定了优化过程，大大减轻了隐式过度优化，而无需依赖重的KL正则化。对多个扩散骨干（例如SD3.5M、Flux.1-dev）和不同代理任务进行的大量实验表明，GRPO-Guard显著降低了过度优化，同时保持甚至提高了生成质量。

更新时间: 2025-10-25 14:51:17

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.22319v1

Harnessing the Power of Large Language Models for Software Testing Education: A Focus on ISTQB Syllabus

Software testing is a critical component in the software engineering field and is important for software engineering education. Thus, it is vital for academia to continuously improve and update educational methods to reflect the current state of the field. The International Software Testing Qualifications Board (ISTQB) certification framework is globally recognized and widely adopted in industry and academia. However, ISTQB-based learning has been rarely applied with recent generative artificial intelligence advances. Despite the growing capabilities of large language models (LLMs), ISTQB-based learning and instruction with LLMs have not been thoroughly explored. This paper explores and evaluates how LLMs can complement the ISTQB framework for higher education. The findings present four key contributions: (i) the creation of a comprehensive ISTQB-aligned dataset spanning over a decade, consisting of 28 sample exams and 1,145 questions; (ii) the development of a domain-optimized prompt that enhances LLM precision and explanation quality on ISTQB tasks; (iii) a systematic evaluation of state-of-the-art LLMs on this dataset; and (iv) actionable insights and recommendations for integrating LLMs into software testing education. These findings highlight the promise of LLMs in supporting ISTQB certification preparation and offer a foundation for their broader use in software engineering at higher education.

Updated: 2025-10-25 14:45:58

标题: 利用大型语言模型的力量进行软件测试教育：以ISTQB教学大纲为重点

摘要: 软件测试是软件工程领域的一个关键组成部分，对软件工程教育至关重要。因此，学术界需要不断改进和更新教育方法，以反映该领域的最新状态。国际软件测试资格认证委员会（ISTQB）认证框架在工业界和学术界广泛认可和采用。然而，基于ISTQB的学习很少应用于最近的生成人工智能进步。尽管大型语言模型（LLMs）的能力不断增长，但ISTQB基础学习和使用LLMs的指导并未得到深入探讨。本文探讨并评估了LLMs如何与ISTQB框架相结合，用于高等教育。研究结果提出了四个关键贡献：（i）创建了一个跨越十多年的全面的ISTQB对齐数据集，包括28个样本考试和1,145个问题；（ii）开发了一个领域优化的提示，提高了LLMs在ISTQB任务上的精度和解释质量；（iii）对这一数据集上最先进的LLMs进行系统评估；以及（iv）关于如何将LLMs整合到软件测试教育中的可操作见解和建议。这些发现突显了LLMs在支持ISTQB认证准备方面的潜力，并为它们在高等教育中广泛使用的基础奠定了基础。

更新时间: 2025-10-25 14:45:58

领域: cs.SE,cs.AI,K.3.2, D.2.5

下载: http://arxiv.org/abs/2510.22318v1

LacMaterial: Large Language Models as Analogical Chemists for Materials Discovery

Analogical reasoning, the transfer of relational structures across contexts (e.g., planet is to sun as electron is to nucleus), is fundamental to scientific discovery. Yet human insight is often constrained by domain expertise and surface-level biases, limiting access to deeper, structure-driven analogies both within and across disciplines. Large language models (LLMs), trained on vast cross-domain data, present a promising yet underexplored tool for analogical reasoning in science. Here, we demonstrate that LLMs can generate novel battery materials by (1) retrieving cross-domain analogs and analogy-guided exemplars to steer exploration beyond conventional dopant substitutions, and (2) constructing in-domain analogical templates from few labeled examples to guide targeted exploitation. These explicit analogical reasoning strategies yield candidates outside established compositional spaces and outperform standard prompting baselines. Our findings position LLMs as interpretable, expert-like hypothesis generators that leverage analogy-driven generalization for scientific innovation.

Updated: 2025-10-25 14:25:26

标题: LacMaterial：作为材料发现的类比化学家的大型语言模型

摘要: 类比推理，即在不同背景下传递关系结构（例如，行星对太阳如同电子对原子核），是科学发现的基础。然而，人类的洞察力常常受到领域专业知识和表面层面偏见的限制，从而限制了对更深层次、结构驱动类比的访问，无论是在学科内部还是跨学科之间。在广泛跨领域数据上训练的大型语言模型（LLMs）为科学中的类比推理提供了一个有前途但尚未充分挖掘的工具。在这里，我们展示了LLMs可以通过（1）检索跨领域类比和类比引导的示例来引导探索超越传统的掺杂替换，以及（2）从少量标记示例中构建领内类比模板来引导有针对性的开发，从而生成新型电池材料。这些明确的类比推理策略产生了超出已建立的组成空间之外的候选材料，并优于标准提示基线。我们的发现将LLMs定位为可解释的、类似专家的假设生成器，利用以类比驱动的概括来进行科学创新。

更新时间: 2025-10-25 14:25:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22312v1

LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models

Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities. However, identifying optimal AP deployment frameworks can be daunting and introduces new challenges. This paper aims to provide a timely survey of the current research with an in-depth analysis, positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf AP planners. By systematically reviewing the current state of research, we highlight methodologies, and identify critical challenges and future directions, hoping to contribute to the joint research on NLP and Automated Planning.

Updated: 2025-10-25 14:13:22

标题: LLMs作为规划形式化器：利用大型语言模型构建自动规划模型的调查

摘要: 大型语言模型（LLMs）在各种自然语言任务中表现出色，但通常在需要结构化推理的长期规划问题上往往表现出困难。这种限制引起了在自动规划（AP）和自然语言处理（NLP）社区中整合神经符号方法的兴趣。然而，确定最佳的AP部署框架可能是令人生畏的，并引入了新的挑战。本文旨在及时调查目前的研究情况，并进行深入分析，将LLMs定位为用于规范和完善规划规范以支持可靠的现成AP规划器的工具。通过系统地审查目前的研究状态，我们突出了方法论，并识别了关键挑战和未来方向，希望为NLP和自动规划的联合研究做出贡献。

更新时间: 2025-10-25 14:13:22

领域: cs.AI

下载: http://arxiv.org/abs/2503.18971v2

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to further decrease memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

Updated: 2025-10-25 14:12:56

标题: 混合递归：学习动态递归深度以实现自适应的令牌级计算

摘要: 扩展语言模型解锁了令人印象深刻的能力，但伴随而来的计算和内存需求使训练和部署变得昂贵。现有的效率工作通常针对参数共享或自适应计算，但如何同时实现这两者仍然是一个未解之谜。我们引入了混合递归（MoR），这是一个统一的框架，将效率的两个轴结合在一个单一的递归变换器中。MoR通过在递归步骤中重复使用一组共享的层来实现参数效率，同时轻量级路由器通过动态分配不同的递归深度给个别令牌，实现自适应的令牌级思考。这使得MoR能够仅在给定递归深度仍处于活动状态的令牌之间进行二次关注计算，通过选择性地仅缓存它们的键值对，进一步提高内存访问效率。除了这些核心机制外，我们还提出了一种KV共享变体，可以重复使用第一个递归中的KV对，专门设计用于进一步减少内存占用。在参数规模从135M到1.7B不等的模型尺度上，MoR形成了一个新的帕累托前沿：在相等的训练FLOPs和更小的模型尺寸下，它显著降低了验证困惑度，提高了少样本准确性，同时与普通的和现有的递归基线相比，提高了吞吐量。这些收益表明MoR是通向大型模型质量的有效途径，而不会产生大型模型成本。

更新时间: 2025-10-25 14:12:56

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.10524v3

AnyECG-Lab: An Exploration Study of Fine-tuning an ECG Foundation Model to Estimate Laboratory Values from Single-Lead ECG Signals

Timely access to laboratory values is critical for clinical decision-making, yet current approaches rely on invasive venous sampling and are intrinsically delayed. Electrocardiography (ECG), as a non-invasive and widely available signal, offers a promising modality for rapid laboratory estimation. Recent progress in deep learning has enabled the extraction of latent hematological signatures from ECGs. However, existing models are constrained by low signal-to-noise ratios, substantial inter-individual variability, limited data diversity, and suboptimal generalization, especially when adapted to low-lead wearable devices. In this work, we conduct an exploratory study leveraging transfer learning to fine-tune ECGFounder, a large-scale pre-trained ECG foundation model, on the Multimodal Clinical Monitoring in the Emergency Department (MC-MED) dataset from Stanford. We generated a corpus of more than 20 million standardized ten-second ECG segments to enhance sensitivity to subtle biochemical correlates. On internal validation, the model demonstrated strong predictive performance (area under the curve above 0.65) for thirty-three laboratory indicators, moderate performance (between 0.55 and 0.65) for fifty-nine indicators, and limited performance (below 0.55) for sixteen indicators. This study provides an efficient artificial-intelligence driven solution and establishes the feasibility scope for real-time, non-invasive estimation of laboratory values.

Updated: 2025-10-25 14:04:26

标题: AnyECG-Lab：微调心电图基础模型以从单导联心电图信号估计实验室数值的探索研究

摘要: 及时获得实验室数值对临床决策至关重要，然而当前的方法依赖于侵入性的静脉采样，并且固有地存在延迟。心电图（ECG）作为一种非侵入性且广泛可用的信号，为快速实验室估计提供了有希望的方式。最近深度学习的进展使得可以从ECG中提取潜在的血液学特征。然而，现有模型受到信噪比低、个体间变异性大、数据多样性有限以及泛化不佳等限制，特别是当应用于低导联可穿戴设备时。在这项工作中，我们进行了一项探索性研究，利用迁移学习对大规模预训练的ECG基础模型ECGFounder进行微调，在斯坦福大学的急诊科临床监测多模态数据集（MC-MED）上。我们生成了一个包含2000万多个标准化十秒ECG片段的语料库，以增强对微小生化相关性的敏感性。在内部验证中，该模型对33个实验室指标表现出强大的预测性能（曲线下面积大于0.65），对59个指标表现出中等性能（在0.55和0.65之间），对16个指标表现出有限性能（低于0.55）。这项研究提供了一种高效的人工智能驱动的解决方案，并确立了实时、非侵入性估计实验室数值的可行范围。

更新时间: 2025-10-25 14:04:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22301v1

MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

Hypothesis ranking is vital for automated scientific discovery, especially in cost-intensive, throughput-limited natural science domains. Current methods focus on pre-experiment ranking, relying solely on language model reasoning without empirical feedback. We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests. Due to the impracticality of real experiments, we propose a simulator grounded in domain-specific concepts that models hypothesis performance as a function of similarity to a hidden ground truth, perturbed by noise. Validated against 124 hypotheses with experimentally reported outcomes, the simulator approximates real results with consistent trend alignment. Although deviations exist, they mimic wet-lab noise, promoting more robust ranking strategies. We frame experiment-guided ranking as a sequential decision-making problem and propose an in-context reinforcement learning (ICRL) framework. Our LLM-based policy decomposes hypotheses into functional elements, clusters them by mechanistic roles, and prioritizes recombinations based on feedback. Experiments show our approach significantly outperforms pre-experiment baselines and strong ablations. Our toolkit, comprising the simulator and ICRL framework, enables systematic research on experiment-guided ranking, with the policy serving as a strong proof of concept.

Updated: 2025-10-25 14:00:54

标题: MOOSE-Chem3：通过模拟实验反馈实现实验引导的假设排名

摘要: 假设排序对于自动科学发现至关重要，特别是在成本高昂、吞吐量有限的自然科学领域。当前的方法主要关注于实验前的排序，仅依赖于语言模型的推理而没有实证反馈。我们引入了实验引导的排序，根据先前测试的反馈优先考虑假设。由于真实实验的不可行性，我们提出了一个以领域特定概念为基础的模拟器，模拟假设性能作为与隐藏的真相相似性的函数，受噪声干扰。通过对124个已实验证实结果的假设进行验证，模拟器以一致的趋势对齐近似真实结果。尽管存在偏差，但它们模仿湿实验室的噪声，促进了更健壮的排序策略。我们将实验引导的排序框架构建为一个顺序决策问题，并提出了一个基于上下文的强化学习（ICRL）框架。我们基于LLM的策略将假设分解为功能元素，通过机械角色将它们聚类，并基于反馈优先考虑重新组合。实验证明，我们的方法明显优于实验前的基线和强大的削减方法。我们的工具包，包括模拟器和ICRL框架，使得对实验引导排序的系统研究成为可能，而策略作为一个强有力的概念证明。

更新时间: 2025-10-25 14:00:54

领域: cs.CL,cs.AI,cs.CE

下载: http://arxiv.org/abs/2505.17873v3

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in three key areas: 1) limited risky categories, 2) coarse-grained annotation, and 3) low effectiveness. To address these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark designed for evaluating safety-related tasks in T2I models. Specifically, we first develop a hierarchical risk taxonomy, which consists of 6 primary categories and 14 fine-grained subcategories. Building upon this taxonomy, we construct a pipeline to collect and annotate risky prompts. Finally, we obtain 6,432 effective risky prompts, where each prompt is annotated with both hierarchical category labels and detailed risk reasons. Moreover, to facilitate the evaluation, we propose a reason-driven risky image detection method that explicitly aligns the MLLM with safety annotations. Based on T2I-RiskyPrompt, we conduct a comprehensive evaluation of eight T2I models, nine defense methods, five safety filters, and five attack strategies, offering nine key insights into the strengths and limitations of T2I model safety. Finally, we discuss potential applications of T2I-RiskyPrompt across various research fields. The dataset and code are provided in https://github.com/datar001/T2I-RiskyPrompt.

Updated: 2025-10-25 14:00:26

标题: T2I-RiskyPrompt：文本到图像模型的安全评估、攻击和防御基准

摘要: 使用风险文本提示，例如色情和暴力提示，来测试文本到图像（T2I）模型的安全性是一项关键任务。然而，现有的风险提示数据集在三个关键领域存在限制：1）有限的风险类别，2）粗粒度注释，3）低效性。为了解决这些限制，我们引入了T2I-RiskyPrompt，一个专为评估T2I模型中与安全相关任务而设计的全面基准。具体来说，我们首先制定了一个包含6个主要类别和14个细粒度子类别的分层风险分类法。在此分类法基础上，我们构建了一个流程来收集和注释风险提示。最终，我们获得了6,432个有效的风险提示，其中每个提示都标有分层类别标签和详细的风险原因。此外，为了促进评估，我们提出了一种基于原因的风险图像检测方法，明确将MLLM与安全注释对齐。基于T2I-RiskyPrompt，我们对八个T2I模型、九种防御方法、五种安全过滤器和五种攻击策略进行了全面评估，提供了关于T2I模型安全性的九个关键见解。最后，我们讨论了T2I-RiskyPrompt在各个研究领域的潜在应用。数据集和代码可以在https://github.com/datar001/T2I-RiskyPrompt上找到。

更新时间: 2025-10-25 14:00:26

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.22300v1

Stable neural networks and connections to continuous dynamical systems

The existence of instabilities, for example in the form of adversarial examples, has given rise to a highly active area of research concerning itself with understanding and enhancing the stability of neural networks. We focus on a popular branch within this area which draws on connections to continuous dynamical systems and optimal control, giving a bird's eye view of this area. We identify and describe the fundamental concepts that underlie much of the existing work in this area. Following this, we go into more detail on a specific approach to designing stable neural networks, developing the theoretical background and giving a description of how these networks can be implemented. We provide code that implements the approach that can be adapted and extended by the reader. The code further includes a notebook with a fleshed-out toy example on adversarial robustness of image classification that can be run without heavy requirements on the reader's computer. We finish by discussing this toy example so that the reader can interactively follow along on their computer. This work will be included as a chapter of a book on scientific machine learning, which is currently under revision and aimed at students.

Updated: 2025-10-25 14:00:03

标题: 稳定的神经网络与连续动力系统的关系

摘要: 存在不稳定性，例如对抗性样本的形式，已经引起了一个非常活跃的研究领域，关注于理解和增强神经网络的稳定性。我们专注于该领域内的一个流行分支，该分支借鉴了连续动力系统和最优控制的相关性，从鸟瞰视角概述了这一领域。我们确定并描述了支撑该领域现有工作的基本概念。在此之后，我们更详细地讨论了一种设计稳定神经网络的具体方法，发展了理论背景并描述了这些网络的实现方式。我们提供了一段代码，该代码实现了该方法，读者可以根据需要进行调整和扩展。该代码还包括一个带有详细玩具示例的笔记本，用于对抗性图像分类的鲁棒性，读者可以在计算机上运行而无需太高的要求。最后，我们讨论了这个玩具示例，以便读者可以在自己的计算机上进行交互式跟踪。这项工作将作为一本科学机器学习书籍的一章被收录，该书目前正在修订中，面向学生。

更新时间: 2025-10-25 14:00:03

领域: math.NA,cs.LG,cs.NA

下载: http://arxiv.org/abs/2510.22299v1

MetaCaDI: A Meta-Learning Framework for Scalable Causal Discovery with Unknown Interventions

Uncovering the underlying causal mechanisms of complex real-world systems remains a significant challenge, as these systems often entail high data collection costs and involve unknown interventions. We introduce MetaCaDI, the first framework to cast the joint discovery of a causal graph and unknown interventions as a meta-learning problem. MetaCaDI is a Bayesian framework that learns a shared causal graph structure across multiple experiments and is optimized to rapidly adapt to new, few-shot intervention target prediction tasks. A key innovation is our model's analytical adaptation, which uses a closed-form solution to bypass expensive and potentially unstable gradient-based bilevel optimization. Extensive experiments on synthetic and complex gene expression data demonstrate that MetaCaDI significantly outperforms state-of-the-art methods. It excels at both causal graph recovery and identifying intervention targets from as few as 10 data instances, proving its robustness in data-scarce scenarios.

Updated: 2025-10-25 13:59:42

标题: MetaCaDI：一种用于具有未知干预的可扩展因果推断的元学习框架

摘要: 揭示复杂现实世界系统的潜在因果机制仍然是一个重大挑战，因为这些系统通常涉及高昂的数据收集成本，并涉及未知的干预措施。我们介绍了MetaCaDI，这是第一个将因果图的联合发现和未知干预作为元学习问题的框架。MetaCaDI是一个贝叶斯框架，可以跨多个实验学习共享的因果图结构，并经过优化以快速适应新的、少样本的干预目标预测任务。我们模型的一个关键创新是其分析适应性，它使用闭式解决方案来避免昂贵且潜在不稳定的基于梯度的双层优化。对合成和复杂基因表达数据进行的大量实验表明，MetaCaDI明显优于最先进的方法。它在因果图恢复和从仅有10个数据实例中识别干预目标方面表现出色，证明了在数据稀缺场景中的稳健性。

更新时间: 2025-10-25 13:59:42

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.22298v1

IndiSeek learns information-guided disentangled representations

Learning disentangled representations is a fundamental task in multi-modal learning. In modern applications such as single-cell multi-omics, both shared and modality-specific features are critical for characterizing cell states and supporting downstream analyses. Ideally, modality-specific features should be independent of shared ones while also capturing all complementary information within each modality. This tradeoff is naturally expressed through information-theoretic criteria, but mutual-information-based objectives are difficult to estimate reliably, and their variational surrogates often underperform in practice. In this paper, we introduce IndiSeek, a novel disentangled representation learning approach that addresses this challenge by combining an independence-enforcing objective with a computationally efficient reconstruction loss that bounds conditional mutual information. This formulation explicitly balances independence and completeness, enabling principled extraction of modality-specific features. We demonstrate the effectiveness of IndiSeek on synthetic simulations, a CITE-seq dataset and multiple real-world multi-modal benchmarks.

Updated: 2025-10-25 13:49:31

标题: IndiSeek学习信息引导的解缠表示

摘要: 学习分解表示是多模态学习中的一个基本任务。在现代应用中，如单细胞多组学中，共享特征和模态特定特征对于描述细胞状态和支持下游分析至关重要。理想情况下，模态特定特征应独立于共享特征，同时捕获每种模态内的所有互补信息。这种权衡通常通过信息论准则来表达，但基于互信息的目标很难可靠地估计，它们的变分替代通常在实践中表现不佳。在本文中，我们介绍了一种新颖的分解表示学习方法IndiSeek，通过将一个强制独立性的目标与一个计算效率高的重构损失相结合，限制条件互信息。这种表述明确平衡了独立性和完整性，使得可以有原则地提取模态特定特征。我们在合成模拟、一个CITE-seq数据集和多个真实世界的多模态基准上展示了IndiSeek的有效性。

更新时间: 2025-10-25 13:49:31

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.21584v2

Root Cause Analysis of Outliers with Missing Structural Knowledge

The goal of Root Cause Analysis (RCA) is to explain why an anomaly occurred by identifying where the fault originated. Several recent works model the anomalous event as resulting from a change in the causal mechanism at the root cause, i.e., as a soft intervention. RCA is then the task of identifying which causal mechanism changed. In real-world applications, one often has either few or only a single sample from the post-intervention distribution: a severe limitation for most methods, which assume one knows or can estimate the distribution. However, even those that do not are statistically ill-posed due to the need to probe regression models in regions of low probability density. In this paper, we propose simple, efficient methods to overcome both difficulties in the case where there is a single root cause and the causal graph is a polytree. When one knows the causal graph, we give guarantees for a traversal algorithm that requires only marginal anomaly scores and does not depend on specifying an arbitrary anomaly score cut-off. When one does not know the causal graph, we show that the heuristic of identifying root causes as the variables with the highest marginal anomaly scores is causally justified. To this end, we prove that anomalies with small scores are unlikely to cause those with larger scores in polytrees and give upper bounds for the likelihood of causal pathways with non-monotonic anomaly scores.

Updated: 2025-10-25 13:48:02

标题: 缺乏结构知识的异常值根本原因分析

摘要: 根本原因分析（RCA）的目标是通过确定故障起源处来解释异常事件的发生原因。最近的一些研究将异常事件建模为由于根本原因处的因果机制发生变化而导致，即作为一种软干预。因此，RCA的任务是确定哪个因果机制发生了变化。在实际应用中，通常只有来自干预后分布的少量或仅有一个样本，这对大多数方法来说是一个严重的限制，因为这些方法假设人们知道或能够估计分布。然而，即使那些不知道也是由于需要在低概率密度区域探索回归模型而在统计上是无解的。在本文中，我们提出了简单高效的方法来克服在只有一个根本原因且因果图是多树时的困难。当人们知道因果图时，我们为一种遍历算法提供了保证，该算法仅需要边缘异常分数，而不依赖于指定任意异常分数截断。当人们不知道因果图时，我们表明将具有最高边缘异常分数的变量识别为根本原因的启发式是有因果关系的。为此，我们证明在多树中，具有较小分数的异常不太可能引起具有较大分数的异常，并为具有非单调异常分数的因果路径的可能性提供上限。

更新时间: 2025-10-25 13:48:02

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2406.05014v3

VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription

Automatic Lyrics Transcription (ALT) for Vietnamese music presents unique challenges due to its tonal complexity and dialectal variations, but remains largely unexplored due to the lack of a dedicated dataset. Therefore, we curated the first large-scale Vietnamese ALT dataset (VietLyrics), comprising 647 hours of songs with line-level aligned lyrics and metadata to address these issues. Our evaluation of current ASRbased approaches reveal significant limitations, including frequent transcription errors and hallucinations in non-vocal segments. To improve performance, we fine-tuned Whisper models on the VietLyrics dataset, achieving superior results compared to existing multilingual ALT systems, including LyricWhiz. We publicly release VietLyrics and our models, aiming to advance Vietnamese music computing research while demonstrating the potential of this approach for ALT in low-resource language and music.

Updated: 2025-10-25 13:38:20

标题: 越南歌词：越南自动歌词转录的大规模数据集和模型

摘要: 越南音乐的自动歌词转录（ALT）面临独特挑战，由于其音调复杂性和方言变化，但由于缺乏专门的数据集，这一领域仍然很少被探索。因此，我们策划了第一个大规模的越南ALT数据集（VietLyrics），包括647小时的歌曲，具有按行对齐的歌词和元数据，以解决这些问题。我们对当前基于ASR的方法进行评估，发现存在重大限制，包括频繁的转录错误和在非唱段中的幻觉。为了提高性能，我们在VietLyrics数据集上对Whisper模型进行了微调，与现有的多语言ALT系统（包括LyricWhiz）相比，取得了更好的效果。我们公开发布VietLyrics和我们的模型，旨在推进越南音乐计算研究，同时展示这种方法在低资源语言和音乐中ALT的潜力。

更新时间: 2025-10-25 13:38:20

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22295v1

Predicting Metabolic Dysfunction-Associated Steatotic Liver Disease using Machine Learning Methods

Background: Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD) affects ~33% of U.S. adults and is the most common chronic liver disease. Although often asymptomatic, progression can lead to cirrhosis. Early detection is important, as lifestyle interventions can prevent disease progression. We developed a fair, rigorous, and reproducible MASLD prediction model and compared it to prior methods using a large electronic health record database. Methods: We evaluated LASSO logistic regression, random forest, XGBoost, and a neural network for MASLD prediction using clinical feature subsets, including the top 10 SHAP-ranked features. To reduce disparities in true positive rates across racial and ethnic subgroups, we applied an equal opportunity postprocessing method. Results: This study included 59,492 patients in the training data, 24,198 in the validating data, and 25,188 in the testing data. The LASSO logistic regression model with the top 10 features was selected for its interpretability and comparable performance. Before fairness adjustment, the model achieved AUROC of 0.84, accuracy of 78%, sensitivity of 72%, specificity of 79%, and F1-score of 0.617. After equal opportunity postprocessing, accuracy modestly increased to 81% and specificity to 94%, while sensitivity decreased to 41% and F1-score to 0.515, reflecting the fairness trade-off. Conclusions: We developed the MASER prediction model (MASLD Static EHR Risk Prediction), a LASSO logistic regression model which achieved competitive performance for MASLD prediction (AUROC 0.836, accuracy 77.6%), comparable to previously reported ensemble and tree-based models. Overall, this approach demonstrates that interpretable models can achieve a balance of predictive performance and fairness in diverse patient populations.

Updated: 2025-10-25 13:36:18

标题: 使用机器学习方法预测代谢功能障碍相关的脂肪肝疾病

摘要: 背景：代谢功能失调相关的脂肪肝疾病（MASLD）影响约33％的美国成年人，是最常见的慢性肝病。尽管通常无症状，但疾病的进展可能导致肝硬化。早期检测很重要，因为生活方式干预可以防止疾病的进展。我们开发了一个公平、严谨和可重复的MASLD预测模型，并将其与先前的方法进行了比较，使用了一个大型电子健康记录数据库。方法：我们评估了LASSO logistic回归、随机森林、XGBoost和神经网络对MASLD预测的能力，使用包括排名前10位的SHAP特征在内的临床特征子集。为了减少在种族和民族亚组之间真正阳性率的差异，我们应用了一种平等机会后处理方法。结果：本研究包括在训练数据中的59,492名患者，在验证数据中的24,198名患者和在测试数据中的25,188名患者。选择了具有前10个特征的LASSO logistic回归模型，因其可解释性和可比性表现。在公平调整之前，该模型实现了0.84的AUROC，78％的准确性，72％的敏感性，79％的特异性和0.617的F1分数。在平等机会后处理后，准确性略有提高至81％，特异性提高至94％，而敏感性降低至41％，F1分数降至0.515，反映了公平的权衡。结论：我们开发了MASER预测模型（MASLD静态EHR风险预测），这是一个LASSO logistic回归模型，其在MASLD预测方面取得了竞争性表现（AUROC 0.836，准确度77.6％），与先前报道的集成和基于树的模型相当。总的来说，这种方法表明可解释性模型可以在各种患者人群中实现预测性能和公平性的平衡。

更新时间: 2025-10-25 13:36:18

领域: cs.LG,cs.CY,q-bio.QM

下载: http://arxiv.org/abs/2510.22293v1

TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multi-modal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder's predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow-prediction, critique (reflect), and refinement-continuously boosts the framework's performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.

Updated: 2025-10-25 13:18:20

标题: TimeXL：LLM在环中的可解释多模态时间序列预测

摘要: 时间序列分析为现实世界系统动态提供了必要的见解，并为下游决策提供了信息，然而大多数现有方法通常忽略了辅助模态中丰富的上下文信号。为了弥补这一差距，我们引入了TimeXL，一个多模态预测框架，它集成了基于原型的时间序列编码器和三个协作的大型语言模型（LLMs），以提供更准确的预测和可解释的解释。首先，一个多模态基于原型的编码器处理时间序列和文本输入，生成初步预测以及基于案例的理由。然后，这些输出传入一个预测LLM，通过对编码器的预测和解释进行推理来优化预测。接下来，一个反思LLM将预测值与实际情况进行比较，识别文本不一致或噪音。在这个反馈的指导下，一个改进LLM迭代地增强文本质量并触发编码器的重新训练。这个闭环工作流程-预测、批评（反思）和改进-持续提升框架的性能和可解释性。对四个真实世界数据集的实证评估表明，TimeXL在AUC上取得了高达8.9%的改进，并产生了以人为中心的多模态解释，突显了LLM驱动的推理对于时间序列预测的力量。

更新时间: 2025-10-25 13:18:20

领域: cs.LG

下载: http://arxiv.org/abs/2503.01013v3

Does Homophily Help in Robust Test-time Node Classification?

Homophily, the tendency of nodes from the same class to connect, is a fundamental property of real-world graphs, underpinning structural and semantic patterns in domains such as citation networks and social networks. Existing methods exploit homophily through designing homophily-aware GNN architectures or graph structure learning strategies, yet they primarily focus on GNN learning with training graphs. However, in real-world scenarios, test graphs often suffer from data quality issues and distribution shifts, such as domain shifts across users from different regions in social networks and temporal evolution shifts in citation network graphs collected over varying time periods. These factors significantly compromise the pre-trained model's robustness, resulting in degraded test-time performance. With empirical observations and theoretical analysis, we reveal that transforming the test graph structure by increasing homophily in homophilic graphs or decreasing it in heterophilic graphs can significantly improve the robustness and performance of pre-trained GNNs on node classifications, without requiring model training or update. Motivated by these insights, a novel test-time graph structural transformation method grounded in homophily, named GrapHoST, is proposed. Specifically, a homophily predictor is developed to discriminate test edges, facilitating adaptive test-time graph structural transformation by the confidence of predicted homophily scores. Extensive experiments on nine benchmark datasets under a range of test-time data quality issues demonstrate that GrapHoST consistently achieves state-of-the-art performance, with improvements of up to 10.92%. Our code has been released at https://github.com/YanJiangJerry/GrapHoST.

Updated: 2025-10-25 13:17:28

标题: 同质性是否有助于鲁棒的测试时节点分类？

摘要: 同质性是指相同类别的节点连接的倾向，是现实世界图的基本特性，支撑着诸如引用网络和社交网络等领域中的结构和语义模式。现有方法通过设计同质性感知的GNN架构或图结构学习策略来利用同质性，然而它们主要关注于在训练图上进行GNN学习。然而，在现实场景中，测试图通常会受到数据质量问题和分布变化的影响，例如社交网络中来自不同地区的用户之间的领域转移以及引用网络图在不同时间段收集的时间演变转移。这些因素显著损害了预训练模型的鲁棒性，导致测试时性能下降。通过实证观察和理论分析，我们发现通过增加同质性图中的同质性或减少异质性图中的同质性，可以显著提高预训练GNN在节点分类上的鲁棒性和性能，而无需进行模型训练或更新。受这些观点的启发，提出了一种基于同质性的新型测试时图结构转换方法，名为GrapHoST。具体地，开发了一个同质性预测器来区分测试边，通过预测的同质性得分的置信度促进自适应的测试时图结构转换。在九个基准数据集上进行的广泛实验展示了GrapHoST始终实现了最先进的性能，提高了高达10.92%。我们的代码已在https://github.com/YanJiangJerry/GrapHoST上发布。

更新时间: 2025-10-25 13:17:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22289v1

Machine Learning Enabled Early Warning System For Financial Distress Using Real-Time Digital Signals

The growing instability of both global and domestic economic environments has increased the risk of financial distress at the household level. However, traditional econometric models often rely on delayed and aggregated data, limiting their effectiveness. This study introduces a machine learning-based early warning system that utilizes real-time digital and macroeconomic signals to identify financial distress in near real-time. Using a panel dataset of 750 households tracked over three monitoring rounds spanning 13 months, the framework combines socioeconomic attributes, macroeconomic indicators (such as GDP growth, inflation, and foreign exchange fluctuations), and digital economy measures (including ICT demand and market volatility). Through data preprocessing and feature engineering, we introduce lagged variables, volatility measures, and interaction terms to capture both gradual and sudden changes in financial stability. We benchmark baseline classifiers, such as logistic regression and decision trees, against advanced ensemble models including random forests, XGBoost, and LightGBM. Our results indicate that the engineered features from the digital economy significantly enhance predictive accuracy. The system performs reliably for both binary distress detection and multi-class severity classification, with SHAP-based explanations identifying inflation volatility and ICT demand as key predictors. Crucially, the framework is designed for scalable deployment in national agencies and low-bandwidth regional offices, ensuring it is accessible for policymakers and practitioners. By implementing machine learning in a transparent and interpretable manner, this study demonstrates the feasibility and impact of providing near-real-time early warnings of financial distress. This offers actionable insights that can strengthen household resilience and guide preemptive intervention strategies.

Updated: 2025-10-25 13:12:45

标题: 机器学习技术实现的基于实时数字信号的金融危机预警系统

摘要: 全球和国内经济环境日益不稳定，增加了家庭层面的财务困境风险。然而，传统计量经济模型通常依赖延迟和聚合数据，限制了它们的有效性。本研究引入了一种基于机器学习的早期预警系统，利用实时数字和宏观经济信号来实现准实时地识别财务困境。利用涵盖了13个月的三轮监测的750个家庭的面板数据集，该框架结合了社会经济属性、宏观经济指标（如GDP增长、通货膨胀和外汇波动）以及数字经济指标（包括ICT需求和市场波动）。通过数据预处理和特征工程，我们引入了滞后变量、波动性指标和交互项，以捕捉财务稳定性的渐进性和突发性变化。我们将基准分类器（如逻辑回归和决策树）与高级集成模型（包括随机森林、XGBoost和LightGBM）进行了对比。我们的结果表明，数字经济中的工程特征显著提高了预测准确性。该系统在二元困境检测和多类别严重性分类方面表现可靠，基于SHAP的解释确定了通货膨胀波动和ICT需求作为关键预测因素。关键是，该框架旨在可扩展地部署在国家机构和低带宽区域办公室中，以确保决策者和从业人员可以访问。通过以透明和可解释的方式实施机器学习，本研究展示了提供准实时财务困境预警的可行性和影响。这提供了可操作的见解，可以加强家庭的抗压能力，并指导预防性干预策略。

更新时间: 2025-10-25 13:12:45

领域: cs.LG,cs.CY

下载: http://arxiv.org/abs/2510.22287v1

Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER

We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC's five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 $\approx$ 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.

Updated: 2025-10-25 13:08:59

标题: 监督微调还是上下文学习？评估用于临床命名实体识别的LLMs

摘要: 我们研究了在CADEC语料库上的临床命名实体识别（NER），并比较了三种方法族：（i）BERT风格编码器（BERT Base，BioClinicalBERT，RoBERTa-large），（ii）GPT-4o与少量现场学习（ICL）结合使用，使用简单vs.复杂提示，以及（iii）GPT-4o与监督微调（SFT）。所有模型都根据CADEC的五种实体类型（ADR，药物，疾病，症状，发现）进行了标准NER指标评估。RoBERTa-large和BioClinicalBERT相对于BERT Base的改进有限，显示了这些模型族的局限性。在LLM设置中，简单ICL的表现优于更长、指令繁重的提示，而SFT实现了最强的整体性能（F1约为87.1%），尽管成本较高。我们发现LLM在简化任务上实现更高的准确性，将分类限制为两个标签。

更新时间: 2025-10-25 13:08:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.22285v1

Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI Reconstruction

Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to the emergence of unsupervised learning approaches. A recent promising method among them is implicit neural representation (INR), which defines the data as a continuous function that maps coordinate values to the corresponding signal values. This allows for filling in missing information only with incomplete measurements and solving the inverse problem effectively. Nevertheless, previous works incorporating this method have faced drawbacks such as long optimization time and the need for extensive hyperparameter tuning. To address these issues, we propose Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction that captures the spatial and temporal continuity of dynamic MRI data in the image domain and explicitly incorporates the temporal redundancy of the data into the model structure. As a result, DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.

Updated: 2025-10-25 13:01:56

标题: 动态感知时空表示学习用于动态磁共振成像重建

摘要: 动态MRI重建是逆问题之一，由于深度学习技术的应用，其受到了关注。特别是，获取真实数据的实际困难导致了无监督学习方法的出现。其中一种最近有前景的方法是隐式神经表示（INR），它将数据定义为一个连续函数，将坐标值映射到相应的信号值。这使得只能用不完整的测量来填补缺失信息，并有效地解决逆问题。然而，之前采用这种方法的作品面临诸如长时间优化和需要大量超参数调整等缺点。为了解决这些问题，我们提出了Dynamic-Aware INR（DA-INR），这是一种基于INR的动态MRI重建模型，它在图像域中捕捉动态MRI数据的空间和时间连续性，并明确地将数据的时间冗余性纳入模型结构中。因此，DA-INR在重建质量方面表现优于其他模型，即使在极端欠采样比率下，也显著降低了优化时间并减少了最小的超参数调整。

更新时间: 2025-10-25 13:01:56

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2501.09049v2

Adapting Noise-Driven PUF and AI for Secure WBG ICS: A Proof-of-Concept Study

Wide-bandgap (WBG) technologies offer unprecedented improvements in power system efficiency, size, and performance, but also introduce unique sensor corruption and cybersecurity risks in industrial control systems (ICS), particularly due to high-frequency noise and sophisticated cyber-physical threats. This proof-of-concept (PoC) study demonstrates the adaptation of a noise-driven physically unclonable function (PUF) and machine learning (ML)-assisted anomaly detection framework to the demanding environment of WBG-based ICS sensor pathways. By extracting entropy from unavoidable WBG switching noise (up to 100 kHz) as a PUF source, and simultaneously using this noise as a real-time threat indicator, the proposed system unites hardware-level authentication and anomaly detection. Our approach integrates hybrid machine learning (ML) models with adaptive Bayesian filtering, providing robust and low-latency detection capabilities resilient to both natural electromagnetic interference (EMI) and active adversarial manipulation. Through detailed simulations of WBG modules under benign and attack scenarios--including EMI injection, signal tampering, and node impersonation--we achieve 95% detection accuracy and sub-millisecond processing latency. These results demonstrate the feasibility of physics-driven, dual-use noise exploitation as a scalable ICS defense primitive. Our findings lay the groundwork for next-generation security strategies that leverage inherent device characteristics, bridging hardware and artificial intelligence (AI) for enhanced protection of critical ICS infrastructure.

Updated: 2025-10-25 12:57:55

标题: 采用噪声驱动PUF和人工智能技术用于安全的WBG ICS：一个概念验证研究

摘要: 广泛带隙（WBG）技术在功率系统的效率、大小和性能方面提供了前所未有的改进，但也在工业控制系统（ICS）中引入了独特的传感器损坏和网络安全风险，特别是由于高频噪声和复杂的网络物理威胁。这项概念验证（PoC）研究展示了将噪声驱动的物理不可克隆功能（PUF）和机器学习（ML）辅助异常检测框架适应于基于WBG的ICS传感器路径的严苛环境。通过从不可避免的WBG开关噪声（高达100 kHz）中提取熵作为PUF源，并同时将此噪声用作实时威胁指示器，所提出的系统将硬件级认证和异常检测结合在一起。我们的方法将混合机器学习（ML）模型与自适应贝叶斯滤波结合，提供了对自然电磁干扰（EMI）和主动对抗性操纵都具有鲁棒且低延迟检测能力。通过对WBG模块在良性和攻击场景下的详细模拟，包括EMI注入、信号篡改和节点冒充，我们实现了95%的检测准确率和亚毫秒级的处理延迟。这些结果证明了基于物理的、双重用途的噪声利用作为可扩展的ICS防御原语的可行性。我们的发现为利用固有设备特性、桥接硬件和人工智能（AI）以增强关键ICS基础设施保护的下一代安全策略奠定了基础。

更新时间: 2025-10-25 12:57:55

领域: cs.CR,cs.LG,cs.SY,eess.SY,physics.app-ph

下载: http://arxiv.org/abs/2510.22283v1

CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.

Updated: 2025-10-25 12:56:46

标题: CityRiSE: 通过强化学习在视觉语言模型中推理城市社会经济地位

摘要: 利用公开可获得的大规模网络数据，如街景和卫星图像，城市社会经济感知对于实现全球可持续发展目标至关重要。随着大型视觉-语言模型（LVLMs）的出现，通过将其视为多模态感知和理解问题，解决这一任务出现了新的机会。然而，最近的研究显示，LVLMs仍然在从视觉数据中准确和可解释地预测社会经济方面存在困难。为了解决这些局限性并最大限度地发挥LVLMs的潜力，我们引入了一种新颖的框架CityRiSE，通过纯强化学习（RL）推理LVLMs中的城市社会经济状况。通过精心策划的多模态数据和可验证的奖励设计，我们的方法引导LVLMs专注于语义上有意义的视觉线索，实现了结构化和目标导向的推理，用于一般性社会经济状况预测。实验证明，具有新兴推理过程的CityRiSE明显优于现有基线，提高了预测准确性和在不同城市环境中的泛化能力，特别是对于未见过的城市和未见过的指标的预测。这项工作突出了将RL和LVLMs结合起来进行可解释和通用性城市社会经济感知的潜力。

更新时间: 2025-10-25 12:56:46

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22282v1

A Lightweight Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes

With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model's ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model's input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model's effectiveness in reconstructing gene regulatory networks.

Updated: 2025-10-25 12:46:37

标题: 一个基于轻量级梯度的因果发现框架及其在复杂工业过程中的应用

摘要: 随着深度学习技术的进步，提出了各种基于神经网络的Granger因果关系模型。尽管这些模型已经表现出显著的改进，但仍然存在一些局限性。大多数现有方法采用逐分量结构，需要为每个时间序列构建一个单独的模型，这导致了大量的计算成本。此外，在神经网络的第一层权重上施加稀疏诱导惩罚以提取因果关系会削弱模型捕捉复杂交互作用的能力。为了解决这些限制，我们提出了基于梯度正则化的神经Granger因果关系（GRNGC）方法，该方法只需要一个时间序列预测模型，并将$L_{1}$正则化应用于模型输入和输出之间的梯度来推断Granger因果关系。此外，GRNGC不限于特定的时间序列预测模型，可以使用不同的架构如KAN、MLP和LSTM实现，提供了增强的灵活性。在DREAM、Lorenz-96、fMRI BOLD和CausalTime上的数值模拟表明，GRNGC优于现有基线模型，并显著减少了计算开销。与此同时，在真实的DNA、酵母、HeLa和膀胱尿道上皮癌数据集上的实验证实了该模型在重建基因调控网络方面的有效性。

更新时间: 2025-10-25 12:46:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.11178v2

Distributional Training Data Attribution: What do Influence Functions Sample?

Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing distributional training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. Intriguingly, we find that influence functions (IFs), a popular data attribution tool, are 'secretly distributional': they emerge from our framework as the limit to unrolled differentiation, without requiring restrictive convexity assumptions. This provides a new perspective on the effectiveness of IFs in deep learning. We demonstrate the practical utility of d-TDA in experiments, including improving data pruning for vision transformers and identifying influential examples with diffusion models.

Updated: 2025-10-25 12:43:41

标题: 分布式训练数据归因：影响函数样本的内容是什么？

摘要: 随机性是训练深度学习模型不可避免的一部分，然而传统的训练数据归因算法未能严谨地考虑到这一点。它们忽视了这样一个事实，即由于初始化和批处理中的随机性，使用相同数据集进行训练可能会产生不同的模型。在本文中，我们通过引入分布式训练数据归因（d-TDA）来解决这一缺点，其目标是预测模型输出的分布（在训练运行中）如何取决于数据集。有趣的是，我们发现影响函数（IFs），一种流行的数据归因工具，实际上是“秘密分布的”：它们在我们的框架中作为展开微分的极限出现，而不需要具有限制性凸性假设。这为理解IFs在深度学习中的有效性提供了新视角。我们通过实验证明了d-TDA在实验中的实用性，包括改进视觉变换器的数据修剪和使用扩散模型识别有影响力的示例。

更新时间: 2025-10-25 12:43:41

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.12965v3

SecureLearn - An Attack-agnostic Defense for Multiclass Machine Learning Against Data Poisoning Attacks

Data poisoning attacks are a potential threat to machine learning (ML) models, aiming to manipulate training datasets to disrupt their performance. Existing defenses are mostly designed to mitigate specific poisoning attacks or are aligned with particular ML algorithms. Furthermore, most defenses are developed to secure deep neural networks or binary classifiers. However, traditional multiclass classifiers need attention to be secure from data poisoning attacks, as these models are significant in developing multi-modal applications. Therefore, this paper proposes SecureLearn, a two-layer attack-agnostic defense to defend multiclass models from poisoning attacks. It comprises two components of data sanitization and a new feature-oriented adversarial training. To ascertain the effectiveness of SecureLearn, we proposed a 3D evaluation matrix with three orthogonal dimensions: data poisoning attack, data sanitization and adversarial training. Benchmarking SecureLearn in a 3D matrix, a detailed analysis is conducted at different poisoning levels (10%-20%), particularly analysing accuracy, recall, F1-score, detection and correction rates, and false discovery rate. The experimentation is conducted for four ML algorithms, namely Random Forest (RF), Decision Tree (DT), Gaussian Naive Bayes (GNB) and Multilayer Perceptron (MLP), trained with three public datasets, against three poisoning attacks and compared with two existing mitigations. Our results highlight that SecureLearn is effective against the provided attacks. SecureLearn has strengthened resilience and adversarial robustness of traditional multiclass models and neural networks, confirming its generalization beyond algorithm-specific defenses. It consistently maintained accuracy above 90%, recall and F1-score above 75%. For neural networks, SecureLearn achieved 97% recall and F1-score against all selected poisoning attacks.

Updated: 2025-10-25 12:35:45

标题: SecureLearn - 针对数据毒化攻击的多类机器学习的攻击无关防御

摘要: 数据中毒攻击是对机器学习（ML）模型构成潜在威胁的一种手段，旨在操纵训练数据集以破坏其性能。现有的防御措施大多旨在减轻特定的中毒攻击或与特定的ML算法保持一致。此外，大多数防御措施都是为了保护深度神经网络或二元分类器而开发的。然而，传统的多类分类器需要注意防止数据中毒攻击，因为这些模型在开发多模态应用程序方面非常重要。因此，本文提出了SecureLearn，一个两层攻击无关的防御措施，用于保护多类模型免受中毒攻击。它由数据净化和新的面向特征的对抗训练两个组成部分。为了确定SecureLearn的有效性，我们提出了一个具有三个正交维度的3D评估矩阵：数据中毒攻击、数据净化和对抗训练。在3D矩阵中对SecureLearn进行基准测试，详细分析了不同中毒水平（10%-20%）下的准确率、召回率、F1分数、检测和纠正率以及误发现率。实验针对四种ML算法进行，分别是随机森林（RF）、决策树（DT）、高斯朴素贝叶斯（GNB）和多层感知器（MLP），使用三个公共数据集进行训练，针对三种中毒攻击进行比较，并与两种现有的减轻措施进行比较。我们的结果突显出SecureLearn对提供的攻击是有效的。SecureLearn加强了传统多类模型和神经网络的抗干扰能力和对抗鲁棒性，证实了其超越特定算法防御的泛化性。它始终保持准确率在90%以上，召回率和F1分数在75%以上。对于神经网络，SecureLearn在所有选择的中毒攻击下实现了97%的召回率和F1分数。

更新时间: 2025-10-25 12:35:45

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2510.22274v1

SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training

Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of optimizer states. While recent work has aimed to reduce memory consumption, such efforts often entail trade-offs among memory efficiency, training time, and model performance. Yet, true democratization of LLMs requires simultaneous progress across all three dimensions. To this end, we propose SubTrack++ that leverages Grassmannian gradient subspace tracking combined with projection-aware optimizers, enabling Adam's internal statistics to adapt to subspace changes. Additionally, employing recovery scaling, a technique that restores information lost through low-rank projections, further enhances model performance. Our method demonstrates SOTA convergence by exploiting Grassmannian geometry, reducing pre-training wall-time by up to 65% and fine-tuning time by 36% compared to existing SOTA methods, while maintaining the same memory footprint.

Updated: 2025-10-25 12:33:42

标题: SubTrack++：用于可扩展LLM训练的梯度子空间跟踪

摘要: 培训大型语言模型（LLMs）需要大量资源，因为它们拥有大量的参数和优化器状态的开销。尽管最近的研究致力于减少内存消耗，但这些努力通常需要在内存效率、训练时间和模型性能之间进行权衡。然而，真正的LLMs民主化需要在这三个维度上同时取得进展。为此，我们提出了SubTrack++，利用Grassmannian梯度子空间跟踪结合投影感知优化器，使Adam的内部统计数据能够适应子空间的变化。此外，采用恢复缩放技术，可以通过恢复通过低秩投影丢失的信息，进一步提高模型性能。我们的方法利用Grassmannian几何学实现了SOTA收敛，将预训练壁钟时间缩短了高达65％，将微调时间缩短了36％，与现有的SOTA方法相比，同时保持相同的内存占用。

更新时间: 2025-10-25 12:33:42

领域: cs.LG

下载: http://arxiv.org/abs/2502.01586v3

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against closed-source commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial black-box LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we propose to refine semantic clarity by encoding explicit semantic details within local regions, thus ensuring the capture of finer-grained features and inter-model transferability, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective baseline: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. While the naive source-target matching method has been utilized before in the literature, we are the first to provide a tight analysis, which establishes a close connection between perturbation optimization and semantics. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5/3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods with lower $\ell_1/\ell_2$ perturbations.

Updated: 2025-10-25 12:23:47

标题: 一种简单但高效的攻击基准线：对GPT-4.5/4o/o1强大黑盒模型的成功率超过90%

摘要: 尽管在开源大型视觉语言模型（LVLMs）上表现出有希望的性能，但基于转移的定向攻击经常在闭源商业LVLMs上失败。分析失败的对抗性扰动揭示，学习的扰动通常来自均匀分布，缺乏明确的语义细节，导致意外的响应。这种关键的语义信息缺失导致商业黑盒LVLMs要么完全忽略扰动，要么误解其嵌入的语义，从而导致攻击失败。为了克服这些问题，我们提出通过在局部区域内编码明确的语义细节来提高语义清晰度，从而确保捕获更细粒度的特征和模型间的传递性，并集中修改在语义丰富的区域而不是均匀地应用。为了实现这一点，我们提出了一个简单但非常有效的基线：在每个优化步骤中，对抗性图像按受控的纵横比例和比例随机裁剪，调整大小，然后与嵌入空间中的目标图像对齐。虽然在文献中以前已经使用了天真的源-目标匹配方法，但我们是第一个提供了紧密分析，建立了扰动优化与语义之间的密切联系。实验证实了我们的假设。我们精心设计的以本地聚合扰动为重点的对抗性示例展现出对商业LVLMs的惊人的传递性，包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5/3.7-sonnet，甚至像o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking这样的推理模型。我们的方法在GPT-4.5、4o和o1上取得了超过90%的成功率，明显优于所有先前的最先进的攻击方法，具有较低的$\ell_1/\ell_2$扰动。

更新时间: 2025-10-25 12:23:47

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.10635v2

Hierarchical Optimization via LLM-Guided Objective Evolution for Mobility-on-Demand Systems

Online ride-hailing platforms aim to deliver efficient mobility-on-demand services, often facing challenges in balancing dynamic and spatially heterogeneous supply and demand. Existing methods typically fall into two categories: reinforcement learning (RL) approaches, which suffer from data inefficiency, oversimplified modeling of real-world dynamics, and difficulty enforcing operational constraints; or decomposed online optimization methods, which rely on manually designed high-level objectives that lack awareness of low-level routing dynamics. To address this issue, we propose a novel hybrid framework that integrates large language model (LLM) with mathematical optimization in a dynamic hierarchical system: (1) it is training-free, removing the need for large-scale interaction data as in RL, and (2) it leverages LLM to bridge cognitive limitations caused by problem decomposition by adaptively generating high-level objectives. Within this framework, LLM serves as a meta-optimizer, producing semantic heuristics that guide a low-level optimizer responsible for constraint enforcement and real-time decision execution. These heuristics are refined through a closed-loop evolutionary process, driven by harmony search, which iteratively adapts the LLM prompts based on feasibility and performance feedback from the optimization layer. Extensive experiments based on scenarios derived from both the New York and Chicago taxi datasets demonstrate the effectiveness of our approach, achieving an average improvement of 16% compared to state-of-the-art baselines.

Updated: 2025-10-25 12:20:43

标题: 通过LLM引导的目标进化进行移动需求系统的层次优化

摘要: 在线打车平台旨在提供高效的按需出行服务，通常面临动态和空间异质性供需平衡的挑战。现有方法通常分为两类：强化学习（RL）方法存在数据效率低、对现实世界动态建模过于简化以及难以强制执行运营约束的问题；或者是分解的在线优化方法，依赖于手动设计的高层次目标，缺乏对低级路由动态的意识。为了解决这个问题，我们提出了一个新颖的混合框架，将大型语言模型（LLM）与数学优化集成到一个动态的分层系统中：（1）它是无需训练的，消除了像RL中那样需要大规模交互数据的需求；（2）它利用LLM来弥合由问题分解引起的认知局限性，通过自适应生成高层次目标。在这个框架中，LLM充当元优化器，生成指导低级优化器执行约束强制和实时决策的语义启发。这些启发通过一个闭环进化过程进行细化，由和声搜索驱动，根据优化层的可行性和性能反馈迭代地调整LLM提示。基于从纽约和芝加哥出租车数据集中提取的场景进行的大量实验表明了我们方法的有效性，相比于最先进的基线，平均改进了16%。

更新时间: 2025-10-25 12:20:43

领域: cs.AI

下载: http://arxiv.org/abs/2510.10644v2

A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata

Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and director management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school's average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school's ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.

Updated: 2025-10-25 12:15:30

标题: 一个与学生表现相关的因素的多层面分析：一种机器学习方法应用于SAEB微观数据

摘要: 在巴西，确定影响基础教育学生表现的因素是制定有效公共政策的一个核心挑战。本研究引入了一种多层次机器学习方法，利用基础教育评估系统（SAEB）的微数据来分类九年级和高中学生的能力。我们的模型独特地整合了四个数据来源：学生社会经济特征、教师专业资料、学校指标和校长管理资料。对四种集成算法的比较分析确认了随机森林模型的优越性，该模型达到了90.2%的准确率和96.7%的曲线下面积（AUC）。为了超越预测，我们应用了可解释的人工智能（XAI）技术，使用SHAP揭示了学校的平均社会经济水平是最主要的预测因素，表明系统因素比个体特征在孤立状态下更具影响力。主要结论是，学术表现是与学校生态系统密切相关的系统现象。这项研究提供了一个数据驱动的、可解释的工具，用来指导促进教育公平的政策，解决学校之间的差异。

更新时间: 2025-10-25 12:15:30

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2510.22266v1

REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning

Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.

Updated: 2025-10-25 12:13:18

标题: REP：面向无需排练的持续学习的资源高效提示

摘要: 最近，受提示指导的无重复训练的持续学习（CL）方法在视觉任务中取得了强大的表现，但仍然资源密集，阻碍了真实世界的边缘部署。我们引入了资源高效的提示（REP），它提高了基于提示的无重复训练方法的计算和内存效率，同时最小化了准确性的权衡。我们的方法采用快速提示选择，使用精心配置的模型来精炼输入数据，并引入了自适应令牌合并（AToM）和自适应层丢弃（ALD）以实现高效的提示更新。在学习新任务期间，AToM和ALD有选择地跳过数据和模型层，同时保留任务特定的特征。对多个图像分类数据集的广泛实验表明REP在资源效率上优于最先进的无重复训练CL方法。

更新时间: 2025-10-25 12:13:18

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2406.04772v4

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.

Updated: 2025-10-25 12:11:44

标题: EmoSteer-TTS：通过激活导向实现细粒度和无需训练的情感可控文本转语音

摘要: 文本转语音（TTS）在近年来取得了巨大进展。然而，大多数现有的TTS系统只提供粗糙和刚性的情感控制，通常通过离散的情感标签或精心制作和详细的情感文本提示，使得细粒度情感操作要么无法访问，要么不稳定。这些模型还需要大量高质量的数据集进行训练。为了解决这些限制，我们提出了EmoSteer-TTS，一种新颖的无需训练的方法，通过激活引导实现细粒度语音情感控制（转换、插值、擦除）。我们首先经验性地观察到，在基于流匹配的TTS模型内修改一部分内部激活可以有效地改变合成语音的情感色调。基于这一认识，我们开发了一种无需训练且高效的算法，包括激活提取、情感标记搜索和推理时引导，可以无缝地集成到各种预训练模型中（例如F5-TTS、CosyVoice2和E2-TTS）。此外，为了得出有效的引导向量，我们构建了一个包含多样化发言者的策划情感语音数据集。广泛的实验表明，EmoSteer-TTS实现了对语音情感的细粒度、可解释和连续控制，优于最先进技术（SOTA）。据我们所知，这是第一种在TTS中实现无需训练和连续细粒度情感控制的方法。演示样本可在https://emosteer-tts-demo.pages.dev/上找到。

更新时间: 2025-10-25 12:11:44

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2508.03543v3

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.

Updated: 2025-10-25 12:01:46

标题: PatenTEB：专利文本嵌入的综合基准和模型族

摘要: 专利文本嵌入使得先前的技术搜索、技术景观和专利分析成为可能，然而现有的基准测试不足以捕捉专利特定的挑战。我们介绍了PatenTEB，一个包括检索、分类、释义和聚类等15个任务的综合基准测试，共有206万个示例。PatenTEB采用了领域分层拆分、领域特定的硬负采样，并系统性地覆盖了一般嵌入基准测试中缺失的不对称片段到文档匹配场景。我们通过多任务训练开发了patembed模型系列，涵盖了6700万到34400万个参数，上下文长度最长可达4096个标记。外部验证显示了强大的泛化能力：patembed-base在MTEB BigPatentClustering.v2上达到了最先进水平（0.494 V-measure vs. 0.445之前最好成绩），而patembed-large在DAPFAM上达到了0.377的NDCG@100。系统的消融实验表明，尽管存在较小的基准成本，多任务训练仍然提高了外部泛化能力，而领域预训练初始化在任务族中提供了一致的优势。所有资源将在https://github.com/iliass-y/patenteb上提供。关键词：专利检索、句子嵌入、多任务学习、不对称检索、基准评估、对比学习。

更新时间: 2025-10-25 12:01:46

领域: cs.CL,cs.AI,cs.IR,H.3.3; I.2.7; I.2.6

下载: http://arxiv.org/abs/2510.22264v1

TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning

We present TreeIRL, a novel planner for autonomous driving that combines Monte Carlo tree search (MCTS) and inverse reinforcement learning (IRL) to achieve state-of-the-art performance in simulation and in real-world driving. The core idea is to use MCTS to find a promising set of safe candidate trajectories and a deep IRL scoring function to select the most human-like among them. We evaluate TreeIRL against both classical and state-of-the-art planners in large-scale simulations and on 500+ miles of real-world autonomous driving in the Las Vegas metropolitan area. Test scenarios include dense urban traffic, adaptive cruise control, cut-ins, and traffic lights. TreeIRL achieves the best overall performance, striking a balance between safety, progress, comfort, and human-likeness. To our knowledge, our work is the first demonstration of MCTS-based planning on public roads and underscores the importance of evaluating planners across a diverse set of metrics and in real-world environments. TreeIRL is highly extensible and could be further improved with reinforcement learning and imitation learning, providing a framework for exploring different combinations of classical and learning-based approaches to solve the planning bottleneck in autonomous driving.

Updated: 2025-10-25 12:00:58

标题: TreeIRL：具有树搜索和逆强化学习的安全城市驾驶

摘要: 我们提出了TreeIRL，这是一种结合了蒙特卡洛树搜索（MCTS）和逆强化学习（IRL）的新型自动驾驶规划器，可以在模拟环境和真实世界驾驶中实现最先进的性能。其核心思想是利用MCTS找到一组有潜力的安全候选轨迹，并使用深度IRL评分函数从中选择最类似于人类行为的轨迹。我们在大规模模拟和拉斯维加斯都市区500多英里的真实世界自动驾驶中评估了TreeIRL与传统和最先进的规划器。测试场景包括密集都市交通、自适应巡航控制、切入和交通灯。TreeIRL取得了最佳的整体性能，平衡了安全性、进展、舒适性和类人性。据我们所知，我们的工作是在公共道路上基于MCTS规划的首次演示，并强调了在不同指标和真实环境中评估规划器的重要性。TreeIRL具有很高的可扩展性，可以通过强化学习和模仿学习进一步改进，为探索解决自动驾驶规划瓶颈的不同经典和基于学习的方法组合提供了框架。

更新时间: 2025-10-25 12:00:58

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.13579v4

Epistemic Deep Learning: Enabling Machine Learning Models to Know When They Do Not Know

Machine learning has achieved remarkable successes, yet its deployment in safety-critical domains remains hindered by an inherent inability to manage uncertainty, resulting in overconfident and unreliable predictions when models encounter out-of-distribution data, adversarial perturbations, or naturally fluctuating environments. This thesis, titled Epistemic Deep Learning: Enabling Machine Learning Models to 'Know When They Do Not Know', addresses these critical challenges by advancing the paradigm of Epistemic Artificial Intelligence, which explicitly models and quantifies epistemic uncertainty: the uncertainty arising from limited, biased, or incomplete training data, as opposed to the irreducible randomness of aleatoric uncertainty, thereby empowering models to acknowledge their limitations and refrain from overconfident decisions when uncertainty is high. Central to this work is the development of the Random-Set Neural Network (RS-NN), a novel methodology that leverages random set theory to predict belief functions over sets of classes, capturing the extent of epistemic uncertainty through the width of associated credal sets, applications of RS-NN, including its adaptation to Large Language Models (LLMs) and its deployment in weather classification for autonomous racing. In addition, the thesis proposes a unified evaluation framework for uncertainty-aware classifiers. Extensive experiments validate that integrating epistemic awareness into deep learning not only mitigates the risks associated with overconfident predictions but also lays the foundation for a paradigm shift in artificial intelligence, where the ability to 'know when it does not know' becomes a hallmark of robust and dependable systems. The title encapsulates the core philosophy of this work, emphasizing that true intelligence involves recognizing and managing the limits of one's own knowledge.

Updated: 2025-10-25 12:00:19

标题: 认知深度学习：让机器学习模型知道自己不知道的时候

摘要: 机器学习取得了显著的成功，但是在安全关键领域的部署仍然受到困扰，因为其固有的无法处理不确定性，导致当模型遇到超出分布数据、对抗性扰动或自然波动环境时，产生过于自信和不可靠的预测。这篇名为《认知深度学习：使机器学习模型能够“知道自己不知道”》的论文通过推进认知人工智能范式来解决这些关键挑战，该范式明确地建模和量化认知不确定性：由于训练数据有限、有偏差或不完整而产生的不确定性，而不是无法消除的随机性的不确定性，从而使模型能够承认其局限性，并在不确定性较高时避免过于自信的决策。本文的核心是发展随机集神经网络（RS-NN），这是一种利用随机集理论来预测类别集上的置信度函数的新方法，通过相关信任集的宽度来捕捉认知不确定性的程度，RS-NN的应用包括将其适应到大型语言模型（LLMs）以及在自主赛车的天气分类中的部署。此外，论文提出了一个统一的评估框架，用于评估具有不确定性意识的分类器。大量实验证实，将认知意识整合到深度学习中不仅可以减轻与过于自信预测相关的风险，还为人工智能中的范式转变奠定了基础，其中“知道自己不知道”的能力成为强大而可靠系统的标志。标题概括了本文的核心哲学，强调真正的智慧包括认识和管理自己知识的限制。

更新时间: 2025-10-25 12:00:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22261v1

VulSolver: Vulnerability Detection via LLM-Driven Constraint Solving

Traditional vulnerability detection methods rely heavily on predefined rule matching, which often fails to capture vulnerabilities accurately. With the rise of large language models (LLMs), leveraging their ability to understand code semantics has emerged as a promising direction for achieving more accurate and efficient vulnerability detection. However, current LLM-based approaches face significant challenges: instability in model outputs, limitations in context length, and hallucination. As a result, many existing solutions either use LLMs merely to enrich predefined rule sets, thereby keeping the detection process fundamentally rule-based, or over-rely on them, leading to poor robustness. To address these challenges, we propose a constraint-solving approach powered by LLMs named VULSOLVER. By modeling vulnerability detection as a constraint-solving problem, and by integrating static application security testing (SAST) with the semantic reasoning capabilities of LLMs, our method enables the LLM to act like a professional human security expert. We assess VULSOLVER on the OWASP Benchmark (1,023 labeled samples), achieving 97.85% accuracy, 97.97% F1-score, and 100% recall. Applied to popular GitHub repositories, VULSOLVER also identified 15 previously unknown high-severity vulnerabilities (CVSS 7.5-9.8), demonstrating its effectiveness in real-world security analysis.

Updated: 2025-10-25 11:39:09

标题: VulSolver：通过LLM驱动的约束求解进行漏洞检测

摘要: 传统的漏洞检测方法在很大程度上依赖于预定义的规则匹配，这经常无法准确捕获漏洞。随着大型语言模型（LLMs）的兴起，利用它们理解代码语义的能力已经成为实现更准确和高效漏洞检测的有前途的方向。然而，当前基于LLM的方法面临着重大挑战：模型输出的不稳定性、上下文长度的限制和虚构现象。因此，许多现有解决方案要么仅仅使用LLMs来丰富预定义的规则集，从而保持检测过程基本上是基于规则的，要么过度依赖它们，导致鲁棒性不佳。为了应对这些挑战，我们提出了一个由LLMs驱动的约束求解方法，名为VULSOLVER。通过将漏洞检测建模为一个约束求解问题，并通过将静态应用程序安全测试（SAST）与LLMs的语义推理能力相结合，我们的方法使LLMs能够像一名专业的人类安全专家一样行事。我们在OWASP基准测试（1,023个标记样本）上评估了VULSOLVER，实现了97.85%的准确率，97.97%的F1分数和100%的召回率。应用于流行的GitHub存储库，VULSOLVER还识别出了15个先前未知的高危漏洞（CVSS 7.5-9.8），展示了其在真实世界安全分析中的有效性。

更新时间: 2025-10-25 11:39:09

领域: cs.CR

下载: http://arxiv.org/abs/2509.00882v4

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in standard parameterization (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore uninteresting. In particular, we show that, under cross-entropy (CE) loss, the unstable regime comprises two distinct sub-regimes: a catastrophically unstable regime and a more benign controlled divergence regime, where logits diverge but gradients and activations remain stable. Moreover, under large learning rates at the edge of the controlled divergence regime, there exists a well-defined infinite width limit where features continue to evolve in all the hidden layers. In experiments across optimizers, architectures, and data modalities, we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically maximal stable learning rate exponents which provide useful guidance on optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scaling for standard initialization.

Updated: 2025-10-25 11:34:31

标题: 关于在标准宽度缩放下大学习率的惊人有效性

摘要: 缩放极限，如无限宽度极限，作为研究大规模模型的有前途的理论工具。然而，普遍认为现有的无限宽度理论并不能忠实地解释实际网络的行为，特别是那些在标准参数化（SP）中训练的网络，即使用全局学习率的He初始化。例如，现有的SP理论预测在较大的学习率下会出现不稳定性，而在稳定的学习率下特征学习会消失。然而，在实践中，最佳学习率的衰减速度比理论预测的要慢，网络展现出稳定训练和非平凡的特征学习，即使在非常大的宽度下也是如此。在这里，我们展示了这种差异并不能完全通过有限宽度现象来解释。相反，我们通过对之前被认为不稳定和无趣的区域进行更细粒度的分析找到了解决方案。特别是，我们发现，在交叉熵（CE）损失下，不稳定区域包括两个不同的子区域：一个灾难性不稳定的区域和一个更温和的可控发散区域，其中对数发散但梯度和激活保持稳定。此外，在控制发散区域边缘处的大学习率下，存在一个明确定义的无限宽度极限，其中特征在所有隐藏层中继续演变。在跨优化器、架构和数据模态的实验中，我们验证了神经网络在CE损失下操作在这个可控发散区域内，但在MSE损失下不是这样。我们的实证证据表明，宽度缩放考虑对于预测经验上的最大稳定学习率指数非常有用，这提供了有用的有关最佳学习率指数的指导。最后，我们的分析阐明了最近提出的标准初始化的逐层学习率缩放方法的有效性和局限性。

更新时间: 2025-10-25 11:34:31

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2505.22491v2

LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis

Electroencephalography (EEG) offers a non-invasive lens into human brain activity, but building large-scale models is hampered by topological heterogeneity: each public EEG data defines its own electrode layout, limiting generalization. We introduce LUNA (Latent Unified Network Architecture), a self-supervised foundation model that reconciles disparate electrode geometries while scaling linearly -- not quadratically -- with channel count. LUNA compresses multi-channel EEG into a fixed-size, topology-agnostic latent space via learned queries and cross-attention. Downstream transformer blocks then operate exclusively on this latent representation using patch-wise temporal self-attention, decoupling computation from electrode count. Pre-trained on TUEG and Siena (over 21,000 hours of raw EEG across diverse montages) using a masked-patch reconstruction objective, LUNA transfers effectively to four downstream tasks: abnormality detection, artifact rejection, slowing classification, and emotion recognition. It demonstrates highly competitive performance across several benchmarks, achieving state-of-the-art results on TUAR and TUSL, e.g., 0.921 AUROC on TUAR, while reducing FLOPs by 300x and trimming GPU memory use by up to 10x. Critically, these gains are consistent across all evaluated electrode configurations. Code is available at https://github.com/pulp-bio/BioFoundation

Updated: 2025-10-25 11:31:27

标题: LUNA：用于脑电信号分析的高效且与拓扑结构无关的基础模型

摘要: 脑电图（EEG）提供了一种无创视角来观察人类大脑活动，但建立大规模模型受到拓扑异质性的阻碍：每个公共EEG数据定义了自己的电极布局，限制了泛化。我们引入了LUNA（Latent Unified Network Architecture），这是一个自监督的基础模型，可以调和不同的电极几何结构，并且与通道数量呈线性而不是二次增长。通过学习的查询和交叉注意力，LUNA将多通道EEG压缩为一个固定大小、与拓扑无关的潜在空间。然后，下游的转换器块仅在这个潜在表示上操作，使用基于补丁的时间自注意力，将计算与电极数量分离。在TUEG和Siena上进行预训练（跨不同排列的超过21,000小时原始EEG数据），使用掩码补丁重建目标，LUNA有效地转移到四个下游任务：异常检测、伪影拒绝、减速分类和情绪识别。它在几个基准测试中表现出极具竞争力的性能，例如在TUAR上实现了0.921的AUROC，同时将FLOPs降低了300倍，并且将GPU内存使用量减少了最多10倍。至关重要的是，这些收益在所有评估的电极配置中都是一致的。代码可在https://github.com/pulp-bio/BioFoundation找到。

更新时间: 2025-10-25 11:31:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22257v1

PACR: Progressively Ascending Confidence Reward for LLM Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.

Updated: 2025-10-25 11:25:35

标题: PACR：逐步上升的置信奖励用于LLM推理

摘要: 强化学习与可验证奖励（RLVR）显著改善了LLM推理，但其稀疏的、基于结果的奖励对中间步骤提供不了指导，从而减慢了探索速度。我们提出了逐步上升的置信奖励（PACR），这是一种密集的、模型内在的奖励，直接从模型对正确答案的演变信念中计算得出。PACR编码了这样一个归纳偏差，即沿着一个良好形成的推理轨迹，地面真相答案的概率应该具有一般上升的趋势。我们提供了实证和理论分析，验证了这种归纳偏差限制了探索搜索空间，使其更富于逻辑上合理的推理。我们展示了PACR加速了探索，用更少的轨迹达到奖励饱和，并在多个基准测试中取得了改进。我们的结果表明，密集的、模型内在的塑形信号可以使RLVR训练更加有效和可靠。

更新时间: 2025-10-25 11:25:35

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22255v1

You Don't Need Prompt Engineering Anymore: The Prompting Inversion

Prompt engineering, particularly Chain-of-Thought (CoT) prompting, significantly enhances LLM reasoning capabilities. We introduce "Sculpting," a constrained, rule-based prompting method designed to improve upon standard CoT by reducing errors from semantic ambiguity and flawed common sense. We evaluate three prompting strategies (Zero Shot, standard CoT, and Sculpting) across three OpenAI model generations (gpt-4o-mini, gpt-4o, gpt-5) using the GSM8K mathematical reasoning benchmark (1,317 problems). Our findings reveal a "Prompting Inversion": Sculpting provides advantages on gpt-4o (97% vs. 93% for standard CoT), but becomes detrimental on gpt-5 (94.00% vs. 96.36% for CoT on full benchmark). We trace this to a "Guardrail-to-Handcuff" transition where constraints preventing common-sense errors in mid-tier models induce hyper-literalism in advanced models. Our detailed error analysis demonstrates that optimal prompting strategies must co-evolve with model capabilities, suggesting simpler prompts for more capable models.

Updated: 2025-10-25 11:04:01

标题: 您不再需要即时工程：提示反转

摘要: 即时工程，尤其是思维链（CoT）提示，显著增强了LLM的推理能力。我们介绍了“雕塑”这一受限、基于规则的提示方法，旨在改进标准的CoT，减少语义模糊和错误常识所导致的错误。我们通过GSM8K数学推理基准（1,317个问题）评估了三种提示策略（零射击、标准CoT和雕塑），跨越了三代OpenAI模型（gpt-4o-mini、gpt-4o、gpt-5）。我们的研究结果揭示了一种“提示反转”：雕塑在gpt-4o（97% vs. 标准CoT的93%）上提供了优势，但在gpt-5上变得有害（94.00% vs. 完整基准上CoT的96.36%）。我们将这归因于“护栏到手铐”的转变，在中间层模型中防止常识错误的约束导致高级模型中的过度文字主义。我们的详细错误分析表明，最佳提示策略必须与模型能力共同演进，建议更强大模型使用更简单的提示。

更新时间: 2025-10-25 11:04:01

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.22251v1

FedCVD++: Communication-Efficient Federated Learning for Cardiovascular Risk Prediction with Parametric and Non-Parametric Model Optimization

Cardiovascular diseases (CVD) cause over 17 million deaths annually worldwide, highlighting the urgent need for privacy-preserving predictive systems. We introduce FedCVD++, an enhanced federated learning (FL) framework that integrates both parametric models (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) for coronary heart disease risk prediction. To address key FL challenges, we propose: (1) tree-subset sampling that reduces Random Forest communication overhead by 70%, (2) XGBoost-based feature extraction enabling lightweight federated ensembles, and (3) federated SMOTE synchronization for resolving cross-institutional class imbalance. Evaluated on the Framingham dataset (4,238 records), FedCVD++ achieves state-of-the-art results: federated XGBoost (F1 = 0.80) surpasses its centralized counterpart (F1 = 0.78), and federated Random Forest (F1 = 0.81) matches non-federated performance. Additionally, our communication-efficient strategies reduce bandwidth consumption by 3.2X while preserving 95% accuracy. Compared to existing FL frameworks, FedCVD++ delivers up to 15% higher F1-scores and superior scalability for multi-institutional deployment. This work represents the first practical integration of non-parametric models into federated healthcare systems, providing a privacy-preserving solution validated under real-world clinical constraints.

Updated: 2025-10-25 10:38:41

标题: FedCVD ++：参数化和非参数化模型优化的心血管风险预测通信高效联邦学习

摘要: 心血管疾病（CVD）每年在全球造成超过1700万人死亡，凸显了对保护隐私的预测系统的迫切需求。我们介绍了FedCVD++，这是一个增强的联邦学习（FL）框架，集成了参数模型（逻辑回归、支持向量机、神经网络）和非参数模型（随机森林、XGBoost）用于冠心病风险预测。为了解决关键的FL挑战，我们提出了：（1）树子集采样，可将随机森林通信开销减少70%，（2）基于XGBoost的特征提取，实现轻量级联邦集成，以及（3）联邦SMOTE同步，用于解决跨机构类别不平衡问题。在Framingham数据集（4,238条记录）上进行评估，FedCVD++取得了最先进的结果：联邦XGBoost（F1 = 0.80）超越了其集中化对应物（F1 = 0.78），并且联邦随机森林（F1 = 0.81）与非联邦性能相匹配。此外，我们的通信高效策略将带宽消耗降低了3.2倍，同时保持了95%的准确率。与现有的FL框架相比，FedCVD++提供了高达15%更高的F1分数，并具有更好的多机构部署可扩展性。这项工作代表了非参数模型首次实际集成到联邦医疗系统中，提供了在现实临床约束条件下验证的保护隐私解决方案。

更新时间: 2025-10-25 10:38:41

领域: cs.LG,q-bio.OT

下载: http://arxiv.org/abs/2507.22963v2

KL Penalty Control via Perturbation for Direct Preference Optimization

Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods claim to change this static KL penalty of DPO into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO), which allows adaptive control of the KL penalty strength $\beta$ for each preference pair. Specifically, $\varepsilon$-DPO adaptively controls $\beta$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $\beta$ during training. This is equivalent to adjusting the KL penalty by checking whether the change in training-time temperature can lead to better preference confidence as preference models by simply reusing the logit of the current policy and the reference policy. Experimental results show that the simple criterion of $\varepsilon$-DPO for KL penalty relaxation significantly improves DPO compared to most existing direct alignment algorithms on general chatbot benchmarks and reveal that this KL penalty control criterion can reflect confusion as a preference model and provide an efficient KL trade-off, highlighting the significance of instance-level adaptive KL penalty control in DPO.

Updated: 2025-10-25 10:23:13

标题: KL Penalty Control via Perturbation for Direct Preference Optimization KL惩罚控制通过摄动用于直接偏好优化

摘要: 直接偏好优化（DPO）展示了将大型语言模型与人类偏好对齐的优势，仅使用离线数据集。然而，DPO存在一个限制，即KL惩罚，它防止过度偏离参考模型，在训练过程中保持静态。有几种方法声称将DPO的这种静态KL惩罚转变为动态KL惩罚，但没有一种方法可以自适应地为每对偏好分配不同的KL惩罚。在本文中，我们提出了$\varepsilon$-直接偏好优化（$\varepsilon$-DPO），它允许对每对偏好自适应地控制KL惩罚强度$\beta$。具体来说，$\varepsilon$-DPO根据在训练过程中对偏好模型的logits的单调性，基于$\beta$的扰动自适应地控制每对偏好的$\beta$。这相当于通过检查在训练时温度的变化是否可以通过简单地重用当前策略和参考策略的logit来使偏好模型的偏好置信度更好而调整KL惩罚。实验结果表明，$\varepsilon$-DPO对于KL惩罚松弛的简单标准相比大多数现有的直接对齐算法在通用聊天机器人基准上显着改善了DPO，并揭示了这种KL惩罚控制标准可以反映混乱作为偏好模型并提供高效的KL折衷，突出了DPO中实例级自适应KL惩罚控制的重要性。

更新时间: 2025-10-25 10:23:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.13177v3

Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide

Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective adaptation under data scarcity requires focused and efficient fine-tuning techniques. This paper presents a structured and practical survey of recent methods for fine-tuning LLMs in data-scarce scenarios. We systematically review parameter-efficient fine-tuning techniques that lower training and deployment costs, domain and cross-lingual adaptation methods for both encoder and decoder models, and model specialization strategies. We further examine preference alignment approaches that guide model behavior using limited human or synthetic feedback, emphasizing sample and compute efficiency. Throughout, we highlight empirical trade-offs, selection criteria, and best practices for choosing suitable techniques based on task constraints, including model scaling, data scaling, and the mitigation of catastrophic forgetting. The aim is to equip researchers and practitioners with actionable insights for effectively fine-tuning LLMs when data and resources are limited.

Updated: 2025-10-25 10:17:48

标题: 用有限数据微调大型语言模型：调查和实用指南

摘要: 将数据有限的大型语言模型（LLMs）进行微调在低资源语言、专业领域和受限部署设置中面临着实际挑战。尽管预训练的LLMs提供了坚实的基础，但在数据稀缺情况下有效的调整需要专注、高效的微调技术。本文提供了一份最近方法的结构化和实用调查，用于数据稀缺情况下微调LLMs。我们系统地审查了降低训练和部署成本的参数高效微调技术，针对编码器和解码器模型的领域和跨语言适应方法，以及模型专业化策略。我们进一步研究了通过有限的人类或合成反馈来指导模型行为的偏好对齐方法，强调了样本和计算效率。在整个过程中，我们突出了经验性的权衡、选择标准和最佳实践，以根据任务约束选择合适的技术，包括模型扩展、数据扩展和减轻灾难性遗忘。我们的目标是为研究人员和从业者提供可操作的见解，以便在数据和资源有限的情况下有效地微调LLMs。

更新时间: 2025-10-25 10:17:48

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2411.09539v2

Real-Time Semantic Segmentation on FPGA for Autonomous Vehicles Using LMIINet with the CGRA4ML Framework

Semantic segmentation has emerged as a fundamental problem in computer vision, gaining particular importance in real-time applications such as autonomous driving. The main challenge is achieving high accuracy while operating under computational and hardware constraints. In this research, we present an FPGA-based implementation of real-time semantic segmentation leveraging the lightweight LMIINet architecture and the Coarse-Grained Reconfigurable Array for Machine Learning (CGRA4ML) hardware framework. The model was trained using Quantization-Aware Training (QAT) with 8-bit precision on the Cityscapes dataset, reducing memory footprint by a factor of four while enabling efficient fixed-point computations. Necessary modifications were applied to adapt the model to CGRA4ML constraints, including simplifying skip connections, employing hardware-friendly operations such as depthwise-separable and 1A-1 convolutions, and redesigning parts of the Flatten Transformer. Our implementation achieves approximately 90% pixel accuracy and 45% mean Intersection-over-Union (mIoU), operating in real-time at 20 frames per second (FPS) with 50.1 ms latency on the ZCU104 FPGA board. The results demonstrate the potential of CGRA4ML, with its flexibility in mapping modern layers and off-chip memory utilization for skip connections, provides a path for implementing advanced semantic segmentation networks on FPGA for real-time applications to outperform traditional GPU solutions in terms of power efficiency while maintaining competitive accuracy. The code for this project is publicly available at https://github.com/STAmirr/ cgra4ml_semantic_segmentation

Updated: 2025-10-25 10:16:22

标题: 基于CGRA4ML框架的LMIINet在FPGA上用于自动驾驶车辆的实时语义分割

摘要: 语义分割已经成为计算机视觉中的一个基本问题，在实时应用中变得尤为重要，比如自动驾驶。主要挑战在于在计算和硬件约束下实现高准确性。在这项研究中，我们提出了一种基于FPGA的实时语义分割实现，利用轻量级LMIINet架构和粗粒度可重构数组机器学习（CGRA4ML）硬件框架。该模型使用8位精度的量化感知训练（QAT）在Cityscapes数据集上进行训练，将内存占用减少了四倍，同时实现了高效的定点计算。为了适应CGRA4ML的约束，我们进行了必要的修改，包括简化跳跃连接，采用硬件友好的操作，如深度可分离和1A-1卷积，并重新设计了Flatten Transformer的部分。我们的实现在ZCU104 FPGA板上以每秒20帧的实时速率实现了约90%的像素准确性和45%的平均交并比（mIoU），延迟为50.1毫秒。结果表明，CGRA4ML具有映射现代层和外部存储器利用率的灵活性，为在FPGA上实现先进的语义分割网络提供了路径，以在实时应用中优于传统GPU解决方案的功耗效率，同时保持竞争力的准确性。该项目的代码可在https://github.com/STAmirr/cgra4ml_semantic_segmentation 上公开获取。

更新时间: 2025-10-25 10:16:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22243v1

PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading

Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.

Updated: 2025-10-25 10:11:29

标题: PaperAsk：一种用于对论文搜索和阅读中LLM可靠性评估的基准测试

摘要: 大型语言模型（LLMs）越来越被用作研究助手，然而它们在学术任务中的可靠性仍未得到充分评估。在这项工作中，我们引入了PaperAsk，这是一个系统评估LLMs在四个关键研究任务中的基准：引文检索、内容提取、论文发现和主张验证。我们在真实的使用条件下评估了GPT-4o、GPT-5和Gemini-2.5-Flash，通过Web界面进行评估，其中搜索操作对用户不透明。通过控制实验，我们发现一致的可靠性失败：引文检索在48-98％的多引文查询中失败，特定部分内容提取在72-91％的情况下失败，而主题论文发现的F1分数低于0.32，在60％以上的相关文献中遗漏。进一步的人类分析将这些失败归因于检索上下文的不受控制扩展以及LLMs倾向于优先考虑语义相关文本而不是任务说明。在基本任务上，LLMs显示出不同的失败行为：ChatGPT经常选择不回应而不是冒险出错，而Gemini产生流畅但虚构的答案。为了解决这些问题，我们开发了基于PaperAsk数据训练的轻量级可靠性分类器，用于识别不可靠的输出。PaperAsk为推进基于LLMs的学术协助系统的可靠性评估提供了一个可复制和诊断性框架。

更新时间: 2025-10-25 10:11:29

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22242v1

Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking

The growing spread of online misinformation has created an urgent need for scalable, reliable fact-checking solutions. Crowdsourced fact-checking - where non-experts evaluate claim veracity - offers a cost-effective alternative to expert verification, despite concerns about variability in quality and bias. Encouraged by promising results in certain contexts, major platforms such as X (formerly Twitter), Facebook, and Instagram have begun shifting from centralized moderation to decentralized, crowd-based approaches. In parallel, advances in Large Language Models (LLMs) have shown strong performance across core fact-checking tasks, including claim detection and evidence evaluation. However, their potential role in crowdsourced workflows remains unexplored. This paper investigates whether LLM-powered generative agents - autonomous entities that emulate human behavior and decision-making - can meaningfully contribute to fact-checking tasks traditionally reserved for human crowds. Using the protocol of La Barbera et al. (2024), we simulate crowds of generative agents with diverse demographic and ideological profiles. Agents retrieve evidence, assess claims along multiple quality dimensions, and issue final veracity judgments. Our results show that agent crowds outperform human crowds in truthfulness classification, exhibit higher internal consistency, and show reduced susceptibility to social and cognitive biases. Compared to humans, agents rely more systematically on informative criteria such as Accuracy, Precision, and Informativeness, suggesting a more structured decision-making process. Overall, our findings highlight the potential of generative agents as scalable, consistent, and less biased contributors to crowd-based fact-checking systems.

Updated: 2025-10-25 10:05:59

标题: 评估协作事实核查中生成代理的潜力

摘要: 网络虚假信息的传播日益加剧，迫切需要可扩展、可靠的事实核查解决方案。众包事实核查——非专家评估主张真实性——提供了一种经济高效的替代专家验证的选择，尽管存在有关质量和偏见变异性的担忧。在某些情境下取得令人鼓舞的结果后，主要平台如X（前身为Twitter）、Facebook和Instagram已经开始从集中式的管理转向分散的、基于群体的方法。同时，大型语言模型（LLMs）的进展在核心事实核查任务中表现出色，包括主张检测和证据评估。然而，它们在众包工作流中的潜在作用尚未被探索。本文研究了LLM驱动的生成代理——模拟人类行为和决策的自主实体——是否可以对传统上由人类群体保留的事实核查任务作出有意义的贡献。使用La Barbera等人（2024年）的协议，我们模拟了具有多样化人口统计和意识形态特征的生成代理群体。代理获取证据，评估多个质量维度上的主张，并发出最终的真实性判断。我们的结果显示，代理群体在真实性分类方面优于人类群体，表现出更高的内部一致性，并显示出对社会和认知偏见的降低敏感性。与人类相比，代理更系统地依赖于准确性、精确性和信息性等信息性标准，表明了更有结构的决策过程。总的来说，我们的研究结果突显了生成代理作为可扩展、一致且较少偏见的众包事实核查系统的潜力。

更新时间: 2025-10-25 10:05:59

领域: cs.CL,cs.AI,cs.MA

下载: http://arxiv.org/abs/2504.19940v2

Efficient Federated Learning against Byzantine Attacks and Data Heterogeneity via Aggregating Normalized Gradients

Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but is vulnerable to Byzantine attacks and data heterogeneity, which can severely degrade performance. Existing Byzantine-robust approaches tackle data heterogeneity, but incur high computational overhead during gradient aggregation, thereby slowing down the training process. To address this issue, we propose a simple yet effective Federated Normalized Gradients Algorithm (Fed-NGA), which performs aggregation by merely computing the weighted mean of the normalized gradients from each client. This approach yields a favorable time complexity of $\mathcal{O}(pM)$, where $p$ is the model dimension and $M$ is the number of clients. We rigorously prove that Fed-NGA is robust to both Byzantine faults and data heterogeneity. For non-convex loss functions, Fed-NGA achieves convergence to a neighborhood of stationary points under general assumptions, and further attains zero optimality gap under some mild conditions, which is an outcome rarely achieved in existing literature. In both cases, the convergence rate is $\mathcal{O}(1/T^{\frac{1}{2} - \delta})$, where $T$ denotes the number of iterations and $\delta \in (0, 1/2)$. Experimental results on benchmark datasets confirm the superior time efficiency and convergence performance of Fed-NGA over existing methods.

Updated: 2025-10-25 10:05:17

标题: 通过聚合归一化梯度实现高效的联邦学习抵御拜占庭攻击和数据异构性

摘要: 联邦学习（FL）使多个客户端能够在不共享原始数据的情况下协作训练模型，但容易受到拜占庭攻击和数据异构性的影响，从而严重影响性能。现有的拜占庭鲁棒方法解决了数据异构性问题，但在梯度聚合过程中产生了高计算开销，从而减慢了训练过程。为了解决这个问题，我们提出了一种简单而有效的联邦归一化梯度算法（Fed-NGA），它仅通过计算每个客户端的归一化梯度的加权平均来进行聚合。这种方法具有有利的时间复杂度为$O(pM)$，其中$p$是模型维度，$M$是客户端数量。我们严格证明了Fed-NGA对拜占庭故障和数据异构性都具有鲁棒性。对于非凸损失函数，Fed-NGA在一般假设下能够收敛到稳定点的邻域，并在一些温和条件下达到零最优性差，这在现有文献中很少见。在这两种情况下，收敛速度为$O(1/T^{\frac{1}{2} - \delta})$，其中$T$表示迭代次数，$\delta \in (0, 1/2)$。基准数据集上的实验结果证实了Fed-NGA相对于现有方法的卓越时间效率和收敛性能。

更新时间: 2025-10-25 10:05:17

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2408.09539v3

Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

Updated: 2025-10-25 10:01:55

标题: 因果关系的充分性和必要性改善了思维链推理

摘要: Chain-of-Thought (CoT)提示在赋予大型语言模型（LLMs）复杂推理能力方面发挥着不可或缺的作用。然而，CoT目前面临着两个基本挑战：（1）充分性，确保生成的中间推理步骤全面涵盖并证实最终结论；以及（2）必要性，确定对最终答案的合理性真正必不可少的推理步骤。我们提出了一个因果框架，通过充分性和必要性的双重视角来表征CoT推理。结合因果充分性和必要性的概率，不仅可以确定哪些步骤在逻辑上对预测结果是充分或必要的，还可以在不同干预场景下量化它们对最终推理结果的实际影响，从而实现自动添加缺失步骤和修剪多余步骤。对各种数学和常识推理基准的广泛实验结果证实，推理效率显著提高，使用令牌减少，而不会牺牲准确性。我们的工作为改善LLM推理性能和成本效益提供了一个有前途的方向。

更新时间: 2025-10-25 10:01:55

领域: cs.CL,cs.AI,math.ST,stat.ME,stat.TH

下载: http://arxiv.org/abs/2506.09853v3

Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy

Chromatin sensitive partial wave spectroscopic (csPWS) microscopy enables label free detection of nanoscale chromatin packing alterations that occur before visible cellular transformation. However, manual nuclear segmentation limits population scale analysis needed for biomarker discovery in early cancer detection. The lack of annotated csPWS imaging data prevents direct use of standard deep learning methods. We present CFU Net, a hierarchical segmentation architecture trained with a three stage curriculum on synthetic multimodal data. CFU Net achieves near perfect performance on held out synthetic test data that represent diverse spectroscopic imaging conditions without manual annotations (Dice 0.9879, IoU 0.9895). Our approach uses physics based rendering that incorporates empirically supported chromatin packing statistics, Mie scattering models, and modality specific noise, combined with a curriculum that progresses from adversarial RGB pretraining to spectroscopic fine tuning and histology validation. CFU Net integrates five architectural elements (ConvNeXt backbone, Feature Pyramid Network, UNet plus plus dense connections, dual attention, and deep supervision) that together improve Dice over a baseline UNet by 8.3 percent. We demonstrate deployment ready INT8 quantization with 74.9 percent compression and 0.15 second inference, giving a 240 times throughput gain over manual analysis. Applied to more than ten thousand automatically segmented nuclei from synthetic test data, the pipeline extracts chromatin biomarkers that distinguish normal from pre cancerous tissue with large effect sizes (Cohens d between 1.31 and 2.98), reaching 94 percent classification accuracy. This work provides a general framework for synthetic to real transfer learning in specialized microscopy and open resources for community validation on clinical specimens.

Updated: 2025-10-25 10:00:34

标题: 合成到真实的迁移学习用于对染色质敏感的PWS显微镜

摘要: 染色质敏感的部分波谱显微镜(csPWS)使得可以在可见的细胞转化之前检测到纳米级染色质包装变化，而无需标记。然而，手动核分割限制了早期癌症检测中所需的人群规模分析。缺乏标记的csPWS成像数据阻碍了标准深度学习方法的直接使用。我们提出了CFU Net，这是一个在合成多模态数据上通过三阶段课程训练的分层分割架构。CFU Net在合成测试数据中取得了接近完美的表现，这些数据代表了不同的光谱成像条件，没有手动注释(Dice 0.9879，IoU 0.9895)。我们的方法使用基于物理的渲染，结合了经验支持的染色质包装统计、Mie散射模型和特定的噪声，以及一个从对抗性RGB预训练到光谱微调和组织学验证的课程。CFU Net整合了五个架构元素(ConvNeXt主干、特征金字塔网络、UNet加加密集连接、双重关注和深度监督)，共同将Dice与基线UNet相比提高了8.3%。我们展示了部署准备好的INT8量化，压缩率为74.9%，推断时间为0.15秒，比手动分析提高了240倍的吞吐量。应用于来自合成测试数据的一万多个自动分割的细胞核，该流程提取了能够区分正常和癌前组织的染色质生物标志物，效果显著(Cohens d在1.31和2.98之间)，达到了94%的分类准确性。这项工作为专门显微镜学中的合成到真实迁移学习提供了一个通用框架，并为社区在临床样本上进行验证提供了开放资源。

更新时间: 2025-10-25 10:00:34

领域: eess.IV,cs.LG,q-bio.QM,68T45, 92C55,I.4.6; I.2.10; I.5.4

下载: http://arxiv.org/abs/2510.22239v1

Better Estimation of the Kullback--Leibler Divergence Between Language Models

Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.

Updated: 2025-10-25 09:49:22

标题: 更好地估计语言模型之间的Kullback-Leibler散度

摘要: 估计语言模型之间的Kullback-Leibler（KL）散度在许多应用中具有重要意义，例如从人类反馈中进行强化学习（RLHF）、可解释性和知识蒸馏。然而，计算任意两个语言模型之间的精确KL散度是不可行的。因此，从业者通常采用基于抽样的估计器。虽然很容易设计一个简单的蒙特卡洛（MC）估计器，提供语言模型之间KL散度的无偏估计，但这种估计器臭名昭著地受到高方差的影响，甚至可能导致KL散度的负估计，这是一个非负数量。在本文中，我们介绍了一种Rao-Blackwell化的估计器，它是无偏的，并且经证明方差小于或等于标准蒙特卡洛估计器的方差。在一个关于情感控制微调的实证研究中，我们展示了我们的估计器提供更稳定的KL估计，并显著降低了方差。此外，我们推导了KL散度梯度的类似的Rao-Blackwell化估计器，这导致更稳定的训练，并且产生的模型更频繁地出现在奖励与KL之间的Pareto前沿上，与使用MC梯度估计器训练的模型相比。

更新时间: 2025-10-25 09:49:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.10637v3

Bridging the Perceptual - Statistical Gap in Dysarthria Assessment: Why Machine Learning Still Falls Short

Automated dysarthria detection and severity assessment from speech have attracted significant research attention due to their potential clinical impact. Despite rapid progress in acoustic modeling and deep learning, models still fall short of human expert performance. This manuscript provides a comprehensive analysis of the reasons behind this gap, emphasizing a conceptual divergence we term the ``perceptual-statistical gap''. We detail human expert perceptual processes, survey machine learning representations and methods, review existing literature on feature sets and modeling strategies, and present a theoretical analysis of limits imposed by label noise and inter-rater variability. We further outline practical strategies to narrow the gap, perceptually motivated features, self-supervised pretraining, ASR-informed objectives, multimodal fusion, human-in-the-loop training, and explainability methods. Finally, we propose experimental protocols and evaluation metrics aligned with clinical goals to guide future research toward clinically reliable and interpretable dysarthria assessment tools.

Updated: 2025-10-25 09:44:31

标题: 填补言语障碍评估中感知和统计差距：为什么机器学习仍然不足

摘要: 自动检测和评估言语中的运动障碍引起了重要的研究关注，因为它们具有潜在的临床影响。尽管在声学建模和深度学习方面取得了快速进展，但模型仍然无法达到人类专家的表现水平。本文对这种差距背后的原因进行了全面分析，强调了我们称之为“感知-统计差距”的概念分歧。我们详细介绍了人类专家感知过程，调查了机器学习表示和方法，审查了现有文献中的特征集和建模策略，并对由标签噪声和评分者之间的可变性所施加的限制进行了理论分析。我们进一步概述了缩小差距的实用策略，包括感知驱动的特征、自监督预训练、基于ASR的目标、多模态融合、人机合作培训和可解释性方法。最后，我们提出了与临床目标一致的实验协议和评估指标，以引导未来研究朝着临床可靠且可解释的言语运动障碍评估工具发展。

更新时间: 2025-10-25 09:44:31

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2510.22237v1

Guarded Query Routing for Large Language Models

Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a guarded query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench, released as Python package gqr), covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP, fastText), and traditional machine learning models (SVM, XGBoost). Our results show that WideMLP, enhanced with out-of-domain detection capabilities, yields the best trade-off between accuracy (88%) and speed (<4ms). The embedding-based fastText excels at speed (<1ms) with acceptable accuracy (80%), whereas LLMs yield the highest accuracy (91%) but are comparatively slow (62ms for local Llama-3.1:8B and 669ms for remote GPT-4o-mini calls). Our findings challenge the automatic reliance on LLMs for (guarded) query routing and provide concrete recommendations for practical applications. Source code is available: https://github.com/williambrach/gqr.

Updated: 2025-10-25 09:38:24

标题: 大型语言模型的保护查询路由

摘要: 查询路由是将用户查询路由到不同的大型语言模型（LLM）端点的任务，可以被视为一个文本分类问题。然而，必须正确处理超出分布范围的查询，因为这些查询可能涉及无关的领域、其他语言的查询，甚至包含不安全的文本。因此，我们研究了一个受保护的查询路由问题，首先引入了受保护的查询路由基准（GQR-Bench，作为Python包gqr发布），涵盖了三个典型的目标领域（法律、金融和医疗保健），以及七个数据集，用于测试对超出分布范围的查询的稳健性。然后，我们使用GQR-Bench来对比基于LLM的路由机制的有效性和效率（GPT-4o-mini、Llama-3.2-3B和Llama-3.1-8B）、标准的基于LLM的防护栏方法（LlamaGuard和NVIDIA NeMo Guardrails）、连续词袋分类器（WideMLP、fastText）和传统机器学习模型（SVM、XGBoost）。我们的结果表明，增加超出领域检测能力的WideMLP在准确性（88%）和速度（<4ms）之间取得了最佳的权衡。基于嵌入的fastText在速度上表现出色（<1ms），并具有可接受的准确性（80%），而LLMs则具有最高的准确性（91%），但相对较慢（本地Llama-3.1:8B为62ms，远程GPT-4o-mini为669ms）。我们的发现挑战了对（受保护的）查询路由自动依赖于LLMs，并为实际应用提供了具体的建议。源代码可在以下链接找到：https://github.com/williambrach/gqr。

更新时间: 2025-10-25 09:38:24

领域: cs.AI

下载: http://arxiv.org/abs/2505.14524v3

Ask for More Than Bayes Optimal: A Theory of Indecisions for Classification

Selective classification is a powerful tool for automated decision-making in high-risk scenarios, allowing classifiers to act only when confident and abstain when uncertainty is high. Given a target accuracy, our goal is to minimize indecisions, observations we do not automate. For difficult problems, the target accuracy may be unattainable without abstention. By using indecisions, we can control the misclassification rate to any user-specified level, even below the Bayes optimal error rate, while minimizing overall indecision mass. We provide a complete characterization of the minimax risk in selective classification, establishing continuity and monotonicity properties that enable optimal indecision selection. We revisit selective inference via the Neyman-Pearson testing framework, where indecision enables control of type 2 error given fixed type 1 error probability. For both classification and testing, we propose a finite-sample calibration method with non-asymptotic guarantees, proving plug-in classifiers remain consistent and that accuracy-based calibration effectively controls indecision mass. In the binary Gaussian mixture model, we uncover the first sharp phase transition in selective inference, showing minimal indecision can yield near-optimal accuracy even under poor class separation. Experiments on Gaussian mixtures and real datasets confirm that small indecision proportions yield substantial accuracy gains, making indecision a principled tool for risk control.

Updated: 2025-10-25 09:34:29

标题: 超越贝叶斯最优：一种关于分类中的犹豫理论

摘要: 选择性分类是在高风险情景下自动决策的强大工具，允许分类器只在有信心时行动，在不确定性高时放弃。在给定目标准确率的情况下，我们的目标是最小化我们不自动化的观察结果。对于困难问题，目标准确率可能无法达到而不进行放弃。通过使用放弃，我们可以将误分类率控制到任意用户指定的水平，甚至低于贝叶斯最优错误率，同时最小化整体放弃量。我们对选择性分类中的极小风险进行了完整的刻画，建立了连续性和单调性属性，使得最佳放弃选择成为可能。我们通过Neyman-Pearson检验框架重新审视选择性推断，其中放弃使得在固定第一类错误概率的情况下可以控制第二类错误。对于分类和检验，我们提出了一种具有非渐近保证的有限样本校准方法，证明插入式分类器仍然保持一致，并且基于准确率的校准有效地控制了放弃量。在二元高斯混合模型中，我们揭示了选择性推断中的第一个尖锐相变，显示最小化放弃可以在类别分离差的情况下产生接近最优准确率。对高斯混合和真实数据集的实验证实，小的放弃比例会产生显著的准确率提升，使得放弃成为风险控制的一个原则性工具。

更新时间: 2025-10-25 09:34:29

领域: math.ST,cs.LG,stat.ME,stat.ML,stat.TH

下载: http://arxiv.org/abs/2412.12807v3

Prognostic Framework for Robotic Manipulators Operating Under Dynamic Task Severities

Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot's end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot's position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches -- (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.

Updated: 2025-10-25 09:30:10

标题: 机器人操作在动态任务严重性下的预测框架

摘要: 机器人操作器在许多应用中至关重要，但众所周知它们会随着时间而退化。这种退化受机器人执行的任务性质影响。处理重载的任务等严重性较高的任务可能加速退化过程。退化的一种表现是机器人末端执行器的位置精度。在本文中，我们提出了一个预测建模框架，可以预测机器人操作器的剩余可用寿命(RUL)，同时考虑任务严重性的影响。我们的框架将机器人的位置精度表示为一个布朗运动过程，其中包含一个受任务严重性影响的随机漂移参数。任务严重性的动态特性使用连续时间马尔可夫链(CTMC)进行建模。为了评估RUL，我们讨论了两种方法：(1)剩余寿命分布(RLD)的新颖闭式表达式，(2)在预测文献中常用的蒙特卡罗模拟。理论结果证实了这些RUL计算方法之间的等价性。我们通过使用两个不同的基于物理的模拟器进行实验来验证我们的框架，这些模拟器用于平面和空间机器人舰队。我们的研究结果显示，当处理更高比例的高严重性任务时，两个舰队中的机器人都会经历较短的RUL。

更新时间: 2025-10-25 09:30:10

领域: cs.RO,cs.LG,cs.SY,eess.SY,stat.AP

下载: http://arxiv.org/abs/2412.00538v3

Rational Adversaries and the Maintenance of Fragility: A Game-Theoretic Theory of Rational Stagnation

Cooperative systems often remain in persistently suboptimal yet stable states. This paper explains such "rational stagnation" as an equilibrium sustained by a rational adversary whose utility follows the principle of potential loss, $u_{D} = U_{ideal} - U_{actual}$. Starting from the Prisoner's Dilemma, we show that the transformation $u_{i}' = a\,u_{i} + b\,u_{j}$ and the ratio of mutual recognition $w = b/a$ generate a fragile cooperation band $[w_{\min},\,w_{\max}]$ where both (C,C) and (D,D) are equilibria. Extending to a dynamic model with stochastic cooperative payoffs $R_{t}$ and intervention costs $(C_{c},\,C_{m})$, a Bellman-style analysis yields three strategic regimes: immediate destruction, rational stagnation, and intervention abandonment. The appendix further generalizes the utility to a reference-dependent nonlinear form and proves its stability under reference shifts, ensuring robustness of the framework. Applications to social-media algorithms and political trust illustrate how adversarial rationality can deliberately preserve fragility.

Updated: 2025-10-25 09:28:15

标题: 理性对手和脆弱性的维持：理性停滞的博弈论理论

摘要: 合作系统通常会处于持续不佳但稳定的状态。本文将这种“理性停滞”解释为由一个理性对手维持的平衡，其效用遵循潜在损失原则，$u_{D} = U_{ideal} - U_{actual}$。从囚徒困境开始，我们展示了变换$u_{i}' = a\,u_{i} + b\,u_{j}$和相互认可比例$w = b/a$生成了一个脆弱的合作区间$[w_{\min},\,w_{\max}]$，在其中(C,C)和(D,D)都是平衡点。将其扩展为具有随机合作收益$R_{t}$和干预成本$(C_{c},\,C_{m})$的动态模型，使用贝尔曼风格分析得出三种战略模式：即时破坏、理性停滞和干预放弃。附录进一步将效用推广为参考依赖的非线性形式，并证明了在参考转移下其稳定性，确保了框架的鲁棒性。社交媒体算法和政治信任的应用说明了对抗性的理性如何有意地保持脆弱性。

更新时间: 2025-10-25 09:28:15

领域: cs.GT,cs.AI,econ.TH

下载: http://arxiv.org/abs/2510.22232v1

When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs

Layer pruning has emerged as a widely adopted technique for improving the efficiency of large language models (LLMs). Although existing methods demonstrate strong performance retention on general knowledge tasks, their effect on long-chain reasoning, a more brittle yet crucial capability, remains largely unexplored. In this work, we study the impact of layer pruning on long-chain reasoning through the lens of test-time scaling, a key mechanism in modern LLMs that enables strong reasoning capacity by allocating more computation at inference time. With extensive experiments, we demonstrate that pruning even one or two layers can severely impair test-time scaling, with performance collapsing drastically on long reasoning benchmarks even when performance on knowledge-intensive and shallow reasoning tasks remains stable. Furthermore, we find that standard supervised fine-tuning remedies fail to recover test-time scaling once it has deteriorated. Through in-depth analyses, we identify the mechanisms underlying this fragility of test-time scaling and highlight the fundamental risks of applying layer pruning to reasoning-intensive LLMs. These findings call for a rethinking of layer pruning strategies and provide insights for developing methods that preserve the robustness of reasoning. We open-source the codebase in \href{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}.

Updated: 2025-10-25 09:22:22

标题: 当更少的层打破更多的链条：层剪枝损害了LLMs在测试时的扩展

摘要: 层剪枝已经成为提高大型语言模型（LLMs）效率的一种广泛采用的技术。尽管现有方法在一般知识任务上表现出强大的性能保留，但它们对于长链推理，一种更脆弱但至关重要的能力的影响仍然很少被探索。在这项工作中，我们通过测试时间缩放的视角研究了层剪枝对长链推理的影响，测试时间缩放是现代LLMs中的一个关键机制，通过在推理时分配更多的计算资源实现强大的推理能力。通过大量实验证明，即使剪枝一两层也会严重损害测试时间缩放，在长推理基准上性能会急剧下降，即使在知识密集和浅层推理任务上的表现保持稳定。此外，我们发现标准监督微调方法无法在测试时间缩放恶化后恢复。通过深入分析，我们确定了导致测试时间缩放脆弱性的机制，并突出了将层剪枝应用于注重推理的LLMs的基本风险。这些发现要求重新思考层剪枝策略，并为保持推理的稳健性提供了见解。我们在\href {https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}中开源了代码库。

更新时间: 2025-10-25 09:22:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22228v1

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.

Updated: 2025-10-25 09:08:18

标题: VEGGIE: 使用基于实例生成的指导性编辑和推理视频概念

摘要: 最近的视频扩散模型已经增强了视频编辑，但在统一框架内处理教学编辑和各种任务（例如添加、删除、更改）仍然具有挑战性。在本文中，我们介绍了VEGGIE，一种从指令生成的视频编辑器，这是一个简单的端到端框架，它统一了基于多样化用户指令的视频概念编辑、基础和推理。具体而言，给定一个视频和文本查询，VEGGIE首先利用MLLM来解释用户在指令中的意图，并将它们基于视频上下文进行基础，生成特定于帧的基于任务的查询以获取像素空间响应。然后，扩散模型呈现这些计划并生成与用户意图一致的编辑视频。为了支持多样化的任务和复杂的指令，我们采用了一种课程学习策略：首先将MLLM和视频扩散模型与大规模教学图像编辑数据对齐，然后在高质量多任务视频数据上进行端到端微调。此外，我们引入了一种新颖的数据合成流水线，以生成配对的教学视频编辑数据用于模型训练。它通过利用图像到视频模型注入动态，将静态图像数据转换为多样化、高质量的视频编辑样本。VEGGIE在不同编辑技能的教学视频编辑中表现出强大的性能，作为一种多功能模型，优于最佳教学基准线，而其他模型在多任务处理方面表现不佳。VEGGIE还在视频对象基础和推理分割方面表现出色，而其他基线则失败。我们进一步揭示了多个任务如何相互帮助，并强调了零-shot多模式教学和上下文视频编辑等有前景的应用。

更新时间: 2025-10-25 09:08:18

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.14350v3

Taming Silent Failures: A Framework for Verifiable AI Reliability

The integration of Artificial Intelligence (AI) into safety-critical systems introduces a new reliability paradigm: silent failures, where AI produces confident but incorrect outputs that can be dangerous. This paper introduces the Formal Assurance and Monitoring Environment (FAME), a novel framework that confronts this challenge. FAME synergizes the mathematical rigor of offline formal synthesis with the vigilance of online runtime monitoring to create a verifiable safety net around opaque AI components. We demonstrate its efficacy in an autonomous vehicle perception system, where FAME successfully detected 93.5% of critical safety violations that were otherwise silent. By contextualizing our framework within the ISO 26262 and ISO/PAS 8800 standards, we provide reliability engineers with a practical, certifiable pathway for deploying trustworthy AI. FAME represents a crucial shift from accepting probabilistic performance to enforcing provable safety in next-generation systems.

Updated: 2025-10-25 09:07:47

标题: 驯服静默故障：可验证人工智能可靠性的框架

摘要: 将人工智能（AI）集成到安全关键系统中引入了一个新的可靠性范式：静默故障，即AI产生自信但不正确的输出，可能会带来危险。本文介绍了形式保证和监控环境（FAME），这是一个应对这一挑战的新框架。FAME将离线形式合成的数学严谨性与在线运行时监控的警惕性相结合，为不透明的AI组件创建了一个可验证的安全网。我们在自动驾驶车辆感知系统中展示了其有效性，FAME成功检测到了否则会静默存在的93.5％的关键安全违规行为。通过将我们的框架置于ISO 26262和ISO/PAS 8800标准之内，我们为可靠性工程师提供了一个实用且可证实的部署可信AI的途径。FAME代表了从接受概率性能转向在下一代系统中强制执行可证明安全的重要转变。

更新时间: 2025-10-25 09:07:47

领域: cs.SE,cs.AI,cs.LG,cs.LO,cs.SY,eess.SY

下载: http://arxiv.org/abs/2510.22224v1

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

Updated: 2025-10-25 08:57:40

标题: MEXA: 朝着具有动态多专家聚合的通用多模态推理方向

摘要: 合并预训练专家模型具有可扩展的多模态推理潜力，但由于输入模态的增加多样性和任务复杂性，构建统一框架仍具挑战性。例如，医学诊断需要对结构化临床表格进行精确推理，而财务预测则依赖于解释基于图形的数据以进行明智预测。为了解决这一挑战，我们引入了MEXA，这是一个无需训练的框架，可以对多个专家模型进行模态和任务感知的聚合，从而实现跨多样和不同领域的有效多模态推理。MEXA根据输入模态和任务特定推理需求（即技能）动态选择专家模型。每个专家模型专门针对一个模态任务对生成可解释的文本推理输出。然后，MEXA使用大型推理模型（LRM）对这些输出进行聚合和推理以生成最终答案。这种模块化设计可以在不增加额外训练开销的情况下，实现跨不同领域的灵活和透明的多模态推理。我们在包括视频推理、音频推理、3D理解和医学问答在内的多样多模态基准上对我们的方法进行了广泛评估。MEXA始终比强大的多模态基准提供性能改进，突显了我们的专家驱动选择和聚合在多样多模态推理任务中的有效性和广泛适用性。

更新时间: 2025-10-25 08:57:40

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.17113v2

OmniFC: Rethinking Federated Clustering via Lossless and Secure Distance Reconstruction

Federated clustering (FC) aims to discover global cluster structures across decentralized clients without sharing raw data, making privacy preservation a fundamental requirement. There are two critical challenges: (1) privacy leakage during collaboration, and (2) robustness degradation due to aggregation of proxy information from non-independent and identically distributed (Non-IID) local data, leading to inaccurate or inconsistent global clustering. Existing solutions typically rely on model-specific local proxies, which are sensitive to data heterogeneity and inherit inductive biases from their centralized counterparts, thus limiting robustness and generality. We propose Omni Federated Clustering (OmniFC), a unified and model-agnostic framework. Leveraging Lagrange coded computing, our method enables clients to share only encoded data, allowing exact reconstruction of the global distance matrix--a fundamental representation of sample relationships--without leaking private information, even under client collusion. This construction is naturally resilient to Non-IID data distributions. This approach decouples FC from model-specific proxies, providing a unified extension mechanism applicable to diverse centralized clustering methods. Theoretical analysis confirms both reconstruction fidelity and privacy guarantees, while comprehensive experiments demonstrate OmniFC's superior robustness, effectiveness, and generality across various benchmarks compared to state-of-the-art methods. Code will be released.

Updated: 2025-10-25 08:57:20

标题: OmniFC：通过无损和安全距离重建重新思考联合聚类

摘要: 联合聚类（FC）旨在在分散的客户端之间发现全局集群结构，而无需共享原始数据，使隐私保护成为一个基本要求。存在两个关键挑战：（1）在协作过程中的隐私泄漏，以及（2）由于来自非独立和同分布（Non-IID）本地数据的代理信息聚合而导致的鲁棒性下降，从而导致全局聚类不准确或不一致。现有的解决方案通常依赖于特定于模型的本地代理，这些代理对数据异质性敏感，并从它们的中央对应物继承归纳偏见，从而限制了鲁棒性和普适性。我们提出了Omni联合聚类（OmniFC），这是一个统一的、与模型无关的框架。利用Lagrange编码计算，我们的方法使客户端只能共享编码数据，允许精确重构全局距离矩阵——样本关系的基本表示——而不泄露私人信息，即使在客户端合谋的情况下也是如此。这种构造对非独立和同分布的数据分布具有自然的韧性。这种方法将FC与特定于模型的代理解耦，提供一个统一的扩展机制，适用于各种集中式聚类方法。理论分析证实了重构的忠实度和隐私保证，而全面的实验证明了OmniFC相对于最先进的方法在各种基准测试中的卓越鲁棒性、有效性和普适性。代码将发布。

更新时间: 2025-10-25 08:57:20

领域: cs.LG

下载: http://arxiv.org/abs/2505.13071v2

Provably Efficient Online RLHF with One-Pass Reward Modeling

Reinforcement Learning from Human Feedback (RLHF) has shown remarkable success in aligning Large Language Models (LLMs) with human preferences. Traditional RLHF methods rely on a fixed dataset, which often suffers from limited coverage. To this end, online RLHF has emerged as a promising direction, enabling iterative data collection and refinement. Despite its potential, this paradigm faces a key bottleneck: the requirement to continuously integrate new data into the dataset and re-optimize the model from scratch at each iteration, resulting in computational and storage costs that grow linearly with the number of iterations. In this work, we address this challenge by proposing a one-pass reward modeling method that eliminates the need to store historical data and achieves constant-time updates per iteration. Specifically, we first formalize RLHF as a contextual preference bandit and develop a new algorithm based on online mirror descent with a tailored local norm, replacing the standard maximum likelihood estimation for reward modeling. We then apply it to various online RLHF settings, including passive data collection, active data collection, and deployment-time adaptation. We provide theoretical guarantees showing that our method enhances both statistical and computational efficiency. Finally, we design practical algorithms for LLMs and conduct experiments with the Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on Ultrafeedback and Mixture2 datasets, validating the effectiveness of our approach.

Updated: 2025-10-25 08:52:03

标题: 可证明高效的在线RLHF与一次性奖励建模

摘要: 人类反馈强化学习（RLHF）在将大型语言模型（LLMs）与人类偏好对齐方面表现出显著成功。传统的RLHF方法依赖于固定数据集，往往存在覆盖范围有限的问题。因此，在线RLHF已经成为一个有前途的方向，实现迭代式数据收集和改进。尽管具有潜力，但这种范式面临一个关键瓶颈：需要不断将新数据整合到数据集中，并在每次迭代中从头开始重新优化模型，导致计算和存储成本随迭代次数线性增长。在这项工作中，我们通过提出一种一次性奖励建模方法来解决这一挑战，消除了存储历史数据的需求，并实现了每次迭代的常数时间更新。具体来说，我们首先将RLHF形式化为一种上下文偏好赌博机，并基于具有定制局部范数的在线镜像下降开发了一种新算法，用于替代奖励建模的标准最大似然估计。然后，我们将其应用于各种在线RLHF设置，包括被动数据收集、主动数据收集和部署时适应。我们提供理论保证，表明我们的方法提高了统计效率和计算效率。最后，我们为LLMs设计了实用算法，并在Ultrafeedback和Mixture2数据集上使用Llama-3-8B-Instruct和Qwen2.5-7B-Instruct模型进行实验，验证了我们方法的有效性。

更新时间: 2025-10-25 08:52:03

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.07193v3

HPC-Driven Modeling with ML-Based Surrogates for Magnon-Photon Dynamics in Hybrid Quantum Systems

Simulating hybrid magnonic quantum systems remains a challenge due to the large disparity between the timescales of the two systems. We present a massively parallel GPU-based simulation framework that enables fully coupled, large-scale modeling of on-chip magnon-photon circuits. Our approach resolves the dynamic interaction between ferromagnetic and electromagnetic fields with high spatiotemporal fidelity. To accelerate design workflows, we develop a physics-informed machine learning surrogate trained on the simulation data, reducing computational cost while maintaining accuracy. This combined approach reveals real-time energy exchange dynamics and reproduces key phenomena such as anti-crossing behavior and the suppression of ferromagnetic resonance under strong electromagnetic fields. By addressing the multiscale and multiphysics challenges in magnon-photon modeling, our framework enables scalable simulation and rapid prototyping of next-generation quantum and spintronic devices.

Updated: 2025-10-25 08:51:00

标题: 基于机器学习代理的HPC驱动建模，用于混合量子系统中的磁子光子动力学

摘要: 模拟混合磁子量子系统仍然是一个挑战，因为这两个系统的时间尺度之间存在较大的差异。我们提出了一个基于大规模并行GPU的仿真框架，可以实现完全耦合的大规模模拟芯片上的磁子光子电路。我们的方法以高时空保真度解决了铁磁和电磁场之间的动态相互作用。为了加快设计工作流程，我们开发了一个基于物理知识的机器学习替代模型，经过仿真数据训练，降低了计算成本同时保持准确性。这种综合方法揭示了实时能量交换动态，并重现了关键现象，如反交叉行为和在强电磁场下抑制铁磁共振。通过解决磁子光子建模中的多尺度和多物理挑战，我们的框架实现了可扩展的模拟和快速原型设计，用于下一代量子和自旋电子器件。

更新时间: 2025-10-25 08:51:00

领域: quant-ph,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2510.22221v1

Estimating the Error of Large Language Models at Pairwise Text Comparison

We measure LLMs' output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts. The Copeland counting constructs a ranking over the compared texts from pairwise preferences; the ranking reveals the poor scalability of LLM-based pairwise comparison and helps yield the estimates for LLMs' error rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five types of text input and obtain consistent estimates of LLMs' error. In general, the measured two positional bias terms are similar, close to the uniform error. Considering both the error rates and the robustness to the variation of prompts, Claude obtained the most desirable performance in this experiment. Our model outperforms the biased Bradley-Terry model and the commutativity score in indicating LLMs' error at this task.

Updated: 2025-10-25 08:39:52

标题: 估计大型语言模型在文本对比中的误差

摘要: 我们在成对文本比较中测量LLMs的输出错误，注意他们偏好中的错误概率。我们的方法不依赖于基本事实，并支持两种情况：(i)无论比较顺序如何，都估计出两个文本对的每个比较中的均匀错误率，每个文本对进行两次比较，其中一个文本放在第一位；(ii)二进制位置偏差，假设两种比较顺序的错误率不同，通过重复比较两个文本来估计。Copeland计数从成对偏好中构建了对比文本的排名；排名显示了基于LLM的成对比较的扩展性差，并有助于得出LLMs错误率的估计。我们将该方法应用于六个LLMs（ChatGPT、Claude、DeepSeek、Gemini、Grok、Qwen），使用五种文本输入，并获得LLMs错误的一致估计。总体而言，测量得到的两个位置偏差项相似，接近均匀错误率。考虑到错误率和对提示变化的稳健性，Claude在这个实验中表现最佳。我们的模型在指示LLMs在这项任务中的错误时优于有偏的Bradley-Terry模型和可交换性评分。

更新时间: 2025-10-25 08:39:52

领域: cs.CL,cs.AI,math.PR

下载: http://arxiv.org/abs/2510.22219v1

Preference Optimization by Estimating the Ratio of the Data Distribution

Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu's power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-8B-Instruct, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9\% length-controlled win rate on AlpacaEval2. Project page: https://github.com/aailab-kaist/BPO.

Updated: 2025-10-25 08:32:17

标题: 通过估计数据分布比例进行偏好优化

摘要: 直接偏好优化(DPO)被广泛用作将大型语言模型(LLMs)与人类偏好对齐的简单稳定的方法。本文研究了一种广义的DPO损失，使得策略模型能够从似然比估计的角度匹配目标策略。目标策略的比率提供了策略分布的独特识别，而无需依赖奖励模型或划分函数。这使得广义损失能够保持简单性和理论保证，而之前的工作如$f$-PO无法同时实现。我们提出了Bregman偏好优化(BPO)，一种实现目标策略最优性的一系列目标函数的广义框架。BPO覆盖了DPO作为特例，并对所有实例提供了可处理的形式，只需少量代码即可实现。我们进一步开发了缩放Basu的幂分歧(SBA)，一种梯度缩放方法，可用于BPO实例。BPO框架补充了其他DPO变体，并适用于这些变体定义的目标策略。在实验中，与其他概率损失扩展如$f$-DPO或$f$-PO相比，这些BPO实例改善了胜率和熵，而不是在生成忠实度和多样性之间存在权衡。当应用于Llama-3-8B-Instruct时，BPO在AlpacaEval2上取得了Llama-3-8B主干中最先进的表现，胜率为55.9\%。项目页面：https://github.com/aailab-kaist/BPO。

更新时间: 2025-10-25 08:32:17

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.19601v2

Offline Clustering of Linear Bandits: The Power of Clusters under Limited Data

Contextual multi-armed bandit is a fundamental learning framework for making a sequence of decisions, e.g., advertising recommendations for a sequence of arriving users. Recent works have shown that clustering these users based on the similarity of their learned preferences can accelerate the learning. However, prior work has primarily focused on the online setting, which requires continually collecting user data, ignoring the offline data widely available in many applications. To tackle these limitations, we study the offline clustering of bandits (Off-ClusBand) problem, which studies how to use the offline dataset to learn cluster properties and improve decision-making. The key challenge in Off-ClusBand arises from data insufficiency for users: unlike the online case where we continually learn from online data, in the offline case, we have a fixed, limited dataset to work from and thus must determine whether we have enough data to confidently cluster users together. To address this challenge, we propose two algorithms: Off-C2LUB, which we show analytically and experimentally outperforms existing methods under limited offline user data, and Off-CLUB, which may incur bias when data is sparse but performs well and nearly matches the lower bound when data is sufficient. We experimentally validate these results on both real and synthetic datasets.

Updated: 2025-10-25 08:29:46

标题: 线性赌博机的离线聚类：有限数据情况下聚类的能力

摘要: 背景多臂赌博机是一个基本的学习框架，用于做一系列决策，例如为一系列到达的用户提供广告推荐。最近的研究表明，基于用户学习偏好的相似性对这些用户进行聚类可以加速学习。然而，先前的工作主要集中在在线设置上，这需要不断收集用户数据，忽略了在许多应用中广泛可用的离线数据。为了解决这些限制，我们研究了赌博机的离线聚类（Off-ClusBand）问题，研究如何利用离线数据集学习集群属性并改善决策。Off-ClusBand中的关键挑战来自用户数据的不足：与在线情况不同，在线情况下我们不断从在线数据中学习，而在离线情况下，我们只能从一个固定的有限数据集中工作，因此必须确定我们是否有足够的数据来确信将用户聚类在一起。为了解决这一挑战，我们提出了两种算法：Off-C2LUB，在有限的离线用户数据下，在理论上和实验中表现优于现有方法；Off-CLUB，当数据稀疏时可能产生偏差，但在数据充足时表现良好，并几乎与下限相匹配。我们在真实和合成数据集上对这些结果进行了实验验证。

更新时间: 2025-10-25 08:29:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.19043v2

GALA: A GlobAl-LocAl Approach for Multi-Source Active Domain Adaptation

Domain Adaptation (DA) provides an effective way to tackle target-domain tasks by leveraging knowledge learned from source domains. Recent studies have extended this paradigm to Multi-Source Domain Adaptation (MSDA), which exploits multiple source domains carrying richer and more diverse transferable information. However, a substantial performance gap still remains between adaptation-based methods and fully supervised learning. In this paper, we explore a more practical and challenging setting, named Multi-Source Active Domain Adaptation (MS-ADA), to further enhance target-domain performance by selectively acquiring annotations from the target domain. The key difficulty of MS-ADA lies in designing selection criteria that can jointly handle inter-class diversity and multi-source domain variation. To address these challenges, we propose a simple yet effective GALA strategy (GALA), which combines a global k-means clustering step for target-domain samples with a cluster-wise local selection criterion, effectively tackling the above two issues in a complementary manner. Our proposed GALA is plug-and-play and can be seamlessly integrated into existing DA frameworks without introducing any additional trainable parameters. Extensive experiments on three standard DA benchmarks demonstrate that GALA consistently outperforms prior active learning and active DA methods, achieving performance comparable to the fully-supervised upperbound while using only 1% of the target annotations.

Updated: 2025-10-25 08:26:45

标题: GALA：一种面向多源主动域自适应的全局-局部方法

摘要: 域自适应（DA）提供了一种有效的方法，通过利用从源域学到的知识来处理目标域任务。最近的研究已将这一范式扩展到多源域自适应（MSDA），利用携带更丰富和更多样可转移信息的多个源域。然而，适应方法与完全监督学习之间仍存在实质性的性能差距。在本文中，我们探索了一个更实际和具有挑战性的设置，称为多源主动域自适应（MS-ADA），通过有选择地从目标域获取标注来进一步增强目标域性能。MS-ADA的关键困难在于设计能够共同处理类间多样性和多源域变化的选择标准。为了解决这些挑战，我们提出了一种简单而有效的GALA策略（GALA），它将全局k均值聚类步骤与基于簇的局部选择标准相结合，有效地以互补的方式处理上述两个问题。我们提出的GALA是即插即用的，可以无缝地集成到现有的DA框架中，而不引入任何额外的可训练参数。对三个标准DA基准上的广泛实验表明，GALA始终优于先前的主动学习和主动DA方法，在仅使用目标注释的1%的情况下实现了与完全监督上限相当的性能。

更新时间: 2025-10-25 08:26:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22214v1

Approximation and Generalization Abilities of Score-based Neural Network Generative Models for Sub-Gaussian Distributions

This paper studies the approximation and generalization abilities of score-based neural network generative models (SGMs) in estimating an unknown distribution $P_0$ from $n$ i.i.d. observations in $d$ dimensions. Assuming merely that $P_0$ is $\alpha$-sub-Gaussian, we prove that for any time step $t \in [t_0, n^{\mathcal{O}(1)}]$, where $t_0 > \mathcal{O}(\alpha^2n^{-2/d}\log n)$, there exists a deep ReLU neural network with width $\leq \mathcal{O}(n^{\frac{3}{d}}\log_2n)$ and depth $\leq \mathcal{O}(\log^2n)$ that can approximate the scores with $\tilde{\mathcal{O}}(n^{-1})$ mean square error and achieve a nearly optimal rate of $\tilde{\mathcal{O}}(n^{-1}t_0^{-d/2})$ for score estimation, as measured by the score matching loss. Our framework is universal and can be used to establish convergence rates for SGMs under milder assumptions than previous work. For example, assuming further that the target density function $p_0$ lies in Sobolev or Besov classes, with an appropriately early stopping strategy, we demonstrate that neural network-based SGMs can attain nearly minimax convergence rates up to logarithmic factors. Our analysis removes several crucial assumptions, such as Lipschitz continuity of the score function or a strictly positive lower bound on the target density.

Updated: 2025-10-25 08:26:45

标题: 基于分数的神经网络生成模型对次高斯分布的逼近和泛化能力

摘要: 本文研究了基于得分的神经网络生成模型（SGM）在估计未知分布$P_0$时的逼近和泛化能力，从$n$个$d$维独立同分布观测中。假设仅仅是$P_0$是$\alpha$-次高斯的，我们证明对于任何时间步$t \in [t_0, n^{\mathcal{O}(1)}]$，其中$t_0 > \mathcal{O}(\alpha^2n^{-2/d}\log n)$，存在一个深度ReLU神经网络，宽度$\leq \mathcal{O}(n^{\frac{3}{d}}\log_2n)$，深度$\leq \mathcal{O}(\log^2n)$，可以用$\tilde{\mathcal{O}}(n^{-1})$均方误差逼近得分，并且通过得分匹配损失实现接近最优率的$\tilde{\mathcal{O}}(n^{-1}t_0^{-d/2})$。我们的框架是通用的，可以用于建立SGMs的收敛速度，比以前的工作更温和的假设。例如，假设目标密度函数$p_0$属于Sobolev或Besov类，并采用适当的早停策略，我们证明基于神经网络的SGMs可以达到接近极小极限收敛速度，直到对数因子。我们的分析消除了一些关键假设，比如得分函数的Lipschitz连续性或目标密度的严格正下界。

更新时间: 2025-10-25 08:26:45

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.10880v2

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.

Updated: 2025-10-25 08:21:06

标题: 每一步都在进化：为万亿级思维模型扩展强化学习

摘要: 我们介绍了Ring-1T，这是第一个具有万亿级参数规模的开源最先进的思维模型。它拥有1万亿个总参数，每个token激活大约50亿个参数。在万亿级参数规模下训练这样的模型引入了前所未有的挑战，包括训练推理不一致、推理处理效率低下以及RL系统瓶颈。为了解决这些问题，我们开创性地提出了三个相互关联的创新：（1）IcePop通过token级别的差异屏蔽和裁剪来稳定RL训练，解决了训练推理不一致导致的不稳定性；（2）C3PO++通过动态分区，提高资源利用率，以在token预算下进行长时间的rollout，从而获得高时间效率；以及（3）ASystem，一种旨在克服阻碍万亿参数模型训练的系统性瓶颈的高性能RL框架。Ring-1T在关键基准测试中取得了突破性成果：在AIME-2025上达到93.4，在HMMT-2025上达到86.72，在CodeForces上达到2088，在ARC-AGI-1上达到55.94。值得注意的是，在IMO-2025上获得了银牌级别的成绩，突显其出色的推理能力。通过向社区发布完整的1T参数MoE模型，我们为研究社区提供了直接获取尖端推理能力的机会。这一贡献标志着在民主化大规模推理智能方面达到了重要里程碑，并为开源模型性能建立了新的基准。

更新时间: 2025-10-25 08:21:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.18855v2

LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation

Automated unit test generation is essential for robust software development, yet existing approaches struggle to generalize across multiple programming languages and operate within real-time development. While Large Language Models (LLMs) offer a promising solution, their ability to generate high coverage test code depends on prompting a concise context of the focal method. Current solutions, such as Retrieval-Augmented Generation, either rely on imprecise similarity-based searches or demand the creation of costly, language-specific static analysis pipelines. To address this gap, we present LSPRAG, a framework for concise-context retrieval tailored for real-time, language-agnostic unit test generation. LSPRAG leverages off-the-shelf Language Server Protocol (LSP) back-ends to supply LLMs with precise symbol definitions and references in real time. By reusing mature LSP servers, LSPRAG provides an LLM with language-aware context retrieval, requiring minimal per-language engineering effort. We evaluated LSPRAG on open-source projects spanning Java, Go, and Python. Compared to the best performance of baselines, LSPRAG increased line coverage by up to 174.55% for Golang, 213.31% for Java, and 31.57% for Python.

Updated: 2025-10-25 08:19:21

标题: LSPRAG：LSP引导的用于语言无关实时单元测试生成的RAG

摘要: 自动生成单元测试对于健壮的软件开发至关重要，然而现有方法在跨多种编程语言通用并在实时开发中运行时存在困难。虽然大型语言模型(LLMs)提供了一种有前途的解决方案，但它们生成高覆盖率测试代码的能力取决于提供焦点方法的简洁上下文。当前的解决方案，如检索增强生成，要么依赖于不精确的基于相似性的搜索，要么要求创建昂贵的、特定于语言的静态分析管道。为了填补这一空白，我们提出了LSPRAG，一个针对实时、语言无关的单元测试生成的简洁上下文检索框架。LSPRAG利用现成的语言服务器协议(LSP)后端，实时提供LLMs精确的符号定义和引用。通过重用成熟的LSP服务器，LSPRAG为LLMs提供了语言感知上下文检索，需要最少的每种语言的工程工作。我们在跨Java、Go和Python的开源项目上对LSPRAG进行了评估。与基准线的最佳性能相比，LSPRAG在Golang增加了高达174.55%的行覆盖率，Java增加了213.31%，Python增加了31.57%。

更新时间: 2025-10-25 08:19:21

领域: cs.SE,cs.AI,D.2.5

下载: http://arxiv.org/abs/2510.22210v1

Simplifying Knowledge Transfer in Pretrained Models

Pretrained models are ubiquitous in the current deep learning landscape, offering strong results on a broad range of tasks. Recent works have shown that models differing in various design choices exhibit categorically diverse generalization behavior, resulting in one model grasping distinct data-specific insights unavailable to the other. In this paper, we propose to leverage large publicly available model repositories as an auxiliary source of model improvements. We introduce a data partitioning strategy where pretrained models autonomously adopt either the role of a student, seeking knowledge, or that of a teacher, imparting knowledge. Experiments across various tasks demonstrate the effectiveness of our proposed approach. In image classification, we improved the performance of ViT-B by approximately 1.4% through bidirectional knowledge transfer with ViT-T. For semantic segmentation, our method boosted all evaluation metrics by enabling knowledge transfer both within and across backbone architectures. In video saliency prediction, our approach achieved a new state-of-the-art. We further extend our approach to knowledge transfer between multiple models, leading to considerable performance improvements for all model participants.

Updated: 2025-10-25 08:18:41

标题: 简化预训练模型中的知识转移

摘要: 预训练模型在当前深度学习领域中无处不在，在广泛的任务范围上提供强大的结果。最近的研究表明，不同设计选择的模型表现出明显不同的泛化行为，导致一个模型掌握了另一个模型无法获得的独特数据洞察力。在本文中，我们提出利用大量公开可用的模型库作为模型改进的辅助来源。我们引入了一种数据分区策略，其中预训练模型自主地扮演学生或教师的角色，寻求知识或传授知识。在各种任务上的实验表明了我们提出的方法的有效性。在图像分类中，我们通过与ViT-T的双向知识传递，将ViT-B的性能提高了约1.4%。对于语义分割，我们的方法通过在骨干架构内部和跨越骨干架构之间进行知识传递，提升了所有评估指标。在视频显著性预测中，我们的方法实现了新的最先进水平。我们进一步将我们的方法扩展到多个模型之间的知识传递，大大提高了所有模型参与者的性能。

更新时间: 2025-10-25 08:18:41

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.22208v1

Visual Model Selection using Feature Importance Clusters in Fairness-Performance Similarity Optimized Space

In the context of algorithmic decision-making, fair machine learning methods often yield multiple models that balance predictive fairness and performance in varying degrees. This diversity introduces a challenge for stakeholders who must select a model that aligns with their specific requirements and values. To address this, we propose an interactive framework that assists in navigating and interpreting the trade-offs across a portfolio of models. Our approach leverages weakly supervised metric learning to learn a Mahalanobis distance that reflects similarity in fairness and performance outcomes, effectively structuring the feature importance space of the models according to stakeholder-relevant criteria. We then apply clustering technique (k-means) to group models based on their transformed representations of feature importances, allowing users to explore clusters of models with similar predictive behaviors and fairness characteristics. This facilitates informed decision-making by helping users understand how models differ not only in their fairness-performance balance but also in the features that drive their predictions.

Updated: 2025-10-25 08:18:41

标题: 在公平性-性能相似度优化空间中使用特征重要性聚类进行视觉模型选择

摘要: 在算法决策的背景下，公平的机器学习方法通常会产生多个模型，这些模型在预测公平性和性能方面以不同程度平衡。这种多样性为利益相关者带来了挑战，他们必须选择与其特定要求和价值观一致的模型。为了解决这个问题，我们提出了一个交互式框架，帮助利益相关者在多个模型组合中导航和解释权衡。我们的方法利用弱监督度量学习来学习一个反映公平性和性能结果相似性的马氏距离，有效地根据利益相关者相关标准构建模型的特征重要性空间。然后，我们应用聚类技术（k-means）根据其特征重要性的转换表现将模型分组，使用户能够探索具有相似预测行为和公平性特征的模型群。这有助于通过帮助用户了解模型在公平性-性能平衡以及推动其预测的特征方面的差异，促进知情决策。

更新时间: 2025-10-25 08:18:41

领域: cs.LG

下载: http://arxiv.org/abs/2510.22209v1

The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I)

Large Language Models (LLMs) can achieve near-optimal lossless compression by acting as powerful probability models. We investigate their use in the lossy domain, where reconstruction fidelity is traded for higher compression ratios. This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec that leverages a Masked Language Model (MLM) as a decompressor. Instead of storing a subset of original tokens, EPC allows the model to predict masked content and stores minimal, rank-based corrections only when the model's top prediction is incorrect. This creates a residual channel that offers continuous rate-distortion control. We compare EPC to a simpler Predictive Masking (PM) baseline and a transform-based Vector Quantisation with a Residual Patch (VQ+RE) approach. Through an evaluation that includes precise bit accounting and rate-distortion analysis, we demonstrate that EPC consistently dominates PM, offering superior fidelity at a significantly lower bit rate by more efficiently utilising the model's intrinsic knowledge.

Updated: 2025-10-25 08:18:31

标题: 失真地平线：误差有界的预测编码用于失真文本压缩（第一集）

摘要: 大型语言模型（LLMs）可以通过作为强大的概率模型来实现接近最佳的无损压缩。我们研究它们在有损领域的应用，其中重建保真度以换取更高的压缩比。本文介绍了一种称为Error-Bounded Predictive Coding（EPC）的有损文本编解码器，它利用了一个Masked Language Model（MLM）作为解压缩器。与存储原始令牌的子集不同，EPC允许模型预测掩码内容，并仅在模型的顶部预测不正确时存储最小的基于排名的修正。这创造了一个残余通道，提供连续的速率失真控制。我们将EPC与更简单的Predictive Masking（PM）基线和基于变换的矢量量化与残差补丁（VQ+RE）方法进行比较。通过包括精确的比特计算和速率失真分析在内的评估，我们证明EPC始终优于PM，以更有效地利用模型固有知识，在显著较低的比特率下提供卓越的保真度。

更新时间: 2025-10-25 08:18:31

领域: cs.LG,cs.CL,cs.IT,math.IT,94A08, 68P30, 68T50,E.4; I.2.7; I.2.7

下载: http://arxiv.org/abs/2510.22207v1

Evidence Without Injustice: A New Counterfactual Test for Fair Algorithms

The growing philosophical literature on algorithmic fairness has examined statistical criteria such as equalized odds and calibration, causal and counterfactual approaches, and the role of structural and compounding injustices. Yet an important dimension has been overlooked: whether the evidential value of an algorithmic output itself depends on structural injustice. We contrast a predictive policing algorithm, which relies on historical crime data, with a camera-based system that records ongoing offenses, where both are designed to guide police deployment. In evaluating the moral acceptability of acting on a piece of evidence, we must ask not only whether the evidence is probative in the actual world, but also whether it would remain probative in nearby worlds without the relevant injustices. The predictive policing algorithm fails this test, but the camera-based system passes it. When evidence fails the test, it is morally problematic to use it punitively, more so than evidence that passes the test.

Updated: 2025-10-25 08:15:13

标题: Evidence Without Injustice: A New Counterfactual Test for Fair Algorithms 没有不公正的证据：公平算法的新反事实测试

摘要: 关于算法公平性的日益增长的哲学文献已经考察了诸如均等几率和校准等统计标准、因果和反事实方法，以及结构性和复合性不公正的作用。然而，一个重要的维度被忽视了：算法输出本身的证据价值是否取决于结构性不公正。我们对预测性警务算法进行对比，该算法依赖于历史犯罪数据，与记录正在发生的违法行为的基于摄像头的系统，两者均旨在指导警力部署。在评估是否采取行动的道德可接受性时，我们不仅必须询问证据在实际世界中是否具有证明力，还必须询问在没有相关不公正的附近世界中是否仍然具有证明力。预测性警务算法未能通过这一测试，而基于摄像头的系统通过了测试。当证据未能通过测试时，在惩罚性使用它时存在道德问题，而通过测试的证据则相对较少存在问题。

更新时间: 2025-10-25 08:15:13

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.12822v2

Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration

Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM's varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM's performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM's updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM's capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.

Updated: 2025-10-25 08:12:41

标题: Bohdi：具有自动数据探索的异构LLM融合

摘要: 异构大型语言模型（LLM）融合将具有不同架构的多个源LLM的优势整合到一个具有低计算开销的目标LLM中。尽管有前景，但现有方法存在两个主要限制：1）依赖于有限领域的真实数据进行知识融合，阻止目标LLM在不同领域中完全获取知识，2）跨领域固定数据分配比例，未能根据目标LLM在不同领域中的不同能力动态调整，导致能力失衡。为了克服这些限制，我们提出了Bohdi，一个仅使用合成数据的异构LLM融合框架。通过将知识领域组织成分层树结构，Bohdi通过多模型协作实现自动领域探索和多领域数据生成，从而全面从源LLM中提取知识。通过将知识树上的领域扩展和数据采样比例分配形式化为分层多臂赌博机问题，Bohdi利用设计的DynaBranches机制根据目标LLM在不同领域的性能反馈自适应调整采样比例。结合我们提出的内省-重生（IR）机制，DynaBranches通过滑动窗口二项似然比检验（SWBLRT）动态跟踪目标LLM更新期间的能力变化，进一步增强其在线适应能力。在广泛的基准测试套件上的比较实验结果表明，Bohdi在多个目标LLM上明显优于现有基线，表现出更高的数据效率，并几乎消除了目标LLM能力上的不平衡。我们的代码可在https://github.com/gjq100/Bohdi.git上找到。

更新时间: 2025-10-25 08:12:41

领域: cs.LG

下载: http://arxiv.org/abs/2506.15721v3

Right Place, Right Time: Market Simulation-based RL for Execution Optimisation

Execution algorithms are vital to modern trading, they enable market participants to execute large orders while minimising market impact and transaction costs. As these algorithms grow more sophisticated, optimising them becomes increasingly challenging. In this work, we present a reinforcement learning (RL) framework for discovering optimal execution strategies, evaluated within a reactive agent-based market simulator. This simulator creates reactive order flow and allows us to decompose slippage into its constituent components: market impact and execution risk. We assess the RL agent's performance using the efficient frontier based on work by Almgren and Chriss, measuring its ability to balance risk and cost. Results show that the RL-derived strategies consistently outperform baselines and operate near the efficient frontier, demonstrating a strong ability to optimise for risk and impact. These findings highlight the potential of reinforcement learning as a powerful tool in the trader's toolkit.

Updated: 2025-10-25 08:10:18

标题: 恰到好处的地点、恰到好处的时间：基于市场模拟的强化学习用于执行优化

摘要: 执行算法对于现代交易至关重要，它们使市场参与者能够在最大程度减少市场影响和交易成本的同时执行大宗订单。随着这些算法变得更加复杂，优化它们变得越来越具有挑战性。在这项工作中，我们提出了一个基于强化学习（RL）框架，用于发现最佳执行策略，并在一个反应式基于代理的市场模拟器中进行评估。这个模拟器创建了反应式的订单流，并允许我们将滑点分解为其构成部分：市场影响和执行风险。我们使用Almgren和Chriss的工作基于有效前沿来评估RL代理的表现，衡量其在平衡风险和成本方面的能力。结果显示，基于RL的策略始终优于基线，并且接近有效前沿操作，展示了优化风险和影响的强大能力。这些发现突显了强化学习作为交易者工具包中的一个强大工具的潜力。

更新时间: 2025-10-25 08:10:18

领域: q-fin.CP,cs.AI,cs.LG,q-fin.RM,q-fin.TR

下载: http://arxiv.org/abs/2510.22206v1

Bridging Perception and Reasoning: Dual-Pipeline Neuro-Symbolic Landing for UAVs in Cluttered Environments

Autonomous landing in unstructured (cluttered, uneven, and map-poor) environments is a core requirement for Unmanned Aerial Vehicles (UAVs), yet purely vision-based or deep learning models often falter under covariate shift and provide limited interpretability. We propose NeuroSymLand, a neuro-symbolic framework that tightly couples two complementary pipelines: (i) an offline pipeline, where Large Language Models (LLMs) and human-in-the-loop refinement synthesize Scallop code from diverse landing scenarios, distilling generalizable and verifiable symbolic knowledge; and (ii) an online pipeline, where a compact foundation-based semantic segmentation model generates probabilistic Scallop facts that are composed into semantic scene graphs for real-time deductive reasoning. This design combines the perceptual strengths of lightweight foundation models with the interpretability and verifiability of symbolic reasoning. Node attributes (e.g., flatness, area) and edge relations (adjacency, containment, proximity) are computed with geometric routines rather than learned, avoiding the data dependence and latency of train-time graph builders. The resulting Scallop program encodes landing principles (avoid water and obstacles; prefer large, flat, accessible regions) and yields calibrated safety scores with ranked Regions of Interest (ROIs) and human-readable justifications. Extensive evaluations across datasets, diverse simulation maps, and real UAV hardware show that NeuroSymLand achieves higher accuracy, stronger robustness to covariate shift, and superior efficiency compared with state-of-the-art baselines, while advancing UAV safety and reliability in emergency response, surveillance, and delivery missions.

Updated: 2025-10-25 08:08:04

标题: 架起感知和推理之桥：在杂乱环境中为无人机提供双管道神经符号着陆

摘要: 在无结构（混乱、不均匀和缺乏地图）环境中进行自主着陆是无人机（UAVs）的一个核心要求，然而纯视觉或深度学习模型往往在协变量转移下失败，并且提供有限的可解释性。我们提出NeuroSymLand，一个神经符号框架，紧密耦合两个互补的流水线：（i）离线流水线，其中大型语言模型（LLMs）和人在环路细化从多样化的着陆场景合成Scallop代码，提炼可概括和可验证的符号知识；和（ii）在线流水线，其中一个紧凑的基础语义分割模型生成概率Scallop事实，这些事实被组合成语义场景图进行实时演绎推理。这种设计将轻量级基础模型的感知优势与符号推理的可解释性和可验证性相结合。节点属性（例如平坦度、面积）和边缘关系（邻近性、包含性、接近性）是通过几何例程计算而非学习得到的，避免了训练时图构建器的数据依赖和延迟。结果的Scallop程序编码了着陆原则（避免水和障碍物；优先选择大、平坦、可达区域），并生成具有等级的兴趣区域（ROIs）和人类可读的理由的校准安全评分。通过跨数据集、不同模拟地图和真实UAV硬件的广泛评估显示，NeuroSymLand相比于最先进的基线实现了更高的准确性、更强的抗协变量转移能力和更高的效率，同时在紧急响应、监视和交付任务中推进了UAV的安全性和可靠性。

更新时间: 2025-10-25 08:08:04

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.22204v1

MMbeddings: Parameter-Efficient, Low-Overfitting Probabilistic Embeddings Inspired by Nonlinear Mixed Models

We present MMbeddings, a probabilistic embedding approach that reinterprets categorical embeddings through the lens of nonlinear mixed models, effectively bridging classical statistical theory with modern deep learning. By treating embeddings as latent random effects within a variational autoencoder framework, our method substantially decreases the number of parameters -- from the conventional embedding approach of cardinality $\times$ embedding dimension, which quickly becomes infeasible with large cardinalities, to a significantly smaller, cardinality-independent number determined primarily by the encoder architecture. This reduction dramatically mitigates overfitting and computational burden in high-cardinality settings. Extensive experiments on simulated and real datasets, encompassing collaborative filtering and tabular regression tasks using varied architectures, demonstrate that MMbeddings consistently outperforms traditional embeddings, underscoring its potential across diverse machine learning applications.

Updated: 2025-10-25 07:35:08

标题: MMbeddings：受非线性混合模型启发的参数高效、低过拟合概率嵌入

摘要: 我们提出了MMbeddings，这是一种概率嵌入方法，通过非线性混合模型的视角重新解释分类嵌入，有效地将经典统计理论与现代深度学习相结合。通过在变分自动编码器框架中将嵌入视为潜在随机效应，我们的方法显著减少了参数数量——从传统嵌入方法的基数×嵌入维度，这在具有大基数时很快变得不可行，到一个显著较小的、主要由编码器架构确定的基数独立数量。这种减少显著缓解了在高基数情况下的过拟合和计算负担。在模拟和真实数据集上进行了大量实验，涵盖使用不同架构的协同过滤和表格回归任务，结果表明MMbeddings始终优于传统嵌入，突显了其在各种机器学习应用中的潜力。

更新时间: 2025-10-25 07:35:08

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.22198v1

Multi-dataset Joint Pre-training of Emotional EEG Enables Generalizable Affective Computing

Task-specific pre-training is essential when task representations diverge from generic pre-training features. Existing task-general pre-training EEG models struggle with complex tasks like emotion recognition due to mismatches between task-specific features and broad pre-training approaches. This work aims to develop a task-specific multi-dataset joint pre-training framework for cross-dataset emotion recognition, tackling problems of large inter-dataset distribution shifts, inconsistent emotion category definitions, and substantial inter-subject variability. We introduce a cross-dataset covariance alignment loss to align second-order statistical properties across datasets, enabling robust generalization without the need for extensive labels or per-subject calibration. To capture the long-term dependency and complex dynamics of EEG, we propose a hybrid encoder combining a Mamba-like linear attention channel encoder and a spatiotemporal dynamics model. Our method outperforms state-of-the-art large-scale EEG models by an average of 4.57% in AUROC for few-shot emotion recognition and 11.92% in accuracy for zero-shot generalization to a new dataset. Performance scales with the increase of datasets used in pre-training. Multi-dataset joint pre-training achieves a performance gain of 8.55% over single-dataset training. This work provides a scalable framework for task-specific pre-training and highlights its benefit in generalizable affective computing. Our code is available at https://github.com/ncclab-sustech/mdJPT_nips2025.

Updated: 2025-10-25 07:30:24

标题: 多数据集联合情绪EEG预训练实现通用情感计算

摘要: 任务特定的预训练在任务表示与通用预训练特征不一致时是必不可少的。现有的任务通用预训练脑电图（EEG）模型在复杂任务（如情绪识别）中存在困难，这是由于任务特定特征与广泛的预训练方法之间存在不匹配。本研究旨在开发一个针对跨数据集情绪识别的任务特定多数据集联合预训练框架，解决跨数据集之间存在的大量分布偏移、情绪类别定义不一致以及主体间巨大的变异性等问题。我们引入了一个跨数据集协方差对齐损失，以在数据集间对齐二阶统计属性，实现强大的泛化，无需大量标签或每个主体的校准。为了捕捉EEG的长期依赖和复杂动态，我们提出了一个混合编码器，结合了类似Mamba的线性注意力通道编码器和时空动力学模型。我们的方法在少样本情绪识别的AUROC上比最先进的大规模EEG模型平均提高了4.57%，在零样本到新数据集的泛化准确率上提高了11.92%。性能随着预训练中使用的数据集增加而提高。多数据集联合预训练比单数据集训练提高了8.55%的性能。本研究提供了一个可扩展的任务特定预训练框架，并强调了其在泛化情感计算中的益处。我们的代码可在https://github.com/ncclab-sustech/mdJPT_nips2025 上找到。

更新时间: 2025-10-25 07:30:24

领域: cs.LG,cs.AI,q-bio.NC

下载: http://arxiv.org/abs/2510.22197v1

Scaling Non-Parametric Sampling with Representation

Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to strip away complicated engineering tricks and propose a simple, non-parametric generative model. Our design is grounded in three principles of natural images-(i) spatial non-stationarity, (ii) low-level regularities, and (iii) high-level semantics-and defines each pixel's distribution from its local context window. Despite its minimal architecture and no training, the model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images. This combination of simplicity and strong empirical performance points toward a minimal theory of natural-image structure. The model's white-box nature also allows us to have a mechanistic understanding of how the model generalizes and generates diverse images. We study it by tracing each generated pixel back to its source images. These analyses reveal a simple, compositional procedure for "part-whole generalization", suggesting a hypothesis for how large neural network generative models learn to generalize.

Updated: 2025-10-25 07:29:26

标题: 使用表示方法扩展非参数抽样

摘要: 缩放和建筑进步产生了引人注目的逼真图像生成模型，但它们的机制仍然不透明。与其推进扩展，我们的目标是剥离复杂的工程技巧，提出一个简单的非参数生成模型。我们的设计基于自然图像的三个原则-(i)空间非平稳性，(ii)低层规律性，和(iii)高层语义-并从每个像素的本地上下文窗口定义其分布。尽管其极简的架构和无需训练，该模型在MNIST和视觉上引人注目的CIFAR-10图像上产生了高保真度的样本。这种简单性和强大的实证表现的结合指向了自然图像结构的最小理论。该模型的白盒性质也使我们能够对模型如何泛化和生成多样的图像有机械性的理解。我们通过追踪每个生成像素回到其源图像来研究它。这些分析揭示了一个简单的、构成性的"部分-整体泛化"过程，暗示了大型神经网络生成模型如何学习泛化的假设。

更新时间: 2025-10-25 07:29:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22196v1

OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling

Optimization modeling is one of the most crucial but technical parts of operations research (OR). To automate the modeling process, existing works have leveraged large language models (LLMs), prompting them to break down tasks into steps for generating variables, constraints, and objectives. However, due to the highly complex mathematical structures inherent in OR problems, standard fixed-step decomposition often fails to achieve high performance. To address this challenge, we introduce OptiTree, a novel tree search approach designed to enhance modeling capabilities for complex problems through adaptive problem decomposition into simpler subproblems. Specifically, we develop a modeling tree that organizes a wide range of OR problems based on their hierarchical problem taxonomy and complexity, with each node representing a problem category and containing relevant high-level modeling thoughts. Given a problem to model, we recurrently search the tree to identify a series of simpler subproblems and synthesize the global modeling thoughts by adaptively integrating the hierarchical thoughts. Experiments show that OptiTree significantly improves the modeling accuracy compared to the state-of-the-art, achieving over 10\% improvements on the challenging benchmarks. The code is released at https://github.com/MIRALab-USTC/OptiTree/tree/main.

Updated: 2025-10-25 07:19:16

标题: OptiTree：基于树搜索的LLM优化建模的分层思路生成

摘要: 优化建模是运筹学（OR）中最关键但技术性最强的部分之一。为了自动化建模过程，现有作品利用大型语言模型（LLMs），促使它们将任务分解为生成变量、约束和目标的步骤。然而，由于OR问题中固有的高度复杂的数学结构，标准的固定步骤分解通常无法实现高性能。为了解决这一挑战，我们引入了OptiTree，这是一种新颖的树搜索方法，旨在通过自适应问题分解为更简单的子问题，提高复杂问题的建模能力。具体而言，我们开发了一个建模树，根据层次化的问题分类和复杂性组织了各种OR问题，每个节点代表一个问题类别，并包含相关的高级建模思想。给定一个要建模的问题，我们通过反复搜索树来识别一系列更简单的子问题，并通过自适应地整合层次化思想来综合全局建模思想。实验表明，与最先进的方法相比，OptiTree显著提高了建模准确性，在具有挑战性的基准测试中实现了超过10％的改进。代码发布在https://github.com/MIRALab-USTC/OptiTree/tree/main。

更新时间: 2025-10-25 07:19:16

领域: cs.AI

下载: http://arxiv.org/abs/2510.22192v1

TPPR: APT Tactic / Technique Pattern Guided Attack Path Reasoning for Attack Investigation

Provenance analysis based on system audit data has emerged as a fundamental approach for investigating Advanced Persistent Threat (APT) attacks. Due to the high concealment and long-term persistence of APT attacks, they are only represented as a minimal part of the critical path in the provenance graph. While existing techniques employ behavioral pattern matching and data flow feature matching to uncover latent associations in attack sequences through provenance graph path reasoning, their inability to establish effective attack context associations often leads to the conflation of benign system operations with real attack entities, that fail to accurately characterize real APT behaviors. We observe that while the causality of entities in the provenance graph exhibit substantial complexity, attackers often follow specific attack patterns-specifically, clear combinations of tactics and techniques to achieve their goals. Based on these insights, we propose TPPR, a novel framework that first extracts anomaly subgraphs through abnormal node detection, TTP-annotation and graph pruning, then performs attack path reasoning using mined TTP sequential pattern, and finally reconstructs attack scenarios through confidence-based path scoring and merging. Extensive evaluation on real enterprise logs (more than 100 million events) and DARPA TC dataset demonstrates TPPR's capability to achieve 99.9% graph simplification (700,000 to 20 edges) while preserving 91% of critical attack nodes, outperforming state-of-the-art solutions (SPARSE, DepImpact) by 63.1% and 67.9% in reconstruction precision while maintaining attack scenario integrity.

Updated: 2025-10-25 07:13:07

标题: TPPR：APT策略/技术模式引导的攻击路径推理用于攻击调查

摘要: 基于系统审计数据的溯源分析已成为调查高级持续性威胁（APT）攻击的基本方法。由于APT攻击的高度隐蔽性和长期持久性，它们仅被表示为溯源图中关键路径的一个最小部分。虽然现有技术采用行为模式匹配和数据流特征匹配来通过溯源图路径推理揭示攻击序列中的潜在关联，但它们无法建立有效的攻击上下文关联，往往导致将良性系统操作与真正的攻击实体混淆，无法准确表征真实的APT行为。我们观察到，虽然溯源图中实体的因果关系表现出相当复杂性，但攻击者通常遵循特定的攻击模式，特别是清晰的策略和技术组合来实现他们的目标。基于这些见解，我们提出了TPPR，一个新颖的框架，首先通过异常节点检测、TTP注释和图修剪提取异常子图，然后使用挖掘的TTP序列模式进行攻击路径推理，最后通过基于置信度的路径评分和合并重建攻击场景。对真实企业日志（超过1亿事件）和DARPA TC数据集进行的广泛评估表明，TPPR能够实现99.9%的图简化（从70万到20条边），同时保留91%的关键攻击节点，比现有技术解决方案（SPARSE、DepImpact）在重建精度方面提高了63.1%和67.9%，同时保持攻击场景的完整性。

更新时间: 2025-10-25 07:13:07

领域: cs.CR

下载: http://arxiv.org/abs/2510.22191v1

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.

Updated: 2025-10-25 07:05:42

标题: 视觉基础模型作为自回归图像生成的有效视觉标记器

摘要: 在这项工作中，我们提出了一种新颖的方法，直接在一个冻结的视觉基础模型之上构建图像分词器，这是一个很大程度上未被探索的领域。具体来说，我们将一个冻结的视觉基础模型作为我们的分词器的编码器。为了增强其有效性，我们引入了两个关键组件：（1）一个区域自适应量化框架，减少了预训练特征在常规的2D网格上的冗余，以及（2）一个语义重构目标，将分词器的输出与基础模型的表示对齐，以保持语义保真度。基于这些设计，我们提出的图像分词器VFMTok在图像重建和生成质量方面取得了显著的改进，同时也增强了标记效率。它进一步提升了自回归（AR）生成——在ImageNet基准上实现了1.36的gFID，同时将模型收敛加速了三倍，并实现了高保真的类别条件合成，无需分类器自由指导（CFG）。代码可在https://github.com/CVMI-Lab/VFMTok找到。

更新时间: 2025-10-25 07:05:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.08441v2

Learning Satellite Pattern-of-Life Identification: A Diffusion-based Approach

As Earth's orbital satellite population grows exponentially, effective space situational awareness becomes critical for collision prevention and sustainable operations. Current approaches to monitor satellite behaviors rely on expert knowledge and rule-based systems that scale poorly. Among essential monitoring tasks, satellite pattern-of-life (PoL) identification, analyzing behaviors like station-keeping maneuvers and drift operations, remains underdeveloped due to aerospace system complexity, operational variability, and inconsistent ephemerides sources. We propose a novel generative approach for satellite PoL identification that significantly eliminates the dependence on expert knowledge. The proposed approach leverages orbital elements and positional data to enable automatic pattern discovery directly from observations. Our implementation uses a diffusion model framework for end-to-end identification without manual refinement or domain expertise. The architecture combines a multivariate time-series encoder to capture hidden representations of satellite positional data with a conditional denoising process to generate accurate PoL classifications. Through experiments across diverse real-world satellite operational scenarios, our approach demonstrates superior identification quality and robustness across varying data quality characteristics. A case study using actual satellite data confirms the approach's transformative potential for operational behavior pattern identification, enhanced tracking, and space situational awareness.

Updated: 2025-10-25 07:01:42

标题: 学习卫星生活模式识别：一种基于扩散的方法

摘要: 随着地球轨道卫星数量呈指数增长，有效的空间态势感知对于碰撞预防和可持续运营变得至关重要。目前用于监测卫星行为的方法依赖于专家知识和基于规则的系统，但扩展性较差。在卫星监测任务中，卫星生活模式（PoL）识别是一项重要任务，分析卫星的行为，如保持位置的机动和漂移操作，由于航空航天系统的复杂性、操作变化性和不一致的星历来源，该任务仍未得到充分发展。我们提出了一种新颖的生成方法来进行卫星PoL识别，显著减少了对专家知识的依赖。所提出的方法利用轨道要素和位置数据，直接从观察中实现自动模式发现。我们的实现使用扩散模型框架进行端到端识别，无需手动精炼或领域专业知识。该架构结合了多变量时间序列编码器，捕捉卫星位置数据的隐藏表示，以及有条件的去噪过程，生成准确的PoL分类。通过在不同真实卫星运营场景中进行实验，我们的方法展示了在不同数据质量特征下的卓越识别质量和稳健性。使用实际卫星数据进行的案例研究证实了该方法在操作行为模式识别、增强跟踪和空间态势感知方面的变革潜力。

更新时间: 2025-10-25 07:01:42

领域: cs.LG,cs.CE

下载: http://arxiv.org/abs/2412.10814v3

RGC: a radio AGN classifier based on deep learning. I. A semi-supervised model for the VLA images of bent radio AGNs

Wide-angle tail (WAT) and narrow-angle tail (NAT) radio active galactic nuclei (RAGNs) are key tracers of dense environments in galaxy groups and clusters, yet no machine-learning classifier of bent RAGNs has been trained using both unlabeled data and purely visually inspected labels. We release the RGC Python package, which includes two newly preprocessed labeled datasets of 639 WATs and NATs derived from a publicly available catalog of visually inspected sources, along with a semi-supervised RGC model that leverages 20,000 unlabeled RAGNs. The two labeled datasets in RGC were preprocessed using PyBDSF which retains spurious sources, and Photutils which removes them. The RGC model integrates the self-supervised framework BYOL (Bootstrap YOur Latent) with the supervised E2CNN (E2-equivariant Convolutional Neural Network) to form a semi-supervised binary classifier. The RGC model, when trained and evaluated on a dataset devoid of spurious sources, reaches peak performance, attaining an accuracy of 88.88% along with F1-scores of 0.90 for WATs and 0.85 for NATs. The model's attention patterns amid class imbalance suggest that this work can serve as a stepping stone toward developing physics-informed foundation models capable of identifying a broad range of AGN physical properties.

Updated: 2025-10-25 06:55:29

标题: RGC：基于深度学习的无线电AGN分类器。I. 一种半监督模型，用于弯曲无线电AGNs的VLA图像

摘要: 广角尾（WAT）和窄角尾（NAT）射电活动星系核（RAGNs）是星系团和星系群中密集环境的关键追踪器，但尚未使用未标记数据和纯粹视觉检查的标签训练弯曲RAGNs的机器学习分类器。我们发布了RGC Python软件包，其中包括两个新预处理的标记数据集，其中包括639个WAT和NAT，这些数据集来自一个公开可用的经过视觉检查的源目录，以及一个利用20,000个未标记的RAGNs的半监督RGC模型。RGC中的两个标记数据集是使用保留虚假源的PyBDSF和移除它们的Photutils进行预处理的。RGC模型将自监督框架BYOL（Bootstrap Your Latent）与监督的E2CNN（E2-等变卷积神经网络）结合，形成一个半监督的二元分类器。当在没有虚假源的数据集上训练和评估时，RGC模型达到了最佳性能，准确率达到了88.88％，WAT的F1分数为0.90，NAT的F1分数为0.85。在类别不平衡的情况下，模型的注意力模式表明这项工作可以作为开发基于物理信息的基础模型的垫脚石，该模型能够识别广泛的AGN物理特性。

更新时间: 2025-10-25 06:55:29

领域: astro-ph.IM,astro-ph.CO,cs.LG

下载: http://arxiv.org/abs/2510.22190v1

Quantitative Bounds for Sorting-Based Permutation-Invariant Embeddings

We study the sorting-based embedding $\beta_{\mathbf A} : \mathbb R^{n \times d} \to \mathbb R^{n \times D}$, $\mathbf X \mapsto {\downarrow}(\mathbf X \mathbf A)$, where $\downarrow$ denotes column wise sorting of matrices. Such embeddings arise in graph deep learning where outputs should be invariant to permutations of graph nodes. Previous work showed that for large enough $D$ and appropriate $\mathbf A$, the mapping $\beta_{\mathbf A}$ is injective, and moreover satisfies a bi-Lipschitz condition. However, two gaps remain: firstly, the optimal size $D$ required for injectivity is not yet known, and secondly, no estimates of the bi-Lipschitz constants of the mapping are known. In this paper, we make substantial progress in addressing both of these gaps. Regarding the first gap, we improve upon the best known upper bounds for the embedding dimension $D$ necessary for injectivity, and also provide a lower bound on the minimal injectivity dimension. Regarding the second gap, we construct matrices $\mathbf A$, so that the bi-Lipschitz distortion of $\beta_{\mathbf A} $ depends quadratically on $n$, and is completely independent of $d$. We also show that the distortion of $\beta_{\mathbf A}$ is necessarily at least in $\Omega(\sqrt{n})$. Finally, we provide similar results for variants of $\beta_{\mathbf A}$ obtained by applying linear projections to reduce the output dimension of $\beta_{\mathbf A}$.

Updated: 2025-10-25 06:44:08

标题: 基于排序的排列不变嵌入的定量界限

摘要: 我们研究基于排序的嵌入$\beta_{\mathbf A} : \mathbb R^{n \times d} \to \mathbb R^{n \times D}$，$\mathbf X \mapsto {\downarrow}(\mathbf X \mathbf A)$，其中$\downarrow$表示矩阵的逐列排序。这种嵌入在图深度学习中出现，其中输出应该对图节点的排列不变。先前的工作表明，对于足够大的$D$和适当的$\mathbf A$，映射$\beta_{\mathbf A}$是单射的，并且还满足双Lipschitz条件。然而，仍然存在两个差距：首先，尚不清楚单射性所需的最佳维度$D$，其次，尚不清楚映射的双Lipschitz常数的估计。在本文中，我们在解决这两个差距方面取得了实质性进展。关于第一个差距，我们改进了已知的用于单射性所需的嵌入维度$D$的最佳上界，并且还提供了最小单射维度的下界。关于第二个差距，我们构造矩阵$\mathbf A$，使得$\beta_{\mathbf A}$的双Lipschitz畸变与$n$的平方成正比，完全与$d$无关。我们还展示了$\beta_{\mathbf A}$的畸变至少为$\Omega(\sqrt{n})$。最后，我们为通过应用线性投影来减少$\beta_{\mathbf A}$的输出维度而获得的$\beta_{\mathbf A}$的变体提供类似的结果。

更新时间: 2025-10-25 06:44:08

领域: cs.LG,cs.IT,math.FA,math.IT,math.MG,46B07 (Primary), 68T07, 54E40 (Secondary)

下载: http://arxiv.org/abs/2510.22186v1

Dopamine-driven synaptic credit assignment in neural networks

Solving the synaptic Credit Assignment Problem(CAP) is central to learning in both biological and artificial neural systems. Finding an optimal solution for synaptic CAP means setting the synaptic weights that assign credit to each neuron for influencing the final output and behavior of neural networks or animals. Gradient-based methods solve this problem in artificial neural networks using back-propagation, however, not in the most efficient way. For instance, back-propagation requires a chain of top-down gradient computations. This leads to an expensive optimization process in terms of computing power and memory linked with well-known weight transport and update locking problems. To address these shortcomings, we take a NeuroAI approach and draw inspiration from neural Reinforcement Learning to develop a derivative-free optimizer for training neural networks, Dopamine. Dopamine is developed for Weight Perturbation (WP) learning that exploits stochastic updating of weights towards optima. It achieves this by minimizing the regret, a form of Reward Prediction Error (RPE) between the expected outcome from the perturbed model and the actual outcome from the unperturbed model. We use this RPE to adjust the learning rate in the network (i.e., creating an adaptive learning rate strategy, similar to the role of dopamine in the brain). We tested the Dopamine optimizer for training multi-layered perceptrons for XOR tasks, and recurrent neural networks for chaotic time series forecasting. Dopamine-trained models demonstrate accelerated convergence and outperform standard WP, and give comparable performance to gradient-based algorithms, while consuming significantly less computation and memory. Overall, the Dopamine optimizer not only finds robust solutions and comparable performance to the state-of-the-art Machine Learning optimizers but is also neurobiologically more plausible.

Updated: 2025-10-25 06:17:49

标题: 多巴胺驱动的神经网络中的突触学分归因

摘要: 解决突触学分分配问题（CAP）对于生物和人工神经系统中的学习至关重要。找到突触CAP的最佳解决方案意味着设置分配信用给每个神经元的突触权重，以影响神经网络或动物的最终输出和行为。基于梯度的方法使用反向传播来解决人工神经网络中的这个问题，然而，并不是以最有效的方式。例如，反向传播需要一系列自顶向下的梯度计算。这导致了与著名的权重传输和更新锁定问题相关的昂贵的计算资源和内存的优化过程。为了解决这些缺点，我们采用了一种神经AI方法，借鉴神经强化学习来开发一个用于训练神经网络的无导数优化器Dopamine。Dopamine是为了权重扰动（WP）学习而开发的，利用朝向最优解的随机权重更新。它通过最小化遗憾来实现这一点，遗憾是扰动模型的预期结果与未扰动模型的实际结果之间的一种奖励预测误差（RPE）形式。我们使用这个RPE来调整网络中的学习率（即创建一种自适应学习率策略，类似于大脑中多巴胺的作用）。我们测试了Dopamine优化器在训练多层感知器进行XOR任务和用于混沌时间序列预测的递归神经网络时的表现。经过Dopamine训练的模型表现出加速收敛，并且胜过标准WP，并且表现与基于梯度的算法相当，同时消耗的计算资源和内存明显更少。总的来说，Dopamine优化器不仅找到了稳健的解决方案和与最先进的机器学习优化器相当的性能，而且在神经生物学上更加合理。

更新时间: 2025-10-25 06:17:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.22178v1

Macro2Micro: A Rapid and Precise Cross-modal Magnetic Resonance Imaging Synthesis using Multi-scale Structural Brain Similarity

The human brain is a complex system requiring both macroscopic and microscopic components for comprehensive understanding. However, mapping nonlinear relationships between these scales remains challenging due to technical limitations and the high cost of multimodal Magnetic Resonance Imaging (MRI) acquisition. To address this, we introduce Macro2Micro, a deep learning framework that predicts brain microstructure from macrostructure using a Generative Adversarial Network (GAN). Based on the hypothesis that microscale structural information can be inferred from macroscale structures, Macro2Micro explicitly encodes multiscale brain information into distinct processing branches. To enhance artifact elimination and output quality, we propose a simple yet effective auxiliary discriminator and learning objective. Extensive experiments demonstrated that Macro2Micro faithfully translates T1-weighted MRIs into corresponding Fractional Anisotropy (FA) images, achieving a 6.8\% improvement in the Structural Similarity Index Measure (SSIM) compared to previous methods, while retaining the individual biological characteristics of the brain. With an inference time of less than 0.01 seconds per MR modality translation, Macro2Micro introduces the potential for real-time multimodal rendering in medical and research applications. The code will be made available upon acceptance.

Updated: 2025-10-25 06:08:55

标题: 宏观到微观：利用多尺度结构脑相似性的快速精确跨模态磁共振成像合成

摘要: 人类大脑是一个复杂的系统，需要宏观和微观组件来全面理解。然而，由于技术限制和多模态磁共振成像（MRI）采集的高成本，对这些尺度之间的非线性关系进行映射仍然具有挑战性。为了解决这个问题，我们引入了Macro2Micro，这是一个使用生成对抗网络（GAN）从宏观结构预测脑微结构的深度学习框架。基于微观尺度结构信息可以从宏观结构推断的假设，Macro2Micro将多尺度脑信息明确编码到不同的处理分支中。为了增强伪影消除和输出质量，我们提出了一个简单而有效的辅助鉴别器和学习目标。大量实验证明，Macro2Micro将T1加权MRI忠实地转换为相应的分数各向异性（FA）图像，与以往方法相比，结构相似性指数测量（SSIM）提高了6.8％，同时保留了大脑的个体生物特征。每种MR模态转换的推理时间不到0.01秒，Macro2Micro为医疗和研究应用中的实时多模态渲染提供了潜力。接受后将提供代码。

更新时间: 2025-10-25 06:08:55

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2412.11277v2

FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA

Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix multiplication of the LoRA update ($BA$) intensifies this effect. Freezing one matrix (e.g., $A$) reduces the noise but restricts model expressiveness, often resulting in suboptimal adaptation. To address this, we propose $\texttt{FedSVD}$, a simple yet effective method that introduces a global reparameterization based on singular value decomposition (SVD). In our approach, each client optimizes only the $B$ matrix and transmits it to the server. The server aggregates the $B$ matrices, computes the product $BA$ using the previous $A$, and refactorizes the result via SVD. This yields a new adaptive $A$ composed of the orthonormal right singular vectors of $BA$, and an updated $B$ containing the remaining SVD components. This reparameterization avoids quadratic noise amplification, while allowing $A$ to better capture the principal directions of the aggregate updates. Moreover, the orthonormal structure of $A$ bounds the gradient norms of $B$ and preserves more signal under DP-SGD, as confirmed by our theoretical analysis. As a result, $\texttt{FedSVD}$ consistently improves stability and performance across a variety of privacy settings and benchmarks, outperforming relevant baselines under both private and non-private regimes.

Updated: 2025-10-25 06:07:40

标题: FedSVD：LoRA下用于私人联邦学习的自适应正交化

摘要: 低秩适应（LoRA）将两个可训练的低秩矩阵的乘积引入冻结的预训练权重中，广泛用于在联邦学习（FL）中对语言模型进行高效微调。然而，当与差分隐私随机梯度下降（DP-SGD）结合时，LoRA面临着重大的噪声放大问题：DP-SGD扰动每个样本的梯度，而LoRA更新的矩阵乘法（$BA$）加剧了这种效应。冻结一个矩阵（例如，$A$）可以减少噪声，但限制了模型的表达能力，通常导致次优的适应性。为了解决这个问题，我们提出了$\texttt{FedSVD}$，这是一种简单而有效的方法，它基于奇异值分解（SVD）引入了一个全局重新参数化。在我们的方法中，每个客户端只优化$B$矩阵并将其传输到服务器。服务器聚合$B$矩阵，使用先前的$A$计算乘积$BA$，并通过SVD重新分解结果。这产生了一个由$BA$的正交右奇异向量组成的新的自适应$A$，以及包含其余SVD分量的更新$B$。这种重新参数化避免了二次噪声放大，同时允许$A$更好地捕捉聚合更新的主要方向。此外，$A$的正交结构限制了$B$的梯度范数，并在DP-SGD下保留了更多信号，这得到了我们理论分析的确认。因此，$\texttt{FedSVD}$在各种隐私设置和基准测试中始终提高了稳定性和性能，在私人和非私人制度下均优于相关基线。

更新时间: 2025-10-25 06:07:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.12805v2

ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

Current RLHF methods such as PPO and DPO typically reduce human preferences to binary labels, which are costly to obtain and too coarse to reflect individual variation. We observe that expressions of satisfaction and dissatisfaction follow stable linguistic patterns across users, indicating that more informative supervisory signals can be extracted from free-form feedback. Building on this insight, we introduce Adaptive Reward-Following (ARF), which converts natural feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm. Across diverse LLMs and preference domains, ARF consistently outperforms PPO and DPO, improving alignment by up to 7.6%. Our results demonstrate that continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.

Updated: 2025-10-25 05:45:15

标题: ARF-RLHF：通过情绪驱动的自我监督和迹偏动态优化实现RLHF的自适应奖励跟踪

摘要: 目前的RLHF方法，如PPO和DPO，通常将人类偏好简化为二进制标签，这些标签获取成本高昂且过于粗糙，无法反映个体差异。我们观察到，满意和不满意的表达在用户之间遵循稳定的语言模式，表明可以从自由形式的反馈中提取更多信息量的监督信号。基于这一洞察，我们引入了自适应奖励跟随（ARF）方法，将自然反馈转化为连续的偏好轨迹，并使用新颖的TraceBias算法进行优化。在各种LLMs和偏好领域中，ARF始终优于PPO和DPO，将对齐性提高了高达7.6%。我们的结果表明，连续奖励建模为个性化和理论基础的RLHF提供了可扩展的路径。

更新时间: 2025-10-25 05:45:15

领域: cs.CL,cs.AI,68T05, 68Q25,I.2.6; I.2.7

下载: http://arxiv.org/abs/2507.03069v3

Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests

AI psychometrics evaluates AI systems in roles that traditionally require emotional judgment and ethical consideration. Prior work often reuses human trait inventories (Big Five, \hexaco) or ad hoc personas, limiting behavioral realism and domain relevance. We propose a framework that (1) uses situational judgment tests (SJTs) from realistic scenarios to probe domain-specific competencies; (2) integrates industrial-organizational and personality psychology to design sophisticated personas which include behavioral and psychological descriptors, life history, and social and emotional functions; and (3) employs structured generation with population demographic priors and memoir inspired narratives, encoded with Pydantic schemas. In a law enforcement assistant case study, we construct a rich dataset of personas drawn across 8 persona archetypes and SJTs across 11 attributes, and analyze behaviors across subpopulation and scenario slices. The dataset spans 8,500 personas, 4,000 SJTs, and 300,000 responses. We will release the dataset and all code to the public.

Updated: 2025-10-25 05:45:10

标题: 衡量重要的事情：使用情境判断测试对人工智能进行心理测量评估

摘要: AI心理测量学评估传统上需要情感判断和道德考虑的角色中的AI系统。先前的工作通常重新使用人类特质清单（五大特质、六维特质）或特定的人物角色，限制了行为的真实性和领域相关性。我们提出了一个框架：（1）使用真实场景中的情境判断测试（SJTs）来探究领域特定的能力；（2）整合工业-组织心理学和人格心理学，设计包括行为和心理描述、生活历史以及社交和情感功能的复杂人物角色；（3）采用带有人口统计学先验和以回忆为灵感的叙述的结构化生成，编码为Pydantic模式。在一个执法助理案例研究中，我们构建了一个包括8种人物角色原型和11种属性的SJTs的丰富数据集，并分析了不同子群体和场景片段中的行为。该数据集涵盖了8500个人物角色、4000个SJTs和30万个响应。我们将向公众发布该数据集和所有代码。

更新时间: 2025-10-25 05:45:10

领域: cs.AI,I.2.7; I.2.6; H.1.2; J.4

下载: http://arxiv.org/abs/2510.22170v1

Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model

Machine learning in neurosurgery is limited by challenges in assembling large, high-quality imaging datasets. Synthetic data offers a scalable, privacy-preserving solution. We evaluated the feasibility of generating realistic lateral cervical spine radiographs using a denoising diffusion probabilistic model (DDPM) trained on 4,963 images from the Cervical Spine X-ray Atlas. Model performance was monitored via training/validation loss and Frechet inception distance, and synthetic image quality was assessed in a blinded "clinical Turing test" with six neuroradiologists and two spine-fellowship trained neurosurgeons. Experts reviewed 50 quartets containing one real and three synthetic images, identifying the real image and rating realism on a 4-point Likert scale. Experts correctly identified the real image in 29% of trials (Fleiss' kappa=0.061). Mean realism scores were comparable between real (3.323) and synthetic images (3.228, 3.258, and 3.320; p=0.383, 0.471, 1.000). Nearest-neighbor analysis found no evidence of memorization. We also provide a dataset of 20,063 synthetic radiographs. These results demonstrate that DDPM-generated cervical spine X-rays are statistically indistinguishable in realism and quality from real clinical images, offering a novel approach to creating large-scale neuroimaging datasets for ML applications in landmarking, segmentation, and classification.

Updated: 2025-10-25 05:25:37

标题: 专家验证使用去噪扩散概率模型生成的合成颈椎X射线照片

摘要: 神经外科中的机器学习受到组装大量高质量成像数据集的挑战的限制。合成数据提供了一个可扩展的、保护隐私的解决方案。我们评估了使用训练于颈椎X射线图谱中的4963幅图像的去噪扩散概率模型（DDPM）生成逼真的侧面颈椎X射线片的可行性。模型性能通过训练/验证损失和Frechet初始距离进行监测，合成图像质量在一个盲目的“临床图灵测试”中由六名神经放射科医师和两名接受脊柱专科培训的神经外科医生评估。专家审查了包含一个真实图像和三个合成图像的50组图像，识别真实图像并根据4点力克特比例尺对逼真度进行评分。专家在29%的试验中正确识别了真实图像（Fleiss' kappa=0.061）。真实图像和合成图像的平均逼真度得分相当（真实图像为3.323，合成图像为3.228、3.258和3.320；p=0.383，0.471，1.000）。最近邻分析未发现记忆的证据。我们还提供了一组20063幅合成X射线图像的数据集。这些结果表明，由DDPM生成的颈椎X射线片在逼真度和质量上与真实临床图像无法统计上区分，为在标记、分割和分类中应用机器学习创建大规模神经影像数据集提供了一种新颖的方法。

更新时间: 2025-10-25 05:25:37

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.22166v1

Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics

Mean field games (MFGs) have emerged as a powerful framework for modeling interactions in large-scale multi-agent systems. Despite recent advancements in reinforcement learning (RL) for MFGs, existing methods are typically limited to finite spaces or stationary models, hindering their applicability to real-world problems. This paper introduces a novel deep reinforcement learning (DRL) algorithm specifically designed for non-stationary continuous MFGs. The proposed approach builds upon a Fictitious Play (FP) methodology, leveraging DRL for best-response computation and supervised learning for average policy representation. Furthermore, it learns a representation of the time-dependent population distribution using a Conditional Normalizing Flow. To validate the effectiveness of our method, we evaluate it on three different examples of increasing complexity. By addressing critical limitations in scalability and density approximation, this work represents a significant advancement in applying DRL techniques to complex MFG problems, bringing the field closer to real-world multi-agent systems.

Updated: 2025-10-25 04:50:52

标题: 解决连续均场博弈问题：深度强化学习用于非稳态动态

摘要: 均场博弈（MFGs）已经成为建模大规模多智能体系统相互作用的强大框架。尽管最近在MFGs的强化学习（RL）方面取得了一些进展，但现有方法通常局限于有限空间或静态模型，限制了它们在真实世界问题中的适用性。本文介绍了一种专门针对非静态连续MFGs设计的新型深度强化学习（DRL）算法。所提出的方法建立在虚拟对战（FP）方法论之上，利用DRL进行最佳响应计算，并利用监督学习进行平均策略表示。此外，它通过条件正态化流学习了一个时间相关的人口分布表示。为了验证我们方法的有效性，我们在三个不断增加复杂度的示例上对其进行评估。通过解决可扩展性和密度逼近方面的关键限制，这项工作在将DRL技术应用于复杂MFG问题方面取得了重大进展，使该领域更加接近真实世界的多智能体系统。

更新时间: 2025-10-25 04:50:52

领域: cs.LG,cs.AI,cs.MA,math.OC

下载: http://arxiv.org/abs/2510.22158v1

Learning to Coordinate Bidders in Non-Truthful Auctions

In non-truthful auctions such as first-price and all-pay auctions, the independent strategic behaviors of bidders, with the corresponding Bayes-Nash equilibrium notion, are notoriously difficult to characterize and can cause undesirable outcomes. An alternative approach to achieve better outcomes in non-truthful auctions is to coordinate the bidders: let a mediator make incentive-compatible recommendations of correlated bidding strategies to the bidders, namely, implementing a Bayes correlated equilibrium (BCE). The implementation of BCE, however, requires knowledge of the distributions of bidders' private valuations, which is often unavailable. We initiate the study of the sample complexity of learning Bayes correlated equilibria in non-truthful auctions. We prove that the set of strategic-form BCEs in a large class of non-truthful auctions, including first-price and all-pay auctions, can be learned with a polynomial number $\tilde O(\frac{n}{\varepsilon^2})$ of samples of bidders' values. This moderate number of samples demonstrates the statistical feasibility of learning to coordinate bidders. Our technique is a reduction to the problem of estimating bidders' expected utility from samples, combined with an analysis of the pseudo-dimension of the class of all monotone bidding strategies.

Updated: 2025-10-25 04:40:44

标题: 学习在非真实拍卖中协调竞标者

摘要: 在非真实拍卖中，如一价拍卖和全付拍卖中，投标人的独立战略行为以及相应的Bayes-Nash均衡概念很难表征，并且可能导致不良结果。在非真实拍卖中实现更好结果的另一种方法是协调投标人：让一个中介人向投标人提供相关出价策略的激励兼容建议，即实施Bayes相关均衡（BCE）。然而，实施BCE需要了解投标人私人估值的分布，而这通常是不可获得的。我们开始研究在非真实拍卖中学习Bayes相关均衡的样本复杂性。我们证明，在包括一价拍卖和全付拍卖在内的大类非真实拍卖中，战略形式BCE的集合可以通过拍卖者价值的多项式数量的样本$\tilde O(\frac{n}{\varepsilon^2})$来学习。这个适度数量的样本表明了学习协调投标人的统计可行性。我们的技术是通过样本来估计拍卖者的预期效用的问题简化，结合了所有单调出价策略类的伪维度分析。

更新时间: 2025-10-25 04:40:44

领域: cs.GT,cs.LG,econ.TH

下载: http://arxiv.org/abs/2507.02801v2

CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

Imitation Learning offers a promising approach to learn directly from data without requiring explicit models, simulations, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling step until a successful action is obtained can be inefficient. In this work, we propose an enhanced sampling strategy that refines the sampling distribution to avoid previously unsuccessful actions. We demonstrate that by solely utilizing data from successful demonstrations, our method can infer recovery actions without the need for additional exploratory behavior or a high-level controller. Furthermore, we leverage the concept of diffusion model decomposition to break down the primary problem, which may require long-horizon history to manage failures, into multiple smaller, more manageable sub-problems in learning, data collection, and inference, thereby enabling the system to adapt to variable failure counts. Our approach yields a low-level controller that dynamically adjusts its sampling space to improve efficiency when prior samples fall short. We validate our method across several tasks, including door opening with unknown directions, object manipulation, and button-searching scenarios, demonstrating that our approach outperforms traditional baselines.

Updated: 2025-10-25 04:31:05

标题: CCDP: 有引导采样的条件扩散策略组合

摘要: Imitation Learning提供了一种有希望的方法，可以直接从数据中学习，而无需明确的模型、模拟或详细的任务定义。在推理过程中，动作是从学习的分布中抽样并在机器人上执行的。然而，由于各种原因，抽样的动作可能会失败，简单地重复抽样步骤直到获得成功的动作可能效率低下。在这项工作中，我们提出了一种增强的抽样策略，可以优化抽样分布，以避免以前失败的动作。我们展示了通过仅利用成功演示的数据，我们的方法可以推断恢复动作，而无需额外的探索行为或高级控制器。此外，我们利用扩散模型分解的概念，将可能需要长期历史来处理故障的主要问题分解为多个较小、更易管理的子问题，在学习、数据收集和推理方面，从而使系统能够适应不同的故障计数。我们的方法产生了一个低级控制器，可以动态调整其抽样空间，以提高在以前的样本不足时的效率。我们验证了我们的方法在几个任务中的有效性，包括未知方向的开门、物体操纵和搜索按钮的场景，证明了我们的方法优于传统基线。

更新时间: 2025-10-25 04:31:05

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2503.15386v3

Quantifying Dataset Similarity to Guide Transfer Learning

Transfer learning has become a cornerstone of modern machine learning, as it can empower models by leveraging knowledge from related domains to improve learning effectiveness. However, transferring from poorly aligned data can harm rather than help performance, making it crucial to determine whether the transfer will be beneficial before implementation. This work aims to address this challenge by proposing an innovative metric to measure dataset similarity and provide quantitative guidance on transferability. In the literature, existing methods largely focus on feature distributions while overlooking label information and predictive relationships, potentially missing critical transferability insights. In contrast, our proposed metric, the Cross-Learning Score (CLS), measures dataset similarity through bidirectional generalization performance between domains. We provide a theoretical justification for CLS by establishing its connection to the cosine similarity between the decision boundaries for the target and source datasets. Computationally, CLS is efficient and fast to compute as it bypasses the problem of expensive distribution estimation for high-dimensional problems. We further introduce a general framework that categorizes source datasets into positive, ambiguous, or negative transfer zones based on their CLS relative to the baseline error, enabling informed decisions. Additionally, we extend this approach to encoder-head architectures in deep learning to better reflect modern transfer pipelines. Extensive experiments on diverse synthetic and real-world tasks demonstrate that CLS can reliably predict whether transfer will improve or degrade performance, offering a principled tool for guiding data selection in transfer learning.

Updated: 2025-10-25 04:27:59

标题: 量化数据集相似性以指导迁移学习

摘要: 迁移学习已经成为现代机器学习的基石，因为它可以通过利用相关领域的知识来增强模型，从而提高学习效果。然而，从不相关数据进行迁移可能会损害而非帮助性能，因此在实施之前确定迁移是否有益至关重要。本研究旨在通过提出一种创新的度量标准来衡量数据集的相似性，并为迁移性提供定量指导，以应对这一挑战。在文献中，现有方法主要关注特征分布，而忽略了标签信息和预测关系，可能会忽略关键的迁移洞察。相反，我们提出的度量标准——交叉学习分数（CLS），通过领域之间的双向泛化性能来衡量数据集的相似性。我们通过将CLS与目标和源数据集的决策边界之间的余弦相似性联系起来，为CLS提供了理论上的证明。在计算上，CLS高效快速，因为它绕过了高维问题的昂贵分布估计问题。我们进一步引入一个通用框架，根据CLS与基准误差的相对关系，将源数据集分类为积极、模糊或负面的迁移区域，从而使决策更加明智。此外，我们将这种方法扩展到深度学习中的编码器-头结构，以更好地反映现代迁移流程。在多样化的合成和真实任务上进行的广泛实验表明，CLS可以可靠地预测迁移是否会改善或降低性能，为指导迁移学习中的数据选择提供了一种原则性工具。

更新时间: 2025-10-25 04:27:59

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.10866v2

Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement

Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the LLIE technique has achieved significant breakthroughs. However, existing LLIE methods either ignore the important role of frequency domain information or fail to effectively promote the propagation and flow of information, limiting the LLIE performance. In this paper, we develop a novel frequency-spatial interaction-driven network (FSIDNet) for LLIE based on two-stage architecture. To be specific, the first stage is designed to restore the amplitude of low-light images to improve the lightness, and the second stage devotes to restore phase information to refine fine-grained structures. Considering that Frequency domain and spatial domain information are complementary and both favorable for LLIE, we further develop two frequency-spatial interaction blocks which mutually amalgamate the complementary spatial and frequency information to enhance the capability of the model. In addition, we construct the Information Exchange Module (IEM) to associate two stages by adequately incorporating cross-stage and cross-scale features to effectively promote the propagation and flow of information in the two-stage network structure. Finally, we conduct experiments on several widely used benchmark datasets (i.e., LOL-Real, LSRW-Huawei, etc.), which demonstrate that our method achieves the excellent performance in terms of visual results and quantitative metrics while preserving good model efficiency.

Updated: 2025-10-25 04:17:50

标题: 低光图像增强的频率-空间交互驱动网络

摘要: 低光图像增强（LLIE）旨在提高在光照不足环境中捕获的图像的感知或可解释性。随着深度学习的出现，LLIE技术取得了显著突破。然而，现有的LLIE方法要么忽视了频率域信息的重要作用，要么未能有效促进信息的传播和流动，限制了LLIE的性能。在本文中，我们基于两阶段架构开发了一种新颖的频率空间交互驱动网络（FSIDNet）用于LLIE。具体来说，第一阶段旨在恢复低光图像的幅度以改善亮度，第二阶段致力于恢复相位信息以细化细粒结构。考虑到频率域和空间域信息是互补的且对LLIE都有利，我们进一步开发了两个频率空间交互块，互相融合互补的空间和频率信息以增强模型的能力。此外，我们构建了信息交换模块（IEM），通过充分整合跨阶段和跨尺度特征，将两阶段网络结构中的信息传播和流动有效促进。最后，我们在几个广泛使用的基准数据集上进行实验（例如，LOL-Real，LSRW-Huawei等），结果表明我们的方法在视觉效果和定量指标方面均取得了出色的表现，同时保持良好的模型效率。

更新时间: 2025-10-25 04:17:50

领域: eess.IV,cs.CV,cs.LG,cs.MM,eess.SP

下载: http://arxiv.org/abs/2510.22154v1

Power to the Clients: Federated Learning in a Dictatorship Setting

Federated learning (FL) has emerged as a promising paradigm for decentralized model training, enabling multiple clients to collaboratively learn a shared model without exchanging their local data. However, the decentralized nature of FL also introduces vulnerabilities, as malicious clients can compromise or manipulate the training process. In this work, we introduce dictator clients, a novel, well-defined, and analytically tractable class of malicious participants capable of entirely erasing the contributions of all other clients from the server model, while preserving their own. We propose concrete attack strategies that empower such clients and systematically analyze their effects on the learning process. Furthermore, we explore complex scenarios involving multiple dictator clients, including cases where they collaborate, act independently, or form an alliance in order to ultimately betray one another. For each of these settings, we provide a theoretical analysis of their impact on the global model's convergence. Our theoretical algorithms and findings about the complex scenarios including multiple dictator clients are further supported by empirical evaluations on both computer vision and natural language processing benchmarks.

Updated: 2025-10-25 04:02:04

标题: 客户的权力：在独裁环境中的联邦学习

摘要: 联邦学习（FL）已经成为去中心化模型训练的一种有前途的范式，使多个客户端能够协同学习共享模型，而无需交换其本地数据。然而，FL的去中心化特性也引入了漏洞，因为恶意客户端可能会破坏或操纵训练过程。在这项工作中，我们引入了独裁客户端，这是一种新颖、明确定义且可分析的恶意参与者类别，能够完全抹去所有其他客户端对服务器模型的贡献，同时保留自己的贡献。我们提出了具体的攻击策略，赋予这些客户端强大力量，并系统分析它们对学习过程的影响。此外，我们探讨了涉及多个独裁客户端的复杂场景，包括它们合作、独立行动或结成联盟最终背叛彼此的情况。对于每种情况，我们提供了它们对全局模型收敛的影响的理论分析。我们对包括多个独裁客户端的复杂情景的理论算法和发现得到了计算机视觉和自然语言处理基准测试的实证评估的支持。

更新时间: 2025-10-25 04:02:04

领域: cs.LG,cs.AI,cs.CL,cs.CR,cs.CV,cs.DC

下载: http://arxiv.org/abs/2510.22149v1

Temporal Relational Reasoning of Large Language Models for Detecting Stock Portfolio Crashes

Stock portfolios are often exposed to rare consequential events (e.g., 2007 global financial crisis, 2020 COVID-19 stock market crash), as they do not have enough historical information to learn from. Large Language Models (LLMs) now present a possible tool to tackle this problem, as they can generalize across their large corpus of training data and perform zero-shot reasoning on new events, allowing them to detect possible portfolio crash events without requiring specific training data. However, detecting portfolio crashes is a complex problem that requires more than reasoning abilities. Investors need to dynamically process the impact of each new piece of information found in news articles, analyze the relational network of impacts across different events and portfolio stocks, as well as understand the temporal context between impacts across time-steps, in order to obtain the aggregated impact on the target portfolio. In this work, we propose an algorithmic framework named Temporal Relational Reasoning (TRR). It seeks to emulate the spectrum of human cognitive capabilities used for complex problem-solving, which include brainstorming, memory, attention and reasoning. Through extensive experiments, we show that TRR is able to outperform state-of-the-art techniques on detecting stock portfolio crashes, and demonstrate how each of the proposed components help to contribute to its performance through an ablation study. Additionally, we further explore the possible applications of TRR by extending it to other related complex problems, such as the detection of possible global crisis events in Macroeconomics.

Updated: 2025-10-25 03:50:00

标题: 大型语言模型的时间关系推理用于检测股票投资组合崩盘

摘要: 股票投资组合往往暴露于罕见的重大事件（例如，2007年全球金融危机、2020年COVID-19股市崩盘），因为它们没有足够的历史信息可以学习。大型语言模型（LLMs）现在提供了一个可能的工具来解决这个问题，因为它们可以在其大量的训练数据中进行泛化，并对新事件进行零射推理，从而可以检测可能的投资组合崩盘事件，而无需特定的训练数据。然而，检测投资组合崩盘是一个复杂的问题，需要超出推理能力。投资者需要动态处理在新闻文章中发现的每个新信息的影响，分析不同事件和投资组合股票之间的影响关系网络，以及理解在时间步骤之间的影响的时间背景，以便获得目标投资组合的综合影响。在这项工作中，我们提出了一个名为Temporal Relational Reasoning（TRR）的算法框架。它旨在模拟用于复杂问题解决的人类认知能力的范围，包括头脑风暴、记忆、注意力和推理。通过广泛的实验，我们展示了TRR能够在检测股票投资组合崩盘方面胜过最先进的技术，并展示了提出的每个组件如何通过消融研究有助于其性能。此外，我们进一步探讨了将TRR扩展到其他相关复杂问题的可能应用，如宏观经济学中可能全球危机事件的检测。

更新时间: 2025-10-25 03:50:00

领域: q-fin.RM,cs.AI,cs.CL,cs.LG,q-fin.CP

下载: http://arxiv.org/abs/2410.17266v2

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

Updated: 2025-10-25 03:34:09

标题: SpineBench：由SpineMed-450k语料库支持的临床相关、层次感知的基准测试

摘要: 脊柱疾病影响全球6.19亿人，是致残的主要原因，但AI辅助诊断受限于缺乏针对不同级别、多模态数据集。脊柱疾病的临床决策需要在X光、CT和MRI上对特定椎体水平进行复杂推理。然而，由于缺乏可追溯、临床基础的指导数据和标准化的脊柱特定基准，进展受到限制。为解决这一问题，我们引入了SpineMed，这是一个与实际脊柱外科医生共同设计的生态系统。它包括SpineMed-450k，这是第一个专门设计用于横跨成像模式的椎体水平推理的大规模数据集，包含超过45万个指令实例，以及SpineBench，一个临床基础的评估框架。SpineMed-450k是从多种来源策划而来，包括教科书、指南、开放数据集和约1000个去标识化医院病例，使用一个医师在环流程的两阶段LLM生成方法（草稿和修订）来确保高质量、可追溯的数据，用于问题回答、多轮咨询和报告生成。SpineBench在临床上重要的轴上评估模型，包括水平识别、病理评估和手术规划。我们对几种最近先进的大型视觉语言模型（LVLMs）在SpineBench上的全面评估揭示了对细粒度、特定水平推理的系统性弱点。相比之下，我们在SpineMed-450k上微调的模型在所有任务中表现出一致且显著的改进。医师评估证实了我们模型输出的诊断清晰度和实用性。

更新时间: 2025-10-25 03:34:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.03160v2

Vision Transformers Don't Need Trained Registers

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers - the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al., 2024). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models, yielding cleaner attention-based, text-to-image attribution. Finally, we outline a simple mathematical model that reflects the observed behavior of register neurons and high norm tokens. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

Updated: 2025-10-25 03:27:47

标题: Vision Transformers不需要训练寄存器

摘要: 我们研究了视觉变换器中先前识别的一种现象的机制 - 高范数标记的出现导致了嘈杂的注意力图（Darcet等人，2024）。我们观察到，在多个模型（例如CLIP，DINOv2）中，一组稀疏的神经元负责将高范数激活集中在异常标记上，导致不规则的注意力模式并且降低下游视觉处理的质量。虽然现有的消除这些异常值的解决方案涉及从头开始重新训练模型并使用额外的学习寄存器标记，但我们利用我们的发现创建了一种无需训练的方法来减轻这些人工特征。通过将我们发现的寄存器神经元中的高范数激活转移到一个额外未经训练的标记中，我们可以模仿在已经训练过的模型中使用寄存器标记的效果。我们证明了我们的方法产生更清晰的注意力和特征图，提升了在多个下游视觉任务中的基础模型的性能，并且实现了与明确训练带有寄存器标记的模型相当的结果。然后，我们将测试时寄存器扩展到现成的视觉-语言模型，产生更干净的基于注意力的文本到图像归因。最后，我们概述了一个简单的数学模型，反映了寄存器神经元和高范数标记的观察行为。我们的结果表明，测试时寄存器有效地承担了测试时寄存器标记的角色，为任何未提供寄存器标记的预训练模型提供了一个无需训练的解决方案。

更新时间: 2025-10-25 03:27:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.08010v5

TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security

Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model's input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model's input greatly degrades the model's inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++. Code is available at https://github.com/DanJun6737/TransFace_pp.

Updated: 2025-10-25 03:27:01

标题: TransFace++：重新思考面部识别范式，侧重于准确性、效率和安全性

摘要: 人脸识别（FR）技术在深度学习的出现中取得了重大进展。通常，大多数现有的FR模型都建立在卷积神经网络（CNN）之上，并将RGB人脸图像作为模型的输入。在这项工作中，我们从高效性、安全性和精度的角度仔细研究了现有的FR范例，并确定了以下三个问题：（i）CNN框架在捕捉全局面部特征和建模局部面部特征之间的相关性方面存在漏洞。（ii）选择RGB人脸图像作为模型的输入会大大降低模型的推理效率，增加额外的计算成本。（iii）在运行在RGB人脸图像上的真实世界FR系统中，如果黑客成功渗透并获得对该模型输入的访问权限，用户隐私的完整性可能会受到损害。为了解决这三个问题，我们提出了两种新颖的FR框架，即TransFace和TransFace ++，成功地探索了将ViTs和图像字节应用于FR任务的可行性。对流行的人脸基准进行的实验证明了我们的TransFace和TransFace ++的优越性。代码可在https://github.com/DanJun6737/TransFace_pp 上找到。

更新时间: 2025-10-25 03:27:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2308.10133v2

ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models

Explaining time series classification models is crucial, particularly in high-stakes applications such as healthcare and finance, where transparency and trust play a critical role. Although numerous time series classification methods have identified key subsequences, known as shapelets, as core features for achieving state-of-the-art performance and validating their pivotal role in classification outcomes, existing post-hoc time series explanation (PHTSE) methods primarily focus on timestep-level feature attribution. These explanation methods overlook the fundamental prior that classification outcomes are predominantly driven by key shapelets. To bridge this gap, we present ShapeX, an innovative framework that segments time series into meaningful shapelet-driven segments and employs Shapley values to assess their saliency. At the core of ShapeX lies the Shapelet Describe-and-Detect (SDD) framework, which effectively learns a diverse set of shapelets essential for classification. We further demonstrate that ShapeX produces explanations which reveal causal relationships instead of just correlations, owing to the atomicity properties of shapelets. Experimental results on both synthetic and real-world datasets demonstrate that ShapeX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal fidelity of time series explanations.

Updated: 2025-10-25 03:23:39

标题: ShapeX：基于形状特征的时间序列分类模型后续解释

摘要: 解释时间序列分类模型是至关重要的，特别是在高风险应用领域，如医疗保健和金融领域，透明度和信任起着至关重要的作用。虽然许多时间序列分类方法已经确定了关键的子序列，称为形状子序列，作为实现最先进性能和验证它们在分类结果中的关键作用的核心特征，但现有的事后时间序列解释（PHTSE）方法主要集中在时间步级特征归因上。这些解释方法忽视了分类结果主要由关键形状子序列驱动的基本先验。为了弥补这一差距，我们提出了ShapeX，一个创新框架，将时间序列分割成具有意义的形状子序列驱动的段，并利用Shapley值评估其显著性。ShapeX的核心是Shapelet描述和检测（SDD）框架，该框架有效地学习了对分类至关重要的多样化形状子序列。我们进一步证明，由于形状子序列的原子性属性，ShapeX产生的解释揭示了因果关系而不仅仅是相关性。对合成和真实世界数据集的实验结果表明，ShapeX在识别最相关的子序列方面优于现有方法，增强了时间序列解释的精度和因果可信度。

更新时间: 2025-10-25 03:23:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.20084v2

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

Updated: 2025-10-25 03:17:52

标题: 可信知识提取用于运维智能

摘要: 从组织数据存储库中提取运营智能是一个关键挑战，原因在于数据保密性与数据集成目标之间的对立，以及自然语言处理（NLP）工具相对于诸如运营和维护等领域的特定知识结构的局限性。在这项工作中，我们讨论了知识图构建，并将知识提取过程分解为命名实体识别、共指解析、命名实体链接和关系提取功能组件。然后，我们评估了十六种NLP工具，与正在迅速发展的大型语言模型（LLMs）的能力进行协同或比较。我们专注于飞机行业中受信任应用的运营和维护智能使用案例。基线数据集来自一个丰富的公共领域美国联邦航空管理局数据集，重点关注设备故障或维护要求。我们评估了可以在受控、保密环境中操作的NLP和LLM工具的零-shot性能（不会将数据发送给第三方）。基于我们对显著性能限制的观察，我们讨论了受信任的NLP和LLM工具以及它们在航空等关键行业中更广泛使用的技术成熟度所面临的挑战。最后，我们提出了增强信任的建议，并提供我们的开源策划数据集以支持进一步的基线测试和评估。

更新时间: 2025-10-25 03:17:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.22935v3

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.

Updated: 2025-10-25 03:17:22

标题: SEAL: 大型语言模型的可导向推理校准自由翻译

摘要: 大型语言模型（LLMs），如OpenAI的o1系列，通过扩展的连贯思维（CoT）推理机制展示了复杂推理任务的引人注目的能力。然而，最近的研究揭示了CoT推理轨迹中存在大量冗余，这不仅增加了推理延迟，还通过将注意力转移到不必要的推理路径上，对模型性能产生负面影响。为解决这一问题，我们调查了LLMs的内部推理结构，并将其分类为三种主要思维类型：执行、反思和过渡思维。此外，我们的分析表明，过多的反思和过渡思维与失败案例强相关，并且这些思维类别在潜在空间中呈现明显分离。基于这些发现，我们引入了SEAL（可操控推理校准），这是一种无需训练的方法，可以无缝校准CoT过程，提高准确性的同时实现显著的效率提升。SEAL包括一个离线阶段，用于在潜在空间中提取推理引导向量，然后通过使用引导向量进行代表干预对推理轨迹进行即时校准。值得注意的是，引导向量在各种任务之间具有较强的可转移性。在多个模型（DeepSeek-R1-Distill和QwQ-32B-Preview）和基准测试（Math500、GSM8K、LiveCodeBench）上进行的大量实验验证了SEAL的有效性，准确度提高了高达11%，同时将推理标记减少了11.8%至50.4%。我们的代码可以在https://github.com/VITA-Group/SEAL 上公开获取。

更新时间: 2025-10-25 03:17:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.07986v3

HandPass: A Wi-Fi CSI Palm Authentication Approach for Access Control

Wi-Fi Channel State Information (CSI) has been extensively studied for sensing activities. However, its practical application in user authentication still needs to be explored. This study presents a novel approach to biometric authentication using Wi-Fi Channel State Information (CSI) data for palm recognition. The research delves into utilizing a Raspberry Pi encased in a custom-built box with antenna power reduced to 1dBm, which was used to capture CSI data from the right hands of 20 participants (10 men and 10 women). The dataset was normalized using MinMax scaling to ensure uniformity and accuracy. By focusing on biophysical aspects such as hand size, shape, angular spread between fingers, and finger phalanx lengths, among other characteristics, the study explores how these features affect electromagnetic signals, which are then reflected in Wi-Fi CSI, allowing for precise user identification. Five classification algorithms were evaluated, with the Random Forest classifier achieving an average F1-Score of 99.82\% using 10-fold cross-validation. Amplitude and Phase data were used, with each capture session recording approximately 1000 packets per second in five 5-second intervals for each User. This high accuracy highlights the potential of Wi-Fi CSI in developing robust and reliable user authentication systems based on palm biometric data.

Updated: 2025-10-25 03:13:45

标题: HandPass：一种用于访问控制的Wi-Fi CSI掌纹认证方法

摘要: Wi-Fi通道状态信息（CSI）已被广泛研究用于感知活动。然而，在用户认证方面，其实际应用仍需进一步探索。本研究提出了一种使用Wi-Fi通道状态信息（CSI）数据进行手掌识别的生物特征认证的新方法。研究使用将天线功率降低至1dBm的树莓派封装在自制盒子中，用于捕获20名参与者（10名男性和10名女性）右手的CSI数据。数据集使用MinMax缩放进行了规范化，以确保统一性和准确性。通过关注生物物理学方面，如手部大小、形状、手指之间的角度扩散以及手指节长度等特征，研究探讨了这些特征如何影响电磁信号，然后反映在Wi-Fi CSI中，从而实现精确的用户识别。评估了五种分类算法，随机森林分类器在使用10倍交叉验证时实现了平均F1-Score为99.82\%。振幅和相位数据被使用，每次捕获会话为每个用户在五个5秒间隔内记录约1000个数据包。这种高准确率突显了Wi-Fi CSI在基于手掌生物特征数据开发健壮可靠的用户认证系统方面的潜力。

更新时间: 2025-10-25 03:13:45

领域: cs.NI,cs.CR,cs.LG,C.2.0; I.5.4; K.6.5

下载: http://arxiv.org/abs/2510.22133v1

Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors

We present a novel approach for controllable mathematical reasoning that leverages self-optimizing thought vectors with entropy minimization. Our method introduces learnable thought vectors that dynamically modulate the internal reasoning process of large language models. Using Gemma-2-9B on GSM8K, we achieve 90.1% accuracy with a controllability score of 0.42, demonstrating that entropy-based rewards effectively guide focused reasoning patterns without requiring external reward annotations. Our analysis reveals distinct thought vector clusters and consistent low-entropy distributions across control conditions, validating our framework for controllable AI reasoning.

Updated: 2025-10-25 03:13:14

标题: 通过自优化思维向量实现可控数学推理

摘要: 我们提出了一种新颖的可控数学推理方法，利用熵最小化的自优化思维向量。我们的方法引入了可学习的思维向量，动态调节大型语言模型的内部推理过程。在GSM8K上使用Gemma-2-9B，我们实现了90.1%的准确率，可控性得分为0.42，表明基于熵的奖励有效地引导了专注的推理模式，而无需外部奖励注释。我们的分析揭示了不同的思维向量聚类和一致的低熵分布在控制条件下，验证了我们的可控AI推理框架。

更新时间: 2025-10-25 03:13:14

领域: cs.AI

下载: http://arxiv.org/abs/2510.22132v1

Probing Neural Combinatorial Optimization Models

Neural combinatorial optimization (NCO) has achieved remarkable performance, yet its learned model representations and decision rationale remain a black box. This impedes both academic research and practical deployment, since researchers and stakeholders require deeper insights into NCO models. In this paper, we take the first critical step towards interpreting NCO models by investigating their representations through various probing tasks. Moreover, we introduce a novel probing tool named Coefficient Significance Probing (CS-Probing) to enable deeper analysis of NCO representations by examining the coefficients and statistical significance during probing. Extensive experiments and analysis reveal that NCO models encode low-level information essential for solution construction, while capturing high-level knowledge to facilitate better decisions. Using CS-Probing, we find that prevalent NCO models impose varying inductive biases on their learned representations, uncover direct evidence related to model generalization, and identify key embedding dimensions associated with specific knowledge. These insights can be potentially translated into practice, for example, with minor code modifications, we improve the generalization of the analyzed model. Our work represents a first systematic attempt to interpret black-box NCO models, showcasing probing as a promising tool for analyzing their internal mechanisms and revealing insights for the NCO community. The source code is publicly available.

Updated: 2025-10-25 03:11:10

标题: 探究神经网络组合优化模型

摘要: 神经组合优化（NCO）取得了显著的性能，但其学习模型表示和决策理由仍然是一个黑匣子。这阻碍了学术研究和实际部署，因为研究人员和利益相关者需要更深入地了解NCO模型。在本文中，我们通过研究各种探究任务，迈出了解释NCO模型的第一关键步骤。此外，我们引入了一种名为系数显著性探测（CS-Probing）的新型探测工具，通过检查探测过程中的系数和统计显著性，实现对NCO表示的深入分析。广泛的实验和分析揭示了NCO模型对于解决方案构建至关重要的低级信息，同时捕捉高级知识以促进更好的决策。使用CS-Probing，我们发现普遍的NCO模型对其学习表示施加不同的归纳偏差，揭示了与模型泛化相关的直接证据，并确定了与特定知识相关的关键嵌入维度。这些见解有可能被转化为实践，例如，通过微小的代码修改，我们改进了分析模型的泛化能力。我们的工作代表了对黑盒NCO模型进行解释的第一次系统性尝试，展示了探测作为分析其内部机制和为NCO社区揭示见解的一种有前途的工具。源代码可公开获取。

更新时间: 2025-10-25 03:11:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22131v1

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes. Please see the project page at https://zhoues.github.io/RoboRefer.

Updated: 2025-10-25 03:07:10

标题: RoboRefer: 朝向在视觉语言模型中运用推理实现空间指代的机器人技术

摘要: 空间引用是具有身体的机器人与3D物理世界互动的基本能力。然而，即使使用强大的预训练视觉语言模型（VLMs），最近的方法仍无法准确理解复杂的3D场景并动态推理指示交互位置。为此，我们提出了RoboRefer，这是一个3D感知VLM，通过监督微调（SFT）集成一个分解但专用的深度编码器，可以首先实现精确的空间理解。此外，RoboRefer通过强化微调（RFT）推进了广义多步空间推理，为空间引用任务量身定制了度量敏感的过程奖励函数。为支持SFT和RFT训练，我们引入了RefSpatial，一个包含2000万个QA对（2倍于之前）的大规模数据集，涵盖31种空间关系（与之前的15种相比），支持复杂的推理过程（最多5步）。此外，我们引入了RefSpatial-Bench，一个具有挑战性的基准，填补了评估具有多步推理的空间引用的空白。实验证明，经过SFT训练的RoboRefer实现了最先进的空间理解，平均成功率为89.6%。经过RFT训练的RoboRefer进一步在RefSpatial-Bench上优于所有其他基线，甚至在平均准确度上超过Gemini-2.5-Pro 17.4%。值得注意的是，RoboRefer可以与各种控制策略集成，以执行在拥挤的现实世界场景中跨不同机器人（例如，UR5、G1 humanoid）进行长时间跟踪、动态任务。请访问项目页面https://zhoues.github.io/RoboRefer。

更新时间: 2025-10-25 03:07:10

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.04308v3

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state-of-the-art performance.

Updated: 2025-10-25 02:57:21

标题: Raw2Drive：使用对齐的世界模型进行强化学习实现端到端自动驾驶（在CARLA v2中）

摘要: 强化学习（RL）可以缓解模仿学习（IL）固有的因果混淆和分布转移问题。然而，将RL应用于端到端自动驾驶（E2E-AD）仍然是一个训练困难的开放问题，IL仍然是学术界和工业界的主流范式。最近，基于模型的强化学习（MBRL）在神经规划方面取得了有希望的结果；然而，这些方法通常需要特权信息作为输入，而不是原始传感器数据。我们通过设计Raw2Drive填补了这一空白，这是一种双流MBRL方法。首先，我们高效地训练了一个辅助特权世界模型，配对一个使用特权信息作为输入的神经规划器。随后，我们引入了一个通过我们提出的引导机制训练的原始传感器世界模型，该机制确保在执行过程中原始传感器世界模型与特权世界模型之间的一致性。最后，原始传感器世界模型结合了嵌入在特权世界模型头部的先验知识，以有效地指导原始传感器策略的训练。Raw2Drive目前是CARLA Leaderboard 2.0上唯一的基于RL的端到端方法，以及Bench2Drive，并且取得了最先进的表现。

更新时间: 2025-10-25 02:57:21

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2505.16394v2

Efficient Utility-Preserving Machine Unlearning with Implicit Gradient Surgery

Machine unlearning (MU) aims to efficiently remove sensitive or harmful memory from a pre-trained model. The key challenge is to balance the potential tradeoff between unlearning efficacy and utility preservation, which involves forgetting undesirable information as defined while maintaining the model's original performance. One potential way to tackle this problem is to use multi-objective optimization to jointly optimize both the unlearning and utility preservation objectives. However, existing multi-objective methods only guarantee finding a Pareto-optimal solution without fine-grained control, which causes under-optimization of the unlearning objective. To this end, we first model MU as a constrained optimization problem, that is, optimizing the unlearning objective under the constraint of a bounded increase for utility loss. We then show that solving this optimization problem is equivalent to unilateral gradient surgery on the unlearning objective. To resolve the additional computational cost brought by gradient surgery, we propose an implicit gradient surgery method, which approximates the solution to the aforementioned constrained optimization problem via only one backpropagation, thereby achieving efficient utility-preserving MU. Theoretically, we provide a tight convergence analysis of the algorithm. Empirically, our extensive experiments show that the proposed algorithm achieves better tradeoff results than existing baselines. Codes are available at https://github.com/anseryuer/EUPMU-Efficient-Utility-Preserving-Machine-Unlearning.

Updated: 2025-10-25 02:49:26

标题: 高效的保护效用的机器遗忘与隐式梯度手术

摘要: 机器遗忘（MU）旨在高效地从预训练模型中删除敏感或有害的记忆。关键挑战是在卸载效力和实用性保留之间平衡潜在的权衡，这涉及遗忘定义中的不良信息同时保持模型的原始性能。解决这个问题的一种潜在方法是使用多目标优化来共同优化卸载和实用性保留目标。然而，现有的多目标方法只能保证找到帕累托最优解，但缺乏精细控制，这导致卸载目标的亚优化。为此，我们首先将MU建模为一个受限制的优化问题，即在实用性损失有界增加的约束下优化卸载目标。然后我们展示，解决这个优化问题等同于对卸载目标进行单向梯度手术。为了解决梯度手术带来的额外计算成本，我们提出了一种隐式梯度手术方法，通过仅进行一次反向传播来近似解决上述受限制的优化问题，从而实现高效的保留实用性的MU。从理论上讲，我们提供了算法的严格收敛分析。在实验上，我们的广泛实验表明，所提出的算法比现有基线实现了更好的权衡结果。代码可在https://github.com/anseryuer/EUPMU-Efficient-Utility-Preserving-Machine-Unlearning上找到。

更新时间: 2025-10-25 02:49:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.22124v1

GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning\textemdash{}a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6\% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16\% human-validated accuracy\textemdash{}compared to 57.6\% on a dataset generated by recent work. % or recent work Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5\% on BDD and 37.9\% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found on our \href{https://ke7.github.io/graid/}{project page}.

Updated: 2025-10-25 02:07:23

标题: GRAID：通过高保真数据生成增强VLMs的空间推理

摘要: Vision Language Models (VLMs)在许多视觉语言任务上取得了强大的性能，但通常在空间推理方面存在困难，这是许多应用的前提条件。实证上，我们发现由当前训练数据生成流程生成的一个数据集具有57.6％的人类验证率。这些比例源自当前限制：单幅图像的3D重建引入级联建模错误，并需要宽容答案容差，而基于标题的方法需要超详细的注释，并且容易出现生成幻觉。我们提出了GRAID，它建立在这样一个关键观点上：定性空间关系可以可靠地从2D几何原语中确定。通过仅在标准物体检测器的2D边界框上操作，GRAID避免了3D重建错误和生成幻觉，从而产生比现有工具更高质量的数据集，这些数据集经过人类评估验证。我们将我们的框架应用于BDD100k、NuImages和Waymo数据集，生成超过850万个高质量VQA对，创建涵盖空间关系、计数、排名和大小比较的问题。我们评估其中一个数据集，并发现其实现了91.16％的人类验证准确率，相比之下，最近工作生成的数据集为57.6％。关键是，我们证明，当在GRAID数据上进行训练时，模型学习了概括的空间推理概念：在6个问题类型上微调的模型在超过10个保留类型上有了提高，对于Llama 3.2B 11B，在BDD上提高了47.5％，在NuImages上提高了37.9％，当在所有问题类型上进行训练时，对一些现有基准测试产生了改进，如BLINK。GRAID框架、数据集和其他信息可在我们的[项目页面](https://ke7.github.io/graid/)上找到。

更新时间: 2025-10-25 02:07:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22118v1

When UAV Swarm Meets IRS: Collaborative Secure Communications in Low-altitude Wireless Networks

Low-altitude wireless networks (LAWNs) represent a promising architecture that integrates unmanned aerial vehicles (UAVs) as aerial nodes to provide enhanced coverage, reliability, and throughput for diverse applications. However, these networks face significant security vulnerabilities from both known and potential unknown eavesdroppers, which may threaten data confidentiality and system integrity. To solve this critical issue, we propose a novel secure communication framework for LAWNs where the selected UAVs within a swarm function as a virtual antenna array (VAA), complemented by intelligent reflecting surface (IRS) to create a robust defense against eavesdropping attacks. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the secrecy rate while minimizing the maximum sidelobe level and total energy consumption, requiring joint optimization of UAV excitation current weights, flight trajectories, and IRS phase shifts. This problem presents significant difficulties due to the dynamic nature of the system and heterogeneous components. Thus, we first transform the problem into a heterogeneous Markov decision process (MDP). Then, we propose a heterogeneous multi-agent control approach (HMCA) that integrates a dedicated IRS control policy with a multi-agent soft actor-critic framework for UAV control, which enables coordinated operation across heterogeneous network elements. Simulation results show that the proposed HMCA achieves superior performance compared to baseline approaches in terms of secrecy rate improvement, sidelobe suppression, and energy efficiency. Furthermore, we find that the collaborative and passive beamforming synergy between VAA and IRS creates robust security guarantees when the number of UAVs increases.

Updated: 2025-10-25 02:02:14

标题: 当无人机群遇到IRS：低空无线网络中的协作安全通信

摘要: 低空无线网络（LAWNs）代表了一种有前途的架构，将无人机（UAVs）作为空中节点集成，为各种应用提供增强覆盖范围、可靠性和吞吐量。然而，这些网络面临着来自已知和潜在未知窃听者的重大安全漏洞，可能威胁数据机密性和系统完整性。为了解决这一关键问题，我们提出了一个新颖的安全通信框架，其中选择的无人机在一个群体中充当虚拟天线阵列（VAA），并辅以智能反射面（IRS），以对抗窃听攻击。具体地，我们制定了一个多目标优化问题，同时最大化秘密速率，同时最小化最大旁瓣电平和总能量消耗，需要联合优化无人机激励电流权重、飞行轨迹和IRS相位偏移。由于系统的动态性和异构组件，这个问题存在重大困难。因此，我们首先将问题转化为异构马尔可夫决策过程（MDP）。然后，我们提出了一种异构多智能体控制方法（HMCA），将专用IRS控制策略与多智能体软演员-评论家框架集成，用于无人机控制，实现跨异构网络元素的协调操作。模拟结果显示，与基线方法相比，提出的HMCA在提高秘密速率、抑制旁瓣和提高能效方面表现出更优越的性能。此外，我们发现，在无人机数量增加时，VAA和IRS之间的协作和被动波束成形协同作用创造了强大的安全保证。

更新时间: 2025-10-25 02:02:14

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2510.22117v1

RLVR-World: Training World Models with Reinforcement Learning

World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly. Code, datasets, models, and video samples are available at the project website: https://thuml.github.io/RLVR-World.

Updated: 2025-10-25 02:00:20

标题: RLVR-World：使用强化学习训练世界模型

摘要: 世界模型预测对行动的状态转换，并且在不同的模态下越来越多地得到发展。然而，标准训练目标如最大似然估计（MLE）通常与世界模型的特定任务目标（如准确度或感知质量等转换预测指标）不一致。在本文中，我们提出了RLVR-World，这是一个统一的框架，利用强化学习与可验证奖励（RLVR）直接优化世界模型以满足这些指标。尽管将世界建模形式化为分词序列的自回归预测，RLVR-World评估解码预测的指标作为可验证奖励。我们在语言和基于视频的世界模型中跨领域展示了显著的性能提升，包括文本游戏、网络导航和机器人操作。我们的工作表明，除了最近在推理语言模型方面取得的进展外，RLVR为增强生成模型的实用性提供了一个有前途的后训练范式。代码、数据集、模型和视频样本可在项目网站上找到：https://thuml.github.io/RLVR-World。

更新时间: 2025-10-25 02:00:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.13934v2

Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction

Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning offers transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes operational decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges, including performance disparity, robustness, uncertainty, interpretability, generalizability, and reproducibility. In this work, we present a multi-dimensional, quantitative evaluation of trustworthiness benchmarking three state-of-the-art deep learning architectures: recurrent (LSTM), operator-learning (DeepONet), and transformer-based (Informer), trained on 37 years of data from 482 U.S. basins to predict 20 water quality variables. Our investigation reveals systematic performance disparities tied to process complexity, data availability, and basin heterogeneity. Management-critical variables remain the least predictable and most uncertain. Robustness tests reveal pronounced sensitivity to outliers and corrupted targets; notably, the architecture with the strongest baseline performance (LSTM) proves most vulnerable under data corruption. Attribution analyses align for simple variables but diverge for nutrients, underscoring the need for multi-method interpretability. Spatial generalization to ungauged basins remains poor across all models. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.

Updated: 2025-10-25 01:57:51

标题: 识别大规模水质预测深度学习模型中的可信度挑战

摘要: 水质是环境可持续性、生态系统韧性和公共健康的基础。深度学习为大规模水质预测和科学洞见生成提供了变革性潜力。然而，在高风险运营决策，如污染缓解和资源公平分配方面，它们的广泛采用受到未解决的信任挑战的阻碍，包括性能差异、稳健性、不确定性、可解释性、泛化性和可重现性。在这项工作中，我们提出了一个多维、定量的信任度评估，对三种最先进的深度学习架构进行基准测试：循环（LSTM）、操作学习（DeepONet）和基于转换器的（Informer），在482个美国流域37年的数据上进行训练，以预测20个水质变量。我们的调查揭示了与过程复杂性、数据可用性和流域异质性相关的系统性性能差异。管理关键变量仍然是最不可预测且最不确定的。稳健性测试揭示了对异常值和损坏目标的明显敏感性；值得注意的是，在数据损坏方面性能最强的架构（LSTM）在数据损坏下最脆弱。归因分析在简单变量上保持一致，但在营养物质上出现分歧，强调了多方法可解释性的必要性。空间泛化到未测量的流域对所有模型来说仍然很差。这项工作是一个及时的呼吁，旨在推动水资源管理的可信数据驱动方法，并为寻求在环境管理中负责任地利用人工智能（AI）的研究人员、决策者和从业者提供关键洞见的途径。

更新时间: 2025-10-25 01:57:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.09947v3

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.

Updated: 2025-10-25 01:51:37

标题: 每次激活都受益：将通用推理器扩展到1万亿个开放语言基础

摘要: 我们介绍了Ling 2.0，这是一个基于推理导向的语言基础，建立在每次激活都能增强推理能力的原则上。 Ling 2.0旨在在统一的专家混合（MoE）范式下从数十亿扩展到一万亿参数，并且强调高稀疏性，跨尺度一致性和效率，受经验缩放定律指导。该系列包括三个非思考（指导）模型-Ling-mini-2.0，Ling-flash-2.0和Ling-1T-总参数范围从16B到1T，并与密集对应物相比，实现了高达7倍的活跃计算效率。 Ling 2.0在模型架构，预训练，后训练和基础设施上整合了协调的创新：具有MTP的高稀疏MoE用于高效推理，推理导向数据和中间训练CoT激活，基于强化的微调（DFT，Evo-CoT），以及具有细粒度异构管道的全尺寸FP8训练。在兆级规模上，Ling-1T建立了推理准确性与计算效率的新帕累托前沿，表明当稀疏激活与推理目标正确对齐时，能够实现可扩展和高效的智能。总的来说，Ling 2.0为推进未来推理和思考模型提供了一个连贯，开放和高效的基础，包括建立在同一基础之上的Ring系列。

更新时间: 2025-10-25 01:51:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.22115v1

DB-FGA-Net: Dual Backbone Frequency Gated Attention Network for Multi-Class Brain Tumor Classification with Grad-CAM Interpretability

Brain tumors are a challenging problem in neuro-oncology, where early and precise diagnosis is important for successful treatment. Deep learning-based brain tumor classification methods often rely on heavy data augmentation which can limit generalization and trust in clinical applications. In this paper, we propose a double-backbone network integrating VGG16 and Xception with a Frequency-Gated Attention (FGA) Block to capture complementary local and global features. Unlike previous studies, our model achieves state-of-the-art performance without augmentation which demonstrates robustness to variably sized and distributed datasets. For further transparency, Grad-CAM is integrated to visualize the tumor regions based on which the model is giving prediction, bridging the gap between model prediction and clinical interpretability. The proposed framework achieves 99.24\% accuracy on the 7K-DS dataset for the 4-class setting, along with 98.68\% and 99.85\% in the 3-class and 2-class settings, respectively. On the independent 3K-DS dataset, the model generalizes with 95.77\% accuracy, outperforming baseline and state-of-the-art methods. To further support clinical usability, we developed a graphical user interface (GUI) that provides real-time classification and Grad-CAM-based tumor localization. These findings suggest that augmentation-free, interpretable, and deployable deep learning models such as DB-FGA-Net hold strong potential for reliable clinical translation in brain tumor diagnosis.

Updated: 2025-10-25 01:40:13

标题: DB-FGA-Net：用于多类脑肿瘤分类的双骨干频率门控注意网络，并具有Grad-CAM可解释性

摘要: 脑肿瘤是神经肿瘤学中的一个挑战性问题，早期和准确的诊断对成功治疗至关重要。基于深度学习的脑肿瘤分类方法通常依赖于大量的数据增强，这可能会限制在临床应用中的泛化和信任。在本文中，我们提出了一个双骨干网络，将VGG16和Xception与一个频率门控注意力（FGA）块集成在一起，以捕获互补的局部和全局特征。与以前的研究不同，我们的模型在没有增强的情况下实现了最先进的性能，这表明对可变大小和分布数据集的鲁棒性。为了进一步透明，我们集成了Grad-CAM来可视化基于模型给出预测的肿瘤区域，弥合了模型预测与临床可解释性之间的差距。所提出的框架在4类设置的7K-DS数据集上实现了99.24%的准确性，分别在3类和2类设置上分别达到了98.68%和99.85%。在独立的3K-DS数据集上，该模型以95.77%的准确性进行了泛化，超过了基线和最先进的方法。为了进一步支持临床可用性，我们开发了一个提供实时分类和基于Grad-CAM的肿瘤定位的图形用户界面（GUI）。这些发现表明，无增强、可解释和可部署的深度学习模型，如DB-FGA-Net，具有在脑肿瘤诊断中可靠的临床转化潜力。

更新时间: 2025-10-25 01:40:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.20299v2

Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows

Most approaches to long-context processing increase the complexity of the transformer's internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In this work, we introduce an alternative approach that modifies the input representation itself, rather than the transformer architecture. Inspired by cognitive models of human memory, our method applies a scale-invariant logarithmic compression to the input tokens. The resulting compressed representation is processed by a standard, unmodified transformer, preserving architectural simplicity. We evaluate this approach on the WikiText-103 and PG-19 language modeling benchmarks, showing a reduction in perplexity compared to uncompressed baselines. Moreover, performance improves consistently with longer compressed temporal contexts, showing that input-level logarithmic compression is a simple and effective way to extend a transformer's long-range memory.

Updated: 2025-10-25 01:29:37

标题: 逐渐遗忘：对Transformer上下文窗口进行对数压缩以延长其范围

摘要: 大多数处理长上下文的方法通过集成机制（如循环或辅助记忆模块）增加了transformer内部架构的复杂性。在这项工作中，我们引入了一种替代方法，该方法修改输入表示本身，而不是transformer架构。受人类记忆认知模型的启发，我们的方法对输入标记应用了一个与比例无关的对数压缩。所得到的压缩表示由一个标准的未修改的transformer处理，保持了架构的简单性。我们在WikiText-103和PG-19语言建模基准上评估了这种方法，与未压缩基线相比，显示了困惑度的降低。此外，性能随着更长的压缩时间上下文而持续改善，表明输入级别的对数压缩是扩展transformer长程记忆的一种简单有效的方式。

更新时间: 2025-10-25 01:29:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.22109v1

STAR-RIS-assisted Collaborative Beamforming for Low-altitude Wireless Networks

While low-altitude wireless networks (LAWNs) based on uncrewed aerial vehicles (UAVs) offer high mobility, flexibility, and coverage for urban communications, they face severe signal attenuation in dense environments due to obstructions. To address this critical issue, we consider introducing collaborative beamforming (CB) of UAVs and omnidirectional reconfigurable beamforming (ORB) of simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) to enhance the signal quality and directionality. On this basis, we formulate a joint rate and energy optimization problem (JREOP) to maximize the transmission rate of the overall system, while minimizing the energy consumption of the UAV swarm. Due to the non-convex and NP-hard nature of JREOP, we propose a heterogeneous multi-agent collaborative dynamic (HMCD) optimization framework, which has two core components. The first component is a simulated annealing (SA)-based STAR-RIS control method, which dynamically optimizes reflection and transmission coefficients to enhance signal propagation. The second component is an improved multi-agent deep reinforcement learning (MADRL) control method, which incorporates a self-attention evaluation mechanism to capture interactions between UAVs and an adaptive velocity transition mechanism to enhance training stability. Simulation results demonstrate that HMCD outperforms various baselines in terms of convergence speed, average transmission rate, and energy consumption. Further analysis reveals that the average transmission rate of the overall system scales positively with both UAV count and STAR-RIS element numbers.

Updated: 2025-10-25 01:28:37

标题: STAR-RIS辅助的低空无线网络协同波束成形

摘要: 低空无线网络（LAWNs）基于无人机（UAVs）提供了高度的移动性，灵活性和覆盖面，适用于城市通信，但在密集环境中由于障碍物导致信号衰减严重。为了解决这一关键问题，我们考虑引入UAV的协同波束成形（CB）和同时传输和反射可重构智能表面（STAR-RIS）的全向可重构波束成形（ORB）来增强信号质量和方向性。在此基础上，我们制定了一个联合速率和能量优化问题（JREOP），以最大化整个系统的传输速率，同时最小化UAV群的能量消耗。由于JREOP的非凸性和NP-hard性质，我们提出了一个异质多智能体协作动态（HMCD）优化框架，具有两个核心组件。第一个组件是基于模拟退火（SA）的STAR-RIS控制方法，动态优化反射和传输系数以增强信号传播。第二个组件是改进的多智能体深度强化学习（MADRL）控制方法，包括自注意评估机制以捕捉UAV之间的交互作用，并具有自适应速度过渡机制以增强训练稳定性。模拟结果表明，HMCD在收敛速度，平均传输速率和能量消耗方面优于各种基线。进一步分析表明，整个系统的平均传输速率与UAV数量和STAR-RIS元素数量均呈正相关。

更新时间: 2025-10-25 01:28:37

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2510.22108v1

Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often modify random seeds, making it difficult to discern meaningful differences between samples, or diversify the input prompt, which is limited in verbally interpretable diversity. We propose Rainbow, a novel conditional image generation framework, applicable to any pretrained conditional generative model, that addresses inherent condition/prompt uncertainty and generates diverse plausible images. Rainbow is based on a simple yet effective idea: decomposing the input condition into diverse latent representations, each capturing an aspect of the uncertainty and generating a distinct image. First, we integrate a latent graph, parameterized by Generative Flow Networks (GFlowNets), into the prompt representation computation. Second, leveraging GFlowNets' advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, we produce multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images. Evaluations on natural image and medical image datasets demonstrate Rainbow's improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.

Updated: 2025-10-25 01:25:50

标题: 使用GFlowNets发现潜在图形，实现多样化的条件图像生成

摘要: 捕捉多样性在条件和基于提示的图像生成中至关重要，特别是当条件包含可能导致多个合理输出的不确定性时。为了生成反映这种多样性的多样化图像，传统方法通常修改随机种子，使得难以区分样本之间的有意义差异，或者使输入提示多样化，但在口头可解释多样性方面存在局限。我们提出了Rainbow，这是一种新颖的条件图像生成框架，适用于任何预训练的条件生成模型，可以解决固有的条件/提示不确定性，并生成多样化的合理图像。 Rainbow基于一个简单而有效的思想：将输入条件分解为不同的潜在表示，每个表示一个不确定性的方面，并生成不同的图像。首先，我们将一个由生成流网络（GFlowNets）参数化的潜在图整合到提示表示计算中。其次，利用GFlowNets的高级图采样能力来捕捉不确定性并在图上产生多样化的轨迹，我们生成多条轨迹，共同代表输入条件，从而产生多样的条件表示和相应的输出图像。对自然图像和医学图像数据集的评估表明，Rainbow在图像合成、图像生成和反事实生成任务中的多样性和忠实度方面均有所提高。

更新时间: 2025-10-25 01:25:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.22107v1

Holistic Order Prediction in Natural Scenes

Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.

Updated: 2025-10-25 01:22:08

标题: 自然场景中的整体顺序预测

摘要: 即使在受控环境中，理解实例级几何形状对于广泛的视觉模型来说都是一项具有挑战性的任务。尽管存在专门的系统，现代艺术依赖于昂贵的输入格式（类别标签、二进制分割掩模）和推断成本（前向传递的二次数量）。我们通过提出InstaFormer来缓解这些限制，这是一个能够进行整体顺序预测的网络。也就是说，仅仅通过输入的RGB图像，InstaFormer就可以在单次前向传递中返回场景中所有实例的完整遮挡和深度顺序。在其核心，InstaFormer依赖于物体查询和潜在掩模描述符之间的相互作用，这些描述符在语义上代表着相同的对象，同时携带着互补信息。我们全面评估和剖析我们的方法以突显其有效性。我们的代码和模型是开源的，可在以下网址获得：https://github.com/SNU-VGILab/InstaOrder。

更新时间: 2025-10-25 01:22:08

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.01704v2

Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning

Imitation learning has proven effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by its reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data -- such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies -- can offer broader coverage and lower collection costs. However, conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we posit that with right design decisions, offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under the sparse data coverage settings typically encountered in the real world, simple algorithmic modifications can allow for the utilization of this data, without significant additional assumptions. Our approach shows that broadening the support of the policy distribution can allow imitation algorithms augmented by offline RL to solve tasks robustly, showing considerably enhanced recovery and generalization behavior. In manipulation tasks, these innovations significantly increase the range of initial conditions where learned policies are successful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstrations, to bolster task-directed policy performance. This underscores the importance of algorithmic techniques for using non-expert data for robust policy learning in robotics. Website: https://uwrobotlearning.github.io/RISE-offline/

Updated: 2025-10-25 01:18:43

标题: 使用非专家数据通过离线强化学习增强模仿学习

摘要: 模仿学习已被证明对训练机器人从专家人类示范中执行复杂任务有效。然而，它仍然受限于对高质量、特定任务数据的依赖，限制了适应性到多样化的现实世界物体配置和场景。相比之下，非专家数据 -- 如游戏数据、次优示范、部分任务完成或次优策略的演练 -- 可提供更广泛的覆盖范围和更低的收集成本。然而，传统的模仿学习方法未能有效利用这些数据。为了解决这些挑战，我们认为通过正确的设计决策，离线强化学习可以被用作一种工具，利用非专家数据来增强模仿学习策略的性能。我们展示了，虽然标准的离线RL方法在实际世界中通常遇到的稀疏数据覆盖设置下往往无法有效利用非专家数据，但简单的算法修改可以允许利用这些数据，而无需额外的重要假设。我们的方法表明，通过扩大策略分布的支持范围，附加离线RL的模仿算法可以稳健地解决任务，显示出明显增强的恢复和泛化行为。在操作任务中，这些创新显著增加了学习策略在非专家数据被整合时成功的初始条件范围。此外，我们展示这些方法能够利用所有收集的数据，包括部分或次优示范，来加强面向任务的策略性能。这强调了在机器人技术中使用非专家数据进行稳健策略学习的算法技术的重要性。网站：https://uwrobotlearning.github.io/RISE-offline/

更新时间: 2025-10-25 01:18:43

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.19495v2

When Can Model-Free Reinforcement Learning be Enough for Thinking?

Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to such "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a thought Markov decision process (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.

Updated: 2025-10-25 01:17:48

标题: 何时无模型强化学习足以进行思考？

摘要: 最近对大型语言模型的研究表明，使用无模型强化学习（RL）来训练类推能力。通过无模型RL实现“思考”的出现是有趣的，因为思考行为既不产生奖励，也不改变外部世界状态，使得智能体更有可能获得奖励。本文旨在建立一个与领域无关的理解，即无模型RL何时会导致这种“思考”作为奖励最大化的策略。为了建立这种理解，我们首先引入了一个理论模型，我们称之为思维马尔可夫决策过程（MDP）。思维MDP最小程度地扩展了经典MDP模型，包括一个抽象的思维状态和思维动作的概念。利用思维MDP模型，我们证明了政策初始化在决定思考是否出现方面的重要性，并正式显示思考行为等同于智能体选择在继续行动之前执行一步政策改进。然后，我们展示了开源大型语言模型满足我们理论预测的模型无关RL产生类似思考行为所必需的条件。最后，我们假设了足够的条件，可以使得思考在语言生成之外被学习，并介绍了一个玩具领域，其中多任务预训练和指定的思维动作的组合使得相比于不思考的智能体更加高效地利用数据。

更新时间: 2025-10-25 01:17:48

领域: cs.AI

下载: http://arxiv.org/abs/2506.17124v2

Mitigating Coordinate Prediction Bias from Positional Encoding Failures

Multimodal large language models (MLLMs) excel at vision-language tasks such as VQA and document understanding, yet precise coordinate prediction remains challenging. High-resolution inputs exacerbate this difficulty by producing long token sequences that weaken positional encodings and introduce directional biases in coordinate outputs. We investigate this phenomenon by analyzing how MLLMs behave when visual positional encodings (VPEs) are deliberately perturbed through shuffling. Our analysis reveals that such perturbations induce predictable, non-random coordinate biases rather than random errors, suggesting that models rely on internal positional priors when spatial grounding signals are degraded. Crucially, we observe similar directional error patterns in natural high-resolution datasets, indicating that positional encoding failures are a key bottleneck for accurate coordinate prediction at scale. To address this issue, we propose Vision-PE Shuffle Guidance (VPSG), a training-free test-time method that leverages the directional nature of these biases for correction. VPSG runs auxiliary decoding with shuffled VPEs to isolate position-unconditioned tendencies, then uses this as negative evidence to guide digit prediction while preserving coordinate format through a lightweight finite-state machine. Experiments on ScreenSpot-Pro demonstrate reliable improvements, highlighting positional encoding robustness as a critical factor for spatial reasoning in MLLMs.

Updated: 2025-10-25 00:58:47

标题: 减轻由位置编码失效引起的坐标预测偏差

摘要: 多模态大语言模型（MLLMs）在视觉语言任务中表现出色，如VQA和文档理解，但精确的坐标预测仍然具有挑战性。高分辨率输入加剧了这一困难，因为它产生了长的令牌序列，削弱了位置编码，并引入了坐标输出中的方向偏差。我们通过分析MLLMs在视觉位置编码（VPEs）被故意扰乱时的行为来调查这一现象。我们的分析显示，这种扰动引发可预测的、非随机的坐标偏差，而不是随机错误，这表明模型在空间基准信号受损时依赖于内部位置先验。至关重要的是，我们观察到在自然高分辨率数据集中类似的方向错误模式，表明位置编码故障是规模上准确坐标预测的关键瓶颈。为了解决这个问题，我们提出了Vision-PE Shuffle Guidance（VPSG），这是一种无需训练的测试时间方法，利用这些偏差的方向性进行校正。VPSG使用带有混乱VPEs的辅助解码来隔离位置不受条件的倾向，然后使用这一负证据来指导数字预测，同时通过轻量级有限状态机保持坐标格式。在ScreenSpot-Pro上的实验表明，VPSG实现了可靠的改进，突出了位置编码的稳健性作为MLLMs中空间推理的关键因素。

更新时间: 2025-10-25 00:58:47

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22102v1

Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset

The development of accessible screening tools for early cancer detection in dogs represents a significant challenge in veterinary medicine. Routine laboratory data offer a promising, low-cost source for such tools, but their utility is hampered by the non-specificity of individual biomarkers and the severe class imbalance inherent in screening populations. This study assesses the feasibility of cancer risk classification using the Golden Retriever Lifetime Study (GRLS) cohort under real-world constraints, including the grouping of diverse cancer types and the inclusion of post-diagnosis samples. A comprehensive benchmark evaluation was conducted, systematically comparing 126 analytical pipelines that comprised various machine learning models, feature selection methods, and data balancing techniques. Data were partitioned at the patient level to prevent leakage. The optimal model, a Logistic Regression classifier with class weighting and recursive feature elimination, demonstrated moderate ranking ability (AUROC = 0.815; 95% CI: 0.793-0.836) but poor clinical classification performance (F1-score = 0.25, Positive Predictive Value = 0.15). While a high Negative Predictive Value (0.98) was achieved, insufficient recall (0.79) precludes its use as a reliable rule-out test. Interpretability analysis with SHapley Additive exPlanations (SHAP) revealed that predictions were driven by non-specific features like age and markers of inflammation and anemia. It is concluded that while a statistically detectable cancer signal exists in routine lab data, it is too weak and confounded for clinically reliable discrimination from normal aging or other inflammatory conditions. This work establishes a critical performance ceiling for this data modality in isolation and underscores that meaningful progress in computational veterinary oncology will require integration of multi-modal data sources.

Updated: 2025-10-25 00:55:35

标题: 评估使用常规实验室数据进行早期癌症检测的可行性：对非平衡数据集上机器学习方法的评估

摘要: 狗早期癌症筛查工具的开发在兽医医学中是一个重要挑战。常规实验室数据提供了一个有前景且低成本的来源，但由于个体生物标志物的非特异性和筛查人群中固有的严重类别不平衡，它们的效用受到了阻碍。本研究评估了在现实条件下使用金毛寻回犬寿命研究（GRLS）队列进行癌症风险分类的可行性，包括各种癌症类型的分组和诊断后样本的纳入。进行了全面的基准评估，系统地比较了由各种机器学习模型、特征选择方法和数据平衡技术组成的126个分析流程。数据在患者水平进行了分割，以防止数据泄漏。最佳模型是一种带有类别加权和递归特征消除的逻辑回归分类器，表现出中等的排名能力（AUROC = 0.815；95% CI：0.793-0.836），但临床分类性能较差（F1分数=0.25，阳性预测值=0.15）。虽然实现了高负预测值（0.98），但召回率不足（0.79）排除了其作为可靠的排除测试的使用。通过SHapley Additive exPlanations（SHAP）进行的可解释性分析显示，预测受年龄和炎症和贫血标志物等非特异性特征驱动。总结认为，虽然常规实验室数据中存在可统计检测到的癌症信号，但它过于微弱且混杂，无法与正常衰老或其他炎症状况进行临床可靠的区分。这项工作建立了该数据模态独立运行的关键性能上限，并强调计算兽医肿瘤学的有意义进展将需要整合多模态数据来源。

更新时间: 2025-10-25 00:55:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.20209v2

Lightweight and Breach-Resilient Authenticated Encryption Framework for Internet of Things

The Internet of Things (IoT) relies heavily on resource-limited devices to communicate critical (e.g., military data) information under low-energy adversarial environments and low-latency wireless channels. Authenticated Encryption (AE) guarantees confidentiality, authenticity, and integrity, making it a vital security service for IoT. However, current deployed (lightweight) AE standards lack essential features like key compromise resiliency and compact authentication tags, as well as performance enhancements such as offline-online cryptography. To address these gaps, we propose Graphene, the first (to our knowledge) symmetric Forward-secure and Aggregate Authenticated Encryption (FAAE) framework designed for the performance and security demands of low-end IoT infrastructures. Graphene innovates by synergizing key evolution strategies and offline-online cryptographic processing with Universal Message Authentication Codes (UMACs) to guarantee breach-resiliency, near-optimal online latency, and compactness. We demonstrate Graphene efficiency through two distinct instantiations, each balancing unique performance trade-offs with extensibility for diverse MACs. Our experimental evaluation on commodity hardware and 32-bit ARM Cortex-M4 microcontroller shows Graphene significant performance gains over existing alternatives. Graphene is also backward compatible with standard-compliant cryptographic implementations. We release our implementation as open source for public testing and adaptation.

Updated: 2025-10-25 00:51:34

标题: 轻量级和抗突破认证加密框架，用于物联网

摘要: 物联网(IoT)在资源有限的设备上依赖重要的通信机制，以在低能量敌对环境和低延迟无线通道下传输关键信息(如军事数据)。认证加密(AE)保证机密性、真实性和完整性，使其成为物联网的重要安全服务。然而，目前部署的(轻量级)AE标准缺乏关键的功能，如密钥妥协弹性和紧凑的认证标签，以及离线-在线加密等性能增强。为了填补这些差距，我们提出了Graphene，这是第一个(据我们所知)针对低端物联网基础设施性能和安全需求设计的对称前向安全和聚合认证加密(FAAE)框架。Graphene通过协同作用密钥演化策略和离线-在线加密处理与通用消息认证码(UMAC)来保证突破弹性、接近最佳的在线延迟和紧凑性。我们通过两个不同的实例展示了Graphene的效率，每个实例都在独特的性能权衡和对不同MAC的可扩展性之间取得平衡。我们在大众硬件和32位ARM Cortex-M4微控制器上的实验评估显示，Graphene相对于现有替代方案具有显著的性能增益。Graphene也与符合标准的加密实现向后兼容。我们将我们的实现作为开源发布，供公众测试和适应。

更新时间: 2025-10-25 00:51:34

领域: cs.CR

下载: http://arxiv.org/abs/2510.22100v1

Deep Learning Atmospheric Models Reliably Simulate Out-of-Sample Land Heat and Cold Wave Frequencies

Deep learning (DL)-based general circulation models (GCMs) are emerging as fast simulators, yet their ability to replicate extreme events outside their training range remains unknown. Here, we evaluate two such models -- the hybrid Neural General Circulation Model (NGCM) and purely data-driven Deep Learning Earth System Model (DL\textit{ESy}M) -- against a conventional high-resolution land-atmosphere model (HiRAM) in simulating land heatwaves and coldwaves. All models are forced with observed sea surface temperatures and sea ice over 1900-2020, focusing on the out-of-sample early-20th-century period (1900-1960). Both DL models generalize successfully to unseen climate conditions, broadly reproducing the frequency and spatial patterns of heatwave and cold wave events during 1900-1960 with skill comparable to HiRAM. An exception is over portions of North Asia and North America, where all models perform poorly during 1940-1960. Due to excessive temperature autocorrelation, DL\textit{ESy}M tends to overestimate heatwave and cold wave frequencies, whereas the physics-DL hybrid NGCM exhibits persistence more similar to HiRAM.

Updated: 2025-10-25 00:30:47

标题: 深度学习大气模型可可靠地模拟样本外的陆地热浪和寒潮频率

摘要: 基于深度学习（DL）的大气环流模型（GCMs）正在成为快速模拟器，然而它们在复制训练范围之外的极端事件的能力仍然未知。在这里，我们评估了两种这样的模型--混合神经环流模型（NGCM）和纯数据驱动的深度学习地球系统模型（DLESyM）--与传统高分辨率陆-大气模型（HiRAM）在模拟陆地热浪和寒潮方面的表现。所有模型都受到观测到的海表温度和海冰的影响，时间跨度为1900-2020，重点关注样本外的20世纪初期（1900-1960年）。两个DL模型成功地推广到未见过的气候条件，广泛地复制了1900-1960年间热浪和寒潮事件的频率和空间模式，其技能与HiRAM相当。唯一的例外是在北亚和北美部分地区，在1940-1960年期间所有模型的表现都较差。由于温度自相关性过高，DLESyM倾向于高估热浪和寒潮的频率，而物理-DL混合NGCM的持续性更接近于HiRAM。

更新时间: 2025-10-25 00:30:47

领域: physics.ao-ph,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.03176v2

Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies

Brain-Computer Interfaces (BCIs) offer a direct communication pathway between the human brain and external devices, holding significant promise for individuals with severe neurological impairments. However, their widespread adoption is hindered by critical limitations, such as low information transfer rates and extensive user-specific calibration. To overcome these challenges, recent research has explored the integration of Large Language Models (LLMs), extending the focus from simple command decoding to understanding complex cognitive states. Despite these advancements, deploying agentic AI faces technical hurdles and ethical concerns. Due to the lack of comprehensive discussion on this emerging direction, this position paper argues that the field is poised for a paradigm extension from BCI to Brain-Agent Collaboration (BAC). We emphasize reframing agents as active and collaborative partners for intelligent assistance rather than passive brain signal data processors, demanding a focus on ethical data handling, model reliability, and a robust human-agent collaboration framework to ensure these systems are safe, trustworthy, and effective.

Updated: 2025-10-25 00:25:45

标题: 拥抱可信赖的大脑-代理协作作为智能辅助技术范式扩展

摘要: 脑-计算机界面（BCI）为人类大脑与外部设备之间提供了直接的沟通途径，对于患有严重神经功能障碍的个体具有巨大的潜力。然而，它们的广泛应用受到了一些关键限制的阻碍，比如信息传输速率低和需要进行大量用户特定的校准。为了克服这些挑战，最近的研究探索了将大型语言模型（LLMs）整合到BCI中，将焦点从简单命令解码扩展到理解复杂的认知状态。尽管取得了这些进展，部署代理人人工智能面临技术障碍和道德问题。由于对这个新兴方向缺乏全面讨论，这篇立场论文认为该领域正处于从BCI向脑-代理人协作（BAC）的范式扩展的关键时刻。我们强调将代理人重新定位为积极和合作的智能协助伙伴，而不是被动的大脑信号数据处理器，要求专注于道德数据处理、模型可靠性以及强大的人-代理人协作框架，以确保这些系统安全、可信且有效。

更新时间: 2025-10-25 00:25:45

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.22095v1

Operationalising Extended Cognition: Formal Metrics for Corporate Knowledge and Legal Accountability

Corporate responsibility turns on notions of corporate \textit{mens rea}, traditionally imputed from human agents. Yet these assumptions are under challenge as generative AI increasingly mediates enterprise decision-making. Building on the theory of extended cognition, we argue that in response corporate knowledge may be redefined as a dynamic capability, measurable by the efficiency of its information-access procedures and the validated reliability of their outputs. We develop a formal model that captures epistemic states of corporations deploying sophisticated AI or information systems, introducing a continuous organisational knowledge metric $S_S(\varphi)$ which integrates a pipeline's computational cost and its statistically validated error rate. We derive a thresholded knowledge predicate $\mathsf{K}_S$ to impute knowledge and a firm-wide epistemic capacity index $\mathcal{K}_{S,t}$ to measure overall capability. We then operationally map these quantitative metrics onto the legal standards of actual knowledge, constructive knowledge, wilful blindness, and recklessness. Our work provides a pathway towards creating measurable and justiciable audit artefacts, that render the corporate mind tractable and accountable in the algorithmic age.

Updated: 2025-10-25 00:17:55

标题: 实现扩展认知：企业知识和法律责任的形式化度量标准

摘要: 企业责任的概念建立在对企业\textit{mens rea}的理解上，传统上是从人类代理人那里推断出来的。然而，随着生成式人工智能越来越多地介入企业决策，这些假设正在受到挑战。基于扩展认知理论，我们认为，作为回应，企业知识可以重新定义为一种动态能力，可通过其信息获取程序的效率和其输出的验证可靠性来衡量。我们开发了一个形式化模型，捕捉部署了复杂人工智能或信息系统的企业的认知状态，引入了一个连续的组织知识度量$S_S(\varphi)$，该度量集成了管道的计算成本和其统计验证的错误率。我们推导出一个阈值知识谓词$\mathsf{K}_S$来推断知识，以及一个公司范围的认知能力指数$\mathcal{K}_{S,t}$来衡量整体能力。然后，我们将这些定量指标操作地映射到实际知识、建设性知识、故意视而不见和鲁莽行为的法律标准上。我们的工作为创造可度量和可司法化的审计工具提供了一条道路，使企业思维在算法时代变得可控和可追究责任。

更新时间: 2025-10-25 00:17:55

领域: cs.AI

下载: http://arxiv.org/abs/2510.16193v2

Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context

Transformers have demonstrated remarkable in-context learning (ICL) capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates. Compelling empirical and theoretical evidence suggests that ICL, as a general-purpose learner, could outperform task-specific models. However, it remains unclear to what extent the transformers optimally learn in-context compared to principled learning algorithms. To investigate this, we employ a meta ICL framework in which each prompt defines a distinctive regression task whose target function is drawn from a hierarchical distribution, requiring inference over both the latent model class and task-specific parameters. Within this setup, we benchmark sample complexity of ICL against principled learning algorithms, including the Bayes optimal estimator, under varying performance requirements. Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context. Through an information-theoretic analysis, we show that the diminishing efficiency is inherent to ICL. These results clarify the trade-offs in adopting ICL as a universal problem solver, motivating a new generation of on-the-fly adaptive methods without the diminishing efficiency.

Updated: 2025-10-25 00:08:41

标题: 上下文学习中的技术债务：长上下文中效率的减少

摘要: Transformers已经展示出了非凡的现场学习（ICL）能力，通过简单地根据示范来适应新任务，而无需更新参数。令人信服的经验和理论证据表明，作为通用学习器，ICL可能会胜过特定任务的模型。然而，目前尚不清楚transformers在现场学习方面与原则性学习算法相比的优化程度。为了调查这一问题，我们采用了一个元ICL框架，其中每个提示定义了一个独特的回归任务，其目标函数来自一个层次分布，需要在潜在模型类和特定任务参数上进行推理。在这种设置下，我们对ICL与原则性学习算法的样本复杂性进行了基准测试，包括贝叶斯最优估计器，针对不同的性能要求。我们的研究结果揭示了一个巨大的二分法：虽然ICL最初与贝叶斯最优估计器的效率相匹配，但其效率在长期背景下显著下降。通过信息理论分析，我们表明这种效率下降是ICL固有的。这些结果阐明了采用ICL作为一个通用问题解决者的权衡，激励了一种新一代的适应性方法，而不会出现效率下降。

更新时间: 2025-10-25 00:08:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.04580v2