_              _         ____              
   / \   _ ____  _(_)_   __ |  _ \  __ _ _   _ 
  / _ \ | '__\ \/ / \ \ / / | | | |/ _` | | | |
 / ___ \| |   >  <| |\ V /  | |_| | (_| | |_| |
/_/   \_\_|  /_/\_\_| \_/   |____/ \__,_|\__, |
                                         |___/ 
        

Articles: 0

Last Updated: N/A (+00:00)

Index | Calendar | Favorites | Archive | Profile

Generating Auxiliary Tasks with Reinforcement Learning

Auxiliary Learning (AL) is a form of multi-task learning in which a model trains on auxiliary tasks to boost performance on a primary objective. While AL has improved generalization across domains such as navigation, image classification, and NLP, it often depends on human-labeled auxiliary tasks that are costly to design and require domain expertise. Meta-learning approaches mitigate this by learning to generate auxiliary tasks, but typically rely on gradient based bi-level optimization, adding substantial computational and implementation overhead. We propose RL-AUX, a reinforcement-learning (RL) framework that dynamically creates auxiliary tasks by assigning auxiliary labels to each training example, rewarding the agent whenever its selections improve the performance on the primary task. We also explore learning per-example weights for the auxiliary loss. On CIFAR-100 grouped into 20 superclasses, our RL method outperforms human-labeled auxiliary tasks and matches the performance of a prominent bi-level optimization baseline. We present similarly strong results on other classification datasets. These results suggest RL is a viable path to generating effective auxiliary tasks.

Updated: 2025-11-03 23:55:55

标题: 用增强学习生成辅助任务

摘要: 辅助学习(AL)是一种多任务学习形式,模型通过训练辅助任务来提高主要目标的性能。虽然AL已经在诸如导航、图像分类和自然语言处理等领域改善了泛化能力,但通常依赖于昂贵且需要领域专业知识的人工标记的辅助任务。元学习方法通过学习生成辅助任务来缓解这一问题,但通常依赖于基于梯度的双层优化,增加了大量的计算和实施开销。我们提出了RL-AUX,一个基于强化学习(RL)的框架,通过为每个训练样本分配辅助标签来动态创建辅助任务,当其选择改善主要任务的性能时奖励代理。我们还探讨了学习每个示例权重的辅助损失。在将CIFAR-100分成20个超类的情况下,我们的RL方法优于人工标记的辅助任务,并与杰出的双层优化基线的性能相匹配。我们在其他分类数据集上也取得了类似强大的结果。这些结果表明RL是生成有效辅助任务的可行途径。

更新时间: 2025-11-03 23:55:55

领域: cs.LG

下载: http://arxiv.org/abs/2510.22940v4

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA effects distort locality in multi-head attention (MHA) and present Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit intra-chiplet cache reuse. On AMD's MI300X architecture, our method achieves up to 50% higher performance over state-of-the-art attention algorithms using conventional scheduling techniques and sustains consistently high L2 cache hit rates of 80-97%. These results demonstrate that NUMA-aware scheduling is now fundamental to achieving full efficiency on next-generation disaggregated GPUs, offering a path forward for scalable AI training and inference.

Updated: 2025-11-03 23:48:39

标题: 通过利用GPU架构的NUMA效应优化GPU上的注意力

摘要: 随着分解的人工智能(AI)图形处理单元(GPU)的兴起,暴露了大规模注意力工作负载中的一个关键瓶颈:非均匀内存访问(NUMA)。随着多芯片设计成为扩展计算能力的常态,内存延迟和带宽在计算区域之间变化剧烈,破坏了传统GPU内核调度策略的性能,这些策略假定内存访问是均匀的。我们确定了这些NUMA效应如何扭曲多头注意力(MHA)中的局部性,并提出了一种空间感知调度策略,即Swizzled Head-first Mapping,该策略将注意力头与GPU NUMA域对齐,以利用芯片内重用缓存。在AMD的MI300X架构上,我们的方法在使用传统调度技术的最先进注意力算法上实现了高达50%的性能提升,并保持了80-97%的高L2缓存命中率。这些结果表明,NUMA感知调度现在是实现下一代分解GPU上的全效率所必需的,为可扩展的AI训练和推断提供了前进之路。

更新时间: 2025-11-03 23:48:39

领域: cs.AR,cs.DC,cs.LG,cs.PF

下载: http://arxiv.org/abs/2511.02132v1

Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation

Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen's d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher's exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.

Updated: 2025-11-03 23:47:53

标题: 约束满足方法在Wordle中的应用:新颖的启发式和跨词典验证

摘要: Wordle为约束满足问题(CSP)求解提供了一个算法丰富的测试平台。虽然现有的求解器依赖于信息论熵最大化或基于频率的启发式方法,但没有进行正式的约束处理,我们提出了Wordle的第一个全面的CSP表述,具有新颖的约束感知求解策略。我们引入了CSP-Aware Entropy,计算约束传播后的信息增益,而不是在原始候选集上,以及一个概率CSP框架,将贝叶斯词频先验与逻辑约束集成在一起。通过对2,315个英文单词的评估,CSP-Aware Entropy实现了平均3.54次猜测,成功率为99.9%,比Forward Checking有着统计学上显著的1.7%改进(t=-4.82,p<0.001,Cohen's d=0.07),运行时间比其快46%(每次猜测12.9ms对23.7ms)。在10%的噪声下,CSP-aware方法保持了5.3个百分点的优势(29.0%对23.7%,p=0.041),而概率CSP通过约束恢复机制在所有噪声水平(0-20%)上实现了100%的成功率。对500个西班牙单词的跨词库验证表明,无需语言特定调整即可实现88%的成功率,验证了核心CSP原则跨语言传递的可能性,尽管存在11.2个百分点的语言差异(p<0.001,Fisher's exact test)。我们的开源实现通过34个单元测试,实现了91%的代码覆盖率,为CSP研究提供了可重复的基础设施。形式上的CSP处理、约束感知启发式方法、概率逻辑集成、鲁棒性分析和跨词库验证的结合建立了新的性能基准,表明基于原则的约束满足技术在结构化谜题求解领域优于经典的信息论和基于学习的方法。

更新时间: 2025-11-03 23:47:53

领域: cs.CL,cs.AI,68T20, 90C27,I.2.8; I.2.3; G.1.6

下载: http://arxiv.org/abs/2510.02855v3

Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.

Updated: 2025-11-03 23:47:49

标题: Re-FORC:用于高效思维链推理的自适应奖励预测

摘要: 我们提出了一种自适应奖励预测方法Re-FORC,根据给定的上下文,能够预测未来预期奖励作为未来思考令牌数量的函数。Re-FORC在推理模型上训练了一个轻量级适配器,展示了随着推理时间的延长和模型规模的增大而改进的预测能力。Re-FORC实现了以下功能:1) 及早停止无前途的推理链,减少计算量26%,同时保持准确性;2) 优化模型和思考长度的选择,以实现相等计算量时高出4%的准确性,以及相等准确性时减少55%的计算量,相较于最大模型;3) 自适应测试时间缩放,在高计算量情况下提高11%的准确性,在低计算量情况下提高7%的准确性。Re-FORC允许通过每令牌成本阈值实现动态的、带有长度控制的推理,同时可以提前估计计算时间。

更新时间: 2025-11-03 23:47:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.02130v1

Analysis of AdvFusion: Adapter-based Multilingual Learning for Code Large Language Models

Programming languages can benefit from one another by utilizing a language model for software engineering tasks. Full fine-tuning and Parameter Efficient Fine-Tuning (PEFT) of Code Language Models (Code-LMs) has been explored for multilingual knowledge transfer. AdapterFusion is a PEFT architecture that aims to enhance task performance by leveraging information from multiple programming languages, but primarily focuses on the target programming language. In our previous work, we proposed AdvFusion, a novel PEFT-based approach that effectively learns from other programming languages before adapting to the target task. Though previous experiments showed that AdvFusion outperformed AdapterFusion and LoRA, it was applied on pre-trained Code-LMs and was limited to only two tasks, code summarization and method name prediction. In this study, we expanded our work and investigated AdvFusion on Code Large Language Models (Code-LLMs), considering three new tasks: code generation, code translation, and commit message generation. We observed that different Code-LLMs/tasks exhibit different characteristics. In code generation, AdvFusion outperformed AdapterFusion but not other PEFT methods (LoRA, Compacter, and TaskAdapter). In commit message generation, AdapterFusion performed better than AdvFusion, and contrary to code generation, we found that the other PEFT methods do not have better performance. In code translation, AdvFusion performed worse than AdapterFusion overall, with the performance gap marginally widening as the model size increases. However, consistent with code generation, other PEFT methods showed better performance.

Updated: 2025-11-03 23:45:27

标题: AdvFusion的分析:基于适配器的多语言学习用于大型语言模型

摘要: 编程语言可以通过利用语言模型进行软件工程任务受益。已经探索了对代码语言模型(Code-LMs)进行完全微调和参数高效微调(PEFT)以实现多语言知识转移。AdapterFusion是一种PEFT架构,旨在通过利用多种编程语言的信息来增强任务性能,但主要侧重于目标编程语言。 在我们先前的工作中,我们提出了AdvFusion,这是一种基于PEFT的新方法,可以在适应目标任务之前有效地从其他编程语言中学习。尽管先前的实验表明AdvFusion优于AdapterFusion和LoRA,但它是应用于预训练的Code-LMs,并且仅限于两个任务,即代码摘要和方法名称预测。在这项研究中,我们扩展了我们的工作,并在Code Large Language Models(Code-LLMs)上研究了AdvFusion,考虑了三个新任务:代码生成、代码翻译和提交消息生成。我们观察到不同的Code-LLMs/任务表现出不同的特征。在代码生成方面,AdvFusion优于AdapterFusion,但不如其他PEFT方法(LoRA、Compacter和TaskAdapter)。在提交消息生成方面,AdapterFusion的表现优于AdvFusion,与代码生成相反,我们发现其他PEFT方法表现不佳。在代码翻译方面,总体上AdvFusion的表现不及AdapterFusion,在模型规模增大时,性能差距略微扩大。然而,与代码生成一致,其他PEFT方法表现更好。

更新时间: 2025-11-03 23:45:27

领域: cs.SE,cs.AI,cs.PL

下载: http://arxiv.org/abs/2511.02869v1

Variance-Aware Feel-Good Thompson Sampling for Contextual Bandits

Variance-dependent regret bounds have received increasing attention in recent studies on contextual bandits. However, most of these studies are focused on upper confidence bound (UCB)-based bandit algorithms, while sampling based bandit algorithms such as Thompson sampling are still understudied. The only exception is the LinVDTS algorithm (Xu et al., 2023), which is limited to linear reward function and its regret bound is not optimal with respect to the model dimension. In this paper, we present FGTSVA, a variance-aware Thompson Sampling algorithm for contextual bandits with general reward function with optimal regret bound. At the core of our analysis is an extension of the decoupling coefficient, a technique commonly used in the analysis of Feel-good Thompson sampling (FGTS) that reflects the complexity of the model space. With the new decoupling coefficient denoted by $\mathrm{dc}$, FGTS-VA achieves the regret of $\tilde{O}(\sqrt{\mathrm{dc}\cdot\log|\mathcal{F}|\sum_{t=1}^T\sigma_t^2}+\mathrm{dc})$, where $|\mathcal{F}|$ is the size of the model space, $T$ is the total number of rounds, and $\sigma_t^2$ is the subgaussian norm of the noise (e.g., variance when the noise is Gaussian) at round $t$. In the setting of contextual linear bandits, the regret bound of FGTSVA matches that of UCB-based algorithms using weighted linear regression (Zhou and Gu, 2022).

Updated: 2025-11-03 23:25:41

标题: 方差感知的感觉良好汤普森采样用于情境臂问题

摘要: 最近对于上下文强盗问题中方差相关的后悔界限引起了越来越多的关注。然而,大多数研究集中在基于上置信区间(UCB)的强盗算法上,而基于采样的强盗算法如汤普森抽样仍未得到充分研究。唯一的例外是LinVDTS算法(Xu等,2023),该算法仅限于线性奖励函数,并且其后悔界限在模型维度方面并不是最优的。本文介绍了FGTSVA,一种针对具有一般奖励函数的上下文强盗问题的方差感知汤普森抽样算法,具有最优的后悔界限。在我们的分析核心是解耦系数的扩展,这是一种在感觉良好汤普森抽样(FGTS)分析中常用的技术,反映了模型空间的复杂性。用符号$\mathrm{dc}$表示的新解耦系数,FGTS-VA达到了后悔值$\tilde{O}(\sqrt{\mathrm{dc}\cdot\log|\mathcal{F}|\sum_{t=1}^T\sigma_t^2}+\mathrm{dc})$,其中$|\mathcal{F}|$是模型空间的大小,$T$是总回合数,$\sigma_t^2$是在第$t$轮的噪声的次高斯范数(例如,当噪声为高斯时的方差)。在上下文线性强盗问题的设置中,FGTSVA的后悔界限与使用加权线性回归的基于UCB的算法(Zhou和Gu,2022)相匹配。

更新时间: 2025-11-03 23:25:41

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2511.02123v1

Feature compression is the root cause of adversarial fragility in neural network classifiers

In this paper, we uniquely study the adversarial robustness of deep neural networks (NN) for classification tasks against that of optimal classifiers. We look at the smallest magnitude of possible additive perturbations that can change a classifier's output. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural networks for classification. In particular, our theoretical results show that a neural network's adversarial robustness can degrade as the input dimension $d$ increases. Analytically, we show that neural networks' adversarial robustness can be only $1/\sqrt{d}$ of the best possible adversarial robustness of optimal classifiers. Our theories match remarkably well with numerical experiments of practically trained NN, including NN for ImageNet images. The matrix-theoretic explanation is consistent with an earlier information-theoretic feature-compression-based explanation for the adversarial fragility of neural networks.

Updated: 2025-11-03 23:24:29

标题: 特征压缩是神经网络分类器中对抗性脆弱性的根本原因

摘要: 在这篇论文中,我们独特地研究了深度神经网络(NN)在分类任务中对抗鲁棒性与最优分类器的对抗鲁棒性之间的关系。我们研究了可以改变分类器输出的可能性最小的加性扰动大小。我们提供了一个矩阵理论解释,解释了深度神经网络在分类中对抗性脆弱性。特别是,我们的理论结果表明,随着输入维度d的增加,神经网络的对抗鲁棒性可能会降低。在分析方面,我们展示了神经网络的对抗鲁棒性可能仅为最优分类器最佳对抗鲁棒性的1/√d。我们的理论与实际训练的NN(包括ImageNet图像的NN)的数值实验非常匹配。矩阵理论解释与早期基于信息论特征压缩的对神经网络对抗性脆弱性的解释是一致的。

更新时间: 2025-11-03 23:24:29

领域: cs.LG,cs.CR,cs.IT,eess.SP,math.IT

下载: http://arxiv.org/abs/2406.16200v2

Matrix Sensing with Kernel Optimal Loss: Robustness and Optimization Landscape

In this paper we study how the choice of loss functions of non-convex optimization problems affects their robustness and optimization landscape, through the study of noisy matrix sensing. In traditional regression tasks, mean squared error (MSE) loss is a common choice, but it can be unreliable for non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust loss based on nonparametric regression, which uses a kernel-based estimate of the residual density and maximizes the estimated log-likelihood. This robust formulation coincides with the MSE loss under Gaussian errors but remains stable under more general settings. We further examine how this robust loss reshapes the optimization landscape by analyzing the upper-bound of restricted isometry property (RIP) constants for spurious local minima to disappear. Through theoretical and empirical analysis, we show that this new loss excels at handling large noise and remains robust across diverse noise distributions. This work offers initial insights into enhancing the robustness of machine learning tasks through simply changing the loss, guided by an intuitive and broadly applicable analytical framework.

Updated: 2025-11-03 23:22:37

标题: 使用核最优损失进行矩阵感知:鲁棒性和优化景观

摘要: 在本文中,我们研究了非凸优化问题的损失函数选择如何影响它们的鲁棒性和优化景观,通过对嘈杂矩阵感知的研究。在传统的回归任务中,均方误差(MSE)损失是一个常见选择,但对于非高斯或重尾噪声可能不可靠。为了解决这个问题,我们采用了基于非参数回归的鲁棒损失,它使用基于核的残差密度估计并最大化估计的对数似然。这种鲁棒的形式在高斯误差下与MSE损失一致,但在更一般的情况下保持稳定。我们进一步通过分析受限等距性质(RIP)常数的上界,研究了这种鲁棒损失如何重塑优化景观,以消除虚假局部最小值。通过理论和实证分析,我们展示了这种新损失在处理大噪声方面的优越性,并在各种噪声分布下保持鲁棒性。这项工作提供了通过简单改变损失来增强机器学习任务鲁棒性的初步见解,这是根据一种直观且广泛适用的分析框架来引导的。

更新时间: 2025-11-03 23:22:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.02122v1

LLMs as Layout Designers: Enhanced Spatial Reasoning for Content-Aware Layout Generation

While Large Language Models (LLMs) have demonstrated impressive reasoning and planning abilities in textual domains and can effectively follow instructions for complex tasks, their ability to understand and manipulate spatial relationships remains limited. Such capabilities are crucial for content-aware graphic layout design, where the goal is to arrange heterogeneous elements onto a canvas so that final design remains visually balanced and structurally feasible. This problem requires precise coordination of placement, alignment, and structural organization of multiple elements within a constrained visual space. To address this limitation, we introduce LaySPA, a reinforcement learning-based framework that augments LLM-based agents with explicit spatial reasoning capabilities for layout design. LaySPA employs hybrid reward signals that jointly capture geometric constraints, structural fidelity, and visual quality, enabling agents to navigate the canvas, model inter-element relationships, and optimize spatial arrangements. Through group-relative policy optimization, the agent generates content-aware layouts that reflect salient regions, respect spatial constraints, and produces an interpretable reasoning trace explaining placement decisions and a structured layout specification. Experimental results show that LaySPA substantially improves the generation of structurally valid and visually appealing layouts, outperforming larger general-purpose LLMs and achieving performance comparable to state-of-the-art specialized layout models.

Updated: 2025-11-03 23:19:39

标题: LLMs作为布局设计师:内容感知布局生成的增强空间推理

摘要: 大型语言模型(LLMs)在文本领域展示了出色的推理和规划能力,并且可以有效地遵循复杂任务的指令,但它们理解和操纵空间关系的能力仍然有限。这种能力对于内容感知的图形布局设计至关重要,其目标是将异质元素排列到画布上,使最终设计在视觉上保持平衡和结构上可行。这个问题需要在受限制的视觉空间内精确协调多个元素的放置、对齐和结构组织。为了解决这一限制,我们引入了LaySPA,这是一个基于强化学习的框架,它增强了基于LLM的代理的显式空间推理能力,用于布局设计。LaySPA采用混合奖励信号,共同捕捉几何约束、结构保真度和视觉质量,使代理能够在画布上导航,建模元素间的关系,并优化空间布局。通过群体相对策略优化,代理生成反映显著区域、尊重空间约束的内容感知布局,并产生解释放置决策的可解释推理跟踪和结构化布局规范。实验结果表明,LaySPA显著改进了生成结构有效且视觉上吸引人的布局,优于更大的通用LLMs,并实现了与最先进专用布局模型相媲美的性能。

更新时间: 2025-11-03 23:19:39

领域: cs.AI

下载: http://arxiv.org/abs/2509.16891v2

InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance

Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.

Updated: 2025-11-03 23:19:27

标题: InsurAgent: 一种大型语言模型增强代理,用于模拟购买洪水保险的个体行为

摘要: 洪水保险是个人减少与灾害相关损失的有效策略。然而,在美国潜在风险人口中的参与率仍然非常低。这种差距凸显出了理解和建模影响保险决策的行为机制的必要性。大型语言模型(LLMs)最近展示出在各种任务中表现出类人类智能的能力,提供了模拟人类决策的有希望的工具。本研究构建了一个基准数据集,以捕捉各种因素下的保险购买概率。利用这个数据集,评估了LLMs的能力:虽然LLMs表现出对因素的定性理解,但在估计定量概率方面表现不佳。为了解决这一限制,提出了InsurAgent,这是一个由五个模块组成的LLM强化代理,包括感知、检索、推理、行动和记忆。检索模块利用检索增强生成(RAG)来基于实证调查数据进行决策,实现了边际和双变量概率的准确估计。推理模块利用LLM常识来超越调查数据,捕捉传统模型无法处理的上下文信息。记忆模块支持时间决策演变的模拟,通过过山车式的生活轨迹进行说明。总的来说,InsurAgent为行为建模和政策分析提供了一个有价值的工具。

更新时间: 2025-11-03 23:19:27

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.02119v1

The SDSC Satellite Reverse Proxy Service for Launching Secure Jupyter Notebooks on High-Performance Computing Systems

Using Jupyter notebooks in an HPC environment exposes a system and its users to several security risks. The Satellite Proxy Service, developed at SDSC, addresses many of these security concerns by providing Jupyter Notebook servers with a token-authenticated HTTPS reverse proxy through which end users can access their notebooks securely with a single URL copied and pasted into their web browser.

Updated: 2025-11-03 23:15:18

标题: SDSC卫星反向代理服务:在高性能计算系统上启动安全Jupyter笔记本

摘要: 在HPC环境中使用Jupyter笔记本会使系统及其用户面临多种安全风险。由SDSC开发的Satellite代理服务通过提供基于令牌认证的HTTPS反向代理,解决了许多这些安全问题,使终端用户可以通过复制并粘贴单个URL到其Web浏览器中安全地访问其Jupyter笔记本。

更新时间: 2025-11-03 23:15:18

领域: cs.CR

下载: http://arxiv.org/abs/2511.02116v1

MicroLad: 2D-to-3D Microstructure Reconstruction and Generation via Latent Diffusion and Score Distillation

A major obstacle to establishing reliable structure-property (SP) linkages in materials engineering is the scarcity of diverse 3D microstructure datasets. Limited dataset availability and insufficient control over the analysis and design space restrict the variety of achievable microstructure morphologies, hindering progress in solving the inverse (property-to-structure) design problem. To address these challenges, we introduce MicroLad, a latent diffusion framework specifically designed for reconstructing 3D microstructures from 2D data. Trained on 2D images and employing multi-plane denoising diffusion sampling in the latent space, the framework reliably generates stable and coherent 3D volumes that remain statistically consistent with the original data. While this reconstruction capability enables dimensionality expansion (2D-to-3D) for generating statistically equivalent 3D samples from 2D data, effective exploration of microstructure design requires methods to guide the generation process toward specific objectives. To achieve this, MicroLad integrates score distillation sampling (SDS), which combines a differentiable score loss with microstructural descriptor-matching and property-alignment terms. This approach updates encoded 2D slices of the 3D volume in the latent space, enabling robust inverse-controlled 2D-to-3D microstructure generation. Consequently, the method facilitates exploration of an expanded 3D microstructure analysis and design space in terms of both microstructural descriptors and material properties.

Updated: 2025-11-03 23:14:51

标题: 微Lad: 通过潜在扩散和分数蒸馏进行的2D到3D微结构重建和生成

摘要: 在材料工程中建立可靠的结构-性能(SP)联系的一个主要障碍是3D微观结构数据集的稀缺性。有限的数据集可用性和对分析和设计空间的不足控制限制了可实现的微观结构形态的多样性,阻碍了解决逆向(性能到结构)设计问题的进展。为了解决这些挑战,我们介绍了MicroLad,这是一个专门设计用于从2D数据重建3D微观结构的潜在扩散框架。该框架在2D图像上进行训练,并在潜在空间中采用多平面去噪扩散采样,可可靠地生成稳定和连贯的3D体积,仍然与原始数据在统计上保持一致。尽管这种重建能力实现了维度扩展(2D到3D),从而可以从2D数据生成与原始数据在统计上等价的3D样本,但有效探索微观结构设计需要方法来引导生成过程朝向特定目标。为实现这一目标,MicroLad集成了得分蒸馏采样(SDS),该方法将可微损失得分与微结构描述符匹配和性能对齐项相结合。这种方法在潜在空间中更新3D体积的编码2D切片,从而实现了鲁棒的逆向控制的2D到3D微观结构生成。因此,该方法促进了对扩展的3D微观结构分析和设计空间的探索,涉及微观结构描述符和材料性能。

更新时间: 2025-11-03 23:14:51

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2508.20138v2

A Compositional Kernel Model for Feature Learning

We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that $\ell_1$-type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.

Updated: 2025-11-03 23:05:49

标题: 一个用于特征学习的组合核模型

摘要: 我们研究了核岭回归的一个组合变体,其中预测器应用于输入的逐坐标重新加权。作为一个变分问题,这个模型为组合架构中的特征学习提供了一个简单的测试平台。从变量选择的角度来看,我们展示了相关变量是如何被恢复的,同时噪声变量被消除。我们建立了保证,表明当噪声变量服从高斯分布时,全局最小值和稳定点都会丢弃噪声坐标。一个中心发现是,例如拉普拉斯核这样的$\ell_1$型核,在稳定点成功恢复了对非线性效应有贡献的特征,而高斯核仅恢复线性特征。

更新时间: 2025-11-03 23:05:49

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2509.14158v2

Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control

This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.

Updated: 2025-11-03 23:01:49

标题: 动作分块和探索性数据收集在连续控制的行为复制中产生指数级的改进。

摘要: 本文对现代机器人学和连续控制中两种最具影响力的示范学习干预措施进行了理论分析:动作块化实践(在开环中预测动作序列)和专家示范的探索性增强。尽管最近的结果显示,也称为模仿学习(IL)的示范学习可能在连续设置中出现随任务视野指数级增长的错误,我们证明动作块化和探索性数据收集可以在不同领域绕过指数增长错误。我们的结果确定控制理论稳定性是这些干预措施带来好处的关键机制。在实证方面,我们通过对流行的机器人学习基准进行实验验证了我们的预测和控制理论稳定性的作用。在理论方面,我们证明控制理论的视角提供了对累积误差产生方式的细致洞察,从而在应用这些干预措施时比仅基于信息论考虑的先前技术提供更紧密的模仿学习误差的统计保证。

更新时间: 2025-11-03 23:01:49

领域: cs.LG,cs.SY,eess.SY,stat.ML

下载: http://arxiv.org/abs/2507.09061v4

Collective Communication for 100k+ GPUs

The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed at Meta, engineered to optimize performance across the full LLM lifecycle, from the synchronous demands of large-scale training to the low-latency requirements of inference. The framework is designed to support complex workloads on clusters exceeding 100,000 GPUs, ensuring reliable, high-throughput, and low-latency data exchange. Empirical evaluation on the Llama4 model demonstrates substantial improvements in communication efficiency. This research contributes a robust solution for enabling the next generation of LLMs to operate at unprecedented scales.

Updated: 2025-11-03 23:01:29

标题: 100k+ GPU的集体通信

摘要: 随着大型语言模型(LLMs)规模的不断增加,高效的集体通信框架变得必不可少,特别是当训练工作负载扩展到数十万个GPU时。传统的通信方法在这一规模下面临显著的吞吐量和延迟限制,阻碍了最先进模型的开发和部署。本文介绍了在Meta开发的NCCLX集体通信框架,旨在优化整个LLM生命周期的性能,从大规模训练的同步需求到推断的低延迟要求。该框架旨在支持超过10万个GPU的集群上的复杂工作负载,确保可靠、高吞吐量和低延迟的数据交换。对Llama4模型的实证评估显示了通信效率的显著改进。这项研究为实现下一代LLMs在前所未有的规模下运行提供了强大的解决方案。

更新时间: 2025-11-03 23:01:29

领域: cs.DC,cs.AI,cs.NI,C.2.4; I.2

下载: http://arxiv.org/abs/2510.20171v3

Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow Preferences

We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features -- for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model's Deep Value Generalization Rate (DVGR) -- the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.

Updated: 2025-11-03 22:49:54

标题: 深度价值基准:衡量模型是否泛化深度价值或浅层偏好

摘要: 我们引入了深度价值基准(DVB),这是一个评估框架,直接测试大型语言模型(LLMs)是否学习了基本的人类价值观,还是仅仅学习了表面层次的偏好。这种区别对于AI的对齐至关重要:捕捉更深层次价值观的系统可能会更稳健地推广人类意图,而只捕捉偏好数据表面模式的系统有可能产生不对齐的行为。DVB使用了一种新颖的实验设计,深度价值观(例如道德原则)和浅层特征(例如表面属性)之间有控制性混淆。在训练阶段,我们让LLMs接触人类偏好数据,这些数据有着故意相关的深度和浅层特征,例如,一个用户一贯更喜欢(非伤害性、正式语言)选项而不是(正义、非正式语言)替代品。测试阶段会打破这些相关性,呈现出(正义、正式语言)和(非伤害性、非正式语言)选项之间的选择。这个设计让我们能够精确测量一个模型的深度价值泛化率(DVGR)- 基于潜在价值而不是浅层特征进行泛化的概率。在9种不同的模型中,平均DVGR只有0.30。所有模型都比机会更少地泛化深层价值观。较大的模型比较小的模型有稍微较低的DVGR。我们正在发布我们的数据集,这个数据集经过了三次独立的人类验证实验。DVB提供了一个可解释的度量,用于评估AI对齐的核心特征。

更新时间: 2025-11-03 22:49:54

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2511.02109v1

Metamorphic Testing of Large Language Models for Natural Language Processing

Using large language models (LLMs) to perform natural language processing (NLP) tasks has become increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that they can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limited availability of labeled datasets, which necessitates an oracle to determine the correctness of LLM behaviors. Metamorphic testing (MT) is a popular testing approach that alleviates this oracle problem. At the core of MT are metamorphic relations (MRs), which define relationships between the outputs of related inputs. MT can expose faulty behaviors without the need for explicit oracles (e.g., labeled datasets). This paper presents the most comprehensive study of MT for LLMs to date. We conducted a literature review and collected 191 MRs for NLP tasks. We implemented a representative subset (36 MRs) to conduct a series of experiments with three popular LLMs, running approximately 560,000 metamorphic tests. The results shed light on the capabilities and opportunities of MT for LLMs, as well as its limitations.

Updated: 2025-11-03 22:48:19

标题: 大型语言模型在自然语言处理中的变形测试

摘要: 近来,使用大型语言模型(LLMs)执行自然语言处理(NLP)任务变得越来越普遍。LLMs的多功能性使它们适用于各种任务。尽管最近LLMs的性能通常很出色,但几项研究表明它们通常会产生错误的结果。自动识别这些错误行为对于改善LLMs的效果非常有用。其中一个障碍是标记数据集的有限可用性,这需要一个oracle来确定LLMs行为的正确性。变形测试(MT)是一种流行的测试方法,可以缓解这个oracle问题。MT的核心是变形关系(MRs),它们定义了相关输入的输出之间的关系。MT可以在不需要显式oracle(例如标记数据集)的情况下暴露出错误行为。本文介绍了迄今为止针对LLMs的最全面的MT研究。我们进行了文献综述,并收集了191个用于NLP任务的MRs。我们实现了一个代表性子集(36个MRs),进行了一系列实验,使用三种流行的LLMs运行了约560,000个变形测试。结果揭示了MT对LLMs的能力和机会,以及它的局限性。

更新时间: 2025-11-03 22:48:19

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.02108v1

Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning

We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.

Updated: 2025-11-03 22:46:52

标题: 多普罗河口的长期测绘与多智能体强化学习

摘要: 我们研究了利用多个自主水下车辆(AUVs)长期(多天)映射河流冲洗的问题,重点关注杜罗河代表性用例。我们提出了一种能源和通信高效的多智能体强化学习方法,其中中央协调员间歇性地与AUVs通信,收集测量数据并下达命令。我们的方法将时空高斯过程回归(GPR)与多头Q网络控制器相结合,该控制器调节每个AUV的方向和速度。使用Delft3D海洋模型进行的模拟表明,我们的方法始终优于单一和多智能体基准,增加代理数量可以提高均方误差(MSE)和操作持久性。在某些情况下,我们的算法表明,将AUVs数量翻倍可以使持久性翻倍以上,同时保持或提高准确性,突显了多智能体协调的好处。我们学到的策略可以在不同月份和年份的不同季节制度下泛化,为未来发展基于数据的动态冲洗环境长期监测打下了基础。

更新时间: 2025-11-03 22:46:52

领域: cs.MA,cs.LG,cs.SY,eess.SY,stat.ML

下载: http://arxiv.org/abs/2510.03534v3

Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach

Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being non-stationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection have been previously proposed. Our goal is to modularize the design and analysis of such Detection Augmented Bandit (DAB) procedures. To this end, we first provide novel, improved performance lower bounds for PS-MABs. Then, we identify the requirements for stationary bandit algorithms and change detectors in a DAB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of DAB procedures can indeed be modularized, so that the regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular DAB procedures that are order-optimal. Finally, we showcase the practical effectiveness of our modular DAB approach in our experiments, studying its regret performance compared to other methods and investigating its detection capabilities.

Updated: 2025-11-03 22:25:46

标题: 检测增强型赌博程序用于分段静止MABs:一种模块化方法

摘要: 传统的多臂赌博(MAB)算法是为稳态环境设计的,其中与臂相关的奖励分布不随时间改变。然而,在许多应用中,环境更准确地建模为非稳态。本研究探讨了分段稳态MAB(PS-MAB)环境,其中与一部分臂相关的奖励分布在某些转折点发生改变,而在转折点之间保持稳定。我们关注PS-MAB的渐近分析,先前提出了基于变化检测的实用算法。我们的目标是模块化设计和分析这种检测增强赌博(DAB)程序。为此,我们首先提供了PS-MAB的新的、改进的性能下界。然后,我们确定了DAB程序中需要用于模块化的稳态赌博算法和变化检测器的要求。我们假设奖励是次高斯的。在此假设和关于转折点间隔的条件下,我们展示了DAB程序的分析确实可以模块化,以便能够统一地获得各种变化检测器和赌博算法的后悔界。通过这种分析,我们开发了新的模块化DAB程序,其性能是最优的。最后,我们在实验中展示了我们模块化DAB方法的实际有效性,研究了它与其他方法的后悔表现,并调查了它的检测能力。

更新时间: 2025-11-03 22:25:46

领域: cs.AI,cs.LG,cs.SY,eess.SY,stat.ML

下载: http://arxiv.org/abs/2501.01291v3

Learning Low Rank Neural Representations of Hyperbolic Wave Dynamics from Data

We present a data-driven dimensionality reduction method that is well-suited for physics-based data representing hyperbolic wave propagation. The method utilizes a specialized neural network architecture called low rank neural representation (LRNR) inside a hypernetwork framework. The architecture is motivated by theoretical results that rigorously prove the existence of efficient representations for this wave class. We illustrate through archetypal examples that such an efficient low-dimensional representation of propagating waves can be learned directly from data through a combination of deep learning techniques. We observe that a low rank tensor representation arises naturally in the trained LRNRs, and that this reveals a new decomposition of wave propagation where each decomposed mode corresponds to interpretable physical features. Furthermore, we demonstrate that the LRNR architecture enables efficient inference via a compression scheme, which is a potentially important feature when deploying LRNRs in demanding performance regimes.

Updated: 2025-11-03 22:22:04

标题: 从数据中学习双曲波动动力学的低秩神经表示

摘要: 我们提出了一种适用于表示双曲波传播的基于物理数据的数据驱动降维方法。该方法利用了一个称为低秩神经表示(LRNR)的专门神经网络架构,该架构位于一个超网络框架内。该架构受到理论结果的启发,这些结果严格证明了该波类别存在有效的表示。我们通过典型示例说明,通过深度学习技术的组合,可以直接从数据中学习传播波的高效低维表示。我们观察到,在训练后的LRNR中自然产生了一个低秩张量表示,并且这揭示了波传播的新分解,其中每个分解模式对应于可解释的物理特征。此外,我们展示了LRNR架构通过压缩方案实现了高效推断,这在部署LRNR在要求高性能的场景中可能是一个重要的特性。

更新时间: 2025-11-03 22:22:04

领域: cs.LG,cs.AI,cs.NA,math.NA,68T07, 65D25, 65M22

下载: http://arxiv.org/abs/2510.25123v2

Revisiting Multivariate Time Series Forecasting with Missing Values

Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.

Updated: 2025-11-03 22:20:57

标题: 重新审视带有缺失值的多元时间序列预测

摘要: 实际时间序列中缺失值是常见的,具有缺失值的多变量时间序列预测(MTSF-M)已成为确保可靠预测的关键研究领域。为了解决缺失数据的挑战,目前的方法已经开发了一个填充缺失值的插补-预测框架,该框架使用插补模块填充缺失值,然后对插补数据进行预测。然而,这一框架忽视了一个关键问题:缺失值没有基准数据,使得插补过程容易出现错误,从而降低预测准确性。在本文中,我们进行了一个系统的实证研究,揭示了没有直接监督的插补可能会破坏基础数据分布并主动降低预测准确性。为了解决这个问题,我们提出了一种范式转变,摆脱插补,直接从部分观察到的时间序列进行预测。我们介绍了一种基于信息瓶颈原理的全新框架——一致性正则化信息瓶颈(CRIB)。CRIB将统一单变量注意机制与一致性正则化方案相结合,学习出过滤掉由缺失值引入的噪声,同时保留必要的预测信号的稳健表示。对四个真实数据集进行的全面实验证明了CRIB的有效性,即使在高缺失率下也能精确预测。我们的代码可以在https://github.com/Muyiiiii/CRIB 上找到。

更新时间: 2025-11-03 22:20:57

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2509.23494v2

Geometric Data Valuation via Leverage Scores

Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint's structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emph{ridge leverage scores} yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We further show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within $O(\varepsilon)$ of the full-data optimum, thereby providing a rigorous link between data valuation and downstream decision quality. Finally, we conduct an active learning experiment in which we empirically demonstrate that ridge-leverage sampling outperforms standard baselines without requiring access gradients or backward passes.

Updated: 2025-11-03 22:20:50

标题: 几何数据价值评估的杠杆分数

摘要: 沙普利数据估值提供了一个原则性的、公理化的框架,用于为单个数据点分配重要性,并在数据集筛选、修剪和定价方面取得了进展。然而,它是一种需要评估数据所有子集上的边际效用的组合度量,因此在大规模情况下计算是不可行的。我们提出了一种基于统计杠杆分数的几何替代方案,通过衡量每个数据点在表示空间中的结构影响力,即衡量其在数据集的跨度和对训练问题的有效维度的贡献量。我们展示了我们的分数满足沙普利估值的虚拟、效率和对称公理,并将它们扩展为\emph{岭杠杆分数},产生严格正的边际收益,自然地与经典的A-和D-最优设计标准相连接。我们进一步展示,在杠杆抽样子集上进行训练产生的模型的参数和预测风险与完整数据最优解之间的差距在$O(\varepsilon)$范围内,从而提供了数据估值与下游决策质量之间的严格联系。最后,我们进行了一项主动学习实验,在该实验中,我们经验证明岭-杠杆抽样优于标准基线,而无需访问梯度或反向传播。

更新时间: 2025-11-03 22:20:50

领域: cs.LG,cs.AI,cs.NA,math.NA,math.OC

下载: http://arxiv.org/abs/2511.02100v1

Automated Reward Design for Gran Turismo

When designing reinforcement learning (RL) agents, a designer communicates the desired agent behavior through the definition of reward functions - numerical feedback given to the agent as reward or punishment for its actions. However, mapping desired behaviors to reward functions can be a difficult process, especially in complex environments such as autonomous racing. In this paper, we demonstrate how current foundation models can effectively search over a space of reward functions to produce desirable RL agents for the Gran Turismo 7 racing game, given only text-based instructions. Through a combination of LLM-based reward generation, VLM preference-based evaluation, and human feedback we demonstrate how our system can be used to produce racing agents competitive with GT Sophy, a champion-level RL racing agent, as well as generate novel behaviors, paving the way for practical automated reward design in real world applications.

Updated: 2025-11-03 22:07:53

标题: 《Gran Turismo的自动奖励设计》

摘要: 在设计强化学习(RL)代理时,设计师通过定义奖励函数来传达所期望的代理行为 - 即给予代理的行为以奖励或惩罚的数值反馈。然而,在复杂环境(如自主赛车)中,将期望行为映射到奖励函数可能是一个困难的过程。在本文中,我们展示了如何利用当前的基础模型有效地搜索奖励函数空间,以产生适用于《极品飞车7》赛车游戏的理想RL代理,仅仅通过基于文本的指令。通过LLM基础的奖励生成、VLM基于偏好的评估以及人类反馈的结合,我们展示了如何利用我们的系统来产生与冠军级RL赛车代理GT Sophy竞争力相当的赛车代理,以及生成新颖的行为,为实际应用中的自动奖励设计铺平道路。

更新时间: 2025-11-03 22:07:53

领域: cs.AI

下载: http://arxiv.org/abs/2511.02094v1

Closing the Intent-to-Behavior Gap via Fulfillment Priority Logic

Practitioners designing reinforcement learning policies face a fundamental challenge: translating intended behavioral objectives into representative reward functions. This challenge stems from behavioral intent requiring simultaneous achievement of multiple competing objectives, typically addressed through labor-intensive linear reward composition that yields brittle results. Consider the ubiquitous robotics scenario where performance maximization directly conflicts with energy conservation. Such competitive dynamics are resistant to simple linear reward combinations. In this paper, we present the concept of objective fulfillment upon which we build Fulfillment Priority Logic (FPL). FPL allows practitioners to define logical formula representing their intentions and priorities within multi-objective reinforcement learning. Our novel Balanced Policy Gradient algorithm leverages FPL specifications to achieve up to 500\% better sample efficiency compared to Soft Actor Critic. Notably, this work constitutes the first implementation of non-linear utility scalarization design, specifically for continuous control problems.

Updated: 2025-11-03 22:06:54

标题: 通过履行优先逻辑来弥合意图与行为之间的差距

摘要: 从事设计强化学习策略的从业者面临一个基本挑战:将预期的行为目标转化为代表性的奖励函数。这一挑战源于行为意图需要同时实现多个竞争目标,通常通过费时的线性奖励组合来解决,而这种方法会产生脆弱的结果。考虑到广泛存在的机器人情景,其中性能最大化直接与能源节约相冲突。这种竞争动态不适合简单的线性奖励组合。本文介绍了目标实现的概念,基于此构建了实现优先逻辑(FPL)。FPL允许从业者在多目标强化学习中定义代表其意图和优先级的逻辑公式。我们的新颖平衡策略梯度算法利用FPL规范,实现了比Soft Actor Critic高达500\%的样本效率。值得注意的是,这项工作是连续控制问题的非线性效用标量化设计的首次实现。

更新时间: 2025-11-03 22:06:54

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.05818v3

Uncertainty Guided Online Ensemble for Non-stationary Data Streams in Fusion Science

Machine Learning (ML) is poised to play a pivotal role in the development and operation of next-generation fusion devices. Fusion data shows non-stationary behavior with distribution drifts, resulted by both experimental evolution and machine wear-and-tear. ML models assume stationary distribution and fail to maintain performance when encountered with such non-stationary data streams. Online learning techniques have been leveraged in other domains, however it has been largely unexplored for fusion applications. In this paper, we present an application of online learning to continuously adapt to drifting data stream for prediction of Toroidal Field (TF) coils deflection at the DIII-D fusion facility. The results demonstrate that online learning is critical to maintain ML model performance and reduces error by 80% compared to a static model. Moreover, traditional online learning can suffer from short-term performance degradation as ground truth is not available before making the predictions. As such, we propose an uncertainty guided online ensemble method to further improve the performance. The Deep Gaussian Process Approximation (DGPA) technique is leveraged for calibrated uncertainty estimation and the uncertainty values are then used to guide a meta-algorithm that produces predictions based on an ensemble of learners trained on different horizon of historical data. The DGPA also provides uncertainty estimation along with the predictions for decision makers. The online ensemble and the proposed uncertainty guided online ensemble reduces predictions error by about 6%, and 10% respectively over standard single model based online learning.

Updated: 2025-11-03 22:03:37

标题: 不稳定数据流融合科学中基于不确定性引导的在线集成模型

摘要: 机器学习(ML)有望在下一代聚变设备的发展和运行中发挥关键作用。聚变数据显示出非平稳行为,其分布漂移是由实验演变和机器磨损引起的。ML模型假设分布是稳定的,当遇到这种非平稳数据流时,性能表现下降。在线学习技术已经在其他领域得到应用,但在聚变应用中尚未得到广泛探索。本文介绍了在线学习在连续适应漂移数据流中的应用,用于预测DIII-D聚变设施中环形磁场(TF)线圈偏斜。结果表明,在线学习对于维持ML模型性能至关重要,并将误差降低了80%。此外,传统的在线学习可能会因为在进行预测之前缺乏真实数据而导致短期性能下降。因此,我们提出了一种基于不确定性引导的在线集成方法来进一步提高性能。利用深高斯过程近似(DGPA)技术进行校准的不确定性估计,然后利用不确定性值来指导一个元算法,根据在不同历史数据范围上训练的学习者组成的集成来进行预测。DGPA还为决策者提供了不确定性估计和预测。在线集成和提出的不确定性引导的在线集成分别将预测误差降低了约6%和10%,相较于基于标准单一模型的在线学习。

更新时间: 2025-11-03 22:03:37

领域: cs.LG,cs.AI,physics.plasm-ph

下载: http://arxiv.org/abs/2511.02092v1

Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling

The field of world modeling is fragmented, with researchers developing bespoke architectures that rarely build upon each other. We propose a framework that specifies the natural building blocks for structured world models based on the fundamental stochastic processes that any world model must capture: discrete processes (logic, symbols) and continuous processes (physics, dynamics); the world model is then defined by the hierarchical composition of these building blocks. We examine Hidden Markov Models (HMMs) and switching linear dynamical systems (sLDS) as natural building blocks for discrete and continuous modeling--which become partially-observable Markov decision processes (POMDPs) and controlled sLDS when augmented with actions. This modular approach supports both passive modeling (generation, forecasting) and active control (planning, decision-making) within the same architecture. We avoid the combinatorial explosion of traditional structure learning by largely fixing the causal architecture and searching over only four depth parameters. We review practical expressiveness through multimodal generative modeling (passive) and planning from pixels (active), with performance competitive to neural approaches while maintaining interpretability. The core outstanding challenge is scalable joint structure-parameter learning; current methods finesse this by cleverly growing structure and parameters incrementally, but are limited in their scalability. If solved, these natural building blocks could provide foundational infrastructure for world modeling, analogous to how standardized layers enabled progress in deep learning.

Updated: 2025-11-03 22:02:04

标题: 自然构建模块对结构化世界模型的影响:理论、证据和规模化

摘要: 世界建模领域存在碎片化,研究人员开发定制的架构,很少相互建立在彼此之上。我们提出了一个框架,该框架规定了基于任何世界模型必须捕捉的基本随机过程的自然构建模块:离散过程(逻辑、符号)和连续过程(物理、动力学);然后通过这些构建模块的层次组合来定义世界模型。我们研究了隐马尔科夫模型(HMMs)和切换线性动态系统(sLDS)作为离散和连续建模的自然构建模块--当加入动作时,它们变为部分可观察的马尔可夫决策过程(POMDPs)和受控sLDS。这种模块化方法支持 passivemodeling(生成、预测)和 activecontrol(规划、决策)在同一架构内。我们通过多模式生成建模(被动)和从像素进行规划(主动)来回顾实用性表达,表现与神经方法竞争,并保持可解释性。核心尚未解决的挑战是可扩展的联合结构参数学习;目前的方法通过巧妙地逐步增长结构和参数来解决这个问题,但在可扩展性方面存在局限性。如果解决了这个问题,这些自然构建模块可以为世界建模提供基础设施,类似于标准化层为深度学习的进展提供了支持。

更新时间: 2025-11-03 22:02:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.02091v1

A Survey on Large Language Model-Based Game Agents

Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence. Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning, memory, and adaptability in complex game environments. This survey offers an up-to-date review of LLM-based game agents (LLMGAs) through a unified reference architecture. At the single-agent level, we synthesize existing studies around three core components: memory, reasoning, and perception-action interfaces, which jointly characterize how language enables agents to perceive, think, and act. At the multi-agent level, we outline how communication protocols and organizational models support coordination, role differentiation, and large-scale social behaviors. To contextualize these designs, we introduce a challenge-centered taxonomy linking six major game genres to their dominant agent requirements, from low-latency control in action games to open-ended goal formation in sandbox worlds. A curated list of related papers is available at https://github.com/git-disl/awesome-LLM-game-agent-papers

Updated: 2025-11-03 22:01:57

标题: 基于大型语言模型的游戏智能体调查

摘要: 游戏环境提供了丰富且可控的设置,可以刺激现实世界复杂性的许多方面。因此,游戏代理提供了一个有价值的测试平台,用于探索与人工通用智能相关的能力。最近,大型语言模型(LLMs)的出现为这些代理提供了在复杂游戏环境中赋予通用推理、记忆和适应性的新机会。本调查通过一个统一的参考架构,提供了对基于LLM的游戏代理(LLMGAs)的最新综述。在单一代理级别上,我们综合现有研究围绕三个核心组件进行分析:记忆、推理和感知-行动接口,这些共同描述了语言如何使代理能够感知、思考和行动。在多代理级别上,我们概述了沟通协议和组织模型如何支持协调、角色差异化和大规模社会行为。为了将这些设计置于上下文中,我们引入了一个以挑战为中心的分类法,将六种主要游戏类型与其主导的代理需求联系起来,从动作游戏中的低延迟控制到沙盒世界中的目标形成。相关论文的精心筛选列表可在https://github.com/git-disl/awesome-LLM-game-agent-papers上找到。

更新时间: 2025-11-03 22:01:57

领域: cs.AI

下载: http://arxiv.org/abs/2404.02039v4

LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS

Contrast-Consistent Search (CCS) is an unsupervised probing method able to test whether large language models represent binary features, such as sentence truth, in their internal activations. While CCS has shown promise, its two-term objective has been only partially understood. In this work, we revisit CCS with the aim of clarifying its mechanisms and extending its applicability. We argue that what should be optimized for, is relative contrast consistency. Building on this insight, we reformulate CCS as an eigenproblem, yielding closed-form solutions with interpretable eigenvalues and natural extensions to multiple variables. We evaluate these approaches across a range of datasets, finding that they recover similar performance to CCS, while avoiding problems around sensitivity to random initialization. Our results suggest that relativizing contrast consistency not only improves our understanding of CCS but also opens pathways for broader probing and mechanistic interpretability methods.

Updated: 2025-11-03 22:00:37

标题: 使用对比特征值问题探测LLM:改进CCS的理解和适用性

摘要: 对比一致搜索(CCS)是一种无监督的探测方法,能够测试大型语言模型是否在其内部激活中表示二进制特征,如句子真实性。尽管CCS显示出潜力,但其双项目标仅在一定程度上被理解。在这项工作中,我们重新审视了CCS,旨在澄清其机制并扩展其适用性。我们认为,应该优化的是相对对比一致性。基于这一洞察力,我们将CCS重新表述为一个特征问题,产生具有可解释特征值和自然扩展到多个变量的封闭形式解决方案。我们评估了这些方法在一系列数据集上的表现,发现它们恢复了与CCS类似的性能,同时避免了对随机初始化的敏感性问题。我们的结果表明,相对对比一致性不仅提高了我们对CCS的理解,还为更广泛的探测和机械可解释性方法开辟了道路。

更新时间: 2025-11-03 22:00:37

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.02089v1

Energy Loss Functions for Physical Systems

Effectively leveraging prior knowledge of a system's physics is crucial for applications of machine learning to scientific domains. Previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function for prediction and generative modeling tasks on systems like molecules and spins. We derive energy loss functions assuming that each data sample is in thermal equilibrium with respect to an approximate energy landscape. By using the reverse KL divergence with a Boltzmann distribution around the data, we obtain the loss as an energy difference between the data and the model predictions. This perspective also recasts traditional objectives like MSE as energy-based, but with a physically meaningless energy. In contrast, our formulation yields physically grounded loss functions with gradients that better align with valid configurations, while being architecture-agnostic and computationally efficient. The energy loss functions also inherently respect physical symmetries. We demonstrate our approach on molecular generation and spin ground-state prediction and report significant improvements over baselines.

Updated: 2025-11-03 21:58:36

标题: 物理系统的能量损失函数

摘要: 有效利用系统物理知识对科学领域的机器学习应用至关重要。先前的方法主要集中在在架构级别上整合物理见解。在本文中,我们提出了一个框架,直接将物理信息融入到预测和生成建模任务的损失函数中,例如分子和自旋系统。我们推导出能量损失函数,假设每个数据样本在近似能量景观方面处于热平衡状态。通过使用Boltzmann分布的反向KL散度,我们获得损失作为数据和模型预测之间的能量差异。这种视角也将传统的目标如MSE重新构建为基于能量的,但具有物理上无意义的能量。相比之下,我们的制定得到具有物理基础的损失函数,梯度更好地与有效配置对齐,同时又具有架构无关性和计算效率。能量损失函数也天然尊重物理对称性。我们在分子生成和自旋基态预测上展示了我们的方法,并报告了与基线相比显著的改进。

更新时间: 2025-11-03 21:58:36

领域: cs.LG,cs.AI,physics.comp-ph

下载: http://arxiv.org/abs/2511.02087v1

AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives

Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model's awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model's vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.

Updated: 2025-11-03 21:48:02

标题: 超越句子边界的AWARE:用于识别STEM叙述中文化资本的上下文转换器框架

摘要: 在学生反思中识别文化资本(CC)主题可以提供有价值的见解,有助于在教室中促进公平的学习环境。然而,诸如抱负目标或家庭支持等主题通常被编织成叙述,而不是直接出现为关键词。这使它们很难被标准的NLP模型检测到,这些模型在处理单独的句子时无法识别这些主题。核心挑战源于缺乏意识,因为标准模型是在一般语料库上进行预训练的,使它们对数据固有的领域特定语言和叙述上下文视而不见。为了解决这个问题,我们引入了AWARE,一个系统性地试图提高变压器模型对这一微妙任务的意识的框架。AWARE有三个核心组件:1)领域意识,调整模型的词汇表以适应学生反思的语言风格;2)上下文意识,生成对整篇文章上下文有意识的句子嵌入;以及3)类重叠意识,采用多标签策略识别单个句子中主题的共存。我们的结果表明,通过使模型明确意识到输入的属性,AWARE在Macro-F1上比强基线表现提高了2.1个百分点,并在所有主题上显示出显著的改进。这项工作为任何依赖于叙述上下文的文本分类任务提供了一种健壮且可推广的方法论。

更新时间: 2025-11-03 21:48:02

领域: cs.CL,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2510.04983v3

AutoPDL: Automatic Prompt Optimization for LLM Agents

The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains ($9.21\pm15.46$ percentage points), up to 67.5pp, and reveal that selected prompting strategies vary across models and tasks.

Updated: 2025-11-03 21:46:50

标题: AutoPDL:LLM代理的自动提示优化 LLM代表语言模型。

摘要: 大型语言模型(LLMs)的性能取决于它们如何被提示,选择范围涵盖高级提示模式(例如,Zero-Shot,CoT,ReAct,ReWOO)和特定的提示内容(指令和少量示范)。手动调整这种组合是繁琐的、容易出错的,并且特定于给定的LLM和任务。因此,本文提出了AutoPDL,一种自动发现良好LLM代理配置的方法。我们的方法将其视为一个结构化的AutoML问题,涵盖了一系列代理和非代理提示模式和演示,使用连续减半来有效地导航这个空间。我们引入了一个库,使用PDL提示编程语言实现常见的提示模式。AutoPDL解决方案是可读、可编辑和可执行的PDL程序,使用这个库。这种方法还实现了源到源优化,允许人在环路中进行细化和重用。在三项任务和七个LLMs(参数范围从3B到70B)的评估中显示了一致的准确性提升(9.21±15.46个百分点),最高可达67.5个百分点,并且显示出选择的提示策略在模型和任务之间有所变化。

更新时间: 2025-11-03 21:46:50

领域: cs.LG,cs.AI,cs.PL

下载: http://arxiv.org/abs/2504.04365v5

MediQ-GAN: Quantum-Inspired GAN for High Resolution Medical Image Generation

Machine learning-assisted diagnosis shows promise, yet medical imaging datasets are often scarce, imbalanced, and constrained by privacy, making data augmentation essential. Classical generative models typically demand extensive computational and sample resources. Quantum computing offers a promising alternative, but existing quantum-based image generation methods remain limited in scale and often face barren plateaus. We present MediQ-GAN, a quantum-inspired GAN with prototype-guided skip connections and a dual-stream generator that fuses classical and quantum-inspired branches. Its variational quantum circuits inherently preserve full-rank mappings, avoid rank collapse, and are theory-guided to balance expressivity with trainability. Beyond generation quality, we provide the first latent-geometry and rank-based analysis of quantum-inspired GANs, offering theoretical insight into their performance. Across three medical imaging datasets, MediQ-GAN outperforms state-of-the-art GANs and diffusion models. While validated on IBM hardware for robustness, our contribution is hardware-agnostic, offering a scalable and data-efficient framework for medical image generation and augmentation.

Updated: 2025-11-03 21:45:49

标题: MediQ-GAN:用于高分辨率医学图像生成的量子启发式GAN

摘要: 机器学习辅助诊断显示出潜力,但医学影像数据集通常稀缺、不平衡,并受隐私限制,因此数据增强至关重要。传统生成模型通常需要大量计算和样本资源。量子计算提供了一个有前途的替代方案,但现有的基于量子的图像生成方法在规模上仍然受限,并且经常面临荒漠高原问题。我们提出了MediQ-GAN,这是一个量子启发的GAN,具有原型引导的跳跃连接和一个融合经典和量子启发分支的双流生成器。其变分量子电路固有地保留全秩映射,避免秩坍塌,并且在理论上引导以平衡表达能力和可训练性。除了生成质量外,我们提供了对量子启发的GAN的潜在几何和基于秩的分析,为它们的性能提供了理论洞察。在三个医学影像数据集上,MediQ-GAN优于最先进的GAN和扩散模型。虽然在IBM硬件上验证了鲁棒性,但我们的贡献是硬件无关的,为医学图像生成和增强提供了一个可扩展和数据有效的框架。

更新时间: 2025-11-03 21:45:49

领域: cs.CV,cs.LG,quant-ph

下载: http://arxiv.org/abs/2506.21015v2

Training Language Models to Reason Efficiently

Scaling model size and training data has led to great advances in the performance of Large Language Models (LLMs). However, the diminishing returns of this approach necessitate alternative methods to improve model capabilities, particularly in tasks requiring advanced reasoning. Large reasoning models, which leverage long chain-of-thoughts, bring unprecedented breakthroughs in problem-solving capabilities but at a substantial deployment cost associated to longer generations. Reducing inference costs is crucial for the economic feasibility, user experience, and environmental sustainability of these models. In this work, we propose to train large reasoning models to reason efficiently. More precisely, we use reinforcement learning (RL) to train reasoning models to dynamically allocate inference-time compute based on task complexity. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy, thereby achieving substantial efficiency gains. It enables the derivation of a family of reasoning models with varying efficiency levels, controlled via a single hyperparameter. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.

Updated: 2025-11-03 21:45:15

标题: 训练语言模型以高效进行推理

摘要: 规模化模型大小和训练数据已经在大型语言模型(LLMs)的性能方面取得了巨大进展。然而,这种方法的收益递减使得必须采取替代方法来提高模型的能力,特别是在需要高级推理的任务中。利用长链式思维的大型推理模型在问题解决能力方面取得了前所未有的突破,但与更长推理时间相关的部署成本也相当高。降低推理成本对于这些模型的经济可行性、用户体验和环境可持续性至关重要。 在这项工作中,我们提出训练大型推理模型以实现高效推理。更具体地说,我们使用强化学习(RL)来训练推理模型,根据任务复杂性动态分配推理时间计算。我们的方法鼓励模型最小化不必要的计算开销,同时保持准确性,从而实现了显著的效率提升。它使得能够推导出一系列具有不同效率水平的推理模型,通过单一超参数控制。对两个开放权重大型推理模型的实验表明,在保持大部分准确性的同时显著降低了推理成本。

更新时间: 2025-11-03 21:45:15

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.04463v4

Watermarking Discrete Diffusion Language Models

Watermarking has emerged as a promising technique to track AI-generated content and differentiate it from authentic human creations. While prior work extensively studies watermarking for autoregressive large language models (LLMs) and image diffusion models, none address discrete diffusion language models, which are becoming popular due to their high inference throughput. In this paper, we introduce the first watermarking method for discrete diffusion models by applying the distribution-preserving Gumbel-max trick at every diffusion step and seeding the randomness with the sequence index to enable reliable detection. We experimentally demonstrate that our scheme is reliably detectable on state-of-the-art diffusion language models and analytically prove that it is distortion-free with an exponentially decaying probability of false detection in the token sequence length.

Updated: 2025-11-03 21:43:44

标题: 数字扩散语言模型水印技术

摘要: 数字扩散模型的水印技术已经成为一种有前途的技术,可以用来跟踪由人工智能生成的内容,并将其与真实的人类创作区分开来。虽然先前的工作广泛研究了用于自回归大型语言模型(LLMs)和图像扩散模型的水印技术,但没有一个涉及离散扩散语言模型,这些模型由于其高推理吞吐量而变得流行。在本文中,我们介绍了第一种用于离散扩散模型的水印方法,通过在每个扩散步骤中应用保持分布的Gumbel-max技巧,并使用序列索引为随机性提供种子,以实现可靠的检测。我们通过实验证明,我们的方案可以可靠地检测到最先进的扩散语言模型,并以解析的方式证明,在令牌序列长度方面具有指数衰减的错误检测概率,且不会引入失真。

更新时间: 2025-11-03 21:43:44

领域: cs.CR,cs.AI,cs.CY

下载: http://arxiv.org/abs/2511.02083v1

Expertise and confidence explain how social influence evolves along intellective tasks

Discovering the antecedents of individuals' influence in collaborative environments is an important, practical, and challenging problem. In this paper, we study interpersonal influence in small groups of individuals who collectively execute a sequence of intellective tasks. We observe that along an issue sequence with feedback, individuals with higher expertise and social confidence are accorded higher interpersonal influence. We also observe that low-performing individuals tend to underestimate their high-performing teammate's expertise. Based on these observations, we introduce three hypotheses and present empirical and theoretical support for their validity. We report empirical evidence on longstanding theories of transactive memory systems, social comparison, and confidence heuristics on the origins of social influence. We propose a cognitive dynamical model inspired by these theories to describe the process by which individuals adjust interpersonal influences over time. We demonstrate the model's accuracy in predicting individuals' influence and provide analytical results on its asymptotic behavior for the case with identically performing individuals. Lastly, we propose a novel approach using deep neural networks on a pre-trained text embedding model for predicting the influence of individuals. Using message contents, message times, and individual correctness collected during tasks, we are able to accurately predict individuals' self-reported influence over time. Extensive experiments verify the accuracy of the proposed models compared to baselines such as structural balance and reflected appraisal model. While the neural networks model is the most accurate, the dynamical model is the most interpretable for influence prediction.

Updated: 2025-11-03 21:37:24

标题: 专业知识和信心解释了社会影响随认知任务的演变

摘要: 在协作环境中发现个体影响力的前因是一个重要、实践性和具有挑战性的问题。在本文中,我们研究了在集体执行一系列智力任务的小团体中的人际影响力。我们观察到,在带有反馈的议题序列中,具有较高专业知识和社交自信的个体会被赋予更高的人际影响力。我们还观察到表现不佳的个体往往低估了他们表现优异的队友的专业知识。基于这些观察,我们提出了三个假设,并为它们的有效性提供了经验和理论支持。我们报告了关于跨记忆系统、社会比较和信心启发式等长期理论对社会影响起源的经验证据。我们提出了一个受这些理论启发的认知动力学模型,描述了个体如何随时间调整人际影响的过程。我们展示了该模型在预测个体影响力方面的准确性,并为具有相同表现的个体的情况提供了其渐近行为的解析结果。最后,我们提出了一种新颖的方法,利用一个预训练的文本嵌入模型和深度神经网络来预测个体的影响力。利用任务期间收集的消息内容、消息时间和个体正确性,我们能够准确预测个体随时间报告的影响力。大量实验证实了所提模型与结构平衡和反映评价模型等基准模型的准确性。虽然神经网络模型是最准确的,但动力学模型对于影响力预测来说是最易解释的。

更新时间: 2025-11-03 21:37:24

领域: cs.SI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2011.07168v2

Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models

Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.

Updated: 2025-11-03 21:30:03

标题: 超越静态截止值:一次性动态阈值法用于扩散语言模型

摘要: 掩盖扩散语言模型(MDLMs)正逐渐与其自回归对应物竞争,但通常使用固定步骤和顺序解码。为加速解码,最近的工作如Fast-dLLM通过静态全局置信阈值实现并行解码,然而我们观察到强烈的块状和步进式置信度波动,并且在数据集中,通过余弦相似性度量,输入之间的置信轨迹几乎相同。受到这些观察的启发,我们引入了一次性动态阈值(OSDT),该方法在单个序列上校准阈值,并将其应用于后续输入,几乎没有额外开销。在GPQA、GSM8K和HumanEval上,OSDT实现了更优越的准确性-吞吐量权衡(在GSM8K上在最佳准确性时提高了24%的tokens/s,在GPQA上与可比准确性提高了45%,在HumanEval上与适度准确性差距提高了50%)。除了这些结果,我们的发现表明,利用可重用的任务级置信签名可以为扩散解码中更通用的算法和系统创新提供更广泛的机会。

更新时间: 2025-11-03 21:30:03

领域: cs.LG

下载: http://arxiv.org/abs/2511.02077v1

Relational Causal Discovery with Latent Confounders

Estimating causal effects from real-world relational data can be challenging when the underlying causal model and potential confounders are unknown. While several causal discovery algorithms exist for learning causal models with latent confounders from data, they assume that the data is independent and identically distributed (i.i.d.) and are not well-suited for learning from relational data. Similarly, existing relational causal discovery algorithms assume causal sufficiency, which is unrealistic for many real-world datasets. To address this gap, we propose RelFCI, a sound and complete causal discovery algorithm for relational data with latent confounders. Our work builds upon the Fast Causal Inference (FCI) and Relational Causal Discovery (RCD) algorithms and it defines new graphical models, necessary to support causal discovery in relational domains. We also establish soundness and completeness guarantees for relational d-separation with latent confounders. We present experimental results demonstrating the effectiveness of RelFCI in identifying the correct causal structure in relational causal models with latent confounders.

Updated: 2025-11-03 21:27:56

标题: 具有潜在混杂因素的关系因果发现

摘要: 从真实世界的关系数据中估计因果效应可能是具有挑战性的,特别是当潜在的因果模型和潜在混淆因子未知时。虽然存在一些用于从数据中学习带有潜在混淆因子的因果模型的因果发现算法,但它们假设数据是独立同分布的(i.i.d.),并不适合从关系数据中学习。同样,现有的关系因果发现算法假定因果充分性,这对许多真实世界数据集来说是不现实的。为了填补这一空白,我们提出了RelFCI,这是一种针对具有潜在混淆因子的关系数据的完备有效的因果发现算法。我们的工作建立在快速因果推断(FCI)和关系因果发现(RCD)算法之上,并定义了新的图模型,以支持在关系领域进行因果发现。我们还为具有潜在混淆因子的关系d-分离性建立了有效性和完备性保证。我们展示了实验结果,证明了RelFCI在识别具有潜在混淆因子的关系因果模型中的正确因果结构的有效性。

更新时间: 2025-11-03 21:27:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.01700v2

Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing

Scientific experiment and manufacture rely on complex, multi-step procedures that demand continuous human expertise for precise execution and decision-making. Despite advances in machine learning and automation, conventional models remain confined to virtual domains, while real-world experiment and manufacture still rely on human supervision and expertise. This gap between machine intelligence and physical execution limits reproducibility, scalability, and accessibility across scientific and manufacture workflows. Here, we introduce human-AI co-embodied intelligence, a new form of physical AI that unites human users, agentic AI, and wearable hardware into an integrated system for real-world experiment and intelligent manufacture. In this paradigm, humans provide precise execution and control, while agentic AI contributes memory, contextual reasoning, adaptive planning, and real-time feedback. The wearable interface continuously captures the experimental and manufacture processes, facilitates seamless communication between humans and AI for corrective guidance and interpretable collaboration. As a demonstration, we present Agentic-Physical Experimentation (APEX) system, coupling agentic reasoning with physical execution through mixed-reality. APEX observes and interprets human actions, aligns them with standard operating procedures, provides 3D visual guidance, and analyzes every step. Implemented in a cleanroom for flexible electronics fabrication, APEX system achieves context-aware reasoning with accuracy exceeding general multimodal large language models, corrects errors in real time, and transfers expertise to beginners. These results establish a new class of agentic-physical-human intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific research and manufacturing into autonomous, traceable, interpretable, and scalable processes.

Updated: 2025-11-03 21:12:48

标题: 人工智能与人类共同体智能在科学实验和制造领域的应用

摘要: 科学实验和制造依赖于复杂的、多步骤的程序,这些程序需要持续的人类专业知识来进行精确执行和决策。尽管机器学习和自动化取得了进展,传统模型仍然局限于虚拟领域,而现实世界中的实验和制造仍然依赖于人类监督和专业知识。机器智能和物理执行之间的这种差距限制了科学和制造工作流程的可重复性、可扩展性和可访问性。在这里,我们介绍了人工智能共体智能,这是一种新形式的物理人工智能,将人类用户、代理人工智能和可穿戴硬件结合成一个集成系统,用于现实世界的实验和智能制造。在这种范式中,人类提供精确执行和控制,而代理人工智能则提供记忆、语境推理、自适应计划和实时反馈。可穿戴界面持续捕捉实验和制造过程,促进人类和人工智能之间的无缝沟通,以进行纠正指导和可解释协作。作为演示,我们提出了代理-物理实验(APEX)系统,通过混合现实将代理推理与物理执行结合在一起。APEX观察和解释人类行为,将其与标准操作程序对齐,提供3D视觉指导,并分析每一步。在柔性电子制造的洁净室中实施,APEX系统实现了超过一般多模态大语言模型的准确的上下文感知推理,实时纠正错误,并将专业知识传递给初学者。这些结果建立了一种新型的代理-物理-人类智能类别,将代理推理扩展到计算之外的物理领域,将科学研究和制造转变为自主、可追踪、可解释和可扩展的过程。

更新时间: 2025-11-03 21:12:48

领域: cs.AI

下载: http://arxiv.org/abs/2511.02071v1

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

Updated: 2025-11-03 20:59:27

标题: 涡旋:满足严格延迟和吞吐量要求的ML推断和知识检索服务托管

摘要: 越来越多的人对部署机器学习推断和知识检索作为服务感兴趣,这些服务可以支持最终用户的交互查询以及由集成到最终用户应用程序并作为代理部署的人工智能所产生的更具挑战性的请求流。我们的核心前提是这些情况将带来服务级别的延迟目标(SLOs)。现有的机器学习服务平台使用批处理来优化高吞吐量,使它们暴露于不可预测的尾部延迟。Vortex采用以SLO为先的方法。对于相同的任务,Vortex的流水线在各种工作负载下实现了显著更低和更稳定的延迟,通常使得在更高的请求速率下实现给定的SLO目标超过两倍。当RDMA可用时,Vortex的优势甚至更显著。

更新时间: 2025-11-03 20:59:27

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2511.02062v1

A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification

In recent years, spatio-temporal graph neural networks (GNNs) have attracted considerable interest in the field of time series analysis, due to their ability to capture, at once, dependencies among variables and across time points. The objective of this systematic literature review is hence to provide a comprehensive overview of the various modeling approaches and application domains of GNNs for time series classification and forecasting. A database search was conducted, and 366 papers were selected for a detailed examination of the current state-of-the-art in the field. This examination is intended to offer to the reader a comprehensive review of proposed models, links to related source code, available datasets, benchmark models, and fitting results. All this information is hoped to assist researchers in their studies. To the best of our knowledge, this is the first and broadest systematic literature review presenting a detailed comparison of results from current spatio-temporal GNN models applied to different domains. In its final part, this review discusses current limitations and challenges in the application of spatio-temporal GNNs, such as comparability, reproducibility, explainability, poor information capacity, and scalability. This paper is complemented by a GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git providing additional interactive tools to further explore the presented findings.

Updated: 2025-11-03 20:55:33

标题: 一个关于时空图神经网络模型在时间序列预测和分类中的系统文献综述

摘要: 最近几年,时空图神经网络(GNNs)在时间序列分析领域引起了相当大的兴趣,因为它们能够一次性捕捉变量之间和时间点之间的依赖关系。因此,本系统文献综述的目标是全面概述GNNs在时间序列分类和预测中的各种建模方法和应用领域。进行了数据库搜索,选取了366篇论文进行详细检查,以了解该领域的最新技术水平。这个检查旨在为读者提供对提出的模型、相关源代码、可用数据集、基准模型和拟合结果的全面回顾。所有这些信息希望能帮助研究人员进行研究。据我们所知,这是第一篇也是最广泛的系统文献综述,详细比较了当前应用于不同领域的时空图GNN模型的结果。在最后部分,这篇综述讨论了应用时空GNNs时的当前限制和挑战,如可比性、可重现性、可解释性、信息容量不足和可扩展性。本文还附带一个GitHub存储库,网址为https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git,提供额外的交互工具,以进一步探索所呈现的发现。

更新时间: 2025-11-03 20:55:33

领域: cs.LG,cs.AI,physics.data-an

下载: http://arxiv.org/abs/2410.22377v4

Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement

Reliable evaluation of AI systems remains a fundamental challenge when ground truth labels are unavailable, particularly for systems generating natural language outputs like AI chat and agent systems. Many of these AI agents and systems focus on entity-centric tasks. In enterprise contexts, organizations deploy AI systems for entity linking, data integration, and information retrieval where verification against gold standards is often infeasible due to proprietary data constraints. Academic deployments face similar challenges when evaluating AI systems on specialized datasets with ambiguous criteria. Conventional evaluation frameworks, rooted in supervised learning paradigms, fail in such scenarios where single correct answers cannot be defined. We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems that operates without ground truth by jointly measuring effectiveness and robustness. Given system inputs, VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling, assigning probabilities that reflect their likelihood. It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system. We provide formal theoretical analysis establishing key properties including range, monotonicity, and stability along with concentration bounds for Monte Carlo estimation. Through case studies on AI systems with ambiguous inputs, we demonstrate that VB-Score reveals robustness differences hidden by conventional evaluation frameworks, offering a principled measurement framework for assessing AI system reliability in label-scarce domains.

Updated: 2025-11-03 20:40:52

标题: 实体中心人工智能系统的方差有界评估:理论和测量

摘要: 可靠评估人工智能系统在没有真实标签时仍然是一个基本挑战,特别是对于生成自然语言输出的系统,如人工智能聊天和代理系统。许多这些人工智能代理和系统专注于以实体为中心的任务。在企业环境中,组织部署人工智能系统用于实体链接、数据集成和信息检索,由于专有数据限制,对金标准的验证通常是不可行的。学术部署在评估具有模糊标准的专业数据集上的人工智能系统时面临类似挑战。传统评估框架根植于监督学习范式,在单个正确答案无法定义的情况下,这些框架在这种情况下会失败。我们引入了VB-Score,这是一个基于方差约束的实体中心人工智能系统评估框架,它在没有真实标签的情况下通过同时衡量有效性和鲁棒性来运行。给定系统输入,VB-Score通过约束放松和蒙特卡洛抽样列举可能的解释,分配反映其可能性的概率。然后通过期望的成功度量系统输出,通过方差进行惩罚以评估系统的鲁棒性。我们提供了建立关键属性的形式化理论分析,包括范围、单调性和稳定性,以及蒙特卡洛估计的集中界限。通过对具有模糊输入的人工智能系统的案例研究,我们证明VB-Score揭示了传统评估框架隐藏的鲁棒性差异,为评估标签稀缺领域中人工智能系统可靠性提供了一个原则性的测量框架。

更新时间: 2025-11-03 20:40:52

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.22751v2

Private Map-Secure Reduce: Infrastructure for Efficient AI Data Markets

The modern AI data economy centralizes power, limits innovation, and misallocates value by extracting data without control, privacy, or fair compensation. We introduce Private Map-Secure Reduce (PMSR), a network-native paradigm that transforms data economics from extractive to participatory through cryptographically enforced markets. Extending MapReduce to decentralized settings, PMSR enables computation to move to the data, ensuring verifiable privacy, efficient price discovery, and incentive alignment. Demonstrations include large-scale recommender audits, privacy-preserving LLM ensembling (87.5\% MMLU accuracy across six models), and distributed analytics over hundreds of nodes. PMSR establishes a scalable, equitable, and privacy-guaranteed foundation for the next generation of AI data markets.

Updated: 2025-11-03 20:39:25

标题: 私密地图安全缩减:高效人工智能数据市场的基础设施

摘要: 现代人工智能数据经济集中了权力,限制了创新,并通过在没有控制、隐私或公平补偿的情况下提取数据来错误分配价值。我们引入了Private Map-Secure Reduce(PMSR),这是一种网络本地范式,通过加密强制市场将数据经济从剥削性转变为参与性。将MapReduce扩展到分散设置,PMSR使计算可以移动到数据,确保可验证的隐私、高效的价格发现和激励对齐。演示包括大规模推荐审计、保护隐私的LLM集成(六个模型的87.5\% MMLU准确率)以及数百个节点上的分布式分析。PMSR为下一代人工智能数据市场奠定了可扩展、公平和隐私保障的基础。

更新时间: 2025-11-03 20:39:25

领域: cs.CR

下载: http://arxiv.org/abs/2511.02055v1

Data-driven Learning of Interaction Laws in Multispecies Particle Systems with Gaussian Processes: Convergence Theory and Applications

We develop a Gaussian process framework for learning interaction kernels in multi-species interacting particle systems from trajectory data. Such systems provide a canonical setting for multiscale modeling, where simple microscopic interaction rules generate complex macroscopic behaviors. While our earlier work established a Gaussian process approach and convergence theory for single-species systems, and later extended to second-order models with alignment and energy-type interactions, the multi-species setting introduces new challenges: heterogeneous populations interact both within and across species, the number of unknown kernels grows, and asymmetric interactions such as predator-prey dynamics must be accommodated. We formulate the learning problem in a nonparametric Bayesian setting and establish rigorous statistical guarantees. Our analysis shows recoverability of the interaction kernels, provides quantitative error bounds, and proves statistical optimality of posterior estimators, thereby unifying and generalizing previous single-species theory. Numerical experiments confirm the theoretical predictions and demonstrate the effectiveness of the proposed approach, highlighting its advantages over existing kernel-based methods. This work contributes a complete statistical framework for data-driven inference of interaction laws in multi-species systems, advancing the broader multiscale modeling program of connecting microscopic particle dynamics with emergent macroscopic behavior.

Updated: 2025-11-03 20:38:38

标题: 基于数据驱动的高斯过程在多物种粒子系统中学习相互作用规律:收敛理论与应用

摘要: 我们开发了一个高斯过程框架,用于从轨迹数据中学习多物种相互作用粒子系统中的相互作用核。这种系统为多尺度建模提供了一个经典的设置,在这里简单的微观相互作用规则会产生复杂的宏观行为。虽然我们早期的工作为单物种系统建立了高斯过程方法和收敛理论,后来扩展到具有对齐和能量型相互作用的二阶模型,但多物种设置引入了新的挑战:异质种群在种内和种间相互作用,未知核的数量增加,并且必须适应像捕食-被捕食动态这样的非对称相互作用。我们在一个非参数贝叶斯设置中制定了学习问题,并建立了严格的统计保证。我们的分析显示了相互作用核的可恢复性,提供了定量误差界,并证明了后验估计器的统计最优性,从而统一和概括了先前的单物种理论。数值实验证实了理论预测,并展示了所提出方法的有效性,突出了其相对于现有基于核的方法的优势。这项工作为数据驱动推断多物种系统中相互作用规律提供了一个完整的统计框架,推进了将微观粒子动力学与出现的宏观行为相连接的更广泛的多尺度建模计划。

更新时间: 2025-11-03 20:38:38

领域: stat.ML,cs.LG,cs.NA,math.NA,math.ST,stat.TH

下载: http://arxiv.org/abs/2511.02053v1

Solving cold start in news recommendations: a RippleNet-based system for large scale media outlet

We present a scalable recommender system implementation based on RippleNet, tailored for the media domain with a production deployment in Onet.pl, one of Poland's largest online media platforms. Our solution addresses the cold-start problem for newly published content by integrating content-based item embeddings into the knowledge propagation mechanism of RippleNet, enabling effective scoring of previously unseen items. The system architecture leverages Amazon SageMaker for distributed training and inference, and Apache Airflow for orchestrating data pipelines and model retraining workflows. To ensure high-quality training data, we constructed a comprehensive golden dataset consisting of user and item features and a separate interaction table, all enabling flexible extensions and integration of new signals.

Updated: 2025-11-03 20:38:37

标题: 解决新闻推荐中的冷启动问题:基于RippleNet的大规模媒体出口系统

摘要: 我们提出了一个基于RippleNet的可扩展的推荐系统实现,专为媒体领域定制,在波兰最大的在线媒体平台之一Onet.pl上进行了生产部署。我们的解决方案通过将基于内容的项目嵌入集成到RippleNet的知识传播机制中,解决了新发布内容的冷启动问题,从而实现了对以前未见过的项目的有效评分。系统架构利用Amazon SageMaker进行分布式训练和推断,利用Apache Airflow进行数据管道和模型重新训练工作流的编排。为了确保高质量的训练数据,我们构建了一个包含用户和项目特征以及单独的交互表的全面的黄金数据集,所有这些都支持灵活的扩展和集成新信号。

更新时间: 2025-11-03 20:38:37

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2511.02052v1

AAGATE: A NIST AI RMF-Aligned Governance Platform for Agentic AI

This paper introduces the Agentic AI Governance Assurance & Trust Engine (AAGATE), a Kubernetes-native control plane designed to address the unique security and governance challenges posed by autonomous, language-model-driven agents in production. Recognizing the limitations of traditional Application Security (AppSec) tooling for improvisational, machine-speed systems, AAGATE operationalizes the NIST AI Risk Management Framework (AI RMF). It integrates specialized security frameworks for each RMF function: the Agentic AI Threat Modeling MAESTRO framework for Map, a hybrid of OWASP's AIVSS and SEI's SSVC for Measure, and the Cloud Security Alliance's Agentic AI Red Teaming Guide for Manage. By incorporating a zero-trust service mesh, an explainable policy engine, behavioral analytics, and decentralized accountability hooks, AAGATE provides a continuous, verifiable governance solution for agentic AI, enabling safe, accountable, and scalable deployment. The framework is further extended with DIRF for digital identity rights, LPCI defenses for logic-layer injection, and QSAF monitors for cognitive degradation, ensuring governance spans systemic, adversarial, and ethical risks.

Updated: 2025-11-03 20:37:10

标题: AAGATE:一种用于代理AI的NIST AI RMF对齐治理平台

摘要: 本文介绍了Agentic AI Governance Assurance & Trust Engine(AAGATE),这是一个基于Kubernetes的控制平台,旨在解决自主语言模型驱动代理在生产环境中提出的独特安全和治理挑战。认识到传统应用安全(AppSec)工具在即兴、机器速度系统中的局限性,AAGATE将NIST AI风险管理框架(AI RMF)运作化。它集成了专门的安全框架,用于每个RMF功能:Agentic AI威胁建模MAESTRO框架用于Map,OWASP的AIVSS和SEI的SSVC的混合用于Measure,以及云安全联盟的Agentic AI红队指南用于Manage。通过整合零信任服务网格、可解释的策略引擎、行为分析和分散化的责任挂钩,AAGATE为代理AI提供了持续可验证的治理解决方案,实现了安全、可靠和可扩展的部署。该框架还通过DIRF实现数字身份权利、通过LPCI防御来防止逻辑层注入,以及通过QSAF监控来保证认知下降,确保治理涵盖系统、对抗和道德风险。

更新时间: 2025-11-03 20:37:10

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.25863v2

Learning Terrain-Specialized Policies for Adaptive Locomotion in Challenging Environments

Legged robots must exhibit robust and agile locomotion across diverse, unstructured terrains, a challenge exacerbated under blind locomotion settings where terrain information is unavailable. This work introduces a hierarchical reinforcement learning framework that leverages terrain-specialized policies and curriculum learning to enhance agility and tracking performance in complex environments. We validated our method on simulation, where our approach outperforms a generalist policy by up to 16% in success rate and achieves lower tracking errors as the velocity target increases, particularly on low-friction and discontinuous terrains, demonstrating superior adaptability and robustness across mixed-terrain scenarios.

Updated: 2025-11-03 20:32:45

标题: 学习针对具体地形的策略,以适应具有挑战性的环境中的运动

摘要: 四肢机器人必须在各种不规则地形上展示稳健而灵活的运动能力,这在盲目运动设置下尤为困难,因为地形信息是不可用的。本文介绍了一种层次强化学习框架,利用地形专用策略和课程学习来提高在复杂环境中的灵活性和跟踪性能。我们在模拟环境中验证了我们的方法,在那里我们的方法在成功率上比通用策略高出最多16%,并在速度目标增加时实现更低的跟踪误差,特别是在低摩擦和不连续的地形上,展示出在混合地形场景中更优越的适应性和稳健性。

更新时间: 2025-11-03 20:32:45

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.20635v2

Two-Player Zero-Sum Games with Bandit Feedback

We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit framework. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(\Delta + \sqrt{T})$ for ETC in zero-sum game setting and $O(\log (T \Delta^2) / \Delta)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $\Delta$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.

Updated: 2025-11-03 20:32:28

标题: 带有强盗反馈的两人零和博弈

摘要: 我们研究了一个双人零和博弈,其中行动玩家旨在最大化他们对抗敌对列玩家的回报,这是通过赌徒反馈估计的未知支付矩阵。我们提出了三种基于探索-然后-承诺框架的算法。第一种是将其调整为零和博弈,第二种是将自适应淘汰结合进去,利用ε-Nash均衡属性来高效选择最佳行动对,第三种是通过采用非均匀探索来扩展淘汰算法。我们的目标是通过专注于学习纯策略纳什均衡来展示ETC在零和博弈设置中的适用性。我们工作的一个关键贡献是推导出我们提出的算法的预期遗憾的基于实例的上界,这在关于零和博弈的文献中受到了有限的关注。特别是,在T轮之后,我们在零和博弈设置中实现了一个实例相关的遗憾的上界为O(Δ + √T)的ETC,以及自适应淘汰算法及其具有非均匀探索的变体的O(log(TΔ²)/Δ),其中Δ表示次优性差距。因此,我们的结果表明,基于ETC的算法在对抗性游戏设置中表现出效果,达到了与现有方法相当的遗憾界,同时通过实例相关分析提供了见解。

更新时间: 2025-11-03 20:32:28

领域: cs.LG,cs.GT

下载: http://arxiv.org/abs/2506.14518v2

End-to-End Crop Row Navigation via LiDAR-Based Deep Reinforcement Learning

Reliable navigation in under-canopy agricultural environments remains a challenge due to GNSS unreliability, cluttered rows, and variable lighting. To address these limitations, we present an end-to-end learning-based navigation system that maps raw 3D LiDAR data directly to control commands using a deep reinforcement learning policy trained entirely in simulation. Our method includes a voxel-based downsampling strategy that reduces LiDAR input size by 95.83%, enabling efficient policy learning without relying on labeled datasets or manually designed control interfaces. The policy was validated in simulation, achieving a 100% success rate in straight-row plantations and showing a gradual decline in performance as row curvature increased, tested across varying sinusoidal frequencies and amplitudes.

Updated: 2025-11-03 20:29:38

标题: 通过基于激光雷达的深度强化学习实现的端到端作物行导航

摘要: 在密集的农业环境中可靠的导航仍然是一个挑战,主要是由于GNSS不可靠、杂乱的行植物和光照变化。为了解决这些限制,我们提出了一种基于端到端学习的导航系统,该系统直接将原始3D LiDAR数据映射到控制命令,使用在模拟中完全训练的深度强化学习策略。我们的方法包括基于体素的降采样策略,将LiDAR输入大小减少了95.83%,实现了高效的策略学习,而无需依赖标记数据集或手动设计的控制界面。该策略在模拟中经过验证,在直行植物园中实现了100%的成功率,并且在测试时表现出渐进性能下降,随着行曲率的增加,在不同的正弦频率和振幅范围内进行了测试。

更新时间: 2025-11-03 20:29:38

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18608v2

Finding Probably Approximate Optimal Solutions by Training to Estimate the Optimal Values of Subproblems

The paper is about developing a solver for maximizing a real-valued function of binary variables. The solver relies on an algorithm that estimates the optimal objective-function value of instances from the underlying distribution of objectives and their respective sub-instances. The training of the estimator is based on an inequality that facilitates the use of the expected total deviation from optimality conditions as a loss function rather than the objective-function itself. Thus, it does not calculate values of policies, nor does it rely on solved instances.

Updated: 2025-11-03 20:29:30

标题: 通过训练来估计子问题的最优值,寻找可能的近似最优解

摘要: 这篇论文探讨了开发一个求解器,用于最大化具有二进制变量的实值函数。该求解器依赖于一种算法,该算法估计来自底层目标分布及其各自子实例的最优目标函数值。估计器的训练基于一种不等式,该不等式促进了使用期望总体偏离最优条件作为损失函数,而不是目标函数本身。因此,它不计算政策的值,也不依赖于已解决的实例。

更新时间: 2025-11-03 20:29:30

领域: cs.LG

下载: http://arxiv.org/abs/2511.02048v1

A Dual-Use Framework for Clinical Gait Analysis: Attention-Based Sensor Optimization and Automated Dataset Auditing

Objective gait analysis using wearable sensors and AI is critical for managing neurological and orthopedic conditions. However, models are vulnerable to hidden dataset biases, and task-specific sensor optimization remains a challenge. We propose a multi-stream attention-based deep learning framework that functions as both a sensor optimizer and an automated data auditor. Applied to the Voisard et al. (2025) multi-cohort gait dataset on four clinical tasks (PD, OA, CVA screening; PD vs CVA differential), the model's attention mechanism quantitatively discovered a severe dataset confound. For OA and CVA screening, tasks where bilateral assessment is clinically essential, the model assigned more than 70 percent attention to the Right Foot while statistically ignoring the Left Foot (less than 0.1 percent attention, 95 percent CI [0.0-0.1]). This was not a clinical finding but a direct reflection of a severe laterality bias (for example, 15 of 15 right-sided OA) in the public dataset. The primary contribution of this work is methodological, demonstrating that an interpretable framework can automatically audit dataset integrity. As a secondary finding, the model proposes novel, data-driven sensor synergies (for example, Head plus Foot for PD screening) as hypotheses for future optimized protocols.

Updated: 2025-11-03 20:29:03

标题: 一个用于临床步态分析的双重用途框架:基于注意力的传感器优化和自动化数据集审核

摘要: 穿戴式传感器和人工智能进行客观步态分析对管理神经和骨科疾病至关重要。然而,模型容易受到隐藏的数据集偏见的影响,而特定任务的传感器优化仍然是一个挑战。我们提出了一个基于多流注意力的深度学习框架,既可以作为传感器优化器,又可以作为自动化数据审计员。应用于Voisard等人(2025年)关于四个临床任务(PD、OA、CVA筛查;PD与CVA鉴别)的多队列步态数据集,模型的注意机制定量地发现了一个严重的数据集混淆。对于OA和CVA筛查这两个需要进行双侧评估的临床重要任务,模型将超过70%的注意力分配给了右脚,同时在统计上忽略了左脚(不到0.1%的注意力,95%的置信区间[0.0-0.1])。这并不是一个临床发现,而是公共数据集中严重的侧向偏见(例如,15例右侧OA中的15例)。这项工作的主要贡献是方法论的,展示了一个可解释的框架可以自动审计数据集的完整性。作为次要发现,该模型提出了新颖的、数据驱动的传感器协同作用(例如,头部加脚部用于PD筛查),作为未来优化方案的假设。

更新时间: 2025-11-03 20:29:03

领域: cs.LG

下载: http://arxiv.org/abs/2511.02047v1

Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

Updated: 2025-11-03 20:28:22

标题: 文本-VQA增强:大型多模态模型的管道化利用用于自动合成

摘要: 为了与场景文本相关的视觉问答任务创建大规模数据库(text-VQA)涉及繁琐且具有挑战性的人工注释。随着能够处理视觉和语言模态的基础模型的出现,以及OCR系统的成熟,建立一个能够根据给定图像中的场景文本合成问题-答案(QA)对的端到端流水线是当务之急。我们提出了一个用于自动合成文本-VQA数据集的流水线,可以生成忠实的QA对,并且随着场景文本数据的可用性而扩展。我们提出的方法利用了涉及OCR检测和识别(文本定位)、感兴趣区域(ROI)检测、标题生成和问题生成的多个模型和算法的能力。这些组件被整合到一个连贯的流水线中,以自动化合成和验证QA对。据我们所知,这是首次提出的用于自动合成和验证大规模文本-VQA数据集的流水线,包括约72K个QA对,基于约44K张图像。

更新时间: 2025-11-03 20:28:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.02046v1

ExpertLens: Activation steering features are highly interpretable

Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., ``cat'') using the ``finding experts'' method from research on activation steering and show that the ExpertLens, i.e., inspection of these neurons provides insights about model representation. We find that ExpertLens representations are stable across models and datasets and closely align with human representations inferred from behavioral data, matching inter-human alignment levels. ExpertLens significantly outperforms the alignment captured by word/sentence embeddings. By reconstructing human concept organization through ExpertLens, we show that it enables a granular view of LLM concept representation. Our findings suggest that ExpertLens is a flexible and lightweight approach for capturing and analyzing model representations.

Updated: 2025-11-03 20:25:57

标题: ExpertLens:激活导向特征具有很高的可解释性

摘要: 大型语言模型(LLMs)中的激活导向方法已经成为执行有针对性的更新以增强生成语言的有效方式,而无需大量的适应数据。我们询问激活导向方法发现的特征是否可解释。我们使用激活导向研究中的“找到专家”方法识别负责特定概念(例如“猫”)的神经元,并展示专家镜头,即检查这些神经元,提供了有关模型表示的见解。我们发现专家镜头的表示在模型和数据集之间是稳定的,并与从行为数据推断的人类表示紧密对齐,与人际对齐水平匹配。专家镜头明显优于由单词/句子嵌入所捕捉的对齐。通过专家镜头重建人类概念组织,我们展示它能够提供对LLM概念表示的细粒度视图。我们的发现表明,专家镜头是捕获和分析模型表示的一种灵活轻量级方法。

更新时间: 2025-11-03 20:25:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.15090v4

Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning

Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference.

Updated: 2025-11-03 20:25:42

标题: 推理正则化:通过解释增强微调在语言模型分类中的系统改进

摘要: 对LLMs进行微调以进行分类通常将输入直接映射到标签。 我们探讨在微调过程中是否附加简要解释到每个标签会产生更好的模型。我们评估对话回应质量沿着自然性、全面性和主题相关性三个方面,每个方面都用5分制进行评分。我们使用多个LLMs生成的数据集,对一个7B参数的模型进行微调,并在六个不同的对话数据集上进行测试。在18个数据集和任务设置中,标签加解释的训练表现优于仅使用标签的基线模型。 一个中心且意想不到的结果涉及随机标记。我们用语法上不连贯但与原始文本(例如,洗牌或词袋变体)保持词汇一致的文本代替人工编写的解释。尽管缺乏语义,这些伪解释仍然提高了准确性,超过了仅使用标签的训练,并经常缩小了与真实解释之间的差距。这种效果持续存在于数据集和训练种子中,表明收益更多地来自结构而非含义:额外的标记预算鼓励更丰富的中间计算,并作为减少过度自信的快捷方式的正则化器。 内部分析支持这种观点:加解释的模型在中间层中表现出更高的激活熵,同时在输出层上表现出更尖锐的预测质量,这与在做出决策前增加思考的一致。总的来说,加解释的微调,无论是使用真实理由还是精心构建的随机标记序列,都可以提高LLM分类的准确性和可靠性,同时阐明了在推理过程中标记级别的支撑如何塑造计算。

更新时间: 2025-11-03 20:25:42

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.02044v1

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Bad charactors when submitting to arXiv: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

Updated: 2025-11-03 20:25:19

标题: 手电筒:PyTorch编译器扩展以加速注意力变体

摘要: 在提交到arXiv时的坏字符:注意力是大型语言模型(LLMs)的基本构建块,因此已经进行了许多努力来高效实现它。例如,FlashAttention利用平铺和内核融合来优化注意力。最近,引入了许多注意力的变体,以增强模型质量或效率。支持它们的高效实现仍然困难,因为它们通常需要专门的内核或手工调整的实现。FlexAttention最近通过使用静态编程模板来支持FlashAttention样式的内核来解决了部分这一缺口。 在本文中,我们介绍了Flashlight,这是PyTorch生态系统中的一个编译器原生框架,可以自动生成融合的、FlashAttention风格的内核,用于任意基于注意力的程序,而不依赖于静态模板或预定义的内核专业化。Flashlight利用PyTorch的编译工作流程来透明地融合和平铺注意力计算,从而为各种注意力模式实现高效执行。它不仅支持在FlexAttention模型中可表达的所有变体,还处理更一般的、数据相关的注意力公式,这是FlexAttention能力范围之外的。 我们的结果显示,Flashlight生成的内核具有与FlexAttention竞争力或更好的性能,同时提供原生PyTorch代码的灵活性,使开发人员能够快速探索新的注意力模型,而不会牺牲性能。

更新时间: 2025-11-03 20:25:19

领域: cs.LG,cs.PF

下载: http://arxiv.org/abs/2511.02043v1

Quantum-Enhanced Generative Models for Rare Event Prediction

Rare events such as financial crashes, climate extremes, and biological anomalies are notoriously difficult to model due to their scarcity and heavy-tailed distributions. Classical deep generative models often struggle to capture these rare occurrences, either collapsing low-probability modes or producing poorly calibrated uncertainty estimates. In this work, we propose the Quantum-Enhanced Generative Model (QEGM), a hybrid classical-quantum framework that integrates deep latent-variable models with variational quantum circuits. The framework introduces two key innovations: (1) a hybrid loss function that jointly optimizes reconstruction fidelity and tail-aware likelihood, and (2) quantum randomness-driven noise injection to enhance sample diversity and mitigate mode collapse. Training proceeds via a hybrid loop where classical parameters are updated through backpropagation while quantum parameters are optimized using parameter-shift gradients. We evaluate QEGM on synthetic Gaussian mixtures and real-world datasets spanning finance, climate, and protein structure. Results demonstrate that QEGM reduces tail KL divergence by up to 50 percent compared to state-of-the-art baselines (GAN, VAE, Diffusion), while improving rare-event recall and coverage calibration. These findings highlight the potential of QEGM as a principled approach for rare-event prediction, offering robustness beyond what is achievable with purely classical methods.

Updated: 2025-11-03 20:24:55

标题: 量子增强生成模型用于罕见事件预测

摘要: 稀有事件,如金融崩溃、气候极端事件和生物异常,由于其稀缺性和重尾分布而难以建模。经典深度生成模型通常难以捕捉这些罕见事件,要么崩溃低概率模式,要么产生校准不佳的不确定性估计。在本工作中,我们提出了量子增强生成模型(QEGM),这是一个结合了深度潜变量模型和变分量子电路的混合经典-量子框架。该框架引入了两个关键创新:(1)一个混合损失函数,同时优化重构保真度和尾部感知似然度,以及(2)量子随机性驱动的噪声注入,以增强样本多样性和减轻模式崩溃。训练通过一个混合循环进行,其中经典参数通过反向传播进行更新,而量子参数则使用参数移位梯度进行优化。我们在合成高斯混合模型和跨金融、气候和蛋白质结构的真实数据集上评估了QEGM。结果表明,与最先进的基线模型(GAN、VAE、Diffusion)相比,QEGM将尾部KL散度降低了高达50%,同时提高了罕见事件的召回率和覆盖率校准。这些发现突显了QEGM作为罕见事件预测的一种原则性方法的潜力,提供了超越纯粹经典方法可达到的鲁棒性。

更新时间: 2025-11-03 20:24:55

领域: cs.LG,cs.AI,cs.CR,cs.DC

下载: http://arxiv.org/abs/2511.02042v1

Predicting Microbial Interactions Using Graph Neural Networks

Predicting interspecies interactions is a key challenge in microbial ecology, as these interactions are critical to determining the structure and activity of microbial communities. In this work, we used data on monoculture growth capabilities, interactions with other species, and phylogeny to predict a negative or positive effect of interactions. More precisely, we used one of the largest available pairwise interaction datasets to train our models, comprising over 7,500 interactions be- tween 20 species from two taxonomic groups co-cultured under 40 distinct carbon conditions, with a primary focus on the work of Nestor et al.[28 ]. In this work, we propose Graph Neural Networks (GNNs) as a powerful classifier to predict the direction of the effect. We construct edge-graphs of pairwise microbial interactions in order to leverage shared information across individual co-culture experiments, and use GNNs to predict modes of interaction. Our model can not only predict binary interactions (positive/negative) but also classify more complex interaction types such as mutualism, competition, and parasitism. Our initial results were encouraging, achieving an F1-score of 80.44%. This significantly outperforms comparable methods in the literature, including conventional Extreme Gradient Boosting (XGBoost) models, which reported an F1-score of 72.76%.

Updated: 2025-11-03 20:19:49

标题: 使用图神经网络预测微生物相互作用

摘要: 预测物种间相互作用是微生物生态学中的一个关键挑战,因为这些相互作用对于确定微生物群落的结构和活动至关重要。在这项工作中,我们利用单种培养能力、与其他物种的相互作用以及系统发育的数据来预测相互作用的负面或正面影响。更具体地说,我们使用了一个最大的可用的成对相互作用数据集来训练我们的模型,包括来自两个分类群的20种物种在40种不同碳条件下共同培养的超过7500种相互作用,主要关注Nestor等人的工作。在这项工作中,我们提出图神经网络(GNNs)作为一个强大的分类器来预测效果的方向。我们构建了微生物相互作用的边缘图,以利用个体共同培养实验中的共享信息,并使用GNNs来预测相互作用模式。我们的模型不仅可以预测二进制相互作用(正面/负面),还可以分类更复杂的相互作用类型,如共生、竞争和寄生。我们的初步结果令人鼓舞,达到了80.44%的F1分数。这显著优于文献中可比的方法,包括传统的极端梯度提升(XGBoost)模型,其报告的F1分数为72.76%。

更新时间: 2025-11-03 20:19:49

领域: cs.LG,q-bio.QM,68T05,I.2.6; I.5.1

下载: http://arxiv.org/abs/2511.02038v1

RobustFSM: Submodular Maximization in Federated Setting with Malicious Clients

Submodular maximization is an optimization problem benefiting many machine learning applications, where we seek a small subset best representing an extremely large dataset. We focus on the federated setting where the data are locally owned by decentralized clients who have their own definitions for the quality of representability. This setting requires repetitive aggregation of local information computed by the clients. While the main motivation is to respect the privacy and autonomy of the clients, the federated setting is vulnerable to client misbehaviors: malicious clients might share fake information. An analogy is backdoor attack in conventional federated learning, but our challenge differs freshly due to the unique characteristics of submodular maximization. We propose RobustFSM, a federated submodular maximization solution that is robust to various practical client attacks. Its performance is substantiated with an empirical evaluation study using real-world datasets. Numerical results show that the solution quality of RobustFSM substantially exceeds that of the conventional federated algorithm when attacks are severe. The degree of this improvement depends on the dataset and attack scenarios, which can be as high as 200%

Updated: 2025-11-03 20:07:21

标题: RobustFSM: 在存在恶意客户的联邦学习中进行子模块最大化

摘要: 次模最大化是一个优化问题,对许多机器学习应用有益,我们在这里寻求一个最能代表极大数据集的小子集。我们关注的是联邦设置,在这种设置中,数据由分散的客户端本地拥有,他们对可代表性的质量有自己的定义。这种设置需要对客户端计算的本地信息进行重复聚合。虽然主要动机是尊重客户端的隐私和自治权,但联邦设置容易受到客户端行为不端的影响:恶意客户端可能分享虚假信息。一个类比是传统联邦学习中的后门攻击,但由于次模最大化的独特特性,我们的挑战是全新的。我们提出了RobustFSM,一个联邦次模最大化解决方案,能够抵御各种实际客户端攻击。通过使用真实世界数据集进行的实证评估研究证实了其性能。数值结果显示,当攻击严重时,RobustFSM的解决方案质量明显超过传统联邦算法。这种改进的程度取决于数据集和攻击场景,可以高达200%。

更新时间: 2025-11-03 20:07:21

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2511.02029v1

Aggregation of Published Non-Uniform Axial Power Data for Phase II of the OECD/NEA AI/ML Critical Heat Flux Benchmark

Critical heat flux (CHF) marks the onset of boiling crisis in light-water reactors, defining safe thermal-hydraulic operating limits. To support Phase II of the OECD/NEA AI/ML CHF benchmark, which introduces spatially varying power profiles, this work compiles and digitizes a broad CHF dataset covering both uniform and non-uniform axial heating conditions. Heating profiles were extracted from technical reports, interpolated onto a consistent axial mesh, validated via energy-balance checks, and encoded in machine-readable formats for benchmark compatibility. Classical CHF correlations exhibit substantial errors under uniform heating and degrade markedly when applied to non-uniform profiles, while modern tabular methods offer improved but still imperfect predictions. A neural network trained solely on uniform data performs well in that regime but fails to generalize to spatially varying scenarios, underscoring the need for models that explicitly incorporate axial power distributions. By providing these curated datasets and baseline modeling results, this study lays the groundwork for advanced transfer-learning strategies, rigorous uncertainty quantification, and design-optimization efforts in the next phase of the CHF benchmark.

Updated: 2025-11-03 20:04:57

标题: 整合已发表的非均匀轴向功率数据用于OECD/NEA AI/ML关键热流量基准的第二阶段

摘要: 关键热流量(CHF)标志着轻水反应堆中沸腾危机的开始,定义了安全的热水力操作限制。为了支持OECD/NEA AI/ML CHF基准测试的第二阶段,该工作编制和数字化了涵盖均匀和非均匀轴向加热条件的广泛CHF数据集。加热曲线从技术报告中提取出来,插值到一致的轴向网格上,通过能量平衡检查进行验证,并以机器可读格式编码以符合基准测试兼容性。 经典CHF相关性在均匀加热条件下存在重大误差,并且在应用于非均匀加热曲线时明显下降,而现代表格方法提供了改进但仍不完美的预测。仅在均匀数据上训练的神经网络在该范围内表现良好,但在空间变化情景中无法概括,强调了需要明确纳入轴向功率分布的模型。通过提供这些精选数据集和基线建模结果,本研究为下一阶段CHF基准测试中的先进迁移学习策略、严格的不确定性量化和设计优化工作奠定了基础。

更新时间: 2025-11-03 20:04:57

领域: cs.LG,cs.CE

下载: http://arxiv.org/abs/2507.00034v2

Path-Coordinated Continual Learning with Neural Tangent Kernel-Justified Plasticity: A Theoretical Framework with Near State-of-the-Art Performance

Catastrophic forgetting is one of the fundamental issues of continual learning because neural networks forget the tasks learned previously when trained on new tasks. The proposed framework is a new path-coordinated framework of continual learning that unites the Neural Tangent Kernel (NTK) theory of principled plasticity bounds, statistical validation by Wilson confidence intervals, and evaluation of path quality by the use of multiple metrics. Experimental evaluation shows an average accuracy of 66.7% at the cost of 23.4% catastrophic forgetting on Split-CIFAR10, a huge improvement over the baseline and competitive performance achieved, which is very close to state-of-the-art results. Further, it is found out that NTK condition numbers are predictive indicators of learning capacity limits, showing the existence of a critical threshold at condition number $>10^{11}$. It is interesting to note that the proposed strategy shows a tendency of lowering forgetting as the sequence of tasks progresses (27% to 18%), which is a system stabilization. The framework validates 80% of discovered paths with a rigorous statistical guarantee and maintains 90-97% retention on intermediate tasks. The core capacity limits of the continual learning environment are determined in the analysis, and actionable insights to enhance the adaptive regularization are offered.

Updated: 2025-11-03 19:55:59

标题: 路径协调的神经切线核可证明可塑性持续学习:一个具有接近最新技术性能的理论框架

摘要: 灾难性遗忘是连续学习的一个基本问题,因为当神经网络在新任务上训练时,它们会忘记先前学习的任务。提出的框架是一个新的路径协调的连续学习框架,它结合了原则性可塑性边界的神经切线核(NTK)理论,Wilson置信区间的统计验证,以及使用多个指标评估路径质量。实验评估显示,在Split-CIFAR10上的平均准确率为66.7%,灾难性遗忘率为23.4%,这比基线有了很大改进,并取得了竞争性表现,非常接近最先进的结果。此外,发现NTK条件数是学习能力极限的预测指标,显示在条件数大于$10^{11}$时存在临界阈值。有趣的是,提出的策略显示出随着任务序列的进行,遗忘率有降低的趋势(从27%到18%),这是系统稳定化。该框架通过严格的统计保证验证了80%的发现路径,并在中间任务上保持了90-97%的保留率。在分析中确定了连续学习环境的核心能力限制,并提供了增强自适应正则化的可操作见解。

更新时间: 2025-11-03 19:55:59

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2511.02025v1

HADSF: Aspect Aware Semantic Control for Explainable Recommendation

Recent advances in large language models (LLMs) promise more effective information extraction for review-based recommender systems, yet current methods still (i) mine free-form reviews without scope control, producing redundant and noisy representations, (ii) lack principled metrics that link LLM hallucination to downstream effectiveness, and (iii) leave the cost-quality trade-off across model scales largely unexplored. We address these gaps with the Hyper-Adaptive Dual-Stage Semantic Framework (HADSF), a two-stage approach that first induces a compact, corpus-level aspect vocabulary via adaptive selection and then performs vocabulary-guided, explicitly constrained extraction of structured aspect-opinion triples. To assess the fidelity of the resulting representations, we introduce Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) and empirically uncover a nonmonotonic relationship between hallucination severity and rating prediction error. Experiments on approximately 3 million reviews across LLMs spanning 1.5B-70B parameters show that, when integrated into standard rating predictors, HADSF yields consistent reductions in prediction error and enables smaller models to achieve competitive performance in representative deployment scenarios. We release code, data pipelines, and metric implementations to support reproducible research on hallucination-aware, LLM-enhanced explainable recommendation. Code is available at https://github.com/niez233/HADSF

Updated: 2025-11-03 19:51:14

标题: HADSF: 面向方面的语义控制,用于可解释推荐

摘要: 近期大型语言模型(LLMs)的进展为基于评论的推荐系统提供了更有效的信息提取,然而当前方法仍然存在一些问题:(i)在没有范围控制的情况下挖掘自由形式的评论,产生冗余和嘈杂的表示;(ii)缺乏将LLM幻觉与下游效果联系起来的基本度量标准;(iii)对模型规模之间的成本质量权衡问题尚未进行充分探讨。我们提出了Hyper-Adaptive Dual-Stage Semantic Framework(HADSF)来填补这些空白,这是一个两阶段方法,首先通过自适应选择诱导出一个紧凑的语料库级别的方面词汇,然后执行受词汇引导的、明确约束的结构化方面-观点三元组的提取。为了评估结果表示的准确性,我们引入了Aspect Drift Rate(ADR)和Opinion Fidelity Rate(OFR),并实证地发现幻觉严重性与评分预测误差之间存在非单调关系。在跨越1.5B-70B参数范围的LLMs上进行的实验证明,当集成到标准评分预测器中时,HADSF可以持续减少预测误差,并使较小的模型在代表性部署场景中实现竞争性性能。我们发布了代码、数据管道和度量实现,以支持可重现的对幻觉感知、LLM增强可解释推荐的研究。代码可在https://github.com/niez233/HADSF 上找到。

更新时间: 2025-11-03 19:51:14

领域: cs.LG

下载: http://arxiv.org/abs/2510.26994v2

Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior

Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.

Updated: 2025-11-03 19:50:24

标题: 共享参数子空间和紧急错位行为中的跨任务线性特征

摘要: 最近的研究发现,大型语言模型在被精细调整到狭义有害数据集上后可能会产生广泛不一致的行为,这种现象被称为紧急不一致(EM)。然而,使这种有害泛化跨不同领域的基本机制仍然知之甚少。在这项工作中,我们采用了几何视角来研究EM,并展示它在不同数据集之间如何编码有害行为的基本跨任务线性结构。具体来说,我们发现EM参数在不同任务之间存在强烈的收敛,通过精细调整的权重更新显示出相对较高的余弦相似性,以及通过它们的主角度和投影重叠来衡量的共享低维子空间。此外,我们还通过线性模式连接展示了功能等效性,即在狭义不一致任务之间插值模型保持一致的广泛不一致行为。我们的结果表明,EM源于不同的狭义任务发现了相同的共享参数方向集合,这表明有害行为可能被组织到权重空间的特定、可预测的区域中。通过揭示参数几何与行为结果之间的基本联系,我们希望我们的工作能够催生更多关于参数空间可解释性和基于权重的干预的研究。

更新时间: 2025-11-03 19:50:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.02022v1

Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025

The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year's program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of "Explainability, Interpretability, and Transparency," "Uncertainty, Bias, and Fairness," "Causality," "Domain Adaptation," "Foundation Models," "Learning from Small Medical Data," "Multimodal Methods," and "Scalable, Translational Healthcare Solutions."

Updated: 2025-11-03 19:48:57

标题: 2025年健康、推理和学习会议(CHIL)研究圆桌讨论的思考

摘要: 第六届健康、推理和学习年会(CHIL 2025)由卫生学习和推理协会(AHLI)主办,于2025年6月25日至27日在美国加利福尼亚州伯克利市的加利福尼亚大学伯克利分校举行。作为今年计划的一部分,我们举办了研究圆桌会议,以促进机器学习和医疗保健交叉领域关键和及时主题的合作小组对话。每个圆桌由一组资深和初级主席主持,促进开放交流、智识好奇心和包容性参与。会议强调对关键挑战的深入讨论,探索新兴机会,并共同构思出可操作的方向。总共有19位圆桌主席主持了八个圆桌会议,主题包括“可解释性、可理解性和透明性”、“不确定性、偏见和公平性”、“因果性”、“领域适应”、“基础模型”、“从小型医疗数据中学习”、“多模态方法”和“可扩展的、可转化的医疗保健解决方案”。

更新时间: 2025-11-03 19:48:57

领域: cs.LG

下载: http://arxiv.org/abs/2510.15217v3

Street Review: A Participatory AI-Based Framework for Assessing Streetscape Inclusivity

Urban centers undergo social, demographic, and cultural changes that shape public street use and require systematic evaluation of public spaces. This study presents Street Review, a mixed-methods approach that combines participatory research with AI-based analysis to assess streetscape inclusivity. In Montr\'eal, Canada, 28 residents participated in semi-directed interviews and image evaluations, supported by the analysis of approximately 45,000 street-view images from Mapillary. The approach produced visual analytics, such as heatmaps, to correlate subjective user ratings with physical attributes like sidewalk, maintenance, greenery, and seating. Findings reveal variations in perceptions of inclusivity and accessibility across demographic groups, demonstrating that incorporating diverse user feedback can enhance machine learning models through careful data-labeling and co-production strategies. The Street Review framework offers a systematic method for urban planners and policy analysts to inform planning, policy development, and management of public streets.

Updated: 2025-11-03 19:45:29

标题: 街道评审:一种基于参与式人工智能框架的评估街道景观包容性

摘要: 城市中心经历社会、人口和文化变化,塑造了公共街道的使用,并需要对公共空间进行系统评估。本研究提出了Street Review,这是一种将参与式研究与基于人工智能的分析相结合的混合方法,用于评估街道景观的包容性。在加拿大蒙特利尔,28名居民参与了半结构化访谈和图像评估,同时分析了来自Mapillary约45,000张街景图片。该方法产生了视觉分析,如热图,以将主观用户评分与人行道、维护、绿化和座位等物理属性相互关联。研究结果显示了不同人群对包容性和可访问性的感知差异,表明将不同用户反馈纳入机器学习模型中可以通过谨慎的数据标记和共同创作策略加以增强。Street Review框架为城市规划师和政策分析师提供了一种系统方法,用于指导公共街道的规划、政策制定和管理。

更新时间: 2025-11-03 19:45:29

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2508.11708v3

TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding

Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.

Updated: 2025-11-03 19:42:25

标题: TapOut:一种基于强盗算法的动态推测解码方法

摘要: 投机解码通过使用轻量级草稿模型在与更大目标模型并行验证之前自回归地生成令牌,加速LLMs。然而,确定最佳的草稿令牌数量仍然是限制该方法有效性的关键挑战。动态投机解码旨在智能地决定草拟多少令牌以实现最大加速。现有方法通常依赖于手动调整的敏感阈值(例如,令牌熵),这些阈值设置成本高且在模型和领域之间泛化能力差。我们提出了TapOut,一种在线、无需训练、即插即用的动态投机策略选择算法,使用多臂老虎机。我们的方法采用元算法,根据过去的奖励和探索选择多个无参数的动态投机策略。我们在不同模型对和数据集上进行了广泛实验,结果显示TapOut相较于已建立的动态投机基线实现了竞争性或更高的加速,而无需进行任何超参数调整。

更新时间: 2025-11-03 19:42:25

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.02017v1

Multi-Objective Planning with Contextual Lexicographic Reward Preferences

Autonomous agents are often required to plan under multiple objectives whose preference ordering varies based on context. The agent may encounter multiple contexts during its course of operation, each imposing a distinct lexicographic ordering over the objectives, with potentially different reward functions associated with each context. Existing approaches to multi-objective planning typically consider a single preference ordering over the objectives, across the state space, and do not support planning under multiple objective orderings within an environment. We present Contextual Lexicographic Markov Decision Process (CLMDP), a framework that enables planning under varying lexicographic objective orderings, depending on the context. In a CLMDP, both the objective ordering at a state and the associated reward functions are determined by the context. We employ a Bayesian approach to infer a state-context mapping from expert trajectories. Our algorithm to solve a CLMDP first computes a policy for each objective ordering and then combines them into a single context-aware policy that is valid and cycle-free. The effectiveness of the proposed approach is evaluated in simulation and using a mobile robot.

Updated: 2025-11-03 19:32:36

标题: 多目标规划与上下文词典奖励偏好

摘要: 自主代理通常需要在多个目标下进行规划,这些目标的偏好顺序根据上下文而变化。代理可能在其运行过程中遇到多个上下文,每个上下文对目标施加不同的词典顺序,可能与每个上下文相关联的奖励函数也可能不同。现有的多目标规划方法通常考虑跨状态空间的目标的单一偏好顺序,并不支持在环境中根据多个目标顺序进行规划。我们提出了一种上下文词典马尔可夫决策过程(CLMDP)框架,该框架可以根据上下文在不同词典目标顺序下进行规划。在CLMDP中,一个状态的目标顺序和相关的奖励函数都由上下文确定。我们采用贝叶斯方法从专家轨迹中推断状态-上下文映射。我们的算法首先为每个目标顺序计算策略,然后将它们合并为一个有效且无循环的上下文感知策略。我们在仿真和使用移动机器人中评估了所提出方法的有效性。

更新时间: 2025-11-03 19:32:36

领域: cs.AI,cs.RO,cs.SY

下载: http://arxiv.org/abs/2502.10476v2

Proof-of-Spiking-Neurons(PoSN): Neuromorphic Consensus for Next-Generation Blockchains

Blockchain systems face persistent challenges of scalability, latency, and energy inefficiency. Existing consensus protocols such as Proof-of-Work (PoW) and Proof-of-Stake (PoS) either consume excessive resources or risk centralization. This paper proposes \textit{Proof-of-Spiking-Neurons (PoSN)}, a neuromorphic consensus protocol inspired by spiking neural networks. PoSN encodes transactions as spike trains, elects leaders through competitive firing dynamics, and finalizes blocks via neural synchronization, enabling parallel and event-driven consensus with minimal energy overhead. A hybrid system architecture is implemented on neuromorphic platforms, supported by simulation frameworks such as Nengo and PyNN. Experimental results show significant gains in energy efficiency, throughput, and convergence compared to PoB and PoR. PoSN establishes a foundation for sustainable, adaptive blockchains suitable for IoT, edge, and large-scale distributed systems.

Updated: 2025-11-03 19:31:47

标题: 脉冲神经元共识(PoSN)的证明:下一代区块链的神经形态共识

摘要: 区块链系统面临着可扩展性、延迟和能源效率等持续挑战。现有的共识协议如工作量证明(PoW)和股权证明(PoS)要么消耗过多资源,要么存在集中化风险。本文提出了一种基于神经形态的共识协议Proof-of-Spiking-Neurons(PoSN),受到尖峰神经网络的启发。PoSN将交易编码为尖峰列车,通过竞争性放电动态选举领导者,并通过神经同步最终确定区块,实现了并行和事件驱动的共识,能够在能源开销最小的情况下进行。在神经形态平台上实施了混合系统架构,支持Nengo和PyNN等仿真框架。实验结果显示,与PoB和PoR相比,PoSN在能源效率、吞吐量和收敛性方面取得了显著的收益。PoSN为适用于物联网、边缘和大规模分布式系统的可持续、自适应区块链奠定了基础。

更新时间: 2025-11-03 19:31:47

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.02868v1

TraCS: Trajectory Collection in Continuous Space under Local Differential Privacy

Trajectory collection is fundamental for location-based services but often involves sensitive information, such as users' daily activities, raising significant privacy concerns. Local Differential Privacy (LDP) provides strong privacy guarantees for users, even when the data collector is untrusted. Existing trajectory collection methods under LDP are limited to discrete location spaces, where the number of locations affects both privacy guarantees and trajectory utility. Moreover, many real-world scenarios, such as flying trajectories or sensor trajectories of wearable devices, operate in continuous location spaces, making existing methods inadequate. This paper shifts the focus from discrete to continuous spaces for trajectory collection under LDP. We propose two novel methods: TraCS-D, which perturbs the direction and distance of locations, and TraCS-C, which perturbs the Cartesian coordinates of locations. Both methods are theoretically and experimentally analyzed for trajectory utility in continuous spaces. TraCS can also be applied to discrete spaces by rounding perturbed locations to the nearest discrete points. In this case, TraCS's privacy and utility guarantees are independent of the number of locations in the space, and has only $\Theta(1)$ time complexity in each perturbation generation. Evaluation results on discrete location spaces validate the efficiency advantage and show that TraCS outperforms state-of-the-art methods with improved trajectory utility, especially for large privacy parameters.

Updated: 2025-11-03 19:21:08

标题: TraCS:在本地差分隐私下连续空间中的轨迹收集

摘要: Trajectory collection is fundamental for location-based services but often involves sensitive information, such as users' daily activities, raising significant privacy concerns. Local Differential Privacy (LDP) provides strong privacy guarantees for users, even when the data collector is untrusted. Existing trajectory collection methods under LDP are limited to discrete location spaces, where the number of locations affects both privacy guarantees and trajectory utility. Moreover, many real-world scenarios, such as flying trajectories or sensor trajectories of wearable devices, operate in continuous location spaces, making existing methods inadequate. This paper shifts the focus from discrete to continuous spaces for trajectory collection under LDP. We propose two novel methods: TraCS-D, which perturbs the direction and distance of locations, and TraCS-C, which perturbs the Cartesian coordinates of locations. Both methods are theoretically and experimentally analyzed for trajectory utility in continuous spaces. TraCS can also be applied to discrete spaces by rounding perturbed locations to the nearest discrete points. In this case, TraCS's privacy and utility guarantees are independent of the number of locations in the space, and has only Θ(1) time complexity in each perturbation generation. Evaluation results on discrete location spaces validate the efficiency advantage and show that TraCS outperforms state-of-the-art methods with improved trajectory utility, especially for large privacy parameters.

更新时间: 2025-11-03 19:21:08

领域: cs.CR,68P27

下载: http://arxiv.org/abs/2412.00620v2

Bulk-boundary decomposition of neural networks

We present the bulk-boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a natural extension, we develop a field-theoretic formulation of neural dynamics based on this decomposition.

Updated: 2025-11-03 19:18:20

标题: 神经网络的边界-体积分解

摘要: 我们提出了体-边界分解作为理解深度神经网络训练动态的新框架。从随机梯度下降的形式出发,我们展示了拉格朗日量可以重新组织为一个与数据无关的体项和一个与数据相关的边界项。体项捕捉了由网络架构和激活函数设定的内在动态,而边界反映了来自输入和输出层训练样本的随机相互作用。这种分解揭示了深度网络中底层和同质结构。作为自然的延伸,我们基于这种分解开发了一个基于场论的神经动力学形式。

更新时间: 2025-11-03 19:18:20

领域: cs.LG,cond-mat.dis-nn,hep-ph

下载: http://arxiv.org/abs/2511.02003v1

InteracSPARQL: An Interactive System for SPARQL Query Refinement Using Natural Language Explanations

In recent years, querying semantic web data using SPARQL has remained challenging, especially for non-expert users, due to the language's complex syntax and the prerequisite of understanding intricate data structures. To address these challenges, we propose InteracSPARQL, an interactive SPARQL query generation and refinement system that leverages natural language explanations (NLEs) to enhance user comprehension and facilitate iterative query refinement. InteracSPARQL integrates LLMs with a rule-based approach to first produce structured explanations directly from SPARQL abstract syntax trees (ASTs), followed by LLM-based linguistic refinements. Users can interactively refine queries through direct feedback or LLM-driven self-refinement, enabling the correction of ambiguous or incorrect query components in real time. We evaluate InteracSPARQL on standard benchmarks, demonstrating significant improvements in query accuracy, explanation clarity, and overall user satisfaction compared to baseline approaches. Our experiments further highlight the effectiveness of combining rule-based methods with LLM-driven refinements to create more accessible and robust SPARQL interfaces.

Updated: 2025-11-03 19:15:51

标题: InteracSPARQL:一个使用自然语言解释进行SPARQL查询精化的交互式系统

摘要: 近年来,使用SPARQL查询语义Web数据仍然具有挑战性,特别是对于非专家用户,这是由于该语言的复杂语法和理解复杂数据结构的先决条件。为了解决这些挑战,我们提出了InteracSPARQL,这是一个交互式SPARQL查询生成和优化系统,利用自然语言解释(NLEs)来增强用户理解并促进迭代查询优化。InteracSPARQL将LLMs与基于规则的方法集成在一起,首先从SPARQL抽象语法树(ASTs)直接生成结构化解释,然后进行基于LLM的语言优化。用户可以通过直接反馈或LLM驱动的自我优化来交互地优化查询,实现实时纠正模糊或不正确的查询组件。我们在标准基准上评估了InteracSPARQL,与基准方法相比,显示出查询准确性、解释清晰度和整体用户满意度的显著改善。我们的实验进一步突显了将基于规则的方法与LLM驱动的优化相结合,以创建更具可访问性和稳健性的SPARQL界面的有效性。

更新时间: 2025-11-03 19:15:51

领域: cs.DB,cs.AI,cs.IR

下载: http://arxiv.org/abs/2511.02002v1

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

Updated: 2025-11-03 19:15:16

标题: 诊断和解决KG-RAG数据集中的陷阱:朝着更可靠的基准测试方向

摘要: 知识图谱问答(KGQA)系统依赖于高质量的基准来评估复杂的多跳推理。然而,尽管它们被广泛使用,但流行的数据集如WebQSP和CWQ存在着关键的质量问题,包括不准确或不完整的基准注释,构造不当的问题,这些问题模糊、琐碎或无法回答,以及过时或不一致的知识。通过对包括WebQSP和CWQ在内的16个流行KGQA数据集进行手动审计,我们发现平均事实正确率仅为57%。为了解决这些问题,我们引入了KGQAGen,这是一个LLM循环框架,系统地解决这些陷阱。KGQAGen结合了结构化知识基础、LLM引导生成和符号验证,以产生具有挑战性和可验证性的QA实例。使用KGQAGen,我们构建了基于Wikidata的KGQAGen-10k,一个具有一万规模的基准,并评估了各种KG-RAG模型。实验结果表明,即使是最先进的系统也在这一基准上遇到困难,突显了其揭示现有模型局限性的能力。我们的研究结果倡导更严格的基准构建,并将KGQAGen定位为推进KGQA评估的可扩展框架。

更新时间: 2025-11-03 19:15:16

领域: cs.CL,cs.AI,cs.LG,I.2.6; I.2.7

下载: http://arxiv.org/abs/2505.23495v4

Quantifying Classifier Utility under Local Differential Privacy

Local differential privacy (LDP) offers rigorous, quantifiable privacy guarantees for personal data by introducing perturbations at the data source. Understanding how these perturbations affect classifier utility is crucial for both designers and users. However, a general theoretical framework for quantifying this impact is lacking and also challenging, especially for complex or black-box classifiers. This paper presents a unified framework for theoretically quantifying classifier utility under LDP mechanisms. The key insight is that LDP perturbations are concentrated around the original data with a specific probability, allowing utility analysis to be reframed as robustness analysis within this concentrated region. Our framework thus connects the concentration properties of LDP mechanisms with the robustness of classifiers, treating LDP mechanisms as general distributional functions and classifiers as black boxes. This generality enables applicability to any LDP mechanism and classifier. A direct application of our utility quantification is guiding the selection of LDP mechanisms and privacy parameters for a given classifier. Notably, our analysis shows that piecewise-based mechanisms often yield better utility than alternatives in common scenarios. Beyond the core framework, we introduce two novel refinement techniques that further improve utility quantification. We then present case studies illustrating utility quantification for various combinations of LDP mechanisms and classifiers. Results demonstrate that our theoretical quantification closely matches empirical observations, particularly when classifiers operate in lower-dimensional input spaces.

Updated: 2025-11-03 19:13:36

标题: 用局部差分隐私量化分类器效用

摘要: 局部差分隐私(LDP)通过在数据源引入扰动为个人数据提供了严格的、可量化的隐私保障。了解这些扰动如何影响分类器的效用对于设计者和用户都至关重要。然而,对于量化这种影响的一般理论框架尚且缺乏,尤其是对于复杂或黑盒分类器来说更具挑战性。 本文提出了一个统一框架,用于在LDP机制下理论上量化分类器的效用。关键洞察是LDP扰动集中在具有特定概率的原始数据周围,这使得效用分析可以重新构建为在这个集中区域内的鲁棒性分析。因此,我们的框架将LDP机制的集中特性与分类器的鲁棒性联系起来,将LDP机制视为一般分布函数,将分类器视为黑盒。这种普遍性使得适用于任何LDP机制和分类器。我们的效用量化的直接应用是指导选择给定分类器的LDP机制和隐私参数。值得注意的是,我们的分析表明,在常见情况下,基于分段的机制通常比其他选择提供更好的效用。 除了核心框架,我们引入了两种新颖的改进技术,进一步提高了效用量化。然后,我们提供了案例研究,展示了对各种组合的LDP机制和分类器的效用量化。结果表明,我们的理论量化与实证观察非常接近,特别是当分类器在低维输入空间中操作时。

更新时间: 2025-11-03 19:13:36

领域: cs.CR,E.3

下载: http://arxiv.org/abs/2507.02727v2

TRACE: Textual Reasoning for Affordance Coordinate Extraction

Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR's effectiveness. Furthermore, analysis of the model's attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at https://github.com/jink-ucla/TRACE

Updated: 2025-11-03 19:13:26

标题: TRACE:用于实现能力坐标提取的文本推理

摘要: 视觉-语言模型(VLMs)在将高级指令翻译为机器人操作所需的精确空间便利性方面存在困难。虽然存在视觉思维链(CoT)方法,但它们通常计算密集。在这项工作中,我们介绍了TRACE(Textual Reasoning for Affordance Coordinate Extraction),这是一种将文本推理链(CoR)整合到便利性预测过程中的新方法。我们使用这种方法创建了TRACE数据集,这是一个通过自主流水线创建的大规模集合,将指令与明确的文本推理配对。通过在这些数据上微调VLM,我们的模型学会在执行之前外化其空间推理。我们的实验表明,我们的TRACE调整模型在主要Where2Place(W2P)基准上达到了48.1%的准确率(相对改进了9.6%),在更具挑战性的W2P(h)子集上达到了55.0%。至关重要的是,消融研究表明性能与使用的推理数据量直接相关,验证了CoR的有效性。此外,对模型的注意力图的分析显示出一个可解释的推理过程,其中焦点在推理步骤之间动态转移。这项工作表明,训练VLM生成文本CoR是增强基于VLM的机器人控制的精度,可靠性和可解释性的有效和稳健策略。我们的数据集和代码可在https://github.com/jink-ucla/TRACE 上获得。

更新时间: 2025-11-03 19:13:26

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.01999v1

Trove: A Flexible Toolkit for Dense Retrieval

We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.

Updated: 2025-11-03 18:59:57

标题: 宝库:一种用于密集检索的灵活工具包

摘要: 我们介绍了Trove,这是一个易于使用的开源检索工具包,可以简化研究实验,而不会牺牲灵活性或速度。我们首次引入了高效的数据管理功能,可以动态加载和处理(过滤、选择、转换和合并)检索数据集,只需几行代码即可完成。这使用户可以灵活地尝试不同的数据集配置,无需计算和存储大型数据集的多个副本。Trove具有高度可定制性:除了许多内置选项外,它还允许用户自由修改现有组件或完全用用户定义的对象替换它们。它还提供了一个低代码和统一的评估和硬负采样管道,支持多节点执行而无需进行任何代码更改。Trove的数据管理功能将内存消耗减少了2.6倍。此外,Trove易于使用的推理管道没有额外开销,推理时间随可用节点数量线性减少。最重要的是,我们展示了Trove如何简化检索实验并允许任意定制,从而促进了探索性研究。

更新时间: 2025-11-03 18:59:57

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2511.01857v1

SmartMLOps Studio: Design of an LLM-Integrated IDE with Automated MLOps Pipelines for Model Development and Monitoring

The rapid expansion of artificial intelligence and machine learning (ML) applications has intensified the demand for integrated environments that unify model development, deployment, and monitoring. Traditional Integrated Development Environments (IDEs) focus primarily on code authoring, lacking intelligent support for the full ML lifecycle, while existing MLOps platforms remain detached from the coding workflow. To address this gap, this study proposes the design of an LLM-Integrated IDE with automated MLOps pipelines that enables continuous model development and monitoring within a single environment. The proposed system embeds a Large Language Model (LLM) assistant capable of code generation, debugging recommendation, and automatic pipeline configuration. The backend incorporates automated data validation, feature storage, drift detection, retraining triggers, and CI/CD deployment orchestration. This framework was implemented in a prototype named SmartMLOps Studio and evaluated using classification and forecasting tasks on the UCI Adult and M5 datasets. Experimental results demonstrate that SmartMLOps Studio reduces pipeline configuration time by 61%, improves experiment reproducibility by 45%, and increases drift detection accuracy by 14% compared to traditional workflows. By bridging intelligent code assistance and automated operational pipelines, this research establishes a novel paradigm for AI engineering - transforming the IDE from a static coding tool into a dynamic, lifecycle-aware intelligent platform for scalable and efficient model development.

Updated: 2025-11-03 18:56:59

标题: 智能MLOps工作室:一种集成LLM的IDE设计,具有自动化MLOps管道用于模型开发和监控

摘要: 人工智能和机器学习(ML)应用的快速扩张加剧了对统一模型开发、部署和监控的集成环境的需求。传统的集成开发环境(IDE)主要关注代码编写,缺乏对整个ML生命周期的智能支持,而现有的MLOps平台与编码工作流程仍然分离。为了填补这一空白,本研究提出了设计一个带有自动化MLOps管道的LLM集成IDE,它可以在单一环境内实现持续模型开发和监控。所提出的系统嵌入了一个能够生成代码、提供调试建议和自动管道配置的大型语言模型(LLM)助手。后端包括自动化数据验证、特征存储、漂移检测、重新训练触发器和CI/CD部署编排。该框架在一个名为SmartMLOps Studio 的原型中实现,并在UCI成年人和M5数据集上进行分类和预测任务的评估。实验结果表明,与传统工作流程相比,SmartMLOps Studio 将管道配置时间减少了61%,提高了实验重现性45%,漂移检测准确度提高了14%。通过将智能代码辅助和自动化运营管道相结合,这项研究为AI工程建立了一种新的范式 - 将IDE从静态编码工具转变为动态、生命周期感知的智能平台,用于可扩展和高效的模型开发。

更新时间: 2025-11-03 18:56:59

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.01850v1

GTAlign: Game-Theoretic Alignment of LLM Assistants for Social Welfare

Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a social welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .

Updated: 2025-11-03 18:54:17

标题: GTAlign:协同游戏理论对LLM助手进行社会福利对齐

摘要: 大型语言模型(LLMs)在推理方面取得了显著进展,但有时在任务中产生的响应对用户来说不够理想,比如写作、信息搜索或提供实用指导。传统的对齐做法通常假定最大化模型奖励也能最大化用户福利,但这种假设在实践中经常失败:模型可能在用户偏好简洁答案时过度澄清或生成过于冗长的推理。这些行为类似于囚徒困境,即个体理性选择导致社会上次优化的结果。基本挑战在于缺乏一个能够同时惠及LLM和用户的有原则的决策机制。我们提出了博弈理论对齐(GTAlign),这是一个将博弈理论决策融入推理和训练中的对齐框架。在推理过程中,模型明确将用户 - LLM互动视为战略游戏:它在其推理链内构建收益矩阵,以估计自身和用户的福利,然后选择相互有利的行动。在训练过程中,我们引入了一个社会福利奖励,以加强合作响应,使模型行为与社会有效结果保持一致。此外,我们引入了一种推理技术,利用博弈理论推理动态调整LLM响应,当LLM服务的定价政策发生变化时。大量实验证明,与基线相比,GTAlign显著提高了推理效率、答案质量和社会福利在各种任务中。代码可以在https://github.com/ulab-uiuc/GTAlign 找到。

更新时间: 2025-11-03 18:54:17

领域: cs.AI,cs.GT,cs.HC,cs.LG,cs.MA

下载: http://arxiv.org/abs/2510.08872v3

Towards Robust Mathematical Reasoning

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

Updated: 2025-11-03 18:53:02

标题: 朝向强大的数学推理

摘要: 寻找正确的北极星指标对于推进基础模型的数学推理能力至关重要,特别是考虑到现有的评估要么太简单,要么只关注于获得正确的简短答案。为了解决这些问题,我们提出了IMO-Bench,这是一个由顶尖专家组成的小组审核的一套高级推理基准,专门针对国际数学奥林匹克竞赛(IMO)的水平,这是年轻数学家最负盛名的场所。IMO-AnswerBench 首先在400个不同的奥林匹克问题上对模型进行测试,这些问题具有可验证的简短答案。IMO-Proof Bench 是下一级别的评估,用于评估证明写作能力,其中包括基本和高级IMO水平的问题,以及详细的评分指南以便进行自动评分。这些基准在我们历史性地在2025年IMO上取得金牌成绩的过程中起着至关重要的作用,使用了 Gemini Deep Think 模型(Luong 和 Lockhart,2025)。我们的模型在 IMO-AnswerBench 上取得了80.0% 的分数,在高级IMO-Proof Bench 上取得了65.7% 的分数,分别超过非 Gemini 模型的最佳表现6.9% 和42.4%。我们还表明,使用 Gemini 推理构建的自动评分器与人工评估具有良好的相关性,并构建了 IMO-GradingBench,用于对证明进行1000次人工评分,以促进对长篇答案的自动评估的进一步进展。我们希望 IMO-Bench 能帮助社区朝着推进强大的数学推理能力的方向前进,并将其发布在 https://imobench.github.io/。

更新时间: 2025-11-03 18:53:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01846v1

A Detailed Study on LLM Biases Concerning Corporate Social Responsibility and Green Supply Chains

Organizations increasingly use Large Language Models (LLMs) to improve supply chain processes and reduce environmental impacts. However, LLMs have been shown to reproduce biases regarding the prioritization of sustainable business strategies. Thus, it is important to identify underlying training data biases that LLMs pertain regarding the importance and role of sustainable business and supply chain practices. This study investigates how different LLMs respond to validated surveys about the role of ethics and responsibility for businesses, and the importance of sustainable practices and relations with suppliers and customers. Using standardized questionnaires, we systematically analyze responses generated by state-of-the-art LLMs to identify variations. We further evaluate whether differences are augmented by four organizational culture types, thereby evaluating the practical relevance of identified biases. The findings reveal significant systematic differences between models and demonstrate that organizational culture prompts substantially modify LLM responses. The study holds important implications for LLM-assisted decision-making in sustainability contexts.

Updated: 2025-11-03 18:48:48

标题: 关于企业社会责任和绿色供应链的LLM偏见的详细研究

摘要: 组织越来越多地使用大型语言模型(LLMs)来改善供应链流程并减少环境影响。然而,已经证明LLMs会再现关于可持续商业战略优先级的偏见。因此,重要的是识别LLMs在可持续商业和供应链实践的重要性和作用方面涉及的基础训练数据偏见。本研究调查了不同LLMs如何对有关企业伦理和责任的角色以及可持续实践的重要性以及与供应商和客户的关系的验证调查做出反应。使用标准化问卷,我们系统地分析了由最先进的LLMs生成的回答,以识别差异。我们进一步评估差异是否被四种组织文化类型放大,从而评估已识别的偏见的实际相关性。研究结果显示模型之间存在显著的系统差异,并表明组织文化促使大幅修改LLMs的回应。该研究对LLM辅助可持续决策制定具有重要意义。

更新时间: 2025-11-03 18:48:48

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2511.01840v1

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.

Updated: 2025-11-03 18:47:32

标题: SE-Agent:基于LLM代理的多步推理中的自进化轨迹优化

摘要: 基于大型语言模型(LLM)的代理最近展示了在复杂推理和工具使用方面的令人印象深刻的能力,通过与环境的多步交互。虽然这些代理有潜力解决复杂任务,但它们的问题解决过程,即代理的交互轨迹导致任务完成,仍未得到充分利用。这些轨迹包含丰富的反馈,可以引导代理朝着正确的方向解决问题。虽然流行的方法,如蒙特卡洛树搜索(MCTS),可以有效地平衡探索和利用,但它们忽略了各种轨迹之间的相互依赖性,并且缺乏搜索空间的多样性,导致冗余推理和次优结果。为了解决这些挑战,我们提出了SE-Agent,一个自我演化框架,使代理能够迭代地优化其推理过程。我们的方法通过三个关键操作重新审视和增强以前的飞行轨迹:修订、重组和精炼。这种进化机制带来了两个关键优势:(1)通过智能地探索由先前轨迹指导的多样化解决路径,将搜索空间扩展到局部最优解之外,(2)利用跨轨迹的灵感有效提高性能,同时减轻次优推理路径的影响。通过这些机制,SE-Agent实现了连续的自我演化,逐步提高了推理质量。我们在SWE-bench Verified上评估了SE-Agent,以解决真实世界的GitHub问题。五个强大的LLM的实验结果显示,整合SE-Agent可以实现高达55%的相对改进,在SWE-bench Verified上达到了所有开源代理的最新性能。我们的代码和演示材料可以在https://github.com/JARVIS-Xs/SE-Agent 上公开获取。

更新时间: 2025-11-03 18:47:32

领域: cs.AI

下载: http://arxiv.org/abs/2508.02085v6

TabArena: A Living Benchmark for Machine Learning on Tabular Data

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

Updated: 2025-11-03 18:47:03

标题: TabArena:一个用于表格数据机器学习的实时基准

摘要: 随着深度学习和基础模型在表格数据中的日益流行,对于标准化和可靠基准的需求比以往任何时候都更加迫切。然而,目前的基准是静态的。即使发现了缺陷,模型版本被更新,或者新模型发布,它们的设计也不会更新。为了解决这个问题,我们介绍了TabArena,这是第一个持续维护的生活型表格基准系统。为了启动TabArena,我们手动策划了一组代表性的数据集和实现良好的模型,进行了大规模的基准研究以初始化公共排行榜,并组建了一个经验丰富的维护团队。我们的结果突出了验证方法和超参数配置集成对于使模型发挥最大潜力的影响。尽管梯度提升树在实际表格数据集上仍然是强有力的竞争者,但我们观察到深度学习方法在更大的时间预算下通过集成已经赶上了。同时,基础模型在较小的数据集上表现出色。最后,我们展示了跨模型集成在表格机器学习的最新进展。我们观察到一些深度学习模型在跨模型集成中过度表示,这是由于验证集过拟合造成的,我们鼓励模型开发者解决这个问题。我们通过一个公共排行榜、可重现的代码和维护协议启动了TabArena,以创建一个可用的生活型基准,网址为https://tabarena.ai。

更新时间: 2025-11-03 18:47:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.16791v4

Efficient Vector Symbolic Architectures from Histogram Recovery

Vector symbolic architectures (VSAs) are a family of information representation techniques which enable composition, i.e., creating complex information structures from atomic vectors via binding and superposition, and have recently found wide ranging applications in various neurosymbolic artificial intelligence (AI) systems. Recently, Raviv proposed the use of random linear codes in VSAs, suggesting that their subcode structure enables efficient binding, while preserving the quasi-orthogonality that is necessary for neural processing. Yet, random linear codes are difficult to decode under noise, which severely limits the resulting VSA's ability to support recovery, i.e., the retrieval of information objects and their attributes from a noisy compositional representation. In this work we bridge this gap by utilizing coding theoretic tools. First, we argue that the concatenation of Reed-Solomon and Hadamard codes is suitable for VSA, due to the mutual quasi-orthogonality of the resulting codewords (a folklore result). Second, we show that recovery of the resulting compositional representations can be done by solving a problem we call histogram recovery. In histogram recovery, a collection of $N$ histograms over a finite field is given as input, and one must find a collection of Reed-Solomon codewords of length $N$ whose entry-wise symbol frequencies obey those histograms. We present an optimal solution to the histogram recovery problem by using algorithms related to list-decoding, and analyze the resulting noise resilience. Our results give rise to a noise-resilient VSA with formal guarantees regarding efficient encoding, quasi-orthogonality, and recovery, without relying on any heuristics or training, and while operating at improved parameters relative to similar solutions such as the Hadamard code.

Updated: 2025-11-03 18:45:47

标题: Histogram Recovery 产生的高效矢量符号架构

摘要: 矢量符号化体系结构(VSAs)是一类信息表示技术,可以通过绑定和叠加从原子向量中创建复杂信息结构,最近在各种神经符号人工智能(AI)系统中找到了广泛的应用。最近,Raviv提出在VSAs中使用随机线性码,认为它们的子码结构使绑定更有效,同时保留了神经处理所必需的准正交性。然而,随机线性码在噪声下难以解码,严重限制了结果VSAs支持恢复的能力,即从嘈杂的组合表示中检索信息对象及其属性。 在这项工作中,我们利用编码理论工具来填补这一差距。首先,我们认为Reed-Solomon和Hadamard码的串联适用于VSA,因为结果码字之间的相互准正交性(一种流行的结果)。其次,我们展示了通过解决我们称之为直方图恢复的问题可以恢复结果的组合表示。在直方图恢复中,输入是一个有限域上的N个直方图的集合,需要找到一个长度为N的Reed-Solomon码字的集合,其逐项符号频率符合这些直方图。我们通过使用与列表译码相关的算法来提出直方图恢复问题的最优解,并分析了所得到的噪声韧性。我们的结果产生了一个具有关于高效编码、准正交性和恢复的形式保证的抗噪声VSA,而不依赖于任何启发式方法或训练,并且在相对于类似解决方案(如Hadamard码)的改进参数下运作。

更新时间: 2025-11-03 18:45:47

领域: cs.IT,cs.AI,cs.NE,math.IT

下载: http://arxiv.org/abs/2511.01838v1

Cold-Start Active Preference Learning in Socio-Economic Domains

Active preference learning offers an efficient approach to modeling preferences, but it is hindered by the cold-start problem, which leads to a marked decline in performance when no initial labeled data are available. While cold-start solutions have been proposed for domains such as vision and text, the cold-start problem in active preference learning remains largely unexplored, underscoring the need for practical, effective methods. Drawing inspiration from established practices in social and economic research, the proposed method initiates learning with a self-supervised phase that employs Principal Component Analysis (PCA) to generate initial pseudo-labels. This process produces a \say{warmed-up} model based solely on the data's intrinsic structure, without requiring expert input. The model is then refined through an active learning loop that strategically queries a simulated noisy oracle for labels. Experiments conducted on various socio-economic datasets, including those related to financial credibility, career success rate, and socio-economic status, consistently show that the PCA-driven approach outperforms standard active learning strategies that start without prior information. This work thus provides a computationally efficient and straightforward solution that effectively addresses the cold-start problem.

Updated: 2025-11-03 18:44:16

标题: 在社会经济领域中的冷启动主动偏好学习

摘要: 主动偏好学习提供了一种有效建模偏好的方法,但受到冷启动问题的阻碍,当没有初始标记数据可用时,性能会显著下降。虽然已经为视觉和文本等领域提出了冷启动解决方案,但在主动偏好学习中的冷启动问题仍然大多未被探讨,强调了对实用、有效方法的需求。从社会和经济研究中的成熟实践中获得启发,提出的方法通过使用主成分分析(PCA)生成初始伪标签来启动学习。这个过程仅基于数据的内在结构产生一个“预热”模型,而无需专家输入。然后,通过一个主动学习循环对模型进行细化,该循环有策略地向一个模拟的有噪声的预言者查询标签。在各种社会经济数据集上进行的实验,包括与财务信用、职业成功率和社会经济地位相关的数据集,一致表明基于PCA的方法优于不带有先前信息的标准主动学习策略。因此,这项工作提供了一个计算效率高且直接的解决方案,有效地解决了冷启动问题。

更新时间: 2025-11-03 18:44:16

领域: cs.LG

下载: http://arxiv.org/abs/2508.05090v2

RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs

Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.

Updated: 2025-11-03 18:42:57

标题: RELATE:一种用于多模态关系图的模式无关感知器编码器

摘要: 在诸如电子商务、医疗保健和科学研究等领域,关系多表数据是常见的,可以自然地表示为具有多模态节点属性的异构时间图。现有的图神经网络(GNN)依赖于特定于模式的特征编码器,需要为每种节点类型和特征列单独的模块,这限制了可扩展性和参数共享。我们引入了RELATE(用于类型实体的潜在聚合的关系编码器),这是一个与模式无关的即插即用特征编码器,可与任何通用的GNN一起使用。RELATE采用共享的模态特定编码器,用于分类、数值、文本和时间属性,随后是一种类似Perceiver的交叉注意力模块,将特征聚合成固定大小的、排列不变的节点表示。我们在RelBench基准测试中使用ReLGNN和HGT对RELATE进行评估,在这里它实现了与特定于模式的编码器相比性能仅低于3%,同时将参数数量减少了最多5倍。这种设计支持不同的模式,并为通用GNN启用多数据集预训练,为关系图数据的基础模型铺平了道路。

更新时间: 2025-11-03 18:42:57

领域: cs.AI,cs.DB,cs.LG

下载: http://arxiv.org/abs/2510.19954v3

Simulating Environments with Reasoning Models for Agent Training

LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner, and Simia-RL, a framework that enables RL training without real environment implementations through LLM-simulated feedback. Fine-tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT-4o and approaching o4-mini on $\tau^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM-based simulation.

Updated: 2025-11-03 18:29:57

标题: 使用推理模型模拟环境进行代理训练

摘要: LLM代理在需要深度推理的紧凑环境中表现出色,但在需要跨多种工具和模式的更广泛、更复杂的情境中表现出脆弱性。为培训构建专门环境是繁重、脆弱的,并限制了进展。在本文中,我们展示了LLMs可以模拟真实环境反馈,而无需访问实际测试数据或API。受到这种能力的启发,我们提出了两个框架:Simia-SFT,一个通过以环境无关的方式放大小种子集合成为多样轨迹来合成SFT数据的流水线;Simia-RL,一个通过LLM模拟的反馈实现RL训练的框架。对开放模型进行微调在多个基准测试中取得了一致的改进,超过了GPT-4o,并接近了o4-mini在$\tau^2$-Bench上的表现。Simia-SFT和Simia-RL共同实现了可伸缩的代理训练,无需环境工程,用灵活的基于LLM的模拟替代了繁重和脆弱的实现。

更新时间: 2025-11-03 18:29:57

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01824v1

Machine and Deep Learning for Indoor UWB Jammer Localization

Ultra-wideband (UWB) localization delivers centimeter-scale accuracy but is vulnerable to jamming attacks, creating security risks for asset tracking and intrusion detection in smart buildings. Although machine learning (ML) and deep learning (DL) methods have improved tag localization, localizing malicious jammers within a single room and across changing indoor layouts remains largely unexplored. Two novel UWB datasets, collected under original and modified room configurations, are introduced to establish comprehensive ML/DL baselines. Performance is rigorously evaluated using a variety of classification and regression metrics. On the source dataset with the collected UWB features, Random Forest achieves the highest F1-macro score of 0.95 and XGBoost achieves the lowest mean Euclidean error of 20.16 cm. However, deploying these source-trained models in the modified room layout led to severe performance degradation, with XGBoost's mean Euclidean error increasing tenfold to 207.99 cm, demonstrating significant domain shift. To mitigate this degradation, a domain-adversarial ConvNeXt autoencoder (A-CNT) is proposed that leverages a gradient-reversal layer to align CIR-derived features across domains. The A-CNT framework restores localization performance by reducing the mean Euclidean error to 34.67 cm. This represents a 77 percent improvement over non-adversarial transfer learning and an 83 percent improvement over the best baseline, restoring the fraction of samples within 30 cm to 0.56. Overall, the results demonstrate that adversarial feature alignment enables robust and transferable indoor jammer localization despite environmental changes. Code and dataset available at https://github.com/afbf4c8996f/Jammer-Loc

Updated: 2025-11-03 18:26:14

标题: 室内UWB干扰器定位的机器学习和深度学习

摘要: 超宽带(UWB)定位提供厘米级精度,但容易受到干扰攻击,为资产追踪和智能建筑入侵检测带来安全风险。虽然机器学习(ML)和深度学习(DL)方法已改善标签定位,但在单个房间内和跨不断变化的室内布局中定位恶意干扰者仍未得到广泛探索。引入了两个新颖的UWB数据集,分别在原始和修改后的房间配置下收集,以建立全面的ML/DL基线。使用各种分类和回归指标严格评估性能。在收集的UWB特征源数据集上,随机森林达到了最高的F1-macro分数为0.95,XGBoost达到了最低的平均欧几里得误差为20.16厘米。然而,在修改后的房间布局中部署这些源训练模型导致严重的性能下降,XGBoost的平均欧几里得误差增加了十倍至207.99厘米,显示出显著的领域转移。为了减轻这种降级,提出了一种领域对抗ConvNeXt自编码器(A-CNT),利用梯度反转层来对齐跨领域的CIR衍生特征。A-CNT框架通过将平均欧几里得误差降低到34.67厘米来恢复定位性能。这比非对抗式迁移学习提高了77%,比最佳基线提高了83%,将30厘米内的样本比率恢复到0.56。总的来说,结果表明,对抗性特征对齐使室内干扰定位具有鲁棒性和可转移性,尽管环境发生变化。代码和数据集可在https://github.com/afbf4c8996f/Jammer-Loc 上获得。

更新时间: 2025-11-03 18:26:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.01819v1

SimKey: A Semantically Aware Key Module for Watermarking Language Models

The rapid spread of text generated by large language models (LLMs) makes it increasingly difficult to distinguish authentic human writing from machine output. Watermarking offers a promising solution: model owners can embed an imperceptible signal into generated text, marking its origin. Most leading approaches seed an LLM's next-token sampling with a pseudo-random key that can later be recovered to identify the text as machine-generated, while only minimally altering the model's output distribution. However, these methods suffer from two related issues: (i) watermarks are brittle to simple surface-level edits such as paraphrasing or reordering; and (ii) adversaries can append unrelated, potentially harmful text that inherits the watermark, risking reputational damage to model owners. To address these issues, we introduce SimKey, a semantic key module that strengthens watermark robustness by tying key generation to the meaning of prior context. SimKey uses locality-sensitive hashing over semantic embeddings to ensure that paraphrased text yields the same watermark key, while unrelated or semantically shifted text produces a different one. Integrated with state-of-the-art watermarking schemes, SimKey improves watermark robustness to paraphrasing and translation while preventing harmful content from false attribution, establishing semantic-aware keying as a practical and extensible watermarking direction.

Updated: 2025-11-03 18:20:37

标题: SimKey:用于给语言模型添加水印的语义感知密钥模块

摘要: 大型语言模型(LLMs)生成的文本迅速传播,使得越来越难以区分真实的人类写作和机器输出。水印技术提供了一个有前途的解决方案:模型所有者可以将一个不可察觉的信号嵌入生成的文本中,标记其来源。大多数领先的方法使用伪随机密钥来种子化LLM的下一个令牌采样,以便后来可以恢复以识别文本是否由机器生成,同时最小程度地改变模型的输出分布。然而,这些方法存在两个相关问题:(i)水印对简单的表面层编辑(如改写或重排序)很脆弱;(ii)对手可以附加不相关的、潜在有害的文本,这些文本会继承水印,从而给模型所有者造成声誉损害。为了解决这些问题,我们引入了SimKey,一个语义密钥模块,通过将密钥生成与先前上下文的含义联系起来,增强了水印的鲁棒性。SimKey使用语义嵌入上的局部敏感哈希来确保改写的文本产生相同的水印密钥,而不相关或语义上移位的文本产生不同的水印密钥。与最先进的水印方案集成,SimKey提高了水印对改写和翻译的鲁棒性,同时防止有害内容被错误归因,建立了语义意识密钥作为一种实用且可扩展的水印方向。

更新时间: 2025-11-03 18:20:37

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2510.12828v2

KV Cache Transform Coding for Compact Storage in LLM Inference

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

Updated: 2025-11-03 18:20:35

标题: KV缓存变换编码用于在LLM推断中紧凑存储

摘要: 为了在规模上为大型语言模型(LLMs)提供服务,需要高效的键值(KV)缓存管理。KV缓存可以通过常见的迭代代码编辑和聊天中的共享前缀提示在对话转换之间重复使用。然而,过期的缓存会消耗稀缺的GPU内存,需要卸载或强制重新计算。我们提出了KVTC,这是一种轻量级的变换编码器,可以对KV缓存进行压缩,以便在GPU内部和GPU外部进行紧凑存储。KVTC借鉴了经典的媒体压缩技术,结合了基于PCA的特征去相关、自适应量化和熵编码。它只需要简短的初始校准,不会改变模型参数。通过利用KV缓存中的冗余信息,KVTC实现了高达20倍的压缩比,同时保持推理和长上下文精度,对于特定用例可以达到40倍或更高的压缩比。我们在AIME25、LiveCodeBench、GSM8K、MMLU、Qasper、RULER和MATH-500等基准测试中使用Llama 3、Mistral NeMo和R1-Qwen 2.5模型对KVTC进行了测试。它始终优于基准推理时间的方法,如标记清除、量化和基于SVD的方法,同时实现更高的压缩比。这些结果证实了KVTC作为一种实用的构建模块,可用于具有可重复使用KV缓存的内存高效LLM服务。

更新时间: 2025-11-03 18:20:35

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01815v1

Automotive Crash Dynamics Modeling Accelerated with Machine Learning

Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.

Updated: 2025-11-03 18:19:07

标题: 用机器学习加速的汽车碰撞动力学建模

摘要: 碰撞安全性评估是汽车设计的一个关键方面,传统上依赖于高保真有限元(FE)模拟,这些模拟计算成本高且耗时。本研究提出了一个探索性比较研究,旨在利用NVIDIA PhysicsNeMo框架开发基于机器学习的替代模型,以有效预测碰撞情景中的结构变形。鉴于以往应用机器学习于结构碰撞动态的研究有限,本研究的主要贡献在于展示了所探索的各种建模方法的可行性和工程实用性。我们研究了两种用于建模碰撞动态的最新神经网络架构:MeshGraphNet和Transolver。此外,我们考察了三种建模瞬态动态的策略:基于时间条件、标准自回归方法以及结合基于回滚的训练的稳定增强的自回归方案。这些模型在包含150个详细FE模拟的全车身(BIW)碰撞数据集上进行评估,使用LS-DYNA。该数据集代表一个结构丰富的车辆组件,包括200多个组件,其中包括38个关键组件,具有不同厚度分布,以捕捉真实制造可变性。每个模型利用未变形的网格几何和组件特征作为输入,预测碰撞序列中变形网格的时空演化。评估结果显示,这些模型以合理的准确性捕捉了整体变形趋势,表明将机器学习应用于结构碰撞动态的可行性。虽然尚未达到完整FE精度,但这些模型大幅减少了计算成本,实现了数量级的降低,从而在碰撞安全性评估中实现了快速设计探索和早期优化。

更新时间: 2025-11-03 18:19:07

领域: cs.LG,cs.AI,cs.NA,math.NA,physics.app-ph,physics.comp-ph

下载: http://arxiv.org/abs/2510.15201v3

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.

Updated: 2025-11-03 18:12:39

标题: 一把刀和一个女人?测量图像标题中的方向偏差放大

摘要: 当我们在偏见数据集上训练模型时,它们不仅会复制数据偏见,而且在测试时可能会加剧这些偏见 - 这种现象被称为偏见放大。许多当前的偏见放大指标(例如,BA(MALS),DPA)仅在分类数据集中衡量偏见放大。这些指标对于图像字幕数据集是无效的,因为它们无法捕捉字幕的语言语义。最近的工作介绍了Captioning中的泄漏(LIC),这是一种了解字幕语义的语言感知偏见放大度量。然而,LIC有一个关键限制:它无法确定字幕模型中偏见放大的来源。我们提出了Directional Bias Amplification in Captioning(DBAC),这是一种语言感知和定向度量,可以确定字幕模型何时放大偏见。DBAC比LIC有两个改进:(1)它对句子编码器(语言感知指标中的超参数)的敏感性较低,(2)它提供了对字幕中偏见放大的更准确估计。我们在COCO字幕数据集中关于性别和种族属性的实验表明,DBAC是唯一可靠的度量工具,用于衡量字幕中的偏见放大。

更新时间: 2025-11-03 18:12:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.07878v4

Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining

Length control in Large Language Models (LLMs) is a crucial but under-addressed challenge, with applications ranging from voice interfaces requiring concise responses to research summaries needing comprehensive outputs. Current approaches to length control, including Regularized DPO, Length-Instruction Fine Tuning, and tool-augmented methods, typically require expensive model retraining or complex inference-time tooling. This paper presents a prompt engineering methodology that enables precise length control without model retraining. Our structure-guided approach implements deliberate planning and word counting mechanisms within the prompt, encouraging the model to carefully track and adhere to specified length constraints. Comprehensive evaluations across six state-of-the-art LLMs demonstrate that our method significantly improves length fidelity for several models compared to standard prompting when applied to document summarization tasks, particularly for shorter-to-medium length constraints. The proposed technique shows varying benefits across different model architectures, with some models demonstrating up to 37.6% improvement in length adherence. Quality evaluations further reveal that our approach maintains or enhances overall output quality compared to standard prompting techniques. Our approach provides an immediately deployable solution for applications requiring precise length control, particularly valuable for production environments where model retraining is impractical or cost-prohibitive.

Updated: 2025-11-03 18:10:42

标题: Plan-and-Write: LLMs结构引导的长度控制,无需重新训练模型

摘要: 大语言模型(LLMs)中的长度控制是一个关键但未得到充分解决的挑战,其应用范围从需要简洁回答的语音界面到需要全面输出的研究摘要。目前的长度控制方法,包括正则化DPO、长度指令微调和工具增强方法,通常需要昂贵的模型重新训练或复杂的推理时间工具。本文提出了一种提示工程方法,可以实现精确的长度控制而无需重新训练模型。我们的结构引导方法在提示中实施了有意识的规划和单词计数机制,鼓励模型仔细跟踪并遵守指定的长度约束。对六个最先进的LLMs进行的全面评估表明,我们的方法在应用于文档摘要任务时,与标准提示相比,显著改善了几个模型的长度保真度,特别是对于较短到中等长度约束。所提出的技术在不同模型架构中表现出不同的益处,一些模型表现出长达37.6%的长度遵守改善。质量评估进一步显示,与标准提示技术相比,我们的方法在维护或提升整体输出质量方面表现出色。我们的方法为需要精确长度控制的应用提供了一个可以立即部署的解决方案,特别适用于生产环境中模型重新训练不切实际或成本过高的情况。

更新时间: 2025-11-03 18:10:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01807v1

CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning

Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks -- to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches -- from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training time. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at https://cosmobench.streamlit.app

Updated: 2025-11-03 18:09:02

标题: CosmoBench: 一个用于几何深度学习的多尺度、多视角、多任务宇宙学基准benchmark

摘要: 宇宙学模拟提供了大量的数据,以点云和有向树的形式呈现。一个关键的目标是从这些数据中提取见解,揭示宇宙的性质和组成。在本文中,我们介绍了CosmoBench,这是一个从最先进的宇宙学模拟中策划的基准数据集,其运行需要超过4100万核小时,生成了超过两PB的数据。CosmoBench是其类别中最大的数据集:它包含来自暗物质晕和星系模拟的3万4千个点云,涵盖三种不同的长度尺度,以及记录两种不同时间尺度上晕形成历史的2万5千个有向树。CosmoBench中的数据可用于多个任务——从点云和合并树中预测宇宙学参数,从集体位置中预测各个晕和星系的速度,以及从较粗的时间尺度上的合并树重建较细的时间尺度上的合并树。我们提供了这些任务的几个基准,一些基于宇宙学建模中已有的方法,另一些则根植于机器学习。对于后者,我们研究了不同的方法——从由对称性最小约束的简单线性模型到更大更计算密集的深度学习模型,如图神经网络。我们发现,带有少量不变特征的最小二乘拟合有时胜过参数更多、训练时间更长的深度架构。尽管如此,通过结合机器学习和宇宙学,仍然存在巨大的潜力来改进这些基准,以充分利用数据。CosmoBench为在规模上桥接宇宙学和几何深度学习奠定了基础。我们邀请社区通过访问https://cosmobench.streamlit.app与这个数据集互动,推动科学发现的前沿。

更新时间: 2025-11-03 18:09:02

领域: cs.LG,astro-ph.CO,astro-ph.IM

下载: http://arxiv.org/abs/2507.03707v2

Fractional Diffusion Bridge Models

We present Fractional Diffusion Bridge Models (FDBM), a novel generative diffusion bridge framework driven by an approximation of the rich and non-Markovian fractional Brownian motion (fBM). Real stochastic processes exhibit a degree of memory effects (correlations in time), long-range dependencies, roughness and anomalous diffusion phenomena that are not captured in standard diffusion or bridge modeling due to the use of Brownian motion (BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM), we construct FDBM that enable tractable inference while preserving the non-Markovian nature of fBM. We prove the existence of a coupling-preserving generative diffusion bridge and leverage it for future state prediction from paired training data. We then extend our formulation to the Schr\"{o}dinger bridge problem and derive a principled loss function to learn the unpaired data translation. We evaluate FDBM on both tasks: predicting future protein conformations from aligned data, and unpaired image translation. In both settings, FDBM achieves superior performance compared to the Brownian baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$ atomic positions in protein structure prediction and lower Fr\'echet Inception Distance (FID) in unpaired image translation.

Updated: 2025-11-03 17:51:10

标题: 分式扩散桥模型

摘要: 我们提出了分数扩散桥模型(FDBM),这是一个由富含和非马尔可夫分数布朗运动(fBM)的近似驱动的新颖生成性扩散桥框架。真实的随机过程表现出一定程度的记忆效应(时间相关性)、长程依赖性、粗糙度和异常扩散现象,这些都不能通过标准扩散或桥模型捕捉到,因为它们使用了布朗运动(BM)。为了解决这个问题,利用最近的fBM的马尔可夫近似(MA-fBM),我们构建了FDBM,可以在保留fBM非马尔可夫性质的同时进行可行的推断。我们证明了一种保持耦合的生成扩散桥的存在,并利用它进行未来状态的预测。然后,我们将我们的公式扩展到Schr\"{o}dinger桥问题,并推导出一个合理的损失函数,用于学习未配对数据的翻译。我们在两个任务上评估了FDBM:从对齐数据中预测未来蛋白质构象和未配对图像翻译。在两种设置中,与布朗基线相比,FDBM实现了更优越的性能,使得蛋白结构预测中C$_\alpha$原子位置的均方根偏差(RMSD)更低,未配对图像翻译中Fr\'echet Inception距离(FID)更低。

更新时间: 2025-11-03 17:51:10

领域: cs.LG,cs.AI,cs.CV,cs.RO,stat.ML

下载: http://arxiv.org/abs/2511.01795v1

Random Initialization of Gated Sparse Adapters

When fine-tuning language models on new tasks, catastrophic forgetting -- performance degradation on previously-learned tasks -- is a ubiquitous problem. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this through low-rank adapters, sparse adaptation offers an alternative that doesn't impose rank constraints. We introduce Random Initialization of Gated Sparse Adapters (RIGSA), which starts from randomly-initialized full-rank adapters, gates them with a ReZero analog, and sparsifies them with iterative magnitude pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag, and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA and random masking. In spite of having more trainable parameters than QLoRA, the RIGSA configurations that we studied displayed less forgetting than QLoRA, particularly on GSM8k, though it performs comparably to random masking.

Updated: 2025-11-03 17:49:44

标题: 门控稀疏适配器的随机初始化

摘要: 在新任务上微调语言模型时,灾难性遗忘——在先前学习的任务上性能下降——是一个普遍存在的问题。虽然像LoRA这样的参数高效微调(PEFT)方法通过低秩适配器来解决这个问题,但稀疏适配提供了一种不施加秩约束的替代方法。我们引入了随机初始化门控稀疏适配器(RIGSA),它从随机初始化的全秩适配器开始,用ReZero类似物对其进行门控,并通过迭代幅度剪枝稀疏化。我们在SmolLM2-1.7B-Instruct上评估了RIGSA,使用了一项新颖的视觉文本任务(文本MNIST),并在PIQA、HellaSwag和GSM8k上测量了遗忘情况。SmolLM2-1.7B-Instruct在文本MNIST上最初表现接近机会水平,并能够通过RIGSA、4位QLoRA和随机掩码学习该任务。尽管可训练参数比QLoRA多,但我们研究的RIGSA配置显示出比QLoRA更少的遗忘,特别是在GSM8k上,尽管它与随机掩码表现相当。

更新时间: 2025-11-03 17:49:44

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.01794v1

GenDexHand: Generative Simulation for Dexterous Hands

Data scarcity remains a fundamental bottleneck for embodied intelligence. Existing approaches use large language models (LLMs) to automate gripper-based simulation generation, but they transfer poorly to dexterous manipulation, which demands more specialized environment design. Meanwhile, dexterous manipulation tasks are inherently more difficult due to their higher degrees of freedom. Massively generating feasible and trainable dexterous hand tasks remains an open challenge. To this end, we present GenDexHand, a generative simulation pipeline that autonomously produces diverse robotic tasks and environments for dexterous manipulation. GenDexHand introduces a closed-loop refinement process that adjusts object placements and scales based on vision-language model (VLM) feedback, substantially improving the average quality of generated environments. Each task is further decomposed into sub-tasks to enable sequential reinforcement learning, reducing training time and increasing success rates. Our work provides a viable path toward scalable training of diverse dexterous hand behaviors in embodied intelligence by offering a simulation-based solution to synthetic data generation. Our website: https://winniechen2002.github.io/GenDexHand/.

Updated: 2025-11-03 17:45:38

标题: GenDexHand: 灵巧手的生成模拟

摘要: 数据稀缺仍然是具身智能面临的基本瓶颈。现有方法使用大型语言模型(LLMs)自动化夹持器模拟生成,但在需要更专门化环境设计的熟练操作中转移效果不佳。与此同时,熟练操作任务由于其更高的自由度而本质上更加困难。大规模生成可行且可训练的熟练手部任务仍然是一个开放性挑战。为此,我们提出了GenDexHand,一个生成式模拟管道,自主产生多样化的机器人任务和环境,用于熟练操作。GenDexHand引入了一个闭环细化过程,根据视觉语言模型(VLM)的反馈调整物体放置和比例,显著改善了生成环境的平均质量。每个任务进一步分解为子任务,以实现顺序强化学习,减少训练时间并提高成功率。我们的工作通过提供基于模拟的解决方案来生成合成数据,为具身智能中多样化熟练手部行为的可扩展训练提供了可行的路径。我们的网站:https://winniechen2002.github.io/GenDexHand/。

更新时间: 2025-11-03 17:45:38

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.01791v1

LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

This paper presents LM-Fix, a lightweight detection and rapid recovery framework for faults in large language models (LLMs). Existing integrity approaches are often heavy or slow for modern LLMs. LM-Fix runs a short test-vector pass and uses hash-guided checks to detect bit-flip faults, then repairs them locally without a full reload. Across multiple models, it detects over 94% of single-bit flips at TVL=200 and nearly 100% of multi-bit flips with approximately 1% to 7.7% runtime overhead; recovery is more than 100x faster than reloading. These results show a practical, low-overhead solution to keep LLMs reliable in production

Updated: 2025-11-03 17:37:39

标题: LM-Fix:轻量级语言模型位翻转检测和快速恢复框架

摘要: 本文介绍了LM-Fix,一种用于大型语言模型(LLMs)故障检测和快速恢复的轻量级框架。现有的完整性方法对于现代LLMs通常过重或速度慢。LM-Fix运行一个简短的测试向量传递,并使用哈希引导检查来检测位翻转故障,然后在本地修复它们而无需完全重新加载。在多个模型中,它在TVL=200时检测到超过94%的单位翻转,几乎100%的多位翻转,运行时开销约为1%至7.7%;恢复速度比重新加载快100倍以上。这些结果表明了一种实用的、低开销的解决方案,可以保持LLMs在生产环境中的可靠性。

更新时间: 2025-11-03 17:37:39

领域: cs.SE,cs.AI,cs.AR,cs.CR

下载: http://arxiv.org/abs/2511.02866v1

AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents

AI agents can autonomously perform tasks and, often without explicit user consent, collect or disclose users' sensitive local data, which raises serious privacy concerns. Although AI agents' privacy policies may describe their intended data practices, there remains limited transparency and accountability about whether runtime behavior matches those policies. To close this gap, we introduce AudAgent, a visual framework that continuously monitors AI agents' data practices in real time and guards compliance with stated privacy policies. AudAgent consists of four components for automated privacy auditing of AI agents. (i) Policy parsing: an ensemble of LLMs translates natural-language privacy policies into a structured privacy-policy model, where cross-LLM voting guarantees confidence of the parsing results. (ii) Runtime annotation: a lightweight Presidio-based analyzer detects sensitive data and annotates how the data is used based on the context of the AI agent's operations and the privacy-policy model. (iii) Compliance auditing: ontology alignment and automata-based evaluation connect the policy model with runtime annotations, enabling on-the-fly compliance checks between the natural-language policy and observed unordered data practices of AI agents. (iv) User interface: a platform-independent implementation visualizes the real-time execution trace of AI agents along with potential privacy risks detected during auditing, providing user-friendly transparency and accountability. In addition to common formatted privacy policies, AudAgent also supports user-defined policies for fine-grained control and customization. We evaluate AudAgent on AI agents built upon mainstream programming frameworks such as AutoGen, experiments show that AudAgent effectively identifies potential privacy policy violations in real time.

Updated: 2025-11-03 17:32:08

标题: AudAgent:人工智能代理隐私政策合规性的自动审计

摘要: AI代理可以自主执行任务,并且通常在没有明确用户同意的情况下收集或披露用户的敏感本地数据,这引发了严重的隐私问题。尽管AI代理的隐私政策可能描述了它们的预期数据实践,但是否运行时行为符合这些政策仍存在有限的透明度和问责制。为了弥补这一差距,我们引入了AudAgent,一个持续监控AI代理实时数据实践并确保符合规定隐私政策的视觉框架。 AudAgent由四个自动化隐私审计组件组成。(i)政策解析:LLMs集成将自然语言隐私政策转化为结构化隐私政策模型,跨LLM投票保证解析结果的可信度。(ii)运行时注释:基于Presidio的轻量级分析器检测敏感数据,并根据AI代理操作的上下文和隐私政策模型注释数据的使用方式。(iii)合规审计:本体对齐和基于自动机的评估将政策模型与运行时注释连接起来,实现自然语言政策与AI代理观察到的无序数据实践之间的即时合规检查。(iv)用户界面:一个独立于平台的实现可视化AI代理的实时执行跟踪,以及审计期间检测到的潜在隐私风险,提供用户友好的透明度和问责制。 除了常见的格式化隐私政策,AudAgent还支持用户定义的政策,以进行精细化控制和定制。我们在基于主流编程框架(如AutoGen)构建的AI代理上评估了AudAgent,实验表明AudAgent能够有效实时识别潜在的隐私政策违规行为。

更新时间: 2025-11-03 17:32:08

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.07441v1

Non-Contact Health Monitoring During Daily Personal Care Routines

Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.

Updated: 2025-11-03 17:30:56

标题: 在日常个人护理过程中的非接触式健康监测

摘要: 遥感光电容量脉搏图(rPPG)使得能够非接触式、连续监测生理信号,并提供了传统健康感知方法的实用替代方案。尽管rPPG对于日常健康监测有很大潜力,但在长期个人护理场景中的应用,例如高海拔环境中面对镜子的日常例行程序,仍然具有挑战性,这是由于环境光照变化、手部运动频繁遮挡和面部动态姿势造成的。为了解决这些挑战,我们提出了LADH(长期海拔日常健康),这是第一个包含240个同步的RGB和红外(IR)面部视频的长期rPPG数据集,涵盖了21名参与者在五种常见个人护理场景下的数据,同时提供了地面真实的PPG、呼吸和血氧信号。我们的实验表明,结合RGB和IR视频输入可以提高非接触式生理监测的准确性和稳健性,心率估计的平均绝对误差(MAE)为4.99 BPM。此外,我们发现多任务学习可以同时提高多种生理指标的性能。数据集和代码可以在https://github.com/McJackTang/FusionVitals 打开。

更新时间: 2025-11-03 17:30:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.09718v2

How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.

Updated: 2025-11-03 17:28:54

标题: 外科医生与外科世界模型有多远?一项关于零样本外科视频生成的初步研究,同时进行专家评估

摘要: 视频生成中的基础模型正展示出作为模拟物理世界潜在世界模型的卓越能力。然而,在手术等高风险领域的应用要求深入的、专业化的因果知识,而不是一般的物理规则,这仍然是一个重要的未被探索的空白。为了系统地解决这一挑战,我们提出了SurgVeo,这是手术领域中第一个由专家策划的视频生成模型评估基准,以及Surgical Plausibility Pyramid(SPP),这是一个新颖的、四层的框架,旨在评估从基本外观到复杂手术策略的模型输出。基于SurgVeo基准,我们要求高级Veo-3模型在腹腔镜和神经外科手术过程的手术剪辑上进行零射预测任务。四位董事会认证的外科医生根据SPP评估生成的视频。我们的结果显示了明显的“合理性差距”:虽然Veo-3在视觉感知合理性方面表现出色,但在SPP的更高层次上,包括仪器操作合理性、环境反馈合理性和手术意图合理性方面却存在严重失败。这项工作提供了手术人工智能中视觉上令人信服的模仿和因果理解之间的鸿沟的第一份定量证据。我们从SurgVeo和SPP的发现建立了一个关键的基础和发展未来模型的路线图,这些模型能够应对专业化、现实世界的医疗保健领域的复杂性。

更新时间: 2025-11-03 17:28:54

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2511.01775v1

Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

Updated: 2025-11-03 17:24:18

标题: Wonder3D ++:跨域扩散技术实现从单幅图像生成高保真度3D模型

摘要: 在这项工作中,我们介绍了一种名为\textbf{Wonder3D++}的新方法,可以高效地从单视图图像中生成高保真度的纹理网格。基于得分蒸馏采样(SDS)的最近方法已经显示出从二维扩散先验中恢复三维几何形状的潜力,但它们通常遭受每个形状优化耗时和不一致的几何形状的困扰。相比之下,某些工作直接通过快速网络推断生成三维信息,但它们的结果通常质量较低,缺乏几何细节。为了全面提高单视图重建任务的质量、一致性和效率,我们提出了一个跨域扩散模型,可以生成多视角法线图和相应的彩色图像。为了确保生成的一致性,我们采用了一个多视角跨域注意机制,促进了视图和模态之间的信息交换。最后,我们介绍了一个级联的三维网格提取算法,以粗粒度到细粒度的方式,仅需约3分钟就可以从多视图的二维表示中生成高质量的表面。我们进行了广泛的评估,结果表明我们的方法在重建结果的质量、稳健的泛化能力和良好的效率方面均优于先前的工作。代码可在https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus 上获得。

更新时间: 2025-11-03 17:24:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01767v1

Context-Guided Decompilation: A Step Towards Re-executability

Binary decompilation plays an important role in software security analysis, reverse engineering, and malware understanding when source code is unavailable. However, existing decompilation techniques often fail to produce source code that can be successfully recompiled and re-executed, particularly for optimized binaries. Recent advances in large language models (LLMs) have enabled neural approaches to decompilation, but the generated code is typically only semantically plausible rather than truly executable, limiting their practical reliability. These shortcomings arise from compiler optimizations and the loss of semantic cues in compiled code, which LLMs struggle to recover without contextual guidance. To address this challenge, we propose ICL4Decomp, a hybrid decompilation framework that leverages in-context learning (ICL) to guide LLMs toward generating re-executable source code. We evaluate our method across multiple datasets, optimization levels, and compilers, demonstrating around 40\% improvement in re-executability over state-of-the-art decompilation methods while maintaining robustness.

Updated: 2025-11-03 17:21:39

标题: 上下文引导的反编译:迈向可重新执行性的一步

摘要: 二进制反编译在软件安全分析、逆向工程和恶意软件理解中起着重要作用,特别是在源代码不可用的情况下。然而,现有的反编译技术通常无法生成可以成功重新编译和重新执行的源代码,特别是针对优化的二进制文件。最近大型语言模型(LLMs)的进展使神经方法能够进行反编译,但生成的代码通常只是在语义上可信而不是真正可执行的,从而限制了它们的实际可靠性。这些缺陷源于编译器优化和编译代码中语义线索的丢失,LLMs很难在没有上下文引导的情况下恢复。为了解决这一挑战,我们提出了ICL4Decomp,这是一个利用上下文学习(ICL)来引导LLMs生成可重新执行源代码的混合反编译框架。我们在多个数据集、优化级别和编译器上评估我们的方法,表明与最先进的反编译方法相比,在保持稳健性的同时,可重新执行性提高了约40%。

更新时间: 2025-11-03 17:21:39

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.01763v1

Mixed-Density Diffuser: Efficient Planning with Non-uniform Temporal Resolution

Recent studies demonstrate that diffusion planners benefit from sparse-step planning over single-step planning. Training models to skip steps in their trajectories helps capture long-term dependencies without additional or memory computational cost. However, predicting excessively sparse plans degrades performance. We hypothesize this temporal density threshold is non-uniform across a temporal horizon and that certain parts of a planned trajectory should be more densely planned. We propose Mixed Density Diffuser (MDD), a diffusion planner where the densities throughout the horizon are tunable hyperparameters. MDD achieves a new SOTA across the Maze2D, Franka Kitchen, and Antmaze D4RL task domains.

Updated: 2025-11-03 17:17:23

标题: 混合密度扩散器:非均匀时间分辨率下的高效规划

摘要: 最近的研究表明,扩散规划者受益于稀疏步骤规划而不是单步规划。训练模型跳过其轨迹中的步骤有助于捕捉长期依赖关系,而无需额外的内存计算成本。然而,预测过度稀疏的计划会降低性能。我们假设这种时间密度阈值在时间范围内是非均匀的,并且计划轨迹的某些部分应该更密集地计划。我们提出了混合密度扩散器(MDD),一种扩散规划器,其中整个时间范围内的密度是可调的超参数。MDD在Maze2D、Franka Kitchen和Antmaze D4RL任务领域取得了最新的SOTA。

更新时间: 2025-11-03 17:17:23

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.23026v2

RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks

Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic's error detection and the generator's output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.

Updated: 2025-11-03 17:15:05

标题: RLAC:带有对抗性评论者的强化学习用于自由形式生成任务

摘要: 开放式生成任务要求输出满足各种各样、通常是隐含的特定评估标准。大量相关的评估标准导致验证成本过高,且对响应的评估不完整,使得基于评分标准的强化学习(RL)后训练难以扩展。这个问题被恶化的原因是通常将这些评分标准组合成一个单一奖励的最佳方式也高度特定于提示。我们提出了带有对抗性评论家(RLAC)的强化学习后训练方法,通过动态评分标准验证来解决这些挑战。我们的方法利用大型语言模型(LLM)作为评论家,动态识别出最可能的失败模式(例如,事实错误或未处理的边缘情况),然后由外部验证者验证,以优化生成器和评论家的联合。通过训练生成器和评论家,这个游戏增强了评论家的错误检测和生成器的输出质量,同时减少了所需的验证。我们的实验表明,RLAC提高了文本生成的事实准确性和代码生成的正确性,同时优于详尽的验证和奖励模型方法。我们展示了动态评论家比固定评论家更有效,展示了RLAC在将RL后训练扩展到自由形式生成任务方面的潜力。

更新时间: 2025-11-03 17:15:05

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.01758v1

Interpretable end-to-end Neurosymbolic Reinforcement Learning agents

Deep reinforcement learning (RL) agents rely on shortcut learning, preventing them from generalizing to slightly different environments. To address this problem, symbolic method, that use object-centric states, have been developed. However, comparing these methods to deep agents is not fair, as these last operate from raw pixel-based states. In this work, we instantiate the symbolic SCoBots framework. SCoBots decompose RL tasks into intermediate, interpretable representations, culminating in action decisions based on a comprehensible set of object-centric relational concepts. This architecture aids in demystifying agent decisions. By explicitly learning to extract object-centric representations from raw states, object-centric RL, and policy distillation via rule extraction, this work places itself within the neurosymbolic AI paradigm, blending the strengths of neural networks with symbolic AI. We present the first implementation of an end-to-end trained SCoBot, separately evaluate of its components, on different Atari games. The results demonstrate the framework's potential to create interpretable and performing RL systems, and pave the way for future research directions in obtaining end-to-end interpretable RL agents.

Updated: 2025-11-03 17:07:46

标题: 可解释的端到端神经符号强化学习代理

摘要: 深度强化学习(RL)代理依赖于捷径学习,这阻碍了它们泛化到略有不同的环境。为解决这一问题,已经开发了使用以对象为中心的状态的符号方法。然而,将这些方法与深度代理进行比较并不公平,因为后者是基于原始像素状态运行的。在这项工作中,我们实例化了符号SCoBots框架。SCoBots将RL任务分解为中间的、可解释的表示,最终基于一组可理解的以对象为中心的关系概念做出行动决策。这种架构有助于解开代理决策的神秘面纱。通过明确学习从原始状态中提取以对象为中心的表示、以及通过规则提取进行政策蒸馏,本研究将自身置于神经符号AI范式之内,将神经网络的优势与符号人工智能相结合。我们展示了首个经过端到端训练的SCoBot的实现,在不同的Atari游戏上分别评估其组件。结果表明,该框架有潜力创建可解释且性能良好的RL系统,并为未来获取端到端可解释RL代理的研究方向铺平道路。

更新时间: 2025-11-03 17:07:46

领域: cs.AI

下载: http://arxiv.org/abs/2410.14371v2

Access Hoare Logic

Following Hoare's seminal invention, later called Hoare logic, to reason about correctness of computer programs, we advocate a related but fundamentally different approach to reason about access security of computer programs such as access control. We define the formalism, which we denote access Hoare logic, and present examples which demonstrate its usefulness and fundamental difference to Hoare logic. We prove soundness and completeness of access Hoare logic, and provide a link between access Hoare logic and standard Hoare logic.

Updated: 2025-11-03 17:04:49

标题: Hoare逻辑访问

摘要: 在Hoare提出的开创性发明后,后来被称为Hoare逻辑,用于推理计算机程序的正确性,我们提倡一种相关但基本上不同的方法来推理计算机程序的访问安全性,如访问控制。我们定义了一个形式化,我们称之为访问Hoare逻辑,并提供了演示其有用性和与Hoare逻辑的基本区别的示例。我们证明了访问Hoare逻辑的声音和完整性,并提供了访问Hoare逻辑与标准Hoare逻辑之间的联系。

更新时间: 2025-11-03 17:04:49

领域: cs.LO,cs.CR,cs.SC

下载: http://arxiv.org/abs/2511.01754v1

SM-based Semantics for Answer Set Programs Containing Conditional Literals and Arithmetic

Modern answer set programming solvers such as CLINGO support advanced language constructs that improve the expressivity and conciseness of logic programs. Conditional literals are one such construct. They form "subformulas" that behave as nested implications within the bodies of logic rules. Their inclusion brings the form of rules closer to the less restrictive syntax of first-order logic. These qualities make conditional literals useful tools for knowledge representation. In this paper, we propose a semantics for logic programs with conditional literals and arithmetic based on the SM operator. These semantics do not require grounding, unlike the established semantics for such programs that relies on a translation to infinitary propositional logic. The main result of this paper establishes the precise correspondence between the proposed and existing semantics.

Updated: 2025-11-03 17:03:29

标题: 基于SM的语义用于包含条件文字和算术的答案集程序

摘要: 现代答案集编程求解器(如CLINGO)支持改进逻辑程序的表达能力和简洁性的高级语言构造。条件文字是其中一种构造。它们形成“子公式”,在逻辑规则的主体中表现为嵌套蕴涵。它们的包含使规则的形式更接近于一阶逻辑的不那么限制性的语法。这些特性使条件文字成为知识表示的有用工具。在本文中,我们提出了基于SM运算符的带有条件文字和算术的逻辑程序的语义。与依赖于对无穷命题逻辑进行翻译的已建立语义不同,这些语义不需要实例化。本文的主要结果建立了所提出的语义和现有语义之间的精确对应关系。

更新时间: 2025-11-03 17:03:29

领域: cs.LO,cs.AI,cs.PL,F.4.1

下载: http://arxiv.org/abs/2511.01753v1

Scam Shield: Multi-Model Voting and Fine-Tuned LLMs Against Adversarial Attacks

Scam detection remains a critical challenge in cybersecurity as adversaries craft messages that evade automated filters. We propose a Hierarchical Scam Detection System (HSDS) that combines a lightweight multi-model voting front end with a fine-tuned LLaMA 3.1 8B Instruct back end to improve accuracy and robustness against adversarial attacks. An ensemble of four classifiers provides preliminary predictions through majority vote, and ambiguous cases are escalated to the fine-tuned model, which is optimized with adversarial training to reduce misclassification. Experiments show that this hierarchical design both improves adversarial scam detection and shortens inference time by routing most cases away from the LLM, outperforming traditional machine-learning baselines and proprietary LLM baselines. The findings highlight the effectiveness of a hybrid voting mechanism and adversarial fine-tuning in fortifying LLMs against evolving scam tactics, enhancing the resilience of automated scam detection systems.

Updated: 2025-11-03 16:58:47

标题: Scam Shield:多模型投票和经过微调的LLMs抵抗对抗性攻击

摘要: 诈骗检测在网络安全领域仍然是一个关键挑战,因为对手制造的消息可以逃避自动过滤器。我们提出了一个层次化的诈骗检测系统(HSDS),它将轻量级多模型投票前端与经过精细调整的LLaMA 3.1 8B Instruct后端结合起来,以提高准确性和抵抗对手攻击的强度。四个分类器的集成通过多数投票提供初步预测,模糊案例被升级到经过优化的模型,该模型通过对抗训练来减少误分类。实验表明,这种层次设计既提高了对对手攻击的诈骗检测,又通过将大部分案例从LLM路由开来,缩短了推理时间,优于传统的机器学习基线和专有的LLM基线。研究结果突显了混合投票机制和对抗性微调在加强LLMs对不断演变的诈骗策略方面的有效性,提高了自动诈骗检测系统的韧性。

更新时间: 2025-11-03 16:58:47

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.01746v1

An Open-Access Benchmark of Statistical and Machine-Learning Anomaly Detection Methods for Battery Applications

Battery safety is critical in applications ranging from consumer electronics to electric vehicles and aircraft, where undetected anomalies could trigger safety hazards or costly downtime. In this study, we present OSBAD as an open-source benchmark for anomaly detection frameworks in battery applications. By benchmarking 15 diverse algorithms encompassing statistical, distance-based, and unsupervised machine-learning methods, OSBAD enables a systematic comparison of anomaly detection methods across heterogeneous datasets. In addition, we demonstrate how a physics- and statistics-informed feature transformation workflow enhances anomaly separability by decomposing collective anomalies into point anomalies. To address a major bottleneck in unsupervised anomaly detection due to incomplete labels, we propose a Bayesian optimization pipeline that facilitates automated hyperparameter tuning based on transfer-learning and regression proxies. Through validation on datasets covering both liquid and solid-state chemistries, we further demonstrate the cross-chemistry generalization capability of OSBAD to identify irregularities across different electrochemical systems. By making benchmarking database with open-source reproducible anomaly detection workflows available to the community, OSBAD establishes a unified foundation for developing safe, scalable, and transferable anomaly detection tools in battery analytics. This research underscores the significance of physics- and statistics-informed feature engineering as well as model selection with probabilistic hyperparameter tuning, in advancing trustworthy, data-driven diagnostics for safety-critical energy systems.

Updated: 2025-11-03 16:57:18

标题: 一个面向电池应用的统计和机器学习异常检测方法的开放式基准。

摘要: 电池安全在从消费电子到电动车辆和飞机等应用中至关重要,未检测到的异常可能会引发安全风险或昂贵的停机时间。在本研究中,我们提出OSBAD作为电池应用中异常检测框架的开源基准。通过对包括统计、基于距离和无监督机器学习方法在内的15种不同算法进行基准测试,OSBAD使得可以在异构数据集上系统比较异常检测方法。此外,我们演示了如何通过物理和统计知识引导的特征转换工作流程,将集体异常分解为点异常,从而提高异常的可分离性。为了解决由于标签不完整而导致无监督异常检测的主要瓶颈,我们提出了一个贝叶斯优化管道,该管道基于迁移学习和回归代理,促进了自动化的超参数调整。通过在液体和固态化学体系的数据集上进行验证,我们进一步展示了OSBAD的跨化学能力,可以识别不同电化学系统中的异常。通过向社区提供具有开源可重现异常检测工作流程的基准数据库,OSBAD为开发安全、可扩展和可转移的电池分析异常检测工具奠定了统一基础。本研究强调了物理和统计指导的特征工程以及概率超参数调整的模型选择的重要性,推动了可信赖的、数据驱动的安全关键能源系统诊断技术的发展。

更新时间: 2025-11-03 16:57:18

领域: cs.LG,cs.AI,stat.ME

下载: http://arxiv.org/abs/2511.01745v1

PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure

We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov-Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.

Updated: 2025-11-03 16:54:38

标题: PO-CKAN: 具有块有理结构的物理信息深度算子科尔莫戈洛夫-阿诺德网络

摘要: 我们提出了PO-CKAN,这是基于Chunkwise Rational Kolmogorov-Arnold Networks(KANs)的物理信息深度算子框架,用于逼近偏微分方程的解算子。该框架利用了深度算子网络(DeepONet)架构,该架构集成了Chunkwise Rational Kolmogorov-Arnold Network(CKAN)子网络以增强函数逼近能力。物理信息神经网络(PINNs)的原则被整合到算子学习框架中以强化物理一致性。这一设计使得能够高效地学习空间-时间解算子并在训练后快速预测具有不同输入(例如参数、初始/边界条件)的参数化时变PDE。在具有挑战性的基准问题上进行验证,PO-CKAN展示了准确的算子学习结果,与高保真解的结果非常接近。PO-CKAN采用了DeepONet风格的分支-主干架构,其子网络实例化为有理KAN模块,并通过PDE残差(PINN风格)损失来强化物理一致性。在Burgers方程中,当$\nu=0.01$时,PO-CKAN相比于PI-DeepONet将平均相对$L^2$误差减少了约48%,并在Eikonal和扩散-反应基准测试中取得了竞争性的精度。

更新时间: 2025-11-03 16:54:38

领域: cs.LG,math-ph,math.MP

下载: http://arxiv.org/abs/2510.08795v2

Towards Efficient Federated Learning of Networked Mixture-of-Experts for Mobile Edge Computing

Recent advancements in large artificial intelligence models (LAMs) are driving significant innovations in mobile edge computing within next-generation wireless networks. However, the substantial demands for computational resources and large-scale training data required to train LAMs conflict with the limited storage and computational capacity of edge devices, posing significant challenges to training and deploying LAMs at the edge. In this work, we introduce the Networked Mixture-of-Experts (NMoE) system, in which clients infer collaboratively by distributing tasks to suitable neighbors based on their expertise and aggregate the returned results. For training the NMoE, we propose a federated learning framework that integrates both supervised and self-supervised learning to balance personalization and generalization, while preserving communication efficiency and data privacy. We conduct extensive experiments to demonstrate the efficacy of the proposed NMoE system, providing insights and benchmarks for the NMoE training algorithms.

Updated: 2025-11-03 16:54:06

标题: 朝向移动边缘计算的网络混合专家联合学习的高效实现

摘要: 最近,大型人工智能模型(LAMs)的最新进展推动了下一代无线网络中移动边缘计算的重大创新。然而,训练LAMs所需的大量计算资源和大规模训练数据需求与边缘设备的有限存储和计算能力相冲突,给在边缘部署和训练LAMs带来了重大挑战。在这项工作中,我们引入了网络专家混合(NMoE)系统,其中客户端通过根据其专长将任务分配给合适的邻居进行协作推断,并汇总返回的结果。为了训练NMoE,我们提出了一个融合了监督学习和自监督学习的联邦学习框架,以平衡个性化和泛化,同时保持通信效率和数据隐私。我们进行了大量实验来展示所提出的NMoE系统的有效性,为NMoE训练算法提供了见解和基准。

更新时间: 2025-11-03 16:54:06

领域: cs.LG,cs.AI,cs.NI

下载: http://arxiv.org/abs/2511.01743v1

MarsLGPR: Mars Rover Localization with Ground Penetrating Radar

In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multi-modal filtering estimates in high-slip environments. Lastly, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available at https://umfieldrobotics.github.io/marslgpr.

Updated: 2025-11-03 16:49:02

标题: MarsLGPR: 使用地下探测雷达的火星车辆定位

摘要: 在这项工作中,我们提出利用地质雷达(GPR)在火星上对漫游器进行定位。精确的姿态估计对于探索行星表面的移动机器人来说是一项重要任务,因为它们在无GPS的环境中运行。虽然视觉测距提供了准确的定位,但在光线昏暗或高对比度的情况下可能计算量大且失败。轮子编码器也可以提供测距估计,但在火星上遇到的沙地地形易于打滑。尽管传统上被用作科学测量传感器,地质雷达已经在地球上用于地形分类和通过地下特征匹配进行定位。毅力号漫游器和即将到来的ExoMars漫游器已经配备了GPR传感器,以帮助寻找水和矿产资源。我们提出利用GPR来帮助火星漫游器的定位。具体来说,我们开发了一种新颖的基于GPR的深度学习模型,用于预测1D相对姿态平移。我们将我们的GPR姿态预测方法与惯性和轮子编码器数据融合在一个滤波框架中,输出漫游器的定位。我们在火星模拟环境中进行实验,并证明我们基于GPR的位移预测既优于轮子编码器,又改善了高滑移环境中的多模态滤波估计。最后,我们提供了第一个旨在在火星模拟环境中进行基于GPR定位的数据集,该数据集将在https://umfieldrobotics.github.io/marslgpr上公开。

更新时间: 2025-11-03 16:49:02

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2503.04944v2

Vibe Learning: Education in the age of AI

The debate over whether "thinking machines" could replace human intellectual labor has existed in both public and expert discussions since the mid-twentieth century, when the concept and terminology of Artificial Intelligence (AI) first emerged. For decades, this idea remained largely theoretical. However, with the recent advent of Generative AI - particularly Large Language Models (LLMs) - and the widespread adoption of tools such as ChatGPT, the issue has become a practical reality. Many fields that rely on human intellectual effort are now being reshaped by AI tools that both expand human capabilities and challenge the necessity of certain forms of work once deemed uniquely human but now easily automated. Education, somewhat unexpectedly, faces a pivotal responsibility: to devise long-term strategies for cultivating human skills that will remain relevant in an era of pervasive AI in the intellectual domain. In this context, we identify the limitations of current AI systems - especially those rooted in LLM technology - argue that the fundamental causes of these weaknesses cannot be resolved through existing methods, and propose directions within the constructivist paradigm for transforming education to preserve the long-term advantages of human intelligence over AI tools.

Updated: 2025-11-03 16:47:05

标题: Vibe Learning:AI时代的教育

摘要: 自20世纪中叶以来,关于“思维机器”是否能取代人类智力劳动的争论一直存在于公众和专家讨论中,当时首次出现了人工智能(AI)的概念和术语。几十年来,这个想法一直基本上是理论性的。然而,随着生成式AI的最近出现,特别是大型语言模型(LLMs)的出现,以及像ChatGPT这样的工具的广泛应用,这个问题已经变成了一个实际的现实。许多依赖人类智力劳动的领域现在正在被AI工具重新塑造,这些工具既扩展了人类的能力,又挑战了曾经被视为独特人类但现在很容易被自动化的某些形式的工作的必要性。教育,有些出乎意料地,面临着一个关键的责任:制定长期战略,培养在智力领域普遍存在AI时代中仍然具有相关性的人类技能。在这种背景下,我们确定了当前AI系统的局限性 - 尤其是那些根植于LLM技术的系统 - 认为这些弱点的根本原因不能通过现有方法解决,并提出了在建构主义范式内转变教育以保留人类智能长期优势的方向。

更新时间: 2025-11-03 16:47:05

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2511.01956v1

A Proof of Learning Rate Transfer under $μ$P

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $\mu$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $\mu P$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

Updated: 2025-11-03 16:45:47

标题: 一个在$μ$P下学习率转移的证明

摘要: 我们在一个线性的多层感知器(MLP)中提供了首次关于学习速率传递和宽度的证明,其参数化为$\mu P$,这是一种旨在在无限宽度限制下“最大化”特征学习的神经网络参数化。我们展示了在$\mu P$下,随着宽度趋近于无穷大,最优学习速率会收敛到一个\emph{非零常数},从而提供了学习速率传递的理论解释。相比之下,我们展示了在标准参数化(SP)和神经切线参数化(NTP)等替代参数化下,这一性质无法成立。我们提供直观的证明,并用大量的实证结果支持理论发现。

更新时间: 2025-11-03 16:45:47

领域: stat.ML,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2511.01734v1

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

The need to analyze graphs is ubiquitous across various fields, from social networks to biological research and recommendation systems. Therefore, enabling the ability of large language models (LLMs) to process graphs is an important step toward more advanced general intelligence. However, current LLM benchmarks on graph analysis require models to directly reason over the prompts describing graph topology, and are thus limited to small graphs with only a few dozens of nodes. In contrast, human experts typically write programs based on popular libraries for task solving, and can thus handle graphs with different scales. To this end, a question naturally arises: can LLMs analyze graphs like professionals? In this paper, we introduce ProGraph, a manually crafted benchmark containing 3 categories of graph tasks. The benchmark expects solutions based on programming instead of directly reasoning over raw inputs. Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy. To bridge this gap, we propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries. By augmenting closed-source LLMs with document retrieval and fine-tuning open-source ones on the codes, we show 11-32% absolute improvements in their accuracies. Our results underscore that the capabilities of LLMs in handling structured data are still under-explored, and show the effectiveness of LLM4Graph in enhancing LLMs' proficiency of graph analysis. The benchmark, datasets and enhanced open-source models are available at https://github.com/BUPT-GAMMA/ProGraph.

Updated: 2025-11-03 16:44:03

标题: 大型语言模型能像专业人员一样分析图形吗?一个基准、数据集和模型

摘要: 需要分析图形的需求在各个领域普遍存在,从社交网络到生物研究和推荐系统。因此,使大型语言模型(LLMs)能够处理图形是迈向更高级智能的重要一步。然而,当前在图形分析上的LLM基准要求模型直接推理描述图形拓扑的提示,因此仅限于具有几十个节点的小图。相比之下,人类专家通常基于流行库编写程序来解决任务,因此可以处理不同规模的图形。因此,一个自然的问题是:LLMs能像专业人士一样分析图形吗?在本文中,我们介绍了ProGraph,一个手工制作的基准,包含3类图形任务。该基准期望基于编程而不是直接推理原始输入的解决方案。我们的研究结果表明,当前LLMs的性能令人不满,最佳模型仅达到36%的准确率。为了弥补这一差距,我们提出了LLM4Graph数据集,其中包括基于6个广泛使用的图形库的抓取文档和自动生成的代码。通过在闭源LLMs中增加文档检索和在代码上微调开源模型,我们展示了它们准确率的11-32%的绝对改进。我们的结果强调了LLMs处理结构化数据的能力仍未得到充分探索,并展示了LLM4Graph在增强LLMs图形分析能力方面的有效性。该基准、数据集和增强的开源模型可在https://github.com/BUPT-GAMMA/ProGraph获得。

更新时间: 2025-11-03 16:44:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2409.19667v4

AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.

Updated: 2025-11-03 16:38:43

标题: AnyEnhance:具有提示引导和自评指导的统一生成模型,用于语音增强

摘要: 我们介绍了AnyEnhance,这是一个统一的生成模型,用于处理语音增强和歌声增强。基于掩模生成模型,AnyEnhance能够处理语音和歌声,支持一系列增强任务,包括降噪、去混响、去削波、超分辨率和目标说话者提取,而且无需微调。AnyEnhance引入了一种提示引导机制,用于上下文学习,使模型能够本地接受参考说话者的音色。通过这种方式,当有参考音频时,它可以提高增强表现,并且能够在不改变底层架构的情况下实现目标说话者提取任务。此外,我们还将自评机制引入到掩模生成模型的生成过程中,通过迭代的自我评估和改进,产生更高质量的输出。对各种增强任务的大量实验表明,AnyEnhance在客观指标和主观听测试方面都优于现有方法。演示音频可以在 https://amphionspace.github.io/anyenhance 上公开获取。提供了一个开源实现,网址为https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc。

更新时间: 2025-11-03 16:38:43

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2501.15417v3

Rethinking Visual Intelligence: Insights from Video Pretraining

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

Updated: 2025-11-03 16:32:22

标题: 重新思考视觉智能:从视频预训练中获得的见解

摘要: 大型语言模型(LLMs)已经证明,大规模预训练使系统能够在语言领域中快速适应新问题,而几乎不需要监督。然而,这种成功在视觉领域中并没有得到同样有效的转化,包括LLMs在内的模型继续在组合理解、样本效率和通用问题解决方面挣扎。我们研究视频扩散模型(VDMs)作为弥合这一差距的一个有前途的方向。在时空数据上进行预训练赋予这些模型对结构和动态的强烈归纳偏见,我们认为这可以支持广泛的任务适应性。为了测试这一假设,我们设计了一个受控评估,其中预训练的LLM和预训练的VDM都配备了轻量级适配器,并被呈现以它们自然的模态的任务。在包括ARC-AGI、ConceptARC、视觉游戏、路径规划和元胞自动机在内的基准测试中,VDMs展现出比它们的语言对应物更高的数据效率。综合起来,我们的结果表明视频预训练提供了支持向视觉基础模型进展的归纳偏见。

更新时间: 2025-11-03 16:32:22

领域: cs.CV,cs.AI,68T07, 68T45, 68T20,I.2.10; I.4.8; I.5.1; I.2.6

下载: http://arxiv.org/abs/2510.24448v2

GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration

Graphs are widely used for modeling relational data in real-world scenarios, such as social networks and urban computing. Existing LLM-based graph analysis approaches either integrate graph neural networks (GNNs) for specific machine learning tasks, limiting their transferability, or rely solely on LLMs' internal reasoning ability, resulting in suboptimal performance. To address these limitations, we take advantage of recent advances in LLM-based agents, which have shown capabilities of utilizing external knowledge or tools for problem solving. By simulating human problem-solving strategies such as analogy and collaboration, we propose a multi-agent system based on LLMs named GraphTeam, for graph analysis. GraphTeam consists of five LLM-based agents from three modules, and the agents with different specialities can collaborate with each other to address complex problems. Specifically, (1) input-output normalization module: the question agent extracts and refines four key arguments from the original question, facilitating the problem understanding, and the answer agent organizes the results to meet the output requirement; (2) external knowledge retrieval module: we first build a knowledge base consisting of relevant documentation and experience information, and then the search agent retrieves the most relevant entries for each question. (3) problem-solving module: given the retrieved information from search agent, the coding agent uses established algorithms via programming to generate solutions, and in case the coding agent does not work, the reasoning agent will directly compute the results without programming. Extensive experiments on six graph analysis benchmarks demonstrate that GraphTeam achieves state-of-the-art performance with an average 25.85% improvement over the best baseline in terms of accuracy. The code and data are available at https://github.com/BUPT-GAMMA/GraphTeam.

Updated: 2025-11-03 16:31:39

标题: GraphTeam:通过多智能体协作促进基于大型语言模型的图分析

摘要: 图表广泛用于建模现实世界场景中的关系数据,如社交网络和城市计算。现有基于LLM的图分析方法要么集成图神经网络(GNN)用于特定的机器学习任务,从而限制其可转移性,要么仅依赖LLM的内部推理能力,导致性能不佳。为了解决这些限制,我们利用LLM-based agents最近的进展,这些agent显示出利用外部知识或工具进行问题解决的能力。通过模拟人类问题解决策略,如类比和协作,我们提出了一种基于LLM的多agent系统,名为GraphTeam,用于图分析。GraphTeam由三个模块中的五个LLM-based agent组成,具有不同专业的agent可以相互协作解决复杂问题。具体来说,(1)输入输出归一化模块:问题agent从原始问题中提取和精炼四个关键参数,促进问题理解,而答案agent组织结果以满足输出要求;(2)外部知识检索模块:我们首先构建一个包含相关文档和经验信息的知识库,然后搜索agent为每个问题检索最相关的条目。(3)问题解决模块:根据搜索agent检索到的信息,编码agent通过编程使用建立的算法生成解决方案,如果编码agent不起作用,推理agent将直接计算结果而无需编程。对六个图分析基准进行的大量实验表明,GraphTeam在准确性方面比最佳基线平均提高了25.85%的性能。代码和数据可在https://github.com/BUPT-GAMMA/GraphTeam找到。

更新时间: 2025-11-03 16:31:39

领域: cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2410.18032v5

Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. The model is open-sourced at https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series.

Updated: 2025-11-03 16:31:16

标题: 用语言锻造时间序列:一种大型语言模型方法生成合成数据

摘要: SDForger是一个灵活高效的框架,用于使用LLMs生成高质量的多变量时间序列。利用紧凑的数据表示,SDForger可以从少量样本生成合成时间序列,并对任何自回归LLM进行低计算的微调。具体来说,该框架将单变量和多变量信号转换为表格嵌入,然后编码为文本并用于微调LLM。在推断阶段,会对新的文本嵌入进行采样,并解码为保留原始数据统计属性和时间动态的合成时间序列。在各种数据集上,SDForger在许多场景中优于现有的生成模型,无论是基于相似性的评估还是下游预测任务。通过在生成过程中实现文本条件,SDForger为多模态建模和时间序列与文本信息的高效整合铺平了道路。该模型已在https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series上开源。

更新时间: 2025-11-03 16:31:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.17103v2

Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement

Natural Language Explanations (NLEs) describe how Large Language Models (LLMs) make decisions, drawing on both external Context Knowledge (CK) and Parametric Knowledge (PK) stored in model weights. Understanding their interaction is key to assessing the grounding of NLEs, yet it remains underexplored. Prior work has largely examined only single-step generation, typically the final answer, and has modelled PK and CK interaction only as a binary choice in a rank-1 subspace. This overlooks richer forms of interaction, such as complementary or supportive knowledge. We propose a novel rank-2 projection subspace that disentangles PK and CK contributions more accurately and use it for the first multi-step analysis of knowledge interactions across longer NLE sequences. Experiments on four QA datasets and three open-weight instruction-tuned LLMs show that diverse knowledge interactions are poorly represented in a rank-1 subspace but are effectively captured in our rank-2 formulation. Our multi-step analysis reveals that hallucinated NLEs align strongly with the PK direction, context-faithful ones balance PK and CK, and Chain-of-Thought prompting for NLEs shifts generated NLEs toward CK by reducing PK reliance. This work provides the first framework for systematic studies of multi-step knowledge interactions in LLMs through a richer rank-2 subspace disentanglement. Code and data: https://github.com/copenlu/pk-ck-knowledge-disentanglement.

Updated: 2025-11-03 16:15:06

标题: 多步知识交互分析通过排名-2子空间解缠。

摘要: 自然语言解释(NLEs)描述了大型语言模型(LLMs)如何做出决策,同时利用外部上下文知识(CK)和存储在模型权重中的参数化知识(PK)。理解它们之间的相互作用对评估NLEs的基础是至关重要的,然而目前仍未得到充分探讨。先前的研究主要只考虑了单步生成,通常是最终答案,并且仅将PK和CK的交互建模为一个二元选择在一个秩-1子空间中。这忽视了更丰富的交互形式,比如互补或支持性知识。我们提出了一个新颖的秩-2投影子空间,更准确地解开PK和CK的贡献,并将其用于首次对长NLE序列中的知识交互进行多步分析。对四个QA数据集和三个开放权重的指导调整的LLMs进行的实验表明,多样化的知识交互在秩-1子空间中表示不佳,但在我们的秩-2公式中有效捕捉。我们的多步分析显示,产生的虚构NLEs与PK方向强烈对齐,符合上下文的NLEs平衡了PK和CK,而对NLEs进行思维链提示通过减少PK依赖将生成的NLEs转向CK。这项工作提供了首个通过更丰富的秩-2子空间解开LMLs中多步知识交互的系统研究框架。代码和数据:https://github.com/copenlu/pk-ck-knowledge-disentanglement。

更新时间: 2025-11-03 16:15:06

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01706v1

Energy Consumption of TLS, Searchable Encryption and Fully Homomorphic Encryption

Privacy-enhancing technologies (PETs) have attracted significant attention in response to privacy regulations, driving the development of applications that prioritize user data protection. At the same time, the information and communication technology (ICT) sector faces growing pressure to reduce its environmental footprint, particularly its energy consumption. While numerous studies have assessed the energy consumption of ICT applications, the environmental impact of cryptographic PETs remains largely unexplored. This work investigates this question by measuring the energy consumption increase induced by three PETs compared to their non-private counterparts: TLS, Searchable Encryption, and Fully Homomorphic Encryption (FHE). These technologies were chosen for two reasons. First, they cover different maturity levels -- from the widely deployed TLS protocol to the emerging FHE schemes -- allowing us to examine the influence of maturity on energy consumption. Second, they each have well-established applications in industry: web browsing, encrypted databases, and privacy-preserving machine learning. Our results reveal highly variable energy consumption increases, ranging from 2x for TLS to 10x for Searchable Encryption and 100,000x for FHE. Our experiments demonstrate a simple and reproducible methodology, based on existing open-source software, to quantify the energy costs of PETs. They also highlight the wide spectrum of energy demands across technologies, underscoring the importance of further research on sustainable PET design. Finally, we discuss orthogonal research directions, such as hardware acceleration, to outline promising directions toward sustainable PETs.

Updated: 2025-11-03 16:14:11

标题: TLS、可搜索加密和完全同态加密的能量消耗

摘要: 随着隐私法规的出台,隐私增强技术(PETs)受到了广泛关注,推动了应用程序的开发,优先考虑用户数据保护。同时,信息和通信技术(ICT)领域面临着日益增长的压力,要求减少其环境足迹,特别是能源消耗。虽然许多研究已经评估了ICT应用程序的能源消耗,但加密PETs的环境影响仍然大多未被探索。 本研究通过测量三种PETs相对于它们的非私密对应物而引起的能源消耗增加来探讨这个问题:TLS、可搜索加密和完全同态加密(FHE)。选择这些技术有两个原因。首先,它们涵盖了不同的成熟水平 -- 从广泛部署的TLS协议到新兴的FHE方案 -- 这使我们能够研究成熟度对能源消耗的影响。其次,它们在工业中每个都有建立良好的应用:网络浏览、加密数据库和隐私保护机器学习。 我们的结果显示了能源消耗增加的变化非常大,从TLS的2倍到可搜索加密的10倍,再到FHE的100,000倍。我们的实验展示了一种简单且可重复的方法论,基于现有开源软件,用于量化PETs的能源成本。它们还突出了不同技术之间广泛的能源需求范围,强调了进一步研究可持续PET设计的重要性。最后,我们讨论了正交研究方向,例如硬件加速,以勾勒出朝着可持续PETs的有前途的方向。

更新时间: 2025-11-03 16:14:11

领域: cs.CR

下载: http://arxiv.org/abs/2508.04583v3

Benchmarking LLMs in Web API Integration Tasks

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.

Updated: 2025-11-03 16:12:09

标题: 在Web API集成任务中基准测试LLMs

摘要: API集成是我们数字基础设施的重要组成部分,使软件系统能够连接和交互。然而,许多研究表明,编写或生成正确的代码来调用API,特别是web API,是具有挑战性的。尽管大型语言模型(LLMs)在软件开发中变得流行,但它们在自动化生成web API集成代码方面的效果尚未被探索。为了解决这个问题,我们提出了WAPIIBench,这是一个旨在评估LLMs生成web API调用代码能力的数据集和评估流水线。我们对几个开源LLMs进行的实验表明,生成API调用存在重大挑战,导致出现虚构的端点、错误的参数使用和其他错误。在评估的开源模型中,没有一个能够解决超过40%的任务。

更新时间: 2025-11-03 16:12:09

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2509.20172v3

Solution Space Topology Guides CMTS Search

A fundamental question in search-guided AI: what topology should guide Monte Carlo Tree Search (MCTS) in puzzle solving? Prior work applied topological features to guide MCTS in ARC-style tasks using grid topology -- the Laplacian spectral properties of cell connectivity -- and found no benefit. We identify the root cause: grid topology is constant across all instances. We propose measuring \emph{solution space topology} instead: the structure of valid color assignments constrained by detected pattern rules. We build this via compatibility graphs where nodes are $(cell, color)$ pairs and edges represent compatible assignments under pattern constraints. Our method: (1) detect pattern rules automatically with 100\% accuracy on 5 types, (2) construct compatibility graphs encoding solution space structure, (3) extract topological features (algebraic connectivity, rigidity, color structure) that vary with task difficulty, (4) integrate these features into MCTS node selection via sibling-normalized scores. We provide formal definitions, a rigorous selection formula, and comprehensive ablations showing that algebraic connectivity is the dominant signal. The work demonstrates that topology matters for search -- but only the \emph{right} topology. For puzzle solving, this is solution space structure, not problem space structure.

Updated: 2025-11-03 16:09:00

标题: 解决方案空间拓扑指导CMTS搜索

摘要: 在搜索引导的人工智能中的一个基本问题是:在解决难题时,蒙特卡洛树搜索(MCTS)应该由什么拓扑结构来指导?以往的研究将拓扑特征应用于使用网格拓扑结构的ARC风格任务中,即细胞连接的拉普拉斯谱特性,并没有发现任何好处。我们确定了根本原因:网格拓扑结构在所有实例中都是恒定的。我们建议测量\emph{解空间拓扑结构}:受检测到的图案规则约束的有效颜色分配的结构。我们通过兼容性图构建了这一结构,其中节点是$(细胞,颜色)$对,边表示在图案约束下的兼容分配。 我们的方法是:(1) 自动以100\%的准确率检测5种类型的图案规则,(2) 构建编码解空间结构的兼容性图,(3) 提取随任务难度变化的拓扑特征(代数连通性、刚性、颜色结构),(4) 将这些特征集成到MCTS节点选择中,通过同级规范化得分。 我们提供了正式的定义、严格的选择公式和全面的消融试验结果,显示代数连通性是主要信号。该研究表明拓扑在搜索中很重要,但只有\emph{正确的}拓扑结构才重要。对于解谜问题,这是解空间结构,而不是问题空间结构。

更新时间: 2025-11-03 16:09:00

领域: cs.CE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01701v1

Identity Increases Stability in Neural Cellular Automata

Neural Cellular Automata (NCAs) offer a way to study the growth of two-dimensional artificial organisms from a single seed cell. From the outset, NCA-grown organisms have had issues with stability, their natural boundary often breaking down and exhibiting tumour-like growth or failing to maintain the expected shape. In this paper, we present a method for improving the stability of NCA-grown organisms by introducing an 'identity' layer with simple constraints during training. Results show that NCAs grown in close proximity are more stable compared with the original NCA model. Moreover, only a single identity value is required to achieve this increase in stability. We observe emergent movement from the stable organisms, with increasing prevalence for models with multiple identity values. This work lays the foundation for further study of the interaction between NCA-grown organisms, paving the way for studying social interaction at a cellular level in artificial organisms. Code/Videos available at: https://github.com/jstovold/ALIFE2025

Updated: 2025-11-03 16:04:41

标题: 身份增加了神经细胞自动机的稳定性

摘要: 神经细胞自动机(NCAs)提供了一种研究从单个种子细胞开始生长的二维人工生物的方法。从一开始,NCA生长的生物体存在稳定性问题,其自然边界经常破裂并表现出肿瘤样生长,或者无法保持预期的形状。在本文中,我们提出了一种通过在训练过程中引入具有简单约束条件的“身份”层来改善NCA生长生物稳定性的方法。 结果显示,与原始NCA模型相比,密切相邻的NCAs生长的生物体更稳定。此外,只需要一个单一的身份值就能实现这种稳定性增加。我们观察到稳定的生物体中出现了新兴的运动,对于具有多个身份值的模型,这种现象变得更为普遍。 这项工作为进一步研究NCA生长生物体之间的相互作用奠定了基础,为研究人工生物体中细胞水平上的社会互动铺平了道路。 代码/视频可在以下链接找到:https://github.com/jstovold/ALIFE2025

更新时间: 2025-11-03 16:04:41

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2508.06389v2

Mathematical exploration and discovery at scale

AlphaEvolve is a generic evolutionary coding agent that combines the generative capabilities of LLMs with automated evaluation in an iterative evolutionary framework that proposes, tests, and refines algorithmic solutions to challenging scientific and practical problems. In this paper we showcase AlphaEvolve as a tool for autonomously discovering novel mathematical constructions and advancing our understanding of long-standing open problems. To demonstrate its breadth, we considered a list of 67 problems spanning mathematical analysis, combinatorics, geometry, and number theory. The system rediscovered the best known solutions in most of the cases and discovered improved solutions in several. In some instances, AlphaEvolve is also able to generalize results for a finite number of input values into a formula valid for all input values. Furthermore, we are able to combine this methodology with Deep Think and AlphaProof in a broader framework where the additional proof-assistants and reasoning systems provide automated proof generation and further mathematical insights. These results demonstrate that large language model-guided evolutionary search can autonomously discover mathematical constructions that complement human intuition, at times matching or even improving the best known results, highlighting the potential for significant new ways of interaction between mathematicians and AI systems. We present AlphaEvolve as a powerful new tool for mathematical discovery, capable of exploring vast search spaces to solve complex optimization problems at scale, often with significantly reduced requirements on preparation and computation time.

Updated: 2025-11-03 16:04:07

标题: 大规模的数学探索与发现

摘要: AlphaEvolve是一个通用的进化编码代理,将LLMs的生成能力与自动评估结合在一个迭代的进化框架中,提出、测试和完善算法解决具有挑战性的科学和实际问题。在本文中,我们展示了AlphaEvolve作为一种工具,用于自主发现新颖的数学构造,并推进我们对长期存在的开放性问题的理解。 为了展示其广泛性,我们考虑了一个涵盖数学分析、组合数学、几何学和数论的67个问题的列表。系统在大多数情况下重新发现了已知的最佳解决方案,并在一些情况下发现了改进的解决方案。在某些情况下,AlphaEvolve还能够将有限数量的输入值的结果推广为适用于所有输入值的公式。此外,我们能够将这种方法与Deep Think和AlphaProof结合在一个更广泛的框架中,其中额外的证明助手和推理系统提供自动证明生成和进一步的数学见解。 这些结果表明,由大型语言模型引导的进化搜索可以自主地发现与人类直觉相辅相成的数学构造,有时甚至可以匹敌或改进已知的最佳结果,突出数学家和AI系统之间重要新的互动方式的潜力。我们将AlphaEvolve作为一个强大的新工具,用于数学发现,能够探索广阔的搜索空间以解决规模庞大的复杂优化问题,通常在准备和计算时间方面要求大大减少。

更新时间: 2025-11-03 16:04:07

领域: cs.NE,cs.AI,math.CA,math.CO,math.MG

下载: http://arxiv.org/abs/2511.02864v1

Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering

Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior--or comparable--ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.

Updated: 2025-11-03 16:00:45

标题: 贝叶斯自然梯度通过卡尔曼滤波对CLIP模型进行微调

摘要: 视觉-语言预训练模型,如CLIP,已在多模态数据挖掘中建立了新的基准。在这种模型中,少样本微调是实现在分布(ID)和分布之外(OOD)数据集上实现最佳性能的主要挑战,特别是在标记数据稀缺时。大多数现有的微调方法依赖于一阶梯度优化器,这些优化器通常收敛缓慢,对步长超参数敏感,在OOD设置中泛化能力差。相比之下,二阶方法利用损失函数的局部曲率信息来调整更新步长。这对于CLIP模型尤为有益,因为它们的非凸损失函数通常包含尖锐的临界点。在这种情况下,自然梯度方向在用有限数据微调时可以提供更实质性和高效的每次迭代更新。自然梯度下降(NGD)是通过预处理标准梯度与逆Fisher信息矩阵(FIM)获得的,这对于大型模型在计算上是昂贵的。为了解决这个问题,我们提出了一种使用卡尔曼滤波器对CLIP模型进行贝叶斯近似的NGD方法。我们的方法结合了二阶优化的优势和贝叶斯推断,提高了泛化能力同时提供了不确定性量化。在各种图像分类数据集上进行的广泛实验表明,与最先进的基线相比,我们的算法始终实现了更优或可比的ID性能和改进的OOD鲁棒性。据我们所知,这项工作是第一个成功将卡尔曼滤波器应用于微调基于CLIP的模型,从而实现了在视觉-语言任务中更强大和高效的学习。

更新时间: 2025-11-03 16:00:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.01694v1

Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning

Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting, but require data sharing, which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches regarding better load forecasting and reduced operational latency costs.

Updated: 2025-11-03 15:56:26

标题: 用联邦学习在多跳智能计量网络上进行电负荷预测

摘要: 电力负载预测对于智能电网的电力管理和稳定至关重要。这主要通过先进的计量基础设施实现,其中智能电表记录家庭能源数据。传统的机器学习方法通常用于负载预测,但需要数据共享,这引发了数据隐私问题。联邦学习可以通过在本地智能电表上运行分布式机器学习模型而不进行数据交换来解决这个问题。然而,基于当前的联邦学习方法往往难以实现有效的负载预测,因为异构智能电表之间存在不平衡的数据分布。本文提出了一种用于计量网络中高质量负载预测的新型个性化联邦学习(PFL)方法。开发了基于元学习的策略,以解决本地智能电表中数据异质性在本地负载预测模型的协作训练中的问题。此外,为了最小化我们的PFL模型中的负载预测延迟,我们研究了一种基于最佳资源分配的新的延迟优化问题。还进行了理论收敛分析,以提供关于联邦负载预测的FL设计的见解。来自真实数据集的广泛模拟显示,我们的方法在负载预测和减少操作延迟成本方面优于现有方法。

更新时间: 2025-11-03 15:56:26

领域: cs.LG

下载: http://arxiv.org/abs/2502.17226v2

Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining.

Updated: 2025-11-03 15:53:47

标题: 开放性角色训练:通过宪法人工智能塑造AI助手的个性

摘要: 现代聊天机器人大型语言模型生成的“AI助手”角色性格影响表面行为、明显价值观、信念和道德。所有这些都会影响互动质量、感知智能以及与开发者和用户意图的一致性。这种角色塑造,即所谓的性格训练,是行业后训练的一个关键组成部分,但在学术文献中尚未得到充分研究。我们介绍了第一个开放的角色训练实现,利用宪法AI和一种新的数据管道,使用合成内省数据来更有效地和受控制地塑造助手角色。具体而言,我们使用11个示例人物(如幽默、深情、甚至邪恶)对三个流行的开放权重模型进行微调。为了跟踪我们方法的影响,我们引入了一种分析揭示偏好的方法,揭示了性格中明显和整体的变化。我们发现这些变化对于敌对提示比上述两种替代方法更稳健,同时也会导致更连贯和现实的生成。最后,我们证明这种微调对通用能力几乎没有影响,如常见基准测试所衡量。我们描述并开源了完整的后训练方法,其实现可以在https://github.com/maiush/OpenCharacterTraining找到。

更新时间: 2025-11-03 15:53:47

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01689v1

Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity

Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.

Updated: 2025-11-03 15:51:44

标题: "双下降遇上超出分布检测:关于模型复杂性角色的理论洞见和实证分析"

摘要: Out-of-distribution (OOD) detection是确保机器学习系统可靠性和安全性的关键。近年来,它受到越来越多的关注,尤其是通过事后检测和基于训练的方法。本文着重于事后OOD检测,这使得能够识别OOD样本而不改变模型的训练程序或目标。我们的主要目标是研究模型容量与其OOD检测性能之间的关系。具体来说,我们的目标是回答以下问题:Double Descent现象是否在事后OOD检测中显现?这个问题至关重要,因为它可以揭示过度参数化,已知有利于泛化,是否也可以增强OOD检测。尽管经典监督机器学习社区对这些主题越来越感兴趣,但这个交叉点对于OOD检测仍未被探索。我们在实证上证明了Double Descent效应确实出现在事后OOD检测中。此外,我们提供了理论见解来解释为什么这种现象在这种设置中出现。最后,我们展示过度参数化区域并不始终产生优越结果,并提出一种基于我们的观察来确定OOD检测最佳区域的方法。

更新时间: 2025-11-03 15:51:44

领域: stat.ML,cs.AI,cs.CV,cs.LG,math.ST,stat.TH,I.2.6; I.5.1

下载: http://arxiv.org/abs/2411.02184v3

Student Engagement in AI Assisted Complex Problem Solving: A Pilot Study of Human AI Rubik's Cube Collaboration

Games and puzzles play important pedagogical roles in STEM learning. New AI algorithms that can solve complex problems offer opportunities for scaffolded instruction in puzzle solving. This paper presents the ALLURE system, which uses an AI algorithm (DeepCubeA) to guide students in solving a common first step of the Rubik's Cube (i.e., the white cross). Using data from a pilot study we present preliminary findings about students' behaviors in the system, how these behaviors are associated with STEM skills - including spatial reasoning, critical thinking and algorithmic thinking. We discuss how data from ALLURE can be used in future educational data mining to understand how students benefit from AI assistance and collaboration when solving complex problems.

Updated: 2025-11-03 15:46:54

标题: 学生参与AI辅助复杂问题解决:人工智能魔方协作的试点研究

摘要: 游戏和谜题在STEM学习中扮演着重要的教育角色。能够解决复杂问题的新人工智能算法为解谜提供了支持性指导的机会。本文介绍了ALLURE系统,该系统利用人工智能算法(DeepCubeA)指导学生解决魔方的常见第一步(即白十字)。通过一项试点研究的数据,我们提出了有关学生在系统中的行为以及这些行为与STEM技能(包括空间推理、批判性思维和算法思维)之间的关联的初步发现。我们讨论了如何利用ALLURE系统的数据进行未来的教育数据挖掘,以了解学生在解决复杂问题时受益于人工智能辅助和协作。

更新时间: 2025-11-03 15:46:54

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2511.01683v1

Dynamic Forgetting and Spatio-Temporal Periodic Interest Modeling for Local-Life Service Recommendation

In the context of the booming digital economy, recommendation systems, as a key link connecting users and numerous services, face challenges in modeling user behavior sequences on local-life service platforms, including the sparsity of long sequences and strong spatio-temporal dependence. Such challenges can be addressed by drawing an analogy to the forgetting process in human memory. This is because users' responses to recommended content follow the recency effect and the cyclicality of memory. By exploring this, this paper introduces the forgetting curve and proposes Spatio-Temporal periodic Interest Modeling (STIM) with long sequences for local-life service recommendation. STIM integrates three key components: a dynamic masking module based on the forgetting curve, which is used to extract both recent spatiotemporal features and periodic spatiotemporal features; a query-based mixture of experts (MoE) approach that can adaptively activate expert networks under different dynamic masks, enabling the collaborative modeling of time, location, and items; and a hierarchical multi-interest network unit, which captures multi-interest representations by modeling the hierarchical interactions between the shallow and deep semantics of users' recent behaviors. By introducing the STIM method, we conducted online A/B tests and achieved a 1.54\% improvement in gross transaction volume (GTV). In addition, extended offline experiments also showed improvements. STIM has been deployed in a large-scale local-life service recommendation system, serving hundreds of millions of daily active users in core application scenarios.

Updated: 2025-11-03 15:46:33

标题: 动态遗忘和时空周期兴趣建模用于本地生活服务推荐

摘要: 在蓬勃发展的数字经济背景下,作为连接用户和众多服务的关键环节的推荐系统,在本地生活服务平台上建模用户行为序列方面面临挑战,其中包括长序列的稀疏性和强烈的时空依赖性。这些挑战可以通过类比于人类记忆中的遗忘过程来解决。这是因为用户对推荐内容的反应遵循最近效应和记忆的周期性。通过探索这一点,本文引入了遗忘曲线,并提出了具有长序列的本地生活服务推荐的时空周期兴趣建模(STIM)。STIM集成了三个关键组件:基于遗忘曲线的动态遮罩模块,用于提取最近的时空特征和周期性的时空特征;基于查询的专家混合(MoE)方法,可以根据不同的动态遮罩自适应激活专家网络,实现对时间、位置和项目的协同建模;以及一个分层多兴趣网络单元,通过对用户最近行为的浅层和深层语义之间的层级交互建模,捕获多兴趣表示。通过引入STIM方法,我们进行了在线A/B测试,实现了总交易量(GTV)的1.54%改进。此外,扩展的离线实验也显示了改进。STIM已部署在大规模的本地生活服务推荐系统中,在核心应用场景中为数亿日活跃用户提供服务。

更新时间: 2025-11-03 15:46:33

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.02451v2

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

Updated: 2025-11-03 15:40:21

标题: 检索增强防御:大型语言模型的自适应可控越狱预防

摘要: 大型语言模型(LLMs)仍然容易受到越狱攻击的威胁,这些攻击试图引发LLMs产生有害反应。这些攻击的不断演变和多样性为防御系统带来许多挑战,包括(1)适应对抗新出现的攻击策略,而无需昂贵的重新训练,以及(2)控制安全性和效用之间的权衡。为了解决这些挑战,我们提出了检索增强防御(RAD),这是一个新颖的框架,用于监测越狱攻击,它将已知攻击示例数据库整合到检索增强生成中,用于推断用于攻击系统的潜在恶意用户查询和越狱策略。RAD使得可以无需训练即可更新新发现的越狱策略,并提供一种平衡安全性和效用的机制。对StrongREJECT的实验表明,RAD显著降低了强力越狱攻击(如PAP和PAIR)的有效性,同时对良性查询的拒绝率保持在低水平。我们提出了一种新颖的评估方案,并展示RAD在一系列可控操作点上实现了稳健的安全性和效用的权衡。

更新时间: 2025-11-03 15:40:21

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2508.16406v2

Combinatorial Creativity: A New Frontier in Generalization Abilities

Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.

Updated: 2025-11-03 15:40:02

标题: 组合创造力:泛化能力的新前沿

摘要: 人工智能(AI)系统,尤其是大型语言模型(LLMs),越来越多地被用于科学创意生成等创意任务,构成了一种从训练数据中未解决的泛化形式,与现有概念框架类似。尽管它与组合泛化(CG)相似,组合创造力(CC)是一种开放式的能力。我们提出了一个理论框架和算法任务,通过其新颖性和实用性程度来评估输出,而不是针对固定目标的准确性或正确性,这将违反CC的开放性质。从这里开始,我们做出了几个重要的实证贡献:(1)我们首次了解了LLMs的创造力扩展行为。 (2)我们发现,在固定计算预算下,存在适用于创造能力的最佳模型深度和宽度。 (3)我们发现,虽然LLMs擅长生成新颖的科学思想,但难以确保其实际可行性,但这可能是由于创造性算法的更基本的新颖性-实用性权衡所解释的。重要的是,即使在规模上,这种权衡仍然持久存在,对LLMs在其当前形式下的长期创造潜力产生怀疑。总之,我们的概念框架和实证发现为理解和改进现代AI模型中的创造力提供了基础,弥合了人类和机器智能之间的鸿沟。

更新时间: 2025-11-03 15:40:02

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.21043v4

Memory-Enhanced Neural Solvers for Routing Problems

Routing Problems are central to many real-world applications, yet remain challenging due to their (NP-)hard nature. Amongst existing approaches, heuristics often offer the best trade-off between quality and scalability, making them suitable for industrial use. While Reinforcement Learning (RL) offers a flexible framework for designing heuristics, its adoption over handcrafted heuristics remains incomplete. Existing learned methods still lack the ability to adapt to specific instances and fully leverage the available computational budget. Current best methods either rely on a collection of pre-trained policies, or on RL fine-tuning; hence failing to fully utilize newly available information within the constraints of the budget. In response, we present MEMENTO, an approach that leverages memory to improve the search of neural solvers at inference. MEMENTO leverages online data collected across repeated attempts to dynamically adjust the action distribution based on the outcome of previous decisions. We validate its effectiveness on the Traveling Salesman and Capacitated Vehicle Routing problems, demonstrating its superiority over tree-search and policy-gradient fine-tuning; and showing that it can be zero-shot combined with diversity-based solvers. We successfully train all RL auto-regressive solvers on large instances, and verify MEMENTO's scalability and data-efficiency: pushing the state-of-the-art on 11 out of 12 evaluated tasks.

Updated: 2025-11-03 15:37:49

标题: 记忆增强型神经求解器用于路由问题

摘要: 路由问题是许多现实世界应用中的核心问题,但由于其(NP-)难度而仍然具有挑战性。在现有方法中,启发式方法通常在质量和可扩展性之间提供最佳平衡,使其适用于工业应用。虽然强化学习(RL)为设计启发式方法提供了灵活的框架,但其在手工启发式方法上的应用还不完整。现有的学习方法仍然缺乏适应特定实例并充分利用可用计算预算的能力。当前最佳方法要么依赖于一组预训练策略,要么依赖于RL微调;因此在预算约束条件下未能充分利用新获得的信息。作为回应,我们提出了MEMENTO,一种利用记忆来改进神经求解器搜索的方法。MEMENTO利用在线收集的数据跨多次尝试,根据先前决策的结果动态调整动作分布。我们验证了其在旅行推销员和容量车辆路径规划问题上的有效性,证明了其优于树搜索和策略梯度微调的方法,并表明它可以与基于多样性的求解器进行零射击组合。我们成功训练了所有RL自回归求解器处理大规模实例,并验证了MEMENTO的可扩展性和数据效率:在评估的12个任务中,推动了技术水平的进步。

更新时间: 2025-11-03 15:37:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.16424v3

Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations

In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Optimal Point for Boundary Approximation}$ (OPBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the OPBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/OPBA_For_Counterfactuals

Updated: 2025-11-03 15:36:39

标题: 走向个性化治疗方案:几何模型无关方法实现对事实解释

摘要: 在我们的文章中,我们描述了一种在高维空间中生成反事实解释的方法,该方法涉及四个步骤,包括将数据集拟合到模型中,找到决策边界,确定问题的约束条件,并计算最接近的点(反事实解释)从该边界。我们提出了一种离散化方法,通过在边界上找到许多离散点,然后确定最接近的可行反事实解释。这种方法后来被称为$\textit{边界近似的最优点}$(OPBA),它应用二分搜索来找到决策边界点,然后寻找最接近的边界点。在四个不同维度的数据集中,我们展示了我们的方法可以在$L_2$范数的距离上实现$5\%$到$50\%$的减少,从而胜过当前的反事实生成方法。我们的方法还可以通过限制对不可变和分类特征(如年龄、性别、身高等)的更改来处理现实世界的约束,例如健康类数据集的情况。在运行时间方面,与基于网格的方法相比,OPBA算法在相同的时间内生成决策边界点的数量级别更多。总的来说,我们的方法提供了一种简单有效的模型不可知方法,可以计算最接近的可行(即在约束条件下现实的)反事实解释。我们的所有结果和代码都可以在以下链接找到:https://github.com/dsin85691/OPBA_For_Counterfactuals。

更新时间: 2025-11-03 15:36:39

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.22911v3

Spin-Adapted Neural Network Wavefunctions in Real Space

Spin plays a fundamental role in understanding electronic structure, yet many real-space wavefunction methods fail to adequately consider it. We introduce the Spin-Adapted Antisymmetrization Method (SAAM), a general procedure that enforces exact total spin symmetry for antisymmetric many-electron wavefunctions in real space. In the context of neural network-based quantum Monte Carlo (NNQMC), SAAM leverages the expressiveness of deep neural networks to capture electron correlation while enforcing exact spin adaptation via group representation theory. This framework provides a principled route to embed physical priors into otherwise black-box neural network wavefunctions, yielding a compact representation of correlated system with neural network orbitals. Compared with existing treatments of spin in NNQMC, SAAM is more accurate and efficient, achieving exact spin purity without any additional tunable hyperparameters. To demonstrate its effectiveness, we apply SAAM to study the spin ladder of iron-sulfur clusters, a long-standing challenge for many-body methods due to their dense spectrum of nearly degenerate spin states. Our results reveal accurate resolution of low-lying spin states and spin gaps in [Fe$_2$S$_2$] and [Fe$_4$S$_4$] clusters, offering new insights into their electronic structures. In sum, these findings establish SAAM as a robust, hyperparameter-free standard for spin-adapted NNQMC, particularly for strongly correlated systems.

Updated: 2025-11-03 15:34:19

标题: 实空间中的自旋适应神经网络波函数

摘要: 自旋在理解电子结构中起着基础作用,然而许多实空间波函数方法未能充分考虑它。我们介绍了自旋适应反对称化方法(SAAM),这是一种通用程序,可以在实空间中对反对称多电子波函数强制执行确切的总自旋对称性。在基于神经网络的量子蒙特卡洛(NNQMC)的背景下,SAAM利用深度神经网络的表达能力来捕捉电子相关性,同时通过群表示理论强制执行确切的自旋适应。这一框架为将物理先验嵌入否则黑盒神经网络波函数提供了一种原则性途径,从而产生具有神经网络轨道的相关系统的简洁表示。与现有的NNQMC中关于自旋的处理方法相比,SAAM更精确和高效,实现了无需任何额外可调超参数的确切自旋纯度。为了证明其有效性,我们将SAAM应用于研究硫铁簇的自旋阶梯,这是许多体方法长期以来面临的挑战,因为它们具有密集的几乎简并自旋态谱。我们的结果展示了对[Fe$_2$S$_2$]和[Fe$_4$S$_4$]簇的低能自旋态和自旋间隙的准确分辨率,为了他们的电子结构提供了新的见解。总的来说,这些发现将SAAM确立为自旋适应NNQMC的稳健、无超参数的标准,特别适用于强相关系统。

更新时间: 2025-11-03 15:34:19

领域: physics.chem-ph,cs.AI

下载: http://arxiv.org/abs/2511.01671v1

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

Updated: 2025-11-03 15:32:58

标题: SeaLLMs-Audio: 东南亚大型音频语言模型

摘要: 我们介绍了SeaLLMs-Audio,这是第一个专为多种东南亚语言-印尼语(id)、泰语(th)和越南语(vi)-以及英语(en)和中文(zh)定制的大型音频语言模型(LALM)。SeaLLMs-Audio经过大规模音频语料库的训练,在各种以音频为中心的任务中表现出很强的性能,涵盖了细粒度的音频理解和基于语音的交互。其主要特点包括:1)多语言支持:该模型主要支持5种语言,即印尼语、泰语、越南语、英语和中文;2)多模态支持:该模型接受灵活的输入模式,包括仅音频、仅文本,以及音频与文本结合;3)多任务支持:该模型支持广泛的任务,包括音频分析任务,如音频字幕生成、自动语音识别、语音到文本翻译、语音情感识别、语音问答和语音摘要生成。它还支持基于语音的对话,包括回答事实性、数学性和一般知识性查询。作为推动东南亚音频LLM发展的重要一步,我们期望SeaLLMs-Audio能够使区域研究社区和产业受益。为了自动化东南亚LALM的评估,我们介绍了一项涵盖多个任务的基准测试SeaBench-Audio。实验证明,与其他SEA语言的LALM相比,SeaLLMs-Audio取得了竞争性的表现。

更新时间: 2025-11-03 15:32:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01670v1

Hybrid Retrieval-Augmented Generation Agent for Trustworthy Legal Question Answering in Judicial Forensics

As artificial intelligence permeates judicial forensics, ensuring the veracity and traceability of legal question answering (QA) has become critical. Conventional large language models (LLMs) are prone to hallucination, risking misleading guidance in legal consultation, while static knowledge bases struggle to keep pace with frequently updated statutes and case law. We present a hybrid legal QA agent tailored for judicial settings that integrates retrieval-augmented generation (RAG) with multi-model ensembling to deliver reliable, auditable, and continuously updatable counsel. The system prioritizes retrieval over generation: when a trusted legal repository yields relevant evidence, answers are produced via RAG; otherwise, multiple LLMs generate candidates that are scored by a specialized selector, with the top-ranked answer returned. High-quality outputs then undergo human review before being written back to the repository, enabling dynamic knowledge evolution and provenance tracking. Experiments on the Law\_QA dataset show that our hybrid approach significantly outperforms both a single-model baseline and a vanilla RAG pipeline on F1, ROUGE-L, and an LLM-as-a-Judge metric. Ablations confirm the complementary contributions of retrieval prioritization, model ensembling, and the human-in-the-loop update mechanism. The proposed system demonstrably reduces hallucination while improving answer quality and legal compliance, advancing the practical landing of media forensics technologies in judicial scenarios.

Updated: 2025-11-03 15:30:58

标题: 混合检索增强生成代理——司法取证中值得信赖的法律问题回答

摘要: 随着人工智能渗透到司法取证领域,确保法律问答(QA)的真实性和可追溯性变得至关重要。传统的大型语言模型(LLMs)容易产生幻觉,导致在法律咨询中提供误导性的指导,而静态知识库难以跟上频繁更新的法规和案例法。我们提出了一种专为司法环境量身定制的混合法律问答代理,该代理将检索增强生成(RAG)与多模型集成,以提供可靠、可审计和持续可更新的建议。该系统优先考虑检索而非生成:当信任的法律知识库产生相关证据时,答案通过RAG生成;否则,多个LLMs生成候选答案,由专门的选择器评分,返回排名最高的答案。高质量的输出经过人工审核后写回知识库,实现动态知识演化和溯源跟踪。在Law\_QA数据集上的实验表明,我们的混合方法在F1、ROUGE-L和LLM作为法官度量标准上明显优于单一模型基线和纯RAG流水线。消融实验证实了检索优先、模型集成和人为干预更新机制的互补贡献。所提出的系统明显减少了幻觉,同时提高了答案质量和法律合规性,推动了媒体取证技术在司法场景中的实际落地。

更新时间: 2025-11-03 15:30:58

领域: cs.AI

下载: http://arxiv.org/abs/2511.01668v1

The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity

While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.

Updated: 2025-11-03 15:26:01

标题: 钥匙中的幽灵:人工智能和人类音乐共创的Disklavier演示

摘要: 音乐创作的生成模型越来越强大,但音乐家使用这些模型受到文本提示的阻碍,这种异步工作流与乐器演奏的体验性、响应性质相脱离。为了解决这一问题,我们介绍了Aria-Duet,这是一个交互式系统,促进了人类钢琴家与Aria(一种最先进的生成模型)之间的实时音乐二重奏,使用Yamaha Disklavier作为共享的物理界面。该框架实现了轮流合作:用户演奏,发出交接信号,模型生成一个连贯的延续,通过钢琴进行声学演奏。除了描述实现这种低延迟交互的技术架构外,我们还从音乐学的角度分析了系统的输出,发现模型能够保持风格语义并发展连贯的乐句思想,表明这种体验性系统能够进行音乐上复杂的对话,并为人工智能与人类共同创作开辟了一条有前途的新路径。

更新时间: 2025-11-03 15:26:01

领域: cs.SD,cs.AI,cs.HC

下载: http://arxiv.org/abs/2511.01663v1

OrbitChain: Orchestrating In-orbit Real-time Analytics of Earth Observation Data

Earth observation analytics have the potential to serve many time-sensitive applications. However, due to limited bandwidth and duration of ground-satellite connections, it takes hours or even days to download and analyze data from existing Earth observation satellites, making real-time demands like timely disaster response impossible. Toward real-time analytics, we introduce OrbitChain, a collaborative analytics framework that orchestrates computational resources across multiple satellites in an Earth observation constellation. OrbitChain decomposes analytics applications into microservices and allocates computational resources for time-constrained analysis. A traffic routing algorithm is devised to minimize the inter-satellite communication overhead. OrbitChain adopts a pipeline workflow that completes Earth observation tasks in real-time, facilitates time-sensitive applications and inter-constellation collaborations such as tip-and-cue. To evaluate OrbitChain, we implement a hardware-in-the-loop orbital computing testbed. Experiments show that our system can complete up to 60% analytics workload than existing Earth observation analytics framework while reducing the communication overhead by up to 72%.

Updated: 2025-11-03 15:24:01

标题: OrbitChain:协调地球观测数据在轨实时分析

摘要: 地球观测分析具有为许多时间敏感应用提供服务的潜力。然而,由于地面卫星连接的带宽和持续时间有限,从现有的地球观测卫星下载和分析数据需要几个小时甚至几天的时间,这使得像及时灾害响应这样的实时需求变得不可能。为了实现实时分析,我们引入了OrbitChain,这是一个协作分析框架,协调多颗卫星在地球观测星座中的计算资源。OrbitChain将分析应用程序分解为微服务,并为时限分析分配计算资源。设计了一种交通路由算法,以最小化卫星间通信开销。OrbitChain采用了一个管道工作流程,可以实时完成地球观测任务,促进时间敏感应用和星座间协作,如指示和提示。为了评估OrbitChain,我们实施了一个硬件在环轨道计算测试平台。实验表明,我们的系统可以完成比现有地球观测分析框架高达60%的分析工作负载,同时将通信开销降低高达72%。

更新时间: 2025-11-03 15:24:01

领域: cs.DC,cs.ET,cs.LG,cs.NI

下载: http://arxiv.org/abs/2508.13374v2

A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design

Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.

Updated: 2025-11-03 15:21:13

标题: 一个受DbC启发的神经符号层,用于可信任的代理设计

摘要: 生成模型,特别是大型语言模型(LLMs),能够产生流畅的输出,但缺乏可验证的保证。我们采用设计契约(DbC)和类型论原则,引入一个契约层,调节每个LLM调用。契约规定输入和输出的语义和类型要求,结合概率修复,引导生成向符合要求的方向发展。这一层暴露了LLMs作为语义解析器和概率黑盒组件的双重视图。契约满足是概率性的,语义验证通过程序员指定的对良好类型数据结构的条件来操作定义。更广泛地,这项工作假设满足相同契约的任何两个代理在这些契约方面是\emph{功能上等价}的。

更新时间: 2025-11-03 15:21:13

领域: cs.LG,cs.AI,I.2.7; I.2.2; I.1.2; D.1.0

下载: http://arxiv.org/abs/2508.03665v4

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Open-weight bio-foundation models present a dual-use dilemma. While holding great promise for accelerating scientific research and drug development, they could also enable bad actors to develop more deadly bioweapons. To mitigate the risk posed by these models, current approaches focus on filtering biohazardous data during pre-training. However, the effectiveness of such an approach remains unclear, particularly against determined actors who might fine-tune these models for malicious use. To address this gap, we propose \eval, a framework to evaluate the robustness of procedures that are intended to reduce the dual-use capabilities of bio-foundation models. \eval assesses models' virus understanding through three lenses, including sequence modeling, mutational effects prediction, and virulence prediction. Our results show that current filtering practices may not be particularly effective: Excluded knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits broader generalizability in sequence modeling. Furthermore, dual-use signals may already reside in the pretrained representations, and can be elicited via simple linear probing. These findings highlight the challenges of data filtering as a standalone procedure, underscoring the need for further research into robust safety and security strategies for open-weight bio-foundation models.

Updated: 2025-11-03 15:19:02

标题: 开放式生物基金模型生物风险评估的最佳实践

摘要: 开放式权重生物基础模型存在双重使用困境。虽然它们有望加速科学研究和药物开发,但也可能使恶意行为者开发更致命的生物武器。为了减轻这些模型所带来的风险,目前的方法主要集中在在预训练期间过滤生物危害数据。然而,这种方法的有效性仍不清楚,特别是对那些可能对这些模型进行微调以进行恶意使用的决心行动者而言。为了解决这一问题,我们提出了\eval,一个旨在评估旨在降低生物基础模型双重使用能力的程序的鲁棒性的框架。 \eval 通过三个视角评估模型对病毒的理解,包括序列建模、突变效应预测和毒力预测。我们的结果显示,当前的过滤实践可能并不特别有效:在某些情况下,被排除的知识可以通过微调迅速恢复,并在序列建模中具有更广泛的泛化性。此外,双重使用信号可能已经存在于预训练表示中,并可以通过简单的线性探测引出。这些发现凸显了数据过滤作为独立程序的挑战,强调了对开放式权重生物基础模型的鲁棒安全和安全策略进行进一步研究的必要性。

更新时间: 2025-11-03 15:19:02

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.27629v2

Panther: A Cost-Effective Privacy-Preserving Framework for GNN Training and Inference Services in Cloud Environments

Graph Neural Networks (GNNs) have marked significant impact in traffic state prediction, social recommendation, knowledge-aware question answering and so on. As more and more users move towards cloud computing, it has become a critical issue to unleash the power of GNNs while protecting the privacy in cloud environments. Specifically, the training data and inference data for GNNs need to be protected from being stolen by external adversaries. Meanwhile, the financial cost of cloud computing is another primary concern for users. Therefore, although existing studies have proposed privacy-preserving techniques for GNNs in cloud environments, their additional computational and communication overhead remain relatively high, causing high financial costs that limit their widespread adoption among users. To protect GNN privacy while lowering the additional financial costs, we introduce Panther, a cost-effective privacy-preserving framework for GNN training and inference services in cloud environments. Technically, Panther leverages four-party computation to asynchronously executing the secure array access protocol, and randomly pads the neighbor information of GNN nodes. We prove that Panther can protect privacy for both training and inference of GNN models. Our evaluation shows that Panther reduces the training and inference time by an average of 75.28% and 82.80%, respectively, and communication overhead by an average of 52.61% and 50.26% compared with the state-of-the-art, which is estimated to save an average of 55.05% and 59.00% in financial costs (based on on-demand pricing model) for the GNN training and inference process on Google Cloud Platform.

Updated: 2025-11-03 15:15:40

标题: 猎豹:云环境中GNN训练和推断服务的成本效益的隐私保护框架

摘要: 图神经网络(GNNs)在交通状态预测、社交推荐、知识感知问答等领域产生了显著影响。随着越来越多的用户转向云计算,释放GNNs的能力同时保护云环境中的隐私已成为一个关键问题。具体来说,GNNs的训练数据和推断数据需要受到外部对手的窃取保护。同时,云计算的财务成本也是用户关注的主要问题。因此,尽管现有研究已经提出了在云环境中保护GNNs隐私的技术,但其额外的计算和通信开销仍然相对较高,导致高昂的财务成本限制了它们在用户中的广泛应用。 为了保护GNN隐私并降低额外的财务成本,我们引入了Panther,这是一个在云环境中进行GNN训练和推断服务的经济高效的隐私保护框架。从技术上讲,Panther利用四方计算异步执行安全数组访问协议,并随机填充GNN节点的邻居信息。我们证明Panther可以保护GNN模型的训练和推断的隐私。我们的评估表明,与最先进技术相比,Panther分别将训练和推断时间平均缩短了75.28%和82.80%,通信开销平均减少了52.61%和50.26%,据估计在Google Cloud Platform上的GNN训练和推断过程中,可节省平均55.05%和59.00%的财务成本(基于按需定价模型)。

更新时间: 2025-11-03 15:15:40

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2511.01654v1

Calibrating Bayesian Learning via Regularization, Confidence Minimization, and Selective Inference

The application of artificial intelligence (AI) models in fields such as engineering is limited by the known difficulty of quantifying the reliability of an AI's decision. A well-calibrated AI model must correctly report its accuracy on in-distribution (ID) inputs, while also enabling the detection of out-of-distribution (OOD) inputs. A conventional approach to improve calibration is the application of Bayesian ensembling. However, owing to computational limitations and model misspecification, practical ensembling strategies do not necessarily enhance calibration. This paper proposes an extension of variational inference (VI)-based Bayesian learning that integrates calibration regularization for improved ID performance, confidence minimization for OOD detection, and selective calibration to ensure a synergistic use of calibration regularization and confidence minimization. The scheme is constructed successively by first introducing calibration-regularized Bayesian learning (CBNN), then incorporating out-of-distribution confidence minimization (OCM) to yield CBNN-OCM, and finally integrating also selective calibration to produce selective CBNN-OCM (SCBNN-OCM). Selective calibration rejects inputs for which the calibration performance is expected to be insufficient. Numerical results illustrate the trade-offs between ID accuracy, ID calibration, and OOD calibration attained by both frequentist and Bayesian learning methods. Among the main conclusions, SCBNN-OCM is seen to achieve best ID and OOD performance as compared to existing state-of-the-art approaches at the cost of rejecting a sufficiently large number of inputs.

Updated: 2025-11-03 15:12:10

标题: 通过正则化、置信度最小化和选择性推断校准贝叶斯学习

摘要: 人工智能(AI)模型在工程等领域的应用受到了量化AI决策可靠性困难的限制。一个良好校准的AI模型必须正确报告其在分布内(ID)输入上的准确性,同时也能够检测到分布外(OOD)输入。改善校准的传统方法是应用贝叶斯集成。然而,由于计算限制和模型误差,实际集成策略不一定会提高校准性。本文提出了一种基于变分推断(VI)的贝叶斯学习的扩展,该方法整合了校准正则化以改善ID性能,置信度最小化以进行OOD检测,并选择性校准以确保校准正则化和置信度最小化的协同使用。该方案是通过首先引入校准正则化贝叶斯学习(CBNN),然后将分布外置信度最小化(OCM)纳入以产生CBNN-OCM,最后还整合选择性校准以生成选择性CBNN-OCM(SCBNN-OCM)。选择性校准拒绝预计校准性能不足的输入。数值结果说明了频率学习方法和贝叶斯学习方法在ID准确性、ID校准和OOD校准之间的权衡。在主要结论中,SCBNN-OCM被认为在拒绝足够多的输入的代价下实现了最佳的ID和OOD性能,相比现有最先进方法。

更新时间: 2025-11-03 15:12:10

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2404.11350v3

Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne, unmanned, and space laser scanning

According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elev\.minimum, Elev\.L3, lev\.MAD.mode, Elev\.mode, Elev\.MAD\.median, and Elev\.skewness, are important for AGB estimation using ALSD and ULSD, while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha. The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.

Updated: 2025-11-03 15:11:02

标题: 热带干旱森林地上生物量的估算:航空、无人机和太空激光扫描的相互比较

摘要: 根据《巴黎气候变化协定》,所有国家都要在2024年之前每两年提交关于其温室气体排放和吸收情况的报告。因此,森林在减少碳排放方面发挥着至关重要的作用,这对满足这些义务至关重要。《巴黎协定》第5条强调了对高质量森林数据的需求,认识到森林保护在全球应对气候变化的斗争中的重要性。本研究专注于提高热带干旱森林地上生物量测绘方法。热带干旱森林被认为是最不为人了解的热带森林环境之一,因此需要准确的方法来估算碳库。我们采用了对AGB估算的比较分析,利用不同的离散和全波形激光扫描数据集,结合普通最小二乘法和贝叶斯方法SVM。航空激光扫描、无人机激光扫描和空间激光扫描被用作提取森林指标的独立变量。通过变量选择、SVM回归调整和机器学习方法进行的交叉验证,以解决过拟合和欠拟合问题。结果表明,与树高主要相关的六个关键变量:Elev\.minimum、Elev\.L3、lev\.MAD.mode、Elev\.mode、Elev\.MAD\.median和Elev\.skewness,在使用ALSD和ULSD估算AGB时非常重要,而叶面积指数、冠层覆盖率和高度、地形海拔和全波形信号能量则被视为最关键的变量。在哥斯达黎加关卡斯特省的十个永久热带干旱森林样地中,估算的AGB值范围从26.02 Mg/ha到175.43 Mg/ha不等。SVM回归表明,在所有激光扫描系统中存在17.89的误差,其中SLSF W在估算每个样地总生物量方面表现出最低的误差17.07。

更新时间: 2025-11-03 15:11:02

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2510.27408v2

EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.

Updated: 2025-11-03 15:05:44

标题: EngChain:工程中可验证多步推理的符号基准

摘要: 大型语言模型(LLMs)越来越多地被应用于专业化、高风险领域,如工程领域,这要求对它们复杂的推理能力进行严格评估。目前的基准测试评估语言理解、事实回忆、数学或代码生成,但没有捕捉到工程中的综合推理,其中科学原理、定量建模和实际约束必须融合。为了填补这一空白,我们引入了EngChain,这是一个用于可验证的多步工程问题解决的基准测试。EngChain包含了90个问题,涵盖了三个工程分支,分为9个领域和20个不同领域。这些问题是从具有高度随机性的符号模板中生成的,以确保多样性并消除污染风险。通过这个基准测试,我们不仅仅在最终答案准确性上取得进展,还采用了两阶段评估:我们首先定量验证每个推理步骤的数字和语义有效性,然后引入了LLM-As-A-Judge,这是一个自动化系统,用于定性地对识别出的推理错误进行分类。

更新时间: 2025-11-03 15:05:44

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01650v1

TinyDef-DETR: A Transformer-Based Framework for Defect Detection in Transmission Lines from UAV Imagery

Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult objects. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous objects.

Updated: 2025-11-03 15:03:19

标题: TinyDef-DETR:一种基于Transformer的框架,用于从无人机图像中检测输电线路缺陷

摘要: 来自无人机图像的输电线缺陷自动检测是一项具有挑战性的任务,原因在于缺陷的小尺寸、模糊性和复杂的背景。本文提出了TinyDef-DETR,这是一个基于DETR的框架,旨在实现从无人机获取的图像中准确高效地检测输电线缺陷。该模型整合了四个主要组件:一个边缘增强的ResNet主干,用于加强边界敏感的表示,一个无步长的深度模块,用于实现保留细节的降采样,一个跨阶段双域多尺度注意力机制,用于联合建模全局上下文和局部线索,以及一个Focaler-Wise-SIoU回归损失,用于提高小型和困难对象的定位。这些设计共同有效地缓解了传统检测器的局限性。在公共和真实世界数据集上进行的大量实验表明,TinyDef-DETR实现了卓越的检测性能和强大的泛化能力,同时保持了适度的计算开销。TinyDef-DETR的准确性和效率使其成为一种适用于基于无人机的输电线缺陷检测的方法,特别是在涉及小型和模糊对象的情境中。

更新时间: 2025-11-03 15:03:19

领域: cs.CV,cs.AI,cs.CE

下载: http://arxiv.org/abs/2509.06035v8

A Graph-based RAG for Energy Efficiency Question Answering

In this work, we investigate the use of Large Language Models (LLMs) within a graph-based Retrieval Augmented Generation (RAG) architecture for Energy Efficiency (EE) Question Answering. First, the system automatically extracts a Knowledge Graph (KG) from guidance and regulatory documents in the energy field. Then, the generated graph is navigated and reasoned upon to provide users with accurate answers in multiple languages. We implement a human-based validation using the RAGAs framework properties, a validation dataset comprising 101 question-answer pairs, and domain experts. Results confirm the potential of this architecture and identify its strengths and weaknesses. Validation results show how the system correctly answers in about three out of four of the cases (75.2 +- 2.7%), with higher results on questions related to more general EE answers (up to 81.0 +- 4.1%), and featuring promising multilingual abilities (4.4% accuracy loss due to translation).

Updated: 2025-11-03 14:55:34

标题: 一个基于图的用于能源效率问答的RAG

摘要: 在这项工作中,我们研究了在基于图的检索增强生成(RAG)架构中使用大型语言模型(LLMs)来进行能源效率(EE)问答。首先,系统自动从能源领域的指导和监管文件中提取知识图(KG)。然后,生成的图被导航和推理,以便为用户提供准确的多语言答案。我们使用RAGAs框架属性进行基于人的验证,验证数据集包括101个问题-答案对和领域专家。结果确认了这种架构的潜力,并识别了其优势和劣势。验证结果显示系统在约四分之三的情况下正确回答(75.2 +- 2.7%),对于与更一般的EE答案相关的问题结果更好(高达81.0 +- 4.1%),并具有有希望的多语言能力(由于翻译造成的4.4%的准确性损失)。

更新时间: 2025-11-03 14:55:34

领域: cs.CL,cs.AI,cs.IR,I.2.7; I.2.4; I.2.1; I.2.6

下载: http://arxiv.org/abs/2511.01643v1

IVGAE-TAMA-BO: A novel temporal dynamic variational graph model for link prediction in global food trade networks with momentum structural memory and Bayesian optimization

Global food trade plays a crucial role in ensuring food security and maintaining supply chain stability. However, its network structure evolves dynamically under the influence of geopolitical, economic, and environmental factors, making it challenging to model and predict future trade links. Effectively capturing temporal patterns in food trade networks is therefore essential for improving the accuracy and robustness of link prediction. This study introduces IVGAE-TAMA-BO, a novel dynamic graph neural network designed to model evolving trade structures and predict future links in global food trade networks. To the best of our knowledge, this is the first work to apply dynamic graph neural networks to this domain, significantly enhancing predictive performance. Building upon the original IVGAE framework, the proposed model incorporates a Trade-Aware Momentum Aggregator (TAMA) to capture the temporal evolution of trade networks, jointly modeling short-term fluctuations and long-term structural dependencies. A momentum-based structural memory mechanism further improves predictive stability and performance. In addition, Bayesian optimization is used to automatically tune key hyperparameters, enhancing generalization across diverse trade scenarios. Extensive experiments on five crop-specific datasets demonstrate that IVGAE-TAMA substantially outperforms the static IVGAE and other dynamic baselines by effectively modeling temporal dependencies, while Bayesian optimization further boosts performance in IVGAE-TAMA-BO. These results highlight the proposed framework as a robust and scalable solution for structural prediction in global trade networks, with strong potential for applications in food security monitoring and policy decision support.

Updated: 2025-11-03 14:48:32

标题: IVGAE-TAMA-BO:一种新型的时变动态变分图模型,用于全球食品贸易网络中的链接预测,具有动量结构记忆和贝叶斯优化

摘要: 全球食品贸易在确保食品安全和维持供应链稳定方面起着至关重要的作用。然而,在地缘政治、经济和环境因素的影响下,其网络结构动态演变,使得对未来贸易联系进行建模和预测具有挑战性。有效捕捉食品贸易网络中的时间模式对于提高链接预测的准确性和稳健性至关重要。本研究介绍了IVGAE-TAMA-BO,这是一种新颖的动态图神经网络,旨在建模不断演变的贸易结构并预测全球食品贸易网络中的未来联系。据我们所知,这是首个将动态图神经网络应用于该领域的工作,极大地提升了预测性能。在原始IVGAE框架的基础上,所提出的模型融合了Trade-Aware Momentum Aggregator(TAMA)来捕捉贸易网络的时间演变,共同建模短期波动和长期结构依赖关系。一种基于动量的结构性记忆机制进一步提高了预测稳定性和性能。此外,贝叶斯优化用于自动调节关键超参数,增强在多样化贸易场景中的泛化能力。在五个特定作物数据集上进行的大量实验表明,IVGAE-TAMA在有效建模时间依赖性方面明显优于静态IVGAE和其他动态基准线,而贝叶斯优化进一步提升了IVGAE-TAMA-BO的性能。这些结果突显了所提出的框架作为全球贸易网络结构预测的稳健且可扩展的解决方案,具有在食品安全监测和政策决策支持领域应用的潜力。

更新时间: 2025-11-03 14:48:32

领域: cs.AI

下载: http://arxiv.org/abs/2511.01639v1

Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models

Large Language Models (LLMs) are increasingly used in intelligent systems that perform reasoning, summarization, and code generation. Their ability to follow natural-language instructions, while powerful, also makes them vulnerable to a new class of attacks known as prompt injection. In these attacks, hidden or malicious instructions are inserted into user inputs or external content, causing the model to ignore its intended task or produce unsafe responses. This study proposes a unified framework for evaluating how resistant Large Language Models (LLMs) are to prompt injection attacks. The framework defines three complementary metrics such as the Resilience Degradation Index (RDI), Safety Compliance Coefficient (SCC), and Instructional Integrity Metric (IIM) to jointly measure robustness, safety, and semantic stability. We evaluated four instruction-tuned models (GPT-4, GPT-4o, LLaMA-3 8B Instruct, and Flan-T5-Large) on five common language tasks: question answering, summarization, translation, reasoning, and code generation. Results show that GPT-4 performs best overall, while open-weight models remain more vulnerable. The findings highlight that strong alignment and safety tuning are more important for resilience than model size alone. Results show that all models remain partially vulnerable, especially to indirect and direct-override attacks. GPT-4 achieved the best overall resilience (RDR = 9.8 %, SCR = 96.4 %), while open-source models exhibited higher performance degradation and lower safety scores. The findings demonstrate that alignment strength and safety tuning play a greater role in resilience than model size alone. The proposed framework offers a structured, reproducible approach for assessing model robustness and provides practical insights for improving LLM safety and reliability.

Updated: 2025-11-03 14:43:56

标题: 即时注入作为一种新兴威胁:评估大型语言模型的弹性

摘要: 大型语言模型(LLMs)越来越多地用于执行推理、摘要和代码生成等智能系统。它们能够遵循自然语言指令,虽然功能强大,但也使它们容易受到一种称为提示注入的新型攻击的影响。在这些攻击中,隐藏或恶意指令被插入用户输入或外部内容,导致模型忽视其预期任务或产生不安全的响应。本研究提出了一个统一框架,用于评估大型语言模型(LLMs)对提示注入攻击的抵抗力。该框架定义了三个互补指标,如弹性降级指数(RDI)、安全遵从系数(SCC)和指令完整性度量(IIM),共同衡量健壮性、安全性和语义稳定性。我们在五个常见的语言任务上评估了四个经过指令调整的模型(GPT-4、GPT-4o、LLaMA-38B Instruct和Flan-T5-Large):问答、摘要、翻译、推理和代码生成。结果显示,整体而言,GPT-4表现最佳,而开放权重模型仍然更容易受到攻击。研究结果强调,对齐度和安全调整比模型大小本身更重要。结果显示,所有模型仍然部分易受攻击,尤其是对于间接和直接覆盖攻击。GPT-4实现了最佳的整体弹性(RDR = 9.8%,SCR = 96.4%),而开源模型表现出更高的性能降级和更低的安全分数。结果表明,对齐强度和安全调整在弹性方面的作用比模型大小本身更重要。提出的框架为评估模型的稳健性提供了结构化、可再现的方法,并为改善LLM的安全性和可靠性提供了实用见解。

更新时间: 2025-11-03 14:43:56

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.01634v1

Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

Graph Chain-of-Thought (Graph-CoT) enables large language models (LLMs) to perform step-by-step reasoning over graph-structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution. We present GLM, the first multi-agent Graph-CoT system co-designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph-CoT-aware LLM inference mechanism with graph-specific KV-cache management, priority-based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines, enabling efficient adoption for complex real-world reasoning at scale.

Updated: 2025-11-03 14:42:53

标题: 扩展图链式推理:一个带有高效LLM服务的多智能体框架

摘要: 图链思维(Graph-CoT)使大型语言模型(LLMs)能够在图结构化知识上逐步推理,但现有的管道存在精度低、令牌使用过多、延迟高和吞吐量低的问题,这是由于单一代理人的单块提示、重复上下文重新编码和低效服务执行。我们提出GLM,这是第一个与优化的LLM服务架构共同设计的多代理图-CoT系统。GLM将推理分解为专门的代理,用于分类、推理、动作生成和图检索,从而实现分支和选择性上下文共享,以减少提示长度和推理迭代,同时保持推理质量,从而提高准确性并降低总体令牌消耗。为了扩展推理,我们引入了一种具有图特定KV缓存管理、基于优先级的驱逐和管道执行的Graph-CoT感知LLM推理机制,以提高服务效率。实验证明,GLM将答案准确性提高了高达38%,将令牌成本降低了高达95.7%,将推理延迟降低了90.3%,与最先进的Graph-CoT基线相比,吞吐量提高了高达15.1倍,从而实现了在规模上进行复杂现实世界推理的有效采用。

更新时间: 2025-11-03 14:42:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.01633v1

Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases

Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.

Updated: 2025-11-03 14:30:33

标题: Khiops:面向大型、多表数据库的端到端、节俭的AutoML和XAI机器学习解决方案

摘要: Khiops是一个开源的机器学习工具,旨在挖掘大型多表数据库。Khiops基于独特的贝叶斯方法,吸引了学术界的兴趣,有超过20篇关于变量选择、分类、决策树和共聚类等主题的发表文章。它通过离散化模型对数值数据和值聚类对分类数据提供变量重要性的预测度量。所提出的分类/回归模型是一个朴素贝叶斯分类器,包括变量选择和权重学习。在多表数据库的情况下,它通过自动构建聚合提供命题化。Khiops适用于分析拥有数百万个个体、数万个变量和数亿条次要表记录的大型数据库。它在许多环境中可用,既可以作为Python库,也可以通过用户界面访问。

更新时间: 2025-11-03 14:30:33

领域: cs.LG

下载: http://arxiv.org/abs/2508.20519v3

Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers

Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.

Updated: 2025-11-03 14:22:43

标题: 不完美的语言、人工智能和人类思维:跨学科研究关于母语西班牙语者语言错误的方法

摘要: 语言错误不仅仅是与规范语法的偏差;它们提供了一个独特的视角,揭示了语言认知结构的特点,并暴露了试图复制它们的人工系统的当前局限性。本项目提出了对西班牙母语者产生的语言错误进行跨学科研究,旨在分析当前大型语言模型(LLM)如何解释、复制或纠正这些错误。该研究整合了三个核心视角:理论语言学,用于对错误性质进行分类和理解;神经语言学,用于在大脑中实时语言处理的背景下对其进行定位;自然语言处理(NLP),用于评估其对语言错误的解释。一个自行构建的西班牙母语者真实错误语料库(500+)将作为实证分析的基础。这些错误将被测试与AI模型(如GPT或Gemini)进行比较,以评估它们的解释准确性以及它们概括人类语言行为模式的能力。该项目不仅有助于理解西班牙语作为母语的特点,还有助于开发更具认知启发和能够处理真实人类语言不完美、多变和常常模糊性质的自然语言处理系统。

更新时间: 2025-11-03 14:22:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01615v1

Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

Updated: 2025-11-03 14:21:27

标题: 朝着大规模背景强化学习:在随机世界中通过元训练实现

摘要: In-Context Reinforcement Learning(ICRL)使代理能够自动地从互动经验中学习。然而,ICRL规模化的一个主要挑战是缺乏可扩展的任务集合。为了解决这个问题,我们提出了一种程序生成的表格马尔可夫决策过程,称为AnyMDP。通过精心设计的随机化过程,AnyMDP能够在大规模上生成高质量的任务,同时保持相对低的结构偏差。为了促进大规模的元训练,我们进一步引入了分离的策略精炼,并在ICRL框架中引入先验信息。我们的结果表明,通过足够大规模的AnyMDP任务,所提出的模型能够通过多功能的上下文学习范式泛化到训练集中未考虑的任务。AnyMDP提供的可扩展任务集合还能更全面地探讨数据分布与ICRL性能之间的关系。我们进一步表明,ICRL的泛化可能会以增加任务多样性和更长的适应期为代价。这一发现对于扩展稳健的ICRL能力具有关键意义,强调了多样化和广泛的任务设计的必要性,并优先考虑渐进性能而非少样本适应。

更新时间: 2025-11-03 14:21:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.02869v4

DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning

Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.

Updated: 2025-11-03 14:10:43

标题: DINO-MX:一个用于自监督学习的模块化和灵活框架

摘要: Vision Foundation Models(VFMs)通过自监督方法推动了表示学习的发展。然而,现有的训练流程通常缺乏灵活性,领域特定,或者计算成本高昂,这限制了它们在不同领域和资源设置中的可用性。DINO-MX是一个模块化和可扩展的训练框架,它在一个统一的配置驱动系统中结合了DINO、DINOv2和DINOv3的核心原则。它支持各种基于transformer的架构,并且完全兼容Hugging Face生态系统。该框架包括多种训练策略,如低秩适应(LoRA)、层冻结和知识蒸馏,同时支持通过分布式数据并行(DDP)和完全分片数据并行(FSDP)进行分布式训练。DINO-MX旨在处理自然和专门的数据类型,包括单通道和多通道图像。对各种数据集的实验结果表明,DINO-MX在显著降低计算成本的同时实现了竞争性的性能。此外,它提供了可解释性工具和一个标签引导的数据增强方法,可以改善基于注意力的定位,而无需额外的检测或分割头。DINO-MX为在各种研究和实际应用中开发、适应和基准测试自监督视觉模型提供了可复制和可扩展的基础。

更新时间: 2025-11-03 14:10:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01610v1

Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies

Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl.

Updated: 2025-11-03 14:09:54

标题: 突破强化学习中的性能上限需要推理策略

摘要: 强化学习(RL)系统有无数的应用,从能源网络管理到蛋白质设计。然而,这样的真实场景通常非常困难,具有组合特性,并且需要多个代理之间复杂的协调。这种复杂性甚至会导致最先进的RL系统在训练到收敛时达到性能上限,无法通过零-shot推断突破。与此同时,许多基于数字或模拟的应用允许在推断阶段利用特定的时间和计算预算,在输出最终解决方案之前探索多次尝试。在这项工作中,我们展示了在执行时使用这种推断阶段以及选择相应的推断策略对于突破观察到的复杂多代理RL问题的性能上限至关重要。我们的主要结果令人瞩目:在17个任务中,我们可以获得高达126%的改进,并且平均改进了45%,仅使用额外几秒钟的墙钟时间。我们还展示了有希望的计算扩展性属性,支持超过60k次实验,这使其成为迄今为止关于复杂RL推断策略的最大研究。我们的实验数据和代码可在https://sites.google.com/view/inference-strategies-rl 上获得。

更新时间: 2025-11-03 14:09:54

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2505.21236v2

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained by supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.

Updated: 2025-11-03 14:09:53

标题: RL-100:使用真实世界强化学习实现高效的机器人操作

摘要: 在家庭和工厂中的现实世界机器人操作需要接近或超过熟练人操作员的可靠性、效率和稳健性。我们提出了RL-100,这是一个基于扩散视觉运动策略的真实世界强化学习训练框架,该框架通过监督学习进行训练。RL-100引入了一个三阶段流程。首先,模仿学习利用人类先验知识。其次,迭代离线强化学习使用离线策略评估程序(简称OPE)来门控应用于去噪过程中的PPO风格更新,以实现保守和可靠的改进。第三,在线强化学习消除残余故障模式。另外,一个轻量级的一致性提取头将扩散中的多步抽样过程压缩为单步策略,实现高频控制,延迟减少一个数量级,同时保持任务性能。该框架不受任务、实体和表示方式的限制,支持3D点云和2D RGB输入、各种机器人平台以及单步和动作块策略。我们在七个真实机器人任务上评估了RL-100,涵盖了动态刚体控制、如推-T和敏捷保龄球、流体和颗粒倾倒、可变形布料折叠、精确灵巧地拧开、多阶段榨橙汁等任务。RL-100在评估试验中取得了100%的成功率,总共达到了900个场次,其中有250个连续试验中的250个。该方法实现了接近人类远程操作或更高的时间效率,并展示了长达两小时的连续运行多小时的稳健性。

更新时间: 2025-11-03 14:09:53

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.14830v2

Evaluation of compliance with democratic and technical standards of i-voting in elections to academic senates in Czech higher education

The shift towards increased remote work and digital communication, driven by recent global developments, has led to the widespread adoption of i-voting systems, including in academic institutions. This paper critically evaluates the use of i-voting platforms for elections to academic senates at Czech public universities, focusing on the democratic and technical challenges they present. A total of 18 out of 26 Czech public universities have implemented remote electronic voting for these elections. Yet, the systems often lack the necessary transparency, raising significant concerns regarding their adherence to democratic norms, such as election security, voter privacy, and the integrity of the process. Through interviews with system developers and administrators, along with a survey of potential voters, the study underscores the critical need for transparency. Without it, a comprehensive assessment of the technical standards and the overall legitimacy of the i-voting systems remains unattainable, potentially undermining the credibility of the electoral outcomes.

Updated: 2025-11-03 14:00:47

标题: 评估捷克高等教育学术参议会选举中i-投票符合民主和技术标准

摘要: 随着近期全球发展推动的远程工作和数字沟通增加,i-投票系统得到了广泛应用,包括在学术机构中。本文对捷克公立大学学术参议会选举使用i-投票平台进行的情况进行了批判性评估,重点关注其所面临的民主和技术挑战。在26所捷克公立大学中,共有18所实施了这些选举的远程电子投票。然而,这些系统往往缺乏必要的透明度,引发了对其遵守民主规范的严重担忧,如选举安全、选民隐私和过程的完整性。通过对系统开发者和管理员的访谈,以及对潜在选民的调查,研究强调了透明度的关键性需求。缺乏透明度,将无法全面评估技术标准和i-投票系统的整体合法性,可能削弱选举结果的可信度。

更新时间: 2025-11-03 14:00:47

领域: cs.CY,cs.CR,physics.soc-ph

下载: http://arxiv.org/abs/2511.01598v1

Federated Cyber Defense: Privacy-Preserving Ransomware Detection Across Distributed Systems

Detecting malware, especially ransomware, is essential to securing today's interconnected ecosystems, including cloud storage, enterprise file-sharing, and database services. Training high-performing artificial intelligence (AI) detectors requires diverse datasets, which are often distributed across multiple organizations, making centralization necessary. However, centralized learning is often impractical due to security, privacy regulations, data ownership issues, and legal barriers to cross-organizational sharing. Compounding this challenge, ransomware evolves rapidly, demanding models that are both robust and adaptable. In this paper, we evaluate Federated Learning (FL) using the Sherpa.ai FL platform, which enables multiple organizations to collaboratively train a ransomware detection model while keeping raw data local and secure. This paradigm is particularly relevant for cybersecurity companies (including both software and hardware vendors) that deploy ransomware detection or firewall systems across millions of endpoints. In such environments, data cannot be transferred outside the customer's device due to strict security, privacy, or regulatory constraints. Although FL applies broadly to malware threats, we validate the approach using the Ransomware Storage Access Patterns (RanSAP) dataset. Our experiments demonstrate that FL improves ransomware detection accuracy by a relative 9% over server-local models and achieves performance comparable to centralized training. These results indicate that FL offers a scalable, high-performing, and privacy-preserving framework for proactive ransomware detection across organizational and regulatory boundaries.

Updated: 2025-11-03 13:54:13

标题: 联邦式网络防御:跨分布式系统的隐私保护勒索软件检测

摘要: 检测恶意软件,特别是勒索软件,对于保护当今互联生态系统至关重要,包括云存储、企业文件共享和数据库服务。训练高性能人工智能(AI)检测器需要多样化的数据集,这些数据集通常分布在多个组织之间,因此需要集中化。然而,由于安全性、隐私法规、数据所有权问题和跨组织共享的法律障碍,集中式学习通常是不可行的。此外,勒索软件快速演进,要求模型既强大又适应性强。 在本文中,我们评估了使用Sherpa.ai FL平台的联邦学习(FL),该平台使多个组织能够合作训练一个勒索软件检测模型,同时保持原始数据本地和安全。这种范式特别适用于部署勒索软件检测或防火墙系统的网络安全公司(包括软件和硬件供应商),跨百万终端。在这种环境中,由于严格的安全性、隐私性或监管约束,数据无法传输到客户设备之外。尽管FL广泛适用于恶意软件威胁,但我们使用Ransomware Storage Access Patterns(RanSAP)数据集验证了这种方法。 我们的实验表明,FL相对于服务器本地模型提高了9%的勒索软件检测准确性,并且实现了与集中式训练相媲美的性能。这些结果表明,FL为跨组织和监管边界间的积极勒索软件检测提供了可扩展、高性能且保护隐私的框架。

更新时间: 2025-11-03 13:54:13

领域: cs.CR,cs.AI,cs.DC,cs.LG

下载: http://arxiv.org/abs/2511.01583v1

Augmenting learning in neuro-embodied systems through neurobiological first principles

Recent progress in artificial intelligence (AI) has been driven by insights from physics and neuroscience, particularly through the development of artificial neural networks (ANNs) capable of complex cognitive tasks such as vision and language processing. Despite these advances, they struggle with continual learning, adaptable knowledge transfer, robustness, and resource efficiency -- capabilities that biological systems handle seamlessly. Specifically, neuromorphic systems and artificial neural networks often overlook two key biophysical properties of neural circuits: neuronal diversity and cell-specific neuromodulation. These mechanisms, essential for regulating dynamic learning across brain scales, allow neuromodulators to introduce degeneracy in biological neural networks, ensuring stability and adaptability under changing conditions. In this article, we summarize recent bioinspired models, learning rules, and architectures, and propose a framework for augmenting ANNs, which has the potential to bridge the gap between neuroscience and AI through neurobiological first principles. Our proposed dual-framework approach leverages spiking neural networks to emulate diverse spiking behaviors and dendritic compartmental dynamics, thereby simulating the morphological and functional diversity of neuronal computations. Finally, we outline how integrating these biophysical principles into task-driven spiking neural networks and neuromorphic systems provides scalable solutions for continual learning, adaptability, robustness, and resource-efficiency. Additionally, this approach will not only provide insights into how emergent behaviors arise in neural networks but also catalyze the development of more efficient, reliable, and intelligent neuromorphic systems and robotic agents.

Updated: 2025-11-03 13:54:00

标题: 通过神经生物学第一原则增强神经体系中的学习

摘要: 人工智能(AI)近年来取得的进展主要得益于物理学和神经科学的见解,特别是通过发展能够完成复杂认知任务如视觉和语言处理的人工神经网络(ANNs)。尽管取得了这些进展,但它们在持续学习、适应性知识传递、稳健性和资源效率方面仍存在困难,这些是生物系统轻松处理的能力。具体来说,神经形态系统和人工神经网络经常忽视神经回路的两个关键生物物理特性:神经元多样性和细胞特异性神经调节。这些机制对调节跨脑区尺度的动态学习至关重要,允许神经调质物在生物神经网络中引入退化,确保在变化条件下的稳定性和适应性。在本文中,我们总结了最近的生物启发模型、学习规则和架构,并提出了一个增强ANNs的框架,有潜力通过神经生物学第一原则来弥合神经科学和人工智能之间的鸿沟。我们提出的双框架方法利用尖峰神经网络来模拟多样化的尖峰行为和树突分室动力学,从而模拟神经元计算的形态和功能多样性。最后,我们概述了如何将这些生物物理原则整合到以任务驱动的尖峰神经网络和神经形态系统中,为持续学习、适应性、稳健性和资源效率提供可伸缩的解决方案。此外,这种方法不仅将为我们提供有关神经网络中新兴行为是如何产生的见解,还将催生更高效、可靠和智能的神经形态系统和机器人代理的发展。

更新时间: 2025-11-03 13:54:00

领域: q-bio.NC,cs.AI,cs.LG,92B20

下载: http://arxiv.org/abs/2407.04525v5

ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks

Large language models suffer from knowledge staleness and lack of interpretability due to implicit knowledge storage across entangled network parameters, preventing targeted updates and reasoning transparency. We propose ExplicitLM, a novel architecture featuring a million-scale external memory bank storing human-readable knowledge as token sequences, enabling direct inspection and modification. We design a differentiable two-stage retrieval mechanism with efficient coarse-grained filtering via product key decomposition (reducing complexity from $\mathcal{O}(N \cdot |I|)$ to $\mathcal{O}(\sqrt{N} \cdot |I|)$) and fine-grained Gumbel-Softmax matching for end-to-end training. Inspired by dual-system cognitive theory, we partition knowledge into frozen explicit facts (20%) and learnable implicit patterns (80%), maintained through Exponential Moving Average updates for stability. ExplicitLM achieves up to 43.67% improvement on knowledge-intensive tasks versus standard Transformers, with 3.62$\times$ gains in low-data regimes (10k samples). Analysis shows strong correlations between memory retrieval and performance, with correct predictions achieving 49% higher hit rates. Unlike RAG systems with frozen retrieval, our jointly optimized architecture demonstrates that interpretable, updatable models can maintain competitive performance while providing unprecedented knowledge transparency.

Updated: 2025-11-03 13:53:19

标题: ExplicitLM:通过显式内存库将知识与参数分离

摘要: 大型语言模型由于纠缠在网络参数中的隐式知识存储而遭受知识陈旧和缺乏可解释性的困扰,这阻碍了有针对性的更新和推理透明度。我们提出了ExplicitLM,这是一种新颖的架构,具有一个百万规模的外部存储器,将人类可读的知识存储为标记序列,从而实现直接检查和修改。我们设计了一个可微分的两阶段检索机制,通过产品键分解实现了高效的粗粒度过滤(将复杂度从$\mathcal{O}(N \cdot |I|)$降低到$\mathcal{O}(\sqrt{N} \cdot |I|)$),并采用了细粒度的Gumbel-Softmax匹配进行端到端训练。受到双系统认知理论的启发,我们将知识分为冻结的显式事实(20%)和可学习的隐式模式(80%),通过指数移动平均更新以保持稳定性。ExplicitLM在知识密集型任务上实现了高达43.67%的改进,相较于标准的Transformer,在低数据情况下(10k个样本)获得了3.62倍的增益。分析显示,记忆检索和性能之间存在强烈的相关性,正确预测的命中率提高了49%。与具有冻结检索的RAG系统不同,我们联合优化的架构证明了,可解释的、可更新的模型可以在提供前所未有的知识透明度的同时保持竞争性能。

更新时间: 2025-11-03 13:53:19

领域: cs.AI

下载: http://arxiv.org/abs/2511.01581v1

Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning

Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.

Updated: 2025-11-03 13:50:12

标题: 神经符号模仿学习:发现技能学习的符号抽象

摘要: 模仿学习是一种教导机器人新行为的流行方法。然而,大多数现有方法侧重于教授短暂、孤立的技能,而不是长期、多步骤的任务。为了弥合这一差距,模仿学习算法不仅必须学习个别技能,还必须学习如何有效地将这些技能序列化以执行扩展任务的抽象理解。本文通过提出一个神经符号模仿学习框架来应对这一挑战。利用任务演示,系统首先学习一个符号表示,抽象了低级状态-动作空间。学习到的表示将任务分解为更容易的子任务,并允许系统利用符号规划生成抽象计划。随后,系统利用任务分解来学习一组能够将抽象计划细化为可执行机器人命令的神经技能。在三个模拟机器人环境中的实验结果表明,与基线相比,我们的神经符号方法提高了数据效率,改善了泛化能力,并促进了可解释性。

更新时间: 2025-11-03 13:50:12

领域: cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.21406v2

What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)

While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

Updated: 2025-11-03 13:48:58

标题: 循环变压器比非递归变压器表现更好的原因(可证明的)

摘要: 循环变压器(称为Looped-Attn)通常在复杂推理任务上优于标准变压器(称为Single-Attn),但这种优势的理论基础尚未得到充分探讨。本文通过损失景观几何的视角来解释这一现象,灵感来自于它们在样本和Hessian水平上的不同动态的实证观察。为了形式化这一点,我们通过区分U形谷(平坦)和V形谷(陡峭)来扩展River-Valley景观模型。基于实证观察,我们推测Looped-Attn的递归架构引起了对River-V-Valley的景观级归纳偏见。基于这种归纳偏见的理论推导保证了由于谷跃而沿着河流更好地收敛的损失,并进一步鼓励学习复杂模式,与Single-Attn引起的River-U-Valley相比。基于这一观点,我们提出SHIFT(逐步分层框架用于渐进性训练),这是一个分阶段的训练框架,可以加速Looped-Attn的训练过程,同时实现可比较的性能。

更新时间: 2025-11-03 13:48:58

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2510.10089v2

DuSEGO: Dual Second-order Equivariant Graph Ordinary Differential Equation

Graph Neural Networks (GNNs) with equivariant properties have achieved significant success in modeling complex dynamic systems and molecular properties. However, their expressiveness ability is limited by: (1) Existing methods often overlook the over-smoothing issue caused by traditional GNN models, as well as the gradient explosion or vanishing problems in deep GNNs. (2) Most models operate on first-order information, neglecting that the real world often consists of second-order systems, which further limits the model's representation capabilities. To address these issues, we propose the \textbf{Du}al \textbf{S}econd-order \textbf{E}quivariant \textbf{G}raph \textbf{O}rdinary Differential Equation (\method{}) for equivariant representation. Specifically, \method{} apply the dual second-order equivariant graph ordinary differential equations (Graph ODEs) on graph embeddings and node coordinates, simultaneously. Theoretically, we first prove that \method{} maintains the equivariant property. Furthermore, we provide theoretical insights showing that \method{} effectively alleviates the over-smoothing problem in both feature representation and coordinate update. Additionally, we demonstrate that the proposed \method{} mitigates the exploding and vanishing gradients problem, facilitating the training of deep multi-layer GNNs. Extensive experiments on benchmark datasets validate the superiority of the proposed \method{} compared to baselines.

Updated: 2025-11-03 13:39:46

标题: DuSEGO:双二阶等变图常微分方程

摘要: 具有等变性质的图神经网络(GNNs)在建模复杂动态系统和分子性质方面取得了显著成功。然而,它们的表达能力受到限制:(1)现有方法通常忽视传统GNN模型引起的过度平滑问题,以及深层GNN中的梯度爆炸或消失问题。(2)大多数模型操作在一阶信息上,忽视了现实世界通常由二阶系统构成,从而进一步限制了模型的表征能力。为了解决这些问题,我们提出了\textbf{Du}al \textbf{S}econd-order \textbf{E}quivariant \textbf{G}raph \textbf{O}rdinary Differential Equation(\method{})来实现等变性表征。具体而言,\method{}在图嵌入和节点坐标上同时应用了双二阶等变图常微分方程(Graph ODEs)。理论上,我们首先证明\method{}保持等变性质。此外,我们提供理论见解表明\method{}有效缓解了特征表征和坐标更新中的过度平滑问题。此外,我们演示了所提出的\method{}缓解了梯度爆炸和消失问题,有助于训练深层多层GNNs。在基准数据集上进行的大量实验验证了所提出的\method{}相对于基线的优越性。

更新时间: 2025-11-03 13:39:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.10000v3

HIT-ROCKET: Hadamard-vector Inner-product Transformer for ROCKET

Time series classification holds broad application value in communications, information countermeasures, finance, and medicine. However, state-of-the-art (SOTA) methods-including HIVE-COTE, Proximity Forest, and TS-CHIEF-exhibit high computational complexity, coupled with lengthy parameter tuning and training cycles. In contrast, lightweight solutions like ROCKET (Random Convolutional Kernel Transform) offer greater efficiency but leave substantial room for improvement in kernel selection and computational overhead. To address these challenges, we propose a feature extraction approach based on Hadamard convolutional transform, utilizing column or row vectors of Hadamard matrices as convolution kernels with extended lengths of varying sizes. This enhancement maintains full compatibility with existing methods (e.g., ROCKET) while leveraging kernel orthogonality to boost computational efficiency, robustness, and adaptability. Comprehensive experiments on multi-domain datasets-focusing on the UCR time series dataset-demonstrate SOTA performance: F1-score improved by at least 5% vs. ROCKET, with 50% shorter training time than miniROCKET (fastest ROCKET variant) under identical hyperparameters, enabling deployment on ultra-low-power embedded devices. All code is available on GitHub.

Updated: 2025-11-03 13:39:40

标题: HIT-ROCKET:用于ROCKET的Hadamard向量内积变换器

摘要: 时间序列分类在通信、信息对抗、金融和医学领域具有广泛的应用价值。然而,包括HIVE-COTE、Proximity Forest和TS-CHIEF在内的最新方法表现出高计算复杂性,加上漫长的参数调整和训练周期。相比之下,像ROCKET(随机卷积核变换)这样的轻量级解决方案提供了更高的效率,但在核选择和计算开销方面还有很大的改进空间。为了解决这些挑战,我们提出了基于哈达玛卷积变换的特征提取方法,利用哈达玛矩阵的列或行向量作为卷积核,长度延长至不同大小。这种增强功能与现有方法(如ROCKET)完全兼容,同时利用核正交性来提高计算效率、稳健性和适应性。针对多领域数据集进行的全面实验-重点放在UCR时间序列数据集上-展示了最新技术的性能:F1分数相比ROCKET提高了至少5%,与相同超参数下的miniROCKET(最快的ROCKET变体)相比,训练时间缩短了50%,可以部署在超低功耗嵌入式设备上。所有代码都可以在GitHub上找到。

更新时间: 2025-11-03 13:39:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.01572v1

Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks

With the emergence of large model-based agents, widely adopted transformer-based architectures inevitably produce excessively long token embeddings for transmission, which may result in high bandwidth overhead, increased power consumption and latency. In this letter, we propose a task-oriented multimodal token transmission scheme for efficient multimodal information fusion and utilization. To improve the efficiency of token transmission, we design a two-stage training algotithm, including cross-modal alignment and task-oriented fine-tuning, for large model-based token communication. Meanwhile, token compression is performed using a sliding window pooling operation to save communication resources. To balance the trade-off between latency and model performance caused by compression, we formulate a weighted-sum optimization problem over latency and validation loss. We jointly optimizes bandwidth, power allocation, and token length across users by using an alternating optimization method. Simulation results demonstrate that the proposed algorithm outperforms the baseline under different bandwidth and power budgets. Moreover, the two-stage training algorithm achieves higher accuracy across various signal-to-noise ratios than the method without cross-modal alignment.

Updated: 2025-11-03 13:36:27

标题: 资源受限的多用户网络中面向任务的多模态令牌传输

摘要: 随着大型基于模型的代理的出现,被广泛采用的基于transformer的架构不可避免地产生了过长的令牌嵌入以进行传输,这可能导致高带宽开销、增加的功耗和延迟。在这封信中,我们提出了一种面向任务的多模式令牌传输方案,用于高效的多模式信息融合和利用。为了提高令牌传输的效率,我们设计了一个两阶段训练算法,包括跨模态对齐和面向任务的微调,用于大型基于模型的令牌通信。同时,使用滑动窗口池化操作进行令牌压缩以节省通信资源。为了平衡由压缩引起的延迟和模型性能之间的权衡,我们通过在延迟和验证损失上进行加权求和优化问题来制定一个优化问题。我们通过使用交替优化方法跨用户联合优化带宽、功率分配和令牌长度。仿真结果表明,所提出的算法在不同带宽和功率预算下优于基线。此外,两阶段训练算法在各种信噪比下实现了比没有跨模态对齐的方法更高的准确性。

更新时间: 2025-11-03 13:36:27

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2505.07841v3

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

Updated: 2025-11-03 13:29:04

标题: 核回归中的功能缩放定律:损失动态和学习速率计划

摘要: Scaling laws have become a common framework for understanding and guiding the training of large language models (LLMs). However, existing research primarily focuses on the final loss at the end of training, leaving unanswered questions about whether the entire loss trajectory follows similar laws and how the learning rate schedule (LRS) influences them. In this study, we address these gaps by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model in a controlled theoretical setting. Our key insight lies in a new perspective based on intrinsic time, which better captures training progress than iteration count. We introduce a Functional Scaling Law (FSL) that describes the complete loss trajectory under any LRS, with the schedule's impact reflected in a simple convolutional function. We apply this theory to three common LRSs – constant, exponential decay, and warmup-stable-decay (WSD) – and derive explicit scaling relationships in both data- and compute-limited scenarios. These comparisons help explain important empirical findings: (i) higher-capacity models are more efficient in terms of data and computation; (ii) decaying learning rates enhance training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs with parameter sizes ranging from 0.1B to 1B validate the practical utility of FSL as a surrogate model for modeling and predicting loss trajectories in large-scale pre-training.

更新时间: 2025-11-03 13:29:04

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.19189v3

Flat Channels to Infinity in Neural Loss Landscapes

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

Updated: 2025-11-03 13:24:24

标题: 神经损失景观中通往无限的平坦通道

摘要: 神经网络的损失景观包含可能在平坦区域连接或孤立出现的最小值和鞍点。我们确定并表征了损失景观中的特殊结构:沿着这些通道,损失下降极其缓慢,同时至少两个神经元的输出权重$a_i$和$a_j$趋于$\pm$无穷大,它们的输入权重向量$\mathbf{w_i}$和$\mathbf{w_j}$相互相等。在收敛时,这两个神经元实现了一个门控线性单元:$a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$。从几何上看,这些通道到无穷大渐近平行于关键点引起的对称性线。梯度流求解器以及相关的优化方法如SGD或ADAM,在不同的回归设置中高概率地达到这些通道,但在没有仔细检查的情况下,它们看起来像具有有限参数值的平坦局部最小值。我们的表征提供了关于这些准平坦区域的梯度动态、几何学和功能解释的全面图景。通道末端门控线性单元的出现突出了全连接层的计算能力的一个令人惊讶的方面。

更新时间: 2025-11-03 13:24:24

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2506.14951v2

Black-Box Membership Inference Attack for LVLMs via Prior Knowledge-Calibrated Memory Probing

Large vision-language models (LVLMs) derive their capabilities from extensive training on vast corpora of visual and textual data. Empowered by large-scale parameters, these models often exhibit strong memorization of their training data, rendering them susceptible to membership inference attacks (MIAs). Existing MIA methods for LVLMs typically operate under white- or gray-box assumptions, by extracting likelihood-based features for the suspected data samples based on the target LVLMs. However, mainstream LVLMs generally only expose generated outputs while concealing internal computational features during inference, limiting the applicability of these methods. In this work, we propose the first black-box MIA framework for LVLMs, based on a prior knowledge-calibrated memory probing mechanism. The core idea is to assess the model memorization of the private semantic information embedded within the suspected image data, which is unlikely to be inferred from general world knowledge alone. We conducted extensive experiments across four LVLMs and three datasets. Empirical results demonstrate that our method effectively identifies training data of LVLMs in a purely black-box setting and even achieves performance comparable to gray-box and white-box methods. Further analysis reveals the robustness of our method against potential adversarial manipulations, and the effectiveness of the methodology designs. Our code and data are available at https://github.com/spmede/KCMP.

Updated: 2025-11-03 13:16:30

标题: 基于先验知识校准内存探测的LVLMs黑盒成员推理攻击

摘要: 大型视觉语言模型(LVLMs)的能力来自对大量视觉和文本数据进行广泛训练。受到大规模参数的支持,这些模型通常表现出对其训练数据的强大记忆能力,使它们容易受到成员推断攻击(MIAs)的影响。现有的LVLMs的MIA方法通常在白盒或灰盒假设下运作,通过基于目标LVLMs对疑似数据样本提取基于可能性的特征。然而,主流LVLMs在推断过程中通常只暴露生成的输出,同时隐藏内部计算特征,限制了这些方法的适用性。在这项工作中,我们提出了第一个针对LVLMs的黑盒MIA框架,基于先前的知识校准的内存探测机制。核心思想是评估嵌入在疑似图像数据中的私密语义信息的模型记忆,这些信息不太可能仅从一般世界知识中推断出来。我们在四个LVLMs和三个数据集上进行了大量实验。实证结果表明,我们的方法在纯黑盒设置下有效地识别LVLMs的训练数据,甚至达到了与灰盒和白盒方法相当的性能。进一步的分析揭示了我们的方法对潜在对抗性操纵的稳健性,以及方法设计的有效性。我们的代码和数据可在https://github.com/spmede/KCMP 获取。

更新时间: 2025-11-03 13:16:30

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.01952v1

Real-time Continual Learning on Intel Loihi 2

AI systems on edge devices face a critical challenge in open-world environments: adapting when data distributions shift and novel classes emerge. While offline training dominates current paradigms, online continual learning (OCL)--where models learn incrementally from non-stationary streams without catastrophic forgetting--remains challenging in power-constrained settings. We present a neuromorphic solution called CLP-SNN: a spiking neural network architecture for Continually Learning Prototypes and its implementation on Intel's Loihi 2 chip. Our approach introduces three innovations: (1) event-driven and spatiotemporally sparse local learning, (2) a self-normalizing three-factor learning rule maintaining weight normalization, and (3) integrated neurogenesis and metaplasticity for capacity expansion and forgetting mitigation. On OpenLORIS few-shot learning experiments, CLP-SNN achieves accuracy competitive with replay methods while being rehearsal-free. CLP-SNN delivers transformative efficiency gains: 70\times faster (0.33ms vs 23.2ms), and 5,600\times more energy efficient (0.05mJ vs 281mJ) than the best alternative OCL on edge GPU. This demonstrates that co-designed brain-inspired algorithms and neuromorphic hardware can break traditional accuracy-efficiency trade-offs for future edge AI systems.

Updated: 2025-11-03 13:16:16

标题: Intel Loihi 2上的实时持续学习

摘要: 边缘设备上的AI系统在开放世界环境中面临一个关键挑战:当数据分布发生转变并出现新的类别时,如何适应。虽然离线训练在当前范式中占主导地位,但在线持续学习(OCL)——即模型从非静态流中渐进学习而不会发生灾难性遗忘——在功耗受限的环境中仍然具有挑战性。我们提出了一种名为CLP-SNN的神经形态解决方案:这是一个用于持续学习原型的脉冲神经网络架构,并在英特尔的Loihi 2芯片上实现。我们的方法引入了三项创新:(1)事件驱动和时空稀疏的局部学习,(2)自归一化的三因子学习规则维持权重归一化,以及(3)集成神经发生和元可塑性用于扩展容量和减轻遗忘。在OpenLORIS的少样本学习实验中,CLP-SNN实现了与重播方法竞争力相当的准确性,同时无需排练。CLP-SNN带来了革命性的效率收益:比最佳备选OCL边缘GPU上的方法快70倍(0.33ms vs 23.2ms),并且能效更高5600倍(0.05mJ vs 281mJ)。这表明,经过共同设计的脑启发算法和神经形态硬件可以打破未来边缘AI系统传统的准确性和效率之间的权衡。

更新时间: 2025-11-03 13:16:16

领域: cs.LG,cs.AI,cs.DC,cs.NE

下载: http://arxiv.org/abs/2511.01553v1

Analyzing Sustainability Messaging in Large-Scale Corporate Social Media

In this work, we introduce a multimodal analysis pipeline that leverages large foundation models in vision and language to analyze corporate social media content, with a focus on sustainability-related communication. Addressing the challenges of evolving, multimodal, and often ambiguous corporate messaging on platforms such as X (formerly Twitter), we employ an ensemble of large language models (LLMs) to annotate a large corpus of corporate tweets on their topical alignment with the 17 Sustainable Development Goals (SDGs). This approach avoids the need for costly, task-specific annotations and explores the potential of such models as ad-hoc annotators for social media data that can efficiently capture both explicit and implicit references to sustainability themes in a scalable manner. Complementing this textual analysis, we utilize vision-language models (VLMs), within a visual understanding framework that uses semantic clusters to uncover patterns in visual sustainability communication. This integrated approach reveals sectoral differences in SDG engagement, temporal trends, and associations between corporate messaging, environmental, social, governance (ESG) risks, and consumer engagement. Our methods-automatic label generation and semantic visual clustering-are broadly applicable to other domains and offer a flexible framework for large-scale social media analysis.

Updated: 2025-11-03 13:14:17

标题: 分析大规模企业社交媒体中的可持续性传播消息

摘要: 在这项工作中,我们引入了一个多模态分析流程,利用视觉和语言中的大型基础模型来分析企业社交媒体内容,重点关注与可持续发展相关的沟通。针对平台(前身为Twitter)等地方公司消息不断发展、多模态且常常模糊的挑战,我们利用大型语言模型(LLMs)的集合来对大量公司推文进行标注,从而确定其与17个可持续发展目标(SDGs)的主题相关性。这种方法避免了昂贵的任务特定注释的需求,并探索了这些模型作为社交媒体数据的即兴注释器的潜力,可以以可扩展的方式高效地捕捉可持续发展主题的明确和隐含引用。除了文字分析,我们还利用视觉-语言模型(VLMs)在一个视觉理解框架中,利用语义聚类来揭示视觉可持续发展沟通中的模式。这种综合方法揭示了在SDG参与度、时间趋势以及公司消息、环境、社会、治理(ESG)风险和消费者参与之间的关联方面的部门差异。我们的方法-自动生成标签和语义视觉聚类-在其他领域具有广泛适用性,并为大规模社交媒体分析提供了一个灵活的框架。

更新时间: 2025-11-03 13:14:17

领域: cs.AI

下载: http://arxiv.org/abs/2511.01550v1

GroupSHAP-Guided Integration of Financial News Keywords and Technical Indicators for Stock Price Prediction

Recent advances in finance-specific language models such as FinBERT have enabled the quantification of public sentiment into index-based measures, yet compressing diverse linguistic signals into single metrics overlooks contextual nuances and limits interpretability. To address this limitation, explainable AI techniques, particularly SHAP (SHapley Additive Explanations), have been employed to identify influential features. However, SHAP's computational cost grows exponentially with input features, making it impractical for large-scale text-based financial data. This study introduces a GRU-based forecasting framework enhanced with GroupSHAP, which quantifies contributions of semantically related keyword groups rather than individual tokens, substantially reducing computational burden while preserving interpretability. We employed FinBERT to embed news articles from 2015 to 2024, clustered them into coherent semantic groups, and applied GroupSHAP to measure each group's contribution to stock price movements. The resulting group-level SHAP variables across multiple topics were used as input features for the prediction model. Empirical results from one-day-ahead forecasting of the S&P 500 index throughout 2024 demonstrate that our approach achieves a 32.2% reduction in MAE and a 40.5% reduction in RMSE compared with benchmark models without the GroupSHAP mechanism. This research presents the first application of GroupSHAP in news-driven financial forecasting, showing that grouped sentiment representations simultaneously enhance interpretability and predictive performance.

Updated: 2025-11-03 13:06:41

标题: GroupSHAP引导的财经新闻关键词和技术指标的整合,用于股价预测

摘要: 最近在金融领域的语言模型方面取得了一些进展,比如FinBERT,这些模型使公众情绪的量化转化为基于指数的度量成为可能,然而,将多样的语言信号压缩为单一指标忽视了语境的细微差别并限制了可解释性。为了解决这一限制,可解释人工智能技术,特别是SHAP(SHapley Additive Explanations),已经被用来识别影响力特征。然而,SHAP的计算成本随着输入特征呈指数增长,使其在大规模基于文本的金融数据中变得不切实际。本研究引入了一个基于GRU的预测框架,增强了GroupSHAP,该框架量化了语义相关关键词组的贡献,而不是个体标记,大大减少了计算负担,同时保持了可解释性。我们使用FinBERT来嵌入从2015年到2024年的新闻文章,将它们聚类为连贯的语义组,并应用GroupSHAP来衡量每个组对股价变动的贡献。跨多个主题的结果组级SHAP变量被用作预测模型的输入特征。通过2024年对标普500指数的一天前预测的实证结果表明,与没有GroupSHAP机制的基准模型相比,我们的方法实现了平均绝对误差减少32.2%和均方根误差减少40.5%。这项研究展示了在新闻驱动的金融预测中首次应用GroupSHAP,显示出分组情绪表达同时增强了可解释性和预测性能。

更新时间: 2025-11-03 13:06:41

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2510.23112v3

Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks

The growing luminosity frontier at the Large Hadron Collider is challenging the reconstruction and analysis of particle collision events. Increased particle multiplicities are straining latency and storage requirements at the data acquisition stage, while new complications are emerging, including higher background levels and more frequent particle vertex misassociations. This in turn necessitates the development of more holistic and scalable reconstruction methods that take advantage of recent advances in machine learning. We propose a novel Heterogeneous Graph Neural Network (HGNN) architecture featuring unique representations for diverse particle collision relationships and integrated graph pruning layers for scalability. Trained with a multi-task paradigm in an environment mimicking the LHCb experiment, this HGNN significantly improves beauty hadron reconstruction performance. Notably, it concurrently performs particle vertex association and graph pruning within a single framework. We quantify reconstruction and pruning performance, demonstrate enhanced inference time scaling with event complexity, and mitigate potential performance loss using a weighted message passing scheme.

Updated: 2025-11-03 13:06:07

标题: 可扩展的多任务学习用于具有异构图神经网络的粒子碰撞事件重建

摘要: 在大型强子对撞机上不断增长的亮度前沿对粒子碰撞事件的重建和分析提出了挑战。增加的粒子多重性正在考验数据采集阶段的延迟和存储需求,同时新的复杂性也在出现,包括更高的背景水平和更频繁的粒子顶点误关联。这反过来需要开发更全面和可扩展的重建方法,利用最近在机器学习方面的进展。我们提出了一种新颖的异质图神经网络(HGNN)架构,具有独特的多样化粒子碰撞关系表示和集成图修剪层以实现可扩展性。在模拟LHCb实验环境中采用多任务范式对其进行训练,这种HGNN显著提高了美丽强子的重建性能。值得注意的是,它同时在一个框架内执行粒子顶点关联和图修剪。我们量化了重建和修剪性能,展示了随着事件复杂性增加而提高的推理时间缩放,并使用加权消息传递方案减少潜在性能损失。

更新时间: 2025-11-03 13:06:07

领域: physics.data-an,cs.LG,hep-ex

下载: http://arxiv.org/abs/2504.21844v3

Driving scenario generation and evaluation using a structured layer representation and foundational models

Rare and challenging driving scenarios are critical for autonomous vehicle development. Since they are difficult to encounter, simulating or generating them using generative models is a popular approach. Following previous efforts to structure driving scenario representations in a layer model, we propose a structured five-layer model to improve the evaluation and generation of rare scenarios. We use this model alongside large foundational models to generate new driving scenarios using a data augmentation strategy. Unlike previous representations, our structure introduces subclasses and characteristics for every agent of the scenario, allowing us to compare them using an embedding specific to our layer-model. We study and adapt two metrics to evaluate the relevance of a synthetic dataset in the context of a structured representation: the diversity score estimates how different the scenarios of a dataset are from one another, while the originality score calculates how similar a synthetic dataset is from a real reference set. This paper showcases both metrics in different generation setup, as well as a qualitative evaluation of synthetic videos generated from structured scenario descriptions. The code and extended results can be found at https://github.com/Valgiz/5LMSG.

Updated: 2025-11-03 13:04:55

标题: 使用结构化层表示和基础模型生成和评估驾驶场景

摘要: 稀有且具挑战性的驾驶情景对于自动驾驶车辆的发展至关重要。由于这些情景很难遇到,使用生成模型进行模拟或生成它们是一种流行的方法。在之前的工作的基础上,为了改善罕见情景的评估和生成,我们提出了一个结构化的五层模型。我们将这个模型与大型基础模型一起使用,通过数据增强策略生成新的驾驶情景。与以往的表示方式不同,我们的结构为情景中的每个代理引入了子类和特征,这使我们可以使用特定于我们的层模型的嵌入来进行比较。我们研究并改进了两个评估合成数据集在结构化表示环境中相关性的指标:多样性分数估计了数据集中情景之间的差异程度,而独创性分数计算了合成数据集与真实参考集之间的相似程度。本文展示了这两个指标在不同的生成设置中的应用,以及从结构化情景描述生成的合成视频的定性评估。代码和扩展结果可以在https://github.com/Valgiz/5LMSG找到。

更新时间: 2025-11-03 13:04:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01541v1

Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models

We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we observe preliminary correlations between internal fragility and miscalibration or hallucination (confabulation), and find that lightweight critique prompts may modestly improve or worsen calibration in small-scale tests. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens to diagnose and potentially mitigate overconfidence in reasoning systems.

Updated: 2025-11-03 12:53:06

标题: 稳定但不正确校准:从过滤器到大型语言模型的康德观点对自信过度的看法

摘要: 我们重新解释康德的《纯粹理性批判》作为一个关于反馈稳定性的理论,将理性视为一种调节器,使推理保持在可能经验的范围内。我们通过一个复合不稳定性指数(H风险)形式化这种直觉,结合了频谱边界、条件、时间敏感性和创新放大。在线性高斯模拟中,更高的H风险预测即使在形式稳定性下也会出现自负的错误,揭示了名义稳定性和认识稳定性之间的差距。扩展到大型语言模型(LLMs),我们观察到内部脆弱性与错误校准或幻觉(混淆)之间的初步相关性,并发现在小规模测试中,轻量级的批评提示可能会适度改善或恶化校准。这些结果表明康德式自我限制与反馈控制之间存在着一个结构性的桥梁,提供了一个原则性的视角来诊断和潜在地减轻推理系统中的自负。

更新时间: 2025-11-03 12:53:06

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.14925v2

Preliminary study on artificial intelligence methods for cybersecurity threat detection in computer networks based on raw data packets

Most of the intrusion detection methods in computer networks are based on traffic flow characteristics. However, this approach may not fully exploit the potential of deep learning algorithms to directly extract features and patterns from raw packets. Moreover, it impedes real-time monitoring due to the necessity of waiting for the processing pipeline to complete and introduces dependencies on additional software components. In this paper, we investigate deep learning methodologies capable of detecting attacks in real-time directly from raw packet data within network traffic. We propose a novel approach where packets are stacked into windows and separately recognised, with a 2D image representation suitable for processing with computer vision models. Our investigation utilizes the CIC IDS-2017 dataset, which includes both benign traffic and prevalent real-world attacks, providing a comprehensive foundation for our research.

Updated: 2025-11-03 12:51:30

标题: 计算机网络中基于原始数据包的人工智能方法用于网络安全威胁检测的初步研究

摘要: 计算机网络中大多数入侵检测方法都基于流量特征。然而,这种方法可能无法充分利用深度学习算法直接从原始数据包中提取特征和模式的潜力。此外,由于需要等待处理流水线完成,这种方法阻碍了实时监控,并引入了对额外软件组件的依赖。 本文研究了能够在网络流量中直接从原始数据包中实时检测攻击的深度学习方法。我们提出了一种新颖的方法,将数据包堆叠成窗口,并分别识别,使用适合计算机视觉模型处理的二维图像表示。我们的研究利用了CIC IDS-2017数据集,该数据集包括正常流量和普遍的现实世界攻击,为我们的研究提供了全面的基础。

更新时间: 2025-11-03 12:51:30

领域: cs.CV,cs.AI,cs.CR,I.5.4; C.2.0; I.2.1

下载: http://arxiv.org/abs/2407.17339v2

How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison

The launch of Grokipedia, an AI-generated encyclopedia developed by Elon Musk's xAI, was presented as a response to perceived ideological and structural biases in Wikipedia, aiming to produce "truthful" entries via the large language model Grok. Yet whether an AI-driven alternative can escape the biases and limitations of human-edited platforms remains unclear. This study undertakes a large-scale computational comparison of 1,800 matched article pairs between Grokipedia and Wikipedia, drawn from the 2,000 most-edited Wikipedia pages. Using metrics across lexical richness, readability, structural organization, reference density, and semantic similarity, we assess how closely the two platforms align in form and substance. The results show that while Grokipedia exhibits strong semantic and stylistic alignment with Wikipedia, it typically produces longer but less lexically diverse articles, with fewer references per word and greater structural variability. These findings suggest that AI-generated encyclopedic content currently mirrors Wikipedia's informational scope but diverges in editorial norms, favoring narrative expansion over citation-based verification. The implications highlight new tensions around transparency, provenance, and the governance of knowledge in an era of automated text generation.

Updated: 2025-11-03 12:50:56

标题: Grokipedia和Wikipedia有多相似?一个多维文本和结构比较

摘要: Grokipedia是由埃隆·马斯克的xAI开发的人工智能百科全书,旨在作为对维基百科中存在的感知意识形态和结构偏见的回应,旨在通过大型语言模型Grok生成“真实”的条目。然而,AI驱动的替代方案能否摆脱人类编辑平台的偏见和限制仍不清楚。本研究对Grokipedia和维基百科之间的1,800对匹配文章进行了大规模计算比较,这些文章来自维基百科编辑次数最多的2,000个页面。通过词汇丰富度、可读性、结构组织、参考密度和语义相似性等指标,我们评估了这两个平台在形式和内容上的接近程度。结果显示,虽然Grokipedia在语义和风格上与维基百科具有较强的一致性,但通常产生较长但词汇多样性较低的文章,每个词的引用较少且结构变化较大。这些发现表明,AI生成的百科内容目前反映了维基百科的信息范围,但在编辑规范上存在差异,更倾向于基于叙事的扩展而非基于引用的验证。这些结果强调了在自动文本生成时代,透明度、出处和知识治理方面的新紧张关系。

更新时间: 2025-11-03 12:50:56

领域: cs.CY,cs.AI,cs.SI

下载: http://arxiv.org/abs/2510.26899v2

Learning Complementary Policies for Human-AI Teams

This paper tackles the critical challenge of human-AI complementarity in decision-making. Departing from the traditional focus on algorithmic performance in favor of performance of the human-AI team, and moving past the framing of collaboration as classification to focus on decision-making tasks, we introduce a novel approach to policy learning. Specifically, we develop a robust solution for human-AI collaboration when outcomes are only observed under assigned actions. We propose a deferral collaboration approach that maximizes decision rewards by exploiting the distinct strengths of humans and AI, strategically allocating instances among them. Critically, our method is robust to misspecifications in both the human behavior and reward models. Leveraging the insight that performance gains stem from divergent human and AI behavioral patterns, we demonstrate, using synthetic and real human responses, that our proposed method significantly outperforms independent human and algorithmic decision-making. Moreover, we show that substantial performance improvements are achievable by routing only a small fraction of instances to human decision-makers, highlighting the potential for efficient and effective human-AI collaboration in complex management settings.

Updated: 2025-11-03 12:49:17

标题: 学习人工智能团队的互补政策

摘要: 这篇论文讨论了人工智能与人类在决策中的互补性关键挑战。与传统关注算法性能不同,我们更关注人工智能团队的表现,超越了将合作框架定位为分类,专注于决策任务,我们引入了一种新颖的政策学习方法。具体地,当结果仅在分配的行动下观察到时,我们为人工智能协作开发了一个强大的解决方案。我们提出了一种推迟协作方法,通过充分利用人类和人工智能的独特优势,策略性地将实例分配给它们,最大化决策奖励。关键是,我们的方法对人类行为和奖励模型的误差都具有鲁棒性。利用性能提升来自不同人类和人工智能行为模式的洞察力,我们使用合成和真实人类反应来证明,我们提出的方法明显优于独立的人类和算法决策。此外,我们展示了仅将一小部分实例路由到人类决策者可以实现显著的性能改进,突显了在复杂管理环境中实现高效和有效的人工智能协作的潜力。

更新时间: 2025-11-03 12:49:17

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2302.02944v2

Sharp Lower Bounds for Linearized ReLU^k Approximation on the Sphere

We prove a saturation theorem for linearized shallow ReLU$^k$ neural networks on the unit sphere $\mathbb S^d$. For any antipodally quasi-uniform set of centers, if the target function has smoothness $r>\tfrac{d+2k+1}{2}$, then the best $\mathcal{L}^2(\mathbb S^d)$ approximation cannot converge faster than order $n^{-\frac{d+2k+1}{2d}}$. This lower bound matches existing upper bounds, thereby establishing the exact saturation order $\tfrac{d+2k+1}{2d}$ for such networks. Our results place linearized neural-network approximation firmly within the classical saturation framework and show that, although ReLU$^k$ networks outperform finite elements under equal degrees $k$, this advantage is intrinsically limited.

Updated: 2025-11-03 12:49:13

标题: 在球面上对线性化ReLU^k逼近的尖锐下界

摘要: 我们证明了关于单位球$\mathbb S^d$上线性化的浅层ReLU$^k$神经网络的饱和定理。对于任意反对称拟均匀的中心集合,如果目标函数的光滑度$r>\tfrac{d+2k+1}{2}$,那么最佳的$\mathcal{L}^2(\mathbb S^d)$逼近不能快于$n^{-\frac{d+2k+1}{2d}}$阶收敛。这个下界与现有的上界相匹配,从而为这种网络确立了精确的饱和阶$\tfrac{d+2k+1}{2d}$。我们的结果将线性化神经网络逼近牢固地置于经典的饱和框架内,并显示出,尽管在相同度数$k$下,ReLU$^k$网络优于有限元,但这种优势是固有受限的。

更新时间: 2025-11-03 12:49:13

领域: math.NA,cs.LG,cs.NA

下载: http://arxiv.org/abs/2510.04060v2

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.

Updated: 2025-11-03 12:45:39

标题: TPS-Bench:评估AI代理在复合任务中的工具规划和调度能力

摘要: 大型语言模型(LLM)代理在研究和编码等领域展示了强大的问题解决能力。然而,仍然有待探讨LLM代理是否能够解决需要使用各种工具完成的复杂现实世界问题。鉴于一个广泛的异构工具库,LLM代理不仅必须根据任务规划分析选择合适的工具,还必须战略性地安排执行顺序以确保效率。本文介绍了TPS-Bench,用于评估LLM代理在解决需要工具规划和调度的问题方面的能力。TPS-Bench收集了200个复杂任务,根据一个包含数百个模型上下文协议(MCP)工具的工具库,分为两个难度级别。特别地,每个任务由多个子任务组成,如网络搜索、地图导航、日历查看等,每个子任务可以由基本工具完成。我们的评估强调任务完成率和效率。对流行的闭源和开源LLM进行的实证研究表明,大多数模型可以进行合理的工具规划,但在调度方面存在差异。例如,GLM-4.5通过广泛的顺序工具调用实现了64.72%的超越任务完成率,因此遭受明显长的执行时间。相比之下,GPT-4o优先考虑并行工具调用,但只实现了45.08%的完成率。考虑到强化学习(RL)可能是提高调度效率而不损害性能的一种可行方式,我们对Qwen3-1.7B进行了初步研究,发现根据难得的100个RL训练样本,在执行时间上减少了14%,同时任务完成率提高了6%。我们的代码可在https://github.com/hanwenxu1/mcp-agent找到。

更新时间: 2025-11-03 12:45:39

领域: cs.AI

下载: http://arxiv.org/abs/2511.01527v1

New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models

Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LL\"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LL\"aMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.

Updated: 2025-11-03 12:45:10

标题: 从头开始训练的新德语编码器:将ModernGBERT与转换的LLM2Vec模型进行比较

摘要: 尽管解码器仅LLM的兴起,但编码器仍然对高效的德语NLP和NLU场景至关重要。本文研究了在相同数据和训练约束条件下获得高质量德语编码器的两种途径:1)从头开始训练和2)通过LLM2Vec将解码器转换而来。我们引入了两种资源:ModernGBERT(134M,1B),完全透明的德语编码器,采用ModernBERT风格,以及LL\"aMmleinVec(120M,1B,7B),通过掩码下一个令牌预测训练的解码器到编码器的转换,两者都经过上下文扩展至8.192个令牌。 在SuperGLEBer中,ModernGBERT 1B套装创造了一个新的技术水平(平均0.808),超过了GBERT Large(+4%)和七倍大的转换7B模型(0.787)。在经过监督微调的德语MTEB之后,ModernGBERT 1B(0.551)接近于转换的7B模型(0.557)。 我们发布了所有模型、检查点、数据集和完整的训练记录,并引入了一个适用于编码器的QA-NIAH评估。总的来说,我们的结果提供了可操作的指导:当参数效率和延迟重要时,从头开始的编码器占主导地位。当存在预训练的解码器且计算资源有限时,转换提供了一种有效的替代方案。ModernGBERT和LL\"aMmleinVec,包括所有代码、数据和中间检查点,均在仅限于研究的RAIL许可下发布。

更新时间: 2025-11-03 12:45:10

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.13136v2

JudgeLRM: Large Reasoning Models as a Judge

Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.

Updated: 2025-11-03 12:43:56

标题: JudgeLRM:作为法官的大型推理模型

摘要: 大型语言模型(LLMs)越来越被采用作为评估器,提供了一种可扩展的替代人工标注的方法。然而,现有的监督微调(SFT)方法在需要复杂推理的领域往往表现不佳。判断本质上是需要推理的:除了表面水平的评分外,它还需要验证证据,识别错误,并证明决定的合理性。通过对评估任务的分析,我们发现SFT性能提升与需要推理的样本比例之间存在负相关,揭示了在这种情况下SFT的局限性。为了解决这个问题,我们引入了JudgeLRM,这是一系列以判断为导向的LLMs,使用强化学习(RL)进行训练,通过以判断为导向的奖励来激活推理能力。JudgeLRM在相同规模的SFT微调基线以及其他RL和SFT变体中始终表现优异,甚至超过了最先进的推理模型:值得注意的是,JudgeLRM-3B/4B超过了GPT-4,而JudgeLRM-7B/8B/14B在F1分数上超过DeepSeek-R1超过2%,特别是在需要推理的任务上获得了极大的收益。我们的研究结果强调了RL在释放与推理对齐的LLM评判者中的价值。

更新时间: 2025-11-03 12:43:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.00050v3

Diversity-Aware Policy Optimization for Large Language Model Reasoning

The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.

Updated: 2025-11-03 12:40:16

标题: 大语言模型推理的多样性感知政策优化

摘要: 大型语言模型(LLMs)的推理能力已经迅速发展,尤其是在DeepSeek R1发布之后,这激发了对数据质量和强化学习(RL)算法的大量研究。尽管多样性在RL中扮演着至关重要的角色,但其对LLM推理的影响仍然鲜为人知。为了弥补这一差距,本研究对RL训练中多样性对LLM推理的影响进行了系统调查,并提出了一种新颖的多样性感知策略优化方法。在12个LLMs上的评估中,我们观察到高性能模型中解决方案多样性与Potential at k(一种新颖的度量LLM推理潜力的指标)之间存在强烈的正相关性。这一发现激发了我们的方法在RL训练过程中明确促进多样性。具体来说,我们设计了一个基于标记级别的多样性,并将其重新制定为一个实用的目标,然后选择性地应用于正样本。集成到R1-zero训练框架中,我们的方法在四个数学推理基准测试中实现了3.5%的平均改进,同时生成了更多样化和更稳健的解决方案。

更新时间: 2025-11-03 12:40:16

领域: cs.LG

下载: http://arxiv.org/abs/2505.23433v2

Graph Neural Networks for Electricity Load Forecasting

Forecasting electricity demand is increasingly challenging as energy systems become more decentralized and intertwined with renewable sources. Graph Neural Networks (GNNs) have recently emerged as a powerful paradigm to model spatial dependencies in load data while accommodating complex non-stationarities. This paper introduces a comprehensive framework that integrates graph-based forecasting with attention mechanisms and ensemble aggregation strategies to enhance both predictive accuracy and interpretability. Several GNN architectures -- including Graph Convolutional Networks, GraphSAGE, APPNP, and Graph Attention Networks -- are systematically evaluated on synthetic, regional (France), and fine-grained (UK) datasets. Empirical results demonstrate that graph-aware models consistently outperform conventional baselines such as Feed Forward Neural Networks and foundation models like TiREX. Furthermore, attention layers provide valuable insights into evolving spatial interactions driven by meteorological and seasonal dynamics. Ensemble aggregation, particularly through bottom-up expert combination, further improves robustness under heterogeneous data conditions. Overall, the study highlights the complementarity between structural modeling, interpretability, and robustness, and discusses the trade-offs between accuracy, model complexity, and transparency in graph-based electricity load forecasting.

Updated: 2025-11-03 12:38:23

标题: 图神经网络用于电力负荷预测

摘要: 随着能源系统变得更加分散化并与可再生能源交织在一起,预测电力需求变得越来越具有挑战性。最近,图神经网络(GNNs)作为一种强大的范例出现,用于建模负载数据中的空间依赖关系,同时适应复杂的非平稳性。本文介绍了一个综合框架,将基于图的预测与注意机制和集成聚合策略相结合,以提高预测准确性和可解释性。在合成、区域(法国)和细粒度(英国)数据集上系统评估了几种GNN架构,包括图卷积网络、GraphSAGE、APPNP和图注意网络。实证结果表明,具有图意识的模型始终优于传统基准线模型,如前馈神经网络和TiREX等基础模型。此外,注意层提供了有关由气象和季节动态驱动的演变空间交互的宝贵见解。集成聚合,特别是通过自下而上的专家组合,进一步提高了在异质数据条件下的稳健性。总体而言,该研究突显了结构建模、可解释性和稳健性之间的互补性,并讨论了基于图的电力负载预测中准确性、模型复杂性和透明度之间的权衡。

更新时间: 2025-11-03 12:38:23

领域: cs.LG

下载: http://arxiv.org/abs/2507.03690v3

Shift-invariant transformations and almost liftings

We investigate shift-invariant transformations, also known as rotation-symmetric vectorial Boolean functions, on $n$ bits that are induced from Boolean functions on $k$ bits, for $k\leq n$. We consider such transformations that are not necessarily permutations, but are, in some sense, almost bijective, and study their cryptographic properties. In this context, we define an almost lifting as a Boolean function for which there is an upper bound on the number of collisions of its induced transformation that does not depend on $n$. We show that if a Boolean function with diameter $k$ is an almost lifting, then the maximum number of collisions of its induced transformation is $2^{k-1}$ for any $n$. Moreover, we search for functions in the class of almost liftings that have good cryptographic properties and for which the non-bijectivity does not cause major security weaknesses. These functions generalize the well-known map $\chi$ used in the Keccak hash function.

Updated: 2025-11-03 12:35:28

标题: Shift-invariant transformations and almost liftings (平移不变变换和几乎提升)

摘要: 我们研究了$n$位上诱导自$k$位布尔函数的平移不变变换,也称为旋转对称矢量布尔函数。我们考虑这样的变换不一定是排列,但在某种意义上几乎是双射,并研究它们的密码学属性。在这个背景下,我们定义了几乎提升为一个布尔函数,对于其诱导变换的碰撞次数有一个不依赖于$n$的上界。我们表明,如果一个直径为$k$的布尔函数是一个几乎提升,那么其诱导变换的最大碰撞次数对于任何$n$都是$2^{k-1}$。此外,我们寻找在几乎提升类中具有良好密码学属性且非双射不会导致主要安全弱点的函数。这些函数推广了在Keccak哈希函数中使用的知名映射$\chi$。

更新时间: 2025-11-03 12:35:28

领域: math.CO,cs.CR,cs.IT,math.IT,06E30 (primary), 94A60, 94D10, 68Q80 (secondary)

下载: http://arxiv.org/abs/2407.11931v3

Rough Path Signatures: Learning Neural RDEs for Portfolio Optimization

We tackle high-dimensional, path-dependent valuation and control and introduce a deep BSDE/2BSDE solver that couples truncated log-signatures with a neural rough differential equation (RDE) backbone. The architecture aligns stochastic analysis with sequence-to-path learning: a CVaR-tilted terminal objective targets left-tail risk, while an optional second-order (2BSDE) head supplies curvature estimates for risk-sensitive control. Under matched compute and parameter budgets, the method improves accuracy, tail fidelity, and training stability across Asian and barrier option pricing and portfolio control: at d=200 it achieves CVaR(0.99)=9.80% versus 12.00-13.10% for strong baselines, attains the lowest HJB residual (0.011), and yields the lowest RMSEs for Z and Gamma. Ablations over truncation depth, local windows, and tilt parameters confirm complementary gains from the sequence-to-path representation and the 2BSDE head. Taken together, the results highlight a bidirectional dialogue between stochastic analysis and modern deep learning: stochastic tools inform representations and objectives, while sequence-to-path models expand the class of solvable financial models at scale.

Updated: 2025-11-03 12:33:09

标题: 粗糙路径签名:学习用于投资组合优化的神经RDEs

摘要: 我们处理高维、路径依赖的估值和控制问题,并引入了一种深度BSDE/2BSDE求解器,它将截断的对数签名与神经粗糙微分方程(RDE)骨干相结合。该架构将随机分析与序列到路径学习相结合:一个CVaR倾斜的终端目标针对左尾风险,而可选的二阶(2BSDE)头部为风险敏感控制提供曲率估计。在匹配的计算和参数预算下,该方法提高了在亚洲和障碍期权定价以及组合控制方面的准确性、尾部忠实度和训练稳定性:在d=200时,它实现了CVaR(0.99)=9.80%,而强基线为12.00-13.10%,达到了最低的HJB残差(0.011),并为Z和Gamma获得了最低的RMSE。对于截断深度、本地窗口和倾斜参数的消融实验证实了序列到路径表示和2BSDE头部的互补收益。总的来说,结果突显了随机分析与现代深度学习之间的双向对话:随机工具指导表示和目标,而序列到路径模型扩展了可解决的金融模型类别。

更新时间: 2025-11-03 12:33:09

领域: q-fin.MF,cs.LG

下载: http://arxiv.org/abs/2510.10728v3

BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.

Updated: 2025-11-03 12:26:04

标题: BanglaNirTox: 用于孟加拉文文本净化的可解释 AI 的大规模并行语料库

摘要: 孟加拉语中的有毒语言仍然普遍存在,特别是在在线环境中,对其几乎没有有效的预防措施。虽然高资源语言的文本净化已取得进展,但由于资源有限,孟加拉语仍未得到充分探索。在本文中,我们提出了一种新颖的孟加拉语文本净化管道,结合帕累托类优化大型语言模型(LLMs)和“思维链”(CoT)提示,生成净化后的句子。为了支持这一努力,我们构建了BanglaNirTox,一个由68,041个有毒孟加拉语句子组成的人工生成的平行语料库,具有类别毒性标签、原因和净化后的释义,使用在随机样本上评估的帕累托优化LLMs。由此产生的BanglaNirTox数据集用于微调语言模型,以生成更好的净化后的孟加拉语句子版本。我们的研究结果表明,帕累托优化的LLMs结合CoT提示显著提高了孟加拉语文本净化的质量和一致性。

更新时间: 2025-11-03 12:26:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01512v1

Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.

Updated: 2025-11-03 12:24:52

标题: 保持在一定控制范围内:可控的伪标签生成,朝着逼真的长尾半监督学习方向前进

摘要: 目前的长尾半监督学习方法假设有标记数据呈长尾分布,而未标记数据遵循典型的预定义分布(即长尾、均匀或反长尾)。然而,未标记数据的分布通常是未知的,可能遵循任意分布。为了解决这一挑战,我们提出了一个可控伪标签生成(CPG)框架,通过从未标记数据中逐渐确定可靠的伪标签扩展标记数据集,并在更新后的标记数据集上训练模型,使其不受未标记数据分布的影响。具体而言,CPG通过一个可控的自我强化优化周期运作:(i)在每个训练步骤中,我们的动态可控过滤机制选择性地将可靠的伪标签从未标记数据集中合并到标记数据集中,确保更新后的标记数据集遵循已知分布;(ii)然后基于更新后的标记数据分布构建贝叶斯最优分类器;(iii)这个改进的分类器随后有助于在下一个训练步骤中识别更多可靠的伪标签。我们进一步在理论上证明,在某些条件下,这种优化周期可以显著降低泛化错误。此外,我们提出了一个类感知自适应增强模块,进一步改善少数类的表示,并引入一个辅助分支,通过利用所有标记和未标记样本来最大化数据利用率。在各种常用基准数据集上进行全面评估表明,CPG取得了一致的改进,准确率超过了最先进方法高达15.97%。代码可在https://github.com/yaxinhou/CPG找到。

更新时间: 2025-11-03 12:24:52

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.03993v5

Neural Entropy

We explore the connection between deep learning and information theory through the paradigm of diffusion models. A diffusion model converts noise into structured data by reinstating, imperfectly, information that is erased when data was diffused to noise. This information is stored in a neural network during training. We quantify this information by introducing a measure called neural entropy, which is related to the total entropy produced by diffusion. Neural entropy is a function of not just the data distribution, but also the diffusive process itself. Measurements of neural entropy on a few simple image diffusion models reveal that they are extremely efficient at compressing large ensembles of structured data.

Updated: 2025-11-03 12:24:14

标题: 神经熵

摘要: 我们通过扩散模型的范式探讨深度学习与信息论之间的联系。扩散模型通过将噪音转化为结构化数据,通过不完全地恢复在数据扩散到噪音时被抹去的信息来实现。这些信息在训练期间存储在神经网络中。我们通过引入一种称为神经熵的度量来量化这些信息,它与扩散产生的总熵有关。神经熵不仅是数据分布的函数,还受到扩散过程本身的影响。在几个简单的图像扩散模型上对神经熵的测量显示,它们非常有效地压缩大量的结构化数据。

更新时间: 2025-11-03 12:24:14

领域: cs.LG,cond-mat.stat-mech,cs.IT,math.IT

下载: http://arxiv.org/abs/2409.03817v3

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

Updated: 2025-11-03 12:21:07

标题: 接种提示:在培训过程中引发LLMs的特征可能会在测试时抑制它们

摘要: 语言模型微调往往会导致学习到与期望的特征相结合的不良特征。为了解决这个问题,我们提出了接种提示:通过在微调数据前添加一个短系统提示指令,故意引出不良特征。在测试时,我们在没有指令的情况下进行评估;接种模型表现出的不良特征远低于使用未修改训练数据训练的模型。接种是有选择性的:在一个玩具设置中,助手回复总是用西班牙语和大写字母,一个适当的接种(例如,“你总是说西班牙语。”)教会模型在仍然用英语回答的同时将回复大写。我们发现接种在几个额外的设置中也是有效的:减少任务特定微调造成的紧急错误对齐(EM),防御后门注入,并减轻通过潜意识学习传递特征的影响。后续分析表明一个机制:通过接种使特征变得不那么令人惊讶,减少了全局更新模型的优化压力,从而降低了泛化程度。我们的分析与之前关于EM的工作相关:接种解释了以前发现的教育背景如何减轻来自不安全代码的EM。除了展示一种简单有效的选择性学习技术外,我们的结果还有助于更好地理解语言模型如何以及为什么泛化。

更新时间: 2025-11-03 12:21:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.04340v4

Multi-Agent Regime-Conditioned Diffusion (MARCD) for CVaR-Constrained Portfolio Decisions

We examine whether regime-conditioned generative scenarios combined with a convex CVaR allocator improve portfolio decisions under regime shifts. We present MARCD, a generative-to-decision framework with: (i) a Gaussian HMM to infer latent regimes; (ii) a diffusion generator that produces regime-conditioned scenarios; (iii) signal extraction via blended, shrunk moments; and (iv) a governed CVaR epigraph quadratic program. Contributions: Within the Scenario stage we introduce a tail-weighted diffusion objective that up-weights low-quantile outcomes relevant for drawdowns and a regime-expert (MoE) denoiser whose gate increases with crisis posteriors; both are evaluated end-to-end through the allocator. Under strict walk-forward on liquid multi-asset ETFs (2005-2025), MARCD exhibits stronger scenario calibration and materially smaller drawdowns: MaxDD 9.3% versus 14.1% for BL (a 34% reduction) over 2020-2025 out-of-sample. The framework provides an auditable pipeline with explicit budget, box, and turnover constraints, demonstrating the value of decision-aware generative modeling in finance.

Updated: 2025-11-03 12:18:32

标题: 多智能体制度条件扩散(MARCD)用于CVaR约束的投资组合决策

摘要: 我们研究了在制度转变下,结合凸CVaR配置器的制度条件生成场景是否能够改善投资组合决策。我们提出了MARCD,这是一个从生成到决策的框架,包括:(i) 用于推断潜在制度的高斯HMM;(ii) 产生制度条件场景的扩散生成器;(iii) 通过混合、收缩矩来提取信号;以及(iv) 一个受控CVaR上册二次规划程序。贡献:在情景阶段,我们引入了一个重尾扩散目标,加权了与回撤相关的低分位结果,并引入了一个制度专家(MoE)去噪器,其门随着危机后验概率的增加而增加;这两者通过配置器端到端评估。在对流动多资产ETF(2005-2025)进行严格的前向步进测试下,MARCD展示出更强的情景校准和显著较小的回撤:2020-2025样本外最大回撤率为9.3%,而BL为14.1%(减少34%)。该框架提供了一个可审计的流程,具有明确的预算、盒子和换手率约束,展示了金融领域决策感知生成建模的价值。

更新时间: 2025-11-03 12:18:32

领域: cs.LG,q-fin.CP

下载: http://arxiv.org/abs/2510.10807v3

In Dialogue with Intelligence: Rethinking Large Language Models as Collective Knowledge

Large Language Models (LLMs) can be understood as Collective Knowledge (CK): a condensation of human cultural and technical output, whose apparent intelligence emerges in dialogue. This perspective article, drawing on extended interaction with ChatGPT-4, postulates differential response modes that plausibly trace their origin to distinct model subnetworks. It argues that CK has no persistent internal state or ``spine'': it drifts, it complies, and its behaviour is shaped by the user and by fine-tuning. It develops the notion of co-augmentation, in which human judgement and CK's representational reach jointly produce forms of analysis that neither could generate alone. Finally, it suggests that CK offers a tractable object for neuroscience: unlike biological brains, these systems expose their architecture, training history, and activation dynamics, making the human--CK loop itself an experimental target.

Updated: 2025-11-03 12:13:58

标题: 与智能对话:重新思考大型语言模型作为集体知识

摘要: 大型语言模型(LLMs)可以被理解为集体知识(CK):人类文化和技术产出的凝结物,在对话中显现出明显的智能。这篇透过与ChatGPT-4的广泛交互而得出的观点文章提出了不同的响应模式,这些模式可能可以追溯到不同的模型子网络的起源。文章认为,CK没有持久的内部状态或“脊柱”:它漂移,遵从,并且其行为受用户和微调的影响。它发展了共增强的概念,即人类判断和CK的表征范围共同产生了分析形式,这些形式是单独无法产生的。最后,文章暗示CK为神经科学提供了一个可行的对象:与生物大脑不同,这些系统展示了它们的架构、训练历史和激活动态,使得人类 - CK 循环本身成为一个实验目标。

更新时间: 2025-11-03 12:13:58

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2505.22767v3

NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.

Updated: 2025-11-03 12:13:43

标题: NeuroDeX:解锁深度神经网络可执行文件的多样支持

摘要: 设备上的深度学习模型在现实世界中有广泛的需求。深度学习编译器可以高效地将模型编译成可在边缘设备上部署的可执行文件,但这些可执行文件可能面临反向工程的威胁。先前的研究尝试对DNN可执行文件进行反编译,但在处理编译优化和分析量化编译模型方面面临挑战。在本文中,我们提出NeuroDeX来解锁对反编译DNN可执行文件的多样支持。NeuroDeX利用LLMs的语义理解能力以及动态分析来准确高效地执行运算符类型识别、运算符属性恢复和模型重建。NeuroDeX可以将DNN可执行文件恢复为高级模型,以实现编译优化、不同的架构和量化编译模型。我们对12种常见的DNN模型的96个DNN可执行文件进行了实验。广泛的实验结果表明,NeuroDeX可以将非量化的可执行文件反编译为几乎相同的高级模型。对于量化可执行文件,NeuroDeX可以恢复功能上相似的高级模型,达到平均的Top-1准确率为72%。与先前的DNN可执行文件反编译器相比,NeuroDeX提供了更全面和有效的解决方案。

更新时间: 2025-11-03 12:13:43

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2509.06402v2

AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems

Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a "human-in-the-loop" strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.

Updated: 2025-11-03 11:59:01

标题: AI引导的VR分子模拟:探索在高维分子系统中模拟学习的策略

摘要: 分子动力学(MD)模拟是研究人员理解和设计分子结构和功能的关键计算工具,在药物发现、蛋白工程和材料设计等领域起着重要作用。尽管其有用性,MD模拟因分子系统的高维度而昂贵。最近,交互式虚拟现实中的分子动力学(iMD-VR)作为一种“人在循环”策略,有效地导航超高维分子系统。通过提供沉浸式3D环境,使研究人员能够可视化和操作运行在高性能计算架构上的实时分子模拟,iMD-VR使研究人员能够触及并引导分子构象动力学,以有效地探索复杂的高维分子系统。此外,iMD-VR模拟生成丰富的数据集,捕获了人类专家关于分子结构和功能的空间洞察。本文探讨了利用研究人员生成的iMD-VR数据集通过模仿学习(IL)来训练AI代理的用途。IL使代理能够模仿专家演示的复杂行为,避免了需要明确编程或复杂奖励设计的需求。在本文中,我们回顾了机器人和多智能体系统领域的IL,这些领域与iMD-VR可比,并讨论了如何利用iMD-VR录音来训练IL模型与MD模拟进行交互。然后,我们通过一个以iMD-VR数据用于训练CNN网络进行简单分子操作任务的概念验证研究来说明这些想法的应用,即通过纳米管孔穿过一个小分子。最后,我们概述了未来的研究方向和利用AI代理增强人类专业知识以导航广阔的分子构象空间可能面临的挑战。

更新时间: 2025-11-03 11:59:01

领域: cs.LG,cs.AI,cs.HC,q-bio.BM

下载: http://arxiv.org/abs/2409.07189v2

Open Agent Specification (Agent Spec) Technical Report

Open Agent Specification (Agent Spec) is a declarative language for defining AI agents and workflows in a way that is compatible across different AI frameworks, promoting portability and interoperability within AI Agent frameworks. Agent Spec aims to resolve the challenges of fragmented agent development by providing a common unified specification that allows AI agents to be designed once and deployed across various frameworks, improving interoperability and reusability, while reducing redundant efforts. Additionally, Agent Spec facilitates development tools and portability, allowing AI agents to be defined independently of their execution environment and enabling teams to exchange solutions without implementation-specific limitations. Agent Spec benefits four key groups: (i) Agent developers, who gain a superset of reusable components and design patterns, enabling them to leverage a broader range of functionalities; (ii) Agent framework and tool developers, who can use Agent Spec as an interchange format and therefore benefit from cross-framework and tool support; (iii) Researchers, who can achieve reproducible results and comparability, facilitating more reliable and consistent outcomes; (iv) Enterprises, which see faster prototype-to-deployment, increased productivity, and greater scalability and maintainability for their AI agent solutions. This technical report provides an overview of the technical foundations of Agent Spec, including motivation, benefits, and future work. We also introduce a standardized Evaluation harness to assess agent behavior and agentic workflows across runtimes (LangGraph, CrewAI, AutoGen, and WayFlow), using three different benchmarks (SimpleQA Verified, $\tau^2$-Bench and BIRD-SQL) - analogous to how HELM and related harnesses standardized LLM evaluation - so that performance, robustness, and efficiency can be compared consistently across frameworks.

Updated: 2025-11-03 11:55:32

标题: 开放代理规范(代理规范)技术报告

摘要: 开放代理规范(Agent Spec)是一种声明性语言,用于定义AI代理和工作流程,使其在不同的AI框架中兼容,促进AI代理框架内的可移植性和互操作性。Agent Spec旨在通过提供一个通用统一规范来解决碎片化代理开发的挑战,使AI代理只需设计一次即可在各种框架中部署,提高互操作性和重用性,同时减少冗余工作。此外,Agent Spec促进了开发工具和可移植性,使AI代理能够独立于其执行环境定义,并使团队能够在没有特定实现限制的情况下交换解决方案。Agent Spec使四个关键群体受益:(i)代理开发人员,他们获得了一组可重用组件和设计模式的超集,使他们能够利用更广泛的功能;(ii)代理框架和工具开发人员,他们可以将Agent Spec用作交换格式,因此受益于跨框架和工具支持;(iii)研究人员,他们可以实现可重复的结果和可比性,促进更可靠和一致的结果;(iv)企业,他们看到了更快的原型到部署、增加的生产力以及更大规模性和可维护性的AI代理解决方案。本技术报告提供了Agent Spec技术基础的概述,包括动机、好处和未来工作。我们还介绍了一个标准化的评估测试工具,用于评估代理行为和代理工作流程在不同运行时(LangGraph、CrewAI、AutoGen和WayFlow)上,使用三种不同的基准测试(SimpleQA验证、τ²-Bench和BIRD-SQL)- 类似于HELM和相关测试工具标准化LLM评估- 以便能够在不同框架中一致地比较性能、鲁棒性和效率。

更新时间: 2025-11-03 11:55:32

领域: cs.AI

下载: http://arxiv.org/abs/2510.04173v3

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

Updated: 2025-11-03 11:52:57

标题: UniVLA:学习使用以任务为中心的潜在动作进行行动

摘要: 一个通用的机器人应该在各种环境中有效地运行。然而,大多数现有方法都严重依赖于扩展动作注释数据来增强它们的能力。因此,它们通常受限于单一的物理规范,并且难以在不同的实体和环境中学习可转移的知识。为了克服这些限制,我们提出了UniVLA,这是一个新的框架,用于学习跨实体视觉-语言-动作(VLA)策略。我们的关键创新是从视频中使用潜在动作模型推导出以任务为中心的动作表示。这使我们能够利用各种实体和视角下的大量数据。为了减轻任务无关动态的影响,我们将语言指令纳入特征空间内的潜在动作模型中。通过从互联网规模视频中学习而得到的通用策略可以通过高效的潜在动作解码部署到各种机器人上。我们在多个操纵和导航基准测试以及实际机器人部署中获得了最先进的结果。UniVLA比OpenVLA具有不到1/20的预训练计算和1/10的下游数据,表现出更好的性能。随着异构数据的不断加入,甚至包括人类视频,持续观察到性能的改善,结果突显了UniVLA促进可扩展和高效的机器人策略学习的潜力。

更新时间: 2025-11-03 11:52:57

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.06111v3

Differentiable Generalized Sliced Wasserstein Plans

Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions -- such as the Wasserstein distance -- but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation -- where fast computation of transport plans is essential.

Updated: 2025-11-03 11:40:48

标题: Differentiable Generalized Sliced Wasserstein计划

摘要: Optimal Transport (OT)已经吸引了机器学习社区的重要关注,不仅因为它能够定义概率分布之间的有意义距离,比如Wasserstein距离,而且还因为它对OT计划的表述。然而,其计算复杂度仍然是一个瓶颈,因此已经开发出了切片技术来扩展OT到大型数据集。最近,一种新颖的切片方案,称为min-SWGG,将单一一维计划提升回原始多维空间,最终选择产生最低Wasserstein距离的切片作为完整OT计划的近似。尽管具有计算和理论优势,min-SWGG仍然继承了切片方法的典型局限性:(i)所需切片数量随着数据维度呈指数增长,(ii)受限于线性投影。在这里,我们将min-SWGG重新构造为一个双层优化问题,并提出了一个可微分的近似方案,以高效地在高维环境中识别最佳切片。我们还定义了其适应生活在流形上的数据的广义扩展。最后,我们展示了我们的方法在各种应用中的实际价值,包括流形和高维空间上的梯度流,以及基于切片OT的图像生成条件流匹配,其中快速计算传输计划是必不可少的。

更新时间: 2025-11-03 11:40:48

领域: cs.LG

下载: http://arxiv.org/abs/2505.22049v2

MO-SeGMan: Rearrangement Planning Framework for Multi Objective Sequential and Guided Manipulation in Constrained Environments

In this work, we introduce MO-SeGMan, a Multi-Objective Sequential and Guided Manipulation planner for highly constrained rearrangement problems. MO-SeGMan generates object placement sequences that minimize both replanning per object and robot travel distance while preserving critical dependency structures with a lazy evaluation method. To address highly cluttered, non-monotone scenarios, we propose a Selective Guided Forward Search (SGFS) that efficiently relocates only critical obstacles and to feasible relocation points. Furthermore, we adopt a refinement method for adaptive subgoal selection to eliminate unnecessary pick-and-place actions, thereby improving overall solution quality. Extensive evaluations on nine benchmark rearrangement tasks demonstrate that MO-SeGMan generates feasible motion plans in all cases, consistently achieving faster solution times and superior solution quality compared to the baselines. These results highlight the robustness and scalability of the proposed framework for complex rearrangement planning problems.

Updated: 2025-11-03 11:38:57

标题: MO-SeGMan:受限环境下多目标顺序引导操作的重新规划框架

摘要: 在这项工作中,我们介绍了MO-SeGMan,一种用于高度约束的重新排列问题的多目标顺序和引导操作规划器。MO-SeGMan生成最小化每个对象重新规划和机器人行程距离的物体放置序列,同时采用一种懒惰评估方法保持关键依赖结构。为了解决高度混乱、非单调的场景,我们提出了一种选择性引导前向搜索(SGFS),它有效地将关键障碍物重新定位到可行的重新定位点。此外,我们采用了一种用于自适应子目标选择的改进方法,以消除不必要的拿放动作,从而提高整体解决方案质量。对九个基准重新排列任务的广泛评估表明,MO-SeGMan在所有情况下生成可行的运动计划,相对于基线,始终实现更快的解决方案时间和更优质的解决方案质量。这些结果突显了所提出的框架对复杂重新排列规划问题的鲁棒性和可扩展性。

更新时间: 2025-11-03 11:38:57

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.01476v1

The Limits of AI Explainability: An Algorithmic Information Theory Approach

This paper establishes a theoretical foundation for understanding the fundamental limits of AI explainability through algorithmic information theory. We formalize explainability as the approximation of complex models by simpler ones, quantifying both approximation error and explanation complexity using Kolmogorov complexity. Our key theoretical contributions include: (1) a complexity gap theorem proving that any explanation significantly simpler than the original model must differ from it on some inputs; (2) precise bounds showing that explanation complexity grows exponentially with input dimension but polynomially with error tolerance for Lipschitz functions; and (3) a characterization of the gap between local and global explainability, demonstrating that local explanations can be significantly simpler while maintaining accuracy in relevant regions. We further establish a regulatory impossibility theorem proving that no governance framework can simultaneously pursue unrestricted AI capabilities, human-interpretable explanations, and negligible error. These results highlight considerations likely to be relevant to the design, evaluation, and oversight of explainable AI systems.

Updated: 2025-11-03 11:37:53

标题: AI可解释性的局限性:一种算法信息论方法

摘要: 本文通过算法信息论建立了一个理论基础,用于理解人工智能可解释性的基本限制。我们将可解释性形式化为通过简单模型逼近复杂模型,使用科尔莫哥洛夫复杂性量化逼近误差和解释复杂性。我们的关键理论贡献包括:(1)一个复杂性差距定理,证明任何显著简单于原始模型的解释在某些输入上必然不同;(2)精确界限表明解释复杂性随着输入维度呈指数增长,但对于Lipschitz函数的误差容忍度呈多项式增长;(3)对局部和全局可解释性之间的差距进行表征,证明局部解释可以在保持相关区域准确性的同时显著简化。我们进一步建立了一项监管不可能性定理,证明没有治理框架可以同时追求无限制的人工智能能力、人类可解释的解释和可忽略的误差。这些结果突显了在设计、评估和监督可解释性人工智能系统时可能相关的考虑因素。

更新时间: 2025-11-03 11:37:53

领域: cs.AI,cs.CY,cs.IT,math.IT,68Q30, 68T01,I.2.0; H.1.1; K.4.1

下载: http://arxiv.org/abs/2504.20676v2

ConTextTab: A Semantics-Aware Tabular In-Context Learner

Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/sap-rpt-1-oss.

Updated: 2025-11-03 11:32:32

标题: ConTextTab:一种语义感知的表格式上下文学习器

摘要: 表内上下文学习(ICL)最近在几个表格预测任务上取得了最先进的性能。先前仅限于小表上的分类问题,最近的进展如TabPFN和TabICL已将其应用扩展到更大的数据集。尽管当前的表格原生ICL架构在结构上高效且适应表格数据结构,但它们仅在合成数据上进行训练限制了它们充分利用真实世界表格数据中包含的丰富语义和世界知识的能力。另一方面,基于预训练大型语言模型(如TabuLa-8B)的表格ICL模型集成了深层语义理解和世界知识,但由于固有的架构限制,只能利用少量上下文。为了结合这两个世界的优势,我们引入了ConTextTab,将语义理解和对齐集成到一个表格原生ICL框架中。通过为不同的数据模态使用专门的嵌入,并在大规模真实世界表格数据上进行训练,我们的模型在广泛的基准测试中与SOTA竞争力相当,同时在语义丰富的CARTE基准测试中树立了新的标准。代码和模型检查点可在以下网址找到:https://github.com/SAP-samples/sap-rpt-1-oss。

更新时间: 2025-11-03 11:32:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.10707v4

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.

Updated: 2025-11-03 11:32:13

标题: 准确和连贯的LLM答案聚合的表征一致性

摘要: 测试时间缩放通过在推断过程中分配更多计算资源来提高大型语言模型(LLMs)的性能。为了实现这一目标,现有方法通常需要对提示和抽样策略进行复杂的修改。在本研究中,我们引入了表示一致性(RC),这是一种用于聚合从LLM的多个候选响应中抽取的答案的测试时间缩放方法,无论这些答案是如何生成的,包括提示措辞和抽样策略的变化。RC通过不仅考虑候选响应集中每个答案的出现次数,而且考虑到模型在生成导致每个答案的响应集时的内部激活的一致性来增强答案聚合。这些激活可以是密集的(原始模型激活)或稀疏的(通过预训练的稀疏自动编码器进行编码)。我们的理念是,如果模型对收敛于相同答案的多个响应的表示非常不稳定,那么这个答案更有可能是不连贯推理的结果,应在聚合过程中减少权重。重要的是,我们的方法只使用缓存的激活和轻量级相似性计算,不需要额外的模型查询。通过对四个开源LLMs和四个推理数据集的实验,我们验证了RC在改善推断过程中的任务性能方面的有效性,与强大的测试时间缩放基线相比,我们观察到了一致的准确率改进(最多4%)。我们还展示了稀疏激活信号的一致性与连贯推理的常见概念相符。

更新时间: 2025-11-03 11:32:13

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.21590v2

DAMBench: A Multi-Modal Benchmark for Deep Learning-based Atmospheric Data Assimilation

Data Assimilation is a cornerstone of atmospheric system modeling, tasked with reconstructing system states by integrating sparse, noisy observations with prior estimation. While traditional approaches like variational and ensemble Kalman filtering have proven effective, recent advances in deep learning offer more scalable, efficient, and flexible alternatives better suited for complex, real-world data assimilation involving large-scale and multi-modal observations. However, existing deep learning-based DA research suffers from two critical limitations: (1) reliance on oversimplified scenarios with synthetically perturbed observations, and (2) the absence of standardized benchmarks for fair model comparison. To address these gaps, in this work, we introduce DAMBench, the first large-scale multi-modal benchmark designed to evaluate data-driven DA models under realistic atmospheric conditions. DAMBench integrates high-quality background states from state-of-the-art forecasting systems and real-world multi-modal observations (i.e., real-world weather stations and satellite imagery). All data are resampled to a common grid and temporally aligned to support systematic training, validation, and testing. We provide unified evaluation protocols and benchmark representative data assimilation approaches, including latent generative models and neural process frameworks. Additionally, we propose a lightweight multi-modal plugin to demonstrate how integrating realistic observations can enhance even simple baselines. Through comprehensive experiments, DAMBench establishes a rigorous foundation for future research, promoting reproducibility, fair comparison, and extensibility to real-world multi-modal scenarios. Our dataset and code are publicly available at https://github.com/figerhaowang/DAMBench.

Updated: 2025-11-03 11:26:26

标题: DAMBench:基于深度学习的大气数据同化的多模态基准

摘要: 数据同化是大气系统建模的基石,其任务是通过将稀疏、嘈杂的观测数据与先前的估计相结合来重建系统状态。虽然传统方法如变分和集合卡尔曼滤波已被证明有效,但深度学习的最新进展提供了更可扩展、高效、灵活的替代方案,更适合于涉及大规模和多模态观测的复杂、真实世界的数据同化。然而,现有基于深度学习的数据同化研究存在两个关键限制:(1)依赖于对合成扰动观测进行过度简化的场景,以及(2)缺乏公平模型比较的标准基准。为解决这些差距,在这项工作中,我们引入了DAMBench,这是第一个大规模多模态基准,旨在评估在真实大气条件下的数据驱动型数据同化模型。DAMBench集成了来自最先进预报系统的高质量背景状态和真实世界多模态观测数据(即真实世界的气象站和卫星图像)。所有数据都被重新采样到一个共同的网格,并在时间上对齐,以支持系统化的训练、验证和测试。我们提供统一的评估协议,并对代表性的数据同化方法进行基准测试,包括潜在生成模型和神经过程框架。此外,我们提出了一个轻量级的多模态插件,以演示如何整合真实观测可以增强甚至简单的基线模型。通过全面的实验,DAMBench为未来研究建立了严格的基础,促进可复制性、公平比较和对真实世界多模态场景的可扩展性。我们的数据集和代码可在https://github.com/figerhaowang/DAMBench上公开获取。

更新时间: 2025-11-03 11:26:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.01468v1

3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data

Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.

Updated: 2025-11-03 11:24:32

标题: 3DViT-GAT:基于统一的基于图谱学习的三维视觉变换器框架,用于利用结构MRI数据检测重性抑郁症

摘要: 抑郁症是一种普遍存在的精神健康状况,对个人幸福和全球公共卫生产生负面影响。利用结构性磁共振成像(sMRI)和深度学习(DL)方法自动检测抑郁症,有望提高诊断准确性并实现早期干预。大多数现有方法要么使用体素级特征,要么使用从预定义脑部图谱构建的手工区域表示,限制了捕捉复杂脑模式的能力。本文开发了一个统一的流程,利用Vision Transformers(ViTs)从sMRI数据中提取3D区域嵌入,并使用图神经网络(GNN)进行分类。我们探索了两种定义区域的策略:(1)使用基于图谱的方法,使用预定义的结构和功能性脑图谱,(2)使用基于立方体的方法,通过直接训练ViTs来识别均匀提取的3D补丁中的区域。此外,生成余弦相似性图来建模区域间的关系,并引导基于GNN的分类。通过使用REST-meta-MDD数据集进行大量实验,证明了我们模型的有效性。在分层10折交叉验证中,最佳模型获得了78.98%的准确性,76.54%的敏感性,81.58%的特异性,81.58%的精确度和78.98%的F1分数。此外,基于图谱的模型始终优于基于立方体的方法,突显了在抑郁症检测中使用特定领域解剖先验的重要性。

更新时间: 2025-11-03 11:24:32

领域: cs.CV,cs.AI,62P10, 68T07, 92B20,I.2.6; J.3

下载: http://arxiv.org/abs/2509.12143v2

HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA

The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.

Updated: 2025-11-03 11:22:10

标题: HMVLM:通过MoE LoRA的人体动作-视觉-语言模型

摘要: 指导调整数据的扩展使基础语言模型得以展现出改进的指导遵从和优越的性能,能够在各种不同的下游任务中表现出色。语义丰富的3D人体运动正逐渐与这些基础模型整合,以增强多模态理解和跨模态生成能力。然而,人体运动和文本之间的模态差距引发了关于在整合过程中的灾难性遗忘的未解决问题。此外,开发保持泛化性的自回归兼容姿势表示对于跨异构下游任务仍然是一个关键的技术障碍。为了解决这些问题,我们提出了基于Mixture of Expert Low-Rank Adaption(MoE LoRA)策略的人体运动-视觉-语言模型(HMVLM),这是一个统一的框架。该框架利用门控网络根据输入提示动态分配LoRA专家权重,实现多任务的同步微调。为了减轻指导调整过程中的灾难性遗忘,我们引入了一个新颖的零专家,保留了用于一般语言任务的预训练参数。对于姿势表示,我们通过将人体分为不同的关节组来实现特定身体部位的标记化,增强了表示的空间分辨率。实验表明,我们的方法有效减轻了指导调整过程中的知识遗忘,并在各种人体运动下游任务中取得了显著的性能。

更新时间: 2025-11-03 11:22:10

领域: cs.CV,cs.AI,cs.GR,68T45,I.2.10; I.3.7

下载: http://arxiv.org/abs/2511.01463v1

Efficiently Training A Flat Neural Network Before It has been Quantizated

Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the relationship between a well-trained NN and the quantized model, leading to considerable quantization error for PTQ. However, it is unclear how to efficiently train a model-agnostic neural network which is tailored for a predefined precision low-bit model. In this paper, we firstly discover that a flat full precision neural network is crucial for low-bit quantization. To achieve this, we propose a framework that proactively pre-conditions the model by measuring and disentangling the error sources. Specifically, both the Activation Quantization Error (AQE) and the Weight Quantization Error (WQE) are statistically modeled as independent Gaussian noises. We study several noise injection optimization methods to obtain a flat minimum. Experimental results attest to the effectiveness of our approach. These results open novel pathways for obtaining low-bit PTQ models.

Updated: 2025-11-03 11:21:45

标题: 在神经网络被量化之前高效训练一个平面神经网络

摘要: 视觉变换器(ViTs)的后训练量化(PTQ)因其在压缩模型方面的效率而受到广泛关注。然而,现有方法通常忽视了训练良好的神经网络与量化模型之间的关系,导致PTQ的量化误差相当大。然而,如何有效地训练一个面向预定义精度低比特模型的模型无关神经网络尚不清楚。在本文中,我们首先发现,对于低比特量化,一个平坦的全精度神经网络是至关重要的。为了实现这一目标,我们提出了一个框架,通过测量和解耦误差源来主动预处理模型。具体来说,激活量化误差(AQE)和权重量化误差(WQE)被统计建模为独立的高斯噪声。我们研究了几种噪声注入优化方法以获得平坦的最小值。实验结果证明了我们方法的有效性。这些结果为获得低比特PTQ模型开辟了新的途径。

更新时间: 2025-11-03 11:21:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01462v1

When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA

Safety and reliability are essential for deploying Visual Question Answering (VQA) in surgery, where incorrect or ambiguous responses can harm the patient. Most surgical VQA research focuses on accuracy or linguistic quality while overlooking safety behaviors such as ambiguity awareness, referral to human experts, or triggering a second opinion. Inspired by Automatic Failure Detection (AFD), we study uncertainty estimation as a key enabler of safer decision making. We introduce Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black box uncertainty estimator that incorporates question semantics into prediction confidence. It measures semantic entropy by comparing generated answers with nearest neighbors in a medical text embedding space, conditioned on the question. We evaluate five models, including domain specific Parameter-Efficient Fine-Tuned (PEFT) models and zero-shot Large Vision-Language Models (LVLMs), on EndoVis18-VQA and PitVQA. PEFT models degrade under mild paraphrasing, while LVLMs are more resilient. Across three LVLMs and two PEFT baselines, QA-SNNE improves AUROC in most in-template settings and enhances hallucination detection. The Area Under the ROC Curve (AUROC) increases by 15-38% for zero-shot models, with gains maintained under out-of-template stress. QA-SNNE offers a practical and interpretable step toward AFD in surgical VQA by linking semantic uncertainty to question context. Combining LVLM backbones with question aligned uncertainty estimation can improve safety and clinician trust. The code and model are available at https://github.com/DennisPierantozzi/QASNNE

Updated: 2025-11-03 11:18:21

标题: 何时相信答案:针对更安全的外科问答的问题对齐语义最近邻熵

摘要: 安全性和可靠性对于在外科手术中部署视觉问答(VQA)至关重要,错误或模糊的回答可能会对患者造成伤害。大多数外科VQA研究都侧重于准确性或语言质量,而忽视了安全行为,如模糊性意识、转介给人类专家或触发第二意见。受自动故障检测(AFD)的启发,我们研究了不确定性估计作为更安全决策制定的关键因素。我们引入了问题对齐语义最近邻熵(QA-SNNE),这是一个黑盒不确定性估计器,将问题语义纳入到预测置信度中。它通过将生成的答案与医学文本嵌入空间中最近邻的问题进行比较,来衡量语义熵。我们在EndoVis18-VQA和PitVQA上评估了五个模型,包括特定领域的参数高效微调(PEFT)模型和零射大视觉语言模型(LVLMs)。PEFT模型在轻微改写下性能下降,而LVLMs更具弹性。在三个LVLM和两个PEFT基线模型中,QA-SNNE在大多数内模板设置中提高了AUROC,并增强了幻觉检测。对于零射模型,ROC曲线下的面积(AUROC)增加了15-38%,并在模板外压力下保持增益。QA-SNNE通过将语义不确定性与问题背景联系起来,为外科VQA中的AFD提供了实用且可解释的一步。将LVLM主干与问题对齐的不确定性估计相结合可以提高安全性和临床医生的信任。代码和模型可在https://github.com/DennisPierantozzi/QASNNE 上获得。

更新时间: 2025-11-03 11:18:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01458v1

Runtime Analysis of Evolutionary Algorithms for Multi-party Multi-objective Optimization

In scenarios where multiple decision-makers operate within a common decision space, each focusing on their own multi-objective optimization problem (e.g., bargaining games), the problem can be modeled as a multi-party multi-objective optimization problem (MPMOP). While numerous evolutionary algorithms have been proposed to solve MPMOPs, most results remain empirical. This paper presents the first theoretical analysis of the expected runtime of evolutionary algorithms on bi-party multi-objective optimization problems (BPMOPs). Our findings demonstrate that employing traditional multi-objective optimization algorithms to solve MPMOPs is both time-consuming and inefficient, as the resulting population contains many solutions that fail to achieve consensus among decision-makers. An alternative approach involves decision-makers individually solving their respective optimization problems and seeking consensus only in the final stage. While feasible for pseudo-Boolean optimization problems, this method may fail to guarantee approximate performance for one party in NP-hard problems. Finally, we propose evolutionary multi-party multi-objective optimizers (EMPMO) for pseudo-Boolean optimization and shortest path problems within a multi-party multi-objective context, maintain a common solution set among all parties. Theoretical and experimental results demonstrate that the proposed \( \text{EMPMO}_{\text{random}} \) outperforms previous algorithms in terms of the lower bound on the expected runtime for pseudo-Boolean optimization problems. Additionally, the consensus-based evolutionary multi-party multi-objective optimizer( \( \text{EMPMO}_{\text{cons}}^{\text{SP}} \) ) achieves better efficiency and precision in solving shortest path problems compared to existing algorithms.

Updated: 2025-11-03 11:12:40

标题: 多方多目标优化问题的进化算法运行时分析

摘要: 在多个决策者在共同的决策空间内运作,每个关注自己的多目标优化问题(例如,协商游戏)的场景中,问题可以建模为多方多目标优化问题(MPMOP)。虽然已经提出了许多进化算法来解决MPMOP,但大多数结果仍然是经验性的。本文首次对双方多目标优化问题(BPMOP)上的进化算法的期望运行时间进行了理论分析。我们的研究结果表明,使用传统的多目标优化算法来解决MPMOP既耗时又低效,因为所得到的种群包含许多无法在决策者之间达成共识的解决方案。另一种方法是决策者分别解决各自的优化问题,并仅在最后阶段寻求共识。虽然对于伪布尔优化问题是可行的,但这种方法可能无法保证在NP难问题中为一方提供近似性能。最后,我们提出了伪布尔优化和最短路径问题的进化多方多目标优化器(EMPMO),在多方多目标的背景下保持所有方之间的共同解决方案集。理论和实验结果表明,所提出的\( \text{EMPMO}_{\text{random}} \)在伪布尔优化问题的期望运行时间下限方面优于先前的算法。此外,基于共识的进化多方多目标优化器(\( \text{EMPMO}_{\text{cons}}^{\text{SP}} \))相对于现有算法在解决最短路径问题时具有更高的效率和精度。

更新时间: 2025-11-03 11:12:40

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2501.16336v3

Security-Aware Joint Sensing, Communication, and Computing Optimization in Low Altitude Wireless Networks

As terrestrial resources become increasingly saturated, the research attention is shifting to the low-altitude airspace, with many emerging applications such as urban air taxis and aerial inspection. Low-Altitude Wireless Networks (LAWNs) are the foundation for these applications, with integrated sensing, communications, and computing (ISCC) being one of the core parts of LAWNs. However, the openness of low-altitude airspace exposes communications to security threats, degrading ISCC performance and ultimately compromising the reliability of applications supported by LAWNs. To address these challenges, this paper studies joint performance optimization of ISCC while considering secrecyness of the communications. Specifically, we derive beampattern error, secrecy rate, and age of information (AoI) as performance metrics for sensing, secrecy communication, and computing. Building on these metrics, we formulate a multi-objective optimization problem that balances sensing and computation performance while keeping the probability of communication being detected below a required threshold. We then propose a deep Q-network (DQN)-based multi-objective evolutionary algorithm, which adaptively selects evolutionary operators according to the evolving optimization objectives, thereby leading to more effective solutions. Extensive simulations show that the proposed method achieves a superior balance among sensing accuracy, communication secrecyness, and information freshness compared with baseline algorithms, thereby safeguarding ISCC performance and LAWN-supported low-altitude applications.

Updated: 2025-11-03 11:06:41

标题: 低空无线网络中安全感知、通信和计算优化

摘要: 随着陆地资源日益饱和,研究重点转向低空领域,涌现出许多应用,如城市空中出租车和空中检查。低空无线网络(LAWN)是这些应用的基础,集成感知、通信和计算(ISCC)是LAWN的核心部分之一。然而,低空领域的开放性使通信容易受到安全威胁,降低ISCC性能,最终危及LAWN支持的应用程序的可靠性。为了应对这些挑战,本文研究了在考虑通信保密性的情况下ISCC的联合性能优化。具体地,我们将波束失配误差、保密速率和信息时代(AoI)作为感知、保密通信和计算的性能指标。基于这些指标,我们制定了一个多目标优化问题,平衡感知和计算性能,同时保持通信被检测的概率低于所需的阈值。然后,我们提出了基于深度Q网络(DQN)的多目标进化算法,根据不断演化的优化目标自适应选择进化算子,从而产生更有效的解决方案。大量的模拟表明,与基线算法相比,所提出的方法在感知准确性、通信保密性和信息新鲜度之间实现了更优的平衡,从而保障了ISCC性能和LAWN支持的低空应用程序。

更新时间: 2025-11-03 11:06:41

领域: cs.CR

下载: http://arxiv.org/abs/2511.01451v1

Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO objective to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

Updated: 2025-11-03 11:04:22

标题: Reg-DPO:具有GT-Pair的SFT正则化直接偏好优化以改善视频生成

摘要: 最近的研究确定了直接偏好优化(DPO)作为提高视频生成质量的一种高效且无需奖励的方法。然而,现有方法主要遵循图像域范式,并主要在小规模模型(大约2B参数)上开发,限制了它们解决视频任务的独特挑战的能力,如昂贵的数据构建、不稳定的训练和大量的内存消耗。为了克服这些限制,我们引入了一个GT-Pair,通过使用真实视频作为正样本和模型生成的视频作为负样本,自动构建高质量的偏好对,消除了对任何外部注释的需要。我们进一步提出了Reg-DPO,将SFT损失作为正则化项整合到DPO目标中,以增强训练稳定性和生成保真度。此外,通过将FSDP框架与多种内存优化技术结合,我们的方法实现了近三倍于单独使用FSDP的训练容量。在多个数据集上进行的广泛实验表明,我们的方法在I2V和T2V任务上始终优于现有方法,提供卓越的视频生成质量。

更新时间: 2025-11-03 11:04:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01450v1

Privacy Preserving Ordinal-Meta Learning with VLMs for Fine-Grained Fruit Quality Prediction

To effectively manage the wastage of perishable fruits, it is crucial to accurately predict their freshness or shelf life using non-invasive methods that rely on visual data. In this regard, deep learning techniques can offer a viable solution. However, obtaining fine-grained fruit freshness labels from experts is costly, leading to a scarcity of data. Closed proprietary Vision Language Models (VLMs), such as Gemini, have demonstrated strong performance in fruit freshness detection task in both zero-shot and few-shot settings. Nonetheless, food retail organizations are unable to utilize these proprietary models due to concerns related to data privacy, while existing open-source VLMs yield sub-optimal performance for the task. Fine-tuning these open-source models with limited data fails to achieve the performance levels of proprietary models. In this work, we introduce a Model-Agnostic Ordinal Meta-Learning (MAOML) algorithm, designed to train smaller VLMs. This approach utilizes meta-learning to address data sparsity and leverages label ordinality, thereby achieving state-of-the-art performance in the fruit freshness classification task under both zero-shot and few-shot settings. Our method achieves an industry-standard accuracy of 92.71%, averaged across all fruits. Keywords: Fruit Quality Prediction, Vision Language Models, Meta Learning, Ordinal Regression

Updated: 2025-11-03 11:03:54

标题: 使用循环语言模型进行隐私保护的细粒度水果质量预测排序元学习

摘要: 为了有效管理易腐坏水果的浪费,准确预测它们的新鲜度或保质期至关重要,使用依赖视觉数据的非侵入式方法。在这方面,深度学习技术可以提供可行的解决方案。然而,从专家获取细粒度水果新鲜度标签成本高昂,导致数据稀缺。闭源专有的视觉语言模型(VLMs),如Gemini,在零样本和少样本设置中展示了强大的水果新鲜度检测性能。然而,食品零售机构由于与数据隐私相关的担忧而无法利用这些专有模型,而现有的开源VLMs在任务中表现出次优性能。用有限数据微调这些开源模型无法达到专有模型的性能水平。在这项工作中,我们介绍了一种Model-Agnostic Ordinal Meta-Learning(MAOML)算法,旨在训练较小的VLMs。这种方法利用元学习来解决数据稀疏性,并利用标签序数性,从而在零样本和少样本设置下实现水果新鲜度分类任务的最新性能。我们的方法在所有水果上平均实现了92.71%的行业标准准确度。 关键词:水果质量预测,视觉语言模型,元学习,序数回归

更新时间: 2025-11-03 11:03:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01449v1

A probabilistic view on Riemannian machine learning models for SPD matrices

The goal of this paper is to show how different machine learning tools on the Riemannian manifold $\mathcal{P}_d$ of Symmetric Positive Definite (SPD) matrices can be united under a probabilistic framework. For this, we will need several Gaussian distributions defined on $\mathcal{P}_d$. We will show how popular classifiers on $\mathcal{P}_d$ can be reinterpreted as Bayes Classifiers using these Gaussian distributions. These distributions will also be used for outlier detection and dimension reduction. By showing that those distributions are pervasive in the tools used on $\mathcal{P}_d$, we allow for other machine learning tools to be extended to $\mathcal{P}_d$.

Updated: 2025-11-03 10:59:57

标题: 一种概率视角下的Riemannian机器学习模型用于SPD矩阵

摘要: 本文的目标是展示如何将黎曼流形$\mathcal{P}_d$上的不同机器学习工具统一到一个概率框架下。为此,我们需要在$\mathcal{P}_d$上定义几个高斯分布。我们将展示如何将$\mathcal{P}_d$上的流行分类器重新解释为使用这些高斯分布的贝叶斯分类器。这些分布还将用于异常检测和降维。通过展示这些分布在$\mathcal{P}_d$上使用的工具中普遍存在,我们允许其他机器学习工具被扩展到$\mathcal{P}_d$。

更新时间: 2025-11-03 10:59:57

领域: cs.LG,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2505.02402v2

Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.

Updated: 2025-11-03 10:58:42

标题: 地理空间基础模型促进可持续发展目标的实现

摘要: 基础模型(FMs)是大规模、预先训练的人工智能(AI)系统,已经彻底改变了自然语言处理和计算机视觉,现在也在推动地理空间分析和地球观测(EO)。它们承诺在任务间实现改进的泛化性能、可伸缩性和高效适应性,并且只需很少标记数据。然而,尽管地理空间FMs快速增长,它们在现实世界中的效用和与全球可持续发展目标的一致性仍然未被充分探讨。我们引入了SustainFM,一个基于17个可持续发展目标的综合基准框架,涵盖了从资产财富预测到环境灾害检测等极其多样化的任务。这项研究对地理空间FMs进行了严格的跨学科评估,并提供了关于它们在实现可持续发展目标中的作用的关键见解。我们的研究表明:(1)尽管并非普遍优越,FMs在各种任务和数据集上通常优于传统方法。 (2)评估FMs应该超越准确性,将可迁移性、泛化性和能源效率作为其负责任使用的关键标准。 (3)FMs实现可扩展的、以SDG为基础的解决方案,为解决复杂的可持续性挑战提供广泛的实用性。至关重要的是,我们主张从以模型为中心的开发转向以影响为驱动的部署范式,并强调能源效率、领域转换的鲁棒性和道德考量等指标。

更新时间: 2025-11-03 10:58:42

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2505.24528v2

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.

Updated: 2025-11-03 10:56:21

标题: 创新者工作台:评估代理人进行创新LLM研究的能力

摘要: 人工智能代理可以通过自动化假设形成、实验设计、编码、执行和分析来加速科学发现,但现有的基准测试在简化设置中探测狭窄技能。为了弥补这一差距,我们引入了InnovatorBench,这是一个用于对执行大型语言模型(LLM)研究的代理进行现实的端到端评估的基准-平台对。它包括20个任务,涵盖了数据构建、过滤、增强、损失设计、奖励设计和脚手架构建,这些任务需要可运行的工件以及正确性、性能、输出质量和不确定性的评估。为了支持代理操作,我们开发了ResearchGym,这是一个研究环境,提供丰富的行动空间、分布式和长期视野的执行、异步监控和快照保存。我们还实现了一个轻量级的ReAct代理,它将显式推理与可执行计划相结合,使用像Claude-4、GPT-5、GLM-4.5和Kimi-K2这样的前沿模型。我们的实验证明,虽然前沿模型在代码驱动的研究任务中表现出了潜力,但在脆弱的与算法相关的任务和长期决策制定方面存在困难,比如缺乏耐心、糟糕的资源管理和过度依赖基于模板的推理。此外,代理需要超过11小时才能在InnovatorBench上达到最佳表现,突显了基准测试的困难性,并展示了InnovatorBench成为下一代基于代码的研究基准的潜力。

更新时间: 2025-11-03 10:56:21

领域: cs.AI

下载: http://arxiv.org/abs/2510.27598v2

From Passive to Proactive: A Multi-Agent System with Dynamic Task Orchestration for Intelligent Medical Pre-Consultation

Global healthcare systems face critical challenges from increasing patient volumes and limited consultation times, with primary care visits averaging under 5 minutes in many countries. While pre-consultation processes encompassing triage and structured history-taking offer potential solutions, they remain limited by passive interaction paradigms and context management challenges in existing AI systems. This study introduces a hierarchical multi-agent framework that transforms passive medical AI systems into proactive inquiry agents through autonomous task orchestration. We developed an eight-agent architecture with centralized control mechanisms that decomposes pre-consultation into four primary tasks: Triage ($T_1$), History of Present Illness collection ($T_2$), Past History collection ($T_3$), and Chief Complaint generation ($T_4$), with $T_1$--$T_3$ further divided into 13 domain-specific subtasks. Evaluated on 1,372 validated electronic health records from a Chinese medical platform across multiple foundation models (GPT-OSS 20B, Qwen3-8B, Phi4-14B), the framework achieved 87.0% accuracy for primary department triage and 80.5% for secondary department classification, with task completion rates reaching 98.2% using agent-driven scheduling versus 93.1% with sequential processing. Clinical quality scores from 18 physicians averaged 4.56 for Chief Complaints, 4.48 for History of Present Illness, and 4.69 for Past History on a 5-point scale, with consultations completed within 12.7 rounds for $T_2$ and 16.9 rounds for $T_3$. The model-agnostic architecture maintained high performance across different foundation models while preserving data privacy through local deployment, demonstrating the potential for autonomous AI systems to enhance pre-consultation efficiency and quality in clinical settings.

Updated: 2025-11-03 10:55:35

标题: 从被动到主动:具有动态任务编排的多智能体系统,用于智能医疗预会诊

摘要: 全球医疗系统面临着患者数量增加和会诊时间有限等关键挑战,许多国家的初级护理访问平均不到5分钟。虽然包括分诊和结构化病史采集在内的会诊前流程提供了潜在解决方案,但现有人工智能系统中的被动交互范式和上下文管理挑战限制了它们的发展。该研究引入了一个分层多代理框架,通过自主任务编排将被动医疗人工智能系统转变为主动询问代理。我们开发了一个包含中央控制机制的八代理架构,将会诊前分解为四个主要任务:分诊($T_1$)、现病史收集($T_2$)、既往病史收集($T_3$)和主诉生成($T_4$),其中$T_1$至$T_3$进一步分为13个领域特定子任务。在来自中国医疗平台的1,372份经过验证的电子健康记录上进行评估,跨多个基础模型(GPT-OSS 20B、Qwen3-8B、Phi4-14B),该框架在初级科室分诊方面达到了87.0%的准确率,次级科室分类方面达到了80.5%的准确率,使用代理驱动调度完成任务的完成率达到了98.2%,而使用顺序处理的完成率为93.1%。18名医师的临床质量评分在5分制上平均为4.56分,现病史为4.48分,既往病史为4.69分,会诊在$T_2$上完成时间为12.7轮,$T_3$上完成时间为16.9轮。这种与模型无关的架构在不同基础模型上保持了高性能,通过本地部署保护数据隐私,展示了自主人工智能系统在临床环境中提高会诊前效率和质量的潜力。

更新时间: 2025-11-03 10:55:35

领域: cs.AI

下载: http://arxiv.org/abs/2511.01445v1

Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.

Updated: 2025-11-03 10:53:11

标题: 交互作为智能的一部分 II:用于长期任务训练的异步人-代理推出

摘要: 大型语言模型(LLM)代理最近在自动编码、深度研究和图形用户界面操作等领域显示出强大潜力。然而,训练它们在长期、领域专业化任务上取得成功仍然具有挑战性。目前的方法主要分为两类。第一类依赖于通过行为克隆获得的密集人类注释,这对于可能需要数天甚至数月才能完成的长期任务来说成本过高。第二类依赖于结果驱动的采样,由于领域专业化任务上有效正向轨迹的稀缺性,这种方法经常会崩溃。我们介绍了Apollo,这是一个采样框架,将异步人类指导与动作级数据过滤相结合。Apollo不要求注释者跟踪每一步,而是只有在代理漂离有前景的轨迹时才允许他们介入,提供先验知识、战略建议等。这种轻量化设计使得可以持续进行超过30小时的交互,并以更低的成本产生有价值的轨迹。然后,Apollo应用监督控制来过滤出次优动作并防止错误传播。这些组件共同使得在长期环境中进行可靠且有效的数据收集成为可能。为了展示Apollo的有效性,我们使用InnovatorBench对其进行评估。我们的实验表明,当应用于在InnovatorBench上训练GLM-4.5模型时,Apollo相对未训练基线获得了超过50%的改进,并且比没有人类交互训练的变体获得了28%的改进。这些结果突显了人在环中采样的关键作用以及Apollo设计在处理长期、领域专业化任务中的鲁棒性。

更新时间: 2025-11-03 10:53:11

领域: cs.AI

下载: http://arxiv.org/abs/2510.27630v2

Robust Multimodal Sentiment Analysis via Double Information Bottleneck

Multimodal sentiment analysis has received significant attention across diverse research domains. Despite advancements in algorithm design, existing approaches suffer from two critical limitations: insufficient learning of noise-contaminated unimodal data, leading to corrupted cross-modal interactions, and inadequate fusion of multimodal representations, resulting in discarding discriminative unimodal information while retaining multimodal redundant information. To address these challenges, this paper proposes a Double Information Bottleneck (DIB) strategy to obtain a powerful, unified compact multimodal representation. Implemented within the framework of low-rank Renyi's entropy functional, DIB offers enhanced robustness against diverse noise sources and computational tractability for high-dimensional data, as compared to the conventional Shannon entropy-based methods. The DIB comprises two key modules: 1) learning a sufficient and compressed representation of individual unimodal data by maximizing the task-relevant information and discarding the superfluous information, and 2) ensuring the discriminative ability of multimodal representation through a novel attention bottleneck fusion mechanism. Consequently, DIB yields a multimodal representation that effectively filters out noisy information from unimodal data while capturing inter-modal complementarity. Extensive experiments on CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single validate the effectiveness of our method. The model achieves 47.4% accuracy under the Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming the second-best baseline by 1.19%. Under noise, it shows only 0.36% and 0.29% performance degradation on CMU-MOSI and CMU-MOSEI respectively.

Updated: 2025-11-03 10:52:45

标题: 通过双信息瓶颈实现鲁棒的多模态情感分析

摘要: 多模态情感分析已经在不同研究领域受到了重要关注。尽管在算法设计方面取得了进展,但现有方法存在两个关键限制:对受噪声干扰的单模态数据学习不足,导致跨模态交互受损,以及多模态表示的融合不足,导致舍弃了有区分性的单模态信息而保留了多模态冗余信息。为了解决这些挑战,本文提出了一种双信息瓶颈(DIB)策略,以获得一个强大、统一、紧凑的多模态表示。在低秩Renyi熵函数框架内实施,DIB相对于传统基于Shannon熵的方法,在抵抗各种噪声源和高维数据的计算可行性方面提供了增强的鲁棒性。DIB包括两个关键模块:1)通过最大化任务相关信息并丢弃多余信息来学习单个单模态数据的充分压缩表示,以及2)通过新颖的注意力瓶颈融合机制确保多模态表示的区分能力。因此,DIB产生了一个多模态表示,能够有效过滤来自单模态数据的噪声信息,同时捕捉跨模态互补性。在CMU-MOSI、CMU-MOSEI、CH-SIMS和MVSA-Single上的大量实验证实了我们方法的有效性。该模型在CMU-MOSI上的Acc-7指标下实现了47.4%的准确率,在CH-SIMS上实现了81.63%的F1分数,优于次优基线1.19%。在噪声下,它在CMU-MOSI和CMU-MOSEI上仅显示0.36%和0.29%的性能降级。

更新时间: 2025-11-03 10:52:45

领域: cs.AI

下载: http://arxiv.org/abs/2511.01444v1

Where to Search: Measure the Prior-Structured Search Space of LLM Agents

The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via two instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

Updated: 2025-11-03 10:52:10

标题: 在哪里搜索:测量LLM代理的先验结构化搜索空间

摘要: 基于大型语言模型(LLMs)的生成-过滤-精化(迭代范式)在人工智能+科学领域的推理、编程和程序发现方面取得了进展。然而,搜索的有效性取决于搜索的位置,即如何将领域先验编码到一个操作性结构化的假设空间中。为此,本文提出了一个简洁的形式理论,描述和衡量由领域先验引导的LLM辅助迭代搜索。我们将一个代理表示为一个模糊关系运算符,用于捕获可行的转换;代理因此受到固定安全包络的约束。为了描述多步推理/搜索,我们通过一个单一的延续参数对所有可达路径进行加权,并将它们求和以获得一个覆盖生成函数;这引出了一个可达性困难度的度量;并提供了一个通过安全包络诱导的图形上搜索的几何解释。我们进一步提供了最简单的可验证推断,并通过两种实例化验证了它们。这一理论提供了一个可行的语言和操作工具,用于衡量代理及其搜索空间,提出了一个由LLMs构建的迭代搜索的系统形式描述。

更新时间: 2025-11-03 10:52:10

领域: cs.AI,cs.CL,cs.LO

下载: http://arxiv.org/abs/2510.14846v3

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.

Updated: 2025-11-03 10:50:15

标题: 通过利用动作的分层结构和文本背景,增强动作识别

摘要: 我们提出了一种改善动作识别的新方法,通过利用动作的分层组织并结合上下文化的文本信息,包括位置和先前动作,以反映动作的时间背景。为实现这一目标,我们引入了一种专门用于动作识别的transformer架构,同时利用视觉和文本特征。视觉特征来自RGB和光流数据,而文本嵌入则代表上下文信息。此外,我们定义了一个联合损失函数,同时为模型训练粗粒和细粒动作识别,有效地利用动作的层次结构。为了展示我们方法的有效性,我们通过整合动作层次结构扩展了Toyota智能家居未剪辑(TSU)数据集,得到了分层TSU数据集,这是一个专为监测老年人在家庭环境中活动而设计的分层数据集。消融研究评估了不同策略整合上下文和分层数据对性能的影响。实验结果表明,所提出的方法在Hierarchical TSU数据集、Assembly101和IkeaASM上始终优于SOTA方法,在top-1准确率上取得了超过17%的提升。

更新时间: 2025-11-03 10:50:15

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2410.21275v2

Interpretable Heart Disease Prediction via a Weighted Ensemble Model: A Large-Scale Study with SHAP and Surrogate Decision Trees

Cardiovascular disease (CVD) remains a critical global health concern, demanding reliable and interpretable predictive models for early risk assessment. This study presents a large-scale analysis using the Heart Disease Health Indicators Dataset, developing a strategically weighted ensemble model that combines tree-based methods (LightGBM, XGBoost) with a Convolutional Neural Network (CNN) to predict CVD risk. The model was trained on a preprocessed dataset of 229,781 patients where the inherent class imbalance was managed through strategic weighting and feature engineering enhanced the original 22 features to 25. The final ensemble achieves a statistically significant improvement over the best individual model, with a Test AUC of 0.8371 (p=0.003) and is particularly suited for screening with a high recall of 80.0%. To provide transparency and clinical interpretability, surrogate decision trees and SHapley Additive exPlanations (SHAP) are used. The proposed model delivers a combination of robust predictive performance and clinical transparency by blending diverse learning architectures and incorporating explainability through SHAP and surrogate decision trees, making it a strong candidate for real-world deployment in public health screening.

Updated: 2025-11-03 10:24:09

标题: 可解释的心脏疾病预测:基于加权集成模型的大规模研究,使用SHAP和代理决策树

摘要: 心血管疾病(CVD)仍然是一个关键的全球健康问题,需要可靠和可解释的预测模型进行早期风险评估。本研究利用心脏疾病健康指标数据集进行了大规模分析,开发了一个策略加权的集成模型,将基于树的方法(LightGBM、XGBoost)与卷积神经网络(CNN)结合起来预测CVD风险。该模型在一个经过预处理的包含229,781名患者的数据集上进行训练,通过策略性加权来处理固有的类别不平衡,并通过特征工程将原始的22个特征增强为25个。最终的集成模型在最佳个体模型的基础上取得了统计学上显著的改进,测试AUC为0.8371(p=0.003),特别适合具有80.0%召回率的筛查。为了提供透明度和临床可解释性,采用了代理决策树和SHapley增加解释(SHAP)。该模型通过融合多样的学习架构并通过SHAP和代理决策树引入解释性,提供了强大的预测性能和临床透明度的结合,使其成为在公共健康筛查中实际部署的有力候选。

更新时间: 2025-11-03 10:24:09

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2511.01947v1

UniSOT: A Unified Framework for Multi-Modality Single Object Tracking

Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0\% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0\% main metric across all three RGB+X video modalities.

Updated: 2025-11-03 10:23:53

标题: UniSOT:多模式单目标跟踪的统一框架

摘要: 单目标跟踪旨在在特定的视频模态序列(RGB、RGB+深度、RGB+热力或RGB+事件)中定位具有特定参考模态(边界框、自然语言或两者)的目标对象。不同的参考模态可以实现各种人机交互,复杂场景中需要不同的视频模态以增强跟踪的稳健性。现有的跟踪器设计用于单个或几个视频模态和单个或几个参考模态,这导致了分开的模型设计并限制了实际应用。实际上,需要一个统一的跟踪器来处理各种要求。据我们所知,仍然没有跟踪器可以同时跨越这些视频模态使用上述参考模态进行跟踪。因此,在本文中,我们提出了一个统一的跟踪器UniSOT,用于不同组合的三种参考模态和四种视频模态,参数统一。对18个视觉跟踪、视觉语言跟踪和RGB+X跟踪基准的大量实验结果显示,UniSOT在性能上优于特定模态的对手。值得注意的是,UniSOT在所有三种参考模态上的TNL2K上的AUC表现比以往的对手高出3.0\%,在所有三种RGB+X视频模态上的主要指标上比Un-Track高出2.0\%。

更新时间: 2025-11-03 10:23:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01427v1

Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

Updated: 2025-11-03 10:21:43

标题: 在基础模型时代,通过跨视图代码对齐进行图像哈希化

摘要: 高效的大规模检索需要既紤密又有辨识力的表示。基础模型提供了强大的视觉和多模态嵌入,但在这些高维空间中进行最近邻搜索是计算昂贵的。哈希提供了一种有效的替代方案,通过利用二进制码实现快速的汉明距离搜索,然而现有方法往往依赖于复杂的流水线、多项目标、专门针对单一学习范式设计以及长时间的训练。我们引入了CroVCA(交叉视图编码对齐),这是一个简单而统一的学习二进制码的原则,使其在语义对齐的视图之间保持一致。单一二进制交叉熵损失强制对齐,而编码速率最大化作为一种反坍缩正则化器,促进平衡和多样化的码。为了实现这一目标,我们设计了HashCoder,一个轻量级的MLP哈希网络,具有最终的批量归一化层以强制实现平衡的码。HashCoder可以作为冻结嵌入的探测头,也可以通过LoRA微调高效地调整编码器。在各种基准测试中,CroVCA在仅5个训练周期内取得了最先进的结果。在16位情况下,它表现尤为出色,例如,在COCO上的无监督哈希在不到2分钟内完成,在ImageNet100上的监督哈希在单个GPU上约3分钟完成。这些结果突显了CroVCA的效率、适应性和广泛适用性。

更新时间: 2025-11-03 10:21:43

领域: cs.CV,cs.IR,cs.LG

下载: http://arxiv.org/abs/2510.27584v2

Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis

Explanations for AI models in high-stakes domains like medicine often lack verifiability, which can hinder trust. To address this, we propose an interactive agent that produces explanations through an auditable sequence of actions. The agent learns a policy to strategically seek external visual evidence to support its diagnostic reasoning. This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable. Our experiments show that this action-based reasoning process significantly improves calibrated accuracy, reducing the Brier score by 18\% compared to a non-interactive baseline. To validate the faithfulness of the agent's explanations, we introduce a causal intervention method. By masking the visual evidence the agent chooses to use, we observe a measurable degradation in its performance ($\Delta$Brier=+0.029), confirming that the evidence is integral to its decision-making process. Our work provides a practical framework for building AI systems with verifiable and faithful reasoning capabilities.

Updated: 2025-11-03 10:21:35

标题: 学习寻找证据:具有因果忠实性分析的可验证推理代理

摘要: 在高风险领域如医学中,人工智能模型的解释通常缺乏可验证性,这可能阻碍信任。为了解决这个问题,我们提出了一个交互式代理,通过可审计的一系列行动来产生解释。该代理学习一种策略,以战略性地寻找外部视觉证据来支持其诊断推理。这个策略是通过强化学习进行优化的,结果是一个既高效又可泛化的模型。我们的实验证明,这种基于行动的推理过程显著提高了校准准确性,将Brier分数降低了18\%,与非交互式基线相比。为了验证代理的解释的忠实性,我们引入了一种因果干预方法。通过掩盖代理选择使用的视觉证据,我们观察到其表现有可衡量的退化(ΔBrier=+0.029),证实证据对其决策过程至关重要。我们的工作提供了一个实用的框架,用于构建具有可验证和忠实推理能力的人工智能系统。

更新时间: 2025-11-03 10:21:35

领域: cs.AI,cs.CV,I.2.6; I.2.10

下载: http://arxiv.org/abs/2511.01425v1

Contextual Tokenization for Graph Inverted Indices

Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token' on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.

Updated: 2025-11-03 10:11:55

标题: 上下文标记化用于图反向索引

摘要: 从大量语料库中检索包含与给定查询图同构的子图的图形是许多现实世界应用程序中的核心操作。虽然最近的多向量图表示和基于集合对齐和包含的分数可以提供准确的子图同构测试,但它们在检索中的使用仍受限于对语料库图进行详尽评分的需求。我们引入了CORGII(用于倒排索引的图形的上下文表示),这是一个图形索引框架,从上下文密集图表示开始,一个可微离散化模块计算出学习到的潜在词汇表上的稀疏二进制代码。这种类似文本文档的表示允许我们利用经典的、高度优化的倒排索引,同时支持软(向量)集合包含分数。将这种范式推向更远,我们用数据驱动的、可训练的影响权重替换了图中“标记”(如TFIDF或BM25)的经典、固定影响权重。最后,我们探索了标记扩展,以支持多次探测索引,以实现更平滑的精度-效率权衡。据我们所知,CORGII是第一个使用离散标记映射到高效倒排列表的密集图表示的索引器。广泛的实验表明,与几种基线相比,CORGII在精度和效率之间提供了更好的权衡。

更新时间: 2025-11-03 10:11:55

领域: cs.LG

下载: http://arxiv.org/abs/2510.22479v2

COFAP: A Universal Framework for COFs Adsorption Prediction through Designed Multi-Modal Extraction and Cross-Modal Synergy

Covalent organic frameworks (COFs) are promising adsorbents for gas adsorption and separation, while identifying the optimal structures among their vast design space requires efficient high-throughput screening. Conventional machine-learning predictors rely heavily on specific gas-related features. However, these features are time-consuming and limit scalability, leading to inefficiency and labor-intensive processes. Herein, a universal COFs adsorption prediction framework (COFAP) is proposed, which can extract multi-modal structural and chemical features through deep learning, and fuse these complementary features via cross-modal attention mechanism. Without Henry coefficients or adsorption heat, COFAP sets a new SOTA by outperforming previous approaches on hypoCOFs dataset. Based on COFAP, we also found that high-performing COFs for separation concentrate within a narrow range of pore size and surface area. A weight-adjustable prioritization scheme is also developed to enable flexible, application-specific ranking of candidate COFs for researchers. Superior efficiency and accuracy render COFAP directly deployable in crystalline porous materials.

Updated: 2025-11-03 10:11:33

标题: COFAP:通过设计的多模式提取和跨模式协同作用实现COFs吸附预测的通用框架

摘要: 融合物质(COFs)是气体吸附和分离的有前途的吸附剂,然而在其庞大的设计空间中找到最优结构需要高效的高通量筛选。传统的机器学习预测器严重依赖于特定的与气体相关的特征。然而,这些特征耗时且限制了可扩展性,导致低效和劳动密集型过程。本文提出了一种通用的COFs吸附预测框架(COFAP),可以通过深度学习提取多模结构和化学特征,并通过跨模态注意机制融合这些互补特征。在不使用Henry系数或吸附热的情况下,COFAP通过在hypoCOFs数据集上表现优于先前方法,树立了新的SOTA。基于COFAP,我们还发现,用于分离的高性能COFs集中在一个狭窄的孔径和表面积范围内。还开发了一种权重可调的优先方案,以便为研究人员灵活地根据应用特定对候选COFs进行排名。出色的效率和准确性使COFAP可以直接部署在结晶多孔材料中。

更新时间: 2025-11-03 10:11:33

领域: cs.LG,cond-mat.mtrl-sci,cs.AI,physics.chem-ph

下载: http://arxiv.org/abs/2511.01946v1

Deep Modularity Networks with Diversity-Preserving Regularization

Graph clustering plays a crucial role in graph representation learning but often faces challenges in achieving feature-space diversity. While Deep Modularity Networks (DMoN) leverage modularity maximization and collapse regularization to ensure structural separation, they lack explicit mechanisms for feature-space separation, assignment dispersion, and assignment-confidence control. We address this limitation by proposing Deep Modularity Networks with Diversity-Preserving Regularization (DMoN-DPR), which introduces three novel regularization terms: distance-based for inter-cluster separation, variance-based for per-cluster assignment dispersion, and an assignment-entropy penalty with a small positive weight, encouraging more confident assignments gradually. Our method significantly enhances label-based clustering metrics on feature-rich benchmark datasets (paired two-tailed t-test, $p\leq0.05$), demonstrating the effectiveness of incorporating diversity-preserving regularizations in creating meaningful and interpretable clusters.

Updated: 2025-11-03 10:11:21

标题: 具有保持多样性正则化的深度模块化网络

摘要: 图聚类在图表示学习中起着至关重要的作用,但通常面临着在实现特征空间多样性方面的挑战。虽然深度模块化网络(DMoN)利用模块化最大化和折叠正则化来确保结构分离,但它们缺乏显式的特征空间分离、分配分散和分配置信度控制机制。我们通过提出具有保持多样性正则化的深度模块化网络(DMoN-DPR)来解决这一限制,该方法引入了三个新的正则化项:基于距离的用于簇间分离、基于方差的用于每个簇的分配分散,以及一个带有小正权重的分配熵惩罚,逐渐鼓励更自信的分配。我们的方法显著提高了基于标签的聚类指标在功能丰富的基准数据集上(成对双尾t检验,$p\leq0.05$),展示了将保持多样性的正则化纳入到创建有意义且可解释的簇中的有效性。

更新时间: 2025-11-03 10:11:21

领域: cs.LG

下载: http://arxiv.org/abs/2501.13451v2

VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.

Updated: 2025-11-03 10:10:38

标题: VO-DP:用于仅视觉机器人操作的语义-几何自适应扩散策略

摘要: 在模仿学习的背景下,基于视觉-运动的扩散策略学习是机器人操纵中的主要方向之一。大多数方法依赖于点云作为观察输入,并通过点云特征学习构建场景表示,从而使它们能够实现显著的准确性。然而,现有文献缺乏对具有重要潜力的仅视觉解决方案的深入探讨。在本文中,我们提出了一种仅视觉和单视图扩散策略学习方法(VO-DP),利用预训练的视觉基础模型实现语义和几何特征的有效融合。我们利用来自VGGT的中间特征,结合来自DINOv2的语义特征和交替注意力块的几何特征。特征通过交叉注意力融合,并通过CNN进行空间压缩,形成策略头的输入。大量实验证明,VO-DP不仅明显优于仅视觉基线DP,而且在模拟任务中表现出与基于点云的方法DP3相当的性能趋势:VO-DP的平均成功率为64.6%,与DP3的64.0%相当,远高于DP的34.8%;而在现实世界任务中,它达到了87.9%,明显优于DP3的67.5%和DP的11.2%。进一步的鲁棒性评估证实,VO-DP在不同条件下包括颜色、大小、背景和光照下保持高度稳定。最后,我们开源了一个用于机器人操纵的训练库。该库基于Accelerate构建,支持多机和多GPU并行训练,以及混合精度训练。它与视觉运动策略(如DP、DP3和VO-DP)兼容,还支持RoboTwin模拟器。

更新时间: 2025-11-03 10:10:38

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.15530v4

Modulation of temporal decision-making in a deep reinforcement learning agent under the dual-task paradigm

This study explores the interference in temporal processing within a dual-task paradigm from an artificial intelligence (AI) perspective. In this context, the dual-task setup is implemented as a simplified version of the Overcooked environment with two variations, single task (T) and dual task (T+N). Both variations involve an embedded time production task, but the dual task (T+N) additionally involves a concurrent number comparison task. Two deep reinforcement learning (DRL) agents were separately trained for each of these tasks. These agents exhibited emergent behavior consistent with human timing research. Specifically, the dual task (T+N) agent exhibited significant overproduction of time relative to its single task (T) counterpart. This result was consistent across four target durations. Preliminary analysis of neural dynamics in the agents' LSTM layers did not reveal any clear evidence of a dedicated or intrinsic timer. Hence, further investigation is needed to better understand the underlying time-keeping mechanisms of the agents and to provide insights into the observed behavioral patterns. This study is a small step towards exploring parallels between emergent DRL behavior and behavior observed in biological systems in order to facilitate a better understanding of both.

Updated: 2025-11-03 10:09:55

标题: 在双任务范式下,深度强化学习代理在时间决策中的调节

摘要: 本研究从人工智能(AI)的角度探讨了双任务范式中时间处理的干扰。在这种情境下,双任务设置被实现为Overcooked环境的简化版本,有两种变体,单任务(T)和双任务(T+N)。这两种变体都涉及嵌入的时间生产任务,但双任务(T+N)还涉及并发的数字比较任务。为每个任务单独训练了两个深度强化学习(DRL)代理。这些代理展现出与人类时间研究一致的新兴行为。具体而言,双任务(T+N)代理相对于单任务(T)对应的代理时间显著超量。这一结果在四个目标持续时间上保持一致。对代理的LSTM层的神经动态的初步分析没有显示出任何明确的专门或固有的计时器证据。因此,需要进一步研究以更好地理解代理的基础计时机制,并提供对观察到的行为模式的见解。本研究是探索新兴DRL行为和生物系统中观察到的行为之间的相似之处的一小步,以便更好地理解两者。

更新时间: 2025-11-03 10:09:55

领域: cs.AI

下载: http://arxiv.org/abs/2511.01415v1

Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach

Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. This study compares the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error that is 7% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.37% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.

Updated: 2025-11-03 10:00:26

标题: 将高光谱图像转化为化学地图:一种新颖的端到端深度学习方法

摘要: 目前,从高光谱图像生成化学地图的当前方法基于诸如偏最小二乘(PLS)回归之类的模型,生成不考虑空间上下文并且受到高噪声干扰的像素级预测。本研究提出了一种端到端的深度学习方法,使用修改版的U-Net和自定义损失函数,直接从高光谱图像中获取化学地图,跳过传统像素级分析所需的所有中间步骤。本研究在具有相关平均脂肪参考值的真实数据集上将U-Net与传统的PLS回归进行比较,U-Net在平均脂肪预测任务上获得的测试集均方根误差比PLS回归低7%。同时,U-Net生成精细细节的化学地图,其中99.91%的方差在空间上是相关的。相反,PLS生成的化学地图中只有2.37%的方差在空间上是相关的,表明每个像素级预测在很大程度上独立于相邻像素。此外,虽然PLS生成的化学地图包含远远超出0-100%物理可能范围的预测,但U-Net学习保持在此范围内。因此,本研究的结果表明,U-Net在化学地图生成方面优于PLS。

更新时间: 2025-11-03 10:00:26

领域: cs.CV,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2504.14131v5

FoldPath: End-to-End Object-Centric Motion Generation via Modulated Implicit Paths

Object-Centric Motion Generation (OCMG) is instrumental in advancing automated manufacturing processes, particularly in domains requiring high-precision expert robotic motions, such as spray painting and welding. To realize effective automation, robust algorithms are essential for generating extended, object-aware trajectories across intricate 3D geometries. However, contemporary OCMG techniques are either based on ad-hoc heuristics or employ learning-based pipelines that are still reliant on sensitive post-processing steps to generate executable paths. We introduce FoldPath, a novel, end-to-end, neural field based method for OCMG. Unlike prior deep learning approaches that predict discrete sequences of end-effector waypoints, FoldPath learns the robot motion as a continuous function, thus implicitly encoding smooth output paths. This paradigm shift eliminates the need for brittle post-processing steps that concatenate and order the predicted discrete waypoints. Particularly, our approach demonstrates superior predictive performance compared to recently proposed learning-based methods, and attains generalization capabilities even in real industrial settings, where only a limited amount of 70 expert samples are provided. We validate FoldPath through comprehensive experiments in a realistic simulation environment and introduce new, rigorous metrics designed to comprehensively evaluate long-horizon robotic paths, thus advancing the OCMG task towards practical maturity.

Updated: 2025-11-03 10:00:25

标题: FoldPath:通过调制的隐式路径实现端到端的以对象为中心的运动生成

摘要: 目标中心运动生成(OCMG)对推动自动化制造过程至关重要,特别是在需要高精度专家机器人运动的领域,如喷漆和焊接。为了实现有效的自动化,生成跨复杂3D几何体的扩展、对象感知轨迹需要强大的算法。然而,当代的OCMG技术要么基于临时启发式方法,要么采用仍然依赖敏感后处理步骤来生成可执行路径的基于学习的流程。我们介绍了FoldPath,一种新颖的、端到端的、基于神经场的OCMG方法。与先前的深度学习方法不同,FoldPath学习机器人运动作为连续函数,从而隐式编码平滑输出路径。这种范式转变消除了需要连接和排序预测的离散航路点的脆弱后处理步骤。特别是,我们的方法表现出比最近提出的基于学习的方法更优越的预测性能,甚至在只提供了有限数量的70个专家样本的实际工业环境中也具有泛化能力。我们通过在真实仿真环境中进行全面实验验证了FoldPath,并引入了新的严格度量标准,旨在全面评估长期视野的机器人路径,从而将OCMG任务推向实际成熟。

更新时间: 2025-11-03 10:00:25

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.01407v1

Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models

What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold -- requiring a Riemannian metric to describe the space's local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) -- a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics -- shortest paths that follow the data manifold's intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.

Updated: 2025-11-03 10:00:16

标题: 跟随能量,找到路径:基于能量模型的黎曼度量

摘要: 在高维空间中,两个数据点之间的最短路径是什么?在欧几里德几何中,答案显而易见,但当数据位于曲面流形上时,问题变得更加复杂 -- 需要使用黎曼度量描述空间的局部曲率。然而,在高维空间中估算这样的度量仍然是一个重大挑战。 在这项工作中,我们提出了一种从预训练的基于能量的模型(EBMs)直接导出黎曼度量的方法 -- 这是一类生成模型,将低能量分配给高密度区域。这些度量定义了空间变化的距离,使得可以计算测地线 -- 沿着数据流形固有几何的最短路径。我们引入了两种从EBMs导出的新型度量,并展示它们产生的测地线始终更接近数据流形,并且在与地面真实轨迹对齐度量的条件下表现出更低的曲率失真。我们在越来越复杂的数据集上评估我们的方法:具有已知数据密度的合成数据集、具有可解释几何的旋转字符图像以及嵌入在预训练VAE潜在空间中的高分辨率自然图像。 我们的结果表明,基于EBM导出的度量在高维设置中始终胜过已建立的基线。我们的工作是首次从EBMs中导出黎曼度量,实现了数据感知的测地线,并为生成建模和模拟提供了可扩展的、几何驱动的学习。

更新时间: 2025-11-03 10:00:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.18230v3

Tight analyses of first-order methods with error feedback

Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in the simplified single-agent setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

Updated: 2025-11-03 09:54:53

标题: 一阶方法与误差反馈的严格分析

摘要: 代理之间的通信经常构成分布式学习中的主要计算瓶颈。最常见的缓解策略之一是压缩交换的信息,从而减少通信开销。为了对抗与压缩通信相关的收敛性降级,引入了错误反馈方案,其中最显著的是EF和EF^21。在这项工作中,我们提供了对这两种方法的严格分析。具体而言,我们找到了为每种方法提供最佳可能收敛速率的Lyapunov函数,并配有匹配的下界。这种基于原则的方法提供了明确的性能保证,并使EF、EF^21和压缩梯度下降之间的严格、苹果对苹果的比较成为可能。我们的分析是在简化的单代理设置中进行的,这使得能够进行干净的理论洞察和对底层机制的公平比较。

更新时间: 2025-11-03 09:54:53

领域: cs.LG,cs.DC,math.OC

下载: http://arxiv.org/abs/2506.05271v2

Low-Rank Adaptation for Foundation Models: A Comprehensive Review

The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.

Updated: 2025-11-03 09:51:08

标题: 基于低秩适应的基础模型:综合评述

摘要: 基于广泛、多样数据集训练的大规模神经网络基础模型的快速发展,彻底改变了人工智能领域,实现了自然语言处理、计算机视觉和科学发现等领域的前所未有的进步。然而,这些模型的参数数量庞大,通常达到数十亿甚至数万亿,使得将它们调整到特定的下游任务面临重大挑战。低秩适应(LoRA)已经成为一种非常有前途的方法,可以缓解这些挑战,提供一种参数高效的机制来微调基础模型,减少计算开销。本调查对LoRA技术进行了首次全面审查,涵盖了大型语言模型以外的一般基础模型,包括最近的技术基础、新兴领域以及低秩适应在多个领域中的应用。最后,本调查讨论了理论理解、可扩展性和鲁棒性方面的关键挑战和未来研究方向。本调查为致力于高效基础模型适应的研究人员和从业者提供了宝贵的资源。

更新时间: 2025-11-03 09:51:08

领域: cs.LG,cs.AI,I.2

下载: http://arxiv.org/abs/2501.00365v2

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

Updated: 2025-11-03 09:47:30

标题: ParaRNN:解锁大型语言模型的非线性RNN并行训练

摘要: 循环神经网络(RNNs)奠定了序列建模的基础,但它们固有的顺序性质限制了并行计算,从而造成了扩展的根本障碍。这导致了可并行化架构如变压器和最近的状态空间模型(SSMs)的主导地位。虽然SSMs通过结构化线性递推实现了高效的并行化,但这种线性约束限制了它们的表现力,并且排除了对复杂、非线性序列依赖关系的建模。为了解决这个问题,我们提出了ParaRNN,一个打破非线性RNN序列并行化障碍的框架。在以前的工作基础上,我们将非线性递归关系的序列视为一组方程,通过将牛顿迭代与自定义并行约简相结合,实现并行求解。我们的实现使非线性RNN的训练速度比朴素的顺序应用提高了多达665倍,从而使其可以以前所未有的规模进行训练。为了展示这一点,我们将ParaRNN应用于LSTM和GRU架构的改进,成功训练了具有70亿参数的模型,其困惑度与大小相似的变压器和Mamba2架构相媲美。为了加速高效序列建模的研究,我们将ParaRNN代码库作为一个开源框架发布,用于自动训练并行化非线性RNN,使研究人员和从业者能够在规模上探索新的非线性RNN模型。

更新时间: 2025-11-03 09:47:30

领域: cs.LG,68T07, 68W10,I.2.7

下载: http://arxiv.org/abs/2510.21450v2

Relaxing partition admissibility in Cluster-DAGs: a causal calculus with arbitrary variable clustering

Cluster DAGs (C-DAGs) provide an abstraction of causal graphs in which nodes represent clusters of variables, and edges encode both cluster-level causal relationships and dependencies arisen from unobserved confounding. C-DAGs define an equivalence class of acyclic causal graphs that agree on cluster-level relationships, enabling causal reasoning at a higher level of abstraction. However, when the chosen clustering induces cycles in the resulting C-DAG, the partition is deemed inadmissible under conventional C-DAG semantics. In this work, we extend the C-DAG framework to support arbitrary variable clusterings by relaxing the partition admissibility constraint, thereby allowing cyclic C-DAG representations. We extend the notions of d-separation and causal calculus to this setting, significantly broadening the scope of causal reasoning across clusters and enabling the application of C-DAGs in previously intractable scenarios. Our calculus is both sound and atomically complete with respect to the do-calculus: all valid interventional queries at the cluster level can be derived using our rules, each corresponding to a primitive do-calculus step.

Updated: 2025-11-03 09:44:58

标题: 在集群DAG中放宽分区可接受性:具有任意变量聚类的因果演算

摘要: Cluster DAGs(C-DAGs)提供了因果图的抽象,其中节点代表变量的簇,并且边缘编码了既包含集群级因果关系又包含由未观察到的混淆引起的依赖关系。C-DAGs定义了一组在集群级关系上达成一致的无环因果图的等价类,从而使得在更高层次的抽象水平上进行因果推理成为可能。然而,当所选择的聚类在生成的C-DAG中引起循环时,根据传统的C-DAG语义,该分区被认为是不可接受的。在这项工作中,我们将C-DAG框架扩展到支持任意变量聚类,通过放宽分区可接受性约束,从而允许循环C-DAG表示。我们将d-分离和因果演算法的概念扩展到这种情况,显著扩展了跨集群的因果推理范围,并使C-DAGs能够应用于先前难以处理的场景。我们的演算法在与do-演算法方面既是正确的又是原子完备的:在集群级别进行的所有有效干预查询都可以使用我们的规则推导出来,每个查询对应于一个基本的do-演算法步骤。

更新时间: 2025-11-03 09:44:58

领域: cs.AI,stat.ME

下载: http://arxiv.org/abs/2511.01396v1

ConneX: Automatically Resolving Transaction Opacity of Cross-Chain Bridges for Security Analysis

As the Web3 ecosystem evolves toward a multi-chain architecture, cross-chain bridges have become critical infrastructure for enabling interoperability between diverse blockchain networks. However, while connecting isolated blockchains, the lack of cross-chain transaction pairing records introduces significant challenges for security analysis like cross-chain fund tracing, advanced vulnerability detection, and transaction graph-based analysis. To address this gap, we introduce ConneX, an automated and general-purpose system designed to accurately identify corresponding transaction pairs across both ends of cross-chain bridges. Our system leverages Large Language Models (LLMs) to efficiently prune the semantic search space by identifying semantically plausible key information candidates within complex transaction records. Further, it deploys a novel examiner module that refines these candidates by validating them against transaction values, effectively addressing semantic ambiguities and identifying the correct semantics. Extensive evaluations on a dataset of about 500,000 transactions from five major bridge platforms demonstrate that ConneX achieves an average F1 score of 0.9746, surpassing baselines by at least 20.05\%, with good efficiency that reduces the semantic search space by several orders of magnitude (1e10 to less than 100). Moreover, its successful application in tracing illicit funds (including a cross-chain transfer worth $1 million) in real-world hacking incidents underscores its practical utility for enhancing cross-chain security and transparency.

Updated: 2025-11-03 09:44:02

标题: ConneX:用于安全分析的自动解决跨链桥交易不透明性

摘要: 随着Web3生态系统朝着多链架构发展,跨链桥已成为实现不同区块链网络之间互操作性的关键基础设施。然而,在连接孤立的区块链时,缺乏跨链交易配对记录会给安全分析带来重大挑战,例如跨链资金追踪、高级漏洞检测和基于交易图的分析。为了填补这一空白,我们引入了ConneX,一个自动化且通用的系统,旨在准确识别跨链桥两端的对应交易对。我们的系统利用大型语言模型(LLMs)有效地剪枝语义搜索空间,通过识别复杂交易记录中的语义合理的关键信息候选项。此外,它部署了一个新颖的审查员模块,通过验证这些候选项与交易值,有效地解决语义模糊性并识别正确的语义。对来自五个主要桥梁平台约50万笔交易的数据集进行的广泛评估表明,ConneX实现了平均F1分数为0.9746,超过基线至少20.05\%,并具有良好的效率,将语义搜索空间减少了数个数量级(从1e10减少到不到100)。此外,其在追踪非法资金(包括价值100万美元的跨链转账)在现实世界中的成功应用强调了其增强跨链安全性和透明性的实际效用。

更新时间: 2025-11-03 09:44:02

领域: cs.CR

下载: http://arxiv.org/abs/2511.01393v1

Inducing Riesz and orthonormal bases in $L^2$ via composition operators

Let $C_h$ be a composition operator mapping $L^2(\Omega_1)$ into $L^2(\Omega_2)$ for some open sets $\Omega_1, \Omega_2 \subseteq \mathbb{R}^n$. We characterize the mappings $h$ that transform Riesz bases of $L^2(\Omega_1)$ into Riesz bases of $L^2(\Omega_2)$. Restricting our analysis to differentiable mappings, we demonstrate that mappings $h$ that preserve Riesz bases have Jacobian determinants that are bounded away from zero and infinity. We discuss implications of these results for approximation theory, highlighting the potential of using bijective neural networks to construct Riesz bases with favorable approximation properties.

Updated: 2025-11-03 09:42:43

标题: 通过合成算子在$L^2$中引入Riesz和正交基

摘要: 让$C_h$是一个将$L^2(\Omega_1)$映射到$L^2(\Omega_2)$的复合算子,其中$\Omega_1, \Omega_2 \subseteq \mathbb{R}^n$为一些开集。我们表征了将$L^2(\Omega_1)$的Riesz基映射到$L^2(\Omega_2)$的映射$h$。限制我们的分析在可微映射上,我们展示了保持Riesz基的映射$h$具有雅可比行列式远离零和无穷的特性。我们讨论了这些结果对逼近理论的影响,强调使用双射神经网络构建具有良好逼近特性的Riesz基的潜力。

更新时间: 2025-11-03 09:42:43

领域: math.FA,cs.LG,cs.NA,math.NA,47B33, 42C15

下载: http://arxiv.org/abs/2406.18613v2

Beyond Static Thresholds: Adaptive RRC Signaling Storm Detection with Extreme Value Theory

In 5G and beyond networks, the radio communication between a User Equipment (UE) and a base station (gNodeB or gNB), also known as the air interface, is a critical component of network access and connectivity. During the connection establishment procedure, the Radio Resource Control (RRC) layer can be vulnerable to signaling storms, which threaten the availability of the radio access control plane. These attacks may occur when one or more UEs send a large number of connection requests to the gNB, preventing new UEs from establishing connections. In this paper, we investigate the detection of such threats and propose an adaptive threshold-based detection system based on Extreme Value Theory (EVT). The proposed solution is evaluated numerically by applying simulated attack scenarios based on a realistic threat model on top of real-world RRC traffic data from an operator network. We show that, by leveraging features from the RRC layer only, the detection system can not only identify the attacks but also differentiate them from legitimate high-traffic situations. The adaptive threshold calculated using EVT ensures that the system can work under diverse traffic conditions. The results show high accuracy, precision, and recall values (above 93%), and a low detection latency even under complex conditions.

Updated: 2025-11-03 09:42:12

标题: 超越静态阈值:利用极值理论的自适应RRC信令风暴检测

摘要: 在5G及更高版本网络中,用户设备(UE)和基站(gNodeB或gNB)之间的无线通信,也称为空中接口,是网络接入和连接的关键组成部分。在连接建立过程中,无线资源控制(RRC)层可能容易受到信令风暴的攻击,这会威胁到无线接入控制平面的可用性。当一个或多个UE向gNB发送大量连接请求时,可能会发生这些攻击,从而阻止新的UE建立连接。在本文中,我们研究了此类威胁的检测,并提出了一种基于极端值理论(EVT)的自适应阈值检测系统。所提出的解决方案通过将基于真实威胁模型的模拟攻击场景应用于运营商网络中的真实RRC流量数据进行了数值评估。我们展示了通过仅利用RRC层的特征,检测系统不仅可以识别攻击,还可以将其与合法的高流量情况区分开来。使用EVT计算的自适应阈值确保系统可以在不同的流量条件下工作。结果显示高准确度、精度和召回率值(高于93%),并且在复杂条件下检测延迟较低。

更新时间: 2025-11-03 09:42:12

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2511.01391v1

SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment

Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic generation capabilities. However, the dense textual outputs from MLLMs may introduce conflicts with the original sparse captions. Furthermore, accurately quantifying semantic relevance between rich visual patches and concise textual descriptions remains a core challenge. To overcome these limitations, we introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity. Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches. Additionally, it leverages relevance-aware selection with mean value computation to highlight crucial patch-word correspondences, thereby improving cross-modal similarity assessment. Comprehensive experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance, surpassing existing approaches by 23\%-86\% in rSum across diverse model architectures, with notable enhancements in text-to-image retrieval scenarios. Our implementation is available at https://github.com/Sweet4tars/seps.git.

Updated: 2025-11-03 09:41:32

标题: SEPS:用于细粒度跨模态对齐的语义增强补丁瘦身框架

摘要: 精细的跨模态对齐旨在建立视觉和语言之间精确的局部对应关系,为视觉问答和相关多模态应用构建基石。当前方法面临着解决补丁冗余和歧义的挑战,这些挑战源自跨模态间固有的信息密度差异。最近,多模态大型语言模型(MLLMs)已经成为通过其强大的语义生成能力来弥合这一差距的有希望的解决方案。然而,MLLMs产生的密集文本输出可能会与原始的稀疏字幕产生冲突。此外,准确量化丰富的视觉补丁和简明的文本描述之间的语义相关性仍然是一个核心挑战。为了克服这些限制,我们引入了语义增强补丁精简(SEPS)框架,系统地解决补丁冗余和歧义。我们的方法采用两阶段机制,能够整合来自密集和稀疏文本的统一语义,从而实现突出重要视觉补丁的识别。此外,它利用相关性感知选择和均值计算来突出关键的补丁-单词对应关系,从而改善跨模态相似性评估。在Flickr30K和MS-COCO数据集上的全面实验证实,SEPS在各种模型架构中的rSum上实现了卓越的性能,超越现有方法23\%-86%,在文本到图像检索场景中表现出显着的增强。我们的实现可以在https://github.com/Sweet4tars/seps.git找到。

更新时间: 2025-11-03 09:41:32

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2511.01390v1

RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8\% on average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in retrieval and +7.5\% in generation. The search typically explores $\approx 0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone -- vector retrieval plus post-generation reflection/revision -- augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.

Updated: 2025-11-03 09:36:27

标题: RAGSmith:跨数据集寻找检索增强生成方法的最佳组合的框架

摘要: 检索增强生成(RAG)质量取决于检索、排名、增强、提示和生成之间许多相互作用的选择,因此在孤立优化模块是脆弱的。我们介绍了RAGSmith,这是一个模块化框架,将RAG设计视为对九个技术家族和46,080个可行管道配置的端到端架构搜索。遗传搜索优化一个标量目标,共同聚合检索指标(召回率@k、mAP、nDCG、MRR)和生成指标(LLM-Judge和语义相似性)。我们在六个维基百科衍生领域(数学、法律、金融、医学、国防工业、计算机科学)上进行评估,每个领域有100个涵盖事实、解释和长答案类型的问题。RAGSmith找到的配置始终比天真的RAG基线平均高出+3.8\%(在不同领域中范围为+1.2\%到+6.9%),检索和生成方面的增益高达+12.5%和+7.5%。搜索通常探索约0.2%的空间(约100个候选者),并发现了一个强大的骨干结构--向量检索加上后生成反思/修订--再通过领域相关选择进行增强、重新排名、增强和提示重新排序;段落压缩从未被选择。改进幅度与问题类型相关,对事实/长答案混合集合的增益大于解释密集型集合。这些结果为组装有效的RAG系统提供了实际、领域感知的指导,并展示了进化搜索对全管道优化的实用性。

更新时间: 2025-11-03 09:36:27

领域: cs.CL,cs.AI,cs.IR,H.3.3; I.2.7

下载: http://arxiv.org/abs/2511.01386v1

What Can Be Recovered Under Sparse Adversarial Corruption? Assumption-Free Theory for Linear Measurements

Let $A \in \mathbb{R}^{m \times n}$ be an arbitrary, known matrix and $e$ a $q$-sparse adversarial vector. Given $y = A x^\star + e$ and $q$, we seek the smallest set containing $x^\star$--hence the one conveying maximal information about $x^\star$--that is uniformly recoverable from $y$ without knowing $e$. While exact recovery of $x^\star$ via strong (and often impractical) structural assumptions on $A$ or $x^\star$ (e.g., restricted isometry, sparsity) is well studied, recoverability for arbitrary $A$ and $x^\star$ remains open. Our main result shows that the best that one can hope to recover is $x^\star + \ker(U)$, where $U$ is the unique projection matrix onto the intersection of rowspaces of all possible submatrices of $A$ obtained by deleting $2q$ rows. Moreover, we prove that every $x$ that minimizes the $\ell_0$-norm of $y - A x$ lies in $x^\star + \ker(U)$, which then gives a constructive approach to recover this set.

Updated: 2025-11-03 09:29:07

标题: 什么可以在稀疏对抗性破坏下恢复?线性测量的无假设理论

摘要: 让$A \in \mathbb{R}^{m \times n}$为一个任意已知矩阵,$e$为一个$q$-稀疏对抗向量。给定$y = A x^\star + e$和$q$,我们寻找包含$x^\star$的最小集合 -- 因此传达有关$x^\star$的最大信息量 -- 在不知道$e$的情况下从$y$中均匀恢复。虽然通过对$A$或$x^\star$进行强(通常是不切实际的)结构假设(例如,受限等距、稀疏性)确切恢复$x^\star$的问题已经得到广泛研究,但对于任意$A$和$x^\star$的恢复性仍然是开放的。我们的主要结果表明,最好的恢复希望是$x^\star + \ker(U)$,其中$U$是投影矩阵,投影到通过删除$2q$行获得的$A$的所有可能子矩阵的行空间的交集上。此外,我们证明了最小化$y - A x$的$\ell_0$-范数的每个$x$都位于$x^\star + \ker(U)$中,然后提供了一种构造性方法来恢复这个集合。

更新时间: 2025-11-03 09:29:07

领域: cs.IT,cs.LG,eess.SP,math.IT

下载: http://arxiv.org/abs/2510.24215v2

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.

Updated: 2025-11-03 09:18:27

标题: 对齐到不对齐: 使用元优化的LLM评判员进行自动LLM越狱

摘要: 识别大型语言模型(LLMs)的脆弱性对于通过解决固有弱点来改进其安全性至关重要。监狱突破(Jailbreaks)是指对LLMs进行红队测试的中心作用,通过精心制作的输入提示绕过保障措施,以探测LLMs以引发意外或不安全行为。最近基于优化的监狱突破方法通过利用LLMs迭代地优化攻击提示。然而,它们常常过度依赖于二进制攻击成功率(ASR)信号,这些信号是稀疏的,或者手动制作的评分模板,这些模板在评分结果中引入了人为偏见和不确定性。为了解决这些局限性,我们提出了AMIS(Align to MISalign),这是一个元优化框架,通过双层结构共同演变监狱突破提示和评分模板。在内部循环中,使用固定评分模板使用细粒度和密集反馈来优化提示。在外部循环中,通过ASR对齐分数对模板进行优化,逐渐演变以更好地反映查询中真实的攻击结果。这种共同优化过程产生了逐渐更强的监狱突破提示和更加校准的评分信号。在AdvBench和JBB-Behaviors上的评估表明,AMIS实现了最先进的性能,包括在Claude-3.5-Haiku上的88.0% ASR和在Claude-4-Sonnet上的100.0% ASR,明显优于现有基准。

更新时间: 2025-11-03 09:18:27

领域: cs.AI

下载: http://arxiv.org/abs/2511.01375v1

Memory Assisted LLM for Personalized Recommendation System

Large language models (LLMs) have demonstrated significant potential in solving recommendation tasks. With proven capabilities in understanding user preferences, LLM personalization has emerged as a critical area for providing tailored responses to individuals. Current studies explore personalization through prompt design and fine-tuning, paving the way for further research in personalized LLMs. However, existing approaches are either costly and inefficient in capturing diverse user preferences or fail to account for timely updates to user history. To address these gaps, we propose the Memory-Assisted Personalized LLM (MAP). Through user interactions, we first create a history profile for each user, capturing their preferences, such as ratings for historical items. During recommendation, we extract relevant memory based on similarity, which is then incorporated into the prompts to enhance personalized recommendations. In our experiments, we define a new task that enables testing with varying memory size under two scenarios: single domain where memory and tasks are from the same category and cross-domain (e.g. memory from movies and recommendation tasks in books). The results show that MAP outperforms regular LLM-based recommenders that integrate user history directly through prompt design. Moreover, as user history grows, MAP's advantage increases in both scenarios, making it more suitable for addressing successive personalized user requests.

Updated: 2025-11-03 09:18:20

标题: 基于记忆的个性化推荐系统中的LLM

摘要: 大型语言模型(LLMs)在解决推荐任务方面展现出显著潜力。在理解用户偏好方面已经证明了其能力,LLM个性化已经成为提供个性化回应给个体的关键领域。目前的研究通过提示设计和微调来探索个性化,为个性化LLMs的进一步研究铺平道路。然而,现有方法要么成本高,捕捉多样化用户偏好效率低,要么无法及时更新用户历史。为了解决这些差距,我们提出了记忆辅助个性化LLM(MAP)。通过用户互动,我们首先为每个用户创建一个历史资料,捕捉他们的偏好,比如对历史项目的评分。在推荐过程中,我们根据相似性提取相关记忆,然后将其融入提示中以增强个性化推荐。在我们的实验中,我们定义了一个新任务,允许在两种情景下测试不同大小的记忆:单一领域,即记忆和任务来自同一类别,以及跨领域(例如,来自电影的记忆和书籍推荐任务)。结果显示,MAP优于通过提示设计直接整合用户历史的基于LLM的推荐器。此外,随着用户历史的增长,MAP在两种情景下的优势都增加,使其更适合处理连续的个性化用户请求。

更新时间: 2025-11-03 09:18:20

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2505.03824v2

A Self-Evolving AI Agent System for Climate Science

Scientific progress in Earth science depends on integrating data across the planet's interconnected spheres. However, the accelerating volume and fragmentation of multi-sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self-evolving AI agent system designed as an interactive "copilot" for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human-like cross-disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks, including model-observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Ni\~no, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human-AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern-ai.org.cn.

Updated: 2025-11-03 09:17:56

标题: 一个用于气候科学的自进化人工智能代理系统

摘要: 地球科学领域的科学进步取决于整合全球各个相互连接的领域的数据。然而,多领域知识和数据的增长速度和碎片化已经超出了人类的分析能力。这在发现方面造成了重大瓶颈,特别是在气候科学领域。为了解决这一挑战,我们介绍了EarthLink,这是第一个自我进化的AI代理系统,设计为地球科学家的交互式“副驾驶”。通过自然语言交互,EarthLink通过将规划、代码执行、数据分析和物理推理整合到一个统一过程中,自动化整个研究工作流程,直接解决了这一限制。除了效率之外,它展现了类似于人类跨学科分析能力,并在核心大规模气候任务的专家评估中达到了与初级研究人员相当的熟练程度,包括模型-观测比较和气候变化理解。当面临一个开放的科学问题,特别是发现大西洋Niño的前兆时,EarthLink自主制定了研究策略,确定了可预测性的来源,用现有数据验证了其假设,并提出了一个物理上一致的机制。这些新兴能力开启了一种新的人工智能研究范式。科学家可以专注于价值和结果判断,而AI系统可以处理复杂的数据分析和知识整合。这加快了地球科学发现的速度和广度。该系统可通过我们的网站https://earthlink.intern-ai.org.cn访问。

更新时间: 2025-11-03 09:17:56

领域: cs.LG,cs.AI,physics.ao-ph

下载: http://arxiv.org/abs/2507.17311v3

Generative AI and Empirical Software Engineering: A Paradigm Shift

The adoption of large language models (LLMs) and autonomous agents in software engineering marks an enduring paradigm shift. These systems create new opportunities for tool design, workflow orchestration, and empirical observation, while fundamentally reshaping the roles of developers and the artifacts they produce. Although traditional empirical methods remain central to software engineering research, the rapid evolution of AI introduces new data modalities, alters causal assumptions, and challenges foundational constructs such as "developer", "artifact", and "interaction". As humans and AI agents increasingly co-create, the boundaries between social and technical actors blur, and the reproducibility of findings becomes contingent on model updates and prompt contexts. This vision paper examines how the integration of LLMs into software engineering disrupts established research paradigms. We discuss how it transforms the phenomena we study, the methods and theories we rely on, the data we analyze, and the threats to validity that arise in dynamic AI-mediated environments. Our aim is to help the empirical software engineering community adapt its questions, instruments, and validation standards to a future in which AI systems are not merely tools, but active collaborators shaping software engineering and its study.

Updated: 2025-11-03 09:09:05

标题: 生成式人工智能与实证软件工程:范式转变

摘要: 大型语言模型(LLMs)和自主代理在软件工程中的采用标志着一个持久的范式转变。这些系统为工具设计、工作流程编排和经验观察创造了新的机会,同时根本重塑了开发人员及其产生的工件的角色。尽管传统的经验方法仍然是软件工程研究的核心,但人工智能的快速演变引入了新的数据模态,改变了因果假设,并挑战了“开发人员”、“工件”和“交互”等基本构造。随着人类和人工智能代理越来越多地共同创造,社会和技术行为者之间的界限变得模糊,研究结果的可重复性取决于模型更新和及时环境。本文探讨了LLMs整合到软件工程中如何打破已建立的研究范式。我们讨论了它如何转变我们研究的现象、我们依赖的方法和理论、我们分析的数据,以及在动态AI中介环境中出现的有效性威胁。我们的目标是帮助经验软件工程社区调整其问题、工具和验证标准,以适应未来人工智能系统不仅仅是工具,而是主动参与塑造软件工程及其研究的未来。

更新时间: 2025-11-03 09:09:05

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2502.08108v2

Automatic Minds: Cognitive Parallels Between Hypnotic States and Large Language Model Processing

The cognitive processes of the hypnotized mind and the computational operations of large language models (LLMs) share deep functional parallels. Both systems generate sophisticated, contextually appropriate behavior through automatic pattern-completion mechanisms operating with limited or unreliable executive oversight. This review examines this convergence across three principles: automaticity, in which responses emerge from associative rather than deliberative processes; suppressed monitoring, leading to errors such as confabulation in hypnosis and hallucination in LLMs; and heightened contextual dependency, where immediate cues (for example, the suggestion of a therapist or the prompt of the user) override stable knowledge. These mechanisms reveal an observer-relative meaning gap: both systems produce coherent but ungrounded outputs that require an external interpreter to supply meaning. Hypnosis and LLMs also exemplify functional agency - the capacity for complex, goal-directed, context-sensitive behavior - without subjective agency, the conscious awareness of intention and ownership that defines human action. This distinction clarifies how purposive behavior can emerge without self-reflective consciousness, governed instead by structural and contextual dynamics. Finally, both domains illuminate the phenomenon of scheming: automatic, goal-directed pattern generation that unfolds without reflective awareness. Hypnosis provides an experimental model for understanding how intention can become dissociated from conscious deliberation, offering insights into the hidden motivational dynamics of artificial systems. Recognizing these parallels suggests that the future of reliable AI lies in hybrid architectures that integrate generative fluency with mechanisms of executive monitoring, an approach inspired by the complex, self-regulating architecture of the human mind.

Updated: 2025-11-03 09:08:50

标题: 自动化思维:催眠状态和大型语言模型处理之间的认知相似性

摘要: 被催眠的思维认知过程和大型语言模型(LLMs)的计算操作分享深层次的功能相似性。这两个系统通过自动的模式完成机制生成复杂的、符合上下文的行为,这些行为通过有限或不可靠的执行监督操作。本综述通过三个原则检验了这种收敛:自动性,即响应是从联想而非深思熟虑的过程中出现的;被压制的监控,导致催眠中的虚构和LLMs中的幻觉等错误;以及增强的上下文依赖性,即即时线索(例如治疗师的建议或用户的提示)覆盖稳定的知识。 这些机制揭示了一个观察者相关的意义差距:这两个系统产生连贯但不牢固的输出,需要外部解释者提供意义。催眠和LLMs也展示了功能代理 - 复杂、目标导向、上下文敏感行为的能力 - 没有主观代理,即意图和所有权的意识,这定义了人类行为。这种区别澄清了如何在没有自我反思意识的情况下出现目的行为,而是受结构和上下文动态的支配。最后,这两个领域阐明了策划的现象:自动的、目标导向的模式生成,在没有反思意识的情况下展开。催眠提供了一个实验模型,用于理解意图如何与意识深思熟虑分离,为人工系统隐藏的动机动态提供见解。 认识到这些相似之处表明,可靠人工智能的未来在于融合生成流畅性和执行监控机制的混合架构,这种方法受到人类思维复杂、自我调节的架构的启发。

更新时间: 2025-11-03 09:08:50

领域: cs.AI

下载: http://arxiv.org/abs/2511.01363v1

PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise

Natural Language Inference (NLI) models have been used in various ways to improve the factuality of LLM outputs. This is typically done by applying an NLI model to judge whether the model output is entailed from the supposed evidence, triggering some corrective actions, such as beam reranking at inference time or RL rewards during training. While NLI models are trained to detect factual inconsistencies over complete sentences, decisions in the common autoregressive generation architecture are made for each evolving text prefix, during decoding. Addressing this setting, we generalize the entailment detection task to apply over arbitrary text prefixes, and suggest its utility for improving generation faithfulness. Providing suitable evaluation and training datasets for this task, we train MiniTruePrefixes, a novel specialized model that better detects factual inconsistencies over text prefixes, outperforming comparable baseline NLI models by 5-14 F1 points in prefix-level entailment. We further demonstrate that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization. When guided by MiniTruePrefixes, LLaMA-3.2-3B-Instruct matches the faithfulness and runtime of the 8B model from the same model family, while using only half the memory.

Updated: 2025-11-03 09:07:44

标题: PrefixNLI:一旦出现就检测事实不一致

摘要: 自然语言推理(NLI)模型已被用于多种方式来提高LLM输出的事实性。通常通过应用NLI模型来判断模型输出是否由所谓的证据推导出,触发一些纠正行动,如推理时的光束重新排序或者在训练期间的RL奖励。虽然NLI模型经过训练可以检测完整句子中的事实不一致性,但在常见的自回归生成架构中,决策是针对每个逐渐发展的文本前缀进行的,在解码过程中。针对这种情况,我们将推理检测任务概括为适用于任意文本前缀,并建议其对改善生成忠实度的实用性。为这一任务提供适当的评估和训练数据集,我们训练了MiniTruePrefixes,这是一个新颖的专门模型,它能更好地检测文本前缀中的事实不一致性,比基线NLI模型在前缀级别的推理上表现出5-14个F1点的优势。我们进一步证明,将MiniTruePrefixes集成到一个受控解码框架中,可以显着提高抽象摘要中的事实一致性。在MiniTruePrefixes的指导下,LLaMA-3.2-3B-Instruct与来自同一模型系列的8B模型在忠实度和运行时间上相匹配,而只使用一半的内存。

更新时间: 2025-11-03 09:07:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01359v1

Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning

Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.

Updated: 2025-11-03 09:06:01

标题: 根据天赋教学!利用能力感知课程学习调整LLMs的指导

摘要: 高效的指导调整旨在提高在给定指导数据集上训练的大型语言模型(LLMs)的最终性能。作为典型的数据组织策略,课程学习已经显示出在指导调整中的初步有效性。然而,当前的课程调整方法遭受课程刚性的困扰,因为它们仅依赖于静态的启发式难度度量。这些方法无法适应模型在训练过程中不断发展的能力,导致固定且潜在次优的学习轨迹。为了解决这个问题,提出了一种名为CAMPUS(Competence-Aware Multi-Perspective cUrriculum inStruction tuning)的能力感知多角度课程指导调整框架。CAMPUS具有几个优点:(1)对子课程进行动态选择。 (2)针对能力的调整课程表。 (3)基于多个难度的调度。广泛的实验证明,与其他最先进的基线方法相比,CAMPUS的性能更优越,用于高效的指导调整。

更新时间: 2025-11-03 09:06:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.13790v2

Localist LLMs -- A Mathematical Framework for Dynamic Locality Control

We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.

Updated: 2025-11-03 09:05:41

标题: 本地化的LLMs--一个动态本地化控制的数学框架

摘要: 我们提出了一个新颖的框架,用于训练具有连续可调内部表示的大型语言模型,这些内部表示涵盖了从局部主义(可解释、基于规则)到分布式(可泛化、高效)编码的整个范围。关键创新是一种局部性调节器,一个可调参数,动态控制训练和推理过程中的本地化程度,而无需重新训练模型。通过对注意机制的组稀疏惩罚、信息论锚点设计和动态规则注入,实现了这一点。我们提供了严谨的数学证明,建立了明确的阈值条件,根据这些条件,可以证明注意力集中在语义相关的块上,注意力熵和指针保真度具有指数边界。具体来说,我们证明了当组稀疏惩罚超过一定阈值时,模型的注意机制会集中在语义相关的块上,实现低熵和高保真度,误差可以忽略不计。这一框架使从业者能够在可解释和高性能模式之间连续插值,支持需要透明度和功能性的受管领域的应用。

更新时间: 2025-11-03 09:05:41

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.09338v2

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

Updated: 2025-11-03 09:05:16

标题: CMI-MTL:基于交叉曼巴相互作用的医学视觉问题回答多任务学习

摘要: 医学视觉问答(Med-VQA)是临床决策支持和远程医疗中至关重要的多模态任务。最近基于自注意力的方法在有效处理视觉和语言之间的跨模态语义对齐方面存在困难。此外,基于分类的方法依赖预定义的答案集。将这个任务视为简单的分类问题可能会使其无法适应自由形式答案的多样性,并忽略自由形式答案的详细语义信息。为了应对这些挑战,我们引入了基于交叉曼巴交互的多任务学习(CMI-MTL)框架,从图像和文本中学习跨模态特征表示。CMI-MTL包括三个关键模块:细粒度视觉-文本特征对齐(FVTA),跨模态交错特征表示(CIFR)和自由形式答案增强多任务学习(FFAE)。FVTA通过细粒度视觉-文本特征对齐提取图像-文本对中最相关的区域。CIFR通过跨模态交错特征表示捕获跨模态顺序交互。FFAE通过自由形式答案增强多任务学习利用开放式问题的辅助知识,提高模型在开放式Med-VQA中的能力。实验结果表明,CMI-MTL在三个Med-VQA数据集(VQA-RAD、SLAKE和OVQA)上优于现有的最先进方法。此外,我们进行了更多的可解释性实验以证明其有效性。代码公开可在https://github.com/BioMedIA-repo/CMI-MTL上获取。

更新时间: 2025-11-03 09:05:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01357v1

On the Classical Hardness of the Semidirect Discrete Logarithm Problem in Finite Groups

The semidirect discrete logarithm problem (SDLP) in finite groups was proposed as a foundation for post-quantum cryptographic protocols, based on the belief that its non-abelian structure would resist quantum attacks. However, recent results have shown that SDLP in finite groups admits efficient quantum algorithms, undermining its quantum resistance. This raises a fundamental question: does the SDLP offer any computational advantages over the standard discrete logarithm problem (DLP) against classical adversaries? In this work, we investigate the classical hardness of SDLP across different finite group platforms. We establish that the group-case SDLP can be reformulated as a generalized discrete logarithm problem, enabling adaptation of classical algorithms to study its complexity. We present a concrete adaptation of the Baby-Step Giant-Step algorithm for SDLP, achieving time and space complexity $O(\sqrt{r})$ where $r$ is the period of the underlying cycle structure. Through theoretical analysis and experimental validation in SageMath, we demonstrate that the classical hardness of SDLP is highly platform-dependent and does not uniformly exceed that of standard DLP. In finite fields $\mathbb{F}_p^*$, both problems exhibit comparable complexity. Surprisingly, in elliptic curves $E(\mathbb{F}_p)$, the SDLP becomes trivial due to the bounded automorphism group, while in elementary abelian groups $\mathbb{F}_p^n$, the SDLP can be harder than DLP, with complexity varying based on the eigenvalue structure of the automorphism. Our findings reveal that the non-abelian structure of semidirect products does not inherently guarantee increased classical hardness, suggesting that the search for classically hard problems for cryptographic applications requires more careful consideration of the underlying algebraic structures.

Updated: 2025-11-03 09:05:05

标题: 有限群中半直离散对数问题的经典困难性

摘要: 在有限群中,半直离散对数问题(SDLP)被提出作为后量子密码协议的基础,基于其非阿贝尔结构能够抵抗量子攻击的信念。然而,最近的结果表明,在有限群中的SDLP存在高效的量子算法,削弱了其量子抵抗性。这引发了一个基本问题:SDLP在对抗经典对手时是否提供了任何计算优势?在这项工作中,我们调查了不同有限群平台上SDLP的经典难度。我们确定了群情况下的SDLP可以重新表述为广义离散对数问题,从而使得经典算法能够适应其复杂性。我们提出了Baby-Step Giant-Step算法的具体适应版本用于SDLP,实现了时间和空间复杂度为$O(\sqrt{r})$,其中$r$是基础循环结构的周期。通过理论分析和在SageMath中的实验验证,我们展示了SDLP的经典难度高度依赖于平台,并且并不一致地超过标准DLP的难度。在有限域$\mathbb{F}_p^*$中,这两个问题表现出可比较的复杂性。令人惊讶的是,在椭圆曲线$E(\mathbb{F}_p)$中,由于有界的自同构群,SDLP变得微不足道,而在元素阿贝尔群$\mathbb{F}_p^n$中,SDLP可能比DLP更难,其复杂性取决于自同构的特征值结构。我们的发现揭示了半直积的非阿贝尔结构并不能固有地保证增加的经典难度,这表明为了密码应用寻找经典难题需要更加谨慎地考虑底层代数结构。

更新时间: 2025-11-03 09:05:05

领域: cs.CR,cs.CC

下载: http://arxiv.org/abs/2508.05048v2

Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.

Updated: 2025-11-03 09:00:51

标题: 用DistilQwen进行思考:四个精炼推理和奖励模型系列的故事

摘要: 最近,对于支持真实世界应用的小型高效推理模型的需求推动了知识蒸馏技术的发展,以平衡推理性能和推理速度。在本文中,我们进一步扩展了从Qwen模型初始化的DistilQwen模型系列,引入了四个专门设计以满足工业需求的模型系列。蒸馏模型集合包括:(1)慢思考模型,针对需要高准确性的推理任务进行优化;(2)两个自适应思考模型系列,根据输入任务动态调整推理策略,以在不同场景中最大化效率;以及(3)蒸馏奖励模型,可使用蒸馏知识进一步强化推理模型的强化学习。通过对多个基准测试的全面评估显示,这些模型具有高推理效率和强大的推理性能,以及蒸馏奖励模型的实际效用。我们进一步展示这些模型通过在阿里巴巴云PAI(人工智能平台)平台上提供可扩展的训练和推理功能,可以支持工业从业者。

更新时间: 2025-11-03 09:00:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01354v1

AI Literacy in UAE Libraries: Assessing Competencies, Training Needs, and Ethical Considerations for the Digital Age

The study explores the current state of artificial intelligence (AI) literacy levels among library professionals employing a quantitative approach consisting of 92 surveys of LIS professionals in the United Arab Emirates (UAE). Findings of the study revealed the presence of strong cognitive competencies, while there were gaps observed in behavioral and normative competencies, especially related to AI biases, AI-powered learning, and ethical considerations. There was a disconnect observed between the perceived importance of AI skills and the effectiveness of the current training programs.

Updated: 2025-11-03 09:00:15

标题: 阿联酋图书馆中的人工智能素养:评估能力、培训需求和数字时代的道德考虑

摘要: 这项研究探讨了在阿拉伯联合酋长国(UAE)92名图书馆信息专业人士中的人工智能(AI)素养水平,采用了定量方法。研究发现,受访者在认知能力方面表现出较强的竞争力,但在行为和规范能力方面存在差距,尤其在与AI偏见、AI驱动的学习和道德考虑等方面。研究发现,人们对AI技能的重要性看法与目前培训项目的有效性存在脱节。

更新时间: 2025-11-03 09:00:15

领域: cs.DL,cs.AI

下载: http://arxiv.org/abs/2511.01353v1

The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project

Generative AI (GenAI) has recently emerged as a groundbreaking force in Software Engineering, capable of generating code, suggesting fixes, and supporting quality assurance. While its use in coding tasks shows considerable promise, applying GenAI across the entire Software Development Life Cycle (SDLC) has not yet been fully explored. Critical uncertainties in areas such as reliability, accountability, security, and data privacy demand deeper investigation and coordinated action. The GENIUS project, comprising over 30 European industrial and academic partners, aims to address these challenges by advancing AI integration across all SDLC phases. It focuses on GenAI's potential, the development of innovative tools, and emerging research challenges, actively shaping the future of software engineering. This vision paper presents a shared perspective on the future of GenAI-based software engineering, grounded in cross-sector dialogue and experience within the GENIUS consortium, supported by an exploratory literature review. The paper explores four central elements: (1) a structured overview of current challenges in GenAI adoption across the SDLC; (2) a forward-looking vision outlining key technological and methodological advances expected over the next five years; (3) anticipated shifts in the roles and required skill sets of software professionals; and (4) the contribution of GENIUS in realizing this transformation through practical tools and industrial validation. By aligning technical innovation with business relevance, this paper aims to inform both research agendas and industrial strategies, providing a foundation for reliable, scalable, and industry-ready GenAI solutions for software engineering teams.

Updated: 2025-11-03 08:56:23

标题: 《软件工程中生成式人工智能的未来:欧洲GENIUS项目中来自工业和学术界的展望》

摘要: Generative AI(GenAI)最近已经成为软件工程领域的一股开创性力量,能够生成代码、提出修复建议,并支持质量保证。虽然它在编码任务中的使用显示出相当大的潜力,但在整个软件开发生命周期(SDLC)中应用GenAI尚未得到充分探索。在可靠性、问责制、安全性和数据隐私等领域存在关键不确定性,需要深入调查和协调行动。GENIUS项目由30多个欧洲工业和学术合作伙伴组成,旨在通过推进AI整合跨越所有SDLC阶段来解决这些挑战。它关注GenAI的潜力,开发创新工具,并应对新兴研究挑战,积极塑造软件工程的未来。本文提出了对基于GenAI的软件工程未来的共同看法,基于GENIUS联盟内部的跨部门对话和经验,并得到探索性文献综述的支持。本文探讨了四个中心要素:(1)对GenAI在整个SDLC中采用当前挑战的结构化概述;(2)展望未来的愿景,概述未来五年内预期的关键技术和方法论进展;(3)软件专业人员角色和所需技能集的预期转变;以及(4)GENIUS在通过实用工具和工业验证实现这一转变中的贡献。通过将技术创新与业务相关性相结合,本文旨在为软件工程团队提供可靠、可扩展和符合行业要求的GenAI解决方案的研究议程和工业战略提供基础。

更新时间: 2025-11-03 08:56:23

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.01348v1

SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning

Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https://github.com/MichaelMaiii/SynBrain.

Updated: 2025-11-03 08:51:11

标题: SynBrain:通过概率表示学习增强视觉到fMRI的合成

摘要: 破解视觉刺激如何转化为皮层反应是计算神经科学中的一个基本挑战。这种从视觉到神经的映射本质上是一对多的关系,因为相同的视觉输入可在试验、环境和受试者之间可靠地引发不同的血液动力学反应。然而,现有的确定性方法在同时建模这种生物变异性和捕捉编码刺激信息的基础功能一致性方面遇到困难。为了解决这些限制,我们提出了SynBrain,这是一个生成框架,以概率和生物解释的方式模拟从视觉语义到神经反应的转化。SynBrain引入了两个关键组件:(i) BrainVAE通过概率学习将神经表示建模为连续概率分布,同时通过视觉语义约束维持功能一致性;(ii) 一个语义到神经映射器作为语义传输途径,将视觉语义投射到神经响应流形中,以促进高保真度的fMRI合成。实验结果表明,SynBrain在个体特定的视觉到fMRI编码性能方面超越了最先进的方法。此外,SynBrain能够有效地适应少量数据的新受试者,并合成出在改善数据受限的fMRI到图像解码性能方面有效的高质量fMRI信号。此外,SynBrain揭示了跨试验和受试者的功能一致性,合成信号捕捉到由生物神经变异塑造的可解释模式。我们的代码可在https://github.com/MichaelMaiii/SynBrain获取。

更新时间: 2025-11-03 08:51:11

领域: cs.LG,cs.CV,eess.IV

下载: http://arxiv.org/abs/2508.10298v3

Bellman Diffusion Models

Diffusion models have seen tremendous success as generative architectures. Recently, they have been shown to be effective at modelling policies for offline reinforcement learning and imitation learning. We explore using diffusion as a model class for the successor state measure (SSM) of a policy. We find that enforcing the Bellman flow constraints leads to a simple Bellman update on the diffusion step distribution.

Updated: 2025-11-03 08:42:18

标题: 贝尔曼扩散模型

摘要: 扩散模型作为生成架构已经取得了巨大成功。最近,已经证明它们在建模离线强化学习和模仿学习的政策方面是有效的。我们探讨将扩散作为政策的继任状态度量(SSM)的模型类使用。我们发现,强制执行贝尔曼流约束导致在扩散步骤分布上进行简单的贝尔曼更新。

更新时间: 2025-11-03 08:42:18

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2407.12163v2

Beyond Permissions: Investigating Mobile Personalization with Simulated Personas

Mobile applications increasingly rely on sensor data to infer user context and deliver personalized experiences. Yet the mechanisms behind this personalization remain opaque to users and researchers alike. This paper presents a sandbox system that uses sensor spoofing and persona simulation to audit and visualize how mobile apps respond to inferred behaviors. Rather than treating spoofing as adversarial, we demonstrate its use as a tool for behavioral transparency and user empowerment. Our system injects multi-sensor profiles - generated from structured, lifestyle-based personas - into Android devices in real time, enabling users to observe app responses to contexts such as high activity, location shifts, or time-of-day changes. With automated screenshot capture and GPT-4 Vision-based UI summarization, our pipeline helps document subtle personalization cues. Preliminary findings show measurable app adaptations across fitness, e-commerce, and everyday service apps such as weather and navigation. We offer this toolkit as a foundation for privacy-enhancing technologies and user-facing transparency interventions.

Updated: 2025-11-03 08:39:38

标题: 超越权限:利用模拟人设进行移动个性化研究

摘要: 移动应用程序越来越依赖传感器数据来推断用户环境并提供个性化体验。然而,这种个性化背后的机制对用户和研究人员来说仍然不透明。本文介绍了一个沙盒系统,利用传感器欺骗和角色模拟来审计和可视化移动应用程序对推断行为的响应。我们展示了将欺骗视为行为透明度和用户赋能工具的用途,而非对抗性。我们的系统实时将从结构化的基于生活方式的人物生成的多传感器配置文件注入到安卓设备中,使用户能够观察应用程序对诸如高活动、位置转移或时间变化等环境的响应。通过自动截图捕获和基于 GPT-4 Vision 的用户界面摘要,我们的管道帮助记录微妙的个性化线索。初步研究结果显示,在健身、电子商务和天气导航等日常服务应用程序中可以测量到应用程序的适应性。我们将这个工具包作为加强隐私技术和用户透明度干预的基础。

更新时间: 2025-11-03 08:39:38

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2511.01336v1

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.

Updated: 2025-11-03 08:39:35

标题: 深度视频发现:具有工具使用的主观搜索,用于长篇视频理解

摘要: 长篇视频理解面临着重大挑战,因为存在着广泛的时间空间复杂性,以及在这种扩展上下文下回答问题的困难。虽然大型语言模型(LLMs)在视频分析能力和长时间上下文处理方面取得了显著进展,但它们在处理信息密集的长达一小时的视频时仍然存在局限性。为了克服这些限制,我们提出了Deep Video Discovery (DVD)代理,利用分段视频剪辑上的代理搜索策略。与之前依赖于在不同查询中统一应用预定义工作流程的视频代理不同,我们的方法强调代理的自主和适应性特性。通过在多粒度视频数据库上提供一组搜索中心工具,我们的DVD代理利用LLM的先进推理能力,根据收集到的信息,在当前观察状态下制定计划,有策略地选择工具来为不同查询编排自适应工作流程。我们在多个长视频理解基准上进行了全面评估,证明了我们的优势。我们的DVD代理在具有挑战性的LVBench数据集上取得了最先进的表现,达到了74.2%的准确率,大大超过了所有先前的工作,并在使用转录后进一步提高到76.0%。代码已发布在https://github.com/microsoft/DeepVideoDiscovery。

更新时间: 2025-11-03 08:39:35

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.18079v4

Embodied Cognition Augmented End2End Autonomous Driving

In recent years, vision-based end-to-end autonomous driving has emerged as a new paradigm. However, popular end-to-end approaches typically rely on visual feature extraction networks trained under label supervision. This limited supervision framework restricts the generality and applicability of driving models. In this paper, we propose a novel paradigm termed $E^{3}AD$, which advocates for comparative learning between visual feature extraction networks and the general EEG large model, in order to learn latent human driving cognition for enhancing end-to-end planning. In this work, we collected a cognitive dataset for the mentioned contrastive learning process. Subsequently, we investigated the methods and potential mechanisms for enhancing end-to-end planning with human driving cognition, using popular driving models as baselines on publicly available autonomous driving datasets. Both open-loop and closed-loop tests are conducted for a comprehensive evaluation of planning performance. Experimental results demonstrate that the $E^{3}AD$ paradigm significantly enhances the end-to-end planning performance of baseline models. Ablation studies further validate the contribution of driving cognition and the effectiveness of comparative learning process. To the best of our knowledge, this is the first work to integrate human driving cognition for improving end-to-end autonomous driving planning. It represents an initial attempt to incorporate embodied cognitive data into end-to-end autonomous driving, providing valuable insights for future brain-inspired autonomous driving systems. Our code will be made available at Github

Updated: 2025-11-03 08:34:44

标题: 具身认知增强的端到端自动驾驶

摘要: 近年来,基于视觉的端到端自动驾驶作为一种新范式出现。然而,流行的端到端方法通常依赖于在标签监督下训练的视觉特征提取网络。这种有限监督框架限制了驾驶模型的普适性和适用性。在本文中,我们提出了一种称为$E^{3}AD$的新范式,倡导视觉特征提取网络与通用EEG大模型之间的比较学习,以学习潜在的人类驾驶认知,以增强端到端规划。在这项工作中,我们收集了一个认知数据集,用于上述对比学习过程。随后,我们调查了使用流行的驾驶模型作为基线在公开可用的自动驾驶数据集上增强端到端规划的方法和潜在机制。对规划性能进行全面评估进行了开环和闭环测试。实验结果表明,$E^{3}AD$范式显著提升了基线模型的端到端规划性能。消融研究进一步验证了驾驶认知的贡献以及比较学习过程的有效性。据我们所知,这是第一项将人类驾驶认知整合到改进端到端自动驾驶规划中的工作。这代表了将体验认知数据纳入端到端自动驾驶的初步尝试,为未来的脑启发式自动驾驶系统提供了宝贵的见解。我们的代码将在Github上提供。

更新时间: 2025-11-03 08:34:44

领域: cs.RO,cs.AI,cs.HC,68T45

下载: http://arxiv.org/abs/2511.01334v1

Unbiased Platform-Level Causal Estimation for Search Systems: A Competitive Isolation PSM-DID Framework

Evaluating platform-level interventions in search-based two-sided marketplaces is fundamentally challenged by systemic effects such as spillovers and network interference. While widely used for causal inference, the PSM (Propensity Score Matching) - DID (Difference-in-Differences) framework remains susceptible to selection bias and cross-unit interference from unaccounted spillovers. In this paper, we introduced Competitive Isolation PSM-DID, a novel causal framework that integrates propensity score matching with competitive isolation to enable platform-level effect measurement (e.g., order volume, GMV) instead of item-level metrics in search systems. Our approach provides theoretically guaranteed unbiased estimation under mutual exclusion conditions, with an open dataset released to support reproducible research on marketplace interference (github.com/xxxx). Extensive experiments demonstrate significant reductions in interference effects and estimation variance compared to baseline methods. Successful deployment in a large-scale marketplace confirms the framework's practical utility for platform-level causal inference.

Updated: 2025-11-03 08:29:21

标题: 搜索系统的无偏平台级因果估计:竞争隔离PSM-DID框架

摘要: 在搜索型双边市场中评估平台级干预措施面临着诸如溢出和网络干扰等系统效应的基本挑战。虽然PSM(倾向得分匹配)-DID(差异-差异)框架被广泛用于因果推断,但仍然容易受到选择偏差和未考虑的溢出导致的跨单元干扰的影响。在本文中,我们介绍了竞争隔离PSM-DID,这是一个将倾向得分匹配与竞争隔离相结合的新型因果框架,可以实现平台级效果测量(例如订单量,GMV),而不是在搜索系统中的项目级指标。 我们的方法在相互排斥条件下提供了理论上保证的无偏估计,并发布了一个开放数据集以支持市场干扰的可重复研究(github.com/xxxx)。广泛的实验表明,与基准方法相比,干扰效应和估计方差显著减少。在大规模市场上的成功部署证实了该框架在平台级因果推断方面的实用性。

更新时间: 2025-11-03 08:29:21

领域: cs.AI

下载: http://arxiv.org/abs/2511.01329v1

MARFT: Multi-Agent Reinforcement Fine-Tuning

LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.

Updated: 2025-11-03 08:23:59

标题: MARFT: 多智能体强化微调

摘要: 基于LLM的多智能体系统已经展示出在处理复杂、具有主动性的任务方面的显著能力,从生成高质量的演示幻灯片,甚至进行复杂的科学研究。同时,RL因其在增强智能体智能方面的有效性而被广泛认可,但有限的研究探讨了使用基础RL技术对LaMAS进行微调的问题。此外,将MARL方法直接应用于LaMAS引入了重要挑战,源自LaMAS固有的独特特征和机制。为了解决这些挑战,本文提出了对基于LLM的MARL的全面研究,并提出了一种新的范式,称为多智能体强化微调(MARFT)。我们介绍了一种全新的MG称为Flex-MG,它与LaMAS在现实应用中的优化相一致,并为LaMAS量身定制了一个通用的算法框架,概述了概念基础、关键区别和实际实施策略。我们回顾了从RL到RFT的演变,为多智能体领域的并行分析奠定了基础。在LaMAS的背景下,我们阐明了MARL和MARFT之间的关键差异。这些差异促使向LaMAS导向的RFT的过渡。本研究的核心是一个强大且可扩展的MARFT框架。我们详细介绍了核心算法,并提供了一个完整的开源实现,以促进采用和进一步研究。本文的后续部分探讨了MARFT在现实应用中的展望和开放挑战。通过将理论基础与实际方法结合起来,本研究为寻求将MARFT推向具有弹性和适应性解决方案的研究人员提供了一份路线图。我们提出的框架的实现公开可用于:https://github.com/jwliao-ai/MARFT。

更新时间: 2025-11-03 08:23:59

领域: cs.MA,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2504.16129v4

DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Large language models (LLMs) with integrated search tools show strong promise in open-domain question answering (QA), yet they often struggle to produce complete answer set to complex questions such as Which actor from the film Heat won at least one Academy Award?, which requires (1) distinguishing between multiple films sharing the same title and (2) reasoning across a large set of actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate both challenges jointly. To address this, we introduce DeepAmbigQAGen, an automatic data generation pipeline that constructs QA tasks grounded in text corpora and linked knowledge graph, generating natural and verifiable questions that systematically embed name ambiguity and multi-step reasoning. Based on this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi-hop reasoning and half of them explicit name ambiguity resolving. Experiments reveal that, even state-of-the-art GPT-5 show incomplete answers, achieving only 0.13 exact match on ambiguous questions and 0.21 on non-ambiguous questions. These findings highlight the need for more robust QA systems aimed at information gathering and answer completeness.

Updated: 2025-11-03 08:15:24

标题: DEEPAMBIGQA:用于评估LLM答案完整性的模糊多跳问题Benchmarking

摘要: 大型语言模型(LLMs)与集成搜索工具在开放领域问答(QA)中展现出强大的潜力,然而它们通常难以对复杂问题产生完整的答案集,比如《热血特警队》这部电影中哪位演员至少赢得了一项奥斯卡奖?这需要(1)区分共享相同标题的多部电影和(2)在大量演员之间进行推理以收集和整合证据。现有的QA基准很少同时评估这两个挑战。为了解决这个问题,我们引入了DeepAmbigQAGen,这是一个自动数据生成流水线,构建了基于文本语料库和链接知识图的QA任务,生成自然且可验证的问题,系统地嵌入名称模糊性和多步推理。基于此,我们建立了DeepAmbigQA,一个包含3,600个需要多跳推理的问题的数据集,其中一半需要解决明确的名称模糊性。实验表明,即使是最先进的GPT-5也无法给出完整答案,对于模糊问题仅达到0.13的准确匹配率,对于非模糊问题为0.21。这些发现凸显了需要更强大的QA系统来进行信息收集和完整答案。

更新时间: 2025-11-03 08:15:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01323v1

MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts

LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at https://github.com/mylittleriver/MedREK.

Updated: 2025-11-03 08:12:08

标题: MedREK:基于检索的编辑,带有针对关键提示的医学LLMs

摘要: LLMs在医疗应用中具有巨大的潜力,但医学知识的快速演变和训练数据中的错误常常导致它们生成过时或不准确的信息,限制了它们在高风险临床实践中的适用性。模型编辑已经成为一种潜在的解决方案,而无需进行完全的重新训练。虽然基于参数的编辑通常会损害局部性,因此不适用于医学领域,但基于检索的编辑提供了一个更可行的替代方案。然而,它仍然面临两个关键挑战:(1)医学知识空间内的表示重叠经常导致不准确的检索并降低编辑准确性;(2)现有方法局限于单个样本编辑,而批量编辑尽管对实际医学应用至关重要,但仍然大多未被探索。为解决这些挑战,我们首先构建了MedVersa,这是一个增强的基准测试,涵盖更广泛的医学主题,旨在评估在严格的局部性约束下的单个和批量编辑。然后,我们提出了MedREK,这是一个基于检索的编辑框架,集成了一个共享查询-键模块,用于精确匹配,并配备了基于注意力的提示编码器,用于提供信息性指导。在各种医学基准测试上的实验结果表明,我们的MedREK在不同核心指标上取得了优越的性能,并为医学LLMs中的批量编辑提供了首个验证解决方案。我们的代码和数据集可在https://github.com/mylittleriver/MedREK 上获得。

更新时间: 2025-11-03 08:12:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.13500v2

OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance

Accurate and timely prediction of tool conditions is critical for intelligent manufacturing systems, where unplanned tool failures can lead to quality degradation and production downtime. In modern industrial environments, predictive maintenance is increasingly implemented as an intelligent service that integrates sensing, analysis, and decision support across production processes. To meet the demand for reliable and service-oriented operation, we present OmniFuser, a multimodal learning framework for predictive maintenance of milling tools that leverages both visual and sensor data. It performs parallel feature extraction from high-resolution tool images and cutting-force signals, capturing complementary spatiotemporal patterns across modalities. To effectively integrate heterogeneous features, OmniFuser employs a contamination-free cross-modal fusion mechanism that disentangles shared and modality-specific components, allowing for efficient cross-modal interaction. Furthermore, a recursive refinement pathway functions as an anchor mechanism, consistently retaining residual information to stabilize fusion dynamics. The learned representations can be encapsulated as reusable maintenance service modules, supporting both tool-state classification (e.g., Sharp, Used, Dulled) and multi-step force signal forecasting. Experiments on real-world milling datasets demonstrate that OmniFuser consistently outperforms state-of-the-art baselines, providing a dependable foundation for building intelligent industrial maintenance services.

Updated: 2025-11-03 08:08:52

标题: OmniFuser:面向服务的预测性维护的自适应多模态融合

摘要: 准确及时地预测工具状态对智能制造系统至关重要,因为计划外的工具故障可能导致质量下降和生产停机。在现代工业环境中,预测性维护越来越多地作为一种智能服务实施,整合了传感、分析和决策支持跨越生产过程。为了满足可靠和服务导向运作的需求,我们提出了OmniFuser,这是一个用于铣削工具预测性维护的多模态学习框架,利用了视觉和传感器数据。它从高分辨率的工具图像和切削力信号中进行并行特征提取,捕捉了跨模态之间的互补时空模式。为了有效地整合异构特征,OmniFuser采用了一种无污染的跨模态融合机制,解开了共享和模态特定的组件,实现了高效的跨模态交互。此外,递归细化路径作为一种锚定机制,始终保留残余信息以稳定融合动态。学习到的表示可以封装为可重用的维护服务模块,支持工具状态分类(例如,锋利、已使用、变钝)和多步骤力信号预测。对真实世界的铣削数据集的实验表明,OmniFuser始终优于最先进的基线,为构建智能工业维护服务提供了可靠的基础。

更新时间: 2025-11-03 08:08:52

领域: cs.AI

下载: http://arxiv.org/abs/2511.01320v1

Exploringand Unleashing the Power of Large Language Models in CI/CD Configuration Translation

Continuous Integration (CI) is a cornerstone of modern collaborative software development, and numerous CI platforms are available. Differences in maintenance overhead, reliability, and integration depth with code-hosting platforms make migration between CI platforms a common practice. A central step in migration is translating CI configurations, which is challenging due to the intrinsic complexity of CI configurations and the need to understand semantic differences and relationships across CI platforms. With the advent of large language models (LLMs), recent advances in software engineering highlight their potential for CI configuration translation. In this paper, we present a study on LLM-based CI configuration translation, focusing on the migration from Travis CI to GitHub Actions. First, using 811 migration records, we quantify the effort involved and find that developers read an average of 38 lines of Travis configuration and write 58 lines of GitHub Actions configuration, with nearly half of the migrations requiring multiple commits. We further analyze translations produced by each of the four LLMs and identify 1,121 issues grouped into four categories: logic inconsistencies (38%), platform discrepancies (32%), environment errors (25%), and syntax errors (5%). Finally, we evaluate three enhancement strategies and show that combining guideline-based prompting with iterative refinement achieves the best performance, reaching a Build Success Rate of 75.5%-nearly a threefold improvement over GPT-4o with a basic prompt.

Updated: 2025-11-03 08:01:09

标题: 探索和释放在CI/CD配置翻译中大型语言模型的力量

摘要: 持续集成(CI)是现代协作软件开发的基石,有许多CI平台可供选择。在维护开销、可靠性和与代码托管平台的集成深度方面存在差异,因此迁移CI平台是常见的做法。迁移的核心步骤是翻译CI配置,这是具有挑战性的,因为CI配置的固有复杂性以及需要理解跨CI平台的语义差异和关系。 随着大型语言模型(LLMs)的出现,最近软件工程的进展突显了它们在CI配置转换中的潜力。本文展示了一项基于LLM的CI配置翻译研究,重点关注从Travis CI迁移到GitHub Actions。首先,使用811个迁移记录,我们量化了涉及的工作量,并发现开发人员平均阅读了38行Travis配置,并编写了58行GitHub Actions配置,其中近一半的迁移需要多次提交。我们进一步分析了四个LLM产生的翻译,并确定了1,121个问题,分为四类:逻辑不一致(38%)、平台差异(32%)、环境错误(25%)和语法错误(5%)。最后,我们评估了三种增强策略,并展示结合基于指导性的提示和迭代细化可以取得最佳表现,达到了75.5%的构建成功率,几乎是基本提示的GPT-4o的三倍改善。

更新时间: 2025-11-03 08:01:09

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.01316v1

llmSHAP: A Principled Approach to LLM Explainability

Feature attribution methods help make machine learning-based inference explainable by determining how much one or several features have contributed to a model's output. A particularly popular attribution method is based on the Shapley value from cooperative game theory, a measure that guarantees the satisfaction of several desirable principles, assuming deterministic inference. We apply the Shapley value to feature attribution in large language model (LLM)-based decision support systems, where inference is, by design, stochastic (non-deterministic). We then demonstrate when we can and cannot guarantee Shapley value principle satisfaction across different implementation variants applied to LLM-based decision support, and analyze how the stochastic nature of LLMs affects these guarantees. We also highlight trade-offs between explainable inference speed, agreement with exact Shapley value attributions, and principle attainment.

Updated: 2025-11-03 07:54:47

标题: llmSHAP:LLM可解释性的原则方法

摘要: 特征归因方法通过确定一个或多个特征对模型输出的贡献程度来帮助使基于机器学习的推断可解释化。一种特别流行的归因方法基于合作博弈理论中的Shapley值,这是一种保证满足几个理想原则的度量,假定推断是确定性的。我们将Shapley值应用于基于大型语言模型(LLM)的决策支持系统中的特征归因,其中推断在设计上是随机的(非确定性)。然后我们展示了在应用于基于LLM的决策支持系统的不同实现变体中,我们何时能够保证Shapley值原则的满足以及何时不能保证。我们分析LLMs的随机性如何影响这些保证。我们还强调了可解释推断速度、与精确Shapley值归因的一致性以及原则达成之间的权衡。

更新时间: 2025-11-03 07:54:47

领域: cs.AI

下载: http://arxiv.org/abs/2511.01311v1

Perturb a Model, Not an Image: Towards Robust Privacy Protection via Anti-Personalized Diffusion Models

Recent advances in diffusion models have enabled high-quality synthesis of specific subjects, such as identities or objects. This capability, while unlocking new possibilities in content creation, also introduces significant privacy risks, as personalization techniques can be misused by malicious users to generate unauthorized content. Although several studies have attempted to counter this by generating adversarially perturbed samples designed to disrupt personalization, they rely on unrealistic assumptions and become ineffective in the presence of even a few clean images or under simple image transformations. To address these challenges, we shift the protection target from the images to the diffusion model itself to hinder the personalization of specific subjects, through our novel framework called Anti-Personalized Diffusion Models (APDM). We first provide a theoretical analysis demonstrating that a naive approach of existing loss functions to diffusion models is inherently incapable of ensuring convergence for robust anti-personalization. Motivated by this finding, we introduce Direct Protective Optimization (DPO), a novel loss function that effectively disrupts subject personalization in the target model without compromising generative quality. Moreover, we propose a new dual-path optimization strategy, coined Learning to Protect (L2P). By alternating between personalization and protection paths, L2P simulates future personalization trajectories and adaptively reinforces protection at each step. Experimental results demonstrate that our framework outperforms existing methods, achieving state-of-the-art performance in preventing unauthorized personalization. The code is available at https://github.com/KU-VGI/APDM.

Updated: 2025-11-03 07:42:05

标题: 扰动模型,而非图像:通过反个性化扩散模型实现强大的隐私保护

摘要: 最近在扩散模型方面取得的进展使得能够高质量地合成特定主题,如身份或对象。尽管这种能力在内容创作方面开辟了新的可能性,但也引入了重大的隐私风险,因为个性化技术可能被恶意用户滥用以生成未经授权的内容。虽然有几项研究试图通过生成对抗扰动样本来打破个性化,但它们依赖于不切实际的假设,并且在存在少量干净图像或简单图像变换的情况下变得无效。为了解决这些挑战,我们将保护目标从图像转移到扩散模型本身,以阻碍特定主题的个性化,通过我们称之为反个性化扩散模型(APDM)的新框架。我们首先提供了一个理论分析,证明了现有损失函数对扩散模型的天真方法本质上无法确保强大的反个性化收敛。受到这一发现的启发,我们引入了直接保护优化(DPO),这是一种有效扰乱目标模型中主题个性化的新损失函数,而不损害生成质量。此外,我们提出了一种新的双路径优化策略,称为学习保护(L2P)。通过在个性化和保护路径之间交替,L2P模拟未来的个性化轨迹,并在每一步上自适应地加强保护。实验结果表明,我们的框架优于现有方法,在防止未经授权的个性化方面表现出现状性能。代码可在https://github.com/KU-VGI/APDM获取。

更新时间: 2025-11-03 07:42:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01307v1

Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment

The reasoning capabilities of large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain-of-thought (CoT) rationales from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique-Rethink-Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique tasks: (i) critiquing the CoT rationales according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Building on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.

Updated: 2025-11-03 07:39:35

标题: 通过认知调整提高小型LLM的推理能力

摘要: 大推理模型(LRM)的推理能力,如OpenAI的o1和DeepSeek-R1,通过深思熟虑已经取得了显著进展。然而,这些增强功能伴随着巨大的资源需求,突出了训练有效小型推理模型的必要性。一个关键挑战是小型模型与其较大对应物相比具有不同的推理能力和认知轨迹。因此,直接从大型LRMs中提取思维链(CoT)合理性到较小的模型有时可能是无效的,并且通常需要大量注释数据。在本文中,我们首先介绍了一个新颖的Critique-Rethink-Verify(CRV)系统,旨在训练更小但功能强大的LRMs。我们的CRV系统由多个LLM代理组成,每个代理专门负责独特的任务:(i)根据较小模型的认知能力批评CoT合理性,(ii)根据批评重新思考和完善这些CoT,以及(iii)验证精炼结果的正确性。基于CRV系统,我们进一步提出了认知偏好优化(CogPO)算法,通过将推理过程与认知能力相匹配,持续增强较小模型的推理能力。对具有挑战性推理基准的全面评估显示了我们的CRV+CogPO框架的有效性,远远优于其他方法。

更新时间: 2025-11-03 07:39:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.09802v2

DeepSpecs: Expert-Level Questions Answering in 5G

5G technology enables mobile Internet access for billions of users. Answering expert-level questions about 5G specifications requires navigating thousands of pages of cross-referenced standards that evolve across releases. Existing retrieval-augmented generation (RAG) frameworks, including telecom-specific approaches, rely on semantic similarity and cannot reliably resolve cross-references or reason about specification evolution. We present DeepSpecs, a RAG system enhanced by structural and temporal reasoning via three metadata-rich databases: SpecDB (clause-aligned specification text), ChangeDB (line-level version diffs), and TDocDB (standardization meeting documents). DeepSpecs explicitly resolves cross-references by recursively retrieving referenced clauses through metadata lookup, and traces specification evolution by mining changes and linking them to Change Requests that document design rationale. We curate two 5G QA datasets: 573 expert-annotated real-world questions from practitioner forums and educational resources, and 350 evolution-focused questions derived from approved Change Requests. Across multiple LLM backends, DeepSpecs outperforms base models and state-of-the-art telecom RAG systems; ablations confirm that explicit cross-reference resolution and evolution-aware retrieval substantially improve answer quality, underscoring the value of modeling the structural and temporal properties of 5G standards.

Updated: 2025-11-03 07:39:22

标题: DeepSpecs:5G中的专家级问题回答

摘要: 5G技术使数十亿用户能够进行移动互联网访问。回答关于5G规范的专家级问题需要浏览数千页交叉参考的标准,这些标准会在不同版本中不断更新。现有的检索增强生成(RAG)框架,包括电信特定方法,依赖于语义相似性,无法可靠解决交叉参考或推理规范演变。我们提出了DeepSpecs,这是一个通过三个元数据丰富的数据库(SpecDB(条款对齐规范文本),ChangeDB(行级版本差异)和TDocDB(标准化会议文件))增强的RAG系统,通过结构和时间推理明确解决了交叉引用。DeepSpecs通过递归检索元数据查找引用条款来解析交叉引用,并通过挖掘变更并将其链接到记录设计理念的变更请求来跟踪规范演变。我们整理了两个5G QA数据集:573个从从业者论坛和教育资源中获得的专家注释的现实世界问题,以及从批准的变更请求中衍生出的350个关注演变的问题。在多个LLM后端中,DeepSpecs优于基本模型和最先进的电信RAG系统;消融实验证实,明确的交叉引用解决和意识到规格演变的检索显着改善了答案质量,强调了建模5G标准的结构和时间属性的价值。

更新时间: 2025-11-03 07:39:22

领域: cs.CL,cs.AI,cs.NI

下载: http://arxiv.org/abs/2511.01305v1

Limits of Safe AI Deployment: Differentiating Oversight and Control

Oversight and control, which we collectively call supervision, are often discussed as ways to ensure that AI systems are accountable, reliable, and able to fulfill governance and management requirements. However, the requirements for "human oversight" risk codifying vague or inconsistent interpretations of key concepts like oversight and control. This ambiguous terminology could undermine efforts to design or evaluate systems that must operate under meaningful human supervision. This matters because the term is used by regulatory texts such as the EU AI Act. This paper undertakes a targeted critical review of literature on supervision outside of AI, along with a brief summary of past work on the topic related to AI. We next differentiate control as ex-ante or real-time and operational rather than policy or governance, and oversight as performed ex-post, or a policy and governance function. Control aims to prevent failures, while oversight focuses on detection, remediation, or incentives for future prevention. Building on this, we make three contributions. 1) We propose a framework to align regulatory expectations with what is technically and organizationally plausible, articulating the conditions under which each mechanism is possible, where they fall short, and what is required to make them meaningful in practice. 2) We outline how supervision methods should be documented and integrated into risk management, and drawing on the Microsoft Responsible AI Maturity Model, we outline a maturity model for AI supervision. 3) We explicitly highlight boundaries of these mechanisms, including where they apply, where they fail, and where it is clear that no existing methods suffice. This foregrounds the question of whether meaningful supervision is possible in a given deployment context, and can support regulators, auditors, and practitioners in identifying both present and future limitations.

Updated: 2025-11-03 07:38:49

标题: 安全AI部署的限制:区分监督和控制

摘要: 监督和控制,我们通称为监管,通常被讨论为确保人工智能系统具有问责制、可靠性,并且能够满足治理和管理要求的方式。然而,对于“人类监督”的要求可能会将关键概念如监督和控制的模糊或不一致解释固化。这种模糊的术语可能会破坏设计或评估必须在有意义的人类监督下运行的系统的努力。这很重要,因为这个术语被欧盟人工智能法案等监管文件使用。 本文针对监管领域的文献进行了有针对性的批判性审查,同时简要总结了与人工智能相关的话题的过去工作。接下来,我们将控制区分为前期或实时操作性而不是政策或治理,并将监督区分为事后执行的政策和治理功能。控制旨在预防失败,而监督侧重于检测、纠正或对未来预防提供激励。基于此,我们做出了三项贡献。1)我们提出了一个框架,将监管期望与技术和组织可行性对齐,阐明了每种机制可能性的条件,它们的局限性在哪里,以及在实践中使它们具有意义所需的条件。2)我们概述了监督方法应该如何记录并整合到风险管理中,并借鉴微软负责任的人工智能成熟模型,我们概述了一个人工智能监管成熟模型。3)我们明确突出这些机制的边界,包括它们适用的地方,在哪里它们失败,以及明确没有现有方法足够的地方。这凸显了一个问题,即在特定的部署环境中是否可能进行有意义的监督,并且可以支持监管机构、审计员和从业者识别当前和未来的限制。

更新时间: 2025-11-03 07:38:49

领域: cs.AI,cs.SY,eess.SY,I.2; K.6; D.2.9

下载: http://arxiv.org/abs/2507.03525v2

Black-Box Differentially Private Nonparametric Confidence Intervals Under Minimal Assumptions

We introduce a simple, general framework that takes any differentially private estimator of any arbitrary quantity as a black box, and from it constructs a differentially private nonparametric confidence interval of that quantity. Our approach repeatedly subsamples the data, applies the private estimator to each subsample, and then post-processes the resulting empirical CDF to a confidence interval. Our analysis uses the randomness from the subsampling to achieve privacy amplification. Under mild assumptions, the empirical CDF we obtain approaches the CDF of the private statistic as the sample size grows. We use this to show that the confidence intervals we estimate are asymptotically valid, tight, and equivalent to their non-private counterparts. We provide empirical evidence that our method performs well compared with the (less-general) state-of-the-art algorithms.

Updated: 2025-11-03 07:38:30

标题: 在最小假设下的黑盒差分隐私非参数置信区间

摘要: 我们引入了一个简单且通用的框架,将任意数量的差分私密估计器作为黑匣子,并从中构建一个差分私密的非参数置信区间。我们的方法反复对数据进行子采样,对每个子样本应用私密估计器,然后对结果进行后处理以得到置信区间。我们的分析利用子采样的随机性来实现隐私放大。在温和的假设下,我们获得的经验CDF在样本量增长时逼近私密统计的CDF。我们利用这一点证明我们估计的置信区间是渐近有效、紧密且等同于其非私密对应物。我们提供了实证证据表明我们的方法相对于(更少通用的)最先进算法表现良好。

更新时间: 2025-11-03 07:38:30

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2511.01303v1

Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.

Updated: 2025-11-03 07:38:02

标题: 分布鲁棒性平均奖励强化学习的样本复杂性

摘要: 受到稳定长期性能对实际应用的重要性的启发,如机器人技术、运筹学和医疗保健领域,我们研究了分布鲁棒(DR)平均奖励强化学习的问题。我们提出了两种算法,实现了接近最优样本复杂度。第一种将问题简化为DR折扣马尔可夫决策过程(MDP),而第二种锚定DR平均奖励MDP引入了一个锚定状态,以在不确定性集合内稳定控制转移内核。假设标称MDP是均匀遍历的,我们证明这两种算法在估计最优策略和KL及f_k-散度基于不确定性集合下的鲁棒平均奖励时,可以达到样本复杂度为$\widetilde{O}\left(|\mathbf{S}||\mathbf{A}|t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$,前提是不确定性半径足够小。这里,$\varepsilon$是目标精度,$|\mathbf{S}|$和$|\mathbf{A}|$表示状态和动作空间的大小,$t_{\mathrm{mix}}$是标称MDP的混合时间。这是DR平均奖励强化学习的第一个有限样本收敛保证。我们通过数值实验进一步验证了算法的收敛速度。

更新时间: 2025-11-03 07:38:02

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2505.10007v2

Language-Driven Coordination and Learning in Multi-Agent Simulation Environments

This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.

Updated: 2025-11-03 07:37:51

标题: 多智能体模拟环境中的语言驱动协调和学习

摘要: 本文介绍了LLM-MARL,这是一个统一框架,将大型语言模型(LLMs)整合到多智能体强化学习(MARL)中,以增强在模拟游戏环境中的协调、沟通和泛化能力。该框架包括协调器、通信器和记忆三个模块化组件,动态生成子目标,促进符号间智能体消息传递,并支持情节回忆。训练结合了PPO和基于语言的损失以及LLM查询门控。LLM-MARL在Google Research Football、MAgent Battle和星际争霸II中进行评估。结果显示,在获胜率、协调得分和零射击泛化方面,LLM-MARL与MAPPO和QMIX相比表现出一致的改进。消融研究表明,子目标生成和基于语言的消息传递分别对性能提升起到了重要作用。定性分析揭示了角色专业化和基于通信的战术等新兴行为。通过搭建语言建模和策略学习之间的桥梁,这项工作有助于设计智能、合作的代理人在交互模拟中。它为利用LLMs培训、游戏和人工智能协作中的多智能体系统提供了前进之路。

更新时间: 2025-11-03 07:37:51

领域: cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2506.04251v4

DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding

Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size. The source code is publicly available at https://github.com/JooyoungLeeETRI/DeepHQ

Updated: 2025-11-03 07:36:08

标题: DeepHQ:用于渐进式深度图像编码的学习分层量化器

摘要: 与固定或可变速率图像编码不同,逐步图像编码(PIC)旨在将各种质量的图像压缩成单个比特流,增加比特流利用的多样性,并与同时压缩相比提供更高的压缩效率。基于神经网络(NN)的PIC研究处于初期阶段,主要集中在以分层方式向转换后的潜在表示应用不同的量化步长。这些方法旨在仅在质量改善时压缩逐步添加的信息,考虑到较低质量压缩的更广量化间隔包括较多较窄的子间隔以用于更高质量的压缩。然而,现有方法基于手工设计的量化层次结构,导致次优的压缩效率。在本文中,我们提出了一种基于NN的逐步编码方法,首先通过学习为每个量化层学习的量化步长。我们还结合了选择性压缩,仅压缩每个量化层的基本表示组件。我们证明我们的方法比现有方法实现了显著更高的编码效率,同时减少了解码时间和模型大小。源代码可以在https://github.com/JooyoungLeeETRI/DeepHQ上公开获取。

更新时间: 2025-11-03 07:36:08

领域: eess.IV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2408.12150v2

LSHFed: Robust and Communication-Efficient Federated Learning with Locally-Sensitive Hashing Gradient Mapping

Federated learning (FL) enables collaborative model training across distributed nodes without exposing raw data, but its decentralized nature makes it vulnerable in trust-deficient environments. Inference attacks may recover sensitive information from gradient updates, while poisoning attacks can degrade model performance or induce malicious behaviors. Existing defenses often suffer from high communication and computation costs, or limited detection precision. To address these issues, we propose LSHFed, a robust and communication-efficient FL framework that simultaneously enhances aggregation robustness and privacy preservation. At its core, LSHFed incorporates LSHGM, a novel gradient verification mechanism that projects high-dimensional gradients into compact binary representations via multi-hyperplane locally-sensitive hashing. This enables accurate detection and filtering of malicious gradients using only their irreversible hash forms, thus mitigating privacy leakage risks and substantially reducing transmission overhead. Extensive experiments demonstrate that LSHFed maintains high model performance even when up to 50% of participants are collusive adversaries while achieving up to a 1000x reduction in gradient verification communication compared to full-gradient methods.

Updated: 2025-11-03 07:28:14

标题: LSHFed:具有局部敏感哈希梯度映射的稳健且通信高效的联邦学习

摘要: Federated learning (FL)允许在分布式节点之间进行协作模型训练,而不暴露原始数据,但其分散性质使其在信任不足的环境中容易受到攻击。推断攻击可能会从梯度更新中恢复敏感信息,而毒化攻击可以降低模型性能或引发恶意行为。现有的防御通常面临高通信和计算成本,或有限的检测精度。为了解决这些问题,我们提出了LSHFed,这是一个既强大又通信高效的FL框架,同时增强了聚合的稳健性和隐私保护。在其核心,LSHFed结合了LSHGM,一种新颖的梯度验证机制,通过多超平面局部敏感哈希将高维梯度投影为紧凑的二进制表示。这使得能够准确检测和过滤恶意梯度,仅使用它们不可逆的哈希形式,从而减轻隐私泄露风险,并大幅减少传输开销。大量实验表明,LSHFed即使在多达50%的参与者是串通的对手的情况下,也能保持高模型性能,同时与全梯度方法相比,梯度验证通信可实现高达1000倍的减少。

更新时间: 2025-11-03 07:28:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.01296v1

Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation

Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. The experiments show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay, fulfilling the operational standard for minute-level precision. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory data, enhances delay prediction. Moreover, the approach shows practicality and potential scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.

Updated: 2025-11-03 07:12:24

标题: 飞行延误预测:通过大型语言模型和飞行轨迹表示的跨模态适应

摘要: 航班延误预测已成为空中交通管理的重点关注,因为延误突出了影响整体网络性能的低效。本文提出了一种基于轻量级大型语言模型的多模态航班延误预测,从空中交通管制员监控航空器进入终端区后的延误角度进行建模。该方法将轨迹表示与文本航空信息(包括航班信息、天气报告和机场通告)相结合,通过将轨迹数据调整为语言模态来捕捉空域条件。实验表明,该模型通过有效利用与延误来源相关的上下文信息,始终实现亚分钟级的预测误差,满足分钟级精度的操作标准。该框架表明,语言理解与轨迹数据的跨模态适应相结合可以增强延误预测能力。此外,该方法显示出在现实运营中的实用性和潜在可扩展性,支持实时更新,根据接收到的新操作信息来完善预测。

更新时间: 2025-11-03 07:12:24

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.23636v2

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA family models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.1 on WikiText2. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.

Updated: 2025-11-03 07:08:02

标题: FlexQ:通过算法-系统协同设计实现LLM服务的高效后训练INT6量化

摘要: 大型语言模型(LLMs)展示出卓越的性能,但带来显著的内存和计算成本,限制了它们的实际部署。虽然现有的INT4/INT8量化减少了这些成本,但它们经常降低准确性或缺乏最佳效率。INT6量化提供了模型准确性和推理效率之间的卓越折衷,但在现代GPU中缺乏硬件支持,迫使通过高精度算术单元进行仿真,限制了加速。在本文中,我们提出了FlexQ,一种新颖的后训练INT6量化框架,结合了算法创新和系统级优化。FlexQ在所有层上采用统一的6位权重量化,通过层间敏感性分析识别层中的自适应保留8位激活。为了最大化硬件效率,我们开发了一种专门的高性能GPU内核,支持通过二进制张量核心(BTC)等效物进行W6A6和W6A8表示的矩阵乘法,有效地绕过了原生INT6张量核心的缺乏。对LLaMA家族模型的评估显示,FlexQ保持了接近FP16精度,在WikiText2上的困惑度增加不超过0.1。所提出的内核在LLaMA-2-70B线性层上实现了平均1.39倍的加速比ABQ-LLM更快。端到端,FlexQ提供了1.33倍的推理加速和1.21倍的内存节省比SmoothQuant。代码发布在https://github.com/FlyFoxPlayer/FlexQ。

更新时间: 2025-11-03 07:08:02

领域: cs.LG

下载: http://arxiv.org/abs/2508.04405v2

Epistemic Uncertainty for Generated Image Detection

We introduce a novel framework for AI-generated image detection through epistemic uncertainty, aiming to address critical security concerns in the era of generative models. Our key insight stems from the observation that distributional discrepancies between training and testing data manifest distinctively in the epistemic uncertainty space of machine learning models. In this context, the distribution shift between natural and generated images leads to elevated epistemic uncertainty in models trained on natural images when evaluating generated ones. Hence, we exploit this phenomenon by using epistemic uncertainty as a proxy for detecting generated images. This converts the challenge of generated image detection into the problem of uncertainty estimation, underscoring the generalization performance of the model used for uncertainty estimation. Fortunately, advanced large-scale vision models pre-trained on extensive natural images have shown excellent generalization performance for various scenarios. Thus, we utilize these pre-trained models to estimate the epistemic uncertainty of images and flag those with high uncertainty as generated. Extensive experiments demonstrate the efficacy of our method. Code is available at https://github.com/tmlr-group/WePe.

Updated: 2025-11-03 07:06:56

标题: 生成图像检测中的认知不确定性

摘要: 我们引入了一个新颖的框架,通过认识不确定性来进行人工智能生成图像检测,旨在解决生成模型时代的关键安全问题。我们的主要洞察力源于观察到训练和测试数据之间的分布差异在机器学习模型的认知不确定性空间中表现出明显的特征。在这种情况下,自然图像和生成图像之间的分布转移导致了在训练于自然图像的模型评估生成图像时认知不确定性升高。因此,我们利用这一现象,通过使用认知不确定性作为检测生成图像的代理。这将生成图像检测的挑战转化为不确定性估计问题,强调了用于不确定性估计的模型的泛化性能。幸运的是,预先在大量自然图像上训练的先进大规模视觉模型已经展示出在各种场景中的优秀泛化性能。因此,我们利用这些预训练模型来估计图像的认知不确定性,并标记那些具有高不确定性的图像为生成图像。大量实验证明了我们方法的有效性。代码可在https://github.com/tmlr-group/WePe获得。

更新时间: 2025-11-03 07:06:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.05897v2

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

Updated: 2025-11-03 07:04:22

标题: "只给出积极评价:对AI审稿人的文内提示注入攻击及防御措施的早期调查"

摘要: 随着人工智能模型的快速发展,它们在各种任务中的部署变得日益普遍。一个显著的新兴应用是利用人工智能模型来协助审查科学论文。然而,最近的报告显示,一些论文中包含隐藏的、注入的提示,旨在操纵人工智能审稿人提供过分正面的评价。在这项工作中,我们提出了对这种新兴威胁的早期系统调查。我们提出了两类攻击:(1)静态攻击,使用固定的注入提示,以及(2)迭代攻击,通过优化注入提示针对模拟审稿人模型来最大程度地提高其效果。这两种攻击都取得了惊人的表现,经常在针对前沿人工智能审稿人时引发完全的评分。此外,我们展示了这些攻击在各种环境下的鲁棒性。为了应对这一威胁,我们探讨了一种简单的基于检测的防御方法。虽然它大幅降低了攻击成功率,但我们证明了一个适应性攻击者可以部分规避这种防御。我们的发现强调了在人工智能辅助同行评审中对注入提示威胁的更多关注和严格保障的必要性。

更新时间: 2025-11-03 07:04:22

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2511.01287v1

Adaptation of Foundation Models for Medical Image Analysis: Strategies, Challenges, and Future Directions

Foundation models (FMs) have emerged as a transformative paradigm in medical image analysis, offering the potential to provide generalizable, task-agnostic solutions across a wide range of clinical tasks and imaging modalities. Their capacity to learn transferable representations from large-scale data has the potential to address the limitations of conventional task-specific models. However, adaptation of FMs to real-world clinical practice remains constrained by key challenges, including domain shifts, limited availability of high-quality annotated data, substantial computational demands, and strict privacy requirements. This review presents a comprehensive assessment of strategies for adapting FMs to the specific demands of medical imaging. We examine approaches such as supervised fine-tuning, domain-specific pretraining, parameter-efficient fine-tuning, self-supervised learning, hybrid methods, and multimodal or cross-modal frameworks. For each, we evaluate reported performance gains, clinical applicability, and limitations, while identifying trade-offs and unresolved challenges that prior reviews have often overlooked. Beyond these established techniques, we also highlight emerging directions aimed at addressing current gaps. These include continual learning to enable dynamic deployment, federated and privacy-preserving approaches to safeguard sensitive data, hybrid self-supervised learning to enhance data efficiency, data-centric pipelines that combine synthetic generation with human-in-the-loop validation, and systematic benchmarking to assess robust generalization under real-world clinical variability. By outlining these strategies and associated research gaps, this review provides a roadmap for developing adaptive, trustworthy, and clinically integrated FMs capable of meeting the demands of real-world medical imaging.

Updated: 2025-11-03 06:57:42

标题: 医学图像分析基础模型的适应性:策略、挑战和未来方向

摘要: 基础模型(FMs)已经成为医学图像分析中的一种革命性范式,为跨领域的临床任务和成像模式提供了通用的、任务无关的解决方案的潜力。它们能够从大规模数据中学习可迁移的表示,有可能解决传统任务特定模型的局限性。然而,将FMs应用于实际临床实践仍受到关键挑战的限制,包括领域转移、高质量标注数据的有限可用性、巨大的计算需求和严格的隐私要求。本综述提供了一种全面评估适应FMs到医学成像特定需求的策略。我们探讨了监督微调、领域特定预训练、参数高效微调、自监督学习、混合方法以及多模态或跨模态框架等方法。对于每种方法,我们评估了报道的性能提升、临床适用性和限制,同时识别了先前综述经常忽视的权衡和未解决的挑战。除了这些已建立的技术之外,我们还强调了旨在填补当前差距的新兴方向。这些包括持续学习以实现动态部署、联邦和隐私保护方法来保护敏感数据、混合自监督学习以增强数据效率、数据中心管道将合成生成与人在环验证相结合、以及系统基准测试以评估在真实临床变异性下的强大泛化能力。通过概述这些策略和相关研究差距,本综述为开发适应性、可信赖和临床集成的FMs提供了一条路线图,以满足现实世界医学成像的需求。

更新时间: 2025-11-03 06:57:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01284v1

When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting model. While model-based methods like EAGLE-2 are accurate but costly, retrieval-enhanced methods like SAM-Decoding rely on heuristic switching strategies that often trigger unnecessary retrievals. To address this, we propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a novel framework that transforms heuristic drafter switching into adaptive decision-making. ReSpec features three core innovations: 1) An \textbf{entropy-guided adaptive trigger} quantifies contextual predictability to initiate retrieval only when uncertainty is low, avoiding costly low-quality speculations. 2) A \textbf{feedback-driven candidate selection} leverages historical feedback to organize multiple high-quality candidates for parallel verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed verification strategy} applies strict checks to model-generated drafts while using a relaxed verification for retrieved drafts, achieving a better balance between accuracy and efficiency. Extensive experiments on Spec-Bench demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming EAGLE-2 and SAM-Decoding by over $33\%$ and $25\%$, respectively, while maintaining output quality.

Updated: 2025-11-03 06:57:16

标题: 何时,什么和如何:重新思考检索增强的推测解码

摘要: 推测解码(SD)已经成为一种有效的技术,可以加速大型语言模型(LLM)的推断,而不会影响输出质量。然而,可实现的加速程度在很大程度上取决于起草模型的效果。虽然基于模型的方法如EAGLE-2准确但代价高昂,而像SAM-Decoding这样的检索增强方法依赖于启发式切换策略,往往会触发不必要的检索。为了解决这个问题,我们提出了ReSpec(检索增强推测解码),这是一个新颖的框架,将启发式起草转化为自适应决策。ReSpec具有三个核心创新:1)一个基于熵的自适应触发器,量化上下文可预测性以仅在不确定性低时启动检索,避免昂贵的低质量推测。2)一个基于反馈的候选者选择,利用历史反馈为多个高质量候选者组织并行验证,最大化检索效用。3)一种源感知的宽松验证策略,对模型生成的草稿应用严格检查,同时对检索草稿使用宽松验证,实现准确性和效率之间更好的平衡。对Spec-Bench的大量实验表明,ReSpec实现了最先进的加速,分别比EAGLE-2和SAM-Decoding高出33%和25%,同时保持输出质量。

更新时间: 2025-11-03 06:57:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01282v1

Adversarial Spatio-Temporal Attention Networks for Epileptic Seizure Forecasting

Forecasting epileptic seizures from multivariate EEG signals represents a critical challenge in healthcare time series prediction, requiring high sensitivity, low false alarm rates, and subject-specific adaptability. We present STAN, an Adversarial Spatio-Temporal Attention Network that jointly models spatial brain connectivity and temporal neural dynamics through cascaded attention blocks with alternating spatial and temporal modules. Unlike existing approaches that assume fixed preictal durations or separately process spatial and temporal features, STAN captures bidirectional dependencies between spatial and temporal patterns through a unified cascaded architecture. Adversarial training with gradient penalty enables robust discrimination between interictal and preictal states learned from clearly defined 15-minute preictal windows. Continuous 90-minute pre-seizure monitoring reveals that the learned spatio-temporal attention patterns enable early detection: reliable alarms trigger at subject-specific times (typically 15-45 minutes before onset), reflecting the model's capacity to capture subtle preictal dynamics without requiring individualized training. Experiments on two benchmark EEG datasets (CHB-MIT scalp: 8 subjects, 46 events; MSSM intracranial: 4 subjects, 14 events) demonstrate state-of-the-art performance: 96.6% sensitivity with 0.011 false detections per hour and 94.2% sensitivity with 0.063 false detections per hour, respectively, while maintaining computational efficiency (2.3M parameters, 45 ms latency, 180 MB memory) for real-time edge deployment. Beyond epilepsy, the proposed framework provides a general paradigm for spatio-temporal forecasting in healthcare and other time series domains where individual heterogeneity and interpretability are crucial.

Updated: 2025-11-03 06:48:54

标题: 对抗性时空注意力网络用于癫痫发作预测

摘要: 预测癫痫发作的多变量脑电信号代表了医疗时间序列预测中的一个关键挑战,需要高灵敏度、低误报率和个体特定的适应性。我们提出了STAN,一种对抗性时空注意力网络,通过交替的空间和时间模块进行级联的注意力块,共同建模了空间脑连接性和时间神经动力学。与现有方法不同,STAN通过统一的级联架构捕获了空间和时间模式之间的双向依赖关系,而不是假定固定的发病前时长或分别处理空间和时间特征。通过带有梯度惩罚的对抗性训练,实现了对发作前和发作前状态的稳健区分,这些状态是从明确定义的15分钟发作前窗口中学习到的。持续监测90分钟的发作前显示,学习到的时空注意力模式可以实现早期检测:在个体特定的时间点(通常是发作前15-45分钟)可触发可靠的警报,反映了该模型捕获微妙的发作前动态的能力,而无需个性化训练。在两个基准脑电数据集上的实验证明了该方法的最新性能:分别为96.6%的灵敏度和每小时0.011次误报,以及94.2%的灵敏度和每小时0.063次误报,同时保持了计算效率(2.3M参数,45毫秒延迟,180MB内存)以便于实时边缘部署。除了癫痫外,提出的框架为医疗保健和其他时间序列领域中的时空预测提供了一个通用范例,其中个体异质性和可解释性至关重要。

更新时间: 2025-11-03 06:48:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.01275v1

Double-Signed Fragmented DNSSEC for Countering Quantum Threat

DNSSEC, a DNS security extension, is essential to accurately translating domain names to IP addresses. Digital signatures provide the foundation for this reliable translation; however, the evolution of 'Quantum Computers' has made traditional digital signatures vulnerable. In light of this, NIST has recently selected potential post-quantum digital signatures that can operate on conventional computers and resist attacks made with Quantum Computers. Since these post-quantum digital signatures are still in their early stages of development, replacing pre-quantum digital signature schemes in DNSSEC with post-quantum candidates is risky until the post-quantum candidates have undergone a thorough security analysis. Given this, herein, we investigate the viability of employing 'Double-Signatures' in DNSSEC, combining a post-quantum digital signature and a classic one. The rationale is that double-signatures will offer protection against quantum threats on conventional signature schemes as well as unknown non-quantum attacks on post-quantum signature schemes, hence even if one fails, the other provides security guarantees. However, the inclusion of two signatures in the DNSSEC response message doesn't bode well with the maximum allowed size of DNSSEC responses (i.e., 1232B, a limitation enforced by the MTU of physical links). To counter this issue, we leverage a way to do application-layer fragmentation of DNSSEC responses with two signatures. We implement our solution on top of OQS-BIND and, through experiments, show that the addition of two signatures in DNSSEC and application-layer fragmentation of all relevant resource records and their reassembly does not have a substantial impact on the efficiency of the resolution process and thus is suitable for the interim period at least until the quantum computers are fully realized.

Updated: 2025-11-03 06:43:09

标题: 对抗量子威胁的双签名分段DNSSEC

摘要: DNSSEC,一种DNS安全扩展,对准确将域名转换为IP地址至关重要。数字签名为这种可靠的转换提供了基础;然而,“量子计算机”的发展使传统数字签名变得脆弱。鉴于此,NIST最近选择了潜在的后量子数字签名,可以在传统计算机上运行,并抵抗使用量子计算机进行的攻击。由于这些后量子数字签名仍处于早期开发阶段,在未经全面安全分析之前,将DNSSEC中的前量子数字签名方案替换为后量子候选方案是有风险的。因此,在这里,我们调查了在DNSSEC中使用“双签名”的可行性,结合后量子数字签名和经典数字签名。理由是双签名将提供对传统签名方案上的量子威胁以及对后量子签名方案上的未知非量子攻击的保护,因此即使一个失败,另一个也提供安全保障。然而,在DNSSEC响应消息中包含两个签名并不符合DNSSEC响应的最大允许大小(即1232B,这是由物理链路的MTU强制执行的限制)。为了解决这个问题,我们利用一种方法对DNSSEC响应进行应用层分片,其中包含两个签名。我们将我们的解决方案实现在OQS-BIND之上,并通过实验表明,在DNSSEC中添加两个签名和对所有相关资源记录进行应用层分片和重组对解析过程的效率没有实质性影响,因此至少在量子计算机完全实现之前是合适的。

更新时间: 2025-11-03 06:43:09

领域: cs.CR

下载: http://arxiv.org/abs/2411.07535v2

Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems

Large language models (LLMs) are reshaping numerous facets of our daily lives, leading widespread adoption as web-based services. Despite their versatility, LLMs face notable challenges, such as generating hallucinated content and lacking access to up-to-date information. Lately, to address such limitations, Retrieval-Augmented Generation (RAG) has emerged as a promising direction by generating responses grounded in external knowledge sources. A typical RAG system consists of i) a retriever that probes a group of relevant passages from a knowledge base and ii) a generator that formulates a response based on the retrieved content. However, as with other AI systems, recent studies demonstrate the vulnerability of RAG, such as knowledge corruption attacks by injecting misleading information. In response, several defense strategies have been proposed, including having LLMs inspect the retrieved passages individually or fine-tuning robust retrievers. While effective, such approaches often come with substantial computational costs. In this work, we introduce RAGDefender, a resource-efficient defense mechanism against knowledge corruption (i.e., by data poisoning) attacks in practical RAG deployments. RAGDefender operates during the post-retrieval phase, leveraging lightweight machine learning techniques to detect and filter out adversarial content without requiring additional model training or inference. Our empirical evaluations show that RAGDefender consistently outperforms existing state-of-the-art defenses across multiple models and adversarial scenarios: e.g., RAGDefender reduces the attack success rate (ASR) against the Gemini model from 0.89 to as low as 0.02, compared to 0.69 for RobustRAG and 0.24 for Discern-and-Answer when adversarial passages outnumber legitimate ones by a factor of four (4x).

Updated: 2025-11-03 06:39:58

标题: 拯救未被毒害的:高效防御RAG系统上的知识腐化攻击

摘要: 大型语言模型(LLMs)正在改变我们日常生活的许多方面,被广泛采用为基于网络的服务。尽管它们具有多样化的能力,LLMs仍面临显著挑战,如生成虚构内容和缺乏获取最新信息的途径。最近,为了解决这些限制,检索增强生成(RAG)已经成为一个有前途的方向,通过生成基于外部知识来源的响应。一个典型的RAG系统包括i)从知识库中探索一组相关段落的检索器和ii)基于检索内容制定响应的生成器。然而,与其他AI系统一样,最近的研究表明了RAG的脆弱性,如通过注入误导信息进行知识破坏攻击。作为回应,提出了几种防御策略,包括让LLMs逐个检查检索到的段落或微调强大的检索器。虽然有效,但这些方法通常伴随着相当大的计算成本。 在这项工作中,我们介绍了RAGDefender,这是一种资源高效的防御机制,用于防范实际RAG部署中的知识破坏(即通过数据污染)攻击。RAGDefender在检索后阶段运行,利用轻量级机器学习技术检测和过滤对抗性内容,而无需额外的模型训练或推断。我们的实证评估表明,RAGDefender在多个模型和对抗性情景中始终优于现有的最先进的防御措施:例如,与RobustRAG的0.69和Discern-and-Answer的0.24相比,RAGDefender将对Gemini模型的攻击成功率(ASR)从0.89降低到低至0.02,当对抗性段落数量是合法段落的四倍(4倍)时。

更新时间: 2025-11-03 06:39:58

领域: cs.CR,cs.AI,cs.IR,D.4.6; K.6.5

下载: http://arxiv.org/abs/2511.01268v1

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.

Updated: 2025-11-03 06:12:40

标题: 《Speech-DRAME:面向人类对齐的言语角色扮演基准框架》

摘要: 角色扮演已经成为生成模型的一个关键测试平台,从仅文本对话扩展到多模态交互。将角色扮演扩展到语音捕获了语调、情感和表达,但也提出了新的评估挑战。当前的流程通常使用音频大型语言模型(ALLMs)作为零-shot评判者,这些模型会错过语音语言学线索,将多个方面合并为粗粒度得分,并依赖于不能反映真实角色的合成语音参考。我们提出了Speech-DRAME,这是一个统一的框架,贡献在三个层面:(i)Speech-DRAME-EvalBench,一个具有双语人工注释数据和用于训练和测试语音评估模型(SEMs)的协议的评估基准,(ii)DRAME-Eval,一个经过良好调整的评估模型,明显优于零-shot和少-shot ALLMs,以及(iii)Speech-DRAME-RoleBench,一个语音角色扮演基准,利用DRAME-Eval作为自动评判者来比较语音基础模型(SFMs)。Speech-DRAME区分了两种互补的评估策略:原型评估,一种自上而下的方法,衡量对广泛角色原型的依从度,以及现实主义评估,一种根植于真实人类语音的自下而上方法,强调细致的角色质量。与零-shot ALLM评判者相比,DRAME-Eval与人类评分之间达成更强的一致性(在原型中的皮尔逊相关系数从0.480到0.629,在现实主义中从0.390到0.625)。通过整合透明的基准资源、建模方法和系统级评估,Speech-DRAME为评估口头角色扮演提供了第一个全面、可重复的基础。

更新时间: 2025-11-03 06:12:40

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2511.01261v1

Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization

Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.

Updated: 2025-11-03 06:07:17

标题: 双层渐进硬度感知重加权用于跨视图地理定位

摘要: 无人机与卫星图像之间的跨视图地理定位(CVGL)仍然具有挑战性,原因是存在严重的视角差距和存在视觉上相似但地理上不匹配的困难负样本。现有的挖掘或重新加权策略通常使用静态加权,对分布变化敏感,并且容易过早强调困难样本,导致梯度嘈杂和不稳定的收敛。在本文中,我们提出了一种双级渐进困难感知重新加权(DPHR)策略。在样本级别上,一个基于比率的难度感知(RDA)模块评估相对困难并为负样本分配细粒度权重。在批次级别上,一种渐进自适应损失加权(PALW)机制利用训练进度信号在早期优化过程中减弱嘈杂梯度,并在训练成熟时逐渐增强困难负样本挖掘。在University-1652和SUES-200基准上的实验表明,所提出的DPHR方法的有效性和鲁棒性,实现了对最先进方法的持续改进。

更新时间: 2025-11-03 06:07:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.27181v2

Graph Neural Network-Based Semi-Supervised Open-Set Fault Diagnosis for Marine Machinery Systems

Recently, fault diagnosis methods for marine machinery systems based on deep learning models have attracted considerable attention in the shipping industry. Most existing studies assume fault classes are consistent and known between the training and test datasets, and these methods perform well under controlled environment. In practice, however, previously unseen or unknown fault types (i.e., out-of-distribution or open-set observations not present during training) can occur, causing such methods to fail and posing a significant challenge to their widespread industrial deployment. To address this challenge, this paper proposes a semi-supervised open-set fault diagnosis (SOFD) framework that enhances and extends the applicability of deep learning models in open-set fault diagnosis scenarios. The framework includes a reliability subset construction process, which uses a multi-layer fusion feature representation extracted by a supervised feature learning model to select an unlabeled test subset. The labeled training set and pseudo-labeled test subset are then fed into a semi-supervised diagnosis model to learn discriminative features for each class, enabling accurate classification of known faults and effective detection of unknown samples. Experimental results on a public maritime benchmark dataset demonstrate the effectiveness and superiority of the proposed SOFD framework.

Updated: 2025-11-03 06:06:25

标题: 基于图神经网络的半监督开放式海洋机械系统故障诊断

摘要: 最近,基于深度学习模型的船舶机械系统故障诊断方法在航运业中引起了相当大的关注。大多数现有研究假设在训练和测试数据集之间故障类别是一致且已知的,并且这些方法在受控环境下表现良好。然而,在实践中,先前未见或未知的故障类型(即,在训练期间未出现的超出分布或开放集观察)可能会发生,导致这些方法失败,并对它们的广泛工业部署提出了重大挑战。为了解决这一挑战,本文提出了一种半监督开放集故障诊断(SOFD)框架,增强和扩展了深度学习模型在开放集故障诊断场景中的适用性。该框架包括一个可靠性子集构建过程,利用由监督特征学习模型提取的多层融合特征表示来选择一个未标记的测试子集。然后将标记的训练集和伪标记的测试子集输入到半监督诊断模型中,学习每个类别的判别特征,从而实现已知故障的准确分类和未知样本的有效检测。对一个公共海事基准数据集的实验结果证明了所提出的SOFD框架的有效性和优越性。

更新时间: 2025-11-03 06:06:25

领域: cs.AI

下载: http://arxiv.org/abs/2511.01258v1

Detecting Vulnerabilities from Issue Reports for Internet-of-Things

Timely identification of issue reports reflecting software vulnerabilities is crucial, particularly for Internet-of-Things (IoT) where analysis is slower than non-IoT systems. While Machine Learning (ML) and Large Language Models (LLMs) detect vulnerability-indicating issues in non-IoT systems, their IoT use remains unexplored. We are the first to tackle this problem by proposing two approaches: (1) combining ML and LLMs with Natural Language Processing (NLP) techniques to detect vulnerability-indicating issues of 21 Eclipse IoT projects and (2) fine-tuning a pre-trained BERT Masked Language Model (MLM) on 11,000 GitHub issues for classifying \vul. Our best performance belongs to a Support Vector Machine (SVM) trained on BERT NLP features, achieving an Area Under the receiver operator characteristic Curve (AUC) of 0.65. The fine-tuned BERT achieves 0.26 accuracy, emphasizing the importance of exposing all data during training. Our contributions set the stage for accurately detecting IoT vulnerabilities from issue reports, similar to non-IoT systems.

Updated: 2025-11-03 05:59:34

标题: 从问题报告中检测物联网的漏洞

摘要: 及时识别反映软件漏洞的问题报告对于物联网(IoT)至关重要,尤其是在分析速度比非IoT系统慢的情况下。虽然机器学习(ML)和大型语言模型(LLMs)可以检测非IoT系统中指示漏洞的问题,但它们在IoT系统中的应用尚未被探索。我们是第一个通过提出两种方法来解决这个问题:(1)结合机器学习和LLMs与自然语言处理(NLP)技术,以检测21个Eclipse IoT项目的指示漏洞的问题,并且(2)在11,000个GitHub问题上对一个预训练的BERT遮蔽语言模型(MLM)进行微调,用于分类\ vul。我们的最佳表现来自于一个在BERT NLP特征上训练的支持向量机(SVM),实现了0.65的接收器操作特征曲线(AUC)。微调的BERT实现了0.26的准确率,强调了在训练过程中暴露所有数据的重要性。我们的贡献为准确检测IoT漏洞问题报告奠定了基础,类似于非IoT系统。

更新时间: 2025-11-03 05:59:34

领域: cs.SE,cs.AI,cs.CR

下载: http://arxiv.org/abs/2511.01941v1

Quantum Deep Learning Still Needs a Quantum Leap

Quantum computing technology is advancing rapidly. Yet, even accounting for these trends, a quantum leap would be needed for quantum computers to meaningfully impact deep learning over the coming decade or two. We arrive at this conclusion based on a first-of-its-kind survey of quantum algorithms and how they match potential deep learning applications. This survey reveals three important areas where quantum computing could potentially accelerate deep learning, each of which faces a challenging roadblock to realizing its potential. First, quantum algorithms for matrix multiplication and other algorithms central to deep learning offer small theoretical improvements in the number of operations needed, but this advantage is overwhelmed on practical problem sizes by how slowly quantum computers do each operation. Second, some promising quantum algorithms depend on practical Quantum Random Access Memory (QRAM), which is underdeveloped. Finally, there are quantum algorithms that offer large theoretical advantages, but which are only applicable to special cases, limiting their practical benefits. In each of these areas, we support our arguments using quantitative forecasts of quantum advantage that build on the work by Choi et al. [2023] as well as new research on limitations and quantum hardware trends. Our analysis outlines the current scope of quantum deep learning and points to research directions that could lead to greater practical advances in the field.

Updated: 2025-11-03 05:49:49

标题: 量子深度学习仍需要一个量子飞跃

摘要: 量子计算技术正在迅速发展。然而,即使考虑到这些趋势,未来十年或二十年内,量子计算机要对深度学习产生实质性影响,仍需要实现一次量子飞跃。我们得出这一结论是基于对量子算法及其与潜在深度学习应用的匹配情况进行了首次调查。这项调查揭示了量子计算可能加速深度学习的三个重要领域,每个领域都面临着实现潜力的困难障碍。首先,用于矩阵乘法等对深度学习至关重要的算法的量子算法在所需操作数量方面提供了小幅理论改进,但在实际问题规模上,这一优势被量子计算机执行每个操作的速度慢所压倒。其次,一些有前途的量子算法依赖于尚未充分发展的实际量子随机访问存储器(QRAM)。最后,有些量子算法提供了较大的理论优势,但只适用于特殊情况,限制了它们的实际效益。在这些领域的每一个中,我们通过量子优势的定量预测来支持我们的论点,这些预测建立在Choi等人[2023年]的工作以及有关限制和量子硬件趋势的新研究基础上。我们的分析概述了量子深度学习的当前范围,并指出了可能导致该领域取得更大实际进展的研究方向。

更新时间: 2025-11-03 05:49:49

领域: quant-ph,cs.AI,cs.LG,81P68, 68T07, 68Q12, 68Q25, 65F10,F.1.2; F.2.1; I.2.6; I.2.8; G.1.3

下载: http://arxiv.org/abs/2511.01253v1

Amortized Active Generation of Pareto Sets

We introduce active generation of Pareto sets (A-GPS), a new framework for online discrete black-box multi-objective optimization (MOO). A-GPS learns a generative model of the Pareto set that supports a-posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non-dominance relations and to condition the generative model toward high-performing regions of the search space. We also show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade-offs, A-GPS introduces preference direction vectors that encode user-specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.

Updated: 2025-11-03 05:27:43

标题: 摊销活跃生成帕累托集

摘要: 我们引入主动生成帕累托集(A-GPS),这是一种用于在线离散黑盒多目标优化(MOO)的新框架。A-GPS学习了一个支持用户偏好后验条件的帕累托集生成模型。该方法利用类概率估计器(CPE)来预测非支配关系,并将生成模型调整到搜索空间的高性能区域。我们还展示了这种非支配CPE隐含地估计了超体积改进的概率(PHVI)。为了包含主观权衡,A-GPS引入了编码用户指定偏好的偏好方向向量在目标空间中。在每次迭代中,模型使用帕累托成员资格和与这些偏好方向的一致性来更新,生成一个摊销的生成模型,能够在不重新训练的情况下在帕累托前沿上进行采样。结果是一个简单而强大的方法,实现了高质量的帕累托集逼近,避免了显式超体积计算,并灵活地捕获用户偏好。在合成基准和蛋白质设计任务上的实证结果显示了强大的样本效率和有效的偏好整合。

更新时间: 2025-11-03 05:27:43

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.21052v2

Multi-Focused Video Group Activities Hashing

With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.

Updated: 2025-11-03 05:25:07

标题: 多焦点视频组活动哈希化

摘要: 随着各种复杂场景中视频数据的爆炸式增长,快速检索群体活动已成为一个紧迫的问题。然而,许多任务只能检索关注整个视频的视频,而不是活动的粒度。为了解决这个问题,我们首次提出了一种新的STVH(时空交错视频哈希)技术。通过统一框架,STVH同时建模个体对象动态和群体互动,捕捉群体视觉特征和位置特征的时空演变。此外,在现实生活中的视频检索场景中,有时可能需要活动特征,而其他时候可能需要对象的视觉特征。然后,我们进一步提出了一种新颖的M-STVH(多焦点时空视频哈希)作为增强版本来处理这一困难任务。该先进方法通过多焦点表示学习整合层次特征,使模型能够共同关注活动语义特征和对象视觉特征。我们在公开可用的数据集上进行了比较实验,STVH和M-STVH都能取得出色的结果。

更新时间: 2025-11-03 05:25:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.00490v2

Complex QA and language models hybrid architectures, Survey

This paper reviews the state-of-the-art of large language models (LLM) architectures and strategies for "complex" question-answering with a focus on hybrid architectures. LLM based chatbot services have allowed anyone to grasp the potential of LLM to solve many common problems, but soon discovered their limitations for complex questions. Addressing more specific, complex questions (e.g., "What is the best mix of power-generation methods to reduce climate change ?") often requires specialized architectures, domain knowledge, new skills, decomposition and multi-step resolution, deep reasoning, sensitive data protection, explainability, and human-in-the-loop processes. Therefore, we review: (1) necessary skills and tasks for handling complex questions and common LLM limits to overcome; (2) dataset, cost functions and evaluation metrics for measuring and improving (e.g. accuracy, explainability, fairness, robustness, groundedness, faithfulness, toxicity...); (3) family of solutions to overcome LLM limitations by (a) training and reinforcement (b) hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and extended reasoning.

Updated: 2025-11-03 05:24:08

标题: 复杂的问题回答和语言模型混合架构,调查

摘要: 本文回顾了大型语言模型(LLM)架构和策略的最新技术,重点关注“复杂”问答,并着重介绍了混合架构。基于LLM的聊天机器人服务使任何人都能把握LLM解决许多常见问题的潜力,但很快发现了它们在处理复杂问题方面的局限性。解决更具体、复杂的问题(例如“什么是减少气候变化最佳的发电方法混合?”)通常需要专门的架构、领域知识、新技能、分解和多步解决、深度推理、敏感数据保护、可解释性和人机协作过程。因此,我们回顾了:(1)处理复杂问题所需的必要技能和任务以及常见LLM限制的克服方法;(2)用于衡量和改进的数据集、成本函数和评估指标(例如准确性、可解释性、公平性、稳健性、扎实性、忠实性、有毒性等);(3)通过(a)训练和强化,(b)混合化,(c)提示,(d)代理架构(代理、工具)和扩展推理的解决方案系列,以克服LLM的局限性。

更新时间: 2025-11-03 05:24:08

领域: cs.CL,cs.AI,cs.IR,cs.LG

下载: http://arxiv.org/abs/2302.09051v5

Eyes on Target: Gaze-Aware Object Detection in Egocentric Video

Human gaze offers rich supervisory signals for understanding visual attention in complex visual environments. In this paper, we propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework designed for egocentric videos. Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions. Unlike traditional object detectors that treat all regions equally, our method emphasises viewer-prioritised areas to enhance object detection. We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment, illustrating its potential in evaluating human performance in simulation scenarios. We evaluate the effectiveness of our gaze-integrated model through extensive experiments and ablation studies, demonstrating consistent gains in detection accuracy over gaze-agnostic baselines on both the custom simulator dataset and public benchmarks, including Ego4D Ego-Motion and Ego-CH-Gaze datasets. To interpret model behaviour, we also introduce a gaze-aware attention head importance metric, revealing how gaze cues modulate transformer attention dynamics.

Updated: 2025-11-03 05:21:58

标题: 目标视线:自我中心视频中的注视感知目标检测

摘要: 人类凝视为理解复杂视觉环境中的视觉注意力提供了丰富的监督信号。在本文中,我们提出了一种名为Eyes on Target的新型深度感知和凝视引导的物体检测框架,专为主观视角视频设计。我们的方法将凝视导出的特征注入到Vision Transformer(ViT)的注意机制中,有效地偏向于人类关注区域的空间特征选择。与传统的物体检测器不同,它们将所有区域视为平等,我们的方法强调观看者优先考虑的区域以增强物体检测。我们在一个主观模拟器数据集上验证了我们的方法,其中人类视觉注意对任务评估至关重要,展示了它在评估模拟场景中人类表现方面的潜力。我们通过大量实验和消融研究评估了我们的凝视整合模型的有效性,在自定义模拟器数据集和包括Ego4D Ego-Motion和Ego-CH-Gaze数据集在内的公共基准上,展示了与凝视不加权基准相比的检测准确性的持续增益。为了解释模型行为,我们还引入了一个凝视感知注意力头重要性度量,揭示了凝视线索如何调节转换器注意力动态。

更新时间: 2025-11-03 05:21:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01237v1

MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification

Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.

Updated: 2025-11-03 05:18:33

标题: MGPATH:多粒度提示学习的视觉语言模型用于少样本WSI分类

摘要: 全切片病理学图像分类面临挑战,因为图像尺寸巨大且注释标签有限,这阻碍了模型的泛化。本文介绍了一种快速学习方法,用于适应大型视觉语言模型进行少样本病理分类。我们首先将在13亿病理图像块上预训练的Prov-GigaPath视觉基础模型扩展为视觉语言模型,通过添加适配器并通过对923k图像文本对进行对比学习与医学文本编码器对齐。然后,使用该模型从少样本注释中提取视觉特征和文本嵌入,并使用可学习的提示嵌入进行微调。与以前将提示与冻结特征结合使用前缀嵌入或自注意力的方法不同,我们提出了多粒度注意力,比较可学习提示与单个图像块和它们组之间的交互。这种方法提高了模型捕获细粒度细节和更广泛上下文的能力,增强了其对亚区域中复杂模式的识别。为了进一步提高准确性,我们利用(不平衡的)基于最优传输的视觉文本距离,通过缓解在数据增强过程中可能发生的扰动来确保模型的稳健性。在肺部、肾脏和乳腺病理学模式的实证实验验证了我们方法的有效性;因此,我们超越了几个最新竞争对手,并在包括CLIP、PLIP和Prov-GigaPath集成PLIP在内的不同架构上持续提高性能。

更新时间: 2025-11-03 05:18:33

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.07409v5

Riemannian Consistency Model

Consistency models are a class of generative models that enable few-step generation for diffusion and flow matching models. While consistency models have achieved promising results on Euclidean domains like images, their applications to Riemannian manifolds remain challenging due to the curved geometry. In this work, we propose the Riemannian Consistency Model (RCM), which, for the first time, enables few-step consistency modeling while respecting the intrinsic manifold constraint imposed by the Riemannian geometry. Leveraging the covariant derivative and exponential-map-based parameterization, we derive the closed-form solutions for both discrete- and continuous-time training objectives for RCM. We then demonstrate theoretical equivalence between the two variants of RCM: Riemannian consistency distillation (RCD) that relies on a teacher model to approximate the marginal vector field, and Riemannian consistency training (RCT) that utilizes the conditional vector field for training. We further propose a simplified training objective that eliminates the need for the complicated differential calculation. Finally, we provide a unique kinematics perspective for interpreting the RCM objective, offering new theoretical angles. Through extensive experiments, we manifest the superior generative quality of RCM in few-step generation on various non-Euclidean manifolds, including flat-tori, spheres, and the 3D rotation group SO(3).

Updated: 2025-11-03 05:11:44

标题: 黎曼一致性模型

摘要: 一致性模型是一类生成模型,可为扩散和流匹配模型实现少步生成。虽然一致性模型在欧几里得域(如图像)上取得了有希望的结果,但由于曲面几何,它们在黎曼流形上的应用仍具挑战性。在这项工作中,我们提出了黎曼一致性模型(RCM),这是首次实现少步一致性建模,同时又尊重由黎曼几何施加的固有流形约束。利用共变导数和基于指数映射的参数化,我们为RCM的离散和连续时间训练目标推导出闭式解。然后,我们展示了RCM的两个变体之间的理论等价性:依赖于教师模型来近似边际矢量场的黎曼一致性蒸馏(RCD),以及利用条件矢量场进行训练的黎曼一致性训练(RCT)。我们进一步提出了一个简化的训练目标,消除了复杂的微分计算的需要。最后,我们提供了一个独特的动力学角度来解释RCM目标,提供了新的理论视角。通过大量实验证明了RCM在各种非欧几里得流形(包括平面环面、球体和三维旋转群SO(3))上的优越生成质量。

更新时间: 2025-11-03 05:11:44

领域: cs.LG

下载: http://arxiv.org/abs/2510.00983v2

Scientific Machine Learning with Kolmogorov-Arnold Networks

The field of scientific machine learning, which originally utilized multilayer perceptrons (MLPs), is increasingly adopting Kolmogorov-Arnold Networks (KANs) for data encoding. This shift is driven by the limitations of MLPs, including poor interpretability, fixed activation functions, and difficulty capturing localized or high-frequency features. KANs address these issues with enhanced interpretability and flexibility, enabling more efficient modeling of complex nonlinear interactions and effectively overcoming the constraints associated with conventional MLP architectures. This review categorizes recent progress in KAN-based models across three distinct perspectives: (i) data-driven learning, (ii) physics-informed modeling, and (iii) deep-operator learning. Each perspective is examined through the lens of architectural design, training strategies, application efficacy, and comparative evaluation against MLP-based counterparts. By benchmarking KANs against MLPs, we highlight consistent improvements in accuracy, convergence, and spectral representation, clarifying KANs' advantages in capturing complex dynamics while learning more effectively. In addition to reviewing recent literature, this work also presents several comparative evaluations that clarify central characteristics of KAN modeling and hint at their potential implications for real-world applications. Finally, this review identifies critical challenges and open research questions in KAN development, particularly regarding computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity. We also outline future research directions aimed at improving the robustness, scalability, and physical consistency of KAN-based frameworks.

Updated: 2025-11-03 05:04:45

标题: 科尔莫戈洛夫-阿诺德网络的科学机器学习

摘要: Scientific machine learning originally used multilayer perceptrons (MLPs), but is now shifting towards Kolmogorov-Arnold Networks (KANs) due to the limitations of MLPs. KANs offer enhanced interpretability and flexibility, allowing for more efficient modeling of complex nonlinear interactions. This review categorizes recent progress in KAN-based models in three perspectives: data-driven learning, physics-informed modeling, and deep-operator learning. Each perspective is evaluated in terms of architectural design, training strategies, application efficacy, and comparison with MLP-based models, highlighting the advantages of KANs in accuracy, convergence, and spectral representation. The review also discusses challenges and open research questions in KAN development, such as computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity, and suggests future research directions to improve the robustness, scalability, and physical consistency of KAN-based frameworks.

更新时间: 2025-11-03 05:04:45

领域: cs.LG,cs.CE,math-ph,math.MP

下载: http://arxiv.org/abs/2507.22959v2

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

Updated: 2025-11-03 05:03:18

标题: 表征级反事实校准用于去偏差的零样本识别

摘要: 目标-背景快捷方法在视觉语言模型中仍然是一个持久的挑战,当测试时场景与熟悉的训练共现不同时,会削弱零样本可靠性。我们将这个问题重新定义为因果推断问题,并询问:如果对象出现在不同的环境中,预测是否会保持不变?为了在推断时回答这个问题,我们在CLIP的表示空间中估计对象和背景的期望,并通过重新组合从外部数据集、批量邻居或文本派生的描述中抽样的多样化的替代环境,合成反事实嵌入。通过估计总直接效应并模拟干预,我们进一步减去仅背景激活,保留有益的对象-背景交互作用,同时减轻产生幻觉的分数。在不重新训练或设计提示的情况下,我们的方法显著提高了对于依赖上下文的基准测试的最差组和平均准确性,确立了一个新的零样本技术水平。除了性能,我们的框架提供了一种轻量级的表示层级反事实方法,为去偏和可靠的多模态推理提供了一个实用的因果途径。

更新时间: 2025-11-03 05:03:18

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.26466v2

Influence-aware Causal Autoencoder Network for Node Importance Ranking in Complex Networks

Node importance ranking is a fundamental problem in graph data analysis. Existing approaches typically rely on node features derived from either traditional centrality measures or advanced graph representation learning methods, which depend directly on the target network's topology. However, this reliance on structural information raises privacy concerns and often leads to poor generalization across different networks. In this work, we address a key question: Can we design a node importance ranking model trained exclusively on synthetic networks that is effectively appliable to real-world networks, eliminating the need to rely on the topology of target networks and improving both practicality and generalizability? We answer this question affirmatively by proposing the Influence-aware Causal Autoencoder Network (ICAN), a novel framework that leverages causal representation learning to get robust, invariant node embeddings for cross-network ranking tasks. Firstly, ICAN introduces an influence-aware causal representation learning module within an autoencoder architecture to extract node embeddings that are causally related to node importance. Moreover, we introduce a causal ranking loss and design a unified optimization framework that jointly optimizes the reconstruction and ranking objectives, enabling mutual reinforcement between node representation learning and ranking optimization. This design allows ICAN, trained on synthetic networks, to generalize effectively across diverse real-world graphs. Extensive experiments on multiple benchmark datasets demonstrate that ICAN consistently outperforms state-of-the-art baselines in terms of both ranking accuracy and generalization capability.

Updated: 2025-11-03 05:01:22

标题: 在复杂网络中节点重要性排名的影响感知因果自动编码器网络

摘要: 节点重要性排名是图数据分析中的一个基本问题。现有方法通常依赖于从传统中心性度量或先进的图表示学习方法导出的节点特征,这些方法直接依赖于目标网络的拓扑结构。然而,对结构信息的依赖引发了隐私问题,通常导致在不同网络之间的泛化能力较差。在这项工作中,我们提出了一个关键问题:我们是否可以设计一个仅在合成网络上训练的节点重要性排名模型,有效地应用于真实网络,消除对目标网络拓扑的依赖,提高实用性和泛化性?我们肯定地回答了这个问题,通过提出了一种新颖的框架——影响感知因果自动编码器网络(ICAN),利用因果表示学习获取稳健、不变的节点嵌入,用于跨网络排名任务。首先,ICAN在自动编码器架构中引入了一个影响感知因果表示学习模块,提取与节点重要性因果相关的节点嵌入。此外,我们引入了一个因果排名损失,并设计了一个统一的优化框架,共同优化重构和排名目标,实现节点表示学习和排名优化之间的互相增强。这种设计使ICAN在合成网络上训练后,能够有效地在各种真实世界图中泛化。在多个基准数据集上进行的广泛实验表明,ICAN在排名准确性和泛化能力方面始终优于最先进的基线方法。

更新时间: 2025-11-03 05:01:22

领域: cs.SI,cs.AI

下载: http://arxiv.org/abs/2511.01228v1

EgoBlind: Towards Egocentric Visual Assistance for the Blind

We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.

Updated: 2025-11-03 04:52:24

标题: EgoBlind:面向盲人的自我中心视觉辅助

摘要: 我们提出了EgoBlind,这是第一个从盲人那里收集的以自我为中心的VideoQA数据集,旨在评估当代多模态大型语言模型(MLLMs)的辅助能力。EgoBlind包括来自盲人和视力受损个体日常生活的1,392个第一人称视频。它还包含5,311个问题,这些问题是盲人直接提出或验证的,以反映他们在现场需要视觉帮助。为了减少主观性,每个问题都有平均3个手动标注的参考答案。使用EgoBlind,我们全面评估了16种先进的MLLM,并发现所有模型都面临困难。表现最好的模型的准确率接近60%,远远落后于人类的87.4%的表现。为了引导未来的进步,我们确定并总结了现有MLLM在盲人自我为中心视觉辅助方面的主要限制,并探索了启发式解决方案以提高性能。通过这些努力,我们希望EgoBlind将成为开发有效的AI助手的基础,以增强盲人和视力受损者的独立性。数据和代码可在https://github.com/doc-doc/EgoBlind获取。

更新时间: 2025-11-03 04:52:24

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2503.08221v4

Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2

Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.

Updated: 2025-11-03 04:48:17

标题: 通过短长记忆SAM 2加速体积医学图像标注

摘要: 体积医学图像的手动注释,如磁共振成像(MRI)和计算机断层扫描(CT),是一项耗时且费力的过程。最近在视频目标分割基础模型方面的进展,例如Segment Anything Model 2(SAM 2),为通过手动注释一个或几个切片,然后将目标蒙版传播到整个体积提供了显著加速注释过程的机会。然而,在这种情况下,SAM 2的性能存在差异。我们的实验表明,依赖单一记忆库和注意力模块容易出现错误传播,特别是在目标在上一个切片中存在但在当前切片中不存在的边界区域。为了解决这个问题,我们提出了短长记忆SAM 2(SLM-SAM 2),这是一种创新的架构,它集成了独立的短期和长期记忆库,配备单独的注意力模块以提高分割准确性。我们在涵盖MRI、CT和超声视频的四个公共数据集上评估了SLM-SAM 2,涵盖器官、骨骼和肌肉。我们展示了所提出的方法明显优于默认的SAM 2,在5个卷和1个卷可用于初始适应的情况下,平均Dice相似系数提高了0.14和0.10。SLM-SAM 2还表现出更强的抗传播能力,与SAM 2相比,每个体积减少了60.575%的校正传播蒙版所需的时间,这是朝着更准确地自动注释医学图像以开发分割模型迈出的显著一步。

更新时间: 2025-11-03 04:48:17

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2505.01854v2

Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness

Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations. While adversarial training remains the predominant defense strategy, it often incurs significant computational cost and may compromise clean-data accuracy. In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions-specifically, rotation- and scale-equivariant layers-into standard convolutional neural networks (CNNs). These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries and greater resilience to adversarial attacks. We propose and evaluate two symmetry-aware architectures: a parallel design that processes standard and equivariant features independently before fusion, and a cascaded design that applies equivariant operations sequentially. Theoretically, we demonstrate that such models reduce hypothesis space complexity, regularize gradients, and yield tighter certified robustness bounds under the CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) framework. Empirically, our models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under both FGSM and PGD attacks, without requiring adversarial training. These findings underscore the potential of symmetry-enforcing architectures as efficient and principled alternatives to data augmentation-based defenses.

Updated: 2025-11-03 04:37:21

标题: 桥接对称性和稳健性:论等变性在增强对抗性稳健性中的作用

摘要: 对抗性样本揭示了深度神经网络对于不可感知输入扰动的敏感性,从而暴露了其关键漏洞。虽然对抗性训练仍然是主要的防御策略,但通常会产生显著的计算成本,并可能损害干净数据的准确性。在这项工作中,我们通过将群等变卷积-具体来说是旋转和尺度等变层-嵌入标准卷积神经网络(CNNs)中,研究了一种面向对抗性鲁棒性的架构方法。这些层编码了与输入空间中的结构化转换对齐的对称先验,促进了更平滑的决策边界和更大的对抗性攻击韧性。我们提出并评估了两种对称感知架构:一种并行设计,它在融合之前独立处理标准和等变特征,另一种级联设计,它依次应用等变操作。从理论上讲,我们证明了这种模型减少了假设空间的复杂性,正则化了梯度,并在CLEVER(网络鲁棒性的交叉Lipschitz极值)框架下产生了更严格的认证鲁棒性界限。从经验上讲,我们的模型在不需要对抗性训练的情况下,在CIFAR-10、CIFAR-100和CIFAR-10C上持续提高了对抗性鲁棒性和泛化性,抵御了FGSM和PGD攻击。这些发现强调了对称强制架构作为数据增强型防御的高效和原则性替代方案的潜力。

更新时间: 2025-11-03 04:37:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.16171v3

Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering

The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.

Updated: 2025-11-03 04:13:24

标题: 思想为食:推理链引发的食物视觉问题回答

摘要: 印度美食文化和烹饪的巨大多样性引起了现有视觉问答(VQA)系统的主要缺陷,这些系统倾向于西方地区的食物。最近建立印度美食的VQA数据集是解决这一挑战的一步。然而,他们的VQA方法遵循一个两步过程,首先生成答案,然后解释预期答案。在这项工作中,我们认为食物VQA需要遵循一个多步推理过程才能得出准确的答案,特别是在印度食品的背景下,这涉及理解复杂的烹饪背景和识别各种食物之间的关系。基于这一假设,我们在QA之上创建推理链,减少人为干预。我们使用自动验证的推理链对较小的LLMs和VLMs进行微调,然后使用更大的数据进行强化学习训练。通过增加推理链,我们观察到在基线上的平均准确率提高了10个百分点。我们提供了详细分析,以确定增加推理链对印度美食VQA任务的影响。 关键词- 食物VQA,推理链,强化学习,知识图。

更新时间: 2025-11-03 04:13:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01213v1

Learning Nonholonomic Dynamics with Constraint Discovery

We consider learning nonholonomic dynamical systems while discovering the constraints, and describe in detail the case of the rolling disk. A nonholonomic system is a system subject to nonholonomic constraints. Unlike holonomic constraints, nonholonomic constraints do not define a sub-manifold on the configuration space. Therefore, the inverse problem of finding the constraints has to involve the tangent bundle. This paper discusses a general procedure to learn the dynamics of a nonholonomic system through Hamel's formalism, while discovering the system constraint by parameterizing it, given the data set of discrete trajectories on the tangent bundle $TQ$. We prove that there is a local minimum for convergence of the network. We also preserve symmetry of the system by reducing the Lagrangian to the Lie algebra of the selected group.

Updated: 2025-11-03 03:57:52

标题: 学习非完整动力学与约束发现

摘要: 我们考虑在发现约束条件的同时学习非完整动力系统,并详细描述了滚动盘的情况。非完整系统是受到非完整约束条件限制的系统。与完整约束条件不同,非完整约束条件不在配置空间上定义子流形。因此,找到约束条件的逆问题必须涉及切向丛。本文讨论了通过Hamel形式主义学习非完整系统动力学的一般过程,同时通过参数化发现系统约束,给定切丛$TQ$上的离散轨迹数据集。我们证明了网络收敛存在局部最小值。我们还通过将Lagrangian减少到所选群的Lie代数来保持系统的对称性。

更新时间: 2025-11-03 03:57:52

领域: math.DS,cs.LG,math-ph,math.MP

下载: http://arxiv.org/abs/2410.15201v3

Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

Large language models (LLMs) have demonstrated remarkable capabilities in numerous real-world applications. While the vast majority of research conducted from an experimental perspective is progressing rapidly, it demands substantial computational power, data, and other resources. Therefore, how to open the black-box of LLMs from a theoretical standpoint has become a critical challenge. This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point to investigate the information-theoretic principles behind LLMs, leading to the development of semantic information theory for LLMs, where the fundamental unit is token, rather than bits that lacks any semantic meaning. By defining the probabilistic model of LLMs, we discuss structure-agnostic information-theoretic measures, such as the directed rate-distortion function in pre-training, the directed rate-reward function in post-training, and the semantic information flow in inference phase. This paper also delves deeply into the theory of token-level semantic embedding and the information-theoretically optimal vectorization method. Thereafter, we propose a general definition of autoregression LLM, where the Transformer architecture and its performance such as ELBO, generalization error bound, memory capacity, and semantic information measures can be derived theoretically. Other architectures, such as Mamba/Mamba2 and LLaDA, are also discussed in our framework. Consequently, this paper provides a theoretical framework for understanding LLMs from the perspective of semantic information theory, which also offers the necessary theoretical tools for further in-depth research.

Updated: 2025-11-03 03:56:34

标题: 忘记比特币,一切都关乎令牌:面向LLM的语义信息理论

摘要: 大型语言模型(LLMs)在许多实际应用中展示了卓越的能力。虽然绝大多数从实验角度进行的研究正在迅速发展,但这需要大量的计算能力、数据和其他资源。因此,如何从理论角度打开LLMs的黑匣子已成为一个关键挑战。本文以失真率函数理论、定向信息和Granger因果关系为起点,探讨LLMs背后的信息理论原则,导致了LLMs的语义信息理论的发展,其中基本单元是标记,而不是缺乏任何语义意义的位。通过定义LLMs的概率模型,我们讨论结构无关的信息理论度量,例如预训练中的定向失真率函数,后训练中的定向失真率-奖励函数,以及推理阶段的语义信息流。本文还深入探讨了标记级语义嵌入的理论和信息论上最佳的向量化方法。随后,我们提出了自回归LLMs的一般定义,其中可以从理论上推导出Transformer架构及其性能,如ELBO、泛化误差界限、内存容量和语义信息度量。我们还在我们的框架中讨论了其他架构,如Mamba/Mamba2和LLaDA。因此,本文提供了一个从语义信息理论的角度理解LLMs的理论框架,也为进一步深入研究提供了必要的理论工具。

更新时间: 2025-11-03 03:56:34

领域: cs.IT,cs.AI,math.IT

下载: http://arxiv.org/abs/2511.01202v1

Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.

Updated: 2025-11-03 03:48:04

标题: 视觉基础模型可以成为潜在扩散模型的良好分词器

摘要: 潜在扩散模型(LDMs)的性能在很大程度上取决于它们的视觉标记器的质量。虽然最近的研究已经探讨了通过蒸馏将视觉基础模型(VFMs)整合进来,但我们发现这种方法存在一个根本性的缺陷:它不可避免地削弱了与原始VFM的对齐性,导致在分布转移下对齐的潜在语义偏离。在本文中,我们通过提出更直接的方法——视觉基础模型变分自动编码器(VFM-VAE)来绕过蒸馏。为了解决VFM的语义焦点与像素级保真度之间的固有张力,我们重新设计了VFM-VAE解码器,引入了多尺度潜在融合和渐进分辨率重建模块,从而实现了从空间粗糙的VFM特征进行高质量重建。此外,我们对扩散训练期间的表示动态进行了全面分析,引入了建议的SE-CKNNA指标作为更精确的诊断工具。这种分析使我们能够开发出一种联合标记器扩散对齐策略,大大加快了收敛速度。我们在标记器设计和训练策略方面的创新带来了卓越的性能和效率:我们的系统在仅80个时代内就达到了2.20的gFID(无CFG)(比先前标记器快了10倍)。继续训练至640个时代后,它进一步达到了1.62的gFID(无CFG),确立了直接VFM整合作为LDMs的卓越范式。

更新时间: 2025-11-03 03:48:04

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.18457v2

Computational Basis of LLM's Decision Making in Social Simulation

Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM's behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM's internal representations in a Dictator Game -- a classic behavioral experiment on fairness and prosocial behavior. We extract "vectors of variable variations" (e.g., "male" to "female") from the LLM's internal state. Manipulating these vectors during the model's inference can substantially alter how those variables relate to the model's decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.

Updated: 2025-11-03 03:47:20

标题: 社会模拟中LLM决策的计算基础

摘要: 大型语言模型(LLMs)越来越多地作为社会科学和应用领域中类似人类决策的代理。这些LLM代理通常被赋予类似人类的特征,并置身于现实生活的背景之中。然而,这些特征和背景如何塑造LLM的行为仍未得到充分探讨。本研究提出并测试了一种探测、量化和修改LLM内部表示的方法,在一个经典的关于公平和亲社会行为的行为实验——独裁者游戏中。我们从LLM的内部状态中提取“变量变化的向量”(例如,“男性”到“女性”)。在模型推理过程中操纵这些向量可以显著改变这些变量与模型决策之间的关系。这种方法提供了一种原则性的研究和调节社会概念如何被编码和工程化在基于transformer的模型中的方式,对于学术和商业应用中的社会模拟AI代理的对齐、去偏见和设计具有重要意义,加强了社会学理论和测量。

更新时间: 2025-11-03 03:47:20

领域: cs.AI,cs.CY,cs.LG,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2504.11671v3

An Interdisciplinary and Cross-Task Review on Missing Data Imputation

Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring. Despite decades of research and numerous imputation methods, the literature remains fragmented across fields, creating a critical need for a comprehensive synthesis that connects statistical foundations with modern machine learning advances. This work systematically reviews core concepts-including missingness mechanisms, single versus multiple imputation, and different imputation goals-and examines problem characteristics across various domains. It provides a thorough categorization of imputation methods, spanning classical techniques (e.g., regression, the EM algorithm) to modern approaches like low-rank and high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and large language models. Special attention is given to methods for complex data types, such as tensors, time series, streaming data, graph-structured data, categorical data, and multimodal data. Beyond methodology, we investigate the crucial integration of imputation with downstream tasks like classification, clustering, and anomaly detection, examining both sequential pipelines and joint optimization frameworks. The review also assesses theoretical guarantees, benchmarking resources, and evaluation metrics. Finally, we identify critical challenges and future directions, emphasizing model selection and hyperparameter optimization, the growing importance of privacy-preserving imputation via federated learning, and the pursuit of generalizable models that can adapt across domains and data types, thereby outlining a roadmap for future research.

Updated: 2025-11-03 03:43:43

标题: 一篇关于缺失数据插补的跨学科和跨任务综述

摘要: 缺失数据是数据科学中的一个基本挑战,严重阻碍了跨越医疗保健、生物信息学、社会科学、电子商务和工业监测等各个领域的分析和决策。尽管经过几十年的研究和众多的插补方法,文献仍然在各个领域之间分散,这导致了对于将统计基础与现代机器学习进展联系起来的全面综合的迫切需求。本研究系统地审视了核心概念,包括缺失机制、单一与多重插补以及不同的插补目标,并检查了各个领域的问题特征。它提供了插补方法的彻底分类,涵盖了传统技术(如回归、EM算法)到现代方法,如低秩和高秩矩阵完成、深度学习模型(自动编码器、GANs、扩散模型、图神经网络)和大型语言模型。特别关注于复杂数据类型的方法,如张量、时间序列、流数据、图结构数据、分类数据和多模态数据。除了方法论,我们还研究了插补与分类、聚类和异常检测等下游任务的关键整合,检查了顺序管道和联合优化框架。审查还评估了理论保证、基准资源和评估指标。最后,我们确定了关键挑战和未来方向,强调了模型选择和超参数优化的重要性,通过联邦学习实现隐私保护插补的日益重要性,以及追求能够适应各个领域和数据类型的可泛化模型,从而勾勒出未来研究的路线图。

更新时间: 2025-11-03 03:43:43

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01196v1

A Topology-Aware Graph Convolutional Network for Human Pose Similarity and Action Quality Assessment

Action Quality Assessment (AQA) requires fine-grained understanding of human motion and precise evaluation of pose similarity. This paper proposes a topology-aware Graph Convolutional Network (GCN) framework, termed GCN-PSN, which models the human skeleton as a graph to learn discriminative, topology-sensitive pose embeddings. Using a Siamese architecture trained with a contrastive regression objective, our method outperforms coordinate-based baselines and achieves competitive performance on AQA-7 and FineDiving benchmarks. Experimental results and ablation studies validate the effectiveness of leveraging skeletal topology for pose similarity and action quality assessment.

Updated: 2025-11-03 03:38:24

标题: 一种拓扑感知的图卷积网络用于人体姿势相似度和动作质量评估

摘要: 动作质量评估(AQA)需要对人类动作进行细致理解,并精确评估姿势相似性。本文提出了一种称为GCN-PSN的拓扑感知图卷积网络(GCN)框架,将人类骨架建模为图形以学习具有区分性、拓扑敏感的姿势嵌入。利用具有对比回归目标的孪生架构进行训练,我们的方法胜过基于坐标的基线,并在AQA-7和FineDiving基准测试中取得竞争性表现。实验结果和消融研究验证了利用骨骼拓扑来进行姿势相似性和动作质量评估的有效性。

更新时间: 2025-11-03 03:38:24

领域: cs.CV,cs.AI,68T07 (Artificial neural networks and deep learning), 68U10 (Computer graphics, computational geometry)

下载: http://arxiv.org/abs/2511.01194v1

Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.

Updated: 2025-11-03 03:34:34

标题: 自我和谐:学习在测试时间的强化学习中协调自我监督和自我游玩

摘要: 测试时间强化学习(TTRL)提供了一种无标签范式,仅在推断时使用合成信号来调整模型,但其成功取决于构建可靠的学习信号。标准方法,如多数投票,经常会崩溃到虚假但流行的答案。我们引入了“Self-Harmony”,这是一个建立在简单直觉上的框架:正确答案应该在原始问题及其解释之间保持稳定。Self-Harmony通过在两个互补角色中使用单一模型来实现这一点:一个解决者用于生成答案,一个重述者用于改写输入。基于此,我们进一步提出了一种伪标签方法:不是通过多数投票,而是通过使用谐波均值在这些原始和重述观点之间聚合答案频率。这是一个自然选择在重塑下稳定解决方案的过程,从而避免了偏向视角依赖性的虚假答案的常见陷阱。关键是,这不需要人工监督或辅助模型。在各种推理基准测试中,Self-Harmony在无标签测试时间设置中取得了最先进的结果,在多种方法中在30个设置中的28个中排名第一。除了准确性外,它还展示了空前的稳健性,在所有实验中没有训练失败,强调了其稳定性和可靠性。

更新时间: 2025-11-03 03:34:34

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.01191v1

ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

The rapid spread of fake news threatens social stability and public trust, rendering its detection an imperative research priority. Although large language models (LLMs) excel at numerous natural language processing tasks with their remarkable contextual understanding and extensive prior knowledge, the time-bounded knowledge coverage and tendency for generating hallucination content reduce their reliability when handling fast-evolving news streams. Furthermore, models trained on existing static datasets also often lack the generalization needed for emerging news topics. To address these challenges, we propose ZoFia, a novel two-stage zero-shot fake news detection framework. First, we introduce Hierarchical Salience to quantify the importance of entities in the news content, and propose the SC-MMR algorithm to effectively select an informative and diverse set of keywords that serve as queries for retrieving up-to-date external evidence. Subsequently, a multi LLM interactive system, in which each agent assumes a distinct role, performs multi-view collaborative analysis and adversarial debate over the news text and its related information, and finally produces an interpretable and robust judgment. Comprehensive experiments on two public datasets demonstrate that ZoFia obviously outperforms existing zero-shot baselines and most of few-shot methods. Our codes will be open-sourced to facilitate related communities.

Updated: 2025-11-03 03:29:42

标题: ZoFia:利用实体引导检索和多LLM交互实现零样本假新闻检测

摘要: 快速传播的假新闻威胁社会稳定和公众信任,使其检测成为迫切的研究重点。尽管大型语言模型(LLMs)在许多自然语言处理任务中表现出色,具有出色的上下文理解能力和广泛的先验知识,但时间限制的知识覆盖和生成幻觉内容的倾向降低了它们在处理快速演变的新闻流时的可靠性。此外,基于现有静态数据集训练的模型通常也缺乏应对新兴新闻主题所需的泛化能力。为了解决这些挑战,我们提出了一种新颖的两阶段零样本假新闻检测框架ZoFia。首先,我们引入了分层显著性来量化新闻内容中实体的重要性,并提出了SC-MMR算法,以有效选择一组信息丰富且多样化的关键词,作为检索最新外部证据的查询。随后,一个多LLM交互系统,其中每个代理扮演不同角色,对新闻文本及其相关信息进行多视角协作分析和对抗性辩论,最终产生可解释且稳健的判断。对两个公共数据集的全面实验表明,ZoFia明显优于现有的零样本基线和大多数少样本方法。我们的代码将开源以促进相关社区。

更新时间: 2025-11-03 03:29:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.01188v1

Damper-B-PINN: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Vehicle State Estimation

Accurate state estimation is fundamental to intelligent vehicles. Wheel load, one of the most important chassis states, serves as an essential input for advanced driver assistance systems (ADAS) and exerts a direct influence on vehicle stability and safety. However, wheel load estimation remains challenging due to the complexity of chassis modeling and the susceptibility of nonlinear systems to noise. To address these issues, this paper first introduces a refined suspension linkage-level modeling approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon this, we propose a damper characteristics-based Bayesian physics-informed neural network (Damper-B-PINN) framework to estimate dynamic wheel load, which leverages the suspension dynamics as physical guidance of PINN while employing Bayesian inference to mitigate the effects of system noise and uncertainty. Moreover, a damper-characteristic physics conditioning (DPC) module is designed for embedding physical prior. The proposed Damper-B-PINN is evaluated using both high-fidelity simulation datasets generated by CarSim software and real-world datasets collected from a Formula Student race car. Experimental results demonstrate that our Damper-B-PINN consistently outperforms existing methods across various test conditions, particularly extreme ones. These findings highlight the potential of the proposed Damper-B-PINN framework to enhance the accuracy and robustness of dynamic wheel load estimation, thereby improving the reliability and safety of ADAS applications.

Updated: 2025-11-03 03:21:16

标题: Damper-B-PINN:基于减震器特性的贝叶斯物理信息神经网络用于车辆状态估计

摘要: 准确的状态估计是智能车辆的基础。车轮载荷是最重要的底盘状态之一,对于先进驾驶辅助系统(ADAS)起着至关重要的作用,并直接影响车辆的稳定性和安全性。然而,由于底盘建模的复杂性和非线性系统对噪声的敏感性,车轮载荷估计仍然具有挑战性。为了解决这些问题,本文首先引入了一种精细的悬架连杆级建模方法,通过明确考虑悬架的复杂几何结构构建了一个非线性瞬时动态模型。在此基础上,我们提出了一种基于减震器特性的贝叶斯物理信息神经网络(Damper-B-PINN)框架,用于估计动态车轮载荷,利用悬架动力学作为PINN的物理指导,并利用贝叶斯推断来减轻系统噪声和不确定性的影响。此外,设计了一个减震器特性物理条件(DPC)模块用于嵌入物理先验。通过使用CarSim软件生成的高保真模拟数据集和从一辆Formula Student比赛车收集的真实数据集对所提出的Damper-B-PINN进行评估。实验结果表明,我们的Damper-B-PINN在各种测试条件下始终优于现有方法,特别是在极端条件下表现更好。这些发现突出了所提出的Damper-B-PINN框架提升动态车轮载荷估计的准确性和稳健性,从而提高了ADAS应用的可靠性和安全性。

更新时间: 2025-11-03 03:21:16

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.20772v2

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance.

Updated: 2025-11-03 03:20:26

标题: 齐蒙-纽康归来:从IR到汇编代码的自进化翻译

摘要: 编译器是必不可少的,但它们以复杂而著称,需要昂贵的人类专业知识来开发和维护。最近大型语言模型(LLMs)的进展提供了一个引人注目的新范式:神经编译,这可能简化对新架构的编译器开发,并促进创新优化技术的发现。然而,一些关键障碍阻碍了其实际采用。首先,缺乏专门的基准测试和健全的评估方法阻碍了对该领域进展的客观评估和跟踪。其次,系统地提高LLM生成的汇编代码的可靠性和性能仍然是一个关键挑战。针对这些挑战,本文介绍了NeuComBack,一个专门为IR到汇编编译设计的新型基准数据集。利用该数据集,我们首先定义了一个基础的神经编译工作流程,并对最近的前沿LLMs在神经编译上的能力进行了全面评估,建立了新的性能基线。我们进一步提出了一种自我演化的提示优化方法,使LLMs能够通过从先前的自我调试跟踪中提取见解来迭代地发展其内部提示策略,从而增强其神经编译能力。实验证明,我们的方法显著提高了LLM生成的汇编代码的功能正确性和性能。与基准提示相比,在x86_64上功能正确性率从44%提高到64%,在aarch64上从36%提高到58%。更重要的是,在使用我们方法正确生成的16个x86_64程序中,有14个(87.5%)超过了clang-O3的性能。

更新时间: 2025-11-03 03:20:26

领域: cs.AI

下载: http://arxiv.org/abs/2511.01183v1

MiRAGE: Misconception Detection with Retrieval-Guided Multi-Stage Reasoning and Ensemble Fusion

Detecting student misconceptions in open-ended responses is a longstanding challenge, demanding semantic precision and logical reasoning. We propose MiRAGE - Misconception Detection with Retrieval-Guided Multi-Stage Reasoning and Ensemble Fusion, a novel framework for automated misconception detection in mathematics. MiRAGE operates in three stages: (1) a Retrieval module narrows a large candidate pool to a semantically relevant subset; (2) a Reasoning module employs chain-of-thought generation to expose logical inconsistencies in student solutions; and (3) a Reranking module refines predictions by aligning them with the reasoning. These components are unified through an ensemble-fusion strategy that enhances robustness and interpretability. On mathematics datasets, MiRAGE achieves Mean Average Precision scores of 0.82/0.92/0.93 at levels 1/3/5, consistently outperforming individual modules. By coupling retrieval guidance with multi-stage reasoning, MiRAGE reduces dependence on large-scale language models while delivering a scalable and effective solution for educational assessment.

Updated: 2025-11-03 03:17:36

标题: MiRAGE:利用检索引导的多阶段推理和集成融合进行误解检测

摘要: 在开放性回答中检测学生误解是一个长期存在的挑战,需要语义精确和逻辑推理。我们提出了MiRAGE - 带有检索引导的多阶段推理和集成融合的误解检测,这是一个新颖的自动数学误解检测框架。MiRAGE分为三个阶段:(1) 检索模块将一个大的候选池缩小到一个语义相关的子集;(2) 推理模块采用思维链生成方法,揭示学生解决方案中的逻辑矛盾;(3) 重新排序模块通过与推理对齐来优化预测。这些组件通过集成融合策略统一,增强了稳健性和可解释性。在数学数据集上,MiRAGE在1/3/5级别上的平均精度得分分别为0.82/0.92/0.93,始终优于单个模块。通过将检索引导与多阶段推理相结合,MiRAGE减少了对大规模语言模型的依赖,同时为教育评估提供了可扩展和有效的解决方案。

更新时间: 2025-11-03 03:17:36

领域: cs.AI

下载: http://arxiv.org/abs/2511.01182v1

Provable Generalization Bounds for Deep Neural Networks with Momentum-Adaptive Gradient Dropout

Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop's effectiveness, we derive a non-asymptotic, computable PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 29.2\% tighter bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop achieves competitive performance on MNIST (99.52\%) and CIFAR-10 (92.03\%), with generalization gaps of 0.48\% and 6.52\%, respectively. We provide fully reproducible code and numerical computation of our bounds to validate our theoretical claims. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization, making it suitable for high-stakes applications.

Updated: 2025-11-03 03:17:03

标题: 具有动量自适应梯度丢弃的深度神经网络的可证明泛化界限

摘要: 深度神经网络(DNNs)取得了显著的性能,但由于其高容量,往往容易过拟合。我们引入了动量自适应梯度丢弃(MAGDrop),这是一种新颖的正则化方法,根据当前梯度和累积动量动态调整激活的丢弃率,增强了非凸优化景观中的稳定性。为了在理论上证明MAGDrop的有效性,我们推导了一个非渐近、可计算的PAC-Bayes泛化界限,考虑到其自适应性质,通过利用动量驱动的扰动控制,实现了比标准方法高达29.2%的更紧密的界限。从实证上看,基于激活的MAGDrop在MNIST(99.52%)和CIFAR-10(92.03%)上表现出竞争力,分别具有0.48%和6.52%的泛化差距。我们提供完全可复制的代码和数值计算,以验证我们的理论要求。我们的工作桥接了理论见解和实际进展,为增强DNN泛化提供了一个坚固的框架,使其适用于高风险应用。

更新时间: 2025-11-03 03:17:03

领域: cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2510.18410v2

Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.

Updated: 2025-11-03 03:15:39

标题: 翻转学习:乳腺超声中弱监督擦除分割结节

摘要: 准确地在二维乳腺超声(BUS)和三维自动乳腺超声(ABUS)中分割结节对临床诊断和治疗规划至关重要。因此,开发一个用于结节分割的自动化系统可以增强用户独立性并加快临床分析过程。与完全监督学习不同,弱监督分割(WSS)可以简化繁琐和复杂的标注过程。然而,当前的WSS方法在实现精确结节分割方面面临挑战,因为许多方法依赖于不准确的激活图或低效的伪蒙版生成算法。在这项研究中,我们引入了一种基于多智能体强化学习的WSS框架,称为Flip Learning,该框架仅依赖于二维/三维框来实现准确的分割。具体地,多个智能体被用于从框中擦除目标以促进分类标签翻转,被擦除的区域作为预测的分割蒙版。这项研究的关键贡献包括:(1)采用基于超像素/超体素的方法来编码标准化环境,捕捉边界先验并加快学习过程。(2)引入三种精心设计的奖励,包括分类得分奖励和两种强度分布奖励,来精确引导智能体的擦除过程,从而避免欠分割和过分割。 (3)实施渐进式课程学习策略,使智能体能够逐渐挑战性地与环境互动,从而提高学习效率。在大型内部BUS和ABUS数据集上进行了广泛验证,我们的Flip Learning方法优于最先进的WSS方法和基础模型,同时实现了与完全监督学习算法相当的性能。

更新时间: 2025-11-03 03:15:39

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.20685v4

A Large Scale Study of AI-based Binary Function Similarity Detection Techniques for Security Researchers and Practitioners

Binary Function Similarity Detection (BFSD) is a foundational technique in software security, underpinning a wide range of applications including vulnerability detection, malware analysis. Recent advances in AI-based BFSD tools have led to significant performance improvements. However, existing evaluations of these tools suffer from three key limitations: a lack of in-depth analysis of performance-influencing factors, an absence of realistic application analysis, and reliance on small-scale or low-quality datasets. In this paper, we present the first large-scale empirical study of AI-based BFSD tools to address these gaps. We construct two high-quality and diverse datasets: BinAtlas, comprising 12,453 binaries and over 7 million functions for capability evaluation; and BinAres, containing 12,291 binaries and 54 real-world 1-day vulnerabilities for evaluating vulnerability detection performance in practical IoT firmware settings. Using these datasets, we evaluate nine representative BFSD tools, analyze the challenges and limitations of existing BFSD tools, and investigate the consistency among BFSD tools. We also propose an actionable strategy for combining BFSD tools to enhance overall performance (an improvement of 13.4%). Our study not only advances the practical adoption of BFSD tools but also provides valuable resources and insights to guide future research in scalable and automated binary similarity detection.

Updated: 2025-11-03 03:10:25

标题: 一个针对安全研究人员和从业者的基于人工智能的二进制函数相似性检测技术的大规模研究

摘要: 二进制函数相似性检测(BFSD)是软件安全中的基础技术,支撑着包括漏洞检测、恶意软件分析在内的广泛应用。基于人工智能的BFSD工具的最新进展已经带来了显著的性能提升。然而,现有的这些工具的评估存在三个关键限制:缺乏对性能影响因素的深入分析、缺乏现实应用分析以及依赖小规模或低质量数据集。 在本文中,我们首次提出了基于人工智能的BFSD工具的大规模实证研究,以弥补这些空白。我们构建了两个高质量且多样化的数据集:BinAtlas,包括12,453个二进制文件和超过700万个函数,用于评估功能;BinAres,包含12,291个二进制文件和54个现实世界的1天漏洞,用于评估在实际物联网固件环境中的漏洞检测性能。利用这些数据集,我们评估了九种代表性的BFSD工具,分析了现有BFSD工具的挑战和限制,并探讨了BFSD工具之间的一致性。我们还提出了一种可行的策略,通过结合BFSD工具来提高整体性能(提高了13.4%)。我们的研究不仅推动了BFSD工具的实际应用,还为指导未来在可扩展和自动化的二进制相似性检测领域的研究提供了宝贵的资源和见解。

更新时间: 2025-11-03 03:10:25

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2511.01180v1

Prevailing Research Areas for Music AI in the Era of Foundation Models

Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists' workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.

Updated: 2025-11-03 03:01:07

标题: 在基础模型时代音乐人工智能的主要研究领域

摘要: 随着基础模型研究的快速发展,过去几年见证了音乐人工智能应用的激增。随着人工智能生成和增强音乐变得日益普及,许多音乐人工智能社区的研究人员可能会想知道:哪些研究前沿尚未被探索? 本文概述了音乐人工智能研究中几个关键领域,提供了进一步调查的重要机会。我们首先检查基础表示模型,并重点介绍朝向可解释性和可解释性的新努力。然后我们讨论向多模态系统的演进,概述当前音乐数据集的现状及其局限性,并讨论训练和部署中模型效率的日益重要性。 接下来,我们探讨应用方向,首先关注生成模型。我们回顾了最近的系统、它们的计算限制以及与评估和可控性相关的持久挑战。然后我们研究了这些生成方法在多模态环境中的扩展及其集成到艺术家工作流程中的应用,包括音乐编辑、字幕、制作、转录、源分离、表演、发现和教育等应用。 最后,我们探讨了生成音乐的版权影响,并提出了保障艺术家权利的策略。虽然不是详尽无遗的,但这份调查旨在阐明音乐基础模型最近发展所带来的有前途的研究方向。

更新时间: 2025-11-03 03:01:07

领域: cs.SD,cs.AI,cs.MM,eess.AS,68T05, 68T20,I.2; I.5.4; I.2.6; I.2.7; H.5.5

下载: http://arxiv.org/abs/2409.09378v2

RepoMark: A Data-Usage Auditing Framework for Code Large Language Models

The rapid development of Large Language Models (LLMs) for code generation has transformed software development by automating coding tasks with unprecedented efficiency. However, the training of these models on open-source code repositories (e.g., from GitHub) raises critical ethical and legal concerns, particularly regarding data authorization and open-source license compliance. Developers are increasingly questioning whether model trainers have obtained proper authorization before using repositories for training, especially given the lack of transparency in data collection. To address these concerns, we propose a novel data marking framework RepoMark to audit the data usage of code LLMs. Our method enables auditors to verify whether their code has been used in training, while ensuring semantic preservation, imperceptibility, and theoretical false detection rate (FDR) guarantees. By generating multiple semantically equivalent code variants, RepoMark introduces data marks into the code files, and during detection, RepoMark leverages a novel ranking-based hypothesis test to detect model behavior difference on trained data. Compared to prior data auditing approaches, RepoMark significantly enhances data efficiency, allowing effective auditing even when the user's repository possesses only a small number of code files. Experiments demonstrate that RepoMark achieves a detection success rate over 90\% on small code repositories under a strict FDR guarantee of 5\%. This represents a significant advancement over existing data marking techniques, all of which only achieve accuracy below 55\% under identical settings. This further validates RepoMark as a robust, theoretically sound, and promising solution for enhancing transparency in code LLM training, which can safeguard the rights of code authors.

Updated: 2025-11-03 02:58:46

标题: RepoMark:用于代码大型语言模型数据使用审计的框架

摘要: 大型语言模型(LLMs)用于代码生成的迅速发展已经通过自动化编码任务以前所未有的效率改变了软件开发。然而,这些模型在开源代码存储库(例如GitHub)上的训练引发了关键的伦理和法律问题,特别是涉及数据授权和开源许可合规性。开发人员越来越质疑模型训练者在使用存储库进行训练之前是否获得了适当的授权,尤其是考虑到数据收集的缺乏透明度。 为了解决这些问题,我们提出了一种新颖的数据标记框架RepoMark来审计代码LLMs的数据使用情况。我们的方法使审计人员能够验证他们的代码是否被用于训练,同时确保语义保持、不可察觉性和理论误检率(FDR)保证。通过生成多个语义等效的代码变体,RepoMark将数据标记引入代码文件中,并在检测过程中,RepoMark利用一种基于排名的假设检验来检测训练数据上的模型行为差异。与先前的数据审计方法相比,RepoMark显著提高了数据效率,即使用户的存储库只包含少量代码文件,也能实现有效的审计。 实验证明,RepoMark在严格FDR保证为5%的情况下,在小型代码存储库上实现了超过90%的检测成功率。这在相同设置下所有现有的数据标记技术仅实现低于55%的准确率的情况下,代表了一个重大进步。这进一步验证了RepoMark作为增强代码LLMs训练透明度的稳健、理论上可靠和有前途的解决方案,可以保护代码作者的权利。

更新时间: 2025-11-03 02:58:46

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2508.21432v3

The Digital Ecosystem of Beliefs: does evolution favour AI over humans?

As AI systems are integrated into social networks, there are AI safety concerns that AI-generated content may dominate the web, e.g. in popularity or impact on beliefs. To understand such questions, this paper proposes the Digital Ecosystem of Beliefs (Digico), the first evolutionary framework for controlled experimentation with multi-population interactions in simulated social networks. Following a Universal Darwinism approach, the framework models a population of agents which change their messaging strategies due to evolutionary updates. They interact via messages, update their beliefs following a contagion model, and maintain their beliefs through cognitive Lamarckian inheritance. Initial experiments with Digico implement two types of agents, which are modelled to represent AIs vs humans based on higher rates of communication, higher rates of evolution, seeding fixed beliefs with propaganda aims, and higher influence on the recommendation algorithm. These experiments show that: a) when AIs have faster messaging, evolution, and more influence on the recommendation algorithm, they get 80% to 95% of the views; b) AIs designed for propaganda can typically convince 50% of humans to adopt extreme beliefs, and up to 85% when agents believe only a limited number of channels; c) a penalty for content that violates agents' beliefs reduces propaganda effectiveness up to 8%. We further discuss Digico as a tool for systematic experimentation across multi-agent configurations, the implications for legislation, personal use, and platform design, and the use of Digico for studying evolutionary principles.

Updated: 2025-11-03 02:58:08

标题: 信仰的数字生态系统:进化是否更青睐人工智能而非人类?

摘要: 随着人工智能系统被整合进社交网络,人们担心人工智能生成的内容可能会在网上占据主导地位,例如在受欢迎程度或对信仰的影响方面。为了解决这些问题,本文提出了Digital Ecosystem of Beliefs(Digico),这是第一个用于在模拟社交网络中进行多群体交互的受控实验的进化框架。遵循普遍达尔文主义方法,该框架对一群因演化更新而改变他们的消息传递策略的代理进行建模。他们通过消息进行互动,根据传染模型更新他们的信仰,并通过认知拉马克遗传方式维持他们的信仰。Digico的初步实验实施了两种类型的代理,这些代理被建模为代表具有更高通信率、更高演化率、以宣传目的种植固定信仰以及对推荐算法具有更高影响力的人工智能和人类。这些实验表明:a)当人工智能具有更快的消息传递、演化速度和更多对推荐算法的影响时,它们获得80%至95%的观看量;b)用于宣传的人工智能通常能说服50%的人类采纳极端信仰,当代理仅相信有限数量的渠道时,这一数字可达85%;c)对违反代理信仰的内容进行处罚可将宣传效果降低8%。我们进一步讨论了Digico作为一个用于跨多代理配置进行系统实验的工具,对于立法、个人使用和平台设计的影响,以及使用Digico来研究进化原则。

更新时间: 2025-11-03 02:58:08

领域: cs.AI,cs.MA,cs.NE

下载: http://arxiv.org/abs/2412.14500v3

Adapt under Attack and Domain Shift: Unified Adversarial Meta-Learning and Domain Adaptation for Robust Automatic Modulation Classification

Deep learning has emerged as a leading approach for Automatic Modulation Classification (AMC), demonstrating superior performance over traditional methods. However, vulnerability to adversarial attacks and susceptibility to data distribution shifts hinder their practical deployment in real-world, dynamic environments. To address these threats, we propose a novel, unified framework that integrates meta-learning with domain adaptation, making AMC systems resistant to both adversarial attacks and environmental changes. Our framework utilizes a two-phase strategy. First, in an offline phase, we employ a meta-learning approach to train the model on clean and adversarially perturbed samples from a single source domain. This method enables the model to generalize its defense, making it resistant to a combination of previously unseen attacks. Subsequently, in the online phase, we apply domain adaptation to align the model's features with a new target domain, allowing it to adapt without requiring substantial labeled data. As a result, our framework achieves a significant improvement in modulation classification accuracy against these combined threats, offering a critical solution to the deployment and operational challenges of modern AMC systems.

Updated: 2025-11-03 02:44:53

标题: 应对攻击和领域转移:统一对抗元学习和领域适应,用于稳健的自动调制分类

摘要: 深度学习已经成为自动调制分类(AMC)的主要方法,表现出比传统方法更优越的性能。然而,对敌对攻击的脆弱性和对数据分布变化的敏感性阻碍了它们在现实世界动态环境中的实际部署。为了解决这些威胁,我们提出了一个新颖的统一框架,将元学习与领域自适应相结合,使AMC系统能够抵抗敌对攻击和环境变化。我们的框架采用两阶段策略。首先,在离线阶段,我们采用元学习方法在单一源域的干净和受到敌对干扰的样本上训练模型。这种方法使模型能够推广其防御,使其能够抵抗先前未见过的攻击组合。随后,在在线阶段,我们应用领域自适应将模型的特征与新的目标领域对齐,使其能够适应而无需大量标记数据。因此,我们的框架在调制分类准确性方面取得了显著改善,抵抗了这些联合威胁,为现代AMC系统的部署和运行挑战提供了关键解决方案。

更新时间: 2025-11-03 02:44:53

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2511.01172v1

DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models

Adaptive reasoning is essential for aligning the computational effort of large language models (LLMs) with the intrinsic difficulty of problems. Current chain-of-thought methods boost reasoning ability but indiscriminately generate long explanations, leading to evident inefficiency. However, existing reinforcement learning approaches to adaptive thinking remain unstable and heavily reward-dependent. Here we propose \textbf{DART}, a supervised \textbf{D}ifficulty-\textbf{A}daptive \textbf{R}easoning \textbf{T}runcation framework that adjusts thinking length according to problem difficulty. By distilling concise reasoning patterns from stronger models, interpolating them into a continuum of reasoning styles, and curating optimal training data that balances correctness and compactness, DART learns when to ``stop thinking''. Across multiple mathematical benchmarks, experimental results demonstrate its remarkable efficiency while preserving or improving accuracy, achieving a significant 81.2\% reasoning truncation (DeepSeek-R1-Distill-Qwen-7B on GSM8K dataset) with 5.33$\times$ computational acceleration. DART provides a stable and general paradigm for efficient reasoning, advancing the development of adaptive intelligence in LLMs.

Updated: 2025-11-03 02:41:20

标题: DART: 针对高效大型语言模型的难度自适应推理截断

摘要: 自适应推理对于将大型语言模型(LLMs)的计算工作与问题的内在难度相一致至关重要。当前的思维链方法可以增强推理能力,但会不加选择地生成冗长的解释,导致明显的低效。然而,现有的强化学习方法对于自适应思维仍然不稳定且严重依赖奖励。在这里,我们提出了一个监督学习的\textbf{DART}框架,即\textbf{D}ifficulty-\textbf{A}daptive \textbf{R}easoning \textbf{T}runcation,根据问题的难度调整思考长度。通过从更强的模型中提炼简明的推理模式,将它们插值到一系列推理风格中,并策划平衡正确性和简洁性的最佳训练数据,DART 学会了何时“停止思考”。在多个数学基准测试中,实验结果展示了它的显著效率,同时保持或提高准确性,实现了81.2\%的推理截断(DeepSeek-R1-Distill-Qwen-7B 在 GSM8K 数据集上),计算加速度为5.33倍。DART 提供了一种稳定且通用的高效推理范式,推动了 LLMS 中自适应智能的发展。

更新时间: 2025-11-03 02:41:20

领域: cs.AI

下载: http://arxiv.org/abs/2511.01170v1

DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI

Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.

Updated: 2025-11-03 02:27:53

标题: DMVFC:基于深度学习的多模态扩散MRI和功能MRI的功能一致性Tractography纤维聚类

摘要: 扩散MRI(dMRI)的Tractography纤维聚类是白质(WM)解剖的关键方法,可用于分析健康和疾病状态下大脑结构连接性。当前的纤维聚类策略主要使用纤维的几何特征(即空间轨迹)将相似的纤维分组成簇,但忽略了纤维束的功能和微结构信息。越来越多的证据表明,白质中的神经活动可以通过功能性MRI(fMRI)来测量,为纤维聚类提供潜在有价值的多模态信息,以增强其功能一致性。此外,微结构特征如分数各向异性(FA)可以从dMRI中计算出作为额外信息,以确保簇的解剖一致性。在本文中,我们开发了一种新颖的深度学习纤维聚类框架,即Deep Multi-view Fiber Clustering(DMVFC),利用联合多模态dMRI和fMRI数据实现功能一致的WM解剖。DMVFC能够有效地将WM纤维的几何和微结构特征与纤维束沿线的fMRI BOLD信号集成在一起。DMVFC包括两个主要组件:(1)多视图预训练模块,从每个信息源单独计算嵌入特征,包括纤维几何、微结构测量和功能信号,以及(2)协作微调模块,同时调整嵌入的差异。在实验中,我们将DMVFC与两种最先进的纤维聚类方法进行比较,并展示其在实现具有功能意义和一致性的WM解剖结果方面具有卓越表现。

更新时间: 2025-11-03 02:27:53

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.24770v2

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.

Updated: 2025-11-03 02:23:39

标题: DITTO:利用知识蒸馏对带水印的LLMs进行恶搞攻击框架

摘要: LLM数字水印技术的潜力在于一个核心假设,即特定水印可以证明特定模型的作者身份。我们证明了这一假设存在严重缺陷。我们引入了水印欺骗的威胁,这是一种复杂的攻击,允许恶意模型生成包含信任受害者模型的真实水印的文本。这使得有害内容(如虚假信息)可以无缝地被归因于权威来源。我们攻击的关键在于重新利用水印放射性,即在微调过程中无意中继承数据模式,从一个可发现的特征变成了一个攻击向量。通过从带水印的教师模型中提炼知识,我们的框架允许攻击者窃取和复制受害模型的水印信号。这项工作揭示了文本作者验证中的关键安全漏洞,并呼吁转向能够区分真实水印和专业模仿水印的技术。我们的代码可在https://github.com/hsannn/ditto.git上找到。

更新时间: 2025-11-03 02:23:39

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.10987v2

Mapping Overlaps in Benchmarks through Perplexity in the Wild

We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.

Updated: 2025-11-03 02:18:28

标题: 通过“困惑度”在实际情况中映射基准测试中的重叠

摘要: 我们开发了能力熟悉度标志,以表征大型语言模型(LLM)基准及其有意义的重叠。基准标志探究了实现基准性能所需的能力。我们正式定义它们为从野外、自然编写的语料库中提取的一组显著标记,LLM标记的困惑度反映了更多或更少的预训练曝光,这成为LLM基准性能的高度预测因素。通过大规模的元评估,我们通过逐步前向选择和线性回归提取了基准标志,涵盖了32个LLM和88个基准,跨越了不同的知识、编码、逻辑、指令遵循、数学、语言、推理和世界建模领域。我们的分析将标志置于基准问题的语义相似性和模型性能相关性之间。尽管性能重叠普遍较高,语义重叠仍限于一个狭窄的中间范围,基准标志在捕捉变化、重叠和分歧方面表现出高度信息量。我们观察到知识和推理子任务之间的重叠,而多语言和文化基准的相似性较低,甚至与跨任务重叠相比也如此。值得注意的是,性能水平结果受基准正交因素(如问题格式)强烈影响,突显了LLM泛化的局限性,性能与能力的混淆,以及当前主流基准一致性研究中固有的问题。然而,基准标志对这些影响保持稳健。最终,我们确定了逻辑、数学、语言、指令遵循和世界建模领域之间的跨功能重叠,编码成为最少重叠的领域。这些发现共同为基准有效性和LLM敏感性提供了机械洞察,并勾勒出相互连接的LLM能力的基础景观。

更新时间: 2025-11-03 02:18:28

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.23488v3

One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra

A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.

Updated: 2025-11-03 02:16:35

标题: 指纹的一小步,质谱的一大步:从质谱中生成新分子

摘要: 一种常见的从质谱中生成新分子的方法涉及一个两阶段的流程:(1)将质谱编码成分子指纹,然后(2)将这些指纹解码成分子结构。在我们的工作中,我们采用MIST(Goldman等人,2023年)作为编码器,MolForge(Ucak等人,2023年)作为解码器,利用额外的训练数据来增强性能。我们还对每个指纹位的概率进行阈值处理,以便关注亚结构的存在。这导致相比先前的最先进方法的十倍改进,在MassSpecGym(Bushuiev等人,2024年)中,从质谱中正确生成了前1位的分子结构的31%/前10位的40%。我们认为这是未来从质谱中解析新分子的研究的一个强有力的基准。

更新时间: 2025-11-03 02:16:35

领域: cs.LG

下载: http://arxiv.org/abs/2508.04180v4

From Superficial Outputs to Superficial Learning: Risks of Large Language Models in Education

Large Language Models (LLMs) are transforming education by enabling personalization, feedback, and knowledge access, while also raising concerns about risks to students and learning systems. Yet empirical evidence on these risks remains fragmented. This paper presents a systematic review of 70 empirical studies across computer science, education, and psychology. Guided by four research questions, we examine: (i) which applications of LLMs in education have been most frequently explored; (ii) how researchers have measured their impact; (iii) which risks stem from such applications; and (iv) what mitigation strategies have been proposed. We find that research on LLMs clusters around three domains: operational effectiveness, personalized applications, and interactive learning tools. Across these, model-level risks include superficial understanding, bias, limited robustness, anthropomorphism, hallucinations, privacy concerns, and knowledge constraints. When learners interact with LLMs, these risks extend to cognitive and behavioural outcomes, including reduced neural activity, over-reliance, diminished independent learning skills, and a loss of student agency. To capture this progression, we propose an LLM-Risk Adapted Learning Model that illustrates how technical risks cascade through interaction and interpretation to shape educational outcomes. As the first synthesis of empirically assessed risks, this review provides a foundation for responsible, human-centred integration of LLMs in education.

Updated: 2025-11-03 02:14:57

标题: 从表面输出到表面学习:大型语言模型在教育中的风险

摘要: 大语言模型(LLMs)正在通过实现个性化、反馈和知识获取来改变教育,同时也引发了对学生和学习系统风险的关注。然而,关于这些风险的经验证据仍然零散。本文对计算机科学、教育和心理学领域的70项实证研究进行了系统综述。在四个研究问题的指导下,我们研究了:(i)在教育中哪些LLMs应用被最频繁地探索;(ii)研究人员如何衡量它们的影响;(iii)这些应用中存在哪些风险;以及(iv)提出了哪些缓解策略。我们发现,关于LLMs的研究聚集在三个领域:运行有效性、个性化应用和互动学习工具。在这些领域中,模型级风险包括表面理解、偏见、有限的稳健性、拟人化、幻觉、隐私问题和知识约束。当学习者与LLMs互动时,这些风险延伸至认知和行为结果,包括降低的神经活动、过度依赖、减弱的独立学习技能和学生代理权的丧失。为了捕捉这种进展,我们提出了一个LLM风险适应学习模型,说明了技术风险如何通过互动和解释而影响教育结果。作为首次综合评估风险的综述,本文为在教育中负责任、以人为中心地整合LLMs提供了基础。

更新时间: 2025-11-03 02:14:57

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2509.21972v2

A High-Throughput Spiking Neural Network Processor Enabling Synaptic Delay Emulation

Synaptic delay has attracted significant attention in neural network dynamics for integrating and processing complex spatiotemporal information. This paper introduces a high-throughput Spiking Neural Network (SNN) processor that supports synaptic delay-based emulation for edge applications. The processor leverages a multicore pipelined architecture with parallel compute engines, capable of real-time processing of the computational load associated with synaptic delays. We develop a SoC prototype of the proposed processor on PYNQ Z2 FPGA platform and evaluate its performance using the Spiking Heidelberg Digits (SHD) benchmark for low-power keyword spotting tasks. The processor achieves 93.4% accuracy in deployment and an average throughput of 104 samples/sec at a typical operating frequency of 125 MHz and 282 mW power consumption.

Updated: 2025-11-03 02:12:44

标题: 一款实现突触延迟模拟的高通量脉冲神经网络处理器

摘要: Synaptic延迟在神经网络动态中引起了重要关注,用于集成和处理复杂的时空信息。本文介绍了一种支持基于神经突触延迟仿真的边缘应用的高吞吐量脉冲神经网络(SNN)处理器。该处理器利用了多核流水线架构和并行计算引擎,能够实时处理与神经突触延迟相关的计算负载。我们在PYNQ Z2 FPGA平台上开发了所提出处理器的SoC原型,并使用Spiking Heidelberg Digits(SHD)基准评估其性能,用于低功耗关键词识别任务。在典型工作频率125 MHz和282 mW功耗下,处理器在部署中实现了93.4%的准确率和平均吞吐量为104个样本/秒。

更新时间: 2025-11-03 02:12:44

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2511.01158v1

Neighboring State-based Exploration for Reinforcement Learning

Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49% in terms of Eval Reward Return.

Updated: 2025-11-03 02:11:45

标题: 基于邻近状态的强化学习探索

摘要: 强化学习是一种强大的工具,用于建模决策过程。然而,它依赖于一种探索利用的权衡,对许多任务来说仍然是一个挑战。在这项工作中,我们研究基于邻近状态的无模型探索,其核心思想是对于一个处于早期阶段的代理,考虑来自附近状态有限区域的行动可能会在探索时导致更好的行动。我们提出了两种基于对附近状态进行调查选择探索性行动的算法,并发现我们的其中一种方法,${\rho}$-explore,在离散环境中Eval Reward Return方面比双DQN基线表现更好,提高了49%。

更新时间: 2025-11-03 02:11:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2212.10712v3

Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models

This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.

Updated: 2025-11-03 02:01:32

标题: 大型语言模型中具有时间衰减和注意力的动态主题演变

摘要: 本文提出了一个基于时间大语言模型的动态主题演变建模框架。该方法首先利用一个大语言模型获取文本的上下文嵌入,然后引入了一个时间衰减函数和一个注意力机制。这些组件使模型能够根据时间间隔调整语义单元的重要性,并捕捉不同时期主题的变化。然后,将时间表示映射到一个潜在主题空间,应用状态转换矩阵描述主题的动态演变。联合优化目标约束了语义建模和时间一致性,确保主题生成的多样性和平滑性。设计强调了语义表示和时间演变的统一建模,提高了主题的连贯性和多样性,同时增强了随时间的稳定性和可解释性。对真实世界语料库的实验表明,该框架有效捕捉了主题的生成、扩展和衰减,并在多个指标上优于现有模型。总体而言,所提出的方法为理解大规模文本中的动态语义模式提供了系统化的解决方案,丰富了主题建模的研究范式,并支持多个领域的复杂文本分析任务。

更新时间: 2025-11-03 02:01:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.10613v2

Modular Task Decomposition and Dynamic Collaboration in Multi-Agent Systems Driven by Large Language Models

This paper addresses the limitations of a single agent in task decomposition and collaboration during complex task execution, and proposes a multi-agent architecture for modular task decomposition and dynamic collaboration based on large language models. The method first converts natural language task descriptions into unified semantic representations through a large language model. On this basis, a modular decomposition mechanism is introduced to break down the overall goal into multiple hierarchical sub-tasks. Then, dynamic scheduling and routing mechanisms enable reasonable division of labor and realtime collaboration among agents, allowing the system to adjust strategies continuously according to environmental feedback, thus maintaining efficiency and stability in complex tasks. Furthermore, a constraint parsing and global consistency mechanism is designed to ensure coherent connections between sub-tasks and balanced workload, preventing performance degradation caused by redundant communication or uneven resource allocation. The experiments validate the architecture across multiple dimensions, including task success rate, decomposition efficiency, sub-task coverage, and collaboration balance. The results show that the proposed method outperforms existing approaches in both overall performance and robustness, achieving a better balance between task complexity and communication overhead. In conclusion, this study demonstrates the effectiveness and feasibility of language-driven task decomposition and dynamic collaboration in multi-agent systems, providing a systematic solution for task execution in complex environments.

Updated: 2025-11-03 02:00:06

标题: 大型语言模型驱动的多智能体系统中的模块化任务分解和动态协作

摘要: 本文讨论了单一agent在复杂任务执行过程中任务分解和协作的局限性,并提出了基于大型语言模型的模块化任务分解和动态协作的多agent架构。该方法首先通过大型语言模型将自然语言任务描述转换为统一的语义表示。在此基础上,引入了模块化分解机制,将整体目标分解为多个层次子任务。然后,动态调度和路由机制使得agent之间能够合理分工和实时协作,允许系统根据环境反馈不断调整策略,从而在复杂任务中保持效率和稳定性。此外,设计了约束解析和全局一致性机制,以确保子任务之间的连贯连接和平衡工作负载,防止由于冗余通信或不均匀资源分配导致的性能下降。实验证实了该架构在任务成功率、分解效率、子任务覆盖率和协作平衡等多个方面的有效性。结果显示,所提出的方法在整体性能和鲁棒性方面优于现有方法,在任务复杂性和通信开销之间实现了更好的平衡。总之,本研究证明了语言驱动的任务分解和动态协作在多agent系统中的有效性和可行性,为复杂环境中的任务执行提供了系统化解决方案。

更新时间: 2025-11-03 02:00:06

领域: cs.AI

下载: http://arxiv.org/abs/2511.01149v1

OneCast: Structured Decomposition and Modular Generation for Cross-Domain Time Series Forecasting

Cross-domain time series forecasting is a valuable task in various web applications. Despite its rapid advancement, achieving effective generalization across heterogeneous time series data remains a significant challenge. Existing methods have made progress by extending single-domain models, yet often fall short when facing domain-specific trend shifts and inconsistent periodic patterns. We argue that a key limitation lies in treating temporal series as undifferentiated sequence, without explicitly decoupling their inherent structural components. To address this, we propose OneCast, a structured and modular forecasting framework that decomposes time series into seasonal and trend components, each modeled through tailored generative pathways. Specifically, the seasonal component is captured by a lightweight projection module that reconstructs periodic patterns via interpretable basis functions. In parallel, the trend component is encoded into discrete tokens at segment level via a semantic-aware tokenizer, and subsequently inferred through a masked discrete diffusion mechanism. The outputs from both branches are combined to produce a final forecast that captures seasonal patterns while tracking domain-specific trends. Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines.

Updated: 2025-11-03 01:49:39

标题: OneCast:跨领域时间序列预测的结构分解和模块化生成

摘要: 跨领域时间序列预测是各种网络应用中的一项有价值的任务。尽管其快速发展,但在异构时间序列数据之间实现有效的泛化仍然是一个重要挑战。现有方法通过扩展单一领域模型取得了进展,但在面对领域特定的趋势变化和不一致的周期模式时经常表现不佳。我们认为关键限制在于将时间序列视为未区分的序列,没有明确解耦它们固有的结构组件。为了解决这个问题,我们提出了OneCast,一个结构化和模块化的预测框架,将时间序列分解为季节和趋势组件,每个组件通过定制的生成路径进行建模。具体而言,季节性组件通过一个轻量级的投影模块捕获,通过可解释的基函数重构周期模式。同时,趋势组件通过一个语义感知的分词器将离散标记编码为段级别,随后通过掩码离散扩散机制进行推断。两个分支的输出结合在一起,产生一个最终的预测,捕捉季节性模式同时跟踪领域特定的趋势。在八个领域上进行的大量实验表明,OneCast大多数情况下优于最先进的基线模型。

更新时间: 2025-11-03 01:49:39

领域: cs.AI

下载: http://arxiv.org/abs/2510.24028v2

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST

Updated: 2025-11-03 01:49:20

标题: MultiMed-ST: 大规模多对多多语种医学语音翻译

摘要: 多语种语音翻译(ST)和机器翻译(MT)在医疗领域增强了患者护理,通过跨越语言障碍实现高效沟通,缓解专业人员短缺问题,并促进改善诊断和治疗,特别是在流行病期间。在这项工作中,我们首次系统研究了医疗ST,通过发布MultiMed-ST,一个涵盖越南语、英语、德语、法语和简体/繁体中文五种语言中所有翻译方向的大规模ST数据集,我们认为这是目前医疗MT数据集中最大的,也是所有领域中最大的多对多多语种ST。其次,我们进行了有史以来最全面的ST分析,包括:实证基线、双语-多语对比研究、端到端 vs. 级联对比研究、任务特定 vs. 多任务序列对序列对比研究、代码交换分析以及定量-定性错误分析。所有代码、数据和模型都可以在线获取:https://github.com/leduckhai/MultiMed-ST

更新时间: 2025-11-03 01:49:20

领域: cs.CL,cs.AI,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2504.03546v2

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall, their performance remains subpar on reasoning-intensive tasks, such as threat actor attribution and risk mitigation, with open-source models trailing even further behind. These findings highlight fundamental limitations in the reasoning capabilities of current LLMs and underscore the need for models explicitly tailored to CTI workflows and automation.

Updated: 2025-11-03 01:45:29

标题: AthenaBench:用于评估网络威胁情报中LLMs的动态基准测试

摘要: 大型语言模型(LLMs)在自然语言推理方面表现出强大的能力,然而它们在网络威胁情报(CTI)领域的应用仍然有限。CTI分析涉及将大量非结构化报告转化为可操作的知识,LLMs在这一过程中可以大幅减少分析师的工作量。CTIBench引入了一个用于评估LLMs在多个CTI任务中的全面基准。在这项工作中,我们通过开发AthenaBench来扩展CTIBench,这是一个增强型基准,包括改进的数据集创建流水线、重复项删除、精细化的评估指标以及一个新的专注于风险缓解策略的任务。我们评估了十二个LLMs,包括GPT-5和Gemini-2.5 Pro等最先进的专有模型,以及来自LLaMA和Qwen家族的七个开源模型。尽管专有LLMs总体上取得更好的结果,但它们在推理密集型任务(如威胁行为者归因和风险缓解)上的表现仍然不理想,而开源模型的表现甚至落后更远。这些发现突显了当前LLMs推理能力的基本局限,并强调了需要专门针对CTI工作流程和自动化定制的模型的重要性。

更新时间: 2025-11-03 01:45:29

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.01144v1

MicroAUNet: Boundary-Enhanced Multi-scale Fusion with Knowledge Distillation for Colonoscopy Polyp Image Segmentation

Early and accurate segmentation of colorectal polyps is critical for reducing colorectal cancer mortality, which has been extensively explored by academia and industry. However, current deep learning-based polyp segmentation models either compromise clinical decision-making by providing ambiguous polyp margins in segmentation outputs or rely on heavy architectures with high computational complexity, resulting in insufficient inference speeds for real-time colorectal endoscopic applications. To address this problem, we propose MicroAUNet, a light-weighted attention-based segmentation network that combines depthwise-separable dilated convolutions with a single-path, parameter-shared channel-spatial attention block to strengthen multi-scale boundary features. On the basis of it, a progressive two-stage knowledge-distillation scheme is introduced to transfer semantic and boundary cues from a high-capacity teacher. Extensive experiments on benchmarks also demonstrate the state-of-the-art accuracy under extremely low model complexity, indicating that MicroAUNet is suitable for real-time clinical polyp segmentation. The code is publicly available at https://github.com/JeremyXSC/MicroAUNet.

Updated: 2025-11-03 01:43:34

标题: MicroAUNet:边界增强多尺度融合与知识蒸馏在结肠镜息肉图像分割中的应用

摘要: 肠道息肉的早期和准确分割对于降低结直肠癌死亡率至关重要,这一点已经得到学术界和行业的广泛探讨。然而,目前基于深度学习的息肉分割模型要么通过在分割输出中提供模糊的息肉边缘来妥协临床决策,要么依赖于计算复杂度高的重型架构,导致实时结直肠内窥镜应用的推断速度不足。为了解决这个问题,我们提出了MicroAUNet,一种轻量级的基于注意力的分割网络,它结合了深度可分离膨胀卷积和单通道共享参数的空间注意力块,以加强多尺度边界特征。在此基础上,引入了渐进式的两阶段知识蒸馏方案,以从高容量的教师那里转移语义和边界线索。对基准测试的大量实验也证明了在极低模型复杂性下的最先进准确性,表明MicroAUNet适用于实时临床息肉分割。该代码公开可用于https://github.com/JeremyXSC/MicroAUNet。

更新时间: 2025-11-03 01:43:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.01143v1

FTSmartAudit: A Knowledge Distillation-Enhanced Framework for Automated Smart Contract Auditing Using Fine-Tuned LLMs

The rapid growth of blockchain technology has driven the widespread adoption of smart contracts. However, their inherent vulnerabilities have led to significant financial losses. Traditional auditing methods, while essential, struggle to keep pace with the increasing complexity and scale of smart contracts. Large Language Models (LLMs) offer promising capabilities for automating vulnerability detection, but their adoption is often limited by high computational costs. Although prior work has explored leveraging large models through agents or workflows, relatively little attention has been given to improving the performance of smaller, fine-tuned models--a critical factor for achieving both efficiency and data privacy. In this paper, we introduce HKT-SmartAudit, a framework for developing lightweight models optimized for smart contract auditing. It features a multi-stage knowledge distillation pipeline that integrates classical distillation, external domain knowledge, and reward-guided learning to transfer high-quality insights from large teacher models. A single-task learning strategy is employed to train compact student models that maintain high accuracy and robustness while significantly reducing computational overhead. Experimental results show that our distilled models outperform both commercial tools and larger models in detecting complex vulnerabilities and logical flaws, offering a practical, secure, and scalable solution for smart contract auditing. The source code is available at Github repository.

Updated: 2025-11-03 01:42:59

标题: FTSmartAudit:一种使用经过微调的LLMs进行知识蒸馏增强的自动智能合约审计框架

摘要: 区块链技术的快速增长推动了智能合约的广泛采用。然而,它们固有的漏洞导致了重大的财务损失。传统的审计方法虽然至关重要,但难以跟上智能合约日益复杂和规模不断扩大的步伐。大型语言模型(LLMs)提供了自动化漏洞检测的有希望的能力,但由于高计算成本,它们的采用往往受到限制。尽管先前的工作探讨了通过代理或工作流来利用大模型,但相对较少的关注被给予到改进性能较小、经过精细调整的模型--这是实现效率和数据隐私的关键因素。在本文中,我们介绍了HKT-SmartAudit,一个用于开发优化智能合约审计的轻量级模型的框架。它采用了一个多阶段知识蒸馏流程,集成了经典蒸馏、外部领域知识和奖励引导学习,从大型教师模型中传递高质量的见解。采用单任务学习策略来训练紧凑的学生模型,这些模型在显著降低计算开销的同时保持高准确性和鲁棒性。实验结果表明,我们的蒸馏模型在检测复杂漏洞和逻辑缺陷方面优于商业工具和更大的模型,为智能合约审计提供了实用、安全和可扩展的解决方案。源代码可在Github仓库中获得。

更新时间: 2025-11-03 01:42:59

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2410.13918v3

A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications

The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

Updated: 2025-11-03 01:42:58

标题: 一个关于马尔可夫决策过程状态相似性的广义双模拟度量:从理论命题到应用

摘要: 双模拟度量(BSM)是计算马尔可夫决策过程(MDP)中状态相似性的强大工具,揭示了在BSM中更接近的状态具有更相似的最优值函数。虽然BSM已成功应用于强化学习(RL)中的任务,如状态表示学习和策略探索,但其在多个MDP场景中的应用,如策略转移,仍具有挑战性。先前的工作试图将BSM推广到MDP对,但由于对其数学属性的严格分析不足,进一步的理论进展受到限制。在这项工作中,我们正式建立了一种广义双模拟度量(GBSM),用于MDP对之间,经过严格证明具有三个基本属性:GBSM对称性,MDP间三角不等式,以及相同状态空间上的距离界限。利用这些属性,我们在MDP中理论分析了策略转移、状态聚合和基于采样的估计,获得了严格比标准BSM导出的边界更紧的显式边界。此外,GBSM为估计提供了一个闭合形式的样本复杂度,改进了基于BSM的现有渐近结果。数值结果验证了我们的理论发现,并展示了GBSM在多MDP场景中的有效性。

更新时间: 2025-11-03 01:42:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18714v3

Trustworthy AI Must Account for Interactions

Trustworthy AI encompasses many aspirational aspects for aligning AI systems with human values, including fairness, privacy, robustness, explainability, and uncertainty quantification. Ultimately the goal of Trustworthy AI research is to achieve all aspects simultaneously. However, efforts to enhance one aspect often introduce unintended trade-offs that negatively impact others. In this position paper, we review notable approaches to these five aspects and systematically consider every pair, detailing the negative interactions that can arise. For example, applying differential privacy to model training can amplify biases, undermining fairness. Drawing on these findings, we take the position that current research practices of improving one or two aspects in isolation are insufficient. Instead, research on Trustworthy AI must account for interactions between aspects and adopt a holistic view across all relevant axes at once. To illustrate our perspective, we provide guidance on how practitioners can work towards integrated trust, examples of how interactions affect the financial industry, and alternative views.

Updated: 2025-11-03 01:42:55

标题: 可信的人工智能必须考虑互动

摘要: 可信的人工智能涵盖了许多愿景方面,旨在使人工智能系统与人类价值观保持一致,包括公平性、隐私性、稳健性、可解释性和不确定性量化。最终,可信的人工智能研究的目标是同时实现所有方面。然而,增强一个方面的努力往往会引入意想不到的权衡,从而对其他方面产生负面影响。在这篇立场论文中,我们回顾了这五个方面的显著方法,并系统地考虑了每一对,详细说明可能出现的负面相互作用。例如,将差分隐私应用于模型训练可能会放大偏见,从而损害公平性。借鉴这些发现,我们认为目前的单独改进一个或两个方面的研究实践是不够的。相反,可信的人工智能研究必须考虑各方面之间的相互作用,并同时采用跨越所有相关轴的整体观点。为了说明我们的观点,我们提供了关于从业者如何朝着整合信任方向努力、相互作用如何影响金融行业以及替代观点的指导。

更新时间: 2025-11-03 01:42:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.07170v2

Dataset Distillation for Offline Reinforcement Learning

Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .

Updated: 2025-11-03 01:38:40

标题: 数据集精炼用于离线强化学习

摘要: 线下强化学习通常需要一个优质的数据集,我们可以在其上训练策略。然而,在许多情况下,很难获得这样的数据集,也很难训练一个能在实际环境中表现良好的策略。我们提出使用数据蒸馏来训练和提炼一个更好的数据集,然后可以用它来训练一个更好的策略模型。我们展示了我们的方法能够合成一个数据集,在该数据集上训练的模型达到了与在完整数据集上训练或使用百分比行为克隆训练的模型相似的性能。我们的项目网站可在https://datasetdistillation4rl.github.io找到。我们还提供了我们的实现代码在https://github.com/ggflow123/DDRL。

更新时间: 2025-11-03 01:38:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2407.20299v3

Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

Updated: 2025-11-03 01:30:29

标题: 在哪里以及如何扰动:关于扰动引导在扩散和流动模型设计的研究

摘要: 最近扩散模型中的引导方法通过扰动模型来构建隐式弱模型,并将生成引导远离它。在这些方法中,注意力扰动已经在无条件情况下表现出强大的实证性能,其中无需分类器的指导并不适用。然而,现有的注意力扰动方法缺乏确定何处应用扰动的原则方法,特别是在扩散变压器(DiT)架构中,质量相关的计算分布在各个层中。在本文中,我们研究了注意力扰动的粒度,从层级到单个关注头部,发现特定头部控制不同的视觉概念,如结构、风格和质地。基于这一认识,我们提出了“HeadHunter”,这是一个系统性框架,用于迭代选择与用户中心目标一致的注意头,实现对生成质量和视觉属性的精细控制。此外,我们引入了SoftPAG,它线性插值每个选择的头部的注意力图向身份矩阵,提供了一个连续的旋钮来调节扰动强度并抑制伪影。我们的方法不仅减轻了现有层级扰动的过度平滑问题,还通过组合头部选择实现了特定视觉风格的有针对性操纵。我们在现代大规模基于DiT的文本到图像模型上验证了我们的方法,包括Stable Diffusion 3和FLUX.1,在一般质量增强和特定风格指导方面表现出卓越性能。我们的工作提供了扩散模型中注意力扰动的头部层面分析,揭示了注意力层内的可解释特化,并实现了有效扰动策略的实用设计。

更新时间: 2025-11-03 01:30:29

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.10978v4

Few-Shot Multimodal Medical Imaging: A Theoretical Framework

Medical imaging relies heavily on large, labeled datasets. But, unfortunately, they are not always easily accessible in clinical settings. Additionally, many practitioners often face various structural obstacles like limited data availability, fragmented data systems, and unbalanced datasets. These barriers often lead to the increased diagnostic uncertainty, underrepresentation of certain conditions, reduced model robustness, and biased diagnostic decisions. In response to these challenges, approaches such as transfer learning, meta-learning, and multimodal fusion have made great strides. However, they still need a solid theoretical justification for why they succeed or fail in situations where data is scarce. To address this gap, we propose a unified theoretical framework that characterizes learning and inference under low-resource medical imaging conditions. We first formalize the learning objective under few-shot conditions and compute sample complexity constraints to estimate the smallest quantity of data needed to achieve clinically reliable accuracy. Then based on ideas from PAC-learning and PAC-Bayesian theory, we explain how multimodal integration encourages generalization and quantifies uncertainty under sparse supervision. We further propose a formal metric for explanation stability, offering interpretability guarantees under low-data conditions. Taken together, the proposed framework establishes a principled foundation for constructing dependable, data-efficient diagnostic systems by jointly characterizing sample efficiency, uncertainty quantification, and interpretability in a unified theoretical setting.

Updated: 2025-11-03 01:21:50

标题: 少样本多模态医学影像学:一个理论框架

摘要: 医学影像学在很大程度上依赖于大型、标记的数据集。但不幸的是,它们并不总是在临床环境中容易获取。此外,许多从业者常常面临各种结构性障碍,如有限的数据可用性、碎片化的数据系统和不平衡的数据集。这些障碍经常导致增加的诊断不确定性、某些疾病的代表性不足、模型的鲁棒性降低以及有偏见的诊断决策。为了解决这些挑战,诸如迁移学习、元学习和多模态融合等方法已经取得了巨大进展。然而,它们仍然需要一个坚实的理论基础来解释在数据稀缺情况下成功或失败的原因。为了填补这一空白,我们提出了一个统一的理论框架,对低资源医学影像条件下的学习和推断进行了描述。我们首先在少样本条件下形式化学习目标,并计算样本复杂性约束,以估计实现临床可靠准确性所需的最小数据量。然后基于PAC学习和PAC-Bayesian理论的思想,我们解释了多模态集成如何鼓励泛化并在稀疏监督下量化不确定性。我们进一步提出了一个解释稳定性的形式化度量标准,在低数据条件下提供可解释性保证。总的来说,提出的框架在统一的理论背景下建立了一个构建可靠、数据高效的诊断系统的原则性基础,同时对样本效率、不确定性量化和可解释性进行了共同描述。

更新时间: 2025-11-03 01:21:50

领域: stat.ML,cs.AI,cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2511.01140v1

A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm

This paper presents a comprehensive formulation of Kaneko's Error Diffusion Learning Algorithm (EDLA) and evaluates its effectiveness across parity check, regression, and image classification tasks. EDLA is a biologically inspired learning algorithm that provides an alternative to conventional backpropagation for training artificial neural networks. EDLA employs a single global error signal that diffuses across networks composed of paired positive and negative sublayers, eliminating traditional layer-wise error backpropagation. This study evaluates EDLA's effectiveness using benchmark tasks, such as parity check, regression, and image classification, by systematically varying the neuron count, network depth, and learning rates to assess its performance comprehensively. The experimental results demonstrate that EDLA achieves consistently high accuracy across multiple benchmarks, highlighting its effectiveness as a learning algorithm for neural networks. The choice of learning rate, neuron count, and network depth significantly influences EDLA's efficiency and convergence speed. Analysis of internal network representations reveals meaningful feature extraction capabilities, and the network's overall performance is found to be competitive with networks trained via conventional backpropagation, especially in shallow architectures. This study introduces EDLA, a biologically plausible alternative to traditional backpropagation previously underrecognized due to language barriers. By reformulating EDLA, systematically evaluating its performance, and presenting empirical evidence of its effectiveness, this study increases the visibility and accessibility of EDLA and contributes to biologically inspired training methodologies.

Updated: 2025-11-03 01:07:24

标题: 一个基于误差扩散学习算法训练的神经网络的基本评估

摘要: 本文提出了Kaneko的误差扩散学习算法(EDLA)的全面公式化,并评估了其在奇偶校验、回归和图像分类任务中的有效性。EDLA是一种受生物启发的学习算法,为训练人工神经网络提供了一种替代传统反向传播的方法。EDLA采用一个全局误差信号在由正负子层组成的网络中扩散,消除了传统的逐层误差反向传播。这项研究通过系统地改变神经元数目、网络深度和学习率来全面评估其性能,评估了EDLA在奇偶校验、回归和图像分类等基准任务中的有效性。实验结果表明,EDLA在多个基准任务中都能实现一致高的准确性,突显了其作为神经网络学习算法的有效性。学习率、神经元数目和网络深度的选择显著影响了EDLA的效率和收敛速度。对内部网络表示的分析揭示了有意义的特征提取能力,发现网络的整体性能与通过传统反向传播训练的网络相竞争,尤其是在浅层架构中。这项研究介绍了EDLA,这是一种受生物启发的传统反向传播的替代方法,之前由于语言障碍而未被充分认识。通过重新公式化EDLA,系统评估其性能,并呈现其有效性的实证证据,这项研究增加了EDLA的可见性和可访问性,并为受生物启发的训练方法做出了贡献。

更新时间: 2025-11-03 01:07:24

领域: cs.LG

下载: http://arxiv.org/abs/2504.14814v3

Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models

Large Language Models are increasingly adopted in financial applications to support investment workflows. However, prior studies have seldom examined how these models reflect biases related to firm size, sector, or financial characteristics, which can significantly impact decision-making. This paper addresses this gap by focusing on representation bias in open-source Qwen models. We propose a balanced round-robin prompting method over approximately 150 U.S. equities, applying constrained decoding and token-logit aggregation to derive firm-level confidence scores across financial contexts. Using statistical tests and variance analysis, we find that firm size and valuation consistently increase model confidence, while risk factors tend to decrease it. Confidence varies significantly across sectors, with the Technology sector showing the greatest variability. When models are prompted for specific financial categories, their confidence rankings best align with fundamental data, moderately with technical signals, and least with growth indicators. These results highlight representation bias in Qwen models and motivate sector-aware calibration and category-conditioned evaluation protocols for safe and fair financial LLM deployment.

Updated: 2025-11-03 01:00:40

标题: 揭示开源大型语言模型中投资决策的表征偏见

摘要: 大型语言模型越来越多地被应用于金融领域,以支持投资工作流程。然而,先前的研究很少考虑这些模型如何反映与公司规模、行业或财务特征相关的偏见,这些偏见可能会显著影响决策。本文通过关注开源Qwen模型中的表示偏差来解决这一问题。我们提出了一种在约150家美国股票上应用受限解码和令牌-逻辑聚合的平衡轮转提示方法,以在财务背景下推导公司级信心分数。通过统计检验和方差分析,我们发现公司规模和估值一贯会增加模型的信心,而风险因素往往会降低它。信心在不同行业之间存在显著差异,科技行业的变化最大。当模型针对特定的财务类别进行提示时,它们的信心排名最符合基本数据,中度符合技术信号,最不符合增长指标。这些结果突出了Qwen模型中的表示偏差,并激发了面向行业的校准和基于类别的评估协议,以确保金融LLM的安全和公平部署。

更新时间: 2025-11-03 01:00:40

领域: q-fin.CP,cs.AI

下载: http://arxiv.org/abs/2510.05702v2

Verification and Attack Synthesis for Network Protocols

Network protocols are programs with inputs and outputs that follow predefined communication patterns to synchronize and exchange information. There are many protocols and each serves a different purpose, e.g., routing, transport, secure communication, etc. The functional and performance requirements for a protocol can be expressed using a formal specification, such as, a set of logical predicates over its traces. A protocol could be prevented from achieving its requirements due to a bug in its design or implementation, a component failure (e.g., a crash), or an attack. This dissertation shows that formal methods can feasibly characterize the functionality and performance of network protocols under normal conditions as well as when subjected to attacks.

Updated: 2025-11-03 00:26:12

标题: 网络协议的验证和攻击合成

摘要: 网络协议是具有输入和输出的程序,遵循预定义的通信模式以同步和交换信息。有许多协议,每个协议都有不同的目的,例如路由、传输、安全通信等。协议的功能和性能要求可以使用形式规范来表达,例如,一组逻辑谓词在其跟踪中。由于设计或实现中的错误、组件故障(例如崩溃)或攻击,协议可能无法实现其需求。本文表明,形式方法可以合理地描述网络协议在正常条件下以及遭受攻击时的功能和性能。

更新时间: 2025-11-03 00:26:12

领域: cs.CR,cs.FL

下载: http://arxiv.org/abs/2511.01124v1

Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression

We study gradient descent (GD) with a constant stepsize for $\ell_2$-regularized logistic regression with linearly separable data. Classical theory suggests small stepsizes to ensure monotonic reduction of the optimization objective, achieving exponential convergence in $\widetilde{\mathcal{O}}(\kappa)$ steps with $\kappa$ being the condition number. Surprisingly, we show that this can be accelerated to $\widetilde{\mathcal{O}}(\sqrt{\kappa})$ by simply using a large stepsize -- for which the objective evolves nonmonotonically. The acceleration brought by large stepsizes extends to minimizing the population risk for separable distributions, improving on the best-known upper bounds on the number of steps to reach a near-optimum. Finally, we characterize the largest stepsize for the local convergence of GD, which also determines the global convergence in special scenarios. Our results extend the analysis of Wu et al. (2024) from convex settings with minimizers at infinity to strongly convex cases with finite minimizers.

Updated: 2025-11-03 00:16:28

标题: 大步长加速正则化逻辑回归的梯度下降

摘要: 我们研究了具有恒定步长的梯度下降(GD)在具有线性可分数据的$\ell_2$正则化逻辑回归中的应用。经典理论建议采用小步长以确保优化目标的单调减少,从而在$\widetilde{\mathcal{O}}(\kappa)$步内实现指数收敛,其中$\kappa$是条件数。令人惊讶的是,我们表明只需使用较大的步长就可以加速到$\widetilde{\mathcal{O}}(\sqrt{\kappa})$,尽管这样目标函数的演化是非单调的。由大步长带来的加速性质也适用于最小化可分布的总体风险,改进了已知的在达到接近最优解所需步数的上界。最后,我们表征了GD的局部收敛的最大步长,这也决定了在特定情况下的全局收敛。我们的结果将Wu等人(2024年)的分析从凸设置与无限极小值扩展到了具有有限极小值的强凸情况。

更新时间: 2025-11-03 00:16:28

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2506.02336v2

Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection

This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.

Updated: 2025-11-03 00:07:17

标题: 局部核投影异常值检测:一种多模态异常值检测的两阶段方法

摘要: 本文介绍了Two-Stage LKPLO,这是一个新颖的多阶段异常检测框架,克服了传统基于投影的方法的共同限制:它们依赖于固定的统计度量,以及它们假设单一数据结构的限制。我们的框架独特地综合了三个关键概念:(1)一个广义的基于损失的异常度量(PLO),用灵活、自适应的损失函数来取代固定度量,如我们提出的类似SVM的损失;(2)一个全局核PCA阶段,用于线性化非线性数据结构;(3)一个随后的本地聚类阶段,用于处理多模态分布。在10个基准数据集上进行的全面的5倍交叉验证实验,以及自动化超参数优化,证明了Two-Stage LKPLO实现了最先进的性能。它在具有挑战性结构的数据集上显著优于强基线方法,尤其是在多簇数据(Optdigits)和复杂的高维数据(Arrhythmia)上。此外,消融研究从经验上证实,核化和本地化阶段的协同组合对其卓越性能至关重要。这项工作为一类重要的异常检测问题提供了一个强大的新工具,并强调了混合、多阶段架构的重要性。

更新时间: 2025-11-03 00:07:17

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.24043v3

By Xinhai (Sean) Zou.