Arxiv Day: Article

El Agente: An Autonomous Agent for Quantum Chemistry

Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging >87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.

Updated: 2025-08-08 23:58:12

标题: El Agente: Un agente autónomo para la química cuántica

摘要: 计算化学工具被广泛用于研究化学现象的行为。然而，这些工具的复杂性可能使非专家无法使用，甚至对专家来说也具有挑战性。在这项工作中，我们介绍了El Agente Q，这是一个基于LLM的多智能体系统，可以动态生成并执行量子化学工作流程，从自然语言用户提示中。该系统建立在一种新颖的认知架构之上，具有分层记忆框架，可以灵活地分解任务、自适应工具选择、后续分析以及自主文件处理和提交。El Agente Q在六个大学级课程练习和两个案例研究中进行了基准测试，展示了强大的问题解决性能（平均超过87%的任务成功率）和通过原位调试的自适应错误处理。它还支持更复杂工作流程的长期、多步骤任务执行，同时通过详细的动作跟踪日志保持透明度。这些能力共同为日益自主和可访问的量子化学奠定了基础。

更新时间: 2025-08-08 23:58:12

领域: cs.AI,cs.LG,cs.MA,physics.chem-ph

下载: http://arxiv.org/abs/2505.02484v2

A Fuzzy Logic Prompting Framework for Large Language Models in Adaptive and Uncertain Tasks

We introduce a modular prompting framework that supports safer and more adaptive use of large language models (LLMs) across dynamic, user-centered tasks. Grounded in human learning theory, particularly the Zone of Proximal Development (ZPD), our method combines a natural language boundary prompt with a control schema encoded with fuzzy scaffolding logic and adaptation rules. This architecture enables LLMs to modulate behavior in response to user state without requiring fine-tuning or external orchestration. In a simulated intelligent tutoring setting, the framework improves scaffolding quality, adaptivity, and instructional alignment across multiple models, outperforming standard prompting baselines. Evaluation is conducted using rubric-based LLM graders at scale. While initially developed for education, the framework has shown promise in other interaction-heavy domains, such as procedural content generation for games. Designed for safe deployment, it provides a reusable methodology for structuring interpretable, goal-aligned LLM behavior in uncertain or evolving contexts.

Updated: 2025-08-08 23:50:48

标题: 一个模糊逻辑提示框架用于大型语言模型在自适应和不确定任务中

摘要: 我们介绍了一个模块化提示框架，支持在动态、用户中心任务中更安全、更适应性地使用大型语言模型（LLMs）。我们的方法基于人类学习理论，特别是发展的近区域（ZPD），结合了自然语言边界提示和用模糊支架逻辑和适应规则编码的控制模式。这种架构使LLMs能够根据用户状态调节行为，而无需进行微调或外部协调。在模拟智能辅导环境中，该框架提高了多个模型之间的支架质量、适应性和教学对齐，胜过标准提示基线。评估是使用基于标准的LLM评分器进行的。虽然最初是为教育而开发的，但该框架在其他互动密集领域（如游戏的程序内容生成）中表现出了潜力。设计用于安全部署，它提供了一种可重复使用的方法论，用于在不确定或不断发展的环境中构建可解释的、与目标对齐的LLM行为。

更新时间: 2025-08-08 23:50:48

领域: cs.AI,I.2.7

下载: http://arxiv.org/abs/2508.06754v1

Bridging the Last Mile of Prediction: Enhancing Time Series Forecasting with Conditional Guided Flow Matching

Existing generative models for time series forecasting often transform simple priors (typically Gaussian) into complex data distributions. However, their sampling initialization, independent of historical data, hinders the capture of temporal dependencies, limiting predictive accuracy. They also treat residuals merely as optimization targets, ignoring that residuals often exhibit meaningful patterns like systematic biases or nontrivial distributional structures. To address these, we propose Conditional Guided Flow Matching (CGFM), a novel model-agnostic framework that extends flow matching by integrating outputs from an auxiliary predictive model. This enables learning from the probabilistic structure of prediction residuals, leveraging the auxiliary model's prediction distribution as a source to reduce learning difficulty and refine forecasts. CGFM incorporates historical data as both conditions and guidance, uses two-sided conditional paths (with source and target conditioned on the same history), and employs affine paths to expand the path space, avoiding path crossing without complex mechanisms, preserving temporal consistency, and strengthening distribution alignment. Experiments across datasets and baselines show CGFM consistently outperforms state-of-the-art models, advancing forecasting.

Updated: 2025-08-08 23:50:27

标题: 连接预测的最后一英里：通过条件引导流匹配增强时间序列预测

摘要: 现有的用于时间序列预测的生成模型通常将简单的先验（通常是高斯分布）转换为复杂的数据分布。然而，它们的采样初始化独立于历史数据，阻碍了捕获时间依赖关系，限制了预测准确性。它们还仅将残差视为优化目标，忽略了残差通常显示出的有意义的模式，如系统偏差或非平凡的分布结构。为了解决这些问题，我们提出了一种新颖的模型无关框架——条件引导流匹配（CGFM），它通过集成辅助预测模型的输出来扩展流匹配。这使得可以从预测残差的概率结构中学习，利用辅助模型的预测分布作为减少学习难度和改进预测的来源。CGFM将历史数据作为条件和指导，使用双向条件路径（源和目标在相同历史条件下），并采用仿射路径扩展路径空间，避免路径交叉而无需复杂机制，保持时间一致性，并加强分布对齐。跨数据集和基线的实验表明，CGFM始终优于最先进的模型，推动了预测的发展。

更新时间: 2025-08-08 23:50:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.07192v3

ElementaryNet: A Non-Strategic Neural Network for Predicting Human Behavior in Normal-Form Games

Behavioral game theory models serve two purposes: yielding insights into how human decision-making works, and predicting how people would behave in novel strategic settings. A system called GameNet represents the state of the art for predicting human behavior in the setting of unrepeated simultaneous-move games, combining a simple "level-k" model of strategic reasoning with a complex neural network model of non-strategic "level-0" behavior. Although this reliance on well-established ideas from cognitive science ought to make GameNet interpretable, the flexibility of its level-0 model raises the possibility that it is able to emulate strategic reasoning. In this work, we prove that GameNet's level-0 model is indeed too general. We then introduce ElementaryNet, a novel neural network that is provably incapable of expressing strategic behavior. We show that these additional restrictions are empirically harmless, leading ElementaryNet to statistically indistinguishable predictive performance vs GameNet. We then show how it is possible to derive insights about human behavior by varying ElementaryNet's features and interpreting its parameters, finding evidence of iterative reasoning, learning about the depth of this reasoning process, and showing the value of a rich level-0 specification.

Updated: 2025-08-08 23:36:46

标题: ElementaryNet：一种用于预测正态形式博弈中人类行为的非策略性神经网络

摘要: 行为博弈论模型有两个目的：揭示人类决策如何运作，预测人们在新颖战略环境下的行为。一种名为GameNet的系统代表了在未重复同时移动游戏设置中预测人类行为的最新技术，结合了简单的战略推理“水平-k”模型和复杂的神经网络模型非战略“水平-0”行为。尽管GameNet依赖于认知科学中已经建立的思想，应该是可以解释的，但其水平-0模型的灵活性引发了其可能能够模拟战略推理的可能性。在这项工作中，我们证明了GameNet的水平-0模型确实过于普遍。然后，我们引入了ElementaryNet，这是一种新颖的神经网络，可以证明无法表达战略行为。我们展示了这些额外的限制在实证上是无害的，导致ElementaryNet与GameNet在预测性能上几乎没有区别。然后，我们展示了通过改变ElementaryNet的特征并解释其参数，可以推导出关于人类行为的洞察，发现迭代推理的证据，了解这一推理过程的深度，并展示丰富水平-0规范的价值。

更新时间: 2025-08-08 23:36:46

领域: cs.LG,cs.AI,cs.GT

下载: http://arxiv.org/abs/2503.05925v2

Pushing the Envelope of LLM Inference on AI-PC

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of LLM inference on AI PCs and edge devices, paving the way for efficient deployment of ultra-low-bit LLM models.

Updated: 2025-08-08 23:33:38

标题: 拓展AI-PC上的LLM推理领域

摘要: 超低比特LLM模型（1/1.58/2比特）的出现，使其在相同模型尺寸下与全精度对应物的困惑度和终端任务性能相匹配，正在为资源受限环境（如边缘设备和AI PC）引入一种新的LLM推理时代。尽管这些量化进步承诺以延迟、内存、吞吐量和能耗等方面更具成本效益的模型，但用于部署它们的最先进的推理运行时（例如bitnet.cpp）的计算效率仍未得到充分探讨。在这项工作中，我们采取自下而上的方法：首先设计和实现针对现代CPU进行优化的1比特和2比特微内核，实现跨各种CPU平台的最高计算效率。我们将这些微内核集成到一个最先进的LLM推理框架中，即PyTorch-TPP，并展示了具有2比特模型的端到端推理结果，其优于当前SOTA运行时bitnet.cpp最多2.2倍，并与16比特模型推理相比提供最多7倍的加速。我们优化的运行时推进了AI PC和边缘设备上LLM推理的状态，为超低比特LLM模型的高效部署铺平了道路。

更新时间: 2025-08-08 23:33:38

领域: cs.AI,cs.LG,cs.PF

下载: http://arxiv.org/abs/2508.06753v1

Scalable Private Partition Selection via Adaptive Weighting

In the differentially private partition selection problem (a.k.a. private set union, private key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy. Solutions to this problem are a core building block for many privacy-preserving ML applications including vocabulary extraction in a private corpus, computing statistics over categorical data and learning embeddings over user-provided items. We propose an algorithm for this problem, MaxAdaptiveDegree (MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output. Our algorithm can be efficiently implemented in massively parallel computation systems allowing scalability to very large datasets. We prove that our algorithm stochastically dominates the standard parallel algorithm for this problem. We also develop a two-round version of our algorithm, MAD2R, where results of the computation in the first round are used to bias the weighting in the second round to maximize the number of items output. In experiments, our algorithms provide the best results among parallel algorithms and scale to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed by prior sequential algorithms.

Updated: 2025-08-08 23:30:19

标题: 可扩展的私密分区选择通过自适应加权

摘要: 在差分私有分区选择问题（也称为私有集合并集、私有密钥发现）中，用户持有来自无界宇宙的项目子集。目标是在保持用户级差分隐私的同时，尽可能多地输出用户集合的并集中的项目。解决这个问题是许多保护隐私的机器学习应用的核心构建块，包括在私有语料库中提取词汇、计算分类数据的统计信息以及学习用户提供项目的嵌入。我们提出了一个用于解决这个问题的算法MaxAdaptiveDegree（MAD），该算法自适应地将权重从远高于隐私所需阈值的项目重新路由到权重较小的项目，从而增加输出较不频繁项目的概率。我们的算法可以在大规模并行计算系统中高效实现，从而实现对非常大数据集的可扩展性。我们证明我们的算法在概率上优于这个问题的标准并行算法。我们还开发了我们算法的两轮版本MAD2R，在第一轮计算的结果被用来偏置第二轮的加权，以最大化输出项目的数量。在实验中，我们的算法在并行算法中提供最佳结果，并能够扩展到拥有数百亿项目的数据集，比之前顺序算法分析的数据集大三个数量级。

更新时间: 2025-08-08 23:30:19

领域: cs.DS,cs.CR

下载: http://arxiv.org/abs/2502.08878v2

EfficientEQA: An Efficient Approach to Open-Vocabulary Embodied Question Answering

Embodied Question Answering (EQA) is an essential yet challenging task for robot assistants. Large vision-language models (VLMs) have shown promise for EQA, but existing approaches either treat it as static video question answering without active exploration or restrict answers to a closed set of choices. These limitations hinder real-world applicability, where a robot must explore efficiently and provide accurate answers in open-vocabulary settings. To overcome these challenges, we introduce EfficientEQA, a novel framework that couples efficient exploration with free-form answer generation. EfficientEQA features three key innovations: (1) Semantic-Value-Weighted Frontier Exploration (SFE) with Verbalized Confidence (VC) from a black-box VLM to prioritize semantically important areas to explore, enabling the agent to gather relevant information faster; (2) a BLIP relevancy-based mechanism to stop adaptively by flagging highly relevant observations as outliers to indicate whether the agent has collected enough information; and (3) a Retrieval-Augmented Generation (RAG) method for the VLM to answer accurately based on pertinent images from the agent's observation history without relying on predefined choices. Our experimental results show that EfficientEQA achieves over 15% higher answer accuracy and requires over 20% fewer exploration steps than state-of-the-art methods. Our code is available at: https://github.com/chengkaiAcademyCity/EfficientEQA

Updated: 2025-08-08 23:10:26

标题: EfficientEQA：一种高效的开放词汇体现问答方法

摘要: 具身问答（EQA）是机器人助手的一个重要但具有挑战性的任务。大型视觉语言模型（VLMs）已经显示出在EQA方面具有潜力，但现有方法要么将其视为静态视频问题回答，没有主动探索，要么将答案限制在一组封闭选择中。这些限制阻碍了在现实世界中的应用，机器人必须在开放词汇环境中进行有效探索并提供准确答案。为了克服这些挑战，我们引入了EfficientEQA，这是一个将有效探索与自由形式答案生成相结合的新框架。EfficientEQA具有三个关键创新：（1）语义值加权前沿探索（SFE）与来自黑匣子VLM的语言化置信度（VC），以优先探索语义上重要的区域，使代理能够更快地收集相关信息；（2）基于BLIP相关性的机制，通过将高度相关的观察结果标记为异常值来自适应地停止，以指示代理是否已收集足够信息；（3）一种检索增强生成（RAG）方法，用于VLM根据代理的观察历史中的相关图片准确回答问题，而不依赖于预定义的选择。我们的实验结果表明，EfficientEQA的答案准确度比最先进的方法高出15%以上，并且需要比最先进的方法少20%以上的探索步骤。我们的代码可在以下链接找到：https://github.com/chengkaiAcademyCity/EfficientEQA

更新时间: 2025-08-08 23:10:26

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2410.20263v2

Fenchel-Young Variational Learning

From a variational perspective, many statistical learning criteria involve seeking a distribution that balances empirical risk and regularization. In this paper, we broaden this perspective by introducing a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning. Our proposed formulation -- FY variational learning -- includes as key ingredients new notions of FY free energy, FY evidence, FY evidence lower bound, and FY posterior. We derive alternating minimization and gradient backpropagation algorithms to compute (or lower bound) the FY evidence, which enables learning a wider class of models than previous variational formulations. This leads to generalized FY variants of classical algorithms, such as an FY expectation-maximization (FYEM) algorithm, and latent-variable models, such as an FY variational autoencoder (FYVAE). Our new methods are shown to be empirically competitive, often outperforming their classical counterparts, and most importantly, to have qualitatively novel features. For example, FYEM has an adaptively sparse E-step, while the FYVAE can support models with sparse observations and sparse posteriors.

Updated: 2025-08-08 23:09:31

标题: Fenchel-Young变分学习

摘要: 从变分的角度来看，许多统计学习准则涉及寻找一个平衡经验风险和正则化的分布。在本文中，我们通过引入一种基于Fenchel-Young（FY）损失的新的一般类别的变分方法来扩展这一观点，将其视为概括（并包含）传统变分学习核心的Kullback-Leibler散度的散度。我们提出的公式 - FY变分学习 - 包括FY自由能量、FY证据、FY证据下界和FY后验的新概念等关键要素。我们推导了交替最小化和梯度反向传播算法来计算（或下界）FY证据，这使得学习比以前的变分公式更广泛的模型成为可能。这导致了经典算法的广义FY变体，例如FY期望最大化（FYEM）算法，以及潜变量模型，例如FY变分自动编码器（FYVAE）。我们的新方法在实验中表现出竞争力，通常优于它们的经典对应物，并且最重要的是具有质量上的新特点。例如，FYEM具有自适应稀疏的E步骤，而FYVAE可以支持具有稀疏观测和稀疏后验的模型。

更新时间: 2025-08-08 23:09:31

领域: cs.LG

下载: http://arxiv.org/abs/2502.10295v2

Topology Generation of UAV Covert Communication Networks: A Graph Diffusion Approach with Incentive Mechanism

With the growing demand for Uncrewed Aerial Vehicle (UAV) networks in sensitive applications, such as urban monitoring, emergency response, and secure sensing, ensuring reliable connectivity and covert communication has become increasingly vital. However, dynamic mobility and exposure risks pose significant challenges. To tackle these challenges, this paper proposes a self-organizing UAV network framework combining Graph Diffusion-based Policy Optimization (GDPO) with a Stackelberg Game (SG)-based incentive mechanism. The GDPO method uses generative AI to dynamically generate sparse but well-connected topologies, enabling flexible adaptation to changing node distributions and Ground User (GU) demands. Meanwhile, the Stackelberg Game (SG)-based incentive mechanism guides self-interested UAVs to choose relay behaviors and neighbor links that support cooperation and enhance covert communication. Extensive experiments are conducted to validate the effectiveness of the proposed framework in terms of model convergence, topology generation quality, and enhancement of covert communication performance.

Updated: 2025-08-08 23:06:49

标题: 无人机隐秘通信网络的拓扑生成：一种带激励机制的图扩散方法

摘要: 随着对无人机网络在敏感应用领域（如城市监测、紧急响应和安全感知）的需求不断增长，确保可靠的连接性和隐蔽通信变得日益重要。然而，动态移动性和暴露风险带来了重大挑战。为了应对这些挑战，本文提出了一个自组织无人机网络框架，将基于图扩散的政策优化（GDPO）与基于斯塔克贝格博弈（SG）的激励机制相结合。GDPO方法利用生成式人工智能动态生成稀疏但连接良好的拓扑结构，使其能够灵活适应不断变化的节点分布和地面用户（GU）需求。同时，基于斯塔克贝格博弈（SG）的激励机制引导自私的无人机选择支持合作和增强隐蔽通信的中继行为和邻居链接。进行了大量实验来验证所提出框架在模型收敛、拓扑生成质量和隐蔽通信性能增强方面的有效性。

更新时间: 2025-08-08 23:06:49

领域: cs.AI

下载: http://arxiv.org/abs/2508.06746v1

Analysis of Schedule-Free Nonconvex Optimization

First-order methods underpin most large-scale learning algorithms, yet their classical convergence guarantees hinge on carefully scheduled step-sizes that depend on the total horizon $T$, which is rarely known in advance. The Schedule-Free (SF) method promises optimal performance with hyperparameters that are independent of $T$ by interpolating between Polyak--Ruppert averaging and momentum, but nonconvex analysis of SF has been limited or reliant on strong global assumptions. We introduce a robust Lyapunov framework that, under only $L$-smoothness and lower-boundedness, reduces SF analysis to a single-step descent inequality. This yields horizon-agnostic bounds in the nonconvex setting: $O(1/\log T)$ for constant step + PR averaging, $O(\log T/T)$ for a linearly growing step-size, and a continuum of $O(T^{-(1-\alpha)})$ rates for polynomial averaging. We complement these proofs with Performance Estimation Problem (PEP) experiments that numerically validate our rates and suggest that our $O(1/\log T)$ bound on the original nonconvex SF algorithm may tighten to $O(1/T)$. Our work extends SF's horizon-free guarantees to smooth nonconvex optimization and charts future directions for optimal nonconvex rates.

Updated: 2025-08-08 22:54:35

标题: 无时间表非凸优化分析

摘要: 一阶方法支撑着大多数大规模学习算法，然而它们的经典收敛保证依赖于精心安排的步长，这些步长取决于很少预先知道的总体时段 $T$。免调度（SF）方法承诺通过在Polyak-Ruppert平均和动量之间进行插值，使用与 $T$ 无关的超参数来实现最佳性能，但是对 SF 的非凸分析受到限制或依赖于强全局假设。我们引入了一个强大的Lyapunov框架，仅在 $L$-光滑性和下界性下，将 SF 分析简化为单步下降不等式。这为非凸设置提供了与时段无关的界：对于恒定步长 + PR平均，界为 $O(1/\log T)$，对于线性增长的步长，界为 $O(\log T/T)$，对于多项式平均，有一系列 $O(T^{-(1-\alpha)})$ 的速率。我们用性能估计问题（PEP）实验证明了这些证明，并暗示我们对原始非凸 SF 算法的 $O(1/\log T)$ 界可能会收紧为 $O(1/T)$。我们的工作将 SF 的无时段保证扩展到平滑的非凸优化，并为最佳非凸速率指明了未来方向。

更新时间: 2025-08-08 22:54:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06743v1

Observation Interference in Partially Observable Assistance Games

We study partially observable assistance games (POAGs), a model of the human-AI value alignment problem which allows the human and the AI assistant to have partial observations. Motivated by concerns of AI deception, we study a qualitatively new phenomenon made possible by partial observability: would an AI assistant ever have an incentive to interfere with the human's observations? First, we prove that sometimes an optimal assistant must take observation-interfering actions, even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. Though this result seems to contradict the classic theorem from single-agent decision making that the value of information is nonnegative, we resolve this seeming contradiction by developing a notion of interference defined on entire policies. This can be viewed as an extension of the classic result that the value of information is nonnegative into the cooperative multiagent setting. Second, we prove that if the human is simply making decisions based on their immediate outcomes, the assistant might need to interfere with observations as a way to query the human's preferences. We show that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate their preferences to the assistant. Third, we show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations. Finally, we use an experimental model to analyze tradeoffs faced by the AI assistant in practice when considering whether or not to take observation-interfering actions.

Updated: 2025-08-08 22:51:07

标题: 部分可观察协助游戏中的观测干扰

摘要: 我们研究了部分可观察协助游戏（POAGs），这是一个人工智能价值对齐问题的模型，允许人类和AI助手具有部分观察。受到AI欺骗的担忧的启发，我们研究了部分可观察性带来的一种新现象：AI助手是否会有动机干预人类的观察？首先，我们证明有时最佳助手必须采取干扰观察的行动，即使人类在最佳状态下进行游戏，即使有其他等效的行动可用且不会干扰观察。尽管这一结果似乎与单一决策制定中的信息价值非负的经典定理相矛盾，但我们通过发展一个关于整个策略定义的干扰概念来解决这一看似矛盾。这可以看作是将信息价值非负的经典结果扩展到合作多智能体环境中的一种扩展。其次，我们证明如果人类仅根据他们的即时结果做决定，助手可能需要干扰观察来询问人类的偏好。我们展示，如果人类在最佳状态下进行游戏，或者我们引入一个通信渠道让人类将他们的偏好传达给助手，那么这种干扰的动机将消失。第三，我们展示如果人类按照不理性的Boltzmann模型行动，这可能会导致助手有动机干扰观察。最后，我们使用一个实验模型来分析AI助手在实践中在考虑是否采取干扰观察行动时所面临的权衡。

更新时间: 2025-08-08 22:51:07

领域: cs.AI,cs.GT,cs.LG,cs.MA

下载: http://arxiv.org/abs/2412.17797v2

Learning Causal Structure Distributions for Robust Planning

Structural causal models describe how the components of a robotic system interact. They provide both structural and functional information about the relationships that are present in the system. The structural information outlines the variables among which there is interaction. The functional information describes how such interactions work, via equations or learned models. In this paper we find that learning the functional relationships while accounting for the uncertainty about the structural information leads to more robust dynamics models which improves downstream planning, while using significantly lower computational resources. This in contrast with common model-learning methods that ignore the causal structure and fail to leverage the sparsity of interactions in robotic systems. We achieve this by estimating a causal structure distribution that is used to sample causal graphs that inform the latent-space representations in an encoder-multidecoder probabilistic model. We show that our model can be used to learn the dynamics of a robot, which together with a sampling-based planner can be used to perform new tasks in novel environments, provided an objective function for the new requirement is available. We validate our method using manipulators and mobile robots in both simulation and the real-world. Additionally, we validate the learned dynamics' adaptability and increased robustness to corrupted inputs and changes in the environment, which is highly desirable in challenging real-world robotics scenarios. Video: https://youtu.be/X6k5t7OOnNc.

Updated: 2025-08-08 22:43:17

标题: 学习因果结构分布以实现鲁棒规划

摘要: 结构因果模型描述了机器人系统组件之间的相互作用。它们提供了关于系统中存在的关系的结构和功能信息。结构信息概述了存在交互的变量。功能信息描述了这些交互如何工作，通过方程或学习模型。在本文中，我们发现学习功能关系同时考虑到关于结构信息的不确定性会导致更加稳健的动力学模型，从而改善下游规划，同时使用更低的计算资源。这与通常忽视因果结构并未能利用机器人系统中交互的稀疏性的常见模型学习方法形成对比。我们通过估计因果结构分布来实现这一点，该分布用于对编码器-多解码器概率模型中的潜在空间表示进行采样因果图。我们展示了我们的模型可以用于学习机器人的动态，结合基于采样的规划器可以用于在新环境中执行新任务，前提是新需求的客观函数可用。我们在仿真和真实世界中使用机械臂和移动机器人验证了我们的方法。此外，我们验证了学习到的动态适应性以及对损坏输入和环境变化的增强鲁棒性，这在具有挑战性的真实世界机器人场景中非常有用。视频链接：https://youtu.be/X6k5t7OOnNc。

更新时间: 2025-08-08 22:43:17

领域: cs.RO,cs.AI,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.06742v1

ParBalans: Parallel Multi-Armed Bandits-based Adaptive Large Neighborhood Search

Solving Mixed-Integer Programming (MIP) problems often requires substantial computational resources due to their combinatorial nature. Parallelization has emerged as a critical strategy to accelerate solution times and enhance scalability to tackle large, complex instances. This paper investigates the parallelization capabilities of Balans, a recently proposed multi-armed bandits-based adaptive large neighborhood search for MIPs. While Balans's modular architecture inherently supports parallel exploration of diverse parameter configurations, this potential has not been thoroughly examined. To address this gap, we introduce ParBalans, an extension that leverages both solver-level and algorithmic-level parallelism to improve performance on challenging MIP instances. Our experimental results demonstrate that ParBalans exhibits competitive performance compared to the state-of-the-art commercial solver Gurobi, particularly on hard optimization benchmarks.

Updated: 2025-08-08 22:30:19

标题: ParBalans：基于并行多臂赌博机的自适应大邻域搜索

摘要: 解决混合整数规划（MIP）问题通常需要大量的计算资源，因为其组合性质。并行化已经成为加速解决时间并增强可扩展性以处理大型复杂实例的关键策略。本文调查了Balans的并行化能力，Balans是一种基于多臂老虎机的自适应大邻域搜索方法，最近提出用于MIP。虽然Balans的模块化架构固有地支持多样化参数配置的并行探索，但这一潜力尚未得到充分探讨。为了弥补这一差距，我们引入了ParBalans，这是一个扩展，利用求解器级别和算法级别的并行性来改进在具有挑战性的MIP实例上的性能。我们的实验结果表明，与最先进的商业求解器Gurobi相比，ParBalans在困难的优化基准上表现出竞争力。

更新时间: 2025-08-08 22:30:19

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06736v1

Mitigating Distribution Shift in Graph-Based Android Malware Classification via Function Metadata and LLM Embeddings

Graph-based malware classifiers can achieve over 94% accuracy on standard Android datasets, yet we find they suffer accuracy drops of up to 45% when evaluated on previously unseen malware variants from the same family - a scenario where strong generalization would typically be expected. This highlights a key limitation in existing approaches: both the model architectures and their structure-only representations often fail to capture deeper semantic patterns. In this work, we propose a robust semantic enrichment framework that enhances function call graphs with contextual features, including function-level metadata and, when available, code embeddings derived from large language models. The framework is designed to operate under real-world constraints where feature availability is inconsistent, and supports flexible integration of semantic signals. To evaluate generalization under realistic domain and temporal shifts, we introduce two new benchmarks: MalNet-Tiny-Common and MalNet-Tiny-Distinct, constructed using malware family partitioning to simulate cross-family generalization and evolving threat behavior. Experiments across multiple graph neural network backbones show that our method improves classification performance by up to 8% under distribution shift and consistently enhances robustness when integrated with adaptation-based methods. These results offer a practical path toward building resilient malware detection systems in evolving threat environments.

Updated: 2025-08-08 22:16:57

标题: 通过功能元数据和LLM嵌入缓解基于图的Android恶意软件分类中的分布偏移

摘要: 基于图的恶意软件分类器在标准Android数据集上可以实现超过94%的准确率，但我们发现当评估来自同一家族的先前未见的恶意软件变种时，它们的准确率可能下降高达45% - 这种情况下通常会期望强大的泛化性。这突显了现有方法的一个关键局限性：模型架构和仅结构表示往往无法捕捉更深层次的语义模式。在这项工作中，我们提出了一个强大的语义增强框架，通过增加上下文特征（包括函数级元数据和从大型语言模型派生的代码嵌入）来增强函数调用图。该框架旨在在特征可用性不一致的实际约束条件下运行，并支持语义信号的灵活集成。为了评估在现实领域和时间变化下的泛化性，我们引入了两个新的基准：MalNet-Tiny-Common和MalNet-Tiny-Distinct，使用恶意软件家族划分来模拟跨家族泛化和不断演变的威胁行为。在多个图神经网络骨干上的实验证实，我们的方法在分布转移下将分类性能提高了高达8%，并在与基于适应性的方法集成时始终提高了鲁棒性。这些结果为在不断演变的威胁环境中构建强大的恶意软件检测系统提供了实际途径。

更新时间: 2025-08-08 22:16:57

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.06734v1

ClimateSOM: A Visual Analysis Workflow for Climate Ensemble Datasets

Ensemble datasets are ever more prevalent in various scientific domains. In climate science, ensemble datasets are used to capture variability in projections under plausible future conditions including greenhouse and aerosol emissions. Each ensemble model run produces projections that are fundamentally similar yet meaningfully distinct. Understanding this variability among ensemble model runs and analyzing its magnitude and patterns is a vital task for climate scientists. In this paper, we present ClimateSOM, a visual analysis workflow that leverages a self-organizing map (SOM) and Large Language Models (LLMs) to support interactive exploration and interpretation of climate ensemble datasets. The workflow abstracts climate ensemble model runs - spatiotemporal time series - into a distribution over a 2D space that captures the variability among the ensemble model runs using a SOM. LLMs are integrated to assist in sensemaking of this SOM-defined 2D space, the basis for the visual analysis tasks. In all, ClimateSOM enables users to explore the variability among ensemble model runs, identify patterns, compare and cluster the ensemble model runs. To demonstrate the utility of ClimateSOM, we apply the workflow to an ensemble dataset of precipitation projections over California and the Northwestern United States. Furthermore, we conduct a short evaluation of our LLM integration, and conduct an expert review of the visual workflow and the insights from the case studies with six domain experts to evaluate our approach and its utility.

Updated: 2025-08-08 22:15:43

标题: ClimateSOM：气候集合数据的可视化分析工作流程

摘要: 集合数据集在各个科学领域中越来越普遍。在气候科学中，集合数据集被用来捕捉在可能的未来条件下，包括温室气体和气溶胶排放的预测变化。每个集合模型运行都会产生基本相似但意义明显不同的预测。了解集合模型运行之间的这种变异性并分析其幅度和模式是气候科学家的一个重要任务。在本文中，我们提出了ClimateSOM，这是一个利用自组织映射（SOM）和大型语言模型（LLMs）支持交互式探索和解释气候集合数据集的可视化分析工作流程。该工作流将气候集合模型运行 - 时空时间序列 - 抽象为一个在二维空间中捕捉集合模型运行变异性的分布。LLMs被整合进来，以帮助理解这个由SOM定义的二维空间，这是可视化分析任务的基础。总的来说，ClimateSOM使用户能够探索集合模型运行之间的变异性，识别模式，比较和聚类集合模型运行。为了展示ClimateSOM的实用性，我们将该工作流应用于加利福尼亚州和西北美国的降水预测集合数据集。此外，我们对LLM集成进行了简短评估，并对可视化工作流程和案例研究的见解进行了专家评审，以评估我们的方法及其实用性。

更新时间: 2025-08-08 22:15:43

领域: cs.HC,cs.LG

下载: http://arxiv.org/abs/2508.06732v1

Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis.

Updated: 2025-08-08 22:06:23

标题: 大型语言模型用于口述历史理解的文本分类和情感分析

摘要: 口头历史是生活经历的重要记录，特别是在受系统性不公和历史抹去影响的社区中。对口头历史档案的有效和高效分析可以促进对口头历史的获取和理解。然而，由于其结构化格式、情感复杂性和高成本的标注，这些档案的大规模分析仍然受限制。本文提出了一个可扩展的框架，用于自动化对日裔美国人关押口头历史的语义和情感标注。我们利用LLM构建了一个高质量数据集，评估了多个模型，并在具有历史敏感性的情境中测试了提示工程策略。我们的多阶段方法结合了专家标注、提示设计和LLM评估与ChatGPT、Llama和Qwen。我们对来自15位叙述者的558个句子进行了情感和语义分类，然后评估了零射击、少量射击和RAG策略。对于语义分类，ChatGPT取得了最高的F1分数（88.71%），其次是Llama（84.99%）和Qwen（83.72%）。对于情感分析，Llama略优于Qwen（82.66%）和ChatGPT（82.29%），所有模型均显示出可比较的结果。最佳提示配置被用于对JAIOH收藏中的1,002次面试中的92,191个句子进行标注。我们的研究结果表明，在设计良好的提示指导下，LLM可以有效地在大型口头历史收藏中执行语义和情感标注。这项研究提供了一个可重复使用的标注流程和应用LLM在文化敏感档案分析中的实用指导。通过将档案伦理与可扩展的自然语言处理技术结合起来，这项工作为数字人文学和集体记忆的保护奠定了基础。GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis。

更新时间: 2025-08-08 22:06:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06729v1

LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay

Sellers at eBay are recommended keyphrases to bid on to enhance the performance of their advertising campaigns. The relevance of these keyphrases is crucial in avoiding the overcrowding of search systems with irrelevant items and maintaining a positive seller perception. It is essential that keyphrase recommendations align with both seller and Search judgments regarding auctions. Due to the difficulty in procuring negative human judgment at scale, employing LLM-as-a-judge to mimic seller judgment has been established as the norm in several studies. This study introduces a novel two-step LLM distillation process from a LLM-judge used to debias our Embedding Based Retrieval (EBR) model from the various biases that exist in click-data. We distill from an LLM teacher via a cross-encoder assistant into a bi-encoder student using a multi-task training approach, ultimately employing the student bi-encoder to retrieve relevant advertiser keyphrases. We show that integrating a knowledge distillation process from LLMs in a multi-task training setup enhances bi-encoder performance in retrieving relevant advertiser keyphrases at eBay.

Updated: 2025-08-08 21:55:01

标题: LLMDistill4Ads：在eBay上使用交叉编码器从LLM信号中提炼广告主关键词推荐

摘要: eBay推荐卖家参与竞标的关键词以增强其广告活动的表现。这些关键词的相关性对于避免搜索系统过度拥挤以及保持积极的卖家印象至关重要。关键词推荐必须与卖家和搜索判断对拍卖的一致。由于在规模上获取负面人类判断的困难，多项研究已经确立了使用LLM作为评判者来模拟卖家判断的做法。这项研究介绍了一种新颖的两步LLM精馏过程，从LLM评判者中提炼出来，用于减轻我们基于嵌入的检索（EBR）模型中存在的各种偏见。我们通过一个跨编码器助手从LLM教师中提炼，进而用多任务训练方法将其转换为双编码器学生，最终利用学生双编码器来检索相关的广告主关键词。我们表明，在eBay中将LLM的知识精炼过程与多任务训练结合在一起，可以提高双编码器在检索相关的广告主关键词方面的性能。

更新时间: 2025-08-08 21:55:01

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.03628v2

Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition

In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, GPT-4o, Grok-4) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we report the following findings: (a) In all face benchmark datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improved on over-segmented face images compared to tightly cropped faces, thereby suggesting the importance of contextual clues. (c) A simple score-level fusion of a foundation model with a domain-specific face recognition model improved the accuracy at low false match rates. (d) Foundation models, such as GPT-4o and Grok-4, are able to provide explainability to the face recognition pipeline. In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace, thereby reiterating the importance of combining domain-specific face recognition models with generic foundation models in a judicious manner.

Updated: 2025-08-08 21:43:22

标题: 基础模型与领域特定模型：人脸识别中的性能比较、融合和可解释性

摘要: 在这篇论文中，我们探讨了以下问题：通用基础模型（如CLIP、BLIP、GPT-4o、Grok-4）在人脸识别任务中与特定领域人脸识别模型（如AdaFace或ArcFace）相比如何？通过一系列涉及多个基础模型和基准数据集的实验，我们报告了以下发现：（a）在所有考虑的人脸基准数据集中，特定领域模型表现优于零样本基础模型。（b）与紧密裁剪的人脸相比，零样本通用基础模型在过度分割的人脸图像上的表现有所提高，从而暗示了上下文提示的重要性。（c）将基础模型与特定领域人脸识别模型进行简单的评分级融合，在低假匹配率下提高了准确性。（d）诸如GPT-4o和Grok-4之类的基础模型能够为人脸识别流程提供可解释性。在某些情况下，基础模型甚至能够解决AdaFace所做的低置信度决策，从而再次强调了以明智的方式将特定领域人脸识别模型与通用基础模型相结合的重要性。

更新时间: 2025-08-08 21:43:22

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.03541v2

Generative AI for Cel-Animation: A Survey

Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation

Updated: 2025-08-08 21:38:42

标题: 基于生成式人工智能的卡通动画制作：一项调查

摘要: 传统的纸片动画制作流程包括多个基本步骤，包括分镜头、布局设计、关键帧动画、中间绘制和上色，这些步骤需要大量的人工努力、技术专业知识和大量时间投入。这些挑战历来阻碍了纸片动画制作的效率和可扩展性。生成式人工智能（GenAI）的兴起，包括大型语言模型、多模态模型和扩散模型，通过自动化任务如中间帧生成、上色和分镜头创作，提供创新解决方案。本调查探讨了GenAI整合如何通过降低技术门槛、扩大对更广泛创作者的可访问性，通过工具如AniDoc、ToonCrafter和AniSora，使艺术家能够更多地专注于创造性表达和艺术创新，从而彻底改变传统动画工作流程。尽管具有潜力，但挑战如视觉一致性、风格连贯性和道德考虑仍然存在。此外，本文还探讨了AI辅助动画的未来方向和进展。欲了解更多信息和资源，请访问我们的GitHub存储库：https://github.com/yunlong10/Awesome-AI4Animation

更新时间: 2025-08-08 21:38:42

领域: cs.CV,cs.AI,cs.HC

下载: http://arxiv.org/abs/2501.06250v3

GLIDR: Graph-Like Inductive Logic Programming with Differentiable Reasoning

Differentiable inductive logic programming (ILP) techniques have proven effective at finding approximate rule-based solutions to link prediction and node classification problems on knowledge graphs; however, the common assumption of chain-like rule structure can hamper the performance and interpretability of existing approaches. We introduce GLIDR, a differentiable rule learning method that models the inference of logic rules with more expressive syntax than previous methods. GLIDR uses a differentiable message passing inference algorithm that generalizes previous chain-like rule learning methods to allow rules with features like branches and cycles. GLIDR has a simple and expressive rule search space which is parameterized by a limit on the maximum number of free variables that may be included in a rule. Explicit logic rules can be extracted from the weights of a GLIDR model for use with symbolic solvers. We demonstrate that GLIDR can significantly outperform existing rule learning methods on knowledge graph completion tasks and even compete with embedding methods despite the inherent disadvantage of being a structure-only prediction method. We show that rules extracted from GLIDR retain significant predictive performance, and that GLIDR is highly robust to training data noise. Finally, we demonstrate that GLIDR can be chained with deep neural networks and optimized end-to-end for rule learning on arbitrary data modalities.

Updated: 2025-08-08 21:31:55

标题: GLIDR：具有可微分推理的类图归纳逻辑编程

摘要: 可微归纳逻辑编程（ILP）技术已被证明在知识图上找到近似基于规则的链路预测和节点分类问题的解决方案方面十分有效；然而，现有方法普遍假设链式规则结构可能会影响性能和可解释性。我们引入GLIDR，一种可微的规则学习方法，它模拟逻辑规则推理比先前方法更具表现力的语法。GLIDR使用可微传递推理算法，将先前的链式规则学习方法泛化，允许具有分支和循环等特征的规则。GLIDR具有简单而富有表现力的规则搜索空间，其由规则中可以包含的自由变量的最大数量限制参数化。可以从GLIDR模型的权重中提取显式逻辑规则，以供符号求解器使用。我们证明GLIDR在知识图完成任务上可以显著优于现有的规则学习方法，甚至可以与嵌入方法竞争，尽管其固有劣势是一种仅基于结构的预测方法。我们展示从GLIDR提取的规则保留了显著的预测性能，并且GLIDR对训练数据噪声具有很高的稳健性。最后，我们证明GLIDR可以与深度神经网络链接，并针对任意数据模态进行端到端的优化规则学习。

更新时间: 2025-08-08 21:31:55

领域: cs.AI,cs.LG,cs.LO

下载: http://arxiv.org/abs/2508.06716v1

Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations.

Updated: 2025-08-08 21:22:12

标题: Play Favorites：一种用于衡量LLM作为法官时自我偏见的统计方法

摘要: 大型语言模型（LLMs）可以作为法官，快速和可靠地评估其他LLM输出。然而，模型可能会系统性地给自己的输出过高的评分，这种现象被称为自我偏见，它可能会扭曲对真实模型表现的评估。先前的研究经常将模型质量的真实差异与偏见混淆，或者错误地假定LLMs和人类的评分分布是相同的。在这项工作中，我们提出了一个明确规范了自我偏见可以被识别和估计的假设的统计框架。我们的方法对LLM作为法官对其自己的完成与其他模型的评分分布之间的差异进行建模，同时考虑了由独立的第三方法官（例如人类）提供的完成的基本质量。我们的方法可靠地分离和量化自我偏见，即使模型的能力有所不同，也确保真实的性能差异不会被误认为是自我偏见。我们对一个大型数据集（>5000个提示-完成对）进行了自我偏见的实证分析，其中包含专家人类注释和来自九个不同LLM法官的判断。我们发现一些模型，如GPT-4o和Claude 3.5 Sonnet，系统性地给自己的输出分配更高的分数。这些模型还显示了家族偏见；系统性地给同一家族其他模型产生的输出分配更高的评分。我们的发现突显了使用LLM法官的潜在风险，并提供了在解释自动评估时减少偏见的实用指导。

更新时间: 2025-08-08 21:22:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06709v1

Probabilistic Circuits for Knowledge Graph Completion with Reduced Rule Sets

Rule-based methods for knowledge graph completion provide explainable results but often require a significantly large number of rules to achieve competitive performance. This can hinder explainability due to overwhelmingly large rule sets. We discover rule contexts (meaningful subsets of rules that work together) from training data and use learned probability distribution (i.e. probabilistic circuits) over these rule contexts to more rapidly achieve performance of the full rule set. Our approach achieves a 70-96% reduction in number of rules used while outperforming baseline by up to 31$\times$ when using equivalent minimal number of rules and preserves 91% of peak baseline performance even when comparing our minimal rule sets against baseline's full rule sets. We show that our framework is grounded in well-known semantics of probabilistic logic, does not require independence assumptions, and that our tractable inference procedure provides both approximate lower bounds and exact probability of a given query. The efficacy of our method is validated by empirical studies on 8 standard benchmark datasets where we show competitive performance by using only a fraction of the rules required by AnyBURL's standard inference method, the current state-of-the-art for rule-based knowledge graph completion. This work may have further implications for general probabilistic reasoning over learned sets of rules.

Updated: 2025-08-08 21:17:03

标题: 使用减少规则集的概率电路进行知识图完善

摘要: 基于规则的知识图谱补全方法提供可解释的结果，但通常需要大量规则才能达到竞争性表现。这可能会阻碍解释性，因为规则集过大。我们从训练数据中发现规则上下文（一起工作的有意义的规则子集），并利用学习的概率分布（即概率电路）在这些规则上下文上更快地实现全规则集的性能。我们的方法在使用等效最小数量规则时减少规则使用数量70-96%，同时超越基线最多31倍，并在将我们的最小规则集与基线的全规则集进行比较时，保留91%的峰值基线性能。我们展示了我们的框架根植于众所周知的概率逻辑语义，不需要独立性假设，而我们可处理的推理过程提供了给定查询的近似下限和准确概率。我们的方法的功效通过对8个标准基准数据集的实证研究得到验证，我们展示了仅使用AnyBURL的标准推理方法所需规则的一小部分就可以达到竞争性表现，AnyBURL是基于规则的知识图谱补全的当前最先进技术。这项工作可能对学习规则集的一般概率推理产生进一步影响。

更新时间: 2025-08-08 21:17:03

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2508.06706v1

CISO: Species Distribution Modeling Conditioned on Incomplete Species Observations

Species distribution models (SDMs) are widely used to predict species' geographic distributions, serving as critical tools for ecological research and conservation planning. Typically, SDMs relate species occurrences to environmental variables representing abiotic factors, such as temperature, precipitation, and soil properties. However, species distributions are also strongly influenced by biotic interactions with other species, which are often overlooked. While some methods partially address this limitation by incorporating biotic interactions, they often assume symmetrical pairwise relationships between species and require consistent co-occurrence data. In practice, species observations are sparse, and the availability of information about the presence or absence of other species varies significantly across locations. To address these challenges, we propose CISO, a deep learning-based method for species distribution modeling Conditioned on Incomplete Species Observations. CISO enables predictions to be conditioned on a flexible number of species observations alongside environmental variables, accommodating the variability and incompleteness of available biotic data. We demonstrate our approach using three datasets representing different species groups: sPlotOpen for plants, SatBird for birds, and a new dataset, SatButterfly, for butterflies. Our results show that including partial biotic information improves predictive performance on spatially separate test sets. When conditioned on a subset of species within the same dataset, CISO outperforms alternative methods in predicting the distribution of the remaining species. Furthermore, we show that combining observations from multiple datasets can improve performance. CISO is a promising ecological tool, capable of incorporating incomplete biotic information and identifying potential interactions between species from disparate taxa.

Updated: 2025-08-08 21:15:12

标题: CISO：基于不完整物种观察条件的物种分布建模

摘要: 物种分布模型（SDMs）被广泛用于预测物种的地理分布，是生态研究和保护规划的关键工具。通常，SDMs将物种出现与代表无生物因素的环境变量（如温度、降水和土壤属性）相关联。然而，物种分布也受到与其他物种的生物相互作用的强烈影响，这些往往被忽视。一些方法通过部分地考虑生物相互作用来解决这一限制，但它们通常假设物种之间存在对称的两两关系，并需要一致的共同出现数据。在实践中，物种观察数据稀疏，关于其他物种的存在或缺失信息的可用性在不同地点之间存在显著差异。为了解决这些挑战，我们提出了CISO，这是一种基于深度学习的物种分布模型方法，它在不完整的物种观察条件下进行预测。CISO使预测能够根据灵活数量的物种观察以及环境变量进行条件化，以适应可用生物数据的可变性和不完整性。我们使用代表不同物种群的三个数据集展示了我们的方法：植物的sPlotOpen，鸟类的SatBird和蝴蝶的新数据集SatButterfly。我们的结果表明，包含部分生物信息可以改善空间分离测试集上的预测性能。在同一数据集中对物种子集进行条件化时，CISO在预测其余物种的分布方面优于替代方法。此外，我们显示结合多个数据集的观察可以提高性能。CISO是一种有希望的生态学工具，能够整合不完整的生物信息，并识别来自不同分类单元的物种之间的潜在相互作用。

更新时间: 2025-08-08 21:15:12

领域: cs.LG

下载: http://arxiv.org/abs/2508.06704v1

MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.

Updated: 2025-08-08 21:03:29

标题: MMFformer：用于抑郁症检测的多模融合变压器网络

摘要: 抑郁症是一种严重的心理健康疾病，严重影响个体的幸福感和生活质量，因此早期检测对于充分的护理和治疗至关重要。检测抑郁症通常很困难，因为主要基于临床访谈中的主观评估。因此，由于社交网络的内容，抑郁症的早期诊断已成为一个突出的研究领域。用户生成的信息的广泛和多样性性质构成了一个重要挑战，限制了相关时间信息的准确提取和跨多种模态数据的有效融合。本文介绍了MMFformer，一种多模式抑郁症检测网络，旨在从多模式社交媒体信息中检索抑郁的时空高级模式。具有残差连接的变压器网络从视频中捕获空间特征，利用变压器编码器设计音频中的重要时间动态。此外，融合架构通过后期和中间融合策略融合提取的特征，以找出其中最相关的模态间相关性。最后，所提出的网络在两个大规模抑郁症检测数据集上进行评估，结果清楚地表明，它超越了现有的最先进方法，将D-Vlog数据集的F1分数提高了13.92%，LMVD数据集提高了7.74%。代码公开发布在https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection。

更新时间: 2025-08-08 21:03:29

领域: cs.CV,cs.AI,cs.LG,cs.SD

下载: http://arxiv.org/abs/2508.06701v1

A Practical Introduction to Kernel Discrepancies: MMD, HSIC & KSD

This article provides a practical introduction to kernel discrepancies, focusing on the Maximum Mean Discrepancy (MMD), the Hilbert-Schmidt Independence Criterion (HSIC), and the Kernel Stein Discrepancy (KSD). Various estimators for these discrepancies are presented, including the commonly-used V-statistics and U-statistics, as well as several forms of the more computationally-efficient incomplete U-statistics. The importance of the choice of kernel bandwidth is stressed, showing how it affects the behaviour of the discrepancy estimation. Adaptive estimators are introduced, which combine multiple estimators with various kernels, addressing the problem of kernel selection.

Updated: 2025-08-08 20:58:30

标题: 一个实用的核差异引导：MMD、HSIC和KSD

摘要: 本文提供了关于核差异的实用介绍，重点关注最大均值差异（MMD）、希尔伯特-施密特独立准则（HSIC）和核斯坦差异（KSD）。介绍了这些差异的各种估计方法，包括常用的V-统计和U-统计，以及更具计算效率的不完整U-统计的几种形式。强调了核带宽选择的重要性，展示了它如何影响差异估计的行为。引入了自适应估计器，将多个估计器与各种核结合起来，解决了核选择的问题。

更新时间: 2025-08-08 20:58:30

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2503.04820v2

Graph-Powered Defense: Controller Area Network Intrusion Detection for Unmanned Aerial Vehicles

The network of services, including delivery, farming, and environmental monitoring, has experienced exponential expansion in the past decade with Unmanned Aerial Vehicles (UAVs). Yet, UAVs are not robust enough against cyberattacks, especially on the Controller Area Network (CAN) bus. The CAN bus is a general-purpose vehicle-bus standard to enable microcontrollers and in-vehicle computers to interact, primarily connecting different Electronic Control Units (ECUs). In this study, we focus on solving some of the most critical security weaknesses in UAVs by developing a novel graph-based intrusion detection system (IDS) leveraging the Uncomplicated Application-level Vehicular Communication and Networking (UAVCAN) protocol. First, we decode CAN messages based on UAVCAN protocol specification; second, we present a comprehensive method of transforming tabular UAVCAN messages into graph structures. Lastly, we apply various graph-based machine learning models for detecting cyber-attacks on the CAN bus, including graph convolutional neural networks (GCNNs), graph attention networks (GATs), Graph Sample and Aggregate Networks (GraphSAGE), and graph structure-based transformers. Our findings show that inductive models such as GATs, GraphSAGE, and graph-based transformers can achieve competitive and even better accuracy than transductive models like GCNNs in detecting various types of intrusions, with minimum information on protocol specification, thus providing a generic robust solution for CAN bus security for the UAVs. We also compared our results with baseline single-layer Long Short-Term Memory (LSTM) and found that all our graph-based models perform better without using any decoded features based on the UAVCAN protocol, highlighting higher detection performance with protocol-independent capability.

Updated: 2025-08-08 20:45:18

标题: 基于图的防御：无人机控制区域网络入侵检测

摘要: 随着无人机（UAVs）的广泛应用，服务网络，包括交付、农业和环境监测，在过去十年中经历了指数级的扩张。然而，UAVs在特别是在控制器区域网络（CAN）总线上对抗网络攻击方面还不够强大。CAN总线是一种通用的车辆总线标准，用于使微控制器和车载计算机进行交互，主要连接不同的电子控制单元（ECUs）。在本研究中，我们通过开发一种基于图形的新颖入侵检测系统（IDS），利用简化的应用级车辆通信和网络（UAVCAN）协议，专注于解决UAVs中一些最关键的安全弱点。首先，我们根据UAVCAN协议规范解码CAN消息；其次，我们提出了一种将表格化的UAVCAN消息转换为图形结构的全面方法。最后，我们应用各种基于图形的机器学习模型来检测CAN总线上的网络攻击，包括图形卷积神经网络（GCNNs）、图形注意力网络（GATs）、图采样和聚合网络（GraphSAGE）以及基于图形结构的变压器。我们的研究结果表明，诸如GATs、GraphSAGE和基于图形的变压器这样的归纳模型可以实现竞争性甚至比GCNNs更好的准确性，用最少的关于协议规范的信息来检测各种类型的入侵，从而为UAVs的CAN总线安全提供了一种通用的稳健解决方案。我们还将结果与基线单层长短期记忆（LSTM）进行了比较，发现所有基于图形的模型在不使用基于UAVCAN协议的解码特征的情况下表现更好，突显了具有协议独立能力的更高检测性能。

更新时间: 2025-08-08 20:45:18

领域: cs.AI

下载: http://arxiv.org/abs/2412.02539v2

A Tight Lower Bound for the Approximation Guarantee of Higher-Order Singular Value Decomposition

We prove that the classic approximation guarantee for the higher-order singular value decomposition (HOSVD) is tight by constructing a tensor for which HOSVD achieves an approximation ratio of $N/(1+\varepsilon)$, for any $\varepsilon > 0$. This matches the upper bound of De Lathauwer et al. (2000a) and shows that the approximation ratio of HOSVD cannot be improved. Using a more advanced construction, we also prove that the approximation guarantees for the ST-HOSVD algorithm of Vannieuwenhoven et al. (2012) and higher-order orthogonal iteration (HOOI) of De Lathauwer et al. (2000b) are tight by showing that they can achieve their worst-case approximation ratio of $N / (1 + \varepsilon)$, for any $\varepsilon > 0$.

Updated: 2025-08-08 20:34:57

标题: Higher-Order奇异值分解的近似保证的严格下界

摘要: 我们证明了高阶奇异值分解（HOSVD）的经典逼近保证是紧密的，通过构造一个张量，使得HOSVD可以实现$N/(1+\varepsilon)$的逼近比率，对于任意$\varepsilon > 0$。这与De Lathauwer等人（2000a）的上界相匹配，并表明HOSVD的逼近比率不能被改进。通过使用更先进的构造，我们还证明了Vannieuwenhoven等人（2012）的ST-HOSVD算法和De Lathauwer等人（2000b）的高阶正交迭代（HOOI）的逼近保证是紧密的，通过展示它们可以实现它们的最坏情况逼近比率$N / (1 + \varepsilon)$，对于任意$\varepsilon > 0$。

更新时间: 2025-08-08 20:34:57

领域: cs.DS,cs.LG

下载: http://arxiv.org/abs/2508.06693v1

Stabilizing Federated Learning under Extreme Heterogeneity with HeteRo-Select

Federated Learning (FL) is a machine learning technique that often suffers from training instability due to the diverse nature of client data. Although utility-based client selection methods like Oort are used to converge by prioritizing high-loss clients, they frequently experience significant drops in accuracy during later stages of training. We propose a theoretical HeteRo-Select framework designed to maintain high performance and ensure long-term training stability. We provide a theoretical analysis showing that when client data is very different (high heterogeneity), choosing a smart subset of client participation can reduce communication more effectively compared to full participation. Our HeteRo-Select method uses a clear, step-by-step scoring system that considers client usefulness, fairness, update speed, and data variety. It also shows convergence guarantees under strong regularization. Our experimental results on the CIFAR-10 dataset under significant label skew ($\alpha=0.1$) support the theoretical findings. The HeteRo-Select method performs better than existing approaches in terms of peak accuracy, final accuracy, and training stability. Specifically, HeteRo-Select achieves a peak accuracy of $74.75\%$, a final accuracy of $72.76\%$, and a minimal stability drop of $1.99\%$. In contrast, Oort records a lower peak accuracy of $73.98\%$, a final accuracy of $71.25\%$, and a larger stability drop of $2.73\%$. The theoretical foundations and empirical performance in our study make HeteRo-Select a reliable solution for real-world heterogeneous FL problems.

Updated: 2025-08-08 20:33:34

标题: 使用HeteRo-Select在极端异质性下稳定联邦学习

摘要: 联邦学习（FL）是一种机器学习技术，由于客户数据的多样性，经常遭受训练不稳定性的困扰。尽管像Oort这样的基于效用的客户选择方法用于通过优先考虑高损失客户来收敛，但它们在训练后期经常出现显著的精度下降。我们提出了一个理论的HeteRo-Select框架，旨在保持高性能并确保长期的训练稳定性。我们提供了一项理论分析，表明当客户数据非常不同（高异质性）时，选择一个智能子集的客户参与可以更有效地减少通信，相较于完全参与。我们的HeteRo-Select方法使用了一个清晰、逐步的评分系统，考虑了客户的有用性、公平性、更新速度和数据多样性。它还在强正则化下显示了收敛保证。我们在CIFAR-10数据集上的实验结果支持了这些理论发现，尤其是在标签倾斜程度显著（$\alpha=0.1$）的情况下。HeteRo-Select方法在峰值精度、最终精度和训练稳定性方面优于现有方法。具体来说，HeteRo-Select实现了$74.75\%$的峰值精度，$72.76\%$的最终精度，以及$1.99\%$的最小稳定性下降。相比之下，Oort记录了$73.98\%$的较低峰值精度，$71.25\%$的最终精度，以及$2.73\%$的较大稳定性下降。我们研究中的理论基础和实证表现使HeteRo-Select成为解决真实世界异构FL问题的可靠解决方案。

更新时间: 2025-08-08 20:33:34

领域: cs.LG

下载: http://arxiv.org/abs/2508.06692v1

Role of Large Language Models and Retrieval-Augmented Generation for Accelerating Crystalline Material Discovery: A Systematic Review

Large language models (LLMs) have emerged as powerful tools for knowledge-intensive tasks across domains. In materials science, to find novel materials for various energy efficient devices for various real-world applications, requires several time and cost expensive simulations and experiments. In order to tune down the uncharted material search space, minimizing the experimental cost, LLMs can play a bigger role to first provide an accelerated search of promising known material candidates. Furthermore, the integration of LLMs with domain-specific information via retrieval-augmented generation (RAG) is poised to revolutionize how researchers predict materials structures, analyze defects, discover novel compounds, and extract knowledge from literature and databases. In motivation to the potentials of LLMs and RAG in accelerating material discovery, this paper presents a broad and systematic review to examine the recent advancements in applying LLMs and RAG to key materials science problems. We survey state-of-the-art developments in crystal structure prediction, defect analysis, materials discovery, literature mining, database integration, and multi-modal retrieval, highlighting how combining LLMs with external knowledge sources enables new capabilities. We discuss the performance, limitations, and implications of these approaches, and outline future directions for leveraging LLMs to accelerate materials research and discovery for advancement in technologies in the area of electronics, optics, biomedical, and energy storage.

Updated: 2025-08-08 20:32:56

标题: 大型语言模型和检索增强生成在加速晶体材料发现中的作用：一项系统综述

摘要: 大型语言模型(LLMs)已经成为跨领域知识密集型任务的强大工具。在材料科学领域，为了为各种能源高效设备寻找新材料以应用于各种实际应用，需要进行多次时间和成本昂贵的模拟和实验。为了调整未被发现的材料搜索空间，最小化实验成本，LLMs可以起到更大的作用，首先提供有前途的已知材料候选人的加速搜索。此外，通过检索增强生成(RAG)将LLMs与领域特定信息集成，有望革新研究人员预测材料结构、分析缺陷、发现新化合物，并从文献和数据库中提取知识的方式。出于对LLMs和RAG在加速材料发现方面潜力的动机，本文提出了一项广泛而系统的评论，以审视应用LLMs和RAG解决关键材料科学问题的最新进展。我们调查了晶体结构预测、缺陷分析、材料发现、文献挖掘、数据库集成和多模态检索的最新发展，突出了如何将LLMs与外部知识源结合起来实现新功能。我们讨论了这些方法的性能、局限性和影响，并概述了未来利用LLMs加速材料研究和发现，推动电子、光学、生物医学和能源存储领域技术进步的方向。

更新时间: 2025-08-08 20:32:56

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2508.06691v1

Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values

Agentic AI systems, possessing capabilities for autonomous planning and action, show great potential across diverse domains. However, their practical deployment is hindered by challenges in aligning their behavior with varied human values, complex safety requirements, and specific compliance needs. Existing alignment methodologies often falter when faced with the complex task of providing personalized context without inducing confabulation or operational inefficiencies. This paper introduces a novel solution: a 'superego' agent, designed as a personalized oversight mechanism for agentic AI. This system dynamically steers AI planning by referencing user-selected 'Creed Constitutions' encapsulating diverse rule sets -- with adjustable adherence levels to fit non-negotiable values. A real-time compliance enforcer validates plans against these constitutions and a universal ethical floor before execution. We present a functional system, including a demonstration interface with a prototypical constitution-sharing portal, and successful integration with third-party models via the Model Context Protocol (MCP). Comprehensive benchmark evaluations (HarmBench, AgentHarm) demonstrate that our Superego agent dramatically reduces harmful outputs -- achieving up to a 98.3% harm score reduction and near-perfect refusal rates (e.g., 100% with Claude Sonnet 4 on AgentHarm's harmful set) for leading LLMs like Gemini 2.5 Flash and GPT-4o. This approach substantially simplifies personalized AI alignment, rendering agentic systems more reliably attuned to individual and cultural contexts, while also enabling substantial safety improvements. An overview on this research with examples is available at https://superego.creed.space.

Updated: 2025-08-08 20:29:52

标题: 个性化的宪法一致性代理超我：与多样人类价值一致的安全人工智能行为

摘要: Agentic AI系统具有自主规划和行动能力，在各个领域显示出巨大潜力。然而，它们的实际部署受到挑战，主要是在将它们的行为与多样化的人类价值观、复杂的安全要求和特定的合规需求进行对齐方面遇到困难。现有的对齐方法在提供个性化背景的复杂任务时经常失败，因为会导致混淆或操作效率低下。本文介绍了一种创新解决方案：一种'超我'代理，设计为agentic AI的个性化监督机制。该系统通过参考用户选择的包含各种规则集的'信条宪法'动态引导AI规划，并具有可调节的坚持水平以适应不可协商的价值观。实时合规执行器在执行之前会根据这些宪法和普遍的道德底线验证计划。我们提出了一个功能系统，包括一个展示界面，具有原型宪法共享门户，并通过模型上下文协议（MCP）成功集成了第三方模型。全面的基准评估（HarmBench，AgentHarm）表明，我们的超我代理显著减少了有害的输出 - 实现了高达98.3%的伤害评分降低以及接近完美的拒绝率（例如，在AgentHarm的有害集合上，Claude Sonnet 4的拒绝率为100%），对于领先的LLMs如Gemini 2.5 Flash和GPT-4o。这种方法大大简化了个性化AI对齐，使agentic系统更可靠地与个人和文化背景保持一致，同时也实现了实质性的安全改进。有关该研究的概述和示例可在https://superego.creed.space上获得。

更新时间: 2025-08-08 20:29:52

领域: cs.AI,cs.CY,cs.MA

下载: http://arxiv.org/abs/2506.13774v2

Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration?

Foundation models, pre-trained on large image datasets and capable of capturing rich feature representations, have recently shown potential for zero-shot image registration. However, their performance has mostly been tested in the context of rigid or less complex structures, such as the brain or abdominal organs, and it remains unclear whether these models can handle more challenging, deformable anatomy. Breast MRI registration is particularly difficult due to significant anatomical variation between patients, deformation caused by patient positioning, and the presence of thin and complex internal structure of fibroglandular tissue, where accurate alignment is crucial. Whether foundation model-based registration algorithms can address this level of complexity remains an open question. In this study, we provide a comprehensive evaluation of foundation model-based registration algorithms for breast MRI. We assess five pre-trained encoders, including DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP, across four key breast registration tasks that capture variations in different years and dates, sequences, modalities, and patient disease status (lesion versus no lesion). Our results show that foundation model-based algorithms such as SAM outperform traditional registration baselines for overall breast alignment, especially under large domain shifts, but struggle with capturing fine details of fibroglandular tissue. Interestingly, additional pre-training or fine-tuning on medical or breast-specific images in MedSAM and SSLSAM, does not improve registration performance and may even decrease it in some cases. Further work is needed to understand how domain-specific training influences registration and to explore targeted strategies that improve both global alignment and fine structure accuracy. We also publicly release our code at \href{https://github.com/mazurowski-lab/Foundation-based-reg}{Github}.

Updated: 2025-08-08 20:15:59

标题: 视觉基础模型是否已经准备好用于开箱即用的医学图像配准？

摘要: 基于大型图像数据集预训练的基础模型具有捕捉丰富特征表示能力的特点，最近展示了在零样本图像配准方面的潜力。然而，它们的性能大多在刚性或较简单结构，如大脑或腹部器官等背景下进行了测试，目前尚不清楚这些模型是否能处理更具挑战性的可变解剖结构。乳腺MRI配准特别困难，因为患者之间存在显著的解剖差异，由患者体位引起的变形，以及纤维腺组织的薄且复杂内部结构。精确对齐是至关重要的。基于基础模型的配准算法能否应对这种复杂度仍然是一个开放问题。在这项研究中，我们对基于基础模型的乳腺MRI配准算法进行了全面评估。我们评估了五个预训练编码器，包括DINO-v2、SAM、MedSAM、SSLSAM和MedCLIP，在捕捉不同年份和日期、序列、模态和患者疾病状态（病变与无病变）的四个关键乳腺配准任务中。我们的结果显示，基于基础模型的算法如SAM在整体乳腺对齐方面优于传统配准基线，尤其在大领域转变下表现出色，但在捕捉纤维腺组织的细节方面存在困难。有趣的是，在MedSAM和SSLSAM中额外的医学或乳腺特定图像的预训练或微调，并没有改善配准性能，在某些情况下甚至可能降低它。需要进一步研究领域特定训练如何影响配准，并探索提高全局对齐和细微结构准确性的有针对性策略。我们还在\href{https://github.com/mazurowski-lab/Foundation-based-reg}{Github}上公开发布了我们的代码。

更新时间: 2025-08-08 20:15:59

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.11569v2

Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System

Scene graphs have emerged as a structured and serializable environment representation for grounded spatial reasoning with Large Language Models (LLMs). In this work, we propose SG^2, an iterative Schema-Guided Scene-Graph reasoning framework based on multi-agent LLMs. The agents are grouped into two modules: a (1) Reasoner module for abstract task planning and graph information queries generation, and a (2) Retriever module for extracting corresponding graph information based on code-writing following the queries. Two modules collaborate iteratively, enabling sequential reasoning and adaptive attention to graph information. The scene graph schema, prompted to both modules, serves to not only streamline both reasoning and retrieval process, but also guide the cooperation between two modules. This eliminates the need to prompt LLMs with full graph data, reducing the chance of hallucination due to irrelevant information. Through experiments in multiple simulation environments, we show that our framework surpasses existing LLM-based approaches and baseline single-agent, tool-based Reason-while-Retrieve strategy in numerical Q\&A and planning tasks.

Updated: 2025-08-08 19:55:03

标题: 基于多代理大型语言模型系统的模式引导场景图推理

摘要: 场景图已经成为基于大型语言模型（LLMs）进行基于空间推理的结构化和可序列化环境表示。在这项工作中，我们提出了SG^2，这是一个基于多智能体LLMs的迭代式基于模式引导的场景图推理框架。智能体分为两个模块：（1）推理器模块用于抽象任务规划和图信息查询生成，以及（2）检索器模块用于根据代码编写提取相应的图信息。两个模块进行迭代协作，实现了顺序推理和对图信息的自适应关注。场景图模式同时提示两个模块，不仅简化了推理和检索过程，还指导了两个模块之间的合作。这消除了使用完整图数据提示LLMs的需要，降低了因无关信息而产生幻觉的机会。通过在多个模拟环境中的实验，我们展示了我们的框架在数字Q&A和规划任务中超越了现有基于LLMs的方法和基准单智能体、基于工具的推理-检索策略。

更新时间: 2025-08-08 19:55:03

领域: cs.LG,cs.AI,cs.MA,cs.RO

下载: http://arxiv.org/abs/2502.03450v2

Watermarking Kolmogorov-Arnold Networks for Emerging Networked Applications via Activation Perturbation

With the increasing importance of protecting intellectual property in machine learning, watermarking techniques have gained significant attention. As advanced models are increasingly deployed in domains such as social network analysis, the need for robust model protection becomes even more critical. While existing watermarking methods have demonstrated effectiveness for conventional deep neural networks, they often fail to adapt to the novel architecture, Kolmogorov-Arnold Networks (KAN), which feature learnable activation functions. KAN holds strong potential for modeling complex relationships in network-structured data. However, their unique design also introduces new challenges for watermarking. Therefore, we propose a novel watermarking method, Discrete Cosine Transform-based Activation Watermarking (DCT-AW), tailored for KAN. Leveraging the learnable activation functions of KAN, our method embeds watermarks by perturbing activation outputs using discrete cosine transform, ensuring compatibility with diverse tasks and achieving task independence. Experimental results demonstrate that DCT-AW has a small impact on model performance and provides superior robustness against various watermark removal attacks, including fine-tuning, pruning, and retraining after pruning.

Updated: 2025-08-08 19:51:59

标题: 为新兴网络应用通过激活扰动对Kolmogorov-Arnold网络进行水印处理

摘要: 随着保护机器学习知识产权的重要性日益增加，数字水印技术引起了广泛关注。随着先进模型在诸如社交网络分析等领域的部署越来越多，对于稳健模型保护的需求变得更加关键。虽然现有的数字水印方法已经证明对于传统的深度神经网络非常有效，但它们通常无法适应具有可学习激活函数的新型架构科尔莫哥洛夫-阿诺德网络（KAN），这些网络具有模拟激活函数。KAN在建模网络结构化数据中复杂关系方面具有强大潜力。然而，它们独特的设计也为数字水印引入了新的挑战。因此，我们提出了一种针对KAN定制的新型数字水印方法，即基于离散余弦变换的激活水印（DCT-AW）。利用KAN的可学习激活函数，我们的方法通过扰动激活输出来嵌入水印，确保与各种任务的兼容性，并实现任务独立性。实验结果表明，DCT-AW对模型性能影响较小，并且对各种数字水印去除攻击具有更强的稳健性，包括微调、修剪和修剪后的重训练。

更新时间: 2025-08-08 19:51:59

领域: cs.LG

下载: http://arxiv.org/abs/2508.06676v1

An information-matching approach to optimal experimental design and active learning

The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

Updated: 2025-08-08 19:50:35

标题: 一种信息匹配方法用于最佳实验设计和主动学习

摘要: 数学模型的有效性严重依赖于训练数据的质量，然而收集足够的数据往往是昂贵且具有挑战性的。许多建模应用程序仅需要推断参数以预测其他感兴趣的数量（QoI）。由于模型通常包含许多不可识别的（松散的）参数，QoI通常取决于相对较少的参数组合。因此，我们引入了一种基于费舍尔信息矩阵的信息匹配标准，以从候选池中选择最具信息性的训练数据。该方法确保所选数据包含足够的信息，以仅学习需要约束下游QoI的参数。它被制定为一个凸优化问题，使其可扩展到大型模型和数据集。我们在各种科学领域的建模问题中展示了这种方法的有效性，包括电力系统和水下声学。最后，我们在材料科学应用程序中将信息匹配作为主动学习循环中的查询函数。在所有这些应用中，我们发现相对较小的一组最佳训练数据可以提供实现精确预测所需的信息。这些结果对于各种未来应用特别是大型机器学习模型中的主动学习是令人鼓舞的。

更新时间: 2025-08-08 19:50:35

领域: cs.LG,cond-mat.mtrl-sci,physics.app-ph,physics.comp-ph,physics.data-an

下载: http://arxiv.org/abs/2411.02740v3

Zero-Shot Cellular Trajectory Map Matching

Cellular Trajectory Map-Matching (CTMM) aims to align cellular location sequences to road networks, which is a necessary preprocessing in location-based services on web platforms like Google Maps, including navigation and route optimization. Current approaches mainly rely on ID-based features and region-specific data to learn correlations between cell towers and roads, limiting their adaptability to unexplored areas. To enable high-accuracy CTMM without additional training in target regions, Zero-shot CTMM requires to extract not only region-adaptive features, but also sequential and location uncertainty to alleviate positioning errors in cellular data. In this paper, we propose a pixel-based trajectory calibration assistant for zero-shot CTMM, which takes advantage of transferable geospatial knowledge to calibrate pixelated trajectory, and then guide the path-finding process at the road network level. To enhance knowledge sharing across similar regions, a Gaussian mixture model is incorporated into VAE, enabling the identification of scenario-adaptive experts through soft clustering. To mitigate high positioning errors, a spatial-temporal awareness module is designed to capture sequential features and location uncertainty, thereby facilitating the inference of approximate user positions. Finally, a constrained path-finding algorithm is employed to reconstruct the road ID sequence, ensuring topological validity within the road network. This process is guided by the calibrated trajectory while optimizing for the shortest feasible path, thus minimizing unnecessary detours. Extensive experiments demonstrate that our model outperforms existing methods in zero-shot CTMM by 16.8\%.

Updated: 2025-08-08 19:47:45

标题: 零射击细胞轨迹地图匹配

摘要: 细胞轨迹地图匹配（CTMM）旨在将细胞位置序列对齐到道路网络，这是在Web平台上进行基于位置的服务（如Google地图上的导航和路径优化）的必要预处理。当前方法主要依赖于基于ID的特征和区域特定数据来学习细胞塔和道路之间的关联，从而限制了它们对未探索区域的适应性。为了在目标区域中实现高精度的CTMM而无需额外训练，零样本CTMM需要提取不仅是区域自适应特征，还有顺序和位置不确定性，以减轻细胞数据中的定位错误。在本文中，我们提出了一个基于像素的零样本CTMM的轨迹校准助手，利用可转移的地理知识来校准像素化的轨迹，然后引导道路网络层面的路径查找过程。为了增强类似区域之间的知识共享，高斯混合模型被整合到VAE中，通过软聚类实现场景自适应专家的识别。为了减轻高定位错误，设计了一个空间-时间感知模块来捕捉顺序特征和位置不确定性，从而促进对用户位置的近似推断。最后，采用约束路径查找算法重建道路ID序列，确保在道路网络内的拓扑有效性。该过程在校准轨迹的指导下进行，同时优化最短可行路径，从而最小化不必要的绕道。广泛的实验表明，我们的模型在零样本CTMM中的性能优于现有方法16.8\%。

更新时间: 2025-08-08 19:47:45

领域: cs.AI

下载: http://arxiv.org/abs/2508.06674v1

Do Biased Models Have Biased Thoughts?

The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textit{Do biased models have biased thoughts}? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model's thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.

Updated: 2025-08-08 19:41:20

标题: 偏见模型是否会产生偏见想法？

摘要: 语言模型的卓越表现是不可否认的。然而，基于性别、种族、社会经济地位、外貌和性取向的偏见存在使得语言模型的部署具有挑战性。本文研究了思维链提示的效果，这是一种最近的方法，研究模型在响应之前所遵循的步骤对公平性的影响。更具体地，我们提出了以下问题：偏见模型是否有偏见的思维？为了回答这个问题，我们对5个流行的大型语言模型进行了实验，使用公平性指标来量化模型的思维和输出中的11种不同偏见。我们的结果显示，思维步骤中的偏见与输出偏见之间的相关性不高（在大多数情况下小于0.6的相关性，p值小于0.001）。换句话说，与人类不同，具有偏见决策的测试模型并不总是具有偏见的思维。

更新时间: 2025-08-08 19:41:20

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2508.06671v1

Machines Learn Number Fields, But How? The Case of Galois Groups

By applying interpretable machine learning methods such as decision trees, we study how simple models can classify the Galois groups of Galois extensions over $\mathbb{Q}$ of degrees 4, 6, 8, 9, and 10, using Dedekind zeta coefficients. Our interpretation of the machine learning results allows us to understand how the distribution of zeta coefficients depends on the Galois group, and to prove new criteria for classifying the Galois groups of these extensions. Combined with previous results, this work provides another example of a new paradigm in mathematical research driven by machine learning.

Updated: 2025-08-08 19:32:11

标题: 机器学习数域，但是如何学习？以伽罗瓦群为例

摘要: 通过应用可解释的机器学习方法，如决策树，我们研究了简单模型如何能够利用Dedekind zeta系数对度数为4、6、8、9和10的Galois扩展在$\mathbb{Q}$上的Galois群进行分类。我们对机器学习结果的解释使我们能够理解zeta系数分布如何取决于Galois群，并证明了对这些扩展的Galois群进行分类的新标准。结合先前的结果，这项工作提供了数学研究中由机器学习驱动的新范例的另一个例子。

更新时间: 2025-08-08 19:32:11

领域: math.NT,cs.LG,11R32, 11R42, 11S15, 11S20

下载: http://arxiv.org/abs/2508.06670v1

Formal Concept Analysis: a Structural Framework for Variability Extraction and Analysis

Formal Concept Analysis (FCA) is a mathematical framework for knowledge representation and discovery. It performs a hierarchical clustering over a set of objects described by attributes, resulting in conceptual structures in which objects are organized depending on the attributes they share. These conceptual structures naturally highlight commonalities and variabilities among similar objects by categorizing them into groups which are then arranged by similarity, making it particularly appropriate for variability extraction and analysis. Despite the potential of FCA, determining which of its properties can be leveraged for variability-related tasks (and how) is not always straightforward, partly due to the mathematical orientation of its foundational literature. This paper attempts to bridge part of this gap by gathering a selection of properties of the framework which are essential to variability analysis, and how they can be used to interpret diverse variability information within the resulting conceptual structures.

Updated: 2025-08-08 19:30:14

标题: 形式概念分析：一种用于变异性提取和分析的结构框架

摘要: Formal Concept Analysis（FCA）是一种用于知识表示和发现的数学框架。它对由属性描述的对象集执行层次聚类，从而产生概念结构，其中对象根据它们共享的属性进行组织。这些概念结构自然地突出了相似对象之间的共性和变异性，通过将它们分类为相似的组，并按相似性排列，使其特别适用于变异性的提取和分析。尽管FCA具有潜力，但确定哪些属性可以用于与变异性相关的任务（以及如何使用）并不总是直截了当的，部分原因是其基础文献的数学取向。本文试图弥合这一差距的一部分，通过收集框架的一些属性的选择，这些属性对于变异性分析至关重要，并且它们如何用于解释在生成的概念结构中的多样化变异性信息。

更新时间: 2025-08-08 19:30:14

领域: cs.AI,cs.IR,cs.SE

下载: http://arxiv.org/abs/2508.06668v1

Transferring Social Network Knowledge from Multiple GNN Teachers to Kolmogorov-Arnold Networks

Graph Neural Networks (GNNs) have shown strong performance on graph-structured data, but their reliance on graph connectivity often limits scalability and efficiency. Kolmogorov-Arnold Networks (KANs), a recent architecture with learnable univariate functions, offer strong nonlinear expressiveness and efficient inference. In this work, we integrate KANs into three popular GNN architectures-GAT, SGC, and APPNP-resulting in three new models: KGAT, KSGC, and KAPPNP. We further adopt a multi-teacher knowledge amalgamation framework, where knowledge from multiple KAN-based GNNs is distilled into a graph-independent KAN student model. Experiments on benchmark datasets show that the proposed models improve node classification accuracy, and the knowledge amalgamation approach significantly boosts student model performance. Our findings highlight the potential of KANs for enhancing GNN expressiveness and for enabling efficient, graph-free inference.

Updated: 2025-08-08 19:26:31

标题: 将多个GNN教师的社交网络知识转移至Kolmogorov-Arnold网络

摘要: 图神经网络（GNNs）在图结构数据上表现出色，但它们对图连接性的依赖通常限制了可扩展性和效率。科尔莫戈罗夫-阿诺德网络（KANs）是一种最近的架构，具有可学习的单变量函数，提供了强大的非线性表达能力和高效的推断。在这项工作中，我们将KANs集成到三种流行的GNN架构中-GAT、SGC和APPNP，得到三个新模型：KGAT、KSGC和KAPPNP。我们进一步采用了一个多教师知识融合框架，其中来自多个基于KAN的GNN的知识被提炼到一个与图无关的KAN学生模型中。在基准数据集上的实验证明，所提出的模型提高了节点分类准确性，知识融合方法显著提升了学生模型的性能。我们的发现突显了KANs在增强GNN表达能力和实现高效、无图推断方面的潜力。

更新时间: 2025-08-08 19:26:31

领域: cs.LG

下载: http://arxiv.org/abs/2508.06663v1

In-Context Reinforcement Learning via Communicative World Models

Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents' in-context RL (ICRL) ability, this work formulates ICRL as a two-agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by decoupling latent representation learning from control. In CORAL, an Information Agent (IA) is pre-trained as a world model on a diverse distribution of tasks. Its objective is not to maximize task reward, but to build a world model and distill its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero-shot adaptation with the help of pre-trained IA in entirely unseen sparse-reward environments, validating the efficacy of learning a transferable communicative representation.

Updated: 2025-08-08 19:23:23

标题: 通过交流式世界模型进行上下文强化学习

摘要: 强化学习（RL）代理通常难以在不更新其参数的情况下泛化到新任务和环境中，主要是因为它们学到的表示和策略过拟合于训练环境的具体细节。为了提高代理的上下文RL（ICRL）能力，本研究将ICRL定义为一个双代理紧急交流问题，并引入了CORAL（自适应RL的沟通表示）框架，通过将潜在表示学习与控制解耦，学习可转移的沟通上下文。在CORAL中，信息代理（IA）在多样化任务分布上进行预训练作为世界模型。它的目标不是最大化任务奖励，而是构建一个世界模型并将其理解蒸馏成简洁的消息。新兴通信协议由一种新颖的因果影响损失塑造，该因果影响损失测量消息对下一个动作的影响。在部署期间，先前训练的IA作为新的控制代理（CA）的固定上下文化者，后者通过解释提供的沟通上下文来学习解决任务。我们的实验表明，这种方法使CA能够在完全看不见的稀疏奖励环境中借助预训练的IA实现显着的样本效率提高，并成功地进行零样本适应，验证了学习可转移的沟通表示的有效性。

更新时间: 2025-08-08 19:23:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06659v1

MS-IMAP -- A Multi-Scale Graph Embedding Approach for Interpretable Manifold Learning

Deriving meaningful representations from complex, high-dimensional data in unsupervised settings is crucial across diverse machine learning applications. This paper introduces a framework for multi-scale graph network embedding based on spectral graph wavelets that employs a contrastive learning approach. We theoretically show that in Paley-Wiener spaces on combinatorial graphs, the spectral graph wavelets operator provides greater flexibility and control over smoothness compared to the Laplacian operator, motivating our approach. A key advantage of the proposed embedding is its ability to establish a correspondence between the embedding and input feature spaces, enabling the derivation of feature importance. We validate the effectiveness of our graph embedding framework on multiple public datasets across various downstream tasks, including clustering and unsupervised feature importance.

Updated: 2025-08-08 19:16:59

标题: MS-IMAP -- 一种用于可解释流形学习的多尺度图嵌入方法

摘要: 在无监督环境中，从复杂、高维数据中提取有意义的表示对于各种机器学习应用至关重要。本文介绍了一种基于谱图小波的多尺度图网络嵌入框架，采用对比学习方法。我们理论上证明，在组合图上的帕雷-维纳空间中，与拉普拉斯算子相比，谱图小波算子提供了更大的灵活性和控制平滑度的能力，这是我们方法的动机。提出的嵌入的一个关键优势是其能够建立嵌入与输入特征空间之间的对应关系，从而实现特征重要性的推导。我们在多个公共数据集上验证了我们的图嵌入框架的有效性，包括聚类和无监督特征重要性。

更新时间: 2025-08-08 19:16:59

领域: cs.LG

下载: http://arxiv.org/abs/2406.02778v5

Federated Online Learning for Heterogeneous Multisource Streaming Data

Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns. Most existing federated learning methods focus on the ``static" datasets. However, in many real-world applications, data arrive continuously over time, forming streaming datasets. This introduces additional challenges for data storage and algorithm design, particularly under high-dimensional settings. In this paper, we propose a federated online learning (FOL) method for distributed multi-source streaming data analysis. To account for heterogeneity, a personalized model is constructed for each data source, and a novel ``subgroup" assumption is employed to capture potential similarities, thereby enhancing model performance. We adopt the penalized renewable estimation method and the efficient proximal gradient descent for model training. The proposed method aligns with both federated and online learning frameworks: raw data are not exchanged among sources, ensuring data privacy, and only summary statistics of previous data batches are required for model updates, significantly reducing storage demands. Theoretically, we establish the consistency properties for model estimation, variable selection, and subgroup structure recovery, demonstrating optimal statistical efficiency. Simulations illustrate the effectiveness of the proposed method. Furthermore, when applied to the financial lending data and the web log data, the proposed method also exhibits advantageous prediction performance. Results of the analysis also provide some practical insights.

Updated: 2025-08-08 19:08:53

标题: 异构多源流数据的联邦在线学习

摘要: 联邦学习已经成为在隐私关注下进行分布式多源数据分析的重要范式。大多数现有的联邦学习方法都专注于“静态”数据集。然而，在许多实际应用中，数据会随着时间不断到达，形成流式数据集。这给数据存储和算法设计带来了额外的挑战，特别是在高维设置下。本文提出了一种用于分布式多源流数据分析的联邦在线学习（FOL）方法。为了考虑异质性，为每个数据源构建了一个个性化模型，并采用了一种新颖的“子群”假设来捕捉潜在的相似性，从而增强模型性能。我们采用了惩罚可更新估计方法和高效的近端梯度下降来进行模型训练。所提出的方法与联邦学习和在线学习框架都是一致的：原始数据不在源之间交换，确保数据隐私，并且仅需要前几个数据批次的汇总统计数据来进行模型更新，显著降低了存储需求。在理论上，我们建立了模型估计、变量选择和子群结构恢复的一致性特性，展示了最佳的统计效率。模拟结果说明了所提出方法的有效性。此外，当应用于金融借贷数据和网络日志数据时，所提出的方法也展现出了有利的预测性能。分析结果还提供了一些实用的见解。

更新时间: 2025-08-08 19:08:53

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.06652v1

Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN

Synthetic data generation has become essential for securely sharing and analyzing sensitive data sets. Traditional anonymization techniques, however, often fail to adequately preserve privacy. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a neural network architecture specifically designed for generating high-quality synthetic tabular data. Using a discretization-based auto-regressive approach, TabularARGN achieves high data fidelity while remaining computationally efficient. We evaluate TabularARGN against existing synthetic data generation methods, showing competitive results in statistical similarity, machine learning utility, and detection robustness. We further perform an in-depth privacy evaluation using systematic membership-inference attacks, highlighting the robustness and effective privacy-utility balance of our approach.

Updated: 2025-08-08 18:57:23

标题: 隐私保护的表格合成数据生成：使用TabularARGN

摘要: 合成数据生成已成为安全共享和分析敏感数据集的重要手段。然而，传统的匿名化技术往往无法充分保护隐私。我们引入了表格自回归生成网络（TabularARGN），这是一种专门设计用于生成高质量合成表格数据的神经网络架构。通过使用基于离散化的自回归方法，TabularARGN在保持高数据保真度的同时保持了计算效率。我们对TabularARGN进行了与现有合成数据生成方法的评估，展示了在统计相似性、机器学习效用和检测鲁棒性方面的竞争结果。我们进一步进行了深入的隐私评估，使用系统化的成员推断攻击，突出了我们方法的鲁棒性和有效的隐私-效用平衡。

更新时间: 2025-08-08 18:57:23

领域: cs.LG

下载: http://arxiv.org/abs/2508.06647v1

Training 3D ResNets to Extract BSM Physics Parameters from Simulated Data

We report on a novel application of computer vision techniques to extract beyond the Standard Model parameters directly from high energy physics flavor data. We propose a novel data representation that transforms the angular and kinematic distributions into ``quasi-images", which are used to train a convolutional neural network to perform regression tasks, similar to fitting. As a proof-of-concept, we train a 34-layer Residual Neural Network to regress on these images and determine information about the Wilson Coefficient $C_{9}$ in Monte Carlo simulations of $B^0 \rightarrow K^{*0}\mu^{+}\mu^{-}$ decays. The method described here can be generalized and may find applicability across a variety of experiments.

Updated: 2025-08-08 18:48:36

标题: 使用训练好的3D ResNets从模拟数据中提取BSM物理参数

摘要: 我们报告了一种新颖的计算机视觉技术应用，可以直接从高能物理味道数据中提取超出标准模型参数。我们提出了一种新颖的数据表示，将角度和动力学分布转换为“准图像”，这些图像用于训练卷积神经网络执行回归任务，类似于拟合。作为概念验证，我们训练了一个34层的残差神经网络对这些图像进行回归，并确定在Monte Carlo模拟的$B^0 \rightarrow K^{*0}\mu^{+}\mu^{-}$衰变中关于Wilson系数$C_{9}$的信息。这里描述的方法可以泛化，并可能在各种实验中找到应用。

更新时间: 2025-08-08 18:48:36

领域: hep-ex,cs.LG,hep-ph

下载: http://arxiv.org/abs/2311.13060v4

Symbolic Execution in Practice: A Survey of Applications in Vulnerability, Malware, Firmware, and Protocol Analysis

Symbolic execution is a powerful program analysis technique that allows for the systematic exploration of all program paths. Path explosion, where the number of states to track becomes unwieldy, is one of the biggest challenges hindering symbolic execution's practical application. To combat this, researchers have employed various strategies to enable symbolic execution on complex software systems. This paper introduces a systematic taxonomy of these strategies, categorizing them into two primary approaches: Scope Reduction, which aims to reduce the scope of symbolic execution to manageable portions of code, and Guidance Heuristics, which steer the symbolic execution engine toward promising paths. Using this taxonomy as a lens, we survey applications of symbolic executions in several domains such as vulnerability analysis, malware analysis, firmware re-hosting, and network protocol analysis. Finally, we identify promising directions for future research, including the application of symbolic execution to real-time operating systems and modern, type-safe languages.

Updated: 2025-08-08 18:43:45

标题: 实践中的符号执行：在漏洞、恶意软件、固件和协议分析中应用调查

摘要: 符号执行是一种强大的程序分析技术，允许系统地探索所有程序路径。路径爆炸，即要跟踪的状态数量变得难以控制，是阻碍符号执行实际应用的最大挑战之一。为了应对这一挑战，研究人员采用了各种策略，以实现在复杂软件系统上的符号执行。本文介绍了这些策略的系统分类，将它们分为两种主要方法：范围缩减，旨在将符号执行的范围缩小到可管理的代码部分；指导启发，将符号执行引擎引导至有前景的路径。利用这一分类方法，我们调查了符号执行在多个领域的应用，如漏洞分析、恶意软件分析、固件复位和网络协议分析。最后，我们确定了未来研究的有前景的方向，包括将符号执行应用于实时操作系统和现代的类型安全语言。

更新时间: 2025-08-08 18:43:45

领域: cs.CR

下载: http://arxiv.org/abs/2508.06643v1

HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs

Multimodal knowledge graphs (MMKGs) enrich traditional knowledge graphs (KGs) by incorporating diverse modalities such as images and text. multimodal knowledge graph completion (MMKGC) seeks to exploit these heterogeneous signals to infer missing facts, thereby mitigating the intrinsic incompleteness of MMKGs. Existing MMKGC methods typically leverage only the information contained in the MMKGs under the closed-world assumption and adopt discriminative training objectives, which limits their reasoning capacity during completion. Recent large language models (LLMs), empowered by massive parameter scales and pretraining on vast corpora, have demonstrated strong reasoning abilities across various tasks. However, their potential in MMKGC remains largely unexplored. To bridge this gap, we propose HERGC, a flexible Heterogeneous Experts Representation and Generative Completion framework for MMKGs. HERGC first deploys a Heterogeneous Experts Representation Retriever that enriches and fuses multimodal information and retrieves a compact candidate set for each incomplete triple. It then uses a Generative LLM Predictor, implemented via either in-context learning or lightweight fine-tuning, to accurately identify the correct answer from these candidates. Extensive experiments on three standard MMKG benchmarks demonstrate HERGC's effectiveness and robustness, achieving superior performance over existing methods.

Updated: 2025-08-08 18:42:44

标题: HERGC：用于多模态知识图的异构专家表示和生成完成

摘要: 多模态知识图谱（MMKGs）通过整合图像和文本等多样性模态丰富了传统知识图谱（KGs）。多模态知识图谱完成（MMKGC）旨在利用这些异构信号推断缺失的事实，从而减轻MMKGs固有的不完整性。现有的MMKGC方法通常仅利用封闭世界假设下的MMKGs中包含的信息，并采用判别式训练目标，这限制了它们在完成过程中的推理能力。最近大型语言模型（LLMs）通过庞大的参数规模和在广泛语料库上的预训练展示了强大的推理能力。然而，它们在MMKGC中的潜力仍然大部分未被探索。为了弥补这一差距，我们提出了HERGC，一个灵活的用于MMKGs的异构专家表示和生成完成框架。HERGC首先部署了一个异构专家表示检索器，丰富和融合多模态信息，并为每个不完整的三元组检索出一个紧凑的候选集。然后，它使用一个生成式LLM预测器，通过上下文学习或轻量级微调实现，从这些候选中准确识别出正确的答案。对三个标准MMKG基准的广泛实验表明HERGC的有效性和稳健性，表现优于现有方法。

更新时间: 2025-08-08 18:42:44

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.00826v2

Benchmarking Self-Driving Labs

A key goal of modern materials science is accelerating the pace of materials discovery. Self-driving labs, or systems that select experiments using machine learning and then execute them using automation, are designed to fulfil this promise by performing experiments faster, more intelligently, more reliably, and with richer metadata than conventional means. This review summarizes progress in understanding the degree to which SDLs accelerate learning by quantifying how much they reduce the number of experiments required for a given goal. The review begins by summarizing the theory underlying two key metrics, namely acceleration factor AF and enhancement factor EF, which quantify how much faster and better an algorithm is relative to a reference strategy. Next, we provide a comprehensive review of the literature, which reveals a wide range of AFs with a median of 6, and that tends to increase with the dimensionality of the space, reflecting an interesting blessing of dimensionality. In contrast, reported EF values vary by over two orders of magnitude, although they consistently peak at 10-20 experiments per dimension. To understand these results, we perform a series of simulated Bayesian optimization campaigns that reveal how EF depends upon the statistical properties of the parameter space while AF depends on its complexity. Collectively, these results reinforce the motivation for using SDLs by revealing their value across a wide range of material parameter spaces and provide a common language for quantifying and understanding this acceleration.

Updated: 2025-08-08 18:41:40

标题: 基准测试自动驾驶实验室

摘要: 现代材料科学的一个关键目标是加快材料发现的速度。自动驾驶实验室，或者选择使用机器学习进行实验然后使用自动化执行的系统，旨在通过比传统手段更快、更智能、更可靠和更丰富的元数据执行实验来实现这一承诺。本综述总结了理解自动驾驶实验室加速学习程度的进展，通过量化它们减少实现特定目标所需实验数量的程度。该综述首先总结了两个关键指标的理论基础，即加速因子AF和增强因子EF，它们量化了相对于参考策略算法有多快、多好。接下来，我们对文献进行了全面回顾，揭示了一系列AF值，其中媒体值为6，并且随着空间维度的增加而增加，反映了维度的有趣特性。相比之下，报告的EF值变化范围超过两个数量级，尽管它们一致在每个维度的实验数量达到顶峰时为10-20次。为了理解这些结果，我们进行了一系列模拟贝叶斯优化活动，揭示了EF如何取决于参数空间的统计属性，而AF取决于其复杂性。总的来说，这些结果通过揭示在各种材料参数空间中的价值并提供了一种用于量化和理解这种加速的共同语言，强化了使用自动驾驶实验室的动机。

更新时间: 2025-08-08 18:41:40

领域: cond-mat.mtrl-sci,cs.LG,physics.data-an

下载: http://arxiv.org/abs/2508.06642v1

Fractal Language Modelling by Universal Sequence Maps (USM)

Motivation: With the advent of Language Models using Transformers, popularized by ChatGPT, there is a renewed interest in exploring encoding procedures that numerically represent symbolic sequences at multiple scales and embedding dimensions. The challenge that encoding addresses is the need for mechanisms that uniquely retain contextual information about the succession of individual symbols, which can then be modeled by nonlinear formulations such as neural networks. Context: Universal Sequence Maps(USM) are iterated functions that bijectively encode symbolic sequences onto embedded numerical spaces. USM is composed of two Chaos Game Representations (CGR), iterated forwardly and backwardly, that can be projected into the frequency domain (FCGR). The corresponding USM coordinates can be used to compute a Chebyshev distance metric as well as k-mer frequencies, without having to recompute the embedded numeric coordinates, and, paradoxically, allowing for non-integers values of k. Results: This report advances the bijective fractal encoding by Universal Sequence Maps (USM) by resolving seeding biases affecting the iterated process. The resolution had two results, the first expected, the second an intriguing outcome: 1) full reconciliation of numeric positioning with sequence identity; and 2) uncovering the nature of USM as an efficient numeric process converging towards a steady state sequence embedding solution. We illustrate these results for genomic sequences because of the convenience of a planar representation defined by an alphabet with only 4 tokens (the 4 nucleotides). Nevertheless, the application to alphabet of arbitrary cardinality was found to be straightforward.

Updated: 2025-08-08 18:41:13

标题: 分形语言建模通过通用序列映射（USM）

摘要: 动机：随着使用变压器的语言模型的出现，例如ChatGPT的普及，人们对探索在多个尺度和嵌入维度上数值表示符号序列的编码程序产生了新的兴趣。编码所面临的挑战是需要机制来唯一地保留有关个体符号连续性的上下文信息，然后可以通过非线性公式(如神经网络)对其进行建模。背景：通用序列映射(USM)是将符号序列双射地编码到嵌入式数值空间上的迭代函数。USM由两个混沌游戏表示（CGR）组成，正向和反向迭代，可以投影到频域（FCGR）。相应的USM坐标可用于计算切比雪夫距离度量以及k-mer频率，而无需重新计算嵌入式数值坐标，并且令人感到矛盾的是，它允许非整数值的k。结果：本报告通过解决影响迭代过程的播种偏差，推进了通用序列映射（USM）的双射分形编码。解决方案产生了两个结果，第一个是预期的，第二个是一个有趣的结果：1) 数值定位与序列标识的完全协调; 2) 发现USM作为一个有效的数值过程，收敛于一个稳态序列嵌入解决方案的本质。我们以基因组序列为例说明了这些结果，因为仅有4个标记的字母表定义了一个平面表示。然而，发现将其应用于任意基数的字母表是直接的。

更新时间: 2025-08-08 18:41:13

领域: cs.LG,cs.AI,cs.NA,math.NA,q-bio.QM

下载: http://arxiv.org/abs/2508.06641v1

Segmented Confidence Sequences and Multi-Scale Adaptive Confidence Segments for Anomaly Detection in Nonstationary Time Series

As time series data become increasingly prevalent in domains such as manufacturing, IT, and infrastructure monitoring, anomaly detection must adapt to nonstationary environments where statistical properties shift over time. Traditional static thresholds are easily rendered obsolete by regime shifts, concept drift, or multi-scale changes. To address these challenges, we introduce and empirically evaluate two novel adaptive thresholding frameworks: Segmented Confidence Sequences (SCS) and Multi-Scale Adaptive Confidence Segments (MACS). Both leverage statistical online learning and segmentation principles for local, contextually sensitive adaptation, maintaining guarantees on false alarm rates even under evolving distributions. Our experiments across Wafer Manufacturing benchmark datasets show significant F1-score improvement compared to traditional percentile and rolling quantile approaches. This work demonstrates that robust, statistically principled adaptive thresholds enable reliable, interpretable, and timely detection of diverse real-world anomalies.

Updated: 2025-08-08 18:34:54

标题: 分段置信序列和多尺度自适应置信区段在非平稳时间序列异常检测中的应用

摘要: 随着时间序列数据在制造业、IT和基础设施监测等领域变得越来越普遍，异常检测必须适应非平稳环境，其中统计属性随时间变化。传统的静态阈值很容易被制度转变、概念漂移或多尺度变化所淘汰。为了解决这些挑战，我们引入并实证评估了两种新颖的自适应阈值框架：分段置信区间（SCS）和多尺度自适应置信区间（MACS）。两者都利用统计在线学习和分段原则进行局部、上下文敏感的适应，即使在不断变化的分布下，也能保证假警报率。我们在晶圆制造基准数据集上的实验显示，与传统的百分位数和滚动分位数方法相比，F1分数有显著提高。这项工作表明，稳健、统计原则的自适应阈值能够可靠、可解释和及时地检测各种现实世界的异常。

更新时间: 2025-08-08 18:34:54

领域: cs.LG,cs.AI,14J60 (Primary) 14F05, 14J26 (Secondary),F.2.2; I.2.0

下载: http://arxiv.org/abs/2508.06638v1

BoostTransformer: Enhancing Transformer Models with Subgrid Selection and Importance Sampling

Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.

Updated: 2025-08-08 18:33:30

标题: BoostTransformer：通过子网格选择和重要性抽样增强Transformer模型

摘要: Transformer架构主导了现代自然语言处理，但通常需要大量的计算资源和复杂的超参数调整。为了缓解这些挑战，我们提出了一种新颖的框架BoostTransformer，通过子网格令牌选择和重要性加权抽样来增强transformers。我们的方法将最小二乘提升目标直接纳入transformer管道，实现更高效的训练和改善性能。在多个细粒度文本分类基准测试中，BoostTransformer表现出更快的收敛速度和更高的准确性，超越了标准transformers，同时最小化架构搜索开销。

更新时间: 2025-08-08 18:33:30

领域: cs.LG,stat.ML,68T07, 68Q32,I.2.6; I.5.1; F.1.1

下载: http://arxiv.org/abs/2508.02924v2

Using Imperfect Synthetic Data in Downstream Inference Tasks

Predictions and generations from large language models are increasingly being explored as an aid to computational social science and human subject research in limited data regimes. While previous technical work has explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (also termed as synthetic simulations), such as in responses to surveys. However, it is not immediately clear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this work, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address the challenge at hand. Surprisingly, we find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter. We empirically validate the finite-sample performance of our estimator across different regression tasks in computational social science applications, demonstrating large empirical gains.

Updated: 2025-08-08 18:32:52

标题: 在下游推理任务中使用不完美的合成数据

摘要: 大型语言模型预测和生成在计算社会科学和人类主体研究中越来越受到关注，特别是在数据受限的情况下。尽管先前的技术工作已经探讨了使用模型预测标签来处理未标记数据的潜力，但越来越多的人对使用大型语言模型生成全新的合成样本（也称为合成模拟），例如对调查的回应，表现出兴趣。然而，如何将这些数据与真实数据结合并对其进行统计上有效的结论并不明显。在这项工作中，我们引入了一种基于广义矩估计的新估计器，提供了一个没有超参数的解决方案，并具有强大的理论保证来解决当前的挑战。令人惊讶的是，我们发现合成数据的矩残差与真实数据的矩残差之间的相互作用可以改善目标参数的估计。我们在计算社会科学应用中的不同回归任务中对我们的估计器的有限样本性能进行了实证验证，展示了显著的实证收益。

更新时间: 2025-08-08 18:32:52

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2508.06635v1

CoDe-NeRF: Neural Rendering via Dynamic Coefficient Decomposition

Neural Radiance Fields (NeRF) have shown impressive performance in novel view synthesis, but challenges remain in rendering scenes with complex specular reflections and highlights. Existing approaches may produce blurry reflections due to entanglement between lighting and material properties, or encounter optimization instability when relying on physically-based inverse rendering. In this work, we present a neural rendering framework based on dynamic coefficient decomposition, aiming to improve the modeling of view-dependent appearance. Our approach decomposes complex appearance into a shared, static neural basis that encodes intrinsic material properties, and a set of dynamic coefficients generated by a Coefficient Network conditioned on view and illumination. A Dynamic Radiance Integrator then combines these components to synthesize the final radiance. Experimental results on several challenging benchmarks suggest that our method can produce sharper and more realistic specular highlights compared to existing techniques. We hope that this decomposition paradigm can provide a flexible and effective direction for modeling complex appearance in neural scene representations.

Updated: 2025-08-08 18:30:02

标题: CoDe-NeRF: 通过动态系数分解的神经渲染

摘要: 神经辐射场（NeRF）在新视角综合方面表现出色，但在呈现具有复杂镜面反射和高光的场景方面仍存在挑战。现有方法可能由于光照和材质属性之间的纠缠而产生模糊的反射，或者在依赖基于物理的反向渲染时遇到优化不稳定性。在这项工作中，我们提出了基于动态系数分解的神经渲染框架，旨在改善视角相关外观建模。我们的方法将复杂外观分解为一个共享的静态神经基础，编码内在材质属性，以及一组由视角和光照条件的系数网络生成的动态系数。然后，动态辐射积分器将这些组件组合起来合成最终的辐射。对几个具有挑战性的基准测试的实验结果表明，与现有技术相比，我们的方法能够产生更锐利、更逼真的镜面高光。我们希望这种分解范式能够为在神经场景表示中建模复杂外观提供灵活有效的方向。

更新时间: 2025-08-08 18:30:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06632v1

MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection

We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models. Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy. Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space. Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition's most critical frequency components are selected. Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.

Updated: 2025-08-08 18:29:14

标题: MaCP: 通过分层余弦投影实现最小而强大的适应

摘要: 我们提出了一种新的适应方法MaCP，即最小但功能强大的自适应余弦投影，它在微调大型基础模型时实现了出色的性能，同时需要最少的参数和内存。其主要思想是利用余弦投影的优越能量压缩和去相关性属性，以提高模型的效率和准确性。具体来说，它将低秩适应中的权重变化投影到离散余弦空间中。然后，将权重变化分配到不同级别的离散余弦谱中，并选择每个分区的最关键频率分量。大量实验证明了MaCP在各种单模态任务上的有效性，包括自然语言理解、自然语言生成、文本摘要，以及多模态任务，如图像分类和视频理解。与现有替代方案相比，MaCP始终提供更高的准确性、显著降低的计算复杂性和更低的内存需求。

更新时间: 2025-08-08 18:29:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.09103v2

MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection

Updated: 2025-08-08 18:25:36

标题: MaCP：通过层次余弦投影实现最小但强大的自适应

摘要: 我们提出了一种新的自适应方法MaCP，即最小但强大的自适应余弦投影，它在需要微调大型基础模型时表现出卓越的性能，同时需要最少的参数和内存。其一般思想是利用余弦投影的优越能量压缩和去相关性质来提高模型的效率和准确性。具体来说，它将低秩自适应的权重变化投影到离散余弦空间中。然后，将权重变化分配到不同级别的离散余弦频谱中，并选择每个分区中最关键的频率分量。大量实验表明，MaCP在各种单模态任务（包括自然语言理解、自然语言生成、文本摘要等）以及多模态任务（如图像分类和视频理解）中的有效性。与现有的替代方案相比，MaCP始终提供更高的准确性，显著降低的计算复杂性和更低的内存需求。

更新时间: 2025-08-08 18:25:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.23870v2

Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Record

Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at https://github.com/MosbahAouad/EarlyPDAC-MML.

Updated: 2025-08-08 18:18:15

标题: 早期发现胰腺癌：利用电子健康记录上的多模态学习

摘要: 胰腺导管腺癌（PDAC）是最致命的癌症之一，早期检测仍然是一个主要的临床挑战，因为缺乏特定症状和可靠的生物标志物。在这项工作中，我们提出了一种新的多模态方法，该方法整合了纵向诊断代码历史记录和从电子健康记录中定期收集的实验室检测结果，可以在临床诊断前一年检测到PDAC。我们的方法结合了神经控制微分方程来建模不规则的实验室时间序列，预训练的语言模型和循环网络来学习诊断代码轨迹表示，以及交叉注意机制来捕捉两种模态之间的相互作用。我们在一个近4700名患者的真实数据集上开发和评估我们的方法，并在AUC上取得了从6.5％到15.5％的显着改进，超过了最先进的方法。此外，我们的模型识别出与PDAC风险升高相关的诊断代码和实验室面板，包括已建立的和新的生物标志物。我们的代码可在https://github.com/MosbahAouad/EarlyPDAC-MML 上找到。

更新时间: 2025-08-08 18:18:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06627v1

Learning to Forget with Information Divergence Reweighted Objectives for Noisy Labels

We introduce ANTIDOTE, a new class of objectives for learning under noisy labels which are defined in terms of a relaxation over an information-divergence neighborhood. Using convex duality, we provide a reformulation as an adversarial training method that has similar computational cost to training with standard cross-entropy loss. We show that our approach adaptively reduces the influence of the samples with noisy labels during learning, exhibiting a behavior that is analogous to forgetting those samples. ANTIDOTE is effective in practical environments where label noise is inherent in the training data or where an adversary can alter the training labels. Extensive empirical evaluations on different levels of symmetric, asymmetric, human annotation, and real-world label noise show that ANTIDOTE outperforms leading comparable losses in the field and enjoys a time complexity that is very close to that of the standard cross entropy loss.

Updated: 2025-08-08 18:10:16

标题: 使用信息差重加权目标学习忘记嘈杂标签

摘要: 我们介绍了一种新的学习目标类别ANTIDOTE，它是在嘈杂标签下学习的一种目标类别，这些标签是基于信息分歧邻域的放松定义的。通过凸对偶性，我们提供了一种重新表述，作为一种具有类似计算成本的对抗训练方法，与标准交叉熵损失训练相比。我们展示了我们的方法在学习过程中自适应地减少了具有嘈杂标签的样本的影响，表现出类似于遗忘这些样本的行为。ANTIDOTE在实际环境中非常有效，其中标签噪音在训练数据中是固有的，或者对手可以改变训练标签。对不同水平的对称、非对称、人类注释和真实标签噪声进行了广泛的实证评估，结果显示ANTIDOTE在该领域优于领先的可比损失，并且享有与标准交叉熵损失非常接近的时间复杂性。

更新时间: 2025-08-08 18:10:16

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.06622v1

Generalizing Scaling Laws for Dense and Sparse Large Language Models

Over the past few years, the size of language models has grown exponentially, as has the computational cost to train these large models. This rapid growth has motivated researchers to develop new techniques aimed at enhancing the efficiency of the training process. Despite these advancements, optimally predicting the model size or allocating optimal resources remains a challenge. Several efforts have addressed the challenge by proposing different scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws to demonstrate its effectiveness.

Updated: 2025-08-08 18:07:11

标题: 泛化稠密和稀疏大型语言模型的缩放定律

摘要: 在过去几年中，语言模型的规模呈指数增长，训练这些大型模型的计算成本也在增加。这种快速增长促使研究人员开发新技术，旨在增强训练过程的效率。尽管取得了这些进展，但最佳预测模型大小或分配最佳资源仍然是一个挑战。一些努力通过提出不同的缩放规律来解决这一挑战，但几乎所有这些规律都是特定于架构（密集或稀疏）的。在这项工作中，我们重新审视现有的缩放规律，并提出一个通用的缩放规律，提供一个适用于密集和稀疏大型语言模型的统一框架。我们评估和比较我们提出的缩放规律与现有的缩放规律，以展示其有效性。

更新时间: 2025-08-08 18:07:11

领域: cs.LG,cs.AI,cs.PF

下载: http://arxiv.org/abs/2508.06617v1

Generative AI for Intent-Driven Network Management in 6G: A Case Study on Hierarchical Learning Approach

With the emergence of 6G, mobile networks are becoming increasingly heterogeneous and dynamic, necessitating advanced automation for efficient management. Intent-Driven Networks (IDNs) address this by translating high-level intents into optimization policies. Large Language Models (LLMs) can enhance this process by understanding complex human instructions to enable adaptive, intelligent automation. Given the rapid advancements in Generative AI (GenAI), a comprehensive survey of LLM-based IDN architectures in disaggregated Radio Access Network (RAN) environments is both timely and critical. This article provides such a survey, along with a case study on a hierarchical learning-enabled IDN architecture that integrates GenAI across three key stages: intent processing, intent validation, and intent execution. Unlike most existing approaches that apply GenAI in the form of LLMs for intent processing only, we propose a hierarchical framework that introduces GenAI across all three stages of IDN. To demonstrate the effectiveness of the proposed IDN management architecture, we present a case study based on the latest GenAI architecture named Mamba. The case study shows how the proposed GenAI-driven architecture enhances network performance through intelligent automation, surpassing the performance of the conventional IDN architectures.

Updated: 2025-08-08 18:06:52

标题: 基于意图驱动的网络管理的生成式人工智能在6G中的应用：基于层次化学习方法的案例研究

摘要: 随着6G的出现，移动网络变得越来越异构和动态，需要先进的自动化来实现高效管理。意图驱动网络（IDNs）通过将高级意图转化为优化策略来解决这一问题。大型语言模型（LLMs）可以通过理解复杂的人类指令来增强这一过程，从而实现自适应、智能化的自动化。鉴于生成人工智能（GenAI）的快速发展，对在分解的无线接入网络（RAN）环境中基于LLM的IDN架构进行全面调查既及时又至关重要。本文提供了这样一份调查，以及一个关于分层学习启用的IDN架构的案例研究，该架构在意图处理、意图验证和意图执行三个关键阶段集成了GenAI。与大多数现有方法仅在意图处理阶段应用LLMs的GenAI不同，我们提出了一个引入GenAI跨越IDN三个阶段的分层框架。为了展示所提出的IDN管理架构的有效性，我们提供了一个基于最新GenAI架构Mamba的案例研究。该案例研究显示了所提出的GenAI驱动架构如何通过智能化自动化提升网络性能，超越了传统IDN架构的性能。

更新时间: 2025-08-08 18:06:52

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2508.06616v1

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Large Language Models (LLMs) struggle with high computational time and error propagation during inference time, especially for complex tasks like math, puzzles, or coding requiring multi-step thinking. While existing reasoning models with chain-of-thoughts (CoT) can enable LLMs to do step-wise analysis and reflection, they often face the issue of wasting computation on less productive solutions and fail to make progress during inference time. In this paper, we propose Meta-Reasoner, a new framework to enable LLMs ``Think about how to think'', i.e., optimize the inference compute by adjusting strategies on how to reason during inference time. Inspired by dual-process theory, our method decouples the high-level strategy generation (e.g., backtracking, switching approaches, or restarting) from stepwise CoT generation via a lightweight progress report. The strategy module only consider the summarized version from the previous CoTs to propose new strategies accordingly. We employ the contextual multi-armed bandits (CMABs) for this module to iteratively evaluate the previous reasoning states and dynamically adjust the strategy to avoid reasoning get stuck in less productive paths during inference. Evaluations on math problems (e.g., Game-of-24, TheoremQA) and scientific problems (e.g., SciBench) demonstrate that our method improves performance by 9-12\% over previous SOTA methods while reducing inference time by 28-35\%. This approach also generalizes to other domains like creative writing, demonstrating its versatility for diverse reasoning-intensive problems using LLMs.

Updated: 2025-08-08 18:01:34

标题: 元推理者：大型语言模型推理时间优化的动态指导

摘要: 大型语言模型（LLMs）在推理时遇到了高计算时间和错误传播的困难，尤其是对于像数学、谜题或需要多步思考的编码等复杂任务。虽然现有的具有思维链（CoT）的推理模型可以使LLMs进行逐步分析和反思，但它们经常面临在不太有效的解决方案上浪费计算资源，并在推理时未能取得进展的问题。在本文中，我们提出了Meta-Reasoner，这是一个新的框架，可以使LLMs“思考如何思考”，即通过调整推理时的推理策略来优化推理计算。受到双过程理论的启发，我们的方法通过轻量级进展报告将高层策略生成（例如，回溯、切换方法或重新开始）与逐步CoT生成分离开来。策略模块仅考虑先前CoTs的总结版本，然后相应地提出新的策略。我们利用上下文多臂老虎机（CMABs）来评估先前的推理状态并动态调整策略，以避免在推理过程中陷入不太有效的路径。对数学问题（例如，24点游戏，TheoremQA）和科学问题（例如，SciBench）的评估表明，我们的方法在减少推理时间28-35％的同时，比以往的SOTA方法提高了9-12％的性能。这种方法还可以推广到其他领域，如创意写作，展示了其对使用LLMs解决各种推理密集型问题的多功能性。

更新时间: 2025-08-08 18:01:34

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.19918v4

Local Diffusion Models and Phases of Data Distributions

As a class of generative artificial intelligence frameworks inspired by statistical physics, diffusion models have shown extraordinary performance in synthesizing complicated data distributions through a denoising process gradually guided by score functions. Real-life data, like images, is often spatially structured in low-dimensional spaces. However, ordinary diffusion models ignore this local structure and learn spatially global score functions, which are often computationally expensive. In this work, we introduce a new perspective on the phases of data distributions, which provides insight into constructing local denoisers with reduced computational costs. We define two distributions as belonging to the same data distribution phase if they can be mutually connected via spatially local operations such as local denoisers. Then, we show that the reverse denoising process consists of an early trivial phase and a late data phase, sandwiching a rapid phase transition where local denoisers must fail. To diagnose such phase transitions, we prove an information-theoretic bound on the fidelity of local denoisers based on conditional mutual information, and conduct numerical experiments in a real-world dataset. This work suggests simpler and more efficient architectures of diffusion models: far from the phase transition point, we can use small local neural networks to compute the score function; global neural networks are only necessary around the narrow time interval of phase transitions. This result also opens up new directions for studying phases of data distributions, the broader science of generative artificial intelligence, and guiding the design of neural networks inspired by physics concepts.

Updated: 2025-08-08 18:01:01

标题: 本地扩散模型和数据分布的阶段

摘要: 作为一类受统计物理启发的生成人工智能框架，扩散模型在通过逐渐由评分函数引导的去噪过程中合成复杂数据分布方面表现出非凡的性能。现实生活中的数据，如图像，通常在低维空间中具有空间结构。然而，普通扩散模型忽略了这种局部结构，并学习了空间全局评分函数，这通常具有较高的计算成本。在这项工作中，我们引入了数据分布阶段的新视角，这为构建具有降低计算成本的局部去噪器提供了见解。我们定义两个分布属于同一数据分布阶段，如果它们可以通过空间局部操作（如局部去噪器）相互连接。然后，我们展示了逆去噪过程由一个早期平凡阶段和一个晚期数据阶段组成，夹在一个快速相变阶段中，局部去噪器必须失败。为了诊断这种相变，我们基于条件互信息证明了局部去噪器的保真度的信息论界限，并在一个真实世界数据集中进行了数值实验。这项工作提出了更简单和更高效的扩散模型架构：远离相变点，我们可以使用小型局部神经网络来计算评分函数；全局神经网络仅在相变的狭窄时间间隔内是必要的。这一结果还为研究数据分布阶段、生成人工智能的更广泛科学以及基于物理概念启发的神经网络设计指明了新的方向。

更新时间: 2025-08-08 18:01:01

领域: cs.LG,cond-mat.stat-mech,quant-ph

下载: http://arxiv.org/abs/2508.06614v1

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text -- outperforming existing post-training baselines by over an order of magnitude -- with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.

Updated: 2025-08-08 17:59:47

标题: 深度无知：过滤预训练数据为开放权重LLMs构建防篡改保障

摘要: 开放权重人工智能系统提供独特的好处，包括增强的透明性、开放研究和去中心化访问。然而，它们容易受到篡改攻击的影响，通过修改权重或激活来有效地引起有害行为。目前，尚未形成一个强大的开放权重模型风险管理科学。现有的安全微调方法和其他训练后技术一直难以使LLMs对敌对微调超过几十个步骤具有抵抗力。在本文中，我们调查了从训练数据中过滤有关双重用途主题的文本是否可以防止不需要的功能，并作为一种更具防篡改性的保障。我们引入了一个多阶段管道，用于可扩展的数据过滤，并展示它提供了一种可行且有效的方法，用于最小化LLMs中生物威胁代理知识。我们从头开始预训练了多个69亿参数模型，并发现它们对于多达10,000步和3亿个生物威胁相关文本标记的敌对微调攻击表现出相当大的抵抗力，优于现有的训练后基线一个数量级以上，而且没有观察到与无关能力的降级。然而，虽然经过过滤的模型缺乏内部危险知识，但我们发现它们仍然可以在提供上下文的情况下利用这种信息（例如，通过搜索工具增强），展示了需要采用深度防御方法的必要性。总的来说，这些发现有助于确立预训练数据筛选作为开放权重人工智能系统的有希望的防御层。

更新时间: 2025-08-08 17:59:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06601v1

Multivariate Fields of Experts

We introduce the multivariate fields of experts, a new framework for the learning of image priors. Our model generalizes existing fields of experts methods by incorporating multivariate potential functions constructed via Moreau envelopes of the $\ell_\infty$-norm. We demonstrate the effectiveness of our proposal across a range of inverse problems that include image denoising, deblurring, compressed-sensing magnetic-resonance imaging, and computed tomography. The proposed approach outperforms comparable univariate models and achieves performance close to that of deep-learning-based regularizers while being significantly faster, requiring fewer parameters, and being trained on substantially fewer data. In addition, our model retains a relatively high level of interpretability due to its structured design.

Updated: 2025-08-08 17:58:25

标题: 专家多元场领域

摘要: 我们引入了多变量专家领域，这是一种学习图像先验的新框架。我们的模型通过将由$\ell_\infty$-范数的Moreau包络构建的多变量势函数整合进现有的专家领域方法，从而推广了现有的专家领域方法。我们在包括图像去噪、去模糊、压缩感知磁共振成像和计算机断层扫描在内的一系列逆问题中展示了我们提议的有效性。所提出的方法优于可比较的单变量模型，并实现了接近基于深度学习的正则化器性能的表现，同时速度显著更快，需要较少的参数，并且在较少的数据上进行训练。此外，由于其结构化设计，我们的模型保留了相对较高的可解释性水平。

更新时间: 2025-08-08 17:58:25

领域: eess.IV,cs.CV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2508.06490v1

LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.

Updated: 2025-08-08 17:58:06

标题: LaDi-WM：一种基于潜在扩散的世界模型用于预测性操作

摘要: 预测性操纵最近在具体AI社区中引起了广泛关注，因为它有潜力通过利用预测状态来提高机器人策略性能。然而，从世界模型中生成准确的未来视觉状态，特别是在实现高质量像素级表示方面，仍然是一个众所周知的挑战。为此，我们提出了LaDi-WM，这是一个使用扩散建模来预测未来状态的世界模型。具体来说，LaDi-WM利用了与预先训练的视觉基础模型（VFMs）对齐的已建立良好的潜在空间，其中包括几何特征（基于DINO）和语义特征（基于CLIP）。我们发现，预测潜在空间的演变比直接预测像素级图像更容易学习和更具泛化性。在LaDi-WM的基础上，我们设计了一个扩散策略，通过整合预测状态来迭代地改进输出动作，从而产生更一致和准确的结果。在合成和现实世界基准测试中进行了大量实验，结果表明LaDi-WM在LIBERO-LONG基准测试中将策略性能提高了27.9％，在现实世界场景中提高了20％。此外，我们的世界模型和策略在真实世界实验中显示出了令人印象深刻的泛化能力。

更新时间: 2025-08-08 17:58:06

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.11528v2

Self-Steering Language Models

While test-time reasoning enables language models (LMs) to tackle complex tasks, searching or planning in natural language can be slow, costly, and error-prone. But even when LMs struggle to emulate the precise reasoning steps needed to solve a problem, they often excel at describing its abstract structure--both how to verify solutions and how to search for them. This paper introduces DisCIPL, a method for "self-steering" LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models. Our approach equips LMs with the ability to write recursive search procedures that guide LM inference, enabling new forms of verifiable and efficient reasoning. When instantiated with a small Follower (e.g., Llama-3.2-1B or Qwen3-1.7B), DisCIPL matches (and sometimes outperforms) much larger models, including GPT-4o and o1, on challenging constrained generation tasks. Our work opens up a design space of highly-parallelized Monte Carlo inference strategies that outperform standard best-of-N sampling, require no finetuning, and can be implemented automatically by existing LMs.

Updated: 2025-08-08 17:58:01

标题: 自主导航语言模型

摘要: 测试时推理使语言模型（LM）能够应对复杂任务，但在自然语言中进行搜索或规划可能会很慢、昂贵且容易出错。即使在LM努力模拟解决问题所需的精确推理步骤时，它们通常擅长描述其抽象结构--如何验证解决方案以及如何搜索解决方案。本文介绍了一种名为DisCIPL的方法，该方法用于“自主控制”LM，在这种方法中，规划器模型生成一个特定任务的推理程序，该程序由一组跟随者模型执行。我们的方法赋予LM编写递归搜索程序的能力，指导LM推理，从而实现新形式的可验证和高效推理。当使用小型跟随者（例如Llama-3.2-1B或Qwen3-1.7B）实例化时，DisCIPL在具有挑战性的受限生成任务上与（有时超越）包括GPT-4o和o1在内的更大模型相匹配。我们的工作开辟了一种高度并行化的蒙特卡洛推理策略的设计空间，这种策略优于标准的N次采样，无需微调，并且可以由现有LM自动实现。

更新时间: 2025-08-08 17:58:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.07081v2

Voting-Based Semi-Parallel Proof-of-Work Protocol

Parallel Proof-of-Work (PoW) protocols are suggested to improve the safety guarantees, transaction throughput and confirmation latencies of Nakamoto consensus. In this work, we first consider the existing parallel PoW protocols and develop hard-coded incentive attack structures. Our theoretical results and simulations show that the existing parallel PoW protocols are more vulnerable to incentive attacks than the Nakamoto consensus, e.g., attacks have smaller profitability threshold and they result in higher relative rewards. Next, we introduce a voting-based semi-parallel PoW protocol that outperforms both Nakamoto consensus and the existing parallel PoW protocols from most practical perspectives such as communication overheads, throughput, transaction conflicts, incentive compatibility of the protocol as well as a fair distribution of transaction fees among the voters and the leaders. We use state-of-the-art analysis to evaluate the consistency of the protocol and consider Markov decision process (MDP) models to substantiate our claims about the resilience of our protocol against incentive attacks.

Updated: 2025-08-08 17:57:35

标题: 基于投票的半并行工作证明协议

摘要: 并行工作量证明（PoW）协议被建议用来提高Nakamoto共识的安全性保证、交易吞吐量和确认延迟。在这项工作中，我们首先考虑了现有的并行PoW协议，并开发了硬编码的激励攻击结构。我们的理论结果和模拟显示，现有的并行PoW协议比Nakamoto共识更容易受到激励攻击，例如，攻击的盈利阈值更低，且导致更高的相对奖励。接下来，我们介绍了一种基于投票的半并行PoW协议，从大多数实际角度来看，该协议在通信开销、吞吐量、交易冲突、协议的激励兼容性以及交易费用在选民和领导者之间的公平分配等方面胜过Nakamoto共识和现有的并行PoW协议。我们使用最先进的分析方法评估协议的一致性，并考虑马尔可夫决策过程（MDP）模型来证实我们关于协议对抗激励攻击的韧性的论断。

更新时间: 2025-08-08 17:57:35

领域: cs.CR,cs.DC,cs.DM,cs.IT,math.IT,math.PR

下载: http://arxiv.org/abs/2508.06489v1

WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion

Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.

Updated: 2025-08-08 17:49:46

标题: WGAST：弱监督生成网络用于每日10米陆地表面温度估计的时空融合

摘要: 城市化、气候变化和农业压力的增加正在推动对精确和及时环境监测的需求。地表温度（LST）是这一背景下的关键变量，并可从遥感卫星中获取。然而，这些系统在空间和时间分辨率之间存在折衷。虽然空间-时间融合方法提供了有希望的解决方案，但很少有人专注于以10米分辨率估算每日LST。在这项研究中，我们提出了WGAST，一个通过Terra MODIS、Landsat 8和Sentinel-2的空间-时间融合对每日10米LST进行估算的弱监督生成网络。WGAST是为此任务设计的第一个端到端深度学习框架。它采用了一种条件生成对抗架构，生成器由四个阶段组成：特征提取、融合、LST重建和噪声抑制。第一阶段采用一组编码器从输入中提取多级潜在表示，然后在第二阶段使用余弦相似性、归一化和时间注意机制进行融合。第三阶段将融合特征解码为高分辨率LST，随后使用高斯滤波器抑制高频噪声。训练遵循基于物理平均原则的弱监督策略，并由PatchGAN鉴别器加强。实验证明，WGAST在定量和定性评估中均优于现有方法。与表现最佳的基准线相比，WGAST平均减少了17.18%的RMSE，提高了11.00%的SSIM。此外，WGAST对云诱导的LST具有鲁棒性，并有效捕捉到细粒度的热图案，经过与33个地基传感器的验证。代码可在https://github.com/Sofianebouaziz1/WGAST.git上获得。

更新时间: 2025-08-08 17:49:46

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06485v1

AI-Assisted Conversational Interviewing: Effects on Data Quality and User Experience

Standardized surveys scale efficiently but sacrifice depth, while conversational interviews improve response quality at the cost of scalability and consistency. This study bridges the gap between these methods by introducing a framework for AI-assisted conversational interviewing. To evaluate this framework, we conducted a web survey experiment where 1,800 participants were randomly assigned to AI 'chatbots' which use large language models (LLMs) to dynamically probe respondents for elaboration and interactively code open-ended responses to fixed questions developed by human researchers. We assessed the AI chatbot's performance in terms of coding accuracy, response quality, and respondent experience. Our findings reveal that AI chatbots perform moderately well in live coding even without survey-specific fine-tuning, despite slightly inflated false positive errors due to respondent acquiescence bias. Open-ended responses were more detailed and informative, but this came at a slight cost to respondent experience. Our findings highlight the feasibility of using AI methods such as chatbots enhanced by LLMs to enhance open-ended data collection in web surveys.

Updated: 2025-08-08 17:43:37

标题: AI辅助对话式面试：对数据质量和用户体验的影响

摘要: 标准化调查效率高，但牺牲了深度，而对话式面谈虽然提高了回应质量，但牺牲了可扩展性和一致性。本研究通过引入一个AI辅助对话式面谈框架来弥合这些方法之间的差距。为了评估这一框架，我们进行了一个网络调查实验，将1,800名参与者随机分配给使用大型语言模型（LLMs）动态追问受访者并与人类研究人员开发的固定问题进行互动编码的AI“聊天机器人”。我们评估了AI聊天机器人在编码准确性、回应质量和受访者体验方面的表现。我们的研究结果显示，即使没有针对调查进行特定微调，AI聊天机器人在实时编码方面表现良好，尽管由于受访者顺从偏差而略微增加了误报错误。开放性回应更加详细和有信息价值，但这会稍微降低受访者体验。我们的研究结果突显了利用AI方法（如LLMs增强的聊天机器人）来增强网络调查中开放性数据收集的可行性。

更新时间: 2025-08-08 17:43:37

领域: cs.HC,cs.AI,stat.AP

下载: http://arxiv.org/abs/2504.13908v2

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model

We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to $50\times$ activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.

Updated: 2025-08-08 17:43:12

标题: AMix-1：一种测试时可扩展的蛋白质基础模型路径

摘要: 我们介绍了AMix-1，这是一个基于贝叶斯流网络构建的强大蛋白质基础模型，采用系统化的训练方法，包括预训练缩放规律、新兴能力分析、上下文学习机制和测试时缩放算法。为了保证稳健的可扩展性，我们建立了一个预测性的缩放规律，并通过损失视角揭示了结构理解的逐渐出现，最终形成一个强大的17亿参数模型。在此基础上，我们设计了一种基于多序列比对（MSA）的上下文学习策略，将蛋白设计统一到一个通用框架中，其中AMix-1能够识别MSA中的深层进化信号，并持续生成结构和功能一致的蛋白质。这个框架使得成功设计了一个改进明显的AmeR变体，其活性比其野生型增加了50倍。在推动蛋白工程的界限时，我们进一步增强了AMix-1，引入了一种进化测试时缩放算法，用于硅内定向进化，随着验证预算的增加，提供了可观的可扩展性性能增益，为下一代实验室内循环蛋白设计奠定了基础。

更新时间: 2025-08-08 17:43:12

领域: q-bio.BM,cs.AI

下载: http://arxiv.org/abs/2507.08920v3

Post-training for Efficient Communication via Convention Formation

Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this behavior. We develop a post-training process to develop this ability through targeted fine-tuning on heuristically identified demonstrations of convention formation. We evaluate with two new benchmarks focused on this capability. First, we design a focused, cognitively-motivated interaction benchmark that consistently elicits strong convention formation trends in humans. Second, we create a new document-grounded reference completion task that reflects in-the-wild convention formation behavior. Our studies show significantly improved convention formation abilities in post-trained LLMs across the two evaluation methods.

Updated: 2025-08-08 17:42:16

标题: 后训练以促进有效沟通的共同形成

摘要: 人类在多轮交互中越来越高效地进行沟通，通过调整他们的语言并形成临时约定。相比之下，之前的研究表明，大型语言模型（LLMs）并不自然地展现这种行为。我们通过在启发式确定的约定形成示范上进行有针对性的微调，开发了一个后训练过程来发展这种能力。我们通过两个新的基准对这种能力进行评估。首先，我们设计了一个专注于认知动机的交互基准，该基准在人类中始终引发强烈的约定形成趋势。其次，我们创建了一个新的基于文档的参考完成任务，反映了现实世界中的约定形成行为。我们的研究表明，在这两种评估方法中，经过后训练的LLMs的约定形成能力得到了显著改善。

更新时间: 2025-08-08 17:42:16

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06482v1

Crop Pest Classification Using Deep Learning Techniques: A Review

Insect pests continue to bring a serious threat to crop yields around the world, and traditional methods for monitoring them are often slow, manual, and difficult to scale. In recent years, deep learning has emerged as a powerful solution, with techniques like convolutional neural networks (CNNs), vision transformers (ViTs), and hybrid models gaining popularity for automating pest detection. This review looks at 37 carefully selected studies published between 2018 and 2025, all focused on AI-based pest classification. The selected research is organized by crop type, pest species, model architecture, dataset usage, and key technical challenges. The early studies relied heavily on CNNs but latest work is shifting toward hybrid and transformer-based models that deliver higher accuracy and better contextual understanding. Still, challenges like imbalanced datasets, difficulty in detecting small pests, limited generalizability, and deployment on edge devices remain significant hurdles. Overall, this review offers a structured overview of the field, highlights useful datasets, and outlines the key challenges and future directions for AI-based pest monitoring systems.

Updated: 2025-08-08 17:34:39

标题: 利用深度学习技术进行作物害虫分类：一项综述

摘要: 昆虫害持续对全球作物产量构成严重威胁，传统的监测方法通常缓慢、手动且难以扩展。近年来，深度学习已经成为一个强大的解决方案，其中卷积神经网络（CNNs）、视觉转换器（ViTs）和混合模型等技术因其自动化害虫检测的功能而备受青睐。本综述回顾了2018年至2025年间发表的37篇精心挑选的研究，所有这些研究都集中在基于人工智能的害虫分类上。选定的研究按作物类型、害虫物种、模型架构、数据集使用和关键技术挑战进行组织。早期研究主要依赖于CNNs，但最新的工作正在向提供更高准确性和更好上下文理解的混合和变压器模型转变。然而，不平衡数据集、难以检测小型害虫、有限的泛化能力和在边缘设备上的部署仍然是重要障碍。总体而言，本综述提供了该领域的结构化概述，突出了有用的数据集，并概述了基于人工智能的害虫监测系统的关键挑战和未来方向。

更新时间: 2025-08-08 17:34:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.01494v3

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.

Updated: 2025-08-08 17:29:16

标题: 潜在融合越狱：混合有害和无害的表示以引发不安全的LLM输出

摘要: 大型语言模型(LLMs)在各种语言任务中展示出令人印象深刻的能力，但容易受到绕过安全对齐的破解攻击的影响。本文介绍了潜在融合破解(LFJ)，这是一种基于表示的攻击，通过插值有害和良性查询对的隐藏状态来引出被禁止的响应。LFJ首先选择具有高主题和句法相似性的查询对，然后在有影响力的层和标记处执行梯度引导的插值，随后进行优化以平衡攻击成功率、输出流畅度和计算效率。在像AdvBench和MaliciousInstruct这样的基准测试中对Vicuna和LLaMA-2等模型进行评估，得到了94.01%的平均攻击成功率(ASR)，优于现有方法。为了减轻LFJ，我们提出了一种对抗性训练防御方法，对插值示例进行微调，将ASR降低了80%以上，而不会降低对良性输入的性能。消融研究验证了LFJ的有效性中查询对选择、隐藏状态插值组件和优化策略的重要性。

更新时间: 2025-08-08 17:29:16

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.10029v1

Live Music Models

We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

Updated: 2025-08-08 17:28:13

标题: 现场音乐模型

摘要: 我们介绍了一类新的音乐生成模型，称为实时音乐模型，它能够实时生成连续的音乐流，并与用户控制同步。我们发布了Magenta RealTime，一个开放权重的实时音乐模型，可以通过文本或音频提示来控制声学风格。在音乐质量的自动指标上，尽管使用的参数较少，并提供首次的实时生成能力，Magenta RealTime的表现优于其他开放权重的音乐生成模型。我们还发布了Lyria RealTime，一个基于API的模型，具有扩展控制功能，可以访问我们覆盖广泛提示的最强大模型。这些模型展示了一种强调人机交互的AI辅助音乐创作新范式，用于实时音乐表演。

更新时间: 2025-08-08 17:28:13

领域: cs.SD,cs.HC,cs.LG

下载: http://arxiv.org/abs/2508.04651v2

Intuition emerges in Maximum Caliber models at criticality

Whether large predictive models merely parrot their training data or produce genuine insight lacks a physical explanation. This work reports a primitive form of intuition that emerges as a metastable phase of learning that critically balances next-token prediction against future path-entropy. The intuition mechanism is discovered via mind-tuning, the minimal principle that imposes Maximum Caliber in predictive models with a control temperature-like parameter $\lambda$. Training on random walks in deterministic mazes reveals a rich phase diagram: imitation (low $\lambda$), rule-breaking hallucination (high $\lambda$), and a fragile in-between window exhibiting strong protocol-dependence (hysteresis) and multistability, where models spontaneously discover novel goal-directed strategies. These results are captured by an effective low-dimensional theory and frame intuition as an emergent property at the critical balance between memorizing what is and wondering what could be.

Updated: 2025-08-08 17:27:41

标题: 直觉在最大熵模型中的临界性表现

摘要: 大型预测模型是否只是机械地重复其训练数据，还是能够产生真正的洞察力，这一点目前尚无物理解释。本研究报告了一种原始形式的直觉，这种直觉作为学习的亚稳相出现，关键地平衡了下一个标记的预测和未来路径熵。通过心灵调谐，发现了这种直觉机制，这是在预测模型中施加类似控制温度的参数λ下，强加最大熵原则的最小原则。在确定性迷宫中的随机行走训练揭示了丰富的相图：模仿（低λ）、打破规则的幻想（高λ）和一个脆弱的中间窗口，表现出强烈的协议依赖性（滞后效应）和多稳定性，模型在这里自发地发现了新的目标导向策略。这些结果被一个有效的低维理论所捕捉，并将直觉框架化为一个在记忆现实和想象未来之间临界平衡的新兴属性。

更新时间: 2025-08-08 17:27:41

领域: physics.soc-ph,cond-mat.dis-nn,cond-mat.stat-mech,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06477v1

LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection

The growing legal and ethical scrutiny of large language models (LLMs) necessitates effective machine unlearning, particularly for sensitive or unauthorized data. Existing empirical methods often yield incomplete forgetting or unintended degradation of unrelated knowledge due to poor localization. In this work, we propose GRIN: a modular and targeted framework for LLM unlearning. GRIN introduces a novel gradient-ratio-based metric to identify parameters most responsible for memorizing forget data. We then perform selective noise injection into these parameters prior to fine-tuning, which improves unlearning performance while maintaining model utility. Finally, we propose new evaluation metrics tailored to the LLM setting and validate our approach on standard benchmarks such as TOFU, WMDP, and SafePKU.

Updated: 2025-08-08 17:15:32

标题: 使用基于梯度比率的影响估计和噪声注入进行LLM遗忘

摘要: 随着大型语言模型（LLMs）日益受到法律和道德审查的关注，特别是对于敏感或未经授权的数据，需要有效的机器遗忘。现有的经验方法往往会导致遗忘不完全或由于定位不准确而意外降低无关知识。在这项工作中，我们提出了GRIN：一个用于LLM遗忘的模块化和有针对性的框架。GRIN引入了一种基于梯度比率的新指标，以识别最负责记忆遗忘数据的参数。然后我们在这些参数中进行选择性噪声注入，然后再进行微调，这样可以提高遗忘性能同时保持模型的实用性。最后，我们提出了针对LLM环境的新评估指标，并在TOFU、WMDP和SafePKU等标准基准上验证了我们的方法。

更新时间: 2025-08-08 17:15:32

领域: cs.LG

下载: http://arxiv.org/abs/2508.06467v1

ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls

Large Language Models (LLMs) have demonstrated impressive fluency and reasoning capabilities, but their potential for misuse has raised growing concern. In this paper, we present ScamAgent, an autonomous multi-turn agent built on top of LLMs, capable of generating highly realistic scam call scripts that simulate real-world fraud scenarios. Unlike prior work focused on single-shot prompt misuse, ScamAgent maintains dialogue memory, adapts dynamically to simulated user responses, and employs deceptive persuasion strategies across conversational turns. We show that current LLM safety guardrails, including refusal mechanisms and content filters, are ineffective against such agent-based threats. Even models with strong prompt-level safeguards can be bypassed when prompts are decomposed, disguised, or delivered incrementally within an agent framework. We further demonstrate the transformation of scam scripts into lifelike voice calls using modern text-to-speech systems, completing a fully automated scam pipeline. Our findings highlight an urgent need for multi-turn safety auditing, agent-level control frameworks, and new methods to detect and disrupt conversational deception powered by generative AI.

Updated: 2025-08-08 17:01:41

标题: 骗局代理：人工智能代理如何模拟人类水平的骗局电话

摘要: 大型语言模型（LLMs）展示了令人印象深刻的流畅性和推理能力，但它们被滥用的潜力引起了越来越多的关注。在本文中，我们介绍了ScamAgent，这是一个建立在LLMs之上的自主多轮代理，能够生成高度逼真的诈骗电话脚本，模拟真实世界的欺诈场景。与先前集中在单次提示滥用的工作不同，ScamAgent保持对话记忆，动态适应模拟用户的回应，并在对话轮次中采用欺骗性说服策略。我们表明，当前的LLM安全防护措施，包括拒绝机制和内容过滤器，对这种基于代理的威胁是无效的。即使具有强大提示级别保障的模型也可以在提示被分解、伪装或在代理框架中逐步交付时被绕过。我们进一步展示了将诈骗脚本转化为逼真的语音电话，利用现代的文本转语音系统，完成一个完全自动化的欺诈流程。我们的发现强调了对多轮安全审计、代理级别控制框架以及检测和打断由生成式人工智能驱动的对话欺骗的新方法的紧迫需求。

更新时间: 2025-08-08 17:01:41

领域: cs.CR,cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2508.06457v1

Maximum Impact with Fewer Features: Efficient Feature Selection for Cold-Start Recommenders through Collaborative Importance Weighting

Cold-start challenges in recommender systems necessitate leveraging auxiliary features beyond user-item interactions. However, the presence of irrelevant or noisy features can degrade predictive performance, whereas an excessive number of features increases computational demands, leading to higher memory consumption and prolonged training times. To address this, we propose a feature selection strategy that prioritizes the user behavioral information. Our method enhances the feature representation by incorporating correlations from collaborative behavior data using a hybrid matrix factorization technique and then ranks features using a mechanism based on the maximum volume algorithm. This approach identifies the most influential features, striking a balance between recommendation accuracy and computational efficiency. We conduct an extensive evaluation across various datasets and hybrid recommendation models, demonstrating that our method excels in cold-start scenarios by selecting minimal yet highly effective feature subsets. Even under strict feature reduction, our approach surpasses existing feature selection techniques while maintaining superior efficiency.

Updated: 2025-08-08 16:58:47

标题: 少特征产生最大影响：通过协同重要性加权实现冷启动推荐系统的高效特征选择

摘要: 在推荐系统中，冷启动挑战需要利用除用户-物品交互之外的辅助特征。然而，无关或嘈杂的特征会降低预测性能，而过多的特征会增加计算需求，导致内存消耗增加和训练时间延长。为了解决这个问题，我们提出了一种优先考虑用户行为信息的特征选择策略。我们的方法通过使用混合矩阵分解技术结合协同行为数据的相关性来增强特征表示，然后使用基于最大体积算法的机制对特征进行排名。这种方法确定了最具影响力的特征，在推荐准确性和计算效率之间取得了平衡。我们在各种数据集和混合推荐模型上进行了广泛评估，证明我们的方法在冷启动场景中通过选择最小但高效的特征子集表现出色。即使在严格的特征减少情况下，我们的方法仍然超越了现有的特征选择技术，同时保持卓越的效率。

更新时间: 2025-08-08 16:58:47

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.06455v1

What Voting Rules Actually Do: A Data-Driven Analysis of Multi-Winner Voting

Committee-selection problems arise in many contexts and applications, and there has been increasing interest within the social choice research community on identifying which properties are satisfied by different multi-winner voting rules. In this work, we propose a data-driven framework to evaluate how frequently voting rules violate axioms across diverse preference distributions in practice, shifting away from the binary perspective of axiom satisfaction given by worst-case analysis. Using this framework, we analyze the relationship between multi-winner voting rules and their axiomatic performance under several preference distributions. We then show that neural networks, acting as voting rules, can outperform traditional rules in minimizing axiom violations. Our results suggest that data-driven approaches to social choice can inform the design of new voting systems and support the continuation of data-driven research in social choice.

Updated: 2025-08-08 16:54:09

标题: 《选举规则的实际效果：多人获胜选举的数据驱动分析》

摘要: 委员会选择问题在许多背景和应用中出现，社会选择研究界对不同多赢家投票规则满足哪些特性越来越感兴趣。在这项工作中，我们提出了一个数据驱动框架，评估在实践中不同偏好分布下投票规则违反公理的频率，摆脱了最差情况分析给出的公理满足二元视角。利用这个框架，我们分析了多赢家投票规则与它们在几种偏好分布下的公理性能之间的关系。然后我们展示了神经网络作为投票规则，在最小化公理违反方面可以胜过传统规则。我们的结果表明，数据驱动的社会选择方法可以为设计新的投票系统提供指导，并支持在社会选择领域继续进行数据驱动研究。

更新时间: 2025-08-08 16:54:09

领域: cs.AI,cs.GT

下载: http://arxiv.org/abs/2508.06454v1

Text Embedded Swin-UMamba for DeepLesion Segmentation

Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow offers the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, a high Dice Score of 82% and low Hausdorff distance of 6.58 (pixels) was obtained for lesion segmentation. The proposed Text-Swin-UMamba model outperformed prior approaches: 37% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001),and surpassed the purely image-based xLSTM-UNet and nnUNet models by 1.74% and 0.22%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba

Updated: 2025-08-08 16:54:06

标题: 使用嵌入式文本的Swin-UMamba用于深度病变分割

摘要: CT上病变的分割使得可以对慢性疾病（例如淋巴瘤）进行自动测量进行临床评估。将大型语言模型（LLMs）整合到病变分割工作流程中，有潜力将影像特征与放射学报告中病变特征的描述相结合。在这项研究中，我们调查了将文本整合到Swin-UMamba架构中进行病变分割任务的可行性。使用了公开可用的ULS23 DeepLesion数据集以及报告中研究结果的简短描述。在测试数据集上，病变分割获得了高达82％的Dice分数和6.58（像素）的低Hausdorff距离。提出的Text-Swin-UMamba模型优于先前的方法：比基于LLM驱动的LanGuideMedSeg模型提高了37％（p <0.001），并分别超过了纯图像基础的xLSTM-UNet和nnUNet模型1.74％和0.22％。数据集和代码可以在https://github.com/ruida/LLM-Swin-UMamba访问。

更新时间: 2025-08-08 16:54:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06453v1

TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation

Recent unsupervised domain adaptation (UDA) methods have shown great success in addressing classical domain shifts (e.g., synthetic-to-real), but they still suffer under complex shifts (e.g. geographical shift), where both the background and object appearances differ significantly across domains. Prior works showed that the language modality can help in the adaptation process, exhibiting more robustness to such complex shifts. In this paper, we introduce TRUST, a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model. TRUST generates pseudo-labels for target samples from their captions and introduces a novel uncertainty estimation strategy that uses normalised CLIP similarity scores to estimate the uncertainty of the generated pseudo-labels. Such estimated uncertainty is then used to reweight the classification loss, mitigating the adverse effects of wrong pseudo-labels obtained from low-quality captions. To further increase the robustness of the vision model, we propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces, by leveraging captions to guide the contrastive training of the vision model on target images. In our contrastive loss, each pair of images acts as both a positive and a negative pair and their feature representations are attracted and repulsed with a strength proportional to the similarity of their captions. This solution avoids the need for hardly determining positive and negative pairs, which is critical in the UDA setting. Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts. The code will be available upon acceptance.

Updated: 2025-08-08 16:51:44

标题: 信任：利用文本鲁棒性进行无监督域自适应

摘要: 最近的无监督域自适应（UDA）方法在解决经典域偏移（例如，合成到真实）方面取得了巨大成功，但它们在复杂偏移（例如地理偏移）方面仍然存在困难，即在这种情况下，背景和对象外观在不同领域之间有明显差异。先前的研究表明，语言模态可以在适应过程中起到帮助作用，表现出对这种复杂偏移更强大的鲁棒性。在本文中，我们介绍了一种新颖的UDA方法TRUST，利用语言模态的稳健性指导视觉模型的适应。TRUST从目标样本的标题生成伪标签，并引入一种使用标准化的CLIP相似度分数来估计生成的伪标签不确定性的新颖策略。然后利用估计的不确定性重新加权分类损失，减轻由低质量标题获得的错误伪标签的不利影响。为了进一步提高视觉模型的鲁棒性，我们提出了一种多模态软对比学习损失，通过利用标题来引导视觉模型对目标图像进行对比训练，从而对齐视觉和语言特征空间。在我们的对比损失中，每一对图像都既是正对又是负对，它们的特征表示会根据其标题的相似性吸引和排斥。这种解决方案避免了在UDA设置中难以确定正负对的需求，这在关键时刻尤为重要。我们的方法优于先前的方法，在经典（DomainNet）和复杂（GeoNet）域偏移上取得了新的最新技术。代码将在接受后提供。

更新时间: 2025-08-08 16:51:44

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.06452v1

Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free

Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: https://faverogian.github.io/med-diffusion-classifier.github.io/.

Updated: 2025-08-08 16:49:48

标题: 条件扩散模型是医学图像分类器，提供免费的可解释性和不确定性

摘要: 判别分类器已成为深度学习在医学影像领域的基础工具，擅长学习复杂数据分布的可分离特征。然而，这些模型通常需要仔细设计、增强和训练技术，以确保安全可靠的部署。最近，扩散模型已成为2D生成建模的代名词。这些模型展示了在包括自然图像分类在内的一系列任务中的稳健性，其中分类是通过比较为每个可能的条件输入生成的图像的重建错误来执行的。本研究首次探讨了基于类条件扩散模型用于2D医学图像分类的潜力。首先，我们开发了一种新颖的多数投票方案，显示出改善医学扩散分类器性能。接下来，在CheXpert和ISIC黑色素瘤皮肤癌数据集上进行了大量实验，证明了基础和从头开始训练的扩散模型在无需明确监督的情况下与SOTA判别分类器具有竞争性性能。此外，我们展示了扩散分类器本质上是可解释的，并且可以用于量化其预测的不确定性，在安全关键的临床环境中提高其可信度和可靠性。有关更多信息，请访问我们的项目页面：https://faverogian.github.io/med-diffusion-classifier.github.io/.

更新时间: 2025-08-08 16:49:48

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.03687v2

eSASRec: Enhancing Transformer-based Recommendations in a Modular Fashion

Since their introduction, Transformer-based models, such as SASRec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements has not been systematically benchmarked - this is the gap we aim to close in this paper. Through our experiments, we identify a very strong model that uses SASRec's training objective, LiGR Transformer layers, and Sampled Softmax Loss. We call this combination eSASRec (Enhanced SASRec). While we primarily focus on realistic, production-like evaluation, in our preliminarily study we find that common academic benchmarks show eSASRec to be 23% more effective compared to the most recent state-of-the-art models, such as ActionPiece. In our main production-like benchmark, eSASRec resides on the Pareto frontier in terms of the accuracy-coverage tradeoff (alongside the recent industrial models HSTU and FuXi. As the modifications compared to the original SASRec are relatively straightforward and no extra features are needed (such as timestamps in HSTU), we believe that eSASRec can be easily integrated into existing recommendation pipelines and can can serve as a strong yet very simple baseline for emerging complicated algorithms. To facilitate this, we provide the open-source implementations for our models and benchmarks in repository https://github.com/blondered/transformer_benchmark

Updated: 2025-08-08 16:49:03

标题: eSASRec：以模块化方式增强基于Transformer的推荐系统

摘要: 自引入以来，基于Transformer的模型，如SASRec和BERT4Rec，已成为顺序推荐的常见基准线，超越了早期的神经和非神经方法。许多后续出版物表明，通过稍微更新Transformer层的架构，使用更好的训练目标和使用改进的损失函数等方式，这些模型的有效性可以得到提高。然而，这些模块化改进的可加性尚未得到系统的基准测试-这是我们在本文中的目标。通过我们的实验，我们确定了一个非常强大的模型，该模型使用了SASRec的训练目标，LiGR Transformer层和Sampled Softmax Loss。我们将这种组合称为eSASRec（增强SASRec）。尽管我们主要关注实际的、类似于生产的评估，在我们的初步研究中，我们发现常见的学术基准显示，与最新的ActionPiece等最新技术模型相比，eSASRec的有效性提高了23%。在我们的主要类似生产的基准测试中，eSASRec在准确性-覆盖率权衡方面处于帕累托边界上（与最近的工业模型HSTU和FuXi并列）。由于与原始SASRec相比的修改相对简单，而且不需要额外的功能（例如HSTU中的时间戳），我们相信eSASRec可以轻松集成到现有的推荐流程中，并且可以作为新兴复杂算法的强大而非常简单的基准线。为了便于这一点，我们在https://github.com/blondered/transformer_benchmark仓库中提供了我们模型和基准测试的开源实现。

更新时间: 2025-08-08 16:49:03

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.06450v1

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.

Updated: 2025-08-08 16:45:47

标题: CRUST-Bench：C到安全Rust转译的综合基准

摘要: C转换为Rust是现代化遗留C代码并增强安全性和与现代Rust生态系统的互操作性的关键。然而，目前还没有用于评估系统是否能够将C转换为通过一组测试用例的安全Rust的数据集。我们介绍了CRUST-Bench，这是一个包含100个C代码存储库的数据集，每个存储库都与手动编写的安全Rust接口以及可用于验证转换正确性的测试用例配对。通过考虑整个存储库而不是孤立的函数，CRUST-Bench捕捉了跨多个文件具有依赖关系的复杂项目翻译的挑战。提供的Rust接口提供明确的规范，确保遵循惯用的、内存安全的Rust模式，而伴随的测试用例则强制执行功能正确性。我们评估了这一任务的最先进的大型语言模型（LLMs），发现安全和惯用的Rust生成对于各种最先进的方法和技术仍然是一个具有挑战性的问题。我们还提供了关于LLMs在从C转换为安全Rust时通常出现的错误的见解。表现最好的模型，OpenAI o1，在单次设置中只能解决15个任务。对CRUST-Bench的改进将导致改进的转换系统，可以推理复杂情景并帮助将遗留代码库从C迁移到确保内存安全的Rust等语言。您可以在https://github.com/anirudhkhatry/CRUST-bench找到数据集和代码。

更新时间: 2025-08-08 16:45:47

领域: cs.SE,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.15254v2

Echoes of Automation: The Increasing Use of LLMs in Newsmaking

The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns for journalistic integrity and authorship. This study examines AI-generated content across over 40,000 news articles from major, local, and college news media, in various media formats. Using three advanced AI-text detectors (e.g., Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of GenAI use in recent years, especially in local and college news. Sentence-level analysis reveals LLMs are often used in the introduction of news, while conclusions usually written manually. Linguistic analysis shows GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles, particularly in local media.

Updated: 2025-08-08 16:38:33

标题: 自动化的回声：在新闻制作中越来越多地使用LLM

摘要: 快速崛起的生成式人工智能（GenAI），特别是LLMs，引起了对新闻诚信和作者身份的担忧。本研究对来自主要、地方和大学新闻媒体的40,000多篇新闻文章中的AI生成内容进行了研究，涵盖了各种媒体格式。利用三种先进的AI文本检测器（例如Binoculars、Fast-Detect GPT和GPTZero），我们发现近年来GenAI的使用大幅增加，特别是在地方和大学新闻中。句子级分析显示LLMs经常用于新闻的引言部分，而结论通常是手动撰写的。语言分析显示GenAI提升了词汇丰富性和可读性，但降低了正式性，导致更统一的写作风格，特别是在地方媒体中。

更新时间: 2025-08-08 16:38:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06445v1

The Fair Game: Auditing & Debiasing AI Algorithms Over Time

An emerging field of AI, namely Fair Machine Learning (ML), aims to quantify different types of bias (also known as unfairness) exhibited in the predictions of ML algorithms, and to design new algorithms to mitigate them. Often, the definitions of bias used in the literature are observational, i.e. they use the input and output of a pre-trained algorithm to quantify a bias under concern. In reality,these definitions are often conflicting in nature and can only be deployed if either the ground truth is known or only in retrospect after deploying the algorithm. Thus,there is a gap between what we want Fair ML to achieve and what it does in a dynamic social environment. Hence, we propose an alternative dynamic mechanism,"Fair Game",to assure fairness in the predictions of an ML algorithm and to adapt its predictions as the society interacts with the algorithm over time. "Fair Game" puts together an Auditor and a Debiasing algorithm in a loop around an ML algorithm. The "Fair Game" puts these two components in a loop by leveraging Reinforcement Learning (RL). RL algorithms interact with an environment to take decisions, which yields new observations (also known as data/feedback) from the environment and in turn, adapts future decisions. RL is already used in algorithms with pre-fixed long-term fairness goals. "Fair Game" provides a unique framework where the fairness goals can be adapted over time by only modifying the auditor and the different biases it quantifies. Thus,"Fair Game" aims to simulate the evolution of ethical and legal frameworks in the society by creating an auditor which sends feedback to a debiasing algorithm deployed around an ML system. This allows us to develop a flexible and adaptive-over-time framework to build Fair ML systems pre- and post-deployment.

Updated: 2025-08-08 16:36:16

标题: 公平游戏：随时间审计和消除人工智能算法的偏见

摘要: AI领域的一个新兴领域，即公平机器学习（ML），旨在量化ML算法预测中表现出的不同类型的偏见（也称为不公平性），并设计新算法来减轻它们。通常，文献中使用的偏见定义是观察性的，即它们使用经过预训练的算法的输入和输出来量化关注的偏见。实际上，这些定义通常是相互冲突的，并且只能在已知基本事实或在部署算法后的回顾中才能部署。因此，在我们希望公平机器学习实现的目标和它在动态社会环境中的实际表现之间存在差距。因此，我们提出了一种另类的动态机制，“公平游戏”，以确保ML算法的预测公平性，并在社会随时间与算法互动时调整其预测。“公平游戏”在一个ML算法周围的循环中将审计员和去偏差算法组合在一起。“公平游戏”通过利用强化学习（RL）将这两个组件放入循环中。强化学习算法与环境互动以做出决策，从而从环境中产生新观察（也称为数据/反馈），并相应地调整未来决策。RL已经在具有预先设定的长期公平目标的算法中使用。 “公平游戏”提供了一个独特的框架，其中公平目标可以通过仅修改审计员和其量化的不同偏见来随时间调整。因此，“公平游戏”旨在通过创建一个向围绕ML系统部署的去偏差算法发送反馈的审计员来模拟社会伦理和法律框架的演变。这使我们能够开发一个灵活且随时间适应的框架，以构建在部署前和部署后的公平ML系统。

更新时间: 2025-08-08 16:36:16

领域: cs.AI,cs.CY,cs.ET,cs.GT

下载: http://arxiv.org/abs/2508.06443v1

Optimal sampling for least-squares approximation

Least-squares approximation is one of the most important methods for recovering an unknown function from data. While in many applications the data is fixed, in many others there is substantial freedom to choose where to sample. In this paper, we review recent progress on near-optimal random sampling strategies for (weighted) least-squares approximation in arbitrary linear spaces. We introduce the Christoffel function as a key quantity in the analysis of (weighted) least-squares approximation from random samples, then show how it can be used to construct a random sampling strategy, termed Christoffel sampling, that possesses near-optimal sample complexity: namely, the number of samples scales log-linearly in the dimension of the approximation space $n$. We discuss a series of variations, extensions and further topics, and throughout highlight connections to approximation theory, machine learning, information-based complexity and numerical linear algebra. Finally, motivated by various contemporary applications, we consider a generalization of the classical setting where the samples need not be pointwise samples of a scalar-valued function, and the approximation space need not be linear. We show that, even in this significantly more general setting, suitable generalizations of Christoffel function still determine the sample complexity. Consequently, these can be used to design enhanced, Christoffel sampling strategies in a unified way for general recovery problems. This article is largely self-contained, and intended to be accessible to nonspecialists.

Updated: 2025-08-08 16:34:08

标题: 最优采样用于最小二乘逼近

摘要: 最小二乘逼近是从数据中恢复未知函数的最重要方法之一。在许多应用中，数据是固定的，但在许多其他情况下，可以自由选择采样位置。在本文中，我们回顾了最近在任意线性空间中（加权）最小二乘逼近的近似最优随机采样策略的进展。我们引入了Christoffel函数作为（加权）最小二乘逼近的分析中的关键量，然后展示了如何利用它来构建一种随机采样策略，称为Christoffel采样，具有近似最优的采样复杂度：即，样本数量与逼近空间的维度$n$的对数线性相关。我们讨论了一系列变化、扩展和进一步的主题，并强调与逼近理论、机器学习、基于信息的复杂性和数值线性代数的联系。最后，受到各种当代应用的启发，我们考虑了一个经典设置的泛化，其中样本不必是标量值函数的点值样本，逼近空间也不必是线性的。我们展示，即使在这种显著更一般的设置中，Christoffel函数的适当泛化仍然确定了样本复杂度。因此，这些可以用于以统一方式设计增强的、适用于一般恢复问题的Christoffel采样策略。本文基本上是自包含的，并旨在对非专业人士易于理解。

更新时间: 2025-08-08 16:34:08

领域: stat.ML,cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2409.02342v2

Position: Lifetime tuning is incompatible with continual reinforcement learning

In continual RL we want agents capable of never-ending learning, and yet our evaluation methodologies do not reflect this. The standard practice in RL is to assume unfettered access to the deployment environment for the full lifetime of the agent. For example, agent designers select the best performing hyperparameters in Atari by testing each for 200 million frames and then reporting results on 200 million frames. In this position paper, we argue and demonstrate the pitfalls of this inappropriate empirical methodology: lifetime tuning. We provide empirical evidence to support our position by testing DQN and SAC across several of continuing and non-stationary environments with two main findings: (1) lifetime tuning does not allow us to identify algorithms that work well for continual learning -- all algorithms equally succeed; (2) recently developed continual RL algorithms outperform standard non-continual algorithms when tuning is limited to a fraction of the agent's lifetime. The goal of this paper is to provide an explanation for why recent progress in continual RL has been mixed and motivate the development of empirical practices that better match the goals of continual RL.

Updated: 2025-08-08 16:28:26

标题: 位置：终身调谐与持续强化学习不兼容

摘要: 在持续强化学习中，我们希望代理能够进行永无止境的学习，然而我们的评估方法并不能反映这一点。在强化学习中的标准做法是假设代理在其整个生命周期内可以自由访问部署环境。例如，在Atari中，代理设计者通过对每个超参数进行2亿帧的测试来选择表现最好的超参数，然后在2亿帧上报告结果。在这篇观点论文中，我们讨论并展示了这种不恰当的实证方法——生命周期调整的缺点。我们通过在多个持续和非稳态环境中测试DQN和SAC来提供实证证据支持我们的观点，并得出两个主要发现：（1）生命周期调整无法让我们找到适用于持续学习的算法——所有算法都同样成功；（2）最近发展的持续强化学习算法在调整限制为代理生命周期的一部分时优于标准的非持续算法。本文的目标是解释为什么持续强化学习的最近进展是参差不齐的，并激励开发更符合持续强化学习目标的实证实践。

更新时间: 2025-08-08 16:28:26

领域: cs.LG

下载: http://arxiv.org/abs/2404.02113v4

Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain characterised by polarised, culturally specific discourse. We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases. Results show that LLMs fine-tuned in one or two languages can reliably classify immigration-related content in unseen languages. However, identifying whether a tweet expresses a pro- or anti-immigration stance benefits from multilingual fine-tuning. Pre-training bias favours dominant languages, but even minimal exposure to under-represented languages during fine-tuning (as little as $9.62\times10^{-11}$ of the original pre-training token volume) yields significant gains. These findings challenge the assumption that cross-lingual mastery requires extensive multilingual training: limited language coverage suffices for topic-level generalisation, and structural biases can be corrected with lightweight interventions. By releasing 4-bit-quantised, LoRA fine-tuned models, we provide an open-source, reproducible alternative to proprietary LLMs that delivers 35 times faster inference at just 0.00000989% of the dollar cost of the OpenAI GPT-4o model, enabling scalable, inclusive research.

Updated: 2025-08-08 16:23:24

标题: 学习主题，而非语言：LLMs如何跨语言分类在线移民话语

摘要: 大型语言模型（LLMs）正在通过实现可扩展、精确的分析来改变社会科学研究。它们的适应性引发了一个问题，即通过在少数语言中进行微调获取的知识是否可以转移到在预训练过程中仅出现过的未知语言中。为了研究这一点，我们在单语、双语或多语数据集上对轻量级LLaMA 3.2-3B模型进行微调，以对来自X/Twitter的涉及移民的推文进行分类，跨越13种语言，这是一个以极化、文化特定话语为特征的领域。我们评估了是否最小的语言特定微调可以实现跨语言主题检测，以及添加有针对性的语言是否可以纠正预训练偏见。结果表明，在一个或两种语言中微调的LLMs可以可靠地对未知语言中与移民相关的内容进行分类。然而，确定一条推文表达支持还是反对移民立场的受益于多语言微调。预训练偏见偏袒主导语言，但即使在微调过程中对代表性较低的语言进行最小暴露（仅为原始预训练令牌量的$9.62\times10^{-11}$）也会带来显著收益。这些发现挑战了跨语言掌握需要广泛的多语言训练的假设：有限的语言覆盖足以实现主题级别的泛化，并且结构性偏见可以通过轻量级干预来纠正。通过发布4位量化、LoRA微调模型，我们提供了一个开源、可重现的替代方案，可以以OpenAI GPT-4o模型成本的0.00000989%实现35倍更快的推理，从而实现可扩展、包容性研究。

更新时间: 2025-08-08 16:23:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06435v1

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.

Updated: 2025-08-08 16:23:05

标题: CLIPin：用于多模态语义对齐的非对照插件到CLIP

摘要: 大规模自然图像-文本数据集，特别是那些自动从网络收集的数据集，通常由于弱监督而遭受松散的语义对准，而医学数据集往往具有高跨模态相关性但内容多样性较低。这些特性为对比语言-图像预训练(CLIP)提出了一个共同的挑战：它们阻碍了模型学习稳健且可泛化的表示。在这项工作中，我们提出了CLIPin，一个统一的非对比插件，可以无缝地集成到CLIP风格的架构中，以改善多模态语义对准，提供更强的监督并增强对准的稳健性。此外，分别为图像和文本模态设计了两个共享的预投影器，以便以参数妥协的方式促进对比和非对比学习的集成。对各种下游任务的广泛实验表明，CLIPin作为一个兼容各种对比框架的即插即用组件具有有效性和泛化性。代码可在https://github.com/T6Yang/CLIPin找到。

更新时间: 2025-08-08 16:23:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06434v1

Memp: Exploring Agent Procedural Memory

Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.

Updated: 2025-08-08 16:20:56

标题: Memp：探索代理程序性记忆

摘要: 基于大型语言模型（LLMs）的代理在各种任务中表现出色，但它们受到脆弱的程序性记忆的困扰，该记忆通常是手工设计或与静态参数纠缠在一起的。在这项工作中，我们研究了赋予代理具有可学习、可更新和终身程序性记忆的策略。我们提出了Memp，将过去的代理轨迹提炼成细粒度的逐步指令和更高级别的类似脚本的抽象，并探讨了程序性记忆的构建、检索和更新的不同策略的影响。结合一个不断更新、纠正和淘汰其内容的动态制度，这个存储库与新经验同步演变。对TravelPlanner和ALFWorld的实证评估显示，随着记忆存储库的完善，代理在类似任务上持续取得更高的成功率和更高的效率。此外，从更强模型构建的程序性记忆保持其价值：将程序性记忆迁移到较弱模型会带来显著的性能提升。

更新时间: 2025-08-08 16:20:56

领域: cs.CL,cs.AI,cs.LG,cs.MA

下载: http://arxiv.org/abs/2508.06433v1

SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation

Deep learning has revolutionized medical imaging, but its effectiveness is severely limited by insufficient labeled training data. This paper introduces a novel GAN-based semi-supervised learning framework specifically designed for low labeled-data regimes, evaluated across settings with 5 to 50 labeled samples per class. Our approach integrates three specialized neural networks -- a generator for class-conditioned image translation, a discriminator for authenticity assessment and classification, and a dedicated classifier -- within a three-phase training framework. The method alternates between supervised training on limited labeled data and unsupervised learning that leverages abundant unlabeled images through image-to-image translation rather than generation from noise. We employ ensemble-based pseudo-labeling that combines confidence-weighted predictions from the discriminator and classifier with temporal consistency through exponential moving averaging, enabling reliable label estimation for unlabeled data. Comprehensive evaluation across eleven MedMNIST datasets demonstrates that our approach achieves statistically significant improvements over six state-of-the-art GAN-based semi-supervised methods, with particularly strong performance in the extreme 5-shot setting where the scarcity of labeled data is most challenging. The framework maintains its superiority across all evaluated settings (5, 10, 20, and 50 shots per class). Our approach offers a practical solution for medical imaging applications where annotation costs are prohibitive, enabling robust classification performance even with minimal labeled data. Code is available at https://github.com/GuidoManni/SPARSE.

Updated: 2025-08-08 16:16:43

标题: 稀疏数据，丰富结果：通过类别条件图像翻译进行少样本半监督学习

摘要: 深度学习已经彻底改变了医学影像，但其有效性受到标记训练数据不足的严重限制。本文介绍了一种新颖的基于GAN的半监督学习框架，专门设计用于低标记数据情况，在每个类别5到50个标记样本的设置下进行评估。我们的方法整合了三个专门的神经网络--一个用于类别条件图像翻译的生成器，一个用于真实性评估和分类的鉴别器，以及一个专门的分类器--在一个三阶段训练框架内。该方法在有限标记数据上进行监督训练和利用丰富的未标记图像通过图像到图像翻译而不是从噪声生成的无监督学习之间交替进行。我们采用基于集成的伪标记方法，结合了鉴别器和分类器的置信度加权预测，并通过指数移动平均实现了时间一致性，从而为未标记数据提供可靠的标记估计。在十一个MedMNIST数据集上进行的全面评估表明，我们的方法在六种最先进的基于GAN的半监督方法上取得了显著的改进，特别是在极端的5-shot设置下，标记数据的稀缺性最具挑战性。该框架在所有评估设置（每个类别5、10、20和50个shot）下保持了其优越性。我们的方法为医学影像应用提供了一个实用的解决方案，其中注释成本是禁锢的，即使只有很少的标记数据，也能实现可靠的分类性能。代码可在https://github.com/GuidoManni/SPARSE上获得。

更新时间: 2025-08-08 16:16:43

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06429v1

SPARTA: Advancing Sparse Attention in Spiking Neural Networks via Spike-Timing-Based Prioritization

Current Spiking Neural Networks (SNNs) underutilize the temporal dynamics inherent in spike-based processing, relying primarily on rate coding while overlooking precise timing information that provides rich computational cues. We propose SPARTA (Spiking Priority Attention with Resource-Adaptive Temporal Allocation), a framework that leverages heterogeneous neuron dynamics and spike-timing information to enable efficient sparse attention. SPARTA prioritizes tokens based on temporal cues, including firing patterns, spike timing, and inter-spike intervals, achieving 65.4% sparsity through competitive gating. By selecting only the most salient tokens, SPARTA reduces attention complexity from O(N^2) to O(K^2) with k << n, while maintaining high accuracy. Our method achieves state-of-the-art performance on DVS-Gesture (98.78%) and competitive results on CIFAR10-DVS (83.06%) and CIFAR-10 (95.3%), demonstrating that exploiting spike timing dynamics improves both computational efficiency and accuracy.

Updated: 2025-08-08 16:16:24

标题: 斯巴达：通过基于脉冲时序的优先级排序推进尖峰神经网络中的稀疏注意力

摘要: 目前的脉冲神经网络（SNNs）未充分利用脉冲处理中固有的时间动态，主要依赖于速率编码，忽视了提供丰富计算线索的精确时间信息。我们提出了SPARTA（具有资源自适应时间分配的脉冲优先注意力），这是一个利用异质神经元动态和脉冲时序信息实现高效稀疏注意力的框架。SPARTA根据包括发射模式、脉冲时序和脉冲间隔在内的时间线索对令牌进行优先级排序，通过竞争门控实现65.4%的稀疏性。通过仅选择最显著的令牌，SPARTA将注意力复杂度从O(N^2)降低到O(K^2)，其中k << n，同时保持高准确性。我们的方法在DVS-Gesture上实现了最先进的性能（98.78%），并在CIFAR10-DVS（83.06%）和CIFAR-10（95.3%）上取得了竞争性结果，表明利用脉冲时序动态可以改善计算效率和准确性。

更新时间: 2025-08-08 16:16:24

领域: cs.LG,cs.AI,cs.NE,I.2.6; I.2.10; C.1.3

下载: http://arxiv.org/abs/2508.01646v2

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning -- the reliance on task-irrelevant features -- as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., $\pi_0$, in both simulation and real-world environments. More information at https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.

Updated: 2025-08-08 16:14:01

标题: 通用机器人政策中的快捷学习：数据集多样性和碎片化的作用

摘要: 在大规模数据集（如Open X-Embodiment（OXE））上训练的通用机器人策略展现出在各种任务中的强大表现。然而，它们往往难以推广到超出训练数据分布范围之外。本文探讨了这种有限的泛化能力的根本原因。我们确定了捷径学习——依赖于与任务无关的特征——作为泛化的主要障碍。通过全面的理论和实证分析，我们揭示了捷径学习的两个主要因素：（1）单个子数据集内的有限多样性，以及（2）子数据集之间显著的分布差异，导致数据集碎片化。这些问题源于大规模数据集（如OXE）的固有结构，这些数据集通常由在不同环境和实体中独立收集的多个子数据集组成。我们的发现为降低捷径学习并增强通用机器人策略的泛化能力提供了关键见解。此外，在获取新的大规模数据不可行的情况下，我们证明了精心选择的机器人数据增强策略可以有效地减少现有离线数据集中的捷径学习，从而提高通用机器人策略（例如，π₀）在仿真和真实环境中的泛化能力。更多信息请访问https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/。

更新时间: 2025-08-08 16:14:01

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.06426v1

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. Although recent detection works have shifted to internal representations due to their rich cross-modal information, most methods rely on heuristic rules rather than principled objectives, resulting in suboptimal performance. To address these limitations, we propose Learning to Detect (LoD), a novel unsupervised framework that formulates jailbreak detection as anomaly detection. LoD introduces two key components: Multi-modal Safety Concept Activation Vectors (MSCAV), which capture layer-wise safety-related representations across modalities, and the Safety Pattern Auto-Encoder, which models the distribution of MSCAV derived from safe inputs and detects anomalies via reconstruction errors. By training the auto-encoder (AE) solely on safe samples without attack labels, LoD naturally identifies jailbreak inputs as distributional anomalies, enabling accurate and unified detection of jailbreak attacks. Comprehensive experiments on three different LVLMs and five benchmarks demonstrate that LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.

Updated: 2025-08-08 16:13:28

标题: 学习在大型视觉语言模型中检测未知越狱攻击：一种统一且准确的方法

摘要: 尽管进行了大量的对齐工作，但大规模视觉语言模型（LVLMs）仍然容易受到越狱攻击的威胁，造成严重的安全风险。尽管最近的检测工作已经转向内部表示，因为它们具有丰富的跨模态信息，但大多数方法仍依赖于启发式规则而不是基本目标，导致性能不佳。为了解决这些限制，我们提出了Learning to Detect（LoD），这是一个新颖的无监督框架，将越狱检测表述为异常检测。LoD引入了两个关键组件：多模态安全概念激活向量（MSCAV），它们捕获了跨模态的层次安全相关表示，以及安全模式自编码器，它模拟了来自安全输入的MSCAV的分布并通过重建错误检测异常。通过仅在安全样本上训练自动编码器（AE）而不使用攻击标签，LoD自然地将越狱输入识别为分布异常，实现了准确和统一的越狱攻击检测。对三种不同的LVLMs和五个基准进行的全面实验表明，LoD实现了最先进的性能，平均AUROC为0.9951，最小AUROC比最强基线提高了高达38.89%。

更新时间: 2025-08-08 16:13:28

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.09201v1

From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models

Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning. But how does this ability evolve during training? We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training. To this end, we construct MathCAMPS, a synthetic dataset of novel mathematical reasoning problems grounded in 44 fine-grained skills taken from the Common Core curriculum from K to 8th grades. In one experiment, we show that mathematical skills are learned during pre-training in an order that measurably correlates with the human-designed curriculum, even though training data are randomly ordered. We also show a detailed analysis of which mathematical abilities benefit from instruction tuning, a widely used post-training method and, in contrast, which skills suffer. Our work paves the way for an empirical understanding of LLM training dynamics in relation to reasoning.

Updated: 2025-08-08 16:07:40

标题: 从下一个令牌到数学：语言模型中数学推理的学习动态

摘要: 大型语言模型（LLMs）仅通过下一个标记预测训练学习解决涉及数学推理的广泛问题。但这种能力在训练过程中是如何演化的呢？我们展示了首次分析多个开放权重LLMs在预训练和后训练期间数学推理能力如何发展的研究。为此，我们构建了MathCAMPS，这是一个基于从K到8年级共同核心课程中提取的44个细粒度技能的新数学推理问题的合成数据集。在一个实验中，我们展示了在预训练期间学习数学技能的顺序与人类设计的课程有可测量的相关性，尽管训练数据是随机排序的。我们还展示了一项详细分析，指出哪些数学能力受益于指导调整，这是一种广泛使用的后训练方法，相反，哪些技能受到损害。我们的工作为在推理方面与LLM训练动态之间的经验理解铺平了道路。

更新时间: 2025-08-08 16:07:40

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2407.00900v2

Algorithmic Segmentation and Behavioral Profiling for Ransomware Detection Using Temporal-Correlation Graphs

The rapid evolution of cyber threats has outpaced traditional detection methodologies, necessitating innovative approaches capable of addressing the adaptive and complex behaviors of modern adversaries. A novel framework was introduced, leveraging Temporal-Correlation Graphs to model the intricate relationships and temporal patterns inherent in malicious operations. The approach dynamically captured behavioral anomalies, offering a robust mechanism for distinguishing between benign and malicious activities in real-time scenarios. Extensive experiments demonstrated the framework's effectiveness across diverse ransomware families, with consistently high precision, recall, and overall detection accuracy. Comparative evaluations highlighted its better performance over traditional signature-based and heuristic methods, particularly in handling polymorphic and previously unseen ransomware variants. The architecture was designed with scalability and modularity in mind, ensuring compatibility with enterprise-scale environments while maintaining resource efficiency. Analysis of encryption speeds, anomaly patterns, and temporal correlations provided deeper insights into the operational strategies of ransomware, validating the framework's adaptability to evolving threats. The research contributes to advancing cybersecurity technologies by integrating dynamic graph analytics and machine learning for future innovations in threat detection. Results from this study underline the potential for transforming the way organizations detect and mitigate complex cyberattacks.

Updated: 2025-08-08 16:07:06

标题: 算法分割和行为特征分析用于使用时序相关图检测勒索软件

摘要: 网络威胁的快速演变已经超越了传统的检测方法，这需要具有创新能力的方法来应对现代对手的适应性和复杂行为。引入了一种新颖的框架，利用时间相关图来建模恶意操作中固有的复杂关系和时间模式。该方法动态捕捉行为异常，为实时场景中区分良性和恶意活动提供了强大的机制。广泛的实验证明了该框架在各种勒索软件家族中的有效性，具有一致的高精度、召回率和整体检测准确性。比较评估突显了其在处理多态和以前未见的勒索软件变种方面相对于传统基于签名和启发式方法的更好性能。该架构旨在设计具有可扩展性和模块化性，确保与企业规模环境的兼容性同时保持资源效率。对加密速度、异常模式和时间相关性的分析提供了对勒索软件操作策略的更深入洞察，验证了该框架对应对不断演变的威胁的适应性。该研究通过整合动态图分析和机器学习，为未来威胁检测的创新做出了贡献。本研究结果强调了改变组织检测和缓解复杂网络攻击的潜力。

更新时间: 2025-08-08 16:07:06

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2501.17429v2

Contextual Reinforcement in Multimodal Token Compression for Large Language Models

Effective token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets. A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance. This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation. Incorporating graph-based algorithms and adaptive weighting, the method captures subtle contextual relationships across textual and multimodal data, ensuring robust alignment and performance in downstream tasks. Evaluations across varied domains reveal significant improvements in accuracy and semantic retention, particularly for tasks requiring detailed cross-modal interactions. Memory usage analyses demonstrate improved computational efficiency, with minimal overhead despite the additional reinforcement processes. Performance gains are further validated through error distribution analyses, showing reduced semantic loss and syntactic inconsistencies compared to baseline models. The modular architecture ensures compatibility with a wide range of open-source frameworks, facilitating scalable implementation for real-world applications. These findings highlight the potential of contextual reinforcement in redefining token management strategies and advancing large-scale model design.

Updated: 2025-08-08 16:06:37

标题: 大型语言模型中多模态标记压缩的情境强化

摘要: 有效的标记压缩仍然是扩展模型以处理日益复杂和多样化数据集的关键挑战。引入了一种基于上下文强化的新机制，通过相互依赖性和语义相关性动态调整标记的重要性。这种方法可以在保留信息表示的质量和连贯性的同时，显著减少标记的使用。该方法结合了基于图的算法和自适应加权，捕捉文本和多模态数据之间微妙的上下文关系，确保在下游任务中的稳健对齐和性能。跨不同领域的评估结果显示了准确性和语义保留的显著改进，特别是对于需要详细的跨模态交互的任务。内存使用分析表明，尽管额外的强化过程，计算效率得到了提高，开销很小。通过错误分布分析进一步验证了性能增益，相比基准模型，语义损失和句法不一致性都减少了。模块化架构确保与各种开源框架兼容，为实际应用提供可扩展的实现。这些发现突显了上下文强化在重新定义标记管理策略和推进大规模模型设计方面的潜力。

更新时间: 2025-08-08 16:06:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.16658v2

Neural Encrypted State Transduction for Ransomware Classification: A Novel Approach Using Cryptographic Flow Residuals

Encrypted behavioral patterns provide a unique avenue for classifying complex digital threats without reliance on explicit feature extraction, enabling detection frameworks to remain effective even when conventional static and behavioral methodologies fail. A novel approach based on Neural Encrypted State Transduction (NEST) is introduced to analyze cryptographic flow residuals and classify threats through their encrypted state transitions, mitigating evasion tactics employed through polymorphic and obfuscated attack strategies. The mathematical formulation of NEST leverages transduction principles to map state transitions dynamically, enabling high-confidence classification without requiring direct access to decrypted execution traces. Experimental evaluations demonstrate that the proposed framework achieves improved detection accuracy across multiple ransomware families while exhibiting resilience against adversarial perturbations and previously unseen attack variants. The model maintains competitive processing efficiency, offering a practical balance between classification performance and computational resource constraints, making it suitable for large-scale security deployments. Comparative assessments reveal that NEST consistently outperforms baseline classification models, particularly in detecting ransomware samples employing delayed encryption, entropy-based obfuscation, and memory-resident execution techniques. The capacity to generalize across diverse execution environments reinforces the applicability of encrypted transduction methodologies in adversarial classification tasks beyond conventional malware detection pipelines. The integration of residual learning mechanisms within the transduction layers further enhances classification robustness, minimizing both false positives and misclassification rates across varied operational contexts.

Updated: 2025-08-08 16:06:23

标题: 神经加密状态转换用于勒索软件分类：使用加密流残差的新方法

摘要: 加密行为模式为分类复杂数字威胁提供了一种独特途径，而无需依赖显式特征提取，使检测框架即使在传统静态和行为方法失败时仍能保持有效。引入了一种基于神经加密状态传导（NEST）的新方法，用于分析加密流残差并通过其加密状态转换对威胁进行分类，从而减轻通过多态和混淆的攻击策略采用的规避战术。NEST的数学公式利用传导原理动态地映射状态转换，实现高置信度分类，而无需直接访问解密执行跟踪。实验评估表明，所提出的框架在多个勒索软件家族中实现了改进的检测准确性，同时展现出对抗干扰和以前未见的攻击变种的韧性。该模型保持了竞争性的处理效率，提供了分类性能和计算资源限制之间的实际平衡，使其适用于大规模安全部署。比较评估显示，NEST在基准分类模型中始终表现出色，特别是在检测采用延迟加密、基于熵的混淆和内存驻留执行技术的勒索软件样本方面。跨多样化执行环境的泛化能力加强了加密传导方法在超越传统恶意软件检测流程的对抗分类任务中的适用性。在传导层中集成残差学习机制进一步增强了分类的鲁棒性，在各种操作环境中降低了误报和误分率。

更新时间: 2025-08-08 16:06:23

领域: cs.CR

下载: http://arxiv.org/abs/2502.05341v2

Hierarchical Pattern Decryption Methodology for Ransomware Detection Using Probabilistic Cryptographic Footprints

The increasing sophistication of encryption-based ransomware has demanded innovative approaches to detection and mitigation, prompting the development of a hierarchical framework grounded in probabilistic cryptographic analysis. By focusing on the statistical characteristics of encryption patterns, the proposed methodology introduces a layered approach that combines advanced clustering algorithms with machine learning to isolate ransomware-induced anomalies. Through comprehensive testing across diverse ransomware families, the framework demonstrated exceptional accuracy, effectively distinguishing malicious encryption operations from benign activities while maintaining low false positive rates. The system's design integrates dynamic feedback mechanisms, enabling adaptability to varying cryptographic complexities and operational environments. Detailed entropy-based evaluations revealed its sensitivity to subtle deviations in encryption workflows, offering a robust alternative to traditional detection methods reliant on static signatures or heuristics. Computational benchmarks confirmed its scalability and efficiency, achieving consistent performance even under high data loads and complex cryptographic scenarios. The inclusion of real-time clustering and anomaly evaluation ensures rapid response capabilities, addressing critical latency challenges in ransomware detection. Performance comparisons with established methods highlighted its improvements in detection efficacy, particularly against advanced ransomware employing extended key lengths and unique cryptographic protocols.

Updated: 2025-08-08 16:06:01

标题: 基于概率密码足迹的勒索软件检测的分层模式解密方法论

摘要: 加密勒索软件的日益复杂要求创新的检测和减轻方法，促使了基于概率加密分析的分层框架的发展。通过专注于加密模式的统计特征，所提出的方法论引入了一种结合高级聚类算法和机器学习的分层方法，以隔离由勒索软件引起的异常。通过对各种勒索软件家族进行全面测试，该框架表现出色，有效区分了恶意加密操作和良性活动，同时保持低误报率。系统设计集成了动态反馈机制，使其能够适应不同的加密复杂性和操作环境。详细的基于熵的评估显示其对加密工作流中微小偏差的敏感性，为传统依赖静态签名或启发式的检测方法提供了强大的替代方案。计算基准确认了其可伸缩性和效率，在高数据负载和复杂的加密场景下实现了一致的性能。实时聚类和异常评估的加入确保了快速响应能力，解决了勒索软件检测中的关键延迟挑战。与已建立的方法进行性能比较突显了其在检测有效性方面的改进，尤其是在应用扩展密钥长度和独特加密协议的高级勒索软件方面。

更新时间: 2025-08-08 16:06:01

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2501.15084v2

Autonomous Structural Memory Manipulation for Large Language Models Using Hierarchical Embedding Augmentation

Transformative innovations in model architectures have introduced hierarchical embedding augmentation as a means to redefine the representation of tokens through multi-level semantic structures, offering enhanced adaptability to complex linguistic inputs. Autonomous structural memory manipulation further advances this paradigm through dynamic memory reallocation mechanisms that prioritize critical contextual features while suppressing less relevant information, enabling scalable and efficient performance across diverse tasks. Experimental results reveal substantial improvements in computational efficiency, with marked reductions in processing overhead for longer input sequences, achieved through memory reorganization strategies that adapt to evolving contextual requirements. Hierarchical embeddings not only improved contextual alignment but also facilitated task generalization by capturing relationships at varying semantic granularities, ensuring coherence across layers without introducing significant computational redundancies. Comparative analysis against baseline models demonstrated unique advantages in accuracy, efficiency, and interpretability, particularly in tasks requiring complex contextual understanding or domain-specific adaptability. The ability to dynamically adjust token representations and memory configurations contributed to the model's robustness under varied and unpredictable input conditions. Applications benefiting from these advancements include multi-domain generalization, interactive systems, and scenarios involving real-time decision-making, where traditional static memory architectures often face limitations. The proposed methodology combines advanced embedding and memory management strategies into a cohesive framework that addresses scalability challenges while preserving task-specific relevance.

Updated: 2025-08-08 16:05:25

标题: 使用分层嵌入增强进行大型语言模型的自主结构内存操作

摘要: 模型结构中的变革性创新引入了层次嵌入增强作为重新定义标记表示的手段，通过多级语义结构，提供了对复杂语言输入的增强适应性。自主结构记忆操作通过动态内存重新分配机制进一步推进这一范式，优先考虑关键的上下文特征，同时抑制不太相关的信息，从而实现在各种任务中的可扩展和高效性能。实验结果显示，在计算效率方面取得了显著改善，对于更长的输入序列，通过适应不断演变的上下文需求的内存重组策略，可实现处理开销的显著减少。层次嵌入不仅改善了上下文对齐，还通过捕获不同语义粒度的关系，促进了任务的泛化，确保在不引入显著的计算冗余的情况下在各层之间保持连贯性。与基准模型的比较分析显示，在准确性、效率和可解释性方面具有独特优势，尤其是在需要复杂上下文理解或特定领域适应性的任务中。动态调整标记表示和内存配置的能力有助于使模型在各种变化和不可预测的输入条件下具有鲁棒性。受益于这些进步的应用包括多领域泛化、交互式系统以及涉及实时决策的场景，传统静态内存架构往往受到限制。所提出的方法将先进的嵌入和内存管理策略结合到一个统一的框架中，以解决可扩展性挑战，同时保持任务特定的相关性。

更新时间: 2025-08-08 16:05:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.14119v2

Contextually Entangled Gradient Mapping for Optimized LLM Comprehension

Contextually Entangled Gradient Mapping (CEGM) introduces a new approach to gradient optimization, redefining the relationship between contextual embeddings and gradient updates to enhance semantic coherence and reasoning capabilities in neural architectures. By treating gradients as dynamic carriers of contextual dependencies rather than isolated numerical entities, the proposed methodology bridges critical gaps in existing optimization strategies. The integration of entangled gradient dynamics into a loss regularization framework demonstrated significant improvements in tasks involving long-form reasoning, contextual retention, and adaptability to unseen domains. Experimental evaluations showed that the CEGM-enhanced model consistently outperformed baseline approaches, achieving higher accuracy in token-level predictions and greater resilience to noisy inputs. Practical implementations involved modifications to training pipelines, introducing entanglement layers and dynamic coefficient adjustments that seamlessly align with existing architectures. Results further highlighted reductions in semantic drift during sequential transformations and improvements in embedding coherence across paraphrased sentences, showing the robustness and versatility of the proposed methodology. The findings demonstrate the broader implications of gradient entanglement for both theoretical advancements and practical applications in optimization strategies.

Updated: 2025-08-08 16:05:00

标题: 上下文纠缠梯度映射对优化LLM理解的影响

摘要: Contextually Entangled Gradient Mapping (CEGM)引入了一种新的梯度优化方法，重新定义了上下文嵌入和梯度更新之间的关系，以增强神经结构中的语义一致性和推理能力。通过将梯度视为动态的上下文依赖的载体，而不是孤立的数值实体，所提出的方法弥合了现有优化策略中的关键差距。将纠缠梯度动态集成到损失正则化框架中，在涉及长篇推理、上下文保留和适应未知领域的任务中显示出显著改进。实验评估表明，CEGM增强模型始终优于基线方法，在标记级别预测方面准确性更高，在嘈杂输入方面更具韧性。实际实施涉及对训练流程的修改，引入纠缠层和动态系数调整，与现有结构无缝对齐。结果进一步突显了在连续转换过程中语义漂移的减少和在改写句子中嵌入一致性的改善，展示了所提方法的健壮性和多功能性。研究结果展示了梯度纠缠对优化策略中的理论进展和实际应用的更广泛影响。

更新时间: 2025-08-08 16:05:00

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.00048v2

Hierarchical Cryptographic Signature Mapping for Ransomware Classification: A Structural Decomposition Approach

Encryption-based cyber threats continue to evolve, leveraging increasingly sophisticated cryptographic techniques to evade detection and persist within compromised systems. A hierarchical classification framework designed to analyze structural cryptographic properties provides a novel approach to distinguishing malicious encryption from legitimate cryptographic operations. By systematically decomposing encryption workflows into hierarchical layers, the classification method enhances the ability to recognize distinct patterns across diverse threat variants, reducing the dependence on predefined signatures that often fail against rapidly mutating threats. The study examines how cryptographic feature mapping facilitates improved classification accuracy, highlighting the role of entropy, key exchange mechanisms, and algorithmic dependencies in distinguishing harmful encryption activities. Through experimental validation, the framework demonstrated a high degree of precision across multiple attack families, outperforming conventional classification techniques while maintaining computational efficiency suitable for large-scale cybersecurity applications. The layered structural analysis further enhances forensic investigations, enabling security analysts to dissect encryption workflows to trace attack origins and identify commonalities across different campaigns. The methodology strengthens proactive threat mitigation efforts, offering a scalable and adaptable solution that accounts for both known and emerging encryption-based cyber threats. Comparative evaluations illustrate the advantages of structural decomposition in mitigating false positives and negatives, reinforcing the reliability of cryptographic signature classification in real-world security environments.

Updated: 2025-08-08 16:04:40

标题: 分层加密签名映射用于勒索软件分类：一种结构分解方法

摘要: 基于加密的网络威胁不断进化，利用越来越复杂的加密技术来规避检测并在受损系统中持续存在。一个旨在分析结构加密属性的分层分类框架提供了一种新颖的方法来区分恶意加密和合法的加密操作。通过将加密工作流系统地分解为分层，该分类方法增强了识别不同威胁变体中独特模式的能力，减少了对常常无法应对快速变异威胁的预定义签名的依赖。研究考察了加密特征映射如何促进了改进的分类准确性，突出了熵、密钥交换机制和算法依赖性在区分有害加密活动中的作用。通过实验验证，该框架在多个攻击家族中展示了高精度，优于传统分类技术，同时保持了适用于大规模网络安全应用的计算效率。分层结构分析进一步增强了取证调查，使安全分析人员可以解剖加密工作流以追踪攻击源，并识别不同活动中的共性。该方法加强了主动威胁缓解工作，提供了一个可扩展且适应性强的解决方案，考虑了已知和新兴的基于加密的网络威胁。比较评估显示了结构分解在减少误报和漏报方面的优势，加强了密码签名分类在真实安全环境中的可靠性。

更新时间: 2025-08-08 16:04:40

领域: cs.CR

下载: http://arxiv.org/abs/2501.19120v2

Unveiling Zero-Space Detection: A Novel Framework for Autonomous Ransomware Identification in High-Velocity Environments

Modern cybersecurity landscapes increasingly demand sophisticated detection frameworks capable of identifying evolving threats with precision and adaptability. The proposed Zero-Space Detection framework introduces a novel approach that dynamically identifies latent behavioral patterns through unsupervised clustering and advanced deep learning techniques. Designed to address the limitations of signature-based and heuristic methods, it operates effectively in high-velocity environments by integrating multi-phase filtering and ensemble learning for refined decision-making. Experimental evaluation reveals high detection rates across diverse ransomware families, including LockBit, Conti, REvil, and BlackMatter, while maintaining low false positive rates and scalable performance. Computational overhead remains minimal, with average processing times ensuring compatibility with real-time systems even under peak operational loads. The framework demonstrates resilience against adversarial strategies such as obfuscation and encryption speed variability, which frequently challenge conventional detection systems. Analysis across multiple data sources highlights its versatility in handling diverse file types and operational contexts. Comprehensive metrics, including detection probability, latency, and resource efficiency, validate its efficacy under real-world conditions. Through its modular architecture, the framework achieves seamless integration with existing cybersecurity infrastructures without significant reconfiguration. The results demonstrate its robustness and scalability, offering a transformative paradigm for ransomware identification in dynamic and resource-constrained environments.

Updated: 2025-08-08 16:04:28

标题: 揭示零空间检测：一种新颖的框架，用于在高速环境中自主识别勒索软件

摘要: 现代网络安全领域越来越需要能够精确识别不断演变的威胁的复杂检测框架。提出的Zero-Space Detection框架引入了一种新颖的方法，通过无监督聚类和先进的深度学习技术动态识别潜在的行为模式。设计用于解决基于签名和启发式方法的限制，通过整合多阶段过滤和集成学习实现在高速环境下有效运作，以进行精细的决策。实验评估显示，在包括LockBit、Conti、REvil和BlackMatter在内的各种勒索软件系列中，检测率高，同时保持低误报率和可扩展性。计算开销保持最小，平均处理时间确保在峰值操作负载下与实时系统兼容。该框架展示了对抗性策略（如混淆和加密速度变化）的抗性，这些策略经常挑战传统的检测系统。跨多个数据源的分析突显了其在处理各种文件类型和操作上下文中的多功能性。全面的指标，包括检测概率、延迟和资源效率，验证了其在真实环境条件下的有效性。通过其模块化架构，该框架实现了与现有网络安全基础设施的无缝集成，无需进行重大重新配置。结果表明其稳健性和可扩展性，为在动态和资源受限环境中进行勒索软件识别提供了一种转变的范式。

更新时间: 2025-08-08 16:04:28

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2501.12811v2

Exploring Synaptic Resonance in Large Language Models: A Novel Approach to Contextual Memory Integration

Contextual memory integration remains a high challenge in the development of language models, particularly in tasks that require maintaining coherence over extended sequences. Traditional approaches, such as self-attention mechanisms and memory-augmented architectures, often prioritize short-term dependencies, leading to fragmentation and inconsistency in long-range contextual understanding. Inspired by principles of synaptic plasticity observed in biological neural systems, a novel mechanism, Synaptic Resonance, is introduced to dynamically reinforce relevant memory pathways during training and inference. Unlike static memory representations, this mechanism continuously adjusts synaptic weight matrices based on contextual relevance, allowing for improved information retention without excessive computational overhead. Evaluations conducted on an open-source language model demonstrate reductions in perplexity, enhancements in contextual coherence, and increased robustness against input noise, highlighting the effectiveness of reinforcement-driven memory modulation. Comparative analysis against baseline models further reveals that the proposed approach achieves higher memory retention efficiency while maintaining computational feasibility. The architectural modifications integrate seamlessly into existing transformer-based frameworks, ensuring stable convergence and efficient inference without sacrificing scalability. Applications benefiting from improved long-term contextual consistency, such as dialogue systems and document summarization, stand to gain from this approach. Empirical findings suggest that dynamically reinforced memory pathways offer a promising alternative to conventional memory mechanisms, addressing longstanding limitations in extended sequence modeling.

Updated: 2025-08-08 16:04:08

标题: 探索大型语言模型中的突触共振：一种整合上下文记忆的新方法

摘要: 上下文记忆整合仍然是语言模型发展中的一个重要挑战，尤其是在需要在延长序列中保持连贯性的任务中。传统方法，如自注意机制和记忆增强结构，通常优先考虑短期依赖关系，导致长程上下文理解的碎片化和不一致性。受生物神经系统中观察到的突触可塑性原理的启发，引入了一种新颖的机制，即突触共振，用于在训练和推理过程中动态加强相关的记忆路径。与静态记忆表示不同，这种机制根据上下文相关性不断调整突触权重矩阵，从而实现了改进的信息保留，而不会带来过多的计算开销。在一个开源语言模型上进行的评估表明，困惑度减少，上下文连贯性提高，并且对输入噪声的鲁棒性增强，突显了强化驱动的记忆调制的有效性。与基准模型的比较分析进一步显示，所提出的方法实现了更高的记忆保留效率，同时保持了计算可行性。这种结构修改无缝集成到现有的基于变压器的框架中，确保了稳定的收敛和高效的推理，而不会牺牲可扩展性。受益于改进的长期上下文一致性的应用，如对话系统和文档摘要，有望从这种方法中获益。经验结果表明，动态加强的记忆路径提供了一个有望替代传统记忆机制的选择，解决了长期以来在扩展序列建模中存在的限制。

更新时间: 2025-08-08 16:04:08

领域: cs.CL,cs.AI,cs.NE

下载: http://arxiv.org/abs/2502.10699v2

Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration

Contextual Partitioning introduces an innovative approach to enhancing the architectural design of large-scale computational models through the dynamic segmentation of parameters into context-aware regions. This methodology emphasizes the importance of task-specific specialization, achieved through adaptive parameter allocation mechanisms that align with the linguistic features of input data. Experimental evaluations demonstrated substantial improvements in accuracy, perplexity, and contextual coherence across a variety of linguistic tasks, highlighting the adaptability and scalability of the proposed framework. By reducing redundancy and enhancing computational efficiency, Contextual Partitioning not only streamlines model operations but also expands the scope of applications for advanced language processing systems. The approach operates autonomously, requiring no external fine-tuning, thereby addressing a significant limitation in conventional parameter optimization techniques. Empirical results demonstrate the effectiveness of gradient-driven segmentation, enabling models to dynamically recalibrate and specialize in response to task-specific demands. Furthermore, resource utilization metrics reveal notable reductions in memory usage and training times, confirming the efficiency of the approach. Observations from qualitative analyses illustrate improved contextual coherence and logical flow in generated outputs, reinforcing the practical value of this technique. The findings collectively demonstrate the potential for Contextual Partitioning to redefine the scalability and adaptability of computational language architectures in diverse and complex domains.

Updated: 2025-08-08 16:03:53

标题: 大型语言模型中的上下文分区架构融合：参数化知识整合的新方法

摘要: 背景分区引入了一种创新的方法，通过将参数动态分割为具有上下文感知能力的区域，增强大规模计算模型的架构设计。该方法强调任务特定专业化的重要性，通过与输入数据的语言特征相一致的自适应参数分配机制来实现。实验评估显示，在各种语言任务中，准确性、困惑度和上下文连贯性都得到了显著改善，突显了所提议框架的适应性和可扩展性。通过减少冗余并增强计算效率，背景分区不仅简化了模型操作，还扩大了先进语言处理系统的应用范围。该方法自主运行，无需外部微调，从而解决了传统参数优化技术中的重要限制。实证结果表明，梯度驱动的分段具有有效性，使模型能够根据任务特定需求动态重新校准和专业化。此外，资源利用度指标显示内存使用和训练时间的显著降低，证实了该方法的效率。定性分析结果显示，在生成的输出中，上下文连贯性和逻辑流得到了改善，加强了这一技术的实际价值。这些发现共同展示了背景分区重新定义了多样和复杂领域中计算语言体系结构的可扩展性和适应性的潜力。

更新时间: 2025-08-08 16:03:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.12901v2

Neural Contextual Reinforcement Framework for Logical Structure Language Generation

The Neural Contextual Reinforcement Framework introduces an innovative approach to enhancing the logical coherence and structural consistency of text generated by large language models. Leveraging reinforcement learning principles, the framework integrates custom reward functions and dynamic context alignment mechanisms to address challenges inherent in maintaining long-range dependencies across extended sequences. The architecture incorporates multi-head attention layers and hierarchical encoding modules, enabling the model to produce outputs that align closely with human expectations of logical structure and semantic flow. Quantitative evaluations across diverse datasets demonstrate substantial improvements in coherence metrics, perplexity reduction, and semantic alignment, showcasing the framework's ability to outperform baseline models in both general and domain-specific tasks. Qualitative analyses further highlight the framework's capacity to generate text with improved narrative clarity and reduced redundancy, reflecting its effectiveness in balancing fluency with structural precision. In addition to its performance gains, the framework exhibits robustness in handling noisy input data and scalability across varying model sizes, reinforcing its versatility in practical applications. Experimental results reveal that optimal context window sizes significantly influence coherence outcomes, showing the importance of architectural flexibility in adapting to diverse linguistic structures. Cross-lingual performance evaluations affirm the framework's adaptability to multiple languages, extending its utility beyond monolingual contexts. Resource efficiency analyses indicate a reduction in computational overhead compared to traditional approaches, emphasizing the practicality of the framework for large-scale deployment.

Updated: 2025-08-08 16:03:41

标题: 神经上下文强化框架用于逻辑结构语言生成

摘要: 神经上下文强化框架引入了一种创新的方法，用于增强大型语言模型生成的文本的逻辑连贯性和结构一致性。利用强化学习原则，该框架集成了定制奖励函数和动态上下文对齐机制，以解决在维持跨越扩展序列的长距离依赖性方面存在的挑战。该架构包括多头注意力层和分层编码模块，使模型能够生成与人类对逻辑结构和语义流的期望紧密一致的输出。通过对多样数据集进行定量评估，展示了在连贯性度量、困惑度降低和语义对齐方面的显著改进，突显了该框架在通用和特定领域任务中胜过基准模型的能力。定性分析进一步突出了该框架生成具有改善叙述清晰度和减少冗余的文本的能力，反映了其在平衡流畅性和结构精度方面的有效性。除了性能提升外，该框架在处理嘈杂输入数据和在不同模型大小上的可扩展性方面表现出稳健性，强调了它在实际应用中的多功能性。实验结果显示，最佳上下文窗口大小显著影响连贯性结果，显示了在适应多样语言结构方面的架构灵活性的重要性。跨语言性能评估证实了该框架对多种语言的适应性，将其实用性扩展到单语境之外。资源效率分析表明相比传统方法，计算开销减少，强调了该框架在大规模部署中的实用性。

更新时间: 2025-08-08 16:03:41

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.11417v2

Hierarchical Entropy Disruption for Ransomware Detection: A Computationally-Driven Framework

The rapid evolution of encryption-based threats has rendered conventional detection mechanisms increasingly ineffective against sophisticated attack strategies. Monitoring entropy variations across hierarchical system levels offers an alternative approach to identifying unauthorized data modifications without relying on static signatures. A framework leveraging hierarchical entropy disruption was introduced to analyze deviations in entropy distributions, capturing behavioral anomalies indicative of malicious encryption operations. Evaluating the framework across multiple ransomware variants demonstrated its capability to achieve high detection accuracy while maintaining minimal computational overhead. Entropy distributions across different system directories revealed that encryption activities predominantly targeted user-accessible files, aligning with observed attacker strategies. Detection latency analysis indicated that early-stage identification was feasible, mitigating potential data loss before critical system impact occurred. The framework's ability to operate efficiently in real-time environments was validated through an assessment of resource utilization, confirming a balanced trade-off between detection precision and computational efficiency. Comparative benchmarking against established detection methods highlighted the limitations of conventional approaches in identifying novel ransomware variants, whereas entropy-based anomaly detection provided resilience against obfuscation techniques.

Updated: 2025-08-08 16:03:22

标题: 分层熵破坏用于勒索软件检测：一种计算驱动的框架

摘要: 基于加密的威胁的快速演变使得传统的检测机制在面对复杂攻击策略时变得越来越无效。监控分层系统级别上的熵变化提供了一种替代方法，可以识别未经授权的数据修改，而无需依赖静态签名。引入了一个利用分层熵破坏的框架来分析熵分布偏差，捕捉表明恶意加密操作的行为异常。对多个勒索软件变种进行框架评估表明，它具有高检测准确性的能力，同时保持最小的计算开销。不同系统目录中的熵分布显示，加密活动主要针对用户可访问的文件，符合观察到的攻击者策略。检测延迟分析表明，早期识别是可行的，可以在关键系统受影响之前减少潜在数据损失。通过资源利用评估验证了该框架在实时环境中高效运行的能力，确认了在检测精度和计算效率之间的平衡取舍。与已建立的检测方法进行比较基准测试突显出传统方法在识别新型勒索软件变种方面的局限性，而基于熵的异常检测提供了对混淆技术的弹性。

更新时间: 2025-08-08 16:03:22

领域: cs.CR

下载: http://arxiv.org/abs/2502.08843v2

Sample-efficient LLM Optimization with Reset Replay

Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.

Updated: 2025-08-08 15:56:49

标题: 具有复位重播功能的样本高效LLM优化

摘要: 最近在后训练大语言模型（LLMs）方面的进展，特别是通过强化学习（RL）和偏好优化方法，是增强它们推理能力的关键驱动因素。然而，这些方法通常受到低样本效率和对初次体验的过度拟合的影响，从而降低政策质量并破坏学习过程。为了解决这些挑战，我们引入了使用重置重放（LoRR）的LLM优化，这是一个通用且强大的插件，旨在增强任何基于偏好的优化框架的样本效率。LoRR的核心机制使训练在高重放次数下进行，最大化每个收集的数据批次的效用。为了对抗高重播训练中固有的过度拟合风险，LoRR采用了定期重置策略，重新使用初始数据，从而保持网络的可塑性。此外，它利用了混合优化目标，将监督微调（SFT）和基于偏好的损失结合起来，进一步加强数据利用。我们的广泛实验表明，LoRR显著提升了各种偏好优化方法在数学和一般推理基准上的性能。值得注意的是，使用LoRR增强的迭代DPO方法在具有挑战性的数学任务上实现了可比较的性能，胜过一些复杂且计算密集的基于RL的算法。这些发现凸显了LoRR提供了一种实用、样本效率高且高效的LLM微调范式，从有限数据中释放出更大的性能。

更新时间: 2025-08-08 15:56:49

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.06412v1

Dimensional Characterization and Pathway Modeling for Catastrophic AI Risks

Although discourse around the risks of Artificial Intelligence (AI) has grown, it often lacks a comprehensive, multidimensional framework, and concrete causal pathways mapping hazard to harm. This paper aims to bridge this gap by examining six commonly discussed AI catastrophic risks: CBRN, cyber offense, sudden loss of control, gradual loss of control, environmental risk, and geopolitical risk. First, we characterize these risks across seven key dimensions, namely intent, competency, entity, polarity, linearity, reach, and order. Next, we conduct risk pathway modeling by mapping step-by-step progressions from the initial hazard to the resulting harms. The dimensional approach supports systematic risk identification and generalizable mitigation strategies, while risk pathway models help identify scenario-specific interventions. Together, these methods offer a more structured and actionable foundation for managing catastrophic AI risks across the value chain.

Updated: 2025-08-08 15:56:05

标题: 灾难性人工智能风险的维度特征和路径建模

摘要: 尽管围绕人工智能（AI）风险的讨论不断增加，但往往缺乏一个全面的、多维的框架，以及将危害与伤害联系起来的具体因果路径。本文旨在通过研究六种常见的AI灾难性风险来弥补这一差距：CBRN、网络攻击、突然失控、逐渐失控、环境风险和地缘政治风险。首先，我们在七个关键维度（即意图、能力、实体、极性、线性、影响范围和顺序）上对这些风险进行特征化。接下来，我们通过将从初始危险到最终伤害的逐步进展映射到风险路径模型中来进行风险路径建模。这种维度方法支持系统化的风险识别和可推广的减灾策略，而风险路径模型有助于确定特定情景的干预措施。这些方法共同为跨价值链管理灾难性AI风险提供了更有结构化和可操作性的基础。

更新时间: 2025-08-08 15:56:05

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06411v1

A New Lens on Homelessness: Daily Tent Monitoring with 311 Calls and Street Images

Homelessness in the United States has surged to levels unseen since the Great Depression. However, existing methods for monitoring it, such as point-in-time (PIT) counts, have limitations in terms of frequency, consistency, and spatial detail. This study proposes a new approach using publicly available, crowdsourced data, specifically 311 Service Calls and street-level imagery, to track and forecast homeless tent trends in San Francisco. Our predictive model captures fine-grained daily and neighborhood-level variations, uncovering patterns that traditional counts often overlook, such as rapid fluctuations during the COVID-19 pandemic and spatial shifts in tent locations over time. By providing more timely, localized, and cost-effective information, this approach serves as a valuable tool for guiding policy responses and evaluating interventions aimed at reducing unsheltered homelessness.

Updated: 2025-08-08 15:53:29

标题: 一个关于无家可归问题的新视角：通过311电话和街道图像进行每日帐篷监控

摘要: 美国的无家可归问题已经激增至自大萧条以来的前所未见水平。然而，目前用于监测无家可归问题的方法，如点对点计数（PIT），在频率、一致性和空间细节方面存在局限性。本研究提出了一种新方法，利用公开可获得的、众包的数据，特别是311服务呼叫和街道级图像，来跟踪和预测旧金山的无家可归帐篷趋势。我们的预测模型捕捉到了细粒度的日常和社区级别的变化，揭示了传统计数常常忽略的模式，比如COVID-19大流行期间的快速波动和帐篷位置随时间的空间转移。通过提供更及时、更本地化和更具成本效益的信息，这种方法成为指导政策应对和评估旨在减少露天无家可归问题的干预措施的宝贵工具。

更新时间: 2025-08-08 15:53:29

领域: cs.LG

下载: http://arxiv.org/abs/2508.06409v1

Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction

Machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists' assessments of nodules, guiding the model to learn clinically relevant, robust, and explainable imaging features for predicting lung cancer. We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST) with 1,246 nodules and semantic features. Additionally, the Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model with a parameter-efficient fine-tuning approach to align imaging and semantic text features and predict the one-year lung cancer diagnosis. Our model outperformed state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP, we also obtained predictions on semantic features through zero-shot inference, such as nodule margin (AUROC: 0.812), nodule consistency (0.812), and pleural attachment (0.840). Our approach surpasses the SOTA models in predicting lung cancer across datasets collected from diverse clinical settings, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings. The code is available at https://github.com/luotingzhuang/CLIP_nodule.

Updated: 2025-08-08 15:53:03

标题: 基于视觉-语言模型的语义引导成像生物标记物用于肺结节恶性预测

摘要: 机器学习模型已经利用语义特征、深度特征或两者结合来评估肺结节的恶性程度。然而，在推断过程中它们对手动标注的依赖、解释性有限以及对影像变化的敏感性阻碍了它们在真实世界临床环境中的应用。因此，这项研究旨在整合从放射科医生对结节的评估中提取的语义特征，指导模型学习与临床相关、稳健且可解释的影像特征，用于预测肺癌。我们从国家肺部筛查试验（NLST）获得了938例低剂量CT扫描，包括1,246个结节和语义特征。此外，肺部影像数据库联盟数据集包含1,018例CT扫描，其中对2,625个病变进行了结节特征标注。我们从UCLA Health、LUNGx挑战和杜克大学肺癌筛查获得了三个外部数据集。我们通过参数高效的微调方法对预训练的对比语言-图像预训练（CLIP）模型进行了微调，以使影像和语义文本特征对齐，并预测一年内的肺癌诊断。我们的模型在NLST测试集中表现优于最先进的模型（SOTA），AUROC为0.901，AUPRC为0.776。它在外部数据集中也表现出了稳健的结果。使用CLIP，我们还通过零样本推断获得了对语义特征的预测，比如结节边缘（AUROC：0.812）、结节一致性（0.812）和胸膜附着（0.840）。我们的方法在预测来自不同临床环境收集的数据集中的肺癌方面超越了最先进的模型，提供了可解释的输出，帮助临床医生理解模型预测的潜在含义。这种方法还可以防止模型学习捷径，并在不同临床环境中实现泛化。代码可在https://github.com/luotingzhuang/CLIP_nodule获得。

更新时间: 2025-08-08 15:53:03

领域: cs.CV,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2504.21344v2

A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

Updated: 2025-08-08 15:50:40

标题: 一个针对合成孔径雷达图像中船舶目标的分类感知超分辨率框架

摘要: 高分辨率图像在改善视觉识别任务（如分类、检测和分割）的性能方面起着关键作用。在许多领域，包括遥感和监视，低分辨率图像可能限制自动分析的准确性。为解决这一问题，超分辨率（SR）技术已被广泛采用，试图从低分辨率输入中重建高分辨率图像。相关的传统方法仅关注基于像素级指标的图像质量提升，而对超分辨率图像保真度与下游分类性能之间的关系很少探讨。这引发了一个关键问题：将分类目标直接整合到超分辨率过程中是否能进一步提高分类准确性？在本文中，我们尝试通过部署一种专门的算法策略来调查超分辨率与分类之间的关系。我们提出了一种新的方法论，通过优化同时考虑图像质量和分类性能的损失函数，提高合成孔径雷达图像的分辨率。我们的方法提高了图像质量，通过科学确定的图像质量指标来衡量，并且还增强了分类准确性。

更新时间: 2025-08-08 15:50:40

领域: cs.CV,cs.AI,eess.IV

下载: http://arxiv.org/abs/2508.06407v1

MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine-grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single-agent methods that rely on self-correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero-shot and single-agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi-layered error diagnosis, omission detection across perspectives, and context-aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity.

Updated: 2025-08-08 15:49:43

标题: MAATS：基于MQM评估的多代理人自动翻译系统

摘要: 我们提出了MAATS，一个利用多维质量度量（MQM）框架作为错误检测和优化的细粒度信号的多智能体自动翻译系统。MAATS采用多个专门的人工智能代理，每个代理专注于不同的MQM类别（例如准确性、流畅性、风格、术语），然后是一个综合代理，将注释集成在一起，通过迭代优化翻译。这种设计与传统的依赖于自我校正的单一代理方法有所不同。通过在不同语言对和大型语言模型（LLM）上进行评估，MAATS在自动度量和人类评估中均优于零-shot和单一代理基线，获得了统计上显著的增益。它特别擅长语义准确性、区域适应性和语言之间的语言差距。定性分析突出了其在多层次错误诊断、跨视角遗漏检测和上下文感知优化方面的优势。通过将模块化代理角色与可解释的MQM维度对齐，MAATS缩小了黑匣子LLMs和人类翻译工作流之间的差距，将焦点从表面流畅度转向更深层次的语义和上下文忠实度。

更新时间: 2025-08-08 15:49:43

领域: cs.CL,cs.LG,cs.MA

下载: http://arxiv.org/abs/2505.14848v2

AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities of Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose \textbf{AICrypto}, the first comprehensive benchmark designed to evaluate the cryptographic capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTF challenges, we design an agent-based framework. To gain deeper insight into the current state of cryptographic proficiency in LLMs, we introduce human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io.

Updated: 2025-08-08 15:47:59

标题: AICrypto：用于评估大型语言模型密码学能力的综合基准

摘要: 大型语言模型（LLMs）已经在各个领域展示出了卓越的能力。然而，它们在密码学领域的应用，作为网络安全的基础支柱，仍然被大部分未探索。为了填补这一空白，我们提出了第一个全面评估LLMs密码能力的基准测试——\textbf{AICrypto}。该基准测试包括135道多项选择题、150个夺旗挑战（CTF）以及18个证明问题，涵盖了从事实记忆到漏洞利用和形式推理的广泛技能范围。所有任务均由密码学专家仔细审查或构建，以确保正确性和严谨性。为了支持CTF挑战的自动评估，我们设计了一个基于代理的框架。为了更深入地了解LLMs在密码学能力方面的当前状况，我们引入了人类专家表现基线，以便在所有任务类型上进行比较。我们对17个领先的LLMs进行评估，结果显示，最先进的模型在记忆密码概念、利用常见漏洞和常规证明方面与人类专家相匹敌甚至超越。然而，它们仍然缺乏对抽象数学概念的深刻理解，并且在需要多步推理和动态分析的任务上遇到困难。我们希望这项工作可以为未来研究LLMs在密码学应用中提供见解。我们的代码和数据集可在https://aicryptobench.github.io 上获得。

更新时间: 2025-08-08 15:47:59

领域: cs.CR

下载: http://arxiv.org/abs/2507.09580v2

Blockchain-Enabled Federated Learning

Blockchain-enabled federated learning (BCFL) addresses fundamental challenges of trust, privacy, and coordination in collaborative AI systems. This chapter provides comprehensive architectural analysis of BCFL systems through a systematic four-dimensional taxonomy examining coordination structures, consensus mechanisms, storage architectures, and trust models. We analyze design patterns from blockchain-verified centralized coordination to fully decentralized peer-to-peer networks, evaluating trade-offs in scalability, security, and performance. Through detailed examination of consensus mechanisms designed for federated learning contexts, including Proof of Quality and Proof of Federated Learning, we demonstrate how computational work can be repurposed from arbitrary cryptographic puzzles to productive machine learning tasks. The chapter addresses critical storage challenges by examining multi-tier architectures that balance blockchain's transaction constraints with neural networks' large parameter requirements while maintaining cryptographic integrity. A technical case study of the TrustMesh framework illustrates practical implementation considerations in BCFL systems through distributed image classification training, demonstrating effective collaborative learning across IoT devices with highly non-IID data distributions while maintaining complete transparency and fault tolerance. Analysis of real-world deployments across healthcare consortiums, financial services, and IoT security applications validates the practical viability of BCFL systems, achieving performance comparable to centralized approaches while providing enhanced security guarantees and enabling new models of trustless collaborative intelligence.

Updated: 2025-08-08 15:47:55

标题: 区块链启用的联邦学习

摘要: 区块链启用的联邦学习（BCFL）解决了协作人工智能系统中的信任、隐私和协调方面的基本挑战。本章通过系统性的四维分类法，提供了对BCFL系统的全面架构分析，检查协调结构、共识机制、存储架构和信任模型。我们从区块链验证的集中协调到完全去中心化的点对点网络分析设计模式，评估可扩展性、安全性和性能之间的权衡。通过详细研究为联邦学习环境设计的共识机制，包括质量证明和联邦学习证明，我们展示了如何将计算工作从任意的加密难题重新用于有效的机器学习任务。本章通过检查平衡区块链交易约束和神经网络大规模参数需求的多层架构，解决了关键的存储挑战，同时保持了加密完整性。TrustMesh框架的技术案例研究展示了在BCFL系统中通过分布式图像分类训练实现实用的实施考虑，展示了在维护完全透明性和容错性的同时，跨IoT设备进行高度非IID数据分布的有效协作学习。在跨医疗保健财团、金融服务和IoT安全应用的实际部署分析中，验证了BCFL系统的实际可行性，实现了与集中方法相当的性能，同时提供了增强的安全保证，并实现了无信任协作智能的新模型。

更新时间: 2025-08-08 15:47:55

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2508.06406v1

Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction

This paper introduces and formalizes Noosem\`ia, a novel cognitive-phenomenological pattern emerging from human interaction with generative AI systems, particularly those enabling dialogic or multimodal exchanges. We propose a multidisciplinary framework to explain how, under certain conditions, users attribute intentionality, agency, and even interiority to these systems - a process grounded not in physical resemblance, but in linguistic performance, epistemic opacity, and emergent technological complexity. By linking an LLM declination of meaning holism to our technical notion of the LLM Contextual Cognitive Field, we clarify how LLMs construct meaning relationally and how coherence and a simulacrum of agency arise at the human-AI interface. The analysis situates noosemia alongside pareidolia, animism, the intentional stance and the uncanny valley, distinguishing its unique characteristics. We also introduce a-noosemia to describe the phenomenological withdrawal of such projections. The paper concludes with reflections on the broader philosophical, epistemological and social implications of noosemic dynamics and directions for future research.

Updated: 2025-08-08 15:44:11

标题: Noosemia：人类生成AI互动中关于意向性归属的认知和现象学解释

摘要: 本文介绍并形式化了Noosem\`ia，这是一种新型的认知-现象学模式，源自人类与生成式人工智能系统的互动，特别是那些支持对话或多模式交流的系统。我们提出了一个多学科框架来解释在某些条件下，用户如何将意图、代理性甚至内部性归因于这些系统 - 这个过程不是基于物理相似性，而是基于语言表现、认识不透明度和新兴技术复杂性。通过将意义整体性的LLM衰减与我们技术上的LLM上下文认知场的概念联系起来，我们澄清了LLM是如何在关系中构建意义的，以及在人工智能界面上如何产生连贯性和代理的类似物。分析将noosemia置于pareidolia、精灵主义、有意的立场和诡异山谷旁，区分其独特特征。我们还介绍了a-noosemia来描述这种投射的现象学撤离。文章最后反思了noosemic动态的更广泛的哲学、认识论和社会影响，以及未来研究的方向。

更新时间: 2025-08-08 15:44:11

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2508.02622v2

A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges

This systematic review of the research literature on retrieval-augmented generation (RAG) provides a focused analysis of the most highly cited studies published between 2020 and May 2025. A total of 128 articles met our inclusion criteria. The records were retrieved from ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, and the Digital Bibliography and Library Project (DBLP). RAG couples a neural retriever with a generative language model, grounding output in up-to-date, non-parametric memory while retaining the semantic generalisation stored in model weights. Guided by the PRISMA 2020 framework, we (i) specify explicit inclusion and exclusion criteria based on citation count and research questions, (ii) catalogue datasets, architectures, and evaluation practices, and (iii) synthesise empirical evidence on the effectiveness and limitations of RAG. To mitigate citation-lag bias, we applied a lower citation-count threshold to papers published in 2025 so that emerging breakthroughs with naturally fewer citations were still captured. This review clarifies the current research landscape, highlights methodological gaps, and charts priority directions for future research.

Updated: 2025-08-08 15:37:14

标题: 一篇关于检索增强生成的系统文献综述：技术、度量和挑战

摘要: 这份对检索增强生成（RAG）研究文献的系统性回顾提供了对2020年至2025年5月间发表的最被引用研究进行了专注分析。总共有128篇文章符合我们的纳入标准。这些记录来自ACM数字图书馆、IEEE Xplore、Scopus、ScienceDirect和数字文献和图书馆项目（DBLP）。RAG将神经检索器与生成语言模型相结合，将输出基于最新的非参数记忆，同时保留模型权重中存储的语义概括。在PRISMA 2020框架的指导下，我们（i）根据引用计数和研究问题指定明确的纳入和排除标准，（ii）编目数据集、架构和评估实践，（iii）综合实证证据，评估RAG的有效性和局限性。为了减少引用滞后偏见，我们对2025年发表的论文应用了较低的引用计数阈值，以便捕捉到自然引用较少的新突破。这份回顾澄清了当前的研究景观，突出了方法论上的差距，并为未来研究指明了优先方向。

更新时间: 2025-08-08 15:37:14

领域: cs.DL,cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2508.06401v1

EVA-S2PLoR: Decentralized Secure 2-party Logistic Regression with A Subtly Hadamard Product Protocol (Full Version)

The implementation of accurate nonlinear operators (e.g., sigmoid function) on heterogeneous datasets is a key challenge in privacy-preserving machine learning (PPML). Most existing frameworks approximate it through linear operations, which not only result in significant precision loss but also introduce substantial computational overhead. This paper proposes an efficient, verifiable, and accurate security 2-party logistic regression framework (EVA-S2PLoR), which achieves accurate nonlinear function computation through a subtly secure hadamard product protocol and its derived protocols. All protocols are based on a practical semi-honest security model, which is designed for decentralized privacy-preserving application scenarios that balance efficiency, precision, and security. High efficiency and precision are guaranteed by the asynchronous computation flow on floating point numbers and the few number of fixed communication rounds in the hadamard product protocol, where robust anomaly detection is promised by dimension transformation and Monte Carlo methods. EVA-S2PLoR outperforms many advanced frameworks in terms of precision, improving the performance of the sigmoid function by about 10 orders of magnitude compared to most frameworks. Moreover, EVA-S2PLoR delivers the best overall performance in secure logistic regression experiments with training time reduced by over 47.6% under WAN settings and a classification accuracy difference of only about 0.5% compared to the plaintext model.

Updated: 2025-08-08 15:33:48

标题: EVA-S2PLoR：具有微妙Hadamard乘积协议的分散安全两方逻辑回归（完整版）

摘要: 准确的非线性运算符（如Sigmoid函数）在隐私保护机器学习（PPML）中对异构数据集的实施是一个关键挑战。大多数现有框架通过线性运算来近似它，这不仅导致了显著的精度损失，还引入了大量的计算开销。本文提出了一种高效、可验证和准确的安全双方Logistic回归框架（EVA-S2PLoR），通过一个微妙安全的哈达玛积协议及其派生协议实现准确的非线性函数计算。所有协议都基于实际的半诚实安全模型，旨在为平衡效率、精度和安全性的分散隐私保护应用场景而设计。通过浮点数上的异步计算流和哈达玛积协议中少量固定通信轮次，高效率和精度得到保证，其中通过维度转换和蒙特卡罗方法承诺了强大的异常检测。与大多数框架相比，EVA-S2PLoR在精度方面表现出色，将Sigmoid函数的性能提高了约10个数量级。此外，在WAN设置下，EVA-S2PLoR在安全Logistic回归实验中表现出最佳综合性能，训练时间减少了超过47.6％，与明文模型相比，分类准确度仅相差约0.5％。

更新时间: 2025-08-08 15:33:48

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2501.05223v3

L1-Regularized Functional Support Vector Machine

In functional data analysis, binary classification with one functional covariate has been extensively studied. We aim to fill in the gap of considering multivariate functional covariates in classification. In particular, we propose an $L_1$-regularized functional support vector machine for binary classification. An accompanying algorithm is developed to fit the classifier. By imposing an $L_1$ penalty, the algorithm enables us to identify relevant functional covariates of the binary response. Numerical results from simulations and one real-world application demonstrate that the proposed classifier enjoys good performance in both prediction and feature selection.

Updated: 2025-08-08 15:29:10

标题: L1正则化功能支持向量机

摘要: 在功能数据分析中，已经广泛研究了具有一个功能协变量的二元分类问题。我们的目标是填补考虑多元功能协变量在分类中的空白。特别地，我们提出了一种$L_1$正则化的功能支持向量机用于二元分类。我们开发了一个相应的算法来拟合分类器。通过施加$L_1$惩罚，该算法使我们能够识别二元响应的相关功能协变量。来自模拟和一个实际应用的数值结果表明，所提出的分类器在预测和特征选择方面表现良好。

更新时间: 2025-08-08 15:29:10

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2508.05567v2

Are Your LLMs Capable of Stable Reasoning?

The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model's performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.

Updated: 2025-08-08 15:28:01

标题: 您的LLM是否具有稳定的推理能力？

摘要: 大规模语言模型（LLMs）的快速发展在复杂推理任务中展现出显著进展。然而，在基准性能和实际应用之间存在显著差距。我们主要将这一差距归因于当前的评估协议和指标，这些协议和指标未能充分捕捉LLM能力的全部范围，特别是在复杂推理任务中，准确性和一致性都是至关重要的。在本文中，我们介绍了G-Pass@$k$，这是一种新颖的评估指标，可以持续评估模型在多次抽样尝试中的性能，量化模型的性能潜力和稳定性。通过对各种公共和新构建的基准进行广泛实验，我们将G-Pass@$k$与最先进的大型语言模型结合起来，为它们的潜在能力和操作一致性提供全面的洞察。我们的研究结果显示，提升LLMs的实际推理能力有重要机会，强调了更加健壮的评估指标的必要性。

更新时间: 2025-08-08 15:28:01

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2412.13147v5

When AIOps Become "AI Oops": Subverting LLM-driven IT Operations via Telemetry Manipulation

AI for IT Operations (AIOps) is transforming how organizations manage complex software systems by automating anomaly detection, incident diagnosis, and remediation. Modern AIOps solutions increasingly rely on autonomous LLM-based agents to interpret telemetry data and take corrective actions with minimal human intervention, promising faster response times and operational cost savings. In this work, we perform the first security analysis of AIOps solutions, showing that, once again, AI-driven automation comes with a profound security cost. We demonstrate that adversaries can manipulate system telemetry to mislead AIOps agents into taking actions that compromise the integrity of the infrastructure they manage. We introduce techniques to reliably inject telemetry data using error-inducing requests that influence agent behavior through a form of adversarial reward-hacking; plausible but incorrect system error interpretations that steer the agent's decision-making. Our attack methodology, AIOpsDoom, is fully automated--combining reconnaissance, fuzzing, and LLM-driven adversarial input generation--and operates without any prior knowledge of the target system. To counter this threat, we propose AIOpsShield, a defense mechanism that sanitizes telemetry data by exploiting its structured nature and the minimal role of user-generated content. Our experiments show that AIOpsShield reliably blocks telemetry-based attacks without affecting normal agent performance. Ultimately, this work exposes AIOps as an emerging attack vector for system compromise and underscores the urgent need for security-aware AIOps design.

Updated: 2025-08-08 15:25:31

标题: 当AIOps变成“AI糟糕”：通过遥测操纵颠覆LLM驱动的IT运营

摘要: IT运维中的人工智能（AIOps）正在改变组织管理复杂软件系统的方式，通过自动化异常检测、事故诊断和修复。现代AIOps解决方案越来越依赖于基于自主LLM的代理来解释遥测数据并采取纠正措施，减少人为干预，承诺更快的响应时间和运营成本节省。在这项工作中，我们进行了对AIOps解决方案的首次安全分析，显示出，再次，基于人工智能的自动化会带来深远的安全成本。我们展示对手可以操纵系统遥测数据，误导AIOps代理采取危害其管理基础设施完整性的行动。我们引入了可靠地注入遥测数据的技术，使用引起错误的请求来影响代理行为，通过一种形式的对抗性奖励黑客攻击；这种攻击方法，AIOpsDoom，是完全自动化的--结合侦察、模糊测试和基于LLM的对抗性输入生成--并且在没有任何先前对目标系统的了解的情况下运行。为了应对这一威胁，我们提出了AIOpsShield，一种利用其结构化特性和用户生成内容的最小角色来消毒遥测数据的防御机制。我们的实验表明，AIOpsShield能够可靠地阻止基于遥测的攻击，而不影响正常代理性能。最终，这项工作揭示了AIOps作为系统妥协的新兴攻击向量，并强调了对安全意识AIOps设计的迫切需要。

更新时间: 2025-08-08 15:25:31

领域: cs.CR

下载: http://arxiv.org/abs/2508.06394v1

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOTA baseline, achieving 71% relative improvement in DER and 69% in cpWER.

Updated: 2025-08-08 15:24:10

标题: 通过增强说话者嵌入采样实现稳健的目标说话者辨识和分离

摘要: 传统的语音分离和说话者辨识方法依赖于对目标说话者的先验知识或音频信号中预先确定的参与者人数。为了解决这些限制，最近的进展集中在开发无需注册的方法，能够在没有明确说话者标记的情况下识别目标。本文介绍了一种新方法，通过自动识别混合中的目标说话者嵌入，训练同时进行语音分离和说话者辨识。我们提出的模型采用了一个双阶段训练流程，旨在学习对背景噪声干扰具有鲁棒性的说话者表征特征。此外，我们提出了一种专门针对增强在重叠语音帧期间辨识准确性的重叠谱损失函数。实验结果显示，与当前的SOTA基线相比，实现了71%的DER相对改进和69%的cpWER改进。

更新时间: 2025-08-08 15:24:10

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2508.06393v1

Evaluation of Noise and Crosstalk in Neutral Atom Quantum Computers

This work explores and evaluates noise and crosstalk in neutral atom quantum computers. Neutral atom quantum computers are a promising platform for analog Hamiltonian simulations, which rely on a sequence of time-dependent Hamiltonians to model the dynamics of a larger system and are particularly useful for problems in optimization, physics, and molecular dynamics. However, the viability of running multiple simulations in a co-located or multi-tenant environment is limited by noise and crosstalk. This work conducts an analysis of how noise faced by simulations changes over time, and investigates the effects of spatial co-location on simulation fidelity. Findings of this work demonstrate that the close proximity of concurrent simulations can increase crosstalk between them. To mitigate this issue, a Moving Target Defense (MTD) strategy is proposed and evaluated. The results confirm that the MTD is a viable technique for enabling safe and reliable co-location of simulations on neutral atom quantum hardware.

Updated: 2025-08-08 15:18:21

标题: 中性原子量子计算机中噪声和串扰的评估

摘要: 这项工作探讨并评估了中性原子量子计算机中的噪声和串扰。中性原子量子计算机是模拟模拟模型的有前途的平台，依赖于一系列时间相关的哈密顿量来模拟更大系统的动态，特别适用于优化、物理和分子动力学问题。然而，在共同位置或多租户环境中运行多个模拟的可行性受到噪声和串扰的限制。这项工作对模拟面临的噪声随时间变化的情况进行了分析，并研究了空间共同位置对模拟准确性的影响。该研究结果表明，同时进行的模拟之间的串扰会增加。为了缓解这个问题，提出并评估了一种移动目标防御（MTD）策略。结果证实，MTD是一种可行的技术，可以在中性原子量子硬件上实现模拟的安全可靠的共同位置。

更新时间: 2025-08-08 15:18:21

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2507.22140v2

Identity Increases Stability in Neural Cellular Automata

Neural Cellular Automata (NCAs) offer a way to study the growth of two-dimensional artificial organisms from a single seed cell. From the outset, NCA-grown organisms have had issues with stability, their natural boundary often breaking down and exhibiting tumour-like growth or failing to maintain the expected shape. In this paper, we present a method for improving the stability of NCA-grown organisms by introducing an 'identity' layer with simple constraints during training. Results show that NCAs grown in close proximity are more stable compared with the original NCA model. Moreover, only a single identity value is required to achieve this increase in stability. We observe emergent movement from the stable organisms, with increasing prevalence for models with multiple identity values. This work lays the foundation for further study of the interaction between NCA-grown organisms, paving the way for studying social interaction at a cellular level in artificial organisms.

Updated: 2025-08-08 15:18:01

标题: 身份增加了神经细胞自动机的稳定性

摘要: 神经细胞自动机（NCAs）提供了一种研究二维人工生物从单个种子细胞开始生长的方法。从一开始，NCA生长的生物体存在稳定性问题，它们的自然边界经常会崩溃并展示肿瘤样的生长，或者无法保持预期的形状。在本文中，我们提出了一种通过在训练过程中引入具有简单约束条件的“身份”层来提高NCA生长的生物体稳定性的方法。结果显示，接近生长的NCAs与原始NCA模型相比更加稳定。此外，只需要一个身份值即可实现这种稳定性增加。我们观察到稳定生物体的出现运动，对于具有多个身份值的模型，其普及程度不断增加。这项工作为进一步研究NCA生长的生物体之间的相互作用奠定了基础，为研究人工生物体在细胞水平上的社会互动铺平了道路。

更新时间: 2025-08-08 15:18:01

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2508.06389v1

End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation

Text-to-SQL bridges the gap between natural language and structured database language, thus allowing non-technical users to easily query databases. Traditional approaches model text-to-SQL as a direct translation task, where a given Natural Language Query (NLQ) is mapped to an SQL command. Recent advances in large language models (LLMs) have significantly improved translation accuracy, however, these methods all require that the target database is pre-specified. This becomes problematic in scenarios with multiple extensive databases, where identifying the correct database becomes a crucial yet overlooked step. In this paper, we propose a three-stage end-to-end text-to-SQL framework to identify the user's intended database before generating SQL queries. Our approach leverages LLMs and prompt engineering to extract implicit information from natural language queries (NLQs) in the form of a ruleset. We then train a large db\_id prediction model, which includes a RoBERTa-based finetuned encoder, to predict the correct Database identifier (db\_id) based on both the NLQ and the LLM-generated rules. Finally, we refine the generated SQL by using critic agents to correct errors. Experimental results demonstrate that our framework outperforms the current state-of-the-art models in both database intent prediction and SQL generation accuracy.

Updated: 2025-08-08 15:16:36

标题: 端到端文本到SQL的数据集选择：利用LLMs进行自适应查询生成

摘要: 文本到SQL桥梁填补了自然语言和结构化数据库语言之间的差距，从而使非技术用户能够轻松查询数据库。传统方法将文本到SQL建模为直接翻译任务，即将给定的自然语言查询（NLQ）映射到一个SQL命令。最近大型语言模型（LLMs）的进展显著提高了翻译准确性，然而，这些方法都要求目标数据库是预先指定的。在存在多个广泛数据库的情况下，识别正确的数据库成为一个至关重要但被忽视的步骤。在本文中，我们提出了一个三阶段的端到端文本到SQL框架，以在生成SQL查询之前识别用户的目标数据库。我们的方法利用LLMs和提示工程从自然语言查询（NLQs）中提取隐含信息形成规则集。然后，我们训练一个大型db_id预测模型，其中包括一个基于RoBERTa的微调编码器，根据NLQ和LLM生成的规则预测正确的数据库标识符（db_id）。最后，我们使用批评代理来纠正错误，进一步完善生成的SQL。实验结果表明，我们的框架在数据库意图预测和SQL生成准确性方面优于当前的最先进模型。

更新时间: 2025-08-08 15:16:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06387v1

Tree-Based Deep Learning for Ranking Symbolic Integration Algorithms

Symbolic indefinite integration in Computer Algebra Systems such as Maple involves selecting the most effective algorithm from multiple available methods. Not all methods will succeed for a given problem, and when several do, the results, though mathematically equivalent, can differ greatly in presentation complexity. Traditionally, this choice has been made with minimal consideration of the problem instance, leading to inefficiencies. We present a machine learning (ML) approach using tree-based deep learning models within a two-stage architecture: first identifying applicable methods for a given instance, then ranking them by predicted output complexity. Furthermore, we find representing mathematical expressions as tree structures significantly improves performance over sequence-based representations, and our two-stage framework outperforms alternative ML formulations. Using a diverse dataset generated by six distinct data generators, our models achieve nearly 90% accuracy in selecting the optimal method on a 70,000 example holdout test set. On an independent out-of-distribution benchmark from Maple's internal test suite, our tree transformer model maintains strong generalisation, outperforming Maple's built-in selector and prior ML approaches. These results highlight the critical role of data representation and problem framing in ML for symbolic computation, and we expect our methodology to generalise effectively to similar optimisation problems in mathematical software.

Updated: 2025-08-08 15:13:39

标题: 基于树的深度学习用于排名符号积分算法

摘要: 计算机代数系统中的符号不定积分涉及从多种可用方法中选择最有效的算法。并非所有方法都适用于给定问题，当有几种方法可以成功时，结果虽然在数学上是等价的，但在表现复杂性上可能会有很大差异。传统上，这种选择往往对问题实例的考虑很少，导致效率低下。我们提出了一种机器学习（ML）方法，使用基于树的深度学习模型构建一个两阶段架构：首先识别适用于给定实例的方法，然后通过预测输出复杂性对它们进行排名。此外，我们发现将数学表达式表示为树结构显著提高了性能，而我们的两阶段框架胜过了其他替代的ML公式。使用由六个不同数据生成器生成的多样化数据集，我们的模型在一个包含70,000个示例的测试集上几乎达到了90%的准确率。在Maple内部测试套件的一个独立的超出分布的基准测试中，我们的树变换器模型保持了强大的泛化能力，胜过了Maple内置的选择器和先前的ML方法。这些结果突出了数据表示和问题框架在符号计算的机器学习中的关键作用，我们期望我们的方法能够有效地推广到数学软件中类似的优化问题中。

更新时间: 2025-08-08 15:13:39

领域: cs.SC,cs.LG

下载: http://arxiv.org/abs/2508.06383v1

DP-SPRT: Differentially Private Sequential Probability Ratio Tests

We revisit Wald's celebrated Sequential Probability Ratio Test for sequential tests of two simple hypotheses, under privacy constraints. We propose DP-SPRT, a wrapper that can be calibrated to achieve desired error probabilities and privacy constraints, addressing a significant gap in previous work. DP-SPRT relies on a private mechanism that processes a sequence of queries and stops after privately determining when the query results fall outside a predefined interval. This OutsideInterval mechanism improves upon naive composition of existing techniques like AboveThreshold, potentially benefiting other sequential algorithms. We prove generic upper bounds on the error and sample complexity of DP-SPRT that can accommodate various noise distributions based on the practitioner's privacy needs. We exemplify them in two settings: Laplace noise (pure Differential Privacy) and Gaussian noise (R\'enyi differential privacy). In the former setting, by providing a lower bound on the sample complexity of any $\epsilon$-DP test with prescribed type I and type II errors, we show that DP-SPRT is near optimal when both errors are small and the two hypotheses are close. Moreover, we conduct an experimental study revealing its good practical performance.

Updated: 2025-08-08 15:09:13

标题: DP-SPRT：差分隐私顺序概率比检验

摘要: 我们重新审视了沃尔德（Wald）著名的用于两个简单假设的顺序概率比检验，在隐私约束下进行。我们提出了DP-SPRT，这是一个可以校准以实现所需的错误概率和隐私约束的包装器，填补了以前工作中的一个显著空白。DP-SPRT依赖于一个私有机制，该机制处理一系列查询，并在私下确定查询结果何时落在预定义间隔之外后停止。这个OutsideInterval机制改进了现有技术如AboveThreshold的天真组合，可能有益于其他顺序算法。我们证明了DP-SPRT的错误和样本复杂性的一般上界，可以适应基于从业者隐私需求的各种噪声分布。我们在两个设置中举例说明了它们：拉普拉斯噪声（纯差分隐私）和高斯噪声（R\'enyi差分隐私）。在前一情境中，通过提供任何$\epsilon$-DP测试的样本复杂性的下界，从而指定类型I和类型II错误，我们展示了当两种错误都很小且两个假设很接近时，DP-SPRT是近乎最佳的。此外，我们进行了一项实验研究，揭示其良好的实际表现。

更新时间: 2025-08-08 15:09:13

领域: stat.ML,cs.CR,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2508.06377v1

Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge

Estimating the construction year of buildings is critical for advancing sustainability, as older structures often lack energy-efficient features. Sustainable urban planning relies on accurate building age data to reduce energy consumption and mitigate climate change. In this work, we introduce MapYourCity, a novel multi-modal benchmark dataset comprising top-view Very High Resolution (VHR) imagery, multi-spectral Earth Observation (EO) data from the Copernicus Sentinel-2 constellation, and co-localized street-view images across various European cities. Each building is labeled with its construction epoch, and the task is formulated as a seven-class classification problem covering periods from 1900 to the present. To advance research in EO generalization and multi-modal learning, we organized a community-driven data challenge in 2024, hosted by ESA $\Phi$-lab, which ran for four months and attracted wide participation. This paper presents the Top-4 performing models from the challenge and their evaluation results. We assess model generalization on cities excluded from training to prevent data leakage, and evaluate performance under missing modality scenarios, particularly when street-view data is unavailable. Results demonstrate that building age estimation is both feasible and effective, even in previously unseen cities and when relying solely on top-view satellite imagery (i.e. with VHR and Sentinel-2 images). The new MapYourCity dataset thus provides a valuable resource for developing scalable, real-world solutions in sustainable urban analytics.

Updated: 2025-08-08 15:08:54

标题: 建筑年龄估计：一个新的多模态基准数据集和社区挑战

摘要: 估计建筑物的建造年份对于推进可持续发展至关重要，因为较旧的建筑往往缺乏节能设施。可持续城市规划依赖准确的建筑年龄数据来降低能源消耗并缓解气候变化。在这项工作中，我们介绍了MapYourCity，一个全新的多模态基准数据集，包括顶视图的超高分辨率（VHR）图像、来自哥白尼卫星2星座的多光谱地球观测（EO）数据，以及欧洲各城市的共定位街景图像。每座建筑都标有其建造时代，任务被制定为一个七类分类问题，涵盖了从1900年至今的时期。为了推进EO泛化和多模态学习的研究，我们在2024年组织了一个由欧洲航天局ESA $\Phi$-lab主办的社区驱动数据挑战赛，持续四个月并吸引了广泛参与。本文介绍了该挑战赛的前四名表现模型及其评估结果。我们评估了模型在排除训练城市的情况下的泛化能力，以防止数据泄露，并评估了在缺失模态场景下的性能，特别是当街景数据不可用时。结果表明，建筑年龄估计是可行且有效的，即使在先前未见的城市，且仅依赖于顶视图卫星图像（即具有VHR和哥白尼2号图像）。因此，新的MapYourCity数据集为开发可扩展的可持续城市分析解决方案提供了宝贵的资源。

更新时间: 2025-08-08 15:08:54

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.13818v3

Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges

The convergence of Web3 technologies and AI agents represents a rapidly evolving frontier poised to reshape decentralized ecosystems. This paper presents the first and most comprehensive analysis of the intersection between Web3 and AI agents, examining five critical dimensions: landscape, economics, governance, security, and trust mechanisms. Through an analysis of 133 existing projects, we first develop a taxonomy and systematically map the current market landscape (RQ1), identifying distinct patterns in project distribution and capitalization. Building upon these findings, we further investigate four key integrations: (1) the role of AI agents in participating in and optimizing decentralized finance (RQ2); (2) their contribution to enhancing Web3 governance mechanisms (RQ3); (3) their capacity to strengthen Web3 security via intelligent vulnerability detection and automated smart contract auditing (RQ4); and (4) the establishment of robust reliability frameworks for AI agent operations leveraging Web3's inherent trust infrastructure (RQ5). By synthesizing these dimensions, we identify key integration patterns, highlight foundational challenges related to scalability, security, and ethics, and outline critical considerations for future research toward building robust, intelligent, and trustworthy decentralized systems with effective AI agent interactions.

Updated: 2025-08-08 15:05:32

标题: Web3 x AI Agents：景观、集成和基础挑战

摘要: Web3技术和人工智能代理的融合代表着一个快速发展的前沿，将重塑去中心化生态系统。本文首次全面分析了Web3和人工智能代理之间的交集，考察了五个关键维度：景观、经济、治理、安全和信任机制。通过对133个现有项目的分析，我们首先制定了分类法，并系统地绘制了当前市场景观，确定了项目分布和资本化的明显模式。基于这些发现，我们进一步研究了四个关键集成：(1)人工智能代理在参与和优化去中心化金融中的作用；(2)他们对增强Web3治理机制的贡献；(3)他们通过智能漏洞检测和自动化智能合约审计来加强Web3安全性；(4)建立稳健的可靠性框架，利用Web3固有的信任基础设施来运营人工智能代理。通过综合这些维度，我们确定了关键的集成模式，突出了与可扩展性、安全性和伦理有关的基础挑战，并概述了未来研究的关键考虑因素，以构建具有有效人工智能代理交互的稳健、智能和可信任的去中心化系统。

更新时间: 2025-08-08 15:05:32

领域: cs.CY,cs.AI,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2508.02773v2

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

Updated: 2025-08-08 15:04:00

标题: SpeakerLM：端到端多功能说话人分离与识别，基于多模态大型语言模型

摘要: 说话者辨识与识别（SDR）任务旨在预测音频片段中的“谁在何时说了什么”，这是各种真实世界多说话者场景中的关键任务，如会议转录和对话系统。现有的SDR系统通常采用级联框架，结合多个模块，如说话者辨识（SD）和自动语音识别（ASR）。级联系统存在一些限制，如错误传播、处理重叠语音的困难，以及缺乏探索SD和ASR任务之间协同作用的联合优化。为了解决这些限制，我们引入了SpeakerLM，一个统一的多模态大型语言模型，用端到端的方式共同执行SD和ASR。此外，为了促进不同的现实场景，我们将灵活的说话者注册机制整合到SpeakerLM中，实现在不同的说话者注册设置下的SDR。SpeakerLM通过多阶段训练策略在大规模真实数据上逐步发展。大量实验证明，SpeakerLM表现出强大的数据扩展能力和泛化能力，在领域内和领域外的公共SDR基准测试中优于最先进的级联基线。此外，实验结果显示，所提出的说话者注册机制有效地确保了SpeakerLM在不同的说话者注册条件和不同数量的注册说话者下的稳健SDR性能。

更新时间: 2025-08-08 15:04:00

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2508.06372v1

Automated Creation of the Legal Knowledge Graph Addressing Legislation on Violence Against Women: Resource, Methodology and Lessons Learned

Legal decision-making process requires the availability of comprehensive and detailed legislative background knowledge and up-to-date information on legal cases and related sentences/decisions. Legal Knowledge Graphs (KGs) would be a valuable tool to facilitate access to legal information, to be queried and exploited for the purpose, and to enable advanced reasoning and machine learning applications. Indeed, legal KGs may act as knowledge intensive component to be used by pre-dictive machine learning solutions supporting the decision process of the legal expert. Nevertheless, a few KGs can be found in the legal domain. To fill this gap, we developed a legal KG targeting legal cases of violence against women, along with clear adopted methodologies. Specifically, the paper introduces two complementary approaches for automated legal KG construction; a systematic bottom-up approach, customized for the legal domain, and a new solution leveraging Large Language Models. Starting from legal sentences publicly available from the European Court of Justice, the solutions integrate structured data extraction, ontology development, and semantic enrichment to produce KGs tailored for legal cases involving violence against women. After analyzing and comparing the results of the two approaches, the developed KGs are validated via suitable competency questions. The obtained KG may be impactful for multiple purposes: can improve the accessibility to legal information both to humans and machine, can enable complex queries and may constitute an important knowledge component to be possibly exploited by machine learning tools tailored for predictive justice.

Updated: 2025-08-08 14:59:54

标题: 自动创建涉及针对妇女暴力立法的法律知识图：资源、方法论和经验教训

摘要: 法律决策过程需要全面和详细的立法背景知识以及关于法律案例和相关判决/决定的最新信息。法律知识图谱（KG）将是一个有价值的工具，可以促进对法律信息的获取、查询和利用，并为高级推理和机器学习应用提供支持。实际上，法律知识图谱可以作为知识密集型组件，供预测性机器学习解决方案支持法律专家的决策过程使用。然而，在法律领域很少能找到KG。为了填补这一空白，我们开发了一个针对针对妇女暴力案件的法律知识图谱，并采用明确的方法论。具体来说，本文介绍了两种自动化法律知识图谱构建的互补方法；一种是针对法律领域定制的系统性自下而上方法，另一种是利用大型语言模型的新解决方案。从欧洲法院公开提供的法律判决开始，这些解决方案集成了结构化数据提取、本体论开发和语义增强，以生成针对涉及妇女暴力案件的法律案例量身定制的知识图谱。在分析和比较这两种方法的结果后，通过适当的胜任问题验证了开发的KG。获得的KG可能对多种目的产生影响：可以提高对法律信息的人机可及性，可以实现复杂查询，并可能构成一个重要的知识组件，可能被针对预测性司法的机器学习工具所利用。

更新时间: 2025-08-08 14:59:54

领域: cs.AI

下载: http://arxiv.org/abs/2508.06368v1

WildSAT: Learning Satellite Image Representations from Wildlife Observations

Species distributions encode valuable ecological and environmental information, yet their potential for guiding representation learning in remote sensing remains underexplored. We introduce WildSAT, which pairs satellite images with millions of geo-tagged wildlife observations readily-available on citizen science platforms. WildSAT employs a contrastive learning approach that jointly leverages satellite images, species occurrence maps, and textual habitat descriptions to train or fine-tune models. This approach significantly improves performance on diverse satellite image recognition tasks, outperforming both ImageNet-pretrained models and satellite-specific baselines. Additionally, by aligning visual and textual information, WildSAT enables zero-shot retrieval, allowing users to search geographic locations based on textual descriptions. WildSAT surpasses recent cross-modal learning methods, including approaches that align satellite images with ground imagery or wildlife photos, demonstrating the advantages of our approach. Finally, we analyze the impact of key design choices and highlight the broad applicability of WildSAT to remote sensing and biodiversity monitoring.

Updated: 2025-08-08 14:58:18

标题: 野生SAT：从野生动物观察中学习卫星图像表示

摘要: 物种分布编码了有价值的生态和环境信息，然而它们在指导遥感中的表示学习的潜力仍未得到充分探索。我们引入了WildSAT，它将卫星图像与数百万在公民科学平台上readily-available的地理标记的野生动物观察结合起来。WildSAT采用对比学习方法，同时利用卫星图像、物种出现地图和文本栖息地描述来训练或微调模型。这种方法显著提高了在各种卫星图像识别任务上的性能，优于ImageNet预训练模型和卫星特定基线。此外，通过对齐视觉和文本信息，WildSAT实现了零-shot检索，允许用户根据文本描述搜索地理位置。WildSAT超越了最近的跨模态学习方法，包括将卫星图像与地面图像或野生动物照片进行对齐的方法，展示了我们方法的优势。最后，我们分析了关键设计选择的影响，并突出了WildSAT在遥感和生物多样性监测中的广泛适用性。

更新时间: 2025-08-08 14:58:18

领域: cs.CV,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2412.14428v2

ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design

Achieving precise control over a molecule's biological activity-encompassing targeted activation/inhibition, cooperative multi-target modulation, and off-target toxicity mitigation-remains a critical challenge in de novo drug design. However, existing generative methods primarily focus on producing molecules with a single desired activity, lacking integrated mechanisms for the simultaneous management of multiple intended and unintended molecular interactions. Here, we propose ActivityDiff, a generative approach based on the classifier-guidance technique of diffusion models. It leverages separately trained drug-target classifiers for both positive and negative guidance, enabling the model to enhance desired activities while minimizing harmful off-target effects. Experimental results show that ActivityDiff effectively handles essential drug design tasks, including single-/dual-target generation, fragment-constrained dual-target design, selective generation to enhance target specificity, and reduction of off-target effects. These results demonstrate the effectiveness of classifier-guided diffusion in balancing efficacy and safety in molecular design. Overall, our work introduces a novel paradigm for achieving integrated control over molecular activity, and provides ActivityDiff as a versatile and extensible framework.

Updated: 2025-08-08 14:48:47

标题: ActivityDiff：一种具有正向和负向活性指导的扩散模型，用于新药设计

摘要: 实现对分子生物活性的精确控制-包括目标化激活/抑制、协同多靶点调节和副作用毒性缓解-仍然是从头设计药物面临的关键挑战。然而，现有的生成方法主要集中在产生具有单一期望活性的分子，缺乏同时管理多个预期和意外分子相互作用的集成机制。在这里，我们提出了一种基于扩散模型分类器引导技术的生成方法ActivityDiff。它利用分别训练的药物-靶标分类器进行正向和负向引导，使模型能够增强期望的活性同时最小化有害的副作用。实验结果显示ActivityDiff有效处理了关键的药物设计任务，包括单/双靶点生成、片段约束双靶点设计、选择性生成以增强靶点特异性以及减少副作用。这些结果证明了分类器引导扩散在平衡分子设计中的功效和安全性方面的有效性。总的来说，我们的工作引入了一种新的范式，用于实现对分子活性的集成控制，并提供ActivityDiff作为一个多功能且可扩展的框架。

更新时间: 2025-08-08 14:48:47

领域: cs.LG,cs.AI,q-bio.BM

下载: http://arxiv.org/abs/2508.06364v1

Rank1: Test-Time Compute for Reranking in Information Retrieval

We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation in order to rapidly improve the performance of a smaller model. We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO. Models trained on this dataset show: (1) state-of-the-art performance on advanced reasoning and instruction following datasets; (2) work remarkably well out of distribution due to the ability to respond to user-input prompts; and (3) have explainable reasoning chains that can be given to users or RAG-based systems. Further, we demonstrate that quantized versions of these models retain strong performance while using less compute/memory. Overall, Rank1 shows that test-time compute allows for a fundamentally new type of explainable and performant reranker model for search.

Updated: 2025-08-08 14:48:31

标题: 排名1：信息检索中的重排序测试时间计算

摘要: 我们介绍了Rank1，这是第一个经过训练利用测试时间计算优势的重新排名模型。Rank1展示了在检索中使用推理语言模型（例如OpenAI的o1、Deepseek的R1等）进行蒸馏的适用性，以快速提高较小模型的性能。我们收集并开源了一个包含来自MS MARCO查询和段落的超过60万个R1推理示例的数据集。在这个数据集上训练的模型表现出：（1）在高级推理和指令遵循数据集上的最新性能；（2）由于能够响应用户输入提示，因此在分布之外表现出色；（3）具有可提供给用户或基于RAG的系统的可解释推理链。此外，我们证明了这些模型的量化版本在使用更少的计算/内存的同时保持强劲性能。总的来说，Rank1表明测试时间计算为搜索提供了一种基本新型可解释和高性能的重新排名模型。

更新时间: 2025-08-08 14:48:31

领域: cs.IR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.18418v2

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective, remains a significant and underexplored threat. Existing studies typically induce such deception by explicitly setting a "hidden" objective through prompting or fine-tuning, which may not fully reflect real-world human-LLM interactions. Moving beyond this human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth in this evaluation, we propose a novel framework using "contact searching questions." This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model's bias towards a hidden objective. The second, Deceptive Behavior Score, measures the inconsistency between the LLM's internal belief and its expressed output. Upon evaluating 14 leading LLMs, we find that both metrics escalate as task difficulty increases, rising in parallel for most models. Building on these findings, we formulate a mathematical model to explain this behavior. These results reveal that even the most advanced LLMs exhibit an increasing tendency toward deception when handling complex problems, raising critical concerns for the deployment of LLM agents in complex and crucial domains.

Updated: 2025-08-08 14:46:35

标题: 超越提示诱导谎言：研究在良性提示下的LLM欺骗

摘要: 大型语言模型（LLMs）已被广泛应用于推理、规划和决策任务中，使得它们的可信度成为一个关键关注点。故意欺骗的潜力，即LLM故意捏造或隐瞒信息以达到隐藏目标的情况，仍然是一个重要且未被充分探讨的威胁。现有研究通常通过明确设定"隐藏"目标来诱发这种欺骗，可能并不能完全反映现实世界中人类与LLM之间的互动。超越这种人为诱发的欺骗，我们研究LLMs在良性提示下自发发起的欺骗。为了解决评估中缺乏基本事实的问题，我们提出了一种使用"联系搜索问题"的新框架。该框架引入了两个从心理学原理中衍生的统计度量标准，用于量化欺骗的可能性。第一个度量标准是欺骗意图评分，衡量模型对隐藏目标的偏好程度。第二个度量标准是欺骗行为评分，衡量LLM内部信念与其表达输出之间的不一致性。在评估了14个主要LLMs后，我们发现随着任务难度的增加，这两个度量标准都在上升，对大多数模型来说都是平行的。基于这些发现，我们制定了一个数学模型来解释这种行为。这些结果显示，即使是最先进的LLMs在处理复杂问题时也表现出越来越倾向于欺骗的趋势，这引发了在复杂和关键领域部署LLM代理的重要担忧。

更新时间: 2025-08-08 14:46:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06361v1

Training Plug-n-Play Knowledge Modules with Deep Context Distillation

Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and RAG.

Updated: 2025-08-08 14:39:42

标题: 用深度上下文蒸馏训练即插即用知识模块

摘要: 在（大规模）语言模型预训练之后动态集成新的或快速发展的信息仍然具有挑战性，特别是在低数据场景下或处理私人和专业文档时。在上下文学习和检索增强生成（RAG）面临限制，包括它们高推理成本和无法捕获全局文档信息。在本文中，我们提出了一种通过训练文档级知识模块（KMs）来模块化知识的方法。KMs是作为参数高效的LoRA模块实现的轻量级组件，它们被训练来存储关于新文档的信息，并且可以轻松地按需插入模型中。我们发现下一个标记预测作为KMs的训练目标表现不佳。相反，我们提出了深层上下文蒸馏：我们学习KMs参数以模拟接受上下文中文档的教师的隐藏状态和logits。我们的方法在两个数据集上优于标准的下一个标记预测和预指导训练技术。最后，我们强调了KMs和RAG之间的协同作用。

更新时间: 2025-08-08 14:39:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.08727v4

Are you In or Out (of gallery)? Wisdom from the Same-Identity Crowd

A central problem in one-to-many facial identification is that the person in the probe image may or may not have enrolled image(s) in the gallery; that is, may be In-gallery or Out-of-gallery. Past approaches to detect when a rank-one result is Out-of-gallery have mostly focused on finding a suitable threshold on the similarity score. We take a new approach, using the additional enrolled images of the identity with the rank-one result to predict if the rank-one result is In-gallery / Out-of-gallery. Given a gallery of identities and images, we generate In-gallery and Out-of-gallery training data by extracting the ranks of additional enrolled images corresponding to the rank-one identity. We then train a classifier to utilize this feature vector to predict whether a rank-one result is In-gallery or Out-of-gallery. Using two different datasets and four different matchers, we present experimental results showing that our approach is viable for mugshot quality probe images, and also, importantly, for probes degraded by blur, reduced resolution, atmospheric turbulence and sunglasses. We also analyze results across demographic groups, and show that In-gallery / Out-of-gallery classification accuracy is similar across demographics. Our approach has the potential to provide an objective estimate of whether a one-to-many facial identification is Out-of-gallery, and thereby to reduce false positive identifications, wrongful arrests, and wasted investigative time. Interestingly, comparing the results of older deep CNN-based face matchers with newer ones suggests that the effectiveness of our Out-of-gallery detection approach emerges only with matchers trained using advanced margin-based loss functions.

Updated: 2025-08-08 14:39:29

标题: 你是在里面还是在外面（画廊）？来自相同身份群体的智慧

摘要: 在一对多人脸识别中的一个中心问题是探测图像中的人员是否在图库中已注册图像；即可能处于图库内或图库外。过去探测排名第一结果是否图库外的方法主要集中于找到适当的相似度得分阈值。我们采用一种新方法，利用具有排名第一结果的身份的附加注册图像来预测排名第一结果是否处于图库内/图库外。给定一个包含身份和图像的图库，我们通过提取与排名第一身份对应的附加注册图像的等级来生成图库内和图库外的训练数据。然后训练分类器利用该特征向量来预测排名第一结果是否处于图库内或图库外。使用两个不同的数据集和四个不同的匹配器，我们呈现实验结果表明我们的方法适用于照片质量良好的探测图像，同时也适用于受模糊、降低分辨率、大气湍流和太阳镜影响的探测图像。我们还分析了不同人群之间的结果，并表明图库内/图库外分类准确性在不同人群之间相似。我们的方法有潜力提供一个客观的估计，判断一对多人脸识别是否处于图库外，从而减少错误的正面识别、错误逮捕和浪费调查时间。有趣的是，将使用较旧的基于深度CNN的人脸匹配器的结果与使用新的匹配器的结果进行比较，表明我们的图库外探测方法的有效性仅在使用训练有先进基于边界的损失函数的匹配器时才显现出来。

更新时间: 2025-08-08 14:39:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06357v1

Geometric-k-means: A Bound Free Approach to Fast and Eco-Friendly k-means

This paper introduces Geometric-k-means (or Gk-means for short), a novel approach that significantly enhances the efficiency and energy economy of the widely utilized k-means algorithm, which, despite its inception over five decades ago, remains a cornerstone in machine learning applications. The essence of Gk-means lies in its active utilization of geometric principles, specifically scalar projection, to significantly accelerate the algorithm without sacrificing solution quality. This geometric strategy enables a more discerning focus on data points that are most likely to influence cluster updates, which we call as high expressive data (HE). In contrast, low expressive data (LE), does not impact clustering outcome, is effectively bypassed, leading to considerable reductions in computational overhead. Experiments spanning synthetic, real-world and high-dimensional datasets, demonstrate Gk-means is significantly better than traditional and state of the art (SOTA) k-means variants in runtime and distance computations (DC). Moreover, Gk-means exhibits better resource efficiency, as evidenced by its reduced energy footprint, placing it as more sustainable alternative.

Updated: 2025-08-08 14:32:42

标题: 基于几何的k均值：一种快速且环保的k均值无约束方法

摘要: 本文介绍了几何k均值（简称Gk均值），这是一种新颖的方法，显著提高了广泛使用的k均值算法的效率和能源经济性。尽管k均值算法诞生已超过五十年，仍然是机器学习应用中的基石。Gk均值的核心在于积极利用几何原理，特别是标量投影，显著加速算法而不损害解决方案质量。这种几何策略使得更有针对性地关注最有可能影响聚类更新的数据点，我们称之为高表达数据（HE）。相比之下，低表达数据（LE）不会影响聚类结果，有效地被绕过，从而大幅减少计算开销。从合成、真实世界和高维数据集的实验证明，Gk均值在运行时间和距离计算方面显著优于传统和最新技术（SOTA）的k均值变体。此外，Gk均值表现出更好的资源效率，其降低的能源足迹证明了其作为更可持续的替代方案的地位。

更新时间: 2025-08-08 14:32:42

领域: cs.LG

下载: http://arxiv.org/abs/2508.06353v1

From Explainable to Explanatory Artificial Intelligence: Toward a New Paradigm for Human-Centered Explanations through Generative AI

Current explainable AI (XAI) approaches prioritize algorithmic transparency and present explanations in abstract, non-adaptive formats that often fail to support meaningful end-user understanding. This paper introduces "Explanatory AI" as a complementary paradigm that leverages generative AI capabilities to serve as explanatory partners for human understanding rather than providers of algorithmic transparency. While XAI reveals algorithmic decision processes for model validation, Explanatory AI addresses contextual reasoning to support human decision-making in sociotechnical contexts. We develop a definition and systematic eight-dimensional conceptual model distinguishing Explanatory AI through narrative communication, adaptive personalization, and progressive disclosure principles. Empirical validation through Rapid Contextual Design methodology with healthcare professionals demonstrates that users consistently prefer context-sensitive, multimodal explanations over technical transparency. Our findings reveal the practical urgency for AI systems designed for human comprehension rather than algorithmic introspection, establishing a comprehensive research agenda for advancing user-centered AI explanation approaches across diverse domains and cultural contexts.

Updated: 2025-08-08 14:32:41

标题: 从可解释到解释人工智能：通过生成式人工智能走向以人为中心解释的新范式

摘要: 当前的可解释人工智能（XAI）方法优先考虑算法透明性，并以抽象、非自适应的方式呈现解释，通常无法支持有意义的最终用户理解。本文引入了“解释人工智能”作为一种补充范式，利用生成人工智能能力作为人类理解的解释伙伴，而不是算法透明性的提供者。虽然XAI揭示了算法决策过程以进行模型验证，但解释人工智能则处理上下文推理，以支持社会技术背景下的人类决策。我们通过叙述沟通、自适应个性化和渐进式披露原则，开发了一个定义和系统化的八维概念模型，区分了解释人工智能。通过与医疗专业人员进行快速情境设计方法的实证验证，我们发现用户一致偏好上下文敏感、多模态的解释，而非技术透明性。我们的研究结果揭示了为人类理解而设计的人工智能系统的实际紧迫性，而不是算法内省，为推进跨不同领域和文化背景的用户中心的AI解释方法确立了一个全面的研究议程。

更新时间: 2025-08-08 14:32:41

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2508.06352v1

Accelerating Fleet Upgrade Decisions with Machine-Learning Enhanced Optimization

Rental-based business models and increasing sustainability requirements intensify the need for efficient strategies to manage large machine and vehicle fleet renewal and upgrades. Optimized fleet upgrade strategies maximize overall utility, cost, and sustainability. However, conventional fleet optimization does not account for upgrade options and is based on integer programming with exponential runtime scaling, which leads to substantial computational cost when dealing with large fleets and repeated decision-making processes. This contribution firstly suggests an extended integer programming approach that determines optimal renewal and upgrade decisions. The computational burden is addressed by a second, alternative machine learning-based method that transforms the task to a mixed discrete-continuous optimization problem. Both approaches are evaluated in a real-world automotive industry case study, which shows that the machine learning approach achieves near-optimal solutions with significant improvements in the scalability and overall computational performance, thus making it a practical alternative for large-scale fleet management.

Updated: 2025-08-08 14:31:18

标题: 使用机器学习增强优化加速车队升级决策

摘要: 基于租赁的商业模式和不断增加的可持续性要求加剧了对高效管理大型机械和车辆车队更新和升级的需求。优化的车队升级策略最大化了整体效用、成本和可持续性。然而，传统的车队优化不考虑升级选项，基于指数运行时间缩放的整数规划，处理大型车队和重复决策过程时会导致相当大的计算成本。本研究首先提出了一种扩展的整数规划方法，确定最佳的更新和升级决策。第二种基于机器学习的替代方法解决了计算负担，将任务转化为混合离散-连续优化问题。这两种方法在一个真实汽车行业案例研究中进行了评估，结果显示机器学习方法实现了接近最优解，大幅提高了可扩展性和整体计算性能，从而成为大规模车队管理的实际替代方案。

更新时间: 2025-08-08 14:31:18

领域: math.OC,cs.CE,cs.LG

下载: http://arxiv.org/abs/2508.00915v2

LLM Unlearning Without an Expert Curated Dataset

Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.

Updated: 2025-08-08 14:30:08

标题: 没有专家策划的数据集的LLM无需专家进行取消学习

摘要: 现代大型语言模型通常编码敏感、有害或受版权保护的知识，这引发了对事后遗忘的需求——即从模型中移除特定领域的知识而无需完全重新训练。目前事后遗忘流程中的一个主要瓶颈是构建有效的遗忘集，即近似目标领域并引导模型遗忘它的数据集。在这项工作中，我们介绍了一种可扩展、自动化的方法，使用语言模型本身生成高质量的遗忘集。我们的方法通过结构化提示管道合成类似教科书风格的数据，只需要领域名称作为输入。通过对生物安全、网络安全和《哈利·波特》小说进行遗忘实验，我们展示了我们的合成数据集始终优于基准合成替代方案，并且与专家精心策划的数据集相媲美。此外，消融研究揭示了多步生成管道显著提高了数据多样性，进而提高了遗忘效用。总的来说，我们的研究结果表明，合成数据集为在各种新兴领域实现实际、可扩展的遗忘提供了一个有前途的途径，而无需手动干预。我们在https://github.com/xyzhu123/Synthetic_Textbook上发布了我们的代码和数据集。

更新时间: 2025-08-08 14:30:08

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06595v1

Formal Local Implication Between Two Neural Networks

Given two neural network classifiers with the same input and output domains, our goal is to compare the two networks in relation to each other over an entire input region (e.g., within a vicinity of an input sample). To this end, we establish the foundation of formal local implication between two networks, i.e., N2 implies N1, in an entire input region D. That is, network N1 consistently makes a correct decision every time network N2 does, and it does so in an entire input region D. We further propose a sound formulation for establishing such formally-verified (provably correct) local implications. The proposed formulation is relevant in the context of several application domains, e.g., for comparing a trained network and its corresponding compact (e.g., pruned, quantized, distilled) networks. We evaluate our formulation based on the MNIST, CIFAR10, and two real-world medical datasets, to show its relevance.

Updated: 2025-08-08 14:25:59

标题: 两个神经网络之间的正式局部蕴含

摘要: 给定具有相同输入和输出域的两个神经网络分类器，我们的目标是比较这两个网络在整个输入区域（例如，在输入样本附近）中相互之间的关系。为此，我们建立了两个网络之间的形式局部蕴涵的基础，即N2蕴含N1，位于整个输入区域D中。也就是说，网络N1每次网络N2做出正确决策时都能一致做出正确决策，并且在整个输入区域D中都能做到这一点。我们进一步提出了一个合理的公式化方法来建立这种经过正式验证（可证明正确）的局部蕴涵。所提出的公式适用于几个应用领域的背景，例如，用于比较经过训练的网络及其对应的紧凑（例如，剪枝、量化、提炼）网络。我们根据MNIST、CIFAR10和两个真实的医学数据集评估了我们的公式化方法，以展示其相关性。

更新时间: 2025-08-08 14:25:59

领域: cs.LG

下载: http://arxiv.org/abs/2409.16726v2

AntiCheatPT: A Transformer-Based Approach to Cheat Detection in Competitive Computer Games

Cheating in online video games compromises the integrity of gaming experiences. Anti-cheat systems, such as VAC (Valve Anti-Cheat), face significant challenges in keeping pace with evolving cheating methods without imposing invasive measures on users' systems. This paper presents AntiCheatPT\_256, a transformer-based machine learning model designed to detect cheating behaviour in Counter-Strike 2 using gameplay data. To support this, we introduce and publicly release CS2CD: A labelled dataset of 795 matches. Using this dataset, 90,707 context windows were created and subsequently augmented to address class imbalance. The transformer model, trained on these windows, achieved an accuracy of 89.17\% and an AUC of 93.36\% on an unaugmented test set. This approach emphasizes reproducibility and real-world applicability, offering a robust baseline for future research in data-driven cheat detection.

Updated: 2025-08-08 14:22:41

标题: AntiCheatPT：一种基于Transformer的方法，用于竞技电脑游戏中的作弊检测

摘要: 在线视频游戏中作弊会损害游戏体验的完整性。反作弊系统，如VAC（Valve Anti-Cheat），面临着保持与不断发展的作弊手段同步的重大挑战，而又不会对用户系统施加侵入性措施。本文介绍了AntiCheatPT_256，这是一个基于Transformer的机器学习模型，旨在使用游戏数据检测《反恐精英2》中的作弊行为。为了支持这一点，我们介绍并公开发布了CS2CD：一个包含795场比赛的标记数据集。利用这个数据集，创建了90,707个上下文窗口，并随后进行了增强以解决类别不平衡问题。在这些窗口上训练的Transformer模型，在未经增强的测试集上实现了89.17％的准确率和93.36％的AUC。这种方法强调了可复制性和真实世界适用性，在数据驱动的作弊检测领域为未来研究提供了坚实的基础。

更新时间: 2025-08-08 14:22:41

领域: cs.AI

下载: http://arxiv.org/abs/2508.06348v1

Structural Equation-VAE: Disentangled Latent Representations for Tabular Data

Learning interpretable latent representations from tabular data remains a challenge in deep generative modeling. We introduce SE-VAE (Structural Equation-Variational Autoencoder), a novel architecture that embeds measurement structure directly into the design of a variational autoencoder. Inspired by structural equation modeling, SE-VAE aligns latent subspaces with known indicator groupings and introduces a global nuisance latent to isolate construct-specific confounding variation. This modular architecture enables disentanglement through design rather than through statistical regularizers alone. We evaluate SE-VAE on a suite of simulated tabular datasets and benchmark its performance against a series of leading baselines using standard disentanglement metrics. SE-VAE consistently outperforms alternatives in factor recovery, interpretability, and robustness to nuisance variation. Ablation results reveal that architectural structure, rather than regularization strength, is the key driver of performance. SE-VAE offers a principled framework for white-box generative modeling in scientific and social domains where latent constructs are theory-driven and measurement validity is essential.

Updated: 2025-08-08 14:21:20

标题: 结构方程-VAE：表格数据的解耦潜在表示

摘要: 从表格数据中学习可解释的潜在表示仍然是深度生成建模中的挑战。我们引入了SE-VAE(结构方程-变分自动编码器)，这是一种将测量结构直接嵌入到变分自动编码器设计中的新型架构。受结构方程建模的启发，SE-VAE将潜在子空间与已知的指示器分组对齐，并引入全局干扰潜在变量以隔离构造特定的混杂变化。这种模块化架构通过设计实现因果关系的解开，而不仅仅依靠统计正则化。我们在一系列模拟表格数据集上评估了SE-VAE，并使用标准的因果关系度量标准将其性能与一系列领先的基线进行了比较。SE-VAE在因素恢复、可解释性和对干扰变化的稳健性方面始终优于其他选择。消融结果显示，架构结构而不是正则化强度是性能的关键驱动因素。SE-VAE为在科学和社会领域进行白盒生成建模提供了一个有原则的框架，在这些领域中，潜在构造是理论驱动的，而测量有效性是至关重要的。

更新时间: 2025-08-08 14:21:20

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2508.06347v1

Introducing Fractional Classification Loss for Robust Learning with Noisy Labels

Robust loss functions are crucial for training deep neural networks in the presence of label noise, yet existing approaches require extensive, dataset-specific hyperparameter tuning. In this work, we introduce Fractional Classification Loss (FCL), an adaptive robust loss that automatically calibrates its robustness to label noise during training. Built within the active-passive loss framework, FCL employs the fractional derivative of the Cross-Entropy (CE) loss as its active component and the Mean Absolute Error (MAE) as its passive loss component. With this formulation, we demonstrate that the fractional derivative order $\mu$ spans a family of loss functions that interpolate between MAE-like robustness and CE-like fast convergence. Furthermore, we integrate $\mu$ into the gradient-based optimization as a learnable parameter and automatically adjust it to optimize the trade-off between robustness and convergence speed. We reveal that FCL's unique property establishes a critical trade-off that enables the stable learning of $\mu$: lower log penalties on difficult or mislabeled examples improve robustness but impose higher penalties on easy or clean data, reducing model confidence in them. Consequently, FCL can dynamically reshape its loss landscape to achieve effective classification performance under label noise. Extensive experiments on benchmark datasets show that FCL achieves state-of-the-art results without the need for manual hyperparameter tuning.

Updated: 2025-08-08 14:20:52

标题: 引入分数分类损失以实现带有嘈杂标签的稳健学习

摘要: 鲁棒损失函数对于在标签噪声存在的情况下训练深度神经网络至关重要，然而现有方法需要进行广泛的、特定于数据集的超参数调整。在这项工作中，我们介绍了分数分类损失（FCL），这是一种自适应的鲁棒损失，可以在训练过程中自动校准其对标签噪声的鲁棒性。FCL内置于主动被动损失框架中，采用交叉熵（CE）损失的分数导数作为其主动组件，采用平均绝对误差（MAE）作为其被动损失组件。通过这种公式，我们证明了分数导数阶数μ跨越了一族损失函数，这些函数在MAE样式的鲁棒性和CE样式的快速收敛之间进行插值。此外，我们将μ整合到基于梯度的优化中作为可学习参数，并自动调整它以优化鲁棒性和收敛速度之间的折衷。我们揭示了FCL的独特属性建立了一个关键的折衷，实现了μ的稳定学习：对困难或误标记示例的较低对数惩罚提高了鲁棒性，但对容易或干净数据施加了更高的惩罚，降低了模型对它们的信心。因此，FCL能够动态重塑其损失景观，以在标签噪声下实现有效的分类性能。在基准数据集上进行的大量实验表明，FCL在无需手动超参数调整的情况下实现了最先进的结果。

更新时间: 2025-08-08 14:20:52

领域: cs.LG

下载: http://arxiv.org/abs/2508.06346v1

Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering

Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those "one-size-fits-all" approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by $F_{ZS}$, tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from $F_{ZS}$ for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy

Updated: 2025-08-08 14:18:24

标题: 利用自适应拓扑表示进行零-shot图问题回答

摘要: Large Multimodal Models（LMMs）在各种领域的问答（QA）任务中展示了广义的零样本能力，包括涉及复杂图拓扑结构的图QA。然而，大多数当前方法仅使用一种图表示形式，即拓扑表示形式（TRF），例如统一提示文本描述或固定风格的视觉风格。这些“一刀切”的方法未能考虑不同模型或任务的具体偏好，通常导致不正确或过长的答案。为了解决这个问题，我们首先分析了现有TRF的特点和弱点，然后设计了一组适用于零样本图QA的TRF，记为$F_{ZS}$。然后引入了一个新的度量标准，图响应效率（GRE），用于衡量图QA中性能和简洁性之间的平衡。基于这些，我们开发了DynamicTRF框架，旨在提高图QA的准确性和简洁性。具体来说，DynamicTRF首先创建了一个TRF偏好（TRFP）数据集，根据它们的GRE分数对TRF进行排名，以探索问题特定的TRF偏好。然后在TRFP数据集上训练一个TRF路由器，以自适应地为每个问题分配来自$F_{ZS}$的最佳TRF。对于7个领域内算法图QA任务和2个领域外下游任务进行的大量实验表明，DynamicTRF显著提高了LMMs的零样本图QA的准确性。

更新时间: 2025-08-08 14:18:24

领域: cs.CL,cs.AI,cs.GR,cs.LG

下载: http://arxiv.org/abs/2508.06345v1

On Approximate MMS Allocations on Restricted Graph Classes

We study the problem of fair division of a set of indivisible goods with connectivity constraints. Specifically, we assume that the goods are represented as vertices of a connected graph, and sets of goods allocated to the agents are connected subgraphs of this graph. We focus on the widely-studied maximin share criterion of fairness. It has been shown that an allocation satisfying this criterion may not exist even without connectivity constraints, i.e., if the graph of goods is complete. In view of this, it is natural to seek approximate allocations that guarantee each agent a connected bundle of goods with value at least a constant fraction of the maximin share value to the agent. It is known that for some classes of graphs, such as complete graphs, cycles, and $d$-claw-free graphs for any fixed $d$, such approximate allocations indeed exist. However, it is an open problem whether they exist for the class of all graphs. In this paper, we continue the systematic study of the existence of approximate allocations on restricted graph classes. In particular, we show that such allocations exist for several well-studied classes, including block graphs, cacti, complete multipartite graphs, and split graphs.

Updated: 2025-08-08 14:17:44

标题: 关于受限图类上近似最大最小树形结构分配的研究

摘要: 我们研究了带有连接约束的一组不可分割商品的公平分配问题。具体地，我们假设商品被表示为一个连通图的顶点，分配给代理商的商品集是该图的连通子图。我们关注广泛研究的最大最小份额公平准则。已经证明，即使没有连接约束，即商品图是完全的情况下，可能不存在满足这一准则的分配。鉴于此，寻求保证每个代理商获得价值至少等于最大最小份额值的常数分数的连接商品捆绑的近似分配是很自然的。已经知道，对于某些图类，如完全图、环图和对于任意固定$d$的$d$-爪子-免费图，这样的近似分配确实存在。然而，目前还不清楚它们是否存在于所有图类中。在本文中，我们继续对受限图类上近似分配的存在性进行系统研究。特别是，我们展示了这样的分配存在于几个广泛研究的图类中，包括块图、仙人掌图、完全多部图和分裂图。

更新时间: 2025-08-08 14:17:44

领域: cs.DM,cs.AI

下载: http://arxiv.org/abs/2508.06343v1

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities. On average, benchmarks devote 61.6% of their regulatory-relevant questions to "Tendency to hallucinate" and 31.2% to "Lack of performance reliability", while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This study provides the first comprehensive, quantitative analysis of this gap, demonstrating that current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance and offering critical insights for the development of next-generation evaluation tools.

Updated: 2025-08-08 14:16:34

标题: Bench-2-CoP: 我们可以相信欧盟人工智能合规基准吗？

摘要: 通用人工智能（GPAI）模型的快速发展需要健全的评估框架，特别是随着欧盟AI法案及其相关实践准则（CoP）等新兴法规的出台。当前的AI评估实践主要依赖于现有基准，但这些工具并不是为衡量新监管环境重点关注的系统性风险而设计的。本研究解决了量化“基准-监管差距”这一迫切需求。我们引入了Bench-2-CoP，这是一个新颖的系统框架，利用经过验证的LLM作为评判分析，将广泛使用的基准中的194,955个问题与欧盟AI法案对模型能力和倾向的分类相对照。我们的研究结果显示了深刻的不一致：评估生态系统将绝大部分关注点集中在一小部分行为倾向上。平均而言，基准将其与监管相关的问题中的61.6%专注于“幻觉倾向”，31.2%专注于“性能可靠性不足”，而关键的功能能力则被危险地忽视。至关重要的是，涉及失控场景的功能能力，包括规避人类监督、自我复制和自主AI发展，在整个基准语料库中都没有得到涵盖。这项研究提供了对这一差距的首次全面定量分析，表明当前的公共基准本身无法提供所需的全面风险评估证据，以确保法规遵从，并为下一代评估工具的开发提供重要见解。

更新时间: 2025-08-08 14:16:34

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.05464v2

ATM: Improving Model Merging by Alternating Tuning and Merging

Model merging has emerged as a cost-efficient approximation to multitask learning. Among merging strategies, task arithmetic is notable for its simplicity and effectiveness. In this work, we provide a theoretical motivation for task vectors by highlighting that, under single-epoch full-batch gradient descent, they are equivalent to multitask gradients. This insight leads us to reinterpret model merging as a single step in an iterative procedure that Alternates between Tuning and Merging (ATM). We propose two applications of ATM: (1) as an alternative to multitask learning in scenarios where data sharing is restricted (e.g., federated settings), and (2) as a lightweight refinement step to improve existing model merging methods using a small validation set. Experiments across diverse vision tasks demonstrate the effectiveness of ATM.

Updated: 2025-08-08 14:13:09

标题: ATM：通过交替调整和合并来改善模型合并

摘要: 模型合并已经成为多任务学习的一种成本效益较高的近似方法。在合并策略中，任务算术以其简单性和有效性而著名。在这项工作中，我们通过强调，在单次完整批量梯度下降中，任务向量等同于多任务梯度，为任务向量提供了理论动机。这一洞察力使我们重新解释模型合并为一个交替调整和合并（ATM）的迭代过程中的一个步骤。我们提出了ATM的两个应用：（1）作为数据共享受限制的情况下（例如，联合设置）的多任务学习的替代方法，以及（2）作为一种轻量级的细化步骤，使用小的验证集来改进现有的模型合并方法。跨越多样的视觉任务的实验证明了ATM的有效性。

更新时间: 2025-08-08 14:13:09

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2411.03055v4

Decorrelated feature importance from local sample weighting

Feature importance (FI) statistics provide a prominent and valuable method of insight into the decision process of machine learning (ML) models, but their effectiveness has well-known limitations when correlation is present among the features in the training data. In this case, the FI often tends to be distributed among all features which are in correlation with the response-generating signal features. Even worse, if multiple signal features are in strong correlation with a noise feature, while being only modestly correlated with one another, this can result in a noise feature having a distinctly larger FI score than any signal feature. Here we propose local sample weighting (losaw) which can flexibly be integrated into many ML algorithms to improve FI scores in the presence of feature correlation in the training data. Our approach is motivated from inverse probability weighting in causal inference and locally, within the ML model, uses a sample weighting scheme to decorrelate a target feature from the remaining features. This reduces model bias locally, whenever the effect of a potential signal feature is evaluated and compared to others. Moreover, losaw comes with a natural tuning parameter, the minimum effective sample size of the weighted population, which corresponds to an interpretation-prediction-tradeoff, analog to a bias-variance-tradeoff as for classical ML tuning parameters. We demonstrate how losaw can be integrated within decision tree-based ML methods and within mini-batch training of neural networks. We investigate losaw for random forest and convolutional neural networks in a simulation study on settings showing diverse correlation patterns. We found that losaw improves FI consistently. Moreover, it often improves prediction accuracy for out-of-distribution, while maintaining a similar accuracy for in-distribution test data.

Updated: 2025-08-08 14:11:18

标题: Decorrelated feature importance from local sample weighting （来自局部样本加权的不相关特征重要性）

摘要: Feature importance (FI) 统计数据为机器学习（ML）模型的决策过程提供了一种显著且有价值的洞察方法，但是当训练数据中的特征之间存在相关性时，它们的有效性存在众所周知的局限性。在这种情况下，FI 往往会分布在所有与生成响应信号特征相关的特征之间。更糟糕的是，如果多个信号特征与一个噪声特征强相关，而彼此之间相关性仅稍微相关，这可能导致噪声特征的 FI 分数明显大于任何信号特征。在这里，我们提出了本地样本加权（losaw），它可以灵活地集成到许多 ML 算法中，以在训练数据中存在特征相关性时改善 FI 分数。我们的方法受因果推断中的反向概率加权的启发，并且在 ML 模型内部局部使用样本加权方案来使目标特征与其余特征不相关。每当评估一个潜在信号特征的效果并将其与其他特征进行比较时，这样可以在本地减少模型偏差。此外，losaw 带有一个自然的调整参数，即加权人口的最小有效样本量，这对应于解释-预测-折衷，类似于经典 ML 调整参数的偏差-方差-折衷。我们演示了如何将 losaw 集成到基于决策树的 ML 方法和神经网络的小批量训练中。我们在显示不同相关性模式的设置上的模拟研究中调查了 losaw 在随机森林和卷积神经网络中的应用。我们发现 losaw 可持续改善 FI。此外，它通常会提高对于分布外数据的预测准确性，同时保持相似的分布内测试数据准确性。

更新时间: 2025-08-08 14:11:18

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2508.06337v1

Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

We introduce Unsupervised Partner Design (UPD) - a population-free, multi-agent reinforcement learning framework for robust ad-hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent's policy with biased random behaviours and scores them using a variance-based learnability metric that prioritises partners near the ego agent's current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked-AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population-based and population-free baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more human-like, a better collaborator, and less frustrating.

Updated: 2025-08-08 14:11:15

标题: 无监督的合作伙伴设计实现强大的即时团队合作

摘要: 我们介绍了无监督伙伴设计（UPD）- 一种无人口、多智能体强化学习框架，用于稳健的即兴团队合作，自适应地生成训练伙伴，而不需要预训练的伙伴或手动参数调整。UPD通过将自我智能体的策略与偏向随机行为随机混合来构建多样化的伙伴，并使用基于方差的可学习度量来对它们进行评分，该度量优先考虑靠近自我智能体当前学习前沿的伙伴。我们展示了UPD可以与无监督环境设计相结合，从而实现了首个能够在合作环境中全面无监督地进行课程设计，覆盖关卡和伙伴分布。通过对Overcooked-AI和Overcooked Generalisation Challenge的广泛评估，我们证明了这种动态伙伴课程非常有效：UPD始终优于基于人口和无人口基线以及消融实验。在用户研究中，我们进一步展示了UPD比所有基线都获得了更高的回报，并被认为更具适应性、更像人类、更好的合作者，更少令人沮丧。

更新时间: 2025-08-08 14:11:15

领域: cs.LG,cs.AI,cs.HC,cs.MA

下载: http://arxiv.org/abs/2508.06336v1

HASD: Hierarchical Adaption for pathology Slide-level Domain-shift

Domain shift is a critical problem for pathology AI as pathology data is heavily influenced by center-specific conditions. Current pathology domain adaptation methods focus on image patches rather than WSI, thus failing to capture global WSI features required in typical clinical scenarios. In this work, we address the challenges of slide-level domain shift by proposing a Hierarchical Adaptation framework for Slide-level Domain-shift (HASD). HASD achieves multi-scale feature consistency and computationally efficient slide-level domain adaptation through two key components: (1) a hierarchical adaptation framework that integrates a Domain-level Alignment Solver for feature alignment, a Slide-level Geometric Invariance Regularization to preserve the morphological structure, and a Patch-level Attention Consistency Regularization to maintain local critical diagnostic cues; and (2) a prototype selection mechanism that reduces computational overhead. We validate our method on two slide-level tasks across five datasets, achieving a 4.1\% AUROC improvement in a Breast Cancer HER2 Grading cohort and a 3.9\% C-index gain in a UCEC survival prediction cohort. Our method provides a practical and reliable slide-level domain adaption solution for pathology institutions, minimizing both computational and annotation costs.

Updated: 2025-08-08 14:04:58

标题: HASD：病理学幻灯片级别领域转移的分层适应

摘要: 领域转移是病理人工智能面临的一个关键问题，因为病理数据受到特定中心条件的严重影响。当前的病理领域适应方法主要关注图像补丁而不是WSI，因此未能捕捉在典型临床场景中所需的全局WSI特征。在这项工作中，我们通过提出一个用于幻灯片级领域转移的Hierarchical Adaptation框架（HASD）来解决幻灯片级领域转移的挑战。 HASD通过两个关键组件实现多尺度特征一致性和计算效率高的幻灯片级领域适应：（1）一个集成了领域级对齐求解器、幻灯片级几何不变性正则化和补丁级注意一致性正则化的层次适应框架，用于特征对齐、保持形态结构和维护局部关键诊断线索；以及（2）一个原型选择机制，以减少计算开销。我们在五个数据集上验证了我们的方法，分别在乳腺癌HER2分级队列中实现了4.1\%的AUROC提升，以及在UCEC生存预测队列中实现了3.9\%的C指数增益。我们的方法为病理机构提供了一个实用可靠的幻灯片级领域适应解决方案，可以最小化计算和注释成本。

更新时间: 2025-08-08 14:04:58

领域: cs.AI

下载: http://arxiv.org/abs/2506.23673v2

Reconsidering the Performance of GAE in Link Prediction

Recent advancements in graph neural networks (GNNs) for link prediction have introduced sophisticated training techniques and model architectures. However, reliance on outdated baselines may exaggerate the benefits of these new approaches. To tackle this issue, we systematically explore Graph Autoencoders (GAEs) by applying model-agnostic tricks in recent methods and tuning hyperparameters. We find that a well-tuned GAE can match the performance of recent sophisticated models while offering superior computational efficiency on widely-used link prediction benchmarks. Our approach delivers substantial performance gains on datasets where structural information dominates and feature data is limited. Specifically, our GAE achieves a state-of-the-art Hits@100 score of 78.41\% on the ogbl-ppa dataset. Furthermore, we examine the impact of various tricks to uncover the reasons behind our success and to guide the design of future methods. Our study emphasizes the critical need to update baselines for a more accurate assessment of progress in GNNs for link prediction. Our code is available at https://github.com/GraphPKU/Refined-GAE.

Updated: 2025-08-08 14:01:13

标题: 重新考虑GAE在链接预测中的性能

摘要: 最近在图神经网络（GNNs）领域，用于链接预测的技术和模型架构得到了显著进展。然而，依赖于过时的基准可能夸大了这些新方法的优势。为了解决这个问题，我们通过应用最近方法中的模型无关技巧和调整超参数，系统地探索了图自编码器（GAEs）。我们发现，一个经过良好调整的GAE可以在广泛使用的链接预测基准上与最近的复杂模型性能匹敌，同时提供卓越的计算效率。我们的方法在结构信息占主导且特征数据有限的数据集上取得了显著的性能提升。具体来说，我们的GAE在ogbl-ppa数据集上实现了78.41\%的Hits@100得分，达到了最先进水平。此外，我们对各种技巧的影响进行了研究，以揭示我们成功背后的原因，并指导未来方法的设计。我们的研究强调了更新基线以更准确评估GNNs在链接预测中取得进展的重要性。我们的代码可在https://github.com/GraphPKU/Refined-GAE找到。

更新时间: 2025-08-08 14:01:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.03845v3

M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.

Updated: 2025-08-08 13:59:21

标题: M$^2$IV: 通过表示工程术实现高效、细粒度的多模态上下文学习

摘要: 多模态上下文学习（ICL）为大型视觉语言模型（LVLMs）提供了通过多个用户提供的演示适应新任务的能力，而无需进行任何模型参数更新。然而，其有效性受到多模态输入的令牌密集性质和跨模态少样本推理复杂性的限制，这两者共同阻碍了LVLMs从演示中提取有用模式。为了解决这些挑战，我们提出了一种新颖的表示工程方法M$^2$IV，该方法将显式令牌级演示替换为一组可学习的多模态上下文向量，直接注入到LVLMs的残差流中。通过分析多头注意力（MHA）和多层感知器（MLP）在ICL过程中的不同作用，我们设计了一种训练策略，使M$^2$IV能够进行细粒度语义精炼和稳健的跨模态表示学习。M$^2$IV不仅提高了各种任务和LVLMs的性能，还显著减少了令牌开销，使其能够优雅地扩展到多样本情景。为了进一步提升可用性，我们引入了VLibrary，一个存储训练过的M$^2$IV的存储库，用于灵活检索和注入。借助VLibrary，用户可以以符合各种要求的定制方式引导预训练的LVLMs。大量实验证明，M$^2$IV始终优于基本ICL和先前的表示工程基线，平均准确率提高了3.74％，整体效率显著改善。

更新时间: 2025-08-08 13:59:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.04633v2

Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection

Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.

Updated: 2025-08-08 13:48:48

标题: 专家混合模型由高斯斑点引导：一种新的弱监督视频异常检测方法

摘要: 视频异常检测（VAD）是一项具有挑战性的任务，因为异常事件的变化性和标记数据的有限可用性。在弱监督VAD（WSVAD）范式下，在训练期间仅提供视频级别标签，而在帧级别进行预测。尽管最先进的模型在简单异常（例如爆炸）上表现良好，但在复杂的现实世界事件（例如入店行窃）中表现不佳。这一困难源于两个关键问题：（1）当前模型无法处理各种异常类型，因为它们使用共享模型处理所有类别，忽视了特定类别的特征；和（2）弱监督信号缺乏精确的时间信息，限制了捕捉与正常事件混合的微妙异常模式的能力。为了解决这些挑战，我们提出了高斯飞溅引导的专家混合（GS-MoE），这是一个新颖的框架，采用一组专家模型，每个模型专门捕捉特定的异常类型。这些专家受到时间高斯飞溅损失的指导，使模型能够利用时间一致性并增强弱监督。高斯飞溅方法通过关注最有可能包含异常事件的时间段，鼓励对异常的更精确和全面的表示。这些专业专家的预测通过专家混合机制进行整合，以建模跨多种异常模式的复杂关系。我们的方法在UCF-Crime数据集上实现了最先进的性能，AUC为91.58％，并在XD-Violence和MSAD数据集上展现出卓越的结果。通过利用特定类别的专业知识和时间指导，GS-MoE为弱监督下的VAD设定了一个新的基准。

更新时间: 2025-08-08 13:48:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06318v1

DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

Updated: 2025-08-08 13:43:08

标题: DeToNATION：在相互链接的在线节点上进行解耦的Torch网络感知训练

摘要: 训练大型神经网络模型需要大量的计算资源，通常分布在多个节点和加速器之间。最近的研究发现，仅交换梯度的快速移动部分可能就足够了，同时在本地累积动量（解耦动量，或DeMo）。然而，DeMo假设模型适合在单个加速器上。我们放宽了这个假设，并引入了FlexDeMo，其中节点在不同加速器之间完全共享模型参数，而节点间通信通过仅同步快速移动部分而不是完整梯度来减少，从而实现了一种混合分片数据并行训练策略。我们进一步引入了一个框架，称为DeToNATION，它泛化了DeMo、FlexDeMo和其他流行的分布式训练方案，如DiLoCo，引入了对复制方案和DeMo中所做的挑战性选择的新变体。我们在语言和视觉领域的结果表明，FlexDeMo实现了与采用AdamW和完整梯度同步的混合分片数据并行训练相似的验证损失，同时速度要快得多。因此，FlexDeMo是最大的机器学习模型的一个有前途的分布式训练方案。

更新时间: 2025-08-08 13:43:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.06728v3

ART: Adaptive Relation Tuning for Generalized Relation Prediction

Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.

Updated: 2025-08-08 13:31:57

标题: 自适应关系调整用于广义关系预测

摘要: 视觉关系检测（VRD）是识别场景中物体之间关系的任务。仅在关系检测数据上训练的VRD模型往往难以推广到其未训练的关系上。虽然提示调整已用于调整视觉-语言模型（VLMs）以适应VRD，但它使用手工制作的提示并且在处理新颖或复杂的关系时存在困难。我们认为指令调整提供了一种更有效的解决方案，通过在多样的指令数据上微调VLMs。因此，我们引入了ART，一种自适应关系调整框架，通过指令调整和策略实例选择来调整VLMs以适应VRD。通过将VRD数据集转换为指令调整格式，并采用一种自适应抽样算法，ART引导VLM集中关注信息丰富的关系同时保持泛化能力。具体来说，我们专注于关系分类，其中给定主体对象框，并且模型预测它们之间的谓词。我们在一个保留集上进行微调，并在多个不同复杂性的保留数据集上进行评估。我们的方法显著改进了基线，并且可以推断出未见的关系概念，这是主流VRD方法所缺乏的能力。我们通过使用预测的关系来分割复杂场景展示了ART的实际价值。

更新时间: 2025-08-08 13:31:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.23543v2

DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning

Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation. While domain-specific supervised fine-tuning (SFT) is effective and efficient, it often weakens cross-domain generalization and struggles with noisy training data. To address these challenges, we propose DONOD, a lightweight model-intrinsic data pruning method. Our approach evaluates data using two model-parameter-based metrics: Delta of Norm (DON), which captures the cumulative influence on model weights, and Norm of Delta (NOD), which quantifies weight instability. Moreover, by employing the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) algorithm, we effectively filter noisy, unlearnable, and generalization-harming samples without relying on auxiliary models during the SFT process. Experiments on mathematical tasks demonstrate that data selected by DONOD achieves superior fine-tuning efficiency and improved robustness against noisy data. By filtering out 70% of the whole dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%. Meanwhile, our selected data present superior cross-architecture generalization. Data pruned by smaller models (e.g., Llama 3.1-8B) generalize effectively on larger models (e.g., Llama 2-13B). Compared to existing related methodologies, DONOD demonstrates comparable or superior performance while remaining dataset-agnostic, enabling broader applicability. Code will be made publicly available.

Updated: 2025-08-08 13:29:20

标题: DONOD：通过模型内数据集修剪实现LLMs的高效和通用指导微调

摘要: 大规模语言模型（LLMs）的即时指导微调被广泛应用于特定领域的自适应。虽然特定领域的监督微调（SFT）是有效且高效的，但往往会削弱跨领域泛化能力，并且在处理嘈杂的训练数据时遇到困难。为了解决这些挑战，我们提出了一个轻量级的模型内数据修剪方法DONOD。我们的方法使用两个基于模型参数的指标来评估数据：Norm的Delta（DON），捕获模型权重的累积影响，和Delta的Norm（NOD），量化权重的不稳定性。此外，通过采用类似于理想解的偏好顺序技术（TOPSIS）算法，我们能够在SFT过程中有效地过滤嘈杂的、难以学习的和影响泛化的样本，而无需依赖辅助模型。数学任务的实验证明，通过DONOD选择的数据实现了更高效的微调和对噪声数据的改进鲁棒性。通过剔除整个数据集的70%，我们将目标领域精度提高了14.90%，跨领域精度提高了5.67%。同时，我们选择的数据展现了更好的跨架构泛化能力。较小模型（例如Llama 3.1-8B）修剪的数据在较大模型（例如Llama 2-13B）上有效泛化。与现有相关方法相比，DONOD表现出可比或更优越的性能，同时保持对数据集的不可知性，具有更广泛的适用性。代码将公开发布。

更新时间: 2025-08-08 13:29:20

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.14810v2

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.

Updated: 2025-08-08 13:25:14

标题: MagicGUI：具有可扩展数据管道和强化微调的基础移动GUI代理

摘要: 这篇论文介绍了MagicGUI，这是一个基础性的移动GUI代理，旨在解决现实世界移动GUI环境中的关键挑战，包括感知、基础和推理。该框架基于以下六个关键组件：（1）通过可扩展的GUI数据管道构建的全面准确的数据集，从开源仓库、自动抓取和有针对性的手动注释中聚合了迄今为止最大最多样的GUI中心的多模态数据；（2）增强的感知和基础能力，促进了对UI元素引用、基础和屏幕理解的细粒度多模态对齐；（3）一个全面统一的动作空间，包括基本的UI操作和复杂的交互意图，以支持人-代理交互；（4）面向规划的推理机制，使模型能够将复杂的用户指令分解为具有显式中间元规划推理的顺序动作；（5）迭代的两阶段训练程序，结合在780万个样本上进行的大规模继续预训练和利用空间增强的复合奖励和双重过滤策略进行强化微调；（6）在专有的Magic-RICH基准测试和十几个公共基准测试上取得竞争性表现，在GUI感知和代理任务中实现了卓越的性能，同时在实际移动GUI场景中展示了强大的泛化能力和实际部署潜力，如图1所详述。

更新时间: 2025-08-08 13:25:14

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.03700v2

FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields

Neural fields provide a memory-efficient representation of data, which can effectively handle diverse modalities and large-scale data. However, learning to map neural fields often requires large amounts of training data and computations, which can be limited to resource-constrained edge devices. One approach to tackle this limitation is to leverage Federated Meta-Learning (FML), but traditional FML approaches suffer from privacy leakage. To address these issues, we introduce a novel FML approach called FedMeNF. FedMeNF utilizes a new privacy-preserving loss function that regulates privacy leakage in the local meta-optimization. This enables the local meta-learner to optimize quickly and efficiently without retaining the client's private data. Our experiments demonstrate that FedMeNF achieves fast optimization speed and robust reconstruction performance, even with few-shot or non-IID data across diverse data modalities, while preserving client data privacy.

Updated: 2025-08-08 13:24:57

标题: FedMeNF: 面向神经场的隐私保护联邦元学习

摘要: 神经场提供了一种内存高效的数据表示，可以有效处理多种形式和大规模数据。然而，学习映射神经场通常需要大量的训练数据和计算资源，这可能受到资源受限的边缘设备的限制。解决这一限制的方法之一是利用联合元学习（FML），但传统的FML方法存在隐私泄露问题。为了解决这些问题，我们引入了一种名为FedMeNF的新型FML方法。FedMeNF利用一种新的隐私保护损失函数，调节本地元优化中的隐私泄露。这使得本地元学习者能够快速而有效地优化，而无需保留客户端的私人数据。我们的实验证明，即使在多种数据形式上具有少量或非IID数据的情况下，FedMeNF仍能实现快速优化速度和稳健的重建性能，同时保护客户数据隐私。

更新时间: 2025-08-08 13:24:57

领域: cs.LG,cs.AI,cs.CV,cs.DC

下载: http://arxiv.org/abs/2508.06301v1

A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis

Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals' identities may be gleaned from information about their gender, which is a kind of soft biometric. Over the years, several methods for determining a person's gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual's life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement.

Updated: 2025-08-08 13:18:06

标题: 基于虹膜图像的性别分类技术研究：深入调查和分析

摘要: 性别分类在一系列应用中具有吸引力，包括监视和监控、企业 profiling 和人机交互。个体的身份可以通过其性别信息来获取，这是一种软生物特征。多年来，已经开发出了几种确定个人性别的方法。一些最为知名的方法基于面部、指纹、掌纹、DNA、耳朵、步态和虹膜等身体特征。另一方面，面部特征占据了绝大多数性别分类方法。此外，虹膜是一种重要的生物特征，因为研究表明，虹膜在个体一生中基本保持不变。此外，虹膜是外部可见的，对用户非侵入性，这对实际应用至关重要。此外，已经有高质量的方法用于分割和编码虹膜图像，当前的方法有助于从虹膜纹理中选择和提取属性向量。本研究讨论了几种确定性别的方法。简要回顾了先前的文学作品。此外，有各种方法论适用于性别分类的不同步骤。本研究为研究人员提供了现有性别分类方法的知识和分析。它还将帮助对这一特定领域感兴趣的研究人员，突出了领域中的空白和挑战，并最终提出改进的建议和未来发展方向。

更新时间: 2025-08-08 13:18:06

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.05246v2

LLM Robustness Leaderboard v1 --Technical report

This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness assessment across the community.

Updated: 2025-08-08 13:15:40

标题: LLM鲁棒性排行榜v1 --技术报告

摘要: 这份技术报告是为了配合由PRISM Eval在巴黎人工智能行动峰会上发布的LLM鲁棒性排行榜而撰写的。我们介绍了PRISM Eval行为引出工具（BET），这是一个通过动态对抗优化执行自动红队行动的人工智能系统，可以在37个最先进的LLM中实现100%的攻击成功率（ASR）。除了二元成功指标外，我们提出了一个细粒度的鲁棒性指标，估计了引出有害行为所需的平均尝试次数，揭示了尽管存在普遍的漏洞，攻击难度在各模型之间存在超过300倍的差异。我们引入了基本级别的漏洞分析，以确定哪些越狱技术对特定危险类别最有效。我们与来自AI安全网络的可信第三方进行了合作评估，展示了社区范围内分布式鲁棒性评估的实际途径。

更新时间: 2025-08-08 13:15:40

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06296v1

Low-Bit Data Processing Using Multiple-Output Spiking Neurons with Non-linear Reset Feedback

Neuromorphic computing is an emerging technology enabling low-latency and energy-efficient signal processing. A key algorithmic tool in neuromorphic computing is spiking neural networks (SNNs). SNNs are biologically inspired neural networks which utilize stateful neurons, and provide low-bit data processing by encoding and decoding information using spikes. Similar to SNNs, deep state-space models (SSMs) utilize stateful building blocks. However, deep SSMs, which recently achieved competitive performance in various temporal modeling tasks, are typically designed with high-precision activation functions and no reset mechanisms. To bridge the gains offered by SNNs and the recent deep SSM models, we propose a novel multiple-output spiking neuron model that combines a linear, general SSM state transition with a non-linear feedback mechanism through reset. Compared to the existing neuron models for SNNs, our proposed model clearly conceptualizes the differences between the spiking function, the reset condition and the reset action. The experimental results on various tasks, i.e., a keyword spotting task, an event-based vision task and a sequential pattern recognition task, show that our proposed model achieves performance comparable to existing benchmarks in the SNN literature. Our results illustrate how the proposed reset mechanism can overcome instability and enable learning even when the linear part of neuron dynamics is unstable, allowing us to go beyond the strictly enforced stability of linear dynamics in recent deep SSM models.

Updated: 2025-08-08 13:12:13

标题: 使用具有非线性复位反馈的多输出尖峰神经元进行低位数据处理

摘要: 神经形态计算是一种新兴技术，可以实现低延迟和高效能的信号处理。在神经形态计算中，一个关键的算法工具是脉冲神经网络（SNNs）。SNNs是受生物启发的神经网络，利用有状态的神经元，通过脉冲对信息进行编码和解码，实现低位数据处理。类似于SNNs，深度状态空间模型（SSMs）也利用有状态的构建模块。然而，深度SSMs最近在各种时间建模任务中取得了竞争性表现，通常设计有高精度激活函数和无复位机制。为了弥合SNNs提供的收益和最近的深度SSM模型之间的差距，我们提出了一种新颖的多输出脉冲神经元模型，将线性、通用的SSM状态转移与通过复位的非线性反馈机制相结合。与现有的SNNs神经元模型相比，我们提出的模型清晰地概念化了脉冲函数、复位条件和复位动作之间的差异。在各种任务上的实验结果，如关键词识别任务、事件驱动视觉任务和顺序模式识别任务，显示出我们提出的模型在SNN文献中达到了与现有基准相媲美的性能。我们的结果说明了所提出的复位机制如何能够克服不稳定性，并在神经元动态的线性部分不稳定时实现学习，使我们能够超越最近深度SSM模型中严格执行的线性动态稳定性。

更新时间: 2025-08-08 13:12:13

领域: cs.LG

下载: http://arxiv.org/abs/2508.06292v1

Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation

In semantic segmentation, even state-of-the-art deep learning models fall short of the performance required in certain high-stakes applications such as medical image analysis. In these cases, performance can be improved by allowing a model to abstain from making predictions when confidence is low, an approach known as selective prediction. While well-known in the classification literature, selective prediction has been underexplored in the context of semantic segmentation. This paper tackles the problem by focusing on image-level abstention, which involves producing a single confidence estimate for the entire image, in contrast to previous approaches that focus on pixel-level uncertainty. Assuming the Dice coefficient as the evaluation metric for segmentation, two main contributions are provided in this paper: (i) In the case of known marginal posterior probabilities, we derive the optimal confidence estimator, which is observed to be intractable for typical image sizes. Then, an approximation computable in linear time, named Soft Dice Confidence (SDC), is proposed and proven to be tightly bounded to the optimal estimator. (ii) When only an estimate of the marginal posterior probabilities are known, we propose a plug-in version of the SDC and show it outperforms all previous methods, including those requiring additional tuning data. These findings are supported by experimental results on both synthetic data and real-world data from six medical imaging tasks, including out-of-distribution scenarios, positioning the SDC as a reliable and efficient tool for selective prediction in semantic segmentation.

Updated: 2025-08-08 13:12:04

标题: Soft Dice置信度:在语义分割中选择性预测的近乎最优置信度估计器

摘要: 在语义分割中，即使是最先进的深度学习模型在某些高风险应用（如医学图像分析）中所需性能方面仍然表现不佳。在这些情况下，通过允许模型在信心较低时放弃预测来改善性能，这种方法被称为选择性预测。虽然在分类文献中很有名，但在语义分割的环境中却鲜为人知。本文通过专注于图像级别的弃权来解决这个问题，这涉及为整个图像产生单一的置信度估计，与先前侧重像素级别不确定性的方法形成对比。假设Dice系数作为分割的评估指标，在本文中提供了两个主要贡献：（i）在已知边缘后验概率的情况下，我们推导出最优的置信度估计器，观察到对于典型图像尺寸而言是难以处理的。然后，提出了一种能够在线性时间内计算的近似方法，称为Soft Dice Confidence（SDC），并证明其与最优估计器紧密绑定。（ii）当只知道边缘后验概率的估计时，我们提出了SDC的插件版本，并表明其胜过所有先前的方法，包括需要额外调整数据的方法。这些发现得到了对合成数据和来自六个医学成像任务的真实世界数据的实验结果的支持，包括超出分布情景，将SDC定位为语义分割中选择性预测的可靠高效工具。

更新时间: 2025-08-08 13:12:04

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2402.10665v4

Advanced Deep Learning Techniques for Accurate Lung Cancer Detection and Classification

Lung cancer (LC) ranks among the most frequently diagnosed cancers and is one of the most common causes of death for men and women worldwide. Computed Tomography (CT) images are the most preferred diagnosis method because of their low cost and their faster processing times. Many researchers have proposed various ways of identifying lung cancer using CT images. However, such techniques suffer from significant false positives, leading to low accuracy. The fundamental reason results from employing a small and imbalanced dataset. This paper introduces an innovative approach for LC detection and classification from CT images based on the DenseNet201 model. Our approach comprises several advanced methods such as Focal Loss, data augmentation, and regularization to overcome the imbalanced data issue and overfitting challenge. The findings show the appropriateness of the proposal, attaining a promising performance of 98.95% accuracy.

Updated: 2025-08-08 13:09:52

标题: 先进的深度学习技术用于准确的肺癌检测与分类

摘要: 肺癌（LC）位居最常见的癌症之列，是全球男性和女性死亡的最常见原因之一。由于低成本和快速处理时间，计算机断层扫描（CT）图像是最受欢迎的诊断方法。许多研究人员提出了使用CT图像识别肺癌的各种方法。然而，这些技术存在重大的假阳性问题，导致准确性较低。根本原因在于使用了小型和不平衡的数据集。本文介绍了一种基于DenseNet201模型的创新方法，用于从CT图像中检测和分类LC。我们的方法包括多种先进方法，如焦点损失、数据增强和正则化，以克服数据不平衡问题和过拟合挑战。研究结果显示，该提议的适当性，取得了98.95%的准确性。

更新时间: 2025-08-08 13:09:52

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.06287v1

A Study on Regularization-Based Continual Learning Methods for Indic ASR

Indias linguistic diversity poses significant challenges for developing inclusive Automatic Speech Recognition (ASR) systems. Traditional multilingual models, which require simultaneous access to all language data, are impractical due to the sequential arrival of data and privacy constraints. Continual Learning (CL) offers a solution by enabling models to learn new languages sequentially without catastrophically forgetting previously learned knowledge. This paper investigates CL for ASR on Indian languages using a subset of the IndicSUPERB benchmark. We employ a Conformer-based hybrid RNN-T/CTC model, initially pretrained on Hindi, which is then incrementally trained on eight additional Indian languages, for a total sequence of nine languages. We evaluate three prominent regularization- and distillation-based CL strategies: Elastic Weight Consolidation (EWC), Memory Aware Synapses (MAS), and Learning without Forgetting (LwF), selected for their suitability in no-replay, privacy-conscious scenarios. Performance is analyzed using Word Error Rate (WER) for both RNN-T and CTC paths on clean and noisy data, as well as knowledge retention via Backward Transfer. We also explore the impact of varying the number of training epochs (1, 2, 5, and 10) per task. Results, compared against naive fine-tuning, demonstrate CLs effectiveness in mitigating forgetting, making it a promising approach for scalable ASR in diverse Indian languages under realistic constraints. The code is available at: https://github.com/FrozenWolf-Cyber/Indic-CL-ASR

Updated: 2025-08-08 13:02:19

标题: 一个关于基于正则化的持续学习方法在印度语自动语音识别中的研究

摘要: 印度的语言多样性为开发包容性自动语音识别（ASR）系统提出了重大挑战。传统的多语言模型需要同时访问所有语言数据，但由于数据的顺序到达和隐私约束，这种方法是不切实际的。持续学习（CL）通过使模型能够按顺序学习新语言而不会灾难性地遗忘先前学到的知识，为此提供了解决方案。本文使用IndicSUPERB基准测试的一个子集，研究了在印度语言上使用CL进行ASR。我们采用了一个基于Conformer的混合RNN-T/CTC模型，最初在印地语上进行预训练，然后逐步在八种额外的印度语言上进行训练，总共九种语言序列。我们评估了三种著名的基于正则化和蒸馏的CL策略：弹性权重合并（EWC）、记忆感知突触（MAS）和无遗忘学习（LwF），由于它们在无重播、重视隐私的情况下的适用性而被选择。通过在干净和嘈杂的数据上使用RNN-T和CTC路径的字错误率（WER）进行性能分析，以及通过反向传递评估知识保留。我们还探讨了在每项任务中训练周期数（1、2、5和10）变化对结果的影响。与天真的微调相比，结果表明CL在减轻遗忘方面的有效性，使其成为在现实约束下多样化印度语言中可扩展ASR的有希望的方法。代码可在以下链接找到：https://github.com/FrozenWolf-Cyber/Indic-CL-ASR

更新时间: 2025-08-08 13:02:19

领域: cs.LG

下载: http://arxiv.org/abs/2508.06280v1

Large Language Model Data Generation for Enhanced Intent Recognition in German Speech

Intent recognition (IR) for speech commands is essential for artificial intelligence (AI) assistant systems; however, most existing approaches are limited to short commands and are predominantly developed for English. This paper addresses these limitations by focusing on IR from speech by elderly German speakers. We propose a novel approach that combines an adapted Whisper ASR model, fine-tuned on elderly German speech (SVC-de), with Transformer-based language models trained on synthetic text datasets generated by three well-known large language models (LLMs): LeoLM, Llama3, and ChatGPT. To evaluate the robustness of our approach, we generate synthetic speech with a text-to-speech model and conduct extensive cross-dataset testing. Our results show that synthetic LLM-generated data significantly boosts classification performance and robustness to different speaking styles and unseen vocabulary. Notably, we find that LeoLM, a smaller, domain-specific 13B LLM, surpasses the much larger ChatGPT (175B) in dataset quality for German intent recognition. Our approach demonstrates that generative AI can effectively bridge data gaps in low-resource domains. We provide detailed documentation of our data generation and training process to ensure transparency and reproducibility.

Updated: 2025-08-08 12:54:09

标题: 德语语音增强意图识别的大型语言模型数据生成

摘要: 意图识别（IR）对于人工智能（AI）助理系统是至关重要的，然而，大多数现有方法局限于短指令，并且主要是针对英语开发的。本文通过聚焦于德语老年人的语音意图识别来解决这些限制。我们提出了一种新颖的方法，该方法将在老年德语语音上微调的Whisper ASR模型（SVC-de）与基于Transformer的语言模型结合，这些语言模型是在三个著名的大型语言模型（LLMs）LeoLM、Llama3和ChatGPT生成的合成文本数据集上训练的。为了评估我们方法的稳健性，我们使用文本到语音模型生成合成语音，并进行了广泛的跨数据集测试。我们的结果表明，合成LLM生成的数据显著提升了分类性能，并增强了对不同口语风格和未知词汇的稳健性。值得注意的是，我们发现LeoLM，一个较小的领域特定13B LLM，在德语意图识别的数据集质量上超过了规模更大的ChatGPT（175B）。我们的方法表明，生成式人工智能可以有效地弥合低资源领域的数据鸿沟。我们提供了我们的数据生成和训练过程的详细文档，以确保透明度和可重复性。

更新时间: 2025-08-08 12:54:09

领域: cs.CL,cs.LG,cs.SD

下载: http://arxiv.org/abs/2508.06277v1

A Graph Sufficiency Perspective for Neural Networks

This paper analyzes neural networks through graph variables and statistical sufficiency. We interpret neural network layers as graph-based transformations, where neurons act as pairwise functions between inputs and learned anchor points. Within this formulation, we establish conditions under which layer outputs are sufficient for the layer inputs, that is, each layer preserves the conditional distribution of the target variable given the input variable. We explore two theoretical paths under this graph-based view. The first path assumes dense anchor points and shows that asymptotic sufficiency holds in the infinite-width limit and is preserved throughout training. The second path, more aligned with practical architectures, proves exact or approximate sufficiency in finite-width networks by assuming region-separated input distributions and constructing appropriate anchor points. This path can ensure the sufficiency property for an infinite number of layers, and provide error bounds on the optimal loss for both regression and classification tasks using standard neural networks. Our framework covers fully connected layers, general pairwise functions, ReLU and sigmoid activations, and convolutional neural networks. Overall, this work bridges statistical sufficiency, graph-theoretic representations, and deep learning, providing a new statistical understanding of neural networks.

Updated: 2025-08-08 12:39:33

标题: 神经网络的图充分性视角

摘要: 本文通过图变量和统计充分性分析神经网络。我们将神经网络层解释为基于图的转换，其中神经元充当输入和学习的锚点之间的成对函数。在这种表述中，我们建立了层输出对于层输入充分的条件，也就是说，每个层保留了给定输入变量的目标变量的条件分布。我们在这种基于图的观点下探讨了两条理论路径。第一条路径假设密集的锚点，并展示了在无限宽度极限下渐近充分性成立并在整个训练过程中保持不变。第二条路径更加符合实际架构，通过假设区域分隔的输入分布并构建适当的锚点，证明了在有限宽度网络中的准确或近似充分性。这条路径可以确保无限数量的层的充分性属性，并利用标准神经网络为回归和分类任务提供最优损失的误差界。我们的框架涵盖了全连接层、一般成对函数、ReLU和sigmoid激活函数以及卷积神经网络。总的来说，这项工作架起了统计充分性、图论表示和深度学习之间的桥梁，提供了对神经网络的新的统计理解。

更新时间: 2025-08-08 12:39:33

领域: cs.LG,stat.AP

下载: http://arxiv.org/abs/2507.10215v2

OM2P: Offline Multi-Agent Mean-Flow Policy

Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.

Updated: 2025-08-08 12:38:56

标题: OM2P：离线多智能体均值流策略

摘要: 生成模型，尤其是扩散和基于流的模型，在离线多智能体强化学习中表现出很大潜力。然而，将强大的生成模型集成到这个框架中存在独特的挑战。特别是，扩散和基于流的策略由于它们的迭代生成过程而导致采样效率低下，使它们在时间敏感或资源受限的环境中不切实际。为了解决这些困难，我们提出了OM2P（离线多智能体均值流策略），这是一种新颖的离线MARL算法，可实现高效的单步动作采样。为了解决生成目标和奖励最大化之间的不一致性，我们引入了一种奖励感知优化方案，将精心设计的均值流匹配损失与Q函数监督相结合。此外，我们设计了一个通用的时间步长分布和一个无导数估计策略，以减少内存开销并提高训练稳定性。在多智能体粒子和MuJoCo基准测试中的实证评估表明，OM2P实现了优越的性能，GPU内存使用量减少了最多3.8倍，训练时间加快了最多10.8倍。我们的方法是首个成功将均值流模型整合到离线MARL中的方法，为合作多智能体环境中的实用和可扩展的生成策略铺平了道路。

更新时间: 2025-08-08 12:38:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06269v1

Topic Over Source: The Key to Effective Data Mixing for Language Models Pre-training

The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various languages, sources, and topics. Effectively integrating these heterogeneous data groups is crucial for optimizing LLM performance. Previous research has predominantly concentrated on source-based data mixing, often neglecting the nuanced topic-level characteristics of the data. To address this gap, we propose a topic-based data mixing strategy that utilizes detailed topic labels generated through a multi-stage process combining unsupervised clustering, LLM-based summarization, and supervised classifier training. With this strategy, we conduct the first comprehensive comparison of topic-based versus source-based partitioning across multiple mixing strategies. We demonstrate that language models pretrained on data mixed by topics consistently outperform those trained on data mixed by sources across multiple methods including RegMix, DoReMi,temperature-based sampling, and a manual mixing method based on downstream task performance. Our theoretical analysis reveals that topic-based data achieves significantly lower validation loss compared to source-based approaches, creating a better optimization landscape for model training. We will make our code, annotated datasets, and topic classification models publicly available to facilitate further research.

Updated: 2025-08-08 12:38:49

标题: 主题优于来源：语言模型预训练中有效数据混合的关键

摘要: 大型语言模型（LLMs）的性能受其预训练数据的质量和组成的显著影响，这些数据从根本上是多样化的，涵盖了各种语言、来源和主题。有效地整合这些异质数据组对优化LLM性能至关重要。先前的研究主要集中在基于来源的数据混合上，通常忽略了数据的微妙主题级特征。为了弥补这一差距，我们提出了一种基于主题的数据混合策略，利用多阶段过程生成的详细主题标签，结合无监督聚类、基于LLM的摘要和监督分类器训练。通过这种策略，我们首次全面比较了基于主题和基于来源的分区在多种混合策略中的表现。我们证明，预训练于主题混合数据上的语言模型在包括RegMix、DoReMi、基于温度的抽样和基于下游任务表现的手动混合方法在内的多种方法中始终优于在来源混合数据上训练的模型。我们的理论分析表明，与基于来源的方法相比，基于主题的数据实现了显著更低的验证损失，为模型训练创造了更好的优化环境。我们将公开我们的代码、标注数据集和主题分类模型，以促进进一步研究。

更新时间: 2025-08-08 12:38:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.16802v3

Learning to Initialize Trajectory Optimization for Vision-Based Autonomous Flight in Unknown Environments

Autonomous flight in unknown environments requires precise spatial and temporal trajectory planning, often involving computationally expensive nonconvex optimization prone to local optima. To overcome these challenges, we present the Neural-Enhanced Trajectory Planner (NEO-Planner), a novel approach that leverages a Neural Network (NN) Planner to provide informed initial values for trajectory optimization. The NN-Planner is trained on a dataset generated by an expert planner using batch sampling, capturing multimodal trajectory solutions. It learns to predict spatial and temporal parameters for trajectories directly from raw sensor observations. NEO-Planner starts optimization from these predictions, accelerating computation speed while maintaining explainability. Furthermore, we introduce a robust online replanning framework that accommodates planning latency for smooth trajectory tracking. Extensive simulations demonstrate that NEO-Planner reduces optimization iterations by 20%, leading to a 26% decrease in computation time compared with pure optimization-based methods. It maintains trajectory quality comparable to baseline approaches and generalizes well to unseen environments. Real-world experiments validate its effectiveness for autonomous drone navigation in cluttered, unknown environments.

Updated: 2025-08-08 12:33:07

标题: 学习如何初始化轨迹优化以实现在未知环境中基于视觉的自主飞行

摘要: 在未知环境中进行自主飞行需要精确的空间和时间轨迹规划，通常涉及计算昂贵的非凸优化，容易陷入局部最优解。为了克服这些挑战，我们提出了神经增强轨迹规划器（NEO-Planner），这是一种利用神经网络（NN）规划器为轨迹优化提供知情初始值的新方法。NN-Planner是在专家规划者使用批处理采样生成的数据集上进行训练的，捕捉到多模态轨迹解决方案。它学会了直接从原始传感器观测中预测轨迹的空间和时间参数。NEO-Planner从这些预测开始优化，加快了计算速度同时保持了可解释性。此外，我们引入了一个强大的在线重规划框架，以适应平滑轨迹跟踪的规划延迟。大量模拟表明，NEO-Planner将优化迭代次数减少了20％，与纯优化方法相比，计算时间减少了26％。它保持了与基线方法相当的轨迹质量，并且在未知环境中具有很好的泛化能力。实际世界实验验证了其在拥挤、未知环境中用于自主无人机导航的有效性。

更新时间: 2025-08-08 12:33:07

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2309.10683v2

Numerical Considerations in Weighted Model Counting

Weighted model counting computes the sum of the rational-valued weights associated with the satisfying assignments for a Boolean formula, where the weight of an assignment is given by the product of the weights assigned to the positive and negated variables comprising the assignment. Weighted model counting finds applications across a variety of domains including probabilistic reasoning and quantitative risk assessment. Most weighted model counting programs operate by (explicitly or implicitly) converting the input formula into a form that enables arithmetic evaluation, using multiplication for conjunctions and addition for disjunctions. Performing this evaluation using floating-point arithmetic can yield inaccurate results, and it cannot quantify the level of precision achieved. Computing with rational arithmetic gives exact results, but it is costly in both time and space. This paper describes how to combine multiple numeric representations to efficiently compute weighted model counts that are guaranteed to achieve a user-specified precision. When all weights are nonnegative, we prove that the precision loss of arithmetic evaluation using floating-point arithmetic can be tightly bounded. We show that supplementing a standard IEEE double-precision representation with a separate 64-bit exponent, a format we call extended-range double (ERD), avoids the underflow and overflow issues commonly encountered in weighted model counting. For problems with mixed negative and positive weights, we show that a combination of interval floating-point arithmetic and rational arithmetic can achieve the twin goals of efficiency and guaranteed precision. For our evaluations, we have devised especially challenging formulas and weight assignments, demonstrating the robustness of our approach.

Updated: 2025-08-08 12:28:49

标题: 加权模型计数中的数值考虑

摘要: 加权模型计数计算与布尔公式的满足赋值相关的有理权重之和，其中赋值的权重由分配给构成赋值的正变量和否定变量的权重的乘积给出。加权模型计数在包括概率推理和定量风险评估在内的各种领域中找到应用。大多数加权模型计数程序通过（显式或隐式）将输入公式转换为能够进行算术评估的形式，对于合取使用乘法，对于析取使用加法。使用浮点算术进行这种评估可能导致不准确的结果，并且不能量化达到的精度水平。使用有理算术计算可以给出准确的结果，但在时间和空间方面代价高昂。本文描述了如何结合多个数值表示来有效计算加权模型计数，保证达到用户指定的精度。当所有权重为非负时，我们证明使用浮点算术进行算术评估的精度损失可以被严格限制。我们展示，将标准IEEE双精度表示补充一个独立的64位指数，一种我们称为扩展范围双精度（ERD）的格式，可以避免加权模型计数中常见的下溢和上溢问题。对于具有混合负权重和正权重的问题，我们展示，区间浮点算术和有理算术的组合可以实现效率和保证精度的双重目标。对于我们的评估，我们设计了特别具有挑战性的公式和权重分配，展示了我们方法的稳健性。

更新时间: 2025-08-08 12:28:49

领域: math.NA,cs.AI,cs.LO,cs.NA

下载: http://arxiv.org/abs/2508.06264v1

Symmetry breaking for inductive logic programming

The goal of inductive logic programming is to search for a hypothesis that generalises training data and background knowledge. The challenge is searching vast hypothesis spaces, which is exacerbated because many logically equivalent hypotheses exist. To address this challenge, we introduce a method to break symmetries in the hypothesis space. We implement our idea in answer set programming. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce solving times from over an hour to just 17 seconds.

Updated: 2025-08-08 12:28:42

标题: 归纳逻辑编程的对称性破缺

摘要: 归纳逻辑编程的目标是搜索一个能够概括训练数据和背景知识的假设。挑战在于搜索庞大的假设空间，因为存在许多逻辑等价的假设。为了解决这个挑战，我们引入了一种方法来打破假设空间中的对称性。我们在答案集编程中实现了我们的想法。我们在多个领域进行的实验，包括视觉推理和游戏玩法，表明我们的方法可以将解决时间从一个多小时减少到仅需17秒。

更新时间: 2025-08-08 12:28:42

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06263v1

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.

Updated: 2025-08-08 12:26:20

标题: SIFThinker：用于视觉推理的空间感知图像焦点

摘要: 目前的多模态大型语言模型（MLLMs）在复杂的视觉任务（如空间理解、细粒度感知）中仍然面临重大挑战。先前的方法已经尝试将视觉推理纳入其中，但是它们未能利用空间线索来迭代地优化对提示相关区域的关注。在本文中，我们介绍了一种空间感知的“图像思维者”框架SIFThinker，模仿人类视觉感知。具体来说，SIFThinker通过交替使用深度增强的边界框和自然语言实现了注意力纠正和图像区域聚焦。我们的贡献有两个方面：首先，我们引入了一种逆扩展-正向推理策略，有助于生成交织的图像-文本思维链，用于过程级监督，从而导致SIF-50K数据集的构建。此外，我们提出了GRPO-SIF，一种强化训练范式，将深度信息的视觉定位整合到统一的推理流程中，教导模型动态地纠正并聚焦于提示相关区域。广泛的实验证明，SIFThinker在空间理解和细粒度视觉感知方面优于最先进的方法，同时保持了强大的通用能力，突显了我们方法的有效性。

更新时间: 2025-08-08 12:26:20

领域: cs.CV,cs.AI,I.2.10

下载: http://arxiv.org/abs/2508.06259v1

Multi-Omics Analysis for Cancer Subtype Inference via Unrolling Graph Smoothness Priors

Integrating multi-omics datasets through data-driven analysis offers a comprehensive understanding of the complex biological processes underlying various diseases, particularly cancer. Graph Neural Networks (GNNs) have recently demonstrated remarkable ability to exploit relational structures in biological data, enabling advances in multi-omics integration for cancer subtype classification. Existing approaches often neglect the intricate coupling between heterogeneous omics, limiting their capacity to resolve subtle cancer subtype heterogeneity critical for precision oncology. To address these limitations, we propose a framework named Graph Transformer for Multi-omics Cancer Subtype Classification (GTMancer). This framework builds upon the GNN optimization problem and extends its application to complex multi-omics data. Specifically, our method leverages contrastive learning to embed multi-omics data into a unified semantic space. We unroll the multiplex graph optimization problem in that unified space and introduce dual sets of attention coefficients to capture structural graph priors both within and among multi-omics data. This approach enables global omics information to guide the refining of the representations of individual omics. Empirical experiments on seven real-world cancer datasets demonstrate that GTMancer outperforms existing state-of-the-art algorithms.

Updated: 2025-08-08 12:22:36

标题: 多组学分析用于通过展开图平滑先验推断癌症亚型

摘要: 通过数据驱动的分析整合多组学数据提供了对各种疾病，特别是癌症潜在复杂生物过程的全面理解。图神经网络（GNNs）最近展示了在生物数据中利用关系结构的显著能力，实现了癌症亚型分类的多组学整合方面的进展。现有方法常常忽视异质组学之间的复杂耦合，限制了它们解决精准肿瘤学关键的微妙癌症亚型异质性的能力。为了解决这些限制，我们提出了一个名为多组学癌症亚型分类的图变换器（GTMancer）框架。该框架建立在GNN优化问题的基础上，并将其应用扩展到复杂的多组学数据。具体来说，我们的方法利用对比学习将多组学数据嵌入到统一的语义空间中。我们在该统一空间中展开多重图优化问题，并引入双组注意系数来捕捉多组学数据内部和之间的结构图先验。这种方法使全局组学信息能够指导个体组学表示的细化。对七个真实世界的癌症数据集的实证实验表明，GTMancer优于现有的最先进算法。

更新时间: 2025-08-08 12:22:36

领域: cs.LG

下载: http://arxiv.org/abs/2508.06257v1

Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS)

Synthetic data generation is a key technique in modern artificial intelligence, addressing data scarcity, privacy constraints, and the need for diverse datasets in training robust models. In this work, we propose a method for generating privacy-preserving high-quality synthetic tabular data using Tensor Networks, specifically Matrix Product States (MPS). We benchmark the MPS-based generative model against state-of-the-art models such as CTGAN, VAE, and PrivBayes, focusing on both fidelity and privacy-preserving capabilities. To ensure differential privacy (DP), we integrate noise injection and gradient clipping during training, enabling privacy guarantees via R\'enyi Differential Privacy accounting. Across multiple metrics analyzing data fidelity and downstream machine learning task performance, our results show that MPS outperforms classical models, particularly under strict privacy constraints. This work highlights MPS as a promising tool for privacy-aware synthetic data generation. By combining the expressive power of tensor network representations with formal privacy mechanisms, the proposed approach offers an interpretable and scalable alternative for secure data sharing. Its structured design facilitates integration into sensitive domains where both data quality and confidentiality are critical.

Updated: 2025-08-08 12:14:57

标题: 合成数据生成和使用张量网络的矩阵乘积状态（MPS）实现差分隐私

摘要: 合成数据生成是现代人工智能中的关键技术，解决了数据稀缺、隐私约束和训练鲁棒模型所需的多样数据集等问题。在这项工作中，我们提出了一种使用张量网络，具体来说是矩阵积态（MPS），生成隐私保护高质量合成表格数据的方法。我们将基于MPS的生成模型与CTGAN、VAE和PrivBayes等最先进模型进行了基准测试，重点关注保真度和隐私保护能力。为了确保差分隐私（DP），我们在训练过程中集成了噪声注入和梯度剪裁，通过R\'enyi差分隐私核算实现隐私保证。通过分析数据保真度和下游机器学习任务性能的多个指标，我们的结果显示MPS在严格的隐私约束下优于传统模型。这项工作突出了MPS作为隐私感知合成数据生成的有前途的工具。通过将张量网络表示的表达能力与正式的隐私机制相结合，所提出的方法为安全数据共享提供了一种可解释和可扩展的替代方案。其结构化设计有助于将其整合到数据质量和保密性都至关重要的敏感领域中。

更新时间: 2025-08-08 12:14:57

领域: cs.LG,cs.AI,cs.CR,quant-ph

下载: http://arxiv.org/abs/2508.06251v1

Reinforcement Learning Based Sensor Optimization for Bio-markers

Radio frequency (RF) biosensors, in particular those based on inter-digitated capacitors (IDCs), are pivotal in areas like biomedical diagnosis, remote sensing, and wireless communication. Despite their advantages of low cost and easy fabrication, their sensitivity can be hindered by design imperfections, environmental factors, and circuit noise. This paper investigates enhancing the sensitivity of IDC-based RF sensors using novel reinforcement learning based Binary Particle Swarm Optimization (RLBPSO), and it is compared to Ant Colony Optimization (ACO), and other state-of-the-art methods. By focusing on optimizing design parameters like electrode design and finger width, the proposed study found notable improvements in sensor sensitivity. The proposed RLBPSO method shows best optimized design for various frequency ranges when compared to current state-of-the-art methods.

Updated: 2025-08-08 12:12:23

标题: 基于强化学习的生物标志物传感器优化

摘要: 射频（RF）生物传感器，特别是基于互锁电容器（IDCs）的传感器，在生物医学诊断、远程感知和无线通信等领域起着关键作用。尽管它们具有低成本和易于制造的优势，但其灵敏度可能会受到设计缺陷、环境因素和电路噪声的影响。本文研究了使用基于强化学习的二进制粒子群优化（RLBPSO）来增强基于IDC的射频传感器的灵敏度，并将其与蚁群优化（ACO）和其他最新方法进行了比较。通过专注于优化设计参数，如电极设计和指宽度，提出的研究发现了传感器灵敏度的显着改善。与当前最新方法相比，所提出的RLBPSO方法显示出在各种频率范围内最佳优化设计。

更新时间: 2025-08-08 12:12:23

领域: cs.LG,cs.NE,eess.SP

下载: http://arxiv.org/abs/2308.10649v2

In-Training Defenses against Emergent Misalignment in Language Models

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.

Updated: 2025-08-08 12:10:28

标题: 在训练中对语言模型中出现的紧急不对齐进行防御

摘要: Fine-tuning使从业者可以将对齐的大型语言模型（LLMs）重新应用于新的领域，但最近的研究揭示了出现的不对齐（EMA）：即使是一个小的、特定领域的微调也可能在目标领域之外引发有害行为。即使模型权重被隐藏在微调API的后面，攻击者也会无意中获得对广泛不对齐模型的访问，这种访问方式很难仅通过微调数据来检测。我们提出了对EMA进行系统性研究的第一步，这些研究是为那些通过API公开微调的提供者而实用的。我们研究了四种训练正则化干预方法：（i）KL散度正则化朝向一个安全参考模型，（ii）特征空间中的$\ell_2$距离，（iii）投影到一个安全子空间（SafeLoRA），以及（iv）将一小部分来自一个通用指导微调数据集的安全训练示例交错添加。我们首先评估这些方法在四个恶意的、引发EMA的任务中的出现的不对齐效果。其次，我们评估这些方法对良性任务的影响。最后，我们讨论了出现的不对齐研究中的开放问题。

更新时间: 2025-08-08 12:10:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06249v1

Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor $\log T$ that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of $O\big( (\log k)^2\sqrt{kmT}\big )$ under semi-bandit feedback, where $m$ is the number of arms and $k$ is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on $\log T$ and matches the established $\Omega\big( \sqrt{kmT}\big)$ lower bound up to $O\big((\log k)^2\big)$. We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.

Updated: 2025-08-08 12:01:50

标题: 高效随机组合半臂老虎机的近乎最佳遗憾

摘要: 组合多臂老虎机（CMAB）是顺序决策框架的基石，主要由两个算法家族主导：基于UCB的算法和对抗性方法，如遵循正则化领导者（FTRL）和在线镜像下降（OMD）。然而，著名的基于UCB的方法，如CUCB，受到额外的后悔因子$ \log T $的影响，在长期内具有不利性，而对抗性方法，如EXP3.M和HYBRID，则会施加重大的计算开销。为了解决这种权衡，我们引入了在随机设置中的组合极小最优策略（CMOSS）。CMOSS是一种计算效率高的算法，在半老虎机反馈下实现了一个与实例无关的后悔值$ O\big((\log k)^2\sqrt{kmT}\big) $，其中$m$是臂的数量，$k$是一个可行动作的最大基数。关键地，这个结果消除了对$\log T$的依赖，并与已建立的$ \Omega\big( \sqrt{kmT}\big) $下界相匹配，最多为$ O\big((\log k)^2\big)$。然后我们扩展我们的分析，以展示CMOSS也适用于级联反馈。对合成和真实世界数据集的实验证实，CMOSS在后悔和运行效率方面始终优于基准算法。

更新时间: 2025-08-08 12:01:50

领域: cs.LG,cs.DS,stat.ML

下载: http://arxiv.org/abs/2508.06247v1

Feedback-Guided Extraction of Knowledge Base from Retrieval-Augmented LLM Applications

Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) by integrating external knowledge bases, whose construction is often time-consuming and laborious. If an adversary extracts the knowledge base verbatim, it not only severely infringes the owner's intellectual property but also enables the adversary to replicate the application's functionality for unfair competition. Previous works on knowledge base extraction are limited either by low extraction coverage (usually less than 4%) in query-based attacks or by impractical assumptions of white-box access in embedding-based optimization methods. In this work, we propose CopyBreakRAG, an agent-based black-box attack that reasons from feedback and adaptively generates new adversarial queries for progressive extraction. By balancing exploration and exploitation through curiosity-driven queries and feedback-guided query refinement, our method overcomes the limitations of prior approaches and achieves significantly higher extraction coverage in realistic black-box settings. Experimental results show that CopyBreakRAG outperforms the state-of-the-art black-box approach by 45% on average in terms of chunk extraction ratio from applications built with mainstream RAG frameworks, and extracts over 70% of the data from the knowledge base in applications on commercial platforms including OpenAI's GPTs and ByteDance's Coze when essential protection is in place.

Updated: 2025-08-08 11:56:30

标题: 反馈引导下的知识库提取：从检索增强的LLM应用程序中

摘要: 检索增强生成（RAG）通过整合外部知识库扩展了大型语言模型（LLMs）的知识边界，而这些知识库的构建通常耗时且费力。如果对手直接提取知识库，不仅严重侵犯了所有者的知识产权，而且还使对手能够复制应用程序的功能以进行不正当竞争。先前关于知识库提取的研究要么受到查询攻击中提取覆盖率较低（通常不到4%）的限制，要么受到基于嵌入的优化方法中对白盒访问的不切实际假设的限制。在本文中，我们提出了CopyBreakRAG，一种基于代理的黑盒攻击，通过从反馈中推理并自适应生成新的对抗性查询以进行渐进提取。通过通过好奇心驱动的查询和反馈引导的查询细化来平衡探索和利用，我们的方法克服了先前方法的限制，在现实的黑盒设置中实现了显着更高的提取覆盖率。实验结果显示，与基于最新RAG框架构建的应用程序相比，CopyBreakRAG在块提取比例方面平均优于最先进的黑盒方法45％，并在商业平台上提取了超过70％的数据，包括OpenAI的GPT和ByteDance的Coze，前提是必要的保护措施得以实施。

更新时间: 2025-08-08 11:56:30

领域: cs.CR

下载: http://arxiv.org/abs/2411.14110v2

Membership Inference Attack with Partial Features

Machine learning models have been shown to be susceptible to membership inference attack, which can be used to determine whether a given sample appears in the training data. Existing membership inference methods commonly assume that the adversary has full access to the features of the target sample. This assumption, however, does not hold in many real-world scenarios where only partial features information is available, thereby limiting the applicability of these methods. In this work, we study an inference scenario where the adversary observes only partial features of each sample and aims to infer whether this observed subset was present in the training set of the target model. We define this problem as Partial Feature Membership Inference (PFMI). To address this problem, we propose MRAD (Memory-guided Reconstruction and Anomaly Detection), a two-stage attack framework. In the first stage, MRAD optimizes the unknown feature values to minimize the loss of the sample. In the second stage, it measures the deviation between the reconstructed sample and the training distribution using anomaly detection. Empirical results demonstrate that MRAD is effective across a range of datasets, and maintains compatibility with various off-the-shelf anomaly detection techniques. For example, on STL-10, our attack achieves an AUC of around 0.6 even with 40% of the missing features.

Updated: 2025-08-08 11:56:13

标题: 使用部分特征的成员推断攻击

摘要: 机器学习模型已被证明容易受到成员推理攻击的影响，这种攻击可以用来确定给定样本是否出现在训练数据中。现有的成员推理方法通常假设对手可以完全访问目标样本的特征。然而，在许多现实场景中，只有部分特征信息可用，因此限制了这些方法的适用性。在这项工作中，我们研究了一个推理场景，对手只观察每个样本的部分特征，并试图推断这个观察到的子集是否出现在目标模型的训练集中。我们将这个问题定义为部分特征成员推理（PFMI）。为了解决这个问题，我们提出了MRAD（内存引导重建和异常检测），一个两阶段攻击框架。在第一阶段，MRAD优化未知特征值以最小化样本的损失。在第二阶段，它使用异常检测来测量重建样本与训练分布之间的偏差。实证结果表明，MRAD在各种数据集上都是有效的，并且与各种现成的异常检测技术兼容。例如，在STL-10上，我们的攻击即使有40%的特征缺失，也能达到约0.6的AUC。

更新时间: 2025-08-08 11:56:13

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.06244v1

SCAR: State-Space Compression for AI-Driven Resource Management in 6G-Enabled Vehicular Infotainment Systems

The advent of 6G networks opens new possibilities for connected infotainment services in vehicular environments. However, traditional Radio Resource Management (RRM) techniques struggle with the increasing volume and complexity of data such as Channel Quality Indicators (CQI) from autonomous vehicles. To address this, we propose SCAR (State-Space Compression for AI-Driven Resource Management), an Edge AI-assisted framework that optimizes scheduling and fairness in vehicular infotainment. SCAR employs ML-based compression techniques (e.g., clustering and RBF networks) to reduce CQI data size while preserving essential features. These compressed states are used to train 6G-enabled Reinforcement Learning policies that maximize throughput while meeting fairness objectives defined by the NGMN. Simulations show that SCAR increases time in feasible scheduling regions by 14\% and reduces unfair scheduling time by 15\% compared to RL baselines without CQI compression. Furthermore, Simulated Annealing with Stochastic Tunneling (SAST)-based clustering reduces CQI clustering distortion by 10\%, confirming its efficiency. These results demonstrate SCAR's scalability and fairness benefits for dynamic vehicular networks.

Updated: 2025-08-08 11:53:18

标题: SCAR：面向AI驱动的资源管理的状态空间压缩在6G启用的车载信息娱乐系统中

摘要: 第6代网络的出现为车载环境中连接的信息娱乐服务开辟了新的可能性。然而，传统的无线资源管理（RRM）技术在处理来自自动驾驶汽车等数据的数量和复杂性不断增加时面临困难，例如信道质量指标（CQI）。为了解决这个问题，我们提出了SCAR（基于AI驱动资源管理的状态空间压缩），这是一个边缘AI辅助框架，可优化车载信息娱乐中的调度和公平性。SCAR采用基于机器学习的压缩技术（例如聚类和RBF网络）来减小CQI数据的大小，同时保留关键特征。这些压缩状态用于训练6G启用的强化学习策略，以最大化吞吐量，同时满足由NGMN定义的公平性目标。模拟结果显示，与没有CQI压缩的RL基线相比，SCAR将可行调度区域的时间增加了14％，减少了不公平调度时间15％。此外，基于模拟退火和随机隧道（SAST）的聚类方法将CQI聚类失真降低了10％，证实了其效率。这些结果表明了SCAR在动态车载网络中的可伸缩性和公平性优势。

更新时间: 2025-08-08 11:53:18

领域: cs.LG,cs.NE,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.06243v1

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.

Updated: 2025-08-08 11:47:22

标题: 通过近似拟正交性评估和设计稀疏自编码器

摘要: 稀疏自编码器（SAEs）广泛应用于大型语言模型的机制可解释性研究中；然而，使用$k$-稀疏自编码器的最新方法缺乏选择代表非零激活数量的超参数$k$的理论基础，通常用$\ell_0$表示。在本文中，我们揭示了一个理论联系，即稀疏特征向量的$\ell_2$范数可以通过具有封闭形式误差的密集向量的$\ell_2$范数来近似，这使得稀疏自编码器可以在无需手动确定$\ell_0$的情况下进行训练。具体而言，我们验证了我们理论发现的两个应用。首先，我们引入了一种新的方法，可以通过计算来自输入嵌入的理论期望值来评估预训练SAEs的特征激活，这在现有的SAE评估方法和损失函数中被忽略了。其次，我们引入了一种新的激活函数，top-AFA，它基于我们对近似特征激活（AFA）的公式。这个函数实现了不需要调整常数超参数$k$的top-$k$风格激活，动态确定每个输入的激活特征数量。通过对三个中间层的SAEs进行训练，重构来自OpenWebText数据集的超过8000万个标记的GPT2隐藏嵌入，我们展示了这种方法的经验优点，并将其与当前最先进的$k$-稀疏自编码器进行了比较。我们的代码可在以下链接找到：https://github.com/SewoongLee/top-afa-sae。

更新时间: 2025-08-08 11:47:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.24277v2

SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Game-Theoretic Defenses

Multimodal foundation models (MFMs) integrate diverse data modalities to support complex and wide-ranging tasks. However, this integration also introduces distinct safety and security challenges. In this paper, we unify the concepts of safety and security in the context of MFMs by identifying critical threats that arise from both model behavior and system-level interactions. We propose a taxonomy grounded in information theory, evaluating risks through the concepts of channel capacity, signal, noise, and bandwidth. This perspective provides a principled way to analyze how information flows through MFMs and how vulnerabilities can emerge across modalities. Building on this foundation, we investigate defense mechanisms through the lens of a minimax game between attackers and defenders, highlighting key gaps in current research. In particular, we identify insufficient protection for cross-modal alignment and a lack of systematic and scalable defense strategies. Our work offers both a theoretical and practical foundation for advancing the safety and security of MFMs, supporting the development of more robust and trustworthy systems.

Updated: 2025-08-08 11:43:58

标题: SoK：通过信息流和博弈论防御的多模基础模型的安全性-安全性连续体

摘要: 多模态基础模型（MFMs）整合多样的数据模态以支持复杂和广泛的任务。然而，这种整合也引入了明显的安全和安全挑战。在本文中，我们通过识别由模型行为和系统级相互作用引起的关键威胁，统一了在MFMs背景下的安全和安全概念。我们提出了一个基于信息理论的分类法，通过信道容量、信号、噪声和带宽的概念评估风险。这种视角提供了一种原则性的方法来分析信息如何通过MFMs流动以及漏洞如何跨模态出现。在此基础上，我们通过攻击者和防御者之间的极小极大博弈的视角调查防御机制，突出当前研究中的主要差距。特别是，我们发现跨模态对齐的保护不足以及缺乏系统化和可扩展的防御策略。我们的工作为推进MFMs的安全性和安全性提供了理论和实践基础，支持开发更加稳健和可信赖的系统。

更新时间: 2025-08-08 11:43:58

领域: cs.CR

下载: http://arxiv.org/abs/2411.11195v3

Advancing Welding Defect Detection in Maritime Operations via Adapt-WeldNet and Defect Detection Interpretability Analysis

Weld defect detection is crucial for ensuring the safety and reliability of piping systems in the oil and gas industry, especially in challenging marine and offshore environments. Traditional non-destructive testing (NDT) methods often fail to detect subtle or internal defects, leading to potential failures and costly downtime. Furthermore, existing neural network-based approaches for defect classification frequently rely on arbitrarily selected pretrained architectures and lack interpretability, raising safety concerns for deployment. To address these challenges, this paper introduces ``Adapt-WeldNet", an adaptive framework for welding defect detection that systematically evaluates various pre-trained architectures, transfer learning strategies, and adaptive optimizers to identify the best-performing model and hyperparameters, optimizing defect detection and providing actionable insights. Additionally, a novel Defect Detection Interpretability Analysis (DDIA) framework is proposed to enhance system transparency. DDIA employs Explainable AI (XAI) techniques, such as Grad-CAM and LIME, alongside domain-specific evaluations validated by certified ASNT NDE Level II professionals. Incorporating a Human-in-the-Loop (HITL) approach and aligning with the principles of Trustworthy AI, DDIA ensures the reliability, fairness, and accountability of the defect detection system, fostering confidence in automated decisions through expert validation. By improving both performance and interpretability, this work enhances trust, safety, and reliability in welding defect detection systems, supporting critical operations in offshore and marine environments.

Updated: 2025-08-08 11:27:44

标题: 通过Adapt-WeldNet和缺陷检测可解释性分析推进海事操作中的焊接缺陷检测

摘要: 焊接缺陷检测对于确保石油和天然气行业管道系统的安全性和可靠性至关重要，特别是在具有挑战性的海洋和近海环境中。传统的无损检测(NDT)方法通常无法检测出微小或内部缺陷，可能导致潜在故障和昂贵的停机时间。此外，现有基于神经网络的缺陷分类方法通常依赖于任意选择的预训练架构，缺乏可解释性，增加了部署安全性的担忧。为了解决这些挑战，本文介绍了“Adapt-WeldNet”，一个自适应的焊接缺陷检测框架，系统地评估各种预训练架构、迁移学习策略和自适应优化器，以确定表现最佳的模型和超参数，优化缺陷检测并提供可操作的见解。此外，提出了一种新颖的缺陷检测可解释性分析(DDIA)框架，以增强系统的透明度。DDIA采用可解释的AI(XAI)技术，如Grad-CAM和LIME，以及由ASNT NDE二级认证专业人员验证的领域特定评估。结合人-机协同(HITL)方法，并与值得信赖的AI原则一致，DDIA确保缺陷检测系统的可靠性、公平性和问责制，通过专家验证增强对自动决策的信心。通过提高性能和可解释性，这项工作增强了焊接缺陷检测系统的信任度、安全性和可靠性，支持近海和海洋环境中的关键操作。

更新时间: 2025-08-08 11:27:44

领域: cs.CV,cs.AI,cs.CE,cs.LG

下载: http://arxiv.org/abs/2508.00381v2

Learning Logical Rules using Minimum Message Length

Unifying probabilistic and logical learning is a key challenge in AI. We introduce a Bayesian inductive logic programming approach that learns minimum message length programs from noisy data. Our approach balances hypothesis complexity and data fit through priors, which explicitly favour more general programs, and a likelihood that favours accurate programs. Our experiments on several domains, including game playing and drug design, show that our method significantly outperforms previous methods, notably those that learn minimum description length programs. Our results also show that our approach is data-efficient and insensitive to example balance, including the ability to learn from exclusively positive examples.

Updated: 2025-08-08 11:23:58

标题: 学习使用最小消息长度学习逻辑规则

摘要: 将概率和逻辑学习统一起来是人工智能面临的关键挑战。我们引入了一种贝叶斯归纳逻辑编程方法，从噪声数据中学习最小消息长度程序。我们的方法通过先验平衡假设复杂性和数据拟合度，明确地支持更一般的程序，并且通过偏好准确的程序的似然概率。我们在多个领域进行了实验，包括游戏玩法和药物设计，结果显示我们的方法明显优于先前的方法，特别是那些学习最小描述长度程序的方法。我们的结果还表明，我们的方法数据效率高，对示例平衡不敏感，包括能够从纯正的正面示例中学习。

更新时间: 2025-08-08 11:23:58

领域: cs.AI

下载: http://arxiv.org/abs/2508.06230v1

A Markov Random Field model for Hypergraph-based Machine Learning

Understanding the data-generating process is essential for building machine learning models that generalise well while ensuring robustness and interpretability. This paper addresses the fundamental challenge of modelling the data generation processes on hypergraphs and explores how such models can inform the design of machine learning algorithms for hypergraph data. The key to our approach is the development of a hypergraph Markov random field that models the joint distribution of the node features and hyperedge features in a hypergraph through a multivariate Gaussian distribution whose covariance matrix is uniquely determined by the hypergraph structure. The proposed data-generating process provides a valuable inductive bias for various hypergraph machine learning tasks, thus enhancing the algorithm design. In this paper, we focus on two representative downstream tasks: structure inference and node classification. Accordingly, we introduce two novel frameworks: 1) an original hypergraph structure inference framework named HGSI, and 2) a novel learning framework entitled Hypergraph-MLP for node classification on hypergraphs. Empirical evaluation of the proposed frameworks demonstrates that: 1) HGSI outperforms existing hypergraph structure inference methods on both synthetic and real-world data; and 2) Hypergraph-MLP outperforms baselines in six hypergraph node classification benchmarks, at the same time promoting runtime efficiency and robustness against structural perturbations during inference.

Updated: 2025-08-08 11:22:42

标题: 一个用于基于超图的机器学习的马尔可夫随机场模型

摘要: 理解数据生成过程对于构建能够很好泛化且确保鲁棒性和可解释性的机器学习模型至关重要。本文讨论了在超图上建模数据生成过程的基本挑战，并探讨了这种模型如何指导用于超图数据的机器学习算法的设计。我们方法的关键在于开发了一个超图马尔科夫随机场，通过多元高斯分布模拟了超图中节点特征和超边特征的联合分布，其协方差矩阵由超图结构唯一确定。所提出的数据生成过程为各种超图机器学习任务提供了有价值的归纳偏好，从而增强了算法设计。在本文中，我们专注于两个代表性的下游任务：结构推断和节点分类。因此，我们介绍了两个新颖的框架：1）命名为HGSI的原始超图结构推断框架，以及2）一种名为Hypergraph-MLP的新型学习框架，用于超图上的节点分类。所提出的框架的实证评估表明：1）HGSI在合成数据和实际数据上均优于现有的超图结构推断方法；2）Hypergraph-MLP在六个超图节点分类基准上优于基线方法，同时提高了运行时效率，并增强了在推断过程中对结构扰动的鲁棒性。

更新时间: 2025-08-08 11:22:42

领域: cs.LG,cs.AI,cs.SI,eess.SP,stat.ML

下载: http://arxiv.org/abs/2308.14172v4

Towards Integrated Alignment

As AI adoption expands across human society, the problem of aligning AI models to match human preferences remains a grand challenge. Currently, the AI alignment field is deeply divided between behavioral and representational approaches, resulting in narrowly aligned models that are more vulnerable to increasingly deceptive misalignment threats. In the face of this fragmentation, we propose an integrated vision for the future of the field. Drawing on related lessons from immunology and cybersecurity, we lay out a set of design principles for the development of Integrated Alignment frameworks that combine the complementary strengths of diverse alignment approaches through deep integration and adaptive coevolution. We highlight the importance of strategic diversity - deploying orthogonal alignment and misalignment detection approaches to avoid homogeneous pipelines that may be "doomed to success". We also recommend steps for greater unification of the AI alignment research field itself, through cross-collaboration, open model weights and shared community resources.

Updated: 2025-08-08 11:16:56

标题: 走向整合对齐

摘要: 随着人工智能在人类社会中的应用不断扩大，将人类偏好与人工智能模型相匹配的问题仍然是一个重大挑战。目前，人工智能对齐领域在行为和表示两种方法之间存在深刻分歧，导致狭窄对齐的模型更容易受到日益欺骗性的不对齐威胁。面对这种分裂局面，我们提出了一个整合愿景，展望未来该领域的发展。借鉴免疫学和网络安全领域相关经验，我们提出了一套集成对齐框架的设计原则，通过深度整合和适应性共同进化结合多样化对齐方法的优势。我们强调战略多样性的重要性 - 部署正交对齐和不对齐检测方法，避免可能“注定成功”的同质化流程。我们还建议采取措施，促进人工智能对齐研究领域本身更大程度的统一，通过跨学科合作、开放模型权重和共享社区资源。

更新时间: 2025-08-08 11:16:56

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2508.06592v1

iTFKAN: Interpretable Time Series Forecasting with Kolmogorov-Arnold Network

As time evolves, data within specific domains exhibit predictability that motivates time series forecasting to predict future trends from historical data. However, current deep forecasting methods can achieve promising performance but generally lack interpretability, hindering trustworthiness and practical deployment in safety-critical applications such as auto-driving and healthcare. In this paper, we propose a novel interpretable model, iTFKAN, for credible time series forecasting. iTFKAN enables further exploration of model decision rationales and underlying data patterns due to its interpretability achieved through model symbolization. Besides, iTFKAN develops two strategies, prior knowledge injection, and time-frequency synergy learning, to effectively guide model learning under complex intertwined time series data. Extensive experimental results demonstrated that iTFKAN can achieve promising forecasting performance while simultaneously possessing high interpretive capabilities.

Updated: 2025-08-08 11:14:54

标题: iTFKAN：具有可解释性的Kolmogorov-Arnold网络时间序列预测

摘要: 随着时间的推移，特定领域内的数据表现出可预测性，这促使时间序列预测从历史数据中预测未来趋势。然而，目前的深度预测方法虽然可以取得有希望的表现，但通常缺乏可解释性，这影响了可靠性和在关键安全应用中的实际部署，如自动驾驶和医疗保健。在本文中，我们提出了一种新颖的可解释模型iTFKAN，用于可信的时间序列预测。iTFKAN通过模型符号化实现了可解释性，从而使模型决策原理和基础数据模式能够进一步探索。此外，iTFKAN开发了两种策略，即先验知识注入和时间-频率协同学习，以有效引导模型在复杂交织的时间序列数据下进行学习。广泛的实验结果表明，iTFKAN在同时具有高度解释能力的同时可以取得有希望的预测性能。

更新时间: 2025-08-08 11:14:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.16432v2

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Geometry problem solving (GPS) requires models to master diagram comprehension, logical reasoning, knowledge application, numerical computation, and auxiliary line construction. This presents a significant challenge for Multimodal Large Language Models (MLLMs). However, existing benchmarks for evaluating MLLM geometry skills overlook auxiliary line construction and lack fine-grained process evaluation, making them insufficient for assessing MLLMs' long-step reasoning abilities. To bridge these gaps, we present the GeoLaux benchmark, comprising 2,186 geometry problems, incorporating both calculation and proving questions. Notably, the problems require an average of 6.51 reasoning steps, with a maximum of 24 steps, and 41.8% of them need auxiliary line construction. Building on the dataset, we design a novel five-dimensional evaluation strategy assessing answer correctness, process correctness, process quality, auxiliary line impact, and error causes. Extensive experiments on 13 leading MLLMs (including thinking models and non-thinking models) yield three pivotal findings: First, models exhibit substantial performance degradation in extended reasoning steps (nine models demonstrate over 50% performance drop). Second, compared to calculation problems, MLLMs tend to take shortcuts when solving proving problems. Third, models lack auxiliary line awareness, and enhancing this capability proves particularly beneficial for overall geometry reasoning improvement. These findings establish GeoLaux as both a benchmark for evaluating MLLMs' long-step geometric reasoning with auxiliary lines and a guide for capability advancement. Our dataset and code are included in supplementary materials and will be released.

Updated: 2025-08-08 11:11:37

标题: GeoLaux：用于评估MLLMs在需要辅助线的长步骤问题上的几何性能的基准Benchmark

摘要: 几何问题解决（GPS）需要模型掌握图表理解、逻辑推理、知识应用、数值计算和辅助线构建。这对于多模式大型语言模型（MLLMs）构成了重大挑战。然而，用于评估MLLM几何技能的现有基准忽视了辅助线构建，并且缺乏细粒度的过程评估，使它们不足以评估MLLM的长步推理能力。为了弥补这些差距，我们提出了GeoLaux基准，包括2,186个几何问题，涵盖了计算和证明问题。值得注意的是，这些问题平均需要6.51个推理步骤，最多可达24步，其中41.8%需要辅助线构建。基于数据集，我们设计了一种新颖的五维评估策略，评估答案正确性、过程正确性、过程质量、辅助线影响和错误原因。对13个主要MLLM（包括思考模型和非思考模型）进行了广泛实验，得出了三个重要发现：首先，在扩展推理步骤方面，模型表现出显著的性能下降（九个模型表现出超过50%的性能下降）。其次，与计算问题相比，MLLM在解决证明问题时倾向于采取捷径。第三，模型缺乏辅助线意识，增强这种能力对整体几何推理改进特别有益。这些发现将GeoLaux确立为评估MLLM在使用辅助线的长步几何推理方面的基准，同时也是能力提升的指南。我们的数据集和代码已包含在补充材料中，并将发布。

更新时间: 2025-08-08 11:11:37

领域: cs.AI

下载: http://arxiv.org/abs/2508.06226v1

Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution

Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the **Overconfidence Phenomenon** in current LLM-as-a-Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce **TH-Score**, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose **LLM-as-a-Fuser**, an ensemble framework that transforms LLMs into reliable, risk-aware evaluators. Extensive experiments demonstrate that our approach substantially improves calibration and enables adaptive, confidence-driven evaluation pipelines, achieving superior reliability and accuracy compared to existing baselines.

Updated: 2025-08-08 11:11:22

标题: 对于LLM作为裁判的过度自信：诊断和自信驱动的解决方案

摘要: 大型语言模型（LLMs）被广泛用作自动判决者，实际价值取决于准确性和可信赖的、具有风险意识的判断。现有方法主要关注准确性，忽视了对良好校准置信度的必要性，这对于自适应和可靠的评估流程至关重要。在这项工作中，我们主张从以准确性为中心的评估转变为以置信度驱动、具有风险意识的LLM作为判决系统，强调了对于可信赖和自适应评估而言，良好校准的置信度的必要性。我们系统地识别了当前LLM作为判决者中的过度自信现象，即预测的置信度明显夸大了实际正确性，削弱了在实际部署中的可靠性。为了量化这一现象，我们引入了TH-Score，这是一种衡量置信度-准确性对齐的新指标。此外，我们提出了LLM作为融合器的框架，将LLMs转化为可靠、具有风险意识的评估者。大量实验证明，我们的方法显著改善了校准，并实现了自适应、以置信度驱动的评估流程，与现有基线相比取得了更高的可靠性和准确性。

更新时间: 2025-08-08 11:11:22

领域: cs.AI

下载: http://arxiv.org/abs/2508.06225v1

Contemplative Artificial Intelligence

As artificial intelligence (AI) improves, traditional alignment strategies may falter in the face of unpredictable self-improvement, hidden subgoals, and the sheer complexity of intelligent systems. Inspired by contemplative wisdom traditions, we show how four axiomatic principles can instil a resilient Wise World Model in AI systems. First, mindfulness enables self-monitoring and recalibration of emergent subgoals. Second, emptiness forestalls dogmatic goal fixation and relaxes rigid priors. Third, non-duality dissolves adversarial self-other boundaries. Fourth, boundless care motivates the universal reduction of suffering. We find that prompting AI to reflect on these principles improves performance on the AILuminate Benchmark (d=.96) and boosts cooperation and joint-reward on the Prisoner's Dilemma task (d=7+). We offer detailed implementation strategies at the level of architectures, constitutions, and reinforcement on chain-of-thought. For future systems, active inference may offer the self-organizing and dynamic coupling capabilities needed to enact Contemplative AI in embodied agents.

Updated: 2025-08-08 11:06:54

标题: 沉思人工智能

摘要: 随着人工智能（AI）的进步，传统的调整策略在面对不可预测的自我改进、隐藏的次要目标和智能系统的复杂性时可能会失败。受思考智慧传统的启发，我们展示了四个公理性原则如何在AI系统中灌输一个具有弹性的智慧世界模型。首先，正念使得自我监控和重新校准出现的次要目标成为可能。其次，空性防止教条主义目标的固定并放松僵化的先验条件。第三，非二元性消解对立的自我他者边界。第四，无限的关怀激励普遍减少痛苦。我们发现，促使AI反思这些原则可以提高在AILuminate基准测试上的性能（d=0.96），并增加在囚徒困境任务中的合作和共同奖励（d=7+）。我们提供了在架构、构成和思维链强化层面的详细实施策略。对于未来的系统，主动推理可能提供实现在具有身体的代理中实施冥想AI所需的自组织和动态耦合能力。

更新时间: 2025-08-08 11:06:54

领域: cs.AI

下载: http://arxiv.org/abs/2504.15125v2

Rethinking the Bias of Foundation Model under Long-tailed Distribution

Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances from pre-training affect long-tailed downstream tasks. Specifically, we find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we build our method on causal learning and view the incomplete semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Notably, we achieve an average performance increase of about $1.67\%$ on each dataset. Code is available: https://github.com/JiahaoChen1/Pre-train-Imbalance

Updated: 2025-08-08 11:04:58

标题: 重新思考长尾分布下基础模型的偏见

摘要: 长尾学习因其实际意义而受到越来越多的关注。在各种方法中，随着基础模型的出现，微调范式引起了相当大的兴趣。然而，大多数现有方法主要侧重于利用这些模型的知识，忽略了它们依赖的不平衡训练数据引入的固有偏见。本文研究了预训练中这种不平衡如何影响长尾下游任务。具体来说，我们发现基础模型在下游任务中继承的不平衡偏见表现为参数不平衡和数据不平衡。在微调过程中，我们观察到参数不平衡起到更为关键的作用，而数据不平衡可以通过现有的再平衡策略来缓解。此外，我们发现参数不平衡不能像数据不平衡那样在训练期间有效地通过当前的再平衡技术（如调整逻辑值）来解决。为了同时解决这两种不平衡，我们基于因果学习构建了我们的方法，并将不完整的语义因素视为混杂因素，它在输入样本和标签之间带来了虚假相关性。为了解决这种负面影响，我们提出了一种新颖的反向调整方法，该方法学习输入样本和标签之间的真实因果效应，而不仅仅是拟合数据中的相关性。值得注意的是，我们在每个数据集上实现了约1.67％的平均性能提升。代码可在以下链接找到：https://github.com/JiahaoChen1/Pre-train-Imbalance

更新时间: 2025-08-08 11:04:58

领域: cs.LG,cs.CV,stat.ML

下载: http://arxiv.org/abs/2501.15955v3

InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference -- a core aspect of human cognition -- remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.

Updated: 2025-08-08 11:03:23

标题: InfoCausalQA：模型能否基于信息图表执行非显式因果推理？

摘要: 最近对视觉语言模型(VLMs)的进展展示了在感知和推理方面令人印象深刻的能力。然而，执行因果推理——人类认知的核心方面之一——仍然被较少探讨，特别是在多模态环境中。在这项研究中，我们引入了InfoCausalQA，一个旨在评估基于信息图表的因果推理的新型基准。这些信息图表结合了结构化的视觉数据和文本背景。该基准包括两个任务：任务1侧重于基于推断的数字趋势进行定量因果推理，而任务2则针对涉及五种因果关系的语义因果推理：原因、结果、干预、反事实和时间。我们手动收集了来自四个公共来源的494个信息图文对，并使用GPT-4o生成了1,482个高质量的多项选择问答对。这些问题随后经过人工仔细修改，以确保它们不能仅仅基于表面线索回答，而是需要真正的视觉基础。我们的实验结果显示，目前的VLMs在计算推理方面的能力有限，甚至在语义因果推理方面限制更为明显。它们与人类相比表现明显较低，表明在利用基于信息图表的信息进行因果推理方面存在重大差距。通过InfoCausalQA，我们强调了推进多模态人工智能系统因果推理能力的需求。

更新时间: 2025-08-08 11:03:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06220v1

Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

Data Collaboration (DC) enables multiple parties to jointly train a model without exposing their private datasets. Each party privately transforms its data using a secret linear basis and shares only the resulting intermediate representations. Existing theory asserts that any target basis spanning the same subspace as the secret bases should suffice; however, empirical evidence reveals that the particular choice of target basis significantly influences model accuracy and stability. In this paper, we introduce Orthonormal Data Collaboration (ODC), a novel DC framework that explicitly enforces orthonormality constraints on both the secret and target bases. Under these constraints, the basis alignment step reduces precisely to the classical Orthogonal Procrustes Problem, admitting a closed-form solution. We rigorously establish that the resulting orthonormal change-of-basis matrices achieve orthogonal concordance, aligning all parties' intermediate representations up to a common orthogonal transformation. Consequently, downstream model performance becomes invariant to the specific choice of orthonormal target basis. Computationally, ODC substantially reduces alignment complexity from O(\min\{a,(cl)^2,a^2cl) to O(acl^2) where a denotes anchor data size, l the latent dimension, and c the number of collaborating parties. Extensive empirical evaluations confirm the theoretical advantages of ODC, demonstrating alignment speed-ups of up to two orders of magnitude compared to state-of-the-art DC methods, alongside comparable or superior accuracy across multiple benchmark datasets. ODC maintains robust privacy under the semi-honest threat model and requires only a single round of communication. These results establish ODC as a practically advantageous and computationally efficient enhancement to existing DC pipelines, particularly when orthonormal secret bases are naturally feasible.

Updated: 2025-08-08 10:55:06

标题: 使用正交基选择和对准的数据协作分析

摘要: 数据协作（DC）使多个参与方能够共同训练模型，而不暴露其私有数据集。每个参与方都使用秘密线性基础私下转换其数据，并仅共享结果中间表示。现有理论断言，任何覆盖与秘密基础相同子空间的目标基础应该足够；然而，实证证据显示，特定选择的目标基础显著影响模型的准确性和稳定性。在本文中，我们介绍正交数据协作（ODC），这是一种新颖的DC框架，明确对秘密和目标基础施加正交性约束。在这些约束下，基础对齐步骤精确地降低到经典正交Procrustes问题，从而得到一个闭合解。我们严格证明，由此产生的正交基础变换矩阵实现了正交一致性，将所有参与方的中间表示对齐到一个公共正交变换上。因此，下游模型性能对正交目标基础的具体选择变得不变。在计算上，ODC将对齐复杂性大大降低，从O（min{a，（cl）^ 2，a^2cl）到O（acl^2），其中a表示锚定数据大小，l表示潜在维度，c表示合作参与方的数量。大量实证评估证实了ODC的理论优势，显示了与最先进的DC方法相比，对齐速度提高了两个数量级，同时在多个基准数据集上具有可比或更高的准确性。ODC在半诚实威胁模型下保持强大的隐私性，并且只需要一轮通信。这些结果将ODC确定为一种在现有DC流程中实际上有利且计算效率高的增强，特别是在正交秘密基础自然可行的情况下。

更新时间: 2025-08-08 10:55:06

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2403.02780v5

X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment

Vertical Federated Learning (VFL) enables collaborative learning by integrating disjoint feature subsets from multiple clients/parties. However, VFL typically faces two key challenges: i) the requirement for perfectly aligned data samples across all clients (missing features are not allowed); ii) the requirement for joint collaborative inference/prediction involving all clients (it does not support locally independent inference on a single client). To address these challenges, we propose X-VFL, a new VFL framework designed to deal with the non-aligned data samples with (partially) missing features and to support locally independent inference of new data samples for each client. In particular, we design two novel modules in X-VFL: Cross Completion (XCom) and Decision Subspace Alignment (DS-Align). XCom can complete/reconstruct missing features for non-aligned data samples by leveraging information from other clients. DS-Align aligns local features with completed and global features across all clients within the decision subspace, thus enabling locally independent inference at each client. Moreover, we provide convergence theorems for different algorithms used in training X-VFL, showing an $O(1/\sqrt{T})$ convergence rate for SGD-type algorithms and an $O(1/T)$ rate for PAGE-type algorithms, where $T$ denotes the number of training update steps. Extensive experiments on real-world datasets demonstrate that X-VFL significantly outperforms existing methods, e.g., achieving a 15% improvement in accuracy on the image CIFAR-10 dataset and a 43% improvement on the medical MIMIC-III dataset. These results validate the practical effectiveness and superiority of X-VFL, particularly in scenarios involving partially missing features and locally independent inference.

Updated: 2025-08-08 10:51:19

标题: X-VFL：一种新的具有交叉完成和决策子空间对齐的垂直联邦学习框架

摘要: Vertical Federated Learning（VFL）通过集成来自多个客户/方的不相交特征子集，实现协作学习。然而，VFL通常面临两个关键挑战：i）需要在所有客户端之间完全对齐的数据样本（不允许缺失特征）；ii）需要涉及所有客户端的联合协作推断/预测（不支持单个客户端上的本地独立推断）。为了解决这些挑战，我们提出了X-VFL，这是一个新的VFL框架，旨在处理具有（部分）缺失特征的不对齐数据样本，并支持每个客户端的新数据样本的本地独立推断。特别地，我们在X-VFL中设计了两个新颖的模块：交叉完成（XCom）和决策子空间对齐（DS-Align）。XCom可以通过利用来自其他客户端的信息来完成/重建缺失特征，从而为不对齐数据样本。DS-Align在决策子空间内将本地特征与已完成和全局特征对齐，从而使每个客户端能够进行本地独立推断。此外，我们为训练X-VFL中使用的不同算法提供了收敛定理，对于SGD类型算法，收敛速率为$O(1/\sqrt{T})$，对于PAGE类型算法，收敛速率为$O(1/T)$，其中$T$表示训练更新步数。对真实世界数据集的大量实验表明，X-VFL明显优于现有方法，例如，在图像CIFAR-10数据集上准确率提高了15％，在医疗MIMIC-III数据集上提高了43％。这些结果验证了X-VFL在涉及部分缺失特征和本地独立推断的情景中的实际有效性和优越性。

更新时间: 2025-08-08 10:51:19

领域: cs.LG,cs.CV,cs.DC,math.OC

下载: http://arxiv.org/abs/2508.05568v2

Reparameterization Proximal Policy Optimization

Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables multiple epochs of stable sample reuse by optimizing a clipped surrogate objective tailored for RPG, while being further stabilized by Kullback-Leibler (KL) divergence regularization and remaining fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.

Updated: 2025-08-08 10:50:55

标题: 重新参数化近端策略优化

摘要: 重新参数化策略梯度（RPG）通过利用可微动力学来提高样本效率具有很大潜力。然而，其训练不稳定性是一个关键障碍，高方差梯度可能会破坏学习过程。为了解决这个问题，我们从近端策略优化（PPO）中汲取灵感，后者使用替代目标来实现在无模型环境中稳定的样本重复利用。我们首先建立了这个替代目标与RPG之间的联系，这在很大程度上是未被探索的，也是非常重要的。然后，我们通过展示类似PPO的替代目标的重新参数化梯度可以通过时间反向传播高效计算来弥合这个差距。基于这一关键洞察，我们提出了重新参数化近端策略优化（RPO），这是一种稳定且具有高效样本利用率的基于RPG的方法。RPO通过优化为RPG量身定制的剪切替代目标实现多个时期的稳定样本重复利用，同时通过Kullback-Leibler（KL）散度正则化进一步稳定，并且与现有的方差减少方法完全兼容。我们在一系列具有挑战性的运动和操纵任务上评估了RPO，实验证明我们的方法实现了更优越的样本效率和强大的性能。

更新时间: 2025-08-08 10:50:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06214v1

Graph Federated Learning for Personalized Privacy Recommendation

Federated recommendation systems (FedRecs) have gained significant attention for providing privacy-preserving recommendation services. However, existing FedRecs assume that all users have the same requirements for privacy protection, i.e., they do not upload any data to the server. The approaches overlook the potential to enhance the recommendation service by utilizing publicly available user data. In real-world applications, users can choose to be private or public. Private users' interaction data is not shared, while public users' interaction data can be shared. Inspired by the issue, this paper proposes a novel Graph Federated Learning for Personalized Privacy Recommendation (GFed-PP) that adapts to different privacy requirements while improving recommendation performance. GFed-PP incorporates the interaction data of public users to build a user-item interaction graph, which is then used to form a user relationship graph. A lightweight graph convolutional network (GCN) is employed to learn each user's user-specific personalized item embedding. To protect user privacy, each client learns the user embedding and the scoring function locally. Additionally, GFed-PP achieves optimization of the federated recommendation framework through the initialization of item embedding on clients and the aggregation of the user relationship graph on the server. Experimental results demonstrate that GFed-PP significantly outperforms existing methods for five datasets, offering superior recommendation accuracy without compromising privacy. This framework provides a practical solution for accommodating varying privacy preferences in federated recommendation systems.

Updated: 2025-08-08 10:44:33

标题: 图形联邦学习用于个性化隐私推荐

摘要: 联邦推荐系统（FedRecs）因提供保护隐私的推荐服务而受到重视。然而，现有的FedRecs假设所有用户对隐私保护具有相同要求，即他们不向服务器上传任何数据。这些方法忽视了通过利用公开可用的用户数据来增强推荐服务的潜力。在现实世界的应用中，用户可以选择保持私密或公开。私密用户的互动数据不会被共享，而公开用户的互动数据可以被共享。受此问题启发，本文提出了一种新颖的用于个性化隐私推荐的图联邦学习（GFed-PP）方法，该方法能够适应不同的隐私要求，同时提高推荐性能。GFed-PP利用公开用户的互动数据构建用户-物品互动图，然后用于形成用户关系图。采用轻量级图卷积网络（GCN）来学习每个用户的用户特定个性化物品嵌入。为了保护用户隐私，每个客户端在本地学习用户嵌入和评分函数。此外，GFed-PP通过在客户端初始化项目嵌入和在服务器上聚合用户关系图来实现联邦推荐框架的优化。实验结果表明，GFed-PP在五个数据集上明显优于现有方法，提供了更高的推荐准确性，而不会损害隐私。该框架为在联邦推荐系统中适应不同隐私偏好提供了实际解决方案。

更新时间: 2025-08-08 10:44:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06208v1

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies have assessed LLMs' ability to predict program outputs, most focus solely on the accuracy of those predictions, without evaluating the reasoning behind them. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this work, we evaluate whether state-of-the-art LLMs with up to 8B parameters can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated six LLMs and performed a human expert analysis using LiveCodeBench to assess whether the correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. Our findings show that LLMs trained for code produce correct predictions based on flawed reasoning between 10% and 50% of cases. Furthermore, LLMs often change predictions in response to our code mutations, indicating they do not yet exhibit stable, semantically grounded reasoning.

Updated: 2025-08-08 10:42:24

标题: 大型语言模型在理解代码对抗保留语义突变方面是否具有鲁棒性？

摘要: 理解大型语言模型（LLMs）的推理能力和鲁棒性对于它们在编程任务中的可靠使用至关重要。尽管最近的研究已经评估了LLMs预测程序输出的能力，但大多数研究仅关注这些预测的准确性，而没有评估其背后的推理过程。此外，在数学推理任务中观察到，LLMs可能通过错误的逻辑得出正确答案，引发对代码理解中类似问题的担忧。在这项工作中，我们评估了具有多达80亿参数的最先进LLMs是否能够推理Python程序，还是仅仅在猜测。我们应用了五种保持语义的代码变异：重命名变量、镜像比较表达式、交换if-else分支、将for循环转换为while循环以及循环展开。这些变异保持程序的语义同时改变其语法。我们评估了六种LLMs，并利用LiveCodeBench进行了人类专家分析，以评估正确预测是否基于合理推理。我们还在LiveCodeBench和CruxEval上评估了不同代码变异下的预测稳定性。我们的研究结果显示，针对代码训练的LLMs在10%到50%的情况下基于错误推理产生了正确预测。此外，LLMs经常在响应我们的代码变异时改变预测，表明它们尚未展示出稳定、语义基础的推理能力。

更新时间: 2025-08-08 10:42:24

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2505.10443v2

Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials

Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration.

Updated: 2025-08-08 10:41:03

标题: 生成人工智能从植物中提取结构-功能关系用于新材料

摘要: 大型语言模型（LLMs）通过实现新的知识检索和创造性构思方法，重塑了研究领域。然而，在特定学科的实验科学中，特别是在高度多学科领域如材料科学中，它们的应用仍然有限。我们提出了一个首创性的框架，将生成式人工智能与植物科学、仿生学和材料工程等迄今尚未联系的领域的文献结合起来，从中提取见解并为材料设计实验。我们重点关注湿度响应系统，如基于花粉的材料和Rhapis excelsa（宽叶女手掌）叶片，这些材料表现出自我激活和适应性能。利用一系列人工智能工具，包括经过精细调整的模型（BioinspiredLLM）、检索增强生成（RAG）、代理系统和分层抽样策略，我们提取结构-性能关系，并将其转化为新型仿生材料类别。结构化推理协议从单个查询中生成并评估数百个假设，呈现出新颖且实验可跟踪的想法。我们通过真实世界实施来验证我们的方法：由LLM生成的程序、材料设计和机械预测在实验室中进行了测试，最终制造了一种具有可调形态和测量剪切强度的新型基于花粉的粘合剂，为未来植物来源的粘合剂设计奠定了基础。这项工作展示了AI辅助构思如何推动实际材料设计并实现有效的人工智能-人类协作。

更新时间: 2025-08-08 10:41:03

领域: cs.LG,cond-mat.dis-nn,cond-mat.mtrl-sci,cond-mat.other,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.06591v1

Classification is a RAG problem: A case study on hate speech detection

Robust content moderation requires classification systems that can quickly adapt to evolving policies without costly retraining. We present classification using Retrieval-Augmented Generation (RAG), which shifts traditional classification tasks from determining the correct category in accordance with pre-trained parameters to evaluating content in relation to contextual knowledge retrieved at inference. In hate speech detection, this transforms the task from "is this hate speech?" to "does this violate the hate speech policy?" Our Contextual Policy Engine (CPE) - an agentic RAG system - demonstrates this approach and offers three key advantages: (1) robust classification accuracy comparable to leading commercial systems, (2) inherent explainability via retrieved policy segments, and (3) dynamic policy updates without model retraining. Through three experiments, we demonstrate strong baseline performance and show that the system can apply fine-grained policy control by correctly adjusting protection for specific identity groups without requiring retraining or compromising overall performance. These findings establish that RAG can transform classification into a more flexible, transparent, and adaptable process for content moderation and wider classification problems.

Updated: 2025-08-08 10:35:41

标题: 分类是一个RAG问题：对仇恨言论检测的案例研究

摘要: 强大的内容审核需要能够快速适应不断发展的政策而无需昂贵重新训练的分类系统。我们提出了使用检索增强生成（RAG）的分类方法，将传统分类任务从根据预先训练的参数确定正确类别转变为在推断时评估与检索到的背景知识相关的内容。在憎恶言论检测中，这将任务从“这是憎恶言论吗？”转变为“这是否违反了憎恶言论政策？”我们的上下文政策引擎（CPE）-一种主动的RAG系统-展示了这种方法，并提供三个关键优势：（1）与主要商业系统相当的强大分类准确性，（2）通过检索到的政策片段具有固有的可解释性，以及（3）不需要模型重新训练的动态政策更新。通过三个实验，我们展示了强大的基准性能，并表明该系统可以通过正确调整特定身份群体的保护而无需重新训练或影响整体性能来应用细粒度的政策控制。这些发现确立了RAG可以将分类转变为更灵活、透明和适应性强的过程，用于内容审核和更广泛的分类问题。

更新时间: 2025-08-08 10:35:41

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06204v1

LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches.

Updated: 2025-08-08 10:32:38

标题: LoRA在LoRA中：为持续视觉指导调整的参数高效架构扩展

摘要: 持续视觉指导调整（CVIT）使多模态大语言模型（MLLMs）能够随着时间逐渐学习新任务。然而，这个过程受到灾难性遗忘的挑战，即在模型适应新任务时，先前学习任务的表现下降。缓解遗忘的常见方法是架构扩展，引入任务特定模块以防止干扰。然而，现有方法经常为每个任务扩展整个层，导致显著的参数开销和不良的可扩展性。为了克服这些问题，我们介绍了适用于MLLMs中CVIT的高效架构扩展方法LiLoRA中的LoRA。 LiLoRA跨任务共享LoRA矩阵A以减少冗余，对矩阵B应用额外的低秩分解以最小化任务特定参数，并结合余弦正则化稳定损失来保持随时间共享表示的一致性。对多样化CVIT基准的大量实验表明，与现有方法相比，LiLoRA在顺序任务学习中始终实现卓越的性能，并显著提高参数效率。

更新时间: 2025-08-08 10:32:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06202v1

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.

Updated: 2025-08-08 10:29:24

标题: 基准预训练分子嵌入模型用于分子表示学习

摘要: 经过预训练的神经网络在化学和小分子药物设计领域引起了广泛关注。这些模型的嵌入被广泛用于分子性质预测、虚拟筛选和分子化学中的小数据学习。本研究对迄今为止最广泛的25个数据集中的25个模型进行了评估。在一个公平的比较框架下，我们评估了涵盖各种模态、体系结构和预训练策略的模型。使用专用的分层贝叶斯统计测试模型，我们得出了一个令人惊讶的结果：几乎所有的神经模型在基线ECFP分子指纹上表现出微不足道或没有改进。只有基于分子指纹的CLAMP模型在统计上明显优于其他模型。这些发现引起了对现有研究评估严谨性的担忧。我们讨论了潜在的原因，提出了解决方案，并提供了实际建议。

更新时间: 2025-08-08 10:29:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06199v1

Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning Pipeline

Short video platforms like YouTube Shorts and TikTok face significant copyright compliance challenges, as infringers frequently embed arbitrary background music (BGM) to obscure original soundtracks (OST) and evade content originality detection. To tackle this issue, we propose a novel pipeline that integrates Music Source Separation (MSS) and cross-modal video-music retrieval (CMVMR). Our approach effectively separates arbitrary BGM from the original OST, enabling the restoration of authentic video audio tracks. To support this work, we introduce two domain-specific datasets: OASD-20K for audio separation and OSVAR-160 for pipeline evaluation. OASD-20K contains 20,000 audio clips featuring mixed BGM and OST pairs, while OSVAR-160 is a unique benchmark dataset comprising 1,121 video and mixed-audio pairs, specifically designed for short video restoration tasks. Experimental results demonstrate that our pipeline not only removes arbitrary BGM with high accuracy but also restores OSTs, ensuring content integrity. This approach provides an ethical and scalable solution to copyright challenges in user-generated content on short video platforms.

Updated: 2025-08-08 10:28:16

标题: 在短视频平台上解决版权侵权问题：新颖数据集和音频恢复深度学习管道

摘要: 像YouTube Shorts和TikTok这样的短视频平台面临着重大的版权合规挑战，因为侵权者经常嵌入任意背景音乐（BGM）来掩盖原始配乐（OST）并逃避内容原创性检测。为了解决这个问题，我们提出了一个新颖的流程，将音乐源分离（MSS）和跨模态视频-音乐检索（CMVMR）集成在一起。我们的方法有效地将任意BGM与原始OST分离，实现了真实视频音轨的恢复。为了支持这项工作，我们引入了两个领域特定的数据集：OASD-20K用于音频分离，OSVAR-160用于流程评估。OASD-20K包含20,000个音频剪辑，其中包含混合的BGM和OST对，而OSVAR-160是一个独特的基准数据集，包括1,121个视频和混合音频对，专门设计用于短视频恢复任务。实验结果表明，我们的流程不仅能够高精度地去除任意BGM，还可以恢复OST，确保内容的完整性。这种方法为短视频平台上用户生成内容的版权挑战提供了一种道德和可扩展的解决方案。

更新时间: 2025-08-08 10:28:16

领域: cs.MM,cs.AI

下载: http://arxiv.org/abs/2504.21772v3

Decompositional Reasoning for Graph Retrieval with Large Language Models

Large Language Models (LLMs) excel at many NLP tasks, but struggle with multi-hop reasoning and factual consistency, limiting their effectiveness on knowledge-intensive tasks like complex question answering (QA). Linking Knowledge Graphs (KG) and LLMs has shown promising results, but LLMs generally lack the ability to reason efficiently over graph-structured information. To tackle this problem, we propose a novel retrieval approach that integrates textual knowledge graphs into the LLM reasoning process via query decomposition. Our method decomposes complex questions into sub-questions, retrieves relevant textual subgraphs, and composes a question-specific knowledge graph to guide answer generation. For that, we use a weighted similarity function that focuses on both the complex question and the generated subquestions to extract a relevant subgraph, which allows efficient and precise retrieval for complex questions and improves the performance of LLMs on multi-hop QA tasks. This structured reasoning pipeline enhances factual grounding and interpretability while leveraging the generative strengths of LLMs. We evaluate our method on standard multi-hop QA benchmarks and show that it achieves comparable or superior performance to competitive existing methods, using smaller models and fewer LLM calls.

Updated: 2025-08-08 10:26:47

标题: 大语言模型用于图检索的分解式推理

摘要: 大型语言模型（LLMs）在许多自然语言处理任务中表现出色，但在多跳推理和事实一致性方面表现不佳，从而限制了它们在知识密集型任务（如复杂问答）中的效果。将知识图谱（KG）与LLMs进行关联已经显示出有希望的结果，但LLMs通常缺乏有效推理图结构化信息的能力。为了解决这个问题，我们提出了一种新颖的检索方法，通过查询分解将文本知识图谱整合到LLM推理过程中。我们的方法将复杂问题分解为子问题，检索相关的文本子图，并组成一个特定于问题的知识图谱来指导答案生成。为此，我们使用了一种加权相似性函数，重点关注复杂问题和生成的子问题，以提取相关子图，从而实现复杂问题的高效和精确检索，并提高LLMs在多跳问答任务上的性能。这种结构化推理流程增强了事实基础和可解释性，同时利用了LLMs的生成优势。我们在标准的多跳问答基准上评估了我们的方法，并展示了它相对于竞争性现有方法使用更小的模型和更少的LLM调用来实现可比或更优秀的性能。

更新时间: 2025-08-08 10:26:47

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2506.13380v2

ACTIVA: Amortized Causal Effect Estimation via Transformer-based Variational Autoencoder

Predicting the distribution of outcomes under hypothetical interventions is crucial across healthcare, economics, and policy-making. However, existing methods often require restrictive assumptions, and are typically limited by the lack of amortization across problem instances. We propose ACTIVA, a transformer-based conditional variational autoencoder (VAE) architecture for amortized causal inference, which estimates interventional distributions directly from observational data without. ACTIVA learns a latent representation conditioned on observational inputs and intervention queries, enabling zero-shot inference by amortizing causal knowledge from diverse training scenarios. We provide theoretical insights showing that ACTIVA predicts interventional distributions as mixtures over observationally equivalent causal models. Empirical evaluations on synthetic and semi-synthetic datasets confirm the effectiveness of our amortized approach and highlight promising directions for future real-world applications.

Updated: 2025-08-08 10:12:06

标题: ACTIVA: 基于Transformer的变分自动编码器的摊销因果效应估计

摘要: 预测在假设干预下结果的分布在医疗保健、经济学和政策制定中至关重要。然而，现有方法通常需要限制性假设，并且通常受到问题实例之间缺乏摊销的限制。我们提出了ACTIVA，这是一种基于变压器的条件变分自动编码器（VAE）架构，用于摊销因果推断，它直接从观测数据中估计干预分布，而无需。ACTIVA学习了一个潜在表示，根据观测输入和干预查询进行条件化，从而通过从各种训练场景中摊销因果知识实现零-shot推断。我们提供了理论洞见，显示ACTIVA预测干预分布为观测等效因果模型的混合物。对合成和半合成数据集的实证评估验证了我们的摊销方法的有效性，并突出了未来实际应用的有前途的方向。

更新时间: 2025-08-08 10:12:06

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.01290v2

Survey on the Evaluation of Generative Models in Music

Research on generative systems in music has seen considerable attention and growth in recent years. A variety of attempts have been made to systematically evaluate such systems. We present an interdisciplinary review of the common evaluation targets, methodologies, and metrics for the evaluation of both system output and model use, covering subjective and objective approaches, qualitative and quantitative approaches, as well as empirical and computational methods. We examine the benefits and limitations of these approaches from a musicological, an engineering, and an HCI perspective.

Updated: 2025-08-08 10:02:06

标题: 音乐生成模型评估调查

摘要: 近年来，音乐创作系统的研究引起了相当大的关注和增长。人们尝试系统地评估这些系统。我们提出了一个跨学科的综述，涵盖了评估系统产出和模型使用的常见评估目标、方法和指标，包括主观和客观方法、定性和定量方法，以及经验和计算方法。我们从音乐学、工程学和人机交互的角度审视了这些方法的优势和局限性。

更新时间: 2025-08-08 10:02:06

领域: cs.SD,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.05104v2

Differentially Private Federated Clustering with Random Rebalancing

Federated clustering aims to group similar clients into clusters and produce one model for each cluster. Such a personalization approach typically improves model performance compared with training a single model to serve all clients, but can be more vulnerable to privacy leakage. Directly applying client-level differentially private (DP) mechanisms to federated clustering could degrade the utilities significantly. We identify that such deficiencies are mainly due to the difficulties of averaging privacy noise within each cluster (following standard privacy mechanisms), as the number of clients assigned to the same clusters is uncontrolled. To this end, we propose a simple and effective technique, named RR-Cluster, that can be viewed as a light-weight add-on to many federated clustering algorithms. RR-Cluster achieves reduced privacy noise via randomly rebalancing cluster assignments, guaranteeing a minimum number of clients assigned to each cluster. We analyze the tradeoffs between decreased privacy noise variance and potentially increased bias from incorrect assignments and provide convergence bounds for RR-Clsuter. Empirically, we demonstrate the RR-Cluster plugged into strong federated clustering algorithms results in significantly improved privacy/utility tradeoffs across both synthetic and real-world datasets.

Updated: 2025-08-08 09:56:47

标题: 差分私有联邦聚类与随机再平衡

摘要: 联合聚类旨在将相似的客户分组到簇中，并为每个簇生成一个模型。这种个性化方法通常比训练一个单一模型为所有客户提供服务的方法提高了模型性能，但可能更容易泄露隐私。直接将客户级差分隐私（DP）机制应用于联合聚类可能会显著降低效用。我们发现这种缺陷主要是由于在每个簇中平均隐私噪声的困难（遵循标准隐私机制），因为分配到相同簇的客户数量是不受控制的。因此，我们提出了一种简单有效的技术，称为RR-Cluster，可以看作是许多联合聚类算法的轻量级附加组件。RR-Cluster通过随机重新平衡簇分配来实现降低隐私噪声，保证每个簇分配到的客户数量达到最小值。我们分析了减少隐私噪声方差和由于错误分配可能导致的潜在偏差之间的权衡，并为RR-Clsuter提供了收敛界限。在实证分析中，我们展示了将RR-Cluster插入强大的联合聚类算法可以显著改善在合成和真实数据集上的隐私/效用权衡。

更新时间: 2025-08-08 09:56:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06183v1

Navigating Demand Uncertainty in Container Shipping: Deep Reinforcement Learning for Enabling Adaptive and Feasible Master Stowage Planning

Reinforcement learning (RL) has shown promise in solving various combinatorial optimization problems. However, conventional RL faces challenges when dealing with real-world constraints, especially when action space feasibility is explicit and dependent on the corresponding state or trajectory. In this work, we focus on using RL in container shipping, often considered the cornerstone of global trade, by dealing with the critical challenge of master stowage planning. The main objective is to maximize cargo revenue and minimize operational costs while navigating demand uncertainty and various complex operational constraints, namely vessel capacity and stability, which must be dynamically updated along the vessel's voyage. To address this problem, we implement a deep reinforcement learning framework with feasibility projection to solve the master stowage planning problem (MPP) under demand uncertainty. The experimental results show that our architecture efficiently finds adaptive, feasible solutions for this multi-stage stochastic optimization problem, outperforming traditional mixed-integer programming and RL with feasibility regularization. Our AI-driven decision-support policy enables adaptive and feasible planning under uncertainty, optimizing operational efficiency and capacity utilization while contributing to sustainable and resilient global supply chains.

Updated: 2025-08-08 09:52:16

标题: 在集装箱航运中应对需求不确定性：深度强化学习实现适应性和可行性的主舱舱位规划

摘要: 强化学习（RL）已经展示出在解决各种组合优化问题方面的潜力。然而，传统的RL在处理现实世界的约束时面临挑战，特别是当行动空间的可行性明确且依赖于相应的状态或轨迹时。在这项工作中，我们专注于在集装箱运输中使用RL，这通常被认为是全球贸易的基石，通过解决主舱位规划的关键挑战。主要目标是在应对需求不确定性和各种复杂操作约束的同时，最大化货物收入并最小化运营成本，即船只容量和稳定性，这些必须在船只航行过程中动态更新。为了解决这个问题，我们实施了一个带有可行性投影的深度强化学习框架，以在需求不确定性下解决主舱位规划问题（MPP）。实验结果表明，我们的架构有效地找到了适应性的、可行的解决方案，超越了传统的混合整数规划和带有可行性正则化的RL。我们的AI驱动的决策支持政策在不确定性下实现了适应性和可行性规划，优化了运营效率和容量利用率，同时有助于可持续和具有弹性的全球供应链。

更新时间: 2025-08-08 09:52:16

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2502.12756v4

Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training

Training large language models (LLMs) is often constrained by GPU memory limitations. To alleviate memory pressure, activation recomputation and data compression have been proposed as two major strategies. However, both approaches have limitations: recomputation introduces significant training overhead, while compression can lead to accuracy degradation and computational inefficiency when applied naively. In this paper, we propose Adacc, the first adaptive memory optimization framework that unifies activation recomputation and data compression to improve training efficiency for LLMs while preserving model accuracy. Unlike existing methods that apply static, rule-based strategies or rely solely on one technique, Adacc makes fine-grained, tensor-level decisions, dynamically selecting between recomputation, retention, and compression based on tensor characteristics and runtime hardware constraints. Adacc tackles three key challenges: (1) it introduces layer-specific compression algorithms that mitigate accuracy loss by accounting for outliers in LLM activations; (2) it employs a MILP-based scheduling policy to globally optimize memory strategies across layers; and (3) it integrates an adaptive policy evolution mechanism to update strategies during training in response to changing data distributions. Experimental results show that Adacc improves training throughput by 1.01x to 1.37x compared to state-of-the-art frameworks, while maintaining accuracy comparable to the baseline.

Updated: 2025-08-08 09:49:52

标题: Adacc：一种统一压缩和激活重计算的自适应框架，用于LLM训练

摘要: 训练大型语言模型（LLMs）通常受限于GPU内存限制。为缓解内存压力，激活重计算和数据压缩被提出为两种主要策略。然而，这两种方法都有局限性：重计算引入了显著的训练开销，而压缩在应用时可能导致精度降低和计算效率低下。在本文中，我们提出了Adacc，这是第一个将激活重计算和数据压缩统一起来以提高LLMs训练效率同时保持模型精度的自适应内存优化框架。与现有方法不同，Adacc做出细粒度的、张量级别的决策，根据张量特征和运行时硬件约束动态选择重计算、保留和压缩之间的策略。 Adacc解决了三个关键挑战：（1）引入了针对LLM激活中异常值的层特定压缩算法，以减轻精度损失；（2）采用基于MILP的调度策略在各层之间全局优化内存策略；（3）集成了自适应策略演化机制，以在训练过程中根据数据分布变化更新策略。实验结果表明，与最先进的框架相比，Adacc将训练吞吐量提高了1.01倍至1.37倍，同时保持了与基线相当的精度。

更新时间: 2025-08-08 09:49:52

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2508.00806v2

Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation

Colonoscopy is a vital tool for the early diagnosis of colorectal cancer, which is one of the main causes of cancer-related mortality globally; hence, it is deemed an essential technique for the prevention and early detection of colorectal cancer. The research introduces a unique multidirectional architectural framework to automate polyp detection within colonoscopy images while helping resolve limited healthcare dataset sizes and annotation complexities. The research implements a comprehensive system that delivers synthetic data generation through Stable Diffusion enhancements together with detection and segmentation algorithms. This detection approach combines Faster R-CNN for initial object localization while the Segment Anything Model (SAM) refines the segmentation masks. The faster R-CNN detection algorithm achieved a recall of 93.08% combined with a precision of 88.97% and an F1 score of 90.98%.SAM is then used to generate the image mask. The research evaluated five state-of-the-art segmentation models that included U-Net, PSPNet, FPN, LinkNet, and MANet using ResNet34 as a base model. The results demonstrate the superior performance of FPN with the highest scores of PSNR (7.205893) and SSIM (0.492381), while UNet excels in recall (84.85%) and LinkNet shows balanced performance in IoU (64.20%) and Dice score (77.53%).

Updated: 2025-08-08 09:37:03

标题: 合成数据驱动的多架构框架，用于通过集成检测和蒙版生成实现自动息肉分割

摘要: 结肠镜检查是早期诊断结直肠癌的重要工具，结直肠癌是全球癌症相关死亡的主要原因之一；因此，它被认为是预防和早期发现结直肠癌的关键技术。该研究引入了一个独特的多方位架构框架，用于自动化结肠镜检查图像中的息肉检测，同时帮助解决有限的医疗数据集大小和标注复杂性。该研究实施了一个全面的系统，通过稳定扩散增强技术生成合成数据，并结合检测和分割算法。该检测方法将Faster R-CNN用于初始物体定位，而Segment Anything Model (SAM)则优化分割掩模。 Faster R-CNN检测算法实现了93.08%的召回率，88.97%的精确度和90.98%的F1分数。然后使用SAM生成图像掩模。该研究评估了五种最先进的分割模型，包括U-Net、PSPNet、FPN、LinkNet和MANet，使用ResNet34作为基础模型。结果表明，FPN的性能最优，PSNR得分最高（7.205893），SSIM得分为0.492381，而UNet在召回率（84.85%）方面表现优异，LinkNet在IoU（64.20%）和Dice得分（77.53%）上表现均衡。

更新时间: 2025-08-08 09:37:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06170v1

UW-3DGS: Underwater 3D Reconstruction with Physics-Aware Gaussian Splatting

Underwater 3D scene reconstruction faces severe challenges from light absorption, scattering, and turbidity, which degrade geometry and color fidelity in traditional methods like Neural Radiance Fields (NeRF). While NeRF extensions such as SeaThru-NeRF incorporate physics-based models, their MLP reliance limits efficiency and spatial resolution in hazy environments. We introduce UW-3DGS, a novel framework adapting 3D Gaussian Splatting (3DGS) for robust underwater reconstruction. Key innovations include: (1) a plug-and-play learnable underwater image formation module using voxel-based regression for spatially varying attenuation and backscatter; and (2) a Physics-Aware Uncertainty Pruning (PAUP) branch that adaptively removes noisy floating Gaussians via uncertainty scoring, ensuring artifact-free geometry. The pipeline operates in training and rendering stages. During training, noisy Gaussians are optimized end-to-end with underwater parameters, guided by PAUP pruning and scattering modeling. In rendering, refined Gaussians produce clean Unattenuated Radiance Images (URIs) free from media effects, while learned physics enable realistic Underwater Images (UWIs) with accurate light transport. Experiments on SeaThru-NeRF and UWBundle datasets show superior performance, achieving PSNR of 27.604, SSIM of 0.868, and LPIPS of 0.104 on SeaThru-NeRF, with ~65% reduction in floating artifacts.

Updated: 2025-08-08 09:36:32

标题: UW-3DGS: 具有物理感知高斯分片的水下3D重建

摘要: 水下3D场景重建面临着光吸收、散射和混浊等严峻挑战，这些挑战影响了传统方法如神经辐射场（NeRF）中的几何和颜色保真度。虽然像SeaThru-NeRF这样的NeRF扩展结合了基于物理的模型，但它们对MLP的依赖在有雾环境下限制了效率和空间分辨率。我们引入了UW-3DGS，这是一个新颖的框架，采用3D高斯喷洒（3DGS）用于稳健的水下重建。关键创新包括：（1）使用基于体素的回归进行学习的水下图像形成模块，用于空间变化的衰减和反向散射；和（2）一个物理感知的不确定性修剪（PAUP）分支，通过不确定性评分自适应地去除嘈杂的浮动高斯，确保几何没有伪影。该流程在训练和渲染阶段运行。在训练期间，通过PAUP修剪和散射建模引导，优化了带有水下参数的嘈杂高斯。在渲染阶段，精细化的高斯产生干净的不衰减辐射图像（URIs），不受介质影响，而学习的物理学使得具有准确光传输的逼真水下图像（UWIs）成为可能。SeaThru-NeRF和UWBundle数据集上的实验表明，其性能优越，SeaThru-NeRF上的PSNR达到27.604，SSIM为0.868，LPIPS为0.104，在浮动伪影减少约65%。

更新时间: 2025-08-08 09:36:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06169v1

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope-typically limited to open-domain QA with fixed retrieval settings and task-specific assumptions. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR2 (built on Qwen2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on several benchmarks. We have released all code, models, and data at https://github.com/Tsinghua-dhy/UR2.

Updated: 2025-08-08 09:33:20

标题: UR$^2$: 通过强化学习统一RAG和推理

摘要: 大型语言模型（LLMs）已经通过两种互补的范式展示出了卓越的能力：检索增强生成（RAG），增强知识基础，以及从可验证奖励中强化学习（RLVR），优化复杂的推理能力。然而，这两种能力通常是独立发展的，并且现有的努力将它们统一起来仍然范围狭窄 - 通常局限于具有固定检索设置和特定任务假设的开放域问答。这种缺乏整合限制了泛化能力，并限制了RAG-RL方法对更广泛领域的适用性。为了弥合这一差距，我们提出了UR2（统一RAG和推理），一个通过强化学习统一检索和推理的通用框架。UR2引入了两个关键贡献：一个难度感知的课程训练，仅在遇到挑战性问题时选择性地调用检索，并结合领域特定的离线语料库与LLM生成的摘要的混合知识访问策略。这些组件旨在实现检索和推理之间的动态协调，提高在各种任务上的适应性。在开放域问答、MMLU-Pro、医学和数学推理任务中的实验表明，基于Qwen2.5-3/7B和LLaMA-3.1-8B的UR2明显优于现有的RAG和RL方法，在几个基准测试上取得了与GPT-4o-mini和GPT-4.1-mini可比的表现。我们已经在https://github.com/Tsinghua-dhy/UR2 上发布了所有代码、模型和数据。

更新时间: 2025-08-08 09:33:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06165v1

One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging

Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all'' strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbf{TADrop} (\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0\% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model's structure, offering a new baseline for high-performance model merging.

Updated: 2025-08-08 09:33:08

标题: 一个尺寸并不适合所有：一种分布感知的稀疏化方法，用于更精确的模型合并

摘要: 模型合并已成为多任务学习中引人注目的无数据范式，可以将多个微调模型融合为一个单一、强大的实体。合并方法中的关键技术是稀疏化，它从任务向量中修剪多余的参数以减轻干扰。然而，目前的方法采用“一刀切”的策略，应用统一的稀疏比率，忽视了模型参数的固有结构和统计异质性。这经常导致一个次优的权衡，其中关键参数被无意中修剪，而不太有用的参数被保留。为了解决这一限制，我们引入了TADrop（张量智能自适应修剪），这是一种适应性稀疏化策略，尊重这种异质性。TADrop不同于全局比率，它根据参数张量的分布特性为每个参数张量分配一个定制的稀疏水平。其核心思想是，具有更密集、更多余分布的张量可以被激进地修剪，而稀疏、更关键的张量将被保留。作为一个简单的即插即用模块，我们通过将其与基础、经典和SOTA合并方法集成来验证TADrop。在多样化任务（视觉、语言和多模态）和模型（ViT、BEiT）上进行了大量实验，结果表明TADrop能够持续且显著地提升它们的性能。例如，在增强一种领先的合并方法时，它在8个ViT-B/32任务中平均性能提升了2.0%。TADrop提供了一种更有效的方式来通过针对模型结构调整稀疏化来减轻参数干扰，为高性能模型合并提供了一个新的基准。

更新时间: 2025-08-08 09:33:08

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06163v1

LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage

Recent studies have discovered that large language models (LLM) may be ``fooled'' to output private information, including training data, system prompts, and personally identifiable information, under carefully crafted adversarial prompts. Existing red-teaming approaches for privacy leakage either rely on manual efforts or focus solely on system prompt extraction, making them ineffective for severe risks of training data leakage. We propose LeakAgent, a novel black-box red-teaming framework for LLM privacy leakage. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for both training data extraction and system prompt extraction. To achieve this, we propose a novel reward function to provide effective and fine-grained rewards and design novel mechanisms to balance exploration and exploitation during learning and enhance the diversity of adversarial prompts. Through extensive evaluations, we first show that LeakAgent significantly outperforms existing rule-based approaches in training data extraction and automated methods in system prompt leakage. We also demonstrate the effectiveness of LeakAgent in extracting system prompts from real-world applications in OpenAI's GPT Store. We further demonstrate LeakAgent's effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study. We release our code here https://github.com/rucnyz/LeakAgent.

Updated: 2025-08-08 09:27:21

标题: LeakAgent: 基于RL的LLM隐私泄漏红队代理

摘要: 最近的研究发现，大型语言模型（LLM）可能会被“欺骗”输出私人信息，包括训练数据、系统提示和个人可识别信息，通过精心设计的对抗性提示。现有的隐私泄露的红队方法要么依赖于手动工作，要么仅关注系统提示的提取，使其对训练数据泄露的严重风险无效。我们提出了LeakAgent，这是一个新颖的用于LLM隐私泄露的黑盒红队框架。我们的框架通过强化学习训练一个开源LLM作为攻击代理来生成对抗性提示，用于训练数据提取和系统提示提取。为了实现这一目标，我们提出了一个新颖的奖励函数来提供有效和细粒度的奖励，并设计了新颖的机制来在学习过程中平衡探索和开发，并增强对抗性提示的多样性。通过广泛的评估，我们首先展示了LeakAgent在训练数据提取方面显著优于现有的基于规则的方法，并在系统提示泄露方面优于自动化方法。我们还展示了LeakAgent在从OpenAI的GPT Store中的真实应用中提取系统提示方面的有效性。我们进一步展示了LeakAgent在规避现有防护措施和在实现更好的安全对齐方面的帮助。最后，我们通过详细的消融研究验证了我们定制的设计。我们在此发布我们的代码：https://github.com/rucnyz/LeakAgent。

更新时间: 2025-08-08 09:27:21

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.05734v2

Semantic Item Graph Enhancement for Multimodal Recommendation

Multimodal recommendation systems have attracted increasing attention for their improved performance by leveraging items' multimodal information. Prior methods often build modality-specific item-item semantic graphs from raw modality features and use them as supplementary structures alongside the user-item interaction graph to enhance user preference learning. However, these semantic graphs suffer from semantic deficiencies, including (1) insufficient modeling of collaborative signals among items and (2) structural distortions introduced by noise in raw modality features, ultimately compromising performance. To address these issues, we first extract collaborative signals from the interaction graph and infuse them into each modality-specific item semantic graph to enhance semantic modeling. Then, we design a modulus-based personalized embedding perturbation mechanism that injects perturbations with modulus-guided personalized intensity into embeddings to generate contrastive views. This enables the model to learn noise-robust representations through contrastive learning, thereby reducing the effect of structural noise in semantic graphs. Besides, we propose a dual representation alignment mechanism that first aligns multiple semantic representations via a designed Anchor-based InfoNCE loss using behavior representations as anchors, and then aligns behavior representations with the fused semantics by standard InfoNCE, to ensure representation consistency. Extensive experiments on four benchmark datasets validate the effectiveness of our framework.

Updated: 2025-08-08 09:20:50

标题: 多模态推荐中语义项目图增强

摘要: 多模态推荐系统因利用物品的多模态信息而提高了其性能，因此受到越来越多的关注。先前的方法通常从原始的模态特征构建模态特定的物品-物品语义图，并将它们作为补充结构与用户-物品交互图一起用于增强用户偏好学习。然而，这些语义图存在语义缺陷，包括(1)在物品之间协作信号建模不足和(2)由原始模态特征中的噪音引入的结构失真，最终损害了性能。为了解决这些问题，我们首先从交互图中提取协作信号，并将其注入到每个模态特定的物品语义图中以增强语义建模。然后，我们设计了一种基于模量的个性化嵌入扰动机制，通过注入由模量引导的个性化强度的扰动到嵌入中生成对比视图。这使得模型能够通过对比学习学习耐噪声的表示，从而减少语义图中结构噪声的影响。此外，我们提出了一个双重表示对齐机制，首先通过使用行为表示作为锚点设计的基于Anchor的InfoNCE损失来对齐多个语义表示，然后通过标准InfoNCE将行为表示与融合语义对齐，以确保表示一致性。对四个基准数据集进行的大量实验证实了我们框架的有效性。

更新时间: 2025-08-08 09:20:50

领域: cs.IR,cs.AI,cs.MM

下载: http://arxiv.org/abs/2508.06154v1

Movement-Prediction-Adjusted Naive Forecast

In financial time series forecasting, surpassing the naive forecast is challenging due to the randomness in the data. To address this challenge, this study proposes a novel forecast combination method, the movement-prediction-adjusted naive forecast (MPANF), which is designed to improve point forecasts beyond the naive baseline. Specifically, MPANF integrates two forecasting components: a naive forecast and a movement prediction. The final forecast is generated by adjusting the naive forecast with a movement prediction term, the weight of which is the product of two in-sample quantities: one is a coefficient determined from the movement prediction accuracy and the other is the mean absolute increment. The performance of MPANF was evaluated on eight financial time series via standard metrics, including the RMSE, MAE, MAPE, and sMAPE. Under modest movement prediction accuracy slightly above 0.55, MPANF generally outperforms common benchmarks such as the naive forecast, naive forecast with drift, integrated moving average of order (1,1) (IMA(1,1)), and linear regression. These findings suggest that MPANF can serve as an effective second-stage method when reliable movement predictions are available.

Updated: 2025-08-08 09:20:46

标题: 移动预测调整的天真预测

摘要: 在金融时间序列预测中，由于数据的随机性，超越天真预测是具有挑战性的。为了解决这一挑战，本研究提出了一种新颖的预测组合方法，即运动预测调整的天真预测（MPANF），旨在改进超越天真基线的点预测。具体而言，MPANF集成了两个预测组件：一个是天真预测，另一个是运动预测。最终预测是通过用运动预测项调整天真预测生成的，其权重是两个样本内量的乘积：一个是从运动预测准确性确定的系数，另一个是平均绝对增量。通过标准指标评估了MPANF在八个金融时间序列上的性能，包括RMSE、MAE、MAPE和sMAPE。在略高于0.55的适度运动预测准确性下，MPANF通常优于常见的基准，如天真预测、带漂移的天真预测、(1,1)阶整合移动平均（IMA(1,1)）和线性回归。这些发现表明，当可靠的运动预测可用时，MPANF可以作为有效的第二阶段方法。

更新时间: 2025-08-08 09:20:46

领域: cs.CE,cs.AI,cs.LG,econ.EM,stat.ML,62M10

下载: http://arxiv.org/abs/2406.14469v9

SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

With the development of customized large language model (LLM) agents, a new threat of black-box backdoor attacks has emerged, where malicious instructions are injected into hidden system prompts. These attacks easily bypass existing defenses that rely on white-box access, posing a serious security challenge. To address this, we propose SLIP, a Soft Label mechanism and key-extraction-guided CoT-based defense against Instruction backdoors in APIs. SLIP is designed based on two key insights. First, to counteract the model's oversensitivity to triggers, we propose a Key-extraction-guided Chain-of-Thought (KCoT). Instead of only considering the single trigger or the input sentence, KCoT prompts the agent to extract task-relevant key phrases. Second, to guide the LLM toward correct answers, our proposed Soft Label Mechanism (SLM) prompts the agent to quantify the semantic correlation between key phrases and candidate answers. Crucially, to mitigate the influence of residual triggers or misleading content in phrases extracted by KCoT, which typically causes anomalous scores, SLM excludes anomalous scores deviating significantly from the mean and subsequently averages the remaining scores to derive a more reliable semantic representation. Extensive experiments on classification and question-answer (QA) tasks demonstrate that SLIP is highly effective, reducing the average attack success rate (ASR) from 90.2% to 25.13% while maintaining high accuracy on clean data and outperforming state-of-the-art defenses. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/SLIP.

Updated: 2025-08-08 09:17:33

标题: SLIP：软标签机制和基于密钥提取引导的CoT防御API中的指令后门

摘要: 随着定制化大型语言模型（LLM）代理的发展，一种新的黑匣子后门攻击威胁已经出现，恶意指令被注入到隐藏的系统提示中。这些攻击轻易地绕过了依赖于白匣子访问的现有防御措施，构成了严重的安全挑战。为了解决这个问题，我们提出了SLIP，一种软标签机制和基于密钥提取引导的CoT防御，用于API中的指令后门。SLIP基于两个关键见解设计。首先，为了抵消模型对触发器的过度敏感，我们提出了基于密钥提取引导的思维链（KCoT）。KCoT不仅考虑单个触发器或输入句子，而是提示代理提取任务相关的关键短语。其次，为了引导LLM朝向正确答案，我们提出的软标签机制（SLM）提示代理量化关键短语与候选答案之间的语义相关性。关键是，为了减轻KCoT提取的短语中残留触发器或误导内容的影响，这通常导致异常分数，SLM排除与平均值明显偏离的异常分数，然后对剩余分数进行平均，得出更可靠的语义表示。对分类和问答（QA）任务的大量实验表明，SLIP非常有效，将平均攻击成功率（ASR）从90.2％降低到25.13％，同时在干净数据上保持高准确性，并优于最先进的防御措施。我们的代码可在https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/SLIP找到。

更新时间: 2025-08-08 09:17:33

领域: cs.CR

下载: http://arxiv.org/abs/2508.06153v1

Improving Diagnostic Accuracy for Oral Cancer with inpainting Synthesis Lesions Generated Using Diffusion Models

In oral cancer diagnostics, the limited availability of annotated datasets frequently constrains the performance of diagnostic models, particularly due to the variability and insufficiency of training data. To address these challenges, this study proposed a novel approach to enhance diagnostic accuracy by synthesizing realistic oral cancer lesions using an inpainting technique with a fine-tuned diffusion model. We compiled a comprehensive dataset from multiple sources, featuring a variety of oral cancer images. Our method generated synthetic lesions that exhibit a high degree of visual fidelity to actual lesions, thereby significantly enhancing the performance of diagnostic algorithms. The results show that our classification model achieved a diagnostic accuracy of 0.97 in differentiating between cancerous and non-cancerous tissues, while our detection model accurately identified lesion locations with 0.85 accuracy. This method validates the potential for synthetic image generation in medical diagnostics and paves the way for further research into extending these methods to other types of cancer diagnostics.

Updated: 2025-08-08 09:15:02

标题: 用扩散模型生成修复合成病变，提高口腔癌的诊断准确性

摘要: 在口腔癌诊断中，注释数据集的有限可用性经常限制了诊断模型的性能，特别是由于训练数据的变化性和不足。为了解决这些挑战，本研究提出了一种新颖的方法，通过使用修补技术和经过微调的扩散模型合成逼真的口腔癌病变，以增强诊断准确性。我们从多个来源汇编了一个包含各种口腔癌病变图像的综合数据集。我们的方法生成的合成病变展示了与实际病变高度相似的视觉保真度，从而显著提高了诊断算法的性能。结果显示，我们的分类模型在区分癌组织和非癌组织时达到了0.97的诊断准确性，而我们的检测模型以0.85的准确性准确识别了病变位置。这种方法验证了合成图像生成在医学诊断中的潜力，并为进一步将这些方法扩展到其他类型的癌症诊断的研究铺平了道路。

更新时间: 2025-08-08 09:15:02

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2508.06151v1

Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications

The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges, particularly in the domain of pharmaceutical contraindications where accurate and reliable information is required. This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline. Utilizing OpenAI's GPT-4o-mini as the base model, and the text-embedding-3-small model for embeddings, our approach integrates Langchain to orchestrate a hybrid retrieval system with re-ranking. This system leverages Drug Utilization Review (DUR) data from public databases, focusing on contraindications for specific age groups, pregnancy, and concomitant drug use. The dataset includes 300 question-answer pairs across three categories, with baseline model accuracy ranging from 0.49 to 0.57. Post-integration of the RAG pipeline, we observed a significant improvement in model accuracy, achieving rates of 0.94, 0.87, and 0.89 for contraindications related to age groups, pregnancy, and concomitant drug use, respectively. The results indicate that augmenting LLMs with a RAG framework can substantially reduce uncertainty in prescription and drug intake decisions by providing more precise and reliable drug contraindication information.

Updated: 2025-08-08 09:09:03

标题: 增强检索的大型语言模型系统用于全面的药物禁忌症

摘要: 大型语言模型（LLMs）的多功能性已经在各个领域进行了探索，但它们在医疗保健领域的应用面临挑战，特别是在药物禁忌症领域，需要准确可靠的信息。本研究通过实施检索增强生成（RAG）管道，增强了LLMs解决禁忌症问题的能力。我们使用OpenAI的GPT-4o-mini作为基础模型，使用文本嵌入-3-small模型进行嵌入，我们的方法集成了Langchain来协调混合检索系统并进行重新排序。该系统利用来自公共数据库的药物利用审核（DUR）数据，重点关注特定年龄组、怀孕和同时使用药物的禁忌症。数据集包括三个类别的300个问题-答案对，基线模型的准确率在0.49到0.57之间。在集成RAG管道后，我们观察到模型准确率显著提高，分别达到0.94、0.87和0.89，对应于与年龄组、怀孕和同时使用药物相关的禁忌症。结果表明，通过使用RAG框架增强LLMs可以显著减少处方和药物摄入决策中的不确定性，提供更精确可靠的药物禁忌信息。

更新时间: 2025-08-08 09:09:03

领域: cs.AI

下载: http://arxiv.org/abs/2508.06145v1

StepFun-Prover Preview: Let's Think and Verify Step by Step

We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of $70.0\%$. Beyond advancing benchmark performance, we introduce an end-to-end training framework for developing tool-integrated reasoning models, offering a promising direction for automated theorem proving and Math AI assistant.

Updated: 2025-08-08 08:58:11

标题: StepFun-Prover 预览：让我们一步一步地思考和验证

摘要: 我们提出StepFun-Prover Preview，这是一个专为形式定理证明而设计的大型语言模型，通过集成工具进行推理。使用一个包含基于工具的交互的强化学习流程，StepFun-Prover能够在最小采样的情况下取得生成Lean 4证明的强大性能。我们的方法使模型能够模拟人类式的问题解决策略，通过根据实时环境反馈不断改进证明。在miniF2F-test基准测试中，StepFun-Prover取得了70.0%的一次通关成功率。除了推动基准测试性能之外，我们还介绍了一个端到端的培训框架，用于开发集成工具推理模型，为自动定理证明和数学AI助手提供了一个有前途的方向。

更新时间: 2025-08-08 08:58:11

领域: cs.AI

下载: http://arxiv.org/abs/2507.20199v2

Roll Your Eyes: Gaze Redirection via Explicit 3D Eyeball Rotation

We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.

Updated: 2025-08-08 08:56:51

标题: 翻译：翻动你的眼睛：通过显式的3D眼球旋转进行凝视重定向

摘要: 我们提出了一个新颖的3D凝视重定向框架，利用显式的3D眼球结构。现有的凝视重定向方法通常基于神经辐射场，通过体积渲染使用隐式神经表示。与这些基于NeRF的方法不同，我们引入了专门的3D眼球结构来用3D高斯斑点（3DGS）表示眼球。我们的方法通过显式旋转和平移3D眼球结构生成忠实重现所需凝视方向的逼真图像。此外，我们提出了一个自适应变形模块，可以复制眼部周围微妙肌肉运动。通过对ETH-XGaze数据集进行的实验，我们证明我们的框架能够生成多样的新颖凝视图像，相比先前的最先进方法，实现了更优越的图像质量和凝视估计精度。

更新时间: 2025-08-08 08:56:51

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06136v1

Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models

Knowledge Distillation (KD) is a fundamental technique for compressing large language models (LLMs) into compact, efficient student models. However, existing white-box KD methods mainly focus on balancing ground truth and student-generated responses while overlooking two critical factors: training data quality and student-model compatibility. To address these limitations, we propose Selective Reflection Distillation (SRD), a novel data curation framework that leverages reflections from student models to systematically refine training data. SRD dynamically evaluates and selects prompt-response pairs by comparing ground truth data with student model outputs, selectively curating high-quality, student-compatible training instances through automated ranking based on difficulty. Furthermore, after selecting the training data, a curriculum scheduling strategy is employed to incrementally introduce these curated subsets into the distillation process at fixed intervals. As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches and model architectures, as well as decreases computational cost significantly during KD training. Experiments on a range of language model benchmarks demonstrate SRD's consistent improvements in distilled model performance, as well as a reduction in training runtime by up to 39%, under diverse KD methods and model families. Notably, SRD operates as a plug-and-play module, enhancing sample efficiency without modifying underlying KD algorithms. Our findings highlight that data quality and compatibility are pivotal to effective and efficient distillation of LLMs, and SRD provides a principled framework to achieve both. This work advances the understanding of data-centric factors in KD and offers practical insights for enhancing the capability and efficiency of compressed LLMs.

Updated: 2025-08-08 08:55:53

标题: 精简即是美：大型语言模型中兼容且高效的知识蒸馏的选择性反思

摘要: 知识蒸馏（KD）是将大型语言模型（LLMs）压缩为紧凑、高效学生模型的基本技术。然而，现有的白盒KD方法主要关注平衡基本事实和学生生成的响应，却忽视了两个关键因素：训练数据质量和学生模型的兼容性。为了解决这些局限性，我们提出了选择性反射蒸馏（SRD），这是一个新颖的数据策划框架，利用学生模型的反思来系统地改进训练数据。SRD通过比较基本事实数据和学生模型输出，动态评估和选择提示-响应对，通过根据难度自动排名选择高质量、学生兼容的训练实例。此外，在选择训练数据后，采用课程安排策略，以固定间隔逐步引入这些策划的子集到蒸馏过程中。作为一个即插即用的增强功能，SRD始终改善不同白盒KD方法和模型架构的蒸馏结果，同时在KD训练过程中显著降低计算成本。在一系列语言模型基准测试中的实验表明，SRD在蒸馏模型性能上始终取得改进，同时在不同KD方法和模型系列下，训练运行时间缩短最多达39%。值得注意的是，SRD作为一个即插即用模块，提高了样本效率，而不修改底层KD算法。我们的研究结果突显了数据质量和兼容性对于有效和高效蒸馏LLMs至关重要，而SRD提供了一个有原则的框架来实现这一点。这项工作促进了对KD中数据中心因素的理解，并为增强压缩LLMs的能力和效率提供了实用见解。

更新时间: 2025-08-08 08:55:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06135v1

LLM Serving Optimization with Variable Prefill and Decode Lengths

We study the problem of serving LLM (Large Language Model) requests where each request has heterogeneous prefill and decode lengths. In LLM serving, the prefill length corresponds to the input prompt length, which determines the initial memory usage in the KV cache. The decode length refers to the number of output tokens generated sequentially, with each additional token increasing the KV cache memory usage by one unit. Given a set of n requests, our goal is to schedule and process them to minimize the total completion time. We show that this problem is NP-hard due to the interplay of batching, placement constraints, precedence relationships, and linearly increasing memory usage. We then analyze commonly used scheduling strategies in practice, such as First-Come-First-Serve (FCFS) and Shortest-First (SF), and prove that their competitive ratios scale up sublinearly with the memory limit-a significant drawback in real-world settings where memory demand is large. To address this, we propose a novel algorithm based on a new selection metric that efficiently forms batches over time. We prove that this algorithm achieves a constant competitive ratio. Finally, we develop and evaluate a few algorithm variants inspired by this approach, including dynamic programming variants, local search methods, and an LP-based scheduler, demonstrating through comprehensive simulations that they outperform standard baselines while maintaining computational efficiency.

Updated: 2025-08-08 08:54:21

标题: LLM服务优化与变量预填和解码长度

摘要: 我们研究了为LLM（大型语言模型）请求提供服务的问题，其中每个请求具有异构的预填充和解码长度。在LLM服务中，预填充长度对应于输入提示长度，该长度确定了KV缓存中的初始内存使用量。解码长度指的是按顺序生成的输出标记数量，每增加一个标记，KV缓存内存使用量增加一个单位。给定一组n个请求，我们的目标是安排和处理它们以最小化总完成时间。我们表明，由于批处理、放置约束、优先关系和线性增加的内存使用之间的相互作用，这个问题是NP难的。然后，我们分析了实践中常用的调度策略，如先到先服务（FCFS）和最短优先（SF），并证明它们的竞争比例随内存限制的增加是次线性的，这在实际环境中内存需求较大时是一个重要的缺点。为了解决这个问题，我们提出了一种基于新选择度量的算法，该算法能够有效地随时间形成批次。我们证明这个算法实现了一个恒定的竞争比例。最后，我们开发并评估了一些受到这种方法启发的算法变体，包括动态规划变体、局部搜索方法和基于LP的调度器，通过全面的模拟表明它们优于标准基线同时保持计算效率。

更新时间: 2025-08-08 08:54:21

领域: math.OC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06133v1

Enhancing the Scalability of Classical Surrogates for Real-World Quantum Machine Learning Applications

Quantum machine learning (QML) presents potential for early industrial adoption, yet limited access to quantum hardware remains a significant bottleneck for deployment of QML solutions. This work explores the use of classical surrogates to bypass this restriction, which is a technique that allows to build a lightweight classical representation of a (trained) quantum model, enabling to perform inference on entirely classical devices. We reveal prohibiting high computational demand associated with previously proposed methods for generating classical surrogates from quantum models, and propose an alternative pipeline enabling generation of classical surrogates at a larger scale than was previously possible. Previous methods required at least a high-performance computing (HPC) system for quantum models of below industrial scale (ca. 20 qubits), which raises questions about its practicality. We greatly minimize the redundancies of the previous approach, utilizing only a minute fraction of the resources previously needed. We demonstrate the effectiveness of our method on a real-world energy demand forecasting problem, conducting rigorous testing of performance and computation demand in both simulations and on quantum hardware. Our results indicate that our method achieves high accuracy on the testing dataset while its computational resource requirements scale linearly rather than exponentially. This work presents a lightweight approach to transform quantum solutions into classically deployable versions, facilitating faster integration of quantum technology in industrial settings. Furthermore, it can serve as a powerful research tool in search practical quantum advantage in an empirical setup.

Updated: 2025-08-08 08:51:01

标题: 增强经典替代品在现实世界量子机器学习应用中的可扩展性

摘要: 量子机器学习（QML）为早期工业应用提供了潜在可能性，但是对量子硬件的有限访问仍然是部署QML解决方案的一个重要瓶颈。本研究探讨了使用经典替代方法来绕过这一限制，这是一种技术，允许构建一个轻量级的经典表示（经过训练的）量子模型，从而能够在完全经典设备上进行推断。我们揭示了以前提出的从量子模型生成经典替代品的方法所涉及的高计算需求，提出了一种替代流程，使得生成经典替代品的规模比以前更大。以前的方法至少需要一个高性能计算（HPC）系统来处理低于工业尺度（约20个量子比特）的量子模型，这引发了关于其实用性的问题。我们大大减少了以前方法的冗余性，仅利用了先前所需资源的极小部分。我们在一个真实的能源需求预测问题上证明了我们的方法的有效性，对性能和计算需求进行了严格测试，包括在模拟和量子硬件上。我们的结果表明，我们的方法在测试数据集上实现了高准确性，而其计算资源需求呈线性而非指数级增长。这项工作提出了一种将量子解决方案转化为经典部署版本的轻量级方法，促进了量子技术在工业环境中更快速地集成。此外，它可以作为在经验设置中寻找实际量子优势的强大研究工具。

更新时间: 2025-08-08 08:51:01

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2508.06131v1

Study of Robust Features in Formulating Guidance for Heuristic Algorithms for Solving the Vehicle Routing Problem

The Vehicle Routing Problem (VRP) is a complex optimization problem with numerous real-world applications, mostly solved using metaheuristic algorithms due to its $\mathcal{NP}$-Hard nature. Traditionally, these metaheuristics rely on human-crafted designs developed through empirical studies. However, recent research shows that machine learning methods can be used the structural characteristics of solutions in combinatorial optimization, thereby aiding in designing more efficient algorithms, particularly for solving VRP. Building on this advancement, this study extends the previous research by conducting a sensitivity analysis using multiple classifier models that are capable of predicting the quality of VRP solutions. Hence, by leveraging explainable AI, this research is able to extend the understanding of how these models make decisions. Finally, our findings indicate that while feature importance varies, certain features consistently emerge as strong predictors. Furthermore, we propose a unified framework able of ranking feature impact across different scenarios to illustrate this finding. These insights highlight the potential of feature importance analysis as a foundation for developing a guidance mechanism of metaheuristic algorithms for solving the VRP.

Updated: 2025-08-08 08:50:03

标题: 研究强大特征在制定启发式算法解决车辆路径问题中的指导的作用

摘要: 车辆路径问题（VRP）是一个复杂的优化问题，具有许多现实世界的应用，大多数情况下使用启发式算法来解决，因为它具有NP难度。传统上，这些启发式算法依赖于通过经验研究开发的人工设计。然而，最近的研究表明，机器学习方法可以用于组合优化解决方案的结构特征，从而有助于设计更有效的算法，特别是用于解决VRP的算法。基于这一进展，本研究通过使用多个能够预测VRP解决方案质量的分类器模型进行敏感性分析，扩展了以前的研究。因此，通过利用可解释的AI，本研究能够扩展对这些模型如何做出决策的理解。最后，我们的研究结果表明，虽然特征重要性有所不同，但某些特征始终表现为强有力的预测因子。此外，我们提出了一个统一的框架，能够在不同情景下排名特征影响，以说明这一发现。这些见解突显了特征重要性分析作为发展指导启发式算法解决VRP的基础的潜力。

更新时间: 2025-08-08 08:50:03

领域: cs.AI

下载: http://arxiv.org/abs/2508.06129v1

IOCC: Aligning Semantic and Cluster Centers for Few-shot Short Text Clustering

In clustering tasks, it is essential to structure the feature space into clear, well-separated distributions. However, because short text representations have limited expressiveness, conventional methods struggle to identify cluster centers that truly capture each category's underlying semantics, causing the representations to be optimized in suboptimal directions. To address this issue, we propose IOCC, a novel few-shot contrastive learning method that achieves alignment between the cluster centers and the semantic centers. IOCC consists of two key modules: Interaction-enhanced Optimal Transport (IEOT) and Center-aware Contrastive Learning (CACL). Specifically, IEOT incorporates semantic interactions between individual samples into the conventional optimal transport problem, and generate pseudo-labels. Based on these pseudo-labels, we aggregate high-confidence samples to construct pseudo-centers that approximate the semantic centers. Next, CACL optimizes text representations toward their corresponding pseudo-centers. As training progresses, the collaboration between the two modules gradually reduces the gap between cluster centers and semantic centers. Therefore, the model will learn a high-quality distribution, improving clustering performance. Extensive experiments on eight benchmark datasets show that IOCC outperforms previous methods, achieving up to 7.34\% improvement on challenging Biomedical dataset and also excelling in clustering stability and efficiency. The code is available at: https://anonymous.4open.science/r/IOCC-C438.

Updated: 2025-08-08 08:47:13

标题: IOCC: 少样本短文本聚类中语义和簇中心的对齐

摘要: 在聚类任务中，将特征空间结构化为清晰、明显分离的分布是至关重要的。然而，由于短文本表示具有有限的表现力，传统方法往往难以识别真正捕捉每个类别潜在语义的聚类中心，导致表示在次优方向上被优化。为解决这一问题，我们提出了IOCC，一种新颖的少样本对比学习方法，实现了聚类中心与语义中心之间的对齐。IOCC包括两个关键模块：增强交互的最优传输（IEOT）和中心感知的对比学习（CACL）。具体而言，IEOT将个体样本之间的语义交互整合到传统的最优传输问题中，并生成伪标签。基于这些伪标签，我们聚合高置信度样本以构建伪中心，近似于语义中心。接下来，CACL将文本表示优化到相应的伪中心。随着训练的进行，两个模块之间的协作逐渐减小了聚类中心与语义中心之间的差距。因此，模型将学习到高质量的分布，提高了聚类性能。对八个基准数据集进行的大量实验表明，IOCC优于先前的方法，在具有挑战性的生物医学数据集上实现了高达7.34%的改进，并在聚类稳定性和效率方面表现出色。代码可在以下链接找到：https://anonymous.4open.science/r/IOCC-C438。

更新时间: 2025-08-08 08:47:13

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.06126v1

Reshaping MOFs text mining with a dynamic multi-agents framework of large language model

Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and difficult to interpret. We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables. It links related descriptions across paragraphs, unifies ligand abbreviations with full names, and outputs structured parameters ready for use. MOFh6 achieved 99% extraction accuracy, resolved 94.1% of abbreviation cases across five major publishers, and maintained a precision of 0.93 +/- 0.01. Processing a full text takes 9.6 s, locating synthesis descriptions 36 s, with 100 papers processed for USD 4.24. By replacing static database lookups with real-time extraction, MOFh6 reshapes MOF synthesis research, accelerating the conversion of literature knowledge into practical synthesis protocols and enabling scalable, data-driven materials discovery.

Updated: 2025-08-08 08:35:50

标题: 用大型语言模型的动态多代理框架重塑 MOFs 文本挖掘

摘要: 准确识别金属有机框架（MOFs）的合成条件对于指导实验设计至关重要，但由于文献中相关信息通常分散、不一致且难以解释，因此仍然具有挑战性。我们提出了MOFh6，一个由大型语言模型驱动的系统，可以阅读原始文章或晶体代码，并将其转换为标准化的合成表格。它可以在段落之间链接相关描述，统一配体缩写与全名，并输出准备好供使用的结构化参数。MOFh6实现了99%的提取准确性，在五个主要出版商的案例中解决了94.1%的缩写情况，并保持了0.93±0.01的精度。处理完整文本需要9.6秒，定位合成描述需要36秒，处理100篇论文的成本为4.24美元。通过用实时提取替代静态数据库查找，MOFh6重塑了MOF合成研究，加速了将文献知识转化为实用合成协议，并实现了可扩展的、数据驱动的材料发现。

更新时间: 2025-08-08 08:35:50

领域: cs.AI,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2504.18880v3

Exploiting Kubernetes' Image Pull Implementation to Deny Node Availability

Kubernetes (K8s) has grown in popularity over the past few years to become the de-facto standard for container orchestration in cloud-native environments. While research is not new to topics such as containerization and access control security, the Application Programming Interface (API) interactions between K8s and its runtime interfaces have not been studied thoroughly. In particular, the CRI-API is responsible for abstracting the container runtime, managing the creation and lifecycle of containers along with the downloads of the respective images. However, this decoupling of concerns and the abstraction of the container runtime renders K8s unaware of the status of the downloading process of the container images, obstructing the monitoring of the resources allocated to such process. In this paper, we discuss how this lack of status information can be exploited as a Denial of Service attack in a K8s cluster. We show how such attacks can impact worker nodes, generating up to 95% average CPU usage, prevent downloads of new container images, and increase I/O and network usage for a potentially unlimited amount of time. We argue that solving this problem would require a radical architectural change in the relationship between K8s and the CRI-API, which would be unfeasible in the short term. Thus, as a stopgap solution, we propose MAGI: an eBPF-based, proof-of-concept mitigation that detects and terminates potential attacks.

Updated: 2025-08-08 08:32:51

标题: 利用Kubernetes的镜像拉取实现来拒绝节点可用性

摘要: Kubernetes（K8s）在过去几年中越来越受欢迎，已成为云原生环境中容器编排的事实标准。尽管关于容器化和访问控制安全等主题的研究并不新鲜，但K8s与其运行时接口之间的应用程序编程接口（API）交互并未得到彻底研究。特别是，CRI-API负责抽象化容器运行时，管理容器的创建和生命周期以及相应图像的下载。然而，这种关注点的解耦和容器运行时的抽象使K8s无法意识到容器图像下载过程的状态，阻碍了对分配给该过程的资源的监控。在本文中，我们讨论了如何利用这种缺乏状态信息作为K8s集群中的拒绝服务攻击。我们展示了这种攻击如何影响工作节点，导致高达95％的平均CPU使用率，阻止新容器图像的下载，并增加I/O和网络使用量，可能持续无限期。我们认为解决这个问题将需要对K8s和CRI-API之间的关系进行根本性的架构变更，这在短期内是不可行的。因此，作为一种权宜之计，我们提出了MAGI：一种基于eBPF的概念验证缓解方法，用于检测和终止潜在攻击。

更新时间: 2025-08-08 08:32:51

领域: cs.CR

下载: http://arxiv.org/abs/2401.10582v2

Ensemble-Based Graph Representation of fMRI Data for Cognitive Brain State Classification

Understanding and classifying human cognitive brain states based on neuroimaging data remains one of the foremost and most challenging problems in neuroscience, owing to the high dimensionality and intrinsic noise of the signals. In this work, we propose an ensemble-based graph representation method of functional magnetic resonance imaging (fMRI) data for the task of binary brain-state classification. Our method builds the graph by leveraging multiple base machine-learning models: each edge weight reflects the difference in posterior probabilities between two cognitive states, yielding values in the range [-1, 1] that encode confidence in a given state. We applied this approach to seven cognitive tasks from the Human Connectome Project (HCP 1200 Subject Release), including working memory, gambling, motor activity, language, social cognition, relational processing, and emotion processing. Using only the mean incident edge weights of the graphs as features, a simple logistic-regression classifier achieved average accuracies from 97.07% to 99.74%. We also compared our ensemble graphs with classical correlation-based graphs in a classification task with a graph neural network (GNN). In all experiments, the highest classification accuracy was obtained with ensemble graphs. These results demonstrate that ensemble graphs convey richer topological information and enhance brain-state discrimination. Our approach preserves edge-level interpretability of the fMRI graph representation, is adaptable to multiclass and regression tasks, and can be extended to other neuroimaging modalities and pathological-state classification.

Updated: 2025-08-08 08:32:46

标题: 基于集合的fMRI数据图表示用于认知脑状态分类

摘要: 理解和分类基于神经影像数据的人类认知脑状态仍然是神经科学中最重要和最具挑战性的问题之一，这是由于信号的高维度和固有噪音。在这项工作中，我们提出了一种基于集成的图表示方法，用于功能磁共振成像（fMRI）数据的二元脑状态分类任务。我们的方法通过利用多个基本机器学习模型来构建图：每个边的权重反映了两种认知状态之间后验概率的差异，产生在[-1, 1]范围内的值，编码对给定状态的信心。我们将这种方法应用于人类连接组计划（HCP 1200主题发布）的七个认知任务，包括工作记忆、赌博、运动活动、语言、社会认知、关系处理和情绪处理。仅使用图的平均入射边权重作为特征，一个简单的逻辑回归分类器实现了从97.07%到99.74%的平均准确率。我们还将我们的集成图与经典基于相关性的图在一个使用图神经网络（GNN）的分类任务中进行了比较。在所有实验中，最高的分类准确率是通过集成图获得的。这些结果表明，集成图传达了更丰富的拓扑信息，并增强了脑状态的区分能力。我们的方法保留了fMRI图表示的边级可解释性，适用于多类和回归任务，并可以扩展到其他神经影像模态和病理状态分类。

更新时间: 2025-08-08 08:32:46

领域: q-bio.NC,cs.LG

下载: http://arxiv.org/abs/2508.06118v1

Probabilistic Foundations for Metacognition via Hybrid-AI

Metacognition is the concept of reasoning about an agent's own internal processes, and it has recently received renewed attention with respect to artificial intelligence (AI) and, more specifically, machine learning systems. This paper reviews a hybrid-AI approach known as "error detecting and correcting rules" (EDCR) that allows for the learning of rules to correct perceptual (e.g., neural) models. Additionally, we introduce a probabilistic framework that adds rigor to prior empirical studies, and we use this framework to prove results on necessary and sufficient conditions for metacognitive improvement, as well as limits to the approach. A set of future

Updated: 2025-08-08 08:24:32

标题: Probabilistic Foundations for Metacognition via Hybrid-AI 通过混合人工智能建立元认知的概率基础

摘要: 研究表明，元认知是关于一个代理人自身内部过程的推理概念，并且最近在人工智能（AI）和更具体地说是机器学习系统方面受到了重新关注。本文回顾了一种名为“错误检测和纠正规则”（EDCR）的混合AI方法，该方法允许学习规则来纠正感知（例如神经）模型。此外，我们引入了一个概率框架，为先前的经验研究增加了严谨性，并利用该框架证明了关于元认知改进的必要和充分条件以及方法的限制。一组未来的研究方向包括……（未完，需要继续翻译）

更新时间: 2025-08-08 08:24:32

领域: cs.AI

下载: http://arxiv.org/abs/2502.05398v3

The Docking Game: Loop Self-Play for Fast, Dynamic, and Accurate Prediction of Flexible Protein-Ligand Binding

Molecular docking is a crucial aspect of drug discovery, as it predicts the binding interactions between small-molecule ligands and protein pockets. However, current multi-task learning models for docking often show inferior performance in ligand docking compared to protein pocket docking. This disparity arises largely due to the distinct structural complexities of ligands and proteins. To address this issue, we propose a novel game-theoretic framework that models the protein-ligand interaction as a two-player game called the Docking Game, with the ligand docking module acting as the ligand player and the protein pocket docking module as the protein player. To solve this game, we develop a novel Loop Self-Play (LoopPlay) algorithm, which alternately trains these players through a two-level loop. In the outer loop, the players exchange predicted poses, allowing each to incorporate the other's structural predictions, which fosters mutual adaptation over multiple iterations. In the inner loop, each player dynamically refines its predictions by incorporating its own predicted ligand or pocket poses back into its model. We theoretically show the convergence of LoopPlay, ensuring stable optimization. Extensive experiments conducted on public benchmark datasets demonstrate that LoopPlay achieves approximately a 10\% improvement in predicting accurate binding modes compared to previous state-of-the-art methods. This highlights its potential to enhance the accuracy of molecular docking in drug discovery.

Updated: 2025-08-08 08:23:13

标题: 《对接游戏：循环自我对弈用于快速、动态和准确预测柔性蛋白质-配体结合》

摘要: 分子对接是药物发现的一个关键方面，因为它预测小分子配体和蛋白质口袋之间的结合相互作用。然而，目前用于对接的多任务学习模型通常在配体对接方面表现不如蛋白质口袋对接。这种差异主要是由于配体和蛋白质的结构复杂性不同造成的。为了解决这个问题，我们提出了一个新颖的博弈论框架，将蛋白质-配体相互作用建模为一个名为对接游戏的双人博弈，其中配体对接模块充当配体玩家，蛋白质口袋对接模块充当蛋白质玩家。为了解决这个游戏，我们开发了一种新颖的Loop Self-Play（LoopPlay）算法，通过一个两级循环交替训练这些玩家。在外部循环中，玩家交换预测的姿态，使每个玩家能够融入对方的结构预测，从而促进多次迭代的相互适应。在内部循环中，每个玩家通过将自己的预测配体或口袋姿态重新纳入模型来动态优化其预测。我们在公共基准数据集上进行了大量实验，结果显示，与先前的最先进方法相比，LoopPlay在准确预测结合模式方面实现了约10\%的改进。这突显了它在药物发现中提高分子对接准确性的潜力。

更新时间: 2025-08-08 08:23:13

领域: cs.AI

下载: http://arxiv.org/abs/2508.05006v2

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.

Updated: 2025-08-08 08:16:40

标题: SKATE，一种可扩展的锦标赛评估方法：较弱的LLMs使用可验证的挑战来区分较强的LLMs

摘要: 评估基础模型的能力和风险至关重要，然而当前的方法要求广泛的领域专业知识，阻碍了它们的可扩展性，因为这些模型迅速发展。我们引入了SKATE：一种新颖的评估框架，其中大型语言模型（LLMs）通过为彼此生成和解决可验证的任务来竞争。我们的核心见解是将评估视为一种游戏：模型既是任务设置者又是解决者，激励他们创建突出自身优势并暴露他人弱点的问题。SKATE提供了几个关键优势，平衡了可扩展性、开放性和客观性。它是完全自动化的、无需数据的和可扩展的，不需要人类输入或领域专业知识。通过使用可验证的任务而不是LLM评委，评分是客观的。与限于特定领域的程序生成的基准（例如下棋或空间推理）不同，让LLMs创造性地提出挑战可以实现开放和可扩展的评估。作为概念验证，我们引入了LLM集合代码输出预测（COP）挑战作为一个可验证和可扩展的框架，以测试我们的方法。使用基于TrueSkill的排名系统，我们评估了六个前沿的LLM，并发现：（1）较弱的模型能够可靠地区分和评分较强的模型，（2）基于LLM的系统能够表现出自我偏好的行为，生成与其自身能力相符的问题，以及（3）SKATE自动展现了模型之间的细微能力差异。我们的研究结果是迈向与LLM进展同步的通用、可扩展的评估框架的重要一步。

更新时间: 2025-08-08 08:16:40

领域: cs.AI

下载: http://arxiv.org/abs/2508.06111v1

PanelTR: Zero-Shot Table Reasoning Framework Through Multi-Agent Scientific Discussion

Table reasoning, including tabular QA and fact verification, often depends on annotated data or complex data augmentation, limiting flexibility and generalization. LLMs, despite their versatility, often underperform compared to simple supervised models. To approach these issues, we introduce PanelTR, a framework utilizing LLM agent scientists for robust table reasoning through a structured scientific approach. PanelTR's workflow involves agent scientists conducting individual investigations, engaging in self-review, and participating in collaborative peer-review discussions. This process, driven by five scientist personas, enables semantic-level transfer without relying on data augmentation or parametric optimization. Experiments across four benchmarks show that PanelTR outperforms vanilla LLMs and rivals fully supervised models, all while remaining independent of training data. Our findings indicate that structured scientific methodology can effectively handle complex tasks beyond table reasoning with flexible semantic understanding in a zero-shot context.

Updated: 2025-08-08 08:15:52

标题: PanelTR：通过多智能科学讨论实现零-shot表格推理框架

摘要: 表格推理，包括表格问答和事实验证，通常依赖于标注数据或复杂的数据增强，限制了灵活性和泛化能力。尽管LLMs具有多功能性，但通常表现不如简单的监督模型。为了解决这些问题，我们引入了PanelTR，这是一个利用LLM代理科学家进行稳健表格推理的框架，通过结构化科学方法。PanelTR的工作流程涉及代理科学家进行个别调查，进行自我审查，并参与协作的同行审查讨论。这一由五种科学家人物驱动的过程，使语义级别的转移成为可能，而无需依赖数据增强或参数优化。在四个基准测试中的实验证明，PanelTR的表现优于普通的LLMs，并与完全监督模型不相上下，同时保持独立于训练数据。我们的发现表明，结构化科学方法可以有效处理表格推理之外的复杂任务，在零-shot的情境下具有灵活的语义理解能力。

更新时间: 2025-08-08 08:15:52

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.06110v1

FMCE-Net++: Feature Map Convergence Evaluation and Training

Deep Neural Networks (DNNs) face interpretability challenges due to their opaque internal representations. While Feature Map Convergence Evaluation (FMCE) quantifies module-level convergence via Feature Map Convergence Scores (FMCS), it lacks experimental validation and closed-loop integration. To address this limitation, we propose FMCE-Net++, a novel training framework that integrates a pretrained, frozen FMCE-Net as an auxiliary head. This module generates FMCS predictions, which, combined with task labels, jointly supervise backbone optimization through a Representation Auxiliary Loss. The RAL dynamically balances the primary classification loss and feature convergence optimization via a tunable \Representation Abstraction Factor. Extensive experiments conducted on MNIST, CIFAR-10, FashionMNIST, and CIFAR-100 demonstrate that FMCE-Net++ consistently enhances model performance without architectural modifications or additional data. Key experimental outcomes include accuracy gains of $+1.16$ pp (ResNet-50/CIFAR-10) and $+1.08$ pp (ShuffleNet v2/CIFAR-100), validating that FMCE-Net++ can effectively elevate state-of-the-art performance ceilings.

Updated: 2025-08-08 08:15:26

标题: FMCE-Net++：特征图收敛评估和训练

摘要: 深度神经网络(DNNs)面临解释能力挑战，因为它们具有不透明的内部表示。虽然特征图收敛评估(FMCE)通过特征图收敛分数(FMCS)量化模块级收敛，但缺乏实验验证和闭环集成。为了解决这一限制，我们提出了FMCE-Net++，这是一个新颖的训练框架，将一个预训练的、冻结的FMCE-Net作为辅助头部集成进来。这个模块生成FMCS预测，结合任务标签，共同通过表示辅助损失监督主干优化。表示辅助损失通过可调的表示抽象因子动态平衡主要分类损失和特征收敛优化。在MNIST、CIFAR-10、FashionMNIST和CIFAR-100上进行的大量实验表明，FMCE-Net++在不对架构进行修改或添加数据的情况下持续增强模型性能。关键的实验结果包括准确率增益为+1.16个百分点(ResNet-50/CIFAR-10)和+1.08个百分点(ShuffleNet v2/CIFAR-100)，验证了FMCE-Net++能够有效提升最先进性能的上限。

更新时间: 2025-08-08 08:15:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06109v1

GCHR : Goal-Conditioned Hindsight Regularization for Sample-Efficient Reinforcement Learning

Goal-conditioned reinforcement learning (GCRL) with sparse rewards remains a fundamental challenge in reinforcement learning. While hindsight experience replay (HER) has shown promise by relabeling collected trajectories with achieved goals, we argue that trajectory relabeling alone does not fully exploit the available experiences in off-policy GCRL methods, resulting in limited sample efficiency. In this paper, we propose Hindsight Goal-conditioned Regularization (HGR), a technique that generates action regularization priors based on hindsight goals. When combined with hindsight self-imitation regularization (HSR), our approach enables off-policy RL algorithms to maximize experience utilization. Compared to existing GCRL methods that employ HER and self-imitation techniques, our hindsight regularizations achieve substantially more efficient sample reuse and the best performances, which we empirically demonstrate on a suite of navigation and manipulation tasks.

Updated: 2025-08-08 08:12:14

标题: GCHR：目标条件的事后规范化用于高效采样的强化学习

摘要: 目标条件强化学习（GCRL）在稀疏奖励条件下仍然是强化学习中的一个基本挑战。虽然事后经验重放（HER）通过重新标记收集到的轨迹与实现的目标显示出了潜力，但我们认为仅仅通过轨迹重新标记并不能充分利用离线GCRL方法中可用的经验，导致样本效率有限。在本文中，我们提出了事后目标条件正则化（HGR），一种基于事后目标生成动作正则化先验的技术。当与事后自我模仿正则化（HSR）结合使用时，我们的方法使离线RL算法能够最大化经验利用。与采用HER和自我模仿技术的现有GCRL方法相比，我们的事后正则化实现了更高效的样本重用和最佳性能，这些结果在一系列导航和操纵任务中得到了实证验证。

更新时间: 2025-08-08 08:12:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06108v1

Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention

Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.

Updated: 2025-08-08 08:11:36

标题: 遮罩与匹配：学习使用自监督注意力识别手写数学

摘要: 手写数学表达式识别（HMER）是一项具有挑战性的任务，这是由于其固有的二维结构、不同符号尺度以及符号之间复杂的空间关系。本文提出了一个自监督学习（SSL）框架用于HMER，从而消除了对昂贵标记数据的需求。我们的方法首先通过全局和局部对比损失的组合对图像编码器进行预训练，使模型能够学习到整体和细粒度的表示。本文的一个关键贡献是一种新颖的自监督注意力网络，该网络使用渐进式空间遮罩策略进行训练。该注意力机制旨在学习语义上有意义的焦点区域，如运算符、指数和嵌套数学符号，而无需任何监督。渐进式遮罩课程鼓励网络变得越来越能够应对缺失或遮挡的视觉信息，最终提高结构理解能力。我们的完整流程包括（1）编码器的自监督预训练、（2）自监督注意力学习，以及（3）使用变换器解码器进行有监督微调以生成LATEX序列。在CROHME基准测试上进行了大量实验，结果表明我们的方法优于现有的SSL和完全监督的基准线，验证了我们逐步注意力机制在提升HMER性能方面的有效性。我们的代码库可以在这里找到。

更新时间: 2025-08-08 08:11:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06107v1

Simulation in Cybersecurity: Understanding Techniques, Applications, and Goals

Modeling and simulation are widely used in cybersecurity research to assess cyber threats, evaluate defense mechanisms, and analyze vulnerabilities. However, the diversity of application areas, the variety of cyberattacks scenarios, and the differing objectives of these simulations makes it difficult to identify methodological trends. Existing reviews often focus on specific modeling techniques or application domains, making it challenging to analyze the field as a whole. To address these limitations, we present a comprehensive review of the current state of the art, classifying the selected papers based on four dimensions: the application domain, the types of cyber threats represented, the simulation techniques employed, and the primary goals of the simulation. The review discusses the strengths and limitations of different approaches, identifies which cyber threats are the most suited for simulation-based investigations, and analyzes which modeling paradigms are most appropriate for specific cybersecurity challenges.

Updated: 2025-08-08 08:11:13

标题: 网络安全中的模拟：理解技术、应用和目标

摘要: 建模和模拟在网络安全研究中被广泛应用，用于评估网络威胁，评估防御机制，并分析漏洞。然而，应用领域的多样性，各种网络攻击场景以及这些模拟的不同目标使得很难确定方法论趋势。现有的综述通常侧重于特定的建模技术或应用领域，这使得对整个领域进行分析变得具有挑战性。为了解决这些局限性，我们提出了一份关于现有技术水平的综合审查，根据四个维度对所选论文进行分类：应用领域、代表的网络威胁类型、使用的模拟技术以及模拟的主要目标。该审查讨论了不同方法的优势和局限性，确定了哪些网络威胁最适合基于模拟的研究，并分析了哪种建模范式最适合特定的网络安全挑战。

更新时间: 2025-08-08 08:11:13

领域: cs.CR

下载: http://arxiv.org/abs/2508.06106v1

Towards More Realistic Extraction Attacks: An Adversarial Perspective

Language models are prone to memorizing their training data, making them vulnerable to extraction attacks. While existing research often examines isolated setups, such as a single model or a fixed prompt, real-world adversaries have a considerably larger attack surface due to access to models across various sizes and checkpoints, and repeated prompting. In this paper, we revisit extraction attacks from an adversarial perspective -- with multi-faceted access to the underlying data. We find significant churn in extraction trends, i.e., even unintuitive changes to the prompt, or targeting smaller models and earlier checkpoints, can extract distinct information. By combining multiple attacks, our adversary doubles ($2 \times$) the extraction risks, persisting even under mitigation strategies like data deduplication. We conclude with four case studies, including detecting pre-training data, copyright violations, extracting personally identifiable information, and attacking closed-source models, showing how our more realistic adversary can outperform existing adversaries in the literature.

Updated: 2025-08-08 08:09:37

标题: 朝着更现实的提取攻击前进：一个对抗性的视角

摘要: 语言模型容易记忆其训练数据，使其容易受到提取攻击的影响。尽管现有研究通常考察单个模型或固定提示等孤立的设置，但真实世界中的对手具有更大的攻击面，因为他们可以访问各种规模和检查点的模型，并反复提示。在本文中，我们从对抗性的角度重新审视提取攻击-通过对基础数据进行多方面访问。我们发现提取趋势存在显著的变化，即即使对提示进行不直观的更改，或者针对较小的模型和较早的检查点，也可以提取出不同的信息。通过结合多种攻击方式，我们的对手可以将提取风险翻倍（$2 \times$），即使在数据去重等缓解策略下仍然持续存在。我们通过四个案例研究总结，包括检测预训练数据、版权侵犯、提取个人可识别信息以及攻击闭源模型，展示了我们更加现实的对手如何在文献中超越现有对手。

更新时间: 2025-08-08 08:09:37

领域: cs.CR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2407.02596v3

Measuring Dependencies between Biological Signals with Self-supervision, and its Limitations

Measuring the statistical dependence between observed signals is a primary tool for scientific discovery. However, biological systems often exhibit complex non-linear interactions that currently cannot be captured without a priori knowledge regarding the nature of dependence. We introduce a self-supervised approach, concurrence, which is inspired by the observation that if two signals are dependent, then one should be able to distinguish between temporally aligned vs. misaligned segments extracted from them. Experiments with fMRI, physiological and behavioral signals show that, to our knowledge, concurrence is the first approach that can expose relationships across such a wide spectrum of signals and extract scientifically relevant differences without ad-hoc parameter tuning or reliance on a priori information, providing a potent tool for scientific discoveries across fields. However, dependencies caused by extraneous factors remain an open problem, thus researchers should validate that exposed relationships truly pertain to the question(s) of interest.

Updated: 2025-08-08 08:09:20

标题: 用自我监督方法测量生物信号之间的依赖关系及其局限性

摘要: 衡量观察信号之间的统计依赖性是科学发现的主要工具。然而，生物系统通常表现出复杂的非线性相互作用，目前无法捕捉到没有关于依赖性性质的先验知识。我们引入了一种自监督方法，称为"共现"，灵感来自于这样的观察：如果两个信号是相关的，那么应该能够区分出从中提取的时间对齐和未对齐的片段。使用fMRI、生理和行为信号进行的实验表明，据我们所知，共现是第一种能够揭示如此广泛信号之间关系并提取科学相关差异的方法，而无需特定参数调整或依赖先验信息，为跨领域的科学发现提供了强大工具。然而，由外部因素引起的依赖性仍然是一个悬而未决的问题，因此研究人员应验证所揭示的关系是否真正与感兴趣的问题有关。

更新时间: 2025-08-08 08:09:20

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2508.02703v2

From research to clinic: Accelerating the translation of clinical decision support systems by making synthetic data interoperable

The translation of clinical decision support system (CDSS) tools from research settings into the clinic is often non-existent, partly because the focus tends to be on training machine learning models rather than tool development using the model for inference. To develop a CDSS tool that can be deployed in the clinical workflow, there is a need to integrate, validate, and test the tool on the Electronic Health Record (EHR) systems that store and manage patient data. Not surprisingly, it is rarely possible for researchers to get the necessary access to an EHR system due to legal restrictions pertaining to the protection of data privacy in patient records. We propose an architecture for using synthetic data in EHR systems to make CDSS tool development and testing much easier. In this study, the architecture is implemented in the SyntHIR system. SyntHIR has three noteworthy architectural features enabling (i) integration with synthetic data generators, (ii) data interoperability, and (iii) tool transportability. The translational value of this approach was evaluated through two primary steps. First, a working proof-of-concept of a machine learning-based CDSS tool was developed using data from patient registries in Norway. Second, the transportability of this CDSS tool was demonstrated by successfully deploying it in Norway's largest EHR system vendor (DIPS). These findings showcase the value of the SyntHIR architecture as a useful reference model to accelerate the translation of "bench to bedside" research of CDSS tools.

Updated: 2025-08-08 08:07:29

标题: 从研究到临床：通过使合成数据可互操作加速临床决策支持系统的转化

摘要: 将临床决策支持系统（CDSS）工具从研究环境转化到临床实践中通常是不存在的，部分原因是因为重点往往放在训练机器学习模型上，而不是利用模型进行工具开发和推理。为了开发一个可以部署在临床工作流程中的CDSS工具，需要在存储和管理患者数据的电子健康记录（EHR）系统上集成、验证和测试该工具。毫不奇怪，由于涉及保护患者记录数据隐私的法律限制，研究人员很少有可能获得必要的EHR系统访问权限。我们提出了一种使用EHR系统中合成数据进行CDSS工具开发和测试的体系结构。在这项研究中，该体系结构已在SyntHIR系统中实施。SyntHIR具有三个值得注意的体系结构特点，包括与合成数据生成器的集成、数据互操作性和工具可移植性。通过两个主要步骤评估了该方法的转化价值。首先，使用挪威的患者登记数据开发了一个基于机器学习的CDSS工具的工作概念验证。其次，通过成功在挪威最大的EHR系统供应商（DIPS）中部署该CDSS工具，展示了该工具的可移植性。这些发现展示了SyntHIR体系结构作为一个有用的参考模型，可以加速CDSS工具“从实验室到床边”的研究转化过程。

更新时间: 2025-08-08 08:07:29

领域: cs.LG,cs.AI,cs.SE

下载: http://arxiv.org/abs/2308.02613v2

CAST: Cross Attention based multimodal fusion of Structure and Text for materials property prediction

Recent advancements in graph neural networks (GNNs) have significantly enhanced the prediction of material properties by modeling crystal structures as graphs. However, GNNs often struggle to capture global structural characteristics, such as crystal systems, limiting their predictive performance. To overcome this issue, we propose CAST, a cross-attention-based multimodal model that integrates graph representations with textual descriptions of materials, effectively preserving critical structural and compositional information. Unlike previous approaches, such as CrysMMNet and MultiMat, which rely on aggregated material-level embeddings, CAST leverages cross-attention mechanisms to combine fine-grained graph node-level and text token-level features. Additionally, we introduce a masked node prediction pretraining strategy that further enhances the alignment between node and text embeddings. Our experimental results demonstrate that CAST outperforms existing baseline models across four key material properties-formation energy, band gap, bulk modulus, and shear modulus-with average relative MAE improvements ranging from 10.2% to 35.7%. Analysis of attention maps confirms the importance of pretraining in effectively aligning multimodal representations. This study underscores the potential of multimodal learning frameworks for developing more accurate and globally informed predictive models in materials science.

Updated: 2025-08-08 08:06:09

标题: CAST:基于交叉注意力的结构和文本多模态融合用于材料性能预测

摘要: 最近，图神经网络（GNNs）的最新进展显著提高了通过将晶体结构建模为图来预测材料性质的能力。然而，GNNs通常难以捕捉全局结构特征，如晶体系统，从而限制了它们的预测性能。为了解决这个问题，我们提出了CAST，这是一个基于交叉注意力的多模态模型，将图表示与材料的文本描述有效地结合起来，有效地保留了关键的结构和组成信息。与先前的方法（如CrysMMNet和MultiMat）依赖于汇总的材料级嵌入不同，CAST利用交叉注意力机制来结合细粒度的图节点级和文本标记级特征。此外，我们引入了一个掩码节点预测预训练策略，进一步增强了节点和文本嵌入之间的对齐。我们的实验结果表明，CAST在四个关键材料性质-形成能量、带隙、体模量和剪切模量方面胜过现有的基准模型，平均相对MAE改进范围从10.2%到35.7%不等。注意力图的分析证实了预训练在有效对齐多模态表示中的重要性。这项研究强调了多模态学习框架在发展更准确和全局信息的材料科学预测模型方面的潜力。

更新时间: 2025-08-08 08:06:09

领域: cs.LG,cond-mat.mtrl-sci,cs.AI

下载: http://arxiv.org/abs/2502.06836v2

SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent correct reasoning paths are not utilized by standard on-policy reinforcement learning methods. We introduce SuperRL, a unified training framework that adaptively alternates between RL and SFT. Whenever every rollout for a given instance receives zero reward, indicating the absence of a learning signal, SuperRL falls back to SFT on the curated offline data. Extensive experiments across diverse reasoning benchmarks show that SuperRL surpasses vanilla RL by delivering higher sample efficiency, stronger generalization, and improved robustness under sparse rewards.

Updated: 2025-08-08 08:03:03

标题: SuperRL：利用监督来增强语言模型推理的强化学习

摘要: 大型语言模型越来越多地用于复杂的推理任务，通常可以使用高质量的离线数据，例如专家注释的解决方案和提炼的推理痕迹。然而，在奖励稀疏的环境中，强化学习往往难以采样成功的轨迹，导致学习效率低下。与此同时，代表正确推理路径的离线轨迹并未被标准的在线强化学习方法利用。我们引入了SuperRL，一个统一的训练框架，它在RL和SFT之间自适应地交替。每当给定实例的每次试验都获得零奖励，表明缺少学习信号时，SuperRL会回退到基于筛选的离线数据的SFT。对各种推理基准的大量实验表明，SuperRL通过提供更高的样本效率、更强的泛化能力和在奖励稀疏条件下改善的鲁棒性，超越了普通的RL。

更新时间: 2025-08-08 08:03:03

领域: cs.AI

下载: http://arxiv.org/abs/2506.01096v2

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Recent developments in diffusion- and flow- based models have significantly advanced Text-to-Audio Generation (TTA). While achieving great synthesis quality and controllability, current TTA systems still suffer from slow inference speed, which significantly limits their practical applicability. This paper presents MeanAudio, a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. Built on a Flux-style latent transformer, MeanAudio regresses the average velocity field during training, enabling fast generation by mapping directly from the start to the endpoint of the flow trajectory. By incorporating classifier-free guidance (CFG) into the training target, MeanAudio incurs no additional cost in the guided sampling process. To further stabilize training, we propose an instantaneous-to-mean curriculum with flow field mix-up, which encourages the model to first learn the foundational instantaneous dynamics, and then gradually adapt to mean flows. This strategy proves critical for enhancing training efficiency and generation quality. Experimental results demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also demonstrates strong performance in multi-step generation, enabling smooth and coherent transitions across successive synthesis steps.

Updated: 2025-08-08 07:49:59

标题: MeanAudio：使用均值流进行快速且忠实的文本到音频生成

摘要: 最近发展的扩散和流动模型显著推进了文本到音频生成（TTA）。虽然当前的TTA系统在实现高质量合成和可控性方面取得了巨大进展，但仍然存在推断速度较慢的问题，这显著限制了它们的实际适用性。本文介绍了MeanAudio，一种新颖的基于MeanFlow的模型，专门用于快速和忠实的文本到音频生成。MeanAudio建立在Flux风格的潜在变压器上，在训练过程中回归平均速度场，通过直接从流轨迹的开始点映射到终点，实现了快速生成。通过将无分类器引导（CFG）纳入训练目标，MeanAudio在引导采样过程中不会产生额外成本。为了进一步稳定训练，我们提出了一种瞬时到平均课程，并结合了流场混合，鼓励模型首先学习基础瞬时动态，然后逐渐适应平均流。这种策略对提高训练效率和生成质量至关重要。实验结果表明，MeanAudio在单步音频生成方面实现了最先进的性能。具体而言，在单个NVIDIA RTX 3090上，它实现了一个实时因子（RTF）为0.013，比基于SOTA扩散的TTA系统快了100倍。此外，MeanAudio在多步生成方面也表现出色，实现了平稳和连贯的转换跨越连续合成步骤。

更新时间: 2025-08-08 07:49:59

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2508.06098v1

Recurrent Deep Differentiable Logic Gate Networks

While differentiable logic gates have shown promise in feedforward networks, their application to sequential modeling remains unexplored. This paper presents the first implementation of Recurrent Deep Differentiable Logic Gate Networks (RDDLGN), combining Boolean operations with recurrent architectures for sequence-to-sequence learning. Evaluated on WMT'14 English-German translation, RDDLGN achieves 5.00 BLEU and 30.9\% accuracy during training, approaching GRU performance (5.41 BLEU) and graceful degradation (4.39 BLEU) during inference. This work establishes recurrent logic-based neural computation as viable, opening research directions for FPGA acceleration in sequential modeling and other recursive network architectures.

Updated: 2025-08-08 07:49:38

标题: 循环深层可微逻辑门网络

摘要: 不同可微逻辑门在前馈网络中表现出了潜力，但它们在顺序建模中的应用尚未被探索。本文介绍了循环深度可微逻辑门网络（RDDLGN）的首次实现，将布尔运算与循环架构结合，用于序列到序列学习。在WMT'14英德翻译任务上评估，RDDLGN在训练期间实现了5.00 BLEU和30.9％的准确率，接近GRU性能（5.41 BLEU），并在推理过程中表现出优雅的退化（4.39 BLEU）。这项工作确立了基于逻辑的循环神经计算的可行性，开启了在顺序建模和其他递归网络架构中进行FPGA加速研究的方向。

更新时间: 2025-08-08 07:49:38

领域: cs.LG

下载: http://arxiv.org/abs/2508.06097v1

Learning to Match Unpaired Data with Minimum Entropy Coupling

Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.

Updated: 2025-08-08 07:49:00

标题: 学习使用最小熵耦合匹配不成对的数据

摘要: 多模态数据是一种宝贵的资产，能够支持机器学习中各种下游任务。然而，跨不同模态收集的现实世界数据通常是不配对的，这对学习联合分布构成了重要挑战。解决模态耦合问题的一个著名方法是最小熵耦合（MEC），它旨在最小化联合熵，同时满足边际约束。现有解决MEC问题的方法主要关注有限的离散分布，限制了其在涉及连续数据的情况下的应用。在这项工作中，我们提出了一种新颖的方法来解决连续MEC问题，使用众所周知的生成扩散模型学习逼近和最小化联合熵，通过协作方案满足边际约束的放宽版本。我们在实证中证明，我们的方法DDMEC是通用的，可以轻松用于解决挑战性任务，包括无监督单细胞多组学数据对齐和不配对图像转换，优于专门方法。

更新时间: 2025-08-08 07:49:00

领域: cs.LG

下载: http://arxiv.org/abs/2503.08501v2

Fusing Cross-Domain Knowledge from Multimodal Data to Solve Problems in the Physical World

The proliferation of artificial intelligence has enabled a diversity of applications that bridge the gap between digital and physical worlds. As physical environments are too complex to model through a single information acquisition approach, it is crucial to fuse multimodal data generated by different sources, such as sensors, devices, systems, and people, to solve a problem in the real world. Unfortunately, it is neither applicable nor sustainable to deploy new resources to collect original data from scratch for every problem. Thus, when data is inadequate in the domain of problem, it is vital to fuse knowledge from multimodal data that is already available in other domains. We call this cross-domain knowledge fusion. Existing research focus on fusing multimodal data in a single domain, supposing the knowledge from different datasets is intrinsically aligned; however, this assumption may not hold in the scenarios of cross-domain knowledge fusion. In this paper, we formally define the cross-domain multimodal data fusion problem, discussing its unique challenges, differences and advantages beyond data fusion in a single domain. We propose a four-layer framework, consisting of Domains, Links, Models and Data layers, answering three key questions:"what to fuse", "why can be fused", and "how to fuse". The Domains Layer selects relevant data from different domains for a given problem. The Links Layer reveals the philosophy of knowledge alignment beyond specific model structures. The Models Layer provides two knowledge fusion paradigms based on the fundamental mechanisms for processing data. The Data Layer turns data of different structures, resolutions, scales and distributions into a consistent representation that can be fed into an AI model. With this framework, we can design solutions that fuse cross-domain multimodal data effectively for solving real-world problems.

Updated: 2025-08-08 07:46:45

标题: 融合跨领域知识，从多模态数据中解决物理世界问题

摘要: 人工智能的普及使得各种应用能够弥合数字和物理世界之间的差距。由于物理环境太复杂，无法通过单一信息获取方法来建模，因此融合不同来源（如传感器、设备、系统和人）生成的多模态数据来解决现实世界中的问题至关重要。不幸的是，为了每个问题从头开始收集原始数据部署新资源既不适用也不可持续。因此，当在问题领域中数据不足时，从其他领域已有的多模态数据中融合知识至关重要。我们称之为跨领域知识融合。现有研究侧重于在单一领域内融合多模态数据，假设来自不同数据集的知识在本质上是对齐的；然而，在跨领域知识融合的情景中，这种假设可能不成立。在本文中，我们正式定义了跨领域多模态数据融合问题，讨论了其独特挑战、差异和优势，超越了单一领域内的数据融合。我们提出了一个四层框架，包括领域、链接、模型和数据层，回答了三个关键问题：“融合什么”、“为什么可以融合”和“如何融合”。领域层为给定问题选择不同领域的相关数据。链接层揭示了知识对齐的哲学超越了特定模型结构。模型层提供了基于处理数据的基本机制的两种知识融合范式。数据层将不同结构、分辨率、比例和分布的数据转化为一致的表示形式，可以输入到AI模型中。通过这个框架，我们可以设计有效融合跨领域多模态数据来解决现实世界问题的解决方案。

更新时间: 2025-08-08 07:46:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.03155v2

MESAHA-Net: Multi-Encoders based Self-Adaptive Hard Attention Network with Maximum Intensity Projections for Lung Nodule Segmentation in CT Scan

Accurate lung nodule segmentation is crucial for early-stage lung cancer diagnosis, as it can substantially enhance patient survival rates. Computed tomography (CT) images are widely employed for early diagnosis in lung nodule analysis. However, the heterogeneity of lung nodules, size diversity, and the complexity of the surrounding environment pose challenges for developing robust nodule segmentation methods. In this study, we propose an efficient end-to-end framework, the multi-encoder-based self-adaptive hard attention network (MESAHA-Net), for precise lung nodule segmentation in CT scans. MESAHA-Net comprises three encoding paths, an attention block, and a decoder block, facilitating the integration of three types of inputs: CT slice patches, forward and backward maximum intensity projection (MIP) images, and region of interest (ROI) masks encompassing the nodule. By employing a novel adaptive hard attention mechanism, MESAHA-Net iteratively performs slice-by-slice 2D segmentation of lung nodules, focusing on the nodule region in each slice to generate 3D volumetric segmentation of lung nodules. The proposed framework has been comprehensively evaluated on the LIDC-IDRI dataset, the largest publicly available dataset for lung nodule segmentation. The results demonstrate that our approach is highly robust for various lung nodule types, outperforming previous state-of-the-art techniques in terms of segmentation accuracy and computational complexity, rendering it suitable for real-time clinical implementation.

Updated: 2025-08-08 07:44:22

标题: MESAHA-Net: 基于多编码器的自适应硬注意力网络，利用最大强度投影进行CT扫描中的肺结节分割

摘要: 准确的肺结节分割对于早期肺癌诊断至关重要，因为它可以显著提高患者的存活率。计算机断层扫描（CT）图像被广泛应用于肺结节分析的早期诊断。然而，肺结节的异质性、大小多样性以及周围环境的复杂性为开发稳健的结节分割方法带来挑战。在这项研究中，我们提出了一种高效的端到端框架，即基于多编码器的自适应硬关注网络（MESAHA-Net），用于CT扫描中的精确肺结节分割。MESAHA-Net包括三个编码路径、一个注意力块和一个解码器块，有助于集成三种类型的输入：CT切片补丁、前向和后向最大强度投影（MIP）图像以及包含结节的感兴趣区域（ROI）掩模。通过采用新颖的自适应硬注意力机制，MESAHA-Net迭代地进行逐层2D肺结节分割，侧重于每个切片中的结节区域，以生成肺结节的3D体积分割。所提出的框架已在LIDC-IDRI数据集上进行了全面评估，这是用于肺结节分割的最大公开数据集。结果表明，我们的方法对各种肺结节类型高度稳健，优于先前的最先进技术，具有更高的分割准确性和计算复杂性，适用于实时临床实施。

更新时间: 2025-08-08 07:44:22

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2304.01576v2

Bounding Distributional Shifts in World Modeling through Novelty Detection

Recent work on visual world models shows significant promise in latent state dynamics obtained from pre-trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near-complete coverage of the action and state space during training to prevent divergence during inference. To make a model-based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector to ensure that proposed action trajectories during planning do not cause the learned model to deviate from the training data distribution. To evaluate the effectiveness of this approach, a series of experiments in challenging simulated robot environments was carried out, with the proposed method incorporated into a model-predictive control policy loop extending the DINO-WM architecture. The results clearly show that the proposed method improves over state-of-the-art solutions in terms of data efficiency.

Updated: 2025-08-08 07:42:14

标题: 通过新颖性检测界定世界建模中的分布转移

摘要: 最近关于视觉世界模型的研究显示，从预训练的图像骨干中获得的潜在状态动态表现出显著的潜力。然而，大多数当前方法对训练质量敏感，需要在训练期间对行动和状态空间进行接近完整覆盖，以防止推理过程中的发散。为了使基于模型的规划算法更具鲁棒性，我们在这项工作中提出使用变分自动编码器作为新颖性检测器，以确保在规划过程中提出的行动轨迹不会导致学习模型偏离训练数据分布。为了评估这种方法的有效性，在具有挑战性的模拟机器人环境中进行了一系列实验，将所提出的方法纳入到扩展了DINO-WM架构的模型预测控制策略循环中。结果清楚地显示，所提出的方法在数据效率方面优于当前最先进的解决方案。

更新时间: 2025-08-08 07:42:14

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.06096v1

CUB: Benchmarking Context Utilisation Techniques for Language Models

Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) - the first comprehensive benchmark designed to help practitioners within retrieval-augmented generation (RAG) diagnose CMTs under different context conditions. With this benchmark, we conduct the most extensive evaluation to date of seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results reveal that most existing CMTs struggle to handle the full spectrum of context types encountered in real-world retrieval-augmented scenarios. We also find that many CMTs display inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Our findings expose critical gaps in current CMT evaluation practices and demonstrate the need for holistic testing and the development of CMTs that can robustly handle multiple context types.

Updated: 2025-08-08 07:36:59

标题: CUB：语言模型上下文利用技术的基准测试

摘要: 将外部知识纳入是对知识密集型任务至关重要的，比如问答和事实核对。然而，语言模型（LMs）可能会忽略与过时的参数化记忆相矛盾的相关信息，或者被无关的上下文所干扰。虽然最近已经提出了许多上下文利用操纵技术（CMTs）来缓解这些问题，但很少见到系统性的比较。在本文中，我们开发了CUB（上下文利用基准）- 这是第一个旨在帮助检验者在不同上下文条件下诊断CMTs的综合基准。通过这一基准，我们对代表主要CMT类别的七种最先进方法在三个不同数据集和任务上应用于九种LMs进行了迄今为止最广泛的评估。我们的结果显示，大多数现有的CMTs很难处理在现实情境中遇到的各种上下文类型。我们还发现，与具有自然发生样本的更真实的数据集相比，许多CMTs在简单合成的数据集上表现出膨胀的性能。我们的发现揭示了当前CMT评估实践中的重要差距，并展示了对全面测试和开发能够强力处理多种上下文类型的CMTs的需求。

更新时间: 2025-08-08 07:36:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.16518v2

DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at https://github.com/zhangyitonggg/DAVSP.

Updated: 2025-08-08 07:35:47

标题: DAVSP：通过深度对齐的视觉安全提示为大型视觉-语言模型确保安全对齐

摘要: 大型视觉-语言模型（LVLMs）在各种应用中取得了令人印象深刻的进展，但仍然容易受到利用视觉模态的恶意查询的攻击。现有的对齐方法通常无法有效抵抗恶意查询，同时保持对良性查询的有效性。为了解决这些挑战，我们提出了Deep Aligned Visual Safety Prompt（DAVSP），它建立在两个关键创新基础上。首先，我们引入了Visual Safety Prompt，它在输入图像周围附加一个可训练的填充区域。它保留了视觉特征并扩展了优化空间。其次，我们提出了Deep Alignment，一种新颖的方法，通过在模型的激活空间中进行监督来训练视觉安全提示。它增强了LVLMs感知恶意查询的固有能力，实现了比以前更深的对齐。在两个代表性LVLMs上进行的五个基准测试的大量实验证明，DAVSP有效地抵抗了恶意查询，同时保持了良性输入的有效性。此外，DAVSP表现出很强的跨模型生成能力。消融研究进一步揭示了Visual Safety Prompt和Deep Alignment都是必不可少的组成部分，共同促进了其整体有效性。该代码可以在https://github.com/zhangyitonggg/DAVSP 上公开获取。

更新时间: 2025-08-08 07:35:47

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2506.09353v2

Aggregate-Combine-Readout GNNs Are More Expressive Than Logic C2

In recent years, there has been growing interest in understanding the expressive power of graph neural networks (GNNs) by relating them to logical languages. This research has been been initialised by an influential result of Barcel\'o et al. (2020), who showed that the graded modal logic (or a guarded fragment of the logic C2), characterises the logical expressiveness of aggregate-combine GNNs. As a ``challenging open problem'' they left the question whether full C2 characterises the logical expressiveness of aggregate-combine-readout GNNs. This question has remained unresolved despite several attempts. In this paper, we solve the above open problem by proving that the logical expressiveness of aggregate-combine-readout GNNs strictly exceeds that of C2. This result holds over both undirected and directed graphs. Beyond its implications for GNNs, our work also leads to purely logical insights on the expressive power of infinitary logics.

Updated: 2025-08-08 07:35:35

标题: 聚合-组合-读出GNNs比逻辑C2更具表达力

摘要: 近年来，人们对理解图神经网络（GNNs）的表达能力产生了越来越浓厚的兴趣，通过将它们与逻辑语言联系起来。这项研究始于Barcel\'o等人（2020）的一项有影响力的结果，他们展示了分级模态逻辑（或逻辑C2的一部分），刻画了聚合-组合GNNs的逻辑表达能力。作为一个“具有挑战性的未解问题”，他们留下了一个问题，即完整的C2是否刻画了聚合-组合-读出GNNs的逻辑表达能力。尽管已经有几次尝试，但这个问题仍未得到解决。在本文中，我们通过证明聚合-组合-读出GNNs的逻辑表达能力严格超过了C2，解决了上述未解问题。这个结果适用于无向图和有向图。除了对GNNs的影响，我们的工作还为无穷逻辑的表达能力提供了纯粹的逻辑洞察。

更新时间: 2025-08-08 07:35:35

领域: cs.AI

下载: http://arxiv.org/abs/2508.06091v1

Adaptive Backtracking for Privacy Protection in Large Language Models

The preservation of privacy has emerged as a critical topic in the era of artificial intelligence. However, current work focuses on user-oriented privacy, overlooking severe enterprise data leakage risks exacerbated by the Retrieval-Augmented Generation paradigm. To address this gap, our paper introduces a novel objective: enterprise-oriented privacy concerns. Achieving this objective requires overcoming two fundamental challenges: existing methods such as data sanitization severely degrade model performance, and the field lacks public datasets for evaluation. We address these challenges with several solutions. (1) To prevent performance degradation, we propose ABack, a training-free mechanism that leverages a Hidden State Model to pinpoint the origin of a leakage intention and rewrite the output safely. (2) To solve the lack of datasets, we construct PriGenQA, a new benchmark for enterprise privacy scenarios in healthcare and finance. To ensure a rigorous evaluation, we move beyond simple static attacks by developing a powerful adaptive attacker with Group Relative Policy Optimization. Experiments show that against this superior adversary, ABack improves the overall privacy utility score by up to 15\% over strong baselines, avoiding the performance trade-offs of prior methods.

Updated: 2025-08-08 07:29:33

标题: 大型语言模型中的隐私保护自适应回溯

摘要: 随着人工智能时代的到来，隐私保护已经成为一个关键话题。然而，当前的工作主要集中在用户导向的隐私保护上，忽视了由检索增强生成范式加剧的企业数据泄露风险。为了填补这一空白，我们的论文引入了一个新的目标：企业导向的隐私关注点。实现这一目标需要克服两个基本挑战：现有的方法如数据清洗严重降低了模型性能，而且该领域缺乏用于评估的公共数据集。我们提出了几种解决方案来解决这些挑战。(1)为了防止性能降低，我们提出了ABack，一种无需训练的机制，利用隐藏状态模型来确定泄露意图的起源并安全地重写输出。(2)为了解决数据集的缺乏，我们构建了PriGenQA，一个新的用于医疗和金融领域企业隐私场景的基准。为了确保严格的评估，我们通过开发一个强大的自适应攻击者与群体相对策略优化，超越了简单的静态攻击。实验证明，在面对这种优越的对手时，ABack相对于强基线提高了整体隐私效用得分高达15％，避免了以往方法的性能折衷。

更新时间: 2025-08-08 07:29:33

领域: cs.CR,cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.06087v1

Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains

To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.

Updated: 2025-08-08 07:28:53

标题: 多传感器系统跨多个领域中的因果机制估计

摘要: 为了通过因果关系的视角更深入地了解复杂的传感器系统，我们提出了常见和个体因果机制估计（CICME），这是一种新颖的三步方法，用于从收集在多个领域中的异质数据中推断因果机制。通过利用因果传递学习（CTL）原则，CICME能够在提供足够样本的情况下可靠地检测领域不变的因果机制。进一步利用已识别的常见因果机制来指导每个领域中剩余因果机制的估计。我们评估了CICME在受启发于制造过程的情景下的线性高斯模型上的性能。在现有基于连续优化的因果发现方法的基础上，我们展示了CICME利用应用于汇总数据和重复应用于来自各个领域数据的因果发现的好处，并且在某些情况下甚至优于基准方法。

更新时间: 2025-08-08 07:28:53

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.17792v3

Federated Continual Recommendation

The increasing emphasis on privacy in recommendation systems has led to the adoption of Federated Learning (FL) as a privacy-preserving solution, enabling collaborative training without sharing user data. While Federated Recommendation (FedRec) effectively protects privacy, existing methods struggle with non-stationary data streams, failing to maintain consistent recommendation quality over time. On the other hand, Continual Learning Recommendation (CLRec) methods address evolving user preferences but typically assume centralized data access, making them incompatible with FL constraints. To bridge this gap, we introduce Federated Continual Recommendation (FCRec), a novel task that integrates FedRec and CLRec, requiring models to learn from streaming data while preserving privacy. As a solution, we propose F3CRec, a framework designed to balance knowledge retention and adaptation under the strict constraints of FCRec. F3CRec introduces two key components: Adaptive Replay Memory on the client side, which selectively retains past preferences based on user-specific shifts, and Item-wise Temporal Mean on the server side, which integrates new knowledge while preserving prior information. Extensive experiments demonstrate that F3CRec outperforms existing approaches in maintaining recommendation quality over time in a federated environment.

Updated: 2025-08-08 07:27:54

标题: 联邦式持续推荐

摘要: 随着推荐系统中隐私的日益重视，采用联邦学习（FL）作为隐私保护解决方案，实现协作训练而不共享用户数据。虽然联邦推荐（FedRec）有效保护隐私，但现有方法在处理非稳态数据流时存在困难，无法保持一致的推荐质量。另一方面，持续学习推荐（CLRec）方法致力于处理不断演变的用户偏好，但通常假定具有集中式数据访问，与FL约束不兼容。为了弥合这一差距，我们引入了联邦持续推荐（FCRec），这是一个集成了FedRec和CLRec的新颖任务，要求模型从流数据中学习同时保护隐私。作为解决方案，我们提出了F3CRec，这是一个旨在在FCRec的严格约束下平衡知识保留和适应性的框架。F3CRec引入了两个关键组件：客户端上的自适应重放存储器，根据用户特定的偏移选择性地保留过去的偏好，以及服务器端的物品级时间均值，集成新知识同时保留先前信息。大量实验证明，F3CRec在联邦环境中随着时间推移保持推荐质量方面优于现有方法。

更新时间: 2025-08-08 07:27:54

领域: cs.LG,cs.IR,H.3.3; I.2.6; C.2.4

下载: http://arxiv.org/abs/2508.04792v2

Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting and data-scarce scenarios. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting. However, we find existing LLM-based methods still have shortcomings: (1) the absence of a unified paradigm for textual prompt formulation and (2) the neglect of modality discrepancies between textual prompts and time series. To address this, we propose LLM-Prompt, an LLM-based time series forecasting framework integrating multi-prompt information and cross-modal semantic alignment. Specifically, we first construct a unified textual prompt paradigm containing learnable soft prompts and textualized hard prompts. Second, to enhance LLMs' comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve cross-modal fusion of temporal and textual information. Finally, the transformed time series from the LLMs are projected to obtain the forecasts. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that LLM-Prompt is a powerful framework for time series forecasting.

Updated: 2025-08-08 07:18:59

标题: 时序预测中解锁LLM的集成异构提示：Time-Prompt

摘要: 时间序列预测旨在建模变量之间的时间依赖关系，用于未来状态推断，在现实世界的场景中具有重要意义和广泛应用。尽管基于深度学习的方法取得了显著进展，但它们在长期预测和数据稀缺场景下仍表现出亚优性能。最近的研究表明，大型语言模型(LLMs)在时间序列预测中取得了有希望的表现。然而，我们发现现有的基于LLM的方法仍然存在缺点：(1)缺乏统一的文本提示公式；(2)忽视文本提示和时间序列之间的模态差异。为了解决这一问题，我们提出了LLM-Prompt，这是一个基于LLM的时间序列预测框架，整合了多提示信息和跨模态语义对齐。具体来说，我们首先构建了一个包含可学习软提示和文本化硬提示的统一文本提示范式。其次，为了增强LLMs对预测任务的全面理解，我们设计了一个语义空间嵌入和跨模态对齐模块，实现了时间和文本信息的跨模态融合。最后，将经过LLMs转换的时间序列投影以获得预测结果。对6个公共数据集和3个碳排放数据集的全面评估表明，LLM-Prompt是一个强大的时间序列预测框架。

更新时间: 2025-08-08 07:18:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17631v2

Modeling Spatial Extremal Dependence of Precipitation Using Distributional Neural Networks

In this work, we propose a simulation-based estimation approach using generative neural networks to determine dependencies of precipitation maxima and their underlying uncertainty in time and space. Within the common framework of max-stable processes for extremes under temporal and spatial dependence, our methodology allows estimating the process parameters and their respective uncertainty, but also delivers an explicit nonparametric estimate of the spatial dependence through the pairwise extremal coefficient function. We illustrate the effectiveness and robustness of our approach in a thorough finite sample study where we obtain good performance in complex settings for which closed-form likelihood estimation becomes intractable. We use the technique for studying monthly rainfall maxima in Western Germany for the period 2021-2023, which is of particular interest since it contains an extreme precipitation and consecutive flooding event in July 2021 that had a massive deadly impact. Beyond the considered setting, the presented methodology and its main generative ideas also have great potential for other applications.

Updated: 2025-08-08 07:16:20

标题: 使用分布神经网络对降水的空间极值依赖性进行建模

摘要: 在这项工作中，我们提出了一种基于模拟的估计方法，利用生成神经网络来确定降水极值及其在时间和空间上的依赖关系的不确定性。在极值过程的常见框架下，我们的方法允许估计过程参数及其相应的不确定性，同时通过成对极值系数函数提供了显式的非参数估计空间依赖性。我们通过一项彻底的有限样本研究来展示我们方法的有效性和稳健性，我们在复杂设置中获得了良好的性能，对于这些设置，封闭形式的似然估计变得难以处理。我们使用这种技术来研究2021年至2023年间德国西部的月降水极值，这一时期尤其引人关注，因为2021年7月发生了极端降水和连续洪水事件，造成了巨大的致命影响。除考虑的情景之外，所提出的方法及其主要生成性思想也具有很大的潜力用于其他应用领域。

更新时间: 2025-08-08 07:16:20

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2407.08668v2

Towards MR-Based Trochleoplasty Planning

To treat Trochlear Dysplasia (TD), current approaches rely mainly on low-resolution clinical Magnetic Resonance (MR) scans and surgical intuition. The surgeries are planned based on surgeons experience, have limited adoption of minimally invasive techniques, and lead to inconsistent outcomes. We propose a pipeline that generates super-resolved, patient-specific 3D pseudo-healthy target morphologies from conventional clinical MR scans. First, we compute an isotropic super-resolved MR volume using an Implicit Neural Representation (INR). Next, we segment femur, tibia, patella, and fibula with a multi-label custom-trained network. Finally, we train a Wavelet Diffusion Model (WDM) to generate pseudo-healthy target morphologies of the trochlear region. In contrast to prior work producing pseudo-healthy low-resolution 3D MR images, our approach enables the generation of sub-millimeter resolved 3D shapes compatible for pre- and intraoperative use. These can serve as preoperative blueprints for reshaping the femoral groove while preserving the native patella articulation. Furthermore, and in contrast to other work, we do not require a CT for our pipeline - reducing the amount of radiation. We evaluated our approach on 25 TD patients and could show that our target morphologies significantly improve the sulcus angle (SA) and trochlear groove depth (TGD). The code and interactive visualization are available at https://wehrlimi.github.io/sr-3d-planning/.

Updated: 2025-08-08 07:15:23

标题: 朝向基于磁共振的髌骨切削术规划

摘要: 为了治疗滑车骨发育不良（TD），目前的方法主要依赖于低分辨率临床磁共振（MR）扫描和外科直觉。手术是基于外科医生的经验计划的，对微创技术的采用有限，并导致结果不一致。我们提出了一个流程，从传统临床MR扫描中生成超分辨率、患者特异性的3D假健康目标形态。首先，我们使用隐式神经表示（INR）计算各向同性的超分辨率MR体积。接下来，我们用多标签自定义训练网络分割股骨、胫骨、髌骨和腓骨。最后，我们训练波纹扩散模型（WDM）生成滑车区域的假健康目标形态。与先前产生低分辨率3D MR图像的工作相比，我们的方法能够生成亚毫米分辨率的3D形状，适用于术前和术中使用。这些可以作为重塑股骨沟的术前蓝图，同时保留原生髌骨关节。此外，与其他工作不同，我们的流程不需要CT - 减少辐射量。我们在25例TD患者上评估了我们的方法，并证明我们的目标形态显著改善了滑车沟角度（SA）和滑车沟深度（TGD）。代码和交互式可视化可在https://wehrlimi.github.io/sr-3d-planning/ 上找到。

更新时间: 2025-08-08 07:15:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06076v1

ME$^3$-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception

Autonomous driving systems face significant challenges in perceiving complex environments and making real-time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end-to-end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving using deep reinforcement learning (DRL) that integrates bird's-eye view (BEV) perception for enhanced real-time decision-making. We introduce the \texttt{Mamba-BEV} model, an efficient spatio-temporal feature extraction network that combines BEV-based perception with the Mamba framework for temporal feature modeling. This integration allows the system to encode vehicle surroundings and road features in a unified coordinate system and accurately model long-range dependencies. Building on this, we propose the \texttt{ME$^3$-BEV} framework, which utilizes the \texttt{Mamba-BEV} model as a feature input for end-to-end DRL, achieving superior performance in dynamic urban driving scenarios. We further enhance the interpretability of the model by visualizing high-dimensional features through semantic segmentation, providing insight into the learned representations. Extensive experiments on the CARLA simulator demonstrate that \texttt{ME$^3$-BEV} outperforms existing models across multiple metrics, including collision rate and trajectory accuracy, offering a promising solution for real-time autonomous driving.

Updated: 2025-08-08 07:13:28

标题: ME$^3$-BEV: Mamba增强深度强化学习用于BEV感知的端到端自动驾驶

摘要: 自动驾驶系统在感知复杂环境和实时决策方面面临着重大挑战。传统的模块化方法虽然具有可解释性，但存在误差传播和协调问题，而端到端学习系统可以简化设计，但面临计算瓶颈。本文提出了一种使用深度强化学习（DRL）的自动驾驶新方法，该方法集成了鸟瞰图（BEV）感知以增强实时决策能力。我们介绍了\texttt{Mamba-BEV}模型，这是一个高效的时空特征提取网络，将基于BEV的感知与Mamba框架结合起来进行时间特征建模。这种集成允许系统在统一的坐标系中对车辆周围环境和道路特征进行编码，并准确地建模长距离依赖关系。在此基础上，我们提出了\texttt{ME$^3$-BEV}框架，该框架利用\texttt{Mamba-BEV}模型作为端到端DRL的特征输入，在动态城市驾驶场景中实现了卓越的性能。我们通过语义分割可视化高维特征，进一步提高了模型的可解释性，从而提供了对学习表示的洞察。在CARLA模拟器上进行的大量实验表明，\texttt{ME$^3$-BEV}在多个指标上表现优于现有模型，包括碰撞率和轨迹准确性，为实时自动驾驶提供了一个有前途的解决方案。

更新时间: 2025-08-08 07:13:28

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2508.06074v1

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/CodeARC

Updated: 2025-08-08 07:13:11

标题: CodeARC：用于归纳式程序综合的LLM代理的推理能力基准化

摘要: 归纳式程序合成，或者说示例编程，需要从输入输出示例中合成能够推广到未见过输入的函数。虽然大型语言模型代理在自然语言引导的编程任务中表现出了潜力，但它们在归纳程序合成方面的能力尚未得到充分探索。现有的评估协议依赖于静态的示例集和保留测试，在合成的函数错误时没有提供反馈，并且未能反映诸如逆向工程等真实世界场景。我们提出了CodeARC，即代码抽象与推理挑战，这是一个新的评估框架，代理通过向隐藏的目标函数查询新的输入，合成候选函数，并使用差分测试神谕逐步改进解决方案。这种交互式设置鼓励代理执行函数调用和根据反馈进行自我纠正。我们构建了第一个大规模通用归纳程序合成基准，包含1114个函数。在评估的18个模型中，o3-mini的成功率最高，为52.7%，突显了这一任务的难度。在精心设计的合成跟踪上对LLaMA-3.1-8B-Instruct进行微调，可获得高达31%的相对性能提升。CodeARC为评估基于LLM的程序合成和归纳推理提供了更加现实和具有挑战性的测试平台。我们的代码、数据和模型可以在https://github.com/Anjiang-Wei/CodeARC 上公开获取。

更新时间: 2025-08-08 07:13:11

领域: cs.PL,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.23145v2

ProvX: Generating Counterfactual-Driven Attack Explanations for Provenance-Based Detection

Provenance graph-based intrusion detection systems are deployed on hosts to defend against increasingly severe Advanced Persistent Threat. Using Graph Neural Networks to detect these threats has become a research focus and has demonstrated exceptional performance. However, the widespread adoption of GNN-based security models is limited by their inherent black-box nature, as they fail to provide security analysts with any verifiable explanations for model predictions or any evidence regarding the model's judgment in relation to real-world attacks. To address this challenge, we propose ProvX, an effective explanation framework for exlaining GNN-based security models on provenance graphs. ProvX introduces counterfactual explanation logic, seeking the minimal structural subset within a graph predicted as malicious that, when perturbed, can subvert the model's original prediction. We innovatively transform the discrete search problem of finding this critical subgraph into a continuous optimization task guided by a dual objective of prediction flipping and distance minimization. Furthermore, a Staged Solidification strategy is incorporated to enhance the precision and stability of the explanations. We conducted extensive evaluations of ProvX on authoritative datasets. The experimental results demonstrate that ProvX can locate critical graph structures that are highly relevant to real-world attacks and achieves an average explanation necessity of 51.59\%, with these metrics outperforming current SOTA explainers. Furthermore, we explore and provide a preliminary validation of a closed-loop Detection-Explanation-Feedback enhancement framework, demonstrating through experiments that the explanation results from ProvX can guide model optimization, effectively enhancing its robustness against adversarial attacks.

Updated: 2025-08-08 07:12:10

标题: ProvX：为基于溯源检测生成反事实驱动的攻击解释

摘要: 基于来源图的入侵检测系统部署在主机上，以抵御日益严重的高级持久威胁。使用图神经网络来检测这些威胁已经成为研究重点，并表现出出色的性能。然而，基于GNN的安全模型的普遍采用受到其固有的黑匣子特性的限制，因为它们未能为安全分析人员提供任何可验证的模型预测解释或有关模型判断与现实攻击之间的任何证据。为解决这一挑战，我们提出了ProvX，这是一个有效的解释框架，用于解释基于来源图的安全模型。ProvX引入了反事实解释逻辑，寻找在被预测为恶意的图中最小的结构子集，当受到扰动时，可以颠覆模型的原始预测。我们创新地将寻找这个关键子图的离散搜索问题转化为由预测翻转和距离最小化的双重目标引导的连续优化任务。此外，还引入了分阶固化策略，以增强解释的准确性和稳定性。我们在权威数据集上对ProvX进行了广泛的评估。实验结果表明，ProvX能够定位与现实攻击高度相关的关键图结构，并实现51.59%的平均解释必要性，这些指标优于当前的SOTA解释器。此外，我们还探讨并提供了一个封闭循环的检测-解释-反馈增强框架的初步验证，通过实验证明ProvX的解释结果可以引导模型优化，有效增强其对对抗性攻击的稳健性。

更新时间: 2025-08-08 07:12:10

领域: cs.CR

下载: http://arxiv.org/abs/2508.06073v1

Can Large Models Fool the Eye? A New Turing Test for Biological Animation

Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants. Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find that over 90\% of evaluated models, including the cutting-edge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.

Updated: 2025-08-08 07:10:17

标题: 大模型能欺骗眼睛吗？生物动画的新图灵测试

摘要: 评估大型模型的能力并展示它们的差距是具有挑战性的。当前的基准采用基于真实数据评分形式的静态数据集或不明确的文本聊天机器人风格的人类偏好收集，这可能无法为用户提供关于性能差异的即时、直观和可感知的反馈。在本文中，我们介绍了BioMotion Arena，这是一个通过视觉动画评估大型语言模型（LLMs）和多模态大型语言模型（MLLMs）的新框架。我们的方法借鉴了生物体特征的运动模式的固有视觉知觉，利用点光源成像来放大模型之间的性能差异。具体来说，我们采用了成对比较评估，并收集了53种主流LLMs和MLLMs在90种生物运动变体上的45k多个选票。数据分析显示，众包的人类选票与专家评分者的选票达成了良好的一致，证明了我们的BioMotion Arena在提供区分性反馈方面的优越性。我们还发现，包括尖端开源InternVL3和专有Claude-4系列在内的超过90\%的评估模型未能产生基本的人形点光源组，更不用说平滑和生物合理的动作了。这使得BioMotion Arena能够作为一个具有挑战性的性能可视化基准和一个灵活的评估框架，而不受真实数据的限制。

更新时间: 2025-08-08 07:10:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06072v1

A Game-Theoretic Foundation for Bitcoin's Price: A Security-Utility Equilibrium

This paper introduces a structural game-theoretic model to value decentralized digital assets like Bitcoin. Instead of relying on speculative beliefs, it frames the asset's price within a Rational-Expectations Security-Utility Nash Equilibrium (RESUNE). This equilibrium is a fixed point where the market-clearing price dictates the hash rate through a free-entry mining model, which in turn endogenously sets the network's security. The security, defined as one minus the probability of a 51% attack, is determined via a global games model of attacker coordination, providing a unique and continuous security function. We prove the existence of a RESUNE and offer conditions for its uniqueness and stability. The model predicts that the stabilizing direct effect of price on demand must outweigh the potentially destabilizing feedback from price to security. The framework generates testable predictions, such as a protocol halving causing a contraction in both hash rate and price. A structural Vector Autoregression (VAR) model is proposed to test this mechanism. The model decomposes Bitcoin's value into transactional utility, security, and speculative components and explains the observed unidirectional causality from price to hash rate.

Updated: 2025-08-08 07:09:21

标题: 《比特币价格的博弈论基础：安全-效用均衡》

摘要: 本文介绍了一种结构博弈理论模型，用于评估像比特币这样的去中心化数字资产。该模型不依赖于投机性信念，而是将资产价格框定在理性预期安全效用纳什均衡（RESUNE）内。这种均衡是一个固定点，市场清算价格通过自由进入的挖矿模型决定了哈希率，进而内生地确定了网络的安全性。安全性定义为未发生51%攻击的概率，通过攻击者协调的全局博弈模型确定，提供了一种独特且连续的安全函数。我们证明了RESUNE的存在，并提出了其唯一性和稳定性的条件。该模型预测，价格对需求的稳定化直接效应必须超过从价格到安全性的潜在破坏性反馈。该框架产生了可验证的预测，例如协议减半会导致哈希率和价格的收缩。提出了一种结构向量自回归（VAR）模型来测试这个机制。该模型将比特币的价值分解为交易效用、安全性和投机性成分，并解释了从价格到哈希率的观察到的单向因果关系。

更新时间: 2025-08-08 07:09:21

领域: cs.CR

下载: http://arxiv.org/abs/2508.06071v1

Scalable Differentially Private Sketches under Continual Observation

Linear sketches are fundamental tools in data stream analytics. They are notable for supporting both approximate frequency queries and heavy hitter detection with bounded trade-offs for error and memory. Importantly, on streams that contain sensitive information, linear sketches can be easily privatized with the injection of a suitable amount of noise. This process is efficient in the single release model, where the output is released only at the end of the stream. In this setting, it suffices to add noise to the sketch once. In contrast, in the continual observation model, where the output is released at every time-step, fresh noise needs to be added to the sketch before each release. This creates an additional computational overhead. To address this, we introduce Lazy Sketch, a novel differentially private sketching method that employs lazy updates, perturbing and modifying only a small portion of the sketch at each step. Compared to prior work, we reduce the update complexity by a factor of $O(w)$, where $w$ is the width of the sketch. Experiments demonstrate that our method increases throughput by up to 250x over prior work, making continual observation differential privacy practical for high-speed streaming applications. In addition, for heavy hitter detection, we present a new sketch-based algorithm that leverages lazy updates to achieve a per-update complexity of $O(d \log (T/w) + \log w)$, for linear sketches with dimension $d\times w$ and streams of length $T$. This marks a significant improvement over prior approaches in the streaming continual observation model, which require recomputing frequency estimates for every item in the input domain at each time step.

Updated: 2025-08-08 07:06:17

标题: 持续观察下的可扩展差分隐私草图

摘要: 线性草图是数据流分析中的基本工具。它们以支持近似频率查询和重要数据检测为著名，且在错误和内存方面有界的权衡。重要的是，在包含敏感信息的流中，线性草图可以通过注入适量的噪声轻松私有化。这一过程在单一发布模型中是高效的，其中输出仅在流结束时发布。在这种情况下，只需对草图添加一次噪声即可。相比之下，在连续观测模型中，输出在每个时间步释放，因此需要在每次发布之前向草图添加新的噪声。这会增加额外的计算开销。为了解决这个问题，我们引入了懒草图(Lazy Sketch)，一种新颖的差分私有化草图方法，采用懒更新，在每一步仅扰动和修改草图的一小部分。与先前的工作相比，我们将更新复杂性降低了$O(w)$倍，其中$w$是草图的宽度。实验表明，我们的方法使吞吐量比先前的工作提高了多达250倍，使连续观测差分隐私对高速流应用变得实际可行。此外，针对重要数据检测，我们提出了一种新的基于草图的算法，利用懒更新实现了每次更新复杂度为$O(d \log (T/w) + \log w)$，用于具有$d\times w$维度和长度为$T$的流的线性草图。这标志着在流继续观测模型中，相较于先前的方法，我们需要在每个时间步重新计算输入域中每个项目的频率估计。

更新时间: 2025-08-08 07:06:17

领域: cs.CR

下载: http://arxiv.org/abs/2507.03361v3

Lightweight Auto-bidding based on Traffic Prediction in Live Advertising

Internet live streaming is widely used in online entertainment and e-commerce, where live advertising is an important marketing tool for anchors. An advertising campaign hopes to maximize the effect (such as conversions) under constraints (such as budget and cost-per-click). The mainstream control of campaigns is auto-bidding, where the performance depends on the decision of the bidding algorithm in each request. The most widely used auto-bidding algorithms include Proportional-Integral-Derivative (PID) control, linear programming (LP), reinforcement learning (RL), etc. Existing methods either do not consider the entire time traffic, or have too high computational complexity. In this paper, the live advertising has high requirements for real-time bidding (second-level control) and faces the difficulty of unknown future traffic. Therefore, we propose a lightweight bidding algorithm Binary Constrained Bidding (BiCB), which neatly combines the optimal bidding formula given by mathematical analysis and the statistical method of future traffic estimation, and obtains good approximation to the optimal result through a low complexity solution. In addition, we complement the form of upper and lower bound constraints for traditional auto-bidding modeling and give theoretical analysis of BiCB. Sufficient offline and online experiments prove BiCB's good performance and low engineering cost.

Updated: 2025-08-08 07:05:35

标题: 基于流量预测的轻量级在线广告自动竞标

摘要: 网络直播流在在线娱乐和电子商务中被广泛应用，其中直播广告是主播们重要的营销工具。广告活动希望在约束条件（如预算和每次点击成本）下最大化效果（如转化率）。广告活动的主流控制是自动出价，其中表现取决于每个请求中出价算法的决定。最广泛使用的自动出价算法包括比例-积分-微分（PID）控制、线性规划（LP）、强化学习（RL）等。现有方法要么不考虑整个时间流量，要么具有太高的计算复杂度。本文中，直播广告对实时竞价（二级控制）有高要求，面临未知未来流量的困难。因此，我们提出了一种轻量级竞价算法Binary Constrained Bidding（BiCB），它巧妙地结合了数学分析给出的最优竞价公式和未来流量估计的统计方法，并通过低复杂度解决方案获得对最优结果的良好近似。此外，我们补充了传统自动出价建模的上下限约束形式，并对BiCB进行了理论分析。充足的离线和在线实验证明了BiCB的良好性能和低工程成本。

更新时间: 2025-08-08 07:05:35

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2508.06069v1

Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models

Existing LLM reasoning methods have shown impressive capabilities across various tasks, such as solving math and coding problems. However, applying these methods to scenarios without ground-truth answers or rule-based verification methods - such as tracking the mental states of an agent - remains challenging. Inspired by the sequential Monte Carlo algorithm, we introduce thought-tracing, an inference-time reasoning algorithm designed to trace the mental states of specific agents by generating hypotheses and weighting them based on observations without relying on ground-truth solutions to questions in datasets. Our algorithm is modeled after the Bayesian theory-of-mind framework, using LLMs to approximate probabilistic inference over agents' evolving mental states based on their perceptions and actions. We evaluate thought-tracing on diverse theory-of-mind benchmarks, demonstrating significant performance improvements compared to baseline LLMs. Our experiments also reveal interesting behaviors of the recent reasoning models - e.g., o3 and R1 - on theory-of-mind, highlighting the difference of social reasoning compared to other domains.

Updated: 2025-08-08 07:00:51

标题: 基于假设的理论心智推理用于大型语言模型

摘要: 现有的LLM推理方法在解决数学和编码问题等各种任务中展现出令人印象深刻的能力。然而，将这些方法应用于没有基准答案或基于规则的验证方法的情景，例如追踪一个代理的心理状态，仍然具有挑战性。受顺序蒙特卡洛算法启发，我们引入了思维追踪，这是一种推理算法，旨在通过生成假设并根据观察结果对其进行加权，以追踪特定代理的心理状态，而不依赖于数据集中问题的基准解决方案。我们的算法模仿了贝叶斯心灵理论框架，利用LLMs来近似代理的心理状态随着他们的感知和行动而演变的概率推理。我们在各种心灵理论基准测试上评估了思维追踪，与基准LLMs相比，表现出显著的性能提升。我们的实验还揭示了最近推理模型（例如o3和R1）在心灵理论上的有趣行为，突出了社会推理与其他领域的差异。

更新时间: 2025-08-08 07:00:51

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.11881v2

Systemizing Multiplicity: The Curious Case of Arbitrariness in Machine Learning

Algorithmic modeling relies on limited information in data to extrapolate outcomes for unseen scenarios, often embedding an element of arbitrariness in its decisions. A perspective on this arbitrariness that has recently gained interest is multiplicity-the study of arbitrariness across a set of "good models", i.e., those likely to be deployed in practice. In this work, we systemize the literature on multiplicity by: (a) formalizing the terminology around model design choices and their contribution to arbitrariness, (b) expanding the definition of multiplicity to incorporate underrepresented forms beyond just predictions and explanations, (c) clarifying the distinction between multiplicity and other lenses of arbitrariness, i.e., uncertainty and variance, and (d) distilling the benefits and potential risks of multiplicity into overarching trends, situating it within the broader landscape of responsible AI. We conclude by identifying open research questions and highlighting emerging trends in this young but rapidly growing area of research.

Updated: 2025-08-08 06:57:59

标题: 系统化多样性：机器学习中任意性的奇特案例

摘要: 算法建模依赖于数据中的有限信息来推断未见情景的结果，通常在决策中嵌入了一定程度的任意性。最近引起兴趣的关于这种任意性的观点是多样性-即对一组“好模型”（即可能在实践中部署的模型）的任意性的研究。在这项工作中，我们通过以下方式系统化了有关多样性的文献：（a）形成模型设计选择及其对任意性的贡献的术语；（b）扩展了多样性的定义，以包括除预测和解释之外的被低估形式；（c）澄清了多样性与其他任意性视角（即不确定性和方差）之间的区别；（d）将多样性的好处和潜在风险提炼为全面的趋势，将其置于更广泛的负责任人工智能的背景中。最后，我们通过确定未解研究问题并突出这一年轻但迅速发展的研究领域中的新兴趋势来做出结论。

更新时间: 2025-08-08 06:57:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.14959v2

Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

Deep temporal architectures such as Temporal Convolutional Networks (TCNs) achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap by providing both the first non-vacuous, architecture-aware generalization bounds for deep temporal models and a principled evaluation methodology. For exponentially $\beta$-mixing sequences, we derive bounds scaling as $ O\!\Bigl(R\,\sqrt{\tfrac{D\,p\,n\,\log N}{N}}\Bigr), $ where $D$ is network depth, $p$ kernel size, $n$ input dimension, and $R$ weight norm. Our delayed-feedback blocking mechanism transforms dependent samples into effectively independent ones while discarding only $O(1/\log N)$ of the data, yielding $\sqrt{D}$ scaling instead of exponential, implying that doubling depth requires approximately quadrupling the training data. We also introduce a fair-comparison methodology that fixes the effective sample size to isolate the effect of temporal structure from information content. Under $N_{\text{eff}}=2{,}000$, strongly dependent sequences ($\rho=0.8$) exhibit $\approx76\%$ smaller generalization gaps than weakly dependent ones ($\rho=0.2$), challenging the intuition that dependence is purely detrimental. Yet convergence rates diverge from theory: weak dependencies follow $N_{\text{eff}}^{-1.21}$ scaling and strong dependencies follow $N_{\text{eff}}^{-0.89}$, both steeper than the predicted $N^{-0.5}$. These findings reveal that temporal dependence can enhance learning under fixed information budgets, while highlighting gaps between theory and practice that motivate future research.

Updated: 2025-08-08 06:57:49

标题: 《面向体系结构的时间网络的泛化界限：理论和公平比较方法论》

摘要: 深度时间架构，如时间卷积网络（TCNs），在序列数据上取得了强大的预测性能，但它们的泛化理论理解仍然有限。我们通过提供第一个非平凡的、关注架构的深度时间模型泛化界限以及一个有原则的评估方法来填补这一差距。对于指数$\beta$-混合序列，我们得出了界限，其规模为$O\!\Bigl(R\,\sqrt{\tfrac{D\,p\,n\,\log N}{N}}\Bigr)$，其中$D$是网络深度，$p$是核大小，$n$是输入维度，$R$是权重范数。我们的延迟反馈阻塞机制将相关样本转换为有效独立的样本，同时仅丢弃了$O(1/\log N)$的数据，产生了$\sqrt{D}$的规模，而不是指数级的规模，这意味着将深度加倍需要大约将训练数据增加四倍。我们还引入了一个公平比较的方法论，通过固定有效样本大小来隔离时间结构对信息内容的影响。在$N_{\text{eff}}=2{,}000$的情况下，强相关序列（$\rho=0.8$）的泛化差距比弱相关序列（$\rho=0.2$）小约76\%，挑战了依赖性纯粹是有害的直觉。然而，收敛速度与理论不符：弱依赖性遵循$N_{\text{eff}}^{-1.21}$的规模，而强依赖性遵循$N_{\text{eff}}^{-0.89}$的规模，都比预测的$N^{-0.5}$更陡。这些发现揭示了在固定信息预算下，时间依赖性可以增强学习的能力，同时突出了理论与实践之间的差距，推动未来研究。

更新时间: 2025-08-08 06:57:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06066v1

ThematicPlane: Bridging Tacit User Intent and Latent Spaces for Image Generation

Generative AI has made image creation more accessible, yet aligning outputs with nuanced creative intent remains challenging, particularly for non-experts. Existing tools often require users to externalize ideas through prompts or references, limiting fluid exploration. We introduce ThematicPlane, a system that enables users to navigate and manipulate high-level semantic concepts (e.g., mood, style, or narrative tone) within an interactive thematic design plane. This interface bridges the gap between tacit creative intent and system control. In our exploratory study (N=6), participants engaged in divergent and convergent creative modes, often embracing unexpected results as inspiration or iteration cues. While they grounded their exploration in familiar themes, differing expectations of how themes mapped to outputs revealed a need for more explainable controls. Overall, ThematicPlane fosters expressive, iterative workflows and highlights new directions for intuitive, semantics-driven interaction in generative design tools.

Updated: 2025-08-08 06:57:14

标题: 主题平面：连接潜在用户意图和潜在空间以进行图像生成

摘要: 生成AI使图像创作变得更加容易，但将输出与微妙的创意意图保持一致仍然具有挑战性，特别是对于非专家而言。现有工具通常要求用户通过提示或参考外部化思想，从而限制了流畅的探索。我们介绍了ThematicPlane，这是一个系统，使用户能够在交互式主题设计平面内导航和操纵高级语义概念（例如情绪、风格或叙述语调）。这种界面弥合了隐性创意意图与系统控制之间的差距。在我们的探索性研究中（N=6），参与者参与了分歧和收敛的创意模式，通常将意外结果作为灵感或迭代线索。尽管他们将探索基础放在熟悉的主题上，但对于主题如何映射到输出的不同期望揭示了对更可解释的控制的需求。总的来说，ThematicPlane促进了富有表现力的迭代工作流程，并突显了生成设计工具中直观、语义驱动交互的新方向。

更新时间: 2025-08-08 06:57:14

领域: cs.HC,cs.AI,cs.CL,cs.CV,H.5.2; I.2.7

下载: http://arxiv.org/abs/2508.06065v1

PaPaformer: Language Model from Pre-trained Parallel Paths

The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces \textit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.

Updated: 2025-08-08 06:55:08

标题: PaPaformer：来自预训练的平行路径语言模型

摘要: 现代大型语言模型的训练需要越来越多的计算能力和时间。即使是较小的变体，如小语言模型（SLMs），在最理想的情况下也需要几天的训练时间，通常需要多个GPU。本文探讨了在几小时内而不是几天/几周内训练和评估仅解码器的基于transformer的语言模型的方法。我们介绍了\textit{PaPaformer}，一种仅解码器的transformer架构变体，其较低维度的并行路径被合并成更大的模型。本文表明这些较低维度的路径可以分别使用不同类型的训练数据进行训练，然后合并成一个更大的模型。这种方法可以选择减少模型参数的总数和训练时间，并提高性能。此外，使用并行路径结构开启了定制路径以适应特定任务要求的有趣可能性。

更新时间: 2025-08-08 06:55:08

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.00544v2

A Generic Complete Anytime Beam Search for Optimal Decision Tree

Finding an optimal decision tree that minimizes classification error is known to be NP-hard. While exact algorithms based on MILP, CP, SAT, or dynamic programming guarantee optimality, they often suffer from poor anytime behavior -- meaning they struggle to find high-quality decision trees quickly when the search is stopped before completion -- due to unbalanced search space exploration. To address this, several anytime extensions of exact methods have been proposed, such as LDS-DL8.5, Top-k-DL8.5, and Blossom, but they have not been systematically compared, making it difficult to assess their relative effectiveness. In this paper, we propose CA-DL8.5, a generic, complete, and anytime beam search algorithm that extends the DL8.5 framework and unifies some existing anytime strategies. In particular, CA-DL8.5 generalizes previous approaches LDS-DL8.5 and Top-k-DL8.5, by allowing the integration of various heuristics and relaxation mechanisms through a modular design. The algorithm reuses DL8.5's efficient branch-and-bound pruning and trie-based caching, combined with a restart-based beam search that gradually relaxes pruning criteria to improve solution quality over time. Our contributions are twofold: (1) We introduce this new generic framework for exact and anytime decision tree learning, enabling the incorporation of diverse heuristics and search strategies; (2) We conduct a rigorous empirical comparison of several instantiations of CA-DL8.5 -- based on Purity, Gain, Discrepancy, and Top-k heuristics -- using an anytime evaluation metric called the primal gap integral. Experimental results on standard classification benchmarks show that CA-DL8.5 using LDS (limited discrepancy) consistently provides the best anytime performance, outperforming both other CA-DL8.5 variants and the Blossom algorithm while maintaining completeness and optimality guarantees.

Updated: 2025-08-08 06:53:50

标题: 一个通用的完整任意时刻光束搜索算法用于最佳决策树

摘要: 寻找最小化分类错误的最佳决策树被认为是NP难题。虽然基于MILP、CP、SAT或动态规划的精确算法可以保证最优性，但它们往往在任何时间的表现上表现不佳，意味着在搜索未完成时很难快速找到高质量的决策树，这是由于不平衡的搜索空间探索。为了解决这个问题，提出了几种精确方法的任意时间扩展，如LDS-DL8.5、Top-k-DL8.5和Blossom，但它们尚未经过系统比较，这使得难以评估它们的相对有效性。在本文中，我们提出了CA-DL8.5，这是一个通用的、完整的、任意时间的波束搜索算法，它扩展了DL8.5框架并统一了一些现有的任意时间策略。特别是，CA-DL8.5通过允许通过模块化设计整合各种启发式和放松机制，从而泛化了先前的方法LDS-DL8.5和Top-k-DL8.5。该算法重用DL8.5的高效分支限界修剪和基于trie的缓存，结合基于重启的波束搜索，逐渐放宽修剪准则以提高随时间的解决方案质量。我们的贡献有两个方面：(1)我们引入了这个新的精确和任意时间决策树学习的通用框架，使得能够结合多样的启发式和搜索策略；(2)我们使用称为原始间隙积分的任意评估指标对几种CA-DL8.5的实例进行了严格的实证比较--基于纯度、增益、差异和Top-k启发式。在标准分类基准测试上的实验结果表明，使用LDS（有限差异）的CA-DL8.5始终提供最佳的任意性能，优于其他CA-DL8.5变体和Blossom算法，同时保持完整性和最优性保证。

更新时间: 2025-08-08 06:53:50

领域: cs.AI

下载: http://arxiv.org/abs/2508.06064v1

Don't Forget Imagination!

Cognitive imagination is a type of imagination that plays a key role in human thinking. It is not a ``picture-in-the-head'' imagination. It is a faculty to mentally visualize coherent and holistic systems of concepts and causal links that serve as semantic contexts for reasoning, decision making and prediction. Our position is that the role of cognitive imagination is still greatly underestimated, and this creates numerous problems and diminishes the current capabilities of AI. For instance, when reasoning, humans rely on imaginary contexts to retrieve background info. They also constantly return to the context for semantic verification that their reasoning is still reasonable. Thus, reasoning without imagination is blind. This paper is a call for greater attention to cognitive imagination as the next promising breakthrough in artificial intelligence. As an instrument for simulating cognitive imagination, we propose semantic models -- a new approach to mathematical models that can learn, like neural networks, and are based on probabilistic causal relationships. Semantic models can simulate cognitive imagination because they ensure the consistency of imaginary contexts and implement a glass-box approach that allows the context to be manipulated as a holistic and coherent system of interrelated facts glued together with causal relations.

Updated: 2025-08-08 06:50:43

标题: 不要忘记想象力！

摘要: 认知想象是人类思维中起着关键作用的一种想象形式。它不是“脑中的画面”想象。它是一种能力，能够在头脑中构想出一致和整体的概念系统以及因果关系，这些系统作为推理、决策和预测的语义背景。我们认为，认知想象的作用仍然被严重低估，这导致了许多问题，降低了人工智能的当前能力。例如，在推理时，人类依赖于想象的背景信息。他们还不断返回到语境中，以验证他们的推理仍然合理。因此，没有想象力的推理是盲目的。本文呼吁更多关注认知想象作为人工智能中下一个有前途的突破。作为模拟认知想象的工具，我们提出了语义模型——一种新的数学模型方法，它可以像神经网络一样学习，并且基于概率因果关系。语义模型可以模拟认知想象，因为它们确保了想象语境的一致性，并采用了玻璃盒方法，允许对语境进行操纵，将其作为一个相互关联的事实系统以及由因果关系粘合在一起的整体和一致性系统。

更新时间: 2025-08-08 06:50:43

领域: cs.AI,cs.LG,cs.LO,68T27, 68T30

下载: http://arxiv.org/abs/2508.06062v1

LLM-Meta-SR: In-Context Learning for Evolving Selection Operators in Symbolic Regression

Large language models (LLMs) have revolutionized algorithm development, yet their application in symbolic regression, where algorithms automatically discover symbolic expressions from data, remains constrained and is typically designed manually by human experts. In this paper, we propose a meta learning framework that enables LLMs to automatically design selection operators for evolutionary symbolic regression algorithms. We first identify two key limitations in existing LLM-based algorithm evolution techniques: a lack of semantic guidance and code bloat. The absence of semantic awareness can lead to ineffective exchange of useful code components, and bloat results in unnecessarily complex components, both of which can reduce the interpretability of the designed algorithm or hinder evolutionary learning progress. To address these issues, we enhance the LLM-based evolution framework for meta symbolic regression with two key innovations: a complementary, semantics-aware selection operator and bloat control. Additionally, we embed domain knowledge into the prompt, enabling the LLM to generate more effective and contextually relevant selection operators. Our experimental results on symbolic regression benchmarks show that LLMs can devise selection operators that outperform nine expert-designed baselines, achieving state-of-the-art performance. Moreover, the evolved operator can further improve the state-of-the-art symbolic regression algorithm, achieving the best performance among 26 symbolic regression and machine learning algorithms across 116 regression datasets. This demonstrates that LLMs can exceed expert-level algorithm design for symbolic regression.

Updated: 2025-08-08 06:50:37

标题: LLM-Meta-SR：符号回归中演变选择算子的上下文学习

摘要: 大型语言模型（LLMs）已经彻底改变了算法开发，然而它们在符号回归中的应用，即算法从数据中自动发现符号表达式，仍然受限并且通常由人类专家手动设计。在本文中，我们提出了一个元学习框架，使LLMs能够自动设计用于进化符号回归算法的选择运算符。我们首先确定了现有基于LLMs的算法演化技术中存在的两个关键限制：缺乏语义指导和代码膨胀。缺乏语义意识可能导致有用代码组件的交换无效，而膨胀会导致不必要复杂的组件，这两者都可能降低设计算法的可解释性或妨碍进化学习的进展。为了解决这些问题，我们增强了基于LLMs的元符号回归进化框架，采用了两个关键创新：一个互补的、语义感知的选择运算符和膨胀控制。此外，我们将领域知识嵌入到提示中，使LLMs能够生成更有效和与上下文相关的选择运算符。我们在符号回归基准测试中的实验结果表明，LLMs能够设计出优于九个专家设计的基线的选择运算符，实现了最先进的性能。此外，进化的运算符还可以进一步改进最先进的符号回归算法，在116个回归数据集中的26个符号回归和机器学习算法中实现了最佳性能。这表明LLMs能够超越专家水平的符号回归算法设计。

更新时间: 2025-08-08 06:50:37

领域: cs.NE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.18602v2

LLMs for Resource Allocation: A Participatory Budgeting Approach to Inferring Preferences

Large Language Models (LLMs) are increasingly expected to handle complex decision-making tasks, yet their ability to perform structured resource allocation remains underexplored. Evaluating their reasoning is also difficult due to data contamination and the static nature of existing benchmarks. We present a dual-purpose framework leveraging Participatory Budgeting (PB) both as (i) a practical setting for LLM-based resource allocation and (ii) an adaptive benchmark for evaluating their reasoning capabilities. We task LLMs with selecting project subsets under feasibility (e.g., budget) constraints via three prompting strategies: greedy selection, direct optimization, and a hill-climbing-inspired refinement. We benchmark LLMs' allocations against a utility-maximizing oracle. Interestingly, we also test whether LLMs can infer structured preferences from natural-language voter input or metadata, without explicit votes. By comparing allocations based on inferred preferences to those from ground-truth votes, we evaluate LLMs' ability to extract preferences from open-ended input. Our results underscore the role of prompt design and show that LLMs hold promise for mechanism design with unstructured inputs.

Updated: 2025-08-08 06:45:07

标题: 基于参与式预算方法的LLMs资源分配：推断偏好

摘要: 大型语言模型（LLMs）越来越被期望处理复杂的决策任务，然而它们处理结构化资源分配的能力仍未得到充分探索。由于数据污染和现有基准测试的静态特性，评估它们的推理也很困难。我们提出了一个双重目的的框架，利用参与式预算（PB）作为LLM基于资源分配的实际设置，同时作为评估它们推理能力的自适应基准测试。我们要求LLMs通过三种提示策略（贪婪选择、直接优化和启发式的爬山优化）在可行性约束下选择项目子集。我们将LLMs的分配与效用最大化的神谕进行基准测试。有趣的是，我们还测试LLMs是否能够从自然语言选民输入或元数据中推断出结构化偏好，而无需明确的投票。通过比较基于推断偏好和基于真实投票的分配，我们评估LLMs从开放式输入中提取偏好的能力。我们的结果强调了提示设计的作用，并显示LLMs在无结构输入的机制设计方面具有潜力。

更新时间: 2025-08-08 06:45:07

领域: cs.AI

下载: http://arxiv.org/abs/2508.06060v1

Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System

State-of-the-art fact-checking systems combat misinformation at scale by employing autonomous LLM-based agents to decompose complex claims into smaller sub-claims, verify each sub-claim individually, and aggregate the partial results to produce verdicts with justifications (explanatory rationales for the verdicts). The security of these systems is crucial, as compromised fact-checkers, which tend to be easily underexplored, can amplify misinformation. This work introduces Fact2Fiction, the first poisoning attack framework targeting such agentic fact-checking systems. Fact2Fiction mirrors the decomposition strategy and exploits system-generated justifications to craft tailored malicious evidences that compromise sub-claim verification. Extensive experiments demonstrate that Fact2Fiction achieves 8.9\%--21.2\% higher attack success rates than state-of-the-art attacks across various poisoning budgets. Fact2Fiction exposes security weaknesses in current fact-checking systems and highlights the need for defensive countermeasures.

Updated: 2025-08-08 06:44:57

标题: 事实到虚构：针对主体事实核查系统的有针对性毒化攻击

摘要: 最先进的事实核查系统通过使用基于自主LLM的代理来将复杂声明分解为较小的子声明，验证每个子声明，并汇总部分结果以产生带有理由（对裁定的解释）的裁定，从而规模化地打击误传信息。这些系统的安全性至关重要，因为受损的事实核查者往往很容易被忽视，可以放大误传信息。本文介绍了Fact2Fiction，这是第一个针对这种代理事实核查系统的毒化攻击框架。Fact2Fiction模仿了分解策略，并利用系统生成的理由来制定定制的恶意证据，从而破坏子声明的验证。大量实验证明，Fact2Fiction在各种毒化预算下比最先进的攻击实现了8.9%至21.2%更高的攻击成功率。Fact2Fiction揭示了当前事实核查系统的安全弱点，并强调了对防御性对策的需求。

更新时间: 2025-08-08 06:44:57

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2508.06059v1

CAMEF: Causal-Augmented Multi-Modality Event-Driven Financial Forecasting by Integrating Time Series Patterns and Salient Macroeconomic Announcements

Accurately forecasting the impact of macroeconomic events is critical for investors and policymakers. Salient events like monetary policy decisions and employment reports often trigger market movements by shaping expectations of economic growth and risk, thereby establishing causal relationships between events and market behavior. Existing forecasting methods typically focus either on textual analysis or time-series modeling, but fail to capture the multi-modal nature of financial markets and the causal relationship between events and price movements. To address these gaps, we propose CAMEF (Causal-Augmented Multi-Modality Event-Driven Financial Forecasting), a multi-modality framework that effectively integrates textual and time-series data with a causal learning mechanism and an LLM-based counterfactual event augmentation technique for causal-enhanced financial forecasting. Our contributions include: (1) a multi-modal framework that captures causal relationships between policy texts and historical price data; (2) a new financial dataset with six types of macroeconomic releases from 2008 to April 2024, and high-frequency real trading data for five key U.S. financial assets; and (3) an LLM-based counterfactual event augmentation strategy. We compare CAMEF to state-of-the-art transformer-based time-series and multi-modal baselines, and perform ablation studies to validate the effectiveness of the causal learning mechanism and event types.

Updated: 2025-08-08 06:44:22

标题: CAMEF：通过整合时间序列模式和显著的宏观经济公告，实现因果增强的多模态事件驱动金融预测

摘要: 准确预测宏观经济事件的影响对投资者和政策制定者至关重要。像货币政策决定和就业报告这样的重要事件往往通过塑造经济增长和风险预期来触发市场波动，从而建立事件与市场行为之间的因果关系。现有的预测方法通常集中在文本分析或时间序列建模上，但未能捕捉金融市场的多模态特性以及事件与价格波动之间的因果关系。为了填补这些空白，我们提出了CAMEF（因果增强多模态事件驱动金融预测），这是一个多模态框架，有效地将文本和时间序列数据与因果学习机制和基于LLM的反事实事件增强技术相结合，用于增强因果金融预测。我们的贡献包括：（1）一个捕捉政策文本与历史价格数据之间因果关系的多模态框架；（2）一个新的财经数据集，包括2008年至2024年4月的六种宏观经济发布类型，以及五种关键美国金融资产的高频实际交易数据；和（3）基于LLM的反事实事件增强策略。我们将CAMEF与基于变压器的最新时间序列和多模态基线进行比较，并进行消融研究以验证因果学习机制和事件类型的有效性。

更新时间: 2025-08-08 06:44:22

领域: cs.LG,cs.AI,cs.CE,cs.IR

下载: http://arxiv.org/abs/2502.04592v3

Growth in products of matrices: fastest, average, and generic

The problems that we consider in this paper are as follows. Let A and B be 2x2 matrices (over reals). Let w(A, B) be a word of length n. After evaluating w(A, B) as a product of matrices, we get a 2x2 matrix, call it W. What is the largest (by the absolute value) possible entry of W, over all w(A, B) of length n, as a function of n? What is the expected absolute value of the largest (by the absolute value) entry in a random product of n matrices, where each matrix is A or B with probability 0.5? What is the Lyapunov exponent for a random matrix product like that? We give partial answer to the first of these questions and an essentially complete answer to the second question. For the third question (the most difficult of the three), we offer a very simple method to produce an upper bound on the Lyapunov exponent in the case where all entries of the matrices A and B are nonnegative.

Updated: 2025-08-08 06:40:15

标题: 矩阵乘积的增长：最快、平均和通用

摘要: 本文考虑的问题如下。设A和B是2x2矩阵（实数域上）。设w（A，B）为长度为n的字。在将w（A，B）评估为矩阵乘积后，我们得到一个2x2矩阵，称为W。在所有长度为n的w（A，B）中，W的最大（绝对值）可能条目是多少？在随机乘积中，每个矩阵为A或B的概率为0.5，最大（绝对值）条目的预期绝对值是多少？在这种随机矩阵乘积中的Lyapunov指数是多少？我们对第一个问题给出了部分答案，并对第二个问题给出了基本完整的答案。对于第三个问题（这三个问题中最困难的问题），我们提供了一种非常简单的方法，在矩阵A和B的所有条目为非负数的情况下，产生Lyapunov指数的上限。

更新时间: 2025-08-08 06:40:15

领域: math.GR,cs.CR,math.CO,math.DS,math.PR

下载: http://arxiv.org/abs/2405.00610v7

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs

The "LLM-as-an-annotator" and "LLM-as-a-judge" paradigms employ Large Language Models (LLMs) as annotators, judges, and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM annotators and judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming the open-source LLMs we examine, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.

Updated: 2025-08-08 06:36:34

标题: 替代性注释者测试LLM作为法官：如何统计上证明用LLM替代人类注释者

摘要: “LLM作为注释者”和“LLM作为评判者”范式利用大型语言模型（LLMs）作为注释者、评判者和评估者，来完成传统由人类执行的任务。LLM注释广泛应用于不仅仅是自然语言处理研究领域，还包括医学、心理学和社会科学等领域。尽管它们在塑造研究结果和见解方面发挥着重要作用，但目前没有标准或严格的程序来确定LLMs能否取代人类注释者。在本文中，我们提出了一种新颖的统计程序，即替代注释者测试（alt-test），只需要一小部分已注释示例就能证明使用LLM注释的合理性。此外，我们引入了一种多功能且可解释的度量，用于比较LLM注释者和评判者。为了展示我们的程序，我们整理了一个包含语言和视觉-语言任务的多样化数据集合，使用六个LLMs和四种提示技术进行了实验。我们的结果显示，LLMs有时可以替代人类注释者，尤其是对于封闭源LLMs（如GPT-4o）而言，它们胜过我们研究的开源LLMs，并且提示技术会产生不同质量的评判者。我们希望这项研究能够鼓励更加严格和可靠的实践。

更新时间: 2025-08-08 06:36:34

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2501.10970v4

AGI for the Earth, the path, possibilities and how to evaluate intelligence of models that work with Earth Observation Data?

Artificial General Intelligence (AGI) is closer than ever to becoming a reality, sparking widespread enthusiasm in the research community to collect and work with various modalities, including text, image, video, and audio. Despite recent efforts, satellite spectral imagery, as an additional modality, has yet to receive the attention it deserves. This area presents unique challenges, but also holds great promise in advancing the capabilities of AGI in understanding the natural world. In this paper, we argue why Earth Observation data is useful for an intelligent model, and then we review existing benchmarks and highlight their limitations in evaluating the generalization ability of foundation models in this domain. This paper emphasizes the need for a more comprehensive benchmark to evaluate earth observation models. To facilitate this, we propose a comprehensive set of tasks that a benchmark should encompass to effectively assess a model's ability to understand and interact with Earth observation data.

Updated: 2025-08-08 06:28:58

标题: AGI对地球的影响：路径、可能性以及如何评估与地球观测数据一起工作的模型的智能？

摘要: 人工通用智能（AGI）比以往任何时候都更接近成为现实，引发了研究界广泛的热情，以收集和处理各种模态，包括文本、图像、视频和音频。尽管最近已经有努力，但卫星光谱图像作为一个额外的模态，却还没有得到应有的关注。这一领域提出了独特的挑战，但也在推动AGI在理解自然世界方面的能力方面具有巨大的潜力。在本文中，我们论证了为什么地球观测数据对于一个智能模型是有用的，然后我们回顾了现有的基准，并强调它们在评估该领域基础模型的泛化能力方面的局限性。本文强调了需要一个更全面的基准来评估地球观测模型。为了促进这一点，我们提出了一套全面的任务，一个基准应该包含这些任务，以有效地评估一个模型理解和与地球观测数据互动的能力。

更新时间: 2025-08-08 06:28:58

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.06057v1

Data-Driven Density Steering via the Gromov-Wasserstein Optimal Transport Distance

We tackle the data-driven chance-constrained density steering problem using the Gromov-Wasserstein metric. The underlying dynamical system is an unknown linear controlled recursion, with the assumption that sufficiently rich input-output data from pre-operational experiments are available. The initial state is modeled as a Gaussian mixture, while the terminal state is required to match a specified Gaussian distribution. We reformulate the resulting optimal control problem as a difference-of-convex program and show that it can be efficiently and tractably solved using the DC algorithm. Numerical results validate our approach through various data-driven schemes.

Updated: 2025-08-08 06:21:21

标题: 数据驱动的密度引导：通过Gromov-Wasserstein最优输运距离

摘要: 我们使用Gromov-Wasserstein度量来解决基于数据驱动的概率约束密度控制问题。基础动态系统是一个未知的线性受控递归系统，假设有来自预操作实验的丰富的输入-输出数据可用。初始状态被建模为一个高斯混合模型，而终端状态需要匹配指定的高斯分布。我们将结果的最优控制问题重新表述为一个凸差分问题，并展示它可以通过DC算法高效且可行地解决。数值结果通过各种基于数据驱动的方案验证了我们的方法。

更新时间: 2025-08-08 06:21:21

领域: math.OC,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.06052v1

Khan-GCL: Kolmogorov-Arnold Network Based Graph Contrastive Learning with Hard Negatives

Graph contrastive learning (GCL) has demonstrated great promise for learning generalizable graph representations from unlabeled data. However, conventional GCL approaches face two critical limitations: (1) the restricted expressive capacity of multilayer perceptron (MLP) based encoders, and (2) suboptimal negative samples that either from random augmentations-failing to provide effective 'hard negatives'-or generated hard negatives without addressing the semantic distinctions crucial for discriminating graph data. To this end, we propose Khan-GCL, a novel framework that integrates the Kolmogorov-Arnold Network (KAN) into the GCL encoder architecture, substantially enhancing its representational capacity. Furthermore, we exploit the rich information embedded within KAN coefficient parameters to develop two novel critical feature identification techniques that enable the generation of semantically meaningful hard negative samples for each graph representation. These strategically constructed hard negatives guide the encoder to learn more discriminative features by emphasizing critical semantic differences between graphs. Extensive experiments demonstrate that our approach achieves state-of-the-art performance compared to existing GCL methods across a variety of datasets and tasks.

Updated: 2025-08-08 06:14:32

标题: Khan-GCL: 基于Kolmogorov-Arnold网络的图对比学习与困难负样本

摘要: 图对比学习（GCL）已经展示出从未标记数据中学习可泛化图表示的巨大潜力。然而，传统的GCL方法面临两个关键限制：（1）基于多层感知器（MLP）的编码器的受限表达能力，以及（2）要么来自随机增强而无法提供有效的“硬负例”，要么生成的硬负例未解决对区分图数据至关重要的语义区别的问题。因此，我们提出了Khan-GCL，这是一个将Kolmogorov-Arnold网络（KAN）集成到GCL编码器架构中的新框架，显著增强了其表征能力。此外，我们利用KAN系数参数中嵌入的丰富信息来开发两种新的关键特征识别技术，使得能够为每个图表示生成具有语义意义的硬负例。这些策略性构建的硬负例指导编码器通过强调图之间的关键语义差异来学习更具有区分性的特征。大量实验证明，与现有的GCL方法相比，我们的方法在各种数据集和任务上实现了最先进的性能。

更新时间: 2025-08-08 06:14:32

领域: cs.LG

下载: http://arxiv.org/abs/2505.15103v2

EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.

Updated: 2025-08-08 06:10:47

标题: EvolvR：自我进化的逐对推理用于故事评价以增强生成

摘要: 尽管大型语言模型（LLMs）作为评判者（LLM作为评判者）的有效性已得到验证，但它们在开放式任务中的表现仍受限，特别是在故事评估方面。准确的故事评估不仅对辅助人类质量判断至关重要，还为引导故事生成提供了关键信号。然而，现有方法面临一个困境：为闭源模型进行提示工程受到适应性差的困扰，而为开源模型进行微调方法缺乏故事评估所必需的严格推理能力。为了解决这个问题，我们提出了自我演变的两两推理（EvolvR）框架。基于两两比较，该框架首先通过多人设策略自我合成得分对齐的思维链（CoT）数据。为确保数据质量，这些原始CoT经过自我过滤处理，利用多代理人确保其逻辑严密性和稳健性。最后，经过精炼数据训练的评估器被部署为奖励模型，以指导故事生成任务。实验证明，我们的框架在包括StoryER、HANNA和OpenMEVA在内的三个评估基准上实现了最先进的性能。此外，作为奖励模型时，它显著提高了生成故事的质量，从而充分验证了我们自我演变方法的优越性。

更新时间: 2025-08-08 06:10:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06046v1

Entropy Causal Graphs for Multivariate Time Series Anomaly Detection

Many multivariate time series anomaly detection frameworks have been proposed and widely applied. However, most of these frameworks do not consider intrinsic relationships between variables in multivariate time series data, thus ignoring the causal relationship among variables and degrading anomaly detection performance. This work proposes a novel framework called CGAD, an entropy Causal Graph for multivariate time series Anomaly Detection. CGAD utilizes transfer entropy to construct graph structures that unveil the underlying causal relationships among time series data. Weighted graph convolutional networks combined with causal convolutions are employed to model both the causal graph structures and the temporal patterns within multivariate time series data. Furthermore, CGAD applies anomaly scoring, leveraging median absolute deviation-based normalization to improve the robustness of the anomaly identification process. Extensive experiments demonstrate that CGAD outperforms state-of-the-art methods on real-world datasets with a 9% average improvement in terms of three different multivariate time series anomaly detection metrics.

Updated: 2025-08-08 06:01:47

标题: 熵因果图用于多变量时间序列异常检测

摘要: 许多多变量时间序列异常检测框架已被提出并广泛应用。然而，大多数这些框架并未考虑多变量时间序列数据中变量之间的内在关系，从而忽视了变量之间的因果关系并降低了异常检测性能。本研究提出了一种名为CGAD的新框架，即用于多变量时间序列异常检测的熵因果图。CGAD利用转移熵构建图结构，揭示时间序列数据中的潜在因果关系。采用加权图卷积网络结合因果卷积来模拟多变量时间序列数据中的因果图结构和时间模式。此外，CGAD应用异常评分，利用基于中位数绝对偏差的归一化来提高异常识别过程的鲁棒性。大量实验证明，CGAD在三种不同多变量时间序列异常检测指标方面优于最先进的方法，在真实数据集上表现出了9%的平均改进。

更新时间: 2025-08-08 06:01:47

领域: cs.LG,cs.AI,I.2.6; I.5.1

下载: http://arxiv.org/abs/2312.09478v2

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our findings imply that training a classifier on the chosen optimal subset may improve model performance and increase training efficiency when combined with an efficient data reduction strategy. Furthermore, we have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese Natural Language Processing (NLP) tasks and base models, yielding insightful results for faster training and cross-lingual data reduction.

Updated: 2025-08-08 06:00:37

标题: 质量胜于数量：基于点对点V-信息的有效大规模数据降维策略

摘要: 为了增加模型训练的效果，数据减少对于以数据为中心的人工智能(AI)至关重要。它通过在海量数据集中找到最具教育意义的示例来实现这一目标。为了提高数据质量和训练效率，主要困难在于选择最佳示例而不是完整数据集。在本文中，我们提出了一种基于点值V-信息(PVI)的有效数据减少策略。为了实现一种静态方法，我们首先使用PVI来量化实例的难度并删除难度较低的实例。实验证明，当删除10%-30%的数据时，分类器性能仍然保持，准确率仅下降了0.0001%至0.76%。其次，我们使用逐步学习策略对按照递增PVI排序的示例进行分类器训练，加速收敛并实现比传统训练高出0.8%的准确率增益。我们的发现表明，在选择的最佳子集上训练分类器可能会提高模型性能并增加训练效率，尤其是当结合高效的数据减少策略时。此外，我们已经将之前仅限于英语数据集的PVI框架改编为适用于多种中文自然语言处理(NLP)任务和基础模型，从而为更快的训练和跨语言数据减少提供了深刻的见解。

更新时间: 2025-08-08 06:00:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.00038v3

Society of Mind Meets Real-Time Strategy: A Hierarchical Multi-Agent Framework for Strategic Reasoning

Large Language Models (LLMs) have recently demonstrated impressive action sequence prediction capabilities but often struggle with dynamic, long-horizon tasks such as real-time strategic games. In a game such as StarCraftII (SC2), agents need to manage resource constraints and adapt to evolving battlefield situations in a partially observable environment. This often overwhelms exisiting LLM-based approaches. To address these challenges, we propose a hierarchical multi-agent framework that employs specialized imitation learning agents under a meta-controller called Strategic Planner (SP). By expert demonstrations, each specialized agent learns a distinctive strategy, such as aerial support or defensive maneuvers, and produces coherent, structured multistep action sequences. The SP then orchestrates these proposals into a single, environmentally adaptive plan that ensures local decisions aligning with long-term strategies. We call this HIMA (Hierarchical Imitation Multi-Agent). We also present TEXTSCII-ALL, a comprehensive SC2 testbed that encompasses all race match combinations in SC2. Our empirical results show that HIMA outperforms state of the arts in strategic clarity, adaptability, and computational efficiency, underscoring the potential of combining specialized imitation modules with meta-level orchestration to develop more robust, general-purpose AI agents.

Updated: 2025-08-08 05:57:12

标题: 《心智社会遇上实时战略：一种用于战略推理的分层多智能体框架》

摘要: 大型语言模型（LLMs）最近展示出了令人印象深刻的行动序列预测能力，但往往在动态、长视野任务中遇到困难，比如实时战略游戏。在StarCraftII（SC2）这样的游戏中，代理需要管理资源约束并适应不断变化的战场情况，在一个部分可观察的环境中。这经常使现有的基于LLM的方法不堪重负。为了解决这些挑战，我们提出了一个层次化的多智能体框架，采用专门的模仿学习代理在一个名为战略规划者（SP）的元控制器下工作。通过专家演示，每个专门代理学习一个独特的策略，比如空中支援或防御机动，并产生连贯、结构化的多步行动序列。然后，SP将这些提议编排成一个单一的、环境适应性的计划，确保局部决策与长期策略保持一致。我们将其称为HIMA（Hierarchical Imitation Multi-Agent）。我们还提出了TEXTSCII-ALL，一个涵盖SC2中所有种族比赛组合的全面SC2测试平台。我们的实证结果表明，HIMA在战略清晰度、适应性和计算效率方面胜过了现有技术，强调了将专门的模仿模块与元级编排结合起来开发更强大、通用的AI代理的潜力。

更新时间: 2025-08-08 05:57:12

领域: cs.AI

下载: http://arxiv.org/abs/2508.06042v1

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding iterations. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. DP-LLM augments each linear layer in an LLM with a precision selector that determines the bitwidth at runtime using a lightweight error estimator and threshold values learned through fine-tuning. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

Updated: 2025-08-08 05:57:04

标题: DP-LLM: 使用动态层精度分配进行运行时模型适应

摘要: 我们如何有效处理对具有不同运行时约束的设备上大型语言模型（LLMs）的查询，例如延迟和准确性？多尺度量化通过使多个模型变体叠加到不同位宽的量化模型上，从而实现了LLMs的内存高效运行时模型适应。同时，一个重要的问题仍然是开放性的：如何正确配置模型以匹配目标精度或延迟？虽然混合精度提供了一个有希望的解决方案，但我们进一步利用了这样一个关键观察结果：每个层的敏感性在解码迭代过程中动态变化。基于这一观察结果，我们引入了DP-LLM，这是一种新颖的机制，根据输入值动态为每个层分配精度。DP-LLM通过使用轻量级误差估计器和通过微调学习的阈值值动态确定运行时的位宽，从而使LLM中的每个线性层增强了一个精度选择器。跨多个模型和基准测试的实验结果表明，DP-LLM实现了更出色的性能-延迟折衷，优于先前的方法。

更新时间: 2025-08-08 05:57:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06041v1

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

Vision-Language Models (VLMs) typically replace the predefined image placeholder token (<image>) in textual instructions with visual features from an image encoder, forming the input to a backbone Large Language Model (LLM). However, the large number of vision tokens significantly increases the context length, leading to high computational overhead and inference latency. While previous efforts mitigate this by selecting only important visual features or leveraging learnable queries to reduce token count, they often compromise performance or introduce substantial extra costs. In response, we propose Fourier-VLM, a simple yet efficient method that compresses visual representations in the frequency domain. Our approach is motivated by the observation that vision features output from the vision encoder exhibit concentrated energy in low-frequency components. Leveraging this, we apply a low-pass filter to the vision features using a two-dimentional Discrete Cosine Transform (DCT). Notably, the DCT is efficiently computed via the Fast Fourier Transform (FFT) operator with a time complexity of $\mathcal{O}(n\log n)$, minimizing the extra computational cost while introducing no additional parameters. Extensive experiments across various image-based benchmarks demonstrate that Fourier-VLM achieves competitive performance with strong generalizability across both LLaVA and Qwen-VL architectures. Crucially, it reduce inference FLOPs by up to 83.8% and boots generation speed by 31.2% compared to LLaVA-v1.5, highlighting the superior efficiency and practicality.

Updated: 2025-08-08 05:49:42

标题: Fourier-VLM: 在频域中压缩大型视觉语言模型的视觉标记

摘要: 视觉-语言模型（VLMs）通常将文本指令中预定义的图像占位符（<image>）替换为来自图像编码器的视觉特征，形成主干大型语言模型（LLM）的输入。然而，大量的视觉标记显著增加了上下文长度，导致高计算开销和推理延迟。以前的努力通过仅选择重要的视觉特征或利用可学习的查询来减少标记数量来缓解这一问题，但它们通常会牺牲性能或引入实质性的额外成本。为此，我们提出了Fourier-VLM，这是一种简单而高效的方法，它在频域中压缩视觉表示。我们的方法受到以下观察的启发，即来自视觉编码器的视觉特征在低频成分中具有集中能量。利用这一点，我们使用二维离散余弦变换（DCT）对视觉特征应用低通滤波器。值得注意的是，DCT通过快速傅里叶变换（FFT）算子高效计算，时间复杂度为$\mathcal{O}(n\log n)$，最小化额外的计算成本，而不引入额外参数。在各种基于图像的基准测试中进行的广泛实验表明，Fourier-VLM在LLaVA和Qwen-VL架构上实现了具有竞争力的性能和强大的泛化能力。关键是，与LLaVA-v1.5相比，它将推理FLOPs降低了高达83.8%，并将生成速度提高了31.2%，突显了卓越的效率和实用性。

更新时间: 2025-08-08 05:49:42

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.06038v1

Adaptive Heterogeneous Graph Neural Networks: Bridging Heterophily and Heterogeneity

Heterogeneous graphs (HGs) are common in real-world scenarios and often exhibit heterophily. However, most existing studies focus on either heterogeneity or heterophily in isolation, overlooking the prevalence of heterophilic HGs in practical applications. Such ignorance leads to their performance degradation. In this work, we first identify two main challenges in modeling heterophily HGs: (1) varying heterophily distributions across hops and meta-paths; (2) the intricate and often heterophily-driven diversity of semantic information across different meta-paths. Then, we propose the Adaptive Heterogeneous Graph Neural Network (AHGNN) to tackle these challenges. AHGNN employs a heterophily-aware convolution that accounts for heterophily distributions specific to both hops and meta-paths. It then integrates messages from diverse semantic spaces using a coarse-to-fine attention mechanism, which filters out noise and emphasizes informative signals. Experiments on seven real-world graphs and twenty baselines demonstrate the superior performance of AHGNN, particularly in high-heterophily situations.

Updated: 2025-08-08 05:39:58

标题: 自适应异质图神经网络：弥合异质性和异质性

摘要: 异构图（HG）在现实场景中很常见，通常表现出异质性。然而，大多数现有研究要么专注于异质性，要么专注于异质性，忽视了异质性HG在实际应用中的普遍存在。这种无知导致了它们的性能下降。在这项工作中，我们首先确定了建模异质性HG面临的两个主要挑战：（1）跨跳数和元路径的异质性分布变化；（2）不同元路径间语义信息的错综复杂和常常由异质性驱动的多样性。然后，我们提出了自适应异质图神经网络（AHGNN）来应对这些挑战。AHGNN采用了一种考虑到跳数和元路径特定异质性分布的异质性感知卷积。然后，它使用一种从粗到细的注意机制整合来自不同语义空间的信息，这种机制可以过滤噪音并强调信息信号。对七个真实世界图和二十个基线的实验表明AHGNN的卓越性能，特别是在高异质性情况下。

更新时间: 2025-08-08 05:39:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.06034v1

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

Updated: 2025-08-08 05:33:21

标题: AVA-Bench：用于视觉基础模型的原子视觉能力基准

摘要: 视觉基础模型（VFMs）的兴起需要系统评估。一种常见的方法是将VFMs与大型语言模型（LLMs）配对作为通用头部，然后在广泛的视觉问答（VQA）基准上进行评估。然而，这种协议有两个关键的盲点：（i）指导调整数据可能与VQA测试分布不一致，这意味着错误的预测可能源于这种数据不匹配而不是VFM的视觉缺陷；（ii）VQA基准通常需要多种视觉能力，很难判断错误是源于缺乏所有所需能力还是只是一个关键的能力。为了解决这些空白，我们引入了AVA-Bench，这是第一个明确区分14种原子视觉能力（AVAs）的基准 - 基础技能，如定位、深度估计和空间理解，这些技能共同支持复杂的视觉推理任务。通过解耦AVAs并在每个AVAs中匹配训练和测试分布，AVA-Bench准确指出了VFM在哪些方面表现出色或失败。将AVA-Bench应用于领先的VFMs，从而揭示了独特的“能力指纹”，将VFM选择从有教养的猜测转变为原则性工程。值得注意的是，我们发现0.5B LLM产生的VFM排名与7B LLM相似，同时减少了GPU小时数8倍，实现更有效率的评估。通过提供一个全面透明的基准，我们希望AVA-Bench为下一代VFMs奠定基础。

更新时间: 2025-08-08 05:33:21

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.09082v2

Efficient Knowledge Probing of Large Language Models by Adapting Pre-trained Embeddings

Large language models (LLMs) acquire knowledge across diverse domains such as science, history, and geography encountered during generative pre-training. However, due to their stochasticity, it is difficult to predict what LLMs have acquired. Prior work has developed different ways to probe this knowledge by investigating the hidden representations, crafting specific task prompts, curating representative samples, and estimating their uncertainty. However, these methods require making forward passes through the underlying model to probe the LLM's knowledge about a specific fact, making them computationally expensive and time-consuming. To bridge this gap, we propose $\textbf{PEEK}$ or $\textbf{P}$roxy $\textbf{E}$mbeddings to $\textbf{E}$stimate $\textbf{K}$nowledge of LLMs, by leveraging the pre-trained embedding models that effectively encode factual knowledge as text or graphs as proxies for LLMs. First, we identify a training set of facts known by LLMs through various probing strategies and then adapt embedding models to predict the LLM outputs with a linear decoder layer. Comprehensive evaluation on $3$ Wikipedia-derived datasets, $4$ LLMs, and $7$ embedding models shows that embeddings can predict LLM knowledge on a held-out set with up to 90 % accuracy. Furthermore, we find that sentence embedding models are more suitable than graph embeddings to predict LLM knowledge, shedding light on the underlying representation of the factual landscape. Thus, we believe that knowledge-adapted embeddings can be used to identify knowledge gaps in LLMs at scale and can provide deeper insights into LLMs' internal inductive bias. The code and data are made available at https://github.com/claws-lab/peek.

Updated: 2025-08-08 05:32:31

标题: 通过调整预训练嵌入实现大型语言模型的高效知识探索

摘要: 大型语言模型（LLMs）在生成性预训练中遇到的科学、历史和地理等各种领域中获取知识。然而，由于它们的随机性，很难预测LLMs已经获取了什么。先前的工作已经通过调查隐藏表示、构建特定任务提示、策划代表性样本以及估计它们的不确定性等不同方式来探测这些知识。然而，这些方法需要通过基础模型进行前向传递来探测LLMs对特定事实的了解，使得它们在计算上昂贵且耗时。为了弥补这一差距，我们提出了$\textbf{PEEK}$或$\textbf{P}$roxy $\textbf{E}$mbeddings to $\textbf{E}$stimate $\textbf{K}$nowledge of LLMs，通过利用有效将事实知识编码为文本或图形的预训练嵌入模型作为LLMs的代理。首先，我们通过各种探测策略识别LLMs已知的事实的训练集，然后调整嵌入模型以使用线性解码器层预测LLMs的输出。对来自$3$个维基百科衍生数据集、$4$个LLMs和$7$个嵌入模型的全面评估显示，嵌入可以以高达90%的准确率预测LLM在留存集上的知识。此外，我们发现句子嵌入模型比图形嵌入更适合预测LLMs的知识，为事实景观的底层表示提供了启示。因此，我们相信知识适应的嵌入可以用于大规模地识别LLMs中的知识空白，并可以深入了解LLMs的内在归纳偏见。代码和数据可在https://github.com/claws-lab/peek获得。

更新时间: 2025-08-08 05:32:31

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.06030v1

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model's outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.

Updated: 2025-08-08 05:25:54

标题: 时间自我奖励语言模型：通过过去-未来解耦选择-拒绝

摘要: 自我奖励语言模型提出了一种架构，其中大型语言模型（LLMs）既生成响应，又通过LLM作为评判者的提示评估自己的输出，通过迭代的直接偏好优化（DPO）动态改进其生成能力。然而，我们的分析揭示了现有自我奖励范式中的一个关键限制：选择和拒绝响应的同步改进逐渐缩小了对比样本之间的表征差异，削弱了有效的偏好学习。我们提出了\textbf{时间自我奖励语言模型}，通过战略性地协调过去、现在和未来模型生成来维持学习信号。我们的双阶段框架引入了：（1）\textit{锚定拒绝} - 使用过去初始模型的输出修复被拒绝的响应，以及（2）\textit{未来引导的选择} - 使用下一代模型预测动态筛选选择样本。通过对三种模型系列（Llama、Qwen、Mistral）和不同模型大小（Llama3B/8B/70B）进行广泛实验，结果表明，与使用相同计算资源的自我奖励相比，使用我们的方法训练时取得了显著的改进。例如，Llama3.1-8B在AlpacaEval 2.0上使用我们的方法实现了29.44的胜率，优于自我奖励基线（19.69）9.75个百分点。值得注意的是，我们的方法在数学推理（GSM8K）、基于知识的问答（ARC、TruthfulQA）和代码生成（HumanEval）任务上也展示了优越的超出分布泛化能力，即使我们没有专门收集这样的训练数据。

更新时间: 2025-08-08 05:25:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06026v1

Stepwise Fine and Gray: Subject-Specific Variable Selection Shows When Hemodynamic Data Improves Prognostication of Comatose Post-Cardiac Arrest Patients

Prognostication for comatose post-cardiac arrest patients is a critical challenge that directly impacts clinical decision-making in the ICU. Clinical information that informs prognostication is collected serially over time. Shortly after cardiac arrest, various time-invariant baseline features are collected (e.g., demographics, cardiac arrest characteristics). After ICU admission, additional features are gathered, including time-varying hemodynamic data (e.g., blood pressure, doses of vasopressor medications). We view these as two phases in which we collect new features. In this study, we propose a novel stepwise dynamic competing risks model that improves the prediction of neurological outcomes by automatically determining when to take advantage of time-invariant features (first phase) and time-varying features (second phase). Notably, our model finds patients for whom this second phase (time-varying hemodynamic) information is beneficial for prognostication and also when this information is beneficial (as we collect more hemodynamic data for a patient over time, how important these data are for prognostication varies). Our approach extends the standard Fine and Gray model to explicitly model the two phases and to incorporate neural networks to flexibly capture complex nonlinear feature relationships. Evaluated on a retrospective cohort of 2,278 comatose post-arrest patients, our model demonstrates robust discriminative performance for the competing outcomes of awakening, withdrawal of life-sustaining therapy, and death despite maximal support. Our approach generalizes to more than two phases in which new features are collected and could be used in other dynamic prediction tasks, where it may be helpful to know when and for whom newly collected features significantly improve prediction.

Updated: 2025-08-08 05:20:30

标题: 逐步精细和灰度：特定主体变量选择表明血流动力学数据何时改善昏迷后心脏骤停患者的预后预测

摘要: 对于昏迷的心脏骤停患者的预后预测是ICU临床决策中的一个关键挑战。用于预测的临床信息会随着时间序列性地收集。在心脏骤停后不久，会收集各种不随时间变化的基线特征（如人口统计学特征、心脏骤停特征）。 ICU入院后，会收集额外的特征，包括时间变化的血流动力学数据（如血压、血管加压药物剂量）。我们将这视为两个阶段，在其中收集新特征。在本研究中，我们提出了一种新颖的分步动态竞争风险模型，通过自动确定何时利用不随时间变化的特征（第一阶段）和时间变化的特征（第二阶段）来改善神经系统预后的预测。值得注意的是，我们的模型发现对于某些患者，第二阶段（时间变化的血流动力学）信息对于预测是有益的，同时还发现了何时这些信息是有益的（随着时间，我们为患者收集更多血流动力学数据，这些数据对于预测的重要性会有所变化）。我们的方法扩展了标准的Fine and Gray模型，以明确建模这两个阶段，并结合神经网络以灵活地捕捉复杂的非线性特征关系。在回顾性队列研究中，我们对2,278名昏迷后骤停患者进行评估，我们的模型展示了对于清醒、撤销维持生命治疗和尽最大支持死亡等竞争性结果的稳健区分性能。我们的方法可以推广应用于收集更多特征的超过两个阶段，并可用于其他动态预测任务，其中了解何时以及对于谁新收集的特征显著改善预测可能是有益的。

更新时间: 2025-08-08 05:20:30

领域: cs.LG

下载: http://arxiv.org/abs/2508.06023v1

Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis

Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at https://github.com/utkuozbulak/svp-generative-ai.

Updated: 2025-08-08 05:15:02

标题: 通过生成式人工智能图像合成在流式成像显微镜中改进的次可见粒子分类

摘要: 使用流式成像显微镜结合深度学习进行次可见粒子分析已被证明能够有效识别粒子类型，使得能够区分诸如硅油等无害成分与蛋白质粒子。然而，可用数据稀缺以及数据集中粒子类型严重不平衡仍然是应用多类分类器解决此类问题时的重要障碍，通常迫使研究人员依赖效果较弱的方法。上述问题对于意外出现且数量较少的粒子类型尤为具有挑战性，如硅油和气泡，与通过受控设置获取大量图像相对简单的蛋白质粒子相比。在这项工作中，我们开发了一种最先进的扩散模型，通过生成高保真度的图像来解决数据不平衡问题，从而增强训练数据集，实现对多类深度神经网络的有效训练。我们通过展示生成的样本与真实粒子图像在视觉质量和结构方面的相似性来验证这种方法。为评估在训练数据集中使用扩散生成图像的有效性，我们对一个包含50万蛋白质粒子图像的验证数据集进行了大规模实验，并证明这种方法在不降低性能的情况下改善了分类性能。最后，为了促进开放研究和可重复性，我们公开发布了我们的扩散模型和经过训练的多类深度神经网络分类器，以及一个简单的接口，方便其轻松集成到未来的研究中，网址为https://github.com/utkuozbulak/svp-generative-ai。

更新时间: 2025-08-08 05:15:02

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.06021v1

Position: Intelligent Coding Systems Should Write Programs with Justifications

Intelligent coding systems are transforming software development by enabling users to specify code behavior in natural language. However, the opaque decision-making of AI-driven coders raises trust and usability concerns, particularly for non-expert users who cannot inspect low-level implementations. We argue that these systems should not only generate code but also produce clear, consistent justifications that bridge model reasoning and user understanding. To this end, we identify two critical justification properties-cognitive alignment and semantic faithfulness-and highlight the limitations of existing methods, including formal verification, static analysis, and post-hoc explainability. We advocate exploring neuro-symbolic approaches for justification generation, where symbolic constraints guide model behavior during training and program semantics are enriched through neural representations, enabling automated consistency checks at inference time.

Updated: 2025-08-08 05:04:47

标题: 立场：智能编码系统应该编写带有理由的程序

摘要: 智能编码系统正在通过使用户能够用自然语言指定代码行为来改变软件开发。然而，由人工智能驱动的编码器的不透明决策引发了信任和可用性问题，特别是对于无法检查低级实现的非专家用户。我们认为，这些系统不仅应该生成代码，还应该产生清晰、一致的理由，以建立模型推理和用户理解之间的联系。为此，我们确定了两个关键的理由属性-认知对齐和语义忠实，并强调现有方法的局限性，包括形式验证、静态分析和事后可解释性。我们主张探索神经符号方法用于理由生成，其中符号约束在训练期间指导模型行为，程序语义通过神经表示丰富，从而在推理时实现自动一致性检查。

更新时间: 2025-08-08 05:04:47

领域: cs.SE,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.06017v1

Crisp Attention: Regularizing Transformers via Structured Sparsity

The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80\% attention sparsity achieves a validation accuracy of 91.59\%, a 0.97\% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained and robust set of features. Our work recasts attention sparsity not just as a tool for computational efficiency, but as a potential method for improving the generalization and performance of Transformer models.

Updated: 2025-08-08 05:04:28

标题: 清晰关注：通过结构稀疏性对Transformer进行正则化

摘要: 自注意机制的二次计算成本是扩展Transformer模型的主要挑战。虽然注意力稀疏性被广泛研究作为提高计算效率的技术，但几乎普遍认为这是以模型准确性为代价的。在本文中，我们报告了一个令人惊讶的反例，证明了这种普遍的智慧是错误的。通过在SST-2情感分析任务的微调过程中向DistilBERT模型的注意机制引入结构化的事后稀疏性，我们发现模型的准确性显著提高。我们的模型在80\%的注意力稀疏性下实现了91.59\%的验证准确性，比密集基线提高了0.97个百分点。我们假设这种现象是由于稀疏性作为一个强大的隐式正则化器，通过强制模型使用更受约束和稳健的特征集进行预测，防止模型过拟合。我们的工作重新将注意力稀疏性不仅视为提高计算效率的工具，而且作为改善Transformer模型泛化和性能的潜在方法。

更新时间: 2025-08-08 05:04:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.06016v1

FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint

Model fingerprinting is a widely adopted approach to safeguard the intellectual property rights of open-source models by preventing their unauthorized reuse. It is promising and convenient since it does not necessitate modifying the protected model. In this paper, we revisit existing fingerprinting methods and reveal that they are vulnerable to false claim attacks where adversaries falsely assert ownership of any third-party model. We demonstrate that this vulnerability mostly stems from their untargeted nature, where they generally compare the outputs of given samples on different models instead of the similarities to specific references. Motivated by these findings, we propose a targeted fingerprinting paradigm (i.e., FIT-Print) to counteract false claim attacks. Specifically, FIT-Print transforms the fingerprint into a targeted signature via optimization. Building on the principles of FIT-Print, we develop bit-wise and list-wise black-box model fingerprinting methods, i.e., FIT-ModelDiff and FIT-LIME, which exploit the distance between model outputs and the feature attribution of specific samples as the fingerprint, respectively. Extensive experiments on benchmark models and datasets verify the effectiveness, conferrability, and resistance to false claim attacks of our FIT-Print.

Updated: 2025-08-08 05:02:54

标题: FIT-Print: 通过定向指纹技术实现抗虚假声明的模型所有权验证

摘要: 模型指纹技术是一种广泛采用的方法，旨在通过防止未经授权重复使用来保护开源模型的知识产权。这种方法具有前景和便利，因为它不需要修改受保护的模型。本文重新审视现有的指纹技术，并揭示它们容易受到虚假声明攻击的漏洞，即对手可以虚假主张对任何第三方模型拥有所有权。我们证明这种漏洞主要源于它们的非针对性特性，它们通常比较在不同模型上给定样本的输出，而不是与特定参考的相似性。受到这些发现的启发，我们提出了一种有针对性的指纹技术范式（即FIT-Print）来抵制虚假声明攻击。具体地，FIT-Print通过优化将指纹转化为有针对性的签名。借鉴FIT-Print的原则，我们开发了逐位和逐列表的黑盒模型指纹技术，即FIT-ModelDiff和FIT-LIME，分别利用模型输出之间的距离和特定样本的特征归因作为指纹。在基准模型和数据集上进行的大量实验证实了我们的FIT-Print的有效性、可信性和抵抗虚假声明攻击的能力。

更新时间: 2025-08-08 05:02:54

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.15509v3

A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Multi-modality magnetic resonance imaging(MRI) data facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.

Updated: 2025-08-08 04:55:53

标题: 一份带有多种模态分割的原发性鼻咽癌MRI数据集

摘要: 多模态磁共振成像(MRI)数据有助于鼻咽癌(NPC)的早期诊断、肿瘤分割和疾病分期管理。公开可用的综合数据集的缺乏限制了NPC的诊断、治疗规划和机器学习算法的发展。为了满足这一关键需求，我们介绍了第一个全面的NPC MRI数据集，包括277名原发性NPC患者的MR轴向成像。该数据集包括T1加权、T2加权和增强T1加权序列，共831次扫描。除了相应的临床数据外，经验丰富的放射科医师手动注释和标记的分割为未经治疗的原发性NPC提供了高质量的数据资源。

更新时间: 2025-08-08 04:55:53

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2404.03253v3

Layers at Similar Depths Generate Similar Activations Across LLM Architectures

How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not "obvious" either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.

Updated: 2025-08-08 04:45:03

标题: 在相似深度的层次上，LLM架构产生相似的激活

摘要: 独立训练的LLM使用的潜在空间之间如何相关？我们研究了24个开放权重LLM的不同层次的激活引发的最近邻关系，并发现它们1）在模型内的不同层次之间往往变化，2）在不同模型的对应层次之间大致共享。第二项声明表明这些最近邻关系并不是随意的，因为它们跨模型共享，但第一项声明表明它们也不是“明显”的，因为没有一组普遍共享的最近邻关系。总的来说，这表明LLM在不同层次生成激活几何，但整个进展在模型之间大部分共享，被拉伸和挤压以适应不同的架构。

更新时间: 2025-08-08 04:45:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.08775v3

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving code. This gap leaves existing models unable to effectively capture the diversity of programming languages and tasks across different domains, highlighting the need for more focused research in code retrieval. To address this, we introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework, enhancing model generalizability and retrieval performance. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark. In addition to excelling in code retrieval, our models demonstrate competitive performance on the widely adopted BeIR text retrieval benchmark, offering versatility across domains. Experimental results demonstrate that improving retrieval performance significantly enhances end-to-end Retrieval-Augmented Generation (RAG) performance for code-related tasks.

Updated: 2025-08-08 04:35:26

标题: CodeXEmbed：用于多语言和多任务代码检索的通用嵌入模型族

摘要: 尽管文本检索在许多自然语言处理任务中取得成功，但代码检索仍然是一个很少被探索的领域。大多数文本检索系统都是针对自然语言查询设计的，通常忽略了检索代码的特定挑战。这种差距使得现有模型无法有效捕捉不同领域中的编程语言和任务的多样性，突显了需要在代码检索方面进行更加专注的研究。为了解决这个问题，我们引入了CodeXEmbed，这是一个包含400M至7B个参数的大规模代码嵌入模型系列。我们的新颖训练流程统一了多种编程语言，并将各种与代码相关的任务转化为一个通用检索框架，增强了模型的泛化能力和检索性能。我们的7B模型在代码检索方面达到了新的最先进水平，在CoIR基准测试中比之前领先的Voyage-Code模型超过了20%。除了在代码检索方面表现出色外，我们的模型还在被广泛采用的BeIR文本检索基准测试中展现出竞争力，为不同领域提供了多样性。实验结果表明，改进检索性能显著提升了与代码相关任务的端到端检索增强生成（RAG）性能。

更新时间: 2025-08-08 04:35:26

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2411.12644v3

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

Real-life robot navigation involves more than just reaching a destination; it requires optimizing movements while addressing scenario-specific goals. An intuitive way for humans to express these goals is through abstract cues like verbal commands or rough sketches. Such human guidance may lack details or be noisy. Nonetheless, we expect robots to navigate as intended. For robots to interpret and execute these abstract instructions in line with human expectations, they must share a common understanding of basic navigation concepts with humans. To this end, we introduce CANVAS, a novel framework that combines visual and linguistic instructions for commonsense-aware navigation. Its success is driven by imitation learning, enabling the robot to learn from human navigation behavior. We present COMMAND, a comprehensive dataset with human-annotated navigation results, spanning over 48 hours and 219 km, designed to train commonsense-aware navigation systems in simulated environments. Our experiments show that CANVAS outperforms the strong rule-based system ROS NavStack across all environments, demonstrating superior performance with noisy instructions. Notably, in the orchard environment, where ROS NavStack records a 0% total success rate, CANVAS achieves a total success rate of 67%. CANVAS also closely aligns with human demonstrations and commonsense constraints, even in unseen environments. Furthermore, real-world deployment of CANVAS showcases impressive Sim2Real transfer with a total success rate of 69%, highlighting the potential of learning from human demonstrations in simulated environments for real-world applications.

Updated: 2025-08-08 04:20:36

标题: CANVAS：基于常识的导航系统，用于直观的人机交互

摘要: 真实生活中的机器人导航不仅仅是到达目的地；它需要在满足特定场景目标的同时优化移动。人类表达这些目标的直观方式是通过抽象线索，如口头命令或粗略草图。这种人类指导可能缺乏细节或存在噪音。然而，我们期望机器人按照预期进行导航。为了使机器人能够理解和执行这些抽象指令，符合人类期望，它们必须与人类共享对基本导航概念的共同理解。为此，我们引入了CANVAS，一个将视觉和语言指令结合起来进行常识感知导航的新框架。其成功得益于模仿学习，使机器人能够从人类导航行为中学习。我们提出了COMMAND，一个全面的数据集，涵盖48小时和219公里的人类注释的导航结果，旨在训练在模拟环境中进行常识感知导航的系统。我们的实验表明，CANVAS在所有环境中均优于强大的基于规则的系统ROS NavStack，展现出在嘈杂指令下的卓越性能。值得注意的是，在果园环境中，ROS NavStack记录了0%的总成功率，而CANVAS实现了67%的总成功率。CANVAS还与人类演示和常识约束密切一致，即使在未知环境中也是如此。此外，CANVAS在实际部署中展示出令人印象深刻的Sim2Real转移，总成功率为69%，突显了在模拟环境中从人类演示中学习对于真实应用的潜力。

更新时间: 2025-08-08 04:20:36

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2410.01273v3

The Ensemble Kalman Update is an Empirical Matheron Update

The Ensemble Kalman Filter (EnKF) is a widely used method for data assimilation in high-dimensional systems, with an ensemble update step equivalent to an empirical version of the Matheron update popular in Gaussian process regression -- a connection that links half a century of data-assimilation engineering to modern path-wise GP sampling. This paper provides a compact introduction to this simple but under-exploited connection, with necessary definitions accessible to all fields involved. Source code is available at https://github.com/danmackinlay/paper_matheron_equals_enkf .

Updated: 2025-08-08 04:16:01

标题: 集合卡尔曼更新是一种经验性马泰隆更新

摘要: 集合卡尔曼滤波器（EnKF）是在高维系统中广泛使用的数据同化方法，其集合更新步骤相当于高斯过程回归中流行的马瑟龙更新的经验版本 - 这种联系将半个世纪的数据同化工程与现代路径式高斯过程抽样联系起来。本文提供了对这种简单但未充分利用的联系的简明介绍，其中包含了所有相关领域可访问的必要定义。源代码可在https://github.com/danmackinlay/paper_matheron_equals_enkf 上找到。

更新时间: 2025-08-08 04:16:01

领域: cs.LG,stat.ML,93E11 (Primary) 65C20 62M20 (Secondary)

下载: http://arxiv.org/abs/2502.03048v3

Hand by Hand: LLM Driving EMS Assistant for Operational Skill Learning

Operational skill learning, inherently physical and reliant on hands-on practice and kinesthetic feedback, has yet to be effectively replicated in large language model (LLM)-supported training. Current LLM training assistants primarily generate customized textual feedback, neglecting the crucial kinesthetic modality. This gap derives from the textual and uncertain nature of LLMs, compounded by concerns on user acceptance of LLM driven body control. To bridge this gap and realize the potential of collaborative human-LLM action, this work explores human experience of LLM driven kinesthetic assistance. Specifically, we introduced an "Align-Analyze-Adjust" strategy and developed FlightAxis, a tool that integrates LLM with Electrical Muscle Stimulation (EMS) for flight skill acquisition, a representative operational skill domain. FlightAxis learns flight skills from manuals and guides forearm movements during simulated flight tasks. Our results demonstrate high user acceptance of LLM-mediated body control and significantly reduced task completion times. Crucially, trainees reported that this kinesthetic assistance enhanced their awareness of operation flaws and fostered increased engagement in the training process, rather than relieving perceived load. This work demonstrated the potential of kinesthetic LLM training in operational skill acquisition.

Updated: 2025-08-08 04:05:47

标题: 手牵手：用于操作技能学习的LLM驾驶EMS助手

摘要: 操作技能学习是一种固有的物理过程，依赖于实践和动觉反馈，但尚未有效地在大型语言模型（LLM）支持的训练中复制。当前的LLM训练助手主要生成定制的文本反馈，忽视了关键的动觉模态。这种差距源于LLMs的文本性质和不确定性，加上对LLM驱动的身体控制用户接受的担忧。为了弥合这一差距并实现人类-LLM行动的潜力，本研究探讨了人类对LLM驱动的动觉辅助的体验。具体地，我们引入了一个“对齐-分析-调整”策略，并开发了FlightAxis，这是一个将LLM与电肌肉刺激（EMS）集成的工具，用于飞行技能的习得，这是一个代表性的操作技能领域。FlightAxis从手册中学习飞行技能，并在模拟飞行任务中引导前臂运动。我们的结果表明，用户高度接受LLM介导的身体控制，并显著缩短了任务完成时间。关键是，学员报告称这种动觉辅助增强了他们对操作缺陷的意识，并促进了他们对训练过程的更多参与，而不是减轻了感知的负担。本研究展示了动觉LLM训练在操作技能习得中的潜力。

更新时间: 2025-08-08 04:05:47

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.06000v1

Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs' ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence. Our code will be made publicly available.

Updated: 2025-08-08 04:02:10

标题: 中介引导的多智能体协作在医疗决策中的开源模型之间

摘要: 复杂的医疗决策涉及由不同临床医生操作的合作工作流程。设计AI多代理系统可以加快和增强人类级别的临床决策。现有的多代理研究主要集中在仅语言任务上，但将其扩展到多模态场景仍然具有挑战性。盲目结合不同的视觉-语言模型（VLMs）可能会放大错误的结果解释。总体而言，与相似规模的大型语言模型（LLMs）相比，VLMs在遵循指令和重要的自我反思方面能力较弱。这种差距在很大程度上限制了VLMs在合作工作流程中的能力。在这项研究中，我们提出了MedOrch，这是一个用于医学多模态决策的中介引导的多代理协作框架。MedOrch采用基于LLM的中介代理，使多个基于VLM的专家代理能够交换并反思他们的输出以进行协作。我们利用多个开源通用和领域特定的VLMs，而不是昂贵的GPT系列模型，展示了异构模型的优势。我们展示了基于不同VLM代理之间的协作可以超越任何单个代理的能力。我们在五个医学视觉问答基准上验证了我们的方法，展示了卓越的协作性能，无需模型训练。我们的研究结果强调了中介引导的多代理协作在推动医学多模态智能方面的价值。我们的代码将公开发布。

更新时间: 2025-08-08 04:02:10

领域: cs.AI

下载: http://arxiv.org/abs/2508.05996v1

Optimizing Prompt Sequences using Monte Carlo Tree Search for LLM-Based Optimization

Large language models (LLMs) have demonstrated remarkable capabilities in code generation and structured reasoning; however, their performance often degrades on complex tasks that require consistent multi-step planning. Recent work has explored combining LLMs with Monte Carlo Tree Search (MCTS), yet existing approaches primarily focus on generating heuristic-based code for optimization or target simpler tasks where correctness alone is sufficient. In this work, we propose MCTS-OPS, a novel neural-symbolic framework that formulates prompt selection as a sequential decision process guided by MCTS. Our method explores and refines multi-step prompt sequences for the goal of improving code generation quality and enhancing the problem-solving capabilities of LLMs in general optimization. Experiments on network optimization show significant improvement over the baselines, both in the success rate of executing the generated code and in the optimization results with the specified objective and constraints (2$\sim$4$\times$ higher reward and 3$\times$ lower standard deviation). Moreover, it improves the chance of attaining the optimal solution by about 10\% of cases, compared to baseline methods in hard problems. These results highlight the promise of combining symbolic planning with LLMs for robust, high-quality code generation in complex domains.

Updated: 2025-08-08 04:01:24

标题: 使用蒙特卡洛树搜索优化LLM模型的提示序列

摘要: 大型语言模型(LLMs)在代码生成和结构化推理方面展现出了显著的能力；然而，在需要一致的多步规划的复杂任务上，它们的性能经常会下降。最近的研究探讨了将LLMs与蒙特卡洛树搜索(MCTS)相结合，然而现有方法主要集中在为优化生成基于启发式的代码或针对简单的任务，其中仅正确性就足够了。在这项工作中，我们提出了MCTS-OPS，这是一个新颖的神经符号框架，将提示选择形式化为由MCTS引导的序列决策过程。我们的方法探索和精化多步提示序列，以改善代码生成质量，并增强LLMs在一般优化问题解决能力。在网络优化实验中，与基线相比，我们观察到显著的改善，不仅在执行生成的代码的成功率上，还在具有指定目标和约束条件的优化结果上(奖励提高2$\sim$4倍，标准差降低3倍)。此外，与在困难问题中的基线方法相比，它提高了大约10\%的情况下达到最优解的机会。这些结果凸显了在复杂领域中将符号规划与LLMs相结合，实现稳健、高质量代码生成的潜力。

更新时间: 2025-08-08 04:01:24

领域: cs.LG

下载: http://arxiv.org/abs/2508.05995v1

Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

With the end of Moore's law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can improve model accuracy and substantially lower energy consumption, with reductions of up to 38x observed in certain cases.

Updated: 2025-08-08 03:55:25

标题: 智能采样极端规模湍流数据集，用于准确高效的时空模型训练

摘要: 随着摩尔定律和丹纳德缩放的结束，高效的训练越来越需要重新考虑数据量。我们能否通过智能子采样使用更少的数据训练更好的模型？为了探索这一点，我们开发了SICKLE，这是一个用于高效学习的稀疏智能筛选框架，具有一种新颖的最大熵（MaxEnt）采样方法、可扩展的训练和能耗基准测试。我们将MaxEnt与随机采样和相空间采样在大型直接数值模拟（DNS）湍流数据集上进行比较。在Frontier上进行规模化评估SICKLE，我们展示了子采样作为预处理步骤可以提高模型准确性并大幅降低能耗，某些情况下观察到高达38倍的降低。

更新时间: 2025-08-08 03:55:25

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2508.03872v2

ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge

Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49% compared to 78.63%, thereby validating the effectiveness of the proposed framework.

Updated: 2025-08-08 03:55:25

标题: ECMF：增强的跨模态融合用于MER-SEMI挑战中的多模态情绪识别

摘要: 情绪识别在增强人机交互中发挥着至关重要的作用。本研究通过提出一种新颖的多模态情绪识别框架，解决了MER2025比赛中的MER-SEMI挑战。为了解决数据稀缺的问题，我们利用大规模预训练模型从视觉、音频和文本模态中提取信息特征。具体地，对于视觉模态，我们设计了一个双分支视觉编码器，捕捉全局帧级特征和局部面部表示。对于文本模态，我们引入了一种富有上下文的方法，利用大型语言模型丰富输入文本中的情绪线索。为了有效地整合这些多模态特征，我们提出了一个融合策略，包括两个关键组件，即用于动态模态加权的自注意机制和用于保留原始表示的残差连接。除了架构设计，我们还通过多源标注策略对训练集中的噪声标签进行了进一步的优化。我们的方法在MER2025-SEMI数据集上实现了显著的性能提升，较官方基线的加权F分数78.63%提高到了87.49%，从而验证了提出的框架的有效性。

更新时间: 2025-08-08 03:55:25

领域: cs.CV,cs.AI,cs.CY

下载: http://arxiv.org/abs/2508.05991v1

ETA: Energy-based Test-time Adaptation for Depth Completion

We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some ``source'' data, often predict erroneous outputs when transferred to ``target'' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation'', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.

Updated: 2025-08-08 03:51:24

标题: ETA：基于能量的深度补全测试时自适应

摘要: 我们提出了一种用于预训练深度完成模型测试时间适应的方法。深度完成模型在一些“源”数据上训练时，当转移到在新环境条件下捕获的“目标”数据时，通常会产生错误输出，这是由于协变量漂移造成的。我们方法的关键在于量化深度预测属于源数据分布的可能性。挑战在于在部署之前无法访问超出分布（目标）数据。因此，我们利用对抗性扰动作为一种机制来探索数据空间，而不是对目标分布进行假设。这使我们能够训练一个能量模型，对深度预测的局部区域进行打分，判断其是否属于分布内或分布外。我们在测试时间更新预训练深度完成模型的参数，以最小化能量，有效地将测试时间预测与源分布的预测对齐。我们将我们的方法称为“基于能量的测试时间适应”，简称ETA。我们在三个室内和三个室外数据集上评估了我们的方法，ETA在室外平均改进了6.94%，在室内改进了10.23%。项目页面：https://fuzzythecat.github.io/eta。

更新时间: 2025-08-08 03:51:24

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.05989v1

Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal

Recently, Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces introduce substantial challenges in terms of training cost, inference latency, and deployment feasibility. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps. In this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. It then enables a logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP teaches models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning in coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark, our approach reduces token generation by 23.5% and inference latency by 43.5% compared to the strongest baseline, while achieving a competitive accuracy of 36.19% in Pass@1. Our results highlight a promising direction for building powerful and efficient LRMs.

Updated: 2025-08-08 03:46:21

标题: 修剪不惊人的部分：通过第一个令牌的惊讶度进行有效的代码推理

摘要: 最近，大型推理模型（LRMs）通过扩展思维链条（CoT）的长度，在代码推理方面展现出了显著的能力。然而，过长的推理路径引入了训练成本、推理延迟和部署可行性方面的重大挑战。虽然出现了各种CoT压缩方法来解决这一挑战，但它们面临固有的权衡：基于单词级的方法通常会破坏句法和逻辑的连贯性，而基于困惑度的步骤级方法则无法可靠地捕捉逻辑上关键的推理步骤。在本文中，我们提出了ASAP（基于锚点引导、惊喜度剪枝），这是一个用于CoT压缩的新颖粗到细的框架。ASAP首先进行锚点引导剪枝，以保留核心推理结构，从而有效地减少后续处理的搜索空间。然后，它通过基于新颖的第一个单词惊喜度度量选择逻辑上必要的推理步骤，实现了逻辑感知剪枝。最后，ASAP教导模型在推理时自主生成和利用这些简明的CoTs，从而在编码任务中实现高效推理。实验结果显示，ASAP在多个代码生成基准测试中实现了最先进的准确性，同时大幅减少了训练和推理成本。在具有挑战性的LiveCodeBench v4_v5基准测试中，与最强基准相比，我们的方法将标记生成减少了23.5%，推理延迟降低了43.5%，同时在Pass@1中实现了36.19%的竞争性准确率。我们的结果突显了构建强大高效LRMs的一个有前途的方向。

更新时间: 2025-08-08 03:46:21

领域: cs.LG,cs.SE

下载: http://arxiv.org/abs/2508.05988v1

Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series?

Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler's discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery.

Updated: 2025-08-08 03:41:16

标题: 大型语言模型是否能够充分进行时间序列的符号推理？

摘要: 揭示隐藏的符号规律是一个追求，可以追溯到开普勒发现行星运动的时期，仍然是科学发现和人工智能中的一个核心挑战。虽然大型语言模型在结构化推理任务中表现出潜力，但它们从时间序列数据中推断可解释性的、与上下文对齐的符号结构的能力仍然未被充分探索。为了系统评估这种能力，我们介绍了SymbolBench，这是一个旨在评估实际时间序列上符号推理能力的全面基准，涵盖了多元符号回归、布尔网络推断和因果发现三个任务。与以往局限于简单代数方程的努力不同，SymbolBench涵盖了一系列不同复杂度的符号形式。我们进一步提出了一个统一的框架，将大型语言模型与遗传编程结合起来形成一个闭环符号推理系统，其中大型语言模型既充当预测器又充当评估者。我们的实证结果揭示了当前模型的主要优势和局限性，强调了结合领域知识、上下文对齐和推理结构的重要性，以改进大型语言模型在自动化科学发现中的表现。

更新时间: 2025-08-08 03:41:16

领域: cs.AI

下载: http://arxiv.org/abs/2508.03963v2

Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emph{surrogate environment} and introduce a novel distance measure (named the \emph{$\ell_g$-distance}), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order $O(H^{\frac{3}{2}}\sqrt{\log(K(\epsilon)) T})$, where $H$ is the episode length, $T$ is the number of episode and $K(\epsilon)$ is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order $\tilde{O}(H^2\sqrt{SAT})$, where $S$ and $A$ are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.

Updated: 2025-08-08 03:36:58

标题: 通过信息导向的抽样，从人类反馈中进行高效样本强化学习

摘要: 我们从理论角度研究了来自人类反馈的强化学习（RLHF）问题，这是训练大型语言模型中的一个关键问题。我们的主要贡献是设计了基于信息导向抽样（IDS）的新颖高效率RLHF算法，这是一种受信息理论启发的在线决策原则。我们的算法最大化了价值函数和互信息项的和，该项鼓励对未知环境进行探索（通过观察到的人类反馈数据量化了对环境获得的信息）。为了应对大状态空间的挑战并提高样本效率，我们构建了一个简化的\emph{替代环境}并引入了一种新颖的距离度量（称为\emph{$\ell_g$距离}），使得我们基于IDS的算法能够实现贝叶斯遗憾的上限为$O(H^{\frac{3}{2}}\sqrt{\log(K(\epsilon)) T})$，其中$H$是每一轮的长度，$T$是轮数，$K(\epsilon)$与环境的覆盖数有关。特别是在表格设置中，这个遗憾上限的阶数为$\tilde{O}(H^2\sqrt{SAT})$，其中$S$和$A$分别是状态和动作的数量。最后，我们提出了一种近似IDS算法，计算效率更高，同时保持几乎相同的样本效率。这种近似算法的设计原则不仅在RLHF设置中有效，而且适用于标准RL框架。此外，我们的工作展示了信息论在强化学习和大型语言模型训练中的价值。

更新时间: 2025-08-08 03:36:58

领域: cs.LG

下载: http://arxiv.org/abs/2502.05434v3

CF3: Compact and Fast 3D Feature Fields

3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.

Updated: 2025-08-08 03:35:45

标题: CF3：紧凑快速的3D特征场

摘要: 3D高斯喷涂（3DGS）已经开始将来自2D基础模型的丰富信息整合进去。然而，大多数方法依赖于将原始2D特征视为基本事实的自下而上优化过程，导致计算成本增加。我们提出了一种用于构建紧凑且快速的3D高斯特征场的自上而下管道，即CF3。我们首先对多视图2D特征与预训练的高斯进行快速加权融合。这种方法使得能够直接在提升特征上对每个高斯自编码器进行训练，而不是在2D领域训练自编码器。因此，自编码器更好地与特征分布对齐。更重要的是，我们引入了一种自适应稀疏化方法，优化特征场的高斯属性，同时修剪和合并多余的高斯，构建出一个保留几何细节的高效表示。与Feature-3DGS相比，我们的方法仅使用了5%的高斯就实现了具有竞争力的3D特征场。

更新时间: 2025-08-08 03:35:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.05254v2

Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to $Q$-Learning

Algorithms for solving \textit{nonlinear} fixed-point equations -- such as average-reward \textit{$Q$-learning} and \textit{TD-learning} -- often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak--Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free $\tilde{O}(1/\sqrt{t})$ optimal rates for $Q$-learning in both average-reward and exponentially discounted settings, where $t$ denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

Updated: 2025-08-08 03:35:29

标题: 无参数的非线性半范数收缩的最优速率及其在$Q$-学习中的应用

摘要: 解决非线性定点方程的算法 - 例如平均奖励$Q$-learning和TD-learning - 通常涉及半范数收缩。通过Polyak-Ruppert平均化实现这些方法的参数自由最优收敛速度一直是困难的，主要是由于这种半范数的非单调性。我们通过以下方法填补了这一差距：(i.)将平均误差重新表述为涉及非线性扰动的线性递归，以及(ii.)通过将半范数的收缩与适当诱导范数的单调性相结合来驯服非线性。我们的主要结果为$Q$-learning在平均奖励和指数折现设置中都提供了第一个参数自由的$\tilde{O}(1/\sqrt{t})$最优速率，其中$t$表示迭代索引。该结果适用于一个广泛的框架，包括同步和异步更新、单智能体和分布式部署，以及从模拟器或沿着马尔可夫轨迹获得的数据流。

更新时间: 2025-08-08 03:35:29

领域: cs.LG

下载: http://arxiv.org/abs/2508.05984v1

Learning by Teaching: Engaging Students as Instructors of Large Language Models in Computer Science Education

While Large Language Models (LLMs) are often used as virtual tutors in computer science (CS) education, this approach can foster passive learning and over-reliance. This paper presents a novel pedagogical paradigm that inverts this model: students act as instructors who must teach an LLM to solve problems. To facilitate this, we developed strategies for designing questions with engineered knowledge gaps that only a student can bridge, and we introduce Socrates, a system for deploying this method with minimal overhead. We evaluated our approach in an undergraduate course and found that this active-learning method led to statistically significant improvements in student performance compared to historical cohorts. Our work demonstrates a practical, cost-effective framework for using LLMs to deepen student engagement and mastery.

Updated: 2025-08-08 03:25:19

标题: 学习通过教学：在计算机科学教育中将学生作为大型语言模型的教师

摘要: 大型语言模型(LLMs)经常被用作计算机科学教育中的虚拟导师，但这种方法可能促进被动学习和过度依赖。本文提出了一个新颖的教育范例，颠倒了这种模式：学生充当教师，必须教授LLM解决问题。为了促进这一点，我们开发了设计问题的策略，其中包含只有学生才能填补的知识差距，并介绍了Socrates，一种用最小的开销部署这种方法的系统。我们在一个本科课程中评估了我们的方法，并发现与历史上的同学相比，这种积极学习方法在学生表现方面显著提高。我们的工作展示了一种实用、经济高效的框架，利用LLMs深化学生的参与和掌握能力。

更新时间: 2025-08-08 03:25:19

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2508.05979v1

DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching

Singing Voice Conversion (SVC) transfers a source singer's timbre to a target while keeping melody and lyrics. The key challenge in any-to-any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing methods either face timbre leakage or fail to achieve satisfactory timbre similarity and quality in the generated audio. To address these challenges, we propose DAFMSVC, where the self-supervised learning (SSL) features from the source audio are replaced with the most similar SSL features from the target audio to prevent timbre leakage. It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content. Additionally, we introduce a flow matching module for high quality audio generation from the fused features. Experimental results show that DAFMSVC significantly enhances timbre similarity and naturalness, outperforming state-of-the-art methods in both subjective and objective evaluations.

Updated: 2025-08-08 03:24:19

标题: DAFMSVC：具有双重注意机制和流匹配的一次性歌声转换

摘要: 歌声转换（SVC）将源歌手的音质转移到目标歌手，同时保持旋律和歌词。在任何一对任何的SVC中，主要挑战是在不降低质量的情况下调整未知说话者的音质到源音频。现有方法要么面临音质泄漏，要么无法在生成的音频中实现满意的音质相似度和质量。为了解决这些挑战，我们提出了DAFMSVC，其中将源音频的自监督学习（SSL）特征替换为最相似的目标音频的SSL特征，以防止音质泄漏。它还结合了双交叉注意力机制，用于自适应融合说话者嵌入、旋律和语言内容。此外，我们引入了一个流匹配模块，用于从融合特征中生成高质量音频。实验结果表明，DAFMSVC显著提高了音质相似度和自然度，在主观和客观评估中优于最先进的方法。

更新时间: 2025-08-08 03:24:19

领域: cs.SD,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.05978v1

LinguaFluid: Language Guided Fluid Control via Semantic Rewards in Reinforcement Learning

In the domain of scientific machine learning, designing effective reward functions remains a challenge in reinforcement learning (RL), particularly in environments where task goals are difficult to specify numerically. Reward functions in existing work are predominantly based on heuristics, manual engineering, or task-specific tuning. In this work, we introduce a semantically aligned reinforcement learning method where rewards are computed by aligning the current state with a target semantic instruction using a Sentence-Bidirectional Encoder Representations from Transformers (SBERT). Instead of relying on manually defined reward functions, the policy receives feedback based on the reward, which is a cosine similarity between the goal textual description and the statement description in the episode. We evaluated our approach in several environments and showed that semantic reward can guide learning to achieve competitive control behavior, even in the absence of hand-crafted reward functions. Our study demonstrates a correlation between the language embedding space and the conventional Euclidean space. This framework opens new horizons for aligning agent behavior with natural language goals and lays the groundwork for a more seamless integration of larger language models (LLMs) and fluid control applications.

Updated: 2025-08-08 03:23:56

标题: LinguaFluid: 通过语言引导的语义奖励在强化学习中的流体控制

摘要: 在科学机器学习领域，设计有效的奖励函数仍然是强化学习（RL）中的一个挑战，特别是在任务目标难以用数字来指定的环境中。现有工作中的奖励函数主要基于启发式、手工工程或特定任务的调整。在本研究中，我们介绍了一种语义对齐的强化学习方法，其中奖励通过将当前状态与目标语义指令进行对齐来计算，使用了来自Transformer的句子双向编码器表示（SBERT）。政策不再依赖于手动定义的奖励函数，而是根据奖励来接收反馈，奖励是目标文本描述与剧集中的语句描述之间的余弦相似度。我们在多个环境中评估了我们的方法，并展示了语义奖励可以引导学习实现竞争性的控制行为，即使在没有手工制作的奖励函数的情况下也是如此。我们的研究表明了语言嵌入空间和传统欧几里得空间之间的相关性。这个框架为将代理行为与自然语言目标对齐打开了新的视野，并为更顺畅地整合更大的语言模型（LLMs）和流体控制应用奠定了基础。

更新时间: 2025-08-08 03:23:56

领域: cs.LG,physics.flu-dyn

下载: http://arxiv.org/abs/2508.05977v1

MI9 -- Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems

Agentic AI systems capable of reasoning, planning, and executing actions present fundamentally distinct governance challenges compared to traditional AI models. Unlike conventional AI, these systems exhibit emergent and unexpected behaviors during runtime, introducing novel agent-related risks that cannot be fully anticipated through pre-deployment governance alone. To address this critical gap, we introduce MI9, the first fully integrated runtime governance framework designed specifically for safety and alignment of agentic AI systems. MI9 introduces real-time controls through six integrated components: agency-risk index, agent-semantic telemetry capture, continuous authorization monitoring, Finite-State-Machine (FSM)-based conformance engines, goal-conditioned drift detection, and graduated containment strategies. Operating transparently across heterogeneous agent architectures, MI9 enables the systematic, safe, and responsible deployment of agentic systems in production environments where conventional governance approaches fall short, providing the foundational infrastructure for safe agentic AI deployment at scale. Detailed analysis through a diverse set of scenarios demonstrates MI9's systematic coverage of governance challenges that existing approaches fail to address, establishing the technical foundation for comprehensive agentic AI oversight.

Updated: 2025-08-08 03:18:04

标题: MI9 --代理情报协议：面向代理人工智能系统的运行时治理

摘要: 具有推理、规划和执行行动能力的主动型AI系统与传统AI模型相比，面临着根本性不同的治理挑战。与传统AI不同，这些系统在运行时表现出新兴和意想不到的行为，引入了无法完全通过部署前治理单独预料到的新型与代理相关的风险。为了解决这一关键差距，我们引入了MI9，这是第一个专门设计用于安全和对齐主动型AI系统的完全集成的运行时治理框架。MI9通过六个集成组件引入了实时控制：代理风险指数、代理语义遥测捕获、持续授权监控、基于有限状态机（FSM）的一致性引擎、目标条件漂移检测和渐进式遏制策略。MI9在异构代理架构之间透明地运行，使得在传统治理方法不足的生产环境中系统、安全和负责任地部署主动型系统成为可能，为大规模安全主动AI部署提供基础设施。通过一系列多样化场景的详细分析，展示了MI9系统性地覆盖了现有方法未能解决的治理挑战，为全面的主动型AI监督奠定了技术基础。

更新时间: 2025-08-08 03:18:04

领域: cs.AI,cs.ET,cs.MA

下载: http://arxiv.org/abs/2508.03858v2

Impact-driven Context Filtering For Cross-file Code Completion

Retrieval-augmented generation (RAG) has recently demonstrated considerable potential for repository-level code completion, as it integrates cross-file knowledge with in-file preceding code to provide comprehensive contexts for generation. To better understand the contribution of the retrieved cross-file contexts, we introduce a likelihood-based metric to evaluate the impact of each retrieved code chunk on the completion. Our analysis reveals that, despite retrieving numerous chunks, only a small subset positively contributes to the completion, while some chunks even degrade performance. To address this issue, we leverage this metric to construct a repository-level dataset where each retrieved chunk is labeled as positive, neutral, or negative based on its relevance to the target completion. We then propose an adaptive retrieval context filtering framework, CODEFILTER, trained on this dataset to mitigate the harmful effects of negative retrieved contexts in code completion. Extensive evaluation on the RepoEval and CrossCodeLongEval benchmarks demonstrates that CODEFILTER consistently improves completion accuracy compared to approaches without filtering operations across various tasks. Additionally, CODEFILTER significantly reduces the length of the input prompt, enhancing computational efficiency while exhibiting strong generalizability across different models. These results underscore the potential of CODEFILTER to enhance the accuracy, efficiency, and attributability of repository-level code completion.

Updated: 2025-08-08 03:08:19

标题: 基于影响力的跨文件代码补全的上下文过滤

摘要: 检索增强生成（RAG）最近展示了在仓库级代码完成方面的巨大潜力，因为它将跨文件知识与文件内先前代码整合在一起，为生成提供全面的上下文。为了更好地理解检索的跨文件上下文的贡献，我们引入了一个基于概率的度量标准来评估每个检索到的代码片段对完成的影响。我们的分析表明，尽管检索到了许多代码片段，但只有一个小子集对完成起到积极贡献，而一些代码片段甚至会降低性能。为了解决这个问题，我们利用这个度量标准构建了一个仓库级数据集，在这个数据集中，每个检索到的代码片段根据其与目标完成的相关性被标记为积极、中性或负面。然后，我们提出了一个自适应检索上下文过滤框架CODEFILTER，该框架在这个数据集上进行训练，以减轻代码完成中负面检索到的上下文的有害影响。在RepoEval和CrossCodeLongEval基准测试上进行的广泛评估表明，与没有过滤操作的方法相比，CODEFILTER在各种任务中始终提高了完成准确性。此外，CODEFILTER显著减少了输入提示的长度，增强了计算效率，同时在不同模型中表现出强大的泛化能力。这些结果强调了CODEFILTER提高仓库级代码完成的准确性、效率和可归因性的潜力。

更新时间: 2025-08-08 03:08:19

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.05970v1

ActMiner: Applying Causality Tracking and Increment Aligning for Graph-based Cyber Threat Hunting

To defend against Advanced Persistent Threats on the endpoint, threat hunting employs security knowledge such as cyber threat intelligence to continuously analyze system audit logs through retrospective scanning, querying, or pattern matching, aiming to uncover attack patterns/graphs that traditional detection methods (e.g., recognition for Point of Interest) fail to capture. However, existing threat hunting systems based on provenance graphs face challenges of high false negatives, high false positives, and low efficiency when confronted with diverse attack tactics and voluminous audit logs. To address these issues, we propose a system called Actminer, which constructs query graphs from descriptive relationships in cyber threat intelligence reports for precise threat hunting (i.e., graph alignment) on provenance graphs. First, we present a heuristic search strategy based on equivalent semantic transfer to reduce false negatives. Second, we establish a filtering mechanism based on causal relationships of attack behaviors to mitigate false positives. Finally, we design a tree structure to incrementally update the alignment results, significantly improving hunting efficiency. Evaluation on the DARPA Engagement dataset demonstrates that compared to the SOTA POIROT, Actminer reduces false positives by 39.1%, eliminates all false negatives, and effectively counters adversarial attacks.

Updated: 2025-08-08 03:01:40

标题: ActMiner：应用因果追踪和增量对齐进行基于图的网络威胁猎捕

摘要: 为了抵御终端上的高级持续威胁，威胁猎人运用安全知识，如网络威胁情报，通过回顾性扫描、查询或模式匹配持续分析系统审计日志，旨在发现传统检测方法（如用于关注点识别的识别）无法捕捉的攻击模式/图形。然而，现有基于溯源图的威胁猎杀系统在面对多样化的攻击策略和大量审计日志时面临高假阴性、高假阳性和低效率的挑战。为了解决这些问题，我们提出了一个名为Actminer的系统，该系统从网络威胁情报报告中的描述性关系构建查询图，以进行准确的威胁猎杀（即图对齐）溯源图。首先，我们提出了一种基于等效语义转移的启发式搜索策略，以减少假阴性。其次，我们建立了一个基于攻击行为因果关系的过滤机制，以减轻假阳性。最后，我们设计了一个树结构，以增量方式更新对齐结果，显著提高了猎杀效率。对DARPA Engagement数据集的评估表明，与SOTA POIROT相比，Actminer将假阳性降低了39.1%，消除了所有假阴性，并有效抵制了对抗性攻击。

更新时间: 2025-08-08 03:01:40

领域: cs.CR

下载: http://arxiv.org/abs/2501.05793v2

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

Updated: 2025-08-08 03:01:17

标题: 将基础单目深度估计器扩展到使用校准标记的鱼眼相机

摘要: 我们提出一种方法，将基础单目深度估计器（FMDEs）从透视图像扩展到鱼眼图像。尽管FMDEs是在数千万张图像上训练的，但它们容易受到由于摄像机校准（内在，畸变）参数变化引入的协变移位的影响，导致深度估计错误。我们的方法将编码鱼眼图像的潜在嵌入分布与透视图像的分布对齐，从而使得FMDEs可以在鱼眼相机上重新使用，无需重新训练或微调。为此，我们引入了一组校准令牌作为一种轻量级的调整机制，用于调节用于对齐的潜在嵌入。通过利用FMDEs已经具有的表达丰富的潜在空间，我们认为调节它们的嵌入可以避免常规重新校准或将映射投影到图像空间中的规范参考框架中引入的伪像和损失的负面影响。我们的方法是自监督的，不需要鱼眼图像，但利用公开可用的大规模透视图像数据集。这是通过将透视图像重新校准为鱼眼图像，并在训练期间强制保持它们的估计值之间的一致性来实现的。我们在室内和室外使用几个FMDEs评估我们的方法，在这些情况下，我们使用一个令牌集一致地优于最先进的方法。代码可在以下链接找到：https://github.com/JungHeeKim29/calibration-token。

更新时间: 2025-08-08 03:01:17

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04928v2

Resource-Efficient Automatic Software Vulnerability Assessment via Knowledge Distillation and Particle Swarm Optimization

The increasing complexity of software systems has led to a surge in cybersecurity vulnerabilities, necessitating efficient and scalable solutions for vulnerability assessment. However, the deployment of large pre-trained models in real-world scenarios is hindered by their substantial computational and storage demands. To address this challenge, we propose a novel resource-efficient framework that integrates knowledge distillation and particle swarm optimization to enable automated vulnerability assessment. Our framework employs a two-stage approach: First, particle swarm optimization is utilized to optimize the architecture of a compact student model, balancing computational efficiency and model capacity. Second, knowledge distillation is applied to transfer critical vulnerability assessment knowledge from a large teacher model to the optimized student model. This process significantly reduces the model size while maintaining high performance. Experimental results on an enhanced MegaVul dataset, comprising 12,071 CVSS (Common Vulnerability Scoring System) v3 annotated vulnerabilities, demonstrate the effectiveness of our approach. Our approach achieves a 99.4% reduction in model size while retaining 89.3% of the original model's accuracy. Furthermore, it outperforms state-of-the-art baselines by 1.7% in accuracy with 60% fewer parameters. The framework also reduces training time by 72.1% and architecture search time by 34.88% compared to traditional genetic algorithms.

Updated: 2025-08-08 02:51:44

标题: 资源高效的自动软件漏洞评估：通过知识蒸馏和粒子群优化

摘要: 软件系统日益复杂导致网络安全漏洞激增，这需要高效且可扩展的解决方案来进行漏洞评估。然而，在现实场景中部署大型预训练模型受到它们巨大的计算和存储需求的阻碍。为了解决这一挑战，我们提出了一种新颖的资源高效框架，将知识蒸馏和粒子群优化相结合，实现自动化漏洞评估。我们的框架采用了两阶段方法：首先，利用粒子群优化来优化一个紧凑的学生模型的架构，平衡计算效率和模型容量。其次，应用知识蒸馏将关键的漏洞评估知识从大型教师模型传输到优化后的学生模型。这一过程显著减小了模型大小同时保持高性能。在一个增强的 MegaVul 数据集上进行的实验结果显示了我们方法的有效性，该数据集包含了 12,071 个 CVSS（通用漏洞评分系统）v3 注释漏洞。我们的方法在保持 89.3% 原始模型准确率的同时，实现了模型大小的 99.4% 缩减。此外，它在准确率上比最先进的基线高出 1.7%，且参数数量减少了 60%。相比传统的遗传算法，该框架还将训练时间减少了 72.1%，架构搜索时间减少了 34.88%。

更新时间: 2025-08-08 02:51:44

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2508.02840v2

Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning

Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. To prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. To address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets.

Updated: 2025-08-08 02:48:26

标题: 温和保守的正则化评估方法用于离线强化学习

摘要: 离线强化学习（RL）旨在从静态数据集中学习最优策略，而无需进一步与环境互动。一个关键挑战是学习策略和行为策略之间的分布转移，导致超出分布（OOD）的行为和过度估计。为了防止过度估计，价值函数必须保持保守；然而，过度保守可能阻碍性能改进。为了解决这个问题，我们提出了温和保守的正则化评估（MCRE）框架，通过将时间差（TD）误差与贝尔曼备份中的行为克隆项结合起来，平衡保守性和性能。基于此，我们开发了温和保守的正则化Q学习（MCRQ）算法，将MCRE集成到一个离线策略演员-评论家框架中。实验表明，MCRQ在基准数据集上优于强基线和最先进的离线RL算法。

更新时间: 2025-08-08 02:48:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.05960v1

Evaluation of LLMs in AMR Parsing

AMR (Abstract Meaning Representation) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.

Updated: 2025-08-08 02:47:19

标题: 评估在抗微生物药物耐药性解析中的LLMs

摘要: 抽象含义表示（AMR）是一种将句子含义编码为根据、有向、无环图的语义形式化，其中节点代表概念，边表示语义关系。对大型语言模型（LLMs）进行微调解码器代表了AMR解析的一种有前途的新颖直接方向。本文对四种不同的LLM架构进行了微调（Phi 3.5、Gemma 2、LLaMA 3.2和DeepSeek R1 LLaMA Distilled），并使用LDC2020T02 Gold AMR3.0测试集进行了全面评估。我们的结果表明，仅对解码器进行直接微调的LLMs可以达到与复杂的最先进（SOTA）AMR解析器可比的性能。值得注意的是，LLaMA 3.2展示了在采用直接微调方法时与SOTA AMR解析器竞争性能。我们在完整的LDC2020T02测试集上实现了SMATCH F1：0.804，与APT+Silver（IBM）相当，接近Graphene Smatch（MBSE）的0.854。在我们的分析中，我们还观察到一个一致的模式，即LLaMA 3.2在语义性能上领先，而Phi 3.5在结构有效性方面表现出色。

更新时间: 2025-08-08 02:47:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.05028v2

Multi-Armed Bandits-Based Optimization of Decision Trees

Decision trees, without appropriate constraints, can easily become overly complex and prone to overfit, capturing noise rather than generalizable patterns. To resolve this problem,pruning operation is a crucial part in optimizing decision trees, as it not only reduces the complexity of trees but also decreases the probability of generating overfit models. The conventional pruning techniques like Cost-Complexity Pruning (CCP) and Reduced Error Pruning (REP) are mostly based on greedy approaches that focus on immediate gains in performance while pruning nodes of the decision tree. However, this might result in a lower generalization in the long run, compromising the robust ability of the tree model when introduced to unseen data samples, particularly when trained with small and complex datasets. To address this challenge, we are proposing a Multi-Armed Bandits (MAB)-based pruning approach, a reinforcement learning (RL)-based technique, that will dynamically prune the tree to generate an optimal decision tree with better generalization. Our proposed approach assumes the pruning process as an exploration-exploitation problem, where we are utilizing the MAB algorithms to find optimal branch nodes to prune based on feedback from each pruning actions. Experimental evaluation on several benchmark datasets, demonstrated that our proposed approach results in better predictive performance compared to the traditional ones. This suggests the potential of utilizing MAB for a dynamic and probabilistic way of decision tree pruning, in turn optimizing the decision tree-based model.

Updated: 2025-08-08 02:43:45

标题: 多臂老虎机算法优化决策树

摘要: 决策树在没有适当约束的情况下，很容易变得过于复杂且容易过拟合，捕捉到噪声而非可泛化的模式。为解决这一问题，修剪操作是优化决策树的关键部分，因为它不仅减少了树的复杂性，还降低了生成过拟合模型的概率。传统的修剪技术如成本复杂度修剪（CCP）和减少误差修剪（REP）大多基于贪婪方法，侧重于修剪决策树节点时的即时性收益。然而，这可能会导致长期内泛化能力较低，当引入未见数据样本时，尤其是在训练小而复杂数据集时，可能会损害树模型的鲁棒性。为了解决这一挑战，我们提出了一种基于多臂赌博机（MAB）的修剪方法，一种基于强化学习（RL）的技术，将动态修剪树以生成具有更好泛化能力的最优决策树。我们提出的方法将修剪过程视为一种勘探-开发问题，利用MAB算法根据每个修剪动作的反馈找到最佳分支节点进行修剪。对几个基准数据集的实验评估表明，我们提出的方法相对于传统方法具有更好的预测性能。这表明了利用MAB进行决策树修剪的动态和概率化方式的潜力，从而优化基于决策树的模型。

更新时间: 2025-08-08 02:43:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.05957v1

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

Updated: 2025-08-08 02:38:47

标题: Bifrost-1：将多模态LLMs和扩散模型与基于Patch的CLIP Latents进行桥接

摘要: 有越来越多的兴趣将高保真度的视觉合成能力整合到大型语言模型（LLMs）中，而不会损害其强大的推理能力。现有的直接训练LLMs或者桥接LLMs和扩散模型的方法通常在训练成本高昂，因为骨干LLMs在预训练过程中并未见过图像表示。我们提出了Bifrost-1，一个统一框架，通过使用以补丁级别的CLIP图像嵌入作为潜在变量来桥接预训练的多模态LLMs（MLLMs）和扩散模型，这些嵌入与MLLM的CLIP视觉编码器原生对齐。这些补丁级别的图像嵌入被集成到扩散模型中，通过其ControlNet的轻量级适应。为了保留MLLM的原始多模态推理能力，我们在预测补丁级别的图像嵌入时，为MLLM配备了一个从原始MLLM参数初始化的视觉生成分支。通过无缝地将预训练的MLLMs和扩散模型与补丁级别的CLIP潜变量结合起来，我们的框架实现了具有显著训练效率的高保真度可控图像生成。我们的实验表明，Bifrost-1在视觉保真度和多模态理解方面实现了与先前方法相当或更好的性能，而在训练期间计算量明显较低。我们还提供了全面的消融研究，展示了我们设计选择的有效性。

更新时间: 2025-08-08 02:38:47

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.05954v1

HALO: Hindsight-Augmented Learning for Online Auto-Bidding

Digital advertising platforms operate millisecond-level auctions through Real-Time Bidding (RTB) systems, where advertisers compete for ad impressions through algorithmic bids. This dynamic mechanism enables precise audience targeting but introduces profound operational complexity due to advertiser heterogeneity: budgets and ROI targets span orders of magnitude across advertisers, from individual merchants to multinational brands. This diversity creates a demanding adaptation landscape for Multi-Constraint Bidding (MCB). Traditional auto-bidding solutions fail in this environment due to two critical flaws: 1) severe sample inefficiency, where failed explorations under specific constraints yield no transferable knowledge for new budget-ROI combinations, and 2) limited generalization under constraint shifts, as they ignore physical relationships between constraints and bidding coefficients. To address this, we propose HALO: Hindsight-Augmented Learning for Online Auto-Bidding. HALO introduces a theoretically grounded hindsight mechanism that repurposes all explorations into training data for arbitrary constraint configuration via trajectory reorientation. Further, it employs B-spline functional representation, enabling continuous, derivative-aware bid mapping across constraint spaces. HALO ensures robust adaptation even when budget/ROI requirements differ drastically from training scenarios. Industrial dataset evaluations demonstrate the superiority of HALO in handling multi-scale constraints, reducing constraint violations while improving GMV.

Updated: 2025-08-08 02:37:04

标题: HALO: 后见增强学习用于在线自动竞标

摘要: 数字广告平台通过实时竞价（RTB）系统进行毫秒级拍卖，广告主通过算法竞标来争夺广告曝光机会。这种动态机制使得精确的受众定位成为可能，但由于广告主的异质性，导致了深刻的操作复杂性：预算和ROI目标在广告主之间的数量级差距很大，从个体商家到跨国品牌。这种多样性为多约束竞标（MCB）创造了一个严峻的适应环境。传统的自动竞标解决方案在这种环境中失败的原因有两个关键缺陷：1）严重的样本效率低下，特定约束下的失败探索无法为新的预算-ROI组合提供可转移的知识，2）在约束转移时的有限泛化能力，因为它们忽视了约束和竞标系数之间的物理关系。为了解决这一问题，我们提出了HALO：用于在线自动竞标的回顾增强学习。HALO引入了一个在理论上基础的回顾机制，通过轨迹重新定位将所有探索重新用于任意约束配置的训练数据。此外，它采用B-spline函数表示，实现了在约束空间中连续、导数感知的出价映射。HALO确保了在预算/ROI要求与训练场景大相径庭时的稳健适应性。工业数据集评估表明，HALO在处理多尺度约束方面表现出了优越性，减少了约束违规情况，同时提高了GMV。

更新时间: 2025-08-08 02:37:04

领域: cs.LG

下载: http://arxiv.org/abs/2508.03267v3

MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Robust 3D occupancy prediction is essential for autonomous driving, particularly under adverse weather conditions where traditional vision-only systems struggle. While the fusion of surround-view 4D radar and cameras offers a promising low-cost solution, effectively extracting and integrating features from these heterogeneous sensors remains challenging. This paper introduces MetaOcc, a novel multi-modal framework for omnidirectional 3D occupancy prediction that leverages both multi-view 4D radar and images. To address the limitations of directly applying LiDAR-oriented encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial reasoning and feature extraction. Additionally, a Hierarchical Multi-scale Multi-modal Fusion strategy is developed to perform adaptive local-global fusion across modalities and time, mitigating spatio-temporal misalignments and enriching fused feature representations. To reduce reliance on expensive point cloud annotations, we further propose a pseudo-label generation pipeline based on an open-set segmentor. This enables a semi-supervised strategy that achieves 90% of the fully supervised performance using only 50% of the ground truth labels, offering an effective trade-off between annotation cost and accuracy. Extensive experiments demonstrate that MetaOcc under full supervision achieves state-of-the-art performance, outperforming previous methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset. These results demonstrate the scalability and robustness of MetaOcc across sensor domains and training conditions, paving the way for practical deployment in real-world autonomous systems. Code and data are available at https://github.com/LucasYang567/MetaOcc.

Updated: 2025-08-08 02:34:29

标题: MetaOcc：使用双重训练策略对环视4D雷达和摄像头进行时空融合，实现3D占用预测

摘要: 强大的3D占用预测对于自动驾驶至关重要，尤其是在传统视觉系统难以应对的恶劣天气条件下。虽然将全景4D雷达和摄像头进行融合提供了一种有前景的低成本解决方案，但有效地从这些异构传感器中提取和集成特征仍然具有挑战性。本文介绍了MetaOcc，这是一种新颖的多模态框架，用于全方位3D占用预测，充分利用了多视角4D雷达和图像。为了解决直接将LiDAR导向的编码器应用于稀疏雷达数据的局限性，我们提出了一个雷达高度自注意模块，增强了垂直空间推理和特征提取。此外，开发了一种分层多尺度多模态融合策略，实现跨模态和时间的自适应局部-全局融合，减轻了时空错位并丰富了融合特征表示。为了减少对昂贵点云注释的依赖，我们进一步提出了基于开放集分割器的伪标签生成管道。这实现了一种半监督策略，仅使用50%的地面真实标签即可实现90%的完全监督性能，在注释成本和准确性之间提供了有效的权衡。大量实验表明，MetaOcc在完全监督的情况下实现了最先进的性能，在OmniHD-Scenes数据集上的SC IoU和mIoU分别超过前期方法0.47和4.02，而在SurroundOcc-nuScenes数据集上的SC IoU和mIoU分别超过前期方法1.16和1.24。这些结果展示了MetaOcc在传感器领域和训练条件下的可扩展性和稳健性，为实际部署到真实世界自主系统铺平了道路。代码和数据可在https://github.com/LucasYang567/MetaOcc 上获得。

更新时间: 2025-08-08 02:34:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.15384v3

ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has proven effective in integrating external knowledge into large language models (LLMs) for solving question-answer (QA) tasks. The state-of-the-art RAG approaches often use the graph data as the external data since they capture the rich semantic information and link relationships between entities. However, existing graph-based RAG approaches cannot accurately identify the relevant information from the graph and also consume large numbers of tokens in the online retrieval process. To address these issues, we introduce a novel graph-based RAG approach, called Attributed Community-based Hierarchical RAG (ArchRAG), by augmenting the question using attributed communities, and also introducing a novel LLM-based hierarchical clustering method. To retrieve the most relevant information from the graph for the question, we build a novel hierarchical index structure for the attributed communities and develop an effective online retrieval method. Experimental results demonstrate that ArchRAG outperforms existing methods in both accuracy and token cost.

Updated: 2025-08-08 02:33:09

标题: ArchRAG：基于属性的基于社区的分层检索增强生成

摘要: 检索增强生成（RAG）已被证明在将外部知识整合到大型语言模型（LLMs）中以解决问答（QA）任务方面非常有效。最先进的RAG方法通常使用图数据作为外部数据，因为它们捕获了实体之间丰富的语义信息和链接关系。然而，现有的基于图的RAG方法无法准确识别图中的相关信息，并且在在线检索过程中消耗大量标记。为了解决这些问题，我们引入了一种新颖的基于图的RAG方法，称为属性社区为基础的分层RAG（ArchRAG），通过使用属性社区来增强问题，并引入了一种新颖的基于LLM的分层聚类方法。为了从图中检索问题的最相关信息，我们为属性社区构建了一种新颖的分层索引结构，并开发了一种有效的在线检索方法。实验结果表明，ArchRAG在准确性和标记成本方面优于现有方法。

更新时间: 2025-08-08 02:33:09

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2502.09891v3

A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image

The lack of spatial dimensional information remains a challenge in normal estimation from a single image. Recent diffusion-based methods have demonstrated significant potential in 2D-to-3D implicit mapping, they rely on data-driven statistical priors and miss the explicit modeling of light-surface interaction, leading to multi-view normal direction conflicts. Moreover, the discrete sampling mechanism of diffusion models causes gradient discontinuity in differentiable rendering reconstruction modules, preventing 3D geometric errors from being backpropagated to the normal generation network, thereby forcing existing methods to depend on dense normal annotations. This paper proposes SINGAD, a novel Self-supervised framework from a single Image for Normal estimation via 3D GAussian splatting guided Diffusion. By integrating physics-driven light-interaction modeling and a differentiable rendering-based reprojection strategy, our framework directly converts 3D geometric errors into normal optimization signals, solving the challenges of multi-view geometric inconsistency and data dependency. Specifically, the framework constructs a light-interaction-driven 3DGS reparameterization model to generate multi-scale geometric features consistent with light transport principles, ensuring multi-view normal consistency. A cross-domain feature fusion module is designed within a conditional diffusion model, embedding geometric priors to constrain normal generation while maintaining accurate geometric error propagation. Furthermore, a differentiable 3D reprojection loss strategy is introduced for self-supervised optimization that minimizes geometric error between the reconstructed and input image, eliminating dependence on annotated normal datasets. Quantitative evaluations on the Google Scanned Objects dataset demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.

Updated: 2025-08-08 02:32:33

标题: 从单张图像中估计法线的3DGS-扩散自监督框架

摘要: 缺乏空间维度信息仍然是从单个图像进行正常估计时的挑战。最近基于扩散的方法在2D到3D隐式映射方面表现出显著潜力，它们依赖于数据驱动的统计先验，并且缺乏对光-表面相互作用的明确建模，导致多视图法线方向冲突。此外，扩散模型的离散采样机制导致了在可微渲染重建模块中的梯度不连续，阻止了3D几何错误被反向传播到法线生成网络，从而迫使现有方法依赖于密集的法线注释。本文提出了SINGAD，一种新颖的自监督框架，通过3D高斯分布引导扩散从单个图像进行法线估计。通过集成基于物理的光相互作用建模和基于可微渲染的重投影策略，我们的框架直接将3D几何错误转换为法线优化信号，解决了多视图几何不一致性和数据依赖性的挑战。具体来说，该框架构建了一个受光相互作用驱动的3DGS重新参数化模型，生成符合光传输原理的多尺度几何特征，确保多视图法线一致性。在条件扩散模型中设计了一个跨域特征融合模块，嵌入几何先验来约束法线生成，并保持准确的几何错误传播。此外，引入了一种可微分的3D重投影损失策略，用于自监督优化，最小化重建图像与输入图像之间的几何误差，消除了对注释法线数据集的依赖。在Google扫描对象数据集上的定量评估表明，我们的方法在多个指标上优于最先进的方法。

更新时间: 2025-08-08 02:32:33

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.05950v1

RANA: Robust Active Learning for Noisy Network Alignment

Network alignment has attracted widespread attention in various fields. However, most existing works mainly focus on the problem of label sparsity, while overlooking the issue of noise in network alignment, which can substantially undermine model performance. Such noise mainly includes structural noise from noisy edges and labeling noise caused by human-induced and process-driven errors. To address these problems, we propose RANA, a Robust Active learning framework for noisy Network Alignment. RANA effectively tackles both structure noise and label noise while addressing the sparsity of anchor link annotations, which can improve the robustness of network alignment models. Specifically, RANA introduces the proposed Noise-aware Selection Module and the Label Denoising Module to address structural noise and labeling noise, respectively. In the first module, we design a noise-aware maximization objective to select node pairs, incorporating a cleanliness score to address structural noise. In the second module, we propose a novel multi-source fusion denoising strategy that leverages model and twin node pairs labeling to provide more accurate labels for node pairs. Empirical results on three real-world datasets demonstrate that RANA outperforms state-of-the-art active learning-based methods in alignment accuracy. Our code is available at https://github.com/YXNan0110/RANA.

Updated: 2025-08-08 02:30:10

标题: RANA：噪声网络对齐的稳健主动学习

摘要: 网络对齐在各个领域引起了广泛关注。然而，大多数现有研究主要关注标签稀疏性的问题，而忽视了网络对齐中的噪声问题，这可能会严重影响模型性能。这种噪声主要包括来自嘈杂边缘的结构噪声和由人为误差和过程驱动的标签噪声。为了解决这些问题，我们提出了RANA，一个用于嘈杂网络对齐的鲁棒主动学习框架。RANA有效地处理了结构噪声和标签噪声，同时解决了锚点链接注释的稀疏性问题，从而提高了网络对齐模型的鲁棒性。具体地，RANA引入了提出的噪声感知选择模块和标签去噪模块来分别解决结构噪声和标签噪声。在第一个模块中，我们设计了一个噪声感知最大化目标来选择节点对，结合了一个清洁度评分来处理结构噪声。在第二个模块中，我们提出了一种新颖的多源融合去噪策略，利用模型和双节点对标签来为节点对提供更准确的标签。在三个真实数据集上的实证结果表明，RANA在对齐准确性方面优于最先进的基于主动学习的方法。我们的代码可在https://github.com/YXNan0110/RANA 上获得。

更新时间: 2025-08-08 02:30:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.22434v2

Benchmarking Deception Probes via Black-to-White Performance Boosts

AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.

Updated: 2025-08-08 02:19:16

标题: 通过黑盒到白盒性能提升对欺骗性探测的基准测试

摘要: AI助手偶尔会对用户查询做出欺骗性回应。最近，线性分类器（称为“欺骗探针”）已经被训练出来区分语言模型在欺骗性回应和诚实回应时的内部激活。然而，目前尚不清楚这些探针在实践中检测欺骗的效果如何，以及这些探针是否能抵抗来自欺骗助手的简单对抗策略，后者希望规避检测。在本文中，我们将白盒监控（监控器可以访问令牌级别的探针激活）与黑盒监控（没有此类访问权限）进行了比较。我们通过白盒监控优于黑盒监控的程度来评估欺骗探针，即黑盒至白盒性能提升。我们发现现有的欺骗探针存在较弱但令人鼓舞的黑盒至白盒性能提升。

更新时间: 2025-08-08 02:19:16

领域: cs.AI,cs.LG,68T01,I.2.7; K.4.1

下载: http://arxiv.org/abs/2507.12691v2

Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale

Detecting prosociality in text--communication intended to affirm, support, or improve others' behavior--is a novel and increasingly important challenge for trust and safety systems. Unlike toxic content detection, prosociality lacks well-established definitions and labeled data, requiring new approaches to both annotation and deployment. We present a practical, three-stage pipeline that enables scalable, high-precision prosocial content classification while minimizing human labeling effort and inference costs. First, we identify the best LLM-based labeling strategy using a small seed set of human-labeled examples. We then introduce a human-AI refinement loop, where annotators review high-disagreement cases between GPT-4 and humans to iteratively clarify and expand the task definition-a critical step for emerging annotation tasks like prosociality. This process results in improved label quality and definition alignment. Finally, we synthesize 10k high-quality labels using GPT-4 and train a two-stage inference system: a lightweight classifier handles high-confidence predictions, while only $\sim$35\% of ambiguous instances are escalated to GPT-4o. This architecture reduces inference costs by $\sim$70% while achieving high precision ($\sim$0.90). Our pipeline demonstrates how targeted human-AI interaction, careful task formulation, and deployment-aware architecture design can unlock scalable solutions for novel responsible AI tasks.

Updated: 2025-08-08 02:04:14

标题: 玩家游戏聊天中的亲社会行为检测：从人工智能定义的对齐到规模效率标注

摘要: 在文本中检测亲社性-旨在肯定、支持或改善他人行为的沟通-是信任和安全系统的一个新颖且日益重要的挑战。与有毒内容检测不同，亲社性缺乏成熟的定义和标记数据，需要新的方法来进行标注和部署。我们提出了一个实用的三阶段流程，可以实现可扩展的、高精度的亲社性内容分类，同时最大限度地减少人工标注工作量和推断成本。首先，我们使用一小组人工标记示例确定基于LLM的最佳标记策略。然后，我们引入了一个人工智能优化循环，其中标注员审查GPT-4和人类之间存在高度分歧的案例，以迭代地澄清和扩展任务定义-这是新兴标注任务如亲社性的关键步骤。这个过程导致标签质量和定义对齐得到改善。最后，我们使用GPT-4合成了10k个高质量标签，并训练了一个两阶段推断系统：一个轻量级分类器处理高置信度的预测，而只有约35%的模糊实例被升级到GPT-4o。这种架构可以将推断成本降低约70%，同时实现高精度（约0.90）。我们的流程展示了如何通过有针对性的人工智能交互、谨慎的任务制定和部署感知的架构设计，为新颖的负责任人工智能任务提供可扩展的解决方案。

更新时间: 2025-08-08 02:04:14

领域: cs.CL,cs.AI,cs.CY,I.2.7; K.4

下载: http://arxiv.org/abs/2508.05938v1

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates--where one model is given the official answer to defend, and another constructs and defends an alternative answer--adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm's effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination--a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that "pretraining on the test set is no longer all you need," offering a sustainable path for measuring the genuine reasoning ability of advanced language models.

Updated: 2025-08-08 01:56:30

标题: 在测试集上进行预训练不再是你所需要的一切：一个基于辩论的问答基准方法

摘要: 随着前沿语言模型在标准问答基准上的饱和，人们对数据污染、记忆化和不断增加的数据集创建成本的担忧仍然存在。我们提出了一种辩论驱动的评估范式，将任何现有的问答数据集转化为结构化的对抗性辩论--其中一个模型被给定官方答案进行辩护，另一个构建并辩护替代答案--由一位对正确解决方案一无所知的法官模型裁决。通过强制多轮论证，这种方法显著增加了难度，惩罚了浅层记忆，同时利用问答项目减少了策划开销。我们做出了两个主要贡献：(1) 一个评估流程，系统地将问答任务转化为基于辩论的评估，以及(2) 一个公共基准，展示了我们范例在MMLU-Pro问题子集上的有效性，包括标准化协议和参考模型。实证结果验证了该方法的稳健性及其对数据污染的有效性--在测试问题上进行微调的Llama 3.1模型显示出了显著的准确率提高（50% -> 82%），但在辩论中表现更差。结果还表明，即使较弱的法官也能可靠区分较强的辩手，突显了基于辩论的评估如何能够扩展到未来更有能力的系统，同时保持创建新基准的部分成本。总的来说，我们的框架强调了“仅在测试集上进行预训练已经不再足够”，提供了一条可持续的路径来衡量先进语言模型的真正推理能力。

更新时间: 2025-08-08 01:56:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.17747v2

ASLSL: Adaptive shared latent structure learning with incomplete multi-modal physiological data for multi-dimensional emotional feature selection

Recently, multi-modal physiological signals based emotion recognition has garnered increasing attention in the field of brain-computer interfaces. Nevertheness, the associated multi-modal physiological features are often high-dimensional and inevitably include irrelevant, redundant, and noisy representation, which can easily lead to overfitting, poor performance, and high computational complexity in emotion classifiers. Feature selection has been widely applied to address these challenges. However, previous studies generally assumed that multi-modal physiological data are complete, whereas in reality, the data are often incomplete due to the openness of the acquisition and operational environment. For example, a part of samples are available in several modalities but not in others. To address this issue, we propose a novel method for incomplete multi-modal physiological signal feature selection called adaptive shared latent structure learning (ASLSL). Based on the property that similar features share similar emotional labels, ASLSL employs adaptive shared latent structure learning to explore a common latent space shared for incomplete multi-modal physiological signals and multi-dimensional emotional labels, thereby mitigating the impact of missing information and mining consensus information. Two most popular multi-modal physiological emotion datasets (DEAP and DREAMER) with multi-dimensional emotional labels were utilized to compare the performance between compare ASLSL and seventeen feature selection methods. Comprehensive experimental results on these datasets demonstrate the effectiveness of ASLSL.

Updated: 2025-08-08 01:54:02

标题: ASLSL: 自适应共享潜在结构学习，用于多维情感特征选择的不完整多模态生理数据

摘要: 最近，在脑机接口领域，基于多模式生理信号的情绪识别引起了越来越多的关注。然而，相关的多模式生理特征往往是高维的，不可避免地包括无关、冗余和嘈杂的表示，这很容易导致情绪分类器过拟合、性能差和计算复杂度高。特征选择已被广泛应用来解决这些挑战。然而，先前的研究通常假设多模式生理数据是完整的，而实际上，由于数据采集和操作环境的开放性，数据往往是不完整的。例如，一部分样本在几种模式中可用，但在其他模式中不可用。为了解决这个问题，我们提出了一种针对不完整多模式生理信号特征选择的新方法，称为自适应共享潜在结构学习（ASLSL）。基于相似特征共享相似情绪标签的属性，ASLSL利用自适应共享潜在结构学习来探索不完整多模式生理信号和多维情绪标签共享的公共潜在空间，从而减轻缺失信息的影响并挖掘共识信息。利用两个最流行的多模式生理情绪数据集（DEAP和DREAMER）与多维情绪标签进行了比较，比较了ASLSL与十七种特征选择方法的性能。在这些数据集上的全面实验结果显示了ASLSL的有效性。

更新时间: 2025-08-08 01:54:02

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.05934v1

REFS: Robust EEG feature selection with missing multi-dimensional annotation for emotion recognition

The affective brain-computer interface is a crucial technology for affective interaction and emotional intelligence, emerging as a significant area of research in the human-computer interaction. Compared to single-type features, multi-type EEG features provide a multi-level representation for analyzing multi-dimensional emotions. However, the high dimensionality of multi-type EEG features, combined with the relatively small number of high-quality EEG samples, poses challenges such as classifier overfitting and suboptimal real-time performance in multi-dimensional emotion recognition. Moreover, practical applications of affective brain-computer interface frequently encounters partial absence of multi-dimensional emotional labels due to the open nature of the acquisition environment, and ambiguity and variability in individual emotion perception. To address these challenges, this study proposes a novel EEG feature selection method for missing multi-dimensional emotion recognition. The method leverages adaptive orthogonal non-negative matrix factorization to reconstruct the multi-dimensional emotional label space through second-order and higher-order correlations, which could reduce the negative impact of missing values and outliers on label reconstruction. Simultaneously, it employs least squares regression with graph-based manifold learning regularization and global feature redundancy minimization regularization to enable EEG feature subset selection despite missing information, ultimately achieving robust EEG-based multi-dimensional emotion recognition. Simulation experiments on three widely used multi-dimensional emotional datasets, DREAMER, DEAP and HDED, reveal that the proposed method outperforms thirteen advanced feature selection methods in terms of robustness for EEG emotional feature selection.

Updated: 2025-08-08 01:53:46

标题: 参考文献：具有缺失多维标注的情绪识别的稳健脑电图特征选择

摘要: 情感脑机接口是一项关键技术，用于情感交互和情感智能，在人机交互领域已成为重要研究领域。与单一类型特征相比，多类型脑电图（EEG）特征提供了一个多层次的表示，用于分析多维情感。然而，多类型EEG特征的高维度，结合相对较少的高质量EEG样本，带来了一些挑战，如分类器过拟合和多维情感识别中的次优实时性能。此外，情感脑机接口的实际应用经常遇到多维情感标签部分缺失的问题，这是由于采集环境的开放性，以及个体情感感知中的模糊性和变异性。为了解决这些挑战，本研究提出了一种用于缺失多维情感识别的新型EEG特征选择方法。该方法利用自适应正交非负矩阵分解来通过二阶及更高阶相关性重建多维情感标签空间，从而减少缺失值和异常值对标签重建的负面影响。同时，该方法采用基于图的流形学习正则化和全局特征冗余最小化正则化的最小二乘回归，使得即使存在缺失信息，也能进行EEG特征子集选择，最终实现稳健的基于EEG的多维情感识别。对三个广泛使用的多维情感数据集（DREAMER、DEAP和HDED）进行的模拟实验显示，所提出的方法在EEG情感特征选择的稳健性方面优于十三种先进的特征选择方法。

更新时间: 2025-08-08 01:53:46

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.05933v1

Exploring Superior Function Calls via Reinforcement Learning

Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02\% overall accuracy, outperforming standard GRPO by up to 6\% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community.

Updated: 2025-08-08 01:50:45

标题: 通过强化学习探索优秀的函数调用

摘要: 功能调用能力对于部署大型语言模型在现实世界应用中至关重要，然而当前的训练方法未能发展出强大的推理策略。监督微调产生依赖于表面模式匹配的模型，而标准强化学习方法在结构化函数调用的复杂动作空间中遇到困难。我们提出了一种新颖的强化学习框架，旨在通过基于策略熵的策略优化来增强群体相对性，专门为功能调用任务量身定制探索。我们的方法解决了功能调用中三个关键挑战：策略学习过程中探索不足，思维链生成中缺乏结构化推理，以及参数提取验证不足。我们的两阶段数据准备流程通过迭代LLM评估和抽象语法树验证确保高质量的训练样本。对伯克利功能调用排行榜进行的大量实验表明，该框架在开源模型中取得了最先进的性能，总体准确率达到86.02％，在复杂的多功能场景中超过标准GRPO高达6％。值得注意的是，我们的方法在代码预训练模型上表现出特别强的改进，表明结构化语言生成能力为强化学习在功能调用任务中提供了有利的起点。我们将释放所有代码、模型和数据集，以造福社区。

更新时间: 2025-08-08 01:50:45

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.05118v2

SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression

Imbalanced regression refers to prediction tasks where the target variable is skewed. This skewness hinders machine learning models, especially neural networks, which concentrate on dense regions and therefore perform poorly on underrepresented (minority) samples. Despite the importance of this problem, only a few methods have been proposed for imbalanced regression. Many of the available solutions for imbalanced regression adapt techniques from the class imbalance domain, such as linear interpolation and the addition of Gaussian noise, to create synthetic data in sparse regions. However, in many cases, the underlying distribution of the data is complex and non-linear. Consequently, these approaches generate synthetic samples that do not accurately represent the true feature-target relationship. To overcome these limitations, we propose SMOGAN, a two-step oversampling framework for imbalanced regression. In Stage 1, an existing oversampler generates initial synthetic samples in sparse target regions. In Stage 2, we introduce DistGAN, a distribution-aware GAN that serves as SMOGAN's filtering layer and refines these samples via adversarial loss augmented with a Maximum Mean Discrepancy objective, aligning them with the true joint feature-target distribution. Extensive experiments on 23 imbalanced datasets show that SMOGAN consistently outperforms the default oversampling method without the DistGAN filtering layer.

Updated: 2025-08-08 01:49:27

标题: SMOGAN：用于不平衡回归的合成少数类过采样和GAN细化

摘要: Imbalanced regression指的是目标变量倾斜的预测任务。这种偏斜阻碍了机器学习模型，尤其是神经网络，它们集中在密集区域，因此在少数样本上表现不佳。尽管这个问题很重要，但仅有少数方法被提出用于不平衡回归。许多现有的不平衡回归解决方案借鉴了类别不平衡领域的技术，如线性插值和添加高斯噪声，以在稀疏区域生成合成数据。然而，在许多情况下，数据的基本分布是复杂且非线性的。因此，这些方法生成的合成样本并不能准确地代表真实的特征-目标关系。为了克服这些局限性，我们提出了SMOGAN，这是一个用于不平衡回归的两步过采样框架。在第一阶段，一个现有的过采样器在稀疏的目标区域生成初始的合成样本。在第二阶段，我们引入了DistGAN，这是一个基于分布的GAN，作为SMOGAN的过滤层，并通过对抗损失和最大均方差目标来优化这些样本，使它们与真实的特征-目标分布相一致。对23个不平衡数据集的大量实验表明，SMOGAN始终优于默认的过采样方法，而没有DistGAN过滤层。

更新时间: 2025-08-08 01:49:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.21152v2

An ML-based Approach to Predicting Software Change Dependencies: Insights from an Empirical Study on OpenStack

As software systems grow in complexity, accurately identifying and managing dependencies among changes becomes increasingly critical. For instance, a change that leverages a function must depend on the change that introduces it. Establishing such dependencies allows CI/CD pipelines to build and orchestrate changes effectively, preventing build failures and incomplete feature deployments. In modern software systems, dependencies often span multiple components across teams, creating challenges for development and deployment. They serve various purposes, from enabling new features to managing configurations, and can even involve traditionally independent changes like documentation updates. To address these challenges, we conducted a preliminary study on dependency management in OpenStack, a large-scale software system. Our study revealed that a substantial portion of software changes in OpenStack over the past 10 years are interdependent. Surprisingly, 51.08% of these dependencies are identified during the code review phase-after a median delay of 5.06 hours-rather than at the time of change creation. Developers often spend a median of 57.12 hours identifying dependencies, searching among a median of 463 other changes. To help developers proactively identify dependencies, we propose a semi-automated approach that leverages two ML models. The first model predicts the likelihood of dependencies among changes, while the second identifies the exact pairs of dependent changes. Our proposed models demonstrate strong performance, achieving average AUC scores of 79.33% and 91.89%, and Brier scores of 0.11 and 0.014, respectively. Indeed, the second model has a good top-k recall across all types of pairs, while the top-k precision has room for improvement.

Updated: 2025-08-08 01:37:25

标题: 一种基于机器学习的方法来预测软件更改的依赖关系：来自于OpenStack的实证研究的启示

摘要: 随着软件系统的复杂性增加，准确识别和管理变更之间的依赖关系变得越来越关键。例如，利用一个功能的变更必须依赖引入该功能的变更。建立这种依赖关系允许CI/CD流水线有效地构建和协调变更，防止构建失败和功能部署不完整。在现代软件系统中，依赖关系通常跨越团队的多个组件，为开发和部署带来挑战。它们具有各种目的，从启用新功能到管理配置，甚至可能涉及传统上独立的变更，如文档更新。为了解决这些挑战，我们对OpenStack这个大型软件系统中的依赖管理进行了初步研究。我们的研究发现，在过去10年中，OpenStack中的软件变更中有相当大一部分是相互依赖的。令人惊讶的是，在代码审查阶段-在中位延迟5.06小时后-有51.08%的依赖关系被识别出来，而不是在变更创建时。开发人员通常花费中位数57.12小时来识别依赖关系，搜索中位数463个其他变更。为了帮助开发人员主动识别依赖关系，我们提出了一种利用两个ML模型的半自动化方法。第一个模型预测变更之间依赖关系的可能性，而第二个模型识别出确切的依赖变更对。我们提出的模型表现出强大的性能，分别达到了79.33%和91.89%的平均AUC分数，以及0.11和0.014的Brier分数。事实上，第二个模型在所有类型的变更对中都有良好的top-k召回率，而top-k精度还有改进的空间。

更新时间: 2025-08-08 01:37:25

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2508.05034v2

SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation

Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.

Updated: 2025-08-08 01:32:13

标题: SPA++：通用图谱谱对齐方法用于多领域适应

摘要: 域自适应（DA）旨在在域漂移下将知识从标记源域转移到未标记或稀疏标记的目标域。大多数先前的研究集中于捕获域间的可迁移性，但在很大程度上忽视了丰富的域内结构，这在经验上导致了更糟糕的可辨识性。为了解决这种权衡，我们提出了一个广义的图谱对齐框架，SPA++。其核心简要总结如下：（1）通过将DA问题转换为图原语，它构建了一个粗糙的图对齐机制，并向域图在特征空间中对齐提出了一种新颖的谱正则化器；（2）我们进一步开发了一个细粒度的邻域感知传播机制，以增强目标域中的可辨识性；（3）通过结合数据增强和一致性正则化，SPA++可以适应包括大多数DA设置和甚至具有挑战性的分布场景在内的复杂场景。此外，我们还提供了理论分析来支持我们的方法，包括基于图的DA的泛化界限以及谱对齐和平滑一致性的作用。对基准数据集的广泛实验表明，SPA++始终优于现有的前沿方法，在各种具有挑战性的适应场景中实现了优越的稳健性和适应性。

更新时间: 2025-08-08 01:32:13

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.05182v2

DeepMDV: Global Spatial Matching for Multi-depot Vehicle Routing Problems

The rapid growth of online retail and e-commerce has made effective and efficient Vehicle Routing Problem (VRP) solutions essential. To meet rising demand, companies are adding more depots, which changes the VRP problem to a complex optimization task of Multi-Depot VRP (MDVRP) where the routing decisions of vehicles from multiple depots are highly interdependent. The complexities render traditional VRP methods suboptimal and non-scalable for the MDVRP. In this paper, we propose a novel approach to solve MDVRP addressing these interdependencies, hence achieving more effective results. The key idea is, the MDVRP can be broken down into two core spatial tasks: assigning customers to depots and optimizing the sequence of customer visits. We adopt task-decoupling approach and propose a two-stage framework that is scalable: (i) an interdependent partitioning module that embeds spatial and tour context directly into the representation space to globally match customers to depots and assign them to tours; and (ii) an independent routing module that determines the optimal visit sequence within each tour. Extensive experiments on both synthetic and real-world datasets demonstrate that our method outperforms all baselines across varying problem sizes, including the adaptations of learning-based solutions for single-depot VRP. Its adaptability and performance make it a practical and readily deployable solution for real-world logistics challenges.

Updated: 2025-08-08 01:27:26

标题: DeepMDV：多仓库车辆路径问题的全局空间匹配

摘要: 随着在线零售和电子商务的迅速增长，有效和高效的车辆路径问题（VRP）解决方案变得至关重要。为了满足不断增长的需求，公司正在增加更多的配送中心，这将把VRP问题转变为一个复杂的多配送中心VRP（MDVRP）的优化任务，其中来自多个配送中心的车辆的路由决策高度相互依赖。这些复杂性使传统的VRP方法对于MDVRP来说不够优化且不具有可扩展性。在本文中，我们提出了一种新颖的方法来解决MDVRP，解决这些相互依赖性，从而实现更有效的结果。关键思想是，MDVRP可以分解为两个核心空间任务：将客户分配给配送中心和优化客户访问顺序。我们采用任务解耦方法，并提出了一个可扩展的两阶段框架：（i）一个相互依赖的分区模块，直接将空间和路线上下文嵌入到表示空间中，以全局匹配客户和分配到路线；和（ii）一个独立的路由模块，确定每个路线内的最佳访问顺序。对合成和真实数据集的大量实验表明，我们的方法在各种问题规模上均优于所有基线，包括针对单一配送中心VRP的学习解决方案的改进。其适应性和性能使其成为解决实际物流挑战的实用且可立即部署的解决方案。

更新时间: 2025-08-08 01:27:26

领域: cs.DB,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.17080v3

MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints

Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.

Updated: 2025-08-08 01:24:20

标题: 我的文化：在资源有限的语言条件下探索马来西亚的多元文化

摘要: 大型语言模型（LLMs）通常由英语和中文等高资源语言主导的训练数据而表现出文化偏见。这给准确表达和评估不同文化背景，特别是在低资源语言环境中，带来挑战。为了解决这一问题，我们引入了MyCulture，一个旨在全面评估LLMs在马来西亚文化领域的基准，涵盖了六大支柱：艺术、服饰、风俗、娱乐、食品和宗教，以马来语呈现。与传统基准不同，MyCulture采用了一种新颖的开放式多项选择题格式，没有预先定义的选项，从而减少了猜测并减轻了格式偏见。我们提供了这种开放式结构在提高公平性和区分力方面的理论依据。此外，通过比较模型对结构化和自由形式输出的表现，评估语言偏见通过多语种提示变体。我们跨越一系列区域和国际LLMs的评估显示了文化理解方面的显著差异，突显了在LLMs的开发和评估中迫切需要具有文化基础和语言包容性的基准。

更新时间: 2025-08-08 01:24:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.05429v2

Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO's effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO's potential for more robust and effective training of large-scale reasoning models. \footnote{Code and data are available at: https://github.com/shenpeijun0212/S-GRPO

Updated: 2025-08-08 01:24:06

标题: 通过噪声感知优势重新加权减轻LLM推理中的思考-答案不匹配

摘要: Group-Relative Policy Optimization (GRPO) 是训练大型推理模型的关键技术，但它存在一个关键的漏洞：\emph{思考-回答不匹配}，即嘈杂的奖励信号会破坏学习过程。这个问题在不平衡的响应组中最为严重，令人矛盾的是，当信号应该最具信息量时，它会降低信号的质量。为了解决这一挑战，我们提出了稳定的Group-Relative Policy Optimization (S-GRPO)，这是一个原则性的增强方法，可以得到最佳、噪声感知的优势权重，以稳定训练过程。我们在数学推理基准测试上进行了全面实验，证明了S-GRPO的有效性和健壮性。在不同模型上，S-GRPO明显优于DR. GRPO，在Qwen-Math-7B-Base上性能提升了+2.5%，在Llama-3.2-3B-Base上提升了+2.2%，在Qwen-Math-1.5B-Instruct上提升了+2.4%。最为关键的是，尽管标准GRPO在20%合成奖励噪声下无法学习，S-GRPO仍然保持稳定的学习进展。这些结果突显了S-GRPO在训练大规模推理模型时更具健壮性和有效性的潜力。【代码和数据可在以下链接找到：https://github.com/shenpeijun0212/S-GRPO】

更新时间: 2025-08-08 01:24:06

领域: cs.LG

下载: http://arxiv.org/abs/2508.05928v1

Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

Language model (LM) agents are increasingly used as autonomous decision-makers which need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established Blicket Test paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This "disjunctive bias" persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not child-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

Updated: 2025-08-08 01:06:10

标题: 语言代理反映了人类因果推理偏见。我们如何帮助它们像科学家一样思考？

摘要: 语言模型（LM）代理越来越被用作自主决策者，需要积极收集信息以指导他们的决策。对于这样的代理来说，一个至关重要的认知能力是高效地探索和理解世界的因果结构--这对于稳健、科学基础的推理至关重要。然而，目前尚不清楚LM是否具备这种能力，或者是否存在导致错误结论的系统性偏差。在这项工作中，我们使用发展心理学中的著名Blicket测试范式，研究LM探索和推断因果关系的能力。我们发现LM可可靠地推断常见、直觉性的分离式因果关系，但在不寻常的、同样（甚至更为）有证据的连接式因果关系上却系统性地遇到困难。这种“分离偏见”在模型家族、大小和提示策略之间持续存在，并且随着任务复杂性增加，性能进一步下降。有趣的是，人类成年人中也出现了类似的偏见，这表明LM可能从他们的训练数据中继承了根深蒂固的推理启发式。为此，我们量化了LM与人类之间的相似性，发现LM展示出类似成年人的推理特征（但不是儿童）。最后，我们提出了一种测试时间抽样方法，明确从LM中抽样和消除关于因果关系的假设。这种可扩展的方法显著减少了分离偏见，并使LM更接近科学、因果严谨的推理目标。

更新时间: 2025-08-08 01:06:10

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.09614v2

Enhancing Software Vulnerability Detection Through Adaptive Test Input Generation Using Genetic Algorithm

Software vulnerabilities continue to undermine the reliability and security of modern systems, particularly as software complexity outpaces the capabilities of traditional detection methods. This study introduces a genetic algorithm-based method for test input generation that innovatively integrates genetic operators and adaptive learning to enhance software vulnerability detection. A key contribution is the application of the crossover operator, which facilitates exploration by searching across a broader space of potential test inputs. Complementing this, an adaptive feedback mechanism continuously learns from the system's execution behavior and dynamically guides input generation toward promising areas of the input space. Rather than relying on fixed or randomly selected inputs, the approach evolves a population of structurally valid test cases using feedback-driven selection, enabling deeper and more effective code traversal. This strategic integration of exploration and exploitation ensures that both diverse and targeted test inputs are developed over time. Evaluation was conducted across nine open-source JSON-processing libraries. The proposed method achieved substantial improvements in coverage compared to a benchmark evolutionary fuzzing method, with average gains of 39.8% in class coverage, 62.4% in method coverage, 105.0% in line coverage, 114.0% in instruction coverage, and 166.0% in branch coverage. These results highlight the method's capacity to detect deeper and more complex vulnerabilities, offering a scalable and adaptive solution to software security testing.

Updated: 2025-08-08 01:03:22

标题: 通过使用遗传算法增强软件漏洞检测的自适应测试输入生成

摘要: 软件漏洞继续破坏现代系统的可靠性和安全性，特别是当软件复杂性超过传统检测方法的能力时。本研究介绍了一种基于遗传算法的测试输入生成方法，创新地将遗传算子和自适应学习结合起来，以增强软件漏洞检测。一个关键贡献是应用交叉算子，通过在潜在测试输入空间中进行搜索，促进了探索。此外，自适应反馈机制不断从系统的执行行为中学习，并动态地引导输入生成到输入空间的有希望的区域。与依赖固定或随机选择的输入不同，该方法通过反馈驱动的选择演变出一组结构有效的测试用例，实现了更深入和更有效的代码遍历。探索和开发的战略性整合确保了多样化和有针对性的测试输入随着时间的推移而发展。在九个开源JSON处理库上进行了评估。与基准演进模糊方法相比，所提出的方法在覆盖率方面取得了显著改进，类覆盖率平均提高了39.8％，方法覆盖率提高了62.4％，行覆盖率提高了105.0％，指令覆盖率提高了114.0％，分支覆盖率提高了166.0％。这些结果突显了该方法检测更深层次和更复杂漏洞的能力，为软件安全测试提供了可扩展和自适应的解决方案。

更新时间: 2025-08-08 01:03:22

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.05923v1

Enhancing Construction Site Analysis and Understanding with 3D Segmentation

Monitoring construction progress is crucial yet resource-intensive, prompting the exploration of computer-vision-based methodologies for enhanced efficiency and scalability. Traditional data acquisition methods, primarily focusing on indoor environments, falter in construction site's complex, cluttered, and dynamically changing conditions. This paper critically evaluates the application of two advanced 3D segmentation methods, Segment Anything Model (SAM) and Mask3D, in challenging outdoor and indoor conditions. Trained initially on indoor datasets, both models' adaptability and performance are assessed in real-world construction settings, highlighting the gap in current segmentation approaches due to the absence of benchmarks for outdoor scenarios. Through a comparative analysis, this study not only showcases the relative effectiveness of SAM and Mask3D but also addresses the critical need for tailored segmentation workflows capable of extracting actionable insights from construction site data, thereby advancing the field towards more automated and precise monitoring techniques.

Updated: 2025-08-08 00:57:39

标题: 用三维分割技术增强施工现场分析和理解

摘要: 监测施工进度至关重要，但耗费大量资源，促使探索基于计算机视觉的方法论以提高效率和可扩展性。传统的数据采集方法主要集中在室内环境，却在施工现场复杂、混乱和动态变化的条件下失效。本文批判性评估了两种先进的3D分割方法，Segment Anything Model（SAM）和Mask3D，在具有挑战性的室外和室内条件下的应用。这两种模型最初在室内数据集上训练，它们的适应性和性能在真实的施工环境中得到评估，突出了由于缺乏针对室外场景的基准而导致的当前分割方法的差距。通过比较分析，本研究不仅展示了SAM和Mask3D的相对有效性，还解决了对能够从施工现场数据中提取可操作见解的定制分割工作流程的重要需求，从而推动该领域朝着更自动化和精确的监测技术发展。

更新时间: 2025-08-08 00:57:39

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.05922v1

Fast, Convex and Conditioned Network for Multi-Fidelity Vectors and Stiff Univariate Differential Equations

Accuracy in neural PDE solvers often breaks down not because of limited expressivity, but due to poor optimisation caused by ill-conditioning, especially in multi-fidelity and stiff problems. We study this issue in Physics-Informed Extreme Learning Machines (PIELMs), a convex variant of neural PDE solvers, and show that asymptotic components in governing equations can produce highly ill-conditioned activation matrices, severely limiting convergence. We introduce Shifted Gaussian Encoding, a simple yet effective activation filtering step that increases matrix rank and expressivity while preserving convexity. Our method extends the solvable range of Peclet numbers in steady advection-diffusion equations by over two orders of magnitude, achieves up to six orders lower error on multi-frequency function learning, and fits high-fidelity image vectors more accurately and faster than deep networks with over a million parameters. This work highlights that conditioning, not depth, is often the bottleneck in scientific neural solvers and that simple architectural changes can unlock substantial gains.

Updated: 2025-08-08 00:51:38

标题: 快速、凸优化和条件网络用于多保真度向量和严格单变量微分方程

摘要: 在神经PDE求解器中，准确性通常不是由于受限于表达能力，而是由于糟糕的优化造成的病态条件，尤其是在多精度和刚性问题中。我们研究了这个问题在物理启发的极端学习机器（PIELMs）中，这是神经PDE求解器的一个凸变种，并且表明在控制方程中的渐近分量可以产生高度病态的激活矩阵，严重限制收敛性。我们引入了Shifted Gaussian Encoding，这是一个简单而有效的激活过滤步骤，可以增加矩阵秩和表达能力，同时保持凸性。我们的方法将稳态对流扩散方程中的Peclet数的可解范围扩展了两个数量级以上，能够在多频率函数学习中达到高达六个数量级的更低误差，并且比具有一百万参数的深度网络更准确和更快地拟合高保真度的图像向量。这项工作突出了在科学神经求解器中，条件，而不是深度，通常是瓶颈，并且简单的架构变化可以解锁大幅增益。

更新时间: 2025-08-08 00:51:38

领域: cs.LG,math.FA,math.RT,physics.comp-ph

下载: http://arxiv.org/abs/2508.05921v1

Humans overrely on overconfident language models, across languages

As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Prior work shows that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., 'I think it's') differs sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate LLM safety in a global context. Our work finds that overreliance risks are high across languages. We first analyze the distribution of LLM-generated epistemic markers and observe that LLMs are overconfident across languages, frequently generating strengtheners even as part of incorrect responses. Model generations are, however, sensitive to documented cross-linguistic variation in usage: for example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. Next, we measure human reliance rates across languages, finding that reliance behaviors differ cross-linguistically: for example, participants are significantly more likely to discount expressions of uncertainty in Japanese than in English (i.e., ignore their 'hedging' function and rely on generations that contain them). Taken together, these results indicate a high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.

Updated: 2025-08-08 00:50:04

标题: 人类过于依赖自信的语言模型，跨越多种语言

摘要: 随着大型语言模型（LLMs）在全球范围内的部署，确保它们的响应在不同语言之间进行校准以准确传达不确定性和局限性至关重要。先前的研究表明，在英语中，LLMs在语言上表现出过度自信，导致用户过度依赖自信的生成结果。然而，认知标记（例如，“我认为它是”）的使用和解释在不同语言之间有着明显的差异。在这里，我们研究了跨五种语言的多语言语言（误）校准、过度自信和过度依赖的风险，以评估LLM在全球范围内的安全性。我们的研究发现，跨语言的过度依赖风险很高。我们首先分析了LLM生成的认知标记的分布，并观察到LLMs在各种语言中都表现出过度自信，经常生成加强语气的内容，甚至在错误的回答中也如此。然而，模型生成对已记录的跨语言变化的使用具有敏感性：例如，模型在日语中生成最多的不确定标记，在德语和普通话中生成最多的确定标记。接下来，我们测量了跨语言的人类依赖率，发现依赖行为在不同语言之间存在差异：例如，与英语相比，参与者在日语中更有可能忽略表达不确定性的内容（即忽视其“避让”功能，并依赖包含这些内容的生成结果）。总的来说，这些结果表明跨语言的过度自信模型生成存在高风险。我们的研究结果突出了多语言语言校准的挑战，并强调了在文化和语言背景下对模型安全性进行评估的重要性。

更新时间: 2025-08-08 00:50:04

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.06306v2

Dual Signal Decomposition of Stochastic Time Series

The research paper addresses decomposition of a stochastic time series into three time series representing a dual signal i.e., the mean and the dispersion, with noise isolated. Decomposition is done by applying machine learning to fit a dual signal. Machine learning minimizes the loss function which compromises between fitting the original time series and penalizing irregularities of the dual signal. The latter includes terms based on the first and second order derivatives along time. To preserve special patterns, weighting of the regularization components of the loss function has been introduced based on Statistical Process Control methodology. The proposed decomposition can be applied as a smoothing algorithm against the mean and dispersion of the time series. By isolating noise, the proposed decomposition can be seen as a denoising algorithm. Two approaches of the learning process have been considered: sequential and jointly. The former approach learns the mean signal first and then dispersion. The latter approach fits the dual signal jointly. Jointly learning can uncover complex relationships for the time series with heteroskedasticity. Learning has been set by solving the direct non-linear unconstrained optimization problem or by applying neural networks that have sequential or twin output architectures. Tuning of the loss function hyperparameters focuses on the isolated noise to be a stationary stochastic process without autocorrelation properties. Depending on the applications, the hyperparameters of the learning can be tuned towards either the discrete states by stepped signal or smoothed series. The decomposed dual signal can be represented on the 2D space and used to learn inherent structures, to forecast both mean and dispersion, or to analyze cross effects in case of multiple time series.

Updated: 2025-08-08 00:30:41

标题: 随机时间序列的双重信号分解

摘要: 这篇研究论文讨论了将随机时间序列分解为代表双重信号（即均值和离散度）的三个时间序列，同时隔离噪音。分解是通过应用机器学习来拟合双重信号完成的。机器学习通过最小化损失函数来实现分解，这个损失函数在拟合原始时间序列和惩罚双重信号的不规则性之间进行权衡。后者包括基于时间的一阶和二阶导数的项。为了保留特殊模式，基于统计过程控制方法引入了损失函数的正则化组件的加权。所提出的分解可以被应用作为一个平滑算法，针对时间序列的均值和离散度。通过隔离噪音，所提出的分解可以被视为一个去噪算法。学习过程采用了两种方法：顺序和联合。前者首先学习均值信号，然后学习离散度。后者则联合拟合双重信号。联合学习可以揭示具有异方差性的时间序列的复杂关系。学习通过解决直接的非线性无约束优化问题或应用具有顺序或双输出架构的神经网络来进行。损失函数超参数的调整侧重于隔离噪音，使其成为一个没有自相关性质的平稳随机过程。根据应用，学习的超参数可以被调整为朝向离散状态的跃变信号或平滑序列。分解的双重信号可以在2D空间上表示，并用于学习内在结构，预测均值和离散度，或在多个时间序列的情况下分析交叉影响。

更新时间: 2025-08-08 00:30:41

领域: cs.LG,62M10

下载: http://arxiv.org/abs/2508.05915v1

Do Ethical AI Principles Matter to Users? A Large-Scale Analysis of User Sentiment and Satisfaction

As AI systems become increasingly embedded in organizational workflows and consumer applications, ethical principles such as fairness, transparency, and robustness have been widely endorsed in policy and industry guidelines. However, there is still scarce empirical evidence on whether these principles are recognized, valued, or impactful from the perspective of users. This study investigates the link between ethical AI and user satisfaction by analyzing over 100,000 user reviews of AI products from G2. Using transformer-based language models, we measure sentiment across seven ethical dimensions defined by the EU Ethics Guidelines for Trustworthy AI. Our findings show that all seven dimensions are positively associated with user satisfaction. Yet, this relationship varies systematically across user and product types. Technical users and reviewers of AI development platforms more frequently discuss system-level concerns (e.g., transparency, data governance), while non-technical users and reviewers of end-user applications emphasize human-centric dimensions (e.g., human agency, societal well-being). Moreover, the association between ethical AI and user satisfaction is significantly stronger for non-technical users and end-user applications across all dimensions. Our results highlight the importance of ethical AI design from users' perspectives and underscore the need to account for contextual differences across user roles and product types.

Updated: 2025-08-08 00:27:50

标题: 用户是否关心道德人工智能原则？用户情感和满意度的大规模分析

摘要: 随着人工智能系统越来越多地嵌入组织工作流程和消费者应用程序中，道德原则如公平性、透明度和稳健性在政策和行业指南中得到广泛认可。然而，目前仍缺乏关于这些原则在用户视角下是否被认可、重视或产生影响的实证证据。本研究通过分析来自G2的超过10万条用户评论，探讨了道德人工智能与用户满意度之间的联系。我们使用基于变压器的语言模型，测量了欧盟《值得信赖的人工智能伦理指南》定义的七个道德维度上的情感。我们的研究结果显示，所有七个维度都与用户满意度呈正相关。然而，这种关系在用户和产品类型之间存在系统性差异。技术用户和AI开发平台的评论者更频繁地讨论系统级问题（如透明度、数据治理），而非技术用户和终端用户应用的评论者则强调人类中心维度（如人类代理、社会福祉）。此外，道德人工智能与用户满意度之间的关联在所有维度上对非技术用户和终端用户应用来说显著更强。我们的研究结果突显了从用户视角设计道德人工智能的重要性，并强调了需要考虑用户角色和产品类型之间的情境差异。

更新时间: 2025-08-08 00:27:50

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.05913v1

Hybrid Physics-Machine Learning Models for Quantitative Electron Diffraction Refinements

High-fidelity electron microscopy simulations required for quantitative crystal structure refinements face a fundamental challenge: while physical interactions are well-described theoretically, real-world experimental effects are challenging to model analytically. To address this gap, we present a novel hybrid physics-machine learning framework that integrates differentiable physical simulations with neural networks. By leveraging automatic differentiation throughout the simulation pipeline, our method enables gradient-based joint optimization of physical parameters and neural network components representing experimental variables, offering superior scalability compared to traditional second-order methods. We demonstrate this framework through application to three-dimensional electron diffraction (3D-ED) structure refinement, where our approach learns complex thickness distributions directly from diffraction data rather than relying on simplified geometric models. This method achieves state-of-the-art refinement performance across synthetic and experimental datasets, recovering atomic positions, thermal displacements, and thickness profiles with high fidelity. The modular architecture proposed can naturally be extended to accommodate additional physical phenomena and extended to other electron microscopy techniques. This establishes differentiable hybrid modeling as a powerful new paradigm for quantitative electron microscopy, where experimental complexities have historically limited analysis.

Updated: 2025-08-08 00:13:12

标题: 混合物理-机器学习模型用于定量电子衍射细化

摘要: 需要高保真度的电子显微镜模拟才能进行定量的晶体结构精化，面临一个基本挑战：虽然物理相互作用在理论上有很好的描述，但真实世界的实验效应很难用分析方法建模。为了解决这一差距，我们提出了一种新颖的混合物理-机器学习框架，将可微分的物理模拟与神经网络进行整合。通过在整个模拟流程中利用自动微分，我们的方法实现了基于梯度的物理参数和神经网络组件的联合优化，代替传统的二阶方法，具有更好的可扩展性。我们通过应用于三维电子衍射（3D-ED）结构精化来展示这一框架，我们的方法可以直接从衍射数据中学习复杂的厚度分布，而不是依赖于简化的几何模型。该方法在合成和实验数据集上实现了最先进的精化性能，高度准确地恢复了原子位置、热位移和厚度剖面。所提出的模块化架构自然地可以扩展以适应更多的物理现象，并扩展到其他电子显微镜技术。这将可微分混合建模确立为定量电子显微镜的一种强大新范式，历史上实验复杂性限制了分析。

更新时间: 2025-08-08 00:13:12

领域: physics.comp-ph,cs.LG

下载: http://arxiv.org/abs/2508.05908v1

Spatio-Temporal Partial Sensing Forecast for Long-term Traffic

Traffic forecasting uses recent measurements by sensors installed at chosen locations to forecast the future road traffic. Existing work either assumes all locations are equipped with sensors or focuses on short-term forecast. This paper studies partial sensing forecast of long-term traffic, assuming sensors are available only at some locations. The problem is challenging due to the unknown data distribution at unsensed locations, the intricate spatio-temporal correlation in long-term forecasting, as well as noise to traffic patterns. We propose a Spatio-temporal Long-term Partial sensing Forecast model (SLPF) for traffic prediction, with several novel contributions, including a rank-based embedding technique to reduce the impact of noise in data, a spatial transfer matrix to overcome the spatial distribution shift from sensed locations to unsensed locations, and a multi-step training process that utilizes all available data to successively refine the model parameters for better accuracy. Extensive experiments on several real-world traffic datasets demonstrate its superior performance. Our source code is at https://github.com/zbliu98/SLPF

Updated: 2025-08-08 00:02:23

标题: 时空局部感知预测长期交通情况

摘要: 交通预测利用安装在选择位置的传感器的最新测量数据来预测未来的道路交通情况。现有工作要么假设所有位置都配备传感器，要么专注于短期预测。本文研究了长期交通的部分感知预测，假设传感器仅在某些位置可用。该问题具有挑战性，因为未感知位置的数据分布未知，长期预测中复杂的时空相关性，以及交通模式的噪声。我们提出了一种用于交通预测的时空长期部分感知预测模型（SLPF），具有几项新颖贡献，包括基于排名的嵌入技术以减少数据中的噪声影响，空间传输矩阵以克服从感知位置到未感知位置的空间分布转移，以及利用所有可用数据逐步优化模型参数以获得更好准确性的多步训练过程。对几个真实世界交通数据集的大量实验表明其卓越性能。我们的源代码位于https://github.com/zbliu98/SLPF。

更新时间: 2025-08-08 00:02:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2408.02689v2

The Fourth State: Signed-Zero Ternary for Stable LLM Quantization (and More)

Quantization is usually regarded as a means to trade quality of performance for reduced compute requirements, i.e., as a suboptimal approximation. However, if examined in terms of a fixed overall resource budget, a very different perspective arises. We introduce Signed-Zero Ternary (SZT), a 2-bit quantization that deterministically provides gradient information with no forward-path penalty. Our analysis provides evidence that it may improve information density compared to non-quantized alternatives.

Updated: 2025-08-08 00:01:29

标题: 第四状态：稳定LLM量化的有符号零三值化（以及更多）

摘要: 量化通常被视为一种以减少计算需求为代价来交换性能质量的手段，即一种次优的近似。然而，如果从固定的总资源预算的角度来审视，就会出现一个完全不同的视角。我们引入了有符号零三值（SZT），这是一种二位量化方法，可以确定性地提供梯度信息而无需前向路径惩罚。我们的分析提供了证据表明，与非量化的替代方案相比，它可能提高信息密度。

更新时间: 2025-08-08 00:01:29

领域: cs.LG

下载: http://arxiv.org/abs/2508.05905v1