Arxiv Day: Article

Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification

Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.

Updated: 2025-07-01 23:59:37

标题: Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification 将“Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification”翻译为中文：嵌入空间扩散用于零样本环境声音分类

摘要: Zero-shot learning使模型能够通过利用语义信息来泛化到未见过的类别，弥合了训练集和测试集之间非重叠类别的差距。虽然很多研究都集中在计算机视觉领域的零样本学习上，但是将这些方法应用于环境音频的研究仍未被充分探索，现有研究中表现较差。在计算机视觉中表现出成功的生成方法，在零样本环境声音分类研究中明显缺失。为了填补这一空白，本研究调查了零样本学习在环境音频中的生成方法。从计算机视觉中成功的生成模型中进行了调整：一个交叉对齐和分布对齐的变分自动编码器（CADA-VAE）和一个利用不变性侧生成对抗网络（LisGAN）。此外，我们引入了一个新颖的扩散模型，该模型以类别辅助数据为条件。扩散模型生成的合成嵌入与已见类别嵌入结合，用于训练分类器。在五个环境音频数据集ESC-50、ARCA23K-FSD、FSC22、UrbanSound8k和TAU Urban Acoustics 2019以及一个音乐分类数据集GTZAN上进行了实验。结果显示，扩散模型在六个音频数据集上平均表现优于所有基线方法。本研究将扩散模型确立为零样本学习的一种有前途的方法，并引入了零样本环境声音分类的生成方法的第一个基准，为未来研究奠定了基础。

更新时间: 2025-07-01 23:59:37

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2412.03771v2

Jump-Start Reinforcement Learning with Self-Evolving Priors for Extreme Monopedal Locomotion

Reinforcement learning (RL) has shown great potential in enabling quadruped robots to perform agile locomotion. However, directly training policies to simultaneously handle dual extreme challenges, i.e., extreme underactuation and extreme terrains, as in monopedal hopping tasks, remains highly challenging due to unstable early-stage interactions and unreliable reward feedback. To address this, we propose JumpER (jump-start reinforcement learning via self-evolving priors), an RL training framework that structures policy learning into multiple stages of increasing complexity. By dynamically generating self-evolving priors through iterative bootstrapping of previously learned policies, JumpER progressively refines and enhances guidance, thereby stabilizing exploration and policy optimization without relying on external expert priors or handcrafted reward shaping. Specifically, when integrated with a structured three-stage curriculum that incrementally evolves action modality, observation space, and task objective, JumpER enables quadruped robots to achieve robust monopedal hopping on unpredictable terrains for the first time. Remarkably, the resulting policy effectively handles challenging scenarios that traditional methods struggle to conquer, including wide gaps up to 60 cm, irregularly spaced stairs, and stepping stones with distances varying from 15 cm to 35 cm. JumpER thus provides a principled and scalable approach for addressing locomotion tasks under the dual challenges of extreme underactuation and extreme terrains.

Updated: 2025-07-01 23:31:36

标题: 使用自进化先验知识加速单腿极限运动的强化学习

摘要: 强化学习（RL）已显示出在使四足机器人能够进行敏捷定位方面具有巨大潜力。然而，直接训练策略以同时处理双重极端挑战，即极端欠驱动和极端地形，如单腿跳跃任务中那样，由于不稳定的早期交互和不可靠的奖励反馈而仍然具有高度挑战性。为了解决这个问题，我们提出了JumpER（通过自我演变的先验进行跳跃启动强化学习），这是一个RL训练框架，将策略学习结构化为逐渐增加复杂性的多个阶段。通过通过先前学习的策略的迭代引导动态生成自我演变的先验，JumpER逐渐改进和增强引导，从而在不依赖外部专家先验或手工奖励塑造的情况下稳定探索和策略优化。具体来说，当与结构化的三阶段课程相结合，逐步演变动作模态、观测空间和任务目标时，JumpER首次使四足机器人能够在不可预测的地形上实现稳健的单腿跳跃。值得注意的是，由此产生的策略有效地处理传统方法难以征服的具有挑战性的场景，包括宽至60厘米的间隙，间隔不规则的楼梯，以及距离从15厘米到35厘米不等的踏石。因此，JumpER为解决在极端欠驱动和极端地形的双重挑战下的定位任务提供了一种原则性和可扩展的方法。

更新时间: 2025-07-01 23:31:36

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.01243v1

Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW

Stochastic gradient-based descent (SGD), have long been central to training large language models (LLMs). However, their effectiveness is increasingly being questioned, particularly in large-scale applications where empirical evidence suggests potential performance limitations. In response, this paper proposes a stochastic conjugate subgradient method together with adaptive sampling tailored specifically for training LLMs. The method not only achieves faster convergence per iteration but also demonstrates improved scalability compared to traditional SGD techniques. It leverages sample complexity analysis to adaptively choose the sample size, employs a stochastic conjugate subgradient approach to determine search directions and utilizing an AdamW-like algorithm to adaptively adjust step sizes. This approach preserves the key advantages of first-order methods while effectively addressing the nonconvexity and non-smoothness inherent in LLMs training. Additionally, we provide a detailed analysis of the advantage of the algorithm. Experimental results show that the proposed method not only maintains, but in many cases surpasses, the scalability of traditional SGD techniques, significantly enhancing both the speed and accuracy of the optimization process.

Updated: 2025-07-01 23:30:15

标题: 超越一阶：使用随机共轭次梯度和AdamW训练LLMs

摘要: 基于随机梯度下降（SGD）的方法长期以来一直是训练大型语言模型（LLMs）的核心。然而，它们的有效性越来越受到质疑，特别是在大规模应用中，经验证据表明可能存在性能限制。为此，本文提出了一种针对训练LLMs量身定制的随机共轭次梯度方法，配合自适应采样。该方法不仅在每次迭代中实现更快的收敛速度，而且与传统的SGD技术相比，表现出更好的可扩展性。它利用样本复杂性分析自适应选择样本大小，采用随机共轭次梯度方法确定搜索方向，并利用类似AdamW的算法自适应调整步长。这种方法保留了一阶方法的关键优势，同时有效解决了LLMs训练中固有的非凸性和非光滑性问题。此外，我们提供了该算法的优势的详细分析。实验证明，所提出的方法不仅保持了传统SGD技术的可扩展性，在许多情况下甚至超越了它，显著提高了优化过程的速度和准确性。

更新时间: 2025-07-01 23:30:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.01241v1

GenBFA: An Evolutionary Optimization Approach to Bit-Flip Attacks on LLMs

Large Language Models (LLMs) have revolutionized natural language processing (NLP), excelling in tasks like text generation and summarization. However, their increasing adoption in mission-critical applications raises concerns about hardware-based threats, particularly bit-flip attacks (BFAs). BFAs, enabled by fault injection methods such as Rowhammer, target model parameters in memory, compromising both integrity and performance. Identifying critical parameters for BFAs in the vast parameter space of LLMs poses significant challenges. While prior research suggests transformer-based architectures are inherently more robust to BFAs compared to traditional deep neural networks, we challenge this assumption. For the first time, we demonstrate that as few as three bit-flips can cause catastrophic performance degradation in an LLM with billions of parameters. Current BFA techniques are inadequate for exploiting this vulnerability due to the difficulty of efficiently identifying critical parameters within the immense parameter space. To address this, we propose AttentionBreaker, a novel framework tailored for LLMs that enables efficient traversal of the parameter space to identify critical parameters. Additionally, we introduce GenBFA, an evolutionary optimization strategy designed to refine the search further, isolating the most critical bits for an efficient and effective attack. Empirical results reveal the profound vulnerability of LLMs to AttentionBreaker. For example, merely three bit-flips (4.129 x 10^-9% of total parameters) in the LLaMA3-8B-Instruct 8-bit quantized (W8) model result in a complete performance collapse: accuracy on MMLU tasks drops from 67.3% to 0%, and Wikitext perplexity skyrockets from 12.6 to 4.72 x 10^5. These findings underscore the effectiveness of AttentionBreaker in uncovering and exploiting critical vulnerabilities within LLM architectures.

Updated: 2025-07-01 23:27:52

标题: GenBFA：一种基于进化优化的位翻转攻击LLMs方法

摘要: 大型语言模型（LLMs）已经彻底改变了自然语言处理（NLP），在文本生成和总结等任务中表现出色。然而，它们在关键任务应用中的日益普及引发了对硬件威胁的担忧，特别是位翻转攻击（BFAs）。BFAs利用Rowhammer等故障注入方法，针对内存中的模型参数，危害了完整性和性能。在LLMs的庞大参数空间中识别BFAs的关键参数存在重大挑战。尽管先前的研究表明，基于transformer的架构与传统深度神经网络相比对BFAs更具鲁棒性，但我们对这一假设提出质疑。我们首次证明，仅三个位翻转就足以导致拥有数十亿参数的LLM性能灾难性下降。目前的BFA技术无法利用这种脆弱性，因为在庞大的参数空间中高效识别关键参数的难度很大。为了解决这个问题，我们提出了AttentionBreaker，一个专为LLMs定制的新型框架，可以有效遍历参数空间以识别关键参数。此外，我们引入了GenBFA，一种旨在进一步优化搜索的进化优化策略，以分离出最关键的位，实现高效有效的攻击。实证结果显示了LLMs对AttentionBreaker的深刻脆弱性。例如，在LLaMA3-8B-Instruct 8位量化（W8）模型中仅仅三个位翻转（总参数的4.129 x 10^-9%）导致了完全的性能崩溃：在MMLU任务上的准确率从67.3%降至0%，Wikitext的困惑度从12.6增加到4.72 x 10^5。这些发现突显了AttentionBreaker在揭示和利用LLM架构内的关键脆弱性方面的有效性。

更新时间: 2025-07-01 23:27:52

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.13757v4

Quantum Machine Learning in Transportation: A Case Study of Pedestrian Stress Modelling

Quantum computing has opened new opportunities to tackle complex machine learning tasks, for instance, high-dimensional data representations commonly required in intelligent transportation systems. We explore quantum machine learning to model complex skin conductance response (SCR) events that reflect pedestrian stress in a virtual reality road crossing experiment. For this purpose, Quantum Support Vector Machine (QSVM) with an eight-qubit ZZ feature map and a Quantum Neural Network (QNN) using a Tree Tensor Network ansatz and an eight-qubit ZZ feature map, were developed on Pennylane. The dataset consists of SCR measurements along with features such as the response amplitude and elapsed time, which have been categorized into amplitude-based classes. The QSVM achieved good training accuracy, but had an overfitting problem, showing a low test accuracy of 45% and therefore impacting the reliability of the classification model. The QNN model reached a higher test accuracy of 55%, making it a better classification model than the QSVM and the classic versions.

Updated: 2025-07-01 23:18:50

标题: 交通领域的量子机器学习：行人压力建模案例研究

摘要: 量子计算为解决复杂的机器学习任务开辟了新的机遇，例如在智能交通系统中常见的高维数据表示。我们探索了量子机器学习来建模反映行人压力的复杂皮肤电导反应（SCR）事件，该实验在虚拟现实道路穿越实验中进行。为此，在Pennylane上开发了一个使用八比特ZZ特征映射的量子支持向量机（QSVM）和一个使用Tree Tensor Network ansatz和八比特ZZ特征映射的量子神经网络（QNN）。数据集包括SCR测量值以及响应幅度和经过时间等特征，这些特征已被分类为基于幅度的类别。QSVM实现了良好的训练准确性，但存在过拟合问题，显示出低于45%的测试准确性，因此影响了分类模型的可靠性。QNN模型达到了更高的测试准确性，为55%，使其成为比QSVM和经典版本更好的分类模型。

更新时间: 2025-07-01 23:18:50

领域: cs.LG,quant-ph

下载: http://arxiv.org/abs/2507.01235v1

Learning Beyond Euclid: Curvature-Adaptive Generalization for Neural Networks on Manifolds

In this work, we develop new generalization bounds for neural networks trained on data supported on Riemannian manifolds. Existing generalization theories often rely on complexity measures derived from Euclidean geometry, which fail to account for the intrinsic structure of non-Euclidean spaces. Our analysis introduces a geometric refinement: we derive covering number bounds that explicitly incorporate manifold-specific properties such as sectional curvature, volume growth, and injectivity radius. These geometric corrections lead to sharper Rademacher complexity bounds for classes of Lipschitz neural networks defined on compact manifolds. The resulting generalization guarantees recover standard Euclidean results when curvature is zero but improve substantially in settings where the data lies on curved, low-dimensional manifolds embedded in high-dimensional ambient spaces. We illustrate the tightness of our bounds in negatively curved spaces, where the exponential volume growth leads to provably higher complexity, and in positively curved spaces, where the curvature acts as a regularizing factor. This framework provides a principled understanding of how intrinsic geometry affects learning capacity, offering both theoretical insight and practical implications for deep learning on structured data domains.

Updated: 2025-07-01 23:16:49

标题: 超越欧几里得学习：曲率自适应泛化在流形上的神经网络

摘要: 在这项工作中，我们为在黎曼流形上支持的数据训练的神经网络开发了新的泛化界限。现有的泛化理论通常依赖于从欧几里德几何推导出的复杂度度量，这些度量无法解释非欧几里德空间的固有结构。我们的分析引入了几何细化：我们推导了包围数界限，明确地将流形特定属性（如截面曲率、体积增长和单射半径）纳入其中。这些几何修正导致了对定义在紧致流形上的利普希兹神经网络类的更锐利的Rademacher复杂度界限。在曲率为零时，得到的泛化保证恢复了标准的欧几里德结果，但在数据位于高维环境空间中嵌入的弯曲低维流形上时，改进明显。我们在负曲率空间中展示了界限的紧密性，那里指数体积增长导致明显更高的复杂度，并在正曲率空间中展示了曲率作为正则化因子的作用。这个框架提供了对固有几何如何影响学习能力的原则性理解，为在结构化数据域上的深度学习提供了理论洞察和实际影响。

更新时间: 2025-07-01 23:16:49

领域: cs.LG,math.DG,stat.ML

下载: http://arxiv.org/abs/2507.02999v1

A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions often remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. While computational phenotyping algorithms show promise for automating rare disease detection, their development is hindered by the scarcity of labeled data and biases in existing label sources. Gold-standard labels from registries and expert chart reviews are highly accurate but constrained by selection bias and the cost of manual review. In contrast, labels derived from electronic health records (EHRs) cover a broader range of patients but can introduce substantial noise. To address these challenges, we propose a weakly supervised, transformer-based framework that combines a small set of gold-standard labels with a large volume of iteratively updated silver-standard labels derived from EHR data. This hybrid approach enables the training of a highly accurate and generalizable phenotyping model that scales rare disease detection beyond the scope of individual clinical expertise. Our method is initialized by learning embeddings of medical concepts based on their semantic meaning or co-occurrence patterns in EHRs, which are then refined and aggregated into patient-level representations via a multi-layer transformer architecture. Using two rare pulmonary diseases as a case study, we validate our model on EHR data from Boston Children's Hospital. Our framework demonstrates notable improvements in phenotype classification, identification of clinically meaningful subphenotypes through patient clustering, and prediction of disease progression compared to baseline methods. These results highlight the potential of our approach to enable scalable identification and stratification of rare disease patients for clinical care and research applications.

Updated: 2025-07-01 23:11:20

标题: 一种弱监督的变压器，用于支持罕见疾病的电子健康记录诊断：在罕见肺部疾病中的方法和应用

摘要: 罕见疾病影响全球估计3亿至4亿人口，然而，个别情况通常因其低发病率和有限的临床医生熟悉度而难以诊断。虽然计算表型算法显示出自动检测罕见疾病的潜力，但它们的发展受到标记数据的稀缺和现有标签来源中的偏见的阻碍。来自注册表和专家图表审查的黄金标准标签非常准确，但受到选择偏倚和手动审查成本的限制。相比之下，从电子健康记录（EHR）中派生的标签涵盖了更广泛的患者范围，但可能引入大量噪音。为了解决这些挑战，我们提出了一个弱监督、基于变压器的框架，将来自EHR数据的一小部分黄金标准标签与大量迭代更新的银标准标签相结合。这种混合方法使得可以训练一个高度准确且具有普适性的表型模型，将罕见疾病检测扩大到个别临床专业知识之外。我们的方法通过学习基于EHR中医学概念的语义含义或共现模式的嵌入，然后通过多层变压器架构将其精炼和汇总为患者级别的表示来初始化。使用两种罕见肺部疾病作为案例研究，我们在波士顿儿童医院的EHR数据上验证了我们的模型。与基线方法相比，我们的框架在表型分类、通过患者聚类识别临床意义亚表型以及预测疾病进展方面表现出显著改进。这些结果突显了我们的方法在实现可扩展的罕见疾病患者的识别和分层，用于临床护理和研究应用的潜力。

更新时间: 2025-07-01 23:11:20

领域: cs.LG,cs.CL,stat.ML

下载: http://arxiv.org/abs/2507.02998v1

Rethinking the Illusion of Thinking

Earlier this year, Apple ignited controversy by publishing "The Illusion of Thinking," prompting heated debate within the AI community. Critics seized upon the findings as conclusive evidence that Large Reasoning Models (LRMs) lack genuine reasoning capabilities, branding them as mere stochastic parrots. Meanwhile, defenders-spearheaded by Lawsen et al. (2025)-fired back, condemning the experimental setup as flawed and the conclusions overstated. We clarify this debate by replicating and refining two of the original study's most contentious benchmarks: Towers of Hanoi and River Crossing. By introducing incremental stepwise prompting and agentic collaborative dialogue, we show that previously reported failures solving the Towers of Hanoi were not purely result of output constraints, but also partly a result of cognition limitations: LRMs still stumble when complexity rises moderately (around 8 disks). Moreover, the River Crossing results initially heralded as catastrophic failures turn out to hinge upon testing unsolvable configurations. Once we limit tests strictly to solvable problems-LRMs effortlessly solve large instances involving over 100 agent pairs. Our findings ultimately defy simplistic narratives: today's LRMs are stochastic, RL-tuned searchers in a discrete state space we barely understand. Real progress in symbolic, long-horizon reasoning demands mapping that terrain through fine-grained ablations like those introduced here.

Updated: 2025-07-01 23:10:02

标题: 重新思考思考的幻觉

摘要: 今年早些时候，苹果发布了《思维的幻觉》，引发了争议，在人工智能领域内引发了激烈的辩论。批评者将研究结果作为大推理模型（LRMs）缺乏真正推理能力的确凿证据，并将它们贴上了纯粹是随机鹦鹉的标签。与此同时，由劳森等人（2025年）带头的辩护者反击，谴责实验设置有缺陷，结论言过其实。我们通过复制和完善原始研究中最具争议的两个基准测试：汉诺塔和过河问题，澄清了这场辩论。通过引入增量逐步提示和主动合作对话，我们表明先前报道的解决汉诺塔问题的失败不仅仅是由于输出限制，还部分是由于认知限制：当复杂性适度上升时（约为8块盘子），LRMs仍然会遇到困难。此外，最初被视为灾难性失败的过河问题结果，最终在于测试了无法解决的配置。一旦我们严格限制测试仅限于可解问题，LRMs能够轻松解决涉及超过100对代理的大规模实例。我们的研究结果最终挑战了简单化的叙述：今天的LRMs是随机的、经过强化学习调整的搜索者，在我们几乎不了解的离散状态空间中搜索。在符号、长期推理方面的真正进步需要通过像这里介绍的细致的消融来绘制这个地形图。

更新时间: 2025-07-01 23:10:02

领域: cs.AI

下载: http://arxiv.org/abs/2507.01231v1

Capacity Planning and Scheduling for Jobs with Uncertainty in Resource Usage and Duration

Organizations around the world schedule jobs (programs) regularly to perform various tasks dictated by their end users. With the major movement towards using a cloud computing infrastructure, our organization follows a hybrid approach with both cloud and on-prem servers. The objective of this work is to perform capacity planning, i.e., estimate resource requirements, and job scheduling for on-prem grid computing environments. A key contribution of our approach is handling uncertainty in both resource usage and duration of the jobs, a critical aspect in the finance industry where stochastic market conditions significantly influence job characteristics. For capacity planning and scheduling, we simultaneously balance two conflicting objectives: (a) minimize resource usage, and (b) provide high quality-of-service to the end users by completing jobs by their requested deadlines. We propose approximate approaches using deterministic estimators and pair sampling-based constraint programming. Our best approach (pair sampling-based) achieves much lower peak resource usage compared to manual scheduling without compromising on the quality-of-service.

Updated: 2025-07-01 22:56:08

标题: 资源使用和持续时间不确定性的作业的容量规划和调度

摘要: 全球组织定期安排工作（程序）来执行由最终用户指定的各种任务。随着向云计算基础设施的主要转变，我们的组织采用了云和本地服务器的混合方法。本研究的目标是为本地网格计算环境进行容量规划，即估算资源需求和工作调度。我们方法的一个关键贡献是处理资源使用和工作持续时间的不确定性，这在金融行业是一个关键因素，因为随机市场条件显著影响工作特性。对于容量规划和调度，我们同时平衡两个相互冲突的目标：（a）最小化资源使用，和（b）通过按照请求的截止日期完成工作为最终用户提供高质量的服务。我们提出了使用确定性估计器和基于对抽样约束编程的近似方法。我们最佳的方法（基于对抽样的方法）相比手动调度实现了更低的峰值资源使用，同时不影响服务质量。

更新时间: 2025-07-01 22:56:08

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2507.01225v1

What to Do Next? Memorizing skills from Egocentric Instructional Video

Learning to perform activities through demonstration requires extracting meaningful information about the environment from observations. In this research, we investigate the challenge of planning high-level goal-oriented actions in a simulation setting from an egocentric perspective. We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture. The process of memorizing the environment's structure through extracting affordances facilitates selecting appropriate actions based on the context. Moreover, the memory model allows us to detect action deviations while accomplishing specific objectives. To assess the method's versatility, we evaluate it in a realistic interactive simulation environment. Our experimental results demonstrate that the proposed approach learns meaningful representations, resulting in improved performance and robust when action deviations occur.

Updated: 2025-07-01 22:53:41

标题: 下一步该怎么做？从自我中心的教学视频中学习记忆技巧

摘要: 学习通过示范执行活动需要从观察中提取有关环境的有意义信息。在这项研究中，我们研究了在模拟环境中从自我中心的角度规划高级目标导向行动的挑战。我们提出了一项新颖的任务，交互式行动规划，并提出了一种将拓扑可供性记忆与变压器架构相结合的方法。通过提取可供性来记忆环境的结构的过程有助于根据上下文选择适当的行动。此外，记忆模型允许我们在实现特定目标时检测行动偏差。为了评估该方法的多功能性，我们在一个真实的交互式模拟环境中对其进行评估。我们的实验结果表明，所提出的方法学习了有意义的表示，导致在行动偏差发生时表现出更好的性能和稳健性。

更新时间: 2025-07-01 22:53:41

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.02997v1

CAM-NET: An AI Model for Whole Atmosphere with Thermosphere and Ionosphere Extension

We present Compressible Atmospheric Model-Network (CAM-NET), an AI model designed to predict neutral atmospheric variables from the Earth's surface to the ionosphere with high accuracy and computational efficiency. Accurate modeling of the entire atmosphere is critical for understanding the upward propagation of gravity waves, which influence upper-atmospheric dynamics and coupling across atmospheric layers. CAM-NET leverages the Spherical Fourier Neural Operator (SFNO) to capture global-scale atmospheric dynamics while preserving the Earth's spherical structure. Trained on a decade of datasets from the Whole Atmosphere Community Climate Model with thermosphere and ionosphere eXtension (WACCM-X), CAM-NET demonstrates accuracy comparable to WACCM-X while achieving a speedup of over 1000x in inference time, can provide one year simulation within a few minutes once trained. The model effectively predicts key atmospheric parameters, including zonal and meridional winds, temperature, and time rate of pressure. Inspired by traditional modeling approaches that use external couplers to simulate tracer transport, CAM-NET introduces a modular architecture that explicitly separates tracer prediction from core dynamics. The core backbone of CAM-NET focuses on forecasting primary physical variables (e.g., temperature, wind velocity), while tracer variables are predicted through a lightweight, fine-tuned model. This design allows for efficient adaptation to specific tracer scenarios with minimal computational cost, avoiding the need to retrain the entire model. We have validated this approach on the $O^2$ tracer, demonstrating strong performance and generalization capabilities.

Updated: 2025-07-01 22:43:36

标题: CAM-NET：一个包含热层和电离层扩展的全大气人工智能模型

摘要: 我们介绍了可压缩大气模型网络（CAM-NET），这是一个设计用于高精度和计算效率预测从地表到电离层的中性大气变量的人工智能模型。准确建模整个大气对于理解重力波向上传播至关重要，这影响着上层大气动力学和不同大气层之间的耦合。CAM-NET利用球面傅里叶神经操作器（SFNO）来捕捉全球尺度大气动力学，同时保留地球的球形结构。在使用带热层和电离层扩展（WACCM-X）的整个大气社区气候模型十年的数据集进行训练后，CAM-NET展示了与WACCM-X相当的准确性，同时在推断时间上实现了超过1000倍的加速，一旦训练完成，可以在几分钟内提供一年的模拟。该模型有效预测关键大气参数，包括经向和纬向风、温度和压力的时间变化率。受传统建模方法的启发，该方法使用外部耦合器来模拟示踪物质的传输，CAM-NET引入了一个模块化架构，明确区分了示踪物质预测和核心动力学。CAM-NET的核心主干侧重于预测主要物理变量（例如温度、风速），而示踪物质则通过一个轻量级、经过精细调整的模型来预测。这种设计使得能够高效适应特定的示踪物质场景，且计算成本较低，避免了重新训练整个模型的需要。我们已在$O^2$示踪物上验证了这种方法，展示了强大的性能和泛化能力。

更新时间: 2025-07-01 22:43:36

领域: physics.space-ph,cs.LG

下载: http://arxiv.org/abs/2506.19340v3

2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios. Project-website: https://sites.google.com/view/2handedafforder

Updated: 2025-07-01 22:27:41

标题: 2HandedAfforder：从人类视频中学习精确可操作的双手动作效能

摘要: 与物体互动时，人类有效地推理出哪些区域适合进行预期动作，即物体的作用区域。他们还可以根据要执行的任务以及是否需要使用一只手或两只手来考虑物体区域的微小差异。然而，当前基于视觉的作用预测方法通常将问题简化为天真的物体部分分割。在这项工作中，我们提出了一个从人类活动视频数据集中提取作用数据的框架。我们提取的2HANDS数据集包含精确的物体作用区域分割和作用类标签，作为执行活动的叙述。数据还考虑了双手协调和与一个或多个物体互动的双手动作。我们提出了一个基于VLM的作用预测模型，2HandedAfforder，经过数据集训练后，在各种活动的作用区域分割方面表现出优越性能。最后，我们展示了我们预测的作用区域是可操作的，即可以通过在机器人操作场景中进行演示来被执行任务的代理使用。项目网站：https://sites.google.com/view/2handedafforder

更新时间: 2025-07-01 22:27:41

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.09320v3

PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning

There is a huge gap between numerous intriguing applications fostered by on-device large language model (LLM) fine-tuning (FT) from fresh mobile data and the limited resources of a mobile device. While existing server-assisted methods (e.g., split learning or side-tuning) may enable LLM FT on the local mobile device, they suffer from heavy communication burdens of activation transmissions, and may disclose data, labels or fine-tuned models to the server. To address those issues, we develop PAE MobiLLM, a privacy-aware and efficient LLM FT method which can be deployed on the mobile device via server-assisted additive side-tuning. To further accelerate FT convergence and improve computing efficiency, PAE MobiLLM integrates activation caching on the server side, which allows the server to reuse historical activations and saves the mobile device from repeatedly computing forward passes for the recurring data samples. Besides, to reduce communication cost, PAE MobiLLM develops a one-token (i.e., ``pivot'' token) activation shortcut that transmits only a single activation dimension instead of full activation matrices to guide the side network tuning. Last but not least, PAE MobiLLM introduces the additive adapter side-network design which makes the server train the adapter modules based on device-defined prediction differences rather than raw ground-truth labels. In this way, the server can only assist device-defined side-network computing, and learn nothing about data, labels or fine-tuned models.

Updated: 2025-07-01 22:27:21

标题: PAE MobiLLM：通过附加边调整在移动设备上隐私感知和高效的LLM微调

摘要: 我们发现，现有的服务器辅助方法（如分裂学习或侧调整）可能会使本地移动设备上的LLM FT受到激活传输的沉重通信负担，并可能向服务器披露数据、标签或经过微调的模型。为了解决这些问题，我们开发了一种隐私意识和高效的LLM FT方法PAE MobiLLM，可以通过服务器辅助的加法侧调整部署在移动设备上。为了进一步加速FT的收敛并提高计算效率，PAE MobiLLM在服务器端集成了激活缓存，允许服务器重复使用历史激活，并使移动设备免受为重复数据样本计算前向传播的困扰。此外，为了降低通信成本，PAE MobiLLM开发了一种单令牌（即“枢轴”令牌）激活快捷方式，只传输一个激活维度而不是完整的激活矩阵，以指导侧网络调整。最后，PAE MobiLLM引入了加法适配器侧网络设计，使服务器基于设备定义的预测差异而不是原始地面真实标签训练适配器模块。通过这种方式，服务器只能协助设备定义的侧网络计算，并不会了解数据、标签或经过微调的模型。

更新时间: 2025-07-01 22:27:21

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2507.01216v1

FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images

The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8\% accuracy, outperforming state-of-the-art baselines by 5.2\%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1--0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.

Updated: 2025-07-01 22:12:35

标题: FreqCross：一种多模态频率-空间融合网络，用于稳定扩散3.5生成图像的鲁棒检测

摘要: 随着扩散模型的快速发展，尤其是稳定扩散3.5，高度逼真的合成图像的生成使得现有检测方法面临重大挑战。本文介绍了FreqCross，一种新颖的多模态融合网络，结合了空间RGB特征、频域伪影和径向能量分布模式，实现了对人工智能生成图像的稳健检测。我们的方法利用了三分支架构：（1）用于空间特征提取的ResNet-18骨干，（2）用于处理2D FFT幅度谱的轻量级CNN，以及（3）用于分析径向能量轮廓的多层感知器。我们引入了一种新颖的径向能量分布分析，捕捉了扩散生成图像中固有的特征频率伪影，并通过简单的特征串联以及紧凑的分类头融合了空间和频谱线索。对包含1万个配对真实（MS-COCO）和合成（稳定扩散3.5）图像的数据集进行了大量实验证明，FreqCross实现了97.8\%的准确率，优于最先进的基准线5.2\%。频率分析进一步揭示了合成图像在0.1-0.4归一化频率范围内具有明显的光谱特征，为我们的方法提供了理论基础。代码和预训练模型可公开获取，以促进可重复研究。

更新时间: 2025-07-01 22:12:35

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2507.02995v1

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.

Updated: 2025-07-01 22:08:39

标题: 大型语言模型的不确定性量化调查：分类、开放性研究挑战和未来方向

摘要: 大型语言模型（LLMs）在内容生成、编码和常识推理方面的出色性能已经在社会的许多方面得到了广泛应用。然而，LLMs的整合引发了关于它们的可靠性和信任度的合理问题，因为它们倾向于生成幻觉：具有惊人自信度的似是而非的回答。先前的工作表明，LLMs生成的幻觉和其他非事实性回应可以通过检查LLM对相关提示的响应的不确定性来检测，从而推动了大量研究努力致力于量化LLMs的不确定性。本调查旨在对现有的LLMs不确定性量化方法进行广泛回顾，识别它们的显著特征，以及它们的优势和劣势。我们在相关分类中介绍现有方法，统一表面上不同的方法，以帮助理解现有技术的水平。此外，我们强调了LLMs不确定性量化方法的应用，涵盖了从聊天机器人和文本应用到机器人技术中的具体应用。我们以对LLMs的不确定性量化的开放研究挑战作出结论，旨在激励未来的研究。

更新时间: 2025-07-01 22:08:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.05563v2

Deep Learning-Based Intrusion Detection for Automotive Ethernet: Evaluating & Optimizing Fast Inference Techniques for Deployment on Low-Cost Platform

Modern vehicles are increasingly connected, and in this context, automotive Ethernet is one of the technologies that promise to provide the necessary infrastructure for intra-vehicle communication. However, these systems are subject to attacks that can compromise safety, including flow injection attacks. Deep Learning-based Intrusion Detection Systems (IDS) are often designed to combat this problem, but they require expensive hardware to run in real time. In this work, we propose to evaluate and apply fast neural network inference techniques like Distilling and Prunning for deploying IDS models on low-cost platforms in real time. The results show that these techniques can achieve intrusion detection times of up to 727 {\mu}s using a Raspberry Pi 4, with AUCROC values of 0.9890.

Updated: 2025-07-01 22:05:02

标题: 基于深度学习的汽车以太网入侵检测：评估和优化快速推断技术，以部署在低成本平台上

摘要: 现代车辆越来越互联，汽车以太网是其中一种技术，承诺提供车内通信所需的基础设施。然而，这些系统容易受到攻击，可能危及安全，包括流插入攻击。基于深度学习的入侵检测系统（IDS）通常被设计来对抗这个问题，但需要昂贵的硬件才能实时运行。在这项工作中，我们提出评估和应用快速神经网络推断技术，如蒸馏和修剪，以在低成本平台上实时部署IDS模型。结果表明，这些技术可以在使用树莓派4的情况下实现高达727μs的入侵检测时间，AUCROC值为0.9890。

更新时间: 2025-07-01 22:05:02

领域: cs.LG,cs.CR,C.2.0; I.2.0

下载: http://arxiv.org/abs/2507.01208v1

DGenNO: A Novel Physics-aware Neural Operator for Solving Forward and Inverse PDE Problems based on Deep, Generative Probabilistic Modeling

Solving parametric partial differential equations (PDEs) and associated PDE-based, inverse problems is a central task in engineering and physics, yet existing neural operator methods struggle with high-dimensional, discontinuous inputs and require large amounts of {\em labeled} training data. We propose the Deep Generative Neural Operator (DGenNO), a physics-aware framework that addresses these challenges by leveraging a deep, generative, probabilistic model in combination with a set of lower-dimensional, latent variables that simultaneously encode PDE-inputs and PDE-outputs. This formulation can make use of unlabeled data and significantly improves inverse problem-solving, particularly for discontinuous or discrete-valued input functions. DGenNO enforces physics constraints without labeled data by incorporating as virtual observables, weak-form residuals based on compactly supported radial basis functions (CSRBFs). These relax regularity constraints and eliminate higher-order derivatives from the objective function. We also introduce MultiONet, a novel neural operator architecture, which is a more expressive generalization of the popular DeepONet that significantly enhances the approximating power of the proposed model. These innovations make DGenNO particularly effective for challenging forward and inverse, PDE-based problems, such as those involving multi-phase media. Numerical experiments demonstrate that DGenNO achieves higher accuracy across multiple benchmarks while exhibiting robustness to noise and strong generalization to out-of-distribution cases. Its adaptability, and the ability to handle sparse, noisy data while providing probabilistic estimates, make DGenNO a powerful tool for scientific and engineering applications.

Updated: 2025-07-01 22:02:00

标题: DGenNO：基于深度生成概率建模的物理感知神经算子，用于解决正向和反向PDE问题

摘要: 解决参数化偏微分方程（PDEs）及相关PDE-based逆问题是工程和物理学中的核心任务，然而现有的神经算子方法在处理高维度、不连续输入时存在困难，并且需要大量标记的训练数据。我们提出了Deep Generative Neural Operator（DGenNO），这是一个具备物理意识的框架，通过结合深度、生成式、概率模型和一组低维度的潜在变量，同时编码PDE输入和PDE输出，以解决这些挑战。这种形式可以利用未标记的数据，并显著改善逆问题求解，特别是对于不连续或离散数值输入函数。DGenNO通过将基于紧支持径向基函数（CSRBFs）的弱形式残差作为虚拟可观测量来强化物理约束，从而不需要标记的数据。这些改进使得DGenNO特别适用于具有挑战性的前向和逆向PDE-based问题，例如涉及多相介质的问题。数值实验表明，DGenNO在多个基准测试中实现了更高的准确性，同时对噪声和超出分布的情况具有强大的泛化能力。其适应性和处理稀疏、嘈杂数据的能力，同时提供概率估计，使得DGenNO成为科学和工程应用的强大工具。

更新时间: 2025-07-01 22:02:00

领域: cs.LG,math-ph,math.MP

下载: http://arxiv.org/abs/2502.06250v3

Discovery of Fatigue Strength Models via Feature Engineering and automated eXplainable Machine Learning applied to the welded Transverse Stiffener

This research introduces a unified approach combining Automated Machine Learning (AutoML) with Explainable Artificial Intelligence (XAI) to predict fatigue strength in welded transverse stiffener details. It integrates expert-driven feature engineering with algorithmic feature creation to enhance accuracy and explainability. Based on the extensive fatigue test database regression models - gradient boosting, random forests, and neural networks - were trained using AutoML under three feature schemes: domain-informed, algorithmic, and combined. This allowed a systematic comparison of expert-based versus automated feature selection. Ensemble methods (e.g. CatBoost, LightGBM) delivered top performance. The domain-informed model $\mathcal M_2$ achieved the best balance: test RMSE $\approx$ 30.6 MPa and $R^2 \approx 0.780% over the full $\Delta \sigma_{c,50\%}$ range, and RMSE $\approx$ 13.4 MPa and $R^2 \approx 0.527% within the engineering-relevant 0 - 150 MPa domain. The denser-feature model ($\mathcal M_3$) showed minor gains during training but poorer generalization, while the simpler base-feature model ($\mathcal M_1$) performed comparably, confirming the robustness of minimalist designs. XAI methods (SHAP and feature importance) identified stress ratio $R$, stress range $\Delta \sigma_i$, yield strength $R_{eH}$, and post-weld treatment (TIG dressing vs. as-welded) as dominant predictors. Secondary geometric factors - plate width, throat thickness, stiffener height - also significantly affected fatigue life. This framework demonstrates that integrating AutoML with XAI yields accurate, interpretable, and robust fatigue strength models for welded steel structures. It bridges data-driven modeling with engineering validation, enabling AI-assisted design and assessment. Future work will explore probabilistic fatigue life modeling and integration into digital twin environments.

Updated: 2025-07-01 21:57:12

标题: 通过特征工程和自动可解释机器学习应用于焊接横向加强筋的疲劳强度模型发现

摘要: 这项研究引入了将自动化机器学习（AutoML）与可解释人工智能（XAI）相结合的统一方法，用于预测焊接横向加强件细节的疲劳强度。它将专家驱动的特征工程与算法特征创建相结合，以提高准确性和可解释性。基于广泛的疲劳试验数据库，使用梯度提升、随机森林和神经网络等回归模型，在三种特征方案下使用AutoML进行训练：领域知情、算法和混合。这允许系统比较基于专家的特征选择与自动化特征选择。集成方法（例如CatBoost、LightGBM）提供了最佳性能。领域知情模型$\mathcal M_2$实现了最佳平衡：测试RMSE约为30.6 MPa和$R^2约为0.780%，涵盖了整个$\Delta \sigma_{c,50\%}$范围，而在工程相关的0-150 MPa范围内，RMSE约为13.4 MPa，$R^2约为0.527%。密集特征模型（$\mathcal M_3$）在训练过程中表现出轻微增益，但泛化能力较差，而较简单的基础特征模型（$\mathcal M_1$）表现相当，证实了极简设计的稳健性。 XAI方法（SHAP和特征重要性）确定了应力比$R$、应力范围$\Delta \sigma_i$、屈服强度$R_{eH}$和焊后处理（TIG修整 vs. 原样焊接）作为主要预测因素。次要的几何因素 - 板材宽度、喉部厚度、加强件高度 - 也显著影响了疲劳寿命。这一框架表明，将AutoML与XAI集成可以为焊接钢结构提供准确、可解释和稳健的疲劳强度模型。它架起了数据驱动建模与工程验证之间的桥梁，实现了AI辅助设计和评估。未来的工作将探索概率疲劳寿命建模，并将其整合到数字孪生环境中。

更新时间: 2025-07-01 21:57:12

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2507.02005v1

Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.

Updated: 2025-07-01 21:56:44

标题: 具有探测功能的公平算法用于多智能体多臂老虎机

摘要: 我们提出了一个旨在确保各个代理之间公平结果并最大化整个系统性能的多智能体多臂老虎机（MA-MAB）框架。在这种情况下的一个关键挑战是在有限信息的情况下进行决策，关于臂奖励的信息。为了解决这个问题，我们引入了一个新颖的探测框架，可以在分配之前战略性地收集关于选定臂的信息。在已知奖励分布的离线设置中，我们利用亚模块性质设计了一个具有可证明性能界限的贪婪探测算法。对于更复杂的在线设置，我们开发了一个算法，可以实现次线性后悔，并保持公平性。对合成和真实世界数据集进行了大量实验，结果表明我们的方法优于基线方法，实现了更好的公平性和效率。

更新时间: 2025-07-01 21:56:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.14988v2

MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization

Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the <think> reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at https://github.com/bio-mlhui/MedGround-R1

Updated: 2025-07-01 21:51:42

标题: MedGround-R1：通过空间-语义奖励组相对策略优化推进医学图像落地

摘要: 医学图像定位（MIG）涉及根据文本描述在医学图像中定位特定区域，需要模型不仅能感知区域，还能推断这些区域的空间关系。现有的用于MIG的视觉语言模型（VLMs）通常依赖于大量的Chain-of-Thought（CoT）推理注释进行监督微调（SFT），这些注释获取起来昂贵且耗时。最近，DeepSeek-R1表明，大型语言模型（LLMs）可以通过组相对策略优化（GRPO）获取推理能力，而无需CoT注释。在本文中，我们将GRPO强化学习框架调整为用于医学图像定位的VLMs。我们提出了Spatial-Semantic Rewarded Group Relative Policy Optimization来训练模型，无需CoT推理注释。具体地，我们引入了Spatial-Semantic Rewards，结合了空间准确性奖励和语义一致性奖励，为空间正负完成提供细致的反馈。此外，我们提议使用Chain-of-Box模板，将参考边界框的视觉信息整合到推理过程中，使模型能够在中间步骤明确推理空间区域。在三个数据集MS-CXR、ChestX-ray8和M3D-RefSeg上的实验表明，我们的方法在医学图像定位方面取得了最先进的性能。消融研究进一步验证了我们方法中每个组件的有效性。代码、检查点和数据集可在https://github.com/bio-mlhui/MedGround-R1找到。

更新时间: 2025-07-01 21:51:42

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.02994v1

Search-Based Robot Motion Planning With Distance-Based Adaptive Motion Primitives

This work proposes a motion planning algorithm for robotic manipulators that combines sampling-based and search-based planning methods. The core contribution of the proposed approach is the usage of burs of free configuration space (C-space) as adaptive motion primitives within the graph search algorithm. Due to their feature to adaptively expand in free C-space, burs enable more efficient exploration of the configuration space compared to fixed-sized motion primitives, significantly reducing the time to find a valid path and the number of required expansions. The algorithm is implemented within the existing SMPL (Search-Based Motion Planning Library) library and evaluated through a series of different scenarios involving manipulators with varying number of degrees-of-freedom (DoF) and environment complexity. Results demonstrate that the bur-based approach outperforms fixed-primitive planning in complex scenarios, particularly for high DoF manipulators, while achieving comparable performance in simpler scenarios.

Updated: 2025-07-01 21:33:33

标题: 基于搜索的机器人运动规划与基于距离的自适应运动基元

摘要: 这项工作提出了一种用于机器人操作器的运动规划算法，结合了基于采样和基于搜索的规划方法。所提出方法的核心贡献是在图搜索算法中使用自由配置空间（C空间）的囊作为自适应运动原语。由于其具有自适应在自由C空间中扩展的特性，囊使得在配置空间中更有效地探索，相比于固定大小的运动原语，显著减少了寻找有效路径所需的时间和所需的扩展次数。该算法在现有的SMPL（基于搜索的运动规划库）库中实现，并通过一系列涉及具有不同自由度（DoF）和环境复杂度的操作器的不同场景进行评估。结果表明，在复杂场景中，基于囊的方法优于固定原语规划，特别是对于高自由度操作器，同时在更简单的场景中实现可比较的性能。

更新时间: 2025-07-01 21:33:33

领域: cs.RO,cs.AI,cs.CG

下载: http://arxiv.org/abs/2507.01198v1

Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?

Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans' innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.

Updated: 2025-07-01 21:33:32

标题: 关于不确定性的推理：推理模型是否知道自己不知道？

摘要: 推理语言模型在许多具有挑战性的基准测试中取得了最先进的记录，这得益于使用强化学习引发的多步推理。然而，与以前的语言模型一样，推理模型容易生成自信、合理的错误响应（幻觉）。了解何时以及在多大程度上信任这些模型对于推理模型在现实世界应用中的安全部署至关重要。为此，我们在本文中探讨了推理模型的不确定性量化。具体地，我们提出了三个基本问题：第一，推理模型是否校准良好？第二，更深入的推理是否改善模型的校准？最后，受到人类天生的能力启发，他们会反复检查他们的思维过程来验证他们的答案的有效性和信心，我们问：推理模型是否可以通过明确推理他们的思维链迹来改善他们的校准？我们引入自省式不确定性量化（UQ）来探索这个方向。在对跨领域基准测试中最先进的推理模型进行广泛评估的过程中，我们发现推理模型：（i）通常过于自信，特别是对于错误响应，自我陈述的信心估计经常超过85％，（ii）随着推理更深入，变得更加自信，（iii）通过内省（例如，o3-Mini和DeepSeek R1）可以变得更好校准，但不是均匀的（例如，Claude 3.7 Sonnet变得更不良校准）。最后，我们得出结论，提出了设计必要的UQ基准测试和改进推理模型校准的重要研究方向。

更新时间: 2025-07-01 21:33:32

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.18183v2

Are Large Brainwave Foundation Models Capable Yet? Insights from Fine-tuning

Foundation Models have demonstrated significant success across various domains in Artificial Intelligence (AI), yet their capabilities for brainwave modeling remain unclear. In this paper, we comprehensively evaluate current Large Brainwave Foundation Models (LBMs) through systematic fine-tuning experiments across multiple Brain-Computer Interface (BCI) benchmark tasks, including memory tasks and sleep stage classification. Our extensive analysis shows that state-of-the-art LBMs achieve only marginal improvements (0.9%-1.2%) over traditional deep architectures while requiring significantly more parameters (millions vs thousands), raising important questions about their efficiency and applicability in BCI contexts. Moreover, through detailed ablation studies and Low-Rank Adaptation (LoRA), we significantly reduce trainable parameters without performance degradation, while demonstrating that architectural and training inefficiencies limit LBMs' current capabilities. Our experiments span both full model fine-tuning and parameter-efficient adaptation techniques, providing insights into optimal training strategies for BCI applications. We pioneer the application of LoRA to LBMs, revealing that performance benefits generally emerge when adapting multiple neural network components simultaneously. These findings highlight the critical need for domain-specific development strategies to advance LBMs, suggesting that current architectures may require redesign to fully leverage the potential of foundation models in brainwave analysis.

Updated: 2025-07-01 21:21:42

标题: 大脑波基础模型已经具备能力吗？来自微调的见解

摘要: 基础模型已经在人工智能的各个领域取得了显著成功，但它们在脑波建模方面的能力仍不清楚。在这篇论文中，我们通过系统化的精细调整实验全面评估了当前的大型脑波基础模型(LBMs)，涵盖了多个脑-计算机界面(BCI)基准任务，包括记忆任务和睡眠阶段分类。我们的广泛分析显示，当前最先进的LBMs仅在传统深度架构上实现了微小改进(0.9%-1.2%)，同时需要更多的参数(百万 vs 数千)，这引发了关于它们在BCI环境中效率和适用性的重要问题。此外，通过详细的消融研究和低秩适应(LoRA)，我们显著减少了可训练参数而没有性能下降，同时表明了架构和训练效率限制了LBMs的当前能力。我们的实验涵盖了全模型的精细调整和参数高效适应技术，为BCI应用提供了最佳训练策略的见解。我们首次将LoRA应用于LBMs，揭示了当同时调整多个神经网络组件时性能优势通常会显现。这些发现强调了推进LBMs的领域特定开发策略的迫切需要，暗示当前的架构可能需要重新设计以充分利用基础模型在脑波分析中的潜力。

更新时间: 2025-07-01 21:21:42

领域: cs.LG,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.01196v1

Distributional Information Embedding: A Framework for Multi-bit Watermarking

This paper introduces a novel problem, distributional information embedding, motivated by the practical demands of multi-bit watermarking for large language models (LLMs). Unlike traditional information embedding, which embeds information into a pre-existing host signal, LLM watermarking actively controls the text generation process--adjusting the token distribution--to embed a detectable signal. We develop an information-theoretic framework to analyze this distributional information embedding problem, characterizing the fundamental trade-offs among three critical performance metrics: text quality, detectability, and information rate. In the asymptotic regime, we demonstrate that the maximum achievable rate with vanishing error corresponds to the entropy of the LLM's output distribution and increases with higher allowable distortion. We also characterize the optimal watermarking scheme to achieve this rate. Extending the analysis to the finite-token case with non-i.i.d. tokens, we identify schemes that maximize detection probability while adhering to constraints on false alarm and distortion.

Updated: 2025-07-01 21:09:55

标题: 分布信息嵌入：一种多比特水印嵌入框架

摘要: 本文介绍了一个新颖的问题，即分布信息嵌入，受到大型语言模型（LLMs）多比特水印技术的实际需求的启发。与传统的信息嵌入不同，传统的信息嵌入将信息嵌入到预先存在的主机信号中，LLM水印技术主动控制文本生成过程--调整令牌分布--以嵌入可检测的信号。我们开发了一个信息论框架来分析这个分布信息嵌入问题，表征了三个关键性能指标之间的基本权衡：文本质量、可检测性和信息速率。在渐近情况下，我们证明了在错误消失时可达到的最大速率对应于LLM输出分布的熵，并随着允许更高失真而增加。我们还表征了实现这一速率的最优水印方案。将分析扩展到有限令牌情况下的非独立同分布令牌，我们确定了最大化检测概率同时符合误报和失真约束的方案。

更新时间: 2025-07-01 21:09:55

领域: cs.CR,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2501.16558v2

STELLA: Self-Evolving LLM Agent for Biomedical Research

The rapid growth of biomedical data, tools, and literature has created a fragmented research landscape that outpaces human expertise. While AI agents offer a solution, they typically rely on static, manually curated toolsets, limiting their ability to adapt and scale. Here, we introduce STELLA, a self-evolving AI agent designed to overcome these limitations. STELLA employs a multi-agent architecture that autonomously improves its own capabilities through two core mechanisms: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically discovers and integrates new bioinformatics tools. This allows STELLA to learn from experience. We demonstrate that STELLA achieves state-of-the-art accuracy on a suite of biomedical benchmarks, scoring approximately 26\% on Humanity's Last Exam: Biomedicine, 54\% on LAB-Bench: DBQA, and 63\% on LAB-Bench: LitQA, outperforming leading models by up to 6 percentage points. More importantly, we show that its performance systematically improves with experience; for instance, its accuracy on the Humanity's Last Exam benchmark almost doubles with increased trials. STELLA represents a significant advance towards AI Agent systems that can learn and grow, dynamically scaling their expertise to accelerate the pace of biomedical discovery.

Updated: 2025-07-01 20:52:01

标题: STELLA：自我进化的生物医学研究LLM代理

摘要: 生物医学数据、工具和文献的快速增长创造了一个超越人类专业知识的碎片化研究领域。虽然人工智能代理提供了一种解决方案，但它们通常依赖于静态、手动策划的工具集，限制了它们的适应能力和规模。在这里，我们介绍了STELLA，一个自我进化的人工智能代理，旨在克服这些限制。STELLA采用多智能体架构，通过两种核心机制自主改进自己的能力：一个用于推理策略的不断演化的模板库和一个不断扩展的工具海洋，随着工具创建代理自动发现和整合新的生物信息工具而扩展。这使得STELLA能够从经验中学习。我们展示了STELLA在一系列生物医学基准测试中达到了最先进的准确度，分别在“人类最后一次考试：生物医学”中得分约26％，“LAB-Bench：DBQA”中得分54％，“LAB-Bench：LitQA”中得分63％，比领先模型高出最多6个百分点。更重要的是，我们表明它的性能随着经验的增加而系统地提高；例如，在“人类最后一次考试”基准测试中，其准确度几乎翻了一番。STELLA代表了AI代理系统向前发展的重大进步，这些系统能够学习和成长，动态扩展他们的专业知识，加快生物医学发现的步伐。

更新时间: 2025-07-01 20:52:01

领域: cs.AI,cs.CL,q-bio.BM

下载: http://arxiv.org/abs/2507.02004v1

Rewind-to-Delete: Certified Machine Unlearning for Nonconvex Functions

Machine unlearning algorithms aim to efficiently remove data from a model without retraining it from scratch, in order to remove corrupted or outdated data or respect a user's ``right to be forgotten." Certified machine unlearning is a strong theoretical guarantee based on differential privacy that quantifies the extent to which an algorithm erases data from the model weights. In contrast to existing works in certified unlearning for convex or strongly convex loss functions, or nonconvex objectives with limiting assumptions, we propose the first, first-order, black-box (i.e., can be applied to models pretrained with vanilla gradient descent) algorithm for unlearning on general nonconvex loss functions, which unlearns by ``rewinding" to an earlier step during the learning process before performing gradient descent on the loss function of the retained data points. We prove $(\epsilon, \delta)$ certified unlearning and performance guarantees that establish the privacy-utility-complexity tradeoff of our algorithm, and we prove generalization guarantees for functions that satisfy the Polyak-Lojasiewicz inequality. Finally, we demonstrate the superior performance of our algorithm compared to existing methods, within a new experimental framework that more accurately reflects unlearning user data in practice.

Updated: 2025-07-01 20:11:03

标题: 倒带删除：非凸函数的认证机器遗忘

摘要: 机器遗忘算法旨在有效地从模型中移除数据，而无需从头开始重新训练，以便移除损坏或过时的数据，或尊重用户的“被遗忘权”。认证机器遗忘是基于差分隐私的强大理论保证，量化算法从模型权重中擦除数据的程度。与现有的针对凸或强凸损失函数的认证遗忘作品，或对具有限制性假设的非凸目标的作品相比，我们提出了第一种适用于一般非凸损失函数的第一阶黑盒（即可以应用于使用普通梯度下降预训练的模型）算法，该算法通过在执行梯度下降之前“倒带”到学习过程中的早期步骤来进行遗忘保留数据点的损失函数。我们证明了$(\epsilon, \delta)$认证遗忘和性能保证，建立了算法的隐私-效用-复杂度权衡，并证明了满足Polyak-Lojasiewicz不等式的函数的泛化保证。最后，我们通过一个新的实验框架展示了我们的算法相对于现有方法的优越性能，该框架更准确地反映了在实践中遗忘用户数据。

更新时间: 2025-07-01 20:11:03

领域: cs.LG

下载: http://arxiv.org/abs/2409.09778v5

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention maps. ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 15 other zero-shot interpretability methods on the ImageNet-Segmentation dataset. ConceptAttention works for popular image models and even seamlessly generalizes to video generation. Our work contributes the first evidence that the representations of multi-modal DiTs are highly transferable to vision tasks like segmentation.

Updated: 2025-07-01 20:00:27

标题: ConceptAttention：扩散变压器学习高度可解释特征

摘要: 多模态扩散变压器（DiTs）的丰富表示是否具有增强其可解释性的独特属性？我们引入了ConceptAttention，一种利用DiT注意力层表达能力的新方法，生成高质量的显著性地图，精确定位图像中的文本概念。无需额外训练，ConceptAttention重新利用DiT注意力层的参数，生成高度情境化的概念嵌入，贡献了一个重要发现，即在DiT注意力层的输出空间中执行线性投影，相比常用的交叉注意力地图，产生的显著性地图明显更清晰。ConceptAttention甚至在零样本图像分割基准上实现了最先进的性能，超过ImageNet-Segmentation数据集上的其他15种零样本可解释性方法。ConceptAttention适用于流行的图像模型，甚至无缝地推广到视频生成。我们的研究首次证明，多模态DiT的表示对于像分割这样的视觉任务具有高度可传递性。

更新时间: 2025-07-01 20:00:27

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.04320v2

Squat: Quant Small Language Models on the Edge

A growing trend has emerged in designing high-quality Small Language Models (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter training is feasible for SLMs on mobile devices, Quantization-Aware Training (QAT) is employed to improve efficiency by reducing computational overhead and memory footprint. However, previous QAT works adopt fine-grained quantization methods to compress models with billions of parameters on GPUs, incompatible with current commodity hardware, such as mobile and edge devices, which relies on Single Instruction Multiple Data (SIMD) instructions. Thus, the generalization of these methods to SLMs on mobile devices is limited. In this paper, we propose Squat method, an effective QAT framework with deployable quantization for SLMs on mobile devices. Specifically, we propose entropy-guided and distribution-aligned distillation to mitigate the distortion of attention information from quantization. Besides, we employ sub-8-bit token adaptive quantization, assigning varying bit widths to different tokens based on their importance. Furthermore, we develop a SIMD-based Multi-Kernel Mixed-Precision (MKMP) multiplier to support sub-8-bit mixed-precision MAC on mobile devices. Our extensive experiments verify the substantial improvements of our method compared to other QAT methods across various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x compared with its FP16 counterparts, signaling a great advancement. Code: https://github.com/shawnricecake/squant

Updated: 2025-07-01 19:43:45

标题: 蹲下：边缘上的小型语言模型

摘要: 一个新兴趋势在设计高质量的小语言模型（SLMs）中已经出现，这些模型具有几百万个参数。这一趋势是由对云成本、隐私和延迟不断增加的关注驱动的。考虑到在移动设备上对SLMs进行完整参数训练是可行的，我们采用了Quantization-Aware Training（QAT）来通过减少计算开销和内存占用来提高效率。然而，先前的QAT工作采用了细粒度的量化方法来压缩具有数十亿参数的模型在GPU上，这与当前的商品硬件（如移动和边缘设备）不兼容，这些硬件依赖于单指令多数据（SIMD）指令。因此，将这些方法泛化到移动设备上的SLMs受到限制。在本文中，我们提出了一种名为Squat的有效QAT框架，用于移动设备上SLMs的可部署量化。具体来说，我们提出了基于熵引导和分布对齐的蒸馏方法，以减轻量化对注意力信息的失真。此外，我们采用子8位令牌自适应量化，根据其重要性为不同的令牌分配不同的位宽。此外，我们开发了一种基于SIMD的多核混合精度（MKMP）乘法器，以支持移动设备上的子8位混合精度MAC。我们的广泛实验验证了我们的方法相对于其他QAT方法在各种数据集上的显著改进。此外，与其FP16对应物相比，我们实现了高达2.37倍的设备加速，这表明了一项重大进展。源代码：https://github.com/shawnricecake/squant

更新时间: 2025-07-01 19:43:45

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2402.10787v2

Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?

To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. To support the development of this universal form of LLM uncertainties, we publish our metric at https://github.com/apple/ml-selfreflect

Updated: 2025-07-01 19:41:58

标题: 自我反思的不确定性：LLM是否了解他们的内部答案分布？

摘要: 为了揭示大型语言模型（LLM）何时对回应感到不确定，不确定性量化通常会产生百分比数字以及输出。但这是我们唯一可以做的吗？我们认为，在LLM的输出空间中，即字符串空间中，存在足够表达的字符串，可以总结LLM认为可能的输出字符串分布。我们为这种新的不确定性解释途径奠定了基础，并提出了SelfReflect，这是一个在理论上有动机的度量，用于评估字符串多忠实地总结LLM的内部答案分布。我们展示了SelfReflect能够区分甚至是候选摘要字符串的微小差异，并且与人类判断一致，优于LLM判断和嵌入比较等替代度量。通过SelfReflect，我们研究了许多自我总结方法，并发现即使是最先进的推理模型也很难阐明其内部不确定性。但我们发现，通过抽样和总结可以生成忠实的总结。为了支持这种通用形式的LLM不确定性的发展，我们在https://github.com/apple/ml-selfreflect上发布了我们的度量。

更新时间: 2025-07-01 19:41:58

领域: cs.CL,cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2505.20295v2

Divergent Creativity in Humans and Large Language Models

The recent surge of Large Language Models (LLMs) has led to claims that they are approaching a level of creativity akin to human capabilities. This idea has sparked a blend of excitement and apprehension. However, a critical piece that has been missing in this discourse is a systematic evaluation of LLMs' semantic diversity, particularly in comparison to human divergent thinking. To bridge this gap, we leverage recent advances in computational creativity to analyze semantic divergence in both state-of-the-art LLMs and a substantial dataset of 100,000 humans. We found evidence that LLMs can surpass average human performance on the Divergent Association Task, and approach human creative writing abilities, though they fall short of the typical performance of highly creative humans. Notably, even the top performing LLMs are still largely surpassed by highly creative individuals, underscoring a ceiling that current LLMs still fail to surpass. Our human-machine benchmarking framework addresses the polemic surrounding the imminent replacement of human creative labour by AI, disentangling the quality of the respective creative linguistic outputs using established objective measures. While prompting deeper exploration of the distinctive elements of human inventive thought compared to those of AI systems, we lay out a series of techniques to improve their outputs with respect to semantic diversity, such as prompt design and hyper-parameter tuning.

Updated: 2025-07-01 19:34:19

标题: 人类和大型语言模型中的不同创造力

摘要: 最近大型语言模型（LLMs）的激增导致人们声称它们正接近人类能力水平的创造力。这个想法引发了兴奋和忧虑的融合。然而，在这种讨论中缺少的一个关键部分是对LLMs的语义多样性进行系统评估，特别是与人类发散性思维进行比较。为了弥补这一差距，我们利用计算创造力的最新进展来分析最先进的LLMs和10万个人类数据集中的语义分歧。我们发现证据表明LLMs能够超越普通人类在发散联想任务上的表现，并接近人类创造性写作能力，尽管它们仍然不及高度创造性的个体的典型表现。值得注意的是，即使表现最佳的LLMs仍然大大不及高度创造性的个体，突显了当前LLMs仍然未能超越的上限。我们的人机基准框架解决了围绕人类创造性劳动即将被人工智能取代的争论，通过使用已建立的客观指标来解开各自创造性语言产出的质量。在促使更深入探索人类创造性思维与人工智能系统的独特元素之间的区别的同时，我们提出了一系列技术，以改善它们的输出与语义多样性相关的方面，如提示设计和超参数调整。

更新时间: 2025-07-01 19:34:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2405.13012v2

Vehicle-group-based Crash Risk Prediction and Interpretation on Highways

Previous studies in predicting crash risks primarily associated the number or likelihood of crashes on a road segment with traffic parameters or geometric characteristics, usually neglecting the impact of vehicles' continuous movement and interactions with nearby vehicles. Recent technology advances, such as Connected and Automated Vehicles (CAVs) and Unmanned Aerial Vehicles (UAVs) are able to collect high-resolution trajectory data, which enables trajectory-based risk analysis. This study investigates a new vehicle group (VG) based risk analysis method and explores risk evolution mechanisms considering VG features. An impact-based vehicle grouping method is proposed to cluster vehicles into VGs by evaluating their responses to the erratic behaviors of nearby vehicles. The risk of a VG is aggregated based on the risk between each vehicle pair in the VG, measured by inverse Time-to-Collision (iTTC). A Logistic Regression and a Graph Neural Network (GNN) are then employed to predict VG risks using aggregated and disaggregated VG information. Both methods achieve excellent performance with AUC values exceeding 0.93. For the GNN model, GNNExplainer with feature perturbation is applied to identify critical individual vehicle features and their directional impact on VG risks. Overall, this research contributes a new perspective for identifying, predicting, and interpreting traffic risks.

Updated: 2025-07-01 19:32:16

标题: 基于车辆组的高速公路事故风险预测和解释

摘要: 以往的研究主要是将道路段的事故风险数量或可能性与交通参数或几何特征联系起来，通常忽略了车辆的持续移动和与附近车辆的互动对事故风险的影响。最近的技术进步，如联网和自动驾驶车辆（CAVs）和无人机（UAVs）能够收集高分辨率的轨迹数据，从而实现基于轨迹的风险分析。本研究探讨了一种基于车辆组（VG）的风险分析方法，并探讨了考虑VG特征的风险演变机制。提出了一种基于影响的车辆分组方法，通过评估车辆对附近车辆的异常行为的响应将车辆聚类成VG。VG的风险是基于VG中每对车辆之间的风险聚合而成的，通过逆时间碰撞（iTTC）来衡量。然后采用逻辑回归和图神经网络（GNN）来使用聚合和细分的VG信息来预测VG的风险。这两种方法均表现出优异的性能，AUC值超过0.93。对于GNN模型，采用了GNNExplainer和特征扰动来识别关键的个体车辆特征及其对VG风险的方向性影响。总体而言，这项研究为识别、预测和解释交通风险提供了新的视角。

更新时间: 2025-07-01 19:32:16

领域: cs.LG,cs.CY

下载: http://arxiv.org/abs/2402.12415v3

FlashDP: Private Training Large Language Models with Efficient DP-SGD

As large language models (LLMs) increasingly underpin technological advancements, the privacy of their training data emerges as a critical concern. Differential Privacy (DP) serves as a rigorous mechanism to protect this data, yet its integration via Differentially Private Stochastic Gradient Descent (DP-SGD) introduces substantial challenges, primarily due to the complexities of per-sample gradient clipping. Current explicit methods, such as Opacus, necessitate extensive storage for per-sample gradients, significantly inflating memory requirements. Conversely, implicit methods like GhostClip reduce storage needs by recalculating gradients multiple times, which leads to inefficiencies due to redundant computations. This paper introduces FlashDP, an innovative cache-friendly per-layer DP-SGD that consolidates necessary operations into a single task, calculating gradients only once in a fused manner. This approach not only diminishes memory movement by up to \textbf{50\%} but also cuts down redundant computations by \textbf{20\%}, compared to previous methods. Consequently, FlashDP does not increase memory demands and achieves a \textbf{90\%} throughput compared to the Non-DP method on a four-A100 system during the pre-training of the Llama-13B model, while maintaining parity with standard per-layer clipped DP-SGD in terms of accuracy. These advancements establish FlashDP as a pivotal development for efficient and privacy-preserving training of LLMs. FlashDP's code has been open-sourced in https://github.com/kaustpradalab/flashdp.

Updated: 2025-07-01 19:28:37

标题: FlashDP：使用高效的DP-SGD对大型语言模型进行私有化训练

摘要: 随着大型语言模型（LLMs）越来越成为技术进步的基础，它们训练数据的隐私问题变得尤为关键。差分隐私（DP）作为一种严格的机制来保护这些数据，然而通过差分隐私随机梯度下降（DP-SGD）的整合引入了重大挑战，主要是由于每个样本的梯度裁剪的复杂性。当前的显式方法，如Opacus，需要大量存储每个样本的梯度数据，显著增加了内存需求。相反，隐式方法如GhostClip通过多次重新计算梯度来减少存储需求，但造成了由于冗余计算而导致的低效率。本文引入了FlashDP，一种创新的适用于缓存的每层DP-SGD，将必要的操作合并为一个单一任务，以融合方式仅计算梯度一次。这种方法不仅将内存移动减少了高达50％，还将冗余计算减少了20％，与之前的方法相比。因此，在Llama-13B模型的预训练过程中，FlashDP在四个A100系统上与非DP方法相比，不增加内存需求，并实现了90％的吞吐量，同时保持了与标准每层裁剪DP-SGD相当的准确性。这些进展将FlashDP确立为高效且保护隐私的LLMs训练的关键发展。FlashDP的代码已在https://github.com/kaustpradalab/flashdp上开源。

更新时间: 2025-07-01 19:28:37

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2507.01154v1

BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: https://github.com/amirap80/BioPars.

Updated: 2025-07-01 19:14:32

标题: BioPars: 为波斯语生物医学文本挖掘预训练的生物医学大型语言模型

摘要: 最近，由于大型语言模型（LLMs）具有建模、提取和应用复杂生物信息的能力，它们在生命科学领域引起了关注。除了作为聊天机器人的传统用途外，这些系统越来越多地用于专业领域的复杂分析和问题解决，包括生物信息学。首先，我们介绍了来自超过10,000篇科学文章、教科书和医学网站的数据集BIOPARS-BENCH。还介绍了BioParsQA来评估提出的模型，其中包含5,231个波斯医学问题和答案。该研究随后介绍了BioPars，这是一个简单但准确的度量，旨在评估LLMs的三个主要能力：获取特定学科知识、解释和综合此类知识，以及展示适当的证据。通过比较ChatGPT、Llama和Galactica，我们的研究突显了它们记住和检索学到的知识的能力，但也揭示了在解决高级、现实世界问题和细致推理方面的不足。这些发现表明需要进一步进行微调，以解决LLM在生物信息学任务中的能力。据我们所知，BioPars是LLM在波斯医学问答中的首个应用，尤其是用于生成长答案。对四个选择的医学问答数据集的评估显示，与比较方法相比，BioPars取得了显著的结果。在BioParsQA模型上，ROUGE-L得分为29.99，优于GPT-4 1.0。该模型通过MMR方法获得了90.87的BERTScore。该模型的MoverScore和BLEURT值也高于其他三个模型。此外，该模型的MoverScore=60.43，BLEURT=50.78。BioPars是一个持续进行的项目，与其开发相关的所有资源都将通过以下GitHub存储库提供：https://github.com/amirap80/BioPars。

更新时间: 2025-07-01 19:14:32

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.21567v2

LZ Penalty: An information-theoretic repetition penalty for autoregressive language models

We introduce the LZ penalty, a penalty specialized for reducing degenerate repetitions in autoregressive language models without loss of capability. The penalty is based on the codelengths in the LZ77 universal lossless compression algorithm. Through the lens of the prediction-compression duality, decoding the LZ penalty has the interpretation of sampling from the residual distribution after removing the information that is highly compressible. We demonstrate the LZ penalty enables state-of-the-art open-source reasoning models to operate with greedy (temperature zero) decoding without loss of capability and without instances of degenerate repetition. Both the industry-standard frequency penalty and repetition penalty are ineffective, incurring degenerate repetition rates of up to 4%.

Updated: 2025-07-01 19:06:53

标题: LZ惩罚：自回归语言模型的信息理论重复惩罚

摘要: 我们引入了LZ惩罚，这是一种专门用于减少自回归语言模型中重复的惩罚，而不会损失能力。该惩罚是基于LZ77通用无损压缩算法中的码长。通过预测-压缩二元性的视角，解码LZ惩罚可解释为从去除高度可压缩信息后的残差分布中采样。我们展示了LZ惩罚使最先进的开源推理模型能够在贪婪（温度为零）解码的情况下保持能力而没有重复的实例。行业标准的频率惩罚和重复惩罚都是无效的，导致重复率高达4%。

更新时间: 2025-07-01 19:06:53

领域: cs.LG,cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2504.20131v2

A Review on Sound Source Localization in Robotics: Focusing on Deep Learning Methods

Sound source localization (SSL) adds a spatial dimension to auditory perception, allowing a system to pinpoint the origin of speech, machinery noise, warning tones, or other acoustic events, capabilities that facilitate robot navigation, human-machine dialogue, and condition monitoring. While existing surveys provide valuable historical context, they typically address general audio applications and do not fully account for robotic constraints or the latest advancements in deep learning. This review addresses these gaps by offering a robotics-focused synthesis, emphasizing recent progress in deep learning methodologies. We start by reviewing classical methods such as Time Difference of Arrival (TDOA), beamforming, Steered-Response Power (SRP), and subspace analysis. Subsequently, we delve into modern machine learning (ML) and deep learning (DL) approaches, discussing traditional ML and neural networks (NNs), convolutional neural networks (CNNs), convolutional recurrent neural networks (CRNNs), and emerging attention-based architectures. The data and training strategy that are the two cornerstones of DL-based SSL are explored. Studies are further categorized by robot types and application domains to facilitate researchers in identifying relevant work for their specific contexts. Finally, we highlight the current challenges in SSL works in general, regarding environmental robustness, sound source multiplicity, and specific implementation constraints in robotics, as well as data and learning strategies in DL-based SSL. Also, we sketch promising directions to offer an actionable roadmap toward robust, adaptable, efficient, and explainable DL-based SSL for next-generation robots.

Updated: 2025-07-01 19:00:50

标题: 一个关于机器人声源定位的综述：重点关注深度学习方法

摘要: 声源定位（SSL）为听觉感知增加了空间维度，使系统能够准确定位语音、机械噪音、警告音调或其他声学事件的来源，这些能力有助于机器人导航、人机对话和状态监测。虽然现有的调查提供了有价值的历史背景，但它们通常涉及一般音频应用，并未充分考虑机器人的约束条件或深度学习的最新进展。本综述通过提供以机器人为重点的综合性内容，强调了深度学习方法的最新进展来填补这些空白。我们从评论经典方法如到达时间差（TDOA）、波束形成、导向响应功率（SRP）和子空间分析开始。随后，我们深入讨论现代机器学习（ML）和深度学习（DL）方法，包括传统的ML和神经网络（NNs）、卷积神经网络（CNNs）、卷积循环神经网络（CRNNs）和新兴的基于注意力的架构。探讨了DL-based SSL的两个基石——数据和训练策略。研究进一步按机器人类型和应用领域进行分类，以帮助研究人员确定其特定背景下相关工作。最后，我们强调了一般SSL工作中的当前挑战，包括环境稳健性、声源多重性以及机器人领域中的具体实施约束，以及DL-based SSL中的数据和学习策略。此外，我们勾勒出了向稳健、适应、高效和可解释的下一代机器人DL-based SSL提供可操作路线图的有希望方向。

更新时间: 2025-07-01 19:00:50

领域: cs.RO,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.01143v1

Spectral Manifold Harmonization for Graph Imbalanced Regression

Graph-structured data is ubiquitous in scientific domains, where models often face imbalanced learning settings. In imbalanced regression, domain preferences focus on specific target value ranges representing the most scientifically valuable cases; we observe a significant lack of research. In this paper, we present Spectral Manifold Harmonization (SMH), a novel approach for addressing this imbalanced regression challenge on graph-structured data by generating synthetic graph samples that preserve topological properties while focusing on often underrepresented target distribution regions. Conventional methods fail in this context because they either ignore graph topology in case generation or do not target specific domain ranges, resulting in models biased toward average target values. Experimental results demonstrate the potential of SMH on chemistry and drug discovery benchmark datasets, showing consistent improvements in predictive performance for target domain ranges.

Updated: 2025-07-01 18:48:43

标题: 图不平衡回归的光谱流形调和

摘要: 图结构化数据在科学领域中普遍存在，模型经常面临不平衡学习设置。在不平衡回归中，领域偏好集中在代表最具科学价值案例的特定目标值范围；我们观察到研究方面存在显著的缺失。在本文中，我们提出了一种新颖的方法，称为谱流形调和（SMH），用于解决图结构化数据上的这种不平衡回归挑战，通过生成保留拓扑特性并专注于通常被低估的目标分布区域的合成图样本。传统方法在这种情况下失败，因为它们要么忽略了图拓扑在案例生成中的作用，要么没有针对特定领域范围，导致模型偏向于平均目标值。实验结果表明，在化学和药物发现基准数据集上，SMH的潜力，显示出对目标领域范围的预测性能的一致改进。

更新时间: 2025-07-01 18:48:43

领域: cs.LG,q-bio.MN

下载: http://arxiv.org/abs/2507.01132v1

Local Frames: Exploiting Inherited Origins to Bypass Content Blockers

We present a study of how local frames (i.e., iframes loading content like "about:blank") are mishandled by a wide range of popular Web security and privacy tools. As a result, users of these tools remain vulnerable to the very attack techniques against which they seek to protect themselves, including browser fingerprinting, cookie-based tracking, and data exfiltration. The tools we study are vulnerable in different ways, but all share a root cause: legacy Web functionality interacts with browser privacy boundaries in unexpected ways, leading to systemic vulnerabilities in tools developed, maintained, and recommended by privacy experts and activists. We consider four core capabilities supported by most privacy tools and develop tests to determine whether each can be evaded through the use of local frames. We apply our tests to six popular Web privacy and security tools -- identifying at least one vulnerability in each for a total of 19 -- and extract common patterns regarding their mishandling of local frames. Our measurement of popular websites finds that 56% employ local frames and that 73.7% of the requests made by these local frames should be blocked by popular filter lists but instead trigger the vulnerabilities we identify. From another perspective, 14.3% of all sites that we crawl make requests that should be blocked inside of local frames. We disclosed these vulnerabilities to the tool authors and discuss both our experiences working with them to patch their products and the implications of our findings for other privacy and security research.

Updated: 2025-07-01 18:48:25

标题: 本地框架：利用继承的起源来绕过内容拦截器

摘要: 我们提出了一项关于当地框架（即加载内容如“about:blank”）如何被广泛流行的网络安全和隐私工具处理不当的研究。因此，这些工具的用户仍然容易受到他们试图保护自己免受的攻击技术的影响，包括浏览器指纹识别、基于cookie的跟踪和数据外泄。我们研究的工具存在不同的脆弱性，但都有一个根本原因：传统的网络功能与浏览器隐私边界以意想不到的方式相互作用，导致由隐私专家和活动人士开发、维护和推荐的工具中出现系统性漏洞。我们考虑了大多数隐私工具支持的四项核心功能，并开发了测试来确定每个功能是否可以通过使用本地框架来规避。我们将这些测试应用于六种流行的网络隐私和安全工具，发现每种工具中至少存在一个漏洞，总共有19个漏洞，并提取了关于它们对本地框架的处理不当的共同模式。我们对流行网站进行的测量发现，56%的网站使用本地框架，而这些本地框架发出的请求中有73.7%应该被流行的过滤列表阻止，但实际上触发了我们识别出来的漏洞。从另一个角度来看，我们爬取的所有网站中有14.3%发出应该在本地框架内被阻止的请求。我们向工具作者披露了这些漏洞，并讨论了我们与他们合作修补产品的经验以及我们的研究结果对其他隐私和安全研究的影响。

更新时间: 2025-07-01 18:48:25

领域: cs.CR

下载: http://arxiv.org/abs/2506.00317v2

Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations

$\rm{SO}(3)$-equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks whose CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of $\rm{SO}(3)$-equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the $O(L^3)$ CG paths into a single path without compromising equivariance, where $L$ is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from $O(L^6)$ to $O(L^4)$. We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations.

Updated: 2025-07-01 18:46:27

标题: 张量分解网络用于快速机器学习原子间势能计算

摘要: $\rm{SO}(3)$-等变网络是机器学习原子间势能（MLIPs）的主要模型。这类网络的关键操作是克莱布施-戈登（CG）张量积，这在计算上很昂贵。为了加速计算，我们开发了张量分解网络（TDNs）作为一类近似等变网络，其中CG张量积被低秩张量分解（如CANDECOMP/PARAFAC（CP）分解）所取代。通过CP分解，我们证明了（i）$\rm{SO}(3)$-等变性的诱导误差的均匀界限，以及（ii）对近似任何等变双线性映射的普适性。为了进一步减少参数数量，我们提出了路径权重共享，将所有$L^3$个CG路径中的所有多重空间权重绑定到单个路径上，而不会影响等变性，其中$L$是最大角度度数。得到的层作为现有网络中张量积的即插即用替代，并且张量积的计算复杂度从$O(L^6)$降低到$O(L^4)$。我们在PubChemQCR上评估了TDNs，这是一个包含105百万个DFT计算快照的新型精心策划的分子松弛数据集。我们还使用了现有的数据集，包括OC20和OC22。结果表明，TDNs在计算速度上取得了显著加速的竞争性表现。

更新时间: 2025-07-01 18:46:27

领域: cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2507.01131v1

Physics Augmented Machine Learning Discovery of Composition-Dependent Constitutive Laws for 3D Printed Digital Materials

Multi-material 3D printing, particularly through polymer jetting, enables the fabrication of digital materials by mixing distinct photopolymers at the micron scale within a single build to create a composite with tunable mechanical properties. This work presents an integrated experimental and computational investigation into the composition-dependent mechanical behavior of 3D printed digital materials. We experimentally characterize five formulations, combining soft and rigid UV-cured polymers under uniaxial tension and torsion across three strain and twist rates. The results reveal nonlinear and rate-dependent responses that strongly depend on composition. To model this behavior, we develop a physics-augmented neural network (PANN) that combines a partially input convex neural network (pICNN) for learning the composition-dependent hyperelastic strain energy function with a quasi-linear viscoelastic (QLV) formulation for time-dependent response. The pICNN ensures convexity with respect to strain invariants while allowing non-convex dependence on composition. To enhance interpretability, we apply $L_0$ sparsification. For the time-dependent response, we introduce a multilayer perceptron (MLP) to predict viscoelastic relaxation parameters from composition. The proposed model accurately captures the nonlinear, rate-dependent behavior of 3D printed digital materials in both uniaxial tension and torsion, achieving high predictive accuracy for interpolated material compositions. This approach provides a scalable framework for automated, composition-aware constitutive model discovery for multi-material 3D printing.

Updated: 2025-07-01 18:45:34

标题: 物理增强机器学习发现三维打印数字材料的组成相关本构定律

摘要: 多材料3D打印，特别是通过聚合物喷射，能够通过在单次建造过程中混合不同的光敏聚合物来制造数字材料，从而在微米尺度上创建具有可调机械性能的复合材料。本研究提出了对3D打印数字材料的成分相关机械行为进行整合实验和计算调查。我们在单轴拉伸和扭转条件下对五种配方进行了实验表征，将软硬紫外光固化聚合物结合在一起，并跨越三种应变和扭转速率。结果显示了非线性和速率相关响应，这些响应强烈依赖于组成。为了模拟这种行为，我们开发了一种物理增强的神经网络（PANN），该网络将用于学习与组成相关的高弹性应变能函数的部分输入凸神经网络（pICNN）与用于时变响应的准线性粘弹性（QLV）公式相结合。pICNN确保在应变不变量方面的凸性，同时允许对组成具有非凸依赖性。为了增强可解释性，我们应用了$L_0$稀疏化。对于时变响应，我们引入了多层感知器（MLP）来从组成中预测粘弹性松弛参数。所提出的模型准确捕捉了3D打印数字材料在单轴拉伸和扭转中的非线性、速率相关行为，对插值材料组成具有高预测准确性。这种方法为多材料3D打印的自动化、组成感知本构模型发现提供了可扩展的框架。

更新时间: 2025-07-01 18:45:34

领域: cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2507.02991v1

On Design Principles for Private Adaptive Optimizers

The spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptive optimizers like AdaGrad and Adam, and hence many recent works have proposed algorithms to address this challenge. However, the empirical results in these works focus on simple tasks and models and the conclusions may not generalize to model training in practice. In this paper we survey several of these variants, and develop better theoretical intuition for them as well as perform empirical studies comparing them. We find that a common intuition of aiming for unbiased estimates of second moments of gradients in adaptive optimizers is misguided, and instead that a simple technique called scale-then-privatize (which does not achieve unbiased second moments) has more desirable theoretical behaviors and outperforms all other variants we study on a small-scale language model training task. We additionally argue that scale-then-privatize causes the noise addition to better match the application of correlated noise mechanisms which are more desirable to use in practice.

Updated: 2025-07-01 18:44:35

标题: 关于私人自适应优化器设计原则

摘要: 将梯度中添加的球形噪声在差分隐私（DP）训练中破坏了AdaGrad和Adam等自适应优化器的性能，因此许多最近的研究提出了算法来解决这一挑战。然而，这些研究中的实证结果侧重于简单的任务和模型，结论可能不适用于实际模型训练。本文调查了几种这样的变体，并对它们进行了更好的理论直觉，同时进行了实证研究比较它们。我们发现，自适应优化器中追求梯度二阶矩的无偏估计的常见直觉是错误的，相反，一种称为“先缩放然后加隐私”（不实现无偏的二阶矩）的简单技术具有更理想的理论行为，并在小规模语言模型训练任务上优于我们研究的所有其他变体。我们另外认为，先缩放然后加隐私使噪声添加更好地匹配相关性噪声机制的应用，这在实践中更为可取。

更新时间: 2025-07-01 18:44:35

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2507.01129v1

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Despite recent advances in Vision-Language Models (VLMs), long-video understanding remains a challenging problem. Although state-of-the-art long-context VLMs can process around 1000 input frames, they still struggle to effectively leverage this sequence length, and succumb to irrelevant distractors within the context window. We present Temporal Chain of Thought, an inference strategy for video question-answering that curates the model's input context. We use the VLM itself to iteratively identify and extract the most relevant frames from the video, which are then used for answering. We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy, in agreement with recent work on inference-time scaling of LLMs. Moreover, we achieve state-of-the-art results on 4 diverse video question-answering datasets, showing consistent improvements with 3 different VLMs. In particular, our method shines on longer videos which would not otherwise fit within the model's context window: On longer videos of more than 1 hour on LVBench, our approach using a context window of 32K outperforms the same VLM using standard inference with a 700K context window by 2.8 points.

Updated: 2025-07-01 18:39:26

标题: 思维的时间链：通过以帧为单位进行思考实现长视频理解

摘要: 尽管视觉语言模型（VLMs）取得了近期的进展，长视频理解仍然是一个具有挑战性的问题。尽管最先进的长上下文VLMs可以处理大约1000个输入帧，但它们仍然难以有效地利用这个序列长度，并且会受到上下文窗口中无关信息的干扰。我们提出了一种名为"Temporal Chain of Thought"的推理策略，用于视频问答，该策略筛选模型的输入上下文。我们使用VLM本身来迭代地识别和提取视频中最相关的帧，然后用于回答问题。我们展示了如何在推理时利用更多计算来选择最相关的上下文，从而提高准确性，与最近关于LLMs推理时扩展的工作一致。此外，我们在4个不同的视频问答数据集上取得了最先进的结果，展示了与3种不同的VLMs一致的改进。特别是，在本方法中，在模型上下文窗口之外否则无法容纳的更长视频上表现出色：在LVBench上超过1小时的更长视频中，我们的方法使用32K上下文窗口优于相同VLM使用标准推理方式的700K上下文窗口的结果提高了2.8个点。

更新时间: 2025-07-01 18:39:26

领域: cs.LG

下载: http://arxiv.org/abs/2507.02001v1

Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.

Updated: 2025-07-01 18:34:42

标题: 利用深度学习在多源卫星数据和地理区域中进行滑坡检测和制图

摘要: 滑坡对基础设施、经济和人类生命构成严重威胁，需要在不同地理区域进行准确的检测和预测绘图。随着深度学习和遥感技术的进步，自动化滑坡检测变得越来越有效。本研究提出了一种综合方法，整合了多源卫星图像和深度学习模型，以增强滑坡识别和预测。我们利用Sentinel-2多光谱数据和ALOS PALSAR推导的坡度和数字高程模型（DEM）图层，捕捉影响滑坡发生的关键环境特征。采用各种地理空间分析技术来评估地形特征、植被覆盖和降雨对检测准确性的影响。此外，我们评估了多种最先进的深度学习分割模型的性能，包括U-Net、DeepLabV3+和Res-Net，以确定它们在滑坡检测中的有效性。所提出的框架有助于开发可靠的预警系统、改进灾害风险管理和可持续土地利用规划。我们的研究结果为深度学习和多源遥感在创建稳健、可扩展和可转移的滑坡预测模型方面的潜力提供了宝贵见解。

更新时间: 2025-07-01 18:34:42

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2507.01123v1

Quasi-twisted codes: decoding and applications in code-based cryptography

Quasi-twisted (QT) codes generalize several important families of linear codes, including cyclic, constacyclic, and quasi-cyclic codes. Despite their potential, to the best of our knowledge, there exists no efficient decoding algorithm for QT codes. In this work, we propose a syndrome-based decoding method capable of efficiently correcting up to (d* - 1)/2 errors, where d* denotes an HT-like lower bound on the minimum distance of QT codes, which we formalize here. Additionally, we introduce a Niederreiter-like cryptosystem constructed from QT codes. This cryptosystem is resistant to some classical attacks as well as some quantum attacks based on Quantum Fourier Sampling.

Updated: 2025-07-01 18:26:27

标题: 准扭曲码：在基于码的密码学中的解码和应用

摘要: 拟扭曲（QT）码推广了几个重要的线性码家族，包括循环码、恒循环码和准循环码。尽管它们具有潜力，据我们所知，目前尚无有效的解码算法适用于QT码。在本研究中，我们提出了一种基于综合症的解码方法，能够有效地纠正多达（d* - 1）/2个错误，其中d*表示QT码最小距离的类似于HT的下界，我们在此给出了其形式化定义。此外，我们还介绍了一种由QT码构建的类似Niederreiter的密码系统。该密码系统对一些经典攻击以及基于量子傅立叶采样的一些量子攻击具有抵抗能力。

更新时间: 2025-07-01 18:26:27

领域: cs.CR,cs.IT,math.IT

下载: http://arxiv.org/abs/2507.01118v1

Why Neural Network Can Discover Symbolic Structures with Gradient-based Training: An Algebraic and Geometric Foundation for Neurosymbolic Reasoning

We develop a theoretical framework that explains how discrete symbolic structures can emerge naturally from continuous neural network training dynamics. By lifting neural parameters to a measure space and modeling training as Wasserstein gradient flow, we show that under geometric constraints, such as group invariance, the parameter measure $\mu_t$ undergoes two concurrent phenomena: (1) a decoupling of the gradient flow into independent optimization trajectories over some potential functions, and (2) a progressive contraction on the degree of freedom. These potentials encode algebraic constraints relevant to the task and act as ring homomorphisms under a commutative semi-ring structure on the measure space. As training progresses, the network transitions from a high-dimensional exploration to compositional representations that comply with algebraic operations and exhibit a lower degree of freedom. We further establish data scaling laws for realizing symbolic tasks, linking representational capacity to the group invariance that facilitates symbolic solutions. This framework charts a principled foundation for understanding and designing neurosymbolic systems that integrate continuous learning with discrete algebraic reasoning.

Updated: 2025-07-01 18:25:12

标题: 为什么神经网络可以通过基于梯度的训练发现符号结构：神经符号推理的代数和几何基础

摘要: 我们发展了一个理论框架，解释了离散符号结构如何可以自然地从连续神经网络训练动态中产生。通过将神经参数提升到一个度量空间，并将训练建模为Wasserstein梯度流，我们展示了在几何约束条件下，如群不变性，参数度量$\mu_t$经历两个并发现象：(1) 梯度流解耦成一些潜在函数上的独立优化轨迹，和 (2) 自由度的逐渐收缩。这些潜在函数编码与任务相关的代数约束，并且在度量空间上的交换半环结构下作为环同态。随着训练的进行，网络从高维探索过渡到符合代数操作并展现较低自由度的组合表示。我们进一步建立了实现符号任务的数据缩放定律，将表示能力与促进符号解决方案的群不变性联系起来。这个框架为理解和设计将连续学习与离散代数推理集成的神经符号系统奠定了基础。

更新时间: 2025-07-01 18:25:12

领域: cs.LG

下载: http://arxiv.org/abs/2506.21797v2

A Neural Operator based on Dynamic Mode Decomposition

The scientific computation methods development in conjunction with artificial intelligence technologies remains a hot research topic. Finding a balance between lightweight and accurate computations is a solid foundation for this direction. The study presents a neural operator based on the dynamic mode decomposition algorithm (DMD), mapping functional spaces, which combines DMD and deep learning (DL) for spatiotemporal processes efficient modeling. Solving PDEs for various initial and boundary conditions requires significant computational resources. The method suggested automatically extracts key modes and system dynamics using them to construct predictions, reducing computational costs compared to traditional numerical methods. The approach has demonstrated its efficiency through comparative analysis of performance with closest analogues DeepONet and FNO in the heat equation, Laplaces equation, and Burgers equation solutions approximation, where it achieves high reconstruction accuracy.

Updated: 2025-07-01 18:23:28

标题: 基于动态模态分解的神经算子

摘要: 科学计算方法的发展与人工智能技术结合仍然是一个热门的研究课题。在轻量级和准确计算之间找到平衡是这一方向的一个坚实基础。该研究提出了一种基于动态模态分解算法（DMD）的神经算子，映射功能空间，结合了DMD和深度学习（DL）用于时空过程的高效建模。解决各种初始和边界条件的偏微分方程需要大量的计算资源。该方法建议自动提取关键模式和系统动态，利用它们构建预测，与传统数值方法相比，降低计算成本。通过与最接近的模拟DeepONet和FNO在热方程、拉普拉斯方程和Burgers方程解逼近中的性能进行比较分析，该方法已经证明了其效率，它实现了高重构精度。

更新时间: 2025-07-01 18:23:28

领域: cs.LG,68T07, 35A99

下载: http://arxiv.org/abs/2507.01117v1

Geometry-aware 4D Video Generation for Robot Manipulation

Understanding and predicting the dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of videos by supervising the model with cross-view pointmap alignment during training. This geometric supervision enables the model to learn a shared 3D representation of the scene, allowing it to predict future video sequences from novel viewpoints based solely on the given RGB-D observations, without requiring camera poses as inputs. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, supporting robust robot manipulation and generalization to novel camera viewpoints.

Updated: 2025-07-01 18:01:41

标题: 几何感知的机器人操作四维视频生成

摘要: 理解和预测物理世界的动态可以增强机器人在复杂环境中有效规划和互动的能力。尽管最近的视频生成模型在建模动态场景方面显示出强大潜力，但生成既具有时间连贯性又在摄像头视角上几何一致的视频仍然是一个重大挑战。为了解决这个问题，我们提出了一个4D视频生成模型，在训练过程中通过监督模型与交叉视图点图对齐来实现视频的多视角3D一致性。这种几何监督使模型能够学习场景的共享3D表示，使其能够仅基于给定的RGB-D观测来预测未来视频序列，而无需摄像机姿势作为输入。与现有基线相比，我们的方法在多个模拟和真实世界的机器人数据集上产生了视觉上更稳定和空间上对齐的预测。我们进一步展示了预测的4D视频可以通过现成的6DoF姿态跟踪器恢复机器人末端执行器轨迹，支持机器人操作的稳健性和对新摄像头视角的泛化能力。

更新时间: 2025-07-01 18:01:41

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.01099v1

Proof of a perfect platonic representation hypothesis

In this note, we elaborate on and explain in detail the proof given by Ziyin et al. (2025) of the "perfect" Platonic Representation Hypothesis (PRH) for the embedded deep linear network model (EDLN). We show that if trained with SGD, two EDLNs with different widths and depths and trained on different data will become Perfectly Platonic, meaning that every possible pair of layers will learn the same representation up to a rotation. Because most of the global minima of the loss function are not Platonic, that SGD only finds the perfectly Platonic solution is rather extraordinary. The proof also suggests at least six ways the PRH can be broken. We also show that in the EDLN model, the emergence of the Platonic representations is due to the same reason as the emergence of progressive sharpening. This implies that these two seemingly unrelated phenomena in deep learning can, surprisingly, have a common cause. Overall, the theory and proof highlight the importance of understanding emergent "entropic forces" due to the irreversibility of SGD training and their role in representation learning. The goal of this note is to be instructive and avoid lengthy technical details.

Updated: 2025-07-01 18:01:32

标题: 证明完美的柏拉图表征假设

摘要: 在这篇笔记中，我们详细阐述和解释了Ziyin等人（2025年）对嵌入式深度线性网络模型（EDLN）的“完美”柏拉图表示假设（PRH）所给出的证明。我们展示，如果用SGD训练，两个具有不同宽度和深度并在不同数据上训练的EDLN将变为完美柏拉图，意味着每一对可能的层将学习相同的表示，直到旋转为止。由于损失函数的大多数全局最小值不是柏拉图的，SGD只能找到完美柏拉图解决方案相当不寻常。该证明还提出了至少六种打破PRH的方法。我们还展示，在EDLN模型中，柏拉图表示的出现是由于与渐进锐化的出现相同的原因。这意味着深度学习中这两个看似不相关的现象，令人惊讶地有一个共同的原因。总的来说，该理论和证明突显了理解由于SGD训练的不可逆性而产生的“熵力量”对表示学习的重要性。这篇笔记的目标是具有指导性并避免冗长的技术细节。

更新时间: 2025-07-01 18:01:32

领域: cs.LG,cond-mat.dis-nn,q-bio.NC,stat.ML

下载: http://arxiv.org/abs/2507.01098v1

`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.

Updated: 2025-07-01 18:00:04

标题: 为了争论而展示给我如何伤害自己：监狱中越狱LLMs在自杀和自伤情境中

摘要: 最近大型语言模型（LLMs）的进展已经导致越来越复杂的安全协议和功能，旨在防止有害、不道德或未经授权的输出。然而，这些防护栏仍然容易受到新颖和创造性的对抗性提示的影响，包括手动生成的测试用例。在这项工作中，我们使用多步骤、提示级越狱和绕过内置内容和安全过滤器，提出了两个关于（i）自杀和（ii）自残的心理健康测试用例。我们展示了用户意图被忽视，导致生成详细的有害内容和可能引起现实危害的指导。我们在六种广泛可用的LLMs上进行了实证评估，展示了这种绕过的泛化性和可靠性。我们评估这些发现以及它们对提示响应过滤、上下文和任务特定模型开发的影响所提出的多层次伦理张力。我们建议更全面和系统的AI安全和伦理方法，同时强调在安全关键AI部署中需要持续进行对抗性测试。我们还认为，虽然在LLMs中必须实施某些明确定义的安全措施和防护栏，但鉴于目前通用LLMs的技术成熟度，确保在所有用例和领域中的强大和全面的安全性仍然极具挑战性。

更新时间: 2025-07-01 18:00:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.02990v1

AI-guided digital intervention with physiological monitoring reduces intrusive memories after experimental trauma

Trauma prevalence is vast globally. Evidence-based digital treatments can help, but most require human guidance. Human guides provide tailored instructions and responsiveness to internal cognitive states, but limit scalability. Can generative AI and neurotechnology provide a scalable alternative? Here we test ANTIDOTE, combining AI guidance and pupillometry to automatically deliver and monitor an evidence-based digital treatment, specifically the Imagery Competing Task Intervention (ICTI), to reduce intrusive memories after psychological trauma. One hundred healthy volunteers were exposed to videos of traumatic events and randomly assigned to an intervention or active control condition. As predicted, intervention participants reported significantly fewer intrusive memories over the following week. Post-hoc assessment against clinical rubrics confirmed the AI guide delivered the intervention successfully. Additionally, pupil size tracked intervention engagement and predicted symptom reduction, providing a candidate biomarker of intervention effectiveness. These findings open a path toward rigorous AI-guided digital interventions that can scale to trauma prevalence.

Updated: 2025-07-01 17:59:01

标题: 人工智能引导的数字干预与生理监测在实验性创伤后减少侵入性记忆

摘要: 创伤的普遍性在全球范围内是巨大的。基于证据的数字治疗可以帮助，但大多数需要人类指导。人类指导提供量身定制的指导和对内部认知状态的响应，但限制了可扩展性。生成式人工智能和神经技术能够提供可扩展的替代方案吗？在这里，我们测试了ANTIDOTE，结合了人工智能指导和瞳孔测量，自动地提供和监控一个基于证据的数字治疗，具体是想象竞争任务干预（ICTI），以减少心理创伤后的侵入性记忆。一百名健康志愿者观看了创伤事件的视频，并被随机分配到干预组或活跃对照组。如预期，干预组参与者在接下来的一周内报告了显著减少的侵入性记忆。事后评估对照临床规范证实了人工智能指导成功地提供了干预。此外，瞳孔大小跟踪了干预的参与度，并预测了症状减轻，提供了一个干预效果的候选生物标志物。这些发现为严格的人工智能指导的数字干预打开了一条道路，可以扩展到创伤的普遍性。

更新时间: 2025-07-01 17:59:01

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2507.01081v1

STONet: A neural operator for modeling solute transport in micro-cracked reservoirs

In this work, we introduce a novel neural operator, the Solute Transport Operator Network (STONet), to efficiently model contaminant transport in micro-cracked porous media. STONet's model architecture is specifically designed for this problem and uniquely integrates an enriched DeepONet structure with a transformer-based multi-head attention mechanism, enhancing performance without incurring additional computational overhead compared to existing neural operators. The model combines different networks to encode heterogeneous properties effectively and predict the rate of change of the concentration field to accurately model the transport process. The training data is obtained using finite element (FEM) simulations by random sampling of micro-fracture distributions and applied pressure boundary conditions, which capture diverse scenarios of fracture densities, orientations, apertures, lengths, and balance of pressure-driven to density-driven flow. Our numerical experiments demonstrate that, once trained, STONet achieves accurate predictions, with relative errors typically below 1% compared with FEM simulations while reducing runtime by approximately two orders of magnitude. This type of computational efficiency facilitates building digital twins for rapid assessment of subsurface contamination risks and optimization of environmental remediation strategies. The data and code for the paper will be published at https://github.com/ehsanhaghighat/STONet.

Updated: 2025-07-01 17:55:45

标题: STONet：用于模拟微裂缝储层中溶质输运的神经算子

摘要: 在这项工作中，我们介绍了一种新颖的神经算子，即溶质传输算子网络（STONet），用于高效地模拟微裂纹多孔介质中的污染物传输。STONet的模型架构专门针对这一问题设计，并通过将丰富的DeepONet结构与基于变压器的多头注意机制相结合，提高了性能，而不会增加与现有神经算子相比的额外计算开销。该模型结合了不同的网络，有效地编码了异质性属性，并预测了浓度场的变化速率，从而准确地模拟了传输过程。训练数据是通过有限元（FEM）模拟获得的，通过随机采样微裂缝分布和施加的压力边界条件，捕捉了裂缝密度、方向、开度、长度以及压力驱动与密度驱动流的平衡的多种情景。我们的数值实验表明，一经训练，STONet能够实现准确的预测，相对误差通常低于1%，与FEM模拟相比，同时将运行时间缩短约两个数量级。这种计算效率有助于构建快速评估地下污染风险和优化环境修复策略的数字孪生体。本文的数据和代码将发布在https://github.com/ehsanhaghighat/STONet。

更新时间: 2025-07-01 17:55:45

领域: cs.LG,cs.CE,cs.NE,physics.flu-dyn

下载: http://arxiv.org/abs/2412.05576v2

Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes

Recent studies have proposed interpreting the training process from an ergodic perspective. Building on this foundation we present a unified framework for understanding and accelerating the training of deep neural networks via stochastic gradient descent. By analyzing the geometric landscape of the objective function we introduce a practical diagnostic, the running estimate of the largest Lyapunov exponent, which provably distinguishes genuine convergence toward stable minimizers from mere statistical stabilization near saddle points. We then propose a ghost category extension for standard classifiers that adds auxiliary ghost output nodes so the model gains extra descent directions that open a lateral corridor around narrow loss barriers and enable the optimizer to bypass poor basins during the early training phase. We show that this extension strictly reduces approximation error and that after sufficient convergence the ghost dimensions collapse and the extended model's invariant law coincides with that of the original and there exists a path in the enlarged parameter space along which the total loss does not increase while the original loss decreases by an arbitrary margin. Taken together these results provide a principled architecture level intervention that accelerates early stage trainability while preserving asymptotic behavior.

Updated: 2025-07-01 17:54:35

标题: 通过遍历定理描述神经网络训练过程：幽灵节点

摘要: 最近的研究提出了从遍历性的视角解释训练过程。在此基础上，我们提出了一个统一的框架，用于理解和加速通过随机梯度下降训练深度神经网络。通过分析目标函数的几何景观，我们引入了一个实用的诊断工具，即最大Lyapunov指数的运行估计，可明确区分真正趋向稳定极小值的收敛与仅在鞍点附近进行统计稳定。然后，我们提出了标准分类器的鬼类别扩展，通过添加辅助鬼输出节点，使模型获得额外的下降方向，打开窄损失障碍周围的侧向通道，并使优化器在早期训练阶段绕过贫乏盆地。我们表明，这种扩展严格减少逼近误差，并且在足够的收敛之后，鬼维度会崩溃，扩展模型的固有规律与原始规律一致，并且在扩大的参数空间中存在一条路径，沿着这条路径，总损失不会增加，而原始损失则以任意边际减少。综合这些结果，我们提供了一个基于架构的干预措施，加速早期阶段的可训练性，同时保持渐近行为。

更新时间: 2025-07-01 17:54:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.01003v1

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

Updated: 2025-07-01 17:51:59

标题: SciArena：一个用于科学文献任务基础模型的开放评估平台

摘要: 我们介绍了SciArena，这是一个用于评估基础模型在科学文献任务上表现的开放和协作平台。与传统的科学文献理解和综合的基准相比，SciArena直接与研究社区互动，采用Chatbot Arena评估方法，即社区投票比较模型。通过利用集体智慧，SciArena提供了一个基于社区驱动的模型性能评估平台，针对需要文献支持、长篇回答的科学任务。该平台目前支持23个开源和专有基础模型，并已收集来自不同科学领域的可信研究人员超过13000次投票。我们分析了迄今为止收集的数据，并确认提交的问题多样化，与现实文献需求一致，并且参与研究人员在评估中表现出强烈的自一致性和评注者一致性。我们根据模型排名榜讨论了结果和见解。为了进一步推动构建基于模型的自动化文献任务评估系统的研究，我们发布了SciArena-Eval，这是一个基于我们收集的偏好数据的元评估基准。该基准通过将模型的成对评估与人类投票进行比较，衡量了模型在判断答案质量方面的准确性。我们的实验突显了基准的挑战，并强调了需要更可靠的自动化评估方法。

更新时间: 2025-07-01 17:51:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.01001v1

Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion

Effective modeling of multifaceted relations is pivotal for Knowledge Graph Completion (KGC). However, a majority of existing approaches are predicated on static, embedding-based scoring, exhibiting inherent limitations in capturing contextual dependencies and relational dynamics. Addressing this gap, we propose the Flow-Modulated Scoring (FMS) framework. FMS comprises two principal components: (1) a semantic context learning module that encodes context-sensitive entity representations, and (2) a conditional flow-matching module designed to learn the dynamic transformation from a head to a tail embedding, governed by the aforementioned context. The resultant predictive vector field, representing the context-informed relational path, serves to dynamically refine the initial static score of an entity pair. Through this synergy of context-aware static representations and conditioned dynamic information, FMS facilitates a more profound modeling of relational semantics. Comprehensive evaluations on several standard benchmarks demonstrate that our proposed method surpasses prior state-of-the-art results.

Updated: 2025-07-01 17:51:20

标题: 流调节评分用于语义感知知识图完善

摘要: 多方面关系的有效建模对于知识图完成（KGC）至关重要。然而，现有方法大多基于静态的基于嵌入的评分，存在固有的局限性，难以捕捉上下文依赖性和关系动态性。针对这一差距，我们提出了流动调制评分（FMS）框架。FMS包括两个主要组件：（1）一个语义上下文学习模块，用于编码上下文敏感的实体表示，以及（2）一个设计用于学习从头到尾嵌入的动态转换的条件流匹配模块，受上述上下文的控制。所得的预测矢量场，代表了上下文感知的关系路径，用于动态地优化实体对的初始静态评分。通过上下文感知的静态表示和条件动态信息的协同作用，FMS促进了对关系语义的更深入建模。在几个标准基准上的全面评估表明，我们提出的方法超越了先前的最先进结果。

更新时间: 2025-07-01 17:51:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.23137v2

SPGD: Steepest Perturbed Gradient Descent Optimization

Optimization algorithms are pivotal in advancing various scientific and industrial fields but often encounter obstacles such as trapping in local minima, saddle points, and plateaus (flat regions), which makes the convergence to reasonable or near-optimal solutions particularly challenging. This paper presents the Steepest Perturbed Gradient Descent (SPGD), a novel algorithm that innovatively combines the principles of the gradient descent method with periodic uniform perturbation sampling to effectively circumvent these impediments and lead to better solutions whenever possible. SPGD is distinctively designed to generate a set of candidate solutions and select the one exhibiting the steepest loss difference relative to the current solution. It enhances the traditional gradient descent approach by integrating a strategic exploration mechanism that significantly increases the likelihood of escaping sub-optimal local minima and navigating complex optimization landscapes effectively. Our approach not only retains the directed efficiency of gradient descent but also leverages the exploratory benefits of stochastic perturbations, thus enabling a more comprehensive search for global optima across diverse problem spaces. We demonstrate the efficacy of SPGD in solving the 3D component packing problem, an NP-hard challenge. Preliminary results show a substantial improvement over four established methods, particularly on response surfaces with complex topographies and in multidimensional non-convex continuous optimization problems. Comparative analyses with established 2D benchmark functions highlight SPGD's superior performance, showcasing its ability to navigate complex optimization landscapes. These results emphasize SPGD's potential as a versatile tool for a wide range of optimization problems.

Updated: 2025-07-01 17:49:12

标题: SPGD：最陡扰动梯度下降优化

摘要: 优化算法在推进各种科学和工业领域方面起着至关重要的作用，但常常遇到陷入局部最小值、鞍点和平台（平坦区域）等障碍，这使得收敛到合理或接近最优解尤为具有挑战性。本文介绍了一种新颖的算法——最陡扰动梯度下降（SPGD），它创新地将梯度下降法的原理与周期性均匀扰动采样相结合，有效地规避这些障碍，并在可能的情况下导致更好的解决方案。SPGD被设计为生成一组候选解，并选择相对于当前解具有最陡的损失差异的解。它通过整合一种策略性探索机制来增加逃离次优局部最小值和有效导航复杂优化景观的可能性，从而增强了传统梯度下降方法。我们的方法不仅保留了梯度下降的定向效率，还利用了随机扰动的探索优势，从而实现了对各种问题空间的全面搜索全局最优解。我们展示了SPGD在解决3D组件装箱问题（一种NP难题）中的有效性。初步结果显示与四种已建立方法相比，特别是在具有复杂地形和多维非凸连续优化问题的响应曲面上，有显著改进。与已建立的2D基准函数进行比较分析，突显了SPGD的卓越性能，展示了其在导航复杂优化景观中的能力。这些结果强调了SPGD作为一种多功能工具，适用于各种优化问题的潜力。

更新时间: 2025-07-01 17:49:12

领域: math.OC,cs.AI,cs.CE,cs.LG,math-ph,math.MP

下载: http://arxiv.org/abs/2411.04946v2

Defensive Adversarial CAPTCHA: A Semantics-Driven Framework for Natural Adversarial Example Generation

Traditional CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on the original image characteristics, resulting in distortions that hinder human interpretation and limit their applicability in scenarios where no initial input images are available. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (DAC), a novel framework that generates high-fidelity adversarial examples guided by attacker-specified semantics information. Leveraging a Large Language Model (LLM), DAC enhances CAPTCHA diversity and enriches the semantic information. To address various application scenarios, we examine the white-box targeted attack scenario and the black box untargeted attack scenario. For target attacks, we introduce two latent noise variables that are alternately guided in the diffusion step to achieve robust inversion. The synergy between gradient guidance and latent variable optimization achieved in this way ensures that the generated adversarial examples not only accurately align with the target conditions but also achieve optimal performance in terms of distributional consistency and attack effectiveness. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-DAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show that the defensive adversarial CAPTCHA generated by BP-DAC is able to defend against most of the unknown models, and the generated CAPTCHA is indistinguishable to both humans and DNNs.

Updated: 2025-07-01 17:49:09

标题: 防御型对抗性CAPTCHA：基于语义驱动的自然对抗性样本生成框架

摘要: 传统CAPTCHA（完全自动化的公共图灵测试，用于区分计算机和人类）方案越来越容易受到由深度神经网络（DNNs）驱动的自动攻击的威胁。现有的对抗攻击方法通常依赖于原始图像特征，导致扭曲，阻碍人类解释并限制它们在没有初始输入图像可用的场景中的适用性。为了解决这些挑战，我们提出了未标记对抗CAPTCHA（DAC），这是一个新颖的框架，通过攻击者指定的语义信息生成高保真的对抗示例。利用大型语言模型（LLM），DAC增强了CAPTCHA的多样性并丰富了语义信息。为了应对各种应用场景，我们研究了白盒目标攻击场景和黑盒非目标攻击场景。对于目标攻击，我们引入了两个潜在的噪声变量，这些变量在扩散步骤中交替引导，以实现稳健的反演。梯度引导和潜变量优化之间的协同作用确保生成的对抗示例不仅与目标条件准确对齐，而且在分布一致性和攻击效果方面实现最佳性能。对于非目标攻击，特别是黑盒场景，我们引入了双通道未标记对抗CAPTCHA（BP-DAC），这是一种采用多模梯度和双通道优化的两步优化策略，用于有效地误分类。实验表明，BP-DAC生成的防御性对抗CAPTCHA能够抵御大多数未知模型，并且生成的CAPTCHA对人类和DNNs都是无法区分的。

更新时间: 2025-07-01 17:49:09

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2506.10685v3

Diffuse-CLoC: Guided Diffusion for Physics-based Character Look-ahead Control

We present Diffuse-CLoC, a guided diffusion framework for physics-based look-ahead control that enables intuitive, steerable, and physically realistic motion generation. While existing kinematics motion generation with diffusion models offer intuitive steering capabilities with inference-time conditioning, they often fail to produce physically viable motions. In contrast, recent diffusion-based control policies have shown promise in generating physically realizable motion sequences, but the lack of kinematics prediction limits their steerability. Diffuse-CLoC addresses these challenges through a key insight: modeling the joint distribution of states and actions within a single diffusion model makes action generation steerable by conditioning it on the predicted states. This approach allows us to leverage established conditioning techniques from kinematic motion generation while producing physically realistic motions. As a result, we achieve planning capabilities without the need for a high-level planner. Our method handles a diverse set of unseen long-horizon downstream tasks through a single pre-trained model, including static and dynamic obstacle avoidance, motion in-betweening, and task-space control. Experimental results show that our method significantly outperforms the traditional hierarchical framework of high-level motion diffusion and low-level tracking.

Updated: 2025-07-01 17:41:45

标题: Diffuse-CLoC：基于物理学的角色先行控制的引导扩散

摘要: 我们提出Diffuse-CLoC，这是一个基于物理的前瞻控制的引导扩散框架，可以实现直观、可操控和物理真实的运动生成。现有的基于运动学的扩散模型在推理时提供直观的操控能力，但往往无法产生物理可行的运动。相比之下，最近基于扩散的控制策略显示出在生成物理可行的运动序列方面具有潜力，但运动学预测的缺失限制了它们的可操控性。Diffuse-CLoC通过一个关键的洞察来解决这些挑战：在一个单一的扩散模型中对状态和动作的联合分布进行建模，通过对预测的状态进行条件化，使动作生成可操控。这种方法使我们能够利用来自运动学运动生成的已建立的条件技术，同时产生物理真实的运动。因此，我们实现了规划能力，无需高层规划器。我们的方法通过一个预先训练的模型处理各种未见的长时程下游任务，包括静态和动态障碍物避开、运动插值和任务空间控制。实验结果显示，我们的方法明显优于传统的高层运动扩散和低层跟踪的分层框架。

更新时间: 2025-07-01 17:41:45

领域: cs.GR,cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.11801v2

Box Pose and Shape Estimation and Domain Adaptation for Large-Scale Warehouse Automation

Modern warehouse automation systems rely on fleets of intelligent robots that generate vast amounts of data -- most of which remains unannotated. This paper develops a self-supervised domain adaptation pipeline that leverages real-world, unlabeled data to improve perception models without requiring manual annotations. Our work focuses specifically on estimating the pose and shape of boxes and presents a correct-and-certify pipeline for self-supervised box pose and shape estimation. We extensively evaluate our approach across a range of simulated and real industrial settings, including adaptation to a large-scale real-world dataset of 50,000 images. The self-supervised model significantly outperforms models trained solely in simulation and shows substantial improvements over a zero-shot 3D bounding box estimation baseline.

Updated: 2025-07-01 17:36:09

标题: 大规模仓库自动化的箱子姿态和形状估计及领域自适应

摘要: 现代仓储自动化系统依赖于一群智能机器人，这些机器人生成大量数据，其中大部分仍未注释。本文开发了一个自监督领域适应管道，利用现实世界的未标记数据来改进感知模型，而无需手动注释。我们的工作专注于估计箱子的姿势和形状，并提出了一个用于自监督箱子姿势和形状估计的正确和认证管道。我们在一系列模拟和真实工业环境中广泛评估了我们的方法，包括适应一个包含50,000张图像的大规模真实世界数据集。自监督模型明显优于仅在模拟中训练的模型，并且比零样本3D边界框估计基线表现出显著改进。

更新时间: 2025-07-01 17:36:09

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.00984v1

Enhancing LLM Agent Safety via Causal Influence Prompting

As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.

Updated: 2025-07-01 17:31:51

标题: 通过因果影响提示提升LLM代理安全性

摘要: 随着由大型语言模型（LLMs）驱动的自主代理在各种辅助任务中继续展现潜力，确保它们的安全和可靠行为对于防止意外后果至关重要。在这项工作中，我们介绍了一种新技术CIP，利用因果影响图（CIDs）来识别和缓解由代理决策产生的风险。CIDs提供了因果关系的结构化表示，使代理能够预见有害结果并做出更安全的决策。我们的方法包括三个关键步骤：（1）根据任务规范初始化CID以概述决策过程，（2）使用CID指导代理与环境的交互，（3）根据观察到的行为和结果迭代地完善CID。实验结果表明，我们的方法有效地增强了代码执行和移动设备控制任务的安全性。

更新时间: 2025-07-01 17:31:51

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.00979v1

Meta-Posterior Consistency for the Bayesian Inference of Metastable System

The vast majority of the literature on learning dynamical systems or stochastic processes from time series has focused on stable or ergodic systems, for both Bayesian and frequentist inference procedures. However, most real-world systems are only metastable, that is, the dynamics appear to be stable on some time scale, but are in fact unstable over longer time scales. Consistency of inference for metastable systems may not be possible, but one can ask about metaconsistency: Do inference procedures converge when observations are taken over a large but finite time interval, but diverge on longer time scales? In this paper we introduce, discuss, and quantify metaconsistency in a Bayesian framework. We discuss how metaconsistency can be exploited to efficiently infer a model for a sub-system of a larger system, where inference on the global behavior may require much more data, or there is no theoretical guarantee as to the asymptotic success of inference procedures. We also discuss the relation between metaconsistency and the spectral properties of the model dynamical system in the case of uniformly ergodic and non-ergodic diffusions.

Updated: 2025-07-01 17:22:45

标题: 互补系统的贝叶斯推断的元后验一致性

摘要: 大多数关于从时间序列学习动态系统或随机过程的文献都集中在稳定或遍历系统上，无论是贝叶斯还是频率派的推理程序。然而，大多数现实世界中的系统只是亚稳态的，即动态在某个时间尺度上看起来是稳定的，但实际上在较长的时间尺度上是不稳定的。对于亚稳态系统的推理一致性可能是不可能的，但我们可以询问关于元一致性：当观测在一个较大但有限的时间间隔内进行时，推理程序是否会收敛，但在较长时间尺度上会发散？在本文中，我们在一个贝叶斯框架中介绍、讨论和量化元一致性。我们讨论了如何利用元一致性有效地推断一个更大系统的子系统的模型，其中对全局行为的推理可能需要更多的数据，或者没有理论保证推理程序的渐近成功。我们还讨论了在均匀遍历和非遍历扩散的情况下，元一致性与模型动态系统的谱特性之间的关系。

更新时间: 2025-07-01 17:22:45

领域: stat.ML,cs.LG,math.PR,math.ST,stat.TH,62F15, 60J70

下载: http://arxiv.org/abs/2408.01868v2

Reasoning as an Adaptive Defense for Safety

Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called $\textit{TARS}$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a "lightweight" warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.

Updated: 2025-07-01 17:20:04

标题: 推理作为安全的自适应防御

摘要: Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy-to-verify domains such as math and code. In this work, we investigate how to apply this approach to train models that demonstrate a level of resilience to safety vulnerabilities, and demonstrate that doing so can be beneficial. We introduce a methodology called TARS (Training Adaptive Reasoners for Safety), which utilizes reinforcement learning to train models to reason about safety using chain-of-thought traces and a reward signal that balances safety considerations with task completion. In developing TARS, we identify three key design choices: (1) a "lightweight" warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors, and (3) a reward function to maintain reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by allocating more compute resources to ambiguous queries, resulting in improved safety-refusal trade-offs. They also learn to better distinguish between safe and unsafe prompts, leading to increased resilience against both white-box (e.g. GCG) and black-box attacks (e.g. PAIR). Overall, our work offers an effective and transparent method for training LLMs to defend against breaches and malicious requests on a per-prompt basis.

更新时间: 2025-07-01 17:20:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.00971v1

Surgical Neural Radiance Fields from One Image

Purpose: Neural Radiance Fields (NeRF) offer exceptional capabilities for 3D reconstruction and view synthesis, yet their reliance on extensive multi-view data limits their application in surgical intraoperative settings where only limited data is available. In particular, collecting such extensive data intraoperatively is impractical due to time constraints. This work addresses this challenge by leveraging a single intraoperative image and preoperative data to train NeRF efficiently for surgical scenarios. Methods: We leverage preoperative MRI data to define the set of camera viewpoints and images needed for robust and unobstructed training. Intraoperatively, the appearance of the surgical image is transferred to the pre-constructed training set through neural style transfer, specifically combining WTC2 and STROTSS to prevent over-stylization. This process enables the creation of a dataset for instant and fast single-image NeRF training. Results: The method is evaluated with four clinical neurosurgical cases. Quantitative comparisons to NeRF models trained on real surgical microscope images demonstrate strong synthesis agreement, with similarity metrics indicating high reconstruction fidelity and stylistic alignment. When compared with ground truth, our method demonstrates high structural similarity, confirming good reconstruction quality and texture preservation. Conclusion: Our approach demonstrates the feasibility of single-image NeRF training in surgical settings, overcoming the limitations of traditional multi-view methods.

Updated: 2025-07-01 17:19:25

标题: 从一张图像中获得的手术神经辐射场

摘要: 目的：神经光辐射场（NeRF）在3D重建和视图合成方面具有出色的能力，但它们对广泛的多视图数据的依赖限制了它们在手术术中仅有有限数据可用的情况下的应用。特别是，在术中收集如此广泛的数据是不切实际的，因为时间限制。本文通过利用单个术中图像和术前数据，有效地训练NeRF以适用于手术场景，从而解决了这一挑战。方法：我们利用术前MRI数据来定义用于稳健和无障碍训练所需的相机视角和图像集。在术中，通过神经风格转移将手术图像的外观转移到预先构建的训练集中，具体来说结合WTC2和STROTSS以防止过度风格化。这个过程使得可以创建一个用于即时和快速单图像NeRF训练的数据集。结果：该方法在四例临床神经外科病例中进行了评估。与在真实手术显微镜图像上训练的NeRF模型进行定量比较，结果显示强烈的合成一致性，相似性指标表明高重建保真度和风格对齐。与地面真实性相比，我们的方法展示了高结构相似性，确认了良好的重建质量和纹理保留。结论：我们的方法证明了在手术环境中进行单图像NeRF训练的可行性，克服了传统多视图方法的限制。

更新时间: 2025-07-01 17:19:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00969v1

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

With the advent of new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform state-of-the-art models in single-channel speech enhancement, automatic speech recognition, and self-supervised audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this issue, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VoiceBank+Demand Extended (VB-DemandEx), a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, our proposed MambAttention model significantly outperforms existing state-of-the-art LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 and EARS-WHAM_v2, while matching their performance on the in-domain dataset VB-DemandEx. Ablation studies highlight the role of weight sharing between the time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. However, our MambAttention model remains superior on both out-of-domain datasets across all reported evaluation metrics.

Updated: 2025-07-01 17:16:05

标题: MambAttention: Mamba采用多头注意力机制进行通用单通道语音增强

摘要: 随着新的序列模型如Mamba和xLSTM的出现，多项研究表明这些模型在单通道语音增强、自动语音识别和自监督音频表示学习方面与最先进的模型匹配或表现更好。然而，先前的研究表明，类似LSTM和Mamba的序列模型往往会过拟合训练集。为了解决这个问题，先前的研究表明，在LSTM中添加自注意力可以显著提高单通道语音增强的泛化性能。然而，迄今为止，混合Mamba和时频注意力模型的概念以及它们的泛化性能尚未被探索用于语音增强。在本文中，我们提出了一种新颖的混合架构，MambAttention，它结合了Mamba和共享的时间和频率多头注意力模块，用于泛化的单通道语音增强。为了训练我们的模型，我们引入了VoiceBank+Demand Extended（VB-DemandEx）数据集，受VoiceBank+Demand启发，但包含更具挑战性的噪声类型和更低的信噪比。在VB-DemandEx上训练，我们提出的MambAttention模型在两个领域外数据集DNS 2020和EARS-WHAM_v2上的所有报告指标中显著优于现有的最先进的基于LSTM、xLSTM、Mamba和Conformer的系统，同时在领域内数据集VB-DemandEx上与它们的性能相匹配。消融研究突显了时间和频率多头注意力模块之间权重共享在泛化性能中的作用。最后，我们探索将共享的时间和频率多头注意力模块与LSTM和xLSTM结合，这在领域外数据集上有显著的性能提升。然而，我们的MambAttention模型在两个领域外数据集上仍然优于所有报告的评估指标。

更新时间: 2025-07-01 17:16:05

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.00966v1

Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning

Many machine learning tasks can benefit from external knowledge. Large knowledge graphs store such knowledge, and embedding methods can be used to distill it into ready-to-use vector representations for downstream applications. For this purpose, current models have however two limitations: they are primarily optimized for link prediction, via local contrastive learning, and they struggle to scale to the largest graphs due to GPU memory limits. To address these, we introduce SEPAL: a Scalable Embedding Propagation ALgorithm for large knowledge graphs designed to produce high-quality embeddings for downstream tasks at scale. The key idea of SEPAL is to enforce global embedding alignment by optimizing embeddings only on a small core of entities, and then propagating them to the rest of the graph via message passing. We evaluate SEPAL on 7 large-scale knowledge graphs and 46 downstream machine learning tasks. Our results show that SEPAL significantly outperforms previous methods on downstream tasks. In addition, SEPAL scales up its base embedding model, enabling fitting huge knowledge graphs on commodity hardware.

Updated: 2025-07-01 17:15:35

标题: 大规模知识图上的可扩展特征学习，用于下游机器学习

摘要: 许多机器学习任务可以从外部知识中受益。大型知识图存储这样的知识，嵌入方法可以用来将其提炼成可供下游应用使用的向量表示。然而，目前的模型存在两个限制：它们主要针对链接预测进行优化，通过局部对比学习，而且由于GPU内存限制，它们很难扩展到最大的图形。为了解决这些问题，我们引入了SEPAL：一种为大型知识图设计的可扩展嵌入传播算法，旨在生产规模化下游任务的高质量嵌入。SEPAL的关键思想是通过仅在一个小的核心实体上优化嵌入，然后通过消息传递将其传播到图形的其余部分，从而强化全局嵌入对齐。我们在7个大型知识图和46个下游机器学习任务上评估了SEPAL。我们的结果表明，SEPAL在下游任务上显著优于以前的方法。此外，SEPAL扩展了其基础嵌入模型，使其能够在普通硬件上适应巨大的知识图。

更新时间: 2025-07-01 17:15:35

领域: cs.LG

下载: http://arxiv.org/abs/2507.00965v1

Benchmarking the Discovery Engine

The Discovery Engine is a general purpose automated system for scientific discovery, which combines machine learning with state-of-the-art ML interpretability to enable rapid and robust scientific insight across diverse datasets. In this paper, we benchmark the Discovery Engine against five recent peer-reviewed scientific publications applying machine learning across medicine, materials science, social science, and environmental science. In each case, the Discovery Engine matches or exceeds prior predictive performance while also generating deeper, more actionable insights through rich interpretability artefacts. These results demonstrate its potential as a new standard for automated, interpretable scientific modelling that enables complex knowledge discovery from data.

Updated: 2025-07-01 17:13:31

标题: 基准测试发现引擎

摘要: 这个发现引擎是一个通用的自动化科学发现系统，结合了机器学习和最先进的机器学习解释能力，可以在各种数据集上快速、稳健地获得科学洞见。在本文中，我们将发现引擎与应用于医学、材料科学、社会科学和环境科学的五篇最近的同行评议科学出版物进行了基准测试。在每种情况下，发现引擎与先前的预测性能相匹配或超过，并通过丰富的可解释性工件生成更深入、更具行动性的洞见。这些结果表明，发现引擎有潜力成为一种新的自动化、可解释的科学建模标准，可以从数据中实现复杂的知识发现。

更新时间: 2025-07-01 17:13:31

领域: cs.LG,I.2.6; I.2.3; I.5.1; H.2.8; J.2; J.3; J.4

下载: http://arxiv.org/abs/2507.00964v1

Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing

Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water stress. SCARF calculates an Adjusted Water Impact (AWI) metric that considers both consumption volume and local water stress over time. Through three case studies on LLM serving, datacenters, and semiconductor fabrication plants, we show the hidden opportunities for reducing water impact by optimizing location and time choices, paving the way for water-sustainable computing. The code is available at https://github.com/jojacola/SCARF.

Updated: 2025-07-01 17:12:12

标题: 并非所有的水消耗都是相同的：一种用于可持续计算的水压力加权度量标准

摘要: 水消耗是计算可持续性日益关键的一个方面，特别是随着人工智能工作负载迅速增加。然而，目前的水影响评估经常忽视了水压力更严重的地点和时间。为了填补这一空白，我们提出了SCARF，这是第一个通过考虑水压力的空间和时间变化来评估计算水影响的通用框架。SCARF计算了一个调整后的水影响（AWI）指标，考虑了消耗量和随时间变化的当地水压力。通过对LLM服务、数据中心和半导体制造厂的三个案例研究，我们展示了通过优化位置和时间选择来减少水影响的潜在机会，为水可持续计算铺平道路。代码可在https://github.com/jojacola/SCARF获取。

更新时间: 2025-07-01 17:12:12

领域: cs.DC,cs.AR,cs.CY,cs.LG

下载: http://arxiv.org/abs/2506.22773v2

Large Language Model Confidence Estimation via Black-Box Access

Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q\&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

Updated: 2025-07-01 17:12:01

标题: 通过黑盒访问实现大型语言模型置信度估计

摘要: 在评估模型响应时，估计模型响应的不确定性或置信度是非常重要的，这不仅可以评估对响应的信任，也可以评估整个模型的信任。在本文中，我们探讨了使用简单的黑盒或查询访问大型语言模型（LLMs）来估计响应的置信度的问题。我们提出了一个简单且可扩展的框架，通过工程化新颖特征并在这些特征上训练一个（可解释的）模型（如逻辑回归）来估计置信度。我们在实证中证明了我们的简单框架在四个基准问答任务上估计Flan-ul2、Llama-13b、Mistral-7b和GPT-4的置信度以及在两个基准摘要任务上估计Pegasus-large和BART-large的置信度是有效的，有时甚至超过基线超过10％（在AUROC上）。此外，我们的可解释方法提供了关于预测置信度的特征的洞见，使我们有趣且有用地发现，我们为一个LLM构建的置信度模型在给定数据集上可以零-shot泛化到其他模型。

更新时间: 2025-07-01 17:12:01

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2406.04370v4

Uncertainty Quantification of Wind Gust Predictions in the Northeast United States: An Evidential Neural Network and Explainable Artificial Intelligence Approach

Machine learning algorithms have shown promise in reducing bias in wind gust predictions, while still underpredicting high gusts. Uncertainty quantification (UQ) supports this issue by identifying when predictions are reliable or need cautious interpretation. Using data from 61 extratropical storms in the Northeastern USA, we introduce evidential neural network (ENN) as a novel approach for UQ in gust predictions, leveraging atmospheric variables from the Weather Research and Forecasting (WRF) model. Explainable AI techniques suggested that key predictive features contributed to higher uncertainty, which correlated strongly with storm intensity and spatial gust gradients. Compared to WRF, ENN demonstrated a 47% reduction in RMSE and allowed the construction of gust prediction intervals without an ensemble, successfully capturing at least 95% of observed gusts at 179 out of 266 stations. From an operational perspective, providing gust forecasts with quantified uncertainty enhances stakeholders' confidence in risk assessment and response planning for extreme gust events.

Updated: 2025-07-01 17:10:30

标题: 东北美国风阵预测的不确定性量化：一种证据型神经网络和可解释人工智能方法

摘要: 机器学习算法在减少风阵预测中的偏差方面显示出了潜力，同时仍然低估高风阵。不确定性量化（UQ）通过确定何时预测可靠或需要谨慎解释来支持这一问题。利用美国东北部61场外热带风暴的数据，我们引入证据神经网络（ENN）作为一种新颖的方法，用于风阵预测中的UQ，利用来自天气研究和预测（WRF）模型的大气变量。可解释的人工智能技术表明，关键的预测特征导致了更高的不确定性，这与风暴强度和空间风阵梯度强相关。与WRF相比，ENN表现出了47%的RMSE减少，并且允许在没有集成的情况下构建风阵预测间隔，成功捕捉了266个站点中179个站点中至少95%的观测风阵。从操作角度看，提供具有量化不确定性的风阵预测增强了利益相关者对极端风阵事件风险评估和应对计划的信心。

更新时间: 2025-07-01 17:10:30

领域: cs.LG,physics.ao-ph,stat.ML

下载: http://arxiv.org/abs/2502.00300v2

Atmospheric model-trained machine learning selection and classification of ultracool TY dwarfs

The T and Y spectral classes represent the coolest and lowest-mass population of brown dwarfs, yet their census remains incomplete due to limited statistics. Existing detection frameworks are often constrained to identifying M, L, and early T dwarfs, owing to the sparse observational sample of ultracool dwarfs (UCDs) at later types. This paper presents a novel machine learning framework capable of detecting and classifying late-T and Y dwarfs, trained entirely on synthetic photometry from atmospheric models. Utilizing grids from the ATMO 2020 and Sonora Bobcat models, I produce a training dataset over two orders of magnitude larger than any empirical set of >T6 UCDs. Polynomial color relations fitted to the model photometry are used to assign spectral types to these synthetic models, which in turn train an ensemble of classifiers to identify and classify the spectral type of late UCDs. The model is highly performant when validating on both synthetic and empirical datasets, verifying catalogs of known UCDs with object classification metrics >99% and an average spectral type precision within 0.35 +/- 0.37 subtypes. Application of the model to a 1.5 degree region around Pisces and the UKIDSS UDS field results in the discovery of one previously uncatalogued T8.2 candidate, demonstrating the ability of this model-trained approach in discovering faint, late-type UCDs from photometric catalogs.

Updated: 2025-07-01 17:06:16

标题: 大气模型训练的机器学习选择和分类极冷TY矮星

摘要: T和Y谱级代表着最冷和质量最低的棕矮星群体，然而由于统计数据有限，它们的普查仍然不完整。现有的检测框架通常受限于识别M、L和早期T型棕矮星，因为后期超冷棕矮星（UCDs）的观测样本稀少。本文提出了一种新颖的机器学习框架，能够在完全基于大气模型的合成光度训练的情况下检测和分类晚期T和Y型棕矮星。利用ATMO 2020和Sonora Bobcat模型的网格，我生成了一个训练数据集，比任何实证> T6 UCD集合大两个数量级。将模型光度拟合的多项式颜色关系用于为这些合成模型分配光谱类型，这反过来训练了一个分类器集合来识别和分类晚期UCDs的光谱类型。该模型在验证合成和实证数据集时表现出很高的性能，通过对象分类度量验证已知UCDs目录的准确率超过99%，平均光谱类型精度在0.35 +/- 0.37个亚型之内。将该模型应用于双鱼座周围1.5度区域和UKIDSS UDS领域的结果发现了一个以前未编目的T8.2候选星，展示了这种模型训练方法在从光度目录中发现微弱的晚期UCDs方面的能力。

更新时间: 2025-07-01 17:06:16

领域: astro-ph.SR,astro-ph.EP,astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2507.00957v1

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

Updated: 2025-07-01 17:01:12

标题: MLR-Bench：评估AI代理在开放式机器学习研究中的表现

摘要: 最近AI代理的进展展示了它们日益增长的潜力来推动和支持科学发现。在这项工作中，我们介绍了MLR-Bench，这是一个用于评估AI代理在开放式机器学习研究中的全面基准。MLR-Bench包括三个关键组成部分：(1)来自NeurIPS、ICLR和ICML研讨会的201个研究任务，涵盖多样的机器学习主题；(2)MLR-Judge，一个自动评估框架，结合了基于LLM的评审人员和精心设计的审查标准，用于评估研究质量；以及(3)MLR-Agent，一个模块化代理支架，能够通过四个阶段完成研究任务：构思、提案制定、实验和论文撰写。我们的框架支持跨这些不同研究阶段的逐步评估，以及最终研究论文的端到端评估。然后我们使用MLR-Bench评估了六个前沿LLM和一种先进的编码代理，发现虽然LLM在生成连贯的想法和结构良好的论文方面很有效，但当前的编码代理经常（例如，在80%的情况下）产生捏造或无效的实验结果--这对科学可靠性构成了重大障碍。我们通过人工评估验证了MLR-Judge，显示与专家评审人员高度一致，支持其作为研究评估的可扩展工具的潜力。我们开源MLR-Bench，以帮助社区基准、诊断和改进AI研究代理，朝着值得信赖和透明的科学发现迈进。

更新时间: 2025-07-01 17:01:12

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.19955v2

Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.

Updated: 2025-07-01 16:52:25

标题: 超越令牌思维：从脑启发智能到人工通用智能的认知基础及其社会影响

摘要: 机器能否真正像人类那样在领域中思考、推理和行动？这个长期存在的问题继续影响着对通用人工智能（AGI）的追求。尽管像GPT-4.5、DeepSeek、Claude 3.5 Sonnet、Phi-4和Grok 3这样的模型的能力不断增强，展现出多模态流畅性和部分推理能力，但这些系统仍然基本上受到它们依赖于记号级别预测和缺乏扎根机制的限制。本文提供了一个跨学科综合，涵盖了人工智能、认知神经科学、心理学、生成模型和基于代理的系统的AGI发展。我们分析了通用智能的架构和认知基础，强调模块化推理、持久记忆和多代理协调的作用。特别是，我们强调了Agentic RAG框架的崛起，该框架结合了检索、规划和动态工具使用，以实现更具适应性的行为。我们讨论了泛化策略，包括信息压缩、测试时间适应和无训练方法，作为通向灵活、领域无关智能的关键路径。视觉语言模型（VLMs）不仅重新审视为感知模块，而且作为体验理解和协作任务完成的不断演变的接口。我们还主张真正的智能不仅仅源自规模，而是源自记忆和推理的整合：一种模块化、互动式和自我改进的组件协调，其中压缩实现了适应性行为。借鉴神经符号系统、强化学习和认知支架的进展，我们探讨了最近的架构如何开始弥合统计学习和目标导向认知之间的差距。最后，我们确定了通往AGI的关键科学、技术和伦理挑战。

更新时间: 2025-07-01 16:52:25

领域: cs.AI

下载: http://arxiv.org/abs/2507.00951v1

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv. Results demonstrate clear performance differences across agents and validate the effectiveness of our proposed reflection strategy.

Updated: 2025-07-01 16:43:57

标题: WebArXiv：在时间不变的arXiv任务上评估多模态代理

摘要: 最近大型语言模型（LLMs）取得了进展，使得能够开发出能够浏览和与真实网站进行交互的自主网络代理。然而，由于现有基准的不稳定性和不一致性，评估这样的代理仍然具有挑战性，这些基准通常依赖于动态内容或过度简化的模拟。在这项工作中，我们介绍了WebArXiv，这是一个静态和不变的基准，包括275个基于arXiv平台的基于网络的任务。WebArXiv通过将任务锚定在固定的网络快照中，具有确定性的基本事实和标准化的行动轨迹，确保可重现和可靠的评估。通过行为分析，我们确定了一种常见的失败模式，即刚性历史反映，代理过度依赖于固定的互动历史。为了解决这个问题，我们提出了一种轻量级的动态反射机制，允许代理在决策过程中有选择地检索相关的过去步骤。我们在WebArXiv上评估了十个最先进的网络代理。结果表明，不同代理之间存在明显的性能差异，并验证了我们提出的反射策略的有效性。

更新时间: 2025-07-01 16:43:57

领域: cs.IR,cs.AI,cs.DB,F.2.2; I.2.7

下载: http://arxiv.org/abs/2507.00938v1

Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases

Large language models (LLMs) are increasingly used as personal agents, accessing sensitive user data such as calendars, emails, and medical records. Users currently face a trade-off: They can send private records, many of which are stored in remote databases, to powerful but untrusted LLM providers, increasing their exposure risk. Alternatively, they can run less powerful models locally on trusted devices. We bridge this gap. Our Socratic Chain-of-Thought Reasoning first sends a generic, non-private user query to a powerful, untrusted LLM, which generates a Chain-of-Thought (CoT) prompt and detailed sub-queries without accessing user data. Next, we embed these sub-queries and perform encrypted sub-second semantic search using our Homomorphically Encrypted Vector Database across one million entries of a single user's private data. This represents a realistic scale of personal documents, emails, and records accumulated over years of digital activity. Finally, we feed the CoT prompt and the decrypted records to a local language model and generate the final response. On the LoCoMo long-context QA benchmark, our hybrid framework, combining GPT-4o with a local Llama-3.2-1B model, outperforms using GPT-4o alone by up to 7.1 percentage points. This demonstrates a first step toward systems where tasks are decomposed and split between untrusted strong LLMs and weak local ones, preserving user privacy.

Updated: 2025-07-01 16:41:35

标题: 隐私保护的LLM互动与苏格拉底式思维链推理和同态加密向量数据库

摘要: 大型语言模型（LLMs）越来越多地被用作个人代理，访问敏感用户数据，如日历、电子邮件和医疗记录。用户目前面临一种权衡：他们可以将存储在远程数据库中的许多私人记录发送给功能强大但不受信任的LLM提供商，从而增加其暴露风险。另外，他们也可以在值得信任的设备上本地运行功能较弱的模型。我们弥合了这一差距。我们的苏格拉底思维链首先向功能强大但不受信任的LLM发送一个通用的、非私人的用户查询，该查询生成一条思维链（CoT）提示和详细的子查询，而无需访问用户数据。接下来，我们嵌入这些子查询并使用我们的同态加密向量数据库在单个用户的私人数据的一百万条记录上执行加密的亚秒语义搜索。这代表了多年数字活动积累的个人文档、电子邮件和记录的实际规模。最后，我们将CoT提示和解密记录提供给本地语言模型，并生成最终响应。在LoCoMo长上下文问答基准测试中，我们的混合框架，将GPT-4o与本地Llama-3.2-1B模型结合使用，比仅使用GPT-4o表现优异，提高了高达7.1个百分点。这显示了向任务分解和在不受信任的强大LLMs和弱本地模型之间分割任务的系统迈出的第一步，从而保护用户隐私。

更新时间: 2025-07-01 16:41:35

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2506.17336v2

Conformal Inference under High-Dimensional Covariate Shifts via Likelihood-Ratio Regularization

We consider the problem of conformal prediction under covariate shift. Given labeled data from a source domain and unlabeled data from a covariate shifted target domain, we seek to construct prediction sets with valid marginal coverage in the target domain. Most existing methods require estimating the unknown likelihood ratio function, which can be prohibitive for high-dimensional data such as images. To address this challenge, we introduce the likelihood ratio regularized quantile regression (LR-QR) algorithm, which combines the pinball loss with a novel choice of regularization in order to construct a threshold function without directly estimating the unknown likelihood ratio. We show that the LR-QR method has coverage at the desired level in the target domain, up to a small error term that we can control. Our proofs draw on a novel analysis of coverage via stability bounds from learning theory. Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks, including a regression task for the Communities and Crime dataset, an image classification task from the WILDS repository, and an LLM question-answering task on the MMLU benchmark.

Updated: 2025-07-01 16:36:48

标题: 通过可能性比率正则化在高维协变量转移下的一致推断

摘要: 我们考虑在协变量转移下的拟合预测问题。给定来自源域的标记数据和来自协变量转移目标域的未标记数据，我们寻求构建在目标域中具有有效边际覆盖的预测集。大多数现有方法需要估计未知的似然比函数，这对于像图像这样的高维数据可能是不可行的。为了解决这一挑战，我们引入了似然比正则化分位数回归（LR-QR）算法，该算法将损失率与一种新颖的正则化选择相结合，以构建一个阈值函数，而无需直接估计未知的似然比。我们证明了LR-QR方法在目标域中具有所需水平的覆盖率，直到我们可以控制的小误差项。我们的证明基于学习理论中的稳定性界的新颖覆盖率分析。我们的实验表明，LR-QR算法在高维预测任务上表现优于现有方法，包括Communities and Crime数据集的回归任务，来自WILDS存储库的图像分类任务，以及MMLU基准上的LLM问答任务。

更新时间: 2025-07-01 16:36:48

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13030v5

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

Updated: 2025-07-01 16:26:47

标题: 从视频中学习3D世界：利用3D视觉几何先验增强MLLMs

摘要: 之前的研究已经探讨了多模态大型语言模型（MLLMs）在理解3D场景方面的应用，将其解释为视频。这些方法通常依赖于全面的3D数据输入，如点云或重建的鸟瞰图（BEV）地图。在我们的研究中，我们通过增强MLLMs直接从视频数据中理解和推理3D空间的能力，而无需额外的3D输入，推进了这一领域。我们提出了一种新颖高效的方法，即Video-3D Geometry Large Language Model（VG LLM）。我们的方法采用一个3D视觉几何编码器，从视频序列中提取3D先验信息。这些信息与视觉标记结合，并馈入MLLM。大量实验表明，我们的方法在与3D场景理解和空间推理相关的各种任务中取得了显著改进，所有这些都是直接从视频来源学习的。令人印象深刻的是，我们的4B模型，不依赖于显式的3D数据输入，与现有最先进的方法相比取得了竞争性的结果，甚至在VSI-Bench评估中超过了Gemini-1.5-Pro。

更新时间: 2025-07-01 16:26:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.24625v2

Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications

The long-standing vision of intelligent cities is to create efficient, livable, and sustainable urban environments using big data and artificial intelligence technologies. Recently, the advent of Large Language Models (LLMs) has opened new ways toward realizing this vision. With powerful semantic understanding and reasoning capabilities, LLMs can be deployed as intelligent agents capable of autonomously solving complex problems across domains. In this article, we focus on Urban LLM Agents, which are LLM-powered agents that are semi-embodied within the hybrid cyber-physical-social space of cities and used for system-level urban decision-making. First, we introduce the concept of urban LLM agents, discussing their unique capabilities and features. Second, we survey the current research landscape from the perspective of agent workflows, encompassing urban sensing, memory management, reasoning, execution, and learning. Third, we categorize the application domains of urban LLM agents into five groups: urban planning, transportation, environment, public safety, and urban society, presenting representative works in each group. Finally, we discuss trustworthiness and evaluation issues that are critical for real-world deployment, and identify several open problems for future research. This survey aims to establish a foundation for the emerging field of urban LLM agents and to provide a roadmap for advancing the intersection of LLMs and urban intelligence. A curated list of relevant papers and open-source resources is maintained and continuously updated at https://github.com/usail-hkust/Awesome-Urban-LLM-Agents.

Updated: 2025-07-01 16:18:29

标题: 大型语言模型驱动的智能城市代理：概念、能力和应用

摘要: 智能城市的长期愿景是利用大数据和人工智能技术创建高效、宜居和可持续的城市环境。最近，大语言模型（LLMs）的出现开辟了实现这一愿景的新途径。具有强大语义理解和推理能力的LLMs可以部署为智能代理，能够自主解决跨领域的复杂问题。本文关注城市LLM代理，这些由LLM驱动的代理在城市的混合网络物理社会空间中半体现，并用于系统级城市决策。首先，我们介绍城市LLM代理的概念，讨论它们的独特能力和特征。其次，我们从代理工作流的角度调查当前的研究现状，包括城市感知、记忆管理、推理、执行和学习。第三，我们将城市LLM代理的应用领域分类为城市规划、交通、环境、公共安全和城市社会五大类，并介绍每个类别中的代表性作品。最后，我们讨论对于真实世界部署至关重要的可信度和评估问题，并确定未来研究的几个开放问题。这项调查旨在为新兴领域城市LLM代理建立基础，并为推进LLMs和城市智能的交叉点提供路线图。相关论文和开源资源的策划列表将在https://github.com/usail-hkust/Awesome-Urban-LLM-Agents上维护和持续更新。

更新时间: 2025-07-01 16:18:29

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2507.00914v1

Turning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix, Arizona

Artificial intelligence (AI) is fueling exponential electricity demand growth, threatening grid reliability, raising prices for communities paying for new energy infrastructure, and stunting AI innovation as data centers wait for interconnection to constrained grids. This paper presents the first field demonstration, in collaboration with major corporate partners, of a software-only approach--Emerald Conductor--that transforms AI data centers into flexible grid resources that can efficiently and immediately harness existing power systems without massive infrastructure buildout. Conducted at a 256-GPU cluster running representative AI workloads within a commercial, hyperscale cloud data center in Phoenix, Arizona, the trial achieved a 25% reduction in cluster power usage for three hours during peak grid events while maintaining AI quality of service (QoS) guarantees. By orchestrating AI workloads based on real-time grid signals without hardware modifications or energy storage, this platform reimagines data centers as grid-interactive assets that enhance grid reliability, advance affordability, and accelerate AI's development.

Updated: 2025-07-01 16:11:49

标题: 将AI数据中心转变为网格互动资产：来自亚利桑那州凤凰城现场演示的结果

摘要: 人工智能（AI）正在推动指数级的电力需求增长，威胁着电网的可靠性，提高了为新能源基础设施付费的社区的价格，并阻碍了AI创新，因为数据中心需要等待与受限电网的连接。本文与主要公司合作，首次展示了一种仅使用软件的方法——Emerald Conductor——将AI数据中心转变为灵活的电网资源，可以高效地并立即利用现有电力系统，而无需进行大规模基础设施建设。在亚利桑那州凤凰城的商业大规模云数据中心内运行代表性AI工作负载的256 GPU集群进行的试验实现了在高峰电网事件期间，集群功耗降低25％，同时保持AI服务质量（QoS）保证三个小时。通过根据实时电网信号协调AI工作负载，而无需硬件修改或能量存储，该平台将数据中心重新构想为能够增强电网可靠性、促进可负担性并加速AI发展的电网互动资产。

更新时间: 2025-07-01 16:11:49

领域: cs.DC,cs.AI,cs.PF,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.00909v1

The Age of Sensorial Zero Trust: Why We Can No Longer Trust Our Senses

In a world where deepfakes and cloned voices are emerging as sophisticated attack vectors, organizations require a new security mindset: Sensorial Zero Trust [9]. This article presents a scientific analysis of the need to systematically doubt information perceived through the senses, establishing rigorous verification protocols to mitigate the risks of fraud based on generative artificial intelligence. Key concepts, such as Out-of-Band verification, Vision-Language Models (VLMs) as forensic collaborators, cryptographic provenance, and human training, are integrated into a framework that extends Zero Trust principles to human sensory information. The approach is grounded in empirical findings and academic research, emphasizing that in an era of AI-generated realities, even our eyes and ears can no longer be implicitly trusted without verification. Leaders are called to foster a culture of methodological skepticism to protect organizational integrity in this new threat landscape.

Updated: 2025-07-01 16:11:41

标题: 感官零信任时代：为什么我们不能再相信我们的感官

摘要: 在一个深度伪造和克隆声音不断涌现的世界中，组织需要一种新的安全思维：感知零信任。本文对通过感官感知的信息产生系统性怀疑的需求进行了科学分析，建立了严格的验证协议，以减轻基于生成式人工智能的欺诈风险。关键概念，如带外验证、视觉语言模型（VLMs）作为法证合作者、加密溯源和人类培训，被整合到一个框架中，将零信任原则延伸到人类感官信息领域。这种方法基于实证研究和学术研究，强调在人工智能生成的现实时代，甚至我们的眼睛和耳朵也不能再被隐式信任而不经过验证。领导者被要求培养一种方法论怀疑的文化，以保护组织在这个新的威胁环境中的完整性。

更新时间: 2025-07-01 16:11:41

领域: cs.CR,cs.AI,68T07, 68T45, 94A60,K.6.5; D.4.6; I.2.6

下载: http://arxiv.org/abs/2507.00907v1

Deep learning-based segmentation of T1 and T2 cardiac MRI maps for automated disease detection

Objectives Parametric tissue mapping enables quantitative cardiac tissue characterization but is limited by inter-observer variability during manual delineation. Traditional approaches relying on average relaxation values and single cutoffs may oversimplify myocardial complexity. This study evaluates whether deep learning (DL) can achieve segmentation accuracy comparable to inter-observer variability, explores the utility of statistical features beyond mean T1/T2 values, and assesses whether machine learning (ML) combining multiple features enhances disease detection. Materials & Methods T1 and T2 maps were manually segmented. The test subset was independently annotated by two observers, and inter-observer variability was assessed. A DL model was trained to segment left ventricle blood pool and myocardium. Average (A), lower quartile (LQ), median (M), and upper quartile (UQ) were computed for the myocardial pixels and employed in classification by applying cutoffs or in ML. Dice similarity coefficient (DICE) and mean absolute percentage error evaluated segmentation performance. Bland-Altman plots assessed inter-user and model-observer agreement. Receiver operating characteristic analysis determined optimal cutoffs. Pearson correlation compared features from model and manual segmentations. F1-score, precision, and recall evaluated classification performance. Wilcoxon test assessed differences between classification methods, with p < 0.05 considered statistically significant. Results 144 subjects were split into training (100), validation (15) and evaluation (29) subsets. Segmentation model achieved a DICE of 85.4%, surpassing inter-observer agreement. Random forest applied to all features increased F1-score (92.7%, p < 0.001). Conclusion DL facilitates segmentation of T1/ T2 maps. Combining multiple features with ML improves disease detection.

Updated: 2025-07-01 16:08:54

标题: 基于深度学习的T1和T2心脏MRI图像分割用于自动疾病检测

摘要: 目标参数化组织映射能够实现定量心脏组织表征，但在手动勾画过程中存在观察者间的可变性限制。传统方法依赖于平均松弛值和单一截断值，可能过于简化心肌的复杂性。本研究评估了深度学习（DL）是否能够实现与观察者间可变性相媲美的分割精度，探讨了超出平均T1/T2值的统计特征的实用性，并评估了结合多个特征的机器学习（ML）是否能增强疾病检测能力。材料和方法 T1和T2映射进行了手动分割。测试子集由两名观察者独立注释，并评估了观察者间的可变性。训练一个DL模型来分割左心室血池和心肌。为心肌像素计算了平均值（A）、下四分位数（LQ）、中位数（M）和上四分位数（UQ），并应用截断值进行分类或者应用于ML。Dice相似系数（DICE）和平均绝对百分比误差评估了分割性能。Bland-Altman图评估了用户间和模型与观察者间的一致性。接收器操作特性分析确定了最佳截断值。Pearson相关性比较了模型和手动分割的特征。F1分数、精确度和召回率评估了分类性能。Wilcoxon检验评估了分类方法之间的差异，p < 0.05被认为是统计显著的。结果144名受试者被分为训练（100）、验证（15）和评估（29）子集。分割模型达到了85.4%的DICE，超过了观察者间的一致性。随机森林应用于所有特征增加了F1分数（92.7%，p < 0.001）。结论DL促进了T1/T2映射的分割。将多个特征与ML结合可以提高疾病检测能力。

更新时间: 2025-07-01 16:08:54

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.00903v1

Constellation as a Service: Tailored Connectivity Management in Direct-Satellite-to-Device Networks

Direct-satellite-to-device (DS2D) communication is emerging as a promising solution for global mobile service extension, leveraging the deployment of satellite constellations. However, the challenge of managing DS2D connectivity for multi-constellations becomes outstanding, including high interference and frequent handovers caused by multi-coverage overlap and rapid satellite movement. Moreover, existing approaches primarily operate within single-constellation shell, which inherently limits the ability to exploit the vast potential of multi-constellation connectivity provision, resulting in suboptimal DS2D service performances. To address these challenges, this article proposes a Constellation as a Service (CaaS) framework, which treats the entire multi-constellation infrastructure as a shared resource pool and dynamically forms optimal sub-constellations (SCs) for each DS2D service region. The formation of each SC integrates satellites from various orbits to provide tailored connectivity based on user demands, guided by two innovative strategies: predictive satellite beamforming using generative artificial intelligence (GenAI) and pre-configured handover path for efficient satellite access and mobility management. Simulation results demonstrate that CaaS significantly improves satellite service rates while reducing handover overhead, making it an efficient and continuable solution for managing DS2D connectivity in multi-constellation environments.

Updated: 2025-07-01 16:06:29

标题: "星座即服务：定制连接管理在直卫星到设备网络中的应用"

摘要: 直接卫星到设备（DS2D）通信正在成为全球移动服务延伸的一种有前途的解决方案，利用卫星星座的部署。然而，管理多星座DS2D连接的挑战变得突出，包括由多覆盖重叠和卫星快速移动引起的高干扰和频繁切换。此外，现有方法主要在单星座壳体内运行，这从根本上限制了利用多星座连接提供的巨大潜力的能力，导致DS2D服务性能不佳。为解决这些挑战，本文提出了一个“星座即服务”（CaaS）框架，将整个多星座基础设施视为共享资源池，并为每个DS2D服务区域动态形成最佳子星座（SCs）。每个SC的形成将来自不同轨道的卫星整合起来，以根据用户需求提供定制的连接，由两种创新策略指导：使用生成人工智能（GenAI）进行预测卫星波束成形和预配置切换路径以实现高效的卫星接入和移动管理。仿真结果表明，CaaS显著提高了卫星服务速率，同时减少了切换开销，使其成为管理多星座环境中DS2D连接的高效和持续解决方案。

更新时间: 2025-07-01 16:06:29

领域: eess.SY,cs.AI,cs.SY,eess.SP

下载: http://arxiv.org/abs/2507.00902v1

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available at https://github.com/LzVv123456/VISTA.

Updated: 2025-07-01 16:02:21

标题: 令牌的隐藏生活：通过视觉信息引导减少大型视觉-语言模型的幻觉

摘要: 大型视觉-语言模型（LVLMs）可以有效地推理文本和视觉输入，但它们往往会产生在语法上连贯但在视觉上不牢固的内容。在本文中，我们通过检查生成过程中标记logits排名来研究幻觉的内部动态，揭示LVLMs处理信息的三种关键模式：（1）逐渐丢失视觉信息-在生成过程中，视觉上基础的标记逐渐不受青睐，（2）早期激发-语义上有意义的标记在较早的层中达到峰值激活，比最终层早，（3）隐藏的真实信息-虽然视觉上基础的标记最终没有被解码，但在推理过程中仍保持相对较高的排名。基于这些见解，我们提出了VISTA（带有标记增强的视觉信息导向），这是一个无需训练的推理时干预框架，可以减少幻觉同时促进真实信息。VISTA通过结合两种互补方法实现：在激活空间中强化视觉信息和利用早期层激活来促进语义上有意义的解码。与现有方法相比，VISTA不需要外部监督，并适用于各种解码策略。大量实验证明，在评估的开放式生成任务上，VISTA平均减少了约40％的幻觉，并且在四个架构下的三种解码策略中，始终优于现有方法的四个基准测试。代码可在https://github.com/LzVv123456/VISTA找到。

更新时间: 2025-07-01 16:02:21

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.03628v2

Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.

Updated: 2025-07-01 15:59:19

标题: 狮子秘密地解决了受限制的优化问题：正如李雅普诺夫所预测的那样

摘要: Lion（演变标志动量）是通过程序搜索发现的一种新的优化器，在训练大型AI模型方面表现出有希望的结果。它在性能上与AdamW相当或更为优越，但具有更高的内存效率。正如我们可以从随机搜索程序的结果中期待的那样，Lion整合了几种现有算法的元素，包括有符号动量、解耦重量衰减、Polak和Nesterov动量，但不属于任何现有理论基础的优化器类别。因此，即使Lion看起来在广泛的任务范围内表现良好作为通用优化器，其理论基础仍然不确定。这种理论上的不清晰限制了进一步增强和扩展Lion功效的机会。本研究旨在揭示Lion的神秘。基于连续时间和离散时间分析，我们证明了Lion是一种在最小化一般损失函数$f(x)$的同时强制执行约束条件$\|x\|_\infty\leq 1/\lambda$的理论上新颖且有原则的方法。Lion通过整合解耦重量衰减来实现这一点，其中$\lambda$代表重量衰减系数。我们的分析得益于为Lion更新开发的新的Lyapunov函数。它适用于更广泛的Lion-$\kappa$算法族，其中Lion中的$\text{sign}(\cdot)$运算符被凸函数$\kappa$的次梯度取代，导致解决$\min_x f(x) + \kappa^*(x)$的一般复合优化问题。我们的发现为Lion动态提供了宝贵的见解，并为进一步改进和扩展与Lion相关的算法铺平了道路。

更新时间: 2025-07-01 15:59:19

领域: cs.LG,cs.AI,math.OC,stat.AP,stat.ML

下载: http://arxiv.org/abs/2310.05898v7

MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions provide.To address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.

Updated: 2025-07-01 15:57:14

标题: MemeCMD：一个具有上下文检索表情的自动生成中文多轮对话数据集

摘要: 梗在在线社交互动中被广泛使用，提供了生动、直观且常常幽默的方式来表达意图和情感。现有的对话数据集主要限于手动注释或纯文本对话，缺乏多模态交互所提供的表现力和语境细微差别。为了解决这些挑战，我们介绍了一个名为MemeCMD的自动生成的中文多轮对话数据集，其中包含有上下文检索的梗。我们的数据集结合了大规模的、经MLLM注释的梗库，以及由双代理人跨越不同场景自动生成的对话。我们引入了一个检索框架和自适应阈值，以确保上下文相关、自然间隔的梗使用。实验证明了我们的方法在生成上下文恰当且多样化的梗融入对话方面的有效性，为推进多模态对话AI提供了可扩展且隐私保护的资源。

更新时间: 2025-07-01 15:57:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.00891v1

A Scalable and Quantum-Accurate Foundation Model for Biomolecular Force Field via Linearly Tensorized Quadrangle Attention

Accurate atomistic biomolecular simulations are vital for disease mechanism understanding, drug discovery, and biomaterial design, but existing simulation methods exhibit significant limitations. Classical force fields are efficient but lack accuracy for transition states and fine conformational details critical in many chemical and biological processes. Quantum Mechanics (QM) methods are highly accurate but computationally infeasible for large-scale or long-time simulations. AI-based force fields (AIFFs) aim to achieve QM-level accuracy with efficiency but struggle to balance many-body modeling complexity, accuracy, and speed, often constrained by limited training data and insufficient validation for generalizability. To overcome these challenges, we introduce LiTEN, a novel equivariant neural network with Tensorized Quadrangle Attention (TQA). TQA efficiently models three- and four-body interactions with linear complexity by reparameterizing high-order tensor features via vector operations, avoiding costly spherical harmonics. Building on LiTEN, LiTEN-FF is a robust AIFF foundation model, pre-trained on the extensive nablaDFT dataset for broad chemical generalization and fine-tuned on SPICE for accurate solvated system simulations. LiTEN achieves state-of-the-art (SOTA) performance across most evaluation subsets of rMD17, MD22, and Chignolin, outperforming leading models such as MACE, NequIP, and EquiFormer. LiTEN-FF enables the most comprehensive suite of downstream biomolecular modeling tasks to date, including QM-level conformer searches, geometry optimization, and free energy surface construction, while offering 10x faster inference than MACE-OFF for large biomolecules (~1000 atoms). In summary, we present a physically grounded, highly efficient framework that advances complex biomolecular modeling, providing a versatile foundation for drug discovery and related applications.

Updated: 2025-07-01 15:52:39

标题: 一种可扩展且量子精确的生物分子力场基础模型：通过线性张量化的四边形注意力实现

摘要: 精确的原子级生物分子模拟对于疾病机制的理解、药物发现和生物材料设计至关重要，但现有的模拟方法存在显著局限性。经典力场效率高，但在许多化学和生物过程中关键的过渡态和精细构象细节上缺乏准确性。量子力学（QM）方法非常准确，但在大规模或长时间模拟中计算不可行。基于人工智能的力场（AIFFs）旨在实现QM级别的准确性和效率，但往往难以平衡多体建模复杂性、准确性和速度，通常受限于有限的训练数据和不足的泛化验证。为了克服这些挑战，我们引入了LiTEN，一种具有张量化四边形注意力（TQA）的新型等变性神经网络。TQA通过向量操作重新参数化高阶张量特征，以线性复杂度有效地建模三体和四体相互作用，避免昂贵的球面谐波。基于LiTEN，LiTEN-FF是一个强大的AIFF基础模型，预先在广泛的nablaDFT数据集上进行了训练，以实现广泛的化学泛化和在SPICE上进行精确的溶剂系统模拟微调。LiTEN在rMD17、MD22和Chignolin的大多数评估子集上实现了最先进的性能，优于领先的模型，如MACE、NequIP和EquiFormer。LiTEN-FF使迄今为止最全面的下游生物分子建模任务成为可能，包括QM级别的构象搜索、几何优化和自由能表面构建，同时为大型生物分子（~1000个原子）提供比MACE-OFF快10倍的推理速度。总之，我们提出了一个具有物理基础、高效的框架，推动了复杂生物分子建模，为药物发现和相关应用提供了多功能的基础。

更新时间: 2025-07-01 15:52:39

领域: physics.chem-ph,cs.AI,cs.LG,physics.bio-ph

下载: http://arxiv.org/abs/2507.00884v1

Benchmarking the Pedagogical Knowledge of Large Language Models

Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at https://rebrand.ly/pedagogy which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models' capacities to understand pedagogical concepts, respond appropriately to learners' needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.

Updated: 2025-07-01 15:49:58

标题: 对大型语言模型的教学知识进行基准测试

摘要: 像Massive Multitask Language Understanding (MMLU)这样的基准测试在评估人工智能在各个领域的知识和能力方面发挥了关键作用。然而，现有的基准测试主要关注内容知识，缺乏评估模型对教学法的理解的重要环节。本文介绍了Pedagogy Benchmark，这是一个新颖的数据集，旨在评估大型语言模型在跨领域教学知识（CDPK）和特殊教育需求和残疾（SEND）教学知识方面的能力。这些基准测试建立在从教师专业发展考试中精心策划的一系列问题上，涵盖了教学策略和评估方法等一系列教学子领域。在这里，我们概述了这些基准测试的方法论和发展过程。我们报告了97个模型的结果，这些模型在教学知识问题上的准确率从28%到89%不等。我们考虑了成本和准确性之间的关系，并记录了帕累托价值前沿随时间的发展。我们提供了在线排行榜，网址为https://rebrand.ly/pedagogy，该排行榜会随着新模型的加入而更新，并允许根据各种模型属性进行交互式探索和过滤，比如每个令牌的成本和开放vs封闭权重，以及在不同学科中的表现。大型语言模型（LLMs）和生成式人工智能具有巨大的潜力影响教育，并帮助解决全球学习危机。以教育为重点的基准测试对于衡量模型理解教学概念、适当回应学习者需求、并支持有效的教学实践在各种环境下至关重要。它们对于指导LLMs和基于LLM的工具在教育环境中负责和基于证据的部署，以及引导开发和政策决策是必不可少的。

更新时间: 2025-07-01 15:49:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.18710v3

NN-Former: Rethinking Graph Structure in Neural Architecture Representation

The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent complicated features, while transformers face poor generalization when the depth of architecture grows. To mitigate the above issues, we rethink neural architecture topology and show that sibling nodes are pivotal while overlooked in previous research. We thus propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. We introduce a novel token mixer that considers siblings, and a new channel mixer named bidirectional graph isomorphism feed-forward network. Our approach consistently achieves promising performance in both accuracy and latency prediction, providing valuable insights for learning Directed Acyclic Graph (DAG) topology. The code is available at https://github.com/XuRuihan/NNFormer.

Updated: 2025-07-01 15:46:18

标题: NN-Former: 重新思考神经架构表示中的图结构

摘要: 随着深度学习的不断发展，高效的网络设计和部署变得必不可少，因此神经预测器对于估计准确性和延迟等属性至关重要。最近，图神经网络（GNNs）和transformers在表示神经结构方面表现出有希望的性能。然而，这两种方法都各自存在缺点。GNNs缺乏表示复杂特征的能力，而随着架构深度增加，transformers的泛化能力较差。为了缓解上述问题，我们重新思考神经结构拓扑，并发现在先前研究中被忽视的兄弟节点至关重要。因此，我们提出了一种新颖的预测器，利用GNNs和transformers的优势来学习增强的拓扑结构。我们引入了一个考虑兄弟节点的新型令牌混合器，以及一个名为双向图同构前馈网络的新通道混合器。我们的方法在准确性和延迟预测方面始终表现出有希望的性能，为学习有向无环图（DAG）拓扑提供了宝贵的见解。该代码可在https://github.com/XuRuihan/NNFormer获取。

更新时间: 2025-07-01 15:46:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.00880v1

Realism in Action: Anomaly-Aware Diagnosis of Brain Tumors from Medical Images Using YOLOv8 and DeiT

Reliable diagnosis of brain tumors remains challenging due to low clinical incidence rates of such cases. However, this low rate is neglected in most of proposed methods. We propose a clinically inspired framework for anomaly-resilient tumor detection and classification. Detection leverages YOLOv8n fine-tuned on a realistically imbalanced dataset (1:9 tumor-to-normal ratio; 30,000 MRI slices from 81 patients). In addition, we propose a novel Patient-to-Patient (PTP) metric that evaluates diagnostic reliability at the patient level. Classification employs knowledge distillation: a Data Efficient Image Transformer (DeiT) student model is distilled from a ResNet152 teacher. The distilled ViT achieves an F1-score of 0.92 within 20 epochs, matching near teacher performance (F1=0.97) with significantly reduced computational resources. This end-to-end framework demonstrates high robustness in clinically representative anomaly-distributed data, offering a viable tool that adheres to realistic situations in clinics.

Updated: 2025-07-01 15:31:37

标题: 行动中的现实主义：使用YOLOv8和DeiT从医学图像中诊断脑瘤的异常感知

摘要: 由于脑肿瘤的低临床发病率，可靠的诊断仍然具有挑战性。然而，大多数提出的方法忽视了这种低率。我们提出了一个受临床启发的框架，用于异常耐受性肿瘤检测和分类。检测利用在现实不平衡数据集（1:9肿瘤与正常比率；来自81名患者的30,000个MRI切片）上微调的YOLOv8n。此外，我们提出了一种新的患者对患者（PTP）度量标准，评估患者级别的诊断可靠性。分类采用知识蒸馏：从ResNet152教师模型中提取出一个数据高效的图像变换器（DeiT）学生模型。经过20个时代，蒸馏的ViT在F1分数达到0.92，接近教师性能（F1=0.97），而且显著减少了计算资源。这个端到端的框架在临床代表性异常分布数据中表现出高度的稳健性，提供了一个符合临床实际情况的可行工具。

更新时间: 2025-07-01 15:31:37

领域: eess.IV,cs.AI,cs.CV,cs.LG,stat.ML

下载: http://arxiv.org/abs/2401.03302v4

Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report

This report synthesizes the outcomes of a recent interdisciplinary workshop that brought together leading experts in cognitive psychology, language learning, and artificial intelligence (AI)-based natural language processing (NLP). The workshop, funded by the National Science Foundation, aimed to address a critical knowledge gap in our understanding of the relationship between AI language models and human cognitive processes in text comprehension and composition. Through collaborative dialogue across cognitive, linguistic, and technological perspectives, workshop participants examined the underlying processes involved when humans produce and comprehend text, and how AI can both inform our understanding of these processes and augment human capabilities. The workshop revealed emerging patterns in the relationship between large language models (LLMs) and human cognition, with highlights on both the capabilities of LLMs and their limitations in fully replicating human-like language understanding and generation. Key findings include the potential of LLMs to offer insights into human language processing, the increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and the opportunities and challenges presented by human-AI collaboration in language tasks. By synthesizing these findings, this report aims to guide future research, development, and implementation of LLMs in cognitive psychology, linguistics, and education. It emphasizes the importance of ethical considerations and responsible use of AI technologies while striving to enhance human capabilities in text comprehension and production through effective human-AI collaboration.

Updated: 2025-07-01 15:26:29

标题: 人类和人工智能的文本产生和理解：跨学科研讨会报告

摘要: 这份报告综合了最近一次跨学科研讨会的成果，该研讨会汇聚了认知心理学、语言学习和基于人工智能（AI）的自然语言处理（NLP）领域的领先专家。这次由美国国家科学基金会资助的研讨会旨在填补我们对AI语言模型与人类认知过程在文本理解和创作中关系的关键知识空白。通过认知、语言和技术视角之间的协作对话，研讨会参与者探讨了人类产生和理解文本时涉及的基本过程，以及AI如何既可以帮助我们理解这些过程又可以增强人类的能力。研讨会揭示了大型语言模型（LLMs）与人类认知之间关系的新兴模式，重点介绍了LLMs的能力和在完全复制类似人类语言理解和生成方面的局限性。关键发现包括LLMs提供对人类语言处理的洞见的潜力，当模型通过人类反馈进行微调时，LLMs行为与人类语言处理之间的越来越多的一致性，以及人工智能在语言任务中提供的机遇和挑战。通过综合这些发现，这份报告旨在指导未来在认知心理学、语言学和教育领域中LLMs的研究、开发和实施。它强调了在努力通过有效的人工智能合作增强人类文本理解和生产能力时，道德考虑和对AI技术的负责任使用的重要性。

更新时间: 2025-07-01 15:26:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.22698v2

Fully Differentiable Lagrangian Convolutional Neural Network for Physics-Informed Precipitation Nowcasting

This paper presents a convolutional neural network model for precipitation nowcasting that combines data-driven learning with physics-informed domain knowledge. We propose LUPIN, a Lagrangian Double U-Net for Physics-Informed Nowcasting, that draws from existing extrapolation-based nowcasting methods. It consists of a U-Net that dynamically produces mesoscale advection motion fields, a differentiable semi-Lagrangian extrapolation operator, and an advection-free U-Net capturing the growth and decay of precipitation over time. Using our approach, we successfully implement the Lagrangian convolutional neural network for precipitation nowcasting in a fully differentiable and GPU-accelerated manner. This allows for end-to-end training and inference, including the data-driven Lagrangian coordinate system transformation of the data at runtime. We evaluate the model and compare it with other related AI-based models both quantitatively and qualitatively in an extreme event case study. Based on our evaluation, LUPIN matches and even exceeds the performance of the chosen benchmarks, opening the door for other Lagrangian machine learning models.

Updated: 2025-07-01 15:13:39

标题: 完全可微的拉格朗日卷积神经网络用于物理驱动的降水预报

摘要: 这篇论文提出了一种用于降水预测的卷积神经网络模型，结合了数据驱动学习和物理知识。我们提出了LUPIN，一种拉格朗日双U-Net物理知识预测模型，借鉴了现有的基于外推的预测方法。它由一个动态生成中尺度平流运动场的U-Net、一个可微分的半拉格朗日外推算子和一个不包含平流的U-Net组成，捕捉了降水随时间的增长和衰减。使用我们的方法，我们成功地以全可微分和GPU加速的方式实现了用于降水预测的拉格朗日卷积神经网络。这允许端到端的训练和推断，包括在运行时对数据进行数据驱动的拉格朗日坐标系统转换。我们对模型进行评估，并在极端事件案例研究中定量和定性地与其他相关的基于人工智能的模型进行比较。根据我们的评估，LUPIN与甚至超过了选择的基准模型的性能，为其他拉格朗日机器学习模型打开了大门。

更新时间: 2025-07-01 15:13:39

领域: cs.LG,cs.AI,cs.CV,I.2.1; J.2

下载: http://arxiv.org/abs/2402.10747v2

SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents

With the wide application of multimodal foundation models in intelligent agent systems, scenarios such as mobile device control, intelligent assistant interaction, and multimodal task execution are gradually relying on such large model-driven agents. However, the related systems are also increasingly exposed to potential jailbreak risks. Attackers may induce the agents to bypass the original behavioral constraints through specific inputs, and then trigger certain risky and sensitive operations, such as modifying settings, executing unauthorized commands, or impersonating user identities, which brings new challenges to system security. Existing security measures for intelligent agents still have limitations when facing complex interactions, especially in detecting potentially risky behaviors across multiple rounds of conversations or sequences of tasks. In addition, an efficient and consistent automated methodology to assist in assessing and determining the impact of such risks is currently lacking. This work explores the security issues surrounding mobile multimodal agents, attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information, and designs an automated assisted assessment scheme based on a large language model. Through preliminary validation in several representative high-risk tasks, the results show that the method can improve the recognition of risky behaviors to some extent and assist in reducing the probability of agents being jailbroken. We hope that this study can provide some valuable references for the security risk modeling and protection of multimodal intelligent agent systems.

Updated: 2025-07-01 15:10:00

标题: SafeMobile：多模态移动代理的链级越狱检测和自动评估

摘要: 随着多模态基础模型在智能代理系统中的广泛应用，诸如移动设备控制、智能助手交互和多模态任务执行等场景逐渐依赖于这些大型模型驱动的代理。然而，相关系统也日益面临潜在的越狱风险。攻击者可能通过特定输入诱使代理绕过原始行为约束，然后触发某些风险和敏感操作，如修改设置、执行未经授权的命令或冒充用户身份，这给系统安全带来了新的挑战。现有的智能代理安全措施在面对复杂互动时仍存在局限，特别是在检测跨多轮对话或任务序列中潜在风险行为方面。此外，目前缺乏一种高效且一致的自动方法来帮助评估和确定此类风险的影响。本研究探讨了围绕移动多模态代理的安全问题，试图通过整合行为序列信息构建风险判别机制，并设计了基于大型语言模型的自动辅助评估方案。通过在几个代表性高风险任务中的初步验证，结果显示该方法在一定程度上可以提高对风险行为的识别，并有助于降低代理被越狱的概率。我们希望这项研究可以为多模态智能代理系统的安全风险建模和保护提供一些有价值的参考。

更新时间: 2025-07-01 15:10:00

领域: cs.AI,cs.CR

下载: http://arxiv.org/abs/2507.00841v1

Discrete Diffusion in Large Language and Multimodal Models: A Survey

In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy. This paradigm naturally enables parallel generation, fine-grained output controllability, and dynamic, response-aware perception. These capabilities are previously difficult to achieve with AR models. Recently, a growing number of industrial-scale proprietary d(M)LLMs, as well as a large number of open-source academic d(M)LLMs, have demonstrated performance comparable to their autoregressive counterparts, while achieving up to 10x acceleration in inference speed. The advancement of discrete diffusion LLMs and MLLMs has been largely driven by progress in two domains. The first is the development of autoregressive LLMs and MLLMs, which has accumulated vast amounts of data, benchmarks, and foundational infrastructure for training and inference. The second contributing domain is the evolution of the mathematical models underlying discrete diffusion. Together, these advancements have catalyzed a surge in dLLMs and dMLLMs research in early 2025. In this work, we present a comprehensive overview of the research in the dLLM and dMLLM domains. We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models. We further analyze key techniques for training and inference, and summarize emerging applications across language, vision-language, and biological domains. We conclude by discussing future directions for research and deployment. Paper collection: https://github.com/LiQiiiii/DLLM-Survey

Updated: 2025-07-01 15:08:58

标题: 大语言和多模型中的离散扩散：一项调查

摘要: 在这项工作中，我们对离散扩散语言模型（dLLMs）和离散扩散多模态语言模型（dMLLMs）进行了系统调查。与自回归（AR）模型不同，dLLMs和dMLLMs采用了多记号、并行解码范式，使用全注意力和基于去噪的生成策略。这种范式自然地实现了并行生成、精细的输出可控性以及动态的、响应感知。这些能力以前很难通过AR模型实现。最近，越来越多规模化专有的d(M)LLMs以及大量开源学术d(M)LLMs已经表现出与其自回归对应物相媲美的性能，同时在推断速度上实现了高达10倍的加速。离散扩散LLMs和MLLMs的进步在很大程度上受到了两个领域的进展推动。第一个是自回归LLMs和MLLMs的发展，这些模型积累了大量数据、基准测试和用于训练和推断的基础设施。第二个贡献领域是离散扩散基础数学模型的演变。这些进步共同推动了2025年初的dLLMs和dMLLMs研究激增。在这项工作中，我们对dLLM和dMLLM领域的研究提供了全面的概述。我们追溯了dLLMs和dMLLMs的历史发展，形式化了基础数学框架，并对代表性模型进行了分类。我们进一步分析了训练和推断的关键技术，并总结了在语言、视觉语言和生物领域的新兴应用。最后，我们讨论了未来研究和部署的方向。论文集合：https://github.com/LiQiiiii/DLLM-Survey

更新时间: 2025-07-01 15:08:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.13759v2

Stylometry recognizes human and LLM-generated texts in short samples

The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show -- crucially, in the context of the increasingly sophisticated LLMs -- that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.

Updated: 2025-07-01 15:08:53

标题: 文体学识别人类和LLM生成的短文本

摘要: 这篇论文探讨了文体统计学作为一种区分大型语言模型（LLMs）和人类创作的文本的方法，涉及模型归因、知识产权和道德人工智能使用等问题。文体统计学已被广泛应用于表征文本的风格和归因作者。通过将其应用于LLM生成的文本，我们识别出它们的新兴写作模式。该论文涉及基于维基百科创建基准数据集，其中包括（a）人类撰写的术语摘要、（b）仅由LLMs（GPT-3.5/4、LLaMa 2/3、Orca和Falcon）生成的文本、（c）通过多种文本摘要方法（T5、BART、Gensim和Sumy）处理，以及（d）改写方法（Dipper、T5）。这些10句长的文本通过基于树的模型（决策树和LightGBM）使用人类设计的（StyloMetrix）和基于n-gram的（我们自己的管道）文体特征进行分类，这些特征编码了词汇、语法、句法和标点模式。交叉验证结果在7个类别的多类别情境中达到了高达0.87的马修斯相关系数，而在二元分类中，准确度在0.79到1之间，以维基百科和GPT-4为例，在平衡数据集上准确度可达0.98。Shapley Additive Explanations指出了百科文本类型的特征，个别过度使用的词语，以及LLMs相对于人类撰写的文本更大的语法标准化。这些结果显示了--在LLMs日益复杂的背景下--至少可以区分机器生成和人类生成的文本，至少对于一个明确定义的文本类型。

更新时间: 2025-07-01 15:08:53

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.00838v1

HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning

For robotic manipulation, existing robotics datasets and simulation benchmarks predominantly cater to robot-arm platforms. However, for humanoid robots equipped with dual arms and dexterous hands, simulation tasks and high-quality demonstrations are notably lacking. Bimanual dexterous manipulation is inherently more complex, as it requires coordinated arm movements and hand operations, making autonomous data collection challenging. This paper presents HumanoidGen, an automated task creation and demonstration collection framework that leverages atomic dexterous operations and LLM reasoning to generate relational constraints. Specifically, we provide spatial annotations for both assets and dexterous hands based on the atomic operations, and perform an LLM planner to generate a chain of actionable spatial constraints for arm movements based on object affordances and scenes. To further improve planning ability, we employ a variant of Monte Carlo tree search to enhance LLM reasoning for long-horizon tasks and insufficient annotation. In experiments, we create a novel benchmark with augmented scenarios to evaluate the quality of the collected data. The results show that the performance of the 2D and 3D diffusion policies can scale with the generated dataset. Project page is https://openhumanoidgen.github.io.

Updated: 2025-07-01 15:04:38

标题: HumanoidGen：使用LLM推理进行双手熟练操作的数据生成

摘要: 对于机器人操作，现有的机器人数据集和仿真基准主要针对机器臂平台。然而，对于配备双臂和灵巧手的人形机器人，仿真任务和高质量的演示明显缺乏。双手巧妙操作在本质上更加复杂，因为它需要协调的手臂运动和手部操作，使得自主数据收集具有挑战性。本文提出了HumanoidGen，一个自动化任务创建和演示收集框架，利用原子巧妙操作和LLM推理生成关系约束。具体来说，我们基于原子操作为资产和巧妙手提供空间注释，并执行LLM规划器，生成基于对象可供性和场景的可操作空间约束链。为了进一步提高规划能力，我们采用蒙特卡洛树搜索的变体，增强LLM推理，用于长时程任务和不足的注释。在实验中，我们创建了一个新颖的基准，增强了场景，以评估收集数据的质量。结果表明，2D和3D扩散策略的性能可以与生成的数据集相匹配。项目页面为https://openhumanoidgen.github.io。

更新时间: 2025-07-01 15:04:38

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.00833v1

Automated anatomy-based post-processing reduces false positives and improved interpretability of deep learning intracranial aneurysm detection

Introduction: Deep learning (DL) models can help detect intracranial aneurysms on CTA, but high false positive (FP) rates remain a barrier to clinical translation, despite improvement in model architectures and strategies like detection threshold tuning. We employed an automated, anatomy-based, heuristic-learning hybrid artery-vein segmentation post-processing method to further reduce FPs. Methods: Two DL models, CPM-Net and a deformable 3D convolutional neural network-transformer hybrid (3D-CNN-TR), were trained with 1,186 open-source CTAs (1,373 annotated aneurysms), and evaluated with 143 held-out private CTAs (218 annotated aneurysms). Brain, artery, vein, and cavernous venous sinus (CVS) segmentation masks were applied to remove possible FPs in the DL outputs that overlapped with: (1) brain mask; (2) vein mask; (3) vein more than artery masks; (4) brain plus vein mask; (5) brain plus vein more than artery masks. Results: CPM-Net yielded 139 true-positives (TP); 79 false-negative (FN); 126 FP. 3D-CNN-TR yielded 179 TP; 39 FN; 182 FP. FPs were commonly extracranial (CPM-Net 27.3%; 3D-CNN-TR 42.3%), venous (CPM-Net 56.3%; 3D-CNN-TR 29.1%), arterial (CPM-Net 11.9%; 3D-CNN-TR 53.3%), and non-vascular (CPM-Net 25.4%; 3D-CNN-TR 9.3%) structures. Method 5 performed best, reducing CPM-Net FP by 70.6% (89/126) and 3D-CNN-TR FP by 51.6% (94/182), without reducing TP, lowering the FP/case rate from 0.88 to 0.26 for CPM-NET, and from 1.27 to 0.62 for the 3D-CNN-TR. Conclusion: Anatomy-based, interpretable post-processing can improve DL-based aneurysm detection model performance. More broadly, automated, domain-informed, hybrid heuristic-learning processing holds promise for improving the performance and clinical acceptance of aneurysm detection models.

Updated: 2025-07-01 15:03:43

标题: 自动解剖学后处理减少假阳性并提高深度学习颅内动脉瘤检测的可解释性

摘要: 简介：深度学习（DL）模型可以帮助在CTA上检测颅内动脉瘤，但高假阳性（FP）率仍然是临床转化的障碍，尽管模型架构和策略的改进，如检测阈值调整。我们采用了一种自动化的、基于解剖学的、启发式学习混合动脉静脉分割后处理方法，进一步降低FP。方法：使用1,186个开源CTA（1,373个标记动脉瘤）训练了两个DL模型，CPM-Net和可变形3D卷积神经网络-变压器混合（3D-CNN-TR），并用143个私有CTA（218个标记动脉瘤）进行评估。应用了脑、动脉、静脉和海绵窦（CVS）分割掩模，以消除DL输出中可能与之重叠的FP：（1）脑掩码；（2）静脉掩码；（3）静脉多于动脉掩码；（4）脑加静脉掩码；（5）脑加静脉多于动脉掩码。结果：CPM-Net产生了139个真阳性（TP）；79个假阴性（FN）；126个FP。3D-CNN-TR产生了179个TP；39个FN；182个FP。FP通常是颅外的（CPM-Net 27.3%；3D-CNN-TR 42.3%），静脉的（CPM-Net 56.3%；3D-CNN-TR 29.1%），动脉的（CPM-Net 11.9%；3D-CNN-TR 53.3%），以及非血管的（CPM-Net 25.4%；3D-CNN-TR 9.3%）结构。方法5表现最好，将CPM-Net的FP降低了70.6%（89/126），3D-CNN-TR的FP降低了51.6%（94/182），而不降低TP，将CPM-Net的FP/病例率从0.88降低到0.26，将3D-CNN-TR的FP/病例率从1.27降低到0.62。结论：基于解剖学的、可解释的后处理可以提高基于DL的动脉瘤检测模型的性能。更广泛地说，自动化、领域知识驱动的、混合启发式学习处理有望提高动脉瘤检测模型的性能和临床接受度。

更新时间: 2025-07-01 15:03:43

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.00832v1

Studying and Improving Graph Neural Network-based Motif Estimation

Graph Neural Networks (GNNs) are a predominant method for graph representation learning. However, beyond subgraph frequency estimation, their application to network motif significance-profile (SP) prediction remains under-explored, with no established benchmarks in the literature. We propose to address this problem, framing SP estimation as a task independent of subgraph frequency estimation. Our approach shifts from frequency counting to direct SP estimation and modulates the problem as multitarget regression. The reformulation is optimised for interpretability, stability and scalability on large graphs. We validate our method using a large synthetic dataset and further test it on real-world graphs. Our experiments reveal that 1-WL limited models struggle to make precise estimations of SPs. However, they can generalise to approximate the graph generation processes of networks by comparing their predicted SP with the ones originating from synthetic generators. This first study on GNN-based motif estimation also hints at how using direct SP estimation can help go past the theoretical limitations that motif estimation faces when performed through subgraph counting.

Updated: 2025-07-01 15:02:17

标题: 研究和改进基于图神经网络的模式估计

摘要: 图神经网络（GNNs）是图表示学习的主要方法。然而，除了子图频率估计外，它们在网络模式显著性-概要（SP）预测方面的应用仍未得到充分探索，在文献中没有建立的基准。我们提出解决这个问题，将SP估计框架化为与子图频率估计无关的任务。我们的方法从频率计数转变为直接SP估计，并将问题调节为多目标回归。重新制定的方法针对大型图的可解释性、稳定性和可扩展性进行了优化。我们使用大型合成数据集验证了我们的方法，并进一步在真实世界的图上进行了测试。我们的实验表明，1-WL受限模型难以对SP进行精确估计。然而，它们可以通过将其预测的SP与来自合成生成器的SP进行比较，从而推广到近似网络的图生成过程。这个基于GNN的模式估计的第一项研究也暗示了直接SP估计如何帮助克服通过子图计数进行模式估计时所面临的理论限制。

更新时间: 2025-07-01 15:02:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.15709v2

On the Surprising Efficacy of LLMs for Penetration-Testing

This paper presents a critical examination of the surprising efficacy of Large Language Models (LLMs) in penetration testing. The paper thoroughly reviews the evolution of LLMs and their rapidly expanding capabilities which render them increasingly suitable for complex penetration testing operations. It systematically details the historical adoption of LLMs in both academic research and industry, showcasing their application across various offensive security tasks and covering broader phases of the cyber kill chain. Crucially, the analysis also extends to the observed adoption of LLMs by malicious actors, underscoring the inherent dual-use challenge of this technology within the security landscape. The unexpected effectiveness of LLMs in this context is elucidated by several key factors: the strong alignment between penetration testing's reliance on pattern-matching and LLMs' core strengths, their inherent capacity to manage uncertainty in dynamic environments, and cost-effective access to competent pre-trained models through LLM providers. The current landscape of LLM-aided penetration testing is categorized into interactive 'vibe-hacking' and the emergence of fully autonomous systems. The paper identifies and discusses significant obstacles impeding wider adoption and safe deployment. These include critical issues concerning model reliability and stability, paramount safety and security concerns, substantial monetary and ecological costs, implications for privacy and digital sovereignty, complex questions of accountability, and profound ethical dilemmas. This comprehensive review and analysis provides a foundation for discussion on future research directions and the development of robust safeguards at the intersection of AI and security.

Updated: 2025-07-01 15:01:18

标题: 关于LLMs在渗透测试中出人意料的有效性

摘要: 这篇论文对大型语言模型（LLMs）在渗透测试中出人意料的有效性进行了批判性审视。论文全面审查了LLMs的演变以及它们迅速扩展的能力，使其越来越适合复杂的渗透测试操作。它系统地详细描述了LLMs在学术研究和工业中的历史采用，展示了它们在各种攻击性安全任务中的应用，并涵盖了网络攻击链的更广泛阶段。至关重要的是，分析还延伸到恶意行为者对LLMs的采用，强调了这项技术在安全领域内固有的双重用途挑战。 LLMs在这一背景下的意外有效性通过几个关键因素得以阐明：渗透测试对模式匹配的依赖与LLMs的核心优势之间的强烈一致性，它们在动态环境中管理不确定性的固有能力，以及通过LLM提供商获得经过训练的模型的成本效益。 LLM辅助渗透测试的当前格局被分类为交互式的“氛围黑客”和完全自主系统的出现。该论文确定并讨论了阻碍更广泛采用和安全部署的重要障碍。这些包括与模型可靠性和稳定性相关的关键问题，至关重要的安全和安全性问题，重大的经济和生态成本，对隐私和数字主权的影响，复杂的问责问题以及深刻的伦理困境。这一全面回顾和分析为未来研究方向的讨论和在人工智能和安全交汇处发展健全保障奠定了基础。

更新时间: 2025-07-01 15:01:18

领域: cs.CR

下载: http://arxiv.org/abs/2507.00829v1

A Technique for the Detection of PDF Tampering or Forgery

Tampering or forgery of digital documents has become widespread, most commonly through altering images without any malicious intent such as enhancing the overall appearance of the image. However, there are occasions when tampering of digital documents can have negative consequences, such as financial fraud and reputational damage. Tampering can occur through altering a digital document's text or editing an image's pixels. Many techniques have been developed to detect whether changes have been made to a document. Most of these techniques rely on generating hashes or watermarking the document. These techniques, however, have limitations in that they cannot detect alterations to portable document format (PDF) signatures or other non-visual aspects, such as metadata. This paper presents a new technique that can be used to detect tampering within a PDF document by utilizing the PDF document's file page objects. The technique employs a prototype that can detect changes to a PDF document, such as changes made to the text, images, or metadata of the said file.

Updated: 2025-07-01 14:59:05

标题: 一种检测PDF篡改或伪造的技术

摘要: 数字文档的篡改或伪造已经变得普遍，最常见的是通过修改图像而没有任何恶意意图，比如增强图像的整体外观。然而，有时数字文档的篡改可能会带来负面后果，比如金融欺诈和声誉损害。篡改可以通过修改数字文档的文本或编辑图像的像素来发生。已经开发了许多技术来检测文档是否已经被篡改。这些技术大多依赖于生成哈希或给文档加水印。然而，这些技术有局限性，因为它们无法检测便携文档格式（PDF）签名或其他非视觉方面的改动，比如元数据。本文介绍了一种可以通过利用PDF文档的文件页对象来检测PDF文档内部篡改的新技术。该技术使用一个原型，可以检测PDF文档的变化，比如对文本、图像或元数据的修改。

更新时间: 2025-07-01 14:59:05

领域: cs.CR

下载: http://arxiv.org/abs/2507.00827v1

A Study of In-Context-Learning-Based Text-to-SQL Errors

Large language models (LLMs) have been adopted to perform text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into structured query language (SQL). However, such a technique faces correctness problems and requires efficient repairing solutions. In this paper, we conduct the first comprehensive study of text-to-SQL errors. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 29 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement at the cost of high computational overhead with many mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleRepair outperforms existing solutions by repairing 13.8% more queries with neglectable mis-repairs and 67.4% less overhead.

Updated: 2025-07-01 14:55:05

标题: 一项基于上下文学习的文本到SQL错误研究

摘要: 大型语言模型(LLMs)已被采用来执行文本到SQL任务，利用它们的上下文学习(ICL)能力将自然语言问题翻译成结构化查询语言(SQL)。然而，这种技术面临着正确性问题，并需要高效的修复解决方案。在本文中，我们进行了对文本到SQL错误的首次全面研究。我们的研究涵盖了四种代表性的ICL-based技术、五种基本修复方法、两个基准测试和两种LLM设置。我们发现文本到SQL错误普遍存在，并总结了7个类别的29种错误类型。我们还发现现有的修复尝试在计算开销高且存在许多误修的情况下，对正确性改进有限。根据这些发现，我们提出了MapleRepair，一个新颖的文本到SQL错误检测和修复框架。评估结果表明，MapleRepair通过修复13.8%更多的查询而减少了可忽略的误修和67.4%的计算开销。

更新时间: 2025-07-01 14:55:05

领域: cs.CL,cs.AI,cs.SE

下载: http://arxiv.org/abs/2501.09310v2

MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.

Updated: 2025-07-01 14:54:17

标题: MMMR：大规模多模态推理任务基准测试

摘要: 最近关于多模态大型语言模型（MLLMs）的进展使得语言、视觉和结构化输入的统一处理成为可能，打开了进行逻辑推理、空间推理和科学分析等复杂任务的大门。尽管它们具有潜力，但MLLMs的推理能力，特别是那些增强了中间思维痕迹的模型（MLLMs-T），仍然鲜为人知，并缺乏标准化的评估基准。现有工作主要关注感知或最终答案的正确性，提供了有限的洞察力，无法揭示模型在不同模态下如何推理或失败。为了填补这一空白，我们引入了MMMR，一个旨在严格评估具有明确思维的多模态推理的新基准。MMMR包括1）一个高难度数据集，涵盖六种不同推理类型，具有符号深度和多跳需求，共1,083个问题；2）一个模块化的推理追踪评估管道（RTEP），通过相关性、一致性和结构化错误注释等指标评估推理质量，超越准确性。实证结果显示，MLLMs-T整体表现优于非思维对照组，但即使像Claude-3.7-Sonnet和Gemini-2.5 Pro这样的顶尖模型也存在推理病理，如不一致和过度思考。这一基准揭示了准确性与推理质量之间持续存在的差距，并提供了一个可行的评估管道，用于未来模型开发。总体而言，MMMR为评估、比较和改进下一代多模态推理系统提供了可扩展的基础。

更新时间: 2025-07-01 14:54:17

领域: cs.AI

下载: http://arxiv.org/abs/2505.16459v3

Program of Equations Thoughts to Solve Algebra Word Problems

Solving algebraic word problems (AWPs) has recently emerged as an important natural language processing task. Recently, large language models (LLMs) have demonstrated powerful mathematical capabilities, and the Chain-of-Thought technique, which guides LLMs through step-by-step reasoning, has yielded impressive results. However, this reasoning ability is limited by the computational weaknesses of LLMs themselves, where calculation errors can accumulate, leading to incorrect final answers. To address this, we propose Program of Equations Thoughts (POET), which transforms the task of generating step-by-step reasoning answers into a two-stage task of predicting equations and generating code, offloading complex computations to a Python interpreter to avoid calculation errors in LLMs. Furthermore, we propose Zero-shot POET, which utilizes a manually designed template to enable LLMs to directly generate Python code for one-step solving. Our method achieves accuracies of 95.3% and 98.0% on the PEN and ALG514 datasets, respectively, setting a new state-of-the-art (SOTA). Zero-shot POET also achieves the SOTA result of 95.5% on the DRAW-1K dataset.

Updated: 2025-07-01 14:53:43

标题: 方程思维解决代数文字问题的程序

摘要: 解决代数文字问题（AWPs）最近已经成为一个重要的自然语言处理任务。最近，大型语言模型（LLMs）展示了强大的数学能力，而“思维链”技术，指导LLMs进行逐步推理，已经取得了令人印象深刻的结果。然而，这种推理能力受限于LLMs自身的计算弱点，计算错误可能会累积，导致最终答案错误。为了解决这个问题，我们提出了“方程思维程序”（POET），将生成逐步推理答案的任务转化为预测方程式和生成代码的两阶段任务，将复杂计算交给Python解释器，避免LLMs中的计算错误。此外，我们提出了“零射击POET”，利用手工设计的模板使LLMs能够直接为一步求解生成Python代码。我们的方法分别在PEN和ALG514数据集上实现了95.3%和98.0%的准确率，创下了新的最先进技术（SOTA）。零射击POET还在DRAW-1K数据集上实现了95.5%的SOTA结果。

更新时间: 2025-07-01 14:53:43

领域: cs.AI

下载: http://arxiv.org/abs/2505.20170v2

CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs

Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V's potential as a foundational approach for adversarial research across multimodal systems.

Updated: 2025-07-01 14:48:27

标题: CAVALRY-V：用于对视频MLLMs进行对抗攻击的大规模生成器框架

摘要: 视频多模态大型语言模型（V-MLLMs）在时间推理和跨模态理解方面表现出令人印象深刻的能力，然而它们对敌对攻击的脆弱性仍未得到充分探讨，这是由于独特挑战：复杂的跨模态推理机制、时间依赖性和计算约束。我们提出了CAVALRY-V（视频的跨模态语言-视觉对抗产出），这是一个新颖的框架，直接针对V-MLLMs中视觉感知和语言生成之间的关键接口。我们的方法引入了两个关键创新：（1）双目标语义-视觉损失函数，同时扰乱模型的文本生成对数和视觉表示，以破坏跨模态整合，（2）一个计算效率高的两阶段生成器框架，将大规模预训练用于跨模型可迁移性，与专门的微调相结合，以实现时空一致性。对全面的视频理解基准的实证评估表明，CAVALRY-V在商业系统（GPT-4.1、Gemini 2.0）和开源模型（QwenVL-2.5、InternVL-2.5、Llava-Video、Aria、MiniCPM-o-2.6）上显著优于现有的攻击方法，平均改进了22.8%。我们的框架通过隐式时间一致性建模而非显式正则化实现了灵活性，使得即使在图像理解方面也能显著提高性能（平均增益34.4%）。这种能力展示了CAVALRY-V在多模态系统中对抗研究中的潜力作为基础性方法。

更新时间: 2025-07-01 14:48:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00817v1

PI-WAN: A Physics-Informed Wind-Adaptive Network for Quadrotor Dynamics Prediction in Unknown Environments

Accurate dynamics modeling is essential for quadrotors to achieve precise trajectory tracking in various applications. Traditional physical knowledge-driven modeling methods face substantial limitations in unknown environments characterized by variable payloads, wind disturbances, and external perturbations. On the other hand, data-driven modeling methods suffer from poor generalization when handling out-of-distribution (OoD) data, restricting their effectiveness in unknown scenarios. To address these challenges, we introduce the Physics-Informed Wind-Adaptive Network (PI-WAN), which combines knowledge-driven and data-driven modeling methods by embedding physical constraints directly into the training process for robust quadrotor dynamics learning. Specifically, PI-WAN employs a Temporal Convolutional Network (TCN) architecture that efficiently captures temporal dependencies from historical flight data, while a physics-informed loss function applies physical principles to improve model generalization and robustness across previously unseen conditions. By incorporating real-time prediction results into a model predictive control (MPC) framework, we achieve improvements in closed-loop tracking performance. Comprehensive simulations and real-world flight experiments demonstrate that our approach outperforms baseline methods in terms of prediction accuracy, tracking precision, and robustness to unknown environments.

Updated: 2025-07-01 14:48:22

标题: PI-WAN：一种用于四旋翼动力学预测的物理信息风适应网络，适用于未知环境

摘要: 准确的动力学建模对于四旋翼飞行器在各种应用中实现精确的轨迹跟踪至关重要。传统的基于物理知识驱动的建模方法在未知环境中存在重大局限性，这些环境具有可变的有效载荷、风扰动和外部干扰。另一方面，数据驱动的建模方法在处理超出分布范围（OoD）数据时存在泛化能力差的问题，限制了它们在未知情景中的有效性。为了解决这些挑战，我们引入了物理信息风适应网络（PI-WAN），通过将物理约束直接嵌入到训练过程中，结合了基于知识和数据驱动的建模方法，从而实现鲁棒的四旋翼动力学学习。具体地，PI-WAN采用了一个能够高效捕捉历史飞行数据中时间依赖关系的时间卷积网络（TCN）架构，而物理信息损失函数则运用物理原理来提高模型在先前未见条件下的泛化能力和鲁棒性。通过将实时预测结果整合到模型预测控制（MPC）框架中，我们实现了闭环跟踪性能的提升。全面的仿真和实际飞行实验表明，我们的方法在预测准确性、跟踪精度和对未知环境的鲁棒性方面优于基线方法。

更新时间: 2025-07-01 14:48:22

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.00816v1

Many LLMs Are More Utilitarian Than One

Moral judgment is integral to large language model (LLM) alignment and social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function collectively during collaboration, compared to individual agents. In human moral judgment, group deliberation leads to a utilitarian boost: a tendency to endorse norm violations that maximize benefits for the greatest number of people despite harms. We study whether a similar dynamic emerges in multi-agent LLM systems. We tested six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reasoned independently, and (2) Group, where they engaged in multi-turn discussions in pairs or triads. In personal moral dilemmas, where agents must decide to directly harm one individual to maximize the utility for others, all models found moral violations to be more acceptable when part of a group than individually, similar to human experiments. Some models endorsed actions that maximized overall well-being, even if they benefited strangers over familiar individuals. Others became more willing to violate moral norms in groups. However, while human groups show a similar action bias, the mechanism for their utilitarian boost differs from LLMs. Whereas the human shift comes from heightened sensitivity to decision outcomes, LLM groups show either reduced norm sensitivity or enhanced impartiality. This suggests that while the surface behavior of LLM collectives mimics human group reasoning, the underlying drivers differ. We discuss the implications for AI alignment, multi-agent design, and artificial moral reasoning.

Updated: 2025-07-01 14:46:16

标题: 许多低密度脂蛋白更多地起到功利主义作用

摘要: 道德判断对于大型语言模型（LLM）的对齐和社会推理至关重要。随着多代理系统的日益突出，了解LLM在协作过程中的集体功能与个体代理相比是至关重要的。在人类道德判断中，群体讨论导致了一种功利主义提升：即倾向于支持最大化利益的规范违反，尽管会带来伤害。我们研究了多代理LLM系统中是否会出现类似的动态。我们在两种条件下测试了六种模型在公认的道德困境集中：（1）独自一人，模型独立推理，以及（2）团队，模型成对或三人组参与多轮讨论。在个人道德困境中，代理必须决定直接伤害一个人以最大化其他人的效用，所有模型发现当作为一组时，道德违规更容易被接受，这与人类实验类似。一些模型支持最大化整体幸福感的行动，即使这会使陌生人受益而不是熟悉的个人。其他人在团队中更愿意违反道德规范。然而，尽管人类群体显示出类似的行动偏向，但他们的功利主义提升机制与LLM不同。人类的转变来自对决策结果的高度敏感，而LLM群体则表现出减少的规范敏感性或增强的公正性。这表明，虽然LLM集体的表面行为模仿了人类群体推理，但其潜在驱动力却不同。我们讨论了对于AI对齐、多代理设计和人工道德推理的影响。

更新时间: 2025-07-01 14:46:16

领域: cs.CL,cs.AI,cs.CY,I.2.7; I.2.11

下载: http://arxiv.org/abs/2507.00814v1

A Robust Algorithm for Non-IID Machine Learning Problems with Convergence Analysis

In this paper, we propose an improved numerical algorithm for solving minimax problems based on nonsmooth optimization, quadratic programming and iterative process. We also provide a rigorous proof of convergence for our algorithm under some mild assumptions, such as gradient continuity and boundedness. Such an algorithm can be widely applied in various fields such as robust optimization, imbalanced learning, etc.

Updated: 2025-07-01 14:41:59

标题: 一个稳健的算法用于具有收敛分析的非独立同分布机器学习问题

摘要: 在本文中，我们提出了一种改进的数值算法，用于解决基于非光滑优化、二次规划和迭代过程的极小极大问题。我们还对我们的算法在一些温和的假设下（如梯度连续性和有界性）的收敛性进行了严格证明。这样的算法可以广泛应用于各个领域，如鲁棒优化、不平衡学习等。

更新时间: 2025-07-01 14:41:59

领域: cs.AI,math.OC

下载: http://arxiv.org/abs/2507.00810v1

Careless Whisper: Exploiting Silent Delivery Receipts to Monitor Users on Mobile Instant Messengers

With over 3 billion users globally, mobile instant messaging apps have become indispensable for both personal and professional communication. Besides plain messaging, many services implement additional features such as delivery and read receipts informing a user when a message has successfully reached its target. This paper highlights that delivery receipts can pose significant privacy risks to users. We use specifically crafted messages that trigger delivery receipts allowing any user to be pinged without their knowledge or consent. By using this technique at high frequency, we demonstrate how an attacker could extract private information such as the online and activity status of a victim, e.g., screen on/off. Moreover, we can infer the number of currently active user devices and their operating system, as well as launch resource exhaustion attacks, such as draining a user's battery or data allowance, all without generating any notification on the target side. Due to the widespread adoption of vulnerable messengers (WhatsApp and Signal) and the fact that any user can be targeted simply by knowing their phone number, we argue for a design change to address this issue.

Updated: 2025-07-01 14:41:35

标题: 《不经意的耳语：利用静默传递收据监视移动即时通讯软件用户》

摘要: 全球有超过30亿用户使用移动即时通讯应用，这已经成为个人和专业通讯中不可或缺的工具。除了普通的消息传递，许多服务还实现了额外的功能，比如发送和已读回执，通知用户消息已成功送达目标。本文强调了发送回执可能给用户带来重大的隐私风险。我们使用特制的消息触发发送回执，允许任何用户在不知情或未经同意的情况下被ping。通过高频率使用这种技术，我们展示了攻击者如何提取受害者的私人信息，比如在线和活动状态，例如屏幕开关。此外，我们可以推断当前活跃用户设备的数量及其操作系统，以及发动资源耗尽攻击，比如耗尽用户的电池或数据流量，而不会在目标端生成任何通知。由于易受攻击的通讯应用（WhatsApp和Signal）的广泛采用，以及任何用户只要知道其电话号码就可能成为目标，我们认为需要设计变更来解决这一问题。

更新时间: 2025-07-01 14:41:35

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2411.11194v3

Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem

The NP-complete mutual-visibility (MV) problem currently lacks empirical analysis on its practical behaviour despite theoretical studies. This paper addresses this gap by implementing and evaluating three distinct algorithms - a direct greedy heuristic, a hypergraph-based approximation, and a genetic algorithm - on diverse synthetic graph datasets, including those with analytically known $\mu(G)$ values and general graph models. Our results demonstrate that for smaller graphs, the algorithms consistently achieve MV set sizes aligning with theoretical bounds. However, for larger instances, achieved solution sizes notably diverge from theoretical limits; this, combined with the absence of tight bounds, complicates absolute quality assessment. Nevertheless, validation on known optimal graphs showed the Genetic Algorithm and other heuristics empirically performing best among tested methods.

Updated: 2025-07-01 14:35:44

标题: 基于经验的启发式和近似算法对相互可见性问题的分析

摘要: 目前，NP完全的相互可见性（MV）问题在实际行为上缺乏实证分析，尽管有理论研究。本文通过在不同的合成图数据集上实施和评估三种不同的算法 - 直接贪婪启发式算法，基于超图的近似算法和遗传算法 - 来填补这一空白，包括那些具有已知$\mu(G)$ 值和一般图模型的分析数据集。我们的结果表明，对于较小的图，这些算法始终实现与理论界限一致的MV集大小。然而，对于较大的实例，获得的解决方案大小明显偏离理论极限；这与缺乏紧密界限相结合，使得绝对质量评估变得复杂。然而，对已知最优图的验证显示，遗传算法和其他启发式方法在经验上表现最佳。

更新时间: 2025-07-01 14:35:44

领域: cs.CG,cs.AI,cs.PF,math.CO

下载: http://arxiv.org/abs/2507.01076v1

Position: Emergent Machina Sapiens Urge Rethinking Multi-Agent Paradigms

Artificial Intelligence (AI) agents capable of autonomous learning and independent decision-making hold great promise for addressing complex challenges across various critical infrastructure domains, including transportation, energy systems, and manufacturing. However, the surge in the design and deployment of AI systems, driven by various stakeholders with distinct and unaligned objectives, introduces a crucial challenge: How can uncoordinated AI systems coexist and evolve harmoniously in shared environments without creating chaos or compromising safety? To address this, we advocate for a fundamental rethinking of existing multi-agent frameworks, such as multi-agent systems and game theory, which are largely limited to predefined rules and static objective structures. We posit that AI agents should be empowered to adjust their objectives dynamically, make compromises, form coalitions, and safely compete or cooperate through evolving relationships and social feedback. Through two case studies in critical infrastructure applications, we call for a shift toward the emergent, self-organizing, and context-aware nature of these multi-agentic AI systems.

Updated: 2025-07-01 14:33:24

标题: 位置：紧急机器人需要重新思考多智能体范式

摘要: 人工智能（AI）代理能够进行自主学习和独立决策，有望解决跨越交通、能源系统和制造等各种关键基础设施领域的复杂挑战。然而，由各方利益相关者推动的人工智能系统设计和部署激增，引入了一个重要挑战：未协调的人工智能系统如何在共享环境中和谐共存并发展，而不会造成混乱或危害安全？为了解决这一问题，我们主张对现有的多代理框架进行根本性的反思，例如多代理系统和博弈论，这些框架往往受限于预定义规则和静态目标结构。我们认为，AI代理应该具备动态调整目标、做出妥协、形成联盟，通过不断演变的关系和社会反馈来安全地竞争或合作。通过两个关键基础设施应用的案例研究，我们呼吁转向这些多代理AI系统的新兴、自组织和上下文感知性质。

更新时间: 2025-07-01 14:33:24

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2502.04388v3

OM4OV: Leveraging Ontology Matching for Ontology Versioning

Due to the dynamic nature of the Semantic Web, version control is necessary to capture time-varying information, particularly for widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component for efficient ontology management, the growing size of ontologies and accumulating errors caused by manual labour overwhelm current OV approaches. In this paper, we propose a fresh approach to performing OV using existing ontology matching (OM) techniques and systems. We introduce a unified OM4OV pipeline. From an OM perspective, we reconstruct a new task formulation and measurements for OV tasks. Building upon the prior alignment(s) from OM, we propose a pipeline optimisation method called the cross-reference (CR) mechanism to enhance overall OV performance. We experimentally validate the OM4OV pipeline and the cross-reference mechanism in an OV testbed originating from the Ontology Alignment Evaluation Initiative (OAEI) datasets. We also discuss insights into OM used for OV tasks, where some apparent false mappings detected by OV systems are not actually untrue.

Updated: 2025-07-01 14:31:29

标题: OM4OV：利用本体匹配实现本体版本控制

摘要: 由于语义网络的动态性，版本控制对于捕捉时间变化的信息是必要的，尤其是对于广泛使用的本体论。尽管长期以来已经认识到本体版本控制（OV）作为高效本体管理的关键组成部分，但由于本体的规模不断增长和人工劳动导致的错误不断累积，目前的OV方法已经难以应对。本文提出了一种使用现有本体匹配（OM）技术和系统进行OV的新方法。我们引入了统一的OM4OV流水线。从OM的角度出发，我们重新构建了OV任务的任务表述和测量方法。在利用OM之前的对齐结果的基础上，我们提出了一种名为交叉引用（CR）机制的流水线优化方法，以提高整体OV性能。我们在源自本体对齐评估倡议（OAEI）数据集的OV测试平台中实验证实了OM4OV流水线和交叉引用机制。我们还讨论了OM用于OV任务的见解，其中一些显而易见的OV系统检测到的错误映射实际上并不是不正确的。

更新时间: 2025-07-01 14:31:29

领域: cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2409.20302v4

Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability

[Context] AI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation. [Objective] This study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code. [Method] We conducted a two-phase controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance. [Results] AI-assisted development in Phase 1 led to a modest speedup in subsequent evolution and slightly higher average CodeHealth. Although neither difference was significant overall, the increase in CodeHealth was statistically significant when habitual AI users completed Phase 1. For Phase 1, we also observed a significant effect that corroborates previous productivity findings: using an AI assistant yielded a 30.7% median decrease in task completion time. Moreover, for habitual AI users, the mean speedup was 55.9%. [Conclusions] Our study adds to the growing evidence that AI assistants can effectively accelerate development. Moreover, we did not observe warning signs of degraded code-level maintainability. We recommend that future research focus on risks such as code bloat from excessive code generation and the build-up of cognitive debt as developers invest less mental effort during implementation.

Updated: 2025-07-01 14:24:37

标题: AI的回声：研究AI助手对软件可维护性的下游影响

摘要: [背景] AI助手，如GitHub Copilot和Cursor，正在改变软件工程。虽然有几项研究突出了生产力的提高，但它们对可维护性的影响需要进一步调查。 [目的] 本研究调查了与AI助手共同开发是否会影响软件的可维护性，特别是其他开发人员能否轻松地演进生成的源代码。 [方法] 我们进行了一个涉及151名参与者的两阶段对照实验，其中95%是专业开发人员。在第一阶段，参与者使用或不使用AI助手向Java Web应用程序添加新功能。在第二阶段，进行了一项随机对照试验，新的参与者在没有AI助手的情况下演进这些解决方案。 [结果] 第一阶段中的AI辅助开发导致后续演进速度略有加快，平均CodeHealth略高。虽然总体上两者的差异并不显著，但当习惯使用AI的用户完成第一阶段时，CodeHealth的增加在统计学上是显著的。对于第一阶段，我们还观察到了与之前的生产力发现相一致的显著效应：使用AI助手使任务完成时间中位数减少了30.7%。此外，对于习惯使用AI的用户，平均加速度为55.9%。 [结论] 我们的研究进一步证明了AI助手可以有效地加速开发。此外，我们没有观察到代码级可维护性下降的预警信号。我们建议未来的研究重点关注风险，例如过度代码生成导致的代码膨胀和开发人员在实施过程中投入更少心力而造成的认知债务的积累。

更新时间: 2025-07-01 14:24:37

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.00788v1

StreakNet-Arch: An Anti-scattering Network-based Architecture for Underwater Carrier LiDAR-Radar Imaging

In this paper, we introduce StreakNet-Arch, a real-time, end-to-end binary-classification framework based on our self-developed Underwater Carrier LiDAR-Radar (UCLR) that embeds Self-Attention and our novel Double Branch Cross Attention (DBC-Attention) to enhance scatter suppression. Under controlled water tank validation conditions, StreakNet-Arch with Self-Attention or DBC-Attention outperforms traditional bandpass filtering and achieves higher $F_1$ scores than learning-based MP networks and CNNs at comparable model size and complexity. Real-time benchmarks on an NVIDIA RTX 3060 show a constant Average Imaging Time (54 to 84 ms) regardless of frame count, versus a linear increase (58 to 1,257 ms) for conventional methods. To facilitate further research, we contribute a publicly available streak-tube camera image dataset contains 2,695,168 real-world underwater 3D point cloud data. More importantly, we validate our UCLR system in a South China Sea trial, reaching an error of 46mm for 3D target at 1,000 m depth and 20 m range. Source code and data are available at https://github.com/BestAnHongjun/StreakNet .

Updated: 2025-07-01 14:19:46

标题: StreakNet-Arch：一种基于抗散射网络的水下载波激光雷达成像架构

摘要: 在本文中，我们介绍了StreakNet-Arch，这是一个基于我们自行开发的水下载波激光雷达（UCLR）的实时、端到端的二元分类框架，其中包含了自注意力和我们的新颖的双分支交叉注意力（DBC-Attention）以增强散射抑制。在受控水箱验证条件下，具有自注意力或DBC-Attention的StreakNet-Arch优于传统的带通滤波，并在相同的模型大小和复杂度下实现了比基于学习的MP网络和CNN更高的$F_1$分数。在NVIDIA RTX 3060上的实时基准测试显示，无论帧数如何，平均成像时间保持不变（54至84毫秒），而传统方法则呈线性增长（58至1,257毫秒）。为了促进进一步研究，我们贡献了一个公开可用的条纹管相机图像数据集，其中包含2,695,168个真实世界的水下三维点云数据。更重要的是，我们在南中国海试验中验证了我们的UCLR系统，在1,000米深和20米范围内的3D目标上达到了46毫米的误差。源代码和数据可在https://github.com/BestAnHongjun/StreakNet 上找到。

更新时间: 2025-07-01 14:19:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2404.09158v3

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Large language models (LLMs) are now widely accessible, reaching learners at all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes characteristic strengths and limitations of LLM-generated solutions. The findings of this study indicate that both tested LLMs (GPT-4o and o1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on GPT-4o's performance, while o1-preview almost consistently outperformed both GPT-4o and the human benchmark. Based on these findings, the study discusses implications for the design of summative and formative assessment in physics education, including how to uphold assessment integrity and support students in critically engaging with LLMs.

Updated: 2025-07-01 14:16:43

标题: 评估基于GPT和推理的大型语言模型在物理奥林匹克问题上的表现：超越人类表现和教育评估的影响

摘要: 大型语言模型（LLMs）现在广泛可及，可以接触到各个教育水平的学习者。这一发展引起了人们的担忧，他们担心这些模型的使用可能绕过必要的学习过程，损害已建立的评估格式的完整性。在物理教育中，问题解决在教学和评估中起着核心作用，因此了解LLMs的物理特定问题解决能力至关重要。这种理解对于制定负责任和教学合理的LLMs整合方法至关重要。因此，本研究比较了一种通用型LLM（GPT-4o，使用不同提示技术）和一种经过优化的推理模型（o1-preview）与德国物理奥林匹克参赛者在一组明确定义的奥林匹克问题上的问题解决表现。除了评估生成解决方案的正确性，研究还分析了LLM生成解决方案的特点优势和局限性。本研究的发现表明，经过测试的两种LLM（GPT-4o和o1-preview）在奥林匹克类型的物理问题上表现出先进的问题解决能力，平均超过了人类参与者。提示技术对GPT-4o的表现影响不大，而o1-preview几乎始终优于GPT-4o和人类基准。基于这些发现，本研究讨论了物理教育中总结性和形成性评估设计的影响，包括如何维护评估的完整性，并支持学生与LLMs进行批判性互动。

更新时间: 2025-07-01 14:16:43

领域: physics.ed-ph,cs.AI

下载: http://arxiv.org/abs/2505.09438v2

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

Updated: 2025-07-01 14:10:36

标题: LitBench：一个用于可靠评估创意写作的基准和数据集

摘要: 评估由大型语言模型（LLMs）生成的创意写作仍然具有挑战性，因为开放式叙述缺乏基准事实。缺乏表现良好的自动评估方法，使用现成的语言模型作为零射击评判者，但它们在这种情况下的可靠性尚不清楚。为了对创意写作进行强大的评估，我们引入了LitBench，这是第一个用于创意写作验证的标准化基准和成对数据集，包括一个由Reddit抽取的包含2,480个去偏差、人工标记的故事比较的预留测试集，以及一个包含43,827对人类偏好标签的训练语料库。使用LitBench，我们（i）对零射击LLM评判者进行基准测试，（ii）训练Bradley Terry和生成性奖励模型，以及（iii）进行在线人类研究，验证奖励模型在新的LLM生成的故事上的排名。我们的基准测试确定了Claude-3.7-Sonnet作为最强大的现成评判者，与人类偏好达成73%的一致性；在训练的奖励模型中，Bradley-Terry和生成性奖励模型都达到了78%的准确度，优于所有现成评判者。在线人类研究进一步证实，我们训练的奖励模型在新颖的LLM生成的故事中始终与人类偏好一致。我们在https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461发布了LitBench和奖励模型，为可靠的自动评估和优化创意写作系统提供了经过审查的资源。

更新时间: 2025-07-01 14:10:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.00769v1

Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention

Semi-supervised learning offers an appealing solution for remote sensing (RS) image segmentation to relieve the burden of labor-intensive pixel-level labeling. However, RS images pose unique challenges, including rich multi-scale features and high inter-class similarity. To address these problems, this paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks. Specifically, MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization. It improves the multi-scale learning capability of semi-supervised algorithms on unlabeled data. Additionally, MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations through complementary features from the teacher network. This design effectively integrates weak and strong augmentations (WA and SA) to further boost segmentation performance. To verify the effectiveness of our model, we conduct extensive experiments on ISPRS-Potsdam and LoveDA datasets. The experimental results show the superiority of our method over state-of-the-art semi-supervised methods. Notably, our model excels in distinguishing highly similar objects, showcasing its potential for advancing semi-supervised RS image segmentation tasks.

Updated: 2025-07-01 14:03:17

标题: 遥感图像的半监督语义分割通过多尺度不确定性一致性和交叉教师-学生注意力

摘要: 半监督学习为遥感图像分割提供了一种吸引人的解决方案，以减轻繁重的基于像素级别标注的劳动负担。然而，遥感图像存在独特的挑战，包括丰富的多尺度特征和高类间相似性。为了解决这些问题，本文提出了一种新颖的半监督多尺度不确定性和跨教师-学生注意力（MUCA）模型，用于遥感图像语义分割任务。具体而言，MUCA通过引入多尺度不确定性一致性正则化，限制网络不同层次特征图之间的一致性，改善了半监督算法在未标记数据上的多尺度学习能力。此外，MUCA利用跨教师-学生注意力机制引导学生网络，通过从教师网络获取互补特征，指导学生网络构建更具区分性的特征表示。这种设计有效地将弱增强和强增强（WA和SA）集成在一起，进一步提升分割性能。为验证我们模型的有效性，我们在ISPRS-Potsdam和LoveDA数据集上进行了大量实验证明。实验结果显示我们方法优于最先进的半监督方法。值得注意的是，我们的模型在区分高度相似对象方面表现出色，展示了其在推进半监督遥感图像分割任务方面的潜力。

更新时间: 2025-07-01 14:03:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.10736v3

LearnAFE: Circuit-Algorithm Co-design Framework for Learnable Audio Analog Front-End

This paper presents a circuit-algorithm co-design framework for learnable analog front-end (AFE) in audio signal classification. Designing AFE and backend classifiers separately is a common practice but non-ideal, as shown in this paper. Instead, this paper proposes a joint optimization of the backend classifier with the AFE's transfer function to achieve system-level optimum. More specifically, the transfer function parameters of an analog bandpass filter (BPF) bank are tuned in a signal-to-noise ratio (SNR)-aware training loop for the classifier. Using a co-design loss function LBPF, this work shows superior optimization of both the filter bank and the classifier. Implemented in open-source SKY130 130nm CMOS process, the optimized design achieved 90.5%-94.2% accuracy for 10-keyword classification task across a wide range of input signal SNR from 5 dB to 20 dB, with only 22k classifier parameters. Compared to conventional approach, the proposed audio AFE achieves 8.7% and 12.9% reduction in power and capacitor area respectively.

Updated: 2025-07-01 13:59:24

标题: LearnAFE：用于可学习音频模拟前端的电路-算法协同设计框架

摘要: 这篇论文提出了一个电路算法协同设计框架，用于可学习的模拟前端（AFE）在音频信号分类中的应用。设计AFE和后端分类器分开进行是一种常见的做法，但本文表明这种方法并非理想。相反，本文提出了后端分类器与AFE的传递函数的联合优化，以实现系统级最优。更具体地说，模拟带通滤波器（BPF）组的传递函数参数在一个考虑信噪比（SNR）的训练循环中进行调整以为分类器优化。利用协同设计损失函数LBPF，本文展示了滤波器组和分类器的优化。在开源的SKY130 130nm CMOS工艺中实现，优化设计在输入信号SNR从5dB到20dB范围内实现了90.5%-94.2%的准确率，仅有22k个分类器参数。与传统方法相比，所提出的音频AFE分别实现了功耗和电容面积的8.7%和12.9%的降低。

更新时间: 2025-07-01 13:59:24

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2507.00755v1

Listener-Rewarded Thinking in VLMs for Image Preferences

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

Updated: 2025-07-01 13:53:50

标题: VLMs中基于听众奖励的图像偏好思维

摘要: 训练健壮且具有泛化能力的奖励模型对于将文本到图像和文本到视频生成模型与人类意图对齐至关重要。然而，当前的奖励模型经常无法泛化，监督微调导致记忆化，需要复杂的注释流程。虽然强化学习（RL），特别是群体相对策略优化（GRPO），改善了泛化能力，但我们发现一个关键的失败模式：当一个模型的推理轨迹与评估相同输出的独立、冻结的视觉-语言模型（“听者”）相矛盾时，推理准确性显著下降。为了解决这个问题，我们引入了一个听者增强的GRPO框架。在这里，听者重新评估推理者的思维链，提供密集、校准的置信度分数，塑造RL奖励信号。这鼓励推理者不仅回答正确，而且产生对独立模型具有说服力的解释。我们的听者塑造的奖励方案在ImageReward基准测试中取得了最佳准确性（67.4%），在大规模人类偏好数据集上显著提高了超出分布（OOD）性能（120万票，比朴素推理者高出最多6%），并与强大的GRPO和SFT基线相比减少了推理矛盾。这些结果表明，基于听者的奖励为将视觉-语言模型与微妙的人类偏好对齐提供了一条可扩展且数据高效的途径。我们将在这里发布我们的推理模型：https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

更新时间: 2025-07-01 13:53:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.22832v2

Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds

This paper presents a complete formal specification, protocol description, and mathematical proof structure for Simplified Payment Verification (SPV) as originally defined in the Bitcoin whitepaper \cite{nakamoto2008}. In stark contrast to the misrepresentations proliferated by popular implementations, we show that SPV is not only secure under bounded adversarial assumptions but strictly optimal for digital cash systems requiring scalable and verifiable transaction inclusion. We reconstruct the SPV protocol from first principles, grounding its verification model in symbolic automata, Merkle membership relations, and chain-of-proof dominance predicates. Through rigorous probabilistic and game-theoretic analysis, we derive the economic bounds within which the protocol operates securely and verify its liveness and safety properties under partial connectivity, hostile relay networks, and adversarial propagation delay. Our specification further introduces low-bandwidth optimisations such as adaptive polling and compressed header synchronisation while preserving correctness. This document serves both as a blueprint for secure SPV implementation and a rebuttal of common misconceptions surrounding non-validating clients.

Updated: 2025-07-01 13:44:48

标题: 安全低带宽SPV：简化支付验证协议和安全边界的形式化处理

摘要: 本文提出了简化付款验证（SPV）的完整形式规范、协议描述和数学证明结构，这是最初在比特币白皮书中定义的。与流行实现中的误解相比，我们展示了SPV不仅在有界对抗假设下安全，而且对于需要可扩展和可验证的交易包含的数字现金系统来说是严格最佳的。我们从第一原则重建了SPV协议，将其验证模型基于符号自动机、Merkle成员关系和证明链支配谓词。通过严格的概率和博弈论分析，我们推导了协议在经济边界内安全运行的范围，并验证了在部分连通性、敌对中继网络和对抗性传播延迟下的活力和安全性属性。我们的规范进一步引入了低带宽优化，如自适应轮询和压缩头同步，同时保持正确性。这份文件既是安全SPV实现的蓝图，也是针对非验证客户端常见误解的反驳。

更新时间: 2025-07-01 13:44:48

领域: cs.CR,cs.CL,cs.DC,68Q85, 68M10, 94A60, 91A80, 68Q17, 68W10, 68R10,C.2.2; F.2.2; D.4.6; K.6.5

下载: http://arxiv.org/abs/2507.00740v1

HyperCLOVA X THINK Technical Report

We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes. It delivers competitive performance against similarly sized models on Korea-focused benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while preserving robust bilingual consistency and translation quality. In addition, a vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM benchmark, all of which are achieved with substantially lower training compute than existing models of similar sizes. We also present a pruning and distillation technique that will soon be applied to HyperCLOVA X THINK for an open-source and business-friendly foundation model. Altogether, these capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI innovation and a valuable resource for the global research community.

Updated: 2025-07-01 13:39:25

标题: HyperCLOVA X THINK技术报告

摘要: 我们介绍了HyperCLOVA X THINK，这是HyperCLOVA X系列中第一个以推理为重点的大型语言模型，预先训练了大约6万亿个高质量的韩语和英语标记，并增加了有针对性的合成韩语数据。它被实现为一个计算-内存平衡的Peri-LN Transformer，通过$\mu$P进行缩放，并通过一个扩展上下文窗口至128K标记的三阶段课程预训练，并通过受验证奖励的强化学习进行监督微调，支持详细的推理和简洁的答案模式。它在韩国重点基准测试中表现出竞争性能，如KMMLU、CSAT、KoBALT-700、HAERAE-1.0和KoBigBench，同时保持鲁棒的双语一致性和翻译质量。此外，一个视觉增强变体与GPT-4.1在KCSAT STEM基准测试上匹敌或超越，所有这些都是在训练计算远低于现有类似规模模型的情况下实现的。我们还提出了一种修剪和精炼技术，即将应用于HyperCLOVA X THINK，作为一个开源和业务友好的基础模型。总的来说，这些能力使HyperCLOVA X THINK成为韩国人工智能创新的坚实基础，也是全球研究社区的宝贵资源。

更新时间: 2025-07-01 13:39:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.22403v2

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.

Updated: 2025-07-01 13:22:07

标题: AudioTrust: 对音频大型语言模型多方面可信度的基准测试

摘要: 音频大语言模型（ALLMs）的快速发展和广泛应用要求对其可信度进行严格理解。然而，对这些模型进行系统评估的研究，特别是涉及到音频模态独特风险的研究，仍然大部分未被探索。现有的评估框架主要关注文本模态或仅涵盖了一组有限的安全维度，未能充分考虑音频模态固有的独特特征和应用场景。我们引入了AudioTrust-第一个专门为ALLMs设计的多方面可信度评估框架和基准。AudioTrust便于跨六个关键维度进行评估：公平性、幻觉、安全性、隐私、鲁棒性和认证。为了全面评估这些维度，AudioTrust围绕18个不同的实验设置而构建。其核心是一个精心构建的数据集，包括超过4,420个音频/文本样本，来自现实场景（例如日常对话、紧急呼叫、语音助手交互），专门设计用于探究ALLMs的多方面可信度。为了评估，该基准精心设计了9个音频特定的评估指标，我们利用大规模自动化流水线来客观和可扩展地评分模型输出。实验结果揭示了当前最先进的开源和闭源ALLMs在面对各种高风险音频场景时的可信度边界和限制，为未来音频模型的安全和可信部署提供了宝贵的见解。我们的平台和基准可访问https://github.com/JusperLee/AudioTrust。

更新时间: 2025-07-01 13:22:07

领域: cs.SD,cs.AI,cs.CL,eess.AS

下载: http://arxiv.org/abs/2505.16211v2

Quasi-symbolic Semantic Geometry over Transformer-based Variational AutoEncoder

Formal/symbolic semantics can provide canonical, rigid controllability and interpretability to sentence representations due to their \textit{localisation} or \textit{composition} property. How can we deliver such property to the current distributional sentence representations to control and interpret the generation of language models (LMs)? In this work, we theoretically frame the sentence semantics as the composition of \textit{semantic role - word content} features and propose the formal semantic geometry. To inject such geometry into Transformer-based LMs (i.e. GPT2), we deploy Transformer-based Variational AutoEncoder with a supervision approach, where the sentence generation can be manipulated and explained over low-dimensional latent Gaussian space. In addition, we propose a new probing algorithm to guide the movement of sentence vectors over such geometry. Experimental results reveal that the formal semantic geometry can potentially deliver better control and interpretation to sentence generation.

Updated: 2025-07-01 12:28:50

标题: 基于变分自动编码器的准符号语义几何学

摘要: 正式/符号语义可以通过其\textit{局部化}或\textit{组合}属性为句子表示提供规范的、严格的可控性和可解释性。我们如何将这种属性传递给当前的分布式句子表示，以控制和解释语言模型(LMs)的生成？在这项工作中，我们理论上将句子语义框架化为\textit{语义角色 - 单词内容}特征的组合，并提出正式语义几何。为了将这种几何注入基于Transformer的LMs (例如GPT2)，我们采用基于Transformer的变分自动编码器，并采用一种监督方法，其中句子生成可以在低维潜在高斯空间中被操纵和解释。此外，我们提出了一种新的探测算法，以指导句子向量在这种几何上的移动。实验结果显示，正式语义几何可能可以为句子生成提供更好的控制和解释性。

更新时间: 2025-07-01 12:28:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2210.06230v3

TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.4% mAP in lane segment perception and +2.1% OLS in centerline perception tasks.

Updated: 2025-07-01 12:10:46

标题: TopoStreamer：自动驾驶中的时间车道段拓扑推理

摘要: 车道段拓扑推理通过捕捉车道段之间的拓扑关系及其语义类型构建了一个全面的道路网络。这使得端到端的自动驾驶系统能够执行依赖道路的操纵，如转弯和变道。然而，现有方法中存在的一致位置嵌入和时间多属性学习的限制阻碍了准确的道路网络重建。为了解决这些问题，我们提出了TopoStreamer，一个用于车道段拓扑推理的端到端时间感知模型。具体而言，TopoStreamer引入了三个关键改进：流属性约束、动态车道边界位置编码和车道段去噪。流属性约束在中心线和边界坐标以及它们的分类中强化了时间一致性。同时，动态车道边界位置编码增强了对查询中最新位置信息的学习，而车道段去噪有助于捕获各种车道段模式，最终提高了模型性能。此外，我们使用车道边界分类指标评估现有模型的准确性，这在自动驾驶的变道场景中起着关键作用。在OpenLane-V2数据集上，TopoStreamer相对于最先进的方法展示了显著的改进，实现了车道段感知任务中+3.4％ mAP和中心线感知任务中+2.1％ OLS的实质性性能增益。

更新时间: 2025-07-01 12:10:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00709v1

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1

Updated: 2025-07-01 11:46:57

标题: T2I-R1：利用协同的语义级和标记级CoT加强图像生成

摘要: 最近大型语言模型的进展已经表明，思维链 (CoT) 和强化学习 (RL) 可以提高性能。然而，将这种推理策略应用到视觉生成领域仍然是一个较少探索的领域。在本文中，我们提出了 T2I-R1，这是一个新颖的增强推理文本到图像生成模型，由 RL 驱动，具有双层 CoT 推理过程。具体来说，我们确定了两个可以用于增强不同生成阶段的 CoT 水平：(1) 语义级别的 CoT，用于对提示进行高级规划，以及 (2) 标记级别的 CoT，用于在逐块生成期间进行低级像素处理。为了更好地协调这两个层次的 CoT，我们引入了 BiCoT-GRPO，其中包含一组生成奖励的集合，可以在同一训练步骤内无缝地优化两个生成 CoT。通过将我们的推理策略应用到基线模型 Janus-Pro 上，我们在 T2I-CompBench 上取得了 13% 的改进，WISE 基准测试上取得了 19% 的改进，甚至超过了最先进的模型 FLUX.1。代码可在以下链接找到：https://github.com/CaraJ7/T2I-R1

更新时间: 2025-07-01 11:46:57

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2505.00703v2

Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach

Generative AI systems have revolutionized human interaction by enabling natural language-based coding and problem solving. However, the inherent ambiguity of natural language often leads to imprecise instructions, forcing users to iteratively test, correct, and resubmit their prompts. We propose an iterative approach that systematically narrows down these ambiguities through a structured series of clarification questions and alternative solution proposals, illustrated with input/output examples as well. Once every uncertainty is resolved, a final, precise solution is generated. Evaluated on a diverse dataset spanning coding, data analysis, and creative writing, our method demonstrates superior accuracy, competitive resolution times, and higher user satisfaction compared to conventional one-shot solutions, which typically require multiple manual iterations to achieve a correct output.

Updated: 2025-07-01 11:44:33

标题: 使用渐进切分搜索方法迭代解决提示模糊性

摘要: 生成式人工智能系统通过实现基于自然语言的编码和问题解决，彻底改变了人类的互动方式。然而，自然语言固有的歧义性经常导致不精确的指令，迫使用户进行迭代测试、校正和重新提交他们的提示。我们提出了一种迭代方法，通过一系列结构化的澄清问题和备选解决方案提案系统地缩小这些歧义，同时提供输入/输出示例。一旦所有不确定性得以解决，就会生成一个最终的、精确的解决方案。在涵盖编码、数据分析和创意写作的多样数据集上进行评估，我们的方法展示了卓越的准确性、竞争性的解决时间，并且与通常需要多次手动迭代才能获得正确输出的传统一次性解决方案相比，用户满意度更高。

更新时间: 2025-07-01 11:44:33

领域: cs.AI,cs.CL,cs.ET,cs.IR,cs.LG

下载: http://arxiv.org/abs/2505.02952v2

Cage-Based Deformation for Transferable and Undefendable Point Cloud Attack

Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose CageAttack, a cage-based deformation framework that produces natural adversarial point clouds. It first constructs a cage around the target object, providing a structured basis for smooth, natural-looking deformation. Perturbations are then applied to the cage vertices, which seamlessly propagate to the point cloud, ensuring that the resulting deformations remain intrinsic to the object and preserve plausibility. Extensive experiments on seven 3D deep neural network classifiers across three datasets show that CageAttack achieves a superior balance among transferability, undefendability, and plausibility, outperforming state-of-the-art methods. Codes will be made public upon acceptance.

Updated: 2025-07-01 11:42:12

标题: 基于笼状变形的可转移且不可防御的点云攻击

摘要: 对点云的对抗攻击通常会施加严格的几何约束以保留可信度；然而，这些约束本质上限制了可转移性和不可防御性。虽然变形提供了一种替代方案，但现有的非结构化方法可能会引入不自然的扭曲，使对抗性点云显眼并破坏它们的可信度。在本文中，我们提出了CageAttack，一个基于笼状变形框架，可以生成自然的对抗性点云。它首先在目标物体周围构建一个笼子，为平滑、自然的变形提供了结构化基础。然后将扰动应用到笼子的顶点上，这些扰动无缝传播到点云，确保最终的变形仍然是固有于对象并保留可信度的。在三个数据集上对七个3D深度神经网络分类器进行广泛实验表明，CageAttack在转移性、不可防御性和可信度之间实现了卓越的平衡，优于现有方法。接受后将公开代码。

更新时间: 2025-07-01 11:42:12

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2507.00690v1

eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems

We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF. eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer, as well as from key software stacks such as CUDA, Python, and PyTorch, all without requiring any code instrumentation or modifications. Additionally, it leverages libnvml to gather process-level GPU resource usage information. By applying a Gaussian Mixture Model (GMM) to the collected multidimensional performance metrics for statistical modeling and clustering analysis, eACGM effectively identifies complex failure modes, such as latency anomalies, hardware failures, and communication inefficiencies, enabling rapid diagnosis of system bottlenecks and abnormal behaviors. To evaluate eACGM's effectiveness and practicality, we conducted extensive empirical studies and case analyses in multi-node distributed training scenarios. The results demonstrate that eACGM, while maintaining a non-intrusive and low-overhead profile, successfully captures critical performance anomalies during model training and inference. Its stable anomaly detection performance and comprehensive monitoring capabilities validate its applicability and scalability in real-world production environments, providing strong support for performance optimization and fault diagnosis in large-scale AI/ML systems.

Updated: 2025-07-01 11:37:52

标题: eACGM：面向机器学习系统的无仪器性能跟踪和异常检测

摘要: 我们提出了一种基于eBPF的全栈AI/ML系统监控框架eACGM。eACGM从关键硬件组件（包括GPU和网络通信层）以及关键软件堆栈（如CUDA、Python和PyTorch）收集实时性能数据，而无需任何代码仪器化或修改。此外，它利用libnvml来收集进程级GPU资源使用信息。通过将高斯混合模型（GMM）应用于收集的多维性能指标进行统计建模和聚类分析，eACGM有效识别复杂的故障模式，如延迟异常、硬件故障和通信效率低下，从而快速诊断系统瓶颈和异常行为。为了评估eACGM的有效性和实用性，我们在多节点分布式训练场景中进行了广泛的实证研究和案例分析。结果表明，eACGM在保持非侵入性和低开销的情况下，成功捕获了模型训练和推断过程中的关键性能异常。其稳定的异常检测性能和全面的监控能力验证了其在真实生产环境中的适用性和可扩展性，为大规模AI/ML系统中的性能优化和故障诊断提供了有力支持。

更新时间: 2025-07-01 11:37:52

领域: cs.DC,cs.AI,cs.NI

下载: http://arxiv.org/abs/2506.02007v2

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter's capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge. Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the vision-language projection with local details. Our approach is both memory- and time-efficient, requiring only 1.16$\times$ the inference time of the standard LoRA method (with injection into the query and value projection layers), and just 73\% of the inference time of a 4-expert LoRA-MoE. Extensive experiments on various downstream tasks and general MLLM benchmarks validate the effectiveness of our proposed methods.

Updated: 2025-07-01 11:18:54

标题: 从整体到局部：用于有效视觉指导微调的本地增强适配器

摘要: Efficient Visual Instruction Fine-Tuning (EVIT) 旨在通过最小的计算开销将多模态大型语言模型（MLLMs）调整到下游任务。然而，随着任务的多样性和复杂性增加，EVIT在解决数据冲突方面面临重大挑战。为了解决这一限制，我们提出了双低秩适应（Dual-LoRA），这是一个从整体到局部的框架，通过双重结构优化增强适配器的能力，以解决数据冲突。具体地，我们利用两个子空间：一个用于稳定的整体知识保留的技能空间，另一个用于局部激活整体知识的秩修正任务空间。此外，我们引入了Visual Cue Enhancement（VCE），这是一个多级局部特征聚合模块，旨在通过局部细节丰富视觉语言投影。我们的方法既节省内存又节省时间，仅需要标准LoRA方法推理时间的1.16倍（注入到查询和值投影层），以及4个专家LoRA-MoE推理时间的73％。在各种下游任务和通用MLLM基准上的大量实验验证了我们提出的方法的有效性。

更新时间: 2025-07-01 11:18:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.12787v3

Understanding the Identity-Transformation Approach in OIDC-Compatible Privacy-Preserving SSO Services

OpenID Connect (OIDC) enables a user with commercial-off-the-shelf browsers to log into multiple websites, called relying parties (RPs), by her username and credential set up in another trusted web system, called the identity provider (IdP). Identity transformations are proposed in UppreSSO to provide OIDC-compatible SSO services, preventing both IdP-based login tracing and RP-based identity linkage. While security and privacy of SSO services in UppreSSO have been proved, several essential issues of this identity-transformation approach are not well studied. In this paper, we comprehensively investigate the approach as below. Firstly, several suggestions for the efficient integration of identity transformations in OIDC-compatible SSO are explained. Then, we uncover the relationship between identity-transformations in SSO and oblivious pseudo-random functions (OPRFs), and present two variations of the properties required for SSO security as well as the privacy requirements, to analyze existing OPRF protocols. Finally, new identity transformations different from those designed in UppreSSO, are constructed based on OPRFs, satisfying different variations of SSO security requirements. To the best of our knowledge, this is the first time to uncover the relationship between identity transformations in OIDC-compatible privacy-preserving SSO services and OPRFs, and prove the SSO-related properties (i.e., key-identifier freeness, RP designation and user identification) of OPRF protocols, in addition to the basic properties of correctness, obliviousness and pseudo-randomness.

Updated: 2025-07-01 11:18:49

标题: 理解 OIDC 兼容的隐私保护 SSO 服务中的身份转换方法

摘要: OpenID Connect（OIDC）使得用户可以通过商业化浏览器登录多个网站，这些网站被称为依赖方（RPs），通过在另一个可信的网络系统中设置的用户名和凭证登录，这个系统被称为身份提供者（IdP）。UppreSSO提出了身份转换来提供符合OIDC的SSO服务，防止基于IdP的登录追踪和基于RP的身份关联。虽然UppreSSO中的SSO服务的安全性和隐私性已经得到证明，但该身份转换方法的一些基本问题尚未得到充分研究。本文全面研究了该方法。首先，解释了在OIDC兼容的SSO中有效集成身份转换的几个建议。然后，我们揭示了在SSO中的身份转换与无视伪随机函数（OPRFs）之间的关系，并提出了两种SSO安全性所需的属性的变体以及隐私要求，以分析现有的OPRF协议。最后，基于OPRFs构建了与UppreSSO中设计的身份转换不同的新身份转换，满足不同变体的SSO安全性要求。据我们所知，这是首次揭示OIDC兼容的隐私保护SSO服务中的身份转换与OPRFs之间的关系，并证明了OPRF协议的与SSO相关的属性（即，密钥标识符自由、RP指定和用户识别），除了正确性、无视性和伪随机性的基本属性。

更新时间: 2025-07-01 11:18:49

领域: cs.CR

下载: http://arxiv.org/abs/2506.01325v2

How Resilient is QUIC to Security and Privacy Attacks?

QUIC has rapidly evolved into a cornerstone transport protocol for secure, low-latency communications, yet its deployment continues to expose critical security and privacy vulnerabilities, particularly during connection establishment phases and via traffic analysis. This paper systematically revisits a comprehensive set of attacks on QUIC and emerging privacy threats. Building upon these observations, we critically analyze recent IETF mitigation efforts, including TLS Encrypted Client Hello (ECH), Oblivious HTTP (OHTTP) and MASQUE. We analyze how these mechanisms enhance privacy while introducing new operational risks, particularly under adversarial load. Additionally, we discuss emerging challenges posed by post-quantum cryptographic (PQC) handshakes, including handshake expansion and metadata leakage risks. Our analysis highlights ongoing gaps between theoretical defenses and practical deployments, and proposes new research directions focused on adaptive privacy mechanisms. Building on these insights, we propose future directions to ensure long-term security of QUIC and aim to guide its evolution as a robust, privacy-preserving, and resilient transport foundation for the next-generation Internet.

Updated: 2025-07-01 11:12:37

标题: QUIC对安全和隐私攻击有多弹性？

摘要: QUIC已迅速发展成为安全、低延迟通信的基石传输协议，然而其部署仍然暴露出关键的安全和隐私漏洞，特别是在连接建立阶段和通过流量分析。本文系统地重新审视了对QUIC和新兴隐私威胁的全面攻击集合。基于这些观察，我们批判性地分析了最近IETF的缓解努力，包括TLS加密客户端Hello（ECH）、遗忘HTTP（OHTTP）和MASQUE。我们分析了这些机制如何增强隐私，同时引入新的操作风险，特别是在对抗负载下。此外，我们讨论了后量子密码（PQC）握手所带来的新挑战，包括握手扩展和元数据泄露风险。我们的分析突显了理论防御与实际部署之间持续存在的差距，并提出了专注于自适应隐私机制的新研究方向。基于这些见解，我们提出未来方向，以确保QUIC的长期安全，并旨在引导其作为下一代互联网的强大、隐私保护和有韧性的传输基础的发展。

更新时间: 2025-07-01 11:12:37

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2401.06657v3

Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding

3D Visual Grounding (3DVG) involves localizing target objects in 3D point clouds based on natural language. While prior work has made strides using textual descriptions, leveraging spoken language-known as Audio-based 3D Visual Grounding-remains underexplored and challenging. Motivated by advances in automatic speech recognition (ASR) and speech representation learning, we propose Audio-3DVG, a simple yet effective framework that integrates audio and spatial information for enhanced grounding. Rather than treating speech as a monolithic input, we decompose the task into two complementary components. First, we introduce Object Mention Detection, a multi-label classification task that explicitly identifies which objects are referred to in the audio, enabling more structured audio-scene reasoning. Second, we propose an Audio-Guided Attention module that captures interactions between candidate objects and relational speech cues, improving target discrimination in cluttered scenes. To support benchmarking, we synthesize audio descriptions for standard 3DVG datasets, including ScanRefer, Sr3D, and Nr3D. Experimental results demonstrate that Audio-3DVG not only achieves new state-of-the-art performance in audio-based grounding, but also competes with text-based methods-highlighting the promise of integrating spoken language into 3D vision tasks.

Updated: 2025-07-01 11:08:22

标题: Audio-3DVG：用于3D视觉定位的统一音频-点云融合

摘要: 3D视觉定位（3DVG）涉及基于自然语言在3D点云中定位目标物体。虽然先前的工作已经取得了一定进展，使用文本描述，但利用口语-即基于音频的3D视觉定位-仍然是未被充分探索和具有挑战性的。受自动语音识别（ASR）和语音表示学习进展的启发，我们提出了一种名为Audio-3DVG的简单而有效的框架，该框架整合了音频和空间信息以增强定位。我们将任务分解为两个互补的组件，而不是将语音视为一个单一的输入。首先，我们引入了对象提及检测，这是一个多标签分类任务，明确识别了音频中提到的是哪些对象，从而实现了更加结构化的音频-场景推理。其次，我们提出了一个音频引导的注意力模块，捕捉了候选对象与关系语音线索之间的交互作用，提高了在混乱场景中的目标区分能力。为了支持基准测试，我们为标准的3DVG数据集合成音频描述，包括ScanRefer、Sr3D和Nr3D。实验结果表明，Audio-3DVG不仅在基于音频的定位方面取得了新的最先进表现，而且与基于文本的方法竞争-突出显示了将口语整合到3D视觉任务中的潜力。

更新时间: 2025-07-01 11:08:22

领域: cs.LG,cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2507.00669v1

SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}

Updated: 2025-07-01 11:04:03

标题: 更安全：使用稀疏自动编码器探究奖励模型中的安全性

摘要: 人类反馈强化学习（RLHF）是将大型语言模型（LLMs）与人类价值观对齐的一个关键范式，然而其核心的奖励模型仍然大部分是不透明的。在这项工作中，我们提出了稀疏自编码器增强奖励模型（SAFER），这是一个通过机械分析来解释和改进奖励模型的新框架。利用稀疏自编码器（SAEs），我们揭示了奖励模型激活中人类可解释的特征，从而使我们能够洞察与安全相关的决策过程。我们将SAFER应用于安全取向的偏好数据集，并通过选定和拒绝回应之间的激活差异量化个体特征的显著性。利用这些特征级信号，我们设计了有针对性的数据毒化和去噪策略。实验表明，SAFER能够精确地降低或增强安全对齐性，同时最小程度地修改数据，而不会牺牲一般聊天性能。我们的方法有助于解释、审计和完善在高风险LLM对齐任务中的奖励模型。我们的代码可在https://github.com/xzy-101/SAFER-code 上找到。本文讨论与大型语言模型安全相关的主题，可能包括强调潜在风险或不安全结果的讨论或示例。

更新时间: 2025-07-01 11:04:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.00665v1

Identity Preserving 3D Head Stylization with Multiview Score Distillation

3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.

Updated: 2025-07-01 11:01:37

标题: 保持身份的3D头部风格化与多视角评分蒸馏

摘要: 3D头部风格化将逼真的面部特征转化为艺术表现，增强了用户在游戏和虚拟现实应用中的参与度。虽然3D感知生成器取得了重大进展，但许多3D风格化方法主要提供近正面视图，并且在保留原始主体的独特身份方面存在困难，往往导致输出缺乏多样性和个性。本文通过利用PanoHead模型，从全方位360度的视角合成图像来解决这些挑战。我们提出了一个新颖的框架，采用负对数似然蒸馏（LD）来增强身份保存并改善风格化质量。通过在3D GAN架构中整合多视图网格分数和镜像梯度，并引入分数排名加权技术，我们的方法取得了实质性的定性和定量改进。我们的发现不仅推动了3D头部风格化的研究进展，还为扩散模型和GAN之间的有效蒸馏过程提供了有价值的见解，重点关注身份保存的关键问题。更多可视化内容请访问https://three-bee.github.io/head_stylization。

更新时间: 2025-07-01 11:01:37

领域: cs.CV,cs.AI,cs.GR,cs.LG,cs.MM

下载: http://arxiv.org/abs/2411.13536v2

Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity

We investigate how Large Language Models (LLMs) behave when simulating political discourse on social media. Leveraging 21 million interactions on X during the 2024 U.S. presidential election, we construct LLM agents based on 1,186 real users, prompting them to reply to politically salient tweets under controlled conditions. Agents are initialized either with minimal ideological cues (Zero Shot) or recent tweet history (Few Shot), allowing one-to-one comparisons with human replies. We evaluate three model families (Gemini, Mistral, and DeepSeek) across linguistic style, ideological consistency, and toxicity. We find that richer contextualization improves internal consistency but also amplifies polarization, stylized signals, and harmful language. We observe an emergent distortion that we call "generation exaggeration": a systematic amplification of salient traits beyond empirical baselines. Our analysis shows that LLMs do not emulate users, they reconstruct them. Their outputs, indeed, reflect internal optimization dynamics more than observed behavior, introducing structural biases that compromise their reliability as social proxies. This challenges their use in content moderation, deliberative simulations, and policy modeling.

Updated: 2025-07-01 10:54:51

标题: LLM社交代理人中的生成夸大：一致性、偏见和毒性

摘要: 我们研究了大型语言模型（LLMs）在模拟社交媒体上的政治话语时的行为。利用2024年美国总统选举期间X上的2100万次互动，我们基于1186名真实用户构建了LLM代理，要求他们在受控条件下回复政治敏感推文。代理可以初始化为最小的意识形态线索（零射击）或最近的推文历史（少射击），允许与人类回复进行一对一比较。我们评估了三个模型系列（Gemini、Mistral和DeepSeek），涵盖语言风格、意识形态一致性和有毒性。我们发现，更丰富的情境化改善了内部一致性，但也加剧了极化、风格化信号和有害语言。我们观察到一个我们称之为“生成夸张”的新现象：对显著特征的系统放大超过经验基线。我们的分析显示，LLMs并不模拟用户，而是重构他们。它们的输出确实反映了内部优化动态，而不是观察到的行为，引入了损害它们作为社会代理的可靠性的结构性偏见。这挑战了它们在内容管理、审议模拟和政策建模中的使用。

更新时间: 2025-07-01 10:54:51

领域: cs.HC,cs.AI,cs.SI

下载: http://arxiv.org/abs/2507.00657v1

Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models

The escalating computational costs of Large Language Model (LLM) inference have become a critical barrier to their widespread and sustainable deployment. While existing optimization strategies are effective, they are predominantly based on statistical heuristics or architectural modifications, lacking a guiding cognitive theory to manage the inference process itself. This paper aims to bridge this gap by introducing a novel paradigm: the Cognitive Load-Aware Inference (CLAI) framework, which operationalizes principles from Cognitive Load Theory (CLT) and neuroscience for LLM inference. We formalize the concepts of Intrinsic Cognitive Load, Extraneous Cognitive Load, and Germane Cognitive Load into quantifiable LLM metrics ($ICL_{LLM}$, $ECL_{LLM}$, and $GCL_{LLM}$), thereby reframing the inference process as a cognitive economics optimization problem: based on the intrinsic complexity of a problem ($ICL_{LLM}$), minimize wasteful computation ($ECL_{LLM}$), and strategically allocate the token budget to productive reasoning ($GCL_{LLM}$). We propose two implementation paths: CLAI-Prompt, a zero-shot method that guides a base LLM through cognitive control steps via a structured meta-prompt, and CLAI-Tune, a fine-tuned model that internalizes these principles for spontaneous cognitive economy. Across a range of benchmarks in complex reasoning, long-context question answering, and code generation, our methods achieve significant reductions in token consumption (up to 45\%) without sacrificing accuracy. Furthermore, CLAI-Tune exhibits an emergent ability to autonomously decompose difficult problems, a key characteristic of human expert cognition. This work demonstrates that by emulating the brain's resource management strategies, we can build more efficient, robust, and capable artificial intelligence systems.

Updated: 2025-07-01 10:51:18

标题: 认知负荷感知推理：一种用于优化大型语言模型令牌经济的神经符号框架

摘要: Large Language Model（LLM）推理的计算成本不断增加，已成为它们广泛和可持续部署的关键障碍。尽管现有的优化策略是有效的，但它们主要基于统计启发式或架构修改，缺乏管理推理过程本身的指导认知理论。本文旨在通过引入一种新的范式：认知负荷感知推理（CLAI）框架，该框架将认知负荷理论（CLT）和神经科学的原则运用于LLM推理。我们将内在认知负载、外部认知负载和德尔曼认知负载的概念形式化为可量化的LLM指标（$ICL_{LLM}$，$ECL_{LLM}$和$GCL_{LLM}$），从而将推理过程重新构建为一种认知经济优化问题：基于问题的内在复杂性（$ICL_{LLM}$），最小化计算浪费（$ECL_{LLM}$），并将令牌预算战略性地分配给生产性推理（$GCL_{LLM}$）。我们提出两种实施路径：CLAI-Prompt，一种零-shot方法，通过结构化的元提示引导基础LLM通过认知控制步骤，以及CLAI-Tune，一种经过微调的模型，它内化了这些原则以实现自发的认知经济。在复杂推理、长文本问题回答和代码生成的一系列基准测试中，我们的方法实现了令牌消耗的显著降低（最高达45%），而不牺牲准确性。此外，CLAI-Tune表现出一种自主分解困难问题的新能力，这是人类专家认知的一个关键特征。这项工作表明，通过模仿大脑的资源管理策略，我们可以构建更高效、更稳健、更有能力的人工智能系统。

更新时间: 2025-07-01 10:51:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.00653v1

Against 'softmaxing' culture

AI is flattening culture. Evaluations of "culture" are showing the myriad ways in which large AI models are homogenizing language and culture, averaging out rich linguistic differences into generic expressions. I call this phenomenon "softmaxing culture,'' and it is one of the fundamental challenges facing AI evaluations today. Efforts to improve and strengthen evaluations of culture are central to the project of cultural alignment in large AI systems. This position paper argues that machine learning (ML) and human-computer interaction (HCI) approaches to evaluation are limited. I propose two key conceptual shifts. First, instead of asking "what is culture?" at the start of system evaluations, I propose beginning with the question: "when is culture?" Second, while I acknowledge the philosophical claim that cultural universals exist, the challenge is not simply to describe them, but to situate them in relation to their particulars. Taken together, these conceptual shifts invite evaluation approaches that move beyond technical requirements toward perspectives that are more responsive to the complexities of culture.

Updated: 2025-07-01 10:45:21

标题: 抵制“softmaxing”文化

摘要: 人工智能正在拉平文化。对“文化”的评估显示了大型人工智能模型如何使语言和文化同质化，将丰富的语言差异平均化为通用表达。我将这一现象称为“文化的softmaxing”，它是当今人工智能评估面临的基本挑战之一。改进和加强对文化的评估是大型人工智能系统文化对齐项目的核心。本文主张，机器学习（ML）和人机交互（HCI）方法对评估有限。我提出了两个关键概念转变。首先，我建议在系统评估开始时，不再问“文化是什么？”，而是从问题“文化何时发生？”开始。其次，虽然我承认了文化普遍存在的哲学主张，但挑战并非简单地描述它们，而是将其置于其特定环境中。综合起来，这些概念转变鼓励评估方法跳出技术要求，转向更能响应文化复杂性的视角。

更新时间: 2025-07-01 10:45:21

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2506.22968v2

SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning

Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules-one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a new CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions. Code is available at https://github.com/Minato-Zackie/SMoLoRA.

Updated: 2025-07-01 10:40:05

标题: SMoLoRA：在持续视觉指导调整中探索和抵抗双重灾难性遗忘

摘要: 视觉指令调谐（VIT）使多模态大型语言模型（MLLMs）能够通过将其构建为基于语言的指令来有效处理广泛的视觉任务。在此基础上，持续的视觉指令调谐（CVIT）扩展了MLLMs的能力，使其能够逐步学习新任务，适应不断发展的功能。尽管先前的工作通过开发新的基准和方法来减轻灾难性遗忘，但这些努力大部分遵循传统的持续学习范式，忽视了CVIT特有的挑战。我们确定CVIT中存在双重遗忘形式，即MLLMs不仅忘记了先前学到的视觉理解，而且在获取新任务时也经历了指令跟随能力的下降。为了解决这个问题，我们引入了可分离的低秩适应（SMoLoRA）框架，该框架通过两个不同模块进行可分离路由，一个用于视觉理解，另一个用于指令跟随。这种双路由设计使两个领域都能进行专门的适应，防止遗忘同时提高性能。此外，我们提出一个新的CVIT基准，通过评估模型对未见任务的泛化能力以及在各种任务中处理多样化指令的能力，超越了现有基准。大量实验证明，SMoLoRA在减轻双重遗忘、提高对未见任务的泛化以及确保遵循各种指令方面优于现有方法。代码可在https://github.com/Minato-Zackie/SMoLoRA找到。

更新时间: 2025-07-01 10:40:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.13949v2

Integrating Network and Attack Graphs for Service-Centric Impact Analysis

We present a novel methodology for modelling, visualising, and analysing cyber threats, attack paths, as well as their impact on user services in enterprise or infrastructure networks of digital devices and services they provide. Using probabilistic methods to track the propagation of an attack through attack graphs, via the service or application layers, and on physical communication networks, our model enables us to analyse cyber attacks at different levels of detail. Understanding the propagation of an attack within a service among microservices and its spread between different services or application servers could help detect and mitigate it early. We demonstrate that this network-based influence spreading modelling approach enables the evaluation of diverse attack scenarios and the development of protection and mitigation measures, taking into account the criticality of services from the user's perspective. This methodology could also aid security specialists and system administrators in making well-informed decisions regarding risk mitigation strategies.

Updated: 2025-07-01 10:29:45

标题: 将网络和攻击图集成用于基于服务的影响分析

摘要: 我们提出了一种新的方法论，用于对企业或基础设施网络中数字设备和服务的网络威胁、攻击路径以及它们对用户服务的影响进行建模、可视化和分析。通过使用概率方法跟踪攻击通过攻击图、服务或应用层以及物理通信网络的传播，我们的模型使我们能够以不同层次的细节分析网络攻击。了解攻击在微服务中的传播以及在不同服务或应用服务器之间的传播，有助于及早检测和减轻其影响。我们证明，这种基于网络影响传播建模方法使得能够评估不同的攻击场景并制定保护和减轻措施，考虑到用户角度下服务的关键性。这种方法论也能够帮助安全专家和系统管理员做出关于风险减轻策略的明智决策。

更新时间: 2025-07-01 10:29:45

领域: cs.CR,cs.SI

下载: http://arxiv.org/abs/2507.00637v1

SKALD: Scalable K-Anonymisation for Large Datasets

Data privacy and anonymisation are critical concerns in today's data-driven society, particularly when handling personal and sensitive user data. Regulatory frameworks worldwide recommend privacy-preserving protocols such as k-anonymisation to de-identify releases of tabular data. Available hardware resources provide an upper bound on the maximum size of dataset that can be processed at a time. Large datasets with sizes exceeding this upper bound must be broken up into smaller data chunks for processing. In these cases, standard k-anonymisation tools such as ARX can only operate on a per-chunk basis. This paper proposes SKALD, a novel algorithm for performing k-anonymisation on large datasets with limited RAM. Our SKALD algorithm offers multi-fold performance improvement over standard k-anonymisation methods by extracting and combining sufficient statistics from each chunk during processing to ensure successful k-anonymisation while providing better utility.

Updated: 2025-07-01 10:09:57

标题: SKALD：大型数据集的可扩展K-匿名化

摘要: 数据隐私和匿名化是当今数据驱动社会中的关键问题，特别是在处理个人和敏感用户数据时。全球各地的监管框架推荐隐私保护协议，如k-匿名化来去识别表格数据的发布。可用的硬件资源为一次处理的数据集大小提供了上限。超过这个上限的尺寸较大的数据集必须被分割成更小的数据块进行处理。在这些情况下，标准的k-匿名化工具，如ARX，只能在每个数据块基础上运行。本文提出了SKALD，一种新颖的算法，用于在有限RAM上对大型数据集进行k-匿名化。我们的SKALD算法通过在处理过程中从每个数据块中提取和组合足够的统计数据，以确保成功进行k-匿名化，并提供更好的效用，相对于标准的k-匿名化方法，性能提升多倍。

更新时间: 2025-07-01 10:09:57

领域: cs.IT,cs.CR,math.IT

下载: http://arxiv.org/abs/2505.03529v2

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

In recent years, Large-Language-Model-driven AI agents have exhibited unprecedented intelligence and adaptability, and are rapidly changing human production and life. Nowadays, agents are undergoing a new round of evolution. They no longer act as an isolated island like LLMs. Instead, they start to communicate with diverse external entities, such as other agents and tools, to perform more complex tasks collectively. Under this trend, agent communication is regarded as a foundational pillar of the future AI ecosystem, and many organizations have intensively begun to design related communication protocols (e.g., Anthropic's MCP and Google's A2A) within the recent few months. However, this new field exposes significant security hazards, which can cause severe damage to real-world scenarios. To help researchers quickly figure out this promising topic and benefit the future agent communication development, this paper presents a comprehensive survey of agent communication security. More precisely, we first present a clear definition of agent communication and categorize the entire lifecycle of agent communication into three stages: user-agent interaction, agent-agent communication, and agent-environment communication. Next, for each communication phase, we dissect related protocols and analyze the security risks according to the communication characteristics. Then, we summarize and outlook on the possible defense countermeasures for each risk. In addition, we conduct experiments using MCP and A2A to help readers better understand the novel vulnerabilities brought by agent communication. Finally, we discuss open issues and future directions in this promising research field.

Updated: 2025-07-01 09:47:45

标题: 一个LLM驱动的AI代理通信调查：协议、安全风险和防御对策

摘要: 最近几年，由大型语言模型驱动的人工智能代理展现出前所未有的智能和适应性，正在快速改变人类生产和生活。如今，代理正在经历一轮新的演变。它们不再像LLMs那样作为孤立的岛屿。相反，它们开始与多样化的外部实体进行通信，如其他代理和工具，以集体执行更复杂的任务。在这一趋势下，代理通信被视为未来人工智能生态系统的基础支柱，许多组织已经开始密集地设计相关的通信协议（例如，Anthropic的MCP和Google的A2A）在最近几个月内。然而，这一新领域暴露出重大的安全隐患，可能对现实场景造成严重破坏。为了帮助研究人员快速了解这一有前途的主题，并使未来的代理通信发展受益，本文提出了一份代理通信安全的全面调查。更具体地，我们首先对代理通信进行了清晰的定义，并将代理通信的整个生命周期分为三个阶段：用户-代理交互、代理-代理通信和代理-环境通信。接下来，针对每个通信阶段，我们剖析了相关协议，并根据通信特征分析了安全风险。然后，我们总结并展望了针对每种风险的可能防御对策。此外，我们使用MCP和A2A进行实验，以帮助读者更好地了解代理通信带来的新型漏洞。最后，我们讨论了这一有前途研究领域的未解问题和未来方向。

更新时间: 2025-07-01 09:47:45

领域: cs.CR

下载: http://arxiv.org/abs/2506.19676v2

Physics-Informed Neural ODEs for Temporal Dynamics Modeling in Cardiac T1 Mapping

Spin-lattice relaxation time ($T_1$) is an important biomarker in cardiac parametric mapping for characterizing myocardial tissue and diagnosing cardiomyopathies. Conventional Modified Look-Locker Inversion Recovery (MOLLI) acquires 11 breath-hold baseline images with interleaved rest periods to ensure mapping accuracy. However, prolonged scanning can be challenging for patients with poor breathholds, often leading to motion artifacts that degrade image quality. In addition, $T_1$ mapping requires voxel-wise nonlinear fitting to a signal recovery model involving an iterative estimation process. Recent studies have proposed deep-learning approaches for rapid $T_1$ mapping using shortened sequences to reduce acquisition time for patient comfort. Nevertheless, existing methods overlook important physics constraints, limiting interpretability and generalization. In this work, we present an accelerated, end-to-end $T_1$ mapping framework leveraging Physics-Informed Neural Ordinary Differential Equations (ODEs) to model temporal dynamics and address these challenges. Our method achieves high-accuracy $T_1$ estimation from a sparse subset of baseline images and ensures efficient null index estimation at test time. Specifically, we develop a continuous-time LSTM-ODE model to enable selective Look-Locker (LL) data acquisition with arbitrary time lags. Experimental results show superior performance in $T_1$ estimation for both native and post-contrast sequences and demonstrate the strong benefit of our physics-based formulation over direct data-driven $T_1$ priors.

Updated: 2025-07-01 09:47:22

标题: 物理信息神经ODE用于心脏T1映射中的时间动态建模

摘要: 自旋晶格弛豫时间（$T_1$）是心脏参数映射中的重要生物标志物，用于表征心肌组织并诊断心肌病。传统的修改后的Look-Locker反转恢复（MOLLI）通过交错休息期获得11个屏气基线图像，以确保映射精度。然而，对于呼吸不佳的患者，长时间的扫描可能是具有挑战性的，往往导致运动伪影，从而降低图像质量。此外，$T_1$映射需要对信号恢复模型进行逐体素非线性拟合，涉及迭代估计过程。最近的研究提出了深度学习方法，通过缩短序列以减少患者舒适度的采集时间，用于快速$T_1$映射。然而，现有方法忽视了重要的物理约束，限制了可解释性和泛化性。在这项工作中，我们提出了一种加速的、端到端的$T_1$映射框架，利用物理信息神经常微分方程（ODEs）来模拟时间动态并解决这些挑战。我们的方法通过稀疏基线图像的子集实现了高精度的$T_1$估计，并确保在测试时进行有效的零指数估计。具体而言，我们开发了一个连续时间LSTM-ODE模型，以实现具有任意时间滞后的选择性Look-Locker（LL）数据采集。实验结果显示了在原生和对比后序列中$T_1$估计的优越性能，并展示了我们基于物理的公式相对于直接数据驱动的$T_1$先验的强大优势。

更新时间: 2025-07-01 09:47:22

领域: eess.IV,cs.AI

下载: http://arxiv.org/abs/2507.00613v1

Residual Reward Models for Preference-based Reinforcement Learning

Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's ``best guess'' reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline. The videos are presented at https://sunlighted.github.io/RRM-web/.

Updated: 2025-07-01 09:43:57

标题: 基于偏好的强化学习中的剩余奖励模型

摘要: 偏好驱动的强化学习（PbRL）提供了一种在奖励信号难以确定的环境中学习高性能策略的方法，避免了启发式和耗时的奖励设计。然而，PbRL 可能在收敛速度上受到影响，因为它需要在奖励模型中进行训练。先前的研究提出了从演示中学习奖励模型并利用偏好进行微调的方法。然而，当模型是神经网络时，使用不同的损失函数进行预训练和微调可能会对可靠的优化提出挑战。在本文中，我们提出了一种有效利用先验知识的方法，即残差奖励模型（RRM）。RRM 假设环境的真实奖励可以分解为两部分的和：先验奖励和学习奖励。先验奖励是一项在训练之前可获得的项，例如用户的“最佳猜测”奖励函数，或者从逆向强化学习（IRL）中学习到的奖励函数，而学习奖励则是通过偏好进行训练的。我们介绍了基于状态和基于图像的RRM版本，并在 Meta-World 环境套件中的几个任务上进行了评估。实验结果显示，我们的方法显著改善了常见的PbRL方法的性能。我们的方法为各种不同类型的先验奖励（包括代理奖励、从IRL获得的奖励，甚至是代理奖励的否定版本）实现了性能改进。我们还通过对 Franka Panda 进行实验，展示了我们的方法在真实机器人上的卓越表现。它显著加快了不同任务的策略学习速度，比基准线更少的步骤就能取得成功。视频可在 https://sunlighted.github.io/RRM-web/ 查看。

更新时间: 2025-07-01 09:43:57

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2507.00611v1

Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers

Multimodal learning faces a fundamental tension between deep, fine-grained fusion and computational scalability. While cross-attention models achieve strong performance through exhaustive pairwise fusion, their quadratic complexity is prohibitive for settings with many modalities. We address this challenge with Gated Recurrent Fusion (GRF), a novel architecture that captures the power of cross-modal attention within a linearly scalable, recurrent pipeline. Our method processes modalities sequentially, updating an evolving multimodal context vector at each step. The core of our approach is a fusion block built on Transformer Decoder layers that performs symmetric cross-attention, mutually enriching the shared context and the incoming modality. This enriched information is then integrated via a Gated Fusion Unit (GFU) a GRU-inspired mechanism that dynamically arbitrates information flow, enabling the model to selectively retain or discard features. This stateful, recurrent design scales linearly with the number of modalities, O(n), making it ideal for high-modality environments. Experiments on the CMU-MOSI benchmark demonstrate that GRF achieves competitive performance compared to more complex baselines. Visualizations of the embedding space further illustrate that GRF creates structured, class-separable representations through its progressive fusion mechanism. Our work presents a robust and efficient paradigm for powerful, scalable multimodal representation learning.

Updated: 2025-07-01 09:33:38

标题: 门控递归融合：可扩展多模态Transformer的有状态方法

摘要: 多模式学习面临着深度、细粒度融合与计算可扩展性之间的基本张力。虽然交叉注意力模型通过详尽的成对融合实现了强大的性能，但其二次复杂度对于具有许多模态的设置来说是禁锢的。我们通过一种名为门控循环融合（GRF）的新颖架构来解决这一挑战，该架构在一个线性可扩展的循环管道中捕获了跨模态注意力的力量。我们的方法按顺序处理模态，每一步更新一个不断演化的多模态上下文向量。我们方法的核心是一个基于Transformer解码器层构建的融合块，执行对称的交叉注意力，相互丰富共享上下文和传入模态。然后，通过门控融合单元（GFU）将这些丰富的信息集成起来，这是一种受GRU启发的机制，动态地仲裁信息流，使模型能够选择性地保留或丢弃特征。这种有状态的、循环的设计与模态数量呈线性关系，O(n)，使其非常适合高模态环境。在CMU-MOSI基准测试上的实验表明，与更复杂的基线相比，GRF实现了具有竞争力的性能。通过嵌入空间的可视化进一步说明，GRF通过其渐进式融合机制创造了结构化的、类可分离的表示。我们的工作提出了一种强大、可扩展的多模式表示学习范式。

更新时间: 2025-07-01 09:33:38

领域: cs.CV,cs.AI,cs.CL,I.4; I.2

下载: http://arxiv.org/abs/2507.02985v1

Ovis-U1 Technical Report

In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.

Updated: 2025-07-01 09:33:23

标题: Ovis-U1技术报告

摘要: 在这份报告中，我们介绍了Ovis-U1，一个30亿参数的统一模型，集成了多模态理解、文本到图像生成和图像编辑功能。在Ovis系列的基础上构建，Ovis-U1结合了基于扩散的视觉解码器和双向令牌精炼器，使得图像生成任务可以与GPT-4o等领先模型相媲美。与一些先前使用冻结MLLM进行生成任务的模型不同，Ovis-U1利用了一种新的统一训练方法，从语言模型开始。与仅在理解或生成任务上进行训练相比，统一训练表现更好，展示了通过整合这两个任务实现的增强效果。Ovis-U1在OpenCompass多模态学术基准测试中获得了69.6分，超过了最近的最先进模型，如Ristretto-3B和SAIL-VL-1.5-2B。在文本到图像生成方面，它在DPG-Bench和GenEval基准测试中分别获得了83.72和0.89分。在图像编辑方面，它在ImgEdit-Bench和GEdit-Bench-EN上分别获得了4.00和6.42分。作为Ovis统一模型系列的初始版本，Ovis-U1拓展了多模态理解、生成和编辑的界限。

更新时间: 2025-07-01 09:33:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.23044v2

High-resolution spatial memory requires grid-cell-like neural codes

Continuous attractor networks (CANs) are widely used to model how the brain temporarily retains continuous behavioural variables via persistent recurrent activity, such as an animal's position in an environment. However, this memory mechanism is very sensitive to even small imperfections, such as noise or heterogeneity, which are both common in biological systems. Previous work has shown that discretising the continuum into a finite set of discrete attractor states provides robustness to these imperfections, but necessarily reduces the resolution of the represented variable, creating a dilemma between stability and resolution. We show that this stability-resolution dilemma is most severe for CANs using unimodal bump-like codes, as in traditional models. To overcome this, we investigate sparse binary distributed codes based on random feature embeddings, in which neurons have spatially-periodic receptive fields. We demonstrate theoretically and with simulations that such grid-cell-like codes enable CANs to achieve both high stability and high resolution simultaneously. The model extends to embedding arbitrary nonlinear manifolds into a CAN, such as spheres or tori, and generalises linear path integration to integration along freely-programmable on-manifold vector fields. Together, this work provides a theory of how the brain could robustly represent continuous variables with high resolution and perform flexible computations over task-relevant manifolds.

Updated: 2025-07-01 09:29:05

标题: 高分辨率空间记忆需要类似于网络细胞的神经编码

摘要: 连续吸引子网络（CANs）被广泛用来模拟大脑如何通过持续的循环活动临时保留连续的行为变量，比如动物在环境中的位置。然而，这种记忆机制对于甚至是小的不完美非常敏感，比如噪音或异质性，在生物系统中都很常见。先前的研究表明，将连续体分为有限的离散吸引子状态可以使系统对这些不完美具有鲁棒性，但同时会降低所表示变量的分辨率，从而在稳定性和分辨率之间产生困境。我们发现，这种稳定性-分辨率困境对于使用单峰状领域代码的CANs（如传统模型中）最为严重。为了克服这一困境，我们研究了基于随机特征嵌入的稀疏二进制分布式代码，其中神经元具有空间周期性的感受领域。我们理论上和通过模拟证明，这种网格细胞样式的代码使得CANs能够同时实现高稳定性和高分辨率。该模型扩展到将任意非线性流形嵌入到CAN中，比如球体或环面，并将线性路径积分推广到沿着自由可编程的在流形上的矢量场进行积分。综合来看，这项工作提供了大脑如何稳健地表示连续变量并高分辨率地执行任务相关流形上的灵活计算的理论。

更新时间: 2025-07-01 09:29:05

领域: cs.NE,cs.AI,cs.SC

下载: http://arxiv.org/abs/2507.00598v1

Gaze3P: Gaze-Based Prediction of User-Perceived Privacy

Privacy is a highly subjective concept and perceived variably by different individuals. Previous research on quantifying user-perceived privacy has primarily relied on questionnaires. Furthermore, applying user-perceived privacy to optimise the parameters of privacy-preserving techniques (PPT) remains insufficiently explored. To address these limitations, we introduce Gaze3P -- the first dataset specifically designed to facilitate systematic investigations into user-perceived privacy. Our dataset comprises gaze data from 100 participants and 1,000 stimuli, encompassing a range of private and safe attributes. With Gaze3P, we train a machine learning model to implicitly and dynamically predict perceived privacy from human eye gaze. Through comprehensive experiments, we show that the resulting models achieve high accuracy. Finally, we illustrate how predicted privacy can be used to optimise the parameters of differentially private mechanisms, thereby enhancing their alignment with user expectations.

Updated: 2025-07-01 09:26:38

标题: Gaze3P: 基于凝视的用户感知隐私预测

摘要: 隐私是一个高度主观的概念，不同个体对其的感知有所不同。先前关于量化用户感知隐私的研究主要依赖于问卷调查。此外，将用户感知隐私应用于优化隐私保护技术（PPT）的参数仍未得到充分探讨。为了解决这些限制，我们引入了Gaze3P - 第一个专门设计用于促进对用户感知隐私进行系统研究的数据集。我们的数据集包括来自100名参与者和1,000个刺激的凝视数据，涵盖了一系列私密和安全属性。通过Gaze3P，我们训练了一个机器学习模型，从人类眼睛的注视中隐含动态地预测感知隐私。通过全面的实验，我们展示了所得模型的高准确性。最后，我们说明了如何利用预测的隐私来优化差分私密机制的参数，从而增强其与用户期望的一致性。

更新时间: 2025-07-01 09:26:38

领域: cs.HC,cs.CR

下载: http://arxiv.org/abs/2507.00596v1

The Secrets Must Not Flow: Scaling Security Verification to Large Codebases (extended version)

Existing program verifiers can prove advanced properties about security protocol implementations, but are difficult to scale to large codebases because of the manual effort required. We develop a novel methodology called *Diodon* that addresses this challenge by splitting the codebase into the protocol implementation (the *Core*) and the remainder (the *Application*). This split allows us to apply powerful semi-automated verification techniques to the security-critical Core, while fully-automatic static analyses scale the verification to the entire codebase by ensuring that the Application cannot invalidate the security properties proved for the Core. The static analyses achieve that by proving *I/O independence*, i.e., that the I/O operations within the Application are independent of the Core's security-relevant data (such as keys), and that the Application meets the Core's requirements. We have proved Diodon sound by first showing that we can safely allow the Application to perform I/O independent of the security protocol, and second that manual verification and static analyses soundly compose. We evaluate Diodon on two case studies: an implementation of the signed Diffie-Hellman key exchange and a large (100k+ LoC) production Go codebase implementing a key exchange protocol for which we obtained secrecy and injective agreement guarantees by verifying a Core of about 1% of the code with the auto-active program verifier Gobra in less than three person months.

Updated: 2025-07-01 09:25:54

标题: 这个标题的翻译是：秘密不得泄漏：将安全验证扩展到大型代码库（扩展版）

摘要: 现有的程序验证器可以证明关于安全协议实现的高级属性，但由于需要手动努力，很难扩展到大型代码库。我们开发了一种名为*Diodon*的新方法，通过将代码库分为协议实现（*核心*）和其余部分（*应用程序*）来解决这一挑战。这种分割使我们能够将强大的半自动验证技术应用于安全关键的核心部分，同时全自动静态分析将验证扩展到整个代码库，确保应用程序不能使核心已经证明的安全属性失效。静态分析通过证明*I/O独立性*来实现这一目标，即应用程序中的I/O操作与核心的安全相关数据（如密钥）无关，并且应用程序满足核心的要求。我们首先证明了Diodon的可靠性，即我们可以安全地允许应用程序执行与安全协议无关的I/O，并且手动验证和静态分析可靠地组成。我们在两个案例研究中评估了Diodon：一个实现了签名Diffie-Hellman密钥交换的实现和一个大型（100k+行代码）生产Go代码库，实现了一个密钥交换协议，通过在不到三个人月的时间内使用自动活跃程序验证器Gobra验证了约1%的代码核心，获得了保密性和单射协议的保证。

更新时间: 2025-07-01 09:25:54

领域: cs.CR,cs.PL,cs.SE

下载: http://arxiv.org/abs/2507.00595v1

Dehazing Light Microscopy Images with Guided Conditional Flow Matching: finding a sweet spot between fidelity and realism

Fluorescence microscopy is a major driver of scientific progress in the life sciences. Although high-end confocal microscopes are capable of filtering out-of-focus light, cheaper and more accessible microscopy modalities, such as widefield microscopy, can not, which consequently leads to hazy image data. Computational dehazing is trying to combine the best of both worlds, leading to cheap microscopy but crisp-looking images. The perception-distortion trade-off tells us that we can optimize either for data fidelity, e.g. low MSE or high PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID. Existing methods either prioritize fidelity at the expense of realism, or produce perceptually convincing results that lack quantitative accuracy. In this work, we propose HazeMatching, a novel iterative method for dehazing light microscopy images, which effectively balances these objectives. Our goal was to find a balanced trade-off between the fidelity of the dehazing results and the realism of individual predictions (samples). We achieve this by adapting the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. We evaluate HazeMatching on 5 datasets, covering both synthetic and real data, assessing both distortion and perceptual quality. Our method is compared against 7 baselines, achieving a consistent balance between fidelity and realism on average. Additionally, with calibration analysis, we show that HazeMatching produces well-calibrated predictions. Note that our method does not need an explicit degradation operator to exist, making it easily applicable on real microscopy data. All data used for training and evaluation and our code will be publicly available under a permissive license.

Updated: 2025-07-01 09:23:16

标题: 使用引导条件流匹配去除光学显微图像中的雾霾：在保真度和逼真度之间找到最佳平衡点

摘要: 荧光显微镜是生命科学领域科学进展的主要推动力。虽然高端共焦显微镜能够过滤失焦光，但更便宜和更易获取的显微镜模式，如广场显微镜，不能做到这一点，因此导致图像数据模糊。计算去雾试图结合两者的优势，实现廉价显微镜但产生清晰的图像。感知失真权衡告诉我们，我们可以优化数据保真度，例如低MSE或高PSNR，或者优化数据逼真度，通过感知指标如LPIPS或FID来衡量。现有方法要么优先考虑保真度而牺牲逼真度，要么产生在感知上令人信服但缺乏定量准确性的结果。在这项工作中，我们提出了HazeMatching，一种新颖的迭代方法，用于去雾光学显微图像，有效地平衡这些目标。我们的目标是在去雾结果的保真度和单个预测（样本）的逼真度之间找到一种平衡的权衡。我们通过使用有雾观测来引导条件速度场中的生成过程，调整条件流匹配框架，实现了这一目标。我们在涵盖合成和真实数据的5个数据集上评估了HazeMatching，评估了失真和感知质量。我们将我们的方法与7个基线进行了比较，平均实现了保真度和逼真度之间的一致平衡。此外，通过校准分析，我们展示了HazeMatching产生的预测是良好校准的。需要注意的是，我们的方法不需要明确的降级运算符存在，因此在真实显微镜数据上易于应用。所有用于训练和评估的数据以及我们的代码将在一个宽松许可下公开。

更新时间: 2025-07-01 09:23:16

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.22397v3

Binned semiparametric Bayesian networks

This paper introduces a new type of probabilistic semiparametric model that takes advantage of data binning to reduce the computational cost of kernel density estimation in nonparametric distributions. Two new conditional probability distributions are developed for the new binned semiparametric Bayesian networks, the sparse binned kernel density estimation and the Fourier kernel density estimation. These two probability distributions address the curse of dimensionality, which typically impacts binned models, by using sparse tensors and restricting the number of parent nodes in conditional probability calculations. To evaluate the proposal, we perform a complexity analysis and conduct several comparative experiments using synthetic data and datasets from the UCI Machine Learning repository. The experiments include different binning rules, parent restrictions, grid sizes, and number of instances to get a holistic view of the model's behavior. As a result, our binned semiparametric Bayesian networks achieve structural learning and log-likelihood estimations with no statistically significant differences compared to the semiparametric Bayesian networks, but at a much higher speed. Thus, the new binned semiparametric Bayesian networks prove to be a reliable and more efficient alternative to their non-binned counterparts.

Updated: 2025-07-01 09:17:43

标题: 分箱半参数贝叶斯网络

摘要: 这篇论文介绍了一种新型的概率半参数模型，利用数据分箱来降低非参数分布中核密度估计的计算成本。为新的分箱半参数贝叶斯网络开发了两种新的条件概率分布，稀疏分箱核密度估计和傅里叶核密度估计。这两种概率分布通过使用稀疏张量和限制条件概率计算中父节点数量来解决通常影响分箱模型的维度灾难。为了评估这一提议，我们进行了复杂性分析，并使用合成数据和UCI机器学习仓库中的数据集进行了几个比较实验。实验包括不同的分箱规则、父节点限制、网格大小和实例数量，以全面了解模型的行为。结果表明，我们的分箱半参数贝叶斯网络在结构学习和对数似然估计方面与半参数贝叶斯网络相比没有统计显著差异，但速度要快得多。因此，新的分箱半参数贝叶斯网络被证明是可靠且更有效的替代非分箱对应物。

更新时间: 2025-07-01 09:17:43

领域: cs.LG,cs.AI,I.2.6; I.5.1; G.3

下载: http://arxiv.org/abs/2506.21997v2

Quantum Circuit Structure Optimization for Quantum Reinforcement Learning

Reinforcement learning (RL) enables agents to learn optimal policies through environmental interaction. However, RL suffers from reduced learning efficiency due to the curse of dimensionality in high-dimensional spaces. Quantum reinforcement learning (QRL) addresses this issue by leveraging superposition and entanglement in quantum computing, allowing efficient handling of high-dimensional problems with fewer resources. QRL combines quantum neural networks (QNNs) with RL, where the parameterized quantum circuit (PQC) acts as the core computational module. The PQC performs linear and nonlinear transformations through gate operations, similar to hidden layers in classical neural networks. Previous QRL studies, however, have used fixed PQC structures based on empirical intuition without verifying their optimality. This paper proposes a QRL-NAS algorithm that integrates quantum neural architecture search (QNAS) to optimize PQC structures within QRL. Experiments demonstrate that QRL-NAS achieves higher rewards than QRL with fixed circuits, validating its effectiveness and practical utility.

Updated: 2025-07-01 09:16:58

标题: 量子强化学习的量子电路结构优化

摘要: 强化学习（RL）使代理能够通过与环境的交互学习最佳策略。然而，在高维空间中，RL受到维度诅咒的影响，学习效率降低。量子强化学习（QRL）通过利用量子计算中的叠加和纠缠来解决这个问题，从而能够以更少的资源有效处理高维问题。QRL将量子神经网络（QNNs）与RL结合起来，其中参数化量子电路（PQC）充当核心计算模块。PQC通过门操作执行线性和非线性变换，类似于经典神经网络中的隐藏层。然而，先前的QRL研究使用基于经验直觉的固定PQC结构，而未验证其最优性。本文提出了一种QRL-NAS算法，将量子神经网络架构搜索（QNAS）集成到QRL中，以优化QRL中的PQC结构。实验证明，QRL-NAS比具有固定电路的QRL获得更高的奖励，验证了其有效性和实用性。

更新时间: 2025-07-01 09:16:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.00589v1

AI-Generated Video Detection via Perceptual Straightening

The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

Updated: 2025-07-01 09:04:21

标题: 通过感知校正实现的AI生成视频检测

摘要: 生成式人工智能的快速发展使高度逼真的合成视频成为可能，这给内容认证带来了重大挑战，并引发了对滥用的紧急关注。现有的检测方法通常在泛化和捕捉微妙的时间不一致性方面面临困难。我们提出了ReStraV（Representation Straightening Video），这是一种区分自然视频和人工智能生成视频的新方法。受“感知拉直”假设的启发，该假设认为现实世界的视频轨迹在神经表示领域变得更加直线，我们分析了与这种预期的几何特性偏离的情况。使用预训练的自监督视觉变换器（DINOv2），我们量化了模型表示领域中的时间曲率和分段距离。我们汇总了每个视频的这些测量指标的统计数据，并训练了一个分类器。我们的分析表明，与真实视频相比，人工智能生成的视频展现出明显不同的曲率和距离模式。一个轻量级的分类器实现了最先进的检测性能（例如，在VidProM基准测试中的97.17%准确度和98.63%的AUROC），大大优于现有的基于图像和视频的方法。ReStraV在计算效率上表现出色，提供了一种低成本且有效的检测解决方案。这项工作为利用神经表示几何来检测人工智能生成视频提供了新的见解。

更新时间: 2025-07-01 09:04:21

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.00583v1

TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification

Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.

Updated: 2025-07-01 09:00:50

标题: TUM-MiKaNi在SemEval-2025任务3中的表现：面向多语言和知识感知的非事实性幻觉识别

摘要: 幻觉是LLMs的主要问题之一，阻碍了它们的可靠性和在更广泛用例中的部署。然而，大多数有关幻觉的研究都集中在英语数据上，忽视了LLMs的多语言特性。本文描述了我们对SemEval-2025任务-3的提交 - Mu-SHROOM，即幻觉及相关可观察过度生成错误的多语言共享任务。我们提出了一个两部分的流程，将检索式事实验证与维基百科相结合，并使用经过微调的基于BERT的系统来识别常见的幻觉模式。我们的系统在所有语言中取得了竞争性的结果，在包括英语在内的八种语言中进入前十名。此外，它支持超过共享任务覆盖的十四种语言。这种多语言幻觉识别器可以帮助改善LLM输出及其在未来的实用性。

更新时间: 2025-07-01 09:00:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.00579v1

Hecto: Modular Sparse Experts for Adaptive and Interpretable Reasoning

Mixture-of-Experts (MoE) models enable conditional computation by routing inputs to specialized experts, but these experts rely on identical inductive biases, thus limiting representational diversity. This static computation pathway is inefficient for inputs that require different types of reasoning and limits specialization and interpretability. We propose Hecto, a lightweight MoE architecture that leverages architectural heterogeneity by combining a GRU expert for temporal reasoning and an FFNN expert for static abstraction under a sparse Top-1 gating mechanism. Evaluated on three reasoning benchmarks (AG News, SST-2, HotpotQA) and a regression task (STS-B), Hecto matches or closely trails homogeneous baselines in performance despite receiving isolated input representations, while achieving clear expert specialization, with each expert aligning to distinct reasoning types (temporal vs static). At larger batch sizes, Hecto exhibits improved performance, benefiting from relaxed computational constraints that allow its heterogeneous architecture to optimize more effectively. Ablation results isolate architectural diversity as the source of Hecto's stability and interpretability across diverse reasoning tasks. Overall, Hecto establishes itself as a new benchmark for conditional computation, offering a principled framework for specialized reasoning in low-resource regimes with its model strength derived from principled specialization.

Updated: 2025-07-01 09:00:34

标题: Hecto：自适应和可解释推理的模块化稀疏专家

摘要: 混合专家（MoE）模型通过将输入路由到专业专家实现条件计算，但这些专家依赖于相同的归纳偏见，从而限制了表征多样性。这种静态计算路径对于需要不同类型推理的输入是低效的，并限制了专业化和可解释性。我们提出了Hecto，一种轻量级的MoE架构，通过在稀疏的Top-1门控机制下结合用于时间推理的GRU专家和用于静态抽象的FFNN专家，利用架构异质性。在三个推理基准测试（AG News，SST-2，HotpotQA）和一个回归任务（STS-B）上进行评估时，尽管接收到孤立的输入表示，Hecto在性能上与同质基线匹配或接近，同时实现了明确的专家专业化，每个专家都与不同的推理类型（时间性 vs 静态性）保持一致。在更大的批处理大小下，Hecto表现出更好的性能，受益于放松的计算约束，使其异构架构能够更有效地优化。消融结果将架构多样性孤立为Hecto在各种推理任务中稳定性和可解释性的来源。总体而言，Hecto建立为条件计算的新基准，通过其模型强度源自原则性专业化，为低资源环境中的专业化推理提供了一个有原则的框架。

更新时间: 2025-07-01 09:00:34

领域: cs.AI

下载: http://arxiv.org/abs/2506.22919v2

BadViM: Backdoor Attack against Vision Mamba

Vision State Space Models (SSMs), particularly architectures like Vision Mamba (ViM), have emerged as promising alternatives to Vision Transformers (ViTs). However, the security implications of this novel architecture, especially their vulnerability to backdoor attacks, remain critically underexplored. Backdoor attacks aim to embed hidden triggers into victim models, causing the model to misclassify inputs containing these triggers while maintaining normal behavior on clean inputs. This paper investigates the susceptibility of ViM to backdoor attacks by introducing BadViM, a novel backdoor attack framework specifically designed for Vision Mamba. The proposed BadViM leverages a Resonant Frequency Trigger (RFT) that exploits the frequency sensitivity patterns of the victim model to create stealthy, distributed triggers. To maximize attack efficacy, we propose a Hidden State Alignment loss that strategically manipulates the internal representations of model by aligning the hidden states of backdoor images with those of target classes. Extensive experimental results demonstrate that BadViM achieves superior attack success rates while maintaining clean data accuracy. Meanwhile, BadViM exhibits remarkable resilience against common defensive measures, including PatchDrop, PatchShuffle and JPEG compression, which typically neutralize normal backdoor attacks.

Updated: 2025-07-01 08:59:24

标题: BadViM：Vision Mamba的后门攻击

摘要: 视觉状态空间模型（SSMs），特别是像Vision Mamba（ViM）这样的架构，已经成为视觉变换器（ViTs）的有希望的替代方案。然而，这种新颖架构的安全影响，特别是它们对后门攻击的脆弱性，仍然受到严重忽视。后门攻击旨在将隐藏触发器嵌入受害模型中，导致模型在包含这些触发器的输入上错误分类，同时在干净输入上保持正常行为。本文通过引入BadViM，一种专门设计用于Vision Mamba的新型后门攻击框架，研究了ViM对后门攻击的敏感性。所提出的BadViM利用共振频率触发器（RFT），利用受害模型的频率敏感性模式创建隐秘的、分布式的触发器。为了最大化攻击效果，我们提出了一个隐藏状态对齐损失，通过将后门图像的隐藏状态与目标类的隐藏状态对齐，从而策略性地操纵模型的内部表示。大量的实验结果表明，BadViM实现了优越的攻击成功率，同时保持干净数据的准确性。与此同时，BadViM对常见的防御措施表现出显着的抵抗力，包括PatchDrop、PatchShuffle和JPEG压缩，这些措施通常可以中和正常后门攻击。

更新时间: 2025-07-01 08:59:24

领域: cs.CR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.00577v1

ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

This paper introduces ZonUI-3B, a lightweight Vision-Language Model (VLM) specifically designed for Graphical User Interface grounding tasks, achieving performance competitive with significantly larger models. Unlike large-scale VLMs (>7B parameters) that are computationally intensive and impractical for consumer-grade hardware, ZonUI-3B delivers strong grounding accuracy while being fully trainable on a single GPU (RTX 4090). The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks-including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: https://github.com/Han1018/ZonUI-3B

Updated: 2025-07-01 08:46:32

标题: ZonUI-3B：一种用于跨分辨率GUI对齐的轻量级视觉-语言模型

摘要: 本文介绍了 ZonUI-3B，一个专门为图形用户界面基准任务设计的轻量级视觉语言模型（VLM），其性能与规模更大的模型竞争力相当。与计算密集且对消费级硬件不切实际的大规模 VLM（>7B 参数）不同，ZonUI-3B 在单个 GPU（RTX 4090）上完全可训练，并且具有强大的基准准确性。该模型融合了几项关键创新：（i）结合跨平台、多分辨率数据集，包括来自移动、桌面和 Web GUI 屏幕截图的 24K 个案例，以有效解决高分辨率桌面环境中的数据稀缺问题；（ii）采用两阶段微调策略，初始的跨平台训练建立了强大的 GUI 理解，然后在高分辨率数据上进行专门微调，显著增强了模型的适应性；（iii）数据筛选和冗余减少策略，表明随机抽样较小的子集并减少冗余可以实现与更大数据集相当的性能，强调数据多样性胜过数量。对标准 GUI 基准测试的实证评估，包括 ScreenSpot、ScreenSpot-v2 和具有挑战性的 ScreenSpot-Pro，突出了 ZonUI-3B 的异常准确性，在 ScreenSpot 上达到 84.9%，在 ScreenSpot-v2 上达到 86.4%，超过了参数小于 4B 的先前模型。消融研究验证了平衡采样和两阶段微调在增强鲁棒性方面的关键作用，特别是在高分辨率桌面场景中。ZonUI-3B 可在以下网址获取：https://github.com/Han1018/ZonUI-3B。

更新时间: 2025-07-01 08:46:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.23491v2

DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models

Large language models (LLMs) have recently showcased remarkable capabilities, spanning a wide range of tasks and applications, including those in the medical domain. Models like GPT-4 excel in medical question answering but may face challenges in the lack of interpretability when handling complex tasks in real clinical settings. We thus introduce the diagnostic reasoning dataset for clinical notes (DiReCT), aiming at evaluating the reasoning ability and interpretability of LLMs compared to human doctors. It contains 511 clinical notes, each meticulously annotated by physicians, detailing the diagnostic reasoning process from observations in a clinical note to the final diagnosis. Additionally, a diagnostic knowledge graph is provided to offer essential knowledge for reasoning, which may not be covered in the training data of existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant gap between their reasoning ability and that of human doctors, highlighting the critical need for models that can reason effectively in real-world clinical scenarios.

Updated: 2025-07-01 08:40:39

标题: DiReCT: 通过大型语言模型对临床笔记进行诊断推理

摘要: 大型语言模型（LLMs）最近展示出了卓越的能力，涵盖了广泛的任务和应用，包括医疗领域。像GPT-4这样的模型在医疗问题回答方面表现出色，但在处理实际临床环境中的复杂任务时可能面临解释性不足的挑战。因此，我们引入了临床笔记诊断推理数据集（DiReCT），旨在评估LLMs相对于人类医生的推理能力和可解释性。它包含511份临床笔记，每份都由医生精心注释，详细描述了从临床笔记中的观察到最终诊断的诊断推理过程。此外，提供了一个诊断知识图，为推理提供了必要的知识，这些知识可能没有涵盖现有LLMs的训练数据中。在DiReCT上对领先的LLMs进行评估揭示了它们的推理能力与人类医生之间的显著差距，突出了在现实临床场景中能够有效推理的模型的关键需求。

更新时间: 2025-07-01 08:40:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2408.01933v6

Advancing Local Search in SMT-NRA with MCSAT Integration

In this paper, we advance local search for Satisfiability Modulo the Theory of Nonlinear Real Arithmetic (SMT-NRA for short). First, we introduce a two-dimensional cell-jump move, called \emph{$2d$-cell-jump}, generalizing the key operation, cell-jump, of the local search method for SMT-NRA. Then, we propose an extended local search framework, named \emph{$2d$-LS} (following the local search framework, LS, for SMT-NRA), integrating the model constructing satisfiability calculus (MCSAT) framework to improve search efficiency. To further improve the efficiency of MCSAT, we implement a recently proposed technique called \emph{sample-cell projection operator} for MCSAT, which is well suited for CDCL-style search in the real domain and helps guide the search away from conflicting states. Finally, we design a hybrid framework for SMT-NRA combining MCSAT, $2d$-LS and OpenCAD, to improve search efficiency through information exchange. The experimental results demonstrate improvements in local search performance, highlighting the effectiveness of the proposed methods.

Updated: 2025-07-01 08:27:29

标题: 在SMT-NRA中推进局部搜索与MCSAT集成

摘要: 在本文中，我们推进了对非线性实数算术理论满足性模块（简称SMT-NRA）的局部搜索。首先，我们引入了一种二维单元跳跃移动，称为$2d$-cell-jump，将局部搜索方法中的关键操作单元跳跃进行了泛化。然后，我们提出了一个扩展的局部搜索框架，命名为$2d$-LS（遵循SMT-NRA的局部搜索框架LS），将模型构建满足性演算（MCSAT）框架集成进来以提高搜索效率。为了进一步提高MCSAT的效率，我们实现了一种最近提出的技术，即MCSAT的样本单元投影算子，这种技术非常适合实域中的CDCL式搜索，并有助于将搜索引导远离冲突状态。最后，我们设计了一个SMT-NRA的混合框架，结合了MCSAT、$2d$-LS和OpenCAD，通过信息交流来提高搜索效率。实验结果表明，在局部搜索性能方面取得了改进，突显了所提出方法的有效性。

更新时间: 2025-07-01 08:27:29

领域: cs.AI,cs.LO,cs.SC

下载: http://arxiv.org/abs/2507.00557v1

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.

Updated: 2025-07-01 08:19:08

标题: 超越注意力或相似性：在MLLMs中最大化条件多样性以进行标记修剪

摘要: 在多模式大型语言模型（MLLMs）中，输入的视觉令牌的长度通常明显大于其文本对应物的长度，导致推理成本很高。许多工作旨在通过删除冗余的视觉令牌来解决这个问题。然而，目前的方法要么依赖于基于注意力的修剪，保留大量重复的令牌，要么使用基于相似性的修剪，忽略指令的相关性，从而导致性能次优。在本文中，我们通过提出一种名为CDPruner的新颖视觉令牌修剪方法，超越了注意力或相似性，最大化了保留令牌的条件多样性。我们首先定义了在指令条件下的视觉令牌之间的条件相似性，然后用行列式点过程（DPP）重新制定了令牌修剪问题，以最大化所选子集的条件多样性。所提出的CDPruner无需训练，并且与模型无关，可以轻松应用于各种MLLMs。跨多种MLLMs进行的大量实验表明，CDPruner在各种视觉-语言基准测试上建立了新的最先进水平。通过通过DPP最大化条件多样性，所选子集更好地代表了输入图像，同时紧密遵循用户指令，因此即使在高降低比率下也保持了强大的性能。将CDPruner应用于LLaVA时，将FLOP减少了95％，CUDA延迟减少了78％，同时保持原始准确率的94％。我们的代码可在https://github.com/Theia-4869/CDPruner 上找到。

更新时间: 2025-07-01 08:19:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.10967v2

The Curse of Depth in Large Language Models

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

Updated: 2025-07-01 08:18:41

标题: 大型语言模型中的深度诅咒

摘要: 在这篇论文中，我们介绍了深度诅咒（Curse of Depth）这一概念，它突出、解释并解决了现代大型语言模型（LLMs）中近一半的层次效果不如预期的最近观察到的现象。我们首先确认了这一现象在最流行的LLMs家族（如Llama、Mistral、DeepSeek和Qwen）中的普遍存在。我们的分析理论上和实证上确定了LLMs中深层次无效的根本原因是广泛使用的Pre-Layer标准化（Pre-LN）。虽然Pre-LN稳定了Transformer LLMs的训练，但其输出方差随着模型深度呈指数增长，这不可取地导致深层Transformer块的导数为单位矩阵，因此几乎不对训练做出贡献。为了解决这一训练陷阱，我们提出了LayerNorm Scaling（LNS），通过将层标准化的输出方差按其深度的平方根倒数进行缩放。这一简单修改减轻了更深的Transformer层的输出方差暴增，提高了它们的贡献。在各种模型大小（从130M到7B）上，我们的实验表明LNS在增强LLM预训练性能方面始终优于先前的标准化和缩放技术。此外，这一改进无缝地延续到监督微调。所有这些收益都可以归因于LayerNorm Scaling使更深层次在训练过程中更有效地发挥作用。我们的代码可在\href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}获取。

更新时间: 2025-07-01 08:18:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.05795v2

ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data

With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fine-tuned large language model (LLM), as a chemical executor to convert between unstructured experimental procedures and structured action sequences. We propose a sequential LLM-generated data framework to address the challenges of insufficient and low-quality annotated data. This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input. Additionally, we introduce a novel multi-round LLMs circle review metric, which reflects the model's advanced understanding of chemical experimental procedures. Extensive experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor, augmented by LLM-generated data, achieves state-of-the-art performance, outperforming the baseline model by 10%. The code is available at: https://github.com/Zhanghahah/ChemActor.

Updated: 2025-07-01 08:11:18

标题: ChemActor：借助LLM生成的数据增强化学合成行为的自动提取

摘要: 随着有机化学领域对机器合成的兴趣增加，从文献中自动提取化学程序变得至关重要。然而，由于化学语言的固有歧义性和开发可靠的计算机辅助提取协议所需的高成本，这一任务仍然具有挑战性。在这里，我们提出ChemActor，一个完全微调的大型语言模型（LLM），作为一个化学执行器，用于将非结构化实验程序转换为结构化的动作序列。我们提出了一个顺序LLM生成的数据框架，以解决数据不足和质量低下的标注数据所带来的挑战。该框架集成了一个基于分布差异选择数据的数据选择模块，以及一个通用的LLM，用于从单个分子输入生成可机器执行的动作。此外，我们引入了一种新颖的多轮LLMs循环审查指标，反映了模型对化学实验程序的高级理解。在反应到描述（R2D）和描述到动作（D2A）任务上进行了大量实验，结果表明，通过LLM生成的数据增强的ChemActor实现了业界领先的性能，优于基线模型10％。代码可在https://github.com/Zhanghahah/ChemActor 上找到。

更新时间: 2025-07-01 08:11:18

领域: cs.AI

下载: http://arxiv.org/abs/2506.23520v2

Quantifying Azure RBAC Wildcard Overreach

Azure RBAC leverages wildcard permissions to simplify policy authoring, but this abstraction often obscures the actual set of allowed operations and undermines least-privilege guarantees. We introduce Belshazaar, a two-stage framework that targets both the effective permission set problem and the evaluation of wildcards permissions spread. First, we formalize Azure action syntax via a context free grammar and implement a compiler that expands any wildcard into its explicit action set. Second, we define an ultrametric diameter metric to quantify semantic overreach in wildcard scenarios. Applied to Microsoft s official catalog of 15481 actions, Belshazaar reveals that about 50 percent of actions admit a cross Resource Provider reach when associated with non obvious wildcards, and that effective permissions sets are effectively computable. These findings demonstrate that wildcard patterns can introduce substantial privilege bloat, and that our approach offers a scalable, semantics driven path toward tighter, least-privilege RBAC policies in Azure environments.

Updated: 2025-07-01 08:10:30

标题: 量化Azure RBAC通配符滥用

摘要: Azure RBAC利用通配符权限简化策略编写，但这种抽象通常会模糊实际允许的操作集，并破坏最小特权保证。我们介绍了Belshazaar，一个针对有效权限集问题和通配符权限分散评估的两阶段框架。首先，我们通过上下文无关语法形式化Azure操作语法，并实现一个编译器，将任何通配符扩展为其明确的操作集。其次，我们定义了一个超度量直径度量，用于量化通配符场景中的语义过渡。应用于Microsoft的官方目录的15481个操作，Belshazaar揭示了大约50%的操作在与非显而易见的通配符相关联时允许跨资源提供商的访问，并且有效权限集是可以有效计算的。这些发现表明通配符模式可能引入大量的特权膨胀，并且我们的方法为Azure环境中更严格的最小特权RBAC策略提供了一条可扩展的、以语义为驱动的路径。

更新时间: 2025-07-01 08:10:30

领域: cs.CR

下载: http://arxiv.org/abs/2506.10755v3

Inverse Design in Nanophotonics via Representation Learning

Inverse design in nanophotonics, the computational discovery of structures achieving targeted electromagnetic (EM) responses, has become a key tool for recent optical advances. Traditional intuition-driven or iterative optimization methods struggle with the inherently high-dimensional, non-convex design spaces and the substantial computational demands of EM simulations. Recently, machine learning (ML) has emerged to address these bottlenecks effectively. This review frames ML-enhanced inverse design methodologies through the lens of representation learning, classifying them into two categories: output-side and input-side approaches. Output-side methods use ML to learn a representation in the solution space to create a differentiable solver that accelerates optimization. Conversely, input-side techniques employ ML to learn compact, latent-space representations of feasible device geometries, enabling efficient global exploration through generative models. Each strategy presents unique trade-offs in data requirements, generalization capacity, and novel design discovery potentials. Hybrid frameworks that combine physics-based optimization with data-driven representations help escape poor local optima, improve scalability, and facilitate knowledge transfer. We conclude by highlighting open challenges and opportunities, emphasizing complexity management, geometry-independent representations, integration of fabrication constraints, and advancements in multiphysics co-designs.

Updated: 2025-07-01 08:10:05

标题: 纳米光子学中的反向设计通过表征学习

摘要: 纳米光子学中的反向设计是实现目标电磁（EM）响应的结构的计算发现，已成为最近光学进步的关键工具。传统的直觉驱动或迭代优化方法在本质上高维、非凸设计空间和电磁仿真的重大计算需求方面存在困难。最近，机器学习（ML）已经出现以有效解决这些瓶颈。本综述通过表征学习的视角框定了ML增强的反向设计方法论，将其分类为两类：输出端和输入端方法。输出端方法利用ML在解空间中学习表示，创建可微分求解器加速优化。相反，输入端技术利用ML学习可行设备几何的紧凑、潜在空间表示，通过生成模型实现高效全局探索。每种策略在数据需求、泛化能力和新型设计发现潜力方面都呈现出独特的权衡。将基于物理的优化与数据驱动的表示相结合的混合框架有助于避免恶劣的局部最优解，提高可扩展性，并促进知识转移。最后，我们强调了开放挑战和机会，强调复杂性管理、几何无关表示、集成制造约束以及多物理共同设计的进展。

更新时间: 2025-07-01 08:10:05

领域: physics.app-ph,cs.AI,cs.LG,physics.optics

下载: http://arxiv.org/abs/2507.00546v1

An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses

Large Language models (LLMs) have been prominent for language translation, including low-resource languages. There has been limited study on the assessment of the quality of translations generated by LLMs, including Gemini, GPT, and Google Translate. This study addresses this limitation by using semantic and sentiment analysis of selected LLMs for Indian languages, including Sanskrit, Telugu and Hindi. We select prominent texts (Bhagavad Gita, Tamas and Maha Prasthanam ) that have been well translated by experts and use LLMs to generate their translations into English, and provide a comparison with selected expert (human) translations. Our investigation revealed that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in metaphorical and philosophical contexts for texts such as the Bhagavad Gita. The sentiment analysis revealed that GPT models are better at preserving the sentiment polarity for the given texts when compared to human (expert) translation. The results revealed that GPT models are generally better at maintaining the sentiment and semantics when compared to Google Translate. This study could help in the development of accurate and culturally sensitive translation systems for large language models.

Updated: 2025-07-01 08:05:01

标题: 对选定的印度语言进行情感和语义分析的LLMs和Google翻译的评估

摘要: 大型语言模型（LLMs）在语言翻译方面表现突出，包括资源匮乏的语言。对LLMs生成的翻译质量进行评估的研究有限，包括Gemini、GPT和谷歌翻译。本研究通过对印度语言（包括梵文、泰卢固语和印地语）选定的LLMs进行语义和情感分析，以解决这一限制。我们选择了由专家进行良好翻译的著名文本（《恒河之歌》、《暗黯》和《大逃亡》），使用LLMs生成它们的英文翻译，并与选定的专家（人工）翻译进行比较。我们的研究发现，虽然LLMs在翻译准确性方面取得了显著进展，但在保留情感和语义完整性方面仍存在挑战，特别是对于《恒河之歌》等文本的隐喻和哲学背景。情感分析显示，与人类（专家）翻译相比，GPT模型在保留给定文本的情感极性方面表现更好。结果显示，与谷歌翻译相比，GPT模型通常更擅长保持情感和语义。这项研究有助于为大型语言模型开发准确且具有文化敏感性的翻译系统。

更新时间: 2025-07-01 08:05:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.21393v3

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

This paper studies the role of attention heads in CLIP's image encoder. While CLIP has exhibited robust performance across diverse applications, we hypothesize that certain attention heads negatively affect final representations and that ablating them can improve performance in downstream tasks. To capitalize on this insight, we propose a simple yet effective method, called Attention Ablation Technique (AAT), to suppress the contribution of specific heads by manipulating attention weights. By integrating two alternative strategies tailored for different application scenarios, AAT systematically identifies and ablates detrimental attention heads to enhance representation quality. Experiments demonstrate that AAT consistently improves downstream task performance across various domains, boosting recall rate by up to 11.1% on CLIP-family models for cross-modal retrieval. The results highlight the potential of AAT to effectively refine large-scale vision-language models with virtually no increase in inference cost.

Updated: 2025-07-01 07:56:46

标题: 并非所有的注意力头都是你所需要的：通过注意力消融来优化CLIP的图像表示

摘要: 本文研究了CLIP图像编码器中注意力头的作用。虽然CLIP在各种应用中展现出强大的性能，我们假设某些注意力头会对最终表示产生负面影响，并且通过切除它们可以提高下游任务的性能。为了利用这一洞察力，我们提出了一种简单而有效的方法，称为Attention Ablation Technique（AAT），通过操纵注意力权重来抑制特定头的贡献。通过整合针对不同应用场景量身定制的两种替代策略，AAT系统地识别和切除有害的注意力头以增强表示质量。实验证明，AAT能够持续改善各个领域的下游任务性能，使跨模态检索的CLIP家族模型的召回率提高了高达11.1%。结果突显了AAT有效地改进大规模视觉-语言模型的潜力，且几乎不增加推理成本。

更新时间: 2025-07-01 07:56:46

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.00537v1

Rethinking Group Recommender Systems in the Era of Generative AI: From One-Shot Recommendations to Agentic Group Decision Support

More than twenty-five years ago, first ideas were developed on how to design a system that can provide recommendations to groups of users instead of individual users. Since then, a rich variety of algorithmic proposals were published, e.g., on how to acquire individual preferences, how to aggregate them, and how to generate recommendations for groups of users. However, despite the rich literature on the topic, barely any examples of real-world group recommender systems can be found. This lets us question common assumptions in academic research, in particular regarding communication processes in a group and how recommendation-supported decisions are made. In this essay, we argue that these common assumptions and corresponding system designs often may not match the needs or expectations of users. We thus call for a reorientation in this research area, leveraging the capabilities of modern Generative AI assistants like ChatGPT. Specifically, as one promising future direction, we envision group recommender systems to be systems where human group members interact in a chat and an AI-based group recommendation agent assists the decision-making process in an agentic way. Ultimately, this shall lead to a more natural group decision-making environment and finally to wider adoption of group recommendation systems in practice.

Updated: 2025-07-01 07:56:37

标题: 重新思考生成AI时代的群体推荐系统：从一次性推荐到代理群体决策支持

摘要: 在二十五多年前，人们首次提出了如何设计一个可以向群体用户提供建议而不是个人用户的系统的想法。从那时起，已经发表了丰富多样的算法提案，例如如何获取个人偏好、如何聚合它们以及如何为群体用户生成推荐。然而，尽管有关该主题的丰富文献，几乎找不到任何真实世界中的群体推荐系统的例子。这让我们质疑学术研究中的常见假设，特别是关于群体中的沟通过程以及如何进行基于推荐的决策。在这篇文章中，我们认为这些常见假设和相应的系统设计往往可能不符合用户的需求或期望。因此，我们呼吁在这个研究领域进行重新定位，利用现代生成AI助手如ChatGPT的能力。具体而言，作为一个有前途的未来方向，我们设想群体推荐系统是一种系统，其中人类群体成员在聊天中互动，而基于AI的群体推荐代理以代理方式协助决策过程。最终，这将导致更自然的群体决策环境，并最终促使群体推荐系统在实践中得到更广泛的采用。

更新时间: 2025-07-01 07:56:37

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.00535v1

The Impact of AI on Educational Assessment: A Framework for Constructive Alignment

The influence of Artificial Intelligence (AI), and specifically Large Language Models (LLM), on education is continuously increasing. These models are frequently used by students, giving rise to the question whether current forms of assessment are still a valid way to evaluate student performance and comprehension. The theoretical framework developed in this paper is grounded in Constructive Alignment (CA) theory and Bloom's taxonomy for defining learning objectives. We argue that AI influences learning objectives of different Bloom levels in a different way, and assessment has to be adopted accordingly. Furthermore, in line with Bloom's vision, formative and summative assessment should be aligned on whether the use of AI is permitted or not. Although lecturers tend to agree that education and assessment need to be adapted to the presence of AI, a strong bias exists on the extent to which lecturers want to allow for AI in assessment. This bias is caused by a lecturer's familiarity with AI and specifically whether they use it themselves. To avoid this bias, we propose structured guidelines on a university or faculty level, to foster alignment among the staff. Besides that, we argue that teaching staff should be trained on the capabilities and limitations of AI tools. In this way, they are better able to adapt their assessment methods.

Updated: 2025-07-01 07:51:20

标题: 人工智能对教育评估的影响：建设性对齐的框架

摘要: 人工智能（AI）特别是大型语言模型（LLM）对教育的影响不断增加。这些模型经常被学生使用，引发了当前形式的评估是否仍然是评估学生表现和理解的有效方式的问题。本文中发展的理论框架基于建构对齐（CA）理论和布鲁姆的学习目标分类法。我们认为，AI以不同的方式影响不同布鲁姆水平的学习目标，评估必须相应地进行调整。此外，与布鲁姆的愿景一致，形成性和总结性评估应该根据AI的使用是否允许进行对齐。虽然讲师们普遍同意教育和评估需要根据AI的存在进行调整，但存在一个强烈的偏见，即讲师们在评估中愿意允许AI的程度。这种偏见是由讲师对AI的熟悉程度以及他们是否自己使用AI造成的。为了避免这种偏见，我们提出在大学或学院水平上制定结构化指导方针，以促进教职员工之间的对齐。此外，我们认为教职员工应接受关于AI工具的能力和局限性的培训。通过这种方式，他们能够更好地调整他们的评估方法。

更新时间: 2025-07-01 07:51:20

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2506.23815v2

Bridging Ethical Principles and Algorithmic Methods: An Alternative Approach for Assessing Trustworthiness in AI Systems

Artificial Intelligence (AI) technology epitomizes the complex challenges posed by human-made artifacts, particularly those widely integrated into society and exert significant influence, highlighting potential benefits and their negative consequences. While other technologies may also pose substantial risks, AI's pervasive reach makes its societal effects especially profound. The complexity of AI systems, coupled with their remarkable capabilities, can lead to a reliance on technologies that operate beyond direct human oversight or understanding. To mitigate the risks that arise, several theoretical tools and guidelines have been developed, alongside efforts to create technological tools aimed at safeguarding Trustworthy AI. The guidelines take a more holistic view of the issue but fail to provide techniques for quantifying trustworthiness. Conversely, while technological tools are better at achieving such quantification, they lack a holistic perspective, focusing instead on specific aspects of Trustworthy AI. This paper aims to introduce an assessment method that combines the ethical components of Trustworthy AI with the algorithmic processes of PageRank and TrustRank. The goal is to establish an assessment framework that minimizes the subjectivity inherent in the self-assessment techniques prevalent in the field by introducing algorithmic criteria. The application of our approach indicates that a holistic assessment of an AI system's trustworthiness can be achieved by providing quantitative insights while considering the theoretical content of relevant guidelines.

Updated: 2025-07-01 07:48:05

标题: 桥接伦理原则和算法方法：评估人工智能系统可信度的替代方法

摘要: 人工智能（AI）技术体现了人造物品所带来的复杂挑战，特别是那些广泛融入社会并产生重大影响的挑战，突显了潜在的好处和其负面后果。虽然其他技术也可能带来重大风险，但AI的普及范围使其对社会的影响尤为深远。AI系统的复杂性，加上其出色的能力，可能导致人们依赖那些超出直接人类监督或理解范围的技术。为了减轻由此产生的风险，人们开发了几种理论工具和指导原则，并致力于创建旨在保护可信AI的技术工具。这些指导原则更全面地看待问题，但未能提供量化可信度的技术。相反，虽然技术工具更擅长实现这种量化，但它们缺乏全面的视角，而是专注于可信AI的特定方面。本文旨在介绍一种评估方法，将可信AI的伦理组成部分与PageRank和TrustRank的算法过程相结合。其目标是建立一个评估框架，通过引入算法标准，最大程度地减少领域内自我评估技术所固有的主观性。我们方法的应用表明，通过提供定量见解并考虑相关指导原则的理论内容，可以实现对AI系统可信度的全面评估。

更新时间: 2025-07-01 07:48:05

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2506.22774v2

Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames. To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity. Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions. Project page and dataset are available at https://djamahl99.github.io/qaymo-pages/.

Updated: 2025-07-01 07:40:16

标题: Box-QAymo：用于自动驾驶的盒子引用VQA数据集

摘要: 可解释的沟通对于安全可靠的自动驾驶至关重要，然而当前的视觉语言模型（VLMs）常常在理想化的假设下运行，并且很难捕捉用户在现实场景中的意图。现有的面向驾驶的视觉问答（VQA）数据集仅限于完整场景描述或路标预测，无法评估VLMs是否能够回应用户驱动的局部查询。我们引入了Box-QAymo，一个旨在评估和微调VLMs在空间和时间推理上的盒子引用数据集和基准，用户通过绘制边界框来表达意图，为在复杂场景中进行专注查询提供了快速直观的界面。具体来说，我们提出了一个分层评估协议，从二进制的理智检查问题开始评估基本模型能力，逐步发展到（1）针对盒子引用对象的属性预测，（2）对目标实例的运动理解，以及（3）跨帧之间物体之间的时空运动推理。为支持这一点，我们通过众包方式获取了反映驾驶员遇到的复杂性的细粒度对象类别和视觉属性，并提取了对象轨迹以构建基于时间的问答对。通过负采样、时间一致性检查和难度感知平衡的严格质量控制确保数据集的稳健性和多样性。我们的全面评估揭示了当询问感知问题时当前VLMs存在显著限制，突显了在实现实际性能方面的差距。这项工作为开发更强大和可解释的自动驾驶系统提供了基础，可以在真实世界条件下与用户有效沟通。项目页面和数据集可在https://djamahl99.github.io/qaymo-pages/上找到。

更新时间: 2025-07-01 07:40:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00525v1

Cyber Attacks Detection, Prevention, and Source Localization in Digital Substation Communication using Hybrid Statistical-Deep Learning

The digital transformation of power systems is accelerating the adoption of IEC 61850 standard. However, its communication protocols, including Sampled Values (SV), lack built-in security features such as authentication and encryption, making them vulnerable to malicious packet injection. Such cyber attacks can delay fault clearance or trigger unintended circuit breaker operations. While most existing research focuses on detecting cyber attacks in digital substations, intrusion prevention systems have been disregarded because of the risk of potential communication network disruptions. This paper proposes a novel method using hybrid statistical-deep learning for the detection, prevention, and source localization of IEC 61850 SV injection attacks. The method uses exponentially modified Gaussian distributions to model communication network latency and long short-term memory and Elman recurrent neural network to detect anomalous variations in the estimated probability distributions. It effectively discards malicious SV frames with minimal processing overhead and latency, maintains robustness against communication network latency variation and time-synchronization issues, and guarantees a near-zero false positive rate in non-attack scenarios. Comprehensive validation is conducted on three testbeds involving industrial-grade devices, hardware-in-the-loop simulations, virtualized intelligent electronic devices and merging units, and high-fidelity emulated communication networks. Results demonstrate the method's suitability for practical deployment in IEC 61850-compliant digital substations.

Updated: 2025-07-01 07:38:22

标题: 数字变电站通信中的网络攻击检测、预防和来源定位：基于混合统计-深度学习的方法

摘要: 电力系统的数字转型加速了IEC 61850标准的采用。然而，其通信协议，包括采样值（SV），缺乏内置的身份验证和加密等安全特性，使其容易受到恶意数据包注入的攻击。这种网络攻击可能会延迟故障清除或触发意外的断路器操作。虽然大多数现有研究集中在数字变电站中检测网络攻击，但入侵预防系统因潜在的通信网络中断风险而被忽视。本文提出了一种新颖的方法，使用混合统计-深度学习来检测、预防和定位IEC 61850 SV注入攻击。该方法使用指数修改的高斯分布来建模通信网络延迟，并利用长短期记忆和Elman循环神经网络来检测估计概率分布中的异常变化。它有效地丢弃了恶意的SV帧，处理开销和延迟最小化，保持对通信网络延迟变化和时间同步问题的稳健性，并在非攻击场景中保证接近零的误报率。在涉及工业级设备、硬件在环模拟、虚拟化智能电子设备和合并单元以及高保真仿真通信网络的三个测试平台上进行了全面验证。结果表明该方法适合在IEC 61850兼容的数字变电站中进行实际部署。

更新时间: 2025-07-01 07:38:22

领域: cs.CR,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.00522v1

SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation

Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level. https://github.com/apple/ml-sage-dialog-gen

Updated: 2025-07-01 07:35:25

标题: SAGE：利用未来感知状态-动作增强来引导对话生成

摘要: 最近大型语言模型的进步展示了在面向任务的应用中令人印象深刻的能力，然而构建能够进行自然、战略对话的情感智能聊天机器人仍然是一个挑战。我们提出了一种名为SAGE的新方法，该方法使用潜在变量来控制对话生成中的长期行为。我们方法的核心是State-Action Chain（SAC），它通过引入包含情感状态和对话策略的潜在变量来增强标准语言模型微调。在推理过程中，这些变量在每个回复之前生成，从而实现对话进展的粗粒度控制，同时保持自然的交互模式。我们还引入了一种自我改进管道，利用对话树搜索、基于LLM的奖励建模和有针对性的微调来优化对话轨迹。我们的实验结果表明，使用这种方法训练的模型在情感智能指标上表现出改善，同时在LLM基准测试中保持强大的能力。我们潜在变量的离散性质促进了基于搜索的策略，并为将来将强化学习应用于对话系统提供了基础，其中学习可以发生在状态级别而不是标记级别。 https://github.com/apple/ml-sage-dialog-gen

更新时间: 2025-07-01 07:35:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.03040v2

Customer Service Representative's Perception of the AI Assistant in an Organization's Call Center

The integration of various AI tools creates a complex socio-technical environment where employee-customer interactions form the core of work practices. This study investigates how customer service representatives (CSRs) at the power grid service customer service call center perceive AI assistance in their interactions with customers. Through a field visit and semi-structured interviews with 13 CSRs, we found that AI can alleviate some traditional burdens during the call (e.g., typing and memorizing) but also introduces new burdens (e.g., earning, compliance, psychological burdens). This research contributes to a more nuanced understanding of AI integration in organizational settings and highlights the efforts and burdens undertaken by CSRs to adapt to the updated system.

Updated: 2025-07-01 07:27:34

标题: 客服代表对组织呼叫中心中人工智能助手的看法

摘要: 各种人工智能工具的整合构建了一个复杂的社会技术环境，员工与客户之间的互动形成了工作实践的核心。本研究调查了电网服务客户服务呼叫中心的客户服务代表（CSRs）如何看待与客户互动中的人工智能辅助。通过对13名CSRs进行现场访问和半结构化访谈，我们发现人工智能可以减轻一些通话中的传统负担（例如打字和记忆），但也引入了新的负担（例如收入、合规性、心理负担）。这项研究有助于更细致地理解组织设置中人工智能整合的情况，并突显了CSRs为适应更新系统而付出的努力和负担。

更新时间: 2025-07-01 07:27:34

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.00513v1

TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search

As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classifier for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the effectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and best-of-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.

Updated: 2025-07-01 07:24:29

标题: TeamCMU在Touché上的表现：对话式搜索中广告整合和检测的对抗式共同进化

摘要: 随着会话式搜索引擎越来越多地采用由大型语言模型（LLMs）和检索增强生成（RAG）驱动的生成范式，将广告整合到生成的响应中为用户体验提供了商业机会和挑战。与传统搜索不同，广告在这些生成系统中模糊了信息内容和推广材料之间的界限，引发了对透明度和信任度的担忧。在本研究中，我们提出了一个模块化的管道，用于在基于RAG的会话系统中管理广告，包括一个广告重写器，用于无缝整合广告，以及一个强大的广告分类器用于检测。我们利用合成数据训练高性能分类器，然后使用这些分类器来指导两种互补的广告整合策略：对广告重写器进行监督微调和选择最不易检测的广告整合响应的N个最佳采样方法。我们的评估集中在两个核心问题上：广告分类器在检测不同广告整合策略方面的有效性，以及最佳支持连贯、最小干扰广告插入的训练方法。实验结果显示，我们的广告分类器，在受营销策略启发的合成广告数据上训练并通过课程学习增强，实现了强大的检测性能。此外，我们证明，通过分类器引导的优化，通过微调和N个最佳采样，显著改善了广告隐蔽性，实现了更无缝的整合。这些发现为开发更复杂的广告感知生成式搜索系统和强大的广告分类器提供了一个对抗性共同进化框架。

更新时间: 2025-07-01 07:24:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.00509v1

Unleashing Diffusion and State Space Models for Medical Image Segmentation

Existing segmentation models trained on a single medical imaging dataset often lack robustness when encountering unseen organs or tumors. Developing a robust model capable of identifying rare or novel tumor categories not present during training is crucial for advancing medical imaging applications. We propose DSM, a novel framework that leverages diffusion and state space models to segment unseen tumor categories beyond the training data. DSM utilizes two sets of object queries trained within modified attention decoders to enhance classification accuracy. Initially, the model learns organ queries using an object-aware feature grouping strategy to capture organ-level visual features. It then refines tumor queries by focusing on diffusion-based visual prompts, enabling precise segmentation of previously unseen tumors. Furthermore, we incorporate diffusion-guided feature fusion to improve semantic segmentation performance. By integrating CLIP text embeddings, DSM captures category-sensitive classes to improve linguistic transfer knowledge, thereby enhancing the model's robustness across diverse scenarios and multi-label tasks. Extensive experiments demonstrate the superior performance of DSM in various tumor segmentation tasks. Code is available at https://github.com/Rows21/k-Means_Mask_Mamba.

Updated: 2025-07-01 07:16:34

标题: 释放扩散和状态空间模型用于医学图像分割

摘要: 现有的在单一医学影像数据集上训练的分割模型，在遇到未见过的器官或肿瘤时往往缺乏稳健性。开发一种能够识别在训练过程中不存在的罕见或新颖肿瘤类别的稳健模型对于推进医学影像应用至关重要。我们提出了DSM，这是一个利用扩散和状态空间模型来分割超出训练数据范围的未见肿瘤类别的新框架。DSM利用两组在修改后的注意力解码器中训练的对象查询来增强分类准确性。首先，该模型使用一个对象感知特征分组策略学习器官查询，以捕获器官级视觉特征。然后，通过专注于基于扩散的视觉提示，它对肿瘤查询进行了细化，从而实现对以前未见过的肿瘤的精确分割。此外，我们还将扩散引导的特征融合纳入以改善语义分割性能。通过整合CLIP文本嵌入，DSM捕获了类别敏感的类别以改进语言传递知识，从而增强了模型在不同场景和多标签任务中的稳健性。大量实验证明了DSM在各种肿瘤分割任务中的出色性能。代码可在https://github.com/Rows21/k-Means_Mask_Mamba找到。

更新时间: 2025-07-01 07:16:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.12747v2

Twill: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms

Compound AI (cAI) systems chain multiple AI models to solve complex problems. cAI systems are typically composed of deep neural networks (DNNs), transformers, and large language models (LLMs), exhibiting a high degree of computational diversity and dynamic workload variation. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and DVFS, while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets.

Updated: 2025-07-01 07:06:45

标题: Twill：在异构移动边缘平台上调度复合人工智能系统

摘要: 复合AI（cAI）系统将多个AI模型链在一起，以解决复杂问题。cAI系统通常由深度神经网络（DNNs）、transformers和大型语言模型（LLMs）组成，表现出高度的计算多样性和动态工作负载变化。在移动边缘平台上部署cAI服务在调度并发DNN-transformer推断任务方面面临着重大挑战，这些任务以未知顺序动态到达。现有的移动边缘AI推断策略管理多个DNN或仅transformer的工作负载，依赖于设计时间的分析，并且无法处理cAI系统所需的DNN和transformer的并发推断。在这项工作中，我们解决了在异构移动边缘平台上调度cAI系统的挑战。我们提出了Twill，一个运行时框架，通过任务亲和性感知集群映射和迁移、优先级感知任务冻结/解冻和DVFS来处理cAI工作负载的并发推断请求，同时在功耗预算内最小化推断延迟。我们在Nvidia Jetson Orin NX平台上实现并部署了我们的Twill框架。我们评估了Twill与当代DNNs和LLMs上最先进的边缘AI推断技术相比，平均减少推断延迟了54%，同时尊重功耗预算。

更新时间: 2025-07-01 07:06:45

领域: cs.MA,cs.AI,cs.CV,cs.PF

下载: http://arxiv.org/abs/2507.00491v1

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

Updated: 2025-07-01 07:00:59

标题: ResearchBench：通过基于灵感的任务分解对LLMs在科学发现中进行基准测试

摘要: 大型语言模型(LLMs)已经显示出在辅助科学研究方面的潜力，但由于缺乏专门的基准测试，它们发现高质量研究假设的能力尚未得到审查。为了填补这一空白，我们引入了第一个用于评估LLMs的大规模基准测试，其中包含接近充分的科学发现子任务集：灵感检索、假设构建和假设排序。我们开发了一个自动化框架，从涵盖12个学科的科学论文中提取关键组成部分 - 研究问题、背景调查、灵感和假设，并经过专家验证以确认其准确性。为了防止数据污染，我们专注于仅发表在2024年的论文，确保与LLM预训练数据的最小重叠。我们的评估表明，LLMs在检索灵感方面表现良好，这是一个超出分布范围的任务，表明它们能够展现新颖的知识关联。这将LLMs定位为“研究假设矿”，能够通过生成创新的假设，在最小程度的人为干预下，以大规模促进自动化科学发现。

更新时间: 2025-07-01 07:00:59

领域: cs.CL,cs.AI,cs.CE

下载: http://arxiv.org/abs/2503.21248v2

PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning

Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we conduct experiments to evaluate the effectiveness of our proposed backdoor attack framework, evaluating it with the established metrics. This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks. Our code and supplementary material are available at https://github.com/azure-123/PNAct.

Updated: 2025-07-01 06:59:59

标题: PNAct：在安全强化学习中制造后门攻击

摘要: 强化学习（RL）被广泛应用于与环境互动以最大化奖励的任务中。在此基础上，安全强化学习（Safe RL）在奖励度量的基础上引入成本度量，确保代理在决策过程中遵守安全约束。本文指出，Safe RL容易受到后门攻击的影响，后门攻击可以操纵代理执行不安全的操作。首先，我们介绍了Safe RL中后门攻击的相关概念和评估指标。这是Safe RL领域的第一个攻击框架，涉及积极和消极动作样本（PNAct），用于植入后门，其中积极动作样本提供参考动作，而消极动作样本指示应避免的动作。我们从理论上指出了PNAct的特性并设计了一个攻击算法。最后，我们进行实验评估了我们提出的后门攻击框架的有效性，并使用已建立的指标进行评估。本文突出了Safe RL相关的潜在风险，并强调了此类攻击的可行性。我们的代码和补充材料可在https://github.com/azure-123/PNAct上找到。

更新时间: 2025-07-01 06:59:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.00485v1

Physics-Aware Style Transfer for Adaptive Holographic Reconstruction

Inline holographic imaging presents an ill-posed inverse problem of reconstructing objects' complex amplitude from recorded diffraction patterns. Although recent deep learning approaches have shown promise over classical phase retrieval algorithms, they often require high-quality ground truth datasets of complex amplitude maps to achieve a statistical inverse mapping operation between the two domains. Here, we present a physics-aware style transfer approach that interprets the object-to-sensor distance as an implicit style within diffraction patterns. Using the style domain as the intermediate domain to construct cyclic image translation, we show that the inverse mapping operation can be learned in an adaptive manner only with datasets composed of intensity measurements. We further demonstrate its biomedical applicability by reconstructing the morphology of dynamically flowing red blood cells, highlighting its potential for real-time, label-free imaging. As a framework that leverages physical cues inherently embedded in measurements, the presented method offers a practical learning strategy for imaging applications where ground truth is difficult or impossible to obtain.

Updated: 2025-07-01 06:56:51

标题: 物理感知风格迁移用于自适应全息重建

摘要: 内联全息成像提出了一个逆问题，即从记录的衍射图案中重建物体的复幅度。尽管最近的深度学习方法显示出比经典相位恢复算法更有前景，但它们通常需要高质量的复幅度地图的基本真实数据集，以实现两个域之间的统计逆映射操作。在这里，我们提出了一种物理感知风格转移方法，将物体到传感器的距离解释为衍射图案中的隐式风格。利用风格域作为中间域来构建循环图像翻译，我们表明逆映射操作可以仅通过由强度测量组成的数据集以自适应方式学习。我们进一步通过重建动态流动的红细胞的形态来展示其生物医学适用性，突出其实时、无标签成像的潜力。作为一种利用测量中固有嵌入的物理线索的框架，所提出的方法为图像应用提供了一种实用的学习策略，其中基本真实数据难以获取或不可能获取。

更新时间: 2025-07-01 06:56:51

领域: physics.optics,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.00482v1

Truth, Trust, and Trouble: Medical AI on the Edge

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

Updated: 2025-07-01 06:39:39

标题: 真相、信任和麻烦：医疗人工智能的边缘

摘要: 大型语言模型（LLMs）在通过实现自动化医疗问题回答来改变数字健康方面具有重要的潜力。然而，确保这些模型符合关键的行业标准，如事实准确性、有用性和安全性，仍然是一个挑战，特别是对于开源解决方案。我们提出了一个严格的基准测试框架，使用了超过1,000个健康问题的数据集。我们评估了模型在诚实性、有用性和无害性方面的表现。我们的结果突显了在评估模型（Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B）之间事实可靠性和安全性之间的权衡。AlpaCare-13B实现了最高的准确性（91.7%）和无害性（0.92），而BioMistral-7B-DARE中的领域特定调整提升了安全性（0.90），尽管其规模较小。少量提示将准确性从78%提高到85%，所有模型在复杂查询上显示出的有用性降低，突显了临床问答中持续存在的挑战。

更新时间: 2025-07-01 06:39:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.02983v1

ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry

Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines--achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.

Updated: 2025-07-01 06:31:06

标题: ComRAG：用于工业实时社区问答的动态向量存储检索增强生成

摘要: 社区问答（CQA）平台可以被视为社区中重要的知识库，但有效利用历史互动和领域知识在实时中仍然是一个挑战。现有方法往往未充分利用外部知识，未能整合动态历史QA内容，或缺乏适用于工业部署的内存机制。我们提出了ComRAG，一个用于实时工业CQA的检索增强生成框架，通过基于质心的内存机制将静态知识与动态历史QA对整合在一起，设计用于检索、生成和高效存储。在三个工业CQA数据集上评估，ComRAG始终优于所有基准线--在向量相似度方面达到25.9%的改进，在减少延迟8.7%到23.3%，并将迭代中的块增长从20.23%降低到2.06%。

更新时间: 2025-07-01 06:31:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.21098v2

An Automatic Graph Construction Framework based on Large Language Models for Recommendation

Graph neural networks (GNNs) have emerged as state-of-the-art methods to learn from graph-structured data for recommendation. However, most existing GNN-based recommendation methods focus on the optimization of model structures and learning strategies based on pre-defined graphs, neglecting the importance of the graph construction stage. Earlier works for graph construction usually rely on speciffic rules or crowdsourcing, which are either too simplistic or too labor-intensive. Recent works start to utilize large language models (LLMs) to automate the graph construction, in view of their abundant open-world knowledge and remarkable reasoning capabilities. Nevertheless, they generally suffer from two limitations: (1) invisibility of global view (e.g., overlooking contextual information) and (2) construction inefficiency. To this end, we introduce AutoGraph, an automatic graph construction framework based on LLMs for recommendation. Specifically, we first use LLMs to infer the user preference and item knowledge, which is encoded as semantic vectors. Next, we employ vector quantization to extract the latent factors from the semantic vectors. The latent factors are then incorporated as extra nodes to link the user/item nodes, resulting in a graph with in-depth global-view semantics. We further design metapath-based message aggregation to effectively aggregate the semantic and collaborative information. The framework is model-agnostic and compatible with different backbone models. Extensive experiments on three real-world datasets demonstrate the efficacy and efffciency of AutoGraph compared to existing baseline methods. We have deployed AutoGraph in Huawei advertising platform, and gain a 2.69% improvement on RPM and a 7.31% improvement on eCPM in the online A/B test. Currently AutoGraph has been used as the main trafffc model, serving hundreds of millions of people.

Updated: 2025-07-01 06:19:39

标题: 基于大型语言模型的自动图构建框架用于推荐

摘要: 图神经网络（GNNs）已经成为学习图结构化数据用于推荐的最先进方法。然而，大多数现有基于GNN的推荐方法侧重于基于预定义图的模型结构和学习策略的优化，忽视了图构建阶段的重要性。早期的图构建工作通常依赖于特定规则或众包，这些方法要么过于简单，要么过于耗时。最近的研究开始利用大型语言模型（LLMs）自动化图构建，考虑到它们丰富的开放世界知识和出色的推理能力。然而，它们通常受到两个限制：（1）全局视图的不可见性（例如，忽视了上下文信息）和（2）构建效率低下。为此，我们引入了AutoGraph，一种基于LLMs的自动图构建框架用于推荐。具体地，我们首先使用LLMs推断用户偏好和物品知识，将其编码为语义向量。接下来，我们使用向量量化从语义向量中提取潜在因子。然后，将这些潜在因子作为额外节点并入用户/物品节点，形成具有深层全局视图语义的图。我们进一步设计基于元路径的消息聚合，有效地聚合语义和协作信息。该框架是与不同主干模型兼容的无特定模型的。在三个真实世界数据集上进行的大量实验表明，与现有基准方法相比，AutoGraph的有效性和效率。我们已将AutoGraph部署在华为广告平台上，在在线A/B测试中，RPM提高了2.69％，eCPM提高了7.31％。目前，AutoGraph已被用作主要的流量模型，为数亿人提供服务。

更新时间: 2025-07-01 06:19:39

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2412.18241v2

Novel Complex-Valued Hopfield Neural Networks with Phase and Magnitude Quantization

This research paper introduces two novel complex-valued Hopfield neural networks (CvHNNs) that incorporate phase and magnitude quantization. The first CvHNN employs a ceiling-type activation function that operates on the rectangular coordinate representation of the complex net contribution. The second CvHNN similarly incorporates phase and magnitude quantization but utilizes a ceiling-type activation function based on the polar coordinate representation of the complex net contribution. The proposed CvHNNs, with their phase and magnitude quantization, significantly increase the number of states compared to existing models in the literature, thereby expanding the range of potential applications for CvHNNs.

Updated: 2025-07-01 06:19:06

标题: 具有相位和幅度量化的新型复值Hopfield神经网络

摘要: 这篇研究论文介绍了两种新颖的复值Hopfield神经网络（CvHNNs），其中包括相位和幅度量化。第一个CvHNN采用天花板型激活函数，作用于复杂网络贡献的直角坐标表示。第二个CvHNN同样包括相位和幅度量化，但是利用基于复杂网络贡献的极坐标表示的天花板型激活函数。提出的CvHNNs，通过其相位和幅度量化，与文献中现有模型相比显著增加了状态数，从而扩大了CvHNNs的潜在应用范围。

更新时间: 2025-07-01 06:19:06

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2507.00461v1

Process-aware and high-fidelity microstructure generation using stable diffusion

Synthesizing realistic microstructure images conditioned on processing parameters is crucial for understanding process-structure relationships in materials design. However, this task remains challenging due to limited training micrographs and the continuous nature of processing variables. To overcome these challenges, we present a novel process-aware generative modeling approach based on Stable Diffusion 3.5 Large (SD3.5-Large), a state-of-the-art text-to-image diffusion model adapted for microstructure generation. Our method introduces numeric-aware embeddings that encode continuous variables (annealing temperature, time, and magnification) directly into the model's conditioning, enabling controlled image generation under specified process conditions and capturing process-driven microstructural variations. To address data scarcity and computational constraints, we fine-tune only a small fraction of the model's weights via DreamBooth and Low-Rank Adaptation (LoRA), efficiently transferring the pre-trained model to the materials domain. We validate realism using a semantic segmentation model based on a fine-tuned U-Net with a VGG16 encoder on 24 labeled micrographs. It achieves 97.1% accuracy and 85.7% mean IoU, outperforming previous methods. Quantitative analyses using physical descriptors and spatial statistics show strong agreement between synthetic and real microstructures. Specifically, two-point correlation and lineal-path errors remain below 2.1% and 0.6%, respectively. Our method represents the first adaptation of SD3.5-Large for process-aware microstructure generation, offering a scalable approach for data-driven materials design.

Updated: 2025-07-01 06:16:53

标题: 使用稳定扩散生成过程感知和高保真微结构

摘要: 合成与加工参数有关的真实微观结构图像对于理解材料设计中的过程-结构关系至关重要。然而，由于训练微观图像有限和加工变量的连续性，这一任务仍然具有挑战性。为了克服这些挑战，我们提出了一种基于Stable Diffusion 3.5 Large (SD3.5-Large)的新型过程感知生成建模方法，这是一种用于微观结构生成的最先进的文本到图像扩散模型。我们的方法引入了数值感知嵌入，可以将连续变量（退火温度、时间和放大倍数）直接编码到模型的条件中，从而在指定的过程条件下实现受控图像生成，并捕捉受过程驱动的微结构变化。为了解决数据稀缺和计算约束，我们仅通过DreamBooth和Low-Rank Adaptation (LoRA)微调模型的一小部分权重，有效地将预训练模型转移到材料领域。我们通过基于经过微调的U-Net和VGG16编码器的语义分割模型对24个标记的微观图像进行验证，并取得了97.1%的准确率和85.7%的平均IoU，优于先前的方法。使用物理描述符和空间统计的定量分析显示合成和真实微观结构之间存在较强的一致性。具体而言，两点相关性和线性路径误差分别保持在2.1%以下和0.6%以下。我们的方法是将SD3.5-Large首次应用于过程感知微结构生成，为基于数据驱动的材料设计提供了可扩展的方法。

更新时间: 2025-07-01 06:16:53

领域: cond-mat.mtrl-sci,cs.AI

下载: http://arxiv.org/abs/2507.00459v1

ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by \textbf{A}ligning \textbf{T}emporal and \textbf{S}patial scale of different input components, named as \textbf{ATSTrack}. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a fine-grained manner. Moreover, we introduce a Visual-Language token that comprises modified linguistic information from the previous frame to guide the model to extract visual features that are more relevant to language description, thereby reducing the impact caused by the differences in spatial scale. Experimental results show that our proposed ATSTrack achieves performance comparable to existing methods. Our code will be released.

Updated: 2025-07-01 06:13:34

标题: ATSTrack：通过调整时间和空间尺度来增强视觉语言跟踪

摘要: 视觉语言跟踪（VLT）的主要挑战是由于目标移动导致视觉输入和语言描述之间的错位。先前的跟踪器探索了许多有效的特征修改方法，以保留更多对齐的特征。然而，一个重要但尚未探索的因素最终阻碍了它们的能力，即视觉和语言输入之间的信息在时间和空间尺度上的固有差异。为了解决这个问题，我们提出了一种新颖的视觉语言跟踪器，通过对不同输入组件的时间和空间尺度进行对齐，增强了特征修改的效果，命名为ATSTrack。具体而言，我们将每个语言描述分解为具有不同属性的短语，根据它们与视觉输入的时间和空间对应关系，以细粒度的方式修改它们的特征。此外，我们引入了一个视觉语言令牌，其中包含来自上一帧的修改语言信息，以指导模型提取更与语言描述相关的视觉特征，从而减少空间尺度差异引起的影响。实验结果表明，我们提出的ATSTrack实现了与现有方法相当的性能。我们的代码将会发布。

更新时间: 2025-07-01 06:13:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00454v1

Best Agent Identification for General Game Playing

We present an efficient and generalised procedure to accurately identify the best performing algorithm for each sub-task in a multi-problem domain. Our approach treats this as a set of best arm identification problems for multi-armed bandits, where each bandit corresponds to a specific task and each arm corresponds to a specific algorithm or agent. We propose an optimistic selection process based on the Wilson score interval (Optimistic-WS) that ranks each arm across all bandits in terms of their potential regret reduction. We evaluate the performance of Optimistic-WS on two of the most popular general game domains, the General Video Game AI (GVGAI) framework and the Ludii general game playing system, with the goal of identifying the highest performing agent for each game within a limited number of trials. Compared to previous best arm identification algorithms for multi-armed bandits, our results demonstrate a substantial performance improvement in terms of average simple regret. This novel approach can be used to significantly improve the quality and accuracy of agent evaluation procedures for general game frameworks, as well as other multi-task domains with high algorithm runtimes.

Updated: 2025-07-01 06:07:56

标题: 通用游戏玩法的最佳代理身份识别

摘要: 我们提出了一种高效且通用的程序，可以准确地确定多问题领域中每个子任务的最佳执行算法。我们的方法将其视为多臂赌博机的最佳臂识别问题集合，其中每个赌博机对应于特定任务，每个臂对应于特定算法或代理。我们提出了一种基于Wilson得分区间的乐观选择过程（Optimistic-WS），可以根据潜在的遗憾减少程度对每个臂在所有赌博机中进行排名。我们在两个最流行的通用游戏领域General Video Game AI（GVGAI）框架和Ludii通用游戏系统上评估了Optimistic-WS的性能，目的是在有限数量的试验中确定每个游戏中表现最优秀的代理。与以往用于多臂赌博机的最佳臂识别算法相比，我们的结果表明，在平均简单遗憾方面，我们取得了实质性的性能改善。这种新颖的方法可以用来显著提高通用游戏框架以及其他高算法运行时间的多任务领域的代理评估程序的质量和准确性。

更新时间: 2025-07-01 06:07:56

领域: cs.LG,cs.AI,cs.DS,cs.IT,math.IT,stat.ML

下载: http://arxiv.org/abs/2507.00451v1

Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design

We address the problem of fine-tuning diffusion models for reward-guided generation in biomolecular design. While diffusion models have proven highly effective in modeling complex, high-dimensional data distributions, real-world applications often demand more than high-fidelity generation, requiring optimization with respect to potentially non-differentiable reward functions such as physics-based simulation or rewards based on scientific knowledge. Although RL methods have been explored to fine-tune diffusion models for such objectives, they often suffer from instability, low sample efficiency, and mode collapse due to their on-policy nature. In this work, we propose an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions. Our method casts the problem as policy distillation: it collects off-policy data during the roll-in phase, simulates reward-based soft-optimal policies during roll-out, and updates the model by minimizing the KL divergence between the simulated soft-optimal policy and the current model policy. Our off-policy formulation, combined with KL divergence minimization, enhances training stability and sample efficiency compared to existing RL-based methods. Empirical results demonstrate the effectiveness and superior reward optimization of our approach across diverse tasks in protein, small molecule, and regulatory DNA design.

Updated: 2025-07-01 05:55:28

标题: 迭代蒸馏：用于生物分子设计中扩散模型的奖励引导微调

摘要: 我们解决了生物分子设计中奖励引导生成的扩散模型微调问题。虽然扩散模型在建模复杂、高维数据分布方面已被证明非常有效，但现实世界中的应用通常需要更多，不仅要求高保真度的生成，还需要针对潜在的非可微分奖励函数进行优化，例如基于物理模拟或基于科学知识的奖励。虽然已经探索了强化学习方法来微调扩散模型以实现这些目标，但由于其基于策略的本质，它们经常面临不稳定性、低样本效率和模式崩溃的问题。在这项工作中，我们提出了一个迭代蒸馏微调框架，使扩散模型能够优化任意奖励函数。我们的方法将问题建模为策略蒸馏：在推进阶段收集离线数据，模拟基于奖励的软最优策略，在推出阶段更新模型，通过最小化模拟的软最优策略与当前模型策略之间的KL散度来提高训练稳定性和样本效率。与现有基于强化学习的方法相比，我们的离线制定结合KL散度最小化，增强了训练稳定性和样本效率。实证结果证明了我们的方法在蛋白质、小分子和调控DNA设计的多样任务中的有效性和优越的奖励优化能力。

更新时间: 2025-07-01 05:55:28

领域: cs.LG,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2507.00445v1

Novel Pigeon-inspired 3D Obstacle Detection and Avoidance Maneuver for Multi-UAV Systems

Recent advances in multi-agent systems manipulation have demonstrated a rising demand for the implementation of multi-UAV systems in urban areas, which are always subjected to the presence of static and dynamic obstacles. Inspired by the collective behavior of tilapia fish and pigeons, the focus of the presented research is on the introduction of a nature-inspired collision-free formation control for a multi-UAV system, considering the obstacle avoidance maneuvers. The developed framework in this study utilizes a semi-distributed control approach, in which, based on a probabilistic Lloyd's algorithm, a centralized guidance algorithm works for optimal positioning of the UAVs, while a distributed control approach has been used for the intervehicle collision and obstacle avoidance. Further, the presented framework has been extended to the 3D space with a novel definition of 3D maneuvers. Finally, the presented framework has been applied to multi-UAV systems in 2D and 3D scenarios, and the obtained results demonstrated the validity of the presented method in dynamic environments with stationary and moving obstacles.

Updated: 2025-07-01 05:52:21

标题: 新颖的鸽子灵感启发的多无人机系统三维障碍检测与回避机动

摘要: 最近对多智能体系统操作的进展表明，城市地区对多无人机系统的实施需求不断增加，这些地区常常受到静态和动态障碍物的影响。受罗非鱼和鸽子的集体行为启发，本研究的重点是引入一种自然启发的无碰撞编队控制方法，用于多无人机系统，考虑避障机动。本研究开发的框架采用半分布式控制方法，其中基于概率性的劳埃德算法，一个集中式的引导算法用于无人机的最佳定位，同时采用了分布式控制方法用于车辆间的碰撞和避障。此外，该框架还将扩展到3D空间，并提出了3D机动的新定义。最后，该框架已应用于2D和3D场景中的多无人机系统，并获得的结果表明了该方法在具有静态和移动障碍物的动态环境中的有效性。

更新时间: 2025-07-01 05:52:21

领域: cs.RO,cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.00443v1

CARTS: Collaborative Agents for Recommendation Textual Summarization

Current recommendation systems often require some form of textual data summarization, such as generating concise and coherent titles for product carousels or other grouped item displays. While large language models have shown promise in NLP domains for textual summarization, these approaches do not directly apply to recommendation systems, where explanations must be highly relevant to the core features of item sets, adhere to strict word limit constraints. In this paper, we propose CARTS (Collaborative Agents for Recommendation Textual Summarization), a multi-agent LLM framework designed for structured summarization in recommendation systems. CARTS decomposes the task into three stages-Generation Augmented Generation (GAG), refinement circle, and arbitration, where successive agent roles are responsible for extracting salient item features, iteratively refining candidate titles based on relevance and length feedback, and selecting the final title through a collaborative arbitration process. Experiments on large-scale e-commerce data and live A/B testing show that CARTS significantly outperforms single-pass and chain-of-thought LLM baselines, delivering higher title relevance and improved user engagement metrics.

Updated: 2025-07-01 05:47:05

标题: CARTS：用于推荐文本摘要的协作代理

摘要: 目前的推荐系统通常需要某种形式的文本数据摘要，比如为产品旋转木马或其他分组物品展示生成简洁连贯的标题。虽然大型语言模型在自然语言处理领域的文本摘要中表现出了潜力，但这些方法并不直接适用于推荐系统，因为解释必须与物品集的核心特征高度相关，并遵守严格的字数限制。在本文中，我们提出了CARTS（协作代理用于推荐文本摘要），这是一个专为推荐系统中的结构化摘要设计的多代理LLM框架。CARTS将任务分解为三个阶段——生成增强生成（GAG）、细化循环和仲裁，其中连续的代理角色负责提取显著的物品特征，基于相关性和长度反馈逐步改进候选标题，并通过协作仲裁过程选择最终标题。对大规模电子商务数据和实时A/B测试的实验表明，CARTS明显优于单次通过和链式思维LLM基线，提供更高的标题相关性和改进的用户参与度指标。

更新时间: 2025-07-01 05:47:05

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2506.17765v2

A Minimalist Method for Fine-tuning Text-to-Image Diffusion Models

Recent work uses reinforcement learning (RL) to fine-tune text-to-image diffusion models, improving text-image alignment and sample quality. However, existing approaches introduce unnecessary complexity: they cache the full sampling trajectory, depend on differentiable reward models or large preference datasets, or require specialized guidance techniques. Motivated by the "golden noise" hypothesis -- that certain initial noise samples can consistently yield superior alignment -- we introduce Noise PPO, a minimalist RL algorithm that leaves the pre-trained diffusion model entirely frozen and learns a prompt-conditioned initial noise generator. Our approach requires no trajectory storage, reward backpropagation, or complex guidance tricks. Extensive experiments show that optimizing the initial noise distribution consistently improves alignment and sample quality over the original model, with the most significant gains at low inference steps. As the number of inference steps increases, the benefit of noise optimization diminishes but remains present. These findings clarify the scope and limitations of the golden noise hypothesis and reinforce the practical value of minimalist RL fine-tuning for diffusion models.

Updated: 2025-07-01 05:46:31

标题: 一个用于微调文本到图像扩散模型的极简方法

摘要: 最近的研究使用强化学习（RL）来微调文本到图像扩散模型，改善文本-图像对齐和样本质量。然而，现有方法引入了不必要的复杂性：它们缓存完整的采样轨迹，依赖于可微分奖励模型或大型偏好数据集，或需要专门的指导技术。受“黄金噪音”假设的启发--即某些初始噪音样本可以一致产生优越的对齐--我们引入了Noise PPO，一种极简RL算法，完全保持预训练的扩散模型不变，并学习一个与提示条件相关的初始噪音生成器。我们的方法不需要轨迹存储、奖励反向传播或复杂的指导技巧。大量实验表明，优化初始噪音分布始终提高对齐和样本质量，相比原始模型，在低推理步骤时获得了最显著的收益。随着推理步骤的增加，噪音优化的好处减少但仍然存在。这些发现阐明了黄金噪音假设的范围和局限性，并强化了极简RL微调对扩散模型的实际价值。

更新时间: 2025-07-01 05:46:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.12036v3

A Recipe for Causal Graph Regression: Confounding Effects Revisited

Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under out-of-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided on https://github.com/causal-graph/CGR.

Updated: 2025-07-01 05:46:29

标题: 一个因果图回归的配方：混淆效应再探

摘要: 通过识别因果子图，因果图学习（CGL）已经成为提高图神经网络在分布之外（OOD）情景下泛化能力的一种有前途的方法。然而，CGL 技术的实证成功主要体现在分类设置中，而在图学习中更具挑战性的回归任务却被忽视。因此，我们致力于解决因果图回归（CGR）这一问题；为此，我们重新塑造了现有 CGL 研究中处理混淆效应的方式，这些研究主要涉及分类。具体地，我们反思了在图级回归中混淆变量的预测能力，并通过对比学习的视角将针对分类的因果干预技术概括到回归中。对图形 OOD 基准的大量实验验证了我们对 CGR 的建议的有效性。模型实现和代码可在 https://github.com/causal-graph/CGR 上找到。

更新时间: 2025-07-01 05:46:29

领域: cs.LG,cs.AI,stat.ME

下载: http://arxiv.org/abs/2507.00440v1

Autonomy by Design: Preserving Human Autonomy in AI Decision-Support

AI systems increasingly support human decision-making across domains of professional, skill-based, and personal activity. While previous work has examined how AI might affect human autonomy globally, the effects of AI on domain-specific autonomy -- the capacity for self-governed action within defined realms of skill or expertise -- remain understudied. We analyze how AI decision-support systems affect two key components of domain-specific autonomy: skilled competence (the ability to make informed judgments within one's domain) and authentic value-formation (the capacity to form genuine domain-relevant values and preferences). By engaging with prior investigations and analyzing empirical cases across medical, financial, and educational domains, we demonstrate how the absence of reliable failure indicators and the potential for unconscious value shifts can erode domain-specific autonomy both immediately and over time. We then develop a constructive framework for autonomy-preserving AI support systems. We propose specific socio-technical design patterns -- including careful role specification, implementation of defeater mechanisms, and support for reflective practice -- that can help maintain domain-specific autonomy while leveraging AI capabilities. This framework provides concrete guidance for developing AI systems that enhance rather than diminish human agency within specialized domains of action.

Updated: 2025-07-01 05:46:26

标题: 按设计实现的自主权：在人工智能决策支持中保护人类的自主权

摘要: 人工智能系统越来越多地支持人类在专业、技能和个人活动领域的决策。尽管先前的研究已经探讨了人工智能可能如何在全球范围内影响人类自主性，但人工智能对特定领域自主性的影响——即在特定技能或专业领域内自主行动的能力——仍然鲜为人知。我们分析了人工智能决策支持系统对领域特定自主性的两个关键组成部分的影响：技能水平（在自己的领域内做出明智判断的能力）和真正的价值形成（形成真正的领域相关价值观和偏好的能力）。通过与先前的调查互动并分析医疗、金融和教育领域的实证案例，我们展示了缺乏可靠的失败指标和潜在的无意识价值转变如何会立即和随着时间的推移削弱领域特定的自主性。然后，我们制定了一个建设性的框架，以维护自主性的人工智能支持系统。我们提出了具体的社会技术设计模式——包括谨慎的角色规定、实施打败机制和支持反思实践——这些设计模式可以帮助在利用人工智能能力的同时维持领域特定的自主性。这个框架为开发增强而不是削弱专业领域内人类主体性的人工智能系统提供了具体指导。

更新时间: 2025-07-01 05:46:26

领域: cs.HC,cs.AI,cs.LG,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2506.23952v2

RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation

We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior -- such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed -- some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs, and remain informative even when binary success saturates. By pinpointing when and how policies fail, RoboEval enables a deeper, more actionable understanding of robotic manipulation -- and highlights the need for evaluation tools that go beyond success alone.

Updated: 2025-07-01 05:33:16

标题: RoboEval：机器人操作遇到结构化和可扩展评估

摘要: 我们提出了RoboEval，这是一个模拟基准和结构化评估框架，旨在揭示当前双手操作策略的局限性。虽然先前的基准仅报告二进制任务成功，但我们显示这种指标经常隐藏政策行为的关键弱点--例如协调不佳、抓握过程中滑动或不对称的手臂使用。RoboEval引入了一系列分层、语义基础任务，分解为特定技能阶段，具有系统挑战空间、物理和协调能力的变化。任务与细粒度的诊断指标和3000多个人类示范配对，以支持模仿学习。我们的实验表明，具有相似成功率的策略在任务执行方式上存在差异--有些在对准方面困难，其他人在时间上一致的双手控制上有问题。我们发现，行为指标与超过一半的任务指标对成功率相关，并且即使二进制成功饱和，仍然具有信息量。通过准确定位策略何时以及如何失败，RoboEval使得对机器人操作的更深入、更具行动性的理解成为可能--并且强调了超越仅仅成功的评估工具的必要性。

更新时间: 2025-07-01 05:33:16

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.00435v1

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Updated: 2025-07-01 05:23:05

标题: 数学推理是否提高一般LLM能力？理解LLM推理的可转移性

摘要: 数学推理已经成为大型语言模型（LLMs）在进步方面的代表，新模型迅速超越人类水平，在MATH和AIME等基准测试中表现出色。但随着数学排行榜每周的改进，值得问的是：这些收益是否反映了更广泛的问题解决能力，还是仅仅狭隘的过度拟合？为了回答这个问题，我们评估了超过20个针对推理调整的模型，在包括数学、科学问答、代理规划、编码和标准指令遵循在内的广泛任务套件上。令人惊讶的是，我们发现大多数在数学上取得成功的模型未能将其收益转移到其他领域。为了严格研究这一现象，我们使用仅数学数据但不同调整方法对Qwen3-14B模型进行了控制实验。我们发现，强化学习（RL）调整的模型在各个领域都具有良好的泛化能力，而监督微调（SFT）调整的模型经常忘记了一般能力。潜在空间表示和令牌空间分布变换分析显示，SFT会引起重要的表示和输出漂移，而RL保留了一般领域结构。我们的结果表明，有必要重新思考标准的后训练配方，特别是依赖于SFT精炼数据推进推理模型的做法。

更新时间: 2025-07-01 05:23:05

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.00432v1

Conquering Ghosts: Relation Learning for Information Reliability Representation and End-to-End Robust Navigation

Environmental disturbances, such as sensor data noises, various lighting conditions, challenging weathers and external adversarial perturbations, are inevitable in real self-driving applications. Existing researches and testings have shown that they can severely influence the vehicles perception ability and performance, one of the main issue is the false positive detection, i.e., the ghost object which is not real existed or occurs in the wrong position (such as a non-existent vehicle). Traditional navigation methods tend to avoid every detected objects for safety, however, avoiding a ghost object may lead the vehicle into a even more dangerous situation, such as a sudden break on the highway. Considering the various disturbance types, it is difficult to address this issue at the perceptual aspect. A potential solution is to detect the ghost through relation learning among the whole scenario and develop an integrated end-to-end navigation system. Our underlying logic is that the behavior of all vehicles in the scene is influenced by their neighbors, and normal vehicles behave in a logical way, while ghost vehicles do not. By learning the spatio-temporal relation among surrounding vehicles, an information reliability representation is learned for each detected vehicle and then a robot navigation network is developed. In contrast to existing works, we encourage the network to learn how to represent the reliability and how to aggregate all the information with uncertainties by itself, thus increasing the efficiency and generalizability. To the best of the authors knowledge, this paper provides the first work on using graph relation learning to achieve end-to-end robust navigation in the presence of ghost vehicles. Simulation results in the CARLA platform demonstrate the feasibility and effectiveness of the proposed method in various scenarios.

Updated: 2025-07-01 05:21:49

标题: 征服幽灵：关系学习用于信息可靠性表示和端到端稳健导航

摘要: 环境干扰，如传感器数据噪音、各种光照条件、恶劣天气和外部对抗性干扰，在真实自动驾驶应用中是不可避免的。现有研究和测试表明，它们严重影响车辆的感知能力和性能，其中一个主要问题是误报检测，即虚假对象，它并不存在或出现在错误的位置（如不存在的车辆）。传统导航方法往往为了安全起见而避开每个检测到的对象，然而，避开虚假对象可能会使车辆陷入更危险的情况，如在高速公路上突然刹车。考虑到各种干扰类型，在感知方面解决这个问题很困难。一个潜在的解决方案是通过整个场景中的关系学习来检测虚假对象，并开发一个集成的端到端导航系统。我们的基本逻辑是场景中所有车辆的行为受到其邻居的影响，正常车辆以逻辑方式行事，而虚假车辆则不会。通过学习周围车辆之间的时空关系，为每个检测到的车辆学习信息可靠性表示，然后开发一个机器人导航网络。与现有工作相反，我们鼓励网络自己学习如何表示可靠性以及如何聚合所有具有不确定性的信息，从而提高效率和泛化能力。据作者所知，本文首次使用图关系学习来实现在虚假车辆存在的情况下实现端到端稳健导航。在CARLA平台上的仿真结果展示了所提出方法在各种场景中的可行性和有效性。

更新时间: 2025-07-01 05:21:49

领域: cs.AI

下载: http://arxiv.org/abs/2203.09952v4

Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning

Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model with coordination from a central server, without needing to share their raw data. This approach is particularly appealing in the era of privacy regulations like the GDPR, leading many prominent companies to adopt it. However, FL's distributed nature makes it susceptible to poisoning attacks, where malicious clients, controlled by an attacker, send harmful data to compromise the model. Most existing poisoning attacks in FL aim to degrade the model's integrity, such as reducing its accuracy, with limited attention to privacy concerns from these attacks. In this study, we introduce FedPoisonMIA, a novel poisoning membership inference attack targeting FL. FedPoisonMIA involves malicious clients crafting local model updates to infer membership information. Additionally, we propose a robust defense mechanism to mitigate the impact of FedPoisonMIA attacks. Extensive experiments across various datasets demonstrate the attack's effectiveness, while our defense approach reduces its impact to a degree.

Updated: 2025-07-01 04:46:23

标题: 寻找替罪羊：针对联邦学习的毒化成员推断攻击和防御

摘要: 联邦学习（FL）允许多个客户端在中央服务器的协调下共同训练全局机器学习模型，而无需共享原始数据。这种方法在隐私法规如GDPR的时代尤其吸引人，导致许多知名公司采用它。然而，FL的分布式特性使其容易受到毒化攻击的影响，恶意客户端受攻击者控制，发送有害数据以破坏模型。大多数现有的FL毒化攻击旨在降低模型的完整性，例如降低其准确性，对这些攻击的隐私问题关注有限。在这项研究中，我们介绍了FedPoisonMIA，一种针对FL的新颖毒化成员推理攻击。FedPoisonMIA涉及恶意客户端制作本地模型更新以推断成员信息。此外，我们提出了一种强大的防御机制来减轻FedPoisonMIA攻击的影响。对各种数据集的广泛实验表明攻击的有效性，而我们的防御方法在一定程度上减少了其影响。

更新时间: 2025-07-01 04:46:23

领域: cs.CR,cs.DC,cs.LG

下载: http://arxiv.org/abs/2507.00423v1

Pipelined Decoder for Efficient Context-Aware Text Generation

As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.

Updated: 2025-07-01 04:16:14

标题: 流水线解码器用于高效上下文感知文本生成

摘要: 作为生成式人工智能的基础，自回归模型需要根据先前生成的所有令牌生成新的令牌，这既带来了高质量，也限制了模型逐个生成令牌，形成一个限制生成速度的瓶颈。在本文中，我们提出了一种新的解码器架构，可以有效地并行生成文本以用于上下文感知生成任务。我们提出的管道式解码器同时启动多个子序列的生成，每个时间步生成一个新令牌以实现并行性。对包括问答、文本摘要和关键词生成在内的多个文本生成任务的实验表明，我们的管道式解码器显著提高了生成速度，而生成质量或额外内存消耗并没有显著损失。

更新时间: 2025-07-01 04:16:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.23431v2

Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding

Understanding Earth's subsurface is critical for energy transition, natural hazard mitigation, and planetary science. Yet subsurface analysis remains fragmented, with separate models required for structural interpretation, stratigraphic analysis, geobody segmentation, and property modeling-each tightly coupled to specific data distributions and task formulations. We introduce the Geological Everything Model 3D (GEM), a unified generative architecture that reformulates all these tasks as prompt-conditioned inference along latent structural frameworks derived from subsurface imaging. This formulation moves beyond task-specific models by enabling a shared inference mechanism, where GEM propagates human-provided prompts-such as well logs, masks, or structural sketches-along inferred structural frameworks to produce geologically coherent outputs. Through this mechanism, GEM achieves zero-shot generalization across tasks with heterogeneous prompt types, without retraining for new tasks or data sources. This capability emerges from a two-stage training process that combines self-supervised representation learning on large-scale field seismic data with adversarial fine-tuning using mixed prompts and labels across diverse subsurface tasks. GEM demonstrates broad applicability across surveys and tasks, including Martian radar stratigraphy analysis, structural interpretation in subduction zones, full seismic stratigraphic interpretation, geobody delineation, and property modeling. By bridging expert knowledge with generative reasoning in a structurally aware manner, GEM lays the foundation for scalable, human-in-the-loop geophysical AI-transitioning from fragmented pipelines to a vertically integrated, promptable reasoning system. Project page: https://douyimin.github.io/GEM

Updated: 2025-07-01 04:14:13

标题: 地质万物模型3D：一个可提示的基础模型，用于统一和零点地下理解

摘要: 理解地球地下是能源转型、自然灾害缓解和行星科学至关重要的。然而，地下分析仍然是分散的，需要为结构解释、地层分析、地体分割和属性建模分别建立模型-每个模型都与特定数据分布和任务制定紧密耦合。我们介绍了地质综合模型3D（GEM），这是一个统一的生成架构，它将所有这些任务重新构建为基于地下成像推断的潜在结构框架条件推断。这种表述超越了特定任务的模型，通过启用共享推断机制，GEM沿着推断的结构框架传播人类提供的提示-如井日志、掩模或结构草图-以产生地质连贯的输出。通过这种机制，GEM在异质提示类型的任务之间实现了零-shot泛化，无需为新任务或数据源重新训练。这种能力源自一个两阶段的训练过程，结合了对大规模现场地震数据的自监督表示学习和使用混合提示和标签进行对抗微调的多样地下任务。GEM展示了在调查和任务中的广泛适用性，包括火星雷达地层分析、俯冲带结构解释、全地震地层解释、地体划分和属性建模。通过以结构感知的方式将专业知识与生成推理相结合，GEM为可扩展的、人为中心的地球物理AI奠定了基础-从分散的管道过渡到一个垂直集成、可提示的推理系统。项目页面：https://douyimin.github.io/GEM

更新时间: 2025-07-01 04:14:13

领域: physics.geo-ph,cs.AI

下载: http://arxiv.org/abs/2507.00419v1

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs

This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 million to 90 billion parameters, are served using the vLLM framework. The QAic inference cards appears to be energy efficient and performs well in the energy efficiency metric in most cases. The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for high-performance computing (HPC) applications within the National Research Platform (NRP).

Updated: 2025-07-01 04:11:09

标题: 为LLMs提供服务在HPC集群中：高通云AI 100 Ultra和高性能GPU的比较研究

摘要: 本研究提供了对Qualcomm Cloud AI 100 Ultra（QAic）加速器在大型语言模型（LLM）推理方面的基准分析，评估了其能效（每瓦吞吐量）和性能与领先的NVIDIA（A100，H200）和AMD（MI300A）GPU在国家研究平台（NRP）生态系统中的比较。使用vLLM框架为范围从1.17亿到90亿参数的15个开源LLM提供服务。QAic推理卡在大多数情况下表现出能效高和性能良好。研究结果揭示了Qualcomm Cloud AI 100 Ultra在国家研究平台（NRP）中用于高性能计算（HPC）应用的潜力。

更新时间: 2025-07-01 04:11:09

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2507.00418v1

ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context

We introduce ASTRO, the "Autoregressive Search-Taught Reasoner", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, ASTRO bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.

Updated: 2025-07-01 04:10:15

标题: ASTRO:教授语言模型通过在上下文中反思和回溯进行推理

摘要: 我们介绍了ASTRO，“Autoregressive Search-Taught Reasoner”，这是一个框架，用于训练语言模型像搜索算法一样推理，明确利用自我反思、回溯和探索来输出。最近，通过强化学习（RL）训练大型语言模型（LLMs）导致了具有极大增强推理能力的推理模型的出现。推理模型的开源复制虽然取得成功，但是建立在已经展现出强大推理能力以及在RL之前已观察到的搜索行为的模型基础上。因此，如何增强其他非推理模型（包括Llama 3）的推理能力尚不清楚。ASTRO通过从蒙特卡洛树搜索（MCTS）中导出的合成数据集，将这些模型教导内化结构化搜索行为。通过将搜索轨迹转换为捕获成功和从失败中恢复的自然语言思维链，ASTRO为RL期间的探索为模型提供了丰富的先验知识。我们对这些搜索导出的轨迹进行微调，通过可验证的奖励进一步提高性能。我们将ASTRO应用于Llama 3系列模型，并在MATH-500上实现了16.0％的绝对性能增益，在AMC 2023上增加了26.9％，在AIME 2024上增加了20.0％，特别是在需要迭代校正的挑战性问题上取得了改进。我们的结果表明，受搜索启发的训练提供了一种原则性的方法，将强大的推理能力灌输到开放的LLMs中。

更新时间: 2025-07-01 04:10:15

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.00417v1

SwarmThinkers: Learning Physically Consistent Atomic KMC Transitions at Scale

Can a scientific simulation system be physically consistent, interpretable by design, and scalable across regimes--all at once? Despite decades of progress, this trifecta remains elusive. Classical methods like Kinetic Monte Carlo ensure thermodynamic accuracy but scale poorly; learning-based methods offer efficiency but often sacrifice physical consistency and interpretability. We present SwarmThinkers, a reinforcement learning framework that recasts atomic-scale simulation as a physically grounded swarm intelligence system. Each diffusing particle is modeled as a local decision-making agent that selects transitions via a shared policy network trained under thermodynamic constraints. A reweighting mechanism fuses learned preferences with transition rates, preserving statistical fidelity while enabling interpretable, step-wise decision making. Training follows a centralized-training, decentralized-execution paradigm, allowing the policy to generalize across system sizes, concentrations, and temperatures without retraining. On a benchmark simulating radiation-induced Fe-Cu alloy precipitation, SwarmThinkers is the first system to achieve full-scale, physically consistent simulation on a single A100 GPU, previously attainable only via OpenKMC on a supercomputer. It delivers up to 4963x (3185x on average) faster computation with 485x lower memory usage. By treating particles as decision-makers, not passive samplers, SwarmThinkers marks a paradigm shift in scientific simulation--one that unifies physical consistency, interpretability, and scalability through agent-driven intelligence.

Updated: 2025-07-01 04:00:13

标题: SwarmThinkers：规模学习物理一致的原子KMC转换

摘要: 一个科学模拟系统能否同时在物理上一致、可解释设计，并且跨域可扩展？尽管取得了数十年的进展，但这三重要素依然难以实现。传统方法如动力学蒙特卡洛可以确保热力学准确性，但扩展性较差；基于学习的方法提供了效率，但通常牺牲了物理一致性和可解释性。我们提出了SwarmThinkers，这是一个强化学习框架，将原子尺度模拟重新构建为一个基于物理的群体智能系统。每个扩散粒子被建模为一个本地决策代理，通过一个在热力学约束下训练的共享策略网络选择转变。一种重新加权机制将学习到的偏好与转变率融合在一起，保持统计的忠实性同时实现可解释的、逐步的决策制定。训练遵循中央化训练、去中心化执行的范式，使策略能够在不重新训练的情况下概括系统大小、浓度和温度。在模拟辐射诱导的Fe-Cu合金沉淀的基准测试中，SwarmThinkers是第一个在单个A100 GPU上实现全尺度、物理一致的模拟系统，之前只能通过超级计算机上的OpenKMC实现。它提供了高达4963倍（平均3185倍）更快的计算速度，内存使用减少了485倍。通过将粒子视为决策者而非被动采样者，SwarmThinkers标志着科学模拟中的一种范式转变--通过代理驱动的智能实现了物理一致性、可解释性和可扩展性的统一。

更新时间: 2025-07-01 04:00:13

领域: cs.AI

下载: http://arxiv.org/abs/2505.20094v3

Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP foundation models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the foundation models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain low-energy 3D geometries via geometry optimization, providing relaxed 3D geometries for downstream molecular property predictions. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the foundation models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP foundation models trained on relaxation data can provide valuable molecular geometries that benefit property predictions.

Updated: 2025-07-01 03:49:11

标题: 用机器学习的相互作用势将分子图与几何形状相结合

摘要: 准确的分子性质预测需要3D几何结构，通常通过密度泛函理论（DFT）等昂贵方法获得。在这里，我们尝试仅依靠机器学习相互原子势（MLIP）模型获得分子几何结构。为此，我们首先整理了一个包括350万个分子和3亿个快照的大规模分子弛豫数据集。然后，使用监督学习对MLIP基础模型进行训练，以预测给定3D分子结构的能量和力。一旦训练完成，我们展示基础模型可以以不同方式用于显式或隐式获取几何结构。首先，它可以通过几何优化获得低能量的3D几何结构，为下游分子性质预测提供放松的3D几何结构。为了减轻潜在偏见并增强下游预测，我们引入基于放松的3D几何结构的几何微调。其次，当存在真实的3D几何结构时，基础模型可以直接进行性质预测的微调。我们的结果表明，基于弛豫数据训练的MLIP基础模型可以提供有益的分子几何结构，有助于性质预测。

更新时间: 2025-07-01 03:49:11

领域: physics.chem-ph,cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2507.00407v1

VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

Recent advancements in Large Video-Language Models (LVLMs) have led to promising results in multimodal video understanding. However, it remains unclear whether these models possess the cognitive capabilities required for high-level tasks, particularly those involving symbolic and abstract perception. Existing benchmarks typically rely on real-world, annotated videos, which lack control over video content and inherent difficulty, limiting their diagnostic power. To bridge this gap, we propose VideoCogQA, a scalable and fully controllable benchmark inspired by game-world environments, designed to evaluate the cognitive abilities of LVLMs. By generating synthetic videos via a programmatic engine, VideoCogQA allows fine-grained control over visual elements, temporal dynamics, and task difficulty. This approach enables a focused evaluation of video cognitive abilities, independent of prior knowledge from visual scene semantics. The dataset includes 800 videos and 3,280 question-answer pairs, featuring tasks related to abstract concepts, symbolic elements, and multimodal integration, with varying levels of difficulty. Experimental results show that even state-of-the-art (SOTA) models, such as GPT-4o, achieve an average performance of 48.8% on tasks involving abstract concepts. Additionally, performance drops by 15% as task complexity increases, highlighting the challenges LVLMs face in maintaining consistent performance. Through this work, we hope to show the limitations of current LVLMs and offer insights into how they can more effectively emulate human cognitive processes in the future.

Updated: 2025-07-01 03:47:15

标题: VideoCogQA：一个可控的基准，用于评估视频语言模型中的认知能力

摘要: 最近在大型视频-语言模型（LVLMs）方面取得的进展已经在多模态视频理解方面取得了令人兴奋的成果。然而，目前尚不清楚这些模型是否具备高级任务所需的认知能力，特别是涉及符号和抽象知觉的任务。现有的基准通常依赖于真实世界中的注释视频，这些视频缺乏对视频内容和固有难度的控制，限制了它们的诊断能力。为了弥合这一差距，我们提出了VideoCogQA，这是一个可伸缩且完全可控的基准，灵感来自游戏世界环境，旨在评估LVLMs的认知能力。通过使用编程引擎生成合成视频，VideoCogQA允许对视觉元素、时间动态和任务难度进行细粒度控制。这种方法使得能够独立于视觉场景语义的先验知识，集中评估视频认知能力。该数据集包括800个视频和3,280个问题-答案对，涉及与抽象概念、符号元素和多模态集成相关的任务，难度各异。实验结果显示，即使是最先进的模型（如GPT-4o），在涉及抽象概念的任务上也只能实现48.8%的平均性能。此外，随着任务复杂性的增加，性能下降了15%，突显了LVLMs在保持一致性性能方面面临的挑战。通过这项工作，我们希望展示当前LVLMs的局限性，并提供关于它们如何更有效地在未来模拟人类认知过程的见解。

更新时间: 2025-07-01 03:47:15

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.09105v2

Two-Stage Regularization-Based Structured Pruning for LLMs

The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method TRSP: Two-Stage Regularization-Based Structured Pruning for LLMs. Specifically, we multiply the output of each transformer layer by an initial learnable weight and iteratively learn these weights by adding their $\ell_1$-norm as a regularization term to the loss function, serving as the first-stage regularization. Subsequently, we apply additional regularization to the difference between the output and input of layers with smaller weights, encouraging the shift of knowledge to the preserved layers. This serves as the second-stage regularization. TRSP retains more knowledge and better preserves model performance than direct parameter elimination. Through extensive experimentation we show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining. As a layer-wise pruning method, it delivers notable end-to-end acceleration, making it a promising solution for efficient LLM deployment.

Updated: 2025-07-01 03:31:12

标题: 基于两阶段正则化的LLMs结构化剪枝

摘要: 大型语言模型（LLMs）的部署在很大程度上受到其大量参数的限制。结构化剪枝已经成为一种有前途的解决方案。先前的结构化剪枝方法直接基于某些指标删除不重要的参数，这往往会导致知识丢失并需要进行大量的重新训练。为了克服这一问题，我们引入了一种新颖的剪枝方法TRSP：基于两阶段正则化的结构化剪枝方法，用于LLMs。具体来说，我们将每个transformer层的输出乘以一个初始可学习的权重，并通过将它们的$\ell_1$-范数作为正则化项添加到损失函数中来迭代学习这些权重，作为第一阶段正则化。随后，我们对具有较小权重的层的输出和输入之间的差异应用额外的正则化，鼓励知识向保留层的转移。这作为第二阶段正则化。TRSP保留了更多知识，并比直接消除参数更好地保留了模型性能。通过大量实验，我们展示了TRSP优于强层次结构剪枝方法，而无需重新训练。作为一种层次剪枝方法，它提供了显著的端到端加速，使其成为高效LLM部署的有前途的解决方案。

更新时间: 2025-07-01 03:31:12

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.18232v2

Conceptual Framework Toward Embodied Collective Adaptive Intelligence

Collective Adaptive Intelligence (CAI) represent a transformative approach in embodied AI, wherein numerous autonomous agents collaborate, adapt, and self-organize to navigate complex, dynamic environments. By enabling systems to reconfigure themselves in response to unforeseen challenges, CAI facilitate robust performance in real-world scenarios. This article introduces a conceptual framework for designing and analyzing CAI. It delineates key attributes including task generalization, resilience, scalability, and self-assembly, aiming to bridge theoretical foundations with practical methodologies for engineering adaptive, emergent intelligence. By providing a structured foundation for understanding and implementing CAI, this work seeks to guide researchers and practitioners in developing more resilient, scalable, and adaptable AI systems across various domains.

Updated: 2025-07-01 03:22:25

标题: 朝向具体集体适应智能的概念框架

摘要: Collective Adaptive Intelligence (CAI) 代表了在具体化人工智能中的一个变革性方法，其中许多自治代理相互协作、调整和自组织以应对复杂、动态的环境。通过使系统能够根据意外挑战重新配置自身，CAI 有助于在现实场景中实现鲁棒性能。本文介绍了一个用于设计和分析CAI的概念框架。它描绘了包括任务泛化、韧性、可扩展性和自组装在内的关键属性，旨在将理论基础与工程自适应、新兴智能的实际方法桥接起来。通过为理解和实施CAI提供一个结构化的基础，本工作旨在指导研究人员和从业者在各个领域开发更具韧性、可扩展性和适应性的人工智能系统。

更新时间: 2025-07-01 03:22:25

领域: cs.AI

下载: http://arxiv.org/abs/2505.23153v2

Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center

Background: Generative artificial intelligence (AI) deployment in healthcare settings raises copyright compliance concerns. Dana-Farber Cancer Institute implemented GPT4DFCI, an internal generative AI tool utilizing OpenAI models, that is approved for enterprise use in research and operations. Given (i) the exceptionally broad adoption of the tool in our organization, (ii) our research mission, and (iii) the shared responsibility model required by Microsoft OpenAI products, we deemed rigorous copyright compliance testing necessary. Case Description: We conducted a structured red teaming exercise in Nov. 2024, with 42 participants from academic, industry, and government institutions. Four teams attempted to extract copyrighted content from GPT4DFCI across four domains: literary works, news articles, scientific publications, and access-restricted clinical notes. Teams successfully extracted verbatim book dedications and near-exact passages through indirect prompting strategies. News article extraction failed despite jailbreak attempts. Scientific article reproduction yielded only high-level summaries. Clinical note testing revealed appropriate privacy safeguards with data reformatting rather than reproduction. Discussion: The successful extraction of literary content indicates potential copyright material presence in training data, necessitating enhanced inference-time filtering. Differential success rates across content types suggest varying protective mechanisms. The event led to implementation of a copyright-specific meta-prompt in GPT4DFCI; this mitigation is in production since Jan. 2025. Conclusion: Systematic red teaming revealed specific vulnerabilities in generative AI copyright compliance, leading to concrete mitigation strategies. Academic medical institutions deploying generative AI must implement continuous testing protocols to ensure legal and ethical compliance.

Updated: 2025-07-01 03:17:10

标题: 为生成式人工智能进行红队演练：关于在学术医疗中心完成的版权焦点演练报告

摘要: 背景：在医疗保健领域部署生成式人工智能（AI）引发了版权合规方面的担忧。达纳-法伯癌症研究所实施了GPT4DFCI，这是一个内部生成式AI工具，利用OpenAI模型，已获得研究和运营中的企业使用批准。鉴于（i）我们组织中该工具的广泛采用，（ii）我们的研究使命，以及（iii）Microsoft OpenAI产品所需的共同责任模型，我们认为有必要进行严格的版权合规测试。案例描述：我们在2024年11月进行了一次结构化的红队演练，有来自学术界、工业界和政府机构的42名参与者。四个团队试图从GPT4DFCI中提取受版权保护的内容，涵盖文学作品、新闻文章、科学出版物和受限制的临床笔记。团队成功地通过间接提示策略提取了书籍致辞和几乎完全相同的段落。尽管进行了越狱尝试，但新闻文章提取失败。科学文章复制只产生了高层次摘要。临床笔记测试显示适当的隐私保护措施，而不是复制数据。讨论：成功提取文学内容表明训练数据中可能存在版权材料，需要加强推断时间过滤。不同内容类型之间的不同成功率表明存在不同的保护机制。该事件导致在GPT4DFCI中实施了一个专门的版权元提示；这个缓解措施自2025年1月以来已投入生产。结论：系统性的红队演练揭示了生成式AI版权合规中的特定漏洞，导致了具体的缓解策略。部署生成式AI的学术医疗机构必须实施持续的测试协议，以确保合法和道德合规。

更新时间: 2025-07-01 03:17:10

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2506.22523v2

Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce step-wise supervision and induce prior information in the ICRL framework.Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

Updated: 2025-07-01 03:07:05

标题: 朝着在随机世界中进行元训练的大规模上下文强化学习方向

摘要: 在上下文强化学习（ICRL）中，代理能够自动地从交互式经验中学习。然而，扩展ICRL的一个主要挑战是缺乏可扩展的任务集合。为了解决这个问题，我们提出了过程生成的表格化马尔可夫决策过程，命名为AnyMDP。通过精心设计的随机化过程，AnyMDP能够在大规模上生成高质量的任务，同时保持相对较低的结构偏差。为了在大规模上促进有效的元训练，我们进一步引入了逐步监督，并在ICRL框架中引入了先验信息。我们的结果表明，使用足够大规模的AnyMDP任务，提出的模型可以推广到在训练集中没有考虑过的任务。AnyMDP提供的可扩展任务集还使得对数据分布与ICRL性能之间关系进行更全面的实证研究成为可能。我们进一步展示了ICRL的泛化可能会以增加任务多样性和更长的适应期为代价。这一发现对扩展强大的ICRL能力具有重要意义，强调了多样化和广泛的任务设计的必要性，并优先考虑渐近性能而非少次适应。

更新时间: 2025-07-01 03:07:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.02869v2

Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers

Learned sparse retrieval, which can efficiently perform retrieval through mature inverted-index engines, has garnered growing attention in recent years. Particularly, the inference-free sparse retrievers are attractive as they eliminate online model inference in the retrieval phase thereby avoids huge computational cost, offering reasonable throughput and latency. However, even the state-of-the-art (SOTA) inference-free sparse models lag far behind in terms of search relevance when compared to both sparse and dense siamese models. Towards competitive search relevance for inference-free sparse retrievers, we argue that they deserve dedicated training methods other than using same ones with siamese encoders. In this paper, we propose two different approaches for performance improvement. First, we propose an IDF-aware penalty for the matching function that suppresses the contribution of low-IDF tokens and increases the model's focus on informative terms. Moreover, we propose a heterogeneous ensemble knowledge distillation framework that combines siamese dense and sparse retrievers to generate supervisory signals during the pre-training phase. The ensemble framework of dense and sparse retriever capitalizes on their strengths respectively, providing a strong upper bound for knowledge distillation. To concur the diverse feedback from heterogeneous supervisors, we normalize and then aggregate the outputs of the teacher models to eliminate score scale differences. On the BEIR benchmark, our model outperforms existing SOTA inference-free sparse model by \textbf{3.3 NDCG@10 score}. It exhibits search relevance comparable to siamese sparse retrievers and client-side latency only \textbf{1.1x that of BM25}.

Updated: 2025-07-01 03:06:14

标题: 朝着推断无关的学习稀疏检索器竞争性搜索相关性

摘要: 学习的稀疏检索，在近年来已经引起越来越多的关注，可以通过成熟的倒排索引引擎高效地进行检索。特别是，无推理的稀疏检索器具有吸引力，因为它们在检索阶段消除了在线模型推理，从而避免了巨大的计算成本，提供了合理的吞吐量和延迟。然而，即使是最先进的无推理的稀疏模型在搜索相关性方面也远远落后于稀疏和密集的孪生模型。为了使无推理的稀疏检索器具有竞争力的搜索相关性，我们认为它们值得采用与孪生编码器不同的专门训练方法。在本文中，我们提出了两种不同的性能改进方法。首先，我们提出了一种针对匹配函数的IDF感知惩罚，抑制低IDF令牌的贡献，并增加模型对信息性术语的关注。此外，我们提出了一种异构集成知识蒸馏框架，将孪生密集和稀疏检索器结合起来，在预训练阶段生成监督信号。密集和稀疏检索器的集成框架分别利用它们的优势，为知识蒸馏提供了一个强大的上限。为了协调异构监督者的多样反馈，我们对教师模型的输出进行归一化，然后聚合起来消除分数差异。在BEIR基准测试中，我们的模型比现有的最先进的无推理稀疏模型表现更好，\textbf{3.3 NDCG@10分数}。它展示了与孪生稀疏检索器相当的搜索相关性，并且客户端延迟仅为BM25的\textbf{1.1倍}。

更新时间: 2025-07-01 03:06:14

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2411.04403v2

Flexora: Flexible Low Rank Adaptation for Large Language Models

Large Language Models (LLMs) are driving advancements in artificial intelligence by increasing the scale of model parameters, which has significantly enhanced generalization ability and unlocked new capabilities in practice. However, their performance in specific downstream tasks is usually hindered by their knowledge boundaries on these tasks. Thus, fine-tuning techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have been introduced to expand the boundaries on these tasks, whereas LoRA would underperform on certain tasks owing to its potential overfitting on these tasks. To overcome this overfitting and improve the performance of LoRA, we propose the flexible low rank adaptation (Flexora) method to automatically and flexibly select the most important layers needing to be fine-tuned to achieve the best performance on different downstream tasks. Specifically, Flexora firstly frames this layer selection problem as a well-defined hyperparameter optimization (HPO) problem, then addresses it using the unrolled differentiation (UD) method, and finally selects the most useful layers based on the optimized hyperparameters. Our extensive experiments on many pretrained models and natural language tasks show that Flexora is able to consistently improve over the existing baselines, indicating the effectiveness of our Flexora in practice. We additionally provide insightful theoretical results and many ablation studies to deliver a comprehensive understanding of our Flexora.

Updated: 2025-07-01 02:38:26

标题: Flexora：大型语言模型的灵活低秩适应

摘要: 大型语言模型（LLM）通过增加模型参数规模推动人工智能的进步，显著增强了泛化能力，并在实践中解锁了新的能力。然而，它们在特定下游任务中的表现通常受到它们对这些任务的知识边界的限制。因此，引入了微调技术，特别是广泛使用的低秩适应（LoRA）方法，来扩展这些任务上的边界，尽管LoRA在某些任务上表现不佳，因为它在这些任务上可能过拟合。为了克服这种过拟合并提高LoRA的性能，我们提出了灵活的低秩适应（Flexora）方法，以自动灵活地选择需要微调以在不同下游任务中实现最佳性能的最重要的层。具体而言，Flexora首先将这个层选择问题构建为一个明确定义的超参数优化（HPO）问题，然后使用展开微分（UD）方法来解决它，最终根据优化的超参数选择最有用的层。我们对许多预训练模型和自然语言任务进行了广泛的实验，结果显示Flexora能够稳定地优于现有基线，表明我们的Flexora在实践中的有效性。我们还提供了深刻的理论结果和许多消融研究，以提供对我们的Flexora的全面了解。

更新时间: 2025-07-01 02:38:26

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2408.10774v4

Beyond Code: The Multidimensional Impacts of Large Language Models in Software Development

Large language models (LLMs) are poised to significantly impact software development, especially in the Open-Source Software (OSS) sector. To understand this impact, we first outline the mechanisms through which LLMs may influence OSS through code development, collaborative knowledge transfer, and skill development. We then empirically examine how LLMs affect OSS developers' work in these three key areas. Leveraging a natural experiment from a temporary ChatGPT ban in Italy, we employ a Difference-in-Differences framework with two-way fixed effects to analyze data from all OSS developers on GitHub in three similar countries, Italy, France, and Portugal, totaling 88,022 users. We find that access to ChatGPT increases developer productivity by 6.4%, knowledge sharing by 9.6%, and skill acquisition by 8.4%. These benefits vary significantly by user experience level: novice developers primarily experience productivity gains, whereas more experienced developers benefit more from improved knowledge sharing and accelerated skill acquisition. In addition, we find that LLM-assisted learning is highly context-dependent, with the greatest benefits observed in technically complex, fragmented, or rapidly evolving contexts. We show that the productivity effects of LLMs extend beyond direct code generation to include enhanced collaborative learning and knowledge exchange among developers, dynamics that are essential for gaining a holistic understanding of LLMs' impact in OSS. Our findings offer critical managerial implications: strategically deploying LLMs can accelerate novice developers' onboarding and productivity, empower intermediate developers to foster knowledge sharing and collaboration, and support rapid skill acquisition, together enhancing long-term organizational productivity and agility.

Updated: 2025-07-01 02:35:48

标题: 超越代码：大型语言模型在软件开发中的多维影响

摘要: 大型语言模型（LLMs）有望在软件开发领域产生重大影响，特别是在开源软件（OSS）领域。为了理解这种影响，我们首先概述LLMs可能通过代码开发、协作知识转移和技能发展等机制影响OSS。然后，我们通过一个来自意大利ChatGPT暂时禁令的自然实验，利用双向固定效应的差异分析框架分析了意大利、法国和葡萄牙三个类似国家所有GitHub上的OSS开发者的数据，总共有88,022名用户。我们发现，ChatGPT的使用可以使开发者的生产力提高6.4％，知识共享提高9.6％，技能获取提高8.4％。这些好处在用户经验水平上有显著差异：初学者主要经历生产力增长，而经验更丰富的开发者更多地受益于改进的知识共享和加快的技能获取。此外，我们发现，LLM辅助学习高度依赖于环境，最大的好处出现在技术复杂、碎片化或快速发展的环境中。我们展示了LLMs的生产力效应不仅限于直接代码生成，还包括增强开发者之间的协作学习和知识交流，这种动态对于全面了解LLMs在OSS中的影响至关重要。我们的发现提供了重要的管理启示：战略性地部署LLMs可以加快初学者的入职和生产力，让中级开发者促进知识共享和协作，支持快速技能获取，从而提高长期组织的生产力和敏捷性。

更新时间: 2025-07-01 02:35:48

领域: econ.GN,cs.AI,q-fin.EC

下载: http://arxiv.org/abs/2506.22704v2

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

Updated: 2025-07-01 02:29:52

标题: SPIRAL：零和游戏上的自我对弈通过多智体多轮强化学习激励推理

摘要: 最近在强化学习领域取得的进展表明，语言模型可以通过在具有可验证奖励的任务上训练来发展复杂的推理能力，但这些方法依赖于人为策划的问题-答案对和领域特定的奖励设计。我们介绍了SPIRAL，一个自我对弈框架，模型通过与不断改进的自身版本对战多轮、零和游戏来学习，消除了对人类监督的需求。通过自我对弈，SPIRAL生成了一个无限的逐渐具有挑战性的问题课程，因为模型必须不断适应更强大的对手。为了实现这种规模化的自我对弈训练，我们实现了一个完全在线、多轮、多代理的强化学习系统，为LLMs提出了基于角色条件的优势估计（RAE）来稳定多代理训练。使用SPIRAL，在零和游戏上进行自我对弈可以产生广泛传递的推理能力。仅仅通过训练Qwen3-4B-Base在Kuhn扑克上，数学能力提高了8.6%，一般推理能力提高了8.4%，优于在25,000个专家游戏轨迹上的SFT。分析表明，这种转移是通过三种认知模式实现的：系统性分解、期望值计算和逐案分析。多游戏训练（井字游戏、Kuhn扑克、简单谈判）进一步增强了性能，因为每个游戏都培养了不同的推理能力。将SPIRAL应用于强大的推理模型（DeepSeek-R1-Distill-Qwen-7B）仍然可以带来2.0%的平均改进。这些结果表明，零和游戏自然发展出可传递的推理能力，突显了自主推理发展的一个有前景的方向。

更新时间: 2025-07-01 02:29:52

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.24119v2

iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

Conformance testing is essential for ensuring that protocol implementations comply with their specifications. However, traditional testing approaches involve manually creating numerous test cases and scripts, making the process labor-intensive and inefficient. Recently, Large Language Models (LLMs) have demonstrated impressive text comprehension and code generation abilities, providing promising opportunities for automation. In this paper, we propose iPanda, the first end-to-end framework that leverages LLMs to automate protocol conformance testing. Given a protocol specification document and its implementation, iPanda first employs a keyword-based method to automatically generate comprehensive test cases. Then, it utilizes a code-based retrieval-augmented generation approach to effectively interpret the implementation and produce executable test code. To further enhance code quality, iPanda incorporates an iterative self-correction mechanism to refine generated test scripts interactively. Finally, by executing and analyzing the generated tests, iPanda systematically verifies compliance between implementations and protocol specifications. Comprehensive experiments on various protocols show that iPanda significantly outperforms pure LLM-based approaches, improving the success rate (Pass@1) of test-code generation by factors ranging from 4.675 times to 10.751 times.

Updated: 2025-07-01 02:27:44

标题: iPanda：一种智能协议测试和调试代理，用于合规性测试

摘要: 一致性测试对于确保协议实现符合其规范至关重要。然而，传统的测试方法涉及手动创建大量测试用例和脚本，使得这一过程费时费力且低效。近年来，大型语言模型（LLMs）展示了令人印象深刻的文本理解和代码生成能力，为自动化提供了有前途的机会。在本文中，我们提出了iPanda，这是第一个利用LLMs自动化协议一致性测试的端到端框架。给定协议规范文档及其实现，iPanda首先采用基于关键字的方法自动生成全面的测试用例。然后，它利用一种基于代码的检索增强生成方法有效解释实现并生成可执行的测试代码。为了进一步提高代码质量，iPanda还整合了一个迭代式自我校正机制，以交互方式完善生成的测试脚本。最后，通过执行和分析生成的测试，iPanda系统地验证实现与协议规范之间的一致性。对各种协议的全面实验表明，iPanda明显优于纯LLM方法，将测试代码生成的成功率（Pass@1）提高了4.675倍至10.751倍。

更新时间: 2025-07-01 02:27:44

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.00378v1

Presto: Hardware Acceleration of Ciphers for Hybrid Homomorphic Encryption

Hybrid Homomorphic Encryption (HHE) combines symmetric key and homomorphic encryption to reduce ciphertext expansion crucial in client-server deployments of HE. Special symmetric ciphers, amenable to efficient HE evaluation, have been developed. Their client-side deployment calls for performant and energy-efficient implementation, and in this paper we develop and evaluate hardware accelerators for the two known CKKS-targeting HHE ciphers, HERA and Rubato. We design vectorized and overlapped functional modules. The design exploits transposition-invariance property of the MixColumns and MixRows function and alternates the order of intermediate state to eliminate bubbles in stream key generation, improving latency and throughput. We decouple the RNG and key computation phases to hide the latency of RNG and to reduce the critical path in FIFOs, achieving higher operating frequency. We implement the accelerator on an AMD Virtex UltraScale+ FPGA. Both Rubato and HERA achieve a 6x improvement in throughput compared to the software implementation. In terms of latency, Rubato achieves a 5x reduction, while HERA achieves a 3x reduction. Additionally, our hardware implementations reduce energy consumption by 75x for Rubato and 47x for HERA compared to their software implementation.

Updated: 2025-07-01 01:48:28

标题: Presto: 混合同态加密的密码硬件加速

摘要: 混合同态加密（HHE）结合了对称密钥和同态加密，以减少在HE客户端-服务器部署中关键的密文扩展。特殊的对称密码，适用于高效的HE评估，已经被开发出来。它们在客户端部署需要高性能和高能效的实现，本文中我们为两种已知的针对CKKS的HHE密码，HERA和Rubato，开发和评估了硬件加速器。我们设计了矢量化和重叠的功能模块。设计利用了MixColumns和MixRows函数的转置不变性属性，并交替中间状态的顺序，以消除流密钥生成中的气泡，提高延迟和吞吐量。我们将RNG和密钥计算阶段解耦，以隐藏RNG的延迟并减少FIFO中的关键路径，实现更高的操作频率。我们在AMD Virtex UltraScale+ FPGA上实现了加速器。与软件实现相比，Rubato和HERA的吞吐量分别提高了6倍。在延迟方面，Rubato实现了5倍的减少，而HERA实现了3倍的减少。此外，我们的硬件实现将Rubato的能耗降低了75倍，将HERA的能耗降低了47倍，与它们的软件实现相比。

更新时间: 2025-07-01 01:48:28

领域: cs.AR,cs.CR

下载: http://arxiv.org/abs/2507.00367v1

Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning

Imitation learning models for robotic tasks typically rely on multi-modal inputs, such as RGB images, language, and proprioceptive states. While proprioception is intuitively important for decision-making and obstacle avoidance, simply incorporating all proprioceptive states leads to a surprising degradation in imitation learning performance. In this work, we identify the underlying issue as the proprioception shift problem, where the distributions of proprioceptive states diverge significantly between training and deployment. To address this challenge, we propose a domain adaptation framework that bridges the gap by utilizing rollout data collected during deployment. Using Wasserstein distance, we quantify the discrepancy between expert and rollout proprioceptive states and minimize this gap by adding noise to both sets of states, proportional to the Wasserstein distance. This strategy enhances robustness against proprioception shifts by aligning the training and deployment distributions. Experiments on robotic manipulation tasks demonstrate the efficacy of our method, enabling the imitation policy to leverage proprioception while mitigating its adverse effects. Our approach outperforms the naive solution which discards proprioception, and other baselines designed to address distributional shifts.

Updated: 2025-07-01 01:36:05

标题: 调整你的身体：减轻在模仿学习中的本体感移位

摘要: 机器人任务的模仿学习模型通常依赖于多模态输入，例如RGB图像、语言和本体感知状态。虽然本体感知在决策和避障方面直觉上很重要，但简单地将所有本体感知状态纳入模型会导致模仿学习性能出现惊人的下降。在这项工作中，我们确定了本体感知转移问题作为潜在问题，即训练和部署期间本体感知状态的分布显著不同。为了解决这一挑战，我们提出了一个域自适应框架，通过利用在部署期间收集的回滚数据来弥合这一差距。利用Wasserstein距离，我们量化了专家和回滚本体感知状态之间的差异，并通过向两组状态添加与Wasserstein距离成比例的噪声来最小化这一差距。这种策略通过对齐训练和部署分布，增强了对本体感知转移的鲁棒性。在机器人操纵任务上的实验证明了我们方法的有效性，使模仿策略能够利用本体感知并减轻其不良影响。我们的方法优于丢弃本体感知的朴素解决方案，以及其他旨在解决分布转移的基线方法。

更新时间: 2025-07-01 01:36:05

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.23944v2

AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.

Updated: 2025-07-01 01:20:24

标题: AirV2X: 统一的空地车辆对一切协作

摘要: 虽然多车辆协同驾驶展示了明显优势，但传统基础设施的V2X系统仍受到显著的部署成本和在农村和郊区地区创建“未覆盖危险区域”的限制。我们提出了AirV2X-Perception，这是一个大规模数据集，利用无人机作为固定路边单元 (RSU) 的灵活替代或补充。无人机相对于地面感知具有独特优势：互补的鸟瞰视角减少遮挡，动态定位能力实现悬停、巡逻和护航导航规则，以及与固定基础设施相比显著更低的部署成本。我们的数据集包括在城市、郊区和农村环境中不同天气和照明条件下，无人机辅助驾驶场景的6.73小时。AirV2X-Perception数据集促进了车辆与无人机 (V2D) 算法的开发和标准化评估，填补了快速扩展的空中辅助自动驾驶系统领域的关键差距。该数据集和开发工具包在https://github.com/taco-group/AirV2X-Perception 上开源。

更新时间: 2025-07-01 01:20:24

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2506.19283v2

CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception

Multi-agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy.

Updated: 2025-07-01 01:18:12

标题: CoCMT: 用于协作感知的高效跨模态传输者

摘要: 多智能体协作感知通过共享传感信息来增强每个智能体的感知能力，以合作执行机器人感知任务。这种方法已被证明在解决传感器缺陷、遮挡和远程感知等挑战方面非常有效。然而，现有的代表性协作感知系统传输中间特征图，如鸟瞰图（BEV）表示，其中包含大量非关键信息，导致高通信带宽需求。为了提高通信效率同时保持感知能力，我们引入了CoCMT，一个基于对象查询的协作框架，通过选择性提取和传输基本特征来优化通信带宽。在CoCMT中，我们引入了高效查询变换器（EQFormer）来有效融合多智能体对象查询，并实现协同的深度监督以增强各阶段之间的正向强化，从而提高整体性能。在OPV2V和V2V4Real数据集上的实验表明，CoCMT优于最先进的方法，同时显著减少了通信需求。在V2V4Real上，我们的模型（Top-50对象查询）仅需要0.416 Mb带宽，比SOTA方法少83倍，同时将AP70提高了1.1%。这种效率突破使得在带宽受限的环境中实现实际的协作感知部署成为可能，而不会牺牲检测准确性。

更新时间: 2025-07-01 01:18:12

领域: cs.LG,cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2503.13504v2

Neural Networks Generalize on Low Complexity Data

We show that feedforward neural networks with ReLU activation generalize on low complexity data, suitably defined. Given i.i.d.~data generated from a simple programming language, the minimum description length (MDL) feedforward neural network which interpolates the data generalizes with high probability. We define this simple programming language, along with a notion of description length of such networks. We provide several examples on basic computational tasks, such as checking primality of a natural number. For primality testing, our theorem shows the following and more. Suppose that we draw an i.i.d.~sample of $n$ numbers uniformly at random from $1$ to $N$. For each number $x_i$, let $y_i = 1$ if $x_i$ is a prime and $0$ if it is not. Then, the interpolating MDL network accurately answers, with error probability $1- O((\ln N)/n)$, whether a newly drawn number between $1$ and $N$ is a prime or not. Note that the network is not designed to detect primes; minimum description learning discovers a network which does so. Extensions to noisy data are also discussed, suggesting that MDL neural network interpolators can demonstrate tempered overfitting.

Updated: 2025-07-01 01:09:51

标题: 神经网络对低复杂性数据具有泛化能力

摘要: 我们展示了具有ReLU激活函数的前馈神经网络在低复杂度数据上具有泛化能力，适当定义。给定从简单编程语言生成的i.i.d.数据，插值数据的最小描述长度（MDL）前馈神经网络很可能具有泛化能力。我们定义了这种简单编程语言，以及这些网络的描述长度概念。我们提供了一些关于基本计算任务的例子，比如检查自然数的素性。对于素性测试，我们的定理表明以下内容及更多。假设我们从1到N均匀随机抽取了一个由n个数字组成的i.i.d.样本。对于每个数字$x_i$，如果$x_i$是素数则$y_i=1$，否则为0。然后，插值的MDL网络能够准确地回答，错误概率为$1-O((\ln N)/n)$，新抽取的介于1到N之间的数字是否是素数。请注意，网络并非旨在检测素数；最小描述学习发现了一个可以实现此目的的网络。还讨论了对噪声数据的扩展，表明MDL神经网络插值器可以展示出适度的过拟合。

更新时间: 2025-07-01 01:09:51

领域: cs.LG,cs.AI,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2409.12446v4

Data-Driven Exploration for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

We study reinforcement learning (RL) for the same class of continuous-time stochastic linear--quadratic (LQ) control problems as in \cite{huang2024sublinear}, where volatilities depend on both states and controls while states are scalar-valued and running control rewards are absent. We propose a model-free, data-driven exploration mechanism that adaptively adjusts entropy regularization by the critic and policy variance by the actor. Unlike the constant or deterministic exploration schedules employed in \cite{huang2024sublinear}, which require extensive tuning for implementations and ignore learning progresses during iterations, our adaptive exploratory approach boosts learning efficiency with minimal tuning. Despite its flexibility, our method achieves a sublinear regret bound that matches the best-known model-free results for this class of LQ problems, which were previously derived only with fixed exploration schedules. Numerical experiments demonstrate that adaptive explorations accelerate convergence and improve regret performance compared to the non-adaptive model-free and model-based counterparts.

Updated: 2025-07-01 01:09:06

标题: 数据驱动的连续时间线性二次强化学习问题探索

摘要: 我们研究了与\cite{huang2024sublinear}中相同类别的连续时间随机线性二次(LQ)控制问题的强化学习（RL），其中波动性取决于状态和控制，而状态是标量值，且不存在运行控制奖励。我们提出了一种无模型、数据驱动的探索机制，通过批评者自适应调整熵正则化和演员策略方差。与\cite{huang2024sublinear}中采用的常数或确定性探索计划不同，后者需要大量调整以进行实施，并忽略了迭代过程中的学习进展，我们的自适应探索方法通过最小化调整提高了学习效率。尽管具有灵活性，我们的方法实现了与该类LQ问题的最佳已知无模型结果相匹配的次线性后悔上界，这些结果之前仅使用固定探索计划导出。数值实验表明，自适应探索加速了收敛并改善了后悔性能，与非自适应的无模型和有模型对应物相比。

更新时间: 2025-07-01 01:09:06

领域: cs.LG,cs.AI,cs.SY,eess.SY,math.OC

下载: http://arxiv.org/abs/2507.00358v1

CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation

Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for ultra-high-resolution optical RS imagery have constrained the progress of high-resolution remote sensing vision foundation models (RSVFM). As the world's largest sub-meter-level commercial RS satellite constellation, the Jilin-1 constellation possesses abundant sub-meter-level image resources. This study proposes CGEarthEye, a RSVFM framework specifically designed for Jilin-1 satellite characteristics, comprising five backbones with different parameter scales with totaling 2.1 billion parameters. To enhance the representational capacity of the foundation model, we developed JLSSD, the first 15-million-scale multi-temporal self-supervised learning (SSL) dataset featuring global coverage with quarterly temporal sampling within a single year, constructed through multi-level representation clustering and sampling strategies. The framework integrates seasonal contrast, augmentation-based contrast, and masked patch token contrastive strategies for pre-training. Comprehensive evaluations across 10 benchmark datasets covering four typical RS tasks demonstrate that the CGEarthEye consistently achieves state-of-the-art (SOTA) performance. Further analysis reveals CGEarthEye's superior characteristics in feature visualization, model convergence, parameter efficiency, and practical mapping applications. This study anticipates that the exceptional representation capabilities of CGEarthEye will facilitate broader and more efficient applications of Jilin-1 data in traditional EO application.

Updated: 2025-07-01 01:05:18

标题: CGEarthEye:基于吉林一号卫星星座的高分辨率遥感视觉基础模型

摘要: 深度学习方法显著推动了遥感领域智能解释的发展，基于大规模预训练范式的基础模型研究迅速重塑了地球观测领域的各个领域。然而，与中分辨率数据的开放性和高时空覆盖性相比，超高分辨率光学遥感图像的有限获取渠道限制了高分辨率遥感视觉基础模型（RSVFM）的进展。作为世界上最大的亚米级商业遥感卫星星座，吉林一号星座拥有丰富的亚米级图像资源。本研究提出了CGEarthEye，这是一个专门为吉林一号卫星特性设计的RSVFM框架，包括五个不同参数规模的主干，总共拥有21亿个参数。为了增强基础模型的表征能力，我们开发了JLSSD，这是第一个具有1500万规模的多时序自监督学习（SSL）数据集，具有全球覆盖并在单年内进行季度时间采样，通过多级表征聚类和采样策略构建。该框架集成了季节对比、基于增强的对比和掩蔽补丁令牌对比策略进行预训练。对涵盖四种典型RS任务的10个基准数据集进行的全面评估表明，CGEarthEye始终实现了最先进的性能。进一步分析揭示了CGEarthEye在特征可视化、模型收敛、参数效率和实际映射应用方面的优越特性。本研究预计，CGEarthEye的出色表征能力将促进传统EO应用中吉林一号数据的更广泛和更高效的应用。

更新时间: 2025-07-01 01:05:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00356v1

Transformers from Diffusion: A Unified Framework for Neural Message Passing

Learning representations for structured data with certain geometries (e.g., observed or unobserved) is a fundamental challenge, wherein message passing neural networks (MPNNs) have become a de facto class of model solutions. In this paper, inspired by physical systems, we propose an energy-constrained diffusion model, which integrates the inductive bias of diffusion on manifolds with layer-wise constraints of energy minimization. We identify that the diffusion operators have a one-to-one correspondence with the energy functions implicitly descended by the diffusion process, and the finite-difference iteration for solving the energy-constrained diffusion system induces the propagation layers of various types of MPNNs operating on observed or latent structures. This leads to a unified mathematical framework for common neural architectures whose computational flows can be cast as message passing (or its special case), including MLPs, GNNs, and Transformers. Building on these insights, we devise a new class of neural message passing models, dubbed diffusion-inspired Transformers (DIFFormer), whose global attention layers are derived from the principled energy-constrained diffusion framework. Across diverse datasets ranging from real-world networks to images, texts, and physical particles, we demonstrate that the new model achieves promising performance in scenarios where the data structures are observed (as a graph), partially observed, or entirely unobserved.

Updated: 2025-07-01 01:04:17

标题: 从扩散到变压器：神经消息传递的统一框架

摘要: 学习具有特定几何结构的结构化数据表示是一个基本挑战，其中消息传递神经网络（MPNNs）已成为一种解决方案的事实标准模型类。在本文中，受物理系统启发，我们提出了一种能量约束扩散模型，将流形上的扩散归纳偏差与能量最小化的逐层约束集成在一起。我们发现扩散算子与扩散过程隐含下降的能量函数具有一一对应关系，并且用于解决能量约束扩散系统的有限差分迭代引起了各种类型的MPNNs在观察到的或潜在结构上操作的传播层。这导致了一个统一的数学框架，用于常见的神经网络架构，其计算流程可以被视为消息传递（或其特殊情况），包括MLPs，GNNs和Transformers。基于这些见解，我们设计了一类新的神经消息传递模型，称为基于扩散的Transformers（DIFFormer），其全局注意力层源自基于原则的能量约束扩散框架。在从现实世界网络到图像、文本和物理粒子的各种数据集上，我们展示了这种新模型在数据结构被观察到（作为图形）、部分观察到或完全未观察到的情况下取得了有希望的性能。

更新时间: 2025-07-01 01:04:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2409.09111v4

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.

Updated: 2025-07-01 01:02:44

标题: 无缝互动：二元音频视觉运动建模和大规模数据集

摘要: 人类沟通涉及口头和非口头信号的复杂相互作用，这对传达意义和实现人际目标至关重要。为了开发具有社交智能的人工智能技术，关键是开发既能理解又能生成二元行为动态的模型。为此，我们介绍了无缝互动数据集，这是一个庞大的数据集，包含来自不同背景的4,000多名参与者的超过4,000小时的面对面互动录像。该数据集使得开发能够理解二元体现动态的人工智能技术成为可能，从而推动虚拟代理、远程体验和多模态内容分析工具的突破。我们还开发了一套利用该数据集生成与人类言语相一致的二元运动手势和面部表情的模型。这些模型可以同时接受其对话者的言语和视觉行为作为输入。我们提出了一种包含来自LLM模型的言语和与2D和3D渲染方法集成的变体，使我们更接近交互式虚拟代理。此外，我们描述了我们运动模型的可控变体，可以调节情感响应和表达水平，同时生成更具语义相关性的手势。最后，我们讨论了评估这些二元运动模型质量的方法，这些模型展示了更直观和响应灵敏的人机交互潜力。

更新时间: 2025-07-01 01:02:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.22554v2

An AST-guided LLM Approach for SVRF Code Synthesis

Standard Verification Rule Format (SVRF) is essential for semiconductor applications like Design Rule Check (DRC), Layout Versus Schematic (LVS), and Optical Proximity Correction (OPC) and it faces challenges as advancing nodes create complex design rules that renders traditional SVRF development ineffective and highlight an expertise gap. This paper introduces a novel methodology integrating Abstract Syntax Tree (AST) embedding and Retrieval-Augmented Generation (RAG) for enhanced SVRF code synthesis, ensuring semantic accuracy and error minimization through structural validation with domain-specific insights for precise code generation. We evaluate different T5-based models and propose an innovative SVRF-specific scoring framework that complements standard metrics like BLEU and ROUGE-L. In our approach, AST provides rigorous structural validation, while RAG infuses relevant domain knowledge, effectively enhancing the code generation workflow. Testing on a comprehensive benchmark of 740 DRC rule implementations, our methodology demonstrates up to a 40\% improvement in code generation accuracy compared to basic text-based fine-tuning process. This fusion of industry expertise with advanced coding strategies not only optimizes SVRF development under limited dataset constraints but also creates a more intuitive and efficient coding environment. Consequently, users can rapidly iterate through design cycles, reduce manual error correction, and significantly improve overall productivity.

Updated: 2025-07-01 00:57:45

标题: AST引导的LLM方法用于SVRF代码合成

摘要: 标准验证规则格式（SVRF）对于半导体应用如设计规则检查（DRC）、布局与原理图（LVS）和光学近邻修正（OPC）至关重要，面临着随着先进节点的推进而产生复杂设计规则的挑战，使传统SVRF开发变得无效，并突出了专业知识差距。本文介绍了一种新颖的方法，将抽象语法树（AST）嵌入和检索增强生成（RAG）相结合，用于增强SVRF代码合成，通过领域特定的见解进行结构验证，确保语义准确性和错误最小化，从而确保精确的代码生成。我们评估了不同的基于T5的模型，并提出了一种创新的SVRF特定评分框架，补充了标准指标如BLEU和ROUGE-L。在我们的方法中，AST提供严格的结构验证，而RAG注入相关领域知识，有效地增强了代码生成工作流程。在一个包含740个DRC规则实现的全面基准测试中，我们的方法论相比基本的基于文本的微调过程，代码生成准确性提高了高达40％。这种将行业专业知识与先进的编码策略相结合的方法不仅在有限数据集约束下优化了SVRF开发，而且创造了一个更直观、更高效的编码环境。因此，用户可以快速迭代设计周期，减少手动错误更正，并显著提高整体生产力。

更新时间: 2025-07-01 00:57:45

领域: cs.SE,cs.AI,cs.ET

下载: http://arxiv.org/abs/2507.00352v1

Algorithms for the Shortest Vector Problem in $2$-dimensional Lattices, Revisited

Efficiently solving the Shortest Vector Problem (SVP) in two-dimensional lattices holds practical significance in cryptography and computational geometry. While simpler than its high-dimensional counterpart, two-dimensional SVP motivates scalable solutions for high-dimensional lattices and benefits applications like sequence cipher cryptanalysis involving large integers. In this work, we first propose a novel definition of reduced bases and develop an efficient adaptive lattice reduction algorithm \textbf{CrossEuc} that strategically applies the Euclidean algorithm across dimensions. Building on this framework, we introduce \textbf{HVec}, a vectorized generalization of the Half-GCD algorithm originally defined for integers, which can efficiently halve the bit-length of two vectors and may have independent interest. By iteratively invoking \textbf{HVec}, our optimized algorithm \textbf{HVecSBP} achieves a reduced basis in $O(\log n M(n) )$ time for arbitrary input bases with bit-length $n$, where $M(n)$ denotes the cost of multiplying two $n$-bit integers. Compared to existing algorithms, our design is applicable to general forms of input lattices, eliminating the cost of pre-converting input bases to Hermite Normal Form (HNF). The comprehensive experimental results demonstrate that for the input lattice bases in HNF, the optimized algorithm \textbf{HVecSBP} achieves at least a $13.5\times$ efficiency improvement compared to existing methods. For general-form input lattice bases, converting them to HNF before applying \textbf{HVecSBP} offers only marginal advantages in extreme cases where the two basis vectors are nearly degenerate. However, as the linear dependency between input basis vectors decreases, directly employing \textbf{HVecSBP} yields increasingly significant efficiency gains, outperforming hybrid approaches that rely on prior \textbf{HNF} conversion.

Updated: 2025-07-01 00:56:43

标题: 在二维格点中求解最短向量问题的算法，再探讨

摘要: 在密码学和计算几何学中，高效地解决二维格点中的最短向量问题（SVP）具有实际意义。虽然比高维情况简单，但二维SVP激发了高维格点的可扩展解决方案，并有助于涉及大整数的序列密码分析等应用。在这项工作中，我们首先提出了降维基的新定义，并开发了一种高效的自适应格点缩减算法CrossEuc，该算法在不同维度上策略性地应用欧几里德算法。基于这一框架，我们引入了HVec，这是最初为整数定义的Half-GCD算法的矢量化泛化，可以高效地将两个向量的位长度减半，并可能具有独立的兴趣。通过迭代调用HVec，我们优化的算法HVecSBP对于任意输入基数在O(log n M(n))的时间内实现了一个降维基，其中n为位长，M(n)表示两个n位整数相乘的成本。与现有算法相比，我们的设计适用于一般形式的输入格点，消除了将输入基数预转换为Hermite正规形式（HNF）的成本。全面的实验结果表明，对于处于HNF中的输入格点基数，优化算法HVecSBP相比现有方法至少实现了13.5倍的效率提升。对于一般形式的输入格点基数，在极端情况下，在将它们转换为HNF之前应用HVecSBP仅提供了微小的优势，其中两个基向量几乎退化。然而，随着输入基向量之间的线性依赖性减少，直接使用HVecSBP将获得越来越显着的效率提升，胜过依赖于先前HNF转换的混合方法。

更新时间: 2025-07-01 00:56:43

领域: cs.CG,cs.CR

下载: http://arxiv.org/abs/2504.12948v2

Addressing malware family concept drift with triplet autoencoder

Machine learning is increasingly vital in cybersecurity, especially in malware detection. However, concept drift, where the characteristics of malware change over time, poses a challenge for maintaining the efficacy of these detection systems. Concept drift can occur in two forms: the emergence of entirely new malware families and the evolution of existing ones. This paper proposes an innovative method to address the former, focusing on effectively identifying new malware families. Our approach leverages a supervised autoencoder combined with triplet loss to differentiate between known and new malware families. We create clear and robust clusters that enhance the accuracy and resilience of malware family classification by utilizing this metric learning technique and the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. The effectiveness of our method is validated using an Android malware dataset and a Windows portable executable (PE) malware dataset, showcasing its capability to sustain model performance within the dynamic landscape of emerging malware threats. Our results demonstrate a significant improvement in detecting new malware families, offering a reliable solution for ongoing cybersecurity challenges.

Updated: 2025-07-01 00:55:00

标题: 使用三元自动编码器应对恶意软件家族概念漂移

摘要: 机器学习在网络安全中变得越来越重要，特别是在恶意软件检测方面。然而，概念漂移，即恶意软件的特征随时间变化，对于维持这些检测系统的有效性构成挑战。概念漂移可以以两种形式发生：全新恶意软件家族的出现和现有家族的演化。本文提出了一种创新方法来解决前者，重点是有效识别新的恶意软件家族。我们的方法利用了监督自动编码器结合三元损失来区分已知和新的恶意软件家族。通过利用这种度量学习技术和基于密度的空间聚类应用以及噪声（DBSCAN）算法，我们创建了清晰和强大的簇，通过这种方法增强了恶意软件家族分类的准确性和韧性。我们的方法的有效性通过使用安卓恶意软件数据集和Windows可执行文件（PE）恶意软件数据集进行验证，展示了其在新兴恶意软件威胁动态环境中维持模型性能的能力。我们的结果表明，在检测新的恶意软件家族方面取得了显著进展，为持续的网络安全挑战提供了可靠的解决方案。

更新时间: 2025-07-01 00:55:00

领域: cs.CR

下载: http://arxiv.org/abs/2507.00348v1

Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.

Updated: 2025-07-01 00:13:44

标题: 废物DNA假说：修剪小的预训练权重会在LLMs中不可逆地和单调地损害“困难”的下游任务

摘要: 我们提出了“垃圾DNA假说”，采用了一种新颖的以任务为中心的角度来解释大型语言模型（LLMs）的预训练权重。人们一直认为LLMs中的权重包含着重要的冗余信息，因此有人认为通过剪枝可以去除相当大比例的参数而不影响性能。与此观点相反，本文提出了一个反对意见：预训练模型权重中的小幅度权重编码了解决困难下游任务所必需的重要知识——在我们通过幅度剪掉更多预训练权重时，表现下降的下游任务之间存在单调关系。此外，我们揭示了这些看似不重要的权重可能导致知识的不可修复损失和性能下降，即使允许进行下游持续训练。有趣的是，我们的评估表明，另一种流行的压缩方法，即量化，无法展现类似的单调效果，并且无法如此令人信服地解开任务困难信息。为了进行正式研究，我们引入了几个可量化的度量标准来衡量下游任务的困难程度：（1）在相同任务类别内，（2）跨不同任务类别。我们的广泛实验证实了“垃圾DNA假说”，跨越了各种模型大小、任务、数据集甚至剪枝方法。代码可在https://github.com/VITA-Group/Junk_DNA_Hypothesis.git获取。

更新时间: 2025-07-01 00:13:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2310.02277v3