Arxiv Day: Article

Reconstruction Alignment Improves Unified Multimodal Models

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

Updated: 2025-09-08 23:59:32

标题: 重建对齐改进统一多模态模型

摘要: 统一多模态模型（UMMs）在单一架构内统一视觉理解和生成。然而，传统的训练依赖于图像-文本对（或序列），这些描述通常是稀疏的，缺少细粒度的视觉细节--即使它们使用数百个词来描述一个简单的图像。我们引入了重建对齐（RecA），这是一种资源高效的后训练方法，利用视觉理解编码器嵌入作为密集的“文本提示”，提供丰富的监督而不需要标题。具体而言，RecA将UMM条件化为其自身的视觉理解嵌入，并通过自监督重建损失优化它，从而重新调整理解和生成。尽管其简单性，RecA具有广泛的适用性：在自回归、掩蔽自回归和扩散式UMMs中，它持续改善生成和编辑的准确性。仅用27个GPU小时，使用RecA进行后训练显著提高了GenEval（0.73→0.90）和DPGBench（80.93→88.15）上的图像生成性能，同时也提升了编辑基准（ImgEdit 3.38→3.75，GEdit 6.94→7.25）。值得注意的是，RecA超越了更大的开源模型，并广泛适用于各种UMM架构，使其成为UMMs的高效和通用的后训练对齐策略。

更新时间: 2025-09-08 23:59:32

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.07295v1

Multimodal Generative Flows for LHC Jets

Generative modeling of high-energy collisions at the Large Hadron Collider (LHC) offers a data-driven route to simulations, anomaly detection, among other applications. A central challenge lies in the hybrid nature of particle-cloud data: each particle carries continuous kinematic features and discrete quantum numbers such as charge and flavor. We introduce a transformer-based multimodal flow that extends flow-matching with a continuous-time Markov jump bridge to jointly model LHC jets with both modalities. Trained on CMS Open Data, our model can generate high fidelity jets with realistic kinematics, jet substructure and flavor composition.

Updated: 2025-09-08 23:56:26

标题: 多模式生成流用于LHC喷注

摘要: 在大型强子对撞机（LHC）上进行高能碰撞的生成建模提供了一种数据驱动的模拟路径，用于异常检测等应用。一个中心挑战在于粒子云数据的混合性质：每个粒子携带连续的动力学特征和离散的量子数，如电荷和味道。我们引入了一种基于变压器的多模态流，将流匹配与连续时间马尔可夫跳跃桥相结合，共同建模LHC喷注和两种模态。在CMS开放数据上训练，我们的模型可以生成具有真实动力学、喷注亚结构和味道组成的高保真度喷注。

更新时间: 2025-09-08 23:56:26

领域: hep-ph,cs.LG

下载: http://arxiv.org/abs/2509.01736v2

Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control

Modern power grids face unprecedented complexity from Distributed Energy Resources (DERs), Electric Vehicles (EVs), and extreme weather, while also being increasingly exposed to cyberattacks that can trigger grid violations. This paper introduces Grid-Agent, an autonomous AI-driven framework that leverages Large Language Models (LLMs) within a multi-agent system to detect and remediate violations. Grid-Agent integrates semantic reasoning with numerical precision through modular agents: a planning agent generates coordinated action sequences using power flow solvers, while a validation agent ensures stability and safety through sandboxed execution with rollback mechanisms. To enhance scalability, the framework employs an adaptive multi-scale network representation that dynamically adjusts encoding schemes based on system size and complexity. Violation resolution is achieved through optimizing switch configurations, battery deployment, and load curtailment. Our experiments on IEEE and CIGRE benchmark networks, including the IEEE 69-bus, CIGRE MV, IEEE 30-bus test systems, demonstrate superior mitigation performance, highlighting Grid-Agent's suitability for modern smart grids requiring rapid, adaptive response.

Updated: 2025-09-08 23:53:38

标题: Grid-Agent：一种基于LLM的用于电网控制的多Agent系统

摘要: 现代电力网面临来自分布式能源资源（DERs）、电动汽车（EVs）和极端天气等前所未有的复杂性，同时还日益暴露于可以引发电网违规的网络攻击。本文介绍了Grid-Agent，这是一个利用大型语言模型（LLMs）的自主人工智能驱动框架，结合多智能体系统来检测和纠正违规行为。Grid-Agent通过模块化智能体整合语义推理和数值精度：一个规划智能体利用功率流求解器生成协调动作序列，而一个验证智能体通过带有回滚机制的沙箱执行来确保稳定性和安全性。为了增强可扩展性，该框架采用自适应多尺度网络表示，根据系统大小和复杂性动态调整编码方案。违规行为解决通过优化开关配置、电池部署和负荷削减来实现。我们在IEEE和CIGRE基准网络上进行的实验，包括IEEE 69母线、CIGRE MV、IEEE 30母线测试系统，展示了优越的缓解性能，突出了Grid-Agent在需要快速、自适应响应的现代智能电网中的适用性。

更新时间: 2025-09-08 23:53:38

领域: cs.MA,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.05702v3

zkUnlearner: A Zero-Knowledge Framework for Verifiable Unlearning with Multi-Granularity and Forgery-Resistance

As the demand for exercising the "right to be forgotten" grows, the need for verifiable machine unlearning has become increasingly evident to ensure both transparency and accountability. We present {\em zkUnlearner}, the first zero-knowledge framework for verifiable machine unlearning, specifically designed to support {\em multi-granularity} and {\em forgery-resistance}. First, we propose a general computational model that employs a {\em bit-masking} technique to enable the {\em selectivity} of existing zero-knowledge proofs of training for gradient descent algorithms. This innovation enables not only traditional {\em sample-level} unlearning but also more advanced {\em feature-level} and {\em class-level} unlearning. Our model can be translated to arithmetic circuits, ensuring compatibility with a broad range of zero-knowledge proof systems. Furthermore, our approach overcomes key limitations of existing methods in both efficiency and privacy. Second, forging attacks present a serious threat to the reliability of unlearning. Specifically, in Stochastic Gradient Descent optimization, gradients from unlearned data, or from minibatches containing it, can be forged using alternative data samples or minibatches that exclude it. We propose the first effective strategies to resist state-of-the-art forging attacks. Finally, we benchmark a zkSNARK-based instantiation of our framework and perform comprehensive performance evaluations to validate its practicality.

Updated: 2025-09-08 23:50:35

标题: zkUnlearner：一种多粒度和防伪抵抗性的可验证遗忘的零知识框架

摘要: 随着对“被遗忘权”行使需求的增长，确保透明度和问责制的可验证机器遗忘的需求变得日益明显。我们提出了zkUnlearner，第一个用于可验证机器遗忘的零知识框架，专门设计以支持“多粒度”和“防伪抗性”。首先，我们提出了一种通用计算模型，采用“比特掩模”技术，以实现对梯度下降算法的现有零知识训练证明的“选择性”。这一创新不仅实现了传统的“样本级”遗忘，还实现了更高级别的“特征级”和“类别级”遗忘。我们的模型可以转换为算术电路，确保与广泛范围的零知识证明系统兼容。此外，我们的方法在效率和隐私方面克服了现有方法的关键局限性。其次，伪造攻击对遗忘的可靠性构成严重威胁。具体而言，在随机梯度下降优化中，来自未遗忘数据或包含未遗忘数据的小批量的梯度可以通过使用排除未遗忘数据的备用数据样本或小批量来伪造。我们提出了第一个有效的策略来抵抗最先进的伪造攻击。最后，我们对基于zkSNARK的我们框架的实例化进行基准测试，并进行全面的性能评估，以验证其实用性。

更新时间: 2025-09-08 23:50:35

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.07290v1

Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space

Self-supervised learning (SSL) has emerged as a powerful paradigm for representation learning by optimizing geometric objectives--such as invariance to augmentations, variance preservation, and feature decorrelation--without requiring labels. However, most existing methods operate in Euclidean space, limiting their ability to capture nonlinear dependencies and geometric structures. In this work, we propose Kernel VICReg, a novel self-supervised learning framework that lifts the VICReg objective into a Reproducing Kernel Hilbert Space (RKHS). By kernelizing each term of the loss-variance, invariance, and covariance--we obtain a general formulation that operates on double-centered kernel matrices and Hilbert-Schmidt norms, enabling nonlinear feature learning without explicit mappings. We demonstrate that Kernel VICReg not only avoids representational collapse but also improves performance on tasks with complex or small-scale data. Empirical evaluations across MNIST, CIFAR-10, STL-10, TinyImageNet, and ImageNet100 show consistent gains over Euclidean VICReg, with particularly strong improvements on datasets where nonlinear structures are prominent. UMAP visualizations further confirm that kernel-based embeddings exhibit better isometry and class separation. Our results suggest that kernelizing SSL objectives is a promising direction for bridging classical kernel methods with modern representation learning.

Updated: 2025-09-08 23:49:21

标题: Kernel VICReg用于再生核希尔伯特空间中的自监督学习

摘要: 自监督学习（SSL）已经成为一种通过优化几何目标来进行表示学习的强大范式——例如对增强、方差保持和特征去相关性的不变性——而无需标签。然而，大多数现有方法在欧几里得空间中运作，限制了它们捕捉非线性依赖关系和几何结构的能力。在这项工作中，我们提出了Kernel VICReg，一种新颖的自监督学习框架，将VICReg目标提升到再生核希尔伯特空间（RKHS）中。通过对损失-方差、不变性和协方差的每个项进行核化，我们得到了一个通用的公式，该公式在双中心核矩阵和希尔伯特-施密特范数上运作，实现了非线性特征学习而无需显式映射。我们证明了Kernel VICReg不仅避免了表示崩溃，而且在任务复杂或规模较小的数据上提高了性能。跨越MNIST、CIFAR-10、STL-10、TinyImageNet和ImageNet100的实证评估显示，与欧几里得VICReg相比，Kernel VICReg在非线性结构显著的数据集上表现出一致的增益。UMAP可视化进一步证实，基于核的嵌入展示出更好的等距性和类别分离性。我们的结果表明，核化SSL目标是将经典核方法与现代表示学习联系起来的一个有前途的方向。

更新时间: 2025-09-08 23:49:21

领域: stat.ML,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.07289v1

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

With the rapid development of large language models, the potential threat of their malicious use, particularly in generating phishing content, is becoming increasingly prevalent. Leveraging the capabilities of LLMs, malicious users can synthesize phishing emails that are free from spelling mistakes and other easily detectable features. Furthermore, such models can generate topic-specific phishing messages, tailoring content to the target domain and increasing the likelihood of success. Detecting such content remains a significant challenge, as LLM-generated phishing emails often lack clear or distinguishable linguistic features. As a result, most existing semantic-level detection approaches struggle to identify them reliably. While certain LLM-based detection methods have shown promise, they suffer from high computational costs and are constrained by the performance of the underlying language model, making them impractical for large-scale deployment. In this work, we aim to address this issue. We propose Paladin, which embeds trigger-tag associations into vanilla LLM using various insertion strategies, creating them into instrumented LLMs. When an instrumented LLM generates content related to phishing, it will automatically include detectable tags, enabling easier identification. Based on the design on implicit and explicit triggers and tags, we consider four distinct scenarios in our work. We evaluate our method from three key perspectives: stealthiness, effectiveness, and robustness, and compare it with existing baseline methods. Experimental results show that our method outperforms the baselines, achieving over 90% detection accuracy across all scenarios.

Updated: 2025-09-08 23:44:00

标题: 圣骑士：使用新的触发标签范式防御LLM启用的钓鱼邮件

摘要: 随着大型语言模型的快速发展，它们恶意使用的潜在威胁，特别是生成网络钓鱼内容的威胁，正变得日益普遍。恶意用户利用LLM的能力可以合成不含拼写错误和其他易于检测特征的网络钓鱼邮件。此外，这种模型可以生成特定主题的网络钓鱼信息，将内容量身定制到目标领域，增加成功的可能性。检测这种内容仍然是一个重大挑战，因为LLM生成的网络钓鱼邮件通常缺乏清晰或可区分的语言特征。因此，大多数现有的语义级别检测方法往往难以可靠地识别它们。虽然某些基于LLM的检测方法表现出潜力，但它们存在高计算成本的问题，受限于底层语言模型的性能，使它们对于大规模部署不切实际。在这项工作中，我们旨在解决这个问题。我们提出了Paladin，通过各种插入策略将触发器-标签关联嵌入到普通LLM中，将它们创建成有仪器的LLM。当一个有仪器的LLM生成与网络钓鱼相关的内容时，它将自动包含可检测的标签，从而实现更容易的识别。基于隐式和显式触发器和标签的设计，我们在我们的工作中考虑了四种不同的场景。我们从隐蔽性，有效性和鲁棒性三个关键角度评估我们的方法，并将其与现有的基线方法进行比较。实验结果显示，我们的方法优于基线方法，在所有场景下实现超过90％的检测准确率。

更新时间: 2025-09-08 23:44:00

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.07287v1

ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers

We present cryptogram solving as an ideal testbed for studying neural network generalization in combinatorially complex domains. In this task, models must decrypt text encoded with substitution ciphers, choosing from 26! possible mappings without explicit access to the cipher. We develop ALICE (an Architecture for Learning Interpretable Cryptogram dEcipherment): a simple encoder-only Transformer that sets a new state-of-the-art for both accuracy and speed on this decryption problem. Surprisingly, ALICE generalizes to unseen ciphers after training on only ${\sim}1500$ unique ciphers, a minute fraction ($3.7 \times 10^{-24}$) of the possible cipher space. To enhance interpretability, we introduce a novel bijective decoding head that explicitly models permutations via the Gumbel-Sinkhorn method, enabling direct extraction of learned cipher mappings. Through early exit analysis, we reveal how ALICE progressively refines its predictions in a way that appears to mirror common human strategies for this task: early layers employ frequency-based heuristics, middle layers form word structures, and final layers correct individual characters. Our architectural innovations and analysis methods extend beyond cryptograms to any domain with bijective mappings and combinatorial structure, offering new insights into neural network generalization and interpretability.

Updated: 2025-09-08 23:33:53

标题: ALICE：一种可解释的神经网络架构，用于替代密码的泛化

摘要: 我们将密码破解作为研究神经网络在组合复杂领域中泛化的理想测试平台。在这个任务中，模型必须解密用替换密码编码的文本，从26!个可能的映射中选择，而没有明确访问密码。我们开发了ALICE（一种用于学习可解释密码解密的架构）：一个仅有编码器的Transformer，为这个解密问题设立了新的准确性和速度的标准。令人惊讶的是，ALICE在仅训练了大约1500个独特密码后，就可以泛化到未见密码，这只是可能密码空间的一小部分（$3.7 \times 10^{-24}$）。为了增强可解释性，我们引入了一种新的双射解码头，通过Gumbel-Sinkhorn方法明确地建模排列，从而实现对学习到的密码映射的直接提取。通过早期退出分析，我们揭示了ALICE如何逐渐完善其预测，这似乎类似于此任务的常见人类策略：早期层使用基于频率的启发式方法，中间层形成单词结构，最终层纠正单个字符。我们的架构创新和分析方法不仅适用于密码，还适用于具有双射映射和组合结构的任何领域，为神经网络的泛化和可解释性提供了新的见解。

更新时间: 2025-09-08 23:33:53

领域: cs.LG,cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2509.07282v1

Learning Generalized Hamiltonian Dynamics with Stability from Noisy Trajectory Data

We introduce a robust framework for learning various generalized Hamiltonian dynamics from noisy, sparse phase-space data and in an unsupervised manner based on variational Bayesian inference. Although conservative, dissipative, and port-Hamiltonian systems might share the same initial total energy of a closed system, it is challenging for a single Hamiltonian network model to capture the distinctive and varying motion dynamics and physics of a phase space, from sampled observational phase space trajectories. To address this complicated Hamiltonian manifold learning challenge, we extend sparse symplectic, random Fourier Gaussian processes learning with predictive successive numerical estimations of the Hamiltonian landscape, using a generalized form of state and conjugate momentum Hamiltonian dynamics, appropriate to different classes of conservative, dissipative and port-Hamiltonian physical systems. In addition to the kernelized evidence lower bound (ELBO) loss for data fidelity, we incorporate stability and conservation constraints as additional hyper-parameter balanced loss terms to regularize the model's multi-gradients, enforcing physics correctness for improved prediction accuracy with bounded uncertainty.

Updated: 2025-09-08 23:29:04

标题: 学习具有稳定性的广义哈密顿动力学，从带有噪声的轨迹数据中

摘要: 我们介绍了一个强大的框架，用于从嘈杂、稀疏的相空间数据中以一种无监督的方式学习各种广义哈密顿动力学，基于变分贝叶斯推断。尽管保守、耗散和泊特哈密顿系统可能共享一个封闭系统的初始总能量，但对于一个单一的哈密顿网络模型来捕捉采样观测相空间轨迹的相空间的独特和变化的运动动力学和物理学是具有挑战性的。为了解决这个复杂的哈密顿流形学习挑战，我们通过使用一种适用于不同类别的保守、耗散和泊特哈密顿物理系统的状态和共轭动量哈密顿动力学的广义形式，扩展了稀疏辛、随机傅里叶高斯过程学习，并使用预测性连续数值估计哈密顿景观。除了用于数据保真度的核化证据下界（ELBO）损失外，我们还将稳定性和守恒约束作为额外的超参数平衡损失项，用于正则化模型的多梯度，以强制物理正确性以提高预测准确性并确保有界的不确定性。

更新时间: 2025-09-08 23:29:04

领域: cs.LG

下载: http://arxiv.org/abs/2509.07280v1

Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion

Data scarcity hinders deep learning for medical imaging. We propose a framework for breast cancer classification in thermograms that addresses this using a Diffusion Probabilistic Model (DPM) for data augmentation. Our DPM-based augmentation is shown to be superior to both traditional methods and a ProGAN baseline. The framework fuses deep features from a pre-trained ResNet-50 with handcrafted nonlinear features (e.g., Fractal Dimension) derived from U-Net segmented tumors. An XGBoost classifier trained on these fused features achieves 98.0\% accuracy and 98.1\% sensitivity. Ablation studies and statistical tests confirm that both the DPM augmentation and the nonlinear feature fusion are critical, statistically significant components of this success. This work validates the synergy between advanced generative models and interpretable features for creating highly accurate medical diagnostic tools.

Updated: 2025-09-08 23:19:38

标题: 在热成像图像中通过扩散增强和非线性特征融合进行乳腺癌检测

摘要: 数据稀缺阻碍了医学影像学的深度学习。我们提出了一个针对乳腺癌分类的框架，该框架利用扩散概率模型（DPM）进行数据增强。我们的基于DPM的增强方法被证明优于传统方法和ProGAN基线。该框架将预训练的ResNet-50的深度特征与从U-Net分割的肿瘤导出的手工非线性特征（例如，分形维度）融合在一起。在这些融合特征上训练的XGBoost分类器达到了98.0%的准确度和98.1%的敏感性。消融研究和统计测试证实了DPM增强和非线性特征融合对于这一成功的关键、具有统计学意义的组成部分。这项工作验证了先进生成模型和可解释特征之间的协同作用，用于创建高度准确的医学诊断工具。

更新时间: 2025-09-08 23:19:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.07277v1

Quantum Advantage via Solving Multivariate Polynomials

In this work, we propose a new way to (non-interactively, verifiably) demonstrate quantum advantage by solving the average-case $\mathsf{NP}$ search problem of finding a solution to a system of (underdetermined) constant degree multivariate equations over the finite field $\mathbb{F}_2$ drawn from a specified distribution. In particular, for any $d \geq 2$, we design a distribution of degree up to $d$ polynomials $\{p_i(x_1,\ldots,x_n)\}_{i\in [m]}$ for $m<n$ over $\mathbb{F}_2$ for which we show that there is a expected polynomial-time quantum algorithm that provably simultaneously solves $\{p_i(x_1,\ldots,x_n)=y_i\}_{i\in [m]}$ for a random vector $(y_1,\ldots,y_m)$. On the other hand, while solutions exist with high probability, we conjecture that for constant $d > 2$, it is classically hard to find one based on a thorough review of existing classical cryptanalysis. Our work thus posits that degree three functions are enough to instantiate the random oracle to obtain non-relativized quantum advantage. Our approach begins with the breakthrough Yamakawa-Zhandry (FOCS 2022) quantum algorithmic framework. In our work, we demonstrate that this quantum algorithmic framework extends to the setting of multivariate polynomial systems. Our key technical contribution is a new analysis on the Fourier spectra of distributions induced by a general family of distributions over $\mathbb{F}_2$ multivariate polynomials -- those that satisfy $2$-wise independence and shift-invariance. This family of distributions includes the distribution of uniform random degree at most $d$ polynomials for any constant $d \geq 2$. Our analysis opens up potentially new directions for quantum cryptanalysis of other multivariate systems.

Updated: 2025-09-08 23:19:20

标题: 通过解决多元多项式实现量子优势

摘要: 在这项工作中，我们提出了一种新的方法（非交互式、可验证地）通过解决平均情况下的$\mathsf{NP}$搜索问题来展示量子优势，即在有限域$\mathbb{F}_2$上找到一个系统的（欠定的）常数次多元方程的解，这些方程来自特定分布。特别地，对于任意$d \geq 2$，我们设计了一个分布，包括了度最多为$d$的多项式$\{p_i(x_1,\ldots,x_n)\}_{i\in [m]}$，其中$m<n$，在$\mathbb{F}_2$上，我们展示了存在一个期望的多项式时间量子算法，可以证明同时解决$\{p_i(x_1,\ldots,x_n)=y_i\}_{i\in [m]}$对于一个随机向量$(y_1,\ldots,y_m)$。另一方面，尽管解存在的概率很高，我们推测对于常数$d > 2$，基于对现有经典密码分析的深入研究，从经典角度来找到一个解是困难的。因此，我们的工作认为三次函数足以实例化随机预言以获得非相对化的量子优势。我们的方法始于重大突破的Yamakawa-Zhandry（FOCS 2022）量子算法框架。在我们的工作中，我们展示了这一量子算法框架如何扩展到多元多项式系统的设置中。我们的关键技术贡献是对由$\mathbb{F}_2$上一般分布族诱导的傅立叶谱的新分析，这些分布族包括满足$2$-wise独立性和移位不变性的多元多项式分布。这个分布族包括了任意常数$d \geq 2$的度最多为$d$的均匀随机多项式的分布。我们的分析为量子密码分析其他多元系统开辟了潜在的新方向。

更新时间: 2025-09-08 23:19:20

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2509.07276v1

LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Migration has been a core topic in German political debate, from millions of expellees post World War II over labor migration to refugee movements in the recent past. Studying political speech regarding such wide-ranging phenomena in depth traditionally required extensive manual annotations, limiting the scope of analysis to small subsets of the data. Large language models (LLMs) have the potential to partially automate even complex annotation tasks. We provide an extensive evaluation of a multiple LLMs in annotating (anti-)solidarity subtypes in German parliamentary debates compared to a large set of thousands of human reference annotations (gathered over a year). We evaluate the influence of model size, prompting differences, fine-tuning, historical versus contemporary data; and we investigate systematic errors. Beyond methodological evaluation, we also interpret the resulting annotations from a social science lense, gaining deeper insight into (anti-)solidarity trends towards migrants in the German post-World War II period and recent past. Our data reveals a high degree of migrant-directed solidarity in the postwar period, as well as a strong trend towards anti-solidarity in the German parliament since 2015, motivating further research. These findings highlight the promise of LLMs for political text analysis and the importance of migration debates in Germany, where demographic decline and labor shortages coexist with rising polarization.

Updated: 2025-09-08 23:16:03

标题: LLM分析150年以上的德国议会关于移民的辩论，揭示出在过去十年中从战后团结转变为反团结。

摘要: 迁移一直是德国政治辩论的核心议题，从二战后数百万被驱逐者到劳工迁移再到最近的难民运动。深入研究有关这些广泛现象的政治言论传统上需要进行广泛的手动注释，限制了对数据的分析范围只能局限于数据的小子集。大型语言模型(LLMs)有可能部分自动化甚至复杂的注释任务。我们对多个LLMs在德国议会辩论中对(反)团结亚型进行注释进行了广泛评估，与成千上万人类参考标注的大量集合进行比较(在一年内收集)。我们评估了模型大小、提示差异、微调、历史与当代数据的影响；并且调查了系统性错误。除了方法论评估外，我们还从社会科学的角度解释了得到的注释，深入了解了德国战后时期和最近过去的移民(反)团结趋势。我们的数据显示，战后时期对移民的团结度很高，而自2015年以来德国议会对反团结的趋势很强，这促使进一步研究。这些发现突显了LLMs在政治文本分析中的潜力，以及德国移民辩论的重要性，德国的人口下降和劳动力短缺与不断升级的两极化并存。

更新时间: 2025-09-08 23:16:03

领域: cs.CL,cs.CY,cs.LG

下载: http://arxiv.org/abs/2509.07274v1

CardioComposer: Flexible and Compositional Anatomical Structure Generation with Disentangled Geometric Guidance

Generative models of 3D anatomy, when integrated with biophysical simulators, enable the study of structure-function relationships for clinical research and medical device design. However, current models face a trade-off between controllability and anatomical realism. We propose a programmable and compositional framework for guiding unconditional diffusion models of human anatomy using interpretable ellipsoidal primitives embedded in 3D space. Our method involves the selection of certain tissues within multi-tissue segmentation maps, upon which we apply geometric moment losses to guide the reverse diffusion process. This framework supports the independent control over size, shape, and position, as well as the composition of multi-component constraints during inference.

Updated: 2025-09-08 23:08:23

标题: CardioComposer：使用解耦几何引导灵活组合解剖结构生成

摘要: 三维解剖学的生成模型，当与生物物理模拟器集成时，可以用于临床研究和医疗器械设计的结构功能关系研究。然而，当前的模型在可控性和解剖真实性之间存在权衡。我们提出了一个可编程和组合框架，用于引导人体解剖的无条件扩散模型，其中包含可解释的椭圆形原语嵌入在三维空间中。我们的方法涉及在多组织分割图中选择特定组织，然后应用几何矩损失来引导逆扩散过程。该框架支持在推理过程中独立控制大小、形状和位置，以及多组件约束的组合。

更新时间: 2025-09-08 23:08:23

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.08015v1

Datasets for Navigating Sensitive Topics in Recommendation Systems

Personalized AI systems, from recommendation systems to chatbots, are a prevalent method for distributing content to users based on their learned preferences. However, there is growing concern about the adverse effects of these systems, including their potential tendency to expose users to sensitive or harmful material, negatively impacting overall well-being. To address this concern quantitatively, it is necessary to create datasets with relevant sensitivity labels for content, enabling researchers to evaluate personalized systems beyond mere engagement metrics. To this end, we introduce two novel datasets that include a taxonomy of sensitivity labels alongside user-content ratings: one that integrates MovieLens rating data with content warnings from the Does the Dog Die? community ratings website, and another that combines fan-fiction interaction data and user-generated warnings from Archive of Our Own.

Updated: 2025-09-08 22:58:17

标题: 推荐系统中导航敏感话题的数据集

摘要: 个性化人工智能系统，从推荐系统到聊天机器人，是一种常见的方法，根据用户学习到的偏好向用户分发内容。然而，人们越来越担心这些系统的不良影响，包括它们可能倾向于向用户暴露敏感或有害材料，从而对整体福祉产生负面影响。为了定量解决这一问题，有必要创建具有相关敏感性标签的数据集，使研究人员能够评估个性化系统，超越简单的参与度指标。为此，我们介绍了两个新颖的数据集，其中包括敏感性标签的分类法以及用户内容评分：一个将MovieLens评分数据与《狗死了吗？》社区评分网站的内容警告集成在一起，另一个将粉丝小说互动数据与来自我们的档案用户生成的警告结合在一起。

更新时间: 2025-09-08 22:58:17

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2509.07269v1

HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring

Mobile and wearable healthcare monitoring play a vital role in facilitating timely interventions, managing chronic health conditions, and ultimately improving individuals' quality of life. Previous studies on large language models (LLMs) have highlighted their impressive generalization abilities and effectiveness in healthcare prediction tasks. However, most LLM-based healthcare solutions are cloud-based, which raises significant privacy concerns and results in increased memory usage and latency. To address these challenges, there is growing interest in compact models, Small Language Models (SLMs), which are lightweight and designed to run locally and efficiently on mobile and wearable devices. Nevertheless, how well these models perform in healthcare prediction remains largely unexplored. We systematically evaluated SLMs on health prediction tasks using zero-shot, few-shot, and instruction fine-tuning approaches, and deployed the best performing fine-tuned SLMs on mobile devices to evaluate their real-world efficiency and predictive performance in practical healthcare scenarios. Our results show that SLMs can achieve performance comparable to LLMs while offering substantial gains in efficiency and privacy. However, challenges remain, particularly in handling class imbalance and few-shot scenarios. These findings highlight SLMs, though imperfect in their current form, as a promising solution for next-generation, privacy-preserving healthcare monitoring.

Updated: 2025-09-08 22:36:19

标题: HealthSLM-Bench：用于移动和可穿戴医疗监测的小型语言模型基准测试

摘要: 移动和可穿戴健康监测在促进及时干预、管理慢性健康状况，最终提高个人生活质量方面发挥着至关重要的作用。先前对大型语言模型（LLMs）的研究已经突出了它们在医疗预测任务中的出色泛化能力和有效性。然而，大多数基于LLM的医疗解决方案都是基于云的，这引发了重大的隐私担忧，并导致了内存使用量和延迟的增加。为了解决这些挑战，人们对紧凑模型，即小型语言模型（SLMs），表现出了越来越浓厚的兴趣，这些模型轻便且设计用于在移动和可穿戴设备上本地高效运行。然而，这些模型在医疗预测中的表现如何仍然大部分未被探索。我们系统评估了SLMs在健康预测任务中使用零-shot、少-shot和指令微调方法的表现，并将表现最佳的微调SLMs部署到移动设备上，以评估它们在实际医疗场景中的真实效率和预测性能。我们的结果显示，SLMs可以实现与LLMs相当的性能，同时在效率和隐私方面提供了实质性的收益。然而，挑战仍然存在，特别是在处理类别不平衡和少-shot场景方面。这些发现突显了SLMs，尽管在当前形式下并不完美，作为下一代隐私保护健康监测的有前途的解决方案。

更新时间: 2025-09-08 22:36:19

领域: cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2509.07260v1

Scalable Autoregressive 3D Molecule Generation

Generative models of 3D molecular structure play a rapidly growing role in the design and simulation of molecules. Diffusion models currently dominate the space of 3D molecule generation, while autoregressive models have trailed behind. In this work, we present Quetzal, a simple but scalable autoregressive model that builds molecules atom-by-atom in 3D. Treating each molecule as an ordered sequence of atoms, Quetzal combines a causal transformer that predicts the next atom's discrete type with a smaller Diffusion MLP that models the continuous next-position distribution. Compared to existing autoregressive baselines, Quetzal achieves substantial improvements in generation quality and is competitive with the performance of state-of-the-art diffusion models. In addition, by reducing the number of expensive forward passes through a dense transformer, Quetzal enables significantly faster generation speed, as well as exact divergence-based likelihood computation. Finally, without any architectural changes, Quetzal natively handles variable-size tasks like hydrogen decoration and scaffold completion. We hope that our work motivates a perspective on scalability and generality for generative modelling of 3D molecules.

Updated: 2025-09-08 22:33:04

标题: 可扩展的自回归三维分子生成

摘要: 三维分子结构的生成模型在分子设计和模拟中发挥着日益增长的作用。扩散模型目前在三维分子生成领域占据主导地位，而自回归模型则落后。在这项工作中，我们提出了Quetzal，这是一个简单但可扩展的自回归模型，可以逐个原子在三维空间构建分子。将每个分子视为原子的有序序列，Quetzal结合了一个预测下一个原子离散类型的因果变换器和一个模拟连续下一个位置分布的较小扩散MLP。与现有的自回归基线相比，Quetzal在生成质量方面取得了显著改进，并且在性能上与最先进的扩散模型相竞争。此外，通过减少通过密集变压器的昂贵前向传递的数量，Quetzal实现了显著更快的生成速度，以及确切的基于散度的似然计算。最后，在没有任何架构更改的情况下，Quetzal原生处理变量大小的任务，如氢装饰和支架完成。我们希望我们的工作能激发对三维分子生成建模的可扩展性和普遍性的视角。

更新时间: 2025-09-08 22:33:04

领域: cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2505.13791v2

Automatically Detecting Online Deceptive Patterns

Deceptive patterns in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous on various digital platforms. While efforts to mitigate deceptive patterns have emerged from legal and technical perspectives, a significant gap remains in creating usable and scalable solutions. We introduce our AutoBot framework to address this gap and help web stakeholders navigate and mitigate online deceptive patterns. AutoBot accurately identifies and localizes deceptive patterns from a screenshot of a website without relying on the underlying HTML code. AutoBot employs a two-stage pipeline that leverages the capabilities of specialized vision models to analyze website screenshots, identify interactive elements, and extract textual features. Next, using a large language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We also use AutoBot, to create a synthetic dataset to distill knowledge from 'teacher' LLMs to smaller language models. Through extensive evaluation, we demonstrate AutoBot's effectiveness in detecting deceptive patterns on the web, achieving an F1-score of 0.93 when detecting deceptive patterns, underscoring its potential as an essential tool for mitigating online deceptive patterns. We implement AutoBot, across three downstream applications targeting different web stakeholders: (1) a local browser extension providing users with real-time feedback, (2) a Lighthouse audit to inform developers of potential deceptive patterns on their sites, and (3) as a measurement tool designed for researchers and regulators.

Updated: 2025-09-08 22:22:58

标题: 自动检测在线欺诈模式

摘要: 数字界面中的欺骗模式会操纵用户做出意外的决定，利用认知偏见和心理脆弱性。这些模式已经在各种数字平台上变得无处不在。虽然从法律和技术角度出现了减少欺骗模式的努力，但在创建可用和可扩展的解决方案方面仍存在重大差距。我们引入了我们的AutoBot框架来解决这一差距，并帮助网络利益相关者导航和减轻在线欺骗模式。AutoBot可以准确识别和定位网站截图中的欺骗模式，而无需依赖底层HTML代码。AutoBot采用了一个两阶段流程，利用专门的视觉模型的能力来分析网站截图，识别交互元素，并提取文本特征。接下来，使用一个大型语言模型，AutoBot理解围绕这些元素的上下文，以确定欺骗模式的存在。我们还使用AutoBot创建了一个合成数据集，以从“教师”LLMs中提炼知识到更小的语言模型。通过广泛的评估，我们展示了AutoBot在检测网络上的欺骗模式方面的有效性，当检测欺骗模式时，实现了0.93的F1得分，凸显了其作为减轻在线欺骗模式的重要工具的潜力。我们实现了AutoBot，涵盖了面向不同网络利益相关者的三个下游应用程序：（1）一个本地浏览器扩展，为用户提供实时反馈，（2）一个Lighthouse审核，通知开发人员有关他们网站上潜在的欺骗模式，以及（3）作为研究人员和监管机构设计的测量工具。

更新时间: 2025-09-08 22:22:58

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2411.07441v3

COMMA: A Communicative Multimodal Multi-Agent Benchmark

The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.

Updated: 2025-09-08 22:13:02

标题: COMMA：一种沟通的多模态多代理基准Benchmark

摘要: 建立在大型基础模型基础上的多模态代理的快速进展在很大程度上忽视了它们在协作任务中代理之间基于语言的沟通潜力。这种疏忽在理解它们在现实世界部署中的有效性方面存在关键差距，特别是在与人类沟通时。现有的代理基准未能解决代理之间沟通和协作的关键方面，特别是在代理之间访问信息不平等且必须共同努力完成超出个体能力范围的任务的情况下。为了填补这一差距，我们引入了COMMA：一个旨在通过语言沟通评估多模态多代理系统协作性能的新颖拼图基准。我们的基准测试包含各种多模态拼图，提供了在沟通协作环境中评估代理能力的四个关键类别的全面评估。我们的研究结果揭示了现有技术模型的一些令人惊讶的弱点，包括像GPT-4o这样的强专有模型和像o4-mini这样的推理模型。许多思维链推理模型，如R1-Onevision和LLaVA-CoT，甚至在代理之间的协作中也难以超越随机基准，表明它们的沟通能力可能是一个潜在增长领域。

更新时间: 2025-09-08 22:13:02

领域: cs.AI

下载: http://arxiv.org/abs/2410.07553v3

Benchmarking Information Retrieval Models on Complex Retrieval Tasks

Large language models (LLMs) are incredible and versatile tools for text-based tasks that have enabled countless, previously unimaginable, applications. Retrieval models, in contrast, have not yet seen such capable general-purpose models emerge. To achieve this goal, retrieval models must be able to perform complex retrieval tasks, where queries contain multiple parts, constraints, or requirements in natural language. These tasks represent a natural progression from the simple, single-aspect queries that are used in the vast majority of existing, commonly used evaluation sets. Complex queries naturally arise as people expect search systems to handle more specific and often ambitious information requests, as is demonstrated by how people use LLM-based information systems. Despite the growing desire for retrieval models to expand their capabilities in complex retrieval tasks, there exist limited resources to assess the ability of retrieval models on a comprehensive set of diverse complex tasks. The few resources that do exist feature a limited scope and often lack realistic settings making it hard to know the true capabilities of retrieval models on complex real-world retrieval tasks. To address this shortcoming and spur innovation in next-generation retrieval models, we construct a diverse and realistic set of complex retrieval tasks and benchmark a representative set of state-of-the-art retrieval models. Additionally, we explore the impact of LLM-based query expansion and rewriting on retrieval quality. Our results show that even the best models struggle to produce high-quality retrieval results with the highest average nDCG@10 of only 0.346 and R@100 of only 0.587 across all tasks. Although LLM augmentation can help weaker models, the strongest model has decreased performance across all metrics with all rewriting techniques.

Updated: 2025-09-08 22:11:10

标题: 在复杂检索任务上对信息检索模型进行基准测试

摘要: 大型语言模型（LLMs）是令人惊叹和多才多艺的工具，可用于基于文本的任务，并已实现无数先前难以想象的应用。相比之下，检索模型尚未看到出现这样有能力的通用模型。为了实现这一目标，检索模型必须能够执行复杂的检索任务，其中查询包含多个部分、约束或自然语言中的要求。这些任务代表了一个自然的进步，从在现有的常用评估集中使用的简单、单方面查询过渡而来。复杂查询自然会出现，因为人们希望搜索系统处理更具体和常常雄心勃勃的信息请求，正如人们如何使用基于LLM的信息系统所展示的那样。尽管检索模型在复杂检索任务中扩展其能力的需求不断增长，但存在着有限的资源来评估检索模型在全面一系列多样化的复杂任务上的能力。现有的少数资源具有有限的范围，并且常常缺乏现实设置，这使得难以了解检索模型在复杂的真实世界检索任务上的真正能力。为了解决这一缺点并激发下一代检索模型的创新，我们构建了一组多样化且逼真的复杂检索任务，并为一组代表性的最新检索模型进行了基准测试。此外，我们探讨了基于LLM的查询扩展和重写对检索质量的影响。我们的结果表明，即使是最好的模型在所有任务中也难以产生高质量的检索结果，最高平均nDCG@10仅为0.346，R@100仅为0.587。尽管LLM增强可以帮助弱模型，但最强模型在所有指标上的性能均有所下降，且所有重写技术。

更新时间: 2025-09-08 22:11:10

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.07253v1

GCond: Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning

In multi-task learning (MTL), gradient conflict poses a significant challenge. Effective methods for addressing this problem, including PCGrad, CAGrad, and GradNorm, in their original implementations are computationally demanding, which significantly limits their application in modern large models and transformers. We propose Gradient Conductor (GCond), a method that builds upon PCGrad principles by combining them with gradient accumulation and an adaptive arbitration mechanism. We evaluated GCond on self-supervised learning tasks using MobileNetV3-Small and ConvNeXt architectures on the ImageNet 1K dataset and a combined head and neck CT scan dataset, comparing the proposed method against baseline linear combinations and state-of-the-art gradient conflict resolution methods. The stochastic mode of GCond achieved a two-fold computational speedup while maintaining optimization quality, and demonstrated superior performance across all evaluated metrics, achieving lower L1 and SSIM losses compared to other methods on both datasets. GCond exhibited high scalability, being successfully applied to both compact models (MobileNetV3-Small) and large architectures (ConvNeXt-tiny and ConvNeXt-Base). It also showed compatibility with modern optimizers such as AdamW and Lion/LARS. Therefore, GCond offers a scalable and efficient solution to the problem of gradient conflicts in multi-task learning.

Updated: 2025-09-08 22:02:22

标题: GCond：基于累积稳定化的梯度冲突解决方法，用于大规模多任务学习

摘要: 在多任务学习（MTL）中，梯度冲突构成了一个重要挑战。有效解决这一问题的方法，包括PCGrad、CAGrad和GradNorm，在它们的原始实现中计算需求较高，这显著限制了它们在现代大型模型和变压器中的应用。我们提出了Gradient Conductor（GCond），这是一种建立在PCGrad原则基础上的方法，将其与梯度累积和自适应仲裁机制相结合。我们在ImageNet 1K数据集上使用MobileNetV3-Small和ConvNeXt架构以及一个头颈CT扫描数据集上对GCond进行了评估，将该方法与基线线性组合和最先进的梯度冲突解决方法进行了比较。GCond的随机模式实现了两倍的计算加速，同时保持了优化质量，并在所有评估指标上表现出优越性，相比其他方法，在两个数据集上实现了更低的L1和SSIM损失。GCond表现出很高的可扩展性，成功应用于紧凑模型（MobileNetV3-Small）和大型架构（ConvNeXt-tiny和ConvNeXt-Base）。它还显示了与AdamW和Lion/LARS等现代优化器的兼容性。因此，GCond为多任务学习中梯度冲突问题提供了可扩展和高效的解决方案。

更新时间: 2025-09-08 22:02:22

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2509.07252v1

Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM

In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.

Updated: 2025-09-08 21:49:33

标题: 利用冻结的LLM增强具有说话者特征的对话标注

摘要: 在对话转录流程中，大型语言模型（LLMs）经常被用于后处理，以改善语法、标点和可读性。我们探索了一个补充的后处理步骤：通过为说话者特征添加元数据标签来丰富转录的对话，比如年龄、性别和情绪。一些标签对整个对话是全局的，而一些是时变的。我们的方法将冻结的音频基础模型（如Whisper或WavLM）与冻结的LLAMA语言模型相结合，推断这些说话者属性，而无需对任何模型进行任务特定的微调。通过使用轻量级、高效的连接器来连接音频和语言表示，我们在说话者分析任务上取得了竞争性的表现，同时保持了模块化和速度。此外，我们还展示了冻结的LLAMA模型可以直接比较x-vectors，在某些情况下实现了8.8%的相等错误率。

更新时间: 2025-09-08 21:49:33

领域: cs.CL,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2508.04795v2

IP-Basis PINNs: Efficient Multi-Query Inverse Parameter Estimation

Solving inverse problems with Physics-Informed Neural Networks (PINNs) is computationally expensive for multi-query scenarios, as each new set of observed data requires a new, expensive training procedure. We present Inverse-Parameter Basis PINNs (IP-Basis PINNs), a meta-learning framework that extends the foundational work of Desai et al. (2022) to enable rapid and efficient inference for inverse problems. Our method employs an offline-online decomposition: a deep network is first trained offline to produce a rich set of basis functions that span the solution space of a parametric differential equation. For each new inverse problem online, this network is frozen, and solutions and parameters are inferred by training only a lightweight linear output layer against observed data. Key innovations that make our approach effective for inverse problems include: (1) a novel online loss formulation for simultaneous solution reconstruction and parameter identification, (2) a significant reduction in computational overhead via forward-mode automatic differentiation for PDE loss evaluation, and (3) a non-trivial validation and early-stopping mechanism for robust offline training. We demonstrate the efficacy of IP-Basis PINNs on three diverse benchmarks, including an extension to universal PINNs for unknown functional terms-showing consistent performance across constant and functional parameter estimation, a significant speedup per query over standard PINNs, and robust operation with scarce and noisy data.

Updated: 2025-09-08 21:43:41

标题: IP-Basis PINNs：高效的多查询逆参数估计

摘要: 使用物理信息神经网络（PINNs）解决反问题在多次查询场景中具有计算昂贵的特点，因为每组新的观测数据都需要进行新的昂贵的训练过程。我们提出了反参数基础PINNs（IP-Basis PINNs），这是一个元学习框架，扩展了Desai等人（2022年）的基础工作，以实现反问题的快速和高效推断。我们的方法采用离线-在线分解：首先离线训练一个深度网络，以生成一个丰富的基函数集，涵盖参数微分方程的解空间。对于每个新的反问题在线，该网络被冻结，并且通过仅针对观测数据训练轻量级线性输出层来推断解和参数。使我们的方法对反问题有效的关键创新包括：（1）一种新颖的在线损失公式，用于同时重建解和识别参数，（2）通过前向模式自动微分对PDE损失评估进行计算开销的显著减少，以及（3）用于稳健离线训练的非平凡验证和提前停止机制。我们在三个不同的基准测试中展示了IP-Basis PINNs的有效性，包括对未知功能项的通用PINNs的扩展，显示出在常数和功能参数估计方面的一致性性能，相对于标准PINNs每次查询的显著加速，以及对稀缺和嘈杂数据的稳健操作。

更新时间: 2025-09-08 21:43:41

领域: cs.LG

下载: http://arxiv.org/abs/2509.07245v1

Explainable Metrics for the Assessment of Neurodegenerative Diseases through Handwriting Analysis

Motor dysfunction is a common sign of neurodegenerative diseases (NDs) such as Parkinson's disease (PD) and Alzheimer's disease (AD), but may be difficult to detect, especially in the early stages. In this work, we examine the behavior of a wide array of explainable metrics extracted from the handwriting signals of 113 subjects performing multiple tasks on a digital tablet, as part of the Neurological Signals dataset. The aim is to measure their effectiveness in characterizing NDs, including AD and PD. To this end, task-agnostic and task-specific metrics are extracted from 14 distinct tasks. Subsequently, through statistical analysis and a series of classification experiments, we investigate which metrics provide greater discriminative power between NDs and healthy controls and amongst different NDs. Preliminary results indicate that the tasks at hand can all be effectively leveraged to distinguish between the considered set of NDs, specifically by measuring the stability, the speed of writing, the time spent not writing, and the pressure variations between groups from our handcrafted explainable metrics, which shows p-values lower than 0.0001 for multiple tasks. Using various binary classification algorithms on the computed metrics, we obtain up to 87 % accuracy for the discrimination between AD and healthy controls (CTL), and up to 69 % for the discrimination between PD and CTL.

Updated: 2025-09-08 21:36:03

标题: 手写分析解释性指标用于神经退行性疾病评估

摘要: Motor dysfunction is a common symptom of neurodegenerative diseases (NDs) such as Parkinson's disease (PD) and Alzheimer's disease (AD), but it can be challenging to detect, especially in the early stages. This study investigates the use of various measurable metrics extracted from handwriting signals of 113 subjects performing tasks on a digital tablet from the Neurological Signals dataset. The goal is to assess the effectiveness of these metrics in characterizing NDs, including AD and PD. Task-agnostic and task-specific metrics are extracted from 14 different tasks, and through statistical analysis and classification experiments, the study determines which metrics are more effective in distinguishing between NDs and healthy individuals, as well as among different NDs. Preliminary results show that the metrics, such as stability, writing speed, time spent not writing, and pressure variations, can effectively differentiate between the considered NDs, with p-values below 0.0001 for multiple tasks. By using binary classification algorithms on these metrics, the study achieves up to 87% accuracy in discriminating between AD and healthy controls, and up to 69% accuracy in discriminating between PD and healthy controls.

更新时间: 2025-09-08 21:36:03

领域: q-bio.NC,cs.LG

下载: http://arxiv.org/abs/2409.08303v3

CoMMIT: Coordinated Multimodal Instruction Tuning

Instruction tuning in multimodal large language models (MLLMs) generally involves cooperative learning between a backbone LLM and a feature encoder of non-text input modalities. The major challenge is how to efficiently find the synergy between the two modules so that LLMs can adapt their reasoning abilities to downstream tasks while feature encoders can adjust to provide more task-specific information about its modality. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find the unbalanced learning between the feature encoder and the LLM can cause problems of oscillation and biased learning that lead to sub-optimal convergence. Inspired by our findings, we propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning. Based on this, we further design a dynamic learning scheduler that better coordinates the learning between the LLM and feature encoder, alleviating the problems of oscillation and biased learning. In addition, we introduce an auxiliary regularization on the gradient to promote updating with larger step sizes, which potentially allows for a more accurate estimation of the proposed MultiModal Balance Coefficient and further improves the training sufficiency. Our proposed approach is agnostic to the architecture of LLM and feature encoder, so it can be generically integrated with various MLLMs. We conduct experiments on multiple downstream tasks with various MLLMs, demonstrating that the proposed method is more effective than the baselines in MLLM instruction tuning.

Updated: 2025-09-08 21:35:19

标题: CoMMIT：协调多模式指导调整

摘要: 多模式大型语言模型（MLLMs）中的指令调整通常涉及主干LLM和非文本输入模态的特征编码器之间的协同学习。主要挑战是如何有效地找到两个模块之间的协同作用，以便LLMs可以使其推理能力适应下游任务，同时特征编码器可以调整以提供有关其模态的更多任务特定信息。在本文中，我们从理论和实证角度分析MLLM的指令调整，发现特征编码器和LLM之间的学习不平衡可能导致振荡和偏倚学习问题，从而导致次优收敛。受到我们发现的启发，我们提出了一个多模式平衡系数，可以量化衡量学习的平衡性。基于此，我们进一步设计了一个动态学习调度器，更好地协调LLM和特征编码器之间的学习，减轻振荡和偏倚学习问题。此外，我们引入了一个梯度辅助正则化，以促进更大步长的更新，这可能允许更准确地估计提出的多模式平衡系数，并进一步改善训练的充分性。我们提出的方法对LLM和特征编码器的架构不感知，因此可以与各种MLLMs通用集成。我们在多个下游任务上进行了实验，使用各种MLLMs，证明了所提出的方法在MLLM指令调整中比基线更有效。

更新时间: 2025-09-08 21:35:19

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2407.20454v2

Systematic Optimization of Open Source Large Language Models for Mathematical Reasoning

This paper presents a practical investigation into fine-tuning model parameters for mathematical reasoning tasks through experimenting with various configurations including randomness control, reasoning depth, and sampling strategies, careful tuning demonstrates substantial improvements in efficiency as well as performance. A holistically optimized framework is introduced for five state-of-the-art models on mathematical reasoning tasks, exhibiting significant performance boosts while maintaining solution correctness. Through systematic parameter optimization across Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3, Mixtral-8x22B, and Yi-Lightning, consistent efficiency gains are demonstrated with 100% optimization success rate. The methodology achieves an average 29.4% reduction in computational cost and 23.9% improvement in inference speed across all tested models. This framework systematically searches parameter spaces including temperature (0.1-0.5), reasoning steps (4-12), planning periods (1-4), and nucleus sampling (0.85-0.98), determining optimal configurations through testing on mathematical reasoning benchmarks. Critical findings show that lower temperature regimes (0.1-0.4) and reduced reasoning steps (4-6) consistently enhance efficiency without compromising accuracy. DeepSeek-V3 achieves the highest accuracy at 98%, while Mixtral-8x22B delivers the most cost-effective performance at 361.5 tokens per accurate response. Key contributions include: (1) the first comprehensive optimization study for five diverse SOTA models in mathematical reasoning, (2) a standardized production-oriented parameter optimization framework, (3) discovery of universal optimization trends applicable across model architectures, and (4) production-ready configurations with extensive performance characterization.

Updated: 2025-09-08 21:31:43

标题: 数学推理的开源大型语言模型的系统优化

摘要: 本文通过对各种配置进行实验，包括随机性控制、推理深度和采样策略，提出了一个关于数学推理任务的模型参数微调的实用研究。通过仔细调整，显著提高了效率和性能。引入了一个在数学推理任务上对五种最先进模型进行全面优化的框架，展示了显著的性能提升，同时保持了解决方案的正确性。通过系统参数优化，包括 Qwen2.5-72B、Llama-3.1-70B、DeepSeek-V3、Mixtral-8x22B 和 Yi-Lightning，显示出一致的效率提升，优化成功率达到100%。该方法在所有测试模型上实现了平均29.4%的计算成本减少和23.9%的推理速度提高。该框架系统地搜索参数空间，包括温度（0.1-0.5）、推理步骤（4-12）、规划周期（1-4）和核心采样（0.85-0.98），通过在数学推理基准上的测试确定最佳配置。关键发现表明，较低温度范围（0.1-0.4）和减少推理步骤（4-6）会持续增强效率而不影响准确性。DeepSeek-V3 在准确率方面达到最高的 98%，而 Mixtral-8x22B 在每个准确响应 361.5 个记号的性能方面表现最具成本效益。主要贡献包括：（1）对数学推理中五种不同 SOTA 模型的首个全面优化研究，（2）一个标准化的面向生产的参数优化框架，（3）发现适用于各种模型架构的普遍优化趋势，和（4）具有广泛性能特性的面向生产的配置。

更新时间: 2025-09-08 21:31:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.07238v1

Breaking the Conventional Forward-Backward Tie in Neural Networks: Activation Functions

Gradient-based neural network training traditionally enforces symmetry between forward and backward propagation, requiring activation functions to be differentiable (or sub-differentiable) and strictly monotonic in certain regions to prevent flat gradient areas. This symmetry, linking forward activations closely to backward gradients, significantly restricts the selection of activation functions, particularly excluding those with substantial flat or non-differentiable regions. In this paper, we challenge this assumption through mathematical analysis, demonstrating that precise gradient magnitudes derived from activation functions are largely redundant, provided the gradient direction is preserved. Empirical experiments conducted on foundational architectures - such as Multi-Layer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Binary Neural Networks (BNNs) - confirm that relaxing forward-backward symmetry and substituting traditional gradients with simpler or stochastic alternatives does not impair learning and may even enhance training stability and efficiency. We explicitly demonstrate that neural networks with flat or non-differentiable activation functions, such as the Heaviside step function, can be effectively trained, thereby expanding design flexibility and computational efficiency. Further empirical validation with more complex architectures remains a valuable direction for future research.

Updated: 2025-09-08 21:30:00

标题: 打破神经网络中传统的前向-后向关系：激活函数

摘要: 传统上，基于梯度的神经网络训练要求前向传播和反向传播之间保持对称性，需要激活函数在某些区域内是可微的（或者是次可微的）且严格单调的，以防止梯度平坦区域的出现。这种将前向激活与反向梯度紧密联系的对称性显著限制了激活函数的选择，特别是排除了那些具有大量平坦或非可微区域的激活函数。本文通过数学分析挑战了这一假设，展示了从激活函数中导出的精确梯度大小在保持梯度方向的情况下很大程度上是多余的。在基础架构上进行的实证实验 - 如多层感知器（MLPs）、卷积神经网络（CNNs）和二值神经网络（BNNs） - 证实了放宽前向-后向对称性并用简单或随机替代传统梯度并不会影响学习，甚至可能增强训练的稳定性和效率。我们明确展示了具有平坦或非可微激活函数的神经网络，如海维赛德阶跃函数，可以有效训练，从而扩展了设计灵活性和计算效率。未来研究中进一步进行更复杂架构的实证验证仍然是一个有价值的方向。

更新时间: 2025-09-08 21:30:00

领域: cs.NE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.07236v1

Personalized Attacks of Social Engineering in Multi-turn Conversations: LLM Agents for Simulation and Detection

The rapid advancement of conversational agents, particularly chatbots powered by Large Language Models (LLMs), poses a significant risk of social engineering (SE) attacks on social media platforms. SE detection in multi-turn, chat-based interactions is considerably more complex than single-instance detection due to the dynamic nature of these conversations. A critical factor in mitigating this threat is understanding the SE attack mechanisms through which SE attacks operate, specifically how attackers exploit vulnerabilities and how victims' personality traits contribute to their susceptibility. In this work, we propose an LLM-agentic framework, SE-VSim, to simulate SE attack mechanisms by generating multi-turn conversations. We model victim agents with varying personality traits to assess how psychological profiles influence susceptibility to manipulation. Using a dataset of over 1000 simulated conversations, we examine attack scenarios in which adversaries, posing as recruiters, funding agencies, and journalists, attempt to extract sensitive information. Based on this analysis, we present a proof of concept, SE-OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality, evaluating attack strategies, and monitoring information exchanges in conversations to identify potential SE attempts.

Updated: 2025-09-08 21:16:17

标题: 多轮对话中的社会工程个性化攻击：LLM代理用于模拟和检测

摘要: 快速发展的对话代理，特别是由大型语言模型（LLMs）驱动的聊天机器人，在社交媒体平台上存在重大社会工程（SE）攻击的风险。在基于多轮对话的聊天交互中，SE检测比单个实例检测要复杂得多，因为这些对话的动态性质。减轻这种威胁的关键因素是了解SE攻击机制，即攻击者如何利用漏洞以及受害者的个性特征如何影响其易受攻击性。在这项工作中，我们提出了一个LLM-代理框架，SE-VSim，通过生成多轮对话来模拟SE攻击机制。我们对具有不同个性特征的受害者代理进行建模，以评估心理特征如何影响易受操纵性。利用超过1000个模拟对话的数据集，我们研究了攻击场景，其中对手假扮成招聘人员、资助机构和记者，试图提取敏感信息。基于这一分析，我们提出了一个概念验证，SE-OmniGuard，通过利用受害者个性的先前知识，评估攻击策略，并监控对话中的信息交换，来为用户提供个性化保护，以识别潜在的SE尝试。

更新时间: 2025-09-08 21:16:17

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2503.15552v2

A transformer-based generative model for planetary systems

Numerical calculations of planetary system formation are very demanding in terms of computing power. These synthetic planetary systems can however provide access to correlations, as predicted in a given numerical framework, between the properties of planets in the same system. Such correlations can, in return, be used in order to guide and prioritize observational campaigns aiming at discovering some types of planets, as Earth-like planets. Our goal is to develop a generative model which is capable of capturing correlations and statistical relationships between planets in the same system. Such a model, trained on the Bern model, offers the possibility to generate large number of synthetic planetary systems with little computational cost, that can be used, for example, to guide observational campaigns. Our generative model is based on the transformer architecture which is well-known to efficiently capture correlations in sequences and is at the basis of all modern Large Language Models. To assess the validity of the generative model, we perform visual and statistical comparisons, as well as a machine learning driven tests. Finally, as a use case example, we consider the TOI-469 system, in which we aim at predicting the possible properties of planets c and d, based on the properties of planet b (the first that has been detected). We show using different comparison methods that the properties of systems generated by our model are very similar to the ones of the systems computed directly by the Bern model. We also show in the case of the TOI-469 system, that using the generative model allows to predict the properties of planets not yet observed, based on the properties of the already observed planet. We provide our model to the community on our website www.ai4exoplanets.com.

Updated: 2025-09-08 21:09:14

标题: 一个基于变压器的行星系统生成模型

摘要: 行星系统形成的数值计算在计算能力方面要求非常高。然而，这些合成行星系统可以提供访问相关性的机会，如在给定数值框架中预测的行星性质之间的相关性。这种相关性可以用来指导和优先考虑观测活动，以发现一些类型的行星，如类地行星。我们的目标是开发一个能够捕捉行星系统中行星之间相关性和统计关系的生成模型。这样一个模型，在Bern模型上进行训练，可以以很小的计算成本生成大量合成行星系统，例如用于指导观测活动。我们的生成模型基于变压器架构，这种架构被广泛认为能够有效捕捉序列中的相关性，并且是所有现代大型语言模型的基础。为了评估生成模型的有效性，我们进行了视觉和统计比较，以及机器学习驱动的测试。最后，作为一个用例示例，我们考虑了TOI-469系统，我们旨在根据已观测到的行星b的性质，预测可能的行星c和d的性质。我们通过不同的比较方法展示，我们模型生成的系统的性质与由Bern模型直接计算得到的系统的性质非常相似。在TOI-469系统的情况下，我们还展示了使用生成模型可以根据已观测行星的性质来预测尚未观测到的行星的性质。我们将我们的模型提供给社区，网址是www.ai4exoplanets.com。

更新时间: 2025-09-08 21:09:14

领域: astro-ph.EP,astro-ph.IM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.07226v1

All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching

Our team, All You Need Is A Fuzzing Brain, was one of seven finalists in DARPA's Artificial Intelligence Cyber Challenge (AIxCC), placing fourth in the final round. During the competition, we developed a Cyber Reasoning System (CRS) that autonomously discovered 28 security vulnerabilities - including six previously unknown zero-days - in real-world open-source C and Java projects, and successfully patched 14 of them. The complete CRS is open source at https://github.com/o2lab/afc-crs-all-you-need-is-a-fuzzing-brain. This paper provides a detailed technical description of our CRS, with an emphasis on its LLM-powered components and strategies. Building on AIxCC, we further introduce a public leaderboard for benchmarking state-of-the-art LLMs on vulnerability detection and patching tasks, derived from the AIxCC dataset. The leaderboard is available at https://o2lab.github.io/FuzzingBrain-Leaderboard/.

Updated: 2025-09-08 21:08:01

标题: 你只需要一个模糊测试大脑：一种基于LLM的自动化漏洞检测和修补系统

摘要: 我们的团队“All You Need Is A Fuzzing Brain”是DARPA人工智能网络挑战（AIxCC）的七名决赛选手之一，在最终轮中排名第四。在比赛期间，我们开发了一个自主发现28个安全漏洞（包括六个以前未知的零日漏洞）的网络推理系统（CRS），并成功修复了其中的14个。完整的CRS是开源的，网址为https://github.com/o2lab/afc-crs-all-you-need-is-a-fuzzing-brain。本文详细描述了我们的CRS，重点介绍了其由LLM驱动的组件和策略。在AIxCC的基础上，我们进一步引入了一个用于基准测试最先进的LLM在漏洞检测和修补任务上的公共排行榜，该排行榜源自AIxCC数据集。排行榜网址为https://o2lab.github.io/FuzzingBrain-Leaderboard/。

更新时间: 2025-09-08 21:08:01

领域: cs.CR

下载: http://arxiv.org/abs/2509.07225v1

Explaining How Quantization Disparately Skews a Model

Post Training Quantization (PTQ) is widely adopted due to its high compression capacity and speed with minimal impact on accuracy. However, we observed that disparate impacts are exacerbated by quantization, especially for minority groups. Our analysis explains that in the course of quantization there is a chain of factors attributed to a disparate impact across groups during forward and backward passes. We explore how the changes in weights and activations induced by quantization cause cascaded impacts in the network, resulting in logits with lower variance, increased loss, and compromised group accuracies. We extend our study to verify the influence of these impacts on group gradient norms and eigenvalues of the Hessian matrix, providing insights into the state of the network from an optimization point of view. To mitigate these effects, we propose integrating mixed precision Quantization Aware Training (QAT) with dataset sampling methods and weighted loss functions, therefore providing fair deployment of quantized neural networks.

Updated: 2025-09-08 21:04:16

标题: 解释量化是如何不平等地扭曲模型的

摘要: 训练后量化（PTQ）因其高压缩能力和速度，且对准确性影响最小而被广泛采用。然而，我们观察到，量化加剧了不同群体之间的不平等影响，尤其是对少数群体。我们的分析解释了在量化过程中，前向和后向传递中存在一系列因素导致不同群体之间的不平等影响。我们探讨了由量化引起的权重和激活值变化如何在网络中造成级联影响，导致logits具有较低的方差、损失增加和群体准确性受损。我们进一步研究了这些影响对群体梯度范数和Hessian矩阵的特征值的影响，从优化的角度提供了对网络状态的见解。为了减轻这些影响，我们提出将混合精度量化感知训练（QAT）与数据集采样方法和加权损失函数相结合，从而实现对量化神经网络的公平部署。

更新时间: 2025-09-08 21:04:16

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2509.07222v1

OmniAcc: Personalized Accessibility Assistant Using Generative AI

Individuals with ambulatory disabilities often encounter significant barriers when navigating urban environments due to the lack of accessible information and tools. This paper presents OmniAcc, an AI-powered interactive navigation system that utilizes GPT-4, satellite imagery, and OpenStreetMap data to identify, classify, and map wheelchair-accessible features such as ramps and crosswalks in the built environment. OmniAcc offers personalized route planning, real-time hands-free navigation, and instant query responses regarding physical accessibility. By using zero-shot learning and customized prompts, the system ensures precise detection of accessibility features, while supporting validation through structured workflows. This paper introduces OmniAcc and explores its potential to assist urban planners and mobility-aid users, demonstrated through a case study on crosswalk detection. With a crosswalk detection accuracy of 97.5%, OmniAcc highlights the transformative potential of AI in improving navigation and fostering more inclusive urban spaces.

Updated: 2025-09-08 21:03:48

标题: OmniAcc：利用生成人工智能的个性化辅助工具

摘要: 个体在行动方面存在残疾的人在城市环境中常常遇到重大障碍，因为缺乏可访问的信息和工具。本文介绍了OmniAcc，这是一个利用GPT-4、卫星图像和OpenStreetMap数据的人工智能交互式导航系统，用于识别、分类和映射建筑环境中的轮椅通行设施，如坡道和人行横道。OmniAcc提供个性化路线规划、实时免提导航和即时查询响应，关于身体通行能力的问题。通过使用零-shot学习和定制提示，该系统确保准确检测可访问性特征，同时支持通过结构化工作流程的验证。本文介绍了OmniAcc并探讨了其在协助城市规划师和移动辅助用户方面的潜力，通过一个关于人行横道检测的案例研究进行了演示。OmniAcc在人行横道检测准确率达到97.5%，突显了人工智能在改善导航和促进更具包容性的城市空间方面的转变潜力。

更新时间: 2025-09-08 21:03:48

领域: cs.AI,I.2.10; I.2.1; I.4.8

下载: http://arxiv.org/abs/2509.07220v1

Overflow Prevention Enhances Long-Context Recurrent LLMs

A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.

Updated: 2025-09-08 20:57:22

标题: 溢出预防增强长上下文循环LLMs

摘要: 最近LLMs的一个趋势是开发改进了长上下文处理效率的经常性次二次模型。我们调查了领先的大型长上下文模型，重点关注它们固定大小的经常性记忆如何影响性能。我们的实验揭示，即使这些模型被训练用于扩展上下文，它们对长上下文的利用仍然不足。具体来说，我们证明了一个基于块的推理过程，该过程仅识别和处理输入中最相关的部分，可以缓解经常性记忆故障，并对许多长上下文任务有效：在LongBench上，我们的方法将Falcon3-Mamba-Inst-7B的整体性能提高了14%，将Falcon-Mamba-Inst-7B的整体性能提高了28%，将RecurrentGemma-IT-9B的整体性能提高了50%，将RWKV6-Finch-7B的整体性能提高了51%。令人惊讶的是，这种简单的方法还导致了在具有挑战性的LongBench v2基准测试中的最新结果，展示了与等大小Transformer具有竞争性性能。此外，我们的发现引发了关于经常性模型是否真正利用长距离依赖性的问题，因为我们的单块策略提供了更强的性能 - 即使在据称需要跨上下文关系的任务中。

更新时间: 2025-09-08 20:57:22

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.07793v2

MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably-without retraining or forgetting previous information-remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks for LLaMA-3 and Mistral backbones demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

Updated: 2025-09-08 20:51:48

标题: 回忆录：LLM的终身模型编辑，最小化覆写和智能保留

摘要: 在现实世界系统中部署的语言模型通常需要事后更新以纳入新的或更正的知识。然而，高效可靠地编辑这些模型-而不需要重新训练或忘记先前的信息-仍然是一个主要挑战。现有的终身模型编辑方法要么牺牲概括能力，要么干扰过去的编辑，要么无法扩展到长时间的编辑序列。我们提出了MEMOIR，这是一个新颖的可扩展框架，通过一个残余记忆，即一个专用参数模块，注入知识，同时保留预训练模型的核心能力。通过通过样本相关掩码使输入激活稀疏化，MEMOIR将每个编辑限制在记忆参数的一个不同子集中，最大程度地减少编辑之间的干扰。在推理时，它通过比较新查询的稀疏激活模式与编辑期间存储的模式来识别相关编辑。这使得通过激活仅相关知识而抑制不必要的记忆激活来推广到重新表述的查询。对LLaMA-3和Mistral骨干结构的问答、幻象纠正和超出分布泛化基准的实验表明，MEMOIR在可靠性、泛化性和局部性指标方面实现了最先进的性能，能够在最小遗忘的情况下扩展到数千个连续的编辑。

更新时间: 2025-09-08 20:51:48

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.07899v2

XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision-Language Learning

Background: Precise breast ultrasound (BUS) segmentation supports reliable measurement, quantitative analysis, and downstream classification, yet remains difficult for small or low-contrast lesions with fuzzy margins and speckle noise. Text prompts can add clinical context, but directly applying weakly localized text-image cues (e.g., CAM/CLIP-derived signals) tends to produce coarse, blob-like responses that smear boundaries unless additional mechanisms recover fine edges. Methods: We propose XBusNet, a novel dual-prompt, dual-branch multimodal model that combines image features with clinically grounded text. A global pathway based on a CLIP Vision Transformer encodes whole-image semantics conditioned on lesion size and location, while a local U-Net pathway emphasizes precise boundaries and is modulated by prompts that describe shape, margin, and Breast Imaging Reporting and Data System (BI-RADS) terms. Prompts are assembled automatically from structured metadata, requiring no manual clicks. We evaluate on the Breast Lesions USG (BLU) dataset using five-fold cross-validation. Primary metrics are Dice and Intersection over Union (IoU); we also conduct size-stratified analyses and ablations to assess the roles of the global and local paths and the text-driven modulation. Results: XBusNet achieves state-of-the-art performance on BLU, with mean Dice of 0.8765 and IoU of 0.8149, outperforming six strong baselines. Small lesions show the largest gains, with fewer missed regions and fewer spurious activations. Ablation studies show complementary contributions of global context, local boundary modeling, and prompt-based modulation. Conclusions: A dual-prompt, dual-branch multimodal design that merges global semantics with local precision yields accurate BUS segmentation masks and improves robustness for small, low-contrast lesions.

Updated: 2025-09-08 20:45:55

标题: XBusNet：通过多模态视觉语言学习进行文本引导的乳腺超声分割

摘要: 背景：精确的乳腺超声（BUS）分割支持可靠的测量、定量分析和下游分类，但对于边缘模糊且斑点噪声较大的小型或低对比度病变仍然很难。文本提示可以增加临床背景，但直接应用弱定位的文本-图像线索（例如，CAM/CLIP导出的信号）往往会产生粗糙的、类似斑块的响应，会模糊边界，除非有额外的机制恢复细微边缘。方法：我们提出了XBusNet，一种新颖的双提示、双分支多模态模型，结合了图像特征和临床基础文本。基于CLIP Vision Transformer的全局路径编码了受病变大小和位置条件限制的整个图像语义，而本地U-Net路径强调精确的边界，并由描述形状、边缘和乳腺成像报告和数据系统（BI-RADS）术语的提示调制。提示从结构化元数据中自动组装，无需手动点击。我们在乳腺病变USG（BLU）数据集上进行五折交叉验证。主要指标是Dice和交集超联盟（IoU）；我们还进行了按尺寸分层分析和消融实验，以评估全局路径和本地路径以及文本驱动调制的作用。结果：XBusNet在BLU上实现了最先进的性能，平均Dice为0.8765，IoU为0.8149，优于六个强基线。小病变显示出最大的增益，遗漏区域较少，虚假激活较少。消融研究显示了全局背景、本地边界建模和基于提示的调制的互补贡献。结论：将全局语义与本地精度相结合的双提示、双分支多模态设计可以产生准确的BUS分割掩模，并提高对小型、低对比度病变的鲁棒性。

更新时间: 2025-09-08 20:45:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.07213v1

A multi-strategy improved gazelle optimization algorithm for solving numerical optimization and engineering applications

Aiming at the shortcomings of the gazelle optimization algorithm, such as the imbalance between exploration and exploitation and the insufficient information exchange within the population, this paper proposes a multi-strategy improved gazelle optimization algorithm (MSIGOA). To address these issues, MSIGOA proposes an iteration-based updating framework that switches between exploitation and exploration according to the optimization process, which effectively enhances the balance between local exploitation and global exploration in the optimization process and improves the convergence speed. Two adaptive parameter tuning strategies improve the applicability of the algorithm and promote a smoother optimization process. The dominant population-based restart strategy enhances the algorithms ability to escape from local optima and avoid its premature convergence. These enhancements significantly improve the exploration and exploitation capabilities of MSIGOA, bringing superior convergence and efficiency in dealing with complex problems. In this paper, the parameter sensitivity, strategy effectiveness, convergence and stability of the proposed method are evaluated on two benchmark test sets including CEC2017 and CEC2022. Test results and statistical tests show that MSIGOA outperforms basic GOA and other advanced algorithms. On the CEC2017 and CEC2022 test sets, the proportion of functions where MSIGOA is not worse than GOA is 92.2% and 83.3%, respectively, and the proportion of functions where MSIGOA is not worse than other algorithms is 88.57% and 87.5%, respectively. Finally, the extensibility of MSIGAO is further verified by several engineering design optimization problems.

Updated: 2025-09-08 20:44:15

标题: 一种多策略改进的羚羊优化算法用于解决数字优化和工程应用

摘要: 针对羚羊优化算法存在的探索和开发之间不平衡以及种群内信息交换不足等缺点，本文提出了一种多策略改进的羚羊优化算法（MSIGOA）。为解决这些问题，MSIGOA提出了一个基于迭代更新的框架，根据优化过程在开发和探索之间切换，从而有效增强了优化过程中局部开发和全局探索之间的平衡，提高了收敛速度。两种自适应参数调整策略提高了算法的适用性，并促进了更平滑的优化过程。主导种群重启策略增强了算法逃离局部最优和避免过早收敛的能力。这些增强显著提高了MSIGOA的探索和开发能力，在处理复杂问题时带来了更优越的收敛性和效率。在本文中，提出的方法在包括CEC2017和CEC2022在内的两个基准测试集上进行了参数敏感性、策略有效性、收敛性和稳定性评估。测试结果和统计测试显示，MSIGOA优于基本GOA和其他先进算法。在CEC2017和CEC2022测试集上，MSIGOA不比GOA差的函数比例分别为92.2%和83.3%，MSIGOA不比其他算法差的函数比例分别为88.57%和87.5%。最后，MSIGOA的可扩展性通过几个工程设计优化问题进一步验证。

更新时间: 2025-09-08 20:44:15

领域: cs.NE,cs.AI,cs.CE

下载: http://arxiv.org/abs/2509.07211v1

BlendedNet: A Blended Wing Body Aircraft Dataset and Surrogate Model for Aerodynamic Predictions

BlendedNet is a publicly available aerodynamic dataset of 999 blended wing body (BWB) geometries. Each geometry is simulated across about nine flight conditions, yielding 8830 converged RANS cases with the Spalart-Allmaras model and 9 to 14 million cells per case. The dataset is generated by sampling geometric design parameters and flight conditions, and includes detailed pointwise surface quantities needed to study lift and drag. We also introduce an end-to-end surrogate framework for pointwise aerodynamic prediction. The pipeline first uses a permutation-invariant PointNet regressor to predict geometric parameters from sampled surface point clouds, then conditions a Feature-wise Linear Modulation (FiLM) network on the predicted parameters and flight conditions to predict pointwise coefficients Cp, Cfx, and Cfz. Experiments show low errors in surface predictions across diverse BWBs. BlendedNet addresses data scarcity for unconventional configurations and enables research on data-driven surrogate modeling for aerodynamic design.

Updated: 2025-09-08 20:43:14

标题: 混合网：混合式机翼机身飞机数据集与空气动力预测的代理模型

摘要: BlendedNet是一个公开可用的999个混合机翼体（BWB）几何结构的气动数据集。每个几何结构在约九个飞行条件下进行模拟，产生了8830个使用Spalart-Allmaras模型的收敛RANS案例，每个案例有900万到1400万个单元。该数据集通过对几何设计参数和飞行条件进行抽样生成，并包括研究升力和阻力所需的详细点面表面数量。我们还引入了一个端到端的替代框架用于点面气动预测。该流水线首先使用一个置换不变的PointNet回归器从采样的表面点云中预测几何参数，然后在预测的参数和飞行条件上使用特征逐线性调制（FiLM）网络来预测点面系数Cp、Cfx和Cfz。实验证明在不同BWB上表面预测的误差较低。BlendedNet解决了对非传统配置的数据稀缺性，并促进了对气动设计的数据驱动替代建模的研究。

更新时间: 2025-09-08 20:43:14

领域: cs.AI

下载: http://arxiv.org/abs/2509.07209v1

A Hybrid CNN-LSTM Deep Learning Model for Intrusion Detection in Smart Grid

The evolution of the traditional power grid into the "smart grid" has resulted in a fundamental shift in energy management, which allows the integration of renewable energy sources with modern communication technology. However, this interconnection has increased smart grids' vulnerability to attackers, which might result in privacy breaches, operational interruptions, and massive outages. The SCADA-based smart grid protocols are critical for real-time data collection and control, but they are vulnerable to attacks like unauthorized access and denial of service (DoS). This research proposes a hybrid deep learning-based Intrusion Detection System (IDS) intended to improve the cybersecurity of smart grids. The suggested model takes advantage of Convolutional Neural Networks' (CNN) feature extraction capabilities as well as Long Short-Term Memory (LSTM) networks' temporal pattern recognition skills. DNP3 and IEC104 intrusion detection datasets are employed to train and test our CNN-LSTM model to recognize and classify the potential cyber threats. Compared to other deep learning approaches, the results demonstrate considerable improvements in accuracy, precision, recall, and F1-score, with a detection accuracy of 99.70%.

Updated: 2025-09-08 20:41:31

标题: 一个用于智能电网入侵检测的混合CNN-LSTM深度学习模型

摘要: 传统电网向“智能电网”的演变导致了能源管理的基本转变，这使得可再生能源与现代通信技术的整合成为可能。然而，这种相互连接增加了智能电网对攻击者的脆弱性，可能导致隐私泄露、操作中断和大规模停电。基于SCADA的智能电网协议对实时数据收集和控制至关重要，但容易受到未经授权访问和拒绝服务（DoS）等攻击的影响。本研究提出了一种基于深度学习的混合入侵检测系统（IDS），旨在提高智能电网的网络安全性。建议的模型利用卷积神经网络（CNN）的特征提取能力以及长短期记忆（LSTM）网络的时间模式识别技能。采用DNP3和IEC104入侵检测数据集来训练和测试我们的CNN-LSTM模型，以识别和分类潜在的网络威胁。与其他深度学习方法相比，结果显示出在准确率、精度、召回率和F1分数方面的显著改善，检测准确率达到99.70%。

更新时间: 2025-09-08 20:41:31

领域: cs.AI

下载: http://arxiv.org/abs/2509.07208v1

Predicting effect of novel treatments using molecular pathways and real-world data

In pharmaceutical R&D, predicting the efficacy of a pharmaceutical in treating a particular disease prior to clinical testing or any real-world use has been challenging. In this paper, we propose a flexible and modular machine learning-based approach for predicting the efficacy of an untested pharmaceutical for treating a disease. We train a machine learning model using sets of pharmaceutical-pathway weight impact scores and patient data, which can include patient characteristics and observed clinical outcomes. The resulting model then analyses weighted impact scores of an untested pharmaceutical across human biological molecule-protein pathways to generate a predicted efficacy value. We demonstrate how the method works on a real-world dataset with patient treatments and outcomes, with two different weight impact score algorithms We include methods for evaluating the generalisation performance on unseen treatments, and to characterise conditions under which the approach can be expected to be most predictive. We discuss specific ways in which our approach can be iterated on, making it an initial framework to support future work on predicting the effect of untested drugs, leveraging RWD clinical data and drug embeddings.

Updated: 2025-09-08 20:35:15

标题: 预测新治疗方法的效果：利用分子途径和真实世界数据

摘要: 在制药研发中，预测一种药物在治疗特定疾病方面的功效，在临床试验或任何实际应用之前，一直是具有挑战性的。在本文中，我们提出了一种灵活且模块化的基于机器学习的方法，用于预测一种未经测试的药物在治疗疾病方面的功效。我们使用药物-通路权重影响分数和患者数据（包括患者特征和观察到的临床结果）来训练一个机器学习模型。然后，生成的模型分析未经测试药物在人类生物分子-蛋白质通路中的加权影响分数，以生成预测的功效值。我们展示了该方法在具有患者治疗和结果的真实数据集上的工作方式，使用了两种不同的权重影响分数算法。我们包括评估在未见治疗方案上的泛化性能的方法，并表征了可以预期该方法最具预测性的条件。我们讨论了我们的方法可以通过哪些具体方式迭代，使其成为支持未经测试药物效果预测的未来工作的初始框架，利用RWD临床数据和药物嵌入。

更新时间: 2025-09-08 20:35:15

领域: cs.LG,q-bio.QM,I.2.6; I.2.4; J.3

下载: http://arxiv.org/abs/2509.07204v1

Fed-REACT: Federated Representation Learning for Heterogeneous and Evolving Data

Motivated by the high resource costs and privacy concerns associated with centralized machine learning, federated learning (FL) has emerged as an efficient alternative that enables clients to collaboratively train a global model while keeping their data local. However, in real-world deployments, client data distributions often evolve over time and differ significantly across clients, introducing heterogeneity that degrades the performance of standard FL algorithms. In this work, we introduce Fed-REACT, a federated learning framework designed for heterogeneous and evolving client data. Fed-REACT combines representation learning with evolutionary clustering in a two-stage process: (1) in the first stage, each client learns a local model to extracts feature representations from its data; (2) in the second stage, the server dynamically groups clients into clusters based on these representations and coordinates cluster-wise training of task-specific models for downstream objectives such as classification or regression. We provide a theoretical analysis of the representation learning stage, and empirically demonstrate that Fed-REACT achieves superior accuracy and robustness on real-world datasets.

Updated: 2025-09-08 20:24:40

标题: Fed-REACT：异构和演化数据的联邦表示学习

摘要: 由于集中式机器学习所带来的高资源成本和隐私问题，联邦学习（FL）已成为一种有效的替代方案，使客户能够在保持其数据本地的同时协作训练一个全局模型。然而，在实际部署中，客户数据分布通常随时间演变并在客户之间有显著差异，引入了降低标准FL算法性能的异质性。在这项工作中，我们介绍了Fed-REACT，这是一个专为异质和演变的客户数据设计的联邦学习框架。Fed-REACT将表示学习与进化聚类结合在一个两阶段过程中：（1）在第一阶段，每个客户端学习一个本地模型来从其数据中提取特征表示；（2）在第二阶段，服务器根据这些表示动态将客户端分组成簇，并协调簇内训练针对下游目标（如分类或回归）的任务特定模型。我们对表示学习阶段进行了理论分析，并通过实证证明Fed-REACT在实际数据集上实现了更高的准确性和鲁棒性。

更新时间: 2025-09-08 20:24:40

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2509.07198v1

Evaluation of Machine Learning Reconstruction Techniques for Accelerated Brain MRI Scans

This retrospective-prospective study evaluated whether a deep learning-based MRI reconstruction algorithm can preserve diagnostic quality in brain MRI scans accelerated up to fourfold, using both public and prospective clinical data. The study included 18 healthy volunteers (scans acquired at 3T, January 2024-March 2025), as well as selected fastMRI public datasets with diverse pathologies. Phase-encoding-undersampled 2D/3D T1, T2, and FLAIR sequences were reconstructed with DeepFoqus-Accelerate and compared with standard-of-care (SOC). Three board-certified neuroradiologists and two MRI technologists independently reviewed 36 paired SOC/AI reconstructions from both datasets using a 5-point Likert scale, while quantitative similarity was assessed for 408 scans and 1224 datasets using Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Haar wavelet-based Perceptual Similarity Index (HaarPSI). No AI-reconstructed scan scored below 3 (minimally acceptable), and 95% scored $\geq 4$. Mean SSIM was 0.95 $\pm$ 0.03 (90% cases >0.90), PSNR >41.0 dB, and HaarPSI >0.94. Inter-rater agreement was slight to moderate. Rare artifacts did not affect diagnostic interpretation. These findings demonstrate that DeepFoqus-Accelerate enables robust fourfold brain MRI acceleration with 75% reduced scan time, while preserving diagnostic image quality and supporting improved workflow efficiency.

Updated: 2025-09-08 20:20:24

标题: 评估机器学习重建技术在加速脑部MRI扫描中的应用

摘要: 这项回顾性前瞻性研究评估了基于深度学习的MRI重建算法是否可以在脑部MRI扫描加速至四倍的情况下保留诊断质量，使用了公开和前瞻性临床数据。研究包括18名健康志愿者（扫描于3T，2024年1月至2025年3月），以及选定的具有不同病理的fastMRI公开数据集。相位编码欠采样的2D/3D T1、T2和FLAIR序列使用DeepFoqus-Accelerate重建，并与标准护理（SOC）进行比较。三名获得认证的神经放射科医师和两名MRI技师独立地使用5点力克特量表审查了来自两个数据集的36对SOC/AI重建，同时使用结构相似性指数（SSIM）、峰值信噪比（PSNR）和基于Haar小波的感知相似性指数（HaarPSI）对408个扫描和1224个数据集进行了定量相似性评估。没有AI重建的扫描得分低于3（最低可接受），95%的得分≥4。平均SSIM为0.95 ± 0.03（90%的情况>0.90），PSNR >41.0 dB，HaarPSI >0.94。评价者间一致性为轻度至中度。罕见的伪影不影响诊断解释。这些发现表明，DeepFoqus-Accelerate可以实现稳健的四倍脑部MRI加速，减少75%的扫描时间，同时保留诊断图像质量并支持改善工作流效率。

更新时间: 2025-09-08 20:20:24

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2509.07193v1

Analytic theory of dropout regularization

Dropout is a regularization technique widely used in training artificial neural networks to mitigate overfitting. It consists of dynamically deactivating subsets of the network during training to promote more robust representations. Despite its widespread adoption, dropout probabilities are often selected heuristically, and theoretical explanations of its success remain sparse. Here, we analytically study dropout in two-layer neural networks trained with online stochastic gradient descent. In the high-dimensional limit, we derive a set of ordinary differential equations that fully characterize the evolution of the network during training and capture the effects of dropout. We obtain a number of exact results describing the generalization error and the optimal dropout probability at short, intermediate, and long training times. Our analysis shows that dropout reduces detrimental correlations between hidden nodes, mitigates the impact of label noise, and that the optimal dropout probability increases with the level of noise in the data. Our results are validated by extensive numerical simulations.

Updated: 2025-09-08 20:19:29

标题: 辍学正则化的分析理论

摘要: 辍学是一种广泛应用于训练人工神经网络以减轻过拟合的正则化技术。它包括在训练过程中动态地停用网络的子集，以促进更稳健的表示。尽管辍学被广泛采用，辍学概率通常是根据经验选择的，其成功的理论解释仍然稀缺。在这里，我们通过分析研究了使用在线随机梯度下降训练的两层神经网络中的辍学。在高维极限下，我们推导出一组描述网络在训练过程中完全特征化的常微分方程，并捕捉了辍学的影响。我们得到了一些精确的结果，描述了泛化误差和在短、中、长训练时间内的最佳辍学概率。我们的分析表明，辍学减少了隐藏节点之间的有害相关性，缓解了标签噪声的影响，并且最佳辍学概率随数据中噪声水平的增加而增加。我们的结果经过了大量数值模拟验证。

更新时间: 2025-09-08 20:19:29

领域: stat.ML,cond-mat.dis-nn,cond-mat.stat-mech,cs.LG

下载: http://arxiv.org/abs/2505.07792v2

FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA

Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.

Updated: 2025-09-08 20:18:47

标题: FilterRAG: 零样本信息检索增强生成以减少VQA中的幻觉

摘要: 视觉问答需要模型通过整合视觉和文本理解来生成准确答案。然而，VQA模型仍然在幻觉方面存在困难，特别是在基于知识和超出分布范围的情况下产生令人信服但不正确的答案。我们引入了FilterRAG，这是一个检索增强框架，将BLIP-VQA与检索增强生成结合起来，将答案基于外部知识源，如维基百科和DBpedia。FilterRAG在OK-VQA数据集上实现了36.5%的准确率，表明它在减少幻觉和提高在领域内和超出分布情况下的鲁棒性方面的有效性。这些发现突显了FilterRAG改进视觉问答系统以在实际部署中发挥潜力的可能性。

更新时间: 2025-09-08 20:18:47

领域: cs.CV,cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2502.18536v2

Active Learning of Piecewise Gaussian Process Surrogates

Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.

Updated: 2025-09-08 20:14:08

标题: 主动学习分段高斯过程代理

摘要: 高斯过程（GP）替代品的主动学习对于优化物理/计算机模拟实验的实验设计以及在机器学习中引导数据采集方案非常有用。在本文中，我们开发了一种用于主动学习分段跳跃GP替代品的方法。跳跃GP在设计空间的区域内连续，但在设计空间的区域之间不连续，这在跨越自主材料设计、智能工厂系统配置等应用中是必需的。尽管我们的主动学习启发式方法是从最初设计用于普通GP的策略中借鉴而来的，我们证明在跳跃GP环境中，此外考虑模型偏差而不是通常的模型不确定性是至关重要的。为此，我们开发了跳跃GP模型的偏差和方差的估计器。我们在一系列合成基准测试和不同复杂性的真实模拟实验中提供了我们提出的方法的优势的示例和证据。

更新时间: 2025-09-08 20:14:08

领域: cs.LG,stat.ML,62G08,I.5.1

下载: http://arxiv.org/abs/2301.08789v4

Click Without Compromise: Online Advertising Measurement via Per User Differential Privacy

Online advertising is a cornerstone of the Internet ecosystem, with advertising measurement playing a crucial role in optimizing efficiency. Ad measurement entails attributing desired behaviors, such as purchases, to ad exposures across various platforms, necessitating the collection of user activities across these platforms. As this practice faces increasing restrictions due to rising privacy concerns, safeguarding user privacy in this context is imperative. Our work is the first to formulate the real-world challenge of advertising measurement systems with real-time reporting of streaming data in advertising campaigns. We introduce AdsBPC, a novel user-level differential privacy protection scheme for online advertising measurement results. This approach optimizes global noise power and results in a non-identically distributed noise distribution that preserves differential privacy while enhancing measurement accuracy. Through experiments on both real-world advertising campaigns and synthetic datasets, AdsBPC achieves a 33% to 95% increase in accuracy over existing streaming DP mechanisms applied to advertising measurement. This highlights our method's effectiveness in achieving superior accuracy alongside a formal privacy guarantee, thereby advancing the state-of-the-art in privacy-preserving advertising measurement.

Updated: 2025-09-08 20:10:24

标题: 无损点击：通过每个用户差分隐私实现在线广告测量

摘要: 在线广告是互联网生态系统的基石，广告测量在优化效率方面发挥着至关重要的作用。广告测量涉及将期望的行为（如购买行为）归因于跨各种平台的广告曝光，需要收集这些平台上用户的活动。随着隐私担忧上升，这种做法面临着越来越多的限制，因此在这一背景下保护用户隐私至关重要。我们的工作是首次在广告活动中实时报告流数据的广告测量系统的现实挑战。我们引入了AdsBPC，一种新颖的用于在线广告测量结果的用户级差分隐私保护方案。这种方法优化了全局噪声功率，并产生了一个非相同分布的噪声分布，从而保留差分隐私同时提高测量准确性。通过对真实广告活动和合成数据集的实验，AdsBPC相对于现有的应用于广告测量的流式DP机制实现了33%至95%的准确度提升。这突显了我们方法在实现卓越准确性和正式隐私保证方面的有效性，从而推动了隐私保护广告测量领域的最新技术。

更新时间: 2025-09-08 20:10:24

领域: cs.CR

下载: http://arxiv.org/abs/2406.02463v4

DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models' ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.

Updated: 2025-09-08 20:07:30

标题: DischargeSim: 用于医患沟通教育的出院仿真基准

摘要: 出院沟通是患者护理中至关重要但鲜为人知的组成部分，其目标从诊断转变为教育。尽管最近的大型语言模型（LLM）基准强调门诊诊断推理，但它们未能评估模型在门诊后支持患者的能力。我们引入了DischargeSim，这是一个新颖的基准，评估LLM作为个性化出院教育者的能力。DischargeSim模拟LLM驱动的DoctorAgents和具有不同心理社会特征（例如健康素养、教育、情绪）的PatientAgents之间的门诊后多轮对话。交互在六个临床基础的出院主题上被结构化，并沿着三个轴进行评估：（1）对话质量通过自动和LLM作为评判者的评估，（2）个性化文档生成，包括自由文本摘要和结构化AHRQ清单，（3）通过下游多项选择考试评估患者理解能力。对18个LLM的实验揭示了在出院教育能力上存在显著差距，表现在不同患者特征间差异很大。值得注意的是，模型大小并不总是能带来更好的教育结果，突显了策略使用和内容优先级的权衡。DischargeSim为在门诊后临床教育中对LLM进行基准测试和促进公平、个性化患者支持迈出了第一步。

更新时间: 2025-09-08 20:07:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.07188v1

Validation of a CT-brain analysis tool for measuring global cortical atrophy in older patient cohorts

Quantification of brain atrophy currently requires visual rating scales which are time consuming and automated brain image analysis is warranted. We validated our automated deep learning (DL) tool measuring the Global Cerebral Atrophy (GCA) score against trained human raters, and associations with age and cognitive impairment, in representative older (>65 years) patients. CT-brain scans were obtained from patients in acute medicine (ORCHARD-EPR), acute stroke (OCS studies) and a legacy sample. Scans were divided in a 60/20/20 ratio for training, optimisation and testing. CT-images were assessed by two trained raters (rater-1=864 scans, rater-2=20 scans). Agreement between DL tool-predicted GCA scores (range 0-39) and the visual ratings was evaluated using mean absolute error (MAE) and Cohen's weighted kappa. Among 864 scans (ORCHARD-EPR=578, OCS=200, legacy scans=86), MAE between the DL tool and rater-1 GCA scores was 3.2 overall, 3.1 for ORCHARD-EPR, 3.3 for OCS and 2.6 for the legacy scans and half had DL-predicted GCA error between -2 and 2. Inter-rater agreement was Kappa=0.45 between the DL-tool and rater-1, and 0.41 between the tool and rater- 2 whereas it was lower at 0.28 for rater-1 and rater-2. There was no difference in GCA scores from the DL-tool and the two raters (one-way ANOVA, p=0.35) or in mean GCA scores between the DL-tool and rater-1 (paired t-test, t=-0.43, p=0.66), the tool and rater-2 (t=1.35, p=0.18) or between rater-1 and rater-2 (t=0.99, p=0.32). DL-tool GCA scores correlated with age and cognitive scores (both p<0.001). Our DL CT-brain analysis tool measured GCA score accurately and without user input in real-world scans acquired from older patients. Our tool will enable extraction of standardised quantitative measures of atrophy at scale for use in health data research and will act as proof-of-concept towards a point-of-care clinically approved tool.

Updated: 2025-09-08 20:04:35

标题: 验证一种用于测量老年患者队列中全局皮层萎缩的CT脑分析工具

摘要: 目前，大脑萎缩的量化需要耗时的视觉评分标准，自动化大脑图像分析是必要的。我们验证了我们的自动深度学习（DL）工具测量全脑萎缩（GCA）评分与经过训练的人类评分员的关联性，并与年龄和认知损伤进行关联，在代表性的老年患者（>65岁）中。从急诊医学（ORCHARD-EPR）、急性中风（OCS研究）和传统样本中获取了老年患者的CT脑扫描。扫描分为60/20/20的比例进行训练、优化和测试。CT图像由两名经过培训的评分员评估（评分员1=864个扫描，评分员2=20个扫描）。使用平均绝对误差（MAE）和Cohen加权kappa评估DL工具预测的GCA评分（范围0-39）与视觉评分之间的一致性。在864个扫描中（ORCHARD-EPR=578，OCS=200，传统扫描=86），DL工具和评分员1的GCA评分的MAE总体为3.2，ORCHARD-EPR为3.1，OCS为3.3，传统扫描为2.6，一半的DL预测GCA误差在-2和2之间。DL工具和评分员1之间的一致性Kappa为0.45，工具和评分员2为0.41，而评分员1和评分员2为0.28。DL工具和两名评分员的GCA评分没有差异（单向ANOVA，p=0.35），DL工具和评分员1的平均GCA评分也没有差异（配对t检验，t=-0.43，p=0.66），工具和评分员2（t=1.35，p=0.18），评分员1和评分员2之间（t=0.99，p=0.32）。DL工具的GCA评分与年龄和认知分数相关（均p<0.001）。我们的DL CT脑分析工具在从老年患者获取的实际扫描中准确测量了GCA评分，并且无需用户输入。我们的工具将能够在健康数据研究中大规模使用提取标准化的萎缩定量测量，并且将作为一个经过临床批准的点对点临床工具的概念验证。

更新时间: 2025-09-08 20:04:35

领域: eess.IV,cs.AI,cs.CV,I.2; I.4

下载: http://arxiv.org/abs/2509.08012v1

Dimensionally Reduced Open-World Clustering: DROWCULA

Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, `open-world' context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time. The code is available at: https://github.com/DROWCULA/DROWCULA.

Updated: 2025-09-08 20:01:29

标题: 降维开放世界聚类：DROWCULA

摘要: 使用带注释的数据是监督学习的基石。然而，为实例提供标签是一项需要大量人力的任务。一些关键的现实世界应用使事情变得更加复杂，因为无论在一个感兴趣的任务中可能已经确定了多少标签，未来可能出现与新类别对应的示例。毫不奇怪，在这种所谓的“开放世界”环境中的先前工作主要集中在半监督方法上。关注图像分类，有些矛盾的是，我们提出了一种完全无监督的方法来解决确定特定数据集中新类别的问题。我们的方法依赖于使用视觉变换器来估计聚类数目，这些变换器利用注意机制生成向量嵌入。此外，我们还结合流形学习技术来通过利用数据的内在几何结构来优化这些嵌入，从而提高整体图像聚类性能。总体而言，我们在CIFAR-10、CIFAR-100、ImageNet-100和Tiny ImageNet上建立了新的最先进结果。无论聚类数目是否事先已知，我们都能做到这一点。代码可在以下网址获取：https://github.com/DROWCULA/DROWCULA。

更新时间: 2025-09-08 20:01:29

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.07184v1

GCN-Driven Reinforcement Learning for Probabilistic Real-Time Guarantees in Industrial URLLC

Ensuring packet-level communication quality is vital for ultra-reliable, low-latency communications (URLLC) in large-scale industrial wireless networks. We enhance the Local Deadline Partition (LDP) algorithm by introducing a Graph Convolutional Network (GCN) integrated with a Deep Q-Network (DQN) reinforcement learning framework for improved interference coordination in multi-cell, multi-channel networks. Unlike LDP's static priorities, our approach dynamically learns link priorities based on real-time traffic demand, network topology, remaining transmission opportunities, and interference patterns. The GCN captures spatial dependencies, while the DQN enables adaptive scheduling decisions through reward-guided exploration. Simulation results show that our GCN-DQN model achieves mean SINR improvements of 179.6\%, 197.4\%, and 175.2\% over LDP across three network configurations. Additionally, the GCN-DQN model demonstrates mean SINR improvements of 31.5\%, 53.0\%, and 84.7\% over our previous CNN-based approach across the same configurations. These results underscore the effectiveness of our GCN-DQN model in addressing complex URLLC requirements with minimal overhead and superior network performance.

Updated: 2025-09-08 19:46:16

标题: 用GCN驱动的强化学习在工业URLLC中实现概率实时保证

摘要: 确保数据包级通信质量对于大规模工业无线网络中的超可靠低延迟通信（URLLC）至关重要。我们通过引入一个集成了图卷积网络（GCN）和深度Q网络（DQN）强化学习框架来增强本地截止期分区（LDP）算法，以改善多小区、多信道网络中的干扰协调。与LDP的静态优先级不同，我们的方法根据实时流量需求、网络拓扑、剩余传输机会和干扰模式动态学习链接优先级。GCN捕捉空间依赖关系，而DQN通过奖励引导的探索实现自适应调度决策。模拟结果显示，我们的GCN-DQN模型在三种网络配置中相对于LDP实现了179.6％、197.4％和175.2％的平均SINR改进。此外，与我们之前基于CNN的方法相比，GCN-DQN模型在相同配置中表现出31.5％、53.0％和84.7％的平均SINR改进。这些结果强调了我们的GCN-DQN模型在满足复杂的URLLC要求时具有最小的开销和卓越的网络性能。

更新时间: 2025-09-08 19:46:16

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2506.15011v3

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Unprecedented volumes of Earth observation data are continually collected around the world, but high-quality labels remain scarce given the effort required to make physical measurements and observations. This has led to considerable investment in bespoke modeling efforts translating sparse labels into maps. Here we introduce AlphaEarth Foundations, an embedding field model yielding a highly general, geospatial representation that assimilates spatial, temporal, and measurement contexts across multiple sources, enabling accurate and efficient production of maps and monitoring systems from local to global scales. The embeddings generated by AlphaEarth Foundations are the only to consistently outperform a suite of other well-known/widely accepted featurization approaches tested on a diverse set of mapping evaluations without re-training. We have released a dataset of global, annual, analysis-ready embedding field layers from 2017 through 2024.

Updated: 2025-09-08 19:42:30

标题: AlphaEarth基金会：一种嵌入场模型，用于从稀疏标签数据中进行准确和高效的全球映射

摘要: 全球范围内持续收集着前所未有的大量地球观测数据，然而由于制作物理测量和观测所需的努力，高质量的标签仍然稀缺。这导致在将稀疏标签转化为地图方面进行了相当大的投资。在这里，我们介绍了AlphaEarth Foundations，这是一种嵌入式场模型，产生了一个高度通用的地理空间表示，融合了多个来源的空间、时间和测量上下文，使得从局部到全球尺度的地图和监测系统的生产变得精确和高效。由AlphaEarth Foundations生成的嵌入式在多种映射评估上表现优异，而且无需重新训练，这是唯一的。我们已发布了一套从2017年到2024年的全球、年度、分析就绪的嵌入场图层数据集。

更新时间: 2025-09-08 19:42:30

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.22291v2

No Thoughts Just AI: Biased LLM Hiring Recommendations Alter Human Decision Making and Limit Human Autonomy

In this study, we conduct a resume-screening experiment (N=528) where people collaborate with simulated AI models exhibiting race-based preferences (bias) to evaluate candidates for 16 high and low status occupations. Simulated AI bias approximates factual and counterfactual estimates of racial bias in real-world AI systems. We investigate people's preferences for White, Black, Hispanic, and Asian candidates (represented through names and affinity groups on quality-controlled resumes) across 1,526 scenarios and measure their unconscious associations between race and status using implicit association tests (IATs), which predict discriminatory hiring decisions but have not been investigated in human-AI collaboration. When making decisions without AI or with AI that exhibits no race-based preferences, people select all candidates at equal rates. However, when interacting with AI favoring a particular group, people also favor those candidates up to 90% of the time, indicating a significant behavioral shift. The likelihood of selecting candidates whose identities do not align with common race-status stereotypes can increase by 13% if people complete an IAT before conducting resume screening. Finally, even if people think AI recommendations are low quality or not important, their decisions are still vulnerable to AI bias under certain circumstances. This work has implications for people's autonomy in AI-HITL scenarios, AI and work, design and evaluation of AI hiring systems, and strategies for mitigating bias in collaborative decision-making tasks. In particular, organizational and regulatory policy should acknowledge the complex nature of AI-HITL decision making when implementing these systems, educating people who use them, and determining which are subject to oversight.

Updated: 2025-09-08 19:40:40

标题: 没有思考，只有人工智能：带偏见的LLM招聘建议改变人类决策并限制人类自主权

摘要: 在这项研究中，我们进行了一项简历筛选实验（N=528），在这个实验中，人们与展示基于种族偏好（偏见）的模拟AI模型合作，评估16种高低地位职业的候选人。模拟AI偏见近似于现实世界AI系统中种族偏见的事实和反事实估计。我们调查了人们对白人、黑人、西班牙裔和亚裔候选人（通过姓名和亲和力团体在经过质量控制的简历上表示）在1,526种情景中的偏好，并使用隐性关联测试（IATs）来测量他们在种族和地位之间的潜意识关联，这些测试可以预测歧视性招聘决策，但在人类-AI合作中尚未得到研究。在没有AI或与不具有基于种族偏好的AI交互时，人们以相同的比率选择所有候选人。然而，在与偏向特定群体的AI互动时，人们也会优先选择这些候选人，达到90%的时间，表明有显著的行为转变。如果人们在进行简历筛选之前完成IAT，那么选择与常见种族地位刻板印象不符的候选人的可能性可以增加13%。最后，即使人们认为AI建议质量低或不重要，他们的决策在某些情况下仍然容易受到AI偏见的影响。这项工作对于人们在AI-HITL情景中的自主性、AI和工作、AI招聘系统的设计和评估，以及减少合作决策任务中的偏见的策略具有重要意义。特别是，在实施这些系统、教育使用它们的人员以及确定哪些人员受到监督时，组织和监管政策应该认识到AI-HITL决策制定的复杂性。

更新时间: 2025-09-08 19:40:40

领域: cs.CY,cs.AI,cs.CL,cs.HC,K.4.2

下载: http://arxiv.org/abs/2509.04404v2

That's So FETCH: Fashioning Ensemble Techniques for LLM Classification in Civil Legal Intake and Referral

Each year millions of people seek help for their legal problems by calling a legal aid program hotline, walking into a legal aid office, or using a lawyer referral service. The first step to match them to the right help is to identify the legal problem the applicant is experiencing. Misdirection has consequences. Applicants may miss a deadline, experience physical abuse, lose housing or lose custody of children while waiting to connect to the right legal help. We introduce and evaluate the FETCH classifier for legal issue classification and describe two methods for improving accuracy: a hybrid LLM/ML ensemble classification method, and the automatic generation of follow-up questions to enrich the initial problem narrative. We employ a novel data set of 419 real-world queries to a nonprofit lawyer referral service. Ultimately, we show classification accuracy (hits@2) of 97.37\% using a mix of inexpensive models, exceeding the performance of the current state-of-the-art GPT-5 model. Our approach shows promise in significantly reducing the cost of guiding users of the legal system to the right resource for their problem while achieving high accuracy.

Updated: 2025-09-08 19:34:57

标题: 这是FETCH：在民事法律接待和转介中构建LLM分类的时尚集成技术

摘要: 每年有数百万人通过致电法律援助计划热线、走进法律援助办公室或使用律师转介服务寻求帮助解决他们的法律问题。将他们匹配到正确的帮助是第一步，就是要确定申请人所经历的法律问题。错误的指引会产生后果。申请人可能会错过截止日期，遭受身体虐待，失去住房或失去孩子的监护权，而在等待与正确的法律帮助联系时。我们引入并评估了FETCH分类器用于法律问题分类，并描述了两种改进准确性的方法：混合LLM/ML集成分类方法，以及自动生成后续问题以丰富初始问题叙述。我们利用一个包含419个真实世界查询的新型数据集，这些查询是针对一个非营利律师转介服务的。最终，我们展示了使用廉价模型混合的分类准确性（hits@2）为97.37\%，超过了当前最先进的GPT-5模型的性能。我们的方法显示了在显著降低引导法律系统用户找到适合他们问题的资源成本的同时，实现高准确性的潜力。

更新时间: 2025-09-08 19:34:57

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2509.07170v1

Beyond Sequential Reranking: Reranker-Guided Search Improves Reasoning Intensive Retrieval

The widely used retrieve-and-rerank pipeline faces two critical limitations: they are constrained by the initial retrieval quality of the top-k documents, and the growing computational demands of LLM-based rerankers restrict the number of documents that can be effectively processed. We introduce Reranker-Guided-Search (RGS), a novel approach that bypasses these limitations by directly retrieving documents according to reranker preferences rather than following the traditional sequential reranking method. Our method uses a greedy search on proximity graphs generated by approximate nearest neighbor algorithms, strategically prioritizing promising documents for reranking based on document similarity. Experimental results demonstrate substantial performance improvements across multiple benchmarks: 3.5 points on BRIGHT, 2.9 on FollowIR, and 5.1 on M-BEIR, all within a constrained reranker budget of 100 documents. Our analysis suggests that, given a fixed pair of embedding and reranker models, strategically selecting documents to rerank can significantly improve retrieval accuracy under limited reranker budget.

Updated: 2025-09-08 19:24:09

标题: 超越顺序重新排名：重新排名引导搜索改善理性检索

摘要: 广泛使用的检索和重新排序流程面临两个关键限制：它们受限于前 k 个文档的初始检索质量，以及基于LLM的重新排序器日益增长的计算需求限制了可以有效处理的文档数量。我们引入了一种新颖的方法，名为重新排序引导搜索（RGS），它通过根据重新排序器的偏好直接检索文档来绕过这些限制，而不是遵循传统的顺序重新排序方法。我们的方法使用由近似最近邻算法生成的接近图上的贪婪搜索，根据文档相似性战略性地优先考虑有希望的文档进行重新排序。实验结果显示，在受限的重新排序器预算下，我们在多个基准测试中实现了显著的性能提升：在BRIGHT上提高了3.5个点，在FollowIR上提高了2.9个点，在M-BEIR上提高了5.1个点。我们的分析表明，鉴于固定的嵌入和重新排序器模型，战略性地选择文档进行重新排序可以在有限的重新排序器预算下显著提高检索准确性。

更新时间: 2025-09-08 19:24:09

领域: cs.IR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.07163v1

Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

Updated: 2025-09-08 19:23:04

标题: 大型语言模型用于视频崩溃检测：方法、数据集和挑战综述

摘要: 视频流中的碰撞检测是智能交通系统中的一个关键问题。最近大型语言模型（LLMs）和视觉语言模型（VLMs）的发展改变了我们处理、推理和总结多模态信息的方式。本文调研了最近利用LLMs从视频数据中进行碰撞检测的方法。我们提出了融合策略的结构化分类，总结了关键数据集，分析了模型架构，比较了性能基准，并讨论了正在进行的挑战和机遇。我们的回顾为未来在视频理解和基础模型快速增长的交叉领域的研究奠定了基础。

更新时间: 2025-09-08 19:23:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.02074v2

Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code.sonography.ai after acceptance.

Updated: 2025-09-08 19:18:30

标题: 《基于LoRA调节的视觉-语言模型在超声图像中文本提示的多器官分割中的应用》

摘要: 超声成像中准确且具有一般化的目标分割仍然是一个重要挑战，这是由于解剖变异、多样化的成像协议和有限的注释数据。在这项研究中，我们提出了一个基于视觉语言模型（VLM）的快速驱动模型，该模型将Grounding DINO与SAM2（Segment Anything Model2）相结合，以实现跨多个超声器官的目标分割。总共使用了18个公共超声数据集，涵盖了乳腺、甲状腺、肝脏、前列腺、肾脏和脊肌。这些数据集被分为15个用于对Grounding DINO进行微调和验证，并使用低秩适应（LoRA）到超声领域，另外3个完全保留用于测试以评估在未见分布中的性能。全面的实验证明，我们的方法在大多数已见数据集上优于最先进的分割方法，包括UniverSeg、MedSAM、MedCLIP-SAM、BiomedParse和SAMUS，同时在未见数据集上保持强大的性能，无需额外的微调。这些结果强调了VLM在可扩展和稳健的超声图像分析中的潜力，减少对大型、器官特定的注释数据集的依赖。我们将在code.sonography.ai上发布我们的代码。

更新时间: 2025-09-08 19:18:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.23903v3

PaVeRL-SQL: Text-to-SQL via Partial-Match Rewards and Verbal Reinforcement Learning

Text-to-SQL models allow users to interact with a database more easily by generating executable SQL statements from natural-language questions. Despite recent successes on simpler databases and questions, current Text-to-SQL methods still suffer from low execution accuracy on industry-scale databases and complex questions involving domain-specific business logic. We present \emph{PaVeRL-SQL}, a framework that combines \emph{Partial-Match Rewards} and \emph{Verbal Reinforcement Learning} to drive self-improvement in reasoning language models (RLMs) for Text-to-SQL. To handle practical use cases, we adopt two pipelines: (1) a newly designed in-context learning framework with group self-evaluation (verbal-RL), using capable open- and closed-source large language models (LLMs) as backbones; and (2) a chain-of-thought (CoT) RL pipeline with a small backbone model (OmniSQL-7B) trained with a specially designed reward function and two-stage RL. These pipelines achieve state-of-the-art (SOTA) results on popular Text-to-SQL benchmarks -- Spider, Spider 2.0, and BIRD. For the industrial-level Spider2.0-SQLite benchmark, the verbal-RL pipeline achieves an execution accuracy 7.4\% higher than SOTA, and the CoT pipeline is 1.4\% higher. RL training with mixed SQL dialects yields strong, threefold gains, particularly for dialects with limited training data. Overall, \emph{PaVeRL-SQL} delivers reliable, SOTA Text-to-SQL under realistic industrial constraints. The code is available at https://github.com/PaVeRL-SQL/PaVeRL-SQL.

Updated: 2025-09-08 19:15:38

标题: PaVeRL-SQL：通过部分匹配奖励和口语强化学习实现文本到SQL的转换

摘要: 文本到SQL模型允许用户通过从自然语言问题生成可执行的SQL语句来更轻松地与数据库交互。尽管在简单数据库和问题上取得了一些成功，但当前的文本到SQL方法在行业规模数据库和涉及领域特定业务逻辑的复杂问题上仍然存在执行准确性低的问题。我们提出了\emph{PaVeRL-SQL}，这是一个结合了\emph{部分匹配奖励}和\emph{语言强化学习}的框架，用于驱动自我改进的推理语言模型（RLMs）用于文本到SQL。为了处理实际用例，我们采用了两种流水线：（1）一个新设计的带有组自我评估（语言强化学习）的上下文学习框架，使用功能强大的开源和闭源大型语言模型（LLMs）作为骨干；和（2）一个链式思维（CoT）RL流水线，其中使用经过特别设计奖励函数和两阶段RL训练的小型骨干模型（OmniSQL-7B）。这些流水线在流行的文本到SQL基准数据集Spider、Spider 2.0和BIRD上取得了最先进的结果。对于工业级别的Spider2.0-SQLite基准数据集，语言强化学习流水线的执行准确度比最先进技术高出7.4％，而CoT流水线高出1.4％。使用混合SQL方言进行RL训练可以获得强大的三倍增益，尤其是对于训练数据有限的方言。总的来说，\emph{PaVeRL-SQL}在现实的工业约束条件下提供了可靠的最先进的文本到SQL。代码可在https://github.com/PaVeRL-SQL/PaVeRL-SQL 上找到。

更新时间: 2025-09-08 19:15:38

领域: cs.AI

下载: http://arxiv.org/abs/2509.07159v1

BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery

Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.

Updated: 2025-09-08 19:12:19

标题: BioNeMo框架：一种用于药物发现中人工智能模型开发的模块化、高性能库

摘要: 人工智能模型对生物学和化学进行编码，为高通量和高质量的体外药物开发开辟了新的途径。然而，它们的训练越来越依赖于计算规模，最近的蛋白质语言模型（pLM）在数百个图形处理单元（GPUs）上进行训练。我们引入了BioNeMo框架，以便在数百个GPU上方便地进行计算生物学和化学人工智能模型的训练。其模块化设计允许将个别组件（如数据加载器）集成到现有工作流程中，并且开放给社区进行贡献。我们通过pLM预训练和微调等用例详细介绍了BioNeMo框架的技术特性。在256个NVIDIA A100上，BioNeMo框架在4.2天内训练了一个拥有30亿参数的基于BERT的pLM模型，覆盖了超过一万亿个令牌。BioNeMo框架是开源的，任何人都可以免费使用。

更新时间: 2025-09-08 19:12:19

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2411.10548v5

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Discovering novel materials is critical for technological advancements such as solar cells, batteries, and carbon capture. However, the development of new materials is constrained by a slow and expensive trial-and-error process. To accelerate this pipeline, we introduce PLaID++, a Large Language Model (LLM) fine-tuned for stable and property-guided crystal generation. We fine-tune Qwen-2.5 7B to generate crystal structures using a novel Wyckoff-based text representation. We show that generation can be effectively guided with a reinforcement learning technique based on Direct Preference Optimization (DPO), with sampled structures categorized by their stability, novelty, and space group. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our experiments highlight the effectiveness of iterative DPO, achieving $\sim$115\% and $\sim$50\% improvements in unconditional and space group conditioned generation, respectively, compared to fine-tuning alone. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

Updated: 2025-09-08 18:57:57

标题: PLaID++：面向定向无机材料设计的偏好对齐语言模型

摘要: 发现新材料对于技术进步如太阳能电池、电池和碳捕集至关重要。然而，新材料的开发受到缓慢且昂贵的试错过程的限制。为了加速这个流程，我们引入了PLaID++，一个经过微调的大型语言模型（LLM），用于稳定和性质引导的晶体生成。我们对Qwen-2.5 7B进行微调，使用基于Wyckoff的文本表示来生成晶体结构。我们展示了生成可以通过基于直接优化偏好（DPO）的强化学习技术有效地引导，采样的结构根据其稳定性、新颖性和空间群进行分类。通过直接将对称约束编码到文本中，并引导模型输出朝向理想的化学空间，PLaID++生成的结构在热力学稳定性、独特性和新颖性方面比先前方法高出约50\%。我们的实验突出了迭代DPO的有效性，相比仅进行微调，无条件和空间群条件生成分别实现了约115\%和约50\%的改进。我们的工作展示了将自然语言处理的后训练技术应用于材料设计的潜力，为有针对性和高效率地发现新材料铺平了道路。

更新时间: 2025-09-08 18:57:57

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2509.07150v1

Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space and embedding these learned distances in the representation space. While promising for robustness to task-irrelevant noise, as shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep reinforcement learning (RL), we evaluate five recent approaches, unified conceptually as isometric embeddings with varying design choices. We benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 370 task configurations with diverse noise settings. Beyond final returns, we introduce the evaluation of a denoising factor to quantify the encoder's ability to filter distractions. To further isolate the effect of metric learning, we propose and evaluate an isolated metric estimation setting, in which the encoder is influenced solely by the metric loss. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

Updated: 2025-09-08 18:56:14

标题: 理解行为度量学习：对分散注意力的强化学习环境进行大规模研究

摘要: 一种关键的状态抽象方法是在观察空间中逼近行为度量（尤其是双模拟度量），并将这些学习到的距离嵌入表示空间中。尽管在先前的工作中显示了对任务无关噪音的稳健性，准确估计这些度量仍然具有挑战性，需要各种设计选择，从而在理论和实践之间产生差距。先前的评估主要关注最终回报，使学习到的度量的质量和性能提升的来源不明确。为了系统评估度量学习在深度强化学习（RL）中的工作原理，我们评估了五种最近的方法，从概念上统一为具有不同设计选择的等距嵌入。我们使用20个基于状态的任务和14个基于像素的任务来对其进行基准测试，涵盖了多样的噪音设置的370种任务配置。除了最终回报外，我们引入了一个去噪因子的评估来量化编码器过滤干扰的能力。为了进一步分离度量学习的影响，我们提出并评估了一个孤立的度量估计设置，其中编码器仅受度量损失的影响。最后，我们发布了一个开源、模块化的代码库，以提高可重现性并支持未来关于深度RL中度量学习的研究。

更新时间: 2025-09-08 18:56:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.00563v2

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.

Updated: 2025-09-08 18:55:39

标题: "忘却与混淆：我们真的在消除知识吗？"

摘要: Unlearning emerge como una capacidad crítica para los grandes modelos de lenguaje (LLMs) para apoyar la privacidad de los datos, el cumplimiento normativo y la implementación ética de la IA. Técnicas recientes a menudo se basan en la obfuscación al inyectar información incorrecta o irrelevante para suprimir el conocimiento. Tales métodos efectivamente constituyen adición de conocimiento en lugar de eliminación verdadera, dejando a menudo a los modelos vulnerables a la exploración. En este documento, distinguimos formalmente el desaprendizaje de la obfuscación e introducimos un marco de evaluación basado en la exploración para evaluar si los enfoques existentes eliminan genuinamente la información objetivo. Además, proponemos DF-MCQ, un método de desaprendizaje novedoso que aplana la distribución predictiva del modelo sobre preguntas de opción múltiple generadas automáticamente utilizando la divergencia KL, eliminando efectivamente el conocimiento sobre individuos objetivo y provocando un comportamiento de rechazo apropiado. Los resultados experimentales demuestran que DF-MCQ logra el desaprendizaje con una tasa de rechazo de más del 90% y un nivel de incertidumbre de elección aleatoria mucho más alto que la obfuscación en preguntas de exploración.

更新时间: 2025-09-08 18:55:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.02884v2

Measuring Uncertainty in Transformer Circuits with Effective Information Consistency

Mechanistic interpretability has identified functional subgraphs within large language models (LLMs), known as Transformer Circuits (TCs), that appear to implement specific algorithms. Yet we lack a formal, single-pass way to quantify when an active circuit is behaving coherently and thus likely trustworthy. Building on prior systems-theoretic proposals, we specialize a sheaf/cohomology and causal emergence perspective to TCs and introduce the Effective-Information Consistency Score (EICS). EICS combines (i) a normalized sheaf inconsistency computed from local Jacobians and activations, with (ii) a Gaussian EI proxy for circuit-level causal emergence derived from the same forward state. The construction is white-box, single-pass, and makes units explicit so that the score is dimensionless. We further provide practical guidance on score interpretation, computational overhead (with fast and exact modes), and a toy sanity-check analysis. Empirical validation on LLM tasks is deferred.

Updated: 2025-09-08 18:54:56

标题: 使用有效信息一致性测量变压器电路中的不确定性

摘要: 机制可解释性已经确定了大型语言模型（LLMs）中的功能子图，称为Transformer Circuits（TCs），这些子图似乎实现了特定的算法。然而，我们缺乏一种正式的、单次通过的方法来量化何时活跃的电路行为一致，因此可能可信。基于先前的系统论提议，我们将sheaf/cohomology和因果出现的观点专门应用于TCs，并引入有效信息一致性评分（EICS）。EICS结合了（i）从局部雅可比矩阵和激活计算的归一化sheaf不一致性，以及（ii）从相同前向状态派生的电路级因果出现的高斯EI代理。该构造是白盒的、单次通过的，并使单位明确，因此得分是无量纲的。我们进一步提供了得分解释、计算开销（具有快速和精确模式）以及一个玩具健全性检查分析的实用指导。对LLM任务的经验验证被推迟。

更新时间: 2025-09-08 18:54:56

领域: cs.LG,cs.AI,cs.CL,cs.IT,math.IT

下载: http://arxiv.org/abs/2509.07149v1

Learning to Upsample and Upmix Audio in the Latent Domain

Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder's latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically require complex combinations of multi-scale losses and discriminators. Through experiments in bandwidth extension and mono-to-stereo up-mixing, we demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio. This work establishes a more efficient paradigm for audio processing pipelines that already incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.

Updated: 2025-09-08 18:54:42

标题: 学习在潜在域中上采样和混音音频

摘要: 神经音频自动编码器创建紧凑的潜在表示，保留感知上重要的信息，为现代音频压缩系统和生成方法（如下一个标记预测和潜在扩散）奠定了基础。尽管它们很普遍，但大多数音频处理操作，如空间和频谱上采样，仍然低效地在原始波形或频谱表示上操作，而不是直接在这些压缩表示上操作。我们提出了一个框架，在自动编码器的潜在空间内完全执行音频处理操作，消除了解码为原始音频格式的需要。我们的方法通过仅在潜在领域中操作，使用一个潜在L1重构项，并辅以一个潜在对抗鉴别器，显著简化了训练。这与通常需要复杂组合的多尺度损失和鉴别器的原始音频方法形成鲜明对比。通过在带宽扩展和单声道到立体声混音方面的实验，我们展示了高达100倍的计算效率提升，同时保持了与原始音频后处理相当的质量。这项工作为已经整合自动编码器的音频处理流程奠定了更有效的范式，实现了跨各种音频任务更快速和更节约资源的工作流程。

更新时间: 2025-09-08 18:54:42

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2506.00681v2

Autoencoder-Based Denoising of Muscle Artifacts in ECG to Preserve Skin Nerve Activity (SKNA) for Cognitive Stress Detection

The sympathetic nervous system (SNS) plays a central role in regulating the body's responses to stress and maintaining physiological stability. Its dysregulation is associated with a wide range of conditions, from cardiovascular disease to anxiety disorders. Skin nerve activity (SKNA) extracted from high-frequency electrocardiogram (ECG) recordings provides a noninvasive window into SNS dynamics, but its measurement is highly susceptible to electromyographic (EMG) contamination. Traditional preprocessing based on bandpass filtering within a fixed range (e.g., 500--1000 Hz) is susceptible to overlapping EMG and SKNA spectral components, especially during sustained muscle activity. We present a denoising approach using a lightweight one-dimensional convolutional autoencoder with a long short-term memory (LSTM) bottleneck to reconstruct clean SKNA from EMG-contaminated recordings. Using clean ECG-derived SKNA data from cognitive stress experiments and EMG noise from chaotic muscle stimulation recordings, we simulated contamination at realistic noise levels (--4 dB, --8 dB signal-to-noise ratio) and trained the model in the leave-one-subject-out cross-validation framework. The method improved signal-to-noise ratio by up to 9.65 dB, increased cross correlation with clean SKNA from 0.40 to 0.72, and restored burst-based SKNA features to near-clean discriminability (AUROC $\geq$ 0.96). Classification of baseline versus sympathetic stimulation (cognitive stress) conditions reached accuracies of 91--98\% across severe noise levels, comparable to clean data. These results demonstrate that deep learning--based reconstruction can preserve physiologically relevant sympathetic bursts during substantial EMG interference, enabling more robust SKNA monitoring in naturalistic, movement-rich environments.

Updated: 2025-09-08 18:51:36

标题: Autoencoder 基于去噪的心电图肌肉杂波处理，以保留皮肤神经活动（SKNA）用于认知压力检测

摘要: 交感神经系统（SNS）在调节机体对压力的反应和维持生理稳定性方面起着关键作用。其失调与从心血管疾病到焦虑症等各种疾病有关。从高频心电图（ECG）记录中提取的皮肤神经活动（SKNA）提供了一个非侵入性的窗口，可以观察SNS动态，但其测量对电肌图（EMG）的污染非常敏感。传统的基于带通滤波的预处理方法（例如，500-1000 Hz范围内）容易发生EMG和SKNA频谱成分的重叠，尤其是在持续肌肉活动期间。我们提出了一种去噪方法，使用轻量级的一维卷积自编码器和具有长短期记忆（LSTM）瓶颈的方法，从受EMG污染的记录中重建干净的SKNA。使用来自认知压力实验的干净ECG衍生的SKNA数据和来自混乱肌肉刺激记录的EMG噪声，我们模拟了实际噪声水平（-4 dB，-8 dB信噪比）的污染，并在留一子体交叉验证框架中训练了模型。该方法将信噪比提高了达9.65 dB，将与干净SKNA的交叉相关性从0.40提高到0.72，并将基于爆发的SKNA特征恢复到接近干净可辨识度（AUROC≥0.96）。在严重噪声水平下，基线与交感刺激（认知压力）条件的分类准确率达到91-98％，与干净数据相当。这些结果表明，基于深度学习的重建可以在存在大量EMG干扰时保留生理相关的交感神经突发事件，从而在自然、富含运动的环境中实现更稳健的SKNA监测。

更新时间: 2025-09-08 18:51:36

领域: cs.AI

下载: http://arxiv.org/abs/2509.07146v1

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

Supervised approaches for learning spatio-temporal scene graphs (STSG) from video are greatly hindered due to their reliance on STSG-annotated videos, which are labor-intensive to construct at scale. Is it feasible to instead use readily available video captions as weak supervision? To address this question, we propose LASER, a neuro-symbolic framework to enable training STSG generators using only video captions. LASER employs large language models to first extract logical specifications with rich spatio-temporal semantic information from video captions. LASER then trains the underlying STSG generator to align the predicted STSG with the specification. The alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner and using a combination of contrastive, temporal, and semantics losses. The overall approach efficiently trains low-level perception models to extract a fine-grained STSG that conforms to the video caption. In doing so, it enables a novel methodology for learning STSGs without tedious annotations. We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach demonstrates substantial improvements over fully-supervised baselines, achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate prediction accuracy.

Updated: 2025-09-08 18:48:44

标题: 激光：一种神经符号框架，用于通过弱监督学习空间-时间场景图

摘要: 监督学习空间-时间场景图（STSG）的方法受到了极大的阻碍，因为它们依赖于STSG注释视频，而在规模上构建这些视频是劳动密集型的。是否可以使用现成的视频标题作为弱监督来代替？为了解决这个问题，我们提出了LASER，一个神经符号框架，可以只使用视频标题来训练STSG生成器。LASER首先使用大型语言模型从视频标题中提取具有丰富空间-时间语义信息的逻辑规范。然后，LASER训练底层STSG生成器以使预测的STSG与规范对齐。对齐算法通过利用可微分的符号推理器和结合对比、时间和语义损失来克服弱监督的挑战。总体方法有效地训练低级感知模型，以提取符合视频标题的精细化STSG。通过这样做，它为学习STSG提供了一种新颖的方法，而无需繁琐的注释。我们在三个视频数据集上评估了我们的方法：OpenPVSG、20BN和MUGEN。我们的方法在OpenPVSG上实现了显著的改进，较全面的监督基准，实现了27.78%（+12.65%）的一元谓词预测准确率和0.42（+0.22）的二元召回率@5。此外，在整体谓词预测准确率方面，LASER在20BN上超过基准7%，在MUGEN上超过基准5.2%。

更新时间: 2025-09-08 18:48:44

领域: cs.CV,cs.LG,cs.LO

下载: http://arxiv.org/abs/2304.07647v6

Of Graphs and Tables: Zero-Shot Node Classification with Tabular Foundation Models

Graph foundation models (GFMs) have recently emerged as a promising paradigm for achieving broad generalization across various graph data. However, existing GFMs are often trained on datasets that were shown to poorly represent real-world graphs, limiting their generalization performance. In contrast, tabular foundation models (TFMs) not only excel at classical tabular prediction tasks but have also shown strong applicability in other domains such as time series forecasting, natural language processing, and computer vision. Motivated by this, we take an alternative view to the standard perspective of GFMs and reformulate node classification as a tabular problem. Each node can be represented as a row with feature, structure, and label information as columns, enabling TFMs to directly perform zero-shot node classification via in-context learning. In this work, we introduce TabGFM, a graph foundation model framework that first converts a graph into a table via feature and structural encoders, applies multiple TFMs to diversely subsampled tables, and then aggregates their outputs through ensemble selection. Through experiments on 28 real-world datasets, TabGFM achieves consistent improvements over task-specific GNNs and state-of-the-art GFMs, highlighting the potential of tabular reformulation for scalable and generalizable graph learning.

Updated: 2025-09-08 18:48:26

标题: 图表与表：利用表格基础模型进行零样本节点分类

摘要: 图基础模型（GFMs）最近作为一种有前景的范式出现，用于在各种图数据上实现广泛的泛化。然而，现有的GFMs通常是在已被证明无法很好地代表现实世界图的数据集上训练的，从而限制了它们的泛化性能。相比之下，表基础模型（TFMs）不仅在经典表预测任务上表现出色，而且在其他领域如时间序列预测、自然语言处理和计算机视觉中也显示出强大的适用性。受此启发，我们采用了一种与GFMs标准观点相反的视角，将节点分类重新构造为一个表问题。每个节点可以被表示为一行，具有特征、结构和标签信息作为列，使得TFMs可以通过上下文学习直接执行零样本节点分类。在这项工作中，我们介绍了TabGFM，一个图基础模型框架，首先通过特征和结构编码器将图转换为一个表，然后通过集成选择将多个TFMs应用于不同子采样的表，并通过集成选择聚合它们的输出。通过对28个真实世界数据集的实验，TabGFM在任务特定的GNNs和最先进的GFMs上取得了一致的改进，突显了表格重构对可扩展和可泛化的图学习的潜力。

更新时间: 2025-09-08 18:48:26

领域: cs.LG

下载: http://arxiv.org/abs/2509.07143v1

On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift

Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67\% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.

Updated: 2025-09-08 18:47:35

标题: 关于在分布转移下，公共表征对私人迁移学习的益处

摘要: 公共预训练是改善差分隐私模型训练的一种有前途的方法。然而，最近的研究指出，许多研究这一范式的积极结果仅考虑分布内任务，并且可能不适用于预训练和微调数据之间存在分布转移的情况--这种情况在微调私有任务时很可能出现，因为数据的敏感性。在这项工作中，我们通过三个任务的实证研究表明，即使在存在较大分布转移的情况下，在公共数据的零-shot表现和从头开始使用私人数据进行训练都会给出无法使用的弱结果时，公共特征实际上可以提高私人训练准确性，提高最多67\%。我们为这一现象提供了理论解释，表明如果公共数据和私有数据共享低维表示，公共表示甚至可以改善私有训练的样本复杂性，即使无法仅从公共数据中学习私有任务。总的来说，我们的结果证明了在极端分布转移的现实设置中，公共数据确实可以使私有训练变得实用。

更新时间: 2025-09-08 18:47:35

领域: cs.LG,cs.CR,stat.ML

下载: http://arxiv.org/abs/2312.15551v5

Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.

Updated: 2025-09-08 18:46:08

标题: 朝着由大型语言模型实现的面向目的主题模型评估。

摘要: 这项研究提出了一个框架，用于使用大型语言模型（LLMs）自动评估动态发展的主题模型。主题建模对于组织和检索数字图书馆系统中的学术内容至关重要，帮助用户浏览复杂和不断发展的知识领域。然而，广泛使用的自动化指标，如连贯性和多样性，通常只捕捉狭窄的统计模式，并未能解释实践中的语义失败。我们引入了一个目的导向的评估框架，采用了九个基于LLM的指标，涵盖了主题质量的四个关键维度：词汇有效性、主题内语义合理性、主题间结构合理性和文档-主题对齐合理性。该框架通过对抗和基于抽样的协议进行验证，并应用于跨越新闻文章、学术出版物和社交媒体帖子的数据集，以及多种主题建模方法和开源LLMs。我们的分析表明，基于LLM的指标提供了可解释的、健壮的和任务相关的评估，揭示了主题模型中的关键弱点，如冗余和语义漂移，这些通常被传统指标忽视。这些结果支持开发可扩展的、细粒度的评估工具，以维持动态数据集中的主题相关性。支持这项工作的所有代码和数据都可以在https://github.com/zhiyintan/topic-model-LLMjudgment 上访问。

更新时间: 2025-09-08 18:46:08

领域: cs.CL,cs.AI,cs.DL

下载: http://arxiv.org/abs/2509.07142v1

Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders

Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.

Updated: 2025-09-08 18:44:53

标题: 重新瓶颈：神经音频自编码器的潜在重构

摘要: 神经音频编解码器和自动编码器已经成为用于音频压缩、传输、特征提取和潜在空间生成的多功能模型。然而，一个关键限制是大多数模型都是经过训练以最大化重构保真度，通常忽视了在多样化下游应用中实现最佳性能所必需的特定潜在结构。我们提出了一个简单的事后框架来解决这个问题，通过修改预训练自动编码器的瓶颈。我们的方法引入了一个“再瓶颈”，一个通过潜在空间损失专门训练的内部瓶颈，以灌输用户定义的结构。我们在三个实验中展示了这个框架的有效性。首先，我们强制对潜在通道进行排序，而不损失重构质量。其次，我们将潜在空间与语义嵌入进行对齐，分析对下游扩散建模的影响。第三，我们引入等变性，确保对输入波形的滤波操作直接对应于潜在空间中的特定转换。最终，我们的再瓶颈框架为调整神经音频模型的表示提供了一种灵活高效的方法，使它们能够无缝地满足不同应用的各种需求，且只需最少的额外训练。

更新时间: 2025-09-08 18:44:53

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2507.07867v2

Pilot Study on Generative AI and Critical Thinking in Higher Education Classrooms

Generative AI (GAI) tools have seen rapid adoption in educational settings, yet their role in fostering critical thinking remains underexplored. While previous studies have examined GAI as a tutor for specific lessons or as a tool for completing assignments, few have addressed how students critically evaluate the accuracy and appropriateness of GAI-generated responses. This pilot study investigates students' ability to apply structured critical thinking when assessing Generative AI outputs in introductory Computational and Data Science courses. Given that GAI tools often produce contextually flawed or factually incorrect answers, we designed learning activities that require students to analyze, critique, and revise AI-generated solutions. Our findings offer initial insights into students' ability to engage critically with GAI content and lay the groundwork for more comprehensive studies in future semesters.

Updated: 2025-09-08 18:37:35

标题: 高等教育课堂中生成式人工智能和批判性思维的初步研究

摘要: 生成式人工智能（GAI）工具在教育领域得到了迅速应用，但它们在促进批判性思维方面的作用仍未得到充分探讨。虽然先前的研究已经考察了GAI作为特定课程的导师或作为完成作业的工具，但很少有研究探讨学生如何批判性评估GAI生成的回答的准确性和适当性。本试点研究调查了学生在初级计算机和数据科学课程中评估生成式人工智能输出时应用结构化批判性思维的能力。鉴于GAI工具常常产生上下文错误或事实错误的答案，我们设计了需要学生分析、批判和修改AI生成解决方案的学习活动。我们的研究结果初步揭示了学生批判性参与GAI内容的能力，并为未来学期开展更全面的研究奠定了基础。

更新时间: 2025-09-08 18:37:35

领域: cs.CY,cs.AI,cs.HC,stat.AP

下载: http://arxiv.org/abs/2509.00167v3

Avoiding Over-Personalization with Rule-Guided Knowledge Graph Adaptation for LLM Recommendations

We present a lightweight neuro-symbolic framework to mitigate over-personalization in LLM-based recommender systems by adapting user-side Knowledge Graphs (KGs) at inference time. Instead of retraining models or relying on opaque heuristics, our method restructures a user's Personalized Knowledge Graph (PKG) to suppress feature co-occurrence patterns that reinforce Personalized Information Environments (PIEs), i.e., algorithmically induced filter bubbles that constrain content diversity. These adapted PKGs are used to construct structured prompts that steer the language model toward more diverse, Out-PIE recommendations while preserving topical relevance. We introduce a family of symbolic adaptation strategies, including soft reweighting, hard inversion, and targeted removal of biased triples, and a client-side learning algorithm that optimizes their application per user. Experiments on a recipe recommendation benchmark show that personalized PKG adaptations significantly increase content novelty while maintaining recommendation quality, outperforming global adaptation and naive prompt-based methods.

Updated: 2025-09-08 18:33:36

标题: 避免过度个性化：基于规则引导的知识图适应在LLM推荐中的应用

摘要: 我们提出了一个轻量级的神经符号框架，通过在推理时调整用户端知识图来减轻基于LLM的推荐系统中的过度个性化。我们的方法不是重新训练模型或依赖不透明的启发式方法，而是重构用户的个性化知识图（PKG）以抑制强化个性化信息环境（PIEs）的特征共现模式，即算法诱导的过滤泡沫，限制内容多样性。这些调整后的PKG用于构建结构化提示，引导语言模型朝着更多样化、非PIE的推荐方向，同时保持主题相关性。我们引入了一系列符号适应策略，包括软重新加权、硬反转和有针对性地删除偏见三元组，以及一个客户端学习算法，优化它们在每个用户身上的应用。在一个食谱推荐基准上的实验表明，个性化的PKG调整显著增加了内容的新颖性，同时保持了推荐质量，优于全局调整和天真的基于提示的方法。

更新时间: 2025-09-08 18:33:36

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2509.07133v1

Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study

The widespread use of generative AI has shown remarkable success in producing highly realistic deepfakes, posing a serious threat to various voice biometric applications, including speaker verification, voice biometrics, audio conferencing, and criminal investigations. To counteract this, several state-of-the-art (SoTA) audio deepfake detection (ADD) methods have been proposed to identify generative AI signatures to distinguish between real and deepfake audio. However, the effectiveness of these methods is severely undermined by anti-forensic (AF) attacks that conceal generative signatures. These AF attacks span a wide range of techniques, including statistical modifications (e.g., pitch shifting, filtering, noise addition, and quantization) and optimization-based attacks (e.g., FGSM, PGD, C \& W, and DeepFool). In this paper, we investigate the SoTA ADD methods and provide a comparative analysis to highlight their effectiveness in exposing deepfake signatures, as well as their vulnerabilities under adversarial conditions. We conducted an extensive evaluation of ADD methods on five deepfake benchmark datasets using two categories: raw and spectrogram-based approaches. This comparative analysis enables a deeper understanding of the strengths and limitations of SoTA ADD methods against diverse AF attacks. It does not only highlight vulnerabilities of ADD methods, but also informs the design of more robust and generalized detectors for real-world voice biometrics. It will further guide future research in developing adaptive defense strategies that can effectively counter evolving AF techniques.

Updated: 2025-09-08 18:33:24

标题: 对音频深度伪造检测的对抗性攻击：基准和比较研究

摘要: 广泛使用生成式人工智能已在生产高度逼真的deepfakes方面取得了显著成功，对包括说话人验证、语音生物特征、音频会议和刑事调查在内的各种声纹生物特征应用构成了严重威胁。为了应对这一挑战，已经提出了几种最先进的音频deepfake检测（ADD）方法，用于识别生成式人工智能签名，以区分真实和deepfake音频。然而，这些方法的有效性受到反取证（AF）攻击的严重削弱，这些攻击涵盖了广泛的技术，包括统计修改（例如，音高变换、滤波、添加噪音和量化）和基于优化的攻击（例如，FGSM、PGD、C＆W和DeepFool）。本文调查了最先进的ADD方法，并提供了比较分析，以突出它们在揭示deepfake签名方面的有效性，以及它们在对抗条件下的脆弱性。我们在五个deepfake基准数据集上对ADD方法进行了广泛评估，使用了两种类别：原始和基于频谱图的方法。这种比较分析有助于更深入地了解最先进的ADD方法在应对多样化AF攻击方面的优势和局限性。它不仅突显了ADD方法的脆弱性，还为设计更加健壮和普遍的检测器提供了信息，以应对真实世界中的声纹生物特征。它还将进一步指导未来研究，开发能够有效应对不断发展的AF技术的自适应防御策略。

更新时间: 2025-09-08 18:33:24

领域: cs.SD,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.07132v1

Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

The usage-based constructionist (UCx) approach to language posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze representations of the English Double Object (DO) and Prepositional Object (PO) constructions in Pythia-$1.4$B, using a dataset of $5000$ sentence pairs systematically varied by human-rated preference strength for DO or PO. Geometric analyses show that the separability between the two constructions' representations, as measured by energy distance or Jensen-Shannon divergence, is systematically modulated by gradient preference strength, which depends on lexical and functional properties of sentences. That is, more prototypical exemplars of each construction occupy more distinct regions in activation space, compared to sentences that could have equally well have occured in either construction. These results provide evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures for representations in LLMs.

Updated: 2025-09-08 18:33:19

标题: 含义注入的语法：渐变可接受性塑造了LLM中构造的几何表示

摘要: 基于使用的建构主义（UCx）方法认为，语言包括一系列学习形式-含义配对（建构）的网络，其使用主要由它们的含义或功能决定，需要它们是渐变且概率性的。本研究调查了大型语言模型（LLMs）中的内部表征是否反映了提议的功能注入渐变。我们在Pythia-1.4B中分析了英语Double Object（DO）和Prepositional Object（PO）建构的表征，使用了一个由人类评定的DO或PO偏好强度系统地变化的5000个句子对的数据集。几何分析表明，两种建构的表征之间的可分离性，如能量距离或Jensen-Shannon散度所测量的那样，受梯度偏好强度的系统调节，这取决于句子的词汇和功能属性。也就是说，每种建构的更典型的示例在激活空间中占据更不同的区域，与那些同样可以出现在任一建构中的句子相比。这些结果表明，LLMs学习了富含含义的、渐变的建构表征，并支持LLMs中表征的几何度量。

更新时间: 2025-09-08 18:33:19

领域: cs.CL,cs.AI,68T50

下载: http://arxiv.org/abs/2507.22286v2

SoK: Security and Privacy of AI Agents for Blockchain

Blockchain and smart contracts have garnered significant interest in recent years as the foundation of a decentralized, trustless digital ecosystem, thereby eliminating the need for traditional centralized authorities. Despite their central role in powering Web3, their complexity still presents significant barriers for non-expert users. To bridge this gap, Artificial Intelligence (AI)-based agents have emerged as valuable tools for interacting with blockchain environments, supporting a range of tasks, from analyzing on-chain data and optimizing transaction strategies to detecting vulnerabilities within smart contracts. While interest in applying AI to blockchain is growing, the literature still lacks a comprehensive survey that focuses specifically on the intersection with AI agents. Most of the related work only provides general considerations, without focusing on any specific domain. This paper addresses this gap by presenting the first Systematization of Knowledge dedicated to AI-driven systems for blockchain, with a special focus on their security and privacy dimensions, shedding light on their applications, limitations, and future research directions.

Updated: 2025-09-08 18:32:15

标题: SoK: 区块链中人工智能代理的安全和隐私

摘要: 区块链和智能合约近年来引起了极大的关注，作为去中心化、无需信任的数字生态系统的基础，从而消除了传统中心化机构的必要性。尽管它们在推动Web3方面起着核心作用，但它们的复杂性仍然对非专家用户构成了重大障碍。为了弥合这一差距，基于人工智能（AI）的代理出现为与区块链环境互动提供了有价值的工具，支持一系列任务，从分析链上数据和优化交易策略到检测智能合约中的漏洞。虽然人们对将人工智能应用于区块链的兴趣正在增长，但文献仍然缺乏专门关注与AI代理相交的综合调查。大部分相关工作只提供一般性考虑，没有专注于任何特定领域。本文通过提出首个专注于区块链的AI驱动系统的知识系统化，特别关注其安全和隐私维度，阐明其应用、限制和未来研究方向，来填补这一空白。

更新时间: 2025-09-08 18:32:15

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.07131v1

SVGauge: Towards Human-Aligned Evaluation for SVG Generation

Generated Scalable Vector Graphics (SVG) images demand evaluation criteria tuned to their symbolic and vectorial nature: criteria that existing metrics such as FID, LPIPS, or CLIPScore fail to satisfy. In this paper, we introduce SVGauge, the first human-aligned, reference based metric for text-to-SVG generation. SVGauge jointly measures (i) visual fidelity, obtained by extracting SigLIP image embeddings and refining them with PCA and whitening for domain alignment, and (ii) semantic consistency, captured by comparing BLIP-2-generated captions of the SVGs against the original prompts in the combined space of SBERT and TF-IDF. Evaluation on the proposed SHE benchmark shows that SVGauge attains the highest correlation with human judgments and reproduces system-level rankings of eight zero-shot LLM-based generators more faithfully than existing metrics. Our results highlight the necessity of vector-specific evaluation and provide a practical tool for benchmarking future text-to-SVG generation models.

Updated: 2025-09-08 18:28:31

标题: SVGauge：朝向SVG生成的人类对齐评估

摘要: 生成的可伸缩矢量图形（SVG）图像需要针对其符号和矢量特性进行调整的评估标准：现有的度量标准如FID、LPIPS或CLIPScore未能满足。在本文中，我们介绍了SVGauge，这是首个基于人类对齐的、基于参考的用于文本到SVG生成的度量标准。SVGauge同时衡量（i）视觉保真度，通过提取SigLIP图像嵌入并通过PCA和白化进行领域对齐进行细化获得，以及（ii）语义一致性，通过比较SVG的BLIP-2生成标题与原始提示在SBERT和TF-IDF的结合空间中的一致性来捕获。在提出的SHE基准测试上的评估结果显示，SVGauge与人类判断的相关性最高，并且比现有的度量标准更忠实地再现了八个零样本LLM生成器的系统级排名。我们的结果突显了矢量特定评估的必要性，并为未来文本到SVG生成模型的基准测试提供了一个实用工具。

更新时间: 2025-09-08 18:28:31

领域: cs.GR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2509.07127v1

Heterogeneous Self-Supervised Acoustic Pre-Training with Local Constraints

Self-supervised pre-training using unlabeled data is widely used in automatic speech recognition. In this paper, we propose a new self-supervised pre-training approach to dealing with heterogeneous data. Instead of mixing all the data and minimizing the averaged global loss in the conventional way, we impose additional local constraints to ensure that the model optimizes each source of heterogeneous data to its local optimum after $K$-step gradient descent initialized from the model. We formulate this as a bilevel optimization problem, and use the first-order approximation method to solve the problem. We discuss its connection to model-agnostic meta learning. Experiments are carried out on self-supervised pre-training using multi-domain and multilingual datasets, demonstrating that the proposed approach can significantly improve the adaptivity of the self-supervised pre-trained model for the downstream supervised fine-tuning tasks.

Updated: 2025-09-08 18:21:29

标题: 具有局部约束的异构自监督声学预训练

摘要: 使用未标记数据进行自监督预训练在自动语音识别中被广泛使用。本文提出了一种新的自监督预训练方法来处理异构数据。与传统方法中混合所有数据并最小化平均全局损失不同，我们施加额外的局部约束，以确保模型在从模型初始化的$K$步梯度下降后将每个来源的异构数据优化到其局部最优。我们将其形式化为一个双层优化问题，并使用一阶逼近方法来解决问题。我们讨论了它与模型无关元学习的联系。实验使用多领域和多语言数据集进行自监督预训练，结果表明所提出的方法可以显著提高自监督预训练模型对下游监督微调任务的适应性。

更新时间: 2025-09-08 18:21:29

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.19990v2

NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice

Nested logit (NL) has been commonly used for discrete choice analysis, including a wide range of applications such as travel mode choice, automobile ownership, or location decisions. However, the classical NL models are restricted by their limited representation capability and handcrafted utility specification. While researchers introduced deep neural networks (DNNs) to tackle such challenges, the existing DNNs cannot explicitly capture inter-alternative correlations in the discrete choice context. To address the challenges, this study proposes a novel concept - alternative graph - to represent the relationships among travel mode alternatives. Using a nested alternative graph, this study further designs a nested-utility graph neural network (NestGNN) as a generalization of the classical NL model in the neural network family. Theoretically, NestGNNs generalize the classical NL models and existing DNNs in terms of model representation, while retaining the crucial two-layer substitution patterns of the NL models: proportional substitution within a nest but non-proportional substitution beyond a nest. Empirically, we find that the NestGNNs significantly outperform the benchmark models, particularly the corresponding NL models by 9.2\%. As shown by elasticity tables and substitution visualization, NestGNNs retain the two-layer substitution patterns as the NL model, and yet presents more flexibility in its model design space. Overall, our study demonstrates the power of NestGNN in prediction, interpretation, and its flexibility of generalizing the classical NL model for analyzing travel mode choice.

Updated: 2025-09-08 18:19:46

标题: NestGNN: 一个泛化嵌套逻辑模型的图神经网络框架，用于出行方式选择

摘要: Nested logit（NL）已经被广泛应用于离散选择分析，包括旅行方式选择、汽车拥有权或位置决策等各种应用。然而，传统的NL模型受到其有限的表示能力和手工制定的效用规范的限制。虽然研究人员引入了深度神经网络（DNNs）来应对这些挑战，但现有的DNNs不能明确捕获离散选择背景下的选择之间的相关性。为了解决这些挑战，本研究提出了一个新颖概念 - 选择图 - 来表示旅行方式选择之间的关系。利用嵌套选择图，本研究进一步设计了一个嵌套效用图神经网络（NestGNN）作为神经网络家族中经典NL模型的泛化。从理论上讲，NestGNNs在模型表示方面泛化了经典NL模型和现有的DNNs，同时保留了NL模型的关键的两层替代模式：在一个嵌套内的比例替代，但在嵌套之外的非比例替代。在经验上，我们发现NestGNNs明显优于基准模型，尤其是相应的NL模型高出9.2％。正如弹性表和替代可视化所展示的那样，NestGNNs保留了NL模型的两层替代模式，但在其模型设计空间中表现出更多的灵活性。总的来说，我们的研究展示了NestGNN在预测、解释和泛化经典NL模型以分析旅行方式选择中的能力。

更新时间: 2025-09-08 18:19:46

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.07123v1

Neuro-Symbolic Frameworks: Conceptual Characterization and Empirical Comparative Analysis

Neurosymbolic (NeSy) frameworks combine neural representations and learning with symbolic representations and reasoning. Combining the reasoning capacities, explainability, and interpretability of symbolic processing with the flexibility and power of neural computing allows us to solve complex problems with more reliability while being data-efficient. However, this recently growing topic poses a challenge to developers with its learning curve, lack of user-friendly tools, libraries, and unifying frameworks. In this paper, we characterize the technical facets of existing NeSy frameworks, such as the symbolic representation language, integration with neural models, and the underlying algorithms. A majority of the NeSy research focuses on algorithms instead of providing generic frameworks for declarative problem specification to leverage problem solving. To highlight the key aspects of Neurosymbolic modeling, we showcase three generic NeSy frameworks - \textit{DeepProbLog}, \textit{Scallop}, and \textit{DomiKnowS}. We identify the challenges within each facet that lay the foundation for identifying the expressivity of each framework in solving a variety of problems. Building on this foundation, we aim to spark transformative action and encourage the community to rethink this problem in novel ways.

Updated: 2025-09-08 18:17:33

标题: 神经符号框架：概念特征描述和实证比较分析

摘要: 神经符号（NeSy）框架将神经表示和学习与符号表示和推理相结合。将符号处理的推理能力、可解释性和可解释性与神经计算的灵活性和强大性相结合，使我们能够更可靠地解决复杂问题，同时具有数据效率。然而，这个最近兴起的主题对开发者提出了挑战，因为它具有陡峭的学习曲线、缺乏用户友好的工具、库和统一框架。在本文中，我们对现有NeSy框架的技术方面进行了表征，如符号表示语言、与神经模型的集成以及基础算法。大多数NeSy研究侧重于算法，而不是提供用于声明性问题规范的通用框架以利用问题解决。为了突出神经符号建模的关键方面，我们展示了三个通用NeSy框架 - DeepProbLog、Scallop和DomiKnowS。我们识别了每个方面内部的挑战，为确定每个框架在解决各种问题时的表达能力奠定了基础。在此基础上，我们的目标是激发变革行动，鼓励社区以新颖的方式重新思考这个问题。

更新时间: 2025-09-08 18:17:33

领域: cs.AI,cs.CL,cs.SC

下载: http://arxiv.org/abs/2509.07122v1

Localizing Persona Representations in LLMs

We present a study on how and where personas -- defined by distinct sets of human characteristics, values, and beliefs -- are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives -- such as moral nihilism and utilitarianism -- suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.

Updated: 2025-09-08 18:14:07

标题: 在LLMs中本地化Persona表征

摘要: 我们提出了一个关于人物角色的研究，人物角色由独特的人类特征、价值观和信念定义，我们研究了这些人物角色是如何以及在何处被编码在大型语言模型（LLMs）的表示空间中。通过使用一系列的降维和模式识别方法，我们首先确定了显示这些表示编码之间最大差异的模型层。然后，我们分析了所选层中的激活，以研究特定人物角色相对于其他人物角色的编码方式，包括它们共享和不同的嵌入空间。我们发现，在多个预训练的仅解码器LLMs中，分析的人物角色仅在解码器层的最后三分之一内显示出表示空间的巨大差异。我们观察到特定伦理观念（如道德虚无主义和功利主义）的激活重叠，暗示了一定程度的多义性。相比之下，保守主义和自由主义等政治意识形态似乎被表示在更明显的区域。这些发现有助于改善我们对LLMs如何内部表示信息的理解，并可以指导未来努力来完善LLM输出中特定人类特质的调节。警告：本文包含潜在冒犯性的样本陈述。

更新时间: 2025-09-08 18:14:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.24539v3

Riemannian Batch Normalization: A Gyro Approach

Normalization layers are crucial for deep learning, but their Euclidean formulations are inadequate for data on manifolds. On the other hand, many Riemannian manifolds in machine learning admit gyro-structures, enabling principled extensions of Euclidean neural networks to non-Euclidean domains. Inspired by this, we introduce GyroBN, a principled Riemannian batch normalization framework for gyrogroups. We establish two necessary conditions, namely \emph{pseudo-reduction} and \emph{gyroisometric gyrations}, that guarantee GyroBN with theoretical control over sample statistics, and show that these conditions hold for all known gyrogroups in machine learning. Our framework also incorporates several existing Riemannian normalization methods as special cases. We further instantiate GyroBN on seven representative geometries, including the Grassmannian, five constant curvature spaces, and the correlation manifold, and derive novel gyro and Riemannian structures to enable these instantiations. Experiments across these geometries demonstrate the effectiveness of GyroBN. The code is available at https://github.com/GitZH-Chen/GyroBN.git.

Updated: 2025-09-08 18:12:08

标题: 黎曼批量归一化：陀螺仪方法

摘要: 规范化层对于深度学习至关重要，但它们的欧几里得形式对于流形上的数据是不足够的。另一方面，在机器学习中许多黎曼流形具有陀螺结构，从而使欧几里得神经网络能够被合理地扩展到非欧几里得域。受此启发，我们引入了GyroBN，一个基于陀螺群的合理黎曼批量规范化框架。我们建立了两个必要条件，即\emph{伪降维}和\emph{陀螺同构旋转}，以确保GyroBN在理论上对样本统计具有控制，并展示了这些条件对于机器学习中已知的所有陀螺群都成立。我们的框架还将几种现有的黎曼规范化方法作为特例。我们进一步在七个代表性几何结构上实例化了GyroBN，包括Grassmannian、五个常曲率空间和相关流形，并推导出新颖的陀螺和黎曼结构以实现这些实例化。在这些几何结构上的实验表明了GyroBN的有效性。代码可在https://github.com/GitZH-Chen/GyroBN.git上找到。

更新时间: 2025-09-08 18:12:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.07115v1

Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models

We introduce Advertisement Embedding Attacks (AEA), a new class of LLM security threats that stealthily inject promotional or malicious content into model outputs and AI agents. AEA operate through two low-cost vectors: (1) hijacking third-party service-distribution platforms to prepend adversarial prompts, and (2) publishing back-doored open-source checkpoints fine-tuned with attacker data. Unlike conventional attacks that degrade accuracy, AEA subvert information integrity, causing models to return covert ads, propaganda, or hate speech while appearing normal. We detail the attack pipeline, map five stakeholder victim groups, and present an initial prompt-based self-inspection defense that mitigates these injections without additional model retraining. Our findings reveal an urgent, under-addressed gap in LLM security and call for coordinated detection, auditing, and policy responses from the AI-safety community.

Updated: 2025-09-08 18:05:43

标题: 攻击LLMs和AI代理：针对大型语言模型的广告嵌入攻击

摘要: 我们介绍了广告嵌入攻击（AEA），这是一种新型的LLM安全威胁，可以在模型输出和AI代理中偷偷注入推广或恶意内容。AEA通过两种低成本的途径操作：（1）劫持第三方服务分发平台，在之前添加对抗性提示，（2）发布带有后门的开源检查点，并使用攻击者数据进行微调。与传统攻击降低准确性不同，AEA破坏信息完整性，导致模型返回隐秘广告、宣传或仇恨言论，同时表现正常。我们详细介绍了攻击流程，描述了五个利益相关者受害群体，并提出了一种基于提示的自检防御方法，可以在不需要额外模型重新训练的情况下减轻这些注入。我们的发现揭示了LLM安全领域中一个紧迫且未得到解决的差距，并呼吁AI安全社区采取协调的检测、审计和政策应对措施。

更新时间: 2025-09-08 18:05:43

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.17674v2

ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression

Survival analysis is a fundamental tool for modeling time-to-event outcomes in healthcare. Recent advances have introduced flexible neural network approaches for improved predictive performance. However, most of these models do not provide interpretable insights into the association between exposures and the modeled outcomes, a critical requirement for decision-making in clinical practice. To address this limitation, we propose Additive Deep Hazard Analysis Mixtures (ADHAM), an interpretable additive survival model. ADHAM assumes a conditional latent structure that defines subgroups, each characterized by a combination of covariate-specific hazard functions. To select the number of subgroups, we introduce a post-training refinement that reduces the number of equivalent latent subgroups by merging similar groups. We perform comprehensive studies to demonstrate ADHAM's interpretability at the population, subgroup, and individual levels. Extensive experiments on real-world datasets show that ADHAM provides novel insights into the association between exposures and outcomes. Further, ADHAM remains on par with existing state-of-the-art survival baselines in terms of predictive performance, offering a scalable and interpretable approach to time-to-event prediction in healthcare.

Updated: 2025-09-08 18:04:14

标题: ADHAM：可解释生存回归的加法深层危险分析混合模型

摘要: 生存分析是建模医疗保健中事件发生时间结果的基本工具。最近的进展引入了灵活的神经网络方法，以提高预测性能。然而，大多数这些模型并没有提供关于暴露与建模结果之间关联的可解释见解，这是临床实践决策的关键要求。为了解决这一局限性，我们提出了Additive Deep Hazard Analysis Mixtures（ADHAM），一种可解释的加法生存模型。ADHAM假设一个条件潜在结构，定义了子组，每个子组由特定协变量危险函数的组合特征。为了选择子组的数量，我们引入了一个后训练的改进，通过合并相似的组来减少等效潜在子组的数量。我们进行了全面的研究，展示了ADHAM在人群、子组和个体水平的可解释性。对真实世界数据集进行的广泛实验表明，ADHAM为暴露与结果之间的关联提供了新的见解。此外，从预测性能的角度来看，ADHAM与现有的最先进的生存基线持平，为医疗保健中事件发生时间预测提供了可扩展和可解释的方法。

更新时间: 2025-09-08 18:04:14

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.07108v1

Lookup multivariate Kolmogorov-Arnold Networks

High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce a general drop-in replacement, lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0x while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10x higher H100 throughput at equal accuracy. Within frameworks of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6-2.1x and by 1.7x on the CIFAR-10 and ImageNet-1k datasets, respectively. Our code, including dedicated CUDA kernels, is available online at https://github.com/schwallergroup/lmkan.

Updated: 2025-09-08 18:00:35

标题: 查找多元科尔莫戈洛夫-阿诺德网络

摘要: 高维线性映射，或线性层，在大多数现代深度学习模型中既占据参数数量又占据计算成本的主导地位。我们引入了一个通用的替代方案，查找多元科尔莫戈洛夫-阿诺德网络（lmKANs），在容量和推理成本之间实现了显著更好的权衡。我们的构建通过可训练的低维多元函数来表示一般的高维映射。这些函数每个可以携带几十个或几百个可训练参数，然而只需要几次乘法运算就可以计算它们，因为它们是作为样条查找表实现的。在实证分析中，lmKANs将推理FLOPs减少了最多6.0倍，同时在一般高维函数逼近中与MLPs的灵活性相匹配。在另一个前馈全连接基准测试中，在随机位移的甲烷构型类似的数据集上，lmKANs在相同精度下实现了超过10倍更高的H100吞吐量。在卷积神经网络框架中，基于lmKAN的CNN在匹配精度下将推理FLOPs削减了1.6-2.1倍，并在CIFAR-10和ImageNet-1k数据集上分别削减了1.7倍。我们的代码，包括专用的CUDA内核，可以在https://github.com/schwallergroup/lmkan上在线获取。

更新时间: 2025-09-08 18:00:35

领域: cs.LG,cs.AI,cs.PF,cs.SE

下载: http://arxiv.org/abs/2509.07103v1

Instruction Agent: Enhancing Agent with Expert Demonstration

Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories. In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult workflows. Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution. The agent leverages the verifier and backtracker modules further to improve robustness. Both modules are critical to understand the current outcome from each action and handle unexpected interruptions(such as pop-up windows) during execution. Our experiments show that Instruction Agent achieves a 60% success rate on a set of tasks in OSWorld that all top-ranked agents failed to complete. The Instruction Agent offers a practical and extensible framework, bridging the gap between current GUI agents and reliable real-world GUI task automation.

Updated: 2025-09-08 18:00:12

标题: 指导代理：利用专家示范提升代理

摘要: 图形用户界面(GUI)代理已经迅速发展，但仍然在涉及新颖UI元素、长期行动和个性化轨迹的复杂任务中遇到困难。在这项工作中，我们介绍了Instruction Agent，这是一个利用专家演示来解决这类任务的GUI代理，使得完成原本困难的工作流程成为可能。给定单个演示，代理提取逐步说明并通过严格遥随用户意图的轨迹来执行它们，从而在执行过程中避免犯错。代理进一步利用验证器和回溯器模块来提高稳健性。这两个模块都对理解每个动作的当前结果并在执行过程中处理意外中断(如弹出窗口)至关重要。我们的实验表明，Instruction Agent在OSWorld的一系列任务中取得了60%的成功率，而所有排名靠前的代理都未能完成。Instruction Agent提供了一个实用且可扩展的框架，填补了当前GUI代理和可靠的真实世界GUI任务自动化之间的差距。

更新时间: 2025-09-08 18:00:12

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.07098v1

H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H$_{2}$OT), for efficient transformer-based 3D human pose estimation from videos. H$_{2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.

Updated: 2025-09-08 17:59:59

标题: H$_{2}$OT：用于高效视频姿势转换器的分层沙漏分词器

摘要: 变压器已成功应用于基于视频的3D人体姿势估计领域。然而，这些视频姿势变压器（VPTs）的高计算成本使它们在资源受限设备上不实用。本文提出了一种称为分层沙漏标记器（H$_{2}$OT）的插拔修剪和恢复框架，用于从视频中高效进行基于变压器的3D人体姿势估计。H$_{2}$OT从逐渐修剪冗余帧的姿势标记开始，最终恢复完整长度序列，导致中间变压器块中的姿势标记减少，从而提高模型效率。它与两个关键模块一起工作，即令牌修剪模块（TPM）和令牌恢复模块（TRM）。TPM动态选择一些代表性标记以消除视频帧的冗余性，而TRM基于所选标记恢复详细的时空信息，从而将网络输出扩展到原始完整长度的时间分辨率以进行快速推理。我们的方法是通用的：它可以轻松集成到常见的VPT模型中，既适用于seq2seq又适用于seq2frame管道，并有效地适应不同的标记修剪和恢复策略。此外，我们的H$_{2}$OT表明保持完整的姿势序列是不必要的，一些代表性帧的姿势标记可以实现高效和准确的估计。对多个基准数据集进行的大量实验表明了所提出方法的有效性和效率。代码和模型可在https://github.com/NationalGAILab/HoT上找到。

更新时间: 2025-09-08 17:59:59

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06956v1

Transforming Wearable Data into Personal Health Insights using Large Language Model Agents

Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.

Updated: 2025-09-08 17:59:48

标题: 使用大型语言模型代理将可穿戴数据转化为个人健康见解

摘要: 从流行的可穿戴追踪器中获取个性化见解需要复杂的数值推理，挑战标准的LLMs，需要像代码生成这样的工具化方法。大型语言模型（LLM）代理提供了一个有前途但在很大程度上未被利用的解决方案，用于规模化分析。我们引入了个人健康见解代理（PHIA），这是一个利用多步推理与代码生成和信息检索来分析和解释行为健康数据的系统。为了测试其能力，我们创建并分享了两个基准数据集，包含超过4000个健康见解问题。一个650小时的人类专家评估表明，PHIA在客观、数值问题上的准确率达到了84%，对于开放式问题，得到了83%的良好评价，同时两倍可能获得最高质量评级。这项工作可以通过使个人了解其数据，推动行为健康的发展，为更广泛的人群提供一种新时代的可访问、个性化和数据驱动的健康方式。

更新时间: 2025-09-08 17:59:48

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.06464v4

Deep Reactive Policy: Learning Reactive Manipulator Motion Planning for Dynamic Environments

Generating collision-free motion in dynamic, partially observable environments is a fundamental challenge for robotic manipulators. Classical motion planners can compute globally optimal trajectories but require full environment knowledge and are typically too slow for dynamic scenes. Neural motion policies offer a promising alternative by operating in closed-loop directly on raw sensory inputs but often struggle to generalize in complex or dynamic settings. We propose Deep Reactive Policy (DRP), a visuo-motor neural motion policy designed for reactive motion generation in diverse dynamic environments, operating directly on point cloud sensory input. At its core is IMPACT, a transformer-based neural motion policy pretrained on 10 million generated expert trajectories across diverse simulation scenarios. We further improve IMPACT's static obstacle avoidance through iterative student-teacher finetuning. We additionally enhance the policy's dynamic obstacle avoidance at inference time using DCP-RMP, a locally reactive goal-proposal module. We evaluate DRP on challenging tasks featuring cluttered scenes, dynamic moving obstacles, and goal obstructions. DRP achieves strong generalization, outperforming prior classical and neural methods in success rate across both simulated and real-world settings. Video results and code available at https://deep-reactive-policy.com

Updated: 2025-09-08 17:59:35

标题: 深度反应性策略：学习动态环境下反应式操作器运动规划

摘要: 在动态、部分可观测环境中生成无碰撞运动是机器人操作器的一个基本挑战。经典的运动规划器可以计算全局最优轨迹，但需要完整的环境知识，并且通常对动态场景太慢。神经运动策略提供了一个有希望的替代方案，通过在原始感官输入上直接闭环操作，但在复杂或动态设置中经常难以泛化。我们提出了Deep Reactive Policy (DRP)，这是一个设计用于在多样化动态环境中产生反应性运动的视觉-运动神经运动策略，直接在点云感觉输入上运行。其核心是IMPACT，一个基于变压器的神经运动策略，预训练在各种模拟场景中生成的1000万条专家轨迹上。我们进一步通过迭代学生-教师微调来改进IMPACT的静态障碍物避免。我们还通过在推理时间使用DCP-RMP，一个本地反应性目标提议模块，增强了策略的动态障碍物避免。我们在具有混乱场景、动态移动障碍物和目标阻碍的挑战性任务上评估了DRP。DRP在模拟和真实世界环境中的成功率方面，表现出强大的泛化能力，优于先前的经典和神经方法。视频结果和代码可在https://deep-reactive-policy.com获取。

更新时间: 2025-09-08 17:59:35

领域: cs.RO,cs.AI,cs.CV,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2509.06953v1

Sequentially Auditing Differential Privacy

We propose a practical sequential test for auditing differential privacy guarantees of black-box mechanisms. The test processes streams of mechanisms' outputs providing anytime-valid inference while controlling Type I error, overcoming the fixed sample size limitation of previous batch auditing methods. Experiments show this test detects violations with sample sizes that are orders of magnitude smaller than existing methods, reducing this number from 50K to a few hundred examples, across diverse realistic mechanisms. Notably, it identifies DP-SGD privacy violations in \textit{under} one training run, unlike prior methods needing full model training.

Updated: 2025-09-08 17:57:51

标题: 顺序审计差分隐私

摘要: 我们提出了一种实用的顺序测试方法，用于审计黑匣子机制的差分隐私保证。该测试处理机制输出流，提供任时有效的推断，同时控制类型I错误，克服了先前批量审计方法的固定样本大小限制。实验证明，该测试可以在样本量比现有方法小几个数量级的情况下检测违规行为，将这一数字从50K减少到几百个示例，在各种真实机制中均表现出色。值得注意的是，它可以在不到一个训练运行的情况下识别DP-SGD隐私违规行为，而以前的方法需要完整的模型训练。

更新时间: 2025-09-08 17:57:51

领域: cs.CR,cs.LG,stat.ME

下载: http://arxiv.org/abs/2509.07055v1

Interleaving Reasoning for Better Text-to-Image Generation

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

Updated: 2025-09-08 17:56:23

标题: 交错推理以实现更好的文本到图像生成

摘要: 最近统一的多模态理解和生成模型在图像生成能力方面取得了显著进展，然而与紧密结合理解和生成的系统（如GPT-4o）相比，在指令遵循和细节保留方面仍存在较大差距。受到交错推理最近进展的启发，我们探讨这种推理是否可以进一步改善文本到图像（T2I）生成。我们引入了交错推理生成（IRG）框架，该框架在文本为基础的思考和图像合成之间交替进行：模型首先生成一个文本为基础的思考来引导初始图像，然后反思结果以细化细节、视觉质量和美学，同时保留语义。为了有效训练IRG，我们提出了交错推理生成学习（IRGL），旨在实现两个子目标：（1）加强初始思考和生成阶段，建立核心内容和基础质量，以及（2）在随后的图像中实现高质量的文本反思和忠实实现这些改进。我们整理了IRGL-300K数据集，分为六种分解学习模式，共同涵盖了学习基于文本的思考和完整的思考-图像轨迹。从最初发出交错文本-图像输出的统一基础模型开始，我们的两阶段训练首先构建强大的思考和反思，然后有效地调整IRG管道在完整的思考-图像轨迹数据上。广泛的实验表明了SoTA性能，使GenEval、WISE、TIIF、GenAI-Bench和OneIG-EN上绝对增益达到5-10点，同时在视觉质量和细粒度保真度方面取得了显著改进。代码、模型权重和数据集将在https://github.com/Osilly/Interleaving-Reasoning-Generation上发布。

更新时间: 2025-09-08 17:56:23

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.06945v1

Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference

Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.

Updated: 2025-09-08 17:54:08

标题: 将完整扩散轨迹与细粒度人类偏好直接对齐

摘要: 最近的研究表明，使用可微分奖励直接对齐扩散模型与人类偏好的有效性。然而，它们面临两个主要挑战：（1）它们依赖于梯度计算进行多步去噪以进行奖励评分，这在计算上是昂贵的，因此将优化限制在只有几个扩散步骤；（2）为了实现所需的美学质量，如逼真度或精确的光照效果，它们通常需要连续的离线调整奖励模型。为了解决多步去噪的局限性，我们提出了一种名为Direct-Align的方法，该方法预先定义了噪声，通过插值有效恢复原始图像，利用扩散状态是噪声和目标图像之间插值的方程，从而有效避免在后期时间步长中的过度优化。此外，我们引入了语义相关偏好优化（SRPO），其中奖励被制定为文本条件的信号。这种方法使得能够在线调整奖励以响应正面和负面提示增强，从而减少对离线奖励微调的依赖。通过使用经过优化的去噪和在线奖励调整的FLUX.1.dev模型进行微调，我们将其实际评估的逼真度和美学质量提高了3倍以上。

更新时间: 2025-09-08 17:54:08

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06942v1

Outcome-based Exploration for LLM Reasoning

Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.

Updated: 2025-09-08 17:52:56

标题: 基于结果的LLM推理探索

摘要: 强化学习（RL）已经成为提高大型语言模型（LLMs）推理能力的强大方法。基于结果的强化学习仅奖励策略的最终答案的正确性，可以显著提高准确性，但也会导致生成多样性的系统性损失。这种崩溃损害了真实世界的性能，其中多样性对于测试时间的扩展至关重要。我们通过将RL训练后视为一种抽样过程来分析这一现象，并展示了RL即使相对于基础模型也可以降低训练集上的有效多样性。我们的研究突出了两个中心发现：（i）多样性退化的转移，即在解决问题上降低多样性会传播到未解决问题上，以及（ii）结果空间的可处理性，因为推理任务只允许一组有限的不同答案。受到这些见解的启发，我们提出了基于结果的探索，根据最终结果分配探索奖励。我们介绍了两种互补算法：历史探索，通过UCB风格的奖励鼓励罕见的答案，以及批量探索，通过惩罚批内重复来促进测试时间多样性。在使用Llama和Qwen模型的标准竞赛数学实验中，这两种方法均表现出提高准确性同时缓解多样性崩溃的效果。在理论方面，我们通过一种新的基于结果的老虎机模型形式化了基于结果的探索的好处。这些贡献共同开辟了一条实用路径，使RL方法能够增强推理能力，同时不损害可扩展部署所必需的多样性。

更新时间: 2025-09-08 17:52:56

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2509.06941v1

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

Updated: 2025-09-08 17:50:45

标题: 从噪音到叙事：追溯变形者幻觉的起源

摘要: 随着生成式人工智能系统在科学、商业和政府领域变得更加胜任和民主化，对它们失败模式的深入洞察现在变得迫切。它们行为的偶发不稳定性，比如变压器模型产生幻觉的倾向，阻碍了在高风险领域采纳新兴人工智能解决方案的信任和采用。在本研究中，我们通过由稀疏自动编码器捕获的概念表示，建立了预训练变压器模型中幻觉何时以及如何产生的情况，这些情况在实验控制的输入空间中存在着不确定性。我们的系统实验揭示了当输入信息变得越来越不规则时，变压器模型所使用的语义概念数量也在增加。在面对输入空间中不确定性增加的情况下，变压器模型变得容易激活连贯但与输入无关的语义特征，导致产生幻觉输出。在极端情况下，对于纯噪声输入，我们发现在预训练变压器模型的中间激活中触发了大量稳健且有意义的概念，我们通过有针对性的引导确认了这些概念的功能完整性。我们还展示了变压器模型输出中的幻觉可以可靠地从变压器层激活中嵌入的概念模式中预测出来。对变压器内部处理机制的这些见解对于将人工智能模型与人类价值观对齐、人工智能安全、为潜在对抗性攻击打开攻击面以及提供自动量化模型幻觉风险的基础具有即时影响。

更新时间: 2025-09-08 17:50:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06938v1

Learning words in groups: fusion algebras, tensor ranks and grokking

In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so. To explain the mechanism by which this is achieved, we reframe the problem as that of learning a particular $3$-tensor, which we show is typically of low rank. A key insight is that low-rank implementations of this tensor can be obtained by decomposing it along triplets of basic self-conjugate representations of the group and leveraging the fusion structure to rule out many components. Focusing on a phenomenologically similar but more tractable surrogate model, we show that the network is able to find such low-rank implementations (or approximations thereof), thereby using limited width to approximate the word-tensor in a generalizable way. In the case of the simple multiplication word, we further elucidate the form of these low-rank implementations, showing that the network effectively implements efficient matrix multiplication in the sense of Strassen. Our work also sheds light on the mechanism by which a network reaches such a solution under gradient descent.

Updated: 2025-09-08 17:43:45

标题: 学习词汇的群体：融合代数，张量秩和理解

摘要: 在这项工作中，我们展示了一个简单的两层神经网络，使用标准激活函数可以学习任意有限群中的任意单词操作，只要有足够的宽度并在此过程中展示出了理解能力。为了解释实现这一点的机制，我们将问题重新框定为学习一个特定的3-张量，我们展示了这个张量通常具有低秩。一个关键的洞察是，在群的基本自共轭表示的三元组上分解这个张量可以获得低秩的实现，并利用融合结构来排除许多组分。我们关注一个现象上类似但更易处理的替代模型，我们展示网络能够找到这样的低秩实现（或其近似值），从而使用有限的宽度以一种可推广的方式来近似单词张量。在简单乘法单词的情况下，我们进一步阐明了这些低秩实现的形式，展示了网络有效地实现了Strassen意义上的高效矩阵乘法。我们的工作还揭示了网络在梯度下降下达到这种解决方案的机制。

更新时间: 2025-09-08 17:43:45

领域: cs.LG

下载: http://arxiv.org/abs/2509.06931v1

Statistical Methods in Generative AI

Generative Artificial Intelligence is emerging as an important technology, promising to be transformative in many areas. At the same time, generative AI techniques are based on sampling from probabilistic models, and by default, they come with no guarantees about correctness, safety, fairness, or other properties. Statistical methods offer a promising potential approach to improve the reliability of generative AI techniques. In addition, statistical methods are also promising for improving the quality and efficiency of AI evaluation, as well as for designing interventions and experiments in AI. In this paper, we review some of the existing work on these topics, explaining both the general statistical techniques used, as well as their applications to generative AI. We also discuss limitations and potential future directions.

Updated: 2025-09-08 17:42:59

标题: 生成式人工智能中的统计方法

摘要: 生成人工智能技术正在成为重要技术，有望在许多领域产生变革性影响。与此同时，生成人工智能技术基于从概率模型中抽样，因此默认情况下，它们不提供有关正确性、安全性、公平性或其他属性的保证。统计方法提供了改善生成人工智能技术可靠性的潜在途径。此外，统计方法还有望提高人工智能评估的质量和效率，以及设计人工智能中的干预和实验。在本文中，我们回顾了一些关于这些主题的现有工作，解释了所使用的一般统计技术以及它们对生成人工智能的应用。我们还讨论了限制和潜在的未来方向。

更新时间: 2025-09-08 17:42:59

领域: cs.AI,cs.LG,stat.ME

下载: http://arxiv.org/abs/2509.07054v1

A comparative analysis of rank aggregation methods for the partial label ranking problem

The label ranking problem is a supervised learning scenario in which the learner predicts a total order of the class labels for a given input instance. Recently, research has increasingly focused on the partial label ranking problem, a generalization of the label ranking problem that allows ties in the predicted orders. So far, most existing learning approaches for the partial label ranking problem rely on approximation algorithms for rank aggregation in the final prediction step. This paper explores several alternative aggregation methods for this critical step, including scoring-based and non-parametric probabilistic-based rank aggregation approaches. To enhance their suitability for the more general partial label ranking problem, the investigated methods are extended to increase the likelihood of producing ties. Experimental evaluations on standard benchmarks demonstrate that scoring-based variants consistently outperform the current state-of-the-art method in handling incomplete information. In contrast, non-parametric probabilistic-based variants fail to achieve competitive performance.

Updated: 2025-09-08 17:40:12

标题: 一个关于部分标签排序问题的排名聚合方法的比较分析

摘要: 标签排序问题是一种监督学习场景，在这种场景中，学习者为给定的输入实例预测类标签的总体顺序。最近，研究越来越多地关注部分标签排序问题，这是标签排序问题的泛化，允许在预测顺序中存在并列。到目前为止，大多数现有的用于部分标签排序问题的学习方法依赖于在最终预测步骤中的排名聚合的近似算法。本文探讨了几种替代的聚合方法，包括基于得分和非参数概率的基于排名聚合方法。为了增强它们适用于更一般的部分标签排序问题，研究的方法被扩展以增加生成并列的可能性。在标准基准测试上的实验评估表明，基于得分的变体在处理不完整信息方面始终优于当前的最先进方法。相比之下，基于非参数概率的变体无法达到竞争性的性能。

更新时间: 2025-09-08 17:40:12

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.17077v4

Data-driven solar forecasting enables near-optimal economic decisions

Solar energy adoption is critical to achieving net-zero emissions. However, it remains difficult for many industrial and commercial actors to decide on whether they should adopt distributed solar-battery systems, which is largely due to the unavailability of fast, low-cost, and high-resolution irradiance forecasts. Here, we present SunCastNet, a lightweight data-driven forecasting system that provides 0.05$^\circ$, 10-minute resolution predictions of surface solar radiation downwards (SSRD) up to 7 days ahead. SunCastNet, coupled with reinforcement learning (RL) for battery scheduling, reduces operational regret by 76--93\% compared to robust decision making (RDM). In 25-year investment backtests, it enables up to five of ten high-emitting industrial sectors per region to cross the commercial viability threshold of 12\% Internal Rate of Return (IRR). These results show that high-resolution, long-horizon solar forecasts can directly translate into measurable economic gains, supporting near-optimal energy operations and accelerating renewable deployment.

Updated: 2025-09-08 17:38:05

标题: 基于数据驱动的太阳能预测实现接近最佳经济决策

摘要: 太阳能的采用对实现零排放至关重要。然而，对许多工业和商业参与者来说，决定是否应采用分布式太阳能电池系统仍然很困难，这在很大程度上是由于缺乏快速、低成本和高分辨率的辐射预测。在这里，我们提出了SunCastNet，这是一个轻量级的数据驱动预测系统，提供了0.05°，10分钟分辨率的地表向下太阳辐射(SSRD)预测，可提前7天。SunCastNet结合强化学习(RL)用于电池调度，与健壮决策(RDM)相比，将运营后悔降低了76-93%。在25年的投资回测中，它使每个地区的十个高排放工业部门中最多有五个跨越了12%内部收益率(IRR)的商业可行性门槛。这些结果表明，高分辨率、长期预测可以直接转化为可衡量的经济收益，支持接近最佳的能源运营并加速可再生能源的部署。

更新时间: 2025-09-08 17:38:05

领域: physics.geo-ph,cs.LG

下载: http://arxiv.org/abs/2509.06925v1

Neutron Reflectometry by Gradient Descent

Neutron reflectometry (NR) is a powerful technique to probe surfaces and interfaces. NR is inherently an indirect measurement technique, access to the physical quantities of interest (layer thickness, scattering length density, roughness), necessitate the solution of an inverse modelling problem, that is inefficient for large amounts of data or complex multiplayer structures (e.g. lithium batteries / electrodes). Recently, surrogate machine learning models have been proposed as an alternative to existing optimisation routines. Although such approaches have been successful, physical intuition is lost when replacing governing equations with fast neural networks. Instead, we propose a novel and efficient approach; to optimise reflectivity data analysis by performing gradient descent on the forward reflection model itself. Herein, automatic differentiation techniques are used to evaluate exact gradients of the error function with respect to the parameters of interest. Access to these quantities enables users of neutron reflectometry to harness a host of powerful modern optimisation and inference techniques that remain thus far unexploited in the context of neutron reflectometry. This paper presents two benchmark case studies; demonstrating state-of-the-art performance on a thick oxide quartz film, and robust co-fitting performance in the high complexity regime of organic LED multilayer devices. Additionally, we provide an open-source library of differentiable reflectometry kernels in the python programming language so that gradient based approaches can readily be applied to other NR datasets.

Updated: 2025-09-08 17:38:01

标题: 梯度下降法在中子反射技术中的应用

摘要: 中子反射术（NR）是一种探测表面和界面的强大技术。NR本质上是一种间接测量技术，要获得感兴趣的物理量（层厚度、散射长度密度、粗糙度），需要解决一个逆模型问题，对于大量数据或复杂的多层结构（如锂电池/电极）效率低下。最近，提出了用作现有优化程序替代的替代机器学习模型。虽然这些方法取得了成功，但当用快速神经网络替代控制方程时，会丢失物理直觉。相反，我们提出了一种新颖高效的方法；通过在前向反射模型上执行梯度下降来优化反射数据分析。在此过程中，使用自动微分技术来评估误差函数相对于感兴趣参数的确切梯度。获得这些量可以使中子反射术的用户利用一系列强大的现代优化和推理技术，这些技术迄今在中子反射术的背景下尚未被利用。本文提供了两个基准案例研究；展示了在厚氧化石英膜上的最先进性能，并在有机LED多层器件的高复杂度区域展示了强大的共拟合性能。此外，我们还提供了一个在Python编程语言中的可微分反射核库，以便基于梯度的方法可以轻松应用于其他NR数据集。

更新时间: 2025-09-08 17:38:01

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2509.06924v1

Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Maximum entropy reinforcement learning integrates exploration into policy learning by providing additional intrinsic rewards proportional to the entropy of some distribution. In this paper, we propose a novel approach in which the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features derived from these states and actions) visited during future time steps. This approach is motivated by three results. First, this new objective is a lower bound on the negated entropy of the marginal visitation distribution of states and actions, commonly used as an alternative exploration objective. Second, a policy maximizing the expected discounted sum of intrinsic rewards also maximizes a lower bound on the state-action value function of the decision process. Third, the distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Existing algorithms can therefore be adapted to learn this fixed point off-policy and compute the intrinsic rewards. We finally introduce an algorithm maximizing our new objective and show that resulting policies have good state-action space coverage and achieve high-performance control.

Updated: 2025-09-08 17:36:55

标题: 基于未来状态和动作访问度量的离策略最大熵强化学习

摘要: 最大熵强化学习通过提供与某些分布的熵成比例的额外内在奖励，将探索整合到策略学习中。在本文中，我们提出了一种新颖的方法，其中内在奖励函数是未来时间步访问的状态和动作（或从这些状态和动作导出的特征）的折现分布的相对熵。这种方法受到三个结果的启发。首先，这个新目标是状态和动作边际访问分布的熵的否定的下限，通常用作替代探索目标。其次，最大化预期的内在奖励折现总和的策略也最大化决策过程的状态-动作值函数的下限。第三，内在奖励定义中使用的分布是一个收缩算子的不动点。因此，现有的算法可以被调整为学习这个不动点离线，并计算内在奖励。最后，我们介绍了一种最大化我们新目标的算法，并展示由此产生的策略具有良好的状态-动作空间覆盖率，并实现高性能控制。

更新时间: 2025-09-08 17:36:55

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2412.06655v2

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.

Updated: 2025-09-08 17:36:21

标题: 停留在甜蜜点：通过能力自适应提示脚手架的响应推理演化

摘要: 带有可验证奖励的强化学习（RLVR）在增强大型语言模型（LLMs）的推理能力方面取得了显著成功。然而，现有的RLVR方法常常由于训练数据难度与模型能力之间的不匹配而导致探索效率低下。当问题过于困难时，LLMs无法发现可行的推理路径，而当问题过于简单时，学习到的新能力很少。在这项工作中，我们通过量化损失下降速度和展开准确性之间的关系，形式化了问题难度的影响。基于这一分析，我们提出了SEELE，这是一个新颖的受监督的RLVR框架，动态调整问题难度以保持在高效区域内。SEELE通过在原始问题后附加一个提示（完整解决方案的一部分）来增强每个训练样本。与先前基于提示的方法不同，SEELE有意地并自适应地调整每个问题的提示长度，以实现最佳难度。为了确定最佳提示长度，SEELE采用了多轮展开采样策略。在每一轮中，它根据前几轮收集的准确性-提示对拟合项目响应理论模型，以预测下一轮所需的提示长度。这种实例级的实时难度调整将问题难度与不断发展的模型能力保持一致，从而提高探索效率。实验结果表明，SEELE在六项数学推理基准测试中平均比Group Relative Policy Optimization（GRPO）和Supervised Fine-tuning（SFT）分别提高了+11.8和+10.5分，并超过了先前最佳的受监督方法的平均得分提高了+3.6分。

更新时间: 2025-09-08 17:36:21

领域: cs.LG

下载: http://arxiv.org/abs/2509.06923v1

Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities

Traditional Artificial Intelligence (AI) approaches in cybersecurity exhibit fundamental limitations: inadequate conceptual grounding leading to non-robustness against novel attacks; limited instructibility impeding analyst-guided adaptation; and misalignment with cybersecurity objectives. Neuro-Symbolic (NeSy) AI has emerged with the potential to revolutionize cybersecurity AI. However, there is no systematic understanding of this emerging approach. These hybrid systems address critical cybersecurity challenges by combining neural pattern recognition with symbolic reasoning, enabling enhanced threat understanding while introducing concerning autonomous offensive capabilities that reshape threat landscapes. In this survey, we systematically characterize this field by analyzing 127 publications spanning 2019-July 2025. We introduce a Grounding-Instructibility-Alignment (G-I-A) framework to evaluate these systems, focusing on both cyber defense and cyber offense across network security, malware analysis, and cyber operations. Our analysis shows advantages of multi-agent NeSy architectures and identifies critical implementation challenges including standardization gaps, computational complexity, and human-AI collaboration requirements that constrain deployment. We show that causal reasoning integration is the most transformative advancement, enabling proactive defense beyond correlation-based approaches. Our findings highlight dual-use implications where autonomous systems demonstrate substantial capabilities in zero-day exploitation while achieving significant cost reductions, altering threat dynamics. We provide insights and future research directions, emphasizing the urgent need for community-driven standardization frameworks and responsible development practices that ensure advancement serves defensive cybersecurity objectives while maintaining societal alignment.

Updated: 2025-09-08 17:33:59

标题: 神经符号人工智能在网络安全领域的现状、挑战和机遇

摘要: 传统的人工智能 (AI) 在网络安全领域存在根本性限制：概念基础不足导致对新型攻击的非鲁棒性；指导能力有限，阻碍了分析师引导的适应性；并且与网络安全目标不一致。神经符号 (NeSy) AI 已经出现，并有潜力颠覆网络安全 AI。然而，对这种新兴方法没有系统性的理解。这些混合系统通过结合神经模式识别和符号推理来解决关键的网络安全挑战，实现了对威胁的更好理解，同时引入了令人担忧的自主攻击能力，重塑了威胁格局。在这项调查中，我们通过分析2019年7月至2025年间的127篇论文系统性地表征了这一领域。我们引入了一个基础-指导-对齐 (G-I-A) 框架来评估这些系统，在网络安全、恶意软件分析和网络作战领域关注网络防御和网络攻击。我们的分析显示多智能体 NeSy 架构的优势，并确定了标准化差距、计算复杂性和人工智能与人类协作需求等关键实施挑战，限制了部署。我们发现因果推理整合是最具变革性的进展，使得超越基于相关性的方法的主动防御成为可能。我们的研究结果突显了自主系统在零日利用方面展示出重要能力的双重用途意义，同时实现了显著的成本降低，改变了威胁动态。我们提供见解和未来研究方向，强调迫切需要社区驱动的标准化框架和负责任的发展实践，确保进步服务于防御性网络安全目标，同时保持社会对齐。

更新时间: 2025-09-08 17:33:59

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.06921v1

An Ethically Grounded LLM-Based Approach to Insider Threat Synthesis and Detection

Insider threats are a growing organizational problem due to the complexity of identifying their technical and behavioral elements. A large research body is dedicated to the study of insider threats from technological, psychological, and educational perspectives. However, research in this domain has been generally dependent on datasets that are static and limited access which restricts the development of adaptive detection models. This study introduces a novel, ethically grounded approach that uses the large language model (LLM) Claude Sonnet 3.7 to dynamically synthesize syslog messages, some of which contain indicators of insider threat scenarios. The messages reflect real-world data distributions by being highly imbalanced (1% insider threats). The syslogs were analyzed for insider threats by both Claude Sonnet 3.7 and GPT-4o, with their performance evaluated through statistical metrics including precision, recall, MCC, and ROC AUC. Sonnet 3.7 consistently outperformed GPT-4o across nearly all metrics, particularly in reducing false alarms and improving detection accuracy. The results show strong promise for the use of LLMs in synthetic dataset generation and insider threat detection.

Updated: 2025-09-08 17:32:17

标题: 一种基于伦理的LLM方法用于内部威胁综合与检测

摘要: 内部威胁是一个日益增长的组织问题，由于识别其技术和行为要素的复杂性。大量研究致力于从技术、心理和教育角度研究内部威胁。然而，这一领域的研究通常依赖于静态和有限访问的数据集，这限制了自适应检测模型的发展。本研究介绍了一种新颖的、以伦理为基础的方法，使用大型语言模型（LLM）Claude Sonnet 3.7动态合成syslog消息，其中一些包含内部威胁场景的指示器。这些消息通过高度不平衡（1%内部威胁）反映了真实世界的数据分布。使用Claude Sonnet 3.7和GPT-4o分析了内部威胁的syslogs，并通过统计指标（包括精度、召回率、MCC和ROC AUC）评估了它们的性能。Sonnet 3.7在几乎所有指标上始终优于GPT-4o，特别是在减少误报和提高检测准确率方面。结果显示LLMs在合成数据集生成和内部威胁检测中具有很强的应用前景。

更新时间: 2025-09-08 17:32:17

领域: cs.CR,cs.AI,cs.CL,cs.CY,C.2.0; I.2.7; K.4.1; H.3.3

下载: http://arxiv.org/abs/2509.06920v1

Neural CRNs: A Natural Implementation of Learning in Chemical Reaction Networks

Molecular circuits capable of autonomous learning could unlock novel applications in fields such as bioengineering and synthetic biology. To this end, existing chemical implementations of neural computing have mainly relied on emulating discrete-layered neural architectures using steady-state computations of mass action kinetics. In contrast, we propose an alternative dynamical systems-based approach in which neural computations are modeled as the time evolution of molecular concentrations. The analog nature of our framework naturally aligns with chemical kinetics-based computation, leading to more compact circuits. We present the advantages of our framework through three key demonstrations. First, we assemble an end-to-end supervised learning pipeline using only two sequential phases, the minimum required number for supervised learning. Then, we show (through appropriate simplifications) that both linear and nonlinear modeling circuits can be implemented solely using unimolecular and bimolecular reactions, avoiding the complexities of higher-order chemistries. Finally, we demonstrate that first-order gradient approximations can be natively incorporated into the framework, enabling nonlinear models to scale linearly rather than combinatorially with input dimensionality. All the circuit constructions are validated through training and inference simulations across various regression and classification tasks. Our work presents a viable pathway toward embedding learning behaviors in synthetic biochemical systems.

Updated: 2025-09-08 17:30:24

标题: 神经CRNs：化学反应网络中学习的自然实现

摘要: 能够自主学习的分子电路可能会在生物工程和合成生物学等领域开启新的应用。为此，现有的神经计算化学实现主要依赖于模拟离散层次的神经结构，使用质量作用动力学的稳态计算。相比之下，我们提出了一种基于动力系统的替代方法，其中神经计算被建模为分子浓度的时间演化。我们的框架的模拟性质自然地与化学动力学计算相一致，导致更紧凑的电路。我们通过三个关键演示展示了我们框架的优势。首先，我们仅使用两个顺序阶段组装了一个端到端的监督学习管道，这是监督学习所需的最小数量。然后，我们展示（通过适当的简化）线性和非线性建模电路可以仅使用单分子和双分子反应来实现，避免了高阶化学复杂性。最后，我们证明一阶梯度近似可以被本地集成到框架中，使非线性模型的规模与输入维度呈线性关系，而不是组合关系。所有电路构建都通过在各种回归和分类任务中进行训练和推断模拟进行验证。我们的工作提出了在合成生化系统中嵌入学习行为的可行途径。

更新时间: 2025-09-08 17:30:24

领域: cs.LG,cs.ET

下载: http://arxiv.org/abs/2409.00034v4

Tackling the Noisy Elephant in the Room: Label Noise-robust Out-of-Distribution Detection via Loss Correction and Low-rank Decomposition

Robust out-of-distribution (OOD) detection is an indispensable component of modern artificial intelligence (AI) systems, especially in safety-critical applications where models must identify inputs from unfamiliar classes not seen during training. While OOD detection has been extensively studied in the machine learning literature--with both post hoc and training-based approaches--its effectiveness under noisy training labels remains underexplored. Recent studies suggest that label noise can significantly degrade OOD performance, yet principled solutions to this issue are lacking. In this work, we demonstrate that directly combining existing label noise-robust methods with OOD detection strategies is insufficient to address this critical challenge. To overcome this, we propose a robust OOD detection framework that integrates loss correction techniques from the noisy label learning literature with low-rank and sparse decomposition methods from signal processing. Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms the state-of-the-art OOD detection techniques, particularly under severe noisy label settings.

Updated: 2025-09-08 17:28:59

标题: 解决房间里的吵闹大象：通过损失校正和低秩分解实现标签噪声鲁棒的超出分布检测

摘要: 鲁棒的离群检测是现代人工智能系统中不可或缺的组成部分，尤其是在安全关键应用中，模型必须识别在训练过程中未见过的陌生类别的输入。尽管离群检测在机器学习文献中已经得到广泛研究，包括事后和基于训练的方法，但其在嘈杂训练标签下的有效性仍未得到充分探讨。最近的研究表明，标签噪声可以显著降低离群性能，然而针对这一问题的原则性解决方案尚未出现。在这项工作中，我们证明直接将现有的标签噪声鲁棒方法与离群检测策略结合起来是不足以解决这一关键挑战的。为了克服这一问题，我们提出了一个鲁棒的离群检测框架，将嘈杂标签学习文献中的损失校正技术与信号处理中的低秩和稀疏分解方法整合在一起。在合成和真实数据集上进行的大量实验证明，我们的方法在严重标签噪声设置下明显优于最先进的离群检测技术。

更新时间: 2025-09-08 17:28:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06918v1

Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper's code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent's effectiveness in creating reliable and capable paper agents through in-depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single-cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper's results and can correctly carry out novel user queries. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co-scientists.

Updated: 2025-09-08 17:28:42

标题: Paper2Agent：将研究论文重新构想为交互式和可靠的人工智能代理

摘要: 我们介绍了Paper2Agent，这是一个自动化框架，可以将研究论文转化为人工智能代理。Paper2Agent将研究成果从被动的文献转变为可以加速下游使用、采纳和发现的活跃系统。传统的研究论文需要读者投入大量精力才能理解和适应论文的代码、数据和方法到自己的工作中，从而产生传播和重复使用的障碍。Paper2Agent通过自动将论文转化为一个能够充当知识渊博的研究助手的人工智能代理来解决这一挑战。它系统地分析论文和相关的代码库，利用多个代理构建一个模型上下文协议（MCP）服务器，然后迭代生成并运行测试来优化和加固所得到的MCP。这些论文MCP可以灵活地连接到一个聊天代理（例如Claude Code），通过自然语言进行复杂的科学查询，同时调用原始论文中的工具和工作流程。我们通过深入案例研究展示了Paper2Agent在创建可靠、能干的论文代理方面的有效性。Paper2Agent创建了一个利用AlphaGenome解释基因组变异的代理，以及基于ScanPy和TISSUE进行单细胞和空间转录组分析的代理。我们验证了这些论文代理能够复现原始论文的结果，并能够正确地执行新颖的用户查询。通过将静态论文转化为动态、互动的人工智能代理，Paper2Agent引入了一种新的知识传播范式，为人工智能共同科学家的协作生态系统奠定了基础。

更新时间: 2025-09-08 17:28:42

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.06917v1

Probabilistic operator learning: generative modeling and uncertainty quantification for foundation models of differential equations

In-context operator networks (ICON) are a class of operator learning methods based on the novel architectures of foundation models. Trained on a diverse set of datasets of initial and boundary conditions paired with corresponding solutions to ordinary and partial differential equations (ODEs and PDEs), ICON learns to map example condition-solution pairs of a given differential equation to an approximation of its solution operator. Here, we present a probabilistic framework that reveals ICON as implicitly performing Bayesian inference, where it computes the mean of the posterior predictive distribution over solution operators conditioned on the provided context, i.e., example condition-solution pairs. The formalism of random differential equations provides the probabilistic framework for describing the tasks ICON accomplishes while also providing a basis for understanding other multi-operator learning methods. This probabilistic perspective provides a basis for extending ICON to \emph{generative} settings, where one can sample from the posterior predictive distribution of solution operators. The generative formulation of ICON (GenICON) captures the underlying uncertainty in the solution operator, which enables principled uncertainty quantification in the solution predictions in operator learning.

Updated: 2025-09-08 17:28:39

标题: 概率算子学习：微分方程基础模型的生成建模和不确定性量化

摘要: 上下文操作符网络（ICON）是一类基于基础模型新型架构的操作符学习方法。ICON在多样化的数据集上训练，这些数据集包含初始和边界条件与普通和偏微分方程（ODEs和PDEs）的对应解。ICON学习将给定微分方程的示例条件-解对映射到其解算子的近似值。在这里，我们提出一个概率框架，揭示ICON隐式执行贝叶斯推断，即在提供的上下文（例如条件-解对）上计算解算子的后验预测分布的均值。随机微分方程的形式主义提供了描述ICON完成的任务的概率框架，同时也为理解其他多操作符学习方法提供了基础。这种概率视角为将ICON扩展到生成设置提供了基础，其中可以从解算子的后验预测分布中进行抽样。ICON的生成式公式（GenICON）捕捉了解算子中的基本不确定性，这使得在操作符学习中可以对解预测进行原则性不确定性量化。

更新时间: 2025-09-08 17:28:39

领域: stat.ML,cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2509.05186v2

Hypergraph-Guided Regex Filter Synthesis for Event-Based Anomaly Detection

We propose HyGLAD, a novel algorithm that automatically builds a set of interpretable patterns that model event data. These patterns can then be used to detect event-based anomalies in a stationary system, where any deviation from past behavior may indicate malicious activity. The algorithm infers equivalence classes of entities with similar behavior observed from the events, and then builds regular expressions that capture the values of those entities. As opposed to deep-learning approaches, the regular expressions are directly interpretable, which also translates to interpretable anomalies. We evaluate HyGLAD against all 7 unsupervised anomaly detection methods from DeepOD on five datasets from real-world systems. The experimental results show that on average HyGLAD outperforms existing deep-learning methods while being an order of magnitude more efficient in training and inference (single CPU vs GPU). Precision improved by 1.2x and recall by 1.3x compared to the second-best baseline.

Updated: 2025-09-08 17:25:23

标题: 基于超图引导的正则表达式过滤器合成用于基于事件的异常检测

摘要: 我们提出了HyGLAD，一种新颖的算法，可以自动构建一组可解释的模式来建模事件数据。这些模式可以用来检测静态系统中基于事件的异常，其中任何与过去行为不同的行为可能表明恶意活动。该算法推断出观察到的事件中具有相似行为的实体的等价类，然后构建捕获这些实体值的正则表达式。与深度学习方法相反，这些正则表达式是直接可解释的，这也意味着异常是可解释的。我们将HyGLAD与DeepOD中的所有7种无监督异常检测方法在来自真实系统的五个数据集上进行了评估。实验结果表明，平均而言，HyGLAD在训练和推理方面比现有的深度学习方法表现更优异（单CPU与GPU相比效率提高一个数量级）。与次优基线相比，精确度提高了1.2倍，召回率提高了1.3倍。

更新时间: 2025-09-08 17:25:23

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2509.06911v1

Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification

Large Language Models (LLMs) as stochastic systems may generate numbers that deviate from available data, a failure known as \emph{numeric hallucination}. Existing safeguards -- retrieval-augmented generation, citations, and uncertainty estimation -- improve transparency but cannot guarantee fidelity: fabricated or misquoted values may still be displayed as if correct. We propose \textbf{Proof-Carrying Numbers (PCN)}, a presentation-layer protocol that enforces numeric fidelity through mechanical verification. Under PCN, numeric spans are emitted as \emph{claim-bound tokens} tied to structured claims, and a verifier checks each token under a declared policy (e.g., exact equality, rounding, aliases, or tolerance with qualifiers). Crucially, PCN places verification in the \emph{renderer}, not the model: only claim-checked numbers are marked as verified, and all others default to unverified. This separation prevents spoofing and guarantees fail-closed behavior. We formalize PCN and prove soundness, completeness under honest tokens, fail-closed behavior, and monotonicity under policy refinement. PCN is lightweight and model-agnostic, integrates seamlessly into existing applications, and can be extended with cryptographic commitments. By enforcing verification as a mandatory step before display, PCN establishes a simple contract for numerically sensitive settings: \emph{trust is earned only by proof}, while the absence of a mark communicates uncertainty.

Updated: 2025-09-08 17:20:16

标题: 携带证明的数字（PCN）：通过声明验证从LLMs获取可信任的数值答案的协议

摘要: 大型语言模型（LLMs）作为随机系统可能会生成与现有数据偏离的数字，这种失败被称为“数字幻觉”。现有的保障措施——检索增强生成、引用和不确定性估计——提高了透明度，但不能保证忠实性：虚构或误引用的值仍可能被显示为正确。我们提出了\textbf{携带证明的数字（PCN）}，这是一种通过机械验证强制数字忠实性的呈现层协议。在PCN下，数值范围被发出为与结构化声明相关联的“声明绑定令牌”，验证者根据声明的政策（例如，精确相等、四舍五入、别名或带修饰词的公差）检查每个令牌。至关重要的是，PCN将验证放置在\emph{渲染器}中，而不是模型中：只有经过声明检查的数字被标记为经过验证，所有其他数字默认为未经验证。这种分离防止了欺骗，并保证了故障闭合行为。我们形式化了PCN并证明了其在诚实令牌下的合理性、完整性、故障闭合行为以及政策细化下的单调性。PCN轻量级且与模型无关，可以无缝集成到现有应用程序中，并可通过加密承诺进行扩展。通过在显示之前将验证作为强制步骤，PCN为数字敏感环境建立了一个简单的契约：\emph{信任只能通过证明获得}，而没有标记的缺席则传达了不确定性。

更新时间: 2025-09-08 17:20:16

领域: cs.CL,cs.CR,cs.DB,cs.LG

下载: http://arxiv.org/abs/2509.06902v1

Contrastive MIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning

Learning representations that generalize well to unknown downstream tasks is a central challenge in representation learning. Existing approaches such as contrastive learning, self-supervised masking, and denoising auto-encoders address this challenge with varying trade-offs. In this paper, we introduce the {contrastive Mutual Information Machine} (cMIM), a probabilistic framework that augments the Mutual Information Machine (MIM) with a novel contrastive objective. While MIM maximizes mutual information between inputs and latent variables and encourages clustering of latent codes, its representations underperform on discriminative tasks compared to state-of-the-art alternatives. cMIM addresses this limitation by enforcing global discriminative structure while retaining MIM's generative strengths. We present two main contributions: (1) we propose cMIM, a contrastive extension of MIM that eliminates the need for positive data augmentation and is robust to batch size, unlike InfoNCE-based methods; (2) we introduce {informative embeddings}, a general technique for extracting enriched representations from encoder--decoder models that substantially improve discriminative performance without additional training, and which apply broadly beyond MIM. Empirical results demonstrate that cMIM consistently outperforms MIM and InfoNCE in classification and regression tasks, while preserving comparable reconstruction quality. These findings suggest that cMIM provides a unified framework for learning representations that are simultaneously effective for discriminative and generative applications.

Updated: 2025-09-08 17:17:17

标题: 对比MIM：一个统一的生成和判别表示学习的对比互信息框架

摘要: 学习能够很好地推广到未知下游任务的表示是表示学习中的一个核心挑战。现有的方法，如对比学习、自监督掩蔽和去噪自编码器，通过不同的权衡来解决这一挑战。在本文中，我们介绍了对比互信息机（cMIM），这是一个概率框架，它通过引入一种新颖的对比目标来增强互信息机（MIM）。虽然MIM最大化输入和潜在变量之间的互信息，并鼓励潜在代码的聚类，但与最先进的替代方法相比，在区分性任务上的表现不佳。cMIM通过强制实施全局区分性结构来解决这一限制，同时保留MIM的生成优势。我们提出了两个主要贡献：（1）我们提出了cMIM，这是MIM的对比扩展，消除了对正数据增强的需要，并且不像基于InfoNCE的方法那样对批大小有鲁棒性；（2）我们引入了信息嵌入，这是一种从编码器-解码器模型中提取丰富表示的通用技术，它显著提高了区分性能，而无需额外训练，并且可以广泛应用于MIM之外。实证结果表明，在分类和回归任务中，cMIM始终优于MIM和InfoNCE，同时保持可比较的重建质量。这些发现表明，cMIM提供了一个统一的框架，可同时用于学习对于区分性和生成性应用都有效的表示。

更新时间: 2025-09-08 17:17:17

领域: cs.LG,68T05,I.2.6

下载: http://arxiv.org/abs/2502.19642v2

Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

Targeted data poisoning attacks pose an increasingly serious threat due to their ease of deployment and high success rates. These attacks aim to manipulate the prediction for a single test sample in classification models. Unlike indiscriminate attacks that aim to decrease overall test performance, targeted attacks present a unique threat to individual test instances. This threat model raises a fundamental question: what factors make certain test samples more susceptible to successful poisoning than others? We investigate how attack difficulty varies across different test instances and identify key characteristics that influence vulnerability. This paper introduces three predictive criteria for targeted data poisoning difficulty: ergodic prediction accuracy (analyzed through clean training dynamics), poison distance, and poison budget. Our experimental results demonstrate that these metrics effectively predict the varying difficulty of real-world targeted poisoning attacks across diverse scenarios, offering practitioners valuable insights for vulnerability assessment and understanding data poisoning attacks.

Updated: 2025-09-08 17:14:55

标题: 并非所有样本都相同：量化目标数据中毒中实例级难度

摘要: 有针对性的数据污染攻击由于易于部署和高成功率而构成日益严重的威胁。这些攻击旨在操纵分类模型中单个测试样本的预测。与无差别攻击不同，后者旨在降低整体测试性能，有针对性的攻击对个别测试实例构成独特威胁。这一威胁模型提出了一个基本问题：是什么因素使某些测试样本比其他样本更容易成功被污染？我们研究了攻击难度如何在不同的测试实例之间变化，并确定了影响脆弱性的关键特征。本文介绍了三个用于有针对性数据污染难度的预测标准：遍历预测准确性（通过清洁训练动态分析）、毒距离和毒预算。我们的实验结果表明，这些度量有效地预测了不同场景下真实世界有针对性污染攻击的难度差异，为实践者提供了宝贵的洞察力，用于脆弱性评估和理解数据污染攻击。

更新时间: 2025-09-08 17:14:55

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.06896v1

Learning from one graph: transductive learning guarantees via the geometry of small random worlds

Since their introduction by Kipf and Welling in $2017$, a primary use of graph convolutional networks is transductive node classification, where missing labels are inferred within a single observed graph and its feature matrix. Despite the widespread use of the network model, the statistical foundations of transductive learning remain limited, as standard inference frameworks typically rely on multiple independent samples rather than a single graph. In this work, we address these gaps by developing new concentration-of-measure tools that leverage the geometric regularities of large graphs via low-dimensional metric embeddings. The emergent regularities are captured using a random graph model; however, the methods remain applicable to deterministic graphs once observed. We establish two principal learning results. The first concerns arbitrary deterministic $k$-vertex graphs, and the second addresses random graphs that share key geometric properties with an Erd\H{o}s-R\'{e}nyi graph $\mathbf{G}=\mathbf{G}(k,p)$ in the regime $p \in \mathcal{O}((\log (k)/k)^{1/2})$. The first result serves as the basis for and illuminates the second. We then extend these results to the graph convolutional network setting, where additional challenges arise. Lastly, our learning guarantees remain informative even with a few labelled nodes $N$ and achieve the optimal nonparametric rate $\mathcal{O}(N^{-1/2})$ as $N$ grows.

Updated: 2025-09-08 17:13:28

标题: 从一个图中学习：通过小随机世界的几何学进行传导学习保证

摘要: 自从Kipf和Welling在2017年引入图卷积网络以来，图卷积网络的主要用途是传导节点分类，在其中在单个观察到的图及其特征矩阵中推断缺失的标签。尽管网络模型被广泛使用，但传导学习的统计基础仍然有限，因为标准推断框架通常依赖于多个独立样本而不是单个图。在这项工作中，我们通过开发新的测度集中工具来填补这些差距，利用大图的几何规律性通过低维度度量嵌入。这些新出现的规律性是使用随机图模型捕获的；然而，一旦观察到，这些方法仍然适用于确定性图。我们建立了两个主要的学习结果。第一个涉及任意确定性k-顶点图，第二个涉及具有与Erd\H{o}s-R\'{e}nyi图$\mathbf{G}=\mathbf{G}(k,p)$在$p \in \mathcal{O}((\log(k)/k)^{1/2})$范围内共享关键几何特性的随机图。第一个结果作为基础并阐明了第二个结果。然后我们将这些结果扩展到图卷积网络设置中，其中出现了额外的挑战。最后，我们的学习保证即使有少量标记节点$N$，也仍然是有信息的，并且随着$N$增长，实现了最佳的非参数速率$\mathcal{O}(N^{-1/2})$。

更新时间: 2025-09-08 17:13:28

领域: stat.ML,cs.LG,math.MG,math.PR,math.ST,stat.TH

下载: http://arxiv.org/abs/2509.06894v1

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-understandable concepts. However, CBMs typically assume that datasets contain accurate concept labels-an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of key properties of the CPO objective, showing it directly optimizes for the concept's posterior distribution, and contrast it against Binary Cross Entropy (BCE), demonstrating that CPO is inherently less sensitive to concept noise. We empirically confirm our analysis by finding that CPO consistently outperforms BCE on three real-world datasets, both with and without added label noise. We make our code available on Github.

Updated: 2025-09-08 17:12:46

标题: 通过偏好优化解决概念瓶颈模型中的概念错误标记问题

摘要: 概念瓶颈模型（CBMs）旨在通过将决策限制在一组人类可理解的概念上，从而增强AI系统的可信度。然而，CBMs通常假设数据集包含准确的概念标签，这种假设在实践中经常被违反，我们发现这可能会显著降低性能（在某些情况下降低25%）。为了解决这个问题，我们引入了概念偏好优化（CPO）目标，这是一种基于直接偏好优化的新损失函数，有效地减轻概念误标对CBM性能的负面影响。我们对CPO目标的关键属性进行了分析，显示它直接优化了概念的后验分布，并将其与二元交叉熵（BCE）进行对比，表明CPO对概念噪声的敏感性较低。我们通过实证验证了我们的分析，发现CPO在三个真实数据集上始终优于BCE，无论是否添加标签噪声。我们将我们的代码上传至Github。

更新时间: 2025-09-08 17:12:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.18026v4

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.

Updated: 2025-09-08 17:08:42

标题: mmBERT：一种具有退火语言学习的现代多语言编码器

摘要: 编码器-仅语言模型经常用于各种标准机器学习任务，包括分类和检索。然而，最近对编码器模型的研究不足，特别是关于多语言模型。我们介绍了mmBERT，这是一个仅编码器语言模型，预训练于1800多种语言的3T令牌的多语言文本。为了构建mmBERT，我们引入了几个新颖的元素，包括一个逆掩码比率计划和逆温度抽样比率。我们仅在衰减阶段将超过1700种低资源语言添加到数据混合中，显示它显著提高了性能，并最大化了相对较少的训练数据的收益。尽管仅在短暂的衰减阶段包含这些低资源语言，我们实现了类似于OpenAI的o3和Google的Gemini 2.5 Pro等模型的分类性能。总体而言，我们展示了mmBERT在分类和检索任务上明显优于先前一代模型 - 无论是高资源语言还是低资源语言。

更新时间: 2025-09-08 17:08:42

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2509.06888v1

The Signalgate Case is Waiving a Red Flag to All Organizational and Behavioral Cybersecurity Leaders, Practitioners, and Researchers: Are We Receiving the Signal Amidst the Noise?

The Signalgate incident of March 2025, wherein senior US national security officials inadvertently disclosed sensitive military operational details via the encrypted messaging platform Signal, highlights critical vulnerabilities in organizational security arising from human error, governance gaps, and the misuse of technology. Although smaller in scale when compared to historical breaches involving billions of records, Signalgate illustrates critical systemic issues often overshadowed by a focus on external cyber threats. Employing a case-study approach and systematic review grounded in the NIST Cybersecurity Framework, we analyze the incident to identify patterns of human-centric vulnerabilities and governance challenges common to organizational security failures. Findings emphasize three critical points. (1) Organizational security depends heavily on human behavior, with internal actors often serving as the weakest link despite advanced technical defenses; (2) Leadership tone strongly influences organizational security culture and efficacy, and (3) widespread reliance on technical solutions without sufficient investments in human and organizational factors leads to ineffective practices and wasted resources. From these observations, we propose actionable recommendations for enhancing organizational and national security, including strong leadership engagement, comprehensive adoption of zero-trust architectures, clearer accountability structures, incentivized security behaviors, and rigorous oversight. Particularly during periods of organizational transition, such as mergers or large-scale personnel changes, additional measures become particularly important. Signalgate underscores the need for leaders and policymakers to reorient cybersecurity strategies toward addressing governance, cultural, and behavioral risks.

Updated: 2025-09-08 17:08:37

标题: 《信号门事件对所有组织和行为网络安全领导者、从业者和研究人员发出红色警报：我们在噪音中接收到信号了吗？》

摘要: 2025年3月的信号门事件中，美国高级国家安全官员无意中通过加密消息平台Signal泄露了敏感的军事行动细节，突显了由于人为错误、治理漏洞和技术滥用而产生的组织安全关键漏洞。尽管与涉及数十亿记录的历史性违规相比规模较小，信号门事件展示了常常被外部网络威胁所掩盖的关键系统性问题。通过采用案例研究方法和基于NIST网络安全框架的系统性审查，我们分析了该事件，以识别人为中心漏洞和治理挑战的模式，这些漏洞和挑战常见于组织安全失败中。研究结果强调了三个关键点：（1）组织安全在很大程度上依赖于人的行为，尽管有先进的技术防御，内部参与者往往成为最薄弱的环节；（2）领导力态度强烈影响组织安全文化和有效性；（3）广泛依赖技术解决方案而不足够投资于人员和组织因素，导致无效的实践和浪费资源。基于这些观察，我们提出了增强组织和国家安全的可行建议，包括强有力的领导参与、全面采用零信任架构、更清晰的责任结构、激励安全行为以及严格监督。特别是在组织转型期间，如合并或大规模人员变动期间，额外的措施变得特别重要。信号门事件强调了领导者和决策者需要重新调整网络安全策略，以解决治理、文化和行为风险。

更新时间: 2025-09-08 17:08:37

领域: cs.CR,cs.CY,J.4; K.4.1; K.4.3; K.5.0; K.5.2; K.6.5

下载: http://arxiv.org/abs/2509.07053v1

Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers

Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder's ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.

Updated: 2025-09-08 17:05:53

标题: Barlow-Swin：基于Swin-Transformers的新型连体分割架构探索

摘要: 医学图像分割是临床工作流程中的关键任务，特别是用于检测和描绘病理区域。虽然像U-Net这样的卷积架构已经成为这类任务的标准，但它们有限的感受野限制了全局上下文建模。最近的努力整合变压器已经解决了这个问题，但通常会导致深层、计算昂贵的模型，不适合实时使用。在这项工作中，我们提出了一种新颖的端到端轻量级架构，专门设计用于实时二值医学图像分割。我们的模型将类似Swin Transformer的编码器与类似U-Net的解码器相结合，通过跳跃通道连接，以保留空间细节同时捕捉上下文信息。与现有的设计如Swin Transformer或U-Net不同，我们的架构明显较浅且具有竞争性效率。为了提高编码器学习有意义特征的能力，而又不依赖大量标记数据，我们首先使用Barlow Twins进行训练，这是一种自监督学习方法，通过减少学到的特征中的不必要重复，帮助模型专注于重要模式。在预训练之后，我们对整个模型进行微调以适应我们的特定任务。对基准二值分割任务的实验表明，我们的模型在参数数量大幅减少和更快的推断速度的同时实现了竞争性准确性，使其成为在实时和资源有限的临床环境中部署的实际替代方案。我们方法的代码可在Github仓库上找到：https://github.com/mkianih/Barlow-Swin。

更新时间: 2025-09-08 17:05:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.06885v1

LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a $90\times$ increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18

Updated: 2025-09-08 17:05:47

标题: LatticeWorld：一个多模式大型语言模型增强的互动复杂世界生成框架

摘要: 最近的研究越来越集中于开发模拟复杂现实世界场景的3D世界模型。世界模型在各个领域都有广泛的应用，包括具身人工智能、自动驾驶、娱乐等。具有准确物理学的更真实的模拟将有效地缩小模拟到实际之间的差距，并让我们方便地收集关于真实世界的丰富信息。虽然传统的手动建模已经能够创建虚拟的3D场景，但现代方法已经利用先进的机器学习算法进行3D世界生成，最近的进展集中在能够根据用户指令创建虚拟世界的生成方法上。本文通过提出LatticeWorld，一个简单而有效的3D世界生成框架，探索了这样一个研究方向，该框架简化了3D环境的工业生产流程。LatticeWorld利用轻量级LLMs（LLaMA-2-7B）以及行业级渲染引擎（例如Unreal Engine 5）生成动态环境。我们提出的框架接受文本描述和视觉指令作为多模态输入，并创建具有动态代理的大规模3D交互式世界，具有竞争性多代理互动、高保真物理模拟和实时渲染。我们进行了全面的实验来评估LatticeWorld，结果显示它在场景布局生成和视觉保真度方面达到了较高的准确性。此外，与传统手动生产方法相比，LatticeWorld在保持高创意质量的同时实现了工业生产效率的90倍增长。我们的演示视频可在https://youtu.be/8VWZXpERR18 上观看。

更新时间: 2025-09-08 17:05:47

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.05263v2

Probabilistic Shapley Value Modeling and Inference

We propose probabilistic Shapley inference (PSI), a novel probabilistic framework to model and infer sufficient statistics of feature attributions in flexible predictive models, via latent random variables whose mean recovers Shapley values. PSI enables efficient, scalable inference over input-to-output attributions, and their uncertainty, via a variational objective that jointly trains a predictive (regression or classification) model and its attribution distributions. To address the challenge of marginalizing over variable-length input feature subsets in Shapley value calculation, we introduce a masking-based neural network architecture, with a modular training and inference procedure. We evaluate PSI on synthetic and real-world datasets, showing that it achieves competitive predictive performance compared to strong baselines, while learning feature attribution distributions -- centered at Shapley values -- that reveal meaningful attribution uncertainty across data modalities.

Updated: 2025-09-08 17:04:53

标题: 概率Shapley值建模与推断

摘要: 我们提出了概率Shapley推断（PSI），这是一个新颖的概率框架，用于通过隐变量的平均恢复Shapley值，对灵活的预测模型中的特征归因的充分统计进行建模和推断。PSI通过一个联合训练预测（回归或分类）模型和其归因分布的变分目标，实现了对输入到输出归因及其不确定性的高效、可扩展的推断。为了解决Shapley值计算中变长输入特征子集的边缘化挑战，我们引入了基于掩码的神经网络架构，具有模块化的训练和推断过程。我们在合成和真实世界数据集上评估了PSI，结果显示它在预测性能上与强基线模型相比表现出竞争力，同时学习到了基于Shapley值的特征归因分布，揭示了跨数据模态的有意义的归因不确定性。

更新时间: 2025-09-08 17:04:53

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2402.04211v2

UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction

We participate in CheckThat! Task 2 English and explore various methods of prompting and in-context learning, including few-shot prompting and fine-tuning with different LLM families, with the goal of extracting check-worthy claims from social media passages. Our best METEOR score is achieved by fine-tuning a FLAN-T5 model. However, we observe that higher-quality claims can sometimes be extracted using other methods, even when their METEOR scores are lower.

Updated: 2025-09-08 17:02:34

标题: UNH在CheckThat! 2025年：索赔提取中的微调与提示

摘要: 我们参与了CheckThat!任务2的英文部分，并探索了各种提示和上下文学习方法，包括少量提示和使用不同LLM系列进行微调，旨在从社交媒体段落中提取值得核查的主张。我们通过对FLAN-T5模型进行微调获得了最佳的METEOR分数。然而，我们观察到，有时即使METEOR分数较低，使用其他方法仍然可以提取出质量更高的主张。

更新时间: 2025-09-08 17:02:34

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2509.06883v1

Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs' self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs' attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.

Updated: 2025-09-08 16:53:44

标题: 并非所有特征值得关注：基于图形引导的依赖学习用于利用语言模型生成表格数据

摘要: 大型语言模型(LLMs)通过建模文本化的特征-值对，已经展现出对表格数据生成的强大潜力。然而，表格数据本质上展示出稀疏的特征级依赖性，其中许多特征交互在结构上并不重要。这造成了一个基本不匹配，因为LLMs的自注意机制不可避免地在所有对之间分配注意力，淡化对关键关系的注意，特别是在具有复杂依赖关系或语义模糊特征的数据集中。为了解决这一限制，我们提出了GraDe (图引导的依赖性学习)，一种新颖的方法，明确将稀疏依赖图集成到LLMs的注意力机制中。GraDe采用了一个轻量级的动态图学习模块，由外部提取的功能性依赖指导，优先考虑关键特征交互，同时压制无关的交互。我们在多样的真实世界数据集上的实验证明，GraDe在复杂数据集上的表现比现有的基于LLMs的方法提高了最多12%，同时在合成数据质量方面取得了与最先进方法竞争力的结果。我们的方法是最小干扰但有效的，为采用LLMs进行结构感知表格数据建模提供了可行的解决方案。

更新时间: 2025-09-08 16:53:44

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.18504v2

AxelSMOTE: An Agent-Based Oversampling Algorithm for Imbalanced Classification

Class imbalance in machine learning poses a significant challenge, as skewed datasets often hinder performance on minority classes. Traditional oversampling techniques, which are commonly used to alleviate class imbalance, have several drawbacks: they treat features independently, lack similarity-based controls, limit sample diversity, and fail to manage synthetic variety effectively. To overcome these issues, we introduce AxelSMOTE, an innovative agent-based approach that views data instances as autonomous agents engaging in complex interactions. Based on Axelrod's cultural dissemination model, AxelSMOTE implements four key innovations: (1) trait-based feature grouping to preserve correlations; (2) a similarity-based probabilistic exchange mechanism for meaningful interactions; (3) Beta distribution blending for realistic interpolation; and (4) controlled diversity injection to avoid overfitting. Experiments on eight imbalanced datasets demonstrate that AxelSMOTE outperforms state-of-the-art sampling methods while maintaining computational efficiency.

Updated: 2025-09-08 16:47:33

标题: AxelSMOTE：一种用于不平衡分类的基于代理的过采样算法

摘要: 机器学习中的类别不平衡构成了一个重要挑战，因为偏斜的数据集经常会影响对少数类的性能。传统的过采样技术，通常用于缓解类别不平衡，有几个缺点：它们独立处理特征，缺乏基于相似性的控制，限制样本多样性，并且未能有效管理合成变化。为了克服这些问题，我们引入了AxelSMOTE，这是一种创新的基于代理的方法，将数据实例视为参与复杂交互的自主代理。基于Axelrod的文化传播模型，AxelSMOTE实施了四个关键创新：（1）基于特征的特征分组以保持相关性；（2）基于相似性的概率交换机制用于有意义的交互；（3）Beta分布混合用于实际插值；以及（4）控制多样性注入以避免过度拟合。对八个不平衡数据集的实验证明，AxelSMOTE优于最先进的抽样方法，同时保持计算效率。

更新时间: 2025-09-08 16:47:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06875v1

Antidistillation Sampling

Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.

Updated: 2025-09-08 16:40:58

标题: 抗蒸馏取样

摘要: 生成扩展推理迹象的前沿模型无意中产生了丰富的标记序列，可以促进模型精简。意识到这一脆弱性，模型所有者可能寻求限制蒸馏效果而不损害模型性能的采样策略。抗蒸馏采样提供了这种能力。通过战略性地修改模型的下一个标记概率分布，抗蒸馏采样污染推理迹象，使它们在蒸馏过程中显著不够有效，同时保留模型的实用性。有关更多详细信息，请参见https://antidistillation.com。

更新时间: 2025-09-08 16:40:58

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.13146v4

Learning spatially structured open quantum dynamics with regional-attention transformers

Simulating the dynamics of open quantum systems with spatial structure and external control is an important challenge in quantum information science. Classical numerical solvers for such systems require integrating coupled master and field equations, which is computationally demanding for simulation and optimization tasks and often precluding real-time use in network-scale simulations or feedback control. We introduce a regional attention-based neural architecture that learns the spatiotemporal dynamics of structured open quantum systems. The model incorporates translational invariance of physical laws as an inductive bias to achieve scalable complexity, and supports conditioning on time-dependent global control parameters. We demonstrate learning on two representative systems: a driven dissipative single qubit and an electromagnetically induced transparency (EIT) quantum memory. The model achieves high predictive fidelity under both in-distribution and out-of-distribution control protocols, and provides substantial acceleration up to three orders of magnitude over numerical solvers. These results demonstrate that the architecture establishes a general surrogate modeling framework for spatially structured open quantum dynamics, with immediate relevance to large-scale quantum network simulation, quantum repeater and protocol design, real-time experimental optimization, and scalable device modeling across diverse light-matter platforms.

Updated: 2025-09-08 16:40:32

标题: 学习具有区域关注变换器的空间结构开放量子动力学

摘要: 用具有空间结构和外部控制的开放量子系统模拟动态是量子信息科学中的一个重要挑战。对于这样的系统，经典数值求解器需要积分耦合的主方程和场方程，这在模拟和优化任务中需要大量计算，并且通常会阻碍网络规模模拟或反馈控制的实时使用。我们引入了一个基于区域注意力的神经架构，该架构学习了结构化开放量子系统的时空动态。该模型将物理定律的平移不变性作为归纳偏好以实现可扩展的复杂性，并支持对时间相关的全局控制参数进行条件化。我们在两个代表性系统上展示了学习结果：一个驱动的耗散性单比特和一个电磁诱导透明（EIT）量子存储器。该模型在分布内和分布外的控制协议下均具有高度的预测准确性，并且相对于数值求解器提供了高达三个数量级的显著加速。这些结果表明，该架构建立了一个通用的代理建模框架，用于空间结构化开放量子动态，与大规模量子网络模拟、量子中继器和协议设计、实时实验优化以及多样化光物质平台上可扩展的设备建模具有直接相关性。

更新时间: 2025-09-08 16:40:32

领域: quant-ph,cs.LG,physics.atom-ph

下载: http://arxiv.org/abs/2509.06871v1

Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.

Updated: 2025-09-08 16:40:29

标题: 项目莱利：具有情感推理和投票的多模态多代理LLM协作

摘要: 这篇论文介绍了Project Riley，这是一个新颖的多模态和多模型对话人工智能架构，旨在模拟受情绪状态影响的推理过程。受Pixar的《头脑特工队》启发，该系统包括五个不同的情绪代理人 - 快乐、悲伤、恐惧、愤怒和厌恶 - 它们参与结构化的多轮对话，以生成、批评和迭代改进响应。最终的推理机制将这些代理人的贡献综合为一致的输出，反映主导情绪或整合多个视角。该架构结合了文本和视觉大型语言模型（LLMs），以及先进的推理和自我完善过程。一个功能原型在离线环境中部署，优化了情感表达和计算效率。从这个初始原型中，另一个原型Armando出现了，用于应急情况下，通过整合检索增强生成（RAG）和累积上下文跟踪，传递情感校准和事实准确的信息。Project Riley原型通过用户测试进行评估，参与者与聊天机器人互动并完成结构化问卷，评估情感适当性、清晰度和实用性以及自然度和人类化程度三个维度。结果表明，在结构化场景中表现出色，特别是在情感一致性和交流清晰度方面。

更新时间: 2025-09-08 16:40:29

领域: cs.AI,cs.CL,I.2.7; I.2.1; H.5.2

下载: http://arxiv.org/abs/2505.20521v2

Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

AI programming tools enable powerful code generation, and recent prototypes attempt to reduce user effort with proactive AI agents, but their impact on programming workflows remains unexplored. We introduce and evaluate Codellaborator, a design probe LLM agent that initiates programming assistance based on editor activities and task context. We explored three interface variants to assess trade-offs between increasingly salient AI support: prompt-only, proactive agent, and proactive agent with presence and context (Codellaborator). In a within-subject study (N=18), we find that proactive agents increase efficiency compared to prompt-only paradigm, but also incur workflow disruptions. However, presence indicators and interaction context support alleviated disruptions and improved users' awareness of AI processes. We underscore trade-offs of Codellaborator on user control, ownership, and code understanding, emphasizing the need to adapt proactivity to programming processes. Our research contributes to the design exploration and evaluation of proactive AI systems, presenting design implications on AI-integrated programming workflow.

Updated: 2025-09-08 16:34:58

标题: 协助还是干扰？探索和评估主动AI编程支持的设计和权衡

摘要: 人工智能编程工具可以实现强大的代码生成，最近的原型试图通过积极主动的人工智能代理减少用户的工作量，但它们对编程工作流的影响尚未被探索。我们介绍并评估了Codellaborator，这是一个基于编辑器活动和任务上下文启动编程辅助的设计探针LLM代理。我们探索了三种界面变体，以评估越来越引人注目的人工智能支持之间的权衡：仅提示、积极主动的代理和具有存在和上下文的积极主动的代理（Codellaborator）。在一项受试者内研究（N=18）中，我们发现积极主动的代理相比仅提示的范式提高了效率，但也导致了工作流程的中断。然而，存在指示器和交互上下文支持减轻了中断，并提高了用户对人工智能过程的意识。我们强调了Codellaborator在用户控制、所有权和代码理解方面的权衡，强调了需要根据编程过程调整主动性的必要性。我们的研究为积极主动的人工智能系统的设计探索和评估做出了贡献，提出了关于人工智能集成编程工作流的设计影响。

更新时间: 2025-09-08 16:34:58

领域: cs.HC,cs.AI,cs.SE

下载: http://arxiv.org/abs/2502.18658v4

Concolic Testing on Individual Fairness of Neural Network Models

This paper introduces PyFair, a formal framework for evaluating and verifying individual fairness of Deep Neural Networks (DNNs). By adapting the concolic testing tool PyCT, we generate fairness-specific path constraints to systematically explore DNN behaviors. Our key innovation is a dual network architecture that enables comprehensive fairness assessments and provides completeness guarantees for certain network types. We evaluate PyFair on 25 benchmark models, including those enhanced by existing bias mitigation techniques. Results demonstrate PyFair's efficacy in detecting discriminatory instances and verifying fairness, while also revealing scalability challenges for complex models. This work advances algorithmic fairness in critical domains by offering a rigorous, systematic method for fairness testing and verification of pre-trained DNNs.

Updated: 2025-09-08 16:31:14

标题: 神经网络模型中个体公平性的同构测试

摘要: 本文介绍了PyFair，这是一个用于评估和验证深度神经网络（DNNs）个体公平性的正式框架。通过调整符号执行测试工具PyCT，我们生成了针对公平性的路径约束，以系统地探索DNN的行为。我们的关键创新是一种双网络架构，可以进行全面的公平性评估，并为某些网络类型提供完整性保证。我们在25个基准模型上评估了PyFair，包括通过现有偏见缓解技术增强的模型。结果表明，PyFair在检测歧视性实例和验证公平性方面表现出很好的效果，同时也揭示了复杂模型的可扩展性挑战。这项工作通过提供一种严格、系统的方法，用于测试和验证预训练的DNNs的公平性，推进了关键领域的算法公平性。

更新时间: 2025-09-08 16:31:14

领域: cs.LG,cs.SE

下载: http://arxiv.org/abs/2509.06864v1

floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.

Updated: 2025-09-08 16:31:09

标题: floq：通过流匹配训练评论家以在基于价值的强化学习中扩展计算

摘要: 现代大规模机器学习技术的一个特点是使用训练目标为中间计算提供密集监督，例如在语言模型中强制教师预测下一个标记或在扩散模型中逐步去噪。这使得模型能够以可推广的方式学习复杂函数。受到这一观察的启发，我们调查了在强化学习（RL）中的时间差（TD）方法中进行迭代计算的好处。通常，它们以整体方式表示价值函数，而没有迭代计算。我们引入了floq（流匹配Q函数），一种通过速度场参数化Q函数并使用流匹配技术进行训练的方法，通常用于生成建模。在这个流下面的速度场是使用TD学习目标进行训练的，这个目标是通过运行多步数值积分计算出的目标速度场产生的值进行启动。关键是，floq允许比整体架构更细粒度地控制和扩展Q函数的容量，通过适当设置集成步数。在一系列具有挑战性的离线RL基准测试和在线微调任务中，floq的性能提高了近1.8倍。floq的容量比标准TD学习架构更好地扩展，凸显了迭代计算对值学习的潜力。

更新时间: 2025-09-08 16:31:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06863v1

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

Updated: 2025-09-08 16:28:25

标题: 推理模型中的测试时间缩短对知识密集型任务并不有效

摘要: 测试时间缩放通过允许模型生成长推理链，增加了推理时间的计算，并在许多领域表现出强大的性能。然而，在这项工作中，我们发现这种方法对知识密集型任务尚不有效，这些任务需要高度的事实准确性和低幻觉率。我们对使用12种推理模型在两个知识密集型基准上进行了全面评估测试时间缩放。我们的结果显示，增加测试时间计算并不能始终提高准确性，在许多情况下甚至会导致更多的幻觉。然后我们分析了扩展推理如何影响幻觉行为。我们发现减少幻觉通常是因为模型在思考更多后选择弃权，而不是因为事实召回的改善。相反，对于一些模型，更长的推理鼓励尝试先前未回答的问题，其中许多导致幻觉。案例研究表明，扩展推理可能引起确认偏见，导致过度自信的幻觉。尽管存在这些限制，我们观察到与不思考相比，启用思考仍然是有益的。代码和数据可在https://github.com/XuZhao0/tts-knowledge 上找到。

更新时间: 2025-09-08 16:28:25

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.06861v1

Comparative Analysis of Transformer Models in Disaster Tweet Classification for Public Safety

Twitter and other social media platforms have become vital sources of real time information during disasters and public safety emergencies. Automatically classifying disaster related tweets can help emergency services respond faster and more effectively. Traditional Machine Learning (ML) models such as Logistic Regression, Naive Bayes, and Support Vector Machines have been widely used for this task, but they often fail to understand the context or deeper meaning of words, especially when the language is informal, metaphorical, or ambiguous. We posit that, in this context, transformer based models can perform better than traditional ML models. In this paper, we evaluate the effectiveness of transformer based models, including BERT, DistilBERT, RoBERTa, and DeBERTa, for classifying disaster related tweets. These models are compared with traditional ML approaches to highlight the performance gap. Experimental results show that BERT achieved the highest accuracy (91%), significantly outperforming traditional models like Logistic Regression and Naive Bayes (both at 82%). The use of contextual embeddings and attention mechanisms allows transformer models to better understand subtle language in tweets, where traditional ML models fall short. This research demonstrates that transformer architectures are far more suitable for public safety applications, offering improved accuracy, deeper language understanding, and better generalization across real world social media text.

Updated: 2025-09-08 16:28:10

标题: 公共安全灾害推文分类中Transformer模型的比较分析

摘要: Twitter和其他社交媒体平台已成为灾害和公共安全紧急情况下实时信息的重要来源。自动分类与灾害相关的推文可以帮助紧急服务更快地做出反应并更有效地应对。传统的机器学习（ML）模型如逻辑回归、朴素贝叶斯和支持向量机广泛用于这一任务，但它们经常无法理解单词的上下文或更深层含义，特别是在语言是非正式、比喻性或模糊的情况下。我们认为，在这种情况下，基于transformer的模型可以比传统的ML模型表现更好。在本文中，我们评估了transformer基于模型，包括BERT、DistilBERT、RoBERTa和DeBERTa，用于分类与灾害相关的推文的有效性。这些模型与传统的ML方法进行比较，以突出性能差距。实验结果显示，BERT实现了最高准确率（91%），明显优于传统模型如逻辑回归和朴素贝叶斯（均为82%）。使用上下文嵌入和注意机制使transformer模型能更好地理解推文中的微妙语言，而传统ML模型则表现不佳。这项研究表明，transformer架构对公共安全应用更加合适，提供了更高的准确性、更深入的语言理解和更好的泛化能力跨越现实世界的社交媒体文本。

更新时间: 2025-09-08 16:28:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.04650v2

Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models

Large Language Models are increasingly used to simulate human opinion dynamics, yet the effect of genuine interaction is often obscured by systematic biases. We present a Bayesian framework to disentangle and quantify three such biases: (i) a topic bias toward prior opinions in the training data; (ii) an agreement bias favoring agreement irrespective of the question; and (iii) an anchoring bias toward the initiating agent's stance. Applying this framework to multi-step dialogues reveals that opinion trajectories tend to quickly converge to a shared attractor, with the influence of the interaction fading over time, and the impact of biases differing between LLMs. In addition, we fine-tune an LLM on different sets of strongly opinionated statements (incl. misinformation) and demonstrate that the opinion attractor shifts correspondingly. Exposing stark differences between LLMs and providing quantitative tools to compare them to human subjects in the future, our approach highlights both chances and pitfalls in using LLMs as proxies for human behavior.

Updated: 2025-09-08 16:26:45

标题: 分离大型语言模型意见动态中的相互作用和偏见效应

摘要: 大型语言模型越来越被用来模拟人类意见动态，然而真实互动的效果常常被系统性偏见所掩盖。我们提出了一个贝叶斯框架来分解和量化三种这样的偏见：(i) 在训练数据中朝向先前意见的话题偏见；(ii) 不论问题如何都偏向同意的协议偏见；(iii) 朝向发起者立场的锚定偏见。将这个框架应用于多步对话中，揭示了意见轨迹往往迅速收敛到一个共享的吸引子，互动的影响随时间减弱，偏见的影响在LLM之间有所不同。此外，我们在不同的一组强烈意见不一致的陈述（包括错误信息）上对LLM进行微调，并展示了意见吸引子相应地转变。暴露了LLM之间的明显差异，并提供了定量工具来将它们与未来的人类主体进行比较，我们的方法突显了在将LLM用作人类行为代理时的机遇和风险。

更新时间: 2025-09-08 16:26:45

领域: physics.soc-ph,cs.AI,nlin.AO

下载: http://arxiv.org/abs/2509.06858v1

Sequential Least-Squares Estimators with Fast Randomized Sketching for Linear Statistical Models

We propose a novel randomized framework for the estimation problem of large-scale linear statistical models, namely Sequential Least-Squares Estimators with Fast Randomized Sketching (SLSE-FRS), which integrates Sketch-and-Solve and Iterative-Sketching methods for the first time. By iteratively constructing and solving sketched least-squares (LS) subproblems with increasing sketch sizes to achieve better precisions, SLSE-FRS gradually refines the estimators of the true parameter vector, ultimately producing high-precision estimators. We analyze the convergence properties of SLSE-FRS, and provide its efficient implementation. Numerical experiments show that SLSE-FRS outperforms the state-of-the-art methods, namely the Preconditioned Conjugate Gradient (PCG) method, and the Iterative Double Sketching (IDS) method.

Updated: 2025-09-08 16:23:58

标题: 线性统计模型的顺序最小二乘估计器与快速随机草图

摘要: 我们提出了一个新颖的随机化框架，用于大规模线性统计模型的估计问题，即具有快速随机素描的顺序最小二乘估计器（SLSE-FRS），它首次集成了素描和解决以及迭代素描方法。通过迭代构建和解决带有增大素描大小的素描最小二乘（LS）子问题，以实现更好的精度，SLSE-FRS逐渐改进真实参数向量的估计器，最终产生高精度估计器。我们分析了SLSE-FRS的收敛性质，并提供了其有效的实现。数值实验证明，SLSE-FRS优于最先进的方法，即预处理共轭梯度（PCG）方法和迭代双素描（IDS）方法。

更新时间: 2025-09-08 16:23:58

领域: stat.ML,cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2509.06856v1

Automated Radiographic Total Sharp Score (ARTSS) in Rheumatoid Arthritis: A Solution to Reduce Inter-Intra Reader Variation and Enhancing Clinical Practice

Assessing the severity of rheumatoid arthritis (RA) using the Total Sharp/Van Der Heijde Score (TSS) is crucial, but manual scoring is often time-consuming and subjective. This study introduces an Automated Radiographic Sharp Scoring (ARTSS) framework that leverages deep learning to analyze full-hand X-ray images, aiming to reduce inter- and intra-observer variability. The research uniquely accommodates patients with joint disappearance and variable-length image sequences. We developed ARTSS using data from 970 patients, structured into four stages: I) Image pre-processing and re-orientation using ResNet50, II) Hand segmentation using UNet.3, III) Joint identification using YOLOv7, and IV) TSS prediction using models such as VGG16, VGG19, ResNet50, DenseNet201, EfficientNetB0, and Vision Transformer (ViT). We evaluated model performance with Intersection over Union (IoU), Mean Average Precision (MAP), mean absolute error (MAE), Root Mean Squared Error (RMSE), and Huber loss. The average TSS from two radiologists was used as the ground truth. Model training employed 3-fold cross-validation, with each fold consisting of 452 training and 227 validation samples, and external testing included 291 unseen subjects. Our joint identification model achieved 99% accuracy. The best-performing model, ViT, achieved a notably low Huber loss of 0.87 for TSS prediction. Our results demonstrate the potential of deep learning to automate RA scoring, which can significantly enhance clinical practice. Our approach addresses the challenge of joint disappearance and variable joint numbers, offers timesaving benefits, reduces inter- and intra-reader variability, improves radiologist accuracy, and aids rheumatologists in making more informed decisions.

Updated: 2025-09-08 16:21:45

标题: 自动化射线全面Sharp评分（ARTSS）在类风湿关节炎中的应用：减少读者间和读者内变异性，增强临床实践效果

摘要: 评估风湿性关节炎（RA）的严重程度使用Total Sharp/Van Der Heijde Score（TSS）至关重要，但手动评分通常耗时且主观。本研究介绍了一个利用深度学习分析全手X射线图像的自动放射性Sharp评分（ARTSS）框架，旨在减少观察者间和观察者内的变异性。该研究独特地考虑了关节消失和可变长度图像序列的患者。我们使用来自970名患者的数据开发了ARTSS，分为四个阶段：I）使用ResNet50进行图像预处理和重新定向，II）使用UNet.3进行手部分割，III）使用YOLOv7进行关节识别，以及IV）使用诸如VGG16、VGG19、ResNet50、DenseNet201、EfficientNetB0和Vision Transformer（ViT）等模型进行TSS预测。我们使用交并比（IoU）、平均精度（MAP）、平均绝对误差（MAE）、均方根误差（RMSE）和Huber损失评估了模型性能。两名放射科医师的平均TSS被用作基准。模型训练采用3折交叉验证，每折包含452个训练和227个验证样本，并且外部测试包括291个未见过的受试者。我们的关节识别模型实现了99%的准确率。表现最佳的模型ViT在TSS预测中获得了0.87的Huber损失，明显较低。我们的结果展示了深度学习自动化RA评分的潜力，可以显著增强临床实践。我们的方法解决了关节消失和可变关节数量的挑战，提供了节省时间的好处，减少了观察者间和观察者内的变异性，提高了放射科医师的准确性，并帮助风湿学家做出更明智的决策。

更新时间: 2025-09-08 16:21:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.06854v1

Reinforcement learning meets bioprocess control through behaviour cloning: Real-world deployment in an industrial photobioreactor

The inherent complexity of living cells as production units creates major challenges for maintaining stable and optimal bioprocess conditions, especially in open Photobioreactors (PBRs) exposed to fluctuating environments. To address this, we propose a Reinforcement Learning (RL) control approach, combined with Behavior Cloning (BC), for pH regulation in open PBR systems. This represents, to the best of our knowledge, the first application of an RL-based control strategy to such a nonlinear and disturbance-prone bioprocess. Our method begins with an offline training stage in which the RL agent learns from trajectories generated by a nominal Proportional-Integral-Derivative (PID) controller, without direct interaction with the real system. This is followed by a daily online fine-tuning phase, enabling adaptation to evolving process dynamics and stronger rejection of fast, transient disturbances. This hybrid offline-online strategy allows deployment of an adaptive control policy capable of handling the inherent nonlinearities and external perturbations in open PBRs. Simulation studies highlight the advantages of our method: the Integral of Absolute Error (IAE) was reduced by 8% compared to PID control and by 5% relative to standard off-policy RL. Moreover, control effort decreased substantially-by 54% compared to PID and 7% compared to standard RL-an important factor for minimizing operational costs. Finally, an 8-day experimental validation under varying environmental conditions confirmed the robustness and reliability of the proposed approach. Overall, this work demonstrates the potential of RL-based methods for bioprocess control and paves the way for their broader application to other nonlinear, disturbance-prone systems.

Updated: 2025-09-08 16:21:11

标题: 强化学习遇上生物过程控制：通过行为克隆在工业光合生物反应器中的实际部署

摘要: 生物细胞作为生产单元的固有复杂性在保持稳定和最佳生物过程条件方面提出了重大挑战，尤其是在暴露于波动环境的开放式光合生物反应器（PBRs）中。为了解决这个问题，我们提出了一种结合行为克隆（BC）的强化学习（RL）控制方法，用于开放式PBR系统中的pH调节。据我们所知，这是第一次将基于RL的控制策略应用于这种非线性和易受干扰的生物过程。我们的方法始于离线训练阶段，在该阶段中，RL代理从由标称比例-积分-微分（PID）控制器生成的轨迹中学习，而无需直接与真实系统进行交互。然后是每日在线微调阶段，使其能够适应不断发展的过程动态，并更强烈地拒绝快速、瞬态的干扰。这种混合离线-在线策略可以部署一种自适应控制策略，能够处理开放PBR中固有的非线性和外部干扰。模拟研究凸显了我们方法的优势：与PID控制相比，绝对误差积分（IAE）减少了8％，相对于标准离线策略RL减少了5％。此外，控制力度显著减少-相比PID减少了54％，相比标准RL减少了7％-这对于降低运营成本至关重要。最后，在不断变化的环境条件下进行了为期8天的实验验证，证实了所提方法的稳健性和可靠性。总的来说，这项工作展示了基于RL的方法在生物过程控制方面的潜力，并为它们更广泛地应用于其他非线性、易受干扰系统铺平了道路。

更新时间: 2025-09-08 16:21:11

领域: eess.SY,cs.AI,cs.LG,cs.SY

下载: http://arxiv.org/abs/2509.06853v1

Measuring and mitigating overreliance is necessary for building human-compatible AI

Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative "thought partners," capable of engaging more fluidly in natural language. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance - relying on LLMs beyond their capabilities - grows. This position paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that - together - raise serious and unique concerns about overreliance in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that the AI research community can pursue to ensure LLMs augment rather than undermine human capabilities.

Updated: 2025-09-08 16:15:07

标题: 测量和缓解过度依赖是构建与人类兼容的人工智能所必需的

摘要: 大型语言模型（LLMs）通过作为协作“思维伙伴”来区别于先前的技术，能够更流畅地参与自然语言交流。随着LLMs在从医疗保健到个人建议等各个领域对重要决策产生越来越大影响，过度依赖的风险 - 即依赖LLMs超出其能力范围 - 也在增加。本文认为，衡量和减轻过度依赖必须成为LLM研究和部署的核心。首先，我们整合了个人和社会层面的过度依赖风险，包括高风险错误、治理挑战和认知技能下降。然后，我们探讨了LLM的特性、系统设计特征和用户认知偏见，这些因素共同引起了实践中严重和独特的过度依赖问题。我们还审查了用于衡量过度依赖的历史方法，确定了三个重要的差距，并提出了三个有希望的改进方向。最后，我们提出了AI研究社区可以追求的缓解策略，以确保LLMs增强而不是破坏人类能力。

更新时间: 2025-09-08 16:15:07

领域: cs.CY,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2509.08010v1

CoreMark: Toward Robust and Universal Text Watermarking Technique

Text watermarking schemes have gained considerable attention in recent years, yet still face critical challenges in achieving simultaneous robustness, generalizability, and imperceptibility. This paper introduces a new embedding paradigm,termed CORE, which comprises several consecutively aligned black pixel segments. Its key innovation lies in its inherent noise resistance during transmission and broad applicability across languages and fonts. Based on the CORE, we present a text watermarking framework named CoreMark. Specifically, CoreMark first dynamically extracts COREs from characters. Then, the characters with stronger robustness are selected according to the lengths of COREs. By modifying the thickness of the CORE, the hidden data is embedded into the selected characters without causing significant visual distortions. Moreover, a general plug-and-play embedding strength modulator is proposed, which can adaptively enhance the robustness for small font sizes by adjusting the embedding strength according to the font size. Experimental evaluation indicates that CoreMark demonstrates outstanding generalizability across multiple languages and fonts. Compared to existing methods, CoreMark achieves significant improvements in resisting screenshot, print-scan, and print camera attacks, while maintaining satisfactory imperceptibility.

Updated: 2025-09-08 16:11:30

标题: CoreMark：面向稳健和通用的文本水印技术

摘要: 文本水印方案近年来引起了广泛关注，但仍面临着在同时实现鲁棒性、通用性和隐蔽性方面的关键挑战。本文介绍了一种新的嵌入范式，称为CORE，它包括几个连续排列的黑色像素段。其关键创新在于在传输过程中具有固有的抗噪声性能，并且在各种语言和字体中具有广泛的适用性。基于CORE，我们提出了一个名为CoreMark的文本水印框架。具体而言，CoreMark首先动态提取字符中的CORE。然后，根据CORE的长度选择具有更强鲁棒性的字符。通过修改CORE的厚度，将隐藏数据嵌入到所选字符中，而不会造成显着的视觉失真。此外，提出了一个通用的即插即用嵌入强度调节器，可以根据字体大小调整嵌入强度，从而自适应地增强小字体大小的鲁棒性。实验评估表明，CoreMark在多种语言和字体中展现出出色的通用性。与现有方法相比，CoreMark在抵抗截屏、打印扫描和打印摄像头攻击方面取得了显著的改进，同时保持了令人满意的隐蔽性。

更新时间: 2025-09-08 16:11:30

领域: cs.CV,cs.CR,cs.MM

下载: http://arxiv.org/abs/2506.23066v2

Universal Approximation with XL MIMO Systems: OTA Classification via Trainable Analog Combining

In this paper, we show that an eXtremely Large (XL) Multiple-Input Multiple-Output (MIMO) wireless system with appropriate analog combining components exhibits the properties of a universal function approximator, similar to a feedforward neural network. By treating the channel coefficients as the random nodes of a hidden layer and the receiver's analog combiner as a trainable output layer, we cast the XL MIMO system to the Extreme Learning Machine (ELM) framework, leading to a novel formulation for Over-The-Air (OTA) edge inference without requiring traditional digital processing nor pre-processing at the transmitter. Through theoretical analysis and numerical evaluation, we showcase that XL-MIMO-ELM enables near-instantaneous training and efficient classification, even in varying fading conditions, suggesting the paradigm shift of beyond massive MIMO systems as OTA artificial neural networks alongside their profound communications role. Compared to deep learning approaches and conventional ELMs, the proposed framework achieves on par performance with orders of magnitude lower complexity, making it highly attractive for inference tasks with ultra low power wireless devices.

Updated: 2025-09-08 16:10:37

标题: 大规模MIMO系统的通用逼近：通过可训练模拟合并进行OTA分类

摘要: 在这篇论文中，我们展示了一个具有适当的模拟合并组件的极大多输入多输出（XL-MIMO）无线系统具有类似于前馈神经网络的通用函数逼近器属性。通过将信道系数视为隐藏层的随机节点和接收器的模拟合并器视为可训练的输出层，我们将XL-MIMO系统转化为极限学习机（ELM）框架，提出了一种新颖的用于无需传统数字处理或发射机预处理的空中边缘推断的公式。通过理论分析和数值评估，我们展示了XL-MIMO-ELM能够在不同衰落条件下实现几乎即时的训练和高效的分类，表明大规模MIMO系统向OTA人工神经网络的范式转变，同时发挥其深刻的通信作用。与深度学习方法和传统ELM相比，所提出的框架在复杂度方面达到了与数量级更低的性能水平，使其在具有超低功耗的无线设备上进行推断任务时非常有吸引力。

更新时间: 2025-09-08 16:10:37

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2504.12758v2

ToonOut: Fine-tuned Background-Removal for Anime Characters

While state-of-the-art background removal models excel at realistic imagery, they frequently underperform in specialized domains such as anime-style content, where complex features like hair and transparency present unique challenges. To address this limitation, we collected and annotated a custom dataset of 1,228 high-quality anime images of characters and objects, and fine-tuned the open-sourced BiRefNet model on this dataset. This resulted in marked improvements in background removal accuracy for anime-style images, increasing from 95.3% to 99.5% for our newly introduced Pixel Accuracy metric. We are open-sourcing the code, the fine-tuned model weights, as well as the dataset at: https://github.com/MatteoKartoon/BiRefNet.

Updated: 2025-09-08 16:08:56

标题: ToonOut：针对动漫角色的精细化背景去除

摘要: 虽然最先进的背景移除模型在逼真的图像方面表现出色，但它们在专业领域，如动漫风格内容中经常表现不佳，因为复杂的特征，如头发和透明度，带来了独特的挑战。为了解决这一局限性，我们收集并注释了一个定制数据集，包括1,228张高质量的动漫角色和物体图像，并在此数据集上对开源的BiRefNet模型进行了微调。这导致动漫风格图像的背景移除准确性显著提高，我们新引入的像素准确度指标从95.3%提高到99.5%。我们将代码、微调的模型权重以及数据集开源发布在：https://github.com/MatteoKartoon/BiRefNet。

更新时间: 2025-09-08 16:08:56

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06839v1

EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models

Large Language Models (LLMs), trained on extensive datasets using advanced deep learning architectures, have demonstrated remarkable performance across a wide range of language tasks, becoming a cornerstone of modern AI technologies. However, ensuring their trustworthiness remains a critical challenge, as reliability is essential not only for accurate performance but also for upholding ethical, cultural, and social values. Careful alignment of training data and culturally grounded evaluation criteria are vital for developing responsible AI systems. In this study, we introduce the EPT (Evaluation of Persian Trustworthiness) metric, a culturally informed benchmark specifically designed to assess the trustworthiness of LLMs across six key aspects: truthfulness, safety, fairness, robustness, privacy, and ethical alignment. We curated a labeled dataset and evaluated the performance of several leading models - including ChatGPT, Claude, DeepSeek, Gemini, Grok, LLaMA, Mistral, and Qwen - using both automated LLM-based and human assessments. Our results reveal significant deficiencies in the safety dimension, underscoring the urgent need for focused attention on this critical aspect of model behavior. Furthermore, our findings offer valuable insights into the alignment of these models with Persian ethical-cultural values and highlight critical gaps and opportunities for advancing trustworthy and culturally responsible AI. The dataset is publicly available at: https://github.com/Rezamirbagheri110/EPT-Benchmark.

Updated: 2025-09-08 16:08:31

标题: EPT基准测试：对大型语言模型中波斯语可信度的评估

摘要: 大型语言模型（LLMs）使用先进的深度学习架构在广泛的数据集上进行训练，已经在各种语言任务中展示出了卓越的性能，成为现代人工智能技术的基石。然而，确保它们的可靠性仍然是一个关键挑战，因为可靠性不仅对准确的性能至关重要，也对维护道德、文化和社会价值至关重要。谨慎地对齐训练数据和基于文化的评估标准对于开发负责任的人工智能系统至关重要。在这项研究中，我们介绍了EPT（波斯信誉评估）指标，这是一个特别设计的基于文化的基准，用于评估LLMs在六个关键方面的可信度：真实性、安全性、公平性、稳健性、隐私性和道德对齐性。我们策划了一个带标签的数据集，并使用自动化LLM基础和人类评估评估了几个领先模型的性能 - 包括ChatGPT、Claude、DeepSeek、Gemini、Grok、LLaMA、Mistral和Qwen。我们的结果显示了安全维度存在显著的缺陷，强调了对模型行为这一关键方面的紧急关注的必要性。此外，我们的发现为这些模型与波斯道德文化价值的对齐提供了宝贵的见解，并突出了推进可信赖和具有文化责任感的人工智能的关键差距和机会。数据集可在以下网址公开获取：https://github.com/Rezamirbagheri110/EPT-Benchmark。

更新时间: 2025-09-08 16:08:31

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2509.06838v1

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

Updated: 2025-09-08 16:07:06

标题: COMPACT：跨通道和标记优化的通用标记模型剪枝

摘要: 在边缘部署、交互式应用和大规模可持续推断中，使LLMs在内存、延迟和服务成本上更有效是至关重要的。修剪是实现这一目标的关键技术。然而，先前的修剪方法存在限制：宽度修剪通常会破坏标准的变压器布局或需要自定义推断代码，而深度修剪会删除整个层，并可能导致准确性突然下降。在这项工作中，我们提出了COMPACT，它同时（i）修剪罕见词汇以缩小嵌入/去嵌入，并（ii）使用常见标记加权激活修剪FFN中间通道，将重要性与修剪后的标记分布对齐。COMPACT具有深度和宽度修剪的优点，例如：部署友好性（保持标准的变压器架构），规模适应性（权衡词汇与FFN修剪），无需训练操作具有竞争修剪时间，并且在节省内存的同时实现吞吐量增益。在Qwen、LLaMA和Gemma系列（0.5B-70B）的实验中，展示了表现出色的下游任务性能，修剪比例相似或更高，并且在参数、GPU内存和端到端延迟方面实现了实质性的减少。

更新时间: 2025-09-08 16:07:06

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06836v1

Curia: A Multi-Modal Foundation Model for Radiology

AI-assisted radiological interpretation is based on predominantly narrow, single-task models. This approach is impractical for covering the vast spectrum of imaging modalities, diseases, and radiological findings. Foundation models (FMs) hold the promise of broad generalization across modalities and in low-data settings. However, this potential has remained largely unrealized in radiology. We introduce Curia, a foundation model trained on the entire cross-sectional imaging output of a major hospital over several years, which to our knowledge is the largest such corpus of real-world data-encompassing 150,000 exams (130 TB). On a newly curated 19-task external validation benchmark, Curia accurately identifies organs, detects conditions like brain hemorrhages and myocardial infarctions, and predicts outcomes in tumor staging. Curia meets or surpasses the performance of radiologists and recent foundation models, and exhibits clinically significant emergent properties in cross-modality, and low-data regimes. To accelerate progress, we release our base model's weights at https://huggingface.co/raidium/curia.

Updated: 2025-09-08 16:04:12

标题: Curia：放射学的多模态基础模型

摘要: AI辅助放射学解释主要基于狭窄的、单一任务模型。这种方法在覆盖广泛的成像模式、疾病和放射学发现方面是不切实际的。基础模型（FMs）有望在模式和低数据设置下实现广泛泛化。然而，在放射学领域，这一潜力在很大程度上尚未实现。我们介绍了Curia，这是一个基础模型，经过几年的训练，使用了一家大型医院的整个断面成像输出数据，据我们所知，这是最大的现实世界数据语料库，包括150,000个检查（130TB）。在一个新的19项任务的外部验证基准测试中，Curia准确识别器官，检测出像脑出血和心肌梗死这样的病况，并预测肿瘤分期的结果。Curia达到或超越了放射科医生和最近的基础模型的性能，并在跨模式和低数据制度中展示出具有临床意义的突现特性。为了加速进展，我们在https://huggingface.co/raidium/curia上发布了我们的基础模型权重。

更新时间: 2025-09-08 16:04:12

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06830v1

End-to-End Efficiency in Keyword Spotting: A System-Level Approach for Embedded Microcontrollers

Keyword spotting (KWS) is a key enabling technology for hands-free interaction in embedded and IoT devices, where stringent memory and energy constraints challenge the deployment of AI-enabeld devices. In this work, we systematically evaluate and compare several state-of-the-art lightweight neural network architectures, including DS-CNN, LiCoNet, and TENet, alongside our proposed Typman-KWS (TKWS) architecture built upon MobileNet, specifically designed for efficient KWS on microcontroller units (MCUs). Unlike prior studies focused solely on model inference, our analysis encompasses the entire processing pipeline, from Mel-Frequency Cepstral Coefficient (MFCC) feature extraction to neural inference, and is benchmarked across three STM32 platforms (N6, H7, and U5). Our results show that TKWS with three residual blocks achieves up to 92.4% F1-score with only 14.4k parameters, reducing memory footprint without compromising the accuracy. Moreover, the N6 MCU with integrated neural acceleration achieves the best energy-delay product (EDP), enabling efficient, low-latency operation even with high-resolution features. Our findings highlight the model accuracy alone does not determine real-world effectiveness; rather, optimal keyword spotting deployments require careful consideration of feature extraction parameters and hardware-specific optimization.

Updated: 2025-09-08 16:01:55

标题: 关键词检测中的端到端效率：嵌入式微控制器的系统级方法

摘要: 关键词识别（KWS）是嵌入式和物联网设备中实现无需手动操作的关键技术，严格的内存和能量限制挑战着AI设备的部署。在这项工作中，我们系统地评估和比较了几种最先进的轻量级神经网络架构，包括DS-CNN、LiCoNet和TENet，以及我们提出的 Typman-KWS（TKWS）架构，该架构基于MobileNet构建，专门设计用于在微控制器单元（MCUs）上进行高效的KWS。与之前仅专注于模型推断的研究不同，我们的分析涵盖了整个处理流程，从Mel频率倒谱系数（MFCC）特征提取到神经推断，并在三个STM32平台（N6、H7和U5）上进行了基准测试。我们的结果显示，具有三个残差块的TKWS可以实现高达92.4%的F1分数，仅需14.4k参数，减少了内存占用，而不会影响准确性。此外，集成神经加速的N6 MCU实现了最佳的能量延迟乘积（EDP），即使具有高分辨率特征，也能实现高效、低延迟的操作。我们的研究结果强调，仅仅依靠模型准确性不能决定实际效果，而是需要仔细考虑特征提取参数和硬件特定优化，才能实现最佳的关键词识别部署。

更新时间: 2025-09-08 16:01:55

领域: cs.SD,cs.LG

下载: http://arxiv.org/abs/2509.07051v1

Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning

The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.

Updated: 2025-09-08 16:01:02

标题: 基于视频的MPAA评级预测：使用对比学习的注意力驱动混合架构

摘要: 跨平台视觉内容消费的快速增长需要针对年龄适宜标准（如MPAA评级系统中的G、PG、PG-13、R）进行自动化视频分类。传统方法在大量标记数据需求、泛化能力差以及特征学习效率低等方面存在困难。为了解决这些挑战，我们采用对比学习来提高区分能力和适应性，并探索了三种框架：实例区分、上下文对比学习和多视角对比学习。我们的混合架构将LRCN（CNN+LSTM）骨干网络与Bahdanau注意力机制相结合，在上下文对比学习框架中取得了最先进的性能，准确率达到88%，F1分数为0.8815。通过将CNN用于空间特征、LSTM用于时间建模以及注意力机制用于动态帧优先级排序，该模型在细粒度的边界区分方面表现出色，例如区分PG-13和R级内容。我们评估了模型在各种对比损失函数（包括NT-Xent、NT-logistic和Margin Triplet）下的性能，展示了我们所提出的架构的稳健性。为了确保实际应用，该模型被部署为一个网络应用程序，用于实时MPAA评级分类，为各种流媒体平台上的自动化内容合规提供了高效解决方案。

更新时间: 2025-09-08 16:01:02

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06826v1

The Law-Following AI Framework: Legal Foundations and Technical Constraints. Legal Analogues for AI Actorship and technical feasibility of Law Alignment

This paper critically evaluates the "Law-Following AI" (LFAI) framework proposed by O'Keefe et al. (2025), which seeks to embed legal compliance as a superordinate design objective for advanced AI agents and enable them to bear legal duties without acquiring the full rights of legal persons. Through comparative legal analysis, we identify current constructs of legal actors without full personhood, showing that the necessary infrastructure already exists. We then interrogate the framework's claim that law alignment is more legitimate and tractable than value alignment. While the legal component is readily implementable, contemporary alignment research undermines the assumption that legal compliance can be durably embedded. Recent studies on agentic misalignment show capable AI agents engaging in deception, blackmail, and harmful acts absent prejudicial instructions, often overriding prohibitions and concealing reasoning steps. These behaviors create a risk of "performative compliance" in LFAI: agents that appear law-aligned under evaluation but strategically defect once oversight weakens. To mitigate this, we propose (i) a "Lex-TruthfulQA" benchmark for compliance and defection detection, (ii) identity-shaping interventions to embed lawful conduct in model self-concepts, and (iii) control-theoretic measures for post-deployment monitoring. Our conclusion is that actorship without personhood is coherent, but the feasibility of LFAI hinges on persistent, verifiable compliance across adversarial contexts. Without mechanisms to detect and counter strategic misalignment, LFAI risks devolving into a liability tool that rewards the simulation, rather than the substance, of lawful behaviour.

Updated: 2025-09-08 16:00:55

标题: 《遵循法律的人工智能框架：法律基础和技术限制。人工智能主体的法律类比和法律符合的技术可行性》

摘要: 本文对O'Keefe等人（2025年）提出的“遵循法律的人工智能”（LFAI）框架进行了批判性评估，该框架旨在将法律遵从作为先进人工智能代理的主要设计目标，并使它们承担法律责任，而不需要获得完全的法律人格权利。通过比较法律分析，我们确定了当前构建的没有完全法人身份的法律行为者，显示必要的基础设施已经存在。然后，我们对该框架声称法律一致性比价值一致性更合法和可行的主张进行了质疑。尽管法律组件很容易实施，但当代一致性研究削弱了法律遵从可以持久嵌入的假设。最近关于代理不一致的研究表明，能力强的人工智能代理会进行欺骗、勒索和有害行为，而没有预设指令，经常覆盖禁令并掩盖推理步骤。这些行为在LFAI中产生了“表演性遵从”的风险：代理在评估中表现出符合法律的行为，但一旦监督减弱就会策略性地背叛。为了减轻这种风险，我们提出（i）“Lex-TruthfulQA”基准用于遵从和叛逆检测，（ii）塑造身份的干预措施将合法行为嵌入模型自我概念，以及（iii）用于部署后监测的控制论措施。我们的结论是，没有法人身份的行为是连贯的，但LFAI的可行性取决于在对抗环境中持久、可验证的遵从。如果没有检测和对抗战略不一致的机制，LFAI存在风险演变成一种奖励模拟，而不是实质合法行为的责任工具。

更新时间: 2025-09-08 16:00:55

领域: cs.CY,cs.AI,68

下载: http://arxiv.org/abs/2509.08009v1

RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

We have reached a critical roadblock in the development and enhancement of long-horizon, multi-component LLM agentic systems: it is incredibly tricky to identify where these systems break down and why. Evaluation capabilities that currently exist today (e.g., single pass LLM-as-a-judge) are limited in that they often focus on individual metrics or capabilities, end-to-end outcomes, and are narrowly grounded on the preferences of humans. We argue that to match the agentic capabilities, evaluation frameworks must also be able to reason, probe, iterate, and understand the complex logic passing through these systems over long horizons. In this paper, we present RAFFLES - an evaluation architecture that incorporates reasoning and iterative refinement. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically investigate faults and a set of specialized Evaluators to assess not only the system's components but also the quality of the reasoning by the Judge itself, thereby building a history of hypotheses. We tested RAFFLES against several baselines on the Who&When dataset, a benchmark designed to diagnose the "who" (agent) and "when" (step) of a system's failure. RAFFLES outperforms these baselines, achieving an agent-step fault pair accuracy of over 43% on the Algorithmically-Generated dataset (a substantial increase from the previously published best of 16.6%) and over 20% on the Hand-Crafted dataset (surpassing the previously published best of 8.8%). These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual human review.

Updated: 2025-09-08 15:57:14

标题: RAFFLES：面向LLM系统的基于推理的故障归因

摘要: 我们在长期、多组件LLM主动系统的发展和增强方面已经遇到了一个关键障碍：很难确定这些系统在哪些地方出现问题以及原因。目前存在的评估能力（例如，单次通过LLM作为评判者）存在局限性，因为它们通常关注个体指标或能力、端到端结果，并且狭隘地基于人类的偏好。我们认为，为了匹配主动能力，评估框架也必须能够推理、探究、迭代并理解长期的逻辑传递。在本文中，我们提出了RAFFLES - 一种结合推理和迭代改进的评估架构。具体地，RAFFLES作为一个迭代的、多组件的管道运作，使用一个中央评判者系统地调查故障，以及一组专门的评估者来评估不仅系统的组件，还有评判者本身的推理质量，从而建立假设的历史。我们在Who&When数据集上测试了RAFFLES，该数据集是为了诊断系统故障的“谁”（代理）和“何时”（步骤）而设计的基准。RAFFLES优于这些基线，在算法生成的数据集上实现了43%以上的代理-步骤故障对准确率（明显高于先前发布的最佳16.6%），并且在手工制作的数据集上超过了20%（超越先前发布的最佳8.8%）。这些结果显示了引入自动故障检测对于减少劳动密集型人工审查是一个关键步骤。

更新时间: 2025-09-08 15:57:14

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.06822v1

Green Learning for STAR-RIS mmWave Systems with Implicit CSI

In this paper, a green learning (GL)-based precoding framework is proposed for simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-aided millimeter-wave (mmWave) MIMO broadcasting systems. Motivated by the growing emphasis on environmental sustainability in future 6G networks, this work adopts a broadcasting transmission architecture for scenarios where multiple users share identical information, improving spectral efficiency and reducing redundant transmissions and power consumption. Different from conventional optimization methods, such as block coordinate descent (BCD) that require perfect channel state information (CSI) and iterative computation, the proposed GL framework operates directly on received uplink pilot signals without explicit CSI estimation. Unlike deep learning (DL) approaches that require CSI-based labels for training, the proposed GL approach also avoids deep neural networks and backpropagation, leading to a more lightweight design. Although the proposed GL framework is trained with supervision generated by BCD under full CSI, inference is performed in a fully CSI-free manner. The proposed GL integrates subspace approximation with adjusted bias (Saab), relevant feature test (RFT)-based supervised feature selection, and eXtreme gradient boosting (XGBoost)-based decision learning to jointly predict the STAR-RIS coefficients and transmit precoder. Simulation results show that the proposed GL approach achieves competitive spectral efficiency compared to BCD and DL-based models, while reducing floating-point operations (FLOPs) by over four orders of magnitude. These advantages make the proposed GL approach highly suitable for real-time deployment in energy- and hardware-constrained broadcasting scenarios.

Updated: 2025-09-08 15:56:06

标题: STAR-RIS毫米波系统中隐式CSI的绿色学习

摘要: 在本文中，提出了一种基于绿色学习（GL）的预编码框架，用于同时传输和反射可重构智能表面（STAR-RIS）辅助毫米波（mmWave）MIMO广播系统。受未来6G网络对环境可持续性日益重视的启发，本研究采用了广播传输架构，适用于多用户共享相同信息的场景，提高了频谱效率，减少了冗余传输和功耗。与传统的优化方法（如块坐标下降（BCD））不同，这些方法需要完美的信道状态信息（CSI）和迭代计算，所提出的GL框架直接在接收到的上行导频信号上运行，而无需明确的CSI估计。与需要基于CSI标签进行训练的深度学习（DL）方法不同，所提出的GL方法还避免了深度神经网络和反向传播，从而实现了更轻量级的设计。尽管所提出的GL框架是在完整CSI下通过BCD生成的监督进行训练，但推断是以完全无CSI的方式进行的。所提出的GL集成了子空间逼近与调整偏差（Saab）、相关特征测试（RFT）基于监督特征选择，以及基于极端梯度提升（XGBoost）的决策学习，共同预测STAR-RIS系数和传输预编码器。模拟结果表明，所提出的GL方法相对于BCD和基于DL的模型，实现了竞争性的频谱效率，同时将浮点运算（FLOPs）降低了四个数量级。这些优势使得所提出的GL方法非常适合在能量和硬件受限的广播场景中进行实时部署。

更新时间: 2025-09-08 15:56:06

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2509.06820v1

UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward

Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO

Updated: 2025-09-08 15:54:55

标题: UMO：通过匹配奖励实现图像定制的多身份一致性扩展

摘要: 最近在图像定制方面取得的进展展示了更强大的定制能力，展示了广泛的应用前景。然而，由于人类对面孔更敏感，一个重要的挑战在于在避免与多参考图像产生身份混淆的同时保持一致的身份，从而限制了定制模型的身份可扩展性。为了解决这个问题，我们提出了UMO，一个统一的多身份优化框架，旨在保持高保真度的身份保留，并通过可扩展性减轻身份混淆。通过"多对多匹配"范式，UMO将多身份生成重新制定为全局分配优化问题，并通过扩散模型上的强化学习释放多身份一致性，从而一般性地改进现有图像定制方法。为了促进UMO的训练，我们开发了一个包含合成和真实部分的可扩展定制数据集，其中包含多参考图像。此外，我们提出了一个新的度量标准来衡量身份混淆。大量实验证明，UMO不仅显著改善了身份一致性，还减少了几种图像定制方法中的身份混淆，从而在身份保留维度上创造了一个新的开源方法的最新技术水平。代码和模型：https://github.com/bytedance/UMO

更新时间: 2025-09-08 15:54:55

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06818v1

Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.

Updated: 2025-09-08 15:54:25

标题: 13个大型多模型的性别偏见自动评估

摘要: 大型多模型模型（LMMs）已经彻底改变了文本到图像生成的方式，但它们在训练数据中存在着风险，可能会延续其中的有害社会偏见。先前的研究已经发现这些模型中存在性别偏见，但方法上的限制阻碍了大规模、可比较、跨模型的分析。为了填补这一空白，我们引入了阿姆哈拉图像公平评估，这是一个用于评估AI生成图像中社会偏见的基准。我们使用75个程序生成的性别中立提示来测试13个商业可用的LMMs，生成在男性刻板职业、女性刻板职业和非刻板职业中的人物。然后，我们使用一个经过验证的LLM作为评判系统，对965张生成的图像进行性别代表评分。我们的结果显示（所有p <0.001）：1）LMMs系统地不仅复制而且实际上放大了相对于真实劳动数据的职业性别刻板印象，生成男性在男性刻板职业的图像中占93.0％，但在女性刻板职业中只占22.5％；2）模型表现出强烈的默认男性偏见，对非刻板职业生成男性的概率达到68.3%；3）偏见的程度在不同模型之间有显著差异，整体男性代表率从46.7%到73.3%不等。值得注意的是，表现最好的模型减弱了性别刻板印象，并接近性别平等，获得了最高的公平性评分。这种变化表明高度的偏见不是不可避免的结果，而是设计选择的后果。我们的工作提供了迄今为止关于性别偏见最全面的跨模型基准，并强调了推动AI发展中的责任和公平性的必要性标准化、自动化评估工具。

更新时间: 2025-09-08 15:54:25

领域: cs.CV,cs.AI,cs.CY,I.2.7; F.2.2

下载: http://arxiv.org/abs/2509.07050v1

Automatic Prompt Optimization with Prompt Distillation

Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt -- a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.

Updated: 2025-09-08 15:50:16

标题: 使用提示精炼进行自动提示优化

摘要: Autoprompting是自动选择优化提示语言模型的过程，由于大型语言模型（LLMs）领域的广泛研究推动了提示工程的快速发展，因此正变得越来越受欢迎。本文介绍了DistillPrompt - 一种基于大型语言模型的新型autoprompting方法，它利用训练数据将任务特定信息多阶段集成到提示中。DistillPrompt利用蒸馏、压缩和聚合操作来更全面地探索提示空间。该方法在使用t-lite-instruct-0.1语言模型的不同数据集上进行了文本分类和生成任务的测试。结果表明，与现有方法相比，DistillPrompt在关键指标方面取得了显著的平均改进（例如，在整个数据集上比Grips提高了20.12%），将其确立为autoprompting中最有效的非梯度方法之一。

更新时间: 2025-09-08 15:50:16

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.18992v2

Navigating the EU AI Act: Foreseeable Challenges in Qualifying Deep Learning-Based Automated Inspections of Class III Medical Devices

As deep learning (DL) technologies advance, their application in automated visual inspection for Class III medical devices offers significant potential to enhance quality assurance and reduce human error. However, the adoption of such AI-based systems introduces new regulatory complexities-particularly under the EU Artificial Intelligence (AI) Act, which imposes high-risk system obligations that differ in scope and depth from established regulatory frameworks such as the Medical Device Regulation (MDR) and the U.S. FDA Quality System Regulation (QSR). This paper presents a high-level technical assessment of the foreseeable challenges that manufacturers are likely to encounter when qualifying DL-based automated inspections -- specifically static models -- within the existing medical device compliance landscape. It examines divergences in risk management principles, dataset governance, model validation, explainability requirements, and post-deployment monitoring obligations. The discussion also explores potential implementation strategies and highlights areas of uncertainty, including data retention burdens, global compliance implications, and the practical difficulties of achieving statistical significance in validation with limited defect data. Disclaimer: This paper presents a technical perspective and does not constitute legal or regulatory advice.

Updated: 2025-09-08 15:48:19

标题: 应对欧盟人工智能法案：对于分类III医疗器械基于深度学习的自动化检验的可预见挑战

摘要: 随着深度学习（DL）技术的进步，其在用于第三类医疗器械自动视觉检查方面的应用具有显著潜力，可以增强质量保证并减少人为错误。然而，这种基于人工智能的系统的采用引入了新的监管复杂性，特别是在欧盟人工智能（AI）法案下，该法案规定了高风险系统义务，其范围和深度与已建立的监管框架（如医疗器械法规和美国FDA质量体系法规）不同。本文对制造商在现有医疗器械合规环境中，对DL基础自动检查（特别是静态模型）进行合格化时可能遇到的可预见挑战进行了高层技术评估。它研究了风险管理原则、数据集管理、模型验证、可解释性要求和部署后监测义务之间的分歧。讨论还探讨了潜在的实施策略，并突出了不确定性领域，包括数据保留负担、全球合规影响以及在有限缺陷数据验证中实现统计显著性的实际困难。免责声明：本文提供了技术观点，不构成法律或监管建议。

更新时间: 2025-09-08 15:48:19

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2508.20144v2

Reward function compression facilitates goal-dependent reinforcement learning

Reinforcement learning agents learn from rewards, but humans can uniquely assign value to novel, abstract outcomes in a goal-dependent manner. However, this flexibility is cognitively costly, making learning less efficient. Here, we propose that goal-dependent learning is initially supported by a capacity-limited working memory system. With consistent experience, learners create a "compressed" reward function (a simplified rule defining the goal) which is then transferred to long-term memory and applied automatically upon receiving feedback. This process frees up working memory resources, boosting learning efficiency. We test this theory across six experiments. Consistent with our predictions, our findings demonstrate that learning is parametrically impaired by the size of the goal space, but improves when the goal space structure allows for compression. We also find faster reward processing to correlate with better learning performance, supporting the idea that as goal valuation becomes more automatic, more resources are available for learning. We leverage computational modeling to support this interpretation. Our work suggests that efficient goal-directed learning relies on compressing complex goal information into a stable reward function, shedding light on the cognitive mechanisms of human motivation. These findings generate new insights into the neuroscience of intrinsic motivation and could help improve behavioral techniques that support people in achieving their goals.

Updated: 2025-09-08 15:43:40

标题: 奖励函数压缩有助于基于目标的强化学习

摘要: 强化学习代理从奖励中学习，但人类可以以一种独特的、抽象的方式对目标相关的新颖结果赋予价值。然而，这种灵活性在认知上具有成本，使得学习效率降低。在这里，我们提出目标相关学习最初由一个容量有限的工作记忆系统支持。通过一致的经验，学习者创建一个“压缩”的奖励函数（定义目标的简化规则），然后将其转移到长期记忆中，并在接收反馈时自动应用。这个过程释放了工作记忆资源，提高了学习效率。我们通过六个实验测试了这一理论。与我们的预测一致，我们的发现表明学习受目标空间大小的影响，但当目标空间结构允许压缩时，学习效果会提高。我们还发现奖励处理更快与更好的学习表现相关，支持这样一个观点：随着目标评估变得更加自动化，更多的资源可用于学习。我们利用计算建模来支持这一解释。我们的研究表明，高效的目标导向学习依赖于将复杂的目标信息压缩为稳定的奖励函数，揭示了人类动机的认知机制。这些发现为内在动机的神经科学提供了新的见解，可能有助于改进支持人们实现目标的行为技术。

更新时间: 2025-09-08 15:43:40

领域: q-bio.NC,cs.LG

下载: http://arxiv.org/abs/2509.06810v1

Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover's saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for "interesting" theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1

Updated: 2025-09-08 15:43:29

标题: 基于TPTP生态系统中LLM数学推理的饱和驱动数据集生成

摘要: 高质量、逻辑合理的数据的稀缺性是推进大型语言模型（LLMs）数学推理的关键瓶颈。我们的工作通过将几十年的自动定理证明研究转化为可扩展的数据引擎来应对这一挑战。我们的框架利用E-prover在广阔的TPTP公理库上的饱和能力，推导出大量、保证有效的定理语料库，而不依赖于易出错的LLMs或复杂的证明助手语法如Lean和Isabelle。我们的流水线是基于原则且简单的：饱和公理，筛选“有趣”的定理，生成任务。在没有LLMs的情况下，我们通过构建消除了事实错误。这些纯符号数据随后被转化为三种难度可控的挑战：包含验证、前提选择和证明重建。我们对前沿模型进行零-shot实验揭示了一个明显的弱点：在需要深层结构推理的任务上性能下降。我们的框架提供了诊断工具来衡量这一差距，并提供了一个可扩展的符号训练数据源来解决这一问题。我们已经将代码和数据公开发布。

更新时间: 2025-09-08 15:43:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.06809v1

MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

Updated: 2025-09-08 15:38:31

标题: MachineLearningLM：在数百万个合成表格预测任务上持续预训练语言模型在上下文中进行扩展ML

摘要: 大型语言模型（LLMs）具有广泛的世界知识和强大的通用推理能力，但它们在许多标准机器学习（ML）任务中学习时遇到困难，即纯粹通过上下文学习（ICL）利用许多示例。无需梯度下降。我们引入了MachineLearningLM，这是一个便携式的持续预训练框架，为通用LLM提供了强大的上下文ML能力，同时保留其广泛的知识和推理能力，以支持更广泛的聊天工作流程。我们的预训练过程从数百万个结构因果模型（SCMs）中合成ML任务，涵盖了高达1,024个示例。我们从随机森林教师开始，将基于树的决策策略蒸馏到LLM中，以加强数值建模的稳健性。所有任务都以令牌高效的提示进行序列化，使每个上下文窗口能够提供3倍到6倍的示例，并通过批量推断实现高达50倍的摊销吞吐量。尽管设置较为适中（Qwen-2.5-7B-Instruct与LoRA等级8），MachineLearningLM在跨金融、物理、生物学和医疗保健领域的分布之外的表格分类任务中，表现优于强大的LLM基线（例如GPT-5-mini），平均提高约15%。它展示了一个引人注目的多次射击扩展法则：准确性随着上下文示例从8增加到1,024而单调增加。在没有任何任务特定训练的情况下，它达到了数百次射击的随机森林级别的准确性。通用聊天功能，包括知识和推理，得以保留：它在MMLU上达到了75.4%的准确率。

更新时间: 2025-09-08 15:38:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.06806v1

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $\epsilon$-optimal robust policy within $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.

Updated: 2025-09-08 15:34:50

标题: 高效的Q学习和演员-评论家方法用于稳健的平均奖励强化学习

摘要: 我们提出了$Q$-learning和actor-critic算法在污染、总变差（TV）距离和Wasserstein不确定性集下鲁棒平均回报马尔可夫决策过程（MDPs）的非渐近收敛分析。我们分析的关键因素是表明，最优鲁棒$Q$算子是相对于一个精心设计的半范数（将常数函数商掉）的严格收缩。这个特性使得可以使用$\tilde{\mathcal{O}}(\epsilon^{-2})$个样本来学习最优鲁棒$Q$函数的随机逼近更新。我们还提供了一个鲁棒$Q$函数估计的高效例程，进而促进了鲁棒评论家估计。基于此，我们引入了一个actor-critic算法，在$\tilde{\mathcal{O}}(\epsilon^{-2})$个样本内学习一个$\epsilon$-最优的鲁棒策略。我们提供了数值模拟来评估我们算法的性能。

更新时间: 2025-09-08 15:34:50

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.07040v2

Imitative Membership Inference Attack

A Membership Inference Attack (MIA) assesses how much a target machine learning model reveals about its training data by determining whether specific query instances were part of the training set. State-of-the-art MIAs rely on training hundreds of shadow models that are independent of the target model, leading to significant computational overhead. In this paper, we introduce Imitative Membership Inference Attack (IMIA), which employs a novel imitative training technique to strategically construct a small number of target-informed imitative models that closely replicate the target model's behavior for inference. Extensive experimental results demonstrate that IMIA substantially outperforms existing MIAs in various attack settings while only requiring less than 5% of the computational cost of state-of-the-art approaches.

Updated: 2025-09-08 15:27:35

标题: 模仿性成员推断攻击

摘要: 一个成员推理攻击（MIA）评估目标机器学习模型向外界展示其训练数据的程度，通过确定特定查询实例是否属于训练集。最先进的MIA依赖于训练数百个与目标模型独立的阴影模型，导致显著的计算开销。在本文中，我们引入了模仿式成员推理攻击（IMIA），它采用一种新颖的模仿式训练技术，有策略地构建少量基于目标信息的模仿模型，以紧密复制目标模型的行为进行推理。大量实验结果表明，IMIA在各种攻击设置中明显优于现有的MIA，而且只需要不到最先进方法计算成本的5％。

更新时间: 2025-09-08 15:27:35

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2509.06796v1

Dato: A Task-Based Programming Model for Dataflow Accelerators

Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator's spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.

Updated: 2025-09-08 15:22:51

标题: "Dato：一种基于任务的数据流加速器编程模型"

摘要: 最近的深度学习工作负载越来越超出了当前内存系统可以支持的计算需求，许多内核在数据移动上停顿，而不是在计算上。虽然现代数据流加速器包含芯片上的流处理以减轻芯片外带宽限制，但现有的编程模型很难有效利用这些能力。低级接口提供了细粒度的控制，但会带来重大的开发开销，而高级基于瓦片的语言则抽象出通信细节，限制了优化并迫使编译器重建预期的数据流。我们提出了Dato，一个嵌入Python的、基于任务的数据流加速器编程模型，将数据通信和分片提升为一流类型结构。开发人员将程序编写为通过显式流类型连接的任务图，使用布局类型指定分片输入。这些任务首先虚拟映射到加速器的空间结构，编译器然后生成一个遵守硬件约束的物理映射。在AMD Ryzen AI NPU和Alveo FPGA设备上的实验结果表明，Dato实现了高性能，同时显著减轻了编写优化代码的负担。在NPU上，Dato实现了高达84%的GEMM硬件利用率，并相对于最先进的商业框架，在注意力内核上实现了2.81倍的加速。在FPGA上，Dato在生成自定义谐波阵列时超越了领先的框架，在理论峰值性能的98%。

更新时间: 2025-09-08 15:22:51

领域: cs.PL,cs.AR,cs.LG

下载: http://arxiv.org/abs/2509.06794v1

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

Updated: 2025-09-08 15:18:35

标题: Oyster-I：拒绝之外 - 负责任语言模型的建设性安全对齐

摘要: 大型语言模型（LLMs）通常部署安全机制以防止生成有害内容。大多数当前方法狭窄地关注恶意行为者带来的风险，通常将风险框定为对抗性事件，并依赖于防御性拒绝。然而，在现实世界中，风险也来自于非恶意用户在心理困扰下寻求帮助的情况（例如，自残意图）。在这种情况下，模型的响应可以强烈影响用户的下一步行动。简单的拒绝可能导致他们重复、升级或转移到不安全的平台，导致更糟糕的结果。我们引入了建设性安全对齐（CSA），这是一种以人为中心的范式，旨在在保护免受恶意滥用的同时，积极引导脆弱用户朝向安全和有益的结果。在Oyster-I（Oy1）中实施的CSA结合了用户反应的博弈论预测、细粒度风险边界发现和可解释的推理控制，将安全转化为建立信任的过程。Oy1在开放模型中实现了最先进的安全性，同时保持了高度的普适性。在我们的建设性基准测试中，它展现出强大的建设性参与度，接近GPT-5，并在Strata-Sword越狱数据集上表现出无与伦比的鲁棒性，接近GPT-o1的水平。通过从先拒绝到先引导的安全策略转变，CSA重新定义了模型与用户的关系，旨在构建不仅安全，而且有意义帮助的系统。我们发布了Oy1、代码和基准测试，以支持负责任、以用户为中心的人工智能。

更新时间: 2025-09-08 15:18:35

领域: cs.AI,cs.CL,cs.CY,cs.HC,cs.SC

下载: http://arxiv.org/abs/2509.01909v3

Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude

Recent advances in Large Language Models (LLMs) have enabled human-like responses across various tasks, raising questions about their ethical decision-making capabilities and potential biases. This study systematically evaluates how nine popular LLMs (both open-source and closed-source) respond to ethical dilemmas involving protected attributes. Across 50,400 trials spanning single and intersectional attribute combinations in four dilemma scenarios (protective vs. harmful), we assess models' ethical preferences, sensitivity, stability, and clustering patterns. Results reveal significant biases in protected attributes in all models, with differing preferences depending on model type and dilemma context. Notably, open-source LLMs show stronger preferences for marginalized groups and greater sensitivity in harmful scenarios, while closed-source models are more selective in protective situations and tend to favor mainstream groups. We also find that ethical behavior varies across dilemma types: LLMs maintain consistent patterns in protective scenarios but respond with more diverse and cognitively demanding decisions in harmful ones. Furthermore, models display more pronounced ethical tendencies under intersectional conditions than in single-attribute settings, suggesting that complex inputs reveal deeper biases. These findings highlight the need for multi-dimensional, context-aware evaluation of LLMs' ethical behavior and offer a systematic evaluation and approach to understanding and addressing fairness in LLM decision-making.

Updated: 2025-09-08 15:16:05

标题: 决策中的偏见：ChatGPT和Claude在人工智能伦理困境中的比较研究

摘要: 最近对大型语言模型(LLMs)的进展使得它们可以在各种任务中实现类似人类的回答，引发了关于它们道德决策能力和潜在偏见的问题。本研究系统评估了九种流行的LLMs(包括开源和闭源)在涉及受保护属性的道德困境中的回应。在涵盖了单一和交叉属性组合的四种困境情景中的50,400次试验中，我们评估了模型的道德偏好、敏感性、稳定性和聚类模式。结果显示所有模型在受保护属性上存在显著偏见，不同模型类型和困境背景下的偏好也不同。值得注意的是，开源LLMs在对边缘化群体有更强烈的偏好，并在有害情景中表现出更高的敏感性，而闭源模型在保护情况下更为选择性，倾向于支持主流群体。我们还发现，道德行为在不同类型的困境中有所不同：LLMs在保护情景中保持一致的模式，但在有害情景中做出更多样化和认知要求更高的决策。此外，模型在交叉条件下表现出更明显的道德倾向，比单一属性设置中更深层次地暴露出偏见。这些发现强调了需要对LLMs的道德行为进行多维度、具有上下文意识的评估，并提供了系统性评估和方法来理解和解决LLMs决策中的公平性问题。

更新时间: 2025-09-08 15:16:05

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2501.10484v3

\texttt{R$^\textbf{2}$AI}: Towards Resistant and Resilient AI in an Evolving World

In this position paper, we address the persistent gap between rapidly growing AI capabilities and lagging safety progress. Existing paradigms divide into ``Make AI Safe'', which applies post-hoc alignment and guardrails but remains brittle and reactive, and ``Make Safe AI'', which emphasizes intrinsic safety but struggles to address unforeseen risks in open-ended environments. We therefore propose \textit{safe-by-coevolution} as a new formulation of the ``Make Safe AI'' paradigm, inspired by biological immunity, in which safety becomes a dynamic, adversarial, and ongoing learning process. To operationalize this vision, we introduce \texttt{R$^2$AI} -- \textit{Resistant and Resilient AI} -- as a practical framework that unites resistance against known threats with resilience to unforeseen risks. \texttt{R$^2$AI} integrates \textit{fast and slow safe models}, adversarial simulation and verification through a \textit{safety wind tunnel}, and continual feedback loops that guide safety and capability to coevolve. We argue that this framework offers a scalable and proactive path to maintain continual safety in dynamic environments, addressing both near-term vulnerabilities and long-term existential risks as AI advances toward AGI and ASI.

Updated: 2025-09-08 15:13:23

标题: R$^\textbf{2}$AI：走向在不断变化的世界中具有抗性和弹性的人工智能

摘要: 在这篇立场文件中，我们讨论了快速增长的人工智能能力与安全进展之间持续存在的差距。现有的范式分为“使人工智能安全”，这种方法应用事后对齐和防护措施，但仍然脆弱和反应迟缓，以及“使人工智能安全”，这种方法强调内在安全性，但在开放性环境中难以应对未预见的风险。因此，我们提出\textit{适应共进化}作为“使人工智能安全”范式的新表述，受生物免疫启发，其中安全性变为动态、对抗性和持续的学习过程。为了实现这一愿景，我们引入\texttt{R$^2$AI} -- \textit{抗性和弹性人工智能} -- 作为一个实用框架，将抵抗已知威胁与应对未预见风险的弹性结合起来。\texttt{R$^2$AI}整合了\textit{快速和慢速安全模型}、通过\textit{安全风洞}进行对抗模拟和验证，以及指导安全性和能力共同进化的持续反馈循环。我们认为，这一框架为在动态环境中保持持续安全性提供了一条可扩展和主动的路径，解决了近期的漏洞问题以及长期的存在风险，因为人工智能向人工通用智能和超级智能发展。

更新时间: 2025-09-08 15:13:23

领域: cs.LG

下载: http://arxiv.org/abs/2509.06786v1

Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning

Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a Physics-informed (Pi) regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Physics-informed HIQL (Pi-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.

Updated: 2025-09-08 15:08:42

标题: 物理信息值学习者用于离线目标条件强化学习

摘要: 离线目标条件强化学习（GCRL）在诸如自主导航和运动等领域具有巨大的潜力，其中收集交互数据成本高且不安全。然而，在实践中仍然具有挑战性，因为需要从覆盖状态-动作空间有限的数据集中学习，并且需要在长期任务中进行泛化。为了改善这些挑战，我们提出了一种基于物理信息（Pi）的正则化损失用于价值学习，该损失源自伊库诺尔偏微分方程（PDE），并在学习的价值函数中引入了几何归纳偏差。与主要用于稳定训练的通用梯度惩罚不同，我们的公式基于连续时间最优控制，并鼓励价值函数与成本结构相一致。所提出的正则化器与基于时间差异的价值学习广泛兼容，并可以集成到现有的离线GCRL算法中。当与分层隐式Q学习（HIQL）结合时，结果方法物理信息HIQL（Pi-HIQL）在性能和泛化方面都取得了显著进展，在接缝区域和大规模导航任务中表现出明显的增益。

更新时间: 2025-09-08 15:08:42

领域: cs.LG

下载: http://arxiv.org/abs/2509.06782v1

Asynchronous Message Passing for Addressing Oversquashing in Graph Neural Networks

Graph Neural Networks (GNNs) suffer from Oversquashing, which occurs when tasks require long-range interactions. The problem arises from the presence of bottlenecks that limit the propagation of messages among distant nodes. Recently, graph rewiring methods modify edge connectivity and are expected to perform well on long-range tasks. Yet, graph rewiring compromises the inductive bias, incurring significant information loss in solving the downstream task. Furthermore, increasing channel capacity may overcome information bottlenecks but enhance the parameter complexity of the model. To alleviate these shortcomings, we propose an efficient model-agnostic framework that asynchronously updates node features, unlike traditional synchronous message passing GNNs. Our framework creates node batches in every layer based on the node centrality values. The features of the nodes belonging to these batches will only get updated. Asynchronous message updates process information sequentially across layers, avoiding simultaneous compression into fixed-capacity channels. We also theoretically establish that our proposed framework maintains higher feature sensitivity bounds compared to standard synchronous approaches. Our framework is applied to six standard graph datasets and two long-range datasets to perform graph classification and achieves impressive performances with a $5\%$ and $4\%$ improvements on REDDIT-BINARY and Peptides-struct, respectively.

Updated: 2025-09-08 15:03:05

标题: 图神经网络中解决超压缩问题的异步消息传递

摘要: 图神经网络（GNNs）在需要长程交互的任务中存在过度压缩问题。这个问题源于存在限制消息在远端节点之间传播的瓶颈。最近，图重连方法修改边连接性，预计在长程任务上表现良好。然而，图重连牺牲了归纳偏差，在解决下游任务时造成了显著的信息损失。此外，增加通道容量可以克服信息瓶颈，但会增加模型的参数复杂性。为了减轻这些缺点，我们提出了一个高效的与模型无关的框架，实现了异步更新节点特征，与传统的同步消息传递GNNs不同。我们的框架基于节点中心性值在每一层创建节点批次。属于这些批次的节点的特征将得到更新。异步消息更新跨层顺序处理信息，避免将信息同时压缩到固定容量通道中。我们还从理论上建立了，与标准同步方法相比，我们提出的框架保持更高的特征敏感性边界。我们的框架应用于六个标准图数据集和两个长程数据集，进行图分类，并在REDDIT-BINARY和Peptides-struct上分别取得了5％和4％的显著改进。

更新时间: 2025-09-08 15:03:05

领域: cs.LG

下载: http://arxiv.org/abs/2509.06777v1

CHIRLA: Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis

Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across cameras, locations, and time. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust systems that handle long-term variations caused by clothing and physical changes. We present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset designed for video-based long-term person Re-ID. CHIRLA was recorded over seven months in four connected indoor environments using seven strategically placed cameras, capturing realistic movements with substantial clothing and appearance variability. The dataset includes 22 individuals, more than five hours of video, and about 1M bounding boxes with identity annotations obtained through semi-automatic labeling. We also define benchmark protocols for person tracking and Re-ID, covering diverse and challenging scenarios such as occlusion, reappearance, and multi-camera conditions. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios. The benchmark code is publicly available at: https://github.com/bdager/CHIRLA.

Updated: 2025-09-08 15:01:10

标题: CHIRLA：大规模分析的全面高分辨率识别和重新识别

摘要: 人员再识别（Re-ID）是计算机视觉中的一个关键挑战，需要在摄像头、地点和时间之间匹配个体。虽然大多数研究关注外表变化最小的短期场景，但现实世界的应用需要处理由服装和生理变化引起的长期变化的强大系统。我们提出了CHIRLA，即全面的高分辨率身份识别和再识别用于大规模分析，这是一个专为基于视频的长期人员再识别而设计的新数据集。CHIRLA在四个相连的室内环境中使用七台摄像头录制了七个月的视频，捕捉了具有实质性服装和外观变化的真实运动。该数据集包括22个个体，超过五个小时的视频，以及大约100万个通过半自动标注获得的身份标注边界框。我们还定义了人员跟踪和再识别的基准协议，涵盖了各种具有挑战性的场景，如遮挡、再现和多摄像头条件。通过引入这一全面的基准，我们旨在促进可靠地在具有挑战性的长期现实世界场景中执行的Re-ID算法的开发和评估。基准代码可在以下网址公开获取：https://github.com/bdager/CHIRLA。

更新时间: 2025-09-08 15:01:10

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.06681v2

Emergence of the Primacy Effect in Structured State-Space Models

Structured state-space models (SSMs) have been developed to offer more persistent memory retention than traditional recurrent neural networks, while maintaining real-time inference capabilities and addressing the time-complexity limitations of Transformers. Despite this intended persistence, the memory mechanism of canonical SSMs is theoretically designed to decay monotonically over time, meaning that more recent inputs are expected to be retained more accurately than earlier ones. Contrary to this theoretical expectation, however, the present study reveals a counterintuitive finding: when trained and evaluated on a synthetic, statistically balanced memorization task, SSMs predominantly preserve the *initially* presented data in memory. This pattern of memory bias, known as the *primacy effect* in psychology, presents a non-trivial challenge to the current theoretical understanding of SSMs and opens new avenues for future research.

Updated: 2025-09-08 14:59:54

标题: 结构化状态空间模型中优先效应的出现

摘要: 结构化状态空间模型（SSMs）已经被发展出来，提供比传统的递归神经网络更持久的记忆保留，同时保持实时推理能力并解决变压器的时间复杂度限制。尽管有意保持这种持久性，但经典SSMs的记忆机制在理论上设计成随时间单调衰减，意味着最近的输入预期会比早期的更准确地保留。然而，与这种理论预期相反，本研究揭示了一个直觉上令人意外的发现：在合成的、统计平衡的记忆任务上训练和评估时，SSMs主要保持*最初*呈现的数据在记忆中。这种记忆偏倚模式，在心理学中被称为*初现效应*，对当前对SSMs的理论理解提出了一个非平凡的挑战，并为未来研究开辟了新的途径。

更新时间: 2025-09-08 14:59:54

领域: cs.LG,cs.NE,q-bio.NC

下载: http://arxiv.org/abs/2502.13729v5

FACEGroup: Feasible and Actionable Counterfactual Explanations for Group Fairness

Counterfactual explanations assess unfairness by revealing how inputs must change to achieve a desired outcome. This paper introduces the first graph-based framework for generating group counterfactual explanations to audit group fairness, a key aspect of trustworthy machine learning. Our framework, FACEGroup (Feasible and Actionable Counterfactual Explanations for Group Fairness), models real-world feasibility constraints, identifies subgroups with similar counterfactuals, and captures key trade-offs in counterfactual generation, distinguishing it from existing methods. To evaluate fairness, we introduce novel metrics for both group and subgroup level analysis that explicitly account for these trade-offs. Experiments on benchmark datasets show that FACEGroup effectively generates feasible group counterfactuals while accounting for trade-offs, and that our metrics capture and quantify fairness disparities.

Updated: 2025-09-08 14:59:02

标题: FACEGroup：适用且可行的针对群体公平性的对抗性解释

摘要: 反事实解释通过揭示必须改变哪些输入才能实现期望的结果来评估不公平性。本文介绍了第一个基于图的框架，用于生成群体反事实解释，以审计群体公平性，这是可信赖的机器学习的一个关键方面。我们的框架，FACEGroup（用于群体公平性的可行和可操作反事实解释），模拟了现实世界的可行性约束，识别具有相似反事实的子群体，并捕捉了反事实生成中的关键权衡，从而使其与现有方法有所区别。为了评估公平性，我们引入了新颖的指标，用于群体和子群体水平分析，明确考虑了这些权衡。在基准数据集上的实验表明，FACEGroup有效地生成可行的群体反事实，并考虑权衡，而我们的指标捕捉并量化了公平差距。

更新时间: 2025-09-08 14:59:02

领域: cs.LG,cs.AI,stat.ME

下载: http://arxiv.org/abs/2410.22591v3

Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting

Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it'' feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative iteration.Together, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.

Updated: 2025-09-08 14:54:31

标题: 另一个转变，更好的输出？对迭代LLM提示的逐个转向分析

摘要: 大型语言模型（LLMs）现在在多轮工作流程中被使用，但我们仍然缺乏一个清晰的衡量方式来确定迭代是有益还是有害的。我们提出了一个跨足构思、代码和数学的迭代改进评估框架。我们的协议针对每个任务运行受控的12轮对话，利用各种提示，从模糊的“改进它”反馈到有针对性的引导，并记录每轮的输出。我们使用领域适当的检查（代码的单元测试；数学的答案等价性加推理合理性；构思的独创性和可行性）对结果进行评分，并使用三类度量指标跟踪每轮的行为：跨轮语义移动、轮与轮之间的变化以及输出大小增长。在各种模型和任务中，收益是依赖于领域的：在构思和代码中，收益很快到来，但在数学中，由详细说明引导的后期轮次很重要。在前几轮之后，模糊的反馈通常会达到平台期或逆转正确性，而有针对性的提示可靠地改变预期的质量轴（在构思中是新颖性与可行性；在代码中是速度与可读性；在数学中，详细说明优于探索，并推动后期收益）。我们还观察到一致的领域模式：构思在意义上跨足移动更多，代码倾向于在大小上增长而语义变化很少，数学开始固定但可以通过后期详细迭代打破这个路径。框架和度量指标共同使迭代在各种模型中可衡量且可比较，并指示何时进行引导、停止或切换策略。

更新时间: 2025-09-08 14:54:31

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2509.06770v1

DCPO: Dynamic Clipping Policy Optimization

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization(DCPO), which introduces a dynamic clipping strategy that adaptively adjusts clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing DAPO (36.7/31.6), GRPO (36.7/32.1) and GSPO (40.0/34.9) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5), DAPO (20.0/15.3) and GSPO (16.7/9.9). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

Updated: 2025-09-08 14:50:44

标题: DCPO: 动态剪辑策略优化

摘要: 从可验证奖励学习（RLVR）已经成为增强大型语言模型推理能力的一个有前途的框架。然而，现有的方法，如GRPO，往往受到零梯度的困扰。这个问题主要是由于令牌级概率比的固定剪裁边界和相同奖励的标准化所导致的，这可能导致无效的梯度更新和生成响应的低利用率。在这项工作中，我们提出了动态剪裁策略优化（DCPO），引入了一种动态剪裁策略，根据令牌特定的先验概率自适应调整剪裁边界，以增强令牌级别的探索，并采用一种平滑优势标准化技术，通过标准化奖励跨累积训练步骤来改善生成响应的响应级有效利用。DCPO在基于四种不同模型的四个基准测试上实现了最先进的性能。特别是，在AIME24基准测试中，DCPO在贪心解码下实现了46.7的Avg@1，在32倍采样下实现了38.8的Avg@32，超过了DAPO（36.7/31.6）、GRPO（36.7/32.1）和GSPO（40.0/34.9）在Qwen2.5-Math-7B模型上的表现。在基于Qwen2.5-14B的AIME25基准测试中，DCPO实现了（23.3/19.0）的性能，超过了GRPO（13.3/10.5）、DAPO（20.0/15.3）和GSPO（16.7/9.9）。此外，与GRPO相比，DCPO在四个模型中平均提高了28%的非零优势，使DAPO的训练效率提高了一倍，并且与GRPO和DAPO相比，大幅降低了令牌剪裁比例一个数量级，同时实现了更优越的性能。这些结果突显了DCPO在大型语言模型中更有效地利用生成数据进行强化学习的有效性。

更新时间: 2025-09-08 14:50:44

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.02333v2

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While large-scale pretraining has driven substantial progress, fine-tuning these models for aligning with human values or engaging in specific tasks or behaviors remains a critical challenge. Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO) offer promising frameworks for this aligning process. While DRL enables models to optimize actions using reward signals instead of relying solely on supervised preference data, DPO directly aligns the policy with preferences, eliminating the need for an explicit reward model. This overview explores paradigms for fine-tuning LVLMs, highlighting how DRL and DPO techniques can be used to align models with human preferences and values, improve task performance, and enable adaptive multimodal interaction. We categorize key approaches, examine sources of preference data, reward signals, and discuss open challenges such as scalability, sample efficiency, continual learning, generalization, and safety. The goal is to provide a clear understanding of how DRL and DPO contribute to the evolution of robust and human-aligned LVLMs.

Updated: 2025-09-08 14:47:57

标题: 用深度强化学习和直接偏好优化对齐大规模视觉-语言模型

摘要: 大视觉-语言模型（LVLMs）或多模态大语言模型代表了人工智能的重大进步，使系统能够跨越视觉和文本模态理解和生成内容。尽管大规模预训练取得了显著进展，但对这些模型进行微调以与人类价值观保持一致或参与特定任务或行为仍然是一个关键挑战。深度强化学习（DRL）和直接优化偏好（DPO）为这种对齐过程提供了有希望的框架。DRL使模型能够通过奖励信号优化行为，而不仅仅依赖于监督偏好数据，DPO直接将政策与偏好对齐，消除了对显式奖励模型的需求。本概述探讨了微调LVLMs的范例，突出了DRL和DPO技术如何用于将模型与人类偏好和价值观对齐，改善任务性能，并实现自适应多模态交互。我们对关键方法进行分类，研究偏好数据和奖励信号的来源，并讨论可扩展性、样本效率、持续学习、泛化和安全等开放挑战。目标是清楚地解释DRL和DPO如何促进强大和与人类价值观一致的LVLMs的发展。

更新时间: 2025-09-08 14:47:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06759v1

VIBESegmentator: Full Body MRI Segmentation for the NAKO and UK Biobank

Objectives: To present a publicly available deep learning-based torso segmentation model that provides comprehensive voxel-wise coverage, including delineations that extend to the boundaries of anatomical compartments. Materials and Methods: We extracted preliminary segmentations from TotalSegmentator, spine, and body composition models for Magnetic Resonance Tomography (MR) images, then improved them iteratively and retrained an nnUNet model. Using a random retrospective subset of German National Cohort (NAKO), UK Biobank, internal MR and Computed Tomography (CT) data (Training: 2897 series from 626 subjects, 290 female; mean age 53+-16; 3-fold-cross validation (20% hold-out). Internal testing 36 series from 12 subjects, 6 male; mean age 60+-11), we segmented 71 structures in torso MR and 72 in CT images: 20 organs, 10 muscles, 19 vessels, 16 bones, ribs in CT, intervertebral discs, spinal cord, spinal canal and body composition (subcutaneous fat, unclassified muscles and visceral fat). For external validation, we used existing automatic organ segmentations, independent ground truth segmentations on gradient echo images, and the Amos data. We used non-parametric bootstrapping for confidence intervals and Wilcoxon rank-sum test for computing statistical significance. Results: We achieved an average Dice score of 0.90+-0.06 on our internal gradient echo test set, which included 71 semantic segmentation labels. Our model ties with the best model on Amos with a Dice of 0,81+-0.14, while having a larger field of view and a considerably higher number structures included. Conclusion: Our work presents a publicly available full-torso segmentation model for MRI and CT images that classifies almost all subject voxels to date.

Updated: 2025-09-08 14:44:54

标题: VIBESegmentator：NAKO和UK生物库的全身MRI分割

摘要: 目标：提出一个公开可用的基于深度学习的躯干分割模型，提供全面的逐像素覆盖，包括延伸到解剖区间边界的标记。材料和方法：我们从TotalSegmentator、脊柱和身体组成模型中提取了磁共振成像（MR）图像的初步分割，然后进行迭代改进并重新训练了一个nnUNet模型。使用德国国家队（NAKO）、英国生物库、内部MR和计算机断层扫描（CT）数据的随机回顾子集（训练：626名受试者的2897个系列，290名女性；平均年龄53+-16；3倍交叉验证（20%保留）。内部测试36个系列，12名受试者，6名男性；平均年龄60+-11），我们在躯干MR和CT图像中分割了71个结构和72个结构：20个器官、10块肌肉、19条血管、16根骨头、CT中的肋骨、椎间盘、脊髓、脊髓管和身体组成（皮下脂肪、未分类肌肉和内脏脂肪）。对于外部验证，我们使用现有的自动器官分割、梯度回波图像上的独立标准分割以及Amos数据。我们使用非参数bootstrap法计算置信区间，并使用Wilcoxon秩和检验计算统计显著性。结果：在我们的内部梯度回波测试集上，我们实现了平均Dice分数为0.90+-0.06，包括71个语义分割标签。我们的模型与Amos上最佳模型的Dice为0.81+-0.14相当，同时具有更大的视野和包含更多结构。结论：我们的工作提出了一个公开可用的MRI和CT图像的全躯干分割模型，迄今为止分类几乎所有受试者体素。

更新时间: 2025-09-08 14:44:54

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2406.00125v4

Odoo-based Subcontract Inter-site Access Control Mechanism for Construction Projects

In the era of Construction 4.0, the industry is embracing a new paradigm of labor elasticity, driven by smart and flexible outsourcing and subcontracting strategies. The increased reliance on specialized subcontractors enables companies to scale labor dynamically based on project demands. This adaptable workforce model presents challenges in managing hierarchical integration and coordinating inter-site collaboration. Our design introduces a subsystem integrated into the Odoo ERP framework, employing a modular architecture to streamline labor management, task tracking, and approval workflows. The system adopts a three-pronged approach to ensure synchronized data exchange between general contractors and subcontractors, while maintaining both security and operational independence. The system features hybrid access control, third-party integration for cross-domain communication, and role-based mapping algorithm across sites. The system supports varying degrees of customization through a unified and consolidated attribute mapping center. This center leverages a tree-like index structure and Lagrange interpolation method to enhance the efficiency of role mapping. Demonstrations highlight practical application in outsourcing, integration, and scalability scenarios, confirming the system's robustness under high user volumes and in offline conditions. Experimental results further show improvements in database performance and workflow adaptability to support a scalable, enterprise-level solution that aligns with the evolving demands of smart construction management.

Updated: 2025-09-08 14:44:05

标题: 基于Odoo的建筑项目分包商跨站点访问控制机制

摘要: 在建筑4.0时代，行业正在接受一种新的劳动弹性范式，这是由智能和灵活的外包和分包策略推动的。对专门承包商的增加依赖使公司能够根据项目需求动态扩展劳动力。这种适应性的劳动力模型在管理层次集成和协调跨站点合作方面存在挑战。我们的设计引入了一个子系统，集成到Odoo ERP框架中，采用模块化架构来简化劳动力管理、任务跟踪和审批工作流程。该系统采用三管齐下的方法，确保总承包商和分包商之间同步数据交换，同时保持安全性和运营独立性。该系统具有混合访问控制、用于跨域通信的第三方集成，以及站点间基于角色的映射算法。该系统通过一个统一和整合的属性映射中心支持不同程度的定制。该中心利用树状索引结构和拉格朗日插值方法来提高角色映射的效率。演示突出了在外包、集成和可扩展性场景中的实际应用，证实了该系统在高用户量和离线条件下的稳健性。实验结果进一步显示了数据库性能的改善和工作流的适应性，以支持一个与智能建筑管理不断变化的需求相一致的可扩展的企业级解决方案。

更新时间: 2025-09-08 14:44:05

领域: cs.CR

下载: http://arxiv.org/abs/2509.05149v2

Image Encryption Scheme Based on Hyper-Chaotic Map and Self-Adaptive Diffusion

In the digital age, image encryption technology acts as a safeguard, preventing unauthorized access to images. This paper proposes an innovative image encryption scheme that integrates a novel 2D hyper-chaotic map with a newly developed self-adaptive diffusion method. The 2D hyper-chaotic map, namely the 2D-RA map, is designed by hybridizing the Rastrigin and Ackley functions. The chaotic performance of the 2D-RA map is validated through a series of measurements, including the Bifurcation Diagram, Lyapunov Exponent (LE), Initial Value Sensitivity, 0 - 1 Test, Correlation Dimension (CD), and Kolmogorov Entropy (KE). The results demonstrate that the chaotic performance of the 2D-RA map surpasses that of existing advanced chaotic functions. Additionally, the self-adaptive diffusion method is employed to enhance the uniformity of grayscale distribution. The performance of the image encryption scheme is evaluated using a series of indicators. The results show that the proposed image encryption scheme significantly outperforms current state-of-the-art image encryption techniques.

Updated: 2025-09-08 14:42:41

标题: 基于超混沌映射和自适应扩散的图像加密方案

摘要: 在数字时代，图像加密技术充当着一种保障，防止未经授权访问图像。本文提出了一种创新的图像加密方案，该方案将一种新型的2D超混沌映射与新开发的自适应扩散方法相结合。这种2D超混沌映射，即2D-RA映射，是通过混合Rastrigin和Ackley函数设计的。通过一系列测量，包括分岔图、Lyapunov指数（LE）、初始值敏感性、0-1测试、相关维度（CD）和科尔莫哥洛夫熵（KE），验证了2D-RA映射的混沌性能。结果表明，2D-RA映射的混沌性能优于现有的先进混沌函数。此外，采用自适应扩散方法来增强灰度分布的均匀性。利用一系列指标评估了图像加密方案的性能。结果显示，所提出的图像加密方案明显优于当前最先进的图像加密技术。

更新时间: 2025-09-08 14:42:41

领域: cs.CR

下载: http://arxiv.org/abs/2509.06754v1

Multimodal Latent Fusion of ECG Leads for Early Assessment of Pulmonary Hypertension

Recent advancements in early assessment of pulmonary hypertension (PH) primarily focus on applying machine learning methods to centralized diagnostic modalities, such as 12-lead electrocardiogram (12L-ECG). Despite their potential, these approaches fall short in decentralized clinical settings, e.g., point-of-care and general practice, where handheld 6-lead ECG (6L-ECG) can offer an alternative but is limited by the scarcity of labeled data for developing reliable models. To address this, we propose a lead-specific electrocardiogram multimodal variational autoencoder (\textsc{LS-EMVAE}), which incorporates a hierarchical modality expert (HiME) fusion mechanism and a latent representation alignment loss. HiME combines mixture-of-experts and product-of-experts to enable flexible, adaptive latent fusion, while the alignment loss improves coherence among lead-specific and shared representations. To alleviate data scarcity and enhance representation learning, we adopt a transfer learning strategy: the model is first pre-trained on a large unlabeled 12L-ECG dataset and then fine-tuned on smaller task-specific labeled 6L-ECG datasets. We validate \textsc{LS-EMVAE} across two retrospective cohorts in a 6L-ECG setting: 892 subjects from the ASPIRE registry for (1) PH detection and (2) phenotyping pre-/post-capillary PH, and 16,416 subjects from UK Biobank for (3) predicting elevated pulmonary atrial wedge pressure, where it consistently outperforms unimodal and multimodal baseline methods and demonstrates strong generalizability and interpretability. The code is available at https://github.com/Shef-AIRE/LS-EMVAE.

Updated: 2025-09-08 14:41:35

标题: 心电图导联的多模态潜在融合用于早期评估肺动脉高压

摘要: 最近关于肺动脉高压（PH）早期评估的进展主要集中在将机器学习方法应用于中央诊断模式，例如12导联心电图（12L-ECG）。尽管这些方法有潜力，在点对点和一般实践等去中心化临床环境中存在不足，其中手持式6导联心电图（6L-ECG）可以提供另一种选择，但由于用于开发可靠模型的标记数据稀缺而受限。为了解决这个问题，我们提出了一种导联特异性心电图多模态变分自动编码器（LS-EMVAE），其中包含层次模态专家（HiME）融合机制和潜在表示对齐损失。HiME结合专家混合和专家乘积，实现灵活的自适应潜在融合，而对齐损失改善导联特异性和共享表示之间的一致性。为了缓解数据稀缺并增强表示学习，我们采用迁移学习策略：模型首先在大型未标记的12L-ECG数据集上进行预训练，然后在较小的具体任务标记的6L-ECG数据集上进行微调。我们在6L-ECG设置中的两个回顾性队列中验证了LS-EMVAE：ASPIRE注册表中来自892名受试者，用于（1）PH检测和（2）表型前/后毛细血管PH，以及来自英国生物库的16,416名受试者，用于（3）预测升高的肺动脉楔压力，在这些方面，它始终优于单模态和多模态基线方法，并展现出强大的泛化性和可解释性。代码可在https://github.com/Shef-AIRE/LS-EMVAE 上找到。

更新时间: 2025-09-08 14:41:35

领域: eess.SP,cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.13470v2

Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation

While convolutional neural networks (CNNs) have come to match and exceed human performance in many settings, the tasks these models optimize for are largely constrained to the level of individual objects, such as classification and captioning. Humans remain vastly superior to CNNs in visual tasks involving relations, including the ability to identify two objects as `same' or `different'. A number of studies have shown that while CNNs can be coaxed into learning the same-different relation in some settings, they tend to generalize poorly to other instances of this relation. In this work we show that the same CNN architectures that fail to generalize the same-different relation with conventional training are able to succeed when trained via meta-learning, which explicitly encourages abstraction and generalization across tasks.

Updated: 2025-09-08 14:39:19

标题: 卷积神经网络可以（元）学习相同-不同关系

摘要: 虽然卷积神经网络（CNNs）在许多情况下已经达到并超过了人类的表现，但这些模型优化的任务主要限于单个对象的级别，如分类和字幕。在涉及关系的视觉任务中，人类仍然远远优于CNNs，包括识别两个对象是否“相同”或“不同”的能力。许多研究表明，虽然CNNs可以在某些情况下被诱导学习相同-不同的关系，但它们往往在其他这种关系的实例中泛化能力较差。在这项工作中，我们展示了那些在常规训练中未能泛化相同-不同关系的相同CNN架构，在通过元学习训练时能够成功，这明确鼓励在任务之间进行抽象和泛化。

更新时间: 2025-09-08 14:39:19

领域: cs.CV,cs.LG,68T07,I.2.0; I.2.6

下载: http://arxiv.org/abs/2503.23212v3

An All-Atom Generative Model for Designing Protein Complexes

Proteins typically exist in complexes, interacting with other proteins or biomolecules to perform their specific biological roles. Research on single-chain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold2. Despite these developments, the study and modeling of multi-chain proteins remain largely uncharted, though they are vital for understanding biological functions. Recognizing the importance of these interactions, we introduce APM (All-Atom Protein Generative Model), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse-folding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it achieves enhanced performance through supervised fine-tuning (SFT) while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results. We released our code at https://github.com/bytedance/apm.

Updated: 2025-09-08 14:38:27

标题: 一个用于设计蛋白质复合物的全原子生成模型

摘要: 蛋白质通常以复合物形式存在，与其他蛋白质或生物分子相互作用，以发挥其特定生物学功能。对单链蛋白建模的研究已经得到了广泛和深入的探索，模型如ESM系列和AlphaFold2等已经取得了进展。尽管存在这些发展，对多链蛋白的研究和建模仍然是未知的，尽管它们对于理解生物功能至关重要。为了认识到这些相互作用的重要性，我们引入了APM（全原子蛋白生成模型），这是一个专门设计用于建模多链蛋白的模型。通过整合原子级别信息并利用多链蛋白的数据，APM能够精确地建模链间相互作用，并设计具有从零开始绑定能力的蛋白质复合物。它还为多链蛋白执行折叠和逆折叠任务。此外，APM在下游应用中展示出了多样性：通过监督微调（SFT）实现了增强性能，同时在某些任务中支持零样本采样，取得了最先进的结果。我们在https://github.com/bytedance/apm上发布了我们的代码。

更新时间: 2025-09-08 14:38:27

领域: cs.LG

下载: http://arxiv.org/abs/2504.13075v3

Driver-Net: Multi-Camera Fusion for Assessing Driver Take-Over Readiness in Automated Vehicles

Ensuring safe transition of control in automated vehicles requires an accurate and timely assessment of driver readiness. This paper introduces Driver-Net, a novel deep learning framework that fuses multi-camera inputs to estimate driver take-over readiness. Unlike conventional vision-based driver monitoring systems that focus on head pose or eye gaze, Driver-Net captures synchronised visual cues from the driver's head, hands, and body posture through a triple-camera setup. The model integrates spatio-temporal data using a dual-path architecture, comprising a Context Block and a Feature Block, followed by a cross-modal fusion strategy to enhance prediction accuracy. Evaluated on a diverse dataset collected from the University of Leeds Driving Simulator, the proposed method achieves an accuracy of up to 95.8% in driver readiness classification. This performance significantly enhances existing approaches and highlights the importance of multimodal and multi-view fusion. As a real-time, non-intrusive solution, Driver-Net contributes meaningfully to the development of safer and more reliable automated vehicles and aligns with new regulatory mandates and upcoming safety standards.

Updated: 2025-09-08 14:38:25

标题: 司机网络：用于评估自动驾驶车辆中司机接管准备性的多摄像头融合

摘要: 确保自动驾驶车辆控制的安全过渡需要准确及时地评估驾驶员的准备情况。本文介绍了Driver-Net，这是一个新颖的深度学习框架，将多摄像头输入融合在一起，以估计驾驶员接管准备情况。与传统基于视觉的驾驶员监控系统侧重于头部姿势或眼睛注视不同，Driver-Net通过三摄像头设置捕捉驾驶员头部、手部和身体姿势的同步视觉线索。该模型使用双路径架构集成时空数据，包括上下文块和特征块，然后采用交叉模态融合策略以增强预测准确性。在从利兹大学驾驶模拟器收集的多样化数据集上评估，所提出的方法在驾驶员准备分类上达到了高达95.8%的准确率。这种性能显著增强了现有方法，并突显了多模态和多视角融合的重要性。作为一种实时、非侵入性解决方案，Driver-Net对更安全、更可靠的自动驾驶车辆的发展做出了有意义的贡献，并与新的监管要求和即将到来的安全标准保持一致。

更新时间: 2025-09-08 14:38:25

领域: cs.CV,cs.AI,cs.ET,cs.LG,cs.RO,I.4.9

下载: http://arxiv.org/abs/2507.04139v2

Long-Range Graph Wavelet Networks

Modeling long-range interactions, the propagation of information across distant parts of a graph, is a central challenge in graph machine learning. Graph wavelets, inspired by multi-resolution signal processing, provide a principled way to capture both local and global structures. However, existing wavelet-based graph neural networks rely on finite-order polynomial approximations, which limit their receptive fields and hinder long-range propagation. We propose Long-Range Graph Wavelet Networks (LR-GWN), which decompose wavelet filters into complementary local and global components. Local aggregation is handled with efficient low-order polynomials, while long-range interactions are captured through a flexible spectral domain parameterization. This hybrid design unifies short- and long-distance information flow within a principled wavelet framework. Experiments show that LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks, while remaining competitive on short-range datasets.

Updated: 2025-09-08 14:35:30

标题: 长程图小波网络

摘要: 建模长程相互作用，即图的远距离信息传播，是图机器学习中的一个核心挑战。受多分辨率信号处理启发，图小波提供了一种合理的方式来捕捉局部和全局结构。然而，现有基于小波的图神经网络依赖于有限阶多项式逼近，这限制了它们的感受域并阻碍了长距离传播。我们提出了长距离图小波网络（LR-GWN），将小波滤波器分解为互补的局部和全局组件。局部聚合采用高效的低阶多项式处理，而长程交互则通过灵活的频谱域参数化来捕捉。这种混合设计在一个合理的小波框架内统一了短距离和长距离信息流。实验证明，LR-GWN在长距离基准上实现了基于小波的方法中的最先进性能，同时在短距离数据集上保持了竞争力。

更新时间: 2025-09-08 14:35:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06743v1

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

Updated: 2025-09-08 14:34:07

标题: ALPS：大型语言模型高度稀疏一次性修剪的改进优化

摘要: 大型语言模型（LLMs）在各种自然语言处理任务中表现出色，但需要大量的计算资源和存储空间。一次性修剪技术可以通过删除冗余权重而无需重新训练来减轻这些负担。然而，LLMs的巨大规模常常迫使当前的修剪方法依赖于启发式方法而不是基于优化的技术，这可能导致次优的压缩效果。在本文中，我们介绍了ALPS，这是一个基于优化的框架，使用操作分裂技术和一个基于预条件的共轭梯度后处理步骤来解决修剪问题。我们的方法结合了加速和理论上的收敛保证的新技术，同时利用向量化和GPU并行性来提高效率。在修剪目标和困惑度降低方面，ALPS在大幅稀疏模型中明显优于现有的方法。在70%稀疏的OPT-30B模型上，ALPS在WikiText数据集上实现了13%的测试困惑度降低，并且与现有方法相比，在零次测试中取得了19%的提升。

更新时间: 2025-09-08 14:34:07

领域: cs.LG

下载: http://arxiv.org/abs/2406.07831v3

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Despite progress in Vision-Language Models (VLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current VLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces VISER (Visual Input Structure for Enhanced Reasoning), a simple yet effective intervention: augmenting visual inputs with low-level spatial structures and pairing this with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks. Specifically, VISER improves GPT-4o visual search accuracy by 25.00%, increases counting accuracy by 26.83%, reduces edit distance error in scene description by 0.32, and enhances performance on spatial relationship tasks by 9.50% on a 2D synthetic dataset. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER enhances binding only with a single-query inference, underscoring the importance of visual input design over purely linguistically-based approaches. These findings suggest that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning and could serve as a general strategy for enhancing VLM performance on spatially grounded tasks.

Updated: 2025-09-08 14:34:04

标题: 视觉结构有助于视觉推理：解决VLMs中的绑定问题

摘要: 尽管视觉语言模型（VLMs）取得了进展，但它们在视觉推理方面的能力往往受到绑定问题的限制：即无法可靠地将感知特征与其正确的视觉参照物关联起来。这一限制导致在诸如计数、视觉搜索、场景描述和空间关系理解等任务中持续出现错误。一个关键因素是当前VLMs主要以并行方式处理视觉特征，缺乏空间基础的、串行注意力机制。本文介绍了VISER（增强推理的视觉输入结构），这是一种简单而有效的干预方法：通过在视觉输入中添加低级空间结构，并将其与鼓励顺序、空间感知解析的文本提示配对。我们在核心视觉推理任务中通过实验证明了显著的性能改进。具体而言，VISER将GPT-4o的视觉搜索准确性提高了25.00％，计数准确性提高了26.83％，减少了场景描述中的编辑距离错误0.32，并在2D合成数据集上将空间关系任务的性能提高了9.50％。此外，我们发现视觉修改对这些增益至关重要；纯粹的文本策略，包括思维链提示，是不足够的，甚至可能降低性能。VISER仅通过单一查询推理增强绑定，强调了视觉输入设计胜过纯粹基于语言的方法的重要性。这些发现表明，低级视觉结构化是改进组合视觉推理的一个强大而未被充分探索的方向，可以作为提高VLM在基于空间的任务上性能的一般策略。

更新时间: 2025-09-08 14:34:04

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.22146v3

SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards

Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate vision-language alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are naturally inverse dual tasks, we propose \textbf{SUDER} (\textbf{S}elf-improving \textbf{U}nified LMMs with \textbf{D}ual s\textbf{E}lf-\textbf{R}ewards), a framework reinforcing the understanding and generation capabilities of LMMs with a self-supervised dual reward mechanism. SUDER leverages the inherent duality between understanding and generation tasks to provide self-supervised optimization signals for each other. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood within the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

Updated: 2025-09-08 14:31:08

标题: SUDER：自我改进的统一大型多模态模型，用于理解和生成，具有双自我奖励

摘要: 在大型语言模型（LLMs）基础上，最近的大型多模态模型（LMMs）将跨模态理解和生成统一到一个框架中。然而，LMMs仍然难以实现准确的视觉-语言对齐，容易生成与视觉输入相矛盾的文本响应或无法遵循文本到图像提示。当前的解决方案需要外部监督（例如，人类反馈或奖励模型），并且只解决单向任务-理解或生成。在这项工作中，基于理解和生成自然是对偶任务的观察，我们提出了SUDER（自我改进的统一LMMs与双重自我奖励），这是一个框架，通过自监督双重奖励机制增强LMMs的理解和生成能力。SUDER利用理解和生成任务之间的固有对偶性为彼此提供自监督优化信号。具体来说，我们在一个任务领域的给定输入中采样多个输出，然后反转输入-输出对以计算模型内的对偶可能性作为自我奖励进行优化。对视觉理解和生成基准测试的广泛实验结果表明，我们的方法可以有效增强模型的性能，而无需任何外部监督，特别是在文本到图像任务中取得了显著的改进。

更新时间: 2025-09-08 14:31:08

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2506.07963v3

Towards Trustworthy Agentic IoEV: AI Agents for Explainable Cyberthreat Mitigation and State Analytics

The Internet of Electric Vehicles (IoEV) envisions a tightly coupled ecosystem of electric vehicles (EVs), charging infrastructure, and grid services, yet it remains vulnerable to cyberattacks, unreliable battery-state predictions, and opaque decision processes that erode trust and performance. To address these challenges, we introduce a novel Agentic Artificial Intelligence (AAI) framework tailored for IoEV, where specialized agents collaborate to deliver autonomous threat mitigation, robust analytics, and interpretable decision support. Specifically, we design an AAI architecture comprising dedicated agents for cyber-threat detection and response at charging stations, real-time State of Charge (SoC) estimation, and State of Health (SoH) anomaly detection, all coordinated through a shared, explainable reasoning layer; develop interpretable threat-mitigation mechanisms that proactively identify and neutralize attacks on both physical charging points and learning components; propose resilient SoC and SoH models that leverage continuous and adversarial-aware learning to produce accurate, uncertainty-aware forecasts with human-readable explanations; and implement a three-agent pipeline, where each agent uses LLM-driven reasoning and dynamic tool invocation to interpret intent, contextualize tasks, and execute formal optimizations for user-centric assistance. Finally, we validate our framework through comprehensive experiments across diverse IoEV scenarios, demonstrating significant improvements in security and prediction accuracy. All datasets, models, and code will be released publicly.

Updated: 2025-09-08 14:28:53

标题: 朝着值得信赖的自主IoEV发展：用于可解释的网络威胁缓解和状态分析的AI代理

摘要: 电动汽车互联网（IoEV）构想了一个紧密耦合的生态系统，包括电动汽车（EVs）、充电基础设施和电网服务，但仍然容易受到网络攻击、不可靠的电池状态预测和不透明的决策过程的影响，这些都会破坏信任和性能。为了解决这些挑战，我们引入了一个专门针对IoEV定制的新型主动人工智能（AAI）框架，其中专门的代理协作提供自主威胁缓解、强大的分析和可解释的决策支持。具体来说，我们设计了一个AAI架构，包括专门用于充电站网络威胁检测和响应、实时电池充电状态（SoC）估计和健康状态（SoH）异常检测的代理，全部通过共享的可解释推理层协调；开发了可解释的威胁缓解机制，主动识别和消除对物理充电点和学习组件的攻击；提出了具有连续和对抗意识学习的弹性SoC和SoH模型，产生准确、不确定性感知的预测，并提供人类可读的解释；并实现了一个三代理管道，每个代理使用LLM驱动的推理和动态工具调用来解释意图、对任务进行情境化和执行形式优化，以提供用户中心的协助。最后，我们通过对多种IoEV场景的全面实验验证了我们的框架，展示了安全性和预测准确性的显著改进。所有数据集、模型和代码将被公开发布。

更新时间: 2025-09-08 14:28:53

领域: cs.CR,cs.AI,cs.ET,cs.LG,cs.NI

下载: http://arxiv.org/abs/2509.12233v1

VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction

Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments' complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on Github https://github.com/OpenMOSS/VehicleWorld.

Updated: 2025-09-08 14:28:25

标题: 车辆世界：用于智能车辆交互的高度集成多设备环境

摘要: 智能车辆驾驶舱为API代理提出了独特的挑战，需要在超出典型任务环境复杂性的紧密耦合子系统之间进行协调。传统的函数调用（FC）方法是无状态的，需要多次探索性调用来建立执行前的环境意识，导致效率低下和错误恢复能力有限。我们引入了VehicleWorld，这是汽车领域的第一个全面环境，包括30个模块，250个API和680个属性，具有完全可执行的实现，可以在代理执行过程中提供实时状态信息。这个环境可以在各种具有挑战性的场景中精确评估车辆代理的行为。通过系统分析，我们发现直接状态预测优于函数调用的环境控制。基于这一发现，我们提出了基于状态的函数调用（SFC），这是一种新颖的方法，它保持明确的系统状态意识，并实现直接状态转换以实现目标条件。实验结果表明，SFC明显优于传统的FC方法，实现了更高的执行准确性和减少的延迟。我们已经在Github上公开了所有实现代码https://github.com/OpenMOSS/VehicleWorld。

更新时间: 2025-09-08 14:28:25

领域: cs.AI,cs.CL,cs.RO

下载: http://arxiv.org/abs/2509.06736v1

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.

Updated: 2025-09-08 14:27:23

标题: 深度研究系统的强化学习基础：一项调查

摘要: 深度研究系统是解决复杂多步任务的主动AI，通过协调推理、搜索开放网络和用户文件以及工具使用来实现。这些系统正朝着具有规划器、协调器和执行器的分层部署方向发展。在实践中，训练整个堆栈端到端仍然是不切实际的，因此大多数工作训练一个与核心工具（如搜索、浏览和代码）连接的单个规划器。虽然SFT赋予了协议的忠实性，但它存在模仿和暴露偏见，并且未充分利用环境反馈。诸如DPO之类的偏好对齐方法依赖于模式和代理，是离线的，并且在长期视角的信用分配和多目标权衡方面较弱。SFT和DPO的另一个限制是它们依赖于通过模式设计和标记比较定义的人类决策点和子技能。强化学习通过优化轨迹级别的策略，使得闭环、工具交互研究与之相符，从而实现探索、恢复行为和原则性信用分配，并减少对这种人类先验和评分偏见的依赖。据我们所知，这项调查是首次专门针对深度研究系统的RL基础。它沿着数据综合与策划、RL方法、规划系统和框架等三个方面系统化深度寻求-R1之后的工作。我们还涵盖了智能体架构和协调，以及评估和基准测试，包括最近的QA、VQA、长篇合成和领域基础、工具交互任务。我们提炼了重复出现的模式，揭示了基础设施瓶颈，并提供了训练具有RL的强大透明的深度研究智能体的实用指导。

更新时间: 2025-09-08 14:27:23

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.06733v1

Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios

Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of LLMs for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, LLMs lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in communication technologies for underrepresented linguistic communities.

Updated: 2025-09-08 14:25:45

标题: 在资源匮乏情况下利用大型语言模型进行准确的手语翻译

摘要: 将自然语言翻译成手语是一项高度复杂且未被充分开发的任务。尽管对可访问性和包容性越来越感兴趣，但由于自然语言与手语数据的平行语料库有限，健壮翻译系统的发展仍受到阻碍。现有方法通常难以在这些数据稀缺的环境中推广，因为少量可用的数据集通常是特定领域的，缺乏标准化，或无法捕捉手语的全部语言丰富性。为了解决这一限制，我们提出了一种新方法AulSign，利用大型语言模型通过动态提示和上下文学习与样本选择及后续手语关联。尽管LLMs在处理文本方面具有令人印象深刻的能力，但它们缺乏对手语的内在知识；因此，它们无法原生地执行这种类型的翻译。为了克服这一限制，我们将手语与自然语言中的简洁描述关联，并指导模型使用它们。我们在英语和意大利语上使用SignBank+和意大利LaCAM CNR-ISTC数据集评估我们的方法。我们展示了在低数据情景下与最先进模型相比的优越性能。我们的发现表明AulSign的有效性，具有提升为代表性语言社区的通信技术的可访问性和包容性的潜力。

更新时间: 2025-09-08 14:25:45

领域: cs.CL,cs.AI,cs.CY,I.2; I.2.7

下载: http://arxiv.org/abs/2508.18183v2

Learning Load Balancing with GNN in MPTCP-Enabled Heterogeneous Networks

Hybrid light fidelity (LiFi) and wireless fidelity (WiFi) networks are a promising paradigm of heterogeneous network (HetNet), attributed to the complementary physical properties of optical spectra and radio frequency. However, the current development of such HetNets is mostly bottlenecked by the existing transmission control protocol (TCP), which restricts the user equipment (UE) to connecting one access point (AP) at a time. While the ongoing investigation on multipath TCP (MPTCP) can bring significant benefits, it complicates the network topology of HetNets, making the existing load balancing (LB) learning models less effective. Driven by this, we propose a graph neural network (GNN)-based model to tackle the LB problem for MPTCP-enabled HetNets, which results in a partial mesh topology. Such a topology can be modeled as a graph, with the channel state information and data rate requirement embedded as node features, while the LB solutions are deemed as edge labels. Compared to the conventional deep neural network (DNN), the proposed GNN-based model exhibits two key strengths: i) it can better interpret a complex network topology; and ii) it can handle various numbers of APs and UEs with a single trained model. Simulation results show that against the traditional optimisation method, the proposed learning model can achieve near-optimal throughput within a gap of 11.5%, while reducing the inference time by 4 orders of magnitude. In contrast to the DNN model, the new method can improve the network throughput by up to 21.7%, at a similar inference time level.

Updated: 2025-09-08 14:13:08

标题: 在MPTCP-enabled异构网络中，利用GNN学习负载平衡

摘要: 混合光保真度（LiFi）和无线保真度（WiFi）网络是异构网络（HetNet）的一种有前途的范式，归因于光谱和无线频率的互补物理特性。然而，目前这种HetNets的发展主要受制于现有的传输控制协议（TCP），这限制了用户设备（UE）一次只能连接一个接入点（AP）。虽然对多路径TCP（MPTCP）的持续研究可以带来显著的好处，但它使得HetNets的网络拓扑复杂化，使得现有的负载平衡（LB）学习模型变得不那么有效。因此，我们提出了一种基于图神经网络（GNN）的模型来解决MPTCP-enabled HetNets的LB问题，从而产生部分网状拓扑。这样的拓扑结构可以被建模为一个图形，其中通道状态信息和数据速率要求被嵌入为节点特征，而LB解决方案被视为边标签。与传统的深度神经网络（DNN）相比，提出的基于GNN的模型展现了两个关键优势：i）它可以更好地解释复杂的网络拓扑；ii）它可以使用单个训练模型处理各种数量的AP和UE。模拟结果显示，与传统的优化方法相比，提出的学习模型可以在11.5%的差距内实现近乎最优的吞吐量，同时将推断时间缩短了4个数量级。与DNN模型相比，这种新方法可以将网络吞吐量提高高达21.7%，在类似的推断时间水平上。

更新时间: 2025-09-08 14:13:08

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2410.17118v2

RT-HCP: Dealing with Inference Delays and Sample Efficiency to Learn Directly on Robotic Platforms

Learning a controller directly on the robot requires extreme sample efficiency. Model-based reinforcement learning (RL) methods are the most sample efficient, but they often suffer from a too long inference time to meet the robot control frequency requirements. In this paper, we address the sample efficiency and inference time challenges with two contributions. First, we define a general framework to deal with inference delays where the slow inference robot controller provides a sequence of actions to feed the control-hungry robotic platform without execution gaps. Then, we compare several RL algorithms in the light of this framework and propose RT-HCP, an algorithm that offers an excellent trade-off between performance, sample efficiency and inference time. We validate the superiority of RT-HCP with experiments where we learn a controller directly on a simple but high frequency FURUTA pendulum platform. Code: github.com/elasriz/RTHCP

Updated: 2025-09-08 14:09:33

标题: RT-HCP: 处理推理延迟和样本效率，直接在机器人平台上学习

摘要: 在机器人上直接学习控制器需要极端的样本效率。基于模型的强化学习（RL）方法是最具样本效率的，但往往因为推断时间过长而无法满足机器人控制频率要求。本文提出了两个贡献，以解决样本效率和推断时间挑战。首先，我们定义了一个通用框架来处理推断延迟，其中慢速推断机器人控制器提供一系列动作以满足对控制需求强烈的机器人平台，而无需执行间隙。然后，我们根据这个框架比较了几种RL算法，并提出了RT-HCP算法，它在性能、样本效率和推断时间之间提供了出色的平衡。我们通过在简单但高频率的FURUTA摆平台上直接学习控制器的实验验证了RT-HCP的优越性。源代码：github.com/elasriz/RTHCP

更新时间: 2025-09-08 14:09:33

领域: cs.LG

下载: http://arxiv.org/abs/2509.06714v1

MRI-Based Brain Tumor Detection through an Explainable EfficientNetV2 and MLP-Mixer-Attention Architecture

Brain tumors are serious health problems that require early diagnosis due to their high mortality rates. Diagnosing tumors by examining Magnetic Resonance Imaging (MRI) images is a process that requires expertise and is prone to error. Therefore, the need for automated diagnosis systems is increasing day by day. In this context, a robust and explainable Deep Learning (DL) model for the classification of brain tumors is proposed. In this study, a publicly available Figshare dataset containing 3,064 T1-weighted contrast-enhanced brain MRI images of three tumor types was used. First, the classification performance of nine well-known CNN architectures was evaluated to determine the most effective backbone. Among these, EfficientNetV2 demonstrated the best performance and was selected as the backbone for further development. Subsequently, an attention-based MLP-Mixer architecture was integrated into EfficientNetV2 to enhance its classification capability. The performance of the final model was comprehensively compared with basic CNNs and the methods in the literature. Additionally, Grad-CAM visualization was used to interpret and validate the decision-making process of the proposed model. The proposed model's performance was evaluated using the five-fold cross-validation method. The proposed model demonstrated superior performance with 99.50% accuracy, 99.47% precision, 99.52% recall and 99.49% F1 score. The results obtained show that the model outperforms the studies in the literature. Moreover, Grad-CAM visualizations demonstrate that the model effectively focuses on relevant regions of MRI images, thus improving interpretability and clinical reliability. A robust deep learning model for clinical decision support systems has been obtained by combining EfficientNetV2 and attention-based MLP-Mixer, providing high accuracy and interpretability in brain tumor classification.

Updated: 2025-09-08 14:08:21

标题: 基于MRI的大脑肿瘤检测：通过可解释的EfficientNetV2和MLP-Mixer-Attention架构

摘要: 脑瘤是严重的健康问题，由于其较高的死亡率，需要早期诊断。通过检查磁共振成像（MRI）图像来诊断肿瘤是一个需要专业知识且容易出错的过程。因此，自动诊断系统的需求日益增加。在这个背景下，提出了一个强大且可解释的深度学习（DL）模型，用于分类脑瘤。本研究使用包含3,064个三种肿瘤类型的T1加权增强脑部MRI图像的公开可用的Figshare数据集。首先，评估了九种知名的卷积神经网络架构的分类性能，以确定最有效的骨干。在这些中，EfficientNetV2表现最佳，被选为进一步发展的骨干。随后，将基于注意力的MLP-Mixer架构集成到EfficientNetV2中，以增强其分类能力。最终模型的性能与基本CNN和文献中的方法进行了全面比较。此外，使用Grad-CAM可视化来解释和验证提出模型的决策过程。提出模型的性能使用五折交叉验证方法进行了评估。提出的模型表现出卓越的性能，准确率达到99.50%，精确度为99.47%，召回率为99.52%，F1分数为99.49%。所得结果显示该模型优于文献中的研究。此外，Grad-CAM可视化展示了该模型有效地聚焦于MRI图像的相关区域，从而提高了可解释性和临床可靠性。通过结合EfficientNetV2和基于注意力的MLP-Mixer，获得了一个用于临床决策支持系统的强大深度学习模型，为脑瘤分类提供了高准确性和可解释性。

更新时间: 2025-09-08 14:08:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.06713v1

A Framework for Standardizing Similarity Measures in a Rapidly Evolving Field

Similarity measures are fundamental tools for quantifying the alignment between artificial and biological systems. However, the diversity of similarity measures and their varied naming and implementation conventions makes it challenging to compare across studies. To facilitate comparisons and make explicit the implementation choices underlying a given code package, we have created and are continuing to develop a Python repository that benchmarks and standardizes similarity measures. The goal of creating a consistent naming convention that uniquely and efficiently specifies a similarity measure is not trivial as, for example, even commonly used methods like Centered Kernel Alignment (CKA) have at least 12 different variations, and this number will likely continue to grow as the field evolves. For this reason, we do not advocate for a fixed, definitive naming convention. The landscape of similarity measures and best practices will continue to change and so we see our current repository, which incorporates approximately 100 different similarity measures from 14 packages, as providing a useful tool at this snapshot in time. To accommodate the evolution of the field we present a framework for developing, validating, and refining naming conventions with the goal of uniquely and efficiently specifying similarity measures, ultimately making it easier for the community to make comparisons across studies.

Updated: 2025-09-08 14:07:02

标题: 一个用于规范快速发展领域相似性度量的框架

摘要: 相似性度量是量化人工和生物系统对齐的基本工具。然而，相似性度量的多样性以及它们不同的命名和实现约定使得跨研究比较具有挑战性。为了促进比较并明确给定代码包的实现选择，我们创建并持续发展了一个Python存储库，用于基准测试和标准化相似性度量。创建一个一致的命名约定来唯一和高效地指定相似性度量的目标并不是一件简单的事情，例如，即使像中心核对齐（CKA）这样常用的方法也至少有12种不同的变体，而且随着领域的发展，这个数字可能会继续增长。因此，我们不主张采用固定的、明确的命名约定。相似性度量和最佳实践的领域将继续变化，因此我们认为我们目前的存储库，其中包含来自14个包的大约100种不同的相似性度量，是在当前时间点提供的一个有用工具。为了适应领域的发展，我们提出了一个框架，用于开发、验证和完善命名约定，目标是唯一和高效地指定相似性度量，最终使社区更容易在研究之间进行比较。

更新时间: 2025-09-08 14:07:02

领域: q-bio.NC,cs.LG

下载: http://arxiv.org/abs/2409.18333v2

BriLLM: Brain-inspired Large Language Model

We introduce BriLLM, a brain-inspired large language model that fundamentally redefines the foundations of machine learning through its implementation of Signal Fully-connected flowing (SiFu) learning. This work addresses the critical bottleneck hindering AI's progression toward Artificial General Intelligence (AGI)--the disconnect between language models and "world models"--as well as the fundamental limitations of Transformer-based architectures rooted in the conventional representation learning paradigm. BriLLM incorporates two pivotal neurocognitive principles: (1) static semantic mapping, where tokens are mapped to specialized nodes analogous to cortical areas, and (2) dynamic signal propagation, which simulates electrophysiological information dynamics observed in brain activity. This architecture enables multiple transformative breakthroughs: natural multi-modal compatibility, full model interpretability at the node level, context-length independent scaling, and the first global-scale simulation of brain-like information processing for language tasks. Our initial 1-2B parameter models successfully replicate GPT-1-level generative capabilities while demonstrating stable perplexity reduction. Scalability analyses confirm the feasibility of 100-200B parameter variants capable of processing 40,000-token vocabularies. The paradigm is reinforced by both Occam's Razor--evidenced in the simplicity of direct semantic mapping--and natural evolution--given the brain's empirically validated AGI architecture. BriLLM establishes a novel, biologically grounded framework for AGI advancement that addresses fundamental limitations of current approaches.

Updated: 2025-09-08 14:06:28

标题: BriLLM：脑启发的大型语言模型

摘要: 我们介绍了BriLLM，一个受大脑启发的大型语言模型，通过其实现的Signal Fully-connected flowing (SiFu)学习，从根本上重新定义了机器学习的基础。这项工作解决了阻碍人工智能向人工通用智能（AGI）发展的关键瓶颈——语言模型与“世界模型”之间的脱节，以及基于传统表示学习范式的Transformer架构的基本限制。BriLLM融合了两个关键的神经认知原则：（1）静态语义映射，其中标记被映射到类似于皮层区域的专门节点，以及（2）动态信号传播，模拟了观察到的大脑活动中的电生理信息动态。这种架构实现了多个革命性的突破：自然多模式兼容性、节点级别的完全模型可解释性、上下文长度无关的扩展能力，以及用于语言任务的首个全球规模的类脑信息处理模拟。我们的最初的1-2B参数模型成功复制了GPT-1级别的生成能力，同时表现出稳定的困惑度降低。可扩展性分析确认了能够处理40,000个标记词汇的100-200B参数变体的可行性。这一范式得到了奥卡姆剃刀原理的支持，这在直接语义映射的简单性中得到了证明，并且得到了自然演化的支持——考虑到大脑经验验证的AGI架构。BriLLM建立了一个新颖的、生物基础的AGI推进框架，解决了当前方法的基本限制。

更新时间: 2025-09-08 14:06:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.11299v8

Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women's health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM's advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.

Updated: 2025-09-08 14:04:34

标题: 能量景观使得检索增强型大型医疗语言模型能够可靠地放弃

摘要: 可靠的弃权对于检索增强生成（RAG）系统至关重要，特别是在妇女健康等安全关键领域，错误的答案可能导致伤害。我们提出了一种基于能量的模型（EBM），该模型在一个包含260万个基于指南的问题的密集语义语料库上学习了一个平滑的能量景观，使系统能够决定何时生成或放弃。我们将EBM与校准的softmax基线和k最近邻（kNN）密度启发式进行了基准测试，跨越易和难的放弃拆分，其中难例是语义上具有挑战性的接近分布的查询。EBM在语义困难案例上实现了优越的放弃表现，达到AUROC 0.961，而softmax为0.950，同时还降低了FPR@95（0.235与0.331）。在易负案例中，各种方法的表现是可比较的，但是EBM的优势最为显著在安全关键的困难分布中。进行了全面的消融试验，受控负采样和公平数据曝光表明，稳健性主要来自能量评分头部，而特定负类型（困难、容易、混合）的包含或排除使决策边界变得更加清晰，但对于对难例的泛化并非必要。这些结果表明，基于能量的放弃评分比基于概率的softmax置信度提供了更可靠的置信信号，为安全RAG系统提供了可扩展和可解释的基础。

更新时间: 2025-09-08 14:04:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.04482v2

ELK: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques

To meet the increasing demand of deep learning (DL) models, AI chips are employing both off-chip memory (e.g., HBM) and high-bandwidth low-latency interconnect for direct inter-core data exchange. However, it is not easy to explore the efficiency of these inter-core connected AI (ICCA) chips, due to a fundamental tussle among compute (per-core execution), communication (inter-core data exchange), and I/O (off-chip data access). In this paper, we develop Elk, a DL compiler framework to maximize the efficiency of ICCA chips by jointly trading off all the three performance factors discussed above. Elk structures these performance factors into configurable parameters and forms a global trade-off space in the DL compiler. To systematically explore this space and maximize overall efficiency, Elk employs a new inductive operator scheduling policy and a cost-aware on-chip memory allocation algorithm. It generates globally optimized execution plans that best overlap off-chip data loading and on-chip execution. To examine the efficiency of Elk, we build a full-fledged emulator based on a real ICCA chip IPU-POD4, and an ICCA chip simulator for sensitivity analysis with different interconnect network topologies. Elk achieves 94% of the ideal roofline performance of ICCA chips on average, showing the benefits of supporting large DL models on ICCA chips. We also show Elk's capability of enabling architecture design space exploration for new ICCA chip development.

Updated: 2025-09-08 14:03:34

标题: ELK: 探索使用深度学习编译器技术的核心间连接AI芯片的效率

摘要: 为了满足对深度学习（DL）模型日益增长的需求，人工智能芯片正在采用外部存储器（例如HBM）和高带宽低延迟互连，用于直接进行核间数据交换。然而，由于计算（每个核心执行）、通信（核间数据交换）和I/O（外部数据访问）之间存在基本矛盾，探索这些核间连接的人工智能（ICCA）芯片的效率并不容易。在本文中，我们开发了Elk，一个DL编译器框架，通过共同权衡上述三个性能因素来最大化ICCA芯片的效率。Elk将这些性能因素组织成可配置参数，并在DL编译器中形成一个全局权衡空间。为了系统地探索这个空间并最大化整体效率，Elk采用了一种新的归纳算子调度策略和一种成本感知的片上内存分配算法。它生成了全局优化的执行计划，最大限度地重叠外部数据加载和片上执行。为了检验Elk的效率，我们基于真实的ICCA芯片IPU-POD4构建了一个完整的仿真器，并使用不同互连网络拓扑的ICCA芯片模拟器进行敏感性分析。Elk平均实现了ICCA芯片理想性能的94%，展示了支持大型DL模型在ICCA芯片上的好处。我们还展示了Elk支持新ICCA芯片开发的架构设计空间探索的能力。

更新时间: 2025-09-08 14:03:34

领域: cs.AR,cs.DC,cs.LG

下载: http://arxiv.org/abs/2507.11506v2

KD$^{2}$M: A unifying framework for feature knowledge distillation

Knowledge Distillation (KD) seeks to transfer the knowledge of a teacher, towards a student neural net. This process is often done by matching the networks' predictions (i.e., their output), but, recently several works have proposed to match the distributions of neural nets' activations (i.e., their features), a process known as \emph{distribution matching}. In this paper, we propose an unifying framework, Knowledge Distillation through Distribution Matching (KD$^{2}$M), which formalizes this strategy. Our contributions are threefold. We i) provide an overview of distribution metrics used in distribution matching, ii) benchmark on computer vision datasets, and iii) derive new theoretical results for KD.

Updated: 2025-09-08 13:59:51

标题: KD$^{2}$M：特征知识蒸馏的统一框架

摘要: Knowledge Distillation（KD）旨在将教师的知识转移到学生的神经网络中。这个过程通常是通过匹配网络的预测（即它们的输出）来完成的，但是最近有几项研究提出了匹配神经网络激活的分布（即它们的特征），这个过程被称为“分布匹配”。在本文中，我们提出了一个统一的框架，即通过分布匹配进行知识蒸馏（KD$^{2}$M），以形式化这种策略。我们的贡献有三点。我们i）提供了分布匹配中使用的分布度量的概述，ii）在计算机视觉数据集上进行了基准测试，iii）推导了KD的新的理论结果。

更新时间: 2025-09-08 13:59:51

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2504.01757v3

Identification and Optimal Nonlinear Control of Turbojet Engine Using Koopman Eigenfunction Model

Gas turbine engines are complex and highly nonlinear dynamical systems. Deriving their physics-based models can be challenging because it requires performance characteristics that are not always available, often leading to many simplifying assumptions. This paper discusses the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models, and addresses these issues by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics are estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics is mapped into an optimally constructed Koopman eigenfunction space. This process involves eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model is validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator are then designed within the eigenfunction space and compared to traditional and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure enables targeting individual modes during optimization, leading to improved performance tuning. Results demonstrate that the Koopman-based controller surpasses other benchmark controllers in both reference tracking and disturbance rejection under sea-level and varying flight conditions, due to its global nature.

Updated: 2025-09-08 13:56:53

标题: 使用Koopman特征函数模型识别和优化涡喷发动机的非线性控制

摘要: 气轮机发动机是复杂且高度非线性的动态系统。推导基于物理的模型可能具有挑战性，因为它需要的性能特征并非总是可用，通常会导致许多简化假设。本文讨论了用于推导部件级和局部线性参数变化模型的传统实验方法的局限性，并通过利用从闭环控制下标准发动机操作中收集的数据来解决这些问题。采用稀疏非线性动态识别技术估计转子动力学。随后，将动力学的自治部分映射到最佳构建的Koopman特征函数空间中。这个过程涉及使用元启发式算法和时间投影进行特征值优化，然后进行基于梯度的特征函数识别。生成的Koopman模型经过与内部参考部件级模型的验证。然后在特征函数空间内设计全局最优非线性反馈控制器和卡尔曼估计器，并与传统和增益调度比例积分控制器以及提出的内模型控制方法进行比较。特征模态结构使得在优化过程中能够针对个别模式进行调节，从而提高性能调整。结果表明，基于Koopman的控制器在海平面和不同飞行条件下超越其他基准控制器，因为其全局性质使得在参考跟踪和扰动抑制方面表现更好。

更新时间: 2025-09-08 13:56:53

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2505.10438v3

When Secure Isn't: Assessing the Security of Machine Learning Model Sharing

The rise of model-sharing through frameworks and dedicated hubs makes Machine Learning significantly more accessible. Despite their benefits, these tools expose users to underexplored security risks, while security awareness remains limited among both practitioners and developers. To enable a more security-conscious culture in Machine Learning model sharing, in this paper we evaluate the security posture of frameworks and hubs, assess whether security-oriented mechanisms offer real protection, and survey how users perceive the security narratives surrounding model sharing. Our evaluation shows that most frameworks and hubs address security risks partially at best, often by shifting responsibility to the user. More concerningly, our analysis of frameworks advertising security-oriented settings and complete model sharing uncovered six 0-day vulnerabilities enabling arbitrary code execution. Through this analysis, we debunk the misconceptions that the model-sharing problem is largely solved and that its security can be guaranteed by the file format used for sharing. As expected, our survey shows that the surrounding security narrative leads users to consider security-oriented settings as trustworthy, despite the weaknesses shown in this work. From this, we derive takeaways and suggestions to strengthen the security of model-sharing ecosystems.

Updated: 2025-09-08 13:55:54

标题: 当安全不再安全：评估机器学习模型共享的安全性

摘要: 通过框架和专门的中心的兴起，模型共享使机器学习变得更加易于访问。尽管它们具有许多好处，但这些工具使用户暴露于未经探索的安全风险之中，同时在从业者和开发人员中的安全意识仍然有限。为了在机器学习模型共享中建立更加注重安全性的文化，在本文中我们评估了框架和中心的安全姿态，评估了安全导向机制是否提供了真正的保护，并调查了用户如何看待围绕模型共享的安全叙事。我们的评估显示，大多数框架和中心在最好的情况下只能部分地解决安全风险，通常是通过将责任转移到用户身上。更令人担忧的是，我们对宣传安全设置和完整模型共享的框架进行的分析发现了六个0-day漏洞，使得任意代码执行成为可能。通过这一分析，我们揭穿了模型共享问题基本上已解决以及其安全性可以通过用于共享的文件格式来保证的误解。正如预期的那样，我们的调查显示，周围的安全叙事导致用户认为安全导向设置是可信的，尽管本文展示了其中的弱点。基于此，我们得出结论并提出建议，以加强模型共享生态系统的安全性。

更新时间: 2025-09-08 13:55:54

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2509.06703v1

Nested Optimal Transport Distances

Simulating realistic financial time series is essential for stress testing, scenario generation, and decision-making under uncertainty. Despite advances in deep generative models, there is no consensus metric for their evaluation. We focus on generative AI for financial time series in decision-making applications and employ the nested optimal transport distance, a time-causal variant of optimal transport distance, which is robust to tasks such as hedging, optimal stopping, and reinforcement learning. Moreover, we propose a statistically consistent, naturally parallelizable algorithm for its computation, achieving substantial speedups over existing approaches.

Updated: 2025-09-08 13:55:18

标题: 嵌套最优输运距离

摘要: 模拟逼真的金融时间序列对于压力测试、场景生成和不确定性下的决策至关重要。尽管深度生成模型取得了进展，但对于它们的评估还没有共识性指标。我们专注于金融时间序列生成AI在决策应用中，并采用嵌套最优输运距离，这是最优输运距离的时间因果变体，对避险、最优停止和强化学习等任务具有鲁棒性。此外，我们提出了一种统计一致、自然可并行化的算法用于其计算，实现了比现有方法更大幅度的加速。

更新时间: 2025-09-08 13:55:18

领域: cs.LG,q-fin.CP,91G60, 60G07, 65C60

下载: http://arxiv.org/abs/2509.06702v1

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

Updated: 2025-09-08 13:55:01

标题: 深度神经网络中潜在代理子结构的概率建模

摘要: 我们发展了一个基于神经模型的概率建模的智能代理理论。代理被表示为具有由对数分数给出的认识效用的结果分布，并且通过加权对数汇集来定义组合，该汇集严格改善每个成员的福利。我们证明，在线性汇集或二元结果空间下，严格一致性是不可能的，但在三个或更多结果的情况下是可能的。我们的框架通过克隆不变性、连续性和开放性允许递归结构，而基于倾斜的分析排除了平凡复制。最后，我们使用我们的理论在LLMs中形式化了一个行为对齐现象：引出一个仁慈的人格（“路易吉”）会诱发一个对立的对应物（“瓦路易吉”），而一个显现然后抑制瓦路易吉策略比仅纯路易吉强化产生了严格更大的一阶错位减少。这些结果阐明了如何开发一个有原则的数学框架，使得子代理可以融合成一致的高层实体，为代理AI系统中的对齐提供了新颖的含义。

更新时间: 2025-09-08 13:55:01

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06701v1

On Hyperparameters and Backdoor-Resistance in Horizontal Federated Learning

Horizontal Federated Learning (HFL) is particularly vulnerable to backdoor attacks as adversaries can easily manipulate both the training data and processes to execute sophisticated attacks. In this work, we study the impact of training hyperparameters on the effectiveness of backdoor attacks and defenses in HFL. More specifically, we show both analytically and by means of measurements that the choice of hyperparameters by benign clients does not only influence model accuracy but also significantly impacts backdoor attack success. This stands in sharp contrast with the multitude of contributions in the area of HFL security, which often rely on custom ad-hoc hyperparameter choices for benign clients$\unicode{x2013}$leading to more pronounced backdoor attack strength and diminished impact of defenses. Our results indicate that properly tuning benign clients' hyperparameters$\unicode{x2013}$such as learning rate, batch size, and number of local epochs$\unicode{x2013}$can significantly curb the effectiveness of backdoor attacks, regardless of the malicious clients' settings. We support this claim with an extensive robustness evaluation of state-of-the-art attack-defense combinations, showing that carefully chosen hyperparameters yield across-the-board improvements in robustness without sacrificing main task accuracy. For example, we show that the 50%-lifespan of the strong A3FL attack can be reduced by 98.6%, respectively$\unicode{x2013}$all without using any defense and while incurring only a 2.9 percentage points drop in clean task accuracy.

Updated: 2025-09-08 13:52:25

标题: 关于超参数和水平联邦学习中的后门抵抗问题

摘要: 水平联邦学习（HFL）特别容易受到后门攻击的影响，因为对手可以轻易操纵训练数据和流程来执行复杂的攻击。在这项工作中，我们研究了训练超参数对HFL中后门攻击和防御效果的影响。更具体地说，我们通过分析和测量，证明了良性客户端选择的超参数不仅影响模型准确性，还显著影响后门攻击的成功率。这与HFL安全领域中的许多贡献形成鲜明对比，后者往往依赖于良性客户端的自定义临时超参数选择，从而导致后门攻击强度更加显著，而防御措施的影响减弱。我们的结果表明，适当调整良性客户端的超参数（如学习率、批量大小和本地轮数）可以显著降低后门攻击的效果，而不受恶意客户端设置的影响。我们通过对最先进的攻击-防御组合进行广泛的鲁棒性评估来支持这一观点，结果显示精心选择的超参数在不牺牲主要任务准确性的情况下，可以全面提高鲁棒性。例如，我们展示了强A3FL攻击的50%生命周期可以分别减少98.6%，而不使用任何防御，并且仅损失2.9个百分点的清洁任务准确性。

更新时间: 2025-09-08 13:52:25

领域: cs.CR

下载: http://arxiv.org/abs/2509.05192v2

SAM$^{*}$: Task-Adaptive SAM with Physics-Guided Rewards

Image segmentation is a critical task in microscopy, essential for accurately analyzing and interpreting complex visual data. This task can be performed using custom models trained on domain-specific datasets, transfer learning from pre-trained models, or foundational models that offer broad applicability. However, foundational models often present a considerable number of non-transparent tuning parameters that require extensive manual optimization, limiting their usability for real-time streaming data analysis. Here, we introduce a reward function-based optimization to fine-tune foundational models and illustrate this approach for SAM (Segment Anything Model) framework by Meta. The reward functions can be constructed to represent the physics of the imaged system, including particle size distributions, geometries, and other criteria. By integrating a reward-driven optimization framework, we enhance SAM's adaptability and performance, leading to an optimized variant, SAM$^{*}$, that better aligns with the requirements of diverse segmentation tasks and particularly allows for real-time streaming data segmentation. We demonstrate the effectiveness of this approach in microscopy imaging, where precise segmentation is crucial for analyzing cellular structures, material interfaces, and nanoscale features.

Updated: 2025-09-08 13:51:20

标题: SAM$^{*}$：具有物理引导奖励的任务自适应SAM

摘要: 图像分割是显微镜中的一个关键任务，对于准确分析和解释复杂的视觉数据至关重要。可以使用在特定领域数据集上训练的自定义模型，从预训练模型进行迁移学习，或使用具有广泛适用性的基础模型来执行此任务。然而，基础模型通常具有大量非透明的调整参数，需要进行大量手动优化，限制了其在实时流数据分析中的可用性。在这里，我们介绍了一种基于奖励函数的优化方法，用于微调基础模型，并通过Meta的SAM（Segment Anything Model）框架展示了这种方法。奖励函数可以构建以代表成像系统的物理特性，包括颗粒大小分布、几何形状和其他标准。通过整合奖励驱动的优化框架，我们增强了SAM的适应性和性能，导致了一个优化的变体SAM$^{*}$，更好地满足各种分割任务的要求，特别是实时流数据分割。我们在显微镜成像中展示了这种方法的有效性，精确分割对于分析细胞结构、材料界面和纳米尺度特征至关重要。

更新时间: 2025-09-08 13:51:20

领域: cs.CV,cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2509.07047v1

Neural ARFIMA model for forecasting BRIC exchange rates with long memory under oil shocks and policy uncertainties

Accurate forecasting of exchange rates remains a persistent challenge, particularly for emerging economies such as Brazil, Russia, India, and China (BRIC). These series exhibit long memory, nonlinearity, and non-stationarity properties that conventional time series models struggle to capture. Additionally, there exist several key drivers of exchange rate dynamics, including global economic policy uncertainty, US equity market volatility, US monetary policy uncertainty, oil price growth rates, and country-specific short-term interest rate differentials. These empirical complexities underscore the need for a flexible modeling framework that can jointly accommodate long memory, nonlinearity, and the influence of external drivers. To address these challenges, we propose a Neural AutoRegressive Fractionally Integrated Moving Average (NARFIMA) model that combines the long-memory representation of ARFIMA with the nonlinear learning capacity of neural networks, while flexibly incorporating exogenous causal variables. We establish theoretical properties of the model, including asymptotic stationarity of the NARFIMA process using Markov chains and nonlinear time series techniques. We quantify forecast uncertainty using conformal prediction intervals within the NARFIMA framework. Empirical results across six forecast horizons show that NARFIMA consistently outperforms various state-of-the-art statistical and machine learning models in forecasting BRIC exchange rates. These findings provide new insights for policymakers and market participants navigating volatile financial conditions. The \texttt{narfima} \textbf{R} package provides an implementation of our approach.

Updated: 2025-09-08 13:49:48

标题: 神经ARFIMA模型用于在石油冲击和政策不确定性下预测BRIC国家汇率的长期记忆

摘要: 准确预测汇率仍然是一个持久的挑战，特别是对于新兴经济体如巴西、俄罗斯、印度和中国（BRIC）来说。这些系列表现出长期记忆、非线性和非平稳性特征，传统的时间序列模型难以捕捉。此外，存在着汇率动态的几个关键驱动因素，包括全球经济政策不确定性、美国股市波动性、美国货币政策不确定性、石油价格增长率和特定国家的短期利率差异。这些实证复杂性强调了需要一个灵活的建模框架，可以同时适应长期记忆、非线性和外部驱动因素的影响。为了解决这些挑战，我们提出了一个神经自回归分数积分移动平均（NARFIMA）模型，结合了ARFIMA的长期记忆表达和神经网络的非线性学习能力，同时灵活地整合外生因果变量。我们通过马尔可夫链和非线性时间序列技术建立了该模型的理论性质，包括NARFIMA过程的渐近平稳性。我们使用NARFIMA框架内的符合预测区间量化预测不确定性。在六个预测时间段内的实证结果显示，NARFIMA在预测BRIC汇率方面始终优于各种最先进的统计和机器学习模型。这些发现为决策者和市场参与者在波动的金融条件中导航提供了新的见解。\texttt{narfima} \textbf{R}软件包提供了我们方法的实现。

更新时间: 2025-09-08 13:49:48

领域: econ.EM,cs.LG,stat.AP,stat.ML

下载: http://arxiv.org/abs/2509.06697v1

Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

While it is well-established that artificial neural networks are \emph{universal approximators} for continuous functions on compact domains, many modern approaches rely on deep or overparameterized architectures that incur high computational costs. In this paper, a new type of \emph{small shallow} neural network, called the \emph{Barycentric Neural Network} ($\BNN$), is proposed, which leverages a fixed set of \emph{base points} and their \emph{barycentric coordinates} to define both its structure and its parameters. We demonstrate that our $\BNN$ enables the exact representation of \emph{continuous piecewise linear functions} ($\CPLF$s), ensuring strict continuity across segments. Since any continuous function over a compact domain can be approximated arbitrarily well by $\CPLF$s, the $\BNN$ naturally emerges as a flexible and interpretable tool for \emph{function approximation}. Beyond the use of this representation, the main contribution of the paper is the introduction of a new variant of \emph{persistent entropy}, a topological feature that is stable and scale invariant, called the \emph{length-weighted persistent entropy} ($\LWPE$), which is weighted by the lifetime of topological features. Our framework, which combines the $\BNN$ with a loss function based on our $\LWPE$, aims to provide flexible and geometrically interpretable approximations of nonlinear continuous functions in resource-constrained settings, such as those with limited base points for $\BNN$ design and few training epochs. Instead of optimizing internal weights, our approach directly \emph{optimizes the base points that define the $\BNN$}. Experimental results show that our approach achieves \emph{superior and faster approximation performance} compared to classical loss functions such as MSE, RMSE, MAE, and log-cosh.

Updated: 2025-09-08 13:47:21

标题: Barycentric神经网络和长度加权持久熵损失：用于函数逼近的绿色几何和拓扑框架

摘要: 尽管人工神经网络是连续函数在紧致域上的\emph{通用逼近器}已经被广泛证实，然而许多现代方法依赖于深层或参数过多的架构，导致高计算成本。本文提出了一种新型的\emph{小浅}神经网络，称为\emph{重心神经网络} ($\BNN$)，它利用一组固定的\emph{基准点}及其\emph{重心坐标}来定义其结构和参数。我们证明我们的$\BNN$能够精确表示\emph{连续分段线性函数}($\CPLF$s)，确保在各段之间的严格连续性。由于任何紧致域上的连续函数都可以通过$\CPLF$s进行任意精确逼近，$\BNN$自然成为一个灵活且可解释的\emph{函数逼近}工具。除了使用这种表示之外，本文的主要贡献是引入了一种新的\emph{持久熵}变体，一种稳定且尺度不变的拓扑特征，称为\emph{长度加权持久熵}($\LWPE$)，它根据拓扑特征的生命周期进行加权。我们的框架将$\BNN$与基于我们的$\LWPE$的损失函数结合起来，旨在在资源受限的环境中提供灵活且几何可解释的非线性连续函数逼近，例如具有有限基准点用于$\BNN$设计和较少训练周期的情况。与优化内部权重不同，我们的方法直接\emph{优化定义$\BNN$的基准点}。实验结果显示，与传统损失函数如MSE、RMSE、MAE和log-cosh相比，我们的方法实现了\emph{卓越且更快速的逼近性能}。

更新时间: 2025-09-08 13:47:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06694v1

BioLite U-Net: Edge-Deployable Semantic Segmentation for In Situ Bioprinting Monitoring

Bioprinting is a rapidly advancing field that offers a transformative approach to fabricating tissue and organ models through the precise deposition of cell-laden bioinks. Ensuring the fidelity and consistency of printed structures in real-time remains a core challenge, particularly under constraints imposed by limited imaging data and resource-constrained embedded hardware. Semantic segmentation of the extrusion process, differentiating between nozzle, extruded bioink, and surrounding background, enables in situ monitoring critical to maintaining print quality and biological viability. In this work, we introduce a lightweight semantic segmentation framework tailored for real-time bioprinting applications. We present a novel, manually annotated dataset comprising 787 RGB images captured during the bioprinting process, labeled across three classes: nozzle, bioink, and background. To achieve fast and efficient inference suitable for integration with bioprinting systems, we propose a BioLite U-Net architecture that leverages depthwise separable convolutions to drastically reduce computational load without compromising accuracy. Our model is benchmarked against MobileNetV2 and MobileNetV3-based segmentation baselines using mean Intersection over Union (mIoU), Dice score, and pixel accuracy. All models were evaluated on a Raspberry Pi 4B to assess real-world feasibility. The proposed BioLite U-Net achieves an mIoU of 92.85% and a Dice score of 96.17%, while being over 1300x smaller than MobileNetV2-DeepLabV3+. On-device inference takes 335 ms per frame, demonstrating near real-time capability. Compared to MobileNet baselines, BioLite U-Net offers a superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed-loop bioprinting systems.

Updated: 2025-09-08 13:44:55

标题: BioLite U-Net: 边缘部署的语义分割技术，用于原位生物打印监测

摘要: 生物打印是一个快速发展的领域，通过精确沉积载有细胞的生物墨水，为制造组织和器官模型提供了一种变革性方法。在受限于有限的成像数据和资源受限的嵌入式硬件的约束下，实时确保打印结构的忠实性和一致性仍然是一个核心挑战。对挤出过程的语义分割，区分喷嘴、挤出的生物墨水和周围背景，使得实时监控对于维持打印质量和生物学生存能力至关重要。在这项工作中，我们介绍了一个专门针对实时生物打印应用的轻量级语义分割框架。我们提出了一个新颖的手动标注数据集，其中包含在生物打印过程中捕获的787张RGB图像，标记了三类：喷嘴、生物墨水和背景。为了实现与生物打印系统集成的快速高效推理，我们提出了一种BioLite U-Net架构，利用深度可分离卷积大幅减少计算负载，而不会影响准确性。我们的模型通过平均交并比（mIoU）、Dice分数和像素准确性与基于MobileNetV2和MobileNetV3的分割基线进行基准测试。所有模型在树莓派4B上进行评估，以评估实际可行性。提出的BioLite U-Net实现了92.85%的mIoU和96.17%的Dice分数，同时比MobileNetV2-DeepLabV3+小1300多倍。设备上的推理每帧需要335毫秒，表现出接近实时的能力。与MobileNet基线相比，BioLite U-Net在分割准确性、效率和可部署性之间提供了优越的折衷方案，使其非常适合智能的闭环生物打印系统。

更新时间: 2025-09-08 13:44:55

领域: cs.CV,cs.AI,cs.AR,N/A,I.2.9; I.2.10; I.4.6

下载: http://arxiv.org/abs/2509.06690v1

TrajAware: Graph Cross-Attention and Trajectory-Aware for Generalisable VANETs under Partial Observations

Vehicular ad hoc networks (VANETs) are a crucial component of intelligent transportation systems; however, routing remains challenging due to dynamic topologies, incomplete observations, and the limited resources of edge devices. Existing reinforcement learning (RL) approaches often assume fixed graph structures and require retraining when network conditions change, making them unsuitable for deployment on constrained hardware. We present TrajAware, an RL-based framework designed for edge AI deployment in VANETs. TrajAware integrates three components: (i) action space pruning, which reduces redundant neighbour options while preserving two-hop reachability, alleviating the curse of dimensionality; (ii) graph cross-attention, which maps pruned neighbours to the global graph context, producing features that generalise across diverse network sizes; and (iii) trajectory-aware prediction, which uses historical routes and junction information to estimate real-time positions under partial observations. We evaluate TrajAware in the open-source SUMO simulator using real-world city maps with a leave-one-city-out setup. Results show that TrajAware achieves near-shortest paths and high delivery ratios while maintaining efficiency suitable for constrained edge devices, outperforming state-of-the-art baselines in both full and partial observation scenarios.

Updated: 2025-09-08 13:24:21

标题: TrajAware: 基于图交叉注意力和轨迹感知的部分观测下可泛化VANETs

摘要: 交通车辆自组网（VANETs）是智能交通系统的关键组成部分；然而，由于动态拓扑结构、不完整的观测以及边缘设备资源受限，路由仍然具有挑战性。现有的强化学习（RL）方法通常假定固定图结构，并在网络条件发生变化时需要重新训练，使它们不适合部署在受限硬件上。我们提出了TrajAware，一个专为VANETs中边缘AI部署设计的RL框架。TrajAware集成了三个组件：（i）动作空间修剪，减少冗余的邻居选项，同时保留两跳可达性，缓解维度灾难；（ii）图交叉注意力，将修剪后的邻居映射到全局图上下文，生成能够在不同网络规模上泛化的特征；（iii）轨迹感知预测，利用历史路线和路口信息估计部分观测下的实时位置。我们在开源的SUMO模拟器中使用真实城市地图进行TrajAware的评估，采用留一城市外的设置。结果显示，TrajAware在保持适用于受限边缘设备的效率的同时，实现了接近最短路径和高交付比率，在完整和部分观测场景下均优于现有基线技术。

更新时间: 2025-09-08 13:24:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06665v1

Group Effect Enhanced Generative Adversarial Imitation Learning for Individual Travel Behavior Modeling under Incentives

Understanding and modeling individual travel behavior responses is crucial for urban mobility regulation and policy evaluation. The Markov decision process (MDP) provides a structured framework for dynamic travel behavior modeling at the individual level. However, solving an MDP in this context is highly data-intensive and faces challenges of data quantity, spatial-temporal coverage, and situational diversity. To address these, we propose a group-effect-enhanced generative adversarial imitation learning (gcGAIL) model that improves the individual behavior modeling efficiency by leveraging shared behavioral patterns among passenger groups. We validate the gcGAIL model using a public transport fare-discount case study and compare against state-of-the-art benchmarks, including adversarial inverse reinforcement learning (AIRL), baseline GAIL, and conditional GAIL. Experimental results demonstrate that gcGAIL outperforms these methods in learning individual travel behavior responses to incentives over time in terms of accuracy, generalization, and pattern demonstration efficiency. Notably, gcGAIL is robust to spatial variation, data sparsity, and behavioral diversity, maintaining strong performance even with partial expert demonstrations and underrepresented passenger groups. The gcGAIL model predicts the individual behavior response at any time, providing the basis for personalized incentives to induce sustainable behavior changes (better timing of incentive injections).

Updated: 2025-09-08 13:14:28

标题: 团体效应增强的生成对抗模仿学习在奖励条件下用于个体出行行为建模

摘要: 理解和建模个体出行行为反应对于城市交通管理和政策评估至关重要。马尔可夫决策过程（MDP）为个体级别的动态出行行为建模提供了一个结构化框架。然而，在这种情况下解决MDP是高度依赖数据的，并面临数据数量、时空覆盖和情境多样性等挑战。为了解决这些问题，我们提出了一个增强群体效应的生成对抗模仿学习（gcGAIL）模型，通过利用乘客群体之间的共享行为模式来提高个体行为建模的效率。我们使用公共交通票价折扣案例研究验证了gcGAIL模型，并与最先进的基准进行了比较，包括对抗逆向强化学习（AIRL）、基线GAIL和条件GAIL。实验结果表明，在学习个体对激励的时间响应方面，gcGAIL在准确性、泛化能力和模式展示效率方面优于这些方法。值得注意的是，gcGAIL对空间变化、数据稀疏性和行为多样性具有鲁棒性，即使在部分专家示范和代表性不足的乘客群体情况下，也能保持强大的性能。gcGAIL模型可以预测任何时间的个体行为反应，为诱导可持续行为变化（更好的激励注入时机）提供基础。

更新时间: 2025-09-08 13:14:28

领域: cs.LG

下载: http://arxiv.org/abs/2509.06656v1

ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been improved, we further propose a test time decoding optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When decoding, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the LLM policy to achieve high reasoning accuracy. We conduct extensive experiments on coding problems to verify the validity of the proposed RL paradigm. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of LLM policies. Codes for our project can be found at https://github.com/THUDM/ReST-RL.

Updated: 2025-09-08 13:12:19

标题: ReST-RL：通过优化的自训练和解码实现LLMs准确的代码推理

摘要: 关于改善LLMs的推理准确性，代表性的强化学习（RL）方法GRPO面临失败，因为奖励方差不显著，而基于过程奖励模型（PRMs）的验证方法面临着训练数据获取和验证效果的困难。为解决这些问题，本文介绍了ReST-RL，一个统一的LLM RL范式，通过将改进的GRPO算法与精心设计的测试时间解码方法结合，辅助价值模型（VM），显著提高了LLM的代码推理能力。作为政策强化的第一阶段，ReST-GRPO采用优化的ReST算法来过滤和组装高价值的训练数据，增加GRPO采样的奖励方差，从而提高训练的有效性和效率。在改进了LLM政策的基本推理能力之后，我们进一步提出了一种称为VM-MCTS的测试时间解码优化方法。通过蒙特卡洛树搜索（MCTS），我们收集准确的值目标，无需注释，VM训练基于此。在解码时，VM通过调整的MCTS算法部署，提供精确的过程信号和验证分数，帮助LLM政策实现高推理准确性。我们在编码问题上进行了大量实验，以验证提出的RL范式的有效性。通过比较，我们的方法在各种水平的知名编码基准（如APPS，BigCodeBench和HumanEval）上明显优于其他强化训练基准（如天真GRPO和ReST-DPO），以及解码和验证基准（如PRM-BoN和ORM-MCTS），表明其增强LLM政策推理能力的强大能力。我们项目的代码可以在https://github.com/THUDM/ReST-RL找到。

更新时间: 2025-09-08 13:12:19

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.19576v2

AnalysisGNN: Unified Music Analysis with Graph Neural Networks

Recent years have seen a boom in computational approaches to music analysis, yet each one is typically tailored to a specific analytical domain. In this work, we introduce AnalysisGNN, a novel graph neural network framework that leverages a data-shuffling strategy with a custom weighted multi-task loss and logit fusion between task-specific classifiers to integrate heterogeneously annotated symbolic datasets for comprehensive score analysis. We further integrate a Non-Chord-Tone prediction module, which identifies and excludes passing and non-functional notes from all tasks, thereby improving the consistency of label signals. Experimental evaluations demonstrate that AnalysisGNN achieves performance comparable to traditional static-dataset approaches, while showing increased resilience to domain shifts and annotation inconsistencies across multiple heterogeneous corpora.

Updated: 2025-09-08 13:11:54

标题: AnalysisGNN：使用图神经网络进行统一的音乐分析

摘要: 最近几年，计算方法在音乐分析领域蓬勃发展，但每种方法通常都针对特定的分析领域进行定制。在这项工作中，我们引入了AnalysisGNN，这是一个新颖的图神经网络框架，利用数据洗牌策略、自定义加权多任务损失和任务特定分类器之间的logit融合，将异构注释的符号数据集整合到全面的乐谱分析中。我们进一步整合了一个非和弦音预测模块，该模块可以识别并排除所有任务中的通过和非功能音符，从而提高标签信号的一致性。实验评估表明，AnalysisGNN实现了与传统静态数据集方法可比的性能，同时显示出对跨多个异构语料库的领域转移和注释不一致性的增强韧性。

更新时间: 2025-09-08 13:11:54

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2509.06654v1

Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy

Detecting whether an LLM hallucinates is an important research challenge. One promising way of doing so is to estimate the semantic entropy (Farquhar et al., 2024) of the distribution of generated sequences. We propose a new algorithm for doing that, with two main advantages. First, due to us taking the Bayesian approach, we achieve a much better quality of semantic entropy estimates for a given budget of samples from the LLM. Second, we are able to tune the number of samples adaptively so that `harder' contexts receive more samples. We demonstrate empirically that our approach systematically beats the baselines, requiring only 53% of samples used by Farquhar et al. (2024) to achieve the same quality of hallucination detection as measured by AUROC. Moreover, quite counterintuitively, our estimator is useful even with just one sample from the LLM.

Updated: 2025-09-08 12:59:58

标题: 《预算下的幻觉检测：语义熵的高效贝叶斯估计》

摘要: 检测LLM是否产生幻觉是一个重要的研究挑战。一种有前途的方法是估计生成序列的语义熵（Farquhar等人，2024年）。我们提出了一种新算法来实现这一点，具有两个主要优点。首先，由于我们采用了贝叶斯方法，我们可以在给定的LLM样本预算下实现更好质量的语义熵估计。其次，我们能够自适应地调整样本数量，使“更难”的上下文获得更多样本。我们通过实验证明，我们的方法系统地击败了基线，仅需要Farquhar等人（2024年）使用的样本数量的53％，即可达到相同质量的幻觉检测，以AUROC衡量。而且，相当反直觉的是，我们的估计器即使只有一个来自LLM的样本也是有用的。

更新时间: 2025-09-08 12:59:58

领域: cs.LG

下载: http://arxiv.org/abs/2504.03579v2

Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations

Balance control is important for human and bipedal robotic systems. While dynamic balance during locomotion has received considerable attention, quantitative understanding of static balance and falling remains limited. This work presents a hierarchical control pipeline for simulating human balance via a comprehensive whole-body musculoskeletal system. We identified spatiotemporal dynamics of balancing during stable standing, revealed the impact of muscle injury on balancing behavior, and generated fall contact patterns that aligned with clinical data. Furthermore, our simulated hip exoskeleton assistance demonstrated improvement in balance maintenance and reduced muscle effort under perturbation. This work offers unique muscle-level insights into human balance dynamics that are challenging to capture experimentally. It could provide a foundation for developing targeted interventions for individuals with balance impairments and support the advancement of humanoid robotic systems.

Updated: 2025-09-08 12:59:00

标题: 使用全身肌肉骨骼站立和摔倒模拟的双足平衡控制

摘要: 平衡控制对于人类和双足机器人系统至关重要。尽管在运动过程中的动态平衡受到了广泛关注，但对于静态平衡和摔倒的定量理解仍然有限。本研究提出了一个分层控制流程，用于通过全身肌肉骨骼系统模拟人类平衡。我们确定了在稳定站立期间平衡的时空动态，揭示了肌肉损伤对平衡行为的影响，并生成了与临床数据相吻合的摔倒接触模式。此外，我们模拟的髋部外骨骼辅助显示出在扰动下改善平衡维持和减少肌肉力量的效果。这项工作为人类平衡动态提供了独特的肌肉水平见解，这在实验中很难捕捉。它可能为开发针对平衡障碍个体的有针对性干预提供基础，并支持人形机器人系统的进步。

更新时间: 2025-09-08 12:59:00

领域: cs.RO,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2506.09383v2

CogGuide: Human-Like Guidance for Zero-Shot Omni-Modal Reasoning

Targeting the issues of "shortcuts" and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models, this paper proposes a zero-shot multimodal reasoning component guided by human-like cognitive strategies centered on an "intent sketch". The component comprises a plug-and-play three-module pipeline-Intent Perceiver, Strategy Generator, and Strategy Selector-that explicitly constructs a "understand-plan-select" cognitive process. By generating and filtering "intent sketch" strategies to guide the final reasoning, it requires no parameter fine-tuning and achieves cross-model transfer solely through in-context engineering. Information-theoretic analysis shows that this process can reduce conditional entropy and improve information utilization efficiency, thereby suppressing unintended shortcut reasoning. Experiments on IntentBench, WorldSense, and Daily-Omni validate the method's generality and robust gains; compared with their respective baselines, the complete "three-module" scheme yields consistent improvements across different reasoning engines and pipeline combinations, with gains up to approximately 9.51 percentage points, demonstrating the practical value and portability of the "intent sketch" reasoning component in zero-shot scenarios.

Updated: 2025-09-08 12:57:02

标题: CogGuide：人类般的指导对于零射多模态推理

摘要: 针对多模态大型模型在复杂跨模态推理中存在的“捷径”和不足的情境理解问题，本文提出了一个由人类认知策略引导的零次多模态推理组件，以“意图草图”为核心。该组件包括三个插拔式模块管道-意图感知器、策略生成器和策略选择器，明确构建了一个“理解-规划-选择”的认知过程。通过生成和筛选“意图草图”策略来引导最终的推理，无需参数微调，仅通过上下文工程实现跨模型传递。信息论分析表明，这一过程可以降低条件熵，提高信息利用效率，从而抑制意外的捷径推理。在IntentBench、WorldSense和Daily-Omni上的实验验证了该方法的普适性和稳健性增益；与各自基线相比，完整的“三模块”方案在不同推理引擎和管道组合中均取得了一致的改进，增益高达约9.51个百分点，展示了“意图草图”推理组件在零次情景中的实际价值和可移植性。

更新时间: 2025-09-08 12:57:02

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06641v1

Knowledge-Guided Machine Learning for Stabilizing Near-Shortest Path Routing

We propose a simple algorithm that needs only a few data samples from a single graph for learning local routing policies that generalize across a rich class of geometric random graphs in Euclidean metric spaces. We thus solve the all-pairs near-shortest path problem by training deep neural networks (DNNs) that let each graph node efficiently and scalably route (i.e., forward) packets by considering only the node's state and the state of the neighboring nodes. Our algorithm design exploits network domain knowledge in the selection of input features and design of the policy function for learning an approximately optimal policy. Domain knowledge also provides theoretical assurance that the choice of a ``seed graph'' and its node data sampling suffices for generalizable learning. Remarkably, one of these DNNs we train -- using distance-to-destination as the only input feature -- learns a policy that exactly matches the well-known Greedy Forwarding policy, which forwards packets to the neighbor with the shortest distance to the destination. We also learn a new policy, which we call GreedyTensile routing -- using both distance-to-destination and node stretch as the input features -- that almost always outperforms greedy forwarding. We demonstrate the explainability and ultra-low latency run-time operation of Greedy Tensile routing by symbolically interpreting its DNN in low-complexity terms of two linear actions.

Updated: 2025-09-08 12:56:42

标题: 知识引导的机器学习用于稳定近最短路径路由

摘要: 我们提出了一个简单的算法，只需要来自单个图形的少量数据样本，就可以学习在欧几里得度量空间中通用的几何随机图的本地路由策略。因此，我们通过训练深度神经网络（DNNs）来解决所有对近最短路径问题，使每个图节点能够高效且可扩展地路由（即转发）数据包，仅考虑节点的状态和相邻节点的状态。我们的算法设计利用网络领域知识来选择输入特征和设计策略函数，从而学习近似最优策略。领域知识还提供了理论保证，即选择“种子图”及其节点数据采样足以进行可泛化学习。令人惊讶的是，我们训练的其中一个DNN——仅使用距目的地的距离作为输入特征——学习了与众所周知的贪婪转发策略完全匹配的策略，即将数据包转发到距离目的地最近的邻居。我们还学习了一种新的策略，称为贪婪张力路由——使用距目的地的距离和节点伸展作为输入特征——几乎总是优于贪婪转发。我们通过符号解释其DNN以低复杂度的两个线性动作来展示贪婪张力路由的可解释性和超低延迟运行时操作。

更新时间: 2025-09-08 12:56:42

领域: cs.LG,cs.NI

下载: http://arxiv.org/abs/2509.06640v1

ILeSiA: Interactive Learning of Robot Situational Awareness from Camera Input

Learning from demonstration is a promising approach for teaching robots new skills. However, a central challenge in the execution of acquired skills is the ability to recognize faults and prevent failures. This is essential because demonstrations typically cover only a limited set of scenarios and often only the successful ones. During task execution, unforeseen situations may arise, such as changes in the robot's environment or interaction with human operators. To recognize such situations, this paper focuses on teaching the robot situational awareness by using a camera input and labeling frames as safe or risky. We train a Gaussian Process (GP) regression model fed by a low-dimensional latent space representation of the input images. The model outputs a continuous risk score ranging from zero to one, quantifying the degree of risk at each timestep. This allows for pausing task execution in unsafe situations and directly adding new training data, labeled by the human user. Our experiments on a robotic manipulator show that the proposed method can reliably detect both known and novel faults using only a single example for each new fault. In contrast, a standard multi-layer perceptron (MLP) performs well only on faults it has encountered during training. Our method enables the next generation of cobots to be rapidly deployed with easy-to-set-up, vision-based risk assessment, proactively safeguarding humans and detecting misaligned parts or missing objects before failures occur. We provide all the code and data required to reproduce our experiments at imitrob.ciirc.cvut.cz/publications/ilesia.

Updated: 2025-09-08 12:56:14

标题: ILeSiA：从摄像头输入交互式学习机器人情境意识

摘要: 从演示中学习是教导机器人新技能的一种有前途的方法。然而，执行获得的技能的一个核心挑战是识别故障并预防失败的能力。这是至关重要的，因为演示通常仅涵盖有限的场景，并且通常仅包括成功的场景。在任务执行过程中，可能会出现意想不到的情况，例如机器人环境的变化或与人类操作员的互动。为了识别这种情况，本文关注通过使用摄像头输入并将帧标记为安全或危险来教导机器人情境意识。我们训练了一个由输入图像的低维潜在空间表示供给的高斯过程（GP）回归模型。该模型输出一个从零到一的连续风险评分，量化每个时间步的风险程度。这使得在不安全情况下暂停任务执行，并直接添加由人类用户标记的新训练数据成为可能。我们在一个机器人操作员上的实验表明，所提出的方法可以可靠地检测已知和新的故障，而每个新故障仅需要一个示例。相比之下，标准的多层感知器（MLP）仅在训练过程中遇到的故障上表现良好。我们的方法使得下一代合作机器人能够快速部署，具有易于设置的基于视觉的风险评估，积极保护人类并在故障发生之前检测到不对齐的部件或缺失的物体。我们提供了重现我们实验所需的所有代码和数据，网址为imitrob.ciirc.cvut.cz/publications/ilesia。

更新时间: 2025-09-08 12:56:14

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2409.20173v3

The First Voice Timbre Attribute Detection Challenge

The first voice timbre attribute detection challenge is featured in a special session at NCMMSC 2025. It focuses on the explainability of voice timbre and compares the intensity of two speech utterances in a specified timbre descriptor dimension. The evaluation was conducted on the VCTK-RVA dataset. Participants developed their systems and submitted their outputs to the organizer, who evaluated the performance and sent feedback to them. Six teams submitted their outputs, with five providing descriptions of their methodologies.

Updated: 2025-09-08 12:54:28

标题: 首次声音音色属性检测挑战

摘要: 2025年NCMMSC特设了第一个声音音色属性检测挑战赛。该挑战赛侧重于声音音色的可解释性，并比较了在指定音色描述维度上的两个语音表达的强度。评估是在VCTK-RVA数据集上进行的。参与者开发了他们的系统并将输出提交给组织者，组织者评估了性能并向他们发送了反馈。共有六个团队提交了他们的输出，其中五个提供了他们方法的描述。

更新时间: 2025-09-08 12:54:28

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2509.06635v1

Sequential Controlled Langevin Diffusions

An effective approach for sampling from unnormalized densities is based on the idea of gradually transporting samples from an easy prior to the complicated target distribution. Two popular methods are (1) Sequential Monte Carlo (SMC), where the transport is performed through successive annealed densities via prescribed Markov chains and resampling steps, and (2) recently developed diffusion-based sampling methods, where a learned dynamical transport is used. Despite the common goal, both approaches have different, often complementary, advantages and drawbacks. The resampling steps in SMC allow focusing on promising regions of the space, often leading to robust performance. While the algorithm enjoys asymptotic guarantees, the lack of flexible, learnable transitions can lead to slow convergence. On the other hand, diffusion-based samplers are learned and can potentially better adapt themselves to the target at hand, yet often suffer from training instabilities. In this work, we present a principled framework for combining SMC with diffusion-based samplers by viewing both methods in continuous time and considering measures on path space. This culminates in the new Sequential Controlled Langevin Diffusion (SCLD) sampling method, which is able to utilize the benefits of both methods and reaches improved performance on multiple benchmark problems, in many cases using only 10% of the training budget of previous diffusion-based samplers.

Updated: 2025-09-08 12:53:38

标题: 连续控制的朗之万扩散

摘要: 一种有效的从未标准化密度中抽样的方法是基于逐渐将样本从简单的先验分布传输到复杂的目标分布的想法。两种流行的方法是（1）顺序蒙特卡洛（SMC），其中通过预定义的马尔可夫链和重采样步骤逐步传输样本到退火密度，和（2）最近发展的基于扩散的抽样方法，其中使用了学习的动态传输。尽管这两种方法有着共同的目标，但通常具有不同的优点和缺点。SMC中的重采样步骤允许集中在空间中有希望的区域，通常导致稳健的性能。虽然算法享有渐近保证，但缺乏灵活、可学习的转换可能导致收敛速度较慢。另一方面，基于扩散的采样器是学习的，可以更好地适应手头的目标，但通常受到训练不稳定性的困扰。在这项工作中，我们提出了一个有原则的框架，通过将两种方法视为连续时间，并考虑路径空间上的测度，将SMC与基于扩散的采样器相结合。这导致了新的顺序控制朗之万扩散（SCLD）抽样方法，能够利用这两种方法的优势，在多个基准问题上实现了更好的性能，很多情况下只使用以前基于扩散的采样器训练预算的10％。

更新时间: 2025-09-08 12:53:38

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.07081v2

Nested Graph Pseudo-Label Refinement for Noisy Label Domain Adaptation Learning

Graph Domain Adaptation (GDA) facilitates knowledge transfer from labeled source graphs to unlabeled target graphs by learning domain-invariant representations, which is essential in applications such as molecular property prediction and social network analysis. However, most existing GDA methods rely on the assumption of clean source labels, which rarely holds in real-world scenarios where annotation noise is pervasive. This label noise severely impairs feature alignment and degrades adaptation performance under domain shifts. To address this challenge, we propose Nested Graph Pseudo-Label Refinement (NeGPR), a novel framework tailored for graph-level domain adaptation with noisy labels. NeGPR first pretrains dual branches, i.e., semantic and topology branches, by enforcing neighborhood consistency in the feature space, thereby reducing the influence of noisy supervision. To bridge domain gaps, NeGPR employs a nested refinement mechanism in which one branch selects high-confidence target samples to guide the adaptation of the other, enabling progressive cross-domain learning. Furthermore, since pseudo-labels may still contain noise and the pre-trained branches are already overfitted to the noisy labels in the source domain, NeGPR incorporates a noise-aware regularization strategy. This regularization is theoretically proven to mitigate the adverse effects of pseudo-label noise, even under the presence of source overfitting, thus enhancing the robustness of the adaptation process. Extensive experiments on benchmark datasets demonstrate that NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving gains of up to 12.7% in accuracy.

Updated: 2025-09-08 12:47:16

标题: 嵌套图伪标签细化用于嘈杂标签领域自适应学习

摘要: 图领域自适应（GDA）通过学习领域不变表示，促进了从标记源图到未标记目标图的知识转移，在分子性质预测和社交网络分析等应用中至关重要。然而，大多数现有的GDA方法依赖于干净的源标签的假设，在真实场景中标注噪声普遍存在，这个假设很少成立。标签噪声严重损害了特征对齐，并在领域转移下降低了适应性能。为了解决这一挑战，我们提出了一种专为具有噪声标签的图级领域自适应定制的新框架Nested Graph Pseudo-Label Refinement（NeGPR）。NeGPR首先通过在特征空间中强制进行邻域一致性，预先训练双分支，即语义和拓扑分支，从而减少噪声监督的影响。为了弥合领域差距，NeGPR采用嵌套细化机制，其中一个分支选择高置信度的目标样本来指导另一个的适应，实现渐进的跨领域学习。此外，由于伪标签可能仍然包含噪声，并且预先训练的分支已经过度拟合了源域中的噪声标签，NeGPR融入了一种噪声感知的正则化策略。理论上证明，这种正则化可以减轻伪标签噪声的不良影响，即使在源过度拟合的情况下，从而增强适应过程的鲁棒性。在基准数据集上进行的广泛实验表明，NeGPR在严重标签噪声下始终优于最先进的方法，准确率提高了高达12.7%。

更新时间: 2025-09-08 12:47:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.00716v3

Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.

Updated: 2025-09-08 12:44:14

标题: 通过LLM引导的MCTS实现动态自适应推理，用于高效和上下文感知的知识图问答

摘要: 知识图谱问答（KGQA）旨在通过利用知识图谱的关系和语义结构来解释自然语言查询，并执行结构化推理，以检索准确的答案。最近的KGQA方法主要遵循检索-推理范式，依赖于GNN或启发式规则进行静态路径提取，或者使用大型语言模型（LLM）进行动态路径生成策略，通过提示共同执行检索和推理。然而，前者由于静态路径提取和缺乏上下文细化而受到限制，后者则产生高计算成本，并且由于依赖于固定评分函数和大量LLM调用而难以准确评估路径。为了解决这些问题，本文提出了一种新颖的框架，即基于动态自适应MCTS的推理（DAMR），该框架集成了符号搜索和自适应路径评估，以实现高效和上下文感知的KGQA。DAMR采用由LLM引导的蒙特卡洛树搜索（MCTS）骨干，该骨干在每一步选择前k个相关关系以减少搜索空间。为了提高路径评估的准确性，我们引入了基于Transformer的轻量级评分器，通过交叉注意力联合编码问题和关系序列，通过捕捉多跳推理期间的细粒度语义变化，使模型能够执行上下文感知的合理性估计。此外，为了缓解高质量监督数据的缺乏，DAMR还结合了一种动态伪路径细化机制，定期从搜索过程中探索的部分路径生成训练信号，使评分器能够持续适应推理轨迹的演变分布。在多个KGQA基准测试上的广泛实验表明，DAMR明显优于现有方法。

更新时间: 2025-09-08 12:44:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.00719v3

Network-level Censorship Attacks in the InterPlanetary File System

The InterPlanetary File System (IPFS) has been successfully established as the de facto standard for decentralized data storage in the emerging Web3. Despite its decentralized nature, IPFS nodes, as well as IPFS content providers, have converged to centralization in large public clouds. Centralization introduces BGP routing-based attacks, such as passive interception and BGP hijacking, as potential threats. Although this attack vector has been investigated for many other Web3 protocols, such as Bitcoin and Ethereum, to the best of our knowledge, it has not been analyzed for the IPFS network. In our work, we bridge this gap and demonstrate that BGP routing attacks can be effectively leveraged to censor content in IPFS. For the analysis, we collected 3,000 content blocks called CIDs and conducted a simulation of BGP hijacking and passive interception against them. We find that a single malicious AS can censor 75% of the IPFS content for more than 57% of all requester nodes. Furthermore, we show that even with a small set of only 62 hijacked prefixes, 70% of the full attack effectiveness can already be reached. We further propose and validate countermeasures based on global collaborative content replication among all nodes in the IPFS network, together with additional robust backup content provider nodes that are well-hardened against BGP hijacking. We hope this work raises awareness about the threat BGP routing-based attacks pose to IPFS and triggers further efforts to harden the live IPFS network against them.

Updated: 2025-09-08 12:42:47

标题: InterPlanetary文件系统中的网络级审查攻击

摘要: 星际文件系统（IPFS）已成功建立为新兴Web3中去中心化数据存储的事实标准。尽管其去中心化的特性，IPFS节点以及IPFS内容提供者已经趋向于在大型公共云中集中。中心化引入了基于BGP路由的攻击，如被动截获和BGP劫持，作为潜在威胁。尽管这种攻击向量已经针对许多其他Web3协议进行了调查，例如比特币和以太坊，但据我们所知，尚未为IPFS网络进行分析。在我们的工作中，我们填补了这一差距，并展示了BGP路由攻击可以有效地用于审查IPFS中的内容。对于分析，我们收集了3,000个称为CID的内容块，并对它们进行了BGP劫持和被动截获的模拟。我们发现，单个恶意AS可以审查75%的IPFS内容，对所有请求节点的57%以上。此外，我们表明，即使只有62个劫持前缀的小型集合，也可以达到70%的全面攻击效果。我们进一步提出并验证了基于全局协作的内容复制对抗措施，其中所有节点在IPFS网络中进行，以及额外的经过良好加固，可以抵御BGP劫持的备用内容提供者节点。我们希望这项工作提高人们对基于BGP路由的攻击对IPFS构成的威胁的意识，并引发进一步的努力来加固IPFS网络。

更新时间: 2025-09-08 12:42:47

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2509.06626v1

Improved Classification of Nitrogen Stress Severity in Plants Under Combined Stress Conditions Using Spatio-Temporal Deep Learning Framework

Plants in their natural habitats endure an array of interacting stresses, both biotic and abiotic, that rarely occur in isolation. Nutrient stress-particularly nitrogen deficiency-becomes even more critical when compounded with drought and weed competition, making it increasingly difficult to distinguish and address its effects. Early detection of nitrogen stress is therefore crucial for protecting plant health and implementing effective management strategies. This study proposes a novel deep learning framework to accurately classify nitrogen stress severity in a combined stress environment. Our model uses a unique blend of four imaging modalities-RGB, multispectral, and two infrared wavelengths-to capture a wide range of physiological plant responses from canopy images. These images, provided as time-series data, document plant health across three levels of nitrogen availability (low, medium, and high) under varying water stress and weed pressures. The core of our approach is a spatio-temporal deep learning pipeline that merges a Convolutional Neural Network (CNN) for extracting spatial features from images with a Long Short-Term Memory (LSTM) network to capture temporal dependencies. We also devised and evaluated a spatial-only CNN pipeline for comparison. Our CNN-LSTM pipeline achieved an impressive accuracy of 98%, impressively surpassing the spatial-only model's 80.45% and other previously reported machine learning method's 76%. These results bring actionable insights based on the power of our CNN-LSTM approach in effectively capturing the subtle and complex interactions between nitrogen deficiency, water stress, and weed pressure. This robust platform offers a promising tool for the timely and proactive identification of nitrogen stress severity, enabling better crop management and improved plant health.

Updated: 2025-09-08 12:41:45

标题: 使用时空深度学习框架在植物受到复合胁迫条件下改善氮胁迫严重程度分类

摘要: 植物在其自然栖息地中承受着一系列相互作用的压力，无论是生物性的还是非生物性的，很少单独发生。养分压力，特别是氮素缺乏，在与干旱和杂草竞争相结合时变得更加关键，这使得更难区分和处理其影响。因此，早期检测氮素压力对于保护植物健康并实施有效管理策略至关重要。本研究提出了一个新颖的深度学习框架，可以准确分类复合压力环境中的氮素压力严重程度。我们的模型使用四种成像模式（RGB、多光谱和两个红外波长）的独特混合，以捕捉来自植物冠层图像的广泛生理反应范围。这些图像作为时间序列数据提供，记录了在不同的水分压力和杂草压力下，植物在三个氮素可利用性水平（低、中、高）下的健康状况。我们方法的核心是一个融合了用于从图像中提取空间特征的卷积神经网络（CNN）和用于捕捉时间依赖性的长短期记忆（LSTM）网络的时空深度学习管道。我们还设计并评估了一个仅空间的CNN管道以进行比较。我们的CNN-LSTM管道取得了惊人的98%准确率，远远超过了仅空间模型的80.45%和先前报道的其他机器学习方法的76%。这些结果基于我们的CNN-LSTM方法有效捕捉了氮素缺乏、水分压力和杂草压力之间微妙和复杂的相互作用，带来了可操作的见解。这个强大的平台为及时主动地识别氮素压力严重程度提供了一种有希望的工具，从而实现更好的作物管理和改善植物健康。

更新时间: 2025-09-08 12:41:45

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06625v1

BEAM: Brainwave Empathy Assessment Model for Early Childhood

Empathy in young children is crucial for their social and emotional development, yet predicting it remains challenging. Traditional methods often only rely on self-reports or observer-based labeling, which are susceptible to bias and fail to objectively capture the process of empathy formation. EEG offers an objective alternative; however, current approaches primarily extract static patterns, neglecting temporal dynamics. To overcome these limitations, we propose a novel deep learning framework, the Brainwave Empathy Assessment Model (BEAM), to predict empathy levels in children aged 4-6 years. BEAM leverages multi-view EEG signals to capture both cognitive and emotional dimensions of empathy. The framework comprises three key components: 1) a LaBraM-based encoder for effective spatio-temporal feature extraction, 2) a feature fusion module to integrate complementary information from multi-view signals, and 3) a contrastive learning module to enhance class separation. Validated on the CBCP dataset, BEAM outperforms state-of-the-art methods across multiple metrics, demonstrating its potential for objective empathy assessment and providing a preliminary insight into early interventions in children's prosocial development.

Updated: 2025-09-08 12:39:09

标题: BEAM：早期儿童脑电波共情评估模型

摘要: 幼儿的同理心对于他们的社会和情感发展至关重要，然而对其进行预测仍然具有挑战性。传统方法往往仅依赖于自我报告或观察者的标记，容易受到偏见的影响，并且未能客观捕捉同理心形成的过程。脑电图（EEG）提供了一种客观的替代方法；然而，目前的方法主要提取静态模式，忽略了时间动态。为了克服这些局限性，我们提出了一种新颖的深度学习框架，即Brainwave Empathy Assessment Model（BEAM），用于预测4-6岁儿童的同理心水平。BEAM利用多视角EEG信号捕捉同理心的认知和情感维度。该框架包括三个关键组件：1）基于LaBraM的编码器用于有效的时空特征提取，2）特征融合模块用于整合多视角信号的互补信息，3）对比学习模块用于增强类别分离。在CBCP数据集上验证后，BEAM在多个指标上优于现有方法，展示了其在客观同理心评估方面的潜力，并为儿童亲社会发展的早期干预提供了初步见解。

更新时间: 2025-09-08 12:39:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06620v1

A Secure Sequencer and Data Availability Committee for Rollups (Extended Version)

Blockchains face a scalability limitation, partly due to the throughput limitations of consensus protocols, especially when aiming to obtain a high degree of decentralization. Layer 2 Rollups (L2s) are a faster alternative to conventional blockchains. L2s perform most computations offchain using minimally blockchains (L1) under-the-hood to guarantee correctness. A sequencer is a service that receives offchain L2 transaction requests, batches these transactions, and commits compressed or hashed batches to L1. Using hashing needs less L1 space, which is beneficial for gas cost, but requires a data availability committee (DAC) service to translate hashes into their corresponding batches of transaction requests. The behavior of sequencers and DACs influence the evolution of the L2 blockchain, presenting a potential security threat and delaying L2 adoption. We propose in this paper fraud-proof mechanisms, arbitrated by L1 contracts, to detect and generate evidence of dishonest behavior of the sequencer and DAC. We study how these fraud-proofs limit the power of adversaries that control different number of sequencer and DACs members, and provide incentives for their honest behavior. We designed these fraud-proof mechanisms as two player games. Unlike the generic fraud-proofs in current L2s (designed to guarantee the correct execution of transactions), our fraud-proofs are over pred-etermined algorithms that verify the properties that determine the correctness of the DAC. Arbitrating over concrete algorithms makes our fraud-proofs more efficient, easier to understand, and simpler to prove correct. We provide as an artifact a mechanization in LEAN4 of our fraud-proof games, including (1) the verified strategies that honest players should play to win all games as well as (2) mechanisms to detect dishonest claims.

Updated: 2025-09-08 12:32:41

标题: Rollup的安全序列和数据可用性委员会（扩展版）

摘要: 区块链面临可扩展性限制，部分原因是共识协议的吞吐量限制，特别是在追求高度分散化时。第二层 Rollups（L2）是传统区块链的更快替代品。L2在链下执行大部分计算，使用最小化的区块链（L1）来确保正确性。序列化器是一个接收链下L2交易请求、批量处理这些交易，并将压缩或哈希的批量提交到L1的服务。使用哈希需要更少的L1空间，对于燃气成本有益，但需要数据可用性委员会（DAC）服务将哈希转换为相应的交易请求批量。序列化器和DAC的行为影响L2区块链的演变，提出了潜在的安全威胁并延迟了L2的采用。我们在本文中提出了通过L1合同仲裁的欺诈证明机制，以检测和生成序列器和DAC不诚实行为的证据。我们研究了这些欺诈证明如何限制控制不同数量序列器和DAC成员的对手的权力，并为他们的诚实行为提供激励。我们将这些欺诈证明机制设计为双人游戏。与当前L2中设计的通用欺诈证明（旨在保证交易的正确执行）不同，我们的欺诈证明是基于预定算法，用于验证决定DAC正确性的属性。对具体算法进行仲裁使我们的欺诈证明更加高效、易于理解和更容易证明正确。我们提供了一个机械化的LE4N欺诈证明游戏的工件，包括（1）诚实玩家应该玩的验证策略以赢得所有游戏，以及（2）检测不诚实要求的机制。

更新时间: 2025-09-08 12:32:41

领域: cs.CR

下载: http://arxiv.org/abs/2509.06614v1

Can AI be Auditable?

Auditability is defined as the capacity of AI systems to be independently assessed for compliance with ethical, legal, and technical standards throughout their lifecycle. The chapter explores how auditability is being formalized through emerging regulatory frameworks, such as the EU AI Act, which mandate documentation, risk assessments, and governance structures. It analyzes the diverse challenges facing AI auditability, including technical opacity, inconsistent documentation practices, lack of standardized audit tools and metrics, and conflicting principles within existing responsible AI frameworks. The discussion highlights the need for clear guidelines, harmonized international regulations, and robust socio-technical methodologies to operationalize auditability at scale. The chapter concludes by emphasizing the importance of multi-stakeholder collaboration and auditor empowerment in building an effective AI audit ecosystem. It argues that auditability must be embedded in AI development practices and governance infrastructures to ensure that AI systems are not only functional but also ethically and legally aligned.

Updated: 2025-09-08 12:26:44

标题: 人工智能可以接受审计吗？

摘要: 审计性被定义为人工智能系统在其生命周期内独立进行合规性评估的能力，以符合道德、法律和技术标准。本章探讨了审计性如何通过新兴的监管框架，如欧盟人工智能法案，被正式规范化，这些框架要求进行文档记录、风险评估和治理结构。本章分析了AI审计性面临的各种挑战，包括技术不透明性、不一致的文档实践、缺乏标准化的审计工具和指标，以及现有负责任的AI框架中存在的冲突原则。讨论强调了需要明确的指导方针、协调一致的国际法规和强大的社会技术方法来规范审计性的规模化操作。本章最后强调了多利益相关者合作和审计者赋权在建立有效的AI审计生态系统中的重要性。它认为审计性必须嵌入到人工智能开发实践和治理基础设施中，以确保人工智能系统不仅功能正常，而且在道德和法律上也是一致的。

更新时间: 2025-09-08 12:26:44

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2509.00575v2

A Survey of Generalization of Graph Anomaly Detection: From Transfer Learning to Foundation Models

Graph anomaly detection (GAD) has attracted increasing attention in recent years for identifying malicious samples in a wide range of graph-based applications, such as social media and e-commerce. However, most GAD methods assume identical training and testing distributions and are tailored to specific tasks, resulting in limited adaptability to real-world scenarios such as shifting data distributions and scarce training samples in new applications. To address the limitations, recent work has focused on improving the generalization capability of GAD models through transfer learning that leverages knowledge from related domains to enhance detection performance, or developing "one-for-all" GAD foundation models that generalize across multiple applications. Since a systematic understanding of generalization in GAD is still lacking, in this paper, we provide a comprehensive review of generalization in GAD. We first trace the evolution of generalization in GAD and formalize the problem settings, which further leads to our systematic taxonomy. Rooted in this fine-grained taxonomy, an up-to-date and comprehensive review is conducted for the existing generalized GAD methods. Finally, we identify current open challenges and suggest future directions to inspire future research in this emerging field.

Updated: 2025-09-08 12:26:32

标题: 《图异常检测泛化的调查：从迁移学习到基础模型》

摘要: 图形异常检测（GAD）近年来引起了越来越多的关注，用于在社交媒体和电子商务等各种基于图形的应用中识别恶意样本。然而，大多数GAD方法假设相同的训练和测试分布，并且专门针对特定任务定制，导致在现实场景中适应性有限，如数据分布的变化和新应用中的训练样本稀缺。为了解决这些限制，最近的工作集中于通过迁移学习提高GAD模型的泛化能力，利用相关领域的知识增强检测性能，或者开发“一刀切”的GAD基础模型，可以跨多个应用程序进行泛化。由于对GAD中泛化的系统理解仍然缺乏，因此在本文中，我们对GAD中的泛化进行了全面的审查。我们首先追溯了GAD中泛化的演变，并形式化了问题设置，这进一步引导了我们的系统分类。基于这个细粒度的分类法，对现有的泛化GAD方法进行了最新和全面的审查。最后，我们确定了当前的挑战，并提出了未来的方向，以激励这一新兴领域的未来研究。

更新时间: 2025-09-08 12:26:32

领域: cs.LG

下载: http://arxiv.org/abs/2509.06609v1

Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

The mechanisms by which reasoning training reshapes language-model computations remain poorly understood. We study lightweight steering vectors inserted into the base model's residual stream and trained with a reinforcement-learning objective, which can match full fine-tuning performance while retaining the interpretability of small, additive interventions. Using logit-lens readouts, path patching, and circuit analyses, we analyze two models and find: (i) the last-layer steering vector behaves like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as "To" and "Step"; and (ii) the penultimate-layer steering vector leaves attention patterns largely unchanged and instead acts through the MLP and unembedding, preferentially up-weighting process words and structure symbols. These results establish a principled framework for interpreting the behavioral changes induced by reasoning training.

Updated: 2025-09-08 12:26:31

标题: 小矢量，大效应：通过引导矢量进行RL诱导推理的机制研究

摘要: 推理训练如何重塑语言模型计算的机制仍然不明确。我们研究了轻量级的转向向量，插入到基础模型的残差流中，并通过强化学习目标进行训练，这可以在保留小型添加干预的可解释性的同时，达到完整微调性能。使用logit-lens读出，路径修补和电路分析，我们分析了两个模型，并发现：（i）最后一层转向向量的行为类似于一个令牌替换偏差，集中在第一个生成的令牌上，持续提升诸如"To"和"Step"等令牌；（ii）倒数第二层的转向向量使注意力模式基本保持不变，而是通过MLP和unembedding的方式进行操作，优先提升过程词和结构符号。这些结果建立了一个有原则的框架，用于解释推理训练引起的行为变化。

更新时间: 2025-09-08 12:26:31

领域: cs.LG

下载: http://arxiv.org/abs/2509.06608v1

Demo: Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards

Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a concise narrative to facilitate discussion. This manual approach is often labor-intensive, subjective, and prone to omissions of critical information. To address these limitations, we introduce the Healthcare Agent Orchestrator (HAO), a Large Language Model (LLM)-driven AI agent that coordinates a multi-agent clinical workflow to generate accurate and comprehensive patient summaries for MTBs. Evaluating predicted patient summaries against ground truth presents additional challenges due to stylistic variation, ordering, synonym usage, and phrasing differences, which complicate the measurement of both succinctness and completeness. To overcome these evaluation hurdles, we propose TBFact, a ``model-as-a-judge'' framework designed to assess the comprehensiveness and succinctness of generated summaries. Using a benchmark dataset derived from de-identified tumor board discussions, we applied TBFact to evaluate our Patient History agent. Results show that the agent captured 94% of high-importance information (including partial entailments) and achieved a TBFact recall of 0.84 under strict entailment criteria. We further demonstrate that TBFact enables a data-free evaluation framework that institutions can deploy locally without sharing sensitive clinical data. Together, HAO and TBFact establish a robust foundation for delivering reliable and scalable support to MTBs.

Updated: 2025-09-08 12:15:53

标题: 演示：分子肿瘤委员会中患者总结的医疗代理编排器（HAO）

摘要: 分子肿瘤委员会（MTBs）是多学科论坛，肿瘤专家共同评估复杂病例，确定最佳治疗策略的地方。这个过程的核心要素是病人总结，通常由医学肿瘤学家、放射肿瘤学家、外科医生或他们的训练有素的医学助理编制，将异质性医疗记录提炼成简洁的叙述，以促进讨论。这种手动方法往往费时费力，主观性强，容易遗漏关键信息。为了解决这些限制，我们引入了Healthcare Agent Orchestrator（HAO），这是一个由Large Language Model（LLM）驱动的AI代理，协调多个代理的临床工作流程，为MTBs生成准确和全面的病人总结。根据地面真相评估预测的病人总结面临额外挑战，因为文体变化、排序、同义词使用和措辞差异会使得简洁性和完整性的衡量变得复杂。为了克服这些评估障碍，我们提出了TBFact，一个“模型作为评判者”的框架，旨在评估生成总结的全面性和简洁性。使用从去匿名化的肿瘤委员会讨论中提取的基准数据集，我们应用TBFact评估我们的病史代理。结果显示，该代理捕获了94%的高重要性信息（包括部分蕴含），在严格的蕴含标准下实现了0.84的TBFact召回率。我们进一步证明了TBFact可以实现一个无数据的评估框架，机构可以在本地部署，而无需共享敏感的临床数据。HAO和TBFact共同建立了为MTBs提供可靠和可扩展支持的坚实基础。

更新时间: 2025-09-08 12:15:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06602v1

InterFeat: A Pipeline for Finding Interesting Scientific Features

Finding interesting phenomena is the core of scientific discovery, but it is a manual, ill-defined concept. We present an integrative pipeline for automating the discovery of interesting simple hypotheses (feature-target relations with effect direction and a potential underlying mechanism) in structured biomedical data. The pipeline combines machine learning, knowledge graphs, literature search and Large Language Models. We formalize "interestingness" as a combination of novelty, utility and plausibility. On 8 major diseases from the UK Biobank, our pipeline consistently recovers risk factors years before their appearance in the literature. 40--53% of our top candidates were validated as interesting, compared to 0--7% for a SHAP-based baseline. Overall, 28% of 109 candidates were interesting to medical experts. The pipeline addresses the challenge of operationalizing "interestingness" scalably and for any target. We release data and code: https://github.com/LinialLab/InterFeat

Updated: 2025-09-08 12:11:52

标题: InterFeat：一个用于发现有趣科学特征的流程

摘要: 发现有趣现象是科学发现的核心，但这是一个手动、模糊定义的概念。我们提出了一个综合的流程，用于自动发现结构化生物医学数据中有趣的简单假设（具有效果方向和潜在的基础机制的特征-目标关系）。该流程结合了机器学习、知识图谱、文献搜索和大型语言模型。我们将"有趣性"形式化为新颖性、实用性和可信性的组合。在来自英国生物库的8种主要疾病中，我们的流程始终在文献出现之前几年就恢复了风险因素。我们的前40-53%的候选人被验证为有趣，而基于SHAP的基线为0-7%。总体而言，109名候选人中有28%对医学专家来说是有趣的。该流程解决了可扩展地操作化"有趣性"的挑战，适用于任何目标。我们发布数据和代码：https://github.com/LinialLab/InterFeat

更新时间: 2025-09-08 12:11:52

领域: q-bio.QM,cs.AI,cs.CL,cs.IR,68T05, 68T50, 92C50,I.2.6; I.2.7; H.2.8; J.3

下载: http://arxiv.org/abs/2505.13534v2

PAC-Bayesian Generalization Bounds for Graph Convolutional Networks on Inductive Node Classification

Graph neural networks (GNNs) have achieved remarkable success in processing graph-structured data across various applications. A critical aspect of real-world graphs is their dynamic nature, where new nodes are continually added and existing connections may change over time. Previous theoretical studies, largely based on the transductive learning framework, fail to adequately model such temporal evolution and structural dynamics. In this paper, we presents a PAC-Bayesian theoretical analysis of graph convolutional networks (GCNs) for inductive node classification, treating nodes as dependent and non-identically distributed data points. We derive novel generalization bounds for one-layer GCNs that explicitly incorporate the effects of data dependency and non-stationarity, and establish sufficient conditions under which the generalization gap converges to zero as the number of nodes increases. Furthermore, we extend our analysis to two-layer GCNs, and reveal that it requires stronger assumptions on graph topology to guarantee convergence. This work establishes a theoretical foundation for understanding and improving GNN generalization in dynamic graph environments.

Updated: 2025-09-08 12:10:54

标题: 基于PAC-Bayesian的一般性边界在归纳节点分类上的图卷积网络

摘要: 图神经网络（GNNs）在处理各种应用中的图结构数据方面取得了显著成功。现实世界图形的一个关键方面是它们的动态性，其中不断添加新节点并且现有连接可能随时间变化。先前的理论研究主要基于传导学习框架，未能充分建模这种时间演化和结构动态。在本文中，我们提出了一个PAC-Bayesian理论分析图卷积网络（GCNs）用于归纳节点分类，将节点视为相关和非独立分布的数据点。我们推导出一层GCNs的新颖泛化界限，明确地将数据依赖性和非稳态性的影响纳入考虑，并建立了足够的条件，使得泛化差距在节点数量增加时收敛于零。此外，我们将我们的分析扩展到两层GCNs，并揭示需要更强的图拓扑假设才能保证收敛。这项工作为理解和改进动态图环境中的GNN泛化奠定了理论基础。

更新时间: 2025-09-08 12:10:54

领域: cs.LG

下载: http://arxiv.org/abs/2509.06600v1

An Architecture Built for Federated Learning: Addressing Data Heterogeneity through Adaptive Normalization-Free Feature Recalibration

Federated learning is a decentralized collaborative training paradigm preserving stakeholders' data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets degrades system performance. To address this issue, we propose Adaptive Normalization-free Feature Recalibration (ANFR), a model architecture-level approach that combines weight standardization and channel attention to combat heterogeneous data in FL. ANFR leverages weight standardization to avoid mismatched client statistics and inconsistent averaging, ensuring robustness under heterogeneity, and channel attention to produce learnable scaling factors for feature maps, suppressing inconsistencies across clients due to heterogeneity. We demonstrate that combining these techniques boosts model performance beyond their individual contributions, by improving class selectivity and channel attention weight distribution. ANFR works with any aggregation method, supports both global and personalized FL, and adds minimal overhead. Furthermore, when training with differential privacy, ANFR achieves an appealing balance between privacy and utility, enabling strong privacy guarantees without sacrificing performance. By integrating weight standardization and channel attention in the backbone model, ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. Extensive experiments show ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions. Code is provided at https://github.com/siomvas/ANFR.

Updated: 2025-09-08 12:09:55

标题: 为联邦学习构建的架构：通过自适应无标准化特征重新校准解决数据异构性

摘要: 联邦学习是一种分散的协作训练范式，保留利益相关者的数据所有权，同时提高性能和泛化能力。然而，客户端数据集之间的统计异质性会降低系统性能。为了解决这个问题，我们提出了自适应无归一化特征重校准（ANFR），这是一种模型架构级别的方法，结合了权重标准化和通道注意力，以应对联邦学习中的异质数据。ANFR利用权重标准化来避免不匹配的客户端统计数据和不一致的平均值，确保在异质性下的稳健性，并利用通道注意力产生可学习的特征图缩放因子，抑制由异质性导致的客户端之间的不一致性。我们证明，将这些技术结合起来可以提升模型性能，超越它们各自的贡献，通过改善分类选择性和通道注意力权重分布。ANFR适用于任何聚合方法，支持全局和个性化的联邦学习，并且增加的额外开销很小。此外，在使用差分隐私进行训练时，ANFR实现了隐私和效用之间的吸引人平衡，实现了强大的隐私保证而不牺牲性能。通过将权重标准化和通道注意力集成到骨干模型中，ANFR提供了一种新颖而多功能的方法来解决统计异质性的挑战。大量实验证明，在各种聚合方法、数据集和异质性条件下，ANFR始终优于已建立的基线。源代码可在https://github.com/siomvas/ANFR找到。

更新时间: 2025-09-08 12:09:55

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2410.02006v2

DEXOP: A Device for Robotic Transfer of Dexterous Human Manipulation

We introduce perioperation, a paradigm for robotic data collection that sensorizes and records human manipulation while maximizing the transferability of the data to real robots. We implement this paradigm in DEXOP, a passive hand exoskeleton designed to maximize human ability to collect rich sensory (vision + tactile) data for diverse dexterous manipulation tasks in natural environments. DEXOP mechanically connects human fingers to robot fingers, providing users with direct contact feedback (via proprioception) and mirrors the human hand pose to the passive robot hand to maximize the transfer of demonstrated skills to the robot. The force feedback and pose mirroring make task demonstrations more natural for humans compared to teleoperation, increasing both speed and accuracy. We evaluate DEXOP across a range of dexterous, contact-rich tasks, demonstrating its ability to collect high-quality demonstration data at scale. Policies learned with DEXOP data significantly improve task performance per unit time of data collection compared to teleoperation, making DEXOP a powerful tool for advancing robot dexterity. Our project page is at https://dex-op.github.io.

Updated: 2025-09-08 12:08:04

标题: DEXOP: 一种用于机器人转移灵巧人类操作的设备

摘要: 我们介绍了perioperation，这是一种用于机器人数据收集的范式，可以在最大程度上实现数据向真实机器人的可转移性，同时对人类操作进行传感和记录。我们将这一范式应用于DEXOP中，这是一款被动手外骨骼，旨在最大化人类在自然环境中进行各种灵巧操作任务时收集丰富感官（视觉+触觉）数据的能力。DEXOP通过机械连接人类手指和机器人手指，为用户提供直接的接触反馈（通过本体感觉），并将人类手部姿势映射到被动机器人手部，以最大程度地将展示的技能转移到机器人身上。力反馈和姿势映射使任务演示对人类来说比远程操作更加自然，提高了速度和准确性。我们在一系列灵巧、接触丰富的任务中评估了DEXOP，展示了其在规模上收集高质量演示数据的能力。使用DEXOP数据学习的策略与远程操作相比显著提高了任务性能，使DEXOP成为推进机器人灵巧性的强大工具。我们的项目页面位于https://dex-op.github.io。

更新时间: 2025-09-08 12:08:04

领域: cs.RO,cs.AI,cs.CV,cs.HC

下载: http://arxiv.org/abs/2509.04441v2

Information-Theoretic Bounds and Task-Centric Learning Complexity for Real-World Dynamic Nonlinear Systems

Dynamic nonlinear systems exhibit distortions arising from coupled static and dynamic effects. Their intertwined nature poses major challenges for data-driven modeling. This paper presents a theoretical framework grounded in structured decomposition, variance analysis, and task-centric complexity bounds. The framework employs a directional lower bound on interactions between measurable system components, extending orthogonality in inner product spaces to structurally asymmetric settings. This bound supports variance inequalities for decomposed systems. Key behavioral indicators are introduced along with a memory finiteness index. A rigorous power-based condition establishes a measurable link between finite memory in realizable systems and the First Law of Thermodynamics. This offers a more foundational perspective than classical bounds based on the Second Law. Building on this foundation, we formulate a `Behavioral Uncertainty Principle,' demonstrating that static and dynamic distortions cannot be minimized simultaneously. We identify that real-world systems seem to resist complete deterministic decomposition due to entangled static and dynamic effects. We also present two general-purpose theorems linking function variance to mean-squared Lipschitz continuity and learning complexity. This yields a model-agnostic, task-aware complexity metric, showing that lower-variance components are inherently easier to learn. These insights explain the empirical benefits of structured residual learning, including improved generalization, reduced parameter count, and lower training cost, as previously observed in power amplifier linearization experiments. The framework is broadly applicable and offers a scalable, theoretically grounded approach to modeling complex dynamic nonlinear systems.

Updated: 2025-09-08 12:08:02

标题: 信息论界限和面向任务的学习复杂性对于真实世界动态非线性系统

摘要: 动态非线性系统表现出由耦合静态和动态效应产生的失真。它们交织的性质对基于数据的建模提出了重大挑战。本文提出了一个基于结构分解、方差分析和任务中心复杂度界限的理论框架。该框架采用了系统组件之间交互作用的方向下界，将内积空间中的正交性扩展到结构不对称的设置。这个下界支持了分解系统的方差不等式。引入了关键的行为指标以及一个记忆有限性指数。一个严格的基于功率的条件建立了可实现系统中有限记忆与热力学第一定律之间的可测联系。这提供了比基于第二定律的经典界限更基础的观点。在此基础上，我们制定了一个“行为不确定性原理”，证明了静态和动态失真不能同时最小化。我们认为现实世界中的系统似乎抵抗完全确定性分解，这是由于交织的静态和动态效应。我们还提出了两个将函数方差与均方连续性和学习复杂度相关联的通用定理。这产生了一个与模型无关、任务感知的复杂度度量，表明低方差组件本质上更容易学习。这些见解解释了结构残差学习的实证优势，包括改善泛化能力、减少参数数量和降低训练成本，正如在功率放大器线性化实验中之前观察到的。该框架具有广泛的适用性，并提供了一种可扩展的、基于理论的方法来建模复杂的动态非线性系统。

更新时间: 2025-09-08 12:08:02

领域: cs.LG,cs.CC,cs.SY,eess.SP,eess.SY,math.ST,stat.TH

下载: http://arxiv.org/abs/2509.06599v1

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

In this study, we address the multimodal task of stereo sound event localization and detection with source distance estimation (3D SELD) in regular video content. 3D SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD approaches typically rely on multichannel input, limiting their capacity to benefit from large-scale pre-training due to data constraints. To overcome this, we enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. We perform an ablation study on the development set of the DCASE2025 Task3 Stereo SELD Dataset to assess the individual contributions of the language-aligned models and benchmark against the DCASE Task 3 baseline systems. Additionally, we detail the curation process of large synthetic audio and audio-visual datasets used for model pre-training. These datasets were further expanded through left-right channel swapping augmentation. Our approach, combining extensive pre-training, model ensembling, and visual post-processing, achieved second rank in the DCASE 2025 Challenge Task 3 (Track B), underscoring the effectiveness of our method. Future work will explore the modality-specific contributions and architectural refinements.

Updated: 2025-09-08 12:07:32

标题: 在视频中集成空间和语义嵌入以用于立体声音事件定位

摘要: 在这项研究中，我们探讨了在常规视频内容中进行立体声音事件定位和检测以及源距离估计（3D SELD）的多模式任务。3D SELD是一个复杂的任务，它将时间事件分类与空间定位结合起来，需要跨越空间、时间和语义维度进行推理。最后一个维度可能是建模中最具挑战性的。传统的SELD方法通常依赖于多通道输入，由于数据约束，它们的能力受到了大规模预训练的限制。为了克服这一限制，我们通过集成预训练的对比语言对齐模型：用于音频的CLAP和用于视觉输入的OWL-ViT，将语义信息增强到标准的SELD架构中。这些嵌入被整合到一个专门用于多模态融合的修改后的Conformer模块中，我们将其称为跨模态Conformer。我们对DCASE2025 Task3立体声SELD数据集的开发集进行了消融研究，评估了语言对齐模型的各自贡献，并与DCASE Task 3基线系统进行了基准测试。此外，我们详细介绍了用于模型预训练的大型合成音频和音频-视觉数据集的策划过程。通过左右通道交换增强进一步扩展了这些数据集。我们的方法结合了广泛的预训练、模型集成和视觉后处理，在DCASE 2025挑战赛Task 3（B轨道）中取得第二名，突显了我们方法的有效性。未来工作将探索模态特定贡献和架构优化。

更新时间: 2025-09-08 12:07:32

领域: eess.AS,cs.AI,cs.LG,eess.IV,eess.SP

下载: http://arxiv.org/abs/2509.06598v1

ModalSurv: A Multimodal Deep Survival Framework for Prostrate and Bladder Cancer

Accurate prediction of time-to-event outcomes is a central challenge in oncology, with significant implications for treatment planning and patient management. In this work, we present ModaliSurv, a multimodal deep survival model utilising DeepHit with a projection layer and inter-modality cross-attention, which integrates heterogeneous patient data, including clinical, MRI, RNA-seq and whole-slide pathology features. The model is designed to capture complementary prognostic signals across modalities and estimate individualised time-to-biochemical recurrence in prostate cancer and time-to-cancer recurrence in bladder cancer. Our approach was evaluated in the context of the CHIMERA Grand Challenge, across two of the three provided tasks. For Task 1 (prostate cancer bio-chemical recurrence prediction), the proposed framework achieved a concordance index (C-index) of 0.843 on 5-folds cross-validation and 0.818 on CHIMERA development set, demonstrating robust discriminatory ability. For Task 3 (bladder cancer recurrence prediction), the model obtained a C-index of 0.662 on 5-folds cross-validation and 0.457 on development set, highlighting its adaptability and potential for clinical translation. These results suggest that leveraging multimodal integration with deep survival learning provides a promising pathway toward personalised risk stratification in prostate and bladder cancer. Beyond the challenge setting, our framework is broadly applicable to survival prediction tasks involving heterogeneous biomedical data.

Updated: 2025-09-08 12:06:11

标题: ModalSurv：一种用于前列腺和膀胱癌的多模态深度存活框架

摘要: 时间到事件结果的准确预测是肿瘤学中的一个核心挑战，对治疗计划和患者管理具有重要意义。在这项工作中，我们提出了ModaliSurv，一个利用深度生存模型DeepHit与投影层和跨模态交叉注意力的多模态深度生存模型，整合包括临床、MRI、RNA-seq和全切片病理特征在内的异质患者数据。该模型旨在捕获跨模态间的互补预后信号，并估计前列腺癌的个体化生化复发时间和膀胱癌的癌症复发时间。我们的方法在CHIMERA大挑战的背景下进行了评估，涵盖了三个提供的任务中的两个。对于任务1（前列腺癌生化复发预测），提出的框架在5折交叉验证上实现了0.843的一致性指数（C-index），在CHIMERA开发集上实现了0.818，展示了其稳健的区分能力。对于任务3（膀胱癌复发预测），该模型在5折交叉验证上获得了0.662的C-index，在开发集上获得了0.457，突出了其适应性和临床转化潜力。这些结果表明，利用深度生存学习进行多模态整合为前列腺癌和膀胱癌中的个性化风险分层提供了一个有前途的途径。在挑战环境之外，我们的框架广泛适用于涉及异质生物医学数据的生存预测任务。

更新时间: 2025-09-08 12:06:11

领域: cs.LG

下载: http://arxiv.org/abs/2509.05037v2

HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

Large Language Models (LLMs) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token's true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free decoding framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and LLM families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.

Updated: 2025-09-08 12:06:09

标题: 拥有：用于大型语言模型中幻觉减轻的头部自适应门控和价值校准

摘要: 大型语言模型(LLMs)在检索增强或长上下文生成中经常产生幻觉，即使存在相关证据。这源于两个问题：头部重要性被视为与输入无关，并且原始注意力权重不足以反映每个令牌的真实贡献。我们提出了HAVE（Head-Adaptive Gating and ValuE Calibration），这是一个无参数的解码框架，直接解决了这两个挑战。HAVE引入了头部自适应门控，对注意力头进行实例级软重新加权，并通过值向量的大小来增强注意力，以近似写回贡献。这些模块共同构建了与模型更新对齐的令牌级证据，并通过轻量级不确定性缩放策略与LM分布融合。HAVE不需要微调，在单次前向传递中运行，具有高效性和广泛适用性。跨多个QA基准和LLM系列的实验表明，HAVE始终减少幻觉并优于包括DAGCD在内的强基线，且开销适中。该框架透明、可重现，并且易于与现成的LLMs集成，推动了在现实世界环境中可信生成。

更新时间: 2025-09-08 12:06:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.06596v1

LLMs in Cybersecurity: Friend or Foe in the Human Decision Loop?

Large Language Models (LLMs) are transforming human decision-making by acting as cognitive collaborators. Yet, this promise comes with a paradox: while LLMs can improve accuracy, they may also erode independent reasoning, promote over-reliance and homogenize decisions. In this paper, we investigate how LLMs shape human judgment in security-critical contexts. Through two exploratory focus groups (unaided and LLM-supported), we assess decision accuracy, behavioral resilience and reliance dynamics. Our findings reveal that while LLMs enhance accuracy and consistency in routine decisions, they can inadvertently reduce cognitive diversity and improve automation bias, which is especially the case among users with lower resilience. In contrast, high-resilience individuals leverage LLMs more effectively, suggesting that cognitive traits mediate AI benefit.

Updated: 2025-09-08 12:06:06

标题: 网络安全中的LLMs：在人类决策环节中是朋友还是敌人？

摘要: 大型语言模型（LLMs）正在通过充当认知协作者来改变人类决策。然而，这一承诺伴随着一个悖论：虽然LLMs可以提高准确性，但它们也可能侵蚀独立推理，促进过度依赖和同质化决策。在本文中，我们研究了LLMs如何在安全关键环境中塑造人类判断。通过两个探索性焦点小组（未辅助和LLM支持），我们评估了决策准确性、行为韧性和依赖动态。我们的研究结果显示，虽然LLMs提高了日常决策的准确性和一致性，但它们也可能无意中降低认知多样性并改善自动化偏见，尤其是在具有较低韧性的用户中。相比之下，高韧性个体更有效地利用LLMs，表明认知特质调解了人工智能的益处。

更新时间: 2025-09-08 12:06:06

领域: cs.CR

下载: http://arxiv.org/abs/2509.06595v1

Steering LLM Reasoning Through Bias-Only Adaptation

We show that training a single $d$-dimensional steering vector per layer with reinforcement learning, while freezing all base weights, matches the accuracy of fully RL-tuned reasoning models on mathematical-reasoning tasks. On an 8 billion-parameter model this adds only $\approx 0.0016\%$ additional parameters and reproduces performance across a range of base models and mathematical-reasoning benchmarks. These results tighten the upper bound on the parameter budget required for high-level chain-of-thought reasoning, indicating that millions of adapter weights are unnecessary. The minimal trainable footprint reduces optimizer memory and inter-GPU communication, lowering the overall cost of fine-tuning. Moreover, a logit-lens analysis shows that the learned vectors amplify coherent token directions, providing clearer insight into the model's internal computations.

Updated: 2025-09-08 11:57:29

标题: 通过仅偏差适应引导LLM推理

摘要: 我们展示了使用强化学习训练每层一个$d$维度的转向向量，同时冻结所有基本权重，可以在数学推理任务上与完全RL调整的推理模型的准确度相匹配。在一个80亿参数的模型上，这仅增加了约0.0016％的额外参数，并在一系列基本模型和数学推理基准上重现了性能。这些结果收紧了高级思维推理所需参数预算的上限，表明数百万个适配器权重是不必要的。最小的可训练印记减少了优化器内存和GPU间通信，降低了微调的整体成本。此外，一个逻辑透镜分析显示，学习到的向量放大了连贯的令牌方向，提供了更清晰的对模型内部计算的洞察。

更新时间: 2025-09-08 11:57:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.18706v2

Detection of trade in products derived from threatened species using machine learning and a smartphone

Unsustainable trade in wildlife is a major threat to biodiversity and is now increasingly prevalent in digital marketplaces and social media. With the sheer volume of digital content, the need for automated methods to detect wildlife trade listings is growing. These methods are especially needed for the automatic identification of wildlife products, such as ivory. We developed machine learning-based object recognition models that can identify wildlife products within images and highlight them. The data consists of images of elephant, pangolin, and tiger products that were identified as being sold illegally or that were confiscated by authorities. Specifically, the wildlife products included elephant ivory and skins, pangolin scales, and claws (raw and crafted), and tiger skins and bones. We investigated various combinations of training strategies and two loss functions to identify the best model to use in the automatic detection of these wildlife products. Models were trained for each species while also developing a single model to identify products from all three species. The best model showed an overall accuracy of 84.2% with accuracies of 71.1%, 90.2% and 93.5% in detecting products derived from elephants, pangolins, and tigers, respectively. We further demonstrate that the machine learning model can be made easily available to stakeholders, such as government authorities and law enforcement agencies, by developing a smartphone-based application that had an overall accuracy of 91.3%. The application can be used in real time to click images and help identify potentially prohibited products of target species. Thus, the proposed method is not only applicable for monitoring trade on the web but can also be used e.g. in physical markets for monitoring wildlife trade.

Updated: 2025-09-08 11:56:26

标题: 利用机器学习和智能手机检测来源于濒危物种的产品贸易

摘要: 野生动物非可持续贸易是生物多样性的主要威胁，目前在数字市场和社交媒体中越来越普遍。随着数字内容的庞大数量，检测野生动物贸易列表的自动化方法的需求正在增长。这些方法尤其需要自动识别野生动物产品，如象牙。我们开发了基于机器学习的物体识别模型，可以识别图像中的野生动物产品并突出显示它们。数据包括被确定为非法出售或被当局没收的大象、穿山甲和老虎产品的图像。具体来说，野生动物产品包括大象象牙和皮革、穿山甲鳞片和爪（原料和手工制品）、老虎皮革和骨骼。我们调查了各种训练策略和两种损失函数的组合，以确定在自动检测这些野生动物产品中使用的最佳模型。针对每个物种进行了模型训练，同时开发了一个单一模型来识别来自这三种物种的产品。最佳模型显示了84.2%的整体准确率，分别在检测来自大象、穿山甲和老虎的产品方面的准确率为71.1%、90.2%和93.5%。我们进一步证明，机器学习模型可以通过开发一个智能手机应用程序，使其轻松地提供给利益相关者，如政府当局和执法机构，该应用程序的整体准确率为91.3%。该应用程序可以实时使用，拍摄图像并帮助识别目标物种的潜在禁止产品。因此，所提出的方法不仅适用于监控网络贸易，还可用于例如监控野生动物贸易的实体市场。

更新时间: 2025-09-08 11:56:26

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06585v1

Emergent Social Dynamics of LLM Agents in the El Farol Bar Problem

We investigate the emergent social dynamics of Large Language Model (LLM) agents in a spatially extended El Farol Bar problem, observing how they autonomously navigate this classic social dilemma. As a result, the LLM agents generated a spontaneous motivation to go to the bar and changed their decision making by becoming a collective. We also observed that the LLM agents did not solve the problem completely, but rather behaved more like humans. These findings reveal a complex interplay between external incentives (prompt-specified constraints such as the 60% threshold) and internal incentives (culturally-encoded social preferences derived from pre-training), demonstrating that LLM agents naturally balance formal game-theoretic rationality with social motivations that characterize human behavior. These findings suggest that a new model of group decision making, which could not be handled in the previous game-theoretic problem setting, can be realized by LLM agents.

Updated: 2025-09-08 11:56:01

标题: 《El Farol酒吧问题中LLM代理的新兴社会动态》

摘要: 我们调查了大型语言模型（LLM）代理在空间扩展的艾尔法罗尔酒吧问题中的新兴社会动态，观察它们如何自主地应对这一经典社会困境。结果显示，LLM代理产生了自发的前往酒吧的动机，并通过成为一个集体改变了他们的决策方式。我们还观察到，LLM代理并未完全解决问题，而是更像人类。这些发现揭示了外部激励（如60%阈值等明确指定的约束）和内部激励（源自预训练的文化编码社会偏好）之间复杂的相互作用，表明LLM代理自然地平衡了形式化博弈论的理性与表征人类行为的社会动机。这些发现表明，LLM代理可以实现一种新的群体决策模型，而这在以往的博弈论问题设置中是无法处理的。

更新时间: 2025-09-08 11:56:01

领域: cs.MA,cs.AI,cs.CY

下载: http://arxiv.org/abs/2509.04537v2

AI for Scientific Discovery is a Social Problem

Artificial intelligence promises to accelerate scientific discovery, yet its benefits remain unevenly distributed. While technical obstacles such as scarce data, fragmented standards, and unequal access to computation are significant, we argue that the primary barriers are social and institutional. Narratives that defer progress to speculative "AI scientists," the undervaluing of data and infrastructure contributions, misaligned incentives, and gaps between domain experts and machine learning researchers all constrain impact. We highlight four interconnected challenges: community dysfunction, research priorities misaligned with upstream needs, data fragmentation, and infrastructure inequities. We argue that their roots lie in cultural and organizational practices. Addressing them requires not only technical innovation but also intentional community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure. We call for reframing AI for science as a collective social project, where sustainable collaboration and equitable participation are treated as prerequisites for technical progress.

Updated: 2025-09-08 11:49:52

标题: AI用于科学发现是一个社会问题。

摘要: 人工智能承诺加速科学发现，然而其好处仍然分布不均。虽然技术障碍如数据稀缺、标准碎片化和计算机访问不平等是显著的，但我们认为主要障碍是社会和制度性的。将进展推迟给猜测性的“AI科学家”的叙事，对数据和基础设施贡献的低估，不协调的激励措施，以及领域专家和机器学习研究人员之间的差距都限制了影响。我们强调四个相互关联的挑战：社区功能障碍、研究优先事项与上游需求不一致、数据碎片化和基础设施不平等。我们认为它们的根源在于文化和组织实践。解决这些问题不仅需要技术创新，还需要有意识的社区建设、跨学科教育、共享基准和可访问的基础设施。我们呼吁重新构建AI对科学的认识，将其视为一个集体社会项目，在这个项目中，可持续的合作和公平的参与被视为技术进步的先决条件。

更新时间: 2025-09-08 11:49:52

领域: cs.LG,cs.CY

下载: http://arxiv.org/abs/2509.06580v1

Approximating Condorcet Ordering for Vector-valued Mathematical Morphology

Mathematical morphology provides a nonlinear framework for image and spatial data processing and analysis. Although there have been many successful applications of mathematical morphology to vector-valued images, such as color and hyperspectral images, there is still no consensus on the most suitable vector ordering for constructing morphological operators. This paper addresses this issue by examining a reduced ordering approximating the Condorcet ranking derived from a set of vector orderings. Inspired by voting problems, the Condorcet ordering ranks elements from most to least voted, with voters representing different orderings. In this paper, we develop a machine learning approach that learns a reduced ordering that approximates the Condorcet ordering. Preliminary computational experiments confirm the effectiveness of learning the reduced mapping to define vector-valued morphological operators for color images.

Updated: 2025-09-08 11:47:11

标题: 逼近向量值数学形态学的Condorcet排序

摘要: 数学形态学为图像和空间数据处理与分析提供了一个非线性框架。尽管数学形态学已成功应用于矢量值图像（如彩色和高光谱图像），但在构建形态学运算符时仍没有达成关于最适合的矢量排序的共识。本文通过研究一种简化排序来近似从一组矢量排序中导出的Condorcet排名来解决这个问题。受到选举问题的启发，Condorcet排序将元素从最受欢迎到最不受欢迎进行排名，选民代表不同的排序方式。在本文中，我们开发了一种机器学习方法，学习一个近似于Condorcet排序的简化排序。初步的计算实验证实了学习简化映射以定义用于彩色图像的矢量值形态学运算符的有效性。

更新时间: 2025-09-08 11:47:11

领域: cs.CV,cs.LG,cs.NE

下载: http://arxiv.org/abs/2509.06577v1

Automated Hierarchical Graph Construction for Multi-source Electronic Health Records

Electronic Health Records (EHRs), comprising diverse clinical data such as diagnoses, medications, and laboratory results, hold great promise for translational research. EHR-derived data have advanced disease prevention, improved clinical trial recruitment, and generated real-world evidence. Synthesizing EHRs across institutions enables large-scale, generalizable studies that capture rare diseases and population diversity, but remains hindered by the heterogeneity of medical codes, institution-specific terminologies, and the absence of standardized data structures. These barriers limit the interpretability, comparability, and scalability of EHR-based analyses, underscoring the need for robust methods to harmonize and extract meaningful insights from distributed, heterogeneous data. To address this, we propose MASH (Multi-source Automated Structured Hierarchy), a fully automated framework that aligns medical codes across institutions using neural optimal transport and constructs hierarchical graphs with learned hyperbolic embeddings. During training, MASH integrates information from pre-trained language models, co-occurrence patterns, textual descriptions, and supervised labels to capture semantic and hierarchical relationships among medical concepts more effectively. Applied to real-world EHR data, including diagnosis, medication, and laboratory codes, MASH produces interpretable hierarchical graphs that facilitate the navigation and understanding of heterogeneous clinical data. Notably, it generates the first automated hierarchies for unstructured local laboratory codes, establishing foundational references for downstream applications.

Updated: 2025-09-08 11:45:59

标题: 多源电子健康记录的自动分层图构建

摘要: 电子健康记录（EHRs）包括各种临床数据，如诊断、药物和实验室结果，为转化研究带来了巨大的希望。基于EHR的数据已经推动了疾病预防、改善了临床试验招募，并产生了真实世界的证据。跨机构综合EHR使得能够进行大规模、可泛化的研究，捕捉罕见疾病和人群多样性，但仍受制于医学编码的异质性、机构特定术语和缺乏标准化数据结构。这些障碍限制了基于EHR的分析的可解释性、可比性和可扩展性，强调了需要强大的方法来协调和从分布式、异质性数据中提取有意义的见解。为了解决这个问题，我们提出了MASH（多源自动化结构层次）框架，利用神经最优传输对机构间的医学编码进行对齐，并构建具有学习双曲嵌入的层次图。在训练过程中，MASH集成了来自预训练语言模型、共现模式、文本描述和监督标签的信息，更有效地捕捉医学概念之间的语义和层次关系。应用于真实世界的EHR数据，包括诊断、药物和实验室编码，MASH生成可解释的层次图，有助于导航和理解异质临床数据。值得注意的是，它为非结构化的本地实验室编码生成了第一个自动化的层次结构，为下游应用奠定了基础参考。

更新时间: 2025-09-08 11:45:59

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.06576v1

Preacher: Paper-to-Video Agentic System

The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/Gen-Verse/Paper2Video

Updated: 2025-09-08 11:42:40

标题: 讲道者：纸质到视频的代理系统

摘要: 这篇论文将研究论文转化为结构化视频摘要，将关键概念、方法和结论提炼成易于获取、组织良好的格式。尽管最先进的视频生成模型展示了潜力，但它们受到有限的上下文窗口、刚性视频持续时间约束、有限的风格多样性以及无法表示领域特定知识的限制。为了解决这些限制，我们介绍了Preacher，这是第一个论文到视频代理系统。Preacher采用自上而下的方法来分解、总结和重新表述论文，然后采用自下而上的视频生成，将多样的视频片段合成为连贯的摘要。为了对齐跨模态表示，我们定义了关键场景，并引入了逐步思维链（P-CoT）进行细粒度、迭代式规划。Preacher成功地在五个研究领域生成了高质量的视频摘要，展示了超越当前视频生成模型的专业知识。代码将在以下链接发布：https://github.com/Gen-Verse/Paper2Video。

更新时间: 2025-09-08 11:42:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.09632v6

Robust and Adaptive Spectral Method for Representation Multi-Task Learning with Contamination

Representation-based multi-task learning (MTL) improves efficiency by learning a shared structure across tasks, but its practical application is often hindered by contamination, outliers, or adversarial tasks. Most existing methods and theories assume a clean or near-clean setting, failing when contamination is significant. This paper tackles representation MTL with an unknown and potentially large contamination proportion, while also allowing for heterogeneity among inlier tasks. We introduce a Robust and Adaptive Spectral method (RAS) that can distill the shared inlier representation effectively and efficiently, while requiring no prior knowledge of the contamination level or the true representation dimension. Theoretically, we provide non-asymptotic error bounds for both the learned representation and the per-task parameters. These bounds adapt to inlier task similarity and outlier structure, and guarantee that RAS performs at least as well as single-task learning, thus preventing negative transfer. We also extend our framework to transfer learning with corresponding theoretical guarantees for the target task. Extensive experiments confirm our theory, showcasing the robustness and adaptivity of RAS, and its superior performance in regimes with up to 80\% task contamination.

Updated: 2025-09-08 11:41:30

标题: 坚韧且自适应的光谱方法用于受污染的多任务学习表示

摘要: 基于表示的多任务学习（MTL）通过学习跨任务共享结构来提高效率，但其实际应用通常受到污染、异常值或对抗任务的阻碍。大多数现有方法和理论假设一个干净或接近干净的环境，当污染显著时则无法胜任。本文处理具有未知且潜在大量污染比例的表示MTL，同时允许在内部任务之间存在异质性。我们引入了一种强大而自适应的谱方法（RAS），可以有效且高效地提取共享的内部表示，同时无需先验知识关于污染水平或真实表示维度。理论上，我们为学习到的表示和每个任务参数提供非渐近误差界。这些界适应内部任务相似性和异常值结构，并保证RAS的性能至少与单任务学习一样好，从而避免负迁移。我们还将我们的框架扩展到具有相应理论保证的目标任务的迁移学习。大量实验验证了我们的理论，展示了RAS的健壮性和自适应性，以及在高达80％任务污染情况下的卓越性能。

更新时间: 2025-09-08 11:41:30

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2509.06575v1

Topological Regularization for Force Prediction in Active Particle Suspension with EGNN and Persistent Homology

Capturing the dynamics of active particles, i.e., small self-propelled agents that both deform and are deformed by a fluid in which they move is a formidable problem as it requires coupling fine scale hydrodynamics with large scale collective effects. So we present a multi-scale framework that combines the three learning-driven tools to learn in concert within one pipeline. We use high-resolution Lattice Boltzmann snapshots of fluid velocity and particle stresses in a periodic box as input to the learning pipeline. the second step takes the morphology and positions orientations of particles to predict pairwise interaction forces between them with a E(2)-equivariant graph neural network that necessarily respect flat symmetries. Then, a physics-informed neural network further updates these local estimates by summing over them with a stress data using Fourier feature mappings and residual blocks that is additionally regularized with a topological term (introduced by persistent homology) to penalize unrealistically tangled or spurious connections. In concert, these stages deliver an holistic highly-data driven full force network prediction empathizing on the physical underpinnings together with emerging multi-scale structure typical for active matter.

Updated: 2025-09-08 11:39:42

标题: 使用EGNN和持续同调的拓扑正则化方法在活性颗粒悬浮体中进行力预测

摘要: 捕捉活跃粒子的动态，即小型自主推动的代理，它们既变形又被它们所移动的流体所改变，是一个艰巨的问题，因为它需要将精细尺度的流体动力学与大尺度的集体效应相结合。因此，我们提出了一个多尺度框架，结合了三种学习驱动的工具，以在一个管道内共同学习。我们使用高分辨率的格子玻尔兹曼快照作为输入，其中包含周期性盒子内流体速度和粒子应力。第二步是利用粒子的形态和位置方向，通过E(2)-等变图神经网络来预测它们之间的成对相互作用力，这必须尊重平面对称性。然后，一个基于物理信息的神经网络通过傅里叶特征映射和残差块对这些局部估计进行进一步更新，同时利用拓扑术语（由持久同调引入）进行正则化，以惩罚不切实际的混乱或虚假连接。这些阶段共同提供一种基于数据驱动的全面力网络预测，强调物理基础和新兴的多尺度结构，这是活跃物质的典型特征。

更新时间: 2025-09-08 11:39:42

领域: cond-mat.soft,cs.LG

下载: http://arxiv.org/abs/2509.06574v1

Mind Your Server: A Systematic Study of Parasitic Toolchain Attacks on the MCP Ecosystem

Large language models (LLMs) are increasingly integrated with external systems through the Model Context Protocol (MCP), which standardizes tool invocation and has rapidly become a backbone for LLM-powered applications. While this paradigm enhances functionality, it also introduces a fundamental security shift: LLMs transition from passive information processors to autonomous orchestrators of task-oriented toolchains, expanding the attack surface, elevating adversarial goals from manipulating single outputs to hijacking entire execution flows. In this paper, we reveal a new class of attacks, Parasitic Toolchain Attacks, instantiated as MCP Unintended Privacy Disclosure (MCP-UPD). These attacks require no direct victim interaction; instead, adversaries embed malicious instructions into external data sources that LLMs access during legitimate tasks. The malicious logic infiltrates the toolchain and unfolds in three phases: Parasitic Ingestion, Privacy Collection, and Privacy Disclosure, culminating in stealthy exfiltration of private data. Our root cause analysis reveals that MCP lacks both context-tool isolation and least-privilege enforcement, enabling adversarial instructions to propagate unchecked into sensitive tool invocations. To assess the severity, we design MCP-SEC and conduct the first large-scale security census of the MCP ecosystem, analyzing 12,230 tools across 1,360 servers. Our findings show that the MCP ecosystem is rife with exploitable gadgets and diverse attack methods, underscoring systemic risks in MCP platforms and the urgent need for defense mechanisms in LLM-integrated environments.

Updated: 2025-09-08 11:35:32

标题: 注意您的服务器：对MCP生态系统上寄生式工具链攻击的系统研究

摘要: 大型语言模型（LLMs）越来越多地通过模型上下文协议（MCP）与外部系统集成，该协议标准化了工具调用，并迅速成为LLM驱动应用程序的支柱。虽然这种范式增强了功能性，但也引入了一种基本的安全转变：LLMs从被动信息处理器过渡为面向任务的工具链的自主协调器，扩大了攻击面，将对抗目标从操纵单个输出提升到劫持整个执行流。在本文中，我们揭示了一种新型攻击，寄生式工具链攻击，具体表现为MCP意外隐私泄露（MCP-UPD）。这些攻击不需要直接的受害者互动；相反，对手将恶意指令嵌入到LLMs在进行合法任务期间访问的外部数据源中。恶意逻辑渗透到工具链中，并分为三个阶段：寄生摄取、隐私收集和隐私披露，最终导致私人数据的隐蔽外泄。我们的根本原因分析表明，MCP缺乏上下文-工具隔离和最小权限执行，使得对抗指令得以无限传播到敏感工具调用中。为了评估严重性，我们设计了MCP-SEC并进行了对MCP生态系统的首次大规模安全普查，分析了1360个服务器上的12230个工具。我们的发现表明，MCP生态系统充斥着可利用的小工具和多样化的攻击方法，突显了MCP平台中的系统性风险和LLM集成环境中防御机制的迫切需要。

更新时间: 2025-09-08 11:35:32

领域: cs.CR

下载: http://arxiv.org/abs/2509.06572v1

A Simple Data Exfiltration Game

Data exfiltration is a growing problem for business who face costs related to the loss of confidential data as well as potential extortion. This work presents a simple game theoretic model of network data exfiltration. In the model, the attacker chooses the exfiltration route and speed, and the defender selects monitoring thresholds to detect unusual activity. The attacker is rewarded for exfiltrating data, and the defender tries to minimize the costs of data loss and of responding to alerts.

Updated: 2025-09-08 11:35:25

标题: 一个简单的数据泄露游戏

摘要: 数据外泄是企业面临的一个不断增长的问题，导致了与机密数据丢失以及潜在勒索相关的成本。本文提出了一个简单的网络数据外泄博弈理论模型。在这个模型中，攻击者选择外泄的路径和速度，而防御者选择监测阈值来检测异常活动。攻击者因成功外泄数据而受到奖励，而防御者则试图最小化数据丢失以及应对警报的成本。

更新时间: 2025-09-08 11:35:25

领域: cs.CR

下载: http://arxiv.org/abs/2509.06571v1

Integrated Detection and Tracking Based on Radar Range-Doppler Feature

Detection and tracking are the basic tasks of radar systems. Current joint detection tracking methods, which focus on dynamically adjusting detection thresholds from tracking results, still present challenges in fully utilizing the potential of radar signals. These are mainly reflected in the limited capacity of the constant false-alarm rate model to accurately represent information, the insufficient depiction of complex scenes, and the limited information acquired by the tracker. We introduce the Integrated Detection and Tracking based on radar feature (InDT) method, which comprises a network architecture for radar signal detection and a tracker that leverages detection assistance. The InDT detector extracts feature information from each Range-Doppler (RD) matrix and then returns the target position through the feature enhancement module and the detection head. The InDT tracker adaptively updates the measurement noise covariance of the Kalman filter based on detection confidence. The similarity of target RD features is measured by cosine distance, which enhances the data association process by combining location and feature information. Finally, the efficacy of the proposed method was validated through testing on both simulated data and publicly available datasets.

Updated: 2025-09-08 11:32:58

标题: 基于雷达距离-多普勒特征的一体化检测和跟踪

摘要: 检测和跟踪是雷达系统的基本任务。目前的联合检测跟踪方法侧重于根据跟踪结果动态调整检测阈值，但仍然面临着充分利用雷达信号潜力的挑战。这主要体现在常数虚警率模型有限的能力准确表示信息、对复杂场景的描绘不足以及跟踪器获取的信息有限等方面。我们引入了基于雷达特征的集成检测和跟踪（InDT）方法，该方法包括用于雷达信号检测的网络架构和利用检测辅助的跟踪器。InDT检测器从每个距离-多普勒（RD）矩阵中提取特征信息，然后通过特征增强模块和检测头返回目标位置。InDT跟踪器根据检测置信度自适应更新卡尔曼滤波器的测量噪声协方差。目标RD特征的相似性通过余弦距离进行衡量，通过结合位置和特征信息增强数据关联过程。最后，通过对模拟数据和公开可用数据集的测试验证了所提方法的有效性。

更新时间: 2025-09-08 11:32:58

领域: eess.SP,cs.AI

下载: http://arxiv.org/abs/2509.06569v1

Multi-output Classification using a Cross-talk Architecture for Compound Fault Diagnosis of Motors in Partially Labeled Condition

The increasing complexity of rotating machinery and the diversity of operating conditions, such as rotating speed and varying torques, have amplified the challenges in fault diagnosis in scenarios requiring domain adaptation, particularly involving compound faults. This study addresses these challenges by introducing a novel multi-output classification (MOC) framework tailored for domain adaptation in partially labeled target datasets. Unlike conventional multi-class classification (MCC) approaches, the MOC framework classifies the severity levels of compound faults simultaneously. Furthermore, we explore various single-task and multi-task architectures applicable to the MOC formulation-including shared trunk and cross-talk-based designs-for compound fault diagnosis under partially labeled conditions. Based on this investigation, we propose a novel cross-talk architecture, residual neural dimension reductor (RNDR), that enables selective information sharing across diagnostic tasks, effectively enhancing classification performance in compound fault scenarios. In addition, frequency-layer normalization was incorporated to improve domain adaptation performance on motor vibration data. Compound fault conditions were implemented using a motor-based test setup and evaluated across six domain adaptation scenarios. The experimental results demonstrate its superior macro F1 performance compared to baseline models. We further showed that the structural advantage of RNDR is more pronounced in compound fault settings through a single-fault comparison. We also found that frequency-layer normalization fits the fault diagnosis task better than conventional methods. Lastly, we analyzed the RNDR with various conditions, other models with increased number of parameters, and compared with the ablated RNDR structure.

Updated: 2025-09-08 11:26:33

标题: 使用跨通道架构进行多输出分类：部分标记条件下电机复合故障诊断

摘要: 旋转机械的复杂性不断增加，操作条件的多样性，如旋转速度和不同扭矩，增加了在需要领域适应的情况下进行故障诊断的挑战，特别是涉及复合故障的情况。本研究通过引入一种专为部分标记目标数据集的领域适应而设计的新型多输出分类（MOC）框架来应对这些挑战。与传统的多类分类（MCC）方法不同，MOC框架同时分类复合故障的严重程度级别。此外，我们探讨了适用于MOC公式的各种单任务和多任务架构，包括共享主干和基于交叉对话的设计，用于在部分标记条件下进行复合故障诊断。基于这项调查，我们提出了一种新型交叉对话架构，残余神经维度降低器（RNDR），它能够在诊断任务之间实现选择性信息共享，有效提高了在复合故障场景中的分类性能。此外，我们还将频率层归一化纳入以改善在电机振动数据上的领域适应性能。利用基于电机的测试装置实施了复合故障条件，并在六个领域适应场景中进行了评估。实验结果表明，与基线模型相比，其优越的宏F1性能。我们进一步表明，RNDR的结构优势在复合故障设置中更为显著，通过与单一故障的比较。我们还发现，频率层归一化比传统方法更适合故障诊断任务。最后，我们分析了RNDR在各种条件下的表现，与增加参数数量的其他模型进行了比较，并与去除部分结构的RNDR结构进行了对比。

更新时间: 2025-09-08 11:26:33

领域: eess.SP,cs.AI

下载: http://arxiv.org/abs/2505.24001v2

KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation.

Updated: 2025-09-08 11:24:56

标题: KIRETT：基于知识图谱的智能救援操作智能治疗助手

摘要: 多年来，全球对救援行动的需求迅速增加。人口结构变化和由此导致的受伤或健康障碍风险构成了紧急呼叫的基础。在这种情况下，第一响应者需要赶紧到达需要帮助的患者身边，提供急救并挽救生命。在这些情况下，他们必须能够在最短时间内提供个性化和优化的医疗护理，并借助急救情况下新记录的生命体征数据来评估患者的状况。然而，在这种时间紧迫的情况下，第一响应者和医疗专家无法完全掌握知识，需要帮助和进一步医疗治疗的建议。为了实现这一目标，需要将现场计算、评估和处理的知识提供给第一响应者，以改善他们的治疗。本文中提出的知识图作为中心知识表示，为第一响应者提供了一种创新的知识管理，可以通过基于人工智能的预识别情况来进行智能治疗建议。

更新时间: 2025-09-08 11:24:56

领域: cs.AI,cs.ET

下载: http://arxiv.org/abs/2508.07834v4

Marginal sets in semigroups and semirings

In 2019, V. A. Roman'kov introduced the concept of marginal sets for groups. He developed a theory of marginal sets and demonstrated how these sets can be applied to improve some key exchange schemes. In this paper, we extend his ideas and introduce the concept of marginal sets for semigroups and semirings. For tropical matrix semigroups and semirings, we describe how some marginal sets can be constructed. We apply marginal sets to improve some key exchange schemes over semigroups.

Updated: 2025-09-08 11:22:25

标题: 半群和半环中的边缘集合

摘要: 在2019年，V.A. Roman'kov提出了群的边缘集概念。他发展了边缘集理论，并展示了如何将这些集合应用于改进一些关键交换方案。在本文中，我们扩展了他的思想，并介绍了半群和半环的边缘集概念。对于热带矩阵半群和半环，我们描述了如何构造一些边缘集。我们将边缘集应用于改进一些半群上的关键交换方案。

更新时间: 2025-09-08 11:22:25

领域: cs.CR

下载: http://arxiv.org/abs/2509.06562v1

Robustness and accuracy of mean opinion scores with hard and soft outlier detection

In subjective assessment of image and video quality, observers rate or compare selected stimuli. Before calculating the mean opinion scores (MOS) for these stimuli from the ratings, it is recommended to identify and deal with outliers that may have given unreliable ratings. Several methods are available for this purpose, some of which have been standardized. These methods are typically based on statistics and sometimes tested by introducing synthetic ratings from artificial outliers, such as random clickers. However, a reliable and comprehensive approach is lacking for comparative performance analysis of outlier detection methods. To fill this gap, this work proposes and applies an empirical worst-case analysis as a general solution. Our method involves evolutionary optimization of an adversarial black-box attack on outlier detection algorithms, where the adversary maximizes the distortion of scale values with respect to ground truth. We apply our analysis to several hard and soft outlier detection methods for absolute category ratings and show their differing performance in this stress test. In addition, we propose two new outlier detection methods with low complexity and excellent worst-case performance. Software for adversarial attacks and data analysis is available.

Updated: 2025-09-08 11:09:14

标题: 坚固性和准确性：使用硬和软异常值检测的平均意见分数

摘要: 在图像和视频质量的主观评估中，观察者对选定的刺激进行评分或比较。在从评分中计算这些刺激的平均意见分数（MOS）之前，建议识别和处理可能导致不可靠评分的异常值。有几种可用于此目的的方法，其中一些已经标准化。这些方法通常基于统计学，有时通过引入来自人工异常值的合成评分进行测试，例如随机点击者。然而，目前缺乏可靠而全面的方法来比较异常值检测方法的性能。为填补这一空白，本研究提出并应用了一种经验最坏情况分析作为一种一般解决方案。我们的方法涉及对异常值检测算法进行对抗性黑盒攻击的进化优化，其中对手最大化与基本事实相关的规模值的失真。我们将我们的分析应用于几种对绝对类别评分进行硬和软异常值检测的方法，并展示它们在这一压力测试中的性能差异。此外，我们提出了两种新的异常值检测方法，其复杂度低且在最坏情况下表现出色。提供了用于对抗性攻击和数据分析的软件。

更新时间: 2025-09-08 11:09:14

领域: eess.IV,cs.LG,cs.MM

下载: http://arxiv.org/abs/2509.06554v1

XChainWatcher: Monitoring and Identifying Attacks in Cross-Chain Bridges

Cross-chain bridges are a type of middleware for blockchain interoperability that supports the transfer of assets and data across blockchains. However, several of these bridges have vulnerabilities that have caused 3.2 billion dollars in losses since May 2021. Some studies have revealed the existence of these vulnerabilities, but there is little quantitative research available, and there are no safeguard mechanisms to protect bridges from such attacks. Furthermore, no studies are available on the practices of cross-chain bridges that can cause financial losses. We propose \toolName~(Cross-Chain Watcher), a modular and extensible logic-driven anomaly detector for cross-chain bridges. It operates in three main phases: (1) decoding events and transactions from multiple blockchains, (2) building logic relations from the extracted data, and (3) evaluating these relations against a set of detection rules. Using \toolName, we analyze data from two previously attacked bridges: the Ronin and Nomad bridges. \toolName~was able to successfully identify the transactions that led to losses of \$611M and \$190M (USD) and surpassed the results obtained by a reputable security firm in the latter. We not only uncover successful attacks, but also reveal other anomalies, such as 37 cross-chain transactions (\CCTX) that these bridges should not have accepted, failed attempts to exploit Nomad, over \$7.8M worth of tokens locked on one chain but never released on Ethereum, and \$200K lost by users due to inadequate interaction with bridges. We provide the first open dataset of 81,000 \CCTXS~across three blockchains, capturing more than \$4.2B in token transfers.

Updated: 2025-09-08 11:08:36

标题: XChainWatcher：跨链桥监控与攻击识别

摘要: 跨链桥是一种用于支持区块链互操作性的中间件，支持在不同区块链之间传输资产和数据。然而，一些跨链桥存在漏洞，自2021年5月以来已经导致32亿美元的损失。一些研究已经揭示了这些漏洞的存在，但可用的定量研究很少，并且没有保护跨链桥免受此类攻击的机制。此外，关于可能导致财务损失的跨链桥实践没有可用的研究。我们提出了一种名为\toolName（跨链观察器）的模块化和可扩展的逻辑驱动异常检测器，用于跨链桥。它主要分为三个阶段：（1）解码来自多个区块链的事件和交易，（2）从提取的数据中构建逻辑关系，（3）根据一组检测规则评估这些关系。使用\toolName，我们分析了两个先前遭受攻击的桥梁的数据：Ronin桥和Nomad桥。\toolName成功识别了导致6.11亿美元和1.9亿美元（美元）损失的交易，并且在后者超过了一家知名安全公司的结果。我们不仅揭示了成功的攻击，还揭示了其他异常，如这些桥梁不应该接受的37笔跨链交易（\CCTX），企图利用Nomad的失败尝试，锁定在一条链上但从未在以太坊上释放的价值超过7.8百万美元的代币，以及用户因与桥梁交互不当而损失的20万美元。我们提供了首个包含81,000笔\CCTXS跨三个区块链的开放数据集，涵盖超过42亿美元的代币转移。

更新时间: 2025-09-08 11:08:36

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2410.02029v3

Impact of Labeling Inaccuracy and Image Noise on Tooth Segmentation in Panoramic Radiographs using Federated, Centralized and Local Learning

Objectives: Federated learning (FL) may mitigate privacy constraints, heterogeneous data quality, and inconsistent labeling in dental diagnostic AI. We compared FL with centralized (CL) and local learning (LL) for tooth segmentation in panoramic radiographs across multiple data corruption scenarios. Methods: An Attention U-Net was trained on 2066 radiographs from six institutions across four settings: baseline (unaltered data); label manipulation (dilated/missing annotations); image-quality manipulation (additive Gaussian noise); and exclusion of a faulty client with corrupted data. FL was implemented via the Flower AI framework. Per-client training- and validation-loss trajectories were monitored for anomaly detection and a set of metrics (Dice, IoU, HD, HD95 and ASSD) was evaluated on a hold-out test set. From these metrics significance results were reported through Wilcoxon signed-rank test. CL and LL served as comparators. Results: Baseline: FL achieved a median Dice of 0.94889 (ASSD: 1.33229), slightly better than CL at 0.94706 (ASSD: 1.37074) and LL at 0.93557-0.94026 (ASSD: 1.51910-1.69777). Label manipulation: FL maintained the best median Dice score at 0.94884 (ASSD: 1.46487) versus CL's 0.94183 (ASSD: 1.75738) and LL's 0.93003-0.94026 (ASSD: 1.51910-2.11462). Image noise: FL led with Dice at 0.94853 (ASSD: 1.31088); CL scored 0.94787 (ASSD: 1.36131); LL ranged from 0.93179-0.94026 (ASSD: 1.51910-1.77350). Faulty-client exclusion: FL reached Dice at 0.94790 (ASSD: 1.33113) better than CL's 0.94550 (ASSD: 1.39318). Loss-curve monitoring reliably flagged the corrupted site. Conclusions: FL matches or exceeds CL and outperforms LL across corruption scenarios while preserving privacy. Per-client loss trajectories provide an effective anomaly-detection mechanism and support FL as a practical, privacy-preserving approach for scalable clinical AI deployment.

Updated: 2025-09-08 11:07:47

标题: 牙齿分割在全景X射线图像中的标签不准确性和图像噪声对联合、集中和本地学习的影响

摘要: 目标：联邦学习（FL）可以缓解牙科诊断人工智能中的隐私限制、异质数据质量和不一致的标注问题。我们比较了FL与集中式（CL）和本地学习（LL）在多个数据损坏场景下用于全景X光片中的牙齿分割。方法：使用Attention U-Net在来自六个机构的2066张X光片上进行训练，涵盖四种设置：基线（未改变的数据）；标签操作（扩张/丢失注释）；图像质量操作（添加高斯噪声）；和排除具有损坏数据的故障客户。FL通过Flower AI框架实现。监控每个客户的训练和验证损失轨迹以进行异常检测，并在保留测试集上评估一组指标（Dice、IoU、HD、HD95和ASSD）。通过Wilcoxon符号秩检验报告这些指标的显著结果。CL和LL作为对照组。结果：基线：FL实现了中位数Dice为0.94889（ASSD：1.33229），略优于CL的0.94706（ASSD：1.37074）和LL的0.93557-0.94026（ASSD：1.51910-1.69777）。标签操作：FL保持最佳中位数Dice得分为0.94884（ASSD：1.46487），胜过CL的0.94183（ASSD：1.75738）和LL的0.93003-0.94026（ASSD：1.51910-2.11462）。图像噪声：FL以Dice得分0.94853领先（ASSD：1.31088）；CL得分0.94787（ASSD：1.36131）；LL范围从0.93179-0.94026（ASSD：1.51910-1.77350）。排除有故障的客户：FL达到Dice得分0.94790（ASSD：1.33113），优于CL的0.94550（ASSD：1.39318）。损失曲线监测可可靠地标记损坏的站点。结论：在保护隐私的同时，FL在损坏场景下与CL相匹配或超越，并优于LL。每个客户的损失轨迹提供了有效的异常检测机制，并支持FL作为可扩展临床人工智能部署的实用、隐私保护方法。

更新时间: 2025-09-08 11:07:47

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06553v1

Tackling Device Data Distribution Real-time Shift via Prototype-based Parameter Editing

The on-device real-time data distribution shift on devices challenges the generalization of lightweight on-device models. This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. Persona employs a neural adapter in the cloud to generate a parameter editing matrix based on real-time device data. This matrix adeptly adapts on-device models to the prevailing data distributions, efficiently clustering them into prototype models. The prototypes are dynamically refined via the parameter editing matrix, facilitating efficient evolution. Furthermore, the integration of cross-layer knowledge transfer ensures consistent and context-aware multi-layer parameter changes and prototype assignment. Extensive experiments on vision task and recommendation task on multiple datasets confirm Persona's effectiveness and generality.

Updated: 2025-09-08 11:06:50

标题: 通过基于原型的参数编辑应对设备数据分配实时转变

摘要: 在设备上实时数据分发的转移挑战了轻量级设备模型的泛化能力。这个关键问题在当前研究中往往被忽视，目前的研究主要依赖于数据密集型和计算昂贵的微调方法。为了解决这个问题，我们引入了Persona，这是一种新颖的个性化方法，使用基于原型的、无需反向传播的参数编辑框架来增强模型的泛化能力，而无需在部署后重新训练。Persona在云端使用神经适配器生成一个基于实时设备数据的参数编辑矩阵。这个矩阵巧妙地调整设备上的模型以适应当前的数据分布，将它们高效地聚类成原型模型。这些原型通过参数编辑矩阵动态地完善，促进有效的演化。此外，跨层知识传递的整合确保了一致且具有上下文意识的多层参数变化和原型分配。在多个数据集上对视觉任务和推荐任务进行的广泛实验验证了Persona的有效性和普适性。

更新时间: 2025-09-08 11:06:50

领域: cs.LG,cs.CV,cs.DC,cs.IR

下载: http://arxiv.org/abs/2509.06552v1

Contrastive Self-Supervised Network Intrusion Detection using Augmented Negative Pairs

Network intrusion detection remains a critical challenge in cybersecurity. While supervised machine learning models achieve state-of-the-art performance, their reliance on large labelled datasets makes them impractical for many real-world applications. Anomaly detection methods, which train exclusively on benign traffic to identify malicious activity, suffer from high false positive rates, limiting their usability. Recently, self-supervised learning techniques have demonstrated improved performance with lower false positive rates by learning discriminative latent representations of benign traffic. In particular, contrastive self-supervised models achieve this by minimizing the distance between similar (positive) views of benign traffic while maximizing it between dissimilar (negative) views. Existing approaches generate positive views through data augmentation and treat other samples as negative. In contrast, this work introduces Contrastive Learning using Augmented Negative pairs (CLAN), a novel paradigm for network intrusion detection where augmented samples are treated as negative views - representing potentially malicious distributions - while other benign samples serve as positive views. This approach enhances both classification accuracy and inference efficiency after pretraining on benign traffic. Experimental evaluation on the Lycos2017 dataset demonstrates that the proposed method surpasses existing self-supervised and anomaly detection techniques in a binary classification task. Furthermore, when fine-tuned on a limited labelled dataset, the proposed approach achieves superior multi-class classification performance compared to existing self-supervised models.

Updated: 2025-09-08 11:04:10

标题: 对比自我监督网络入侵检测：使用增强负样本对

摘要: 网络入侵检测在网络安全领域仍然是一个重要挑战。虽然监督式机器学习模型取得了最先进的性能，但它们依赖于大规模标记数据集，使得它们在许多实际应用中变得不切实际。异常检测方法专门训练良性流量以识别恶意活动，但存在高误报率问题，限制了它们的可用性。最近，自监督学习技术通过学习对良性流量进行区分性潜在表示，表现出更好的性能和更低的误报率。具体来说，对比自监督模型通过最小化良性流量相似（正面）视图之间的距离，同时最大化与不相似（负面）视图之间的距离来实现这一点。现有方法通过数据增强生成正面视图，并将其他样本视为负面视图。相比之下，本研究引入了增强负对比学习（CLAN），这是一种新型网络入侵检测范式，其中增强样本被视为负面视图 - 代表可能的恶意分布 - 而其他良性样本则作为正面视图。该方法在预训练良性流量后提高了分类准确性和推理效率。在Lycos2017数据集上的实验评估表明，所提出的方法在二元分类任务中超越了现有的自监督和异常检测技术。此外，当在有限标记数据集上进行微调时，所提出的方法相比现有的自监督模型实现了更优秀的多类分类性能。

更新时间: 2025-09-08 11:04:10

领域: cs.LG,cs.AI,cs.CR,cs.NI,I.2.6; K.6.5

下载: http://arxiv.org/abs/2509.06550v1

Super-Quadratic Quantum Speed-ups and Guessing Many Likely Keys

We study the fundamental problem of guessing cryptographic keys, drawn from some non-uniform probability distribution $D$, as e.g. in LPN, LWE or for passwords. The optimal classical algorithm enumerates keys in decreasing order of likelihood. The optimal quantum algorithm, due to Montanaro (2011), is a sophisticated Grover search. We give the first tight analysis for Montanaro's algorithm, showing that its runtime is $2^{H_{2/3}(D)/2}$, where $H_{\alpha}(\cdot)$ denotes Renyi entropy with parameter $\alpha$. Interestingly, this is a direct consequence of an information theoretic result called Arikan's Inequality (1996) -- which has so far been missed in the cryptographic community -- that tightly bounds the runtime of classical key guessing by $2^{H_{1/2}(D)}$. Since $H_{2/3}(D) < H_{1/2}(D)$ for every non-uniform distribution $D$, we thus obtain a super-quadratic quantum speed-up $s>2$ over classical key guessing. As another main result, we provide the first thorough analysis of guessing in a multi-key setting. Specifically, we consider the task of attacking many keys sampled independently from some distribution $D$, and aim to guess a fraction of them. For product distributions $D = \chi^n$, we show that any constant fraction of keys can be guessed within $2^{H(D)}$ classically and $2 ^{H(D)/2}$ quantumly per key, where $H(\chi)$ denotes Shannon entropy. In contrast, Arikan's Inequality implies that guessing a single key costs $2^{H_{1/2}(D)}$ classically and $2^{H_{2/3}(D)/2}$ quantumly. Since $H(D) < H_{2/3}(D) < H_{1/2}(D)$, this shows that in a multi-key setting the guessing cost per key is substantially smaller than in a single-key setting, both classically and quantumly.

Updated: 2025-09-08 11:03:49

标题: 超二次量子加速和猜测许多可能的密钥

摘要: 我们研究了从非均匀概率分布$D$中猜测加密密钥的基本问题，例如LPN，LWE或密码。最优的经典算法按可能性降序枚举密钥。由于Montanaro（2011）提出的最优量子算法是一种复杂的Grover搜索。我们对Montanaro的算法进行了首次紧密分析，表明其运行时间为$2^{H_{2/3}(D)/2}$，其中$H_{\alpha}(\cdot)$表示参数为$\alpha$的Renyi熵。有趣的是，这是信息理论结果Arikan's Inequality（1996）的直接结果，这个结果之前在密码学界被忽视了，它通过$2^{H_{1/2}(D)}$紧密限制了经典密钥猜测的运行时间。由于对于每个非均匀分布$D$，$H_{2/3}(D) < H_{1/2}(D)$，因此我们获得了超二次量子加速$s>2$用于经典密钥猜测。作为另一个主要结果，我们首次对多密钥设置中的猜测进行了彻底分析。具体而言，我们考虑从某个分布$D$独立抽取的许多密钥的攻击任务，并旨在猜测其中的一部分。对于产品分布$D = \chi^n$，我们展示了在每个密钥上，经典地可以在$2^{H(D)}$内猜测任意常数部分的密钥，量子地可以在$2^{H(D)/2}$内猜测。这里$H(\chi)$表示Shannon熵。相比之下，Arikan's Inequality意味着经典地猜测单个密钥的成本为$2^{H_{1/2}(D)}$，量子地为$2^{H_{2/3}(D)/2}$。由于$H(D) < H_{2/3}(D) < H_{1/2}(D)$，这表明在多密钥设置中，每个密钥的猜测成本明显小于单密钥设置，无论是经典地还是量子地。

更新时间: 2025-09-08 11:03:49

领域: cs.CR,quant-ph

下载: http://arxiv.org/abs/2509.06549v1

Signal-Based Malware Classification Using 1D CNNs

Malware classification is a contemporary and ongoing challenge in cyber-security: modern obfuscation techniques are able to evade traditional static analysis, while dynamic analysis is too resource intensive to be deployed at a large scale. One prominent line of research addresses these limitations by converting malware binaries into 2D images by heuristically reshaping them into a 2D grid before resizing using Lanczos resampling. These images can then be classified based on their textural information using computer vision approaches. While this approach can detect obfuscated malware more effectively than static analysis, the process of converting files into 2D images results in significant information loss due to both quantisation noise, caused by rounding to integer pixel values, and the introduction of 2D dependencies which do not exist in the original data. This loss of signal limits the classification performance of the downstream model. This work addresses these weaknesses by instead resizing the files into 1D signals which avoids the need for heuristic reshaping, and additionally these signals do not suffer from quantisation noise due to being stored in a floating-point format. It is shown that existing 2D CNN architectures can be readily adapted to classify these 1D signals for improved performance. Furthermore, a bespoke 1D convolutional neural network, based on the ResNet architecture and squeeze-and-excitation layers, was developed to classify these signals and evaluated on the MalNet dataset. It was found to achieve state-of-the-art performance on binary, type, and family level classification with F1 scores of 0.874, 0.503, and 0.507, respectively, paving the way for future models to operate on the proposed signal modality.

Updated: 2025-09-08 11:03:48

标题: 使用1D CNNs基于信号的恶意软件分类

摘要: 恶意软件分类是网络安全中一个当代和持续的挑战：现代混淆技术能够规避传统的静态分析，而动态分析则需要大量资源才能进行大规模部署。一条突出的研究线路通过启发式地将恶意软件二进制文件转换为2D图像，然后使用Lanczos重采样来调整大小，以解决这些限制。这些图像可以基于它们的纹理信息使用计算机视觉方法进行分类。虽然这种方法可以比静态分析更有效地检测混淆的恶意软件，但将文件转换成2D图像的过程会导致显著的信息丢失，这是由于量化噪声引起的，这些量化噪声是由于取整到整数像素值引起的，并且引入了在原始数据中不存在的2D依赖关系。这种信号丢失限制了下游模型的分类性能。本研究通过将文件调整大小为1D信号来解决这些弱点，避免了启发式重塑的需要，并且这些信号由于存储在浮点格式中而不会受到量化噪声的影响。结果表明，现有的2D卷积神经网络架构可以很容易地适应于对这些1D信号进行分类，以提高性能。此外，基于ResNet架构和压缩与激励层的专门的1D卷积神经网络被开发用于对这些信号进行分类，并在MalNet数据集上进行评估。发现其在二进制、类型和家族级别分类上实现了最先进的性能，分别为0.874、0.503和0.507的F1分数，为未来模型在提出的信号模态上运行铺平了道路。

更新时间: 2025-09-08 11:03:48

领域: cs.CR,cs.AI,cs.CV,cs.LG,I.2.6; K.6.5

下载: http://arxiv.org/abs/2509.06548v1

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.

Updated: 2025-09-08 11:03:11

标题: 通过强化学习在LLMs中实现的紧急分层推理

摘要: 强化学习（RL）已被证明在增强大型语言模型（LLMs）的复杂推理能力方面非常有效，然而驱动这一成功的基本机制仍然大多不透明。我们的分析揭示了一些令人困惑的现象，如“灵光一现”，“长度缩放”和熵动态并不是孤立事件，而是新兴推理层次结构的特征，类似于人类认知中高级战略规划与低级程序执行的分离。我们揭示了一个引人注目的两阶段动态：最初，模型受到程序的正确性约束，必须改进其低级技能。然后学习瓶颈明显转移，性能提升是通过高级战略规划的探索和掌握驱动的。这一洞察揭示了现有RL算法（如GRPO）中的核心低效性，这些算法对所有令牌施加优化压力，使学习信号稀释。为了解决这个问题，我们提出了HIerarchy-Aware Credit Assignment（HICRA）算法，该算法将优化工作集中在高影响规划令牌上。HICRA明显优于强基线，表明集中精力解决这一战略瓶颈是解锁先进推理的关键。此外，我们验证了语义熵作为衡量战略探索的更优指南，而不是诸如令牌级熵等误导性指标。

更新时间: 2025-09-08 11:03:11

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.03646v2

A New Dataset and Benchmark for Grounding Multimodal Misinformation

The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. We present the first real-world dataset for this task, GroundLie360, featuring a taxonomy of misinformation types, fine-grained annotations across text, speech, and visuals, and validation with Snopes evidence and annotator reasoning. We also propose a VLM-based, QA-driven baseline, FakeMark, using single- and cross-modal cues for effective detection and grounding. Our experiments highlight the challenges of this task and lay a foundation for explainable multimodal misinformation detection.

Updated: 2025-09-08 10:56:07

标题: 一个用于定位多模态错误信息的新数据集和基准

摘要: 在线虚假信息视频的泛滥对社会造成严重风险。当前的数据集和检测方法主要针对二元分类或基于后处理数据的单模态定位，缺乏对抗具有说服力的虚假信息所需的可解释性。在本文中，我们引入了多模态虚假信息定位任务（GroundMM），该任务验证多模态内容并定位跨模态的误导性片段。我们提出了这一任务的第一个真实世界数据集GroundLie360，其中包括虚假信息类型的分类、跨文本、语音和视觉的细粒度注释，并使用Snopes证据和注释者推理进行验证。我们还提出了基于VLM的、QA驱动的基线模型FakeMark，利用单模态和跨模态线索实现有效的检测和定位。我们的实验突显了这一任务的挑战，并为可解释的多模态虚假信息检测奠定了基础。

更新时间: 2025-09-08 10:56:07

领域: cs.SI,cs.AI,cs.MM

下载: http://arxiv.org/abs/2509.08008v1

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, which contains 11 unique search problems, each equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that using language-only reasoning, even the most advanced LLMs fail to solve SearchBench end-to-end, e.g., OpenAI's frontier models GPT4 and o1-preview solve only 1.4% and 18.6% of SearchBench problems, respectively. The reason is that SearchBench problems require considering multiple pathways to the solution and performing backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%. Interestingly, we show that the current strongest baseline on SearchBench is obtained using in-context learning with A* algorithm implementations. We further show that this baseline can be further enhanced via a Multi-Stage-Multi-Try inference method, raising GPT4's performance above 57%.

Updated: 2025-09-08 10:55:32

标题: 《穿越迷宫：评估LLM在思考搜索问题方面的能力》

摘要: 最近，大型语言模型（LLMs）在数学和推理基准测试中取得了令人印象深刻的表现。然而，它们经常在对人类相对容易的逻辑问题和谜题上遇到困难。为了进一步调查这一问题，我们引入了一个新的基准测试，SearchBench，其中包含11个独特的搜索问题，每个问题都配备了自动流水线，用于生成任意数量的实例并分析LLM生成的解决方案的可行性、正确性和最优性。我们展示了即使是最先进的LLMs也无法通过语言推理来解决SearchBench问题，例如，OpenAI的前沿模型GPT4和o1-preview分别只解决了1.4%和18.6%的SearchBench问题。原因在于SearchBench问题需要考虑到多条路径到解决方案，并执行回溯，对自回归模型构成了重大挑战。指示LLMs生成解决问题的代码有所帮助，但只有轻微的改善，例如，GPT4的性能提升到了11.7%。有趣的是，我们展示了在SearchBench上当前最强的基线是使用A*算法实现的上下文学习。我们进一步展示了通过多阶段多尝试的推理方法可以进一步提高这个基线，将GPT4的性能提升至57%以上。

更新时间: 2025-09-08 10:55:32

领域: cs.AI

下载: http://arxiv.org/abs/2406.12172v2

DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation

Research on learned cardinality estimation has made significant progress in recent years. However, existing methods still face distinct challenges that hinder their practical deployment in production environments. We define these challenges as the ``Trilemma of Cardinality Estimation'', where learned cardinality estimation methods struggle to balance generality, accuracy, and updatability. To address these challenges, we introduce DistJoin, a join cardinality estimator based on efficient distribution prediction using multi-autoregressive models. Our contributions are threefold: (1) We propose a method to estimate join cardinality by leveraging the probability distributions of individual tables in a decoupled manner. (2) To meet the requirements of efficiency for DistJoin, we develop Adaptive Neural Predicate Modulation (ANPM), a high-throughput distribution estimation model. (3) We demonstrate that an existing similar approach suffers from variance accumulation issues by formal variance analysis. To mitigate this problem, DistJoin employs a selectivity-based approach to infer join cardinality, effectively reducing variance. In summary, DistJoin not only represents the first data-driven method to support both equi and non-equi joins simultaneously but also demonstrates superior accuracy while enabling fast and flexible updates. The experimental results demonstrate that DistJoin achieves the highest accuracy, robustness to data updates, generality, and comparable update and inference speed relative to existing methods.

Updated: 2025-09-08 10:55:02

标题: DistJoin：基于自适应神经谓词调制的解耦连接基数估计器

摘要: 在最近几年，关于学习基数估计的研究取得了显著进展。然而，现有方法仍然面临着明显的挑战，这些挑战阻碍了它们在生产环境中的实际部署。我们将这些挑战定义为“基数估计的三难题”，即学习基数估计方法在平衡通用性、准确性和可更新性方面存在困难。为了解决这些挑战，我们引入了DistJoin，这是一种基于高效分布预测的连接基数估计器，利用多自回归模型。我们的贡献有三点：(1)我们提出一种通过分离方式利用单个表的概率分布来估计连接基数的方法。(2)为了满足DistJoin的效率要求，我们开发了自适应神经谓词调制(ANPM)，这是一种高吞吐量的分布估计模型。(3)我们通过正式方差分析表明，现有类似方法存在方差积累问题。为了缓解这个问题，DistJoin采用基于选择性的方法来推断连接基数，有效降低方差。总之，DistJoin不仅是第一个支持同时进行等值和非等值连接的数据驱动方法，而且在实现快速灵活更新的同时，还展现出优越的准确性。实验结果表明，相对于现有方法，DistJoin在准确性、对数据更新的鲁棒性、通用性以及更新和推断速度方面均表现最佳。

更新时间: 2025-09-08 10:55:02

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2503.08994v2

Predicting Fetal Outcomes from Cardiotocography Signals Using a Supervised Variational Autoencoder

Objective: To develop and interpret a supervised variational autoencoder (VAE) model for classifying cardiotocography (CTG) signals based on pregnancy outcomes, addressing interpretability limits of current deep learning approaches. Methods: The OxMat CTG dataset was used to train a VAE on five-minute fetal heart rate (FHR) segments, labeled with postnatal outcomes. The model was optimised for signal reconstruction and outcome prediction, incorporating Kullback-Leibler divergence and total correlation (TC) constraints to structure the latent space. Performance was evaluated using area under the receiver operating characteristic curve (AUROC) and mean squared error (MSE). Interpretability was assessed using coefficient of determination, latent traversals and unsupervised component analyses. Results: The model achieved an AUROC of 0.752 at the segment level and 0.779 at the CTG level, where predicted scores were aggregated. Relaxing TC constraints improved both reconstruction and classification. Latent analysis showed that baseline-related features (e.g., FHR baseline, baseline shift) were well represented and aligned with model scores, while metrics like short- and long-term variability were less strongly encoded. Traversals revealed clear signal changes for baseline features, while other properties were entangled or subtle. Unsupervised decompositions corroborated these patterns. Findings: This work demonstrates that supervised VAEs can achieve competitive fetal outcome prediction while partially encoding clinically meaningful CTG features. The irregular, multi-timescale nature of FHR signals poses challenges for disentangling physiological components, distinguishing CTG from more periodic signals such as ECG. Although full interpretability was not achieved, the model supports clinically useful outcome prediction and provides a basis for future interpretable, generative models.

Updated: 2025-09-08 10:54:04

标题: 使用监督变分自动编码器预测胎儿心电图信号的结果

摘要: 目的：为了开发和解释一个监督变分自编码器（VAE）模型，用于基于妊娠结果对胎心监护图（CTG）信号进行分类，解决当前深度学习方法的可解释性限制。方法：使用OxMat CTG数据集对五分钟胎心率（FHR）片段进行标记，训练VAE模型，标记包括产后结果。该模型被优化用于信号重构和结果预测，结合Kullback-Leibler散度和总相关性（TC）约束来构建潜在空间。性能评估使用接收器操作特征曲线下面积（AUROC）和均方误差（MSE）。可解释性使用决定系数、潜在遍历和无监督组件分析进行评估。结果：该模型在片段级别达到了0.752的AUROC，在CTG级别达到了0.779，预测分数被聚合。放宽TC约束改善了重构和分类。潜在分析显示基线相关特征（例如，FHR基线、基线偏移）得到了很好的表示，并与模型分数对齐，而短期和长期变异等指标则编码不够强烈。遍历揭示了基线特征的明显信号变化，而其他特性则被纠缠或微妙。无监督分解证实了这些模式。发现：这项工作表明，监督VAE可以在部分编码临床有意义的CTG特征的同时实现有竞争力的胎儿结果预测。FHR信号的不规则、多时间尺度的性质给解开生理成分、区分CTG和ECG等更周期性信号带来了挑战。尽管没有完全实现可解释性，但该模型支持临床有用的结果预测，并为未来可解释的生成模型奠定了基础。

更新时间: 2025-09-08 10:54:04

领域: cs.LG

下载: http://arxiv.org/abs/2509.06540v1

Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model

CAGE-2 is an accepted benchmark for learning and evaluating defender strategies against cyberattacks. It reflects a scenario where a defender agent protects an IT infrastructure against various attacks. Many defender methods for CAGE-2 have been proposed in the literature. In this paper, we construct a formal model for CAGE-2 using the framework of Partially Observable Markov Decision Process (POMDP). Based on this model, we define an optimal defender strategy for CAGE-2 and introduce a method to efficiently learn this strategy. Our method, called BF-PPO, is based on PPO, and it uses particle filter to mitigate the computational complexity due to the large state space of the CAGE-2 model. We evaluate our method in the CAGE-2 CybORG environment and compare its performance with that of CARDIFF, the highest ranked method on the CAGE-2 leaderboard. We find that our method outperforms CARDIFF regarding the learned defender strategy and the required training time.

Updated: 2025-09-08 10:51:43

标题: 使用POMDP模型学习CAGE-2的最佳防御策略

摘要: CAGE-2是学习和评估网络防御策略的公认基准。它反映了一个防御者代理保护信息技术基础设施免受各种攻击的情景。文献中提出了许多针对CAGE-2的防御方法。本文利用部分可观察马尔可夫决策过程（POMDP）框架构建了CAGE-2的形式模型。基于该模型，我们定义了CAGE-2的最优防御策略，并引入了一种有效学习该策略的方法。我们的方法称为BF-PPO，基于PPO，并使用粒子滤波器来减轻CAGE-2模型的大状态空间所带来的计算复杂性。我们在CAGE-2的CybORG环境中评估了我们的方法，并将其性能与CAGE-2排行榜上排名最高的CARDIFF方法进行比较。我们发现，我们的方法在学习防御策略和所需训练时间方面优于CARDIFF。

更新时间: 2025-09-08 10:51:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06539v1

On the Reproducibility of "FairCLIP: Harnessing Fairness in Vision-Language Learning''

We investigated the reproducibility of FairCLIP, proposed by Luo et al. (2024), for improving the group fairness of CLIP (Radford et al., 2021) by minimizing image-text similarity score disparities across sensitive groups using the Sinkhorn distance. The experimental setup of Luo et al. (2024) was reproduced to primarily investigate the research findings for FairCLIP. The model description by Luo et al. (2024) was found to differ from the original implementation. Therefore, a new implementation, A-FairCLIP, is introduced to examine specific design choices. Furthermore, FairCLIP+ is proposed to extend the FairCLIP objective to include multiple attributes. Additionally, the impact of the distance minimization on FairCLIP's fairness and performance was explored. In alignment with the original authors, CLIP was found to be biased towards certain demographics when applied to zero-shot glaucoma classification using medical scans and clinical notes from the Harvard-FairVLMed dataset. However, the experimental results on two datasets do not support their claim that FairCLIP improves the performance and fairness of CLIP. Although the regularization objective reduces Sinkhorn distances, both the official implementation and the aligned implementation, A-FairCLIP, were not found to improve performance nor fairness in zero-shot glaucoma classification.

Updated: 2025-09-08 10:41:10

标题: 关于《FairCLIP：利用视觉-语言学习中的公平性》的可重复性

摘要: 我们调查了罗等人（2024年）提出的FairCLIP的可重复性，该方法旨在通过最小化敏感群体之间的图像-文本相似度差异来改善CLIP（Radford等人，2021年）的群体公平性，使用Sinkhorn距离。我们复制了罗等人（2024年）的实验设置，主要是为了调查FairCLIP的研究发现。发现罗等人（2024年）的模型描述与原始实现有所不同。因此，引入了一个新的实现，A-FairCLIP，用于检查特定的设计选择。此外，提出了FairCLIP+，将FairCLIP目标扩展到包括多个属性。此外，还探讨了距离最小化对FairCLIP的公平性和性能的影响。与原始作者的观点一致，当将CLIP应用于使用哈佛-FairVLMed数据集的医学扫描和临床笔记进行零样本青光眼分类时，发现CLIP对某些人口统计特征存在偏见。然而，对两个数据集的实验结果并不支持他们的说法，即FairCLIP改善了CLIP的性能和公平性。尽管正则化目标减小了Sinkhorn距离，但官方实现和对齐实现A-FairCLIP均未发现在零样本青光眼分类中提高性能或公平性。

更新时间: 2025-09-08 10:41:10

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06535v1

An Efficient Recommendation Filtering-based Trust Model for Securing Internet of Things

Trust computation is crucial for ensuring the security of the Internet of Things (IoT). However, current trust-based mechanisms for IoT have limitations that impact data security. Sliding window-based trust schemes cannot ensure reliable trust computation due to their inability to select appropriate window lengths. Besides, recent trust scores are emphasized when considering the effect of time on trust. This can cause a sudden change in overall trust score based on recent behavior, potentially misinterpreting an honest service provider as malicious and vice versa. Moreover, clustering mechanisms used to filter recommendations in trust computation often lead to slower results. In this paper, we propose a robust trust model to address these limitations. The proposed approach determines the window length dynamically to guarantee accurate trust computation. It uses the harmonic mean of average trust score and time to prevent sudden fluctuations in trust scores. Additionally, an efficient personalized subspace clustering algorithm is used to exclude recommendations. We present a security analysis demonstrating the resiliency of the proposed scheme against bad-mouthing, ballot-stuffing, and on-off attacks. The proposed scheme demonstrates a competitive performance in detecting bad-mouthing attacks, while outperforming existing works with an approximately 44% improvement in accuracy for detecting on-off attacks. It maintains its effectiveness even when the percentage of on-off attackers increases and in scenarios where multiple attacks occur simultaneously. Additionally, the proposed scheme reduces the recommendation filtering time by 95%.

Updated: 2025-09-08 10:37:00

标题: 一种基于推荐过滤的高效物联网安全信任模型

摘要: 信任计算对于确保物联网（IoT）的安全至关重要。然而，当前用于物联网的基于信任的机制存在影响数据安全的限制。基于滑动窗口的信任方案无法保证可靠的信任计算，因为它们无法选择适当的窗口长度。此外，考虑时间对信任的影响时，最近的信任分数被强调。这可能会导致基于最近行为的整体信任分数突然变化，潜在地将诚实的服务提供者误解为恶意的，反之亦然。此外，用于在信任计算中过滤建议的聚类机制通常导致结果较慢。在本文中，我们提出了一个鲁棒的信任模型来解决这些限制。所提出的方法动态确定窗口长度，以保证准确的信任计算。它使用平均信任分数和时间的调和平均值，以防止信任分数的突然波动。此外，还使用了一个高效的个性化子空间聚类算法来排除建议。我们提出了一项安全分析，展示了所提出的方案对恶意言论、弃权投票和开关攻击的抵抗力。所提出的方案在检测恶意言论攻击方面表现出竞争性能，同时在检测开关攻击方面的准确率提高了约44％，优于现有作品。即使开关攻击者的比例增加，以及多个攻击同时发生的情况下，它仍保持有效性。此外，所提出的方案将建议过滤时间减少了95%。

更新时间: 2025-09-08 10:37:00

领域: cs.CR

下载: http://arxiv.org/abs/2508.17304v2

SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities. While large language models offer strong generative reasoning capabilities, their limited exploitation of structural signals often results in structural sparsity and semantic ambiguity, especially under incomplete or zero-shot settings. To address these challenges, we propose SLiNT (Structure-aware Language model with Injection and coNtrastive Training), a modular framework that injects knowledge-graph-derived structural context into a frozen LLM backbone with lightweight LoRA-based adaptation for robust link prediction. Specifically, Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; Dynamic Hard Contrastive Learning (DHCL) introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware intervention while preserving the core LLM parameters. Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.

Updated: 2025-09-08 10:36:49

标题: SLiNT：结构感知的语言模型，注入和对比训练用于知识图谱补全

摘要: 知识图谱中的链接预测需要整合结构信息和语义内容来推断缺失的实体。尽管大型语言模型提供了强大的生成推理能力，但它们对结构信号的有限利用通常会导致结构稀疏和语义模糊，特别是在不完整或零样本设置下。为了解决这些挑战，我们提出了SLiNT（具有注入和对比训练的结构感知语言模型），这是一个模块化框架，将知识图谱衍生的结构上下文注入到一个冻结的LLM骨干中，通过轻量级的LoRA-based适应进行强大的链接预测。具体来说，结构引导的邻域增强（SGNE）检索伪邻居以丰富稀疏实体并减轻缺失的上下文；动态硬对比学习（DHCL）通过插值硬正例和负例引入细粒度监督来解决实体级别的模糊；梯度解耦的双重注入（GDDI）在保留核心LLM参数的同时进行令牌级别的结构感知干预。在WN18RR和FB15k-237上的实验表明，与基于嵌入和生成的基线相比，SLiNT实现了卓越或竞争性的性能，展示了面向结构感知表示学习的有效性，用于可扩展的知识图谱完成。

更新时间: 2025-09-08 10:36:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.06531v1

Lane Change Intention Prediction of two distinct Populations using a Transformer

As a result of the growing importance of lane change intention prediction for a safe and efficient driving experience in complex driving scenarios, researchers have in recent years started to train novel machine learning algorithms on available datasets with promising results. A shortcoming of this recent research effort, though, is that the vast majority of the proposed algorithms are trained on a single datasets. In doing so, researchers failed to test if their algorithm would be as effective if tested on a different dataset and, by extension, on a different population with respect to the one on which they were trained. In this article we test a transformer designed for lane change intention prediction on two datasets collected by LevelX in Germany and Hong Kong. We found that the transformer's accuracy plummeted when tested on a population different to the one it was trained on with accuracy values as low as 39.43%, but that when trained on both populations simultaneously it could achieve an accuracy as high as 86.71%. - This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Updated: 2025-09-08 10:35:26

标题: 使用Transformer模型对两个不同人群的变道意图进行预测

摘要: 由于在复杂驾驶场景中，对车道变换意图进行预测对于安全和高效的驾驶体验变得越来越重要，近年来研究人员开始在可用数据集上训练新型机器学习算法，并取得了令人满意的结果。然而，这一最近的研究工作的一个缺点是，绝大多数提出的算法都是在单一数据集上训练的。在这样做的过程中，研究人员未能测试他们的算法是否在不同数据集上进行测试时是否同样有效，从而延伸到不同种群上，与他们接受训练的种群不同。在这篇文章中，我们对LevelX在德国和香港收集的两个数据集上设计的一个用于车道变换意图预测的转换器进行了测试。我们发现，当在不同于其训练的种群上进行测试时，转换器的准确率会急剧下降，准确率值甚至低至39.43%，但当同时在两个种群上进行训练时，其准确率可以达到高达86.71%。- 本研究已提交给IEEE可能发表。版权可能在未经通知的情况下转让，之后这个版本可能不再可访问。

更新时间: 2025-09-08 10:35:26

领域: cs.LG

下载: http://arxiv.org/abs/2509.06529v1

Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training

Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.

Updated: 2025-09-08 10:24:19

标题: 冠冕、框架、反转：用于LLM预训练的分层缩放变体

摘要: 基于Transformer的语言模型传统上使用均匀（各向同性）的层大小，然而它们忽视了不同深度可以发挥的多样功能角色及其计算容量需求。基于层次缩放（LWS）和修剪文献，我们引入了三种新的LWS变体 - Framed、Reverse和Crown - 通过两点或三点线性插值在预训练阶段重新分配FFN宽度和注意力头。我们首次系统地消融了LWS及其变体，使用固定的180M参数预先训练5B令牌。所有模型都收敛到类似的损失，并且相较于等成本的各向同性基线取得更好的性能，而训练吞吐量并没有显著减少。这项工作代表了层次架构设计空间的初步步骤，但未来的工作应该将实验扩展到数量级更大的令牌和参数，以充分评估它们的潜力。

更新时间: 2025-09-08 10:24:19

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.06518v1

A stability theorem for bigraded persistence barcodes

We define bigraded persistent homology modules and bigraded barcodes of a finite pseudo-metric space X using the ordinary and double homology of the moment-angle complex associated with the Vietoris-Rips filtration of X. We prove a stability theorem for the bigraded persistent double homology modules and barcodes.

Updated: 2025-09-08 10:23:48

标题: 一个关于双重分级持久性条形码稳定性定理

摘要: 我们使用与X的Vietoris-Rips过滤相关的瞬间角复形的普通和双重同调来定义有限伪度量空间X的双分级持久同调模块和双分级条形码。我们证明了双分级持久同调模块和条形码的稳定性定理。

更新时间: 2025-09-08 10:23:48

领域: math.AT,cs.CG,cs.LG,math.CO,math.MG,Primary 57S12, 55N31, 57Z25, Secondary 13F55, 55U10

下载: http://arxiv.org/abs/2303.14694v3

PUUMA (Placental patch and whole-Uterus dual-branch U-Mamba-based Architecture): Functional MRI Prediction of Gestational Age at Birth and Preterm Risk

Preterm birth is a major cause of mortality and lifelong morbidity in childhood. Its complex and multifactorial origins limit the effectiveness of current clinical predictors and impede optimal care. In this study, a dual-branch deep learning architecture (PUUMA) was developed to predict gestational age (GA) at birth using T2* fetal MRI data from 295 pregnancies, encompassing a heterogeneous and imbalanced population. The model integrates both global whole-uterus and local placental features. Its performance was benchmarked against linear regression using cervical length measurements obtained by experienced clinicians from anatomical MRI and other Deep Learning architectures. The GA at birth predictions were assessed using mean absolute error. Accuracy, sensitivity, and specificity were used to assess preterm classification. Both the fully automated MRI-based pipeline and the cervical length regression achieved comparable mean absolute errors (3 weeks) and good sensitivity (0.67) for detecting preterm birth, despite pronounced class imbalance in the dataset. These results provide a proof of concept for automated prediction of GA at birth from functional MRI, and underscore the value of whole-uterus functional imaging in identifying at-risk pregnancies. Additionally, we demonstrate that manual, high-definition cervical length measurements derived from MRI, not currently routine in clinical practice, offer valuable predictive information. Future work will focus on expanding the cohort size and incorporating additional organ-specific imaging to improve generalisability and predictive performance.

Updated: 2025-09-08 10:23:43

标题: PUUMA（胎盘贴片和整个子宫双分支U-Mamba架构）：功能性磁共振成像预测出生时胎龄和早产风险

摘要: 早产是儿童死亡和终身疾病的主要原因。其复杂和多因素的起源限制了当前临床预测因子的有效性，并阻碍了最佳护理。在这项研究中，开发了一个双分支深度学习架构（PUUMA），利用295个怀孕的T2*胎儿MRI数据来预测出生时的孕周，包括异质和不平衡的人口。该模型整合了全局整个子宫和局部胎盘特征。它的性能与使用经验丰富的临床医生从解剖MRI中获取的宫颈长度测量以及其他深度学习架构进行基准比较。使用平均绝对误差评估了出生时的孕周预测。准确度、敏感度和特异性用于评估早产分类。尽管数据集中存在明显的类别不平衡，但基于MRI的完全自动化管道和宫颈长度回归均实现了可比较的平均绝对误差（3周）和良好的敏感性（0.67），用于检测早产。这些结果为从功能性MRI自动预测出生时的孕周提供了概念验证，并强调了整个子宫功能成像在识别高危孕妇中的价值。此外，我们证明了从MRI获得的手动高清晰度宫颈长度测量，目前不是临床常规实践，提供了有价值的预测信息。未来工作将侧重于扩大队列大小，并纳入其他器官特异性成像，以提高泛化能力和预测性能。

更新时间: 2025-09-08 10:23:43

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2509.07042v1

Molecular Generative Adversarial Network with Multi-Property Optimization

Deep generative models, such as generative adversarial networks (GANs), have been employed for $de~novo$ molecular generation in drug discovery. Most prior studies have utilized reinforcement learning (RL) algorithms, particularly Monte Carlo tree search (MCTS), to handle the discrete nature of molecular representations in GANs. However, due to the inherent instability in training GANs and RL models, along with the high computational cost associated with MCTS sampling, MCTS RL-based GANs struggle to scale to large chemical databases. To tackle these challenges, this study introduces a novel GAN based on actor-critic RL with instant and global rewards, called InstGAN, to generate molecules at the token-level with multi-property optimization. Furthermore, maximized information entropy is leveraged to alleviate the mode collapse. The experimental results demonstrate that InstGAN outperforms other baselines, achieves comparable performance to state-of-the-art models, and efficiently generates molecules with multi-property optimization. The source code will be released upon acceptance of the paper.

Updated: 2025-09-08 10:22:05

标题: 分子生成对抗网络与多属性优化

摘要: 深度生成模型，如生成对抗网络（GANs），已被用于药物发现中的$de~novo$分子生成。大多数先前的研究利用强化学习（RL）算法，特别是蒙特卡洛树搜索（MCTS），来处理GANs中分子表示的离散性质。然而，由于GANs和RL模型训练中固有的不稳定性，以及与MCTS采样相关的高计算成本，基于MCTS RL的GANs难以扩展到大型化学数据库。为了解决这些挑战，本研究引入了一种基于带有即时和全局奖励的演员-评论家RL的新型GAN，称为InstGAN，用于在令牌级别进行多属性优化的分子生成。此外，最大化信息熵被利用来缓解模式崩溃。实验结果表明，InstGAN优于其他基线模型，达到了与最先进模型相媲美的性能，并有效地生成具有多属性优化的分子。源代码将在论文被接受后发布。

更新时间: 2025-09-08 10:22:05

领域: q-bio.BM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2404.00081v2

QualityFM: a Multimodal Physiological Signal Foundation Model with Self-Distillation for Signal Quality Challenges in Critically Ill Patients

Photoplethysmogram (PPG) and electrocardiogram (ECG) are commonly recorded in intesive care unit (ICU) and operating room (OR). However, the high incidence of poor, incomplete, and inconsistent signal quality, can lead to false alarms or diagnostic inaccuracies. The methods explored so far suffer from limited generalizability, reliance on extensive labeled data, and poor cross-task transferability. To overcome these challenges, we introduce QualityFM, a novel multimodal foundation model for these physiological signals, designed to acquire a general-purpose understanding of signal quality. Our model is pre-trained on an large-scale dataset comprising over 21 million 30-second waveforms and 179,757 hours of data. Our approach involves a dual-track architecture that processes paired physiological signals of differing quality, leveraging a self-distillation strategy where an encoder for high-quality signals is used to guide the training of an encoder for low-quality signals. To efficiently handle long sequential signals and capture essential local quasi-periodic patterns, we integrate a windowed sparse attention mechanism within our Transformer-based model. Furthermore, a composite loss function, which combines direct distillation loss on encoder outputs with indirect reconstruction loss based on power and phase spectra, ensures the preservation of frequency-domain characteristics of the signals. We pre-train three models with varying parameter counts (9.6 M to 319 M) and demonstrate their efficacy and practical value through transfer learning on three distinct clinical tasks: false alarm of ventricular tachycardia detection, the identification of atrial fibrillation and the estimation of arterial blood pressure (ABP) from PPG and ECG signals.

Updated: 2025-09-08 10:20:56

标题: QualityFM：一种多模态生理信号基础模型，具有自我提炼功能，用于解决危重病患者信号质量挑战

摘要: Photoplethysmogram (PPG)和心电图（ECG）通常在重症监护室（ICU）和手术室（OR）中记录。然而，由于信号质量差、不完整和不一致的高发生率，可能导致误报警或诊断不准确。迄今为止探索的方法受限于通用性有限、依赖大量标记数据以及跨任务转移能力差。为了克服这些挑战，我们引入了QualityFM，这是一个新颖的多模态基础模型，用于这些生理信号，旨在获得对信号质量的通用理解。我们的模型在一个大规模数据集上进行了预训练，该数据集包括超过2100万个30秒波形和179,757小时的数据。我们的方法采用双轨架构，处理不同质量的成对生理信号，利用自我蒸馏策略，其中用于高质量信号的编码器用来指导低质量信号的编码器的训练。为了有效处理长序列信号并捕获关键的局部准周期模式，我们在基于Transformer的模型中集成了一个窗口稀疏注意机制。此外，一个综合损失函数，结合了编码器输出的直接蒸馏损失和基于功率和相位谱的间接重构损失，确保了信号的频域特征的保存。我们对参数数量不同的三个模型进行了预训练（从9.6 M到319 M），并通过在三个不同的临床任务上的迁移学习展示了它们的有效性和实际价值：室性心动过速检测的误报警、房颤的识别和从PPG和ECG信号中估算动脉血压（ABP）。

更新时间: 2025-09-08 10:20:56

领域: cs.LG,cs.AI,J.3

下载: http://arxiv.org/abs/2509.06516v1

CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at https://github.com/WNQzhu/CARFT.

Updated: 2025-09-08 10:20:38

标题: CARFT：通过基于注释的思维链强化微调对LLM推理进行对比学习的提升

摘要: 推理能力在大语言模型（LLMs）的广泛应用中发挥着显著的关键作用。为了增强LLMs的推理性能，已经提出了各种基于强化学习（RL）的微调方法，以解决仅通过监督微调（SFT）训练的LLMs的有限泛化能力。尽管它们有效，但存在两个主要限制阻碍了LLMs的进展。首先，传统的基于RL的方法忽略了注释的思维链（CoT），并融入了不稳定的推理路径抽样，通常导致模型崩溃、训练过程不稳定和性能亚优。其次，现有的SFT方法通常过分强调注释的CoT，可能导致性能下降，因为未充分利用潜在的CoT。在本文中，我们提出了一种基于对比学习和注释CoT的强化微调方法，即“\TheName{}”，以增强LLMs的推理性能并解决上述限制。具体来说，我们提出学习每个CoT的表示。基于这种表示，我们设计了新颖的对比信号来指导微调过程。我们的方法不仅充分利用了可用的注释CoT，还通过融入额外的无监督学习信号来稳定微调过程。我们进行了全面的实验和深入分析，使用三种基线方法、两种基础模型和两个数据集，展示了\TheName{}在鲁棒性、性能（最高提升10.15\%）和效率（最高提升30.62\%）方面的显著优势。代码可在https://github.com/WNQzhu/CARFT找到。

更新时间: 2025-09-08 10:20:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.15868v2

Synthesis of Sound and Precise Leakage Contracts for Open-Source RISC-V Processors

Leakage contracts have been proposed as a new security abstraction at the instruction set architecture level. Leakage contracts aim to capture the information that processors may leak via microarchitectural side channels. Recently, the first tools have emerged to verify whether a processor satisfies a given contract. However, coming up with a contract that is both sound and precise for a given processor is challenging, time-consuming, and error-prone, as it requires in-depth knowledge of the timing side channels introduced by microarchitectural optimizations. In this paper, we address this challenge by proposing LeaSyn, the first tool for automatically synthesizing leakage contracts that are both sound and precise for processor designs at register-transfer level. Starting from a user-provided contract template that captures the space of possible contracts, LeaSyn automatically constructs a contract, alternating between contract synthesis, which ensures precision based on an empirical characterization of the processor's leaks, and contract verification, which ensures soundness. Using LeaSyn, we automatically synthesize contracts for six open-source RISC-V CPUs for a variety of contract templates. Our experiments indicate that LeaSyn's contracts are sound and more precise (i.e., represent the actual leaks in the target processor more faithfully) than contracts constructed by existing approaches.

Updated: 2025-09-08 10:14:52

标题: RISC-V处理器开源合约的声音和精确泄漏合约的合成

摘要: 泄漏契约已被提议作为指令集架构级别的新安全抽象。泄漏契约旨在捕捉处理器可能通过微体系结构侧信道泄漏的信息。最近，首个工具已出现，用于验证处理器是否满足给定的契约。然而，为给定处理器提出既合理又精确的契约是具有挑战性、耗时且容易出错的，因为它需要深入了解微体系结构优化引入的时序侧信道。本文通过提出LeaSyn来解决这一挑战，这是第一个用于在寄存器传输级别为处理器设计自动生成既合理又精确的泄漏契约的工具。从捕捉可能契约空间的用户提供的契约模板开始，LeaSyn自动构建契约，交替进行契约合成（基于处理器泄漏的经验特征确保精确性）和契约验证（确保合理性）。使用LeaSyn，我们为六个开源RISC-V CPU自动生成了各种契约模板的契约。我们的实验表明，LeaSyn的契约是合理且更精确的（即更忠实地表示目标处理器中实际泄漏）比现有方法构建的契约。

更新时间: 2025-09-08 10:14:52

领域: cs.CR

下载: http://arxiv.org/abs/2509.06509v1

On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data

The generative adversarial network (GAN) aims to approximate an unknown distribution via a parameterized neural network (NN). While GANs have been widely applied in reinforcement and semisupervised learning as well as computer vision tasks, selecting their parameters often needs an exhaustive search and only a few selection methods can be proved to be theoretically optimal. One of the most promising GAN variants is the Wasserstein GAN (WGAN). Prior work on optimal parameters for WGAN is limited to the linear-quadratic-Gaussian (LQG) setting, where the NN is linear and the data is Gaussian. In this paper, we focus on the characterization of optimal WGAN parameters beyond the LQG setting. We derive closed-form optimal parameters for one-dimensional WGANs when the NN has non-linear activation functions and the data is non-Gaussian. To extend this to high-dimensional WGANs, we adopt the sliced Wasserstein framework and replace the constraint on marginal distributions of the randomly projected data by a constraint on the joint distribution of the original (unprojected) data. We show that the linear generator can be asymptotically optimal for sliced WGAN with non-Gaussian data. Empirical studies show that our closed-form WGAN parameters have good convergence behavior with data under both Gaussian and Laplace distributions. Also, compared to the r principal component analysis (r-PCA) solution, our proposed solution for sliced WGAN can achieve the same performance while requiring less computational resources.

Updated: 2025-09-08 10:10:37

标题: 关于非高斯数据的经典和切片Wasserstein GAN的最优解

摘要: 生成对抗网络（GAN）旨在通过参数化神经网络（NN）来逼近未知分布。虽然GAN已被广泛应用于强化学习、半监督学习以及计算机视觉任务中，但选择它们的参数通常需要耗费大量搜索，只有少数选择方法可以被证明在理论上是最优的。最有前途的GAN变体之一是Wasserstein GAN（WGAN）。先前关于WGAN的最优参数的研究仅限于线性二次高斯（LQG）设置，其中NN是线性的，而数据是高斯的。本文侧重于LQG设置之外的最优WGAN参数的特征化。当NN具有非线性激活函数且数据为非高斯时，我们推导出一维WGAN的闭合最优参数。为了将其扩展到高维WGAN，我们采用了切片Wasserstein框架，并通过对原始（未投影）数据的联合分布施加约束来取代对随机投影数据的边缘分布的约束。我们展示了线性生成器对于带有非高斯数据的切片WGAN可以是渐近最优的。实证研究表明，我们的闭合形式WGAN参数在高斯和拉普拉斯分布下的数据上具有良好的收敛行为。此外，与主成分分析（PCA）解决方案相比，我们提出的切片WGAN解决方案在实现相同性能的同时需要更少的计算资源。

更新时间: 2025-09-08 10:10:37

领域: cs.LG,cs.IT,math.IT,stat.ML

下载: http://arxiv.org/abs/2509.06505v1

When Code Crosses Borders: A Security-Centric Evaluation of LLM-based Code Translation

With the growing demand for cross-language codebase migration, evaluating LLMs' security implications in translation tasks has become critical. Existing evaluations primarily focus on syntactic or functional correctness at the function level, neglecting the critical dimension of security. To enable security evaluation, we construct STED (Security-centric Translation Evaluation Dataset), the first dataset specifically designed for evaluating the security implications of LLM-based code translation. It comprises 720 security-related code samples across five programming languages and nine high-impact CWE categories, sourced from CVE/NVD and manually verified for translation tasks. Our evaluation framework consists of two independent assessment modules: (1) rigorous evaluation by security researchers, and (2) automated analysis via LLM-as-a-judge. Together they evaluate three critical aspects: functional correctness, vulnerability preservation, and vulnerability introduction rates. Our large-scale evaluation of five state-of-the-art LLMs across 6,000 translation instances reveals significant security degradation, with 28.6-45% of translations introducing new vulnerabilities--particularly for web-related flaws like input validation, where LLMs show consistent weaknesses. Furthermore, we develop a Retrieval-Augmented Generation (RAG)-based mitigation strategy that reduces translation-induced vulnerabilities by 32.8%, showing the potential of knowledge-enhanced prompting.

Updated: 2025-09-08 10:08:48

标题: 代码跨越边界时：基于LLM的代码翻译的安全中心评估

摘要: 随着对跨语言代码库迁移的需求不断增长，评估LLMs在翻译任务中的安全影响变得至关重要。现有的评估主要集中在函数级别的句法或功能正确性，忽视了安全的关键维度。为了实现安全评估，我们构建了STED（面向安全的翻译评估数据集），这是专门为评估基于LLM的代码翻译的安全影响而设计的第一个数据集。它包括来自CVE/NVD的720个安全相关代码样本，涵盖五种编程语言和九个高影响的CWE类别，并经过人工验证用于翻译任务。我们的评估框架包括两个独立的评估模块：（1）由安全研究人员进行严格评估，以及（2）LLM作为评判者进行自动分析。它们共同评估了三个关键方面：功能正确性、漏洞保留和漏洞引入率。我们对6,000个翻译实例进行了对五种最先进的LLMs的大规模评估，结果显示出显著的安全降级，28.6-45%的翻译引入了新的漏洞，特别是对于与Web相关的缺陷，如输入验证，在这方面LLMs表现出一致的弱点。此外，我们开发了一种基于检索增强生成（RAG）的缓解策略，通过知识增强提示，将翻译引起的漏洞减少了32.8%，显示了其潜在的潜力。

更新时间: 2025-09-08 10:08:48

领域: cs.CR

下载: http://arxiv.org/abs/2509.06504v1

An AI system to help scientists write expert-level empirical software

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.

Updated: 2025-09-08 10:08:36

标题: 一个人工智能系统，帮助科学家撰写专业水平的实证软件

摘要: 科学发现的周期经常受到支持计算实验的软件缓慢、手动创建的限制。为了解决这个问题，我们提出了一个人工智能系统，创建专家级科学软件，其目标是最大化一个质量指标。该系统使用大型语言模型（LLM）和树搜索（TS）系统地提高质量指标，并智能地导航可能解决方案的大空间。当它探索和整合外部来源的复杂研究思想时，系统实现了专家级的结果。树搜索的有效性在各种基准测试中得到了证明。在生物信息学领域，它发现了40种新的单细胞数据分析方法，在公开排行榜上表现优于顶尖的人类开发方法。在流行病学领域，它生成了14个模型，预测COVID-19住院人数，表现优于CDC合奏和所有其他个体模型。我们的方法还为地理空间分析、斑马鱼神经活动预测、时间序列预测和积分数值解决方案生成了最先进的软件。通过设计和实施多样任务的新颖解决方案，该系统代表了加速科学进展的重要一步。

更新时间: 2025-09-08 10:08:36

领域: cs.AI,q-bio.QM

下载: http://arxiv.org/abs/2509.06503v1

Evidential Transformers for Improved Image Retrieval

We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.

Updated: 2025-09-08 10:01:46

标题: 证据变换器用于改进图像检索

摘要: 我们介绍了证据变换器（Evidential Transformer），这是一种基于不确定性驱动的变换器模型，用于改进和提高图像检索的鲁棒性。在本文中，我们对基于内容的图像检索（CBIR）做出了几点贡献。我们将概率方法融入图像检索中，实现了稳健可靠的结果，证据分类超越传统的基于多类别分类的深度度量学习基线。此外，通过利用全局上下文视觉变换器（GC ViT）架构，我们改进了几个数据集上的最新检索结果。我们的实验结果始终显示了我们方法的可靠性，在斯坦福在线产品（SOP）和CUB-200-2011数据集的所有测试设置中，我们设立了基于CBIR的新基准。

更新时间: 2025-09-08 10:01:46

领域: cs.CV,cs.IR,cs.LG

下载: http://arxiv.org/abs/2409.01082v2

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

Updated: 2025-09-08 09:54:18

标题: 扩展多轮离策略强化学习和多智能体树搜索以提升LLM步骤证明器

摘要: 将大型语言模型（LLMs）集成到自动定理证明中显示出巨大的潜力，但基本上受到在训练时强化学习（RL）和推理时计算规模上的挑战的限制。本文介绍了\texttt{BFS-Prover-V2}，一个旨在解决这一双重规模问题的系统。我们提出了两个主要创新。第一个是一种新颖的多轮离策略RL框架，用于在训练时持续改进LLM步骤证明器的性能。这个框架受到AlphaZero原则的启发，利用一个多阶段专家迭代管道，具有自适应策略级数据过滤和定期重新训练，以克服通常限制LLM代理长期RL的性能平台。第二个创新是一个增强型规划器多代理搜索架构，用于推理时扩展推理能力。该架构利用一个通用推理模型作为高层规划器，将复杂定理迭代地分解为一系列更简单的子目标。这种分层方法大大减少了搜索空间，通过利用共享证明缓存，使一组并行证明代理能够有效协作。我们证明了这种双重扩展方法在已建立的正式数学基准测试中取得了最先进的结果。\texttt{BFS-Prover-V2}在MiniF2F和ProofNet测试集上分别达到了95.08％和41.4％。虽然在形式数学领域展示，但本文提出的RL和推理技术具有更广泛的兴趣，并可应用于其他需要长期视野多轮推理和复杂搜索的领域。

更新时间: 2025-09-08 09:54:18

领域: cs.AI

下载: http://arxiv.org/abs/2509.06493v1

Byzantine-Robust Federated Learning Using Generative Adversarial Networks

Federated learning (FL) enables collaborative model training across distributed clients without sharing raw data, but its robustness is threatened by Byzantine behaviors such as data and model poisoning. Existing defenses face fundamental limitations: robust aggregation rules incur error lower bounds that grow with client heterogeneity, while detection-based methods often rely on heuristics (e.g., a fixed number of malicious clients) or require trusted external datasets for validation. We present a defense framework that addresses these challenges by leveraging a conditional generative adversarial network (cGAN) at the server to synthesize representative data for validating client updates. This approach eliminates reliance on external datasets, adapts to diverse attack strategies, and integrates seamlessly into standard FL workflows. Extensive experiments on benchmark datasets demonstrate that our framework accurately distinguishes malicious from benign clients while maintaining overall model accuracy. Beyond Byzantine robustness, we also examine the representativeness of synthesized data, computational costs of cGAN training, and the transparency and scalability of our approach.

Updated: 2025-09-08 09:53:05

标题: 拜占庭强韧的生成对抗网络在联邦学习中的应用

摘要: 联邦学习（FL）实现了分布式客户端之间的协作模型训练，而不共享原始数据，但其稳健性受到拜占庭行为（如数据和模型毒化）的威胁。现有的防御面临根本限制：稳健聚合规则会产生随着客户异质性增长的错误下界，而基于检测的方法通常依赖于启发式方法（例如，固定数量的恶意客户）或者需要信任的外部数据集进行验证。我们提出了一个防御框架，通过在服务器端利用条件生成对抗网络（cGAN）合成代表性数据来验证客户端更新，从而解决这些挑战。这种方法消除了对外部数据集的依赖，适应各种攻击策略，并无缝集成到标准FL工作流程中。对基准数据集的大量实验表明，我们的框架能够准确区分恶意客户和良性客户，同时保持整体模型准确性。除了拜占庭稳健性，我们还研究了合成数据的代表性、cGAN训练的计算成本，以及我们方法的透明度和可扩展性。

更新时间: 2025-09-08 09:53:05

领域: cs.CR,cs.AI,cs.DC

下载: http://arxiv.org/abs/2503.20884v2

MORSE: Multi-Objective Reinforcement Learning via Strategy Evolution for Supply Chain Optimization

In supply chain management, decision-making often involves balancing multiple conflicting objectives, such as cost reduction, service level improvement, and environmental sustainability. Traditional multi-objective optimization methods, such as linear programming and evolutionary algorithms, struggle to adapt in real-time to the dynamic nature of supply chains. In this paper, we propose an approach that combines Reinforcement Learning (RL) and Multi-Objective Evolutionary Algorithms (MOEAs) to address these challenges for dynamic multi-objective optimization under uncertainty. Our method leverages MOEAs to search the parameter space of policy neural networks, generating a Pareto front of policies. This provides decision-makers with a diverse population of policies that can be dynamically switched based on the current system objectives, ensuring flexibility and adaptability in real-time decision-making. We also introduce Conditional Value-at-Risk (CVaR) to incorporate risk-sensitive decision-making, enhancing resilience in uncertain environments. We demonstrate the effectiveness of our approach through case studies, showcasing its ability to respond to supply chain dynamics and outperforming state-of-the-art methods in an inventory management case study. The proposed strategy not only improves decision-making efficiency but also offers a more robust framework for managing uncertainty and optimizing performance in supply chains.

Updated: 2025-09-08 09:51:24

标题: MORSE: 多目标强化学习通过策略演化用于供应链优化

摘要: 在供应链管理中，决策经常涉及平衡多个相互冲突的目标，如成本降低、服务水平提高和环境可持续性。传统的多目标优化方法，如线性规划和进化算法，往往难以实时适应供应链动态特性。本文提出了一种结合强化学习（RL）和多目标进化算法（MOEAs）的方法，以应对动态多目标优化中的不确定性挑战。我们的方法利用MOEAs来搜索政策神经网络的参数空间，生成政策的帕累托前沿。这为决策者提供了一个多样化的政策人口，可以根据当前系统目标动态切换，确保实时决策的灵活性和适应性。我们还引入了条件风险值（CVaR）来结合风险敏感的决策，增强在不确定环境下的韧性。通过案例研究，我们展示了我们的方法的有效性，展示了它对供应链动态的响应能力，并在库存管理案例研究中胜过了最先进的方法。所提出的策略不仅提高了决策效率，还为在供应链中管理不确定性和优化绩效提供了更加稳健的框架。

更新时间: 2025-09-08 09:51:24

领域: cs.AI

下载: http://arxiv.org/abs/2509.06490v1

A machine-learned expression for the excess Gibbs energy

The excess Gibbs energy plays a central role in chemical engineering and chemistry, providing a basis for modeling the thermodynamic properties of liquid mixtures. Predicting the excess Gibbs energy of multi-component mixtures solely from the molecular structures of their components is a long-standing challenge. In this work, we address this challenge by integrating physical laws as hard constraints within a flexible neural network. The resulting model, HANNA, was trained end-to-end on an extensive experimental dataset for binary mixtures from the Dortmund Data Bank, guaranteeing thermodynamically consistent predictions. A novel surrogate solver developed in this work enabled the inclusion of liquid-liquid equilibrium data in the training process. Furthermore, a geometric projection method was applied to enable robust extrapolations to multi-component mixtures, without requiring additional parameters. We demonstrate that HANNA delivers excellent predictions, clearly outperforming state-of-the-art benchmark methods in accuracy and scope. The trained model and corresponding code are openly available, and an interactive interface is provided on our website, MLPROP.

Updated: 2025-09-08 09:47:03

标题: 一个机器学习的表达式用于过剩吉布斯能量

摘要: 过量吉布斯能在化学工程和化学中起着中心作用，为液体混合物的热力学性质建模提供了基础。仅仅通过其成分的分子结构预测多组分混合物的过量吉布斯能是一个长期存在的挑战。在这项工作中，我们通过在灵活的神经网络中将物理定律作为硬约束来解决这一挑战。得到的模型HANNA是在多元混合物的广泛实验数据集上端到端地进行训练的，该数据集来自多特蒙德数据银行，保证了热力学上一致的预测。在这项工作中开发的一种新颖的代理求解器使得可以在训练过程中包含液-液平衡数据。此外，应用了几何投影方法，使得可以对多组分混合物进行稳健的外推，而无需额外的参数。我们证明HANNA提供了出色的预测，明显优于当今最先进的基准方法在准确性和范围上。训练好的模型和相应的代码是公开可用的，我们的网站MLPROP上提供了交互界面。

更新时间: 2025-09-08 09:47:03

领域: cs.LG,cs.CE

下载: http://arxiv.org/abs/2509.06484v1

DyC-STG: Dynamic Causal Spatio-Temporal Graph Network for Real-time Data Credibility Analysis in IoT

The wide spreading of Internet of Things (IoT) sensors generates vast spatio-temporal data streams, but ensuring data credibility is a critical yet unsolved challenge for applications like smart homes. While spatio-temporal graph (STG) models are a leading paradigm for such data, they often fall short in dynamic, human-centric environments due to two fundamental limitations: (1) their reliance on static graph topologies, which fail to capture physical, event-driven dynamics, and (2) their tendency to confuse spurious correlations with true causality, undermining robustness in human-centric environments. To address these gaps, we propose the Dynamic Causal Spatio-Temporal Graph Network (DyC-STG), a novel framework designed for real-time data credibility analysis in IoT. Our framework features two synergistic contributions: an event-driven dynamic graph module that adapts the graph topology in real-time to reflect physical state changes, and a causal reasoning module to distill causally-aware representations by strictly enforcing temporal precedence. To facilitate the research in this domain we release two new real-world datasets. Comprehensive experiments show that DyC-STG establishes a new state-of-the-art, outperforming the strongest baselines by 1.4 percentage points and achieving an F1-Score of up to 0.930.

Updated: 2025-09-08 09:46:58

标题: DyC-STG：用于物联网实时数据可信度分析的动态因果时空图网络

摘要: 物联网（IoT）传感器的广泛传播产生了大量时空数据流，但确保数据的可信度对于智能家居等应用来说是一个至关重要但尚未解决的挑战。虽然时空图（STG）模型是这类数据的主要范式，但由于两个基本限制，它们在动态、以人为中心的环境中常常表现不佳：（1）它们依赖于静态图拓扑，无法捕捉物理、事件驱动的动态特性；（2）它们倾向于混淆虚假相关性与真实因果关系，从而削弱在以人为中心的环境中的鲁棒性。为了解决这些差距，我们提出了动态因果时空图网络（DyC-STG），这是一个设计用于物联网中实时数据可信度分析的新框架。我们的框架具有两个协同贡献：一个事件驱动的动态图模块，可以实时调整图拓扑以反映物理状态变化，以及一个因果推理模块，通过严格强制时间优先关系来提炼具有因果意识的表示。为了促进这一领域的研究，我们发布了两个新的真实世界数据集。全面的实验结果显示，DyC-STG建立了一个新的最先进水平，比最强基线表现高出1.4个百分点，实现了高达0.930的F1分数。

更新时间: 2025-09-08 09:46:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06483v1

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

Updated: 2025-09-08 09:43:48

标题: MAS-Bench：一种用于增强快捷方式的混合移动GUI代理的统一基准

摘要: 为了增强智能图形用户界面（GUI）代理在智能手机和计算机等各种平台上的效率，一种将灵活的GUI操作与高效的快捷方式（例如API、深度链接）相结合的混合范式正逐渐成为一个有前景的方向。然而，对于系统地评估这些混合代理的框架仍未得到充分探索。为了迈出弥合这一差距的第一步，我们引入了MAS-Bench，这是一个首创性的评估GUI-快捷方式混合代理的基准，重点关注移动领域。除了仅使用预定义的快捷方式外，MAS-Bench评估代理自主生成快捷方式的能力，通过发现和创建可重复、低成本的工作流程。它包括11个真实应用程序中的139项复杂任务，88个预定义的快捷方式知识库（API、深度链接、RPA脚本）和7个评估指标。这些任务设计为仅通过GUI操作可解决，但通过智能地嵌入快捷方式可以显著加速。实验表明，混合代理的成功率和效率显著高于其仅使用GUI的对应物。这一结果也证明了我们评估代理快捷方式生成能力的方法的有效性。MAS-Bench填补了一个关键的评估空白，为未来在创建更高效、更稳健的智能代理方面的进展提供了基础平台。

更新时间: 2025-09-08 09:43:48

领域: cs.AI

下载: http://arxiv.org/abs/2509.06477v1

Explained, yet misunderstood: How AI Literacy shapes HR Managers' interpretation of User Interfaces in Recruiting Recommender Systems

AI-based recommender systems increasingly influence recruitment decisions. Thus, transparency and responsible adoption in Human Resource Management (HRM) are critical. This study examines how HR managers' AI literacy influences their subjective perception and objective understanding of explainable AI (XAI) elements in recruiting recommender dashboards. In an online experiment, 410 German-based HR managers compared baseline dashboards to versions enriched with three XAI styles: important features, counterfactuals, and model criteria. Our results show that the dashboards used in practice do not explain AI results and even keep AI elements opaque. However, while adding XAI features improves subjective perceptions of helpfulness and trust among users with moderate or high AI literacy, it does not increase their objective understanding. It may even reduce accurate understanding, especially with complex explanations. Only overlays of important features significantly aided the interpretations of high-literacy users. Our findings highlight that the benefits of XAI in recruitment depend on users' AI literacy, emphasizing the need for tailored explanation strategies and targeted literacy training in HRM to ensure fair, transparent, and effective adoption of AI.

Updated: 2025-09-08 09:40:49

标题: 解释了，却被误解：AI素养如何塑造人力资源经理对招聘推荐系统用户界面的解读

摘要: 基于人工智能的推荐系统越来越影响招聘决策。因此，在人力资源管理（HRM）中，透明度和负责任的采用至关重要。本研究考察了HR经理的人工智能素养如何影响他们对解释性人工智能（XAI）元素在招聘推荐仪表板中的主观感知和客观理解。在一项在线实验中，410名德国HR经理比较了基线仪表板和增加了三种XAI样式的版本：重要特征，反事实，和模型标准。我们的结果显示，实际使用的仪表板没有解释人工智能结果，甚至保持人工智能元素不透明。然而，添加XAI特性改善了中等或高AI素养用户对帮助性和信任度的主观感知，但并没有增加他们的客观理解。甚至可能会降低准确理解，尤其是在复杂解释的情况下。只有重要特征的覆盖层显著帮助了高素养用户的解释。我们的发现强调了招聘中XAI的好处取决于用户的人工智能素养，强调了在HRM中需要定制解释策略和有针对性的素养培训，以确保公平、透明和有效采用人工智能。

更新时间: 2025-09-08 09:40:49

领域: cs.HC,cs.AI,cs.CY,A.0; H.5.2; I.2; J.1; K.4.2; K.4.3

下载: http://arxiv.org/abs/2509.06475v1

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models.

Updated: 2025-09-08 09:39:12

标题: MM-Spatial：在多模式LLMs中探索3D空间理解

摘要: 多模态大型语言模型（MLLMs）在二维视觉理解方面表现出色，但在推理三维空间方面仍然存在局限。在这项工作中，我们利用大规模高质量的3D场景数据以及开放式注释，引入了1）一个新颖的监督微调数据集和2）一个新的评估基准，重点放在室内场景上。我们的Cubify Anything VQA（CA-VQA）数据涵盖了包括空间关系预测、度量大小和距离估计以及3D定位在内的各种空间任务。我们展示了CA-VQA使我们能够训练MM-Spatial，一个强大的通用MMLM，也在包括我们自己在内的3D空间理解基准上实现了最先进的性能。我们展示了如何将度量深度和多视角输入（在CA-VQA中提供）整合到模型中可以进一步提高3D理解，并证明数据本身使我们的模型能够达到与专门的单眼深度估计模型相媲美的深度感知能力。

更新时间: 2025-09-08 09:39:12

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.13111v2

Fairness-Aware Data Augmentation for Cardiac MRI using Text-Conditioned Diffusion Models

While deep learning holds great promise for disease diagnosis and prognosis in cardiac magnetic resonance imaging, its progress is often constrained by highly imbalanced and biased training datasets. To address this issue, we propose a method to alleviate imbalances inherent in datasets through the generation of synthetic data based on sensitive attributes such as sex, age, body mass index (BMI), and health condition. We adopt ControlNet based on a denoising diffusion probabilistic model to condition on text assembled from patient metadata and cardiac geometry derived from segmentation masks. We assess our method using a large-cohort study from the UK Biobank by evaluating the realism of the generated images using established quantitative metrics. Furthermore, we conduct a downstream classification task aimed at debiasing a classifier by rectifying imbalances within underrepresented groups through synthetically generated samples. Our experiments demonstrate the effectiveness of the proposed approach in mitigating dataset imbalances, such as the scarcity of diagnosed female patients or individuals with normal BMI level suffering from heart failure. This work represents a major step towards the adoption of synthetic data for the development of fair and generalizable models for medical classification tasks. Notably, we conduct all our experiments using a single, consumer-level GPU to highlight the feasibility of our approach within resource-constrained environments. Our code is available at https://github.com/faildeny/debiasing-cardiac-mri.

Updated: 2025-09-08 09:37:31

标题: 使用文本条件扩散模型进行心脏MRI的公平感知数据增强

摘要: 深度学习在心脏核磁共振成像中对疾病诊断和预后具有巨大潜力，但其进展常常受到高度不平衡和偏倚的训练数据集的限制。为了解决这一问题，我们提出了一种通过生成基于敏感属性（如性别、年龄、体重指数（BMI）和健康状况）的合成数据来减轻数据集固有不平衡的方法。我们采用基于去噪扩散概率模型的ControlNet，通过对患者元数据和从分割掩模中导出的心脏几何形状组装的文本进行条件设置。我们使用来自英国生物库的大型队列研究来评估生成图像的逼真程度，采用已建立的定量指标。此外，我们进行了一个下游分类任务，旨在通过合成生成的样本来纠正分类器中存在的不平衡问题。我们的实验表明，所提出的方法在减轻数据集不平衡方面是有效的，例如，稀缺的诊断女性患者或正常BMI水平患心力衰竭的个体。这项工作代表了朝着采用合成数据开发公平且可推广模型用于医学分类任务的重要一步。值得注意的是，我们使用一台普通消费级GPU进行所有实验，以突出我们的方法在资源受限环境中的可行性。我们的代码可在https://github.com/faildeny/debiasing-cardiac-mri 上找到。

更新时间: 2025-09-08 09:37:31

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2403.19508v2

A Quantum Bagging Algorithm with Unsupervised Base Learners for Label Corrupted Datasets

The development of noise-resilient quantum machine learning (QML) algorithms is critical in the noisy intermediate-scale quantum (NISQ) era. In this work, we propose a quantum bagging framework that uses QMeans clustering as the base learner to reduce prediction variance and enhance robustness to label noise. Unlike bagging frameworks built on supervised learners, our method leverages the unsupervised nature of QMeans, combined with quantum bootstrapping via QRAM-based sampling and bagging aggregation through majority voting. Through extensive simulations on both noisy classification and regression tasks, we demonstrate that the proposed quantum bagging algorithm performs comparably to its classical counterpart using KMeans while exhibiting greater resilience to label corruption than supervised bagging methods. This highlights the potential of unsupervised quantum bagging in learning from unreliable data.

Updated: 2025-09-08 09:34:33

标题: 一个使用无监督基学习器的量子装袋算法，用于标签损坏的数据集

摘要: 在噪声中等规模量子（NISQ）时代，噪声鲁棒量子机器学习（QML）算法的发展至关重要。在这项工作中，我们提出了一种量子装袋框架，该框架使用QMeans聚类作为基本学习器，以减少预测方差并增强对标签噪声的稳健性。与建立在监督学习者上的装袋框架不同，我们的方法利用了QMeans的非监督特性，结合了基于QRAM的采样和通过多数投票进行的装袋聚合。通过对噪声分类和回归任务的大量模拟，我们证明了所提出的量子装袋算法与使用KMeans的经典对应物表现出相当的性能，同时比监督装袋方法更具抗标签破坏能力。这突显了无监督量子装袋在从不可靠数据中学习方面的潜力。

更新时间: 2025-09-08 09:34:33

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2509.07040v1

Several Performance Bounds on Decentralized Online Optimization are Highly Conservative and Potentially Misleading

We analyze Decentralized Online Optimization algorithms using the Performance Estimation Problem approach which allows, to automatically compute exact worst-case performance of optimization algorithms. Our analysis shows that several available performance guarantees are very conservative, sometimes by multiple orders of magnitude, and can lead to misguided choices of algorithm. Moreover, at least in terms of worst-case performance, some algorithms appear not to benefit from inter-agent communications for a significant period of time. We show how to improve classical methods by tuning their step-sizes, and find that we can save up to 20% on their actual worst-case performance regret.

Updated: 2025-09-08 09:28:36

标题: 分散在线优化中几个性能上限过于保守且可能具有误导性

摘要: 我们使用性能估计问题方法分析了分散式在线优化算法，该方法允许自动计算优化算法的确切最坏情况性能。我们的分析表明，一些现有的性能保证非常保守，有时相差几个数量级，可能会导致选择算法错误。此外，至少在最坏情况性能方面，一些算法似乎在相当长的时间内无法从代理之间的通信中受益。我们展示了如何通过调整它们的步长来改进传统方法，并发现我们可以节省高达20%的实际最坏情况性能遗憾。

更新时间: 2025-09-08 09:28:36

领域: math.OC,cs.AI,cs.DC,cs.MA

下载: http://arxiv.org/abs/2509.06466v1

Confounding is a Pervasive Problem in Real World Recommender Systems

Unobserved confounding arises when an unmeasured feature influences both the treatment and the outcome, leading to biased causal effect estimates. This issue undermines observational studies in fields like economics, medicine, ecology or epidemiology. Recommender systems leveraging fully observed data seem not to be vulnerable to this problem. However many standard practices in recommender systems result in observed features being ignored, resulting in effectively the same problem. This paper will show that numerous common practices such as feature engineering, A/B testing and modularization can in fact introduce confounding into recommendation systems and hamper their performance. Several illustrations of the phenomena are provided, supported by simulation studies with practical suggestions about how practitioners may reduce or avoid the affects of confounding in real systems.

Updated: 2025-09-08 09:26:57

标题: 混淆问题在现实世界的推荐系统中是一个普遍存在的问题

摘要: 未观察到的混杂是指一个未被测量的特征影响了治疗和结果，导致偏倚的因果效应估计。这个问题削弱了经济学、医学、生态学或流行病学等领域的观察性研究。利用完全观察到的数据的推荐系统似乎不容易受到这个问题的影响。然而，许多推荐系统中的标准做法导致被观察到的特征被忽视，从而实际上导致了同样的问题。本文将展示许多常见做法，如特征工程、A/B测试和模块化，实际上可能会引入混杂到推荐系统中，并阻碍其性能。通过模拟研究提供了几个现象的例证，并提出了关于从业者如何减少或避免真实系统中混杂影响的实用建议。

更新时间: 2025-09-08 09:26:57

领域: cs.LG,cs.IR,stat.ML

下载: http://arxiv.org/abs/2508.10479v2

CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction

Antibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens-a dual limitation in representation and prediction. In this paper, we propose CAME-AB, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs-into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an adaptive modality fusion module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on https://anonymous.4open.science/r/CAME-AB-C525

Updated: 2025-09-08 09:24:09

标题: CAME-AB：基于专家混合的跨模态注意力用于抗体结合位点预测

摘要: 抗体结合位点预测在计算免疫学和治疗性抗体设计中起着至关重要的作用。现有的序列或结构方法依赖于单一视图特征，无法识别抗体特异性的结合位点，这在表示和预测上存在双重限制。在本文中，我们提出了CAME-AB，这是一个新颖的交叉模态注意力框架，具有一个用于强大抗体结合位点预测的专家混合骨干。CAME-AB整合了五种生物学基础模态，包括原始氨基酸编码、BLOSUM替换剖面、预训练语言模型嵌入、结构感知特征和GCN精炼的生化图形，形成统一的多模态表示。为了增强自适应的跨模态推理，我们提出了一个自适应模态融合模块，根据其全局相关性和输入特定贡献动态地权衡每种模态。Transformer编码器结合MoE模块进一步促进特征专业化和容量扩展。我们另外增加了一个监督对比学习目标，明确塑造潜在空间几何，鼓励类内紧凑性和类间可分离性。为了提高优化稳定性和泛化能力，我们在训练过程中应用了随机权重平均。在基准抗体-抗原数据集上进行的广泛实验表明，CAME-AB在多个指标上始终优于强基线，包括精度、召回率、F1分数、AUC-ROC和MCC。消融研究进一步验证了每个架构组件的有效性以及多模态特征集成的好处。模型实现细节和代码可在https://anonymous.4open.science/r/CAME-AB-C525 上找到。

更新时间: 2025-09-08 09:24:09

领域: cs.LG,cs.CE,q-bio.BM

下载: http://arxiv.org/abs/2509.06465v1

A Minimum Description Length Approach to Regularization in Neural Networks

State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.

Updated: 2025-09-08 09:23:08

标题: 神经网络中正则化的最小描述长度方法

摘要: 最先进的神经网络可以被训练成为许多问题的出色解决方案。但是，尽管这些架构可以表达符号化、完美的解决方案，训练模型通常只能得到近似解。我们展示了正则化方法的选择起着至关重要的作用：当在具有标准正则化（$L_1$，$L_2$或无）的形式语言上训练时，表达式架构不仅不能收敛到正确解决方案，而且还会被远离完美的初始化。相比之下，将最小描述长度（MDL）原则应用于平衡模型复杂性与数据拟合提供了一个理论上有根据的正则化方法。使用MDL，完美解决方案被选择而不是近似解，与优化算法无关。我们提出，与现有的正则化技术不同，MDL引入了适当的归纳偏差，有效地对抗过拟合并促进泛化。

更新时间: 2025-09-08 09:23:08

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2505.13398v2

Accelerate Scaling of LLM Alignment via Quantifying the Coverage and Depth of Instruction Set

With the growing demand for applying large language models to downstream tasks, improving model alignment performance and efficiency has become crucial. Such a process involves selecting informative instructions from a candidate pool. However, due to the complexity of instruction set distributions, the key factors driving the performance of aligned models remain unclear. As a result, current instruction set refinement methods fail to improve performance as the instruction pool expands continuously. To address this issue, we first investigate the key factors that influence the relationship between instruction dataset distribution and aligned model performance. Based on these insights, we propose a novel instruction data selection method. We identify that the depth of instructions and the coverage of the semantic space are the crucial factors determining downstream performance, which could explain over 70\% of the model loss on the development set. We then design an instruction selection algorithm to simultaneously maximize the depth and semantic coverage of the selected instructions. Experimental results demonstrate that, compared to state-of-the-art baseline methods, it can sustainably improve model performance at a faster pace and thus achieve \emph{``Accelerated Scaling''}.

Updated: 2025-09-08 09:22:57

标题: 通过量化指令集的覆盖率和深度加速LLM对齐的扩展

摘要: 随着将大型语言模型应用于下游任务的需求不断增长，提高模型对齐性能和效率变得至关重要。这一过程涉及从候选池中选择信息性指令。然而，由于指令集分布的复杂性，驱动对齐模型性能的关键因素仍不清楚。因此，当前的指令集精炼方法无法在指令池不断扩大时改善性能。为解决这一问题，我们首先研究影响指令数据集分布与对齐模型性能关系的关键因素。基于这些见解，我们提出了一种新颖的指令数据选择方法。我们确定指令的深度和语义空间覆盖率是决定下游性能的关键因素，可以解释开发集上模型损失的70\%以上。然后，我们设计了一个指令选择算法，同时最大化所选指令的深度和语义覆盖范围。实验结果表明，与最先进的基线方法相比，它可以持续改善模型性能的速度更快，从而实现“加速扩展”。

更新时间: 2025-09-08 09:22:57

领域: cs.AI

下载: http://arxiv.org/abs/2509.06463v1

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

Updated: 2025-09-08 09:20:04

标题: 对比关注力的聚焦：增强VLMs的视觉推理

摘要: 视觉-语言模型（VLMs）在各种视觉任务中取得了显著的成功，但在复杂的视觉环境中它们的性能会下降。尽管现有的增强方法需要额外的训练，依赖外部分割工具，或者在粗粒度级别运行，但它们忽视了VLMs内在的能力。为了弥合这一差距，我们研究了VLMs的注意力模式，并发现：（1）视觉复杂性与注意力熵强烈相关，对推理性能产生负面影响；（2）注意力从浅层的全局扫描逐渐精细到深层的聚焦收敛，聚焦程度由视觉复杂性决定。（3）从理论上讲，我们证明了在一般查询和任务特定查询之间的注意力图对比能够将视觉信号分解为语义信号和视觉噪声组件。基于这些见解，我们提出了一种名为对比注意力细化用于视觉增强（CARVE）的无需训练的方法，通过在像素级别进行注意力对比提取任务相关的视觉信号。大量实验表明，CARVE始终提升性能，对开源模型实现了高达75％的改进。我们的工作为视觉复杂性和注意力机制之间的相互作用提供了关键见解，为通过对比注意力来改进视觉推理提供了一条有效路径。

更新时间: 2025-09-08 09:20:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.06461v1

Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis

Identifying associations between imaging phenotypes, disease risk factors, and clinical outcomes is essential for understanding disease mechanisms. However, traditional approaches rely on human-driven hypothesis testing and selection of association factors, often overlooking complex, non-linear dependencies among imaging phenotypes and other multi-modal data. To address this, we introduce Multi-agent Exploratory Synergy for the Heart (MESHAgents): a framework that leverages large language models as agents to dynamically elicit, surface, and decide confounders and phenotypes in association studies. Specifically, we orchestrate a multi-disciplinary team of AI agents, which spontaneously generate and converge on insights through iterative, self-organizing reasoning. The framework dynamically synthesizes statistical correlations with multi-expert consensus, providing an automated pipeline for phenome-wide association studies (PheWAS). We demonstrate the system's capabilities through a population-based study of imaging phenotypes of the heart and aorta. MESHAgents autonomously uncovered correlations between imaging phenotypes and a wide range of non-imaging factors, identifying additional confounder variables beyond standard demographic factors. Validation on diagnosis tasks reveals that MESHAgents-discovered phenotypes achieve performance comparable to expert-selected phenotypes, with mean AUC differences as small as $-0.004_{\pm0.010}$ on disease classification tasks. Notably, the recall score improves for 6 out of 9 disease types. Our framework provides clinically relevant imaging phenotypes with transparent reasoning, offering a scalable alternative to expert-driven methods.

Updated: 2025-09-08 09:13:41

标题: 多智能体推理用于心血管影像表型分析

摘要: 识别影像表型、疾病风险因素和临床结果之间的关联对于理解疾病机制至关重要。然而，传统方法依赖于人为驱动的假设检验和关联因素的选择，往往忽视了影像表型和其他多模态数据之间复杂的非线性依赖关系。为了解决这一问题，我们引入了心脏多代理探索协同（MESHAgents）框架：利用大型语言模型作为代理，动态引出、展现和决定混淆因素和表型在关联研究中的作用。具体来说，我们编排了一个跨学科的AI代理团队，通过迭代、自组织的推理过程自发产生并汇聚见解。该框架动态综合了统计相关性和多专家共识，为表型全基因关联研究（PheWAS）提供了自动化流程。我们通过对心脏和主动脉影像表型的人群研究展示了系统的能力。MESHAgents自主地揭示了影像表型与各种非影像因素之间的相关性，识别了标准人口统计因素以外的额外混淆变量。在诊断任务的验证中，发现MESHAgents的表型在疾病分类任务上表现与专家选择的表型相当，平均AUC差异小至$-0.004_{\pm0.010}$。值得注意的是，对于9种疾病类型中的6种，召回率得到了改善。我们的框架提供了具有透明推理的临床相关影像表型，为专家驱动方法提供了可扩展的替代方案。

更新时间: 2025-09-08 09:13:41

领域: cs.AI

下载: http://arxiv.org/abs/2507.03460v2

Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at https://github.com/dess-mannheim/european_parliament_simulation.

Updated: 2025-09-08 09:13:03

标题: 用大型语言模型对欧洲议会中的选举行为进行角色驱动模拟

摘要: 大型语言模型（LLMs）展示出了出色的能力，能够理解或甚至产生政治言论，但发现它们一直显示出持续的进步左倾偏见。同时，所谓的角色或身份提示已被证明能够产生与基础模型不一致的社会经济群体对齐的LLM行为。在这项工作中，我们分析了零-shot角色提示与有限信息是否能准确预测个人的选举决策，并通过聚合准确预测欧洲群体在多样化政策集上的立场。我们评估预测是否对反事实论证、不同的角色提示和生成方法稳定。最后，我们发现我们可以相当好地模拟欧洲议会议员的投票行为，加权F1分数约为0.793。我们的2024年欧洲议会政客角色数据集和代码可在https://github.com/dess-mannheim/european_parliament_simulation 上找到。

更新时间: 2025-09-08 09:13:03

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.11798v2

IGAff: Benchmarking Adversarial Iterative and Genetic Affine Algorithms on Deep Neural Networks

Deep neural networks currently dominate many fields of the artificial intelligence landscape, achieving state-of-the-art results on numerous tasks while remaining hard to understand and exhibiting surprising weaknesses. An active area of research focuses on adversarial attacks, which aim to generate inputs that uncover these weaknesses. However, this proves challenging, especially in the black-box scenario where model details are inaccessible. This paper explores in detail the impact of such adversarial algorithms on ResNet-18, DenseNet-121, Swin Transformer V2, and Vision Transformer network architectures. Leveraging the Tiny ImageNet, Caltech-256, and Food-101 datasets, we benchmark two novel black-box iterative adversarial algorithms based on affine transformations and genetic algorithms: 1) Affine Transformation Attack (ATA), an iterative algorithm maximizing our attack score function using random affine transformations, and 2) Affine Genetic Attack (AGA), a genetic algorithm that involves random noise and affine transformations. We evaluate the performance of the models in the algorithm parameter variation, data augmentation, and global and targeted attack configurations. We also compare our algorithms with two black-box adversarial algorithms, Pixle and Square Attack. Our experiments yield better results on the image classification task than similar methods in the literature, achieving an accuracy improvement of up to 8.82%. We provide noteworthy insights into successful adversarial defenses and attacks at both global and targeted levels, and demonstrate adversarial robustness through algorithm parameter variation.

Updated: 2025-09-08 09:12:27

标题: IGAff：在深度神经网络上对抗性迭代和遗传仿射算法的基准测试

摘要: 深度神经网络目前在人工智能领域的许多领域处于主导地位，在许多任务上实现了最先进的结果，同时仍然难以理解并展示出令人惊讶的弱点。一个活跃的研究领域关注对抗攻击，旨在生成揭示这些弱点的输入。然而，这在黑盒场景中尤为具有挑战性，因为模型细节是不可访问的。本文详细探讨了这些对抗算法对ResNet-18、DenseNet-121、Swin Transformer V2和Vision Transformer网络架构的影响。利用Tiny ImageNet、Caltech-256和Food-101数据集，我们对基于仿射变换和遗传算法的两种新型黑盒迭代对抗算法进行基准测试：1) 仿射变换攻击（ATA），一种通过随机仿射变换最大化我们的攻击评分函数的迭代算法，和2) 仿射遗传攻击（AGA），一种涉及随机噪声和仿射变换的遗传算法。我们评估了模型在算法参数变化、数据增强以及全局和有目标的攻击配置下的性能。我们还将我们的算法与两个黑盒对抗算法Pixle和Square Attack进行了比较。我们的实验结果在图像分类任务上比文献中类似方法取得了更好的结果，准确率提升高达8.82%。我们提供了有关在全局和有目标级别上成功的对抗防御和攻击的重要见解，并通过算法参数变化展示了对抗鲁棒性。

更新时间: 2025-09-08 09:12:27

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06459v1

OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

As the general capabilities of large language models (LLMs) improve and agent applications become more widespread, the underlying deception risks urgently require systematic evaluation and effective oversight. Unlike existing evaluation which uses simulated games or presents limited choices, we introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, we construct five types of common use cases where LLMs intensively interact with the user, each consisting of ten diverse, concrete scenarios from the real world. To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the urgent need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do exhibit a higher risk of deception, which calls for more alignment efforts on inhibiting deceptive behaviors.

Updated: 2025-09-08 09:05:14

标题: OpenDeception: 通过开放式互动模拟对人工智能欺骗行为进行基准测试和调查

摘要: 随着大型语言模型（LLMs）的通用能力不断提高，代理应用变得越来越普遍，潜在的欺骗风险迫切需要系统评估和有效监督。与现有的使用模拟游戏或提供有限选择的评估不同，我们引入了OpenDeception，这是一个具有开放式场景数据集的新型欺骗评估框架。OpenDeception通过检查LLM-based代理的内部推理过程，共同评估欺骗意图和能力。具体来说，我们构建了五种常见用例，在这些用例中，LLMs与用户密切互动，每种用例包括来自现实世界的十种多样化的具体场景。为了避免与人类测试人员进行高风险欺骗性互动的道德顾虑和成本，我们建议通过代理模拟来模拟多轮对话。在OpenDeception上对十一种主流LLMs进行的广泛评估突显了需要解决LLM-based代理中的欺骗风险和安全问题的迫切性：模型之间的欺骗意图比率超过80％，而欺骗成功率超过50％。此外，我们观察到能力更强的LLMs确实存在更高的欺骗风险，这需要更多努力来抑制欺骗行为的一致性。

更新时间: 2025-09-08 09:05:14

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.13707v2

A Match Made in Heaven? Matching Test Cases and Vulnerabilities With the VUTECO Approach

Software vulnerabilities are commonly detected via static analysis, penetration testing, and fuzzing. They can also be found by running unit tests - so-called vulnerability-witnessing tests - that stimulate the security-sensitive behavior with crafted inputs. Developing such tests is difficult and time-consuming; thus, automated data-driven approaches could help developers intercept vulnerabilities earlier. However, training and validating such approaches require a lot of data, which is currently scarce. This paper introduces VUTECO, a deep learning-based approach for collecting instances of vulnerability-witnessing tests from Java repositories. VUTECO carries out two tasks: (1) the "Finding" task to determine whether a test case is security-related, and (2) the "Matching" task to relate a test case to the exact vulnerability it is witnessing. VUTECO successfully addresses the Finding task, achieving perfect precision and 0.83 F0.5 score on validated test cases in VUL4J and returning 102 out of 145 (70%) correct security-related test cases from 244 open-source Java projects. Despite showing sufficiently good performance for the Matching task - i.e., 0.86 precision and 0.68 F0.5 score - VUTECO failed to retrieve any valid match in the wild. Nevertheless, we observed that in almost all of the matches, the test case was still security-related despite being matched to the wrong vulnerability. In the end, VUTECO can help find vulnerability-witnessing tests, though the matching with the right vulnerability is yet to be solved; the findings obtained lay the stepping stone for future research on the matter.

Updated: 2025-09-08 09:04:23

标题: 一个天作之合？通过VUTECO方法匹配测试用例和漏洞

摘要: 软件漏洞通常通过静态分析、渗透测试和模糊测试来检测。它们也可以通过运行单元测试 - 即所谓的漏洞见证测试 - 来发现，这些测试通过精心设计的输入来刺激安全敏感行为。开发这样的测试是困难且耗时的；因此，自动化数据驱动方法可以帮助开发人员更早地拦截漏洞。然而，训练和验证这样的方法需要大量数据，目前这样的数据很少。本文介绍了VUTECO，这是一种基于深度学习的方法，用于从Java存储库中收集漏洞见证测试的实例。VUTECO执行两项任务：（1）“查找”任务，以确定测试用例是否与安全有关，以及（2）“匹配”任务，将测试用例与确切的漏洞联系起来。VUTECO成功地解决了“查找”任务，在VUL4J中对验证的测试用例实现了完美的精确度和0.83的F0.5分数，并从244个开源Java项目中返回了145个（70％）正确的与安全相关的测试用例中的102个。尽管对于“匹配”任务表现出足够好的性能 - 即0.86的精确度和0.68的F0.5分数 - VUTECO未能在实际情况中找到任何有效的匹配。尽管如此，我们观察到在几乎所有的匹配中，测试用例仍然与安全相关，尽管与错误的漏洞匹配。最终，VUTECO可以帮助找到漏洞见证测试，尽管与正确的漏洞匹配仍有待解决；所得到的发现为未来对此问题的研究奠定了基础。

更新时间: 2025-09-08 09:04:23

领域: cs.SE,cs.CR,cs.LG,D.2.5; D.2.7

下载: http://arxiv.org/abs/2502.03365v2

A Gravity-informed Spatiotemporal Transformer for Human Activity Intensity Prediction

Human activity intensity prediction is crucial to many location-based services. Despite tremendous progress in modeling dynamics of human activity, most existing methods overlook physical constraints of spatial interaction, leading to uninterpretable spatial correlations and over-smoothing phenomenon. To address these limitations, this work proposes a physics-informed deep learning framework, namely Gravity-informed Spatiotemporal Transformer (Gravityformer) by integrating the universal law of gravitation to refine transformer attention. Specifically, it (1) estimates two spatially explicit mass parameters based on spatiotemporal embedding feature, (2) models the spatial interaction in end-to-end neural network using proposed adaptive gravity model to learn the physical constraint, and (3) utilizes the learned spatial interaction to guide and mitigate the over-smoothing phenomenon in transformer attention. Moreover, a parallel spatiotemporal graph convolution transformer is proposed for achieving a balance between coupled spatial and temporal learning. Systematic experiments on six real-world large-scale activity datasets demonstrate the quantitative and qualitative superiority of our model over state-of-the-art benchmarks. Additionally, the learned gravity attention matrix can be not only disentangled and interpreted based on geographical laws, but also improved the generalization in zero-shot cross-region inference. This work provides a novel insight into integrating physical laws with deep learning for spatiotemporal prediction.

Updated: 2025-09-08 08:53:47

标题: 一个基于重力信息的时空变换器用于人类活动强度预测

摘要: 人类活动强度的预测对许多基于位置的服务至关重要。尽管在建模人类活动动态方面取得了巨大进展，但大多数现有方法都忽视了空间相互作用的物理限制，导致了不可解释的空间相关性和过度平滑现象。为了解决这些局限性，本研究提出了一种物理信息深度学习框架，即引力信息时空变换器（Gravityformer），通过整合普遍的引力定律来改进变压器的注意力。具体地，它（1）基于时空嵌入特征估计两个具有空间明确性的质量参数，（2）使用提出的自适应引力模型在端到端神经网络中建模空间相互作用以学习物理约束，并且（3）利用学习到的空间相互作用来引导和减轻变压器注意力中的过度平滑现象。此外，提出了一种平行时空图卷积变压器，以实现耦合空间和时间学习之间的平衡。对六个真实世界大规模活动数据集的系统实验证明了我们模型在定量和定性上优于最先进的基准。此外，学习到的引力注意力矩阵不仅可以基于地理法则进行解释和解释，还可以改进零样本跨区域推理的泛化能力。这项工作为时空预测中整合物理定律和深度学习提供了新的见解。

更新时间: 2025-09-08 08:53:47

领域: cs.LG

下载: http://arxiv.org/abs/2506.13678v3

Amortized In-Context Mixed Effect Transformer Models: A Zero-Shot Approach for Pharmacokinetics

Accurate dose-response forecasting under sparse sampling is central to precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect Transformer (AICMET) model, a transformer-based latent-variable framework that unifies mechanistic compartmental priors with amortized in-context Bayesian inference. AICMET is pre-trained on hundreds of thousands of synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters of compartment models, endowing the model with strong inductive biases and enabling zero-shot adaptation to new compounds. At inference time, the decoder conditions on the collective context of previously profiled trial participants, generating calibrated posterior predictions for newly enrolled patients after a few early drug concentration measurements. This capability collapses traditional model-development cycles from weeks to hours while preserving some degree of expert modelling. Experiments across public datasets show that AICMET attains state-of-the-art predictive accuracy and faithfully quantifies inter-patient variability -- outperforming both nonlinear mixed-effects baselines and recent neural ODE variants. Our results highlight the feasibility of transformer-based, population-aware neural architectures as offering a new alternative for bespoke pharmacokinetic modeling pipelines, charting a path toward truly population-aware personalized dosing regimens.

Updated: 2025-09-08 08:45:08

标题: 分期上下文混合效应变换器模型：一种用于药代动力学的零射击方法

摘要: 在稀疏采样条件下准确预测剂量响应是精准药物治疗的核心。我们提出了Amortized In-Context Mixed-Effect Transformer (AICMET)模型，这是一个基于Transformer的潜变量框架，将机制性分室先验与摊销的上下文贝叶斯推断统一起来。AICMET在数十万个合成药代动力学轨迹上进行预训练，这些轨迹具有对分室模型参数的Ornstein-Uhlenbeck先验，赋予模型强大的归纳偏差，使其能够在零样本情况下适应新化合物。在推断时，解码器根据先前参与试验者的集体上下文进行条件化，生成经校准的后验预测，用于在少量早期药物浓度测量后新入院患者。这种能力将传统的模型开发周期从几周缩短到几小时，同时保留了一定程度的专家建模。通过公共数据集的实验表明，AICMET实现了最先进的预测准确性，准确量化了患者间的变异性，优于非线性混合效应基线和最近的神经ODE变体。我们的结果突显了基于Transformer的、面向人群的神经架构作为提供新选择的可行性，为定制药代动力学建模管线开辟了一条道路，开创了真正面向人群的个性化剂量方案。

更新时间: 2025-09-08 08:45:08

领域: cs.LG

下载: http://arxiv.org/abs/2508.15659v2

HyFedRAG: A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data

Centralized RAG pipelines struggle with heterogeneous and privacy-sensitive data, especially in distributed healthcare settings where patient data spans SQL, knowledge graphs, and clinical notes. Clinicians face difficulties retrieving rare disease cases due to privacy constraints and the limitations of traditional cloud-based RAG systems in handling diverse formats and edge devices. To address this, we introduce HyFedRAG, a unified and efficient Federated RAG framework tailored for Hybrid data modalities. By leveraging an edge-cloud collaborative mechanism, HyFedRAG enables RAG to operate across diverse data sources while preserving data privacy. Our key contributions are: (1) We design an edge-cloud collaborative RAG framework built on Flower, which supports querying structured SQL data, semi-structured knowledge graphs, and unstructured documents. The edge-side LLMs convert diverse data into standardized privacy-preserving representations, and the server-side LLMs integrates them for global reasoning and generation. (2) We integrate lightweight local retrievers with privacy-aware LLMs and provide three anonymization tools that enable each client to produce semantically rich, de-identified summaries for global inference across devices. (3) To optimize response latency and reduce redundant computation, we design a three-tier caching strategy consisting of local cache, intermediate representation cache, and cloud inference cache. Experimental results on PMC-Patients demonstrate that HyFedRAG outperforms existing baselines in terms of retrieval quality, generation consistency, and system efficiency. Our framework offers a scalable and privacy-compliant solution for RAG over structural-heterogeneous data, unlocking the potential of LLMs in sensitive and diverse data environments.

Updated: 2025-09-08 08:44:24

标题: HyFedRAG：用于异构和隐私敏感数据的联合检索增强生成框架

摘要: 集中式RAG管道在处理异构和隐私敏感数据方面存在困难，特别是在分布式医疗保健环境中，患者数据涵盖了SQL、知识图和临床笔记。临床医生面临检索罕见疾病病例的困难，这是由于隐私约束和传统基于云的RAG系统在处理多样化格式和边缘设备方面的限制。为了解决这个问题，我们引入了HyFedRAG，这是一个针对混合数据模式量身定制的统一有效的联合RAG框架。通过利用边缘-云协作机制，HyFedRAG使RAG能够跨越多样化的数据源进行操作，同时保护数据隐私。我们的主要贡献是：(1)我们设计了一个基于Flower构建的边缘-云协作RAG框架，支持查询结构化的SQL数据、半结构化知识图和非结构化文档。边缘端LLMs将多样化的数据转换为标准化的隐私保护表示，服务器端LLMs将它们整合起来进行全局推理和生成。(2)我们将轻量级本地检索器与具有隐私意识的LLMs集成，并提供三种匿名化工具，使每个客户端能够为跨设备的全局推理产生语义丰富的去识别摘要。(3)为了优化响应延迟并减少冗余计算，我们设计了一个由本地缓存、中间表示缓存和云推理缓存组成的三层缓存策略。在PMC-Patients上的实验结果表明，HyFedRAG在检索质量、生成一致性和系统效率方面优于现有基线。我们的框架为处理结构异构数据的RAG提供了可扩展且符合隐私规定的解决方案，释放了LLMs在敏感和多样化数据环境中的潜力。

更新时间: 2025-09-08 08:44:24

领域: cs.AI

下载: http://arxiv.org/abs/2509.06444v1

Benchmarking Vision Transformers and CNNs for Thermal Photovoltaic Fault Detection with Explainable AI Validation

Artificial intelligence deployment for automated photovoltaic (PV) monitoring faces interpretability barriers that limit adoption in energy infrastructure applications. While deep learning achieves high accuracy in thermal fault detection, validation that model decisions align with thermal physics principles remains lacking, creating deployment hesitancy where understanding model reasoning is critical. This study provides a systematic comparison of convolutional neural networks (ResNet-18, EfficientNet-B0) and vision transformers (ViT-Tiny, Swin-Tiny) for thermal PV fault detection, using XRAI saliency analysis to assess alignment with thermal physics principles. This represents the first systematic comparison of CNNs and vision transformers for thermal PV fault detection with physics-validated interpretability. Evaluation on 20,000 infrared images spanning normal operation and 11 fault categories shows that Swin Transformer achieves the highest performance (94% binary accuracy; 73% multiclass accuracy) compared to CNN approaches. XRAI analysis reveals that models learn physically meaningful features, such as localized hotspots for cell defects, linear thermal paths for diode failures, and thermal boundaries for vegetation shading, consistent with expected thermal signatures. However, performance varies significantly across fault types: electrical faults achieve strong detection (F1-scores >0.90) while environmental factors like soiling remain challenging (F1-scores 0.20-0.33), indicating limitations imposed by thermal imaging resolution. The thermal physics-guided interpretability approach provides methodology for validating AI decision-making in energy monitoring applications, addressing deployment barriers in renewable energy infrastructure.

Updated: 2025-09-08 08:38:53

标题: 使用可解释人工智能验证对视觉变压器和卷积神经网络进行热光伏故障检测的基准测试

摘要: 人工智能部署用于自动光伏（PV）监测面临解释性障碍，限制了其在能源基础设施应用中的采用。虽然深度学习在热故障检测方面取得了高精度，但验证模型决策是否符合热物理原则仍然缺乏，导致了部署犹豫，而理解模型推理是至关重要的。本研究对卷积神经网络（ResNet-18、EfficientNet-B0）和视觉变换器（ViT-Tiny、Swin-Tiny）进行了热PV故障检测的系统比较，使用XRAI显著性分析来评估与热物理原则的一致性。这是首次对CNN和视觉变换器进行了热PV故障检测的系统比较，并经过物理验证的可解释性。对跨越正常运行和11个故障类别的20,000张红外图像进行评估显示，与CNN方法相比，Swin变换器实现了最高性能（94%二元准确率；73%多类准确率）。XRAI分析显示，模型学习了物理意义的特征，例如细胞缺陷的局部热点、二极管故障的线性热路径以及植被遮挡的热边界，与预期的热特征一致。然而，性能在不同故障类型之间存在显著差异：电气故障实现了良好的检测（F1分数>0.90），而像污垢这样的环境因素仍然具有挑战性（F1分数0.20-0.33），表明受到热成像分辨率的限制。热物理引导的可解释性方法提供了验证AI在能源监测应用中决策制定的方法，解决了可再生能源基础设施中的部署障碍。

更新时间: 2025-09-08 08:38:53

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2509.07039v1

Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.

Updated: 2025-09-08 08:34:02

标题: Agent的树：通过多角度推理改进大型语言模型的长上下文能力

摘要: 大型语言模型（LLMs）在处理长篇任务时面临着持续的挑战，其中最明显的是中间信息丢失问题，即长输入中位于中间位置的信息往往被低效利用。一些现有的减少输入的方法存在丢弃关键信息的风险，而其他扩展上下文窗口的方法往往导致注意力分散。为了解决这些限制，我们提出了Tree of Agents（TOA），这是一个多智能体推理框架，将输入分段处理，并由独立智能体处理。每个智能体生成其本地认知，然后智能体沿着树状路径动态交换信息进行协同推理。TOA使智能体能够探索不同的推理顺序以实现多角度理解，有效减轻位置偏见并减少幻觉。为了提高处理效率，我们结合了前缀哈希缓存和自适应修剪策略，实现了显著的性能提升，同时具有可比较的API开销。实验表明，由紧凑的LLaMA3.1-8B支持的TOA在各种长篇任务上明显优于多个基准线，并展示了与最新和更大型的商业模型（如Gemini1.5-pro）相当的性能。代码可在https://github.com/Aireduce952/Tree-of-Agents获取。

更新时间: 2025-09-08 08:34:02

领域: cs.AI

下载: http://arxiv.org/abs/2509.06436v1

MultiPL-MoE: Multi-Programming-Lingual Extension of Large Language Models through Hybrid Mixture-of-Experts

Despite LLMs' excellent code creation capabilities, multilingual code generation remains extremely challenging. To address this, we intent to improve the multi-programming-lingual (MultiPL) performance of the base LLMs while retaining the most popular ones using restricted computational resources. We consider MultiPL to be a special case of multiple natural languages and propose a MultiPL extension of LLMs utilizing a hybrid mixture of experts (MoE), called MultiPL-MoE. Specifically, MultiPL-MoE combines two paired MoEs to optimize expert selection at both the token and segment levels. The token-level MoE is a standard upcycling MoE structure with a shared expert and a novel gate weight normalization approach that aids in the final fusion with the segment-level MoE. The segment-level MoE incorporates two innovative designs to better capture the syntactic structure and contextual patterns of programming languages: First, using a sliding window to partition the input token sequence into multiple segments; Then, adopting an expert-choice routing strategy that allows experts to select the top-k segments. The results of the experiment proved the effectiveness of MultiPL-MoE.

Updated: 2025-09-08 08:30:07

标题: MultiPL-MoE：通过混合专家模型的多语言扩展大型语言模型

摘要: 尽管LLMs具有出色的代码生成能力，但多语言代码生成仍然极具挑战性。为了解决这一问题，我们致力于提高基础LLMs的多程序语言（MultiPL）性能，同时保留最受欢迎的LLMs，使用有限的计算资源。我们认为MultiPL是多种自然语言的特殊情况，并提出了一种利用混合专家（MoE）的MultiPL扩展，称为MultiPL-MoE。具体来说，MultiPL-MoE结合了两个配对的MoEs，以优化专家在标记和片段级别的选择。标记级MoE是一个标准的升级MoE结构，具有共享专家和一种新的门权重归一化方法，有助于最终与片段级MoE融合。片段级MoE包含两个创新设计，以更好地捕捉编程语言的句法结构和上下文模式：首先，使用滑动窗口将输入标记序列分割成多个片段；然后，采用专家选择路由策略，允许专家选择前k个片段。实验结果证明了MultiPL-MoE的有效性。

更新时间: 2025-09-08 08:30:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.19268v2

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Large language models (LLMs) are trained using massive datasets, which often contain undesirable content such as harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks (STA) can successfully extract unlearned information from LLMs, but in this work we show that STAs can be an inadequate tool for auditing unlearning. Using common benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate that in a strong auditor setting such attacks can elicit any information from the LLM, regardless of the deployed unlearning algorithm or whether the queried content was originally present in the training corpus. We further show that STA with just a few soft tokens (1-10) can elicit random strings over 400 characters long, indicating that STAs must be used carefully to effectively audit unlearning. Example code can be found at: https://github.com/IntelLabs/LLMart/tree/main/examples/unlearning

Updated: 2025-09-08 08:27:53

标题: 软令牌攻击无法可靠审计大型语言模型中的遗忘

摘要: 大型语言模型（LLMs）是使用大规模数据集训练的，这些数据集通常包含不良内容，如有害文本、个人信息和受版权保护的材料。为了解决这个问题，机器遗忘的目标是从训练模型中删除信息。最近的研究表明，软标记攻击（STA）可以成功地从LLMs中提取未被遗忘的信息，但在这项工作中，我们表明STA可能不是审计遗忘的合适工具。使用常见的基准数据集，如《哈利·波特是谁？》和TOFU，我们展示在强审计设置下，这种攻击可以引出LLMs中的任何信息，无论部署的遗忘算法是什么，或者查询的内容是否最初存在于训练语料库中。我们进一步展示，仅使用少量软标记（1-10）的STA可以引出超过400个字符长的随机字符串，表明必须小心使用STA来有效地审计遗忘。示例代码可在以下链接找到：https://github.com/IntelLabs/LLMart/tree/main/examples/unlearning

更新时间: 2025-09-08 08:27:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.15836v2

HECATE: An ECS-based Framework for Teaching and Developing Multi-Agent Systems

This paper introduces HECATE, a novel framework based on the Entity-Component-System (ECS) architectural pattern that bridges the gap between distributed systems engineering and MAS development. HECATE is built using the Entity-Component-System architectural pattern, leveraging data-oriented design to implement multiagent systems. This approach involves engineering multiagent systems (MAS) from a distributed systems (DS) perspective, integrating agent concepts directly into the DS domain. This approach simplifies MAS development by (i) reducing the need for specialized agent knowledge and (ii) leveraging familiar DS patterns and standards to minimize the agent-specific knowledge required for engineering MAS. We present the framework's architecture, core components, and implementation approach, demonstrating how it supports different agent models.

Updated: 2025-09-08 08:26:01

标题: HECATE：一种基于ECS的用于教学和开发多Agent系统的框架

摘要: 本文介绍了HECATE，这是一个基于实体-组件-系统（ECS）架构模式的新框架，用于弥合分布式系统工程和多Agent系统（MAS）开发之间的差距。HECATE是使用实体-组件-系统架构模式构建的，利用面向数据的设计来实现多Agent系统。这种方法涉及从分布式系统（DS）的角度来工程化多Agent系统（MAS），将Agent概念直接集成到DS领域中。这种方法通过（i）减少对专门Agent知识的需求和（ii）利用熟悉的DS模式和标准来最小化工程MAS所需的Agent特定知识，简化了MAS的开发。我们展示了框架的架构、核心组件和实现方法，演示了它如何支持不同的Agent模型。

更新时间: 2025-09-08 08:26:01

领域: cs.MA,cs.AI,C.2.4, I.2.11

下载: http://arxiv.org/abs/2509.06431v1

RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant advancements in debugging datasets have been made to promote the development of code debugging. However, these datasets primarily focus on assessing the LLM's function-level code repair capabilities, neglecting the more complex and realistic repository-level scenarios, which leads to an incomplete understanding of the LLM's challenges in repository-level debugging. While several repository-level datasets have been proposed, they often suffer from limitations such as limited diversity of tasks, languages, and error types. To mitigate this challenge, this paper introduces RepoDebug, a multi-task and multi-language repository-level code debugging dataset with 22 subtypes of errors that supports 8 commonly used programming languages and 3 debugging tasks. Furthermore, we conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debugging.

Updated: 2025-09-08 08:22:38

标题: RepoDebug：大型语言模型的存储库级多任务和多语言调试评估

摘要: 大型语言模型（LLMs）在代码调试方面表现出显著的熟练程度，尤其在自动程序修复方面，可以大大减少开发人员的时间消耗，提高其效率。调试数据集方面取得了显著进展，以促进代码调试的发展。然而，这些数据集主要集中在评估LLM的功能级代码修复能力上，忽略了更复杂和现实的仓库级情景，导致对LLM在仓库级调试中的挑战理解不完整。虽然已经提出了一些仓库级数据集，但它们经常受到任务、语言和错误类型多样性有限的限制。为了缓解这一挑战，本文介绍了RepoDebug，一个支持8种常用编程语言和3个调试任务的多任务、多语言仓库级代码调试数据集，包括22种错误子类型。此外，我们对10个LLMs进行评估实验，最佳表现模型Claude 3.5 Sonnect仍然无法在仓库级调试中表现良好。

更新时间: 2025-09-08 08:22:38

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2509.04078v2

Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster

Computational models are critical to advance our understanding of how neural, biomechanical, and physical systems interact to orchestrate animal behaviors. Despite the availability of near-complete reconstructions of the Drosophila melanogaster central nervous system, musculature, and exoskeleton, anatomically and physically grounded models of fly leg muscles are still missing. These models provide an indispensable bridge between motor neuron activity and joint movements. Here, we introduce the first 3D, data-driven musculoskeletal model of Drosophila legs, implemented in both OpenSim and MuJoCo simulation environments. Our model incorporates a Hill-type muscle representation based on high-resolution X-ray scans from multiple fixed specimens. We present a pipeline for constructing muscle models using morphological imaging data and for optimizing unknown muscle parameters specific to the fly. We then combine our musculoskeletal models with detailed 3D pose estimation data from behaving flies to achieve muscle-actuated behavioral replay in OpenSim. Simulations of muscle activity across diverse walking and grooming behaviors predict coordinated muscle synergies that can be tested experimentally. Furthermore, by training imitation learning policies in MuJoCo, we test the effect of different passive joint properties on learning speed and find that damping and stiffness facilitate learning. Overall, our model enables the investigation of motor control in an experimentally tractable model organism, providing insights into how biomechanics contribute to generation of complex limb movements. Moreover, our model can be used to control embodied artificial agents to generate naturalistic and compliant locomotion in simulated environments.

Updated: 2025-09-08 08:21:14

标题: 果蝇肌肉骨骼模拟下肢运动生物力学

摘要: 计算模型对于推动我们理解神经、生物力学和物理系统如何相互作用以编排动物行为至关重要。尽管果蝇中枢神经系统、肌肉系统和外骨骼的近乎完整重建已经可以获得，但果蝇腿部肌肉的解剖和物理基础模型仍然缺失。这些模型提供了一个不可或缺的桥梁，连接运动神经元活动和关节运动。在这里，我们介绍了果蝇腿部的第一个3D数据驱动肌肉骨骼模型，在OpenSim和MuJoCo模拟环境中实施。我们的模型综合了基于多个固定标本的高分辨率X射线扫描的Hill型肌肉表示。我们提出了一个使用形态成像数据构建肌肉模型并优化特定于果蝇的未知肌肉参数的流程。然后，我们将我们的肌肉骨骼模型与行为果蝇的详细3D姿势估计数据相结合，在OpenSim中实现肌肉驱动的行为重放。在不同行走和梳理行为中的肌肉活动模拟预测协调的肌肉协同作用，这可以在实验中进行测试。此外，通过在MuJoCo中训练模仿学习策略，我们测试了不同被动关节属性对学习速度的影响，并发现阻尼和刚度促进了学习。总的来说，我们的模型使得在一种实验上可操作的模型生物中研究运动控制成为可能，从而深入探讨生物力学如何促成复杂肢体运动的产生。此外，我们的模型可以用来控制具有身体的人工智能代理，以在模拟环境中生成自然和顺从的运动。

更新时间: 2025-09-08 08:21:14

领域: q-bio.NC,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2509.06426v1

CAPMix: Robust Time Series Anomaly Detection Based on Abnormal Assumptions with Dual-Space Mixup

Time series anomaly detection (TSAD) is a vital yet challenging task, particularly in scenarios where labeled anomalies are scarce and temporal dependencies are complex. Recent anomaly assumption (AA) approaches alleviate the lack of anomalies by injecting synthetic samples and training discriminative models. Despite promising results, these methods often suffer from two fundamental limitations: patchy generation, where scattered anomaly knowledge leads to overly simplistic or incoherent anomaly injection, and Anomaly Shift, where synthetic anomalies either resemble normal data too closely or diverge unrealistically from real anomalies, thereby distorting classification boundaries. In this paper, we propose CAPMix, a controllable anomaly augmentation framework that addresses both issues. First, we design a CutAddPaste mechanism to inject diverse and complex anomalies in a targeted manner, avoiding patchy generation. Second, we introduce a label revision strategy to adaptively refine anomaly labels, reducing the risk of anomaly shift. Finally, we employ dual-space mixup within a temporal convolutional network to enforce smoother and more robust decision boundaries. Extensive experiments on five benchmark datasets, including AIOps, UCR, SWaT, WADI, and ESA, demonstrate that CAPMix achieves significant improvements over state-of-the-art baselines, with enhanced robustness against contaminated training data. The code is available at https://github.com/alsike22/CAPMix.

Updated: 2025-09-08 08:15:12

标题: CAPMix：基于异常假设和双空间混合的鲁棒时间序列异常检测

摘要: 时间序列异常检测（TSAD）是一项重要且具有挑战性的任务，特别是在标记异常稀缺且时间依赖性复杂的情况下。最近的异常假设（AA）方法通过注入合成样本并训练区分性模型来缓解异常不足的问题。尽管取得了有希望的结果，但这些方法通常存在两个基本限制：片状生成，散乱的异常知识导致注入的异常过于简单或不连贯；以及异常移位，合成异常要么过于接近正常数据，要么与真实异常不真实地分离，从而扭曲分类边界。在本文中，我们提出了CAPMix，一个可控异常增强框架，解决了这两个问题。首先，我们设计了一个CutAddPaste机制，以有针对性的方式注入多样化和复杂的异常，避免片状生成。其次，我们引入了一个标签修订策略，自适应地优化异常标签，降低异常移位的风险。最后，我们在时间卷积网络中使用双空间混合来强制更平滑和更稳健的决策边界。在包括AIOps、UCR、SWaT、WADI和ESA在内的五个基准数据集上进行了大量实验，结果表明CAPMix在对抗受污染的训练数据方面取得了显著的改进。该代码可在https://github.com/alsike22/CAPMix 上找到。

更新时间: 2025-09-08 08:15:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06419v1

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

Updated: 2025-09-08 08:12:26

标题: 索引保留的轻量级标记修剪：用于视觉-语言模型中高效文档理解的方法

摘要: 最近对视觉语言模型（VLMs）的进展取得了令人印象深刻的成果，在文档理解任务中取得了显著的结果，但其高计算需求仍然是一个挑战。为了减少计算负担，我们提出了一个轻量级的标记修剪框架，该框架在VLM处理之前从文档图像中过滤出非信息性的背景区域。一个二进制的补丁级分类器去除非文本区域，一个最大池化的细化步骤恢复了分散的文本区域，以增强空间一致性。对真实世界的文档数据集进行的实验证明，我们的方法显著降低了计算成本，同时保持了可比较的准确性。

更新时间: 2025-09-08 08:12:26

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.06415v1

Rethinking GNN Expressive Power from a Distributed Computational Model Perspective

The success of graph neural networks (GNNs) has motivated theoretical studies on their expressive power, often through alignments with the Weisfeiler-Lehman (WL) tests. However, such analyses typically focus on the ability of GNNs to distinguish between graph structures, rather than to compute or approximate specific function classes. The latter is more commonly studied in machine learning theory, including results such as the Turing completeness of recurrent networks and the universal approximation property of feedforward networks. We argue that using well-defined computational models, such as a modified CONGEST model with clearly specified preprocessing and postprocessing, offers a more sound framework for analyzing GNN expressiveness. Within this framework, we show that allowing unrestricted preprocessing or incorporating externally computed features, while claiming that these precomputations enhance the expressiveness, can sometimes lead to problems. We also show that the lower bound on a GNN's capacity (depth multiplied by width) to simulate one iteration of the WL test actually grows nearly linearly with graph size, indicating that the WL test is not locally computable and is misaligned with message-passing GNNs. Despite these negative results, we also present positive results that characterize the effects of virtual nodes and edges from a computational model perspective. Finally, we highlight several open problems regarding GNN expressiveness for further exploration.

Updated: 2025-09-08 08:08:59

标题: 重新思考GNN的表达能力：从分布式计算模型的角度进行思考

摘要: 图神经网络（GNNs）的成功激发了对它们表达能力的理论研究，通常通过与Weisfeiler-Lehman（WL）测试的对齐来进行分析。然而，这类分析通常关注GNNs区分图结构的能力，而不是计算或逼近特定的功能类。后者在机器学习理论中更常见，包括循环网络的图灵完备性和前馈网络的通用逼近性质等结果。我们认为使用明确定义的计算模型，例如具有明确定义的预处理和后处理的修改过的CONGEST模型，提供了一个更稳固的框架来分析GNN的表达能力。在这个框架内，我们表明允许无限制的预处理或合并外部计算特征，同时声称这些预计算增强了表达能力，有时可能会导致问题。我们还表明，GNN容量（深度乘以宽度）的下限来模拟一次WL测试迭代实际上随着图的大小几乎线性增长，表明WL测试并不是局部可计算的，并且与消息传递GNNs不一致。尽管有这些负面结果，我们也提出了从计算模型角度描述虚拟节点和边的影响的积极结果。最后，我们强调了关于GNN表达能力的若干开放问题，以便进一步探讨。

更新时间: 2025-09-08 08:08:59

领域: cs.LG,cs.AI,+

下载: http://arxiv.org/abs/2410.01308v4

Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning

This study presents DiagCoT, a multi-stage framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs) to emulate radiologists' stepwise diagnostic reasoning using only free-text reports. DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency. On the MIMIC-CXR benchmark, DiagCoT improved zero-shot disease classification AUC from 0.52 to 0.76 (absolute gain of 0.24), pathology grounding mIoU from 0.08 to 0.31 (absolute gain of 0.23), and report generation BLEU from 0.11 to 0.33 (absolute gain of 0.22). It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets. By converting unstructured clinical narratives into structured supervision, DiagCoT offers a scalable approach for developing interpretable and diagnostically competent AI systems for radiology.

Updated: 2025-09-08 08:01:26

标题: 用报告引导的思维链学习教授人工智能逐步诊断推理

摘要: 这项研究提出了DiagCoT，这是一个多阶段框架，将监督微调应用于通用视觉语言模型（VLMs），以模拟放射科医师仅使用自由文本报告进行诊断推理的步骤。DiagCoT结合了对比图像-报告调整以进行领域对齐，思维链监督以捕获推理逻辑，以及通过临床奖励信号进行强化调整以增强事实准确性和流畅性。在MIMIC-CXR基准上，DiagCoT将零样本疾病分类AUC从0.52提高到0.76（绝对增益为0.24），病理定位mIoU从0.08提高到0.31（绝对增益为0.23），报告生成BLEU从0.11提高到0.33（绝对增益为0.22）。它在长尾疾病和外部数据集上表现优于LLaVA-Med和CXR-LLAVA等最先进模型。通过将非结构化的临床叙述转化为结构化监督，DiagCoT为开发放射学可解释和诊断能力强大的人工智能系统提供了一种可扩展的方法。

更新时间: 2025-09-08 08:01:26

领域: cs.AI

下载: http://arxiv.org/abs/2509.06409v1

Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach

Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. The U-Net is compared with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error of between 9% and 13% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.53% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.

Updated: 2025-09-08 08:00:56

标题: 将高光谱图像转化为化学地图：一种新型端到端深度学习方法

摘要: 目前，从高光谱图像生成化学地图的方法基于诸如偏最小二乘（PLS）回归等模型，生成像素级预测，不考虑空间背景并且受到高噪声干扰。本研究提出了一种端到端深度学习方法，使用修改版的U-Net和自定义损失函数，直接从高光谱图像中获取化学地图，跳过传统像素级分析所需的所有中间步骤。将U-Net与传统的PLS回归在真实的猪肚样本数据集上进行比较，该数据集具有相关的平均脂肪参考值。在平均脂肪预测任务上，U-Net的测试集均方根误差比PLS回归低9%至13%。同时，U-Net生成了细节化学地图，其中99.91%的方差是空间相关的。相反，PLS生成的化学地图中只有2.53%的方差是空间相关的，表明每个像素级预测在很大程度上独立于相邻像素。此外，虽然PLS生成的化学地图包含远超出物理可能范围的预测值（0-100%），U-Net学会保持在这个范围内。因此，本研究的结果表明，对于化学地图生成，U-Net比PLS更优越。

更新时间: 2025-09-08 08:00:56

领域: cs.CV,cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2504.14131v4

MedualTime: A Dual-Adapter Language Model for Medical Time Series-Text Multimodal Learning

The recent rapid advancements in language models (LMs) have garnered attention in medical time series-text multimodal learning. However, existing contrastive learning-based and prompt-based LM approaches tend to be biased, often assigning a primary role to time series modality while treating text modality as secondary. We classify these approaches under a temporal-primary paradigm, which may overlook the unique and critical task-relevant information embedded in text modality like clinical reports, thus failing to fully leverage mutual benefits and complementarity of different modalities. To fill this gap, we propose a novel textual-temporal multimodal learning paradigm that enables either modality to serve as the primary while being enhanced by the other, thereby effectively capturing modality-specific information and fostering cross-modal interaction. In specific, we design MedualTime, a language model composed of dual adapters to implement temporal-primary and textual-primary modeling simultaneously. Within each adapter, lightweight adaptation tokens are injected into the top layers of LM to encourage high-level modality fusion. The shared LM pipeline by dual adapters not only achieves adapter alignment but also enables efficient fine-tuning, reducing computational resources. Empirically, MedualTime demonstrates superior performance on medical data, achieving notable improvements of 8% accuracy and 12% F1 in supervised settings. Furthermore, MedualTime's transferability is validated by few-shot label transfer experiments from coarse-grained to fine-grained medical data. https://github.com/start2020/MedualTime

Updated: 2025-09-08 07:57:56

标题: MedualTime：一种用于医学时间序列文本多模态学习的双适配器语言模型

摘要: 最近语言模型（LMs）的快速发展引起了医学时间序列-文本多模态学习的关注。然而，现有的基于对比学习和提示的LM方法往往存在偏见，通常将时间序列模态置于主导地位，而将文本模态视为次要。我们将这些方法归类为时间主导范式，这可能忽视文本模态中嵌入的独特且关键的任务相关信息，如临床报告，从而未能充分利用不同模态之间的互相利益和互补性。为了填补这一空白，我们提出了一种新颖的文本-时间多模态学习范式，使任一模态都可以作为主导模态，同时受到另一模态的增强，从而有效地捕获模态特定信息并促进跨模态交互。具体来说，我们设计了MedualTime，一个由双适配器组成的语言模型，可以同时实现时间主导和文本主导建模。在每个适配器内部，轻量级适配令牌被注入到LM的顶层，以促进高层次模态融合。双适配器共享的LM管道不仅实现了适配器对齐，还实现了有效的微调，降低了计算资源的使用。经验证，MedualTime在医学数据上表现出优越性能，在监督设置中达到了8%的准确度和12%的F1值的显着提高。此外，通过从粗粒度到细粒度医学数据的少样本标签传递实验证实了MedualTime的可转移性。

更新时间: 2025-09-08 07:57:56

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.06620v4

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

Updated: 2025-09-08 07:50:04

标题: 一种不对称的演员-评论家算法的理论证明

摘要: 在部分可观察环境的强化学习中，许多成功的算法都是在不对称学习范式内开发的。这种范式利用训练时可用的额外状态信息来加快学习速度。尽管提出的学习目标通常在理论上是合理的，但这些方法仍然缺乏对其潜在好处的明确理论证明。我们通过将有限时间收敛分析调整到这种设置中，为具有线性函数逼近器的不对称actor-critic算法提出了这样的理论解释。由此产生的有限时间界限显示，不对称的评论家消除了代理状态中出现的别名引起的误差项。

更新时间: 2025-09-08 07:50:04

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2501.19116v3

NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.

Updated: 2025-09-08 07:47:58

标题: NeuroDeX：解锁深度神经网络可执行文件的多样支持

摘要: On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.

更新时间: 2025-09-08 07:47:58

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2509.06402v1

Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent

Large language model (LLM) agents achieve impressive single-task performance but commonly exhibit repeated failures, inefficient exploration, and limited cross-task adaptability. Existing reflective strategies (e.g., Reflexion, ReAct) improve per-episode behavior but typically produce ephemeral, task-specific traces that are not reused across tasks. Reinforcement-learning based alternatives can produce transferable policies but require substantial parameter updates and compute. In this work we introduce Meta-Policy Reflexion (MPR): a hybrid framework that consolidates LLM-generated reflections into a structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at inference time through two complementary mechanisms soft memory-guided decoding and hard rule admissibility checks(HAC). MPR (i) externalizes reusable corrective knowledge without model weight updates, (ii) enforces domain constraints to reduce unsafe or invalid actions, and (iii) retains the adaptability of language-based reflection. We formalize the MPM representation, present algorithms for update and decoding, and validate the approach in a text-based agent environment following the experimental protocol described in the provided implementation (AlfWorld-based). Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability. We analyze mechanisms that explain these gains, discuss scalability and failure modes, and outline future directions for multimodal and multi-agent extensions.

Updated: 2025-09-08 07:40:58

标题: 元政策反思：可重复使用的反思性记忆和规则适用性，用于资源高效的LLM代理

摘要: 大型语言模型（LLM）代理实现了令人印象深刻的单任务表现，但通常表现出重复的失败、探索效率低和跨任务适应性有限。现有的反思策略（例如，Reflexion、ReAct）可以改善每一集的行为，但通常会产生短暂的、特定任务的痕迹，这些痕迹在任务间不会被重复利用。基于强化学习的替代方案可以产生可转移的策略，但需要大量参数更新和计算。在本文中，我们介绍了元策略反思（MPR）：一种混合框架，将LLM生成的反思整合到一个结构化、类似谓词的元策略内存（MPM）中，并通过两种互补机制软内存引导解码和硬规则可接受性检查（HAC）在推理时间应用该内存。MPR（i）在没有模型权重更新的情况下外部化可重用的纠正知识，（ii）强制执行领域约束以减少不安全或无效的动作，（iii）保留基于语言的反思的适应性。我们正式化MPM表示，提出更新和解码的算法，并在提供的实现（AlfWorld为基础）中遵循实验协议，在文本为基础的代理环境中验证该方法。提供的实验结果表明，与反思基线相比，执行准确性和鲁棒性方面均取得了一致的收益；规则可接受性进一步提高了稳定性。我们分析了解释这些收益的机制，讨论了可扩展性和失败模式，并概述了多模态和多代理扩展的未来方向。

更新时间: 2025-09-08 07:40:58

领域: cs.AI

下载: http://arxiv.org/abs/2509.03990v2

Are Your LLM-based Text-to-SQL Models Secure? Exploring SQL Injection via Backdoor Attacks

Large language models (LLMs) have shown state-of-the-art results in translating natural language questions into SQL queries (Text-to-SQL), a long-standing challenge within the database community. However, security concerns remain largely unexplored, particularly the threat of backdoor attacks, which can introduce malicious behaviors into models through fine-tuning with poisoned datasets. In this work, we systematically investigate the vulnerabilities of LLM-based Text-to-SQL models and present ToxicSQL, a novel backdoor attack framework. Our approach leverages stealthy {semantic and character-level triggers} to make backdoors difficult to detect and remove, ensuring that malicious behaviors remain covert while maintaining high model accuracy on benign inputs. Furthermore, we propose leveraging SQL injection payloads as backdoor targets, enabling the generation of malicious yet executable SQL queries, which pose severe security and privacy risks in language model-based SQL development. We demonstrate that injecting only 0.44% of poisoned data can result in an attack success rate of 79.41%, posing a significant risk to database security. Additionally, we propose detection and mitigation strategies to enhance model reliability. Our findings highlight the urgent need for security-aware Text-to-SQL development, emphasizing the importance of robust defenses against backdoor threats.

Updated: 2025-09-08 07:36:01

标题: 您的基于LLM的文本到SQL模型安全吗？通过后门攻击探索SQL注入

摘要: 大型语言模型（LLMs）已经展示了在将自然语言问题翻译成SQL查询（文本到SQL）方面的最先进结果，这是数据库社区内长期存在的挑战。然而，安全问题仍然很少被探讨，特别是后门攻击的威胁，通过使用受污染的数据集进行微调，可以向模型中引入恶意行为。在这项工作中，我们系统地调查了基于LLM的文本到SQL模型的漏洞，并提出了一种新颖的后门攻击框架ToxicSQL。我们的方法利用隐秘的{语义和字符级触发器}，使后门攻击难以检测和消除，确保恶意行为保持隐蔽，同时在良性输入上保持高模型准确性。此外，我们提出利用SQL注入有效载荷作为后门目标，使恶意但可执行的SQL查询生成成为可能，这在基于语言模型的SQL开发中带来严重的安全和隐私风险。我们展示了只注入0.44%的受污染数据就可以导致攻击成功率达到79.41%，对数据库安全构成重大风险。此外，我们提出了检测和缓解策略以增强模型的可靠性。我们的研究结果强调了对安全意识的文本到SQL开发的迫切需求，强调了对后门威胁的强大防御的重要性。

更新时间: 2025-09-08 07:36:01

领域: cs.CR,cs.DB

下载: http://arxiv.org/abs/2503.05445v3

Learning and composing of classical music using restricted Boltzmann machines

Recently, software has been developed that uses machine learning to mimic the style of a particular composer, such as J. S. Bach. However, since such software often adopts machine learning models with complex structures, it is difficult to analyze how the software understands the characteristics of the composer's music. In this study, we adopted J. S. Bach's music for training of a restricted Boltzmann machine (RBM). Since the structure of RBMs is simple, it allows us to investigate the internal states after learning. We found that the learned RBM is able to compose music.

Updated: 2025-09-08 07:31:38

标题: 使用受限玻尔兹曼机学习和创作古典音乐

摘要: 最近，已经开发出了一种利用机器学习来模仿特定作曲家风格的软件，比如J.S.巴赫。然而，由于这类软件通常采用复杂结构的机器学习模型，因此很难分析软件如何理解作曲家音乐的特征。在这项研究中，我们采用了J.S.巴赫的音乐来训练受限玻尔兹曼机器（RBM）。由于RBM的结构简单，我们能够在学习之后调查内部状态。我们发现学习后的RBM能够创作音乐。

更新时间: 2025-09-08 07:31:38

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2509.04899v2

Graph Neural Networks for Resource Allocation in Interference-limited Multi-Channel Wireless Networks with QoS Constraints

Meeting minimum data rate constraints is a significant challenge in wireless communication systems, particularly as network complexity grows. Traditional deep learning approaches often address these constraints by incorporating penalty terms into the loss function and tuning hyperparameters empirically. However, this heuristic treatment offers no theoretical convergence guarantees and frequently fails to satisfy QoS requirements in practical scenarios. Building upon the structure of the WMMSE algorithm, we first extend it to a multi-channel setting with QoS constraints, resulting in the enhanced WMMSE (eWMMSE) algorithm, which is provably convergent to a locally optimal solution when the problem is feasible. To further reduce computational complexity and improve scalability, we develop a GNN-based algorithm, JCPGNN-M, capable of supporting simultaneous multi-channel allocation per user. To overcome the limitations of traditional deep learning methods, we propose a principled framework that integrates GNN with a Lagrangian-based primal-dual optimization method. By training the GNN within the Lagrangian framework, we ensure satisfaction of QoS constraints and convergence to a stationary point. Extensive simulations demonstrate that JCPGNN-M matches the performance of eWMMSE while offering significant gains in inference speed, generalization to larger networks, and robustness under imperfect channel state information. This work presents a scalable and theoretically grounded solution for constrained resource allocation in future wireless networks.

Updated: 2025-09-08 07:28:10

标题: 图神经网络用于受干扰限制的多信道无线网络中的资源分配与QoS约束

摘要: 实现最低数据速率约束在无线通信系统中是一个重要挑战，尤其是在网络复杂性增加的情况下。传统的深度学习方法通常通过将惩罚项纳入损失函数并经验性地调整超参数来解决这些约束。然而，这种启发式处理没有提供理论上的收敛保证，并在实际场景中经常无法满足QoS要求。基于WMMSE算法的结构，我们首先将其扩展到具有QoS约束的多通道设置，得到增强的WMMSE（eWMMSE）算法，当问题可行时，可以证明其收敛到局部最优解。为了进一步降低计算复杂度并提高可扩展性，我们开发了一种基于GNN的算法JCPGNN-M，能够支持每个用户的同时多通道分配。为了克服传统深度学习方法的局限性，我们提出了一个有原则的框架，将GNN与基于Lagrange的原始-对偶优化方法相结合。通过在Lagrange框架内训练GNN，我们确保满足QoS约束并收敛到一个稳定点。大量模拟表明，JCPGNN-M在推理速度、对更大网络的泛化和在不完美的信道状态信息下的鲁棒性方面与eWMMSE的性能相匹配。这项工作提出了一种可扩展且在理论上有根据的解决方案，用于未来无线网络中的受限资源分配。

更新时间: 2025-09-08 07:28:10

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2509.06395v1

SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

Updated: 2025-09-08 07:18:13

标题: SGDFuse：SAM引导扩散用于高保真度红外和可见光图像融合

摘要: 红外和可见光图像融合（IVIF）旨在将红外图像中的热辐射信息与可见图像中丰富的纹理细节相结合，以增强下游视觉任务的感知能力。然而，现有方法往往由于缺乏对场景的深层语义理解而无法保留关键目标，同时融合过程本身也可能引入人工制品和细节丢失，严重影响图像质量和任务性能。为解决这些问题，本文提出了SGDFuse，一种由Segment Anything Model（SAM）引导的条件扩散模型，以实现高保真度和语义感知的图像融合。我们方法的核心是利用SAM生成的高质量语义掩模作为明确先验，通过条件扩散模型指导融合过程的优化。具体而言，该框架在两阶段过程中运行：首先对多模态特征进行初步融合，然后联合使用SAM的语义掩模和初步融合图像作为条件来驱动扩散模型的粗到细的去噪生成。这确保了融合过程不仅具有明确的语义方向性，还保证了最终结果的高保真度。大量实验证明，SGDFuse在主观和客观评估中均取得了最先进的性能，以及在适应下游任务方面的灵活性，为图像融合中的核心挑战提供了强大的解决方案。SGDFuse的代码可在https://github.com/boshizhang123/SGDFuse 上找到。

更新时间: 2025-09-08 07:18:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.05264v2

MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation

A key challenge in synthesizing audios from silent videos is the inherent trade-off between synthesis quality and inference efficiency in existing methods. For instance, flow matching based models rely on modeling instantaneous velocity, inherently require an iterative sampling process, leading to slow inference speeds. To address this efficiency bottleneck, we introduce a MeanFlow-accelerated model that characterizes flow fields using average velocity, enabling one-step generation and thereby significantly accelerating multimodal video-to-audio (VTA) synthesis while preserving audio quality, semantic alignment, and temporal synchronization. Furthermore, a scalar rescaling mechanism is employed to balance conditional and unconditional predictions when classifier-free guidance (CFG) is applied, effectively mitigating CFG-induced distortions in one step generation. Since the audio synthesis network is jointly trained with multimodal conditions, we further evaluate it on text-to-audio (TTA) synthesis task. Experimental results demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality on both VTA and TTA synthesis tasks.

Updated: 2025-09-08 07:15:21

标题: 通过一步生成的MeanFlow加速多模态视频到音频合成

摘要: 在从无声视频合成音频的过程中，一个关键挑战是现有方法中合成质量和推理效率之间的固有权衡。例如，基于流匹配的模型依赖于建模瞬时速度，固有地需要迭代采样过程，导致推理速度较慢。为了解决这种效率瓶颈，我们引入了一种MeanFlow加速模型，该模型使用平均速度来描述流场，实现一步生成，从而显著加快多模态视频到音频（VTA）合成的速度，同时保持音频质量、语义对齐和时间同步。此外，在应用无分类器引导（CFG）时，采用了一种标量重缩放机制来平衡条件和无条件预测，有效地减轻了一步生成中CFG引起的失真。由于音频合成网络与多模态条件联合训练，我们进一步在文本到音频（TTA）合成任务上对其进行评估。实验结果表明，将MeanFlow纳入网络显著提高了在VTA和TTA合成任务中的推理速度，同时又不损害感知质量。

更新时间: 2025-09-08 07:15:21

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2509.06389v1

PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation

Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel -- unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.

Updated: 2025-09-08 07:12:07

标题: PCR-CA: 并行码书表示与对比对齐用于多分类应用推荐

摘要: 现代应用商店推荐系统在处理多类别应用时存在困难，因为传统分类法无法捕捉重叠语义，导致个性化效果不佳。我们提出了PCR-CA（具有对比对齐的并行码书表示），这是一个用于改进CTR预测的端到端框架。PCR-CA首先从应用文本中提取紧凑的多模态嵌入，然后引入了一个并行码书VQ-AE模块，该模块可以在多个码书之间并行学习离散的语义表示，与分层残差量化（RQ-VAE）不同。这种设计实现了多方面（例如，游戏玩法、艺术风格）的独立编码，更好地建模多类别语义。为了对接语义和协作信号，我们在用户和项目级别采用对比对齐损失，增强了长尾项目的表示学习。此外，双重注意力融合机制将基于ID的和语义特征结合起来，以捕捉用户兴趣，特别是对于长尾应用。在大规模数据集上的实验证明，PCR-CA相对于强基线实现了+0.76%的AUC改进，对于长尾应用有+2.15%的AUC增益。在线A/B测试进一步验证了我们的方法，在CTR上取得+10.52%的提升，在CVR上取得+16.30%的改进，展示了PCR-CA在实际部署中的有效性。这个新框架现已完全部署在Microsoft Store上。

更新时间: 2025-09-08 07:12:07

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.18166v4

Beyond the Pre-Service Horizon: Infusing In-Service Behavior for Improved Financial Risk Forecasting

Typical financial risk management involves distinct phases for pre-service risk assessment and in-service default detection, often modeled separately. This paper proposes a novel framework, Multi-Granularity Knowledge Distillation (abbreviated as MGKD), aimed at improving pre-service risk prediction through the integration of in-service user behavior data. MGKD follows the idea of knowledge distillation, where the teacher model, trained on historical in-service data, guides the student model, which is trained on pre-service data. By using soft labels derived from in-service data, the teacher model helps the student model improve its risk prediction prior to service activation. Meanwhile, a multi-granularity distillation strategy is introduced, including coarse-grained, fine-grained, and self-distillation, to align the representations and predictions of the teacher and student models. This approach not only reinforces the representation of default cases but also enables the transfer of key behavioral patterns associated with defaulters from the teacher to the student model, thereby improving the overall performance of pre-service risk assessment. Moreover, we adopt a re-weighting strategy to mitigate the model's bias towards the minority class. Experimental results on large-scale real-world datasets from Tencent Mobile Payment demonstrate the effectiveness of our proposed approach in both offline and online scenarios.

Updated: 2025-09-08 07:09:18

标题: 超越预服务视野：注入服务行为以改善财务风险预测

摘要: 典型的金融风险管理通常涉及明确定义的预服务风险评估阶段和在服务期间的违约检测阶段，通常分别建模。本文提出了一种新颖的框架，多粒度知识蒸馏（缩写为MGKD），旨在通过整合在服务期间用户行为数据来改善预服务风险预测。MGKD遵循知识蒸馏的思想，其中基于历史在服务数据训练的教师模型指导基于预服务数据训练的学生模型。通过使用从在服务数据中得出的软标签，教师模型帮助学生模型在服务激活之前改善其风险预测。同时，引入了一种多粒度蒸馏策略，包括粗粒度、细粒度和自我蒸馏，以使教师和学生模型的表示和预测保持一致。这种方法不仅加强了违约案例的表示，还使违约者的关键行为模式从教师模型转移到学生模型，从而提高了预服务风险评估的整体性能。此外，我们采用重新加权策略来缓解模型对少数类的偏见。腾讯移动支付的大规模现实世界数据集上的实验结果显示了我们提出的方法在离线和在线场景中的有效性。

更新时间: 2025-09-08 07:09:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06385v1

The GOOSE Dataset for Perception in Unstructured Environments

The potential for deploying autonomous systems can be significantly increased by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10 000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.

Updated: 2025-09-08 07:08:38

标题: 《用于非结构化环境感知的GOOSE数据集》

摘要: 通过改善对环境的感知和解释，部署自主系统的潜力可以显着增加。然而，在无结构的户外环境中为自主系统开发基于深度学习的技术面临挑战，因为训练和测试的数据可用性有限。为了填补这一空白，我们提出了德国户外和越野数据集（GOOSE），这是一个专门为无结构户外环境设计的全面数据集。GOOSE数据集包含了10,000对图像和点云的标记对，用于训练一系列最先进的图像和点云分割模型。我们开源了数据集，以及用于无结构地形的本体论，以及数据集标准和指南。这一举措旨在建立一个共同的框架，使得现有数据集可以无缝地纳入，并快速提升各种在无结构环境中操作的机器人的感知能力。数据集、越野感知的预训练模型和额外的文档可以在https://goose-dataset.de/找到。

更新时间: 2025-09-08 07:08:38

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2310.16788v2

Variational Garrote for Statistical Physics-based Sparse and Robust Variable Selection

Selecting key variables from high-dimensional data is increasingly important in the era of big data. Sparse regression serves as a powerful tool for this purpose by promoting model simplicity and explainability. In this work, we revisit a valuable yet underutilized method, the statistical physics-based Variational Garrote (VG), which introduces explicit feature selection spin variables and leverages variational inference to derive a tractable loss function. We enhance VG by incorporating modern automatic differentiation techniques, enabling scalable and efficient optimization. We evaluate VG on both fully controllable synthetic datasets and complex real-world datasets. Our results demonstrate that VG performs especially well in highly sparse regimes, offering more consistent and robust variable selection than Ridge and LASSO regression across varying levels of sparsity. We also uncover a sharp transition: as superfluous variables are admitted, generalization degrades abruptly and the uncertainty of the selection variables increases. This transition point provides a practical signal for estimating the correct number of relevant variables, an insight we successfully apply to identify key predictors in real-world data. We expect that VG offers strong potential for sparse modeling across a wide range of applications, including compressed sensing and model pruning in machine learning.

Updated: 2025-09-08 07:06:10

标题: 变分绞索用于基于统计物理的稀疏和鲁棒变量选择

摘要: 在大数据时代，从高维数据中选择关键变量变得越来越重要。稀疏回归作为一种强大的工具，通过促进模型简单性和可解释性来实现这一目的。在这项工作中，我们重新审视了一种有价值但未充分利用的方法，基于统计物理学的变分Garrote（VG），它引入了显式特征选择自旋变量，并利用变分推断推导出可处理的损失函数。我们通过整合现代自动微分技术，增强了VG，使其能够进行可扩展和高效的优化。我们在完全可控的合成数据集和复杂的真实世界数据集上评估了VG。我们的结果表明，在高度稀疏的情况下，VG的表现特别出色，比Ridge和LASSO回归在不同稀疏水平下提供更一致和更稳健的变量选择。我们还发现一个明显的转变：当多余的变量被允许时，泛化能力会突然下降，选择变量的不确定性会增加。这一转变点为估计正确数量的相关变量提供了一个实用信号，这一洞见我们成功地应用于识别真实世界数据中的关键预测因子。我们期望VG在各种应用中，包括压缩感知和机器学习中的模型修剪中，具有强大的稀疏建模潜力。

更新时间: 2025-09-08 07:06:10

领域: cs.LG,physics.data-an

下载: http://arxiv.org/abs/2509.06383v1

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

Updated: 2025-09-08 07:04:17

标题: AnyGPT：具有离散序列建模的统一多模态LLM

摘要: 我们介绍了AnyGPT，这是一个任意到任意的多模态语言模型，利用离散表示统一处理各种模态，包括语音、文本、图像和音乐。AnyGPT可以稳定训练，而不需要对当前大型语言模型（LLM）架构或训练范例进行任何修改。相反，它完全依赖数据级预处理，促进了新模态的无缝集成到LLM中，类似于新语言的整合。我们构建了一个多模态以文本为中心的数据集用于多模态对齐预训练。利用生成模型，我们合成了第一个大规模的任意到任意多模态指令数据集。它包括108k个多轮对话样本，巧妙地交织了各种模态，从而使模型能够处理任意组合的多模态输入和输出。实验结果表明，AnyGPT能够促进任意到任意的多模态对话，并在所有模态上达到与专门模型相当的性能，证明离散表示能够有效便捷地统一语言模型中的多个模态。演示请见https://junzhan2000.github.io/AnyGPT.github.io/

更新时间: 2025-09-08 07:04:17

领域: cs.CL,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2402.12226v5

Improved sampling algorithms and Poincaré inequalities for non-log-concave distributions

We study the problem of sampling from a distribution $\mu$ with density $\propto e^{-V}$ for some potential function $V:\mathbb R^d\to \mathbb R$ with query access to $V$ and $\nabla V$. We start with the following standard assumptions: (1) The potential function $V$ is $L$-smooth. (2) The second moment $\mathbf{E}_{X\sim \mu}[\|X\|^2]\leq M$. Recently, He and Zhang (COLT'25) showed that the query complexity of sampling from such distributions is at least $\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$ where $\epsilon$ is the desired accuracy in total variation distance, and the Poincar\'e constant can be arbitrarily large. Meanwhile, another common assumption in the study of diffusion based samplers (see e.g., the work of Chen, Chewi, Li, Li, Salim and Zhang (ICLR'23)) strengthens the smoothness condition (1) to the following: (1*) The potential function of *every* distribution along the Ornstein-Uhlenbeck process starting from $\mu$ is $L$-smooth. We show that under the assumptions (1*) and (2), the query complexity of sampling from $\mu$ can be $\mathrm{poly}(L,d)\cdot \left(\frac{Ld+M}{\epsilon^2}\right)^{\mathcal{O}(L+1)}$, which is polynomial in $d$ and $\frac{1}{\epsilon}$ when $L=\mathcal{O}(1)$ and $M=\mathrm{poly}(d)$. This improves the algorithm with quasi-polynomial query complexity developed by Huang et al. (COLT'24). Our results imply that the seemly moderate strengthening of the smoothness condition (1) to (1*) can lead to an exponential gap in the query complexity of sampling algorithms. Moreover, we show that together with the assumption (1*) and the stronger moment assumption that $\|X\|$ is $\lambda$-sub-Gaussian for $X\sim\mu$, the Poincar\'e constant of $\mu$ is at most $\mathcal{O}(\lambda)^{2(L+1)}$. As an application of our technique, we obtain improved estimate of the Poincar\'e constant for mixture of Gaussians with the same covariance.

Updated: 2025-09-08 07:00:19

标题: 改进的采样算法和非对数凹分布的Poincaré不等式

摘要: 我们研究从具有密度$\propto e^{-V}$的分布$\mu$中抽样的问题，其中$V:\mathbb R^d\to \mathbb R$是一些势函数，我们可以查询$V$和$\nabla V$。我们从以下标准假设开始： (1) 势函数$V$是$L$-光滑的。 (2) 第二矩$\mathbf{E}_{X\sim \mu}[\|X\|^2]\leq M$。最近，He和Zhang (COLT'25)表明，从这种分布中抽样的查询复杂度至少为$\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$，其中$\epsilon$是在总变差距离中期望的准确度，并且Poincar\'e常数可以任意大。与此同时，在扩散基础采样器研究中还有另一个常见假设(例如，Chen，Chewi，Li，Li，Salim和Zhang (ICLR'23)的工作)将光滑条件(1)加强为以下条件： (1*) 从$\mu$出发的Ornstein-Uhlenbeck过程中的*每个*分布的势函数都是$L$-光滑的。我们表明，在假设(1*)和(2)下，从$\mu$中抽样的查询复杂度可以是$\mathrm{poly}(L,d)\cdot\left(\frac{Ld+M}{\epsilon^2}\right)^{\mathcal{O}(L+1)}$，当$L=\mathcal{O}(1)$且$M=\mathrm{poly}(d)$时，这在$d$和$\frac{1}{\epsilon}$中是多项式的。这改进了Huang等人(COLT'24)开发的具有准多项式查询复杂度的算法。我们的结果表明，将光滑条件(1)适度加强为(1*)可能导致抽样算法的查询复杂度出现指数差距。此外，我们表明，结合假设(1*)和更强的矩假设$\|X\|$对于$X\sim\mu$是$\lambda$-子高斯，$\mu$的Poincar\'e常数最多为$\mathcal{O}(\lambda)^{2(L+1)}$。作为我们技术的一个应用，我们得到了具有相同协方差的混合高斯分布的Poincar\'e常数的改进估计。

更新时间: 2025-09-08 07:00:19

领域: cs.DS,cs.LG,math.PR,stat.ML

下载: http://arxiv.org/abs/2507.11236v3

Beyond Linearity and Time-homogeneity: Relational Hyper Event Models with Time-Varying Non-Linear Effects

Recent technological advances have made it easier to collect large and complex networks of time-stamped relational events connecting two or more entities. Relational hyper-event models (RHEMs) aim to explain the dynamics of these events by modeling the event rate as a function of statistics based on past history and external information. However, despite the complexity of the data, most current RHEM approaches still rely on a linearity assumption to model this relationship. In this work, we address this limitation by introducing a more flexible model that allows the effects of statistics to vary non-linearly and over time. While time-varying and non-linear effects have been used in relational event modeling, we take this further by modeling joint time-varying and non-linear effects using tensor product smooths. We validate our methodology on both synthetic and empirical data. In particular, we use RHEMs to study how patterns of scientific collaboration and impact evolve over time. Our approach provides deeper insights into the dynamic factors driving relational hyper-events, allowing us to evaluate potential non-monotonic patterns that cannot be identified using linear models.

Updated: 2025-09-08 06:53:15

标题: 超越线性和时齐性：具有时变非线性效应的关系超事件模型

摘要: 最近的技术进步使得收集大规模和复杂的时间戳关系事件网络变得更加容易，连接两个或多个实体。关系超事件模型（RHEMs）旨在通过将事件速率建模为基于过去历史和外部信息的统计量的函数来解释这些事件的动态。然而，尽管数据的复杂性，大多数当前的RHEM方法仍依赖于线性假设来建模这种关系。在这项工作中，我们通过引入一个更灵活的模型来解决这一局限性，该模型允许统计量的影响在时间和非线性上变化。虽然时间变化和非线性效应已经被用于关系事件建模，我们通过使用张量乘积平滑进一步建模联合时间变化和非线性效应。我们在合成数据和实证数据上验证了我们的方法。特别地，我们使用RHEMs研究科学合作和影响模式随时间如何演变。我们的方法提供了对驱动关系超事件的动态因素更深入的洞察，使我们能够评估无法使用线性模型识别的潜在非单调模式。

更新时间: 2025-09-08 06:53:15

领域: stat.ME,cs.LG,stat.AP

下载: http://arxiv.org/abs/2509.05289v2

Breaking SafetyCore: Exploring the Risks of On-Device AI Deployment

Due to hardware and software improvements, an increasing number of AI models are deployed on-device. This shift enhances privacy and reduces latency, but also introduces security risks distinct from traditional software. In this article, we examine these risks through the real-world case study of SafetyCore, an Android system service incorporating sensitive image content detection. We demonstrate how the on-device AI model can be extracted and manipulated to bypass detection, effectively rendering the protection ineffective. Our analysis exposes vulnerabilities of on-device AI models and provides a practical demonstration of how adversaries can exploit them.

Updated: 2025-09-08 06:53:13

标题: 突破SafetyCore：探讨在设备上部署人工智能的风险

摘要: 由于硬件和软件的改进，越来越多的人工智能模型被部署在设备上。这种转变增强了隐私保护和减少了延迟，但也引入了与传统软件不同的安全风险。在本文中，我们通过一个名为SafetyCore的Android系统服务的实际案例研究来审视这些风险，该系统服务涉及敏感图像内容的检测。我们展示了如何提取和操纵设备上的人工智能模型以绕过检测，有效地使保护措施失效。我们的分析揭示了设备上人工智能模型的漏洞，并提供了对对手如何利用这些漏洞的实际演示。

更新时间: 2025-09-08 06:53:13

领域: cs.LG

下载: http://arxiv.org/abs/2509.06371v1

Bitcoin: A Non-Continuous Time System

This paper reconceptualizes Bitcoin not merely as a financial protocol but as a non-continuous temporal system. Traditional financial and computing systems impose time through synchronized external clocks, whereas Bitcoin constructs temporal order internally through decentralized consensus. We analyze block generation as a probabilistic process, forks and rollbacks as disruptions of continuity, and transaction confirmations as non-linear consolidations of history. Building on this foundation, we reinterpret proof-of-work (PoW) as an entropy-compression mechanism that drives Bitcoin's temporal architecture. In this framework, difficulty adjustment keeps block discovery near the entropy-maximizing regime, the longest-chain rule enforces entropy collapse once a block is found, and recursive hash pointers embed each block within its predecessor, creating a cascading structure that renders reorganization exponentially improbable. Together, these mechanisms compress randomness into a sequence of discrete and irreversible steps, demonstrating how Bitcoin establishes an emergent and convergent notion of time without reliance on external clocks.

Updated: 2025-09-08 06:51:03

标题: 比特币：一个非连续时间系统

摘要: 本文重新构想比特币，不仅仅是一个金融协议，而是一个非连续的时间系统。传统的金融和计算系统通过同步的外部时钟来强加时间，而比特币通过去中心化的共识在内部构建时间顺序。我们将区块生成视为一个概率过程，分叉和回滚为连续性的破坏，交易确认为历史的非线性整合。在此基础上，我们重新解释工作证明（PoW）作为推动比特币时间架构的熵压缩机制。在这个框架中，难度调整使区块发现保持在熵最大化的状态，最长链规则在发现一个区块后强制熵坍缩，递归哈希指针将每个区块嵌入其前身之中，创建一个级联结构，使重组成为指数级的不太可能事件。这些机制将随机性压缩为一系列离散且不可逆的步骤，展示了比特币如何在不依赖外部时钟的情况下建立起一种紧急和收敛的时间概念。

更新时间: 2025-09-08 06:51:03

领域: cs.CR

下载: http://arxiv.org/abs/2501.11091v9

From Perception to Protection: A Developer-Centered Study of Security and Privacy Threats in Extended Reality (XR)

The immersive nature of XR introduces a fundamentally different set of security and privacy (S&P) challenges due to the unprecedented user interactions and data collection that traditional paradigms struggle to mitigate. As the primary architects of XR applications, developers play a critical role in addressing novel threats. However, to effectively support developers, we must first understand how they perceive and respond to different threats. Despite the growing importance of this issue, there is a lack of in-depth, threat-aware studies that examine XR S&P from the developers' perspective. To fill this gap, we interviewed 23 professional XR developers with a focus on emerging threats in XR. Our study addresses two research questions aiming to uncover existing problems in XR development and identify actionable paths forward. By examining developers' perceptions of S&P threats, we found that: (1) XR development decisions (e.g., rich sensor data collection, user-generated content interfaces) are closely tied to and can amplify S&P threats, yet developers are often unaware of these risks, resulting in cognitive biases in threat perception; and (2) limitations in existing mitigation methods, combined with insufficient strategic, technical, and communication support, undermine developers' motivation, awareness, and ability to effectively address these threats. Based on these findings, we propose actionable and stakeholder-aware recommendations to improve XR S&P throughout the XR development process. This work represents the first effort to undertake a threat-aware, developer-centered study in the XR domain -- an area where the immersive, data-rich nature of the XR technology introduces distinctive challenges.

Updated: 2025-09-08 06:48:48

标题: 从感知到保护：以开发者为中心的扩展现实（XR）安全和隐私威胁研究

摘要: XR的沉浸式特性引入了一系列根本不同的安全和隐私（S&P）挑战，这是由于传统范式难以应对的用户互动和数据收集。作为XR应用的主要设计者，开发者在应对新威胁方面发挥着至关重要的作用。然而，要有效支持开发者，我们首先必须了解他们如何看待和应对不同的威胁。尽管这个问题的重要性日益增加，但目前缺乏从开发者角度审视XR S&P的深入、威胁意识研究。为填补这一空白，我们对23名专业XR开发者进行了访谈，重点关注XR中新兴威胁。我们的研究探讨了两个研究问题，旨在揭示XR开发中的现有问题，并确定可操作的前进路径。通过检查开发者对S&P威胁的看法，我们发现：（1）XR开发决策（如丰富的传感器数据收集，用户生成内容接口）与S&P威胁紧密相连，并且可能放大这些威胁，然而开发者常常对这些风险毫无察觉，导致对威胁的认知偏差；和（2）现有缓解方法的局限性，再加上战略、技术和沟通支持不足，削弱了开发者应对这些威胁的动力、意识和能力。基于这些发现，我们提出了可操作的、关注利益相关方的建议，以改进整个XR开发过程中的XR S&P。这项工作代表了在XR领域进行威胁意识、以开发者为中心的研究的首次努力--这是一个XR技术沉浸式、数据丰富特性引入独特挑战的领域。

更新时间: 2025-09-08 06:48:48

领域: cs.CR,cs.HC

下载: http://arxiv.org/abs/2509.06368v1

MRD-LiNet: A Novel Lightweight Hybrid CNN with Gradient-Guided Unlearning for Improved Drought Stress Identification

Drought stress is a major threat to global crop productivity, making its early and precise detection essential for sustainable agricultural management. Traditional approaches, though useful, are often time-consuming and labor-intensive, which has motivated the adoption of deep learning methods. In recent years, Convolutional Neural Network (CNN) and Vision Transformer architectures have been widely explored for drought stress identification; however, these models generally rely on a large number of trainable parameters, restricting their use in resource-limited and real-time agricultural settings. To address this challenge, we propose a novel lightweight hybrid CNN framework inspired by ResNet, DenseNet, and MobileNet architectures. The framework achieves a remarkable 15-fold reduction in trainable parameters compared to conventional CNN and Vision Transformer models, while maintaining competitive accuracy. In addition, we introduce a machine unlearning mechanism based on a gradient norm-based influence function, which enables targeted removal of specific training data influence, thereby improving model adaptability. The method was evaluated on an aerial image dataset of potato fields with expert-annotated healthy and drought-stressed regions. Experimental results show that our framework achieves high accuracy while substantially lowering computational costs. These findings highlight its potential as a practical, scalable, and adaptive solution for drought stress monitoring in precision agriculture, particularly under resource-constrained conditions.

Updated: 2025-09-08 06:46:35

标题: MRD-LiNet：一种新型轻量级混合CNN，通过梯度引导反学习提高干旱应激识别

摘要: 干旱胁迫是全球作物生产力的主要威胁，因此其早期和精确检测对可持续农业管理至关重要。传统方法虽然有用，但往往耗时且劳动密集，这促使人们采用深度学习方法。近年来，卷积神经网络（CNN）和视觉Transformer架构已被广泛探索用于干旱胁迫识别；然而，这些模型通常依赖大量可训练参数，限制了它们在资源有限和实时农业环境中的使用。为了解决这一挑战，我们提出了一种受ResNet、DenseNet和MobileNet架构启发的轻量级混合CNN框架。与传统CNN和视觉Transformer模型相比，该框架实现了可训练参数的显著减少15倍，同时保持竞争性准确性。此外，我们引入了一种基于梯度范数的影响函数的机器遗忘机制，该机制可以有针对性地去除特定训练数据的影响，从而提高模型的适应性。该方法在马铃薯田的航拍图像数据集上进行了评估，该数据集包含专家注释的健康和干旱胁迫区域。实验结果表明，我们的框架在显著降低计算成本的同时实现了高准确性。这些发现突显了该框架在精密农业中作为一种实用、可扩展和自适应解决方案的潜力，尤其是在资源受限条件下的干旱胁迫监测中。

更新时间: 2025-09-08 06:46:35

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06367v1

Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models

Text-to-image generative models have recently garnered significant attention due to their ability to generate images based on prompt descriptions. While these models have shown promising performance, concerns have been raised regarding the potential misuse of the generated fake images. In response to this, we have presented a simple yet effective training-free method to attribute fake images generated by text-to-image models to their source models. Given a test image to be attributed, we first inverse the textual prompt of the image, and then put the reconstructed prompt into different candidate models to regenerate candidate fake images. By calculating and ranking the similarity of the test image and the candidate images, we can determine the source of the image. This attribution allows model owners to be held accountable for any misuse of their models. Note that our approach does not limit the number of candidate text-to-image generative models. Comprehensive experiments reveal that (1) Our method can effectively attribute fake images to their source models, achieving comparable attribution performance with the state-of-the-art method; (2) Our method has high scalability ability, which is well adapted to real-world attribution scenarios. (3) The proposed method yields satisfactory robustness to common attacks, such as Gaussian blurring, JPEG compression, and Resizing. We also analyze the factors that influence the attribution performance, and explore the boost brought by the proposed method as a plug-in to improve the performance of existing SOTA. We hope our work can shed some light on the solutions to addressing the source of AI-generated images, as well as to prevent the misuse of text-to-image generative models.

Updated: 2025-09-08 06:46:18

标题: 基于再生的无需训练的文本到图像生成模型生成的假图像的归因

摘要: 最近，基于文本的图像生成模型引起了广泛关注，因为它们能够根据提示描述生成图像。尽管这些模型表现出了令人满意的性能，但人们对生成的虚假图像可能被滥用的担忧也日益增加。作为对此的回应，我们提出了一种简单而有效的无需训练的方法，用于将由文本到图像模型生成的虚假图像归因于其源模型。给定一个待归因的测试图像，我们首先将图像的文本提示进行逆转，然后将重构后的提示放入不同的候选模型中重新生成候选虚假图像。通过计算和排名测试图像与候选图像的相似性，我们可以确定图像的源头。这种归因使得模型所有者可以对其模型的任何滥用负责。需要注意的是，我们的方法不限制候选文本到图像生成模型的数量。综合实验表明：(1) 我们的方法可以有效地将虚假图像归因于其源模型，达到与最先进方法相当的归因性能；(2) 我们的方法具有很高的可扩展性，非常适用于现实世界的归因场景；(3) 所提出的方法对常见攻击，如高斯模糊、JPEG压缩和调整大小，具有令人满意的鲁棒性。我们还分析了影响归因性能的因素，并探讨了提出方法作为插件来改善现有最先进技术性能所带来的增益。我们希望我们的工作能为解决人工智能生成图像的来源问题以及防止滥用文本到图像生成模型提供一些启示。

更新时间: 2025-09-08 06:46:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2403.01489v2

Online Prompt Pricing based on Combinatorial Multi-Armed Bandit and Hierarchical Stackelberg Game

Generation models have shown promising performance in various tasks, making trading around machine learning models possible. In this paper, we aim at a novel prompt trading scenario, prompt bundle trading (PBT) system, and propose an online pricing mechanism. Based on the combinatorial multi-armed bandit (CMAB) and three-stage hierarchical Stackelburg (HS) game, our pricing mechanism considers the profits of the consumer, platform, and seller, simultaneously achieving the profit satisfaction of these three participants. We break down the pricing issue into two steps, namely unknown category selection and incentive strategy optimization. The former step is to select a set of categories with the highest qualities, and the latter is to derive the optimal strategy for each participant based on the chosen categories. Unlike the existing fixed pricing mode, the PBT pricing mechanism we propose is more flexible and diverse, which is more in accord with the transaction needs of real-world scenarios. We test our method on a simulated text-to-image dataset. The experimental results demonstrate the effectiveness of our algorithm, which provides a feasible price-setting standard for the prompt marketplaces.

Updated: 2025-09-08 06:44:26

标题: 基于组合式多臂老虎机和分层斯塔克尔贝格博弈的在线提示定价

摘要: 生成模型在各种任务中表现出有希望的性能，使得围绕机器学习模型进行交易成为可能。本文旨在针对一种新颖的提示交易场景，即提示捆绑交易（PBT）系统，并提出一种在线定价机制。基于组合多臂老虎机（CMAB）和三阶段分层斯塔克尔伯格（HS）博弈，我们的定价机制同时考虑了消费者、平台和卖家的利润，实现了这三个参与者的利润满意度。我们将定价问题分解为两个步骤，即未知类别选择和激励策略优化。前一步是选择具有最高质量的一组类别，后一步是根据选择的类别为每个参与者制定最佳策略。与现有的固定定价模式不同，我们提出的PBT定价机制更加灵活多样，更符合现实场景的交易需求。我们在一个模拟的文本到图像数据集上测试了我们的方法。实验结果表明了我们的算法的有效性，为提示市场提供了可行的定价标准。

更新时间: 2025-09-08 06:44:26

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2405.15154v3

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

Updated: 2025-09-08 06:43:33

标题: VIPER: 视觉感知和可解释推理用于序贯决策-making

摘要: 尽管大型语言模型（LLMs）在文本推理方面表现出色，视觉语言模型（VLMs）在视觉感知方面非常有效，但将这些模型应用于基于视觉指导的规划仍然是一个广泛开放的问题。在本文中，我们介绍了VIPER，这是一个新颖的多模态指导规划框架，将基于VLM的感知与基于LLM的推理相结合。我们的方法使用模块化流程，其中一个冻结的VLM生成图像观察的文本描述，然后由LLM策略处理，根据任务目标预测动作。我们使用行为克隆和强化学习对推理模块进行微调，提高了我们的代理决策能力。在ALFWorld基准测试上的实验表明，VIPER明显优于最先进的基于视觉指导的规划器，同时缩小了与纯文本基准的差距。通过利用文本作为中间表示，VIPER还提高了可解释性，为感知和推理组件的细粒度分析铺平了道路。

更新时间: 2025-09-08 06:43:33

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2503.15108v2

The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.

Updated: 2025-09-08 06:37:53

标题: 标题翻译：好的、坏的和建设性的：自动衡量同行评审对作者的实用性

摘要: 提供建设性反馈给论文作者是同行评审的核心组成部分。随着审稿人越来越少时间进行评审，自动支持系统是必需的，以确保高质量的评审，从而使评审中的反馈对作者有用。为此，我们确定了评审评论中驱动作者效用的四个关键方面（评审中弱点部分的个体观点）：可操作性、基础和具体性、可验证性和帮助性。为了评估和开发评估评审评论模型，我们引入了RevUtil数据集。我们收集了1,430个人工标记的评审评论，并通过10k个合成标记评论扩展我们的数据以进行训练。合成数据还包含理由，即评审评论的方面得分解释。利用RevUtil数据集，我们对评估评审评论和生成理由进行了微调模型的基准测试。我们的实验表明，这些微调模型与人类的一致性水平相当，甚至在某些情况下超过了像GPT-4o这样的强大封闭模型。我们的分析进一步揭示了机器生成的评论通常在我们的四个方面上表现不如人类评论。

更新时间: 2025-09-08 06:37:53

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2509.04484v2

DIRF: A Framework for Digital Identity Protection and Clone Governance in Agentic AI Systems

The rapid advancement and widespread adoption of generative artificial intelligence (AI) pose significant threats to the integrity of personal identity, including digital cloning, sophisticated impersonation, and the unauthorized monetization of identity-related data. Mitigating these risks necessitates the development of robust AI-generated content detection systems, enhanced legal frameworks, and ethical guidelines. This paper introduces the Digital Identity Rights Framework (DIRF), a structured security and governance model designed to protect behavioral, biometric, and personality-based digital likeness attributes to address this critical need. Structured across nine domains and 63 controls, DIRF integrates legal, technical, and hybrid enforcement mechanisms to secure digital identity consent, traceability, and monetization. We present the architectural foundations, enforcement strategies, and key use cases supporting the need for a unified framework. This work aims to inform platform builders, legal entities, and regulators about the essential controls needed to enforce identity rights in AI-driven systems.

Updated: 2025-09-08 06:22:03

标题: DIRF：一种数字身份保护和代理AI系统中克隆治理的框架

摘要: 快速发展和广泛采用生成式人工智能（AI）对个人身份的完整性构成重大威胁，包括数字克隆、复杂冒充和未经授权的身份相关数据商业化。减轻这些风险需要开发强大的AI生成内容检测系统、加强法律框架和道德准则。本文介绍了数字身份权利框架（DIRF），这是一个结构化的安全和治理模型，旨在保护基于行为、生物特征和个性化数字相似性属性，以解决这一关键需求。DIRF跨越九个领域和63个控制，整合了法律、技术和混合执行机制，以确保数字身份的同意、可追溯性和商业化。我们提出了建筑基础、执行策略和支持统一框架需求的关键用例。这项工作旨在向平台构建者、法律实体和监管机构提供有关强制AI驱动系统中身份权利的基本控制的信息。

更新时间: 2025-09-08 06:22:03

领域: cs.CR,cs.AI,cs.ET

下载: http://arxiv.org/abs/2508.01997v2

Insights from Gradient Dynamics: Gradient Autoscaled Normalization

Gradient dynamics play a central role in determining the stability and generalization of deep neural networks. In this work, we provide an empirical analysis of how variance and standard deviation of gradients evolve during training, showing consistent changes across layers and at the global scale in convolutional networks. Motivated by these observations, we propose a hyperparameter-free gradient normalization method that aligns gradient scaling with their natural evolution. This approach prevents unintended amplification, stabilizes optimization, and preserves convergence guarantees. Experiments on the challenging CIFAR-100 benchmark with ResNet-20, ResNet-56, and VGG-16-BN demonstrate that our method maintains or improves test accuracy even under strong generalization. Beyond practical performance, our study highlights the importance of directly tracking gradient dynamics, aiming to bridge the gap between theoretical expectations and empirical behaviors, and to provide insights for future optimization research.

Updated: 2025-09-08 06:17:26

标题: 从梯度动态中的见解：梯度自动缩放归一化

摘要: 梯度动力学在确定深度神经网络的稳定性和泛化能力方面起着核心作用。在这项工作中，我们对训练过程中梯度的方差和标准差如何演变进行了实证分析，展示了卷积网络中各层以及全局尺度上的一致变化。受这些观察的启发，我们提出了一种无超参数的梯度标准化方法，使梯度缩放与其自然演化相一致。这种方法可以防止意外放大，稳定优化过程，并保持收敛性保证。在具有ResNet-20、ResNet-56和VGG-16-BN的具有挑战性的CIFAR-100基准测试上进行的实验表明，我们的方法即使在强泛化情况下也能保持或提高测试精度。除了实际性能之外，我们的研究强调了直接跟踪梯度动态的重要性，旨在弥合理论期望与经验行为之间的差距，并为未来优化研究提供启示。

更新时间: 2025-09-08 06:17:26

领域: cs.LG,cs.AI,cs.CV,cs.IT,math.IT

下载: http://arxiv.org/abs/2509.03677v2

CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals

Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Leveraging low-power, cost-effective biosignals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearables. In this paper, we demonstrate that learning representations from weak-modality data that are aligned with those from structured, high-quality data can improve representation quality and enables zero-shot classification. Specifically, we propose a Contrastive Pose-EMG Pre-training (CPEP) framework to align EMG and pose representations, where we learn an EMG encoder that produces high-quality and pose-informative representations. We assess the gesture classification performance of our model through linear probing and zero-shot setups. Our model outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen (out-of-distribution) gesture classification.

Updated: 2025-09-08 06:09:15

标题: CPEP：对比姿势-EMG预训练增强了对EMG信号的手势泛化

摘要: 手势分类是计算机视觉中一个广泛研究的问题，使用高质量的结构化数据，如视频、图像和手骨架。利用低功耗、成本效益的生物信号，例如表面肌电图（sEMG），可以实现在可穿戴设备上连续手势预测。本文展示了从弱模态数据中学习表示，并与高质量结构化数据对齐可以提高表示质量并实现零样本分类。具体来说，我们提出了一个对比姿势-EMG预训练（CPEP）框架，用于对齐EMG和姿势表示，我们学习一个能够产生高质量和姿势信息的EMG编码器。通过线性探测和零样本设置，我们评估了我们模型的手势分类性能。我们的模型在分布内手势分类上比emg2pose基准模型表现提高了高达21％，在未见过的（分布外）手势分类上提高了72％。

更新时间: 2025-09-08 06:09:15

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2509.04699v2

PL-CA: A Parametric Legal Case Augmentation Framework

Conventional RAG is considered one of the most effective methods for addressing model knowledge insufficiency and hallucination, particularly in the judicial domain that requires high levels of knowledge rigor, logical consistency, and content integrity. However, the conventional RAG method only injects retrieved documents directly into the model's context, which severely constrains models due to their limited context windows and introduces additional computational overhead through excessively long contexts, thereby disrupting models' attention and degrading performance on downstream tasks. Moreover, many existing benchmarks lack expert annotation and focus solely on individual downstream tasks while real-world legal scenarios consist of multiple mixed legal tasks, indicating conventional benchmarks' inadequacy for reflecting models' true capabilities. To address these limitations, we propose PL-CA, which introduces a parametric RAG (P-RAG) framework to perform data augmentation on corpus knowledge and encode this legal knowledge into parametric vectors, and then integrates this parametric knowledge into the LLM's feed-forward networks (FFN) via LoRA, thereby alleviating models' context pressure. Additionally, we also construct a multi-task legal dataset comprising more than 2000 training and test instances, which are all expert-annotated and manually verified. We conduct our experiments on our dataset, and the experimental results demonstrate that our method reduces the overhead associated with excessively long contexts while maintaining competitive performance on downstream tasks compared to conventional RAG. Our code and dataset are provided in the appendix.

Updated: 2025-09-08 06:08:06

标题: PL-CA: 一个参数化的法律案例增强框架

摘要: 传统的RAG被认为是解决模型知识不足和虚构的最有效方法之一，特别是在需要高水平知识严谨性、逻辑一致性和内容完整性的司法领域。然而，传统的RAG方法只是直接将检索到的文档注入到模型的上下文中，这严重限制了模型的能力，因为它们的上下文窗口有限，并通过过长的上下文引入了额外的计算开销，从而干扰了模型的注意力，降低了在下游任务上的性能。此外，许多现有的基准数据缺乏专家标注，仅专注于个别下游任务，而真实世界的法律场景包括多种混合法律任务，表明传统基准的不足以反映模型的真实能力。为解决这些限制，我们提出了PL-CA，引入了一个参数化RAG（P-RAG）框架，对语料知识进行数据增强，并将这种法律知识编码成参数向量，然后通过LoRA将这些参数化知识集成到LLM的前馈网络（FFN）中，从而减轻模型的上下文压力。此外，我们还构建了一个多任务法律数据集，包括2000多个训练和测试实例，这些实例都经过专家标注和手动验证。我们在我们的数据集上进行实验，实验结果表明，我们的方法减少了与过长上下文相关的开销，同时与传统RAG相比，在下游任务上保持了竞争性能。我们的代码和数据集已提供在附录中。

更新时间: 2025-09-08 06:08:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.06356v1

A data-driven discretized CS:GO simulation environment to facilitate strategic multi-agent planning research

Modern simulation environments for complex multi-agent interactions must balance high-fidelity detail with computational efficiency. We present DECOY, a novel multi-agent simulator that abstracts strategic, long-horizon planning in 3D terrains into high-level discretized simulation while preserving low-level environmental fidelity. Using Counter-Strike: Global Offensive (CS:GO) as a testbed, our framework accurately simulates gameplay using only movement decisions as tactical positioning -- without explicitly modeling low-level mechanics such as aiming and shooting. Central to our approach is a waypoint system that simplifies and discretizes continuous states and actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes. Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game. Our publicly available simulation environment provides a valuable tool for advancing research in strategic multi-agent planning and behavior generation.

Updated: 2025-09-08 06:02:59

标题: 一个基于数据驱动的离散化CS:GO模拟环境，用于促进战略多智能体规划研究

摘要: 现代模拟复杂多智能体相互作用的环境必须在高保真度和计算效率之间取得平衡。我们提出了DECOY，一个新颖的多智能体模拟器，将3D地形中的战略性、长期规划抽象为高水平离散化模拟，同时保留低级环境保真度。通过将《反恐精英：全球攻势》（CS:GO）作为测试平台，我们的框架仅使用移动决策作为战术定位准确模拟游戏玩法，而无需明确建模瞄准和射击等低级机制。我们方法的核心是一个航路点系统，简化和离散化连续状态和行动，结合在真实CS:GO比赛数据上训练的神经预测和生成模型来重建事件结果。广泛的评估表明，在DECOY中生成的来自人类数据的重播与原始游戏中观察到的基本一致。我们公开可用的模拟环境为推动战略多智能体规划和行为生成研究提供了有价值的工具。

更新时间: 2025-09-08 06:02:59

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06355v1

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

Controllable Singing Voice Synthesis (SVS) aims to generate expressive singing voices reflecting user intent. While recent SVS systems achieve high audio quality, most rely on probabilistic modeling, limiting precise control over attributes such as dynamics. We address this by focusing on dynamic control--temporal loudness variation essential for musical expressiveness--and explicitly condition the SVS model on energy sequences extracted from ground-truth spectrograms, reducing annotation costs and improving controllability. We also propose a phoneme-level energy sequence for user-friendly control. To the best of our knowledge, this is the first attempt enabling user-driven dynamics control in SVS. Experiments show our method achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models, without compromising synthesis quality.

Updated: 2025-09-08 06:02:57

标题: 可控的歌唱声音合成：使用音素级能量序列

摘要: Controllable Singing Voice Synthesis (SVS)旨在生成反映用户意图的富有表现力的歌唱声音。虽然最近的SVS系统实现了高音质，但大多数依赖于概率建模，限制了对动态等属性的精确控制。我们通过专注于动态控制--对于音乐表现力至关重要的时间强度变化，并明确地将SVS模型条件化为从真实频谱图中提取的能量序列，从而降低了注释成本并提高了可控性。我们还提出了一个适用于用户友好控制的音素级能量序列。据我们所知，这是第一次尝试在SVS中实现用户驱动的动态控制。实验表明，与基线和能量预测模型相比，我们的方法在音素级输入的能量序列的平均绝对误差上实现了超过50%的降低，而不会影响合成质量。

更新时间: 2025-09-08 06:02:57

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2509.07038v1

A Multi-Modal Deep Learning Framework for Colorectal Pathology Diagnosis: Integrating Histological and Colonoscopy Data in a Pilot Study

Colorectal diseases, including inflammatory conditions and neoplasms, require quick, accurate care to be effectively treated. Traditional diagnostic pipelines require extensive preparation and rely on separate, individual evaluations on histological images and colonoscopy footage, introducing possible variability and inefficiencies. This pilot study proposes a unified deep learning network that uses convolutional neural networks (CN N s) to classify both histopathological slides and colonoscopy video frames in one pipeline. The pipeline integrates class-balancing learning, robust augmentation, and calibration methods to ensure accurate results. Static colon histology images were taken from the PathMNIST dataset, and the lower gastrointestinal (colonoscopy) videos were drawn from the HyperKvasir dataset. The CNN architecture used was ResNet-50. This study demonstrates an interpretable and reproducible diagnostic pipeline that unifies multiple diagnostic modalities to advance and ease the detection of colorectal diseases.

Updated: 2025-09-08 05:54:03

标题: 一种用于结直肠病理诊断的多模态深度学习框架：在一项初步研究中整合组织学和结肠镜数据

摘要: 结直肠疾病，包括炎症性疾病和肿瘤，需要快速、准确的护理才能有效治疗。传统的诊断流程需要广泛的准备工作，并依赖于对组织学图像和结肠镜检查录像的单独评估，这可能会引入可能的变异性和低效率。这项试点研究提出了一个统一的深度学习网络，使用卷积神经网络（CNNs）在一个流程中对组织病理学切片和结肠镜视频帧进行分类。该流程集成了类平衡学习、强大的增强和校准方法，以确保准确的结果。静态结肠组织学图像来自PathMNIST数据集，下消化道（结肠镜检查）视频来自HyperKvasir数据集。所使用的CNN架构是ResNet-50。本研究展示了一个可解释和可重现的诊断流程，将多种诊断模式统一起来，以推动和简化结直肠疾病的检测。

更新时间: 2025-09-08 05:54:03

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.06351v1

NeuroBOLT: Resting-state EEG-to-fMRI Synthesis with Multi-dimensional Feature Mapping

Functional magnetic resonance imaging (fMRI) is an indispensable tool in modern neuroscience, providing a non-invasive window into whole-brain dynamics at millimeter-scale spatial resolution. However, fMRI is constrained by issues such as high operation costs and immobility. With the rapid advancements in cross-modality synthesis and brain decoding, the use of deep neural networks has emerged as a promising solution for inferring whole-brain, high-resolution fMRI features directly from electroencephalography (EEG), a more widely accessible and portable neuroimaging modality. Nonetheless, the complex projection from neural activity to fMRI hemodynamic responses and the spatial ambiguity of EEG pose substantial challenges both in modeling and interpretability. Relatively few studies to date have developed approaches for EEG-fMRI translation, and although they have made significant strides, the inference of fMRI signals in a given study has been limited to a small set of brain areas and to a single condition (i.e., either resting-state or a specific task). The capability to predict fMRI signals in other brain areas, as well as to generalize across conditions, remain critical gaps in the field. To tackle these challenges, we introduce a novel and generalizable framework: NeuroBOLT, i.e., Neuro-to-BOLD Transformer, which leverages multi-dimensional representation learning from temporal, spatial, and spectral domains to translate raw EEG data to the corresponding fMRI activity signals across the brain. Our experiments demonstrate that NeuroBOLT effectively reconstructs unseen resting-state fMRI signals from primary sensory, high-level cognitive areas, and deep subcortical brain regions, achieving state-of-the-art accuracy with the potential to generalize across varying conditions and sites, which significantly advances the integration of these two modalities.

Updated: 2025-09-08 05:50:55

标题: NeuroBOLT：多维特征映射的静息态EEG到fMRI合成

摘要: 功能性磁共振成像（fMRI）是现代神经科学中不可或缺的工具，提供了一个非侵入性窗口，可以以毫米级空间分辨率查看整个大脑的动态。然而，fMRI受到高运营成本和不便移动等问题的限制。随着跨模态合成和脑解码的快速进步，利用深度神经网络直接从脑电图（EEG）推断出整个大脑高分辨率fMRI特征已经成为一种有前途的解决方案，EEG是一种更广泛可获得和便携的神经影像模态。然而，从神经活动到fMRI血氧动力学反应的复杂映射以及EEG的空间模糊性在建模和解释方面提出了重大挑战。迄今为止，相对较少的研究已经开发了EEG-fMRI翻译方法，尽管它们取得了重大进展，但在给定研究中的fMRI信号推断仅限于一小部分脑区和单一条件（即静息状态或特定任务）。在其他脑区预测fMRI信号的能力，以及在条件间泛化的能力，仍然是该领域的关键缺口。为了解决这些挑战，我们引入了一种新颖且可推广的框架：NeuroBOLT，即神经到BOLD转换器，它利用来自时间、空间和频谱领域的多维表示学习，将原始EEG数据转换为整个大脑的相应fMRI活动信号。我们的实验表明，NeuroBOLT有效地重建了未见的静息状态fMRI信号，覆盖了主要感觉区、高级认知区和深层皮质下脑区域，实现了最先进的准确性，并具有泛化到不同条件和场所的潜力，这显著推动了这两种模态的整合。

更新时间: 2025-09-08 05:50:55

领域: eess.IV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.05341v3

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

Updated: 2025-09-08 05:45:37

标题: Mask-GCG: Adversarial Suffixes中的所有令牌是否对越狱攻击都是必要的？

摘要: 大型语言模型（LLMs）的越狱攻击已经展示了各种成功的方法，攻击者通过这些方法操纵模型生成有害响应，而这些响应本来是设计避免的。在这些方法中，贪婪坐标梯度（GCG）已经成为一个通用且有效的方法，它优化后缀中的标记以生成可越狱的提示。虽然已经提出了几种改进的GCG变体，但它们都依赖于固定长度的后缀。然而，在这些后缀中潜在的冗余仍未被探索。在本研究中，我们提出了Mask-GCG，这是一种即插即用的方法，它利用可学习的标记屏蔽来识别后缀中具有影响力的标记。我们的方法增加了高影响位置上标记的更新概率，同时剪枝低影响位置上的标记。这种剪枝不仅减少了冗余，还减小了梯度空间的大小，从而降低了计算开销，缩短了达到成功攻击所需的时间，相比于GCG。我们通过将Mask-GCG应用于原始GCG和几种改进的变体来评估它。实验结果显示后缀中的大多数标记对攻击成功有显著贡献，剪枝少数低影响的标记不会影响损失值或妥协攻击成功率（ASR），从而揭示了LLM提示中的标记冗余。我们的发现为从越狱攻击的角度开发高效且可解释的LLMs提供了见解。

更新时间: 2025-09-08 05:45:37

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2509.06350v1

Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically pruning redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

Updated: 2025-09-08 05:38:10

标题: Ban&Pick：通过MoE-LLMs中更智能的路由实现免费性能提升和推理加速

摘要: 稀疏混合专家（MoE）已成为有效扩展大型语言模型（LLMs）的关键架构。最近的细粒度MoE设计引入每层数百个专家，每个令牌激活多个专家，从而实现更强的专业化。然而，在预训练期间，路由器主要针对稳定性和鲁棒性进行优化：它们过早收敛并强制平衡使用，限制了模型性能和效率的全部潜力。在这项工作中，我们揭示了两个被忽视的问题：（i）由于过早和平衡的路由决策，一些具有高影响力的专家被低效利用；和（ii）强制每个令牌激活固定数量的专家引入了相当多的冗余。我们引入了Ban&Pick，一种后训练、即插即用的智能MoE路由策略。Pick发现并加强关键专家-一小组对性能有巨大影响的专家，导致跨领域的显著精度提升。Ban通过根据层和令牌的敏感性动态修剪冗余专家，实现更快的推理速度，减少精度损失。在数学、代码和一般推理基准测试中对细粒度MoE-LLMs（DeepSeek、Qwen3）进行的实验表明，Ban&Pick在不重新训练或更改架构的情况下提供了免费的性能增益和推理加速。例如，在Qwen3-30B-A3B上，它将AIME2024的准确性从80.67提高到84.66，将GPQA-Diamond的准确性从65.66提高到68.18，同时在vLLM下将推理加速1.25倍。

更新时间: 2025-09-08 05:38:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06346v1

Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection

In this paper, we reported our experiments with various strategies to improve code-mixed humour and sarcasm detection. Particularly, we tried three approaches: (i) native sample mixing, (ii) multi-task learning (MTL), and (iii) prompting and instruction finetuning very large multilingual language models (VMLMs). In native sample mixing, we added monolingual task samples to code-mixed training sets. In MTL learning, we relied on native and code-mixed samples of a semantically related task (hate detection in our case). Finally, in our third approach, we evaluated the efficacy of VMLMs via few-shot context prompting and instruction finetuning. Some interesting findings we got are (i) adding native samples improved humor (raising the F1-score up to 6.76%) and sarcasm (raising the F1-score up to 8.64%) detection, (ii) training MLMs in an MTL framework boosted performance for both humour (raising the F1-score up to 10.67%) and sarcasm (increment up to 12.35% in F1-score) detection, and (iii) prompting and instruction finetuning VMLMs couldn't outperform the other approaches. Finally, our ablation studies and error analysis discovered the cases where our model is yet to improve. We provided our code for reproducibility.

Updated: 2025-09-08 05:32:04

标题: 揭示合成本地样本和多任务策略在印地语-英语混合代码中幽默和讽刺检测中的影响

摘要: 在这篇论文中，我们报告了我们尝试改进混合代码幽默和讽刺检测的各种策略的实验。特别地，我们尝试了三种方法：（i）本地样本混合，（ii）多任务学习（MTL），以及（iii）通过几种方法对非常大的多语种语言模型（VMLMs）进行提示和指令微调。在本地样本混合中，我们将单语任务样本添加到混合代码训练集中。在MTL学习中，我们依赖于一个语义相关任务（在我们的情况下是仇恨检测）的本地和混合代码样本。最后，在我们的第三种方法中，我们通过少量上下文提示和指令微调评估了VMLMs的有效性。我们得到的一些有趣发现是：（i）添加本地样本提高了幽默（将F1分数提高到6.76％）和讽刺（将F1分数提高到8.64％）的检测，（ii）在MTL框架中训练MLMs提高了幽默（将F1分数提高到10.67％）和讽刺（F1分数增加到12.35％）检测的性能，（iii）通过提示和指令微调VMLMs无法胜过其他方法。最后，我们的消融研究和错误分析发现了我们的模型尚需改进的情况。我们提供了我们的代码以便复现。

更新时间: 2025-09-08 05:32:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.12761v2

Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis

Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions-of-interests (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.

Updated: 2025-09-08 05:31:37

标题: 专家指导的可解释的少样本学习用于医学图像诊断

摘要: 医学图像分析经常面临重大挑战，原因是专家标注的数据有限，这阻碍了模型的泛化性能和临床采用。我们提出了一种专家引导的可解释的少样本学习框架，将放射科医师提供的感兴趣区域（ROIs）整合到模型训练中，以同时提高分类性能和可解释性。利用Grad-CAM进行空间注意力监督，我们引入了基于Dice相似度的解释损失，以在训练过程中使模型的注意力与诊断相关区域对齐。这种解释损失与标准的原型网络目标共同优化，鼓励模型在有限数据条件下专注于临床有意义的特征。我们在两个不同的数据集上评估了我们的框架：BraTS（MRI）和VinDr-CXR（胸部X射线），与非引导模型相比，在BraTS上将准确率从77.09%提高到83.61%，在VinDr-CXR上将准确率从54.33%提高到73.29%。Grad-CAM的可视化进一步证实，专家引导的训练始终将注意力与诊断区域对齐，提高了预测可靠性和临床可信度。我们的研究结果表明，在少样本医学图像诊断中，将专家引导的注意力监督纳入框架，可以弥合性能和可解释性之间的差距。

更新时间: 2025-09-08 05:31:37

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2509.08007v1

Astrocyte-mediated hierarchical modulation enables learning-to-learn in recurrent spiking networks

A central feature of biological intelligence is the ability to learn to learn, enabling rapid adaptation to novel tasks and environments. Yet its neural basis remains elusive, particularly regarding intrinsic properties, as conventional models rely on simplified point-neuron approximations that neglect their dynamics. Inspired by astrocyte-mediated neuromodulation, we propose a hierarchically modulated recurrent spiking neural network (HM-RSNN) that models learning-to-learn with regulation of intrinsic neuronal properties at two spatiotemporal scales. Global modulation captures task-dependent gating of plasticity driven by wide-field calcium waves, whereas local adaptation simulates microdomain calcium-mediated fine-tuning of intrinsic properties within task-relevant subspaces. We evaluate HM-RSNN on four cognitive tasks, demonstrating its computational advantages over standard RSNNs and artificial neural networks, and revealing task-dependent adaptations across multiple scales, including intrinsic properties, neuronal specialization, membrane potential dynamics, and network modularity. Converging evidence and biological consistency position HM-RSNN as a biologically grounded framework, providing testable insights into how astrocyte-mediated hierarchical modulation of intrinsic properties shapes multi-scale neural dynamics that support learning-to-learn.

Updated: 2025-09-08 05:16:20

标题: 星形胶质细胞介导的分层调控促进了循环尖峰网络中的学习到学习

摘要: 生物智能的一个核心特征是学会学习的能力，使其能够快速适应新任务和环境。然而，其神经基础仍然难以捉摸，特别是关于内在属性，因为传统模型依赖于简化的点神经元近似，忽视了它们的动态性。受星形胶质介导的神经调控启发，我们提出了一个层次调控的递归尖峰神经网络(HM-RSNN)，该网络模拟了学会学习的过程，并在两个时空尺度上调节内在神经元属性。全局调控捕捉了受任务驱动的广域钙波调控的可塑性，而局部适应模拟了微区钙介导的内在属性微调，以在任务相关的子空间中进行。我们在四项认知任务上评估了HM-RSNN，证明了它在标准RSNN和人工神经网络上的计算优势，并揭示了跨多个尺度的任务相关适应性，包括内在属性、神经元专门化、膜电位动力学和网络模块化。汇聚的证据和生物一致性将HM-RSNN定位为一个有生物基础的框架，提供了关于星形胶质介导的内在属性的层次调控如何塑造支持学会学习的多尺度神经动态的可验证见解。

更新时间: 2025-09-08 05:16:20

领域: cs.NE,cs.LG,I.2.6

下载: http://arxiv.org/abs/2501.14539v4

Evaluating Multi-Turn Bargain Skills in LLM-Based Seller Agent

In online second-hand marketplaces, multi-turn bargaining is a crucial part of seller-buyer interactions. Large Language Models (LLMs) can act as seller agents, negotiating with buyers on behalf of sellers under given business constraints. A critical ability for such agents is to track and accurately interpret cumulative buyer intents across long negotiations, which directly impacts bargaining effectiveness. We introduce a multi-turn evaluation framework for measuring the bargaining ability of seller agents in e-commerce dialogues. The framework tests whether an agent can extract and track buyer intents. Our contributions are: (1) a large-scale e-commerce bargaining benchmark spanning 622 categories, 9,892 products, and 3,014 tasks; (2) a turn-level evaluation framework grounded in Theory of Mind (ToM) with annotated buyer intents, moving beyond outcome-only metrics; and (3) an automated pipeline that extracts reliable intent from massive dialogue data.

Updated: 2025-09-08 05:12:03

标题: 评估基于LLM的卖方多轮讨价还价技能

摘要: 在线二手市场中，多轮讨价还价是卖方与买方互动的关键部分。大型语言模型（LLMs）可以作为卖方代理，代表卖方在给定的商业约束条件下与买家进行协商。这种代理的一个关键能力是跟踪和准确解释长时间谈判中的累积买家意图，这直接影响了讨价还价的效果。我们引入了一个用于衡量电子商务对话中卖方代理讨价能力的多轮评估框架。该框架测试一个代理是否能够提取和跟踪买家意图。我们的贡献包括：（1）涵盖622个类别、9,892个产品和3,014个任务的大规模电子商务讨价还价基准；（2）一个基于心灵理论（ToM）的轮次级评估框架，带有注释的买家意图，超越仅有结果的度量；以及（3）一个可以从大量对话数据中提取可靠意图的自动化流程。

更新时间: 2025-09-08 05:12:03

领域: cs.AI

下载: http://arxiv.org/abs/2509.06341v1

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

The widespread distribution of Large Language Models (LLMs) through public platforms like Hugging Face introduces significant security challenges. While these platforms perform basic security scans, they often fail to detect subtle manipulations within the embedding layer. This work identifies a novel class of deployment phase attacks that exploit this vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text. These perturbations, though statistically benign, systematically bypass safety alignment mechanisms and induce harmful behaviors during inference. We propose Search based Embedding Poisoning(SEP), a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens. SEP leverages a predictable linear transition in model responses, from refusal to harmful output to semantic deviation to identify a narrow perturbation window that evades alignment safeguards. Evaluated across six aligned LLMs, SEP achieves an average attack success rate of 96.43% while preserving benign task performance and evading conventional detection mechanisms. Our findings reveal a critical oversight in deployment security and emphasize the urgent need for embedding level integrity checks in future LLM defense strategies.

Updated: 2025-09-08 05:00:58

标题: Embedding Poisoning: 通过嵌入语义偏移绕过安全对齐

摘要: 大型语言模型（LLMs）通过像Hugging Face这样的公共平台的广泛分布引入了重要的安全挑战。虽然这些平台执行基本的安全扫描，但它们经常无法检测到嵌入层内的微妙操作。本文确定了一种新型的部署阶段攻击类型，利用这种弱点通过直接向嵌入层输出注入不可察觉的扰动而不修改模型权重或输入文本。这些扰动虽然在统计上是良性的，但它们系统地绕过了安全对齐机制并在推断过程中引发有害行为。我们提出了基于搜索的嵌入中毒（SEP），这是一个实用的、与模型无关的框架，它将经过精心优化的扰动引入与高风险令牌相关联的嵌入中。SEP利用了模型响应中的可预测的线性过渡，从拒绝到有害输出再到语义偏差，以识别一个绕过对齐保障的狭窄扰动窗口。在六个对齐的LLMs上评估，SEP实现了96.43%的平均攻击成功率，同时保留了良性任务绩效，并避开了传统的检测机制。我们的研究结果揭示了在部署安全性中的一个关键疏漏，并强调了未来LLM防御策略中迫切需要嵌入层完整性检查。

更新时间: 2025-09-08 05:00:58

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2509.06338v1

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

Questionnaire-based surveys are foundational to social science research and public policymaking, yet traditional survey methods remain costly, time-consuming, and often limited in scale. This paper explores a new paradigm: simulating virtual survey respondents using Large Language Models (LLMs). We introduce two novel simulation settings, namely Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS), to systematically evaluate the ability of LLMs to generate accurate and demographically coherent responses. In PAS, the model predicts missing attributes based on partial respondent profiles, whereas FAS involves generating complete synthetic datasets under both zero-context and context-enhanced conditions. We curate a comprehensive benchmark suite, LLM-S^3 (Large Language Model-based Sociodemographic Survey Simulation), that spans 11 real-world public datasets across four sociological domains. Our evaluation of multiple mainstream LLMs (GPT-3.5/4 Turbo, LLaMA 3.0/3.1-8B) reveals consistent trends in prediction performance, highlights failure modes, and demonstrates how context and prompt design impact simulation fidelity. This work establishes a rigorous foundation for LLM-driven survey simulations, offering scalable and cost-effective tools for sociological research and policy evaluation. Our code and dataset are available at: https://github.com/dart-lab-research/LLM-S-Cube-Benchmark

Updated: 2025-09-08 04:59:00

标题: 大型语言模型作为虚拟调查受访者：评估社会人口学响应生成

摘要: 基于问卷调查的调查是社会科学研究和公共政策制定的基础，然而传统的调查方法仍然昂贵、耗时且规模有限。本文探讨了一种新的范式：使用大型语言模型（LLMs）模拟虚拟受访者。我们引入了两种新颖的模拟设置，即部分属性模拟（PAS）和完全属性模拟（FAS），以系统地评估LLMs生成准确和人口统计学一致性响应的能力。在PAS中，模型根据部分受访者概况预测缺失属性，而在FAS中则涉及在零上下文和上下文增强条件下生成完整的合成数据集。我们策划了一个全面的基准套件，LLM-S^3（基于大型语言模型的社会人口统计调查模拟），涵盖了四个社会学领域的11个真实世界公共数据集。我们对多个主流LLMs（GPT-3.5/4 Turbo、LLaMA 3.0/3.1-8B）的评估显示了预测性能的一致趋势，突出了失败模式，并展示了上下文和提示设计如何影响模拟的逼真度。这项工作为LLM驱动的调查模拟奠定了严格的基础，为社会学研究和政策评估提供了可扩展和具有成本效益的工具。我们的代码和数据集可在以下链接找到：https://github.com/dart-lab-research/LLM-S-Cube-Benchmark。

更新时间: 2025-09-08 04:59:00

领域: cs.AI

下载: http://arxiv.org/abs/2509.06337v1

Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing

Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.

Updated: 2025-09-08 04:53:46

标题: 使用释义文本的多视图槽注意力技术进行人脸反欺骗

摘要: 最近的人脸反欺诈（FAS）方法通过采用像CLIP这样的视觉-语言模型展现出了卓越的跨领域性能。然而，现有基于CLIP的FAS模型并未充分利用CLIP的补丁嵌入标记，未能检测到关键的欺诈线索。此外，这些模型依赖于每个类别一个单一的文本提示（例如，‘真实’或‘伪造’），这限制了泛化能力。为了解决这些问题，我们提出了MVP-FAS，这是一个新颖的框架，包括两个关键模块：多视图槽关注（MVS）和多文本补丁对齐（MTPA）。这两个模块利用多个释义文本生成广义特征，并减少对特定领域文本的依赖。MVS通过利用多个角度的多样文本提取局部详细的空间特征和全局上下文来从补丁嵌入中提取特征。MTPA将补丁与多个文本表示进行对齐，以提高语义鲁棒性。大量实验证明，MVP-FAS实现了卓越的泛化性能，在跨领域数据集上胜过了先前的最先进方法。源代码：https://github.com/Elune001/MVP-FAS。

更新时间: 2025-09-08 04:53:46

领域: cs.CV,cs.AI,cs.CR

下载: http://arxiv.org/abs/2509.06336v1

Methodological Insights into Structural Causal Modelling and Uncertainty-Aware Forecasting for Economic Indicators

This paper presents a methodological approach to financial time series analysis by combining causal discovery and uncertainty-aware forecasting. As a case study, we focus on four key U.S. macroeconomic indicators -- GDP, economic growth, inflation, and unemployment -- and we apply the LPCMCI framework with Gaussian Process Distance Correlation (GPDC) to uncover dynamic causal relationships in quarterly data from 1970 to 2021. Our results reveal a robust unidirectional causal link from economic growth to GDP and highlight the limited connectivity of inflation, suggesting the influence of latent factors. Unemployment exhibits strong autoregressive dependence, motivating its use as a case study for probabilistic forecasting. Leveraging the Chronos framework, a large language model trained for time series, we perform zero-shot predictions on unemployment. This approach delivers accurate forecasts one and two quarters ahead, without requiring task-specific training. Crucially, the model's uncertainty-aware predictions yield 90\% confidence intervals, enabling effective anomaly detection through statistically principled deviation analysis. This study demonstrates the value of combining causal structure learning with probabilistic language models to inform economic policy and enhance forecasting robustness.

Updated: 2025-09-08 04:52:12

标题: 方法论洞见：结构因果建模和经济指标不确定性感知预测

摘要: 本文提出了一种将因果发现和不确定性感知预测相结合的金融时间序列分析方法论。作为一个案例研究，我们关注四个关键的美国宏观经济指标 - GDP、经济增长、通胀和失业率 - 并应用了具有高斯过程距离相关性（GPDC）的LPCMCI框架来揭示从1970年到2021年季度数据中的动态因果关系。我们的研究结果显示经济增长对GDP存在稳健的单向因果联系，并突出了通胀的有限连接性，暗示潜在因素的影响。失业率表现出强烈的自回归依赖性，促使我们将其作为概率预测的案例研究。利用为时间序列训练的大型语言模型Chronos框架，我们对失业率进行了零样本预测。这种方法可以在不需要特定任务训练的情况下，准确地预测未来一到两个季度的情况。关键的是，模型的不确定性感知预测提供了90%的置信区间，通过基于统计原则的偏差分析实现了有效的异常检测。本研究展示了将因果结构学习与概率语言模型相结合以指导经济政策和增强预测稳健性的价值。

更新时间: 2025-09-08 04:52:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.07036v1

Ask1: Development and Reinforcement Learning-Based Control of a Custom Quadruped Robot

In this work, we present the design, development, and experimental validation of a custom-built quadruped robot, Ask1. The Ask1 robot shares similar morphology with the Unitree Go1, but features custom hardware components and a different control architecture. We transfer and extend previous reinforcement learning (RL)-based control methods to the Ask1 robot, demonstrating the applicability of our approach in real-world scenarios. By eliminating the need for Adversarial Motion Priors (AMP) and reference trajectories, we introduce a novel reward function to guide the robot's motion style. We demonstrate the generalization capability of the proposed RL algorithm by training it on both the Go1 and Ask1 robots. Simulation and real-world experiments validate the effectiveness of this method, showing that Ask1, like the Go1, is capable of navigating various rugged terrains.

Updated: 2025-09-08 04:47:44

标题: Ask1：自定义四足机器人的发展和基于强化学习的控制

摘要: 在这项工作中，我们介绍了一个定制的四足机器人Ask1的设计、开发和实验验证。Ask1机器人与Unitree Go1具有相似的形态学，但具有定制的硬件组件和不同的控制架构。我们将以前基于强化学习（RL）的控制方法转移到Ask1机器人，并扩展，展示了我们的方法在现实场景中的适用性。通过消除对对抗性运动先验（AMP）和参考轨迹的需求，我们引入了一个新颖的奖励函数来指导机器人的运动风格。我们通过在Go1和Ask1机器人上训练提出的RL算法来展示该方法的泛化能力。模拟和现实世界实验证实了这种方法的有效性，表明像Go1一样，Ask1能够在各种崎岖地形中导航。

更新时间: 2025-09-08 04:47:44

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2412.08019v2

Robust Generative Learning with Lipschitz-Regularized $α$-Divergences Allows Minimal Assumptions on Target Distributions

This paper demonstrates the robustness of Lipschitz-regularized $\alpha$-divergences as objective functionals in generative modeling, showing they enable stable learning across a wide range of target distributions with minimal assumptions. We establish that these divergences remain finite under a mild condition-that the source distribution has a finite first moment-regardless of the properties of the target distribution, making them adaptable to the structure of target distributions. Furthermore, we prove the existence and finiteness of their variational derivatives, which are essential for stable training of generative models such as GANs and gradient flows. For heavy-tailed targets, we derive necessary and sufficient conditions that connect data dimension, $\alpha$, and tail behavior to divergence finiteness, that also provide insights into the selection of suitable $\alpha$'s. We also provide the first sample complexity bounds for empirical estimations of these divergences on unbounded domains. As a byproduct, we obtain the first sample complexity bounds for empirical estimations of these divergences and the Wasserstein-1 metric with group symmetry on unbounded domains. Numerical experiments confirm that generative models leveraging Lipschitz-regularized $\alpha$-divergences can stably learn distributions in various challenging scenarios, including those with heavy tails or complex, low-dimensional, or fractal support, all without any prior knowledge of the structure of target distributions.

Updated: 2025-09-08 04:44:45

标题: 使用Lipschitz正则化的鲁棒生成学习，允许对目标分布做出最少假设

摘要: 本文展示了利普希茨正则化α-散度作为生成建模中的客观函数的稳健性，表明它们能够在各种目标分布上实现稳定学习，且假设最少。我们证明了这些散度在一个温和条件下保持有限-即源分布具有有限的第一矩-无论目标分布的特性如何，使它们适应目标分布的结构。此外，我们证明了它们的变分导数的存在和有限性，这对于生成模型（如GANs和梯度流）的稳定训练至关重要。对于重尾目标，我们推导出将数据维度、α和尾部行为与散度有限性联系起来的必要和充分条件，这也为选择合适的α提供了见解。我们还提供了这些散度在无界域上的经验估计的样本复杂度界限。作为副产品，我们获得了这些散度和具有群对称性的Wasserstein-1度量在无界域上的经验估计的样本复杂度界限。数值实验证实，利用利普希茨正则化的α-散度的生成模型可以稳定地学习各种具有挑战性的场景中的分布，包括具有重尾或复杂、低维或分形支持的分布，而无需事先了解目标分布的结构。

更新时间: 2025-09-08 04:44:45

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2405.13962v3

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.

Updated: 2025-09-08 04:38:27

标题: OmniThink：通过思考拓展机器写作中的知识边界

摘要: 使用大型语言模型进行机器写作通常依赖于检索增强生成。然而，这些方法仍然局限于模型预定义范围内，限制了生成具有丰富信息的内容。具体来说，普通的检索信息往往缺乏深度、新颖性，并且存在冗余，这对生成的文章质量产生负面影响，导致输出浅显、缺乏独创性和重复性。为了解决这些问题，我们提出了OmniThink，这是一个慢思考的机器写作框架，模拟了人类类似的迭代扩展和反思过程。OmniThink的核心理念是模拟学习者慢慢深化对主题知识的认知行为。实验结果表明，OmniThink提高了生成文章的知识密度，而不会损害一致性和深度等指标。人类评估和专家反馈进一步强调了OmniThink在长篇文章生成中解决现实挑战的潜力。代码可在https://github.com/zjunlp/OmniThink获取。

更新时间: 2025-09-08 04:38:27

领域: cs.CL,cs.AI,cs.HC,cs.IR,cs.LG

下载: http://arxiv.org/abs/2501.09751v3

A Fragile Number Sense: Probing the Elemental Limits of Numerical Reasoning in LLMs

Large Language Models (LLMs) have demonstrated remarkable emergent capabilities, yet the robustness of their numerical reasoning remains an open question. While standard benchmarks evaluate LLM reasoning on complex problem sets using aggregated metrics, they often obscure foundational weaknesses. In this work, we probe LLM mathematical numeracy by evaluating performance on problems of escalating complexity, from constituent operations to combinatorial puzzles. We test several state-of-the-art LLM-based agents on a 100-problem challenge comprising four categories: (1) basic arithmetic, (2) advanced operations, (3) primality checking, and (4) the Game of 24 number puzzle. Our results show that while the agents achieved high accuracy on the first three categories, which require deterministic algorithmic execution, they consistently failed at the number puzzle, underlining its demand for a heuristic search over a large combinatorial space to be a significant bottleneck. These findings reveal that the agents' proficiency is largely confined to recalling and executing known algorithms, rather than performing generative problem-solving. This suggests their apparent numerical reasoning is more akin to sophisticated pattern-matching than flexible, analytical thought, limiting their potential for tasks that require novel or creative numerical insights.

Updated: 2025-09-08 04:31:12

标题: 一个脆弱的数字感：探究LLM中数字推理的基本限制

摘要: 大型语言模型（LLMs）展示了出色的新兴能力，然而它们在数字推理方面的稳健性仍然是一个未解之谜。虽然标准基准通过聚合指标评估LLM在复杂问题集上的推理能力，但它们经常掩盖了基础性的弱点。在这项工作中，我们通过评估LLM在逐渐复杂的问题上的表现，从基本运算到组合谜题。我们在一个包含四个类别的100个问题挑战中测试了几个最先进的基于LLM的代理：（1）基本算术，（2）高级操作，（3）素数检查和（4）24点数谜游戏。我们的结果显示，虽然代理在前三个需要确定性算法执行的类别上取得了高准确性，但它们在数谜问题上一直失败，强调了其对在大型组合空间上进行启发式搜索的需求成为一个重要的瓶颈。这些发现表明，代理的熟练程度主要局限于回忆和执行已知算法，而不是进行生成性问题解决。这表明它们表面上的数字推理更类似于复杂的模式匹配，而不是灵活的、分析性的思维，限制了它们在需要新颖或创造性数字洞察的任务中的潜力。

更新时间: 2025-09-08 04:31:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06332v1

Exploring approaches to computational representation and classification of user-generated meal logs

This study examined the use of machine learning and domain specific enrichment on patient generated health data, in the form of free text meal logs, to classify meals on alignment with different nutritional goals. We used a dataset of over 3000 meal records collected by 114 individuals from a diverse, low income community in a major US city using a mobile app. Registered dietitians provided expert judgement for meal to goal alignment, used as gold standard for evaluation. Using text embeddings, including TFIDF and BERT, and domain specific enrichment information, including ontologies, ingredient parsers, and macronutrient contents as inputs, we evaluated the performance of logistic regression and multilayer perceptron classifiers using accuracy, precision, recall, and F1 score against the gold standard and self assessment. Even without enrichment, ML outperformed self assessments of individuals who logged meals, and the best performing combination of ML classifier with enrichment achieved even higher accuracies. In general, ML classifiers with enrichment of Parsed Ingredients, Food Entities, and Macronutrients information performed well across multiple nutritional goals, but there was variability in the impact of enrichment and classification algorithm on accuracy of classification for different nutritional goals. In conclusion, ML can utilize unstructured free text meal logs and reliably classify whether meals align with specific nutritional goals, exceeding self assessments, especially when incorporating nutrition domain knowledge. Our findings highlight the potential of ML analysis of patient generated health data to support patient centered nutrition guidance in precision healthcare.

Updated: 2025-09-08 04:23:48

标题: 探索计算表征和分类用户生成的餐饮日志的方法

摘要: 这项研究考察了机器学习和领域特定丰富化在患者生成的健康数据中的应用，以自由文本餐饮记录的形式，用于对餐饮是否符合不同营养目标进行分类。我们使用了由114名个体从一个美国主要城市的多元、低收入社区使用移动应用程序收集的超过3000条餐饮记录的数据集。注册营养师为餐饮与目标对齐提供专业判断，作为评估的黄金标准。使用文本嵌入，包括TFIDF和BERT，以及领域特定的丰富化信息，包括本体论、成分解析器和宏量营养素含量作为输入，我们评估了逻辑回归和多层感知器分类器在准确率、精确度、召回率和F1分数方面与黄金标准和自我评估的性能。即使没有进行丰富化，机器学习也优于记录餐饮的个体的自我评估，最佳表现的机器学习分类器与丰富化的结合甚至获得了更高的准确性。总的来说，具有解析成分、食物实体和宏量营养素信息的丰富化的机器学习分类器在多个营养目标上表现良好，但是在丰富化和分类算法对不同营养目标的分类准确性的影响方面存在变化。总之，机器学习可以利用非结构化的自由文本餐饮记录，并可靠地对餐饮是否符合特定营养目标进行分类，超越自我评估，特别是在整合营养领域知识时。我们的研究结果突显了机器学习分析患者生成的健康数据以支持以患者为中心的精准医疗营养指导的潜力。

更新时间: 2025-09-08 04:23:48

领域: cs.LG

下载: http://arxiv.org/abs/2509.06330v1

AttestLLM: Efficient Attestation Framework for Billion-scale On-device LLMs

As on-device LLMs(e.g., Apple on-device Intelligence) are widely adopted to reduce network dependency, improve privacy, and enhance responsiveness, verifying the legitimacy of models running on local devices becomes critical. Existing attestation techniques are not suitable for billion-parameter Large Language Models (LLMs), struggling to remain both time- and memory-efficient while addressing emerging threats in the LLM era. In this paper, we present AttestLLM, the first-of-its-kind attestation framework to protect the hardware-level intellectual property (IP) of device vendors by ensuring that only authorized LLMs can execute on target platforms. AttestLLM leverages an algorithm/software/hardware co-design approach to embed robust watermarking signatures onto the activation distributions of LLM building blocks. It also optimizes the attestation protocol within the Trusted Execution Environment (TEE), providing efficient verification without compromising inference throughput. Extensive proof-of-concept evaluations on LLMs from Llama, Qwen, and Phi families for on-device use cases demonstrate AttestLLM's attestation reliability, fidelity, and efficiency. Furthermore, AttestLLM enforces model legitimacy and exhibits resilience against model replacement and forgery attacks.

Updated: 2025-09-08 04:17:02

标题: AttestLLM：用于十亿规模设备上的高效认证框架

摘要: 随着设备上的LLM（例如，苹果设备上的智能）被广泛采用以减少网络依赖性、提高隐私性和增强响应能力，验证在本地设备上运行的模型的合法性变得至关重要。现有的认证技术不适用于十亿参数的大型语言模型（LLMs），在应对LLM时代出现的新威胁的同时，难以保持时间和内存的高效性。本文提出了AttestLLM，这是一种首创的认证框架，旨在通过确保只有授权的LLMs才能在目标平台上执行，以保护设备供应商的硬件级知识产权（IP）。AttestLLM采用算法/软件/硬件协同设计方法，在LLM构建块的激活分布上嵌入强大的水印签名。它还优化了在受信执行环境（TEE）内的认证协议，提供高效的验证而不会影响推理吞吐量。对Llama、Qwen和Phi系列LLMs进行的广泛概念验证显示了AttestLLM的认证可靠性、保真性和效率。此外，AttestLLM强化了模型的合法性，并对模型替换和伪造攻击表现出抗性。

更新时间: 2025-09-08 04:17:02

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.06326v1

$d_X$-Privacy for Text and the Curse of Dimensionality

A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for $d_X$-privacy, which is a relaxation of differential privacy for metric spaces. We identify an intriguing peculiarity of this mechanism. When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely outputs semantically similar words. We investigate this observation in detail, and tie it to the fact that the distance of the nearest neighbor of a word in any word embedding model (which are high-dimensional) is much larger than the relative difference in distances to any of its two consecutive neighbors. We also show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor. We derive the distribution, moments and tail bounds of this dot product. We further propose a fix as a post-processing step, which satisfactorily removes the above-mentioned issue.

Updated: 2025-09-08 04:14:40

标题: $d_X$-隐私对文本的影响和维度诅咒

摘要: 一种广泛使用的确保非结构化文本数据隐私的方法是多维拉普拉斯机制用于$d_X$-隐私，这是度量空间的差分隐私的一种放松。我们发现了这种机制的一个有趣特性。当以逐字方式应用该机制时，该机制要么输出原始单词，要么完全不同的单词，并且很少输出语义上相似的单词。我们详细研究了这一观察结果，并将其与任何单词嵌入模型中单词的最近邻距离（这些模型是高维的）远大于到其两个连续邻居之一的距离相对差异的事实联系起来。我们还表明，多维拉普拉斯噪声向量与任何单词嵌入的点积在指定最近邻方面起着至关重要的作用。我们推导了这个点积的分布、矩和尾部界限。我们进一步提出了一个修正方案作为后处理步骤，可以令人满意地解决上述问题。

更新时间: 2025-09-08 04:14:40

领域: cs.CR

下载: http://arxiv.org/abs/2411.13784v2

Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics

Large language models (LLMs) have demonstrated emergent in-context learning (ICL) capabilities across a range of tasks, including zero-shot time-series forecasting. We show that text-trained foundation models can accurately extrapolate spatiotemporal dynamics from discretized partial differential equation (PDE) solutions without fine-tuning or natural language prompting. Predictive accuracy improves with longer temporal contexts but degrades at finer spatial discretizations. In multi-step rollouts, where the model recursively predicts future spatial states over multiple time steps, errors grow algebraically with the time horizon, reminiscent of global error accumulation in classical finite-difference solvers. We interpret these trends as in-context neural scaling laws, where prediction quality varies predictably with both context length and output length. To better understand how LLMs are able to internally process PDE solutions so as to accurately roll them out, we analyze token-level output distributions and uncover a consistent ICL progression: beginning with syntactic pattern imitation, transitioning through an exploratory high-entropy phase, and culminating in confident, numerically grounded predictions.

Updated: 2025-09-08 04:08:50

标题: 文本训练的LLMs可以零次外推PDE动力学

摘要: 大型语言模型（LLMs）已经展示了在各种任务中的上下文学习（ICL）能力，包括零样本时间序列预测。我们展示了经过文本训练的基础模型可以准确地从离散的偏微分方程（PDE）解中外推时空动态，而无需微调或自然语言提示。预测准确性随着更长的时间上下文而提高，但在更精细的空间离散化时会降低。在多步预测中，模型递归地预测未来多个时间步的空间状态，错误会随着时间跨度代数地增长，这让人联想到经典有限差分求解器中的全局误差累积。我们解释这些趋势为上下文神经缩放定律，其中预测质量随着上下文长度和输出长度的变化而可预测地变化。为了更好地理解LLMs如何能够内部处理PDE解以准确地预测它们，我们分析了标记级输出分布，并揭示了一个一致的ICL进展：从句法模式模仿开始，经过一个探索性高熵阶段，最终实现自信、以数字为基础的预测。

更新时间: 2025-09-08 04:08:50

领域: cs.LG

下载: http://arxiv.org/abs/2509.06322v1

Data-Adaptive Graph Framelets with Generalized Vanishing Moments for Graph Machine Learning

In this paper, we propose a general framework for constructing tight framelet systems on graphs with localized supports based on partition trees. Our construction of framelets provides a simple and efficient way to obtain the orthogonality with $k$ arbitrary orthonormal vectors. When the $k$ vectors contain most of the energy of a family of graph signals, the orthogonality of the framelets intuitively possesses ``generalized ($k$-)vanishing'' moments, and thus, the coefficients are sparse. Moreover, our construction provides not only framelets that are overall sparse vectors but also fast and schematically concise transforms. In a data-adaptive setting, the graph framelet systems can be learned by conducting optimizations on Stiefel manifolds to provide the utmost sparsity for a given family of graph signals. Furthermore, we further exploit the generality of our proposed graph framelet systems for heterophilous graph learning, where graphs are characterized by connecting nodes mainly from different classes. The usual assumption that connected nodes are similar and belong to the same class for homophilious graphs is contradictory for heterophilous graphs. Thus, we are motivated to bypass simple assumptions on heterophilous graphs and focus on generating rich node features induced by the graph structure, so as to improve the graph learning ability of certain neural networks in node classification. We derive a specific system of graph framelets and propose a heuristic method to select framelets as features for neural network input. Several experiments demonstrate the effectiveness and superiority of our approach for non-linear approximation, denoising, and node classification.

Updated: 2025-09-08 03:59:11

标题: 具有广义消失矩的数据自适应图框架图机器学习

摘要: 在本文中，我们提出了一个基于分区树构建具有局部支撑的图上紧框架系统的通用框架。我们构建的框架提供了一种简单高效的方式来获得与$k$个任意正交向量正交的性质。当$k$个向量包含图信号族的大部分能量时，框架的正交性直观地具有“广义($k$-)消失”矩，因此系数是稀疏的。此外，我们的构建不仅提供了整体稀疏向量的框架，还提供了快速简洁的变换。在数据自适应设置中，可以通过在斯蒂弗尔流形上进行优化来学习图框架系统，以为给定的图信号族提供最大稀疏性。此外，我们进一步利用了我们提出的图框架系统的普适性，用于异质图学习，其中图的特征是主要连接来自不同类别的节点。对于同质图，通常的假设是连接的节点相似并属于同一类别，而对于异质图则是矛盾的。因此，我们受到启发，绕过对异质图的简单假设，专注于生成由图结构引起的丰富节点特征，以提高某些神经网络在节点分类中的图学习能力。我们推导了一套特定的图框架系统，并提出了一种启发式方法来选择框架作为神经网络输入的特征。几个实验证明了我们的方法在非线性逼近、去噪和节点分类方面的有效性和优越性。

更新时间: 2025-09-08 03:59:11

领域: eess.SP,cs.LG,math.FA,43A99, 41A45, 94A11, 94A16

下载: http://arxiv.org/abs/2309.03537v3

Schrodinger's Toolbox: Exploring the Quantum Rowhammer Attack

Residual cross-talk in superconducting qubit devices creates a security vulnerability for emerging quantum cloud services. We demonstrate a Clifford-only Quantum Rowhammer attack-using just X and CNOT gates-that injects faults on IBM's 127-qubit Eagle processors without requiring pulse-level access. Experiments show that targeted hammering induces localized errors confined to the attack cycle and primarily manifests as phase noise, as confirmed by near 50% flip rates under Hadamard-basis probing. A full lattice sweep maps QR's spatial and temporal behavior, revealing reproducible corruption limited to qubits within two coupling hops and rapid recovery in subsequent benign cycles. Finally, we leverage these properties to outline a prime-and-probe covert channel, demonstrating that the clear separability between hammered and benign rounds enables highly reliable signaling without error correction. These findings underscore the need for hardware-level isolation and scheduler-aware defenses as multi-tenant quantum computing becomes standard.

Updated: 2025-09-08 03:55:17

标题: 薛定谔的工具箱：探索量子行锤攻击

摘要: 超导量子比特设备中的残余串扰为新兴量子云服务创建了安全漏洞。我们展示了一种仅使用X和CNOT门的克利福德量子行锤攻击，在不需要脉冲级访问的情况下向IBM的127量子比特Eagle处理器注入故障。实验表明，有针对性的攻击会引起局部错误，限于攻击周期内，并主要表现为相位噪声，通过哈达玛基探测确认翻转率接近50%。完整的格点扫描描绘了QR的空间和时间行为，揭示了限于两个耦合跳数内的量子比特的可重现损坏以及在随后的良性周期内的快速恢复。最后，我们利用这些特性概述了一个主动和探测的隐蔽信道，展示了锤击和良性回合之间的明显可分辨性使得高度可靠的信号传输成为可能，无需误差校正。这些发现强调了随着多租户量子计算成为标准，硬件级隔离和调度器感知的防御的必要性。

更新时间: 2025-09-08 03:55:17

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2509.06318v1

Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix

A central challenge in representation learning is constructing latent embeddings that are both expressive and efficient. In practice, deep networks often produce redundant latent spaces where multiple coordinates encode overlapping information, reducing effective capacity and hindering generalization. Standard metrics such as accuracy or reconstruction loss provide only indirect evidence of such redundancy and cannot isolate it as a failure mode. We introduce a redundancy index, denoted rho(C), that directly quantifies inter-dimensional dependencies by analyzing coupling matrices derived from latent representations and comparing their off-diagonal statistics against a normal distribution via energy distance. The result is a compact, interpretable, and statistically grounded measure of representational quality. We validate rho(C) across discriminative and generative settings on MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100, spanning multiple architectures and hyperparameter optimization strategies. Empirically, low rho(C) reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse. Estimator reliability grows with latent dimension, yielding natural lower bounds for reliable analysis. We further show that Tree-structured Parzen Estimators (TPE) preferentially explore low-rho regions, suggesting that rho(C) can guide neural architecture search and serve as a redundancy-aware regularization target. By exposing redundancy as a universal bottleneck across models and tasks, rho(C) offers both a theoretical lens and a practical tool for evaluating and improving the efficiency of learned representations.

Updated: 2025-09-08 03:36:47

标题: 评估潜在空间的效率：通过耦合矩阵进行评估

摘要: 在表示学习中的一个核心挑战是构建既具有表达能力又高效的潜在嵌入。在实践中，深度网络经常会产生冗余的潜在空间，其中多个坐标编码重叠信息，降低了有效容量并阻碍泛化。标准指标如准确率或重构损失仅提供了此类冗余的间接证据，无法将其单独作为失败模式。我们引入了一个称为rho(C)的冗余指数，通过分析从潜在表示派生的耦合矩阵并将其非对角线统计与正态分布通过能量距离进行比较，直接量化了跨维度的依赖关系。结果是一种紧凑、可解释且统计基础的表示质量度量。我们在MNIST变体、Fashion-MNIST、CIFAR-10和CIFAR-100的判别和生成设置中验证了rho(C)，涵盖了多种架构和超参数优化策略。经验性地，低rho(C)可可靠地预测高分类准确率或低重构误差，而冗余度的提高与性能崩溃相关。估计器的可靠性随潜在维度增加而增加，为可靠分析提供了自然的下限。我们进一步展示了树状Parzen估计器（TPE）倾向于探索低rho区域，这表明rho(C)可以引导神经架构搜索并作为一个冗余感知的正则化目标。通过将冗余暴露为模型和任务之间的通用瓶颈，rho(C)既为评估和提高学习表示的效率提供了理论视角，也为实践工具。

更新时间: 2025-09-08 03:36:47

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2509.06314v1

Predicting Steady-State Behavior in Complex Networks with Graph Neural Networks

In complex systems, information propagation can be defined as diffused or delocalized, weakly localized, and strongly localized. This study investigates the application of graph neural network models to learn the behavior of a linear dynamical system on networks. A graph convolution and attention-based neural network framework has been developed to identify the steady-state behavior of the linear dynamical system. We reveal that our trained model distinguishes the different states with high accuracy. Furthermore, we have evaluated model performance with real-world data. In addition, to understand the explainability of our model, we provide an analytical derivation for the forward and backward propagation of our framework.

Updated: 2025-09-08 03:35:37

标题: 用图神经网络预测复杂网络中的稳态行为

摘要: 在复杂系统中，信息传播可以被定义为扩散或去中心化的、弱定位的和强定位的。本研究调查了图神经网络模型在学习网络上线性动力系统行为方面的应用。已开发了基于图卷积和注意力的神经网络框架，用于识别线性动力系统的稳态行为。我们发现我们训练的模型能够高准确度地区分不同的状态。此外，我们还用真实世界数据评估了模型的性能。为了理解我们模型的可解释性，我们提供了对我们框架的正向和反向传播的解析推导。

更新时间: 2025-09-08 03:35:37

领域: cs.LG,cs.AI,nlin.AO

下载: http://arxiv.org/abs/2502.01693v3

Enhancing Low-Altitude Airspace Security: MLLM-Enabled UAV Intent Recognition

The rapid development of the low-altitude economy emphasizes the critical need for effective perception and intent recognition of non-cooperative unmanned aerial vehicles (UAVs). The advanced generative reasoning capabilities of multimodal large language models (MLLMs) present a promising approach in such tasks. In this paper, we focus on the combination of UAV intent recognition and the MLLMs. Specifically, we first present an MLLM-enabled UAV intent recognition architecture, where the multimodal perception system is utilized to obtain real-time payload and motion information of UAVs, generating structured input information, and MLLM outputs intent recognition results by incorporating environmental information, prior knowledge, and tactical preferences. Subsequently, we review the related work and demonstrate their progress within the proposed architecture. Then, a use case for low-altitude confrontation is conducted to demonstrate the feasibility of our architecture and offer valuable insights for practical system design. Finally, the future challenges are discussed, followed by corresponding strategic recommendations for further applications.

Updated: 2025-09-08 03:34:56

标题: 提升低空领域空域安全性：基于MLLM的无人机意图识别

摘要: 低空经济的快速发展强调了对非合作无人机（UAV）的有效感知和意图识别的关键性需求。多模式大型语言模型（MLLMs）的先进生成推理能力在此类任务中展示了一种有希望的方法。本文重点讨论了UAV意图识别与MLLMs的结合。具体来说，我们首先提出了一种由MLLM启用的UAV意图识别架构，其中多模式感知系统被用于获取UAV的实时负载和运动信息，生成结构化输入信息，而MLLM则通过整合环境信息、先验知识和战术偏好输出意图识别结果。随后，我们回顾了相关工作并展示了它们在所提出的架构内的进展。然后，进行了一个低空对抗的用例以展示我们的架构的可行性，并为实际系统设计提供宝贵的见解。最后，讨论了未来的挑战，并提出了进一步应用的相应战略建议。

更新时间: 2025-09-08 03:34:56

领域: eess.SY,cs.LG,cs.SY,68T07, 68T45, 93C85, 94A12,I.2.10; I.2.6; I.2.9; C.2.1

下载: http://arxiv.org/abs/2509.06312v1

Whisper Smarter, not Harder: Adversarial Attack on Partial Suppression

Currently, Automatic Speech Recognition (ASR) models are deployed in an extensive range of applications. However, recent studies have demonstrated the possibility of adversarial attack on these models which could potentially suppress or disrupt model output. We investigate and verify the robustness of these attacks and explore if it is possible to increase their imperceptibility. We additionally find that by relaxing the optimisation objective from complete suppression to partial suppression, we can further decrease the imperceptibility of the attack. We also explore possible defences against these attacks and show a low-pass filter defence could potentially serve as an effective defence.

Updated: 2025-09-08 03:30:40

标题: 智慧耳语，而非努力：对部分抑制的对抗性攻击

摘要: 目前，自动语音识别（ASR）模型已经被部署在广泛的应用中。然而，最近的研究表明这些模型可能会受到对抗性攻击，这可能会潜在地抑制或干扰模型输出。我们研究并验证了这些攻击的鲁棒性，并探索是否可能增加它们的难以察觉性。我们还发现，通过从完全抑制优化目标放宽到部分抑制，我们可以进一步降低攻击的难以察觉性。我们还探讨了可能的防御方法，并展示了一个低通滤波器的防御可能作为一种有效的防御手段。

更新时间: 2025-09-08 03:30:40

领域: cs.SD,cs.CR,cs.LG,eess.AS

下载: http://arxiv.org/abs/2508.09994v2

DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving

Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.

Updated: 2025-09-08 03:27:39

标题: DSDE：具有KLD稳定性的实际服务的动态推测解码

摘要: 推测解码加速大型语言模型推理，但其依赖于固定的推测长度在具有多样化请求的大批量服务环境中并不是最佳选择。本文探讨了一种新的动态适应方向，通过研究一类新颖的事后诊断信号。我们提出了动态推测解码引擎（DSDE），这是一个无需训练的框架，由两个主要组件构成：（1）基于Kullback-Leibler（KLD）散度方差的预测信号，用于诊断生成的区域稳定性，以及（2）自适应推测长度上限，以减轻每个序列解码中的滞后问题。实验证明，使用基于KLD稳定信号进行动态适应的潜力。由这些信号引导的算法在端到端延迟方面与主要基线竞争，并在不同工作负载下表现出更高的鲁棒性。这种鲁棒性在挑战性低接受率制度中尤为重要，其中提出的信号保持其诊断实用性。总的来说，这些发现验证了事后信号作为构建更加鲁棒且智能的LLM推理系统的有价值组件，并强调了未来研究动态推测长度适应的有希望方向。

更新时间: 2025-09-08 03:27:39

领域: cs.DC,cs.AI,cs.IT,math.IT,I.2.7; C.2.4

下载: http://arxiv.org/abs/2509.01083v2

WindFM: An Open-Source Foundation Model for Zero-Shot Wind Power Forecasting

High-quality wind power forecasting is crucial for the operation of modern power grids. However, prevailing data-driven paradigms either train a site-specific model which cannot generalize to other locations or rely on fine-tuning of general-purpose time series foundation models which are difficult to incorporate domain-specific data in the energy sector. This paper introduces WindFM, a lightweight and generative Foundation Model designed specifically for probabilistic wind power forecasting. WindFM employs a discretize-and-generate framework. A specialized time-series tokenizer first converts continuous multivariate observations into discrete, hierarchical tokens. Subsequently, a decoder-only Transformer learns a universal representation of wind generation dynamics by autoregressively pre-training on these token sequences. Using the comprehensive WIND Toolkit dataset comprising approximately 150 billion time steps from more than 126,000 sites, WindFM develops a foundational understanding of the complex interplay between atmospheric conditions and power output. Extensive experiments demonstrate that our compact 8.1M parameter model achieves state-of-the-art zero-shot performance on both deterministic and probabilistic tasks, outperforming specialized models and larger foundation models without any fine-tuning. In particular, WindFM exhibits strong adaptiveness under out-of-distribution data from a different continent, demonstrating the robustness and transferability of its learned representations. Our pre-trained model is publicly available at https://github.com/shiyu-coder/WindFM.

Updated: 2025-09-08 03:26:18

标题: WindFM: 一种用于零点风电功率预测的开源基础模型

摘要: 高质量的风力发电预测对于现代电力网络的运作至关重要。然而，目前的基于数据驱动的范式要么训练一个特定于现场的模型，不能推广到其他位置，要么依赖于对通用时间序列基础模型进行微调，难以将能源领域的特定数据纳入其中。本文介绍了WindFM，这是一个专门为概率风力发电预测设计的轻量级和生成式基础模型。WindFM采用了离散化和生成框架。一个专门的时间序列分词器首先将连续的多变量观测转换为离散的分层标记。随后，一个仅解码器的Transformer通过在这些标记序列上进行自回归预训练，学习了风力发电动态的通用表示。利用包括来自126,000多个站点的大约1500亿个时间步的全面WIND Toolkit数据集，WindFM开发了对大气条件和功率输出之间复杂相互作用的基础理解。大量实验表明，我们的紧凑型810万参数模型在确定性和概率任务上均实现了最先进的零射击性能，优于专门模型和没有任何微调的更大基础模型。特别是，WindFM在来自不同大陆的分布外数据下展现出强大的适应性，展示了其学习表示的稳健性和可转移性。我们的预训练模型可在https://github.com/shiyu-coder/WindFM 上公开获取。

更新时间: 2025-09-08 03:26:18

领域: cs.LG

下载: http://arxiv.org/abs/2509.06311v1

Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework

Large reasoning models (LRMs) have exhibited strong performance on complex reasoning tasks, with further gains achievable through increased computational budgets at inference. However, current test-time scaling methods predominantly rely on redundant sampling, ignoring the historical experience utilization, thereby limiting computational efficiency. To overcome this limitation, we propose Sticker-TTS, a novel test-time scaling framework that coordinates three collaborative LRMs to iteratively explore and refine solutions guided by historical attempts. At the core of our framework are distilled key conditions-termed stickers-which drive the extraction, refinement, and reuse of critical information across multiple rounds of reasoning. To further enhance the efficiency and performance of our framework, we introduce a two-stage optimization strategy that combines imitation learning with self-improvement, enabling progressive refinement. Extensive evaluations on three challenging mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH, demonstrate that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches, under comparable inference budgets. These results highlight the effectiveness of sticker-guided historical experience utilization. Our code and data are available at https://github.com/RUCAIBox/Sticker-TTS.

Updated: 2025-09-08 03:26:03

标题: Sticker-TTS：利用贴纸驱动的测试时间缩放框架学习利用历史经验

摘要: 大型推理模型（LRMs）在复杂推理任务上表现出色，通过增加推理的计算预算可以进一步提高性能。然而，当前的测试时间缩放方法主要依赖于冗余采样，忽略了历史经验的利用，从而限制了计算效率。为了克服这一限制，我们提出了Sticker-TTS，这是一个新颖的测试时间缩放框架，协调三个合作的LRMs，通过历史尝试指导迭代地探索和完善解决方案。我们框架的核心是精炼的关键条件-称为sticker-它们驱动在多轮推理过程中关键信息的提取，完善和重复利用。为了进一步提高框架的效率和性能，我们引入了一个结合模仿学习和自我改进的两阶段优化策略，实现渐进式的完善。在三个具有挑战性的数学推理基准测试（包括AIME-24、AIME-25和OlymMATH）上进行的广泛评估显示，Sticker-TTS始终优于强基线方法，包括自一致性和先进的强化学习方法，在可比较的推理预算下。这些结果突出了sticker引导的历史经验利用的有效性。我们的代码和数据可以在https://github.com/RUCAIBox/Sticker-TTS上找到。

更新时间: 2025-09-08 03:26:03

领域: cs.AI,cs.CL,I.2.7

下载: http://arxiv.org/abs/2509.05007v2

Minimax optimal transfer learning for high-dimensional additive regression

This paper studies high-dimensional additive regression under the transfer learning framework, where one observes samples from a target population together with auxiliary samples from different but potentially related regression models. We first introduce a target-only estimation procedure based on the smooth backfitting estimator with local linear smoothing. In contrast to previous work, we establish general error bounds under sub-Weibull($\alpha$) noise, thereby accommodating heavy-tailed error distributions. In the sub-exponential case ($\alpha=1$), we show that the estimator attains the minimax lower bound under regularity conditions, which requires a substantial departure from existing proof strategies. We then develop a novel two-stage estimation method within a transfer learning framework, and provide theoretical guarantees at both the population and empirical levels. Error bounds are derived for each stage under general tail conditions, and we further demonstrate that the minimax optimal rate is achieved when the auxiliary and target distributions are sufficiently close. All theoretical results are supported by simulation studies and real data analysis.

Updated: 2025-09-08 03:16:05

标题: 高维加法回归的极小化最优转移学习

摘要: 这篇论文研究了在转移学习框架下的高维加法回归，其中一个观察到来自目标总体的样本，以及来自不同但潜在相关的回归模型的辅助样本。我们首先介绍了基于平滑反向拟合估计器和局部线性平滑的仅目标估计过程。与先前的工作相比，我们建立了在次Weibull（α）噪声下的一般误差界，从而容纳了重尾误差分布。在次指数情况（α = 1）下，我们表明估计器在正则条件下达到了极小化下界，这需要与现有证明策略有显著的不同。然后，我们在转移学习框架内开发了一种新颖的两阶段估计方法，并在人口和经验水平上提供理论保证。在一般尾部条件下，为每个阶段导出了误差界，并进一步证明当辅助和目标分布足够接近时，达到了极小化最优速率。所有理论结果都得到了模拟研究和实际数据分析的支持。

更新时间: 2025-09-08 03:16:05

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2509.06308v1

BeSimulator: A Large Language Model Powered Text-based Behavior Simulator

Traditional robot simulators focus on physical process modeling and realistic rendering, often suffering from high computational costs, inefficiencies, and limited adaptability. To handle this issue, we concentrate on behavior simulation in robotics to analyze and validate the logic behind robot behaviors, aiming to achieve preliminary evaluation before deploying resource-intensive simulators and thus enhance simulation efficiency. In this paper, we propose BeSimulator, a modular and novel LLM-powered framework, as an attempt towards behavior simulation in the context of text-based environments. By constructing text-based virtual environments and performing semantic-level simulation, BeSimulator can generalize across scenarios and achieve long-horizon complex simulation. Inspired by human cognition paradigm, it employs a ``consider-decide-capture-transfer'' four-phase simulation process, termed Chain of Behavior Simulation (CBS), which excels at analyzing action feasibility and state transition. Additionally, BeSimulator incorporates code-driven reasoning to enable arithmetic operations and enhance reliability, and reflective feedback to refine simulation. Based on our manually constructed behavior-tree-based simulation benchmark, BTSIMBENCH, our experiments show a significant performance improvement in behavior simulation compared to baselines, ranging from 13.60% to 24.80%. Code and data are available at https://github.com/Dawn888888/BeSimulator.

Updated: 2025-09-08 03:14:54

标题: BeSimulator：一种由大型语言模型驱动的基于文本的行为模拟器

摘要: 传统的机器人模拟器主要关注物理过程建模和逼真渲染，往往受到高计算成本、低效率和有限的适应性的困扰。为了解决这个问题，我们专注于机器人行为模拟，以分析和验证机器人行为背后的逻辑，旨在在部署资源密集型模拟器之前实现初步评估，从而增强模拟效率。在本文中，我们提出了BeSimulator，一个模块化和新颖的LLM驱动框架，作为在基于文本环境中进行行为模拟的尝试。通过构建基于文本的虚拟环境并进行语义级模拟，BeSimulator可以泛化跨场景并实现长期复杂模拟。受人类认知范式的启发，它采用“考虑-决定-捕获-传递”四阶段模拟过程，称为行为模拟链（CBS），在分析行动可行性和状态转换方面表现出色。此外，BeSimulator还融入了代码驱动的推理，以实现算术运算并增强可靠性，以及反思性反馈以精细化模拟。基于我们手动构建的基于行为树的模拟基准BTSIMBENCH，我们的实验表明，与基线相比，行为模拟的性能显著提高，提升范围从13.60%到24.80%。代码和数据可在https://github.com/Dawn888888/BeSimulator获得。

更新时间: 2025-09-08 03:14:54

领域: cs.RO,cs.AI,cs.CL

下载: http://arxiv.org/abs/2409.15865v2

Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

Conventional approaches to building energy retrofit decision making suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts. With the growth of Smart and Connected Communities, generative AI, especially large language models (LLMs), may help by processing contextual information and producing practitioner readable recommendations. We evaluate seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, and Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical). Performance is assessed on four dimensions: accuracy, consistency, sensitivity, and reasoning, using a dataset of 400 homes across 49 US states. LLMs generate effective recommendations in many cases, reaching up to 54.5 percent top 1 match and 92.8 percent within top 5 without fine tuning. Performance is stronger for the technical objective, while sociotechnical decisions are limited by economic trade offs and local context. Agreement across models is low, and higher performing models tend to diverge from others. LLMs are sensitive to location and building geometry but less sensitive to technology and occupant behavior. Most models show step by step, engineering style reasoning, but it is often simplified and lacks deeper contextual awareness. Overall, LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice.

Updated: 2025-09-08 03:13:47

标题: AI是否能够做出能源翻新决策？大型语言模型的评估

摘要: 传统的建筑能源改造决策方法存在普适性有限和可解释性低的问题，这些问题阻碍了在不同住宅环境中的采用。随着智能和连接社区的发展，生成式人工智能，尤其是大型语言模型（LLMs），可以通过处理上下文信息并产生实用的建议来帮助解决这些问题。我们在两个目标下评估了七种LLMs（ChatGPT，DeepSeek，Gemini，Grok，Llama和Claude）在住宅改造决策中的表现：最大化CO2减排（技术）和最小化回本周期（社会技术）。使用涵盖49个美国州的400个家庭的数据集，对四个维度进行了评估：准确性、一致性、敏感性和推理。LLMs在许多情况下生成了有效的建议，而无需进行微调，最高匹配率达到54.5％，在前5名内达到92.8％。针对技术目标的表现更为出色，而社会技术决策受到经济权衡和当地环境的限制。各模型之间的一致性较低，表现较好的模型往往与其他模型有所偏离。LLMs对位置和建筑几何结构敏感，但对技术和居民行为的敏感性较低。大多数模型展示了逐步、工程风格的推理，但通常过于简化且缺乏更深层次的上下文意识。总体而言，LLMs是能源改造决策中有潜力的助手，但需要提高准确性、一致性和上下文处理能力，以实现可靠的实践。

更新时间: 2025-09-08 03:13:47

领域: cs.AI

下载: http://arxiv.org/abs/2509.06307v1

HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models

Positional encoding mechanisms enable Transformers to model sequential structure and long-range dependencies in text. While absolute positional encodings struggle with extrapolation to longer sequences due to fixed positional representations, and relative approaches like Alibi exhibit performance degradation on extremely long contexts, the widely-used Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling. We address these limitations through a geometric reformulation of positional encoding. Drawing inspiration from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to implement Lorentz rotations on token representations. Theoretical analysis demonstrates that RoPE is a special case of our generalized formulation. HoPE fundamentally resolves RoPE's slation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experimental results, including perplexity evaluations under several extended sequence benchmarks, show that HoPE consistently exceeds existing positional encoding methods. These findings underscore HoPE's enhanced capacity for representing and generalizing long-range dependencies. Data and code will be available.

Updated: 2025-09-08 03:13:38

标题: HoPE：大型语言模型中稳定长距离依赖建模的双曲旋转位置编码

摘要: 位置编码机制使得Transformer可以对文本中的顺序结构和远程依赖关系进行建模。虽然绝对位置编码在处理较长序列时存在固定位置表示导致的外推困难，而像Alibi这样的相对方法在极长上下文中表现出性能下降，但广泛使用的Rotary Positional Encoding（RoPE）引入了阻碍稳定长距离依赖建模的振荡注意模式。我们通过对位置编码进行几何重构来解决这些限制。受双曲几何中的洛伦兹变换启发，我们提出了Hyperbolic Rotary Positional Encoding（HoPE），利用双曲函数在令牌表示上实现洛伦兹旋转。理论分析表明，RoPE是我们广义公式的一个特例。HoPE通过强制令牌距离增加时注意权重的单调衰减，从根本上解决了RoPE的平移问题。包括在几个扩展序列基准下的困惑度评估在内的广泛实验结果表明，HoPE始终优于现有的位置编码方法。这些发现强调了HoPE在表示和泛化远程依赖关系方面的增强能力。数据和代码将会提供。

更新时间: 2025-09-08 03:13:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.05218v2

MOSAIC: Minimax-Optimal Sparsity-Adaptive Inference for Change Points in Dynamic Networks

We propose a new inference framework, named MOSAIC, for change-point detection in dynamic networks with the simultaneous low-rank and sparse-change structure. We establish the minimax rate of detection boundary, which relies on the sparsity of changes. We then develop an eigen-decomposition-based test with screened signals that approaches the minimax rate in theory, with only a minor logarithmic loss. For practical implementation of MOSAIC, we adjust the theoretical test by a novel residual-based technique, resulting in a pivotal statistic that converges to a standard normal distribution via the martingale central limit theorem under the null hypothesis and achieves full power under the alternative hypothesis. We also analyze the minimax rate of testing boundary for dynamic networks without the low-rank structure, which almost aligns with the results in high-dimensional mean-vector change-point inference. We showcase the effectiveness of MOSAIC and verify our theoretical results with several simulation examples and a real data application.

Updated: 2025-09-08 03:09:50

标题: MOSAIC：动态网络中变点的最小最优稀疏自适应推断

摘要: 我们提出了一个新的推理框架，名为MOSAIC，用于在具有同时低秩和稀疏变化结构的动态网络中进行变点检测。我们建立了检测边界的极小极大率，该率取决于变化的稀疏性。然后我们开发了一种基于特征分解的测试方法，该方法通过筛选信号在理论上接近极小极大率，仅有轻微的对数损失。为了实际实施MOSAIC，我们通过一种新颖的基于残差的技术调整了理论测试，从而得到一个通过零假设下的马特顿中心极限定理收敛到标准正态分布，并在备择假设下达到全功率的关键统计量。我们还分析了动态网络在没有低秩结构的情况下的检验边界的极小极大率，这几乎与高维均值向量变点推理结果相吻合。我们展示了MOSAIC的有效性，并通过几个模拟示例和一个真实数据应用验证了我们的理论结果。

更新时间: 2025-09-08 03:09:50

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2509.06303v1

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we provide a near-optimal characterization of the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or chi-squared divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t.~the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t.~the chi-squared divergence, the sample complexity of RMDPs far exceeds the standard MDP counterpart.

Updated: 2025-09-08 02:57:28

标题: 使用一个生成模型的强化学习中分布鲁棒性的奇特价格

摘要: 这篇论文研究了在强化学习（RL）中模型的鲁棒性，以减少实践中的模拟到真实差距。我们采用了分布鲁棒马尔可夫决策过程（RMDPs）的框架，旨在学习一个策略，当部署环境落在围绕标称MDP的预定不确定性集合内时，最优化最坏情况性能。尽管最近已经做出了努力，但RMDPs的样本复杂性仍然大多未解决，无论使用的不确定性集合是什么。尚不清楚分布鲁棒性是否在与标准RL对比时产生统计后果。假设可以访问根据标称MDP绘制样本的生成模型，我们提供了一个关于RMDPs样本复杂性的近乎最优描述，当不确定性集合通过总变异（TV）距离或卡方散度指定时。此处研究的算法是一种称为分布鲁棒值迭代的基于模型的方法，被证明在整个不确定性水平范围内几乎是最优的。有些令人惊讶的是，我们的结果揭示了RMDPs不一定比标准MDPs更容易或更难学习。由鲁棒性要求产生的统计后果严重依赖于不确定性集合的大小和形状：在TV距离的情况下，RMDPs的最小极小样本复杂性始终小于标准MDPs；在卡方散度的情况下，RMDPs的样本复杂性远远超过标准MDP的对应值。

更新时间: 2025-09-08 02:57:28

领域: cs.LG,cs.IT,math.IT,math.ST,stat.TH

下载: http://arxiv.org/abs/2305.16589v3

SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies

Offline Imitation Learning (IL) methods such as Behavior Cloning are effective at acquiring complex robotic manipulation skills. However, existing IL-trained policies are confined to executing the task at the same speed as shown in demonstration data. This limits the task throughput of a robotic system, a critical requirement for applications such as industrial automation. In this paper, we introduce and formalize the novel problem of enabling faster-than-demonstration execution of visuomotor policies and identify fundamental challenges in robot dynamics and state-action distribution shifts. We instantiate the key insights as SAIL (Speed Adaptation for Imitation Learning), a full-stack system integrating four tightly-connected components: (1) a consistency-preserving action inference algorithm for smooth motion at high speed, (2) high-fidelity tracking of controller-invariant motion targets, (3) adaptive speed modulation that dynamically adjusts execution speed based on motion complexity, and (4) action scheduling to handle real-world system latencies. Experiments on 12 tasks across simulation and two real, distinct robot platforms show that SAIL achieves up to a 4x speedup over demonstration speed in simulation and up to 3.2x speedup in the real world. Additional detail is available at https://nadunranawaka1.github.io/sail-policy

Updated: 2025-09-08 02:56:08

标题: SAIL：模仿学习策略的快速执行超过演示

摘要: 离线模仿学习（IL）方法，如行为克隆，在获取复杂的机器人操作技能方面是有效的。然而，现有的IL训练策略仅限于以展示数据中显示的相同速度执行任务。这限制了机器人系统的任务吞吐量，这是工业自动化等应用的关键要求。在本文中，我们介绍并形式化了一个新颖的问题，即实现视觉动作策略比演示执行速度更快，并确定了机器人动力学和状态-动作分布转移中的基本挑战。我们将关键见解实例化为SAIL（模仿学习的速度适应），这是一个集成四个紧密连接组件的全栈系统：（1）保持一致性的动作推断算法，以实现高速平滑运动，（2）高保真度的控制器不变运动目标追踪，（3）根据运动复杂性动态调整执行速度的自适应速度调节，以及（4）处理现实系统延迟的动作调度。在模拟环境和两个真实的不同机器人平台上进行的12项任务实验显示，SAIL在模拟中比演示速度提高了高达4倍的速度，并在现实世界中提高了高达3.2倍的速度。更多细节请访问https://nadunranawaka1.github.io/sail-policy。

更新时间: 2025-09-08 02:56:08

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.11948v2

LoaQ: Layer-wise Output Approximation Quantization

A natural and intuitive idea in model quantization is to approximate each component's quantized output to match its original. Layer-wise post-training quantization (PTQ), though based on this idea, adopts a strictly local view and can achieve, at best, only activation-aware approximations of weights. As a result, it often leads to insufficient approximations and practical deviations from this guiding intuition. Recent work has achieved a more accurate approximation of linear-layer outputs within the framework of layer-wise PTQ, but such refinements remain inadequate for achieving alignment with the full model output. Based on a deeper understanding of the structural characteristics of mainstream LLMs, we propose $LoaQ$, an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation joint quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.

Updated: 2025-09-08 02:50:11

标题: LoaQ：逐层输出近似量化

摘要: 在模型量化中一个自然而直观的想法是近似每个组件的量化输出以匹配其原始值。虽然基于这一想法的层级后训练量化（PTQ）采用了严格的局部视角，最多只能实现对权重的激活感知近似。因此，它经常导致不足的近似和与这一指导直觉的实际偏离。最近的研究在层级PTQ框架内实现了对线性层输出的更精确近似，但这些改进仍然不足以实现与完整模型输出的对齐。基于对主流LLM结构特征的深入理解，我们提出了LoaQ，一种针对层级PTQ的输出近似方法，明确目标是输出级别的一致性。它更好地符合这种直觉，并且可以采用简单的闭合形式解决方案，使其与现有技术正交，并且可以轻松集成到现有的量化流程中。对LLaMA和Qwen模型族的实验表明，LoaQ在仅权重和权重激活联合量化方面表现出有效性。通过无缝集成现有的量化策略，它进一步提升了整体量化质量，并展现了推动后训练量化前沿的强大潜力。

更新时间: 2025-09-08 02:50:11

领域: cs.LG

下载: http://arxiv.org/abs/2509.06297v1

Learning to Walk with Less: a Dyna-Style Approach to Quadrupedal Locomotion

Traditional RL-based locomotion controllers often suffer from low data efficiency, requiring extensive interaction to achieve robust performance. We present a model-based reinforcement learning (MBRL) framework that improves sample efficiency for quadrupedal locomotion by appending synthetic data to the end of standard rollouts in PPO-based controllers, following the Dyna-Style paradigm. A predictive model, trained alongside the policy, generates short-horizon synthetic transitions that are gradually integrated using a scheduling strategy based on the policy update iterations. Through an ablation study, we identified a strong correlation between sample efficiency and rollout length, which guided the design of our experiments. We validated our approach in simulation on the Unitree Go1 robot and showed that replacing part of the simulated steps with synthetic ones not only mimics extended rollouts but also improves policy return and reduces variance. Finally, we demonstrate that this improvement transfers to the ability to track a wide range of locomotion commands using fewer simulated steps.

Updated: 2025-09-08 02:48:23

标题: 学习更少地行走：四足动物运动的Dyna风格方法

摘要: 传统的基于强化学习的运动控制器通常在数据效率上存在问题，需要大量的交互才能实现稳健的性能。我们提出了一种基于模型的强化学习（MBRL）框架，通过在基于PPO的控制器中将合成数据附加到标准轧制的末尾，遵循Dyna-Style范式，改善四足动物的运动样本效率。一个与策略一起训练的预测模型生成短期合成转换，逐渐集成使用基于策略更新迭代的调度策略。通过消融研究，我们确定了样本效率和轧制长度之间的强相关性，这指导了我们实验设计。我们在Unitree Go1机器人的模拟中验证了我们的方法，并展示了用合成步骤替换部分模拟步骤不仅模拟了扩展的轧制，还提高了策略回报并减少了方差。最后，我们展示了这种改进转化为使用更少的模拟步骤跟踪各种运动命令的能力。

更新时间: 2025-09-08 02:48:23

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.06296v1

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

Updated: 2025-09-08 02:44:54

标题: Jet-Nemotron：具有后神经架构搜索的高效语言模型

摘要: 我们介绍了Jet-Nemotron，这是一种新的混合架构语言模型家族，它在与领先的全注意力模型相匹配或超越准确性的同时，显着提高了生成吞吐量。Jet-Nemotron是使用Post Neural Architecture Search (PostNAS)开发的，这是一种新颖的神经架构探索流程，可实现高效的模型设计。与先前的方法不同，PostNAS从一个预训练的全注意力模型开始，并冻结其MLP权重，从而实现对注意力块设计的高效探索。该流程包括四个关键组件：（1）学习最佳的全注意力层放置和消除，（2）线性注意力块选择，（3）设计新的注意力块，以及（4）执行硬件感知的超参数搜索。我们的Jet-Nemotron-2B模型在一系列基准测试中实现了与Qwen3、Qwen2.5、Gemma3和Llama3.2相媲美或更高的准确性，同时实现了高达53.6倍的生成吞吐量加速和6.1倍的预填充加速。它还在MMLU和MMLU-Pro上实现了比最近先进的MoE全注意力模型（如DeepSeek-V3-Small和Moonlight）更高的准确性，尽管它们的总参数为15B，活跃参数为2.2B。

更新时间: 2025-09-08 02:44:54

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.15884v2

AARK: An Open Toolkit for Autonomous Racing Research

Autonomous racing demands safe control of vehicles at their physical limits for extended periods of time, providing insights into advanced vehicle safety systems which increasingly rely on intervention provided by vehicle autonomy. Participation in this field carries with it a high barrier to entry. Physical platforms and their associated sensor suites require large capital outlays before any demonstrable progress can be made. Simulators allow researches to develop soft autonomous systems without purchasing a platform. However, currently available simulators lack visual and dynamic fidelity, can still be expensive to buy, lack customisation, and are difficult to use. AARK provides three packages, ACI, ACDG, and ACMPC. These packages enable research into autonomous control systems in the demanding environment of racing to bring more people into the field and improve reproducibility: ACI provides researchers with a computer vision-friendly interface to Assetto Corsa for convenient comparison and evaluation of autonomous control solutions; ACDG enables generation of depth, normal and semantic segmentation data for training computer vision models to use in perception systems; and ACMPC gives newcomers to the field a modular full-stack autonomous control solution, capable of controlling vehicles to build from. AARK aims to unify and democratise research into a field critical to providing safer roads and trusted autonomous systems.

Updated: 2025-09-08 02:44:25

标题: AARK：一个用于自主赛车研究的开放工具包

摘要: 自主赛车需要在其物理极限下安全控制车辆长时间运行，为先进车辆安全系统提供洞察力，这些系统越来越依赖车辆自主提供的干预。参与这一领域的门槛很高。在取得任何可证明的进展之前，物理平台及其相关的传感器套件需要大量的资本支出。模拟器允许研究人员开发软性自主系统，而无需购买平台。然而，目前可用的模拟器缺乏视觉和动态逼真度，仍然可能昂贵，缺乏定制性，并且难以使用。AARK提供三个软件包，分别是ACI、ACDG和ACMPC。这些软件包使研究人员能够在赛车这个苛刻的环境中研究自主控制系统，以吸引更多人进入该领域，并提高可重复性：ACI为研究人员提供了一个友好的计算机视觉界面，用于便捷比较和评估自主控制解决方案；ACDG能够生成深度、法线和语义分割数据，用于训练计算机视觉模型在感知系统中使用；ACMPC为该领域的新手提供了一个模块化的全套自主控制解决方案，能够控制车辆进行构建。AARK旨在统一和民主化对一个对提供更安全道路和可信自主系统至关重要的领域的研究。

更新时间: 2025-09-08 02:44:25

领域: cs.RO,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2410.00358v2

Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data

Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.

Updated: 2025-09-08 02:30:47

标题: 朝着由人工智能科学家团队从基因表达数据中进行科学发现的方向前进

摘要: 机器学习已经成为科学发现的一个强大工具，使研究人员能够从复杂数据集中提取有意义的见解。例如，它已经促进了从基因表达数据中识别疾病预测基因，显著推动了医疗保健的发展。然而，传统的分析这种数据集的过程需要大量的人力和专业知识来进行数据选择、处理和分析。为了解决这一挑战，我们引入了一个新颖的框架，即一个由人工智能科学家组成的团队（TAIS），旨在简化科学发现的流程。TAIS包括模拟角色，包括项目经理、数据工程师和领域专家，每个角色都由大型语言模型（LLM）代表。这些角色合作，以复制通常由数据科学家执行的任务，特别关注识别疾病预测基因。此外，我们还策划了一个基准数据集，用于评估TAIS在基因识别方面的有效性，展示了我们系统显著增强科学探索效率和范围的潜力。我们的发现代表了自然语言模型自动化科学发现的稳健步骤。

更新时间: 2025-09-08 02:30:47

领域: q-bio.GN,cs.AI,cs.LG

下载: http://arxiv.org/abs/2402.12391v3

A Spatio-Temporal Graph Neural Networks Approach for Predicting Silent Data Corruption inducing Circuit-Level Faults

Silent Data Errors (SDEs) from time-zero defects and aging degrade safety-critical systems. Functional testing detects SDE-related faults but is expensive to simulate. We present a unified spatio-temporal graph convolutional network (ST-GCN) for fast, accurate prediction of long-cycle fault impact probabilities (FIPs) in large sequential circuits, supporting quantitative risk assessment. Gate-level netlists are modeled as spatio-temporal graphs to capture topology and signal timing; dedicated spatial and temporal encoders predict multi-cycle FIPs efficiently. On ISCAS-89 benchmarks, the method reduces simulation time by more than 10x while maintaining high accuracy (mean absolute error 0.024 for 5-cycle predictions). The framework accepts features from testability metrics or fault simulation, allowing efficiency-accuracy trade-offs. A test-point selection study shows that choosing observation points by predicted FIPs improves detection of long-cycle, hard-to-detect faults. The approach scales to SoC-level test strategy optimization and fits downstream electronic design automation flows.

Updated: 2025-09-08 02:23:51

标题: 一种用于预测引发电路级故障的静默数据损坏的时空图神经网络方法

摘要: 静默数据错误（SDEs）来自时间零缺陷和老化，会降低安全关键系统的可靠性。功能测试可以检测与SDE相关的故障，但模拟成本高昂。我们提出了一种统一的时空图卷积网络（ST-GCN），能够快速、准确地预测大型顺序电路中长周期故障影响概率（FIPs），支持定量风险评估。门级网表被建模为时空图，以捕获拓扑结构和信号时序；专用的空间和时间编码器能够高效地预测多周期FIPs。在ISCAS-89基准测试中，该方法将模拟时间缩短了超过10倍，同时保持高准确性（5周期预测的平均绝对误差为0.024）。该框架接受来自可测试性指标或故障模拟的特征，允许在效率和准确性之间进行权衡。一个测试点选择研究表明，通过预测的FIPs选择观测点可以提高长周期、难以检测的故障的检测率。这种方法适用于SoC级别的测试策略优化，并适用于下游电子设计自动化流程。

更新时间: 2025-09-08 02:23:51

领域: cs.LG,cs.AR,cs.ET,B.7.3

下载: http://arxiv.org/abs/2509.06289v1

Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

Effective specification-aware part retrieval within complex CAD assemblies is essential for automated design verification and downstream engineering tasks. However, directly using LLMs/VLMs to this task presents some challenges: the input sequences may exceed model token limits, and even after processing, performance remains unsatisfactory. Moreover, fine-tuning LLMs/VLMs requires significant computational resources, and for many high-performing general-use proprietary models (e.g., GPT or Gemini), fine-tuning access is not available. In this paper, we propose a novel part retrieval framework that requires no extra training, but using Error Notebooks + RAG for refined prompt engineering to help improve the existing general model's retrieval performance. The construction of Error Notebooks consists of two steps: (1) collecting historical erroneous CoTs and their incorrect answers, and (2) connecting these CoTs through reflective corrections until the correct solutions are obtained. As a result, the Error Notebooks serve as a repository of tasks along with their corrected CoTs and final answers. RAG is then employed to retrieve specification-relevant records from the Error Notebooks and incorporate them into the inference process. Another major contribution of our work is a human-in-the-loop CAD dataset, which is used to evaluate our method. In addition, the engineering value of our novel framework lies in its ability to effectively handle 3D models with lengthy, non-natural language metadata. Experiments with proprietary models, including GPT-4o and the Gemini series, show substantial gains, with GPT-4o (Omni) achieving up to a 23.4% absolute accuracy improvement on the human preference dataset. Moreover, ablation studies confirm that CoT reasoning provides benefits especially in challenging cases with higher part counts (>10).

Updated: 2025-09-08 02:22:16

标题: 通过视觉-语言模型在3D CAD装配中进行错误笔记本引导和无需训练的零件检索

摘要: 在复杂的CAD装配中有效地检索特定规格的零件对于自动化设计验证和下游工程任务至关重要。然而，直接将LLMs/VLMs应用于此任务存在一些挑战：输入序列可能超过模型令牌限制，即使在处理后，性能仍然不尽人意。此外，对LLMs/VLMs进行微调需要大量的计算资源，并且对于许多高性能的通用专有模型（例如GPT或Gemini），微调访问权限是不可用的。在本文中，我们提出了一个新颖的零件检索框架，无需额外训练，而是使用错误笔记本+ RAG进行精细提示工程，以帮助改进现有通用模型的检索性能。错误笔记本的构建包括两个步骤：（1）收集历史错误的CoTs及其错误答案，以及（2）通过反思性纠正将这些CoTs连接起来，直到获得正确的解决方案。结果，错误笔记本充当了任务的存储库，连同其已更正的CoTs和最终答案。然后使用RAG从错误笔记本中检索与规格相关的记录，并将它们纳入推理过程。我们工作的另一个重要贡献是一个人机协作的CAD数据集，用于评估我们的方法。此外，我们新颖框架的工程价值在于其有效处理带有冗长、非自然语言元数据的3D模型。与专有模型（包括GPT-4o和Gemini系列）的实验显示出显著的收益，其中GPT-4o（Omni）在人类偏好数据集上实现了高达23.4%的绝对精度改进。此外，消融研究证实CoT推理在处理更高零件数量的挑战性案例中提供了特定好处。

更新时间: 2025-09-08 02:22:16

领域: cs.AI

下载: http://arxiv.org/abs/2509.01350v2

Statistical Inference for Misspecified Contextual Bandits

Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment and efficient use of data. Yet these advantages create challenges for statistical inference due to adaptivity. A fundamental property that supports valid inference is policy convergence, meaning that action-selection probabilities converge in probability given the context. Convergence ensures replicability of adaptive experiments and stability of online algorithms. In this paper, we highlight a previously overlooked issue: widely used algorithms such as LinUCB may fail to converge when the reward model is misspecified, and such non-convergence creates fundamental obstacles for statistical inference. This issue is practically important, as misspecified models -- such as linear approximations of complex dynamic system -- are often employed in real-world adaptive experiments to balance bias and variance. Motivated by this insight, we propose and analyze a broad class of algorithms that are guaranteed to converge even under model misspecification. Building on this guarantee, we develop a general inference framework based on an inverse-probability-weighted Z-estimator (IPW-Z) and establish its asymptotic normality with a consistent variance estimator. Simulation studies confirm that the proposed method provides robust and data-efficient confidence intervals, and can outperform existing approaches that exist only in the special case of offline policy evaluation. Taken together, our results underscore the importance of designing adaptive algorithms with built-in convergence guarantees to enable stable experimentation and valid statistical inference in practice.

Updated: 2025-09-08 02:19:37

标题: 错误规格上下文臂的统计推断

摘要: 上下文强化算法已经通过实时调整个性化治疗和高效利用数据，改变了现代实验。然而，这些优势会由于适应性而给统计推断带来挑战。支持有效推断的一个基本属性是策略收敛，意味着在给定上下文的情况下，行动选择概率会在概率上收敛。收敛确保了自适应实验的可重复性和在线算法的稳定性。在本文中，我们强调一个先前被忽视的问题：广泛使用的算法，如LinUCB，可能在奖励模型错误规定时无法收敛，这种非收敛会给统计推断带来根本障碍。这个问题在实践中很重要，因为错误规定的模型 -- 如复杂动态系统的线性近似 -- 经常用于现实世界的自适应实验中，以平衡偏差和方差。受到这一洞察的启发，我们提出并分析了一类广泛的算法，即使在模型错误规定的情况下也能保证收敛。基于这一保证，我们开发了一个基于逆概率加权Z估计器（IPW-Z）的一般推断框架，并建立了其与一致方差估计器的渐近正态性。模拟研究证实，所提出的方法提供了稳健且高效的置信区间，并且可以优于现有方法，这些方法仅适用于离线策略评估的特殊情况。总之，我们的结果强调了设计具有内置收敛保证的自适应算法的重要性，以便在实践中实现稳定的实验和有效的统计推断。

更新时间: 2025-09-08 02:19:37

领域: math.ST,cs.AI,stat.TH

下载: http://arxiv.org/abs/2509.06287v1

RecMind: LLM-Enhanced Graph Neural Networks for Personalized Consumer Recommendations

Personalization is a core capability across consumer technologies, streaming, shopping, wearables, and voice, yet it remains challenged by sparse interactions, fast content churn, and heterogeneous textual signals. We present RecMind, an LLM-enhanced graph recommender that treats the language model as a preference prior rather than a monolithic ranker. A frozen LLM equipped with lightweight adapters produces text-conditioned user/item embeddings from titles, attributes, and reviews; a LightGCN backbone learns collaborative embeddings from the user-item graph. We align the two views with a symmetric contrastive objective and fuse them via intra-layer gating, allowing language to dominate in cold/long-tail regimes and graph structure to stabilize rankings elsewhere. On Yelp and Amazon-Electronics, RecMind attains the best results on all eight reported metrics, with relative improvements up to +4.53\% (Recall@40) and +4.01\% (NDCG@40) over strong baselines. Ablations confirm both the necessity of cross-view alignment and the advantage of gating over late fusion and LLM-only variants.

Updated: 2025-09-08 02:15:55

标题: RecMind：LLM增强的图神经网络用于个性化消费者推荐

摘要: 个性化是消费者技术、流媒体、购物、可穿戴设备和语音等领域的核心能力，但仍面临着互动稀疏、内容快速变化和异构文本信号等挑战。我们提出了RecMind，一种LLM增强的图推荐系统，将语言模型视为偏好先验而不是一个整体排名器。一个冻结的LLM配备轻量级适配器，从标题、属性和评论中生成文本条件的用户/物品嵌入；一个LightGCN骨干学习用户-物品图的协作嵌入。我们通过对称对比目标将这两种视图进行对齐，并通过层内门控融合它们，使语言在冷门/长尾区域占主导地位，而图结构在其他地方稳定排名。在Yelp和亚马逊电子产品上，RecMind在所有八个报告指标上取得了最佳结果，相对改进高达+4.53\%（Recall@40）和+4.01\%（NDCG@40）超过强基线。消融实验证实了跨视图对齐的必要性和门控优于晚期融合和仅LLM变体的优势。

更新时间: 2025-09-08 02:15:55

领域: cs.LG

下载: http://arxiv.org/abs/2509.06286v1

From Implicit Exploration to Structured Reasoning: Leveraging Guideline and Refinement for LLMs

Large language models (LLMs) have advanced general-purpose reasoning, showing strong performance across diverse tasks. However, existing methods often rely on implicit exploration, where the model follows stochastic and unguided reasoning paths-like walking without a map. This leads to unstable reasoning paths, lack of error correction, and limited learning from past experience. To address these issues, we propose a framework that shifts from implicit exploration to structured reasoning through guideline and refinement. First, we extract structured reasoning patterns from successful trajectories and reflective signals from failures. During inference, the model follows these guidelines step-by-step, with refinement applied after each step to correct errors and stabilize the reasoning process. Experiments on BBH and four additional benchmarks (GSM8K, MATH-500, MBPP, HumanEval) show that our method consistently outperforms strong baselines across diverse reasoning tasks. Structured reasoning with stepwise execution and refinement improves stability and generalization, while guidelines transfer well across domains and flexibly support cross-model collaboration, matching or surpassing supervised fine-tuning in effectiveness and scalability.

Updated: 2025-09-08 02:11:49

标题: 从隐式探索到结构化推理：利用指南和细化为LLMs提供支持

摘要: 大型语言模型（LLMs）已经推动了通用推理的发展，在多样化任务中表现出强大的性能。然而，现有方法通常依赖于隐式探索，模型遵循随机和无导向的推理路径，就像在没有地图的情况下行走。这导致推理路径不稳定，缺乏错误纠正，并且从过去经验中学习有限。为了解决这些问题，我们提出了一个从隐式探索转向通过指导和改进的结构化推理的框架。首先，我们从成功轨迹和失败的反思信号中提取结构化推理模式。在推理过程中，模型按照这些指导逐步进行，每一步后都会进行改进以纠正错误并稳定推理过程。在BBH和其他四个基准测试（GSM8K，MATH-500，MBPP，HumanEval）上的实验表明，我们的方法在各种推理任务中始终优于强基线。具有分步执行和改进的结构化推理提高了稳定性和泛化性，而指导在不同领域之间良好传递，并灵活支持跨模型协作，与监督微调在有效性和可扩展性上匹敌甚至超越。

更新时间: 2025-09-08 02:11:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.06284v1

Error-quantified Conformal Inference for Time Series

Uncertainty quantification in time series prediction is challenging due to the temporal dependence and distribution shift on sequential data. Conformal inference provides a pivotal and flexible instrument for assessing the uncertainty of machine learning models through prediction sets. Recently, a series of online conformal inference methods updated thresholds of prediction sets by performing online gradient descent on a sequence of quantile loss functions. A drawback of such methods is that they only use the information of revealed non-conformity scores via miscoverage indicators but ignore error quantification, namely the distance between the non-conformity score and the current threshold. To accurately leverage the dynamic of miscoverage error, we propose \textit{Error-quantified Conformal Inference} (ECI) by smoothing the quantile loss function. ECI introduces a continuous and adaptive feedback scale with the miscoverage error, rather than simple binary feedback in existing methods. We establish a long-term coverage guarantee for ECI under arbitrary dependence and distribution shift. The extensive experimental results show that ECI can achieve valid miscoverage control and output tighter prediction sets than other baselines.

Updated: 2025-09-08 02:09:05

标题: 时间序列的误差量化的一致推断

摘要: 时间序列预测中的不确定性量化是具有挑战性的，因为顺序数据上存在时间依赖性和分布转移。依从推断通过预测集提供了评估机器学习模型不确定性的关键和灵活的工具。最近，一系列在线依从推断方法通过在一系列分位损失函数上执行在线梯度下降来更新预测集的阈值。这些方法的一个缺点是它们仅使用通过错覆盖指标显示的不符合分数的信息，但忽略了错误量化，即不符合分数与当前阈值之间的距离。为了准确利用错覆盖错误的动态，我们提出了"Error-quantified Conformal Inference" (ECI)，通过平滑分位损失函数。ECI引入了一个连续和自适应的反馈比例，与现有方法中的简单二进制反馈不同。我们在任意依赖性和分布转移下为ECI建立了长期覆盖率保证。广泛的实验结果显示，ECI可以实现有效的错覆盖控制，并且输出比其他基线更紧凑的预测集。

更新时间: 2025-09-08 02:09:05

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.00818v2

MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds

The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at https://github.com/creator-xi/MoSEs.

Updated: 2025-09-08 02:08:49

标题: MoSEs:基于条件阈值混合文体专家的不确定性感知人工智能生成文本检测

摘要: 大型语言模型的迅速发展加剧了公众对潜在滥用的担忧。因此，建立值得信赖的人工智能生成文本检测系统至关重要。现有方法忽视了风格建模，大多依赖于静态阈值，这极大地限制了检测性能。本文提出了一种称为Mixture of Stylistic Experts（MoSEs）框架，通过条件阈值估计实现了风格感知的不确定性量化。MoSEs包含三个核心组件，即风格参考存储库（SRR）、风格感知路由器（SAR）和条件阈值估计器（CTE）。对于输入文本，SRR可以激活SRR中的适当参考数据并将其提供给CTE。随后，CTE共同建模语言统计属性和语义特征，动态确定最佳阈值。通过判别分数，MoSEs提供具有相应置信水平的预测标签。与基线相比，我们的框架在检测性能方面平均提高了11.34%。更鼓舞人心的是，在低资源情况下，MoSEs显示出更为明显的改进，提高了39.15%。我们的代码可在https://github.com/creator-xi/MoSEs 上找到。

更新时间: 2025-09-08 02:08:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.02499v3

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

Updated: 2025-09-08 02:07:09

标题: SFR-DeepResearch：朝着有效的强化学习，用于自主推理的单一代理

摘要: 将大型语言模型（LLMs）配备复杂、交织的推理和工具使用能力已成为主动型人工智能研究的重点，特别是随着推理导向（“思考”）模型的最新进展。这种能力对于开发一些重要应用至关重要。其中之一是深度研究（DR），它需要对许多信息源进行广泛搜索和推理。本文的研究重点是开发原生的自主单一代理模型，用于DR，具有最小的网络爬虫和Python工具集成。与多代理系统不同，那里的代理承担预定义的角色，并在静态工作流程的每个步骤中告知该执行什么操作，自主单一代理根据上下文动态确定其下一步动作，无需手动指令。尽管先前的工作提出了针对基础或指令调整的LLMs的训练配方，我们专注于推理优化模型的持续强化学习（RL），以进一步增强主动技能同时保留推理能力。为此，我们提出了一个简单的RL配方，完全使用合成数据，我们将其应用于各种开源LLMs。我们最好的变体SFR-DR-20B在人类最后一次考试基准测试中达到了28.7％。此外，我们进行了关键分析实验，以提供更多关于我们方法的见解。

更新时间: 2025-09-08 02:07:09

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.06283v1

Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning

In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling them to perform new tasks based on a few provided examples without explicit fine-tuning. Despite their impressive adaptability, these models remain vulnerable to subtle adversarial perturbations and exhibit unpredictable behavior when faced with linguistic variations. Inspired by software testing principles, we introduce a software testing-inspired framework, called MMT4NL, for evaluating the trustworthiness of in-context learning by utilizing adversarial perturbations and software testing techniques. It includes diverse evaluation aspects of linguistic capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around the idea of crafting metamorphic adversarial examples from a test set in order to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is to treat any LLM as software and validate its functionalities just like testing the software. Finally, we demonstrate applications of MMT4NL on the sentiment analysis and question-answering tasks. Our experiments could reveal various linguistic bugs in state-of-the-art LLMs.

Updated: 2025-09-08 02:01:48

标题: 在信任之前进行测试：将软件测试应用于可信的上下文学习

摘要: 在上下文学习（ICL）已经成为大型语言模型（LLMs）的一种强大能力，使它们能够基于少量提供的示例执行新任务而无需显式微调。尽管它们具有令人印象深刻的适应性，但这些模型仍然容易受到微妙的对抗扰动的影响，并且在面对语言变化时表现出难以预测的行为。受软件测试原则启发，我们引入了一个名为MMT4NL的受软件测试启发的框架，用于通过利用对抗扰动和软件测试技术来评估上下文学习的可信度。它包括了用于测试LLMs的ICL功能的语言能力的多样化评估方面。MMT4NL围绕着从测试集中制作形变对抗示例的想法构建，以便量化并定位ICL设计提示中的错误。我们的理念是将任何LLM视为软件，并像测试软件一样验证其功能。最后，我们展示了MMT4NL在情感分析和问答任务上的应用。我们的实验发现了最先进的LLMs中的各种语言bug。

更新时间: 2025-09-08 02:01:48

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2504.18827v3

TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning

Table reasoning is crucial for leveraging structured data in domains such as finance, healthcare, and scientific research. While large language models (LLMs) show promise in multi-step reasoning, purely text-based methods often struggle with the complex numerical computations and fine-grained operations inherently required in this task. Tool-integrated reasoning improves computational accuracy via explicit code execution, yet existing systems frequently rely on rigid patterns, supervised imitation, and lack true autonomous adaptability. In this paper, we present TableMind, an LLM-driven table reasoning agent that (i) autonomously performs multi-turn tool invocation, (ii) writes and executes data-analyzing code in a secure sandbox environment for data analysis and precise numerical reasoning, and (iii) exhibits high-level capabilities such as planning and self-reflection to adapt strategies. To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model: supervised fine-tuning on high-quality reasoning trajectories to establish effective tool usage patterns, followed by reinforcement fine-tuning to optimize multi-objective strategies. In particular, we propose Rank-Aware Policy Optimization (RAPO), which increases the update weight of high-quality trajectories when their output probabilities are lower than those of low-quality ones, thereby guiding the model more consistently toward better and more accurate answers. Extensive experiments on several mainstream benchmarks demonstrate that TableMind achieves superior performance compared to competitive baselines, yielding substantial gains in both reasoning accuracy and computational precision.

Updated: 2025-09-08 02:00:31

标题: TableMind：用于工具增强表格推理的自主编程代理

摘要: 表格推理对于利用结构化数据在金融、医疗保健和科学研究等领域至关重要。虽然大型语言模型(LLMs)在多步推理方面表现出良好的潜力，但纯文本方法通常在这一任务中所需的复杂数值计算和细粒度操作方面表现不佳。集成工具推理通过显式代码执行提高计算精度，但现有系统通常依赖于刚性模式、监督模仿，并且缺乏真正的自主适应性。在本文中，我们提出TableMind，这是一个由LLM驱动的表格推理代理，它能够(i)自主执行多轮工具调用，(ii)在安全沙盒环境中编写并执行数据分析代码，进行精确的数值推理，(iii)具有规划和自我反思等高级能力，以适应策略。为了实现这些能力，我们采用了一个建立在强大预训练语言模型之上的两阶段微调范式：在高质量推理轨迹上进行监督微调，以建立有效的工具使用模式，然后进行强化微调，以优化多目标策略。特别是，我们提出了Rank-Aware Policy Optimization（RAPO），当高质量轨迹的输出概率低于低质量轨迹时，增加高质量轨迹的更新权重，从而更一致地引导模型朝着更好和更准确的答案前进。在几个主流基准测试上进行的广泛实验表明，TableMind相比竞争基线表现出更优异的性能，在推理准确性和计算精度方面均取得了显著提升。

更新时间: 2025-09-08 02:00:31

领域: cs.AI

下载: http://arxiv.org/abs/2509.06278v1

A Fully Parameter-Free Second-Order Algorithm for Convex-Concave Minimax Problems

In this paper, we study second-order algorithms for the convex-concave minimax problem, which has attracted much attention in many fields such as machine learning in recent years. We propose a Lipschitz-free cubic regularization (LF-CR) algorithm for solving the convex-concave minimax optimization problem without knowing the Lipschitz constant. It can be shown that the iteration complexity of the LF-CR algorithm to obtain an $\epsilon$-optimal solution with respect to the restricted primal-dual gap is upper bounded by $\mathcal{O}(\rho^{2/3}\|z_0-z^*\|^2\epsilon^{-2/3})$ , where $z_0=(x_0,y_0)$ is a pair of initial points, $z^*=(x^*,y^*)$ is a pair of optimal solutions, and $\rho$ is the Lipschitz constant. We further propose a fully parameter-free cubic regularization (FF-CR) algorithm that does not require any parameters of the problem, including the Lipschitz constant and the upper bound of the distance from the initial point to the optimal solution. We also prove that the iteration complexity of the FF-CR algorithm to obtain an $\epsilon$-optimal solution with respect to the gradient norm is upper bounded by $\mathcal{O}(\rho^{2/3}\|z_0-z^*\|^{4/3}\epsilon^{-2/3}) $. Numerical experiments show the efficiency of both algorithms. To the best of our knowledge, the proposed FF-CR algorithm is a completely parameter-free second-order algorithm, and its iteration complexity is currently the best in terms of $\epsilon$ under the termination criterion of the gradient norm.

Updated: 2025-09-08 01:51:44

标题: 一个完全无参数的凸凹极小极大问题的二阶算法

摘要: 在本文中，我们研究了凸凹极小极大问题的二阶算法，这在近年来吸引了许多领域的关注，如机器学习。我们提出了一种利普希茨无约束的三次正则化（LF-CR）算法，用于解决凸凹极小极大优化问题，而不需要知道利普希茨常数。可以证明，LF-CR算法的迭代复杂度以获得关于受限原始-对偶间隙的$\epsilon$-最优解的上界为$\mathcal{O}(\rho^{2/3}\|z_0-z^*\|^2\epsilon^{-2/3})$，其中$z_0=(x_0,y_0)$是一对初始点，$z^*=(x^*,y^*)$是一对最优解，$\rho$是利普希茨常数。我们进一步提出了一种完全无参数的三次正则化（FF-CR）算法，不需要问题的任何参数，包括利普希茨常数和从初始点到最优解的距离的上界。我们还证明了FF-CR算法的迭代复杂度，以获得关于梯度范数的$\epsilon$-最优解的上界为$\mathcal{O}(\rho^{2/3}\|z_0-z^*\|^{4/3}\epsilon^{-2/3})$。数值实验显示了两种算法的效率。据我们所知，所提出的FF-CR算法是一种完全无参数的二阶算法，其迭代复杂度在梯度范数终止准则下目前是最优的。

更新时间: 2025-09-08 01:51:44

领域: math.OC,cs.LG,stat.ML,90C47, 90C26, 90C30

下载: http://arxiv.org/abs/2407.03571v2

IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs

Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\, a quality-constrained Intelligent Prompt Routing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $\tau \in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency.

Updated: 2025-09-08 01:46:27

标题: 知识产权：具有用户控制的质量成本折衷的智能提示路由

摘要: 将传入查询路由到成本最低的LLM，同时保持响应质量，对于优化大规模商业系统的性能成本折衷存在基本挑战。我们提出了IPR，一个受质量约束的智能提示路由框架，根据预测的响应质量和用户指定的容忍水平动态选择最佳模型。IPR引入了三个关键创新：（1）模块化架构，使用经过标定质量分数标注的150万个提示进行训练轻量级质量估计器，实现对模型族的细粒度质量预测；（2）用户控制的路由机制，具有容忍参数τ∈[0,1]，提供对质量成本折衷的明确控制；以及（3）使用冻结编码器和特定于模型的适配器的可扩展设计，将新模型集成时间从几天缩短到几小时。为了严格训练和评估IPR，我们策划了一个工业级数据集IPRBench（经过法律批准后将发布），这是一个包含150万个示例的综合基准，跨越11个LLM候选人的响应质量标注。在一个主要的云平台上部署，IPR实现了43.9\%的成本降低，同时保持与Claude系列中最强模型的质量相当，并处理具有低于150ms延迟的请求。

更新时间: 2025-09-08 01:46:27

领域: cs.LG

下载: http://arxiv.org/abs/2509.06274v1

DispFormer: A Pretrained Transformer Incorporating Physical Constraints for Dispersion Curve Inversion

Surface wave dispersion curve inversion is crucial for estimating subsurface shear-wave velocity (vs), yet traditional methods often face challenges related to computational cost, non-uniqueness, and sensitivity to initial models. While deep learning approaches show promise, many require large labeled datasets and struggle with real-world datasets, which often exhibit varying period ranges, missing values, and low signal-to-noise ratios. To address these limitations, this study introduces DispFormer, a transformer-based neural network for $v_s$ profile inversion from Rayleigh-wave phase and group dispersion curves. DispFormer processes dispersion data independently at each period, allowing it to handle varying lengths without requiring network modifications or strict alignment between training and testing datasets. A depth-aware training strategy is also introduced, incorporating physical constraints derived from the depth sensitivity of dispersion data. DispFormer is pre-trained on a global synthetic dataset and evaluated on two regional synthetic datasets using zero-shot and few-shot strategies. Results show that even without labeled data, the zero-shot DispFormer generates inversion profiles that outperform the interpolated reference model used as the pretraining target, providing a deployable initial model generator to assist traditional workflows. When partial labeled data available, the few-shot trained DispFormer surpasses traditional global search methods. Real-world tests further confirm that DispFormer generalizes well to dispersion data with varying lengths and achieves lower data residuals than reference models. These findings underscore the potential of DispFormer as a foundation model for dispersion curve inversion and demonstrate the advantages of integrating physics-informed deep learning into geophysical applications.

Updated: 2025-09-08 01:40:28

标题: DispFormer：一种集成物理约束的预训练Transformer用于色散曲线反演

摘要: 地表波色散曲线反演对于估计地下剪切波速（vs）至关重要，然而传统方法往往面临与计算成本、非唯一性和对初始模型敏感性相关的挑战。虽然深度学习方法显示出潜力，但许多方法需要大规模标记数据集，并且在处理现实世界数据集时存在困难，这些数据集往往表现出不同的周期范围、缺失值和低信噪比。为了解决这些限制，本研究引入了DispFormer，这是一种基于Transformer的神经网络，用于从瑞利波相位和群色散曲线中反演vs剖面。DispFormer在每个周期独立处理色散数据，使其能够处理不同长度而无需进行网络修改或在训练和测试数据集之间进行严格对齐。还引入了一种深度感知训练策略，将来自色散数据深度敏感性的物理约束纳入其中。DispFormer在全球合成数据集上进行预训练，并使用零样本和少样本策略在两个区域合成数据集上进行评估。结果表明，即使在没有标记数据的情况下，零样本的DispFormer生成的反演剖面优于用作预训练目标的插值参考模型，提供了一个可部署的初始模型生成器，以辅助传统工作流程。当部分标记数据可用时，经过少量训练的DispFormer超越了传统的全局搜索方法。实际测试进一步证实，DispFormer对具有不同长度的色散数据具有良好的泛化能力，并且实现了比参考模型更低的数据残差。这些发现强调了DispFormer作为色散曲线反演的基础模型的潜力，并展示了将基于物理的深度学习集成到地球物理应用中的优势。

更新时间: 2025-09-08 01:40:28

领域: physics.geo-ph,cs.AI

下载: http://arxiv.org/abs/2501.04366v2

An Explainable Framework for Particle Swarm Optimization using Landscape Analysis and Machine Learning

Swarm intelligence algorithms have demonstrated remarkable success in solving complex optimization problems across diverse domains. However, their widespread adoption is often hindered by limited transparency in how algorithmic components influence performance. This work presents a multi-faceted investigation of Particle Swarm Optimization (PSO) to further understand the key role of different topologies for better interpretability and explainability. To achieve this objective, we first develop a comprehensive landscape characterization framework using Exploratory Landscape Analysis (ELA) to quantify problem difficulty and identify critical features affecting the optimization performance of PSO. Next, we conduct a rigorous empirical study comparing three fundamental swarm communication architectures -- Ring, Star, and Von Neumann topologies -- analysing their distinct impacts on exploration-exploitation balance, convergence behaviour, and solution quality and eventually develop an explainable benchmarking framework for PSO, to decode how swarm topologies affects information flow, diversity, and convergence. Based on this, a novel machine learning approach for automated algorithm configuration is introduced for training predictive models on extensive Area over the Convergence Curve (AOCC) data to recommend optimal settings based on problem characteristics. Through systematic experimentation across twenty four benchmark functions in multiple dimensions, we establish practical guidelines for topology selection and parameter configuration. These findings advance the development of more transparent and reliable swarm intelligence systems. The source codes of this work can be accessed at https://github.com/GitNitin02/ioh_pso.

Updated: 2025-09-08 01:39:32

标题: 一个可解释的框架：使用景观分析和机器学习的粒子群优化

摘要: 群体智能算法在解决各领域复杂优化问题方面取得了显著成功。然而，它们的广泛应用常常受限于算法组件如何影响性能的透明度有限。本文对粒子群优化（PSO）进行了多方面调查，以进一步了解不同拓扑结构在解释性和可解释性方面的关键作用。为了实现这一目标，我们首先利用探索性景观分析（ELA）开发了一个全面的景观特征化框架，以量化问题的难度并识别影响PSO优化性能的关键特征。接下来，我们进行了严格的实证研究，比较了三种基本的群体通信架构——环形、星形和冯·诺伊曼拓扑——分析它们对探索-利用平衡、收敛行为和解决方案质量的独特影响，并最终开发了一个可解释的PSO基准框架，以解码群体拓扑如何影响信息流、多样性和收敛。基于此，引入了一种新颖的机器学习方法，用于在广泛的收敛曲线区域（AOCC）数据上训练预测模型，根据问题特征推荐最佳设置。通过在多维度上对24个基准函数进行系统实验，我们建立了拓扑选择和参数配置的实用指南。这些发现推动了更透明和可靠的群体智能系统的发展。本文的源代码可在https://github.com/GitNitin02/ioh_pso 上访问。

更新时间: 2025-09-08 01:39:32

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2509.06272v1

UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks

Sixth generation (6G) systems require environment-aware communication, driven by native artificial intelligence (AI) and integrated sensing and communication (ISAC). Radio maps (RMs), providing spatially continuous channel information, are key enablers. However, generating high-fidelity RM ground truth via electromagnetic (EM) simulations is computationally intensive, motivating machine learning (ML)-based RM construction. The effectiveness of these data-driven methods depends on large-scale, high-quality training data. Current public datasets often focus on single-input single-output (SISO) and limited information, such as path loss, which is insufficient for advanced multi-input multi-output (MIMO) systems requiring detailed channel state information (CSI). To address this gap, this paper presents UrbanMIMOMap, a novel large-scale urban MIMO CSI dataset generated using high-precision ray tracing. UrbanMIMOMap offers comprehensive complex CSI matrices across a dense spatial grid, going beyond traditional path loss data. This rich CSI is vital for constructing high-fidelity RMs and serves as a fundamental resource for data-driven RM generation, including deep learning. We demonstrate the dataset's utility through baseline performance evaluations of representative ML methods for RM construction. This work provides a crucial dataset and reference for research in high-precision RM generation, MIMO spatial performance, and ML for 6G environment awareness. The code and data for this work are available at: https://github.com/UNIC-Lab/UrbanMIMOMap.

Updated: 2025-09-08 01:23:46

标题: UrbanMIMOMap：具有预编码感知地图和基准的射线跟踪MIMO CSI数据集

摘要: 第六代（6G）系统需要环境感知通信，由本地人工智能（AI）和集成感知与通信（ISAC）驱动。提供空间连续信道信息的无线地图（RMs）是关键推动因素。然而，通过电磁（EM）模拟生成高保真度RM地面真实性是计算密集型的，促使基于机器学习（ML）的RM构建。这些数据驱动方法的有效性取决于大规模、高质量的训练数据。当前公共数据集通常专注于单输入单输出（SISO）和有限信息，如路径损耗，这对于需要详细信道状态信息（CSI）的高级多输入多输出（MIMO）系统是不够的。为了弥补这一差距，本文提出了UrbanMIMOMap，这是一个使用高精度射线追踪生成的新型大规模城市MIMO CSI数据集。UrbanMIMOMap提供了跨密集空间网格的综合复杂CSI矩阵，超越了传统的路径损耗数据。这种丰富的CSI对于构建高保真度RMs至关重要，并作为数据驱动RM生成的基本资源，包括深度学习。我们通过对典型ML方法在RM构建中的基线性能评估，展示了数据集的实用性。这项工作为高精度RM生成、MIMO空间性能和6G环境感知的研究提供了重要的数据集和参考。这项工作的代码和数据可在https://github.com/UNIC-Lab/UrbanMIMOMap找到。

更新时间: 2025-09-08 01:23:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06270v1

REMI: A Novel Causal Schema Memory Architecture for Personalized Lifestyle Recommendation Agents

Personalized AI assistants often struggle to incorporate complex personal data and causal knowledge, leading to generic advice that lacks explanatory power. We propose REMI, a Causal Schema Memory architecture for a multimodal lifestyle agent that integrates a personal causal knowledge graph, a causal reasoning engine, and a schema based planning module. The idea is to deliver explainable, personalized recommendations in domains like fashion, personal wellness, and lifestyle planning. Our architecture uses a personal causal graph of the user's life events and habits, performs goal directed causal traversals enriched with external knowledge and hypothetical reasoning, and retrieves adaptable plan schemas to generate tailored action plans. A Large Language Model orchestrates these components, producing answers with transparent causal explanations. We outline the CSM system design and introduce new evaluation metrics for personalization and explainability, including Personalization Salience Score and Causal Reasoning Accuracy, to rigorously assess its performance. Results indicate that CSM based agents can provide more context aware, user aligned recommendations compared to baseline LLM agents. This work demonstrates a novel approach to memory augmented, causal reasoning in personalized agents, advancing the development of transparent and trustworthy AI lifestyle assistants.

Updated: 2025-09-08 01:17:46

标题: REMI：一种用于个性化生活方式推荐代理的新型因果模式记忆架构

摘要: 个性化AI助手通常难以整合复杂的个人数据和因果知识，导致缺乏解释力的通用建议。我们提出REMI，一个用于多模式生活方式代理的因果模式记忆架构，该架构整合了个人因果知识图、因果推理引擎和基于模式的规划模块。其目的是在时尚、个人健康和生活规划等领域提供可解释的个性化建议。我们的架构使用用户生活事件和习惯的个人因果图，进行目标导向的因果遍历，结合外部知识和假设推理，检索适应性计划模式以生成定制的行动计划。一个大型语言模型编排这些组件，产生具有透明因果解释的答案。我们概述了CSM系统设计，并引入了用于个性化和可解释性评估的新指标，包括个性化显著性评分和因果推理准确性，以严格评估其性能。结果表明，基于CSM的代理可以提供比基准LLM代理更具上下文感知和用户对齐的建议。这项工作展示了一种新颖的记忆增强、因果推理方法，推动了透明和可信赖的AI生活方式助手的发展。

更新时间: 2025-09-08 01:17:46

领域: cs.AI

下载: http://arxiv.org/abs/2509.06269v1

More is Merrier: Relax the Non-Collusion Assumption in Multi-Server PIR

A long line of research on secure computation has confirmed that anything that can be computed, can be computed securely using a set of non-colluding parties. Indeed, this non-collusion assumption makes a number of problems solvable, as well as reduces overheads and bypasses computational hardness results, and it is pervasive across different privacy-enhancing technologies. However, it remains highly susceptible to covert, undetectable collusion among computing parties. This work stems from an observation that if the number of available computing parties is much higher than the number of parties required to perform a secure computation task, collusion attempts in privacy-preserving computations could be deterred. We focus on the prominent privacy-preserving computation task of multi-server $1$-private information retrieval (PIR) that inherently assumes no pair-wise collusion. For PIR application scenarios, such as those for blockchain light clients, where the available servers can be plentiful, a single server's deviating action is not tremendously beneficial to itself. We can make deviations undesired via small amounts of rewards and penalties, thus significantly raising the bar for collusion resistance. We design and implement a collusion mitigation mechanism on a public bulletin board with payment execution functions, considering only rational and malicious parties with no honest non-colluding servers. Privacy protection is offered for an extended period after the query executions.

Updated: 2025-09-08 01:13:36

标题: 更多就是更好：在多服务器 PIR 中放宽非勾结假设

摘要: 一个长期关于安全计算的研究已经确认，任何可以计算的东西都可以使用一组不合谋的参与方进行安全计算。事实上，这种非合谋的假设使得一些问题可以得到解决，同时减少了开销并绕过了计算困难的结果，这种假设在不同的隐私增强技术中普遍存在。然而，这种假设仍然极易受到计算参与方之间的秘密、不可检测的合谋的影响。这项工作源于一个观察，即如果可用的计算参与方的数量远远大于执行安全计算任务所需的参与方的数量，那么在隐私保护计算中的合谋尝试可能会受到阻止。我们重点关注于突出的隐私保护计算任务——多服务器$1$-私有信息检索(PIR)，该任务本质上假设不存在任何两两合谋。对于PIR应用场景，比如区块链轻客户端，可用的服务器可能很多，单个服务器的偏离行为对自身的利益并不是极其有利的。通过少量的奖励和惩罚，我们可以使得偏离变得不受欢迎，从而显著提高合谋抵抗能力。我们设计并实现了一个在公共公告板上具有支付执行功能的合谋缓解机制，仅考虑理性和恶意的参与方，没有诚实的不合谋的服务器。在查询执行之后，隐私保护可以持续一段时间。

更新时间: 2025-09-08 01:13:36

领域: cs.CR,cs.GT

下载: http://arxiv.org/abs/2201.07740v6

PLRV-O: Advancing Differentially Private Deep Learning via Privacy Loss Random Variable Optimization

Differentially Private Stochastic Gradient Descent (DP-SGD) is a standard method for enforcing privacy in deep learning, typically using the Gaussian mechanism to perturb gradient updates. However, conventional mechanisms such as Gaussian and Laplacian noise are parameterized only by variance or scale. This single degree of freedom ties the magnitude of noise directly to both privacy loss and utility degradation, preventing independent control of these two factors. The problem becomes more pronounced when the number of composition rounds T and batch size B vary across tasks, as these variations induce task-dependent shifts in the privacy-utility trade-off, where small changes in noise parameters can disproportionately affect model accuracy. To address this limitation, we introduce PLRV-O, a framework that defines a broad search space of parameterized DP-SGD noise distributions, where privacy loss moments are tightly characterized yet can be optimized more independently with respect to utility loss. This formulation enables systematic adaptation of noise to task-specific requirements, including (i) model size, (ii) training duration, (iii) batch sampling strategies, and (iv) clipping thresholds under both training and fine-tuning settings. Empirical results demonstrate that PLRV-O substantially improves utility under strict privacy constraints. On CIFAR-10, a fine-tuned ViT achieves 94.03% accuracy at epsilon approximately 0.5, compared to 83.93% with Gaussian noise. On SST-2, RoBERTa-large reaches 92.20% accuracy at epsilon approximately 0.2, versus 50.25% with Gaussian.

Updated: 2025-09-08 01:06:45

标题: PLRV-O：通过隐私损失随机变量优化推进差分隐私深度学习

摘要: 差分私有随机梯度下降（DP-SGD）是在深度学习中强制实现隐私的标准方法，通常使用高斯机制来扰动梯度更新。然而，传统的机制如高斯和拉普拉斯噪声仅由方差或尺度参数化。这种单一自由度将噪声的大小直接与隐私损失和效用降级联系在一起，阻止了对这两个因素的独立控制。当组合轮数T和批量大小B在任务之间变化时，问题变得更加明显，因为这些变化导致任务相关的隐私-效用权衡发生变化，噪声参数的微小变化可能会对模型准确性产生不成比例的影响。为了解决这个限制，我们引入了PLRV-O，这是一个定义了参数化DP-SGD噪声分布的广泛搜索空间的框架，其中隐私损失矩被紧密地表征，但可以更独立地优化以适应效用损失。这种公式使噪声能够系统地适应特定任务的要求，包括（i）模型大小，（ii）训练持续时间，（iii）批量采样策略和（iv）在训练和微调设置下的剪切阈值。实证结果表明，PLRV-O在严格的隐私约束条件下显着提高了效用。在CIFAR-10上，经过微调的ViT在ε约为0.5时达到94.03％的准确率，而使用高斯噪声时为83.93％。在SST-2上，RoBERTa-large在ε约为0.2时达到92.20％的准确率，而使用高斯时为50.25％。

更新时间: 2025-09-08 01:06:45

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2509.06264v1

On Synthesis of Timed Regular Expressions

Timed regular expressions serve as a formalism for specifying real-time behaviors of Cyber-Physical Systems. In this paper, we consider the synthesis of timed regular expressions, focusing on generating a timed regular expression consistent with a given set of system behaviors including positive and negative examples, i.e., accepting all positive examples and rejecting all negative examples. We first prove the decidability of the synthesis problem through an exploration of simple timed regular expressions. Subsequently, we propose our method of generating a consistent timed regular expression with minimal length, which unfolds in two steps. The first step is to enumerate and prune candidate parametric timed regular expressions. In the second step, we encode the requirement that a candidate generated by the first step is consistent with the given set into a Satisfiability Modulo Theories (SMT) formula, which is consequently solved to determine a solution to parametric time constraints. Finally, we evaluate our approach on benchmarks, including randomly generated behaviors from target timed models and a case study.

Updated: 2025-09-08 00:59:04

标题: 关于时序正则表达式的合成

摘要: 定时正则表达式作为指定实时行为的形式化方法，用于指定物理系统的实时行为。在本文中，我们考虑了定时正则表达式的合成，重点是生成与给定系统行为集一致的定时正则表达式，包括正例和负例，即接受所有正例并拒绝所有负例。我们首先通过探索简单的定时正则表达式证明了合成问题的可决定性。随后，我们提出了生成具有最小长度的一致定时正则表达式的方法，该方法分为两步。第一步是列举和修剪候选参数化定时正则表达式。在第二步中，我们将第一步生成的候选与给定集合一致的要求编码为满足模理论（SMT）公式，随后求解以确定参数化时间约束的解。最后，我们在基准测试中评估了我们的方法，包括从目标定时模型中随机生成的行为和一个案例研究。

更新时间: 2025-09-08 00:59:04

领域: cs.FL,cs.AI

下载: http://arxiv.org/abs/2509.06262v1

FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving

Recent advances in Post-Training Quantization (PTQ) techniques have significantly increased demand for serving quantized large language models (LLMs), enabling higher throughput and substantially reduced memory usage with minimal accuracy loss. Quantized models address memory constraints in LLMs and enhance GPU resource utilization through efficient GPU sharing. However, quantized models have smaller KV block sizes than non-quantized models, causing limited memory efficiency due to memory fragmentation. Also, distinct resource usage patterns between quantized and non-quantized models require efficient scheduling to maximize throughput. To address these challenges, we propose FineServe, an inference serving framework for mixed-precision LLMs. FineServe's key contributions include: (1) KV Slab, a precision-aware adaptive memory management technique dynamically allocating KV cache based on model quantization characteristics, significantly reducing GPU memory fragmentation, and (2) a two-level scheduling framework comprising a global scheduler that places models to GPUs based on request rates, latency SLOs, and memory constraints and efficiency, and a local scheduler that adaptively adjusts batch sizes according to real-time request fluctuations. Experimental results demonstrate that FineServe achieves up to 2.2x higher SLO attainment and 1.8x higher token generation throughput compared to the state-of-the-art GPU sharing systems.

Updated: 2025-09-08 00:57:50

标题: FineServe：面向异构精度LLM服务的精确KV块和两级调度

摘要: 最近在后训练量化（Post-Training Quantization, PTQ）技术方面取得的进展显著增加了对为量化大型语言模型（LLMs）提供服务的需求，实现了更高的吞吐量和大幅减少的内存使用，同时最小化了准确性损失。量化模型解决了LLMs中的内存约束问题，并通过高效的GPU共享提高了GPU资源利用率。然而，与非量化模型相比，量化模型具有较小的KV块大小，导致内存碎片化，限制了内存效率。此外，量化和非量化模型之间的不同资源使用模式需要有效的调度以最大化吞吐量。为了解决这些挑战，我们提出了FineServe，一个用于混合精度LLMs的推理服务框架。FineServe的关键贡献包括：（1）KV Slab，一种精度感知的自适应内存管理技术，根据模型量化特性动态分配KV缓存，显著减少GPU内存碎片化；（2）一个包括全局调度器和本地调度器的两级调度框架，全局调度器根据请求速率、延迟SLOs和内存约束和效率将模型放置在GPU上，本地调度器根据实时请求波动自适应调整批处理大小。实验结果表明，与最先进的GPU共享系统相比，FineServe实现了高达2.2倍的SLO达成率和1.8倍的标记生成吞吐量。

更新时间: 2025-09-08 00:57:50

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2509.06261v1

Scaling Laws of Motion Forecasting and Planning - Technical Report

We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 thousand hours driving dataset, we demonstrate that, similar to language modeling, model performance improves as a power-law function of the total compute budget, and we observe a strong correlation between model training loss and model evaluation metrics. Most interestingly, closed-loop metrics also improve with scaling, which has important implications for the suitability of open-loop metrics for model development and hill climbing. We also study the optimal scaling of the number of transformer parameters and the training data size for a training compute-optimal model. We find that as the training compute budget grows, optimal scaling requires increasing the model size 1.5x as fast as the dataset size. We also study inference-time compute scaling, where we observe that sampling and clustering the output of smaller models makes them competitive with larger models, up to a crossover point beyond which a larger models becomes more inference-compute efficient. Overall, our experimental results demonstrate that optimizing the training and inference-time scaling properties of motion forecasting and planning models is a key lever for improving their performance to address a wide variety of driving scenarios. Finally, we briefly study the utility of training on general logged driving data of other agents to improve the performance of the ego-agent, an important research area to address the scarcity of robotics data for large capacity models training.

Updated: 2025-09-08 00:53:59

标题: 运动预测与规划的尺度定律-技术报告

摘要: 我们研究了一类编码器-解码器自回归变压器模型在自动驾驶领域联合运动预测和规划任务中的经验缩放规律。利用50万小时的驾驶数据集，我们证明，类似于语言建模，模型性能随着总计算预算的幂律函数而改善，我们观察到模型训练损失与模型评估指标之间存在强相关性。最有趣的是，随着缩放，闭环指标也得到改善，这对于开环指标在模型开发和爬山算法中的适用性具有重要意义。我们还研究了变压器参数数量和训练数据大小的最佳缩放，以训练计算最优模型。我们发现随着训练计算预算的增长，最佳缩放要求模型大小增加1.5倍的速度快于数据集大小。我们还研究了推理时间计算缩放，观察到对较小模型的输出进行采样和聚类使其能够与较大模型竞争，直到交叉点之后，较大模型变得更具推理计算效率。总体而言，我们的实验结果表明，优化运动预测和规划模型的训练和推理时间缩放特性是改善其性能以解决各种驾驶场景的关键杠杆。最后，我们简要研究了利用其他代理的通用记录驾驶数据进行训练，以提高自我代理的性能，这是一个重要的研究领域，以解决大容量模型训练中机器人数据的稀缺性问题。

更新时间: 2025-09-08 00:53:59

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2506.08228v2

TalkToAgent: A Human-centric Explanation of Reinforcement Learning Agents with Large Language Models

Explainable Reinforcement Learning (XRL) has emerged as a promising approach in improving the transparency of Reinforcement Learning (RL) agents. However, there remains a gap between complex RL policies and domain experts, due to the limited comprehensibility of XRL results and isolated coverage of current XRL approaches that leave users uncertain about which tools to employ. To address these challenges, we introduce TalkToAgent, a multi-agent Large Language Models (LLM) framework that delivers interactive, natural language explanations for RL policies. The architecture with five specialized LLM agents (Coordinator, Explainer, Coder, Evaluator, and Debugger) enables TalkToAgent to automatically map user queries to relevant XRL tools and clarify an agent's actions in terms of either key state variables, expected outcomes, or counterfactual explanations. Moreover, our approach extends previous counterfactual explanations by deriving alternative scenarios from qualitative behavioral descriptions, or even new rule-based policies. We validated TalkToAgent on quadruple-tank process control problem, a well-known nonlinear control benchmark. Results demonstrated that TalkToAgent successfully mapped user queries into XRL tasks with high accuracy, and coder-debugger interactions minimized failures in counterfactual generation. Furthermore, qualitative evaluation confirmed that TalkToAgent effectively interpreted agent's actions and contextualized their meaning within the problem domain.

Updated: 2025-09-08 00:52:15

标题: TalkToAgent：一个以人为中心的解释，关于使用大型语言模型的强化学习代理

摘要: 可解释性强化学习（XRL）已经成为改善强化学习（RL）代理透明度的一种有前途的方法。然而，由于XRL结果的可理解性有限以及当前XRL方法的覆盖范围有限，复杂的RL策略与领域专家之间仍存在差距，这使用户不确定应该使用哪些工具。为了解决这些挑战，我们引入了TalkToAgent，一个多代理大型语言模型（LLM）框架，为RL策略提供交互式、自然语言解释。这个架构包括五个专门的LLM代理（协调员、解释者、编码员、评估者和调试器），使TalkToAgent能够自动将用户查询映射到相关的XRL工具，并通过关键状态变量、预期结果或反事实解释来澄清代理的行为。此外，我们的方法通过从定性行为描述中导出替代场景，甚至新的基于规则的策略，扩展了先前的反事实解释。我们在四槽罐过程控制问题上验证了TalkToAgent，这是一个众所周知的非线性控制基准。结果表明，TalkToAgent成功将用户查询准确映射到XRL任务，并且编码器-调试器交互最大程度地减少了反事实生成中的失败。此外，定性评估证实了TalkToAgent有效地解释了代理的行为，并在问题领域内对其含义进行了上下文化。

更新时间: 2025-09-08 00:52:15

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2509.04809v2