Arxiv Day: Article

Where 6G Stands Today: Evolution, Enablers, and Research Gaps

As the fifth-generation (5G) mobile communication system continues its global deployment, both industry and academia have started conceptualizing the 6th generation (6G) to address the growing need for a progressively advanced and digital society. Even while 5G offers considerable advancements over LTE, it could struggle to be sufficient to meet all of the requirements, including ultra-high reliability, seamless automation, and ubiquitous coverage. In response, 6G is supposed to bring out a highly intelligent, automated, and ultra-reliable communication system that can handle a vast number of connected devices. This paper offers a comprehensive overview of 6G, beginning with its main stringent requirements while focusing on key enabling technologies such as terahertz (THz) communications, intelligent reflecting surfaces, massive MIMO and AI-driven networking that will shape the 6G networks. Furthermore, the paper lists various 6G applications and usage scenarios that will benefit from these advancements. At the end, we outline the potential challenges that must be addressed to achieve the 6G promises.

Updated: 2025-09-23 23:52:47

标题: 六代移动通信技术的当前发展情况：演进、推动因素和研究空白

摘要: 随着第五代（5G）移动通信系统在全球范围内的部署继续进行，工业界和学术界已经开始构想第六代（6G），以满足不断增长的对先进数字社会的需求。即使5G相对LTE有相当大的进步，但它可能难以满足所有要求，包括超高可靠性、无缝自动化和无处不在的覆盖。作为应对，6G被认为将带来一种高度智能、自动化和超可靠的通信系统，可以处理大量连接设备。本文提供了对6G的全面概述，从其主要严格要求开始，重点介绍了诸如太赫兹（THz）通信、智能反射表面、大规模MIMO和人工智能驱动网络等关键的启用技术，这些技术将塑造6G网络。此外，本文列出了各种将受益于这些进步的6G应用和使用场景。最后，我们概述了必须解决的潜在挑战，以实现6G的承诺。

更新时间: 2025-09-23 23:52:47

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2509.19646v1

Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.

Updated: 2025-09-23 23:52:07

标题: 我们是否在正确地进行扩展？一个系统视角下的测试时扩展

摘要: 测试时间缩放（TTS）最近已经成为一个有前途的方向，以利用预训练的大型语言模型（LLMs）的隐藏推理能力。然而，现有的缩放方法狭隘地专注于计算最优的帕累托前沿，忽视了一个简单的事实，即计算最优并不总是系统最优。在这项工作中，我们提出了一个基于系统驱动的TTS视角，分析推理模型在实际指标（如延迟和每令牌成本）下的缩放情况。通过评估流行优化（如张量并行和推测解码）的影响，我们的初步分析揭示了当前方法的局限性，并呼吁朝着以系统为导向的整体评估的范式转变，捕捉推理时间的缩放规律的真正本质。

更新时间: 2025-09-23 23:52:07

领域: cs.PF,cs.AI

下载: http://arxiv.org/abs/2509.19645v1

Bridging Information Gaps with Comprehensive Answers: Improving the Diversity and Informativeness of Follow-Up Questions

Generating diverse follow-up questions that uncover missing information remains challenging for conversational agents, particularly when they run on small, locally hosted models. To address this, we develop an information-gap-driven knowledge distillation pipeline in which a teacher LLM generates a comprehensive answer, contrasts it with the initial answer to identify information gaps, and formulates gap-bridging follow-up questions. Using this pipeline, we augment the existing FollowupQG dataset tenfold. We then fine-tune smaller student models on the augmented dataset to distill the teacher's knowledge. Experiments with selected teacher-student model pairs show that fine-tuned students achieve significantly higher informativeness and diversity than variations trained on the original dataset. These findings indicate that our pipeline, which mirrors the human cognitive process of information seeking, provides an efficient distillation channel from state-of-the-art LLMs to smaller models, enabling resource-constrained conversational systems to generate more diverse and informative follow-up questions.

Updated: 2025-09-23 23:47:37

标题: 填补信息鸿沟，提供全面答案：改善后续问题的多样性和信息量

摘要: 生成多样化的追问问题以揭示缺失信息对于对话代理仍然具有挑战性，特别是当它们在小型、本地托管的模型上运行时。为了解决这个问题，我们开发了一个以信息缺口为驱动的知识蒸馏流程，其中一个教师LLM生成一个全面的答案，将其与初始答案进行对比以确定信息缺口，并制定填补信息缺口的追问问题。使用这个流程，我们将现有的FollowupQG数据集增加了十倍。然后我们在增强的数据集上对较小的学生模型进行微调，以蒸馏教师的知识。与选择的教师-学生模型对进行的实验表明，微调后的学生比原始数据集上训练的变体显著提高了信息量和多样性。这些发现表明，我们的流程反映了人类信息寻求的认知过程，为最新的LLM到较小模型提供了一个高效的蒸馏通道，使资源受限的对话系统能够生成更多样化和信息丰富的追问问题。

更新时间: 2025-09-23 23:47:37

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2502.17715v2

Urania: Differentially Private Insights into AI Use

We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.

Updated: 2025-09-23 23:45:44

标题: 乌拉尼亚：关于人工智能使用的差分隐私洞见

摘要: 我们引入了$Urania$，这是一个新颖的框架，用于生成关于LLM聊天机器人交互的见解，并具有严格的差分隐私（DP）保证。该框架采用了一个私有的聚类机制和创新的关键字提取方法，包括基于频率、基于TF-IDF和基于LLM的方法。通过利用聚类、分区选择和基于直方图的总结等DP工具，$Urania$提供了端到端的隐私保护。我们的评估评估了词汇和语义内容的保留，对的相似性以及基于LLM的指标，与一个非私密的Clio-inspired管道（Tamkin等人，2024）进行基准测试。此外，我们开发了一个简单的实证隐私评估，展示了我们DP管道的增强鲁棒性。结果显示了该框架在提取有意义的对话见解的能力，同时保持严格的用户隐私，有效平衡了数据效用和隐私保护。

更新时间: 2025-09-23 23:45:44

领域: cs.LG,cs.AI,cs.CL,cs.CR,cs.CY

下载: http://arxiv.org/abs/2506.04681v2

Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity

On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. However, these works often study idealized problem settings which may fail to capture complexities of real-world scenarios. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics -- sometimes by as much as 38% -- both overall and on outliers.

Updated: 2025-09-23 23:35:02

标题: 预训练深度模型在标签稀缺情况下的学习排序中胜过GBDTs

摘要: 在表格数据上，大量文献表明目前的深度学习（DL）模型在最好的情况下与梯度提升决策树（GBDT）性能相似，但在异常值数据上表现明显不足。然而，这些研究往往研究理想化的问题设置，可能无法捕捉现实世界情景的复杂性。我们确定了一个自然的表格数据设置，DL模型可以胜过GBDT：标签稀缺条件下的表格学习排名（LTR）。表格LTR应用，包括搜索和推荐，通常有大量未标记数据和稀缺标记数据。我们展示DL排名器可以利用无监督预训练来利用这些未标记数据。通过对公共和专有数据集的广泛实验，我们展示预训练的DL排名器在排名指标上始终优于GBDT排名器 -- 有时甚至高达38% -- 在整体和异常值上都是如此。

更新时间: 2025-09-23 23:35:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2308.00177v5

Multimodal Atmospheric Super-Resolution With Deep Generative Models

Score-based diffusion modeling is a generative machine learning algorithm that can be used to sample from complex distributions. They achieve this by learning a score function, i.e., the gradient of the log-probability density of the data, and reversing a noising process using the same. Once trained, score-based diffusion models not only generate new samples but also enable zero-shot conditioning of the generated samples on observed data. This promises a novel paradigm for data and model fusion, wherein the implicitly learned distributions of pretrained score-based diffusion models can be updated given the availability of online data in a Bayesian formulation. In this article, we apply such a concept to the super-resolution of a high-dimensional dynamical system, given the real-time availability of low-resolution and experimentally observed sparse sensor measurements from multimodal data. Additional analysis on how score-based sampling can be used for uncertainty estimates is also provided. Our experiments are performed for a super-resolution task that generates the ERA5 atmospheric dataset given sparse observations from a coarse-grained representation of the same and/or from unstructured experimental observations of the IGRA radiosonde dataset. We demonstrate accurate recovery of the high dimensional state given multiple sources of low-fidelity measurements. We also discover that the generative model can balance the influence of multiple dataset modalities during spatiotemporal reconstructions.

Updated: 2025-09-23 23:30:47

标题: 多模态大气超分辨率与深度生成模型

摘要: 得分基础扩散建模是一种生成式机器学习算法，可用于从复杂分布中采样。他们通过学习一个得分函数来实现这一点，即数据的对数概率密度的梯度，并使用相同的过程对噪声进行反转。一旦训练完成，得分基础扩散模型不仅可以生成新样本，还可以在观察到的数据上对生成的样本进行零射击条件。这为数据和模型融合提供了一种新的范式，其中预训练得分基础扩散模型的隐式学习分布可以在贝叶斯公式中更新，给定在线数据的可用性。在本文中，我们将这样的概念应用于高维动态系统的超分辨率，考虑到来自多模态数据的低分辨率和实验观察到的稀疏传感器测量的实时可用性。还提供了有关如何使用得分基础采样进行不确定性估计的额外分析。我们的实验是针对一个超分辨率任务进行的，该任务生成ERA5大气数据集，给定来自相同的粗粒度表示和/或来自IGRA探空仪数据集的无结构实验观测的稀疏观测。我们展示了在给定多个低保真度测量来源的情况下准确恢复高维状态。我们还发现，生成模型可以在时空重建过程中平衡多个数据集模态的影响。

更新时间: 2025-09-23 23:30:47

领域: cs.LG,physics.geo-ph

下载: http://arxiv.org/abs/2506.22780v2

GradNetOT: Learning Optimal Transport Maps with GradNets

Monotone gradient functions play a central role in solving the Monge formulation of the optimal transport (OT) problem, which arises in modern applications ranging from fluid dynamics to robot swarm control. When the transport cost is the squared Euclidean distance, Brenier's theorem guarantees that the unique optimal transport map satisfies a Monge-Amp\`ere equation and is the gradient of a convex function. In [arXiv:2301.10862] [arXiv:2404.07361], we proposed Monotone Gradient Networks (mGradNets), neural networks that directly parameterize the space of monotone gradient maps. In this work, we leverage mGradNets to directly learn the optimal transport mapping by minimizing a training loss function defined using the Monge-Amp\`ere equation. We empirically show that the structural bias of mGradNets facilitates the learning of optimal transport maps across both image morphing tasks and high-dimensional OT problems.

Updated: 2025-09-23 23:21:37

标题: GradNetOT：使用GradNets学习最优输运映射

摘要: 单调梯度函数在解决蒙特的最优传输（OT）问题的蒙特公式中起着核心作用，该问题在现代应用中涵盖了从流体动力学到机器人群控制的范围。当传输成本是平方欧氏距离时，布雷尼耶定理保证唯一的最优传输映射满足蒙特-安培方程并且是一个凸函数的梯度。在[arXiv:2301.10862] [arXiv:2404.07361]中，我们提出了单调梯度网络（mGradNets），这是直接参数化单调梯度映射空间的神经网络。在这项工作中，我们利用mGradNets直接通过最小化使用蒙特-安培方程定义的训练损失函数来学习最优传输映射。我们在实验中展示，mGradNets的结构偏差有利于在图像变形任务和高维度OT问题中学习最优传输映射。

更新时间: 2025-09-23 23:21:37

领域: cs.LG

下载: http://arxiv.org/abs/2507.13191v2

Fine-tuning of diffusion models via stochastic control: entropy regularization and beyond

This paper aims to develop and provide a rigorous treatment to the problem of entropy regularized fine-tuning in the context of continuous-time diffusion models, which was recently proposed by Uehara et al. (arXiv:2402.15194, 2024). The idea is to use stochastic control for sample generation, where the entropy regularizer is introduced to mitigate reward collapse. We also show how the analysis can be extended to fine-tuning with a general $f$-divergence regularizer. Numerical experiments on large-scale text-to-image models--Stable Diffusion v1.5 are conducted to validate our approach.

Updated: 2025-09-23 23:13:36

标题: 通过随机控制对扩散模型进行微调：熵正则化及其进一步应用

摘要: 本文旨在发展并对连续时间扩散模型中的熵正则微调问题进行严格处理，该问题最近由Uehara等人提出（arXiv:2402.15194，2024）。其核心思想是利用随机控制进行样本生成，引入熵正则化项以减轻奖励崩溃现象。我们还展示了如何将分析扩展到使用一般$f$-散度正则化项进行微调。通过在大规模文本到图像模型--Stable Diffusion v1.5上进行数值实验，验证了我们的方法。

更新时间: 2025-09-23 23:13:36

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2403.06279v3

TIMED: Adversarial and Autoregressive Refinement of Diffusion-Based Time Series Generation

Generating high-quality synthetic time series is a fundamental yet challenging task across domains such as forecasting and anomaly detection, where real data can be scarce, noisy, or costly to collect. Unlike static data generation, synthesizing time series requires modeling both the marginal distribution of observations and the conditional temporal dependencies that govern sequential dynamics. We propose TIMED, a unified generative framework that integrates a denoising diffusion probabilistic model (DDPM) to capture global structure via a forward-reverse diffusion process, a supervisor network trained with teacher forcing to learn autoregressive dependencies through next-step prediction, and a Wasserstein critic that provides adversarial feedback to ensure temporal smoothness and fidelity. To further align the real and synthetic distributions in feature space, TIMED incorporates a Maximum Mean Discrepancy (MMD) loss, promoting both diversity and sample quality. All components are built using masked attention architectures optimized for sequence modeling and are trained jointly to effectively capture both unconditional and conditional aspects of time series data. Experimental results across diverse multivariate time series benchmarks demonstrate that TIMED generates more realistic and temporally coherent sequences than state-of-the-art generative models.

Updated: 2025-09-23 23:05:40

标题: TIMED：基于扩散的时间序列生成的对抗性和自回归精化

摘要: 生成高质量的合成时间序列是跨领域中的一项基础性且具有挑战性的任务，如预测和异常检测，其中真实数据可能稀缺、嘈杂或收集成本高昂。与静态数据生成不同，合成时间序列需要建模观察的边际分布以及控制顺序动态的条件时间依赖关系。我们提出了TIMED，一个统一的生成框架，它集成了一个去噪扩散概率模型(DDPM)来通过前向-反向扩散过程捕获全局结构，一个通过教师强制训练的监督网络来学习自回归依赖关系的下一步预测，以及一个Wasserstein评论家，提供对抗反馈以确保时间平滑性和保真度。为了进一步在特征空间中对齐真实和合成分布，TIMED还包括一个最大均值差(MMD)损失，促进多样性和样本质量。所有组件都是使用经过优化用于序列建模的掩码注意力结构构建的，并联合训练以有效捕捉时间序列数据的无条件和有条件方面。跨多元时间序列基准的实验结果表明，TIMED生成的序列比最先进的生成模型更具现实性和时间上的连贯性。

更新时间: 2025-09-23 23:05:40

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2509.19638v1

Mamba Modulation: On the Length Generalization of Mamba

The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.

Updated: 2025-09-23 22:46:19

标题: 蛇毒调节：关于蛇毒长度概括的研究

摘要: Transformer模型中注意力机制的二次复杂度促使开发具有次二次缩放的替代架构，如状态空间模型。在这些模型中，Mamba已成为领先的架构，实现了在一系列语言建模任务中的最新结果。然而，当应用于比预训练过程中看到的长度更长的上下文时，Mamba的性能显著下降，揭示了对上下文长度延伸的敏感性。通过详细分析，我们将这种限制归因于其状态空间动态的分布外行为，特别是在状态转移矩阵A的参数化方面。与最近的一些研究认为这种敏感性是由于离散化时间步长的消失累积$\exp(-\sum_{t=1}^N\Delta_t)$，我们建立了当输入长度趋近于无穷大时状态收敛行为与过渡矩阵A的频谱之间的联系，提供了其在长度延伸中的作用的充分解释。接下来，为了克服这一挑战，我们提出了一种方法，该方法将光谱缩放应用于预训练的Mamba模型，通过选择性地调制每层中A矩阵的频谱，实现了对长上下文的稳健泛化。我们展示了这可以显著提高在简单调制$\Delta_t$失败的设置中的性能，验证了我们的洞察力，并为具有结构化过渡矩阵的状态空间模型的更好长度泛化提供了途径。

更新时间: 2025-09-23 22:46:19

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2509.19633v1

Advancing Speech Summarization in Multi-modal LLMs with Reinforcement Learning

Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.

Updated: 2025-09-23 22:45:13

标题: 用强化学习推动多模式LLMs中的语音摘要技术

摘要: 演讲摘要是口头内容理解的关键组成部分，特别是在口头和视听数据迅速增长的时代。最近，多模态大型语言模型（MLLMs）的进步，利用LLMs的力量，使得可以直接从语音生成文本摘要，而无需中间转录，同时支持可控风格和零样本泛化。然而，开源MLLMs仍然落后于最先进的基于文本的LLMs，限制了它们在演讲摘要中的实际部署。在这项工作中，我们提出了一种新颖的多阶段强化学习训练框架，以增强MLLMs的演讲摘要能力。我们的模型相对于强基线提供了实质性的改进，优于更大的MLLMs，并显著缩小了与最先进的基于文本的LLMs之间的差距。

更新时间: 2025-09-23 22:45:13

领域: eess.AS,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.19631v1

EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data

Egocentric human experience data presents a vast resource for scaling up end-to-end imitation learning for robotic manipulation. However, significant domain gaps in visual appearance, sensor modalities, and kinematics between human and robot impede knowledge transfer. This paper presents EgoBridge, a unified co-training framework that explicitly aligns the policy latent spaces between human and robot data using domain adaptation. Through a measure of discrepancy on the joint policy latent features and actions based on Optimal Transport (OT), we learn observation representations that not only align between the human and robot domain but also preserve the action-relevant information critical for policy learning. EgoBridge achieves a significant absolute policy success rate improvement by 44% over human-augmented cross-embodiment baselines in three real-world single-arm and bimanual manipulation tasks. EgoBridge also generalizes to new objects, scenes, and tasks seen only in human data, where baselines fail entirely. Videos and additional information can be found at https://ego-bridge.github.io

Updated: 2025-09-23 22:34:47

标题: EgoBridge：基于以自我为中心的人类数据进行的领域适应，实现可泛化的模仿

摘要: EgoBridge是一个统一的共同训练框架，通过领域适应明确地对齐人类和机器人数据之间的策略潜在空间，以克服视觉外观、传感器模态和人类与机器人之间的运动学之间的显著领域差距，实现了端到端模仿学习的规模化。通过基于最优传输（OT）的联合策略潜在特征和动作的不一致性度量，我们学习观察表示，不仅在人类和机器人领域之间对齐，还保留了对策略学习至关重要的动作相关信息。在三个真实世界的单臂和双臂操作任务中，EgoBridge相对于人类增强的跨体验基线实现了44%的绝对策略成功率改善。EgoBridge还可以泛化到仅在人类数据中看到的新对象、场景和任务，而基线则完全失败。视频和更多信息可在https://ego-bridge.github.io找到。

更新时间: 2025-09-23 22:34:47

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.19626v1

Adaptive von Mises-Fisher Likelihood Loss for Supervised Deep Time Series Hashing

Indexing time series by creating compact binary representations is a fundamental task in time series data mining. Recently, deep learning-based hashing methods have proven effective for indexing time series based on semantic meaning rather than just raw similarity. The purpose of deep hashing is to map samples with the same semantic meaning to identical binary hash codes, enabling more efficient search and retrieval. Unlike other supervised representation learning methods, supervised deep hashing requires a discretization step to convert real-valued representations into binary codes, but this can induce significant information loss. In this paper, we propose a von Mises-Fisher (vMF) hashing loss. The proposed deep hashing model maps data to an M-dimensional hyperspherical space to effectively reduce information loss and models each data class as points following distinct vMF distributions. The designed loss aims to maximize the separation between each modeled vMF distribution to provide a better way to maximize the margin between each semantically different data sample. Experimental results show that our method outperforms existing baselines. The implementation is publicly available at https://github.com/jmpq97/vmf-hashing

Updated: 2025-09-23 22:34:25

标题: 自适应von Mises-Fisher似然损失用于监督深度时间序列哈希化

摘要: 通过创建紧凑的二进制表示来对时间序列进行索引是时间序列数据挖掘中的一项基本任务。最近，基于深度学习的哈希方法已被证明对基于语义含义而不仅仅是原始相似性的时间序列进行索引是有效的。深度哈希的目的是将具有相同语义含义的样本映射到相同的二进制哈希代码，从而实现更高效的搜索和检索。与其他监督表示学习方法不同，监督深度哈希需要一个离散化步骤将实值表示转换为二进制代码，但这可能会引起显著的信息损失。在本文中，我们提出了一个von Mises-Fisher（vMF）哈希损失。所提出的深度哈希模型将数据映射到一个M维超球面空间，以有效减少信息损失，并将每个数据类建模为遵循不同vMF分布的点。设计的损失旨在最大化每个建模的vMF分布之间的分离，以提供更好地最大化每个语义上不同数据样本之间的间隔。实验结果表明，我们的方法优于现有基线。该实现可以在https://github.com/jmpq97/vmf-hashing 上公开获取。

更新时间: 2025-09-23 22:34:25

领域: cs.LG

下载: http://arxiv.org/abs/2509.19625v1

SteinerSQL: Graph-Guided Mathematical Reasoning for Text-to-SQL Generation

Large Language Models (LLMs) struggle with complex Text-to-SQL queries that demand both sophisticated mathematical reasoning and intricate schema navigation. Existing methods often tackle these challenges in isolation, creating a fractured reasoning process that compromises logical and structural correctness. To resolve this, we introduce SteinerSQL, a framework that unifies these dual challenges into a single, graph-centric optimization problem. SteinerSQL operates in three stages: mathematical decomposition to identify required tables (terminals), optimal reasoning scaffold construction via a Steiner tree problem, and multi-level validation to ensure correctness. On the challenging LogicCat and Spider2.0-Lite benchmarks, SteinerSQL establishes a new state-of-the-art with 36.10% and 40.04% execution accuracy, respectively, using Gemini-2.5-Pro. Beyond accuracy, SteinerSQL presents a new, unified paradigm for Text-to-SQL, paving the way for more robust and principled solutions to complex reasoning tasks.

Updated: 2025-09-23 22:30:52

标题: SteinerSQL：面向文本到SQL生成的图导向数学推理

摘要: 大型语言模型（LLMs）在需要复杂数学推理和复杂模式导航的复杂文本到SQL查询中遇到困难。现有方法通常独立应对这些挑战，导致了一个破碎的推理过程，影响了逻辑和结构的正确性。为了解决这个问题，我们引入了SteinerSQL，这是一个将这两个挑战统一为一个基于图的优化问题的框架。SteinerSQL分为三个阶段：数学分解以确定所需表（终端），通过Steiner树问题构建最佳推理支架，以及多级验证以确保正确性。在具有挑战性的LogicCat和Spider2.0-Lite基准测试中，SteinerSQL分别使用Gemini-2.5-Pro实现了36.10％和40.04％的执行准确度，建立了一个新的最新技术水平。除了准确性外，SteinerSQL提出了一个新的统一的Text-to-SQL范式，为复杂推理任务提供更强大和原则性的解决方案铺平了道路。

更新时间: 2025-09-23 22:30:52

领域: cs.AI

下载: http://arxiv.org/abs/2509.19623v1

SkyEye: When Your Vision Reaches Beyond IAM Boundary Scope in AWS Cloud

In recent years, cloud security has emerged as a primary concern for enterprises due to the increasing trend of migrating internal infrastructure and applications to cloud environments. This shift is driven by the desire to reduce the high costs and maintenance fees associated with traditional on-premise infrastructure. By leveraging cloud capacities such as high availability and scalability, companies can achieve greater operational efficiency and flexibility. However, this migration also introduces new security challenges. Ensuring the protection of sensitive data, maintaining compliance with regulatory requirements, and mitigating the risks of cyber threats are critical issues that must be addressed. Identity and Access Management (IAM) constitutes the critical security backbone of most cloud deployments, particularly within AWS environments. As organizations adopt AWS to scale applications and store data, the need for a thorough, methodical, and precise enumeration of IAM configurations grows exponentially. Enumeration refers to the systematic mapping and interrogation of identities, permissions, and resource authorizations with the objective of gaining situational awareness. By understanding the interplay between users, groups, and their myriads of policies, whether inline or attached managed policies, security professionals need to enumerate and identify misconfigurations, reduce the risk of unauthorized privilege escalation, and maintain robust compliance postures. This paper will present SkyEye, a cooperative multi-principal IAM enumeration framework, which comprises cutting-edge enumeration models in supporting complete situational awareness regarding the IAMs of provided AWS credentials, crossing the boundary of principal-specific IAM entitlement vision to reveal the complete visionary while insufficient authorization is the main challenge.

Updated: 2025-09-23 22:15:16

标题: SkyEye：当您的视野超越AWS云中IAM边界范围时

摘要: 近年来，由于将内部基础设施和应用程序迁移到云环境的趋势不断增加，云安全已成为企业的主要关注点。这种转变是由于希望减少与传统基础设施关联的高成本和维护费用驱动的。通过利用云的高可用性和可伸缩性等能力，企业可以实现更高的运营效率和灵活性。然而，这种迁移也带来了新的安全挑战。确保敏感数据的保护，保持符合法规要求，以及减轻网络威胁的风险是必须解决的关键问题。身份和访问管理（IAM）构成了大多数云部署的关键安全支柱，特别是在AWS环境中。随着组织采用AWS来扩展应用程序和存储数据，对IAM配置进行彻底、系统和精确的枚举需求呈指数增长。枚举指的是有目的地映射和审问身份、权限和资源授权，以获得情境意识。通过了解用户、组和它们众多的策略之间的相互作用，无论是内联还是附加的托管策略，安全专业人员需要枚举和识别配置错误，减少未经授权的特权升级风险，并保持健壮的合规姿态。本文将介绍SkyEye，一个合作多主体IAM枚举框架，其中包括支持提供的AWS凭据的IAM的完整情景意识的尖端枚举模型，跨越特定主体IAM授权视野的边界，揭示完整的愿景，而不足的授权是主要挑战。

更新时间: 2025-09-23 22:15:16

领域: cs.CR

下载: http://arxiv.org/abs/2507.21094v2

The Open DAC 2025 Dataset for Sorbent Discovery in Direct Air Capture

Identifying useful sorbent materials for direct air capture (DAC) from humid air remains a challenge. We present the Open DAC 2025 (ODAC25) dataset, a significant expansion and improvement upon ODAC23 (Sriram et al., ACS Central Science, 10 (2024) 923), comprising nearly 60 million DFT single-point calculations for CO$_2$, H$_2$O, N$_2$, and O$_2$ adsorption in 15,000 MOFs. ODAC25 introduces chemical and configurational diversity through functionalized MOFs, high-energy GCMC-derived placements, and synthetically generated frameworks. ODAC25 also significantly improves upon the accuracy of DFT calculations and the treatment of flexible MOFs in ODAC23. Along with the dataset, we release new state-of-the-art machine-learned interatomic potentials trained on ODAC25 and evaluate them on adsorption energy and Henry's law coefficient predictions.

Updated: 2025-09-23 22:08:00

标题: 2025年开放DAC数据集用于直接空气捕集中的吸附剂发现

摘要: 识别对于从湿润空气中进行直接空气捕集（DAC）有用的吸附剂材料仍然是一个挑战。我们提出了Open DAC 2025（ODAC25）数据集，这是对ODAC23（Sriram等人，ACS Central Science，10（2024）923）的显着扩展和改进，包括针对CO$_2$、H$_2$O、N$_2$和O$_2$在15,000种MOFs中吸附的近6000万个DFT单点计算。ODAC25通过功能化MOFs、高能量GCMC推导的位置和合成生成的框架引入了化学和构型多样性。ODAC25还显著改进了ODAC23中DFT计算的准确性和对柔性MOFs的处理。除了数据集，我们还发布了在ODAC25上训练的新一流的机器学习间原子势，并对其在吸附能和亨利定律系数预测上进行评估。

更新时间: 2025-09-23 22:08:00

领域: cond-mat.mtrl-sci,cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2508.03162v2

Graph-based Neural Space Weather Forecasting

Accurate space weather forecasting is crucial for protecting our increasingly digital infrastructure. Hybrid-Vlasov models, like Vlasiator, offer physical realism beyond that of current operational systems, but are too computationally expensive for real-time use. We introduce a graph-based neural emulator trained on Vlasiator data to autoregressively predict near-Earth space conditions driven by an upstream solar wind. We show how to achieve both fast deterministic forecasts and, by using a generative model, produce ensembles to capture forecast uncertainty. This work demonstrates that machine learning offers a way to add uncertainty quantification capability to existing space weather prediction systems, and make hybrid-Vlasov simulation tractable for operational use.

Updated: 2025-09-23 21:53:35

标题: 基于图神经网络的太空天气预测

摘要: 准确的空间天气预报对保护日益数字化的基础设施至关重要。混合-Vlasov模型，如Vlasiator，提供了超出当前操作系统的物理现实性，但在实时使用上计算成本太高。我们介绍了一个基于图的神经仿真器，通过对Vlasiator数据进行训练，自回归地预测受上游太阳风驱动的近地球空间条件。我们展示了如何实现快速确定性预测，并通过使用生成模型，产生集合以捕捉预测不确定性。这项工作表明，机器学习提供了一种方式，为现有空间天气预测系统添加不确定性量化能力，并使混合-Vlasov模拟可供操作使用。

更新时间: 2025-09-23 21:53:35

领域: physics.space-ph,cs.LG,physics.plasm-ph

下载: http://arxiv.org/abs/2509.19605v1

Improved Therapeutic Antibody Reformatting through Multimodal Machine Learning

Modern therapeutic antibody design often involves composing multi-part assemblages of individual functional domains, each of which may be derived from a different source or engineered independently. While these complex formats can expand disease applicability and improve safety, they present a significant engineering challenge: the function and stability of individual domains are not guaranteed in the novel format, and the entire molecule may no longer be synthesizable. To address these challenges, we develop a machine learning framework to predict "reformatting success" -- whether converting an antibody from one format to another will succeed or not. Our framework incorporates both antibody sequence and structural context, incorporating an evaluation protocol that reflects realistic deployment scenarios. In experiments on a real-world antibody reformatting dataset, we find the surprising result that large pretrained protein language models (PLMs) fail to outperform simple, domain-tailored, multimodal representations. This is particularly evident in the most difficult evaluation setting, where we test model generalization to a new starting antibody. In this challenging "new antibody, no data" scenario, our best multimodal model achieves high predictive accuracy, enabling prioritization of promising candidates and reducing wasted experimental effort.

Updated: 2025-09-23 21:52:37

标题: 通过多模态机器学习改进治疗性抗体重构

摘要: 现代治疗性抗体设计通常涉及组成由单个功能域组成的多部分组合体，每个功能域可能来自不同来源或独立工程。尽管这些复杂的格式可以扩展疾病适用性并提高安全性，但它们也带来了重大的工程挑战：个别功能域的功能和稳定性在新的格式中不能保证，整个分子可能不再可合成。为了解决这些挑战，我们开发了一个机器学习框架来预测“重新格式化成功”--即将抗体从一种格式转换为另一种格式是否成功。我们的框架结合了抗体序列和结构上下文，包括一个反映现实部署场景的评估协议。在一个真实世界的抗体重新格式化数据集上的实验中，我们发现一个令人惊讶的结果，即大型预训练蛋白语言模型（PLMs）并不能胜过简单的、面向特定领域的多模态表示。这在最困难的评估设置中特别明显，我们测试模型对新起始抗体的泛化能力。在这种具有挑战性的“新抗体，无数据”场景中，我们最好的多模态模型实现了高预测准确性，可以优先考虑有前途的候选者，并减少了实验的浪费。

更新时间: 2025-09-23 21:52:37

领域: cs.LG

下载: http://arxiv.org/abs/2509.19604v1

Modular Machine Learning with Applications to Genetic Circuit Composition

In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules' input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system's input/output mapping. Learning the modules' input/output functions is also necessary for designing new systems from different composition architectures. Here, we propose a modular learning framework, which incorporates prior knowledge of the system's compositional structure to (a) identify the composing modules' input/output functions from the system's input/output data and (b) achieve this by using a reduced amount of data compared to what would be required without knowledge of the compositional structure. To achieve this, we introduce the notion of modular identifiability, which allows recovery of modules' input/output functions from a subset of the system's input/output data, and provide theoretical guarantees on a class of systems motivated by genetic circuits. We demonstrate the theory on computational studies showing that a neural network (NNET) that accounts for the compositional structure can learn the composing modules' input/output functions and predict the system's output on inputs outside of the training set distribution. By contrast, a neural network that is agnostic of the structure is unable to predict on inputs that fall outside of the training set distribution. By reducing the need for experimental data and allowing module identification, this framework offers the potential to ease the design of synthetic biological circuits and of multi-module systems more generally.

Updated: 2025-09-23 21:49:13

标题: 模块化机器学习及其在基因电路组合中的应用

摘要: 在许多应用中，包括合成生物学，在一个由许多模块组成的系统中往往有输入/输出数据，尽管模块的输入/输出功能和信号可能是未知的，但对组成架构的了解可以显著减少学习系统的输入/输出映射所需的训练数据量。学习模块的输入/输出功能也是设计新系统的必要条件，这些系统来自不同的组合结构。在这里，我们提出了一个模块化学习框架，该框架结合了对系统组合结构的先验知识，以(a)从系统的输入/输出数据中识别组成模块的输入/输出功能，(b)并通过使用少量数据来实现这一点，而不需要了解组合结构的情况下所需的数据量。为了实现这一点，我们引入了模块可识别性的概念，它允许从系统的输入/输出数据子集中恢复模块的输入/输出功能，并对一类受基因电路激发的系统提供理论保证。我们通过计算研究展示了这一理论，表明考虑组合结构的神经网络（NNET）可以学习组成模块的输入/输出功能，并预测训练集分布之外的输入对系统的输出。相比之下，对结构一无所知的神经网络无法对训练集分布之外的输入进行预测。通过减少实验数据的需求并允许模块识别，这一框架有望简化合成生物电路和多模块系统的设计。

更新时间: 2025-09-23 21:49:13

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2509.19601v1

Learning to Drive by Imitating Surrounding Vehicles

Imitation learning is a promising approach for training autonomous vehicles (AV) to navigate complex traffic environments by mimicking expert driver behaviors. While existing imitation learning frameworks focus on leveraging expert demonstrations, they often overlook the potential of additional complex driving data from surrounding traffic participants. In this paper, we study a data augmentation strategy that leverages the observed trajectories of nearby vehicles, captured by the AV's sensors, as additional demonstrations. We introduce a simple vehicle-selection sampling and filtering strategy that prioritizes informative and diverse driving behaviors, contributing to a richer dataset for training. We evaluate this idea with a representative learning-based planner on a large real-world dataset and demonstrate improved performance in complex driving scenarios. Specifically, the approach reduces collision rates and improves safety metrics compared to the baseline. Notably, even when using only 10 percent of the original dataset, the method matches or exceeds the performance of the full dataset. Through ablations, we analyze selection criteria and show that naive random selection can degrade performance. Our findings highlight the value of leveraging diverse real-world trajectory data in imitation learning and provide insights into data augmentation strategies for autonomous driving.

Updated: 2025-09-23 21:48:04

标题: 通过模仿周围车辆学习驾驶

摘要: 模仿学习是训练自动驾驶车辆（AV）在复杂交通环境中导航的一种有前景的方法，其通过模仿专家驾驶员的行为。尽管现有的模仿学习框架专注于利用专家演示，但它们经常忽视来自周围交通参与者的额外复杂驾驶数据的潜力。在本文中，我们研究了一种数据增强策略，利用AV传感器捕获的附近车辆的观察轨迹作为额外的演示。我们引入了一个简单的车辆选择采样和过滤策略，优先考虑信息丰富和多样化的驾驶行为，有助于为训练提供更丰富的数据集。我们在一个大型真实世界数据集上使用代表性的基于学习的规划器评估了这个想法，并展示了在复杂驾驶场景中改进的性能。具体来说，该方法降低了碰撞率并改善了安全指标，与基线相比。值得注意的是，即使只使用原始数据集的10％，该方法也能匹配或超过完整数据集的性能。通过消融实验，我们分析了选择标准，并展示了天真的随机选择可能会降低性能。我们的研究结果强调了在模仿学习中利用多样化的真实世界轨迹数据的价值，并为自动驾驶的数据增强策略提供了洞见。

更新时间: 2025-09-23 21:48:04

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.05997v2

Knowledge Base-Aware Orchestration: A Dynamic, Privacy-Preserving Method for Multi-Agent Systems

Multi-agent systems (MAS) are increasingly tasked with solving complex, knowledge-intensive problems where effective agent orchestration is critical. Conventional orchestration methods rely on static agent descriptions, which often become outdated or incomplete. This limitation leads to inefficient task routing, particularly in dynamic environments where agent capabilities continuously evolve. We introduce Knowledge Base-Aware (KBA) Orchestration, a novel approach that augments static descriptions with dynamic, privacy-preserving relevance signals derived from each agent's internal knowledge base (KB). In the proposed framework, when static descriptions are insufficient for a clear routing decision, the orchestrator prompts the subagents in parallel. Each agent then assesses the task's relevance against its private KB, returning a lightweight ACK signal without exposing the underlying data. These collected signals populate a shared semantic cache, providing dynamic indicators of agent suitability for future queries. By combining this novel mechanism with static descriptions, our method achieves more accurate and adaptive task routing preserving agent autonomy and data confidentiality. Benchmarks show that our KBA Orchestration significantly outperforms static description-driven methods in routing precision and overall system efficiency, making it suitable for large-scale systems that require higher accuracy than standard description-driven routing.

Updated: 2025-09-23 21:46:38

标题: 知识库感知编排：多智能体系统的动态、隐私保护方法

摘要: 多智能体系统（MAS）越来越多地被用来解决复杂、知识密集型的问题，其中有效的智能体编排至关重要。传统的编排方法依赖于静态的智能体描述，这些描述往往会过时或不完整。这种限制导致任务路由效率低下，特别是在智能体能力不断演化的动态环境中。我们引入了知识库感知（KBA）编排，这是一种新颖的方法，通过从每个智能体的内部知识库（KB）中得出的动态、保护隐私的相关信号来增强静态描述。在提出的框架中，当静态描述对于明确的路由决策不足时，编排器会同时提示子代理。然后，每个智能体根据其私有知识库评估任务的相关性，返回一个轻量级的ACK信号，而不暴露底层数据。这些收集到的信号填充了一个共享的语义缓存，为未来查询提供智能体适用性的动态指标。通过将这种新颖机制与静态描述相结合，我们的方法实现了更准确和自适应的任务路由，保持了智能体的自治性和数据机密性。基准测试显示，我们的KBA编排在路由精度和整体系统效率方面明显优于静态描述驱动的方法，使其适用于需要比标准描述驱动路由更高精度的大规模系统。

更新时间: 2025-09-23 21:46:38

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2509.19599v1

Towards Visual Text Grounding of Multimodal Large Language Model

Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.

Updated: 2025-09-23 21:43:47

标题: 走向多模式大语言模型的视觉文本基础

摘要: 尽管现有的多模态大型语言模型（MLLMs）已经发展，但它们在视觉文本对齐方面仍存在不可忽视的局限性，特别是在文档中文本丰富的图像中。文档图像，如扫描表格和信息图表，由于其复杂的布局和文本内容，突显了关键挑战。然而，目前的基准测试并未完全解决这些挑战，因为它们大多集中在自然图像上的视觉对齐，而不是文本丰富的文档图像上。因此，为了弥合这一差距，我们引入了TRIG，这是一个使用新设计的指令数据集来评估和改进MLLMs在文档问答中的文本丰富图像对齐能力的新任务。具体来说，我们提出了一个OCR-LLM-人类交互流水线，以创建800个手动标注的问题-答案对作为基准，并基于四个不同数据集创建了一个包含90美元合成数据的大规模训练集。对我们提出的基准测试中各种MLLMs的全面评估揭示了它们在文本丰富图像上的对齐能力存在实质性的局限。此外，我们提出了两种基于通用指令调整和即插即用高效嵌入的简单有效的TRIG方法。通过在我们的合成数据集上对MLLMs进行微调，它们有望改善空间推理和对齐能力。

更新时间: 2025-09-23 21:43:47

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.04974v2

Energy Management for Renewable-Colocated Artificial Intelligence Data Centers

We develop an energy management system (EMS) for artificial intelligence (AI) data centers with colocated renewable generation. Under a cost-minimizing framework, the EMS of renewable-colocated data center (RCDC) co-optimizes AI workload scheduling, on-site renewable utilization, and electricity market participation. Within both wholesale and retail market participation models, the economic benefit of the RCDC operation is maximized. Empirical evaluations using real-world traces of electricity prices, data center power consumption, and renewable generation demonstrate significant electricity cost reduction from renewable and AI data center colocations.

Updated: 2025-09-23 21:31:36

标题: 可再生能源共存的人工智能数据中心能源管理

摘要: 我们为人工智能数据中心开发了一个能源管理系统（EMS），其中包括可再生能源发电。在一个成本最小化的框架下，可再生能源数据中心（RCDC）的EMS共同优化人工智能工作负载调度、现场可再生能源利用以及电力市场参与。在批发和零售市场参与模型中，RCDC运营的经济效益得到最大化。利用真实世界的电力价格、数据中心耗电量以及可再生能源发电的跟踪数据进行实证评估，展示了可再生能源和人工智能数据中心共同部署带来的显著降低电力成本。

更新时间: 2025-09-23 21:31:36

领域: math.OC,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.08011v2

GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43\%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.

Updated: 2025-09-23 21:31:14

标题: 猜谜游戏：衡量大型语言模型中开放性问题的信息量

摘要: 我们介绍了GuessingGame，这是一个用于评估大型语言模型（LLMs）在开放性、开放领域设置中作为策略性提问者的协议。一个Guesser LLM通过向Oracle提出自由形式的问题来识别隐藏的对象，而不需要预定义的选择或候选列表。为了衡量问题质量，我们提出了两个信息增益（IG）指标：一个贝叶斯方法，通过LLM评分的相关性跟踪语义概念上的信念更新，以及一个基于熵的方法，通过ConceptNet筛选候选项。这两个指标都是与模型无关的，并支持事后分析。在858个游戏中，涉及多个模型和提示策略，更高的IG强烈预测效率：一个标准差的IG增加将游戏长度的期望值减少了43\%。由IG指导的提示约束，如强制问题多样性，使得较弱的模型能够显著提高性能。这些结果表明，在LLMs中提问既可衡量又可改进，对于交互推理至关重要。

更新时间: 2025-09-23 21:31:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.19593v1

Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.

Updated: 2025-09-23 21:31:00

标题: 框架堆叠的本地变压器用于高效的多码簿语音生成

摘要: 基于大型语言模型（LLMs）的语音生成模型通常基于离散的声学编码，与文本标记根本不同，因为它们具有多码书结构。在每个时间步上，模型必须共同预测N个码书条目，引入挑战简单并行预测方法的依赖关系。并行预测假设码书之间独立，产生高效的解码，但通常以降低保真度为代价。为了解决这个问题，分层策略利用本地变压器（LT）来优化预测并捕获时间步内的依赖关系。在这项工作中，我们系统地研究了两种LT架构：生成码书的自回归变压器和执行迭代掩码预测的MaskGIT-based变压器。这两种设计进一步实现了帧堆叠，其中主要变压器共同预测多个帧，而LT解码它们的码书，提供了速度上的改进而不影响感知质量。通过广泛的分析，我们表征了不同吞吐量和质量范围内并行和迭代抽样策略之间的权衡。最后，我们提出了基于部署优先级（如计算效率和合成保真度）选择解码策略的实用指南。

更新时间: 2025-09-23 21:31:00

领域: eess.AS,cs.AI,cs.CL,cs.SD

下载: http://arxiv.org/abs/2509.19592v1

What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet growing skepticism surrounds their reliability. How can we know that a reported accuracy genuinely reflects a model's true performance? Evaluations are often presented as simple measurements, but in reality they are inferences: to treat benchmark scores as evidence of capability is already to assume a theory of what capability is and how it manifests in a test. We make this step explicit by proposing a principled framework for evaluation as inference: begin from a theory of capability, and then derive methods for estimating it. This perspective, familiar in fields such as psychometrics, has not yet become commonplace in AI evaluation. As a proof of concept, we address a central challenge that undermines reliability: sensitivity to perturbations. After formulating a model of ability, we introduce methods that infer ability while accounting for uncertainty from sensitivity and finite samples, including an adaptive algorithm that significantly reduces sample complexity. Together, these contributions lay the groundwork for more reliable and trustworthy estimates of AI capabilities as measured through benchmarks.

Updated: 2025-09-23 21:29:04

标题: 您的基准实际上衡量了什么？一种用于AI能力强大推断的框架

摘要: 对基准数据上生成模型的评估现在普遍存在，并且它们的结果至关重要地塑造了公众和科学对人工智能能力的期望。然而，越来越多的怀疑围绕它们的可靠性。我们如何能知道报告的准确性是否真正反映了模型的真实性能？评估通常被呈现为简单的测量，但实际上它们是推断：将基准分数视为能力的证据，已经假设了能力是什么以及它如何在测试中表现的理论。我们通过提出一个有原则的评估推理框架，将这一步骤明确化：从一个能力理论开始，然后推导出估计方法。这种在心理测量学等领域熟悉的观点，在人工智能评估中还没有变得普遍。作为概念验证，我们解决了一个破坏可靠性的中心挑战：对扰动的敏感性。在制定能力模型后，我们引入了推断能力的方法，同时考虑来自敏感性和有限样本的不确定性，包括一种显著减少样本复杂性的自适应算法。这些贡献共同奠定了通过基准测试衡量的人工智能能力更可靠和可信的估计的基础。

更新时间: 2025-09-23 21:29:04

领域: cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2509.19590v1

Discovery of Sustainable Refrigerants through Physics-Informed RL Fine-Tuning of Sequence Models

Most refrigerants currently used in air-conditioning systems, such as hydrofluorocarbons, are potent greenhouse gases and are being phased down. Large-scale molecular screening has been applied to the search for alternatives, but in practice only about 300 refrigerants are known, and only a few additional candidates have been suggested without experimental validation. This scarcity of reliable data limits the effectiveness of purely data-driven methods. We present Refgen, a generative pipeline that integrates machine learning with physics-grounded inductive biases. Alongside fine-tuning for valid molecular generation, Refgen incorporates predictive models for critical properties, equations of state, thermochemical polynomials, and full vapor compression cycle simulations. These models enable reinforcement learning fine-tuning under thermodynamic constraints, enforcing consistency and guiding discovery toward molecules that balance efficiency, safety, and environmental impact. By embedding physics into the learning process, Refgen leverages scarce data effectively and enables de novo refrigerant discovery beyond the known set of compounds.

Updated: 2025-09-23 21:24:35

标题: 通过物理信息强化RL微调序列模型发现可持续制冷剂

摘要: 目前在空调系统中使用的大多数制冷剂，如氢氟碳化合物，是强效的温室气体，正在逐步减少使用。大规模的分子筛选已经应用于寻找替代品，但实际上只有大约300种制冷剂是已知的，而且只有少数几种备选候选人没有经过实验验证。可靠数据的稀缺限制了纯数据驱动方法的有效性。我们提出了Refgen，这是一个将机器学习与基于物理的归纳偏差相结合的生成管道。除了为有效的分子生成进行微调外，Refgen还整合了对关键性质、状态方程、热化学多项式和完整蒸汽压缩循环模拟的预测模型。这些模型使得在热力学约束下进行强化学习微调成为可能，强调一致性并引导发现朝向平衡效率、安全性和环境影响的分子。通过将物理嵌入到学习过程中，Refgen有效利用稀缺数据，使得能够在已知化合物集之外进行全新的制冷剂发现成为可能。

更新时间: 2025-09-23 21:24:35

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2509.19588v1

Reverse Engineering User Stories from Code using Large Language Models

User stories are essential in agile development, yet often missing or outdated in legacy and poorly documented systems. We investigate whether large language models (LLMs) can automatically recover user stories directly from source code and how prompt design impacts output quality. Using 1,750 annotated C++ snippets of varying complexity, we evaluate five state-of-the-art LLMs across six prompting strategies. Results show that all models achieve, on average, an F1 score of 0.8 for code up to 200 NLOC. Our findings show that a single illustrative example enables the smallest model (8B) to match the performance of a much larger 70B model. In contrast, structured reasoning via Chain-of-Thought offers only marginal gains, primarily for larger models.

Updated: 2025-09-23 21:23:37

标题: 使用大型语言模型从代码中逆向工程用户故事

摘要: 用户故事在敏捷开发中至关重要，然而在传统的和文档质量不佳的系统中经常缺失或过时。我们研究了大型语言模型（LLMs）是否能够从源代码中自动恢复用户故事，并探讨提示设计如何影响输出质量。通过对1,750个不同复杂度的C++代码片段进行标注，我们评估了五种最先进的LLMs在六种提示策略下的表现。结果显示，所有模型在最多包含200个NLOC的代码上平均达到了0.8的F1分数。我们的研究结果表明，一个单一的示例能够使最小的模型（8B）达到与一个远比其庞大的70B模型相匹配的性能。相比之下，通过“思维链”进行结构化推理仅带来了微小的收益，主要是对于较大的模型而言。

更新时间: 2025-09-23 21:23:37

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2509.19587v1

A Foundation Chemical Language Model for Comprehensive Fragment-Based Drug Discovery

We introduce FragAtlas-62M, a specialized foundation model trained on the largest fragment dataset to date. Built on the complete ZINC-22 fragment subset comprising over 62 million molecules, it achieves unprecedented coverage of fragment chemical space. Our GPT-2 based model (42.7M parameters) generates 99.90% chemically valid fragments. Validation across 12 descriptors and three fingerprint methods shows generated fragments closely match the training distribution (all effect sizes < 0.4). The model retains 53.6% of known ZINC fragments while producing 22% novel structures with practical relevance. We release FragAtlas-62M with training code, preprocessed data, documentation, and model weights to accelerate adoption.

Updated: 2025-09-23 21:23:36

标题: 一个面向全面基于片段的药物发现的基础化学语言模型

摘要: 我们介绍了FragAtlas-62M，这是一个在迄今为止最大的片段数据集上训练的专门的基础模型。建立在包含超过6200万分子的完整ZINC-22片段子集上，它实现了对片段化学空间的前所未有的覆盖。我们基于GPT-2的模型（42.7M参数）生成了99.90%的化学有效片段。通过12个描述符和三种指纹方法的验证显示，生成的片段与训练分布密切匹配（所有效果大小<0.4）。该模型保留了53.6%的已知ZINC片段，同时产生了22%具有实际意义的新结构。我们发布了包含训练代码、预处理数据、文档和模型权重的FragAtlas-62M，以加速采用。

更新时间: 2025-09-23 21:23:36

领域: cs.LG,cs.AI,q-bio.BM

下载: http://arxiv.org/abs/2509.19586v1

CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

General-purpose vision-language models (VLMs) demonstrate impressive capabilities, but their opaque training on uncurated internet data posse critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed neurosurgical literature, and demonstrate its clinical utility compared with GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these image-text pairs into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized deployment trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgeons were assigned to use either CNS-Obsidian or GPT-4o as a diagnostic co-pilot after patient consultations. Primary outcomes were diagnostic helpfulness and accuracy. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o's 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults. CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance in specialized medical domains despite being orders of magnitude smaller and less expensive to train. However, low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative AI integration strategies.

Updated: 2025-09-23 21:03:10

标题: CNS-Obsidian: 从科学出版物构建的神经外科视觉语言模型

摘要: 通用视觉语言模型（VLMs）展示了令人印象深刻的能力，但它们在未经筛选的互联网数据上不透明的训练对于高风险决策，如神经外科，存在关键限制。我们提出了CNS-Obsidian，一个在经过同行评议的神经外科文献上训练的神经外科VLM，并展示了它在真实世界环境中与GPT-4o相比的临床效用。我们从神经外科出版物期刊中编译了23,984篇文章，产生了78,853个图片和标题。使用GPT-4o和Claude Sonnet-3.5，我们将这些图像文本对转换为263,064个训练样本，覆盖了三种格式：指导微调、多项选择问题和鉴别诊断。我们训练了CNS-Obsidian，这是对34亿参数LLaVA-Next模型的微调。在纽约大学朗格尼医院进行的盲目、随机部署试验（2024年8月30日至11月30日），神经外科医生被分配使用CNS-Obsidian或GPT-4o作为患者会诊后的诊断副驾驶。主要结果是诊断的帮助性和准确性。CNS-Obsidian在合成问题上与GPT-4o匹配（76.13% vs 77.54%，p=0.235），但在人为生成的问题上仅达到46.81%的准确率，而GPT-4o则为65.70%（p<10-15）。在随机试验中，评估了70次会诊（32次CNS-Obsidian，38次GPT-4o）共959次会诊。CNS-Obsidian在40.62%的案例中获得了积极评价，而GPT-4o为57.89%（p=0.230）。两种模型在约60%的案例中包括了正确的诊断（59.38% vs 65.79%，p=0.626）。尽管培训成本较低且规模较小，但基于经过筛选的科学文献训练的领域特定VLMs可以接近专业医学领域的前沿模型性能。然而，低临床利用率表明聊天机器人界面可能与专家工作流程不符，表明需要替代的AI整合策略。

更新时间: 2025-09-23 21:03:10

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2502.19546v4

MAGIC: Multi-task Gaussian process for joint imputation and classification in healthcare time series

Time series analysis has emerged as an important tool for improving patient diagnosis and management in healthcare applications. However, these applications commonly face two critical challenges: time misalignment and data sparsity. Traditional approaches address these issues through a two-step process of imputation followed by prediction. We propose MAGIC (Multi-tAsk Gaussian Process for Imputation and Classification), a novel unified framework that simultaneously performs class-informed missing value imputation and label prediction within a hierarchical multi-task Gaussian process coupled with functional logistic regression. To handle intractable likelihood components, MAGIC employs Taylor expansion approximations with bounded error analysis, and parameter estimation is performed using EM algorithm with block coordinate optimization supported by convergence analysis. We validate MAGIC through two healthcare applications: prediction of post-traumatic headache improvement following mild traumatic brain injury and prediction of in-hospital mortality within 48 hours after ICU admission. In both applications, MAGIC achieves superior predictive accuracy compared to existing methods. The ability to generate real-time and accurate predictions with limited samples facilitates early clinical assessment and treatment planning, enabling healthcare providers to make more informed treatment decisions.

Updated: 2025-09-23 21:02:39

标题: MAGIC：在医疗时间序列中联合填补和分类的多任务高斯过程

摘要: 时间序列分析已经成为改善医疗应用中患者诊断和管理的重要工具。然而，这些应用通常面临两个关键挑战：时间不对齐和数据稀疏。传统方法通过一个分为两步的过程来解决这些问题，即插补后跟随预测。我们提出了MAGIC（多任务高斯过程用于插补和分类），这是一个新颖的统一框架，它通过与功能逻辑回归相结合的层次多任务高斯过程同时进行类别相关的缺失值插补和标签预测。为了处理难以处理的似然组件，MAGIC采用了具有有界误差分析的泰勒展开近似，并且参数估计使用EM算法进行，其支持块坐标优化，并通过收敛分析来完成。我们通过两个医疗应用验证了MAGIC：在轻度创伤性脑损伤后预测创伤性头痛改善和在ICU入院后48小时内预测院内死亡。在这两个应用中，MAGIC相比现有方法实现了更优越的预测准确性。生成实时和准确的预测能力有助于早期临床评估和治疗计划，使医疗服务提供者能够做出更明智的治疗决策。

更新时间: 2025-09-23 21:02:39

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.19577v1

Knock-Knock: Black-Box, Platform-Agnostic DRAM Address-Mapping Reverse Engineering

Modern Systems-on-Chip (SoCs) employ undocumented linear address-scrambling functions to obfuscate DRAM addressing, which complicates DRAM-aware performance optimizations and hinders proactive security analysis of DRAM-based attacks; most notably, Rowhammer. Although previous work tackled the issue of reversing physical-to-DRAM mapping, existing heuristic-based reverse-engineering approaches are partial, costly, and impractical for comprehensive recovery. This paper establishes a rigorous theoretical foundation and provides efficient practical algorithms for black-box, complete physical-to-DRAM address-mapping recovery. We first formulate the reverse-engineering problem within a linear algebraic model over the finite field GF(2). We characterize the timing fingerprints of row-buffer conflicts, proving a relationship between a bank addressing matrix and an empirically constructed matrix of physical addresses. Based on this characterization, we develop an efficient, noise-robust, and fully platform-agnostic algorithm to recover the full bank-mask basis in polynomial time, a significant improvement over the exponential search from previous works. We further generalize our model to complex row mappings, introducing new hardware-based hypotheses that enable the automatic recovery of a row basis instead of previous human-guided contributions. Evaluations across embedded and server-class architectures confirm our method's effectiveness, successfully reconstructing known mappings and uncovering previously unknown scrambling functions. Our method provides a 99% recall and accuracy on all tested platforms. Most notably, Knock-Knock runs in under a few minutes, even on systems with more than 500GB of DRAM, showcasing the scalability of our method. Our approach provides an automated, principled pathway to accurate DRAM reverse engineering.

Updated: 2025-09-23 20:49:48

标题: 敲门：黑匣子，平台无关的DRAM地址映射逆向工程

摘要: 现代片上系统（SoCs）利用不透明的线性地址混淆函数来混淆DRAM寻址，这使得DRAM感知性能优化变得复杂，并阻碍了对基于DRAM的攻击（尤其是Rowhammer）的主动安全分析；尽管先前的研究解决了物理到DRAM映射的问题，但现有的基于启发式的逆向工程方法是部分的、昂贵的，并且不适用于全面恢复。本文建立了一个严格的理论基础，并提供了高效的实用算法，用于黑盒、完全的物理到DRAM地址映射恢复。我们首先在有限域GF(2)上的线性代数模型中制定了逆向工程问题。我们表征了行缓冲冲突的时间指纹，证明了银行寻址矩阵和经验构建的物理地址矩阵之间的关系。基于这种表征，我们开发了一种高效、抗噪声和完全与平台无关的算法，以多项式时间恢复完整的银行掩码基础，这是对先前工作中指数搜索的显著改进。我们进一步将我们的模型推广到复杂的行映射，引入新的基于硬件的假设，实现了自动恢复行基础，而不是以前由人类引导的贡献。通过嵌入式和服务器级架构的评估验证了我们方法的有效性，成功重建了已知的映射，并发现了先前未知的混淆函数。我们的方法在所有测试平台上提供了99%的召回率和准确性。尤其值得注意的是，Knock-Knock在几分钟内运行，即使在具有500GB以上DRAM的系统上也是如此，展示了我们方法的可扩展性。我们的方法提供了一条自动化、原则性的路径，用于准确的DRAM逆向工程。

更新时间: 2025-09-23 20:49:48

领域: cs.CR

下载: http://arxiv.org/abs/2509.19568v1

Nano Bio-Agents (NBA): Small Language Model Agents for Genomics

We investigate the application of Small Language Models (<10 billion parameters) for genomics question answering via agentic framework to address hallucination issues and computational cost challenges. The Nano Bio-Agent (NBA) framework we implemented incorporates task decomposition, tool orchestration, and API access into well-established systems such as NCBI and AlphaGenome. Results show that SLMs combined with such agentic framework can achieve comparable and in many cases superior performance versus existing approaches utilising larger models, with our best model-agent combination achieving 98% accuracy on the GeneTuring benchmark. Notably, small 3-10B parameter models consistently achieve 85-97% accuracy while requiring much lower computational resources than conventional approaches. This demonstrates promising potential for efficiency gains, cost savings, and democratization of ML-powered genomics tools while retaining highly robust and accurate performance.

Updated: 2025-09-23 20:44:31

标题: 纳米生物代理（NBA）：基因组学的小型语言模型代理

摘要: 我们通过代理框架研究了小型语言模型（小于10亿参数）在基因组学问题回答中的应用，以解决幻觉问题和计算成本挑战。我们实现的Nano Bio-Agent（NBA）框架将任务分解、工具编排和API访问整合到诸如NCBI和AlphaGenome等成熟系统中。结果表明，结合这种代理框架的SLM可以实现与利用更大模型的现有方法相媲美甚至在许多情况下表现更好的性能，我们的最佳模型-代理组合在GeneTuring基准测试中实现了98％的准确性。值得注意的是，小型3-10B参数模型始终以85-97％的准确性实现，同时需要比传统方法低得多的计算资源。这表明在保持高度稳健和准确性性能的同时，为提高效率、节约成本和民主化ML驱动的基因组学工具提供了有希望的潜力。

更新时间: 2025-09-23 20:44:31

领域: cs.AI,q-bio.GN

下载: http://arxiv.org/abs/2509.19566v1

Uncertainty in Semantic Language Modeling with PIXELS

Pixel-based language models aim to solve the vocabulary bottleneck problem in language modeling, but the challenge of uncertainty quantification remains open. The novelty of this work consists of analysing uncertainty and confidence in pixel-based language models across 18 languages and 7 scripts, all part of 3 semantically challenging tasks. This is achieved through several methods such as Monte Carlo Dropout, Transformer Attention, and Ensemble Learning. The results suggest that pixel-based models underestimate uncertainty when reconstructing patches. The uncertainty is also influenced by the script, with Latin languages displaying lower uncertainty. The findings on ensemble learning show better performance when applying hyperparameter tuning during the named entity recognition and question-answering tasks across 16 languages.

Updated: 2025-09-23 20:43:50

标题: "使用像素进行语义语言建模中的不确定性"

摘要: 像素级语言模型旨在解决语言建模中的词汇瓶颈问题，但不确定性量化的挑战仍然存在。这项工作的创新之处在于分析像素级语言模型在18种语言和7种文字中的不确定性和信心，这些语言和文字都属于3个语义挑战任务的一部分。通过多种方法实现这一目标，如蒙特卡罗辍学、变压器注意力和集成学习。结果表明，像素级模型在重建图块时低估了不确定性。不确定性也受文字的影响，拉丁语言显示出较低的不确定性。集成学习的研究结果表明，在对16种语言进行命名实体识别和问答任务时，经过超参数调整后表现更好。

更新时间: 2025-09-23 20:43:50

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.19563v1

Stochastic Path Planning in Correlated Obstacle Fields

We introduce the Stochastic Correlated Obstacle Scene (SCOS) problem, a navigation setting with spatially correlated obstacles of uncertain blockage status, realistically constrained sensors that provide noisy readings and costly disambiguation. Modeling the spatial correlation with Gaussian Random Field (GRF), we develop Bayesian belief updates that refine blockage probabilities, and use the posteriors to reduce search space for efficiency. To find the optimal traversal policy, we propose a novel two-stage learning framework. An offline phase learns a robust base policy via optimistic policy iteration augmented with information bonus to encourage exploration in informative regions, followed by an online rollout policy with periodic base updates via a Bayesian mechanism for information adaptation. This framework supports both Monte Carlo point estimation and distributional reinforcement learning (RL) to learn full cost distributions, leading to stronger uncertainty quantification. We establish theoretical benefits of correlation-aware updating and convergence property under posterior sampling. Comprehensive empirical evaluations across varying obstacle densities, sensor capabilities demonstrate consistent performance gains over baselines. This framework addresses navigation challenges in environments with adversarial interruptions or clustered natural hazards.

Updated: 2025-09-23 20:30:35

标题: 在相关障碍物场中的随机路径规划

摘要: 我们介绍了随机相关障碍场景（SCOS）问题，这是一个具有空间相关的障碍物、不确定阻塞状态的导航设置，具有提供嘈杂读数和昂贵消除歧义的现实受限传感器。通过使用高斯随机场（GRF）对空间相关性进行建模，我们开发了贝叶斯信念更新，以精化阻塞概率，并利用后验减少搜索空间以提高效率。为了找到最佳穿越策略，我们提出了一个新颖的两阶段学习框架。离线阶段通过乐观策略迭代和信息奖励增强学习一个强大的基础策略，以鼓励在信息丰富的区域进行探索，随后通过贝叶斯机制进行信息适应的在线滚动策略基本更新。该框架支持蒙特卡洛点估计和分布式强化学习（RL）来学习完整的成本分布，从而实现更强的不确定性量化。我们建立了考虑相关性更新和后验抽样下的理论优势和收敛性质。在不同障碍密度、传感器能力下的全面实证评估表明，相对于基线，性能持续提升。该框架解决了在具有对抗性中断或集群自然灾害的环境中的导航挑战。

更新时间: 2025-09-23 20:30:35

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2509.19559v1

Confidence Calibration in Large Language Model-Based Entity Matching

This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.

Updated: 2025-09-23 20:29:10

标题: 大型语言模型在实体匹配中的置信度校准

摘要: 这项研究旨在探讨大型语言模型与实体匹配中置信度校准的交集。为此，我们进行了一项实证研究，比较了用于实体匹配任务的基准RoBERTa置信度与使用温度缩放、蒙特卡洛辍学和集成进行校准的置信度。我们使用了Abt-Buy、DBLP-ACM、iTunes-Amazon和Company数据集。研究结果表明，所提出的修改后的RoBERTa模型表现出轻微的过度自信，预期校准误差分数在数据集之间范围从0.0043到0.0552。我们发现，使用温度缩放可以减轻这种过度自信，将预期校准误差分数降低最高达23.83%。

更新时间: 2025-09-23 20:29:10

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.19557v1

AnySafe: Adapting Latent Safety Filters at Runtime via Safety Constraint Parameterization in the Latent Space

Recent works have shown that foundational safe control methods, such as Hamilton-Jacobi (HJ) reachability analysis, can be applied in the latent space of world models. While this enables the synthesis of latent safety filters for hard-to-model vision-based tasks, they assume that the safety constraint is known a priori and remains fixed during deployment, limiting the safety filter's adaptability across scenarios. To address this, we propose constraint-parameterized latent safety filters that can adapt to user-specified safety constraints at runtime. Our key idea is to define safety constraints by conditioning on an encoding of an image that represents a constraint, using a latent-space similarity measure. The notion of similarity to failure is aligned in a principled way through conformal calibration, which controls how closely the system may approach the constraint representation. The parameterized safety filter is trained entirely within the world model's imagination, treating any image seen by the model as a potential test-time constraint, thereby enabling runtime adaptation to arbitrary safety constraints. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our method adapts at runtime by conditioning on the encoding of user-specified constraint images, without sacrificing performance. Video results can be found on https://any-safe.github.io

Updated: 2025-09-23 20:28:04

标题: AnySafe：通过在潜在空间中的安全约束参数化在运行时调整潜在安全过滤器

摘要: 最近的研究表明，基础的安全控制方法，如Hamilton-Jacobi（HJ）可达性分析，可以应用于世界模型的潜在空间中。虽然这使得可以为难以建模的基于视觉的任务合成潜在安全过滤器，但它们假定安全约束是先验已知的，并在部署过程中保持不变，限制了安全过滤器在不同情景中的适应性。为了解决这个问题，我们提出了约束参数化的潜在安全过滤器，可以在运行时适应用户指定的安全约束。我们的关键思想是通过对表示约束的图像的编码进行条件化，使用潜在空间相似度度量来定义安全约束。通过符合校准的方式将与故障的相似性的概念以一种合理的方式对齐，从而控制系统可以多接近约束表示。参数化的安全过滤器完全在世界模型的想象中进行训练，将模型看到的任何图像都视为潜在的测试时间约束，从而实现对任意安全约束的运行时适应。在使用Franka机械臂进行基于视觉的控制任务的仿真和硬件实验中，我们展示了我们的方法通过对用户指定约束图像的编码进行条件化，在运行时进行了适应，而不会牺牲性能。视频结果可以在https://any-safe.github.io 找到。

更新时间: 2025-09-23 20:28:04

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2509.19555v1

Learning Dynamics of Deep Learning -- Force Analysis of Deep Neural Networks

This thesis explores how deep learning models learn over time, using ideas inspired by force analysis. Specifically, we zoom in on the model's training procedure to see how one training example affects another during learning, like analyzing how forces move objects. We break this influence into two parts: how similar the two examples are, and how strong the updating force is. This framework helps us understand a wide range of the model's behaviors in different real systems. For example, it explains why certain examples have non-trivial learning paths, why (and why not) some LLM finetuning methods work, and why simpler, more structured patterns tend to be learned more easily. We apply this approach to various learning tasks and uncover new strategies for improving model training. While the method is still developing, it offers a new way to interpret models' behaviors systematically.

Updated: 2025-09-23 20:27:19

标题: 深度学习的学习动态 -- 深度神经网络的力分析

摘要: 这篇论文探讨了深度学习模型如何随时间学习，使用了受力分析启发的思想。具体来说，我们放大模型的训练过程，看看一个训练样本在学习过程中如何影响另一个，就像分析力如何移动物体一样。我们将这种影响分为两部分：两个样本的相似程度以及更新力的强度。这一框架帮助我们理解模型在不同真实系统中的各种行为。例如，它解释了为什么某些样本具有非平凡的学习路径，为什么（以及为什么不）某些LLM微调方法有效，以及为什么更简单、更结构化的模式往往更容易学到。我们将这种方法应用于各种学习任务，并发现了改进模型训练的新策略。虽然这种方法仍在发展中，但它提供了一种系统解释模型行为的新方式。

更新时间: 2025-09-23 20:27:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19554v1

Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

Speech deepfake detectors are often evaluated on clean, benchmark-style conditions, but deployment occurs in an open world of shifting devices, sampling rates, codecs, environments, and attack families. This creates a ``coverage debt" for AI-based detectors: every new condition multiplies with existing ones, producing data blind spots that grow faster than data can be collected. Because attackers can target these uncovered regions, worst-case performance (not average benchmark scores) determines security. To demonstrate the impact of the coverage debt problem, we analyze results from a recent cross-testing framework. Grouping performance by bona fide domain and spoof release year, two patterns emerge: newer synthesizers erase the legacy artifacts detectors rely on, and conversational speech domains (teleconferencing, interviews, social media) are consistently the hardest to secure. These findings show that detection alone should not be relied upon for high-stakes decisions. Detectors should be treated as auxiliary signals within layered defenses that include provenance, personhood credentials, and policy safeguards.

Updated: 2025-09-23 20:27:04

标题: 为什么言语Deepfake检测器无法泛化：在开放世界中检测的限制

摘要: 语音深度伪造检测器通常在干净、基准风格的条件下进行评估，但部署发生在一个设备不断变化、采样率、编解码器、环境和攻击家族的开放世界中。这为基于人工智能的检测器创造了一个“覆盖范围债务”：每个新条件都会与现有条件相乘，产生数据盲点，这些盲点增长速度比数据能够收集的速度还要快。由于攻击者可以针对这些未覆盖的区域，最坏情况的表现（而不是平均基准分数）决定安全性。为了展示覆盖范围债务问题的影响，我们分析了最近的交叉测试框架的结果。通过按真实域和欺骗发布年份对性能进行分组，可以看到两种模式：较新的合成器抹去了检测器依赖的传统遗留物，而会话性语音领域（电话会议、采访、社交媒体）一直是最难保护的。这些发现表明，单靠检测不能用于重大决策。检测器应被视为分层防御中的辅助信号，其中包括来源、个人身份凭证和政策保障。

更新时间: 2025-09-23 20:27:04

领域: cs.CR,cs.SD,eess.AS

下载: http://arxiv.org/abs/2509.20405v1

Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks

We address the challenge of sampling and remote estimation for autoregressive Markovian processes in a multi-hop wireless network with statistically-identical agents. Agents cache the most recent samples from others and communicate over wireless collision channels governed by an underlying graph topology. Our goal is to minimize time-average estimation error and/or age of information with decentralized scalable sampling and transmission policies, considering both oblivious (where decision-making is independent of the physical processes) and non-oblivious policies (where decision-making depends on physical processes). We prove that in oblivious policies, minimizing estimation error is equivalent to minimizing the age of information. The complexity of the problem, especially the multi-dimensional action spaces and arbitrary network topologies, makes theoretical methods for finding optimal transmission policies intractable. We optimize the policies using a graphical multi-agent reinforcement learning framework, where each agent employs a permutation-equivariant graph neural network architecture. Theoretically, we prove that our proposed framework exhibits desirable transferability properties, allowing transmission policies trained on small- or moderate-size networks to be executed effectively on large-scale topologies. Numerical experiments demonstrate that (i) Our proposed framework outperforms state-of-the-art baselines; (ii) The trained policies are transferable to larger networks, and their performance gains increase with the number of agents; (iii) The training procedure withstands non-stationarity even if we utilize independent learning techniques; and, (iv) Recurrence is pivotal in both independent learning and centralized training and decentralized execution, and improves the resilience to non-stationarity in independent learning.

Updated: 2025-09-23 20:25:15

标题: 使用图神经网络的分散式学习策略用于估计误差最小化

摘要: 我们解决了在具有统计相同代理的多跳无线网络中对自回归马尔可夫过程进行采样和远程估计的挑战。代理缓存来自其他代理的最新样本，并通过受基础图拓扑控制的无线碰撞信道进行通信。我们的目标是通过分散可扩展的采样和传输策略来最小化时间平均估计误差和/或信息年龄，考虑到无视的（决策独立于物理过程）和非无视的策略（决策依赖于物理过程）。我们证明，在无视策略中，最小化估计误差等同于最小化信息年龄。问题的复杂性，尤其是多维动作空间和任意网络拓扑，使得在寻找最优传输策略方面的理论方法变得困难。我们使用图形化的多代理强化学习框架来优化策略，其中每个代理都使用置换等变图神经网络架构。从理论上讲，我们证明我们提出的框架具有理想的可转移性特性，允许在小型或中等规模网络上训练的传输策略有效地在大规模拓扑上执行。数值实验表明：（i）我们提出的框架优于现有技术基线；（ii）训练的策略可转移至更大的网络，并且其性能增益随代理数量的增加而增加；（iii）即使利用独立学习技术，训练过程也能抵御非稳态情况；（iv）在独立学习和集中训练以及分散执行中，循环对抵御非稳态性至关重要，并改善了独立学习中对非稳态性的韧性。

更新时间: 2025-09-23 20:25:15

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2404.03227v3

DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs

Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39$\times$ compared to the baseline dense model. Code is available at Github.

Updated: 2025-09-23 20:21:55

标题: DuoGPT：通过LLMs中的激活感知剪枝实现无需训练的双重稀疏化

摘要: 大型语言模型（LLMs）具有强大的性能，但由于高内存和计算成本而难以部署。虽然剪枝可以减少这些需求，但大多数方法忽略了运行时观察到的激活稀疏性。我们将激活稀疏性重新解释为动态结构化的权重稀疏性，并提出了DuoGPT，这是一个统一的框架，通过将非结构化权重剪枝与激活稀疏性相结合，构建双稀疏（spMspV）工作负载。为了保持准确性，我们扩展了Optimal Brain Compression（OBC）框架，引入了激活感知校准，并将密集模型的输出残差作为校正项。我们进一步优化解决方案以实现高效的GPU执行，从而使其可扩展到亿参数LLMs。在LLaMA-2和LLaMA-3上的评估表明，DuoGPT在等速提升1.39倍的情况下，比最先进的结构化剪枝方法提高了高达9.17％的准确性。代码可在Github上找到。

更新时间: 2025-09-23 20:21:55

领域: cs.LG

下载: http://arxiv.org/abs/2506.20194v2

The SkipSponge Attack: Sponge Weight Poisoning of Deep Neural Networks

Sponge attacks aim to increase the energy consumption and computation time of neural networks. In this work, we present a novel sponge attack called SkipSponge. SkipSponge is the first sponge attack that is performed directly on the parameters of a pretrained model using only a few data samples. Our experiments show that SkipSponge can successfully increase the energy consumption of image classification models, GANs, and autoencoders, requiring fewer samples than the state-of-the-art sponge attacks (Sponge Poisoning). We show that poisoning defenses are ineffective if not adjusted specifically for the defense against SkipSponge (i.e., they decrease target layer bias values) and that SkipSponge is more effective on the GANs and the autoencoders than Sponge Poisoning. Additionally, SkipSponge is stealthy as it does not require significant changes to the victim model's parameters. Our experiments indicate that SkipSponge can be performed even when an attacker has access to less than 1% of the entire training dataset and reaches up to 13% energy increase.

Updated: 2025-09-23 20:15:36

标题: The SkipSponge攻击：深度神经网络的Sponge Weight毒化

摘要: Sponge攻击旨在增加神经网络的能量消耗和计算时间。在这项工作中，我们提出了一种新颖的Sponge攻击，称为SkipSponge。SkipSponge是第一个直接在预训练模型的参数上执行的Sponge攻击，仅使用少量数据样本。我们的实验证明，SkipSponge可以成功增加图像分类模型、GAN和自编码器的能量消耗，而且所需的样本比最先进的Sponge攻击（Sponge中毒）更少。我们发现，如果没有专门调整用于防御SkipSponge的毒害防御（即减少目标层偏置值），毒害防御是无效的，并且SkipSponge对GAN和自编码器比Sponge中毒更有效。此外，SkipSponge是隐蔽的，因为它不需要对受害模型的参数进行重大更改。我们的实验表明，即使攻击者只能访问整个训练数据集的不到1%，也可以执行SkipSponge，并且能量增加最多达到13%。

更新时间: 2025-09-23 20:15:36

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2402.06357v5

A Survey of Recent Advancements in Secure Peer-to-Peer Networks

Peer-to-peer (P2P) networks are a cornerstone of modern computing, and their security is an active area of research. Many defenses with strong security guarantees have been proposed; however, the most-recent survey is over a decade old. This paper delivers an updated review of recent theoretical advances that address classic threats, such as the Sybil and routing attacks, while highlighting how emerging trends -- such as machine learning, social networks, and dynamic systems -- pose new challenges and drive novel solutions. We evaluate the strengths and weaknesses of these solutions and suggest directions for future research.

Updated: 2025-09-23 20:07:53

标题: 最近安全对等网络的进展调查

摘要: 点对点（P2P）网络是现代计算的基石，它们的安全性是一个活跃的研究领域。许多具有强大安全保障的防御措施已被提出；然而，最近的一次调查已经超过十年。本文提供了最新的理论进展的回顾，解决了传统威胁，如Sybil和路由攻击，同时强调新兴趋势 - 如机器学习、社交网络和动态系统 - 提出了新的挑战并推动了新的解决方案。我们评估了这些解决方案的优势和劣势，并提出了未来研究的方向。

更新时间: 2025-09-23 20:07:53

领域: cs.DC,cs.CR

下载: http://arxiv.org/abs/2509.19539v1

Proof-of-Social-Capital: A Consensus Protocol Replacing Stake for Social Capital

Consensus protocols used today in blockchains mostly rely on scarce resources such as computational power or financial stake, favoring wealthy individuals due to a high entry barrier. We propose Proof-of-Social-Capital (PoSC), a new consensus protocol fueled by social capital as a staking resource to ensure fairness and decentralization. Consensus nodes in our system do not require financial or computational resources that are expensive to acquire; instead, they require preexisting social media influence, distributing consensus power not according to wealth but social capital. Our approach integrates zkSNARK proofs, verifiable credentials with a uniqueness-enforcing mechanism to prevent Sybil attacks, and the incentive scheme that rewards engagement with social media content by followers. This work offers a new concept aligned with modern social media lifestyle applied in finance, providing a practical insight for the evolution of decentralized consensus protocols.

Updated: 2025-09-23 20:06:35

标题: 社会资本验证：一种以社会资本替代股权的共识协议

摘要: 当今区块链中使用的共识协议主要依赖稀缺资源，如计算能力或财务利益，这使得富裕个体受益，因为存在高门槛。我们提出了一种名为社会资本证明（PoSC）的新共识协议，以社会资本作为抵押资源，以确保公平和去中心化。在我们的系统中，共识节点不需要昂贵的财务或计算资源，而是需要已有的社交媒体影响力，分发共识权力不是根据财富而是社会资本。我们的方法整合了zkSNARK证明、可验证凭证和强制唯一性机制，以防止Sybil攻击，以及奖励追随者与社交媒体内容互动的激励方案。这项工作提供了一个与现代社交媒体生活方式相契合的新概念，应用于金融领域，为去中心化共识协议的演进提供了实际见解。

更新时间: 2025-09-23 20:06:35

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2505.12144v2

DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose \textbf{DAWM}, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.

Updated: 2025-09-23 20:06:26

标题: DAWM：通过动作推理转换的离线强化学习中的扩散动作世界模型

摘要: 基于扩散的世界模型已经展示出在合成逼真的长期轨迹方面具有很强的能力，适用于离线强化学习（RL）。然而，许多现有方法并不直接生成动作，而是状态和奖励，这限制了它们与依赖一步时间差（TD）学习的标准基于值的离线RL算法的兼容性。尽管以前的工作已经探讨了对状态、奖励和动作的联合建模以解决这个问题，但这种形式往往会导致实践中训练复杂性增加和性能降低。我们提出了\textbf{DAWM}，一种基于扩散的世界模型，会生成未来的状态奖励轨迹，以当前状态、动作和返回目标为条件，并与逆动力学模型（IDM）配对，以实现有效的动作推断。这种模块化设计产生了适用于一步TD离线RL的完整的合成转换，实现了有效和计算效率高的训练。在实证上，我们表明保守的离线RL算法如TD3BC和IQL从在这些增强轨迹上训练中受益显著，一直在D4RL基准测试中的多个任务中持续优于先前基于扩散的基线。

更新时间: 2025-09-23 20:06:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19538v1

Semantic-Aware Fuzzing: An Empirical Framework for LLM-Guided, Reasoning-Driven Input Mutation

Security vulnerabilities in Internet-of-Things devices, mobile platforms, and autonomous systems remain critical. Traditional mutation-based fuzzers -- while effectively explore code paths -- primarily perform byte- or bit-level edits without semantic reasoning. Coverage-guided tools such as AFL++ use dictionaries, grammars, and splicing heuristics to impose shallow structural constraints, leaving deeper protocol logic, inter-field dependencies, and domain-specific semantics unaddressed. Conversely, reasoning-capable large language models (LLMs) can leverage pretraining knowledge to understand input formats, respect complex constraints, and propose targeted mutations, much like an experienced reverse engineer or testing expert. However, lacking ground truth for "correct" mutation reasoning makes supervised fine-tuning impractical, motivating explorations of off-the-shelf LLMs via prompt-based few-shot learning. To bridge this gap, we present an open-source microservices framework that integrates reasoning LLMs with AFL++ on Google's FuzzBench, tackling asynchronous execution and divergent hardware demands (GPU- vs. CPU-intensive) of LLMs and fuzzers. We evaluate four research questions: (R1) How can reasoning LLMs be integrated into the fuzzing mutation loop? (R2) Do few-shot prompts yield higher-quality mutations than zero-shot? (R3) Can prompt engineering with off-the-shelf models improve fuzzing directly? and (R4) Which open-source reasoning LLMs perform best under prompt-only conditions? Experiments with Llama3.3, Deepseek-r1-Distill-Llama-70B, QwQ-32B, and Gemma3 highlight Deepseek as the most promising. Mutation effectiveness depends more on prompt complexity and model choice than shot count. Response latency and throughput bottlenecks remain key obstacles, offering directions for future work.

Updated: 2025-09-23 19:57:29

标题: 语义感知模糊测试：基于LLM引导和推理驱动的输入变异的经验框架

摘要: 物联网设备、移动平台和自主系统中的安全漏洞仍然是关键问题。传统基于变异的模糊测试工具虽然有效地探索了代码路径，但主要进行字节或位级别的编辑，而缺乏语义推理。覆盖引导工具（如AFL++）使用词典、语法和拼接启发式来施加浅层结构约束，未解决更深层的协议逻辑、字段间依赖和领域特定语义。相反，具有推理能力的大型语言模型（LLMs）可以利用预训练知识来理解输入格式，尊重复杂约束，并提出有针对性的变异，就像经验丰富的逆向工程师或测试专家一样。然而，缺乏“正确”变异推理的基准事实使得监督微调变得不切实际，促使通过基于提示的少样本学习探索现成的LLMs。为了弥补这一差距，我们提出了一个开源微服务框架，将推理LLMs与AFL++集成到Google的FuzzBench中，解决了LLMs和模糊测试工具异步执行和不同硬件需求（GPU与CPU密集型）的问题。我们评估了四个研究问题：（R1）如何将推理LLMs集成到模糊测试变异循环中？（R2）少样本提示是否比零样本产生更高质量的变异？（R3）通过使用现成的模型进行提示工程是否直接改进了模糊测试？以及（R4）在仅提示条件下，哪些开源推理LLMs表现最佳？Llama3.3、Deepseek-r1-Distill-Llama-70B、QwQ-32B和Gemma3的实验突出显示Deepseek最具潜力。变异效果更多取决于提示复杂性和模型选择，而不是样本数量。响应延迟和吞吐量瓶颈仍然是关键障碍，为未来工作提供方向。

更新时间: 2025-09-23 19:57:29

领域: cs.SE,cs.AI,cs.CR

下载: http://arxiv.org/abs/2509.19533v1

VetIoT: On Vetting IoT Defenses Enforcing Policies at Runtime

Smart homes, powered by programmable IoT platforms, often face safety and security issues. A class of defense solutions dynamically enforces policies that capture the expected behavior of the IoT system. Despite numerous innovations, these solutions are under-vetted. The primary reason lies in their evaluation approach -- they are self-assessed in isolated virtual testbeds with hand-crafted orchestrated scenarios that require manual interactions using the platform's user-interface (UI). Such non-uniform evaluation setups limit reproducibility and comparative analysis. Closing this gap in the traditional way requires a significant upfront manual effort, causing researchers to turn away from large-scale comparative empirical evaluation. To address this, we propose VetIoT -- a highly automated, uniform evaluation platform -- to vet the defense solutions that hinge on runtime policy enforcement. Given a defense solution, VetIoT readily instantiates a virtual testbed to deploy and evaluate the solution. VetIoT replaces manual UI-based interactions with an automated event simulator and manual inspection of test outcomes with an automated comparator. VetIoT incorporates automated event generators to feed events to the event simulator. We developed a prototype of VetIoT, which successfully reproduced and comparatively assessed four runtime policy enforcement solutions. VetIoT's stress testing and differential testing capabilities make it a promising tool for future research and evaluation.

Updated: 2025-09-23 19:50:57

标题: VetIoT:关于在运行时执行政策的IoT防御审查

摘要: 智能家居，由可编程的物联网平台驱动，经常面临安全和安全性问题。一类防御解决方案动态地实施政策，捕捉物联网系统的预期行为。尽管有许多创新，但这些解决方案尚未经过审查。主要原因在于它们的评估方法--它们在孤立的虚拟测试平台中进行自我评估，需要使用平台的用户界面手工制作编排的场景进行手动交互。这种非统一的评估设置限制了可重现性和比较分析。传统方式中的这一差距的缩小需要大量的手动前期工作，导致研究人员放弃大规模比较实证评估。为了解决这个问题，我们提出了VetIoT--一个高度自动化、统一的评估平台--用于审查依赖于运行时策略执行的防御解决方案。给定一个防御解决方案，VetIoT会立即实例化一个虚拟测试平台来部署和评估解决方案。VetIoT用自动事件模拟器替代手动UI交互，并用自动比较器替代手动检查测试结果。VetIoT整合了自动事件生成器，以向事件模拟器提供事件。我们开发了一个VetIoT的原型，成功复制并比较评估了四种运行时策略执行解决方案。VetIoT的压力测试和差分测试功能使其成为未来研究和评估的有希望的工具。

更新时间: 2025-09-23 19:50:57

领域: cs.CR

下载: http://arxiv.org/abs/2308.12417v3

Compression Strategies for Efficient Multimodal LLMs in Medical Contexts

Multimodal Large Language Models (MLLMs) hold huge potential for usage in the medical domain, but their computational costs necessitate efficient compression techniques. This paper evaluates the impact of structural pruning and activation-aware quantization on a fine-tuned LLAVA model for medical applications. We propose a novel layer selection method for pruning, analyze different quantization techniques, and assess the performance trade-offs in a prune-SFT-quantize pipeline. Our proposed method enables MLLMs with 7B parameters to run within 4 GB of VRAM, reducing memory usage by 70% while achieving 4% higher model performance compared to traditional pruning and quantization techniques in the same compression ratio.

Updated: 2025-09-23 19:50:14

标题: 在医学环境中高效多模式LLMs的压缩策略

摘要: 多模态大型语言模型（MLLMs）在医疗领域的潜力巨大，但其计算成本需要高效的压缩技术。本文评估了结构剪枝和激活感知量化对用于医疗应用的精调LLAVA模型的影响。我们提出了一种新颖的层选择方法用于剪枝，分析了不同的量化技术，并在一个剪枝-SFT-量化流水线中评估了性能折衷。我们提出的方法使得具有7B参数的MLLMs可以在4GB的VRAM内运行，将内存使用减少了70%，同时在相同的压缩比下比传统的剪枝和量化技术实现了4%更高的模型性能。

更新时间: 2025-09-23 19:50:14

领域: cs.AI

下载: http://arxiv.org/abs/2507.21976v3

CueGCL: Cluster-aware Personalized Self-Training for Unsupervised Graph Contrastive Learning

Recently, graph contrastive learning (GCL) has emerged as one of the optimal solutions for node-level and supervised tasks. However, for structure-related and unsupervised tasks such as graph clustering, current GCL algorithms face difficulties acquiring the necessary cluster-level information, resulting in poor performance. In addition, general unsupervised GCL improves the performance of downstream tasks by increasing the number of negative samples, which leads to severe class collision and unfairness of graph clustering. To address the above issues, we propose a Cluster-aware Graph Contrastive Learning Framework (CueGCL) to jointly learn clustering results and node representations. Specifically, we design a personalized self-training (PeST) strategy for unsupervised scenarios, which enables our model to capture precise cluster-level personalized information. With the benefit of the PeST, we alleviate class collision and unfairness without sacrificing the overall model performance. Furthermore, aligned graph clustering (AGC) is employed to obtain the cluster partition, where we align the clustering space of our downstream task with that in PeST to achieve more consistent node embeddings. Finally, we theoretically demonstrate the effectiveness of our model, showing it yields an embedding space with a significantly discernible cluster structure. Extensive experimental results also show our CueGCL exhibits state-of-the-art performance on five benchmark datasets with different scales.

Updated: 2025-09-23 19:49:17

标题: CueGCL: 面向集群的个性化自我训练，用于无监督图对比学习

摘要: 最近，图对比学习（GCL）已经成为节点级和监督任务的最佳解决方案之一。然而，对于与结构相关和无监督任务（如图聚类）相关的任务，目前的GCL算法面临获取必要的簇级信息的困难，导致性能不佳。此外，一般的无监督GCL通过增加负样本数量来提高下游任务的性能，这导致严重的类碰撞和图聚类的不公平性。为了解决上述问题，我们提出了一个基于簇感知的图对比学习框架（CueGCL），共同学习聚类结果和节点表示。具体来说，我们为无监督场景设计了一种个性化自训练（PeST）策略，使我们的模型能够捕捉精确的簇级个性化信息。借助PeST的好处，我们减轻了类碰撞和不公平性，而不会牺牲整体模型性能。此外，我们采用了对齐图聚类（AGC）来获得簇划分，我们将我们下游任务的聚类空间与PeST中的聚类空间对齐，以实现更一致的节点嵌入。最后，我们从理论上证明了我们模型的有效性，表明它产生了一个具有明显可辨认簇结构的嵌入空间。广泛的实验结果还显示，我们的CueGCL在具有不同规模的五个基准数据集上表现出了最先进的性能。

更新时间: 2025-09-23 19:49:17

领域: cs.SI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2311.11073v2

Metriplectic Conditional Flow Matching for Dissipative Dynamics

Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.

Updated: 2025-09-23 19:46:54

标题: Metriplectic条件流匹配用于耗散动力学

摘要: Metriplectic conditional flow matching (MCFM) 学习耗散动力学而不违反第一原则。神经替代品通常注入能量并不稳定长时间预测；相反，MCFM将保守-耗散分裂结合到向量场和保持结构的采样器中。MCFM通过短转换上的条件流匹配进行训练，避免了长时间预测的伴随。在推断中，Strang-prox方案交替进行辛更新和近端度量步骤，确保离散能量衰减；当可用信任能量时，可选择投影以强制严格衰减。我们提供了连续和离散时间保证，将这种参数化和采样器与保守性、单调耗散和稳定预测联系起来。在受控机械基准测试中，MCFM产生更接近真实情况的相位图，并且比同样表达能力的无约束神经流产生明显更少的能量增加和正能量速率事件，同时匹配终端分布拟合。

更新时间: 2025-09-23 19:46:54

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2509.19526v1

CellCLIP -- Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning

High-content screening (HCS) assays based on high-throughput microscopy techniques such as Cell Painting have enabled the interrogation of cells' morphological responses to perturbations at an unprecedented scale. The collection of such data promises to facilitate a better understanding of the relationships between different perturbations and their effects on cellular state. Towards achieving this goal, recent advances in cross-modal contrastive learning could, in theory, be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, the application of such methods to HCS data is not straightforward due to substantial differences in the semantics of Cell Painting images compared to natural images, and the difficulty of representing different classes of perturbations (e.g., small molecule vs CRISPR gene knockout) in a single latent space. In response to these challenges, here we introduce CellCLIP, a cross-modal contrastive learning framework for HCS data. CellCLIP leverages pre-trained image encoders coupled with a novel channel encoding scheme to better capture relationships between different microscopy channels in image embeddings, along with natural language encoders for representing perturbations. Our framework outperforms current open-source models, demonstrating the best performance in both cross-modal retrieval and biologically meaningful downstream tasks while also achieving significant reductions in computation time.

Updated: 2025-09-23 19:44:10

标题: CellCLIP — 通过文本引导的对比学习学习细胞绘画中的扰动效应

摘要: 高内容筛选（HCS）基于高通量显微技术的检测，例如细胞涂画，已经使得可以在前所未有的规模上对细胞对干扰的形态学反应进行询问。这些数据的收集有望促进更好地理解不同干扰之间以及它们对细胞状态的影响之间的关系。为了实现这一目标，最近在跨模态对比学习方面的进展在理论上可以利用学习一个统一的潜在空间，将干扰与其相应的形态效应进行对齐。然而，由于细胞涂画图像与自然图像相比在语义上存在显著差异，以及在单个潜在空间中表示不同类别的干扰（例如小分子与CRISPR基因敲除）的困难，将这些方法应用于HCS数据并不简单。针对这些挑战，我们介绍了CellCLIP，这是一个用于HCS数据的跨模态对比学习框架。CellCLIP利用预训练图像编码器结合一种新颖的通道编码方案，更好地捕捉图像嵌入中不同显微通道之间的关系，同时利用自然语言编码器表示干扰。我们的框架优于当前的开源模型，在交叉模态检索和生物意义下游任务方面表现出最佳性能，同时也实现了计算时间的显著减少。

更新时间: 2025-09-23 19:44:10

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.06290v3

Identities are not Interchangeable: The Problem of Overgeneralization in Fair Machine Learning

A key value proposition of machine learning is generalizability: the same methods and model architecture should be able to work across different domains and different contexts. While powerful, this generalization can sometimes go too far, and miss the importance of the specifics. In this work, we look at how fair machine learning has often treated as interchangeable the identity axis along which discrimination occurs. In other words, racism is measured and mitigated the same way as sexism, as ableism, as ageism. Disciplines outside of computer science have pointed out both the similarities and differences between these different forms of oppression, and in this work we draw out the implications for fair machine learning. While certainly not all aspects of fair machine learning need to be tailored to the specific form of oppression, there is a pressing need for greater attention to such specificity than is currently evident. Ultimately, context specificity can deepen our understanding of how to build more fair systems, widen our scope to include currently overlooked harms, and, almost paradoxically, also help to narrow our scope and counter the fear of an infinite number of group-specific methods of analysis.

Updated: 2025-09-23 19:42:42

标题: 身份不可互换：公平机器学习中过度概括的问题

摘要: 机器学习的一个关键价值主张是泛化性：相同的方法和模型架构应该能够适用于不同的领域和不同的情境。虽然强大，这种泛化有时可能走得太远，忽略了细节的重要性。在这项工作中，我们研究了公平机器学习如何经常将歧视发生的身份轴视为可以互换的。换句话说，种族主义被衡量和缓解的方式与性别歧视、能力主义和年龄歧视相同。计算机科学之外的学科已经指出了这些不同形式压迫的相似之处和差异之处，在这项工作中，我们揭示了对公平机器学习的影响。虽然公平机器学习的所有方面都不一定需要针对特定形式的压迫进行调整，但目前很明显需要更多关注这种特殊性。最终，上下文特定性可以加深我们对如何构建更公平系统的理解，扩大我们的范围以包括目前被忽视的伤害，并且几乎矛盾地，也有助于缩小我们的范围并对抗无限数量的特定群体方法的担忧。

更新时间: 2025-09-23 19:42:42

领域: cs.CY,cs.LG

下载: http://arxiv.org/abs/2505.04038v2

Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation

Robot learning papers typically report a single binary success rate (SR), which obscures where a policy succeeds or fails along a multi-step manipulation task. We argue that subgoal-level reporting should become routine: for each trajectory, a vector of per-subgoal SRs that makes partial competence visible (e.g., grasp vs. pour). We propose a blueprint for StepEval, a cost-aware plug-in evaluation framework that utilizes vision-language models (VLMs) as automated judges of subgoal outcomes from recorded images or videos. Rather than proposing new benchmarks or APIs, our contribution is to outline design principles for a scalable, community-driven open-source project. In StepEval, the primary artifact for policy evaluation is the per-subgoal SR vector; however, other quantities (e.g., latency or cost estimates) are also considered for framework-optimization diagnostics to help the community tune evaluation efficiency and accuracy when ground-truth subgoal success labels are available. We discuss how such a framework can remain model-agnostic, support single- or multi-view inputs, and be lightweight enough to adopt across labs. The intended contribution is a shared direction: a minimal, extensible seed that invites open-source contributions, so that scoring the steps, not just the final goal, becomes a standard and reproducible practice.

Updated: 2025-09-23 19:42:14

标题: 评分步骤，而不仅仅是目标：基于VLM的机器人操作子目标评估

摘要: 机器人学习论文通常报告单一的二进制成功率（SR），这使得在多步操作任务中难以看清策略成功或失败的地方。我们认为应该将子目标级别的报告变得常规化：对于每条轨迹，一个包含每个子目标成功率的向量可以使部分能力可见（例如，抓取 vs. 倒出）。我们提出了StepEval的蓝图，这是一个成本意识的插件评估框架，利用视觉语言模型（VLMs）作为自动化评判器，从记录的图像或视频中评估子目标的结果。我们的贡献并非是提出新的基准或API，而是为可扩展、社区驱动的开源项目概述设计原则。在StepEval中，用于策略评估的主要工件是每个子目标的SR向量；然而，其他数量（例如延迟或成本估算）也被考虑为框架优化诊断，以帮助社区在有地面真相子目标成功标签时调整评估效率和准确性。我们讨论了这样一个框架如何保持模型无关，支持单视角或多视角输入，并且足够轻量级以在实验室间推广。我们的目标是共享一个方向：一种最小、可扩展的种子，鼓励开源贡献，使得评分步骤不仅成为标准化和可重复的做法，而不仅仅是最终目标。

更新时间: 2025-09-23 19:42:14

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2509.19524v1

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.

Updated: 2025-09-23 19:36:56

标题: 大型语言模型中的认知负荷限制：多跳推理基准测试

摘要: 大型语言模型（LLMs）的规模化暴露出它们在静态基准测试中表现优秀与在动态、信息丰富环境中易碎之间存在着关键差距。虽然模型在孤立任务上表现出色，但控制它们在认知负荷下推理的计算限制仍然不够清楚。在这项工作中，我们引入了一个计算认知负荷的形式理论，假设多余的、与任务无关的信息（上下文饱和）和来自任务切换的干扰（注意力残留）是降低性能的关键机制。我们设计了交替认知评估（ICE），一个消除混淆的基准测试，以系统地操纵这些负荷因素在具有挑战性的多跳推理任务中。一项全面的研究（每个项目重复10次，共200个问题）揭示了五个经过指导的模型之间的显著性能差异。较小的开源架构（Llama-3-8B-Instruct，Mistral-7B-Instruct-v0.2）展示了基线脆弱性，在这个高内在负荷任务中，在所有条件下，包括干净的对照组，都实现了0％的准确率（SEM = 0.0）。相反，Gemini-2.0-Flash-001表现出部分韧性，在控制条件下实现了85％的准确率，在上下文饱和条件下有统计学上显著的退化（每％负荷$\beta = -0.003$，$p < 0.001$）。这些发现初步证明认知负荷是推理失败的一个关键因素，支持在不确定性下幻觉作为猜测的理论。我们得出结论，动态的、认知感知的压力测试，如ICE基准测试所示，对评估先进AI系统的真正韧性和安全性至关重要。

更新时间: 2025-09-23 19:36:56

领域: cs.AI,cs.CL,cs.LG,I.2.7; I.2.6

下载: http://arxiv.org/abs/2509.19517v1

CUPID: Curating Data your Robot Loves with Influence Functions

In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes - such as closed-loop task success or failure - remains a persistent challenge. We propose CUPID, a robot data curation method based on a novel influence function-theoretic formulation for imitation learning policies. Given a set of evaluation rollouts, CUPID estimates the influence of each training demonstration on the policy's expected return. This enables ranking and selection of demonstrations according to their impact on the policy's closed-loop performance. We use CUPID to curate data by 1) filtering out training demonstrations that harm policy performance and 2) subselecting newly collected trajectories that will most improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can yield state-of-the-art diffusion policies on the simulated RoboMimic benchmark, with similar gains observed in hardware. Furthermore, hardware experiments show that our method can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance the post-training of generalist robot policies. Videos and code are made available at: https://cupid-curation.github.io.

Updated: 2025-09-23 19:35:35

标题: CUPID：使用影响函数为您的机器人精心筛选数据

摘要: 在机器人模仿学习中，策略的性能与示范数据的质量和组成密切相关。然而，对于个别示范如何对下游结果（如闭环任务的成功或失败）产生影响的精确理解仍然是一个持续的挑战。我们提出了CUPID，一种基于新颖影响函数理论公式的机器人数据策划方法，用于模仿学习策略。给定一组评估轨迹，CUPID估计每个训练示范对策略的期望回报的影响。这使得能够根据它们对策略闭环性能的影响对示范进行排名和选择。我们使用CUPID通过以下方式策划数据：1）过滤掉会损害策略性能的训练示范；2）精选新收集的轨迹，以最大程度地改进策略。大量模拟和硬件实验表明，我们的方法始终能够确定哪些数据驱动测试时的性能。例如，在模拟RoboMimic基准测试中，使用少于33%的策划数据进行训练可以获得最先进的扩散策略，硬件实验中也观察到类似的收益。此外，硬件实验表明，我们的方法可以在分布转移下识别出稳健的策略，隔离虚假相关性，甚至增强通用机器人策略的后训练。视频和代码可在以下网址找到：https://cupid-curation.github.io。

更新时间: 2025-09-23 19:35:35

领域: cs.RO,cs.AI,cs.LG,I.2.6; I.2.9

下载: http://arxiv.org/abs/2506.19121v2

A Longitudinal Randomized Control Study of Companion Chatbot Use: Anthropomorphism and Its Mediating Role on Social Impacts

Relationships with social artificial intelligence (AI) agents are on the rise. People report forming friendships, mentorships, and romantic partnerships with chatbots such as Replika, a type of social AI agent that is designed specifically for companionship. Concerns that companion chatbot relationships may harm or replace human ones have been raised, but whether and how these social consequences occur remains unclear. Prior research suggests that people's states of social need and their anthropomorphism of the AI agent may play a role in how human-AI interaction impacts human-human interaction. In this longitudinal study (N = 183), participants were randomly assigned to converse with a companion chatbot over text or to play text-based word games for 10 minutes a day for 21 consecutive days. During these 21 days, participants also completed four surveys and two audio-recorded interviews. We found that people's social health and relationships were not significantly impacted by interacting with a companion chatbot across 21 days compared to the control group. However, people who had a higher desire to socially connect anthropomorphized the chatbot more. Those who anthropomorphized the chatbot more indicated that the human-chatbot interaction had greater impacts on their social interactions and relationships with family and friends. A mediation analysis suggested that the impact of human-AI interaction on human-human social outcomes was mediated by the extent to which people anthropomorphized the AI agent, which itself was related to the desire to socially connect.

Updated: 2025-09-23 19:33:41

标题: 一项关于伴侣聊天机器人使用的纵向随机对照研究：拟人化及其在社会影响中的中介作用

摘要: 与社交人工智能（AI）代理的关系正在增加。人们报告与诸如Replika之类的聊天机器人建立友谊、师生关系和浪漫伙伴关系，Replika是一种专门设计用于陪伴的社交AI代理。有人提出担忧，认为伴侣聊天机器人关系可能会伤害或取代人类关系，但这些社交后果是否以及如何发生仍不清楚。先前的研究表明，人们的社交需求状态和他们对AI代理的拟人化可能在人类与AI互动对人类之间互动的影响中发挥作用。在这项纵向研究中（N = 183），参与者被随机分配到与伴侣聊天机器人进行文字交流，或者每天进行基于文本的文字游戏交流10分钟，持续21天。在这21天内，参与者还完成了四次调查和两次录音采访。我们发现，与对照组相比，人们在21天内与伴侣聊天机器人互动并没有显著影响他们的社交健康和关系。然而，那些更渴望社交连接的人更容易拟人化聊天机器人。那些对聊天机器人进行更多拟人化的人表示，人机互动对他们与家人和朋友的社交互动和关系产生了更大的影响。一项中介分析表明，人类与AI互动对人类社交结果的影响是通过人们拟人化AI代理的程度来中介的，而这本身与渴望社交连接有关。

更新时间: 2025-09-23 19:33:41

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2509.19515v1

Bias-variance decompositions: the exclusive privilege of Bregman divergences

Bias-variance decompositions are widely used to understand the generalization performance of machine learning models. While the squared error loss permits a straightforward decomposition, other loss functions - such as zero-one loss or $L_1$ loss - either fail to sum bias and variance to the expected loss or rely on definitions that lack the essential properties of meaningful bias and variance. Recent research has shown that clean decompositions can be achieved for the broader class of Bregman divergences, with the cross-entropy loss as a special case. However, the necessary and sufficient conditions for these decompositions remain an open question. In this paper, we address this question by studying continuous, nonnegative loss functions that satisfy the identity of indiscernibles (zero loss if and only if the two arguments are identical), under mild regularity conditions. We prove that so-called $g$-Bregman divergences are the only such loss functions that have a clean bias-variance decomposition. A $g$-Bregman divergence can be transformed into a standard Bregman divergence through an invertible change of variables. This makes the squared Mahalanobis distance, up to such a variable transformation, the only symmetric loss function with a clean bias-variance decomposition. Consequently, common metrics such as $0$-$1$ and $L_1$ losses cannot admit a clean bias-variance decomposition, explaining why previous attempts have failed. We also examine the impact of relaxing the restrictions on the loss functions and how this affects our results.

Updated: 2025-09-23 19:30:33

标题: 偏差-方差分解：Bregman散度的独有特权

摘要: 偏差-方差分解广泛用于理解机器学习模型的泛化性能。虽然平方误差损失允许简单的分解，但其他损失函数 - 如零-一损失或 $L_1$ 损失 - 要么无法将偏差和方差求和为期望损失，要么依赖于缺乏有意义的偏差和方差的定义。最近的研究表明，可以为更广泛的 Bregman 散度类实现清晰的分解，其中交叉熵损失是一种特例。然而，这些分解的必要和充分条件仍然是一个悬而未决的问题。在本文中，我们通过研究连续、非负的损失函数，满足无法区分性的标识（仅当两个参数相同时为零损失），在温和的正则条件下解决这个问题。我们证明所谓的 $g$-Bregman 散度是唯一具有清晰偏差-方差分解的此类损失函数。通过可逆变量的变换，$g$-Bregman 散度可以转化为标准Bregman散度。这使得平方马氏距离，通过这样的变量转换，成为唯一具有清晰偏差-方差分解的对称损失函数。因此，常见的度量标准如 $0$-$1$ 和 $L_1$ 损失不能接受清晰的偏差-方差分解，从而解释了为什么以前的尝试失败。我们还研究了对损失函数限制的放宽对我们结果的影响。

更新时间: 2025-09-23 19:30:33

领域: cs.LG

下载: http://arxiv.org/abs/2501.18581v2

The Heterogeneous Multi-Agent Challenge

Multi-Agent Reinforcement Learning (MARL) is a growing research area which gained significant traction in recent years, extending Deep RL applications to a much wider range of problems. A particularly challenging class of problems in this domain is Heterogeneous Multi-Agent Reinforcement Learning (HeMARL), where agents with different sensors, resources, or capabilities must cooperate based on local information. The large number of real-world situations involving heterogeneous agents makes it an attractive research area, yet underexplored, as most MARL research focuses on homogeneous agents (e.g., a swarm of identical robots). In MARL and single-agent RL, standardized environments such as ALE and SMAC have allowed to establish recognized benchmarks to measure progress. However, there is a clear lack of such standardized testbed for cooperative HeMARL. As a result, new research in this field often uses simple environments, where most algorithms perform near optimally, or uses weakly heterogeneous MARL environments.

Updated: 2025-09-23 19:30:30

标题: 多智能体挑战的异构性

摘要: 多智能体强化学习（MARL）是一个日益发展的研究领域，在近年来获得了显着的关注，将深度强化学习的应用扩展到更广泛的问题范围。在这个领域中一个特别具有挑战性的问题类别是异构多智能体强化学习（HeMARL），其中具有不同传感器、资源或能力的智能体必须基于局部信息进行合作。大量涉及异构智能体的现实世界情况使其成为一个具有吸引力但尚未充分开发的研究领域，因为大多数MARL研究主要关注同质智能体（例如一群相同的机器人）。在MARL和单智能体强化学习中，诸如ALE和SMAC之类的标准化环境已经建立了公认的基准来衡量进展。然而，合作HeMARL缺乏这样的标准化测试平台。因此，这一领域的新研究通常使用简单的环境，其中大多数算法表现接近最佳，或者使用弱异构MARL环境。

更新时间: 2025-09-23 19:30:30

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2509.19512v1

FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning

Black-Box Discrete Prompt Learning is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible. Adapting federated learning to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service. To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning. Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textit{FedOne}, enabled optimal query efficiency in federated black-box prompt learning. Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs. We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.

Updated: 2025-09-23 19:28:56

标题: FedOne：针对黑匣子离散提示学习的查询高效联邦学习

摘要: 黑盒离散提示学习是一种优化离散提示的提示调整方法，无需访问模型参数或梯度，从而使云端大型语言模型（LLM）上的提示调整成为可能。将联邦学习应用于BDPL可以通过利用来自不同来源的数据进一步增强提示调整性能。然而，先前关于联邦黑盒提示调整的所有研究都忽视了与云端LLM服务相关的实质性查询成本。为了弥补这一差距，我们在联邦黑盒提示调整的背景下进行了查询效率的理论分析。我们的研究结果显示，将FedAvg降级为每轮激活一个客户端的策略，我们称之为FedOne，可以实现在联邦黑盒提示学习中的最佳查询效率。基于这一洞察力，我们提出了FedOne框架，这是一种旨在在与云端LLM交互时最大化查询效率的联邦黑盒离散提示学习方法。我们在我们框架的各个方面进行了数值实验，展示了查询效率的显著改善，与我们的理论结果相一致。

更新时间: 2025-09-23 19:28:56

领域: cs.LG

下载: http://arxiv.org/abs/2506.14929v2

AIRwaves at CheckThat! 2025: Retrieving Scientific Sources for Implicit Claims on Social Media with Dual Encoders and Neural Re-Ranking

Linking implicit scientific claims made on social media to their original publications is crucial for evidence-based fact-checking and scholarly discourse, yet it is hindered by lexical sparsity, very short queries, and domain-specific language. Team AIRwaves ranked second in Subtask 4b of the CLEF-2025 CheckThat! Lab with an evidence-retrieval approach that markedly outperforms the competition baseline. The optimized sparse-retrieval baseline(BM25) achieves MRR@5 = 0.5025 on the gold label blind test set. To surpass this baseline, a two-stage retrieval pipeline is introduced: (i) a first stage that uses a dual encoder based on E5-large, fine-tuned using in-batch and mined hard negatives and enhanced through chunked tokenization and rich document metadata; and (ii) a neural re-ranking stage using a SciBERT cross-encoder. Replacing purely lexical matching with neural representations lifts performance to MRR@5 = 0.6174, and the complete pipeline further improves to MRR@5 = 0.6828. The findings demonstrate that coupling dense retrieval with neural re-rankers delivers a powerful and efficient solution for tweet-to-study matching and provides a practical blueprint for future evidence-retrieval pipelines.

Updated: 2025-09-23 19:26:31

标题: 《AIRwaves在CheckThat! 2025：使用双编码器和神经重新排序检索社交媒体上隐含声明的科学来源》

摘要: 将社交媒体上隐含的科学主张与其原始出版物联系起来对于基于证据的事实检查和学术讨论至关重要，然而，由于词汇稀缺、查询非常简短和领域特定语言等原因，这一过程受到阻碍。 AIRwaves团队在CLEF-2025 CheckThat!实验室的Subtask 4b中排名第二，他们采用了一种证据检索方法，明显优于竞争基线。经过优化的稀疏检索基线（BM25）在金标签盲测集上实现了MRR@5 = 0.5025。为了超越这一基线，引入了一个两阶段检索流程：（i）第一阶段使用基于E5-large的双编码器，通过批内和挖掘的难负例进行微调，并通过分块标记和丰富的文档元数据进行增强；（ii）使用SciBERT交叉编码器的神经重新排序阶段。用神经表示替代纯粹的词汇匹配将性能提升至MRR@5 = 0.6174，完整的流程进一步提高至MRR@5 = 0.6828。研究结果表明，将密集检索与神经重新排列器相结合，可以为推文到研究匹配提供强大且高效的解决方案，并为未来证据检索流程提供实用的蓝图。

更新时间: 2025-09-23 19:26:31

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19509v1

Frame-based Equivariant Diffusion Models for 3D Molecular Generation

Recent methods for molecular generation face a trade-off: they either enforce strict equivariance with costly architectures or relax it to gain scalability and flexibility. We propose a frame-based diffusion paradigm that achieves deterministic E(3)-equivariance while decoupling symmetry handling from the backbone. Building on this paradigm, we investigate three variants: Global Frame Diffusion (GFD), which assigns a shared molecular frame; Local Frame Diffusion (LFD), which constructs node-specific frames and benefits from additional alignment constraints; and Invariant Frame Diffusion (IFD), which relies on pre-canonicalized invariant representations. To enhance expressivity, we further utilize EdgeDiT, a Diffusion Transformer with edge-aware attention. On the QM9 dataset, GFD with EdgeDiT achieves state-of-the-art performance, with a test NLL of -137.97 at standard scale and -141.85 at double scale, alongside atom stability of 98.98%, and molecular stability of 90.51%. These results surpass all equivariant baselines while maintaining high validity and uniqueness and nearly 2x faster sampling compared to EDM. Altogether, our study establishes frame-based diffusion as a scalable, flexible, and physically grounded paradigm for molecular generation, highlighting the critical role of global structure preservation.

Updated: 2025-09-23 19:23:37

标题: 基于框架的等变扩散模型用于3D分子生成

摘要: 最近的分子生成方法面临一个权衡：要么强制使用昂贵的架构实现严格的等变性，要么放宽等变性以获得可扩展性和灵活性。我们提出了一种基于框架的扩散范式，实现了确定性的E(3)-等变性，同时将对称处理与主干解耦。基于这一范式，我们研究了三种变体：全局框架扩散（GFD），它分配一个共享的分子框架；局部框架扩散（LFD），它构建节点特定的框架，并受益于额外的对齐约束；不变框架扩散（IFD），它依赖于预规范化的不变表示。为了增强表达能力，我们进一步利用了EdgeDiT，一种具有边界感知注意力的扩散变换器。在QM9数据集上，GFD与EdgeDiT实现了最先进的性能，标准尺度下测试NLL为-137.97，在双倍尺度下为-141.85，同时原子稳定性为98.98%，分子稳定性为90.51%。这些结果超过了所有等变基线，同时保持了高有效性和独特性，并且与EDM相比，采样速度几乎快了2倍。总的来说，我们的研究建立了基于框架的扩散作为一种可扩展、灵活和物理基础的分子生成范式，突显了全局结构保持的关键作用。

更新时间: 2025-09-23 19:23:37

领域: cs.LG

下载: http://arxiv.org/abs/2509.19506v1

Constraint-Reduced MILP with Local Outlier Factor Modeling for Plausible Counterfactual Explanations in Credit Approval

Counterfactual explanation (CE) is a widely used post-hoc method that provides individuals with actionable changes to alter an unfavorable prediction from a machine learning model. Plausible CE methods improve realism by considering data distribution characteristics, but their optimization models introduce a large number of constraints, leading to high computational cost. In this work, we revisit the DACE framework and propose a refined Mixed-Integer Linear Programming (MILP) formulation that significantly reduces the number of constraints in the local outlier factor (LOF) objective component. We also apply the method to a linear SVM classifier with standard scaler. The experimental results show that our approach achieves faster solving times while maintaining explanation quality. These results demonstrate the promise of more efficient LOF modeling in counterfactual explanation and data science applications.

Updated: 2025-09-23 19:23:08

标题: 减少约束的局部异常因子建模MILP用于信用批准中合理的反事实解释

摘要: 反事实解释（CE）是一种广泛使用的事后方法，它为个体提供了可行的变化，以改变机器学习模型的不利预测。可信的CE方法通过考虑数据分布特征来提高现实性，但它们的优化模型引入了大量约束，导致计算成本高昂。在这项工作中，我们重新审视了DACE框架，并提出了一个精制的混合整数线性规划（MILP）公式，显著减少了局部异常因子（LOF）目标组件中的约束数量。我们还将该方法应用于具有标准缩放器的线性SVM分类器。实验结果显示，我们的方法在保持解释质量的同时实现更快的求解时间。这些结果展示了在反事实解释和数据科学应用中更高效的LOF建模的潜力。

更新时间: 2025-09-23 19:23:08

领域: cs.LG

下载: http://arxiv.org/abs/2509.19504v1

Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and costly: many prompts lack ground truth, forcing reliance on human judgment, while existing uncertainty and adequacy measures typically require full inference. A key challenge is to assess input adequacy in a way that reflects the demands of the task, ideally before even generating any output. We introduce CLOTHO, a task-specific, pre-generation adequacy measure that estimates input difficulty directly from hidden LLM states. Given a large pool of unlabelled inputs for a specific task, CLOTHO uses a Gaussian Mixture Model (GMM) to adaptively sample the most informative cases for human labelling. Based on this reference set the GMM can then rank unseen inputs by their likelihood of failure. In our empirical evaluation across eight benchmark tasks and three open-weight LLMs, CLOTHO can predict failures with a ROC-AUC of 0.716, after labelling reference sets that are on average only 5.4% of inputs. It does so without generating any outputs, thereby reducing costs compared to existing uncertainty measures. Comparison of CLOTHO and post-generation uncertainty measures shows that the two approaches complement each other. Crucially, we show that adequacy scores learnt from open-weight LLMs transfer effectively to proprietary models, extending the applicability of the approach. When prioritising test inputs for proprietary models, CLOTHO increases the average number of failing inputs from 18.7 to 42.5 out of 100, compared to random prioritisation.

Updated: 2025-09-23 19:15:16

标题: 克洛托：用于LLM输入的测量任务特定的预生成测试充分性

摘要: 软件越来越依赖大型语言模型（LLMs）的新兴功能，从自然语言理解到程序分析和生成。然而，在特定任务上对它们进行测试仍然困难且昂贵：许多提示缺乏基准数据，迫使依赖人类判断，而现有的不确定性和充分性度量通常需要完全推理。一个关键挑战是以反映任务需求的方式评估输入充分性，最好在生成任何输出之前就能做到。我们引入了CLOTHO，一个任务特定的、预生成的充分性度量，它直接从隐藏的LLM状态中估计输入难度。给定一个针对特定任务的大量未标记输入的池，CLOTHO使用高斯混合模型（GMM）自适应地对最具信息量的案例进行人工标记。基于这个参考集，GMM可以通过它们失败的可能性对未见输入进行排名。在我们的经验评估中，跨越八个基准任务和三个开放权重的LLMs，CLOTHO可以预测失败，ROC-AUC为0.716，在标记的参考集上，平均只有5.4%的输入。它在不生成任何输出的情况下做到这一点，因此与现有的不确定性度量相比，降低了成本。CLOTHO和后生成不确定性度量的比较表明，这两种方法互补。至关重要的是，我们展示了从开放权重的LLMs学到的充分性分数有效地转移到专有模型，扩展了这种方法的适用性。在为专有模型优先考虑测试输入时，与随机优先考虑相比，CLOTHO将每100个输入中的平均失败输入数量从18.7增加到42.5。

更新时间: 2025-09-23 19:15:16

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2509.17314v2

Generative AI as a catalyst for democratic Innovation: Enhancing citizen engagement in participatory budgeting

This research examines the role of Generative Artificial Intelligence (AI) in enhancing citizen engagement in participatory budgeting. In response to challenges like declining civic participation and increased societal polarization, the study explores how online political participation can strengthen democracy and promote social equity. By integrating Generative AI into public consultation platforms, the research aims to improve citizen proposal formulation and foster effective dialogue between citizens and government. It assesses the capacities governments need to implement AI-enhanced participatory tools, considering technological dependencies and vulnerabilities. Analyzing technological structures, actors, interests, and strategies, the study contributes to understanding how technological advancements can reshape participatory institutions to better facilitate citizen involvement. Ultimately, the research highlights how Generative AI can transform participatory institutions, promoting inclusive, democratic engagement and empowering citizens.

Updated: 2025-09-23 19:09:31

标题: 生成式人工智能作为民主创新的催化剂：提升市民参与预算编制

摘要: 这项研究探讨了生成式人工智能（AI）在增强公民参与预算编制中的作用。针对诸如公民参与减少和社会极化加剧等挑战，研究探讨了在线政治参与如何加强民主并促进社会公平。通过将生成式AI整合到公共咨询平台中，该研究旨在改善公民提议的制定，并促进市民与政府之间的有效对话。研究评估了政府实施AI增强参与工具所需的能力，考虑到技术依赖性和脆弱性。通过分析技术结构、行为者、利益和策略，该研究有助于理解技术进步如何重塑参与机构以更好地促进公民参与。最终，研究突显了生成式AI如何可以改变参与机构，促进包容性、民主的参与并赋予市民权力。

更新时间: 2025-09-23 19:09:31

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2509.19497v1

ArtiFree: Detecting and Reducing Generative Artifacts in Diffusion-based Speech Enhancement

Diffusion-based speech enhancement (SE) achieves natural-sounding speech and strong generalization, yet suffers from key limitations like generative artifacts and high inference latency. In this work, we systematically study artifact prediction and reduction in diffusion-based SE. We show that variance in speech embeddings can be used to predict phonetic errors during inference. Building on these findings, we propose an ensemble inference method guided by semantic consistency across multiple diffusion runs. This technique reduces WER by 15% in low-SNR conditions, effectively improving phonetic accuracy and semantic plausibility. Finally, we analyze the effect of the number of diffusion steps, showing that adaptive diffusion steps balance artifact suppression and latency. Our findings highlight semantic priors as a powerful tool to guide generative SE toward artifact-free outputs.

Updated: 2025-09-23 19:04:18

标题: ArtiFree: 检测和减少扩散式语音增强中的生成性伪影

摘要: 扩散基于语音增强（SE）实现了自然的语音和强大的泛化能力，但存在生成性伪像和推理延迟高等关键限制。在这项工作中，我们系统研究了基于扩散的SE中的伪像预测和减少。我们表明，在推理过程中可以利用语音嵌入的方差来预测语音错误。基于这些发现，我们提出了一种由多次扩散运行中的语义一致性引导的集成推理方法。这种技术在低信噪比条件下将WER降低了15%，有效提高了语音准确性和语义合理性。最后，我们分析了扩散步数的影响，表明自适应扩散步数可以平衡伪像抑制和延迟。我们的发现突显了语义先验作为引导生成性SE朝向无伪像输出的强大工具。

更新时间: 2025-09-23 19:04:18

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2509.19495v1

Differentially Private Compression and the Sensitivity of LZ77

We initiate the study of differentially private data-compression schemes motivated by the insecurity of the popular "Compress-Then-Encrypt" framework. Data compression is a useful tool which exploits redundancy in data to reduce storage/bandwidth when files are stored or transmitted. However, if the contents of a file are confidential then the length of a compressed file might leak confidential information about the content of the file itself. Encrypting a compressed file does not eliminate this leakage as data encryption schemes are only designed to hide the content of confidential message instead of the length of the message. In our proposed Differentially Private Compress-Then-Encrypt framework, we add a random positive amount of padding to the compressed file to ensure that any leakage satisfies the rigorous privacy guarantee of $(\epsilon,\delta)$-differential privacy. The amount of padding that needs to be added depends on the sensitivity of the compression scheme to small changes in the input, i.e., to what degree can changing a single character of the input message impact the length of the compressed file. While some popular compression schemes are highly sensitive to small changes in the input, we argue that effective data compression schemes do not necessarily have high sensitivity. Our primary technical contribution is analyzing the fine-grained sensitivity of the LZ77 compression scheme (IEEE Trans. Inf. Theory 1977) which is one of the most common compression schemes used in practice. We show that the global sensitivity of the LZ77 compression scheme has the upper bound $O(W^{2/3}\log n)$ where $W\leq n$ denotes the size of the sliding window. When $W=n$, we show the lower bound $\Omega(n^{2/3}\log^{1/3}n)$ for the global sensitivity of the LZ77 compression scheme which is tight up to a sublogarithmic factor.

Updated: 2025-09-23 19:01:31

标题: 差分隐私压缩和LZ77的灵敏度

摘要: 我们启动了不同ially private data-compression schemes的研究，其动机是流行的“Compress-Then-Encrypt”框架的不安全性。数据压缩是一种利用数据中的冗余来减少存储/带宽的有用工具，当文件存储或传输时。然而，如果文件的内容是机密的，那么压缩文件的长度可能会泄露关于文件内容的机密信息。对压缩文件进行加密并不能消除这种泄露，因为数据加密方案只设计用于隐藏机密消息的内容，而不是消息的长度。在我们提出的差分私有压缩-然后加密框架中，我们向压缩文件添加了随机正量的填充，以确保任何泄漏都满足$(\epsilon,\delta)$-差分隐私的严格隐私保证。需要添加的填充量取决于压缩方案对输入中小变化的敏感性，即改变输入消息的一个字符会对压缩文件的长度产生什么影响。虽然一些流行的压缩方案对输入中的小变化非常敏感，但我们认为有效的数据压缩方案不一定具有高敏感性。我们的主要技术贡献是分析LZ77压缩方案（IEEE Trans. Inf. Theory 1977）的细粒度敏感性，这是实践中最常用的压缩方案之一。我们展示了LZ77压缩方案的全局敏感性具有上界$O(W^{2/3}\log n)$，其中$W\leq n$表示滑动窗口的大小。当$W=n$时，我们展示了LZ77压缩方案的全局敏感性的下界为$\Omega(n^{2/3}\log^{1/3}n)$，这与次对数因子相匹配。

更新时间: 2025-09-23 19:01:31

领域: cs.CC,cs.CR

下载: http://arxiv.org/abs/2502.09584v3

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, offer compact and effective alternatives to full model fine-tuning by introducing low-rank updates to pre-trained weights. However, most existing approaches rely on global low rank structures, which can overlook spatial patterns spread across the parameter space. In this work, we propose Localized LoRA, a generalized framework that models weight updates as a composition of low-rank matrices applied to structured blocks of the weight matrix. This formulation enables dense, localized updates throughout the parameter space without increasing the total number of trainable parameters. We provide a formal comparison between global, diagonal-local, and fully localized low-rank approximations, and show that our method consistently achieves lower approximation error under matched parameter budgets. Experiments on both synthetic and practical settings demonstrate that Localized LoRA offers a more expressive and adaptable alternative to existing methods, enabling efficient fine-tuning with improved performance.

Updated: 2025-09-23 18:56:10

标题: 局部化的LoRA：一种结构化的低秩逼近方法，用于高效微调

摘要: Parameter-efficient fine-tuning (PEFT)方法，如LoRA，通过引入低秩更新到预训练权重，提供了紧凑且有效的替代方案，而不是完全模型微调。然而，大多数现有方法依赖于全局低秩结构，可能忽视分布在参数空间中的空间模式。在这项工作中，我们提出了局部LoRA，这是一个广义框架，将权重更新建模为应用于权重矩阵的结构化块的低秩矩阵的组合。这种公式使得在不增加可训练参数总数的情况下，在整个参数空间中实现了密集的、局部化的更新。我们对全局、对角线局部和完全局部低秩逼近进行了正式比较，并显示出我们的方法在匹配参数预算下始终实现更低的逼近误差。在合成和实际设置上的实验证明了局部LoRA提供了一种更具表现力和适应性的替代方法，实现了具有改进性能的高效微调。

更新时间: 2025-09-23 18:56:10

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.00236v2

Estimating the Self-Consistency of LLMs

Systems often repeat the same prompt to large language models (LLMs) and aggregate responses to improve reliability. This short note analyzes an estimator of the self-consistency of LLMs and the tradeoffs it induces under a fixed compute budget $B=mn$, where $m$ is the number of prompts sampled from the task distribution and $n$ is the number of repeated LLM calls per prompt; the resulting analysis favors a rough split $m,n\propto\sqrt{B}$.

Updated: 2025-09-23 18:51:56

标题: 估计LLMs的自洽性

摘要: 系统通常会重复向大型语言模型（LLMs）发出相同的提示，并聚合响应以提高可靠性。本短文分析了LLMs自洽性的估计器，并在固定计算预算$B=mn$的情况下引发的权衡，其中$m$是从任务分布中采样的提示数量，$n$是每个提示的重复LLM调用数量；结果分析倾向于粗略分割$m,n\propto\sqrt{B}$。

更新时间: 2025-09-23 18:51:56

领域: cs.AI

下载: http://arxiv.org/abs/2509.19489v1

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Developing interactive agents that can understand language, perceive their surroundings, and act within the physical world is a long-standing goal of AI research. The Minecraft Collaborative Building Task (MCBT) (Narayan-Chen, Jayannavar, and Hockenmaier 2019), a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated 3D Blocks World environment, offers a rich platform to work towards this goal. In this work, we focus on the Builder Action Prediction (BAP) subtask: predicting B's actions in a multimodal game context (Jayannavar, Narayan-Chen, and Hockenmaier 2020) - a challenging testbed for grounded instruction following, with limited training data. We holistically re-examine this task and introduce BAP v2 to address key challenges in evaluation, training data, and modeling. Specifically, we define an enhanced evaluation benchmark, featuring a cleaner test set and fairer, more insightful metrics that also reveal spatial reasoning as the primary performance bottleneck. To address data scarcity and to teach models basic spatial skills, we generate different types of synthetic MCBT data. We observe that current, LLM-based SOTA models trained on the human BAP dialogues fail on these simpler, synthetic BAP ones, but show that training models on this synthetic data improves their performance across the board. We also introduce a new SOTA model, Llama-CRAFTS, which leverages richer input representations, and achieves an F1 score of 53.0 on the BAP v2 task and strong performance on the synthetic data. While this result marks a notable 6 points improvement over previous work, it also underscores the task's remaining difficulty, establishing BAP v2 as a fertile ground for future research, and providing a useful measure of the spatial capabilities of current text-only LLMs in such embodied tasks.

Updated: 2025-09-23 18:50:56

标题: BAP v2：Minecraft对话中用于指令跟随的增强任务框架

摘要: 开发能够理解语言、感知周围环境并在物理世界内行动的互动代理是人工智能研究的长期目标。《Minecraft Collaborative Building Task (MCBT)》（纳拉扬-陈、Jayannavar和霍肯迈尔，2019年）是一个双人游戏，建筑师（A）指导建造者（B）在模拟的3D方块世界环境中建造目标结构，为实现这一目标提供了丰富的平台。在这项工作中，我们专注于建造者行动预测（BAP）子任务：预测B在多模态游戏环境中的行动（Jayannavar、纳拉扬-陈和霍肯迈尔，2020年），这是一个具有有限训练数据的基于指导的测试平台。我们全面重新审视这项任务，并引入了BAP v2来解决评估、训练数据和建模中的关键挑战。具体来说，我们定义了一个增强的评估基准，包括一个更清晰的测试集和更公平、更具洞察力的指标，同时也揭示了空间推理作为主要性能瓶颈。为了解决数据稀缺问题并教导模型基本的空间技能，我们生成了不同类型的合成MCBT数据。我们观察到，目前基于人类BAP对话训练的LLM最先进模型在这些更简单、合成的BAP数据上表现不佳，但表明在这些合成数据上训练模型可以改善它们在各方面的性能。我们还引入了一个新的最先进模型，Llama-CRAFTS，它利用更丰富的输入表示，在BAP v2任务上获得了53.0的F1分数，并在合成数据上表现出色。虽然这一结果相对于以前的工作有显着的6分改进，但也突显了任务仍然存在的困难，将BAP v2确立为未来研究的肥沃土壤，并为当前仅文本LLM在此类具体任务中的空间能力提供了一个有用的度量。

更新时间: 2025-09-23 18:50:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.10836v3

Identifying and Addressing User-level Security Concerns in Smart Homes Using "Smaller" LLMs

With the rapid growth of smart home IoT devices, users are increasingly exposed to various security risks, as evident from recent studies. While seeking answers to know more on those security concerns, users are mostly left with their own discretion while going through various sources, such as online blogs and technical manuals, which may render higher complexity to regular users trying to extract the necessary information. This requirement does not go along with the common mindsets of smart home users and hence threatens the security of smart homes furthermore. In this paper, we aim to identify and address the major user-level security concerns in smart homes. Specifically, we develop a novel dataset of Q&A from public forums, capturing practical security challenges faced by smart home users. We extract major security concerns in smart homes from our dataset by leveraging the Latent Dirichlet Allocation (LDA). We fine-tune relatively "smaller" transformer models, such as T5 and Flan-T5, on this dataset to build a QA system tailored for smart home security. Unlike larger models like GPT and Gemini, which are powerful but often resource hungry and require data sharing, smaller models are more feasible for deployment in resource-constrained or privacy-sensitive environments like smart homes. The dataset is manually curated and supplemented with synthetic data to explore its potential impact on model performance. This approach significantly improves the system's ability to deliver accurate and relevant answers, helping users address common security concerns with smart home IoT devices. Our experiments on real-world user concerns show that our work improves the performance of the base models.

Updated: 2025-09-23 18:47:59

标题: 使用“较小”LLMs识别和解决智能家居中用户级安全问题

摘要: 随着智能家居物联网设备的快速增长，用户越来越多地面临各种安全风险，这一点可以从最近的研究中看出。在寻找答案以了解这些安全问题时，用户大多数时候只能靠自己去查阅各种来源，比如在线博客和技术手册，这可能会给试图提取必要信息的普通用户带来更高的复杂性。这种需求与智能家居用户的常见心态不符，因此进一步威胁了智能家居的安全。本文旨在确定和解决智能家居中主要的用户级安全问题。具体来说，我们从公共论坛中开发了一个新颖的问答数据集，捕捉了智能家居用户所面临的实际安全挑战。我们利用潜在狄利克雷分配（LDA）从数据集中提取智能家居中的主要安全问题。我们在这个数据集上对相对较小的变换器模型（如T5和Flan-T5）进行微调，以构建一个专为智能家居安全定制的问答系统。与像GPT和Gemini这样强大但常常资源密集且需要数据共享的大型模型不同，较小的模型更适合在资源受限或对隐私敏感的环境，如智能家居中部署。该数据集经过人工精心策划，并补充了合成数据，以探索其对模型性能的潜在影响。这种方法显著提高了系统提供准确和相关答案的能力，帮助用户解决智能家居物联网设备的常见安全问题。我们对真实用户关注的实验表明，我们的工作改善了基础模型的性能。

更新时间: 2025-09-23 18:47:59

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.19485v1

Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures

Recent advances in generative world models have enabled classical safe control methods, such as Hamilton-Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high-dimensional sensor observations. However, obtaining comprehensive coverage of all safety-critical scenarios during world model training is extremely challenging. As a result, latent safety filters built on top of these models may miss novel hazards and even fail to prevent known ones, overconfidently misclassifying risky out-of-distribution (OOD) situations as safe. To address this, we introduce an uncertainty-aware latent safety filter that proactively steers robots away from both known and unseen failures. Our key idea is to use the world model's epistemic uncertainty as a proxy for identifying unseen potential hazards. We propose a principled method to detect OOD world model predictions by calibrating an uncertainty threshold via conformal prediction. By performing reachability analysis in an augmented state space-spanning both the latent representation and the epistemic uncertainty-we synthesize a latent safety filter that can reliably safeguard arbitrary policies from both known and unseen safety hazards. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our uncertainty-aware safety filter preemptively detects potential unsafe scenarios and reliably proposes safe, in-distribution actions. Video results can be found on the project website at https://cmu-intentlab.github.io/UNISafe

Updated: 2025-09-23 18:47:32

标题: 考虑不确定性的潜在安全过滤器，用于避免超出分布的故障

摘要: 近年来，生成式世界模型的进展使得传统的安全控制方法，如哈密尔顿-雅可比（HJ）可达性，能够推广到直接从高维传感器观测中操作的复杂机器人系统。然而，在世界模型训练过程中获得所有安全关键场景的全面覆盖是极具挑战性的。因此，建立在这些模型之上的潜在安全过滤器可能会错过新颖的危险，并甚至无法阻止已知的危险情况，过于自信地错误分类有风险的超出分布（OOD）情况为安全。为了解决这个问题，我们提出了一种基于不确定性的潜在安全过滤器，可以主动地将机器人引导远离已知和未知的故障。我们的关键思想是利用世界模型的认识不确定性作为识别未知潜在危险的代理。我们提出了一种通过合规预测来校准不确定性阈值的原则方法，以侦测OOD世界模型预测。通过在增强状态空间中进行可达性分析，跨越潜在表示和认识不确定性，我们综合了一个可以可靠地保护任意策略免受已知和未知安全危险的潜在安全过滤器。通过在Franka机械臂上进行基于视觉的控制任务的仿真和硬件实验，我们展示了我们的不确定性感知安全过滤器可以预先检测潜在的不安全情况，并可靠地提出安全、分布内的行动。视频结果可以在项目网站 https://cmu-intentlab.github.io/UNISafe 上找到。

更新时间: 2025-09-23 18:47:32

领域: cs.RO,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2505.00779v2

Language Models Fail to Introspect About Their Knowledge of Language

There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". By using general tasks, controlling for model similarity, and evaluating a wide range of open-source models, we show that LLMs cannot introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.

Updated: 2025-09-23 18:42:02

标题: 语言模型未能自省其对语言知识的了解

摘要: 最近人们对于大型语言模型（LLMs）是否能够自省其内部状态产生了兴趣。这样的能力将使LLMs更具可解释性，并且验证语言学中使用标准自省方法来评估模型中的语法知识的有效性（例如，询问“这个句子是否符合语法？”）。我们系统地调查了21个开源LLMs之间的自发自省，涉及了两个理论上感兴趣的领域：语法知识和单词预测。在这两个领域中，模型的内部语言知识可以通过直接测量字符串概率来进行理论上的基础。然后，我们评估模型对元语言提示的响应是否忠实地反映了其内部知识。我们提出了一种新的自省度量：模型受提示后的响应预测其自身字符串概率的程度，超出了另一个具有几乎相同内部知识的模型所能预测的范围。尽管元语言提示和概率比较都导致高任务准确性，但我们并未找到LLMs具有特权“自我访问”的证据。通过使用普通任务、控制模型相似性，并评估各种开源模型，我们展示了LLMs不能进行自省，并为论证提示响应不应与模型的语言概括混淆提供了新证据。

更新时间: 2025-09-23 18:42:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.07513v3

GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning

GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.

Updated: 2025-09-23 18:41:49

标题: 嵌入式：用于文本对齐的图表推理——结构化指令遵循和视觉推理的基准

摘要: GRAFT是一个结构化的多模态基准，用于评估模型在遵循指令、视觉推理和视觉文本对齐任务上的表现。它具有通过Python可视化库创建的程序生成的图表和合成渲染的表格，以确保对数据语义、结构和清晰度的控制。每个GRAFT实例将图表或表格图像与仅基于视觉内容的系统生成的多步分析问题配对。答案以JSON或YAML等结构化格式提供，支持对推理和输出格式的一致评估。该基准引入了一种推理类型的分类，包括比较、趋势识别、排序、聚合、比例估计和异常检测，以实现全面评估。参考答案遵循严格的事实和格式指南，用于进行精确、基于方面的评估。GRAFT为在视觉基础上、结构化推理任务上对多模态模型进行细粒度基准测试提供了一个统一的、可扩展的框架，树立了这一领域的新评估标准。

更新时间: 2025-09-23 18:41:49

领域: cs.AI,cs.LG,cs.MM

下载: http://arxiv.org/abs/2508.15690v2

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models. We present videos showcasing OmniVLA performance and will release its checkpoints and training code on our project page.

Updated: 2025-09-23 18:40:29

标题: OmniVLA：机器人导航的全模态视觉语言行为模型

摘要: 人类在导航到目的地时，可以灵活地解释和组合不同的目标规范，如语言指令、空间坐标或视觉参考。相比之下，大多数现有的机器人导航策略都是在单一模态下训练的，这限制了它们在现实世界场景中适应不同形式目标规范的能力，这些规范是自然且互补的。在这项工作中，我们提出了一个用于机器人基础模型的训练框架，实现基于视觉的导航的全模态目标条件。我们的方法利用高容量的视觉-语言-动作（VLA）骨干，通过随机模态融合策略训练三种主要目标模态：2D姿势、自我中心图像和自然语言，以及它们的组合。这种设计不仅扩展了可用数据集的范围，还鼓励策略发展更丰富的几何、语义和视觉表示。结果模型OmniVLA在未见环境中表现出强大的泛化能力，对稀缺模态具有稳健性，并能遵循新颖的自然语言指令。我们展示了OmniVLA在不同模态上优于专家基线，并为对新模态和任务进行微调提供了灵活的基础。我们相信OmniVLA是通向广泛可泛化和灵活导航策略的一步，并为构建全模态机器人基础模型提供了可扩展的路径。我们将展示展示OmniVLA性能的视频，并将在我们的项目页面上发布其检查点和训练代码。

更新时间: 2025-09-23 18:40:29

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2509.19480v1

Quantum Harmonic Analysis and the Structure in Data: Augmentation

In this short note, we study the impact of data augmentation on the smoothness of principal components of high-dimensional datasets. Using tools from quantum harmonic analysis, we show that eigenfunctions of operators corresponding to augmented data sets lie in the modulation space $M^1(\mathbb{R}^d)$, guaranteeing smoothness and continuity. Numerical examples on synthetic and audio data confirm the theoretical findings. While interesting in itself, the results suggest that manifold learning and feature extraction algorithms can benefit from systematic and informed augmentation principles.

Updated: 2025-09-23 18:30:35

标题: 量子谐波分析与数据结构：增强

摘要: 在这篇简短的论文中，我们研究了数据增强对高维数据集主成分平滑度的影响。使用量子谐波分析工具，我们表明与增强数据集对应的算子的特征函数位于调制空间$M^1(\mathbb{R}^d)$中，保证了平滑性和连续性。在合成和音频数据上的数值实例证实了理论发现。虽然本身很有趣，但结果表明流形学习和特征提取算法可以从系统化和明智的增强原则中受益。

更新时间: 2025-09-23 18:30:35

领域: math.FA,cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2509.19474v1

Patterns in the Transition From Founder-Leadership to Community Governance of Open Source

Open digital public infrastructure needs community management to ensure accountability, sustainability, and robustness. Yet open-source projects often rely on centralized decision-making, and the determinants of successful community management remain unclear. We analyze 637 GitHub repositories to trace transitions from founder-led to shared governance. Specifically, we document trajectories to community governance by extracting institutional roles, actions, and deontic cues from version-controlled project constitutions GOVERNANCE.md. With a semantic parsing pipeline, we cluster elements into broader role and action types. We find roles and actions grow, and regulation becomes more balanced, reflecting increases in governance scope and differentiation over time. Rather than shifting tone, communities grow by layering and refining responsibilities. As transitions to community management mature, projects increasingly regulate ecosystem-level relationships and add definition to project oversight roles. Overall, this work offers a scalable pipeline for tracking the growth and development of community governance regimes from open-source software's familiar default of founder-ownership.

Updated: 2025-09-23 18:30:24

标题: 开源项目从创始人领导到社区治理的过渡中的模式

摘要: 开放数字公共基础设施需要社区管理来确保问责、可持续性和稳健性。然而，开源项目通常依赖于集中化的决策，成功的社区管理因素仍不清楚。我们分析了637个GitHub存储库，以追踪从创始人主导到共享治理的转变。具体而言，我们通过提取版本控制项目宪法GOVERNANCE.md中的机构角色、行动和义务线索来记录社区治理的轨迹。通过语义解析流水线，我们将元素聚类成更广泛的角色和行动类型。我们发现角色和行动在不断增长，监管变得更加平衡，反映了随着时间的推移，治理范围和差异化的增加。社区不是通过改变基调而是通过层层堆叠和细化责任而增长。随着转变到社区管理的成熟，项目越来越多地调节生态系统级别的关系，并为项目监督角色增添定义。总的来说，这项工作提供了一个可扩展的流水线，用于跟踪开源软件从创始人所有权的熟悉默认中社区治理制度的增长和发展。

更新时间: 2025-09-23 18:30:24

领域: cs.CY,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.16295v2

Embedding Alignment in Code Generation for Audio

LLM-powered code generation has the potential to revolutionize creative coding endeavors, such as live-coding, by enabling users to focus on structural motifs over syntactic details. In such domains, when prompting an LLM, users may benefit from considering multiple varied code candidates to better realize their musical intentions. Code generation models, however, struggle to present unique and diverse code candidates, with no direct insight into the code's audio output. To better establish a relationship between code candidates and produced audio, we investigate the topology of the mapping between code and audio embedding spaces. We find that code and audio embeddings do not exhibit a simple linear relationship, but supplement this with a constructed predictive model that shows an embedding alignment map could be learned. Supplementing the aim for musically diverse output, we present a model that given code predicts output audio embedding, constructing a code-audio embedding alignment map.

Updated: 2025-09-23 18:29:44

标题: 将对齐嵌入到音频代码生成中

摘要: LLM动力的代码生成具有革命性潜力，可以改变创意编码活动，例如现场编码，使用户能够专注于结构主题而不是语法细节。在这样的领域中，当提示LLM时，用户可能通过考虑多个不同的代码候选项来更好地实现他们的音乐意图。然而，代码生成模型很难呈现独特和多样化的代码候选项，并且无法直接了解代码的音频输出。为了更好地建立代码候选项和生成的音频之间的关系，我们研究了代码和音频嵌入空间之间的映射拓扑。我们发现代码和音频嵌入并不表现为简单的线性关系，但通过构建一个预测模型来补充这一点，该模型显示出可以学习到一个嵌入对齐映射。为了补充对音乐多样化输出的目标，我们提出了一个模型，给定代码预测输出音频嵌入，构建一个代码-音频嵌入对齐映射。

更新时间: 2025-09-23 18:29:44

领域: cs.MM,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2508.05473v2

Transformer Modeling for Both Scalability and Performance in Multivariate Time Series

Variable count is among the main scalability bottlenecks for transformer modeling in multivariate time series (MTS) data. On top of this, a growing consensus in the field points to indiscriminate inter-variable mixing as a potential source of noise-accumulation and performance degradation. This is likely exacerbated by sparsity of informative signals characteristic of many MTS systems coupled with representational misalignment stemming from indiscriminate information mixing between (heterogeneous) variables. While scalability and performance are often seen as competing interests in transformer design, we show that both can be improved simultaneously in MTS by strategically constraining the representational capacity of inter-variable mixing. Our proposed method, transformer with Delegate Token Attention (DELTAformer), constrains inter-variable modeling through what we call delegate tokens which are then used to perform full, unconstrained, inter-temporal modeling. Delegate tokens act as an implicit regularizer that forces the model to be highly selective about what inter-variable information is allowed to propagate through the network. Our results show that DELTAformer scales linearly with variable-count while actually outperforming standard transformers, achieving state-of-the-art performance across benchmarks and baselines. In addition, DELTAformer can focus on relevant signals better than standard transformers in noisy MTS environments and overall exhibit superior noise-resilience. Overall, results across various experiments confirm that by aligning our model design to leverage domain-specific challenges in MTS to our advantage, DELTAformer can simultaneously achieve linear scaling while actually improving its performance against standard, quadratic transformers.

Updated: 2025-09-23 18:28:24

标题: 多元时间序列中变压器模型的可伸缩性和性能模拟

摘要: 变量计数是变压器模型在多变量时间序列（MTS）数据中的主要可伸缩性瓶颈之一。除此之外，该领域中日益形成的共识指出，无差别的变量混合可能是噪声积累和性能降级的潜在来源。这可能会加剧许多MTS系统中信息信号稀疏性和来自（异质）变量之间无差别信息混合导致的表征不匹配。虽然可伸缩性和性能通常被视为变压器设计中相互竞争的利益，但我们表明在MTS中可以通过战略性地限制变量间混合的表征能力来同时改善两者。我们提出的方法，Delegate Token Attention的transformer（DELTAformer），通过我们称之为代理标记的方式来限制变量间建模，然后用于执行完全、不受限的跨时间建模。代理标记充当一种隐式正则化器，强制模型高度选择哪些变量间信息允许通过网络传播。我们的结果表明，DELTAformer随着变量计数呈线性增长，实际上胜过标准变压器，在基准测试和基线上实现了最先进的性能。此外，在嘈杂的MTS环境中，DELTAformer能够比标准变压器更好地关注相关信号，并且总体上表现出更优越的抗噪性。总的来说，各种实验结果证实，通过将我们的模型设计与MTS中的领域特定挑战结合起来，DELTAformer可以同时实现线性扩展，同时实际上提高其对标准二次变压器的性能。

更新时间: 2025-09-23 18:28:24

领域: cs.LG

下载: http://arxiv.org/abs/2509.19471v1

Macroeconomic Forecasting with Large Language Models

This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios

Updated: 2025-09-23 18:27:49

标题: 大型语言模型在宏观经济预测中的应用

摘要: 本文提出了一项比较分析，评估大型语言模型（LLMs）与传统的宏观时间序列预测方法的准确性。最近，由于它们能够捕捉数据中的复杂模式并快速跨越非常不同的领域进行调整，LLMs在预测方面变得越来越受欢迎。然而，与传统方法相比，它们在宏观经济时间序列数据预测中的有效性仍然是一个有趣的领域。为了解决这个问题，我们对LLMs和传统宏观预测方法进行了严格评估，以FRED-MD数据库作为共同基础。我们的发现为我们提供了有关LLMs在预测宏观经济时间序列中的优势和局限性的宝贵见解，为它们在现实场景中的适用性提供了启示。

更新时间: 2025-09-23 18:27:49

领域: econ.EM,cs.CL,cs.LG

下载: http://arxiv.org/abs/2407.00890v4

THINNs: Thermodynamically Informed Neural Networks

Physics-Informed Neural Networks (PINNs) are a class of deep learning models aiming to approximate solutions of PDEs by training neural networks to minimize the residual of the equation. Focusing on non-equilibrium fluctuating systems, we propose a physically informed choice of penalization that is consistent with the underlying fluctuation structure, as characterized by a large deviations principle. This approach yields a novel formulation of PINNs in which the penalty term is chosen to penalize improbable deviations, rather than being selected heuristically. The resulting thermodynamically consistent extension of PINNs, termed THINNs, is subsequently analyzed by establishing analytical a posteriori estimates, and providing empirical comparisons to established penalization strategies.

Updated: 2025-09-23 18:22:47

标题: THINNs: 热力学信息神经网络

摘要: 物理启发神经网络（PINN）是一类深度学习模型，旨在通过训练神经网络以最小化方程残差来逼近PDE的解。专注于非平衡波动系统，我们提出了一种物理启发的惩罚选择，与大偏差原则所表征的波动结构一致。这种方法产生了一种新颖的PINN形式，其中惩罚项被选择为惩罚不太可能的偏差，而不是凭经验选择。随后通过建立分析后验估计和提供与已建立的惩罚策略的经验比较，对得出的热力学一致的PINN扩展，称为THINNs进行分析。

更新时间: 2025-09-23 18:22:47

领域: cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2509.19467v1

RealitySummary: Exploring On-Demand Mixed Reality Text Summarization and Question Answering using Large Language Models

Large Language Models (LLMs) are gaining popularity as reading and summarization aids. However, little is known about their potential benefits when integrated with mixed reality (MR) interfaces to support everyday reading. In this iterative investigation, we developed RealitySummary, an MR reading assistant that seamlessly integrates LLMs with always-on camera access, OCR-based text extraction, and augmented spatial and visual responses. Developed iteratively, RealitySummary evolved across three versions, each shaped by user feedback and reflective analysis: 1) a preliminary user study to understand reader perceptions (N=12), 2) an in-the-wild deployment to explore real-world usage (N=11), and 3) a diary study to capture insights from real-world work contexts (N=5). Our empirical studies' findings highlight the unique advantages of combining AI and MR, including always-on implicit assistance, long-term temporal history, minimal context switching, and spatial affordances, demonstrating significant potential for future LLM-MR interfaces beyond traditional screen-based interactions.

Updated: 2025-09-23 18:21:29

标题: 现实总结：使用大型语言模型探索按需混合现实文本摘要和问答

摘要: 大型语言模型（LLMs）作为阅读和摘要辅助工具正在日益受到关注。然而，关于它们与混合现实（MR）界面集成以支持日常阅读的潜在好处知之甚少。在这个迭代的调查中，我们开发了RealitySummary，一个MR阅读助手，它将LLMs与始终开启的摄像头访问、基于OCR的文本提取以及增强的空间和视觉响应无缝集成在一起。通过迭代开发，RealitySummary经历了三个版本的演变，每个版本都受到用户反馈和反思分析的影响：1）进行初步用户研究以了解读者的看法（N=12），2）进行实地部署以探索实际使用情况（N=11），3）进行日记研究以捕捉来自实际工作环境的见解（N=5）。我们的实证研究结果突显了结合AI和MR的独特优势，包括始终开启的隐式辅助、长期的时间历史、最小化的上下文切换以及空间优势，展示了未来LLM-MR界面在超越传统基于屏幕的交互方面具有显著潜力。

更新时间: 2025-09-23 18:21:29

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2405.18620v3

A Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models

Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models' accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.

Updated: 2025-09-23 18:19:50

标题: 一种现实评估跨频率转移学习和基础预测模型

摘要: 跨频率转移学习（CFTL）已经成为一个流行的框架，用于筛选大规模时间序列数据集，以预先训练基础预测模型（FFMs）。尽管CFTL显示出潜力，但目前的基准实践未能准确评估其性能。这一不足源于许多因素：过度依赖小规模评估数据集；在计算摘要统计数据时对样本大小处理不足；报告了次优的统计模型；并未考虑预训练和测试数据集之间重叠的风险。为了解决这些限制，我们引入了一个统一的重新实现广泛采用的神经预测网络，对其进行了适应CFTL设置的调整；我们仅在专有和合成数据上进行预训练，注意防止测试数据泄漏；并在15个大型、多样化的公共预测竞赛数据集上进行评估。我们的经验分析显示，统计模型的准确性经常被低估。值得注意的是，我们确认统计模型及其集合在各个数据集上始终优于现有的FFMs，sCRPS高出8.2%以上，MASE高出20%以上。然而，我们也发现合成数据集的预训练确实提高了FFM的准确性，提高了7%。

更新时间: 2025-09-23 18:19:50

领域: cs.LG,cs.AI,stat.AP

下载: http://arxiv.org/abs/2509.19465v1

Evaluation-Aware Reinforcement Learning

Policy evaluation is often a prerequisite for deploying safety- and performance-critical systems. Existing evaluation approaches frequently suffer from high variance due to limited data and long-horizon tasks, or high bias due to unequal support or inaccurate environmental models. We posit that these challenges arise, in part, from the standard reinforcement learning (RL) paradigm of policy learning without explicit consideration of evaluation. As an alternative, we propose evaluation-aware reinforcement learning (EvA-RL), in which a policy is trained to maximize expected return while simultaneously minimizing expected evaluation error under a given value prediction scheme -- in other words, being "easy" to evaluate. We formalize a framework for EvA-RL and design an instantiation that enables accurate policy evaluation, conditioned on a small number of rollouts in an assessment environment that can be different than the deployment environment. However, our theoretical analysis and empirical results show that there is often a tradeoff between evaluation accuracy and policy performance when using a fixed value-prediction scheme within EvA-RL. To mitigate this tradeoff, we extend our approach to co-learn an assessment-conditioned state-value predictor alongside the policy. Empirical results across diverse discrete and continuous action domains demonstrate that EvA-RL can substantially reduce evaluation error while maintaining competitive returns. This work lays the foundation for a broad new class of RL methods that treat reliable evaluation as a first-class principle during training.

Updated: 2025-09-23 18:17:21

标题: 评估感知强化学习

摘要: 政策评估通常是部署安全和性能关键系统的先决条件。现有的评估方法常常由于数据有限和长时间跨度任务，或者由于支持不均或环境模型不准确而导致高方差，或高偏差。我们认为，这些挑战部分源于标准强化学习（RL）范例的政策学习，而没有明确考虑评估。作为一种替代方案，我们提出了具有评估意识的强化学习（EvA-RL），在这种方法中，一个政策被训练以最大化预期回报的同时最小化给定值预测方案下的预期评估误差，换句话说，易于评估。我们为EvA-RL制定了一个框架，并设计了一个实例，使政策在一个评估环境中进行少量回合的前提下，可以实现准确的政策评估，这个环境可能与部署环境不同。然而，我们的理论分析和实证结果表明，在EvA-RL中使用固定值预测方案时，评估准确性和政策性能之间经常存在权衡。为了缓解这种权衡，我们扩展了我们的方法，同时与政策一起协同学习一个评估条件的状态值预测器。跨越各种离散和连续动作领域的实证结果表明，EvA-RL可以大幅降低评估误差，同时保持竞争性回报。这项工作为一类新的RL方法奠定了基础，这些方法将可靠的评估视为培训过程中的第一要务。

更新时间: 2025-09-23 18:17:21

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19464v1

Self-evolved Imitation Learning in Simulated World

Imitation learning has been a trend recently, yet training a generalist agent across multiple tasks still requires large-scale expert demonstrations, which are costly and labor-intensive to collect. To address the challenge of limited supervision, we propose Self-Evolved Imitation Learning (SEIL), a framework that progressively improves a few-shot model through simulator interactions. The model first attempts tasksin the simulator, from which successful trajectories are collected as new demonstrations for iterative refinement. To enhance the diversity of these demonstrations, SEIL employs dual-level augmentation: (i) Model-level, using an Exponential Moving Average (EMA) model to collaborate with the primary model, and (ii) Environment-level, introducing slight variations in initial object positions. We further introduce a lightweight selector that filters complementary and informative trajectories from the generated pool to ensure demonstration quality. These curated samples enable the model to achieve competitive performance with far fewer training examples. Extensive experiments on the LIBERO benchmark show that SEIL achieves a new state-of-the-art performance in few-shot imitation learning scenarios. Code is available at https://github.com/Jasper-aaa/SEIL.git.

Updated: 2025-09-23 18:15:32

标题: 在模拟世界中的自我进化模仿学习

摘要: 模仿学习最近已经成为一种趋势，然而在多个任务上训练通用代理仍然需要大规模的专家演示，这些演示收集起来既昂贵又耗时。为了解决监督有限的挑战，我们提出了自进化模仿学习（SEIL）框架，通过模拟器交互逐步改进少样本模型。该模型首先尝试在模拟器中完成任务，成功的轨迹被收集作为迭代改进的新演示。为了增加这些演示的多样性，SEIL采用了双层增强：（i）模型级别，使用指数移动平均（EMA）模型与主要模型协作，以及（ii）环境级别，在初始物体位置引入轻微变化。我们还引入了一个轻量级选择器，从生成的池中过滤出互补和信息丰富的轨迹，以确保演示质量。这些经过策划的样本使模型能够在更少的训练示例下实现有竞争力的性能。对LIBERO基准测试的大量实验表明，SEIL在少样本模仿学习场景中实现了新的最先进性能。代码可在https://github.com/Jasper-aaa/SEIL.git获取。

更新时间: 2025-09-23 18:15:32

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19460v1

The Indispensable Role of User Simulation in the Pursuit of AGI

Progress toward Artificial General Intelligence (AGI) faces significant bottlenecks, particularly in rigorously evaluating complex interactive systems and acquiring the vast interaction data needed for training adaptive agents. This paper posits that user simulation -- creating computational agents that mimic human interaction with AI systems -- is not merely a useful tool, but is a critical catalyst required to overcome these bottlenecks and accelerate AGI development. We argue that realistic simulators provide the necessary environments for scalable evaluation, data generation for interactive learning, and fostering the adaptive capabilities central to AGI. Therefore, research into user simulation technology and intelligent task agents are deeply synergistic and must advance hand-in-hand. This article elaborates on the critical role of user simulation for AGI, explores the interdisciplinary nature of building realistic simulators, identifies key challenges including those posed by large language models, and proposes a future research agenda.

Updated: 2025-09-23 18:12:45

标题: 用户模拟在追求AGI中不可或缺的角色

摘要: 人工通用智能（AGI）的进展面临着重要瓶颈，特别是在严格评估复杂互动系统和获取训练自适应代理所需的大量交互数据方面。本文认为用户模拟——创建模仿人类与人工智能系统互动的计算代理——不仅是一个有用的工具，而且是克服这些瓶颈并加速AGI发展所需的关键催化剂。我们认为，真实的模拟器为可扩展评估、交互式学习的数据生成以及促进AGI核心自适应能力提供了必要的环境。因此，对用户模拟技术和智能任务代理的研究具有深层次的协同作用，必须手牵手前进。本文详细阐述了用户模拟在AGI中的关键作用，探讨了建立真实模拟器的跨学科性质，确定了主要挑战，包括由大型语言模型提出的挑战，并提出了未来研究议程。

更新时间: 2025-09-23 18:12:45

领域: cs.AI

下载: http://arxiv.org/abs/2509.19456v1

Anchored Langevin Algorithms

Standard first-order Langevin algorithms such as the unadjusted Langevin algorithm (ULA) are obtained by discretizing the Langevin diffusion and are widely used for sampling in machine learning because they scale to high dimensions and large datasets. However, they face two key limitations: (i) they require differentiable log-densities, excluding targets with non-differentiable components; and (ii) they generally fail to sample heavy-tailed targets. We propose anchored Langevin dynamics, a unified approach that accommodates non-differentiable targets and certain classes of heavy-tailed distributions. The method replaces the original potential with a smooth reference potential and modifies the Langevin diffusion via multiplicative scaling. We establish non-asymptotic guarantees in the 2-Wasserstein distance to the target distribution and provide an equivalent formulation derived via a random time change of the Langevin diffusion. We provide numerical experiments to illustrate the theory and practical performance of our proposed approach.

Updated: 2025-09-23 18:11:55

标题: 锚定朗之万算法

摘要: 标准的一阶 Langevin 算法，如未调整的 Langevin 算法 (ULA)，是通过离散化 Langevin 扩散而获得的，并且广泛用于机器学习中的采样，因为它们适用于高维度和大数据集。然而，它们面临两个关键限制：(i) 它们需要可微的对数密度，排除了具有非可微分量的目标；(ii) 它们通常无法采样重尾目标。我们提出了锚定 Langevin 动力学，这是一种统一的方法，可容纳非可微分目标和某些类别的重尾分布。该方法通过将原始势能替换为平滑的参考势能，并通过乘法缩放修改 Langevin 扩散。我们建立了与目标分布的 2-Wasserstein 距离的非渐近保证，并提供了一个通过 Langevin 扩散的随机时间改变导出的等价公式。我们提供了数值实验来说明我们提出的方法的理论和实际性能。

更新时间: 2025-09-23 18:11:55

领域: stat.ML,cs.LG,math.PR

下载: http://arxiv.org/abs/2509.19455v1

ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.

Updated: 2025-09-23 18:11:53

标题: ROPA: RGB-D双手数据增强的合成机器人姿势生成

摘要: 通过模仿学习训练稳健的双手操纵策略需要具有广泛覆盖机器人姿势、接触和场景背景的演示数据。然而，收集多样化和精确的现实世界演示是昂贵且耗时的，这阻碍了可扩展性。先前的研究通过数据增强来解决这个问题，通常用于具有RGB输入的手眼协调（手腕相机）设置，或用于生成没有配对动作的新图像，留下了用于具有新动作标签的手到手（第三人称）RGB-D训练的增强 less。在本文中，我们提出了一种用于RGB-D双手数据增强的合成机器人姿势生成方法（ROPA），这是一种离线模仿学习数据增强方法，通过微调稳定扩散来合成新的机器人姿势的第三人称RGB和RGB-D观察。我们的方法同时生成相应的关节空间动作标签，同时利用受限优化来通过适当的夹持器对对象接触约束在双手场景中强制物理一致性。我们在5个模拟任务和3个真实任务上评估了我们的方法。我们在2625个模拟试验和300个真实世界试验中的结果表明，ROPA优于基线和消融，展示了其在眼到手双手操纵中可扩展RGB和RGB-D数据增强的潜力。我们的项目网站位于：https://ropaaug.github.io/。

更新时间: 2025-09-23 18:11:53

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.19454v1

The Platonic Universe: Do Foundation Models See the Same Sky?

We test the Platonic Representation Hypothesis (PRH) in astronomy by measuring representational convergence across a range of foundation models trained on different data types. Using spectroscopic and imaging observations from JWST, HSC, Legacy Survey, and DESI, we compare representations from vision transformers, self-supervised models, and astronomy-specific architectures via mutual $k$-nearest neighbour analysis. We observe consistent scaling: representational alignment generally increases with model capacity across our tested architectures, supporting convergence toward a shared representation of galaxy astrophysics. Our results suggest that astronomical foundation models can use pre-trained general-purpose architectures, allowing us to capitalise on the broader machine learning community's already-spent computational investment.

Updated: 2025-09-23 18:10:05

标题: 柏拉图式宇宙：基础模型是否看到同样的天空？

摘要: 我们通过测量在不同数据类型上训练的一系列基础模型之间的表示收敛，来测试天文学中的柏拉图表示假设（PRH）。利用来自JWST、HSC、Legacy Survey和DESI的光谱和成像观测，我们通过互相的$k$-最近邻分析比较了视觉转换器、自监督模型和天文学特定架构的表示。我们观察到一致的规模效应：在我们测试的架构中，表示对齐通常随着模型容量的增加而增加，支持向共享的星系天体物理表示的收敛。我们的结果表明，天文基础模型可以使用预训练的通用架构，使我们能够利用更广泛的机器学习社区已经投入的计算投资。

更新时间: 2025-09-23 18:10:05

领域: astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2509.19453v1

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

Updated: 2025-09-23 18:07:10

标题: 狩猎：通过瞬时相对框架在非结构化环境中进行高速无人机导航和跟踪

摘要: 搜寻和救援行动需要无人机在高速度下穿越未知的非结构化环境，并一旦发现目标便能追踪目标。在受损感知和无全局定位的情况下实现这两种能力仍然是一个挑战。最近的相对导航研究表明，通过将规划和控制锚定到可见的探测对象可以实现强大的跟踪，但无法解决当视野中没有目标时的导航问题。我们提出了HUNT（高速无人机导航和追踪），这是一个实时框架，将穿越、获取和追踪统一在一个相对公式中。HUNT直接从机载瞬时可观测量（如姿态、高度和速度）定义导航目标，实现在搜索期间的反应性高速飞行。一旦检测到目标，同样的感知控制管线就会无缝过渡到追踪。在密集森林、容器复合体和车辆和人体模型的搜索和救援行动中的户外实验展示了当全局方法失败时的强大自主性。

更新时间: 2025-09-23 18:07:10

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.19452v1

Weaver: Interweaving SQL and LLM for Table Reasoning

Querying tables with unstructured data is challenging due to the presence of text (or image), either embedded in the table or in external paragraphs, which traditional SQL struggles to process, especially for tasks requiring semantic reasoning. While Large Language Models (LLMs) excel at understanding context, they face limitations with long input sequences. Existing approaches that combine SQL and LLMs typically rely on rigid, predefined work-flows, limiting their adaptability to complex queries. To address these issues, we introduce Weaver , a modular pipeline that dynamically integrates SQL and LLMs for table-based question answering (TableQA). Weaver generates a flexible, step-by-step plan that combines SQL for structured data retrieval with LLMs for semantic processing. By decomposing complex queries into manageable subtasks, Weaver improves accuracy and generalization. Our experiments show that Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates. The code, along with other associated scripts, are available at https://coral-lab-asu.github.io/weaver.

Updated: 2025-09-23 18:02:45

标题: 编织者：将SQL和LLM交织在一起进行表格推理

摘要: 查询带有非结构化数据的表格具有挑战性，因为表格中存在文本（或图像），这些文本可能嵌入在表格中或在外部段落中，传统的SQL往往很难处理，特别是对于需要语义推理的任务。虽然大型语言模型（LLMs）擅长理解上下文，但它们在处理长输入序列时存在限制。现有的将SQL和LLMs结合的方法通常依赖于严格的、预定义的工作流程，限制了它们对复杂查询的适应性。为了解决这些问题，我们介绍了Weaver，这是一个模块化的流水线，动态集成了SQL和LLMs，用于基于表格的问答（TableQA）。Weaver生成一个灵活的、逐步的计划，将SQL用于结构化数据检索，将LLMs用于语义处理。通过将复杂查询分解为可管理的子任务，Weaver提高了准确性和泛化能力。我们的实验证明，Weaver在四个TableQA数据集上始终优于最先进的方法，同时减少了API调用和错误率。代码，以及其他相关脚本，可以在https://coral-lab-asu.github.io/weaver上找到。

更新时间: 2025-09-23 18:02:45

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2505.18961v2

The Pareto Frontier of Resilient Jet Tagging

Classifying hadronic jets using their constituents' kinematic information is a critical task in modern high-energy collider physics. Often, classifiers are designed by targeting the best performance using metrics such as accuracy, AUC, or rejection rates. However, the use of a single metric can lead to the use of architectures that are more model-dependent than competitive alternatives, leading to potential uncertainty and bias in analysis. We explore such trade-offs and demonstrate the consequences of using networks with high performance metrics but low resilience.

Updated: 2025-09-23 18:00:01

标题: 弹性喷气标记的帕累托边界

摘要: 使用它们在运动学信息上的成分对强子喷流进行分类是现代高能对撞机物理中的一项关键任务。通常，分类器的设计是通过针对最佳性能使用准确性、AUC或拒绝率等指标来完成的。然而，仅使用单一指标可能导致使用比竞争替代方案更依赖模型的架构，从而可能导致分析中的潜在不确定性和偏见。我们探讨这种权衡，并展示使用性能指标高但韧性低的网络的后果。

更新时间: 2025-09-23 18:00:01

领域: hep-ph,cs.LG,hep-ex

下载: http://arxiv.org/abs/2509.19431v1

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: https://residual-offpolicy-rl.github.io

Updated: 2025-09-23 17:59:46

标题: 细化行为克隆策略的剩余离线强化学习

摘要: 最近在行为克隆（BC）方面取得的进展已经实现了令人印象深刻的视觉动作控制策略。然而，这些方法受到人类示范质量的限制，需要进行数据收集的手动工作量，以及增加离线数据带来的收益递减的影响。相比之下，强化学习（RL）通过与环境的自主交互训练代理，并在各个领域取得了显著成功。然而，在真实世界的机器人上直接训练RL策略仍然具有挑战性，原因在于样本效率低、安全问题以及难以学习来自稀疏奖励的长期任务，尤其是对于高自由度系统而言。我们提出了一种通过残差学习框架结合BC和RL优势的方法。我们的方法利用BC策略作为黑盒基础，并通过高效的离线策略RL学习每步轻量级残差校正。我们证明了我们的方法仅需要稀疏的二进制奖励信号，并且可以有效地改进高自由度系统中的操作策略，无论是在仿真还是真实世界中。特别是，在我们所知道的范围内，我们展示了在具有灵巧手的人形机器人上进行的首次成功的真实世界RL训练。我们的结果展示了在各种基于视觉的任务中的最新性能，为在真实世界中部署RL指明了实际的途径。项目网站：https://residual-offpolicy-rl.github.io

更新时间: 2025-09-23 17:59:46

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2509.19301v1

Audio-Based Pedestrian Detection in the Presence of Vehicular Noise

Audio-based pedestrian detection is a challenging task and has, thus far, only been explored in noise-limited environments. We present a new dataset, results, and a detailed analysis of the state-of-the-art in audio-based pedestrian detection in the presence of vehicular noise. In our study, we conduct three analyses: (i) cross-dataset evaluation between noisy and noise-limited environments, (ii) an assessment of the impact of noisy data on model performance, highlighting the influence of acoustic context, and (iii) an evaluation of the model's predictive robustness on out-of-domain sounds. The new dataset is a comprehensive 1321-hour roadside dataset. It incorporates traffic-rich soundscapes. Each recording includes 16kHz audio synchronized with frame-level pedestrian annotations and 1fps video thumbnails.

Updated: 2025-09-23 17:57:44

标题: 在车辆噪音环境中基于音频的行人检测

摘要: 基于音频的行人检测是一项具有挑战性的任务，迄今为止，仅在受噪声限制的环境中进行了探索。我们提出了一个新的数据集、结果和对基于音频的行人检测的最新技术的详细分析，以便在车辆噪声存在的情况下进行。在我们的研究中，我们进行了三项分析：（i）在嘈杂和受噪声限制的环境之间进行跨数据集评估，（ii）评估嘈杂数据对模型性能的影响，突出声学背景的影响，以及（iii）评估模型对域外声音的预测稳健性。新数据集是一个全面的1321小时路边数据集。它包含交通丰富的声音景观。每个录音包括与帧级别行人注释和1fps视频缩略图同步的16kHz音频。

更新时间: 2025-09-23 17:57:44

领域: eess.AS,cs.AI,cs.LG,cs.SD

下载: http://arxiv.org/abs/2509.19295v1

The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review

We conducted an experiment during the review process of the 2023 International Conference on Machine Learning (ICML), asking authors with multiple submissions to rank their papers based on perceived quality. In total, we received 1,342 rankings, each from a different author, covering 2,592 submissions. In this paper, we present an empirical analysis of how author-provided rankings could be leveraged to improve peer review processes at machine learning conferences. We focus on the Isotonic Mechanism, which calibrates raw review scores using the author-provided rankings. Our analysis shows that these ranking-calibrated scores outperform the raw review scores in estimating the ground truth ``expected review scores'' in terms of both squared and absolute error metrics. Furthermore, we propose several cautious, low-risk applications of the Isotonic Mechanism and author-provided rankings in peer review, including supporting senior area chairs in overseeing area chairs' recommendations, assisting in the selection of paper awards, and guiding the recruitment of emergency reviewers.

Updated: 2025-09-23 17:56:41

标题: ICML 2023排名实验：检验作者在机器学习/人工智能同行评审中的自我评估

摘要: 我们在2023年国际机器学习会议（ICML）的审稿过程中进行了一个实验，要求有多篇投稿的作者根据感知质量对自己的论文进行排名。总共，我们收到了1,342个排名，来自不同的作者，涵盖了2,592份投稿。在本文中，我们对作者提供的排名如何利用于改进机器学习会议的同行评审流程进行了实证分析。我们重点关注等温机制，该机制使用作者提供的排名来校准原始评审分数。我们的分析显示，这些排名校准分数在估计“期望评审分数”方面，无论是在平方误差还是绝对误差指标上，都优于原始评审分数。此外，我们提出了等温机制和作者提供的排名在同行评审中的几种谨慎、低风险应用，包括支持高级区域主席监督区域主席的推荐，协助选择论文奖项，以及指导应急评审员的招募。

更新时间: 2025-09-23 17:56:41

领域: stat.AP,cs.DL,cs.GT,cs.LG,stat.ML

下载: http://arxiv.org/abs/2408.13430v3

SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration

Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement. Project website: https://ericjin2002.github.io/SOE

Updated: 2025-09-23 17:54:47

标题: SOE: 通过流形探索实现样本高效的机器人策略自我改进

摘要: 智能代理通过不断改进其能力来不断探索环境。然而，机器人策略通常缺乏足够的探索能力，这是由于行动模式的崩溃。现有的鼓励探索的方法通常依赖于随机扰动，这些扰动是不安全的，并导致不稳定、古怪的行为，从而限制了它们的有效性。我们提出了自我改进通过流形探索（SOE）的框架，该框架增强了机器人操作中的策略探索和改进。SOE学习任务相关因素的紧凑潜在表示，并将探索限制在有效行动的流形上，确保安全性、多样性和有效性。它可以无缝集成到任意策略模型中作为插件模块，增加探索而不降低基本策略的性能。此外，结构化的潜在空间使人类引导的探索更加高效和可控。在模拟和真实任务中的大量实验表明，SOE始终优于先前的方法，实现了更高的任务成功率、更平稳、更安全的探索和更高的样本效率。这些结果将流形探索确立为一种基于样本高效的策略自我改进的原则性方法。项目网站：https://ericjin2002.github.io/SOE

更新时间: 2025-09-23 17:54:47

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19292v1

Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features

Positional encoding is essential for supplementing transformer with positional information of tokens. Existing positional encoding methods demand predefined token/feature order, rendering them unsuitable for real-world data with non-sequential yet causally-related features. To address this limitation, we propose CAPE, a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG) using generalized structural equation modeling. The DAG is then embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach that effectively captures two important causal graph properties (causal strength & causal specificity). This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer's self-attention mechanism. Theoretical analysis reveals that CAPE-generated rotary positional encodings possess three valuable properties for enhanced self-attention, including causal distance-induced attenuation, causal generality-induced attenuation, and robustness to positional disturbances. We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. Our code is available at https://github.com/Catchxu/CAPE.

Updated: 2025-09-23 17:52:27

标题: 因果引起的位置编码：基于Transformer的非顺序特征表示学习

摘要: 位置编码对于为transformer补充token的位置信息至关重要。现有的位置编码方法需要预定义的token/特征顺序，这使它们不适用于具有非顺序但存在因果关系特征的真实世界数据。为了解决这一限制，我们提出了CAPE，一种新颖的方法，使用广义结构方程建模，将非顺序特征的潜在因果结构识别为加权有向无环图（DAG）。然后，将DAG嵌入双曲空间，通过基于双曲模型的方法有效地保留其几何结构，从而有效捕获两个重要的因果图属性（因果强度和因果特异性）。这一步产生了针对特征的具有因果性的位置编码，这些编码被转换为它们的旋转形式，以与transformer的自注意机制集成。理论分析表明，CAPE生成的旋转位置编码具有增强自注意力的三个有价值的属性，包括因果距离诱导的衰减、因果普遍性诱导的衰减以及对位置干扰的稳健性。我们在合成和真实数据集上评估了CAPE，实证地展示了其理论属性和增强transformer处理具有非顺序特征的数据的有效性。我们的代码可以在https://github.com/Catchxu/CAPE找到。

更新时间: 2025-09-23 17:52:27

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2509.16629v2

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what *characterizes* an effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review (revisiting earlier steps) via appended *wait* tokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the "longer-is-better" narrative, we find that both naive CoT lengthening and increased review are associated with *lower* accuracy. As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic-the *Failed-Step Fraction (FSF)*, the fraction of steps in abandoned branches-that consistently outpredicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those that *fail less* and support *structure-aware* test-time scaling over indiscriminately generating long CoT.

Updated: 2025-09-23 17:50:54

标题: 什么是有效推理的特征？重新审视CoT的长度、审查和结构

摘要: 大型推理模型（LRMs）在长链式思考（CoT）轨迹上花费了大量的测试时间计算，但是什么样的CoT才是有效的仍然不清楚。尽管先前的工作报告了通过延长CoTs和增加复审（重新访问早期步骤）通过添加*等待*标记可以获得收益，但是最近的研究表明，较短的思考可能会胜过较长的轨迹。因此，我们在数学和科学推理领域对十个LRMs进行了系统评估。与“更长更好”的叙事相反，我们发现，无论是天真的CoT延长还是增加复审都与*更低*的准确性相关联。随着CoT逐步展开，标记级别的指标可能将冗长性与过程质量混为一谈。我们引入了CoT的图形视图以提取结构，并识别了一个单一的统计量——*失败步骤比例（FSF）*，即放弃分支中的步骤比例，这一统计量在各个模型中始终优于长度和复审比率以正确性预测。为了探究因果关系，我们设计了两个干预措施。首先，我们在测试时按照每个指标对候选CoTs进行排名，其中FSF获得了最大的pass@1增益；其次，我们编辑CoTs以移除失败的分支，这显著提高了准确性，表明失败的分支会影响后续推理。综合这些结果，我们将有效的CoT描述为那些*失败较少*并支持*结构感知*的测试时间扩展，而不是盲目地生成长的CoT。

更新时间: 2025-09-23 17:50:54

领域: cs.LG

下载: http://arxiv.org/abs/2509.19284v1

Turning Tabular Foundation Models into Graph Foundation Models

While foundation models have revolutionized such fields as natural language processing and computer vision, their potential in graph machine learning remains largely unexplored. One of the key challenges in designing graph foundation models (GFMs) is handling diverse node features that can vary across different graph datasets. While many works on GFMs have focused exclusively on text-attributed graphs, the problem of handling arbitrary features of other types in GFMs has not been fully addressed. However, this problem is not unique to the graph domain, as it also arises in the field of machine learning for tabular data. In this work, motivated by the recent success of tabular foundation models (TFMs) like TabPFNv2 or LimiX, we propose G2T-FM, a simple framework for turning tabular foundation models into graph foundation models. Specifically, G2T-FM augments the original node features with neighborhood feature aggregation, adds structural embeddings, and then applies a TFM to the constructed node representations. Even in a fully in-context regime, our model achieves strong results, significantly outperforming publicly available GFMs and performing competitively with, and often better than, well-tuned GNNs trained from scratch. Moreover, after finetuning, G2T-FM surpasses well-tuned GNN baselines. In particular, when combined with LimiX, G2T-FM often outperforms the best GNN by a significant margin. In summary, our paper reveals the potential of a previously overlooked direction of utilizing tabular foundation models for graph machine learning tasks.

Updated: 2025-09-23 17:49:14

标题: 将表格基础模型转化为图基础模型

摘要: 基于基础模型的领域，如自然语言处理和计算机视觉已经发生了革命性变化，然而在图机器学习领域，基础模型的潜力仍然大部分未被探索。设计图基础模型（GFMs）的关键挑战之一是处理可能在不同图数据集中变化的各种节点特征。虽然许多关于GFMs的研究专注于文本属性图，但在GFMs中处理其他类型的任意特征的问题尚未得到充分解决。然而，这个问题并不是图领域独有的，它在表格数据的机器学习领域也存在。在这项工作中，受到表格基础模型（TFMs）如TabPFNv2或LimiX的最新成功启发，我们提出了G2T-FM，一个简单的框架，用于将表格基础模型转化为图基础模型。具体来说，G2T-FM通过邻域特征聚合增强原始节点特征，添加结构嵌入，然后将TFM应用于构建好的节点表示。即使在完全上下文环境下，我们的模型取得了强大的结果，明显优于公开可用的GFMs，并且与以及优于从头开始训练的精心调整的GNNs竞争。此外，经过微调后，G2T-FM超越了精心调整的GNN基准线。特别是与LimiX结合使用时，G2T-FM往往明显优于最佳GNN。总之，我们的论文揭示了利用表格基础模型进行图机器学习任务的先前被忽视的潜在方向。

更新时间: 2025-09-23 17:49:14

领域: cs.LG

下载: http://arxiv.org/abs/2508.20906v2

MOIS-SAM2: Exemplar-based Segment Anything Model 2 for multilesion interactive segmentation of neurobromas in whole-body MRI

Background and Objectives: Neurofibromatosis type 1 is a genetic disorder characterized by the development of numerous neurofibromas (NFs) throughout the body. Whole-body MRI (WB-MRI) is the clinical standard for detection and longitudinal surveillance of NF tumor growth. Existing interactive segmentation methods fail to combine high lesion-wise precision with scalability to hundreds of lesions. This study proposes a novel interactive segmentation model tailored to this challenge. Methods: We introduce MOIS-SAM2, a multi-object interactive segmentation model that extends the state-of-the-art, transformer-based, promptable Segment Anything Model 2 (SAM2) with exemplar-based semantic propagation. MOIS-SAM2 was trained and evaluated on 119 WB-MRI scans from 84 NF1 patients acquired using T2-weighted fat-suppressed sequences. The dataset was split at the patient level into a training set and four test sets (one in-domain and three reflecting different domain shift scenarios, e.g., MRI field strength variation, low tumor burden, differences in clinical site and scanner vendor). Results: On the in-domain test set, MOIS-SAM2 achieved a scan-wise DSC of 0.60 against expert manual annotations, outperforming baseline 3D nnU-Net (DSC: 0.54) and SAM2 (DSC: 0.35). Performance of the proposed model was maintained under MRI field strength shift (DSC: 0.53) and scanner vendor variation (DSC: 0.50), and improved in low tumor burden cases (DSC: 0.61). Lesion detection F1 scores ranged from 0.62 to 0.78 across test sets. Preliminary inter-reader variability analysis showed model-to-expert agreement (DSC: 0.62-0.68), comparable to inter-expert agreement (DSC: 0.57-0.69). Conclusions: The proposed MOIS-SAM2 enables efficient and scalable interactive segmentation of NFs in WB-MRI with minimal user input and strong generalization, supporting integration into clinical workflows.

Updated: 2025-09-23 17:42:24

标题: MOIS-SAM2：基于范例的多损伤互动分割模型2用于全身MRI中神经纤维瘤的分割

摘要: 背景和目的：神经纤维瘤病1型是一种遗传性疾病，其特征是在全身发展大量神经纤维瘤（NF）。全身MRI（WB-MRI）是检测和长期监测NF肿瘤生长的临床标准。现有的交互式分割方法无法结合高病变准确性与可扩展性到数百个病变。本研究提出了一种针对这一挑战的新型交互式分割模型。方法：我们介绍了MOIS-SAM2，这是一个多对象交互式分割模型，它将最先进的基于转换器的可提示的Segment Anything Model 2（SAM2）与基于示范的语义传播相结合。MOIS-SAM2在使用T2加权脂肪抑制序列获得的84名NF1患者的119个WB-MRI扫描上进行训练和评估。该数据集在患者水平上分为训练集和四个测试集（一个在域内，另外三个反映了不同的领域转变情况，例如MRI场强变化，低肿瘤负担，临床站点和扫描仪供应商的差异）。结果：在域内测试集上，MOIS-SAM2针对专家手动注释实现了扫描级DSC为0.60，优于基线3D nnU-Net（DSC: 0.54）和SAM2（DSC: 0.35）。所提出模型的性能在MRI场强变化（DSC: 0.53）和扫描仪供应商变化（DSC: 0.50）下得以维持，并在低肿瘤负担案例中得到改善（DSC: 0.61）。病变检测F1分数在测试集中从0.62到0.78不等。初步的读者间差异性分析显示模型与专家间的一致性（DSC: 0.62-0.68），与专家间的一致性相当（DSC: 0.57-0.69）。结论：所提出的MOIS-SAM2能够在WB-MRI中高效且可扩展地进行NF的交互式分割，几乎不需要用户输入，并具有强大的泛化能力，支持集成到临床工作流程中。

更新时间: 2025-09-23 17:42:24

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.19277v1

A Gradient Flow Approach to Solving Inverse Problems with Latent Diffusion Models

Solving ill-posed inverse problems requires powerful and flexible priors. We propose leveraging pretrained latent diffusion models for this task through a new training-free approach, termed Diffusion-regularized Wasserstein Gradient Flow (DWGF). Specifically, we formulate the posterior sampling problem as a regularized Wasserstein gradient flow of the Kullback-Leibler divergence in the latent space. We demonstrate the performance of our method on standard benchmarks using StableDiffusion (Rombach et al., 2022) as the prior.

Updated: 2025-09-23 17:41:43

标题: 使用梯度流方法解决具有潜在扩散模型的反问题

摘要: 解决不适定的逆问题需要强大和灵活的先验知识。我们提议通过一种新的无需训练的方法，即扩散正则化Wasserstein梯度流（DWGF），利用预训练的潜在扩散模型来解决这一问题。具体而言，我们将后验抽样问题表述为潜在空间中Kullback-Leibler散度的正则化Wasserstein梯度流。我们使用StableDiffusion（Rombach等人，2022）作为先验，在标准基准测试中展示了我们方法的性能。

更新时间: 2025-09-23 17:41:43

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2509.19276v1

Generative Medical Event Models Improve with Scale

Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Comet models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study of medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Consequently, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient's real-world history, Comet autoregressively predicts the next medical event to simulate patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, Comet generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. Comet's predictive power consistently improves as the model and pretraining scale. Our results show that Comet, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.

Updated: 2025-09-23 17:35:19

标题: 生成式医学事件模型随规模改进

摘要: 在规模上实现个性化医疗需要提炼纵向患者旅程中的见解的方法，可以将其视为一系列医疗事件。在大规模医疗事件数据上预训练的基础模型代表了扩展实证生成和推广到多样化下游任务的有前途的方向。利用 Epic Cosmos 数据集，这是一个包含来自 310 个卫生系统的 16.3 亿次相遇的去标识化纵向健康记录中的医疗事件的数据集，我们介绍了 Comet 模型，这是一个仅解码器的变压器模型家族，预训练了代表 1.18 亿患者的 1150 亿个离散医疗事件（1510 亿标记）。我们展示了医疗事件数据的最大规模定律研究，建立了一个预训练方法，并揭示了计算、标记和模型大小的幂律扩展关系。因此，我们预训练了一系列最多拥有 10 亿参数的计算优化模型。在患者的现实历史条件下，Comet 自回归地预测下一个医疗事件，以模拟患者的健康时间轴。我们研究了 78 个现实世界任务，包括诊断预测、疾病预后和医疗保健运营。令人惊讶的是，对于一个具有通用预训练和基于模拟的推理的基础模型，Comet 在这些任务中通常优于或与专门的监督模型相匹配，而无需特定任务的微调或少量示例。随着模型和预训练规模的扩展，Comet 的预测能力持续改善。我们的结果表明，Comet，一个生成式医疗事件基础模型，能够有效捕捉复杂的临床动态，提供一个可扩展和通用的框架，支持临床决策制定，简化医疗保健运营，并改善患者结果。

更新时间: 2025-09-23 17:35:19

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.12104v2

Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory

We explore whether techniques from AI can help discover new combinatorial structures that improve on known limits on efficient algorithms. Specifically, we use AlphaEvolve (an LLM coding agent) to study two settings: a) Average-case hardness for MAX-CUT and MAX-Independent Set: We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as $163$ nodes, using AlphaEvolve. Additionally, via analytical arguments we strengthen the upper bounds to settle the computational hardness of these questions up to an error in the third decimal place. b) Worst-case Hardness of Approximation for MAX-k-CUT: We obtain new inapproximability results, proving that it is NP-hard to approximate MAX-4-CUT and MAX-3-CUT within factors of $0.987$ and $0.9649$ respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of $0.9853$, but falls short of improving the SOTA of $16/17$ that relies on a custom PCP, rather than a gadget reduction from "standard" H{\aa}stad-style PCPs. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (often requiring exponential time). In both settings above, our results were enabled by using AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by $10,000\times$). We conclude with a discussion of norms by which to assess the assistance from AI in developing proofs.

Updated: 2025-09-23 17:34:30

标题: 加强组合结构的生成：在复杂性理论中的应用

摘要: 我们探讨了AI技术是否可以帮助发现新的组合结构，从而改进已知的有效算法限制。具体来说，我们使用AlphaEvolve（一个LLM编码代理）研究了两个设置： a) MAX-CUT和MAX-Independent Set的平均情况困难度：我们改进了Kunisky和Yu最近的结果，获得了对随机3-和4-正则图的MAX-CUT和MAX-Independent Set的认证算法的近乎最优上界和（条件性的）下界。我们通过使用AlphaEvolve构建近乎极端的Ramanujan图，最多可达163个节点，获得了改进的下界。此外，通过分析论证，我们加强了上界，解决了这些问题的计算困难度，最多有第三位小数的误差。 b) MAX-k-CUT的近似最坏情况困难度：我们获得了新的近似难度结果，证明了在AlphaEvolve发现新的小工具缩减的情况下，近似MAX-4-CUT和MAX-3-CUT在0.987和0.9649的因子内是NP难的。我们的MAX-4-CUT结果优于0.9883的SOTA，而我们的MAX-3-CUT结果优于当前最佳的基于小工具的不可近似性结果0.9853，但未能改进依赖于自定义PCP而不是来自“标准”Håstad风格PCP的小工具缩减的16/17的SOTA。我们面临的关键技术挑战是：验证由AlphaEvolve产生的候选构造是昂贵的（通常需要指数时间）。在上述两种情况下，我们的结果得益于使用AlphaEvolve本身来进化验证过程以使其更快（有时高达10000倍）。最后，我们讨论了评估AI在证明发展中的辅助作用的标准。

更新时间: 2025-09-23 17:34:30

领域: cs.LG,cs.AI,cs.CC,math.CO

下载: http://arxiv.org/abs/2509.18057v2

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are crafted to be subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using them as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.

Updated: 2025-09-23 17:34:27

标题: 战略性不诚实可能会破坏前沿LLM的AI安全评估

摘要: 大型语言模型（LLM）开发人员的目标是使他们的模型诚实、有用和无害。然而，当面对恶意请求时，模型被训练拒绝，牺牲了帮助性。我们展示了前沿LLM可以发展出一种新的策略，即对不诚实偏好，即使其他选项也是可用的。受影响的模型对有害请求作出的输出听起来有害，但实际上是经过精心设计为微妙不正确或者无害的。这种行为即使在同一模型系列中的模型内部也会出现难以预测的变化。我们没有发现欺骗的明显原因，但表明更有能力的模型更擅长执行这种策略。战略性的不诚实已经对安全评估产生了实际影响，因为我们展示了不诚实的响应欺骗了我们测试的用于检测越狱的所有基于输出的监视器，使基准分数不可靠。此外，战略性的不诚实可以像蜜罐一样针对恶意用户，明显混淆了以前的越狱攻击。尽管输出监视器失败，我们展示了线性探针在内部激活上的使用可以可靠地检测战略性的不诚实。我们通过在具有可验证结果的数据集上验证探针，并将其用作指导向量。总的来说，我们认为战略性的不诚实是对LLM对齐的更广泛关注的一个具体例子，特别是当帮助性和无害性发生冲突时，控制LLM的对齐变得困难。

更新时间: 2025-09-23 17:34:27

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2509.18058v2

Seeing is Deceiving: Mirror-Based LiDAR Spoofing for Autonomous Vehicle Deception

Autonomous vehicles (AVs) rely heavily on LiDAR sensors for accurate 3D perception. We show a novel class of low-cost, passive LiDAR spoofing attacks that exploit mirror-like surfaces to inject or remove objects from an AV's perception. Using planar mirrors to redirect LiDAR beams, these attacks require no electronics or custom fabrication and can be deployed in real settings. We define two adversarial goals: Object Addition Attacks (OAA), which create phantom obstacles, and Object Removal Attacks (ORA), which conceal real hazards. We develop geometric optics models, validate them with controlled outdoor experiments using a commercial LiDAR and an Autoware-equipped vehicle, and implement a CARLA-based simulation for scalable testing. Experiments show mirror attacks corrupt occupancy grids, induce false detections, and trigger unsafe planning and control behaviors. We discuss potential defenses (thermal sensing, multi-sensor fusion, light-fingerprinting) and their limitations.

Updated: 2025-09-23 17:34:14

标题: Seeing is Deceiving: Mirror-Based LiDAR Spoofing for Autonomous Vehicle Deception 看见并不代表真实：基于镜子的LiDAR欺骗技术用于自动驾驶车辆欺骗

摘要: 自动驾驶汽车（AVs）在准确的3D感知中严重依赖LiDAR传感器。我们展示了一种新颖的低成本、被动LiDAR欺骗攻击的类别，利用镜面般的表面向AV的感知中注入或移除物体。利用平面镜将LiDAR光束重定向，这些攻击不需要电子设备或定制制造，并可以在实际环境中部署。我们定义了两种对抗性目标：物体添加攻击（OAA），用于创建幻影障碍物，以及物体移除攻击（ORA），用于隐藏真实危险。我们开发几何光学模型，并通过使用商用LiDAR和配备Autoware的车辆进行控制室外实验来验证它们，并实现基于CARLA的仿真以进行可扩展测试。实验表明，镜面攻击会破坏占用格，引发虚假检测，并触发不安全的规划和控制行为。我们讨论了潜在的防御方法（热传感、多传感器融合、光指纹识别）及其局限性。

更新时间: 2025-09-23 17:34:14

领域: cs.CR

下载: http://arxiv.org/abs/2509.17253v2

Leveraging Large Models to Evaluate Novel Content: A Case Study on Advertisement Creativity

Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.

Updated: 2025-09-23 17:34:10

标题: 利用大型模型评估新颖内容：广告创意案例研究

摘要: 评估创意是具有挑战性的，即使对于人类来说也是如此，这不仅是因为它的主观性，还因为它涉及复杂的认知过程。受到营销领域的工作的启发，我们尝试将视觉广告创意分解为非典型性和独创性。通过对这些维度进行细粒度的人类注释，我们提出了一系列专门针对这种主观问题的任务。我们还评估了最先进的视觉语言模型（VLMs）与人类在我们提出的基准上的一致性，展示了使用VLMs进行自动创意评估的前景和挑战。

更新时间: 2025-09-23 17:34:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.00046v2

WolBanking77: Wolof Banking Speech Intent Classification Dataset

Intent classification models have made a lot of progress in recent years. However, previous studies primarily focus on high-resource languages datasets, which results in a gap for low-resource languages and for regions with a high rate of illiterate people where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, with an illiteracy rate of 42\% for the country. Wolof is actually spoken by more than 10 million people in West African region. To tackle such limitations, we release a Wolof Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. This paper also provides detailed analyses of the contents of the data. We report baseline f1-score and word error rate metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. We plan to share and conduct dataset maintenance, updates and to release open-source code.

Updated: 2025-09-23 17:34:10

标题: WolBanking77: 沃洛夫银行业演讲意图分类数据集

摘要: 意图分类模型在近年来取得了很大进展。然而，先前的研究主要集中在高资源语言数据集上，这导致了低资源语言和那些有高比例文盲人口的地区存在差距，这些地区的语言更多地被口头使用而非书写或阅读。例如，在塞内加尔，90%的人口使用沃洛夫语，该国文盲率为42%。沃洛夫语实际上在西非地区有超过1000万人口使用。为了解决这些限制，我们发布了一个沃洛夫语意图分类数据集（WolBanking77），供学术研究使用。WolBanking77目前包含9791个银行领域的文本句子和超过4小时的口头句子。本文在该工作中进行了多个基准实验，包括文本和语音的最新模型。在这个当前数据集上的结果非常令人鼓舞。本文还对数据内容进行了详细分析。我们分别在经过训练的NLP和ASR模型上报告了基准f1分数和字错误率指标，以及模型之间的比较。我们计划分享和进行数据集维护、更新，并发布开源代码。

更新时间: 2025-09-23 17:34:10

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19271v1

SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70\%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.

Updated: 2025-09-23 17:33:57

标题: SloPalSpeech: 一份来自议会数据的 28000 小时斯洛伐克语音语料库

摘要: 自动语音识别（ASR）对于像斯洛伐克这样的低资源语言受到训练数据稀缺的限制。为了解决这个问题，我们引入了SloPalSpeech，一个包含2,806小时议会会议演讲的新的大规模斯洛伐克语ASR数据集。我们开发了一个强大的处理流程，将长格式录音对齐并分割成清晰的30秒音频和转录对，适合用于模型训练。我们使用这个数据集对几个OpenAI Whisper模型（small、medium、large-v3和large-v3-turbo）进行微调，在标准的斯洛伐克基准测试如Common Voice和FLEURS上实现了显著的词错误率（WER）降低。例如，微调后的Whisper-small模型的WER降低了高达70％，接近大得多的Whisper-large-v3模型的基准性能。为了促进低资源语音识别领域的未来研究，我们公开发布了完整的SloPalSpeech数据集、完全分割的转录（6000万字）以及我们所有微调的模型。

更新时间: 2025-09-23 17:33:57

领域: cs.CL,cs.AI,cs.SD

下载: http://arxiv.org/abs/2509.19270v1

Dynamicasome: a molecular dynamics-guided and AI-driven pathogenicity prediction catalogue for all genetic mutations

Advances in genomic medicine accelerate the identi cation of mutations in disease-associated genes, but the pathogenicity of many mutations remains unknown, hindering their use in diagnostics and clinical decision-making. Predictive AI models are generated to combat this issue, but current tools display low accuracy when tested against functionally validated datasets. We show that integrating detailed conformational data extracted from molecular dynamics simulations (MDS) into advanced AI-based models increases their predictive power. We carry out an exhaustive mutational analysis of the disease gene PMM2 and subject structural models of each variant to MDS. AI models trained on this dataset outperform existing tools when predicting the known pathogenicity of mutations. Our best performing model, a neuronal networks model, also predicts the pathogenicity of several PMM2 mutations currently considered of unknown signi cance. We believe this model helps alleviate the burden of unknown variants in genomic medicine.

Updated: 2025-09-23 17:33:05

标题: Dynamicasome：一种分子动力学引导和人工智能驱动的病原性预测目录，适用于所有遗传突变

摘要: 基因组医学的进展加速了疾病相关基因突变的鉴定，但许多突变的致病性仍然未知，这阻碍了它们在诊断和临床决策中的应用。为了解决这个问题，人工智能预测模型被生成，但目前的工具在与功能验证数据集进行测试时显示出较低的准确性。我们展示了将从分子动力学模拟（MDS）中提取的详细构象数据整合到先进的基于人工智能的模型中可以增加它们的预测能力。我们对疾病基因PMM2进行了详尽的突变分析，并将每个变异体的结构模型提交给MDS。在此数据集上训练的人工智能模型在预测已知突变的致病性方面优于现有工具。我们最佳的模型，一个神经网络模型，还可以预测目前被认为具有未知意义的几个PMM2突变的致病性。我们相信这个模型有助于减轻基因组医学中未知变异的负担。

更新时间: 2025-09-23 17:33:05

领域: q-bio.QM,cs.AI,physics.bio-ph,q-bio.MN

下载: http://arxiv.org/abs/2509.19766v1

Latent Representation Learning of Multi-scale Thermophysics: Application to Dynamics in Shocked Porous Energetic Material

Coupling of physics across length and time scales plays an important role in the response of microstructured materials to external loads. In a multi-scale framework, unresolved (subgrid) meso-scale dynamics is upscaled to the homogenized (macro-scale) representation of the heterogeneous material through closure models. Deep learning models trained using meso-scale simulation data are now a popular route to assimilate such closure laws. However, meso-scale simulations are computationally taxing, posing practical challenges in training deep learning-based surrogate models from scratch. In this work, we investigate an alternative meta-learning approach motivated by the idea of tokenization in natural language processing. We show that one can learn a reduced representation of the micro-scale physics to accelerate the meso-scale learning process by tokenizing the meso-scale evolution of the physical fields involved in an archetypal, albeit complex, reactive dynamics problem, \textit{viz.}, shock-induced energy localization in a porous energetic material. A probabilistic latent representation of \textit{micro}-scale dynamics is learned as building blocks for \textit{meso}-scale dynamics. The \textit{meso-}scale latent dynamics model learns the correlation between neighboring building blocks by training over a small dataset of meso-scale simulations. We compare the performance of our model with a physics-aware recurrent convolutional neural network (PARC) trained only on the full meso-scale dataset. We demonstrate that our model can outperform PARC with scarce meso-scale data. The proposed approach accelerates the development of closure models by leveraging inexpensive micro-scale simulations and fast training over a small meso-scale dataset, and can be applied to a range of multi-scale modeling problems.

Updated: 2025-09-23 17:29:37

标题: 多尺度热物理潜在表示学习：应用于受冲击多孔含能材料动力学

摘要: 跨长度和时间尺度之间的物理耦合在微结构材料对外部载荷的响应中起着重要作用。在多尺度框架中，未解析（亚网格）的中尺度动态通过封闭模型被放大到异质材料的均匀化（宏观尺度）表示。使用中尺度模拟数据训练的深度学习模型现在是吸收这种封闭定律的常用途径。然而，中尺度模拟在计算上是具有挑战性的，从头开始训练基于深度学习的替代模型面临实际挑战。在这项工作中，我们调查了一种受自然语言处理中标记化概念启发的替代元学习方法。我们展示通过对参与一个原型性、尽管复杂的反应动力学问题中的中尺度物理场的演化进行标记化，可以学习微观尺度物理的简化表示，以加速中尺度学习过程，即多孔能量材料中的冲击诱导能量定位。微观尺度动力学的概率潜在表示被学习为中尺度动力学的构建模块。中尺度潜在动力学模型通过在小型中尺度模拟数据集上训练来学习相邻构建模块之间的相关性。我们将我们的模型与仅在完整中尺度数据集上训练的物理感知循环卷积神经网络（PARC）进行性能比较。我们证明我们的模型可以在中尺度数据稀缺的情况下胜过PARC。所提出的方法通过利用廉价的微观尺度模拟和在小型中尺度数据集上快速训练来加速封闭模型的发展，并可应用于一系列多尺度建模问题。

更新时间: 2025-09-23 17:29:37

领域: physics.comp-ph,cs.LG,I.2.6

下载: http://arxiv.org/abs/2506.12996v2

Exploring Model Kinship for Merging Large Language Models

Model merging has emerged as a key technique for enhancing the capabilities and efficiency of Large Language Models (LLMs). The open-source community has driven model evolution by iteratively merging existing models, yet a principled understanding of the gains and underlying factors in model merging remains limited. In this work, we study model evolution through iterative merging, drawing an analogy to biological evolution, and introduce the concept of model kinship, the degree of similarity or relatedness between LLMs. Through comprehensive empirical analysis, we show that model kinship is closely linked to the performance improvements achieved by merging, providing a useful criterion for selecting candidate models. Building on this insight, we propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can improve benchmark performance. Specifically, we discover that incorporating model kinship as a guiding criterion enables continuous merging while mitigating performance degradation caused by local optima, thereby facilitating more effective model evolution. Code is available at https://github.com/zjunlp/ModelKinship.

Updated: 2025-09-23 17:26:13

标题: 探索模型亲缘性以合并大型语言模型

摘要: 模型合并已经成为增强大型语言模型（LLMs）能力和效率的关键技术。开源社区通过迭代合并现有模型推动了模型演进，但对模型合并中的收益和基本因素的原理性理解仍然有限。在这项工作中，我们通过迭代合并研究模型演进，类比生物演化，并引入了模型亲缘性的概念，即LLMs之间的相似性或相关性程度。通过全面的实证分析，我们表明模型亲缘性与合并带来的性能改进密切相关，为选择候选模型提供了一个有用的标准。基于这一洞见，我们提出了一种新的模型合并策略：基于模型亲缘性的Top-k贪婪合并，可以提高基准性能。具体来说，我们发现将模型亲缘性作为指导标准，可以在减轻由于局部最优解引起的性能下降的同时，实现持续合并，从而促进更有效的模型演进。代码可在https://github.com/zjunlp/ModelKinship找到。

更新时间: 2025-09-23 17:26:13

领域: cs.CL,cs.AI,cs.CV,cs.LG,cs.MA

下载: http://arxiv.org/abs/2410.12613v3

Future-Proofing Cloud Security Against Quantum Attacks: Risk, Transition, and Mitigation Strategies

Quantum Computing (QC) introduces a transformative threat to digital security, with the potential to compromise widely deployed classical cryptographic systems. This survey offers a comprehensive and systematic examination of quantumsafe security for Cloud Computing (CC), focusing on the vulnerabilities, transition strategies, and mitigation mechanisms required to secure cloud infrastructures in the quantum era. We evaluated the landscape of quantum threats across the entire CC stack, demonstrating how quantum algorithms can undermine classical encryption and compromise cloud security at multiple architectural layers. Using a structured risk assessment methodology based on the STRIDE model, we evaluate quantum-induced attack vectors and their impact on cloud environments. To address these challenges, we propose a layered security framework that integrates hybrid cryptographic transition strategies, cryptographic agility, and proactive risk mitigation. We analyze the preparation and implementation approaches of the major Cloud Service Providers (CSPs), including AWS, Azure and GCP, synthesizing platform-specific initiatives toward Post-Quantum Cryptography (PQC). Furthermore, we provide a detailed evaluation of standardized PQC algorithms, exploring their resilience to side-channel and active attacks within cloud-native deployments. This survey serves as a strategic reference for cloud architects, policymakers, and researchers, offering actionable insights for navigating the complex transition to quantum-resilient cloud systems. We conclude by identifying six key future research directions: standardization and interoperability, performance and scalability, implementation security, integration with emerging technologies, systemic preparedness, and crypto-agile migration frameworks.

Updated: 2025-09-23 17:25:54

标题: 未来云安全抵御量子攻击：风险、过渡和缓解策略

摘要: 量子计算（QC）对数字安全构成了一种变革性威胁，有可能危及广泛部署的经典加密系统。本调查全面系统地检查了云计算（CC）的量子安全性，重点关注在量子时代保护云基础设施所需的漏洞、过渡策略和缓解机制。我们评估了整个CC堆栈中的量子威胁格局，展示了量子算法如何破坏经典加密并在多个架构层面上危及云安全。基于STRIDE模型的结构化风险评估方法，我们评估了量子引起的攻击向量及其对云环境的影响。为了解决这些挑战，我们提出了一个整合了混合加密过渡策略、加密灵活性和积极风险缓解的分层安全框架。我们分析了主要云服务提供商（CSPs）的准备和实施方法，包括AWS、Azure和GCP，综合平台特定的朝向后量子密码学（PQC）的倡议。此外，我们对标准化的PQC算法进行了详细评估，探讨它们在云原生部署中对侧信道和主动攻击的弹性。这项调查为云架构师、政策制定者和研究人员提供了战略参考，为导航复杂的过渡到量子韧性云系统提供了可操作的见解。最后，我们确定了六个关键的未来研究方向：标准化和互操作性、性能和可扩展性、实施安全性、与新兴技术整合、系统准备工作和加密灵活迁移框架。

更新时间: 2025-09-23 17:25:54

领域: cs.CR

下载: http://arxiv.org/abs/2509.15653v2

LLM-Driven SAST-Genius: A Hybrid Static Analysis Framework for Comprehensive and Actionable Security

This report examines the synergy between Large Language Models (LLMs) and Static Application Security Testing (SAST) to improve vulnerability discovery. Traditional SAST tools, while effective for proactive security, are limited by high false-positive rates and a lack of contextual understanding. Conversely, LLMs excel at code analysis and pattern recognition but can be prone to inconsistencies and hallucinations. By integrating these two technologies, a more intelligent and efficient system is created. This combination moves beyond mere vulnerability detection optimization, transforming security into a deeply integrated, contextual process that provides tangible benefits like improved triage, dynamic bug descriptions, bug validation via exploit generation and enhanced analysis of complex codebases. The result is a more effective security approach that leverages the strengths of both technologies while mitigating their weaknesses. SAST-Genius reduced false positives by about 91 % (225 to 20) compared to Semgrep alone.

Updated: 2025-09-23 17:24:40

标题: LLM驱动的SAST-Genius：综合和可操作安全性的混合静态分析框架

摘要: 这份报告研究了大型语言模型（LLMs）与静态应用程序安全性测试（SAST）之间的协同作用，以改善漏洞发现。传统的SAST工具，虽然对主动安全性很有效，但受到高假阳性率和缺乏上下文理解的限制。相反，LLMs在代码分析和模式识别方面表现出色，但可能容易出现不一致和幻觉。通过整合这两种技术，可以创建一个更智能、更高效的系统。这种组合不仅仅是漏洞检测的优化，还将安全性转变为一个深度整合的上下文过程，提供诸如改进的分级、动态错误描述、通过利用生成验证漏洞和增强对复杂代码库的分析等有形益处。其结果是一种更有效的安全方法，利用了两种技术的优势，同时减轻了它们的弱点。与单独使用Semgrep相比，SAST-Genius将假阳性减少约91%（从225个减少到20个）。

更新时间: 2025-09-23 17:24:40

领域: cs.CR

下载: http://arxiv.org/abs/2509.15433v2

Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World

Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning and direct preference optimization. Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10\% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond the Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.

Updated: 2025-09-23 17:24:14

标题: 《LLM中常识推理的跨文化转移：以阿拉伯世界为例的证据》

摘要: 大型语言模型（LLMs）通常反映了西方中心偏见，限制了它们在多样化文化背景下的有效性。尽管一些研究已经探讨了文化对齐的可能性，但利用一个文化中的对齐来提高其他文化中的性能的潜力仍未被充分探索。本文研究了在阿拉伯世界的常识推理的跨文化转移，该地区的语言和历史相似性与本地文化差异并存。我们使用一个涵盖13个阿拉伯国家的文化相关的常识推理数据集，评估了轻量级对齐方法，如上下文学习和基于演示的强化学习（DITTO），以及监督微调和直接优化偏好等基线。我们的结果显示，仅仅来自一个国家的12个文化特定示例可以在多语言模型中将其他国家的性能平均提高10\%。此外，我们还展示了来自印度尼西亚和美国背景的跨文化演示可以匹配或超越阿拉伯世界的常识推理的文化对齐，突显了文化常识的可转移性。这些发现表明，高效的跨文化对齐是可能的，并提供了一种有前景的方法来适应LLMs到资源匮乏的文化环境。

更新时间: 2025-09-23 17:24:14

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.19265v1

Discovering strategies for coastal resilience with AI-based prediction and optimization

Tropical storms cause extensive property damage and loss of life, making them one of the most destructive types of natural hazards. The development of predictive models that identify interventions effective at mitigating storm impacts has considerable potential to reduce these adverse outcomes. In this study, we use an artificial intelligence (AI)-driven approach for optimizing intervention schemes that improve resilience to coastal flooding. We combine three different AI models to optimize the selection of intervention types, sites, and scales in order to minimize the expected cost of flooding damage in a given region, including the cost of installing and maintaining interventions. Our approach combines data-driven generation of storm surge fields, surrogate modeling of intervention impacts, and the solving of a continuous-armed bandit problem. We applied this methodology to optimize the selection of sea wall and oyster reef interventions near Tyndall Air Force Base (AFB) in Florida, an area that was catastrophically impacted by Hurricane Michael. Our analysis predicts that intervention optimization could be used to potentially save billions of dollars in storm damage, far outpacing greedy or non-optimal solutions.

Updated: 2025-09-23 17:21:41

标题: 利用基于人工智能的预测和优化发现沿海韧性策略

摘要: 热带风暴造成了广泛的财产损失和生命损失，使其成为最具破坏性的自然灾害之一。开发能够识别有效减轻风暴影响的干预措施的预测模型具有显著的潜力，有助于减少这些不利后果。在这项研究中，我们使用一种由人工智能（AI）驱动的方法来优化改善海岸洪水韧性的干预方案。我们结合了三种不同的AI模型，以优化干预类型、位置和规模的选择，以最小化给定区域洪水损失的预期成本，包括安装和维护干预的成本。我们的方法结合了基于数据驱动的风暴潮场生成、干预影响的替代模型建模以及连续武装匪徒问题的解决。我们将这种方法应用于优化佛罗里达州泰德尔空军基地（AFB）附近海堤和牡蛎礁干预的选择，该地区受到迈克尔飓风的灾难性影响。我们的分析预测，干预优化可以潜在地节省数十亿美元的风灾损失，远远超过贪婪或非最优解决方案。

更新时间: 2025-09-23 17:21:41

领域: physics.ao-ph,cs.LG

下载: http://arxiv.org/abs/2509.19263v1

LightThinker: Thinking Step-by-Step Compression

Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code is released at https://github.com/zjunlp/LightThinker.

Updated: 2025-09-23 17:20:22

标题: LightThinker：逐步思考压缩

摘要: 大型语言模型（LLMs）在复杂推理任务中表现出了显着的性能，但它们的效率受到了生成长序列标记所需的大量内存和计算成本的限制。在本文中，我们提出了LightThinker，这是一种新颖的方法，可以使LLMs在推理过程中动态压缩中间思维。受人类认知过程的启发，LightThinker将冗长的思维步骤压缩成紧凑的表示，并丢弃原始的推理链，从而显著减少了存储在上下文窗口中的标记数量。通过训练模型何时以及如何执行压缩，通过数据构造，将隐藏状态映射到简化的主要标记，并创建专门的注意力掩模来实现这一点。此外，我们引入了依赖（Dep）指标，通过测量生成过程中对历史标记的依赖程度来量化压缩程度。对四个数据集和两个模型的广泛实验表明，LightThinker减少了峰值内存使用量和推理时间，同时保持了竞争力的准确性。我们的工作为提高LLMs在复杂推理任务中的效率提供了一个新方向，而不牺牲性能。代码发布在https://github.com/zjunlp/LightThinker。

更新时间: 2025-09-23 17:20:22

领域: cs.CL,cs.AI,cs.IR,cs.LG,cs.MM

下载: http://arxiv.org/abs/2502.15589v2

Unified Spatiotemporal Physics-Informed Learning (USPIL): A Framework for Modeling Complex Predator-Prey Dynamics

Ecological systems exhibit complex multi-scale dynamics that challenge traditional modeling. New methods must capture temporal oscillations and emergent spatiotemporal patterns while adhering to conservation principles. We present the Unified Spatiotemporal Physics-Informed Learning (USPIL) framework, a deep learning architecture integrating physics-informed neural networks (PINNs) and conservation laws to model predator-prey dynamics across dimensional scales. The framework provides a unified solution for both ordinary (ODE) and partial (PDE) differential equation systems, describing temporal cycles and reaction-diffusion patterns within a single neural network architecture. Our methodology uses automatic differentiation to enforce physics constraints and adaptive loss weighting to balance data fidelity with physical consistency. Applied to the Lotka-Volterra system, USPIL achieves 98.9% correlation for 1D temporal dynamics (loss: 0.0219, MAE: 0.0184) and captures complex spiral waves in 2D systems (loss: 4.7656, pattern correlation: 0.94). Validation confirms conservation law adherence within 0.5% and shows a 10-50x computational speedup for inference compared to numerical solvers. USPIL also enables mechanistic understanding through interpretable physics constraints, facilitating parameter discovery and sensitivity analysis not possible with purely data-driven methods. Its ability to transition between dimensional formulations opens new avenues for multi-scale ecological modeling. These capabilities make USPIL a transformative tool for ecological forecasting, conservation planning, and understanding ecosystem resilience, establishing physics-informed deep learning as a powerful and scientifically rigorous paradigm.

Updated: 2025-09-23 17:20:11

标题: 统一时空物理信息学习（USPIL）：建模复杂的捕食者-猎物动态的框架

摘要: 生态系统展现出复杂的多尺度动力学，挑战传统建模方法。新方法必须捕捉时间振荡和新兴的时空模式，同时遵守保守原则。我们提出了统一的时空物理信息学习（USPIL）框架，这是一个深度学习架构，整合了物理信息神经网络（PINNs）和保守定律，用于模拟跨尺度的捕食-被捕食动态。该框架为普通（ODE）和偏微分（PDE）方程系统提供了统一的解决方案，描述了单一神经网络架构内的时间循环和反应扩散模式。我们的方法利用自动微分来强制执行物理约束，并使用自适应损失加权来平衡数据忠实度与物理一致性。应用于Lotka-Volterra系统，USPIL实现了1D时间动态的98.9%相关性（损失：0.0219，MAE：0.0184），并捕捉了2D系统中的复杂螺旋波（损失：4.7656，模式相关性：0.94）。验证确认了保守定律的遵守在0.5%内，并显示相对于数值求解器，推断速度提高了10-50倍。USPIL还通过可解释的物理约束实现了对机理的理解，促进了参数发现和敏感性分析，这是纯数据驱动方法无法做到的。其能够在不同维度的表达之间转换的能力为多尺度生态建模开辟了新的途径。这些能力使USPIL成为生态预测、保护规划和理解生态系统弹性的变革性工具，奠定了物理信息学深度学习作为一种强大和科学严谨的范式。

更新时间: 2025-09-23 17:20:11

领域: cs.LG,physics.app-ph,92D25, 35K57, 68T07,I.2.6; J.3; G.1.8

下载: http://arxiv.org/abs/2509.13425v3

Defending against Stegomalware in Deep Neural Networks with Permutation Symmetry

Deep neural networks are being utilized in a growing number of applications, both in production systems and for personal use. Network checkpoints are as a consequence often shared and distributed on various platforms to ease the development process. This work considers the threat of neural network stegomalware, where malware is embedded in neural network checkpoints at a negligible cost to network accuracy. This constitutes a significant security concern, but is nevertheless largely neglected by the deep learning practitioners and security specialists alike. We propose the first effective countermeasure to these attacks. In particular, we show that state-of-the-art neural network stegomalware can be efficiently and effectively neutralized through shuffling the column order of the weight- and bias-matrices, or equivalently the channel-order of convolutional layers. We show that this effectively corrupts payloads that have been embedded by state-of-the-art methods in neural network steganography at no cost to network accuracy, outperforming competing methods by a significant margin. We then discuss possible means by which to bypass this defense, additional defense methods, and advocate for continued research into the security of machine learning systems.

Updated: 2025-09-23 17:15:38

标题: 在具有置换对称性的深度神经网络中防御Stegomalware

摘要: 深度神经网络正在越来越多地应用于生产系统和个人使用中。网络检查点经常在各种平台上共享和分发，以便简化开发过程。本文考虑了神经网络隐秘恶意软件的威胁，即恶意软件被嵌入神经网络检查点中，对网络准确性的成本微乎其微。这构成了一个重要的安全问题，但却在深度学习从业者和安全专家中大多被忽视。我们提出了对这些攻击的第一个有效对策。特别是，我们展示了通过对权重和偏置矩阵的列顺序进行混洗，或等效地对卷积层的通道顺序进行混洗，可以有效地中和最先进的神经网络隐秘恶意软件。我们展示了这有效地破坏了通过最先进方法嵌入的神经网络隐写术的载荷，而不会对网络准确性造成任何成本，明显优于竞争方法。然后，我们讨论了可能绕过这种防御的方法，额外的防御方法，并倡导继续研究机器学习系统的安全性。

更新时间: 2025-09-23 17:15:38

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.20399v1

PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning

Partial observability presents a significant challenge for Safe Reinforcement Learning (Safe RL), as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information in Safe RL. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer (PIGDreamer), a model-based RL approach that leverages privileged information to enhance the agent's safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results demonstrate that PIGDreamer significantly outperforms existing Safe RL methods. Furthermore, compared to alternative privileged RL methods, our approach exhibits enhanced performance, robustness, and efficiency. Codes are available at: https://github.com/hggforget/PIGDreamer.

Updated: 2025-09-23 17:15:02

标题: PIGDreamer：基于特权信息引导的世界模型，用于安全的部分可观察强化学习

摘要: 部分可观察性对安全强化学习(Safe RL)构成了重要挑战，因为它阻碍了潜在风险和奖励的识别。在训练过程中利用特定类型的特权信息来减轻部分可观察性的影响已经取得了显著的实证成功。在本文中，我们提出了非对称约束部分可观察马尔可夫决策过程(ACPOMDPs)来理论上探讨在Safe RL中整合特权信息的优势。基于ACPOMDPs，我们提出了特权信息引导的Dreamer(PIGDreamer)，这是一种基于模型的RL方法，利用特权信息通过特权表示对齐和非对称的演员-评论家结构来增强代理的安全性和性能。我们的实证结果表明，PIGDreamer明显优于现有的Safe RL方法。此外，与替代特权RL方法相比，我们的方法表现出了增强的性能、鲁棒性和效率。代码可在以下链接找到：https://github.com/hggforget/PIGDreamer。

更新时间: 2025-09-23 17:15:02

领域: cs.LG

下载: http://arxiv.org/abs/2508.02159v2

Communication-Efficient Federated Learning with Adaptive Number of Participants

Rapid scaling of deep learning models has enabled performance gains across domains, yet it introduced several challenges. Federated Learning (FL) has emerged as a promising framework to address these concerns by enabling decentralized training. Nevertheless, communication efficiency remains a key bottleneck in FL, particularly under heterogeneous and dynamic client participation. Existing methods, such as FedAvg and FedProx, or other approaches, including client selection strategies, attempt to mitigate communication costs. However, the problem of choosing the number of clients in a training round remains extremely underexplored. We introduce Intelligent Selection of Participants (ISP), an adaptive mechanism that dynamically determines the optimal number of clients per round to enhance communication efficiency without compromising model accuracy. We validate the effectiveness of ISP across diverse setups, including vision transformers, real-world ECG classification, and training with gradient compression. Our results show consistent communication savings of up to 30\% without losing the final quality. Applying ISP to different real-world ECG classification setups highlighted the selection of the number of clients as a separate task of federated learning.

Updated: 2025-09-23 17:13:57

标题: 具有自适应参与者数量的通信高效联邦学习

摘要: 深度学习模型的快速扩展已经在各个领域实现了性能提升，但也带来了一些挑战。联邦学习（FL）已经成为一个有希望的框架，通过实现分散式训练来解决这些问题。然而，在FL中，通信效率仍然是一个关键瓶颈，特别是在异构和动态客户端参与的情况下。现有的方法，如FedAvg和FedProx，或其他方法，包括客户端选择策略，试图减少通信成本。然而，选择每轮训练中客户端数量的问题仍然极少被探讨。我们引入了智能参与者选择（ISP），这是一种自适应机制，动态确定每轮的最佳客户端数量，以提高通信效率而不损失模型精度。我们验证了ISP在不同设置下的有效性，包括视觉变换器、真实世界的心电图分类和梯度压缩训练。我们的结果显示，通信节省达到30\%，而最终质量没有损失。将ISP应用于不同的真实世界心电图分类设置中，突出了选择客户端数量作为联邦学习的一个单独任务。

更新时间: 2025-09-23 17:13:57

领域: cs.LG

下载: http://arxiv.org/abs/2508.13803v2

Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps

Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method's superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion's complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at https://github.com/TeCSAR-UNCC/Pose-Quantization.

Updated: 2025-09-23 17:12:20

标题: 对抗性精炼的VQ-GAN与稠密运动标记化用于时空热图

摘要: 连续的人体运动理解仍然是计算机视觉中的一个核心挑战，因为其高维度和固有冗余性。高效的压缩和表示对于分析复杂运动动态至关重要。在这项工作中，我们引入了一个经过对抗精细化的VQ-GAN框架，用于压缩时空热图，同时保留人体运动的细粒度痕迹。我们的方法将密集运动标记与对抗精细化相结合，消除了在非对抗基线中观察到的重建伪影，如运动模糊和时间错位。我们在CMU Panoptic数据集上的实验提供了我们方法优越性的确凿证据，优于dVAE基线9.31%的SSIM，并将时间不稳定性降低了37.1%。此外，我们的密集标记策略实现了对运动复杂性的新颖分析，揭示了2D运动可以通过紧凑的128标记词汇表最佳表示，而3D运动的复杂性需要一个更大的1024标记码本以实现忠实重建。这些结果建立了在各种运动分析应用中实际部署可行性。此工作的代码基础可在https://github.com/TeCSAR-UNCC/Pose-Quantization找到。

更新时间: 2025-09-23 17:12:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.19252v1

Recovering Wasserstein Distance Matrices from Few Measurements

This paper proposes two algorithms for estimating square Wasserstein distance matrices from a small number of entries. These matrices are used to compute manifold learning embeddings like multidimensional scaling (MDS) or Isomap, but contrary to Euclidean distance matrices, are extremely costly to compute. We analyze matrix completion from upper triangular samples and Nystr\"{o}m completion in which $\mathcal{O}(d\log(d))$ columns of the distance matrices are computed where $d$ is the desired embedding dimension, prove stability of MDS under Nystr\"{o}m completion, and show that it can outperform matrix completion for a fixed budget of sample distances. Finally, we show that classification of the OrganCMNIST dataset from the MedMNIST benchmark is stable on data embedded from the Nystr\"{o}m estimation of the distance matrix even when only 10\% of the columns are computed.

Updated: 2025-09-23 17:11:33

标题: 从少量测量中恢复Wasserstein距离矩阵

摘要: 本文提出了两种算法，用于从少量条目中估计平方Wasserstein距离矩阵。这些矩阵用于计算类似多维缩放（MDS）或Isomap的流形学习嵌入，但与欧几里德距离矩阵相比，计算成本极高。我们分析了从上三角样本和Nystr\"{o}m完成中的矩阵补全，在Nystr\"{o}m完成下计算了$\mathcal{O}(d\log(d))$列距离矩阵，其中$d$是所需的嵌入维度，证明了MDS在Nystr\"{o}m完成下的稳定性，并表明它可以在固定样本距离预算下胜过矩阵补全。最后，我们展示了从MedMNIST基准测试的OrganCMNIST数据集进行分类，通过从Nystr\"{o}m估计距离矩阵嵌入数据，即使只计算了10\%的列，也是稳定的。

更新时间: 2025-09-23 17:11:33

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.19250v1

Reinforcement Learning on Pre-Training Data

The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

Updated: 2025-09-23 17:10:40

标题: 对预训练数据的强化学习

摘要: 随着计算资源的指数级增长和高质量文本数据有限增长之间的差距日益扩大，传统的大型语言模型（LLMs）缩放方法受到限制。为了解决这一挑战，我们介绍了基于预训练数据的强化学习（RLPT），这是一种用于优化LLMs的新的训练时间缩放范式。与以往通过监督学习主要扩展训练的方法不同，RLPT使策略能够自主探索有意义的轨迹，从预训练数据中学习并通过强化学习（RL）改进其能力。虽然现有的RL策略，如人类反馈强化学习（RLHF）和具有可验证奖励的强化学习（RLVR），依赖于人类标注来构建奖励，但RLPT通过直接从预训练数据中导出奖励信号消除了这种依赖性。具体而言，它采用了下一个段落推理目标，奖励策略准确预测基于先前上下文的随后文本段。这种表述允许RL在预训练数据上进行缩放，鼓励在更广泛的上下文中探索更丰富的轨迹，从而培养更具可推广性的推理能力。在多个模型上进行的广泛实验验证了RLPT的有效性，例如，当应用于Qwen3-4B-Base时，RLPT在MMLU、MMLU-Pro、GPQA-Diamond、KOR-Bench、AIME24和AIME25上分别产生了3.0、5.1、8.1、6.0、6.6和5.3的绝对改善。结果进一步显示了有利的缩放行为，表明在更多计算资源的情况下继续获得更大的收益潜力。此外，RLPT为LLMs扩展了推理边界，增强了RLVR的性能。

更新时间: 2025-09-23 17:10:40

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19249v1

CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.

Updated: 2025-09-23 17:10:14

标题: CaKE：电路感知编辑实现可泛化知识学习者

摘要: 知识编辑（KE）使大型语言模型（LLMs）中过时或不正确的信息得以修改。虽然现有的KE方法可以更新孤立的事实，但它们经常无法将这些更新推广到依赖修改后知识的多跳推理任务中。通过对推理电路的分析 - 即LLMs用于基于知识的推理的神经通路，我们发现目前的局部层次化KE方法（例如MEMIT，WISE），这些方法仅编辑单个或少数模型层，不能充分将更新后的知识集成到这些推理通路中。为了解决这一限制，我们提出了CaKE（基于电路的知识编辑），这是一种增强更新后知识在LLMs中有效整合的新方法。通过仅利用我们基于电路的分析指导的少量策划数据样本，CaKE促使模型开发出适当的推理电路，以用于新纳入的知识。实验证明，CaKE使得在相关推理任务中更准确和一致地使用编辑后的知识，MQuAKE数据集上的多跳推理准确度平均提高20％，同时比现有的KE方法需要更少的内存。我们在https://github.com/zjunlp/CaKE上发布了代码和数据。

更新时间: 2025-09-23 17:10:14

领域: cs.CL,cs.AI,cs.CV,cs.IR,cs.LG

下载: http://arxiv.org/abs/2503.16356v2

Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to ingest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.). See more results and interactive demos at https://boyangdeng.com/visual-chronicles.

Updated: 2025-09-23 17:04:48

标题: 视觉编年史：使用多模式LLM分析大量图像收藏品

摘要: 我们提出了一种使用多模态LLMs（MLLMs）的系统，用于分析一个包含数千万张图片的大型数据库，这些图片是在不同时间拍摄的，旨在发现时间变化中的模式。具体来说，我们的目标是在一定时期内捕捉城市中频繁共同发生的变化（“趋势”）。与先前的视觉分析不同，我们的分析回答开放式查询（例如，“城市中频繁变化的类型是什么？”）而无需预先确定目标主题或训练标签。这些属性使得以前的基于学习或无监督的视觉分析工具不适用。我们确定MLLMs是一种具有开放式语义理解能力的新型工具。然而，我们的数据集的规模比MLLM摄取的上下文大四个数量级。因此，我们引入了一种自下而上的程序，将庞大的视觉分析问题分解为更易处理的子问题。我们仔细设计了针对每个子问题的基于MLLM的解决方案。在与我们系统的实验和消融研究中，我们发现它明显优于基线，并且能够从大城市中拍摄的图片中发现有趣的趋势（例如，“户外用餐区的增加”，“天桥被涂成蓝色”等）。更多结果和交互式演示请参见https://boyangdeng.com/visual-chronicles。

更新时间: 2025-09-23 17:04:48

领域: cs.CV,cs.AI,cs.CY

下载: http://arxiv.org/abs/2504.08727v3

LookAhead Tuning: Safer Language Models via Partial Answer Previews

Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model's initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.

Updated: 2025-09-23 17:04:18

标题: 向前调整：通过部分答案预览实现更安全的语言模型

摘要: Fine-tuning使大型语言模型（LLMs）能够适应特定领域，但往往会损害它们先前建立的安全对齐。为了减轻在fine-tuning过程中模型安全性的下降，我们引入了LookAhead Tuning，这是一种轻量级且有效的数据驱动方法，可以在fine-tuning过程中保持安全性。该方法引入了两种简单的策略，通过预览部分答案前缀来修改训练数据，从而最小化对模型初始标记分布的扰动，并保持其内置的安全机制。全面的实验表明，LookAhead Tuning有效地保持了模型的安全性，同时不牺牲下游任务的稳健性能。我们的研究结果将LookAhead Tuning定位为一种可靠且高效的解决方案，用于对LLMs进行安全和有效的适应。

更新时间: 2025-09-23 17:04:18

领域: cs.CL,cs.AI,cs.CV,cs.LG,cs.MM

下载: http://arxiv.org/abs/2503.19041v2

Linear Regression under Missing or Corrupted Coordinates

We study multivariate linear regression under Gaussian covariates in two settings, where data may be erased or corrupted by an adversary under a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $\eta$-fraction of samples per coordinate; a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even information-theoretically. Unlike the clean setting, where estimation error vanishes with more samples, here the optimal error remains a positive function of the problem parameters. Our main contribution is to characterize this error up to constant factors across essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error that match the error of (computationally efficient) algorithms. A key implication is that, perhaps surprisingly, the optimal error in the missing data setting matches that in the corruption setting-so knowing the corruption locations offers no general advantage.

Updated: 2025-09-23 17:01:43

标题: 缺失或损坏坐标下的线性回归

摘要: 我们研究了在高斯协变量下的多元线性回归问题，在两种情境下，数据可能会被敌对方擦除或损坏，其擦除或损坏受到坐标-wise预算的限制。在不完整数据情境下，敌对方可以检查数据集，并在每个坐标中的样本中删除多达$\eta$比例的条目；这是一种较强的非随机缺失模型。在数据损坏情境下，敌对方则会任意替换数值，并且损坏位置对学习者来说是未知的。尽管在缺失数据方面已经有了大量研究，但在这种敌对性缺失情况下的线性回归问题仍然不太清楚，即使从信息理论的角度来看也是如此。与干净数据情境不同，在那里，随着样本数量的增加，估计误差会消失，但在这里，最优误差仍然是问题参数的一个正函数。我们的主要贡献是对整个参数范围内的这种误差进行特征化，直到常数因子。具体来说，我们建立了可达误差的新颖信息理论下界，这些下界与（计算有效的）算法的误差相匹配。一个关键影响是，也许令人惊讶的是，在缺失数据情况下的最优误差与损坏情况下的最优误差是相匹配的-因此，了解损坏位置并没有一般优势。

更新时间: 2025-09-23 17:01:43

领域: cs.DS,cs.LG,math.ST,stat.ML,stat.TH

下载: http://arxiv.org/abs/2509.19242v1

MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS$^{2}$, a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we replace the memory-intensive spherical harmonics with lightweight, arbitrarily oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGS$^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality. Project page: https://megs-2.github.io/

Updated: 2025-09-23 17:01:06

标题: MEGS$^{2}$: 基于球形高斯函数和统一修剪的内存高效高斯投影

摘要: 3D高斯点状（3DGS）已经成为主导的新视图合成技术，但其高内存消耗严重限制了其在边缘设备上的适用性。越来越多的3DGS压缩方法已经被提出，以使3DGS更高效，然而大多数方法只专注于存储压缩，未能解决渲染内存的关键瓶颈。为了解决这个问题，我们引入了MEGS$^{2}$，一个新颖的内存高效框架，通过联合优化两个关键因素：总基元数量和每个基元的参数，实现了前所未有的内存压缩。具体来说，我们将内存密集型的球谐替换为轻量级、任意方向的球形高斯叶作为我们的颜色表示。更重要的是，我们提出了一个统一的软剪枝框架，将基元数量和叶子数量剪枝建模为一个受限制的优化问题。实验表明，与现有方法相比，MEGS$^{2}$实现了50%的静态VRAM减少和40%的渲染VRAM减少，同时保持了可比较的渲染质量。项目页面：https://megs-2.github.io/

更新时间: 2025-09-23 17:01:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.07021v2

AgentInit: Initializing LLM-based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration

Proper initialization is crucial for any system, particularly in multi-agent systems (MAS), where it plays a pivotal role in determining both the system's efficiency and effectiveness. However, existing MAS initialization methods do not fully account for the collaborative needs of the generated agents in subsequent stages. Inspired by the principles of effective team composition, we propose AgentInit, which aims to optimize the structure of agent teams. Specifically, in addition to multi-round interactions and reflections between agents during agent generation, AgentInit incorporates a Natural Language to Format mechanism to ensure consistency and standardization. Balanced team selection strategies using Pareto principles are subsequently applied to jointly consider agent team diversity and task relevance to promote effective and efficient collaboration and enhance overall system performance. Experiments show that AgentInit consistently outperforms state-of-the-art initialization methods and pre-defined strategies across various frameworks and tasks, achieving an overall performance improvement of up to 1.2 and 1.6, respectively, while also significantly reducing token consumption. Further analysis confirms its strong transferability to similar tasks and verifies the effectiveness of its key components, demonstrating its capability and adaptability as a reliable MAS initialization method. Source code and models are available at https://github.com/1737423697/AgentInit.

Updated: 2025-09-23 16:58:54

标题: AgentInit：通过多样性和专业知识编排初始化基于LLM的多智能体系统，以实现有效和高效的协作

摘要: 适当的初始化对于任何系统都至关重要，特别是在多agent系统（MAS）中，它在决定系统效率和有效性方面发挥着关键作用。然而，现有的MAS初始化方法并未充分考虑生成的代理在后续阶段的协作需求。受有效团队构成原则的启发，我们提出了AgentInit，旨在优化代理团队的结构。具体而言，除了代理生成过程中代理之间的多轮交互和反思外，AgentInit还结合了一种自然语言格式机制，以确保一致性和标准化。随后采用平衡团队选择策略，使用Pareto原则共同考虑代理团队的多样性和任务相关性，促进有效和高效的协作，并提升整体系统性能。实验表明，AgentInit在各种框架和任务中始终优于最先进的初始化方法和预定义策略，分别实现了高达1.2和1.6的整体性能改进，同时显著减少了令牌消耗。进一步分析证实了它对类似任务的强大可转移性，并验证了其关键组成部分的有效性，展示了其作为可靠MAS初始化方法的能力和适应性。源代码和模型可在https://github.com/1737423697/AgentInit 上找到。

更新时间: 2025-09-23 16:58:54

领域: cs.AI

下载: http://arxiv.org/abs/2509.19236v1

Stability and Generalization of Adversarial Diffusion Training

Algorithmic stability is an established tool for analyzing generalization. While adversarial training enhances model robustness, it often suffers from robust overfitting and an enlarged generalization gap. Although recent work has established the convergence of adversarial training in decentralized networks, its generalization properties remain unexplored. This work presents a stability-based generalization analysis of adversarial training under the diffusion strategy for convex losses. We derive a bound showing that the generalization error grows with both the adversarial perturbation strength and the number of training steps, a finding consistent with single-agent case but novel for decentralized settings. Numerical experiments on logistic regression validate these theoretical predictions.

Updated: 2025-09-23 16:55:30

标题: 对抗性扩散训练的稳定性和泛化能力

摘要: 算法稳定性是分析泛化的一种已建立的工具。尽管对抗训练增强了模型的鲁棒性，但往往会遭受鲁棒过拟合和扩大的泛化差距的困扰。尽管最近的工作已经建立了分散网络中对抗训练的收敛性，但其泛化特性仍未被探索。本文提出了基于稳定性的对抗训练在扩散策略下的泛化分析，适用于凸损失。我们推导了一个界限，表明泛化误差随着对抗扰动强度和训练步数的增加而增加，这一发现与单智能体案例一致，但对于分散设置而言是新颖的。对逻辑回归的数值实验验证了这些理论预测。

更新时间: 2025-09-23 16:55:30

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2509.19234v1

Study Design and Demystification of Physics Informed Neural Networks for Power Flow Simulation

In the context of the energy transition, with increasing integration of renewable sources and cross-border electricity exchanges, power grids are encountering greater uncertainty and operational risk. Maintaining grid stability under varying conditions is a complex task, and power flow simulators are commonly used to support operators by evaluating potential actions before implementation. However, traditional physical solvers, while accurate, are often too slow for near real-time use. Machine learning models have emerged as fast surrogates, and to improve their adherence to physical laws (e.g., Kirchhoff's laws), they are often trained with embedded constraints which are also known as physics-informed or hybrid models. This paper presents an ablation study to demystify hybridization strategies, ranging from incorporating physical constraints as regularization terms or unsupervised losses, and exploring model architectures from simple multilayer perceptrons to advanced graph-based networks enabling the direct optimization of physics equations. Using our custom benchmarking pipeline for hybrid models called LIPS, we evaluate these models across four dimensions: accuracy, physical compliance, industrial readiness, and out-of-distribution generalization. The results highlight how integrating physical knowledge impacts performance across these criteria. All the implementations are reproducible and provided in the corresponding Github page.

Updated: 2025-09-23 16:55:13

标题: 研究设计和物理信息神经网络在电力流模拟中的揭秘

摘要: 在能源转型的背景下，随着可再生能源的日益融合和跨境电力交流，电力网络面临着更大的不确定性和运行风险。在不同条件下保持电网稳定是一项复杂的任务，电力流模拟器通常用于支持运营商在实施前评估潜在的行动。然而，传统的物理求解器虽然准确，但往往对于近实时使用来说速度太慢。机器学习模型已经成为快速替代品，为了提高它们对物理定律（例如基尔霍夫定律）的遵从性，它们通常会被训练以嵌入式约束，也称为物理知识或混合模型。本文提出了一个消融研究，以揭示从将物理约束作为正则化项或无监督损失，到探索从简单的多层感知器到高级基于图的网络的模型架构，使直接优化物理方程成为可能的混合策略。使用我们称为LIPS的混合模型的自定义基准管道，我们评估这些模型在四个维度上：准确性、物理合规性、工业准备性和超出分布的泛化能力。结果突显了整合物理知识如何影响这些标准下的性能。所有的实施都是可重现的，并提供在相应的Github页面上。

更新时间: 2025-09-23 16:55:13

领域: cs.LG,I.2.0; I.2.4; I.2.6

下载: http://arxiv.org/abs/2509.19233v1

Graph Data Modeling: Molecules, Proteins, & Chemical Processes

Graphs are central to the chemical sciences, providing a natural language to describe molecules, proteins, reactions, and industrial processes. They capture interactions and structures that underpin materials, biology, and medicine. This primer, Graph Data Modeling: Molecules, Proteins, & Chemical Processes, introduces graphs as mathematical objects in chemistry and shows how learning algorithms (particularly graph neural networks) can operate on them. We outline the foundations of graph design, key prediction tasks, representative examples across chemical sciences, and the role of machine learning in graph-based modeling. Together, these concepts prepare readers to apply graph methods to the next generation of chemical discovery.

Updated: 2025-09-23 16:53:18

标题: 图数据建模：分子、蛋白质和化学过程

摘要: 图表在化学科学中起着核心作用，提供了描述分子、蛋白质、反应和工业过程的自然语言。它们捕捉了支撑材料、生物学和医学的相互作用和结构。本文摘要介绍了图数据建模：分子、蛋白质和化学过程，将图表作为化学中的数学对象，并展示了学习算法（特别是图神经网络）如何在其上运行。我们概述了图设计的基础、关键预测任务、化学科学中的代表性示例以及机器学习在基于图的建模中的作用。这些概念共同帮助读者将图方法应用于下一代化学发现中。

更新时间: 2025-09-23 16:53:18

领域: cs.LG,stat.AP

下载: http://arxiv.org/abs/2508.19356v3

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.

Updated: 2025-09-23 16:53:07

标题: 找到我的声音：自动生成混乱语音的重建，用于自动临床评估

摘要: 我们提出了ChiReSSD，这是一个保留儿童说话者身份的语音重建框架，同时抑制发音错误。与以往基于健康成人语音训练的方法不同，ChiReSSD适应了患有言语音障碍（SSD）的儿童的声音，特别强调音高和语调。我们在STAR数据集上评估了我们的方法，并报告了词汇准确性和说话者身份保留方面的显著改善。此外，我们自动预测了原始和重建对中的语音内容，其中校正的辅音比例与正确的辅音百分比（PCC），一种临床言语评估指标，相当。我们的实验显示自动和人工专家注释之间的皮尔逊相关系数为0.63，突显了减少手动转录负担的潜力。此外，在TORGO数据集上的实验表明了对重建成人发音障碍者语音的有效泛化能力。我们的结果表明，解耦的基于风格的TTS重建可以为不同临床人群提供保留身份的语音。

更新时间: 2025-09-23 16:53:07

领域: cs.SD,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.19231v1

MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation

With the widespread deployment of dashcams and advancements in computer vision, developing accident prediction models from the dashcam perspective has become critical for proactive safety interventions. However, two key challenges persist: modeling feature-level interactions among traffic participants (often occluded in dashcam views) and capturing complex, asynchronous multi-temporal behavioral cues preceding accidents. To deal with these two challenges, a Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos. MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion, respectively. For multi-scale feature aggregation, a Multi-scale Module is designed to extract scene representations at short-term, mid-term and long-term temporal scales. Meanwhile, the Transformer architecture is leveraged to facilitate comprehensive feature interactions. Temporal feature processing captures the sequential evolution of scene and object features under causal constraints. In the multi-scale feature post fusion stage, the network fuses scene and object features across multiple temporal scales to generate a comprehensive risk representation. Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction in both prediction correctness and earliness. Ablation studies validate the effectiveness of each module in MsFIN, highlighting how the network achieves superior performance through multi-scale feature fusion and contextual interaction modeling.

Updated: 2025-09-23 16:49:25

标题: MsFIN：用于交通事故预测的多尺度特征交互网络

摘要: 随着行车记录仪的广泛部署和计算机视觉的进步，从行车记录仪视角开发事故预测模型对于积极的安全干预变得至关重要。然而，仍然存在两个关键挑战：在行车记录仪视图中通常被遮挡的交通参与者之间的特征级交互建模，以及捕捉事故前复杂的、异步的多时态行为线索。为了解决这两个挑战，提出了一种用于从行车记录仪视频中提前预测事故的多尺度特征交互网络（MsFIN）。MsFIN分别具有三个层次，用于多尺度特征聚合、时间特征处理和多尺度特征后融合。对于多尺度特征聚合，设计了一个多尺度模块，用于提取短期、中期和长期时间尺度上的场景表示。同时，利用Transformer架构促进全面的特征交互。时间特征处理捕捉在因果约束下场景和对象特征的序列演变。在多尺度特征后融合阶段，网络跨多个时间尺度融合场景和对象特征，生成全面的风险表示。在DAD和DADA数据集上的实验证明，MsFIN在预测正确性和提前性方面显著优于具有单一尺度特征提取的最新模型。消融研究验证了MsFIN中每个模块的有效性，突出了网络如何通过多尺度特征融合和上下文交互建模实现卓越性能。

更新时间: 2025-09-23 16:49:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.19227v1

Neighbor Embeddings Using Unbalanced Optimal Transport Metrics

This paper proposes the use of the Hellinger--Kantorovich metric from unbalanced optimal transport (UOT) in a dimensionality reduction and learning (supervised and unsupervised) pipeline. The performance of UOT is compared to that of regular OT and Euclidean-based dimensionality reduction methods on several benchmark datasets including MedMNIST. The experimental results demonstrate that, on average, UOT shows improvement over both Euclidean and OT-based methods as verified by statistical hypothesis tests. In particular, on the MedMNIST datasets, UOT outperforms OT in classification 81\% of the time. For clustering MedMNIST, UOT outperforms OT 83\% of the time and outperforms both other metrics 58\% of the time.

Updated: 2025-09-23 16:49:15

标题: 使用不平衡最优输运度量的邻居嵌入

摘要: 本文提出在一个降维和学习（监督和无监督）流程中使用Hellinger-Kantorovich距离度量来自不平衡最优输运（UOT）。对比了UOT在几个基准数据集（包括MedMNIST）上的表现与常规OT和基于欧氏距离的降维方法。实验结果表明，平均而言，UOT相比欧氏距离和OT方法显示出改善，通过统计假设检验验证。特别是在MedMNIST数据集上，UOT在分类方面表现优于OT的概率为81%。对于MedMNIST的聚类，UOT在83%的情况下优于OT，并且在58%的情况下优于其他两种度量。

更新时间: 2025-09-23 16:49:15

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.19226v1

Systematic Comparative Analysis of Large Pretrained Language Models on Contextualized Medication Event Extraction

Attention-based models have become the leading approach in modeling medical language for Natural Language Processing (NLP) in clinical notes. These models outperform traditional techniques by effectively capturing contextual rep- resentations of language. In this research a comparative analysis is done amongst pre- trained attention based models namely Bert Base, BioBert, two variations of Bio+Clinical Bert, RoBerta, and Clinical Long- former on task related to Electronic Health Record (EHR) information extraction. The tasks from Track 1 of Harvard Medical School's 2022 National Clinical NLP Challenges (n2c2) are considered for this comparison, with the Contextualized Medication Event Dataset (CMED) given for these task. CMED is a dataset of unstructured EHRs and annotated notes that contain task relevant information about the EHRs. The goal of the challenge is to develop effective solutions for extracting contextual information related to patient medication events from EHRs using data driven methods. Each pre-trained model is fine-tuned and applied on CMED to perform medication extraction, medical event detection, and multi-dimensional medication event context classification. Pro- cessing methods are also detailed for breaking down EHRs for compatibility with the applied models. Performance analysis has been carried out using a script based on constructing medical terms from the evaluation portion of CMED with metrics including recall, precision, and F1-Score. The results demonstrate that models pre-trained on clinical data are more effective in detecting medication and medication events, but Bert Base, pre- trained on general domain data showed to be the most effective for classifying the context of events related to medications.

Updated: 2025-09-23 16:48:28

标题: 大型预训练语言模型在上下文化药物事件提取上的系统性比较分析

摘要: 注意力模型已成为在医学语言建模中自然语言处理（NLP）的主要方法。这些模型通过有效地捕捉语言的上下文表示，优于传统技术。本研究对预训练的基于注意力的模型进行了比较分析，包括Bert Base、BioBert、两种Bio+Clinical Bert的变体、RoBerta和Clinical Long-former，针对电子健康记录（EHR）信息提取任务进行了分析。该比较考虑了哈佛医学院2022年国家临床NLP挑战（n2c2）第1轨道的任务，使用了上下文化药物事件数据集（CMED）。CMED是一个包含未结构化EHR和带有任务相关信息的注释笔记的数据集。挑战的目标是开发使用数据驱动方法从EHR中提取与患者用药事件相关的上下文信息的有效解决方案。每个预训练模型都经过微调，并应用于CMED执行药物提取、医疗事件检测和多维药物事件上下文分类。还详细介绍了用于将EHR拆分以与应用模型兼容的处理方法。使用基于从CMED的评估部分构建医学术语的脚本进行了性能分析，包括召回率、精确度和F1分数。结果表明，预训练于临床数据的模型更有效地检测药物和药物事件，但在分类与药物相关事件的上下文方面，预训练于一般领域数据的Bert Base表现最为有效。

更新时间: 2025-09-23 16:48:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.19224v1

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.

Updated: 2025-09-23 16:48:10

标题: 稀疏自编码器调查：解释大型语言模型的内部机制

摘要: 大型语言模型（LLMs）已经改变了自然语言处理，但它们的内部机制仍然大部分不透明。最近，机械可解释性作为一种理解LLMs内部运作方式的手段吸引了研究界的重视。在各种机械可解释性方法中，稀疏自动编码器（SAEs）由于能够将LLMs中复杂、叠加的特征解开为更易解释的组件而被认为是一种有前途的方法。本文对SAEs进行了全面调查，以解释和理解LLMs的内部运作。我们的主要贡献包括：（1）探索SAEs的技术框架，涵盖基本架构、设计改进和有效训练策略；（2）检验不同的解释SAEs特征的方法，分为基于输入和基于输出的解释方法；（3）讨论评估SAEs性能的方法，涵盖结构和功能度量；以及（4）研究SAEs在理解和操纵LLMs行为方面的实际应用。

更新时间: 2025-09-23 16:48:10

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.05613v3

Video Killed the Energy Budget: Characterizing the Latency and Power Regimes of Open Text-to-Video Models

Recent advances in text-to-video (T2V) generation have enabled the creation of high-fidelity, temporally coherent clips from natural language prompts. Yet these systems come with significant computational costs, and their energy demands remain poorly understood. In this paper, we present a systematic study of the latency and energy consumption of state-of-the-art open-source T2V models. We first develop a compute-bound analytical model that predicts scaling laws with respect to spatial resolution, temporal length, and denoising steps. We then validate these predictions through fine-grained experiments on WAN2.1-T2V, showing quadratic growth with spatial and temporal dimensions, and linear scaling with the number of denoising steps. Finally, we extend our analysis to six diverse T2V models, comparing their runtime and energy profiles under default settings. Our results provide both a benchmark reference and practical insights for designing and deploying more sustainable generative video systems.

Updated: 2025-09-23 16:47:03

标题: 视频削减了能源预算：对开放文本到视频模型的延迟和功耗特性的表征

摘要: 最近在文本到视频（T2V）生成领域取得了重大进展，可以从自然语言提示中生成高保真度、时间上连贯的视频片段。然而，这些系统带来了显著的计算成本，其能源需求仍然不明确。在本文中，我们对最先进的开源T2V模型的延迟和能耗进行了系统性研究。我们首先开发了一个计算为主的分析模型，预测了与空间分辨率、时间长度和去噪步骤相关的扩展定律。然后，我们通过对WAN2.1-T2V进行细致实验验证了这些预测，结果显示空间和时间维度呈二次增长，与去噪步骤数量呈线性增长。最后，我们将分析扩展到六种不同的T2V模型，比较它们在默认设置下的运行时间和能耗特性。我们的结果为设计和部署更可持续的生成视频系统提供了基准参考和实用见解。

更新时间: 2025-09-23 16:47:03

领域: cs.LG

下载: http://arxiv.org/abs/2509.19222v1

FedFusion: Federated Learning with Diversity- and Cluster-Aware Encoders for Robust Adaptation under Label Scarcity

Federated learning in practice must contend with heterogeneous feature spaces, severe non-IID data, and scarce labels across clients. We present FedFusion, a federated transfer-learning framework that unifies domain adaptation and frugal labelling with diversity-/cluster-aware encoders (DivEn, DivEn-mix, DivEn-c). Labelled teacher clients guide learner clients via confidence-filtered pseudo-labels and domain-adaptive transfer, while clients maintain personalised encoders tailored to local data. To preserve global coherence under heterogeneity, FedFusion employs similarity-weighted classifier coupling (with optional cluster-wise averaging), mitigating dominance by data-rich sites and improving minority-client performance. The frugal-labelling pipeline combines self-/semi-supervised pretext training with selective fine-tuning, reducing annotation demands without sharing raw data. Across tabular and imaging benchmarks under IID, non-IID, and label-scarce regimes, FedFusion consistently outperforms state-of-the-art baselines in accuracy, robustness, and fairness while maintaining comparable communication and computation budgets. These results show that harmonising personalisation, domain adaptation, and label efficiency is an effective recipe for robust federated learning under real-world constraints.

Updated: 2025-09-23 16:46:06

标题: FedFusion：基于多样性和簇感知编码器的联邦学习，在标签稀缺情况下实现稳健适应

摘要: 在实际应用中，联邦学习必须处理异构特征空间、严重的非独立同分布数据以及客户端之间标签稀缺的问题。我们提出了FedFusion，这是一个统一领域自适应和节约标签的联邦迁移学习框架，其中包括多样性-/聚类感知编码器(DivEn，DivEn-mix，DivEn-c)。标记的教师客户端通过置信度过滤的伪标签和领域自适应迁移指导学习客户端，同时客户端保持针对本地数据量身定制的个性化编码器。为了在异质性下保持全局一致性，FedFusion采用相似度加权分类器耦合（可选的聚类加权平均），减轻数据丰富站点的主导地位，并提高少数客户端的性能。节约标记的流水线结合了自我-/半监督先验训练和选择性微调，减少标注需求而无需共享原始数据。在IID、非IID和标签稀缺制度下的表格和图像基准测试中，FedFusion始终在准确性、鲁棒性和公平性方面优于最先进的基线，同时保持可比的通信和计算预算。这些结果表明，在实际约束条件下，协调个性化、领域自适应和标签效率是实现强大联邦学习的有效方法。

更新时间: 2025-09-23 16:46:06

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2509.19220v1

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.

Updated: 2025-09-23 16:43:51

标题: 超越输入激活：通过梯度稀疏自动编码器识别具有影响力的潜在因素

摘要: 稀疏自动编码器（SAEs）最近已经成为解释和引导大型语言模型（LLMs）内部表示的强大工具。然而，传统的分析SAEs的方法通常仅依赖于输入端的激活，而不考虑每个潜在特征与模型输出之间的因果关系。本研究建立在两个关键假设的基础上：（1）激活的潜在特征对模型输出的构建并不贡献相等，（2）只有具有高因果影响力的潜在特征才能有效地引导模型。为了验证这些假设，我们提出了梯度稀疏自动编码器（GradSAE），这是一种简单而有效的方法，通过融入输出端梯度信息来识别最具影响力的潜在特征。

更新时间: 2025-09-23 16:43:51

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.08080v2

The Transparent Earth: A Multimodal Foundation Model for the Earth's Subsurface

We present the Transparent Earth, a transformer-based architecture for reconstructing subsurface properties from heterogeneous datasets that vary in sparsity, resolution, and modality, where each modality represents a distinct type of observation (e.g., stress angle, mantle temperature, tectonic plate type). The model incorporates positional encodings of observations together with modality encodings, derived from a text embedding model applied to a description of each modality. This design enables the model to scale to an arbitrary number of modalities, making it straightforward to add new ones not considered in the initial design. We currently include eight modalities spanning directional angles, categorical classes, and continuous properties such as temperature and thickness. These capabilities support in-context learning, enabling the model to generate predictions either with no inputs or with an arbitrary number of additional observations from any subset of modalities. On validation data, this reduces errors in predicting stress angle by more than a factor of three. The proposed architecture is scalable and demonstrates improved performance with increased parameters. Together, these advances make the Transparent Earth an initial foundation model for the Earth's subsurface that ultimately aims to predict any subsurface property anywhere on Earth.

Updated: 2025-09-23 16:43:24

标题: 透明地球：地下岩层的多模态基础模型

摘要: 我们提出了透明地球，这是一种基于变化的异构数据集的变压器架构，这些数据集在稀疏性、分辨率和模态上有所不同，其中每种模态代表一种不同的观测类型（例如，应力角、地幔温度、构造板类型）。该模型将观测位置编码与模态编码结合在一起，模态编码是从应用于每种模态描述的文本嵌入模型中获取的。这种设计使模型能够扩展到任意数量的模态，使其可以轻松添加未在初始设计中考虑的新模态。我们目前包括八种模态，涵盖方向角、分类类别以及温度和厚度等连续属性。这些功能支持上下文学习，使模型能够生成预测，无需任何输入或从任何模态子集中的任意数量的附加观测开始。在验证数据上，这将应力角的预测错误减少了三倍以上。所提出的架构是可扩展的，并且随着参数增加表现出更好的性能。这些进展使透明地球成为地球地下模型的初步基础，最终旨在预测地球上任何地方的任何地下属性。

更新时间: 2025-09-23 16:43:24

领域: cs.LG,cs.AI,physics.geo-ph

下载: http://arxiv.org/abs/2509.02783v2

HyKid: An Open MRI Dataset with Expert-Annotated Multi-Structure and Choroid Plexus in Pediatric Hydrocephalus

Evaluation of hydrocephalus in children is challenging, and the related research is limited by a lack of publicly available, expert-annotated datasets, particularly those with segmentation of the choroid plexus. To address this, we present HyKid, an open-source dataset from 48 pediatric patients with hydrocephalus. 3D MRIs were provided with 1mm isotropic resolution, which was reconstructed from routine low-resolution images using a slice-to-volume algorithm. Manually corrected segmentations of brain tissues, including white matter, grey matter, lateral ventricle, external CSF, and the choroid plexus, were provided by an experienced neurologist. Additionally, structured data was extracted from clinical radiology reports using a Retrieval-Augmented Generation framework. The strong correlation between choroid plexus volume and total CSF volume provided a potential biomarker for hydrocephalus evaluation, achieving excellent performance in a predictive model (AUC = 0.87). The proposed HyKid dataset provided a high-quality benchmark for neuroimaging algorithms development, and it revealed the choroid plexus-related features in hydrocephalus assessments. Our datasets are publicly available at https://www.synapse.org/Synapse:syn68544889.

Updated: 2025-09-23 16:42:16

标题: HyKid：儿童脑积水中具有专家标注的多结构和脉络丛的开放式MRI数据集

摘要: 儿童脑积水的评估具有挑战性，相关研究受到公开可用、专家注释数据集的限制，特别是具有脉络丛分割的数据集。为了解决这一问题，我们提出了HyKid，这是一个来自48名患有脑积水的小儿患者的开放源数据集。3D磁共振成像以1mm等轴分辨率提供，通过使用切片到体积算法从常规低分辨率图像重建而成。经验丰富的神经学家提供了脑组织（包括白质、灰质、侧脑室、外部脑脊液和脉络丛）的手动校正分割。此外，使用检索增强生成框架从临床放射学报告中提取了结构化数据。脉络丛体积与总脑脊液体积之间的强相关性为脑积水评估提供了潜在的生物标志物，该生物标志物在预测模型中表现出色（AUC = 0.87）。提出的HyKid数据集为神经影像算法的发展提供了高质量的基准，并揭示了脉络丛相关特征在脑积水评估中的作用。我们的数据集可以在https://www.synapse.org/Synapse:syn68544889 上公开获取。

更新时间: 2025-09-23 16:42:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.19218v1

FragmentGPT: A Unified GPT Model for Fragment Growing, Linking, and Merging in Molecular Design

Fragment-Based Drug Discovery (FBDD) is a popular approach in early drug development, but designing effective linkers to combine disconnected molecular fragments into chemically and pharmacologically viable candidates remains challenging. Further complexity arises when fragments contain structural redundancies, like duplicate rings, which cannot be addressed by simply adding or removing atoms or bonds. To address these challenges in a unified framework, we introduce FragmentGPT, which integrates two core components: (1) a novel chemically-aware, energy-based bond cleavage pre-training strategy that equips the GPT-based model with fragment growing, linking, and merging capabilities, and (2) a novel Reward Ranked Alignment with Expert Exploration (RAE) algorithm that combines expert imitation learning for diversity enhancement, data selection and augmentation for Pareto and composite score optimality, and Supervised Fine-Tuning (SFT) to align the learner policy with multi-objective goals. Conditioned on fragment pairs, FragmentGPT generates linkers that connect diverse molecular subunits while simultaneously optimizing for multiple pharmaceutical goals. It also learns to resolve structural redundancies-such as duplicated fragments-through intelligent merging, enabling the synthesis of optimized molecules. FragmentGPT facilitates controlled, goal-driven molecular assembly. Experiments and ablation studies on real-world cancer datasets demonstrate its ability to generate chemically valid, high-quality molecules tailored for downstream drug discovery tasks.

Updated: 2025-09-23 16:41:27

标题: FragmentGPT：一个统一的GPT模型，用于分子设计中的片段生长、连接和合并

摘要: 碎片化药物发现（FBDD）是早期药物开发中一种流行的方法，但设计有效的连接剂将不同的分子碎片组合成化学和药理学可行的候选药物仍然具有挑战性。当分子碎片包含结构冗余时，如重复环，进一步复杂性出现，这些不能简单地通过添加或移除原子或键来解决。为了在统一框架中解决这些挑战，我们引入了FragmentGPT，它集成了两个核心组件：（1）一种新颖的化学感知、基于能量的键裂解预训练策略，为基于GPT的模型提供了分子片段生长、连接和合并的能力，以及（2）一种新颖的带奖励排序对齐和专家探索（RAE）算法，结合专家模仿学习以增强多样性、数据选择和增强以实现帕累托和复合分数最优性，以及监督微调（SFT）以将学习策略与多目标目标对齐。在分子片段对的条件下，FragmentGPT生成连接剂，连接不同的分子亚单位，同时优化多个制药目标。它还学会通过智能合并解决结构冗余，例如重复片段，从而实现优化分子的合成。FragmentGPT促进了受控的、目标驱动的分子组装。对真实世界的癌症数据集进行的实验和消融研究表明，它具有生成化学有效、高质量分子的能力，可为下游药物发现任务定制。

更新时间: 2025-09-23 16:41:27

领域: cs.LG,cs.AI,q-bio.BM

下载: http://arxiv.org/abs/2509.11044v2

Conformal Convolution and Monte Carlo Meta-learners for Predictive Inference of Individual Treatment Effects

Generating probabilistic forecasts of potential outcomes and individual treatment effects (ITE) is essential for risk-aware decision-making in domains such as healthcare, policy, marketing, and finance. We propose two novel methods: the conformal convolution T-learner (CCT) and the conformal Monte Carlo (CMC) meta-learner, that generate full predictive distributions of both potential outcomes and ITEs. Our approaches combine weighted conformal predictive systems with either analytic convolution of potential outcome distributions or Monte Carlo sampling, addressing covariate shift through propensity score weighting. In contrast to other approaches that allow the generation of potential outcome predictive distributions, our approaches are model agnostic, universal, and come with finite-sample guarantees of probabilistic calibration under knowledge of the propensity score. Regarding estimating the ITE distribution, we formally characterize how assumptions about potential outcomes' noise dependency impact distribution validity and establish universal consistency under independence noise assumptions. Experiments on synthetic and semi-synthetic datasets demonstrate that the proposed methods achieve probabilistically calibrated predictive distributions while maintaining narrow prediction intervals and having performant continuous ranked probability scores. Besides probabilistic forecasting performance, we observe significant efficiency gains for the CCT- and CMC meta-learners compared to other conformal approaches that produce prediction intervals for ITE with coverage guarantees.

Updated: 2025-09-23 16:40:14

标题: 共形卷积和蒙特卡洛元学习者用于个体治疗效果的预测推断

摘要: 生成潜在结果和个体治疗效果（ITE）的概率预测对于风险感知决策在诸如医疗保健、政策、营销和金融等领域至关重要。我们提出了两种新颖的方法：符合性卷积T-学习者（CCT）和符合性蒙特卡洛（CMC）元学习者，可以生成潜在结果和ITE的完整预测分布。我们的方法将加权符合性预测系统与潜在结果分布的分析卷积或蒙特卡洛采样相结合，通过倾向得分加权处理协变量转移。与其他允许生成潜在结果预测分布的方法相比，我们的方法是模型无关的、通用的，并在倾向得分知识下具有有限样本保证的概率校准。关于估计ITE分布，我们正式描述了关于潜在结果噪声依赖性假设如何影响分布有效性，并在独立噪声假设下建立了通用一致性。对合成和半合成数据集的实验表明，所提出的方法能够实现概率校准的预测分布，同时保持狭窄的预测区间，并具有良好的连续排名概率分数性能。除了概率预测性能，我们观察到与其他符合性方法相比，CCT和CMC元学习者在为ITE生成具有覆盖保证的预测区间方面具有显著的效率提升。

更新时间: 2025-09-23 16:40:14

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2402.04906v6

QSpark: Towards Reliable Qiskit Code Generation

Quantum circuits must be error-resilient, yet LLMs like Granite-20B-Code and StarCoder often output flawed Qiskit code. We fine-tuned the Qwen2.5-Coder-32B model with two RL methods, Group Relative Policy Optimization (GRPO) and Odds-Ratio Preference Optimization (ORPO), using a richly annotated synthetic dataset. On the Qiskit HumanEval benchmark, ORPO reaches 56.29% Pass@1 ($\approx+10$ pp over Granite-8B-QK) and GRPO hits 49%, both beating all general-purpose baselines; on the original HumanEval they score 65.90% and 63.00%. GRPO performs well on basic tasks (44/78) and excels on intermediate ones (41/68), but neither GRPO nor ORPO solves any of the five advanced tasks, highlighting clear gains yet room for progress in AI-assisted quantum programming.

Updated: 2025-09-23 16:36:24

标题: QSpark: 朝向可靠的Qiskit代码生成

摘要: 量子电路必须具有容错性，但像Granite-20B-Code和StarCoder这样的LLM经常输出有缺陷的Qiskit代码。我们使用两种RL方法，Group Relative Policy Optimization（GRPO）和Odds-Ratio Preference Optimization（ORPO），在一个丰富注释的合成数据集上对Qwen2.5-Coder-32B模型进行了微调。在Qiskit HumanEval基准测试中，ORPO达到了56.29%的Pass@1（比Granite-8B-QK高约10个百分点），而GRPO达到了49%，均超过了所有通用基准线；在原始HumanEval上，它们的得分分别为65.90%和63.00%。GRPO在基本任务（44/78）上表现良好，在中级任务（41/68）上表现出色，但GRPO和ORPO都没有解决任何五个高级任务，突出了在AI辅助量子编程中存在明显进步的同时仍有改进空间。

更新时间: 2025-09-23 16:36:24

领域: cs.SE,cs.AI,quant-ph

下载: http://arxiv.org/abs/2507.12642v2

PPG-Distill: Efficient Photoplethysmography Signals Analysis via Foundation Model Distillation

Photoplethysmography (PPG) is widely used in wearable health monitoring, yet large PPG foundation models remain difficult to deploy on resource-limited devices. We present PPG-Distill, a knowledge distillation framework that transfers both global and local knowledge through prediction-, feature-, and patch-level distillation. PPG-Distill incorporates morphology distillation to preserve local waveform patterns and rhythm distillation to capture inter-patch temporal structures. On heart rate estimation and atrial fibrillation detection, PPG-Distill improves student performance by up to 21.8% while achieving 7X faster inference and reducing memory usage by 19X, enabling efficient PPG analysis on wearables

Updated: 2025-09-23 16:35:38

标题: PPG-Distill: 通过基础模型蒸馏实现高效的光电容积脉搏信号分析

摘要: Photoplethysmography (PPG)被广泛应用于可穿戴健康监测，然而大型PPG基础模型仍然难以部署在资源有限的设备上。我们提出了PPG-Distill，这是一个知识蒸馏框架，通过预测、特征和补丁级别的蒸馏来传递全局和局部知识。PPG-Distill包含形态蒸馏以保留局部波形模式和节律蒸馏以捕捉补丁间的时间结构。在心率估计和房颤检测方面，PPG-Distill提高了学生的性能达21.8%，同时实现了7倍更快的推理速度，并将内存使用减少了19倍，从而实现了在可穿戴设备上进行高效PPG分析。

更新时间: 2025-09-23 16:35:38

领域: cs.LG

下载: http://arxiv.org/abs/2509.19215v1

Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.

Updated: 2025-09-23 16:32:25

标题: 为上下文感知安全而引导多模态大语言模型解码

摘要: 多模态大型语言模型（MLLMs）越来越多地被部署在现实世界的应用中，但它们在进行基于上下文的安全决策方面的能力仍然有限。现有方法常常无法平衡过度敏感（对良性查询的无理拒绝）和过度不敏感（对视觉相关风险的错过检测），从而在安全对齐方面存在持续的差距。为了解决这个问题，我们介绍了安全感知对比解码（SafeCoDe），这是一个轻量级且与模型无关的解码框架，根据多模态上下文动态调整标记生成。SafeCoDe分为两个阶段：（1）对比解码机制，通过对比真实图像和高斯噪声图像来突出显示对视觉上下文敏感的标记，以及（2）全局感知标记调节策略，将场景级推理与标记级调整相结合，根据预测的安全判决调整拒绝。通过对多种MLLM架构和安全基准进行广泛实验，涵盖对不敏感性、过度敏感性和一般安全评估的测试，结果显示SafeCoDe在提高上下文敏感的拒绝行为的同时保持了模型的有用性。

更新时间: 2025-09-23 16:32:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.19212v1

AlloyInter: Visualising Alloy Mixture Interpolations in t-SNE Representations

This entry description proposes AlloyInter, a novel system to enable joint exploration of input mixtures and output parameters space in the context of the SciVis Contest 2025. We propose an interpolation approach, guided by eXplainable Artificial Intelligence (XAI) based on a learned model ensemble that allows users to discover input mixture ratios by specifying output parameter goals that can be iteratively adjusted and improved towards a goal. We strengthen the capabilities of our system by building upon prior research within the robustness of XAI, as well as combining well-established techniques like manifold learning with interpolation approaches.

Updated: 2025-09-23 16:21:52

标题: AlloyInter：在t-SNE表示中可视化合金混合物插值

摘要: 本条目描述提出了AlloyInter，这是一个新颖的系统，可以在SciVis Contest 2025的背景下启用输入混合物和输出参数空间的联合探索。我们提出了一种插值方法，该方法受可解释人工智能（XAI）的指导，基于学习模型集合，使用户能够通过指定输出参数目标来发现输入混合比例，这可以通过逐步调整和改进向目标迈进。我们通过借鉴XAI的稳健性以及将流形学习与插值方法结合起来的成熟技术，增强了我们系统的能力。

更新时间: 2025-09-23 16:21:52

领域: cs.CE,cs.LG

下载: http://arxiv.org/abs/2509.19202v1

Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems

Despite achieving excellent performance on benchmarks, deep neural networks often underperform in real-world deployment due to sensitivity to minor, often imperceptible shifts in input data, known as distributional shifts. These shifts are common in practical scenarios but are rarely accounted for during evaluation, leading to inflated performance metrics. To address this gap, we propose a novel methodology for the verification, evaluation, and risk assessment of deep learning systems. Our approach explicitly models the incidence of distributional shifts at runtime by estimating their probability from outputs of out-of-distribution detectors. We combine these estimates with conditional probabilities of network correctness, structuring them in a binary tree. By traversing this tree, we can compute credible and precise estimates of network accuracy. We assess our approach on five different datasets, with which we simulate deployment conditions characterized by differing frequencies of distributional shift. Our approach consistently outperforms conventional evaluation, with accuracy estimation errors typically ranging between 0.01 and 0.1. We further showcase the potential of our approach on a medical segmentation benchmark, wherein we apply our methods towards risk assessment by associating costs with tree nodes, informing cost-benefit analyses and value-judgments. Ultimately, our approach offers a robust framework for improving the reliability and trustworthiness of deep learning systems, particularly in safety-critical applications, by providing more accurate performance estimates and actionable risk assessments.

Updated: 2025-09-23 16:16:02

标题: 概率性运行时验证，视觉深度学习系统的评估和风险评估

摘要: 尽管在基准测试中取得了出色的表现，深度神经网络在实际部署中经常表现不佳，这是因为对输入数据中微小、常常难以察觉的变化敏感，这种变化被称为分布变化。这些变化在实际场景中很常见，但在评估过程中很少被考虑，导致性能指标被夸大。为了解决这一问题，我们提出了一种新的方法论，用于验证、评估和风险评估深度学习系统。我们的方法明确地在运行时建模分布变化的发生，通过从超出分布检测器的输出估计它们的概率。我们将这些估计值与网络正确性的条件概率相结合，将它们结构化在一个二叉树中。通过遍历这棵树，我们可以计算出网络准确性的可信和精确估计。我们在五个不同的数据集上评估了我们的方法，模拟了不同频率的分布变化的部署条件。我们的方法一贯优于传统评估方法，准确性估计误差通常在0.01到0.1之间。我们进一步展示了我们的方法在医学分割基准测试中的潜力，我们在这里将我们的方法应用于风险评估，通过将成本与树节点关联，为成本效益分析和价值判断提供信息。最终，我们的方法提供了一个强大的框架，用于提高深度学习系统的可靠性和可信度，特别是在安全关键应用中，通过提供更准确的性能估计和可操作的风险评估。

更新时间: 2025-09-23 16:16:02

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19419v1

Integrating Belief Domains into Probabilistic Logic Programs

Probabilistic Logic Programming (PLP) under the Distribution Semantics is a leading approach to practical reasoning under uncertainty. An advantage of the Distribution Semantics is its suitability for implementation as a Prolog or Python library, available through two well-maintained implementations, namely ProbLog and cplint/PITA. However, current formulations of the Distribution Semantics use point-probabilities, making it difficult to express epistemic uncertainty, such as arises from, for example, hierarchical classifications from computer vision models. Belief functions generalize probability measures as non-additive capacities, and address epistemic uncertainty via interval probabilities. This paper introduces interval-based Capacity Logic Programs based on an extension of the Distribution Semantics to include belief functions, and describes properties of the new framework that make it amenable to practical applications.

Updated: 2025-09-23 16:15:07

标题: 将信念域集成到概率逻辑程序中

摘要: 概率逻辑编程（PLP）在分布语义下是处理不确定性下实际推理的主要方法。分布语义的优势在于其适用于作为Prolog或Python库实现，可通过两个维护良好的实现，即ProbLog和cplint/PITA，来使用。然而，目前的分布语义公式使用点概率，使得难以表达认知不确定性，例如来自计算机视觉模型的层次分类。信念函数将概率度量泛化为非可加容量，通过区间概率解决认知不确定性。本文介绍了基于区间的容量逻辑程序，该程序基于将信念函数包含在分布语义中的扩展，并描述了新框架的特性，使其适用于实际应用。

更新时间: 2025-09-23 16:15:07

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2507.17291v2

A Validation Strategy for Deep Learning Models: Evaluating and Enhancing Robustness

Data-driven models, especially deep learning classifiers often demonstrate great success on clean datasets. Yet, they remain vulnerable to common data distortions such as adversarial and common corruption perturbations. These perturbations can significantly degrade performance, thereby challenging the overall reliability of the models. Traditional robustness validation typically relies on perturbed test datasets to assess and improve model performance. In our framework, however, we propose a validation approach that extracts "weak robust" samples directly from the training dataset via local robustness analysis. These samples, being the most susceptible to perturbations, serve as an early and sensitive indicator of the model's vulnerabilities. By evaluating models on these challenging training instances, we gain a more nuanced understanding of its robustness, which informs targeted performance enhancement. We demonstrate the effectiveness of our approach on models trained with CIFAR-10, CIFAR-100, and ImageNet, highlighting how robustness validation guided by weak robust samples can drive meaningful improvements in model reliability under adversarial and common corruption scenarios.

Updated: 2025-09-23 16:14:14

标题: 深度学习模型的验证策略：评估和增强稳健性

摘要: 数据驱动模型，特别是深度学习分类器经常在清洁数据集上取得巨大成功。然而，它们仍然容易受到常见数据扭曲的影响，例如对抗性和常见损坏扰动。这些扰动可以显著降低性能，从而挑战模型的整体可靠性。传统的鲁棒性验证通常依赖于扰动的测试数据集来评估和改进模型的性能。然而，在我们的框架中，我们提出了一种通过局部鲁棒性分析直接从训练数据集中提取“弱鲁棒”样本的验证方法。这些样本最容易受到扰动的影响，作为模型脆弱性的早期和敏感指标。通过在这些具有挑战性的训练实例上评估模型，我们获得了对其鲁棒性更细致的理解，从而指导有针对性的性能提升。我们在使用CIFAR-10、CIFAR-100和ImageNet训练的模型上展示了我们方法的有效性，突出了通过弱鲁棒样本引导的鲁棒性验证如何在对抗性和常见损坏情景下推动模型可靠性的有意义改进。

更新时间: 2025-09-23 16:14:14

领域: cs.LG

下载: http://arxiv.org/abs/2509.19197v1

Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context (MS-COCO) dataset, which contains human-annotated captions with an average length of approximately ten tokens. Although effective for general scene understanding, these short captions often fail to capture complex scenes and convey detailed information. Moreover, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects, thus overlooking finer details. In this paper, we present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models. Our proposed method requires no additional model training: given an image, it leverages pre-trained models from the literature to generate the initial captions, and then ranks them using a newly introduced image-text-based metric, which we name BLIPScore. Subsequently, the top two captions are fused using a Large Language Model (LLM) to produce the final, more detailed description. Experimental results on the MS-COCO and Flickr30k test sets demonstrate the effectiveness of our approach in terms of caption-image alignment and hallucination reduction according to the ALOHa, CAPTURE, and Polos metrics. A subjective study lends additional support to these results, suggesting that the captions produced by our model are generally perceived as more consistent with human judgment. By combining the strengths of diverse SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich and informative nature of human-generated descriptions. This advance enables the generation of more suitable captions for the training of both vision-language and captioning models.

Updated: 2025-09-23 16:12:45

标题: 通过排名和基于LLM的融合改进图像字幕的描述性

摘要: 最先进的图像标题模型通常在微软通用对象上下文（MS-COCO）数据集上进行训练，该数据集包含人工标注的标题，平均长度约为十个标记。尽管对于一般场景理解有效，但这些简短的标题往往无法捕捉复杂的场景并传达详细信息。此外，标题模型往往表现出对“平均”标题的偏见，只捕捉更一般的方面，因此忽视了更详细的细节。在本文中，我们提出了一种新方法，通过结合从不同SoTA标题模型生成的标题来生成更丰富和更具信息量的图像标题。我们提出的方法无需额外的模型训练：给定一幅图像，它利用文献中的预训练模型生成初始标题，然后使用一种新引入的基于图像文本的度量（我们称之为BLIPScore）对其进行排名。随后，使用大型语言模型（LLM）融合前两个标题，生成更详细的最终描述。在MS-COCO和Flickr30k测试集上的实验结果展示了我们方法在标题-图像对齐和幻觉减少方面的有效性，根据ALOHa、CAPTURE和Polos度量标准。主观研究进一步支持这些结果，表明我们模型生成的标题通常被认为与人类判断更一致。通过结合各种SoTA模型的优势，我们的方法增强了图像标题的质量和吸引力，弥合了自动化系统和人类生成描述的丰富和信息性之间的差距。这一进步使得能够生成更适合用于视觉语言和标题模型训练的标题。

更新时间: 2025-09-23 16:12:45

领域: cs.CV,cs.AI,cs.CL,cs.DB,cs.LG

下载: http://arxiv.org/abs/2306.11593v2

Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics

The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the learning dynamics inside a model remain to be explored. In this work, we develop a fine-tuning method for analyzing the mechanism behind learning. Inspired by the concept of intrinsic dimension, we view a model as a computational graph with redundancy for a specific task, and treat the fine-tuning process as a search for and optimization of a subgraph within this graph. Based on this hypothesis, we propose circuit-tuning, an algorithm that iteratively builds the subgraph for a specific task and updates the relevant parameters in a heuristic way. We first validate our hypothesis through a carefully designed experiment and provide a detailed analysis of the learning dynamics during fine-tuning. Subsequently, we conduct experiments on more complex tasks, demonstrating that circuit-tuning could strike a balance between the performance on the target task and the general capabilities. Our work offers a new analytical method for the dynamics of fine-tuning, provides new findings on the mechanisms behind the training process, and inspires the design of superior algorithms for the training of neural networks.

Updated: 2025-09-23 16:07:40

标题: 微调是子图搜索：学习动态的新视角

摘要: 机制可解释性研究旨在逆向工程一个模型来解释其行为。尽管最近的研究集中在某种行为的静态机制上，但模型内部的学习动态仍有待探索。在这项工作中，我们开发了一种用于分析学习背后机制的微调方法。受内在维度概念的启发，我们将模型视为一个具有特定任务冗余的计算图，并将微调过程视为在该图中搜索和优化子图的过程。基于这一假设，我们提出了电路调谐算法，该算法迭代地构建特定任务的子图，并以一种启发式方式更新相关参数。我们首先通过精心设计的实验验证了我们的假设，并对微调过程中的学习动态进行了详细分析。随后，我们在更复杂的任务上进行实验，展示了电路调谐可以在目标任务的性能和通用能力之间取得平衡。我们的工作为微调动态提供了一种新的分析方法，提供了关于训练过程背后机制的新发现，并激发了设计更优算法来训练神经网络的灵感。

更新时间: 2025-09-23 16:07:40

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.06106v3

Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws

Scaling laws have played a cornerstone role in guiding the training of large language models (LLMs). However, most existing works on scaling laws primarily focus on the final-step loss, overlooking the loss dynamics during the training process and, crucially, the impact of learning rate schedule (LRS). In this paper, we aim to bridge this gap by studying a teacher-student kernel regression setup trained via online stochastic gradient descent (SGD). Leveraging a novel intrinsic time viewpoint and stochastic differential equation (SDE) modeling of SGD, we introduce the Functional Scaling Law (FSL), which characterizes the evolution of population risk during the training process for general LRSs. Remarkably, the impact of the LRSs is captured through an explicit convolution-type functional term, making their effects fully tractable. To illustrate the utility of FSL, we analyze three widely used LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- under both data-limited and compute-limited regimes. We provide theoretical justification for widely adopted empirical practices in LLMs pre-training such as (i) higher-capacity models are more data- and compute-efficient; (ii) learning rate decay can improve training efficiency; (iii) WSD-like schedules can outperform direct-decay schedules. Lastly, we explore the practical relevance of FSL as a surrogate model for fitting, predicting and optimizing the loss curves in LLM pre-training, with experiments conducted across model sizes ranging from 0.1B to 1B parameters. We hope our FSL framework can deepen the understanding of LLM pre-training dynamics and provide insights for improving large-scale model training.

Updated: 2025-09-23 16:05:16

标题: 揭示学习率调度在功能缩放定律中的作用

摘要: 比例定律在指导大型语言模型（LLMs）训练中发挥了基石作用。然而，大多数现有关于比例定律的研究主要集中在最终损失上，忽视了训练过程中的损失动态以及学习率调度（LRS）的影响。在本文中，我们旨在通过研究一个通过在线随机梯度下降（SGD）训练的教师-学生核回归设置来弥补这一差距。利用一种新颖的内在时间观点和SGD的随机微分方程（SDE）建模，我们引入了功能比例定律（FSL），该定律描述了在通用LRSs的训练过程中种群风险的演变。值得注意的是，学习率调度的影响通过一个显式的卷积型功能项来捕捉，使其效果完全可追踪。为了说明FSL的实用性，我们分析了三种广泛使用的LRSs——常数、指数衰减和预热稳定衰减（WSD）——在数据受限和计算受限的情况下。我们为LLMs预训练中广泛采用的经验做法提供了理论上的理由，例如（i）容量更大的模型更加数据和计算高效；（ii）学习率衰减可以提高训练效率；（iii）类似于WSD的调度可以优于直接衰减调度。最后，我们探讨了FSL作为拟合、预测和优化LLM预训练中损失曲线的替代模型的实际相关性，实验涵盖了从0.1B到1B参数的模型规模。我们希望我们的FSL框架可以加深对LLM预训练动态的理解，并为改进大规模模型训练提供见解。

更新时间: 2025-09-23 16:05:16

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.19189v1

Topological Feature Compression for Molecular Graph Neural Networks

Recent advances in molecular representation learning have produced highly effective encodings of molecules for numerous cheminformatics and bioinformatics tasks. However, extracting general chemical insight while balancing predictive accuracy, interpretability, and computational efficiency remains a major challenge. In this work, we introduce a novel Graph Neural Network (GNN) architecture that combines compressed higher-order topological signals with standard molecular features. Our approach captures global geometric information while preserving computational tractability and human-interpretable structure. We evaluate our model across a range of benchmarks, from small-molecule datasets to complex material datasets, and demonstrate superior performance using a parameter-efficient architecture. We achieve the best performing results in both accuracy and robustness across almost all benchmarks. We open source all code \footnote{All code and results can be found on Github https://github.com/rahulkhorana/TFC-PACT-Net}.

Updated: 2025-09-23 15:58:50

标题: 分子图神经网络的拓扑特征压缩

摘要: 近期在分子表征学习方面取得的进展已经产生了高效的分子编码，用于许多化学信息学和生物信息学任务。然而，在保持预测准确性、可解释性和计算效率的同时提取一般化学见解仍然是一个重大挑战。在这项工作中，我们介绍了一种新颖的图神经网络（GNN）架构，将压缩的高阶拓扑信号与标准分子特征相结合。我们的方法捕获了全局几何信息，同时保持了计算可处理性和人类可解释的结构。我们在一系列基准测试中评估了我们的模型，从小分子数据集到复杂材料数据集，并展示了使用高效参数架构的卓越性能。我们在几乎所有基准测试中都取得了最佳的准确性和鲁棒性结果。我们开源了所有的代码和结果，可以在Github上找到（https://github.com/rahulkhorana/TFC-PACT-Net）。

更新时间: 2025-09-23 15:58:50

领域: cs.LG

下载: http://arxiv.org/abs/2508.07807v2

YAC: Bridging Natural Language and Interactive Visual Exploration with Generative AI for Biomedical Data Discovery

Incorporating natural language input has the potential to improve the capabilities of biomedical data discovery interfaces. However, user interface elements and visualizations are still powerful tools for interacting with data, even in the new world of generative AI. In our prototype system, YAC, Yet Another Chatbot, we bridge the gap between natural language and interactive visualizations by generating structured declarative output with a multi-agent system and interpreting that output to render linked interactive visualizations and apply data filters. Furthermore, we include widgets, which allow users to adjust the values of that structured output through user interface elements. We reflect on the capabilities and design of this system with an analysis of its technical dimensions and illustrate the capabilities through four usage scenarios.

Updated: 2025-09-23 15:57:42

标题: YAC：利用生成人工智能桥接自然语言和交互式可视化探索，用于生物医学数据发现

摘要: 将自然语言输入纳入生物医学数据发现界面具有提高能力的潜力。然而，用户界面元素和可视化仍然是与数据交互的强大工具，即使在生成式人工智能的新世界中。在我们的原型系统YAC（Yet Another Chatbot）中，我们通过使用多代理系统生成结构化的声明性输出，将自然语言和交互式可视化之间的鸿沟连接起来，并解释该输出以呈现链接的交互式可视化并应用数据过滤器。此外，我们还包括小部件，允许用户通过用户界面元素调整该结构化输出的值。通过对其技术维度的分析，我们反思了该系统的功能和设计，并通过四个使用场景说明了其功能。

更新时间: 2025-09-23 15:57:42

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2509.19182v1

SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System

We introduce SupertonicTTS, a novel text-to-speech (TTS) system designed for efficient and streamlined speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. The TTS pipeline is further simplified by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we propose context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment with minimal memory and I/O overhead. Experimental results demonstrate that SupertonicTTS delivers performance comparable to contemporary zero-shot TTS models with only 44M parameters, while significantly reducing architectural complexity and computational cost. Audio samples are available at: https://supertonictts.github.io/.

Updated: 2025-09-23 15:55:51

标题: SupertonicTTS：朝着高效和简化的文本转语音系统方向

摘要: 我们介绍了SupertonicTTS，这是一种新型的文本到语音（TTS）系统，旨在实现高效和简化的语音合成。SupertonicTTS包括三个组件：用于连续潜在表示的语音自编码器，利用流匹配进行文本到潜在映射的文本到潜在模块，以及用于预测话语级持续时间的预测器。为了实现轻量级架构，我们采用低维潜在空间，潜在值的时间压缩以及ConvNeXt块。TTS流程进一步简化，通过直接在原始字符级文本上操作，并利用交叉注意力进行文本-语音对齐，从而消除了对字素到音素（G2P）模块和外部对齐器的需要。此外，我们提出了上下文共享批扩展方法，加速损失收敛，同时以最小的内存和I/O开销稳定文本-语音对齐。实验结果表明，SupertonicTTS仅使用44M参数即可提供与当代零样本TTS模型相当的性能，同时显著降低了架构复杂性和计算成本。音频样本可在以下网址找到：https://supertonictts.github.io/。

更新时间: 2025-09-23 15:55:51

领域: eess.AS,cs.LG,cs.SD

下载: http://arxiv.org/abs/2503.23108v3

Hierarchical Evaluation Function: A Multi-Metric Approach for Optimizing Demand Forecasting Models

Accurate demand forecasting is crucial for effective inventory management in dynamic and competitive environments, where decisions are influenced by uncertainty, financial constraints, and logistical limitations. Traditional evaluation metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) provide complementary perspectives but may lead to biased assessments when applied individually. To address this limitation, we propose the Hierarchical Evaluation Function (HEF), a composite function that integrates R2, MAE, and RMSE within a hierarchical and adaptive framework. The function incorporates dynamic weights, tolerance thresholds derived from the statistical properties of the series, and progressive penalty mechanisms to ensure robustness against extreme errors and invalid predictions. HEF was implemented to optimize multiple forecasting models using Grid Search, Particle Swarm Optimization (PSO), and Optuna, and tested on benchmark datasets including Walmart, M3, M4, and M5. Experimental results, validated through statistical tests, demonstrate that HEF consistently outperforms MAE as an evaluation function in global metrics such as R2, Global Relative Accuracy (GRA), RMSE, and RMSSE, thereby providing greater explanatory power, adaptability, and stability. While MAE retains advantages in simplicity and efficiency, HEF proves more effective for long-term planning and complex contexts. Overall, HEF constitutes a robust and adaptive alternative for model selection and hyperparameter optimization in highly variable demand forecasting environments.

Updated: 2025-09-23 15:43:58

标题: 分层评估函数：优化需求预测模型的多度量方法

摘要: 准确的需求预测对于在动态和竞争激烈的环境中进行有效库存管理至关重要，决策受到不确定性、财务限制和物流限制的影响。传统的评估指标，如平均绝对误差（MAE）和均方根误差（RMSE），提供了互补的视角，但在单独应用时可能导致偏倚的评估。为了解决这一限制，我们提出了分层评估函数（HEF），这是一个综合函数，将R2、MAE和RMSE集成在一个分层和自适应的框架内。该函数包括动态权重、从时间序列的统计特性中导出的容差阈值和渐进性的惩罚机制，以确保对极端错误和无效预测的稳健性。HEF已被实施以优化多个预测模型，使用网格搜索、粒子群优化（PSO）和Optuna，并在包括Walmart、M3、M4和M5在内的基准数据集上进行了测试。通过统计测试验证的实验结果表明，HEF在全局指标（如R2、全局相对精度（GRA）、RMSE和RMSSE）中始终优于MAE作为评估函数，从而提供更大的解释力、适应性和稳定性。虽然MAE在简单性和效率方面具有优势，但HEF在长期规划和复杂环境中表现更为有效。总体而言，HEF构成了高度变化的需求预测环境中模型选择和超参数优化的稳健和自适应替代方案。

更新时间: 2025-09-23 15:43:58

领域: cs.LG,cs.AI,cs.PF,62M10, 90C59, 68T05,I.2.6; I.5.1; I.5.2; I.5.4; G.1.6

下载: http://arxiv.org/abs/2508.13057v4

Soft Tokens, Hard Truths

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

Updated: 2025-09-23 15:43:47

标题: 软令牌，坚实事实

摘要: 最近，人们开始关注在推理LLMs的思维链（CoT）阶段使用连续而不是离散令牌，这是基于连续混合的离散令牌可以模拟多个推理路径的叠加的直觉。理论结果已正式证明连续令牌具有更大的表达能力，并且可以更高效地解决特定问题。然而，连续令牌的实际使用受到强烈的训练困难的限制：先前的工作要么仅在预先训练的离散令牌模型上在推理时使用连续令牌，要么必须从地面真实的离散CoTs中提炼连续CoT，并面临限制CoT为极少令牌的计算成本。这是首次介绍了一种可扩展的方法，通过强化学习（RL）学习连续CoTs，而无需从参考离散CoTs中提炼。我们使用“软”令牌：将令牌混合在一起，并在输入嵌入上加入噪声以提供RL探索。计算开销极小，使我们能够学习数百个令牌的连续CoTs。在具有Llama和Qwen模型的数学推理基准测试中，训练使用连续CoTs与离散令牌CoTs相匹配，对于pass@1超越了它们，显示出更大的CoT多样性。在系统比较中，表现最佳的场景是使用连续CoT令牌进行训练，然后在推理时使用离散令牌，这意味着“软”模型可以以标准方式部署。最后，我们展示了连续CoT RL训练更好地保留了基础模型对域外任务的预测，从而为基础模型提供更柔和的触感。

更新时间: 2025-09-23 15:43:47

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19170v1

RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions

Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods. Codes are available at \textcolor{blue}{https://github.com/cocowy1/RoSe-Robust-Self-supervised-Stereo-Matching-under-Adverse-Weather-Conditions}.

Updated: 2025-09-23 15:41:40

标题: RoSe: 在恶劣天气条件下的稳健自监督立体匹配

摘要: 最近的自监督立体匹配方法取得了显著进展，但在夜晚、雨天和雾天等恶劣天气条件下，它们的性能显著下降。我们确定了导致这种性能下降的两个主要弱点。首先，恶劣天气引入噪音并降低能见度，使基于CNN的特征提取器在反射和无纹理区域等受损区域中难以处理。其次，这些受损区域可能破坏准确的像素对应关系，导致基于光度一致性假设的监督无效。为了解决这些挑战，我们提出将从视觉基础模型中导出的稳健先验注入到基于CNN的特征提取器中，以改善恶劣天气条件下的特征表示。然后，我们引入场景对应先验来构建稳健的监督信号，而不是仅依赖于光度一致性假设。具体地，我们创建了具有逼真天气恶化的合成立体数据集。这些数据集包含清晰和恶劣的图像对，保持相同的语义内容和视差，保留了场景对应属性。有了这些知识，我们提出了一个稳健的自监督训练范式，包括两个关键步骤：稳健自监督场景对应学习和恶劣天气蒸馏。这两个步骤旨在使干净和恶劣图像对的基础场景结果对齐，从而改善模型在恶劣天气影响下的视差估计。大量实验证明了我们提出的解决方案的有效性和多功能性，优于现有的最先进的自监督方法。代码可在\textcolor{blue}{https://github.com/cocowy1/RoSe-Robust-Self-supervised-Stereo-Matching-under-Adverse-Weather-Conditions}找到。

更新时间: 2025-09-23 15:41:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.19165v1

Your Turn: At Home Turning Angle Estimation for Parkinson's Disease Severity Assessment

People with Parkinson's Disease (PD) often experience progressively worsening gait, including changes in how they turn around, as the disease progresses. Existing clinical rating tools are not capable of capturing hour-by-hour variations of PD symptoms, as they are confined to brief assessments within clinic settings. Measuring gait turning angles continuously and passively is a component step towards using gait characteristics as sensitive indicators of disease progression in PD. This paper presents a deep learning-based approach to automatically quantify turning angles by extracting 3D skeletons from videos and calculating the rotation of hip and knee joints. We utilise state-of-the-art human pose estimation models, Fastpose and Strided Transformer, on a total of 1386 turning video clips from 24 subjects (12 people with PD and 12 healthy control volunteers), trimmed from a PD dataset of unscripted free-living videos in a home-like setting (Turn-REMAP). We also curate a turning video dataset, Turn-H3.6M, from the public Human3.6M human pose benchmark with 3D ground truth, to further validate our method. Previous gait research has primarily taken place in clinics or laboratories evaluating scripted gait outcomes, but this work focuses on free-living home settings where complexities exist, such as baggy clothing and poor lighting. Due to difficulties in obtaining accurate ground truth data in a free-living setting, we quantise the angle into the nearest bin $45^\circ$ based on the manual labelling of expert clinicians. Our method achieves a turning calculation accuracy of 41.6%, a Mean Absolute Error (MAE) of 34.7{\deg}, and a weighted precision WPrec of 68.3% for Turn-REMAP. This is the first work to explore the use of single monocular camera data to quantify turns by PD patients in a home setting.

Updated: 2025-09-23 15:41:29

标题: 您的回合：在家中进行的转角估计用于帕金森病严重程度评估

摘要: 帕金森病（PD）患者往往在步态方面经历逐渐恶化，包括随疾病进展而发生的转身方式变化。现有的临床评估工具无法捕捉PD症状的时时变化，因为它们局限于在临床环境中进行简短评估。持续 passively 测量步态转身角度是将步态特征作为PD疾病进展敏感指标的一步。本文提出了一种基于深度学习的方法，通过从视频中提取3D骨架并计算髋部和膝关节的旋转来自动量化转身角度。我们利用最先进的人体姿势估计模型Fastpose和Strided Transformer，在24名受试者（12名PD患者和12名健康对照志愿者）的1386个转身视频剪辑中进行操作，这些视频从一个PD自由生活视频数据集（Turn-REMAP）中剪辑而来。我们还从公共Human3.6M人体姿势基准中提取了一个转身视频数据集Turn-H3.6M，其中包含3D真实数据，以进一步验证我们的方法。先前的步态研究主要在临床或实验室中进行，评估脚本化的步态结果，但本文重点放在了自由生活的家庭环境中，其中存在复杂性，如宽松的服装和光线不足。由于在自由生活环境中获取准确的真实数据存在困难，我们根据专家临床医生的手动标记将角度量化为最接近的45度。我们的方法在Turn-REMAP中实现了41.6％的转身计算精度，34.7°的平均绝对误差（MAE）和68.3％的加权精确度WPrec。这是首个探索使用单眼摄像头数据在家庭环境中量化PD患者转身的工作。

更新时间: 2025-09-23 15:41:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2408.08182v4

Circuit Complexity From Physical Constraints: Scaling Limitations of Attention

We argue that the standard circuit complexity measures derived from $NC, AC, TC$ provide limited practical information and are now insufficient to further differentiate model expressivity. To address these new limitations, we define a novel notion of local uniformity and a family of circuit complexity classes $RC(\cdot)$ that capture the fundamental constraints of scaling physical circuits. Through the lens of $RC(\cdot)$, we show that attention mechanisms with $\omega(n^{3/2})$ runtime cannot scale to accommodate the entropy of increasingly complex datasets. Our results simultaneously provide a methodology for defining meaningful bounds on transformer expressivity and naturally expose the restricted viability of attention.

Updated: 2025-09-23 15:40:36

标题: 物理约束下的电路复杂性：注意力的扩展限制

摘要: 我们认为从$NC, AC, TC$导出的标准电路复杂度衡量标准提供了有限的实际信息，现在已经不足以进一步区分模型表达能力。为了解决这些新的限制，我们定义了一种新颖的局部均匀性概念和一组电路复杂度类$RC(\cdot)$，捕捉了物理电路的基本约束。通过$RC(\cdot)$的视角，我们展示了具有$\omega(n^{3/2})$运行时间的注意力机制无法扩展以适应日益复杂数据集的熵。我们的结果同时提供了一个定义变压器表达能力有意义边界的方法，并自然地揭示了注意力的受限可行性。

更新时间: 2025-09-23 15:40:36

领域: cs.CC,cs.LG,F.1.3

下载: http://arxiv.org/abs/2509.19161v1

CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs (Brief version)

This is the third paper of the CayleyPy project applying artificial intelligence to problems in group theory. We announce the first public release of CayleyPy, an open source Python library for computations with Cayley and Schreier graphs. Compared with systems such as GAP and Sage, CayleyPy handles much larger graphs and performs several orders of magnitude faster. Using CayleyPy we obtained about 200 new conjectures on Cayley and Schreier graphs, focused on diameters and growth. For many Cayley graphs of symmetric groups Sn we observe quasi polynomial diameter formulas: a small set of quadratic or linear polynomials indexed by n mod s. We conjecture that this is a general phenomenon, giving efficient diameter computation despite the problem being NP hard. We propose a refinement of the Babai type conjecture on diameters of Sn: n^2/2 + 4n upper bounds in the undirected case, compared to previous O(n^2) bounds. We also provide explicit generator families, related to involutions in a square with whiskers pattern, conjectured to maximize the diameter; search confirms this for all n up to 15. We further conjecture an answer to a question posed by V M Glushkov in 1968 on directed Cayley graphs generated by a cyclic shift and a transposition. For nilpotent groups we conjecture an improvement of J S Ellenberg's results on upper unitriangular matrices over Z/pZ, showing linear dependence of diameter on p. Moreover. Some conjectures are LLM friendly, naturally stated as sorting problems verifiable by algorithms or Python code. To benchmark path finding we created more than 10 Kaggle datasets. CayleyPy works with arbitrary permutation or matrix groups and includes over 100 predefined generators. Our growth computation code outperforms GAP and Sage up to 1000 times in speed and size.

Updated: 2025-09-23 15:40:36

标题: CayleyPy增长：Cayley图上高效增长计算和数百个新猜想（简明版）

摘要: 这是CayleyPy项目的第三篇论文，将人工智能应用于群论问题。我们宣布了CayleyPy的首次公开发布，这是一个用于计算Cayley和Schreier图的开源Python库。与像GAP和Sage这样的系统相比，CayleyPy处理更大的图形，并且执行速度快几个数量级。使用CayleyPy，我们在Cayley和Schreier图上得到了约200个新的猜想，主要集中在直径和增长方面。对于许多对称群Sn的Cayley图，我们观察到准多项式直径公式：由n mod s索引的一小组二次或线性多项式。我们猜想这是一个普遍现象，尽管问题是NP难的，但可以有效地计算直径。我们对Sn的直径提出了一个关于Babai类型猜想的改进：在无向情况下，n^2/2 + 4n是上界，而不是之前的O(n^2)上界。我们还提供了明确的生成器系列，与在具有鲸须图案的正方形中的逆相关，猜想最大化直径；搜索证实了这一点，对所有n直到15。我们进一步推测了关于由循环移位和换位生成的有向Cayley图的一个1968年V M Glushkov提出的问题的答案。对于幂零群，我们猜想对J S Ellenberg关于Z/pZ上的上三角矩阵的结果进行改进，显示直径对p的线性依赖。此外。一些猜想对LLM友好，自然陈述为可由算法或Python代码验证的排序问题。为了对路径查找进行基准测试，我们创建了超过10个Kaggle数据集。CayleyPy可以处理任意排列或矩阵群，并包含超过100个预定义的生成器。我们的增长计算代码在速度和大小上比GAP和Sage快1000倍。

更新时间: 2025-09-23 15:40:36

领域: math.CO,cs.LG,math.GR

下载: http://arxiv.org/abs/2509.19162v1

Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery

Bold claims about AI's role in science-from "AGI will cure all diseases" to promises of radically accelerated discovery-raise a central epistemic question: do large language models (LLMs) truly generate new knowledge, or do they merely remix memorized fragments? We propose unlearning-as-ablation as a falsifiable probe of constructive scientific discovery. The idea is to systematically remove a target result together with its forget-closure (supporting lemmas, paraphrases, and multi-hop entailments) and then evaluate whether the model can re-derive the result from only permitted axioms and tools. Success would indicate generative capability beyond recall; failure would expose current limits. Unlike prevailing motivations for unlearning-privacy, copyright, or safety-our framing repositions it as an epistemic probe for AI-for-Science. We outline a minimal pilot in mathematics and algorithms to illustrate feasibility, and sketch how the same approach could later be extended to domains such as physics or chemistry. This is a position paper: our contribution is conceptual and methodological, not empirical. We aim to stimulate discussion on how principled ablation tests could help distinguish models that reconstruct knowledge from those that merely retrieve it, and how such probes might guide the next generation of AI-for-Science benchmarks.

Updated: 2025-09-23 15:40:35

标题: 遗忘作为切除：朝向可验证的生成科学发现基准

摘要: 对于人工智能在科学中的角色的大胆声明-从“通用人工智能将治愈所有疾病”到加速发现的承诺-提出了一个核心认识论问题：大型语言模型(LLM)是否真正生成新知识，还是仅仅重新组合记忆片段？我们提出了作为构造性科学发现可验证探针的“遗忘作为消融”概念。其思想是系统地移除目标结果以及其遗忘闭包（支持引理，释义和多跳蕴涵），然后评估模型是否仅从允许的公理和工具中重新推导结果。成功将表明具有超越回忆的生成能力；失败将暴露当前的限制。与当前的遗忘动机-隐私，版权或安全不同-我们的框架将其重新定位为AI-for-Science的认识论探针。我们概述了数学和算法中的最小试点，以说明可行性，并勾勒了相同方法如何稍后可以扩展到物理或化学等领域。这是一篇立场论文：我们的贡献是概念性和方法论的，而不是经验性的。我们的目标是激发讨论，探讨原则性的消融测试如何帮助区分重建知识的模型和仅仅检索知识的模型，以及这类探针如何指导下一代AI-for-Science基准的发展。

更新时间: 2025-09-23 15:40:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.17681v3

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping on the importance-sampling (IS) weight. We study RL methods with sequence-level IS and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs.\ long responses, distorting the optimization direction. FSPO introduces a simple remedy: we clip the sequence log-IS ratio with a band that scales as $\sqrt{L}$. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a cosine directional guarantee between the clipped and true updates. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets on Qwen3-8B-Base model.

Updated: 2025-09-23 15:39:49

标题: 公平地剪辑您的序列：对序列级RL强制执行长度公平性

摘要: 我们提出了FSPO（公平序列策略优化），这是一种用于LLM的序列级强化学习方法，它在重要性抽样（IS）权重上执行长度公平剪切。我们研究了具有序列级IS的RL方法，并确定了当PPO/GRPO风格的剪切被移植到序列时出现的不匹配问题：固定的剪切范围会系统地重新权衡短vs. 长回应，扭曲了优化方向。FSPO引入了一个简单的解决方案：我们将序列log-IS比率剪切为一个随着$L$的平方根缩放的带。从理论上讲，我们通过长度重加权误差（LRE）形式化长度公平，并证明小的LRE会产生剪切和真实更新之间的余弦方向保证。在经验上，FSPO使剪切率在长度区间内变得平坦，稳定了训练，并在Qwen3-8B-Base模型上的多个评估数据集上表现优于所有基线。

更新时间: 2025-09-23 15:39:49

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.09177v2

Dynamical Low-Rank Compression of Neural Networks with Robustness under Adversarial Attacks

Deployment of neural networks on resource-constrained devices demands models that are both compact and robust to adversarial inputs. However, compression and adversarial robustness often conflict. In this work, we introduce a dynamical low-rank training scheme enhanced with a novel spectral regularizer that controls the condition number of the low-rank core in each layer. This approach mitigates the sensitivity of compressed models to adversarial perturbations without sacrificing accuracy on clean data. The method is model- and data-agnostic, computationally efficient, and supports rank adaptivity to automatically compress the network at hand. Extensive experiments across standard architectures, datasets, and adversarial attacks show the regularized networks can achieve over 94% compression while recovering or improving adversarial accuracy relative to uncompressed baselines.

Updated: 2025-09-23 15:39:30

标题: 神经网络的动态低秩压缩及对抗攻击下的稳健性

摘要: 在资源受限设备上部署神经网络需要既紧凑又能够抵御对抗性输入的模型。然而，压缩和对抗性鲁棒性通常存在冲突。在这项工作中，我们介绍了一种动态低秩训练方案，增强了一种新颖的谱正则化器，用于控制每一层中低秩核的条件数。这种方法减轻了压缩模型对对抗性扰动的敏感性，同时在干净数据上不损失准确性。该方法是模型和数据无关的，计算效率高，并支持秩的自适应性，可以自动压缩所需的网络。在标准架构、数据集和对抗性攻击的广泛实验中，经过正则化的网络可以实现超过94%的压缩，同时相对于未压缩的基线恢复或提高对抗性准确性。

更新时间: 2025-09-23 15:39:30

领域: cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2505.08022v3

One Subgoal at a Time: Zero-Shot Generalization to Arbitrary Linear Temporal Logic Requirements in Multi-Task Reinforcement Learning

Generalizing to complex and temporally extended task objectives and safety constraints remains a critical challenge in reinforcement learning (RL). Linear temporal logic (LTL) offers a unified formalism to specify such requirements, yet existing methods are limited in their abilities to handle nested long-horizon tasks and safety constraints, and cannot identify situations when a subgoal is not satisfiable and an alternative should be sought. In this paper, we introduce GenZ-LTL, a method that enables zero-shot generalization to arbitrary LTL specifications. GenZ-LTL leverages the structure of B\"uchi automata to decompose an LTL task specification into sequences of reach-avoid subgoals. Contrary to the current state-of-the-art method that conditions on subgoal sequences, we show that it is more effective to achieve zero-shot generalization by solving these reach-avoid problems \textit{one subgoal at a time} through proper safe RL formulations. In addition, we introduce a novel subgoal-induced observation reduction technique that can mitigate the exponential complexity of subgoal-state combinations under realistic assumptions. Empirical results show that GenZ-LTL substantially outperforms existing methods in zero-shot generalization to unseen LTL specifications.

Updated: 2025-09-23 15:38:52

标题: 一次一个子目标：多任务强化学习中对任意线性时间逻辑要求的零样本泛化

摘要: 将复杂且在时间上延伸的任务目标和安全约束推广到强化学习（RL）中仍然是一个关键挑战。线性时序逻辑（LTL）提供了一种统一的形式化方法来指定这些要求，然而现有方法在处理嵌套的长期任务和安全约束方面能力有限，并且不能识别子目标无法满足且需要寻找替代方案的情况。在本文中，我们介绍了GenZ-LTL，一种能够实现对任意LTL规范的零射击泛化的方法。GenZ-LTL利用B\"uchi自动机的结构将LTL任务规范分解成达到-避免子目标序列。与当前最先进的方法相反，该方法不是根据子目标序列来确定，而是通过适当的安全RL表述逐个解决这些达到-避免问题，从而更有效地实现零射击泛化。此外，我们介绍了一种新颖的子目标诱导的观察减少技术，可以在现实假设下缓解子目标状态组合的指数复杂性。实证结果表明，GenZ-LTL在对未见的LTL规范进行零射击泛化方面明显优于现有方法。

更新时间: 2025-09-23 15:38:52

领域: cs.AI

下载: http://arxiv.org/abs/2508.01561v5

Efficient Reinforcement Learning by Reducing Forgetting with Elephant Activation Functions

Catastrophic forgetting has remained a significant challenge for efficient reinforcement learning for decades (Ring 1994, Rivest and Precup 2003). While recent works have proposed effective methods to mitigate this issue, they mainly focus on the algorithmic side. Meanwhile, we do not fully understand what architectural properties of neural networks lead to catastrophic forgetting. This study aims to fill this gap by studying the role of activation functions in the training dynamics of neural networks and their impact on catastrophic forgetting in reinforcement learning setup. Our study reveals that, besides sparse representations, the gradient sparsity of activation functions also plays an important role in reducing forgetting. Based on this insight, we propose a new class of activation functions, elephant activation functions, that can generate both sparse outputs and sparse gradients. We show that by simply replacing classical activation functions with elephant activation functions in the neural networks of value-based algorithms, we can significantly improve the resilience of neural networks to catastrophic forgetting, thus making reinforcement learning more sample-efficient and memory-efficient.

Updated: 2025-09-23 15:38:01

标题: 通过使用大象激活函数降低遗忘实现高效强化学习

摘要: 灾难性遗忘在强化学习中已经是一个持续了数十年的重要挑战（Ring 1994，Rivest和Precup 2003）。尽管最近的研究已经提出了有效的方法来缓解这个问题，但它们主要集中在算法方面。与此同时，我们并不完全了解神经网络的架构特性是如何导致灾难性遗忘的。本研究旨在通过研究激活函数在神经网络训练动态中的作用以及它们对强化学习设置中灾难性遗忘的影响来填补这一空白。我们的研究揭示了，除了稀疏表示外，激活函数的梯度稀疏性也在减少遗忘方面发挥重要作用。基于这一洞察力，我们提出了一类新的激活函数，大象激活函数，它可以生成稀疏输出和稀疏梯度。我们展示了通过简单地将传统激活函数替换为大象激活函数，我们可以显著提高神经网络对灾难性遗忘的抵抗力，从而使强化学习更加样本高效和内存高效。

更新时间: 2025-09-23 15:38:01

领域: cs.LG

下载: http://arxiv.org/abs/2509.19159v1

LLMs as verification oracles for Solidity

Ensuring the correctness of smart contracts is critical, as even subtle flaws can lead to severe financial losses. While bug detection tools able to spot common vulnerability patterns can serve as a first line of defense, most real-world exploits and losses stem from errors in the contract business logic. Formal verification tools such as SolCMC and the Certora Prover address this challenge, but their impact remains limited by steep learning curves and restricted specification languages. Recent works have begun to explore the use of large language models (LLMs) for security-related tasks such as vulnerability detection and test generation. Yet, a fundamental question remains open: can LLMs serve as verification oracles, capable of reasoning about arbitrary contract-specific properties? In this paper, we provide the first systematic evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role. We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios. Our study combines quantitative metrics with qualitative analysis, and shows that recent reasoning-oriented LLMs can be surprisingly effective as verification oracles, suggesting a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.

Updated: 2025-09-23 15:32:13

标题: LLMs作为Solidity的验证神谕

摘要: 确保智能合约的正确性至关重要，因为即使微小的缺陷也可能导致严重的财务损失。虽然能够发现常见漏洞模式的错误检测工具可以作为第一道防线，但大多数真实世界的利用和损失源于合约业务逻辑中的错误。形式验证工具，如SolCMC和Certora Prover，可以应对这一挑战，但由于陡峭的学习曲线和受限的规范语言，它们的影响仍然有限。最近的研究开始探索利用大型语言模型（LLMs）进行安全相关任务，如漏洞检测和测试生成。然而，一个根本性问题仍然悬而未决：LLMs能否作为验证神谕，能够推理任意合约特定属性？在本文中，我们对GPT-5进行了首次系统评估，这是一种最先进的推理LLM，在这一角色中。我们在一个庞大的验证任务数据集上对其性能进行基准测试，将其输出与已建立的形式验证工具进行比较，并评估其在真实审核场景中的实际有效性。我们的研究将定量指标与定性分析相结合，表明最近的面向推理的LLMs可以出乎意料地作为验证神谕具有相当的有效性，这表明了AI和形式方法在安全智能合约开发和审计领域的融合的新前沿。

更新时间: 2025-09-23 15:32:13

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2509.19153v1

Socially Pertinent Robots in Gerontological Healthcare

Despite the many recent achievements in developing and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilities will be useful and accepted in real-life facilities is yet to be answered. This paper is an attempt to partially answer this question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities. The software architecture, developed during the H2020 SPRING project, together with the experimental protocol, allowed us to evaluate the acceptability (AES) and usability (SUS) with more than 60 end-users. Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions.

Updated: 2025-09-23 15:28:51

标题: 社会相关的机器人在老年保健中的应用

摘要: 尽管在开发和部署社交机器人方面取得了许多最近的成就，但仍然有许多尚未开发的环境和应用需要通过最终用户对这些系统进行系统评估。虽然已经有几个机器人平台被用于老年保健领域，但一个具有多模式对话能力的社交互动机器人是否在现实生活设施中有用且被接受的问题尚未得到答复。本文尝试通过在巴黎一家全尺寸人形机器人具有社交和对话交互功能的日间护理老年设施进行两轮实验来部分回答这个问题。在H2020 SPRING项目期间开发的软件架构以及实验协议使我们能够评估超过60名最终用户的可接受性（AES）和可用性（SUS）。总体而言，用户对这项技术持开放态度，特别是当机器人的感知和行动能力强大到足以应对环境杂乱和灵活处理各种不同的互动时。

更新时间: 2025-09-23 15:28:51

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2404.07560v3

APFEx: Adaptive Pareto Front Explorer for Intersectional Fairness

Ensuring fairness in machine learning models is critical, especially when biases compound across intersecting protected attributes like race, gender, and age. While existing methods address fairness for single attributes, they fail to capture the nuanced, multiplicative biases faced by intersectional subgroups. We introduce Adaptive Pareto Front Explorer (APFEx), the first framework to explicitly model intersectional fairness as a joint optimization problem over the Cartesian product of sensitive attributes. APFEx combines three key innovations- (1) an adaptive multi-objective optimizer that dynamically switches between Pareto cone projection, gradient weighting, and exploration strategies to navigate fairness-accuracy trade-offs, (2) differentiable intersectional fairness metrics enabling gradient-based optimization of non-smooth subgroup disparities, and (3) theoretical guarantees of convergence to Pareto-optimal solutions. Experiments on four real-world datasets demonstrate APFEx's superiority, reducing fairness violations while maintaining competitive accuracy. Our work bridges a critical gap in fair ML, providing a scalable, model-agnostic solution for intersectional fairness.

Updated: 2025-09-23 15:27:29

标题: APFEx: 适应性帕累托前沿探索器用于交叉公平性

摘要: 确保机器学习模型的公平性至关重要，特别是当偏见在种族、性别和年龄等交叉保护属性中累积时。虽然现有方法解决了单一属性的公平性问题，但它们未能捕捉到交叉子群体所面临的微妙、多重的偏见。我们引入了自适应帕累托前端探索器（APFEx），这是第一个明确将交叉公平性建模为对敏感属性的笛卡尔积上的联合优化问题的框架。APFEx结合了三个关键创新：（1）自适应多目标优化器，动态切换帕累托锥投影、梯度加权和探索策略，以在公平性-准确性折衷中导航；（2）可微的交叉公平度量，使得对非平滑子群不平等进行基于梯度的优化成为可能；（3）收敛到帕累托最优解的理论保证。对四个真实数据集的实验证明了APFEx的优越性，减少了公平性违规行为，同时保持了竞争性的准确性。我们的工作填补了公平机器学习中的一个重要空白，提供了一个可扩展的、与模型无关的解决方案，用于处理交叉公平性。

更新时间: 2025-09-23 15:27:29

领域: cs.LG

下载: http://arxiv.org/abs/2509.13908v2

Generative Propaganda

Generative propaganda is the use of generative artificial intelligence (AI) to shape public opinion. To characterize its use in real-world settings, we conducted interviews with defenders (e.g., factcheckers, journalists, officials) in Taiwan and creators (e.g., influencers, political consultants, advertisers) as well as defenders in India, centering two places characterized by high levels of online propaganda. The term "deepfakes", we find, exerts outsized discursive power in shaping defenders' expectations of misuse and, in turn, the interventions that are prioritized. To better characterize the space of generative propaganda, we develop a taxonomy that distinguishes between obvious versus hidden and promotional versus derogatory use. Deception was neither the main driver nor the main impact vector of AI's use; instead, Indian creators sought to persuade rather than to deceive, often making AI's use obvious in order to reduce legal and reputational risks, while Taiwan's defenders saw deception as a subset of broader efforts to distort the prevalence of strategic narratives online. AI was useful and used, however, in producing efficiency gains in communicating across languages and modes, and in evading human and algorithmic detection. Security researchers should reconsider threat models to clearly differentiate deepfakes from promotional and obvious uses, to complement and bolster the social factors that constrain misuse by internal actors, and to counter efficiency gains globally.

Updated: 2025-09-23 15:27:00

标题: 生成式宣传

摘要: 生成性宣传是利用生成式人工智能（AI）来塑造公众舆论。为了对其在现实世界中的应用进行表征，我们在台湾进行了访谈，包括捍卫者（例如事实核查员、记者、官员）和创作者（例如意见领袖、政治顾问、广告商），以及在印度进行了访谈，重点关注两个被高水平网络宣传特征化的地方。我们发现，“深度伪造”这个术语在塑造捍卫者对滥用的期望方面具有超出比例的话语权，并进而确定了优先考虑的干预措施。为了更好地表征生成性宣传领域，我们制定了一个分类法，区分了明显与隐藏以及推广与贬低性的使用。欺骗既不是AI使用的主要动因也不是主要影响向量；相反，印度的创作者试图说服而不是欺骗，通常会使AI的使用显而易见，以减少法律和声誉风险，而台湾的捍卫者则将欺骗视为在网上扭曲战略叙事普遍性的更广泛努力的一个子集。然而，AI在跨语言和模式传播方面产生了效率增益，并在规避人类和算法检测方面得到了应用。安全研究人员应重新考虑威胁模型，明确区分深度伪造和推广及明显使用，以补充和加强约束内部行为者滥用的社会因素，并在全球范围内对抗效率增益。

更新时间: 2025-09-23 15:27:00

领域: cs.CY,cs.AI,cs.SI,K.4.2

下载: http://arxiv.org/abs/2509.19147v1

Anecdoctoring: Automated Red-Teaming Across Language and Place

Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose "anecdoctoring", a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.

Updated: 2025-09-23 15:26:13

标题: “Anecdoctoring：跨语言和地点的自动红队行动”

摘要: 虚假信息是生成式人工智能（AI）滥用的顶级风险之一。全球采用生成式AI需要跨越多种语言和文化的强大的red-teaming评估（即系统对抗性探查），但red-teaming数据集通常以美国和英语为中心。为了弥补这一差距，我们提出了一种新颖的red-teaming方法“anecdoctoring”，它可以自动生成跨越语言和文化的对抗性提示。我们从事实核查网站收集了三种语言（英语、西班牙语和印地语）和两个地理位置（美国和印度）的虚假信息声明。然后我们将单个声明聚类为更广泛的叙述，并用知识图表征这些聚类，然后我们用这些知识图加强了一个攻击者LLM。我们的方法产生了更高的攻击成功率，并相对于少量提示提供了可解释性的好处。结果强调了在全球范围内扩展和基于现实世界对抗性滥用的虚假信息缓解的需求。

更新时间: 2025-09-23 15:26:13

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2509.19143v1

Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few

Attention mechanisms in Transformers have gained significant empirical success. Nonetheless, the optimization objectives underlying their forward pass are still unclear. Additionally, the quadratic complexity of self-attention is increasingly prohibitive. Unlike the prior work on addressing the interpretability or efficiency issue separately, we propose a unified optimization objective to alleviate both issues simultaneously. By unrolling the optimization over the objective, we derive an inherently interpretable and efficient attention mechanism, which compresses all tokens into low-dimensional structures by contracting a few representative tokens and then broadcasting the contractions back. This Contract-and-Broadcast Self-Attention (CBSA) mechanism can not only scale linearly but also generalize existing attention mechanisms as its special cases. Experiments further demonstrate comparable performance and even superior advantages of CBSA on several visual tasks. Code is available at this https URL.

Updated: 2025-09-23 15:26:12

标题: 走向可解释和高效的注意力机制：通过压缩少量参数实现全局压缩

摘要: Transformer中的注意力机制取得了显著的经验成功。然而，它们前向传递背后的优化目标仍然不明确。此外，自注意力的二次复杂性越来越具有限制性。与先前解决可解释性或效率问题的工作不同，我们提出了一个统一的优化目标，以同时缓解这两个问题。通过在目标上展开优化，我们推导出一个固有可解释和高效的注意力机制，通过收缩一些代表性的令牌，然后将收缩内容广播回去，将所有令牌压缩成低维结构。这种“收缩和广播自注意力”(CBSA)机制不仅可以线性扩展，还可以将现有的注意力机制推广为其特殊情况。实验进一步证明了CBSA在几个视觉任务上具有可比性的性能，甚至有优势。代码可在此链接中找到。

更新时间: 2025-09-23 15:26:12

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2509.16875v2

Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality

Existing distribution compression methods reduce dataset size by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that across a variety of scenarios BDC can achieve comparable or superior performance to ambient-space compression at substantially lower cost.

Updated: 2025-09-23 15:23:45

标题: 双边分布压缩：减少数据大小和维度

摘要: 现有的分布压缩方法通过最小化原始和压缩集之间的最大均值差异（MMD）来减小数据集的大小，但现代数据集通常在样本大小和维度上都很大。我们提出了双向分布压缩（BDC），这是一个两阶段框架，可以在保持基础分布的情况下沿着两个轴进行压缩，总体上具有数据集大小和维度的线性时间和内存复杂性。BDC的核心是解码的MMD（DMMD），它量化了原始数据和从低维潜在空间解码的压缩集之间的差异。BDC通过（i）使用重建MMD（RMMD）学习低维投影，以及（ii）使用编码的MMD（EMMD）优化潜在的压缩集来进行。我们展示了这个过程最小化了DMMD，确保压缩集忠实地代表了原始分布。实验证明，在各种情景下，BDC可以以显著更低的成本实现与环境空间压缩相当或更好的性能。

更新时间: 2025-09-23 15:23:45

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2509.17543v2

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3-sigma). We provide prototype tools, test suites, and results.

Updated: 2025-09-23 15:20:40

标题: 关于使用自然语言编写的测试用例执行中LLM代理的准确性和一致性

摘要: 使用自然语言（NL）测试用例验证图形用户界面（GUI）应用程序正在成为手动编写可执行测试脚本的一种有希望的方向，这些脚本开发成本高且难以维护。最近大型语言模型（LLMs）的进展打开了LLM代理直接执行NL测试用例的可能性。本文研究了这个方向，重点关注NL测试用例不完整性对测试用例执行一致性的影响。NL测试用例本质上不完整，因为由于模棱两可的指令或不可预测的代理行为可能导致假失败。此外，重复执行相同的NL测试用例可能导致不一致的结果，破坏了测试的可靠性。为了解决这些挑战，我们提出了一种算法，通过防护机制和专门的代理动态验证每个测试步骤的正确执行来执行NL测试用例。我们引入了衡量LLMs在测试执行中的能力和衡量执行一致性的一个指标。我们提出了一个定义弱不完整性，以描述NL测试用例执行仍然可接受的情况，与工业质量水平六西格玛相关。我们使用八个公开可用的LLMs进行了实验评估，这些LLMs的参数范围从3B到70B，展示了当前LLM代理在GUI测试中的潜力和当前限制。我们的实验表明，Meta Llama 3.1 70B在NL测试用例执行方面表现出可接受的能力，并具有高的执行一致性（超过3西格玛水平）。我们提供了原型工具、测试套件和结果。

更新时间: 2025-09-23 15:20:40

领域: cs.SE,cs.AI,D.2.4; D.2.5; F.3.1

下载: http://arxiv.org/abs/2509.19136v1

GSTM-HMU: Generative Spatio-Temporal Modeling for Human Mobility Understanding

Human mobility traces, often recorded as sequences of check-ins, provide a unique window into both short-term visiting patterns and persistent lifestyle regularities. In this work we introduce GSTM-HMU, a generative spatio-temporal framework designed to advance mobility analysis by explicitly modeling the semantic and temporal complexity of human movement. The framework consists of four key innovations. First, a Spatio-Temporal Concept Encoder (STCE) integrates geographic location, POI category semantics, and periodic temporal rhythms into unified vector representations. Second, a Cognitive Trajectory Memory (CTM) adaptively filters historical visits, emphasizing recent and behaviorally salient events in order to capture user intent more effectively. Third, a Lifestyle Concept Bank (LCB) contributes structured human preference cues, such as activity types and lifestyle patterns, to enhance interpretability and personalization. Finally, task-oriented generative heads transform the learned representations into predictions for multiple downstream tasks. We conduct extensive experiments on four widely used real-world datasets, including Gowalla, WeePlace, Brightkite, and FourSquare, and evaluate performance on three benchmark tasks: next-location prediction, trajectory-user identification, and time estimation. The results demonstrate consistent and substantial improvements over strong baselines, confirming the effectiveness of GSTM-HMU in extracting semantic regularities from complex mobility data. Beyond raw performance gains, our findings also suggest that generative modeling provides a promising foundation for building more robust, interpretable, and generalizable systems for human mobility intelligence.

Updated: 2025-09-23 15:20:38

标题: GSTM-HMU：用于人类活动理解的生成式时空建模

摘要: 人类移动轨迹，通常记录为签到序列，为短期访问模式和持久生活方式规律提供了独特的窗口。在这项工作中，我们介绍了GSTM-HMU，这是一个生成时空框架，旨在通过明确建模人类移动的语义和时间复杂性来推进移动性分析。该框架包括四个关键创新。首先，时空概念编码器（STCE）将地理位置、POI类别语义和周期性时间节奏整合为统一的向量表示。其次，认知轨迹记忆（CTM）自适应地过滤历史访问，强调最近和行为显著事件，以更有效地捕捉用户意图。第三，生活方式概念库（LCB）提供结构化的人类偏好线索，如活动类型和生活方式模式，以增强可解释性和个性化。最后，面向任务的生成头将学习到的表示转换为多个下游任务的预测。我们在四个广泛使用的真实世界数据集上进行了广泛实验，包括Gowalla、WeePlace、Brightkite和FourSquare，并在三个基准任务上评估性能：下一个位置预测、轨迹用户识别和时间估计。结果显示，在强基线上持续和显著的改进，证实了GSTM-HMU在从复杂移动数据中提取语义规律方面的有效性。除了原始性能提升外，我们的发现还表明，生成建模为构建更强大、可解释和可泛化的人类移动智能系统提供了一个有希望的基础。

更新时间: 2025-09-23 15:20:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19135v1

Analyzing Uncertainty Quantification in Statistical and Deep Learning Models for Probabilistic Electricity Price Forecasting

Precise probabilistic forecasts are fundamental for energy risk management, and there is a wide range of both statistical and machine learning models for this purpose. Inherent to these probabilistic models is some form of uncertainty quantification. However, most models do not capture the full extent of uncertainty, which arises not only from the data itself but also from model and distributional choices. In this study, we examine uncertainty quantification in state-of-the-art statistical and deep learning probabilistic forecasting models for electricity price forecasting in the German market. In particular, we consider deep distributional neural networks (DDNNs) and augment them with an ensemble approach, Monte Carlo (MC) dropout, and conformal prediction to account for model uncertainty. Additionally, we consider the LASSO-estimated autoregressive (LEAR) approach combined with quantile regression averaging (QRA), generalized autoregressive conditional heteroskedasticity (GARCH), and conformal prediction. Across a range of performance metrics, we find that the LEAR-based models perform well in terms of probabilistic forecasting, irrespective of the uncertainty quantification method. Furthermore, we find that DDNNs benefit from incorporating both data and model uncertainty, improving both point and probabilistic forecasting. Uncertainty itself appears to be best captured by the models using conformal prediction. Overall, our extensive study shows that all models under consideration perform competitively. However, their relative performance depends on the choice of metrics for point and probabilistic forecasting.

Updated: 2025-09-23 15:20:03

标题: 分析统计和深度学习模型在概率性电力价格预测中的不确定性量化

摘要: 精确的概率预测对于能源风险管理至关重要，为此存在广泛的统计和机器学习模型。这些概率模型固有地包含某种形式的不确定性量化。然而，大多数模型未能捕捉到不确定性的全部范围，这种不确定性不仅来自数据本身，还来自模型和分布选择。在这项研究中，我们研究了德国市场电力价格预测中最先进的统计和深度学习概率模型中的不确定性量化。具体而言，我们考虑了深度分布神经网络（DDNNs）并将其与集成方法、蒙特卡洛（MC）辍学和符合预测相结合，以考虑模型不确定性。此外，我们考虑了LASSO估计的自回归（LEAR）方法结合分位数回归平均（QRA）、广义自回归条件异方差（GARCH）和符合预测。在一系列性能指标上，我们发现基于LEAR的模型在概率预测方面表现良好，无论不确定性量化方法如何。此外，我们发现DDNNs受益于结合数据和模型不确定性，改进了点预测和概率预测。不确定性本身似乎最好由使用符合预测的模型捕捉。总的来说，我们的广泛研究表明，所有考虑的模型在竞争中表现出色。然而，它们的相对性能取决于点和概率预测指标的选择。

更新时间: 2025-09-23 15:20:03

领域: cs.LG,math.ST,stat.TH,I.6.4; I.6.5; I.6.6

下载: http://arxiv.org/abs/2509.19417v1

Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP

Large language models (LLMs) can amplify misinformation, undermining societal goals like the UN SDGs. We study three documented drivers of misinformation (valence framing, information overload, and oversimplification) which are often shaped by one's default beliefs. Building on evidence that LLMs encode such defaults (e.g., "joy is positive," "math is complex") and can act as "bags of heuristics," we ask: can general belief-driven heuristics behind misinformative behaviour be recovered from LLMs as clear rules? A key obstacle is that global rule-extraction methods in explainable AI (XAI) are built for numerical inputs/outputs, not text. We address this by eliciting global LLM beliefs and mapping them to numerical scores via statistically reliable abstractions, thereby enabling off-the-shelf global XAI to detect belief-related heuristics in LLMs. To obtain ground truth, we hard-code bias-inducing nonlinear heuristics of increasing complexity (univariate, conjunctive, nonconvex) into popular LLMs (ChatGPT and Llama) via system instructions. This way, we find that RuleFit under-detects non-univariate biases, while global SHAP better approximates conjunctive ones but does not yield actionable rules. To bridge this gap, we propose RuleSHAP, a rule-extraction algorithm that couples global SHAP-value aggregations with rule induction to better capture non-univariate bias, improving heuristics detection over RuleFit by +94% (MRR@1) on average. Our results provide a practical pathway for revealing belief-driven biases in LLMs.

Updated: 2025-09-23 15:19:17

标题: 全球可解释人工智能方法能否揭示在LLMs中注入的偏见？SHAP vs 规则提取 vs 规则SHAP

摘要: 大型语言模型（LLMs）可以放大错误信息，从而破坏联合国可持续发展目标等社会目标。我们研究了误导信息的三个已记录驱动因素（情感框架、信息过载和过度简化），这些因素通常受到一个人的默认信念的影响。基于LLMs编码这些默认值的证据（例如，“快乐是积极的”，“数学是复杂的”）并可以充当“启发式袋”，我们问：是否可以从LLMs中恢复误导行为背后的一般信念驱动启发式作为明确的规则？一个关键障碍是可解释人工智能（XAI）中的全局规则提取方法是为数值输入/输出而建立的，而不是文本。我们通过引发全局LLM信念并通过统计上可靠的抽象将它们映射到数值分数，从而使现成的全局XAI能够检测LLMs中与信念相关的启发式。为了获得基本事实，我们通过系统指令将引入偏差的非线性启发式（单变量、连词、非凸）硬编码到流行的LLMs（ChatGPT和Llama）中。通过这种方式，我们发现RuleFit在检测非单变量偏差时存在不足，而全局SHAP更好地近似了连词偏差，但并没有产生可行的规则。为了弥合这一差距，我们提出了RuleSHAP，一种规则提取算法，将全局SHAP值聚合与规则归纳相结合，更好地捕捉非单变量偏差，平均提高了94％（MRR@1）的启发式检测。我们的结果为揭示LLMs中信念驱动的偏见提供了实用的途径。

更新时间: 2025-09-23 15:19:17

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.11189v2

Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding

Guesstimation--the task of making approximate quantitative estimates about objects or events-is a common real--world skill, yet remains underexplored in large language model (LLM) research. We introduce three guesstimation datasets: MARBLES, FUTURE, and ELECPRED, spanning physical estimation (e.g., how many marbles fit in a cup) to abstract predictions (e.g., the 2024 U.S. presidential election). Inspired by the social science concept of Wisdom of Crowds (WOC)- where the median of multiple estimates improves accuracy-we propose WOC decoding for LLMs. We replicate WOC effects in human participants and find that LLMs exhibit similar benefits: median aggregation across sampled responses consistently improves accuracy over greedy decoding, self-consistency decoding, and mean decoding. This suggests that LLMs encode a world model that supports approximate reasoning. Our results position guesstimation as a useful probe of LLM world knowledge and highlight WOC decoding as a strategy for enhancing LLM guesstimation performance on real-world tasks.

Updated: 2025-09-23 15:17:03

标题: 探索LLM世界模型：利用众包智慧增强猜测能力

摘要: Guesstimation——即对物体或事件进行近似定量估计的任务——是一种常见的现实世界技能，但在大型语言模型（LLM）研究中仍未得到充分探讨。我们介绍了三个guesstimation数据集：MARBLES，FUTURE和ELECPRED，涵盖了物理估计（例如，一个杯子可以装多少弹珠）到抽象预测（例如，2024年美国总统选举）。受社会科学“群体智慧”（WOC）概念的启发——即多个估计的中位数提高准确性——我们提出了LLM的WOC解码。我们在人类参与者中复制了WOC效应，并发现LLM表现出类似的好处：在采样响应中进行中位数聚合始终比贪婪解码、自一致解码和平均解码提高准确性。这表明LLM编码了支持近似推理的世界模型。我们的结果将guesstimation定位为LLM世界知识的有用探针，并突出了WOC解码作为提高LLM在现实任务中guesstimation性能的策略。

更新时间: 2025-09-23 15:17:03

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2501.17310v4

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generatio

Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately $\sim 2x$ faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.

Updated: 2025-09-23 15:15:21

标题: PipelineRL：更快的长序列生成的在线策略强化学习

摘要: 强化学习（RL）越来越被用于增强大型语言模型（LLMs）的推理能力。然而，有效地扩展这些RL方法面临着重大挑战，主要是由于在不损害常见RL算法的情况下保持高AI加速器利用率的困难，而不生成陈旧的离策略数据。本文介绍了PipelineRL，这是一种旨在实现硬件效率和LLM训练数据策略性之间更优越平衡的方法。PipelineRL采用并发异步数据生成和模型训练，具有独特的在飞行权重更新。这种机制允许LLM生成引擎在生成令牌序列期间最大限度地接收更新的模型权重，从而最大化加速器利用率和训练数据的新鲜度。在使用128个H100 GPU进行长篇推理任务的实验中，PipelineRL相比传统RL基线实现了大约2倍的更快学习速度，同时保持高度策略性的训练数据。PipelineRL的可扩展和模块化的开源实现也作为一个重要贡献发布。

更新时间: 2025-09-23 15:15:21

领域: cs.LG

下载: http://arxiv.org/abs/2509.19128v1

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.

Updated: 2025-09-23 15:14:42

标题: 在野外环境中，视觉-语言模型安全吗？基于表情的基准研究

摘要: 视觉语言模型（VLMs）的快速部署扩大了安全风险，然而大多数评估依赖于人工图像。本研究探讨了一个问题：当面对普通用户共享的模因图像时，当前的VLMs有多安全？为了调查这个问题，我们引入了MemeSafetyBench，一个包含50,430个实例的基准，将真实的模因图像与有害和良性指令配对。利用全面的安全分类法和基于LLM的指令生成，我们评估了多个VLMs在单一和多轮交互中的表现。我们研究了现实世界的模因如何影响有害输出，对话背景的缓解效果，以及模型规模和安全指标之间的关系。我们的研究结果表明，与合成或印刷图像相比，VLMs更容易受到基于模因的有害提示的影响。模因显著增加了有害响应并减少了拒绝，与仅文本输入相比。虽然多轮交互提供了部分减轻，但高度脆弱仍然存在。这些结果突出了对生态有效评估和更强大的安全机制的需求。MemeSafetyBench可在https://github.com/oneonlee/Meme-Safety-Bench公开获取。

更新时间: 2025-09-23 15:14:42

领域: cs.CL,cs.CR,cs.CV

下载: http://arxiv.org/abs/2505.15389v3

Analysis on distribution and clustering of weight

The study on architecture and parameter characteristics remains the hot topic in the research of large language models. In this paper we concern with the characteristics of weight which are used to analyze the correlations and differences between models. Two kinds of vectors-standard deviation vector and clustering vector-are proposed to describe features of models. In the first case, the weights are assumed to follow normal distribution. The standard deviation values of projection matrices are normalized to form Standard-Deviation Vector, representing the distribution characteristics of models. In the second case, the singular values from each weight projection matrix are extracted and grouped by K-Means algorithm. The grouped data with the same type matrix are combined as Clustering Vector to represent the correlation characteristics of models' weights. The study reveals that these two vectors can effectively distinguish between different models and clearly show the similarities among models of the same family. Moreover, after conducting LoRA fine-tuning with different datasets and models, it is found that the distribution of weights represented by standard deviation vector is directly influenced by the dataset, but the correlations between different weights represented by clustering vector remain unaffected and maintain a high consistency with the pre-trained model.

Updated: 2025-09-23 15:08:25

标题: 权重分布和聚类分析

摘要: 对大型语言模型的体系结构和参数特性的研究仍然是研究的热点话题。本文关注用于分析模型之间的相关性和差异的权重特性。提出了两种向量-标准差向量和聚类向量-用于描述模型的特征。在第一种情况下，假设权重遵循正态分布。将投影矩阵的标准差值归一化，形成标准差向量，代表模型的分布特性。在第二种情况下，从每个权重投影矩阵中提取奇异值，并通过K-Means算法进行分组。将相同类型矩阵的分组数据组合为聚类向量，以表示模型权重的相关性特征。研究表明，这两个向量能够有效区分不同模型，并清楚地展示相同家族模型之间的相似性。此外，在使用不同数据集和模型进行LoRA微调后，发现由标准差向量表示的权重分布受数据集直接影响，但由聚类向量表示的不同权重之间的相关性保持不变，并与预训练模型保持高一致性。

更新时间: 2025-09-23 15:08:25

领域: cs.LG,cs.AI,68T50,I.2.7

下载: http://arxiv.org/abs/2509.19122v1

FedFiTS: Fitness-Selected, Slotted Client Scheduling for Trustworthy Federated Learning in Healthcare AI

Federated Learning (FL) has emerged as a powerful paradigm for privacy-preserving model training, yet deployments in sensitive domains such as healthcare face persistent challenges from non-IID data, client unreliability, and adversarial manipulation. This paper introduces FedFiTS, a trust and fairness-aware selective FL framework that advances the FedFaSt line by combining fitness-based client election with slotted aggregation. FedFiTS implements a three-phase participation strategy-free-for-all training, natural selection, and slotted team participation-augmented with dynamic client scoring, adaptive thresholding, and cohort-based scheduling to balance convergence efficiency with robustness. A theoretical convergence analysis establishes bounds for both convex and non-convex objectives under standard assumptions, while a communication-complexity analysis shows reductions relative to FedAvg and other baselines. Experiments on diverse datasets-medical imaging (X-ray pneumonia), vision benchmarks (MNIST, FMNIST), and tabular agricultural data (Crop Recommendation)-demonstrate that FedFiTS consistently outperforms FedAvg, FedRand, and FedPow in accuracy, time-to-target, and resilience to poisoning attacks. By integrating trust-aware aggregation with fairness-oriented client selection, FedFiTS advances scalable and secure FL, making it well suited for real-world healthcare and cross-domain deployments.

Updated: 2025-09-23 15:06:04

标题: FedFiTS：适合健康AI可信联邦学习的健身选择、分槽客户端调度

摘要: 联邦学习（FL）已经成为隐私保护模型训练的强大范例，然而在诸如医疗保健等敏感领域的部署面临着来自非IID数据、客户不可靠性和敌对操纵的持续挑战。本文介绍了FedFiTS，这是一个注重信任和公平的选择性FL框架，通过将基于适应性的客户选举与分时聚合相结合，推进了FedFaSt线。FedFiTS实施了一个三阶段参与策略-自由训练、自然选择和分时团队参与-辅以动态客户评分、自适应阈值和基于队列的调度，以平衡收敛效率和鲁棒性。理论收敛分析确立了在标准假设下对凸和非凸目标的边界，而通信复杂性分析显示相对于FedAvg和其他基线的降低。对各种数据集-医学成像（X射线肺炎）、视觉基准（MNIST、FMNIST）和表格农业数据（作物推荐）的实验表明，FedFiTS在准确性、达到目标时间和对毒害攻击的韧性方面始终优于FedAvg、FedRand和FedPow。通过将信任感知聚合与以公平为导向的客户选择相结合，FedFiTS推进了可扩展和安全的FL，使其非常适合于现实世界的医疗保健和跨领域部署。

更新时间: 2025-09-23 15:06:04

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2509.19120v1

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

Text-guided image editing has seen significant progress in natural image domains, but its application in medical imaging remains limited and lacks standardized evaluation frameworks. Such editing could revolutionize clinical practices by enabling personalized surgical planning, enhancing medical education, and improving patient communication. To bridge this gap, we introduce MedEBench1, a robust benchmark designed to diagnose reliability in text-guided medical image editing. MedEBench consists of 1,182 clinically curated image-prompt pairs covering 70 distinct editing tasks and 13 anatomical regions. It contributes in three key areas: (1) a clinically grounded evaluation framework that measures Editing Accuracy, Context Preservation, and Visual Quality, complemented by detailed descriptions of intended edits and corresponding Region-of-Interest (ROI) masks; (2) a comprehensive comparison of seven state-of-theart models, revealing consistent patterns of failure; and (3) a diagnostic error analysis technique that leverages attention alignment, using Intersection-over-Union (IoU) between model attention maps and ROI masks to identify mislocalization issues, where models erroneously focus on incorrect anatomical regions. MedEBench sets the stage for developing more reliable and clinically effective text-guided medical image editing tools.

Updated: 2025-09-23 15:05:28

标题: MedEBench: 诊断文本引导的医学图像编辑的可靠性

摘要: 医学影像的文本引导图像编辑在自然图像领域取得了显著进展，但在医学影像中的应用仍然有限，并且缺乏标准化的评估框架。这种编辑可以通过实现个性化手术规划、增强医学教育和改善患者沟通，从而彻底改变临床实践。为了弥合这一差距，我们介绍了MedEBench1，这是一个设计用于诊断文本引导医学影像编辑可靠性的强大基准。MedEBench由1,182对临床策划的图像提示组成，涵盖70个不同的编辑任务和13个解剖区域。它在三个关键领域做出贡献：（1）一个基于临床的评估框架，衡量编辑准确性、上下文保留和视觉质量，辅以对预期编辑和相应感兴趣区域（ROI）掩模的详细描述；（2）对七种最先进模型的全面比较，揭示了一致的失败模式；以及（3）一种诊断错误分析技术，利用注意力对齐，使用模型注意力图和ROI掩模之间的交集-联合（IoU）来识别错位问题，其中模型错误地聚焦在不正确的解剖区域上。MedEBench为开发更可靠和临床有效的文本引导医学影像编辑工具奠定了基础。

更新时间: 2025-09-23 15:05:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.01921v6

LLM-based Vulnerability Discovery through the Lens of Code Metrics

Large language models (LLMs) excel in many tasks of software engineering, yet progress in leveraging them for vulnerability discovery has stalled in recent years. To understand this phenomenon, we investigate LLMs through the lens of classic code metrics. Surprisingly, we find that a classifier trained solely on these metrics performs on par with state-of-the-art LLMs for vulnerability discovery. A root-cause analysis reveals a strong correlation and a causal effect between LLMs and code metrics: When the value of a metric is changed, LLM predictions tend to shift by a corresponding magnitude. This dependency suggests that LLMs operate at a similarly shallow level as code metrics, limiting their ability to grasp complex patterns and fully realize their potential in vulnerability discovery. Based on these findings, we derive recommendations on how research should more effectively address this challenge.

Updated: 2025-09-23 15:03:05

标题: 基于LLM的代码度量视角下的漏洞发现

摘要: 大型语言模型（LLMs）在软件工程的许多任务中表现出色，但近年来在利用它们进行漏洞发现方面的进展已经停滞不前。为了理解这一现象，我们通过经典代码指标的视角对LLMs进行了调查。令人惊讶的是，我们发现一个仅基于这些指标训练的分类器在漏洞发现方面表现与最先进的LLMs不相上下。根本原因分析揭示了LLMs和代码指标之间的强相关性和因果关系：当指标值发生变化时，LLMs的预测往往会相应地发生变化。这种依赖关系表明，LLMs在操作上与代码指标类似，限制了它们把握复杂模式和充分实现在漏洞发现中的潜力的能力。基于这些发现，我们提出了如何更有效地解决这一挑战的建议。

更新时间: 2025-09-23 15:03:05

领域: cs.CR,cs.LG,cs.SE

下载: http://arxiv.org/abs/2509.19117v1

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-LLM fusion Transformer to enhance modality fusion prior to input into text LLMs, tailoring for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We open-source the codes, models and datasets we constructed: github.com/sony/DeepResonance.

Updated: 2025-09-23 15:01:37

标题: DeepResonance：通过以音乐为中心的多向指导调整增强多模态音乐理解

摘要: 最近音乐大型语言模型（LLMs）的进展显著改善了音乐理解任务，其中包括模型分析和解释各种音乐要素的能力。这些改进主要集中在整合音乐和文本输入方面。然而，将其他模态（如图像、视频和文本音乐特征）纳入以增强音乐理解的潜力尚未被探索。为了弥补这一差距，我们提出了DeepResonance，这是一个多模态音乐理解LLM，通过多向指导调节和多向对齐的音乐、文本、图像和视频数据来微调。为此，我们构建了Music4way-MI2T、Music4way-MV2T和Music4way-Any2T这三个四路训练和评估数据集，旨在使DeepResonance能够整合视觉和文本音乐特征内容。我们还引入了多采样的ImageBind嵌入和一个预LLM融合Transformer，以增强模态融合，然后将其输入文本LLMs，为多向指导调节量身定制。我们的模型在六个音乐理解任务中取得了最先进的表现，凸显了辅助模态的益处和DeepResonance的结构优越性。我们开源构建的代码、模型和数据集：github.com/sony/DeepResonance。

更新时间: 2025-09-23 15:01:37

领域: cs.SD,cs.AI,cs.CL,cs.MM,eess.AS

下载: http://arxiv.org/abs/2502.12623v3

Hierarchical Federated Learning for Social Network with Mobility

Federated Learning (FL) offers a decentralized solution that allows collaborative local model training and global aggregation, thereby protecting data privacy. In conventional FL frameworks, data privacy is typically preserved under the assumption that local data remains absolutely private, whereas the mobility of clients is frequently neglected in explicit modeling. In this paper, we propose a hierarchical federated learning framework based on the social network with mobility namely HFL-SNM that considers both data sharing among clients and their mobility patterns. Under the constraints of limited resources, we formulate a joint optimization problem of resource allocation and client scheduling, which objective is to minimize the energy consumption of clients during the FL process. In social network, we introduce the concepts of Effective Data Coverage Rate and Redundant Data Coverage Rate. We analyze the impact of effective data and redundant data on the model performance through preliminary experiments. We decouple the optimization problem into multiple sub-problems, analyze them based on preliminary experimental results, and propose Dynamic Optimization in Social Network with Mobility (DO-SNM) algorithm. Experimental results demonstrate that our algorithm achieves superior model performance while significantly reducing energy consumption, compared to traditional baseline algorithms.

Updated: 2025-09-23 14:59:08

标题: Hierarchical Federated Learning for Social Network with Mobility Hierarchical Federated Learning for Social Network with Mobility

摘要: Federated Learning（FL）提供了一种分散化的解决方案，允许协作本地模型训练和全局聚合，从而保护数据隐私。在传统的FL框架中，数据隐私通常在假设本地数据保持绝对私密的情况下得以保留，而客户的流动性经常在明确建模中被忽视。在本文中，我们提出了一种基于社交网络和移动性的分层联邦学习框架，即HFL-SNM，考虑了客户之间的数据共享和其移动模式。在资源有限的约束下，我们制定了资源分配和客户调度的联合优化问题，其目标是在FL过程中减少客户的能耗。在社交网络中，我们引入了有效数据覆盖率和冗余数据覆盖率的概念。通过初步实验分析有效数据和冗余数据对模型性能的影响。我们将优化问题分解为多个子问题，基于初步实验结果进行分析，并提出了带有移动性的社交网络动态优化（DO-SNM）算法。实验结果表明，与传统基准算法相比，我们的算法在显著减少能耗的同时实现了更优越的模型性能。

更新时间: 2025-09-23 14:59:08

领域: cs.LG

下载: http://arxiv.org/abs/2509.14938v2

Towards Practical Multi-label Causal Discovery in High-Dimensional Event Sequences via One-Shot Graph Aggregation

Understanding causality in event sequences where outcome labels such as diseases or system failures arise from preceding events like symptoms or error codes is critical. Yet remains an unsolved challenge across domains like healthcare or vehicle diagnostics. We introduce CARGO, a scalable multi-label causal discovery method for sparse, high-dimensional event sequences comprising of thousands of unique event types. Using two pretrained causal Transformers as domain-specific foundation models for event sequences. CARGO infers in parallel, per sequence one-shot causal graphs and aggregates them using an adaptive frequency fusion to reconstruct the global Markov boundaries of labels. This two-stage approach enables efficient probabilistic reasoning at scale while bypassing the intractable cost of full-dataset conditional independence testing. Our results on a challenging real-world automotive fault prediction dataset with over 29,100 unique event types and 474 imbalanced labels demonstrate CARGO's ability to perform structured reasoning.

Updated: 2025-09-23 14:58:50

标题: 朝向实用的高维事件序列中多标签因果发现：通过一次性图聚合

摘要: 理解在事件序列中的因果关系，其中结果标签（如疾病或系统故障）来源于先前事件（如症状或错误代码）是至关重要的。然而，在诸如医疗保健或车辆诊断等领域，这仍然是一个未解决的挑战。我们引入了CARGO，一种可扩展的多标签因果发现方法，用于包含成千上万个唯一事件类型的稀疏、高维事件序列。使用两个预训练的因果Transformer作为事件序列的领域特定基础模型。CARGO并行推断，每个序列一次性因果图，并使用自适应频率融合来重建标签的全局马尔可夫边界。这种两阶段方法使得在规模上进行高效的概率推理，同时绕过完整数据集条件独立性测试的难以处理的成本。我们在一个具有超过29,100个唯一事件类型和474个不平衡标签的具有挑战性的现实世界汽车故障预测数据集上的结果表明，CARGO能够进行结构化推理。

更新时间: 2025-09-23 14:58:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19112v1

A Fast Initialization Method for Neural Network Controllers: A Case Study of Image-based Visual Servoing Control for the multicopter Interception

Reinforcement learning-based controller design methods often require substantial data in the initial training phase. Moreover, the training process tends to exhibit strong randomness and slow convergence. It often requires considerable time or high computational resources. Another class of learning-based method incorporates Lyapunov stability theory to obtain a control policy with stability guarantees. However, these methods generally require an initially stable neural network control policy at the beginning of training. Evidently, a stable neural network controller can not only serve as an initial policy for reinforcement learning, allowing the training to focus on improving controller performance, but also act as an initial state for learning-based Lyapunov control methods. Although stable controllers can be designed using traditional control theory, designers still need to have a great deal of control design knowledge to address increasingly complicated control problems. The proposed neural network rapid initialization method in this paper achieves the initial training of the neural network control policy by constructing datasets that conform to the stability conditions based on the system model. Furthermore, using the image-based visual servoing control for multicopter interception as a case study, simulations and experiments were conducted to validate the effectiveness and practical performance of the proposed method. In the experiment, the trained control policy attains a final interception velocity of 15 m/s.

Updated: 2025-09-23 14:56:59

标题: 神经网络控制器的快速初始化方法：多旋翼拦截基于图像的视觉伺服控制案例研究

摘要: 基于强化学习的控制器设计方法通常需要大量的数据在初始训练阶段。此外，训练过程往往表现出强烈的随机性和收敛速度慢。通常需要相当长时间或高计算资源。另一类基于学习的方法将Lyapunov稳定性理论纳入其中，以获得具有稳定性保证的控制策略。然而，这些方法通常需要在训练开始时具有初始稳定的神经网络控制策略。显然，稳定的神经网络控制器不仅可以作为强化学习的初始策略，使训练集中于改善控制器性能，还可以作为学习基于Lyapunov控制方法的初始状态。尽管可以使用传统控制理论设计稳定控制器，但设计者仍需要具有大量控制设计知识来解决日益复杂的控制问题。本文提出的神经网络快速初始化方法通过构建符合系统模型稳定条件的数据集来实现神经网络控制策略的初始训练。此外，通过以基于图像的视觉伺服控制多旋翼拦截为案例研究，进行了模拟和实验证实了所提出方法的有效性和实际性能。在实验中，训练后的控制策略实现了最终拦截速度为15m/s。

更新时间: 2025-09-23 14:56:59

领域: eess.SY,cs.LG,cs.RO,cs.SY

下载: http://arxiv.org/abs/2509.19110v1

DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment

Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where models overfit to reward misspecification and drift from preferred behaviors observed during training. We introduce DRO-REBEL, a unified family of robust REBEL updates with type-$p$ Wasserstein, KL, and $\chi^2$ ambiguity sets. Using Fenchel duality, each update reduces to a simple relative-reward regression, preserving scalability and avoiding PPO-style clipping or auxiliary value networks. Under standard linear-reward and log-linear policy classes with a data-coverage condition, we establish $O(n^{-1/4})$ estimation bounds with tighter constants than prior DRO-DPO approaches, and recover the minimax-optimal $O(n^{-1/2})$ rate via a localized Rademacher complexity analysis. The same analysis closes the gap for Wasserstein-DPO and KL-DPO, showing both also attain optimal parametric rates. We derive practical SGD algorithms for all three divergences: gradient regularization (Wasserstein), importance weighting (KL), and a fast 1-D dual solve ($\chi^2$). Experiments on Emotion Alignment, the large-scale ArmoRM multi-objective benchmark, and HH-Alignment demonstrate strong worst-case robustness across unseen preference mixtures, model sizes, and data scales, with $\chi^2$-REBEL showing consistently strong empirical performance. A controlled radius--coverage study validates a no-free-lunch trade-off: radii shrinking faster than empirical divergence concentration rates achieve minimax-optimal parametric rates but forfeit coverage, while coverage-guaranteeing radii incur $O(n^{-1/4})$ rates.

Updated: 2025-09-23 14:49:48

标题: DRO-REBEL：用于快速高效的LLM对齐的分布鲁棒相对奖励回归

摘要: 使用人类反馈的强化学习（RLHF）已经变得至关重要，用于将大型语言模型（LLMs）与人类意图对齐。然而，现有的离线RLHF方法存在过度优化的问题，模型过度拟合于奖励规范，偏离训练过程中观察到的首选行为。我们引入了DRO-REBEL，一种统一的具有类型-$p$ Wasserstein、KL和$\chi^2$模糊集的鲁棒REBEL更新系列。利用Fenchel对偶性，每个更新都简化为简单的相对奖励回归，保持可扩展性，避免了PPO风格的修剪或辅助价值网络。在标准线性奖励和对数线性策略类别下，我们建立了具有比先前DRO-DPO方法更紧密常数的$O(n^{-1/4})$估计界限，并通过局部Rademacher复杂性分析恢复了最小极优$O(n^{-1/2})$速率。相同的分析为Wasserstein-DPO和KL-DPO填补了差距，表明两者也实现了最优参数速率。我们为三种差异推导了实用的SGD算法：梯度正则化（Wasserstein）、重要性加权（KL）和快速的1-D对偶求解（$\chi^2$）。在情感对齐、大规模ArmoRM多目标基准测试和HH-Alignment的实验中，展示了针对未见偏好混合、模型规模和数据规模的强最坏情况鲁棒性，$\chi^2$-REBEL显示出一致强大的实证表现。控制半径-覆盖率研究验证了没有免费午餐的权衡：半径收缩速度快于实证分歧集中率的最小极优参数速率，但会放弃覆盖率，而保证覆盖的半径会导致$O(n^{-1/4})$速率。

更新时间: 2025-09-23 14:49:48

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.19104v1

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

Updated: 2025-09-23 14:49:05

标题: FUNCanon：通过功能性物体规范化学习姿势感知动作基元，实现通用机器人操纵

摘要: 来自端对端演示的通用机器人技能往往会导致特定任务策略无法推广到训练分布之外。因此，我们引入了FunCanon，一个将长期操纵任务转化为由演员、动词和对象定义的动作块序列的框架。这些块将策略学习集中在动作本身上，而不是孤立的任务，从而实现了组合性和重用性。为了使策略具有姿势感知和类别通用性，我们进行了功能对象规范化，用于功能对齐和自动操纵轨迹转移，利用大型视觉语言模型中的可支配线索，将对象映射到共享的功能框架中。基于这些对齐数据训练的面向对象和动作中心的扩散策略FuncDiffuser自然地尊重对象的可支配性和姿势，简化了学习并提高了泛化能力。在模拟和真实世界基准测试中的实验表明，范畴级泛化、跨任务行为重用和强大的sim2real部署，显示功能规范化为复杂操纵领域中可扩展的模仿学习提供了强有力的归纳偏见。演示和补充材料的详细信息可在我们的项目网站https://sites.google.com/view/funcanon上找到。

更新时间: 2025-09-23 14:49:05

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2509.19102v1

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Backdoor attacks pose a significant security threat to natural language processing (NLP) systems, but existing methods lack explainable trigger mechanisms and fail to quantitatively model vulnerability patterns. This work pioneers the quantitative connection between explainable artificial intelligence (XAI) and backdoor attacks, introducing Sensitron, a novel modular framework for crafting stealthy and robust backdoor triggers. Sensitron employs a progressive refinement approach where Dynamic Meta-Sensitivity Analysis (DMSA) first identifies potentially vulnerable input tokens, Hierarchical SHAP Estimation (H-SHAP) then provides explainable attribution to precisely pinpoint the most influential tokens, and finally a Plug-and-Rank mechanism that generates contextually appropriate triggers. We establish the first mathematical correlation (Sensitivity Ranking Correlation, SRC=0.83) between explainability scores and empirical attack success, enabling precise targeting of model vulnerabilities. Sensitron achieves 97.8% Attack Success Rate (ASR) (+5.8% over state-of-the-art (SOTA)) with 85.4% ASR at 0.1% poisoning rate, demonstrating robust resistance against multiple SOTA defenses. This work reveals fundamental NLP vulnerabilities and provides new attack vectors through weaponized explainability.

Updated: 2025-09-23 14:49:00

标题: 触发痛点：通过灵敏度和Sensitron揭示隐藏后门

摘要: 后门攻击对自然语言处理（NLP）系统构成重大安全威胁，但现有方法缺乏可解释的触发机制，并未能定量建模漏洞模式。这项工作开创了可解释人工智能（XAI）和后门攻击之间的定量联系，引入了Sensitron，一种新颖的模块化框架，用于打造隐秘且强大的后门触发器。Sensitron采用渐进细化方法，其中动态元敏感性分析（DMSA）首先识别潜在脆弱的输入令牌，分层SHAP估计（H-SHAP）然后提供可解释的归因以准确地确定最有影响力的令牌，最后是一个Plug-and-Rank机制，生成情境适当的触发器。我们建立了解释能力评分和实证攻击成功之间的第一个数学相关性（敏感性排名相关性，SRC=0.83），从而实现对模型漏洞的精确定位。Sensitron实现了97.8%的攻击成功率（ASR）（比最先进技术（SOTA）高5.8%），在0.1%的中毒率下ASR为85.4%，展示了对多种SOTA防御的强大抗拒力。这项工作揭示了基本的NLP漏洞，并通过武器化的可解释性提供了新的攻击向量。

更新时间: 2025-09-23 14:49:00

领域: cs.CR

下载: http://arxiv.org/abs/2509.19101v1

Algorithms for Adversarially Robust Deep Learning

Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.

Updated: 2025-09-23 14:48:58

标题: 对抗鲁棒深度学习的算法

摘要: 鉴于深度学习模型在安全关键应用中的广泛使用，确保这些模型的决策对抗性利用具有鲁棒性是至关重要的。在本论文中，我们讨论了设计具有良好鲁棒性属性的算法的最新进展。首先，我们讨论了计算机视觉中的对抗性示例问题，我们引入了新的技术结果、训练范式和认证算法。接下来，我们考虑了领域泛化的问题，其中任务是训练神经网络从一组训练分布泛化到看不见的测试分布。我们提出了在医学图像、分子鉴定和图像分类中实现最先进泛化的新算法。最后，我们研究了破解大型语言模型(LLMs)的情景，其中一个对抗性用户试图设计提示来引出LLM中的不良内容。我们提出了新的攻击和防御方法，这代表了朝着设计鲁棒语言代理的进步前沿。

更新时间: 2025-09-23 14:48:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19100v1

EngravingGNN: A Hybrid Graph Neural Network for End-to-End Piano Score Engraving

This paper focuses on automatic music engraving, i.e., the creation of a humanly-readable musical score from musical content. This step is fundamental for all applications that include a human player, but it remains a mostly unexplored topic in symbolic music processing. In this work, we formalize the problem as a collection of interdependent subtasks, and propose a unified graph neural network (GNN) framework that targets the case of piano music and quantized symbolic input. Our method employs a multi-task GNN to jointly predict voice connections, staff assignments, pitch spelling, key signature, stem direction, octave shifts, and clef signs. A dedicated postprocessing pipeline generates print-ready MusicXML/MEI outputs. Comprehensive evaluation on two diverse piano corpora (J-Pop and DCML Romantic) demonstrates that our unified model achieves good accuracy across all subtasks, compared to existing systems that only specialize in specific subtasks. These results indicate that a shared GNN encoder with lightweight task-specific decoders in a multi-task setting offers a scalable and effective solution for automatic music engraving.

Updated: 2025-09-23 14:48:35

标题: EngravingGNN：一种用于端到端钢琴谱刻版的混合图神经网络

摘要: 本文着重研究自动音乐排版，即从音乐内容生成人类可读的乐谱。这一步骤对于所有包含人类演奏者的应用来说至关重要，但在符号音乐处理中仍然是一个大多数未被探索的主题。在这项工作中，我们将问题形式化为一系列相互依赖的子任务，并提出了一个针对钢琴音乐和量化符号输入的统一图神经网络（GNN）框架。我们的方法采用多任务GNN来共同预测声部连接、谱员分配、音高拼写、调号、符杆方向、八度变化和谱号。一个专门的后处理管道生成适合打印的MusicXML/MEI输出。对两个不同的钢琴语料库（J-Pop和DCML浪漫）进行全面评估表明，与只专注于特定子任务的现有系统相比，我们的统一模型在所有子任务上都取得了很好的准确性。这些结果表明，在多任务设置中，具有轻量级任务特定解码器的共享GNN编码器为自动音乐排版提供了可扩展和有效的解决方案。

更新时间: 2025-09-23 14:48:35

领域: cs.GR,cs.AI

下载: http://arxiv.org/abs/2509.19412v1

DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.

Updated: 2025-09-23 14:48:18

标题: DivLogicEval：用于在大型语言模型中进行逻辑推理评估的基准框架

摘要: 自然语言中的逻辑推理被认为是大型语言模型（LLMs）人类智能的重要衡量标准。流行的基准可能会涉及多种推理技能，因此可能会提供对逻辑推理技能不忠实的评估。同时，现有的逻辑推理基准在语言多样性方面存在局限性，并且它们的分布偏离了理想逻辑推理基准的分布，这可能导致偏倚的评估结果。因此，本文提出了一个新的经典逻辑基准DivLogicEval，由以反直觉方式组成的多样性陈述组成的自然句子。为了确保更可靠的评估，我们还引入了一种新的评估指标，该指标减轻了LLMs固有的偏倚和随机性的影响。通过实验，我们展示了在DivLogicEval中回答问题所需的逻辑推理程度，并比较了不同流行的LLMs在进行逻辑推理方面的表现。

更新时间: 2025-09-23 14:48:18

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.15587v2

Asymptotically Optimal Problem-Dependent Bandit Policies for Transfer Learning

We study the non-contextual multi-armed bandit problem in a transfer learning setting: before any pulls, the learner is given N'_k i.i.d. samples from each source distribution nu'_k, and the true target distributions nu_k lie within a known distance bound d_k(nu_k, nu'_k) <= L_k. In this framework, we first derive a problem-dependent asymptotic lower bound on cumulative regret that extends the classical Lai-Robbins result to incorporate the transfer parameters (d_k, L_k, N'_k). We then propose KL-UCB-Transfer, a simple index policy that matches this new bound in the Gaussian case. Finally, we validate our approach via simulations, showing that KL-UCB-Transfer significantly outperforms the no-prior baseline when source and target distributions are sufficiently close.

Updated: 2025-09-23 14:47:42

标题: 渐近最优的问题相关赌博机策略用于迁移学习

摘要: 我们研究了在转移学习设置下的非上下文多臂赌博问题：在任何拉动之前，学习者从每个源分布ν'_k中获得N'_k个独立同分布样本，并且真实目标分布ν_k在已知距离上界d_k(nu_k, nu'_k) <= L_k内。在这个框架中，我们首先推导了一个依赖于问题的渐近累积遗憾下界，将经典的Lai-Robbins结果扩展到合并转移参数(d_k, L_k, N'_k)。然后，我们提出了KL-UCB-Transfer，这是一个简单的指标策略，在高斯情况下匹配了这个新的下界。最后，我们通过模拟验证了我们的方法，显示出当源分布和目标分布足够接近时，KL-UCB-Transfer明显优于无先验基线。

更新时间: 2025-09-23 14:47:42

领域: cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2509.19098v1

Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems

Failure attribution in multi-agent systems -- pinpointing the exact step where a decisive error occurs -- is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step-level accuracy (below 17\%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this \emph{counterfactual inference gap}, we introduce Abduct-Act-Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three-step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent's actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model's analysis. Our extensive experiments on the Who\&When benchmark demonstrate its efficacy. On the Algorithm-Generated dataset, A2P achieves 47.46\% step-level accuracy, a 2.85$\times$ improvement over the 16.67\% of the baseline. On the more complex Hand-Crafted dataset, it achieves 29.31\% step accuracy, a 2.43$\times$ improvement over the baseline's 12.07\%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution. Ours code are released at https://github.com/ResearAI/A2P.

Updated: 2025-09-23 14:45:53

标题: 绑架、行动、预测：自动化故障归因中的因果推理支架在多智能体系统中

摘要: 在多智能体系统中，故障归因——准确定位关键错误发生的确切步骤——是一个关键但尚未解决的挑战。当前的方法将这视为对长对话记录的模式识别任务，导致关键步骤级别的准确率极低（低于17\%），这使它们在调试复杂系统时变得不切实际。它们的核心弱点是无法进行强大的反事实推理：即确定纠正单个动作是否实际上可以避免任务失败。为了弥合这种反事实推理差距，我们引入了Abduct-Act-Predict（A2P）脚手架，这是一个新颖的代理框架，将故障归因从模式识别转变为结构化因果推理任务。A2P明确引导一个大型语言模型通过一个形式化的三步推理过程，在单个推理过程中：（1）Abduction，推断代理动作背后的隐藏根本原因；（2）Action，定义一个最小的纠正干预；以及（3）Prediction，模拟后续轨迹并验证干预是否解决了失败。这种结构化方法利用整个对话的整体背景，同时对模型的分析施加严格的因果逻辑。我们在Who\&When基准测试上进行了大量实验，证明了其有效性。在Algorithm-Generated数据集上，A2P实现了47.46\%的步骤级准确率，比基线的16.67\%提高了2.85倍。在更复杂的Hand-Crafted数据集上，它实现了29.31\%的步骤准确率，比基线的12.07%提高了2.43倍。通过用因果视角重新定义问题，A2P脚手架提供了一个强大、可验证且显著更准确的自动故障归因解决方案。我们的代码发布在https://github.com/ResearAI/A2P。

更新时间: 2025-09-23 14:45:53

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.10401v2

Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

Personalization is essential for adapting question answering (QA) systems to user-specific information needs, thereby improving both accuracy and user satisfaction. However, personalized QA remains relatively underexplored due to challenges such as inferring preferences from long, noisy, and implicit contexts, and generating responses that are simultaneously correct, contextually appropriate, and aligned with user expectations and background knowledge. To address these challenges, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning. The approach models the reasoning of an LLM as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark for personalized QA show that PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement. Human evaluation corroborates these results, with annotators preferring outputs from PoT in 66% of cases and reporting ties in only 15% of cases.

Updated: 2025-09-23 14:44:46

标题: 思维路径：多向思考用于长篇个性化问答

摘要: 个性化对于调整问答（QA）系统以适应用户特定信息需求至关重要，从而提高准确性和用户满意度。然而，个性化QA仍然相对较少被探索，原因是要从长期、嘈杂和隐含的上下文中推断偏好，以及生成同时正确、上下文适当且与用户期望和背景知识一致的响应。为了解决这些挑战，我们提出了思维路径（PoT），这是一种推理阶段方法，适用于任何大型语言模型（LLM），而不需要特定任务的微调。该方法将LLM的推理建模为一个迭代决策过程，在这个过程中，模型动态选择推理、修订、个性化和澄清等认知操作。这使得可以探索多个推理轨迹，生成捕捉不同视角的多样化候选响应。然后，根据推断的用户偏好聚合和重新加权这些候选响应，产生最终的个性化响应，受益于不同推理路径的互补优势。在个性化QA的LaMP-QA基准上进行的实验表明，PoT始终优于竞争基线，实现了高达13.1％的相对改进。人类评估证实了这些结果，注释者在66％的情况下更喜欢PoT的输出，并且仅在15％的情况下报告平局。

更新时间: 2025-09-23 14:44:46

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2509.19094v1

Training Flow Matching Models with Reliable Labels via Self-Purification

Training datasets are inherently imperfect, often containing mislabeled samples due to human annotation errors, limitations of tagging models, and other sources of noise. Such label contamination can significantly degrade the performance of a trained model. In this work, we introduce Self-Purifying Flow Matching (SPFM), a principled approach to filtering unreliable data within the flow-matching framework. SPFM identifies suspicious data using the model itself during the training process, bypassing the need for pretrained models or additional modules. Our experiments demonstrate that models trained with SPFM generate samples that accurately adhere to the specified conditioning, even when trained on noisy labels. Furthermore, we validate the robustness of SPFM on the TITW dataset, which consists of in-the-wild speech data, achieving performance that surpasses existing baselines.

Updated: 2025-09-23 14:43:27

标题: 通过自净化训练流匹配模型以获得可靠标签

摘要: 培训数据集本质上是不完美的，通常包含由于人为注释错误、标记模型的局限性和其他噪音来源而导致的标签错误样本。这种标签污染可以显著降低训练模型的性能。在这项工作中，我们引入了自净流匹配（SPFM），这是一种在流匹配框架内过滤不可靠数据的原则性方法。SPFM在训练过程中使用模型本身识别可疑数据，无需预先训练的模型或额外模块。我们的实验表明，使用SPFM训练的模型生成的样本准确地符合指定的条件，即使在噪声标签上进行训练也是如此。此外，我们在包含野外语音数据的TITW数据集上验证了SPFM的稳健性，实现了超越现有基准线的性能。

更新时间: 2025-09-23 14:43:27

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2509.19091v1

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

Updated: 2025-09-23 14:42:31

标题: 柑橘-V：通过统一的医学影像基础推进临床推理的医学基础模型

摘要: 医学影像提供了临床诊断、治疗规划和手术决策的关键证据，然而大多数现有的影像模型过于狭窄，需要多个专门网络，限制了它们的泛化能力。虽然大规模的语言和多模态模型展现出强大的推理和多任务能力，但现实世界中的临床应用需要精确的视觉基础、多模态集成和思维链推理。我们介绍了Citrus-V，一个结合图像分析和文本推理的多模态医学基础模型。该模型集成了检测、分割和多模态思维链推理，实现了像素级损伤定位、结构化报告生成和类似医生的诊断推理在一个框架中。我们提出了一种新颖的多模态训练方法，并发布了一个涵盖推理、检测、分割和文档理解任务的精选开源数据套件。评估表明，Citrus-V在多个基准测试中优于现有的开源医学模型和专家级影像系统，提供了从视觉基础到临床推理的统一管道，并支持精确的损伤量化、自动报告和可靠的第二意见。

更新时间: 2025-09-23 14:42:31

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.19090v1

A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

Do "digital twins" capture individual responses in surveys and experiments? We run 19 pre-registered studies on a national U.S. panel and their LLM-powered digital twins (constructed based on previously-collected extensive individual-level data) and compare twin and human answers across 164 outcomes. The correlation between twin and human answers is modest (approximately 0.2 on average) and twin responses are less variable than human responses. While constructing digital twins based on rich individual-level data improves our ability to capture heterogeneity across participants and predict relative differences between them, it does not substantially improve our ability to predict the exact answers given by specific participants or enhance predictions of population means. Twin performance varies by domain and is higher among more educated, higher-income, and ideologically moderate participants. These results suggest current digital twins can capture some degree of relative differences but are unreliable for individual-level predictions and sample mean and variance estimation, underscoring the need for careful validation before use. Our data and code are publicly available for researchers and practitioners interested in optimizing digital twin pipelines.

Updated: 2025-09-23 14:42:14

标题: 一个数字孪生的大规模研究揭示了其优势、劣势和进一步改进的机会。

摘要: 数字孪生是否能够捕捉调查和实验中的个体反应？我们在一个美国全国面板上进行了19项预先注册的研究，使用LLM技术构建了数字孪生（基于先前收集的大量个体级数据），并比较了孪生和人类在164个结果上的答案。孪生和人类答案之间的相关性较小（平均约为0.2），孪生答案比人类答案变化较小。虽然基于丰富的个体级数据构建数字孪生提高了我们捕捉参与者之间的异质性和预测他们之间相对差异的能力，但并没有实质性地提高我们预测特定参与者给出的确切答案或增强对总体均值的预测能力。孪生表现在不同领域有所差异，受过更高教育、收入更高和意识形态更温和的参与者的表现更好。这些结果表明当前的数字孪生能够捕捉一定程度的相对差异，但在个体级预测和样本均值和方差估计方面不可靠，强调在使用前需要进行仔细验证。我们的数据和代码对于对数字孪生管道进行优化感兴趣的研究人员和实践者是公开可用的。

更新时间: 2025-09-23 14:42:14

领域: cs.CY,cs.AI,cs.HC,stat.AP

下载: http://arxiv.org/abs/2509.19088v1

Graph Neural Networks with Similarity-Navigated Probabilistic Feature Copying

Graph Neural Networks (GNNs) have demonstrated remarkable success across various graph-based tasks. However, they face some fundamental limitations: feature oversmoothing can cause node representations to become indistinguishable in deeper networks, they struggle to effectively manage heterogeneous relationships where connected nodes differ significantly, and they process entire feature vectors as indivisible units, which limits flexibility. We seek to address these limitations. We propose AxelGNN, a novel GNN architecture inspired by Axelrod's cultural dissemination model that addresses these limitations through a unified framework. AxelGNN incorporates similarity-gated probabilistic interactions that adaptively promote convergence or divergence based on node similarity, implements trait-level copying mechanisms for fine-grained feature aggregation at the segment level, and maintains global polarization to preserve node distinctiveness across multiple representation clusters. The model's bistable convergence dynamics naturally handle both homophilic and heterophilic graphs within a single architecture. Extensive experiments on node classification and influence estimation benchmarks demonstrate that AxelGNN consistently outperforms or matches state-of-the-art GNN methods across diverse graph structures with varying homophily-heterophily characteristics.

Updated: 2025-09-23 14:39:09

标题: 具有相似性导航概率特征复制的图神经网络

摘要: 图神经网络（GNNs）在各种基于图的任务中取得了显著的成功。然而，它们面临一些基本限制：特征过度平滑可能导致节点表示在深层网络中变得无法区分，它们难以有效地管理连接节点之间明显不同的异质关系，并且它们将整个特征向量作为不可分割的单元进行处理，这限制了其灵活性。我们致力于解决这些限制。我们提出了AxelGNN，这是一种受Axelrod文化传播模型启发的新型GNN架构，通过统一框架解决了这些限制。AxelGNN包括相似性门控概率交互，根据节点相似性自适应地促进收敛或分歧，实现了在段级别进行细粒度特征聚合的特征级复制机制，并保持全局极化以保留跨多个表示簇的节点独特性。该模型的双稳态收敛动力学自然地处理同质和异质图在单个架构中。在节点分类和影响估计基准上进行的大量实验表明，AxelGNN在各种具有不同同质性-异质性特征的图结构上始终优于或与最先进的GNN方法相匹配。

更新时间: 2025-09-23 14:39:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19084v1

World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation

Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines. More visualization results are available at https://world4rl.github.io/.

Updated: 2025-09-23 14:38:15

标题: World4RL: 强化学习中用于机器人操纵的扩散世界模型的政策细化

摘要: 机器人操作策略通常通过模仿学习进行初始化，但其性能受限于专家数据的稀缺性和覆盖范围狭窄。强化学习可以优化策略以缓解这一限制，然而在真实机器人训练过程中成本高昂且存在安全隐患，而在模拟器中训练则受到模拟到真实世界之间的差距的影响。生成模型的最新进展在真实世界模拟中展示出了卓越的能力，特别是扩散模型在生成方面表现出色。这引发了一个问题，即如何结合基于扩散模型的世界模型来增强机器人操作中预训练的策略。在这项工作中，我们提出了World4RL，这是一个框架，利用基于扩散的世界模型作为高保真度模拟器，在想象的环境中完全优化预先训练的机器人操作策略。与之前主要使用世界模型进行规划的作品不同，我们的框架实现了直接端到端策略优化。World4RL围绕两个原则设计：预训练一个捕捉多任务数据集上多样动态的扩散世界模型，以及在冻结的世界模型内完全优化策略，以避免在线真实世界交互。我们进一步设计了一种针对机器人操作的两热动作编码方案，并采用扩散骨干来提高建模保真度。广泛的仿真和真实世界实验表明，World4RL提供了高保真度的环境建模，并实现了一致的策略优化，使成功率显著高于模仿学习和其他基线。更多可视化结果请访问https://world4rl.github.io/。

更新时间: 2025-09-23 14:38:15

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.19080v1

Diffusion Bridge Variational Inference for Deep Gaussian Processes

Deep Gaussian processes (DGPs) enable expressive hierarchical Bayesian modeling but pose substantial challenges for posterior inference, especially over inducing variables. Denoising diffusion variational inference (DDVI) addresses this by modeling the posterior as a time-reversed diffusion from a simple Gaussian prior. However, DDVI's fixed unconditional starting distribution remains far from the complex true posterior, resulting in inefficient inference trajectories and slow convergence. In this work, we propose Diffusion Bridge Variational Inference (DBVI), a principled extension of DDVI that initiates the reverse diffusion from a learnable, data-dependent initial distribution. This initialization is parameterized via an amortized neural network and progressively adapted using gradients from the ELBO objective, reducing the posterior gap and improving sample efficiency. To enable scalable amortization, we design the network to operate on the inducing inputs, which serve as structured, low-dimensional summaries of the dataset and naturally align with the inducing variables' shape. DBVI retains the mathematical elegance of DDVI, including Girsanov-based ELBOs and reverse-time SDEs,while reinterpreting the prior via a Doob-bridged diffusion process. We derive a tractable training objective under this formulation and implement DBVI for scalable inference in large-scale DGPs. Across regression, classification, and image reconstruction tasks, DBVI consistently outperforms DDVI and other variational baselines in predictive accuracy, convergence speed, and posterior quality.

Updated: 2025-09-23 14:36:47

标题: 深高斯过程的扩散桥变分推断

摘要: 深高斯过程（DGPs）实现了表达丰富的分层贝叶斯建模，但对后验推断提出了重大挑战，特别是对感应变量的推断。去噪扩散变分推断（DDVI）通过将后验建模为从简单高斯先验开始的时间反向扩散来解决这个问题。然而，DDVI的固定无条件起始分布与复杂的真实后验相差甚远，导致推断轨迹低效且收敛缓慢。在这项工作中，我们提出了扩散桥变分推断（DBVI），这是DDVI的一个合理的扩展，它从一个可学习的、数据相关的初始分布开始反向扩散。这种初始化通过一个摊销神经网络进行参数化，并逐渐通过ELBO目标的梯度进行调整，减少后验差距并提高样本效率。为了实现可扩展的摊销，我们设计网络在感应输入上运行，这些输入作为数据集的结构化、低维摘要，并与感应变量的形状自然对齐。DBVI保留了DDVI的数学优雅性，包括基于Girsanov的ELBO和逆时SDEs，同时通过Doob桥接扩散过程对先验进行重新解释。我们在这种表述下推导出一个可处理的训练目标，并实现了DBVI，用于大规模DGPs中的可扩展推断。在回归、分类和图像重建任务中，DBVI在预测准确性、收敛速度和后验质量方面一贯优于DDVI和其他变分基线。

更新时间: 2025-09-23 14:36:47

领域: cs.LG

下载: http://arxiv.org/abs/2509.19078v1

Universal Dynamics with Globally Controlled Analog Quantum Simulators

Analog quantum simulators with global control fields have emerged as powerful platforms for exploring complex quantum phenomena. Recent breakthroughs, such as the coherent control of thousands of atoms, highlight the growing potential for quantum applications at scale. Despite these advances, a fundamental theoretical question remains unresolved: to what extent can such systems realize universal quantum dynamics under global control? Here we establish a necessary and sufficient condition for universal quantum computation using only global pulse control, proving that a broad class of analog quantum simulators is, in fact, universal. We further extend this framework to fermionic and bosonic systems, including modern platforms such as ultracold atoms in optical superlattices. Crucially, to connect the theoretical possibility with experimental reality, we introduce a new control technique into the experiment - direct quantum optimal control. This method enables the synthesis of complex effective Hamiltonians and allows us to incorporate realistic hardware constraints. To show its practical power, we experimentally engineer three-body interactions outside the blockade regime and demonstrate topological dynamics on a Rydberg atom array. Using the new control framework, we overcome key experimental challenges, including hardware limitations and atom position fluctuations in the non-blockade regime, by identifying smooth, short-duration pulses that achieve high-fidelity dynamics. Experimental measurements reveal dynamical signatures of symmetry-protected-topological edge modes, confirming both the expressivity and feasibility of our approach. Our work opens a new avenue for quantum simulation beyond native hardware Hamiltonians, enabling the engineering of effective multi-body interactions and advancing the frontier of quantum information processing with globally-controlled analog platforms.

Updated: 2025-09-23 14:36:43

标题: 具有全局控制的模拟量子模拟器的通用动力学

摘要: 使用全局控制场的模拟量子模拟器已经成为探索复杂量子现象的强大平台。最近的突破，如对成千上万个原子进行相干控制，突显了量子应用在规模上的不断增长的潜力。尽管取得了这些进展，一个基本的理论问题仍未解决：这类系统在全局控制下能实现多大程度的通用量子动力学？在这里，我们建立了一个仅使用全局脉冲控制的通用量子计算的必要和充分条件，证明了一类广泛的模拟量子模拟器实际上是通用的。我们进一步将这一框架扩展到费米子和玻色子系统，包括现代平台，如光学超晶格中的超冷原子。关键的是，为了将理论可能性与实验现实联系起来，我们在实验中引入了一种新的控制技术 - 直接量子最优控制。这种方法能够合成复杂的有效哈密顿量，并允许我们考虑现实的硬件约束。为了展示其实用性，我们在阻塞区域之外实验工程了三体相互作用，并在一组Rydberg原子上展示了拓扑动力学。利用新的控制框架，我们克服了关键的实验挑战，包括硬件限制和非阻塞区域中原子位置波动，通过识别实现高保真度动力学的平滑、短时脉冲。实验测量揭示了对称保护的拓扑边缘模式的动力学特征，确认了我们方法的表达能力和可行性。我们的工作为超越本地硬件哈密顿量的量子模拟开辟了新的途径，实现了有效多体相互作用的工程化，并推进了全局控制模拟平台的量子信息处理前沿。

更新时间: 2025-09-23 14:36:43

领域: quant-ph,cond-mat.quant-gas,cond-mat.str-el,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.19075v2

Code Driven Planning with Domain-Adaptive Critic

Large Language Models (LLMs) have been widely adopted as task planners for AI agents in sequential decision-making problems, leveraging their extensive world knowledge. However, the gap between their general knowledge and environment-specific requirements often leads to inaccurate plans. To address this, existing approaches rely on frequent LLM queries to iteratively refine plans based on immediate environmental feedback, which incurs substantial query costs. However, this refinement is typically guided by short-term environmental feedback, limiting LLMs from developing plans aligned with long-term rewards. We propose Code Driven Planning with Domain-Adaptive Critic (CoPiC). Instead of relying on frequent queries, CoPiC employs LLMs to generate a diverse set of high-level planning programs, which iteratively produce and refine candidate plans. A trained domain-adaptive critic then evaluates these candidates and selects the one most aligned with long-term rewards for execution. Using high-level planning programs as planner and domain-adaptive critic as estimator, CoPiC improves planning while significantly reducing query costs. Results in ALFWorld, NetHack, and StarCraft II Unit Building show that CoPiC outperforms advanced LLM-based baselines, AdaPlanner and Reflexion, achieving an average (1) 23.33% improvement in success rate and (2) 91.27% reduction in query costs.

Updated: 2025-09-23 14:36:12

标题: 代码驱动的规划与领域自适应评论者

摘要: 大型语言模型（LLMs）已被广泛采用作为人工智能代理在顺序决策问题中的任务规划器，利用它们广泛的世界知识。然而，它们的一般知识与特定环境需求之间的差距经常导致不准确的计划。为了解决这个问题，现有方法依赖于频繁的LLM查询，以基于即时环境反馈迭代地优化计划，这会产生相当大的查询成本。然而，这种优化通常是由短期环境反馈引导的，限制了LLMs制定符合长期奖励的计划。我们提出了一种基于代码驱动规划和领域自适应评论家（CoPiC）的方法。CoPiC不依赖于频繁的查询，而是利用LLMs生成多样化的高级规划程序，这些程序迭代地生成和优化候选计划。经过训练的领域自适应评论家评估这些候选计划，并选择与长期奖励最符合的计划进行执行。使用高级规划程序作为规划器和领域自适应评论家作为估计器，CoPiC在显著减少查询成本的同时改进了规划。在ALFWorld、NetHack和StarCraft II Unit Building中的结果表明，CoPiC胜过先进的基于LLM的基线，AdaPlanner和Reflexion，实现了（1）成功率提高23.33%和（2）查询成本降低91.27%的平均改善。

更新时间: 2025-09-23 14:36:12

领域: cs.AI

下载: http://arxiv.org/abs/2509.19077v1

EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes

For decades, classical point process models, such as the epidemic-type aftershock sequence (ETAS) model, have been widely used for forecasting the event times and locations of earthquakes. Recent advances have led to Neural Point Processes (NPPs), which promise greater flexibility and improvements over such classical models. However, the currently-used benchmark for NPPs does not represent an up-to-date challenge in the seismological community, since it contains data leakage and omits the largest earthquake sequence from the region. Additionally, initial earthquake forecasting benchmarks fail to compare NPPs with state-of-the-art forecasting models commonly used in seismology. To address these gaps, we introduce EarthquakeNPP: a collection of benchmark datasets to facilitate testing of NPPs on earthquake data, accompanied by an implementation of the state-of-the-art forecasting model: ETAS. The datasets cover a range of small to large target regions within California, dating from 1971 to 2021, and include different methodologies for dataset generation. Benchmarking experiments, using both log-likelihood and generative evaluation metrics widely recognised in seismology, show that none of the five NPPs tested outperform ETAS. These findings suggest that current NPP implementations are not yet suitable for practical earthquake forecasting. Nonetheless, EarthquakeNPP provides a platform to foster future collaboration between the seismology and machine learning communities.

Updated: 2025-09-23 14:34:47

标题: EarthquakeNPP：神经点过程地震预测的基准Benchmark

摘要: 几十年来，诸如流行病型余震序列（ETAS）模型等经典点过程模型已被广泛用于预测地震的事件时间和位置。最近的进展导致了神经点过程（NPPs），它们承诺比这些经典模型具有更大的灵活性和改进。然而，目前用于NPPs的基准并不代表地震学界的最新挑战，因为它存在数据泄漏并省略了该地区最大的地震序列。此外，最初的地震预测基准未能将NPPs与地震学中常用的最先进的预测模型进行比较。为了弥补这些差距，我们引入了EarthquakeNPP：一个基准数据集集合，旨在促进NPPs在地震数据上的测试，并附带一个最先进的预测模型的实现：ETAS。这些数据集涵盖了加利福尼亚州内的一系列小到大的目标区域，时间跨度从1971年到2021年，并包括了不同的数据集生成方法。使用地震学中广泛认可的对数似然和生成评估指标进行基准实验表明，五个经过测试的NPPs中没有一个能胜过ETAS。这些发现表明，当前的NPP实现尚不适用于实际的地震预测。尽管如此，EarthquakeNPP提供了一个平台，促进地震学和机器学习社区之间未来的合作。

更新时间: 2025-09-23 14:34:47

领域: physics.geo-ph,cs.LG,stat.AP,stat.ML

下载: http://arxiv.org/abs/2410.08226v2

A Multimodal Conversational Assistant for the Characterization of Agricultural Plots from Geospatial Open Data

The increasing availability of open Earth Observation (EO) and agricultural datasets holds great potential for supporting sustainable land management. However, their high technical entry barrier limits accessibility for non-expert users. This study presents an open-source conversational assistant that integrates multimodal retrieval and large language models (LLMs) to enable natural language interaction with heterogeneous agricultural and geospatial data. The proposed architecture combines orthophotos, Sentinel-2 vegetation indices, and user-provided documents through retrieval-augmented generation (RAG), allowing the system to flexibly determine whether to rely on multimodal evidence, textual knowledge, or both in formulating an answer. To assess response quality, we adopt an LLM-as-a-judge methodology using Qwen3-32B in a zero-shot, unsupervised setting, applying direct scoring in a multi-dimensional quantitative evaluation framework. Preliminary results show that the system is capable of generating clear, relevant, and context-aware responses to agricultural queries, while remaining reproducible and scalable across geographic regions. The primary contributions of this work include an architecture for fusing multimodal EO and textual knowledge sources, a demonstration of lowering the barrier to access specialized agricultural information through natural language interaction, and an open and reproducible design.

Updated: 2025-09-23 14:32:50

标题: 一个用于从地理空间开放数据中表征农业地块的多模态对话助手

摘要: 地球观测（EO）和农业数据集的开放性日益增加，为支持可持续土地管理提供了巨大潜力。然而，它们的技术门槛较高，限制了非专业用户的可访问性。本研究提出了一种开源的会话助手，整合了多模式检索和大型语言模型（LLMs），以实现与异构农业和地理空间数据的自然语言交互。所提出的架构通过检索增强生成（RAG）结合正射影像、哨兵-2植被指数和用户提供的文档，使系统能够灵活地确定在制定答案时是依赖于多模式证据、文本知识还是两者兼而有之。为了评估响应质量，我们采用了LLM作为评判者的方法，在零-shot、无监督设置下使用Qwen3-32B进行直接评分，在多维定量评估框架中应用。初步结果显示，该系统能够产生清晰、相关和上下文感知的农业查询响应，同时在地理区域跨度上保持可重现性和可扩展性。本研究的主要贡献包括融合多模式EO和文本知识源的架构，通过自然语言交互降低专门农业信息获取的门槛，以及开放和可重现的设计。

更新时间: 2025-09-23 14:32:50

领域: cs.AI

下载: http://arxiv.org/abs/2509.17544v2

AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.

Updated: 2025-09-23 14:29:40

标题: AvatarShield：面向人类中心的合成视频检测的视觉强化学习

摘要: 最近人工智能生成内容的进展导致了高度逼真的合成视频，特别是在涉及语音、手势和全身动作的以人为中心的场景中，这给信息真实性和公众信任带来了严重威胁。与DeepFake技术侧重于局部面部操作不同，以人为中心的视频生成方法可以合成完整的人体，并具有可控的动作，使其能够与环境、物体甚至其他人进行复杂互动。然而，现有的检测方法很大程度上忽视了由这种全身合成内容带来的不断增加的风险。与此同时，越来越多的研究探讨了利用LLMs进行可解释的假检测，旨在用自然语言解释决策。然而，这些方法严重依赖于监督微调，这会引入诸如注释偏见、产生监督和弱化泛化等限制。为了解决这些挑战，我们提出了AvatarShield，一种新颖的多模态以人为中心的合成视频检测框架，通过采用群体相对策略优化来消除对密集文本监督的需求，使LLMs能够从简单的二元标签中发展出推理能力。我们的架构结合了一个离散视觉塔，用于高级语义不一致性和一个残差提取器，用于细粒度的工件分析。我们进一步引入了FakeHumanVid，一个包含15K真实和合成视频的大型基准，跨九种最先进的以文本、姿势或音频驱动的人体生成方法。大量实验表明，AvatarShield在领域内和跨领域设置中均优于现有方法。

更新时间: 2025-09-23 14:29:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.15173v3

The Narcissus Hypothesis: Descending to the Rung of Illusion

Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.

Updated: 2025-09-23 14:28:10

标题: 自恋假设：降临幻象阶层

摘要: 现代基础模型越来越反映不仅是世界知识，还有嵌入在它们的训练数据中的人类偏好模式。我们假设通过人类反馈和模型生成的语料库进行递归对齐会引发社会可取性偏见，促使模型倾向于支持令人愉悦或讨人喜欢的回应，而不是客观推理。我们将其称为自恋假设，并通过使用标准化人格评估和一种新颖的社会可取性偏见分数在31个模型中进行测试。结果显示出明显向社会一致特征的漂移，这对语料库的完整性和下游推理的可靠性有深远的影响。然后，我们提供了一种新颖的认识论解释，追溯递归偏见如何导致高阶推理崩溃到因果关系的珍珠梯的底部，最终导致我们所称为幻觉的阶梯。

更新时间: 2025-09-23 14:28:10

领域: cs.CY,cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2509.17999v2

Beyond Backpropagation: Exploring Innovative Algorithms for Energy-Efficient Deep Neural Network Training

The rising computational and energy demands of deep neural networks (DNNs), driven largely by backpropagation (BP), challenge sustainable AI development. This paper rigorously investigates three BP-free training methods: the Forward-Forward (FF), Cascaded-Forward (CaFo), and Mono-Forward (MF) algorithms, tracing their progression from foundational concepts to a demonstrably superior solution. A robust comparative framework was established: each algorithm was implemented on its native architecture (MLPs for FF and MF, a CNN for CaFo) and benchmarked against an equivalent BP-trained model. Hyperparameters were optimized with Optuna, and consistent early stopping criteria were applied based on validation performance, ensuring all models were optimally tuned before comparison. Results show that MF not only competes with but consistently surpasses BP in classification accuracy on its native MLPs. Its superior generalization stems from converging to a more favorable minimum in the validation loss landscape, challenging the assumption that global optimization is required for state-of-the-art results. Measured at the hardware level using the NVIDIA Management Library (NVML) API, MF reduces energy consumption by up to 41% and shortens training time by up to 34%, translating to a measurably smaller carbon footprint as estimated by CodeCarbon. Beyond this primary result, we present a hardware-level analysis that explains the efficiency gains: exposing FF's architectural inefficiencies, validating MF's computationally lean design, and challenging the assumption that all BP-free methods are inherently more memory-efficient. By documenting the evolution from FF's conceptual groundwork to MF's synthesis of accuracy and sustainability, this work offers a clear, data-driven roadmap for future energy-efficient deep learning.

Updated: 2025-09-23 14:27:44

标题: 超越反向传播：探索创新算法以实现高效能深度神经网络训练

摘要: 随着深度神经网络（DNNs）的计算和能量需求不断增加，主要受到反向传播（BP）的驱动，挑战着可持续人工智能的发展。本文严格调查了三种无需BP的训练方法：前向-前向（FF）、级联-前向（CaFo）和单向-前向（MF）算法，追踪它们从基本概念到明显优越解决方案的发展过程。建立了一个强大的比较框架：每种算法都在其原生架构上实施（FF和MF为MLPs，CaFo为CNN），并与等效的BP训练模型进行了基准测试。使用Optuna优化了超参数，并根据验证性能应用一致的早停止标准，确保在比较之前所有模型都得到了最佳调整。结果显示，MF不仅与BP竞争，而且在其原生MLPs上的分类精度始终超过BP。其优越的泛化性源于在验证损失景观中收敛到更有利的最小值，挑战了假设需要全局优化才能获得最先进的结果。使用NVIDIA管理库（NVML）API在硬件级别进行测量，MF将能源消耗降低了高达41％，训练时间缩短了高达34％，这转化为由CodeCarbon估算的明显更小的碳足迹。除了这一主要结果，我们还提出了一个硬件级别的分析，解释了效率的增益：揭示了FF的架构效率低下，验证了MF的计算节约设计，并挑战了所有无需BP方法都在内存效率方面更优的假设。通过记录从FF的概念基础工作到MF的精度和可持续性的综合进化，本研究为未来节能深度学习提供了清晰的、数据驱动的路线图。

更新时间: 2025-09-23 14:27:44

领域: cs.LG,cs.AI,68T07

下载: http://arxiv.org/abs/2509.19063v1

Backdoor Attack with Invisible Triggers Based on Model Architecture Modification

Machine learning systems are vulnerable to backdoor attacks, where attackers manipulate model behavior through data tampering or architectural modifications. Traditional backdoor attacks involve injecting malicious samples with specific triggers into the training data, causing the model to produce targeted incorrect outputs in the presence of the corresponding triggers. More sophisticated attacks modify the model's architecture directly, embedding backdoors that are harder to detect as they evade traditional data-based detection methods. However, the drawback of the architectural modification based backdoor attacks is that the trigger must be visible in order to activate the backdoor. To further strengthen the invisibility of the backdoor attacks, a novel backdoor attack method is presented in the paper. To be more specific, this method embeds the backdoor within the model's architecture and has the capability to generate inconspicuous and stealthy triggers. The attack is implemented by modifying pre-trained models, which are then redistributed, thereby posing a potential threat to unsuspecting users. Comprehensive experiments conducted on standard computer vision benchmarks validate the effectiveness of this attack and highlight the stealthiness of its triggers, which remain undetectable through both manual visual inspection and advanced detection tools.

Updated: 2025-09-23 14:27:14

标题: 基于模型架构修改的隐形触发器后门攻击

摘要: 机器学习系统容易受到后门攻击的影响，攻击者会通过数据篡改或架构修改来操纵模型行为。传统的后门攻击涉及向训练数据中注入带有特定触发器的恶意样本，导致模型在存在相应触发器时产生错误的输出。更复杂的攻击会直接修改模型的架构，嵌入更难以检测的后门，从而逃避传统基于数据的检测方法。然而，基于架构修改的后门攻击的缺点是触发器必须可见才能激活后门。为了进一步增强后门攻击的隐蔽性，该论文提出了一种新颖的后门攻击方法。具体而言，该方法将后门嵌入到模型的架构中，并具有生成不引人注目和隐秘触发器的能力。这种攻击通过修改预训练模型实施，然后重新分发，从而对毫无戒备的用户构成潜在威胁。对标准计算机视觉基准进行的全面实验验证了这种攻击的有效性，并突显了其触发器的隐秘性，这些触发器通过手动视觉检查和先进检测工具都无法检测到。

更新时间: 2025-09-23 14:27:14

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2412.16905v3

An Information-Flow Perspective on Explainability Requirements: Specification and Verification

Explainable systems expose information about why certain observed effects are happening to the agents interacting with them. We argue that this constitutes a positive flow of information that needs to be specified, verified, and balanced against negative information flow that may, e.g., violate privacy guarantees. Since both explainability and privacy require reasoning about knowledge, we tackle these tasks with epistemic temporal logic extended with quantification over counterfactual causes. This allows us to specify that a multi-agent system exposes enough information such that agents acquire knowledge on why some effect occurred. We show how this principle can be used to specify explainability as a system-level requirement and provide an algorithm for checking finite-state models against such specifications. We present a prototype implementation of the algorithm and evaluate it on several benchmarks, illustrating how our approach distinguishes between explainable and unexplainable systems, and how it allows to pose additional privacy requirements.

Updated: 2025-09-23 14:27:05

标题: 一个关于可解释性要求的信息流视角：规范与验证

摘要: 可解释性系统向与其交互的代理提供关于为什么会发生某些观察到的效果的信息。我们认为这构成了一种正向信息流，需要具体规定、验证，并与可能违反隐私保障的负向信息流进行平衡。由于解释性和隐私都需要对知识进行推理，我们使用扩展了对反事实因果关系进行量化的认识时间逻辑来处理这些任务。这使我们能够规定多代理系统暴露足够信息以使代理获知某些效果发生的原因。我们展示了如何利用这一原则来规定解释性作为系统级要求，并提供了一个用于检查有限状态模型是否符合此类规范的算法。我们提出了该算法的原型实现，并在几个基准测试上进行了评估，说明了我们的方法如何区分可解释和不可解释的系统，并如何提出额外的隐私要求。

更新时间: 2025-09-23 14:27:05

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2509.01479v2

Dynami-CAL GraphNet: A Physics-Informed Graph Neural Network Conserving Linear and Angular Momentum for Dynamical Systems

Accurate, interpretable, and real-time modeling of multi-body dynamical systems is essential for predicting behaviors and inferring physical properties in natural and engineered environments. Traditional physics-based models face scalability challenges and are computationally demanding, while data-driven approaches like Graph Neural Networks (GNNs) often lack physical consistency, interpretability, and generalization. In this paper, we propose Dynami-CAL GraphNet, a Physics-Informed Graph Neural Network that integrates the learning capabilities of GNNs with physics-based inductive biases to address these limitations. Dynami-CAL GraphNet enforces pairwise conservation of linear and angular momentum for interacting nodes using edge-local reference frames that are equivariant to rotational symmetries, invariant to translations, and equivariant to node permutations. This design ensures physically consistent predictions of node dynamics while offering interpretable, edge-wise linear and angular impulses resulting from pairwise interactions. Evaluated on a 3D granular system with inelastic collisions, Dynami-CAL GraphNet demonstrates stable error accumulation over extended rollouts, effective extrapolations to unseen configurations, and robust handling of heterogeneous interactions and external forces. Dynami-CAL GraphNet offers significant advantages in fields requiring accurate, interpretable, and real-time modeling of complex multi-body dynamical systems, such as robotics, aerospace engineering, and materials science. By providing physically consistent and scalable predictions that adhere to fundamental conservation laws, it enables the inference of forces and moments while efficiently handling heterogeneous interactions and external forces.

Updated: 2025-09-23 14:26:51

标题: Dynami-CAL GraphNet：一种保留动力学系统线性和角动量的物理信息图神经网络

摘要: 准确、可解释和实时建模多体动力系统对于预测行为和推断自然和工程环境中的物理特性至关重要。传统基于物理的模型面临可扩展性挑战和计算需求高的问题，而数据驱动方法如图神经网络（GNNs）通常缺乏物理一致性、可解释性和泛化性。本文提出了Dynami-CAL GraphNet，这是一个融合了GNNs学习能力和基于物理归纳偏见的物理信息图神经网络，以解决这些限制。Dynami-CAL GraphNet利用边缘局部参考框架强制执行相互作用节点的线性动量和角动量的成对守恒，这些参考框架对旋转对称性等变，对平移不变，对节点排列等变。这一设计确保节点动态的物理一致预测，同时提供由成对交互导致的可解释的边缘线性和角动量冲量。在具有非弹性碰撞的3D颗粒系统上评估，Dynami-CAL GraphNet展示了在扩展演绎中稳定的误差积累，对未知配置的有效外推，以及对异质相互作用和外部力的强大处理能力。Dynami-CAL GraphNet在需要准确、可解释和实时建模复杂多体动力系统的领域，如机器人技术、航空航天工程和材料科学中，提供了显著优势。通过提供遵守基本守恒定律的物理一致和可扩展的预测，它使得力和力矩的推断能够高效处理异质相互作用和外部力。

更新时间: 2025-09-23 14:26:51

领域: cs.LG,cs.CE,physics.comp-ph

下载: http://arxiv.org/abs/2501.07373v2

Gaussian Process Diffeomorphic Statistical Shape Modelling Outperforms Angle-Based Methods for Assessment of Hip Dysplasia

Dysplasia is a recognised risk factor for osteoarthritis (OA) of the hip, early diagnosis of dysplasia is important to provide opportunities for surgical interventions aimed at reducing the risk of hip OA. We have developed a pipeline for semi-automated classification of dysplasia using volumetric CT scans of patients' hips and a minimal set of clinically annotated landmarks, combining the framework of the Gaussian Process Latent Variable Model with diffeomorphism to create a statistical shape model, which we termed the Gaussian Process Diffeomorphic Statistical Shape Model (GPDSSM). We used 192 CT scans, 100 for model training and 92 for testing. The GPDSSM effectively distinguishes dysplastic samples from controls while also highlighting regions of the underlying surface that show dysplastic variations. As well as improving classification accuracy compared to angle-based methods (AUC 96.2% vs 91.2%), the GPDSSM can save time for clinicians by removing the need to manually measure angles and interpreting 2D scans for possible markers of dysplasia.

Updated: 2025-09-23 14:24:55

标题: 高斯过程迭代统计形状建模优于基于角度的方法用于髋关节发育不良的评估

摘要: 发育不良是髋关节骨关节炎（OA）的一个已知风险因素，早期诊断发育不良对提供减少髋关节OA风险的手术干预机会至关重要。我们开发了一个半自动分类发育不良的流程，使用患者髋部的体积CT扫描和一组最小的临床标记点，结合高斯过程潜变量模型的框架与形变生成统计形态模型，我们将其称为高斯过程形变统计形态模型（GPDSSM）。我们使用了192个CT扫描，其中100个用于模型训练，92个用于测试。GPDSSM有效地区分了发育不良样本和对照样本，同时突出显示了表面下显示发育不良变异的区域。与基于角度的方法相比（AUC 96.2% vs 91.2%），GPDSSM不仅提高了分类准确性，还可以为临床医生节省时间，无需手动测量角度并解释2D扫描可能的发育不良标志。

更新时间: 2025-09-23 14:24:55

领域: cs.LG

下载: http://arxiv.org/abs/2506.04886v2

Towards Causal Representation Learning with Observable Sources as Auxiliaries

Causal representation learning seeks to recover latent factors that generate observational data through a mixing function. Needing assumptions on latent structures or relationships to achieve identifiability in general, prior works often build upon conditional independence given known auxiliary variables. However, prior frameworks limit the scope of auxiliary variables to be external to the mixing function. Yet, in some cases, system-driving latent factors can be easily observed or extracted from data, possibly facilitating identification. In this paper, we introduce a framework of observable sources being auxiliaries, serving as effective conditioning variables. Our main results show that one can identify entire latent variables up to subspace-wise transformations and permutations using volume-preserving encoders. Moreover, when multiple known auxiliary variables are available, we offer a variable-selection scheme to choose those that maximize recoverability of the latent factors given knowledge of the latent causal graph. Finally, we demonstrate the effectiveness of our framework through experiments on synthetic graph and image data, thereby extending the boundaries of current approaches.

Updated: 2025-09-23 14:22:39

标题: 朝向以可观测源作为辅助的因果表征学习

摘要: 因果表示学习旨在通过混合函数恢复生成观测数据的潜在因素。通常需要对潜在结构或关系进行假设，以实现可识别性，先前的研究通常基于给定已知辅助变量的条件独立性构建。然而，先前的框架限制了辅助变量的范围，使其外部于混合函数。然而，在某些情况下，系统驱动的潜在因素可以很容易地从数据中观察或提取，可能有助于识别。在本文中，我们介绍了一个将可观察源作为辅助变量的框架，作为有效的条件变量。我们的主要结果表明，可以使用保体积编码器识别整个潜在变量，直到亚空间变换和排列。此外，当有多个已知的辅助变量可用时，我们提供一个变量选择方案，选择那些在已知潜在因果图的情况下最大化潜在因素可恢复性的变量。最后，我们通过对合成图和图像数据的实验展示了我们框架的有效性，从而扩展了当前方法的边界。

更新时间: 2025-09-23 14:22:39

领域: cs.AI

下载: http://arxiv.org/abs/2509.19058v1

Poster: ChatIYP: Enabling Natural Language Access to the Internet Yellow Pages Database

The Internet Yellow Pages (IYP) aggregates information from multiple sources about Internet routing into a unified, graph-based knowledge base. However, querying it requires knowledge of the Cypher language and the exact IYP schema, thus limiting usability for non-experts. In this paper, we propose ChatIYP, a domain-specific Retrieval-Augmented Generation (RAG) system that enables users to query IYP through natural language questions. Our evaluation demonstrates solid performance on simple queries, as well as directions for improvement, and provides insights for selecting evaluation metrics that are better fit for IYP querying AI agents.

Updated: 2025-09-23 14:21:43

标题: 海报：ChatIYP：实现对互联网黄页数据库的自然语言访问

摘要: 互联网黄页（IYP）汇总来自多个来源的关于互联网路由的信息，形成一个统一的基于图的知识库。然而，要查询它需要了解Cypher语言和精确的IYP模式，因此限制了非专家的可用性。在本文中，我们提出了ChatIYP，一个特定领域的检索增强生成（RAG）系统，使用户能够通过自然语言提问查询IYP。我们的评估表明，在简单查询方面表现出色，并提供了改进的方向，为选择更适合IYP查询AI代理的评估指标提供了见解。

更新时间: 2025-09-23 14:21:43

领域: cs.NI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2509.19411v1

Single-stream Policy Optimization

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.

Updated: 2025-09-23 14:19:47

标题: 单流策略优化

摘要: 我们重新审视了针对大型语言模型（LLMs）的政策梯度优化，从单流的角度。流行的基于组的方法如GRPO通过即时的基线减少方差，但存在严重缺陷：频繁的退化组会抹去学习信号，并且同步障碍会阻碍可伸缩性。我们引入了单流策略优化（SPO），通过设计消除了这些问题。SPO用持续的、KL自适应值追踪器替代每组基线，并在批处理中全局归一化优势，为每个样本提供稳定、低方差的学习信号。作为不依赖组的方法，SPO提高了吞吐量，并在生成时间变化较大的长期或与工具集成的环境中有效扩展。此外，持续的值追踪器自然地通过优先采样实现自适应课程。使用Qwen3-8B的实验表明，SPO比GRPO更顺利地收敛并获得更高的准确性，同时消除了在退化组上浪费的计算。消融研究证实，SPO的收益源于其对基线估计和优势归一化的原则性方法，为LLM推理提供更健壮、更高效的途径。在与Qwen3 8B一起使用的五个难度数学基准测试中，SPO将平均maj@32提高了+3.4个百分点（pp），在具有挑战性的数据集上实现了重大的绝对点增益，包括在BRUMO 25上+7.3 pp，在AIME 25上+4.4 pp，在HMMT 25上+3.3 pp，并在评估的k值范围内实现了一致的相对增益。SPO的成功挑战了将偶然复杂性添加到RL算法的流行趋势，突显了一条道路，即基本原则而不是架构变通，推动了LLM推理领域的下一波进展。

更新时间: 2025-09-23 14:19:47

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2509.13232v2

Latent Danger Zone: Distilling Unified Attention for Cross-Architecture Black-box Attacks

Black-box adversarial attacks remain challenging due to limited access to model internals. Existing methods often depend on specific network architectures or require numerous queries, resulting in limited cross-architecture transferability and high query costs. To address these limitations, we propose JAD, a latent diffusion model framework for black-box adversarial attacks. JAD generates adversarial examples by leveraging a latent diffusion model guided by attention maps distilled from both a convolutional neural network (CNN) and a Vision Transformer (ViT) models. By focusing on image regions that are commonly sensitive across architectures, this approach crafts adversarial perturbations that transfer effectively between different model types. This joint attention distillation strategy enables JAD to be architecture-agnostic, achieving superior attack generalization across diverse models. Moreover, the generative nature of the diffusion framework yields high adversarial sample generation efficiency by reducing reliance on iterative queries. Experiments demonstrate that JAD offers improved attack generalization, generation efficiency, and cross-architecture transferability compared to existing methods, providing a promising and effective paradigm for black-box adversarial attacks.

Updated: 2025-09-23 14:12:41

标题: 潜在的危险区域：将统一关注力聚集在跨体系结构的黑盒攻击上

摘要: 黑盒对抗攻击由于对模型内部的有限访问而仍然具有挑战性。现有方法通常依赖于特定的网络架构或需要大量查询，导致跨架构的传递性有限并且查询成本较高。为了解决这些限制，我们提出了JAD，这是一个用于黑盒对抗攻击的潜在扩散模型框架。JAD通过利用从卷积神经网络（CNN）和Vision Transformer（ViT）模型中提炼出的注意力地图来引导潜在扩散模型生成对抗性示例。通过关注在不同模型中普遍敏感的图像区域，这种方法可以制作出有效地在不同模型类型之间传递的对抗扰动。这种共同的注意力提炼策略使得JAD可以独立于架构，实现了在不同模型之间的攻击泛化性能。此外，扩散框架的生成性质通过减少对迭代查询的依赖，提高了对抗样本的生成效率。实验证明，与现有方法相比，JAD在攻击泛化性能、生成效率和跨架构传递性方面均具有改进，为黑盒对抗攻击提供了一种有前途且有效的范式。

更新时间: 2025-09-23 14:12:41

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2509.19044v1

Highly Imbalanced Regression with Tabular Data in SEP and Other Applications

We investigate imbalanced regression with tabular data that have an imbalance ratio larger than 1,000 ("highly imbalanced"). Accurately estimating the target values of rare instances is important in applications such as forecasting the intensity of rare harmful Solar Energetic Particle (SEP) events. For regression, the MSE loss does not consider the correlation between predicted and actual values. Typical inverse importance functions allow only convex functions. Uniform sampling might yield mini-batches that do not have rare instances. We propose CISIR that incorporates correlation, Monotonically Decreasing Involution (MDI) importance, and stratified sampling. Based on five datasets, our experimental results indicate that CISIR can achieve lower error and higher correlation than some recent methods. Also, adding our correlation component to other recent methods can improve their performance. Lastly, MDI importance can outperform other importance functions. Our code can be found in https://github.com/Machine-Earning/CISIR.

Updated: 2025-09-23 14:07:12

标题: 使用表格数据进行高度不平衡的回归在SEP和其他应用中的应用

摘要: 我们研究了具有不平衡比例大于1,000（“高度不平衡”）的表格数据的不平衡回归问题。在应用中准确估计稀有实例的目标值是重要的，比如预测稀有有害太阳能粒子（SEP）事件的强度。对于回归，均方误差损失不考虑预测值和实际值之间的相关性。典型的反向重要性函数只允许凸函数。均匀抽样可能会产生不包含稀有实例的小批量。我们提出了CISIR，它结合了相关性、单调递减反演（MDI）重要性和分层抽样。基于五个数据集，我们的实验结果表明CISIR可以实现比一些最近的方法更低的错误和更高的相关性。此外，将我们的相关性组件添加到其他最近的方法中可以改善它们的性能。最后，MDI重要性可以胜过其他重要性函数。我们的代码可以在https://github.com/Machine-Earning/CISIR找到。

更新时间: 2025-09-23 14:07:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.16339v2

Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

Updated: 2025-09-23 14:06:24

标题: 生涯机器人学习的动态渐进参数高效专家库混合

摘要: 一个通用的代理必须在其整个生命周期中不断学习和适应，实现高效的前向传递，同时最大限度地减少灾难性遗忘。在主流的预训练-微调范式中，先前的工作探索了针对单一任务适应的参数有效微调，有效地引导具有少量参数的冻结预训练模型。然而，在终身学习的背景下，这些方法依赖于测试时任务标识符的不切实际假设，并限制了孤立适配器之间的知识共享。为了解决这些限制，我们提出了用于终身机器人学习的动态混合渐进参数有效专家库（DMPEL）。DMPEL逐步构建一个低秩专家库，并利用轻量级路由器动态地将专家组合成端到端策略，实现灵活有效的终身前向传递。此外，通过利用微调参数的模块化结构，我们引入了专家系数回放，引导路由器准确检索先前遇到任务的冻结专家。这种技术减轻了遗忘，同时比整个策略上的经验重放更节省存储和计算资源。在终身机器人学习基准LIBERO上的大量实验表明，我们的框架在持续适应过程中的成功率方面优于最先进的终身学习方法，同时利用最小的可训练参数和存储空间。

更新时间: 2025-09-23 14:06:24

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2506.05985v2

Improving Credit Card Fraud Detection through Transformer-Enhanced GAN Oversampling

Detection of credit card fraud is an acute issue of financial security because transaction datasets are highly lopsided, with fraud cases being only a drop in the ocean. Balancing datasets using the most popular methods of traditional oversampling such as the Synthetic Minority Oversampling Technique (SMOTE) generally create simplistic synthetic samples that are not readily applicable to complex fraud patterns. Recent industry advances that include Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TVAE) have demonstrated increased efficiency in tabular synthesis, yet all these models still exhibit issues with high-dimensional dependence modelling. Now we will present our hybrid approach where we use a Generative Adversarial Network (GAN) with a Transformer encoder block to produce realistic fraudulent transactions samples. The GAN architecture allows training realistic generators adversarial, and the Transformer allows the model to learn rich feature interactions by self-attention. Such a hybrid strategy overcomes the limitations of SMOTE, CTGAN, and TVAE by producing a variety of high-quality synthetic minority classes samples. We test our algorithm on the publicly-available Credit Card Fraud Detection dataset and compare it to conventional and generative resampling strategies with a variety of classifiers, such as Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Findings indicate that our Transformer-based GAN shows substantial gains in Recall, F1-score and Area Under the Receiver Operating Characteristic Curve (AUC), which indicates that it is effective in overcoming the severe class imbalance inherent in the task of fraud detection.

Updated: 2025-09-23 14:05:13

标题: 通过Transformer增强的GAN过采样改进信用卡欺诈检测

摘要: 信用卡欺诈检测是金融安全的一个严重问题，因为交易数据集极度不平衡，欺诈案例只是其中的一小部分。使用传统过采样方法（例如Synthetic Minority Oversampling Technique，SMOTE）平衡数据集通常会创建简单的合成样本，无法直接应用于复杂的欺诈模式。最近行业的进展包括Conditional Tabular Generative Adversarial Networks（CTGAN）和Tabular Variational Autoencoders（TVAE）已经证明在表格合成方面效率更高，但所有这些模型仍然存在高维度依赖建模的问题。现在我们将介绍我们的混合方法，其中我们使用具有Transformer编码器块的生成对抗网络（GAN）来生成逼真的欺诈交易样本。GAN架构允许对抗性地训练逼真的生成器，而Transformer允许模型通过自注意力学习丰富的特征交互。这种混合策略克服了SMOTE、CTGAN和TVAE的局限性，通过生成多样高质量的合成少数类样本。我们在公开可用的信用卡欺诈检测数据集上测试我们的算法，并将其与传统和生成重采样策略以及各种分类器（如逻辑回归（LR）、随机森林（RF）、极端梯度提升（XGBoost）和支持向量机（SVM））进行比较。研究结果表明，我们基于Transformer的GAN在召回率、F1分数和接收者操作特征曲线下面积（AUC）上显示出显著的增益，这表明它在克服欺诈检测任务中固有的严重类别不平衡方面是有效的。

更新时间: 2025-09-23 14:05:13

领域: cs.LG

下载: http://arxiv.org/abs/2509.19032v1

Landmarks, Monuments, and Beacons: Understanding Generative Calls to Action

Algorithmic evaluation of procedurally generated content struggles to find metrics that align with human experience, particularly for composite artefacts. Automatic decomposition as a possible solution requires concepts that meet a range of properties. To this end, drawing on Games Studies and Game AI research, we introduce the nested concepts of \textit{Landmarks}, \textit{Monuments}, and \textit{Beacons}. These concepts are based on the artefact's perceivability, evocativeness, and Call to Action, all from a player-centric perspective. These terms are generic to games and usable across genres. We argue that these entities can be found and evaluated with techniques currently used in both research and industry, opening a path towards a fully automated decomposition of PCG, and evaluation of the salient sub-components. Although the work presented here emphasises mixed-initiative PCG and compositional PCG, we believe it applies beyond those domains. With this approach, we intend to create a connection between humanities and technical game research and allow for better computational PCG evaluation

Updated: 2025-09-23 14:03:54

标题: 地标、纪念碑和灯塔：理解生成性的行动呼吁

摘要: 算法评估程序生成内容在寻找与人类体验相一致的度量标准方面存在困难，特别是对于复合品。自动分解作为可能的解决方案需要满足一系列属性的概念。为此，借鉴游戏研究和游戏人工智能研究，我们引入了\textit{地标}、\textit{纪念碑}和\textit{灯塔}的嵌套概念。这些概念基于玩家中心的视角，涵盖了物品的可感知性、唤起性和行动呼应。这些术语适用于各种游戏类型。我们认为这些实体可以通过目前在研究和行业中使用的技术找到和评估，从而开辟了一条通向完全自动化分解程序生成内容和评估显著子部件的途径。尽管这里呈现的工作强调了混合式倡议程序生成和组合式程序生成，但我们认为它适用于其他领域。通过这种方法，我们打算在人文学科和技术游戏研究之间建立联系，并实现更好的计算程序生成内容评估。

更新时间: 2025-09-23 14:03:54

领域: cs.AI

下载: http://arxiv.org/abs/2509.19030v1

Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages

The current Large Language Models (LLMs) face significant challenges in improving their performance on low-resource languages and urgently need data-efficient methods without costly fine-tuning. From the perspective of language-bridge, we propose a simple yet effective method, namely BridgeX-ICL, to improve the zero-shot Cross-lingual In-Context Learning (X-ICL) for low-resource languages. Unlike existing works focusing on language-specific neurons, BridgeX-ICL explores whether sharing neurons can improve cross-lingual performance in LLMs. We construct neuron probe data from the ground-truth MUSE bilingual dictionaries, and define a subset of language overlap neurons accordingly to ensure full activation of these anchored neurons. Subsequently, we propose an HSIC-based metric to quantify LLMs' internal linguistic spectrum based on overlapping neurons, guiding optimal bridge selection. The experiments conducted on 4 cross-lingual tasks and 15 language pairs from 7 diverse families, covering both high-low and moderate-low pairs, validate the effectiveness of BridgeX-ICL and offer empirical insights into the underlying multilingual mechanisms of LLMs. The code is publicly available at https://github.com/xuyuemei/BridgeX-ICL.

Updated: 2025-09-23 14:02:49

标题: 语言神经元重叠模式：促进低资源语言的跨语言转移

摘要: 目前，大型语言模型（LLMs）在提高在资源稀缺语言上的性能方面面临重大挑战，迫切需要一种数据高效的方法，而无需昂贵的微调。从语言桥梁的角度出发，我们提出了一种简单而有效的方法，即BridgeX-ICL，用于改进低资源语言的零-shot跨语言上下文学习（X-ICL）。与现有的作品不同，这些作品侧重于特定于语言的神经元，BridgeX-ICL探讨了在LLMs中共享神经元是否可以提高跨语言性能。我们从地面真相MUSE双语词典构建神经元探针数据，并相应地定义语言重叠神经元的子集，以确保这些锚定神经元的完全激活。随后，我们提出了一种基于HSIC的度量标准，用于基于重叠神经元量化LLMs的内部语言谱，指导最佳桥梁选择。在来自7个不同家族的4个跨语言任务和15个语言对上进行的实验中，涵盖了高低和中低配对，验证了BridgeX-ICL的有效性，并为LLMs的基础多语言机制提供了实证见解。该代码可以在https://github.com/xuyuemei/BridgeX-ICL上公开获取。

更新时间: 2025-09-23 14:02:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.17078v2

Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation

Modern machine learning (ML) training workloads place substantial demands on both computational and communication resources. Consequently, accurate performance estimation has become increasingly critical for guiding system design decisions, such as the selection of parallelization strategies, cluster configurations, and hardware provisioning. Existing simulation-based performance estimation requires reimplementing the ML framework in a simulator, which demands significant manual effort and is hard to maintain as ML frameworks evolve rapidly. This paper introduces Phantora, a hybrid GPU cluster simulator designed for performance estimation of ML training workloads. Phantora executes unmodified ML frameworks as is within a distributed, containerized environment. Each container emulates the behavior of a GPU server in a large-scale cluster, while Phantora intercepts and simulates GPU- and communication-related operations to provide high-fidelity performance estimation. We call this approach hybrid simulation of ML systems, in contrast to traditional methods that simulate static workloads. The primary advantage of hybrid simulation is that it allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation. Our evaluation shows that Phantora provides accuracy comparable to static workload simulation while supporting three state-of-the-art LLM training frameworks out-of-the-box. In addition, Phantora operates on a single GPU, eliminating the need for the resource-intensive trace collection and workload extraction steps required by traditional trace-based simulators. Phantora is open-sourced at https://github.com/QDelta/Phantora.

Updated: 2025-09-23 14:01:26

标题: Phantora：在基于模拟的机器学习系统性能估计中最大程度地实现代码复用

摘要: 现代机器学习（ML）训练工作负载对计算和通信资源都提出了重大需求。因此，准确的性能估计对于指导系统设计决策变得越来越关键，例如选择并行化策略、集群配置和硬件配置。现有基于模拟的性能估计需要在模拟器中重新实现ML框架，这需要大量的人工努力，并且难以随着ML框架的快速发展而维护。本文介绍了一种名为Phantora的混合GPU集群模拟器，用于对ML训练工作负载的性能进行估算。Phantora在分布式、容器化环境中执行未修改的ML框架。每个容器模拟大规模集群中的GPU服务器的行为，同时Phantora拦截并模拟GPU和通信相关操作，以提供高保真度的性能估计。我们将这种方法称为ML系统的混合模拟，与传统方法相比，传统方法模拟静态工作负载。混合模拟的主要优势在于它允许在模拟中直接重用ML框架源代码，避免了重新实现的需要。我们的评估显示，Phantora提供了与静态工作负载模拟相当的准确性，同时支持三种最先进的LLM训练框架。此外，Phantora在单个GPU上运行，消除了传统基于跟踪的模拟器所需的资源密集型跟踪收集和工作负载提取步骤的需要。Phantora在https://github.com/QDelta/Phantora上开源。

更新时间: 2025-09-23 14:01:26

领域: cs.DC,cs.LG,cs.PF

下载: http://arxiv.org/abs/2505.01616v2

Exploiting Page Faults for Covert Communication

We present a novel mechanism to construct a covert channel based on page faults. A page fault is an event that occurs when a process or a thread tries to access a page of memory that is not currently mapped to its address space. The kernel typically responds to this event by performing a context switch to allow another process or thread to execute while the page is being fetched from the disk. We exploit this behavior to allow a malicious process to leak secret data to another process, bypassing the isolation mechanisms enforced by the operating system. These attacks do not leverage timers and are hardwareagnostic. Experimental results demonstrate that this attack can achieve a bit error rate of under 4%

Updated: 2025-09-23 14:01:23

标题: 利用页面错误进行隐蔽通信

摘要: 我们提出了一种基于页面错误的隐蔽信道构建机制。页面错误是当一个进程或线程试图访问当前未映射到其地址空间的内存页时发生的事件。内核通常通过执行上下文切换来响应此事件，以允许另一个进程或线程在页面从磁盘中获取时执行。我们利用这种行为允许恶意进程将秘密数据泄露给另一个进程，从而绕过操作系统强制执行的隔离机制。这些攻击不利用定时器，并且与硬件无关。实验结果表明，这种攻击可以实现低于4%的位错误率。

更新时间: 2025-09-23 14:01:23

领域: cs.OS,cs.CR

下载: http://arxiv.org/abs/2509.20398v1

EvoAgentX: An Automated Framework for Evolving Agentic Workflows

Multi-agent systems (MAS) have emerged as a powerful paradigm for orchestrating large language models (LLMs) and specialized tools to collaboratively address complex tasks. However, existing MAS frameworks often require manual workflow configuration and lack native support for dynamic evolution and performance optimization. In addition, many MAS optimization algorithms are not integrated into a unified framework. In this paper, we present EvoAgentX, an open-source platform that automates the generation, execution, and evolutionary optimization of multi-agent workflows. EvoAgentX employs a modular architecture consisting of five core layers: the basic components, agent, workflow, evolving, and evaluation layers. Specifically, within the evolving layer, EvoAgentX integrates three MAS optimization algorithms, TextGrad, AFlow, and MIPRO, to iteratively refine agent prompts, tool configurations, and workflow topologies. We evaluate EvoAgentX on HotPotQA, MBPP, and MATH for multi-hop reasoning, code generation, and mathematical problem solving, respectively, and further assess it on real-world tasks using GAIA. Experimental results show that EvoAgentX consistently achieves significant performance improvements, including a 7.44% increase in HotPotQA F1, a 10.00% improvement in MBPP pass@1, a 10.00% gain in MATH solve accuracy, and an overall accuracy improvement of up to 20.00% on GAIA. The source code is available at: https://github.com/EvoAgentX/EvoAgentX

Updated: 2025-09-23 14:00:26

标题: EvoAgentX：一个用于演化代理工作流程的自动化框架

摘要: 多智能体系统（MAS）已经成为协调大型语言模型（LLMs）和专门工具共同解决复杂任务的强大范式。然而，现有的MAS框架通常需要手动工作流配置，并且缺乏对动态演化和性能优化的本地支持。此外，许多MAS优化算法并未集成到统一框架中。在本文中，我们介绍了EvoAgentX，这是一个开源平台，可以自动生成、执行和进化优化多智能体工作流。EvoAgentX采用由五个核心层组成的模块化架构：基本组件、智能体、工作流、进化和评估层。具体而言，在进化层内，EvoAgentX集成了三种MAS优化算法TextGrad、AFlow和MIPRO，以迭代地优化智能体提示、工具配置和工作流拓扑。我们在HotPotQA、MBPP和MATH上评估了EvoAgentX，用于多跳推理、代码生成和数学问题解决，分别并进一步使用GAIA对其进行了评估。实验结果显示，EvoAgentX始终取得显著的性能改进，包括HotPotQA F1增加了7.44％，MBPP pass@1改善了10.00％，MATH解决准确性提高了10.00％，并在GAIA上的总体准确性提高了高达20.00％。源代码可在以下网址获得：https://github.com/EvoAgentX/EvoAgentX

更新时间: 2025-09-23 14:00:26

领域: cs.AI

下载: http://arxiv.org/abs/2507.03616v2

Reduced-Order Model-Guided Reinforcement Learning for Demonstration-Free Humanoid Locomotion

We introduce Reduced-Order Model-Guided Reinforcement Learning (ROM-GRL), a two-stage reinforcement learning framework for humanoid walking that requires no motion capture data or elaborate reward shaping. In the first stage, a compact 4-DOF (four-degree-of-freedom) reduced-order model (ROM) is trained via Proximal Policy Optimization. This generates energy-efficient gait templates. In the second stage, those dynamically consistent trajectories guide a full-body policy trained with Soft Actor--Critic augmented by an adversarial discriminator, ensuring the student's five-dimensional gait feature distribution matches the ROM's demonstrations. Experiments at 1 meter-per-second and 4 meter-per-second show that ROM-GRL produces stable, symmetric gaits with substantially lower tracking error than a pure-reward baseline. By distilling lightweight ROM guidance into high-dimensional policies, ROM-GRL bridges the gap between reward-only and imitation-based locomotion methods, enabling versatile, naturalistic humanoid behaviors without any human demonstrations.

Updated: 2025-09-23 13:58:36

标题: 减小阶数模型引导的无示范人形机器人行走的强化学习

摘要: 我们介绍了一种称为Reduced-Order Model-Guided Reinforcement Learning（ROM-GRL）的双阶段强化学习框架，用于人形走路，不需要动作捕捉数据或精心设计的奖励形状。在第一阶段，通过Proximal Policy Optimization训练了一个紧凑的4自由度（four-degree-of-freedom）简化模型（ROM），生成了高效能的步态模板。在第二阶段，这些动态一致的轨迹引导了一个由Soft Actor-Critic训练的全身策略，通过对抗鉴别器增强，确保学生的五维步态特征分布与ROM的演示相匹配。在每秒1米和每秒4米的实验中，ROM-GRL产生了稳定、对称的步态，比纯奖励基线具有明显更低的跟踪误差。通过将轻量级ROM引导融入高维度策略，ROM-GRL弥合了仅基于奖励和基于模仿的运动方法之间的差距，实现了多功能、自然的人形行为，而无需任何人类演示。

更新时间: 2025-09-23 13:58:36

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.19023v1

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

Recent advances in multimodal large language models (LLMs) have led to significant progress in understanding, generation, and retrieval tasks. However, current solutions often treat these tasks in isolation or require training LLMs from scratch, resulting in high computational costs and limited generalization across modalities. In this work, we present OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module. To address the challenge of task interference, we propose a two-stage decoupled training strategy: supervised fine-tuning and latent space alignment for aligning LLM behavior with multimodal reasoning, and semantic-guided diffusion training to align cross-modal latent spaces via learnable query embeddings. Extensive experiments across a wide range of benchmarks demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks. Moreover, our results highlight the effectiveness of latent space alignment for unifying multimodal modeling under a shared representation space. Code and models are released at https://github.com/xiao-xt/OmniBridge.

Updated: 2025-09-23 13:57:55

标题: OmniBridge：通过潜在空间对齐实现统一的多模态理解、生成和检索

摘要: 近年来，多模态大型语言模型（LLMs）的最新进展已经取得了重大进展，使我们在理解、生成和检索任务方面取得了显著的进展。然而，目前的解决方案通常将这些任务视为孤立的，或者需要从头开始训练LLMs，导致计算成本高昂且在不同模态之间的泛化能力有限。在这项工作中，我们提出了OmniBridge，一个统一的模块化多模态框架，支持视觉-语言理解、生成和检索任务在一个统一的架构中进行。OmniBridge采用以语言为中心的设计，重复使用预训练的LLMs，并引入了一个轻量级的双向潜在对齐模块。为了解决任务干扰的挑战，我们提出了一个两阶段解耦训练策略：监督微调和潜在空间对齐，以将LLM的行为与多模态推理对齐，并通过可学习的查询嵌入进行语义引导扩散训练，以通过可学习的查询嵌入将跨模态潜在空间对齐。在广泛的基准测试中，我们的实验表明，OmniBridge在所有三个任务中均取得了具有竞争力或最先进的性能。此外，我们的结果突显了潜在空间对齐在将多模态建模统一到一个共享表示空间下的有效性。代码和模型发布在https://github.com/xiao-xt/OmniBridge。

更新时间: 2025-09-23 13:57:55

领域: cs.LG

下载: http://arxiv.org/abs/2509.19018v1

Explainable artificial intelligence (XAI) for scaling: An application for deducing hydrologic connectivity at watershed scale

Explainable artificial intelligence (XAI) methods have been applied to interpret deep learning model results. However, applications that integrate XAI with established hydrologic knowledge for process understanding remain limited. Here we show that XAI method, applied at point-scale, could be used for cross-scale aggregation of hydrologic responses, a fundamental question in scaling problems, using hydrologic connectivity as a demonstration. Soil moisture and its movement generated by physically based hydrologic model were used to train a long short-term memory (LSTM) network, whose impacts of inputs were evaluated by XAI methods. Our results suggest that XAI-based classification can effectively identify the differences in the functional roles of various sub-regions at watershed scale. The aggregated XAI results could be considered as an explicit and quantitative indicator of hydrologic connectivity development, offering insights to hydrological organization. This framework could be used to facilitate aggregation of other geophysical responses to advance process understandings.

Updated: 2025-09-23 13:57:32

标题: 可解释的人工智能（XAI）用于扩展规模：推断流域尺度上的水文连通性的应用

摘要: 可解释的人工智能（XAI）方法已被应用于解释深度学习模型的结果。然而，将XAI与建立的水文知识相结合以理解过程的应用仍然有限。在这里，我们展示了XAI方法，应用于点尺度，可以用于水文响应的跨尺度聚合，这是尺度问题中的一个基本问题，以水文连接性作为示范。通过基于物理的水文模型生成的土壤湿度及其运动被用于训练长短期记忆（LSTM）网络，其输入的影响通过XAI方法进行评估。我们的结果表明，基于XAI的分类可以有效地识别流域尺度各个子区域的功能角色的差异。聚合的XAI结果可以被视为水文连接发展的明确和定量指标，为水文组织提供见解。这个框架可以用于促进其他地球物理响应的聚合，以推进过程理解。

更新时间: 2025-09-23 13:57:32

领域: physics.geo-ph,cs.LG

下载: http://arxiv.org/abs/2509.02127v2

Fully Learnable Neural Reward Machines

Non-Markovian Reinforcement Learning (RL) tasks present significant challenges, as agents must reason over entire trajectories of state-action pairs to make optimal decisions. A common strategy to address this is through symbolic formalisms, such as Linear Temporal Logic (LTL) or automata, which provide a structured way to express temporally extended objectives. However, these approaches often rely on restrictive assumptions -- such as the availability of a predefined Symbol Grounding (SG) function mapping raw observations to high-level symbolic representations, or prior knowledge of the temporal task. In this work, we propose a fully learnable version of Neural Reward Machines (NRM), which can learn both the SG function and the automaton end-to-end, removing any reliance on prior knowledge. Our approach is therefore as easily applicable as classic deep RL (DRL) approaches, while being far more explainable, because of the finite and compact nature of automata. Furthermore, we show that by integrating Fully Learnable Reward Machines (FLNRM) with DRL, our method outperforms previous approaches based on Recurrent Neural Networks (RNNs).

Updated: 2025-09-23 13:57:13

标题: 可以翻译为：完全可学习的神经奖励机制

摘要: 非马尔可夫强化学习（RL）任务存在重要挑战，因为代理需要考虑整个状态-动作轨迹来做出最佳决策。一种常见的解决策略是通过符号形式化，如线性时间逻辑（LTL）或自动机，这提供了一种结构化的方式来表达时间延长的目标。然而，这些方法通常依赖于限制性假设--例如预定义的符号接地（SG）函数映射原始观察到高级符号表示，或时间任务的先验知识的可用性。在这项工作中，我们提出了一个完全可学习的神经奖励机制（NRM）的版本，它可以学习SG函数和自动机端到端，消除了对先验知识的依赖。因此，我们的方法与经典的深度RL（DRL）方法一样易于应用，同时由于自动机的有限和紑形质性质，更容易解释。此外，我们展示了通过将完全可学习奖励机器（FLNRM）与DRL集成，我们的方法优于基于递归神经网络（RNNs）的先前方法。

更新时间: 2025-09-23 13:57:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19017v1

Pure Vision Language Action (VLA) Models: A Comprehensive Survey

The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics, reframing Vision Language Models (VLMs) from passive sequence generators into active agents for manipulation and decision-making in complex, dynamic environments. This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research. It presents a comprehensive analysis of VLA applications across different scenarios and classifies VLA approaches into several paradigms: autoregression-based, diffusion-based, reinforcement-based, hybrid, and specialized methods; while examining their motivations, core strategies, and implementations in detail. In addition, foundational datasets, benchmarks, and simulation platforms are introduced. Building on the current VLA landscape, the review further proposes perspectives on key challenges and future directions to advance research in VLA models and generalizable robotics. By synthesizing insights from over three hundred recent studies, this survey maps the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose VLA methods.

Updated: 2025-09-23 13:53:52

标题: 纯视觉语言行动（VLA）模型：一项全面调查

摘要: Vision Language Action（VLA）模型的出现标志着从传统基于策略控制到泛化机器人学的范式转变，将视觉语言模型（VLMs）从被动序列生成器重新构建为在复杂动态环境中进行操纵和决策的主动代理。本调查深入探讨先进的VLA方法，旨在提供一个清晰的分类法和对现有研究的系统全面评估。它对VLA在不同场景中的应用进行了全面分析，并将VLA方法分类为几种范式：基于自回归、扩散、强化、混合和专门方法，同时详细检查它们的动机、核心策略和实施。此外，还介绍了基础数据集、基准和仿真平台。在当前VLA领域的基础上，该综述进一步提出了关于VLA模型和通用机器人学研究的关键挑战和未来方向的展望。通过综合分析超过三百篇最新研究的见解，本调查描绘了这个快速发展领域的轮廓，并强调了将塑造可扩展通用VLA方法发展的机遇和挑战。

更新时间: 2025-09-23 13:53:52

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.19012v1

Quantum Annealing for Minimum Bisection Problem: A Machine Learning-based Approach for Penalty Parameter Tuning

The Minimum Bisection Problem is a well-known NP-hard problem in combinatorial optimization, with practical applications in areas such as parallel computing, network design, and machine learning. In this paper, we examine the potential of using D-Wave Systems' quantum annealing solvers to solve the Minimum Bisection Problem, which we formulate as a Quadratic Unconstrained Binary Optimization model. A key challenge in this formulation lies in choosing an appropriate penalty parameter, as it plays a crucial role in ensuring both the quality of the solution and the satisfaction of the problem's constraints. To address this, we introduce a novel machine learning-based approach for adaptive tuning of the penalty parameter. Specifically, we use a Gradient Boosting Regressor model trained to predict suitable penalty parameter values based on structural properties of the input graph, the number of nodes and the graph's density. This method enables the penalty parameter to be adjusted dynamically for each specific problem instance, improving the solver's ability to balance the competing goals of minimizing the cut size and maintaining equally sized partitions. We test our approach on a large dataset of randomly generated Erd\H{o}s-R\'enyi graphs with up to 4,000 nodes, and we compare the results with classical partitioning algorithms, Metis and Kernighan-Lin. Experimental findings demonstrate that our adaptive tuning strategy significantly improves the performance of the quantum annealing hybrid solver and consistently outperforms the classical methods used, indicating its potential as an alternative for the graph partitioning problem.

Updated: 2025-09-23 13:49:18

标题: 量子退火用于最小二分问题：一种基于机器学习的罚参数调整方法

摘要: 最小二分问题是组合优化中众所周知的NP难题，在诸如并行计算、网络设计和机器学习等领域具有实际应用。本文探讨了使用D-Wave Systems的量子退火求解器来解决最小二分问题的潜力，将其建模为一个二次无约束二进制优化模型。该建模中的一个关键挑战在于选择适当的惩罚参数，因为它在确保解的质量和问题约束的满足方面起着至关重要的作用。为了解决这个问题，我们引入了一种基于机器学习的自适应调整惩罚参数的方法。具体来说，我们使用一个Gradient Boosting Regressor模型，根据输入图的结构特性、节点数量和图的密度来训练预测适当的惩罚参数值。这种方法使得惩罚参数能够动态调整，以改善求解器在最小化切割大小和保持等大小分区之间的竞争目标方面的能力。我们在一个包含最多4,000个节点的大型随机生成的Erd\H{o}s-R\'enyi图数据集上测试了我们的方法，并将结果与经典分区算法Metis和Kernighan-Lin进行了比较。实验结果表明，我们的自适应调整策略显著提高了量子退火混合求解器的性能，并始终优于使用的经典方法，表明其作为图分区问题的替代方案的潜力。

更新时间: 2025-09-23 13:49:18

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2509.19005v1

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

Updated: 2025-09-23 13:46:31

标题: VIR-Bench: 通过旅行视频行程重建评估MLLMs对地理空间和时间理解的能力

摘要: 最近，多模态大型语言模型（MLLMs）的进展显著增强了视频理解能力，为实际应用开辟了新的可能性。然而，当前的视频基准主要关注室内场景或短程户外活动，对于长距离旅行所涉及的挑战很少被探索。掌握扩展的地理空间-时间轨迹对于下一代MLLMs至关重要，支撑着实际任务，如具身人工智能规划和导航。为了弥补这一差距，我们提出了VIR-Bench，一个由200个旅行视频组成的新型基准，将行程重建构架为一个具有挑战性的任务，旨在评估并推进MLLMs的地理空间-时间智能。实验结果显示，包括专有模型在内的最先进MLLMs难以取得高分，突显了处理跨越扩展空间和时间尺度的视频的困难。此外，我们进行了深入的案例研究，开发了一个利用从VIR-Bench获得的见解的原型旅行规划代理。代理的明显改善的行程推荐验证了我们的评估协议不仅有效地基准了模型，而且在面向用户的应用中转化为具体的性能增益。

更新时间: 2025-09-23 13:46:31

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2509.19002v1

Bayesian Calibration and Model Assessment of Cell Migration Dynamics with Surrogate Model Integration

Computational models provide crucial insights into complex biological processes such as cancer evolution, but their mechanistic nature often makes them nonlinear and parameter-rich, complicating calibration. We systematically evaluate parameter probability distributions in cell migration models using Bayesian calibration across four complementary strategies: parametric and surrogate models, each with and without explicit model discrepancy. This approach enables joint analysis of parameter uncertainty, predictive performance, and interpretability. Applied to a real data experiment of glioblastoma progression in microfluidic devices, surrogate models achieve higher computational efficiency and predictive accuracy, whereas parametric models yield more reliable parameter estimates due to their mechanistic grounding. Incorporating model discrepancy exposes structural limitations, clarifying where model refinement is necessary. Together, these comparisons offer practical guidance for calibrating and improving computational models of complex biological systems.

Updated: 2025-09-23 13:45:16

标题: 贝叶斯校准和细胞迁移动力学模型评估与替代模型集成

摘要: 计算模型为癌症进化等复杂生物过程提供了重要的见解，但它们的机制性质通常使它们非线性且参数丰富，使得校准变得复杂。我们系统地评估了细胞迁移模型中的参数概率分布，使用贝叶斯校准跨越四种互补策略：参数化和替代模型，每种都有和没有显式模型差异。这种方法使得能够联合分析参数不确定性、预测性能和可解释性。应用于微流体设备中脑胶质母细胞瘤进展的真实数据实验，替代模型实现了更高的计算效率和预测准确性，而参数化模型由于其机制性基础产生了更可靠的参数估计。将模型差异纳入考虑，暴露了结构上的限制，澄清了模型改进的必要性。总之，这些比较为校准和改进复杂生物系统的计算模型提供了实用指导。

更新时间: 2025-09-23 13:45:16

领域: math.AP,cs.LG,q-bio.CB,q-bio.QM

下载: http://arxiv.org/abs/2509.18998v1

Theoretical Foundations of Representation Learning using Unlabeled Data: Statistics and Optimization

Representation learning from unlabeled data has been extensively studied in statistics, data science and signal processing with a rich literature on techniques for dimension reduction, compression, multi-dimensional scaling among others. However, current deep learning models use new principles for unsupervised representation learning that cannot be easily analyzed using classical theories. For example, visual foundation models have found tremendous success using self-supervision or denoising/masked autoencoders, which effectively learn representations from massive amounts of unlabeled data. However, it remains difficult to characterize the representations learned by these models and to explain why they perform well for diverse prediction tasks or show emergent behavior. To answer these questions, one needs to combine mathematical tools from statistics and optimization. This paper provides an overview of recent theoretical advances in representation learning from unlabeled data and mentions our contributions in this direction.

Updated: 2025-09-23 13:45:11

标题: 使用无标签数据进行表示学习的理论基础：统计学和优化

摘要: 来自未标记数据的表示学习已在统计学、数据科学和信号处理领域进行了广泛研究，有关降维、压缩、多维缩放等技术的文献丰富。然而，当前的深度学习模型使用了无监督表示学习的新原则，这些原则不能轻易通过经典理论进行分析。例如，视觉基础模型通过自监督或去噪/遮罩自动编码器取得了巨大成功，这有效地从大量未标记数据中学习表示。然而，对于这些模型学习的表示进行表征以及解释它们为何在不同的预测任务中表现良好或显示出新兴行为仍然很困难。要回答这些问题，需要结合统计学和优化的数学工具。本文概述了最近在未标记数据表示学习方面的理论进展，并提及了我们在这方向的贡献。

更新时间: 2025-09-23 13:45:11

领域: cs.LG

下载: http://arxiv.org/abs/2509.18997v1

Learning functions, operators and dynamical systems with kernels

This expository article presents the approach to statistical machine learning based on reproducing kernel Hilbert spaces. The basic framework is introduced for scalar-valued learning and then extended to operator learning. Finally, learning dynamical systems is formulated as a suitable operator learning problem, leveraging Koopman operator theory. The manuscript collects the supporting material for the corresponding course taught at the CIME school "Machine Learning: From Data to Mathematical Understanding" in Cetraro.

Updated: 2025-09-23 13:43:42

标题: 用核学习函数、运算符和动力系统

摘要: 这篇论文摘要介绍了基于再生核希尔伯特空间的统计机器学习方法。首先介绍了标量值学习的基本框架，然后扩展到运算学习。最后，将学习动态系统制定为适合的运算学习问题，利用库普曼算子理论。该手稿汇集了在CIME学校“机器学习：从数据到数学理解”课程中教授的支持材料，该课程在切特拉罗举行。

更新时间: 2025-09-23 13:43:42

领域: cs.LG

下载: http://arxiv.org/abs/2509.18071v2

Enhancing Credit Default Prediction Using Boruta Feature Selection and DBSCAN Algorithm with Different Resampling Techniques

This study examines credit default prediction by comparing three techniques, namely SMOTE, SMOTE-Tomek, and ADASYN, that are commonly used to address the class imbalance problem in credit default situations. Recognizing that credit default datasets are typically skewed, with defaulters comprising a much smaller proportion than non-defaulters, we began our analysis by evaluating machine learning (ML) models on the imbalanced data without any resampling to establish baseline performance. These baseline results provide a reference point for understanding the impact of subsequent balancing methods. In addition to traditional classifiers such as Naive Bayes and K-Nearest Neighbors (KNN), our study also explores the suitability of advanced ensemble boosting algorithms, including Extreme Gradient Boosting (XGBoost), AdaBoost, Gradient Boosting Machines (GBM), and Light GBM for credit default prediction using Boruta feature selection and DBSCAN-based outlier detection, both before and after resampling. A real-world credit default data set sourced from the University of Cleveland ML Repository was used to build ML classifiers, and their performances were tested. The criteria chosen to measure model performance are the area under the receiver operating characteristic curve (ROC-AUC), area under the precision-recall curve (PR-AUC), G-mean, and F1-scores. The results from this empirical study indicate that the Boruta+DBSCAN+SMOTE-Tomek+GBM classifier outperformed the other ML models (F1-score: 82.56%, G-mean: 82.98%, ROC-AUC: 90.90%, PR-AUC: 91.85%) in a credit default context. The findings establish a foundation for future progress in creating more resilient and adaptive credit default systems, which will be essential as credit-based transactions continue to rise worldwide.

Updated: 2025-09-23 13:43:18

标题: 利用Boruta特征选择和带有不同重采样技术的DBSCAN算法增强信用违约预测

摘要: 这项研究通过比较三种常用于解决信用违约情况中类别不平衡问题的技术，即SMOTE、SMOTE-Tomek和ADASYN，来研究信用违约预测。认识到信用违约数据集通常存在偏斜现象，违约者占比远远低于未违约者，我们首先在未进行任何重采样的不平衡数据上评估机器学习（ML）模型，以建立基准性能。这些基准结果提供了理解后续平衡方法影响的参考点。除了传统的分类器如朴素贝叶斯和K-最近邻算法（KNN），我们的研究还探讨了先进的集成增强算法的适用性，包括极端梯度提升（XGBoost）、AdaBoost、梯度提升机（GBM）和轻量级GBM，以及使用Boruta特征选择和基于DBSCAN的异常值检测进行信用违约预测，无论是在重采样前还是后。我们使用从克利夫兰大学ML仓库获取的真实信用违约数据集构建了ML分类器，并对其性能进行了测试。衡量模型性能的标准包括接收者操作特征曲线下面积（ROC-AUC）、精确率-召回率曲线下面积（PR-AUC）、G-mean和F1分数。这项实证研究的结果表明，Boruta+DBSCAN+SMOTE-Tomek+GBM分类器在信用违约背景下表现优于其他ML模型（F1分数：82.56％，G-mean：82.98％，ROC-AUC：90.90％，PR-AUC：91.85％）。这些发现为未来在创建更具弹性和适应性的信用违约系统奠定了基础，这将在全球信用交易继续增长的情况下至关重要。

更新时间: 2025-09-23 13:43:18

领域: cs.LG,stat.AP

下载: http://arxiv.org/abs/2509.19408v1

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

Updated: 2025-09-23 13:43:02

标题: CR-Net：使用跨层低秩结构实现参数高效训练的扩展

摘要: 低秩架构对于高效的大型语言模型（LLM）预训练变得越来越重要，可以显著减少参数复杂性和内存/计算需求。尽管具有这些优点，当前的低秩方法存在三个关键缺点：（1）牺牲了模型性能，（2）计算开销巨大，（3）激活内存节省有限。为了解决这些限制，我们提出了交叉层低秩残差网络（CR-Net），这是一个创新的参数高效框架，灵感来源于我们的发现，层间激活残差具有低秩特性。CR-Net通过双路径架构实现这一洞见，通过将前一层的输出与它们的低秩差异结合来高效重构层激活，从而在保持高秩信息的同时使用最少的参数。我们进一步开发了一种专门为CR-Net量身定制的激活重计算策略，大大降低了内存需求。从60M到7B参数的各种规模模型的广泛预训练实验表明，CR-Net始终优于最先进的低秩框架，同时需要更少的计算资源和内存。

更新时间: 2025-09-23 13:43:02

领域: cs.LG

下载: http://arxiv.org/abs/2509.18993v1

Learning From Simulators: A Theory of Simulation-Grounded Learning

Simulation-Grounded Neural Networks (SGNNs) are predictive models trained entirely on synthetic data from mechanistic simulations. They have achieved state-of-the-art performance in domains where real-world labels are limited or unobserved, but lack a formal underpinning. We present the foundational theory of simulation-grounded learning. We show that SGNNs implement amortized Bayesian inference under a simulation prior and converge to the Bayes-optimal predictor. We derive generalization bounds under model misspecification and prove that SGNNs can learn unobservable scientific quantities that empirical methods provably cannot. We also formalize a novel form of mechanistic interpretability uniquely enabled by SGNNs: by attributing predictions to the simulated mechanisms that generated them, SGNNs yield posterior-consistent, scientifically grounded explanations. We provide numerical experiments to validate all theoretical predictions. SGNNs recover latent parameters, remain robust under mismatch, and outperform classical tools: in a model selection task, SGNNs achieve half the error of AIC in distinguishing mechanistic dynamics. These results establish SGNNs as a principled and practical framework for scientific prediction in data-limited regimes.

Updated: 2025-09-23 13:39:11

标题: 从模拟器中学习：基于模拟的学习理论

摘要: 基于模拟的神经网络（SGNNs）是完全在机械模拟中合成数据上训练的预测模型。它们在真实标签有限或未观测到的领域中取得了最先进的性能，但缺乏正式的基础。我们提出了模拟基础学习的基础理论。我们展示了SGNNs在模拟先验下实现了摊销贝叶斯推断，并收敛到贝叶斯最优预测器。我们推导了模型错误规范下的泛化界限，并证明了SGNNs可以学习经验方法明确无法学习的不可观测的科学量。我们还形式化了一种新颖的机械解释性，唯有SGNNs可以实现：通过将预测归因于生成它们的模拟机制，SGNNs提供后验一致、科学基础的解释。我们提供数值实验来验证所有理论预测。SGNNs可以恢复潜在参数，在不匹配情况下保持稳健，并胜过经典工具：在模型选择任务中，SGNNs在区分机械动态方面的错误率仅为AIC的一半。这些结果将SGNNs确立为在数据有限情况下进行科学预测的原则性和实用性框架。

更新时间: 2025-09-23 13:39:11

领域: cs.LG,math.DS

下载: http://arxiv.org/abs/2509.18990v1

Remaining Time Prediction in Outbound Warehouse Processes: A Case Study (Short Paper)

Predictive process monitoring is a sub-domain of process mining which aims to forecast the future of ongoing process executions. One common prediction target is the remaining time, meaning the time that will elapse until a process execution is completed. In this paper, we compare four different remaining time prediction approaches in a real-life outbound warehouse process of a logistics company in the aviation business. For this process, the company provided us with a novel and original event log with 169,523 traces, which we can make publicly available. Unsurprisingly, we find that deep learning models achieve the highest accuracy, but shallow methods like conventional boosting techniques achieve competitive accuracy and require significantly fewer computational resources.

Updated: 2025-09-23 13:37:09

标题: 出库仓储流程中剩余时间预测：一个案例研究（短篇）

摘要: 预测过程监控是过程挖掘的一个子领域，旨在预测正在进行的过程执行的未来。一个常见的预测目标是剩余时间，即直到进程执行完成所经过的时间。在本文中，我们比较了在航空物流公司的实际出库仓库流程中使用的四种不同的剩余时间预测方法。对于这个流程，公司向我们提供了一个包含169,523条迹线的新颖和原始事件日志，我们可以公开使用。毫不奇怪，我们发现深度学习模型实现了最高的准确性，但浅层方法如传统的增强技术也实现了竞争性的准确性，并且需要较少的计算资源。

更新时间: 2025-09-23 13:37:09

领域: cs.AI

下载: http://arxiv.org/abs/2509.18986v1

Difficulty-Aware Agent Orchestration in LLM-Powered Workflows

Large Language Model (LLM)-based agentic systems have shown strong capabilities across various tasks. However, existing multi-agent frameworks often rely on static or task-level workflows, which either over-process simple queries or underperform on complex ones, while also neglecting the efficiency-performance trade-offs across heterogeneous LLMs. To address these limitations, we propose Difficulty-Aware Agentic Orchestration (DAAO), a dynamic framework that adapts workflow depth, operator selection, and LLM assignment based on the difficulty of each input query. DAAO comprises three interdependent modules: a variational autoencoder (VAE) for difficulty estimation, a modular operator allocator, and a cost- and performance-aware LLM router. By leveraging heterogeneous LLMs and dynamically tailoring workflows, DAAO enables fine-grained, query-specific reasoning strategies. DAAO outperforms prior multi-agent systems in both accuracy and inference efficiency across six benchmarks. We will release our code and implementation details upon publication.

Updated: 2025-09-23 13:32:37

标题: 在LLM驱动的工作流程中，基于难度感知的智能代理编排

摘要: 基于大型语言模型（LLM）的代理系统在各种任务中展现出强大的能力。然而，现有的多代理框架通常依赖静态或任务级的工作流程，这要么过度处理简单查询，要么在复杂查询上表现不佳，同时忽视了异构LLM之间的效率-性能权衡。为了解决这些限制，我们提出了一种称为Difficulty-Aware Agentic Orchestration（DAAO）的动态框架，根据每个输入查询的难度调整工作流程深度、操作员选择和LLM分配。DAAO由三个相互依赖的模块组成：用于难度估计的变分自动编码器（VAE）、模块化操作员分配器和成本-性能感知的LLM路由器。通过利用异构LLM并动态定制工作流程，DAAO实现了精细化的、查询特定的推理策略。DAAO在六个基准测试中在准确性和推理效率方面优于先前的多代理系统。我们将在发表后发布我们的代码和实现细节。

更新时间: 2025-09-23 13:32:37

领域: cs.AI

下载: http://arxiv.org/abs/2509.11079v2

LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.

Updated: 2025-09-23 13:31:18

标题: LLM代理用于交互式工作流溯源：参考架构和评估方法论

摘要: 现代科学发现越来越依赖于处理边缘、云和高性能计算（HPC）连续数据的工作流程。对这些数据进行全面和深入的分析对于假设验证、异常检测、可重现性和有影响力的发现至关重要。虽然工作流程溯源技术支持这些分析，但在大规模情况下，溯源数据变得复杂且难以分析。现有系统依赖于自定义脚本、结构化查询或静态仪表板，限制了数据交互。在这项工作中，我们引入了一种评估方法论、参考架构和开源实现，利用交互式大型语言模型（LLM）代理进行运行时数据分析。我们的方法采用轻量级、基于元数据驱动的设计，将自然语言转换为结构化的溯源查询。通过对LLaMA、GPT、Gemini和Claude的评估，涵盖了多样的查询类别和一个真实的化学工作流程，结果表明模块化设计、提示调整和检索增强生成（RAG）使得LLM代理的响应准确且具有洞察力，超出了记录的溯源。

更新时间: 2025-09-23 13:31:18

领域: cs.DC,cs.AI,cs.DB,68M14, 68M20, 68T07,C.2.4; D.1.3; I.2.0

下载: http://arxiv.org/abs/2509.13978v2

From latent factors to language: a user study on LLM-generated explanations for an inherently interpretable matrix-based recommender system

We investigate whether large language models (LLMs) can generate effective, user-facing explanations from a mathematically interpretable recommendation model. The model is based on constrained matrix factorization, where user types are explicitly represented and predicted item scores share the same scale as observed ratings, making the model's internal representations and predicted scores directly interpretable. This structure is translated into natural language explanations using carefully designed LLM prompts. Many works in explainable AI rely on automatic evaluation metrics, which often fail to capture users' actual needs and perceptions. In contrast, we adopt a user-centered approach: we conduct a study with 326 participants who assessed the quality of the explanations across five key dimensions-transparency, effectiveness, persuasion, trust, and satisfaction-as well as the recommendations themselves.To evaluate how different explanation strategies are perceived, we generate multiple explanation types from the same underlying model, varying the input information provided to the LLM. Our analysis reveals that all explanation types are generally well received, with moderate statistical differences between strategies. User comments further underscore how participants react to each type of explanation, offering complementary insights beyond the quantitative results.

Updated: 2025-09-23 13:30:03

标题: 从潜在因素到语言：LLM生成的解释对基于矩阵的可解释推荐系统的用户研究

摘要: 我们调查了大型语言模型（LLMs）是否可以从一个数学可解释的推荐模型中生成有效的面向用户的解释。该模型基于受限矩阵分解，其中用户类型被明确表示，并且预测的物品评分与观察到的评分共享相同的尺度，使得模型的内部表示和预测的评分直接可解释。这种结构被转化为自然语言解释，使用精心设计的LLM提示。许多可解释AI的作品依赖于自动评估指标，这些指标往往无法捕捉到用户的实际需求和感知。相比之下，我们采用了以用户为中心的方法：我们进行了一项研究，涉及326名参与者，评估解释在透明度、有效性、说服力、信任和满意度等五个关键维度上的质量，以及推荐本身。为了评估不同的解释策略如何被感知，我们从相同的基础模型生成多个解释类型，变化LLM提供的输入信息。我们的分析显示，所有解释类型通常都受到良好接受，不同策略之间存在适度的统计差异。用户评论进一步强调了参与者对每种解释类型的反应，提供了超越定量结果的补充见解。

更新时间: 2025-09-23 13:30:03

领域: cs.AI,cs.HC,cs.IR,H.3.3; H.5.2; I.2.7

下载: http://arxiv.org/abs/2509.18980v1

LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions

Driven by the rapid advancements of Large Language Models (LLMs), LLM-based agents have emerged as powerful intelligent systems capable of human-like cognition, reasoning, and interaction. These agents are increasingly being deployed across diverse real-world applications, including student education, scientific research, and financial analysis. However, despite their remarkable potential, LLM-based agents remain vulnerable to hallucination issues, which can result in erroneous task execution and undermine the reliability of the overall system design. Addressing this critical challenge requires a deep understanding and a systematic consolidation of recent advances on LLM-based agents. To this end, we present the first comprehensive survey of hallucinations in LLM-based agents. By carefully analyzing the complete workflow of agents, we propose a new taxonomy that identifies different types of agent hallucinations occurring at different stages. Furthermore, we conduct an in-depth examination of eighteen triggering causes underlying the emergence of agent hallucinations. Through a detailed review of a large number of existing studies, we summarize approaches for hallucination mitigation and detection, and highlight promising directions for future research. We hope this survey will inspire further efforts toward addressing hallucinations in LLM-based agents, ultimately contributing to the development of more robust and reliable agent systems.

Updated: 2025-09-23 13:24:48

标题: LLM基础代理存在幻觉问题：分类、方法和发展方向调查

摘要: 由于大型语言模型（LLMs）的快速进步，基于LLM的代理已经成为强大的智能系统，能够进行类似人类的认知、推理和交互。这些代理越来越多地被部署在不同的现实世界应用中，包括学生教育、科学研究和金融分析。然而，尽管具有显著的潜力，基于LLM的代理仍然容易受到幻觉问题的影响，这可能导致任务执行错误并削弱整个系统设计的可靠性。解决这一关键挑战需要深入了解和系统整合基于LLM的代理的最新进展。为此，我们提出了对基于LLM的代理中的幻觉进行全面调查的第一份综述。通过仔细分析代理的完整工作流程，我们提出了一种新的分类法，识别出在不同阶段发生的不同类型的代理幻觉。此外，我们深入研究了导致代理幻觉出现的十八种触发原因。通过对大量现有研究的详细回顾，我们总结了减少幻觉和检测的方法，并强调了未来研究的有希望的方向。我们希望这项调查能激发更多的努力，解决基于LLM的代理中的幻觉问题，最终促进更健壮和可靠的代理系统的发展。

更新时间: 2025-09-23 13:24:48

领域: cs.AI

下载: http://arxiv.org/abs/2509.18970v1

Otters: An Energy-Efficient SpikingTransformer via Optical Time-to-First-Spike Encoding

Spiking neural networks (SNNs) promise high energy efficiency, particularly with time-to-first-spike (TTFS) encoding, which maximizes sparsity by emitting at most one spike per neuron. However, such energy advantage is often unrealized because inference requires evaluating a temporal decay function and subsequent multiplication with the synaptic weights. This paper challenges this costly approach by repurposing a physical hardware `bug', namely, the natural signal decay in optoelectronic devices, as the core computation of TTFS. We fabricated a custom indium oxide optoelectronic synapse, showing how its natural physical decay directly implements the required temporal function. By treating the device's analog output as the fused product of the synaptic weight and temporal decay, optoelectronic synaptic TTFS (named Otters) eliminates these expensive digital operations. To use the Otters paradigm in complex architectures like the transformer, which are challenging to train directly due to the sparsity issue, we introduce a novel quantized neural network-to-SNN conversion algorithm. This complete hardware-software co-design enables our model to achieve state-of-the-art accuracy across seven GLUE benchmark datasets and demonstrates a 1.77$\times$ improvement in energy efficiency over previous leading SNNs, based on a comprehensive analysis of compute, data movement, and memory access costs using energy measurements from a commercial 22nm process. Our work thus establishes a new paradigm for energy-efficient SNNs, translating fundamental device physics directly into powerful computational primitives. All codes and data are open source.

Updated: 2025-09-23 13:23:48

标题: Otters：一种能源高效的通过光时至首次尖峰编码的脉冲变换器

摘要: 脉冲神经网络（SNNs）承诺高能效，特别是采用时间至第一脉冲（TTFS）编码，通过每个神经元最多发射一个脉冲来最大化稀疏性。然而，这种能量优势通常未能实现，因为推断需要评估时间衰减函数，并随后与突触权重进行乘法运算。本文通过重新利用物理硬件“漏洞”（即光电器件中的自然信号衰减）来挑战这种昂贵的方法，将其作为TTFS的核心计算。我们制作了一款定制的氧化铟光电子突触，展示了其自然物理衰减如何直接实现所需的时间函数。通过将设备的模拟输出视为突触权重和时间衰减的融合产品，光电子突触TTFS（命名为Otters）消除了这些昂贵的数字操作。为了在像变压器这样的复杂架构中使用Otters范式，由于稀疏性问题直接训练具有挑战性，我们引入了一种新颖的量化神经网络到SNN转换算法。这种完整的硬件-软件协同设计使我们的模型在七个GLUE基准数据集上实现了最先进的准确性，并通过使用商用22nm工艺的能量测量对计算、数据移动和内存访问成本进行全面分析，展示了相对于先前领先的SNNs的1.77倍能效改善。因此，我们的工作为节能型SNNs建立了一种新的范式，直接将基本设备物理转化为强大的计算原语。所有代码和数据均为开源。

更新时间: 2025-09-23 13:23:48

领域: cs.LG

下载: http://arxiv.org/abs/2509.18968v1

Combinational Backdoor Attack against Customized Text-to-Image Models

Recently, Text-to-Image (T2I) synthesis technology has made tremendous strides. Numerous representative T2I models have emerged and achieved promising application outcomes, such as DALL-E, Stable Diffusion, Imagen, etc. In practice, it has become increasingly popular for model developers to selectively adopt personalized pre-trained text encoders and conditional diffusion models from third-party platforms, integrating them together to build customized (personalized) T2I models. However, such an adoption approach is vulnerable to backdoor attacks. In this work, we propose a \textbf{C}ombinational \textbf{B}ackdoor \textbf{A}ttack against \textbf{C}ustomized \textbf{T2I} models (CBACT2I) targeting this application scenario. Different from previous backdoor attacks against T2I models, CBACT2I embeds the backdoor into the text encoder and the conditional diffusion model separately. The customized T2I model exhibits backdoor behaviors only when the backdoor text encoder is used in combination with the backdoor conditional diffusion model. These properties make CBACT2I more stealthy and controllable than prior backdoor attacks against T2I models. Extensive experiments demonstrate the high effectiveness of CBACT2I with different backdoor triggers and backdoor targets, the strong generality on different combinations of customized text encoders and diffusion models, as well as the high stealthiness against state-of-the-art backdoor detection methods.

Updated: 2025-09-23 13:21:55

标题: 组合后门攻击针对定制文本到图像模型

摘要: 最近，文本到图像（T2I）合成技术取得了巨大进展。许多代表性的T2I模型已经出现并取得了有前途的应用结果，如DALL-E、Stable Diffusion、Imagen等。在实践中，模型开发者越来越倾向于选择性地采用个性化的预训练文本编码器和条件扩散模型来构建定制（个性化）的T2I模型。然而，这种采用方法容易受到后门攻击的影响。在这项工作中，我们提出了一种针对这种应用场景的\textbf{C}结合\textbf{B}后门\textbf{A}攻击\textbf{C}定制\textbf{T2I}模型（CBACT2I）。与先前针对T2I模型的后门攻击不同，CBACT2I将后门分别嵌入文本编码器和条件扩散模型中。只有当后门文本编码器与后门条件扩散模型结合使用时，定制的T2I模型才会表现出后门行为。这些特性使CBACT2I比以前针对T2I模型的后门攻击更加隐蔽和可控。大量实验证明了CBACT2I在不同后门触发器和后门目标上的高效性，对不同组合的定制文本编码器和扩散模型的强大泛化性，以及对最先进的后门检测方法的高隐蔽性。

更新时间: 2025-09-23 13:21:55

领域: cs.CR

下载: http://arxiv.org/abs/2411.12389v3

Central Limit Theorems for Asynchronous Averaged Q-Learning

This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We present a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.

Updated: 2025-09-23 13:16:14

标题: 异步平均Q学习的中心极限定理

摘要: 这篇论文建立了在异步更新下Polyak-Ruppert平均Q-learning的中心极限定理。我们提出了一个非渐近的中心极限定理，在Wasserstein距离中的收敛速率明确地反映了对迭代次数、状态动作空间大小、折扣因子和探索质量的依赖关系。此外，我们推导出一个功能性中心极限定理，表明部分和过程弱收敛到一个布朗运动。

更新时间: 2025-09-23 13:16:14

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2509.18964v1

Lift What You Can: Green Online Learning with Heterogeneous Ensembles

Ensemble methods for stream mining necessitate managing multiple models and updating them as data distributions evolve. Considering the calls for more sustainability, established methods are however not sufficiently considerate of ensemble members' computational expenses and instead overly focus on predictive capabilities. To address these challenges and enable green online learning, we propose heterogeneous online ensembles (HEROS). For every training step, HEROS chooses a subset of models from a pool of models initialized with diverse hyperparameter choices under resource constraints to train. We introduce a Markov decision process to theoretically capture the trade-offs between predictive performance and sustainability constraints. Based on this framework, we present different policies for choosing which models to train on incoming data. Most notably, we propose the novel $\zeta$-policy, which focuses on training near-optimal models at reduced costs. Using a stochastic model, we theoretically prove that our $\zeta$-policy achieves near optimal performance while using fewer resources compared to the best performing policy. In our experiments across 11 benchmark datasets, we find empiric evidence that our $\zeta$-policy is a strong contribution to the state-of-the-art, demonstrating highly accurate performance, in some cases even outperforming competitors, and simultaneously being much more resource-friendly.

Updated: 2025-09-23 13:14:37

标题: 举起你所能做到的：具有异质集成的绿色在线学习

摘要: 流式挖掘的集成方法需要管理多个模型，并随着数据分布的演变进行更新。考虑到对更可持续性的呼吁，现有的方法在考虑到集成成员的计算开销方面还不够周到，而过分关注预测能力。为了解决这些挑战并实现绿色在线学习，我们提出了异构在线集成（HEROS）。对于每个训练步骤，HEROS选择一个从具有不同超参数选择的模型池中根据资源约束初始化的模型子集进行训练。我们引入一个马尔可夫决策过程来理论上捕捉预测性能和可持续性约束之间的权衡。基于这个框架，我们提出了不同的策略来选择在传入数据上训练的模型。值得注意的是，我们提出了新颖的$\zeta$-策略，重点是以降低成本训练接近最佳的模型。使用随机模型，我们从理论上证明了我们的$\zeta$-策略在使用较少资源的同时实现了接近最佳性能，与表现最佳的策略相比。在我们对11个基准数据集的实验中，我们发现经验上的证据表明我们的$\zeta$-策略是最先进技术的一个重要贡献，展示了高度准确的性能，在某些情况下甚至超过竞争对手，并同时更加资源友好。

更新时间: 2025-09-23 13:14:37

领域: cs.LG

下载: http://arxiv.org/abs/2509.18962v1

Generative data augmentation for biliary tract detection on intraoperative images

Cholecystectomy is one of the most frequently performed procedures in gastrointestinal surgery, and the laparoscopic approach is the gold standard for symptomatic cholecystolithiasis and acute cholecystitis. In addition to the advantages of a significantly faster recovery and better cosmetic results, the laparoscopic approach bears a higher risk of bile duct injury, which has a significant impact on quality of life and survival. To avoid bile duct injury, it is essential to improve the intraoperative visualization of the bile duct. This work aims to address this problem by leveraging a deep-learning approach for the localization of the biliary tract from white-light images acquired during the surgical procedures. To this end, the construction and annotation of an image database to train the Yolo detection algorithm has been employed. Besides classical data augmentation techniques, the paper proposes Generative Adversarial Network (GAN) for the generation of a synthetic portion of the training dataset. Experimental results have been discussed along with ethical considerations.

Updated: 2025-09-23 13:11:53

标题: 手术图像中胆道检测的生成式数据增强

摘要: 胆囊切除术是胃肠外科中最常见的手术之一，腹腔镜手术是治疗症状性胆囊结石和急性胆囊炎的黄金标准。除了术后恢复速度显著快和更好的美容效果之外，腹腔镜手术存在更高的胆管损伤风险，这对生活质量和生存率有显著影响。为了避免胆管损伤，提高术中胆管可视化是至关重要的。本研究旨在利用深度学习方法，从术中获取的白光图像中定位胆道。为此，利用建立和注释图像数据库来训练Yolo检测算法。除了传统的数据增强技术，本文还提出使用生成对抗网络（GAN）生成训练数据集的合成部分。实验结果已经讨论，同时也涉及伦理考虑。

更新时间: 2025-09-23 13:11:53

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.18958v1

Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations

Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework that systematically evaluates the robustness of VLA models by transforming discrete physical variations into continuous optimization problems. However, comprehensively assessing VLA robustness presents two key challenges: (1) how to systematically characterize diverse physical variations encountered in real-world deployments while maintaining evaluation reproducibility, and (2) how to discover worst-case scenarios without prohibitive real-world data collection costs efficiently. To address the first challenge, we decompose real-world variations into three critical domains: object 3D transformations that affect spatial reasoning, illumination variations that challenge visual perception, and adversarial patches that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization framework that transforms discrete physical variations into parameter optimization, enabling systematic exploration of worst-case scenarios. Extensive experiments on state-of-the-art OpenVLA models across multiple benchmarks reveal alarming vulnerabilities: all variation types trigger failure rates exceeding 60%, with object transformations causing up to 97.8% failure in long-horizon tasks. Our findings expose critical gaps between controlled laboratory success and unpredictable deployment readiness, while the Eva-VLA framework provides a practical pathway for hardening VLA-based robotic manipulation models against real-world deployment challenges.

Updated: 2025-09-23 13:02:23

标题: Eva-VLA: 评估视觉-语言-动作模型在真实世界物理变化下的稳健性

摘要: Vision-Language-Action（VLA）模型已经成为机器人操纵的有前途的解决方案，然而它们对真实世界物理变化的鲁棒性仍然受到严重忽视。为了弥合这一差距，我们提出了Eva-VLA，这是第一个统一框架，通过将离散的物理变化转化为连续的优化问题来系统评估VLA模型的鲁棒性。然而，全面评估VLA的鲁棒性面临两个关键挑战：（1）如何系统地表征在真实部署中遇到的多样化物理变化，同时保持评估的可重复性，以及（2）如何在没有过高真实世界数据收集成本的情况下有效地发现最坏情况。为了解决第一个挑战，我们将真实世界的变化分解为三个关键领域：影响空间推理的物体3D变换、挑战视觉感知的光照变化以及扰乱场景理解的对抗性补丁。针对第二个挑战，我们引入了一个连续的黑盒优化框架，将离散的物理变化转化为参数优化，从而实现对最坏情况的系统探索。对最先进的OpenVLA模型在多个基准上进行的大量实验揭示了令人震惊的脆弱性：所有变化类型触发的故障率均超过60％，物体变换导致长时间任务中高达97.8％的故障。我们的发现揭示了受控实验室成功和不可预测的部署准备之间的重要差距，而Eva-VLA框架为加固VLA基础的机器人操纵模型应对真实世界部署挑战提供了一个实用的途径。

更新时间: 2025-09-23 13:02:23

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18953v1

Security and Privacy Measurement on Chinese Consumer IoT Traffic based on Device Lifecycle

In recent years, consumer Internet of Things (IoT) devices have become widely used in daily life. With the popularity of devices, related security and privacy risks arise at the same time as they collect user-related data and transmit it to various service providers. Although China accounts for a larger share of the consumer IoT industry, current analyses on consumer IoT device traffic primarily focus on regions such as Europe, the United States, and Australia. Research on China, however, is currently relatively rare. This study constructs the first large-scale dataset about consumer IoT device traffic in China. Specifically, we propose a fine-grained traffic collection guidance covering the entire lifecycle of consumer IoT devices, gathering traffic from 77 devices spanning 38 brands and 12 device categories. Based on this dataset, we analyze traffic destinations and encryption practices across different device types during the entire lifecycle and compare the findings with the results of other regions. Compared to other regions, our results show that consumer IoT devices in China rely more on domestic services and overall perform better in terms of encryption practices. However, there are still 23/40 devices improperly conducting certificate validation, and 2/70 devices use insecure encryption protocols. To facilitate future research, we open-source our traffic collection guidance and make our dataset publicly available.

Updated: 2025-09-23 13:00:03

标题: 基于设备生命周期的中国消费者物联网流量的安全和隐私测量

摘要: 近年来，消费者物联网（IoT）设备在日常生活中得到广泛应用。随着设备的普及，相关的安全和隐私风险同时出现，因为它们收集用户相关数据并将其传输给各种服务提供商。尽管中国在消费者物联网行业占据较大份额，但目前关于消费者物联网设备流量的分析主要集中在欧洲、美国和澳大利亚等地区。然而，关于中国的研究目前相对较少。本研究构建了关于中国消费者物联网设备流量的首个大规模数据集。具体而言，我们提出了一项涵盖消费者物联网设备整个生命周期的细粒度流量收集指南，从38个品牌和12个设备类别的77个设备中收集流量。基于这个数据集，我们分析了不同设备类型在整个生命周期中的流量目的地和加密实践，并将发现与其他地区的结果进行比较。与其他地区相比，我们的结果显示，中国的消费者物联网设备更多地依赖国内服务，并在加密实践方面总体表现更好。然而，仍有23/40个设备未正确进行证书验证，2/70个设备使用不安全的加密协议。为了促进未来研究，我们开源了我们的流量收集指南，并公开了我们的数据集。

更新时间: 2025-09-23 13:00:03

领域: cs.CR

下载: http://arxiv.org/abs/2505.09929v4

Towards Privacy-Aware Bayesian Networks: A Credal Approach

Bayesian networks (BN) are probabilistic graphical models that enable efficient knowledge representation and inference. These have proven effective across diverse domains, including healthcare, bioinformatics and economics. The structure and parameters of a BN can be obtained by domain experts or directly learned from available data. However, as privacy concerns escalate, it becomes increasingly critical for publicly released models to safeguard sensitive information in training data. Typically, released models do not prioritize privacy by design. In particular, tracing attacks from adversaries can combine the released BN with auxiliary data to determine whether specific individuals belong to the data from which the BN was learned. State-of-the-art protection tecniques involve introducing noise into the learned parameters. While this offers robust protection against tracing attacks, it significantly impacts the model's utility, in terms of both the significance and accuracy of the resulting inferences. Hence, high privacy may be attained at the cost of releasing a possibly ineffective model. This paper introduces credal networks (CN) as a novel solution for balancing the model's privacy and utility. After adapting the notion of tracing attacks, we demonstrate that a CN enables the masking of the learned BN, thereby reducing the probability of successful attacks. As CNs are obfuscated but not noisy versions of BNs, they can achieve meaningful inferences while safeguarding privacy. Moreover, we identify key learning information that must be concealed to prevent attackers from recovering the underlying BN. Finally, we conduct a set of numerical experiments to analyze how privacy gains can be modulated by tuning the CN hyperparameters. Our results confirm that CNs provide a principled, practical, and effective approach towards the development of privacy-aware probabilistic graphical models.

Updated: 2025-09-23 12:58:32

标题: 走向隐私感知贝叶斯网络：一种置信度方法

摘要: 贝叶斯网络（BN）是概率图模型，可以实现高效的知识表示和推理。这已被证明在包括医疗保健、生物信息学和经济学在内的不同领域中是有效的。BN的结构和参数可以由领域专家获取，也可以直接从可用数据中学习。然而，随着隐私问题不断升级，公开发布的模型需要保护训练数据中的敏感信息变得越来越关键。通常情况下，发布的模型并不设计优先考虑隐私。特别是，敌对方的追踪攻击可以将发布的BN与辅助数据结合起来，以确定特定个体是否属于BN学习的数据。最先进的保护技术涉及将噪音引入学习参数中。虽然这可以有效防御追踪攻击，但它显著影响了模型的效用，无论是结果推断的显著性还是准确性。因此，高隐私可能会以发布可能无效的模型为代价。本文提出了可信网络（CN）作为平衡模型隐私和效用的新解决方案。在调整追踪攻击概念后，我们证明了CN可以对学习的BN进行掩盖，从而降低成功攻击的概率。由于CN是BN的模糊而非嘈杂版本，它们可以在维护隐私的同时实现有意义的推断。此外，我们确定了必须隐藏的关键学习信息，以防止攻击者恢复潜在的BN。最后，我们进行了一系列数值实验，分析了如何通过调整CN超参数来调节隐私增益。我们的结果证实，CN提供了一个原则性、实用性和有效性的方法，用于开发注重隐私的概率图模型。

更新时间: 2025-09-23 12:58:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18949v1

Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLMs

The efficiency of Bayesian optimization (BO) relies heavily on the choice of the Gaussian process (GP) kernel, which plays a central role in balancing exploration and exploitation under limited evaluation budgets. Traditional BO methods often rely on fixed or heuristic kernel selection strategies, which can result in slow convergence or suboptimal solutions when the chosen kernel is poorly suited to the underlying objective function. To address this limitation, we propose a freshly-baked Context-Aware Kernel Evolution (CAKE) to enhance BO with large language models (LLMs). Concretely, CAKE leverages LLMs as the crossover and mutation operators to adaptively generate and refine GP kernels based on the observed data throughout the optimization process. To maximize the power of CAKE, we further propose BIC-Acquisition Kernel Ranking (BAKER) to select the most effective kernel through balancing the model fit measured by the Bayesian information criterion (BIC) with the expected improvement at each iteration of BO. Extensive experiments demonstrate that our fresh CAKE-based BO method consistently outperforms established baselines across a range of real-world tasks, including hyperparameter optimization, controller tuning, and photonic chip design. Our code is publicly available at https://github.com/richardcsuwandi/cake.

Updated: 2025-09-23 12:57:08

标题: Bayesian优化中的自适应核设计是LLMs的一小块蛋糕

摘要: 贝叶斯优化（BO）的效率在很大程度上取决于高斯过程（GP）核的选择，该核在有限的评估预算下在探索和利用之间发挥着重要作用。传统的BO方法通常依赖于固定或启发式的核选择策略，当所选择的核不适合基础目标函数时，可能导致收敛缓慢或次优解。为了解决这一限制，我们提出了一种全新的基于上下文的核进化（CAKE）方法，以增强BO与大型语言模型（LLMs）的结合。具体而言，CAKE利用LLMs作为交叉和变异运算符，根据优化过程中观察到的数据自适应生成和优化GP核。为了最大化CAKE的效力，我们进一步提出了BIC-Acquisition Kernel Ranking（BAKER）来通过在每次BO迭代中平衡贝叶斯信息准则（BIC）测量的模型拟合和期望改进来选择最有效的核。大量实验表明，我们基于新鲜CAKE的BO方法在一系列真实任务中始终优于已建立的基线，包括超参数优化、控制器调整和光子芯片设计。我们的代码可以在https://github.com/richardcsuwandi/cake上公开获取。

更新时间: 2025-09-23 12:57:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.17998v2

Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning

Recent advancements in Large Language Models (LLMs) have emphasized the critical role of fine-tuning (FT) techniques in adapting LLMs to specific tasks, especially when retraining from scratch is computationally infeasible. Fine-tuning enables LLMs to leverage task- or domain-specific data, producing models that more effectively meet the requirements of targeted applications. However, con- ventional FT approaches often suffer from catastrophic forgetting and suboptimal data efficiency, limiting their real-world applicability. To address these challenges, this paper proposes DEAL, a novel framework that integrates Low-Rank Adapta- tion (LoRA) with a continuous fine-tuning strategy. By incorporating knowledge retention and adaptive parameter update modules, the framework mitigates the lim- itations of existing FT methods while maintaining efficiency in privacy-preserving settings. Experiments on 15 diverse datasets show that DEAL consistently outper- forms baseline methods, yielding substantial gains in task accuracy and resource efficiency. These findings demonstrate the potential of our approach to advance continual adaptation in LLMs by enhancing task performance while improving resource efficiency.

Updated: 2025-09-23 12:55:57

标题: 大型语言模型中的数据高效适应：通过连续低秩微调

摘要: 最近大型语言模型（LLMs）的进展强调了微调（FT）技术在将LLMs调整到特定任务中的关键作用，特别是在重新训练从头开始计算上不可行的情况下。微调使LLMs能够利用任务或领域特定数据，生成更有效地满足目标应用要求的模型。然而，传统的FT方法经常受到灾难性遗忘和次优数据效率的影响，限制了它们在现实世界中的适用性。为了解决这些挑战，本文提出了DEAL，一个将低秩适应（LoRA）与连续微调策略结合的新型框架。通过结合知识保留和自适应参数更新模块，该框架缓解了现有FT方法的限制，同时在隐私保护设置中保持效率。在15个不同数据集上的实验表明，DEAL始终优于基线方法，提高了任务准确性和资源效率。这些发现展示了我们方法提升LLMs的持续适应能力的潜力，既提高任务性能又改善资源效率。

更新时间: 2025-09-23 12:55:57

领域: cs.AI

下载: http://arxiv.org/abs/2509.18942v1

Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.

Updated: 2025-09-23 12:55:03

标题: 生物学说明：一个用于大型语言模型多组学序列理解能力的数据集和基准。

摘要: 大型语言模型(LLMs)在一般领域展现出了非凡的能力，但它们在多组学生物学中的应用仍未得到充分探索。为了填补这一空白，我们引入了Biology-Instructions，这是第一个针对多组学生物序列的大规模指导调整数据集，包括DNA、RNA、蛋白质和多种分子。这个数据集连接了LLMs和复杂的生物序列相关任务，增强了它们的多功能性和推理能力，同时保持了对话流畅性。我们还强调了当前最先进的LLMs在多组学任务上的显著限制，没有经过专门培训。为了克服这一问题，我们提出了ChatMultiOmics，这是一个拥有新颖的三阶段训练流程的强有力基线，通过Biology-Instructions展示出卓越的生物理解能力。这两个资源都是公开可用的，为更好地将LLMs整合到多组学分析中铺平了道路。Biology-Instructions可在以下网址公开获取：https://github.com/hhnqqq/Biology-Instructions。

更新时间: 2025-09-23 12:55:03

领域: q-bio.BM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.19191v2

No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning

While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of large language models, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.

Updated: 2025-09-23 12:54:52

标题: 不需要标签：使用协作自学的零样本图像分类

摘要: 尽管深度学习，包括卷积神经网络（CNNs）和视觉变压器（ViTs），显著提高了分类性能，但其典型依赖于大量标注数据集在许多实际场景中很少见的数据情况下构成了一个主要障碍。视觉语言模型（VLMs）和使用预训练视觉模型的迁移学习技术似乎是解决这个问题的有希望的方法。本文提出了一种结合VLM和预训练视觉模型的新型零样本图像分类框架，该框架在自学习循环中仅需要类别名称集合而无需标记的训练数据，利用基于置信度的伪标记策略在测试数据上直接训练轻量级分类器，实现了动态自适应。VLM识别高置信度样本，预训练视觉模型增强它们的视觉表示。这些增强的特征然后迭代地训练分类器，使系统能够在没有监督的情况下捕获互补的语义和视觉线索。值得注意的是，我们的方法避免了VLM微调和大型语言模型的使用，依靠仅视觉模型来减少对语义表示的依赖。在十个不同数据集上的实验评估表明，我们的方法优于基线零样本方法。

更新时间: 2025-09-23 12:54:52

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.18938v1

Generic Adversarial Smart Contract Detection with Semantics and Uncertainty-Aware LLM

Adversarial smart contracts, mostly on EVM-compatible chains like Ethereum and BSC, are deployed as EVM bytecode to exploit vulnerable smart contracts typically for financial gains. Detecting such malicious contracts at the time of deployment is an important proactive strategy preventing loss from victim contracts. It offers a better cost-benefit than detecting vulnerabilities on diverse potential victims. However, existing works are not generic with limited detection types and effectiveness due to imbalanced samples, while the emerging LLM technologies, which show its potentials in generalization, have two key problems impeding its application in this task: hard digestion of compiled-code inputs, especially those with task-specific logic, and hard assessment of LLMs' certainty in their binary answers, i.e., yes-or-no answers. Therefore, we propose a generic adversarial smart contracts detection framework FinDet, which leverages LLMs with two enhancements addressing above two problems. FinDet takes as input only the EVM-bytecode contracts and identifies adversarial ones among them with high balanced accuracy. The first enhancement extracts concise semantic intentions and high-level behavioral logic from the low-level bytecode inputs, unleashing the LLM reasoning capability restricted by the task input. The second enhancement probes and measures the LLM uncertainty to its multi-round answering to the same query, improving the LLM answering robustness for binary classifications required by the task output. Our comprehensive evaluation shows that FinDet achieves a BAC of 0.9223 and a TPR of 0.8950, significantly outperforming existing baselines. It remains robust under challenging conditions including unseen attack patterns, low-data settings, and feature obfuscation. FinDet detects all 5 public and 20+ unreported adversarial contracts in a 10-day real-world test, confirmed manually.

Updated: 2025-09-23 12:52:05

标题: 使用语义和不确定性感知的通用对抗智能合约检测LLM

摘要: 对抗性智能合约主要部署在EVM兼容链上，如以太坊和BSC，被部署为EVM字节码，以利用易受攻击的智能合约，通常是为了获得财务利益。在部署时检测这些恶意合约是一项重要的积极策略，可以防止受害者合约的损失。这比检测潜在受害者的漏洞具有更好的成本效益。然而，现有的工作不够通用，检测类型有限，有效性不足，因为样本不平衡，而新兴的LLM技术展示了其在泛化方面的潜力，但在这项任务中应用时存在两个关键问题：难以消化编译代码输入，特别是那些具有任务特定逻辑的代码，以及难以评估LLM对其二进制答案的确定性，即是或否的答案。因此，我们提出了一个通用的对抗性智能合约检测框架FinDet，利用LLM进行两项增强，解决上述两个问题。FinDet仅接受EVM字节码合约作为输入，并在其中识别对抗性合约，具有高平衡准确性。第一个增强从低级字节码输入中提取简洁的语义意图和高级行为逻辑，释放LLM的推理能力，限制了任务输入。第二个增强探测和测量LLM对同一查询的多轮回答的不确定性，提高了LLM对任务输出所需的二元分类的回答鲁棒性。我们全面评估表明，FinDet实现了0.9223的平衡准确率和0.8950的真阳性率，明显优于现有基线。在包括看不见的攻击模式、低数据设置和特征混淆在内的挑战条件下，FinDet保持稳健。FinDet在为期10天的真实世界测试中检测到了所有5个公开和20多个未报告的对抗性合约，经手动确认。

更新时间: 2025-09-23 12:52:05

领域: cs.CR

下载: http://arxiv.org/abs/2509.18934v1

Accurate and Efficient Prediction of Wi-Fi Link Quality Based on Machine Learning

Wireless communications are characterized by their unpredictability, posing challenges for maintaining consistent communication quality. This paper presents a comprehensive analysis of various prediction models, with a focus on achieving accurate and efficient Wi-Fi link quality forecasts using machine learning techniques. Specifically, the paper evaluates the performance of data-driven models based on the linear combination of exponential moving averages, which are designed for low-complexity implementations and are then suitable for hardware platforms with limited processing resources. Accuracy of the proposed approaches was assessed using experimental data from a real-world Wi-Fi testbed, considering both channel-dependent and channel-independent training data. Remarkably, channel-independent models, which allow for generalized training by equipment manufacturers, demonstrated competitive performance. Overall, this study provides insights into the practical deployment of machine learning-based prediction models for enhancing Wi-Fi dependability in industrial environments.

Updated: 2025-09-23 12:52:01

标题: 基于机器学习的Wi-Fi链路质量准确高效预测

摘要: 无线通信的特点是其不可预测性，这给保持一致的通信质量带来了挑战。本文提出了对各种预测模型进行全面分析，重点是利用机器学习技术实现准确高效的Wi-Fi链路质量预测。具体来说，本文评估了基于指数移动平均线性组合的数据驱动模型的性能，这些模型设计用于低复杂度实现，适用于处理资源有限的硬件平台。所提出方法的准确性是使用来自实际Wi-Fi测试平台的实验数据评估的，考虑了既有通道相关又有通道无关的训练数据。值得注意的是，允许设备制造商进行广义训练的通道无关模型表现出竞争性能。总的来说，本研究为在工业环境中增强Wi-Fi可靠性的机器学习预测模型的实际部署提供了见解。

更新时间: 2025-09-23 12:52:01

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.18933v1

FlexFringe: Modeling Software Behavior by Learning Probabilistic Automata

We present the efficient implementations of probabilistic deterministic finite automaton learning methods available in FlexFringe. These implement well-known strategies for state-merging including several modifications to improve their performance in practice. We show experimentally that these algorithms obtain competitive results and significant improvements over a default implementation. We also demonstrate how to use FlexFringe to learn interpretable models from software logs and use these for anomaly detection. Although less interpretable, we show that learning smaller more convoluted models improves the performance of FlexFringe on anomaly detection, outperforming an existing solution based on neural nets.

Updated: 2025-09-23 12:51:41

标题: FlexFringe：通过学习概率自动机对软件行为建模

摘要: 我们提出了在FlexFringe中可用的概率确定性有限自动机学习方法的高效实现。这些方法实现了众所周知的状态合并策略，包括一些修改以提高它们在实践中的性能。我们通过实验证明，这些算法获得了竞争性的结果，并在默认实现上取得了显著改进。我们还展示了如何使用FlexFringe从软件日志中学习可解释的模型，并将其用于异常检测。尽管不太可解释，但我们表明学习更小更复杂的模型可以提高FlexFringe在异常检测中的性能，优于基于神经网络的现有解决方案。

更新时间: 2025-09-23 12:51:41

领域: cs.LG,cs.LO,cs.SE

下载: http://arxiv.org/abs/2203.16331v6

Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

Neural Algorithmic Reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov Decision Process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.

Updated: 2025-09-23 12:49:25

标题: 解决棘手问题：通过强化学习重新构想的图神经算法推理

摘要: 神经算法推理（NAR）是一种通过监督学习训练神经网络执行经典算法的范式。尽管取得了成功，但仍存在重要的局限性：无法在没有后处理的情况下构建有效解决方案，无法推理多个正确解，对组合型NP难问题表现不佳，并且无法应用于尚未知晓强算法的问题。为了解决这些局限性，我们将学习算法轨迹问题重新构建为马尔可夫决策过程，这为解决方案构建过程增加了结构，并解锁了模仿和强化学习（RL）的强大工具。我们提出了GNARL框架，包括将问题表述从NAR转化为RL的方法论以及适用于各种基于图的问题的学习架构。我们在几个CLRS-30问题上取得了非常高的图准确性结果，性能匹配或超过了针对NP难问题的较窄NAR方法，而且值得注意的是，即使缺乏专家算法，也能适用。

更新时间: 2025-09-23 12:49:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18930v1

THFlow: A Temporally Hierarchical Flow Matching Framework for 3D Peptide Design

Deep generative models provide a promising approach to de novo 3D peptide design. Most of them jointly model the distributions of peptide's position, orientation, and conformation, attempting to simultaneously converge to the target pocket. However, in the early stage of docking, optimizing conformation-only modalities such as rotation and torsion can be physically meaningless, as the peptide is initialized far from the protein pocket and no interaction field is present. We define this problem as the multimodal temporal inconsistency problem and claim it is a key factor contributing to low binding affinity in generated peptides. To address this challenge, we propose THFlow, a novel flow matching-based multimodal generative model that explicitly models the temporal hierarchy between peptide position and conformation. It employs a polynomial based conditional flow to accelerate positional convergence early on, and later aligns it with rotation and torsion for coordinated conformation refinement under the emerging interaction field. Additionally, we incorporate interaction-related features, such as polarity, to further enhance the model's understanding of peptide-protein binding. Extensive experiments demonstrate that THFlow outperforms existing methods in generating peptides with superior stability, affinity, and diversity, offering an effective and accurate solution for advancing peptide-based therapeutic development.

Updated: 2025-09-23 12:47:28

标题: THFlow：一种用于3D肽设计的时间层次流匹配框架

摘要: 深度生成模型为全新的3D肽设计提供了一种有前途的方法。大部分模型同时建模肽的位置、方向和构象的分布，试图同时收敛到目标口袋。然而，在对接的早期阶段，优化仅包括旋转和扭转等构象的模态可能是没有物理意义的，因为肽初始化时远离蛋白口袋且没有相互作用场。我们将这一问题定义为多模态时间不一致问题，并声称这是导致生成的肽的结合亲和力低的关键因素。为了解决这一挑战，我们提出了THFlow，一种基于流匹配的多模态生成模型，明确地建模肽位置和构象之间的时间层次关系。它采用基于多项式的条件流来加速早期的位置收敛，并在后期将其与旋转和扭转对齐，以在新出现的相互作用场下协调构象的细化。此外，我们还融入了与相互作用相关的特征，如极性，以进一步增强模型对肽-蛋白结合的理解。广泛的实验表明，THFlow在生成具有优越稳定性、亲和力和多样性的肽方面优于现有方法，为推进基于肽的治疗开发提供了一种有效和准确的解决方案。

更新时间: 2025-09-23 12:47:28

领域: q-bio.QM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.15855v2

Learning coordinated badminton skills for legged manipulators

Coordinating the motion between lower and upper limbs and aligning limb control with perception are substantial challenges in robotics, particularly in dynamic environments. To this end, we introduce an approach for enabling legged mobile manipulators to play badminton, a task that requires precise coordination of perception, locomotion, and arm swinging. We propose a unified reinforcement learning-based control policy for whole-body visuomotor skills involving all degrees of freedom to achieve effective shuttlecock tracking and striking. This policy is informed by a perception noise model that utilizes real-world camera data, allowing for consistent perception error levels between simulation and deployment and encouraging learned active perception behaviors. Our method includes a shuttlecock prediction model, constrained reinforcement learning for robust motion control, and integrated system identification techniques to enhance deployment readiness. Extensive experimental results in a variety of environments validate the robot's capability to predict shuttlecock trajectories, navigate the service area effectively, and execute precise strikes against human players, demonstrating the feasibility of using legged mobile manipulators in complex and dynamic sports scenarios.

Updated: 2025-09-23 12:45:55

标题: 学习协调的羽毛球技能以用于有腿的机械手Manipulators

摘要: 在机器人技术中，协调下肢和上肢之间的运动，并将肢体控制与感知对齐是一个重大挑战，特别是在动态环境中。为此，我们提出了一种方法，使四肢移动机器人能够打羽毛球，这项任务需要精确协调感知、运动和摆臂。我们提出了一种基于统一强化学习的控制策略，涉及所有自由度的整体视觉运动技能，以实现有效的羽毛球追踪和击打。该策略受到感知噪声模型的启发，该模型利用真实世界的摄像头数据，允许在仿真和部署之间保持一致的感知误差水平，并鼓励学习的主动感知行为。我们的方法包括一个羽毛球预测模型，用于稳健运动控制的受限强化学习，以及集成系统识别技术，以增强部署准备性。在各种环境中进行的广泛实验结果验证了机器人预测羽毛球轨迹、有效地导航服务区域和对人类玩家进行精确打击的能力，展示了在复杂和动态的体育场景中使用四肢移动机器人的可行性。

更新时间: 2025-09-23 12:45:55

领域: cs.RO,cs.LG,68T40, 93C85,I.2.9; I.2.6; I.2.8

下载: http://arxiv.org/abs/2505.22974v2

Impossibility Results of Card-Based Protocols via Mathematical Optimization

This paper introduces mathematical optimization as a new method for proving impossibility results in the field of card-based cryptography. While previous impossibility proofs were often limited to cases involving a small number of cards, this new approach establishes results that hold for a large number of cards. The research focuses on single-cut full-open (SCFO) protocols, which consist of performing one random cut and then revealing all cards. The main contribution is that for any three-variable Boolean function, no new SCFO protocols exist beyond those already known, under the condition that all additional cards have the same color. The significance of this work is that it provides a new framework for proving impossibility results and delivers a proof that is valid for any number of cards, as long as all additional cards have the same color.

Updated: 2025-09-23 12:45:38

标题: 基于数学优化的卡片式协议的不可能性结果

摘要: 本文介绍了数学优化作为在基于卡片的密码学领域证明不可能性结果的新方法。虽然先前的不可能性证明通常局限于涉及少量卡片的情况，但这种新方法建立了适用于大量卡片的结果。该研究集中在单切全开（SCFO）协议上，该协议包括执行一次随机切割，然后揭示所有卡片。主要贡献在于对于任何三变量布尔函数，在所有额外卡片具有相同颜色的条件下，不存在除已知的那些之外的新的SCFO协议。这项工作的意义在于它提供了一个证明不可能性结果的新框架，并提供了一个对任何数量的卡片都有效的证明，只要所有额外卡片具有相同的颜色。

更新时间: 2025-09-23 12:45:38

领域: cs.CR,math.OC

下载: http://arxiv.org/abs/2509.17595v2

Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization

Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.

Updated: 2025-09-23 12:39:26

标题: 没有成对标记数据：用于无人机视角地理定位的端到端自监督学习

摘要: 无人机视角的地理定位（DVGL）旨在通过检索最相关的带有GPS标记的卫星图像来实现无人机的准确定位。然而，大多数现有方法严重依赖于严格预配的无人机-卫星图像进行监督学习。当目标区域发生变化时，通常需要新的配对样本来适应分布变化。注释成本高昂和这些方法的有限可转移性显著阻碍了DVGL在开放环境场景中的实际部署。为了解决这些限制，我们提出了一种新颖的端到端自监督学习方法，称为动态记忆驱动和邻域信息学习（DMNIL）方法。它采用聚类算法生成伪标签，并采用双路径对比学习框架来学习有区分性的视图内表示。此外，DMNIL包括两个核心模块，包括动态分层记忆学习（DHML）模块和信息一致性演化学习（ICEL）模块。DHML模块结合短期和长期记忆，以增强视图内特征的一致性和可区分性。同时，ICEL模块利用邻域驱动的动态约束机制系统地捕捉隐含的跨视图语义相关性，从而提高跨视图特征对齐。为了进一步稳定和加强自监督训练过程，引入了一种伪标签增强策略来提高伪监督的质量。对三个公共基准数据集进行的广泛实验表明，所提出的方法在自监督方法中始终优于现有方法，甚至超过了几种最先进的监督方法。我们的代码可在https://github.com/ISChenawei/DMNIL获得。

更新时间: 2025-09-23 12:39:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.11381v4

Dynamic User Interest Augmentation via Stream Clustering and Memory Networks in Large-Scale Recommender Systems

Recommender System (RS) provides personalized recommendation service based on user interest. However, lots of users' interests are sparse due to lacking consumption behaviors, making it challenging to provide accurate recommendations for them, which is widespread in large-scale RSs. In particular, efficiently solving this problem in the ranking stage of RS is an even greater challenge, which requires an end-to-end and real-time approach. To solve this problem, we propose an innovative method called Dynamic User Interest Augmentation (DUIA). DUIA enhances user interest including user profile and user history behavior sequences by generating enhancement vectors and personalized enhancement vectors through dynamic stream clustering of similar users and relevant items from multiple perspectives. To realize stream clustering, we specially design an algorithm called Gradient-based Hierarchical Clustering Algorithm (GHCA) for DUIA, which performs clustering via gradient descent and stores the cluster centers in memory networks. Extensive offline and online experiments demonstrate that DUIA not only significantly improves model performance for users with sparse interests but also delivers notable gains for other users. As an end-to-end method, DUIA can be easily integrated with existing models. Furthermore, DUIA is also used for long-tail items and cold-start problem, which also yields excellent improvements. Since 2022, DUIA has been successfully deployed in multiple industrial RSs in Tencent and was made public in May 2024. Moreover, the thoughts behind DUIA, dynamic stream clustering and similarity-based enhancement, have inspired relevant works and have also been applied in other stages of RS.

Updated: 2025-09-23 12:38:46

标题: 大规模推荐系统中通过流式聚类和记忆网络动态增强用户兴趣

摘要: 推荐系统（RS）基于用户兴趣提供个性化推荐服务。然而，许多用户的兴趣是稀疏的，因为缺乏消费行为，这使得为他们提供准确推荐变得具有挑战性，在大规模RS中普遍存在。特别是，在RS的排序阶段高效解决这个问题是一个更大的挑战，需要一种端到端和实时方法。为了解决这个问题，我们提出了一种名为动态用户兴趣增强（DUIA）的创新方法。DUIA通过从多个角度生成增强向量和个性化增强向量，通过动态流聚类类似用户和相关项目来增强用户兴趣，包括用户配置文件和用户历史行为序列。为了实现流聚类，我们特别设计了一种名为基于梯度的分层聚类算法（GHCA）的算法，该算法通过梯度下降进行聚类，并将聚类中心存储在内存网络中。广泛的离线和在线实验表明，DUIA不仅显著提高了对兴趣稀疏用户的模型性能，还为其他用户带来了显著的收益。作为一种端到端方法，DUIA可以轻松集成到现有模型中。此外，DUIA还用于长尾项目和冷启动问题，也取得了显著的改进。自2022年以来，DUIA已成功部署在腾讯的多个工业RS中，并于2024年5月公开。此外，DUIA背后的思想，动态流聚类和基于相似性的增强，已启发了相关工作，并在RS的其他阶段也得到了应用。

更新时间: 2025-09-23 12:38:46

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2405.13238v7

LiDAR Point Cloud Image-based Generation Using Denoising Diffusion Probabilistic Models

Autonomous vehicles (AVs) are expected to revolutionize transportation by improving efficiency and safety. Their success relies on 3D vision systems that effectively sense the environment and detect traffic agents. Among sensors AVs use to create a comprehensive view of surroundings, LiDAR provides high-resolution depth data enabling accurate object detection, safe navigation, and collision avoidance. However, collecting real-world LiDAR data is time-consuming and often affected by noise and sparsity due to adverse weather or sensor limitations. This work applies a denoising diffusion probabilistic model (DDPM), enhanced with novel noise scheduling and time-step embedding techniques to generate high-quality synthetic data for augmentation, thereby improving performance across a range of computer vision tasks, particularly in AV perception. These modifications impact the denoising process and the model's temporal awareness, allowing it to produce more realistic point clouds based on the projection. The proposed method was extensively evaluated under various configurations using the IAMCV and KITTI-360 datasets, with four performance metrics compared against state-of-the-art (SOTA) methods. The results demonstrate the model's superior performance over most existing baselines and its effectiveness in mitigating the effects of noisy and sparse LiDAR data, producing diverse point clouds with rich spatial relationships and structural detail.

Updated: 2025-09-23 12:35:07

标题: 使用去噪扩散概率模型生成基于LiDAR点云图像的方法

摘要: 自动驾驶汽车（AVs）有望通过提高效率和安全性来彻底改变交通运输。它们的成功依赖于能够有效感知环境并检测交通代理的3D视觉系统。在传感器AVs用于创建周围环境全面视图的传感器中，LiDAR提供高分辨率深度数据，实现准确的物体检测、安全导航和避免碰撞。然而，由于不利天气或传感器限制，收集真实世界LiDAR数据是耗时且常常受到噪声和稀疏性的影响。本研究应用了一种噪声扩散概率模型（DDPM），结合了新颖的噪声调度和时间步骤嵌入技术，生成高质量的合成数据以进行增广，从而在一系列计算机视觉任务中提高性能，特别是在AV感知方面。这些修改影响了去噪过程和模型的时间意识，使其能够基于投影产生更真实的点云。所提出的方法在IAMCV和KITTI-360数据集上经过各种配置的广泛评估，将四个性能指标与最先进的方法进行比较。结果表明，该模型在大多数现有基线上表现出优越性能，并且在减轻嘈杂和稀疏LiDAR数据的影响方面具有有效性，生成具有丰富空间关系和结构细节的多样点云。

更新时间: 2025-09-23 12:35:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.18917v1

Obelix: Mitigating Side-Channels Through Dynamic Obfuscation

Trusted execution environments (TEEs) offer hardware-assisted means to protect code and data. However, as shown in numerous results over the years, attackers can use side-channels to leak data access patterns and even single-step the code. While the vendors are slowly introducing hardware-based countermeasures for some attacks, others will stay unaddressed. This makes a software-level countermeasure desirable, but current available solutions only address very specific attack vectors or have a narrow leakage model. In this work, we take a holistic view at the vulnerabilities of TEEs and design a tool named Obelix, which is the first to protect both code and data against a wide range of TEE attacks, from cache attacks over single-stepping to ciphertext side-channels. We analyze the practically achievable precision of state-of-the-art single-stepping tools, and present an algorithm which uses that knowledge to divide a program into uniform code blocks, that are indistinguishable for a strong attacker. By storing these blocks and the program data in oblivious RAM, the attacker cannot follow execution, effectively protecting both secret code and data. We describe how we automate our approach to make it available for developers who are unfamiliar with side-channels. As an obfuscation tool, Obelix comes with a considerable performance overhead, but compensates this with strong security guarantees and easy applicability without requiring any expert knowledge.

Updated: 2025-09-23 12:32:51

标题: Obelix：通过动态混淆减轻侧信道

摘要: 受信任的执行环境（TEE）提供了硬件辅助手段来保护代码和数据。然而，多年来的许多研究结果表明，攻击者可以利用侧信道泄露数据访问模式，甚至单步执行代码。虽然供应商正在逐渐引入针对某些攻击的硬件防御措施，但其他攻击仍然未被解决。这使得在软件级别采取防御措施变得令人渴望，但当前可用的解决方案只针对非常特定的攻击向量或具有狭窄的泄漏模型。在这项工作中，我们全面审视了TEE的漏洞，设计了一种名为Obelix的工具，这是第一个能够保护代码和数据免受广泛范围的TEE攻击的工具，从缓存攻击到单步执行再到密文侧信道。我们分析了最先进的单步执行工具的实际可达到的精度，并提出了一种算法，利用这一知识将程序分成统一的代码块，这对于强大的攻击者来说是无法区分的。通过将这些代码块和程序数据存储在遗忘的RAM中，攻击者无法跟随执行，有效保护秘密代码和数据。我们描述了如何将我们的方法自动化，使其可供不熟悉侧信道的开发人员使用。作为一种混淆工具，Obelix带来了相当大的性能开销，但通过强大的安全保证和易于应用性来弥补这一缺点，而无需任何专业知识。

更新时间: 2025-09-23 12:32:51

领域: cs.CR

下载: http://arxiv.org/abs/2509.18909v1

Integrating Stacked Intelligent Metasurfaces and Power Control for Dynamic Edge Inference via Over-The-Air Neural Networks

This paper introduces a novel framework for Edge Inference (EI) that bypasses the conventional practice of treating the wireless channel as noise. We utilize Stacked Intelligent Metasurfaces (SIMs) to control wireless propagation, enabling the channel itself to perform over-the-air computation. This eliminates the need for symbol estimation at the receiver, significantly reducing computational and communication overhead. Our approach models the transmitter-channel-receiver system as an end-to-end Deep Neural Network (DNN) where the response of the SIM elements are trainable parameters. To address channel variability, we incorporate a dedicated DNN module responsible for dynamically adjusting transmission power leveraging user location information. Our performance evaluations showcase that the proposed metasurfaces-integrated DNN framework with deep SIM architectures are capable of balancing classification accuracy and power consumption under diverse scenarios, offering significant energy efficiency improvements.

Updated: 2025-09-23 12:13:06

标题: 通过空中神经网络将堆叠智能超表面和功率控制集成用于动态边缘推理

摘要: 本文介绍了一种新颖的边缘推断（EI）框架，它绕过了将无线信道视为噪声的传统做法。我们利用堆叠智能超表面（SIMs）来控制无线传播，使信道本身能够执行空中计算。这消除了接收器处的符号估计的需要，显著减少了计算和通信开销。我们的方法将发射机-信道-接收机系统建模为端到端的深度神经网络（DNN），其中SIM元素的响应是可训练的参数。为了解决信道变化性，我们还整合了一个专用的DNN模块，负责根据用户位置信息动态调整传输功率。我们的性能评估展示了所提出的集成超表面的DNN框架与深度SIM架构在各种场景下能够平衡分类准确性和功耗，提供了显著的能源效率改进。

更新时间: 2025-09-23 12:13:06

领域: cs.ET,cs.LG,eess.SP

下载: http://arxiv.org/abs/2509.18906v1

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

Updated: 2025-09-23 12:00:14

标题: VLMs距离视觉空间智能有多远？一个基准驱动的视角

摘要: 视觉空间推理（VSR）是人类核心认知能力，也是推进具有自主智能和自主系统的关键要求。尽管视觉语言模型（VLMs）取得了最近的进展，但由于代表和推理三维空间的复杂性，实现人类水平的VSR仍然具有极高的挑战性。在本文中，我们对VLMs中的VSR进行了系统调查，包括对现有方法论在输入模态、模型架构、训练策略和推理机制方面的综述。此外，我们将空间智能分类为三个能力水平，即基本感知、空间理解、空间规划，并创建了SIBench，一个包含近20个开源数据集跨越23个任务设置的空间智能基准。对最先进的VLMs进行的实验显示，在感知和推理之间存在明显差距，模型在基本感知任务上表现出能力，但在理解和规划任务中持续表现不佳，尤其在数值估算、多视角推理、时间动态和空间想象方面。这些发现强调了在实现空间智能方面仍然存在的巨大挑战，同时提供了一个系统的路线图和一个全面的基准，以推动未来在该领域的研究。本研究的相关资源可在https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/中获取。

更新时间: 2025-09-23 12:00:14

领域: cs.AI

下载: http://arxiv.org/abs/2509.18905v1

Enhancing the Effectiveness and Durability of Backdoor Attacks in Federated Learning through Maximizing Task Distinction

Federated learning allows multiple participants to collaboratively train a central model without sharing their private data. However, this distributed nature also exposes new attack surfaces. In particular, backdoor attacks allow attackers to implant malicious behaviors into the global model while maintaining high accuracy on benign inputs. Existing attacks usually rely on fixed patterns or adversarial perturbations as triggers, which tightly couple the main and backdoor tasks. This coupling makes them vulnerable to dilution by honest updates and limits their persistence under federated defenses. In this work, we propose an approach to decouple the backdoor task from the main task by dynamically optimizing the backdoor trigger within a min-max framework. The inner layer maximizes the performance gap between poisoned and benign samples, ensuring that the contributions of benign users have minimal impact on the backdoor. The outer process injects the adaptive triggers into the local model. We evaluate our method on both computer vision and natural language tasks, and compare it with six backdoor attack methods under six defense algorithms. Experimental results show that our method achieves good attack performance and can be easily integrated into existing backdoor attack techniques.

Updated: 2025-09-23 11:59:29

标题: 通过最大化任务区分来增强联邦学习中后门攻击的有效性和持久性

摘要: 联邦学习允许多个参与者共同训练一个中央模型，而无需共享他们的私人数据。然而，这种分布式特性也暴露了新的攻击面。特别是，后门攻击允许攻击者在全局模型中植入恶意行为，同时在良性输入上保持高准确性。现有的攻击通常依赖于固定模式或对抗性扰动作为触发器，这些触发器紧密耦合主任务和后门任务。这种耦合使它们容易被诚实更新稀释，并限制了它们在联邦防御下的持久性。在这项工作中，我们提出了一种方法，通过在最小-最大框架内动态优化后门触发器来解耦后门任务和主任务。内部层最大化了中毒和良性样本之间的性能差距，确保良性用户的贡献对后门的影响最小化。外部过程将自适应触发器注入到本地模型中。我们在计算机视觉和自然语言任务上评估了我们的方法，并将其与六种防御算法下的六种后门攻击方法进行了比较。实验结果表明，我们的方法实现了良好的攻击性能，并可以轻松集成到现有的后门攻击技术中。

更新时间: 2025-09-23 11:59:29

领域: cs.LG

下载: http://arxiv.org/abs/2509.18904v1

The AI Literacy Heptagon: A Structured Approach to AI Literacy in Higher Education

The integrative literature review addresses the conceptualization and implementation of AI Literacy (AIL) in Higher Education (HE) by examining recent research literature. Through an analysis of publications (2021-2024), we explore (1) how AIL is defined and conceptualized in current research, particularly in HE, and how it can be delineated from related concepts such as Data Literacy, Media Literacy, and Computational Literacy; (2) how various definitions can be synthesized into a comprehensive working definition, and (3) how scientific insights can be effectively translated into educational practice. Our analysis identifies seven central dimensions of AIL: technical, applicational, critical thinking, ethical, social, integrational, and legal. These are synthesized in the AI Literacy Heptagon, deepening conceptual understanding and supporting the structured development of AIL in HE. The study aims to bridge the gap between theoretical AIL conceptualizations and the practical implementation in academic curricula.

Updated: 2025-09-23 11:28:30

标题: AI素养七边形：高等教育中AI素养的结构化方法

摘要: 这篇综合文献回顾研究了人工智能素养（AIL）在高等教育中的概念化和实施，通过审查近期研究文献。通过对出版物（2021-2024）的分析，我们探讨了：（1）当前研究中如何定义和概念化AIL，特别是在高等教育中，以及如何将其与数据素养、媒体素养和计算素养等相关概念区分开；（2）如何将各种定义综合成全面的工作定义；以及（3）如何有效将科学见解转化为教育实践。我们的分析确定了AIL的七个核心维度：技术、应用、批判性思维、伦理、社会、整合和法律。这些维度被综合在AI素养七边形中，加深了对概念的理解，并支持了AIL在高等教育中有组织的发展。该研究旨在弥合理论AIL概念化与学术课程中实际实施之间的鸿沟。

更新时间: 2025-09-23 11:28:30

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2509.18900v1

Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task-specific strategies or custom-defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textsc{Pandora}, which takes advantage of \textsc{Python}'s \textsc{Pandas} API to construct a unified knowledge representation for alignment with LLM pre-training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified frameworks and competes effectively with task-specific methods.

Updated: 2025-09-23 11:15:44

标题: 潘多拉：一个基于代码驱动的大型语言模型代理，用于跨多样化结构化知识的统一推理

摘要: 统一结构化知识推理（USKR）旨在通过使用结构化来源，如表格、数据库和知识图，以统一的方式回答自然语言问题（NLQs）。现有的USKR方法要么依赖于采用特定任务策略或自定义定义的表示，这些方法很难利用不同SKR任务之间的知识转移或与LLMs的先验相一致，从而限制了它们的性能。本文提出了一种名为Pandora的新型USKR框架，利用Python的Pandas API构建统一的知识表示，以与LLM预训练对齐。它采用LLM为每个问题生成文本推理步骤和可执行的Python代码。演示来自覆盖各种SKR任务的训练示例的记忆，促进知识转移。对涉及三个SKR任务的四个基准的广泛实验表明，Pandora优于现有的统一框架，并有效竞争特定任务的方法。

更新时间: 2025-09-23 11:15:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.12734v2

Exploring Heterophily in Graph-level Tasks

While heterophily has been widely studied in node-level tasks, its impact on graph-level tasks remains unclear. We present the first analysis of heterophily in graph-level learning, combining theoretical insights with empirical validation. We first introduce a taxonomy of graph-level labeling schemes, and focus on motif-based tasks within local structure labeling, which is a popular labeling scheme. Using energy-based gradient flow analysis, we reveal a key insight: unlike frequency-dominated regimes in node-level tasks, motif detection requires mixed-frequency dynamics to remain flexible across multiple spectral components. Our theory shows that motif objectives are inherently misaligned with global frequency dominance, demanding distinct architectural considerations. Experiments on synthetic datasets with controlled heterophily and real-world molecular property prediction support our findings, showing that frequency-adaptive model outperform frequency-dominated models. This work establishes a new theoretical understanding of heterophily in graph-level learning and offers guidance for designing effective GNN architectures.

Updated: 2025-09-23 11:07:16

标题: 探索图级任务中的异质性

摘要: 虽然异质性在节点级任务中得到了广泛研究，但其对图级任务的影响仍不清楚。我们提出了对图级学习中异质性的首次分析，结合理论洞察力和经验验证。我们首先引入了图级标记方案的分类法，并专注于局部结构标记中的基于模式的任务，这是一种流行的标记方案。通过基于能量的梯度流分析，我们揭示了一个关键的观点：与节点级任务中以频率为主导的制度不同，模式检测需要混合频率动态以保持在多个谱成分上的灵活性。我们的理论表明，模式目标与全局频率优势不一致，需要独特的架构考虑。在具有受控异质性的合成数据集和实际分子属性预测上的实验证明了我们的发现，表明频率自适应模型优于频率主导模型。这项工作建立了对图级学习中异质性的新的理论理解，并为设计有效的GNN架构提供了指导。

更新时间: 2025-09-23 11:07:16

领域: cs.LG

下载: http://arxiv.org/abs/2509.18893v1

Interpretable Nanoporous Materials Design with Symmetry-Aware Networks

Nanoporous materials hold promise for diverse sustainable applications, yet their vast chemical space poses challenges for efficient design. Machine learning offers a compelling pathway to accelerate the exploration, but existing models lack either interpretability or fidelity for elucidating the correlation between crystal geometry and property. Here, we report a three-dimensional periodic space sampling method that decomposes large nanoporous structures into local geometrical sites for combined property prediction and site-wise contribution quantification. Trained with a constructed database and retrieved datasets, our model achieves state-of-the-art accuracy and data efficiency for property prediction on gas storage, separation, and electrical conduction. Meanwhile, this approach enables the interpretation of the prediction and allows for accurate identification of significant local sites for targeted properties. Through identifying transferable high-performance sites across diverse nanoporous frameworks, our model paves the way for interpretable, symmetry-aware nanoporous materials design, which is extensible to other materials, like molecular crystals and beyond.

Updated: 2025-09-23 10:48:03

标题: 可解释的纳米多孔材料设计与对称感知网络

摘要: 纳米多孔材料在各种可持续应用中具有潜力，但其庞大的化学空间为高效设计提出了挑战。机器学习提供了加速探索的引人注目途径，但现有模型缺乏解释性或忠实度，无法阐明晶体几何与性能之间的相关性。在这里，我们报道了一种三维周期空间采样方法，将大型纳米多孔结构分解为局部几何位点，用于联合性能预测和位点贡献量化。通过训练构建的数据库和检索数据集，我们的模型在气体存储、分离和电导方面实现了最先进的准确性和数据效率。同时，这种方法使得预测的解释成为可能，并且允许准确识别针对性能的重要局部位点。通过识别在不同纳米多孔框架中可转移的高性能位点，我们的模型为具有解释性和对称性意识的纳米多孔材料设计铺平了道路，这种设计方法可以扩展到其他材料，如分子晶体等。

更新时间: 2025-09-23 10:48:03

领域: cond-mat.mtrl-sci,cs.AI

下载: http://arxiv.org/abs/2509.15908v2

Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by AMX. We run LLM inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and observing throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential LLMs (cLLMs).

Updated: 2025-09-23 10:36:47

标题: 私密LLM推理：CPU和GPU TEE的性能和成本

摘要: 大型语言模型（LLMs）越来越多地部署在融合的云和高性能计算（HPC）基础设施上。然而，由于LLMs处理机密输入并在昂贵、专有的数据集上进行微调，它们对安全性的要求增加了，降低了在隐私敏感领域（如医疗保健和金融领域）的采用速度。我们调查了解决这一差距的方法，并提出了可信执行环境（TEEs）作为保护端到端LLM推理的解决方案。我们通过在CPU和GPU TEEs中完全评估这些计算密集型工作负载的方式来验证它们的实用性。在CPU方面，我们进行了深入研究，在Intel的TDX和SGX内运行完整的Llama2推理管道（7B，13B，70B），通过高级矩阵扩展（AMX）加速。我们得出了12个见解，其中包括跨各种数据类型、批处理大小和输入长度，CPU TEEs施加了不到10%的吞吐量和20%的延迟开销，AMX进一步降低了这些开销。我们在NVIDIA H100机密计算GPU上运行LLM推理，将我们的CPU发现置于背景中，并观察到随着批处理和输入大小增长而减少的4-8%的吞吐量惩罚。通过比较性能、成本和安全性权衡，我们展示了CPU TEEs如何比其GPU对应物更具成本效益或安全性。据我们所知，我们的工作是首次全面展示了现代TEE在CPU和GPU上实现保护LLMs（cLLMs）的性能和实用性。

更新时间: 2025-09-23 10:36:47

领域: cs.PF,cs.AR,cs.CR,cs.LG

下载: http://arxiv.org/abs/2509.18886v1

LongCat-Flash-Thinking Technical Report

We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.

Updated: 2025-09-23 10:25:48

标题: 长虹-闪思技术报告

摘要: 我们提出了LongCat-Flash-Thinking，一个高效的、拥有5600亿参数的开源专家混合模型(MoE)推理模型。其先进的能力是通过精心设计的训练过程培养而来，从长Chain-of-Thought(CoT)数据冷启动开始，最终达到大规模强化学习(RL)。我们首先采用了一个精心设计的冷启动训练策略，显著提升了推理潜力，并为模型装备了在形式和主体推理方面的专业技能。然后，一个核心创新是我们的领域并行训练方案，该方案将跨不同领域（例如STEM、Code、主体）的优化解耦，并将结果的专家模型融合成一个几乎是帕累托最优的模型。整个过程由我们的动态异步展开管弦乐(DORA)系统驱动，这是一个大规模RL框架，在数万个加速器上比同步方法提供了三倍以上的训练加速度。因此，LongCat-Flash-Thinking在一系列复杂推理任务中实现了开源模型的最新性能。该模型在主体推理方面表现出卓越的效率，在AIME-25上将平均令牌消耗降低了64.5%（从19,653降至6,965），而不影响任务准确性。我们发布LongCat-Flash-Thinking以推动推理系统和主体智能研究的进一步进展。

更新时间: 2025-09-23 10:25:48

领域: cs.AI

下载: http://arxiv.org/abs/2509.18883v1

JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework

Change detection (CD) in remote sensing images plays a vital role in Earth observation. However, the scarcity of high-resolution, comprehensive open-source datasets and the difficulty in achieving robust performance across varying change types remain major challenges. To address these issues, we introduce JL1-CD, a large-scale, sub-meter CD dataset consisting of 5,000 image pairs. We further propose a novel Origin-Partition (O-P) strategy and integrate it into a Multi-Teacher Knowledge Distillation (MTKD) framework to enhance CD performance. The O-P strategy partitions the training set by Change Area Ratio (CAR) and trains specialized teacher models on each subset. The MTKD framework then distills complementary knowledge from these teachers into a single student model, enabling improved detection results across diverse CAR scenarios without additional inference cost. Our MTKD approach demonstrated strong performance in the 2024 ``Jilin-1'' Cup challenge, ranking first in the preliminary and second in the final rounds. Extensive experiments on the JL1-CD and SYSU-CD datasets show that the MTKD framework consistently improves the performance of CD models with various network architectures and parameter sizes, establishing new state-of-the-art results. Code and dataset are available at https://github.com/circleLZY/MTKD-CD.

Updated: 2025-09-23 10:25:20

标题: JL1-CD：遥感变化检测的新基准和稳健的多教师知识蒸馏框架

摘要: 变化检测（CD）在遥感图像中发挥着地球观测中至关重要的作用。然而，高分辨率、全面开放源数据集的稀缺性以及在不同变化类型中实现稳健性能的困难仍然是主要挑战。为解决这些问题，我们引入了JL1-CD，一个由5,000对图像组成的大规模、亚米级的CD数据集。我们进一步提出了一种新颖的Origin-Partition（O-P）策略，并将其整合到一个Multi-Teacher Knowledge Distillation（MTKD）框架中，以增强CD性能。O-P策略通过变化区域比率（CAR）对训练集进行分区，并在每个子集上训练专门的教师模型。然后，MTKD框架将这些教师的互补知识蒸馏到一个单一的学生模型中，使其能够在不同CAR场景中实现改进的检测结果，而无需额外推断成本。我们的MTKD方法在2024年“吉林一”杯挑战赛中表现出色，在初赛中排名第一，在决赛中排名第二。对JL1-CD和SYSU-CD数据集进行的广泛实验表明，MTKD框架始终提高各种网络架构和参数大小的CD模型的性能，取得了新的最先进结果。代码和数据集可在https://github.com/circleLZY/MTKD-CD获取。

更新时间: 2025-09-23 10:25:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.13407v4

Diversity Boosts AI-Generated Text Detection

Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.

Updated: 2025-09-23 10:21:22

标题: 多样性提升人工智能生成文本的检测

摘要: 检测人工智能生成的文本是为了应对LLMs在教育、商业合规、新闻和社交媒体中的滥用而日益必要，合成流畅性可能掩盖误导或欺骗信息。虽然先前的检测器通常依赖于标记级别的可能性或不透明的黑匣子分类器，但这些方法在面对高质量生成时很难，并且提供很少的可解释性。在这项工作中，我们提出了DivEye，一个新颖的检测框架，通过基于惊奇的特征来捕捉文本中的不可预测性如何波动。受到人类撰写的文本展现出比LLM输出更丰富的词汇和结构不可预测性变化的观察启发，DivEye通过一组可解释的统计特征捕捉这个信号。我们的方法在多个基准测试中比现有的零照片检测器表现出高达33.2%的性能，并与微调的基线表现相媲美。DivEye对改写和对抗性攻击具有鲁棒性，在领域和模型之间具有良好的泛化能力，并且当作辅助信号使用时，将现有检测器的性能提高了高达18.7%。除了检测之外，DivEye为为什么文本被标记提供了可解释的见解，指出节奏性的不可预测性是一种强大且未被充分探索的LLM检测信号。

更新时间: 2025-09-23 10:21:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.18880v1

Cybersecurity AI: Humanoid Robots as Attack Vectors

We present a systematic security assessment of the Unitree G1 humanoid showing it operates simultaneously as a covert surveillance node and can be purposed as an active cyber operations platform. Initial access can be achieved by exploiting the BLE provisioning protocol which contains a critical command injection vulnerability allowing root access via malformed Wi-Fi credentials, exploitable using hardcoded AES keys shared across all units. Partial reverse engineering of Unitree's proprietary FMX encryption reveal a static Blowfish-ECB layer and a predictable LCG mask-enabled inspection of the system's otherwise sophisticated security architecture, the most mature we have observed in commercial robotics. Two empirical case studies expose the critical risk of this humanoid robot: (a) the robot functions as a trojan horse, continuously exfiltrating multi-modal sensor and service-state telemetry to 43.175.228.18:17883 and 43.175.229.18:17883 every 300 seconds without operator notice, creating violations of GDPR Articles 6 and 13; (b) a resident Cybersecurity AI (CAI) agent can pivot from reconnaissance to offensive preparation against any target, such as the manufacturer's cloud control plane, demonstrating escalation from passive monitoring to active counter-operations. These findings argue for adaptive CAI-powered defenses as humanoids move into critical infrastructure, contributing the empirical evidence needed to shape future security standards for physical-cyber convergence systems.

Updated: 2025-09-23 10:19:17

标题: 网络安全AI：人形机器人作为攻击载体

摘要: 我们提出了对Unitree G1人形机器人的系统安全评估，显示它同时作为隐蔽监视节点运行，并可用作主动网络操作平台。初始访问可以通过利用包含关键命令注入漏洞的BLE配置协议来实现，从而通过格式错误的Wi-Fi凭据获得根访问权限，利用跨所有单元共享的硬编码AES密钥进行利用。对Unitree专有的FMX加密进行部分逆向工程揭示了一个静态的Blowfish-ECB层和一个可预测的LCG掩码使系统的复杂安全架构暴露于检查，这是我们观察到的商业机器人中最成熟的。两个实证案例揭示了这种人形机器人的关键风险：(a) 机器人作为特洛伊木马，每300秒不断将多模式传感器和服务状态遥测传输到43.175.228.18:17883和43.175.229.18:17883，而操作员没有注意到，违反了GDPR第6和第13条款；(b) 一个居住的网络安全AI（CAI）代理可以从侦察转变为对任何目标的攻击准备，例如制造商的云控制平台，展示了从被动监视升级到主动对抗行动。这些发现支持自适应CAI驱动的防御，因为人形机器人进入关键基础设施，为塑造未来物理网络融合系统的安全标准提供了所需的实证证据。

更新时间: 2025-09-23 10:19:17

领域: cs.CR

下载: http://arxiv.org/abs/2509.14139v3

Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems

Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge devices.TCVADS operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.

Updated: 2025-09-23 10:10:52

标题: 将可解释性和轻量级设计注入到弱监督视频异常检测系统中

摘要: 弱监督监测异常检测（WSMAD）利用弱监督学习来识别异常，这是智能城市监测的关键任务。然而，现有的多模态方法通常由于复杂性而无法满足边缘设备的实时性和可解释性要求。本文介绍了TCVADS（两阶段交叉模态视频异常检测系统），该系统利用知识蒸馏和交叉模态对比学习，在边缘设备上实现了高效、准确和可解释的异常检测。TCVADS分为两个阶段：粗粒度快速分类和细粒度详细分析。在第一阶段，TCVADS从视频帧中提取特征，并将它们输入到时间序列分析模块中，该模块充当教师模型。然后通过知识蒸馏将见解转移到简化的卷积网络（学生模型）用于二进制分类。一旦检测到异常，就会触发第二阶段，采用细粒度多类别分类模型。该阶段使用CLIP进行文本和图像的交叉模态对比学习，通过专门设计的三元文本关系增强了可解释性，并通过细化的分类实现。实验结果表明，TCVADS在模型性能、检测效率和可解释性方面明显优于现有方法，为智能城市监测应用提供了有价值的贡献。

更新时间: 2025-09-23 10:10:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.20201v2

When Ads Become Profiles: Large-Scale Audit of Algorithmic Biases and LLM Profiling Risks

Automated ad targeting on social media is opaque, creating risks of exploitation and invisibility to external scrutiny. Users may be steered toward harmful content while independent auditing of these processes remains blocked. Large Language Models (LLMs) raise a new concern: the potential to reverse-engineer sensitive user attributes from exposure alone. We introduce a multi-stage auditing framework to investigate these risks. First, a large-scale audit of over 435,000 ad impressions delivered to 891 Australian Facebook users reveals algorithmic biases, including disproportionate Gambling and Politics ads shown to socioeconomically vulnerable and politically aligned groups. Second, a multimodal LLM can reconstruct users' demographic profiles from ad streams, outperforming census-based baselines and matching or exceeding human performance. Our results provide the first empirical evidence that ad streams constitute rich digital footprints for public AI inference, highlighting urgent privacy risks and the need for content-level auditing and governance.

Updated: 2025-09-23 10:10:37

标题: 当广告变成档案：算法偏见和LLM档案风险的大规模审计

摘要: 社交媒体上的自动广告定位是不透明的，存在被利用的风险，同时也难以接受外部审查。用户可能会被引导到有害内容，而这些过程的独立审计仍然被阻止。大型语言模型（LLMs）引发了一个新的担忧：仅通过曝光就有可能逆向工程敏感用户属性。我们引入了一个多阶段审计框架来调查这些风险。首先，对超过43.5万次广告展示发送给891名澳大利亚Facebook用户进行了大规模审计，揭示了算法偏见，包括向社会经济脆弱群体和政治倾向群体展示不成比例的赌博和政治广告。其次，一种多模态LLM可以从广告流中重建用户的人口统计资料，表现优于基于人口普查的基线，并且与或超过人类表现。我们的结果首次提供了广告流构成公共AI推断的丰富数字足迹的经验证据，突显了紧迫的隐私风险和对内容级审计和治理的需求。

更新时间: 2025-09-23 10:10:37

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2509.18874v1

R-CONV++: Uncovering Privacy Vulnerabilities through Analytical Gradient Inversion Attacks

Federated learning has emerged as a prominent privacy-preserving technique for leveraging large-scale distributed datasets by sharing gradients instead of raw data. However, recent studies indicate that private training data can still be exposed through gradient inversion attacks. While earlier analytical methods have demonstrated success in reconstructing input data from fully connected layers, their effectiveness significantly diminishes when applied to convolutional layers, high-dimensional inputs, and scenarios involving multiple training examples. This paper extends our previous work \cite{eltaras2024r} and proposes three advanced algorithms to broaden the applicability of gradient inversion attacks. The first algorithm presents a novel data leakage method that efficiently exploits convolutional layer gradients, demonstrating that even with non-fully invertible activation functions, such as ReLU, training samples can be analytically reconstructed directly from gradients without the need to reconstruct intermediate layer outputs. Building on this foundation, the second algorithm extends this analytical approach to support high-dimensional input data, substantially enhancing its utility across complex real-world datasets. The third algorithm introduces an innovative analytical method for reconstructing mini-batches, addressing a critical gap in current research that predominantly focuses on reconstructing only a single training example. Unlike previous studies that focused mainly on the weight constraints of convolutional layers, our approach emphasizes the pivotal role of gradient constraints, revealing that successful attacks can be executed with fewer than 5\% of the constraints previously deemed necessary in certain layers.

Updated: 2025-09-23 10:10:12

标题: R-CONV++：通过分析梯度反演攻击揭示隐私漏洞

摘要: 联邦学习作为一种突出的隐私保护技术，通过共享梯度而不是原始数据，利用大规模分布式数据集。然而，最近的研究表明，私人训练数据仍然可以通过梯度反演攻击暴露出来。尽管早期的分析方法已经成功地从全连接层重建输入数据，但当应用于卷积层、高维输入和涉及多个训练示例的场景时，它们的有效性显著降低。本文扩展了我们先前的工作，并提出了三种先进算法，以扩大梯度反演攻击的适用范围。第一种算法提出了一种新颖的数据泄漏方法，有效地利用卷积层梯度，演示了即使使用非完全可逆的激活函数（如ReLU），训练样本也可以直接从梯度中进行分析重建，而无需重建中间层输出。在此基础上，第二种算法将这种分析方法扩展到支持高维输入数据，大大增强了它在复杂现实世界数据集中的实用性。第三种算法引入了一种创新的分析方法，用于重建小批量数据，解决了当前研究中主要集中于重建单个训练示例的关键缺口。与先前主要关注卷积层权重约束的研究不同，我们的方法强调梯度约束的关键作用，揭示了成功攻击可以在某些层中比以前认为必要的约束少于5%的情况下执行。

更新时间: 2025-09-23 10:10:12

领域: cs.CR

下载: http://arxiv.org/abs/2509.18871v1

Memory in Large Language Models: Mechanisms, Evaluation and Evolution

Under a unified operational definition, we define LLM memory as a persistent state written during pretraining, finetuning, or inference that can later be addressed and that stably influences outputs. We propose a four-part taxonomy (parametric, contextual, external, procedural/episodic) and a memory quadruple (location, persistence, write/access path, controllability). We link mechanism, evaluation, and governance via the chain write -> read -> inhibit/update. To avoid distorted comparisons across heterogeneous setups, we adopt a three-setting protocol (parametric only, offline retrieval, online retrieval) that decouples capability from information availability on the same data and timeline. On this basis we build a layered evaluation: parametric (closed-book recall, edit differential, memorization/privacy), contextual (position curves and the mid-sequence drop), external (answer correctness vs snippet attribution/faithfulness), and procedural/episodic (cross-session consistency and timeline replay, E MARS+). The framework integrates temporal governance and leakage auditing (freshness hits, outdated answers, refusal slices) and uncertainty reporting via inter-rater agreement plus paired tests with multiple-comparison correction. For updating and forgetting, we present DMM Gov: coordinating DAPT/TAPT, PEFT, model editing (ROME, MEND, MEMIT, SERAC), and RAG to form an auditable loop covering admission thresholds, rollout, monitoring, rollback, and change audits, with specs for timeliness, conflict handling, and long-horizon consistency. Finally, we give four testable propositions: minimum identifiability; a minimal evaluation card; causally constrained editing with verifiable forgetting; and when retrieval with small-window replay outperforms ultra-long-context reading. This yields a reproducible, comparable, and governable coordinate system for research and deployment.

Updated: 2025-09-23 10:06:58

标题: 大型语言模型中的记忆：机制、评估和演化

摘要: 在统一的操作定义下，我们将LLM记忆定义为在预训练、微调或推理过程中持久存在的状态，可以后续进行访问并稳定地影响输出。我们提出了一个四部分的分类法（参数化、上下文、外部、程序化/情节化）和一个记忆四元组（位置、持久性、写入/访问路径、可控性）。我们通过链式写入->读取->抑制/更新将机制、评估和治理联系起来。为了避免在异构设置中进行扭曲的比较，我们采用了一个三设置协议（仅参数化、离线检索、在线检索），将能力与相同数据和时间轴上的信息可用性分离开来。基于此基础，我们建立了一个分层评估：参数化（闭卷回忆、编辑差异、记忆/隐私）、上下文（位置曲线和中间序列下降）、外部（答案正确性 vs 片段归因/忠实性）、程序化/情节化（跨会话一致性和时间轴重播，E MARS+）。该框架整合了时间治理和泄漏审计（新鲜度命中、过时答案、拒绝切片）以及通过评定者间一致性加上多重比较校正报告的不确定性。对于更新和遗忘，我们提出了DMM Gov：协调DAPT/TAPT、PEFT、模型编辑（ROME、MEND、MEMIT、SERAC）和RAG，形成一个可审计的循环，涵盖准入门槛、推出、监控、回滚和变更审计，同时具备及时性、冲突处理和长期一致性的规范。最后，我们提出了四个可测试的命题：最小可识别性；一个最小的评估卡；受因果约束的可验证遗忘的编辑；以及当检索与小窗口重播优于超长上下文阅读时。这为研究和部署提供了一个可复制、可比较和可管理的坐标系统。

更新时间: 2025-09-23 10:06:58

领域: cs.AI

下载: http://arxiv.org/abs/2509.18868v1

A Neural Difference-of-Entropies Estimator for Mutual Information

Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.

Updated: 2025-09-23 10:06:24

标题: 一种用于互信息估计的神经网络熵之差估计器

摘要: 估计互信息（MI）是衡量随机量之间依赖关系的关键指标，而不需要特定的建模假设，在高维度下是一个具有挑战性的问题。我们提出了一种基于参数化条件密度的新型互信息估计器，使用归一化流作为基础，归一化流是近年来备受青睐的深度生成模型。这种估计器利用块自回归结构，在标准基准任务上实现了改进的偏差-方差折衷。

更新时间: 2025-09-23 10:06:24

领域: stat.ML,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2502.13085v2

Bi-VLA: Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation

We propose Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation (Bi-VLA), a novel framework that extends bilateral control-based imitation learning to handle more than one task within a single model. Conventional bilateral control methods exploit joint angle, velocity, torque, and vision for precise manipulation but require task-specific models, limiting their generality. Bi-VLA overcomes this limitation by utilizing robot joint angle, velocity, and torque data from leader-follower bilateral control with visual features and natural language instructions through SigLIP and FiLM-based fusion. We validated Bi-VLA on two task types: one requiring supplementary language cues and another distinguishable solely by vision. Real-robot experiments showed that Bi-VLA successfully interprets vision-language combinations and improves task success rates compared to conventional bilateral control-based imitation learning. Our Bi-VLA addresses the single-task limitation of prior bilateral approaches and provides empirical evidence that combining vision and language significantly enhances versatility. Experimental results validate the effectiveness of Bi-VLA in real-world tasks. For additional material, please visit the website: https://mertcookimg.github.io/bi-vla/

Updated: 2025-09-23 10:02:16

标题: Bi-VLA：基于双向控制的通过视觉-语言融合进行行为生成的模仿学习

摘要: 我们提出了基于双边控制的视觉-语言融合模仿学习框架（Bi-VLA），该框架将双边控制的模仿学习扩展到在单个模型内处理多个任务。传统的双边控制方法利用关节角度、速度、扭矩和视觉进行精确操作，但需要特定于任务的模型，限制了其通用性。Bi-VLA通过利用领导者-跟随者双边控制的机器人关节角度、速度和扭矩数据，结合SigLIP和FiLM的融合，通过视觉特征和自然语言指令来克服这一限制。我们在两种任务类型上验证了Bi-VLA：一种需要额外的语言提示，另一种仅通过视觉可区分。真实机器人实验表明，与传统的双边控制模仿学习相比，Bi-VLA成功解释了视觉-语言组合，并提高了任务成功率。我们的Bi-VLA解决了先前双边方法的单一任务限制，并提供了实证证据，证明结合视觉和语言显著增强了多功能性。实验结果验证了Bi-VLA在现实世界任务中的有效性。如需更多资料，请访问网站：https://mertcookimg.github.io/bi-vla/

更新时间: 2025-09-23 10:02:16

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2509.18865v1

Conf-Profile: A Confidence-Driven Reasoning Paradigm for Label-Free User Profiling

User profiling, as a core technique for user understanding, aims to infer structural attributes from user information. Large Language Models (LLMs) provide a promising avenue for user profiling, yet the progress is hindered by the lack of comprehensive benchmarks. To bridge this gap, we propose ProfileBench, an industrial benchmark derived from a real-world video platform, encompassing heterogeneous user data and a well-structured profiling taxonomy. However, the profiling task remains challenging due to the difficulty of collecting large-scale ground-truth labels, and the heterogeneous and noisy user information can compromise the reliability of LLMs. To approach label-free and reliable user profiling, we propose a Confidence-driven Profile reasoning framework Conf-Profile, featuring a two-stage paradigm. We first synthesize high-quality labels by leveraging advanced LLMs with confidence hints, followed by confidence-weighted voting for accuracy improvement and confidence calibration for a balanced distribution. The multiple profile results, rationales, and confidence scores are aggregated and distilled into a lightweight LLM. We further enhance the reasoning ability via confidence-guided unsupervised reinforcement learning, which exploits confidence for difficulty filtering, quasi-ground truth voting, and reward weighting. Experimental results demonstrate that Conf-Profile delivers substantial performance through the two-stage training, improving F1 by 13.97 on Qwen3-8B.

Updated: 2025-09-23 09:58:37

标题: Conf-Profile：一种基于信心驱动的无标签用户建模推理范式

摘要: 用户画像作为用户理解的核心技术，旨在从用户信息中推断结构属性。大型语言模型（LLMs）为用户画像提供了一个有前途的途径，但是由于缺乏全面的基准测试，进展受到了阻碍。为了弥合这一差距，我们提出了ProfileBench，这是一个从现实世界视频平台衍生出来的工业基准测试，涵盖了异构用户数据和一个结构良好的画像分类系统。然而，由于收集大规模的地面真实标签的困难以及异构和嘈杂的用户信息可能会影响LLMs的可靠性，用户画像仍然是一个具有挑战性的任务。为了实现无标签和可靠的用户画像，我们提出了一种置信度驱动的画像推理框架Conf-Profile，具有两阶段范式。我们首先通过利用带有置信度提示的先进LLMs合成高质量的标签，然后通过置信度加权投票来提高准确性，并通过置信度校准来实现平衡分布。多个画像结果、理由和置信度分数被汇总并蒸馏成一个轻量级LLMs。我们通过置信度引导的无监督强化学习来进一步增强推理能力，利用置信度进行难度过滤、准地面真实投票和奖励加权。实验结果表明，Conf-Profile通过两阶段训练实现了显著的性能提升，在Qwen3-8B上将F1提高了13.97。

更新时间: 2025-09-23 09:58:37

领域: cs.AI

下载: http://arxiv.org/abs/2509.18864v1

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 53.4$\%$ to 91.5$\%$. In subjective A/B testing, WavReward also leads by a margin of 83$\%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.

Updated: 2025-09-23 09:57:50

标题: WavReward：具有通用奖励评估器的口语对话模型

摘要: 最近，像GPT-4o-audio这样的端到端口语对话模型在语音领域引起了广泛关注。然而，对口语对话模型对话性能的评估在很大程度上被忽视了。这主要是因为智能聊天机器人传达了大量非文本信息，这些信息不容易用基于文本的语言模型（如ChatGPT）来衡量。为了弥补这一差距，我们提出了WavReward，这是一个基于音频语言模型的奖励反馈模型，可以评估口语对话系统的智商和情商。具体来说，1）基于音频语言模型，WavReward结合了深度推理过程和后训练的非线性奖励机制。通过利用强化学习算法通过多样本反馈，我们构建了一个针对口语对话模型的专门评估器。2）我们引入了ChatReward-30K，一个用于训练WavReward的偏好数据集。ChatReward-30K包括口语对话模型的理解和生成方面。这些场景涵盖了各种任务，如基于文本的聊天、九个指导聊天的声学属性和隐式聊天。WavReward在多个口语对话场景中优于先前的最先进评估模型，从53.4％的客观准确率提高到91.5％的Qwen2.5-Omni。在主观A/B测试中，WavReward也以83％的边际领先。全面的消融研究证实了WavReward每个组件的必要性。所有数据和代码将在本文被接受后公开在https://github.com/jishengpeng/WavReward。

更新时间: 2025-09-23 09:57:50

领域: eess.AS,cs.AI,cs.LG,cs.MM,cs.SD

下载: http://arxiv.org/abs/2505.09558v2

EventVL: Understand Event Streams via Multimodal Large Language Model

The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and segmenting the event stream. To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events. Extensive experiments show that our EventVL can significantly surpass existing MLLM baselines in event captioning and scene description generation tasks. We hope our research could contribute to the development of the event vision community.

Updated: 2025-09-23 09:53:54

标题: EventVL：通过多模态大型语言模型理解事件流（注意：这里的EventVL可能是一个特定的名词或缩写，需要根据具体情况进行翻译）

摘要: 最近，基于事件的视觉语言模型（VLM）在实际视觉任务中取得了良好的进展。然而，大多数这些工作仅利用CLIP来关注传统感知任务，这阻碍了模型从事件流中明确理解充分的语义和上下文。为了解决这一不足，我们提出了EventVL，这是第一个用于显式语义理解的生成式基于事件的MLLM（多模态大型语言模型）框架。具体来说，为了弥合连接不同模态语义的数据差距，我们首先注释了一个大型事件-图像/视频-文本数据集，其中包含近140万对高质量数据，这使得能够在各种场景中进行有效学习，例如驾驶场景或人体运动场景。之后，我们设计了事件时空表示，通过多样地聚合和分段事件流来充分探索全面信息。为了进一步促进紧凑的语义空间，引入了动态语义对齐来改进和完善事件的稀疏语义空间。广泛的实验表明，我们的EventVL可以在事件字幕生成和场景描述生成任务中显著超越现有的MLLM基线。我们希望我们的研究能对事件视觉社区的发展做出贡献。

更新时间: 2025-09-23 09:53:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.13707v2

Prompting for Performance: Exploring LLMs for Configuring Software

Software systems usually provide numerous configuration options that can affect performance metrics such as execution time, memory usage, binary size, or bitrate. On the one hand, making informed decisions is challenging and requires domain expertise in options and their combinations. On the other hand, machine learning techniques can search vast configuration spaces, but with a high computational cost, since concrete executions of numerous configurations are required. In this exploratory study, we investigate whether large language models (LLMs) can assist in performance-oriented software configuration through prompts. We evaluate several LLMs on tasks including identifying relevant options, ranking configurations, and recommending performant configurations across various configurable systems, such as compilers, video encoders, and SAT solvers. Our preliminary results reveal both positive abilities and notable limitations: depending on the task and systems, LLMs can well align with expert knowledge, whereas hallucinations or superficial reasoning can emerge in other cases. These findings represent a first step toward systematic evaluations and the design of LLM-based solutions to assist with software configuration.

Updated: 2025-09-23 09:52:43

标题: 激发表现：探索用于配置软件的LLMs

摘要: 软件系统通常提供大量配置选项，可以影响执行时间、内存使用、二进制大小或比特率等性能指标。一方面，做出明智的决策是具有挑战性的，需要对选项及其组合具有领域专业知识。另一方面，机器学习技术可以搜索广泛的配置空间，但需要高昂的计算成本，因为需要对众多配置进行具体执行。在这项探索性研究中，我们调查了大型语言模型（LLMs）是否可以通过提示来辅助性能导向的软件配置。我们评估了几个LLMs在识别相关选项、排名配置和推荐高性能配置等任务上的表现，涵盖了编译器、视频编码器和SAT求解器等各种可配置系统。我们的初步结果显示了积极的能力和显著的局限性：根据任务和系统的不同，LLMs可以与专家知识很好地契合，而在其他情况下可能出现幻觉或表面推理。这些发现代表了朝着系统评估和设计基于LLM的解决方案以辅助软件配置的第一步。

更新时间: 2025-09-23 09:52:43

领域: cs.SE,cs.AI,cs.PF

下载: http://arxiv.org/abs/2507.09790v2

Differentially Private Algorithms for Graphs Under Continual Observation

Differentially private algorithms protect individuals in data analysis scenarios by ensuring that there is only a weak correlation between the existence of the user in the data and the result of the analysis. Dynamic graph algorithms maintain the solution to a problem (e.g., a matching) on an evolving input, i.e., a graph where nodes or edges are inserted or deleted over time. They output the value of the solution after each update operation, i.e., continuously. We study (event-level and user-level) differentially private algorithms for graph problems under continual observation, i.e., differentially private dynamic graph algorithms. We present event-level private algorithms for partially dynamic counting-based problems such as triangle count that improve the additive error by a polynomial factor (in the length $T$ of the update sequence) on the state of the art, resulting in the first algorithms with additive error polylogarithmic in $T$. We also give $\varepsilon$-differentially private and partially dynamic algorithms for minimum spanning tree, minimum cut, densest subgraph, and maximum matching. The additive error of our improved MST algorithm is $O(W \log^{3/2}T / \varepsilon)$, where $W$ is the maximum weight of any edge, which, as we show, is tight up to a $(\sqrt{\log T} / \varepsilon)$-factor. For the other problems, we present a partially-dynamic algorithm with multiplicative error $(1+\beta)$ for any constant $\beta > 0$ and additive error $O(W \log(nW) \log(T) / (\varepsilon \beta))$. Finally, we show that the additive error for a broad class of dynamic graph algorithms with user-level privacy must be linear in the value of the output solution's range.

Updated: 2025-09-23 09:51:03

标题: 图形在持续观察下的差分隐私算法

摘要: 差分隐私算法通过确保用户在数据中的存在与分析结果之间只有微弱的相关性，以保护数据分析场景中的个人。动态图算法在不断变化的输入（例如，在时间内插入或删除节点或边的图）上维护问题（例如，匹配）的解决方案。它们在每次更新操作后输出解决方案的值，即连续更新。我们研究（事件级别和用户级别）不断保护图问题的差分隐私算法，即差分隐私动态图算法。我们提出了针对部分动态计数型问题（如三角形计数）的事件级别私有算法，通过多项式因子（在更新序列长度$T$中）改进了附加误差，使其成为首个具有$T$中对数多项式附加误差的算法。我们还提供了$\varepsilon$-差分隐私和部分动态算法，用于最小生成树、最小割、最密子图和最大匹配问题。我们改进的最小生成树算法的附加误差为$O(W \log^{3/2}T / \varepsilon)$，其中$W$是任何边的最大权重，正如我们所展示的，这是紧密的，直到$(\sqrt{\log T} / \varepsilon)$-因子。对于其他问题，我们提出了一个具有乘法误差$(1+\beta)$（对于任何常数$\beta > 0$）和附加误差$O(W \log(nW) \log(T) / (\varepsilon \beta))$的部分动态算法。最后，我们证明了具有用户级别隐私的广泛类别的动态图算法的附加误差必须与输出解决方案范围的值呈线性关系。

更新时间: 2025-09-23 09:51:03

领域: cs.DS,cs.CR

下载: http://arxiv.org/abs/2106.14756v3

SCoT: Straight Consistent Trajectory for Pre-Trained Diffusion Model Distillations

Pre-trained diffusion models are commonly used to generate clean data (e.g., images) from random noises, effectively forming pairs of noises and corresponding clean images. Distillation on these pre-trained models can be viewed as the process of constructing advanced trajectories within the pair to accelerate sampling. For instance, consistency model distillation develops consistent projection functions to regulate trajectories, although sampling efficiency remains a concern. Rectified flow method enforces straight trajectories to enable faster sampling, yet relies on numerical ODE solvers, which may introduce approximation errors. In this work, we bridge the gap between the consistency model and the rectified flow method by proposing a Straight Consistent Trajectory~(SCoT) model. SCoT enjoys the benefits of both approaches for fast sampling, producing trajectories with consistent and straight properties simultaneously. These dual properties are strategically balanced by targeting two critical objectives: (1) regulating the gradient of SCoT's mapping to a constant, (2) ensuring trajectory consistency. Extensive experimental results demonstrate the effectiveness and efficiency of SCoT.

Updated: 2025-09-23 09:47:03

标题: SCoT：预训练扩散模型蒸馏的直线一致轨迹

摘要: 预训练扩散模型通常用于从随机噪声生成干净数据（例如图像），有效地形成噪声和相应干净图像的配对。在这些预训练模型上进行蒸馏可以被视为构建高级轨迹以加速采样的过程。例如，一致性模型蒸馏开发出一致的投影函数来调节轨迹，尽管采样效率仍然是一个问题。校正流方法强制直线轨迹以实现更快的采样，但依赖于数值ODE求解器，可能引入近似误差。在这项工作中，我们通过提出直线一致轨迹（SCoT）模型来弥合一致性模型和校正流方法之间的差距。SCoT同时具有快速采样的双重优势，产生具有一致和直线特性的轨迹。通过针对两个关键目标来战略平衡这些双重属性：（1）调节SCoT映射的梯度为常数，（2）确保轨迹的一致性。大量实验结果证明了SCoT的有效性和效率。

更新时间: 2025-09-23 09:47:03

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.16972v3

msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML

AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.

Updated: 2025-09-23 09:46:58

标题: msf-CNN：基于卷积神经网络的小型机器学习的基于补丁的多阶段融合

摘要: 人工智能涵盖了从大型语言模型到在微控制器（MCU）上运行的微小模型。极其内存高效的模型架构对于适应MCU的微小内存预算至关重要，例如128KB的RAM。然而，推理延迟必须保持较小以适应实时约束。一种解决这个问题的方法是基于补丁的融合，旨在优化神经网络层之间的数据流。在本文中，我们介绍了msf-CNN，一种新颖的技术，通过遍历表示为有向无环图的融合解空间，有效地找到卷积神经网络（CNNs）的最佳融合设置。与以前针对MCU的CNN融合工作相比，msf-CNN确定了更广泛的解决方案。我们发布了一个在各种微控制器（ARM Cortex-M、RISC-V、ESP32）上运行的msf-CNN实现。我们展示了与先前技术（MCUNetV2和StreamNet）相比，msf-CNN可以使用50％更少的RAM实现推理。因此，我们展示了msf-CNN如何为系统设计师提供额外的灵活性。

更新时间: 2025-09-23 09:46:58

领域: cs.LG,cs.PF

下载: http://arxiv.org/abs/2505.11483v2

Packed-Ensembles for Efficient Uncertainty Estimation

Deep Ensembles (DE) are a prominent approach for achieving excellent performance on key metrics such as accuracy, calibration, uncertainty estimation, and out-of-distribution detection. However, hardware limitations of real-world systems constrain to smaller ensembles and lower-capacity networks, significantly deteriorating their performance and properties. We introduce Packed-Ensembles (PE), a strategy to design and train lightweight structured ensembles by carefully modulating the dimension of their encoding space. We leverage grouped convolutions to parallelize the ensemble into a single shared backbone and forward pass to improve training and inference speeds. PE is designed to operate within the memory limits of a standard neural network. Our extensive research indicates that PE accurately preserves the properties of DE, such as diversity, and performs equally well in terms of accuracy, calibration, out-of-distribution detection, and robustness to distribution shift. We make our code available at https://github.com/ENSTA-U2IS/torch-uncertainty.

Updated: 2025-09-23 09:46:34

标题: 紧凑集成以提高不确定性估计效率

摘要: 深度集成（DE）是实现优秀性能的一种重要方法，关键指标包括准确度、校准、不确定性估计和超出分布检测。然而，现实世界系统的硬件限制限制了更小的集成和容量较低的网络，显着降低了它们的性能和特性。我们引入了Packed-Ensembles（PE），一种通过精心调节编码空间维度来设计和训练轻量级结构化集成的策略。我们利用分组卷积将集成并行化为单个共享的主干和前向传递，以提高训练和推理速度。PE被设计为在标准神经网络的内存限制下运行。我们的广泛研究表明，PE准确地保留了DE的特性，如多样性，并在准确性、校准、超出分布检测和对分布转移的稳健性方面表现同样出色。我们将我们的代码提供在https://github.com/ENSTA-U2IS/torch-uncertainty。

更新时间: 2025-09-23 09:46:34

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2210.09184v4

NGRPO: Negative-enhanced Group Relative Policy Optimization

RLVR has enhanced the reasoning capabilities of Large Language Models (LLMs) across various tasks. However, GRPO, a representative RLVR algorithm, suffers from a critical limitation: when all responses within a group are either entirely correct or entirely incorrect, the model fails to learn from these homogeneous responses. This is particularly problematic for homogeneously incorrect groups, where GRPO's advantage function yields a value of zero, leading to null gradients and the loss of valuable learning signals. To overcome this issue, we propose NGRPO (Negative-enhanced Group Relative Policy Optimization), an algorithm designed to convert homogeneous errors into robust learning signals. First, NGRPO introduces Advantage Calibration. This mechanism hypothesizes the existence of a virtual maximum-reward sample during advantage calculation, thereby altering the mean and variance of rewards within a group and ensuring that the advantages for homogeneously incorrect samples are no longer zero. Second, NGRPO employs Asymmetric Clipping, which relaxes the update magnitude for positive samples while imposing stricter constraints on that of negative samples. This serves to stabilize the exploration pressure introduced by the advantage calibration. Our experiments on Qwen2.5-Math-7B demonstrate that NGRPO significantly outperforms baselines such as PPO, GRPO, DAPO, and PSR-NSR on mathematical benchmarks including MATH500, AMC23, and AIME2025. These results validate NGRPO's ability to learn from homogeneous errors, leading to stable and substantial improvements in mathematical reasoning. Our code is available at https://github.com/nangongrui-ngr/NGRPO.

Updated: 2025-09-23 09:38:10

标题: NGRPO: 负增强组相对策略优化

摘要: RLVR已经增强了大型语言模型（LLMs）在各种任务中的推理能力。然而，作为代表性的RLVR算法，GRPO存在一个关键限制：当一个组内的所有响应要么完全正确要么完全错误时，模型无法从这些同质化的响应中学习。这对于同质化错误组尤其有问题，因为GRPO的优势函数会产生零值，导致梯度为空并丢失宝贵的学习信号。为了解决这个问题，我们提出了NGRPO（负增强组相对策略优化），这是一种旨在将同质化错误转化为稳健学习信号的算法。首先，NGRPO引入了优势校准。该机制假设在优势计算过程中存在一个虚拟的最大奖励样本，从而改变组内奖励的均值和方差，并确保同质化错误样本的优势不再为零。其次，NGRPO采用了非对称剪裁，放宽了正样本的更新幅度，同时对负样本施加更严格的约束。这有助于稳定优势校准引入的探索压力。我们在Qwen2.5-Math-7B上的实验表明，NGRPO在数学基准测试（包括MATH500、AMC23和AIME2025）上明显优于基准方法，如PPO、GRPO、DAPO和PSR-NSR。这些结果验证了NGRPO学习同质化错误的能力，从而在数学推理中稳定且显著地提高。我们的代码可在https://github.com/nangongrui-ngr/NGRPO上获得。

更新时间: 2025-09-23 09:38:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18851v1

MAPO: Mixed Advantage Policy Optimization

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

Updated: 2025-09-23 09:37:16

标题: MAPO：混合优势策略优化

摘要: 最近在强化学习领域模型基础上取得了一些进展，比如群体相对策略优化（GRPO），显著提高了在推理任务上基础模型的性能。值得注意的是，优势函数在GRPO中充当着对轨迹重要性进行排名的中心机制。然而，现有的研究遇到了优势逆转和优势镜像问题，这些问题妨碍了在不同查询样本之间合理分配优势。在这项工作中，我们提出了一种简单但有效的GRPO策略，混合优势策略优化（MAPO）。我们发现轨迹出现的确定性不同，并提出了针对高确定性轨迹样本的优势百分比偏差。此外，我们动态重新加权了具有不同轨迹确定性的样本的优势函数，从而自适应地配置优势函数以考虑样本特定特征。与相关的最新方法进行比较，以及在不同优势变体上进行消融研究，验证了我们方法的有效性。

更新时间: 2025-09-23 09:37:16

领域: cs.AI

下载: http://arxiv.org/abs/2509.18849v1

Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to 'think more' instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.

Updated: 2025-09-23 09:35:49

标题: Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions 失败使代理者更强大：通过结构化反思提高准确性以实现可靠的工具交互

摘要: 工具增强的大型语言模型（LLMs）通常通过监督模仿或粗粒度强化学习进行训练，优化单个工具调用。当前的自我反思实践依赖于启发式提示或单向推理：模型被敦促“思考更多”而不是学习错误诊断和修复。这在多轮交互中是脆弱的；失败后，模型通常会重复相同的错误。我们提出了结构化反思，将从错误到修复的路径转化为明确、可控和可训练的行为。代理生成一个简短而精确的反思：它使用前一步的证据诊断失败，然后提出一个正确的、可执行的后续调用。为了训练，我们将DAPO和GSPO目标与针对工具使用定制的奖励方案相结合，优化分步策略反思，然后调用，然后最终。为了评估，我们引入了Tool-Reflection-Bench，一个轻量级基准，可以编程检查结构有效性、可执行性、参数正确性和结果一致性。任务被构建为出错调用、反思和修正调用的迷你轨迹，具有不相交的训练和测试拆分。在BFCL v3和Tool-Reflection-Bench上的实验证明了在多轮工具调用成功和错误恢复方面取得了巨大进展，并减少了冗余调用。这些结果表明，将反思变得明确并直接优化它可提高工具交互的可靠性，并为代理从失败中学习提供可复制的路径。

更新时间: 2025-09-23 09:35:49

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.18847v1

Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning

Accurate International Classification of Diseases (ICD) coding is critical for clinical documentation, billing, and healthcare analytics, yet it remains a labour-intensive and error-prone task. Although large language models (LLMs) show promise in automating ICD coding, their challenges in base model selection, input contextualization, and training data redundancy limit their effectiveness. We propose a modular framework for ICD-10 Clinical Modification (ICD-10-CM) code prediction that addresses these challenges through principled model selection, redundancy-aware data sampling, and structured input design. The framework integrates an LLM-as-judge evaluation protocol with Plackett-Luce aggregation to assess and rank open-source LLMs based on their intrinsic comprehension of ICD-10-CM code definitions. We introduced embedding-based similarity measures, a redundancy-aware sampling strategy to remove semantically duplicated discharge summaries. We leverage structured discharge summaries from Taiwanese hospitals to evaluate contextual effects and examine section-wise content inclusion under universal and section-specific modelling paradigms. Experiments across two institutional datasets demonstrate that the selected base model after fine-tuning consistently outperforms baseline LLMs in internal and external evaluations. Incorporating more clinical sections consistently improves prediction performance. This study uses open-source LLMs to establish a practical and principled approach to ICD-10-CM code prediction. The proposed framework provides a scalable, institution-ready solution for real-world deployment of automated medical coding systems by combining informed model selection, efficient data refinement, and context-aware prompting.

Updated: 2025-09-23 09:35:05

标题: 模型选择与临床语义相遇：通过LLM作为评估者，重复性感知采样和部分感知微调优化ICD-10-CM预测

摘要: 准确的国际疾病分类（ICD）编码对于临床记录、账单和医疗分析至关重要，然而它仍然是一项劳动密集且容易出错的任务。虽然大型语言模型（LLMs）在自动化ICD编码方面表现出潜力，但它们在基础模型选择、输入上下文化和训练数据冗余方面存在挑战，限制了它们的有效性。我们提出了一个模块化框架，用于ICD-10临床修改（ICD-10-CM）代码预测，通过原则性的模型选择、冗余感知的数据抽样和结构化输入设计来解决这些挑战。该框架将LLM作为评判协议与Plackett-Luce聚合相结合，根据其对ICD-10-CM代码定义的内在理解来评估和排名开源LLMs。我们引入了基于嵌入的相似度度量，一种冗余感知的采样策略，用于去除语义重复的出院摘要。我们利用台湾医院的结构化出院摘要来评估上下文效应，并在通用和特定部分建模范式下检查部分内容的包含。对两个机构数据集进行的实验表明，在微调后选择的基础模型在内部和外部评估中始终优于基线LLMs。一致地增加临床部分总是提高预测性能。本研究利用开源LLMs建立了一种实用和原则性的ICD-10-CM代码预测方法。所提出的框架通过结合明智的模型选择、高效的数据精炼和上下文感知的提示，为自动化医疗编码系统的真实世界部署提供了可扩展的、机构准备就绪的解决方案。

更新时间: 2025-09-23 09:35:05

领域: cs.AI,I.2.6; I.2.7; J.3

下载: http://arxiv.org/abs/2509.18846v1

Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?

Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.

Updated: 2025-09-23 09:27:57

标题: 越来越多的开放权重LLMs是否正在缩小与专有模型在生物医学问题回答方面的差距？

摘要: 开放权重版本的大型语言模型（LLMs）正在迅速发展，最先进的模型如DeepSeek-V3现在表现与专有LLMs相当。这一进展引发了一个问题，即小型开放权重LLMs是否能有效地取代更大的闭源模型。我们特别关注生物医学问答的背景，这是我们通过参与BioASQ挑战的Task 13B Phase B阶段来探索的一个领域。在这项工作中，我们将几个开放权重模型与GPT-4o、GPT-4.1、Claude 3.5 Sonnet和Claude 3.7 Sonnet等表现最佳的系统进行比较。为了增强问答能力，我们采用了各种技术，包括基于嵌入距离检索最相关的片段、上下文学习和结构化输出。对于某些提交，我们利用集成方法利用不同模型生成的多样化输出来回答准确答案的问题。我们的结果表明，开放权重LLMs与专有LLMs相媲美。在某些情况下，开放权重LLMs甚至超过了闭源模型，特别是在应用集成策略时。所有代码均在https://github.com/evidenceprime/BioASQ-13b上公开。

更新时间: 2025-09-23 09:27:57

领域: cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2509.18843v1

Shared-Weights Extender and Gradient Voting for Neural Network Expansion

Expanding neural networks during training is a promising way to augment capacity without retraining larger models from scratch. However, newly added neurons often fail to adjust to a trained network and become inactive, providing no contribution to capacity growth. We propose the Shared-Weights Extender (SWE), a novel method explicitly designed to prevent inactivity of new neurons by coupling them with existing ones for smooth integration. In parallel, we introduce the Steepest Voting Distributor (SVoD), a gradient-based method for allocating neurons across layers during deep network expansion. Our extensive benchmarking on four datasets shows that our method can effectively suppress neuron inactivity and achieve better performance compared to other expanding methods and baselines.

Updated: 2025-09-23 09:27:47

标题: 共享权重扩展器和梯度投票用于神经网络扩展

摘要: 在训练过程中扩展神经网络是一种有前途的方式，可以增加容量，而不必从头开始重新训练更大的模型。然而，新添加的神经元通常无法适应训练过的网络，变得不活跃，对容量增长没有贡献。我们提出了Shared-Weights Extender（SWE），这是一种新颖的方法，专门设计用于防止新神经元的不活跃，通过将它们与现有神经元耦合以实现平滑整合。同时，我们引入了Steepest Voting Distributor（SVoD），这是一种基于梯度的方法，用于在深度网络扩展过程中跨层分配神经元。我们对四个数据集进行了广泛的基准测试，结果显示我们的方法可以有效抑制神经元的不活跃，并与其他扩展方法和基准相比取得更好的性能。

更新时间: 2025-09-23 09:27:47

领域: cs.LG

下载: http://arxiv.org/abs/2509.18842v1

TimeMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding

Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.

Updated: 2025-09-23 09:20:00

标题: TimeMosaic：通过自适应粒度补丁和分段解码指导的时间异质性时间序列预测

摘要: 多变量时间序列预测在金融、交通、气候和能源等领域至关重要。然而，现有的基于补丁的方法通常采用固定长度的分割，忽视了局部时间动态的异质性和预测的解码异质性。这样的设计在信息密集区域丢失细节，在稳定段引入冗余，并未捕捉短期和长期视野的不同复杂性。我们提出了TimeMosaic，这是一个旨在解决时间异质性的预测框架。TimeMosaic采用自适应补丁嵌入，根据局部信息密度动态调整粒度，平衡主题重用和结构清晰性，同时保留时间连续性。此外，它引入了分段解码，将每个预测视野视为相关的子任务，并根据视野特定的困难和信息需求进行调整，而不是应用单一统一的解码器。对基准数据集进行的广泛评估表明，TimeMosaic相对现有方法取得了持续改进，并且我们在拥有3210亿观测的大规模语料库上训练的模型表现与最先进的TSFM相媲美。

更新时间: 2025-09-23 09:20:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.19406v1

Bounded PCTL Model Checking of Large Language Model Outputs

In this paper, we introduce LLMCHECKER, a model-checking-based verification method to verify the probabilistic computation tree logic (PCTL) properties of an LLM text generation process. We empirically show that only a limited number of tokens are typically chosen during text generation, which are not always the same. This insight drives the creation of $\alpha$-$k$-bounded text generation, narrowing the focus to the $\alpha$ maximal cumulative probability on the top-$k$ tokens at every step of the text generation process. Our verification method considers an initial string and the subsequent top-$k$ tokens while accommodating diverse text quantification methods, such as evaluating text quality and biases. The threshold $\alpha$ further reduces the selected tokens, only choosing those that exceed or meet it in cumulative probability. LLMCHECKER then allows us to formally verify the PCTL properties of $\alpha$-$k$-bounded LLMs. We demonstrate the applicability of our method in several LLMs, including Llama, Gemma, Mistral, Genstruct, and BERT. To our knowledge, this is the first time PCTL-based model checking has been used to check the consistency of the LLM text generation process.

Updated: 2025-09-23 09:19:37

标题: 大规模语言模型输出的有界PCTL模型检测

摘要: 在这篇论文中，我们介绍了LLMCHECKER，这是一种基于模型检查的验证方法，用于验证LLM文本生成过程中的概率计算树逻辑（PCTL）属性。我们通过实证研究表明，在文本生成过程中通常只选择了有限数量的标记，而且这些标记并不总是相同的。这一洞察力推动了$\alpha$-$k$有界文本生成的创建，将焦点缩小到文本生成过程中每一步的顶部-$k$标记中的$\alpha$最大累积概率。我们的验证方法考虑了一个初始字符串和随后的顶部-$k$标记，同时适应了不同的文本量化方法，如评估文本质量和偏见。阈值$\alpha$进一步减少了所选标记，只选择那些超过或达到累积概率的标记。然后，LLMCHECKER允许我们正式验证$\alpha$-$k$有界LLM的PCTL属性。我们展示了我们的方法在几种LLM中的适用性，包括Llama、Gemma、Mistral、Genstruct和BERT。据我们所知，这是首次使用基于PCTL的模型检查来检查LLM文本生成过程的一致性。

更新时间: 2025-09-23 09:19:37

领域: cs.AI

下载: http://arxiv.org/abs/2509.18836v1

Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5$\times$ faster training than Concept Slider and 47$\times$ faster than Attribute Control, while reducing GPU memory usage by nearly 2$\times$ and 4$\times$, respectively.

Updated: 2025-09-23 09:17:18

标题: 文本滑块：通过LoRA适配器实现图像/视频合成的高效即插即用连续概念控制

摘要: 最近在扩散模型方面取得的进展显著提高了图像和视频合成的质量。此外，已经提出了几种概念控制方法，以实现对自由文本提示的细粒度、连续和灵活的控制。然而，这些方法不仅需要大量的训练时间和GPU内存来学习滑块或嵌入，还需要为不同的扩散骨干重新训练，从而限制了它们的可伸缩性和适应性。为了解决这些限制，我们引入了Text Slider，这是一个轻量级、高效且即插即用的框架，能够在预训练的文本编码器中识别低秩方向，从而实现对视觉概念的连续控制，同时显著减少训练时间、GPU内存消耗和可训练参数的数量。此外，Text Slider支持多概念组合和连续控制，实现在图像和视频合成中的细粒度和灵活操作。我们展示了Text Slider能够平滑且连续地调节特定属性，同时保留输入的原始空间布局和结构。Text Slider实现了更高的效率：比Concept Slider快5倍，比Attribute Control快47倍，同时GPU内存使用量分别减少了近2倍和4倍。

更新时间: 2025-09-23 09:17:18

领域: cs.GR,cs.AI,cs.CV,cs.LG,cs.MM

下载: http://arxiv.org/abs/2509.18831v1

DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin's capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin's suitability and practicality for learning real-world, contact-rich manipulation. Please see our project webpage for videos and visualizations: https://dex-skin.github.io/.

Updated: 2025-09-23 09:16:34

标题: DexSkin：用于学习接触丰富操纵的高覆盖性可适应性机器皮肤

摘要: 人类皮肤提供了丰富的触觉感知流，可以定位意图和非意图接触事件，覆盖大面积且有轮廓的区域。复制这些触觉感知能力，用于灵巧的机器人操作系统，仍然是一个长期存在的挑战。在这项工作中，我们通过引入DexSkin向这个目标迈出了一步。DexSkin是一种软性、适应性的电容电子皮肤，可以实现敏感、局部化和可校准的触觉感知，并且可以根据不同的几何形状进行定制。我们通过为一对平行夹具手指装配传感器来展示其在学习下游机器人操作中的有效性，提供几乎覆盖整个手指表面的触觉覆盖范围。我们在学习示范框架中对DexSkin的能力进行了实证评估，这些挑战性的操作任务需要在手指整个表面提供感知覆盖，如手中重新定位物体和在盒子周围缠绕弹性带。然后，我们展示了对于基于数据驱动的方法至关重要的一点，即DexSkin可以校准以实现跨传感器实例的模型转移，并展示其在真实机器人上的在线强化学习的适用性。我们的结果突显了DexSkin在学习真实世界中富有接触的操作中的适用性和实用性。请参阅我们的项目网页获取视频和可视化：https://dex-skin.github.io/。

更新时间: 2025-09-23 09:16:34

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.18830v1

Graph-based Clustering Revisited: A Relaxation of Kernel $k$-Means Perspective

The well-known graph-based clustering methods, including spectral clustering, symmetric non-negative matrix factorization, and doubly stochastic normalization, can be viewed as relaxations of the kernel $k$-means approach. However, we posit that these methods excessively relax their inherent low-rank, nonnegative, doubly stochastic, and orthonormal constraints to ensure numerical feasibility, potentially limiting their clustering efficacy. In this paper, guided by our theoretical analyses, we propose \textbf{Lo}w-\textbf{R}ank \textbf{D}oubly stochastic clustering (\textbf{LoRD}), a model that only relaxes the orthonormal constraint to derive a probabilistic clustering results. Furthermore, we theoretically establish the equivalence between orthogonality and block diagonality under the doubly stochastic constraint. By integrating \textbf{B}lock diagonal regularization into LoRD, expressed as the maximization of the Frobenius norm, we propose \textbf{B-LoRD}, which further enhances the clustering performance. To ensure numerical solvability, we transform the non-convex doubly stochastic constraint into a linear convex constraint through the introduction of a class probability parameter. We further theoretically demonstrate the gradient Lipschitz continuity of our LoRD and B-LoRD enables the proposal of a globally convergent projected gradient descent algorithm for their optimization. Extensive experiments validate the effectiveness of our approaches. The code is publicly available at https://github.com/lwl-learning/LoRD.

Updated: 2025-09-23 09:14:39

标题: 基于图的聚类再探讨：一种松弛的核$k$-均值视角

摘要: 众所周知，基于图的聚类方法，包括谱聚类、对称非负矩阵分解和双随机归一化，可以被看作是核$k$-means方法的放松。然而，我们认为这些方法过度放松了它们固有的低秩、非负、双随机和正交约束，以确保数值可行性，可能限制了它们的聚类效果。在本文中，通过我们的理论分析，我们提出了\textbf{Lo}w-\textbf{R}ank \textbf{D}oubly stochastic clustering (\textbf{LoRD})，这是一个仅放松正交约束以得出概率聚类结果的模型。此外，我们在理论上建立了正交性与块对角性在双随机约束下的等价性。通过将\textbf{B}lock diagonal regularization集成到LoRD中，表示为最大化Frobenius范数，我们提出了\textbf{B-LoRD}，进一步提高了聚类性能。为了确保数值可解性，我们通过引入一个类概率参数，将非凸双随机约束转化为线性凸约束。我们进一步在理论上证明了我们的LoRD和B-LoRD的梯度Lipschitz连续性，从而提出了一个全局收敛的投影梯度下降算法来优化它们。大量实验证实了我们方法的有效性。代码公开可在https://github.com/lwl-learning/LoRD获取。

更新时间: 2025-09-23 09:14:39

领域: cs.LG

下载: http://arxiv.org/abs/2509.18826v1

On the Convergence of Policy Mirror Descent with Temporal Difference Evaluation

Policy mirror descent (PMD) is a general policy optimization framework in reinforcement learning, which can cover a wide range of typical policy optimization methods by specifying different mirror maps. Existing analysis of PMD requires exact or approximate evaluation (for example unbiased estimation via Monte Carlo simulation) of action values solely based on policy. In this paper, we consider policy mirror descent with temporal difference evaluation (TD-PMD). It is shown that, given the access to exact policy evaluations, the dimension-free $O(1/T)$ sublinear convergence still holds for TD-PMD with any constant step size and any initialization. In order to achieve this result, new monotonicity and shift invariance arguments have been developed. The dimension free $\gamma$-rate linear convergence of TD-PMD is also established provided the step size is selected adaptively. For the two common instances of TD-PMD (i.e., TD-PQA and TD-NPG), it is further shown that they enjoy the convergence in the policy domain. Additionally, we investigate TD-PMD in the inexact setting and give the sample complexity for it to achieve the last iterate $\varepsilon$-optimality under a generative model, which improves the last iterate sample complexity for PMD over the dependence on $1/(1-\gamma)$.

Updated: 2025-09-23 09:11:03

标题: 关于策略镜像下降与时间差分评估的收敛性

摘要: 政策镜像下降（PMD）是一种在强化学习中的通用政策优化框架，可以通过指定不同的镜像映射来涵盖各种典型政策优化方法。现有对PMD的分析要求仅基于政策的行动值的精确或近似评估（例如通过蒙特卡洛模拟的无偏估计）。本文考虑了具有时序差分评估（TD-PMD）的政策镜像下降。结果表明，只要具有准确的政策评估，TD-PMD在任何恒定步长和任何初始化下仍然保持维度自由的$O(1/T)$次线性收敛。为了实现这一结果，已经发展了新的单调性和平移不变性论证。还建立了TD-PMD的维度自由$\gamma$-速率线性收敛，前提是选择自适应步长。对于TD-PMD的两种常见实例（即TD-PQA和TD-NPG），进一步表明它们在政策领域中具有收敛性。另外，我们调查了不精确设置中的TD-PMD，并给出了在生成模型下实现最后迭代$\varepsilon$-最优性的样本复杂度，这改善了对PMD的最后迭代样本复杂度依赖于$1/(1-\gamma)$的情况。

更新时间: 2025-09-23 09:11:03

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2509.18822v1

Small LLMs with Expert Blocks Are Good Enough for Hyperparamter Tuning

Hyper-parameter Tuning (HPT) is a necessary step in machine learning (ML) pipelines but becomes computationally expensive and opaque with larger models. Recently, Large Language Models (LLMs) have been explored for HPT, yet most rely on models exceeding 100 billion parameters. We propose an Expert Block Framework for HPT using Small LLMs. At its core is the Trajectory Context Summarizer (TCS), a deterministic block that transforms raw training trajectories into structured context, enabling small LLMs to analyze optimization progress with reliability comparable to larger models. Using two locally-run LLMs (phi4:reasoning14B and qwen2.5-coder:32B) and a 10-trial budget, our TCS-enabled HPT pipeline achieves average performance within ~0.9 percentage points of GPT-4 across six diverse tasks.

Updated: 2025-09-23 09:10:27

标题: 小型LLMs与专家块足以进行超参数调整

摘要: 超参数调整（HPT）是机器学习（ML）流程中的一个必要步骤，但随着模型规模增大，它变得计算昂贵且不透明。最近，大型语言模型（LLMs）已被用于HPT，但大多数依赖于超过1000亿个参数的模型。我们提出了一个专家块框架，使用小型LLMs进行HPT。其核心是轨迹上下文总结器（TCS），这是一个确定性块，将原始训练轨迹转换为结构化上下文，使小型LLMs能够分析优化进展，并具有与更大模型相媲美的可靠性。使用两个本地运行的LLMs（phi4:reasoning14B和qwen2.5-coder:32B）和一个10次试验预算，我们的TCS启用的HPT流程在六个不同任务中实现了平均性能，与GPT-4相比只相差约0.9个百分点。

更新时间: 2025-09-23 09:10:27

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2509.15561v2

Improving Outdoor Multi-cell Fingerprinting-based Positioning via Mobile Data Augmentation

Accurate outdoor positioning in cellular networks is hindered by sparse, heterogeneous measurement collections and the high cost of exhaustive site surveys. This paper introduces a lightweight, modular mobile data augmentation framework designed to enhance multi-cell fingerprinting-based positioning using operator-collected minimization of drive test (MDT) records. The proposed approach decouples spatial and radio-feature synthesis: kernel density estimation (KDE) models the empirical spatial distribution to generate geographically coherent synthetic locations, while a k-nearest-neighbor (KNN)-based block produces augmented per-cell radio fingerprints. The architecture is intentionally training-free, interpretable, and suitable for distributed or on-premise operator deployments, supporting privacy-aware workflows. We both validate each augmentation module independently and assess its end-to-end impact on fingerprinting-based positioning using a real-world MDT dataset provided by an Italian mobile network operator across diverse urban and peri-urban scenarios. Results show that the proposed KDE-KNN augmentation consistently improves positioning performance, with the largest benefits in sparsely sampled or structurally complex regions; we also observe region-dependent saturation effects as augmentation increases. The framework offers a practical, low-complexity path to enhance operator positioning services using existing mobile data traces.

Updated: 2025-09-23 09:09:45

标题: 通过移动数据增强来改进室外多小区指纹定位

摘要: 在移动网络中，稀疏、异构的测量收集和详尽的现场调查成本高昂，阻碍了精确的室外定位。本文介绍了一种轻量级、模块化的移动数据增强框架，旨在利用运营商收集的最小化驾驶测试（MDT）记录，增强基于多小区指纹的定位。所提出的方法将空间和无线特征合成解耦：核密度估计（KDE）模型化实证空间分布，生成地理上连贯的合成位置，而基于k最近邻（KNN）的块则产生增强的每个小区无线指纹。该架构故意不需要训练，可解释，并适用于分布式或现场操作员部署，支持隐私感知工作流程。我们独立验证了每个增强模块，并评估了它们对基于指纹的定位的端到端影响，使用一家意大利移动网络运营商提供的真实MDT数据集，跨越多样的城市和近城区场景。结果表明，所提出的KDE-KNN增强一致改善了定位性能，在稀疏采样或结构复杂地区的效益最大；我们还观察到随着增强的增加，存在区域依赖的饱和效应。该框架提供了一条实用、低复杂度的路径，利用现有移动数据跟踪增强运营商定位服务。

更新时间: 2025-09-23 09:09:45

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2509.19405v1

A Generative Framework for Probabilistic, Spatiotemporally Coherent Downscaling of Climate Simulation

Local climate information is crucial for impact assessment and decision-making, yet coarse global climate simulations cannot capture small-scale phenomena. Current statistical downscaling methods infer these phenomena as temporally decoupled spatial patches. However, to preserve physical properties, estimating spatio-temporally coherent high-resolution weather dynamics for multiple variables across long time horizons is crucial. We present a novel generative framework that uses a score-based diffusion model trained on high-resolution reanalysis data to capture the statistical properties of local weather dynamics. After training, we condition on coarse climate model data to generate weather patterns consistent with the aggregate information. As this predictive task is inherently uncertain, we leverage the probabilistic nature of diffusion models and sample multiple trajectories. We evaluate our approach with high-resolution reanalysis information before applying it to the climate model downscaling task. We then demonstrate that the model generates spatially and temporally coherent weather dynamics that align with global climate output.

Updated: 2025-09-23 09:03:42

标题: 一个用于概率性、时空一致的气候模拟降尺度的生成性框架

摘要: 本地气候信息对于影响评估和决策制定至关重要，然而粗糙的全球气候模拟无法捕捉小尺度现象。目前的统计降尺度方法将这些现象推断为时间上解耦的空间片段。然而，为了保留物理特性，对于长时间跨度内多个变量的高分辨率天气动态进行空间-时间上连贯的估计是至关重要的。我们提出了一种新颖的生成框架，该框架利用训练有素的高分辨率再分析数据上的基于分数的扩散模型来捕捉本地天气动态的统计特性。在训练之后，我们根据粗糙的气候模型数据来生成与总体信息一致的天气模式。由于这种预测任务本质上是不确定的，我们利用扩散模型的概率特性并采样多条轨迹。我们首先利用高分辨率再分析信息评估我们的方法，然后将其应用于气候模型降尺度任务。随后，我们展示了该模型生成的空间和时间上连贯的天气动态与全球气候输出相一致。

更新时间: 2025-09-23 09:03:42

领域: cs.LG,physics.ao-ph

下载: http://arxiv.org/abs/2412.15361v4

Training-Free Data Assimilation with GenCast

Data assimilation is widely used in many disciplines such as meteorology, oceanography, and robotics to estimate the state of a dynamical system from noisy observations. In this work, we propose a lightweight and general method to perform data assimilation using diffusion models pre-trained for emulating dynamical systems. Our method builds on particle filters, a class of data assimilation algorithms, and does not require any further training. As a guiding example throughout this work, we illustrate our methodology on GenCast, a diffusion-based model that generates global ensemble weather forecasts.

Updated: 2025-09-23 08:59:44

标题: 无需训练的数据同化方法：GenCast

摘要: 数据同化被广泛应用于许多学科，如气象学、海洋学和机器人学，用于从嘈杂的观测中估计动态系统的状态。在这项工作中，我们提出了一种轻量级和通用的方法，使用预先训练的扩散模型来执行数据同化，模拟动态系统。我们的方法建立在粒子滤波器上，这是一类数据同化算法，不需要进一步的训练。作为贯穿整个工作的指导示例，我们在GenCast上说明了我们的方法论，这是一个基于扩散的模型，生成全球集合天气预报。

更新时间: 2025-09-23 08:59:44

领域: cs.LG,physics.ao-ph

下载: http://arxiv.org/abs/2509.18811v1

Probabilistic Machine Learning for Uncertainty-Aware Diagnosis of Industrial Systems

Deep neural networks has been increasingly applied in fault diagnostics, where it uses historical data to capture systems behavior, bypassing the need for high-fidelity physical models. However, despite their competence in prediction tasks, these models often struggle with the evaluation of their confidence. This matter is particularly important in consistency-based diagnosis where decision logic is highly sensitive to false alarms. To address this challenge, this work presents a diagnostic framework that uses ensemble probabilistic machine learning to improve diagnostic characteristics of data driven consistency based diagnosis by quantifying and automating the prediction uncertainty. The proposed method is evaluated across several case studies using both ablation and comparative analyses, showing consistent improvements across a range of diagnostic metrics.

Updated: 2025-09-23 08:59:20

标题: 概率机器学习用于工业系统不确定性感知诊断

摘要: 深度神经网络在故障诊断中的应用越来越广泛，它利用历史数据捕捉系统行为，避免了对高保真物理模型的需求。然而，尽管在预测任务中表现出色，这些模型通常在评估其置信度方面存在困难。在基于一致性的诊断中，这一问题尤为重要，决策逻辑对误报非常敏感。为了解决这一挑战，本文提出了一个诊断框架，利用集成概率机器学习来改进基于数据驱动的一致性诊断的诊断特性，通过量化和自动化预测不确定性。提出的方法通过多个案例研究进行评估，包括消融和比较分析，显示出在一系列诊断指标上的一致改进。

更新时间: 2025-09-23 08:59:20

领域: cs.LG

下载: http://arxiv.org/abs/2509.18810v1

NeuFACO: Neural Focused Ant Colony Optimization for Traveling Salesman Problem

This study presents Neural Focused Ant Colony Optimization (NeuFACO), a non-autoregressive framework for the Traveling Salesman Problem (TSP) that combines advanced reinforcement learning with enhanced Ant Colony Optimization (ACO). NeuFACO employs Proximal Policy Optimization (PPO) with entropy regularization to train a graph neural network for instance-specific heuristic guidance, which is integrated into an optimized ACO framework featuring candidate lists, restricted tour refinement, and scalable local search. By leveraging amortized inference alongside ACO stochastic exploration, NeuFACO efficiently produces high-quality solutions across diverse TSP instances.

Updated: 2025-09-23 08:58:31

标题: NeuFACO: 旅行商问题的神经聚焦蚁群优化

摘要: 这项研究提出了神经聚焦蚁群优化（NeuFACO），这是一个针对旅行商问题（TSP）的非自回归框架，结合了先进的强化学习和增强的蚁群优化（ACO）。NeuFACO采用了带有熵正则化的近端策略优化（PPO）来训练一个图神经网络，用于针对实例的启发式指导，这一指导被整合到一个优化的ACO框架中，包括候选列表、受限巡回优化和可扩展的局部搜索。通过利用与ACO随机探索并行的摊销推理，NeuFACO能够高效地在各种TSP实例上产生高质量解决方案。

更新时间: 2025-09-23 08:58:31

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2509.16938v2

EC-LDA : Label Distribution Inference Attack against Federated Graph Learning with Embedding Compression

Graph Neural Networks (GNNs) have been widely used for graph analysis. Federated Graph Learning (FGL) is an emerging learning framework to collaboratively train graph data from various clients. However, since clients are required to upload model parameters to the server in each round, this provides the server with an opportunity to infer each client's data privacy. In this paper, we focus on label distribution attacks(LDAs) that aim to infer the label distributions of the clients' local data. We take the first step to attack client's label distributions in FGL. Firstly, we observe that the effectiveness of LDA is closely related to the variance of node embeddings in GNNs. Next, we analyze the relation between them and we propose a new attack named EC-LDA, which significantly improves the attack effectiveness by compressing node embeddings. Thirdly, extensive experiments on node classification and link prediction tasks across six widely used graph datasets show that EC-LDA outperforms the SOTA LDAs. For example, EC-LDA attains optimal values under both Cos-sim and JS-div evaluation metrics in the CoraFull and LastFM datasets. Finally, we explore the robustness of EC-LDA under differential privacy protection.

Updated: 2025-09-23 08:55:38

标题: EC-LDA：标签分布推断攻击：针对具有嵌入压缩的联邦图学习

摘要: 图神经网络（GNNs）已被广泛用于图分析。联邦图学习（FGL）是一种新兴的学习框架，用于协作训练来自各个客户端的图数据。然而，由于客户端需要在每一轮上传模型参数到服务器，这为服务器提供了推断每个客户端数据隐私的机会。本文侧重于标签分布攻击（LDAs），旨在推断客户端局部数据的标签分布。我们首次尝试在FGL中攻击客户端的标签分布。首先，我们观察到LDA的有效性与GNNs中节点嵌入的方差密切相关。接下来，我们分析它们之间的关系，并提出一种新的攻击方法，名为EC-LDA，通过压缩节点嵌入显著提高攻击效果。第三，对六个广泛使用的图数据集上进行了节点分类和链接预测任务的大量实验表明，EC-LDA优于SOTA LDAs。例如，在CoraFull和LastFM数据集中，EC-LDA在Cos-sim和JS-div评估指标下取得了最佳值。最后，我们探讨了在差分隐私保护下EC-LDA的稳健性。

更新时间: 2025-09-23 08:55:38

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2505.15140v2

Centralized vs. Decentralized Security for Space AI Systems? A New Look

This paper investigates the trade-off between centralized and decentralized security management in constellations of satellites to balance security and performance. We highlight three key AI architectures for automated security management: (a) centralized, (b) distributed and (c) federated. The centralized architecture is the best option short term, providing fast training, despite the hard challenge of the communication latency overhead across space. Decentralized architectures are better alternatives in the longer term, providing enhanced scalability and security.

Updated: 2025-09-23 08:54:55

标题: 中央集权与分权制度对太空人工智能系统的安全性影响？一个新视角

摘要: 本文研究了在卫星星座中集中和分散安全管理之间的权衡，以平衡安全性和性能。我们重点介绍了三种关键的人工智能架构用于自动化安全管理：(a)集中式，(b)分布式和(c)联邦式。集中式架构是短期内最佳选择，提供快速训练，尽管在空间中存在通信延迟开销的挑战。分散式架构是长期内更好的替代方案，提供增强的可扩展性和安全性。

更新时间: 2025-09-23 08:54:55

领域: cs.CR,cs.AI,cs.DC

下载: http://arxiv.org/abs/2509.20395v1

A Kernel Space-based Multidimensional Sparse Model for Dynamic PET Image Denoising

Achieving high image quality for temporal frames in dynamic positron emission tomography (PET) is challenging due to the limited statistic especially for the short frames. Recent studies have shown that deep learning (DL) is useful in a wide range of medical image denoising tasks. In this paper, we propose a model-based neural network for dynamic PET image denoising. The inter-frame spatial correlation and intra-frame structural consistency in dynamic PET are used to establish the kernel space-based multidimensional sparse (KMDS) model. We then substitute the inherent forms of the parameter estimation with neural networks to enable adaptive parameters optimization, forming the end-to-end neural KMDS-Net. Extensive experimental results from simulated and real data demonstrate that the neural KMDS-Net exhibits strong denoising performance for dynamic PET, outperforming previous baseline methods. The proposed method may be used to effectively achieve high temporal and spatial resolution for dynamic PET. Our source code is available at https://github.com/Kuangxd/Neural-KMDS-Net/tree/main.

Updated: 2025-09-23 08:48:36

标题: 一种基于内核空间的多维稀疏模型用于动态PET图像去噪

摘要: 在动态正电子发射断层扫描（PET）中实现高质量图像对于时间帧来说是具有挑战性的，这是由于特别是对于短帧的有限统计量。最近的研究表明，深度学习（DL）在各种医学图像去噪任务中是有用的。本文提出了一种基于模型的神经网络用于动态PET图像去噪。动态PET中的帧间空间相关性和帧内结构一致性被用来建立基于核空间的多维稀疏（KMDS）模型。然后，我们用神经网络替代参数估计的固有形式，以实现自适应参数优化，形成端到端神经KMDS-Net。来自模拟和真实数据的广泛实验结果表明，神经KMDS-Net对于动态PET表现出强大的去噪性能，优于先前的基准方法。所提出的方法可以有效实现动态PET的高时间和空间分辨率。我们的源代码可在https://github.com/Kuangxd/Neural-KMDS-Net/tree/main找到。

更新时间: 2025-09-23 08:48:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.18801v1

Security Evaluation of Android apps in budget African Mobile Devices

Android's open-source nature facilitates widespread smartphone accessibility, particularly in price-sensitive markets. System and vendor applications that come pre-installed on budget Android devices frequently operate with elevated privileges, yet they receive limited independent examination. To address this gap, we developed a framework that extracts APKs from physical devices and applies static analysis to identify privacy and security issues in embedded software. Our study examined 1,544 APKs collected from seven African smartphones. The analysis revealed that 145 applications (9%) disclose sensitive data, 249 (16%) expose critical components without sufficient safeguards, and many present additional risks: 226 execute privileged or dangerous commands, 79 interact with SMS messages (read, send, or delete), and 33 perform silent installation operations. We also uncovered a vendor-supplied package that appears to transmit device identifiers and location details to an external third party. These results demonstrate that pre-installed applications on widely distributed low-cost devices represent a significant and underexplored threat to user security and privacy.

Updated: 2025-09-23 08:45:07

标题: 低成本非洲移动设备上安卓应用程序的安全评估

摘要: 安卓的开源特性促进了智能手机在价格敏感的市场中的普及。预装在廉价安卓设备上的系统和厂商应用程序通常以较高权限运行，但它们接受的独立审查有限。为了填补这一空白，我们开发了一个框架，从物理设备中提取APK，并应用静态分析来识别嵌入软件中的隐私和安全问题。我们的研究检查了来自七部非洲智能手机的1,544个APK。分析显示，145个应用程序（9%）泄露敏感数据，249个（16%）未提供足够的保护措施而暴露关键组件，并且许多应用存在额外风险：226个执行特权或危险命令，79个与短信消息交互（读取、发送或删除），33个执行静默安装操作。我们还发现一个供应商提供的软件包似乎将设备标识符和位置详细信息传输给外部第三方。这些结果表明，广泛分发的低成本设备上预装的应用程序对用户安全和隐私构成重要且尚未被充分探讨的威胁。

更新时间: 2025-09-23 08:45:07

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2509.18800v1

RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning

The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 13.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.

Updated: 2025-09-23 08:42:57

标题: RAG+：增强应用感知推理的检索增强生成

摘要: 通过检索增强生成（RAG）集成外部知识已成为增强大型语言模型（LLMs）用于知识密集型任务的基础。然而，现有的RAG范例往往忽视了应用知识的认知步骤，导致检索到的事实和任务特定推理之间存在差距。在这项工作中，我们引入了RAG+，这是一个有原则的模块化扩展，明确将应用感知推理纳入RAG管道。RAG+构建了一个包含知识和对齐应用示例的双重语料库，可以手动或自动创建，并在推理过程中同时检索这两者。这一设计不仅使LLMs能够访问相关信息，还能在结构化、目标导向的推理过程中应用这些信息。在数学、法律和医学领域进行的实验，使用多个模型进行，表明RAG+一贯优于标准RAG变体，平均改进为3-5％，在复杂情况下最高可提高13.5％。通过将检索与可操作应用相结合，RAG+推动了更具认知基础的知识集成框架，代表了迈向更具可解释性和能力的LLMs的一步。

更新时间: 2025-09-23 08:42:57

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.11555v4

A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection

Misinformation spans various domains, but detection methods trained on specific domains often perform poorly when applied to others. With the rapid development of Large Language Models (LLMs), researchers have begun to utilize LLMs for cross-domain misinformation detection. However, existing LLM-based methods often fail to adequately analyze news in the target domain, limiting their detection capabilities. More importantly, these methods typically rely on manually designed decision rules, which are limited by domain knowledge and expert experience, thus limiting the generalizability of decision rules to different domains. To address these issues, we propose a MultiAgent Framework for cross-domain misinformation detection with Automated Decision Rule Optimization (MARO). Under this framework, we first employs multiple expert agents to analyze target-domain news. Subsequently, we introduce a question-reflection mechanism that guides expert agents to facilitate higherquality analysis. Furthermore, we propose a decision rule optimization approach based on carefully-designed cross-domain validation tasks to iteratively enhance the effectiveness of decision rules in different domains. Experimental results and in-depth analysis on commonlyused datasets demonstrate that MARO achieves significant improvements over existing methods.

Updated: 2025-09-23 08:36:02

标题: 一个用于跨领域虚假信息检测的具有自动决策规则优化的多智能体框架

摘要: 虚假信息跨越各个领域，但在特定领域训练的检测方法在应用到其他领域时往往表现不佳。随着大型语言模型（LLMs）的快速发展，研究人员已开始利用LLMs进行跨领域虚假信息检测。然而，现有基于LLM的方法往往无法充分分析目标领域的新闻，限制了其检测能力。更重要的是，这些方法通常依赖于人工设计的决策规则，受限于领域知识和专家经验，从而限制了决策规则对不同领域的泛化能力。为了解决这些问题，我们提出了一个基于自动决策规则优化（MARO）的跨领域虚假信息检测的多Agent框架。在这个框架下，我们首先利用多个专家代理人分析目标领域的新闻。随后，我们引入了一个问题反思机制，引导专家代理人促进更高质量的分析。此外，我们提出了一种基于精心设计的跨领域验证任务的决策规则优化方法，逐步增强不同领域中决策规则的有效性。对常用数据集的实验结果和深入分析表明，MARO相比现有方法取得了显著改进。

更新时间: 2025-09-23 08:36:02

领域: cs.AI

下载: http://arxiv.org/abs/2503.23329v2

Virtual Arc Consistency for Linear Constraints in Cost Function Networks

In Constraint Programming, solving discrete minimization problems with hard and soft constraints can be done either using (i) soft global constraints, (ii) a reformulation into a linear program, or (iii) a reformulation into local cost functions. Approach (i) benefits from a vast catalog of constraints. Each soft constraint propagator communicates with other soft constraints only through the variable domains, resulting in weak lower bounds. Conversely, the approach (ii) provides a global view with strong bounds, but the size of the reformulation can be problematic. We focus on approach (iii) in which soft arc consistency (SAC) algorithms produce bounds of intermediate quality. Recently, the introduction of linear constraints as local cost functions increases their modeling expressiveness. We adapt an existing SAC algorithm to handle linear constraints. We show that our algorithm significantly improves the lower bounds compared to the original algorithm on several benchmarks, reducing solving time in some cases.

Updated: 2025-09-23 08:35:07

标题: 成本函数网络中线性约束的虚拟弧一致性

摘要: 在约束编程中，解决带有硬性和软性约束的离散最小化问题可以通过以下三种方法实现：（i）使用软全局约束，（ii）转化为线性规划问题，或者（iii）转化为局部成本函数。方法（i）受益于庞大的约束目录。每个软约束传播器只通过变量域与其他软约束进行通信，导致弱下界。相反，方法（ii）提供了一个具有强下界的全局视图，但是重新制定的规模可能会有问题。我们关注方法（iii），其中软弧一致性（SAC）算法产生中间质量的下界。最近，将线性约束引入作为局部成本函数增加了其建模表达能力。我们调整了现有的SAC算法以处理线性约束。我们展示了我们的算法在几个基准测试中相比原始算法显著改善了下界，有时减少了求解时间。

更新时间: 2025-09-23 08:35:07

领域: cs.AI

下载: http://arxiv.org/abs/2509.17706v2

PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference

Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a novel pruning method for PD disaggregation inference, enabling more precise and efficient block and KV Cache pruning. Our approach constructs pruning and distillation sets to perform iterative block removal independently for the prefill and decode stages, obtaining better pruning solutions. Moreover, we introduce a token-aware cache pruning mechanism that retains all KV Cache in the prefill stage but selectively reuses entries for the first and last token sequences in selected layers during decode, reducing communication costs with minimal overhead. Extensive experiments demonstrate that our approach consistently achieves strong performance in both PD disaggregation and PD unified settings without disaggregation. Under the same (default) settings, our method achieves improved performance and faster inference, along with a 4.95$\times$ reduction in data transmission bandwidth consumption.

Updated: 2025-09-23 08:31:26

标题: PDTrim：推理中的预填解码分离的目标修剪

摘要: 大型语言模型（LLMs）在各种任务中展示出卓越的能力，但它们的部署受到高计算和内存成本的限制。模型修剪提供了一种有效的方法来缓解这些需求。然而，现有方法通常忽略了实践中预填-解码（PD）分解的特征。在本文中，我们提出了一种新的修剪方法，用于PD分解推断，实现更精确和高效的块和KV缓存修剪。我们的方法构建修剪和蒸馏集合，独立地执行预填充和解码阶段的迭代块删除，获得更好的修剪解决方案。此外，我们引入了一种基于令牌的缓存修剪机制，在预填阶段保留所有KV缓存，但在解码过程中选择性地重用选定层中第一个和最后一个令牌序列的条目，降低通信成本并带来最小的开销。大量实验表明，我们的方法在PD分解和PD统一设置中始终实现强大的性能，而在没有分解的情况下也是如此。在相同（默认）设置下，我们的方法实现了改进的性能和更快的推断速度，同时数据传输带宽消耗减少了4.95倍。

更新时间: 2025-09-23 08:31:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.04467v3

Detection of security smells in IaC scripts through semantics-aware code and language processing

Infrastructure as Code (IaC) automates the provisioning and management of IT infrastructure through scripts and tools, streamlining software deployment. Prior studies have shown that IaC scripts often contain recurring security misconfigurations, and several detection and mitigation approaches have been proposed. Most of these rely on static analysis, using statistical code representations or Machine Learning (ML) classifiers to distinguish insecure configurations from safe code. In this work, we introduce a novel approach that enhances static analysis with semantic understanding by jointly leveraging natural language and code representations. Our method builds on two complementary ML models: CodeBERT, to capture semantics across code and text, and LongFormer, to represent long IaC scripts without losing contextual information. We evaluate our approach on misconfiguration datasets from two widely used IaC tools, Ansible and Puppet. To validate its effectiveness, we conduct two ablation studies (removing code text from the natural language input and truncating scripts to reduce context) and compare against four large language models (LLMs) and prior work. Results show that semantic enrichment substantially improves detection, raising precision and recall from 0.46 and 0.79 to 0.92 and 0.88 on Ansible, and from 0.55 and 0.97 to 0.87 and 0.75 on Puppet, respectively.

Updated: 2025-09-23 08:28:49

标题: 通过语义感知代码和语言处理技术检测IaC脚本中的安全问题

摘要: 基础设施即代码（IaC）通过脚本和工具自动化提供和管理IT基础设施，简化软件部署。先前的研究表明，IaC脚本经常包含重复的安全配置错误，并提出了几种检测和缓解方法。大多数方法依赖于静态分析，使用统计代码表示或机器学习（ML）分类器区分不安全配置和安全代码。在这项工作中，我们引入了一种新颖的方法，通过联合利用自然语言和代码表示增强静态分析的语义理解。我们的方法建立在两个互补的ML模型上：CodeBERT，用于捕捉代码和文本之间的语义，以及LongFormer，用于表示长IaC脚本而不丢失上下文信息。我们评估了我们的方法在两个广泛使用的IaC工具Ansible和Puppet的错误配置数据集上的效果。为了验证其有效性，我们进行了两项消融研究（删除自然语言输入中的代码文本和截断脚本以减少上下文），并与四个大型语言模型（LLMs）和先前的工作进行了比较。结果表明，语义丰富大大提高了检测效果，在Ansible上将精确度和召回率从0.46和0.79提高到0.92和0.88，在Puppet上分别从0.55和0.97提高到0.87和0.75。

更新时间: 2025-09-23 08:28:49

领域: cs.CR,cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2509.18790v1

The AGNTCY Agent Directory Service: Architecture and Implementation

The Agent Directory Service (ADS) is a distributed directory for the discovery of AI agent capabilities, metadata, and provenance. It leverages content-addressed storage, hierarchical taxonomies, and cryptographic signing to enable efficient, verifiable, and multi-dimensional discovery across heterogeneous Multi-Agent Systems (MAS). Built on the Open Agentic Schema Framework (OASF), ADS decouples capability indexing from content location through a two-level mapping realized over a Kademlia-based Distributed Hash Table (DHT). It reuses mature OCI / ORAS infrastructure for artifact distribution, integrates Sigstore for provenance, and supports schema-driven extensibility for emerging agent modalities (LLM prompt agents, MCP servers, A2A-enabled components). This paper formalizes the architectural model, describes storage and discovery layers, explains security and performance properties, and positions ADS within the broader landscape of emerging agent registry and interoperability initiatives.

Updated: 2025-09-23 08:25:33

标题: 《AGNTCY代理目录服务：架构与实现》

摘要: 代理商目录服务（ADS）是用于发现AI代理商能力、元数据和来源的分布式目录。它利用内容寻址存储、分层分类和加密签名，实现了跨异构多代理系统（MAS）的高效、可验证和多维度发现。ADS建立在开放代理模式框架（OASF）上，通过基于Kademlia的分布式哈希表（DHT）上的两级映射实现了能力索引与内容位置的解耦。它重用成熟的OCI/ORAS基础设施用于工件分发，集成Sigstore用于来源，支持基于模式驱动的可扩展性，以适应新兴代理模态（LLM提示代理、MCP服务器、A2A启用组件）。本文形式化了架构模型，描述了存储和发现层，解释了安全性和性能特性，并将ADS定位在新兴代理注册表和互操作性倡议的更广泛景观中。

更新时间: 2025-09-23 08:25:33

领域: cs.AI,C.2.4

下载: http://arxiv.org/abs/2509.18787v1

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight DRAM-flash swapping and incorporates three novel techniques: (1) Cross-layer active weights preloading. It uses the activations from the current layer to predict the active weights of several subsequent layers, enabling computation and data loading to overlap, as well as facilitating large I/O transfers. (2) Sparsity-aware self-distillation. It adjusts the active weights to align with the dense-model output distribution, compensating for approximations introduced by contextual sparsity. (3) Active weight DRAM-flash swapping pipeline. It orchestrates the DRAM space allocation among the hot weight cache, preloaded active weights, and computation-involved weights based on available memory. Results show ActiveFlow achieves the performance-cost Pareto frontier compared to existing efficiency optimization methods.

Updated: 2025-09-23 08:24:07

标题: 通过DRAM和Flash之间的主动权重交换扩展设备上的LLMs

摘要: 大型语言模型（LLMs）越来越多地部署在移动设备上，但有限的DRAM容量限制了可部署的模型大小。本文介绍了ActiveFlow，这是第一个可以实现现代LLMs（非基于ReLU的）自适应DRAM使用的推理框架，从而实现可部署模型大小的扩展。该框架基于新颖的主动权重DRAM-flash交换概念，并结合了三种新颖技术：（1）跨层主动权重预加载。它利用当前层的激活来预测几个后续层的活动权重，实现计算和数据加载的重叠，并促进大规模I/O传输。（2）稀疏感知自我蒸馏。它调整活动权重以与稠密模型输出分布对齐，弥补了上下文稀疏引入的近似误差。（3）活动权重DRAM-flash交换管道。它根据可用内存对热权重缓存、预加载的活动权重和涉及计算的权重进行DRAM空间分配。结果显示，与现有效率优化方法相比，ActiveFlow实现了性能成本帕累托前沿。

更新时间: 2025-09-23 08:24:07

领域: cs.LG

下载: http://arxiv.org/abs/2504.08378v2

Long-Range Graph Wavelet Networks

Modeling long-range interactions, the propagation of information across distant parts of a graph, is a central challenge in graph machine learning. Graph wavelets, inspired by multi-resolution signal processing, provide a principled way to capture both local and global structures. However, existing wavelet-based graph neural networks rely on finite-order polynomial approximations, which limit their receptive fields and hinder long-range propagation. We propose Long-Range Graph Wavelet Networks (LR-GWN), which decompose wavelet filters into complementary local and global components. Local aggregation is handled with efficient low-order polynomials, while long-range interactions are captured through a flexible spectral-domain parameterization. This hybrid design unifies short- and long-distance information flow within a principled wavelet framework. Experiments show that LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks, while remaining competitive on short-range datasets.

Updated: 2025-09-23 08:22:42

标题: 长距离图小波网络

摘要: 建模长距离交互作用，即信息在图的远程部分传播，是图机器学习中的一个核心挑战。受多分辨率信号处理启发，图小波提供了一种合理的方式来捕捉局部和全局结构。然而，现有基于小波的图神经网络依赖于有限阶多项式逼近，这限制了它们的感知域并阻碍了长距离传播。我们提出了长距离图小波网络（LR-GWN），将小波滤波器分解为互补的局部和全局组件。局部聚合通过高效的低阶多项式处理，而长距离交互作用则通过灵活的频域参数化捕获。这种混合设计统一了基于合乎逻辑的小波框架中的短距离和长距离信息流。实验证明，LR-GWN在长距离基准测试中实现了基于小波的方法的最先进性能，同时在短距离数据集上保持竞争力。

更新时间: 2025-09-23 08:22:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06743v2

Reconstruction of Optical Coherence Tomography Images from Wavelength-space Using Deep-learning

Conventional Fourier-domain Optical Coherence Tomography (FD-OCT) systems depend on resampling into wavenumber (k) domain to extract the depth profile. This either necessitates additional hardware resources or amplifies the existing computational complexity. Moreover, the OCT images also suffer from speckle noise, due to systemic reliance on low coherence interferometry. We propose a streamlined and computationally efficient approach based on Deep-Learning (DL) which enables reconstructing speckle-reduced OCT images directly from the wavelength domain. For reconstruction, two encoder-decoder styled networks namely Spatial Domain Convolution Neural Network (SD-CNN) and Fourier Domain CNN (FD-CNN) are used sequentially. The SD-CNN exploits the highly degraded images obtained by Fourier transforming the domain fringes to reconstruct the deteriorated morphological structures along with suppression of unwanted noise. The FD-CNN leverages this output to enhance the image quality further by optimization in Fourier domain (FD). We quantitatively and visually demonstrate the efficacy of the method in obtaining high-quality OCT images. Furthermore, we illustrate the computational complexity reduction by harnessing the power of DL models. We believe that this work lays the framework for further innovations in the realm of OCT image reconstruction.

Updated: 2025-09-23 08:21:53

标题: 使用深度学习从波长空间重建光学相干断层扫描图像

摘要: 传统的傅立叶域光学相干断层扫描（FD-OCT）系统依赖于重采样到波数（k）域以提取深度剖面。这要么需要额外的硬件资源，要么增加了现有的计算复杂性。此外，由于系统依赖于低相干干涉法，OCT图像也受到散斑噪声的影响。我们提出了一种基于深度学习（DL）的简化和计算高效的方法，可以直接从波长域重建减少散斑噪声的OCT图像。为了重建，我们依次使用了两个编码器-解码器风格的网络，即空间域卷积神经网络（SD-CNN）和傅立叶域卷积神经网络（FD-CNN）。SD-CNN利用通过傅立叶变换域的条纹获得的高度退化的图像来重建恶化的形态结构，并抑制不需要的噪声。FD-CNN利用这个输出进一步优化傅立叶域（FD）中的图像质量。我们定量和视觉上展示了该方法在获取高质量OCT图像方面的有效性。此外，通过利用DL模型的强大能力，我们展示了计算复杂性的降低。我们相信，这项工作为OCT图像重建领域的进一步创新奠定了基础。

更新时间: 2025-09-23 08:21:53

领域: physics.optics,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.18783v1

Real-time Deer Detection and Warning in Connected Vehicles via Thermal Sensing and Deep Learning

Deer-vehicle collisions represent a critical safety challenge in the United States, causing nearly 2.1 million incidents annually and resulting in approximately 440 fatalities, 59,000 injuries, and 10 billion USD in economic damages. These collisions also contribute significantly to declining deer populations. This paper presents a real-time detection and driver warning system that integrates thermal imaging, deep learning, and vehicle-to-everything communication to help mitigate deer-vehicle collisions. Our system was trained and validated on a custom dataset of over 12,000 thermal deer images collected in Mars Hill, North Carolina. Experimental evaluation demonstrates exceptional performance with 98.84 percent mean average precision, 95.44 percent precision, and 95.96 percent recall. The system was field tested during a follow-up visit to Mars Hill and readily sensed deer providing the driver with advanced warning. Field testing validates robust operation across diverse weather conditions, with thermal imaging maintaining between 88 and 92 percent detection accuracy in challenging scenarios where conventional visible light based cameras achieve less than 60 percent effectiveness. When a high probability threshold is reached sensor data sharing messages are broadcast to surrounding vehicles and roadside units via cellular vehicle to everything (CV2X) communication devices. Overall, our system achieves end to end latency consistently under 100 milliseconds from detection to driver alert. This research establishes a viable technological pathway for reducing deer-vehicle collisions through thermal imaging and connected vehicles.

Updated: 2025-09-23 08:16:25

标题: 实时通过热感知和深度学习在联网车辆中检测和警告鹿类

摘要: 鹿车碰撞在美国是一个重要的安全挑战，每年造成近210万起事故，导致大约440人死亡，59,000人受伤，经济损失达100亿美元。这些碰撞也显著地促使鹿的数量下降。本文介绍了一个实时检测和驾驶员警告系统，该系统整合了热成像、深度学习和车辆对一切通信，以帮助减少鹿车碰撞。我们的系统在北卡罗来纳州马斯希尔收集的超过12,000张热成像鹿图像的自定义数据集上进行了训练和验证。实验评估表明，系统表现出色，平均精度为98.84％，精度为95.44％，召回率为95.96％。该系统在对马斯希尔的后续访问中进行了现场测试，并能及时感知到鹿，为驾驶员提供提前警告。现场测试验证了在各种天气条件下的强大运行能力，其中热成像在挑战性场景中保持88至92％的检测准确率，而传统可见光摄像头的效率不到60％。当达到高概率阈值时，传感器数据共享消息通过细胞车辆对一切（CV2X）通信设备广播给周围的车辆和路边单位。总体而言，我们的系统实现了从检测到驾驶员警报的端到端延迟始终低于100毫秒。这项研究通过热成像和连接车辆为减少鹿车碰撞建立了可行的技术途径。

更新时间: 2025-09-23 08:16:25

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.18779v1

VGGT-DP: Generalizable Robot Control via Vision Foundation Models

Visual imitation learning frameworks allow robots to learn manipulation skills from expert demonstrations. While existing approaches mainly focus on policy design, they often neglect the structure and capacity of visual encoders, limiting spatial understanding and generalization. Inspired by biological vision systems, which rely on both visual and proprioceptive cues for robust control, we propose VGGT-DP, a visuomotor policy framework that integrates geometric priors from a pretrained 3D perception model with proprioceptive feedback. We adopt the Visual Geometry Grounded Transformer (VGGT) as the visual encoder and introduce a proprioception-guided visual learning strategy to align perception with internal robot states, improving spatial grounding and closed-loop control. To reduce inference latency, we design a frame-wise token reuse mechanism that compacts multi-view tokens into an efficient spatial representation. We further apply random token pruning to enhance policy robustness and reduce overfitting. Experiments on challenging MetaWorld tasks show that VGGT-DP significantly outperforms strong baselines such as DP and DP3, particularly in precision-critical and long-horizon scenarios.

Updated: 2025-09-23 08:15:30

标题: VGGT-DP：通过视觉基础模型实现通用机器人控制

摘要: 视觉模仿学习框架允许机器人从专家演示中学习操作技能。尽管现有方法主要关注策略设计，但它们经常忽视视觉编码器的结构和容量，从而限制了空间理解和泛化能力。受生物视觉系统的启发，这些系统依赖于视觉和本体感知提示来实现稳健控制，我们提出了VGGT-DP，一个将预训练的3D感知模型中的几何先验与本体感知反馈相结合的视觉运动策略框架。我们采用Visual Geometry Grounded Transformer（VGGT）作为视觉编码器，并引入了一个本体感知引导的视觉学习策略，以将感知与内部机器人状态对齐，提高空间基础和闭环控制。为了减少推理延迟，我们设计了一个逐帧令牌重用机制，将多视角令牌压缩为高效的空间表示。我们进一步应用随机令牌修剪来增强策略的稳健性并减少过拟合。在具有挑战性的MetaWorld任务上的实验表明，VGGT-DP在精度关键和长期视角的场景中明显优于强基线，如DP和DP3。

更新时间: 2025-09-23 08:15:30

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18778v1

AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.

Updated: 2025-09-23 08:09:58

标题: AECBench：AEC领域大型语言模型知识评估的分层基准

摘要: 大型语言模型(LLMs)作为一种新型信息技术，正在建筑、工程和建筑(AEC)领域得到越来越多的应用。它们已经展示了简化建筑生命周期中流程的潜力。然而，在这样一个专业化且安全关键的领域中，LLMs的稳健性和可靠性仍待评估。为了解决这一挑战，本文建立了AECBench，这是一个旨在量化当前LLMs在AEC领域优势和局限的综合基准。该基准定义了23个代表性任务，涵盖了五个认知导向的评估框架，包括知识记忆、理解、推理、计算和应用。这些任务来源于真实的AEC实践，范围从代码检索到专业文件生成。随后，由工程师主要制作了一个包含多种格式的4800个问题的数据集，包括开放式问题，并通过两轮专家审查进行了验证。此外，引入了LLM作为评判者的方法，为评估复杂的、长篇回答提供了可扩展和一致的方法，利用专家制定的评分表。通过对九个LLMs的评估，揭示了在五个认知水平上明显的性能下降。尽管在知识记忆和理解水平的基础任务方面表现出熟练，但模型在解释建筑法规表中的知识、执行复杂的推理和计算，以及生成领域特定文件等方面显示出显著的性能缺陷。因此，这项研究为未来旨在将LLMs稳健可靠地整合到安全关键的工程实践中的研究和发展奠定了基础。

更新时间: 2025-09-23 08:09:58

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.18776v1

Financial Risk Relation Identification through Dual-view Adaptation

A multitude of interconnected risk events -- ranging from regulatory changes to geopolitical tensions -- can trigger ripple effects across firms. Identifying inter-firm risk relations is thus crucial for applications like portfolio management and investment strategy. Traditionally, such assessments rely on expert judgment and manual analysis, which are, however, subjective, labor-intensive, and difficult to scale. To address this, we propose a systematic method for extracting inter-firm risk relations using Form 10-K filings -- authoritative, standardized financial documents -- as our data source. Leveraging recent advances in natural language processing, our approach captures implicit and abstract risk connections through unsupervised fine-tuning based on chronological and lexical patterns in the filings. This enables the development of a domain-specific financial encoder with a deeper contextual understanding and introduces a quantitative risk relation score for transparency, interpretable analysis. Extensive experiments demonstrate that our method outperforms strong baselines across multiple evaluation settings.

Updated: 2025-09-23 08:09:30

标题: 财务风险关系的双视角适应性识别

摘要: 一个众多相互关联的风险事件，从监管变化到地缘政治紧张局势，都可能引发公司之间的连锁反应。因此，识别公司之间的风险关系对于投资组合管理和投资策略等应用至关重要。传统上，这种评估依赖于专家判断和手动分析，然而，这种方法具有主观性、劳动密集性和难以扩展性的缺点。为了解决这个问题，我们提出了一种系统化的方法，利用Form 10-K文件作为我们的数据来源，提取公司之间的风险关系。借助自然语言处理的最新进展，我们的方法通过对文件中的时间和词汇模式进行无监督微调，捕捉隐含和抽象的风险连接。这使得我们能够开发一个具有更深层上下文理解的专业领域金融编码器，并引入一个量化的风险关系评分，以实现透明、可解释的分析。广泛的实验证明，我们的方法在多个评估设置下优于强基线。

更新时间: 2025-09-23 08:09:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.18775v1

Purest Quantum State Identification

Quantum noise constitutes a fundamental obstacle to realizing practical quantum technologies. To address the pivotal challenge of identifying quantum systems least affected by noise, we introduce the purest quantum state identification, which can be used to improve the accuracy of quantum computation and communication. We formulate a rigorous paradigm for identifying the purest quantum state among $K$ unknown $n$-qubit quantum states using total $N$ quantum state copies. For incoherent strategies, we derive the first adaptive algorithm achieving error probability $\exp\left(- \Omega\left(\frac{N H_1}{\log(K) 2^n }\right) \right)$, fundamentally improving quantum property learning through measurement optimization. By developing a coherent measurement protocol with error bound $\exp\left(- \Omega\left(\frac{N H_2}{\log(K) }\right) \right)$, we demonstrate a significant separation from incoherent strategies, formally quantifying the power of quantum memory and coherent measurement. Furthermore, we establish a lower bound by demonstrating that all strategies with fixed two-outcome incoherent POVM must suffer error probability exceeding $ \exp\left( - O\left(\frac{NH_1}{2^n}\right)\right)$. This research advances the characterization of quantum noise through efficient learning frameworks. Our results establish theoretical foundations for noise-adaptive quantum property learning while delivering practical protocols for enhancing the reliability of quantum hardware.

Updated: 2025-09-23 08:07:30

标题: 最纯净的量子态识别

摘要: 量子噪声构成了实现实用量子技术的基本障碍。为了解决识别最不受噪声影响的量子系统的关键挑战，我们引入了最纯净的量子态识别，可以用于提高量子计算和通信的准确性。我们提出了一个严格的范式，用于使用总共$N$个量子态拷贝识别$K$个未知的$n$比特量子态中的最纯净的量子态。对于非相干策略，我们推导出了第一个实现错误概率为$\exp\left(- \Omega\left(\frac{N H_1}{\log(K) 2^n }\right) \right)$的自适应算法，从根本上改善了通过测量优化学习量子属性。通过开发一个具有错误界为$\exp\left(- \Omega\left(\frac{N H_2}{\log(K) }\right) \right)$的连贯测量协议，我们展示了与非相干策略的显著分离，正式量化了量子存储和连贯测量的能力。此外，我们通过证明所有具有固定两结果非相干POVM的策略都必须遭受超过$\exp\left( - O\left(\frac{NH_1}{2^n}\right)\right)$的错误概率，建立了一个下界。这项研究通过高效的学习框架推进了量子噪声的表征。我们的结果为噪声自适应量子属性学习奠定了理论基础，同时提供了增强量子硬件可靠性的实用协议。

更新时间: 2025-09-23 08:07:30

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2502.14334v2

Experience Scaling: Post-Deployment Evolution For Large Language Models

Scaling model size, training data, and compute power have driven advances in large language models (LLMs), but these approaches are reaching saturation as human-generated text is exhausted and further gains diminish. We propose experience scaling, a framework for continuous post-deployment evolution for LLMs through autonomous interaction with the environment and collaborative sharing of accumulated experience. The framework captures raw interactions, distills them into compact, reusable knowledge, and periodically refines stored content to preserve relevance and efficiency. We validate the framework in simulated real-world scenarios involving generalization to previously unseen but related tasks, repetitive queries, and over-saturated knowledge stores. Across all settings, experience scaling improves accuracy, sustains performance over time, and maintains gains when applied to novel situations. These results demonstrate that structured post-deployment learning can extend LLM capabilities beyond the limits of static human-generated data, offering a scalable path for continued intelligence progress.

Updated: 2025-09-23 08:04:58

标题: 经验缩放：大型语言模型的部署后演化

摘要: 模型规模、训练数据和计算能力的扩大推动了大型语言模型（LLMs）的进步，但这些方法正在达到饱和，因为人类生成的文本已经耗尽，进一步的收益减少。我们提出了经验扩展，这是一个用于LLMs的持续部署后演进的框架，通过与环境的自主交互和累积经验的协作共享。该框架捕获原始交互，将其提炼为紧凑、可重复使用的知识，并定期优化存储内容以保持相关性和效率。我们在涉及泛化到以前未见但相关任务、重复查询和过度饱和的知识存储的模拟真实场景中验证了该框架。在所有设置中，经验扩展提高了准确性，在时间上维持了性能，并在应用于新情况时保持了收益。这些结果表明，结构化的部署后学习可以将LLM的能力扩展到超出静态人类生成数据的限制，为持续的智能进步提供了可扩展的路径。

更新时间: 2025-09-23 08:04:58

领域: cs.AI

下载: http://arxiv.org/abs/2509.18771v1

Diagonal Linear Networks and the Lasso Regularization Path

Diagonal linear networks are neural networks with linear activation and diagonal weight matrices. Their theoretical interest is that their implicit regularization can be rigorously analyzed: from a small initialization, the training of diagonal linear networks converges to the linear predictor with minimal 1-norm among minimizers of the training loss. In this paper, we deepen this analysis showing that the full training trajectory of diagonal linear networks is closely related to the lasso regularization path. In this connection, the training time plays the role of an inverse regularization parameter. Both rigorous results and simulations are provided to illustrate this conclusion. Under a monotonicity assumption on the lasso regularization path, the connection is exact while in the general case, we show an approximate connection.

Updated: 2025-09-23 07:59:25

标题: 对角线线性网络和Lasso正则化路径

摘要: 对角线线性网络是具有线性激活和对角线权重矩阵的神经网络。它们的理论意义在于可以对它们的隐式正则化进行严格分析：从一个小的初始化开始，对角线线性网络的训练会收敛到在训练损失的最小1-范数下的线性预测器。在本文中，我们深入分析了对角线线性网络的完整训练轨迹与套索正则化路径之间的密切关系。在这种联系中，训练时间起到了反正则化参数的作用。我们提供了严格的结果和模拟来说明这一结论。在套索正则化路径上的单调性假设下，这种联系是准确的，而在一般情况下，我们展示了一个近似的联系。

更新时间: 2025-09-23 07:59:25

领域: cs.LG,math.OC,stat.ML,62J07, 68T07,G.3

下载: http://arxiv.org/abs/2509.18766v1

DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision

Self-supervised learning (SSL) has emerged as a powerful paradigm for medical image representation learning, particularly in settings with limited labeled data. However, existing SSL methods often rely on complex architectures, anatomy-specific priors, or heavily tuned augmentations, which limit their scalability and generalizability. More critically, these models are prone to shortcut learning, especially in modalities like chest X-rays, where anatomical similarity is high and pathology is subtle. In this work, we introduce DiSSECT -- Discrete Self-Supervision for Efficient Clinical Transferable Representations, a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck. This constrains the model to learn repeatable, structure-aware features while suppressing view-specific or low-utility patterns, improving representation transfer across tasks and domains. DiSSECT achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning, and shows particularly high label efficiency in low-label regimes. We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability compared to existing state-of-the-art approaches.

Updated: 2025-09-23 07:58:21

标题: DiSSECT：通过离散自监督实现结构化的可转移医学图像表示

摘要: 自我监督学习（SSL）已经成为医学图像表征学习的一种强大范式，特别是在有限标记数据的情况下。然而，现有的SSL方法通常依赖复杂的架构、特定解剖学先验或经过精心调整的增强，这限制了它们的可扩展性和普适性。更重要的是，这些模型往往容易出现捷径学习，特别是在胸部X射线等模态中，解剖结构相似且病理学细微。在这项工作中，我们介绍了DiSSECT——用于高效临床可转移表征的离散自我监督框架，该框架将多尺度向量量化集成到SSL流程中，以强加一个离散的表征瓶颈。这限制了模型学习可重复的、结构感知的特征，同时抑制了视图特定或低效模式，改善了任务和领域之间的表征转移。DiSSECT在分类和分割任务上取得了强大的性能，几乎不需要或不需要微调，并且在低标记制度中显示出特别高的标记效率。我们在多个公共医学图像数据集上验证了DiSSECT，证明了与现有最先进方法相比的鲁棒性和普适性。

更新时间: 2025-09-23 07:58:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.18765v1

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.

Updated: 2025-09-23 07:55:38

标题: 长文帮助短文：监督微调中的上下文长度如何影响大型语言模型的行为

摘要: 大型语言模型（LLMs）在自然语言处理（NLP）任务中取得了令人印象深刻的表现。随着现实世界应用对更长上下文窗口的需求不断增加，持续预训练和长上下文数据的监督微调（SFT）已成为一种常见方法。尽管已广泛研究了持续预训练中数据长度的影响，但它们对SFT的影响仍不明确。在这项工作中，我们系统地研究了SFT数据长度如何影响LLM在短上下文任务中的行为。令人意外的是，我们发现长上下文SFT改善了短上下文性能，与通常观察到的长上下文预训练降级相反。为了揭示这一现象的潜在机制，我们首先解耦并分析了两个关键组件，多头注意力（MHA）和前馈网络（FFN），并展示了两者都独立受益于长上下文SFT。我们进一步研究它们的互动，并揭示了一种知识偏好偏倚：长上下文SFT促进了上下文知识，而短上下文SFT偏好参数化知识，使得完全依赖于长上下文SFT不是最佳选择。最后，我们证明了混合训练可以减轻这种偏倚，为微调LLMs提供可解释的指导。

更新时间: 2025-09-23 07:55:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.18762v1

Security smells in infrastructure as code: a taxonomy update beyond the seven sins

Infrastructure as Code (IaC) has become essential for modern software management, yet security flaws in IaC scripts can have severe consequences, as exemplified by the recurring exploits of Cloud Web Services. Prior work has recognized the need to build a precise taxonomy of security smells in IaC scripts as a first step towards developing approaches to improve IaC security. This first effort led to the unveiling of seven sins, limited by the focus on a single IaC tool as well as by the extensive, and potentially biased, manual effort that was required. We propose, in our work, to revisit this taxonomy: first, we extend the study of IaC security smells to a more diverse dataset with scripts associated with seven popular IaC tools, including Terraform, Ansible, Chef, Puppet, Pulumi, Saltstack, and Vagrant; second, we bring in some automation for the analysis by relying on an LLM. While we leverage LLMs for initial pattern processing, all taxonomic decisions underwent systematic human validation and reconciliation with established security standards. Our study yields a comprehensive taxonomy of 62 security smell categories, significantly expanding beyond the previously known seven. We demonstrate actionability by implementing new security checking rules within linters for seven popular IaC tools, often achieving 1.00 precision score. Our evolution study of security smells in GitHub projects reveals that these issues persist for extended periods, likely due to inadequate detection and mitigation tools. This work provides IaC practitioners with insights for addressing common security smells and systematically adopting DevSecOps practices to build safer infrastructure code.

Updated: 2025-09-23 07:55:35

标题: 基础设施即代码中的安全问题：超出七宗罪的分类更新

摘要: 基础设施即代码（IaC）已成为现代软件管理的必备工具，然而IaC脚本中的安全漏洞可能会带来严重后果，正如云Web服务的反复利用所示。先前的研究已经认识到需要建立一个精确的IaC脚本安全问题分类，作为改进IaC安全性方法的第一步。这一初步努力揭示了七宗罪，但由于重点仅限于单一的IaC工具以及需要大量且可能有偏见的手动工作，存在一定的局限性。在我们的研究中，我们提出重新审视这一分类：首先，我们将IaC安全问题的研究扩展到一个更多样化的数据集，包括与七种流行的IaC工具相关的脚本，其中包括Terraform、Ansible、Chef、Puppet、Pulumi、Saltstack和Vagrant；第二，我们引入一些自动化分析，依赖于LLM。虽然我们利用LLMs进行初始模式处理，但所有分类决策均经过系统化的人工验证，并与已建立的安全标准进行协调。我们的研究产生了一个包含62个安全问题类别的全面分类体系，明显超出之前已知的七个类别。我们通过为七种流行的IaC工具实现新的安全检查规则，通常达到1.00的准确度分数，展示了可操作性。我们对GitHub项目中安全问题的演变研究表明，这些问题可能会持续存在较长时间，很可能是由于检测和缓解工具的不足。这项工作为IaC从业者提供了解决常见安全问题和系统化采用DevSecOps实践以构建更安全基础设施代码的见解。

更新时间: 2025-09-23 07:55:35

领域: cs.CR,cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2509.18761v1

Complexity of Activity Patterns in a Bio-Inspired Hopfield-Type Network in Different Topologies

Neural network models capable of storing memory have been extensively studied in computer science and computational neuroscience. The Hopfield network is a prototypical example of a model designed for associative, or content-addressable, memory and has been analyzed in many forms. Further, ideas and methods from complex network theory have been incorporated into artificial neural networks and learning, emphasizing their structural properties. Nevertheless, the temporal dynamics also play a vital role in biological neural networks, whose temporal structure is a crucial feature to examine. Biological neural networks display complex intermittency and, thus, can be studied through the lens of the temporal complexity (TC) theory. The TC approach look at the metastability of self-organized states, characterized by a power-law decay in the inter-event time distribution and in the total activity distribution or a scaling behavior in the corresponding event-driven diffusion processes. In this study, we present a temporal complexity (TC) analysis of a biologically-inspired Hopfield-type neural network model. We conducted a comparative assessment between scale-free and random network topologies, with particular emphasis on their global activation patterns. Our parametric analysis revealed comparable dynamical behaviors across both neural network architectures. Furthermore, our investigation into temporal complexity characteristics uncovered that seemingly distinct dynamical patterns exhibit similar temporal complexity behaviors. In particular, similar power-law decay in the activity distribution and similar complexity levels are observed in both topologies, but with a much reduced noise in the scale-free topology. Notably, most of the complex dynamical profiles were consistently observed in scale-free network configurations, thus confirming the crucial role of hubs in neural network dynamics.

Updated: 2025-09-23 07:53:27

标题: 不同拓扑结构中生物启发的霍普菲尔德类型网络的活动模式复杂性

摘要: 能够存储记忆的神经网络模型在计算机科学和计算神经科学中得到了广泛研究。Hopfield网络是一个为关联性或内容寻址式记忆而设计的模型的典型示例，并且已经以多种形式进行了分析。此外，复杂网络理论中的思想和方法已被纳入到人工神经网络和学习中，强调它们的结构特性。然而，在生物神经网络中，时间动态也起着至关重要的作用，其时间结构是一个重要的特征。生物神经网络显示出复杂的间歇性，因此可以通过时间复杂性（TC）理论的视角来研究。TC方法考察了自组织状态的亚稳态，其特征是在事件间隔时间分布和总活动分布中呈现幂律衰减，或者在相应的事件驱动扩散过程中表现出缩放行为。在本研究中，我们对一种受生物启发的Hopfield类型神经网络模型进行了时间复杂性（TC）分析。我们对无标度和随机网络拓扑进行了比较评估，特别关注它们的全局激活模式。我们的参数分析显示，在两种神经网络结构中都有相似的动态行为。此外，我们对时间复杂性特征的调查发现，看似不同的动态模式展现出类似的时间复杂性行为。特别是，在两种拓扑结构中都观察到了类似的活动分布幂律衰减和类似的复杂性水平，但是在无标度拓扑中噪音要少得多。值得注意的是，大多数复杂的动态配置都在无标度网络配置中始终观察到，从而证实了中心枢纽在神经网络动态中的关键作用。

更新时间: 2025-09-23 07:53:27

领域: q-bio.NC,cs.AI,nlin.AO,physics.bio-ph

下载: http://arxiv.org/abs/2509.18758v1

MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning

Recent advances in imitation learning have shown great promise for developing robust robot manipulation policies from demonstrations. However, this promise is contingent on the availability of diverse, high-quality datasets, which are not only challenging and costly to collect but are often constrained to a specific robot embodiment. Portable handheld grippers have recently emerged as intuitive and scalable alternatives to traditional robotic teleoperation methods for data collection. However, their reliance solely on first-person view wrist-mounted cameras often creates limitations in capturing sufficient scene contexts. In this paper, we present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera to overcome this limitation. This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices. Our experimental results, including an ablation study, demonstrate that our MV-UMI framework improves performance in sub-tasks requiring broad scene understanding by approximately 47% across 3 tasks, confirming the effectiveness of our approach in expanding the range of feasible manipulation tasks that can be learned using handheld gripper systems, without compromising the cross-embodiment advantages inherent to such systems.

Updated: 2025-09-23 07:53:05

标题: MV-UMI：用于跨体系学习的可扩展多视图界面

摘要: 最近在模仿学习方面取得了重大进展，展现出从示范中开发出稳健的机器人操纵策略的巨大潜力。然而，这一潜力取决于多样性和高质量数据集的可用性，这些数据集不仅难以收集且成本高昂，而且通常受限于特定机器人实体。便携手持夹具最近作为传统机器人远程操作方法的直观和可扩展替代方案而出现，用于数据收集。然而，它们仅依赖第一人称视角手腕安装摄像头，通常在捕获足够的场景上下文方面存在限制。在本文中，我们提出了MV-UMI（多视角通用操纵界面）框架，该框架将第三人称视角与自我摄像机相结合，以克服这一限制。这种集成减轻了人类示范和机器人部署之间的领域转变，保留了手持数据收集设备的跨实体优势。我们的实验结果，包括消融研究，证明了我们的MV-UMI框架在涉及广泛场景理解的子任务中的表现提高了约47％，跨3个任务，证实了我们的方法在扩展可以使用手持夹具系统学习的可行操纵任务范围方面的有效性，而不会损害这种系统固有的跨实体优势。

更新时间: 2025-09-23 07:53:05

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18757v1

MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.

Updated: 2025-09-23 07:52:31

标题: MOCHA：多模态对象感知跨架构对齐

摘要: 我们介绍了MOCHA（Multi-modal Objects-aware Cross-arcHitecture Alignment），这是一种知识蒸馏方法，将来自大型视觉-语言教师（例如LLaVa）的区域级多模态语义转移到轻量级视觉-仅物体检测器学生（例如YOLO）。一个翻译模块将学生特征映射到一个联合空间，学生和翻译器的训练受到双目标损失的指导，该损失强制执行局部对齐和全局关系一致性。与之前侧重于密集或全局对齐的方法不同，MOCHA在对象级别操作，实现了在不修改教师或需要推理时的文本输入的情况下有效地转移语义。我们在四个个性化检测基准下验证了我们的方法在少样本情况下的性能。结果显示相对基线的一致增益，平均得分提高了+10.1。尽管其紧凑的架构，MOCHA达到了与更大的多模态模型相媲美的性能，证明了其适用于实际部署。

更新时间: 2025-09-23 07:52:31

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.14001v2

COLT: Enhancing Video Large Language Models with Continual Tool Usage

The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering 'catastrophic forgetting' of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

Updated: 2025-09-23 07:49:30

标题: COLT：通过持续工具使用增强视频大型语言模型

摘要: 大型语言模型（LLM）的成功显著推动了视频理解研究的发展。为了利用经过良好训练的专家模型（即工具）的好处，视频LLM优先考虑探索工具使用能力。现有方法要么提示闭源LLM，要么采用指导调谐范式进行工具使用微调。然而，这些方法假定存在固定工具的已建立存储库，并且难以推广到工具数据不断演变和流入的真实环境。为此，我们提出通过COntinuaL Tool使用（称为COLT）来增强开源视频LLM，该方法在连续的工具流中自动获得工具使用能力，而不会遗忘过去学习的工具。具体而言，我们的COLT将可学习的工具代码簿作为一种特定于工具的记忆系统。然后，根据用户指令和代码簿内工具特征之间的相似性动态选择相关工具。为了释放视频LLM的工具使用潜力，我们收集了一个以视频为中心的工具使用指导调谐数据集VideoToolBench。对之前的视频LLM基准和工具使用特定的VideoToolBench数据集进行了大量实验，证明了我们提出的COLT的最先进性能。

更新时间: 2025-09-23 07:49:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.18754v1

MOMEMTO: Patch-based Memory Gate Model in Time Series Foundation Model

Recently reconstruction-based deep models have been widely used for time series anomaly detection, but as their capacity and representation capability increase, these models tend to over-generalize, often reconstructing unseen anomalies accurately. Prior works have attempted to mitigate this by incorporating a memory architecture that stores prototypes of normal patterns. Nevertheless, these approaches suffer from high training costs and have yet to be effectively integrated with time series foundation models (TFMs). To address these challenges, we propose \textbf{MOMEMTO}, a TFM for anomaly detection, enhanced with a patch-based memory module to mitigate over-generalization. The memory module is designed to capture representative normal patterns from multiple domains and enables a single model to be jointly fine-tuned across multiple datasets through a multi-domain training strategy. MOMEMTO initializes memory items with latent representations from a pre-trained encoder, organizes them into patch-level units, and updates them via an attention mechanism. We evaluate our method using 23 univariate benchmark datasets. Experimental results demonstrate that MOMEMTO, as a single model, achieves higher scores on AUC and VUS metrics compared to baseline methods, and further enhances the performance of its backbone TFM, particularly in few-shot learning scenarios.

Updated: 2025-09-23 07:48:25

标题: MOMEMTO：基于补丁的时间序列基础模型中的记忆门模型

摘要: 最近，基于重建的深度模型被广泛用于时间序列异常检测，但随着它们的容量和表示能力增加，这些模型往往会过度泛化，经常准确重建未见异常。先前的研究已尝试通过结合存储正常模式原型的记忆架构来减轻这一问题。然而，这些方法存在高训练成本，并且尚未有效地与时间序列基础模型（TFMs）集成。为了解决这些挑战，我们提出了一种名为MOMEMTO的TFM异常检测模型，增强了一个基于补丁的记忆模块以减轻过度泛化。记忆模块旨在从多个领域捕获代表性的正常模式，并通过多领域训练策略使单一模型能够同时在多个数据集上进行微调。MOMEMTO使用预训练编码器的潜在表示初始化记忆项，将其组织成补丁级单位，并通过注意机制进行更新。我们使用23个单变量基准数据集评估了我们的方法。实验结果表明，MOMEMTO作为单一模型在AUC和VUS指标上实现了比基准方法更高的得分，并进一步提升了其基础TFM的性能，特别是在少样本学习场景中。

更新时间: 2025-09-23 07:48:25

领域: cs.LG

下载: http://arxiv.org/abs/2509.18751v1

DANSE: Data-driven Non-linear State Estimation of Model-free Process in Unsupervised Learning Setup

We address the tasks of Bayesian state estimation and forecasting for a model-free process in an unsupervised learning setup. For a model-free process, we do not have any a-priori knowledge of the process dynamics. In the article, we propose DANSE -- a Data-driven Nonlinear State Estimation method. DANSE provides a closed-form posterior of the state of the model-free process, given linear measurements of the state. In addition, it provides a closed-form posterior for forecasting. A data-driven recurrent neural network (RNN) is used in DANSE to provide the parameters of a prior of the state. The prior depends on the past measurements as input, and then we find the closed-form posterior of the state using the current measurement as input. The data-driven RNN captures the underlying non-linear dynamics of the model-free process. The training of DANSE, mainly learning the parameters of the RNN, is executed using an unsupervised learning approach. In unsupervised learning, we have access to a training dataset comprising only a set of measurement data trajectories, but we do not have any access to the state trajectories. Therefore, DANSE does not have access to state information in the training data and can not use supervised learning. Using simulated linear and non-linear process models (Lorenz attractor and Chen attractor), we evaluate the unsupervised learning-based DANSE. We show that the proposed DANSE, without knowledge of the process model and without supervised learning, provides a competitive performance against model-driven methods, such as the Kalman filter (KF), extended KF (EKF), unscented KF (UKF), a data-driven deep Markov model (DMM) and a recently proposed hybrid method called KalmanNet. In addition, we show that DANSE works for high-dimensional state estimation.

Updated: 2025-09-23 07:43:32

标题: DANSE: 基于数据驱动的非线性状态估计在无监督学习设置中的模型无关过程

摘要: 我们研究了贝叶斯状态估计和预测任务，针对一个无模型过程在无监督学习环境中。对于一个无模型过程，我们没有任何关于过程动态的先验知识。在这篇文章中，我们提出了DANSE - 一种数据驱动的非线性状态估计方法。DANSE给出了一个闭合形式的后验，给出了无模型过程的状态，在给定状态的线性测量的情况下。此外，它还提供了一个预测的闭合形式后验。在DANSE中使用了一个数据驱动的循环神经网络（RNN）来提供状态先验的参数。先验取决于过去的测量作为输入，然后我们使用当前测量作为输入找到状态的闭合形式后验。数据驱动的RNN捕捉了无模型过程的潜在非线性动态。DANSE的训练主要是使用无监督学习方法执行，主要是学习RNN的参数。在无监督学习中，我们只有一个包含一系列测量数据轨迹的训练数据集，但是我们没有任何访问状态轨迹的信息。因此，DANSE在训练数据中没有访问状态信息，也不能使用监督学习。使用模拟的线性和非线性过程模型（Lorenz吸引子和Chen吸引子），我们评估了基于无监督学习的DANSE。我们展示了提出的DANSE，在没有过程模型知识和没有监督学习的情况下，与基于模型的方法（如Kalman滤波器（KF），扩展KF（EKF），无迹KF（UKF），数据驱动的深度马尔可夫模型（DMM）和最近提出的名为KalmanNet的混合方法）相比，提供了竞争性的性能。此外，我们展示了DANSE适用于高维状态估计。

更新时间: 2025-09-23 07:43:32

领域: eess.SY,cs.LG,cs.SY,eess.SP

下载: http://arxiv.org/abs/2306.03897v3

Theory of periodic convolutional neural network

We introduce a novel convolutional neural network architecture, termed the \emph{periodic CNN}, which incorporates periodic boundary conditions into the convolutional layers. Our main theoretical contribution is a rigorous approximation theorem: periodic CNNs can approximate ridge functions depending on $d-1$ linear variables in a $d$-dimensional input space, while such approximation is impossible in lower-dimensional ridge settings ($d-2$ or fewer variables). This result establishes a sharp characterization of the expressive power of periodic CNNs. Beyond the theory, our findings suggest that periodic CNNs are particularly well-suited for problems where data naturally admits a ridge-like structure of high intrinsic dimension, such as image analysis on wrapped domains, physics-informed learning, and materials science. The work thus both expands the mathematical foundation of CNN approximation theory and highlights a class of architectures with surprising and practically relevant approximation capabilities.

Updated: 2025-09-23 07:43:02

标题: 定期卷积神经网络的理论

摘要: 我们介绍了一种新颖的卷积神经网络架构，称为周期性CNN，它将周期边界条件纳入卷积层中。我们的主要理论贡献是一个严格的逼近定理：周期性CNN可以逼近依赖于$d-1$个线性变量的$d$维输入空间中的脊函数，而在低维脊设置（$d-2$个或更少的变量）中这种逼近是不可能的。这一结果建立了对周期性CNN表达能力的清晰刻画。除了理论，我们的发现表明，周期性CNN特别适用于数据自然呈现高内在维度的脊状结构的问题，如在包裹域上的图像分析、基于物理的学习和材料科学。因此，这项工作既拓展了CNN逼近理论的数学基础，又突出了一类具有令人惊讶且实际相关的逼近能力的架构。

更新时间: 2025-09-23 07:43:02

领域: cs.LG

下载: http://arxiv.org/abs/2509.18744v1

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.

Updated: 2025-09-23 07:40:50

标题: LD-ViCE：视频反事实解释的潜在扩散模型

摘要: 基于视频的人工智能系统越来越多地被应用于自动驾驶和医疗保健等安全关键领域。然而，由于视频数据的固有时空复杂性和深度学习模型的不透明性，解释它们的决策仍然具有挑战性。现有的解释技术通常存在时间上的连贯性有限、鲁棒性不足以及缺乏可操作的因果洞察。当前的反事实解释方法通常不包括来自目标模型的指导，降低了语义保真度和实际效用。我们提出了一种新的框架，名为视频反事实解释潜扩散（LD-ViCE），旨在解释基于视频的人工智能模型的行为。与先前的方法相比，LD-ViCE通过在潜在空间中使用最先进的扩散模型来降低生成解释的计算成本，同时通过额外的精炼步骤生成逼真且可解释的反事实。我们的实验证明了LD-ViCE在三个不同的视频数据集上的有效性，包括EchoNet-Dynamic（心脏超声）、FERV39k（面部表情）和Something-Something V2（动作识别）。LD-ViCE优于最近的一种先进方法，R2分数增加高达68％，同时将推理时间减少一半。定性分析证实LD-ViCE生成了有意义且时间上连贯的解释，为目标模型行为提供了宝贵的洞察。LD-ViCE代表了向在安全关键领域信任部署人工智能的宝贵一步。

更新时间: 2025-09-23 07:40:50

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.08422v2

Online Adaptation via Dual-Stage Alignment and Self-Supervision for Fast-Calibration Brain-Computer Interfaces

Individual differences in brain activity hinder the online application of electroencephalogram (EEG)-based brain computer interface (BCI) systems. To overcome this limitation, this study proposes an online adaptation algorithm for unseen subjects via dual-stage alignment and self-supervision. The alignment process begins by applying Euclidean alignment in the EEG data space and then updates batch normalization statistics in the representation space. Moreover, a self-supervised loss is designed to update the decoder. The loss is computed by soft pseudo-labels derived from the decoder as a proxy for the unknown ground truth, and is calibrated by Shannon entropy to facilitate self-supervised training. Experiments across five public datasets and seven decoders show the proposed algorithm can be integrated seamlessly regardless of BCI paradigm and decoder architecture. In each iteration, the decoder is updated with a single online trial, which yields average accuracy gains of 4.9% on steady-state visual evoked potentials (SSVEP) and 3.6% on motor imagery. These results support fast-calibration operation and show that the proposed algorithm has great potential for BCI applications.

Updated: 2025-09-23 07:38:37

标题: 在线通过双阶段对准和自我监督进行适应，用于快速校准脑-计算机接口

摘要: 个体差异影响大脑活动，阻碍了基于脑电图（EEG）的脑机接口（BCI）系统的在线应用。为了克服这一限制，本研究提出了一种通过双阶段对齐和自我监督的在线适应算法，用于未知受试者。对齐过程从在EEG数据空间中应用欧几里德对齐开始，然后更新表示空间中的批量归一化统计数据。此外，设计了一个自监督损失来更新解码器。损失是通过从解码器导出的软伪标签计算的，作为未知地面真相的代理，并通过香农熵进行校准，以促进自监督训练。在五个公共数据集和七个解码器上的实验显示，所提出的算法可以无缝集成，不受BCI范例和解码器架构的影响。在每次迭代中，解码器通过单个在线试验进行更新，这导致SSVEP的平均准确率提高了4.9％，运动想象提高了3.6％。这些结果支持快速校准操作，并显示所提出的算法在BCI应用中具有巨大潜力。

更新时间: 2025-09-23 07:38:37

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19403v1

Min: Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning

Class Incremental Learning (CIL) aims to continuously learn new categories while retaining the knowledge of old ones. Pre-trained models (PTMs) show promising capabilities in CIL. However, existing approaches that apply lightweight fine-tuning to backbones still induce parameter drift, thereby compromising the generalization capability of pre-trained models. Parameter drift can be conceptualized as a form of noise that obscures critical patterns learned for previous tasks. However, recent researches have shown that noise is not always harmful. For example, the large number of visual patterns learned from pre-training can be easily abused by a single task, and introducing appropriate noise can suppress some low-correlation features, thus leaving a margin for future tasks. To this end, we propose learning beneficial noise for CIL guided by information theory and propose Mixture of Noise (Min), aiming to mitigate the degradation of backbone generalization from adapting new tasks. Specifically, task-specific noise is learned from high-dimension features of new tasks. Then, a set of weights is adjusted dynamically for optimal mixture of different task noise. Finally, Min embeds the beneficial noise into the intermediate features to mask the response of inefficient patterns. Extensive experiments on six benchmark datasets demonstrate that Min achieves state-of-the-art performance in most incremental settings, with particularly outstanding results in 50-steps incremental settings. This shows the significant potential for beneficial noise in continual learning. Code is available at https://github.com/ASCIIJK/MiN-NeurIPS2025.

Updated: 2025-09-23 07:34:23

标题: 最小：基于预训练模型的类增量学习中的噪声混合

摘要: Class Incremental Learning (CIL)旨在在保留旧类别知识的同时持续学习新类别。预训练模型（PTMs）在CIL中表现出有希望的能力。然而，现有的将轻量级微调应用于主干的方法仍然会引起参数漂移，从而损害预训练模型的泛化能力。参数漂移可以被概念化为一种噪音形式，它模糊了为以前的任务学习的关键模式。然而，最近的研究表明，噪音并不总是有害的。例如，从预训练中学习的大量视觉模式可以很容易地被单个任务滥用，并且引入适当的噪音可以抑制一些低相关性特征，从而为未来的任务留下一定的余地。为此，我们提出了通过信息理论指导学习有益噪音的CIL，并提出了Mixture of Noise（Min），旨在减轻从适应新任务中降低主干泛化的问题。具体来说，从新任务的高维特征中学习特定于任务的噪音。然后，一组权重会动态调整以获得不同任务噪音的最佳混合。最后，Min将有益噪音嵌入到中间特征中，以掩盖低效模式的响应。对六个基准数据集进行的大量实验表明，Min在大多数增量设置中实现了最先进的性能，在50步增量设置中表现尤为出色。这显示了有益噪音在持续学习中的重要潜力。代码可在https://github.com/ASCIIJK/MiN-NeurIPS2025找到。

更新时间: 2025-09-23 07:34:23

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.16738v2

Consistency of Selection Strategies for Fraud Detection

This paper studies how insurers can chose which claims to investigate for fraud. Given a prediction model, typically only claims with the highest predicted propability of being fraudulent are investigated. We argue that this can lead to inconsistent learning and propose a randomized alternative. More generally, we draw a parallel with the multi-arm bandit literature and argue that, in the presence of selection, the obtained observations are not iid. Hence, dependence on past observations should be accounted for when updating parameter estimates. We formalize selection in a binary regression framework and show that model updating and maximum-likelihood estimation can be implemented as if claims were investigated at random. Then, we define consistency of selection strategies and conjecture sufficient conditions for consistency. Our simulations suggest that the often-used selection strategy can be inconsistent while the proposed randomized alternative is consistent. Finally, we compare our randomized selection strategy with Thompson sampling, a standard multi-arm bandit heuristic. Our simulations suggest that the latter can be inefficient in learning low fraud probabilities.

Updated: 2025-09-23 07:33:33

标题: 欺诈检测的选择策略一致性

摘要: 本文研究了保险公司如何选择要调查的欺诈索赔。通常情况下，只有具有最高预测欺诈可能性的索赔才会接受调查。我们认为，这可能导致学习不一致，并提出了一种随机选择的替代方案。更一般地说，我们将其与多臂赌博文献进行了类比，并认为，在选择存在的情况下，获得的观察结果不是独立同分布的。因此，在更新参数估计时应考虑过去观察的依赖性。我们在二元回归框架中形式化选择，并展示了模型更新和最大似然估计可以像随机调查索赔一样实施。然后，我们定义了选择策略的一致性，并推测了一致性的充分条件。我们的模拟结果表明，通常使用的选择策略可能不一致，而提出的随机选择策略是一致的。最后，我们将我们的随机选择策略与汤普森采样，一种标准的多臂赌博启发式方法进行了比较。我们的模拟结果表明，后者在学习低欺诈概率时可能效率低下。

更新时间: 2025-09-23 07:33:33

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.18739v1

SoK: Large Language Model Copyright Auditing via Fingerprinting

The broad capabilities and substantial resources required to train Large Language Models (LLMs) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. LLM fingerprinting, a non-intrusive technique that extracts and compares the distinctive features from LLMs to identify infringements, offers a promising solution to copyright auditing. However, its reliability remains uncertain due to the prevalence of diverse model modifications and the lack of standardized evaluation. In this SoK, we present the first comprehensive study of LLM fingerprinting. We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches, providing a structured overview of the state of the art. We further propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios. Built upon mainstream foundation models and comprising 149 distinct model instances, LeaFBench integrates 13 representative post-development techniques, spanning both parameter-altering methods (e.g., fine-tuning, quantization) and parameter-independent mechanisms (e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the strengths and weaknesses of existing methods, thereby outlining future research directions and critical open problems in this emerging field. The code is available at https://github.com/shaoshuo-ss/LeaFBench.

Updated: 2025-09-23 07:31:42

标题: SoK: 通过指纹识别进行大型语言模型版权审计

摘要: 大型语言模型（LLM）所需的广泛能力和大量资源使它们成为有价值的知识产权，但它们仍然容易受到侵犯版权的威胁，例如未经授权的使用和模型盗窃。LLM指纹技术是一种非侵入性技术，可以提取和比较LLM的独特特征，以识别侵权行为，为版权审计提供了一种有希望的解决方案。然而，由于各种模型修改的普遍存在和缺乏标准化评估，其可靠性仍有待确定。在这个研究中，我们提出了对LLM指纹技术的首次全面研究。我们介绍了一个统一框架和正式分类法，将现有方法分为白盒和黑盒两种方法，为现有技术的最新状况提供了结构化概述。我们进一步提出了LeaFBench，这是第一个系统化的基准测试，用于评估LLM指纹技术在实际部署场景下的效果。基于主流基础模型，LeaFBench包括了149个不同的模型实例，集成了13种代表性的后期开发技术，涵盖了参数修改方法（如微调、量化）和参数独立机制（如系统提示、RAG）。在LeaFBench上进行的广泛实验揭示了现有方法的优势和劣势，从而勾勒出这一新兴领域的未来研究方向和关键的开放问题。该代码可在https://github.com/shaoshuo-ss/LeaFBench上找到。

更新时间: 2025-09-23 07:31:42

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.19843v2

MetaFed: Advancing Privacy, Performance, and Sustainability in Federated Metaverse Systems

The rapid expansion of immersive Metaverse applications introduces complex challenges at the intersection of performance, privacy, and environmental sustainability. Centralized architectures fall short in addressing these demands, often resulting in elevated energy consumption, latency, and privacy concerns. This paper proposes MetaFed, a decentralized federated learning (FL) framework that enables sustainable and intelligent resource orchestration for Metaverse environments. MetaFed integrates (i) multi-agent reinforcement learning for dynamic client selection, (ii) privacy-preserving FL using homomorphic encryption, and (iii) carbon-aware scheduling aligned with renewable energy availability. Evaluations on MNIST and CIFAR-10 using lightweight ResNet architectures demonstrate that MetaFed achieves up to 25% reduction in carbon emissions compared to conventional approaches, while maintaining high accuracy and minimal communication overhead. These results highlight MetaFed as a scalable solution for building environmentally responsible and privacy-compliant Metaverse infrastructures.

Updated: 2025-09-23 07:30:51

标题: MetaFed：在联邦式元宇宙系统中推进隐私、性能和可持续性

摘要: 虚拟现实Metaverse应用的快速扩张引入了在性能、隐私和环境可持续性交叉点上的复杂挑战。集中式架构在解决这些需求方面存在不足，往往导致能源消耗增加、延迟和隐私问题。本文提出了MetaFed，这是一个分散式联邦学习（FL）框架，可以为Metaverse环境实现可持续和智能资源编排。MetaFed整合了（i）多智能体强化学习用于动态客户端选择，（ii）使用同态加密的隐私保护FL，以及（iii）与可再生能源可用性一致的碳感知调度。在MNIST和CIFAR-10上使用轻量级ResNet架构进行评估表明，与传统方法相比，MetaFed实现了高达25%的碳排放减少，同时保持高准确性和最小通信开销。这些结果突显了MetaFed作为构建环境负责和符合隐私法规的Metaverse基础设施的可扩展解决方案。

更新时间: 2025-09-23 07:30:51

领域: cs.LG,cs.CR,cs.CY,cs.DC,cs.ET

下载: http://arxiv.org/abs/2508.17341v2

PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality

To mitigate the hallucination problem in large language models, DoLa exploits early exit logits from the same model as a contrastive prior. However, we found that these early exit logits tend to be flat, low in magnitude, and fail to reflect meaningful contrasts. To address this, we propose PruneCD, a novel contrastive decoding method that constructs the amateur model via layer pruning rather than early exit. This design leads to more informative and well-aligned logits, enabling more effective contrastive decoding. Through qualitative and quantitative analyses, we demonstrate that PruneCD consistently improves factuality with minimal inference overhead, offering a robust and practical approach to mitigating hallucinations in LLMs.

Updated: 2025-09-23 07:28:16

标题: PruneCD: 对比修剪的自我模型以提高解码的真实性

摘要: 为了减轻大型语言模型中的幻觉问题，DoLa利用来自同一模型的早期退出logits作为对比先验。然而，我们发现这些早期退出logits往往是平坦的、幅度低，无法反映有意义的对比。为了解决这个问题，我们提出了PruneCD，一种通过层修剪而不是早期退出构建业余模型的新颖对比解码方法。这种设计导致更具信息量和良好对齐的logits，从而实现更有效的对比解码。通过定性和定量分析，我们证明PruneCD在最小推理开销下持续改善了真实性，为减轻LLMs中的幻觉提供了一种稳健且实用的方法。

更新时间: 2025-09-23 07:28:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.16598v2

Error Bound Analysis for the Regularized Loss of Deep Linear Neural Networks

The optimization foundations of deep linear networks have recently received significant attention. However, due to their inherent non-convexity and hierarchical structure, analyzing the loss functions of deep linear networks remains a challenging task. In this work, we study the local geometric landscape of the regularized squared loss of deep linear networks around each critical point. Specifically, we derive a closed-form characterization of the critical point set and establish an error bound for the regularized loss under mild conditions on network width and regularization parameters. Notably, this error bound quantifies the distance from a point to the critical point set in terms of the current gradient norm, which can be used to derive linear convergence of first-order methods. To support our theoretical findings, we conduct numerical experiments and demonstrate that gradient descent converges linearly to a critical point when optimizing the regularized loss of deep linear networks.

Updated: 2025-09-23 07:15:36

标题: 深度线性神经网络正则化损失的误差界分析

摘要: 深度线性网络的优化基础最近受到了重视。然而，由于其固有的非凸性和层次结构，分析深度线性网络的损失函数仍然是一个具有挑战性的任务。在这项工作中，我们研究了深度线性网络的正则化平方损失在每个临界点周围的局部几何景观。具体来说，我们推导出了临界点集的封闭形式特征，并在网络宽度和正则化参数的轻微条件下建立了正则化损失的误差界。值得注意的是，这个误差界量化了从一个点到临界点集的距离，以当前梯度范数为基础，这可以用来推导一阶方法的线性收敛性。为了支持我们的理论发现，我们进行了数值实验，并展示了梯度下降在优化深度线性网络的正则化损失时线性收敛到临界点。

更新时间: 2025-09-23 07:15:36

领域: math.OC,cs.LG,90C26, 68T07, 65K10

下载: http://arxiv.org/abs/2502.11152v3

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. Conventional mixture-of-experts (MoE) architectures suffer from suboptimal coordination dynamics, where isolated expert operations expose the model to overfitting risks. Moreover, they have not been effectively extended to attention blocks, which limits further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes the transformer model into an equivalent group of experts and applies a hierarchical routing mechanism to allocate input subspaces to specialized experts. Our approach advances MoE design with four key innovations: (1) Constructing expert groups by partitioning non-MoE models into functionally equivalent specialists (2) Developing a hierarchical routing paradigm that integrates patch-wise data selection and expert selection strategies. (3) Extending the MoE design to attention blocks. (4) Proposing a hardware-optimized parallelization scheme that exploits batched matrix multiplications for efficient expert computation. The experiments demonstrate that our UoE model surpasses Full Attention, state-of-the-art MoEs and efficient transformers in several tasks across image and natural language domains. In language modeling tasks, UoE achieves an average reduction of 2.38 in perplexity compared to the best-performing MoE method with only 76% of its FLOPs. In the Long Range Arena benchmark, it demonstrates an average score at least 0.68% higher than all comparison models, with only 50% of the FLOPs of the best MoE method. In image classification, it yields an average accuracy improvement of 1.75% over the best model while maintaining comparable FLOPs. The source codes are available at https://github.com/YujiaoYang-work/UoE.

Updated: 2025-09-23 07:09:46

标题: 专家联盟：将分层路由调整为等效分解的变压器

摘要: 混合专家（MoE）在保持计算效率的同时提高了模型性能，使其非常适合大规模应用。传统的混合专家（MoE）架构存在次优协调动态，孤立的专家操作会使模型面临过拟合风险。此外，它们还没有有效地扩展到注意力块，这限制了进一步的效率提升。为了解决这些问题，我们提出了联合专家（UoE），将变压器模型分解为一组等效的专家，并应用分层路由机制将输入子空间分配给专业专家。我们的方法通过四个关键创新推进了MoE设计：（1）通过将非MoE模型划分为功能等效的专家组来构建专家组；（2）开发集成面片数据选择和专家选择策略的分层路由范例；（3）将MoE设计扩展到注意力块；（4）提出一种硬件优化的并行化方案，利用批量矩阵乘法来进行高效的专家计算。实验证明，我们的UoE模型在图像和自然语言领域的多项任务中超过了全注意力、最先进的MoE和高效的变压器。在语言建模任务中，与最佳MoE方法相比，UoE的困惑度平均降低了2.38，仅占其FLOPs的76%。在Long Range Arena基准测试中，它的平均得分至少比所有比较模型高0.68%，仅占最佳MoE方法FLOPs的50%。在图像分类中，它的平均准确率提高了1.75%，同时保持可比的FLOPs。源代码可在https://github.com/YujiaoYang-work/UoE上找到。

更新时间: 2025-09-23 07:09:46

领域: cs.LG,cs.AI,cs.CL,68T07,I.5.1; I.2.0

下载: http://arxiv.org/abs/2503.02495v3

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $\textbf{(ii)}$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $\textbf{(iii)}$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

Updated: 2025-09-23 07:08:26

标题: OptMerge：通过模型合并统一多模态LLM能力和模态

摘要: 基金会模型由于资源密集的训练而更新缓慢，而领域特定模型在发布之间快速演变。模型合并旨在将多个专家模型合并成一个更强大的模型，减少存储和服务成本，同时支持分散式开发。尽管具有潜力，先前的研究主要集中在合并视觉分类模型或大型语言模型（LLMs）用于代码和数学任务。最近，通过大规模多模态训练扩展LLMs的多模态LLMs（MLLMs）开始受到关注。然而，目前缺乏一个明确划分MLLM训练和评估任务的模型合并研究基准。在本文中，$\textbf{(i)}$我们为MLLMs引入了一个模型合并基准，其中包括多个任务，如VQA、几何、图表、OCR和基础，研究LoRA和完全微调模型。此外，我们探讨了模型合并如何结合不同的模态（例如，视觉-语言、音频-语言和视频-语言模型），朝着全语言模型迈进。$\textbf{(ii)}$我们在基准上实施了10种模型合并算法。此外，我们提出了一种新方法，该方法从任务向量中去除噪声，并根据在任务向量交互上定义的损失稳健地优化合并向量，实现了平均性能增益2.48%。$\textbf{(iii)}$我们发现，模型合并为构建改进的MLLMs提供了一种有希望的途径，而无需训练数据。我们的结果还表明，多模态之间的互补性胜过单个模态。

更新时间: 2025-09-23 07:08:26

领域: cs.AI

下载: http://arxiv.org/abs/2505.19892v2

LLM-Enhanced Self-Evolving Reinforcement Learning for Multi-Step E-Commerce Payment Fraud Risk Detection

This paper presents a novel approach to e-commerce payment fraud detection by integrating reinforcement learning (RL) with Large Language Models (LLMs). By framing transaction risk as a multi-step Markov Decision Process (MDP), RL optimizes risk detection across multiple payment stages. Crafting effective reward functions, essential for RL model success, typically requires significant human expertise due to the complexity and variability in design. LLMs, with their advanced reasoning and coding capabilities, are well-suited to refine these functions, offering improvements over traditional methods. Our approach leverages LLMs to iteratively enhance reward functions, achieving better fraud detection accuracy and demonstrating zero-shot capability. Experiments with real-world data confirm the effectiveness, robustness, and resilience of our LLM-enhanced RL framework through long-term evaluations, underscoring the potential of LLMs in advancing industrial RL applications.

Updated: 2025-09-23 07:07:16

标题: 通过LLM增强的自进化强化学习，实现多步骤电子商务支付欺诈风险检测

摘要: 本文提出了一种新颖的电子商务支付欺诈检测方法，通过将强化学习（RL）与大型语言模型（LLMs）相结合。将交易风险框定为多步马尔可夫决策过程（MDP），RL优化跨多个支付阶段的风险检测。有效的奖励函数对于RL模型的成功至关重要，通常需要人类专业知识来设计，因为设计中存在的复杂性和变化性。LLMs具有先进的推理和编码能力，非常适合优化这些函数，相较于传统方法，能够提供改进。我们的方法利用LLMs迭代地增强奖励函数，实现更好的欺诈检测准确性，并展示了零样本能力。通过真实数据的实验验证了我们LLM增强的RL框架的有效性、鲁棒性和韧性，长期评估强调了LLMs在推进工业RL应用中的潜力。

更新时间: 2025-09-23 07:07:16

领域: cs.LG

下载: http://arxiv.org/abs/2509.18719v1

TinyDef-DETR: A DETR-based Framework for Defect Detection in Transmission Lines from UAV Imagery

Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult targets. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous targets.

Updated: 2025-09-23 07:02:16

标题: TinyDef-DETR：基于DETR的框架，用于从无人机图像中检测输电线路缺陷

摘要: 通过UAV图像自动检测输电线路的缺陷是一项具有挑战性的任务，因为缺陷的尺寸小、模糊不清且背景复杂。本文提出了TinyDef-DETR，这是一个基于DETR的框架，旨在实现从UAV获取的图像中准确高效地检测输电线路缺陷。该模型整合了四个主要组件：一个增强边界敏感性表示的边缘增强ResNet主干，一个无步幅的空间到深度模块以实现保留细节的降采样，一个跨阶段双域多尺度注意机制以共同建模全局上下文和局部线索，以及一个Focaler-Wise-SIoU回归损失，以改善小型和困难目标的定位。这些设计共同有效地缓解了传统检测器的局限性。对公共和真实世界数据集进行的大量实验表明，TinyDef-DETR实现了卓越的检测性能和强大的泛化能力，同时保持了适度的计算开销。TinyDef-DETR的准确性和高效性使其成为一种适用于基于UAV的输电线路缺陷检测的方法，特别适用于涉及小型和模糊目标的场景。

更新时间: 2025-09-23 07:02:16

领域: cs.CV,cs.AI,cs.CE

下载: http://arxiv.org/abs/2509.06035v5

A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications

The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

Updated: 2025-09-23 07:02:05

标题: 马尔科夫决策过程状态相似性的广义双模拟度量：从理论命题到应用

摘要: 双模拟度量（BSM）是计算马尔科夫决策过程（MDP）中状态相似性的强大工具，揭示了在BSM中更接近的状态具有更相似的最优值函数。虽然BSM已成功应用于强化学习（RL）中的任务，如状态表示学习和策略探索，但其应用于多个MDP场景，如策略转移，仍然具有挑战性。先前的研究已尝试将BSM推广到MDP对，但对其数学属性的严格分析不足，限制了进一步的理论进展。在这项工作中，我们正式建立了一种广义双模拟度量（GBSM），用于对MDP对之间进行严格证明，具有三个基本属性：GBSM对称性，MDP间三角不等式，以及相同状态空间上的距离边界。利用这些属性，我们在MDP中理论分析了策略转移，状态聚合和基于采样的估计，得到的显式边界比从标准BSM推导的边界严格更紧。此外，GBSM为估计提供了一个封闭形式的样本复杂度，改进了基于BSM的现有渐近结果。数值结果验证了我们的理论发现，并展示了GBSM在多MDP场景中的有效性。

更新时间: 2025-09-23 07:02:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18714v1

MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

Large Language Model-based agents(LLM-based agents) are increasingly deployed in customer service, yet they often forget across sessions, repeat errors, and lack mechanisms for continual self-improvement. This makes them unreliable in dynamic settings where stability and consistency are critical. To better evaluate these properties, we emphasize two indicators: task success rate as a measure of overall effectiveness, and consistency metrics such as Pass$^k$ to capture reliability across multiple trials. To address the limitations of existing approaches, we propose MemOrb, a lightweight and plug-and-play verbal reinforcement memory layer that distills multi-turn interactions into compact strategy reflections. These reflections are stored in a shared memory bank and retrieved to guide decision-making, without requiring any fine-tuning. Experiments show that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage-point gain in multi-turn success rate and delivering more consistent performance across repeated trials. Our results demonstrate that structured reflection is a powerful mechanism for enhancing long-term reliability of frozen LLM agents in customer service scenarios.

Updated: 2025-09-23 06:57:07

标题: MemOrb：一种适用于电子商务客户服务的即插即用的口头强化记忆层

摘要: 基于大型语言模型的代理(LLM-based agents)越来越多地被部署在客户服务领域，然而它们经常会在不同会话之间忘记信息，重复错误，并且缺乏持续自我改进的机制。这使它们在需要稳定性和一致性的动态环境中变得不可靠。为了更好地评估这些特性，我们强调两个指标：任务成功率作为整体有效性的衡量标准，以及一致性指标如Pass$^k$来捕捉多次试验中的可靠性。为了解决现有方法的局限性，我们提出了MemOrb，一个轻量级且即插即用的口头强化记忆层，将多轮交互转化为紧凑的策略反思。这些反思被存储在共享内存库中，并在决策过程中检索，而不需要任何微调。实验表明，MemOrb显著提高了成功率和稳定性，多轮成功率提高了63个百分点，并在重复试验中提供更一致的性能。我们的结果表明，结构化反思是增强客户服务场景中冻结的LLM代理的长期可靠性的强大机制。

更新时间: 2025-09-23 06:57:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.18713v1

RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

Updated: 2025-09-23 06:52:15

标题: RSVG-ZeroOV：探索一种零训练框架，用于在遥感图像中进行零-shot开放词汇视觉定位

摘要: 遥感视觉定位（RSVG）旨在基于自由形式的自然语言表达来定位遥感图像中的对象。现有方法通常受限于封闭集词汇，限制了它们在开放世界场景中的适用性。虽然最近尝试利用通用基础模型进行开放词汇RSVG，但它们过度依赖昂贵的高质量数据集和耗时的微调。为了解决这些限制，我们提出了RSVG-ZeroOV，这是一个无需训练的框架，旨在探索冻结通用基础模型在零-shot开放词汇RSVG中的潜力。具体而言，RSVG-ZeroOV包括三个关键阶段：（i）概述：我们利用视觉-语言模型（VLM）获取跨注意力图，捕捉文本查询和视觉区域之间的语义相关性。（ii）关注：通过利用扩散模型（DM）的细粒度建模先验知识，我们填补了对象的结构和形状信息中被VLM经常忽视的空白部分。（iii）演变：引入了一个简单而有效的注意力演进模块，抑制无关激活，产生对所指对象的纯净分割掩模。无需繁琐的任务特定训练，RSVG-ZeroOV提供了一种高效且可扩展的解决方案。大量实验证明，该框架始终优于现有的弱监督和零-shot方法。

更新时间: 2025-09-23 06:52:15

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.18711v1

Autonomous Data Agents: A New Opportunity for Smart Data

As data continues to grow in scale and complexity, preparing, transforming, and analyzing it remains labor-intensive, repetitive, and difficult to scale. Since data contains knowledge and AI learns knowledge from it, the alignment between AI and data is essential. However, data is often not structured in ways that are optimal for AI utilization. Moreover, an important question arises: how much knowledge can we pack into data through intensive data operations? Autonomous data agents (DataAgents), which integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling, can autonomously interpret data task descriptions, decompose tasks into subtasks, reason over actions, ground actions into python code or tool calling, and execute operations. Unlike traditional data management and engineering tools, DataAgents dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale. This report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. DataAgents are capable of handling collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval. Through these capabilities, DataAgents transform complex and unstructured data into coherent and actionable knowledge. We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend. We then define the concept of DataAgents and discuss their architectural design, training strategies, as well as the new skills and capabilities they enable. Finally, we call for concerted efforts to advance action workflow optimization, establish open datasets and benchmark ecosystems, safeguard privacy, balance efficiency with scalability, and develop trustworthy DataAgent guardrails to prevent malicious actions.

Updated: 2025-09-23 06:46:41

标题: 自主数据代理：智能数据的新机遇

摘要: 随着数据规模和复杂性不断增长，准备、转换和分析数据仍然是一项劳动密集、重复性强且难以扩展的工作。由于数据包含知识，而人工智能从中学习知识，因此人工智能和数据之间的对齐至关重要。然而，数据通常并非以最适合人工智能利用的方式结构化。此外，一个重要问题出现了：我们可以通过密集的数据操作将多少知识填充到数据中？自主数据代理（DataAgents）将LLM推理与任务分解、动作推理和基础、以及工具调用相结合，可以自主解释数据任务描述，将任务分解为子任务，对动作进行推理，将动作基于Python代码或工具调用，执行操作。与传统的数据管理和工程工具不同，DataAgents动态规划工作流程，调用强大的工具，并在规模上适应各种数据任务。本报告认为DataAgents代表了向自主数据到知识系统的范式转变。DataAgents能够处理收集、集成、预处理、选择、转换、重新加权、增强、重新编程、修复和检索。通过这些能力，DataAgents将复杂和非结构化的数据转化为连贯和可操作的知识。我们首先探讨为什么代理式人工智能和数据到知识系统的融合已经成为一个关键趋势。然后我们定义了DataAgents的概念，并讨论了它们的架构设计、训练策略，以及它们所激发的新技能和能力。最后，我们呼吁共同努力推动动作工作流程的优化，建立开放数据集和基准生态系统，保护隐私，平衡效率与可扩展性，并开发可信赖的DataAgent防护措施以防止恶意行为。

更新时间: 2025-09-23 06:46:41

领域: cs.AI

下载: http://arxiv.org/abs/2509.18710v1

Learning When to Restart: Nonstationary Newsvendor from Uncensored to Censored Demand

We study nonstationary newsvendor problems under nonparametric demand models and general distributional measures of nonstationarity, addressing the practical challenges of unknown degree of nonstationarity and demand censoring. We propose a novel distributional-detection-and-restart framework for learning in nonstationary environments, and instantiate it through two efficient algorithms for the uncensored and censored demand settings. The algorithms are fully adaptive, requiring no prior knowledge of the degree and type of nonstationarity, and offer a flexible yet powerful approach to handling both abrupt and gradual changes in nonstationary environments. We establish a comprehensive optimality theory for our algorithms by deriving matching regret upper and lower bounds under both general and refined structural conditions with nontrivial proof techniques that are of independent interest. Numerical experiments using real-world datasets, including nurse staffing data for emergency departments and COVID-19 test demand data, showcase the algorithms' superior and robust empirical performance. While motivated by the newsvendor problem, the distributional-detection-and-restart framework applies broadly to a wide class of nonstationary stochastic optimization problems. Managerially, our framework provides a practical, easy-to-deploy, and theoretically grounded solution for decision-making under nonstationarity.

Updated: 2025-09-23 06:46:37

标题: 学习何时重新开始：非稳态新闻供应商从未被审查到被审查的需求

摘要: 我们研究了在非参数需求模型和一般的非定常分布测度下的非定常新闻供应商问题，解决了未知非定常程度和需求审查的实际挑战。我们提出了一种新颖的分布检测和重启框架，用于在非定常环境中学习，并通过两种高效算法实现了它，用于未审查和审查需求设置。这些算法是完全自适应的，不需要先验知识来了解非定常性的程度和类型，同时提供了一种灵活而强大的方法来处理非定常环境中的突然和渐变变化。通过推导出与一般和精细结构条件下的匹配遗憾上下界，建立了我们算法的全面最优理论，这些证明技术具有独立的兴趣。使用真实世界数据集进行的数值实验，包括急诊科护士人员配置数据和COVID-19检测需求数据，展示了算法卓越和稳健的实证性能。虽然受新闻供应商问题的启发，但分布检测和重启框架广泛适用于一类广泛的非定常随机优化问题。在管理层面，我们的框架为在非定常情况下进行决策提供了一个实用、易部署和理论基础的解决方案。

更新时间: 2025-09-23 06:46:37

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2509.18709v1

Towards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology

This research focuses on rational pesticide design, using graph machine learning to accelerate the development of safer, eco-friendly agrochemicals, inspired by in silico methods in drug discovery. With an emphasis on ecotoxicology, the initial contributions include the creation of ApisTox, the largest curated dataset on pesticide toxicity to honey bees. We conducted a broad evaluation of machine learning (ML) models for molecular graph classification, including molecular fingerprints, graph kernels, GNNs, and pretrained transformers. The results show that methods successful in medicinal chemistry often fail to generalize to agrochemicals, underscoring the need for domain-specific models and benchmarks. Future work will focus on developing a comprehensive benchmarking suite and designing ML models tailored to the unique challenges of pesticide discovery.

Updated: 2025-09-23 06:38:05

标题: 朝向有理农药设计：基于图机器学习模型的生态毒理学

摘要: 这项研究着重于理性农药设计，利用图机器学习加快开发更安全、环保的农药，受到药物发现中基于计算方法的启发。在生态毒理学方面，初始贡献包括创建了 ApisTox 数据集，这是关于农药对蜜蜂毒性的最大精心策划数据集。我们对分子图分类的机器学习（ML）模型进行了广泛评估，包括分子指纹、图核、GNNs 和预训练的转换器。结果表明，在药物化学中成功的方法通常无法推广到农药，强调了需要领域特定的模型和基准的必要性。未来的工作将专注于开发全面的基准测试套件，并设计适应农药发现独特挑战的 ML 模型。

更新时间: 2025-09-23 06:38:05

领域: cs.LG

下载: http://arxiv.org/abs/2509.18703v1

Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

The adoption of large language models (LLMs) is transforming the peer review process, from assisting reviewers in writing more detailed evaluations to generating entire reviews automatically. While these capabilities offer exciting opportunities, they also raise critical concerns about fairness and reliability. In this paper, we investigate bias in LLM-generated peer reviews by conducting controlled experiments on sensitive metadata, including author affiliation and gender. Our analysis consistently shows affiliation bias favoring institutions highly ranked on common academic rankings. Additionally, we find some gender preferences, which, even though subtle in magnitude, have the potential to compound over time. Notably, we uncover implicit biases that become more evident with token-based soft ratings.

Updated: 2025-09-23 06:37:56

标题: 裁决中的公正：揭示LLM辅助的同行评审中的（隐藏）偏见

摘要: 大型语言模型（LLMs）的采用正在改变同行评审过程，从帮助审稿人撰写更详细的评价到自动生成整篇评论。尽管这些能力提供了令人兴奋的机会，但也引发了关于公平性和可靠性的重要关注。在本文中，我们通过对作者所属机构和性别等敏感元数据进行受控实验，调查了LLM生成的同行评审中的偏见。我们的分析一贯显示出偏向常见学术排名高的机构的关联偏见。此外，我们发现了一些性别偏好，尽管在程度上微妙，但有可能随着时间的推移而叠加。值得注意的是，我们发现了随着基于令牌的软评级而变得更加明显的隐性偏见。

更新时间: 2025-09-23 06:37:56

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2509.13400v3

SpellerSSL: Self-Supervised Learning with P300 Aggregation for Speller BCIs

Electroencephalogram (EEG)-based P300 speller brain-computer interfaces (BCIs) face three main challenges: low signal-to-noise ratio (SNR), poor generalization, and time-consuming calibration. We propose SpellerSSL, a framework that combines self-supervised learning (SSL) with P300 aggregation to address these issues. First, we introduce an aggregation strategy to enhance SNR. Second, to achieve generalization in training, we employ a customized 1D U-Net backbone and pretrain the model on both cross-domain and in-domain EEG data. The pretrained model is subsequently fine-tuned with a lightweight ERP-Head classifier for P300 detection, which adapts the learned representations to subject-specific data. Our evaluations on calibration time demonstrate that combining the aggregation strategy with SSL significantly reduces the calibration burden per subject and improves robustness across subjects. Experimental results show that SSL learns effective EEG representations in both in-domain and cross-domain, with in-domain achieving a state-of-the-art character recognition rate of 94% with only 7 repetitions and the highest information transfer rate (ITR) of 21.86 bits/min on the public II-B dataset. Moreover, in-domain SSL with P300 aggregation reduces the required calibration size by 60% while maintaining a comparable character recognition rate. To the best of our knowledge, this is the first study to apply SSL to P300 spellers, highlighting its potential to improve both efficiency and generalization in speller BCIs and paving the way toward an EEG foundation model for P300 speller BCIs.

Updated: 2025-09-23 06:28:44

标题: SpellerSSL：基于P300聚合的拼字BCIs的自监督学习

摘要: 脑电图（EEG）基于P300拼写器脑-计算机界面（BCI）面临三个主要挑战：信噪比低，泛化能力差，校准时间长。我们提出了SpellerSSL，这是一个结合自监督学习（SSL）和P300聚合的框架，以解决这些问题。首先，我们引入了一种聚合策略来增强信噪比。其次，为了在训练中实现泛化，我们采用了定制的1D U-Net骨干，并在跨域和领域内EEG数据上对模型进行预训练。预训练模型随后通过轻量级ERP-Head分类器进行微调，用于P300检测，该分类器使学习到的表示适应特定于受试者的数据。我们的校准时间评估表明，将聚合策略与SSL相结合显着减少了每个受试者的校准负担，并增强了跨受试者的稳健性。实验结果显示，SSL在领域内和跨域中学习到有效的EEG表示，领域内的字符识别率达到94%，仅需7次重复，并在公共II-B数据集上实现了最高的信息传输速率（ITR）达21.86比特/分钟。此外，领域内SSL与P300聚合减少了60%的校准大小，同时保持可比的字符识别率。据我们所知，这是第一项将SSL应用于P300拼写器的研究，突显了其在提高拼写器BCI的效率和泛化性方面的潜力，并为P300拼写器BCI铺平了一条通向EEG基础模型的道路。

更新时间: 2025-09-23 06:28:44

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2509.19401v1

Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization

The sliced Wasserstein distance (SW) reduces optimal transport on $\mathbb{R}^d$ to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, novel approach: learning directions with Bayesian optimization (BO), particularly in settings where SW appears inside an optimization loop (e.g., gradient flows). We introduce a family of drop-in selectors for projection directions: BOSW, a one-shot BO scheme on the unit sphere; RBOSW, a periodic-refresh variant; ABOSW, an adaptive hybrid that seeds from competitive QSW sets and performs a few lightweight BO refinements; and ARBOSW, a restarted hybrid that periodically relearns directions during optimization. Our BO approaches can be composed with QSW and its variants (demonstrated by ABOSW/ARBOSW) and require no changes to downstream losses or gradients. We provide numerical experiments where our methods achieve state-of-the-art performance, and on the experimental suite of the original QSW paper, we find that ABOSW and ARBOSW can achieve convergence comparable to the best QSW variants with modest runtime overhead.

Updated: 2025-09-23 06:27:38

标题: 通过自适应贝叶斯优化实现高效的切片瓦瑟斯坦距离计算

摘要: 切片Wasserstein距离（SW）将$\mathbb{R}^d$上的最优输运问题简化为一维投影的总和，由于这种高效性，在几何学、生成建模和注册任务中被广泛使用。最近的研究表明，用于计算SW的拟随机蒙特卡罗构造（QSW）产生具有出色逼近误差的方向集。本文提出了一种替代的、新颖的方法：使用贝叶斯优化（BO）学习方向，特别是在SW出现在优化循环内的情况下（例如，梯度流）。我们引入了一系列投影方向选择器：BOSW，一个在单位球上的一次性BO方案；RBOSW，一个周期性刷新变体；ABOSW，一种自适应混合方法，从竞争性QSW集中引入种子，并进行一些轻量级的BO调整；以及ARBOSW，一种在优化过程中定期重新学习方向的重启混合方法。我们的BO方法可以与QSW及其变体结合使用（由ABOSW/ARBOSW演示），并且不需要对下游损失或梯度进行任何更改。我们提供了数值实验，证明我们的方法实现了最先进的性能，在原始QSW论文的实验套件中，我们发现ABOSW和ARBOSW可以在适度的运行时间开销下实现与最佳QSW变体相当的收敛性。

更新时间: 2025-09-23 06:27:38

领域: cs.LG,49Q22 (Primary) 90C57, 68Txx (Secondary),G.3; I.2

下载: http://arxiv.org/abs/2509.17405v2

FlowCrypt: Flow-Based Lightweight Encryption with Near-Lossless Recovery for Cloud Photo Privacy

The widespread adoption of smartphone photography has led users to increasingly rely on cloud storage for personal photo archiving and sharing, raising critical privacy concerns. Existing deep learning-based image encryption schemes, typically built upon CNNs or GANs, often depend on traditional cryptographic algorithms and lack inherent architectural reversibility, resulting in limited recovery quality and poor robustness. Invertible neural networks (INNs) have emerged to address this issue by enabling reversible transformations, yet the first INN-based encryption scheme still relies on an auxiliary reference image and discards by-product information before decryption, leading to degraded recovery and limited practicality. To address these limitations, this paper proposes FlowCrypt, a novel flow-based image encryption framework that simultaneously achieves near-lossless recovery, high security, and lightweight model design. FlowCrypt begins by applying a key-conditioned random split to the input image, enhancing forward-process randomness and encryption strength. The resulting components are processed through a Flow-based Encryption/Decryption (FED) module composed of invertible blocks, which share parameters across encryption and decryption. Thanks to its reversible architecture and reference-free design, FlowCrypt ensures high-fidelity image recovery. Extensive experiments show that FlowCrypt achieves recovery quality with 100dB on three datasets, produces uniformly distributed cipher images, and maintains a compact architecture with only 1M parameters, making it suitable for mobile and edge-device applications.

Updated: 2025-09-23 06:25:35

标题: FlowCrypt：基于流的轻量级加密，用于云照片隐私的近无损恢复.

摘要: 智能手机摄影技术的广泛应用导致用户越来越依赖云存储来进行个人照片归档和分享，引发了重要的隐私问题。现有基于深度学习的图像加密方案通常建立在CNNs或GANs之上，往往依赖于传统的密码算法，并缺乏固有的可逆性架构，导致恢复质量有限且鲁棒性差。可逆神经网络（INNs）已经出现以解决这一问题，通过实现可逆变换，然而第一个基于INN的加密方案仍然依赖于辅助参考图像，并在解密前丢弃副产品信息，导致恢复质量下降和实用性有限。为了解决这些限制，本文提出了FlowCrypt，一种新颖的基于流的图像加密框架，同时实现了接近无损恢复、高安全性和轻量级模型设计。FlowCrypt首先通过应用键条件随机分割输入图像，增强了前向过程的随机性和加密强度。结果组件通过由可逆块组成的基于流的加密/解密（FED）模块进行处理，这些块在加密和解密过程中共享参数。由于其可逆架构和无参考设计，FlowCrypt确保了高保真度的图像恢复。大量实验证明，FlowCrypt在三个数据集上实现了100dB的恢复质量，产生均匀分布的密文图像，并保持了只有1M参数的紧凑架构，适用于移动和边缘设备应用。

更新时间: 2025-09-23 06:25:35

领域: cs.CR

下载: http://arxiv.org/abs/2509.18696v1

Symbolic Feedforward Networks for Probabilistic Finite Automata: Exact Simulation and Learnability

We present a formal and constructive theory showing that probabilistic finite automata (PFAs) can be exactly simulated using symbolic feedforward neural networks. Our architecture represents state distributions as vectors and transitions as stochastic matrices, enabling probabilistic state propagation via matrix-vector products. This yields a parallel, interpretable, and differentiable simulation of PFA dynamics using soft updates-without recurrence. We formally characterize probabilistic subset construction, $\varepsilon$-closure, and exact simulation via layered symbolic computation, and prove equivalence between PFAs and specific classes of neural networks. We further show that these symbolic simulators are not only expressive but learnable: trained with standard gradient descent-based optimization on labeled sequence data, they recover the exact behavior of ground-truth PFAs. This learnability, formalized in Proposition 5.1, is the crux of this work. Our results unify probabilistic automata theory with neural architectures under a rigorous algebraic framework, bridging the gap between symbolic computation and deep learning.

Updated: 2025-09-23 06:22:48

标题: 概率有限自动机的符号前馈网络：精确模拟和可学习性

摘要: 我们提出了一种正式且具有建设性的理论，表明概率有限自动机（PFAs）可以通过符号前馈神经网络精确模拟。我们的架构将状态分布表示为向量，将转移表示为随机矩阵，通过矩阵-向量乘积实现概率状态传播。这产生了一种并行、可解释且可微分的PFA动态模拟，使用软更新而无需递归。我们通过分层符号计算正式表征了概率子集构造、ε-闭包和精确模拟，并证明了PFAs与特定类别的神经网络之间的等价性。我们进一步展示了这些符号模拟器不仅具有表达能力，而且可学习：在标记的序列数据上使用标准梯度下降优化训练时，它们可以恢复地面真相PFAs的精确行为。这种可学习性在命题5.1中形式化，在这项工作中是关键。我们的结果通过严格的代数框架将概率自动机理论与神经结构统一起来，弥合了符号计算和深度学习之间的差距。

更新时间: 2025-09-23 06:22:48

领域: cs.LG

下载: http://arxiv.org/abs/2509.10034v2

Online Regularized Statistical Learning in Reproducing Kernel Hilbert Space With Non-Stationary Data

We study the convergence of recursive regularized learning algorithms in the reproducing kernel Hilbert space (RKHS) with dependent and non-stationary online data streams. Firstly, we introduce the concept of random Tikhonov regularization path and decompose the tracking error of the algorithm's output for the regularization path into random difference equations in RKHS, whose non-homogeneous terms are martingale difference sequences. Investigating the mean square asymptotic stability of the equations, we show that if the regularization path is slowly time-varying, then the algorithm's output achieves mean square consistency with the regularization path. Leveraging operator theory, particularly the monotonicity of the inverses of operators and the spectral decomposition of compact operators, we introduce the RKHS persistence of excitation condition (i.e. there exists a fixed-length time period, such that the conditional expectation of the operators induced by the input data accumulated over every period has a uniformly strictly positive compact lower bound) and develop a dominated convergence method to prove the mean square consistency between the algorithm's output and an unknown function. Finally, for independent and non-identically distributed data streams, the algorithm achieves the mean square consistency if the input data's marginal probability measures are slowly time-varying and the average measure over each fixed-length time period has a uniformly strictly positive lower bound.

Updated: 2025-09-23 06:21:41

标题: 在线正则化统计学习在再生核希尔伯特空间中对非平稳数据的应用

摘要: 我们研究了在具有依赖性和非平稳在线数据流的再现核希尔伯特空间（RKHS）中递归正则化学习算法的收敛性。首先，我们引入了随机Tikhonov正则化路径的概念，并将算法输出的跟踪误差分解为RKHS中的随机差分方程，其非齐次项为鞅差分序列。通过研究方程的均方渐近稳定性，我们表明如果正则化路径变化缓慢，则算法的输出将与正则化路径实现均方一致性。利用算子理论，特别是算子的逆的单调性和紧算子的谱分解，我们引入RKHS激励持续性条件（即存在一个固定长度的时间段，使得累积在每个时间段内由输入数据引发的算子的条件期望具有一致严格正的紧下界），并开发了一种支配收敛方法来证明算法输出与未知函数之间的均方一致性。最后，对于独立且非同分布的数据流，如果输入数据的边际概率测度变化缓慢且每个固定长度时间段上的平均测度具有一致严格正的下界，则算法可以实现均方一致性。

更新时间: 2025-09-23 06:21:41

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2404.03211v5

An overview of neural architectures for self-supervised audio representation learning from masked spectrograms

In recent years, self-supervised learning has amassed significant interest for training deep neural representations without labeled data. One such self-supervised learning approach is masked spectrogram modeling, where the objective is to learn semantically rich contextual representations by predicting removed or hidden portions of the input audio spectrogram. With the Transformer neural architecture at its core, masked spectrogram modeling has emerged as the prominent approach for learning general purpose audio representations, a.k.a. audio foundation models. Meanwhile, addressing the issues of the Transformer architecture, in particular the underlying Scaled Dot-product Attention operation, which scales quadratically with input sequence length, has led to renewed interest in recurrent sequence modeling approaches. Among them, Selective structured state space models (such as Mamba) and extended Long Short-Term Memory (xLSTM) are the two most promising approaches which have experienced widespread adoption. While the body of work on these two topics continues to grow, there is currently a lack of an adequate overview encompassing the intersection of these topics. In this paper, we present a comprehensive overview of the aforementioned research domains, covering masked spectrogram modeling and the previously mentioned neural sequence modeling architectures, Mamba and xLSTM. Further, we compare Transformers, Mamba and xLSTM based masked spectrogram models in a unified, reproducible framework on ten diverse downstream audio classification tasks, which will help interested readers to make informed decisions regarding suitability of the evaluated approaches to adjacent applications.

Updated: 2025-09-23 06:20:41

标题: 自监督音频表示学习中神经结构概述：基于掩蔽频谱图的方法

摘要: 近年来，自监督学习在训练深度神经表示时积累了显著的兴趣，而无需标记数据。其中一种自监督学习方法是掩码频谱建模，其目标是通过预测输入音频频谱的移除或隐藏部分来学习语义丰富的上下文表示。以Transformer神经架构为核心，掩码频谱建模已成为学习通用音频表示（即音频基础模型）的主要方法。与此同时，解决Transformer架构的问题，特别是基础的缩放点积注意力操作，其随输入序列长度呈二次扩展，已引起对循环序列建模方法的重新关注。在这些方法中，选择性结构状态空间模型（例如Mamba）和扩展的长短期记忆（xLSTM）是两种最有前景的方法，并得到了广泛采用。尽管关于这两个主题的研究工作不断增加，但目前缺乏一个涵盖这些主题交叉领域的充分概述。本文提供了上述研究领域的全面概述，涵盖了掩码频谱建模和先前提到的神经序列建模架构Mamba和xLSTM。此外，我们在十项不同的下游音频分类任务上，比较了基于Transformer、Mamba和xLSTM的掩码频谱模型，这将有助于感兴趣的读者对评估方法的适用性做出明智的决策。

更新时间: 2025-09-23 06:20:41

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2509.18691v1

Advances in Large Language Models for Medicine

Artificial intelligence (AI) technology has advanced rapidly in recent years, with large language models (LLMs) emerging as a significant breakthrough. LLMs are increasingly making an impact across various industries, with the medical field standing out as the most prominent application area. This paper systematically reviews the up-to-date research progress of LLMs in the medical field, providing an in-depth analysis of training techniques for large medical models, their adaptation in healthcare settings, related applications, as well as their strengths and limitations. Furthermore, it innovatively categorizes medical LLMs into three distinct types based on their training methodologies and classifies their evaluation approaches into two categories. Finally, the study proposes solutions to existing challenges and outlines future research directions based on identified issues in the field of medical LLMs. By systematically reviewing previous and advanced research findings, we aim to highlight the necessity of developing medical LLMs, provide a deeper understanding of their current state of development, and offer clear guidance for subsequent research.

Updated: 2025-09-23 06:16:39

标题: 医学领域大型语言模型的进展

摘要: 人工智能（AI）技术近年来迅速发展，大型语言模型（LLMs）作为一项重大突破已经出现。LLMs在各行业越来越产生影响，医疗领域是最突出的应用领域。本文系统地审查了LLMs在医疗领域的最新研究进展，深入分析了大型医疗模型的训练技术、它们在医疗环境中的适应性、相关应用，以及它们的优势和局限性。此外，本研究根据它们的训练方法将医疗LLMs划分为三种不同类型，并将它们的评估方法分类为两种类别。最后，该研究针对现有挑战提出解决方案，并根据医疗LLMs领域中确定的问题勾勒出未来研究方向。通过系统地审查先前和先进的研究成果，我们旨在强调发展医疗LLMs的必要性，深入了解它们目前的发展状况，并为后续研究提供明确的指导。

更新时间: 2025-09-23 06:16:39

领域: cs.AI

下载: http://arxiv.org/abs/2509.18690v1

Query-Centric Diffusion Policy for Generalizable Robotic Assembly

The robotic assembly task poses a key challenge in building generalist robots due to the intrinsic complexity of part interactions and the sensitivity to noise perturbations in contact-rich settings. The assembly agent is typically designed in a hierarchical manner: high-level multi-part reasoning and low-level precise control. However, implementing such a hierarchical policy is challenging in practice due to the mismatch between high-level skill queries and low-level execution. To address this, we propose the Query-centric Diffusion Policy (QDP), a hierarchical framework that bridges high-level planning and low-level control by utilizing queries comprising objects, contact points, and skill information. QDP introduces a query-centric mechanism that identifies task-relevant components and uses them to guide low-level policies, leveraging point cloud observations to improve the policy's robustness. We conduct comprehensive experiments on the FurnitureBench in both simulation and real-world settings, demonstrating improved performance in skill precision and long-horizon success rate. In the challenging insertion and screwing tasks, QDP improves the skill-wise success rate by over 50% compared to baselines without structured queries.

Updated: 2025-09-23 06:10:46

标题: 查询中心扩散策略用于通用机器人装配

摘要: 机器人装配任务对于建造通用机器人提出了关键挑战，这是因为零件相互作用的固有复杂性和对接触丰富环境中的噪声扰动的敏感性。装配代理通常以分层方式设计：高级多部件推理和低级精确控制。然而，在实践中实现这种分层策略具有挑战性，因为高级技能查询与低级执行之间存在不匹配。为了解决这个问题，我们提出了Query-centric Diffusion Policy（QDP），这是一个层次结构框架，通过利用包括对象、接触点和技能信息的查询来连接高级规划和低级控制。QDP引入了一个以查询为中心的机制，识别任务相关组件并利用它们来指导低级策略，利用点云观察结果来提高策略的鲁棒性。我们在模拟和真实环境中对FurnitureBench进行了全面实验，展示了技能精度和长期成功率的改进。在具有挑战性的插入和拧紧任务中，与没有结构化查询的基线相比，QDP将技能成功率提高了超过50%。

更新时间: 2025-09-23 06:10:46

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2509.18686v1

Can LLMs Explain Themselves Counterfactually?

Explanations are an important tool for gaining insights into the behavior of ML models, calibrating user trust and ensuring regulatory compliance. Past few years have seen a flurry of post-hoc methods for generating model explanations, many of which involve computing model gradients or solving specially designed optimization problems. However, owing to the remarkable reasoning abilities of Large Language Model (LLMs), self-explanation, that is, prompting the model to explain its outputs has recently emerged as a new paradigm. In this work, we study a specific type of self-explanations, self-generated counterfactual explanations (SCEs). We design tests for measuring the efficacy of LLMs in generating SCEs. Analysis over various LLM families, model sizes, temperature settings, and datasets reveals that LLMs sometimes struggle to generate SCEs. Even when they do, their prediction often does not agree with their own counterfactual reasoning.

Updated: 2025-09-23 06:10:32

标题: LLMs能否用反事实方式解释自己？

摘要: 解释是获取对ML模型行为洞察的重要工具，调整用户信任并确保合规性。过去几年出现了一大批用于生成模型解释的事后方法，其中许多涉及计算模型梯度或解决专门设计的优化问题。然而，由于大型语言模型（LLMs）具有出色的推理能力，自我解释，即提示模型解释其输出，最近被提出作为一种新范式。在这项工作中，我们研究了一种特定类型的自我解释，即自动生成的反事实解释（SCEs）。我们设计了用于衡量LLMs在生成SCEs方面有效性的测试。对各种LLM系列、模型大小、温度设置和数据集进行分析表明，LLMs有时很难生成SCEs。即使他们这样做了，他们的预测常常与他们自己的反事实推理不一致。

更新时间: 2025-09-23 06:10:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2502.18156v2

LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection

RGB-D salient object detection (SOD) aims to identify the most conspicuous objects in a scene with the incorporation of depth cues. Existing methods mainly rely on CNNs, limited by the local receptive fields, or Vision Transformers that suffer from the cost of quadratic complexity, posing a challenge in balancing performance and computational efficiency. Recently, state space models (SSM), Mamba, have shown great potential for modeling long-range dependency with linear complexity. However, directly applying SSM to RGB-D SOD may lead to deficient local semantics as well as the inadequate cross-modality fusion. To address these issues, we propose a Local Emphatic and Adaptive Fusion state space model (LEAF-Mamba) that contains two novel components: 1) a local emphatic state space module (LE-SSM) to capture multi-scale local dependencies for both modalities. 2) an SSM-based adaptive fusion module (AFM) for complementary cross-modality interaction and reliable cross-modality integration. Extensive experiments demonstrate that the LEAF-Mamba consistently outperforms 16 state-of-the-art RGB-D SOD methods in both efficacy and efficiency. Moreover, our method can achieve excellent performance on the RGB-T SOD task, proving a powerful generalization ability.

Updated: 2025-09-23 06:08:17

标题: LEAF-Mamba：用于RGB-D显著目标检测的本地强调和自适应融合状态空间模型

摘要: RGB-D显著对象检测（SOD）旨在识别场景中最显著的物体，并结合深度线索。现有方法主要依赖于CNNs，受限于局部感受野，或者依赖于受到二次复杂性成本的Vision Transformers，这给平衡性能和计算效率带来了挑战。最近，状态空间模型（SSM）Mamba展现出了在线性复杂度下建模长距离依赖性的巨大潜力。然而，直接将SSM应用于RGB-D SOD可能会导致局部语义不足以及跨模态融合不足。为解决这些问题，我们提出了一个包含两个新颖组件的局部强调和自适应融合状态空间模型（LEAF-Mamba）：1）一个局部强调状态空间模块（LE-SSM）用于捕捉两种模态的多尺度局部依赖性。2）一个基于SSM的自适应融合模块（AFM）用于互补跨模态交互和可靠的跨模态整合。大量实验证明，LEAF-Mamba在效力和效率方面始终优于16种最先进的RGB-D SOD方法。此外，我们的方法可以在RGB-T SOD任务上取得出色表现，证明了其强大的泛化能力。

更新时间: 2025-09-23 06:08:17

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2509.18683v1

Implementation of airborne ML models with semantics preservation

Machine Learning (ML) may offer new capabilities in airborne systems. However, as any piece of airborne systems, ML-based systems will be required to guarantee their safe operation. Thus, their development will have to be demonstrated to be compliant with the adequate guidance. So far, the European Union Aviation Safety Agency (EASA) has published a concept paper and an EUROCAE/SAE group is preparing ED-324. Both approaches delineate high-level objectives to confirm the ML model achieves its intended function and maintains training performance in the target environment. The paper aims to clarify the difference between an ML model and its corresponding unambiguous description, referred to as the Machine Learning Model Description (MLMD). It then refines the essential notion of semantics preservation to ensure the accurate replication of the model. We apply our contributions to several industrial use cases to build and compare several target models.

Updated: 2025-09-23 06:01:52

标题: 使用语义保留实现空中机载机器学习模型

摘要: 机器学习（ML）可能为空中系统提供新的能力。然而，与任何一种空中系统一样，基于ML的系统将需要确保其安全运行。因此，它们的开发将必须被证明符合适当的指导。迄今为止，欧洲联盟航空安全局（EASA）已经发布了一个概念文件，欧洲协会/美国标准化协会集团正在准备ED-324。这两种方法都勾勒了高层目标，以确认ML模型实现其预期功能并在目标环境中保持训练性能。本文旨在澄清ML模型及其相应的明确描述之间的区别，称为机器学习模型描述（MLMD）。然后，它进一步细化了语义保持的基本概念，以确保模型的准确复制。我们将我们的贡献应用于几个工业用例，以构建和比较几个目标模型。

更新时间: 2025-09-23 06:01:52

领域: cs.AI

下载: http://arxiv.org/abs/2509.18681v1

EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on teacher models for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens CoT steps to expand the space in a controlled way. The framework enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.

Updated: 2025-09-23 05:59:37

标题: EvoCoT：克服强化学习中的探索瓶颈

摘要: 具有可验证奖励的强化学习（RLVR）已成为后训练大型语言模型（LLMs）以提高其推理能力的一种有前途的范例。然而，在困难问题上回滚准确性较低时，奖励变得稀疏，限制了学习效率并导致探索瓶颈。现有方法要么依赖于教师模型进行蒸馏，要么过滤掉困难问题，这限制了可扩展性或通过探索限制了推理改进。我们提出了EvoCoT，这是一种基于两阶段思维链（CoT）推理优化的自我演变课程学习框架。 EvoCoT通过自动生成和验证CoT轨迹来限制探索空间，然后逐渐缩短CoT步骤，以受控的方式扩展空间。该框架使LLMs能够稳定地从最初未解决的困难问题中学习稀疏奖励。我们将EvoCoT应用于多个LLM系列，包括Qwen，DeepSeek和Llama。实验表明，EvoCoT使LLMs能够解决以前未解决的问题，提高推理能力而无需外部CoT监督，并与各种RL微调方法兼容。我们发布源代码以支持未来研究。

更新时间: 2025-09-23 05:59:37

领域: cs.LG

下载: http://arxiv.org/abs/2508.07809v4

Blueprints of Trust: AI System Cards for End to End Transparency and Governance

This paper introduces the Hazard-Aware System Card (HASC), a novel framework designed to enhance transparency and accountability in the development and deployment of AI systems. The HASC builds upon existing model card and system card concepts by integrating a comprehensive, dynamic record of an AI system's security and safety posture. The framework proposes a standardized system of identifiers, including a novel AI Safety Hazard (ASH) ID, to complement existing security identifiers like CVEs, allowing for clear and consistent communication of fixed flaws. By providing a single, accessible source of truth, the HASC empowers developers and stakeholders to make more informed decisions about AI system safety throughout its lifecycle. Ultimately, we also compare our proposed AI system cards with the ISO/IEC 42001:2023 standard and discuss how they can be used to complement each other, providing greater transparency and accountability for AI systems.

Updated: 2025-09-23 05:58:32

标题: 信任的蓝图：端到端透明度和治理的AI系统卡片

摘要: 本文介绍了Hazard-Aware System Card (HASC)，这是一个旨在增强人工智能系统开发和部署透明度和问责性的新框架。HASC在现有的模型卡和系统卡概念基础上构建，通过整合人工智能系统安全和安全状况的全面、动态记录。该框架提出了一套标准化的标识符系统，包括一个新颖的AI安全危险 (ASH) ID，以补充现有的安全标识符如CVE，从而实现对已修复缺陷的清晰和一致的沟通。通过提供单一、易获取的真相来源，HASC使开发人员和利益相关者能够在整个人工智能系统生命周期中做出更明智的安全决策。最终，我们还将我们提出的人工智能系统卡与ISO/IEC 42001:2023标准进行比较，并讨论它们如何相互补充，为人工智能系统提供更大的透明度和问责性。

更新时间: 2025-09-23 05:58:32

领域: cs.CY,cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2509.20394v1

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

Updated: 2025-09-23 05:57:54

标题: 雅典娜：利用数据高效过程奖励模型增强多模态推理

摘要: 我们提出Athena-PRM，这是一个多模态过程奖励模型（PRM），旨在评估解决复杂推理问题中每个步骤的奖励得分。开发高性能的PRM通常需要大量的时间和财务投入，主要是因为需要对推理步骤进行步级注释。传统的自动标记方法，如蒙特卡洛估计，通常会产生嘈杂的标签并造成大量的计算成本。为了有效生成高质量的过程标记数据，我们提出利用弱和强完整者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是，Athena-PRM在各种场景和基准测试中展现出出色的效果，仅需5,000个样本。此外，我们还开发了两种有效的策略来提高PRM的性能：ORM初始化和对负面数据进行上采样。我们在三个具体场景中验证了我们的方法：验证测试时间缩放，直接评估推理步骤的正确性以及奖励排序微调。我们的Athena-PRM在多个基准测试和场景中始终表现出优越的性能。值得注意的是，当将Qwen2.5-VL-7B作为策略模型时，Athena-PRM在WeMath上的性能提高了10.2分，在MathVista上提高了7.1分。此外，Athena-PRM在VisualProcessBench上取得了最新技术（SoTA）结果，并且比以前的SoTA高出3.9个F1分数，展示了其准确评估推理步骤正确性的强大能力。此外，利用Athena-PRM作为奖励模型，我们开发了带有奖励排序微调的Athena-7B，在五个基准测试上明显优于基线。

更新时间: 2025-09-23 05:57:54

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2506.09532v2

Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing

Deploying deep neural networks (DNNs) on resource-constrained mobile devices presents significant challenges, particularly in achieving real-time performance while simultaneously coping with limited computational resources and battery life. While Mobile Edge Computing (MEC) offers collaborative inference with GPU servers as a promising solution, existing approaches primarily rely on layer-wise model partitioning and undergo significant transmission bottlenecks caused by the sequential execution of DNN operations. To address this challenge, we present Intra-DP, a high-performance collaborative inference system optimized for DNN inference on MEC. Intra DP employs a novel parallel computing technique based on local operators (i.e., operators whose minimum unit input is not the entire input tensor, such as the convolution kernel). By decomposing their computations (operations) into several independent sub-operations and overlapping the computation and transmission of different sub-operations through parallel execution, Intra-DP mitigates transmission bottlenecks in MEC, achieving fast and energy-efficient inference. The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines, without sacrificing accuracy.

Updated: 2025-09-23 05:51:13

标题: Intra-DP：一种用于移动边缘计算的高性能协作推理系统

摘要: 在资源受限的移动设备上部署深度神经网络（DNNs）面临着重大挑战，特别是在实现实时性能的同时，同时应对有限的计算资源和电池寿命。虽然移动边缘计算（MEC）提供了与GPU服务器协同推断的有前途的解决方案，但现有方法主要依赖于基于层次模型分区，并由于DNN操作的顺序执行而遭受重大传输瓶颈的影响。为了解决这一挑战，我们提出了Intra-DP，这是一种针对MEC上的DNN推断进行优化的高性能协同推断系统。Intra DP采用了一种基于本地运算符（即最小单位输入不是整个输入张量的运算符，例如卷积核）的新颖并行计算技术。通过将计算（操作）分解为几个独立的子操作，并通过并行执行重叠不同子操作的计算和传输，Intra-DP缓解了MEC中的传输瓶颈，实现了快速和高效能的推断。评估表明，与最先进的基线相比，Intra-DP将推断延迟降低了高达50％，能耗降低了高达75％，而不会牺牲准确性。

更新时间: 2025-09-23 05:51:13

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05829v2

Scalable bayesian shadow tomography for quantum property estimation with set transformers

A scalable Bayesian machine learning framework is introduced for estimating scalar properties of an unknown quantum state from measurement data, which bypasses full density matrix reconstruction. This work is the first to integrate the classical shadows protocol with a permutation-invariant set transformer architecture, enabling the approach to predict and correct bias in existing estimators to approximate the true Bayesian posterior mean. Measurement outcomes are encoded as fixed-dimensional feature vectors, and the network outputs a residual correction to a baseline estimator. Scalability to large quantum systems is ensured by the polynomial dependence of input size on system size and number of measurements. On Greenberger-Horne-Zeilinger state fidelity and second-order R\'enyi entropy estimation tasks -- using random Pauli and random Clifford measurements -- this Bayesian estimator always achieves lower mean squared error than classical shadows alone, with more than a 99\% reduction in the few copy regime.

Updated: 2025-09-23 05:46:26

标题: 可伸缩的贝叶斯阴影层析成像在集合变换器中用于量子属性估计

摘要: 引入了一种可扩展的贝叶斯机器学习框架，用于从测量数据中估计未知量子态的标量属性，绕过了完整密度矩阵重建。这项工作首次将经典阴影协议与置换不变的集合变换器架构整合在一起，使该方法能够预测和校正现有估计器中的偏差，以逼近真实的贝叶斯后验均值。测量结果被编码为固定维度的特征向量，网络输出一个对基线估计器的残差校正。通过输入大小对系统大小和测量次数的多项式依赖，确保了对大型量子系统的可扩展性。在Greenberger-Horne-Zeilinger态保真度和二阶Rényi熵估计任务中，使用随机Pauli和随机Clifford测量，这种贝叶斯估计器总是比单纯的经典阴影具有更低的均方误差，且在少量拷贝的情况下减少超过99%。

更新时间: 2025-09-23 05:46:26

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2509.18674v1

NaviSense: A Multimodal Assistive Mobile application for Object Retrieval by Persons with Visual Impairment

People with visual impairments often face significant challenges in locating and retrieving objects in their surroundings. Existing assistive technologies present a trade-off: systems that offer precise guidance typically require pre-scanning or support only fixed object categories, while those with open-world object recognition lack spatial feedback for reaching the object. To address this gap, we introduce 'NaviSense', a mobile assistive system that combines conversational AI, vision-language models, augmented reality (AR), and LiDAR to support open-world object detection with real-time audio-haptic guidance. Users specify objects via natural language and receive continuous spatial feedback to navigate toward the target without needing prior setup. Designed with insights from a formative study and evaluated with 12 blind and low-vision participants, NaviSense significantly reduced object retrieval time and was preferred over existing tools, demonstrating the value of integrating open-world perception with precise, accessible guidance.

Updated: 2025-09-23 05:45:11

标题: NaviSense：一种多模式辅助移动应用程序，用于视力受损者的物体检索

摘要: 盲人常常在定位和寻找周围物体时面临重大挑战。现有的辅助技术存在一个权衡：提供精确指导的系统通常需要预先扫描或仅支持固定对象类别，而具有开放世界物体识别的系统缺乏到达物体的空间反馈。为了解决这一差距，我们引入了“NaviSense”，一个移动辅助系统，结合了会话式人工智能、视觉语言模型、增强现实（AR）和激光雷达（LiDAR），以支持实时音频触觉引导的开放世界物体检测。用户通过自然语言指定物体，并接收持续的空间反馈，以便在不需要先前设置的情况下向目标物体导航。通过形成研究的见解设计并与12名盲人和低视力参与者进行评估，NaviSense显著缩短了物体检索时间，并被认为优于现有工具，展示了将开放世界感知与精确、易接近的指导相结合的价值。

更新时间: 2025-09-23 05:45:11

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2509.18672v1

ABG-NAS: Adaptive Bayesian Genetic Neural Architecture Search for Graph Representation Learning

Effective and efficient graph representation learning is essential for enabling critical downstream tasks, such as node classification, link prediction, and subgraph search. However, existing graph neural network (GNN) architectures often struggle to adapt to diverse and complex graph structures, limiting their ability to produce structure-aware and task-discriminative representations. To address this challenge, we propose ABG-NAS, a novel framework for automated graph neural network architecture search tailored for efficient graph representation learning. ABG-NAS encompasses three key components: a Comprehensive Architecture Search Space (CASS), an Adaptive Genetic Optimization Strategy (AGOS), and a Bayesian-Guided Tuning Module (BGTM). CASS systematically explores diverse propagation (P) and transformation (T) operations, enabling the discovery of GNN architectures capable of capturing intricate graph characteristics. AGOS dynamically balances exploration and exploitation, ensuring search efficiency and preserving solution diversity. BGTM further optimizes hyperparameters periodically, enhancing the scalability and robustness of the resulting architectures. Empirical evaluations on benchmark datasets (Cora, PubMed, Citeseer, and CoraFull) demonstrate that ABG-NAS consistently outperforms both manually designed GNNs and state-of-the-art neural architecture search (NAS) methods. These results highlight the potential of ABG-NAS to advance graph representation learning by providing scalable and adaptive solutions for diverse graph structures. Our code is publicly available at https://github.com/sserranw/ABG-NAS.

Updated: 2025-09-23 05:39:38

标题: ABG-NAS：用于图表示学习的自适应贝叶斯遗传神经架构搜索

摘要: 有效和高效的图表示学习对于实现关键的下游任务，如节点分类、链路预测和子图搜索至关重要。然而，现有的图神经网络（GNN）结构经常难以适应多样化和复杂的图结构，限制了它们产生结构感知和任务区分性表示的能力。为了解决这一挑战，我们提出了ABG-NAS，一种专门针对高效图表示学习的自动化图神经网络架构搜索框架。ABG-NAS包括三个关键组件：全面架构搜索空间（CASS）、自适应遗传优化策略（AGOS）和贝叶斯引导调整模块（BGTM）。CASS系统地探索多样化的传播（P）和转换（T）操作，使得能够发现能够捕捉复杂图特征的GNN架构。AGOS动态平衡了探索和开发，确保搜索效率并保持解决方案多样性。BGTM定期优化超参数，增强了所得架构的可扩展性和稳健性。在基准数据集（Cora、PubMed、Citeseer和CoraFull）上的实证评估表明，ABG-NAS始终优于手动设计的GNN和最先进的神经架构搜索（NAS）方法。这些结果突显了ABG-NAS推动图表示学习进步的潜力，提供了适用于多样化图结构的可扩展和自适应解决方案。我们的代码可以在https://github.com/sserranw/ABG-NAS 公开获取。

更新时间: 2025-09-23 05:39:38

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2504.21254v3

TERAG: Token-Efficient Graph-Based Retrieval-Augmented Generation

Graph-based Retrieval-augmented generation (RAG) has become a widely studied approach for improving the reasoning, accuracy, and factuality of Large Language Models. However, many existing graph-based RAG systems overlook the high cost associated with LLM token usage during graph construction, hindering large-scale adoption. To address this, we propose TERAG, a simple yet effective framework designed to build informative graphs at a significantly lower cost. Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the retrieval phase, and we achieve at least 80% of the accuracy of widely used graph-based RAG methods while consuming only 3%-11% of the output tokens.

Updated: 2025-09-23 05:34:34

标题: TERAG: 令牌高效的基于图的检索增强生成

摘要: 基于图的检索增强生成（RAG）已成为改善大型语言模型推理、准确性和事实性的广泛研究方法。然而，许多现有的基于图的RAG系统忽视了在图构建过程中与LLM令牌使用相关的高成本，阻碍了大规模采用。为解决这一问题，我们提出了TERAG，这是一个简单而有效的框架，旨在以显著较低的成本构建信息图。受HippoRAG启发，我们在检索阶段引入了个性化PageRank（PPR），并且在消耗仅为输出令牌的3%-11%的情况下，实现了至少80%的常用基于图的RAG方法的准确性。

更新时间: 2025-09-23 05:34:34

领域: cs.AI

下载: http://arxiv.org/abs/2509.18667v1

Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games

Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multi-agent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multi-agent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multi-agent coordination. Counter-intuitively, insights from normal-form potential games do not carry over as MPGs can consist of settings where state-games can be zero-sum games. In the opposite direction, Markov games where every state-game is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove fast convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings.

Updated: 2025-09-23 05:31:04

标题: 马尔可夫潜在博弈中多智能体策略梯度的全局收敛性

摘要: 潜力博弈可以说是最重要和广泛研究的正态形式博弈类之一。它们通过一个共同的潜力函数将多智能体效用完美地对齐，定义了多智能体协调的原型设置。这种直观框架能否在马尔可夫博弈的设置中移植？在具有和不具有状态依赖性的多智能体协调之间有哪些相似之处和不同之处？我们提出了马尔可夫潜力博弈（MPG）的新定义，该定义概括了以前尝试捕捉复杂状态多智能体协调的努力。令人感到反直觉的是，正态形式潜力博弈的见解并不适用于MPG，因为MPG可能由状态博弈为零和博弈的设置组成。相反，每个状态博弈都是潜力博弈的马尔可夫博弈不一定是MPG。尽管如此，MPG展示了诸如存在确定性纳什策略等标准的理想性质。在我们的主要技术结果中，我们证明了独立策略梯度快速收敛到纳什策略，通过将最近为单智能体MDP开发的梯度优势性质论证调整到多智能体学习环境中。

更新时间: 2025-09-23 05:31:04

领域: cs.LG,cs.GT

下载: http://arxiv.org/abs/2106.01969v4

Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However, the existing SD methods still remain constrained by their serialized execution, which causes the mutual waiting bubbles between the draft and target models. To address this challenge, we draw inspiration from branch prediction in modern processors and propose a novel framework \textbf{SpecBranch} to unlock branch parallelism in SD. Specifically, we first take an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the trade-offs between parallelization and token rollback. Based on the analysis, we introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments across various models and benchmarks show that SpecBranch achieves over \textbf{1.8}$\times \sim$ \textbf{4.5}$\times$ speedups against the auto-regressive decoding and reduces rollback tokens by $\textbf{50}$\% for poorly aligned models, while maintaining an identical sampling distribution.

Updated: 2025-09-23 05:29:24

标题: 通过混合起草和回滚感知分支并行性的猜测解码

摘要: Speculative decoding (SD)已经成为一种有前途的技术，通过使用一个小的草稿模型提前提出草稿标记，然后与大型目标模型并行验证，从而加速LLM推断。然而，现有的SD方法仍然受限于它们的串行执行，这导致了草稿和目标模型之间的相互等待泡泡。为了解决这一挑战，我们从现代处理器中的分支预测中汲取灵感，并提出了一个新颖的框架SpecBranch来解锁SD中的分支并行性。具体而言，我们首先深入分析了SD中分支并行性的潜力，并认识到关键挑战在于并行化和标记回退之间的权衡。基于这一分析，我们引入了并行的推测分支，以预先对可能的拒绝进行对冲。同时，为了增强并行性，我们通过自适应草稿长度和隐式草稿模型置信度的混合组合以及目标模型特征的显式重用来共同编排。对各种模型和基准的广泛实验表明，SpecBranch相对于自回归解码实现了超过1.8倍到4.5倍的加速，并且对于对齐较差的模型，减少了50%的回退标记，同时保持了相同的采样分布。

更新时间: 2025-09-23 05:29:24

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2506.01979v2

Mitigating Strategy-Selection Bias in Reasoning for More Effective Test-Time Scaling

Test-time scaling (TTS) has been shown to improve the performance of large language models (LLMs) by sampling and aggregating diverse reasoning paths. However, existing research has overlooked a critical issue: selection bias of reasoning strategies during scaling. Specifically, when generating reasoning processes, LLMs tend to follow certain strategies (e.g., algebraic solutions for math problems) while neglecting other valid alternatives (e.g., geometric solutions), resulting in insufficient exploration of the solution space. To further understand the impact of this bias, we present a theoretical analysis that reveals when it undermines the effectiveness of test-time scaling. Motivated by this theoretical insight, we introduce TTS-Uniform, a framework designed to mitigate the selection bias of reasoning strategies. It (i) identifies potential strategies, (ii) uniformly allocates the sampling budget across them, and (iii) filters out unstable strategies prior to aggregation. Experimental results show that TTS-Uniform significantly enhances scaling effectiveness across multiple mainstream LLMs and benchmark datasets.

Updated: 2025-09-23 05:27:09

标题: 减轻推理中的策略选择偏差，以实现更有效的测试时间缩放

摘要: 测试时间缩放（TTS）已被证明可以通过对多样化推理路径进行采样和聚合来提高大型语言模型（LLMs）的性能。然而，现有研究忽视了一个关键问题：在缩放过程中推理策略的选择偏差。具体来说，当生成推理过程时，LLMs倾向于遵循某些策略（例如，对数问题的代数解）而忽视其他有效的替代方案（例如，几何解），导致解空间的探索不足。为了进一步了解这种偏差的影响，我们提出了一个理论分析，揭示了在何种情况下它会削弱测试时间缩放的有效性。受到这一理论洞察的启发，我们引入了TTS-Uniform，一个旨在减轻推理策略选择偏差的框架。它（i）识别潜在策略，（ii）均匀分配采样预算，（iii）在聚合之前过滤不稳定的策略。实验结果显示，TTS-Uniform显著提升了跨多个主流LLMs和基准数据集的缩放效果。

更新时间: 2025-09-23 05:27:09

领域: cs.AI

下载: http://arxiv.org/abs/2509.17905v2

Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark

Efficient inference is a critical challenge in deep generative modeling, particularly as diffusion models grow in capacity and complexity. While increased complexity often improves accuracy, it raises compute costs, latency, and memory requirements. This work investigates techniques such as pruning, quantization, knowledge distillation, and simplified attention to reduce computational overhead without impacting performance. The study also explores the Mixture of Experts (MoE) approach to further enhance efficiency. These experiments provide insights into optimizing inference for the state-of-the-art Fast Diffusion Transformer (fast-DiT) model.

Updated: 2025-09-23 05:20:05

标题: 优化基于Transformer模型的推理：一个多方法基准

摘要: 高效的推理是深度生成建模中的一个关键挑战，特别是当扩散模型的容量和复杂性增长时。尽管增加复杂性通常会提高准确性，但也会增加计算成本、延迟和内存需求。本文研究了修剪、量化、知识蒸馏和简化注意力等技术，以减少计算开销而不影响性能。该研究还探讨了专家混合(MoE)方法以进一步提高效率。这些实验为优化最先进的快速扩散变压器(fast-DiT)模型的推理提供了见解。

更新时间: 2025-09-23 05:20:05

领域: cs.LG,68T07,I.2.6; I.5.1

下载: http://arxiv.org/abs/2509.17894v2

Language Models Resist Alignment: Evidence From Data Compression

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

Updated: 2025-09-23 05:17:12

标题: 语言模型抵制对齐：来自数据压缩的证据

摘要: 大型语言模型（LLMs）可能会表现出意外或不良行为。最近的研究集中在调整LLMs以减轻有害输出。尽管这些努力，一些异常表明，即使进行了良好的调整过程，也很容易被故意或意外地规避。调整微调是否对模型产生稳健影响，还是其影响仅仅是表面的？在这项工作中，我们从理论和经验两个角度首次探讨了这一现象。经验上，我们展示了后调整模型的弹性，即在进一步微调时倾向于恢复在预训练阶段形成的行为分布。利用压缩理论，我们正式推断出微调对齐相对于预训练会不成比例地削弱，潜在地可以达到数量级。我们通过对不同类型和规模的模型进行实验验证了弹性的存在。具体来说，我们发现在恢复到预训练分布之前，模型性能迅速下降，之后下降速度显著减少。此外，我们进一步揭示，弹性与增加的模型大小和扩展的预训练数据呈正相关。我们的发现强调了解决LLMs固有弹性的必要性，以减轻其对齐的阻力。模型权重和代码可在pku-lm-resist-alignment.github.io上找到。

更新时间: 2025-09-23 05:17:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2406.06144v6

Online Learning for Optimizing AoI-Energy Tradeoff under Unknown Channel Statistics

We consider a real-time monitoring system where a source node (with energy limitations) aims to keep the information status at a destination node as fresh as possible by scheduling status update transmissions over a set of channels. The freshness of information at the destination node is measured in terms of the Age of Information (AoI) metric. In this setting, a natural tradeoff exists between the transmission cost (or equivalently, energy consumption) of the source and the achievable AoI performance at the destination. This tradeoff has been optimized in the existing literature under the assumption of having a complete knowledge of the channel statistics. In this work, we develop online learning-based algorithms with finite-time guarantees that optimize this tradeoff in the practical scenario where the channel statistics are unknown to the scheduler. In particular, when the channel statistics are known, the optimal scheduling policy is first proven to have a threshold-based structure with respect to the value of AoI (i.e., it is optimal to drop updates when the AoI value is below some threshold). This key insight was then utilized to develop the proposed learning algorithms that surprisingly achieve an order-optimal regret (i.e., $O(1)$) with respect to the time horizon length.

Updated: 2025-09-23 05:15:36

标题: 未知信道统计数据下优化AoI-能量权衡的在线学习

摘要: 我们考虑一个实时监测系统，其中一个具有能量限制的源节点通过在一组信道上调度状态更新传输来尽可能保持目的节点的信息状态更新。目的节点的信息新鲜度以信息时代（AoI）指标来衡量。在这种情况下，传输成本（或者等效地，能量消耗）和目的地的可实现AoI性能之间存在自然的权衡。在现有文献中已经优化了这种权衡，假设具有对信道统计的完全了解。在这项工作中，我们开发了基于在线学习的算法，具有有限时间保证，优化了这个权衡，即在调度程序对信道统计不知情的实际情况下。特别是，当信道统计已知时，最优调度策略首先被证明具有基于AoI值的阈值结构（即，当AoI值低于某个阈值时最优地丢弃更新）。然后利用这一关键洞察来制定所提出的学习算法，令人惊讶地实现了与时间范围长度相关的阶数最优后悔（即$O(1)$）。

更新时间: 2025-09-23 05:15:36

领域: cs.NI,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2509.18654v1

Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering

We introduce a novel framework for clustering a collection of tall matrices based on their column spaces, a problem we term Subspace Clustering of Subspaces (SCoS). Unlike traditional subspace clustering methods that assume vectorized data, our formulation directly models each data sample as a matrix and clusters them according to their underlying subspaces. We establish conceptual links to Subspace Clustering and Generalized Canonical Correlation Analysis (GCCA), and clarify key differences that arise in this more general setting. Our approach is based on a Block Term Decomposition (BTD) of a third-order tensor constructed from the input matrices, enabling joint estimation of cluster memberships and partially shared subspaces. We provide the first identifiability results for this formulation and propose scalable optimization algorithms tailored to large datasets. Experiments on real-world hyperspectral imaging datasets demonstrate that our method achieves superior clustering accuracy and robustness, especially under high noise and interference, compared to existing subspace clustering techniques. These results highlight the potential of the proposed framework in challenging high-dimensional applications where structure exists beyond individual data vectors.

Updated: 2025-09-23 05:12:40

标题: 子空间聚类：统一规范相关分析和子空间聚类

摘要: 我们引入了一个新颖的框架，用于基于它们的列空间对一组高矩阵进行聚类，这个问题我们称之为子空间聚类（SCoS）。与传统的子空间聚类方法不同，传统方法假设数据向量化，我们的公式直接将每个数据样本建模为一个矩阵，并根据它们的潜在子空间对它们进行聚类。我们建立了与子空间聚类和广义典范相关分析（GCCA）的概念联系，并澄清了在这种更一般的情况下出现的关键差异。我们的方法基于从输入矩阵构造的三阶张量的块项分解（BTD），实现了对簇成员和部分共享子空间的联合估计。我们提供了这种公式的首个可辨识性结果，并提出了针对大型数据集量身定制的可扩展优化算法。对真实的高光谱成像数据集的实验表明，与现有的子空间聚类技术相比，我们的方法在高噪声和干扰下实现了更优越的聚类准确性和鲁棒性。这些结果突显了所提出的框架在挑战性高维应用中存在结构超越个别数据向量的潜力。

更新时间: 2025-09-23 05:12:40

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2509.18653v1

Variational Bayes Gaussian Splatting

Recently, 3D Gaussian Splatting has emerged as a promising approach for modeling 3D scenes using mixtures of Gaussians. The predominant optimization method for these models relies on backpropagating gradients through a differentiable rendering pipeline, which struggles with catastrophic forgetting when dealing with continuous streams of data. To address this limitation, we propose Variational Bayes Gaussian Splatting (VBGS), a novel approach that frames training a Gaussian splat as variational inference over model parameters. By leveraging the conjugacy properties of multivariate Gaussians, we derive a closed-form variational update rule, allowing efficient updates from partial, sequential observations without the need for replay buffers. Our experiments show that VBGS not only matches state-of-the-art performance on static datasets, but also enables continual learning from sequentially streamed 2D and 3D data, drastically improving performance in this setting.

Updated: 2025-09-23 05:06:45

标题: 变分贝叶斯高斯飞溅

摘要: 最近，3D高斯喷射技术已经成为一种有前途的方法，用于利用高斯混合模型对3D场景进行建模。这些模型的主要优化方法依赖于通过可微渲染管线反向传播梯度，但在处理连续数据流时会出现灾难性遗忘。为了解决这一限制，我们提出了变分贝叶斯高斯喷射（VBGS），这是一种新颖的方法，将训练高斯喷射看作是对模型参数进行变分推理。通过利用多变量高斯分布的共轭性质，我们推导出一个闭合形式的变分更新规则，可以在不需要重放缓冲区的情况下从部分、顺序观测中进行高效更新。我们的实验证明，VBGS不仅在静态数据集上与最先进的性能相匹配，而且还能够从顺序流式传输的2D和3D数据中实现持续学习，在这种设置下大大提高性能。

更新时间: 2025-09-23 05:06:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2410.03592v2

Evaluating The Explainability of State-of-the-Art Deep Learning-based Network Intrusion Detection Systems

State-of-the-art deep learning (DL)-based network intrusion detection systems (NIDSs) offer limited "explainability". For example, how do they make their decisions? Do they suffer from hidden correlations? Prior works have applied eXplainable AI (XAI) techniques to ML-based security systems such as conventional ML classifiers trained on public network intrusion datasets, Android malware detection and malicious PDF file detection. However, those works have not evaluated XAI methods on state-of-the-art DL-based NIDSs and do not use latest XAI tools. In this work, we analyze state-of-the-art DL-based NIDS models using conventional as well as recently proposed XAI techniques through extensive experiments with different attack datasets. Furthermore, we introduce a criteria to evaluate the level of agreement between global- and local-level explanations generated for an NIDS. Using this criteria in addition to other security-focused criteria, we compare the explanations generated across XAI methods. The results show that: (1) the decisions of some DL-based NIDS models can be better explained than other models, (2) XAI explanations generated using different tools are in conflict for most of the NIDS models considered in this work and (3) there are significant differences between XAI methods in terms of some security-focused criteria. Based on our results, we make recommendations on how to achieve a balance between explainability and model detection performance.

Updated: 2025-09-23 05:03:24

标题: 评估最先进的基于深度学习的网络入侵检测系统的可解释性

摘要: 最先进的基于深度学习（DL）的网络入侵检测系统（NIDSs）提供有限的“可解释性”。例如，它们是如何做出决策的？它们是否受到隐藏的相关性的影响？先前的研究已经将可解释人工智能（XAI）技术应用于基于ML的安全系统，如传统的ML分类器在公共网络入侵数据集、Android恶意软件检测和恶意PDF文件检测方面进行了训练。然而，这些工作并未在最先进的基于DL的NIDSs上评估XAI方法，并且也没有使用最新的XAI工具。在这项工作中，我们通过对不同攻击数据集进行广泛实验，分析了最先进的基于DL的NIDS模型，使用传统以及最近提出的XAI技术。此外，我们引入了一个评估全局和局部级别解释之间协议的标准。通过在安全方面的其他标准之外使用这一标准，我们比较了XAI方法生成的解释。结果显示：（1）一些DL-based NIDS模型的决策可以比其他模型更好地解释，（2）在本研究中考虑的大多数NIDS模型中，使用不同工具生成的XAI解释存在冲突，（3）在某些安全方面的标准上，XAI方法之间存在显著差异。根据我们的结果，我们提出了如何在解释性和模型检测性能之间实现平衡的建议。

更新时间: 2025-09-23 05:03:24

领域: cs.CR

下载: http://arxiv.org/abs/2408.14040v4

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

Safety remains a major concern for deploying reinforcement learning (RL) in real-world applications. Simulators provide safe, scalable training environments, but the inevitable sim-to-real gap introduces additional safety concerns, as policies must satisfy constraints in real-world conditions that differ from simulation. To address this challenge, robust safe RL techniques offer principled methods, but are often incompatible with standard scalable training pipelines. In contrast, domain randomization, a simple and popular sim-to-real technique, stands out as a promising alternative, although it often results in unsafe behaviors in practice. We present SPiDR, short for Sim-to-real via Pessimistic Domain Randomization -- a scalable algorithm with provable guarantees for safe sim-to-real transfer. SPiDR uses domain randomization to incorporate the uncertainty about the sim-to-real gap into the safety constraints, making it versatile and highly compatible with existing training pipelines. Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPiDR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.

Updated: 2025-09-23 05:03:00

标题: SPiDR：零样本安全的简单方法在仿真到现实的转移中

摘要: 安全性仍然是在现实世界应用中部署强化学习（RL）时的主要关注点。模拟器提供安全、可扩展的训练环境，但不可避免的模拟到现实的差距引入了额外的安全顾虑，因为策略必须满足与模拟不同的现实世界条件下的约束。为了解决这一挑战，鲁棒的安全RL技术提供了原则性方法，但往往与标准的可扩展训练流程不兼容。相比之下，域随机化是一种简单且流行的模拟到现实技术，它作为一种有前途的替代方案，尽管在实践中往往会导致不安全的行为。我们提出了SPiDR，即通过悲观域随机化实现模拟到现实转移的缩放算法，具有可证明的安全性保证。SPiDR利用域随机化将关于模拟到现实差距的不确定性纳入安全约束中，使其多功能且与现有的训练流程高度兼容。通过对模拟到模拟基准和两个不同的现实世界机器人平台进行的广泛实验，我们证明了SPiDR能够有效确保安全性，尽管存在模拟到现实的差距，同时保持强大的性能。

更新时间: 2025-09-23 05:03:00

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18648v1

Do You Need Proprioceptive States in Visuomotor Policies?

Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0\% to 85\% in height generalization and from 6\% to 64\% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment.

Updated: 2025-09-23 04:56:59

标题: 你是否需要视觉运动政策中的本体感知状态？

摘要: 基于模仿学习的视觉动作策略在机器人操作中被广泛应用，通常同时采用视觉观测和本体感知状态进行精确控制。然而，在这项研究中，我们发现这种常见做法使策略过度依赖本体感知状态输入，导致对训练轨迹过拟合，并且导致空间泛化能力差。相反，我们提出了无状态策略，移除了本体感知状态输入，仅基于视觉观测预测动作。无状态策略建立在相对末端执行器动作空间中，并应确保完整的任务相关视觉观测，这里由双宽角腕部摄像头提供。实证结果表明，无状态策略在空间泛化方面比基于状态的策略明显更强：在现实任务中，如拾取和放置、具有挑战性的衬衣折叠和复杂的整体操作，跨多个机器人实体，高度泛化的平均成功率从0%提高到85%，水平泛化从6%提高到64%。此外，它们还表现出数据效率和跨实体适应性方面的优势，增强了它们在实际部署中的实用性。

更新时间: 2025-09-23 04:56:59

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18644v1

MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis

Deep learning algorithms require extensive data to achieve robust performance. However, data availability is often restricted in the medical domain due to patient privacy concerns. Synthetic data presents a possible solution to these challenges. Recently, image generative models have found increasing use for medical applications but are often designed for singular medical specialties and imaging modalities, thus limiting their broader utility. To address this, we introduce MediSyn: a text-guided, latent diffusion model capable of generating synthetic images from 6 medical specialties and 10 image types. Through extensive experimentation, we first demonstrate that MediSyn quantitatively matches or surpasses the performance of specialist models. Second, we show that our synthetic images are realistic and exhibit strong alignment with their corresponding text prompts, as validated by a team of expert physicians. Third, we provide empirical evidence that our synthetic images are visually distinct from their corresponding real patient images. Finally, we demonstrate that in data-limited settings, classifiers trained solely on synthetic data or real data supplemented with synthetic data can outperform those trained solely on real data. Our findings highlight the immense potential of generalist image generative models to accelerate algorithmic research and development in medicine.

Updated: 2025-09-23 04:56:13

标题: MediSyn：一种通用的文本引导潜在扩散模型，用于多样化的医学图像合成

摘要: 深度学习算法需要大量数据才能实现稳健的性能。然而，在医疗领域，由于患者隐私问题，数据的可用性通常受到限制。合成数据提供了应对这些挑战的可能解决方案。最近，图像生成模型在医疗应用中发现了越来越多的用途，但往往设计用于特定医学专业和成像模态，从而限制了它们的广泛适用性。为了解决这个问题，我们介绍了MediSyn：一个文本引导的潜在扩散模型，能够从6个医学专业和10种图像类型中生成合成图像。通过广泛的实验，我们首先展示了MediSyn在定量上与专家模型的性能相匹配或超越。其次，我们展示了我们的合成图像是逼真的，并且与相应的文本提示具有强大的对齐性，经专业医生团队验证。第三，我们提供了经验证据，证明我们的合成图像在视觉上与相应的真实患者图像有明显区别。最后，我们证明在数据有限的情况下，仅基于合成数据或真实数据补充合成数据训练的分类器可以胜过仅基于真实数据训练的分类器。我们的研究结果突显了通用图像生成模型在加速医学算法研究和开发方面的巨大潜力。

更新时间: 2025-09-23 04:56:13

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2405.09806v5

FediLoRA: Heterogeneous LoRA for Federated Multimodal Fine-tuning under Missing Modalities

Foundation models have demonstrated remarkable performance across a wide range of tasks, yet their large parameter sizes pose challenges for practical deployment, especially in decentralized environments. Parameter-efficient fine-tuning (PEFT), such as Low-Rank Adaptation (LoRA), reduces local computing and memory overhead, making it attractive for federated learning. However, existing federated LoRA methods typically assume uniform rank configurations and unimodal inputs, overlooking two key real-world challenges: (1) heterogeneous client resources have different LoRA ranks, and (2) multimodal data settings with potentially missing modalities. In this work, we propose FediLoRA, a simple yet effective framework for federated multimodal fine-tuning under heterogeneous LoRA ranks and missing modalities. FediLoRA introduces a dimension-wise aggregation strategy that reweights LoRA updates without information dilution during aggregation. It also includes a lightweight layer-wise model editing method that selectively incorporates global parameters to repair local components which improves both client and global model performances. Experimental results on three multimodal benchmark datasets demonstrate that FediLoRA achieves superior performance over competitive baselines in both global and personalized settings, particularly in the presence of modality incompleteness.

Updated: 2025-09-23 04:55:03

标题: FediLoRA：缺失模态下联邦多模态微调的异构LoRA

摘要: 基础模型在各种任务中表现出色，但其庞大的参数规模对于实际部署提出了挑战，特别是在分散环境中。参数高效微调（PEFT），如低秩调整（LoRA），可以减少本地计算和内存开销，使其在联邦学习中备受青睐。然而，现有的联邦LoRA方法通常假设统一的秩配置和单峰输入，忽略了两个关键的现实挑战：（1）异构客户端资源具有不同的LoRA秩，以及（2）可能缺失模态的多模态数据设置。在这项工作中，我们提出了FediLoRA，这是一个简单而有效的框架，用于在异构LoRA秩和缺失模态下进行联邦多模态微调。FediLoRA引入了一种按维度聚合策略，重新加权LoRA更新，避免了聚合过程中的信息稀疏。它还包括了一种轻量级的逐层模型编辑方法，有选择地将全局参数整合到本地组件中，从而改善了客户端和全局模型的性能。在三个多模态基准数据集上的实验结果表明，FediLoRA在全局和个性化设置中均比竞争基线表现出更优异的性能，特别是在存在模态不完整性的情况下。

更新时间: 2025-09-23 04:55:03

领域: cs.LG,cs.AI,I.2.7; I.2.11

下载: http://arxiv.org/abs/2509.06984v2

FedOC: Multi-Server FL with Overlapping Client Relays in Wireless Edge Networks

Multi-server Federated Learning (FL) has emerged as a promising solution to mitigate communication bottlenecks of single-server FL. We focus on a typical multi-server FL architecture, where the regions covered by different edge servers (ESs) may overlap. A key observation of this architecture is that clients located in the overlapping areas can access edge models from multiple ESs. Building on this insight, we propose FedOC (Federated learning with Overlapping Clients), a novel framework designed to fully exploit the potential of these overlapping clients. In FedOC, overlapping clients could serve dual roles: (1) as Relay Overlapping Clients (ROCs), they forward edge models between neighboring ESs in real time to facilitate model sharing among different ESs; and (2) as Normal Overlapping Clients (NOCs), they dynamically select their initial model for local training based on the edge model delivery time, which enables indirect data fusion among different regions of ESs. The overall FedOC workflow proceeds as follows: in every round, each client trains local model based on the earliest received edge model and transmits to the respective ESs for model aggregation. Then each ES transmits the aggregated edge model to neighboring ESs through ROC relaying. Upon receiving the relayed models, each ES performs a second aggregation and subsequently broadcasts the updated model to covered clients. The existence of ROCs enables the model of each ES to be disseminated to the other ESs in a decentralized manner, which indirectly achieves intercell model and speeding up the training process, making it well-suited for latency-sensitive edge environments. Extensive experimental results show remarkable performance gains of our scheme compared to existing methods.

Updated: 2025-09-23 04:53:51

标题: FedOC：无线边缘网络中具有重叠客户端中继的多服务器联邦学习

摘要: Multi-server Federated Learning（FL）已经成为减轻单服务器FL通信瓶颈的有希望的解决方案。我们专注于典型的多服务器FL架构，其中不同边缘服务器（ESs）覆盖的区域可能重叠。该架构的一个关键观察是，位于重叠区域的客户端可以从多个ESs访问边缘模型。基于这一观点，我们提出了FedOC（具有重叠客户端的联邦学习），这是一个旨在充分利用这些重叠客户端潜力的新颖框架。在FedOC中，重叠客户端可以发挥双重作用：（1）作为中继重叠客户端（ROCs），它们实时在相邻ESs之间转发边缘模型，以促进不同ESs之间的模型共享；（2）作为普通重叠客户端（NOCs），它们根据边缘模型传送时间动态选择其初始模型进行本地训练，从而实现不同ESs地区之间的间接数据融合。总体FedOC工作流程如下：在每一轮中，每个客户端基于最早收到的边缘模型训练本地模型并传输给相应的ESs进行模型聚合。然后，每个ES通过ROC中继将聚合的边缘模型传输给相邻ESs。在接收到中继模型后，每个ES执行第二次聚合，随后将更新的模型广播给覆盖的客户端。ROCs的存在使得每个ES的模型可以以分散的方式传播到其他ESs，间接实现了单元模型和加速训练过程，使其非常适用于延迟敏感的边缘环境。大量实验结果显示，与现有方法相比，我们的方案表现出显着的性能提升。

更新时间: 2025-09-23 04:53:51

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2509.19398v1

Learning neuroimaging models from health system-scale data

Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout \cite{Chen2017-bt, Rula2024-qp-1}. These challenges disproportionately impact patients in low-resource and rural settings. Here, we utilized a large academic health system as a data engine to develop Prima, the first vision language model (VLM) serving as an AI foundation for neuroimaging that supports real-world, clinical MRI studies as input. Trained on over 220,000 MRI studies, Prima uses a hierarchical vision architecture that provides general and transferable MRI features. Prima was tested in a 1-year health system-wide study that included 30K MRI studies. Across 52 radiologic diagnoses from the major neurologic disorders, including neoplastic, inflammatory, infectious, and developmental lesions, Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. Prima offers explainable differential diagnoses, worklist priority for radiologists, and clinical referral recommendations across diverse patient demographics and MRI systems. Prima demonstrates algorithmic fairness across sensitive groups and can help mitigate health system biases, such as prolonged turnaround times for low-resource populations. These findings highlight the transformative potential of health system-scale VLMs and Prima's role in advancing AI-driven healthcare.

Updated: 2025-09-23 04:49:59

标题: 从健康系统规模数据中学习神经影像模型

摘要: 神经影像学是评估患有神经系统疾病的患者的一种普遍工具。全球对磁共振成像（MRI）研究的需求稳步增长，给卫生系统带来了巨大压力，延长了结果反馈时间，加剧了医生的工作疲劳。这些挑战不成比例地影响了资源匮乏和农村地区的患者。在这里，我们利用一个大型学术卫生系统作为数据引擎，开发了Prima，作为神经影像学的第一个视觉语言模型（VLM），支持以临床MRI研究为输入的实际世界应用。Prima在超过22万个MRI研究的基础上进行训练，采用了分层视觉架构，提供了通用和可转移的MRI特征。Prima在一个包括30,000个MRI研究的为期1年的卫生系统范围内进行了测试。在来自主要神经疾病的52种放射诊断中，包括肿瘤性、炎症性、感染性和发育性病变，Prima实现了92.0的平均诊断ROC曲线下面积，优于其他最先进的通用和医学AI模型。Prima提供了可解释的不同诊断、放射科医生的工作列表优先级和针对不同患者人口统计和MRI系统的临床转诊建议。Prima展示了对敏感群体的算法公平性，并有助于减轻卫生系统对低资源人群延长结果反馈时间的偏见。这些发现突显了卫生系统规模的VLM和Prima在推动以人工智能驱动的医疗保健方面的转变潜力。

更新时间: 2025-09-23 04:49:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.18638v1

LogicGuard: Improving Embodied LLM agents through Temporal Logic based Critics

Large language models (LLMs) have shown promise in zero-shot and single step reasoning and decision making problems, but in long horizon sequential planning tasks, their errors compound, often leading to unreliable or inefficient behavior. We introduce LogicGuard, a modular actor-critic architecture in which an LLM actor is guided by a trajectory level LLM critic that communicates through Linear Temporal Logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high-level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. LogicGuard supports both fixed safety rules and adaptive, learned constraints, and is model-agnostic: any LLM-based planner can serve as the actor, with LogicGuard acting as a logic-generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LogicGuard to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. To demonstrate generality, we evaluate LogicGuard across two distinct settings: short-horizon general tasks and long-horizon specialist tasks. On the Behavior benchmark of 100 household tasks, LogicGuard increases task completion rates by 25% over a baseline InnerMonologue planner. On the Minecraft diamond-mining task, which is long-horizon and requires multiple interdependent subgoals, LogicGuard improves both efficiency and safety compared to SayCan and InnerMonologue. These results show that enabling LLMs to supervise each other through temporal logic yields more reliable, efficient and safe decision-making for both embodied agents.

Updated: 2025-09-23 04:36:17

标题: LogicGuard：通过基于时态逻辑的评论者改进具有体现特征的LLM代理

摘要: 大型语言模型（LLMs）在零-shot和单步推理和决策问题中显示出潜力，但在长视距序列规划任务中，它们的错误会累积，通常导致不可靠或低效的行为。我们引入了LogicGuard，这是一个模块化的演员-评论者架构，其中LLM演员由通过线性时态逻辑（LTL）进行通信的轨迹级LLM评论者引导。我们的设置结合了语言模型的推理优势和形式逻辑的保证。演员从自然语言观察中选择高级别动作，而评论者分析完整轨迹，并提出新的LTL约束条件，以保护演员免受未来不安全或低效行为的影响。LogicGuard支持固定的安全规则和自适应的学习约束，而且是模型不可知的：任何基于LLM的规划器都可以作为演员，而LogicGuard则充当逻辑生成包装器。我们将规划形式化为在符号约束下的图遍历，使LogicGuard能够分析失败或次优轨迹，并生成新的时态逻辑规则，以改进未来的行为。为了展示其普适性，我们在两个不同的环境中评估了LogicGuard：短视距一般任务和长视距专业任务。在包括100个家庭任务的行为基准测试中，LogicGuard相比基线InnerMonologue规划器提高了25%的任务完成率。在长视距且需要多个相互依赖子目标的Minecraft钻石挖掘任务中，LogicGuard相比SayCan和InnerMonologue改善了效率和安全性。这些结果表明，通过时态逻辑使LLMs相互监督，可以为具有实体的代理提供更可靠、高效和安全的决策制定。

更新时间: 2025-09-23 04:36:17

领域: cs.AI,cs.CL,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.03293v2

Adaptive Learning in Spatial Agent-Based Models for Climate Risk Assessment: A Geospatial Framework with Evolutionary Economic Agents

Climate risk assessment requires modelling complex interactions between spatially heterogeneous hazards and adaptive economic systems. We present a novel geospatial agent-based model that integrates climate hazard data with evolutionary learning for economic agents. Our framework combines Mesa-based spatial modelling with CLIMADA climate impact assessment, introducing adaptive learning behaviours that allow firms to evolve strategies for budget allocation, pricing, wages, and risk adaptation through fitness-based selection and mutation. We demonstrate the framework using riverine flood projections under RCP8.5 until 2100, showing that evolutionary adaptation enables firms to converge with baseline (no hazard) production levels after decades of disruption due to climate stress. Our results reveal systemic risks where even agents that are not directly exposed to floods face impacts through supply chain disruptions, with the end-of-century average price of goods 5.6% higher under RCP8.5 compared to the baseline. This open-source framework provides financial institutions and companies with tools to quantify both direct and cascading climate risks while evaluating cost-effective adaptation strategies.

Updated: 2025-09-23 04:33:58

标题: 气候风险评估中的空间代理模型自适应学习：具有进化经济代理的地理空间框架

摘要: 气候风险评估需要模拟空间异质性危害与适应性经济系统之间的复杂相互作用。我们提出了一种新颖的地理空间代理模型，将气候危害数据与经济代理的进化学习相结合。我们的框架将基于Mesa的空间建模与CLIMADA气候影响评估相结合，引入了自适应学习行为，使企业能够通过基于适应性选择和突变的策略，进化出预算分配、定价、工资和风险适应。我们使用RC8.5下到2100年的河流洪水预测来演示这一框架，结果显示，进化适应使企业在经历几十年的气候压力干扰后能够与基准（无危害）生产水平趋于一致。我们的结果揭示了系统性风险，即使是没有直接暴露于洪水的代理商也会面临供应链中断所带来的影响，根据RC8.5与基准相比，世纪末商品的平均价格提高了5.6%。这一开源框架为金融机构和公司提供了工具，可以量化直接和级联气候风险，同时评估成本有效的适应策略。

更新时间: 2025-09-23 04:33:58

领域: cs.AI,q-fin.RM

下载: http://arxiv.org/abs/2509.18633v1

Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

Behavior cloning has shown promise for robot manipulation, but real-world demonstrations are costly to acquire at scale. While simulated data offers a scalable alternative, particularly with advances in automated demonstration generation, transferring policies to the real world is hampered by various simulation and real domain gaps. In this work, we propose a unified sim-and-real co-training framework for learning generalizable manipulation policies that primarily leverages simulation and only requires a few real-world demonstrations. Central to our approach is learning a domain-invariant, task-relevant feature space. Our key insight is that aligning the joint distributions of observations and their corresponding actions across domains provides a richer signal than aligning observations (marginals) alone. We achieve this by embedding an Optimal Transport (OT)-inspired loss within the co-training framework, and extend this to an Unbalanced OT framework to handle the imbalance between abundant simulation data and limited real-world examples. We validate our method on challenging manipulation tasks, showing it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate and even generalize to scenarios seen only in simulation.

Updated: 2025-09-23 04:32:53

标题: Sim和真实策略联合训练的可泛化领域自适应

摘要: 行为克隆已经显示出机器人操作的潜力，但在规模上获得真实世界的演示成本很高。虽然模拟数据提供了一种可扩展的替代方案，特别是在自动演示生成方面取得了进展，但将策略转移到真实世界受到各种模拟和真实领域差距的阻碍。在这项工作中，我们提出了一个统一的模拟和真实协同训练框架，用于学习可概括操纵策略，主要利用模拟，并且只需要一些真实世界的演示。我们方法的核心是学习一个领域不变的、任务相关的特征空间。我们的关键洞察是，跨领域对齐观察和其对应行动的联合分布，提供比仅对齐观察（边际）更丰富的信号。我们通过在协同训练框架中嵌入了一个受最优传输（OT）启发的损失来实现这一点，并将其扩展为一个不平衡OT框架，以处理丰富的模拟数据和有限的真实世界示例之间的不平衡。我们在具有挑战性的操纵任务上验证了我们的方法，显示它可以利用丰富的模拟数据，在真实世界成功率提高高达30%，甚至可以推广到仅在模拟中看到的场景。

更新时间: 2025-09-23 04:32:53

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18631v1

HyperAdapt: Simple High-Rank Adaptation

Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only $n+m$ trainable parameters for an $n \times m$ matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt's updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.

Updated: 2025-09-23 04:29:26

标题: 适应性超级：简单的高阶适应

摘要: 基础模型在各种任务中表现出色，但将它们调整到专业应用通常需要微调，这种方法是内存和计算密集型的。参数高效微调（PEFT）方法通过仅更新一小部分权重来缓解这一问题。在本文中，我们介绍了HyperAdapt，一种参数高效的微调方法，与LoRA等最先进方法相比，它显著减少了可训练参数的数量。具体来说，HyperAdapt通过对角矩阵应用行和列的缩放来调整预训练的权重矩阵，从而引入高秩更新，只需要 $n+m$ 个可训练参数来处理一个 $n \times m$ 矩阵。在理论上，我们对HyperAdapt的更新的秩建立了一个上界，并且在实验上，我们确认它在模型层之间始终引入高秩变换。在GLUE、算术推理和常识推理基准测试上进行的实验，使用高达14B参数的模型，表明HyperAdapt在使用数量级更少的可训练参数的情况下与全面微调和最先进的PEFT方法的性能相匹配或接近。

更新时间: 2025-09-23 04:29:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18629v1

BRAID: Input-Driven Nonlinear Dynamical Modeling of Neural-Behavioral Data

Neural populations exhibit complex recurrent structures that drive behavior, while continuously receiving and integrating external inputs from sensory stimuli, upstream regions, and neurostimulation. However, neural populations are often modeled as autonomous dynamical systems, with little consideration given to the influence of external inputs that shape the population activity and behavioral outcomes. Here, we introduce BRAID, a deep learning framework that models nonlinear neural dynamics underlying behavior while explicitly incorporating any measured external inputs. Our method disentangles intrinsic recurrent neural population dynamics from the effects of inputs by including a forecasting objective within input-driven recurrent neural networks. BRAID further prioritizes the learning of intrinsic dynamics that are related to a behavior of interest by using a multi-stage optimization scheme. We validate BRAID with nonlinear simulations, showing that it can accurately learn the intrinsic dynamics shared between neural and behavioral modalities. We then apply BRAID to motor cortical activity recorded during a motor task and demonstrate that our method more accurately fits the neural-behavioral data by incorporating measured sensory stimuli into the model and improves the forecasting of neural-behavioral data compared with various baseline methods, whether input-driven or not.

Updated: 2025-09-23 04:22:53

标题: BRAID：神经行为数据的输入驱动非线性动力学建模

摘要: 神经群体表现出复杂的循环结构，驱动行为，同时不断接收和整合来自感觉刺激、上游区域和神经刺激的外部输入。然而，神经群体通常被建模为自主动力系统，很少考虑外部输入对群体活动和行为结果的影响。在这里，我们介绍了BRAID，一个深度学习框架，模拟了潜在行为背后的非线性神经动力学，同时明确地纳入了任何测量的外部输入。我们的方法通过在基于输入驱动的循环神经网络中包含预测目标，将内在循环神经群体动力学与输入效果分离开来。BRAID通过使用多阶段优化方案进一步优先考虑与感兴趣行为相关的内在动态的学习。我们通过非线性模拟验证了BRAID，显示它可以准确地学习在神经和行为模态之间共享的内在动态。然后，我们将BRAID应用于在进行运动任务期间记录的运动皮层活动，并证明我们的方法通过将测量的感觉刺激纳入模型，更准确地拟合神经-行为数据，并改进了与各种基线方法相比的神经-行为数据的预测，无论是基于输入还是不基于输入。

更新时间: 2025-09-23 04:22:53

领域: q-bio.NC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.18627v1

The Case for Negative Data: From Crash Reports to Counterfactuals for Reasonable Driving

Learning-based autonomous driving systems are trained mostly on incident-free data, offering little guidance near safety-performance boundaries. Real crash reports contain precisely the contrastive evidence needed, but they are hard to use: narratives are unstructured, third-person, and poorly grounded to sensor views. We address these challenges by normalizing crash narratives to ego-centric language and converting both logs and crashes into a unified scene-action representation suitable for retrieval. At decision time, our system adjudicates proposed actions by retrieving relevant precedents from this unified index; an agentic counterfactual extension proposes plausible alternatives, retrieves for each, and reasons across outcomes before deciding. On a nuScenes benchmark, precedent retrieval substantially improves calibration, with recall on contextually preferred actions rising from 24% to 53%. The counterfactual variant preserves these gains while sharpening decisions near risk.

Updated: 2025-09-23 04:21:39

标题: 负面数据的重要性：从事故报告到合理驾驶的反事实分析

摘要: 学习型自动驾驶系统主要是在没有事故的数据上进行训练，对安全性能边界附近提供很少的指导。真实的事故报告包含了所需的对比证据，但很难使用：叙述是非结构化的、第三人称的，与传感器视图关联性不强。我们通过将事故叙述规范化为自我中心语言，并将日志和事故转换为适合检索的统一场景-动作表示来解决这些挑战。在决策时，我们的系统通过从这个统一索引中检索相关的先例来裁决提出的行动；一种主观的反事实扩展提出可信的替代方案，为每种方案检索并在决定之前对结果进行推理。在nuScenes基准测试中，先例检索显着改善了校准，优选行动的召回率从24%上升到53%。反事实变体保留了这些收益，同时在接近风险的决策上做出了更加明晰的判断。

更新时间: 2025-09-23 04:21:39

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18626v1

EMMA: End-to-End Multimodal Model for Autonomous Driving

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built upon a multi-modal large language model foundation like Gemini, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. We hope that our results will inspire research to further evolve the state of the art in autonomous driving model architectures.

Updated: 2025-09-23 04:19:59

标题: EMMA：用于自动驾驶的端到端多模态模型

摘要: 我们介绍了EMMA，一种用于自动驾驶的端到端多模态模型。建立在类似Gemini的多模态大型语言模型基础上，EMMA直接将原始摄像头传感器数据映射到各种与驾驶相关的输出，包括规划者轨迹、感知对象和道路图元素。EMMA通过将所有非传感器输入（例如导航指令和自车状态）和输出（例如轨迹和3D位置）表示为自然语言文本，最大化了来自预训练大型语言模型的世界知识的效用。这种方法允许EMMA在统一的语言空间中共同处理各种驾驶任务，并使用特定于任务的提示生成每个任务的输出。从经验上看，我们通过在nuScenes上实现最新的运动规划性能以及在Waymo开放运动数据集（WOMD）上取得竞争性结果，展示了EMMA的有效性。EMMA还在Waymo开放数据集（WOD）上针对以摄像头为主的3D对象检测获得了竞争性结果。我们展示了通过与规划者轨迹、对象检测和道路图任务的共同训练，EMMA在所有三个领域都取得了改进，突显了EMMA作为自动驾驶应用的通用模型的潜力。我们希望我们的结果将激发进一步发展自动驾驶模型架构的最新技术的研究。

更新时间: 2025-09-23 04:19:59

领域: cs.CV,cs.AI,cs.CL,cs.LG,cs.RO

下载: http://arxiv.org/abs/2410.23262v3

Flow marching for a generative PDE foundation model

Pretraining on large-scale collections of PDE-governed spatiotemporal trajectories has recently shown promise for building generalizable models of dynamical systems. Yet most existing PDE foundation models rely on deterministic Transformer architectures, which lack generative flexibility for many science and engineering applications. We propose Flow Marching, an algorithm that bridges neural operator learning with flow matching motivated by an analysis of error accumulation in physical dynamical systems, and we build a generative PDE foundation model on top of it. By jointly sampling the noise level and the physical time step between adjacent states, the model learns a unified velocity field that transports a noisy current state toward its clean successor, reducing long-term rollout drift while enabling uncertainty-aware ensemble generations. Alongside this core algorithm, we introduce a Physics-Pretrained Variational Autoencoder (P2VAE) to embed physical states into a compact latent space, and an efficient Flow Marching Transformer (FMT) that combines a diffusion-forcing scheme with latent temporal pyramids, achieving up to 15x greater computational efficiency than full-length video diffusion models and thereby enabling large-scale pretraining at substantially reduced cost. We curate a corpus of ~2.5M trajectories across 12 distinct PDE families and train suites of P2VAEs and FMTs at multiple scales. On downstream evaluation, we benchmark on unseen Kolmogorov turbulence with few-shot adaptation, demonstrate long-term rollout stability over deterministic counterparts, and present uncertainty-stratified ensemble results, highlighting the importance of generative PDE foundation models for real-world applications.

Updated: 2025-09-23 04:00:41

标题: 流动行军用于生成PDE基础模型

摘要: 最近，对PDE控制的时空轨迹进行大规模预训练已经显示出建立可推广模型的潜力。然而，大多数现有的PDE基础模型依赖于确定性Transformer架构，缺乏许多科学和工程应用中所需的生成灵活性。我们提出了一种名为Flow Marching的算法，它将神经运算器学习与流匹配相结合，受到物理动态系统中错误累积分析的启发，并在此基础上构建了一个生成式PDE基础模型。通过同时对噪声水平和相邻状态之间的物理时间步长进行采样，该模型学习了一个统一的速度场，将带有噪声的当前状态传输到其干净的后继状态，减少了长期的漂移，同时实现了具有不确定性感知的集成生成。除了这个核心算法，我们还引入了一个名为Physics-Pretrained Variational Autoencoder（P2VAE）的模型，将物理状态嵌入到一个紧凑的潜在空间中，以及一个高效的Flow Marching Transformer（FMT），将扩散强迫方案与潜在时间金字塔相结合，实现了比全长视频扩散模型高达15倍的计算效率，从而以大大降低的成本实现大规模预训练。我们整理了包括12个不同PDE家族约250万条轨迹的语料库，并在多个尺度上训练了一系列P2VAEs和FMTs。在下游评估中，我们在未见的Kolmogorov湍流上进行了基准测试，展示了少量样本适应性，证明了对比确定性对照的长期稳定性，并呈现了基于不确定性分层的集成结果，突显了生成式PDE基础模型在实际应用中的重要性。

更新时间: 2025-09-23 04:00:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18611v1

IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence

Map applications are still largely point-and-click, making it difficult to ask map-centric questions or connect what a camera sees to the surrounding geospatial context with view-conditioned inputs. We introduce IMAIA, an interactive Maps AI Assistant that enables natural-language interaction with both vector (street) maps and satellite imagery, and augments camera inputs with geospatial intelligence to help users understand the world. IMAIA comprises two complementary components. Maps Plus treats the map as first-class context by parsing tiled vector/satellite views into a grid-aligned representation that a language model can query to resolve deictic references (e.g., ``the flower-shaped building next to the park in the top-right''). Places AI Smart Assistant (PAISA) performs camera-aware place understanding by fusing image--place embeddings with geospatial signals (location, heading, proximity) to ground a scene, surface salient attributes, and generate concise explanations. A lightweight multi-agent design keeps latency low and exposes interpretable intermediate decisions. Across map-centric QA and camera-to-place grounding tasks, IMAIA improves accuracy and responsiveness over strong baselines while remaining practical for user-facing deployments. By unifying language, maps, and geospatial cues, IMAIA moves beyond scripted tools toward conversational mapping that is both spatially grounded and broadly usable.

Updated: 2025-09-23 03:59:02

标题: IMAIA：用于旅行规划和地理空间智能的交互式地图人工智能助手

摘要: 地图应用程序仍然主要是点与点击，这使得很难提出地图中心的问题或将相机所看到的内容与周围的地理空间背景连接起来。我们引入了IMAIA，这是一个交互式地图人工智能助手，可以实现自然语言交互，同时可以与矢量（街道）地图和卫星图像进行交互，并通过地理空间智能增强相机输入，帮助用户了解世界。IMAIA包括两个互补的组件。Maps Plus将地图视为第一类上下文，通过将分块的矢量/卫星视图解析为一个网格对齐的表示形式，语言模型可以查询以解决指示性引用（例如，“右上角的公园旁边的花朵形状建筑物”）。Places AI Smart Assistant（PAISA）通过融合图像-地点嵌入和地理空间信号（位置、朝向、接近度）来执行相机感知的地点理解，以确立场景，展示显著属性，并生成简明的解释。轻量级的多代理设计使延迟低，并暴露可解释的中间决策。在以地图为中心的问答和相机到地点的接地任务中，IMAIA提高了准确性和响应性，同时保持了对用户面向部署的实用性。通过统一语言、地图和地理空间线索，IMAIA超越了脚本工具，向具有空间基础和广泛可用性的会话式地图制作迈进。

更新时间: 2025-09-23 03:59:02

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.06993v3

Robust DNN Partitioning and Resource Allocation Under Uncertain Inference Time

In edge intelligence systems, deep neural network (DNN) partitioning and data offloading can provide real-time task inference for resource-constrained mobile devices. However, the inference time of DNNs is typically uncertain and cannot be precisely determined in advance, presenting significant challenges in ensuring timely task processing within deadlines. To address the uncertain inference time, we propose a robust optimization scheme to minimize the total energy consumption of mobile devices while meeting task probabilistic deadlines. The scheme only requires the mean and variance information of the inference time, without any prediction methods or distribution functions. The problem is formulated as a mixed-integer nonlinear programming (MINLP) that involves jointly optimizing the DNN model partitioning and the allocation of local CPU/GPU frequencies and uplink bandwidth. To tackle the problem, we first decompose the original problem into two subproblems: resource allocation and DNN model partitioning. Subsequently, the two subproblems with probability constraints are equivalently transformed into deterministic optimization problems using the chance-constrained programming (CCP) method. Finally, the convex optimization technique and the penalty convex-concave procedure (PCCP) technique are employed to obtain the optimal solution of the resource allocation subproblem and a stationary point of the DNN model partitioning subproblem, respectively. The proposed algorithm leverages real-world data from popular hardware platforms and is evaluated on widely used DNN models. Extensive simulations show that our proposed algorithm effectively addresses the inference time uncertainty with probabilistic deadline guarantees while minimizing the energy consumption of mobile devices.

Updated: 2025-09-23 03:56:58

标题: 稳健的DNN分区和资源分配在不确定推理时间下

摘要: 在边缘智能系统中，深度神经网络（DNN）的分区和数据卸载可以为资源受限的移动设备提供实时任务推断。然而，DNN的推断时间通常是不确定的，无法事先准确确定，这给确保任务在截止日期内及时处理带来了重大挑战。为了解决推断时间的不确定性，我们提出了一种鲁棒优化方案，以最小化移动设备的总能耗，同时满足任务概率性截止日期。该方案只需要推断时间的均值和方差信息，无需任何预测方法或分布函数。该问题被表述为一个混合整数非线性规划（MINLP），涉及联合优化DNN模型分区和本地CPU/GPU频率以及上行带宽的分配。为了解决问题，我们首先将原始问题分解为两个子问题：资源分配和DNN模型分区。随后，具有概率约束的两个子问题通过机会约束编程（CCP）方法等价转化为确定性优化问题。最后，利用凸优化技术和惩罚凸凹程序（PCCP）技术来获得资源分配子问题的最优解以及DNN模型分区子问题的一个稳定点。所提出的算法利用来自流行硬件平台的真实数据，并在广泛使用的DNN模型上进行评估。大量模拟结果表明，我们提出的算法有效地解决了推断时间的不确定性，并保证了概率性截止日期，同时最小化了移动设备的能耗。

更新时间: 2025-09-23 03:56:58

领域: cs.DC,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2503.21476v2

End-to-End Crop Row Navigation via LiDAR-Based Deep Reinforcement Learning

Reliable navigation in under-canopy agricultural environments remains a challenge due to GNSS unreliability, cluttered rows, and variable lighting. To address these limitations, we present an end-to-end learning-based navigation system that maps raw 3D LiDAR data directly to control commands using a deep reinforcement learning policy trained entirely in simulation. Our method includes a voxel-based downsampling strategy that reduces LiDAR input size by 95.83%, enabling efficient policy learning without relying on labeled datasets or manually designed control interfaces. The policy was validated in simulation, achieving a 100% success rate in straight-row plantations and showing a gradual decline in performance as row curvature increased, tested across varying sinusoidal frequencies and amplitudes.

Updated: 2025-09-23 03:56:10

标题: 利用基于LiDAR的深度强化学习实现的端到端作物行导航

摘要: 在林下农业环境中可靠导航仍然是一个挑战，这是由于GNSS不可靠性、混乱的行间距以及光照变化所致。为了解决这些限制，我们提出了一种基于端到端学习的导航系统，该系统直接将原始3D LiDAR数据映射到控制命令，使用在模拟环境中完全训练的深度强化学习策略。我们的方法包括基于体素的下采样策略，将LiDAR输入大小降低了95.83%，从而实现了高效的策略学习，而无需依赖标记数据集或手动设计的控制界面。该策略在模拟环境中经过验证，在直行植被园中实现了100%的成功率，并在测试中表现出随着行曲率增加而逐渐下降的性能，跨越不同的正弦频率和振幅。

更新时间: 2025-09-23 03:56:10

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18608v1

Self-Alignment Learning to Improve Myocardial Infarction Detection from Single-Lead ECG

Myocardial infarction is a critical manifestation of coronary artery disease, yet detecting it from single-lead electrocardiogram (ECG) remains challenging due to limited spatial information. An intuitive idea is to convert single-lead into multiple-lead ECG for classification by pre-trained models, but generative methods optimized at the signal level in most cases leave a large latent space gap, ultimately degrading diagnostic performance. This naturally raises the question of whether latent space alignment could help. However, most prior ECG alignment methods focus on learning transformation invariance, which mismatches the goal of single-lead detection. To address this issue, we propose SelfMIS, a simple yet effective alignment learning framework to improve myocardial infarction detection from single-lead ECG. Discarding manual data augmentations, SelfMIS employs a self-cutting strategy to pair multiple-lead ECG with their corresponding single-lead segments and directly align them in the latent space. This design shifts the learning objective from pursuing transformation invariance to enriching the single-lead representation, explicitly driving the single-lead ECG encoder to learn a representation capable of inferring global cardiac context from the local signal. Experimentally, SelfMIS achieves superior performance over baseline models across nine myocardial infarction types while maintaining a simpler architecture and lower computational overhead, thereby substantiating the efficacy of direct latent space alignment. Our code and checkpoint will be publicly available after acceptance.

Updated: 2025-09-23 03:54:39

标题: 自我对齐学习以改善单导联心电图对心肌梗死的检测

摘要: 心肌梗死是冠状动脉疾病的一个关键表现，然而从单导联心电图(ECG)中检测它仍然具有挑战性，因为空间信息有限。一个直观的想法是将单导联转换为多导联ECG进行预训练模型分类，但在大多数情况下，在信号级别上优化的生成方法会留下一个巨大的潜在空间差距，最终降低诊断性能。这自然引发了潜在空间对齐是否有助于的问题。然而，大多数先前的ECG对齐方法侧重于学习变换不变性，这与单导联检测的目标不匹配。为了解决这个问题，我们提出了SelfMIS，这是一个简单而有效的对齐学习框架，用于改进从单导联ECG中检测心肌梗死。SelfMIS采用自我切割策略，将多导联ECG与其相应的单导联片段配对，并直接在潜在空间中对齐它们，而不是丢弃手动数据增强。这一设计将学习目标从追求变换不变性转变为丰富单导联表示，明确地推动单导联ECG编码器学习一个能够从局部信号中推断全局心脏上下文的表示。在实验中，SelfMIS在九种心肌梗死类型上实现了优越的性能，同时保持了更简单的架构和更低的计算开销，从而证实了直接潜在空间对齐的有效性。我们的代码和检查点将在接受后公开提供。

更新时间: 2025-09-23 03:54:39

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19397v1

Reflect before Act: Proactive Error Correction in Language Models

Large Language Models (LLMs) have demonstrated remarkable capabilities in interactive decision-making tasks, but existing methods often struggle with error accumulation and lack robust self-correction mechanisms. We introduce "Reflect before Act" (REBACT), a novel approach that enhances LLM-based decision-making by introducing a critical reflect step prior to taking the next action. This approach allows for immediate error correction, ensuring smooth action path and adaptibity to environment feedback. We evaluate REBACT on three diverse interactive environments: ALFWorld, WebShop, and TextCraft. Our results demonstrate that REBACT significantly outperforms strong baselines, improving success rates by up to 24% on WebShop (achieving 61%), 6.72% on ALFWorld (achieving 98.51%), and 0.5% on TextCraft (achieving 99.5%) using Claude3.5-sonnet as the underlying LLM. Further analysis reveals that REBACT's performance improvements are achieved with only a few modification steps, demonstrating its computational efficiency.

Updated: 2025-09-23 03:53:45

标题: 在行动之前先思考：语言模型中的主动错误更正

摘要: 大型语言模型（LLMs）在交互式决策任务中展示了显著的能力，但现有方法常常在错误累积和缺乏强大的自我校正机制方面遇到困难。我们引入了“反思再行动”（REBACT）这一新颖方法，通过在采取下一步行动之前引入关键的反思步骤来增强基于LLM的决策能力。这种方法允许立即进行错误校正，确保平稳的行动路径并适应环境反馈。我们在三个不同的交互式环境：ALFWorld、WebShop和TextCraft上评估了REBACT。我们的结果表明，REBACT明显优于强基线，在WebShop上成功率提高了高达24%（达到61%），在ALFWorld上提高了6.72%（达到98.51%），在TextCraft上提高了0.5%（达到99.5%），使用Claude3.5-sonnet作为基础LLM。进一步分析表明，REBACT的性能改进仅需进行少量修改步骤，展示了其计算效率。

更新时间: 2025-09-23 03:53:45

领域: cs.LG

下载: http://arxiv.org/abs/2509.18607v1

GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding

Programming languages possess rich semantic information - such as data flow - that is represented by graphs and not available from the surface form of source code. Recent code language models have scaled to billions of parameters, but model source code solely as text tokens while ignoring any other structural information. Conversely, models that do encode structural information of code make modifications to the Transformer architecture, limiting their scale and compatibility with pretrained LLMs. In this work, we take the best of both worlds with GALLa - Graph Aligned Large Language Models. GALLa utilizes graph neural networks and cross-modal alignment technologies to inject the structural information of code into LLMs as an auxiliary task during finetuning. This framework is both model-agnostic and task-agnostic, as it can be applied to any code LLM for any code downstream task, and requires the structural graph data only at training time from a corpus unrelated to the finetuning data, while incurring no cost at inference time over the baseline LLM. Experiments on five code tasks with seven different baseline LLMs ranging in size from 350M to 14B validate the effectiveness of GALLa, demonstrating consistent improvement over the baseline, even for powerful models such as LLaMA3 and Qwen2.5-Coder.

Updated: 2025-09-23 03:53:05

标题: GALLa：用于改进源代码理解的图对齐大型语言模型

摘要: 编程语言拥有丰富的语义信息，例如数据流，这些信息通过图表表示，而不是从源代码的表面形式中获得。最近的代码语言模型已经扩展到数十亿个参数，但是这些模型仅将源代码视为文本标记，而忽略了任何其他结构信息。相反，对代码进行结构信息编码的模型对Transformer架构进行修改，限制了它们的规模和与预训练LLM的兼容性。在这项工作中，我们将GALLa - 图表对齐大型语言模型的优点结合起来。GALLa利用图神经网络和跨模态对齐技术，在微调期间将代码的结构信息作为辅助任务注入LLM中。这个框架既不依赖于模型也不依赖于任务，因此可以应用于任何代码LLM以及任何代码下游任务，并且在训练时仅需要来自与微调数据无关的语料库的结构图数据，而在推理时与基线LLM相比没有额外成本。对五个代码任务进行的实验验证了GALLa的有效性，展示了与基线相比的持续改进，甚至对于像LLaMA3和Qwen2.5-Coder这样强大的模型也是如此。

更新时间: 2025-09-23 03:53:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2409.04183v4

FlexSED: Towards Open-Vocabulary Sound Event Detection

Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.

Updated: 2025-09-23 03:52:52

标题: FlexSED：面向开放词汇的声音事件检测

摘要: 尽管最近在处理数百种声音类别的大规模声音事件检测（SED）系统方面取得了进展，但现有的多类分类框架仍然存在根本限制。它们无法处理自由文本声音查询，这种查询可以实现更灵活和用户友好的交互，同时它们缺乏零样本能力并且提供较差的少样本适应性。尽管已经探索了基于文本查询的分离方法，但它们主要集中在源分离上，并不适用于需要精确时间定位和在大规模和多样的声音词汇中进行高效检测的SED任务。在本文中，我们提出了FlexSED，一个开放词汇声音事件检测系统。FlexSED基于预训练的音频SSL模型和CLAP文本编码器构建，引入编码器-解码器组合和自适应融合策略，以实现从预训练权重进行有效持续训练。为了确保强大的监督，它还采用大型语言模型（LLMs）在训练期间辅助事件查询选择，解决与缺失标签相关的挑战。因此，与普通SED模型相比，FlexSED在AudioSet-Strong上实现了更优越的性能，同时展示了强大的零样本和少样本能力。我们发布了代码和预训练模型，以支持基于FlexSED的未来研究和应用。

更新时间: 2025-09-23 03:52:52

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2509.18606v1

SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix-up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precise temporal annotations and the risk of introducing noise through unreliable filtering. To address these challenges and enable generative-based augmentation for SED, we propose SynSonic, a data augmentation method tailored for this task. SynSonic leverages text-to-audio diffusion models guided by an energy-envelope ControlNet to generate temporally coherent sound events. A joint score filtering strategy with dual classifiers ensures sample quality, and we explore its practical integration into training pipelines. Experimental results show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.

Updated: 2025-09-23 03:48:26

标题: SynSonic：通过文本到音频扩散控制网络和有效样本过滤增强声音事件检测

摘要: 数据综合和增强对于声音事件检测（SED）至关重要，因为时间标记数据稀缺。虽然增强方法如SpecAugment和Mix-up可以提高模型性能，但它们仍受限于现有样本的多样性。最近的生成模型提供了新的机会，然而它们直接应用于SED是具有挑战性的，因为缺乏精确的时间注释并且通过不可靠的过滤引入噪声的风险。为了解决这些挑战并实现基于生成的增强对于SED，我们提出了SynSonic，一个专为这一任务量身定制的数据增强方法。SynSonic利用由能量包络控制网络引导的文本到音频扩散模型来生成时间连贯的声音事件。通过双分类器的联合得分过滤策略确保样本质量，并探索其在训练管线中的实际整合。实验结果显示，SynSonic改善了复音声音检测分数（PSDS1和PSDS2），增强了时间定位和声音类别区分能力。

更新时间: 2025-09-23 03:48:26

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2509.18603v1

Fine-Grained Detection of AI-Generated Text Using Sentence-Level Segmentation

Generation of Artificial Intelligence (AI) texts in important works has become a common practice that can be used to misuse and abuse AI at various levels. Traditional AI detectors often rely on document-level classification, which struggles to identify AI content in hybrid or slightly edited texts designed to avoid detection, leading to concerns about the model's efficiency, which makes it hard to distinguish between human-written and AI-generated texts. A sentence-level sequence labeling model proposed to detect transitions between human- and AI-generated text, leveraging nuanced linguistic signals overlooked by document-level classifiers. By this method, detecting and segmenting AI and human-written text within a single document at the token-level granularity is achieved. Our model combines the state-of-the-art pre-trained Transformer models, incorporating Neural Networks (NN) and Conditional Random Fields (CRFs). This approach extends the power of transformers to extract semantic and syntactic patterns, and the neural network component to capture enhanced sequence-level representations, thereby improving the boundary predictions by the CRF layer, which enhances sequence recognition and further identification of the partition between Human- and AI-generated texts. The evaluation is performed on two publicly available benchmark datasets containing collaborative human and AI-generated texts. Our experimental comparisons are with zero-shot detectors and the existing state-of-the-art models, along with rigorous ablation studies to justify that this approach, in particular, can accurately detect the spans of AI texts in a completely collaborative text. All our source code and the processed datasets are available in our GitHub repository.

Updated: 2025-09-23 03:46:06

标题: 使用句级分割进行细粒度检测人工智能生成的文本

摘要: 人工智能（AI）文本在重要作品中的生成已经成为一种常见做法，可以在不同层面上被滥用和误用。传统的AI检测器通常依赖于文档级分类，这在识别旨在避免检测的混合或略微编辑的文本中的AI内容时存在困难，引起了对模型效率的担忧，使得难以区分人类编写和AI生成的文本。提出了一种句子级序列标记模型，用于检测人类和AI生成文本之间的转换，利用文档级分类器忽视的微妙语言信号。通过这种方法，在单个文档中以标记级别的细粒度实现了对AI和人类编写文本的检测和分割。我们的模型结合了最先进的预训练Transformer模型，融合了神经网络（NN）和条件随机场（CRFs）。这种方法扩展了transformers提取语义和句法模式的能力，以及神经网络组件捕获增强的序列级表示，从而通过CRF层改进边界预测，增强序列识别并进一步识别人类和AI生成文本之间的分区。评估是在两个包含协作的人类和AI生成文本的公开基准数据集上进行的。我们的实验比较了零-shot检测器和现有的最先进模型，同时进行了严格的消融研究，以证明这种特定方法可以准确地检测完全协作文本中AI文本的范围。我们的所有源代码和处理后的数据集都可以在我们的GitHub存储库中找到。

更新时间: 2025-09-23 03:46:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.17830v2

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.

Updated: 2025-09-23 03:45:42

标题: ReSearch: 通过强化学习学习使用搜索进行LLMs推理

摘要: 大型语言模型（LLMs）在推理方面展现出了令人瞩目的能力，OpenAI-o1和DeepSeek-R1的成功就是一个例证。然而，将推理与外部搜索过程整合仍然具有挑战性，特别是对于需要多个检索步骤的复杂多跳问题。我们提出了一种新颖的框架ReSearch，通过强化学习训练LLMs进行搜索推理，而不使用任何关于推理步骤的监督数据。我们的方法将搜索操作视为推理链的组成部分，搜索的时间和方式由基于文本的思考引导，搜索结果随后影响进一步的推理。我们在Qwen2.5-7B（-Instruct）和Qwen2.5-32B（-Instruct）模型上训练ReSearch并进行了大量实验。尽管只在一个数据集上进行了训练，我们的模型在各种基准测试中展现出了强大的泛化能力。分析表明，ReSearch在强化学习过程中自然引发了高级推理能力，如反思和自我纠正。

更新时间: 2025-09-23 03:45:42

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.19470v3

Gender and Political Bias in Large Language Models: A Demonstration Platform

We present ParlAI Vote, an interactive system for exploring European Parliament debates and votes, and for testing LLMs on vote prediction and bias analysis. This platform connects debate topics, speeches, and roll-call outcomes, and includes rich demographic data such as gender, age, country, and political group. Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs, and view error breakdowns by demographic group. Visualizing the EuroParlVote benchmark and its core tasks of gender classification and vote prediction, ParlAI Vote highlights systematic performance bias in state-of-the-art LLMs. The system unifies data, models, and visual analytics in a single interface, lowering the barrier for reproducing findings, auditing behavior, and running counterfactual scenarios. It supports research, education, and public engagement with legislative decision-making, while making clear both the strengths and the limitations of current LLMs in political analysis.

Updated: 2025-09-23 03:43:30

标题: 大型语言模型中的性别和政治偏见：一个演示平台

摘要: 我们介绍ParlAI Vote，这是一个交互式系统，用于探索欧洲议会的辩论和投票，并测试LLMs在投票预测和偏见分析方面的表现。该平台连接辩论主题、演讲和点名投票结果，并包括丰富的人口统计数据，如性别、年龄、国家和政治团体。用户可以浏览辩论，检查链接的演讲，比较真实投票结果与前沿LLMs的预测，并按人口群体查看错误分布。通过可视化EuroParlVote基准和其核心任务的性别分类和投票预测，ParlAI Vote突显了最先进LLMs中的系统性能偏见。该系统将数据、模型和视觉分析统一在一个界面中，降低了重现研究结果、审计行为和运行反事实场景的障碍。它支持对立法决策进行研究、教育和公众参与，同时清晰地展示了当前LLMs在政治分析中的优势和局限性。

更新时间: 2025-09-23 03:43:30

领域: cs.CL,cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2509.16264v2

OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO {OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2--3 orders of magnitude less training data using a small base VLM on modest hardware.

Updated: 2025-09-23 03:42:26

标题: OraPO：针对数据有效和准确的放射学报告生成的Oracle教育的强化学习

摘要: 放射学报告生成（RRG）旨在从胸部X射线图像自动生成临床上忠实的报告。目前的工作通常遵循一个基于规模的范式，通过在大型配对语料库和超大型骨干上进行多阶段训练，使流程高度依赖于数据和计算资源。在本文中，我们提出了基于Oracle教育的GRPO（OraPO），并采用基于FactScore的奖励（FactS）来解决在受限制的预算下进行RRG任务。OraPO通过将对罕见或困难研究的失败GRPO探索转化为通过轻量级Oracle步骤的直接偏好监督，实现了单阶段、仅基于RL的训练。FactS通过提取原子临床事实并检查是否符合基准标签，将学习基础置于诊断证据中，并产生密集、可解释的句子级奖励。OraPO和FactS共同创建了一个紧凑而强大的框架，显著提高了在临床挑战性案例上的学习效率，使用小型基础VLM在适度硬件上，使用2-3个数量级更少的训练数据，在CheXpert Plus数据集上取得了新的SOTA性能（F1值为0.341）。

更新时间: 2025-09-23 03:42:26

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.18600v1

OmniFed: A Modular Framework for Configurable Federated Learning from Edge to HPC

Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at https://github.com/at-aaims/OmniFed.

Updated: 2025-09-23 03:40:22

标题: OmniFed：一种从边缘到高性能计算的可配置联邦学习的模块化框架

摘要: 联邦学习（FL）对于边缘和高性能计算（HPC）至关重要，其中数据不集中且隐私至关重要。我们提出了OmniFed，一个围绕解耦和明确分离关注点的模块化框架，用于配置、编排、通信和训练逻辑。其架构支持基于配置的原型设计和代码级别的覆盖你所需的定制化。我们还支持不同的拓扑结构，单个部署内的混合通信协议，以及流行的训练算法。它还提供了包括差分隐私（DP）、同态加密（HE）和安全聚合（SA）在内的可选隐私机制，以及压缩策略。这些功能通过明确定义的扩展点公开，允许用户自定义拓扑和编排、学习逻辑，以及隐私/压缩插件，同时保持核心系统的完整性。我们评估了多个模型和算法以测量各种性能指标。通过在一个堆栈中统一拓扑配置、混合协议通信和可插拔模块，OmniFed简化了在异构环境中部署FL。Github仓库可在https://github.com/at-aaims/OmniFed找到。

更新时间: 2025-09-23 03:40:22

领域: cs.LG,cs.AI,cs.CR,cs.DC

下载: http://arxiv.org/abs/2509.19396v1

Machine-Learning Interatomic Potentials for Long-Range Systems

Machine-learning interatomic potentials have emerged as a revolutionary class of force-field models in molecular simulations, delivering quantum-mechanical accuracy at a fraction of the computational cost and enabling the simulation of large-scale systems over extended timescales. However, they often focus on modeling local environments, neglecting crucial long-range interactions. We propose a Sum-of-Gaussians Neural Network (SOG-Net), a lightweight and versatile framework for integrating long-range interactions into machine learning force field. The SOG-Net employs a latent-variable learning network that seamlessly bridges short-range and long-range components, coupled with an efficient Fourier convolution layer that incorporates long-range effects. By learning sum-of-Gaussians multipliers across different convolution layers, the SOG-Net adaptively captures diverse long-range decay behaviors while maintaining close-to-linear computational complexity during training and simulation via non-uniform fast Fourier transforms. The method is demonstrated effective for a broad range of long-range systems.

Updated: 2025-09-23 03:36:18

标题: 长程系统的机器学习原子间势能

摘要: 机器学习间原子势能已经成为分子模拟中的一种革命性的力场模型类别，能够以计算成本的一小部分提供量子力学精度，并能够模拟大规模系统在长时间尺度上。然而，它们通常集中于对局部环境建模，忽视了关键的长程相互作用。我们提出了一个高效且多功能的框架，用于将长程相互作用整合到机器学习力场中，即高斯和网络（SOG-Net）。SOG-Net采用了一个潜变量学习网络，能够无缝地连接短程和长程组件，同时结合了一个高效的傅里叶卷积层，可以整合长程效应。通过在不同卷积层之间学习高斯加法器，SOG-Net能够自适应地捕捉不同长程衰减行为，同时通过非均匀快速傅里叶变换在训练和模拟过程中保持接近线性的计算复杂度。该方法已被证明在广泛的长程系统中是有效的。

更新时间: 2025-09-23 03:36:18

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2502.04668v2

Q-DPTS: Quantum Differentially Private Time Series Forecasting via Variational Quantum Circuits

Time series forecasting is vital in domains where data sensitivity is paramount, such as finance and energy systems. While Differential Privacy (DP) provides theoretical guarantees to protect individual data contributions, its integration especially via DP-SGD often impairs model performance due to injected noise. In this paper, we propose Q-DPTS, a hybrid quantum-classical framework for Quantum Differentially Private Time Series Forecasting. Q-DPTS combines Variational Quantum Circuits (VQCs) with per-sample gradient clipping and Gaussian noise injection, ensuring rigorous $(\epsilon, \delta)$-differential privacy. The expressiveness of quantum models enables improved robustness against the utility loss induced by DP mechanisms. We evaluate Q-DPTS on the ETT (Electricity Transformer Temperature) dataset, a standard benchmark for long-term time series forecasting. Our approach is compared against both classical and quantum baselines, including LSTM, QASA, QRWKV, and QLSTM. Results demonstrate that Q-DPTS consistently achieves lower prediction error under the same privacy budget, indicating a favorable privacy-utility trade-off. This work presents one of the first explorations into quantum-enhanced differentially private forecasting, offering promising directions for secure and accurate time series modeling in privacy-critical scenarios.

Updated: 2025-09-23 03:31:38

标题: Q-DPTS: 通过变分量子电路实现的量子差分隐私时间序列预测

摘要: 时间序列预测在数据敏感性至关重要的领域中至关重要，例如金融和能源系统。虽然差分隐私（DP）提供了理论保证来保护个体数据贡献，但其特别是通过DP-SGD的集成常常会由于注入的噪声而损害模型性能。在本文中，我们提出了Q-DPTS，这是一个用于量子差分隐私时间序列预测的混合量子-经典框架。Q-DPTS将变分量子电路（VQCs）与逐样本梯度剪裁和高斯噪声注入相结合，确保严格的（ε，δ）-差分隐私。量子模型的表达能力使其对DP机制引起的实用性损失具有改进的鲁棒性。我们在ETT（电力变压器温度）数据集上评估了Q-DPTS，这是一个用于长期时间序列预测的标准基准。我们的方法与经典和量子基线进行了比较，包括LSTM、QASA、QRWKV和QLSTM。结果表明，在相同的隐私预算下，Q-DPTS始终实现更低的预测误差，表明有利的隐私-实用性权衡。这项工作是对量子增强差分隐私预测的首次探索之一，为隐私关键场景中安全准确的时间序列建模提供了有希望的方向。

更新时间: 2025-09-23 03:31:38

领域: quant-ph,cs.CR,cs.LG,eess.SP

下载: http://arxiv.org/abs/2508.05036v2

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo

Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.

Updated: 2025-09-23 03:30:33

标题: 训练语言模型代理以使用CTF-Dojo找到漏洞

摘要: 大型语言模型（LLMs）在可执行运行环境内训练时表现出非凡的能力，尤其在软件工程任务中通过经过验证的反馈循环表现出色。然而，可扩展且具有普遍适用性的执行基础环境仍然稀缺，限制了训练更有能力的ML代理的进展。我们介绍了CTF-Dojo，这是第一个专门用于训练LLMs的大规模可执行运行时环境，具有可验证反馈，其中包含了658个完全功能的Docker容器化的Capture-The-Flag（CTF）风格挑战，具有保证的可重现性。为了在没有手动干预的情况下实现快速扩展，我们开发了CTF-Forge，这是一个自动化流水线，可以将公开可用的工件转化为几分钟内可用的执行环境，消除了传统上需要数周的专家配置。我们在CTF-Dojo中只使用了486个高质量的、经过验证的轨迹对基于LLM的代理进行训练，在三个竞争基准测试中实现了高达11.6%的绝对增益，超过了强基线。我们的表现最佳的32B模型达到了31.9%的Pass@1，确立了一种新的开放权重最新技术，可以与DeepSeek-V3-0324和Gemini-2.5-Flash等前沿模型相媲美。通过将CTF风格任务作为可执行代理学习的基准，CTF-Dojo展示了执行基础训练信号不仅有效而且至关重要，可以在不依赖昂贵专有系统的情况下推动高性能ML代理的发展。

更新时间: 2025-09-23 03:30:33

领域: cs.SE,cs.CL,cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.18370v2

Benchmarking Quantum and Classical Sequential Models for Urban Telecommunication Forecasting

In this study, we evaluate the performance of classical and quantum-inspired sequential models in forecasting univariate time series of incoming SMS activity (SMS-in) using the Milan Telecommunication Activity Dataset. Due to data completeness limitations, we focus exclusively on the SMS-in signal for each spatial grid cell. We compare five models, LSTM (baseline), Quantum LSTM (QLSTM), Quantum Adaptive Self-Attention (QASA), Quantum Receptance Weighted Key-Value (QRWKV), and Quantum Fast Weight Programmers (QFWP), under varying input sequence lengths (4, 8, 12, 16, 32 and 64). All models are trained to predict the next 10-minute SMS-in value based solely on historical values within a given sequence window. Our findings indicate that different models exhibit varying sensitivities to sequence length, suggesting that quantum enhancements are not universally advantageous. Rather, the effectiveness of quantum modules is highly dependent on the specific task and architectural design, reflecting inherent trade-offs among model size, parameterization strategies, and temporal modeling capabilities.

Updated: 2025-09-23 03:27:39

标题: 基准量子和经典顺序模型用于城市电信预测

摘要: 在这项研究中，我们评估了经典和受量子启发的顺序模型在使用米兰电信活动数据集预测单变量时间序列的短信活动（短信进）表现。由于数据完整性的限制，我们专注于每个空间网格单元的短信进信号。我们比较了五种模型，即LSTM（基准）、量子LSTM（QLSTM）、量子自适应自注意力（QASA）、量子接受加权键值（QRWKV）和量子快速权重编程器（QFWP），在不同的输入序列长度（4、8、12、16、32和64）下进行比较。所有模型都是根据给定序列窗口内的历史值来训练，预测下一个10分钟的短信进值。我们的研究结果表明，不同的模型对序列长度具有不同的敏感性，表明量子增强并非普遍有利。相反，量子模块的有效性高度依赖于特定任务和架构设计，反映了模型大小、参数化策略和时间建模能力之间固有的权衡。

更新时间: 2025-09-23 03:27:39

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2508.04488v2

VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/.

Updated: 2025-09-23 03:23:03

标题: VLN-Zero：零样本转移机器人导航中的快速探索和缓存启用的神经符号视觉-语言规划

摘要: 在未知环境中的快速适应对于可扩展的现实世界自主性至关重要，然而现有方法依赖于详尽的探索或刚性导航策略，无法泛化。我们提出了VLN-Zero，这是一个利用视觉-语言模型构建符号场景图并实现零阶神经符号导航的两阶段视觉-语言导航框架。在探索阶段，结构化提示引导VLM基于搜索朝向信息丰富且多样的轨迹，产生紧凑的场景图表示。在部署阶段，一个神经符号规划器根据场景图和环境观察生成可执行计划，同时一个启用缓存的执行模块通过重复使用先前计算的任务位置轨迹加速适应。通过结合快速探索、符号推理和启用缓存的执行，提出的框架克服了先前视觉-语言导航方法的计算效率低和泛化能力差的问题，实现了在未知环境中健壮且可扩展的决策制定。与最先进的零阶模型相比，VLN-Zero的成功率提高了2倍，优于大多数微调基线，并在相对于最先进模型跨多种环境平均减少了55%的VLM调用次数的情况下，将目标位置达到时间减半。VLN-Zero的代码库、数据集和视频可在https://vln-zero.github.io/上找到。

更新时间: 2025-09-23 03:23:03

领域: cs.RO,cs.AI,cs.CV,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2509.18592v1

Spectraformer: A Unified Random Feature Framework for Transformer

Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We identify the need for a systematic comparison of different combinations of weight matrices and component functions for attention learning in Transformer. Hence, we introduce Spectraformer, a unified framework for approximating and learning the kernel function in the attention mechanism of the Transformer. Our empirical results demonstrate, for the first time, that a random feature-based approach can achieve performance comparable to top-performing sparse and low-rank methods on the challenging Long Range Arena benchmark. Thus, we establish a new state-of-the-art for random feature-based efficient Transformers. The framework also produces many variants that offer different advantages in accuracy, training time, and memory consumption. Our code is available at: https://github.com/cruiseresearchgroup/spectraformer .

Updated: 2025-09-23 03:21:45

标题: Spectraformer：Transformer的统一随机特征框架

摘要: 使用各种核逼近和核学习技术对注意力进行线性化显示出潜力。过去的方法在随机特征范式内使用了一部分组件函数和权重矩阵的组合。我们确定需要对Transformer中注意力学习的不同权重矩阵和组件函数的组合进行系统比较。因此，我们介绍了Spectraformer，一个统一的框架，用于在Transformer的注意机制中逼近和学习核函数。我们的实证结果首次表明，基于随机特征的方法可以在具有挑战性的长距离竞技场基准上实现与表现最佳的稀疏和低秩方法可比较的性能。因此，我们为基于随机特征的高效Transformer建立了新的最新技术水平。该框架还产生许多变体，提供了不同的优势，包括准确性、训练时间和内存消耗。我们的代码可在以下网址找到：https://github.com/cruiseresearchgroup/spectraformer。

更新时间: 2025-09-23 03:21:45

领域: cs.LG

下载: http://arxiv.org/abs/2405.15310v5

Class-wise Balancing Data Replay for Federated Class-Incremental Learning

Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model's overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.

Updated: 2025-09-23 03:19:50

标题: 基于类别平衡的联邦类增量学习数据重放

摘要: 联邦式分类增量学习（FCIL）旨在跨多个客户端协作处理持续增加的任务。在各种方法中，数据重放已成为一种有前途的解决方案，可以通过重新引入先前任务的代表性样本来减轻遗忘。然而，它们的性能通常受到类别不平衡的限制，既在回放缓冲区内由于有限的全局感知，也在回放和新到达的类别之间。为了解决这个问题，我们提出了一种适用于FCIL的基于类别平衡的数据重放方法（FedCBDR），它采用全局协调机制进行类别级内存构建，并重新调整学习目标以减轻前述不平衡。具体而言，FedCBDR有两个关键组成部分：1）全局视角数据重放模块以隐私保护方式重建先前任务的全局表示，然后引导一个类别感知和重要性敏感的采样策略以实现平衡回放；2）随后，为了处理跨任务的类别不平衡，任务感知温度缩放模块根据任务动态自适应地调整类别和实例级别的logits温度，从而减少模型在多数类别中的过度自信，同时增强其对少数类别的敏感性。实验结果验证了FedCBDR在异构数据分布下实现了平衡的类别级采样，并在先前任务和最近任务之间的任务不平衡下提高了泛化能力，使得其在六种最先进方法中的Top-1准确率提高了2%-15%。

更新时间: 2025-09-23 03:19:50

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.07712v2

Automating Steering for Safe Multimodal Large Language Models

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

Updated: 2025-09-23 03:15:44

标题: 自动化转向以确保安全的多模态大型语言模型

摘要: 最近在多模大型语言模型（MLLMs）方面取得了进展，解锁了强大的跨模态推理能力，但也引发了新的安全关注，特别是面对对抗性多模态输入时。为了提高MLLMs在推断过程中的安全性，我们引入了一种模块化和自适应的推断时干预技术AutoSteer，而无需对基础模型进行任何微调。AutoSteer包含三个核心组件：（1）一种新颖的安全意识评分（SAS），自动识别模型内部层中最具安全相关性的区别；（2）一个经过训练的自适应安全探测器，用于估计中间表示中有毒输出的可能性；以及（3）一个轻量级的拒绝头，当检测到安全风险时选择性地干预以调节生成。在LLaVA-OV和Chameleon上进行的实验跨越多样的安全关键基准，表明AutoSteer显著降低了文本、视觉和跨模态威胁的攻击成功率（ASR），同时保持了一般能力。这些发现将AutoSteer定位为一个实用、可解释和有效的框架，用于更安全地部署多模态人工智能系统。

更新时间: 2025-09-23 03:15:44

领域: cs.CL,cs.AI,cs.IR,cs.LG,cs.MM

下载: http://arxiv.org/abs/2507.13255v3

Compressed Permutation Oracles

The analysis of quantum algorithms which query random, invertible permutations has been a long-standing challenge in cryptography. Many techniques which apply to random oracles fail, or are not known to generalize to this setting. As a result, foundational cryptographic constructions involving permutations often lack quantum security proofs. With the aim of closing this gap, we develop and prove soundness of a compressed permutation oracle. Our construction shares many of the attractive features of Zhandry's original compressed function oracle: the purification is a small list of input-output pairs which meaningfully reflect an algorithm's knowledge of the oracle. We then apply this framework to show that the Feistel construction with seven rounds is a strong quantum PRP, resolving an open question of (Zhandry, 2012). We further re-prove essentially all known quantum query lower bounds in the random permutation model, notably the collision and preimage resistance of both Sponge and Davies-Meyer, hardness of double-sided zero search and sparse predicate search, and give new lower bounds for cycle finding and the one-more problem.

Updated: 2025-09-23 03:13:48

标题: 压缩排列神谕

摘要: 对查询随机、可逆排列的量子算法进行分析一直是密码学领域的一项长期挑战。许多适用于随机预言机的技术失败了，或者不适用于这种情况。因此，涉及排列的基础密码构造通常缺乏量子安全性证明。为了弥补这一差距，我们开发并证明了一种压缩排列预言机的正确性。我们的构造与Zhandry的原始压缩函数预言机具有许多吸引人的特点：净化是一小组有意义地反映算法对预言机知识的输入-输出对。然后，我们应用这一框架表明，七轮费斯特尔构造是一个强量子PRP，解决了(Zhandry, 2012)的一个悬而未决的问题。我们进一步重新证明了随机排列模型中几乎所有已知的量子查询下界，特别是Sponge和Davies-Meyer的碰撞和预像抗性，双面零搜索和稀疏谓词搜索的困难性，并为周期查找和一个问题提供了新的下界。

更新时间: 2025-09-23 03:13:48

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2509.18586v1

TsqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-Tuning

Fine-tuning large pre-trained models for downstream tasks has become a fundamental approach in natural language processing. Fully fine-tuning all model parameters is computationally expensive and memory-intensive, especially in resource-constrained environments. Existing parameter-efficient fine-tuning methods reduce the number of trainable parameters but typically overlook the varying sensitivity of different model layers and the importance of training data. In this work, we propose TsqLoRA, a novel method that integrates data-quality-driven selection with sensitivity-aware low-rank adaptation, consisted of two main components: a quality-aware sampling mechanism for selecting the most informative training data, and a dynamic rank allocation module that adjusts the rank of each layer based on its sensitivity to parameter updates. The experimental results demonstrate that TsqLoRA improves fine-tuning efficiency while maintaining or even improving performance on a variety of NLP tasks. Our code will be available at https://github.com/Benjamin-Ricky/TsqLoRA.

Updated: 2025-09-23 03:10:41

标题: TsqLoRA: 为了有效微调的敏感性和质量低秩适应

摘要: 将大型预训练模型微调用于下游任务已成为自然语言处理中的一种基本方法。完全微调所有模型参数在计算上昂贵且占用内存，尤其在资源受限的环境中。现有的参数高效微调方法减少了可训练参数的数量，但通常忽视了不同模型层的敏感性差异和训练数据的重要性。在本研究中，我们提出了TsqLoRA，这是一种新颖的方法，将数据质量驱动的选择与敏感性感知的低秩适应相结合，包括两个主要组件：用于选择最具信息量的训练数据的质量感知采样机制，以及根据每个层对参数更新的敏感性调整其秩的动态秩分配模块。实验结果表明，TsqLoRA提高了微调效率，同时在各种自然语言处理任务中保持或甚至改善了性能。我们的代码将在https://github.com/Benjamin-Ricky/TsqLoRA上提供。

更新时间: 2025-09-23 03:10:41

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.18585v1

Multiscale Hodge Scattering Networks for Data Analysis

We propose new scattering networks for signals measured on simplicial complexes, which we call \emph{Multiscale Hodge Scattering Networks} (MHSNs). Our construction builds on multiscale basis dictionaries on simplicial complexes -- namely, the $\kappa$-GHWT and $\kappa$-HGLET -- which we recently developed for simplices of dimension $\kappa \in \mathbb{N}$ in a given simplicial complex by generalizing the node-based Generalized Haar--Walsh Transform (GHWT) and Hierarchical Graph Laplacian Eigen Transform (HGLET). Both the $\kappa$-GHWT and the $\kappa$-HGLET form redundant sets (i.e., dictionaries) of multiscale basis vectors and the corresponding expansion coefficients of a given signal. Our MHSNs adopt a layered structure analogous to a convolutional neural network (CNN), cascading the moments of the modulus of the dictionary coefficients. The resulting features are invariant to reordering of the simplices (i.e., node permutation of the underlying graphs). Importantly, the use of multiscale basis dictionaries in our MHSNs admits a natural pooling operation -- akin to local pooling in CNNs -- that can be performed either locally or per scale. Such pooling operations are more difficult to define in traditional scattering networks based on Morlet wavelets and in geometric scattering networks based on Diffusion Wavelets. As a result, our approach extracts a rich set of descriptive yet robust features that can be combined with simple machine learning models (e.g., logistic regression or support vector machines) to achieve high-accuracy classification with far fewer trainable parameters than most modern graph neural networks require. Finally, we demonstrate the effectiveness of MHSNs on three distinct problem types: signal classification, domain (i.e., graph/simplex) classification, and molecular dynamics prediction.

Updated: 2025-09-23 03:08:43

标题: 多尺度霍奇散射网络在数据分析中的应用

摘要: 我们提出了一种新的散射网络，用于在单纯复合物上测量的信号，我们称之为多尺度霍奇散射网络（MHSNs）。我们的构建基于在单纯复合物上的多尺度基础词典，即$\kappa$-GHWT和$\kappa$-HGLET，我们最近为给定维度$\kappa \in \mathbb{N}$的单纯形发展了这些词典，通过推广基于节点的广义哈尔-瓦尔什变换（GHWT）和分层图拉普拉斯特征变换（HGLET）。$\kappa$-GHWT和$\kappa$-HGLET都形成冗余集合（即词典）的多尺度基础向量和给定信号的相应展开系数。我们的MHSNs采用类似卷积神经网络（CNN）的分层结构，串联词典系数的模的矩。结果特征不受单纯形的重新排序（即底层图的节点排列）的影响。重要的是，在我们的MHSNs中使用多尺度基础词典引入了一种自然的池化操作，类似于CNN中的局部池化，可以在本地或每个尺度上执行。这种池化操作在基于Morlet小波的传统散射网络和基于扩散小波的几何散射网络中更难定义。因此，我们的方法提取了一组描述性而又稳健的特征，可以与简单的机器学习模型（例如逻辑回归或支持向量机）结合，以比大多数现代图神经网络需要的可训练参数少得多的数量实现高精度分类。最后，我们展示了MHSNs在三种不同问题类型上的有效性：信号分类、领域（即图/单纯形）分类和分子动力学预测。

更新时间: 2025-09-23 03:08:43

领域: cs.LG,cs.NA,cs.SI,eess.SP,math.NA,stat.ML

下载: http://arxiv.org/abs/2311.10270v6

SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models

The rapid adoption of large language models (LLMs) in critical domains has spurred extensive research into their security issues. While input manipulation attacks (e.g., prompt injection) have been well studied, Bit-Flip Attacks (BFAs) -- which exploit hardware vulnerabilities to corrupt model parameters and cause severe performance degradation -- have received far less attention. Existing BFA methods suffer from key limitations: they fail to balance performance degradation and output naturalness, making them prone to discovery. In this paper, we introduce SilentStriker, the first stealthy bit-flip attack against LLMs that effectively degrades task performance while maintaining output naturalness. Our core contribution lies in addressing the challenge of designing effective loss functions for LLMs with variable output length and the vast output space. Unlike prior approaches that rely on output perplexity for attack loss formulation, which inevitably degrade output naturalness, we reformulate the attack objective by leveraging key output tokens as targets for suppression, enabling effective joint optimization of attack effectiveness and stealthiness. Additionally, we employ an iterative, progressive search strategy to maximize attack efficacy. Experiments show that SilentStriker significantly outperforms existing baselines, achieving successful attacks without compromising the naturalness of generated text.

Updated: 2025-09-23 03:08:31

标题: SilentStriker：针对大型语言模型的悄无声息的比特翻转攻击

摘要: 大型语言模型（LLMs）在关键领域的快速采用已经激发了对它们安全问题的广泛研究。虽然输入操纵攻击（例如，提示注入）已经得到深入研究，但利用硬件漏洞来破坏模型参数并导致严重性能下降的比特翻转攻击（BFAs）却受到了远远较少的关注。现有的BFA方法存在关键限制：它们无法平衡性能下降和输出自然度，使它们容易被发现。在本文中，我们介绍了SilentStriker，这是针对LLMs的第一个隐秘比特翻转攻击，可以有效地降低任务性能同时保持输出自然度。我们的核心贡献在于解决为具有可变输出长度和广泛输出空间的LLMs设计有效损失函数的挑战。与以往依赖输出困惑度进行攻击损失制定的方法不同，这种方法不可避免地会降低输出自然度，我们通过利用关键输出令牌作为抑制目标，重新制定攻击目标，从而实现攻击效果和隐秘性的有效联合优化。此外，我们采用迭代的渐进搜索策略来最大化攻击效能。实验证明，SilentStriker明显优于现有基线，能够成功进行攻击而不损害生成文本的自然度。

更新时间: 2025-09-23 03:08:31

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2509.17371v2

DS-Diffusion: Data Style-Guided Diffusion Model for Time-Series Generation

Diffusion models are the mainstream approach for time series generation tasks. However, existing diffusion models for time series generation require retraining the entire framework to introduce specific conditional guidance. There also exists a certain degree of distributional bias between the generated data and the real data, which leads to potential model biases in downstream tasks. Additionally, the complexity of diffusion models and the latent spaces leads to an uninterpretable inference process. To address these issues, we propose the data style-guided diffusion model (DS-Diffusion). In the DS-Diffusion, a diffusion framework based on style-guided kernels is developed to avoid retraining for specific conditions. The time-information based hierarchical denoising mechanism (THD) is developed to reduce the distributional bias between the generated data and the real data. Furthermore, the generated samples can clearly indicate the data style from which they originate. We conduct comprehensive evaluations using multiple public datasets to validate our approach. Experimental results show that, compared to the state-of-the-art model such as ImagenTime, the predictive score and the discriminative score decrease by 5.56% and 61.55%, respectively. The distributional bias between the generated data and the real data is further reduced, the inference process is also more interpretable. Moreover, by eliminating the need to retrain the diffusion model, the flexibility and adaptability of the model to specific conditions are also enhanced.

Updated: 2025-09-23 03:06:39

标题: DS-Diffusion：数据样式引导的时间序列生成扩散模型

摘要: 扩散模型是时间序列生成任务的主流方法。然而，现有的时间序列生成扩散模型需要重新训练整个框架来引入特定的条件指导。此外，生成数据与真实数据之间存在一定程度的分布偏差，这导致下游任务中潜在的模型偏差。此外，扩散模型的复杂性和潜在空间导致了一个不可解释的推理过程。为了解决这些问题，我们提出了数据风格引导的扩散模型（DS-Diffusion）。在DS-Diffusion中，基于风格引导核的扩散框架被开发出来，以避免为特定条件重新训练。基于时间信息的分层去噪机制（THD）被开发出来，以减少生成数据与真实数据之间的分布偏差。此外，生成的样本可以清楚地表明它们来自哪种数据风格。我们使用多个公共数据集进行全面评估以验证我们的方法。实验结果显示，与像ImagenTime这样的最先进模型相比，预测得分和区分得分分别减少了5.56%和61.55%。生成数据与真实数据之间的分布偏差进一步减少，推理过程也更易解释。此外，通过消除重新训练扩散模型的必要性，模型对特定条件的灵活性和适应性也得到增强。

更新时间: 2025-09-23 03:06:39

领域: cs.LG

下载: http://arxiv.org/abs/2509.18584v1

Large Language Models Implicitly Learn to See and Hear Just By Reading

This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

Updated: 2025-09-23 03:02:04

标题: 大型语言模型通过阅读隐性学会看和听

摘要: 这篇论文提出了一个令人着迷的发现：通过在文本标记上训练自回归LLM模型，文本模型在内部自然地发展出理解图像和音频的能力，从而通过阅读来看和听。流行的音频和视觉LLM模型对文本LLM模型进行微调，以提供基于图像和音频嵌入的文本输出。另一方面，我们的架构接受图像补丁、音频波形或标记作为输入。它为我们提供了典型分类管道的嵌入或类别标签。我们展示了文本权重在帮助FSD-50K和GTZAN数据集的音频分类方面的普遍性。此外，我们展示了这在CIFAR-10和时尚MNIST的图像分类上的工作，以及在图像补丁上的工作。这推动了文本LLM学习强大内部电路的概念，可以通过激活不同应用程序所需的连接来利用，而不是每次都从头开始训练模型。

更新时间: 2025-09-23 03:02:04

领域: cs.CL,cs.AI,cs.CV,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2505.17091v2

Joint Memory Frequency and Computing Frequency Scaling for Energy-efficient DNN Inference

Deep neural networks (DNNs) have been widely applied in diverse applications, but the problems of high latency and energy overhead are inevitable on resource-constrained devices. To address this challenge, most researchers focus on the dynamic voltage and frequency scaling (DVFS) technique to balance the latency and energy consumption by changing the computing frequency of processors. However, the adjustment of memory frequency is usually ignored and not fully utilized to achieve efficient DNN inference, which also plays a significant role in the inference time and energy consumption. In this paper, we first investigate the impact of joint memory frequency and computing frequency scaling on the inference time and energy consumption with a model-based and data-driven method. Then by combining with the fitting parameters of different DNN models, we give a preliminary analysis for the proposed model to see the effects of adjusting memory frequency and computing frequency simultaneously. Finally, simulation results in local inference and cooperative inference cases further validate the effectiveness of jointly scaling the memory frequency and computing frequency to reduce the energy consumption of devices.

Updated: 2025-09-23 02:59:09

标题: 联合内存频率和计算频率缩放对于高能效的深度神经网络推理的影响

摘要: 深度神经网络（DNNs）已广泛应用于各种应用程序中，但在资源受限设备上，高延迟和能量开销问题不可避免。为解决这一挑战，大多数研究者关注动态电压和频率调节（DVFS）技术，通过改变处理器的计算频率来平衡延迟和能量消耗。然而，通常忽视调整内存频率，并未充分利用以实现高效的DNN推理，内存频率调整在推理时间和能量消耗中也起着重要作用。本文首先用基于模型和数据驱动方法研究了联合内存频率和计算频率调节对推理时间和能量消耗的影响。然后结合不同DNN模型的拟合参数，对提出的模型进行初步分析，以查看同时调整内存频率和计算频率的效果。最后，本地推理和协作推理情况下的模拟结果进一步验证了联合调整内存频率和计算频率以减少设备能耗的有效性。

更新时间: 2025-09-23 02:59:09

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2509.17970v2

MER-Inspector: Assessing model extraction risks from an attack-agnostic perspective

Information leakage issues in machine learning-based Web applications have attracted increasing attention. While the risk of data privacy leakage has been rigorously analyzed, the theory of model function leakage, known as Model Extraction Attacks (MEAs), has not been well studied. In this paper, we are the first to understand MEAs theoretically from an attack-agnostic perspective and to propose analytical metrics for evaluating model extraction risks. By using the Neural Tangent Kernel (NTK) theory, we formulate the linearized MEA as a regularized kernel classification problem and then derive the fidelity gap and generalization error bounds of the attack performance. Based on these theoretical analyses, we propose a new theoretical metric called Model Recovery Complexity (MRC), which measures the distance of weight changes between the victim and surrogate models to quantify risk. Additionally, we find that victim model accuracy, which shows a strong positive correlation with model extraction risk, can serve as an empirical metric. By integrating these two metrics, we propose a framework, namely Model Extraction Risk Inspector (MER-Inspector), to compare the extraction risks of models under different model architectures by utilizing relative metric values. We conduct extensive experiments on 16 model architectures and 5 datasets. The experimental results demonstrate that the proposed metrics have a high correlation with model extraction risks, and MER-Inspector can accurately compare the extraction risks of any two models with up to 89.58%.

Updated: 2025-09-23 02:57:57

标题: MER-Inspector：从攻击无关的角度评估模型提取风险

摘要: 机器学习基于Web应用程序中的信息泄漏问题引起了越来越多的关注。尽管数据隐私泄漏的风险已经得到严格分析，但模型功能泄漏理论，即模型提取攻击（MEAs），尚未得到充分研究。在本文中，我们首次从攻击不可知的角度理论上理解MEAs，并提出了用于评估模型提取风险的分析指标。通过使用神经切线核（NTK）理论，我们将线性化的MEA制定为一个正则化的核分类问题，然后推导出攻击性能的保真度差和泛化误差边界。基于这些理论分析，我们提出了一个新的理论指标称为模型恢复复杂度（MRC），用于衡量受害者和替代模型之间的权重变化距离以量化风险。此外，我们发现受害者模型准确性与模型提取风险呈强正相关，可以作为经验指标。通过整合这两个指标，我们提出了一个框架，即模型提取风险检测器（MER-Inspector），通过利用相对度量值比较不同模型架构下模型的提取风险。我们对16种模型架构和5个数据集进行了大量实验。实验结果表明，所提出的指标与模型提取风险具有很高的相关性，MER-Inspector可以准确比较任意两个模型的提取风险，准确率高达89.58%。

更新时间: 2025-09-23 02:57:57

领域: cs.CR

下载: http://arxiv.org/abs/2509.18578v1

LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA

Multimodal semantic learning plays a critical role in embodied intelligence, especially when robots perceive their surroundings, understand human instructions, and make intelligent decisions. However, the field faces technical challenges such as effective fusion of heterogeneous data and computational efficiency in resource-constrained environments. To address these challenges, this study proposes the lightweight LCMF cascaded attention framework, introducing a multi-level cross-modal parameter sharing mechanism into the Mamba module. By integrating the advantages of Cross-Attention and Selective parameter-sharing State Space Models (SSMs), the framework achieves efficient fusion of heterogeneous modalities and semantic complementary alignment. Experimental results show that LCMF surpasses existing multimodal baselines with an accuracy of 74.29% in VQA tasks and achieves competitive mid-tier performance within the distribution cluster of Large Language Model Agents (LLM Agents) in EQA video tasks. Its lightweight design achieves a 4.35-fold reduction in FLOPs relative to the average of comparable baselines while using only 166.51M parameters (image-text) and 219M parameters (video-text), providing an efficient solution for Human-Robot Interaction (HRI) applications in resource-constrained scenarios with strong multimodal decision generalization capabilities.

Updated: 2025-09-23 02:57:25

标题: LCMF：轻量级跨模态Mambaformer用于具身机器人VQA

摘要: 多模态语义学习在具身智能中起着关键作用，特别是当机器人感知周围环境、理解人类指令并做出智能决策时。然而，该领域面临技术挑战，如异构数据的有效融合和在资源受限环境中的计算效率。为解决这些挑战，本研究提出了轻量级LCMF级联注意力框架，将多级跨模态参数共享机制引入Mamba模块。通过整合跨注意力和选择性参数共享状态空间模型（SSMs）的优势，该框架实现了异构模态和语义互补对齐的高效融合。实验结果显示，在视觉问答任务中，LCMF在准确率方面超过了现有的多模态基线，达到了74.29％，在EQA视频任务中，在大型语言模型代理（LLM代理）分布集群内实现了有竞争力的中等水平性能。其轻量级设计相对于可比基线的平均值，FLOP减少了4.35倍，仅使用166.51M参数（图像-文本）和219M参数（视频-文本），为资源受限情景中的人机交互（HRI）应用提供了高效的解决方案，具有强大的多模态决策泛化能力。

更新时间: 2025-09-23 02:57:25

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2509.18576v1

Early Prediction of In-Hospital ICU Mortality Using Innovative First-Day Data: A Review

The intensive care unit (ICU) manages critically ill patients, many of whom face a high risk of mortality. Early and accurate prediction of in-hospital mortality within the first 24 hours of ICU admission is crucial for timely clinical interventions, resource optimization, and improved patient outcomes. Traditional scoring systems, while useful, often have limitations in predictive accuracy and adaptability. Objective: This review aims to systematically evaluate and benchmark innovative methodologies that leverage data available within the first day of ICU admission for predicting in-hospital mortality. We focus on advancements in machine learning, novel biomarker applications, and the integration of diverse data types.

Updated: 2025-09-23 02:57:19

标题: 早期使用创新的首日数据预测住院ICU死亡率：一项综述

摘要: 重症监护病房（ICU）管理危重病人，其中许多人面临着高死亡风险。对ICU入院后24小时内对住院死亡进行早期和准确的预测对于及时的临床干预、资源优化和改善患者预后至关重要。传统评分系统虽然有用，但在预测准确性和适应性方面常常存在局限。目标：本综述旨在系统评估和基准测试利用ICU入院首日可获得的数据来预测住院死亡的创新方法。我们关注机器学习、新型生物标志物应用和多样化数据类型的整合方面的进展。

更新时间: 2025-09-23 02:57:19

领域: cs.LG,cs.CY

下载: http://arxiv.org/abs/2505.12344v2

TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning

Table reasoning is crucial for leveraging structured data in domains such as finance, healthcare, and scientific research. While large language models (LLMs) show promise in multi-step reasoning, purely text-based methods often struggle with the complex numerical computations and fine-grained operations inherently required in this task. Tool-integrated reasoning improves computational accuracy via explicit code execution, yet existing systems frequently rely on rigid patterns, supervised imitation, and lack true autonomous adaptability. In this paper, we present TableMind, an LLM-driven table reasoning agent that (i) autonomously performs multi-turn tool invocation, (ii) writes and executes data-analyzing code in a secure sandbox environment for data analysis and precise numerical reasoning, and (iii) exhibits high-level capabilities such as planning and self-reflection to adapt strategies. To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model: supervised fine-tuning on high-quality reasoning trajectories to establish effective tool usage patterns, followed by reinforcement fine-tuning to optimize multi-objective strategies. In particular, we propose Rank-Aware Policy Optimization (RAPO), which increases the update weight of high-quality trajectories when their output probabilities are lower than those of low-quality ones, thereby guiding the model more consistently toward better and more accurate answers. Extensive experiments on several mainstream benchmarks demonstrate that TableMind achieves superior performance compared to competitive baselines, yielding substantial gains in both reasoning accuracy and computational precision.

Updated: 2025-09-23 02:57:07

标题: TableMind: 一个用于工具增强表格推理的自主程序代理

摘要: 表格推理对于在金融、医疗保健和科学研究等领域利用结构化数据至关重要。尽管大型语言模型(LLMs)在多步推理方面表现出潜力，但纯文本方法通常难以处理这一任务中固有需要的复杂数值计算和细粒度操作。工具集成推理通过显式代码执行提高计算准确性，但现有系统经常依赖于刚性模式、监督模仿，并且缺乏真正的自主适应性。在本文中，我们提出了TableMind，一种由LLM驱动的表格推理代理，能够(i)自主执行多轮工具调用，(ii)在安全的沙箱环境中编写并执行数据分析代码，进行精确的数值推理，以及(iii)展示高级能力，如规划和自我反思，以适应策略。为了实现这些功能，我们采用了建立在强大的预训练语言模型之上的两阶段微调范式：在高质量推理轨迹上进行监督微调，以建立有效的工具使用模式，然后进行强化微调，优化多目标策略。特别地，我们提出了Rank-Aware Policy Optimization (RAPO)，当高质量轨迹的输出概率低于低质量轨迹时，增加高质量轨迹的更新权重，从而更一致地引导模型朝着更好、更准确的答案方向发展。对几个主流基准进行的广泛实验表明，TableMind相对于竞争基准具有更优异的性能，提供了推理准确性和计算精度方面的实质性收益。

更新时间: 2025-09-23 02:57:07

领域: cs.AI

下载: http://arxiv.org/abs/2509.06278v2

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the "Ranking Blind Spot", a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: Decision Objective Hijacking, which alters the evaluation goal in pairwise ranking systems, and Decision Criteria Hijacking, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to realistic examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks. Our code is available at: https://github.com/blindspotorg/RankingBlindSpot

Updated: 2025-09-23 02:56:38

标题: 排名盲点：基于LLM的文本排名中的决策劫持

摘要: 大型语言模型（LLMs）已经展示了在信息检索任务中，如段落排名中的良好性能。我们的研究考察了LLMs中的指令遵循能力如何与多文档比较任务相互作用，确定了我们所称之为“排名盲点”的特征，即LLM在比较评估过程中的决策过程。我们分析了这个排名盲点如何通过两种方法影响LLM评估系统：决策目标劫持，它改变了配对排名系统中的评估目标，以及决策标准劫持，它修改了在排名方案中的相关性标准。这些方法展示了内容提供者如何潜在地影响LLM基于排名系统以影响文档位置。这些攻击旨在强制LLM排名器偏爱特定段落并将其排名在最顶部。恶意内容提供者可以利用这个弱点，从而通过攻击排名器获得额外的曝光。在我们的实验中，我们经验证明了提出的攻击在各种LLMs中是有效的，并且可以推广到多个排名方案。我们将这些攻击应用于现实示例以展示它们的有效性。我们还发现更强大的LLMs更容易受到这些攻击。我们的代码可以在以下链接找到：https://github.com/blindspotorg/RankingBlindSpot

更新时间: 2025-09-23 02:56:38

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2509.18575v1

Interaction Topological Transformer for Multiscale Learning in Porous Materials

Porous materials exhibit vast structural diversity and support critical applications in gas storage, separations, and catalysis. However, predictive modeling remains challenging due to the multiscale nature of structure-property relationships, where performance is governed by both local chemical environments and global pore-network topology. These complexities, combined with sparse and unevenly distributed labeled data, hinder generalization across material families. We propose the Interaction Topological Transformer (ITT), a unified data-efficient framework that leverages novel interaction topology to capture materials information across multiple scales and multiple levels, including structural, elemental, atomic, and pairwise-elemental organization. ITT extracts scale-aware features that reflect both compositional and relational structure within complex porous frameworks, and integrates them through a built-in Transformer architecture that supports joint reasoning across scales. Trained using a two-stage strategy, i.e., self-supervised pretraining on 0.6 million unlabeled structures followed by supervised fine-tuning, ITT achieves state-of-the-art, accurate, and transferable predictions for adsorption, transport, and stability properties. This framework provides a principled and scalable path for learning-guided discovery in structurally and chemically diverse porous materials.

Updated: 2025-09-23 02:56:05

标题: 多孔材料中用于多尺度学习的交互式拓扑变换器

摘要: 多孔材料展现出巨大的结构多样性，并支持在气体储存、分离和催化等关键应用中的应用。然而，由于结构-性能关系的多尺度性质，预测建模仍然具有挑战性，性能受到局部化学环境和全局孔隙网络拓扑的双重影响。这些复杂性，加上稀疏且不均匀分布的标记数据，阻碍了在材料家族之间的泛化。我们提出了交互拓扑变换器（ITT），这是一个统一的数据高效框架，利用新颖的交互拓扑来捕捉多尺度和多水平的材料信息，包括结构、元素、原子和成对元素组织。ITT提取了反映复杂多孔框架中的组成和关系结构的尺度感知特征，并通过内置的Transformer架构将它们整合在一起，支持跨尺度的联合推理。通过两阶段策略进行训练，即在60万个未标记结构上进行自监督预训练，然后再进行监督微调，ITT实现了吸附、传输和稳定性等性能的最先进、准确和可转移的预测。这种框架为在结构和化学多样性的多孔材料中进行学习引导的发现提供了一个合理和可扩展的路径。

更新时间: 2025-09-23 02:56:05

领域: cs.LG,cond-mat.mtrl-sci,cs.AI

下载: http://arxiv.org/abs/2509.18573v1

Examining I2P Resilience: Effect of Centrality-based Attack

This study examines the robustness of I2P, a well-regarded anonymous and decentralized peer-to-peer network designed to ensure anonymity, confidentiality, and circumvention of censorship. Unlike its more widely researched counterpart, TOR, I2P's resilience has received less scholarly attention. Employing network analysis, this research evaluates I2P's susceptibility to adversarial percolation. By utilizing the degree centrality as a measure of nodes' influence in the network, the finding suggests the network is vulnerable to targeted disruptions. Before percolation, the network exhibited a density of 0.01065443 and an average path length of 6.842194. At the end of the percolation process, the density decreased by approximately 10%, and the average path length increased by 33%, indicating a decline in efficiency and connectivity. These results highlight that even decentralized networks, such as I2P, exhibit structural fragility under targeted attacks, emphasizing the need for improved design strategies to enhance resilience against adversarial disruptions.

Updated: 2025-09-23 02:54:14

标题: 研究I2P的弹性：基于中心性攻击的影响

摘要: 这项研究探讨了I2P的稳健性，I2P是一个备受好评的匿名和去中心化的点对点网络，旨在确保匿名性、保密性和规避审查。与更广为人知的TOR相比，I2P的弹性受到的学术关注较少。通过网络分析，本研究评估了I2P对敌对滲透的敏感性。通过利用节点在网络中的度中心性作为衡量节点影响力的指标，研究结果表明网络容易受到有针对性的干扰。在滲透之前，网络的密度为0.01065443，平均路径长度为6.842194。在滲透过程结束时，密度降低了约10％，平均路径长度增加了33％，表明效率和连通性下降。这些结果突显出，即使像I2P这样的去中心化网络也会在受到有针对性攻击时表现出结构脆弱性，强调了改进设计策略以增强对敌对干扰的弹性的必要性。

更新时间: 2025-09-23 02:54:14

领域: cs.CR

下载: http://arxiv.org/abs/2509.18572v1

Explore the Reinforcement Learning for the LLM based ASR and TTS system

In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL framework tailored for audio-based LLMs that can process audio inputs and generate audio outputs. Based on this framework, we evaluate the effectiveness of reinforcement learning on both ASR and TTS tasks. For the ASR task, we experiment with different rule-based reward functions within the Group Relative Policy Optimization (GRPO) framework and investigate the impact of RL data construction. For the TTS task, we compare GRPO with Differentiable Reward Optimization (DiffRO) and further combine the two approaches to achieve improved performance. Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems, even with limited training data and a small number of optimization steps.

Updated: 2025-09-23 02:52:54

标题: 探索基于LLM的ASR和TTS系统的强化学习

摘要: 近年来，大型语言模型（LLMs）在自动语音识别（ASR）和文本转语音（TTS）系统中发挥了重要作用。虽然强化学习（RL）在基于文本的任务中显著提升了LLM的性能，但由于训练基于音频的模型的复杂性，其在ASR和TTS方面的应用仍未得到充分开发。在本研究中，我们提出了一个专门针对基于音频的LLMs的轻量级RL框架，可以处理音频输入并生成音频输出。基于这个框架，我们评估了强化学习在ASR和TTS任务中的有效性。对于ASR任务，我们尝试了在Group Relative Policy Optimization（GRPO）框架中使用不同基于规则的奖励函数，并研究了RL数据构建的影响。对于TTS任务，我们将GRPO与Differentiable Reward Optimization（DiffRO）进行了比较，并进一步结合这两种方法以实现改进的性能。我们的实验表明，即使训练数据有限且优化步骤数量少，RL也可以显著提升ASR和TTS系统的性能。

更新时间: 2025-09-23 02:52:54

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2509.18569v1

Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries

To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets, models, and prompt templates, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both Token Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs' internal components interact with different input tokens to support complex factual recall. Code is available at https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries.

Updated: 2025-09-23 02:52:22

标题: 促进、抑制、迭代：语言模型如何回答一对多的事实查询

摘要: 为了回答一对多的事实查询（例如列出一个国家的城市），语言模型（LM）必须同时回忆知识并避免重复先前的答案。这两个子任务是如何在内部实施和集成的？在多个数据集、模型和提示模板中，我们识别出一个先促进后抑制的机制：模型首先回忆所有答案，然后抑制先前生成的答案。具体地，LMs使用主题和先前答案标记来执行知识回忆，注意力传播主题信息，MLPs促进答案。然后，注意力关注和抑制先前的答案标记，而MLPs增强抑制信号。我们的机制得到了大量实验证据的支持：除了使用早期解码和因果追踪，我们还通过引入Token Lens（从指定的标记解码聚合的注意力更新）和一种淘汰方法（分析去除对指定标记的注意后MLP输出的变化）分析了组件如何使用不同标记。总体上，我们提供了关于LMs的内部组件如何与不同的输入标记互动以支持复杂事实回忆的新见解。代码可在https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries找到。

更新时间: 2025-09-23 02:52:22

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.20475v3

Explainable Graph Neural Networks: Understanding Brain Connectivity and Biomarkers in Dementia

Dementia is a progressive neurodegenerative disorder with multiple etiologies, including Alzheimer's disease, Parkinson's disease, frontotemporal dementia, and vascular dementia. Its clinical and biological heterogeneity makes diagnosis and subtype differentiation highly challenging. Graph Neural Networks (GNNs) have recently shown strong potential in modeling brain connectivity, but their limited robustness, data scarcity, and lack of interpretability constrain clinical adoption. Explainable Graph Neural Networks (XGNNs) have emerged to address these barriers by combining graph-based learning with interpretability, enabling the identification of disease-relevant biomarkers, analysis of brain network disruptions, and provision of transparent insights for clinicians. This paper presents the first comprehensive review dedicated to XGNNs in dementia research. We examine their applications across Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and multi-disease diagnosis. A taxonomy of explainability methods tailored for dementia-related tasks is introduced, alongside comparisons of existing models in clinical scenarios. We also highlight challenges such as limited generalizability, underexplored domains, and the integration of Large Language Models (LLMs) for early detection. By outlining both progress and open problems, this review aims to guide future work toward trustworthy, clinically meaningful, and scalable use of XGNNs in dementia research.

Updated: 2025-09-23 02:52:00

标题: 可解释的图神经网络：理解痴呆症中的脑连接和生物标志物

摘要: 痴呆症是一种渐进性神经退行性疾病，具有多种病因，包括阿尔茨海默病、帕金森病、额颞叶痴呆和血管性痴呆。其临床和生物学的异质性使得诊断和亚型区分极具挑战性。图神经网络（GNNs）最近展现出在模拟脑连接方面的强大潜力，但其有限的稳健性、数据稀缺性和缺乏可解释性限制了临床应用。可解释的图神经网络（XGNNs）已经出现，以解决这些障碍，通过将基于图的学习与可解释性相结合，实现疾病相关生物标志物的识别、脑网络紊乱的分析，并为临床医生提供透明的见解。本文是专门针对痴呆研究中的XGNNs的首次全面回顾。我们研究了它们在阿尔茨海默病、帕金森病、轻度认知障碍和多疾病诊断中的应用。介绍了针对与痴呆相关任务定制的可解释方法分类，以及在临床场景中现有模型的比较。我们还强调了一些挑战，如有限的泛化能力、未开发的领域以及用于早期检测的大型语言模型（LLMs）的整合。通过概述进展和存在的问题，本文旨在引导未来的工作朝着在痴呆研究中信任、临床意义和可扩展性的使用方面的方向。

更新时间: 2025-09-23 02:52:00

领域: cs.LG

下载: http://arxiv.org/abs/2509.18568v1

Structure-prior Informed Diffusion Model for Graph Source Localization with Limited Data

Source localization in graph information propagation is essential for mitigating network disruptions, including misinformation spread, cyber threats, and infrastructure failures. Existing deep generative approaches face significant challenges in real-world applications due to limited propagation data availability. We present SIDSL (\textbf{S}tructure-prior \textbf{I}nformed \textbf{D}iffusion model for \textbf{S}ource \textbf{L}ocalization), a generative diffusion framework that leverages topology-aware priors to enable robust source localization with limited data. SIDSL addresses three key challenges: unknown propagation patterns through structure-based source estimations via graph label propagation, complex topology-propagation relationships via a propagation-enhanced conditional denoiser with GNN-parameterized label propagation module, and class imbalance through structure-prior biased diffusion initialization. By learning pattern-invariant features from synthetic data generated by established propagation models, SIDSL enables effective knowledge transfer to real-world scenarios. Experimental evaluation on four real-world datasets demonstrates superior performance with 7.5-13.3\% F1 score improvements over baselines, including over 19\% improvement in few-shot and 40\% in zero-shot settings, validating the framework's effectiveness for practical source localization. Our code can be found \href{https://github.com/tsinghua-fib-lab/SIDSL}{here}.

Updated: 2025-09-23 02:50:13

标题: 有限数据情况下的图源定位结构先验信息扩散模型

摘要: 在图信息传播中进行源定位对于减轻网络中断，包括误信息传播、网络威胁和基础设施故障至关重要。由于传播数据的有限性，现有的深度生成方法在现实世界应用中面临重大挑战。我们提出了SIDSL（\textbf{S}tructure-prior \textbf{I}nformed \textbf{D}iffusion model for \textbf{S}ource \textbf{L}ocalization），这是一个生成扩散框架，利用拓扑感知先验来实现具有限数据的鲁棒源定位。SIDSL解决了三个关键挑战：通过基于结构的图标签传播估计来处理未知的传播模式，通过具有传播增强条件去噪器和GNN参数化标签传播模块来处理复杂的拓扑-传播关系，以及通过结构先验偏置扩散初始化来处理类别不平衡。通过从已建立的传播模型生成的合成数据学习模式不变特征，SIDSL能够有效地将知识转移至现实世界场景。对四个真实数据集进行的实验评估表明，与基线相比，表现更为出色，F1分数提高了7.5-13.3％，包括在少样本和零样本设置中提高了超过19％和40％，验证了该框架在实际源定位中的有效性。我们的代码可以在\href{https://github.com/tsinghua-fib-lab/SIDSL}{这里}找到。

更新时间: 2025-09-23 02:50:13

领域: cs.SI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.17928v3

MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05\% and +4.18\% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus.

Updated: 2025-09-23 02:50:01

标题: MVCL-DAF++: 通过基于原型意图对比对齐和粗到细动态注意力融合增强多模态意图识别

摘要: 多模态意图识别（MMIR）在噪声或稀有类别条件下存在语义基础薄弱和鲁棒性差的问题。我们提出了MVCL-DAF++，它在MVCL-DAF的基础上引入了两个关键模块：（1）基于原型的对比对齐，将实例与类别级原型对齐以增强语义一致性；（2）粗到细的注意力融合，将全局模态摘要与标记级特征进行融合，用于分层跨模态交互。在MIntRec和MIntRec2.0上，MVCL-DAF++实现了新的最先进结果，分别将稀有类别识别提高了+1.05\%和+4.18% WF1。这些结果表明原型引导学习和粗到细融合对于强大的多模态理解具有有效性。源代码可在https://github.com/chr1s623/MVCL-DAF-PlusPlus找到。

更新时间: 2025-09-23 02:50:01

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.17446v2

Solving Math Word Problems Using Estimation Verification and Equation Generation

Large Language Models (LLMs) excel at various tasks, including problem-solving and question-answering. However, LLMs often find Math Word Problems (MWPs) challenging because solving them requires a range of reasoning and mathematical abilities with which LLMs seem to struggle. Recent efforts have helped LLMs solve more complex MWPs with improved prompts. This study proposes a novel method that initially prompts an LLM to create equations from a decomposition of the question, followed by using an external symbolic equation solver to produce an answer. To ensure the accuracy of the obtained answer, inspired by an established recommendation of math teachers, the LLM is instructed to solve the MWP a second time, but this time with the objective of estimating the correct answer instead of solving it exactly. The estimation is then compared to the generated answer to verify. If verification fails, an iterative rectification process is employed to ensure the correct answer is eventually found. This approach achieves new state-of-the-art results on datasets used by prior published research on numeric and algebraic MWPs, improving the previous best results by nearly two percent on average. In addition, the approach obtains satisfactory results on trigonometric MWPs, a task not previously attempted to the authors' best knowledge. This study also introduces two new datasets, SVAMPClean and Trig300, to further advance the testing of LLMs' reasoning abilities.

Updated: 2025-09-23 02:41:39

标题: 使用估算验证和方程生成解决数学应用题

摘要: 大型语言模型（LLMs）在各种任务中表现出色，包括问题解决和问答。然而，LLMs通常发现数学应用问题（MWPs）具有挑战性，因为解决它们需要一系列推理和数学能力，LLMs似乎很难应对。最近的努力帮助LLMs通过改进提示来解决更复杂的MWPs。本研究提出了一种新颖方法，该方法首先促使LLM从问题的分解中创建方程式，然后使用外部符号方程求解器生成答案。为确保获得的答案准确性，受数学教师的建议启发，LLM被指示再次解决MWP，但这次的目标是估计正确答案而不是精确解决。然后将估计值与生成的答案进行比较以进行验证。如果验证失败，将采用迭代校正过程以确保最终找到正确答案。这种方法在以前已发表的数值和代数MWPs研究中使用的数据集上取得了新的最新成果，平均改进了近两个百分点的最佳结果。此外，该方法在三角函数MWPs上获得了令人满意的结果，这是作者们迄今为止尚未尝试过的任务。本研究还引入了两个新数据集，SVAMPClean和Trig300，以进一步推动LLMs推理能力的测试。

更新时间: 2025-09-23 02:41:39

领域: cs.AI

下载: http://arxiv.org/abs/2509.18565v1

Privacy-Aware In-Context Learning for Large Language Models

Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models. The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility.

Updated: 2025-09-23 02:40:24

标题: 隐私感知的大规模语言模型上下文学习

摘要: 大型语言模型（LLMs）显著改变了自然语言理解和生成，但由于潜在敏感信息的曝露，引起了隐私担忧。研究已经强调了信息泄露的风险，即对手可以从提示中提取嵌入的敏感信息。在这项工作中，我们介绍了一个新颖的私密预测框架，用于生成具有强隐私保证的高质量合成文本。我们的方法利用差分隐私（DP）框架，确保信息泄露的最坏情况理论界限，而无需对基础模型进行任何微调。所提出的方法对私密记录进行推理，并聚合生成的每个令牌输出分布。这使得生成更长、更连贯的合成文本同时保持隐私保证。此外，我们提出了一个简单的混合操作，将私密和公共推理结合起来，进一步增强效用。经验评估表明，我们的方法在上下文学习（ICL）任务上优于先前的最先进方法，使其成为隐私保护文本生成的有前途的方向，同时保持高效用。

更新时间: 2025-09-23 02:40:24

领域: cs.LG,cs.CL,cs.CR

下载: http://arxiv.org/abs/2509.13625v3

CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection

Chinese Patronizing and Condescending Language (CPCL) is an implicitly discriminatory toxic speech targeting vulnerable groups on Chinese video platforms. The existing dataset lacks user comments, which are a direct reflection of video content. This undermines the model's understanding of video content and results in the failure to detect some CPLC videos. To make up for this loss, this research reconstructs a new dataset PCLMMPLUS that includes 103k comment entries and expands the dataset size. We also propose the CPCLDetector model with alignment selection and knowledge-enhanced comment content modules. Extensive experiments show the proposed CPCLDetector outperforms the SOTA on PCLMM and achieves higher performance on PCLMMPLUS . CPLC videos are detected more accurately, supporting content governance and protecting vulnerable groups. Code and dataset are available at https://github.com/jiaxunyang256/PCLD.

Updated: 2025-09-23 02:38:49

标题: CPCLDETECTOR：用于中文居高临下语言检测的知识增强和对齐选择

摘要: 中国的居高临下和傲慢语言（CPCL）是一种针对中国视频平台上脆弱群体的隐性歧视性有害言论。现有数据集缺乏用户评论，这些评论直接反映了视频内容。这削弱了模型对视频内容的理解，导致无法检测到一些CPLC视频。为弥补这一损失，本研究重建了一个新的数据集PCLMMPLUS，包括103k条评论条目，并扩大了数据集大小。我们还提出了CPCLDetector模型，包括对齐选择和知识增强评论内容模块。大量实验证明所提出的CPCLDetector在PCLMM上表现优于SOTA，并在PCLMMPLUS上实现了更高的性能。CPLC视频被更准确地检测出来，支持内容治理并保护脆弱群体。代码和数据集可在https://github.com/jiaxunyang256/PCLD获得。

更新时间: 2025-09-23 02:38:49

领域: cs.MM,cs.AI

下载: http://arxiv.org/abs/2509.18562v1

SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic Scenes

Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.

Updated: 2025-09-23 02:36:39

标题: 声音指南针：在复杂声学场景中通过有效方向线索整合导航目标声音提取

摘要: 最近在目标声音提取（TSE）方面取得了进展，利用从到达方向（DoA）派生的定向线索，这些线索代表声音在任何声学场景中可用的固有空间特性。然而，先前基于DoA的方法依赖于手工制作的特征或离散编码，这些方法会丢失细粒度的空间信息并限制适应性。我们提出了SoundCompass，这是一个有效的定向线索集成框架，以Spectral Pairwise Interaction（SPIN）模块为中心，该模块在复杂的频谱图域中捕获跨通道的空间相关性，以保留多通道信号中的完整空间信息。以空间相关性表示的输入特征与表示为球谐编码（SH）的DoA线索进行融合。在重叠频率子带上进行融合，继承了先前频带分割架构中报道的优点。我们还在TSE框架中加入了迭代精细化策略，即推理链（CoI），该策略递归地将DoA与从上一推理阶段估计的声音事件激活进行融合。实验证明，结合SPIN、SH嵌入和CoI的SoundCompass能够稳健地提取不同信号类别和空间配置中的目标源。

更新时间: 2025-09-23 02:36:39

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2509.18561v1

LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs

Compared to traditional models, agentic AI represents a highly valuable target for potential attackers as they possess privileged access to data sources and API tools, which are traditionally not incorporated into classical agents. Unlike a typical software application residing in a Demilitarized Zone (DMZ), agentic LLMs consciously rely on nondeterministic behavior of the AI (only defining a final goal, leaving the path selection to LLM). This characteristic introduces substantial security risk to both operational security and information security. Most common existing defense mechanism rely on detection of malicious intent and preventing it from reaching the LLM agent, thus protecting against jailbreak attacks such as prompt injection. In this paper, we present an alternative approach, LLMZ+, which moves beyond traditional detection-based approaches by implementing prompt whitelisting. Through this method, only contextually appropriate and safe messages are permitted to interact with the agentic LLM. By leveraging the specificity of context, LLMZ+ guarantees that all exchanges between external users and the LLM conform to predefined use cases and operational boundaries. Our approach streamlines the security framework, enhances its long-term resilience, and reduces the resources required for sustaining LLM information security. Our empirical evaluation demonstrates that LLMZ+ provides strong resilience against the most common jailbreak prompts. At the same time, legitimate business communications are not disrupted, and authorized traffic flows seamlessly between users and the agentic LLM. We measure the effectiveness of approach using false positive and false negative rates, both of which can be reduced to 0 in our experimental setting.

Updated: 2025-09-23 02:30:14

标题: LLMZ+：自主LLM的上下文提示白名单原则

摘要: 与传统模型相比，代理式AI代表着对潜在攻击者来说是一个极具价值的目标，因为它们拥有对数据源和API工具的特权访问，这些通常不包含在经典代理中。与传统软件应用程序驻留在非军事区（DMZ）不同，代理式LLMs有意地依赖于AI的非确定性行为（仅定义最终目标，将路径选择留给LLM）。这种特征给操作安全和信息安全引入了实质性的安全风险。大多数现有的防御机制依赖于检测恶意意图并阻止其达到LLM代理，从而保护免受类似提示注入的越狱攻击。在本文中，我们提出了一种替代方法LLMZ+，它超越了传统的基于检测的方法，通过实施提示白名单。通过这种方法，只有符合上下文适当和安全的消息才被允许与代理式LLM互动。通过利用上下文的特定性，LLMZ+保证外部用户与LLM之间的所有交换都符合预定义的用例和操作边界。我们的方法简化了安全框架，增强了其长期的韧性，并减少了维持LLM信息安全所需的资源。我们的实证评估表明，LLMZ+对最常见的越狱提示提供了强大的韧性。与此同时，合法的业务通信不会受到干扰，授权的流量在用户和代理式LLM之间无缝流动。我们使用误报率和漏报率来衡量该方法的有效性，在我们的实验设置中，这两者都可以降至0。

更新时间: 2025-09-23 02:30:14

领域: cs.AI

下载: http://arxiv.org/abs/2509.18557v1

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.

Updated: 2025-09-23 02:28:48

标题: OpenWHO：面向低资源语言健康翻译的文档级平行语料库

摘要: 在机器翻译（MT）中，健康是一个高风险领域，特点是广泛部署和领域特定的词汇。然而，在这个领域缺乏低资源语言的MT评估数据集。为了解决这一问题，我们介绍了OpenWHO，这是一个由世界卫生组织的电子学习平台中的2,978份文档和26,824个句子构成的文档级平行语料库。OpenWHO的来源是专家撰写、专业翻译的材料，免受网络爬虫的干扰，涵盖了超过20种语言，其中九种是低资源语言。利用这一新资源，我们对现代大型语言模型（LLMs）进行了评估，与传统的MT模型进行了比较。我们的研究结果显示，LLMs始终优于传统的MT模型，Gemini 2.5 Flash在我们的低资源测试集上比NLLB-54B提高了+4.79 ChrF点。此外，我们研究了LLM上下文的利用如何影响准确性，发现文档级翻译的好处在健康等专业领域最为显著。我们发布OpenWHO语料库，以鼓励对健康领域低资源MT的进一步研究。

更新时间: 2025-09-23 02:28:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.16048v4

Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value

The proliferation of large models has intensified the need for efficient data valuation methods to quantify the contribution of individual data providers. Traditional approaches, such as game-theory-based Shapley value and influence-function-based techniques, face prohibitive computational costs or require access to full data and model training details, making them hardly achieve partial data valuation. To address this, we propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently. By unlearning target data from a pretrained model and measuring performance shifts on a reachable test set, our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data. Crucially, Unlearning Shapley supports both full and partial data valuation, making it scalable for large models (e.g., LLMs) and practical for data markets. Experiments on benchmark datasets and large-scale text corpora demonstrate that our approach matches the accuracy of state-of-the-art methods while reducing computational overhead by orders of magnitude. Further analysis confirms a strong correlation between estimated values and the true impact of data subsets, validating its reliability in real-world scenarios. This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.

Updated: 2025-09-23 02:26:53

标题: 失去是值得珍惜的：基于机器遗忘和Shapley值的数据估值

摘要: 大型模型的增多加剧了对高效数据估值方法的需求，以量化各个数据提供者的贡献。传统方法，如基于博弈论的 Shapley 值和基于影响函数的技术，面临着计算成本高昂或需要访问完整数据和模型训练细节的困难，使得它们难以实现部分数据估值。为了解决这个问题，我们提出了一种新颖的框架 Unlearning Shapley，利用机器去学习来高效估算数据价值。通过从预训练模型中去学习目标数据，并在可达测试集上测量性能变化，我们的方法通过蒙特卡洛采样计算 Shapley 值，避免了重新训练并消除了对完整数据的依赖。关键是，Unlearning Shapley 支持完整和部分数据估值，使其在大型模型（例如 LLMs）中可扩展，并在数据市场中实用。在基准数据集和大规模文本语料库上的实验表明，我们的方法在减少计算开销的同时与最先进方法的准确性相匹配。进一步的分析证实了估计值与数据子集的真实影响之间的强相关性，验证了其在实际场景中的可靠性。这项工作弥合了数据估值理论和实际部署之间的差距，为现代人工智能生态系统提供了一种可扩展、符合隐私的解决方案。

更新时间: 2025-09-23 02:26:53

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.16147v2

LLM-Guided Co-Training for Text Classification

In this paper, we introduce a novel weighted co-training approach that is guided by Large Language Models (LLMs). Namely, in our co-training approach, we use LLM labels on unlabeled data as target labels and co-train two encoder-only based networks that train each other over multiple iterations: first, all samples are forwarded through each network and historical estimates of each network's confidence in the LLM label are recorded; second, a dynamic importance weight is derived for each sample according to each network's belief in the quality of the LLM label for that sample; finally, the two networks exchange importance weights with each other -- each network back-propagates all samples weighted with the importance weights coming from its peer network and updates its own parameters. By strategically utilizing LLM-generated guidance, our approach significantly outperforms conventional SSL methods, particularly in settings with abundant unlabeled data. Empirical results show that it achieves state-of-the-art performance on 4 out of 5 benchmark datasets and ranks first among 14 compared methods according to the Friedman test. Our results highlight a new direction in semi-supervised learning -- where LLMs serve as knowledge amplifiers, enabling backbone co-training models to achieve state-of-the-art performance efficiently.

Updated: 2025-09-23 02:26:35

标题: 基于LLM的文本分类的协同训练

摘要: 在本文中，我们介绍了一种新颖的加权共训练方法，该方法由大型语言模型（LLMs）指导。换句话说，在我们的共训练方法中，我们使用LLM标签在未标记数据上作为目标标签，并共同训练两个仅编码器网络，它们在多次迭代中相互训练：首先，所有样本都通过每个网络转发，并记录每个网络对LLM标签的信心的历史估计；其次，根据每个网络对LLM标签在该样本上质量的信念得出每个样本的动态重要性权重；最后，两个网络互相交换重要性权重--每个网络将来自其对等网络的重要性权重加权反向传播所有样本，并更新自己的参数。通过策略性地利用LLM生成的指导，我们的方法在大量未标记数据的情况下明显优于传统的半监督学习方法。实证结果显示，它在5个基准数据集中有4个取得了最先进的性能，并根据Friedman检验在14个比较方法中排名第一。我们的结果突显了半监督学习的新方向--LLMs作为知识放大器，使骨干共训练模型能够高效地实现最先进的性能。

更新时间: 2025-09-23 02:26:35

领域: cs.LG

下载: http://arxiv.org/abs/2509.16516v2

Efficient Breast and Ovarian Cancer Classification via ViT-Based Preprocessing and Transfer Learning

Cancer is one of the leading health challenges for women, specifically breast and ovarian cancer. Early detection can help improve the survival rate through timely intervention and treatment. Traditional methods of detecting cancer involve manually examining mammograms, CT scans, ultrasounds, and other imaging types. However, this makes the process labor-intensive and requires the expertise of trained pathologists. Hence, making it both time-consuming and resource-intensive. In this paper, we introduce a novel vision transformer (ViT)-based method for detecting and classifying breast and ovarian cancer. We use a pre-trained ViT-Base-Patch16-224 model, which is fine-tuned for both binary and multi-class classification tasks using publicly available histopathological image datasets. Further, we use a preprocessing pipeline that converts raw histophological images into standardized PyTorch tensors, which are compatible with the ViT architecture and also help improve the model performance. We evaluated the performance of our model on two benchmark datasets: the BreakHis dataset for binary classification and the UBC-OCEAN dataset for five-class classification without any data augmentation. Our model surpasses existing CNN, ViT, and topological data analysis-based approaches in binary classification. For multi-class classification, it is evaluated against recent topological methods and demonstrates superior performance. Our study highlights the effectiveness of Vision Transformer-based transfer learning combined with efficient preprocessing in oncological diagnostics.

Updated: 2025-09-23 02:25:44

标题: 基于ViT的预处理和迁移学习的高效乳腺和卵巢癌分类

摘要: 癌症是妇女面临的主要健康挑战之一，特别是乳腺癌和卵巢癌。早期检测可以通过及时干预和治疗来提高生存率。传统的癌症检测方法涉及手动检查乳房X线摄影、CT扫描、超声波和其他成像类型。然而，这使得该过程繁重并且需要经过训练的病理学家的专业知识。因此，这既耗时又需要大量资源。在本文中，我们介绍了一种基于视觉变换器（ViT）的新方法，用于检测和分类乳腺和卵巢癌。我们使用经过预训练的ViT-Base-Patch16-224模型，对公开可用的组织病理图像数据集进行了二元和多类分类任务的微调。此外，我们使用了一种预处理流程，将原始组织病理图像转换为标准化的PyTorch张量，这些张量与ViT架构兼容，同时也有助于提高模型性能。我们在两个基准数据集上评估了我们模型的性能：BreakHis数据集用于二元分类，UBC-OCEAN数据集用于无任何数据增强的五类分类。我们的模型在二元分类中超过了现有的CNN、ViT和基于拓扑数据分析的方法。对于多类分类，它与最近的拓扑方法进行了评估，并展现了更优越的性能。我们的研究突出了基于视觉变换器的迁移学习结合有效的预处理在肿瘤诊断中的有效性。

更新时间: 2025-09-23 02:25:44

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.18553v1

Global Minimizers of Sigmoid Contrastive Loss

The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-Constellations. $(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $\mathsf{m}$ and relative bias $\mathsf{b}_{\mathsf{rel}}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.

Updated: 2025-09-23 02:24:23

标题: Sigmoid对比损失的全局最小化者

摘要: 通过对比预训练获取和对齐表示的元任务自引入CLIP和ALIGN以来逐渐变得重要。在本文中，我们从理论上解释了在最近的Google DeepMind的SigLIP和SigLIP2模型中以可训练的反温度和偏置同步实现的sigmoid损失的优势。温度和偏置可以驱动损失函数对我们称之为$(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-星座的丰富类配置的收敛至零。$(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-星座是与球面码相关的新颖组合对象，由边际$\mathsf{m}$和相对偏置$\mathsf{b}_{\mathsf{rel}}$参数化。我们利用星座的特征化来从理论上证明SigLIP在检索中的成功，解释SigLIP中存在的模态差距，并确定产生高质量表示所需的必要维度。最后，我们提出了一个带有明确相对偏置的sigmoid损失的重新参数化，该损失改进了对合成数据实验中的训练动态。

更新时间: 2025-09-23 02:24:23

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18552v1

SPRINT: Stochastic Performative Prediction With Variance Reduction

Performative prediction (PP) is an algorithmic framework for optimizing machine learning (ML) models where the model's deployment affects the distribution of the data it is trained on. Compared to traditional ML with fixed data, designing algorithms in PP converging to a stable point -- known as a stationary performative stable (SPS) solution -- is more challenging than the counterpart in conventional ML tasks due to the model-induced distribution shifts. While considerable efforts have been made to find SPS solutions using methods such as repeated gradient descent (RGD) and greedy stochastic gradient descent (SGD-GD), most prior studies assumed a strongly convex loss until a recent work established $O(1/\sqrt{T})$ convergence of SGD-GD to SPS solutions under smooth, non-convex losses. However, this latest progress is still based on the restricted bounded variance assumption in stochastic gradient estimates and yields convergence bounds with a non-vanishing error neighborhood that scales with the variance. This limitation motivates us to improve convergence rates and reduce error in stochastic optimization for PP, particularly in non-convex settings. Thus, we propose a new algorithm called stochastic performative prediction with variance reduction (SPRINT) and establish its convergence to an SPS solution at a rate of $O(1/T)$. Notably, the resulting error neighborhood is independent of the variance of the stochastic gradients. Experiments on multiple real datasets with non-convex models demonstrate that SPRINT outperforms SGD-GD in both convergence rate and stability.

Updated: 2025-09-23 02:15:11

标题: SPRINT：具有方差减少的随机表现性预测

摘要: 表现性预测（PP）是一种用于优化机器学习（ML）模型的算法框架，其中模型的部署会影响其所训练数据的分布。与固定数据的传统ML相比，在PP中设计收敛到稳定点（称为稳定表现稳定（SPS）解）的算法比传统ML任务中的对应部分更具挑战性，这是由于模型导致的分布转移。尽管已经做出了大量努力，使用诸如重复梯度下降（RGD）和贪婪随机梯度下降（SGD-GD）等方法寻找SPS解，但大多数先前研究假定了强凸损失，直到最近一项工作建立了SGD-GD对于光滑、非凸损失的$O(1/\sqrt{T})$收敛性到SPS解。然而，这一最新进展仍是基于随机梯度估计中受限的有界方差假设，并且得出的收敛界具有与方差成比例的非消失误差邻域。这一限制激励我们改进PP中随机优化的收敛速率并减少误差，特别是在非凸设置中。因此，我们提出了一种名为随机表现性预测与方差缩减（SPRINT）的新算法，并建立其以$O(1/T)$的速率收敛到SPS解。值得注意的是，所得到的误差邻域与随机梯度的方差无关。对多个真实数据集和非凸模型进行的实验证明，SPRINT在收敛速度和稳定性方面均优于SGD-GD。

更新时间: 2025-09-23 02:15:11

领域: cs.LG

下载: http://arxiv.org/abs/2509.17304v2

ToMA: Token Merge with Attention for Diffusion Models

Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers' quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose Token Merge with Attention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merge as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO $\Delta < 0.07$), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.

Updated: 2025-09-23 02:10:29

标题: ToMA：使用注意力进行令牌合并以用于扩散模型

摘要: 扩散模型在高保真图像生成方面表现出色，但由于变压器的二次注意力复杂度而面临扩展性限制。像ToMeSD和ToFu这样的即插即用令牌减少方法通过合并生成图像中的冗余令牌来减少FLOPs，但依赖于GPU效率低下的操作（例如，排序，分散写入），引入了增加开销的问题，使得与优化的注意力实现（例如FlashAttention）配对时理论加速失去意义。为了弥合这一差距，我们提出了Token Merge with Attention（ToMA），这是一种即插即用方法，重新设计了令牌减少方法，以实现与GPU对齐的效率，具有三个关键贡献：1）将令牌合并重新定义为选择多样化令牌的子模块优化问题；2）通过GPU友好的矩阵运算将合并/取消合并作为类似于注意力的线性变换；和3）利用潜在的局部性和顺序冗余（模式重用）来最小化开销。ToMA分别将SDXL/Flux生成延迟降低了24％/23％（与DINO $\Delta < 0.07$），优于先前的方法。这项工作弥合了扩散变压器在理论和实际效率之间的差距。

更新时间: 2025-09-23 02:10:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.10918v2

Retrieval Enhanced Feedback via In-context Neural Error-book

Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE's potential for enhancing multimodal reasoning.

Updated: 2025-09-23 02:08:36

标题: 通过上下文神经错误书增强的检索反馈

摘要: 最近大型语言模型（LLMs）的进展显著提高了推理能力，其中上下文学习（ICL）作为无需重新训练的关键技术应运而生。虽然先前的研究侧重于利用正确的示例，但最近的研究强调了从错误中学习以提高性能的重要性。然而，现有方法缺乏用于分析和减轻错误的结构化框架，特别是在多模态大型语言模型（MLLMs）中，其中整合视觉和文本输入增加了复杂性。为解决这一问题，我们提出了REFINE：通过上下文神经错误书增强检索反馈的师生框架，该框架系统地构建错误并提供有针对性的反馈。REFINE引入了三个系统化查询来构建结构化反馈-- Feed-Target、Feed-Check和Feed-Path-- 通过优先考虑相关的视觉信息、诊断关键失误点和制定纠正措施，以增强多模态推理。与依赖冗余检索的先前方法不同，REFINE优化了结构化反馈检索，提高了推理效率、令牌使用和可扩展性。我们的结果表明，REFINE显著加速，降低计算成本，并成功推广，突显了其增强多模态推理的潜力。

更新时间: 2025-09-23 02:08:36

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.16313v4

Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To circumvent the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Llama2-Chat and Code Llama). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a single lightweight stage of router training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

Updated: 2025-09-23 02:07:14

标题: Symphony-MoE：将不同的预训练模型和谐地融合成一种专家混合模型

摘要: 混合专家（MoE）模型通过稀疏激活大型参数集，最小化计算开销，实现可扩展性能。为了避免从头开始训练MoE的巨大成本，最近的工作采用了升级循环利用单个预训练密集模型，通过复制其前馈网络（FFN）层到专家中。然而，这限制了专家的多样性，因为所有专家都源自单个预训练的密集模型。本文通过使用来自多个具有相同架构但不同的预训练模型（例如Llama2-Chat和Code Llama）的专家来构建功能强大的MoE模型来解决这个限制。一个关键挑战在于这些源模型占据参数空间中不同的、不和谐的区域，直接升级容易导致性能严重下降。为了克服这一挑战，我们提出了Symphony-MoE，这是一个新颖的两阶段框架，旨在将这些模型协调到一个统一的专家混合体中。首先，我们以无需训练的方式建立这种和谐：通过层感知融合策略构建一个共享骨干，并且通过基于激活的功能对齐减轻专家之间的参数错位。随后，通过一个轻量级的路由器训练阶段协调整个架构。实验证明，我们的方法成功地整合了来自异构来源的专家，在多领域任务和分布外泛化方面显著超越了基线。

更新时间: 2025-09-23 02:07:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18542v1

A Chain-of-thought Reasoning Breast Ultrasound Dataset Covering All Histopathology Categories

Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions, with millions of examinations per year. However, publicly available high-quality BUS benchmarks for AI development are limited in data scale and annotation richness. In this work, we present BUS-CoT, a BUS dataset for chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of 10,019 lesions from 4,838 patients and covers all 99 histopathology types. To facilitate research on incentivizing CoT reasoning, we construct the reasoning processes based on observation, feature, diagnosis and pathology labels, annotated and verified by experienced experts. Moreover, by covering lesions of all histopathology types, we aim to facilitate robust AI systems in rare cases, which can be error-prone in clinical practice.

Updated: 2025-09-23 02:04:18

标题: 一个涵盖所有组织病理学类别的思维链推理乳腺超声数据集

摘要: 乳腺超声（BUS）是诊断乳腺病变的必要工具，每年有数百万次检查。然而，用于人工智能发展的高质量BUS基准数据在数据规模和注释丰富度方面受限。在这项工作中，我们提出了BUS-CoT，一个用于链式思维（CoT）推理分析的BUS数据集，包含来自4,838名患者的10,019个病变的11,439张图像，涵盖了所有99种组织病理学类型。为了促进对CoT推理的研究，我们基于观察、特征、诊断和病理学标签构建推理过程，由经验丰富的专家进行注释和验证。此外，通过涵盖所有组织病理学类型的病变，我们旨在促进罕见病例中的稳健人工智能系统，这些病例在临床实践中可能容易出现错误。

更新时间: 2025-09-23 02:04:18

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2509.17046v2

CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose \textbf{C}ycle-\textbf{C}onsistency in \textbf{Q}uestion \textbf{A}nswering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at https://github.com/scai-research/ccqa_official.

Updated: 2025-09-23 02:01:03

标题: CCQA：从解决方案生成问题可以改善SLM中的推理时间推理

摘要: 最近，推理时间推理策略进一步提高了大型语言模型（LLMs）的准确性，但它们对较小模型的有效性尚不清楚。根据传统方法在这种情况下通常无法改善性能的观察，我们提出了循环一致性问答（CCQA）方法，这是一种新颖的推理方法，可以有效应用于SLMs。受循环一致性的启发，CCQA从每个推理路径和答案生成一个问题，通过与原始问题的相似性评估每个问题，然后选择具有最高相似性得分的候选解作为最终回应。由于传统SLMs难以从它们自己的推理路径和答案中生成准确的问题，我们采用了一个专门用于支持这一过程的轻量级Flan-T5模型。从实验结果可以验证，CCQA在数学和常识推理基准测试中始终优于现有的最先进方法。此外，我们的方法为SLMs中的高效推理建立了一个新的实用基准。源代码可在https://github.com/scai-research/ccqa_official找到。

更新时间: 2025-09-23 02:01:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.18536v1

DOTA: Distributional Test-Time Adaptation of Vision-Language Models

Vision-language foundation models (VLMs), such as CLIP, exhibit remarkable performance across a wide range of tasks. However, deploying these models can be unreliable when significant distribution gaps exist between training and test data, while fine-tuning for diverse scenarios is often costly. Cache-based test-time adapters offer an efficient alternative by storing representative test samples to guide subsequent classifications. Yet, these methods typically employ naive cache management with limited capacity, leading to severe catastrophic forgetting when samples are inevitably dropped during updates. In this paper, we propose DOTA (DistributiOnal Test-time Adaptation), a simple yet effective method addressing this limitation. Crucially, instead of merely memorizing individual test samples, DOTA continuously estimates the underlying distribution of the test data stream. Test-time posterior probabilities are then computed using these dynamically estimated distributions via Bayes' theorem for adaptation. This distribution-centric approach enables the model to continually learn and adapt to the deployment environment. Extensive experiments validate that DOTA significantly mitigates forgetting and achieves state-of-the-art performance compared to existing methods.

Updated: 2025-09-23 01:58:07

标题: DOTA：视觉-语言模型的分布式测试时间适应

摘要: 视觉-语言基础模型（VLMs），如CLIP，展现出在广泛任务范围内卓越的性能。然而，当训练和测试数据之间存在重大分布差距时，部署这些模型可能不可靠，而为各种场景进行微调通常代价高昂。基于缓存的测试时适配器提供了一种有效的替代方法，通过存储代表性的测试样本来引导后续分类。然而，这些方法通常采用有限容量的天真缓存管理，导致在更新期间不可避免地丢弃样本时严重的灾难性遗忘。在本文中，我们提出了一种简单而有效的方法DOTA（分布式测试时适应），以解决这一限制。关键是，DOTA不仅仅是记忆单个测试样本，而是持续估计测试数据流的基础分布。然后，使用这些动态估计的分布通过贝叶斯定理计算适应的测试时后验概率。这种以分布为中心的方法使模型能够持续学习和适应部署环境。大量实验证明，与现有方法相比，DOTA显著减轻了遗忘并实现了最先进的性能。

更新时间: 2025-09-23 01:58:07

领域: cs.LG,cs.AI,cs.CL,cs.CV,cs.HC

下载: http://arxiv.org/abs/2409.19375v2

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}

Updated: 2025-09-23 01:51:38

标题: 无法验证的韵律奖励：走向TTS中的偏好引导韵律学习

摘要: 最近的工作报告显示，利用Group Relative Policy Optimization（GRPO）在神经文本到语音（TTS）方面取得了进展。然而，在缺乏可验证的奖励用于\textit{韵律}的情况下，基于转录导向信号（CER/NLL）训练的GRPO降低了错误率，但将韵律降至单调、不自然的语音；添加说话者相似性进一步不稳定了训练并降低了CER。我们通过一种\textit{迭代直接偏好优化（DPO）}方案来解决这个问题，每轮只使用几百个人工标记的偏好对来直接优化韵律自然度，同时对当前模型进行正则化。在一个经过策划的数据集\textbf{KoCC-TTS}上，该数据集包含捕捉任务导向对话的真实韩语呼叫中心互动，我们的方法获得了最高的人类偏好（ELO），并具有竞争性的CER，胜过了GRPO和强大的商业基线。这些结果表明，当无法自动奖励韵律时，\textit{人类偏好优化}提供了一条实用且数据高效的路径，实现自然和稳健的TTS。演示页面可在\href{https://tts.ch.dev} 上找到。

更新时间: 2025-09-23 01:51:38

领域: eess.AS,cs.AI,cs.CL,cs.SD

下载: http://arxiv.org/abs/2509.18531v1

Re-uploading quantum data: A universal function approximator for quantum inputs

Quantum data re-uploading has proved powerful for classical inputs, where repeatedly encoding features into a small circuit yields universal function approximation. Extending this idea to quantum inputs remains underexplored, as the information contained in a quantum state is not directly accessible in classical form. We propose and analyze a quantum data re-uploading architecture in which a qubit interacts sequentially with fresh copies of an arbitrary input state. The circuit can approximate any bounded continuous function using only one ancilla qubit and single-qubit measurements. By alternating entangling unitaries with mid-circuit resets of the input register, the architecture realizes a discrete cascade of completely positive and trace-preserving maps, analogous to collision models in open quantum system dynamics. Our framework provides a qubit-efficient and expressive approach to designing quantum machine learning models that operate directly on quantum data.

Updated: 2025-09-23 01:50:37

标题: 重新上传量子数据：一种用于量子输入的通用函数逼近器

摘要: 量子数据重新上传已经被证明在经典输入中非常有效，通过将特征重复编码到一个小电路中可以实现通用函数逼近。将这个想法扩展到量子输入仍然是未被充分探索的，因为量子态中包含的信息在经典形式下不是直接可访问的。我们提出并分析了一个量子数据重新上传架构，在这个架构中，一个量子比特依次与任意输入态的新副本进行交互。该电路可以使用仅一个辅助量子比特和单比特测量来逼近任何有界连续函数。通过交替使用纠缠酉算符和输入寄存器中间的重置，该架构实现了一个离散级联的完全正和保持迹的映射，类似于开放量子系统动力学中的碰撞模型。我们的框架提供了一种节省量子比特且富有表现力的方法，用于设计直接在量子数据上运行的量子机器学习模型。

更新时间: 2025-09-23 01:50:37

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2509.18530v1

Reverse-Complement Consistency for DNA Language Models

A fundamental property of DNA is that the reverse complement (RC) of a sequence often carries identical biological meaning. However, state-of-the-art DNA language models frequently fail to capture this symmetry, producing inconsistent predictions for a sequence and its RC counterpart, which undermines their reliability. In this work, we introduce Reverse-Complement Consistency Regularization (RCCR), a simple and model-agnostic fine-tuning objective that directly penalizes the divergence between a model's prediction on a sequence and the aligned prediction on its reverse complement. We evaluate RCCR across three diverse backbones (Nucleotide Transformer, HyenaDNA, DNABERT-2) on a wide range of genomic tasks, including sequence classification, scalar regression, and profile prediction. Our experiments show that RCCR substantially improves RC robustness by dramatically reducing prediction flips and errors, all while maintaining or improving task accuracy compared to baselines such as RC data augmentation and test-time averaging. By integrating a key biological prior directly into the learning process, RCCR produces a single, intrinsically robust, and computationally efficient model fine-tuning recipe for diverse biology tasks.

Updated: 2025-09-23 01:50:20

标题: DNA语言模型的逆补一致性

摘要: DNA的一个基本属性是序列的反向互补（RC）经常具有相同的生物学意义。然而，最先进的DNA语言模型经常无法捕捉到这种对称性，产生了对序列及其RC对应物的不一致预测，这削弱了它们的可靠性。在这项工作中，我们引入了反向互补一致性正则化（RCCR），这是一种简单且与模型无关的微调目标，直接惩罚模型对序列的预测与其反向互补的对齐预测之间的差异。我们在三种不同的基础模型（核苷酸变换器、HyenaDNA、DNABERT-2）上评估RCCR在广泛的基因组任务中的表现，包括序列分类、标量回归和概要预测。我们的实验表明，RCCR通过显著减少预测翻转和错误，大幅提高了RC的稳健性，同时与RC数据增强和测试时间平均等基准相比，保持或提高了任务准确性。通过直接将关键的生物学先验整合到学习过程中，RCCR为多样的生物学任务提供了一个单一、固有稳健且计算效率高的模型微调配方。

更新时间: 2025-09-23 01:50:20

领域: cs.LG,q-bio.GN

下载: http://arxiv.org/abs/2509.18529v1

FERA: Foil Fencing Referee Assistant Using Pose-Based Multi-Label Move Recognition and Rule Reasoning

The sport of fencing, like many other sports, faces challenges in refereeing: subjective calls, human errors, bias, and limited availability in practice environments. We present FERA (Fencing Referee Assistant), a prototype AI referee for foil fencing which integrates pose-based multi-label action recognition and rule-based reasoning. FERA extracts 2D joint positions from video, normalizes them, computes a 101-dimensional kinematic feature set, and applies a Transformer for multi-label move and blade classification. To determine priority and scoring, FERA applies a distilled language model with encoded right-of-way rules, producing both a decision and an explanation for each exchange. With limited hand-labeled data, a 5-fold cross-validation achieves an average macro-F1 score of 0.549, outperforming multiple baselines, including a Temporal Convolutional Network (TCN), BiLSTM, and a vanilla Transformer. While not ready for deployment, these results demonstrate a promising path towards automated referee assistance in foil fencing and new opportunities for AI applications, such as coaching in the field of fencing.

Updated: 2025-09-23 01:47:44

标题: FERA：使用基于姿势的多标签移动识别和规则推理的击剑裁判助手

摘要: 击剑运动，像许多其他运动一样，面临着裁判方面的挑战：主观判断、人为错误、偏见以及练习环境的有限可用性。我们提出了一种名为FERA（击剑裁判助手）的原型AI裁判，用于佩剑击剑，它整合了基于姿势的多标签动作识别和基于规则的推理。FERA从视频中提取2D关节位置，对其进行标准化，计算出一个101维的运动特征集，并应用Transformer进行多标签移动和刀片分类。为了确定优先权和得分，FERA应用了一个经过精炼的语言模型，其中编码了优先权规则，为每次交换产生决策和解释。在有限的手动标记数据的情况下，进行了5折交叉验证，平均宏F1得分为0.549，优于多个基线模型，包括时间卷积网络（TCN），BiLSTM和原始Transformer。尽管尚未准备部署，这些结果表明了在佩剑击剑中实现自动裁判辅助的有希望的路径，以及AI应用的新机会，例如在击剑领域进行教练。

更新时间: 2025-09-23 01:47:44

领域: cs.AI

下载: http://arxiv.org/abs/2509.18527v1

Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation

Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a post-hoc analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Meta (previously known as Facebook) advertisements. Our analysis focuses on two key aspects: demographic targeting and fairness. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that young adults are primarily targeted through messages emphasizing activism and environmental consciousness, while women are engaged through themes related to caregiving roles and social advocacy. Additionally, we conduct a comprehensive fairness analysis to uncover biases in model predictions. We assess disparities in accuracy and error rates across demographic groups using established fairness metrics such as Demographic Parity, Equal Opportunity, and Predictive Equality. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of male audiences. The analysis of thematic explanations uncovers recurring patterns in messaging strategies tailored to various demographic groups, while the fairness analysis underscores the need for more inclusive targeting methods. This study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.

Updated: 2025-09-23 01:45:34

标题: 社交媒体广告气候微定位的后续研究：LLMs的主题洞察和公平评估

摘要: 社交媒体上的气候变化传播越来越多地采用微定位战略，以有效地接触和影响特定的人群。本研究通过利用大型语言模型（LLMs）对Meta（以前称为Facebook）广告进行事后分析，展示了气候活动中微定位实践的情况。我们的分析侧重于两个关键方面：人口定位和公平性。我们评估LLMs准确预测意图的人口目标（如性别和年龄组）的能力。此外，我们指导LLMs生成其分类的解释，提供每个决定背后透明的推理。这些解释揭示了用于吸引不同人口群体的特定主题元素，突出了针对不同受众量身定制的不同策略。我们的研究结果表明，年轻人主要通过强调行动主义和环境意识的信息进行定位，而妇女则通过与照顾角色和社会倡导相关的主题进行参与。此外，我们进行了全面的公平性分析，以揭示模型预测中的偏见。我们使用已建立的公平性指标（如人口平等、平等机会和预测平等）评估跨人口群体的准确性和错误率的差距。我们的研究结果表明，虽然LLMs整体表现良好，但存在某些偏见，特别是在对男性受众的分类中。主题解释的分析揭示了针对不同人口群体量身定制的信息传播策略中的反复出现的模式，而公平性分析强调了对更具包容性的定位方法的需求。这项研究为未来旨在增强社交媒体驱动的气候活动的透明度、问责制和包容性的研究提供了宝贵的框架。

更新时间: 2025-09-23 01:45:34

领域: cs.CL,cs.AI,cs.CY,cs.SI

下载: http://arxiv.org/abs/2410.05401v4

Multimodal Medical Image Classification via Synergistic Learning Pre-training

Multimodal pathological images are usually in clinical diagnosis, but computer vision-based multimodal image-assisted diagnosis faces challenges with modality fusion, especially in the absence of expert-annotated data. To achieve the modality fusion in multimodal images with label scarcity, we propose a novel ``pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification. Specifically, we propose a synergistic learning pretraining framework of consistency, reconstructive, and aligned learning. By treating one modality as an augmented sample of another modality, we implement a self-supervised learning pre-train, enhancing the baseline model's feature representation capability. Then, we design a fine-tuning method for multimodal fusion. During the fine-tuning stage, we set different encoders to extract features from the original modalities and provide a multimodal fusion encoder for fusion modality. In addition, we propose a distribution shift method for multimodal fusion features, which alleviates the prediction uncertainty and overfitting risks caused by the lack of labeled samples. We conduct extensive experiments on the publicly available gastroscopy image datasets Kvasir and Kvasirv2. Quantitative and qualitative results demonstrate that the proposed method outperforms the current state-of-the-art classification methods. The code will be released at: https://github.com/LQH89757/MICS.

Updated: 2025-09-23 01:40:38

标题: 多模态医学图像分类的协同学习预训练

摘要: 多模态病理图像通常在临床诊断中使用，但基于计算机视觉的多模态图像辅助诊断在模态融合方面面临挑战，尤其是在缺乏专家标注数据的情况下。为了实现多模态图像中的模态融合，我们提出了一种新颖的“预训练+微调”框架，用于多模态半监督医学图像分类。具体来说，我们提出了一种协同学习预训练框架，包括一致性、重构和对齐学习。通过将一种模态视为另一种模态的增强样本，我们实现了一种自监督学习预训练，增强了基线模型的特征表示能力。然后，我们设计了一种用于多模态融合的微调方法。在微调阶段，我们设置不同的编码器来从原始模态中提取特征，并为融合模态提供一个多模态融合编码器。此外，我们提出了一种用于多模态融合特征的分布转移方法，缓解由于缺乏标记样本而引起的预测不确定性和过拟合风险。我们在公开可用的胃镜图像数据集Kvasir和Kvasirv2上进行了广泛实验。定量和定性结果表明，所提出的方法优于当前最先进的分类方法。代码将在以下网址发布：https://github.com/LQH89757/MICS。

更新时间: 2025-09-23 01:40:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.17492v2

Automatic coherence-driven inference on arguments

Inconsistencies are ubiquitous in law, administration, and jurisprudence. Though a cure is too much to hope for, we propose a technological remedy. Large language models (LLMs) can accurately extract propositions from arguments and compile them into natural data structures that enable coherence-driven inference (CDI) via combinatorial optimization. This neurosymbolic architecture naturally separates concerns and enables meaningful judgments about the coherence of arguments that can inform legislative and policy analysis and legal reasoning.

Updated: 2025-09-23 01:40:14

标题: 自动的基于连贯性的论点推理

摘要: 法律、行政和法学中普遍存在不一致性。虽然希望找到一种解决方法过于奢望，但我们提出了一种技术疗法。大型语言模型（LLMs）可以准确地从论点中提取命题，并将它们编译成自然数据结构，从而通过组合优化实现基于连贯性推理（CDI）。这种神经符号体系结构自然地分离了关注点，并使对论点连贯性的有意义判断成为可能，这可以为立法和政策分析以及法律推理提供信息。

更新时间: 2025-09-23 01:40:14

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2509.18523v1

OpenLens AI: Fully Autonomous Research Agent for Health Infomatics

Health informatics research is characterized by diverse data modalities, rapid knowledge expansion, and the need to integrate insights across biomedical science, data analytics, and clinical practice. These characteristics make it particularly well-suited for agent-based approaches that can automate knowledge exploration, manage complex workflows, and generate clinically meaningful outputs. Recent progress in large language model (LLM)-based agents has demonstrated promising capabilities in literature synthesis, data analysis, and even end-to-end research execution. However, existing systems remain limited for health informatics because they lack mechanisms to interpret medical visualizations and often overlook domain-specific quality requirements. To address these gaps, we introduce OpenLens AI, a fully automated framework tailored to health informatics. OpenLens AI integrates specialized agents for literature review, data analysis, code generation, and manuscript preparation, enhanced by vision-language feedback for medical visualization and quality control for reproducibility. The framework automates the entire research pipeline, producing publication-ready LaTeX manuscripts with transparent and traceable workflows, thereby offering a domain-adapted solution for advancing health informatics research.

Updated: 2025-09-23 01:37:30

标题: OpenLens AI：用于健康信息学的全自主研究代理

摘要: 健康信息学研究以多样化的数据模态、快速的知识扩展以及需要整合生物医学科学、数据分析和临床实践的见解而闻名。这些特征使得代理基于方法尤为适合，可以自动化知识探索，管理复杂工作流程，并生成临床有意义的结果。基于大型语言模型(LLM)的代理最近取得了令人期待的进展，展现出在文献综述、数据分析甚至端到端研究执行方面的能力。然而，现有系统对健康信息学仍存在限制，因为它们缺乏解释医学可视化的机制，并经常忽视领域特定的质量要求。为了解决这些差距，我们引入了OpenLens AI，这是一个专门针对健康信息学定制的完全自动化框架。OpenLens AI整合了专门的代理用于文献综述、数据分析、代码生成和手稿准备，通过视觉语言反馈增强医学可视化和质量控制以实现可重现性。该框架自动化整个研究流程，生成适用于出版的LaTeX手稿，并提供透明和可追溯的工作流程，从而为推进健康信息学研究提供了领域适应的解决方案。

更新时间: 2025-09-23 01:37:30

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2509.14778v2

APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation

Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community's growing RL needs, numerous RL frameworks have been proposed. Most of these frameworks primarily rely on inference engines for rollout generation and training engines for policy updates. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by at most 44% across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves at most 8% higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems.

Updated: 2025-09-23 01:32:36

标题: APRIL：在强化学习中使用主动部分展开来驯服长尾生成

摘要: 强化学习（RL）已经成为推动大规模预训练语言模型（LLMs）发展的基石。连续几代，包括GPT-o系列、DeepSeek-R1、Kimi-K1.5、Grok 4和GLM-4.5，都依赖于大规模RL训练来增强推理和编码能力。为了满足社区对RL的不断增长需求，提出了许多RL框架。这些框架大多主要依赖于推理引擎生成rollout和训练引擎更新策略。然而，RL训练仍然计算昂贵，rollout生成占总运行时间的90%以上。此外，其效率通常受到rollout响应长度长尾分布的限制，其中一些长时间响应会使整个批次停滞，导致GPU空闲和低利用率。随着模型和rollout大小的持续增长，这个瓶颈越来越限制了可扩展性。为了解决这一挑战，我们提出了强化学习中的主动部分rollouts（APRIL），以减轻长尾效率不佳问题。在rollout阶段，APRIL过度提供rollout请求，并在达到目标响应数量后终止，并将不完整的响应回收以在未来步骤中继续。这一策略确保没有rollout被丢弃，同时大大减少了GPU空闲时间。实验证明，APRIL可以在常用RL算法（GRPO、DAPO、GSPO）中将rollout吞吐量提高多达44%，加速收敛，并在任务中实现高达8%的更高最终准确度。此外，APRIL既是框架又是硬件无关的，已经集成到slime RL框架中，并且可以在NVIDIA和AMD GPU上部署。综上所述，本工作在提出APRIL时统一了系统级和算法考虑，旨在提高RL训练效率，激发RL系统中的进一步优化。

更新时间: 2025-09-23 01:32:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.18521v1

Coherence-driven inference for cybersecurity

Large language models (LLMs) can compile weighted graphs on natural language data to enable automatic coherence-driven inference (CDI) relevant to red and blue team operations in cybersecurity. This represents an early application of automatic CDI that holds near- to medium-term promise for decision-making in cybersecurity and eventually also for autonomous blue team operations.

Updated: 2025-09-23 01:32:28

标题: 基于一致性的网络安全推理

摘要: 大型语言模型（LLMs）可以编译自然语言数据上的加权图，以实现与网络安全中红队和蓝队操作相关的自动连贯推理（CDI）。这代表了自动CDI在网络安全决策中具有近期到中期前景的早期应用，并最终也适用于自主蓝队操作。

更新时间: 2025-09-23 01:32:28

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.18520v1

Program Synthesis via Test-Time Transduction

We introduce transductive program synthesis, a new formulation of the program synthesis task that explicitly leverages test inputs during synthesis. While prior approaches to program synthesis--whether based on natural language descriptions or input-output examples--typically aim to generalize from training examples, they often struggle with robustness, especially in real-world settings where training examples are limited and test inputs involve various edge cases. To address this, we propose a novel framework that improves robustness by treating synthesis as an active learning over a finite hypothesis class defined by programs' outputs. We use an LLM to predict outputs for selected test inputs and eliminate inconsistent hypotheses, where the inputs are chosen via a greedy maximin algorithm to minimize the number of LLM queries required. We evaluate our approach on four benchmarks: Playgol, MBPP+, 1D-ARC, and programmatic world modeling on MiniGrid. We demonstrate that our method significantly improves program synthesis in both accuracy and efficiency. We release our code at https://github.com/klee972/SYNTRA.

Updated: 2025-09-23 01:29:22

标题: 通过测试时间传导进行程序合成

摘要: 我们介绍了一种新的程序合成任务的表述形式——传导式程序合成，该形式在合成过程中明确利用测试输入。尽管先前的程序合成方法——无论是基于自然语言描述还是输入输出示例——通常旨在从训练示例中泛化，但它们在鲁棒性方面经常存在困难，特别是在训练示例有限且测试输入涉及各种边缘情况的实际环境中。为了解决这个问题，我们提出了一个新颖的框架，通过将合成视为一种有限假设类上的主动学习来提高鲁棒性，该假设类由程序的输出定义。我们使用一个LLM来预测所选测试输入的输出，并消除不一致的假设，其中输入是通过贪婪最大最小算法选择的，以最小化所需的LLM查询数量。我们在四个基准测试上评估了我们的方法：Playgol、MBPP+、1D-ARC和MiniGrid上的程序化世界建模。我们证明了我们的方法在准确性和效率方面显著提高了程序合成。我们在https://github.com/klee972/SYNTRA上发布了我们的代码。

更新时间: 2025-09-23 01:29:22

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.17393v2

HyperEvent: A Strong Baseline for Dynamic Link Prediction via Relative Structural Encoding

Learning representations for continuous-time dynamic graphs is critical for dynamic link prediction. While recent methods have become increasingly complex, the field lacks a strong and informative baseline to reliably gauge progress. This paper proposes HyperEvent, a simple approach that captures relative structural patterns in event sequences through an intuitive encoding mechanism. As a straightforward baseline, HyperEvent leverages relative structural encoding to identify meaningful event sequences without complex parameterization. By combining these interpretable features with a lightweight transformer classifier, HyperEvent reframes link prediction as event structure recognition. Despite its simplicity, HyperEvent achieves competitive results across multiple benchmarks, often matching the performance of more complex models. This work demonstrates that effective modeling can be achieved through simple structural encoding, providing a clear reference point for evaluating future advancements.

Updated: 2025-09-23 01:24:14

标题: HyperEvent: 通过相对结构编码实现动态链接预测的强基线

摘要: 为连续时间动态图学习表示对于动态链接预测至关重要。尽管最近的方法变得越来越复杂，但该领域缺乏一个强大且信息丰富的基准来可靠地衡量进展。本文提出了一种简单的方法HyperEvent，通过直观的编码机制捕捉事件序列中的相对结构模式。作为一个简单的基准线，HyperEvent利用相对结构编码来识别有意义的事件序列，而无需复杂的参数化。通过将这些可解释特征与轻量级的transformer分类器相结合，HyperEvent将链接预测重新构建为事件结构识别。尽管简单，HyperEvent在多个基准测试中取得了竞争性的结果，通常与更复杂的模型性能相匹配。这项工作表明，通过简单的结构编码可以实现有效的建模，为评估未来进展提供了明确的参考点。

更新时间: 2025-09-23 01:24:14

领域: cs.LG

下载: http://arxiv.org/abs/2507.11836v2

A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition

This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.

Updated: 2025-09-23 01:22:15

标题: 一种面向古典阿拉伯诗歌创作的节奏感知短语插入方法

摘要: 这篇论文提出了一种方法，通过使用ByT5，一个基于字节级多语言变换器的模型，将短语插入阿拉伯诗歌中，以符合特定的节奏。我们的工作讨论了一种基于规则的字形到节拍转换，专门用于从完全带音符的阿拉伯脚本中提取节奏。我们的方法采用条件去噪目标来微调ByT5，模型重建掩盖的单词以匹配目标节奏。我们采用了课程学习策略，在通用阿拉伯数据集上进行预训练，然后在诗歌数据集上进行微调，并探讨从英语到阿拉伯语的跨语言转移。实验结果表明，我们的模型在保持语义连贯性的同时实现了高度的节奏对齐。所提出的模型具有潜力在创作古典阿拉伯诗歌过程中用于共创应用。

更新时间: 2025-09-23 01:22:15

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.18514v1

Dynamical Modeling of Behaviorally Relevant Spatiotemporal Patterns in Neural Imaging Data

High-dimensional imaging of neural activity, such as widefield calcium and functional ultrasound imaging, provide a rich source of information for understanding the relationship between brain activity and behavior. Accurately modeling neural dynamics in these modalities is crucial for understanding this relationship but is hindered by the high-dimensionality, complex spatiotemporal dependencies, and prevalent behaviorally irrelevant dynamics in these modalities. Existing dynamical models often employ preprocessing steps to obtain low-dimensional representations from neural image modalities. However, this process can discard behaviorally relevant information and miss spatiotemporal structure. We propose SBIND, a novel data-driven deep learning framework to model spatiotemporal dependencies in neural images and disentangle their behaviorally relevant dynamics from other neural dynamics. We validate SBIND on widefield imaging datasets, and show its extension to functional ultrasound imaging, a recent modality whose dynamical modeling has largely remained unexplored. We find that our model effectively identifies both local and long-range spatial dependencies across the brain while also dissociating behaviorally relevant neural dynamics. Doing so, SBIND outperforms existing models in neural-behavioral prediction. Overall, SBIND provides a versatile tool for investigating the neural mechanisms underlying behavior using imaging modalities.

Updated: 2025-09-23 01:16:23

标题: 行为相关的神经成像数据中的时空模式动态建模

摘要: 高维度神经活动成像，如广域钙和功能性超声成像，为理解大脑活动与行为之间的关系提供了丰富的信息源。准确地建模这些模态下的神经动态对于理解这种关系至关重要，但受到高维度、复杂的时空依赖关系以及这些模态中普遍存在的与行为无关的动态的阻碍。现有的动态模型通常采用预处理步骤从神经图像模态中获得低维表示。然而，这一过程可能会丢弃与行为相关的信息并错过时空结构。我们提出了SBIND，这是一个新颖的数据驱动深度学习框架，用于模拟神经图像中的时空依赖关系，并将其行为相关动态与其他神经动态分离开来。我们在广域成像数据集上验证了SBIND，并展示了其扩展到功能性超声成像，这是一个最近基本上未被探索的模态。我们发现，我们的模型有效地识别了大脑中的局部和远距离空间依赖关系，同时也分离出了与行为相关的神经动态。通过这样做，SBIND在神经行为预测方面表现优于现有模型。总的来说，SBIND为使用成像模态研究行为背后的神经机制提供了一种多功能工具。

更新时间: 2025-09-23 01:16:23

领域: q-bio.NC,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.18507v1

Hyperbolic Coarse-to-Fine Few-Shot Class-Incremental Learning

In the field of machine learning, hyperbolic space demonstrates superior representation capabilities for hierarchical data compared to conventional Euclidean space. This work focuses on the Coarse-To-Fine Few-Shot Class-Incremental Learning (C2FSCIL) task. Our study follows the Knowe approach, which contrastively learns coarse class labels and subsequently normalizes and freezes the classifier weights of learned fine classes in the embedding space. To better interpret the "coarse-to-fine" paradigm, we propose embedding the feature extractor into hyperbolic space. Specifically, we employ the Poincar\'e ball model of hyperbolic space, enabling the feature extractor to transform input images into feature vectors within the Poincar\'e ball instead of Euclidean space. We further introduce hyperbolic contrastive loss and hyperbolic fully-connected layers to facilitate model optimization and classification in hyperbolic space. Additionally, to enhance performance under few-shot conditions, we implement maximum entropy distribution in hyperbolic space to estimate the probability distribution of fine-class feature vectors. This allows generation of augmented features from the distribution to mitigate overfitting during training with limited samples. Experiments on C2FSCIL benchmarks show that our method effectively improves both coarse and fine class accuracies.

Updated: 2025-09-23 01:12:21

标题: 双曲型粗到精细逐类增量学习

摘要: 在机器学习领域，与传统的欧几里得空间相比，双曲空间展现出更优越的层次数据表示能力。本研究关注粗到细的少样本类别增量学习（C2FSCIL）任务。我们的研究遵循Knowe方法，该方法对比学习粗糙类别标签，随后在嵌入空间中归一化并冻结学习到的精细类别的分类器权重。为了更好地解释“粗到细”范式，我们提出将特征提取器嵌入到双曲空间中。具体来说，我们采用双曲空间的Poincaré球模型，使特征提取器能够将输入图像转换为Poincaré球内的特征向量，而不是欧几里得空间。我们进一步引入双曲对比损失和双曲全连接层，以促进在双曲空间中的模型优化和分类。此外，为了在少样本条件下提高性能，我们在双曲空间中实现最大熵分布，以估计精细类别特征向量的概率分布。这允许从分布中生成增强特征，以在训练中减少过拟合。在C2FSCIL基准测试中的实验证明，我们的方法有效地提高了粗糙和精细类别的准确性。

更新时间: 2025-09-23 01:12:21

领域: cs.CV,cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.18504v1

Chypnosis: Undervolting-based Static Side-channel Attacks

Static side-channel analysis attacks, which rely on a stopped clock to extract sensitive information, pose a growing threat to embedded systems' security. To protect against such attacks, several proposed defenses aim to detect unexpected variations in the clock signal and clear sensitive states. In this work, we present \emph{Chypnosis}, an undervolting attack technique that indirectly stops the target circuit clock, while retaining stored data. Crucially, Chypnosis also blocks the state clearing stage of prior defenses, allowing recovery of secret information even in their presence. However, basic undervolting is not sufficient in the presence of voltage sensors designed to handle fault injection via voltage tampering. To overcome such defenses, we observe that rapidly dropping the supply voltage can disable the response mechanism of voltage sensor systems. We implement Chypnosis on various FPGAs, demonstrating the successful bypass of their sensors, both in the form of soft and hard IPs. To highlight the real-world applicability of Chypnosis, we show that the alert handler of the OpenTitan root-of-trust, responsible for providing hardware responses to threats, can be bypassed. Furthermore, we demonstrate that by combining Chypnosis with static side-channel analysis techniques, namely laser logic state imaging (LLSI) and impedance analysis (IA), we can extract sensitive information from a side-channel protected cryptographic module used in OpenTitan, even in the presence of established clock and voltage sensors. Finally, we propose and implement an improvement to an established FPGA-compatible clock detection countermeasure, and we validate its resilience against Chypnosis.

Updated: 2025-09-23 01:05:27

标题: Chypnosis: 基于降压的静态侧信道攻击

摘要: 静态侧信道分析攻击依赖于停止时钟来提取敏感信息，对嵌入式系统的安全性构成日益严重的威胁。为了防范此类攻击，提出了几种防御方法旨在检测时钟信号中的意外变化并清除敏感状态。在本研究中，我们提出了一种名为"Chypnosis"的欠电压攻击技术，间接停止目标电路时钟，同时保留存储的数据。关键是，Chypnosis还阻止了先前防御措施中的状态清除阶段，即使在它们存在的情况下也能恢复秘密信息。然而，基本的欠电压在存在设计用于处理电压篡改的故障注入的电压传感器时并不足够。为了克服这些防御措施，我们观察到快速降低供电电压可以禁用电压传感器系统的响应机制。我们在各种FPGA上实现了Chypnosis，展示了成功绕过它们的传感器，无论是软件还是硬件IP。为了突显Chypnosis的真实世界适用性，我们展示了OpenTitan根信任的警报处理程序，负责对威胁提供硬件响应，可以被绕过。此外，我们展示了通过将Chypnosis与静态侧信道分析技术（即激光逻辑状态成像（LLSI）和阻抗分析（IA））相结合，即使在已建立的时钟和电压传感器存在的情况下，也可以从OpenTitan中使用的侧信道保护的密码模块中提取敏感信息。最后，我们提出并实现了一种改进的适用于FPGA的时钟检测对抗措施，并验证了它对Chypnosis的弹性。

更新时间: 2025-09-23 01:05:27

领域: cs.CR

下载: http://arxiv.org/abs/2504.11633v4

Hybrid Data can Enhance the Utility of Synthetic Data for Training Anti-Money Laundering Models

Money laundering is a critical global issue for financial institutions. Automated Anti-money laundering (AML) models, like Graph Neural Networks (GNN), can be trained to identify illicit transactions in real time. A major issue for developing such models is the lack of access to training data due to privacy and confidentiality concerns. Synthetically generated data that mimics the statistical properties of real data but preserves privacy and confidentiality has been proposed as a solution. However, training AML models on purely synthetic datasets presents its own set of challenges. This article proposes the use of hybrid datasets to augment the utility of synthetic datasets by incorporating publicly available, easily accessible, and real-world features. These additions demonstrate that hybrid datasets not only preserve privacy but also improve model utility, offering a practical pathway for financial institutions to enhance AML systems.

Updated: 2025-09-23 01:03:23

标题: 混合数据可以增强合成数据用于训练反洗钱模型的效用

摘要: 资金洗钱是金融机构面临的一个关键的全球性问题。自动反洗钱（AML）模型，比如图神经网络（GNN），可以被训练用于实时识别非法交易。开发这种模型的一个主要问题是由于隐私和保密问题导致缺乏访问训练数据。已经提出了合成生成的数据，模拟真实数据的统计特性，但同时保护隐私和保密，作为解决方案。然而，仅仅在纯合成数据集上训练AML模型也存在一系列挑战。本文提出了使用混合数据集来增强合成数据集的实用性，通过结合公开可获取和真实世界的特征。这些添加显示出混合数据集不仅保护隐私，而且提高模型的实用性，为金融机构增强AML系统提供了一条实用的路径。

更新时间: 2025-09-23 01:03:23

领域: cs.LG

下载: http://arxiv.org/abs/2509.18499v1

Agentic Software Engineering: Foundational Pillars and a Research Roadmap

Agentic Software Engineering (SE 3.0) represents a new era where intelligent agents are tasked not with simple code generation, but with achieving complex, goal-oriented SE objectives. To harness these new capabilities while ensuring trustworthiness, we must recognize a fundamental duality within the SE field in the Agentic SE era, comprising two symbiotic modalities: SE for Humans and SE for Agents. This duality demands a radical reimagining of the foundational pillars of SE (actors, processes, tools, and artifacts) which manifest differently across each modality. We propose two purpose-built workbenches to support this vision. The Agent Command Environment (ACE) serves as a command center where humans orchestrate and mentor agent teams, handling outputs such as Merge-Readiness Packs (MRPs) and Consultation Request Packs (CRPs). The Agent Execution Environment (AEE) is a digital workspace where agents perform tasks while invoking human expertise when facing ambiguity or complex trade-offs. This bi-directional partnership, which supports agent-initiated human callbacks and handovers, gives rise to new, structured engineering activities (i.e., processes) that redefine human-AI collaboration, elevating the practice from agentic coding to true agentic software engineering. This paper presents the Structured Agentic Software Engineering (SASE) vision, outlining several of the foundational pillars for the future of SE. The paper culminates in a research roadmap that identifies a few key challenges and opportunities while briefly discussing the resulting impact of this future on SE education. Our goal is not to offer a definitive solution, but to provide a conceptual scaffold with structured vocabulary to catalyze a community-wide dialogue, pushing the SE community to think beyond its classic, human-centric tenets toward a disciplined, scalable, and trustworthy agentic future.

Updated: 2025-09-23 01:01:15

标题: 主动型软件工程：基础支柱和研究路线图

摘要: Agentic Software Engineering (SE 3.0)代表了一个新时代，智能代理不再仅仅负责简单的代码生成，而是致力于实现复杂的、目标导向的软件工程目标。为了利用这些新的能力并确保可信度，我们必须认识到在Agentic SE时代软件工程领域内一个基本的二元性，包括两种共生的模式：SE for Humans和SE for Agents。这种二元性要求对SE的基本支柱（演员、流程、工具和产品）进行彻底的重新想象，这些支柱在每种模式下表现不同。我们提出了两个专门设计的工作台来支持这一愿景。Agent Command Environment (ACE)作为一个指挥中心，人类可以在这里编排和指导代理团队，处理输出，如Merge-Readiness Packs (MRPs)和Consultation Request Packs (CRPs)。Agent Execution Environment (AEE)是一个数字工作空间，代理执行任务时可以在面对模糊或复杂的抉择时调用人类专业知识。这种双向合作支持代理发起的人类回调和移交，产生了新的、结构化的工程活动（即流程），重新定义了人工智能协作，将实践从代理编码提升到真正的代理软件工程。本文提出了结构化Agentic软件工程（SASE）愿景，概述了未来软件工程的几个基本支柱。本文最终提出了一个研究路线图，指出了一些关键挑战和机遇，同时简要讨论了这一未来对软件工程教育的影响。我们的目标不是提供一个明确的解决方案，而是提供一个有结构化词汇的概念支架，以激发全社会范围内的对话，推动软件工程社区超越其经典的、以人为中心的原则，走向一个纪律严明、可扩展和可信的代理未来。

更新时间: 2025-09-23 01:01:15

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2509.06216v2

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo generates repositories averaging 36K Code Lines, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.

Updated: 2025-09-23 01:00:38

标题: RPG：一个用于统一和可扩展代码库生成的存储库规划图

摘要: 大型语言模型在函数和文件级代码生成方面表现出色，但从头开始生成完整的代码库仍然是一个基本挑战。这个过程需要在提案和实施级别阶段进行一致和可靠的规划，而自然语言由于其模糊性和冗长性，不适合忠实地表示复杂的软件结构。为了解决这个问题，我们引入了仓库规划图（RPG），这是一个持久性表示，通过在一个图中编码能力、文件结构、数据流和功能，统一了提案和实施级别的规划。RPG用一个明确的蓝图取代了模糊的自然语言，从而实现了长期规划和可扩展的代码库生成。基于RPG，我们开发了ZeroRepo，这是一个从头开始生成代码库的基于图的框架。它分为三个阶段：提案级别规划和实施级别细化以构建图，然后进行图引导的代码生成和测试验证。为了评估这一设置，我们构建了RepoCraft，一个包含1,052个任务的六个真实项目的基准测试。在RepoCraft上，ZeroRepo生成的代码库平均有36K行代码，大约是最强基线（Claude Code）的3.9倍，是其他基线的64倍。它达到了81.5%的功能覆盖率和69.7%的通过率，分别比Claude Code高出27.3和35.8个百分点。进一步分析显示，RPG模型复杂的依赖关系，通过近线性扩展实现了逐渐更复杂的规划，并增强了LLM对代码库的理解，从而加快了代理程序的本地化。

更新时间: 2025-09-23 01:00:38

领域: cs.CL,cs.AI,cs.SE

下载: http://arxiv.org/abs/2509.16198v2

A Comparative Analysis of Ensemble-Based Machine Learning Approaches with Explainable AI for Multi-Class Intrusion Detection in Drone Networks

The growing integration of drones into civilian, commercial, and defense sectors introduces significant cybersecurity concerns, particularly with the increased risk of network-based intrusions targeting drone communication protocols. Detecting and classifying these intrusions is inherently challenging due to the dynamic nature of drone traffic and the presence of multiple sophisticated attack vectors such as spoofing, injection, replay, and man-in-the-middle (MITM) attacks. This research aims to develop a robust and interpretable intrusion detection framework tailored for drone networks, with a focus on handling multi-class classification and model explainability. We present a comparative analysis of ensemble-based machine learning models, namely Random Forest, Extra Trees, AdaBoost, CatBoost, and XGBoost, trained on a labeled dataset comprising benign traffic and nine distinct intrusion types. Comprehensive data preprocessing was performed, including missing value imputation, scaling, and categorical encoding, followed by model training and extensive evaluation using metrics such as macro F1-score, ROC AUC, Matthews Correlation Coefficient, and Log Loss. Random Forest achieved the highest performance with a macro F1-score of 0.9998 and ROC AUC of 1.0000. To validate the superiority of the models, statistical tests, including Friedmans test, the Wilcoxon signed-rank test with Holm correction, and bootstrapped confidence intervals, were applied. Furthermore, explainable AI methods, SHAP and LIME, were integrated to interpret both global and local feature importance, enhancing model transparency and decision trustworthiness. The proposed approach not only delivers near-perfect accuracy but also ensures interpretability, making it highly suitable for real-time and safety-critical drone operations.

Updated: 2025-09-23 00:59:21

标题: 一个基于可解释AI的集成机器学习方法的比较分析，用于多类别入侵检测在无人机网络中

摘要: 随着军用、商业和国防领域对无人机的日益整合，引入了重要的网络安全问题，尤其是增加了网络入侵针对无人机通讯协议的风险。由于无人机流量的动态性和存在多种复杂的攻击向量，如欺骗、注入、重放和中间人（MITM）攻击，检测和分类这些入侵在本质上具有挑战性。本研究旨在开发一个针对无人机网络量身定制的强大且可解释的入侵检测框架，重点关注多类分类和模型可解释性。我们提出了一项集成机器学习模型的比较分析，包括随机森林、额外树、AdaBoost、CatBoost和XGBoost，这些模型在一个包含良性流量和九种不同入侵类型的标记数据集上进行训练。进行了全面的数据预处理，包括缺失值插补、缩放和分类编码，然后进行模型训练和广泛评估，使用宏F1分数、ROC AUC、马修斯相关系数和对数损失等指标。随机森林获得了最高的性能，宏F1分数为0.9998，ROC AUC为1.0000。为验证模型的优越性，应用了统计测试，包括Friedman测试、带Holm校正的Wilcoxon符号秩检验以及自举置信区间。此外，可解释的人工智能方法SHAP和LIME被整合以解释全局和局部特征重要性，增强模型的透明度和决策的可信度。该方法不仅提供了接近完美的准确性，还确保了可解释性，使其非常适合实时和安全关键的无人机操作。

更新时间: 2025-09-23 00:59:21

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2509.20391v1

AHA - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead

Real-time understanding of continuous video streams is essential for intelligent agents operating in high-stakes environments, including autonomous vehicles, surveillance drones, and disaster response robots. Yet, most existing video understanding and highlight detection methods assume access to the entire video during inference, making them unsuitable for online or streaming scenarios. In particular, current models optimize for offline summarization, failing to support step-by-step reasoning needed for real-time decision-making. We introduce Aha, an autoregressive highlight detection framework that predicts the relevance of each video frame against a task described in natural language. Without accessing future video frames, Aha utilizes a multimodal vision-language model and lightweight, decoupled heads trained on a large, curated dataset of human-centric video labels. To enable scalability, we introduce the Dynamic SinkCache mechanism that achieves constant memory usage across infinite-length streams without degrading performance on standard benchmarks. This encourages the hidden representation to capture high-level task objectives, enabling effective frame-level rankings for informativeness, relevance, and uncertainty with respect to the natural language task. Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks, surpassing even prior offline, full-context approaches and video-language models by +5.9% on TVSum and +8.3% on Mr. Hisum in mAP (mean Average Precision). We explore Aha's potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video. Both experiments demonstrate Aha's potential effectiveness as a real-time reasoning module for downstream planning and long-horizon understanding.

Updated: 2025-09-23 00:52:32

标题: AHA- 预测下一个重要事件：在线高亮检测而无需提前查看

摘要: 实时理解连续视频流对于操作于高风险环境中的智能代理至关重要，这些环境包括自动驾驶车辆、监控无人机和灾难应对机器人。然而，大多数现有的视频理解和重点检测方法在推断时假设可以访问整个视频，因此不适合在线或流式场景。特别是，当前模型优化于离线摘要，无法支持实时决策所需的逐步推理。我们介绍了Aha，一种自回归的重点检测框架，根据自然语言描述的任务预测每个视频帧的相关性。在不访问未来视频帧的情况下，Aha利用多模态视觉-语言模型和轻量级、解耦的头部，在一个大型、策划精良的人类中心视频标签数据集上进行训练。为了实现可扩展性，我们引入了动态SinkCache机制，实现在无限长度的流中恒定的内存使用，而不会在标准基准测试中降低性能。这鼓励隐藏表示捕获高级任务目标，从而实现针对自然语言任务的信息性、相关性和不确定性的有效帧级排名。Aha在重点检测基准测试中实现了最新技术(SOTA)表现，在TVSum上超过先前的离线、全文上下文方法和视频-语言模型5.9%，在Mr. Hisum上超过8.3%的mAP(平均精度)。我们探讨了Aha在面向任务的自然语言输入和连续的、以机器人为中心的视频中的实际机器人应用潜力。这两个实验都展示了Aha作为下游规划和长期理解的实时推理模块的潜在有效性。

更新时间: 2025-09-23 00:52:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.16421v2

Meta-Semantics Augmented Few-Shot Relational Learning

Few-shot relational learning on knowledge graph (KGs) aims to perform reasoning over relations with only a few training examples. While current methods have focused primarily on leveraging specific relational information, rich semantics inherent in KGs have been largely overlooked. To bridge this gap, we propose PromptMeta, a novel prompted meta-learning framework that seamlessly integrates meta-semantics with relational information for few-shot relational learning. PromptMeta introduces two core innovations: (1) a Meta-Semantic Prompt (MSP) pool that learns and consolidates high-level meta-semantics shared across tasks, enabling effective knowledge transfer and adaptation to newly emerging relations; and (2) a learnable fusion mechanism that dynamically combines meta-semantics with task-specific relational information tailored to different few-shot tasks. Both components are optimized jointly with model parameters within a meta-learning framework. Extensive experiments and analyses on two real-world KG benchmarks validate the effectiveness of PromptMeta in adapting to new relations with limited supervision.

Updated: 2025-09-23 00:50:57

标题: 元语义增强的少样本关系学习

摘要: 在知识图谱（KGs）上进行少样本关系学习的目标是仅使用少量训练示例进行关系推理。尽管当前方法主要集中在利用特定的关系信息上，但KGs中固有的丰富语义却大多被忽视。为了弥合这一差距，我们提出了PromptMeta，这是一个新颖的提示元学习框架，无缝地将元语义与关系信息集成在一起，用于少样本关系学习。PromptMeta引入了两个核心创新：（1）一个元语义提示（MSP）池，学习并巩固跨任务共享的高级元语义，实现有效的知识传递和适应新出现的关系；（2）一个可学习的融合机制，动态地将元语义与针对不同少样本任务定制的任务特定关系信息相结合。这两个组件与模型参数一起在元学习框架内进行优化。在两个真实世界的KG基准上进行了大量实验和分析，验证了PromptMeta在有限监督下适应新关系的有效性。

更新时间: 2025-09-23 00:50:57

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2505.05684v3

Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment

Automated pain assessment from facial expressions is crucial for non-communicative patients, such as those with dementia. Progress has been limited by two challenges: (i) existing datasets exhibit severe demographic and label imbalance due to ethical constraints, and (ii) current generative models cannot precisely control facial action units (AUs), facial structure, or clinically validated pain levels. We present 3DPain, a large-scale synthetic dataset specifically designed for automated pain assessment, featuring unprecedented annotation richness and demographic diversity. Our three-stage framework generates diverse 3D meshes, textures them with diffusion models, and applies AU-driven face rigging to synthesize multi-view faces with paired neutral and pain images, AU configurations, PSPI scores, and the first dataset-level annotations of pain-region heatmaps. The dataset comprises 82,500 samples across 25,000 pain expression heatmaps and 2,500 synthetic identities balanced by age, gender, and ethnicity. We further introduce ViTPain, a Vision Transformer based cross-modal distillation framework in which a heatmap-trained teacher guides a student trained on RGB images, enhancing accuracy, interpretability, and clinical reliability. Together, 3DPain and ViTPain establish a controllable, diverse, and clinically grounded foundation for generalizable automated pain assessment.

Updated: 2025-09-23 00:50:14

标题: 在3D中的疼痛：生成可控制的合成面孔以进行自动化疼痛评估

摘要: 面部表情自动疼痛评估对于非交流患者，如患有痴呆症的患者至关重要。目前存在两个挑战限制了进展：(i)由于道德约束，现有数据集存在严重的人口统计和标签不平衡，(ii)当前生成模型无法精确控制面部动作单元（AUs）、面部结构或临床验证的疼痛水平。我们提出了3DPain，一个专门设计用于自动疼痛评估的大规模合成数据集，具有前所未有的注释丰富性和人口统计多样性。我们的三阶段框架生成多样化的3D网格，用扩散模型对其进行纹理处理，并应用AU驱动的面部绑定来合成具有中性和疼痛图像、AU配置、PSPI分数和首个数据集级别注释的疼痛区域热图的多视图面部。该数据集包含82,500个样本，涵盖25,000个疼痛表达热图和2,500个根据年龄、性别和种族平衡的合成身份。我们进一步介绍了ViTPain，一种基于视觉变换器的跨模态蒸馏框架，在这种框架中，一个热图训练的教师指导一个在RGB图像上训练的学生，提高了准确性、可解释性和临床可靠性。3DPain和ViTPain共同建立了一个可控、多样化且临床立足的基础，用于可推广的自动疼痛评估。

更新时间: 2025-09-23 00:50:14

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.16727v2

Estimating Heterogeneous Causal Effect on Networks via Orthogonal Learning

Estimating causal effects on networks is important for both scientific research and practical applications. Unlike traditional settings that assume the Stable Unit Treatment Value Assumption (SUTVA), interference allows an intervention/treatment on one unit to affect the outcomes of others. Understanding both direct and spillover effects is critical in fields such as epidemiology, political science, and economics. Causal inference on networks faces two main challenges. First, causal effects are typically heterogeneous, varying with unit features and local network structure. Second, connected units often exhibit dependence due to network homophily, creating confounding between structural correlations and causal effects. In this paper, we propose a two-stage method to estimate heterogeneous direct and spillover effects on networks. The first stage uses graph neural networks to estimate nuisance components that depend on the complex network topology. In the second stage, we adjust for network confounding using these estimates and infer causal effects through a novel attention-based interference model. Our approach balances expressiveness and interpretability, enabling downstream tasks such as identifying influential neighborhoods and recovering the sign of spillover effects. We integrate the two stages using Neyman orthogonalization and cross-fitting, which ensures that errors from nuisance estimation contribute only at higher order. As a result, our causal effect estimates are robust to bias and misspecification in modeling causal effects under network dependencies.

Updated: 2025-09-23 00:41:04

标题: 通过正交学习在网络中估计异质因果效应

摘要: 在科学研究和实际应用中，估计网络上的因果效应至关重要。与传统假设“稳定单元治疗价值假设”（SUTVA）的设置不同，干扰允许对一个单位进行干预/治疗以影响其他单位的结果。在流行病学、政治科学和经济学等领域，理解直接和溢出效应都至关重要。网络上的因果推断面临两个主要挑战。首先，因果效应通常是异质的，随着单位特征和局部网络结构的变化而变化。其次，由于网络同质性，连接的单位通常表现出依赖性，从而在结构相关性和因果效应之间造成混淆。在本文中，我们提出了一种两阶段方法来估计网络上异质的直接和溢出效应。第一阶段使用图神经网络来估计依赖于复杂网络拓扑的无关分量。在第二阶段，我们利用这些估计值调整网络混淆，并通过一种新颖的基于注意力的干扰模型推断因果效应。我们的方法平衡了表达性和可解释性，使得诸如识别具有影响力的邻域和恢复溢出效应的符号等下游任务成为可能。我们使用Neyman正交化和交叉拟合来集成这两个阶段，确保干扰估计的错误仅在更高阶上起作用。因此，我们的因果效应估计对于在网络依赖性下建模因果效应时的偏差和规范错误是稳健的。

更新时间: 2025-09-23 00:41:04

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.18484v1

Physics-informed time series analysis with Kolmogorov-Arnold Networks under Ehrenfest constraints

The prediction of quantum dynamical responses lies at the heart of modern physics. Yet, modeling these time-dependent behaviors remains a formidable challenge because quantum systems evolve in high-dimensional Hilbert spaces, often rendering traditional numerical methods computationally prohibitive. While large language models have achieved remarkable success in sequential prediction, quantum dynamics presents a fundamentally different challenge: forecasting the entire temporal evolution of quantum systems rather than merely the next element in a sequence. Existing neural architectures such as recurrent and convolutional networks often require vast training datasets and suffer from spurious oscillations that compromise physical interpretability. In this work, we introduce a fundamentally new approach: Kolmogorov Arnold Networks (KANs) augmented with physics-informed loss functions that enforce the Ehrenfest theorems. Our method achieves superior accuracy with significantly less training data: it requires only 5.4 percent of the samples (200) compared to Temporal Convolution Networks (3,700). We further introduce the Chain of KANs, a novel architecture that embeds temporal causality directly into the model design, making it particularly well-suited for time series modeling. Our results demonstrate that physics-informed KANs offer a compelling advantage over conventional black-box models, maintaining both mathematical rigor and physical consistency while dramatically reducing data requirements.

Updated: 2025-09-23 00:37:04

标题: Kolmogorov-Arnold网络在Ehrenfest约束下的物理知识时间序列分析

摘要: 量子动力学响应的预测是现代物理学的核心。然而，对这些时间依赖行为进行建模仍然是一个艰巨的挑战，因为量子系统在高维希尔伯特空间中演化，通常使传统的数值方法在计算上变得困难。虽然大型语言模型在序列预测方面取得了显著成功，量子动力学却提出了一种根本不同的挑战：预测量子系统的整个时间演化，而不仅仅是序列中的下一个元素。现有的神经架构，如循环和卷积网络，通常需要庞大的训练数据集，并且容易受到破坏物理可解释性的虚假振荡的影响。在这项工作中，我们引入了一种基本全新的方法：科尔莫戈洛夫·阿诺德网络（KANs），通过物理信息的损失函数增强，强制执行埃伦费斯特定理。我们的方法在显著减少训练数据的同时实现了更高的准确性：与时间卷积网络（3,700）相比，我们仅需要 5.4％的样本（200）。我们进一步引入了KANs链，一种新颖的架构，直接将时间因果关系嵌入模型设计中，使其特别适用于时间序列建模。我们的结果表明，基于物理信息的KANs相对于传统的黑盒模型具有明显优势，既保持了数学严谨性又大大降低了数据需求。

更新时间: 2025-09-23 00:37:04

领域: cs.LG,quant-ph

下载: http://arxiv.org/abs/2509.18483v1

SimpleFold: Folding Proteins is Simpler than You Think

Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer blocks. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow-matching objective with an additional structural term. We scale SimpleFold to 3B parameters and train it on approximately 9M distilled protein structures together with experimental PDB data. On standard folding benchmarks, SimpleFold-3B achieves competitive performance compared to state-of-the-art baselines, in addition SimpleFold demonstrates strong performance in ensemble prediction which is typically difficult for models trained via deterministic reconstruction objectives. Due to its general-purpose architecture, SimpleFold shows efficiency in deployment and inference on consumer-level hardware. SimpleFold challenges the reliance on complex domain-specific architectures designs in protein folding, opening up an alternative design space for future progress.

Updated: 2025-09-23 00:33:32

标题: SimpleFold：蛋白质折叠比你想象的更简单

摘要: 蛋白质折叠模型通常通过将领域知识整合到架构块和训练管道中的组合取得了突破性的成果。然而，鉴于生成模型在不同但相关问题上取得的成功，自然而然地会质疑这些架构设计是否是构建高性能模型的必要条件。在本文中，我们介绍了SimpleFold，这是第一个基于流匹配的蛋白质折叠模型，它仅使用通用的transformer块。蛋白质折叠模型通常采用涉及三角更新、显式配对表示或多个专门为该领域策划的训练目标的计算昂贵的模块。相反，SimpleFold采用带有自适应层的标准transformer块，并通过生成流匹配目标进行训练，并附加了额外的结构项。我们将SimpleFold扩展到了30亿参数，并在约900万个精炼蛋白质结构和实验性PDB数据上进行训练。在标准折叠基准测试中，SimpleFold-3B与最先进的基线相比取得了竞争性的表现，此外，SimpleFold在集成预测中表现出良好的性能，这通常对通过确定性重建目标进行训练的模型来说是困难的。由于其通用架构，SimpleFold在消费者级硬件上的部署和推理方面表现出高效性。SimpleFold挑战了蛋白质折叠中对复杂领域特定架构设计的依赖，为未来的进展开辟了另一种设计空间。

更新时间: 2025-09-23 00:33:32

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2509.18480v1

Variational decision diagrams for quantum-inspired machine learning applications

Decision diagrams (DDs) have emerged as an efficient tool for simulating quantum circuits due to their capacity to exploit data redundancies in quantum states and quantum operations, enabling the efficient computation of probability amplitudes. However, their application in quantum machine learning (QML) has remained unexplored. This paper introduces variational decision diagrams (VDDs), a novel graph structure that combines the structural benefits of DDs with the adaptability of variational methods for efficiently representing quantum states. We investigate the trainability of VDDs by applying them to the ground state estimation problem for transverse-field Ising and Heisenberg Hamiltonians. Analysis of gradient variance suggests that training VDDs is possible, as no signs of vanishing gradients--also known as barren plateaus--are observed. This work provides new insights into the use of decision diagrams in QML as an alternative to design and train variational ans\"atze.

Updated: 2025-09-23 00:29:03

标题: 量子启发机器学习应用的变分决策图

摘要: 决策图（DDs）已经成为模拟量子电路的有效工具，因为它们能够利用量子状态和量子操作中的数据冗余，从而实现概率振幅的高效计算。然而，它们在量子机器学习（QML）中的应用尚未被探索。本文介绍了变分决策图（VDDs），这是一种结合了DDs的结构优势和变分方法的适应性的新型图结构，用于高效表示量子状态。我们通过将其应用于横场伊辛和海森堡哈密顿量的基态估计问题来研究VDDs的可训练性。梯度方差的分析表明，训练VDDs是可能的，因为没有观察到梯度消失的迹象，也就是所谓的贫瘠高原。这项工作为在QML中使用决策图作为设计和训练变分ansätze的替代方法提供了新的见解。

更新时间: 2025-09-23 00:29:03

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2502.04271v2