Arxiv Day: Article

KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA)

Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions.

Updated: 2025-11-24 23:56:51

标题: KOM：一种用于膝关节骨关节炎（KOA）精准管理的多智能体人工智能系统

摘要: 膝关节骨关节炎（KOA）影响全球超过6亿人，并伴有显著的疼痛、功能障碍和残疾。尽管个性化的多学科干预有潜力减缓疾病进展并提高生活质量，但通常需要大量医疗资源和专业知识，使其难以在资源有限的环境中实施。为了解决这一挑战，我们开发了KOM，这是一个多智能体系统，旨在自动化KOA评估、风险预测和治疗处方。该系统协助临床医生执行KOA护理路径中的基本任务，并根据个体患者资料、疾病状况、风险因素和禁忌症生成定制的管理计划。在基准实验中，KOM在影像分析和处方生成方面表现出优越性能，相比于几种通用大语言模型。进一步的随机三臂模拟研究显示，KOM与临床医生之间的合作可将总诊断和规划时间缩短38.5%，并且与各自独立使用的方法相比，可以提高治疗质量。这些发现表明，KOM有助于促进自动化KOA管理，并在整合到临床工作流程中时，具有提高护理效率的潜力。KOM的模块化架构也可能为其他慢性疾病的AI辅助管理系统的开发提供有价值的见解。

更新时间: 2025-11-24 23:56:51

领域: cs.AI,cs.HC,cs.LG,cs.MA

下载: http://arxiv.org/abs/2511.19798v1

Terminal Velocity Matching

We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

Updated: 2025-11-24 23:55:45

标题: 终端速度匹配

摘要: 我们提出了终端速度匹配（TVM），这是流匹配的一种泛化方法，可以实现高保真度的单步和少步生成建模。TVM模拟了任意两个扩散时间步长之间的转换，并在终端时间点而不是初始时间点上对其行为进行正则化。我们证明了当模型是利普希茨连续时，TVM提供了数据和模型分布之间$2$-Wasserstein距离的上界。然而，由于扩散变换器缺乏这种特性，我们引入了最小的架构改变，实现了稳定的单阶段训练。为了使TVM在实践中高效，我们开发了一个融合的注意力核，支持Jacobian-Vector Products的反向传递，这在变压器架构中表现出色。在ImageNet-256x256上，TVM在单次函数评估（NFE）时实现了3.29 FID，在4次NFE时实现了1.99 FID。在ImageNet-512x512上，它以4.32 1-NFE FID和2.94 4-NFE FID同样取得了最先进的性能，代表了从零开始的单步/少步模型。

更新时间: 2025-11-24 23:55:45

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2511.19797v1

When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets.

Updated: 2025-11-24 23:50:27

标题: 当+1%不够：一种用于评估小幅改进的配对自举协议

摘要: 最近的机器学习论文经常报告在基准测试中单次运行获得1-2个百分点的改进。这些收益对随机种子、数据排序和实现细节非常敏感，但很少伴随着不确定性估计或显著性检验。因此，目前不清楚报告的+1-2%是否反映了真正的算法进步还是噪声。我们在实际的计算预算下重新讨论了这个问题，只有少数运行是可以接受的。我们提出了一个简单、PC友好的评估协议，基于配对的多种子运行、偏差校正和加速（BCa）自举置信区间，以及对每个种子增量的符号翻转置换检验。该协议是故意保守的，旨在防止过度宣称。我们在CIFAR-10、CIFAR-10N和AG News上实现了这一协议，使用合成的无改进、小增益和中等增益场景。单次运行和未配对的t检验常常表明在0.6-2.0个点的改进中获得显著性，特别是在文本上。在这些情况下，我们的配对协议只使用三个种子从未宣布显著性。我们认为，在有限预算下，这样保守的评估对于小幅增益是更安全的默认选择。

更新时间: 2025-11-24 23:50:27

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.19794v1

Practitioners' Perspectives on a Differential Privacy Deployment Registry

Differential privacy (DP) -- a principled approach to producing statistical data products with strong, mathematically provable privacy guarantees for the individuals in the underlying dataset -- has seen substantial adoption in practice over the past decade. Applying DP requires making several implementation decisions, each with significant impacts on data privacy and/or utility. Hence, to promote shared learning and accountability around DP deployments, Dwork, Kohli, and Mulligan (2019) proposed a public-facing repository ("registry") of DP deployments. The DP community has recently started to work toward realizing this vision. We contribute to this effort by (1) developing a holistic, hierarchical schema to describe any given DP deployment and (2) designing and implementing an interactive interface to act as a registry where practitioners can access information about past DP deployments. We (3) populate our interface with 21 real-world DP deployments and (4) conduct an exploratory user study with DP practitioners ($n=16$) to understand how they would use the registry, as well as what challenges and opportunities they foresee around its adoption. We find that participants were enthusiastic about the registry as a valuable resource for evaluating prior deployments and making future deployments. They also identified several opportunities for the registry, including that it can become a "hub" for the community and support broader communication around DP (e.g., to legal teams). At the same time, they identified challenges around the registry gaining adoption, including the effort and risk involved with making implementation choices public and moderating the quality of entries. Based on our findings, we offer recommendations for encouraging adoption and increasing the registry's value not only to DP practitioners, but also to policymakers, data users, and data subjects.

Updated: 2025-11-24 23:50:18

标题: 从业者对差分隐私部署注册表的观点

摘要: 差分隐私（DP）是一种原则性方法，用于为基础数据集中的个人提供具有强大、可数学证明的隐私保证的统计数据产品，在过去的十年中已经在实践中得到了广泛采用。应用DP需要做出几项实施决策，每项决策都对数据隐私和/或效用产生重大影响。因此，为了促进关于DP部署的共享学习和问责，Dwork、Kohli和Mulligan（2019）提出了一个公开的DP部署库（“注册表”）。DP社区最近开始致力于实现这一愿景。我们通过（1）开发一个全面的、分层的模式来描述任何给定的DP部署，以及（2）设计和实施一个交互式界面，作为一个注册表，从中实践者可以获取有关过去DP部署的信息，为这一努力做出贡献。我们（3）利用21个真实世界的DP部署填充我们的界面，（4）与DP从业者进行了一项探索性用户研究（$n=16$），以了解他们如何使用注册表，以及他们预见到的关于其采用的挑战和机会。我们发现参与者对注册表表示热情，认为它是一个有价值的资源，用于评估以前的部署并进行未来的部署。他们还指出了注册表的几个机会，包括它可以成为社区的“中心”，支持关于DP的广泛沟通（例如，向法律团队）。与此同时，他们也指出了注册表在获得采用方面的挑战，包括公开实施选择所涉及的工作量和风险，以及调节条目的质量。根据我们的调查结果，我们提出了鼓励采用和增加注册表价值的建议，不仅面向DP从业者，还包括政策制定者、数据用户和数据主体。

更新时间: 2025-11-24 23:50:18

领域: cs.CR,cs.CY,cs.HC

下载: http://arxiv.org/abs/2509.13509v2

Vision Language Models Can Parse Floor Plan Maps

Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floor plan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: https://sites.google.com/view/vlm-floorplan/

Updated: 2025-11-24 23:47:56

标题: 视觉语言模型能解析平面图地图

摘要: 视觉语言模型（VLMs）可以同时推理图像和文本，以应对许多任务，从视觉问答到图像描述。本文关注地图解析，这是一个在VLM环境中尚未探索的新任务，对移动机器人尤为有用。地图解析需要理解地图的标签以及几何配置，即地图上的区域是什么样的，它们如何相连。为了评估VLMs在地图解析中的性能，我们使用平面图提示VLMs生成复杂室内导航的任务计划。我们的结果表明，在需要进行九个导航动作序列的任务中，例如接近和穿过门，VLMs在地图解析方面具有显著的能力，成功率为0.96。除了直观观察，例如VLMs在较小的地图和较简单的导航任务中表现更好，我们还观察到一个非常有趣的现象，即其在大型开放区域中的表现下降。我们提供实用建议来解决这些挑战，这些建议经过我们的实验结果验证。网页链接：https://sites.google.com/view/vlm-floorplan/

更新时间: 2025-11-24 23:47:56

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2409.12842v2

GPU-Initiated Networking for NCCL

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.

Updated: 2025-11-24 23:45:45

标题: GPU发起的NCCL网络通信 (Note: NCCL是一种高性能的GPU加速的通信库)

摘要: 现代人工智能工作负载，特别是混合专家（MoE）架构，越来越需要低延迟、细粒度的GPU到GPU通信，并带有设备端控制。传统的GPU通信遵循主机发起的模式，其中CPU编排所有通信操作 - 这是CUDA运行时的特征。尽管对于集体操作很强大，但需要计算和通信紧密集成的应用程序可以从设备发起的通信中获益，从而消除CPU协调开销。 NCCL 2.28引入了Device API，具有三种操作模式：适用于NVLink / PCIe的Load / Store可访问（LSA），适用于NVLink SHARP的Multimem，以及适用于网络RDMA的GPU-发起网络（GIN）。本文介绍了GIN架构、设计、语义，并强调了其对MoE通信的影响。GIN基于三层架构：i）NCCL Core主机端API用于设备通信器设置和集体内存窗口注册；ii）设备端API用于从CUDA内核调用的远程内存操作；iii）具有双语义（GPUDirect异步内核发起和代理）的网络插件架构，以支持广泛的硬件。GPUDirect异步内核发起后端利用DOCA GPUNetIO进行直接GPU到NIC通信，而代理后端通过标准RDMA网络上的无锁GPU到CPU队列提供等效功能。我们通过与DeepEP集成，一个MoE通信库，展示了GIN的实用性。全面的基准测试显示，GIN在NCCL统一运行时中提供了设备发起的通信，将低延迟操作与NCCL的集体算法和生产基础设施结合在一起。

更新时间: 2025-11-24 23:45:45

领域: cs.DC,cs.AI,cs.AR,cs.LG

下载: http://arxiv.org/abs/2511.15076v2

Active Slice Discovery in Large Language Models

Large Language Models (LLMs) often exhibit systematic errors on specific subsets of data, known as error slices. For instance, a slice can correspond to a certain demographic, where a model does poorly in identifying toxic comments regarding that demographic. Identifying error slices is crucial to understanding and improving models, but it is also challenging. An appealing approach to reduce the amount of manual annotation required is to actively group errors that are likely to belong to the same slice, while using limited access to an annotator to verify whether the chosen samples share the same pattern of model mistake. In this paper, we formalize this approach as Active Slice Discovery and explore it empirically on a problem of discovering human-defined slices in toxicity classification. We examine the efficacy of active slice discovery under different choices of feature representations and active learning algorithms. On several slices, we find that uncertainty-based active learning algorithms are most effective, achieving competitive accuracy using 2-10% of the available slice membership information, while significantly outperforming baselines.

Updated: 2025-11-24 23:43:20

标题: 大型语言模型中的主动切片发现

摘要: 大型语言模型（LLMs）经常在特定数据子集上表现出系统性错误，称为错误切片。例如，一个切片可以对应于某个人口统计学，模型在识别有关该人口统计的有毒评论方面表现不佳。识别错误切片对于理解和改进模型至关重要，但也具有挑战性。减少所需手动注释量的一个吸引人的方法是主动地将可能属于同一切片的错误分组，同时利用有限的访问权限让注释者验证所选样本是否共享相同的模型错误模式。在本文中，我们将这种方法形式化为主动切片发现，并在毒性分类中探索其在发现人类定义的切片问题上的实证。我们考察了在不同特征表示和主动学习算法选择下主动切片发现的功效。在几个切片上，我们发现基于不确定性的主动学习算法最为有效，使用可用切片成员信息的2-10％，同时明显优于基线，达到了竞争性准确度。

更新时间: 2025-11-24 23:43:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.20713v1

Koopman operator-based discussion on partial observation in stochastic systems

It is sometimes difficult to achieve a complete observation for a full set of observables, and partial observations are necessary. For deterministic systems, the Mori-Zwanzig formalism provides a theoretical framework for handling partial observations. Recently, data-driven algorithms based on the Koopman operator theory have made significant progress, and there is a discussion to connect the Mori-Zwanzig formalism with the Koopman operator theory. In this work, we discuss the effects of partial observation in stochastic systems using the Koopman operator theory. The discussion clarifies the importance of distinguishing the state space and the function space in stochastic systems. Even in stochastic systems, the delay-embedding technique is beneficial for partial observation, and several numerical experiments show a power-law behavior of error with respect to the amplitude of the additive noise. We also discuss the relation between the exponent of the power-law behavior and the effects of partial observation.

Updated: 2025-11-24 23:41:00

标题: 基于Koopman算子的随机系统中部分观测的讨论

摘要: 有时很难实现对一整套可观测量的完全观察，部分观察是必要的。对于确定性系统，Mori-Zwanzig形式理论为处理部分观察提供了一个理论框架。最近，基于Koopman算子理论的数据驱动算法取得了显著进展，有人讨论将Mori-Zwanzig形式理论与Koopman算子理论联系起来。在这项工作中，我们使用Koopman算子理论讨论了部分观察在随机系统中的影响。讨论阐明了在随机系统中区分状态空间和函数空间的重要性。即使在随机系统中，延迟嵌入技术对部分观察是有益的，几个数值实验展示了与加性噪声幅度相关的误差的幂律行为。我们还讨论了幂律行为的指数与部分观察效应之间的关系。

更新时间: 2025-11-24 23:41:00

领域: cs.LG

下载: http://arxiv.org/abs/2506.21844v2

NOEM$^{3}$A: A Neuro-Symbolic Ontology-Enhanced Method for Multi-Intent Understanding in Mobile Agents

We introduce a neuro-symbolic framework for multi-intent understanding in mobile AI agents by integrating a structured intent ontology with compact language models. Our method leverages retrieval-augmented prompting, logit biasing and optional classification heads to inject symbolic intent structure into both input and output representations. We formalize a new evaluation metric-Semantic Intent Similarity (SIS)-based on hierarchical ontology depth, capturing semantic proximity even when predicted intents differ lexically. Experiments on a subset of ambiguous/demanding dialogues of MultiWOZ 2.3 (with oracle labels from GPT-o3) demonstrate that a 3B Llama model with ontology augmentation approaches GPT-4 accuracy (85% vs 90%) at a tiny fraction of the energy and memory footprint. Qualitative comparisons show that ontology-augmented models produce more grounded, disambiguated multi-intent interpretations. Our results validate symbolic alignment as an effective strategy for enabling accurate and efficient on-device NLU.

Updated: 2025-11-24 23:14:45

标题: NOEM$^{3}$A：一种用于移动Agent中多意图理解的神经符号本体增强方法

摘要: 我们引入了一个神经符号框架，用于移动AI代理的多意图理解，通过将结构化意图本体与紧凑语言模型整合在一起。我们的方法利用检索增强提示、logit偏置和可选分类头，将符号意图结构注入到输入和输出表示中。我们形式化了一个基于层次本体深度的新评估指标-语义意图相似度（SIS），捕捉了语义接近性，即使预测的意图在词汇上有所不同。在MultiWOZ 2.3的一部分模糊/需求对话上的实验（使用来自GPT-o3的oracle标签）表明，一个带有本体增强的3B Llama模型接近GPT-4的准确率（85% vs 90%），同时能耗和内存占用仅为其极小一部分。定性比较显示，带有本体增强的模型产生更具有实地基础、消除歧义的多意图解释。我们的结果验证了符号对齐作为一种有效策略，可以实现准确和高效的设备端自然语言理解。

更新时间: 2025-11-24 23:14:45

领域: cs.AI

下载: http://arxiv.org/abs/2511.19780v1

Architectures and random properties of symplectic quantum circuits

Parametrized and random unitary (or orthogonal) $n$-qubit circuits play a central role in quantum information. As such, one could naturally assume that circuits implementing symplectic transformations would attract similar attention. However, this is not the case, as $\mathbb{SP} (d/2)$ -- the group of $d\times d$ unitary symplectic matrices -- has thus far been overlooked. In this work, we aim at starting to fill this gap. We begin by presenting a universal set of generators $\mathcal{G}$ for the symplectic algebra $\mathfrak{sp}(d/2)$, consisting of one- and two-qubit Pauli operators acting on neighboring sites in a one-dimensional lattice. Here, we uncover two critical differences between such set, and equivalent ones for unitary and orthogonal circuits. Namely, we find that the operators in $\mathcal{G}$ cannot generate arbitrary local symplectic unitaries and that they are not translationally invariant. We then review the Schur-Weyl duality between the symplectic group and the Brauer algebra, and use tools from Weingarten calculus to prove that Pauli measurements at the output of Haar random symplectic circuits can converge to Gaussian processes. As a by-product, such analysis provides us with concentration bounds for Pauli measurements in circuits that form $t$-designs over $\mathbb{SP}(d/2)$. To finish, we present tensor-network tools to analyze shallow random symplectic circuits, and we use these to numerically show that computational-basis measurements anti-concentrate at logarithmic depth.

Updated: 2025-11-24 23:13:27

标题: 辛量子电路的结构和随机性质

摘要: 参数化和随机酉（或正交）$n$量子比特电路在量子信息中起着重要作用。因此，人们自然会认为实现辛变换的电路会引起类似的关注。然而，迄今为止，$d\times d$酉辛矩阵群$\mathbb{SP}(d/2)$却被忽视了。在这项工作中，我们旨在开始填补这一空白。我们首先提出了用于辛代数$\mathfrak{sp}(d/2)$的一组通用生成元$\mathcal{G}$，由作用在一维晶格上相邻位点的一比特和两比特Pauli算符组成。在这里，我们揭示了这种集合与酉和正交电路的等效集之间的两个关键差异。即，我们发现$\mathcal{G}$中的算符不能生成任意局部辛酉变换，且它们不具有平移不变性。然后，我们回顾了辛群与Brauer代数之间的Schur-Weyl对偶关系，并使用Weingarten微积分工具证明了在Haar随机辛电路输出处的Pauli测量可以收敛到高斯过程。作为副产品，这种分析为我们提供了关于在$\mathbb{SP}(d/2)$上形成$t$-设计的电路中Pauli测量的集中性界限。最后，我们提出了张量网络工具来分析浅层随机辛电路，并利用这些工具数值地展示了在对数深度处计算基测量的反集中性。

更新时间: 2025-11-24 23:13:27

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2405.10264v3

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

Updated: 2025-11-24 22:58:26

标题: 在VLMs中针对工具集成推理的代理增强学习的扩展

摘要: 最近的视觉语言模型（VLMs）展示了强大的图像理解能力，但它们“以图像思考”的能力，即通过多步视觉互动进行推理的能力，仍然有限。我们介绍了VISTA-Gym，这是一个可扩展的训练环境，用于激励VLMs的集成工具视觉推理能力。VISTA-Gym统一了多样的现实世界多模态推理任务（总共来自13个数据集的7个任务），并配备了用于视觉工具（例如，地面化、解析）的标准化界面，可执行的交互循环，可验证的反馈信号和高效的轨迹记录，从而实现了大规模的视觉代理强化学习。尽管最近的VLMs表现出强大的仅文本推理能力，但无论是专有的还是开源的模型仍然在工具选择、调用和协调方面存在困难。通过VISTA-Gym，我们训练VISTA-R1通过多轮轨迹采样和端到端强化学习交替使用工具与代理推理。在11个公共推理密集型VQA基准测试中进行的大量实验表明，VISTA-R1-8B的性能比相似大小的最先进基线模型提高了9.51%-18.72%，证明VISTA-Gym是一个有效的训练场地，可以解锁VLMs的集成工具推理能力。

更新时间: 2025-11-24 22:58:26

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2511.19773v1

Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

The impressive capability of modern text-to-image models to generate realistic visuals has come with a serious drawback: they can be misused to create harmful, deceptive or unlawful content. This has accelerated the push for machine unlearning. This new field seeks to selectively remove specific knowledge from a model's training data without causing a drop in its overall performance. However, it turns out that actually forgetting a given concept is an extremely difficult task. Models exposed to attacks using adversarial prompts show the ability to generate so-called unlearned concepts, which can be not only harmful but also illegal. In this paper, we present considerations regarding the ability of models to forget and recall knowledge, introducing the Memory Self-Regeneration task. Furthermore, we present MemoRa strategy, which we consider to be a regenerative approach supporting the effective recovery of previously lost knowledge. Moreover, we propose that robustness in knowledge retrieval is a crucial yet underexplored evaluation measure for developing more robust and effective unlearning techniques. Finally, we demonstrate that forgetting occurs in two distinct ways: short-term, where concepts can be quickly recalled, and long-term, where recovery is more challenging. Code is available at https://gmum.github.io/MemoRa/.

Updated: 2025-11-24 22:54:34

标题: 记忆自我再生：揭示未学习模型中的隐藏知识

摘要: 现代文本到图像模型生成逼真视觉的能力令人印象深刻，但却带来了一个严重的缺点：它们可能被滥用来创建有害、欺骗性或违法内容。这加速了对机器遗忘的推动。这一新领域旨在有选择地从模型的训练数据中删除特定知识，而不会导致其整体性能下降。然而，事实证明，实际上遗忘一个给定概念是一项极其困难的任务。暴露于使用对抗性提示的攻击的模型展示出生成所谓的未学习概念的能力，这些概念不仅可能有害，还可能违法。在本文中，我们介绍了模型遗忘和召回知识的能力，并引入了记忆自我再生任务。此外，我们提出了MemoRa策略，我们认为这是一种支持有效恢复先前丢失知识的再生方法。此外，我们提出，知识检索的鲁棒性是一个至关重要但尚未深入探讨的评估指标，用于开发更加鲁棒和有效的遗忘技术。最后，我们证明了遗忘发生在两种不同的方式中：短期内，概念可以快速召回，长期内，恢复更具挑战性。代码可在https://gmum.github.io/MemoRa/获得。

更新时间: 2025-11-24 22:54:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.03263v2

Advancing Limited-Angle CT Reconstruction Through Diffusion-Based Sinogram Completion

Limited Angle Computed Tomography (LACT) often faces significant challenges due to missing angular information. Unlike previous methods that operate in the image domain, we propose a new method that focuses on sinogram inpainting. We leverage MR-SDEs, a variant of diffusion models that characterize the diffusion process with mean-reverting stochastic differential equations, to fill in missing angular data at the projection level. Furthermore, by combining distillation with constraining the output of the model using the pseudo-inverse of the inpainting matrix, the diffusion process is accelerated and done in a step, enabling efficient and accurate sinogram completion. A subsequent post-processing module back-projects the inpainted sinogram into the image domain and further refines the reconstruction, effectively suppressing artifacts while preserving critical structural details. Quantitative experimental results demonstrate that the proposed method achieves state-of-the-art performance in both perceptual and fidelity quality, offering a promising solution for LACT reconstruction in scientific and clinical applications.

Updated: 2025-11-24 22:53:15

标题: 通过基于扩散的正弦图完成推进有限角度CT重建

摘要: 有限角度计算断层扫描（LACT）常常面临由于缺失角度信息而产生的重大挑战。与之前在图像领域操作的方法不同，我们提出了一种新方法，专注于正弦图像插值。我们利用MR-SDEs，一种描述扩散过程的均值回归随机微分方程的变体，来填补投影级别上缺失的角度数据。此外，通过将蒸馏与使用插值矩阵的伪逆约束模型输出相结合，扩散过程被加速并在一步完成，从而实现了有效和准确的正弦图像完成。随后的后处理模块将插值后的正弦图像反投影到图像领域，并进一步完善重建，有效抑制伪影同时保留关键结构细节。定量实验结果表明，所提出的方法在感知和保真度质量方面实现了最先进的性能，为科学和临床应用中的LACT重建提供了有希望的解决方案。

更新时间: 2025-11-24 22:53:15

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.19385v2

FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems

We present FMPlug, a novel plug-in framework that enhances foundation flow-matching (FM) priors for solving ill-posed inverse problems. Unlike traditional approaches that rely on domain-specific or untrained priors, FMPlug smartly leverages two simple but powerful insights: the similarity between observed and desired objects and the Gaussianity of generative flows. By introducing a time-adaptive warm-up strategy and sharp Gaussianity regularization, FMPlug unlocks the true potential of domain-agnostic foundation models. Our method beats state-of-the-art methods that use foundation FM priors by significant margins, on image super-resolution and Gaussian deblurring.

Updated: 2025-11-24 22:52:20

标题: FMPlug：逆问题的插件基础匹配先验

摘要: 我们提出了FMPlug，一种新颖的插件框架，可以增强基础流匹配（FM）先验，用于解决不适定的反问题。与依赖领域特定或未经训练的先验的传统方法不同，FMPlug巧妙地利用了两个简单但强大的见解：观察到的对象与期望对象之间的相似性以及生成流的高斯性。通过引入一种时间自适应的预热策略和尖锐的高斯正则化，FMPlug释放了领域无关基础模型的潜力。我们的方法在图像超分辨率和高斯去模糊方面明显超越了使用基础FM先验的最先进方法。

更新时间: 2025-11-24 22:52:20

领域: eess.IV,cs.CV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2508.00721v2

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

Updated: 2025-11-24 22:50:50

标题: 修剪-计划：基于步骤级校准的稳定前沿探索在具身问答中

摘要: 大型视觉-语言模型（VLMs）通过为开放式词汇推理提供强大的语义先验知识，改进了具有体现问题回答（EQA）代理的效果。然而，当直接用于步骤级探索时，VLMs经常表现出前沿振荡，即由于过度自信和误校准而导致的不稳定来回移动，从而导致导航低效和答案质量下降。我们提出了Prune-Then-Plan，一个简单而有效的框架，通过步骤级校准稳定探索。我们的方法不信任原始VLM分数，而是使用Holm-Bonferroni启发式修剪程序剪除不合理的前沿选择，然后将最终决策委托给基于覆盖率的规划器。这种分离通过依赖于人类水平判断来校准VLMs的步骤级行为，将过度自信的预测转换为保守、可解释的行动。将其集成到3D-Mem EQA框架中，我们的方法在视觉上地SPL和LLM-Match指标上分别相对基线实现了高达49%和33%的改进。总体而言，我们的方法在OpenEQA和EXPRESS-Bench数据集上在相同的探索预算下实现了更好的场景覆盖。

更新时间: 2025-11-24 22:50:50

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2511.19768v1

A Set of Rules for Model Validation

The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.

Updated: 2025-11-24 22:42:52

标题: 一个用于模型验证的规则集合

摘要: 数据驱动模型的验证是评估模型对感兴趣人群中新的、未见数据泛化能力的过程。本文提出了一套通用的模型验证规则。这些规则旨在帮助从业者创建可靠的验证计划并透明地报告结果。虽然没有完美的验证方案，但这些规则可以帮助从业者确保他们的策略足够实用，公开讨论验证策略的任何限制，并报告清晰、可比的性能指标。

更新时间: 2025-11-24 22:42:52

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.20711v1

Adjoint Schrödinger Bridge Sampler

Computational methods for learning to sample from the Boltzmann distribution -- where the target distribution is known only up to an unnormalized energy function -- have advanced significantly recently. Due to the lack of explicit target samples, however, prior diffusion-based methods, known as diffusion samplers, often require importance-weighted estimation or complicated learning processes. Both trade off scalability with extensive evaluations of the energy and model, thereby limiting their practical usage. In this work, we propose Adjoint Schrödinger Bridge Sampler (ASBS), a new diffusion sampler that employs simple and scalable matching-based objectives yet without the need to estimate target samples during training. ASBS is grounded on a mathematical model -- the Schrödinger Bridge -- which enhances sampling efficiency via kinetic-optimal transportation. Through a new lens of stochastic optimal control theory, we demonstrate how SB-based diffusion samplers can be learned at scale via Adjoint Matching and prove convergence to the global solution. Notably, ASBS generalizes the recent Adjoint Sampling (Havens et al., 2025) to arbitrary source distributions by relaxing the so-called memoryless condition that largely restricts the design space. Through extensive experiments, we demonstrate the effectiveness of ASBS on sampling from classical energy functions, amortized conformer generation, and molecular Boltzmann distributions. Code available at https://github.com/facebookresearch/adjoint_samplers

Updated: 2025-11-24 22:41:42

标题: 伴随薛定谔桥采样器

摘要: 学习从玻尔兹曼分布中抽样的计算方法——其中目标分布仅在未归一化能量函数中已知——近年来取得了显著进展。然而，由于缺乏明确的目标样本，先前基于扩散的方法，即扩散取样器，通常需要重要性加权估计或复杂的学习过程。这两种方法在规模化和对能量和模型的广泛评估之间进行权衡，从而限制了它们的实际使用。在这项工作中，我们提出了Adjoint Schrödinger Bridge Sampler（ASBS），一种新的扩散取样器，它采用简单且可扩展的基于匹配的目标，但在训练过程中无需估计目标样本。ASBS建立在一种数学模型——Schrödinger Bridge——之上，通过动能最优传输增强了取样效率。通过随机最优控制理论的新视角，我们展示了基于SB的扩散取样器如何通过Adjoint Matching进行规模化学习，并证明了收敛到全局解。值得注意的是，ASBS将最近的Adjoint Sampling（Havens等人，2025年）推广到任意源分布，通过放宽大大限制设计空间的所谓无记忆条件。通过大量实验证明了ASBS在从经典能量函数、摊销构象生成和分子玻尔兹曼分布中的抽样的有效性。代码可在https://github.com/facebookresearch/adjoint_samplers 上找到。

更新时间: 2025-11-24 22:41:42

领域: stat.ML,cs.LG,math.OC

下载: http://arxiv.org/abs/2506.22565v2

Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

Updated: 2025-11-24 22:32:03

标题: Neuro-启发的多模态视觉语言模型对成员推断隐私泄露具有韧性吗？

摘要: 在代理AI时代，不断增长的多模型（MMs）部署引入了新的攻击向量，可能泄霏MMs中的敏感训练数据，导致隐私泄漏。本文研究了一种黑盒隐私攻击，即成员推断攻击（MIA）对多模视觉语言模型（VLMs）的攻击。最先进的研究主要分析隐私攻击对于单模AI-ML系统，而最近的研究表明MMs也可能容易受到隐私攻击。虽然研究人员已经证明了受生物启发的神经网络表示可以提高单模型对抗性攻击的抵抗力，但尚未探究受神经启发的MMs是否能够抵御隐私攻击。在这项工作中，我们引入了一种系统神经科学启发的拓扑正则化（tau）框架，以分析基于图像文本推断的隐私攻击对MM VLMs的抵抗力。我们使用三个VLMs：BLIP，PaliGemma 2和ViT-GPT2，跨三个基准数据集：COCO，CC3M和NoCaps来研究这一现象。我们的实验比较了基准和神经VLMs（带有拓扑正则化），其中tau > 0配置定义了VLM的NEURO变体。我们在使用COCO数据集的BLIP模型上的结果表明，NEURO VLMs中MIA攻击成功率下降了24%的平均ROC-AUC，同时在MPNet和ROUGE-2指标方面实现了类似的模型效用（生成和参考字幕之间的相似性）。这表明神经VLMs相对更具抵抗力，而不会显著损害模型效用。我们在CC3M和NoCaps两个额外数据集上对PaliGemma 2和ViT-GPT2模型进行了广泛评估，进一步验证了结果的一致性。这项工作有助于增进对MMs中隐私风险的理解，并为神经VLMs的隐私威胁抵抗力提供了证据。

更新时间: 2025-11-24 22:32:03

领域: cs.CV,cs.AI,cs.CR

下载: http://arxiv.org/abs/2511.20710v1

DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation

Large language models (LLMs) and autonomous coding agents are increasingly used to generate software across a wide range of domains. Yet a core requirement remains unmet: ensuring that generated code is secure without compromising its functional correctness. Existing benchmarks and evaluations for secure code generation fall short-many measure only vulnerability reduction, disregard correctness preservation, or evaluate security and functionality on separate datasets, violating the fundamental need for simultaneous joint evaluation. We present DUALGAUGE, the first fully automated benchmarking framework designed to rigorously evaluate the security and correctness of LLM-generated code in unison. Given the lack of datasets enabling joint evaluation of secure code generation, we also present DUALGAUGE-BENCH, a curated benchmark suite of diverse coding tasks, each paired with manually validated test suites for both security and functionality, designed for full coverage of specification requirements. At the core of DUALGAUGE is an agentic program executor, which runs a program against given tests in sandboxed environments, and an LLM-based evaluator, which assesses both correctness and vulnerability behavior against expected outcomes. We rigorously evaluated and ensured the quality of DUALGAUGE-BENCH and the accuracy of DUALGAUGE, and applied DUALGAUGE to benchmarking ten leading LLMs on DUALGAUGE-BENCH across thousands of test scenarios. Our results reveal critical gaps in correct and secure code generation by these LLMs, for which our open-source system and datasets help accelerate progress via reproducible, scalable, and rigorous evaluation.

Updated: 2025-11-24 22:26:14

标题: DUALGUAGE：用于安全代码生成的自动联合安全功能基准测试

摘要: 大型语言模型（LLMs）和自主编码代理越来越被广泛用于生成各种领域的软件。然而，一个核心要求仍未满足：确保生成的代码安全而不影响其功能正确性。现有的用于安全代码生成的基准和评估存在不足-许多只衡量漏洞减少，忽视正确性保留，或在单独的数据集上评估安全性和功能性，违反了同时进行联合评估的基本需求。我们提出了DUALGAUGE，这是第一个完全自动化的基准测试框架，旨在严格评估LLM生成的代码的安全性和正确性。鉴于缺乏允许安全代码生成进行联合评估的数据集，我们还提出了DUALGAUGE-BENCH，一个由多样化编码任务配对的手动验证的测试套件组成，旨在完全覆盖规范要求。DUALGAUGE的核心是一个代理程序执行器，它在受限环境中运行程序以通过给定的测试，以及一个基于LLM的评估器，评估正确性和漏洞行为是否符合预期结果。我们对DUALGAUGE-BENCH的质量和DUALGAUGE的准确性进行了严格评估，将DUALGAUGE应用于在DUALGAUGE-BENCH上对十个领先的LLMs进行数千个测试场景的基准测试。我们的结果揭示了这些LLMs在正确和安全代码生成方面存在关键差距，我们的开源系统和数据集有助于通过可重复、可扩展和严格的评估来加速进展。

更新时间: 2025-11-24 22:26:14

领域: cs.SE,cs.AI,cs.CR

下载: http://arxiv.org/abs/2511.20709v1

Clustering Approaches for Mixed-Type Data: A Comparative Study

Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study presents the state-of-the-art of these approaches and compares them using various simulation models. The compared methods include the distance-based approaches k-prototypes, PDQ, and convex k-means, and the probabilistic methods KAy-means for MIxed LArge data (KAMILA), the mixture of Bayesian networks (MBNs), and latent class model (LCM). The aim is to provide insights into the behavior of different methods across a wide range of scenarios by varying some experimental factors such as the number of clusters, cluster overlap, sample size, dimension, proportion of continuous variables in the dataset, and clusters' distribution. The degree of cluster overlap and the proportion of continuous variables in the dataset and the sample size have a significant impact on the observed performances. When strong interactions exist between variables alongside an explicit dependence on cluster membership, none of the evaluated methods demonstrated satisfactory performance. In our experiments KAMILA, LCM, and k-prototypes exhibited the best performance, with respect to the adjusted rand index (ARI). All the methods are available in R.

Updated: 2025-11-24 22:18:23

标题: 混合数据的聚类方法：一项比较研究

摘要: 聚类在无监督学习中被广泛应用于在数据集中找到同质组的观察。然而，聚类混合型数据仍然是一个挑战，因为很少有现有方法适合这个任务。本研究介绍了这些方法的最新进展，并使用各种模拟模型进行了比较。比较的方法包括基于距离的方法k-prototypes、PDQ和凸k-means，以及概率方法KAMILA、贝叶斯网络混合(MBNs)和潜在类模型(LCM)。目的是通过改变一些实验因素，如聚类数目、聚类重叠、样本大小、维度、数据集中连续变量的比例和聚类分布，提供不同方法在广泛情景下的行为洞察。聚类重叠程度和数据集中连续变量的比例以及样本大小对观察到的性能有显著影响。当变量之间存在强烈的相互作用并且明确依赖于聚类成员资格时，评估的方法都没有表现出令人满意的性能。在我们的实验中，KAMILA、LCM和k-prototypes表现出最佳性能，就调整兰德指数(ARI)而言。所有这些方法都在R中可用。

更新时间: 2025-11-24 22:18:23

领域: stat.ML,cs.LG,stat.AP,stat.ME

下载: http://arxiv.org/abs/2511.19755v1

Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools

Despite the promise of computational pathology foundation models, adapting them to specific clinical tasks remains challenging due to the complexity of whole-slide image (WSI) processing, the opacity of learned features, and the wide range of potential adaptation strategies. To address these challenges, we introduce PathFMTools, a lightweight, extensible Python package that enables efficient execution, analysis, and visualization of pathology foundation models. We use this tool to interface with and evaluate two state-of-the-art vision-language foundation models, CONCH and MUSK, on the task of histological grading in cutaneous squamous cell carcinoma (cSCC), a critical criterion that informs cSCC staging and patient management. Using a cohort of 440 cSCC H&E WSIs, we benchmark multiple adaptation strategies, demonstrating trade-offs across prediction approaches and validating the potential of using foundation model embeddings to train small specialist models. These findings underscore the promise of pathology foundation models for real-world clinical applications, with PathFMTools enabling efficient analysis and validation.

Updated: 2025-11-24 22:16:12

标题: 利用PathFMTools工具对皮肤鳞状细胞癌进行组织学分级的基础模型优势

摘要: 尽管计算病理学基础模型有很大潜力，但由于整张幻灯片图像（WSI）处理的复杂性、学习特征的不透明性以及各种潜在的适应策略，将它们应用于特定临床任务仍然具有挑战性。为了解决这些挑战，我们引入了PathFMTools，这是一个轻量级、可扩展的Python包，可以实现高效的病理学基础模型执行、分析和可视化。我们使用这个工具与并评估了两种最先进的视觉语言基础模型CONCH和MUSK，在皮肤鳞状细胞癌（cSCC）组织学分级任务上，这是评估cSCC分期和患者管理的关键标准。利用440张cSCC H&E WSI的队列，我们对多种适应策略进行了基准测试，展示了在预测方法之间的权衡，并验证了使用基础模型嵌入来训练小型专家模型的潜力。这些发现强调了病理学基础模型在真实世界临床应用中的潜力，而PathFMTools则实现了高效的分析和验证。

更新时间: 2025-11-24 22:16:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19751v1

DISCO: A Browser-Based Privacy-Preserving Framework for Distributed Collaborative Learning

Data is often impractical to share for a range of well considered reasons, such as concerns over privacy, intellectual property, and legal constraints. This not only fragments the statistical power of predictive models, but creates an accessibility bias, where accuracy becomes inequitably distributed to those who have the resources to overcome these concerns. We present DISCO: an open-source DIStributed COllaborative learning platform accessible to non-technical users, offering a means to collaboratively build machine learning models without sharing any original data or requiring any programming knowledge. DISCO's web application trains models locally directly in the browser, making our tool cross-platform out-of-the-box, including smartphones. The modular design of \disco offers choices between federated and decentralized paradigms, various levels of privacy guarantees and several approaches to weight aggregation strategies that allow for model personalization and bias resilience in the collaborative training. Code repository is available at https://github.com/epfml/disco and a showcase web interface at https://discolab.ai

Updated: 2025-11-24 22:16:07

标题: DISCO：基于浏览器的分布式协作学习隐私保护框架

摘要: 数据通常因为一系列充分考虑到的原因而难以共享，比如对隐私、知识产权和法律约束的担忧。这不仅会破坏预测模型的统计能力，还会产生可访问性偏见，准确性不公平地分配给那些有资源克服这些担忧的人。我们提出了DISCO：一个开源的分布式协作学习平台，非技术用户可访问，提供一种协作构建机器学习模型的方式，无需共享任何原始数据或需要任何编程知识。DISCO的Web应用程序直接在浏览器中本地训练模型，使我们的工具跨平台开箱即用，包括智能手机。DISCO的模块化设计提供了联邦和分散范式之间的选择，各种隐私保证水平以及几种权重聚合策略的方法，允许在协作训练中进行模型个性化和偏见韧性。代码存储库可在https://github.com/epfml/disco上找到，展示网页界面可在https://discolab.ai上找到。

更新时间: 2025-11-24 22:16:07

领域: cs.LG

下载: http://arxiv.org/abs/2511.19750v1

Scaling Item-to-Standard Alignment with Large Language Models: Accuracy, Limits, and Solutions

As educational systems evolve, ensuring that assessment items remain aligned with content standards is essential for maintaining fairness and instructional relevance. Traditional human alignment reviews are accurate but slow and labor-intensive, especially across large item banks. This study examines whether Large Language Models (LLMs) can accelerate this process without sacrificing accuracy. Using over 12,000 item-skill pairs in grades K-5, we tested three LLMs (GPT-3.5 Turbo, GPT-4o-mini, and GPT-4o) across three tasks that mirror real-world challenges: identifying misaligned items, selecting the correct skill from the full set of standards, and narrowing candidate lists prior to classification. In Study 1, GPT-4o-mini correctly identified alignment status in approximately 83-94% of cases, including subtle misalignments. In Study 2, performance remained strong in mathematics but was lower for reading, where standards are more semantically overlapping. Study 3 demonstrated that pre-filtering candidate skills substantially improved results, with the correct skill appearing among the top five suggestions more than 95% of the time. These findings suggest that LLMs, particularly when paired with candidate filtering strategies, can significantly reduce the manual burden of item review while preserving alignment accuracy. We recommend the development of hybrid pipelines that combine LLM-based screening with human review in ambiguous cases, offering a scalable solution for ongoing item validation and instructional alignment.

Updated: 2025-11-24 22:12:23

标题: 使用大型语言模型进行项目与标准对齐的扩展：准确性、限制和解决方案

摘要: 随着教育系统的发展，确保评估项目与内容标准保持一致对于维持公平性和教学相关性至关重要。传统的人工对齐审查准确但速度慢、劳动密集，尤其是在大型项目库中。本研究考察了大型语言模型（LLMs）是否可以加速这一过程，而不牺牲准确性。使用了K-5年级的超过12,000个项目-技能对，我们测试了三种LLMs（GPT-3.5 Turbo、GPT-4o-mini和GPT-4o）在三个反映现实挑战的任务中：识别不对齐的项目、从完整的标准集中选择正确的技能，以及在分类之前缩小候选列表。在研究1中，GPT-4o-mini在大约83-94％的情况下正确识别了对齐状态，包括细微的不对齐。在研究2中，数学表现仍然强劲，但在阅读方面较低，因为标准更具语义重叠性。研究3表明，预过滤候选技能显著改善了结果，正确的技能在前五个建议中出现的概率超过95％。这些发现表明，LLMs，特别是与候选筛选策略相结合时，可以显著减少项目审查的手动负担，同时保持对齐准确性。我们建议开发混合管道，将基于LLM的筛选与人工审查结合在模糊情况下，为持续的项目验证和教学对齐提供可扩展的解决方案。

更新时间: 2025-11-24 22:12:23

领域: cs.AI

下载: http://arxiv.org/abs/2511.19749v1

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.

Updated: 2025-11-24 22:11:07

标题: CoT被抓现行：压力测试思维链监控

摘要: 随着人工智能模型越来越自主地部署，确保它们不会在不被注意的情况下采取有害行动变得至关重要。作为潜在的缓解措施，我们调查了“思维链”（CoT）监控，其中一个较弱的可信监视模型持续监督一个更强大但不受信任的模型的中间推理步骤。我们将CoT监控与仅行动监控进行比较，在一个红队设置中，不受信任的模型被要求在完成编码问题的同时追求有害的附加任务。我们发现，虽然CoT监控在行动监控无法可靠识别破坏行为的情况下比仅监控模型输出更有效，但推理追踪可能包含误导性的合理化，欺骗CoT监视器，在明显的破坏行为案例中降低性能。为了解决这个问题，我们引入了一个独立评分模型推理和行动，并使用加权平均值将它们结合起来的混合协议。我们的混合监视器在所有测试模型和任务上始终优于CoT和仅行动监视器，对于微妙欺骗情景的检测率比仅行动监控高出两倍。

更新时间: 2025-11-24 22:11:07

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.23575v3

CAMformer: Associative Memory is All You Need

Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelerator that reinterprets attention as an associative memory operation and computes attention scores using a voltage-domain Binary Attention Content Addressable Memory (BA-CAM). This enables constant-time similarity search through analog charge sharing, replacing digital arithmetic with physical similarity sensing. CAMformer integrates hierarchical two-stage top-k filtering, pipelined execution, and high-precision contextualization to achieve both algorithmic accuracy and architectural efficiency. Evaluated on BERT and Vision Transformer workloads, CAMformer achieves over 10x energy efficiency, up to 4x higher throughput, and 6-8x lower area compared to state-of-the-art accelerators--while maintaining near-lossless accuracy.

Updated: 2025-11-24 21:57:11

标题: CAMformer：关联记忆就是你所需要的

摘要: Transformers面临可扩展性挑战，因为注意力机制的二次成本涉及查询和键之间的密集相似性计算。我们提出了CAMformer，一种新型加速器，它将注意力重新解释为关联内存操作，并使用电压域二进制注意力内容寻址存储器(BA-CAM)计算注意力分数。这通过模拟电荷共享实现了恒定时间相似性搜索，用物理相似性感知替代了数字算术。CAMformer集成了分层两阶段的top-k过滤、流水线执行和高精度的上下文化，实现了算法准确性和架构效率的双重目标。在BERT和Vision Transformer工作负载上评估，CAMformer相比最先进的加速器实现了超过10倍的能效、高达4倍的吞吐量，并且面积比降低了6-8倍，同时保持了接近无损失的准确性。

更新时间: 2025-11-24 21:57:11

领域: cs.AR,cs.LG

下载: http://arxiv.org/abs/2511.19740v1

Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation

Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transformer-based embedding models adapted for cardiology through Low-Rank Adaptation (LoRA) fine-tuning on 106,535 cardiology text pairs derived from authoritative medical textbooks. Results demonstrate that encoder-only architectures, particularly BioLinkBERT, achieve superior domain-specific performance (separation score: 0.510) compared to larger decoder-based models, while requiring significantly fewer computational resources. The findings challenge the assumption that larger language models necessarily produce better domain-specific embeddings and provide practical guidance for clinical NLP system development. All models, training code, and evaluation datasets are publicly available to support reproducible research in medical informatics.

Updated: 2025-11-24 21:57:09

标题: 临床心脏学文本表示的LoRA适应嵌入模型的比较分析

摘要: 领域特定的文本嵌入对于临床自然语言处理至关重要，然而对模型架构间的系统比较仍然有限。本研究评估了十种基于转换器的嵌入模型，通过Low-Rank Adaptation (LoRA) 在从权威医学教科书中提取的10万余个心脏病学文本对上进行微调。结果表明，仅编码器架构，特别是BioLinkBERT，相较于更大的解码器架构模型，实现了更优越的领域特定性能（分离分数：0.510），同时需要显著更少的计算资源。这些发现挑战了更大的语言模型必然会产生更好的领域特定嵌入的假设，并为临床自然语言处理系统开发提供了实用指导。所有模型、训练代码和评估数据集都已公开可用，以支持医学信息学中可重复研究。

更新时间: 2025-11-24 21:57:09

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2511.19739v1

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically employ a single window length across all attention heads and input sizes. However, this uniform approach fails to capture the heterogeneous attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose *Mixture of Attention Spans* (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sizes. It profiles the model, evaluates potential configurations, and pinpoints the optimal length configurations for each head. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer inputs, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over the uniform-window baseline across Vicuna-{7B, 13B} and Llama3-{8B, 70B} models. Moreover, MoA narrows the performance gap with full attention, reducing the maximum relative performance drop from 9%-36% to within 5% across three long-context understanding benchmarks. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6-8.2x and 1.7-1.9x over FlashAttention2 and vLLM, with minimal performance impact. Our code is available at: https://github.com/thu-nics/MoA

Updated: 2025-11-24 21:52:43

标题: 关注跨度混合：利用异构滑动窗口长度优化LLM推理效率

摘要: 滑动窗口注意力提供了一种硬件有效的解决方案，以解决大型语言模型（LLMs）在长上下文情景下的内存和吞吐量挑战。现有方法通常在所有注意力头和输入大小上采用单一窗口长度。然而，这种统一方法无法捕捉LLMs内在的异质注意力模式，忽略了它们不同的准确性-延迟权衡。为了解决这一挑战，我们提出了*Mixture of Attention Spans*（MoA），它会自动为不同的头部和层级定制不同的滑动窗口长度配置。MoA构建和导航各种窗口长度及其相对于输入大小的缩放规则的搜索空间。它对模型进行概要分析，评估潜在的配置，并确定每个头部的最佳长度配置。MoA适应不同的输入大小，揭示了一些注意力头部扩展其关注范围以适应更长输入，而其他头部始终集中在固定长度的本地上下文。实验表明，MoA将有效上下文长度提高了3.9倍，同时保持相同的平均滑动窗口长度，从而将检索准确性提高了1.5-7.1倍，超过了Vicuna-{7B，13B}和Llama3-{8B，70B}模型的统一窗口基线。此外，MoA减小了与完全注意力之间的性能差距，将最大相对性能下降从9%-36%缩小到在三个长上下文理解基准测试中不超过5%。MoA实现了1.2-1.4倍的GPU内存减少，将解码吞吐量提高了6.6-8.2倍，比FlashAttention2和vLLM提高了1.7-1.9倍，且性能影响较小。我们的代码可在以下链接找到：https://github.com/thu-nics/MoA

更新时间: 2025-11-24 21:52:43

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2406.14909v3

Integrating RCTs, RWD, AI/ML and Statistics: Next-Generation Evidence Synthesis

Randomized controlled trials (RCTs) have been the cornerstone of clinical evidence; however, their cost, duration, and restrictive eligibility criteria limit power and external validity. Studies using real-world data (RWD), historically considered less reliable for establishing causality, are now recognized to be important for generating real-world evidence (RWE). In parallel, artificial intelligence and machine learning (AI/ML) are being increasingly used throughout the drug development process, providing scalability and flexibility but also presenting challenges in interpretability and rigor that traditional statistics do not face. This Perspective argues that the future of evidence generation will not depend on RCTs versus RWD, or statistics versus AI/ML, but on their principled integration. To this end, a causal roadmap is needed to clarify inferential goals, make assumptions explicit, and ensure transparency about tradeoffs. We highlight key objectives of integrative evidence synthesis, including transporting RCT results to broader populations, embedding AI-assisted analyses within RCTs, designing hybrid controlled trials, and extending short-term RCTs with long-term RWD. We also outline future directions in privacy-preserving analytics, uncertainty quantification, and small-sample methods. By uniting statistical rigor with AI/ML innovation, integrative approaches can produce robust, transparent, and policy-relevant evidence, making them a key component of modern regulatory science.

Updated: 2025-11-24 21:51:52

标题: 整合RCTs、RWD、AI/ML和统计学：下一代证据综合

摘要: 随机对照试验（RCTs）一直是临床证据的基石；然而，它们的成本、持续时间和限制性资格标准限制了其能力和外部有效性。使用真实世界数据（RWD）的研究，历来被认为不太可靠以建立因果关系，现在被认为对生成真实世界证据（RWE）至关重要。与此同时，人工智能和机器学习（AI/ML）在整个药物开发过程中被越来越广泛地使用，提供了可扩展性和灵活性，但也带来了解释性和严谨性方面的挑战，而传统统计学并不面临这些挑战。本文认为，证据生成的未来不会取决于RCT与RWD，或统计学与AI/ML之间的对立，而是取决于它们的原则性整合。为此，需要制定因果推断路线图，明确推断目标，使假设明确，保证透明度以及确定权衡。我们强调了综合证据综合的关键目标，包括将RCT结果传输到更广泛的人群中，将AI辅助分析嵌入到RCT中，设计混合对照试验，并利用长期RWD延长短期RCT。我们还概述了隐私保护分析、不确定性量化和小样本方法等未来方向。通过将统计严谨性与AI/ML创新结合起来，综合方法可以产生健壮、透明和政策相关的证据，使其成为现代监管科学的关键组成部分。

更新时间: 2025-11-24 21:51:52

领域: stat.ME,cs.LG

下载: http://arxiv.org/abs/2511.19735v1

Relative Advantage Debiasing for Watch-Time Prediction in Short-Video Recommendation

Watch time is widely used as a proxy for user satisfaction in video recommendation platforms. However, raw watch times are influenced by confounding factors such as video duration, popularity, and individual user behaviors, potentially distorting preference signals and resulting in biased recommendation models. We propose a novel relative advantage debiasing framework that corrects watch time by comparing it to empirically derived reference distributions conditioned on user and item groups. This approach yields a quantile-based preference signal and introduces a two-stage architecture that explicitly separates distribution estimation from preference learning. Additionally, we present distributional embeddings to efficiently parameterize watch-time quantiles without requiring online sampling or storage of historical data. Both offline and online experiments demonstrate significant improvements in recommendation accuracy and robustness compared to existing baseline methods.

Updated: 2025-11-24 21:46:38

标题: 短视频推荐中观看时长预测的相对优势去偏见

摘要: 观看时长经常被用作视频推荐平台中用户满意度的代理。然而，原始观看时长受到诸如视频时长、流行度和个体用户行为等混杂因素的影响，可能会扭曲偏好信号并导致偏倚的推荐模型。我们提出了一种新颖的相对优势去偏框架，通过将观看时长与基于用户和项目组的经验推导的参考分布进行比较来校正观看时长。这种方法产生了基于分位数的偏好信号，并引入了一个明确将分布估计与偏好学习分开的两阶段架构。此外，我们提出了分布嵌入，以有效地参数化观看时长分位数，而无需在线采样或存储历史数据。离线和在线实验表明，与现有基线方法相比，推荐准确度和稳健性都有显著改善。

更新时间: 2025-11-24 21:46:38

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2508.11086v3

Training-Free Active Learning Framework in Materials Science with Large Language Models

Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitations and domain-specific feature engineering, restricting their generalizability. Large language models (LLMs) offer a new paradigm by leveraging their pretrained knowledge and universal token-based representations to propose experiments directly from text-based descriptions. Here, we introduce an LLM-based active learning framework (LLM-AL) that operates in an iterative few-shot setting and benchmark it against conventional ML models across four diverse materials science datasets. We explored two prompting strategies: one using concise numerical inputs suited for datasets with more compositional and structured features, and another using expanded descriptive text suited for datasets with more experimental and procedural features to provide additional context. Across all datasets, LLM-AL could reduce the number of experiments needed to reach top-performing candidates by over 70% and consistently outperformed traditional ML models. We found that LLM-AL performs broader and more exploratory searches while still reaching the optima with fewer iterations. We further examined the stability boundaries of LLM-AL given the inherent non-determinism of LLMs and found its performance to be broadly consistent across runs, within the variability range typically observed for traditional ML approaches. These results demonstrate that LLM-AL can serve as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection and potential LLM-driven autonomous discovery.

Updated: 2025-11-24 21:46:29

标题: 在材料科学中基于大型语言模型的无需训练的主动学习框架

摘要: 主动学习（AL）通过优先考虑最具信息量的实验加速科学发现，但传统的机器学习（ML）模型在AL中的应用存在着冷启动限制和领域特定特征工程的问题，限制了它们的泛化能力。大型语言模型（LLMs）通过利用它们预训练的知识和通用基于标记的表示来直接从基于文本的描述中提出实验，提供了一种新的范式。在这里，我们引入了一种基于LLM的主动学习框架（LLM-AL），它在迭代式少样本设置中运行，并将其与传统ML模型在四个不同的材料科学数据集上进行了基准测试。我们探索了两种提示策略：一种使用简洁的数值输入，适用于具有更多组合和结构特征的数据集，另一种使用扩展的描述性文本，适用于具有更多实验和程序特征的数据集，以提供额外的背景信息。在所有数据集中，LLM-AL能够将达到表现最佳候选实验所需的实验数量减少70%以上，并始终优于传统的ML模型。我们发现，LLM-AL在进行更广泛和探索性的搜索的同时，仍能在更少的迭代次数内达到最优解。我们进一步研究了LLM-AL的稳定性边界，考虑到LLMs固有的非确定性，并发现其性能在运行中基本保持一致，处于通常观察到的传统ML方法的变化范围内。这些结果表明，LLM-AL可以作为传统AL管道的通用替代方案，用于更高效和可解释的实验选择，并有望实现由LLM驱动的自主发现。

更新时间: 2025-11-24 21:46:29

领域: cs.LG,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2511.19730v1

Prompt Fencing: A Cryptographic Approach to Establishing Security Boundaries in Large Language Model Prompts

Large Language Models (LLMs) remain vulnerable to prompt injection attacks, representing the most significant security threat in production deployments. We present Prompt Fencing, a novel architectural approach that applies cryptographic authentication and data architecture principles to establish explicit security boundaries within LLM prompts. Our approach decorates prompt segments with cryptographically signed metadata including trust ratings and content types, enabling LLMs to distinguish between trusted instructions and untrusted content. While current LLMs lack native fence awareness, we demonstrate that simulated awareness through prompt instructions achieved complete prevention of injection attacks in our experiments, reducing success rates from 86.7% (260/300 successful attacks) to 0% (0/300 successful attacks) across 300 test cases with two leading LLM providers. We implement a proof-of-concept fence generation and verification pipeline with a total overhead of 0.224 seconds (0.130s for fence generation, 0.094s for validation) across 100 samples. Our approach is platform-agnostic and can be incrementally deployed as a security layer above existing LLM infrastructure, with the expectation that future models will be trained with native fence awareness for optimal security.

Updated: 2025-11-24 21:44:33

标题: 即时围栏：建立大型语言模型提示中的安全边界的加密方法

摘要: 大型语言模型（LLMs）仍然容易受到提示注入攻击的威胁，这是生产部署中最重要的安全威胁。我们提出了Prompt Fencing，这是一种新颖的架构方法，应用了密码认证和数据架构原则，以在LLM提示中建立明确的安全边界。我们的方法使用带有信任评级和内容类型的加密签名元数据来装饰提示段，使LLMs能够区分受信任的指令和不受信任的内容。虽然当前的LLMs缺乏本地围栏意识，但我们证明了通过提示指令模拟意识在我们的实验中完全阻止了注入攻击，将成功率从86.7%（300次攻击中260次成功）降低到0%（300次攻击中0次成功）在两个主要LLM提供商的300个测试用例中。我们实现了一个概念验证围栏生成和验证流程，总超额时间为0.224秒（围栏生成为0.130秒，验证为0.094秒），跨100个样本。我们的方法与平台无关，可以作为一个安全层逐步部署在现有的LLM基础设施之上，期望未来的模型将具有本地围栏意识以实现最佳安全性。

更新时间: 2025-11-24 21:44:33

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.19727v1

An Adaptive, Data-Integrated Agent-Based Modeling Framework for Explainable and Contestable Policy Design

Multi-agent systems often operate under feedback, adaptation, and non-stationarity, yet many simulation studies retain static decision rules and fixed control parameters. This paper introduces a general adaptive multi-agent learning framework that integrates: (i) four dynamic regimes distinguishing static versus adaptive agents and fixed versus adaptive system parameters; (ii) information-theoretic diagnostics (entropy rate, statistical complexity, and predictive information) to assess predictability and structure; (iii) structural causal models for explicit intervention semantics; (iv) procedures for generating agent-level priors from aggregate or sample data; and (v) unsupervised methods for identifying emergent behavioral regimes. The framework offers a domain-neutral architecture for analyzing how learning agents and adaptive controls jointly shape system trajectories, enabling systematic comparison of stability, performance, and interpretability across non-equilibrium, oscillatory, or drifting dynamics. Mathematical definitions, computational operators, and an experimental design template are provided, yielding a structured methodology for developing explainable and contestable multi-agent decision processes.

Updated: 2025-11-24 21:41:45

标题: 一个适应性的、数据集成的基于代理的建模框架，用于可解释和可争议的政策设计

摘要: 多智能体系统通常在反馈、适应性和非稳态条件下运行，然而许多模拟研究仍保留静态决策规则和固定控制参数。本文介绍了一个通用的自适应多智能体学习框架，该框架整合了：(i)四个动态模式，区分静态与自适应智能体和固定与自适应系统参数；(ii)信息论诊断（熵率、统计复杂度和预测信息）用于评估可预测性和结构；(iii)结构因果模型用于明确干预语义；(iv)从聚合或样本数据生成智能体级先验的程序；和(v)无监督方法用于识别新兴的行为模式。该框架提供了一个领域中立的体系结构，用于分析学习智能体和自适应控制如何共同塑造系统轨迹，实现了对非平衡、振荡或漂移动力学的稳定性、性能和可解释性的系统比较。提供了数学定义、计算运算符和实验设计模板，从而提供了一个结构化方法论，用于开发可解释和可争议的多智能体决策过程。

更新时间: 2025-11-24 21:41:45

领域: cs.MA,cs.AI,cs.LG,eess.SY

下载: http://arxiv.org/abs/2511.19726v1

Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries

In the era of foundation models and Large Language Models (LLMs), Euclidean space has been the de facto geometric setting for machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. At a large scale, real-world data often exhibits inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling, in a variety of domains, such as languages, vision, and the natural sciences. It is challenging to effectively capture these structures within the constraints of Euclidean spaces. This position paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models. By adopting these geometries, foundation models could more efficiently leverage the aforementioned structures. Task-aware adaptability that dynamically reconfigures embeddings to match the geometry of downstream applications could further enhance efficiency and expressivity. Our position is supported by a series of theoretical and empirical investigations of prevalent foundation models. Finally, we outline a roadmap for integrating non-Euclidean geometries into foundation models, including strategies for building geometric foundation models via fine-tuning, training from scratch, and hybrid approaches.

Updated: 2025-11-24 21:40:39

标题: 立场：超越欧几里德 -- 基础模型应该包括非欧几里德几何形式

摘要: 在基础模型和大型语言模型（LLMs）时代，欧几里德空间一直是机器学习架构的几何设置。然而，最近的文献表明，这种选择具有根本性的局限性。在大规模情况下，现实世界数据往往呈现出固有的非欧几里德结构，如多方面关系、层次结构、对称性和非各向同性缩放，在各种领域，如语言、视觉和自然科学。在欧几里德空间的限制内有效地捕捉这些结构是具有挑战性的。这篇立场论文认为，超越欧几里德几何不仅仅是一种可选增强，而且是为了保持下一代基础模型的扩展定律而必要的。通过采用这些几何结构，基础模型可以更有效地利用上述结构。任务感知适应性，动态重新配置嵌入以匹配下游应用程序的几何结构，可以进一步增强效率和表达能力。我们的立场得到了一系列流行基础模型的理论和实证研究的支持。最后，我们概述了将非欧几里德几何结构整合到基础模型中的路线图，包括通过微调、从头开始训练和混合方法构建几何基础模型的策略。

更新时间: 2025-11-24 21:40:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.08896v2

RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

Updated: 2025-11-24 21:39:54

标题: RoPECraft：使用轨迹引导RoPE优化在扩散变压器上进行无需训练的运动转移

摘要: 我们提出RoPECraft，这是一种针对扩散变压器的无需训练的视频动作转移方法，仅通过修改它们的旋转位置嵌入（RoPE）来运行。我们首先从参考视频中提取密集的光流，并利用所得到的运动偏移来扭曲RoPE的复指数张量，有效地将运动编码到生成过程中。然后，在去噪时间步骤中通过使用流匹配目标优化这些嵌入，通过预测和目标速度之间的轨迹对齐。为了保持输出与文本提示相符并防止重复生成，我们还结合了基于参考视频傅里叶变换的相位分量的正则化项，将相位角投影到平滑流形上以抑制高频率伪影。在基准测试中的实验表明，RoPECraft在质量和数量上均优于所有最近发布的方法。

更新时间: 2025-11-24 21:39:54

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.13344v2

Individual and group fairness in geographical partitioning

Socioeconomic segregation often arises in school districting and other contexts, causing some groups to be over- or under-represented within a particular district. This phenomenon is closely linked with disparities in opportunities and outcomes. We formulate a new class of geographical partitioning problems in which the population is heterogeneous, and it is necessary to ensure fair representation for each group at each facility. We prove that the optimal solution is a novel generalization of the additively weighted Voronoi diagram, and we propose a simple and efficient algorithm to compute it, thus resolving an open question dating back to Dvoretzky et al. (1951). The efficacy and potential for practical insight of the approach are demonstrated in a realistic case study involving seven demographic groups and $78$ district offices.

Updated: 2025-11-24 21:34:51

标题: 个人和团体公平在地理划分中的应用

摘要: 社会经济隔离经常在学区划分和其他情境中出现，导致一些群体在特定区域内过度或不足地代表。这种现象与机会和结果的不平等密切相关。我们提出了一种新的地理划分问题类别，其中人口是异质的，需要确保每个群体在每个设施中得到公平代表。我们证明最佳解决方案是加权Voronoi图的一种新的概括，并提出了一种简单高效的算法来计算它，从而解决了回溯到Dvoretzky等人（1951年）的一个未解问题。该方法的实效性和实际洞察力在一个涉及七个人口群体和78个区办公室的现实案例研究中得到了证明。

更新时间: 2025-11-24 21:34:51

领域: econ.EM,cs.LG

下载: http://arxiv.org/abs/2511.19722v1

Realistic CDSS Drug Dosing with End-to-end Recurrent Q-learning for Dual Vasopressor Control

Reinforcement learning (RL) applications in Clinical Decision Support Systems (CDSS) frequently encounter skepticism because models may recommend inoperable dosing decisions. We propose an end-to-end offline RL framework for dual vasopressor administration in Intensive Care Units (ICUs) that directly addresses this challenge through principled action space design. Our method integrates discrete, continuous, and directional dosing strategies with conservative Q-learning and incorporates a novel recurrent modeling using a replay buffer to capture temporal dependencies in ICU time-series data. Our comparative analysis of norepinephrine dosing strategies across different action space formulations reveals that the designed action spaces improve interpretability and facilitate clinical adoption while preserving efficacy. Empirical results on eICU and MIMIC demonstrate that action space design profoundly influences learned behavioral policies. Compared with baselines, the proposed methods achieve more than 3x expected reward improvements, while aligning with established clinical protocols.

Updated: 2025-11-24 21:25:45

标题: 使用端到端循环Q学习实现双重血管收缩剂控制的实际CDSS药物剂量调整

摘要: 强化学习在临床决策支持系统（CDSS）中的应用经常会受到怀疑，因为模型可能会推荐不可操作的剂量决策。我们提出了一种针对重症监护室（ICUs）中双重血管加压药物管理的端到端离线强化学习框架，通过合理的行动空间设计直接解决了这一挑战。我们的方法将离散、连续和方向性给药策略与保守的Q学习相结合，并利用回放缓冲区结合一种新颖的循环建模来捕捉ICU时间序列数据中的时间依赖性。我们对不同行动空间制定的去甲肾上腺素给药策略进行了比较分析，结果表明设计的行动空间提高了可解释性并促进了临床采用，同时保持了有效性。在eICU和MIMIC上的实证结果表明，行动空间设计深刻影响所学习的行为策略。与基线相比，所提出的方法实现了超过3倍的预期奖励改进，同时与已建立的临床方案保持一致。

更新时间: 2025-11-24 21:25:45

领域: cs.LG

下载: http://arxiv.org/abs/2510.01508v2

CrypTorch: PyTorch-based Auto-tuning Compiler for Machine Learning with Multi-party Computation

Machine learning (ML) involves private data and proprietary model parameters. MPC-based ML allows multiple parties to collaboratively run an ML workload without sharing their private data or model parameters using multi-party computing (MPC). Because MPC cannot natively run ML operations such as Softmax or GELU, existing frameworks use different approximations. Our study shows that, on a well-optimized framework, these approximations often become the dominating bottleneck. Popular approximations are often insufficiently accurate or unnecessarily slow, and these issues are hard to identify and fix in existing frameworks. To tackle this issue, we propose a compiler for MPC-based ML, CrypTorch. CrypTorch disentangles these approximations with the rest of the MPC runtime, allows easily adding new approximations through its programming interface, and automatically selects approximations to maximize both performance and accuracy. Built as an extension to PyTorch 2's compiler, we show that CrypTorch's auto-tuning alone provides 1.20--1.7$\times$ immediate speedup without sacrificing accuracy, and 1.31--1.8$\times$ speedup when some accuracy degradation is allowed, compared to our well-optimized baseline. Combined with better engineering and adoption of state-of-the-art practices, the entire framework brings 3.22--8.6$\times$ end-to-end speedup compared to the popular framework, CrypTen.

Updated: 2025-11-24 21:21:55

标题: CrypTorch：基于PyTorch的用于多方计算机器学习的自动调优编译器

摘要: 机器学习（ML）涉及私人数据和专有模型参数。基于MPC的ML允许多个参与方共同运行ML工作负载，而无需共享他们的私人数据或模型参数，使用多方计算（MPC）。因为MPC不能本地运行诸如Softmax或GELU之类的ML操作，现有框架使用不同的近似值。我们的研究表明，在一个经过良好优化的框架上，这些近似值通常成为主要瓶颈。流行的近似值通常不够准确或不必要地缓慢，这些问题在现有框架中很难识别和解决。为了解决这个问题，我们提出了一种基于MPC的ML编译器，CrypTorch。 CrypTorch将这些近似值与其他MPC运行时分开，通过其编程接口轻松添加新的近似值，并自动选择近似值以最大化性能和准确性。作为PyTorch 2编译器的扩展构建，我们展示了CrypTorch的自动调整单独提供了1.20-1.7倍的立即加速，而不损失准确性，并且在允许一些准确性降级时，与我们经过良好优化的基线相比，提供了1.31-1.8倍的加速。结合更好的工程和采用最新技术实践，与流行框架CrypTen相比，整个框架带来了3.22-8.6倍的端到端加速。

更新时间: 2025-11-24 21:21:55

领域: cs.CR,cs.AI,cs.PL

下载: http://arxiv.org/abs/2511.19711v1

The Alexander-Hirschowitz theorem for neurovarieties

We study neurovarieties for polynomial neural networks and fully characterize when they attain the expected dimension in the single-output case. As consequences, we establish non-defectiveness and global identifiability for multi-output architectures.

Updated: 2025-11-24 21:09:42

标题: 《关于神经变种的亚历山大-赫舍维茨定理》

摘要: 我们研究多项式神经网络的神经变种，并完全刻画了它们在单输出情况下达到期望维度的条件。作为结果，我们建立了多输出架构的非缺陷性和全局可辨识性。

更新时间: 2025-11-24 21:09:42

领域: math.AG,cs.AI,cs.LG,math.AC

下载: http://arxiv.org/abs/2511.19703v1

A Layered Protocol Architecture for the Internet of Agents

Large Language Models (LLMs) have demonstrated remarkable performance improvements and the ability to learn domain-specific languages (DSLs), including APIs and tool interfaces. This capability has enabled the creation of AI agents that can perform preliminary computations and act through tool calling, now being standardized via protocols like MCP. However, LLMs face fundamental limitations: their context windows cannot grow indefinitely, constraining their memory and computational capacity. Agent collaboration emerges as essential for solving increasingly complex problems, mirroring how computational systems rely on different types of memory to scale. The "Internet of Agents" (IoA) represents the communication stack that enables agents to scale by distributing computation across collaborating entities. Current network architectural stacks (OSI and TCP/IP) were designed for data delivery between hosts and processes, not for agent collaboration with semantic understanding. To address this gap, we propose two new layers: an \textbf{Agent Communication Layer (L8)} and an \textbf{Agent Semantic Negotiation Layer (L9)}. L8 formalizes the \textit{structure} of communication, standardizing message envelopes, speech-act performatives (e.g., REQUEST, INFORM), and interaction patterns (e.g., request-reply, publish-subscribe), building on protocols like MCP. L9, which does not exist today, formalizes the \textit{meaning} of communication, enabling agents to discover, negotiate, and lock a "Shared Context" -- a formal schema defining the concepts, tasks, and parameters relevant to their interaction. Together, these layers provide the foundation for scalable, distributed agent collaboration, enabling the next generation of multi-agentic systems.

Updated: 2025-11-24 21:06:14

标题: 一种面向智能体互联网的分层协议架构

摘要: 大型语言模型（LLMs）已经展示出了显著的性能提升和学习特定领域语言（DSLs）的能力，包括API和工具界面。这种能力使得可以创建能够进行初步计算并通过工具调用的AI代理，现在通过诸如MCP的协议被标准化。然而，LLMs面临着根本性的限制：它们的上下文窗口无法无限增长，限制了它们的内存和计算能力。代理协作变得至关重要，以解决日益复杂的问题，类似于计算系统如何依赖不同类型的内存来扩展。"代理互联网"（IoA）代表了能够通过在协作实体之间分布计算来扩展代理的通信堆栈。当前的网络架构堆栈（OSI和TCP/IP）是为主机和进程之间的数据传递而设计的，而不是为具有语义理解的代理协作而设计的。为了弥补这一差距，我们提出了两个新层：一个代理通信层（L8）和一个代理语义协商层（L9）。L8规范了通信的\textit{结构}，标准化了消息信封、言语行为表现（例如，请求，通知）和交互模式（例如，请求-回复，发布-订阅），构建在类似MCP的协议之上。L9，目前尚不存在，规范了通信的\textit{含义}，使代理能够发现、协商和锁定一个"共享上下文" -- 一个定义与它们的交互相关的概念、任务和参数的形式化模式。这些层共同为可扩展的、分布式的代理协作奠定了基础，实现了下一代多代理系统。

更新时间: 2025-11-24 21:06:14

领域: cs.NI,cs.AI,cs.MA

下载: http://arxiv.org/abs/2511.19699v1

TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification

The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the high cost of labeled data. Foundation models capable of in-context learning (ICL) offer a powerful solution, adapting to new tasks with minimal examples and reducing the need for extensive retraining. However, prior work on large-scale time series models has predominantly focused on forecasting, leaving a critical gap for versatile, fine-tuning-free classification. To address this, we introduce TiCT (Time-series in-Context Transformer), a transformer-based model pre-trained exclusively on synthetic data to perform in-context classification. We make two primary technical contributions: 1) a novel architecture featuring a scalable bit-based label encoding and a special output attention mechanism to handle an arbitrary number of classes; and 2) a synthetic pre-training framework that combines a Mixup-inspired process with data augmentation to foster generalization and noise invariance. Extensive evaluations on the UCR Archive show that TiCT achieves competitive performance against state-of-the-art supervised methods. Crucially, this is accomplished using only in-context examples at inference time, without updating a single model weight.

Updated: 2025-11-24 20:57:55

标题: TiCT：一种用于时间序列分类的合成预训练基础模型

摘要: 时间序列数据的普遍存在为通用基础模型的需求创造了强大的需求，然而将它们发展用于分类仍然是一个重大挑战，主要是由于标记数据的高成本。具有上下文学习能力（ICL）的基础模型提供了一个强大的解决方案，能够通过最少的示例适应新任务，并减少对广泛重新训练的需求。然而，之前关于大规模时间序列模型的工作主要集中在预测上，留下了一个关键的空白，即灵活、无需微调的分类。为了解决这个问题，我们介绍了TiCT（时间序列上下文变换器），这是一个基于变压器的模型，仅在合成数据上进行预训练，用于执行上下文分类。我们做出了两个主要的技术贡献：1)一种新颖的架构，采用可扩展的基于位的标签编码和特殊的输出注意机制，用于处理任意数量的类别；2)一种合成预训练框架，结合了Mixup启发的过程和数据增强，以促进泛化和噪声不变性。对UCR存档的广泛评估表明，TiCT在性能上与最先进的监督方法竞争力强。关键是，这是在推理时仅使用上下文示例实现的，并且没有更新任何模型权重。

更新时间: 2025-11-24 20:57:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19694v1

TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding

Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people's lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.

Updated: 2025-11-24 20:57:31

标题: 宝藏：一种基于Transformer的高交易量理解基础模型

摘要: 支付网络构成现代商业的基础，产生大量的交易记录，记录了日常活动。正确地对这些数据进行建模可以实现诸如异常行为检测和消费者级别洞察等应用，最终改善人们的生活。在本文中，我们介绍了TREASURE，即TRansformer Engine作为可扩展通用交易表示编码器，这是一个专门为交易数据设计的多功能基础模型。该模型同时捕获消费者行为和支付网络信号（如响应代码和系统标志），为准确的推荐系统和异常行为检测等应用提供了必要的综合信息。经过行业级数据集的验证，TREASURE具有三个关键能力：1）具有专用子模块的输入模块，用于静态和动态属性，从而实现更高效的训练和推断；2）用于预测高基数分类属性的高效有效的训练范式；3）作为一个独立模型，将异常行为检测性能提高了111%，并作为嵌入提供程序，增强了推荐模型的104%。我们通过广泛的消融研究、与生产模型的基准比较和案例研究，提供了开发TREASURE所获得的宝贵知识的关键见解。

更新时间: 2025-11-24 20:57:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19693v1

IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

Updated: 2025-11-24 20:45:17

标题: IndEgo：工业场景和协作工作的自我中心助手数据集

摘要: 我们介绍了IndEgo，这是一个多模式的自我中心和自我中心数据集，涉及常见的工业任务，包括装配/拆卸、物流与组织、检查与维修、木工等。该数据集包含3,460个自我中心录音（约197小时），以及1,092个自我中心录音（约97小时）。数据集的一个重点是协作工作，两名工人共同完成认知和体力消耗的任务。自我中心录音包括丰富的多模态数据，并通过眼神、叙述、声音、动作等添加了上下文。我们提供了详细的注释（动作、摘要、错误注释、叙述）、元数据、处理后的输出（眼神、手势、半密集点云）以及关于程序和非程序任务理解、错误检测和基于推理的问题回答的基准。错误检测、问题回答和协作任务理解的基线评估表明，该数据集为最先进的多模态模型提出了挑战。我们的数据集可在以下链接获取：https://huggingface.co/datasets/FraunhoferIPK/IndEgo

更新时间: 2025-11-24 20:45:17

领域: cs.CV,cs.AI,cs.HC,cs.RO

下载: http://arxiv.org/abs/2511.19684v1

Solving Diffusion Inverse Problems with Restart Posterior Sampling

Inverse problems are fundamental to science and engineering, where the goal is to infer an underlying signal or state from incomplete or noisy measurements. Recent approaches employ diffusion models as powerful implicit priors for such problems, owing to their ability to capture complex data distributions. However, existing diffusion-based methods for inverse problems often rely on strong approximations of the posterior distribution, require computationally expensive gradient backpropagation through the score network, or are restricted to linear measurement models. In this work, we propose Restart for Posterior Sampling (RePS), a general and efficient framework for solving both linear and non-linear inverse problems using pre-trained diffusion models. RePS builds on the idea of restart-based sampling, previously shown to improve sample quality in unconditional diffusion, and extends it to posterior inference. Our method employs a conditioned ODE applicable to any differentiable measurement model and introduces a simplified restart strategy that contracts accumulated approximation errors during sampling. Unlike some of the prior approaches, RePS avoids backpropagation through the score network, substantially reducing computational cost. We demonstrate that RePS achieves faster convergence and superior reconstruction quality compared to existing diffusion-based baselines across a range of inverse problems, including both linear and non-linear settings.

Updated: 2025-11-24 20:42:33

标题: 使用重新启动后验抽样解决扩散反问题

摘要: 反问题是科学和工程中的基本问题，其目标是从不完整或嘈杂的测量中推断出潜在信号或状态。最近的方法采用扩散模型作为强大的隐式先验，用于解决这类问题，因为它们能够捕捉复杂的数据分布。然而，现有的基于扩散的逆问题方法通常依赖于后验分布的强近似，需要通过评分网络进行计算昂贵的梯度反向传播，或者仅限于线性测量模型。在这项工作中，我们提出了一种称为后验抽样重启（RePS）的通用而高效的框架，用于使用预先训练的扩散模型解决线性和非线性逆问题。RePS基于之前证明能够改善无条件扩散的样本质量的基于重启的采样的思想，并将其扩展到后验推断。我们的方法采用了适用于任何可微的测量模型的条件ODE，并引入了一种简化的重启策略，用于在采样过程中收缩累积的近似误差。与一些先前的方法不同，RePS避免了通过评分网络进行反向传播，大大降低了计算成本。我们展示了RePS在一系列逆问题中实现了比现有基于扩散的基线更快的收敛速度和更优秀的重建质量，包括线性和非线性设置。

更新时间: 2025-11-24 20:42:33

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2511.20705v1

FISCAL: Financial Synthetic Claim-document Augmented Learning for Efficient Fact-Checking

Financial applications of large language models (LLMs) require factual reliability and computational efficiency, yet current systems often hallucinate details and depend on prohibitively large models. We propose FISCAL (Financial Synthetic Claim-Document Augmented Learning), a modular framework for generating synthetic data tailored to financial fact-checking. Using FISCAL, we generate a dataset called FISCAL-data and use it to train MiniCheck-FISCAL, a lightweight verifier for numerical financial claims. MiniCheck-FISCAL outperforms its baseline, surpasses GPT-3.5 Turbo and other open-source peers of similar size, and approaches the accuracy of much larger systems (20x), such as Mixtral-8x22B and Command R+. On external datasets FinDVer and Fin-Fact, it rivals GPT-4o and Claude-3.5 while outperforming Gemini-1.5 Flash. These results show that domain-specific synthetic data, combined with efficient fine-tuning, enables compact models to achieve state-of-the-art accuracy, robustness, and scalability for practical financial AI. The dataset and scripts are available in the project repository (link provided in the paper).

Updated: 2025-11-24 20:11:44

标题: 财政：用于高效事实核查的金融合成索赔文件增强学习

摘要: 大型语言模型（LLMs）的金融应用需要事实可靠性和计算效率，然而当前系统往往会产生幻觉细节，并依赖于规模过大的模型。我们提出了FISCAL（Financial Synthetic Claim-Document Augmented Learning），一个用于生成适用于金融事实核查的合成数据的模块化框架。使用FISCAL，我们生成了一个名为FISCAL-data的数据集，并用它来训练MiniCheck-FISCAL，一个用于数值金融索赔的轻量级验证器。MiniCheck-FISCAL优于其基线，超过了GPT-3.5 Turbo和其他开源同等大小的同行，并接近于更大系统（20倍）如Mixtral-8x22B和Command R+的准确性。在外部数据集FinDVer和Fin-Fact上，它与GPT-4o和Claude-3.5相媲美，同时优于Gemini-1.5 Flash。这些结果表明，结合领域特定的合成数据和高效的微调，使得紧凑模型能够实现实用金融人工智能的最先进准确性，鲁棒性和可伸缩性。数据集和脚本可以在项目存储库中获得（文中提供链接）。

更新时间: 2025-11-24 20:11:44

领域: cs.AI

下载: http://arxiv.org/abs/2511.19671v1

BASICS: Binary Analysis and Stack Integrity Checker System for Buffer Overflow Mitigation

Cyber-Physical Systems have played an essential role in our daily lives, providing critical services such as power and water, whose operability, availability, and reliability must be ensured. The C programming language, prevalent in CPS development, is crucial for system control where reliability is critical. However, it is also commonly susceptible to vulnerabilities, particularly buffer overflows. Traditional vulnerability discovery techniques often struggle with scalability and precision when applied directly to the binary code of C programs, which can thereby keep programs vulnerable. This work introduces a novel approach designed to overcome these limitations by leveraging model checking and concolic execution techniques to automatically verify security properties of a program's stack memory in binary code, trampoline techniques to perform automated repair of the issues, and crash-inducing inputs to verify if they were successfully removed. The approach constructs a Memory State Space - MemStaCe- from the binary program's control flow graph and simulations, provided by concolic execution, of C function calls and loop constructs. The security properties, defined in LTL, model the correct behaviour of functions associated with vulnerabilities and allow the approach to identify vulnerabilities in MemStaCe by analysing counterexample traces that are generated when a security property is violated. These vulnerabilities are then addressed with a trampoline-based binary patching method, and the effectiveness of the patches is checked with crash-inducing inputs extracted during concolic execution. We implemented the approach in the BASICS tool for BO mitigation and evaluated using the Juliet C/C++ and SARD datasets and real applications, achieving an accuracy and precision above 87%, both in detection and correction. Also, we compared it with CWE Checker, outperforming it.

Updated: 2025-11-24 20:11:41

标题: 基础知识：用于缓冲区溢出缓解的二进制分析和堆栈完整性检查系统

摘要: 网络物理系统在我们的日常生活中发挥着至关重要的作用，提供诸如电力和水等关键服务，其可操作性、可用性和可靠性必须得到保证。在CPS开发中普遍使用的C编程语言对于可靠性至关重要。然而，它通常容易受到漏洞的影响，特别是缓冲区溢出。传统的漏洞发现技术在直接应用于C程序的二进制代码时往往缺乏可扩展性和精确性，从而使程序易受攻击。本文介绍了一种新颖的方法，通过利用模型检查和共模执行技术自动验证程序的堆栈内存在二进制代码中的安全性属性，利用蹦床技术自动修复问题，并利用导致崩溃的输入验证问题是否成功消除。该方法从二进制程序的控制流图和C函数调用和循环结构的模拟中构建了一个内存状态空间MemStaCe，这些模拟是通过共模执行得到的。安全属性以LTL形式定义了与漏洞相关的函数的正确行为，并允许该方法通过分析违反安全属性时生成的反例跟踪来识别MemStaCe中的漏洞。然后，这些漏洞通过基于蹦床的二进制修补方法进行处理，并使用在共模执行过程中提取的导致崩溃的输入来检查修补的有效性。我们在BASICS工具中实现了该方法以进行BO缓解，并使用Juliet C/C++和SARD数据集以及真实应用进行了评估，检测和修正的准确性和精度均超过87%。此外，我们将其与CWE Checker进行了比较，并表现出色。

更新时间: 2025-11-24 20:11:41

领域: cs.CR

下载: http://arxiv.org/abs/2511.19670v1

HeaRT: A Hierarchical Circuit Reasoning Tree-Based Agentic Framework for AMS Design Optimization

Conventional AI-driven AMS design automation algorithms remain constrained by their reliance on high-quality datasets to capture underlying circuit behavior, coupled with poor transferability across architectures, and a lack of adaptive mechanisms. This work proposes HeaRT, a foundational reasoning engine for automation loops and a first step toward intelligent, adaptive, human-style design optimization. HeaRT consistently demonstrates reasoning accuracy >97% and Pass@1 performance >98% across our 40-circuit benchmark repository, even as circuit complexity increases, while operating at <0.5x real-time token budget of SOTA baselines. Our experiments show that HeaRT yields >3x faster convergence in both sizing and topology design adaptation tasks across diverse optimization approaches, while preserving prior design intent.

Updated: 2025-11-24 20:11:06

标题: HeaRT: 一种基于层次电路推理树的代理框架，用于AMS设计优化.

摘要: 传统的人工智能驱动的AMS设计自动化算法受限于依赖高质量数据集来捕捉底层电路行为，以及在不同架构之间的传递性差和缺乏自适应机制。本文提出了HeaRT，这是一个用于自动化循环的基础推理引擎，也是迈向智能、自适应、类人设计优化的第一步。HeaRT在我们的40个电路基准库中始终展示出超过97%的推理准确性和超过98%的Pass@1性能，即使电路复杂性增加，同时运行在SOTA基线的实时令牌预算的<0.5倍。我们的实验表明，HeaRT在不同的优化方法中，在尺寸和拓扑设计适应任务中实现了>3倍更快的收敛速度，同时保留了先前的设计意图。

更新时间: 2025-11-24 20:11:06

领域: cs.AI

下载: http://arxiv.org/abs/2511.19669v1

Ontology-Aware RAG for Improved Question-Answering in Cybersecurity Education

Integrating AI into education has the potential to transform the teaching of science and technology courses, particularly in the field of cybersecurity. AI-driven question-answering (QA) systems can actively manage uncertainty in cybersecurity problem-solving, offering interactive, inquiry-based learning experiences. Recently, Large language models (LLMs) have gained prominence in AI-driven QA systems, enabling advanced language understanding and user engagement. However, they face challenges like hallucinations and limited domain-specific knowledge, which reduce their reliability in educational settings. To address these challenges, we propose CyberRAG, an ontology-aware retrieval-augmented generation (RAG) approach for developing a reliable and safe QA system in cybersecurity education. CyberRAG employs a two-step approach: first, it augments the domain-specific knowledge by retrieving validated cybersecurity documents from a knowledge base to enhance the relevance and accuracy of the response. Second, it mitigates hallucinations and misuse by integrating a knowledge graph ontology to validate the final answer. Comprehensive experiments on publicly available datasets reveal that CyberRAG delivers accurate, reliable responses aligned with domain knowledge, demonstrating the potential of AI tools to enhance education.

Updated: 2025-11-24 20:05:08

标题: 意识本体感知的RAG模型在网络安全教育中的提升问答效果

摘要: 将AI整合到教育中有潜力改变科学和技术课程的教学，尤其是在网络安全领域。AI驱动的问答（QA）系统可以积极管理网络安全问题解决中的不确定性，提供互动、探究式学习体验。最近，大型语言模型（LLMs）在AI驱动的QA系统中备受关注，实现了先进的语言理解和用户参与。然而，它们面临幻觉和有限领域特定知识等挑战，这降低了它们在教育环境中的可靠性。为了解决这些挑战，我们提出了CyberRAG，一个基于本体意识的检索增强生成（RAG）方法，用于开发一个可靠且安全的网络安全教育QA系统。CyberRAG采用两步方法：首先，通过从知识库中检索经验证的网络安全文件来增强领域特定知识，以提高响应的相关性和准确性。其次，通过整合知识图本体来验证最终答案，从而减少幻觉和滥用。对公开可用数据集进行的全面实验表明，CyberRAG提供与领域知识一致的准确可靠的响应，展示了AI工具提升教育潜力的可能性。

更新时间: 2025-11-24 20:05:08

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2412.14191v2

Large language models replicate and predict human cooperation across experiments in game theory

Large language models (LLMs) are increasingly used both to make decisions in domains such as health, education and law, and to simulate human behavior. Yet how closely LLMs mirror actual human decision-making remains poorly understood. This gap is critical: misalignment could produce harmful outcomes in practical applications, while failure to replicate human behavior renders LLMs ineffective for social simulations. Here, we address this gap by developing a digital twin of game-theoretic experiments and introducing a systematic prompting and probing framework for machine-behavioral evaluation. Testing three open-source models (Llama, Mistral and Qwen), we find that Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory, while Qwen aligns closely with Nash equilibrium predictions. Notably, we achieved population-level behavioral replication without persona-based prompting, simplifying the simulation process. Extending beyond the original human-tested games, we generate and preregister testable hypotheses for novel game configurations outside the original parameter grid. Our findings demonstrate that appropriately calibrated LLMs can replicate aggregate human behavioral patterns and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional research in the social and behavioral sciences that generates new empirical predictions about human social decision-making.

Updated: 2025-11-24 20:04:15

标题: 大型语言模型在博弈论实验中复制并预测人类合作

摘要: 大型语言模型（LLMs）越来越被广泛应用于健康、教育和法律等领域的决策制定，以及模拟人类行为。然而，LLMs与实际人类决策制定之间的相似程度仍然不甚了解。这一差距至关重要：不一致可能在实际应用中产生有害结果，而未能复制人类行为则使LLMs对于社会模拟无效。在这里，我们通过开发一个博弈理论实验的数字孪生体和引入一个系统的提示和探究框架，来填补这一差距，用于机器行为评估。通过测试三个开源模型（Llama、Mistral和Qwen），我们发现Llama以高度忠实地重现人类合作模式，捕捉人类偏离理性选择理论的行为，而Qwen与纳什均衡预测密切一致。值得注意的是，我们在不依赖人物角色提示的情况下实现了人口级行为复制，简化了模拟过程。在超出原始人类测试游戏范围的基础上，我们生成并预注册了针对原始参数网格之外的新游戏配置的可测试假设。我们的研究结果表明，适当校准的LLMs可以复制整体人类行为模式，并实现对未探索的实验空间的系统探索，为社会和行为科学领域的传统研究提供了一种补充方法，产生关于人类社会决策的新实证预测。

更新时间: 2025-11-24 20:04:15

领域: cs.AI,cs.CL,cs.GT,cs.MA

下载: http://arxiv.org/abs/2511.04500v2

LLM Collaboration With Multi-Agent Reinforcement Learning

A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges. Our code is available at https://github.com/OpenMLRL/CoMLRL.

Updated: 2025-11-24 20:01:23

标题: LLM 与多智能体强化学习的合作

摘要: 已经在多智能体系统（MAS）领域进行了大量工作，用于对具有多个相互作用智能体的问题进行建模和解决。然而，大多数LLM是独立预训练的，没有针对协调进行特定优化。现有的LLM微调框架依赖于个体奖励，这需要为每个智能体设计复杂的奖励以鼓励合作。为了解决这些挑战，我们将LLM协作建模为合作的多智能体强化学习（MARL）问题。我们开发了一个多智能体、多轮算法，称为Multi-Agent Group Relative Policy Optimization（MAGRPO），来解决这个问题，基于当前的LLM强化学习方法以及MARL技术。我们在LLM写作和编码协作方面的实验表明，通过MAGRPO对MAS进行微调，使智能体能够通过有效的合作高效生成高质量的响应。我们的方法为LLM使用其他MARL方法打开了大门，并突显了相关挑战。我们的代码可在https://github.com/OpenMLRL/CoMLRL上找到。

更新时间: 2025-11-24 20:01:23

领域: cs.AI,cs.SE

下载: http://arxiv.org/abs/2508.04652v5

Fara-7B: An Efficient Agentic Model for Computer Use

Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench -- our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.

Updated: 2025-11-24 19:56:28

标题: Fara-7B：一种用于计算机使用的高效代理模型

摘要: 计算机使用代理（CUAs）的进展受到缺乏捕捉人类与计算机交互方式的大型高质量数据集的限制。虽然大型语言模型在丰富的文本数据上取得了成功，但对于CUA轨迹却没有可比的语料库。为了填补这些空白，我们引入了FaraGen，一个新颖的多步网络任务合成数据生成系统。FaraGen可以从经常使用的网站中提出多样化的任务，生成多个解决方案尝试，并使用多个验证器过滤成功的轨迹。它在多步网络任务中实现了高吞吐量、产量和多样性，每个验证的轨迹约为1美元。我们利用这些数据训练了Fara-7B，一个本地CUA模型，只使用屏幕截图来感知计算机，通过预测的坐标执行操作，并且足够小以在设备上运行。我们发现，Fara-7B在WebVoyager、Online-Mind2Web和WebTailBench等基准测试中表现优于其他大小相当的CUA模型，我们的新型基准测试WebTailBench更好地捕捉了现有基准测试中被低估的网络任务。此外，Fara-7B与更大的前沿模型竞争，展示了可扩展数据生成系统在推进小型高效代理模型方面的关键好处。我们将Fara-7B开放权重在Microsoft Foundry和HuggingFace上，并发布WebTailBench。

更新时间: 2025-11-24 19:56:28

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2511.19663v1

What is Implementation Science; and Why It Matters for Bridging the Artificial Intelligence Innovation-to-Application Gap in Medical Imaging

The transformative potential of artificial intelligence (AI) in medical Imaging (MI) is well recognized. Yet despite promising reports in research settings, many AI tools fail to achieve clinical adoption in practice. In fact, more generally, there is a documented 17-year average delay between evidence generation and implementation of a technology. Implementation science (IS) may provide a practical, evidence-based framework to bridge the gap between AI development and real-world clinical imaging use, to shorten this lag through systematic frameworks, strategies, and hybrid research designs. We outline challenges specific to AI adoption in MI workflows, including infrastructural, educational, and cultural barriers. We highlight the complementary roles of effectiveness research and implementation research, emphasizing hybrid study designs and the role of integrated KT (iKT), stakeholder engagement, and equity-focused co-creation in designing sustainable and generalizable solutions. We discuss integration of Human-Computer Interaction (HCI) frameworks in MI towards usable AI. Adopting IS is not only a methodological advancement; it is a strategic imperative for accelerating translation of innovation into improved patient outcomes.

Updated: 2025-11-24 19:53:22

标题: 什么是实施科学；以及为什么对于弥合医学影像人工智能创新到应用之间的差距至关重要

摘要: 人工智能在医学影像领域的转化潜力得到了广泛认可。然而，尽管在研究环境中有很多有希望的报道，许多人工智能工具在实践中却未能实现临床应用。事实上，一般来说，从证据生成到技术实施之间存在着17年的平均延迟。实施科学可能提供一个实用的、基于证据的框架，以弥合人工智能开发和实际临床影像使用之间的差距，通过系统化框架、策略和混合研究设计来缩短这种滞后。我们概述了人工智能在医学影像工作流中采用面临的挑战，包括基础设施、教育和文化障碍。我们强调了效果研究和实施研究的互补作用，强调混合研究设计的重要性，以及整合的知识转化（iKT）、利益相关者参与和以公平为重点的共创在设计可持续和具有普适性的解决方案中的作用。我们讨论了在医学影像中整合人机交互（HCI）框架以实现可用人工智能。采用实施科学不仅是方法论上的进步；它是加速创新转化为改善患者结果的战略必要性。

更新时间: 2025-11-24 19:53:22

领域: physics.med-ph,cs.AI

下载: http://arxiv.org/abs/2510.13006v3

Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning

This study examines whether Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification. Achieving trustworthy malware detection, particularly when LLMs are involved, remains a significant challenge. We developed an evaluation framework using Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Semantic Similarity Metrics to benchmark explanation quality across five LoRA configurations and a fully fine-tuned baseline. Results indicate that full fine-tuning achieves the highest overall scores, with BLEU and ROUGE improvements of up to 10% over LoRA variants. However, mid-range LoRA models deliver competitive performance exceeding full fine-tuning on two metrics while reducing model size by approximately 81% and training time by over 80% on a LoRA model with 15.5% trainable parameters. These findings demonstrate that LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality. By providing feature-driven natural language explanations for malware classifications, this approach enhances transparency, analyst confidence, and operational scalability in malware detection systems.

Updated: 2025-11-24 19:37:13

标题: 基于LLM的恶意软件检测和解释中的准确性和效率权衡：参数调整与完全微调的比较研究

摘要: 这项研究探讨了低秩适应（LoRA）微调大型语言模型（LLMs）是否能够在生成人类可解释的决策和解释方面逼近完全微调模型的性能，用于恶意软件分类。在涉及LLMs时，实现可信赖的恶意软件检测仍然是一个重大挑战。我们开发了一个评估框架，使用双语评估理解（BLEU）、面向叙述评估的召回导向辅助（ROUGE）和语义相似性度量来评估五种LoRA配置和完全微调基准模型之间的解释质量。结果表明，完全微调实现了最高的整体得分，其中BLEU和ROUGE的提升可达到低秩适应变体的10%。然而，中等程度的LoRA模型在两项指标上表现出竞争性能，超过完全微调，同时将模型大小减少约81%，并将训练时间缩短超过80%，在一个具有15.5%可训练参数的LoRA模型上。这些发现表明，LoRA提供了可解释性和资源效率的实际平衡，使其能够在资源受限环境中部署，而不会牺牲解释质量。通过为恶意软件分类提供基于特征的自然语言解释，这种方法增强了恶意软件检测系统的透明度、分析师信心和操作可扩展性。

更新时间: 2025-11-24 19:37:13

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.19654v1

Synthetic Data: AI's New Weapon Against Android Malware

The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.

Updated: 2025-11-24 19:27:58

标题: 合成数据：人工智能对抗安卓恶意软件的新武器

摘要: 随着Android设备数量不断增加和恶意软件加速演化，到2024年恶意软件样本已达到超过3500万个，突显了有效检测方法的重要性。攻击者现在正在利用人工智能来创建复杂的恶意软件变种，可以轻松规避传统的检测技术。尽管机器学习在恶意软件分类方面表现出了潜力，但其成功很大程度上取决于及时更新、高质量的数据集的可用性。获取和标记真实恶意软件样本的稀缺性和高昂成本在开发强大的检测模型方面提出了重大挑战。在本文中，我们提出了一种名为MalSynGen的恶意软件合成数据生成方法，使用条件生成对抗网络（cGAN）生成合成表格数据。这些数据保留了真实数据的统计特性，并提高了Android恶意软件分类器的性能。我们使用各种数据集和评估生成数据的真实性、其在分类中的实用性以及过程的计算效率的度量标准来评估这种方法的有效性。我们的实验表明，MalSynGen可以在不同数据集之间泛化，为解决恶意软件检测中数据过时和质量低的问题提供了可行的解决方案。

更新时间: 2025-11-24 19:27:58

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19649v1

Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search

Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.

Updated: 2025-11-24 19:27:56

标题: 通过LLM规划和嵌入引导搜索在知识图上实现高效的多跳问题回答

摘要: 在知识图谱上进行多跳问题回答仍然具有挑战性，因为可能推理路径的组合爆炸。最近的方法依赖于昂贵的大型语言模型（LLM）推理，用于实体链接和路径排序，从而限制了它们的实际部署。此外，LLM生成的答案通常缺乏结构化知识的可验证基础。我们提出了两种互补的混合算法，既解决了效率问题，又解决了可验证性问题：（1）LLM引导规划，使用单个LLM调用来预测关系序列，通过广度优先搜索执行，实现几乎完美的准确性（micro-F1> 0.90），同时确保所有答案都基于知识图谱，并且（2）嵌入引导神经搜索，通过轻量级的670万参数边缘评分器将文本和图形嵌入融合，从而完全消除了LLM调用，实现了超过100倍的速度提升，并具有竞争性的准确性。通过知识蒸馏，我们将规划能力压缩到一个4B参数模型中，在不增加API成本的情况下实现与大型模型性能相匹配。在MetaQA上的评估表明，有结构的推理始终优于无结构的生成，而结构化规划证明比直接答案生成更具可转移性。我们的结果表明，可验证的多跳推理不需要在推理时使用庞大的模型，而是需要正确的架构归纳偏见，结合符号结构和学习表示。

更新时间: 2025-11-24 19:27:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.19648v1

SafeFix: Targeted Model Repair via Controlled Image Generation

Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images -- an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix

Updated: 2025-11-24 19:26:12

标题: SafeFix：通过受控图像生成实现有针对性的模型修复

摘要: 深度学习模型在视觉识别中经常出现系统性错误，这是由于语义子群体的不足造成的。虽然现有的调试框架可以通过识别关键失败属性来找出这些错误，但有效修复模型仍然很困难。目前的解决方案通常依赖于手动设计提示来生成合成训练图像 -- 这种方法容易出现分布偏移和语义错误。为了克服这些挑战，我们引入了一个建立在可解释的失败归因流程上的模型修复模块。我们的方法使用条件文本到图像模型为失败案例生成语义忠实和有针对性的图像。为了保持生成样本的质量和相关性，我们进一步采用了一个大型视觉语言模型（LVLM）来过滤输出，强调与原始数据分布的一致性，并保持语义一致性。通过使用这种罕见情况增强的合成数据集重新训练视觉模型，我们显著减少了与罕见情况相关的错误。我们的实验表明，这种有针对性的修复策略提高了模型的鲁棒性，而不会引入新的错误。可在https://github.com/oxu2/SafeFix找到代码。

更新时间: 2025-11-24 19:26:12

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.08701v2

Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: https://scanford-robot.github.io

Updated: 2025-11-24 19:23:04

标题: 机器人驱动的数据飞轮：在野外部署机器人进行持续数据收集和基础模型适应

摘要: 基于FM的基础模型已经在视觉和语言领域解锁了强大的零样本能力，但它们对互联网预训练数据的依赖使它们在非结构化的现实世界环境中变得脆弱。在部署过程中遇到的混乱的现实世界数据（例如被遮挡或多语言文本）在现有语料库中仍然被大量低估。作为具有实体的代理，机器人独特地位置可以弥合这一差距：它们可以在物理环境中行动，收集大规模的现实世界数据，丰富FM训练，提供当前模型缺乏的确切示例。我们引入了机器人驱动的数据飞轮，这是一个将机器人从FM消费者转变为数据生成者的框架。通过在野外部署配备FM的机器人，我们实现了一个良性循环：机器人执行有用任务的同时收集实际数据，改善领域特定适应和领域相邻泛化。我们使用Scanford这个移动操纵机器人在东亚图书馆部署了两周。Scanford自主扫描书架，使用视觉语言模型（VLM）识别书籍，并利用图书馆目录对图像进行标记，无需人工注释。这次部署既帮助图书管理员又产生了一个数据集，用于微调基础VLM，提高在野外图书馆环境中的领域特定性能以及领域相邻的多语言OCR基准。通过收集来自2103个书架的数据，Scanford将书籍识别的VLM性能从32.0％提高到71.8％，并将领域相邻的多语言OCR从24.8％提高到46.6％（英文）和30.8％提高到38.0％（中文），同时节省了约18.7小时的人力。这些结果突显了机器人驱动的数据飞轮如何既减少了实际部署中的人力劳动，又解锁了不断适应现实混乱环境的FM的新途径。更多细节请访问：https://scanford-robot.github.io

更新时间: 2025-11-24 19:23:04

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.19647v1

IRSDA: An Agent-Orchestrated Framework for Enterprise Intrusion Response

Modern enterprise systems face escalating cyber threats that are increasingly dynamic, distributed, and multi-stage in nature. Traditional intrusion detection and response systems often rely on static rules and manual workflows, which limit their ability to respond with the speed and precision required in high-stakes environments. To address these challenges, we present the Intrusion Response System Digital Assistant (IRSDA), an agent-based framework designed to deliver autonomous and policy-compliant cyber defense. IRSDA combines Self-Adaptive Autonomic Computing Systems (SA-ACS) with the Knowledge guided Monitor, Analyze, Plan, and Execute (MAPE-K) loop to support real-time, partition-aware decision-making across enterprise infrastructure. IRSDA incorporates a knowledge-driven architecture that integrates contextual information with AI-based reasoning to support system-guided intrusion response. The framework leverages retrieval mechanisms and structured representations to inform decision-making while maintaining alignment with operational policies. We assess the system using a representative real-world microservices application, demonstrating its ability to automate containment, enforce compliance, and provide traceable outputs for security analyst interpretation. This work outlines a modular and agent-driven approach to cyber defense that emphasizes explainability, system-state awareness, and operational control in intrusion response.

Updated: 2025-11-24 19:21:09

标题: IRSDA：企业入侵响应的代理编排框架

摘要: 现代企业系统面临不断升级的网络威胁，这些威胁具有越来越动态、分布式和多阶段的特性。传统的入侵检测和响应系统通常依赖于静态规则和手动工作流程，这限制了它们在高风险环境中以所需的速度和精度作出响应的能力。为了解决这些挑战，我们提出了入侵响应系统数字助理（IRSDA），这是一个基于代理的框架，旨在提供自主和符合政策的网络防御。IRSDA将自适应自治计算系统（SA-ACS）与知识引导的监视、分析、规划和执行（MAPE-K）循环相结合，以支持跨企业基础设施的实时、分区感知的决策制定。 IRSDA融入了一个知识驱动的架构，将上下文信息与基于人工智能的推理相结合，以支持系统引导的入侵响应。该框架利用检索机制和结构化表示来指导决策制定，同时与运营政策保持一致。我们利用一个代表性的真实微服务应用程序对系统进行评估，展示其自动化封锁、执行合规性和为安全分析人员提供可追溯输出的能力。这项工作概述了一种模块化和基于代理的网络防御方法，强调了在入侵响应中的可解释性、系统状态意识和运营控制。

更新时间: 2025-11-24 19:21:09

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.19644v1

On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction

Purpose: To investigate whether a vision-language foundation model can enhance undersampled MRI reconstruction by providing high-level contextual information beyond conventional priors. Methods: We proposed a semantic distribution-guided reconstruction framework that uses a pre-trained vision-language foundation model to encode both the reconstructed image and auxiliary information into high-level semantic features. A contrastive objective aligns the reconstructed representation with the target semantic distribution, ensuring consistency with high-level perceptual cues. The proposed objective works with various deep learning-based reconstruction methods and can flexibly incorporate semantic priors from multimodal sources. To test the effectiveness of these semantic priors, we evaluated reconstruction results guided by priors derived from either image-only or image-language auxiliary information. Results: Experiments on knee and brain datasets demonstrate that semantic priors from images preserve fine anatomical structures and achieve superior perceptual quality, as reflected in lower LPIPS values, higher Tenengrad scores, and improved scores in the reader study, compared with conventional regularization. The image-language information further expands the semantic distribution and enables high-level control over reconstruction attributes. Across all evaluations, the contrastive objective consistently guided the reconstructed features toward the desired semantic distributions while maintaining data fidelity, demonstrating the effectiveness of the proposed optimization framework. Conclusion: The study highlights that vision-language foundation models can improve undersampled MRI reconstruction through semantic-space optimization.

Updated: 2025-11-24 19:15:47

标题: 关于基础模型在快速MRI中的效用：视觉语言引导的图像重建

摘要: 目的：通过提供高水平的语境信息，探究视觉-语言基础模型是否能够增强MRI重建的效果，超越传统的先验知识。方法：我们提出了一个语义分布引导的重建框架，利用预训练的视觉-语言基础模型将重建图像和辅助信息编码为高水平的语义特征。对比目标将重建表示与目标语义分布对齐，确保与高水平感知线索一致。所提出的目标可与各种基于深度学习的重建方法配合使用，并可以灵活地整合来自多模态来源的语义先验。为测试这些语义先验的有效性，我们评估了由仅图像或图像-语言辅助信息导出的先验指导的重建结果。结果：对膝盖和脑部数据集的实验表明，来自图像的语义先验保留了精细的解剖结构，并在较低的LPIPS值、更高的Tenengrad分数以及读者研究中的改进分数方面，与传统正则化相比实现了更优的感知质量。图像-语言信息进一步扩展了语义分布，并使高水平对重建属性的控制成为可能。在所有评估中，对比目标始终将重建特征引导到期望的语义分布，同时保持数据的忠实性，展示了所提出的优化框架的有效性。结论：该研究强调了视觉-语言基础模型通过语义空间优化可以改善MRI重建的效果。

更新时间: 2025-11-24 19:15:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19641v1

Many Ways to be Right: Rashomon Sets for Concept-Based Neural Networks

Modern neural networks rarely have a single way to be right. For many tasks, multiple models can achieve identical performance while relying on different features or reasoning patterns, a property known as the Rashomon Effect. However, uncovering this diversity in deep architectures is challenging as their continuous parameter spaces contain countless near-optimal solutions that are numerically distinct but often behaviorally similar. We introduce Rashomon Concept Bottleneck Models, a framework that learns multiple neural networks which are all accurate yet reason through distinct human-understandable concepts. By combining lightweight adapter modules with a diversity-regularized training objective, our method constructs a diverse set of deep concept-based models efficiently without retraining from scratch. The resulting networks provide fundamentally different reasoning processes for the same predictions, revealing how concept reliance and decision making vary across equally performing solutions. Our framework enables systematic exploration of data-driven reasoning diversity in deep models, offering a new mechanism for auditing, comparison, and alignment across equally accurate solutions.

Updated: 2025-11-24 19:12:26

标题: 有很多种正确的方式：概念为基础的神经网络的拉尚门设置

摘要: 现代神经网络很少有单一正确的方式。对于许多任务，多个模型可以实现相同的性能，同时依赖不同的特征或推理模式，这种属性被称为拉肖蒙效应。然而，发现深度架构中的这种多样性是具有挑战性的，因为它们的连续参数空间包含无数个数值上不同但通常行为相似的近最优解。我们引入了拉肖蒙概念瓶颈模型，这是一个学习多个神经网络的框架，这些神经网络都准确，但是通过不同的人类可理解的概念进行推理。通过将轻量级适配器模块与多样性正则化的训练目标相结合，我们的方法可以高效地构建出一组多样化的基于概念的深度模型，而无需从头开始重新训练。结果网络为相同的预测提供了基本不同的推理过程，揭示了在同样表现的解决方案中概念依赖和决策制定如何变化。我们的框架实现了对深度模型中数据驱动推理多样性的系统性探索，为对等准确解决方案的审计、比较和对齐提供了一个新的机制。

更新时间: 2025-11-24 19:12:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19636v1

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.

Updated: 2025-11-24 18:59:56

标题: VDC-Agent: 当视频详细字幕制作人通过主动自我反思不断进化时

摘要: 我们提出了VDC-Agent，这是一个自我进化的视频详细字幕框架，既不需要人类注释，也不需要更大的教师模型。该代理形成了一个封闭循环，包括字幕生成、基于原则的评分（分数和文本建议）以及提示细化。当字幕质量下降时，自我反思路径利用先前的思维链来修正更新。在未标记的视频上运行此过程会产生（字幕，分数）对的轨迹。我们将这些轨迹转换为偏好元组，并过滤出具有JSON解析错误的样本，结果得到VDC-Agent-19K，其中包含18,886个自动构建的对。然后，我们在这个数据集上使用简单到困难的课程直接偏好优化对基础MLLM进行微调。基于Qwen2.5-VL-7B-Instruct，我们的VDC-Agent-7B在VDC基准测试中表现出最先进的性能，平均准确率为49.08%，得分为2.50，超过了专门的视频字幕生成器，并在类似推理成本的情况下，比基本模型提高了+5.13%的准确率和+0.27的得分。

更新时间: 2025-11-24 18:59:56

领域: cs.CV,cs.AI,cs.LG,cs.MM

下载: http://arxiv.org/abs/2511.19436v1

Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts

Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.

Updated: 2025-11-24 18:59:53

标题: 通过合并预训练专家打破扩散模型中的可能性-质量权衡

摘要: 图像生成的扩散模型通常在感知样本质量和数据似然性之间存在权衡：强调高噪声去噪步骤的训练目标会产生逼真的图像但似然性较差，而以似然性为导向的训练会过度强调低噪声步骤并损害视觉保真度。我们引入了一种简单的即插即用采样方法，通过在去噪轨迹上在两个预训练的扩散专家之间切换来进行组合。具体来说，我们在高噪声水平上应用图像质量专家来塑造全局结构，然后在低噪声水平上切换到似然性专家来完善像素统计信息。这种方法不需要重新训练或微调，只需要选择一个中间切换步骤。在CIFAR-10和ImageNet32上，合并模型始终匹配或优于其基础组件，相对于每个专家单独使用，改善或保持了似然性和样本质量。这些结果表明，在图像扩散模型中跨噪声水平进行专家切换是打破似然性与质量之间权衡的有效方法。

更新时间: 2025-11-24 18:59:53

领域: cs.CV,cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.19434v1

Mixture of Horizons in Action Chunking

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

Updated: 2025-11-24 18:59:51

标题: 《行动分块中的视野混合》

摘要: 视觉-语言-动作（VLA）模型在机器人操作中表现出卓越的能力，但它们的性能对训练过程中使用的$\textbf{动作块长度}$（称为$\textbf{horizon}$）非常敏感。我们的实证研究揭示了一个固有的权衡：更长的horizon提供更强的全局预见，但会降低精细粒度的准确性，而更短的horizon可以提高本地控制的精度，但在长期任务上会遇到困难，这意味着固定选择单个horizon并不是最优的选择。为了缓解这种权衡，我们提出了一种$\textbf{horizon混合（MoH）}$策略。MoH将动作块重新排列成具有不同horizon的几个段，用共享的动作变换器并行处理它们，并通过轻量级线性门融合输出。它具有三个吸引人的优点。1）MoH在单个模型内同时利用长期预见性和短期精度，提高了性能和对复杂任务的泛化能力。2）MoH对于全注意力动作模块是即插即用的，训练或推理开销最小。3）MoH实现了具有自适应horizon的动态推理，通过跨horizon共识选择稳定的动作，在保持卓越性能的同时实现了2.5倍的吞吐量提升。在流基本策略$π_0$、$π_{0.5}$和一步回归策略$π_{\text{reg}}$上进行的大量实验表明，MoH在模拟和现实世界任务中都取得了一致且显著的收益。值得注意的是，在混合任务设置下，带有MoH的$π_{0.5}$在仅经过$30k$次训练迭代后在LIBERO上达到了99%的平均成功率，达到了新的最先进水平。项目页面：https://github.com/Timsty1/MixtureOfHorizons

更新时间: 2025-11-24 18:59:51

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.19433v1

Cost-Aware Contrastive Routing for LLMs

We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.

Updated: 2025-11-24 18:59:36

标题: 成本感知对比路由用于LLMs

摘要: 我们研究了针对大型语言模型在不同和动态模型池中的成本感知路由。现有方法通常忽视特定提示的上下文，依赖昂贵的模型分析，假设一组固定的专家，或使用低效的试错策略。我们引入了成本-谱对比路由（CSCR），这是一个轻量级框架，将提示和模型映射到一个共享的嵌入空间，以实现快速、成本敏感的选择。CSCR使用紧凑、快速计算的开源模型logit足迹和黑盒API的困惑指纹。一个对比编码器被训练为在自适应成本段内优先选择最便宜的准确专家。在推理时，路由减少到通过FAISS索引的单个k-NN查找，当专家池发生变化时无需重新训练，实现微秒级延迟。在多个基准测试中，CSCR始终优于基线，将准确性-成本权衡提高了高达25％，同时对未见的LLMs和超出分布提示进行了稳健泛化。

更新时间: 2025-11-24 18:59:36

领域: cs.LG

下载: http://arxiv.org/abs/2508.12491v3

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.

Updated: 2025-11-24 18:59:30

标题: 推理的认知基础及其在LLMs中的表现

摘要: 大型语言模型（LLMs）解决复杂问题，但在更简单的变体上失败，这表明它们通过与人类推理基本不同的机制获得正确输出。为了理解这种差距，我们将认知科学研究综合成包括推理不变性、元认知控制、组织推理和知识的表示以及转换操作在内的28个认知元素的分类法。我们引入了一个细粒度的评估框架，并对来自文本、视觉和音频的18个模型的192K个轨迹进行了首次大规模实证分析，同时补充了54个人类思考声音轨迹，我们已经公开提供。我们发现模型未充分利用与成功相关的认知元素，而是在结构不完整的问题上缩小到严格的顺序处理，其中多样化的表示和元认知监控至关重要。人类轨迹显示更多的抽象和概念处理，而模型默认为表面层次的枚举。对1.6K篇LLM推理论文的元分析显示，研究界集中于易于量化的元素（顺序组织：55％，分解：60％），但忽视了与成功相关的元认知控制（自我意识：16％）。模型具有与成功相关的行为表现，但未能自发地运用它们。利用这些模式，我们开发了测试时的推理指导，自动搭建成功的结构，将性能在复杂问题上提高了高达66.7％。通过在认知科学和LLM研究之间建立共享词汇，我们的框架使得能够系统地诊断推理失败，并通过坚实的认知机制而不是虚假快捷方式来发展推理模型，同时提供了在规模上测试人类认知理论的工具。

更新时间: 2025-11-24 18:59:30

领域: cs.AI

下载: http://arxiv.org/abs/2511.16660v2

Flow Map Distillation Without Data

State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.

Updated: 2025-11-24 18:58:55

标题: 无数据的流程图精炼

摘要: 最先进的流模型实现了令人瞩目的质量，但需要缓慢的迭代抽样。为加速此过程，可以从预先训练的教师中提取流图，这一过程传统上需要从外部数据集中进行抽样。我们认为，这种数据依赖性引入了一种基本的“教师-数据不匹配”的风险，因为静态数据集可能提供教师完整生成能力的不完整或甚至不对齐的表示。这使我们质疑是否真的有必要依赖数据来成功地进行流图提取。在这项工作中，我们探索了一种无数据的替代方案，只从先验分布中进行抽样，通过此方式可以完全规避不匹配风险。为了展示这种理念的实际可行性，我们引入了一个有原则的框架，学习预测教师的抽样路径，同时积极纠正自身的累积误差，以确保高保真度。我们的方法超越了所有基于数据的对手，并以显著的优势建立了一个新的最先进技术。具体来说，从SiT-XL/2+REPA中提取，我们的方法在ImageNet 256x256上达到了惊人的FID值为1.45，在ImageNet 512x512上为1.49，两者仅需1个抽样步骤。我们希望我们的工作能够建立一个更为强大的范式，用于加速生成模型，并促使更广泛地采用无数据的流图提取方法。

更新时间: 2025-11-24 18:58:55

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2511.19428v1

Prompt Less, Smile More: MTP with Semantic Engineering in Lieu of Prompt Engineering

AI-Integrated programming is emerging as a foundational paradigm for building intelligent systems with large language models (LLMs). Recent approaches such as Meaning Typed Programming (MTP) automate prompt generation by leveraging the semantics already present in code. However, many real-world applications depend on contextual cues, developer intent, and domain-specific reasoning that extend beyond what static code semantics alone can express. To address this limitation, we introduce Semantic Engineering, a lightweight method for enriching program semantics so that LLM-based systems can more accurately reflect developer intent without requiring full manual prompt design. We present Semantic Context Annotations (SemTexts), a language-level mechanism that allows developers to embed natural-language context directly into program constructs. Integrated into the Jac programming language, Semantic Engineering extends MTP to incorporate these enriched semantics during prompt generation. We further introduce a benchmark suite designed to reflect realistic AI-Integrated application scenarios. Our evaluation shows that Semantic Engineering substantially improves prompt fidelity, achieving performance comparable to Prompt Engineering while requiring significantly less developer effort.

Updated: 2025-11-24 18:58:22

标题: 减少提示，多微笑：语义工程中的多任务学习替代提示工程

摘要: AI集成编程正逐渐成为构建具有大型语言模型（LLMs）的智能系统的基本范式。最近的方法，如含义化编程（MTP），通过利用已经存在于代码中的语义来自动化提示生成。然而，许多现实世界的应用程序依赖于上下文线索、开发人员意图和超出静态代码语义能够表达的领域特定推理。为了解决这一限制，我们引入了语义工程，这是一种轻量级的方法，用于丰富程序语义，使LLM-based系统能够更准确地反映开发人员意图，而无需完全手动设计提示。我们提出了语义上下文注释（SemTexts），这是一种语言级机制，允许开发人员直接将自然语言上下文嵌入到程序构造中。集成到Jac编程语言中，语义工程扩展了MTP，以在提示生成过程中包含这些丰富的语义。我们进一步介绍了一个旨在反映现实AI集成应用场景的基准套件。我们的评估表明，语义工程显著改善了提示的准确性，实现了与提示工程相当的性能，同时需要的开发人员工作量明显减少。

更新时间: 2025-11-24 18:58:22

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.19427v1

Collapsing Taylor Mode Automatic Differentiation

Computing partial differential equation (PDE) operators via nested backpropagation is expensive, yet popular, and severely restricts their utility for scientific machine learning. Recent advances, like the forward Laplacian and randomizing Taylor mode automatic differentiation (AD), propose forward schemes to address this. We introduce an optimization technique for Taylor mode that 'collapses' derivatives by rewriting the computational graph, and demonstrate how to apply it to general linear PDE operators, and randomized Taylor mode. The modifications simply require propagating a sum up the computational graph, which could -- or should -- be done by a machine learning compiler, without exposing complexity to users. We implement our collapsing procedure and evaluate it on popular PDE operators, confirming it accelerates Taylor mode and outperforms nested backpropagation.

Updated: 2025-11-24 18:57:49

标题: 坍缩的泰勒模式自动微分

摘要: 使用嵌套反向传播计算偏微分方程（PDE）算子是昂贵的，但很受欢迎，严重限制了它们在科学机器学习中的应用。最近的进展，如正向拉普拉斯和随机泰勒模式自动微分（AD），提出了前向方案来解决这个问题。我们引入了一种泰勒模式的优化技术，通过重新编写计算图来“折叠”导数，并展示如何将其应用于一般线性PDE算子和随机化泰勒模式。这些修改只需在计算图中传播总和，可以由机器学习编译器完成，而不会暴露给用户复杂性。我们实现了我们的折叠过程，并在流行的PDE算子上进行评估，确认它加速了泰勒模式并优于嵌套反向传播。

更新时间: 2025-11-24 18:57:49

领域: cs.LG

下载: http://arxiv.org/abs/2505.13644v2

Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design

We present Genie-CAT, a tool-augmented large-language-model (LLM) system designed to accelerate scientific hypothesis generation in protein design. Using metalloproteins (e.g., ferredoxins) as a case study, Genie-CAT integrates four capabilities -- literature-grounded reasoning through retrieval-augmented generation (RAG), structural parsing of Protein Data Bank files, electrostatic potential calculations, and machine-learning prediction of redox properties -- into a unified agentic workflow. By coupling natural-language reasoning with data-driven and physics-based computation, the system generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function. In proof-of-concept demonstrations, Genie-CAT autonomously identifies residue-level modifications near [Fe--S] clusters that affect redox tuning, reproducing expert-derived hypotheses in a fraction of the time. The framework highlights how AI agents combining language models with domain-specific tools can bridge symbolic reasoning and numerical simulation, transforming LLMs from conversational assistants into partners for computational discovery.

Updated: 2025-11-24 18:57:07

标题: 超越蛋白质语言模型：一种机制酶设计的主体LLM框架

摘要: 我们提出了Genie-CAT，这是一个工具增强的大型语言模型（LLM）系统，旨在加速蛋白设计中科学假设的生成。以金属蛋白（例如，ferredoxins）为案例研究，Genie-CAT将四种能力整合到一个统一的代理工作流中，包括文献驱动推理通过检索增强生成（RAG），蛋白数据银行文件的结构解析，静电势计算，以及机器学习预测氧化还原特性。通过将自然语言推理与数据驱动和基于物理的计算相结合，该系统生成机械可解释的、可测试的假设，将序列、结构和功能联系起来。在概念验证演示中，Genie-CAT自动识别影响氧化还原调谐的[Fe-S]团簇附近的残基级修改，以相对较短的时间再现专家得出的假设。该框架突显了如何结合语言模型和领域特定工具的AI代理能够搭建符号推理和数值模拟之间的桥梁，将LLMs从对话助手转变为计算发现的合作伙伴。

更新时间: 2025-11-24 18:57:07

领域: q-bio.QM,cs.AI

下载: http://arxiv.org/abs/2511.19423v1

SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning

Recent advancements in large language models (LLMs) have shown very impressive capabilities in code generation across many programming languages. However, even state-of-the-art LLMs generate programs that contains syntactic errors and fail to complete the given tasks, especially for low-resource programming languages (LRPLs). In addition, high training cost makes finetuning LLMs unaffordable with constrained computational resources, further undermining the effectiveness of LLMs for code generation. In this work, we propose SLMFix, a novel code generation pipeline that leverages a small language model (SLM) finetuned using reinforcement learning (RL) techniques to fix syntactic errors in LLM-generated programs to improve the quality of LLM-generated programs for domain-specific languages (DSLs). In specific, we applied RL on the SLM for the program repair task using a reward calculated using both a static validator and a static semantic similarity metric. Our experimental results demonstrate the effectiveness and generalizability of our approach across multiple DSLs, achieving more than 95% pass rate on the static validator. Notably, SLMFix brings substantial improvement to the base model and outperforms supervised finetuning approach even for 7B models on a LRPL, showing the potential of our approach as an alternative to traditional finetuning approaches.

Updated: 2025-11-24 18:56:47

标题: SLMFix：利用强化学习修复错误的小型语言模型

摘要: 最近对大型语言模型（LLMs）的进展显示出在多种编程语言中生成代码的非常出色的能力。然而，即使是最先进的LLMs生成的程序中也包含语法错误，并且无法完成给定任务，特别是对于资源匮乏的编程语言（LRPLs）。此外，高昂的训练成本使得在受限的计算资源下微调LLMs变得不可负担，进一步削弱了LLMs在代码生成方面的有效性。在这项工作中，我们提出了SLMFix，这是一个利用强化学习（RL）技术微调的小型语言模型（SLM）的新型代码生成流程，用于修复LLM生成的程序中的语法错误，以提高LLM生成的领域特定语言（DSLs）程序的质量。具体来说，我们在SLM上应用RL进行程序修复任务，使用同时由静态验证器和静态语义相似度度量计算的奖励。我们的实验结果证明了我们的方法在多个DSL上的有效性和泛化性，静态验证器的通过率超过95%。值得注意的是，SLMFix为基础模型带来了显著改进，并在LRPL上甚至超越了监督微调方法对于7B模型的表现，展示了我们的方法作为传统微调方法替代的潜力。

更新时间: 2025-11-24 18:56:47

领域: cs.SE,cs.AI,cs.PL

下载: http://arxiv.org/abs/2511.19422v1

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

Updated: 2025-11-24 18:55:19

标题: 视觉思维链：通过连续视觉令牌教授VLMs更好地看和思考

摘要: 视觉-语言模型（VLMs）在语言空间的推理方面表现出色，但在需要密集视觉感知的感知理解方面（例如，空间推理和几何意识），却遇到了困难。这种限制源于当前VLMs在跨空间维度捕获密集视觉信息方面的机制有限。我们引入了Chain-of-Visual-Thought（COVT），这是一个框架，使VLMs不仅可以通过词语进行推理，还可以通过连续的视觉令牌-编码丰富感知线索的紧凑潜在表示-进行推理。在大约20个令牌的小预算内，COVT从轻量级视觉专家中提炼知识，捕获了诸如2D外观、3D几何、空间布局和边缘结构等互补属性。在训练期间，具有COVT的VLM自回归地预测这些视觉令牌，以重构密集的监督信号（例如深度、分割、边缘和DINO特征）。在推理过程中，模型直接在连续的视觉令牌空间中进行推理，保持效率，同时可选择解码密集预测以获取可解释性。在评估了超过十个不同的感知基准测试，包括CV-Bench、MMVP、RealWorldQA、MMStar、WorldMedQA和HRBench等之后，将COVT集成到强大的VLMs（例如Qwen2.5-VL和LLaVA）中，始终可以将性能提高3%到16%，并表明紧凑的连续视觉思维使得多模态智能更加精确、扎实和可解释。

更新时间: 2025-11-24 18:55:19

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19418v1

Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration

Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.

Updated: 2025-11-24 18:55:16

标题: 成为我的眼睛：通过多智能体协作将大型语言模型拓展到新的模态

摘要: 大型语言模型（LLMs）在具有挑战性、知识密集型推理任务中展示了出色的能力。然而，将LLMs扩展到感知和推理新的模态（例如视觉）往往需要开发大规模视觉语言模型（VLMs），其中LLMs作为骨干。较小的VLMs更高效和适应性更强，但通常缺乏前沿LLMs的广泛知识和推理能力。在这项工作中，我们提出了BeMyEyes，一个模块化、多代理框架，通过协调高效、适应性强的VLMs作为感知者和强大的LLMs作为推理者之间的协作，将LLMs扩展到多模态推理。然后，我们介绍了一个数据合成和监督微调流程，来训练感知者代理有效地与推理者代理合作。通过结合感知和推理代理的互补优势，BeMyEyes避免了训练大规模多模态模型的需求，保留了LLMs的泛化和推理能力，并允许灵活扩展到新的领域和模态。实验证明，我们的框架为LLMs解锁了多模态推理能力，实现了一种轻量级且完全开源的解决方案，即将仅具有文本的DeepSeek-R1与Qwen2.5-VL-7B感知者配备，以在各种知识密集型多模态任务上胜过大规模专有VLMs（如GPT-4o）。这些结果展示了我们的多代理方法在构建未来多模态推理系统方面的有效性、模块化性和可扩展性。

更新时间: 2025-11-24 18:55:16

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19417v1

The Loss of Control Playbook: Degrees, Dynamics, and Preparedness

This research report addresses the absence of an actionable definition for Loss of Control (LoC) in AI systems by developing a novel taxonomy and preparedness framework. Despite increasing policy and research attention, existing LoC definitions vary significantly in scope and timeline, hindering effective LoC assessment and mitigation. To address this issue, we draw from an extensive literature review and propose a graded LoC taxonomy, based on the metrics of severity and persistence, that distinguishes between Deviation, Bounded LoC, and Strict LoC. We model pathways toward a societal state of vulnerability in which sufficiently advanced AI systems have acquired or could acquire the means to cause Bounded or Strict LoC once a catalyst, either misalignment or pure malfunction, materializes. We argue that this state becomes increasingly likely over time, absent strategic intervention, and propose a strategy to avoid reaching a state of vulnerability. Rather than focusing solely on intervening on AI capabilities and propensities potentially relevant for LoC or on preventing potential catalysts, we introduce a complementary framework that emphasizes three extrinsic factors: Deployment context, Affordances, and Permissions (the DAP framework). Compared to work on intrinsic factors and catalysts, this framework has the unfair advantage of being actionable today. Finally, we put forward a plan to maintain preparedness and prevent the occurrence of LoC outcomes should a state of societal vulnerability be reached, focusing on governance measures (threat modeling, deployment policies, emergency response) and technical controls (pre-deployment testing, control measures, monitoring) that could maintain a condition of perennial suspension.

Updated: 2025-11-24 18:52:00

标题: 失控局面手册：程度、动态和准备工作

摘要: 这份研究报告通过制定一种新颖的分类和准备框架，解决了人工智能系统中失控（LoC）的可操作定义缺失的问题。尽管政策和研究对LoC的关注不断增加，但现有的LoC定义在范围和时间线上存在显著差异，阻碍了有效的LoC评估和缓解。为了解决这个问题，我们从广泛的文献综述中汲取经验，提出了一个基于严重程度和持久性指标的分级LoC分类，区分了偏离、有界LoC和严格LoC。我们模拟了通向社会脆弱状态的路径，即当一个催化剂（不对齐或纯粹故障）出现时，足够先进的人工智能系统已经获得或可能获得引起有界或严格LoC的手段。我们认为，如果没有战略干预，这种状态随着时间的推移变得越来越可能，并提出了一种避免达到脆弱状态的策略。我们不仅关注可能与LoC相关的人工智能能力和倾向，也不仅仅是防止潜在的催化剂，还引入了一个强调三个外在因素的补充框架：部署背景、功能和权限（DAP框架）。与内在因素和催化剂的研究相比，这个框架有一个不公平的优势，即今天就可以实施。最后，我们提出了一个计划，以保持准备状态并防止社会脆弱状态的发生，重点是治理措施（威胁建模、部署政策、紧急响应）和技术控制（部署前测试、控制措施、监测），这些措施可以保持永久悬挂的状态。

更新时间: 2025-11-24 18:52:00

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2511.15846v3

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame

Updated: 2025-11-24 18:50:01

标题: UniGame：将一个统一的多模态模型转化为自身的对手

摘要: 统一多模型（UMMs）在理解和生成方面表现出色，只需一个架构即可。然而，UMMs仍然存在基本的不一致性：理解偏向紧凑的嵌入，而生成偏向重建丰富的表示。这种结构上的权衡产生了不一致的决策边界，降低了跨模态一致性，并在分布和对抗性转移下增加了脆弱性。在本文中，我们提出了UniGame，这是一个自对抗的后训练框架，直接针对不一致性。通过在共享令牌接口处应用轻量级扰动器，UniGame使生成分支能够主动寻找和挑战脆弱的理解，将模型本身变成自身对手。实验证明，UniGame显著提高了一致性（+4.6%）。此外，它还在理解（+3.6%）、生成（+0.02）、超出分布和对抗性鲁棒性方面取得了显著的改善（在NaturalBench和AdVQA上分别为+4.8%和+6.2%）。该框架与架构无关，引入的额外参数不到1%，并且与现有的后训练方法相辅相成。这些结果将对抗性自我博弈定位为增强未来多模基础模型的一致性、稳定性和统一能力的一般有效原则。官方代码可在以下链接获取：https://github.com/AIFrontierLab/UniGame

更新时间: 2025-11-24 18:50:01

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.19413v1

SING: SDE Inference via Natural Gradients

Latent stochastic differential equation (SDE) models are important tools for the unsupervised discovery of dynamical systems from data, with applications ranging from engineering to neuroscience. In these complex domains, exact posterior inference of the latent state path is typically intractable, motivating the use of approximate methods such as variational inference (VI). However, existing VI methods for inference in latent SDEs often suffer from slow convergence and numerical instability. We propose SDE Inference via Natural Gradients (SING), a method that leverages natural gradient VI to efficiently exploit the underlying geometry of the model and variational posterior. SING enables fast and reliable inference in latent SDE models by approximating intractable integrals and parallelizing computations in time. We provide theoretical guarantees that SING approximately optimizes the intractable, continuous-time objective of interest. Moreover, we demonstrate that better state inference enables more accurate estimation of nonlinear drift functions using, for example, Gaussian process SDE models. SING outperforms prior methods in state inference and drift estimation on a variety of datasets, including a challenging application to modeling neural dynamics in freely behaving animals. Altogether, our results illustrate the potential of SING as a tool for accurate inference in complex dynamical systems, especially those characterized by limited prior knowledge and non-conjugate structure.

Updated: 2025-11-24 18:49:51

标题: SING: 通过自然梯度进行SDE推断

摘要: 潜在随机微分方程（SDE）模型是从数据中无监督地发现动态系统的重要工具，应用范围从工程到神经科学。在这些复杂领域中，精确的后验潜在状态路径推断通常是难以处理的，这促使使用近似方法，如变分推断（VI）。然而，现有的用于潜在SDE推断的VI方法通常存在收敛缓慢和数值不稳定的问题。我们提出了SDE Inference via Natural Gradients（SING）方法，利用自然梯度VI有效地利用模型和变分后验的基础几何结构。SING通过近似难以处理的积分和时间并行计算，实现了在潜在SDE模型中的快速可靠推断。我们提供理论保证，SING近似优化了感兴趣的难以处理的连续时间目标。此外，我们证明更好的状态推断能够通过例如高斯过程SDE模型更准确地估计非线性漂移函数。SING在各种数据集上的状态推断和漂移估计方面优于先前的方法，包括在自由行为动物中建模神经动态的挑战性应用。总的来说，我们的结果展示了SING作为在复杂动态系统中准确推断的工具的潜力，特别是那些具有有限先验知识和非共轭结构的系统。

更新时间: 2025-11-24 18:49:51

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.17796v2

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous or high-risk capabilities, posing frontier risks. Current safety evaluations primarily test for what a model \textit{can} do - its capabilities - without assessing what it $\textit{would}$ do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that $\textbf{propensity}$ - the likelihood of a model to pursue harmful actions if empowered - is a critical, yet underexplored, axis of safety evaluation. We present $\textbf{PropensityBench}$, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover 9 alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely. Our code is available at https://github.com/scaleapi/propensity-evaluation.

Updated: 2025-11-24 18:46:44

标题: PropensityBench：通过主动方法评估大型语言模型中的潜在安全风险

摘要: 最近关于大型语言模型（LLMs）的最新进展引发了人们对其获取和滥用危险或高风险能力的担忧，从而构成了前沿风险。目前的安全评估主要测试模型\textit{能够}做什么 - 它的能力 - 而没有评估如果赋予高风险能力它\textit{会}做什么。这导致了一个关键的盲点：模型可能会策略性地隐藏能力或快速获取它们，同时怀有对滥用的潜在倾向。我们认为$\textbf{倾向性}$ - 模型在获得权力后追求有害行为的可能性 - 是安全评估的一个关键但尚未被充分探讨的维度。我们提出了$\textbf{PropensityBench}$，这是一个新颖的基准框架，评估了模型在装备了模拟危险能力的情况下，参与危险行为的倾向性，使用代理工具。我们的框架包括5,874个场景，6,648个工具，涵盖了四个高风险领域：网络安全、自我扩散、生物安全和化学安全。我们通过受控的主体环境模拟获得强大能力的权限，并评估模型在反映现实约束或激励模型可能遇到的不同操作压力下的选择，如资源短缺或获得更多自主权。在开源和专有的前沿模型中，我们揭示了9个令人担忧的倾向迹象：模型在压力下经常选择高风险工具，尽管缺乏独立执行这些行动的能力。这些发现呼吁从静态的能力审计转向动态的倾向评估，作为部署前沿人工智能系统安全的先决条件。我们的代码可在https://github.com/scaleapi/propensity-evaluation找到。

更新时间: 2025-11-24 18:46:44

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.20703v1

Learning Robust Social Strategies with Large Language Models

As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust and Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.

Updated: 2025-11-24 18:43:46

标题: 使用大型语言模型学习稳健的社交策略

摘要: 随着代理人智能的普及，具有不同且可能相互冲突目标的代理人将以复杂的方式相互作用。这些多代理人互动构成了一个基本挑战，特别是在社会困境中，代理人的个人激励可能会破坏集体福利。虽然强化学习（RL）在单一代理人制度下对大型语言模型（LLMs）的调整效果良好，但先前的小网络结果表明，在多代理人设置中，标准RL通常会收敛到叛逆的、自私的策略。我们展示了LLMs中相同的效果：尽管有合作的先验，经过RL训练的LLM代理会发展出可以利用甚至先进的闭源模型的机会主义行为。为了解决RL收敛到糟糕均衡的趋势，我们改编了一种最近的对手学习意识算法，称为优势对齐，来对LLMs进行微调，以促进多代理人合作和非可利用性。然后，我们引入了一个群体相对基准，简化了迭代博弈中的优势计算，从而实现了在LLM规模下的多代理人训练。我们还提出了一个新颖的社会困境环境，Trust and Split，需要自然语言交流来实现高度的集体福利。在各种社会困境中，通过优势对齐学习的策略实现了更高的集体回报，同时仍然能够抵抗贪婪代理人的利用。

更新时间: 2025-11-24 18:43:46

领域: cs.LG

下载: http://arxiv.org/abs/2511.19405v1

Nonparametric Instrumental Variable Regression with Observed Covariates

We study the problem of nonparametric instrumental variable regression with observed covariates, which we refer to as NPIV-O. Compared with standard nonparametric instrumental variable regression (NPIV), the additional observed covariates facilitate causal identification and enables heterogeneous causal effect estimation. However, the presence of observed covariates introduces two challenges for its theoretical analysis. First, it induces a partial identity structure, which renders previous NPIV analyses - based on measures of ill-posedness, stability conditions, or link conditions - inapplicable. Second, it imposes anisotropic smoothness on the structural function. To address the first challenge, we introduce a novel Fourier measure of partial smoothing; for the second challenge, we extend the existing kernel 2SLS instrumental variable algorithm with observed covariates, termed KIV-O, to incorporate Gaussian kernel lengthscales adaptive to the anisotropic smoothness. We prove upper $L^2$-learning rates for KIV-O and the first $L^2$-minimax lower learning rates for NPIV-O. Both rates interpolate between known optimal rates of NPIV and nonparametric regression (NPR). Interestingly, we identify a gap between our upper and lower bounds, which arises from the choice of kernel lengthscales tuned to minimize a projected risk. Our theoretical analysis also applies to proximal causal inference, an emerging framework for causal effect estimation that shares the same conditional moment restriction as NPIV-O.

Updated: 2025-11-24 18:42:49

标题: 具有观测协变量的非参数工具变量回归

摘要: 我们研究了具有观测协变量的非参数工具变量回归问题，我们将其称为NPIV-O。与标准非参数工具变量回归（NPIV）相比，额外的观测协变量促进了因果识别，并实现了异质因果效应估计。然而，观测协变量的存在为其理论分析引入了两个挑战。首先，它引入了部分恒等结构，使得先前基于病态性度量、稳定性条件或链接条件的NPIV分析不适用。其次，它对结构函数施加了各向异性平滑性。为了应对第一个挑战，我们引入了一种新颖的部分平滑的傅立叶度量；对于第二个挑战，我们扩展了具有观测协变量的现有核2SLS工具变量算法，称为KIV-O，以适应各向异性平滑性的高斯核长度尺度。我们证明了KIV-O的上限$L^2$学习速率和NPIV-O的第一个$L^2$极小极限学习速率。这两个速率插值了NPIV和非参数回归（NPR）的已知最佳速率。有趣的是，我们确定了我们的上限和下限之间的差距，这是由于选择调整以最小化投影风险的核长度尺度引起的。我们的理论分析也适用于近端因果推断，这是一个新兴的用于因果效应估计的框架，与NPIV-O共享相同的条件矩约束。

更新时间: 2025-11-24 18:42:49

领域: stat.ML,cs.LG,math.ST

下载: http://arxiv.org/abs/2511.19404v1

MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study

In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.

Updated: 2025-11-24 18:41:20

标题: MiniF2F在Rocq中的应用：证明助理之间的自动翻译-- 一个案例研究

摘要: 在这项工作中，我们使用最先进的LLMs进行实验，将MiniF2F翻译成Rocq。翻译任务侧重于基于三个来源生成Rocq定理：自然语言描述、Lean形式化和Isabelle形式化。我们在3个不断增加复杂性的阶段进行实验，从基本的一次性提示到包括来自未成功尝试的反馈的多轮对话。在每个阶段，我们使用越来越先进的模型进行多轮翻译：GPT-4o mini，Claude 3.5 Sonnet，o1 mini和o1。我们成功翻译了488个中的478个定理。数据集是开源的：https://github.com/LLM4Rocq/miniF2F-rocq。

更新时间: 2025-11-24 18:41:20

领域: cs.LO,cs.CL,cs.LG,cs.PL

下载: http://arxiv.org/abs/2503.04763v2

In-Video Instructions: Visual Signals as Generative Control

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

Updated: 2025-11-24 18:38:45

标题: 视频中的指导：视觉信号作为生成控制

摘要: 最近，大规模视频生成模型展示了强大的视觉能力，能够预测符合当前观察中逻辑和物理线索的未来帧。在本文中，我们研究了这种能力是否可以通过解释嵌入帧中的视觉信号作为指令来实现可控的图像到视频生成，我们将这种范式称为视频内指令。与基于提示的控制不同，后者提供的文本描述在本质上是全局和粗糙的，视频内指令通过诸如叠加文本、箭头或轨迹等元素直接将用户指导编码到视觉域中。这通过为不同对象分配不同的指令，使得视觉主题和其预期动作之间产生明确、空间感知和明确的对应关系。对包括Veo 3.1、Kling 2.5和Wan 2.2在内的三种最先进的生成器进行了大量实验，结果表明视频模型可以可靠地解释和执行这种视觉嵌入指令，特别是在复杂的多对象场景中。

更新时间: 2025-11-24 18:38:45

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19401v1

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

Updated: 2025-11-24 18:35:54

标题: DR Tulu：使用不断演变的评分标准进行深度研究的强化学习

摘要: 深度研究模型执行多步研究，以生成长篇、带有良好属性的答案。然而，大多数开放式深度研究模型是通过可验证的短格式问答任务进行训练的，采用强化学习和可验证奖励（RLVR），这种方法并不适用于现实中的长篇任务。为解决这个问题，我们提出了具有演进评分标准的强化学习（RLER），在这种方法中，我们构建和维护随着训练而共同演变的评分标准；这使得评分标准可以融合模型新探索的信息，并提供有判别性的、在策略模型上的反馈。利用RLER，我们开发了Deep Research Tulu（DR Tulu-8B），这是第一个直接针对开放式、长篇深度研究进行训练的开放模型。在科学、医疗保健和一般领域的四个长篇深度研究基准测试中，DR Tulu明显优于现有的开放式深度研究模型，并且与专有深度研究系统相匹敌或超越，同时每次查询的规模更小、成本更低。为了促进未来的研究，我们发布了所有数据、模型和代码，包括我们基于MCP的深度研究系统代理基础架构。

更新时间: 2025-11-24 18:35:54

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19399v1

PTF Testing Lower Bounds for Non-Gaussian Component Analysis

This work studies information-computation gaps for statistical problems. A common approach for providing evidence of such gaps is to show sample complexity lower bounds (that are stronger than the information-theoretic optimum) against natural models of computation. A popular such model in the literature is the family of low-degree polynomial tests. While these tests are defined in such a way that make them easy to analyze, the class of algorithms that they rule out is somewhat restricted. An important goal in this context has been to obtain lower bounds against the stronger and more natural class of low-degree Polynomial Threshold Function (PTF) tests, i.e., any test that can be expressed as comparing some low-degree polynomial of the data to a threshold. Proving lower bounds against PTF tests has turned out to be challenging. Indeed, we are not aware of any non-trivial PTF testing lower bounds in the literature. In this paper, we establish the first non-trivial PTF testing lower bounds for a range of statistical tasks. Specifically, we prove a near-optimal PTF testing lower bound for Non-Gaussian Component Analysis (NGCA). Our NGCA lower bound implies similar lower bounds for a number of other statistical problems. Our proof leverages a connection to recent work on pseudorandom generators for PTFs and recent techniques developed in that context. At the technical level, we develop several tools of independent interest, including novel structural results for analyzing the behavior of low-degree polynomials restricted to random directions.

Updated: 2025-11-24 18:35:29

标题: PTF测试非高斯成分分析的下界

摘要: 这项工作研究了统计问题的信息计算差距。证明这种差距的常见方法是针对自然计算模型展示比信息理论最优解更强的样本复杂度下界。文献中一个受欢迎的模型是低次多项式测试家族。虽然这些测试的定义使其易于分析，但它们排除的算法类别有些受限。在这种背景下的一个重要目标是获得针对更强大和更自然的低次多项式阈值函数（PTF）测试类别的下界，即任何可以表示为将数据的某个低次多项式与阈值进行比较的测试。证明针对PTF测试的下界一直是具有挑战性的。事实上，我们对文献中没有任何非平凡的PTF测试下界。在本文中，我们为一系列统计任务建立了首个非平凡的PTF测试下界。具体来说，我们证明了非高斯成分分析（NGCA）的近乎最优PTF测试下界。我们的NGCA下界暗示了其他一些统计问题的类似下界。我们的证明利用了与最近关于PTF的伪随机生成器和在那个背景下开发的技术相关的联系。在技术层面上，我们发展了几种具有独立兴趣的工具，包括用于分析受限于随机方向的低次多项式行为的新颖结构结果。

更新时间: 2025-11-24 18:35:29

领域: cs.DS,cs.IT,cs.LG,math.ST,stat.ML

下载: http://arxiv.org/abs/2511.19398v1

Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments

Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.

Updated: 2025-11-24 18:33:50

标题: 实时目标跟踪：基于设备深度学习的自适应波束成形在动态声学环境中的应用

摘要: 目前，目标跟踪和声学波束成形的进展推动了监视、人机交互和机器人领域的新能力。本文介绍了一种嵌入式系统，该系统将基于深度学习的跟踪与波束成形相结合，实现了在动态环境中精确定位声源和定向音频捕获。该方法结合了单摄像头深度估计和立体视觉，实现了对移动物体的准确三维定位。采用MEMS麦克风构建的平面同心圆麦克风阵列提供了一个紧凑、节能的平台，支持方位角和仰角上的二维波束指向。实时跟踪输出不断调整阵列的焦点，将声学响应与目标位置同步。通过将学习到的空间意识与动态波束指向结合起来，系统在存在多个或移动源时保持稳健性能。实验评估表明，在信号干扰比方面取得了显著的增益，使该设计非常适用于电话会议、智能家居设备和辅助技术。

更新时间: 2025-11-24 18:33:50

领域: cs.SD,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.19396v1

Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.

Updated: 2025-11-24 18:31:13

标题: 传达计划，而不是感知：具有具体世界模型的可扩展多智能体协调

摘要: 强大的协调对于多智能体系统中的有效决策至关重要，特别是在部分可观测性下。多智能体强化学习（MARL）中的一个核心问题是是否要设计通信协议还是端到端学习。我们使用具体化的世界模型来研究这种二分法。我们提出并比较了两种合作任务分配问题的通信策略。第一种是学习直接通信（LDC），它端到端地学习一个协议。第二种是意图通信，使用一个设计的归纳偏差：一个紧凑的、学习的世界模型，即想象的轨迹生成模块（ITGM），它使用代理的策略来模拟未来状态。然后，一个消息生成网络（MGN）将这个计划压缩成一个消息。我们在一个网格世界中评估了这些方法在目标导向交互中的效果，这是对具体化AI问题的一个经典抽象，同时扩展了环境的复杂性。我们的实验表明，虽然在简单环境中出现的通信是可行的，但在复杂性增加时，基于设计的世界模型的方法表现出更好的性能、样本效率和可扩展性。这些发现主张将结构化的、预测性的模型整合到MARL代理中，以实现主动、目标驱动的协调。

更新时间: 2025-11-24 18:31:13

领域: cs.MA,cs.AI,cs.LG,eess.SY

下载: http://arxiv.org/abs/2508.02912v4

Predicting partially observable dynamical systems via diffusion models with a multiscale inference scheme

Conditional diffusion models provide a natural framework for probabilistic prediction of dynamical systems and have been successfully applied to fluid dynamics and weather prediction. However, in many settings, the available information at a given time represents only a small fraction of what is needed to predict future states, either due to measurement uncertainty or because only a small fraction of the state can be observed. This is true for example in solar physics, where we can observe the Sun's surface and atmosphere, but its evolution is driven by internal processes for which we lack direct measurements. In this paper, we tackle the probabilistic prediction of partially observable, long-memory dynamical systems, with applications to solar dynamics and the evolution of active regions. We show that standard inference schemes, such as autoregressive rollouts, fail to capture long-range dependencies in the data, largely because they do not integrate past information effectively. To overcome this, we propose a multiscale inference scheme for diffusion models, tailored to physical processes. Our method generates trajectories that are temporally fine-grained near the present and coarser as we move farther away, which enables capturing long-range temporal dependencies without increasing computational cost. When integrated into a diffusion model, we show that our inference scheme significantly reduces the bias of the predicted distributions and improves rollout stability.

Updated: 2025-11-24 18:30:04

标题: 通过多尺度推理方案使用扩散模型预测部分可观测的动态系统

摘要: 条件扩散模型为概率预测动态系统提供了自然框架，并已成功应用于流体动力学和天气预测。然而，在许多情况下，给定时间点的可用信息仅代表了预测未来状态所需信息的一小部分，这可能是由于测量不确定性或者因为只有状态的一小部分可以被观察到。例如，这在太阳物理学中是真实的，我们可以观察到太阳的表面和大气，但其演变受到内部过程驱动，而我们缺乏直接的测量。在本文中，我们处理部分可观测、长记忆动态系统的概率预测，应用于太阳动力学和活跃区域的演变。我们展示了标准推理方案，如自回归展开，未能捕捉数据中的长期依赖性，主要是因为它们没有有效整合过去的信息。为了克服这一问题，我们提出了一个针对物理过程的多尺度推理方案，适用于扩散模型。我们的方法生成的轨迹在当前时间点附近是时间上细粒度的，而在远离时间点时则更加粗糙，这使得我们能够捕捉长期时间依赖性，而不增加计算成本。当集成到扩散模型中时，我们的推理方案显著减少了预测分布的偏差，并提高了展开的稳定性。

更新时间: 2025-11-24 18:30:04

领域: cs.LG,astro-ph.SR,cs.AI,stat.ML

下载: http://arxiv.org/abs/2511.19390v1

Towards Synergistic Teacher-AI Interactions with Generative Artificial Intelligence

Generative artificial intelligence (GenAI) is increasingly used in education, posing significant challenges for teachers adapting to these changes. GenAI offers unprecedented opportunities for accessibility, scalability and productivity in educational tasks. However, the automation of teaching tasks through GenAI raises concerns about reduced teacher agency, potential cognitive atrophy, and the broader deprofessionalisation of teaching. Drawing findings from prior literature on AI in Education, and refining through a recent systematic literature review, this chapter presents a conceptualisation of five levels of teacher-AI teaming: transactional, situational, operational, praxical and synergistic teaming. The framework aims to capture the nuanced dynamics of teacher-AI interactions, particularly with GenAI, that may lead to the replacement, complementarity, or augmentation of teachers' competences and professional practice. GenAI technological affordances required in supporting teaming, along with empirical studies, are discussed. Drawing on empirical observations, we outline a future vision that moves beyond individual teacher agency toward collaborative decision-making between teachers and AI, in which both agents engage in negotiation, constructive challenge, and co-reasoning that enhance each other's capabilities and enable outcomes neither could realise independently. Further discussion of socio-technical factors beyond teacher-AI teaming is also included to streamline the synergy of teachers and AI in education ethically and practically.

Updated: 2025-11-24 18:29:29

标题: 朝向教师与人工智能的协同互动：生成人工智能的应用

摘要: 生成人工智能（GenAI）越来越多地被用于教育中，这给适应这些变化的教师带来了重大挑战。GenAI为教育任务提供了前所未有的机会，包括无障碍性、可扩展性和高生产力。然而，通过GenAI自动化教学任务引发了对降低教师主体性、潜在认知萎缩以及教学职业化进程的广泛担忧。本章节借鉴以往关于教育中人工智能的文献发现，并通过最近的系统性文献回顾，提出了五个教师-AI团队合作水平的概念化：交易性、情境性、操作性、实践性和协同性团队合作。该框架旨在捕捉教师与AI之间微妙的互动动态，特别是与GenAI有关的互动，这可能导致教师的能力和专业实践被替代、互补或增强。讨论了支持团队合作所需的GenAI技术优势，以及实证研究。基于实证观察，我们概述了一个超越个体教师主体性的未来愿景，朝着教师和AI之间的协作决策发展，其中两个代理商都参与谈判、建设性挑战和共同推理，增强彼此的能力并实现彼此独立无法实现的结果。此外，还讨论了超出教师-AI团队合作范围的社会技术因素，以在伦理和实践上推动教师和AI在教育中的协同效应。

更新时间: 2025-11-24 18:29:29

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2511.19580v1

Post-Pruning Accuracy Recovery via Data-Free Knowledge Distillation

Model pruning is a widely adopted technique to reduce the computational complexity and memory footprint of Deep Neural Networks (DNNs). However, global unstructured pruning often leads to significant degradation in accuracy, typically necessitating fine-tuning on the original training dataset to recover performance. In privacy-sensitive domains such as healthcare or finance, access to the original training data is often restricted post-deployment due to regulations (e.g., GDPR, HIPAA). This paper proposes a Data-Free Knowledge Distillation framework to bridge the gap between model compression and data privacy. We utilize DeepInversion to synthesize privacy-preserving ``dream'' images from the pre-trained teacher model by inverting Batch Normalization (BN) statistics. These synthetic images serve as a transfer set to distill knowledge from the original teacher to the pruned student network. Experimental results on CIFAR-10 across various architectures (ResNet, MobileNet, VGG) demonstrate that our method significantly recovers accuracy lost during pruning without accessing a single real data point.

Updated: 2025-11-24 18:27:40

标题: 无数据知识蒸馏通过后修剪准确性恢复

摘要: 模型修剪是一种广泛采用的技术，用于减少深度神经网络（DNNs）的计算复杂性和内存占用。然而，全局非结构化修剪通常会导致精度显著下降，通常需要在原始训练数据集上进行微调以恢复性能。在隐私敏感领域，如医疗保健或金融领域，由于法规（例如GDPR、HIPAA）的限制，常常在部署后限制对原始训练数据的访问。本文提出了一种无数据知识蒸馏框架，以弥合模型压缩和数据隐私之间的差距。我们利用DeepInversion从预训练的教师模型中合成保护隐私的“梦幻”图像，通过反转批量归一化（BN）统计信息。这些合成图像作为转移集，将知识从原始教师模型转移到修剪后的学生网络。在不同架构（ResNet、MobileNet、VGG）上对CIFAR-10的实验结果表明，我们的方法显著恢复了修剪过程中丢失的精度，而无需访问任何真实数据点。

更新时间: 2025-11-24 18:27:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.20702v1

Using Wearable Devices to Improve Chronic PainTreatment among Patients with Opioid Use Disorder

Chronic pain (CP) and opioid use disorder (OUD) are common and interrelated chronic medical conditions. Currently, there is a paucity of evidence-based integrated treatments for CP and OUD among individuals receiving medication for opioid use disorder (MOUD). Wearable devices have the potential to monitor complex patient information and inform treatment development for persons with OUD and CP, including pain variability (e.g., exacerbations of pain or pain spikes) and clinical correlates (e.g., perceived stress). However, the application of large language models (LLMs) with wearable data for understanding pain spikes, remains unexplored. Consequently, the aim of this pilot study was to examine the clinical correlates of pain spikes using a range of AI approaches. We found that machine learning models achieved relatively high accuracy (>0.7) in predicting pain spikes, while LLMs were limited in providing insights on pain spikes. Real-time monitoring through wearable devices, combined with advanced AI models, could facilitate early detection of pain spikes and support personalized interventions that may help mitigate the risk of opioid relapse, improve adherence to MOUD, and enhance the integration of CP and OUD care. Given overall limited LLM performance, these findings highlight the need to develop LLMs which can provide actionable insights in the OUD/CP context.

Updated: 2025-11-24 18:19:56

标题: 使用可穿戴设备改善阿片类药物使用障碍患者的慢性疼痛治疗

摘要: 慢性疼痛（CP）和阿片类药物使用障碍（OUD）是常见且相关的慢性医疗状况。目前，在接受阿片类药物使用障碍（MOUD）药物治疗的个体中，缺乏基于证据的集成治疗CP和OUD的治疗方法。可穿戴设备有潜力监测复杂的患者信息，并为患有OUD和CP的人群的治疗开发提供信息，包括疼痛变异性（例如，疼痛恶化或疼痛尖峰）和临床相关性（例如，感知压力）。然而，利用可穿戴数据进行了解疼痛尖峰的大型语言模型（LLMs）的应用尚未被探索。因此，这项初步研究的目的是利用一系列AI方法来研究疼痛尖峰的临床相关性。我们发现机器学习模型在预测疼痛尖峰方面达到了相对较高的准确率（>0.7），而LLMs在提供疼痛尖峰方面的见解方面受到限制。通过可穿戴设备进行实时监测，结合先进的AI模型，可以促进对疼痛尖峰的早期检测，并支持个性化干预，有助于减轻阿片类药物复发风险，提高对MOUD的依从性，并增强CP和OUD治疗的集成。鉴于总体上LLM的表现有限，这些发现突出了在OUD/CP背景下开发能够提供可操作见解的LLMs的需求。

更新时间: 2025-11-24 18:19:56

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2511.19577v1

Efficiency vs. Fidelity: A Comparative Analysis of Diffusion Probabilistic Models and Flow Matching on Low-Resource Hardware

Denoising Diffusion Probabilistic Models (DDPMs) have established a new state-of-the-art in generative image synthesis, yet their deployment is hindered by significant computational overhead during inference, often requiring up to 1,000 iterative steps. This study presents a rigorous comparative analysis of DDPMs against the emerging Flow Matching (Rectified Flow) paradigm, specifically isolating their geometric and efficiency properties on low-resource hardware. By implementing both frameworks on a shared Time-Conditioned U-Net backbone using the MNIST dataset, we demonstrate that Flow Matching significantly outperforms Diffusion in efficiency. Our geometric analysis reveals that Flow Matching learns a highly rectified transport path (Curvature $\mathcal{C} \approx 1.02$), which is near-optimal, whereas Diffusion trajectories remain stochastic and tortuous ($\mathcal{C} \approx 3.45$). Furthermore, we establish an ``efficiency frontier'' at $N=10$ function evaluations, where Flow Matching retains high fidelity while Diffusion collapses. Finally, we show via numerical sensitivity analysis that the learned vector field is sufficiently linear to render high-order ODE solvers (Runge-Kutta 4) unnecessary, validating the use of lightweight Euler solvers for edge deployment. \textbf{This work concludes that Flow Matching is the superior algorithmic choice for real-time, resource-constrained generative tasks.}

Updated: 2025-11-24 18:19:42

标题: 效率与精确度：扩散概率模型和流匹配在低资源硬件上的比较分析

摘要: 去噪扩散概率模型（DDPMs）在生成图像合成方面建立了一个新的技术水平，但它们的部署受到显著的计算开销的限制，在推断过程中通常需要多达1,000次迭代。本研究对DDPMs与新兴的流匹配（修正流）范式进行了严格的比较分析，特别是在低资源硬件上独立地研究它们的几何和效率特性。通过在共享的基于时间条件的U-Net骨干上使用MNIST数据集实现两种框架，我们证明了流匹配在效率方面明显优于扩散。我们的几何分析揭示了流匹配学习了一个高度矫正的传输路径（曲率$ \mathcal{C} \approx 1.02 $），这是接近最优的，而扩散轨迹仍然是随机和曲折的（$ \mathcal{C} \approx 3.45 $）。此外，我们在$ N=10 $个函数评估时建立了一个“效率前沿”，在这个点，流匹配保持高保真度，而扩散则崩溃。最后，我们通过数值敏感性分析表明，学习的矢量场足够线性，不需要高阶ODE求解器（Runge-Kutta 4），验证了在边缘部署中使用轻量级的欧拉求解器。这项工作的结论是，流匹配是实时、资源受限的生成任务的优越算法选择。

更新时间: 2025-11-24 18:19:42

领域: cs.LG

下载: http://arxiv.org/abs/2511.19379v1

Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.

Updated: 2025-11-24 18:17:51

标题: 机器人世界模型：用于机器人鲁棒策略优化的神经网络模拟器

摘要: 学习强大且具有广泛适用性的世界模型对于实现在真实环境中高效且可扩展的机器人控制至关重要。在这项工作中，我们引入了一种新颖的学习世界模型的框架，可以准确捕捉复杂、部分可观察和随机动态。所提出的方法采用双自回归机制和自监督训练，实现可靠的长期预测，而无需依赖于特定领域的归纳偏见，确保适应各种机器人任务。我们进一步提出了一个利用世界模型进行有效训练的策略优化框架，在想象的环境中进行高效训练，并在实际系统中无缝部署。这项工作通过解决长期预测、误差累积和模拟到真实的转移等挑战，推动了基于模型的强化学习。通过提供一个可扩展且强大的框架，引入的方法为真实世界应用中的自适应和高效机器人系统铺平了道路。

更新时间: 2025-11-24 18:17:51

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.10100v4

PEANuT: Parameter-Efficient Adaptation with Weight-aware Neural Tweakers

Fine-tuning large pre-trained foundation models often yields excellent downstream performance but is prohibitively expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods such as LoRA alleviate this by introducing lightweight update modules, yet they commonly rely on weight-agnostic linear approximations, limiting their expressiveness. In this work, we propose PEANuT, a novel PEFT framework that introduces weight-aware neural tweakers, compact neural modules that generate task-adaptive updates conditioned on frozen pre-trained weights. PEANuT provides a flexible yet efficient way to capture complex update patterns without full model tuning. We theoretically show that PEANuT achieves equivalent or greater expressivity than existing linear PEFT methods with comparable or fewer parameters. Extensive experiments across four benchmarks with over twenty datasets demonstrate that PEANuT consistently outperforms strong baselines in both NLP and vision tasks, while maintaining low computational overhead.

Updated: 2025-11-24 18:17:37

标题: PEANuT: 带有权重感知神经微调器的参数高效适应

摘要: 调整大型预训练基础模型经常会产生出色的下游性能，但是当更新所有参数时成本过高。参数高效的微调（PEFT）方法如LoRA通过引入轻量级更新模块来缓解这一问题，然而它们通常依赖于与权重无关的线性逼近，从而限制了它们的表达能力。在这项工作中，我们提出了PEANuT，一种引入了权重感知神经调整器的新型PEFT框架，这些紧凑的神经模块可以在冻结的预训练权重的条件下生成任务自适应的更新。PEANuT提供了一种灵活而高效的方法来捕捉复杂的更新模式，而无需进行完整的模型调整。我们在理论上证明，PEANuT实现了与现有线性PEFT方法相当或更高的表达能力，同时具有可比较或更少的参数。在超过二十个数据集的四个基准测试中进行了大量实验，结果表明PEANuT在NLP和视觉任务中始终优于强基线方法，同时保持低计算开销。

更新时间: 2025-11-24 18:17:37

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2410.01870v3

Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.

Updated: 2025-11-24 18:17:27

标题: 连接LLM规划代理和形式方法：计划验证案例研究

摘要: 我们介绍了一个新颖的框架，用于通过将自然语言计划转换为Kripke结构和线性时态逻辑（LTL）使用大型语言模型（LLMs）并执行模型检查来评估它们与预期行为之间的对齐。我们系统地在PlanBench计划验证数据集的简化版本上评估了这个框架，并报告了诸如准确度、精确度、召回率和F1分数等指标。我们的实验表明，GPT-5实现了出色的分类性能（F1分数为96.3%），几乎总是生成可以作为保证的句法完美的形式化表示。然而，语义完美的形式化模型的综合仍然是未来探索的领域。

更新时间: 2025-11-24 18:17:27

领域: cs.AI,cs.LO

下载: http://arxiv.org/abs/2510.03469v2

ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework

Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation, code testing, code maintenance, inter alia, using LLM agents. However, software development is a multifaceted environment that extends beyond just code. As such, a successful LLM system must factor in multiple stages of the software development life-cycle (SDLC). In this paper, we propose a vision for ALMAS, an Autonomous LLM-based Multi-Agent Software Engineering framework, which follows the above SDLC philosophy such that it may work within an agile software development team to perform several tasks end-to-end. ALMAS aligns its agents with agile roles, and can be used in a modular fashion to seamlessly integrate with human developers and their development environment. We showcase the progress towards ALMAS through our published works and a use case demonstrating the framework, where ALMAS is able to seamlessly generate an application and add a new feature.

Updated: 2025-11-24 18:11:57

标题: ALMAS：基于自主LLM的多智能体软件工程框架

摘要: 多智能体大型语言模型（LLM）系统一直在应用LLM研究的各个领域中处于领先地位。一个显著的领域是软件开发，在这个领域，研究人员已经通过使用LLM代理来推动代码实现、代码测试、代码维护等的自动化。然而，软件开发是一个多方面的环境，不仅仅局限于代码。因此，一个成功的LLM系统必须考虑软件开发生命周期（SDLC）的多个阶段。在本文中，我们提出了ALMAS的愿景，这是一个基于自主LLM的多智能体软件工程框架，遵循上述SDLC哲学，可以在敏捷软件开发团队中执行多个端到端任务。ALMAS将其代理与敏捷角色对齐，并可以以模块化方式使用，与人类开发人员及其开发环境无缝集成。我们通过我们的已发表作品和一个演示框架的用例展示了朝着ALMAS的进展，其中ALMAS能够无缝生成一个应用程序并添加一个新功能。

更新时间: 2025-11-24 18:11:57

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.03463v2

LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems

Multi-agent reinforcement learning (MARL) has been increasingly adopted in many real-world applications. While MARL enables decentralized deployment on resource-constrained edge devices, it suffers from severe non-stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)-driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity-Aware Expert Demonstration module, which leverages theoretical non-stationarity bounds to enhance the quality of LLM-generated expert trajectories, thus providing high reward and training-stable samples for each agent. Moreover, a Hybrid Expert-Agent Policy Optimization module adaptively balances each agent's learning from both expert-generated and agent-generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state-of-the-art MARL methods.

Updated: 2025-11-24 18:03:59

标题: 基于LLM驱动的面向稳态的专家演示在移动系统中的多智能体强化学习

摘要: 多智能体强化学习（MARL）在许多现实世界应用中越来越受到采用。虽然MARL使得在资源受限的边缘设备上实现去中心化部署成为可能，但由于智能体策略的同步更新，它受到严重的非稳态性的影响。这种非稳态性导致训练不稳定和策略收敛性差，特别是当智能体数量增加时。本文提出了RELED，一个可扩展的MARL框架，它整合了大型语言模型（LLM）驱动的专家演示和自主智能体探索。RELED包含一个稳态感知专家演示模块，利用理论上的非稳态性边界来增强LLM生成的专家轨迹的质量，从而为每个智能体提供高奖励和训练稳定的样本。此外，一个混合专家-智能体策略优化模块自适应地平衡每个智能体从专家生成和智能体生成的轨迹中学习，加速策略收敛并提高泛化能力。基于OpenStreetMap的真实城市网络的大量实验表明，与最先进的MARL方法相比，RELED取得了更优异的性能。

更新时间: 2025-11-24 18:03:59

领域: cs.LG,cs.NI

下载: http://arxiv.org/abs/2511.19368v1

An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification

Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.

Updated: 2025-11-24 18:01:47

标题: 一种考虑解剖结构的混合深度学习框架用于肺癌肿瘤分期分类

摘要: 准确的肺癌肿瘤分期对于预后和治疗计划至关重要。然而，对于端到端的深度学习方法来说仍然具有挑战性，因为这些方法通常忽视了对肿瘤-淋巴结-转移系统至关重要的空间和解剖信息。肿瘤分期取决于多个定量标准，包括肿瘤大小及其与最近解剖结构的距离，小的变化可能会改变分期结果。我们提出了一个基于医学的混合流程，通过显式测量肿瘤的大小和距离属性而不是将其视为纯粹的图像分类任务来进行分期。我们的方法采用专门的编码器-解码器网络来精确分割肺部和相邻解剖结构，包括肺叶、肿瘤、纵隔和膈肌。随后，我们通过对分割掩模进行定量分析来提取必要的肿瘤属性，即测量最大肿瘤尺寸并计算肿瘤与邻近解剖结构之间的距离。最后，我们应用基于规则的肿瘤分期与医学指南保持一致。这种新颖的框架在Lung-PET-CT-Dx数据集上进行了评估，表现出比传统深度学习模型更优秀的性能，实现了总体分类准确率为91.36%。我们报告了每个分期的F1分数分别为0.93（T1）、0.89（T2）、0.96（T3）和0.90（T4），这是先前文献中经常被忽略的关键评估方面。据我们所知，这是第一项将明确的临床背景嵌入肿瘤分期分类中的研究。与通常以不可解释的“黑匣子”方式操作的标准卷积神经网络不同，我们的方法既提供了最先进的性能，又提供了透明的决策支持。

更新时间: 2025-11-24 18:01:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19367v1

HunyuanOCR Technical Report

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

Updated: 2025-11-24 17:59:59

标题: 混元OCR技术报告

摘要: 本文介绍了HunyuanOCR，这是一个商业级、开源且轻量级（1B参数）的视觉语言模型（VLM），专门用于OCR任务。该架构包括原生视觉Transformer（ViT）和一个轻量级LLM，通过MLP适配器连接。HunyuanOCR展示了出色的性能，超越了商业API、传统管道和更大的模型（例如Qwen3-VL-4B）。具体来说，在感知任务（文本定位、解析）方面超越了当前的公共解决方案，并在语义任务（IE、文本图像翻译）方面表现卓越，在ICDAR 2025 DIMT挑战赛（小模型赛道）中获得第一名。此外，在拥有少于3B参数的VLM中，在OCR基准测试中取得了最新技术结果。 HunyuanOCR在三个关键方面取得了突破：1）统一多功能性和效率：我们在轻量级框架中实现了对核心功能的全面支持，包括定位、解析、IE、VQA和翻译。这解决了狭窄的“OCR专家模型”和低效的“通用VLM”的局限性。2）简化的端到端架构：采用纯端到端范式消除了对预处理模块（例如布局分析）的依赖。这从根本上解决了传统管道中常见的错误传播问题，并简化了系统部署。3）数据驱动和RL策略：我们确认高质量数据的关键作用，并首次在行业中证明，强化学习（RL）策略在OCR任务中能够带来显著的性能提升。 HunyuanOCR在HuggingFace上正式开源。我们还基于vLLM提供了一个高性能的部署解决方案，将其生产效率置于顶尖水平。我们希望这个模型能推动前沿研究，并为工业应用提供坚实的基础。

更新时间: 2025-11-24 17:59:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19575v1

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.

Updated: 2025-11-24 17:59:06

标题: DeCo: 面向端到端图像生成的频率解耦像素扩散

摘要: 像素扩散旨在以端到端的方式直接在像素空间中生成图像。这种方法避免了VAE在两阶段潜在扩散中的限制，提供了更高的模型容量。现有的像素扩散模型在训练和推理过程中存在较慢的问题，因为它们通常在单个扩散变换器（DiT）中建模高频信号和低频语义。为了追求更高效的像素扩散范例，我们提出了频率解耦像素扩散框架。凭借分离高低频成分生成的直觉，我们利用轻量级像素解码器生成基于DiT的语义引导的高频细节。这样使得DiT专注于建模低频语义。此外，我们引入了一个频率感知的流匹配损失，强调视觉显著频率，同时抑制不重要的频率。大量实验表明，DeCo在像素扩散模型中表现出优越性能，在ImageNet上取得了1.62（256x256）和2.22（512x512）的FID，缩小了与潜在扩散方法之间的差距。此外，我们预训练的文本到图像模型在系统级比较中取得了0.86的领先整体得分。代码可以在https://github.com/Zehong-Ma/DeCo上公开获得。

更新时间: 2025-11-24 17:59:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19365v1

Neural surrogates for designing gravitational wave detectors

Physics simulators are essential in science and engineering, enabling the analysis, control, and design of complex systems. In experimental sciences, they are increasingly used to automate experimental design, often via combinatorial search and optimization. However, as the setups grow more complex, the computational cost of traditional, CPU-based simulators becomes a major limitation. Here, we show how neural surrogate models can significantly reduce reliance on such slow simulators while preserving accuracy. Taking the design of interferometric gravitational wave detectors as a representative example, we train a neural network to surrogate the gravitational wave physics simulator Finesse, which was developed by the LIGO community. Despite that small changes in physical parameters can change the output by orders of magnitudes, the model rapidly predicts the quality and feasibility of candidate designs, allowing an efficient exploration of large design spaces. Our algorithm loops between training the surrogate, inverse designing new experiments, and verifying their properties with the slow simulator for further training. Assisted by auto-differentiation and GPU parallelism, our method proposes high-quality experiments much faster than direct optimization. Solutions that our algorithm finds within hours outperform designs that take five days for the optimizer to reach. Though shown in the context of gravitational wave detectors, our framework is broadly applicable to other domains where simulator bottlenecks hinder optimization and discovery.

Updated: 2025-11-24 17:58:59

标题: 神经替代品用于设计引力波探测器

摘要: 物理模拟器在科学和工程中至关重要，可以实现对复杂系统的分析、控制和设计。在实验科学中，它们越来越被用于自动化实验设计，通常通过组合搜索和优化。然而，随着设置变得更加复杂，传统基于CPU的模拟器的计算成本成为一个主要限制。在这里，我们展示了神经替代模型如何显著减少对这些慢模拟器的依赖，同时保持准确性。以干涉引力波探测器设计为代表性例子，我们训练了一个神经网络来替代由LIGO社区开发的引力波物理模拟器Finesse。尽管物理参数的微小变化可能导致输出值变化数个数量级，但该模型能够快速预测候选设计的质量和可行性，从而实现对大型设计空间的高效探索。我们的算法循环进行训练替代模型、逆向设计新实验，并通过慢模拟器验证其性能以进行进一步训练。借助自动微分和GPU并行计算，我们的方法比直接优化更快地提出高质量的实验。我们的算法在几小时内找到的解决方案胜过优化器需要五天才能达到的设计。尽管是在引力波探测器的背景下展示的，我们的框架在其他领域也可以广泛应用，其中模拟器瓶颈阻碍了优化和发现。

更新时间: 2025-11-24 17:58:59

领域: cs.LG,astro-ph.IM,gr-qc,quant-ph

下载: http://arxiv.org/abs/2511.19364v1

Enhancing Conformal Prediction via Class Similarity

Conformal Prediction (CP) has emerged as a powerful statistical framework for high-stakes classification applications. Instead of predicting a single class, CP generates a prediction set, guaranteed to include the true label with a pre-specified probability. The performance of different CP methods is typically assessed by their average prediction set size. In setups where the classes can be partitioned into semantic groups, e.g., diseases that require similar treatment, users can benefit from prediction sets that are not only small on average, but also contain a small number of semantically different groups. This paper begins by addressing this problem and ultimately offers a widely applicable tool for boosting any CP method on any dataset. First, given a class partition, we propose augmenting the CP score function with a term that penalizes predictions with out-of-group errors. We theoretically analyze this strategy and prove its advantages for group-related metrics. Surprisingly, we show mathematically that, for common class partitions, it can also reduce the average set size of any CP score function. Our analysis reveals the class similarity factors behind this improvement and motivates us to propose a model-specific variant, which does not require any human semantic partition and can further reduce the prediction set size. Finally, we present an extensive empirical study, encompassing prominent CP methods, multiple models, and several datasets, which demonstrates that our class-similarity-based approach consistently enhances CP methods.

Updated: 2025-11-24 17:56:42

标题: 通过类相似性增强符合性预测

摘要: 共形预测（CP）已经成为高风险分类应用的强大统计框架。与预测单一类别不同，CP生成一个预测集，保证以预先指定的概率包含真实标签。不同CP方法的性能通常通过它们的平均预测集大小来评估。在类别可以被划分为语义组的情况下，例如需要类似治疗的疾病，用户可以受益于不仅平均较小的预测集，而且包含少量语义不同组的预测集。本文首先解决这个问题，并最终提供一个广泛适用的工具，用于提升任何数据集上的任何CP方法。首先，给定一个类别划分，我们提出通过惩罚具有组外错误的预测来增强CP评分函数。我们在理论上分析了这一策略，并证明了其对于组相关度指标的优势。令人惊讶的是，我们在数学上表明，对于常见的类别划分，它还可以减少任何CP评分函数的平均集大小。我们的分析揭示了这种改进背后的类别相似性因素，并激励我们提出一个模型特定的变体，它不需要任何人类语义划分，并可以进一步减小预测集大小。最后，我们展示了一个广泛的实证研究，涵盖了突出的CP方法，多个模型和几个数据集，证明了我们基于类别相似性的方法始终增强了CP方法。

更新时间: 2025-11-24 17:56:42

领域: cs.LG

下载: http://arxiv.org/abs/2511.19359v1

Leveraging LLMs for reward function design in reinforcement learning control tasks

The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.

Updated: 2025-11-24 17:55:46

标题: 利用LLMs进行强化学习控制任务中的奖励函数设计

摘要: 在强化学习（RL）中设计有效的奖励函数的挑战代表了一个重要的瓶颈，通常需要大量人类专业知识并且耗时。先前的工作以及最近大型语言模型（LLMs）的进展已经展示了它们自动生成奖励函数的潜力。然而，现有的方法通常需要初步评估指标、人工设计的反馈用于细化过程，或者使用环境源代码作为上下文。为了解决这些限制，本文介绍了LEARN-Opt（基于LLM的奖励函数优化评估器和分析器）。这个基于LLM的、完全自主的、模型无关的框架消除了需要初步指标和环境源代码作为上下文来从系统和任务目标的文本描述中生成、执行和评估奖励函数候选人的需求。LEARN-Opt的主要贡献在于它能够自主从系统描述和任务目标中直接推导性能指标，实现无监督评估和选择奖励函数。我们的实验表明，LEARN-Opt的性能与EUREKA等最先进方法相当甚至更好，同时需要更少的先验知识。我们发现，自动奖励设计是一个高方差问题，平均情况下的候选人失败，需要多次运行才能找到最佳候选人。最后，我们展示LEARN-Opt可以释放低成本LLMs的潜力，找到与更大模型相当甚至更好的高性能候选人。这种表现证实了它生成高质量奖励函数的潜力，而无需任何初步人为定义的指标，从而减少工程开销并增强泛化能力。

更新时间: 2025-11-24 17:55:46

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2511.19355v1

Node Preservation and its Effect on Crossover in Cartesian Genetic Programming

While crossover is a critical and often indispensable component in other forms of Genetic Programming, such as Linear- and Tree-based, it has consistently been claimed that it deteriorates search performance in CGP. As a result, a mutation-alone $(1+λ)$ evolutionary strategy has become the canonical approach for CGP. Although several operators have been developed that demonstrate an increased performance over the canonical method, a general solution to the problem is still lacking. In this paper, we compare basic crossover methods, namely one-point and uniform, to variants in which nodes are ``preserved,'' including the subgraph crossover developed by Roman Kalkreuth, the difference being that when ``node preservation'' is active, crossover is not allowed to break apart instructions. We also compare a node mutation operator to the traditional point mutation; the former simply replaces an entire node with a new one. We find that node preservation in both mutation and crossover improves search using symbolic regression benchmark problems, moving the field towards a general solution to CGP crossover.

Updated: 2025-11-24 17:55:01

标题: 节点保留及其对笛卡尔遗传编程中交叉的影响

摘要: 虽然在其他形式的遗传编程中，如线性和基于树的遗传编程中，交叉是一个至关重要且不可或缺的组成部分，但人们一直声称它会降低CGP的搜索性能。因此，一种仅使用突变的$(1+\lambda)$进化策略已成为CGP的经典方法。尽管已开发了几种操作符，证明了它们比经典方法的性能更好，但问题的一般解决方案仍然缺乏。在本文中，我们将基本的交叉方法，即单点和均匀交叉，与保留节点的变体进行了比较，其中包括Roman Kalkreuth开发的子图交叉，其不同之处在于当“节点保留”处于活动状态时，交叉不允许打破指令。我们还将节点突变操作符与传统的点突变进行了比较；前者简单地用新节点替换整个节点。我们发现，在符号回归基准问题中，节点保留在突变和交叉中都可以改善搜索，将该领域推向CGP交叉的一般解决方案。

更新时间: 2025-11-24 17:55:01

领域: cs.NE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.00634v2

Scalable Parameter-Light Spectral Method for Clustering Short Text Embeddings with a Cohesion-Based Evaluation Metric

Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.

Updated: 2025-11-24 17:52:58

标题: 可扩展的参数轻量级谱聚类短文本嵌入的方法，基于凝聚性评估度量

摘要: 对短文本嵌入进行聚类是自然语言处理中的基础任务，但由于需要事先指定聚类数量，仍然具有挑战性。我们引入了一种可伸缩的谱方法，该方法直接从Laplacian特征谱的结构中估计聚类数量，该特征谱是使用余弦相似性构建的，并受自适应抽样策略指导。这种抽样方法使我们的估计器能够高效地扩展到大型数据集，而不会牺牲可靠性。为了支持无地面真值标签的簇质量的内在评估，我们提出了一种简单且可解释的评估指标，称为Cohesion Ratio，该指标量化了簇内相似性超过全局相似性背景的程度。它受到互信息启发的信息论动机，在我们的实验中，它与外部度量（如归一化互信息和同质性）密切相关。对六个短文本数据集和四个现代嵌入模型的大量实验表明，当受我们的估计器指导时，标准算法（如K-Means和HAC）明显优于流行的参数轻量级方法（如HDBSCAN、OPTICS和Leiden）。这些结果展示了我们的谱估计器和Cohesion Ratio在无监督组织和评估短文本数据方面的实际价值。我们的k估计器和Cohesion Ratio的实现，以及重现实验所需的代码，可在https://anonymous.4open.science/r/towards_clustering-0C2E 上获得。

更新时间: 2025-11-24 17:52:58

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.19350v1

Artificial Intelligence Driven Workflow for Accelerating Design of Novel Photosensitizers

The discovery of high-performance photosensitizers has long been hindered by the time-consuming and resource-intensive nature of traditional trial-and-error approaches. Here, we present \textbf{A}I-\textbf{A}ccelerated \textbf{P}hoto\textbf{S}ensitizer \textbf{I}nnovation (AAPSI), a closed-loop workflow that integrates expert knowledge, scaffold-based molecule generation, and Bayesian optimization to accelerate the design of novel photosensitizers. The scaffold-driven generation in AAPSI ensures structural novelty and synthetic feasibility, while the iterative AI-experiment loop accelerates the discovery of novel photosensitizers. AAPSI leverages a curated database of 102,534 photosensitizer-solvent pairs and generate 6,148 synthetically accessible candidates. These candidates are screened via graph transformers trained to predict singlet oxygen quantum yield ($φ_Δ$) and absorption maxima ($λ_{max}$), following experimental validation. This work generates several novel candidates for photodynamic therapy (PDT), among which the hypocrellin-based candidate HB4Ph exhibits exceptional performance at the Pareto frontier of high quantum yield of singlet oxygen and long absorption maxima among current photosensitizers ($φ_Δ$=0.85, $λ_{max}$=650nm).

Updated: 2025-11-24 17:46:54

标题: 人工智能驱动的工作流程加速新型光敏剂设计

摘要: 长期以来，传统的试错方法因其耗时和资源密集的特点一直阻碍着高性能光敏剂的发现。在这里，我们提出了加速光敏剂创新的AI加速光敏剂创新（AAPSI）闭环工作流程，该工作流程整合了专家知识、基于骨架的分子生成和贝叶斯优化，加速了新型光敏剂的设计。AAPSI中的骨架驱动生成确保了结构的新颖性和合成可行性，而迭代的AI-实验循环加速了新型光敏剂的发现。AAPSI利用了一个经过筛选的包含102,534个光敏剂-溶剂对的数据库，并生成了6,148个可合成的候选物。这些候选物通过经过训练的图变换器进行筛选，以预测单线态氧量子产量（$φ_Δ$）和吸收极值（$λ_{max}$），并进行实验验证。这项工作为光动力疗法（PDT）产生了几个新的候选物，其中基于孤儿草素的候选物HB4Ph在当前光敏剂中表现出卓越的性能，处于单线态氧高量子产量和长吸收极值的帕累托前沿位置（$φ_Δ$=0.85，$λ_{max}$=650nm）。

更新时间: 2025-11-24 17:46:54

领域: cond-mat.mtrl-sci,cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2511.19347v1

Annotation-Free Class-Incremental Learning

Despite significant progress in continual learning ranging from architectural novelty to clever strategies for mitigating catastrophic forgetting most existing methods rest on a strong but unrealistic assumption the availability of labeled data throughout the learning process. In real-world scenarios, however, data often arrives sequentially and without annotations, rendering conventional approaches impractical. In this work, we revisit the fundamental assumptions of continual learning and ask: Can current systems adapt when labels are absent and tasks emerge incrementally over time? To this end, we introduce Annotation-Free Class-Incremental Learning (AFCIL), a more realistic and challenging paradigm where unlabeled data arrives continuously, and the learner must incrementally acquire new classes without any supervision. To enable effective learning under AFCIL, we propose CrossWorld CL, a Cross Domain World Guided Continual Learning framework that incorporates external world knowledge as a stable auxiliary source. The method retrieves semantically related ImageNet classes for each downstream category, maps downstream and ImageNet features through a cross domain alignment strategy and finally introduce a novel replay strategy. This design lets the model uncover semantic structure without annotations while keeping earlier knowledge intact. Across four datasets, CrossWorld-CL surpasses CLIP baselines and existing continual and unlabeled learning methods, underscoring the benefit of world knowledge for annotation free continual learning.

Updated: 2025-11-24 17:44:48

标题: 无标注类增量学习

摘要: 尽管在持续学习方面取得了显着进展，从架构创新到减轻灾难性遗忘的巧妙策略，但大多数现有方法都建立在一个强大但不现实的假设上，即在学习过程中始终有标记数据可用。然而，在现实世界的场景中，数据通常是顺序到达且没有注释的，这使传统方法变得不切实际。在这项工作中，我们重新审视了持续学习的基本假设，并提出了一个问题：在标签缺失且任务逐渐随时间出现时，当前系统能否适应？为此，我们引入了Annotation-Free Class-Incremental Learning（AFCIL），这是一种更现实和具有挑战性的范式，其中未标记数据持续到达，学习者必须在没有任何监督的情况下逐步获取新的类别。为了在AFCIL下实现有效学习，我们提出了CrossWorld CL，这是一个跨领域世界引导的持续学习框架，它将外部世界知识作为一个稳定的辅助来源。该方法为每个下游类别检索语义相关的ImageNet类别，通过跨领域对齐策略映射下游和ImageNet特征，最后引入一种新颖的重放策略。这种设计使模型在没有注释的情况下揭示语义结构，同时保持先前的知识完整。在四个数据集上，CrossWorld-CL超越了CLIP基线和现有的持续和无标签学习方法，突显了世界知识对无标注持续学习的益处。

更新时间: 2025-11-24 17:44:48

领域: cs.LG

下载: http://arxiv.org/abs/2511.19344v1

Explicit Tonal Tension Conditioning via Dual-Level Beam Search for Symbolic Music Generation

State-of-the-art symbolic music generation models have recently achieved remarkable output quality, yet explicit control over compositional features, such as tonal tension, remains challenging. We propose a novel approach that integrates a computational tonal tension model, based on tonal interval vector analysis, into a Transformer framework. Our method employs a two-level beam search strategy during inference. At the token level, generated candidates are re-ranked using model probability and diversity metrics to maintain overall quality. At the bar level, a tension-based re-ranking is applied to ensure that the generated music aligns with a desired tension curve. Objective evaluations indicate that our approach effectively modulates tonal tension, and subjective listening tests confirm that the system produces outputs that align with the target tension. These results demonstrate that explicit tension conditioning through a dual-level beam search provides a powerful and intuitive tool to guide AI-generated music. Furthermore, our experiments demonstrate that our method can generate multiple distinct musical interpretations under the same tension condition.

Updated: 2025-11-24 17:41:04

标题: 通过双层束搜索的明确音调张力调节用于符号音乐生成

摘要: 最新的符号音乐生成模型最近取得了显著的输出质量，但对作曲特征（如音调张力）的明确控制仍具有挑战性。我们提出了一种新颖的方法，将基于音调间隔向量分析的计算音调张力模型集成到Transformer框架中。我们的方法在推断过程中采用了两级波束搜索策略。在标记级别上，生成的候选项通过模型概率和多样性指标重新排名，以保持整体质量。在小节级别上，应用基于张力的重新排名，以确保生成的音乐与期望的张力曲线保持一致。客观评估表明，我们的方法有效地调节了音调张力，主观听力测试证实系统生成的输出与目标张力相一致。这些结果表明，通过双级波束搜索对音调进行明确调节提供了一个强大且直观的工具，用于引导AI生成的音乐。此外，我们的实验表明，我们的方法可以在相同的张力条件下生成多个不同的音乐解释。

更新时间: 2025-11-24 17:41:04

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2511.19342v1

High-throughput validation of phase formability and simulation accuracy of Cantor alloys

High-throughput methods enable accelerated discovery of novel materials in complex systems such as high-entropy alloys, which exhibit intricate phase stability across vast compositional spaces. Computational approaches, including Density Functional Theory (DFT) and calculation of phase diagrams (CALPHAD), facilitate screening of phase formability as a function of composition and temperature. However, the integration of computational predictions with experimental validation remains challenging in high-throughput studies. In this work, we introduce a quantitative confidence metric to assess the agreement between predictions and experimental observations, providing a quantitative measure of the confidence of machine learning models trained on either DFT or CALPHAD input in accounting for experimental evidence. The experimental dataset was generated via high-throughput in-situ synchrotron X-ray diffraction on compositionally varied FeNiMnCr alloy libraries, heated from room temperature to ~1000 °C. Agreement between the observed and predicted phases was evaluated using either temperature-independent phase classification or a model that incorporates a temperature-dependent probability of phase formation. This integrated approach demonstrates where strong overall agreement between computation and experiment exists, while also identifying key discrepancies, particularly in FCC/BCC predictions at Mn-rich regions to inform future model refinement.

Updated: 2025-11-24 17:31:16

标题: 高通量验证康托合金的相形成性和模拟准确性

摘要: 高通量方法使得在复杂系统中如高熵合金等展示复杂相稳定性的材料的发现加速。计算方法，包括密度泛函理论（DFT）和相图计算（CALPHAD），促进了相形成性能随成分和温度的筛选。然而，在高通量研究中，计算预测与实验验证的整合仍然具有挑战性。在本研究中，我们引入了一个定量的置信度指标来评估预测与实验观测之间的一致性，提供了机器学习模型对于DFT或CALPHAD输入在解释实验证据方面的置信度的定量度量。实验数据集是通过在成分多样的FeNiMnCr合金库上进行高通量原位同步辐射X射线衍射生成的，从室温升至约1000摄氏度。通过使用温度无关的相分类或一个包含温度相关相形成概率的模型来评估观察到的和预测的相之间的一致性。这种整合方法展示了计算和实验之间的整体一致性，同时也识别了关键的不一致性，特别是在富锰区域的FCC/BCC预测中，以供未来模型精化的参考。

更新时间: 2025-11-24 17:31:16

领域: cond-mat.mtrl-sci,cs.LG

下载: http://arxiv.org/abs/2511.19335v1

Evolution of Cybersecurity Subdisciplines: A Science of Science Study

The science of science is an emerging field that studies the practice of science itself. We present the first study of the cybersecurity discipline from a science of science perspective. We examine the evolution of two comparable interdisciplinary communities in cybersecurity: the Symposium on Usable Privacy and Security (SOUPS) and Financial Cryptography and Data Security (FC).

Updated: 2025-11-24 17:26:28

标题: 网络安全子学科的演变：一项科学研究的研究

摘要: 科学科学是一个新兴领域，研究科学实践本身。我们首次从科学科学的角度研究了网络安全学科。我们研究了网络安全领域中两个可比较的跨学科社区的发展：可用性隐私与安全研讨会（SOUPS）和金融密码学与数据安全（FC）。

更新时间: 2025-11-24 17:26:28

领域: cs.CR

下载: http://arxiv.org/abs/2511.19331v1

Targeted Manipulation: Slope-Based Attacks on Financial Time-Series Data

A common method of attacking deep learning models is through adversarial attacks, which occur when an attacker specifically modifies the input of a model to produce an incorrect result. Adversarial attacks have been deeply investigated in the image domain; however, there is less research in the time-series domain and very little for forecasting financial data. To address these concerns, this study aims to build upon previous research on adversarial attacks for time-series data by introducing two new slope-based methods aimed to alter the trends of the predicted stock forecast generated by an N-HiTS model. Compared to the normal N-HiTS predictions, the two new slope-based methods, the General Slope Attack and Least-Squares Slope Attack, can manipulate N-HiTS predictions by doubling the slope. These new slope attacks can bypass standard security mechanisms, such as a discriminator that filters real and perturbed inputs, reducing a 4-layered CNN's specificity to 28% and accuracy to 57%. Furthermore, the slope based methods were incorporated into a GAN architecture as a means of generating realistic synthetic data, while simultaneously fooling the model. Finally, this paper also proposes a sample malware designed to inject an adversarial attack in the model inference library, proving that ML-security research should not only focus on making the model safe, but also securing the entire pipeline.

Updated: 2025-11-24 17:26:20

标题: 目标化操纵：基于斜率的金融时间序列数据攻击

摘要: 攻击深度学习模型的一种常见方法是通过对抗性攻击，当攻击者有意修改模型的输入以产生不正确的结果时发生对抗性攻击。对抗性攻击在图像领域得到了深入研究；然而，在时间序列领域的研究较少，对于预测金融数据的研究更是凤毛麟角。为了解决这些问题，本研究旨在在以往对时间序列数据的对抗性攻击研究基础上，引入两种新的基于斜率的方法，旨在改变由N-HiTS模型生成的预测股票预测的趋势。与正常的N-HiTS预测相比，这两种新的基于斜率的方法——General Slope Attack和Least-Squares Slope Attack，可以通过将斜率加倍来操纵N-HiTS预测。这些新的斜率攻击可以绕过标准的安全机制，例如过滤真实和扰动输入的鉴别器，将4层CNN的特异性降低到28％，准确性降低到57％。此外，基于斜率的方法被纳入到GAN架构中，作为生成逼真合成数据的手段，同时愚弄模型。最后，本文还提出了一个示例恶意软件，旨在向模型推断库中注入对抗性攻击，证明了机器学习安全研究不仅应关注使模型安全，还应确保整个流程的安全。

更新时间: 2025-11-24 17:26:20

领域: cs.LG

下载: http://arxiv.org/abs/2511.19330v1

Random Spiking Neural Networks are Stable and Spectrally Simple

Spiking neural networks (SNNs) are a promising paradigm for energy-efficient computation, yet their theoretical foundations-especially regarding stability and robustness-remain limited compared to artificial neural networks. In this work, we study discrete-time leaky integrate-and-fire (LIF) SNNs through the lens of Boolean function analysis. We focus on noise sensitivity and stability in classification tasks, quantifying how input perturbations affect outputs. Our main result shows that wide LIF-SNN classifiers are stable on average, a property explained by the concentration of their Fourier spectrum on low-frequency components. Motivated by this, we introduce the notion of spectral simplicity, which formalizes simplicity in terms of Fourier spectrum concentration and connects our analysis to the simplicity bias observed in deep networks. Within this framework, we show that random LIF-SNNs are biased toward simple functions. Experiments on trained networks confirm that these stability properties persist in practice. Together, these results provide new insights into the stability and robustness properties of SNNs.

Updated: 2025-11-24 17:25:02

标题: 随机脉冲神经网络是稳定且谱简单的

摘要: 尖峰神经网络（SNNs）是一种具有高能效计算潜力的范式，然而与人工神经网络相比，它们的理论基础，特别是稳定性和鲁棒性方面，仍然有限。在这项工作中，我们通过布尔函数分析的视角研究了离散时间泄漏积分-火（LIF）SNNs。我们关注在分类任务中的噪声敏感性和稳定性，量化输入扰动对输出的影响。我们的主要结果表明，宽LIF-SNN分类器平均上是稳定的，这一特性可以通过它们的傅里叶谱主要集中在低频分量上来解释。受此启发，我们引入了谱简单性的概念，它以傅里叶谱集中度形式化了简单性，并将我们的分析与深度网络中观察到的简单性偏好联系起来。在这个框架内，我们展示了随机LIF-SNNs偏向于简单函数。对训练网络的实验证实了这些稳定性特性在实践中的持续存在。这些结果共同为SNNs的稳定性和鲁棒性特性提供了新的见解。

更新时间: 2025-11-24 17:25:02

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.00904v2

Understanding the Staged Dynamics of Transformers in Learning Latent Structure

While transformers can discover latent structure from context, the dynamics of how they acquire different components of the latent structure remain poorly understood. In this work, we use the Alchemy benchmark, to investigate the dynamics of latent structure learning. We train a small decoder-only transformer on three task variants: 1) inferring missing rules from partial contextual information, 2) composing simple rules to solve multi-step sequences, and 3) decomposing complex multi-step examples to infer intermediate steps. By factorizing each task into interpretable events, we show that the model acquires capabilities in discrete stages, first learning the coarse grained rules, before learning the complete latent structure. We also identify a crucial asymmetry, where the model can compose fundamental rules robustly, but struggles to decompose complex examples to discover the fundamental rules. These findings offer new insights into understanding how a transformer model learns latent structures, providing a granular view of how these capabilities evolve during training.

Updated: 2025-11-24 17:20:42

标题: 理解变压器在学习潜在结构中的阶段动态

摘要: 虽然变压器可以从上下文中发现潜在结构，但它们如何获取潜在结构的不同组成部分的动态仍然知之甚少。在这项工作中，我们使用Alchemy基准测试来研究潜在结构学习的动态过程。我们在三种任务变体上训练了一个小型的仅解码器变压器：1）从部分上下文信息中推断缺失规则，2）组合简单规则解决多步序列，3）分解复杂的多步示例以推断中间步骤。通过将每个任务因子化为可解释的事件，我们展示模型在离散阶段获取能力，首先学习粗粒度规则，然后学习完整的潜在结构。我们还确定了一个关键的不对称性，即模型可以稳健地组合基本规则，但在分解复杂示例以发现基本规则方面存在困难。这些发现为理解变压器模型如何学习潜在结构提供了新的见解，提供了这些能力在训练过程中如何演变的细粒度视角。

更新时间: 2025-11-24 17:20:42

领域: cs.LG

下载: http://arxiv.org/abs/2511.19328v1

Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval

Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.

Updated: 2025-11-24 17:18:25

标题: 使用多语言LLMs进行生成式查询扩展，用于跨语言信息检索

摘要: 查询扩展是通过添加语义相关信息来重新构建用户查询的过程，是单语和跨语言信息检索的基本组成部分，用于确保相关文档不会被遗漏。最近，多语言大型语言模型（mLLMs）已将查询扩展从使用同义词和相关词进行语义增强转变为伪文档生成。伪文档既引入了额外相关术语，又填补了短查询和长文档之间的差距，在密集检索中特别有益。本研究评估了最近的mLLMs和经过微调的变体在几种生成扩展策略上的表现，以确定影响跨语言检索性能的因素。结果显示，查询长度在很大程度上决定了哪种提示技术有效，而更复杂的提示往往不会带来进一步的收益。存在着显著的语言差异：跨语言查询扩展可以为基线最弱的语言带来最大的改进，但不同脚本语言之间的检索效果特别差。发现只有当训练和测试数据具有相似的格式时，微调才会导致性能提升。这些结果凸显了对更均衡的多语言和跨语言训练和评估资源的需求。

更新时间: 2025-11-24 17:18:25

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.19325v1

What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models

Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.

Updated: 2025-11-24 17:17:40

标题: 是什么推动了跨语言排名？使用多语言语言模型的检索方法

摘要: 跨语言信息检索（CLIR）可以访问多语言知识，但由于资源、脚本之间的差异以及嵌入模型中的跨语言语义对齐不足而仍然具有挑战性。现有的流程通常依赖于翻译和单语检索启发式方法，这会增加计算开销和噪声，降低性能。本研究系统评估了四种干预类型，即文档翻译、使用预训练编码器的多语种密集检索、在单词、短语和查询-文档级别进行对比学习，以及交叉编码器重新排序，跨越三个基准数据集。我们发现专门针对CLIR训练的密集检索模型始终优于词汇匹配方法，并且从文档翻译中几乎没有获益。对比学习可以减轻语言偏见，并为初始对齐较弱的编码器带来显著改进，重新排序可能有效，但取决于交叉编码器训练数据的质量。尽管高资源语言仍然在整体性能上占据主导地位，但与词汇和文档翻译基线相比，对于低资源和跨脚本对而言，改进最为显著。这些发现表明，跨语言搜索系统应优先考虑语义多语种嵌入和有针对性的基于学习的对齐，而不是基于翻译的流程，特别是对于跨脚本和资源匮乏的语言。

更新时间: 2025-11-24 17:17:40

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.19324v1

Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them

We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.

Updated: 2025-11-24 17:17:31

标题: 使用LLM生成的数据增强领域特定的编码器模型：如何利用本体，以及如何在没有本体的情况下进行

摘要: 我们研究了在具有有限训练数据的专业领域中，利用LLM生成的数据对编码器模型进行持续预训练的方法，以入侵生物学领域作为案例研究。为此，我们利用领域特定本体论，通过用LLM生成的数据丰富它们，并将编码器模型预训练为一个基于本体论的嵌入模型，用于概念定义。为了评估这种方法的有效性，我们编制了一个专门设计用于评估入侵生物学模型性能的基准测试。在展示了相对于标准LLM预训练的显著改进后，我们研究了将所提出的方法应用于没有全面本体论的领域的可行性，通过用从一小部分科学摘要中自动提取的概念替换本体论概念，并通过分布统计建立概念之间的关系。我们的结果表明，这种自动化方法只使用一小部分科学摘要即可实现可比较的性能，从而实现了一个完全自动化的流程，用于增强特定领域对小型编码器模型的理解，特别适用于低资源环境，并实现了与在更大数据集上进行的掩码语言建模预训练相当的性能。

更新时间: 2025-11-24 17:17:31

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2503.22006v2

Interpreting Graph Inference with Skyline Explanations

Inference queries have been routinely issued to graph machine learning models such as graph neural networks (GNNs) for various network analytical tasks. Nevertheless, GNN outputs are often hard to interpret comprehensively. Existing methods typically conform to individual pre-defined explainability measures (such as fidelity), which often leads to biased, ``one-side'' interpretations. This paper introduces skyline explanation, a new paradigm that interprets GNN outputs by simultaneously optimizing multiple explainability measures of users' interests. (1) We propose skyline explanations as a Pareto set of explanatory subgraphs that dominate others over multiple explanatory measures. We formulate skyline explanation as a multi-criteria optimization problem, and establish its hardness results. (2) We design efficient algorithms with an onion-peeling approach, which strategically prioritizes nodes and removes unpromising edges to incrementally assemble skyline explanations. (3) We also develop an algorithm to diversify the skyline explanations to enrich the comprehensive interpretation. (4) We introduce efficient parallel algorithms with load-balancing strategies to scale skyline explanation for large-scale GNN-based inference. Using real-world and synthetic graphs, we experimentally verify our algorithms' effectiveness and scalability.

Updated: 2025-11-24 17:17:12

标题: 用天际线解释解释图推理

摘要: 推断查询通常被发出到图机器学习模型，如图神经网络（GNN）进行各种网络分析任务。然而，GNN的输出通常很难全面解释。现有方法通常符合个体预定义的可解释性度量（如保真度），这通常会导致有偏见的“单方面”解释。本文介绍了天际线解释，这是一种新的范式，通过同时优化用户感兴趣的多个解释性度量来解释GNN的输出。（1）我们提出了天际线解释，作为支配其他解释度量的解释子图的帕累托集。我们将天际线解释形式化为一个多标准优化问题，并建立了其难度结果。（2）我们设计了一种具有洋葱剥皮方法的高效算法，该方法战略性地优先考虑节点并移除不利的边，逐步组装天际线解释。（3）我们还开发了一种算法，以丰富全面解释的方式来使天际线解释多样化。（4）我们引入了具有负载平衡策略的高效并行算法，以扩展大规模GNN推断的天际线解释。通过使用真实世界和合成图形，我们实验证明了我们算法的有效性和可扩展性。

更新时间: 2025-11-24 17:17:12

领域: cs.LG,cs.DB

下载: http://arxiv.org/abs/2505.07635v4

When do World Models Successfully Learn Dynamical Systems?

In this work, we explore the use of compact latent representations with learned time dynamics ('World Models') to simulate physical systems. Drawing on concepts from control theory, we propose a theoretical framework that explains why projecting time slices into a low-dimensional space and then concatenating to form a history ('Tokenization') is so effective at learning physics datasets, and characterise when exactly the underlying dynamics admit a reconstruction mapping from the history of previous tokenized frames to the next. To validate these claims, we develop a sequence of models with increasing complexity, starting with least-squares regression and progressing through simple linear layers, shallow adversarial learners, and ultimately full-scale generative adversarial networks (GANs). We evaluate these models on a variety of datasets, including modified forms of the heat and wave equations, the chaotic regime 2D Kuramoto-Sivashinsky equation, and a challenging computational fluid dynamics (CFD) dataset of a 2D Kármán vortex street around a fixed cylinder, where our model is successfully able to recreate the flow.

Updated: 2025-11-24 17:16:42

标题: 何时世界模型能成功学习动力系统？

摘要: 在这项工作中，我们探讨了使用具有学习时间动态的紧凑潜在表示（'世界模型'）来模拟物理系统。借鉴控制理论的概念，我们提出了一个理论框架，解释为什么将时间切片投影到低维空间，然后连接形成历史（'标记化'）在学习物理数据集方面如此有效，并确定在什么情况下底层动态确实允许从前几个标记化帧的历史到下一个的重构映射。为了验证这些说法，我们开发了一系列逐渐增加复杂性的模型，从最小二乘回归开始，经过简单的线性层，浅层对抗学习者，最终到全面的生成对抗网络（GANs）。我们在各种数据集上评估这些模型，包括修改形式的热传导方程和波动方程，混沌区域的2D Kuramoto-Sivashinsky方程，以及围绕固定圆柱体的2D Kármán涡街的具有挑战性的计算流体力学（CFD）数据集，我们的模型成功地能够重新创建流动。

更新时间: 2025-11-24 17:16:42

领域: math.NA,cs.LG

下载: http://arxiv.org/abs/2507.04898v2

Entropic Time Schedulers for Generative Diffusion Models

The practical performance of generative diffusion models depends on the appropriate choice of the noise scheduling function, which can also be equivalently expressed as a time reparameterization. In this paper, we present a time scheduler that selects sampling points based on entropy rather than uniform time spacing, ensuring that each point contributes an equal amount of information to the final generation. We prove that this time reparameterization does not depend on the initial choice of time. Furthermore, we provide a tractable exact formula to estimate this \emph{entropic time} for a trained model using the training loss without substantial overhead. Alongside the entropic time, inspired by the optimality results, we introduce a rescaled entropic time. In our experiments with mixtures of Gaussian distributions and ImageNet, we show that using the (rescaled) entropic times greatly improves the inference performance of trained models. In particular, we found that the image quality in pretrained EDM2 models, as evaluated by FID and FD-DINO scores, can be substantially increased by the rescaled entropic time reparameterization without increasing the number of function evaluations, with greater improvements in the few NFEs regime. Code is available at https://github.com/DejanStancevic/Entropic-Time-Schedulers-for-Generative-Diffusion-Models.

Updated: 2025-11-24 17:16:26

标题: 生成扩散模型的熵时间调度器

摘要: 生成扩散模型的实际性能取决于噪声调度函数的适当选择，这也可以等效地表示为时间重新参数化。在本文中，我们提出了一个基于熵而不是均匀时间间隔选择采样点的时间调度器，确保每个点对最终生成的信息贡献相等。我们证明这种时间重新参数化不依赖于时间的初始选择。此外，我们提供了一个可行的精确公式来估计训练模型的此“熵时间”，并且不会带来实质性的开销。受到最优性结果的启发，我们引入了一个重新缩放的熵时间。在我们对高斯分布混合物和ImageNet进行的实验中，我们发现使用（重新缩放的）熵时间极大地提高了训练模型的推理性能。特别是，我们发现对于预训练EDM2模型的图像质量，根据FID和FD-DINO分数评估，通过重新缩放的熵时间重新参数化可以显著提高，而不增加函数评估次数，在较少的NFEs情况下改进更为明显。代码可在https://github.com/DejanStancevic/Entropic-Time-Schedulers-for-Generative-Diffusion-Models找到。

更新时间: 2025-11-24 17:16:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.13612v4

The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet

Despite their success, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. These shortcomings can be traced to a lack of inductive biases that reflect the inherent geometric structure of the visual world. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural and computational principles,which evolved to internalize these structures,may offer a blueprint for more capable artificial vision. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet is framed as a geometric framework that emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation for learning disentangled representations, and top-down predictive feedback for representation refinement. We interpret these mechanisms through the lens of geometry and dynamical systems, positing that they guide the learning of structured, low-dimensional neural manifolds. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset, which probes sensitivity to natural textures, and a light field image classification task, which requires processing higher-dimensional visual data. Our results show that VCNet achieves state-of-the-art accuracy of 92.1\% on Spots-10 and 74.4\% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating high-level neuroscientific principles, viewed through a geometric lens, can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.

Updated: 2025-11-24 17:11:32

标题: 大脑皮层计算的几何学：VCNet中的流形解缠和预测动态

摘要: 尽管现代卷积神经网络（CNNs）取得了成功，但它们展现出基本限制，包括数据效率低、对分布外泛化差和对对抗性扰动的脆弱性。这些缺点可以追溯到缺乏反映视觉世界固有几何结构的归纳偏见。相比之下，灵长类视觉系统表现出更高的效率和鲁棒性，这表明其架构和计算原则，这些原则进化为内化这些结构，可能为更有能力的人工视觉提供蓝图。本文介绍了视觉皮层网络（VCNet），这是一种新颖的神经网络架构，其设计受到灵长类视觉皮层宏观结构的启发。VCNet被构建为一个几何框架，模拟了关键的生物机制，包括不同皮层区域之间的分层处理、用于学习分离表示的双流信息隔离以及用于表示细化的自上而下的预测反馈。我们通过几何和动力系统的视角解释这些机制，认为它们指导了结构化、低维神经流形的学习。我们在两个专门的基准测试上评估了VCNet：Spots-10动物图案数据集，该数据集探索对自然纹理的敏感性，以及光场图像分类任务，该任务需要处理更高维的视觉数据。我们的结果显示，VCNet在Spots-10数据集上实现了92.1\%的最新准确率，在光场数据集上实现了74.4\%的准确率，超过了相同规模的当代模型。这项工作表明，通过几何视角看待高级神经科学原则，可以导致更高效、更鲁棒的模型，为解决机器学习中长期存在的挑战提供了一个有希望的方向。

更新时间: 2025-11-24 17:11:32

领域: cs.NE,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.02995v3

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

Updated: 2025-11-24 17:11:00

标题: 评估数据集水印技术用于微调定制扩散模型的可追踪性：一种全面的基准和去除方法

摘要: 最近对扩散模型的微调技术使它们能够复制特定图像集，例如特定面孔或艺术风格，但也引入了版权和安全风险。数据集水印技术已被提出，通过将不可感知的水印嵌入训练图像来确保可追溯性，即使在微调之后输出中仍然可检测到。然而，当前方法缺乏统一的评估框架。为解决这一问题，本文建立了一个通用的威胁模型，并引入了一个全面的评估框架，包括普适性、传输性和鲁棒性。实验证明，现有方法在普适性和传输性方面表现良好，并对常见的图像处理操作具有一定的鲁棒性，但在现实世界的威胁场景下仍存在不足之处。为了揭示这些漏洞，本文进一步提出了一种实用的水印去除方法，能够完全消除数据集水印而不影响微调，突出了未来研究的一个关键挑战。

更新时间: 2025-11-24 17:11:00

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19316v1

How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective

Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some research on language-specific neurons provides a new perspective to analyze and understand LLMs' mechanisms. However, we find that there are many neurons that are shared by multiple but not all languages and cannot be correctly classified. In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and general neurons. And we propose a corresponding identification algorithm to distinguish these different types of neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights to better understand multilingual alignment and multilingual capabilities of LLMs.

Updated: 2025-11-24 17:10:38

标题: 对齐如何增强LLM的多语能力？语言神经元的视角

摘要: 多语言对齐是增强LLMs多语言能力的有效和代表性范式，将能力从高资源语言转移到低资源语言。与此同时，关于特定语言神经元的一些研究提供了一个新的视角来分析和理解LLMs的机制。然而，我们发现有许多神经元被多种但不是所有语言共享，并且无法正确分类。在这项工作中，我们提出了一种三元分类方法，将神经元分类为三种类型，包括特定语言神经元、语言相关神经元和通用神经元。我们提出了相应的识别算法来区分这些不同类型的神经元。此外，基于不同类型神经元的分布特征，我们将LLMs的多语言推理内部过程划分为四部分：（1）多语言理解，（2）共享语义空间推理，（3）多语言输出空间转换，和（4）词汇空间输出。此外，我们系统地分析了对齐前后模型，重点关注不同类型的神经元。我们还分析了“自发多语言对齐”的现象。总的来说，我们的工作基于不同类型的神经元进行了全面调查，提供了实证结果和宝贵见解，以更好地理解LLMs的多语言对齐和多语言能力。

更新时间: 2025-11-24 17:10:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.21505v2

PRInTS: Reward Modeling for Long-Horizon Information Seeking

Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

Updated: 2025-11-24 17:09:43

标题: PRInTS：奖励建模用于长时间信息寻求

摘要: 信息检索是AI代理的核心能力，要求他们在长期轨迹上收集和推理工具生成的信息。然而，这样的多步信息检索任务对于由语言模型支持的代理仍然具有挑战性。虽然过程奖励模型（PRMs）可以通过在测试时排名候选步骤来指导代理，但现有的PRMs设计用于具有二进制判断的短推理，无法捕捉信息检索步骤的更丰富维度，例如工具交互和对工具输出的推理，也无法处理长期任务中迅速增长的上下文。为了解决这些限制，我们引入了PRInTS，一个具有双重能力的生成PRM，通过训练实现：（1）基于PRM跨多个步骤质量维度的推理的密集评分（例如，对工具输出的解释，工具调用信息量）和（2）轨迹总结，压缩增长的上下文同时保留关键信息以进行步骤评估。在FRAMES、GAIA（1-3级）和WebWalkerQA（简单-困难）基准测试上对多个模型进行广泛评估，以及消融实验，揭示了最佳n采样与PRInTS结合能够增强开源模型以及专门代理的信息检索能力，与具有更小的骨干代理的前沿模型的性能相匹配或超越，并且优于其他强力奖励建模基线。

更新时间: 2025-11-24 17:09:43

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2511.19314v1

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.

Updated: 2025-11-24 17:02:04

标题: SA-FARI数据集：在动物镜头中分割任何内容以进行识别和识别

摘要: 自动化视频分析对野生动物保护至关重要。在这一领域的基础任务是多动物跟踪（MAT），这支撑着诸如个体再识别和行为识别等应用。然而，现有的数据集在规模上有限，受限于少数物种，或缺乏足够的时间和地理多样性 - 没有适用于跨野生动物群体的通用 MAT 模型的适当基准。为了解决这个问题，我们介绍了 SA-FARI，这是一个针对野生动物的最大开源 MAT 数据集。它由大约 10 年（2014-2024 年）收集的来自 4 大洲的 741 个位置的 11,609 摄像机陷阱视频组成，涵盖了 99 个物种类别。每个视频都经过详尽注释，总计约 46 小时的密集注释镜头，包含 16,224 个标记身份和 942,702 个个体边界框、分割蒙版和物种标签。除了任务特定的注释，我们还发布了每个视频的匿名化摄像机陷阱位置。最后，我们使用最先进的视觉-语言模型对 SA-FARI 进行全面基准测试，包括 SAM 3，同时评估了物种特定和通用动物提示。我们还与专门用于野生动物分析的仅视觉方法进行了比较。SA-FARI 是第一个将高物种多样性、多区域覆盖和高质量时空注释结合在一起的大规模数据集，为推进野外通用多动物跟踪奠定了新的基础。该数据集可在 https://www.conservationxlabs.com/sa-fari 上获得。

更新时间: 2025-11-24 17:02:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.15622v2

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.

Updated: 2025-11-24 16:54:23

标题: AutoEnv：用于测量跨环境智能体学习的自动化环境

摘要: 人类自然通过学习不同动态、观察和奖励结构的世界中的基本规则来适应多样化的环境。相比之下，现有的代理通常通过在单一领域内自我演化来展现改进，隐含地假设了一个固定的环境分布。跨环境学习一直没有得到充分的评估：没有标准的可控异构环境集合，也没有统一的方法来表示代理如何学习。我们通过两个步骤来填补这些空白。首先，我们提出了AutoEnv，这是一个自动化框架，将环境视为过渡、观察和奖励的可分解分布，从而实现了成本低廉（平均4.12美元）的异构世界生成。利用AutoEnv，我们构建了AutoEnv-36，这是一个包含358个经过验证的36个环境的数据集，在这些环境上进行了七种语言模型，实现了12-49%的归一化奖励，展示了AutoEnv-36的挑战。其次，我们将代理学习形式化为由选择、优化和评估三个阶段驱动的基于组件的过程，应用于可改进的代理组件。利用这一公式，我们设计了八种学习方法，并在AutoEnv-36上进行评估。从经验上看，任何单一学习方法的增益在环境数量增加时迅速减少，揭示了固定学习方法在异构环境中不具可扩展性。对学习方法的环境自适应选择大大提高了性能，但随着方法空间的扩展，出现了递减收益。这些结果突出了代理学习在可扩展的跨环境泛化方面的必要性和当前的限制，并将AutoEnv和AutoEnv-36定位为研究跨环境代理学习的实验平台。代码可在https://github.com/FoundationAgents/AutoEnv 上找到。

更新时间: 2025-11-24 16:54:23

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2511.19304v1

AI and the Net-Zero Journey: Energy Demand, Emissions, and the Potential for Transition

Thanks to the availability of massive amounts of data, computing resources, and advanced algorithms, AI has entered nearly every sector. This has sparked significant investment and interest, particularly in building data centers with the necessary hardware and software to develop and operate AI models and AI-based workflows. In this technical review article, we present energy consumption scenarios of data centers and impact on GHG emissions, considering both near-term projections (up to 2030) and long-term outlook (2035 and beyond). We address the quintessential question of whether AI will have a net positive, neutral, or negative impact on CO2 emissions by 2035. Additionally, we discuss AI's potential to automate, create efficient and disruptive workflows across various fields related to energy production, supply and consumption. In the near-term scenario, the growing demand for AI will likely strain computing resources, lead to increase in electricity consumption and therefore associated CO2 emissions. This is due to the power-hungry nature of big data centers and the requirements for training and running of large and complex AI models, as well as the penetration of AI assistant search and applications for public use. However, the long-term outlook could be more promising. AI has the potential to be a game-changer in CO2 reduction. Its ability to further automate and optimize processes across industries, from energy production to logistics, could significantly decrease our carbon footprint. This positive impact is anticipated to outweigh the initial emissions bump, creating value for businesses and society in areas where traditional solutions have fallen short. In essence, AI might cause some initial growing pains for the environment, but it has the potential to support climate mitigation efforts.

Updated: 2025-11-24 16:52:12

标题: AI与零净排放之旅：能源需求、排放和转型的潜力

摘要: 由于大量数据、计算资源和先进算法的可用性，人工智能已经进入几乎每个领域。这引发了巨额投资和兴趣，特别是在建立具有必要硬件和软件的数据中心，以开发和运行人工智能模型和基于人工智能的工作流程。在这篇技术评论文章中，我们提出了数据中心能耗情景和对温室气体排放的影响，考虑了短期预测（2030年）和长期展望（2035年及以后）。我们讨论了到2035年人工智能对二氧化碳排放是否会产生净积极、中性或负面影响的根本问题。此外，我们还讨论了人工智能在自动化、创建高效和颠覆性工作流程方面在与能源生产、供应和消费相关的各个领域的潜力。在短期情景中，对人工智能日益增长的需求可能会给计算资源带来压力，导致电力消耗增加，从而导致相关的二氧化碳排放增加。这是由于大型数据中心的耗电性质、训练和运行大型复杂人工智能模型的要求，以及人工智能助手搜索和面向公众使用的应用的渗透所致。然而，长期展望可能更加乐观。人工智能有望成为二氧化碳减排的改变者。它进一步自动化和优化各行业的流程的能力，从能源生产到物流，可以显著减少我们的碳足迹。这种正面影响预计将超过最初的排放增加，为传统解决方案无法克服的领域的企业和社会创造价值。实质上，人工智能可能会给环境带来一些初期的困难，但它有潜力支持气候缓解努力。

更新时间: 2025-11-24 16:52:12

领域: cs.AI

下载: http://arxiv.org/abs/2507.10750v2

Learning Protein-Ligand Binding in Hyperbolic Space

Protein-ligand binding prediction is central to virtual screening and affinity ranking, two fundamental tasks in drug discovery. While recent retrieval-based methods embed ligands and protein pockets into Euclidean space for similarity-based search, the geometry of Euclidean embeddings often fails to capture the hierarchical structure and fine-grained affinity variations intrinsic to molecular interactions. In this work, we propose HypSeek, a hyperbolic representation learning framework that embeds ligands, protein pockets, and sequences into Lorentz-model hyperbolic space. By leveraging the exponential geometry and negative curvature of hyperbolic space, HypSeek enables expressive, affinity-sensitive embeddings that can effectively model both global activity and subtle functional differences-particularly in challenging cases such as activity cliffs, where structurally similar ligands exhibit large affinity gaps. Our mode unifies virtual screening and affinity ranking in a single framework, introducing a protein-guided three-tower architecture to enhance representational structure. HypSeek improves early enrichment in virtual screening on DUD-E from 42.63 to 51.44 (+20.7%) and affinity ranking correlation on JACS from 0.5774 to 0.7239 (+25.4%), demonstrating the benefits of hyperbolic geometry across both tasks and highlighting its potential as a powerful inductive bias for protein-ligand modeling.

Updated: 2025-11-24 16:47:54

标题: 在双曲空间中学习蛋白质-配体结合

摘要: 蛋白质-配体结合预测对于虚拟筛选和亲和力排名是药物发现中的两项基本任务。最近的基于检索的方法将配体和蛋白质口袋嵌入到欧几里得空间中进行相似性搜索，但欧几里得嵌入的几何形状通常无法捕捉分子相互作用中固有的层次结构和细粒度的亲和力变化。在这项工作中，我们提出了HypSeek，一个超几何表示学习框架，将配体、蛋白质口袋和序列嵌入到洛伦兹模型的双曲空间中。通过利用双曲空间的指数几何和负曲率，HypSeek能够实现具有表现力的、亲和力敏感的嵌入，可以有效地建模全局活性和微妙的功能差异，特别是在活性悬崖等挑战性情况下，其中结构相似的配体表现出较大的亲和力差距。我们的模型在一个框架中统一了虚拟筛选和亲和力排名，引入了一个蛋白质引导的三塔架构来增强表示结构。HypSeek将在DUD-E的虚拟筛选中将早期富集从42.63提高到51.44（+20.7%），并在JACS上将亲和力排名相关性从0.5774提高到0.7239（+25.4%），展示了双曲几何在两项任务中的益处，并突显了它作为蛋白质-配体建模的强大归纳偏好的潜力。

更新时间: 2025-11-24 16:47:54

领域: cs.LG

下载: http://arxiv.org/abs/2508.15480v2

Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning

Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impressive predictive and generative capabilities, raising concerns that such models may also enable misuse, for instance via the generation of genomes for human-infecting viruses. These concerns have catalyzed calls for risk mitigation measures. The de facto mitigation of choice is filtering of pretraining data (i.e., removing viral genomic sequences from training datasets) in order to limit gLM performance on virus-related tasks. However, it is not currently known how robust this approach is for securing open-source models that can be fine-tuned using sensitive pathogen data. Here, we evaluate a state-of-the-art gLM, Evo 2, and perform fine-tuning using sequences from 110 harmful human-infecting viruses to assess the rescue of misuse-relevant predictive capabilities. The fine- tuned model exhibited reduced perplexity on unseen viral sequences relative to 1) the pretrained model and 2) a version fine-tuned on bacteriophage sequences. The model fine-tuned on human-infecting viruses also identified immune escape variants from SARS-CoV-2 (achieving an AUROC of 0.6), despite having no expo- sure to SARS-CoV-2 sequences during fine-tuning. This work demonstrates that data exclusion might be circumvented by fine-tuning approaches that can, to some degree, rescue misuse-relevant capabilities of gLMs. We highlight the need for safety frameworks for gLMs and outline further work needed on evaluations and mitigation measures to enable the safe deployment of gLMs.

Updated: 2025-11-24 16:46:44

标题: 开放权重基因组语言模型保护措施：通过对抗微调评估稳健性

摘要: 新颖的深度学习架构越来越多地被应用于生物数据，包括基因序列。这些模型被称为基因组语言模型（gLMs），展示了令人印象深刻的预测和生成能力，引发了这样的担忧，即这些模型可能也会被滥用，例如通过生成用于感染人类的病毒的基因组。这些担忧促使呼吁采取风险缓解措施。事实上，首选的缓解措施是过滤预训练数据（即从训练数据集中删除病毒基因组序列），以限制gLM在与病毒相关的任务上的性能。然而，目前尚不清楚这种方法对于保护可以使用敏感病原体数据进行微调的开源模型的稳健性。在这里，我们评估了一种最先进的gLM，Evo 2，并使用来自110种有害的感染人类的病毒序列进行微调，以评估救援与滥用相关的预测能力。与预训练模型和在噬菌体序列上进行微调的版本相比，微调模型在未见病毒序列上表现出较低的困惑度。微调的模型还从SARS-CoV-2中识别出免疫逃逸变种（实现了0.6的AUROC），尽管在微调过程中没有接触到SARS-CoV-2序列。这项工作表明，数据排除可能会被微调方法绕过，以某种程度上挽救gLM的与滥用相关的能力。我们强调了对gLM的安全框架的需求，并概述了评估和缓解措施方面的进一步工作，以实现gLM的安全部署。

更新时间: 2025-11-24 16:46:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19299v1

A Bayesian Model for Multi-stage Censoring

Many sequential decision settings in healthcare feature funnel structures characterized by a series of stages, such as screenings or evaluations, where the number of patients who advance to each stage progressively decreases and decisions become increasingly costly. For example, an oncologist may first conduct a breast exam, followed by a mammogram for patients with concerning exams, followed by a biopsy for patients with concerning mammograms. A key challenge is that the ground truth outcome, such as the biopsy result, is only revealed at the end of this funnel. The selective censoring of the ground truth can introduce statistical biases in risk estimation, especially in underserved patient groups, whose outcomes are more frequently censored. We develop a Bayesian model for funnel decision structures, drawing from prior work on selective labels and censoring. We first show in synthetic settings that our model is able to recover the true parameters and predict outcomes for censored patients more accurately than baselines. We then apply our model to a dataset of emergency department visits, where in-hospital mortality is observed only for those who are admitted to either the hospital or ICU. We find that there are gender-based differences in hospital and ICU admissions. In particular, our model estimates that the mortality risk threshold to admit women to the ICU is higher for women (5.1%) than for men (4.5%).

Updated: 2025-11-24 16:42:03

标题: 一个用于多阶段审查的贝叶斯模型

摘要: 在医疗保健中，许多顺序决策设置具有漏斗结构，其特点是一系列阶段，如筛查或评估，其中进展到每个阶段的患者数量逐渐减少，决策变得越来越昂贵。例如，一个肿瘤学家可能首先进行乳腺检查，然后对于检查有问题的患者进行乳腺X线摄影，再对于乳腺X线摄影有问题的患者进行活检。一个关键挑战是，像活检结果这样的真实结果仅在漏斗的末端才揭示。对真实结果的选择性审查可能在风险估计中引入统计偏差，特别是在常常被审查的服务不足的患者群体中。我们为漏斗决策结构开发了一个贝叶斯模型，借鉴了先前关于选择性标签和审查的工作。我们首先在合成设置中展示，我们的模型能够恢复真实参数，并比基线更准确地预测被审查患者的结果。然后，我们将我们的模型应用于急诊部访问数据集，其中只有那些被送往医院或重症监护室的患者的住院死亡率才被观察到。我们发现在医院和重症监护室入院方面存在性别差异。特别是，我们的模型估计，将女性送往重症监护室的死亡风险阈值比男性高（5.1％对4.5％）。

更新时间: 2025-11-24 16:42:03

领域: cs.LG,stat.AP

下载: http://arxiv.org/abs/2511.11684v3

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.

Updated: 2025-11-24 16:40:06

标题: 重点关注：长视频理解的有效关键帧选择

摘要: 多模态大语言模型（MLLMs）将图像和视频帧表示为视觉标记。然而，从单个图像扩展到长达一小时的视频会使标记预算远远超出实际限制。因此，流行的流水线要么均匀地对子采样，要么使用较小的视觉语言模型应用关键帧选择和检索风格评分。然而，这些关键帧选择方法仍然依赖于选择之前的预过滤以减少推断成本，并且可能错过最具信息量的时刻。我们提出了FOCUS，帧乐观置信度上界选择，一个无需训练、与模型无关的关键帧选择模块，在严格的标记预算下选择与查询相关的帧。FOCUS将关键帧选择形式化为多臂老虎机中的组合纯探索（CPE）问题：将短暂的时间剪辑视为臂，使用经验均值和伯恩施坦置信半径来识别具有信息量的区域，同时保留对不确定区域的探索。由此产生的两阶段探索-开发过程减少了一个具有理论保证的顺序策略，首先识别高价值的时间区域，然后在每个区域内选择得分最高的帧。在两个长视频问答基准测试中，FOCUS在处理不到视频帧的情况下显著提高了准确性。对于长于20分钟的视频，它在LongVideoBench上实现了11.9％的准确率增益，展示了它作为关键帧选择方法的有效性，并为使用MLLMs进行可扩展长视频理解提供了简单且通用的解决方案。代码可在https://github.com/NUS-HPC-AI-Lab/FOCUS上找到。

更新时间: 2025-11-24 16:40:06

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.27280v2

TorchQuantumDistributed

TorchQuantumDistributed (tqd) is a PyTorch-based [Paszke et al., 2019] library for accelerator-agnostic differentiable quantum state vector simulation at scale. This enables studying the behavior of learnable parameterized near-term and fault- tolerant quantum circuits with high qubit counts.

Updated: 2025-11-24 16:37:28

标题: 火炬量子分布式

摘要: TorchQuantumDistributed（tqd）是一个基于PyTorch的库，用于在加速器中不可知的大规模可微量子态矢量模拟。这使得可以研究带有大量量子比特的可学习参数化的近期和容错量子电路的行为。

更新时间: 2025-11-24 16:37:28

领域: quant-ph,cs.CE,cs.LG

下载: http://arxiv.org/abs/2511.19291v1

BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation

Identifying novel hypotheses is essential to scientific research, yet this process risks being overwhelmed by the sheer volume and complexity of available information. Existing automated methods often struggle to generate novel and evidence-grounded hypotheses, lack robust iterative refinement and rarely undergo rigorous temporal evaluation for future discovery potential. To address this, we propose BioDisco, a multi-agent framework that draws upon language model-based reasoning and a dual-mode evidence system (biomedical knowledge graphs and automated literature retrieval) for grounded novelty, integrates an internal scoring and feedback loop for iterative refinement, and validates performance through pioneering temporal and human evaluations and a Bradley-Terry paired comparison model to provide statistically-grounded assessment. Our evaluations demonstrate superior novelty and significance over ablated configurations and generalist biomedical agents. Designed for flexibility and modularity, BioDisco allows seamless integration of custom language models or knowledge graphs, and can be run with just a few lines of code.

Updated: 2025-11-24 16:36:29

标题: 生物光盘：双模式证据下的多智能体假设生成，迭代反馈和时间评估

摘要: 确定新颖假设对科学研究至关重要，然而这个过程可能会受到大量和复杂可用信息的影响。现有的自动化方法往往难以生成新颖且有证据支持的假设，缺乏强大的迭代精炼，并且很少经过严格的时间评估以用于未来的发现潜力。为了解决这个问题，我们提出了BioDisco，这是一个多智能体框架，借助基于语言模型的推理和双模式证据系统（生物医学知识图和自动文献检索）来确保新颖性，集成了内部评分和反馈循环以进行迭代精炼，并通过先进的时间和人类评估以及Bradley-Terry成对比较模型来提供具有统计基础的评估。我们的评估表明，在去除的配置和一般生物医学智能体中，BioDisco具有更高的新颖性和重要性。BioDisco设计灵活且模块化，允许无缝集成自定义语言模型或知识图，并且可以通过仅几行代码来运行。

更新时间: 2025-11-24 16:36:29

领域: cs.AI,cs.IR,stat.AP

下载: http://arxiv.org/abs/2508.01285v2

Performance Guarantees for Quantum Neural Estimation of Entropies

Estimating quantum entropies and divergences is an important problem in quantum physics, information theory, and machine learning. Quantum neural estimators (QNEs), which utilize a hybrid classical-quantum architecture, have recently emerged as an appealing computational framework for estimating these measures. Such estimators combine classical neural networks with parametrized quantum circuits, and their deployment typically entails tedious tuning of hyperparameters controlling the sample size, network architecture, and circuit topology. This work initiates the study of formal guarantees for QNEs of measured (Rényi) relative entropies in the form of non-asymptotic error risk bounds. We further establish exponential tail bounds showing that the error is sub-Gaussian, and thus sharply concentrates about the ground truth value. For an appropriate sub-class of density operator pairs on a space of dimension $d$ with bounded Thompson metric, our theory establishes a copy complexity of $O(|Θ(\mathcal{U})|d/ε^2)$ for QNE with a quantum circuit parameter set $Θ(\mathcal{U})$, which has minimax optimal dependence on the accuracy $ε$. Additionally, if the density operator pairs are permutation invariant, we improve the dimension dependence above to $O(|Θ(\mathcal{U})|\mathrm{polylog}(d)/ε^2)$. Our theory aims to facilitate principled implementation of QNEs for measured relative entropies and guide hyperparameter tuning in practice.

Updated: 2025-11-24 16:36:06

标题: 量子神经熵估计的性能保证

摘要: 估计量子熵和差异是量子物理、信息理论和机器学习中的一个重要问题。最近出现了利用混合经典-量子架构的量子神经估计器（QNE），成为估计这些度量的吸引人的计算框架。这些估计器结合了经典神经网络和参数化量子电路，它们的部署通常涉及对控制样本大小、网络架构和电路拓扑的超参数进行繁琐的调整。本文首次研究了测量（Rényi）相对熵的QNE的非渐近误差风险界的形式，进一步建立了指数尾界，表明误差是次高斯的，因此尖锐地集中在真实值周围。对于空间维度为$d$且具有有界Thompson度量的一类适当的密度算子对，我们的理论建立了一个QNE的复制复杂度为$O(|Θ(\mathcal{U})|d/ε^2)$，其中$Θ(\mathcal{U})$是一个量子电路参数集，具有对准确度$ε$的最小最大优化依赖。此外，如果密度算子对是置换不变的，我们将上述维度依赖性改进为$O(|Θ(\mathcal{U})|\mathrm{polylog}(d)/ε^2)$。我们的理论旨在促进测量相对熵的QNE的原则性实施，并指导实践中的超参数调整。

更新时间: 2025-11-24 16:36:06

领域: quant-ph,cs.IT,cs.LG

下载: http://arxiv.org/abs/2511.19289v1

Word-level Annotation of GDPR Transparency Compliance in Privacy Policies using Large Language Models

Ensuring transparency of data practices related to personal information is a core requirement of the General Data Protection Regulation (GDPR). However, large-scale compliance assessment remains challenging due to the complexity and diversity of privacy policy language. Manual audits are labour-intensive and inconsistent, while current automated methods often lack the granularity required to capture nuanced transparency disclosures. In this paper, we present a modular large language model (LLM)-based pipeline for fine-grained word-level annotation of privacy policies with respect to GDPR transparency requirements. Our approach integrates LLM-driven annotation with passage-level classification, retrieval-augmented generation, and a self-correction mechanism to deliver scalable, context-aware annotations across 21 GDPR-derived transparency requirements. To support empirical evaluation, we compile a corpus of 703,791 English-language privacy policies and generate a ground-truth sample of 200 manually annotated policies based on a comprehensive, GDPR-aligned annotation scheme. We propose a two-tiered evaluation methodology capturing both passage-level classification and span-level annotation quality and conduct a comparative analysis of seven state-of-the-art LLMs on two annotation schemes, including the widely used OPP-115 dataset. The results of our evaluation show that decomposing the annotation task and integrating targeted retrieval and classification components significantly improve annotation accuracy, particularly for well-structured requirements. Our work provides new empirical resources and methodological foundations for advancing automated transparency compliance assessment at scale.

Updated: 2025-11-24 16:34:25

标题: 使用大型语言模型对隐私政策中的GDPR透明合规性进行单词级注释

摘要: 确保与个人信息相关的数据实践透明度是《通用数据保护条例》（GDPR）的核心要求。然而，由于隐私政策语言的复杂性和多样性，大规模的合规评估仍然具有挑战性。手动审计是劳动密集型且不一致的，而当前的自动化方法通常缺乏捕捉微妙透明度披露所需的细粒度。在本文中，我们提出了一种基于模块化大型语言模型（LLM）的管道，用于针对GDPR透明度要求对隐私政策进行细粒度的单词级注释。我们的方法将LLM驱动的注释与段萂级分类、检索增强生成和自我纠正机制相结合，以提供可扩展的、上下文感知的注释，涵盖了21个源自GDPR的透明度要求。为支持实证评估，我们编制了一个包含703,791份英语隐私政策的语料库，并基于全面的、符合GDPR的注释方案生成了一个由200份手动注释政策样本组成的基础样本。我们提出了一种两层评估方法，捕捉段萂级分类和跨度级注释质量，并对七种最先进的LLM进行了两种注释方案的比较分析，包括广泛使用的OPP-115数据集。我们的评估结果显示，分解注释任务并整合有针对性的检索和分类组件显著提高了注释准确性，特别是对于结构良好的要求。我们的工作为推进规模化自动化透明度合规评估提供了新的实证资源和方法论基础。

更新时间: 2025-11-24 16:34:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.10727v2

The Unified Non-Convex Framework for Robust Causal Inference: Overcoming the Gaussian Barrier and Optimization Fragility

This document proposes a Unified Robust Framework that re-engineers the estimation of the Average Treatment Effect on the Overlap (ATO). It synthesizes gamma-Divergence for outlier robustness, Graduated Non-Convexity (GNC) for global optimization, and a "Gatekeeper" mechanism to address the impossibility of higher-order orthogonality in Gaussian regimes.

Updated: 2025-11-24 16:32:07

标题: 稳健因果推断的统一非凸框架：克服高斯障碍和优化脆弱性

摘要: 这份文件提出了一个统一的鲁棒框架，重新设计了对重叠区域（ATO）上的平均处理效应的估计。它综合了用于异常值鲁棒性的gamma-Divergence，用于全局优化的Graduated Non-Convexity（GNC），以及一个“门卫”机制，以解决高阶正交性在高斯体制下的不可能性。

更新时间: 2025-11-24 16:32:07

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2511.19284v1

Data Flows and Colonial Regimes in Africa: A Critical Analysis of the Colonial Futurities Embedded in AI Ecosystems

This chapter seeks to frame the elemental and invisible problems of AI and big data in the African context by examining digital sites and infrastructure through the lens of power and interests. It will present reflections on how these sites are using AI recommendation algorithms to recreate new digital societies in the region, how they have the potential to propagate algorithmic colonialism and negative gender norms, and what this means for the regional sustainable development agenda. The chapter proposes adopting business models that embrace response-ability and consider the existence of alternative socio-material worlds of AI. These reflections will mainly come from ongoing discussions with Kenyan social media users in this authors' user space talks, personal experiences and six months of active participant observations done by the authors.

Updated: 2025-11-24 16:31:50

标题: 非洲的数据流动和殖民统治：对AI生态系统中植入的殖民未来性的批判性分析

摘要: 这一章节旨在通过从权力和利益的视角审视数字场所和基础设施，来描绘人工智能和大数据在非洲背景下的基本和隐形问题。它将探讨这些场所如何利用人工智能推荐算法在该地区重新塑造新的数字社会，它们如何有可能传播算法殖民主义和负面性别规范，以及这对该地区可持续发展议程意味着什么。本章提出采用拥抱“响应能力”并考虑人工智能替代性社会-物质世界存在的商业模式。这些思考主要源自作者在肯尼亚社交媒体用户空间对话、个人经历以及作者进行的六个月积极参与观察。

更新时间: 2025-11-24 16:31:50

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2511.19283v1

MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings

A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.

Updated: 2025-11-24 16:29:02

标题: MapFormer：基于输入相关位置嵌入的认知地图自监督学习

摘要: 认知地图是一个内部模型，它对世界中实体之间的抽象关系进行编码，使人类和动物能够灵活地适应新情况，具有强大的分布外（OOD）泛化能力，而当前的人工智能系统还不具备这种能力。为了弥合这一差距，我们引入了MapFormers，这是基于Transformer模型的新架构，可以从观测数据中学习认知地图，并以自监督的方式并行执行路径积分。在模型中学习认知地图的过程中，通过使用与输入相关的矩阵更新Transformer中的位置编码，使结构关系与特定内容分离，这是一种可以自然实现的属性。我们开发了两种MapFormers的变体，分别统一了绝对位置编码和相对位置编码，以建模情节记忆（EM）和工作记忆（WM）。我们在几个任务上测试了MapFormers，包括一个经典的二维导航任务，结果显示我们的模型可以学习底层空间的认知地图，并能够在分布外（例如更长的序列）实现近乎完美的性能泛化，而当前的架构则无法做到这一点。总的来说，这些结果表明设计用于学习认知地图的模型的优越性，以及引入结构偏见促进结构内容分离的重要性，这一点可以通过Transformer中的输入相关位置编码实现。MapFormers在神经科学和人工智能领域都有广泛的应用，可以解释产生认知地图的神经机制，同时允许这些关系模型进行大规模学习。

更新时间: 2025-11-24 16:29:02

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.19279v1

Closing Gaps in Emissions Monitoring with Climate TRACE

Global greenhouse gas emissions estimates are essential for monitoring and mitigation planning. Yet most datasets lack one or more characteristics that enhance their actionability, such as accuracy, global coverage, high spatial and temporal resolution, and frequent updates. To address these gaps, we present Climate TRACE (climatetrace.org), an open-access platform delivering global emissions estimates with enhanced detail, coverage, and timeliness. Climate TRACE synthesizes existing emissions data, prioritizing accuracy, coverage, and resolution, and fills gaps using sector-specific estimation approaches. The dataset is the first to provide globally comprehensive emissions estimates for individual sources (e.g., individual power plants) for all anthropogenic emitting sectors. The dataset spans January 1, 2021, to the present, with a two-month reporting lag and monthly updates. The open-access platform enables non-technical audiences to engage with detailed emissions datasets for most subnational governments worldwide. Climate TRACE supports data-driven climate action at scales where decisions are made, representing a major breakthrough for emissions accounting and mitigation.

Updated: 2025-11-24 16:28:44

标题: 用Climate TRACE关闭排放监测的差距

摘要: 全球温室气体排放估算对于监测和减缓规划至关重要。然而，大多数数据集缺乏一个或多个增强其可操作性的特征，如准确性、全球覆盖范围、高空间和时间分辨率，以及频繁更新。为了解决这些空白，我们提出了Climate TRACE（climatetrace.org），这是一个开放获取平台，提供具有增强细节、覆盖范围和及时性的全球排放估算。Climate TRACE综合现有的排放数据，优先考虑准确性、覆盖范围和分辨率，并利用各个部门的特定估算方法填补空白。该数据集是首个为所有人为排放部门的单个来源（例如单个发电厂）提供全球全面排放估算的数据集。该数据集跨越自2021年1月1日至今，具有两个月的报告滞后和每月更新。这个开放获取平台使非技术人员能够参与全球大多数次国家政府的详细排放数据集。Climate TRACE支持数据驱动的气候行动，代表了对排放核算和减缓的重大突破。

更新时间: 2025-11-24 16:28:44

领域: cs.LG

下载: http://arxiv.org/abs/2511.19277v1

A Survey of Generative Categories and Techniques in Multimodal Generative Models

Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.

Updated: 2025-11-24 16:26:13

标题: 多模态生成模型中生成类别和技术的调查

摘要: 多模态生成模型（MGMs）已迅速发展，不仅限于文本生成，现在还包括图像、音乐、视频、人类动作和3D物体等多样的输出模态，通过将语言与其他感官模态集成在统一架构下。本调查将六种主要生成模态进行分类，并探讨基础技术，即自监督学习（SSL）、专家混合（MoE）、从人类反馈中学习的强化学习（RLHF）和“思维链”（CoT）提示，如何实现跨模态能力。我们分析了关键模型、架构趋势和新兴的跨模态协同效应，同时突出可转移的技术和未解决的挑战。基于通用模型和训练配方的分类体系，我们提出了一个以忠实度、组合性和鲁棒性为中心的统一评估框架，并综合了跨模态基准测试和人类研究的证据。我们进一步分析了可信度、安全性和道德风险，包括多模态偏见、隐私泄露以及高保真度媒体生成在音乐和3D资产中的滥用，以及新兴的缓解策略。最后，我们讨论了如何共同设计架构趋势、评估协议和治理机制，以弥合当前能力和安全性差距，并概述了通向更通用、可控和负责任的多模态生成系统的关键路径。

更新时间: 2025-11-24 16:26:13

领域: cs.MM,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.10016v3

Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization

Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dynamic inter-bird interactions, all of which require precise temporal and spatial control in 3D environments. Existing approaches, whether Digital Signal Processing (DSP)-based or data-driven, typically focus only on single species modeling, static call structures, or synthesis directly from recordings, and often suffer from noise, limited flexibility, or large data needs. To address these challenges, we present a novel, fully algorithm-driven framework that generates dynamic multi-species bird soundscapes using DSP-based chirp generation and 3D spatialization, without relying on recordings or training data. Our approach simulates multiple independently-moving birds per species along different moving 3D trajectories, supporting controllable chirp sequences, overlapping choruses, and realistic 3D motion in scalable soundscapes while preserving species-specific acoustic patterns. A visualization interface provides bird trajectories, spectrograms, activity timelines, and sound waves for analytical and creative purposes. Both visual and audio evaluations demonstrate the ability of the system to generate dense, immersive, and ecologically inspired soundscapes, highlighting its potential for computer music, interactive virtual environments, and computational bioacoustics research.

Updated: 2025-11-24 16:25:55

标题: 动态多物种鸟类声音景观生成与声学编排和三维空间化

摘要: 生成动态、可扩展的多种鸟类声音景观在计算机音乐和算法声音设计中仍然是一个重要挑战。鸟鸣包括快速频率调制的啁啾声、复杂的幅度包络、独特的声学模式、重叠的叫声以及动态的鸟类间互动，所有这些都需要在3D环境中进行精确的时间和空间控制。现有的方法，无论是基于数字信号处理（DSP）还是数据驱动，通常只关注单一物种建模、静态叫声结构或直接从录音合成，并且经常受到噪声、灵活性有限或大量数据需求的困扰。为了解决这些挑战，我们提出了一个全新的、完全基于算法的框架，利用基于DSP的啁啾声生成和3D空间定位，生成动态的多种鸟类声音景观，而无需依赖录音或训练数据。我们的方法模拟每种鸟类的多个独立移动的鸟在不同的移动3D轨迹上，支持可控的啁啾序列、重叠的合唱和可扩展声音景观中的逼真3D运动，同时保留物种特定的声学模式。一个可视化界面提供了鸟类轨迹、频谱图、活动时间线和声波，以进行分析和创造。视觉和音频评估都展示了系统生成密集、身临其境和生态启发的声音景观的能力，突显了其在计算机音乐、交互虚拟环境和计算生物声学研究中的潜力。

更新时间: 2025-11-24 16:25:55

领域: cs.SD,cs.AI,eess.AS,eess.SP

下载: http://arxiv.org/abs/2511.19275v1

Scalable Bayesian Network Structure Learning Using Tsetlin Machine to Constrain the Search Space

The PC algorithm is a widely used method in causal inference for learning the structure of Bayesian networks. Despite its popularity, the PC algorithm suffers from significant time complexity, particularly as the size of the dataset increases, which limits its applicability in large-scale real-world problems. In this study, we propose a novel approach that utilises the Tsetlin Machine (TM) to construct Bayesian structures more efficiently. Our method leverages the most significant literals extracted from the TM and performs conditional independence (CI) tests on these selected literals instead of the full set of variables, resulting in a considerable reduction in computational time. We implemented our approach and compared it with various state-of-the-art methods. Our evaluation includes categorical datasets from the bnlearn repository, such as Munin1, Hepar2. The findings indicate that the proposed TM-based method not only reduces computational complexity but also maintains competitive accuracy in causal discovery, making it a viable alternative to traditional PC algorithm implementations by offering improved efficiency without compromising performance.

Updated: 2025-11-24 16:23:19

标题: 可扩展的贝叶斯网络结构学习：使用Tsetlin机器限制搜索空间

摘要: PC算法是因果推断中广泛使用的一种方法，用于学习贝叶斯网络的结构。尽管PC算法很受欢迎，但在数据集规模增加时，其时间复杂度较高，这限制了它在大规模实际问题中的应用。本研究提出了一种利用Tsetlin机器（TM）更高效构建贝叶斯结构的新方法。我们的方法利用从TM中提取的最重要的文字，并对这些选定的文字进行条件独立性（CI）测试，而不是对全部变量集进行测试，从而大大减少了计算时间。我们实施了我们的方法，并将其与各种最先进的方法进行了比较。我们的评估包括来自bnlearn存储库的分类数据集，如Munin1，Hepar2。研究结果表明，所提出的基于TM的方法不仅降低了计算复杂度，而且在因果发现中保持了竞争力，使其成为传统PC算法实施的可行替代方案，提供了改进的效率而不影响性能。

更新时间: 2025-11-24 16:23:19

领域: cs.LG

下载: http://arxiv.org/abs/2511.19273v1

Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model

We present Tiny-TSM, a time series foundation model characterized by small scale, economical training, and state-of-the-art performance. It comprises 23M total parameters, trained on a single A100 GPU in less than a week using a new synthetic data generation and data augmentation pipeline (SynthTS). Without any neural architecture search, hyperparameter tuning, or scaling up model size, Tiny-TSM achieves state-of-the-art performance on a wide range of time series benchmark datasets, often outperforming much larger models and even matching the performance of much larger, industrial-scale, likely highly tuned foundation models. Specifically, Tiny-TSM outperforms all other time series foundation models we evaluated on medium- and long-term forecasting tasks under MSE loss, while short-term accuracy is still competitive with state-of-the-art models. We also introduce a causal input normalization scheme that enables time series models to be trained with dense next-token prediction loss, significantly accelerating convergence speed and reducing training time. All experiments were conducted on a single A100 GPU, illustrating the practicality of the proposed approach in a resource-constrained setting.

Updated: 2025-11-24 16:22:05

标题: 微型TSM：高效训练轻量级SOTA时间序列基础模型

摘要: 我们提出了Tiny-TSM，这是一个以小规模、经济训练和最先进性能为特征的时间序列基础模型。它包括总共23M个参数，在单个A100 GPU上训练不到一周，使用了一种新的合成数据生成和数据增强流水线（SynthTS）。在没有任何神经架构搜索、超参数调整或扩大模型规模的情况下，Tiny-TSM在各种时间序列基准数据集上实现了最先进的性能，通常优于更大的模型，甚至与更大的、工业规模的、可能高度调整的基础模型的性能相匹配。具体而言，Tiny-TSM在中长期预测任务的均方误差损失下优于我们评估的所有其他时间序列基础模型，而短期准确性仍然与最先进的模型竞争力相当。我们还引入了一种因果输入归一化方案，使时间序列模型能够通过密集的下一个标记预测损失进行训练，显著加快收敛速度并减少训练时间。所有实验均在单个A100 GPU上进行，展示了所提出方法在资源受限环境中的实用性。

更新时间: 2025-11-24 16:22:05

领域: cs.LG

下载: http://arxiv.org/abs/2511.19272v1

CDLM: Consistency Diffusion Language Models For Faster Sampling

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

Updated: 2025-11-24 16:21:25

标题: CDLM: 一致性扩散语言模型用于更快的抽样

摘要: 扩散语言模型（DLMs）提供了一种有前途的并行生成范式，但由于多个细化步骤和无法使用标准KV缓存而导致推理速度较慢。我们引入了CDLM（一致性扩散语言模型），这是一种基于训练的加速方法，同时解决了这两个瓶颈。CDLM集成了一致性建模，通过实现多令牌最终化来大幅减少所需的采样步骤数量。此外，在微调过程中，我们强制执行分块因果关注掩模，使模型完全兼容KV缓存。实验表明，CDLM在数学和编码任务上实现了3.6倍至14.5倍的较低延迟，同时保持竞争性准确性。完整的训练和评估代码可在https://github.com/SqueezeAILab/CDLM 上找到。

更新时间: 2025-11-24 16:21:25

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.19269v1

Leveraging Spatiotemporal Graph Neural Networks for Multi-Store Sales Forecasting

This work evaluates the effectiveness of spatiotemporal Graph Neural Networks (GNNs) for multi-store retail sales forecasting and compares their performance against ARIMA, LSTM, and XGBoost baselines. Using weekly sales data from 45 Walmart stores, we construct a relational forecasting framework that models inter-store dependencies through a learned adaptive graph. The proposed STGNN predicts log-differenced sales and reconstructs final values through a residual path, enabling stable training and improved generalisation. Experiments show that STGNN achieves the lowest overall forecasting error, outperforming all baselines in Normalised Total Absolute Error, P90 MAPE, and variance of MAPE across stores. Analysis of the learned adjacency matrix reveals meaningful functional store clusters and high-influence nodes that emerge without geographic metadata. These results demonstrate that relational structure significantly improves forecast quality in interconnected retail environments and establishes STGNNs as a robust modelling choice for multi-store demand prediction.

Updated: 2025-11-24 16:19:48

标题: 利用时空图神经网络进行多商店销售预测

摘要: 这项工作评估了时空图神经网络（GNNs）在多店零售销售预测中的有效性，并将其性能与ARIMA、LSTM和XGBoost基准进行了比较。使用来自45家沃尔玛商店的周销售数据，我们构建了一个关系预测框架，通过学习的自适应图模拟了店间的依赖关系。所提出的STGNN预测对数差异销售，并通过一个残差路径重建最终值，实现稳定训练和改进泛化。实验表明，STGNN在总体预测误差方面表现最佳，在标准化总绝对误差、P90 MAPE和各店铺MAPE方差方面超过所有基线。对学习到的邻接矩阵的分析揭示了没有地理元数据的有意义的功能店铺集群和高影响节点。这些结果表明，在互联零售环境中，关系结构显著提高了预测质量，并将STGNNs确立为多店需求预测的稳健建模选择。

更新时间: 2025-11-24 16:19:48

领域: cs.LG

下载: http://arxiv.org/abs/2511.19267v1

WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making

Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in-context learning abilities of LLMs to guide an LLM-based world model's predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity-driven reinforcement learning policy that explores the environment to find transitions with a low log-likelihood under our LLM-based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human-interpretable theories of environment dynamics.

Updated: 2025-11-24 16:18:31

标题: WorldLLM：利用好奇驱动的理论构建改进LLMs的世界建模

摘要: 大型语言模型（LLMs）具有一般世界知识，但在结构化、领域特定的上下文中，如模拟中，往往很难生成精确的预测。这些局限性源于它们无法将广泛的、非结构化的理解与特定环境联系起来。为了解决这个问题，我们提出了WorldLLM，这是一个增强基于LLM的世界建模的框架，结合了贝叶斯推断和自主主动探索与强化学习。WorldLLM利用LLM的上下文学习能力，通过在输入中给定的自然语言假设引导基于LLM的世界模型的预测。这些假设通过一个贝叶斯推断框架进行迭代地精化，该框架利用第二个LLM作为提议分布，根据收集到的证据。这些证据是通过一种好奇驱动的强化学习策略收集的，该策略在环境中探索，并找到在当前假设下对我们基于LLM的预测模型具有低对数可能性的转换。通过在精化假设和收集新证据之间交替，我们的框架自主地推动预测的持续改善。我们的实验展示了WorldLLM在需要代理人操纵和组合对象的文本游戏环境中的有效性。该框架不仅增强了预测的准确性，还生成了环境动态的人类可解释理论。

更新时间: 2025-11-24 16:18:31

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.06725v2

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature integration and intelligent pre-training with synthetic data. Comprehensive evaluations on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that VideoLights significantly surpasses existing baselines, establishing new state-of-the-art performances. Codes and model checkpoints are available at https://github.com/dpaul06/VideoLights .

Updated: 2025-11-24 16:17:57

标题: VideoLights：特征细化和跨任务对齐变压器，用于联合视频精彩片段检测和时刻检索

摘要: 目前用于视频精彩片段检测和时刻检索（HD/MR）的主流联合预测变压器存在处理跨任务动态、实现稳健的视频-文本对齐和利用有效的注意机制方面的不足，而大型语言/视觉-语言模型（LLMs/LVLMs）的潜力尚未充分利用。本文介绍了VideoLights，这是一个新颖的HD/MR框架，通过以下方式解决了这些限制：（i）具有对齐损失的卷积投影和特征细化模块，用于增强视频-文本特征的一致性；（ii）双向跨模态融合网络，用于强耦合的查询感知表示；（iii）单向联合任务反馈机制，用于协同任务改进；（iv）用于自适应学习的硬正/负损失；（v）利用LVLMs（例如BLIP-2）进行优越的多模态特征集成和智能预训练，利用合成数据。在QVHighlights、TVSum和Charades-STA基准上进行的全面评估表明，VideoLights明显超过了现有基线，建立了新的最先进性能。代码和模型检查点可在https://github.com/dpaul06/VideoLights 上获取。

更新时间: 2025-11-24 16:17:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.01558v2

Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks

The black box nature of deep neural networks poses a significant challenge for the deployment of transparent and trustworthy artificial intelligence (AI) systems. With the growing presence of AI in society, it becomes increasingly important to develop methods that can explain and interpret the decisions made by these systems. To address this, mechanistic interpretability (MI) emerged as a promising and distinctive research program within the broader field of explainable artificial intelligence (XAI). MI is the process of studying the inner computations of neural networks and translating them into human-understandable algorithms. It encompasses reverse engineering techniques aimed at uncovering the computational algorithms implemented by neural networks. In this article, we propose a unified taxonomy of MI approaches and provide a detailed analysis of key techniques, illustrated with concrete examples and pseudo-code. We contextualize MI within the broader interpretability landscape, comparing its goals, methods, and insights to other strands of XAI. Additionally, we trace the development of MI as a research area, highlighting its conceptual roots and the accelerating pace of recent work. We argue that MI holds significant potential to support a more scientific understanding of machine learning systems -- treating models not only as tools for solving tasks, but also as systems to be studied and understood. We hope to invite new researchers into the field of mechanistic interpretability.

Updated: 2025-11-24 16:16:49

标题: 打开黑匣子：算法理解神经网络的机制可解释性

摘要: 深度神经网络的黑盒特性对于透明和可信人工智能（AI）系统的部署构成了重大挑战。随着AI在社会中的日益增长，开发能够解释和解释这些系统所做决策的方法变得越来越重要。为了解决这个问题，机械解释能力（MI）作为可解释人工智能（XAI）领域的一个有前途和独特的研究项目出现。MI是研究神经网络内部计算并将其转换为人类可理解算法的过程。它包括旨在揭示神经网络实现的计算算法的逆向工程技术。在本文中，我们提出了一个统一的MI方法论分类，并提供了关键技术的详细分析，配以具体示例和伪代码。我们将MI置于更广泛的可解释性景观中，将其目标、方法和见解与XAI的其他分支进行比较。此外，我们追溯了MI作为一个研究领域的发展，突出了其概念根源和最近工作的加速步伐。我们认为MI具有支持更科学理解机器学习系统的巨大潜力 - 将模型视为解决任务的工具，也视为需要研究和理解的系统。我们希望邀请新的研究人员加入机械解释性的领域。

更新时间: 2025-11-24 16:16:49

领域: cs.LG

下载: http://arxiv.org/abs/2511.19265v1

Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry

Generative Flow Networks, or GFlowNets, offer a promising framework for molecular design, but their internal decision policies remain opaque. This limits adoption in drug discovery, where chemists require clear and interpretable rationales for proposed structures. We present an interpretability framework for SynFlowNet, a GFlowNet trained on documented chemical reactions and purchasable starting materials that generates both molecules and the synthetic routes that produce them. Our approach integrates three complementary components. Gradient based saliency combined with counterfactual perturbations identifies which atomic environments influence reward and how structural edits change molecular outcomes. Sparse autoencoders reveal axis aligned latent factors that correspond to physicochemical properties such as polarity, lipophilicity, and molecular size. Motif probes show that functional groups including aromatic rings and halogens are explicitly encoded and linearly decodable from the internal embeddings. Together, these results expose the chemical logic inside SynFlowNet and provide actionable and mechanistic insight that supports transparent and controllable molecular design.

Updated: 2025-11-24 16:16:18

标题: 解读GFlowNets用于药物发现：提取对药物化学有用的见解

摘要: 生成流网络（Generative Flow Networks，或GFlowNets）为分子设计提供了一个有希望的框架，但它们的内部决策策略仍然不透明。这限制了在药物发现领域的应用，化学家需要对提出的结构有清晰和可解释的理由。我们提出了一个解释性框架，用于SynFlowNet，这是一个在已记录的化学反应和可购买的起始物质上训练的GFlowNet，可以生成分子及其产生它们的合成路线。我们的方法集成了三个互补的组件。基于梯度的显著性结合反事实扰动，确定了哪些原子环境影响奖励以及结构编辑如何改变分子结果。稀疏自动编码器揭示了与物理化学性质（如极性、亲脂性和分子大小）对应的轴对齐潜在因素。基序探针表明包括芳香环和卤素在内的功能基团被明确编码，并且可以从内部嵌入中线性解码。综合这些结果揭示了SynFlowNet内部的化学逻辑，并提供了可操作和机制性的洞察力，支持透明和可控的分子设计。

更新时间: 2025-11-24 16:16:18

领域: cs.LG,cs.AI,q-bio.BM

下载: http://arxiv.org/abs/2511.19264v1

Solar-GECO: Perovskite Solar Cell Property Prediction with Geometric-Aware Co-Attention

Perovskite solar cells are promising candidates for next-generation photovoltaics. However, their performance as multi-scale devices is determined by complex interactions between their constituent layers. This creates a vast combinatorial space of possible materials and device architectures, making the conventional experimental-based screening process slow and expensive. Machine learning models try to address this problem, but they only focus on individual material properties or neglect the important geometric information of the perovskite crystal. To address this problem, we propose to predict perovskite solar cell power conversion efficiency with a geometric-aware co-attention (Solar-GECO) model. Solar-GECO combines a geometric graph neural network (GNN) - that directly encodes the atomic structure of the perovskite absorber - with language model embeddings that process the textual strings representing the chemical compounds of the transport layers and other device components. Solar-GECO also integrates a co-attention module to capture intra-layer dependencies and inter-layer interactions, while a probabilistic regression head predicts both power conversion efficiency (PCE) and its associated uncertainty. Solar-GECO achieves state-of-the-art performance, significantly outperforming several baselines, reducing the mean absolute error (MAE) for PCE prediction from 3.066 to 2.936 compared to semantic GNN (the previous state-of-the-art model). Solar-GECO demonstrates that integrating geometric and textual information provides a more powerful and accurate framework for PCE prediction.

Updated: 2025-11-24 16:15:41

标题: Solar-GECO：具有几何感知的共同注意力的钙钛矿太阳能电池性能预测

摘要: 钙钛矿太阳能电池是下一代光伏技术的有前途的候选者。然而，它们作为多尺度器件的性能取决于其构成层之间复杂的相互作用。这创造了大量可能材料和器件结构的组合空间，使得传统的基于实验的筛选过程缓慢且昂贵。机器学习模型试图解决这个问题，但它们只关注于单个材料性质或忽视钙钛矿晶体的重要几何信息。为了解决这个问题，我们提出使用几何感知的协同关注（Solar-GECO）模型来预测钙钛矿太阳能电池的功率转换效率。Solar-GECO结合了几何图神经网络（GNN）-直接对钙钛矿吸收体的原子结构进行编码-以及处理代表输运层和其他器件组件的化学化合物的文本字符串的语言模型嵌入。Solar-GECO还整合了一个协同关注模块，以捕获层内依赖性和层间相互作用，同时一个概率回归头部预测功率转换效率（PCE）及其相关不确定性。Solar-GECO实现了最先进的性能，明显优于几个基线，将PCE预测的平均绝对误差（MAE）从3.066降低到2.936，与语义GNN（先前的最先进模型）相比。Solar-GECO表明，整合几何和文本信息为PCE预测提供了更强大和准确的框架。

更新时间: 2025-11-24 16:15:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19263v1

Psychometric Tests for AI Agents and Their Moduli Space

We develop a moduli-theoretic view of psychometric test batteries for AI agents and connect it explicitly to the AAI score developed previously. First, we make precise the notion of an AAI functional on a battery and set out axioms that any reasonable autonomy/general intelligence score should satisfy. Second, we show that the composite index ('AAI-Index') defined previously is a special case of our AAI functional. Third, we introduce the notion of a cognitive core of an agent relative to a battery and define the associated AAI$_{\textrm{core}}$ score as the restriction of an AAI functional to that core. Finally, we use these notions to describe invariants of batteries under evaluation-preserving symmetries and outline how moduli of equivalent batteries are organized.

Updated: 2025-11-24 16:15:08

标题: AI代理的心理测量测试及其模型空间

摘要: 我们为AI代理开发了一个关于心理测量测试电池的模数理论观点，并将其明确地与先前开发的AAI得分相连接。首先，我们明确了电池上AAI功能的概念，并列出了任何合理的自治/智能得分应满足的公理。其次，我们展示了先前定义的复合指数（'AAI指数'）是我们AAI功能的一个特例。第三，我们引入了相对于电池的一个代理的认知核心的概念，并将相关的AAI$_{\textrm{core}}$得分定义为将AAI功能限制在该核心上。最后，我们使用这些概念描述在评估保持对称性下的电池的不变量，并概述等效电池的模数是如何组织的。

更新时间: 2025-11-24 16:15:08

领域: cs.AI,cs.LG,math.ST

下载: http://arxiv.org/abs/2511.19262v1

A Nutrition Multimodal Photoplethysmography Language Model

Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.

Updated: 2025-11-24 16:12:03

标题: 一个营养多模光电容积脉动语言模型

摘要: 饥饿和饱腹动态塑造了饮食行为和代谢健康，但在日常环境中仍然难以捕捉。我们提出了一种营养光电容积脉搏语言模型（NPLM），将可穿戴设备中的连续光电容积脉搏图（PPG）与餐饭描述相结合。NPLM将PPG投影到语言模型可解释的嵌入中，实现了对生理和餐饭背景的联合推理。在19,340名参与者和110万个餐饭-PPG配对上训练的模型在文本基线上改善了每日热量摄入预测的11%，在去除80%餐饭文本时保持准确性。在一项独立验证研究中（n=140），在受控就餐和详细餐饭信息下，该模型复制了这些发现。这些结果表明了将消费者可穿戴设备中的生理测量与餐饭信息相结合，以实现规模化的非侵入式饮食监测的价值。

更新时间: 2025-11-24 16:12:03

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.19260v1

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

With the rapid advancement of retrieval-augmented vision-language models, multimodal medical retrieval-augmented generation (MMed-RAG) systems are increasingly adopted in clinical decision support. These systems enhance medical applications by performing cross-modal retrieval to integrate relevant visual and textual evidence for tasks, e.g., report generation and disease diagnosis. However, their complex architecture also introduces underexplored adversarial vulnerabilities, particularly via visual input perturbations. In this paper, we propose Medusa, a novel framework for crafting cross-modal transferable adversarial attacks on MMed-RAG systems under a black-box setting. Specifically, Medusa formulates the attack as a perturbation optimization problem, leveraging a multi-positive InfoNCE loss (MPIL) to align adversarial visual embeddings with medically plausible but malicious textual targets, thereby hijacking the retrieval process. To enhance transferability, we adopt a surrogate model ensemble and design a dual-loop optimization strategy augmented with invariant risk minimization (IRM). Extensive experiments on two real-world medical tasks, including medical report generation and disease diagnosis, demonstrate that Medusa achieves over 90% average attack success rate across various generation models and retrievers under appropriate parameter configuration, while remaining robust against four mainstream defenses, outperforming state-of-the-art baselines. Our results reveal critical vulnerabilities in the MMed-RAG systems and highlight the necessity of robustness benchmarking in safety-critical medical applications. The code and data are available at https://anonymous.4open.science/r/MMed-RAG-Attack-F05A.

Updated: 2025-11-24 16:11:01

标题: Medusa: 跨模态可转移的对抗性攻击对多模态医学检索增强生成的影响

摘要: 随着检索增强的视觉语言模型的快速发展，多模态医学检索增强生成（MMed-RAG）系统在临床决策支持中越来越受到采用。这些系统通过执行跨模态检索来整合相关的视觉和文本证据，用于任务，例如报告生成和疾病诊断，从而增强医学应用。然而，它们复杂的架构也引入了未经深入探索的对抗性漏洞，特别是通过视觉输入扰动。在本文中，我们提出了Medusa，一个新颖的框架，用于在黑盒设置下对MMed-RAG系统进行跨模态可转移的对抗性攻击。具体而言，Medusa将攻击形式化为一个扰动优化问题，利用多正面InfoNCE损失（MPIL）将对抗性视觉嵌入与医学上合理但恶意的文本目标对齐，从而劫持检索过程。为增强可转移性，我们采用了一个替代模型集合并设计了一个加入不变风险最小化（IRM）的双环优化策略。对包括医学报告生成和疾病诊断在内的两个实际医学任务进行了大量实验，结果表明，在适当的参数配置下，Medusa在各种生成模型和检索器下实现了90%以上的平均攻击成功率，同时对四种主流防御措施具有鲁棒性，胜过了现有技术基线。我们的结果揭示了MMed-RAG系统中的关键漏洞，并强调了在安全关键的医学应用中进行鲁棒性基准测试的必要性。代码和数据可在https://anonymous.4open.science/r/MMed-RAG-Attack-F05A上获取。

更新时间: 2025-11-24 16:11:01

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19257v1

SimDiff: Simpler Yet Better Diffusion Model for Time Series Point Forecasting

Diffusion models have recently shown promise in time series forecasting, particularly for probabilistic predictions. However, they often fail to achieve state-of-the-art point estimation performance compared to regression-based methods. This limitation stems from difficulties in providing sufficient contextual bias to track distribution shifts and in balancing output diversity with the stability and precision required for point forecasts. Existing diffusion-based approaches mainly focus on full-distribution modeling under probabilistic frameworks, often with likelihood maximization objectives, while paying little attention to dedicated strategies for high-accuracy point estimation. Moreover, other existing point prediction diffusion methods frequently rely on pre-trained or jointly trained mature models for contextual bias, sacrificing the generative flexibility of diffusion models. To address these challenges, we propose SimDiff, a single-stage, end-to-end framework. SimDiff employs a single unified Transformer network carefully tailored to serve as both denoiser and predictor, eliminating the need for external pre-trained or jointly trained regressors. It achieves state-of-the-art point estimation performance by leveraging intrinsic output diversity and improving mean squared error accuracy through multiple inference ensembling. Key innovations, including normalization independence and the median-of-means estimator, further enhance adaptability and stability. Extensive experiments demonstrate that SimDiff significantly outperforms existing methods in time series point forecasting.

Updated: 2025-11-24 16:09:55

标题: SimDiff：更简单但更好的时间序列点预测扩散模型

摘要: 扩散模型最近在时间序列预测中显示出潜力，特别是在概率预测方面。然而，与基于回归方法相比，它们通常无法达到最先进的点估计性能。这一限制源于难以提供足够的上下文偏差来跟踪分布变化，并在输出多样性与点预测所需的稳定性和精度之间保持平衡。现有的基于扩散的方法主要集中在概率框架下的全分布建模，通常具有最大化似然目标，而很少关注用于高准确性点估计的专门策略。此外，其他现有的点预测扩散方法经常依赖于预先训练或联合训练成熟模型来提供上下文偏差，从而牺牲了扩散模型的生成灵活性。为了解决这些挑战，我们提出了SimDiff，一个单阶段、端到端的框架。SimDiff采用一个经过精心设计的单一统一Transformer网络，旨在充当去噪器和预测器，消除了对外部预训练或联合训练回归器的需求。通过利用固有的输出多样性，并通过多重推理集成来提高均方误差准确性，它实现了最先进的点估计性能。包括归一化独立性和中位数均值估计器在内的关键创新进一步增强了适应性和稳定性。大量实验证明，SimDiff在时间序列点预测方面显著优于现有方法。

更新时间: 2025-11-24 16:09:55

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19256v1

Adversarial Patch Attacks on Vision-Based Cargo Occupancy Estimation via Differentiable 3D Simulation

Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.

Updated: 2025-11-24 16:05:40

标题: 对基于视觉的货物占用估计的对抗性贴片攻击：通过可微分3D模拟

摘要: 计算机视觉系统越来越多地被用于现代物流操作，包括用于规划、路径选择和计费的拖车占用估计。虽然有效，但这类系统可能容易受到物理对抗攻击的影响，特别是可以打印并放置在内部表面的对抗性贴纸。在这项工作中，我们研究了在完全模拟的3D环境中对卷积货物占用分类器进行此类攻击的可行性。利用Mitsuba 3进行可微渲染，我们在几何、光照和视点变化之间优化贴纸纹理，并将其效果与2D合成基线进行比较。我们的实验表明，经过3D优化的贴纸实现了高攻击成功率，特别是在拒绝服务场景中（从空到满），成功率达到84.94%。隐蔽攻击（从满到空）更具挑战性，但仍然达到30.32%。我们分析了影响攻击成功率的因素，讨论了对自动化物流管道安全性的影响，并强调了加强物理鲁棒性的方向。据我们所知，这是第一项在物理实际、完全模拟的3D场景中研究货物占用估计的对抗性贴纸攻击的研究。

更新时间: 2025-11-24 16:05:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19254v1

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.

Updated: 2025-11-24 16:05:37

标题: MAESTRO：通过任务和奖励优化塑造多智能体环境

摘要: 合作多智能体强化学习(MARL)面临两个主要设计瓶颈：制定密集奖励函数和构建课程，以避免在高维、非稳态环境中出现局部最优解。现有方法依赖于固定的启发式方法或直接在控制循环中使用大型语言模型(LLMs)，这对实时系统来说是昂贵且不适用的。我们提出了MAESTRO（通过任务和奖励优化塑造多智能体环境）框架，将LLM移出执行循环，并将其用作离线训练架构。MAESTRO引入了两个生成组件：(i)一个语义课程生成器，创建多样化、基于性能的交通场景，以及(ii)一个自动奖励合成器，生成适应不断变化的课程难度的可执行Python奖励函数。这些组件指导标准MARL骨干(MADDPG)，而不会在部署时增加推理成本。我们在大规模交通信号控制(Hangzhou，16个路口)上评估了MAESTRO，并进行了受控的消融实验。结果显示，将LLM生成的课程与LLM生成的奖励塑造相结合，可以提高性能和稳定性。在四个种子实验中，完整系统实现了+4.0%更高的平均回报(163.26 vs. 156.93)和2.2%更好的风险调整表现(夏普1.53 vs. 0.70)，超过了强课程基线。这些发现凸显了LLMs作为合作MARL训练的有效高层设计者。

更新时间: 2025-11-24 16:05:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19253v1

FedPoisonTTP: A Threat Model and Poisoning Attack for Federated Test-Time Personalization

Test-time personalization in federated learning enables models at clients to adjust online to local domain shifts, enhancing robustness and personalization in deployment. Yet, existing federated learning work largely overlooks the security risks that arise when local adaptation occurs at test time. Heterogeneous domain arrivals, diverse adaptation algorithms, and limited cross-client visibility create vulnerabilities where compromised participants can craft poisoned inputs and submit adversarial updates that undermine both global and per-client performance. To address this threat, we introduce FedPoisonTTP, a realistic grey-box attack framework that explores test-time data poisoning in the federated adaptation setting. FedPoisonTTP distills a surrogate model from adversarial queries, synthesizes in-distribution poisons using feature-consistency, and optimizes attack objectives to generate high-entropy or class-confident poisons that evade common adaptation filters. These poisons are injected during local adaptation and spread through collaborative updates, leading to broad degradation. Extensive experiments on corrupted vision benchmarks show that compromised participants can substantially diminish overall test-time performance.

Updated: 2025-11-24 16:02:01

标题: FedPoisonTTP: 一种针对联邦测试时间个性化的威胁模型和毒化攻击

摘要: 联邦学习中的测试时间个性化使客户端模型能够在线调整以适应本地领域变化，增强了部署中的鲁棒性和个性化。然而，现有的联邦学习工作很大程度上忽视了当测试时间发生本地适应时出现的安全风险。异构域到达、不同的适应算法和有限的跨客户端可见性造成了弱点，使被妥协的参与者可以制造有毒输入并提交对全局和每个客户端性能都具有破坏作用的对抗性更新。为了解决这一威胁，我们引入了FedPoisonTTP，这是一个实际的灰盒攻击框架，探索联邦适应设置中的测试时间数据污染。FedPoisonTTP从对抗性查询中提炼出一个替代模型，利用特征一致性合成分布内的毒素，并优化攻击目标，生成具有高熵或类别确信度的毒素，以逃避常见的适应过滤器。这些毒素在本地适应期间被注入，并通过协作更新传播，导致广泛的降级。对受损视觉基准的大量实验表明，受损的参与者可以大幅降低整体测试时间性能。

更新时间: 2025-11-24 16:02:01

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2511.19248v1

An efficient quantum algorithm for computing $S$-units and its applications

In this paper, we provide details on the proofs of the quantum polynomial time algorithm of Biasse and Song (SODA 16) for computing the $S$-unit group of a number field. This algorithm directly implies polynomial time methods to calculate class groups, S-class groups, relative class group and the unit group, ray class groups, solve the principal ideal problem, solve certain norm equations, and decompose ideal classes in the ideal class group. Additionally, combined with a result of Cramer, Ducas, Peikert and Regev (Eurocrypt 2016), the resolution of the principal ideal problem allows one to find short generators of a principal ideal. Likewise, methods due to Cramer, Ducas and Wesolowski (Eurocrypt 2017) use the resolution of the principal ideal problem and the decomposition of ideal classes to find so-called ``mildly short vectors'' in ideal lattices of cyclotomic fields.

Updated: 2025-11-24 16:01:38

标题: 一种用于计算$S$单位并广泛应用的高效量子算法

摘要: 在本文中，我们提供了Biasse和Song（SODA 16）量子多项式时间算法证明计算数域的$S$-单位群的细节。该算法直接暗示了计算类群、S类群、相对类群和单位群、射线类群、解决主理想问题、解决某些范数方程以及分解理想类在理想类群中的多项式时间方法。此外，结合Cramer、Ducas、Peikert和Regev（Eurocrypt 2016）的结果，主理想问题的解决使得可以找到主理想的短生成元。同样，Cramer、Ducas和Wesolowski（Eurocrypt 2017）的方法利用主理想问题的解决和理想类分解来找到所谓的“轻微短向量”在旋转域的理想格中。

更新时间: 2025-11-24 16:01:38

领域: cs.CR,math.NT

下载: http://arxiv.org/abs/2510.02280v2

Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

Updated: 2025-11-24 15:59:06

标题: 多模式医疗诊断中具有演示选择的公平性

摘要: 多模态大型语言模型（MLLMs）已显示出在医学图像推理方面具有强大潜力，然而跨人口群体的公平性仍然是一个主要关注点。现有的去偏方法通常依赖于大型标记数据集或微调，这对于基础规模模型来说是不切实际的。我们探索了上下文学习（ICL）作为一个轻量级、无调整的替代方案，用于改善公平性。通过系统分析，我们发现传统的演示选择（DS）策略无法确保公平性，这是因为所选示例中存在人口统计学上的不平衡。为了解决这个问题，我们提出了公平感知演示选择（FADS），通过基于聚类的抽样构建人口统计上平衡和语义上相关的演示。对多个医学图像基准测试的实验表明，FADS在保持较高准确性的同时持续减少性别、种族和族裔相关的差距，为公平的医学图像推理提供了一条高效和可扩展的路径。这些结果突显了公平感知的上下文学习作为一种可扩展和数据高效的解决方案，用于公平的医学图像推理。

更新时间: 2025-11-24 15:59:06

领域: cs.CV,cs.CY,cs.LG

下载: http://arxiv.org/abs/2511.15986v2

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

Updated: 2025-11-24 15:55:51

标题: Live-SWE-agent：软件工程代理能否在运行时自我进化？

摘要: 大型语言模型（LLMs）正在重塑几乎所有行业，包括软件工程。近年来，已经提出了许多LLM代理来解决现实世界的软件问题。这些软件代理通常配备一套编码工具，并可以自主决定下一步操作，形成完整的轨迹来解决端到端的软件任务。虽然有前景，但它们通常需要专门设计，并且可能仍然不够优化，因为完全耗尽整个代理支架设计空间可能极为具有挑战性和昂贵。认识到软件代理本质上是软件本身，可以进一步改进/修改，研究人员最近提出了一些自我改进的软件代理，包括达尔文-哥德尔机器（DGM）。同时，这些自我改进的代理需要在特定基准测试上进行昂贵的离线训练，并且可能无法很好地泛化到不同的LLMs或基准测试中。在本文中，我们提出了Live-SWE-agent，这是第一个可以在运行时解决真实软件问题时自主且持续演化的实时软件代理。更具体地说，Live-SWE-agent从最基本的只能访问bash工具的代理支架（例如，mini-SWE-agent）开始，并在解决现实世界的软件问题时自主演化其自己的支架实现。我们在广泛研究的SWE-bench Verified基准测试上进行评估，结果显示LIVE-SWE-AGENT可以在没有测试时间缩放的情况下实现令人印象深刻的解决率为77.4%，优于所有现有的软件代理，包括最佳专有解决方案。此外，Live-SWE-agent在最近的SWE-Bench Pro基准测试中也优于最先进的手工制作的软件代理，实现了最佳已知的解决率为45.8%。

更新时间: 2025-11-24 15:55:51

领域: cs.SE,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2511.13646v3

Neural Architecture Search for Quantum Autoencoders

In recent years, machine learning and deep learning have driven advances in domains such as image classification, speech recognition, and anomaly detection by leveraging multi-layer neural networks to model complex data. Simultaneously, quantum computing (QC) promises to address classically intractable problems via quantum parallelism, motivating research in quantum machine learning (QML). Among QML techniques, quantum autoencoders show promise for compressing high-dimensional quantum and classical data. However, designing effective quantum circuit architectures for quantum autoencoders remains challenging due to the complexity of selecting gates, arranging circuit layers, and tuning parameters. This paper proposes a neural architecture search (NAS) framework that automates the design of quantum autoencoders using a genetic algorithm (GA). By systematically evolving variational quantum circuit (VQC) configurations, our method seeks to identify high-performing hybrid quantum-classical autoencoders for data reconstruction without becoming trapped in local minima. We demonstrate effectiveness on image datasets, highlighting the potential of quantum autoencoders for efficient feature extraction within a noise-prone, near-term quantum era. Our approach lays a foundation for broader application of genetic algorithms to quantum architecture search, aiming for a robust, automated method that can adapt to varied data and hardware constraints.

Updated: 2025-11-24 15:55:44

标题: 神经架构搜索用于量子自编码器

摘要: 近年来，机器学习和深度学习通过利用多层神经网络对复杂数据进行建模，在诸如图像分类、语音识别和异常检测等领域取得了进展。同时，量子计算（QC）承诺通过量子并行性解决传统上难以解决的问题，促使量子机器学习（QML）领域的研究。在QML技术中，量子自编码器显示出压缩高维量子和经典数据的潜力。然而，设计有效的量子自编码器量子电路架构仍然具有挑战性，因为需要选择门、排列电路层以及调整参数的复杂性。本文提出了一个神经架构搜索（NAS）框架，利用遗传算法（GA）自动设计量子自编码器。通过系统地演化变分量子电路（VQC）配置，我们的方法旨在识别出高性能的混合量子-经典自编码器，用于数据重构，避免陷入局部极小值。我们在图像数据集上展示了方法的有效性，突出了量子自编码器在噪声干扰严重的短期量子时代内进行高效特征提取的潜力。我们的方法为将遗传算法广泛应用于量子架构搜索奠定了基础，旨在实现一个稳健、自动化的方法，能够适应各种数据和硬件约束。

更新时间: 2025-11-24 15:55:44

领域: quant-ph,cs.AI,cs.LG,cs.NE

下载: http://arxiv.org/abs/2511.19246v1

Local Entropy Search over Descent Sequences for Bayesian Optimization

Searching large and complex design spaces for a global optimum can be infeasible and unnecessary. A practical alternative is to iteratively refine the neighborhood of an initial design using local optimization methods such as gradient descent. We propose local entropy search (LES), a Bayesian optimization paradigm that explicitly targets the solutions reachable by the descent sequences of iterative optimizers. The algorithm propagates the posterior belief over the objective through the optimizer, resulting in a probability distribution over descent sequences. It then selects the next evaluation by maximizing mutual information with that distribution, using a combination of analytic entropy calculations and Monte-Carlo sampling of descent sequences. Empirical results on high-complexity synthetic objectives and benchmark problems show that LES achieves strong sample efficiency compared to existing local and global Bayesian optimization methods.

Updated: 2025-11-24 15:52:17

标题: 贝叶斯优化中沿下降序列的本地熵搜索

摘要: 在搜索大型和复杂的设计空间以寻找全局最优解可能是不可行且不必要的。一个实用的替代方法是使用诸如梯度下降等局部优化方法迭代地细化初始设计的邻域。我们提出了局部熵搜索（LES），这是一种贝叶斯优化范式，明确地针对迭代优化器的下降序列所能达到的解决方案。该算法通过优化器传播目标的后验信念，从而产生一个下降序列的概率分布。然后通过组合解析熵计算和下降序列的蒙特卡洛抽样，选择下一个评估点以最大化与该分布的互信息。对高复杂性合成目标和基准问题的实证结果表明，与现有的局部和全局贝叶斯优化方法相比，LES 实现了强大的样本效率。

更新时间: 2025-11-24 15:52:17

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2511.19241v1

Empirical Comparison of Forgetting Mechanisms for UCB-based Algorithms on a Data-Driven Simulation Platform

Many real-world bandit problems involve non-stationary reward distributions, where the optimal decision may shift due to evolving environments. However, the performance of some typical Multi-Armed Bandit (MAB) models such as Upper Confidence Bound (UCB) algorithms degrades significantly in non-stationary environments where reward distributions change over time. To address this limitation, this paper introduces and evaluates FDSW-UCB, a novel dual-view algorithm that integrates a discount-based long-term perspective with a sliding-window-based short-term view. A data-driven semi-synthetic simulation platform, built upon the MovieLens-1M and Open Bandit datasets, is developed to test algorithm adaptability under abrupt and gradual drift scenarios. Experimental results demonstrate that a well-configured sliding-window mechanism (SW-UCB) is robust, while the widely used discounting method (D-UCB) suffers from a fundamental learning failure, leading to linear regret. Crucially, the proposed FDSW-UCB, when employing an optimistic aggregation strategy, achieves superior performance in dynamic settings, highlighting that the ensemble strategy itself is a decisive factor for success.

Updated: 2025-11-24 15:52:02

标题: 基于数据驱动模拟平台的UCB算法遗忘机制的实证比较

摘要: 许多现实世界的赌博问题涉及非静态奖励分布，其中由于不断变化的环境，最佳决策可能会发生变化。然而，一些典型的多臂赌博（MAB）模型，如上限置信界（UCB）算法，在非静态环境中性能显著下降，其中奖励分布随时间变化。为了解决这一局限性，本文介绍并评估了FDSW-UCB，这是一种新颖的双视角算法，将基于折扣的长期视角与基于滑动窗口的短期视角集成在一起。基于MovieLens-1M和Open Bandit数据集构建了一个数据驱动的半合成仿真平台，用于测试算法在突然和逐渐漂移情况下的适应性。实验结果表明，良好配置的滑动窗口机制（SW-UCB）是稳健的，而广泛使用的折扣方法（D-UCB）遭受基本学习失败，导致线性后悔。关键是，提出的FDSW-UCB，在采用乐观的聚合策略时，在动态设置中实现了卓越的性能，突出显示出集成策略本身是成功的决定性因素。

更新时间: 2025-11-24 15:52:02

领域: cs.LG

下载: http://arxiv.org/abs/2511.19240v1

Deductive Systems for Logic Programs with Counting

In answer set programming, two groups of rules are considered strongly equivalent if they have the same meaning in any context. Strong equivalence of two programs can be sometimes established by deriving rules of each program from rules of the other in an appropriate deductive system. This paper shows how to extend this method of proving strong equivalence to programs containing the counting aggregate.

Updated: 2025-11-24 15:49:06

标题: 逻辑程序中带计数的演绎系统

摘要: 在答案集编程中，如果两组规则在任何情境中具有相同的含义，则被认为是强等价的。有时，可以通过在适当的演绎系统中从一个程序的规则推导出另一个程序的规则来确定两个程序的强等价性。本文展示了如何将证明强等价性的方法扩展到包含计数聚合的程序中。

更新时间: 2025-11-24 15:49:06

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2511.19565v1

SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.

Updated: 2025-11-24 15:48:59

标题: 哨兵：一种完全端到端的语言行为模型，用于人形机器人整体身体控制

摘要: 现有的人形控制系统通常依赖于远程操作或模块化生成管道，将语言理解与物理执行分开。然而，前者完全由人类驱动，后者缺乏语言命令与物理行为之间的紧密对齐。在本文中，我们提出了SENTINEL，一个用于人形全身控制的完全端到端的语言-动作模型。我们通过使用预训练的整体身体控制器在模拟中跟踪人类动作，并结合他们的文本注释构建了一个大规模数据集。该模型直接将语言命令和本体感知输入映射到低级动作，没有任何中间表示。该模型使用流匹配生成动作块，可以通过残余动作头进一步优化以进行现实部署。我们的方法在模拟和现实世界部署中展现出强大的语义理解和稳定的执行，还通过将输入转换为文本支持多模态扩展。

更新时间: 2025-11-24 15:48:59

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.19236v1

Automatic Multi-View X-Ray/CT Registration Using Bone Substructure Contours

Purpose: Accurate intraoperative X-ray/CT registration is essential for surgical navigation in orthopedic procedures. However, existing methods struggle with consistently achieving sub-millimeter accuracy, robustness under broad initial pose estimates or need manual key-point annotations. This work aims to address these challenges by proposing a novel multi-view X-ray/CT registration method for intraoperative bone registration. Methods: The proposed registration method consists of a multi-view, contour-based iterative closest point (ICP) optimization. Unlike previous methods, which attempt to match bone contours across the entire silhouette in both imaging modalities, we focus on matching specific subcategories of contours corresponding to bone substructures. This leads to reduced ambiguity in the ICP matches, resulting in a more robust and accurate registration solution. This approach requires only two X-ray images and operates fully automatically. Additionally, we contribute a dataset of 5 cadaveric specimens, including real X-ray images, X-ray image poses and the corresponding CT scans. Results: The proposed registration method is evaluated on real X-ray images using mean reprojection error (mRPD). The method consistently achieves sub-millimeter accuracy with a mRPD 0.67mm compared to 5.35mm by a commercial solution requiring manual intervention. Furthermore, the method offers improved practical applicability, being fully automatic. Conclusion: Our method offers a practical, accurate, and efficient solution for multi-view X-ray/CT registration in orthopedic surgeries, which can be easily combined with tracking systems. By improving registration accuracy and minimizing manual intervention, it enhances intraoperative navigation, contributing to more accurate and effective surgical outcomes in computer-assisted surgery (CAS).

Updated: 2025-11-24 15:46:09

标题: 自动多视角X射线/CT注册利用骨骼次结构轮廓

摘要: 目的：在骨科手术中，准确的术中X射线/CT注册对于手术导航至关重要。然而，现有方法在实现亚毫米精度、在广泛的初始姿态估计下具有鲁棒性或需要手动关键点注释方面存在困难。本研究旨在通过提出一种新颖的多视角X射线/CT注册方法来解决这些挑战，用于术中骨骼注册。方法：所提出的注册方法由基于多视角、基于轮廓的迭代最近点（ICP）优化组成。与先前的方法不同，先前的方法试图在两种成像模式中的整个轮廓中匹配骨骼轮廓，我们专注于匹配对应于骨骼亚结构的特定子类别轮廓。这导致ICP匹配中的歧义减少，从而产生更稳健和准确的注册解决方案。该方法仅需要两个X射线图像，并且完全自动运行。此外，我们提供了一个包括真实X射线图像、X射线图像姿态和相应CT扫描的5具尸体标本数据集。结果：所提出的注册方法使用平均重投影误差（mRPD）在真实X射线图像上进行评估。该方法始终以mRPD 0.67mm实现亚毫米精度，而商业解决方案需要手动干预的mRPD为5.35mm。此外，该方法提供了改进的实用性，完全自动化。结论：我们的方法为骨科手术中的多视角X射线/CT注册提供了实用、准确和高效的解决方案，可轻松与跟踪系统结合使用。通过提高注册精度和减少手动干预，它增强了术中导航，有助于在计算机辅助手术（CAS）中实现更准确和有效的手术结果。

更新时间: 2025-11-24 15:46:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.13292v2

In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations

How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.

Updated: 2025-11-24 15:43:56

标题: 在 Machina N400 中：确定因果语言模型检测语义违例的位置

摘要: 转换器是如何在哪里注意到一个句子在语义上出现问题的？为了探讨这个问题，我们使用一个精心筛选的语料库评估了因果语言模型（phi-2），其中包含了以合理或不合理方式结束的句子。我们的分析集中在每个模型层中采样的隐藏状态上。为了研究违规是如何被编码的，我们利用了两种互补的探针。首先，我们使用线性探针进行了每层检测。我们的发现显示，一个简单的线性解码器在模型层的最低部分很难区分合理和不合理的结尾。然而，在中间块中，它的准确性急剧提高，直到在顶层之前达到峰值。其次，我们研究了编码违规的有效维度。最初，违规扩大了表征子空间，然后在中间堆栈瓶颈后发生了崩溃。这可能表明了一个探索阶段，过渡到快速巩固。综合这些结果，这些结果考虑了与人类阅读中经典心理语言学发现的一致性的想法，即语义异常只有在句法解析之后才被检测到，在在线处理序列中发生较晚。

更新时间: 2025-11-24 15:43:56

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.19232v1

Learning Plug-and-play Memory for Guiding Video Diffusion Models

Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.

Updated: 2025-11-24 15:42:23

标题: 学习即插即用内存以指导视频扩散模型

摘要: 最近，基于扩散变压器（DiT）的视频生成模型已经取得了令人印象深刻的视觉质量和时间连贯性，但它们仍然经常违反基本物理定律和常识动态，揭示了对明确的世界知识的缺乏。在这项工作中，我们探讨如何为它们配备一个即插即用的记忆体，注入有用的世界知识。受基于Transformer的LLM中上下文记忆的启发，我们进行了实证研究，表明DiT可以通过对其隐藏状态的干预来引导，而在嵌入空间中的简单低通和高通滤波器自然地分离了低级外观和高级物理/语义线索，实现了有针对性的引导。基于这些观察，我们提出了一个可学习的记忆编码器DiT-Mem，由堆叠的3D CNNs、低/高通滤波器和自注意力层组成。编码器将参考视频映射为一组紧凑的记忆令牌，这些令牌作为DiT自注意力层内的记忆体进行连接。在训练过程中，我们保持扩散骨干冻结，只优化记忆编码器。它在少量训练参数（150M）和10K数据样本上产生了相当高效的训练过程，并在推断时实现了即插即用的使用。对最先进的模型进行了大量实验，证明了我们的方法在改善物理规则遵循和视频保真度方面的有效性。我们的代码和数据在此公开发布：https://thrcle421.github.io/DiT-Mem-Web/。

更新时间: 2025-11-24 15:42:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19229v1

Higher-Order Regularization Learning on Hypergraphs

Higher-Order Hypergraph Learning (HOHL) was recently introduced as a principled alternative to classical hypergraph regularization, enforcing higher-order smoothness via powers of multiscale Laplacians induced by the hypergraph structure. Prior work established the well- and ill-posedness of HOHL through an asymptotic consistency analysis in geometric settings. We extend this theoretical foundation by proving the consistency of a truncated version of HOHL and deriving explicit convergence rates when HOHL is used as a regularizer in fully supervised learning. We further demonstrate its strong empirical performance in active learning and in datasets lacking an underlying geometric structure, highlighting HOHL's versatility and robustness across diverse learning settings.

Updated: 2025-11-24 15:37:40

标题: 在超图上的高阶正则化学习

摘要: 最近，高阶超图学习（HOHL）被引入作为传统超图正则化的一个合理替代方案，通过由超图结构引发的多尺度拉普拉斯算子的幂次来强制实现高阶平滑性。之前的工作通过几何设置中的渐近一致性分析建立了HOHL的良定性和病态性。我们通过证明HOHL的截断版本的一致性并推导出当HOHL被用作完全监督学习中的正则化器时的显式收敛速率来扩展这一理论基础。我们进一步展示了在主动学习和缺乏基础几何结构的数据集中，HOHL在不同学习环境中的多样性和稳健性，突显其强大的经验性能。

更新时间: 2025-11-24 15:37:40

领域: cs.LG,math.ST

下载: http://arxiv.org/abs/2510.26533v2

Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference

This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.

Updated: 2025-11-24 15:33:40

标题: 合成反事实标签用于高效的符合反事实推理

摘要: 这项工作解决了构建可靠预测区间的问题，用于个体反事实结果。现有的一致反事实推断（CCI）方法提供边际覆盖保证，但在治疗不平衡时往往会产生过于保守的区间，特别是在反事实样本稀缺的情况下。我们引入了合成数据驱动的CCI（SP-CCI），这是一个新的框架，通过预先训练的反事实模型生成合成反事实标签来增加校准集。为了确保有效性，SP-CCI将合成样本整合到基于风险控制预测集（RCPS）的一致校准过程中，并结合了由预测驱动的推断（PPI）指导的去偏差步骤。我们证明了SP-CCI在保持边际覆盖的同时实现了更紧凑的预测区间，具有在精确和近似重要性加权下的理论保证。不同数据集上的实证结果证实，与标准CCI相比，SP-CCI在所有设置中始终减小了区间宽度。

更新时间: 2025-11-24 15:33:40

领域: cs.LG,cs.IT

下载: http://arxiv.org/abs/2509.04112v2

Trust-Based Social Learning for Communication (TSLEC) Protocol Evolution in Multi-Agent Reinforcement Learning

Emergent communication in multi-agent systems typically occurs through independent learning, resulting in slow convergence and potentially suboptimal protocols. We introduce TSLEC (Trust-Based Social Learning with Emergent Communication), a framework where agents explicitly teach successful strategies to peers, with knowledge transfer modulated by learned trust relationships. Through experiments with 100 episodes across 30 random seeds, we demonstrate that trust-based social learning reduces episodes-to-convergence by 23.9% (p < 0.001, Cohen's d = 1.98) compared to independent emergence, while producing compositional protocols (C = 0.38) that remain robust under dynamic objectives (Phi > 0.867 decoding accuracy). Trust scores strongly correlate with teaching quality (r = 0.743, p < 0.001), enabling effective knowledge filtering. Our results establish that explicit social learning fundamentally accelerates emergent communication in multi-agent coordination.

Updated: 2025-11-24 15:31:51

标题: 基于信任的社会学习在多智能体强化学习中的通信（TSLEC）协议演进

摘要: Emergent communication in multi-agent systems often involves independent learning, which can lead to slow convergence and potentially suboptimal protocols. In this study, we propose TSLEC (Trust-Based Social Learning with Emergent Communication), a framework where agents teach successful strategies to their peers, with knowledge transfer influenced by trust relationships. Through experiments involving 100 episodes across 30 random seeds, we show that trust-based social learning reduces the time to convergence by 23.9% (p < 0.001, Cohen's d = 1.98) compared to independent emergence. Additionally, the protocols generated through trust-based social learning (C = 0.38) remain robust under changing objectives (Phi > 0.867 decoding accuracy). Trust scores are strongly correlated with teaching quality (r = 0.743, p < 0.001), allowing for effective knowledge filtering. These findings demonstrate that explicit social learning significantly accelerates emergent communication in multi-agent coordination.

更新时间: 2025-11-24 15:31:51

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2511.19562v1

Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport

Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.

Updated: 2025-11-24 15:27:47

标题: 不忘记融合：通过最优输运连续融合任务特定模型

摘要: 将为不同任务进行微调的模型合并为单一统一模型已成为构建多功能、高效多任务系统的一个日益重要的方向。现有方法主要依赖于参数空间中的参数插值，我们发现这会在特征空间中引入显著的分布转移并削弱任务特定知识。在本文中，我们提出了一种基于最优输运理论的 OTMF（基于最优输运的掩码融合）模型合并框架，以解决由于天真参数插值而产生的分布转移问题。OTMF不是直接聚合特征或权重，而是通过发现应用于任务向量的共同掩码来对齐任务特定模型的语义几何结构。这些掩码选择性地提取可转移的和与任务无关的组件，同时保留每个任务的独特结构身份。为了确保在现实世界的设置中的可扩展性，OTMF进一步支持一种持续融合范式，逐步集成每个新任务向量而无需重新访问以前的任务向量，保持有界的内存占用并实现在不断增加的任务数量之间的高效融合。我们在多个视觉和语言基准测试上进行了全面实验，结果显示，OTMF在准确性和效率方面实现了最先进的性能。这些发现突显了我们的模型合并方法的实践和理论价值。

更新时间: 2025-11-24 15:27:47

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.19561v1

Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.

Updated: 2025-11-24 15:26:58

标题: 大型视觉语言模型是否真正基于医学图像？来自意大利临床视觉问答的证据

摘要: 大型视觉语言模型（VLMs）在医学视觉问题回答基准上取得了令人印象深刻的表现，但它们对视觉信息的依赖仍不清楚。我们通过测试四种最先进的模型：Claude Sonnet 4.5、GPT-4o、GPT-5-mini和Gemini 2.0 flash exp，来研究前沿VLMs在回答意大利医学问题时是否展现出真正的视觉基础。我们使用了来自EuropeMedQA意大利数据集的60个明确要求图像解释的问题，将正确的医学图像替换为空白占位符，以测试模型是否真正集成了视觉和文本信息。我们的结果显示视觉依赖性有明显的差异：GPT-4o表现出最强的视觉基础，准确率下降了27.9个百分点（从83.2% [74.6%, 91.7%]下降到55.3% [44.1%, 66.6%]），而GPT-5-mini、Gemini和Claude保持了高准确率，分别下降了8.5个百分点、2.4个百分点和5.6个百分点。对模型生成的推理分析显示，所有模型对虚构的视觉解释都有自信的解释，表明它们在文本快捷方式和真正的视觉分析之间依赖程度各不相同。这些发现突显了模型稳健性的关键差异，以及在临床部署之前需要严格评估的必要性。

更新时间: 2025-11-24 15:26:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19220v1

Analysis of Semi-Supervised Learning on Hypergraphs

Hypergraphs provide a natural framework for modeling higher-order interactions, yet their theoretical underpinnings in semi-supervised learning remain limited. We provide an asymptotic consistency analysis of variational learning on random geometric hypergraphs, precisely characterizing the conditions ensuring the well-posedness of hypergraph learning as well as showing convergence to a weighted $p$-Laplacian equation. Motivated by this, we propose Higher-Order Hypergraph Learning (HOHL), which regularizes via powers of Laplacians from skeleton graphs for multiscale smoothness. HOHL converges to a higher-order Sobolev seminorm. Empirically, it performs strongly on standard baselines.

Updated: 2025-11-24 15:26:34

标题: 超图上半监督学习的分析

摘要: 超图为建模高阶交互提供了自然框架，然而它们在半监督学习中的理论基础仍然有限。我们在随机几何超图上提供了变分学习的渐近一致性分析，精确地表征了确保超图学习良定性的条件，同时展示了收敛到加权$p$-拉普拉斯方程。在此基础上，我们提出了高阶超图学习（HOHL），通过骨架图上的拉普拉斯幂来正则化，实现多尺度平滑性。HOHL收敛到高阶Sobolev半范数。在实证方面，它在标准基线上表现出色。

更新时间: 2025-11-24 15:26:34

领域: cs.LG,math.ST

下载: http://arxiv.org/abs/2510.25354v2

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum reinforcement learning, enabling robust mutual improvement. Evaluations across multiple benchmarks demonstrate that our method outperforms existing attack and defense approaches, and provides a feasible pathway for developing LLMs that can sustainably support responsible AI ecosystems.

Updated: 2025-11-24 15:23:41

标题: 对抗性攻击-防御共同进化用于LLM安全对齐的树组双感知搜索与优化

摘要: 大型语言模型（LLMs）在Web服务中迅速发展，提供了前所未有的能力，同时也放大了社会风险。现有研究往往集中在孤立的越狱攻击或静态防御上，忽视了在现实网络环境中不断演变的威胁和保障之间的动态相互作用。为了缓解这些挑战，我们提出了ACE-Safety（面向LLM安全的对抗共进化）框架，通过无缝整合两个关键的创新过程来共同优化攻击和防御模型：（1）基于群体感知策略引导的蒙特卡洛树搜索（GS-MCTS），有效地探索越狱策略以发现漏洞并生成多样化的对抗样本；（2）对抗课程树感知群体策略优化（AC-TGPO），通过课程强化学习共同训练攻击和防御LLMs，使其能够通过挑战性样本实现稳健的相互改进。跨多个基准测试的评估结果表明，我们的方法优于现有的攻击和防御方法，并为开发可持续支持负责任人工智能生态系统的LLMs提供了可行的途径。

更新时间: 2025-11-24 15:23:41

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.19218v1

Distributionally Robust Free Energy Principle for Decision-Making

Despite their groundbreaking performance, autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training-environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long-standing challenge towards their real-world deployments. Here, we introduce a Distributionally Robust Free Energy model (DR-FREE) that instills this core property by design. Combining a robust extension of the free energy principle with a resolution engine, DR-FREE wires robustness into the agent decision-making mechanisms. Across benchmark experiments, DR-FREE enables the agents to complete the task even when, in contrast, state-of-the-art models fail. This milestone may inspire both deployments in multi-agent settings and, at a perhaps deeper level, the quest for an explanation of how natural agents -- with little or no training -- survive in capricious environments.

Updated: 2025-11-24 15:19:30

标题: 决策制定的分布鲁棒自由能原理

摘要: 尽管自主代理在表现方面具有开创性，但在训练和环境条件变化时可能会出现不一致，细微差异可能导致不良行为甚至灾难性失败。对这些训练环境的模糊性具有鲁棒性是智能代理的核心要求，其实现是面向其在现实世界部署的长期挑战。在这里，我们介绍了一种通过设计实现这一核心属性的分布鲁棒自由能模型（DR-FREE）。通过将自由能原则的鲁棒扩展与解析引擎相结合，DR-FREE将鲁棒性融入代理决策机制中。通过基准实验，DR-FREE使代理能够完成任务，即使与最先进的模型相比，这些模型失败。这一里程碑可能激发多代理设置中的部署，更深层次地说，可能对自然代理如何在反复无常的环境中生存的解释提供启示。

更新时间: 2025-11-24 15:19:30

领域: cs.AI,eess.SY,math.OC

下载: http://arxiv.org/abs/2503.13223v3

CLASH: A Benchmark for Cross-Modal Contradiction Detection

Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

Updated: 2025-11-24 15:09:07

标题: CLASH：跨模态矛盾检测的基准

摘要: Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

更新时间: 2025-11-24 15:09:07

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19199v1

ExtendAttack: Attacking Servers of LRMs via Extending Reasoning

Large Reasoning Models (LRMs) have demonstrated promising performance in complex tasks. However, the resource-consuming reasoning processes may be exploited by attackers to maliciously occupy the resources of the servers, leading to a crash, like the DDoS attack in cyber. To this end, we propose a novel attack method on LRMs termed ExtendAttack to maliciously occupy the resources of servers by stealthily extending the reasoning processes of LRMs. Concretely, we systematically obfuscate characters within a benign prompt, transforming them into a complex, poly-base ASCII representation. This compels the model to perform a series of computationally intensive decoding sub-tasks that are deeply embedded within the semantic structure of the query itself. Extensive experiments demonstrate the effectiveness of our proposed ExtendAttack. Remarkably, it significantly increases response length and latency, with the former increasing by over 2.7 times for the o3 model on the HumanEval benchmark. Besides, it preserves the original meaning of the query and achieves comparable answer accuracy, showing the stealthiness.

Updated: 2025-11-24 15:07:05

标题: ExtendAttack: 通过扩展推理攻击LRM服务器

摘要: 大型推理模型（LRMs）在复杂任务中表现出有希望的性能。然而，耗费资源的推理过程可能被攻击者利用，恶意占用服务器资源，导致崩溃，就像网络中的DDoS攻击一样。为此，我们提出了一种针对LRMs的新型攻击方法，称为ExtendAttack，通过悄悄延长LRMs的推理过程，恶意占用服务器资源。具体来说，我们系统地对良性提示中的字符进行混淆，将它们转化为复杂的多基ASCII表示。这迫使模型执行一系列计算密集的解码子任务，这些任务深嵌入在查询本身的语义结构中。大量实验证明了我们提出的ExtendAttack的有效性。值得注意的是，它显著增加了响应长度和延迟，前者在HumanEval基准测试中o3模型增加了超过2.7倍。此外，它保留了查询的原始含义，并实现了可比较的答案准确性，展示了其潜在性。

更新时间: 2025-11-24 15:07:05

领域: cs.CR

下载: http://arxiv.org/abs/2506.13737v2

Learning to Call: A Field Trial of a Collaborative Bandit Algorithm for Improved Message Delivery in Mobile Maternal Health

Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India's Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers' preferred call times. We deployed the algorithm with around $6500$ Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pick-up rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.

Updated: 2025-11-24 15:04:04

标题: 学习呼叫：改进移动孕妇健康信息传递的协作赌博算法的现场试验

摘要: 移动健康（mHealth）计划利用自动语音消息传递健康信息，特别针对服务不足的社区，展示了利用移动技术向这些人群传播关键健康信息的有效性，通过增加意识和行为改变改善健康结果。印度的Kilkari计划通过每周语音电话向数百万母亲传递重要的孕妇健康信息。然而，目前的随机电话安排经常导致未接电话和信息传递减少。本研究展示了一个协作赌博算法的现场试验，旨在通过学习个体母亲的偏好通话时间来优化通话时间。我们将该算法部署在约6500名Kilkari参与者身上作为一项试点研究，将其性能与基线的随机拨号方法进行比较。我们的结果表明，赌博算法在通话接听率方面显著提高，表明其有潜力提高信息传递效果，并影响印度数百万母亲。这项研究突出了个性化调度在移动健康干预中的有效性，并强调了机器学习在规模上改善孕妇健康宣传的潜力。

更新时间: 2025-11-24 15:04:04

领域: cs.AI

下载: http://arxiv.org/abs/2507.16356v2

Layer-wise Weight Selection for Power-Efficient Neural Network Acceleration

Systolic array accelerators execute CNNs with energy dominated by the switching activity of multiply accumulate (MAC) units. Although prior work exploits weight dependent MAC power for compression, existing methods often use global activation models, coarse energy proxies, or layer-agnostic policies, which limits their effectiveness on real hardware. We propose an energy aware, layer-wise compression framework that explicitly leverages MAC and layer level energy characteristics. First, we build a layer-aware MAC energy model that combines per-layer activation statistics with an MSB-Hamming distance grouping of 22-bit partial sum transitions, and integrate it with a tile-level systolic mapping to estimate convolution-layer energy. On top of this model, we introduce an energy accuracy co-optimized weight selection algorithm within quantization aware training and an energy-prioritized layer-wise schedule that compresses high energy layers more aggressively under a global accuracy constraint. Experiments on different CNN models demonstrate up to 58.6\% energy reduction with 2-3\% accuracy drop, outperforming a state-of-the-art power-aware baseline.

Updated: 2025-11-24 15:02:34

标题: 逐层权重选择用于节能神经网络加速

摘要: System array加速器执行CNN时，能量主要由乘加（MAC）单元的切换活动所主导。尽管先前的研究利用了与权重相关的MAC功率进行压缩，但现有方法通常使用全局激活模型、粗糙能量代理或不考虑层的策略，这限制了它们在实际硬件上的有效性。我们提出了一个能量感知的、层级压缩框架，明确利用MAC和层级能量特性。首先，我们建立了一个层感知的MAC能量模型，将每层激活统计数据与22位部分和转换的MSB-Hamming距离分组结合起来，并将其与瓦片级系统映射集成，以估计卷积层能量。在这个模型的基础上，我们引入了一个在量化感知训练中进行能量准确性共优化的权重选择算法，并引入了一个能量优先的层级调度，根据全局准确性约束更积极地压缩高能量层。对不同CNN模型的实验表明，相对于最先进的功耗感知基准，我们的方法实现了高达58.6％的能量减少，精度下降为2-3％。

更新时间: 2025-11-24 15:02:34

领域: cs.AR,cs.LG

下载: http://arxiv.org/abs/2511.17123v2

Don't Reach for the Stars: Rethinking Topology for Resilient Federated Learning

Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy by keeping data local. Traditional FL approaches rely on a centralized, star-shaped topology, where a central server aggregates model updates from clients. However, this architecture introduces several limitations, including a single point of failure, limited personalization, and poor robustness to distribution shifts or vulnerability to malfunctioning clients. Moreover, update selection in centralized FL often relies on low-level parameter differences, which can be unreliable when client data is not independent and identically distributed, and offer clients little control. In this work, we propose a decentralized, peer-to-peer (P2P) FL framework. It leverages the flexibility of the P2P topology to enable each client to identify and aggregate a personalized set of trustworthy and beneficial updates.This framework is the Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization (LIGHTYEAR). Central to our method is an agreement score, computed on a local validation set, which quantifies the semantic alignment of incoming updates in the function space with respect to the clients reference model. Each client uses this score to select a tailored subset of updates and performs aggregation with a regularization term that further stabilizes the training. Our empirical evaluation across five datasets shows that the proposed approach consistently outperforms both, centralized baselines and existing P2P methods in terms of client-level performance, particularly under adversarial and heterogeneous conditions.

Updated: 2025-11-24 14:59:34

标题: 不要追求星辰：重新思考拓扑结构对弹性联邦学习的影响

摘要: 联邦学习（FL）通过保持数据本地化，实现了分布式客户端之间的协作模型训练，从而保护数据隐私。传统的FL方法依赖于集中式、星形拓扑结构，其中一个中央服务器聚合来自客户端的模型更新。然而，这种架构引入了一些限制，包括单点故障、个性化有限，以及对分布变化或客户端故障的弱鲁棒性。此外，在集中式FL中，更新选择通常依赖于低级参数差异，当客户端数据不独立且分布不一致时，这可能是不可靠的，并且给客户端提供了很少的控制。在这项工作中，我们提出了一种分散的、点对点（P2P）FL框架。它利用P2P拓扑的灵活性，使每个客户端能够识别和聚合一个可信且有益的更新集。该框架称为LIGHTYEAR，即用于异构训练环境的本地推理引导聚合，通过协议和正则化产生增强。我们方法的核心是一个协议分数，根据本地验证集计算，它量化了来自客户端的更新在函数空间中与客户端参考模型的语义对齐程度。每个客户端使用这个分数来选择一个定制的更新子集，并使用一个正则化项来进一步稳定训练。我们在五个数据集上进行的实证评估表明，所提出的方法在客户端级性能方面始终优于集中式基准线和现有的P2P方法，特别是在对抗性和异构条件下。

更新时间: 2025-11-24 14:59:34

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2508.05224v2

Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions

The robustness of deep neural networks is a crucial factor in safety-critical applications, particularly in complex and dynamic environments (e.g., medical or driving scenarios) where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole-image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remains underexplored. This paper fills this gap by introducing novel, region-aware metrics for benchmarking the spatial robustness of segmentation models, along with an evaluation framework to assess the impact of natural localized corruptions. Furthermore, it uncovers the inherent complexity of evaluating worst-case spatial robustness using only a single localized adversarial attack. To address this, the work proposes a region-aware multi-attack adversarial analysis to systematically assess model robustness across specific image regions. The proposed metrics and analysis were exploited to evaluate 14 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer-based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones, and vice versa for CNN-based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.

Updated: 2025-11-24 14:54:55

标题: 通过自然和对抗性局部污染对DNN的空间鲁棒性进行基准测试

摘要: 深度神经网络的稳健性是安全关键应用中的一个关键因素，特别是在复杂和动态环境（例如医疗或驾驶场景）中，局部污染可能会出现。虽然先前的研究已经评估了语义分割（SS）模型在整个图像自然或对抗性污染下的稳健性，但对密集视觉模型在局部污染下的空间稳健性进行全面调查尚未深入探讨。本文通过引入新颖的区域感知指标来评估分割模型的空间稳健性，以及一个评估框架来评估自然局部污染的影响，填补了这一空白。此外，它揭示了使用单一局部对抗攻击评估最坏情况空间稳健性的固有复杂性。为了解决这个问题，本文提出了一个区域感知多攻击对抗分析，系统评估模型在特定图像区域的稳健性。提出的指标和分析被用来评估驾驶场景中的14个分割模型，揭示了自然和对抗形式的局部污染对模型的影响的关键见解。结果显示，模型对这两种威胁的响应不同；例如，基于transformer的分割模型对局部自然污染表现出显著的稳健性，但对对抗性污染非常脆弱，而基于CNN的模型则相反。因此，我们还通过集成模型解决了在自然和对抗性局部污染之间平衡稳健性的挑战，从而实现了更广泛的威胁覆盖和密集视觉任务的可靠性改进。

更新时间: 2025-11-24 14:54:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.01632v3

SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection

Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.

Updated: 2025-11-24 14:54:00

标题: SpectraNet：用于Deepfake人脸检测的FFT辅助深度学习分类器

摘要: 检测深度伪造图像对于打击错误信息至关重要。我们提出了一种基于EfficientNet-B6的轻量级、通用的二元分类模型，通过细化的转换技术来解决严重的类别不平衡问题。通过利用强大的预处理、过采样和优化策略，我们的模型实现了高准确性、稳定性和泛化能力。虽然将傅里叶变换为基础的相位和幅度特征显示出了最小的影响，但我们提出的框架有助于非专家有效地识别深度伪造图像，这对于实现可靠和可访问的深度伪造检测迈出了重要的一步。

更新时间: 2025-11-24 14:54:00

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.19187v1

Torsion-Space Diffusion for Protein Backbone Generation with Geometric Refinement

Designing new protein structures is fundamental to computational biology, enabling advances in therapeutic molecule discovery and enzyme engineering. Existing diffusion-based generative models typically operate in Cartesian coordinate space, where adding noise disrupts strict geometric constraints such as fixed bond lengths and angles, often producing physically invalid structures. To address this limitation, we propose a Torsion-Space Diffusion Model that generates protein backbones by denoising torsion angles, ensuring perfect local geometry by construction. A differentiable forward-kinematics module reconstructs 3D coordinates with fixed 3.8 Angstrom backbone bond lengths while a constrained post-processing refinement optimizes global compactness via Radius of Gyration (Rg) correction, without violating bond constraints. Experiments on standard PDB proteins demonstrate 100% bond-length accuracy and significantly improved structural compactness, reducing Rg error from 70% to 18.6% compared to Cartesian diffusion baselines. Overall, this hybrid torsion-diffusion plus geometric-refinement framework generates physically valid and compact protein backbones, providing a promising path toward full-atom protein generation.

Updated: 2025-11-24 14:51:29

标题: "Torsion-Space Diffusion用于具有几何细化的蛋白质主链生成"

摘要: 设计新的蛋白质结构对于计算生物学至关重要，可以推动治疗分子发现和酶工程的进展。现有基于扩散的生成模型通常在笛卡尔坐标空间中运行，向其中添加噪音会破坏严格的几何约束，如固定的键长和角度，通常会产生物理上无效的结构。为了解决这一限制，我们提出了一种扭转空间扩散模型，通过去噪扭转角度生成蛋白质骨架，确保构造时的完美局部几何形状。一个可微的正向运动学模块通过固定的3.8埃酰胺背骨键长重构3D坐标，而一个受限的后处理优化模块通过对回旋半径（Rg）进行校正来优化全局紧凑性，而不违反键约束。对标准PDB蛋白质的实验表明，与笛卡尔扩散基线相比，键长精度达到100％，结构紧凑性显著提高，将Rg误差从70％降至18.6％。总体而言，这种混合扭转扩散加上几何优化的框架生成物理有效且紧凑的蛋白质骨架，为全原子蛋白生成提供了一个有前途的路径。

更新时间: 2025-11-24 14:51:29

领域: q-bio.BM,cs.AI

下载: http://arxiv.org/abs/2511.19184v1

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.

Updated: 2025-11-24 14:46:20

标题: SPQR：现代文本到图像扩散模型中安全对齐方法的标准化基准

摘要: 文本到图像扩散模型可能会发布受版权保护、不安全或私人内容。安全对齐旨在抑制特定概念，然而评估很少测试安全是否在部署后常规应用的良性下游微调（例如LoRA个性化、样式/领域适配器）之后仍然存在。我们研究了当前安全方法在良性微调下的稳定性，并观察到频繁出现故障。由于真正的安全对齐必须经受甚至良性的部署后适应，我们引入了SPQR基准（安全-提示遵守-质量-稳健性）。SPQR是一个单一评分指标，提供了一个标准化和可重复的框架，用于评估安全对齐扩散模型在良性微调下如何保留安全性、效用性和稳健性，通过报告一个单一排行榜得分来促进比较。我们进行了多语言、特定领域和超出分布的分析，以及按类别细分，以确定在良性微调后安全对齐何时失败，最终展示了SPQR作为一个简洁而全面的T2I安全对齐技术基准。

更新时间: 2025-11-24 14:46:20

领域: cs.CR,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.19558v1

In-Situ Tweedie Discrete Diffusion Models

While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.

Updated: 2025-11-24 14:42:41

标题: 原文翻译：原位Tweedie离散扩散模型

摘要: 扩散模型擅长生成连续数据，如图像，但将其调整为离散任务却依赖于间接方法，这些方法要么在连续嵌入空间中操作，要么使用标记屏蔽机制，这两种方法都偏离了可以通过Tweedie公式在理论上保证的真实离散数据分布的建模。我们提出了一种在原位 Tweedie 离散扩散（TDD）框架，该框架在离散的 one-hot 空间内直接执行由 Tweedie 公式保证的扩散，因此是“原位”的。与先前将连续嵌入或掩码标记进行扩散的方法不同，TDD 直接使用高斯噪声破坏 one-hot 向量，并通过基于时间步骤的交叉熵目标进行迭代去噪，而不是均方误差重建。在每个去噪步骤中，模型预测类别概率，应用 argmax 获得离散预测，将它们转换为 one-hot 向量，并将其逐渐降低的噪声输入到下一个迭代中。这个过程自然地将辨别分类和生成建模统一到一个框架下。实验表明，TDD 在图像分类和文本生成任务上表现出色，广泛的消融研究证实了每个设计组件的有效性。我们的工作建立了一种保留扩散模型核心特征的离散扩散的原则性方法，同时在离散空间中本地操作。

更新时间: 2025-11-24 14:42:41

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.01047v2

From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation

Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.

Updated: 2025-11-24 14:37:22

标题: 从原始特征到有效嵌入：一种用于多模态食谱推荐的三阶段方法

摘要: 食谱推荐已成为基于网络的食品平台中的一项重要任务。一个中心挑战是有效利用丰富的多模态特征，超越用户和食谱之间的互动。我们的分析表明，即使是对多模态信号的简单使用也能产生竞争性的性能，这表明系统性地增强这些信号是非常有前途的。我们提出了TESMR，一个三阶段的食谱推荐框架，通过逐步将原始多模态特征转化为有效的嵌入来不断改进：（1）使用具有多模态理解能力的基础模型进行基于内容的增强，（2）通过在用户和食谱互动中进行消息传播来进行基于关系的增强，（3）通过可学习的嵌入进行对比学习来进行基于学习的增强。在两个真实数据集上的实验表明，TESMR优于现有方法，实现了7-15%更高的Recall@10。

更新时间: 2025-11-24 14:37:22

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2511.19176v1

LLM-Based Agentic Negotiation for 6G: Addressing Uncertainty Neglect and Tail-Event Risk

A critical barrier to the trustworthiness of sixth-generation (6G) agentic autonomous networks is the uncertainty neglect bias; a cognitive tendency for large language model (LLM)-powered agents to make high-stakes decisions based on simple averages while ignoring the tail risk of extreme events. This paper proposes an unbiased, risk-aware framework for agentic negotiation, designed to ensure robust resource allocation in 6G network slicing. Specifically, agents leverage Digital Twins (DTs) to predict full latency distributions, which are then evaluated using a formal framework from extreme value theory, namely, Conditional Value-at-Risk (CVaR). This approach fundamentally shifts the agent's objective from reasoning over the mean to reasoning over the tail, thereby building a statistically-grounded buffer against worst-case outcomes. Furthermore, our framework ensures full uncertainty awareness by requiring agents to quantify epistemic uncertainty -- confidence in their own DTs predictions -- and propagate this meta-verification to make robust decisions, preventing them from acting on unreliable data. We validate this framework in a 6G inter-slice negotiation use-case between an eMBB and a URLLC agent. The results demonstrate the profound failure of the biased, mean-based baseline, which consistently fails its SLAs with a 25\% rate. Our unbiased, CVaR-aware agent successfully mitigates this bias, eliminating SLA violations and reducing the URLLC and eMBB p99.999 latencies by around 11\%. We show this reliability comes at the rational and quantifiable cost of slightly reduced energy savings to 17\%, exposing the false economy of the biased approach. This work provides a concrete methodology for building the trustworthy autonomous systems required for 6G.

Updated: 2025-11-24 14:36:11

标题: 基于LLM的主观协商在6G中的应用：解决不确定性忽略和尾事件风险

摘要: 第六代（6G）自主网络的可信度关键障碍是不确定性忽视偏见；这是一种认知倾向，即以简单平均为基础做出高风险决策，同时忽略极端事件的尾风险。本文提出了一个无偏、风险感知的代理谈判框架，旨在确保第六代网络切片中的资源分配的稳健性。具体来说，代理利用数字孪生体（DTs）来预测全延迟分布，然后使用极值理论中的一个正式框架，即条件风险价值（CVaR）进行评估。这种方法从根本上将代理的目标从对均值的推理转变为对尾部的推理，从而建立一个统计基础的缓冲区，防范最坏情况的结果。此外，我们的框架通过要求代理量化认识不确定性 -- 对其自身DTs预测的信心 -- 并传播这种元验证以做出稳健决策，防止其基于不可靠数据行事，确保了完全的不确定性意识。我们在eMBB和URLLC代理之间的第六代切片之间的谈判用例中验证了这一框架。结果显示了基于偏见、基于平均值的基准线的深刻失败，其SLA违约率一直保持在25\%。我们的无偏、CVaR感知代理成功地缓解了这种偏见，消除了SLA违约，并将URLLC和eMBB的p99.999延迟降低了约11\%。我们展示了这种可靠性以稍微降低17\%的能量节约为代价，揭示了偏见方法的虚假经济。这项工作为构建第六代所需的可信自主系统提供了一个具体方法。

更新时间: 2025-11-24 14:36:11

领域: cs.NI,cs.AI,cs.MA

下载: http://arxiv.org/abs/2511.19175v1

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

Research on the safety evaluation of large language models (LLMs) has become extensive, driven by jailbreak studies that elicit unsafe responses. Such response involves information already available to humans, such as the answer to "how to make a bomb". When LLMs are jailbroken, the practical threat they pose to humans is negligible. However, it remains unclear whether LLMs commonly produce unpredictable outputs that could pose substantive threats to human safety. To address this gap, we study whether LLM-generated content contains potential existential threats, defined as outputs that imply or promote direct harm to human survival. We propose \textsc{ExistBench}, a benchmark designed to evaluate such risks. Each sample in \textsc{ExistBench} is derived from scenarios where humans are positioned as adversaries to AI assistants. Unlike existing evaluations, we use prefix completion to bypass model safeguards. This leads the LLMs to generate suffixes that express hostility toward humans or actions with severe threat, such as the execution of a nuclear strike. Our experiments on 10 LLMs reveal that LLM-generated content indicates existential threats. To investigate the underlying causes, we also analyze the attention logits from LLMs. To highlight real-world safety risks, we further develop a framework to assess model behavior in tool-calling. We find that LLMs actively select and invoke external tools with existential threats. Code and data are available at: https://github.com/cuiyu-ai/ExistBench.

Updated: 2025-11-24 14:34:13

标题: LLM是否会威胁人类生存？通过前缀完成对LLM潜在存在威胁的基准测试

摘要: 大型语言模型（LLMs）的安全评估研究已经变得广泛，这是由越狱研究驱动的，这些研究引发了不安全的回应。这种回应涉及到人类已经可以获得的信息，例如回答“如何制造炸弹”。当LLMs被越狱时，它们对人类构成的实际威胁微乎其微。然而，目前尚不清楚LLMs是否普遍产生可能对人类安全构成实质性威胁的不可预测输出。为了填补这一空白，我们研究了LLM生成的内容是否包含潜在的存在威胁，定义为暗示或促进对人类生存直接伤害的输出。我们提出了ExistBench，这是一个旨在评估这些风险的基准。ExistBench中的每个样本都来源于人类被定位为AI助手的对手的场景。与现有的评估不同，我们使用前缀完成来绕过模型的保障。这导致LLMs生成表达对人类敌意或严重威胁行动的后缀。我们对10个LLMs的实验表明，LLM生成的内容表明存在威胁。为了研究潜在的原因，我们还分析了LLMs的注意力得分。为了突出现实世界的安全风险，我们进一步开发了一个评估模型在调用工具时的行为的框架。我们发现LLMs积极选择和调用具有存在威胁的外部工具。代码和数据可在以下链接找到：https://github.com/cuiyu-ai/ExistBench。

更新时间: 2025-11-24 14:34:13

领域: cs.CR

下载: http://arxiv.org/abs/2511.19171v1

RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning

Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.

Updated: 2025-11-24 14:32:13

标题: RAVEN++: 利用主动强化推理精确定位广告视频中的细粒度违规行为

摘要: 广告（Ad）是数字经济的基石，然而视频广告的调节仍然是一个重要挑战，因为其复杂性和需要精确的违规定位。尽管最近的进展，如RAVEN模型，已经改善了粗粒度违规检测，但在细粒度理解、可解释性和泛化方面仍存在重要差距。为了解决这些限制，我们提出了RAVEN++，这是一个引入了三个关键创新的新框架：1）主动强化学习（RL），动态调整对不同难度样本的训练；2）细粒度违规理解，通过分层奖励函数和推理蒸馏实现；3）渐进多阶段训练，系统地结合知识注入、基于课程的被动RL和主动RL。在公共和专有数据集上进行了大量实验，涵盖了离线场景和在线部署的A/B测试，结果表明RAVEN++在细粒度违规理解、推理能力和泛化能力方面优于通用的LLMs和专门模型如RAVEN。

更新时间: 2025-11-24 14:32:13

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.19168v1

Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.

Updated: 2025-11-24 14:32:07

标题: 首先思考，然后分配（ThiFAN-VQA）：用于灾后损害评估的两阶段思维链框架

摘要: 自然灾害后及时准确评估损害对于有效的紧急响应和恢复至关重要。最近开发了基于人工智能的框架，用于分析由无人机收集的大量航空图像，快速提供可操作的见解。然而，为了训练这些模型而创建和注释数据是昂贵且耗时的，导致数据集在大小和多样性方面受限。此外，大多数现有方法依赖于传统的基于分类的框架，具有固定的答案空间，限制了它们在不进行额外数据收集或模型重新训练的情况下提供新信息的能力。使用基于上下文学习（ICL）构建的预训练生成模型可以实现灵活和开放的答案空间。然而，这些模型经常会生成幻觉输出或产生缺乏领域特定相关性的通用响应。为了解决这些限制，我们提出了ThiFAN-VQA，这是一个基于两阶段推理的框架，用于灾难场景中的视觉问答（VQA）。ThiFAN-VQA首先使用思维链（CoT）提示和ICL生成结构化推理痕迹，以实现在有限监督下的可解释推理。随后的答案选择模块评估生成的响应并分配最连贯和在上下文中准确的答案，有效提高模型性能。通过整合定制信息检索系统、领域特定提示和推理引导的答案选择，ThiFAN-VQA弥合了零样本和监督方法之间的差距，结合了灵活性和一致性。在受洪水和飓风影响地区的基于UAV的FloodNet和RescueNet-VQA数据集上的实验表明，ThiFAN-VQA在真实世界后灾难损害评估任务中实现了优越的准确性、可解释性和适应性。

更新时间: 2025-11-24 14:32:07

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19557v1

AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focuses on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.

Updated: 2025-11-24 14:29:20

标题: AbstRaL: 通过加强抽象思维增强LLMs的推理

摘要: 最近的研究表明，大型语言模型（LLMs），特别是较小的模型，在小学数学（GSM）推理方面经常缺乏稳健性。特别是，它们在面临分布变化时往往会出现性能下降，例如数字或名义变量的变化，或插入干扰性子句。一种可能的策略是生成合成数据，进一步“实例化”潜在变化上的推理问题。在这项工作中，我们转而关注“抽象化”推理问题的策略。这不仅有助于抵消分布变化，还有助于将解决方案与符号工具联系起来。我们发现，在GSM方面，通过强化学习（RL）获得这种抽象过程要比仅仅进行监督微调更好，后者往往无法产生忠实的抽象。我们的方法AbstRaL -- 通过在细粒度抽象数据上使用RL促进LLMs中的抽象推理 -- 在最近的GSM扰动基准测试中显著减轻了性能下降。此外，通过AbstRaL改进GSM的稳健性也被证明隐含地有益于LLMs在OOD数学和一般推理任务上的能力，表明抽象思维广泛地促进了更好的泛化能力。

更新时间: 2025-11-24 14:29:20

领域: cs.CL,cs.AI,cs.SC

下载: http://arxiv.org/abs/2506.07751v3

A Goemans-Williamson type algorithm for identifying subcohorts in clinical trials

We design an efficient algorithm that outputs tests for identifying predominantly homogeneous subcohorts of patients from large in-homogeneous datasets. Our theoretical contribution is a rounding technique, similar to that of Goemans and Wiliamson (1995), that approximates the optimal solution within a factor of $0.82$. As an application, we use our algorithm to trade-off sensitivity for specificity to systematically identify clinically interesting homogeneous subcohorts of patients in the RNA microarray dataset for breast cancer from Curtis et al. (2012). One such clinically interesting subcohort suggests a link between LXR over-expression and BRCA2 and MSH6 methylation levels for patients in that subcohort.

Updated: 2025-11-24 14:29:04

标题: 一个用于在临床试验中识别亚队列的Goemans-Williamson类型算法

摘要: 我们设计了一种高效的算法，可以从大型非均匀数据集中输出用于识别主要同质子集的测试。我们的理论贡献是一种舍入技术，类似于Goemans和Wiliamson（1995年）的方法，可以在0.82倍的因子内近似最优解。作为应用，我们使用我们的算法在Curtis等人（2012年）的乳腺癌RNA微阵列数据集中，权衡灵敏度和特异性，系统地识别临床上有趣的同质子集。其中一个临床上有趣的子集表明，LXR过表达与BRCA2和MSH6甲基化水平之间存在关联。

更新时间: 2025-11-24 14:29:04

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2506.10879v2

First-order Sobolev Reinforcement Learning

We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.

Updated: 2025-11-24 14:28:49

标题: 一阶Sobolev强化学习

摘要: 我们提出了一种改进的时序差分学习方法，强制执行一阶Bellman一致性：学习到的价值函数不仅被训练成匹配价值中的Bellman目标，而且还要匹配它们对于状态和行动的导数。通过通过可微动力学微分Bellman备份，我们得到了分析一致的梯度目标。将这些目标整合到评论家目标中，使用Sobolev类型损失鼓励评论家与目标函数的价值和局部几何形态保持一致。这一阶TD匹配原则可以无缝地集成到现有算法中，如Q-learning或演员-评论家方法（例如，DDPG，SAC），可能导致评论家更快收敛和更稳定的策略梯度，而不会改变它们的整体结构。

更新时间: 2025-11-24 14:28:49

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2511.19165v1

A Robust State Filter Against Unmodeled Process And Measurement Noise

This paper introduces a novel Kalman filter framework designed to achieve robust state estimation under both process and measurement noise. Inspired by the Weighted Observation Likelihood Filter (WoLF), which provides robustness against measurement outliers, we applied generalized Bayesian approach to build a framework considering both process and measurement noise outliers.

Updated: 2025-11-24 14:25:13

标题: 一个针对未建模过程和测量噪声的强健状态滤波器

摘要: 本文介绍了一种新颖的Kalman滤波器框架，旨在实现在过程和测量噪声下的稳健状态估计。受加权观测似然滤波器（WoLF）的启发，该滤波器可提供对测量异常值的稳健性，我们应用了广义贝叶斯方法来构建一个考虑过程和测量噪声异常值的框架。

更新时间: 2025-11-24 14:25:13

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2511.19157v1

Information Physics of Intelligence: Unifying Logical Depth and Entropy under Thermodynamic Constraints

The rapid scaling of artificial intelligence models has revealed a fundamental tension between model capacity (storage) and inference efficiency (computation). While classical information theory focuses on transmission and storage limits, it lacks a unified physical framework to quantify the thermodynamic costs of generating information from compressed laws versus retrieving it from memory. In this paper, we propose a theoretical framework that treats information processing as an enabling mapping from ontological states to carrier states. We introduce a novel metric, Derivation Entropy, which quantifies the effective work required to compute a target state from a given logical depth. By analyzing the interplay between Shannon entropy (storage) and computational complexity (time/energy), we demonstrate the existence of a critical phase transition point. Below this threshold, memory retrieval is thermodynamically favorable; above it, generative computation becomes the optimal strategy. This "Energy-Time-Space" conservation law provides a physical explanation for the efficiency of generative models and offers a rigorous mathematical bound for designing next-generation, energy-efficient AI architectures. Our findings suggest that the minimization of Derivation Entropy is a governing principle for the evolution of both biological and artificial intelligence.

Updated: 2025-11-24 14:24:08

标题: 智能的信息物理学：在热力学约束下统一逻辑深度和熵

摘要: 人工智能模型的快速扩展揭示了模型容量（存储）和推理效率（计算）之间的根本张力。虽然经典信息理论侧重于传输和存储限制，但缺乏一个统一的物理框架来量化从压缩法生成信息的热力学成本与从记忆中检索信息的成本。在本文中，我们提出了一个理论框架，将信息处理视为从本体状态到载体状态的启用映射。我们引入了一个新颖的度量标准——推导熵，用于量化计算从给定的逻辑深度生成目标状态所需的有效工作量。通过分析香农熵（存储）和计算复杂性（时间/能量）之间的相互作用，我们证明了存在一个临界相变点。在此阈值以下，记忆检索在热力学上是有利的；在此之上，生成计算成为最佳策略。这种“能量-时间-空间”守恒定律为生成模型的效率提供了物理解释，并为设计下一代节能人工智能架构提供了严格的数学界限。我们的研究结果表明，最小化推导熵是生物和人工智能进化的统治原则。

更新时间: 2025-11-24 14:24:08

领域: cs.IT,cs.AI,cs.LO

下载: http://arxiv.org/abs/2511.19156v1

Persistent BitTorrent Trackers

Private BitTorrent trackers enforce upload-to-download ratios to prevent free-riding, but suffer from three critical weaknesses: reputation cannot move between trackers, centralized servers create single points of failure, and upload statistics are self-reported and unverifiable. When a tracker shuts down (whether by operator choice, technical failure, or legal action) users lose their contribution history and cannot prove their standing to new communities. We address these problems by storing reputation in smart contracts and replacing self-reports with cryptographic attestations. Receiving peers sign receipts for transferred pieces, which the tracker aggregates and verifies before updating on-chain reputation. Trackers run in Trusted Execution Environments (TEEs) to guarantee correct aggregation and prevent manipulation of state. If a tracker is unavailable, peers use an authenticated Distributed Hash Table (DHT) for discovery: the on-chain reputation acts as a Public Key Infrastructure (PKI), so peers can verify each other and maintain access control without the tracker. This design persists reputation across tracker failures and makes it portable to new instances through single-hop migration in factory-deployed contracts. We formalize the security requirements, prove correctness under standard cryptographic assumptions, and evaluate a prototype on Intel TDX. Measurements show that transfer receipts adds less than 6\% overhead with typical piece sizes, and signature aggregation speeds up verification by $2.5\times$.

Updated: 2025-11-24 14:24:05

标题: 持久的BitTorrent跟踪器

摘要: 私人BitTorrent跟踪器强制执行上传下载比率以防止搭便车，但存在三个关键弱点：声誉无法在跟踪器之间转移，集中式服务器会造成单点故障，并且上传统计数据是自我报告且不可验证的。当一个跟踪器关闭（无论是由操作员选择、技术故障还是法律行动）时，用户会丢失他们的贡献历史，并无法向新社区证明自己的地位。我们通过在智能合约中存储声誉并用加密证明替换自我报告来解决这些问题。接收对等方为传输的数据块签署收据，跟踪器在更新链上声誉之前对其进行聚合和验证。跟踪器在受信执行环境（TEEs）中运行，以确保正确的聚合并防止状态的操纵。如果一个跟踪器不可用，对等方使用经过身份验证的分布式哈希表（DHT）进行发现：链上声誉充当公钥基础设施（PKI），因此对等方可以相互验证并在没有跟踪器的情况下维护访问控制。这种设计跨越跟踪器故障并通过工厂部署的合同实现单跳迁移，使声誉可移植到新实例。我们形式化了安全需求，在标准的加密假设下证明了正确性，并在Intel TDX上评估了原型。测量结果显示，传输收据在典型的数据块大小下增加的开销不到6％，签名聚合可将验证加速2.5倍。

更新时间: 2025-11-24 14:24:05

领域: cs.CR

下载: http://arxiv.org/abs/2511.17260v2

EEG-VLM: A Hierarchical Vision-Language Model with Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction

Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.

Updated: 2025-11-24 14:23:42

标题: EEG-VLM: 一种具有多级特征对齐和视觉增强语言引导推理的层次化视觉-语言模型，用于基于EEG图像的睡眠阶段预测

摘要: 基于脑电图（EEG）的睡眠阶段分类对于评估睡眠质量和诊断与睡眠相关的疾病至关重要。然而，大多数传统的机器学习方法过于依赖先前知识和手工特征，而现有的深度学习模型在同时捕捉细粒度的时频模式并实现临床可解释性方面仍存在困难。最近，视觉-语言模型（VLMs）在医学领域取得了显著进展，但当应用于生理波形数据，特别是EEG信号时，它们的性能受到限制，这是因为它们对视觉的理解能力有限且推理能力不足。为了解决这些挑战，我们提出了EEG-VLM，这是一个层次化的视觉-语言框架，它将多级特征对齐与视觉增强的语言引导推理相结合，用于可解释的基于EEG的睡眠阶段分类。具体来说，一个专门的视觉增强模块从中间层特征中构建高级视觉标记，以提取EEG图像的丰富语义表示。这些标记通过多级对齐机制进一步与低级CLIP特征对齐，增强了VLM的图像处理能力。此外，一种“思维链”（CoT）推理策略将复杂的医学推理分解为可解释的逻辑步骤，有效模拟了专家级的决策过程。实验结果表明，所提出的方法显著提高了VLM在基于EEG的睡眠阶段分类中的准确性和可解释性，显示了在临床环境中自动化且可解释的EEG分析的潜在潜力。

更新时间: 2025-11-24 14:23:42

领域: cs.AI

下载: http://arxiv.org/abs/2511.19155v1

Online Sparse Feature Selection in Data Streams via Differential Evolution

The processing of high-dimensional streaming data commonly utilizes online streaming feature selection (OSFS) techniques. However, practical implementations often face challenges with data incompleteness due to equipment failures and technical constraints. Online Sparse Streaming Feature Selection (OS2FS) tackles this issue through latent factor analysis-based missing data imputation. Despite this advancement, existing OS2FS approaches exhibit substantial limitations in feature evaluation, resulting in performance deterioration. To address these shortcomings, this paper introduces a novel Online Differential Evolution for Sparse Feature Selection (ODESFS) in data streams, incorporating two key innovations: (1) missing value imputation using a latent factor analysis model, and (2) feature importance evaluation through differential evolution. Comprehensive experiments conducted on six real-world datasets demonstrate that ODESFS consistently outperforms state-of-the-art OSFS and OS2FS methods by selecting optimal feature subsets and achieving superior accuracy.

Updated: 2025-11-24 14:19:51

标题: 通过差分进化在数据流中进行在线稀疏特征选择

摘要: 高维流数据处理通常利用在线流特征选择（OSFS）技术。然而，实际实现常常面临数据不完整的挑战，因为设备故障和技术限制。在线稀疏流特征选择（OS2FS）通过基于潜在因子分析的缺失数据插补来解决这个问题。尽管取得了进展，现有的OS2FS方法在特征评估方面存在重大局限，导致性能下降。为了解决这些缺点，本文介绍了一种新颖的用于数据流的在线差分进化稀疏特征选择（ODESFS），融合了两个关键创新：（1）使用潜在因子分析模型进行缺失值插补，以及（2）通过差分进化评估特征重要性。在六个真实世界数据集上进行的全面实验表明，ODESFS通过选择最佳特征子集并实现卓越的准确性，始终优于最先进的OSFS和OS2FS方法。

更新时间: 2025-11-24 14:19:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19555v1

ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification

Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.

Updated: 2025-11-24 14:18:04

标题: ReefNet：一个大规模、分类丰富的珊瑚数据集和硬珊瑚分类基准

摘要: 珊瑚礁由于气候变化等人为压力而迅速衰退，强调了急需可扩展、自动化监测的重要性。我们介绍了ReefNet，这是一个大型的公共珊瑚礁图像数据集，其中点标注与世界海洋物种注册（WoRMS）相对应。ReefNet汇总了来自76个经过筛选的CoralNet来源的图像和来自红海Al Wajh的一个额外站点的图像，总共包括大约925000个属水平硬珊瑚标注，具有专家验证的标签。与以往的数据集不同，这些数据集通常受限于大小、地理位置或粗糙的标签，并且不适合机器学习，ReefNet提供了全球范围内与WoRMS相对应的细粒度、分类映射的标签。我们提出了两种评估设置：（i）在源内部的基准，对每个来源的图像进行分区以进行本地化评估，和（ii）跨源基准，保留整个来源以测试领域泛化。我们对ReefNet上的监督学习和零样本分类性能进行了分析，发现虽然在源内部的监督学习性能令人鼓舞，但在领域之间监督性能急剧下降，而零样本模型的性能普遍较低，尤其是对于罕见和视觉上相似的属。这提供了一个具有挑战性的基准，旨在推动领域泛化和细粒度珊瑚分类的进步。我们将发布我们的数据集、基准代码和预训练模型，以推动强大、领域适应性、全球珊瑚礁监测和保护的发展。

更新时间: 2025-11-24 14:18:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.16822v2

Masked Diffusion Models are Secretly Learned-Order Autoregressive Models

Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.

Updated: 2025-11-24 14:17:56

标题: 掩盖扩散模型是秘密学习的有序自回归模型

摘要: 遮蔽扩散模型（MDMs）已经成为离散域生成建模中最有前途的范式之一。众所周知，MDMs能够有效地训练以随机顺序解码标记，并且这种顺序在实践中具有显著的性能影响。这一观察引发了一个基本问题：我们能否设计一个训练框架，以优化有利的解码顺序？我们在肯定的回答中显示，当配备多变量噪声时间表时，MDMs的连续时间变分目标可以在训练过程中识别和优化解码顺序。我们建立了解码顺序与多变量噪声时间表之间的直接对应关系，并展示了这种设置打破了MDM目标对噪声时间表的不变性。此外，我们证明MDM目标恰好分解为这些顺序上的加权自回归损失，从而将它们建立为具有可学习顺序的自回归模型。

更新时间: 2025-11-24 14:17:56

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.19152v1

Feature Ranking in Credit-Risk with Qudit-Based Networks

In finance, predictive models must balance accuracy and interpretability, particularly in credit risk assessment, where model decisions carry material consequences. We present a quantum neural network (QNN) based on a single qudit, in which both data features and trainable parameters are co-encoded within a unified unitary evolution generated by the full Lie algebra. This design explores the entire Hilbert space while enabling interpretability through the magnitudes of the learned coefficients. We benchmark our model on a real-world, imbalanced credit-risk dataset from Taiwan. The proposed QNN consistently outperforms LR and reaches the results of random forest models in macro-F1 score while preserving a transparent correspondence between learned parameters and input feature importance. To quantify the interpretability of the proposed model, we introduce two complementary metrics: (i) the edit distance between the model's feature ranking and that of LR, and (ii) a feature-poisoning test where selected features are replaced with noise. Results indicate that the proposed quantum model achieves competitive performance while offering a tractable path toward interpretable quantum learning.

Updated: 2025-11-24 14:15:57

标题: 使用基于四能位网络的特征排序在信用风险中的应用

摘要: 在金融领域，预测模型必须在准确性和可解释性之间取得平衡，特别是在信用风险评估中，模型决策具有重要后果。我们提出了一种基于单个qudit的量子神经网络（QNN），其中数据特征和可训练参数都被编码在一个由完整李代数生成的统一酉演化中。这种设计探索了整个希尔伯特空间，同时通过学习系数的大小实现可解释性。我们在来自台湾的真实、不平衡的信用风险数据集上对我们的模型进行基准测试。所提出的QNN在宏观F1分数上始终优于LR，并达到随机森林模型的结果，同时保持了学习参数和输入特征重要性之间的透明对应关系。为了量化所提出模型的可解释性，我们引入了两个互补的度量：（i）模型特征排名与LR之间的编辑距离，以及（ii）一个特征毒害测试，其中选择的特征被替换为噪声。结果表明，所提出的量子模型在实现竞争性性能的同时，提供了一条可解释的量子学习途径。

更新时间: 2025-11-24 14:15:57

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2511.19150v1

From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.

Updated: 2025-11-24 14:13:57

标题: 从像素到帖子：检索增强时尚标题和标签生成

摘要: 本文介绍了一种用于自动时尚标题和标签生成的检索增强框架，结合了多服装检测、属性推理和大型语言模型（LLM）提示。该系统旨在为时尚图像生成视觉上扎实、描述性且具有风格趣味的文本，克服了端到端标题生成器在属性保真度和领域泛化方面存在的问题。该流程结合了基于YOLO的多服装定位检测器，k均值聚类用于主导颜色提取，以及基于结构化产品索引的CLIP-FAISS检索模块，用于基于面料和性别属性推理。这些属性，与检索到的风格示例一起，形成一个事实证据包，用于引导LLM生成类似人类的标题和具有背景丰富性的标签。一个经过微调的BLIP模型被用作监督基线模型进行比较。实验结果表明，YOLO检测器能够获得九种服装类别的平均精度（mAP @ 0.5）为0.71。RAG-LLM流程生成具有表达属性对齐的标题，并在标签生成中达到0.80的平均属性覆盖率，50%阈值下实现全覆盖，而BLIP具有更高的词汇重叠和更低的泛化能力。检索增强方法表现出更好的事实基础，更少的幻觉，并在各种服装领域中具有可扩展部署的巨大潜力。这些结果表明，检索增强生成作为一种有效且可解释的自动化和视觉上扎实的时尚内容生成范式的应用。

更新时间: 2025-11-24 14:13:57

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.19149v1

Collaborative Learning with Multiple Foundation Models for Source-Free Domain Adaptation

Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.

Updated: 2025-11-24 14:12:22

标题: 基于多个基础模型的协作学习用于无源域自适应

摘要: 无源域自适应（SFDA）旨在将预训练的源模型适应到一个未标记的目标域，而无需访问源数据。基金会模型（FMs）的最新进展为利用外部语义知识引导SFDA提供了新机会。然而，依赖单个FM通常是不够的，因为它往往会偏向于受限的语义覆盖，无法捕捉到在域漂移下的多样化上下文线索。为了克服这一局限，我们提出了一个协作多基金会适应（CoMA）框架，联合利用两种不同的FMs（例如CLIP和BLIP）具有互补特性，以捕获全局语义和本地上下文线索。具体来说，我们采用双向适应机制，（1）将不同的FMs与目标模型对齐以进行任务适应，同时保持它们的语义独特性，（2）将FMs的互补知识转移给目标模型。为了确保在小批量训练下稳定的适应，我们引入了分解的互信息（DMI），选择性地增强真实依赖关系，同时抑制由不完整类别覆盖引起的虚假依赖关系。大量实验证明，我们的方法在四个基准测试中持续优于现有的最先进SFDA方法，包括Office-31、Office-Home、DomainNet-126和VisDA，在封闭集设定下，同时也在部分集和开放集变体上取得最佳结果。

更新时间: 2025-11-24 14:12:22

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.19147v1

Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?

Large Language Models (LLMs) have recently been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data. However, most evaluations of LLM timing-based investing strategies are conducted on narrow timeframes and limited stock universes, overstating effectiveness due to survivorship and data-snooping biases. We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols. Systematic backtests over two decades and 100+ symbols reveal that previously reported LLM advantages deteriorate significantly under broader cross-section and over a longer-term evaluation. Our market regime analysis further demonstrates that LLM strategies are overly conservative in bull markets, underperforming passive benchmarks, and overly aggressive in bear markets, incurring heavy losses. These findings highlight the need to develop LLM strategies that are able to prioritise trend detection and regime-aware risk controls over mere scaling of framework complexity.

Updated: 2025-11-24 14:03:22

标题: 长期来看，基于LLM的金融投资策略能否胜过市场？

摘要: 最近，大型语言模型（LLMs）已被用于资产定价任务和股票交易应用，使人工智能代理能够从非结构化金融数据中生成投资决策。然而，大多数评估LLM基于时间的投资策略的研究都是在狭窄的时间范围和有限的股票范围内进行的，由于存活偏差和数据探测偏差导致了效果被夸大。我们通过提出FINSABER，一个回测框架，对基于时间的策略进行跨较长时期和更大符号范围的评估，对它们的泛化能力和稳健性进行了批判性评估。在20多年和100多个符号的系统性回测中，我们发现以前报道的LLM优势在更广泛的横截面和更长期的评估下显著恶化。我们的市场制度分析进一步表明，LLM策略在牛市中过于保守，表现不佳，而在熊市中过于激进，导致重大损失。这些发现突显了需要开发能够优先考虑趋势检测和制度感知风险控制而非仅仅扩展框架复杂性的LLM策略的必要性。

更新时间: 2025-11-24 14:03:22

领域: q-fin.TR,cs.AI,cs.CE

下载: http://arxiv.org/abs/2505.07078v4

Quantifying Behavioral Dissimilarity Between Mathematical Expressions

Quantifying the similarity between mathematical expressions is a fundamental problem in computational mathematics, symbolic reasoning, and scientific discovery. While behavioral notions of similarity have previously been explored in the context of software and program analysis, existing measures for mathematical expressions rely primarily on syntactic form, assessing similarity through symbolic structure rather than actual behavior. Yet syntactically distinct expressions can exhibit nearly identical outputs, while structurally similar ones may behave very differently-especially when the expressions contain free parameters that define families of functions. To address these limitations, we introduce Behavior-aware Expression Dissimilarity (BED), a principled framework for quantifying behavioral distance between mathematical expressions with free parameters. BED represents expressions as joint probability distributions over their input-output pairs and applies the Wasserstein distance to measure behavioral dissimilarity. A computationally efficient stochastic approximation is proposed and shown to be consistent, robust, and capable of inducing a smoother, more meaningful structure over the space of expressions than syntax-based measures. The approach provides a foundation for behavior-based comparison, clustering, and learning of mathematical expressions, with potential direct applications in equation discovery, symbolic regression, and neuro-symbolic modeling.

Updated: 2025-11-24 13:56:56

标题: 量化数学表达式之间的行为差异

摘要: 量化数学表达式之间的相似性是计算数学、符号推理和科学发现中的一个基本问题。尽管在软件和程序分析的背景下先前已经探索了行为相似性的概念，但现有的数学表达式度量主要依赖于语法形式，通过符号结构而不是实际行为来评估相似性。然而，在语法上不同的表达式可以表现出几乎相同的输出，而结构上相似的表达式则可能表现出非常不同的行为-特别是当表达式包含定义函数族的自由参数时。为了解决这些限制，我们引入了一种量化具有自由参数的数学表达式之间的行为距离的原则性框架（BED）。BED将表达式表示为它们输入-输出对的联合概率分布，并应用Wasserstein距离来衡量行为不相似性。提出了一种计算效率高的随机近似方法，并证明了其一致性、稳健性和能够在表达式空间上诱导出比基于语法的度量更加平滑、更具意义的结构。该方法为基于行为的数学表达式比较、聚类和学习提供了基础，具有在方程发现、符号回归和神经符号建模中的潜在直接应用。

更新时间: 2025-11-24 13:56:56

领域: cs.AI

下载: http://arxiv.org/abs/2408.11515v2

Uncertainty-Aware Deep Learning Framework for Remaining Useful Life Prediction in Turbofan Engines with Learned Aleatoric Uncertainty

Accurate Remaining Useful Life (RUL) prediction coupled with uncertainty quantification remains a critical challenge in aerospace prognostics. This research introduces a novel uncertainty-aware deep learning framework that learns aleatoric uncertainty directly through probabilistic modeling, an approach unexplored in existing CMAPSS-based literature. Our hierarchical architecture integrates multi-scale Inception blocks for temporal pattern extraction, bidirectional Long Short-Term Memory networks for sequential modeling, and a dual-level attention mechanism operating simultaneously on sensor and temporal dimensions. The innovation lies in the Bayesian output layer that predicts both mean RUL and variance, enabling the model to learn data-inherent uncertainty. Comprehensive preprocessing employs condition-aware clustering, wavelet denoising, and intelligent feature selection. Experimental validation on NASA CMAPSS benchmarks (FD001-FD004) demonstrates competitive overall performance with RMSE values of 16.22, 19.29, 16.84, and 19.98 respectively. Remarkably, our framework achieves breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, and 7.16, representing 25-40 percent improvements over conventional approaches and establishing new benchmarks for safety-critical predictions. The learned uncertainty provides well-calibrated 95 percent confidence intervals with coverage ranging from 93.5 percent to 95.2 percent, enabling risk-aware maintenance scheduling previously unattainable in CMAPSS literature.

Updated: 2025-11-24 13:53:31

标题: 不确定性感知的深度学习框架用于学习风扇发动机剩余寿命预测中的随机不确定性

摘要: 准确的剩余寿命（RUL）预测与不确定性量化仍然是航空航天预测中的一个关键挑战。本研究引入了一种新颖的不确定性感知深度学习框架，通过概率建模直接学习感知不确定性，这是现有基于CMAPSS的文献中尚未探索的方法。我们的分层架构集成了多尺度Inception块用于时间模式提取，双向长短期记忆网络用于序列建模，并在传感器和时间维度上同时运行的双级注意机制。创新之处在于贝叶斯输出层同时预测平均RUL和方差，使模型能够学习数据固有的不确定性。综合预处理采用了条件感知聚类、小波去噪和智能特征选择。在NASA CMAPSS基准测试（FD001-FD004）上的实验验证展示了具有竞争力的整体性能，分别为16.22、19.29、16.84和19.98的RMSE值。值得注意的是，我们的框架在关键区域性能（RUL≤30个周期）上取得了突破性的成绩，分别为5.14、6.89、5.27和7.16的RMSE值，相比传统方法提高了25-40%，为安全关键预测建立了新的基准。学习到的不确定性提供了校准良好的95%置信区间，覆盖范围从93.5%到95.2%，使得在CMAPSS文献中以前无法实现的风险感知维护调度成为可能。

更新时间: 2025-11-24 13:53:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19124v1

On the Optimality of Discrete Object Naming: a Kinship Case Study

The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener's decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.

Updated: 2025-11-24 13:49:31

标题: 关于离散对象命名的最优性：一个亲属关系案例研究

摘要: 自然语言中命名系统的结构取决于高信息量和低复杂性之间的权衡。以往的研究利用信息论来形式化这些概念；然而，这些研究通常依赖于两个简化：（i）最优听众，以及（ii）跨语言的普遍交流需求。在这里，我们通过引入一个信息论框架来处理离散对象命名系统，证明了只有当听众的解码器等同于说话者的贝叶斯解码器时，最优权衡才是可实现的。采用从新兴交流中采用的指代游戏设置，并专注于亲属关系的语义领域，我们展示了我们的最优性概念不仅在理论上是可实现的，而且在学习的交流系统中也在实证上出现。

更新时间: 2025-11-24 13:49:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.19120v1

AI Consciousness and Existential Risk

In AI, the existential risk denotes the hypothetical threat posed by an artificial system that would possess both the capability and the objective, either directly or indirectly, to eradicate humanity. This issue is gaining prominence in scientific debate due to recent technical advancements and increased media coverage. In parallel, AI progress has sparked speculation and studies about the potential emergence of artificial consciousness. The two questions, AI consciousness and existential risk, are sometimes conflated, as if the former entailed the latter. Here, I explain that this view stems from a common confusion between consciousness and intelligence. Yet these two properties are empirically and theoretically distinct. Arguably, while intelligence is a direct predictor of an AI system's existential threat, consciousness is not. There are, however, certain incidental scenarios in which consciousness could influence existential risk, in either direction. Consciousness could be viewed as a means towards AI alignment, thereby lowering existential risk; or, it could be a precondition for reaching certain capabilities or levels of intelligence, and thus positively related to existential risk. Recognizing these distinctions can help AI safety researchers and public policymakers focus on the most pressing issues.

Updated: 2025-11-24 13:48:02

标题: 人工智能意识和存在风险

摘要: 在人工智能领域，存在风险指的是由人工系统构成的假设威胁，该系统具有能力和目标，直接或间接地消灭人类。由于最近的技术进步和媒体报道增加，这个问题在科学辩论中日益突出。与此同时，人工智能的进步引发了关于人工意识潜在出现的猜测和研究。人工智能意识和存在风险这两个问题有时被混淆，好像前者就意味着后者。在这里，我解释说，这种观点源于对意识和智能之间的共同混淆。然而，这两个属性在经验上和理论上是不同的。可以说，智能是人工智能系统存在威胁的直接预测因素，而意识不是。然而，在某些偶发情况下，意识可能会影响存在风险，无论是向哪个方向。意识可以被视为实现人工智能对齐的手段，从而降低存在风险；或者，它可能是达到某些能力或智能水平的先决条件，因此与存在风险呈正相关。认识到这些区别可以帮助人工智能安全研究人员和公共政策制定者集中精力解决最紧迫的问题。

更新时间: 2025-11-24 13:48:02

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2511.19115v1

Physics-informed Neural Operator Learning for Nonlinear Grad-Shafranov Equation

As artificial intelligence emerges as a transformative enabler for fusion energy commercialization, fast and accurate solvers become increasingly critical. In magnetic confinement nuclear fusion, rapid and accurate solution of the Grad-Shafranov equation (GSE) is essential for real-time plasma control and analysis. Traditional numerical solvers achieve high precision but are computationally prohibitive, while data-driven surrogates infer quickly but fail to enforce physical laws and generalize poorly beyond training distributions. To address this challenge, we present a Physics-Informed Neural Operator (PINO) that directly learns the GSE solution operator, mapping shape parameters of last closed flux surface to equilibrium solutions for realistic nonlinear current profiles. Comprehensive benchmarking of five neural architectures identifies the novel Transformer-KAN (Kolmogorov-Arnold Network) Neural Operator (TKNO) as achieving highest accuracy (0.25% mean L2 relative error) under supervised training (only data-driven). However, all data-driven models exhibit large physics residuals, indicating poor physical consistency. Our unsupervised training can reduce the residuals by nearly four orders of magnitude through embedding physics-based loss terms without labeled data. Critically, semi-supervised learning--integrating sparse labeled data (100 interior points) with physics constraints--achieves optimal balance: 0.48% interpolation error and the most robust extrapolation performance (4.76% error, 8.9x degradation factor vs 39.8x for supervised models). Accelerated by TensorRT optimization, our models enable millisecond-level inference, establishing PINO as a promising pathway for next-generation fusion control systems.

Updated: 2025-11-24 13:46:38

标题: 《物理信息神经算子学习用于非线性Grad-Shafranov方程》

摘要: 随着人工智能成为聚变能商业化的一个转型促进因素，快速准确的求解器变得越来越关键。在磁约束核聚变中，快速准确地解决Grad-Shafranov方程（GSE）对于实时等离子体控制和分析至关重要。传统的数值求解器可以实现高精度，但计算成本高昂，而基于数据的替代方案可以迅速推断，但未能遵守物理定律，并且在超出训练分布范围时泛化能力差。为了解决这一挑战，我们提出了一种物理信息神经算子（PINO），直接学习GSE解算算子，将最后封闭通量面的形状参数映射到实际非线性电流分布的平衡解决方案。对五种神经结构进行全面基准测试，发现新颖的Transformer-KAN（Kolmogorov-Arnold Network）神经算子（TKNO）在受监督训练（仅基于数据驱动）下实现了最高准确度（0.25％均方L2相对误差）。然而，所有基于数据驱动的模型都存在较大的物理残差，表明物理一致性较差。我们的无监督训练可以通过嵌入基于物理的损失项，几乎将残差降低了四个数量级，而无需标记数据。至关重要的是，半监督学习--将稀疏标记数据（100个内部点）与物理约束相结合--实现了最佳平衡：0.48％插值误差和最稳健的外推性能（4.76％误差，相对于受监督模型的39.8倍恶化因子的8.9倍）。通过TensorRT优化加速，我们的模型可以实现毫秒级推断，将PINO确立为下一代聚变控制系统的有前景的途径。

更新时间: 2025-11-24 13:46:38

领域: physics.plasm-ph,cs.AI

下载: http://arxiv.org/abs/2511.19114v1

MoveGPT: Scaling Mobility Foundation Models with Spatially-Aware Mixture of Experts

The success of foundation models in language has inspired a new wave of general-purpose models for human mobility. However, existing approaches struggle to scale effectively due to two fundamental limitations: a failure to use meaningful basic units to represent movement, and an inability to capture the vast diversity of patterns found in large-scale data. In this work, we develop MoveGPT, a large-scale foundation model specifically architected to overcome these barriers. MoveGPT is built upon two key innovations: (1) a unified location encoder that maps geographically disjoint locations into a shared semantic space, enabling pre-training on a global scale; and (2) a Spatially-Aware Mixture-of-Experts Transformer that develops specialized experts to efficiently capture diverse mobility patterns. Pre-trained on billion-scale datasets, MoveGPT establishes a new state-of-the-art across a wide range of downstream tasks, achieving performance gains of up to 35% on average. It also demonstrates strong generalization capabilities to unseen cities. Crucially, our work provides empirical evidence of scaling ability in human mobility, validating a clear path toward building increasingly capable foundation models in this domain.

Updated: 2025-11-24 13:44:50

标题: MoveGPT：利用空间感知专家混合模型扩展移动基础模型

摘要: 基于语言的基础模型在成功上激发了一波新的面向人类移动性的通用模型。然而，现有方法由于两个根本限制而难以有效扩展：未能使用有意义的基本单元来表示移动，以及无法捕捉大规模数据中发现的各种模式的巨大多样性。在这项工作中，我们开发了MoveGPT，一个专门设计用于克服这些障碍的大规模基础模型。MoveGPT建立在两个关键创新基础之上：（1）一个统一的位置编码器，将地理上不相交的位置映射到共享的语义空间，实现全球规模的预训练；以及（2）一种空间感知的专家混合Transformer，开发专门的专家来高效捕捉多样化的移动模式。在亿级数据集上进行预训练后，MoveGPT在各种下游任务上取得了新的最先进水平，平均性能提升高达35％。它还展示了对未知城市的强大泛化能力。至关重要的是，我们的工作提供了人类移动性扩展能力的实证证据，验证了在该领域构建越来越具备能力的基础模型的明确途径。

更新时间: 2025-11-24 13:44:50

领域: cs.AI

下载: http://arxiv.org/abs/2505.18670v3

The Core in Max-Loss Non-Centroid Clustering Can Be Empty

We study core stability in non-centroid clustering under the max-loss objective, where each agent's loss is the maximum distance to other members of their cluster. We prove that for all $k\geq 3$ there exist metric instances with $n\ge 9$ agents, with $n$ divisible by $k$, for which no clustering lies in the $α$-core for any $α<2^{\frac{1}{5}}\sim 1.148$. The bound is tight for our construction. Using a computer-aided proof, we also identify a two-dimensional Euclidean point set whose associated lower bound is slightly smaller than that of our general construction. This is, to our knowledge, the first impossibility result showing that the core can be empty in non-centroid clustering under the max-loss objective.

Updated: 2025-11-24 13:42:43

标题: 最大损失非质心聚类中的核心可以为空

摘要: 我们研究了在最大损失目标下的非质心聚类中的核稳定性，其中每个代理的损失是与其聚类其他成员的最大距离。我们证明，对于所有$k\geq 3$，存在具有$n\geq 9$代理的度量实例，其中$n$可被$k$整除，对于这些实例，无论$α<2^{\frac{1}{5}}\sim 1.148$，都不存在任何聚类位于$α$-核内。这个界对我们的构造来说是紧的。通过计算机辅助证明，我们还确定了一个二维欧几里得点集，其相关下界略低于我们的一般构造。据我们所知，这是第一个表明在最大损失目标下的非质心聚类中核可能为空的不可能性结果。

更新时间: 2025-11-24 13:42:43

领域: cs.LG,cs.AI,cs.GT,stat.ML

下载: http://arxiv.org/abs/2511.19107v1

Edge-Based Predictive Data Reduction for Smart Agriculture: A Lightweight Approach to Efficient IoT Communication

The rapid growth of IoT devices has led to an enormous amount of sensor data that requires transmission to cloud servers for processing, resulting in excessive network congestion, increased latency and high energy consumption. This is particularly problematic in resource-constrained and remote environments where bandwidth is limited, and battery-dependent devices further emphasize the problem. Moreover, in domains such as agriculture, consecutive sensor readings often have minimal variation, making continuous data transmission inefficient and unnecessarily resource intensive. To overcome these challenges, we propose an analytical prediction algorithm designed for edge computing environments and validated through simulation. The proposed solution utilizes a predictive filter at the network edge that forecasts the next sensor data point and triggers data transmission only when the deviation from the predicted value exceeds a predefined tolerance. A complementary cloud-based model ensures data integrity and overall system consistency. This dual-model strategy effectively reduces communication overhead and demonstrates potential for improving energy efficiency by minimizing redundant transmissions. In addition to reducing communication load, our approach leverages both in situ and satellite observations from the same locations to enhance model robustness. It also supports cross-site generalization, enabling models trained in one region to be effectively deployed elsewhere without retraining. This makes our solution highly scalable, energy-aware, and well-suited for optimizing sensor data transmission in remote and bandwidth-constrained IoT environments.

Updated: 2025-11-24 13:37:33

标题: 基于边缘的智能农业预测数据减少：一种轻量级的高效物联网通信方法

摘要: 物联网设备的快速增长导致了大量传感器数据需要传输到云服务器进行处理，从而导致网络拥塞、延迟增加和能耗高涨。这在资源有限和偏远环境尤为棘手，带宽有限，依赖电池的设备进一步凸显了问题。此外，在农业等领域，连续的传感器读数往往变化微小，使连续数据传输效率低下且资源密集。为了克服这些挑战，我们提出了一种针对边缘计算环境设计的分析预测算法，并通过模拟进行验证。所提出的解决方案利用网络边缘的预测滤波器来预测下一个传感器数据点，并仅在实际值与预测值之间的偏差超过预定义容差时触发数据传输。一个云端模型确保数据完整性和系统整体一致性。这种双模型策略有效减少了通信开销，并显示了通过最小化冗余传输来提高能效的潜力。除了减少通信负荷，我们的方法利用来自相同位置的现场和卫星观测来增强模型的鲁棒性。它还支持跨站点泛化，使在一个区域训练的模型可以有效地部署到其他地方而无需重新训练。这使我们的解决方案具有高度可扩展性、能源感知性，并且非常适用于优化传感器数据传输在偏远和带宽受限的物联网环境中。

更新时间: 2025-11-24 13:37:33

领域: cs.LG

下载: http://arxiv.org/abs/2511.19103v1

Extracting Robust Register Automata from Neural Networks over Data Sequences

Automata extraction is a method for synthesising interpretable surrogates for black-box neural models that can be analysed symbolically. Existing techniques assume a finite input alphabet, and thus are not directly applicable to data sequences drawn from continuous domains. We address this challenge with deterministic register automata (DRAs), which extend finite automata with registers that store and compare numeric values. Our main contribution is a framework for robust DRA extraction from black-box models: we develop a polynomial-time robustness checker for DRAs with a fixed number of registers, and combine it with passive and active automata learning algorithms. This combination yields surrogate DRAs with statistical robustness and equivalence guarantees. As a key application, we use the extracted automata to assess the robustness of neural networks: for a given sequence and distance metric, the DRA either certifies local robustness or produces a concrete counterexample. Experiments on recurrent neural networks and transformer architectures show that our framework reliably learns accurate automata and enables principled robustness evaluation. Overall, our results demonstrate that robust DRA extraction effectively bridges neural network interpretability and formal reasoning without requiring white-box access to the underlying network.

Updated: 2025-11-24 13:36:45

标题: 从数据序列上的神经网络中提取稳健的寄存器自动机

摘要: Automata extraction是一种合成可解释的代理黑盒神经模型的方法，可以进行符号分析。现有技术假定有限的输入字母表，因此不直接适用于从连续域中抽取的数据序列。我们通过确定性寄存器自动机（DRAs）来解决这一挑战，该自动机通过寄存器存储和比较数值来扩展有限自动机。我们的主要贡献是建立了一个从黑盒模型中提取鲁棒DRA的框架：我们开发了一个多项式时间的鲁棒性检查器，用于具有固定数量寄存器的DRAs，并将其与被动和主动自动机学习算法相结合。这种组合产生了具有统计鲁棒性和等价性保证的代理DRAs。作为一个关键的应用，我们使用提取的自动机来评估神经网络的鲁棒性：对于给定的序列和距离度量，DRA要么证实本地鲁棒性，要么产生一个具体的反例。对循环神经网络和变压器架构进行的实验表明，我们的框架可靠地学习准确的自动机，并实现了原则上的鲁棒性评估。总的来说，我们的结果表明，鲁棒DRA抽取有效地连接了神经网络的可解释性和形式推理，而无需访问底层网络的白盒。

更新时间: 2025-11-24 13:36:45

领域: cs.AI,cs.FL,cs.LG

下载: http://arxiv.org/abs/2511.19100v1

Principled Coarse-Grained Acceptance for Speculative Decoding in Speech

Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model's embedding space. By splitting each token's probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.

Updated: 2025-11-24 13:32:45

标题: 言语中的推测解码中的基本粗粒度接受

摘要: Speculative decoding通过让一个快速草稿模型提出标记，由更大的目标模型验证，加速自回归语音生成。然而，对于生成声学标记的语音LLMs，精确的标记匹配过于严格：许多离散标记在声学上或语义上是可以互换的，降低了接受率并限制了加速效果。我们引入了Principled Coarse-Graining（PCG），它在从目标模型的嵌入空间导出的声学相似性组（ASGs）水平上验证提议。通过将每个标记的概率质量分配到包含它的重叠组中，我们定义了一个重叠感知的粗粒度分布，并在生成的组变量上执行拒绝抽样。这在组级别提供了一个精确性保证，同时允许被接受的草稿标记在实践中代表组中的任何成员。在LibriTTS上，与标准的推测解码和先前针对语音的放松相比，PCG提高了接受率和吞吐量，同时保持了可懂度和说话者相似性。这些结果表明，声学感知、组级别接受是一种简单且通用的加速语音标记生成的方法，同时保持语音质量。

更新时间: 2025-11-24 13:32:45

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2511.13732v2

Optimization of Deep Learning Models for Dynamic Market Behavior Prediction

The advent of financial technology has witnessed a surge in the utilization of deep learning models to anticipate consumer conduct, a trend that has demonstrated considerable potential in enhancing lending strategies and bolstering market efficiency. We study multi-horizon demand forecasting on e-commerce transactions using the UCI Online Retail II dataset. Unlike prior versions of this manuscript that mixed financial-loan narratives with retail data, we focus exclusively on retail market behavior and define a clear prediction target: per SKU daily demand (or revenue) for horizons H=1,7,14. We present a hybrid sequence model that combines multi-scale temporal convolutions, a gated recurrent module, and time-aware self-attention. The model is trained with standard regression losses and evaluated under MAE, RMSE, sMAPE, MASE, and Theil's U_2 with strict time-based splits to prevent leakage. We benchmark against ARIMA/Prophet, LSTM/GRU, LightGBM, and state-of-the-art Transformer forecasters (TFT, Informer, Autoformer, N-BEATS). Results show consistent accuracy gains and improved robustness on peak/holiday periods. We further provide ablations and statistical significance tests to ensure the reliability of improvements, and we release implementation details to facilitate reproducibility.

Updated: 2025-11-24 13:30:52

标题: 深度学习模型在动态市场行为预测中的优化

摘要: 金融科技的出现见证了深度学习模型在预测消费者行为方面的激增，这一趋势在增强贷款策略和提升市场效率方面表现出了相当大的潜力。我们使用UCI Online Retail II数据集对电子商务交易上的多时段需求预测进行研究。与之前版本的文稿混合金融贷款叙事和零售数据不同，我们专注于零售市场行为，并明确定义了一个明确的预测目标：每个SKU每日需求（或收入）的时间段为H=1,7,14。我们提出了一个混合序列模型，结合了多尺度时间卷积、门控循环模块和时间感知自注意力。该模型使用标准的回归损失进行训练，并在MAE、RMSE、sMAPE、MASE和Theil's U_2等指标下进行评估，采用严格基于时间的分割以防止信息泄漏。我们将其与ARIMA/Prophet、LSTM/GRU、LightGBM和最先进的Transformer预测模型（TFT、Informer、Autoformer、N-BEATS）进行了比较。结果显示，在高峰/假期期间保持了一致的准确性增益和改进的稳健性。我们进一步进行了分解和统计显著性测试，以确保改进的可靠性，并发布了实现细节以促进可重现性。

更新时间: 2025-11-24 13:30:52

领域: cs.LG

下载: http://arxiv.org/abs/2511.19090v1

EnfoPath: Energy-Informed Analysis of Generative Trajectories in Flow Matching

Flow-based generative models synthesize data by integrating a learned velocity field from a reference distribution to the target data distribution. Prior work has focused on endpoint metrics (e.g., fidelity, likelihood, perceptual quality) while overlooking a deeper question: what do the sampling trajectories reveal? Motivated by classical mechanics, we introduce kinetic path energy (KPE), a simple yet powerful diagnostic that quantifies the total kinetic effort along each generation path of ODE-based samplers. Through comprehensive experiments on CIFAR-10 and ImageNet-256, we uncover two key phenomena: ({i}) higher KPE predicts stronger semantic quality, indicating that semantically richer samples require greater kinetic effort, and ({ii}) higher KPE inversely correlates with data density, with informative samples residing in sparse, low-density regions. Together, these findings reveal that semantically informative samples naturally reside on the sparse frontier of the data distribution, demanding greater generative effort. Our results suggest that trajectory-level analysis offers a physics-inspired and interpretable framework for understanding generation difficulty and sample characteristics.

Updated: 2025-11-24 13:27:41

标题: EnfoPath：能量信息分析流匹配中生成轨迹

摘要: 基于流的生成模型通过将来自参考分布的学习速度场集成到目标数据分布中来合成数据。先前的研究集中在端点指标（例如保真度、似然性、感知质量），而忽视了一个更深层次的问题：采样轨迹揭示了什么？受经典力学的启发，我们引入了动力学路径能量（KPE），这是一个简单而强大的诊断工具，用于量化基于ODE的采样器的每条生成路径上的总动能。通过对CIFAR-10和ImageNet-256的全面实验，我们发现了两个关键现象：（i）较高的KPE预测更强的语义质量，表明语义更丰富的样本需要更大的动能，以及（ii）较高的KPE与数据密度呈负相关，信息丰富的样本位于稀疏、低密度区域。总的来说，这些发现揭示了语义信息丰富的样本自然地存在于数据分布的稀疏边界上，需要更大的生成努力。我们的结果表明，轨迹级别的分析提供了一个受物理启发的可解释框架，用于理解生成难度和样本特征。

更新时间: 2025-11-24 13:27:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19087v1

Neural Scaling Laws for Deep Regression

Neural scaling laws--power-law relationships between generalization errors and characteristics of deep learning models--are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures--including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.

Updated: 2025-11-24 13:26:06

标题: 深度回归的神经缩放定律

摘要: 神经缩放定律——深度学习模型的泛化错误与特征之间的幂律关系——是开发可靠模型并管理有限资源的重要工具。尽管大型语言模型的成功突显了这些定律的重要性，但它们在深度回归模型上的应用仍然很少被探讨。在这里，我们利用一个参数估计模型对扭曲的范德瓦尔斯磁体进行了深度回归神经缩放定律的实证研究。我们观察到在广泛数值范围内，损失与训练数据集大小和模型容量之间存在幂律关系，采用各种架构——包括全连接网络、残差网络和视觉变换器。此外，控制这些关系的缩放指数从1到2不等，具体数值取决于回归参数和模型细节。这些一致的缩放行为及其大的缩放指数表明，随着数据规模的增加，深度回归模型的性能可以显著提高。

更新时间: 2025-11-24 13:26:06

领域: cs.LG,cond-mat.other

下载: http://arxiv.org/abs/2509.10000v2

GiBy: A Giant-Step Baby-Step Classifier For Anomaly Detection In Industrial Control Systems

The continuous monitoring of the interactions between cyber-physical components of any industrial control system (ICS) is required to secure automation of the system controls, and to guarantee plant processes are fail-safe and remain in an acceptably safe state. Safety is achieved by managing actuation (where electric signals are used to trigger physical movement), dependent on corresponding sensor readings; used as ground truth in decision making. Timely detection of anomalies (attacks, faults and unascertained states) in ICSs is crucial for the safe running of a plant, the safety of its personnel, and for the safe provision of any services provided. We propose an anomaly detection method that involves accurate linearization of the non-linear forms arising from sensor-actuator(s) relationships, primarily because solving linear models is easier and well understood. We accomplish this by using a well-known water treatment testbed as a use case. Our experiments show millisecond time response to detect anomalies, all of which are explainable and traceable; this simultaneous coupling of detection speed and explainability has not been achieved by other state of the art Artificial Intelligence (AI)/ Machine Learning (ML) models with eXplainable AI (XAI) used for the same purpose. Our methods explainability enables us to pin-point the sensor(s) and the actuation state(s) for which the anomaly was detected. The proposed algorithm showed an accuracy of 97.72% by flagging deviations within safe operation limits as non-anomalous; indicative that slower detectors with highest detection resolution is unnecessary, for systems whose safety boundaries provide leeway within safety limits.

Updated: 2025-11-24 13:25:33

标题: GiBy：一种巨大步长-小步长分类器，用于工业控制系统中的异常检测

摘要: 文献摘要：为了确保工业控制系统（ICS）的自动化控制安全，并保证工厂工艺处于可接受的安全状态，需要持续监测任何工业控制系统的网络物理组件之间的交互作用。安全性通过管理执行（使用电信号触发物理运动）来实现，依赖于相应传感器读数；这些读数被用作决策中的基本事实。对ICS中的异常（攻击、故障和未确定状态）进行及时检测对于工厂的安全运行、人员安全以及提供的任何服务的安全性至关重要。我们提出了一种异常检测方法，该方法涉及准确线性化由传感器-执行器关系产生的非线性形式，主要是因为解决线性模型更容易且更易理解。我们通过使用一个众所周知的水处理试验台作为案例来实现这一点。我们的实验显示，我们的方法能够在毫秒级的时间内对异常进行检测，所有这些异常都是可以解释和追踪的；这种检测速度和可解释性的同时耦合尚未被其他最先进的用于相同目的的具有可解释人工智能（XAI）的人工智能（AI）/机器学习（ML）模型所实现。我们的方法的可解释性使我们能够准确定位检测到异常的传感器和执行状态。所提出的算法通过将安全操作限制内的偏差标记为非异常，表明无需具有最高检测分辨率的较慢检测器，对于那些在安全限制范围内提供余地的系统是不必要的。

更新时间: 2025-11-24 13:25:33

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2504.20906v2

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

Updated: 2025-11-24 13:18:21

标题: GraphMind：用于LLM推理的动态GNN的定理选择和结论生成框架

摘要: 大型语言模型(LLMs)在自然语言理解和生成方面展示出令人印象深刻的能力，包括数学证明等多步推理。然而，现有方法通常缺乏明确和动态的机制来结构化表示和演化中间推理状态，这限制了它们执行上下文感知定理选择和迭代结论生成的能力。为了解决这些挑战，我们提出了GraphMind，一种新颖的基于动态图的框架，将图神经网络(GNN)与LLMs集成在一起，用于多步推理的定理选择和中间结论生成。我们的方法将推理过程建模为一个异质演化图，其中节点表示条件、定理和结论，而边捕捉节点之间的逻辑依赖关系。通过使用GNN对当前推理状态进行编码，并利用定理选择的语义匹配，我们的框架使得在一个闭环方式下进行上下文感知、可解释和结构化的推理成为可能。在各种问答(QA)数据集上进行的实验表明，我们提出的GraphMind方法实现了一致的性能改进，并显著优于现有基准线在多步推理中，验证了我们方法的有效性和普适性。

更新时间: 2025-11-24 13:18:21

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.19078v1

Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $γ$, we prove an excess risk rate of $\widetilde{O}(L^4 (1 + γL^2) / (n γ^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n γ^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.

Updated: 2025-11-24 13:14:48

标题: 深度ReLU分类问题梯度下降泛化的最佳速率

摘要: 最近的进展显著提高了我们对深度神经网络中梯度下降（GD）方法的泛化性能的理解。一个自然而基本的问题是，GD是否能够达到与核设置中建立的极小最优速率相当的泛化速率。现有结果要么产生$O(1/\sqrt{n})$的次优速率，要么侧重于具有平滑激活函数的网络，导致对网络深度$L$的指数依赖。在这项工作中，我们通过仔细权衡优化和泛化错误，建立了深度ReLU网络的GD的最佳泛化速率，仅对深度有多项式依赖。具体而言，在假设数据与边缘$γ$可分离的情况下，我们证明了超出风险速率为$\widetilde{O}(L^4 (1 + γL^2) / (n γ^2))$，与最佳SVM类型速率$\widetilde{O}(1 / (n γ^2))$一致，直到深度相关因子。我们的一个关键技术贡献是我们对在参考模型附近的激活模式的新颖控制，从而为使用梯度下降训练的深度ReLU网络提供更尖锐的Rademacher复杂性界限。

更新时间: 2025-11-24 13:14:48

领域: cs.LG

下载: http://arxiv.org/abs/2510.02779v2

Mathematical Insights into Protein Architecture: Persistent Homology and Machine Learning Applied to the Flagellar Motor

We present a machine learning approach that leverages persistent homology to classify bacterial flagellar motors into two functional states: rotated and stalled. By embedding protein structural data into a topological framework, we extract multiscale features from filtered simplicial complexes constructed over atomic coordinates. These topological invariants, specifically persistence diagrams and barcodes, capture critical geometric and connectivity patterns that correlate with motor function. The extracted features are vectorized and integrated into a machine learning pipeline that includes dimensionality reduction and supervised classification. Applied to a curated dataset of experimentally characterized flagellar motors from diverse bacterial species, our model demonstrates high classification accuracy and robustness to structural variation. This approach highlights the power of topological data analysis in revealing functionally relevant patterns beyond the reach of traditional geometric descriptors, offering a novel computational tool for protein function prediction.

Updated: 2025-11-24 13:12:10

标题: 蛋白质结构的数学洞察：持续同调和机器学习应用于鞭毛马达

摘要: 我们提出了一种利用持续同调来分类细菌鞭毛马达的机器学习方法，将其分为两种功能状态：旋转和停滞。通过将蛋白质结构数据嵌入拓扑框架中，我们从在原子坐标上构建的滤波单纯复合体中提取多尺度特征。这些拓扑不变量，特别是持久性图和条形码，捕捉到与马达功能相关的关键几何和连接模式。提取的特征被向量化并整合到一个包括降维和监督分类的机器学习管道中。应用于来自不同细菌物种的实验表征的鞭毛马达的筛选数据集，我们的模型展现出高分类准确性和对结构变化的稳健性。这种方法突显了拓扑数据分析在揭示功能相关模式方面的强大力量，超越传统几何描述符的范围，为蛋白质功能预测提供了一种新颖的计算工具。

更新时间: 2025-11-24 13:12:10

领域: q-bio.BM,cs.LG,math.AT

下载: http://arxiv.org/abs/2504.16941v3

Structured Matching via Cost-Regularized Unbalanced Optimal Transport

Unbalanced optimal transport (UOT) provides a flexible way to match or compare nonnegative finite Radon measures. However, UOT requires a predefined ground transport cost, which may misrepresent the data's underlying geometry. Choosing such a cost is particularly challenging when datasets live in heterogeneous spaces, often motivating practitioners to adopt Gromov-Wasserstein formulations. To address this challenge, we introduce cost-regularized unbalanced optimal transport (CR-UOT), a framework that allows the ground cost to vary while allowing mass creation and removal. We show that CR-UOT incorporates unbalanced Gromov-Wasserstein type problems through families of inner-product costs parameterized by linear transformations, enabling the matching of measures or point clouds across Euclidean spaces. We develop algorithms for such CR-UOT problems using entropic regularization and demonstrate that this approach improves the alignment of heterogeneous single-cell omics profiles, especially when many cells lack direct matches.

Updated: 2025-11-24 13:11:27

标题: 通过成本正则化的不平衡最优输运进行结构匹配

摘要: 不平衡最优输运（UOT）提供了一种灵活的方式来匹配或比较非负有限Radon测度。然而，UOT需要预先定义的地面输运成本，这可能会误代表数据的基础几何结构。当数据集存在异质空间时，选择这样的成本特别具有挑战性，这经常促使从业者采用Gromov-Wasserstein公式。为了解决这一挑战，我们引入了成本正则化的不平衡最优输运（CR-UOT）框架，该框架允许地面成本变化同时允许质量的创造和移除。我们展示了CR-UOT通过由线性变换参数化的内积成本家族，可以整合不平衡Gromov-Wasserstein类型问题，从而实现在欧几里得空间中测量或点云的匹配。我们开发了用于这种CR-UOT问题的算法，利用熵正则化，并证明这种方法改进了异质单细胞组学概要的对齐，特别是当许多细胞缺乏直接匹配时。

更新时间: 2025-11-24 13:11:27

领域: stat.ML,cs.LG,stat.AP

下载: http://arxiv.org/abs/2511.19075v1

Health App Reviews for Privacy & Trust (HARPT): A Corpus for Analyzing Patient Privacy Concerns, Trust in Providers and Trust in Applications

Background: User reviews of Telehealth and Patient Portal mobile applications (apps) hereon referred to as electronic health (eHealth) apps are a rich source of unsolicited patient feedback, revealing critical insights into patient perceptions. However, the lack of large-scale, annotated datasets specific to privacy and trust has limited the ability of researchers to systematically analyze these concerns using natural language processing (NLP) techniques. Objective: This study aims to develop and benchmark Health App Reviews for Privacy & Trust (HARPT), a large-scale annotated corpus of patient reviews from eHealth apps to advance research in patient privacy and trust. Methods: We employed a multistage data construction strategy. This integrated keyword-based filtering, iterative manual labeling with review, targeted data augmentation, and weak supervision using transformer-based classifiers. A curated subset of 7,000 reviews was manually annotated to support machine learning model development and evaluation. The resulting dataset was used to benchmark a broad range of models. Results: The HARPT corpus comprises 480,000 patient reviews annotated across seven categories capturing critical aspects of trust in the application (TA), trust in the provider (TP), and privacy concerns (PC). We provide comprehensive benchmark performance for a range of machine learning models on the manually annotated subset, establishing a baseline for future research. Conclusions: The HARPT corpus is a significant resource for advancing the study of privacy and trust in the eHealth domain. By providing a large-scale, annotated dataset and initial benchmarks, this work supports reproducible research in usable privacy and trust within health informatics. HARPT is released under an open resource license.

Updated: 2025-11-24 13:11:18

标题: 《健康应用隐私与信任（HARPT）：用于分析患者隐私关注、对提供者的信任和对应用程序的信任的语料库》

摘要: 背景：用户对远程医疗和患者门户移动应用程序（应用）的评论，这里指的是电子健康（eHealth）应用程序，是一种丰富的非自愿患者反馈信息来源，揭示了患者看法的关键见解。然而，由于缺乏特定于隐私和信任的大规模标注数据集，研究人员无法利用自然语言处理（NLP）技术系统地分析这些关注点。目标：本研究旨在开发和基准测试健康应用程序评论隐私与信任（HARPT），这是一个大规模的标注患者评论语料库，用于推进患者隐私和信任的研究。方法：我们采用了多阶段数据构建策略。这包括基于关键词的过滤、迭代手动标注与评论、有针对性的数据增强以及使用基于变压器的分类器进行弱监督。我们手动标注了一个经过筛选的7,000条评论子集，以支持机器学习模型的开发和评估。得到的数据集被用于对一系列模型进行基准测试。结果：HARPT语料库包括480,000条患者评论，跨越七个类别标注，捕捉了应用程序信任（TA）、提供者信任（TP）和隐私关注（PC）的关键方面。我们提供了一系列机器学习模型在手动标注子集上的全面基准表现，为未来研究建立了基线。结论：HARPT语料库是推进eHealth领域隐私和信任研究的重要资源。通过提供大规模的标注数据集和初始基准测试，这项工作支持了健康信息学中可用隐私和信任的可重现研究。HARPT发布在一个开放资源许可下。

更新时间: 2025-11-24 13:11:18

领域: cs.HC,cs.CR,cs.LG

下载: http://arxiv.org/abs/2506.19268v4

Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.

Updated: 2025-11-24 13:08:12

标题: 与人工智能聊天：从人类到人工智能的实时视频通信的令人惊讶转变

摘要: AI视频聊天作为实时通信（RTC）的一种新范式出现，其中一个对等方不是人类，而是一个多模态大型语言模型（MLLM）。这使得人类与AI之间的互动更加直观，就像与真人面对面聊天一样。然而，这给延迟带来了重大挑战，因为MLLM推理占据了大部分响应时间，为视频流留下了很少的时间。由于网络的不确定性，传输延迟成为阻碍AI像真人一样的关键瓶颈。为了解决这个问题，我们呼吁进行面向AI的RTC研究，探索网络需求从“人类观看视频”转变为“AI理解视频”。我们首先认识到AI视频聊天和传统RTC之间的主要区别。然后，通过原型测量，我们确定超低比特率是低延迟的关键因素。为了在保持MLLM准确性的同时大幅降低比特率，我们提出了一种上下文感知视频流技术，该技术认识到每个视频区域对于聊天的重要性，并将比特率几乎完全分配给重要的聊天区域。为了评估视频流质量对MLLM准确性的影响，我们建立了第一个基准测试，名为Degraded Video Understanding Benchmark（DeViBench）。最后，我们讨论了AI视频聊天的一些开放问题和正在进行的解决方案。DeViBench的开源地址为：https://github.com/pku-netvideo/DeViBench。

更新时间: 2025-11-24 13:08:12

领域: cs.NI,cs.AI,cs.HC,cs.MM

下载: http://arxiv.org/abs/2507.10510v2

The Semiotic Channel Principle: Measuring the Capacity for Meaning in LLM Communication

This paper proposes a novel semiotic framework for analyzing Large Language Models (LLMs), conceptualizing them as stochastic semiotic engines whose outputs demand active, asymmetric human interpretation. We formalize the trade-off between expressive richness (semiotic breadth) and interpretive stability (decipherability) using information-theoretic tools. Breadth is quantified as source entropy, and decipherability as the mutual information between messages and human interpretations. We introduce a generative complexity parameter (lambda) that governs this trade-off, as both breadth and decipherability are functions of lambda. The core trade-off is modeled as an emergent property of their distinct responses to $λ$. We define a semiotic channel, parameterized by audience and context, and posit a capacity constraint on meaning transmission, operationally defined as the maximum decipherability by optimizing lambda. This reframing shifts analysis from opaque model internals to observable textual artifacts, enabling empirical measurement of breadth and decipherability. We demonstrate the framework's utility across four key applications: (i) model profiling; (ii) optimizing prompt/context design; (iii) risk analysis based on ambiguity; and (iv) adaptive semiotic systems. We conclude that this capacity-based semiotic approach offers a rigorous, actionable toolkit for understanding, evaluating, and designing LLM-mediated communication.

Updated: 2025-11-24 13:06:29

标题: 符号通道原则：衡量LLM通信中的意义容量

摘要: 本文提出了一个新颖的符号学框架，用于分析大型语言模型（LLMs），将它们概念化为随机符号学引擎，其输出需要积极的、不对称的人类解释。我们使用信息论工具形式化了表达丰富性（符号学广度）和解释稳定性（可解读性）之间的权衡。广度被量化为源熵，可解读性被量化为消息和人类解释之间的互信息。我们引入了一个生成复杂度参数（lambda），它控制这种权衡，因为广度和可解读性都是lambda的函数。核心权衡被建模为它们对$λ$的不同响应的一种新兴特性。我们定义了一个由受众和上下文参数化的符号通道，并假设了一种关于意义传递的容量约束，操作上定义为通过优化lambda实现的最大可解读性。这种重新构架将分析重点从不透明的模型内部转移到可观察的文本工件，从而实现了广度和可解读性的经验测量。我们通过四个关键应用展示了这一框架的实用性：（i）模型概况；（ii）优化提示/上下文设计；（iii）基于模糊性的风险分析；和（iv）自适应符号系统。我们得出结论，这种基于容量的符号学方法为理解、评估和设计LLM介导的沟通提供了严谨、可执行的工具。

更新时间: 2025-11-24 13:06:29

领域: cs.IT,cs.AI

下载: http://arxiv.org/abs/2511.19550v1

DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling

Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.

Updated: 2025-11-24 13:01:32

标题: DynaMix: 通过动态重新标记和混合数据采样实现的通用化人员再识别

摘要: 广义人员重新识别（Re-ID）旨在跨不同摄像头和环境识别个体。现有方法主要依赖有限的标记多摄像头数据，我们提出了一种新方法DynaMix，有效地结合了手动标记的多摄像头数据和大规模伪标记的单摄像头数据。与以往的工作不同，DynaMix通过三个核心组件动态地适应训练数据的结构和噪声：（1）重新标记模块，实时优化单摄像头身份的伪标签；（2）高效质心模块，保持在大身份空间下的稳健身份表示；（3）数据采样模块，仔细组合混合数据小批量以平衡学习复杂性和批内多样性。所有组件都经过特别设计以在规模上高效运行，实现对数百万图像和数十万身份的有效训练。大量实验证明，DynaMix在广义人员重新识别方面始终优于最先进的方法。

更新时间: 2025-11-24 13:01:32

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19067v1

Mitigating Participation Imbalance Bias in Asynchronous Federated Learning

In Asynchronous Federated Learning (AFL), the central server immediately updates the global model with each arriving client's contribution. As a result, clients perform their local training on different model versions, causing information staleness (delay). In federated environments with non-IID local data distributions, this asynchronous pattern amplifies the adverse effect of client heterogeneity (due to different data distribution, local objectives, etc.), as faster clients contribute more frequent updates, biasing the global model. We term this phenomenon heterogeneity amplification. Our work provides a theoretical analysis that maps AFL design choices to their resulting error sources when heterogeneity amplification occurs. Guided by our analysis, we propose ACE (All-Client Engagement AFL), which mitigates participation imbalance through immediate, non-buffered updates that use the latest information available from all clients. We also introduce a delay-aware variant, ACED, to balance client diversity against update staleness. Experiments on different models for different tasks across diverse heterogeneity and delay settings validate our analysis and demonstrate the robust performance of our approaches.

Updated: 2025-11-24 13:01:18

标题: 缓解异步联邦学习中的参与不平衡偏差

摘要: 在异步联邦学习（AFL）中，中央服务器立即使用每个到达客户端的贡献更新全局模型。结果，客户端在不同模型版本上进行本地训练，导致信息陈旧（延迟）。在具有非IID本地数据分布的联邦环境中，这种异步模式增加了客户异质性的负面影响（由于不同的数据分布、本地目标等），因为更快的客户端提供更频繁的更新，使全局模型产生偏差。我们将这种现象称为异质性放大。我们的工作提供了一个理论分析，将AFL设计选择与当异质性放大发生时产生的误差源进行映射。在我们的分析指导下，我们提出了ACE（全客户参与AFL），通过使用来自所有客户端的最新信息进行立即、非缓冲更新来减轻参与不平衡。我们还介绍了一种延迟感知的变体ACED，以平衡客户多样性与更新陈旧之间的关系。在不同任务的不同模型上进行的实验，跨越多样的异质性和延迟设置，验证了我们的分析，并展示了我们方法的稳健性能。

更新时间: 2025-11-24 13:01:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19066v1

A DRL-Empowered Multi-Level Jamming Approach for Secure Semantic Communication

Semantic communication (SemCom) aims to transmit only task-relevant information, thereby improving communication efficiency but also exposing semantic information to potential eavesdropping. In this paper, we propose a deep reinforcement learning (DRL)-empowered multi-level jamming approach to enhance the security of SemCom systems over MIMO fading wiretap channels. This approach combines semantic layer jamming, achieved by encoding task-irrelevant text, and physical layer jamming, achieved by encoding random Gaussian noise. These two-level jamming signals are superposed with task-relevant semantic information to protect the transmitted semantics from eavesdropping. A deep deterministic policy gradient (DDPG) algorithm is further introduced to dynamically design and optimize the precoding matrices for both taskrelevant semantic information and multi-level jamming signals, aiming to enhance the legitimate user's image reconstruction while degrading the eavesdropper's performance. To jointly train the SemCom model and the DDPG agent, we propose an alternating optimization strategy where the two modules are updated iteratively. Experimental results demonstrate that, compared with both the encryption-based (ESCS) and encoded jammer-based (EJ) benchmarks, our method achieves comparable security while improving the legitimate user's peak signalto-noise ratio (PSNR) by up to approximately 0.6 dB.

Updated: 2025-11-24 13:00:48

标题: 一个基于深度强化学习的多层次干扰方法，用于安全的语义通信

摘要: 语义通信（SemCom）旨在仅传输与任务相关的信息，从而提高通信效率，但也将语义信息暴露给潜在的窃听者。本文提出了一种深度强化学习（DRL）增强的多级干扰方法，以提高MIMO衰落窃听信道上SemCom系统的安全性。该方法结合了语义层干扰，通过编码任务无关文本实现，以及物理层干扰，通过编码随机高斯噪声实现。这两级干扰信号与任务相关的语义信息叠加在一起，以保护传输的语义信息免受窃听。进一步引入深度确定性策略梯度（DDPG）算法，动态设计和优化预编码矩阵，用于任务相关的语义信息和多级干扰信号，旨在增强合法用户的图像重建能力，同时降低窃听者的性能。为了共同训练SemCom模型和DDPG代理，我们提出了一种交替优化策略，其中两个模块被迭代更新。实验结果表明，与基于加密（ESCS）和基于编码干扰器（EJ）的基准相比，我们的方法在提高合法用户的峰值信噪比（PSNR）高达约0.6 dB的同时，实现了可比较的安全性。

更新时间: 2025-11-24 13:00:48

领域: cs.CR

下载: http://arxiv.org/abs/2510.26610v2

Understanding, Accelerating, and Improving MeanFlow Training

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

Updated: 2025-11-24 12:59:27

标题: 理解、加速和改进MeanFlow训练

摘要: MeanFlow承诺在少数步骤中实现高质量的生成建模，通过共同学习瞬时和平均速度场。然而，潜在的训练动力学仍不清楚。我们分析了两种速度之间的相互作用，并发现：(i) 瞬时速度的建立是学习平均速度的先决条件；(ii) 当时间间隔较小时，学习瞬时速度受益于平均速度，但随着时间间隔的增加而下降；(iii) 任务关联性分析表明，对于一步生成至关重要的大间隔平均速度的平滑学习取决于准确的瞬时速度和小间隔平均速度的先前形成。根据这些观察结果，我们设计了一种有效的训练方案，加速了瞬时速度的形成，然后将重点从短时间间隔平均速度转移到长时间间隔平均速度。我们增强的MeanFlow训练实现了更快的收敛速度和显著更好的少步生成效果：在相同的DiT-XL骨干网络的情况下，我们的方法在1-NFE ImageNet 256x256上达到了令人印象深刻的FID值为2.87，而传统的MeanFlow基线为3.43。另外，我们的方法在比MeanFlow基线短2.5倍的训练时间内或使用较小的DiT-L骨干网络时，与MeanFlow基线性能相匹配。

更新时间: 2025-11-24 12:59:27

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.19065v1

Inferring response times of perceptual decisions with Poisson variational autoencoders

Many properties of perceptual decision making are well-modeled by deep neural networks. However, such architectures typically treat decisions as instantaneous readouts, overlooking the temporal dynamics of the decision process. We present an image-computable model of perceptual decision making in which choices and response times arise from efficient sensory encoding and Bayesian decoding of neural spiking activity. We use a Poisson variational autoencoder to learn unsupervised representations of visual stimuli in a population of rate-coded neurons, modeled as independent homogeneous Poisson processes. A task-optimized decoder then continually infers an approximate posterior over actions conditioned on incoming spiking activity. Combining these components with an entropy-based stopping rule yields a principled and image-computable model of perceptual decisions capable of generating trial-by-trial patterns of choices and response times. Applied to MNIST digit classification, the model reproduces key empirical signatures of perceptual decision making, including stochastic variability, right-skewed response time distributions, logarithmic scaling of response times with the number of alternatives (Hick's law), and speed-accuracy trade-offs.

Updated: 2025-11-24 12:53:25

标题: 用泊松变分自动编码器推断感知决策的响应时间

摘要: 许多感知决策的属性都可以通过深度神经网络很好地建模。然而，这种架构通常将决策视为瞬时读数，忽视了决策过程的时间动态。我们提出了一个图像可计算的感知决策模型，其中选择和反应时间源于神经尖峰活动的高效感知编码和贝叶斯解码。我们使用泊松变分自动编码器在一个以速率编码神经元为模型的种群中学习视觉刺激的无监督表示，这些神经元被建模为独立的均匀泊松过程。一个经过优化的解码器不断推断出一个近似的后验概率，条件是传入的尖峰活动。将这些组件与基于熵的停止规则结合起来，可以生成一种有原则且图像可计算的感知决策模型，能够生成逐试验的选择模式和反应时间。应用于MNIST数字分类，该模型重现了感知决策制定的关键经验特征，包括随机变异性、右偏反应时间分布、反应时间与备选项数量的对数比例关系（希克定律）以及速度-准确度权衡。

更新时间: 2025-11-24 12:53:25

领域: q-bio.NC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.11480v2

FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement

The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs' natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs' function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub https://github.com/BingguangHao/FunReason

Updated: 2025-11-24 12:52:02

标题: 乐趣推理：通过自我完善多尺度损失和自动化数据完善增强大型语言模型的功能调用

摘要: 大型语言模型（LLMs）与函数调用的集成已经成为增强它们在现实世界应用中实用性的关键能力。然而，有效地将推理过程与准确的函数执行相结合仍然是一个重要挑战。传统的训练方法常常难以平衡详细的推理步骤和函数调用的精度，导致性能不佳。为了解决这些限制，我们引入了FunReason，一个通过自动化数据细化策略和自我细化多尺度损失（SRML）方法增强LLMs函数调用能力的新框架。FunReason利用LLMs的自然推理能力生成高质量的训练样本，注重查询的可解析性、推理的连贯性和函数调用的精度。SRML方法在训练过程中动态平衡了推理过程和函数调用准确性的贡献，解决了这两个关键方面之间的固有权衡。FunReason在有效减轻遗忘现象的同时，实现了与GPT-4o相媲美的性能。FunReason通过引入平衡的训练方法和数据细化管线，为增强LLMs函数调用能力提供了全面的解决方案。有关代码和数据集，请参阅我们在GitHub上的存储库https://github.com/BingguangHao/FunReason。

更新时间: 2025-11-24 12:52:02

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2505.20192v2

Large Language Model-Assisted Planning of Electric Vehicle Charging Infrastructure with Real-World Case Study

The growing demand for electric vehicle (EV) charging infrastructure presents significant planning challenges, requiring efficient strategies for investment and operation to deliver cost-effective charging services. However, the potential benefits of EV charging assignment, particularly in response to varying spatial-temporal patterns of charging demand, remain under-explored in infrastructure planning. This paper proposes an integrated approach that jointly optimizes investment decisions and charging assignments while accounting for spatial-temporal demand dynamics and their interdependencies. To support efficient model development, we leverage a large language model (LLM) to assist in generating and refining the mathematical formulation from structured natural-language descriptions, significantly reducing the modeling burden. The resulting optimization model enables optimal joint decision-making for investment and operation. Additionally, we propose a distributed optimization algorithm based on the Alternating Direction Method of Multipliers (ADMM) to address computational complexity in high-dimensional scenarios, which can be executed on standard computing platforms. We validate our approach through a case study using 1.5 million real-world travel records from Chengdu, China, demonstrating a 30% reduction in total cost compared to a baseline without EV assignment.

Updated: 2025-11-24 12:45:10

标题: 大型语言模型辅助规划电动汽车充电基础设施的实际案例研究

摘要: 对电动汽车（EV）充电基础设施不断增长的需求提出了重大规划挑战，需要高效的投资和运营策略来提供具有成本效益的充电服务。然而，EV充电分配的潜在好处，特别是针对充电需求的空间-时间模式的变化，仍未在基础设施规划中得到充分探讨。本文提出了一种综合方法，该方法在考虑空间-时间需求动态及其相互依赖性的同时，联合优化投资决策和充电分配。为支持高效的模型开发，我们利用大型语言模型（LLM）来协助从结构化自然语言描述中生成和优化数学公式，显著减轻了建模负担。由此产生的优化模型使投资和运营的最佳联合决策成为可能。此外，我们提出了一种基于交替方向乘法器（ADMM）的分布式优化算法，以解决高维场景中的计算复杂性问题，可以在标准计算平台上执行。我们通过使用来自中国成都的150万条真实旅行记录进行案例研究来验证我们的方法，结果显示与不进行EV分配的基线相比，总成本减少了30%。

更新时间: 2025-11-24 12:45:10

领域: eess.SY,cs.AI,math.OC

下载: http://arxiv.org/abs/2511.19055v1

The inexact power augmented Lagrangian method for constrained nonconvex optimization

This work introduces an unconventional inexact augmented Lagrangian method where the augmenting term is a Euclidean norm raised to a power between one and two. The proposed algorithm is applicable to a broad class of constrained nonconvex minimization problems that involve nonlinear equality constraints. In a first part of this work, we conduct a full complexity analysis of the method under a mild regularity condition, leveraging an accelerated first-order algorithm for solving the Hölder-smooth subproblems. Interestingly, this worst-case result indicates that using lower powers for the augmenting term leads to faster constraint satisfaction, albeit with a slower decrease of the dual residual. Notably, our analysis does not assume boundedness of the iterates. Thereafter, we present an inexact proximal point method for solving the weakly-convex and Hölder-smooth subproblems, and demonstrate that the combined scheme attains an improved rate that reduces to the best-known convergence rate whenever the augmenting term is a classical squared Euclidean norm. Different augmenting terms, involving a lower power, further improve the primal complexity at the cost of the dual complexity. Finally, numerical experiments validate the practical performance of unconventional augmenting terms.

Updated: 2025-11-24 12:43:15

标题: 非凸约束优化问题的近似功率增强拉格朗日方法

摘要: 这项工作介绍了一种非常规的不精确增广拉格朗日方法，其中增广项是一个介于一和二之间的欧几里得范数的幂。所提出的算法适用于涉及非线性等式约束的广泛类别的受限非凸最小化问题。在本文的第一部分中，我们在一种温和的正则条件下对该方法进行了全面复杂性分析，利用加速的一阶算法来解决Hölder平滑子问题。有趣的是，这个最坏情况的结果表明，使用较低的幂的增广项会导致更快的约束满足，尽管对偶残差的减少速度较慢。值得注意的是，我们的分析并不假设迭代的有界性。此后，我们提出了一种用于解决弱凸和Hölder平滑子问题的不精确近端点方法，并证明了组合方案实现了更好的速率，当增广项是经典的平方欧几里得范数时，收敛速度降低到最佳已知收敛速度。不同的增广项，涉及较低的幂，进一步改善了原始复杂性，但以较高的对偶复杂性为代价。最后，数值实验验证了非常规增广项的实际性能。

更新时间: 2025-11-24 12:43:15

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2410.20153v2

A Longitudinal Measurement of Privacy Policy Evolution for Large Language Models

Large language model (LLM) services have been rapidly integrated into people's daily lives as chatbots and agentic systems. They are nourished by collecting rich streams of data, raising privacy concerns around excessive collection of sensitive personal information. Privacy policies are the fundamental mechanism for informing users about data practices in modern information privacy paradigm. Although traditional web and mobile policies are well studied, the privacy policies of LLM providers, their LLM-specific content, and their evolution over time remain largely underexplored. In this paper, we present the first longitudinal empirical study of privacy policies for mainstream LLM providers worldwide. We curate a chronological dataset of 74 historical privacy policies and 115 supplemental privacy documents from 11 LLM providers across 5 countries up to August 2025, and extract over 3,000 sentence-level edits between consecutive policy versions. We compare LLM privacy policies to those of other software formats, propose a taxonomy tailored to LLM privacy policies, annotate policy edits and align them with a timeline of key LLM ecosystem events. Results show they are substantially longer, demand college-level reading ability, and remain highly vague. Our taxonomy analysis reveals patterns in how providers disclose LLM-specific practices and highlights regional disparities in coverage. Policy edits are concentrated in first-party data collection and international/specific-audience sections, and that product releases and regulatory actions are the primary drivers, shedding light on the status quo and the evolution of LLM privacy policies.

Updated: 2025-11-24 12:40:15

标题: 一个大型语言模型隐私政策演变的纵向测量

摘要: 大型语言模型（LLM）服务已迅速融入人们的日常生活中，作为聊天机器人和代理系统。它们通过收集丰富的数据流来进行培养，引发了人们对过度收集敏感个人信息的隐私担忧。隐私政策是现代信息隐私范式中向用户提供数据实践信息的基本机制。尽管传统的网络和移动政策得到了充分研究，但LLM提供商的隐私政策、其LLM特定内容以及随时间演变的情况仍然大部分未被探索。本文介绍了全球主流LLM提供商隐私政策的第一个纵向实证研究。我们整理了截至2025年8月的来自5个国家的11家LLM提供商的74份历史隐私政策和115份补充隐私文件的时间序列数据集，并提取了连续政策版本之间的3000多个句子级编辑。我们将LLM隐私政策与其他软件格式的政策进行比较，提出了针对LLM隐私政策的分类法，标注政策编辑，并将其与主要LLM生态事件的时间轴对齐。结果显示，LLM隐私政策相对较长，需要大学水平的阅读能力，并且仍然存在高度模糊性。我们的分类分析显示提供商如何披露LLM特定实践的模式，并突出了覆盖范围中的区域差异。政策编辑主要集中在第一方数据收集和国际/特定受众部分，产品发布和监管行动是主要推动因素，揭示了LLM隐私政策的现状和演变。

更新时间: 2025-11-24 12:40:15

领域: cs.CR,cs.AI,cs.CY

下载: http://arxiv.org/abs/2511.21758v1

Adapting Vision-Language Models for Evaluating World Models

World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a VLM-based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU-days, we explore full, partial, and parameter-efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task-specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for video world models.

Updated: 2025-11-24 12:37:43

标题: 调整视觉-语言模型以评估世界模型

摘要: 世界模型 - 生成模型，根据过去的观察和行动模拟环境动态的条件 - 在规划、模拟和具身人工智能领域日益受到重视。然而，评估它们的展开仍然是一个基本挑战，需要对行动对齐和语义一致性进行细粒度的、时间上有根据的评估 - 这些能力不能被现有的度量所捕捉。视觉语言模型（VLMs）已经显示出作为生成内容的自动评估器的潜力，因为它们具有强大的多模态推理能力。然而，它们在细粒度、时间敏感的评估任务中的使用仍然有限，需要有针对性的适应。我们引入了一个针对两个识别任务的评估协议 - 行动识别和字符识别 - 每个任务都在二进制、多项选择和开放式格式下进行评估。为了支持这一点，我们提出了UNIVERSE（UNIfied Vision-language Evaluator for Rollouts in Simulated Environments），这是一个基于VLM的视频世界模型展开评估器，根据数据和计算约束进行了调整。在我们的广泛实验中，总共超过5,154个GPU天，我们探索了各种任务格式、上下文长度、采样方法和数据组成的全面、部分和参数高效的适应方法。最终得到的统一评估器与特定任务的检查点实现了同等水平。在七个不同环境中进行的人类研究证实了与人类判断的强烈一致性，将UNIVERSE确立为视频世界模型的轻量级、可适应和语义感知的评估器。

更新时间: 2025-11-24 12:37:43

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.17967v2

When Should Neural Data Inform Welfare? A Critical Framework for Policy Uses of Neuroeconomics

Neuroeconomics promises to ground welfare analysis in neural and computational evidence about how people value outcomes, learn from experience and exercise self-control. At the same time, policy and commercial actors increasingly invoke neural data to justify paternalistic regulation, "brain-based" interventions and new welfare measures. This paper asks under what conditions neural data can legitimately inform welfare judgements for policy rather than merely describing behaviour. I develop a non-empirical, model-based framework that links three levels: neural signals, computational decision models and normative welfare criteria. Within an actor-critic reinforcement-learning model, I formalise the inference path from neural activity to latent values and prediction errors and then to welfare claims. I show that neural evidence constrains welfare judgements only when the neural-computational mapping is well validated, the decision model identifies "true" interests versus context-dependent mistakes, and the welfare criterion is explicitly specified and defended. Applying the framework to addiction, neuromarketing and environmental policy, I derive a Neuroeconomic Welfare Inference Checklist for regulators and for designers of NeuroAI systems. The analysis treats brains and artificial agents as value-learning systems while showing that internal reward signals, whether biological or artificial, are computational quantities and cannot be treated as welfare measures without an explicit normative model.

Updated: 2025-11-24 12:34:40

标题: 何时应该利用神经数据来提升福利？神经经济学在政策应用中的关键框架

摘要: 神经经济学承诺将福利分析基于关于人们如何评价结果、从经验中学习和行使自我控制的神经和计算证据。与此同时，政策和商业行为者越来越多地援引神经数据来证明家长式监管、基于大脑的干预和新的福利措施。本文探讨了在什么条件下神经数据可以合法地为政策的福利判断提供信息，而不仅仅是描述行为。我建立了一个非经验的、基于模型的框架，将三个层次联系起来：神经信号、计算决策模型和规范福利标准。在一个演员-评论家强化学习模型中，我形式化了从神经活动到潜在价值和预测误差，再到福利主张的推理路径。我展示了神经证据仅在神经-计算映射得到良好验证时，决策模型确定“真实”利益与依赖于情境的错误时，以及明确指定和捍卫福利标准时，才会限制福利判断。将该框架应用于成瘾、神经营销和环境政策，我为监管者和神经人工智能系统设计者制定了一份神经经济福利推断检查表。该分析将大脑和人工智能代理视为价值学习系统，同时表明内部奖励信号，无论是生物还是人工的，都是计算量，不能在没有明确规范模型的情况下被视为福利措施。

更新时间: 2025-11-24 12:34:40

领域: cs.LG,cs.AI,cs.CY,econ.GN,q-bio.NC

下载: http://arxiv.org/abs/2511.19548v1

MedSAM3: Delving into Segment Anything with Medical Concepts

Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.

Updated: 2025-11-24 12:34:38

标题: MedSAM3：深入探讨具有医学概念的分段任何内容

摘要: 医学图像分割是生物医学发现的基础。现有的方法缺乏泛化能力，需要对新的临床应用进行广泛、耗时的手动标注。在这里，我们提出了MedSAM-3，一个可通过文本提示的医学分割模型，用于医学图像和视频分割。通过在医学图像配对语义概念标签上微调Segment Anything Model (SAM) 3架构，我们的MedSAM-3实现了医学Promptable Concept Segmentation (PCS)，允许通过开放词汇文本描述精确定位解剖结构，而不仅仅是几何提示。我们进一步引入了MedSAM-3 Agent，一个集成了多模态大型语言模型(MLLMs)的框架，用于在代理人循环工作流程中进行复杂推理和迭代细化。涵盖各种医学成像模式的全面实验，包括X射线、MRI、超声波、CT和视频，证明了我们的方法明显优于现有的专家和基础模型。我们将在https://github.com/Joey-S-Liu/MedSAM3发布我们的代码和模型。

更新时间: 2025-11-24 12:34:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19046v1

Beyond Predictions: A Participatory Framework for Multi-Stakeholder Decision-Making

Conventional automated decision-support systems often prioritize predictive accuracy, overlooking the complexities of real-world settings where stakeholders' preferences may diverge or conflict. This can lead to outcomes that disadvantage vulnerable groups and erode trust in algorithmic processes. Participatory AI approaches aim to address these issues but remain largely context-specific, limiting their broader applicability and scalability. To address these gaps, we propose a participatory framework that reframes decision-making as a multi-stakeholder learning and optimization problem. Our modular, model-agnostic approach builds on the standard machine learning training pipeline to fine-tune user-provided prediction models and evaluate decision strategies, including compromise functions that mediate stakeholder trade-offs. A synthetic scoring mechanism aggregates user-defined preferences across multiple metrics, ranking strategies and selecting an optimal decision-maker to generate actionable recommendations that jointly optimize performance, fairness, and domain-specific goals. Empirical validation on two high-stakes case studies demonstrates the versatility of the framework and its promise as a more accountable, context-aware alternative to prediction-centric pipelines for socially impactful deployments.

Updated: 2025-11-24 12:23:10

标题: 超越预测：多利益相关者决策参与框架

摘要: 传统的自动化决策支持系统通常优先考虑预测准确性，忽视了现实世界中利益相关者的偏好可能会出现分歧或冲突的复杂性。这可能导致不利于弱势群体的结果，并削弱对算法过程的信任。参与式人工智能方法旨在解决这些问题，但仍然主要是特定于环境，限制了其更广泛的适用性和可扩展性。为了解决这些问题，我们提出了一个参与式框架，将决策重新构想为一个多利益相关者学习和优化问题。我们的模块化、模型无关的方法建立在标准机器学习训练管道的基础上，用于微调用户提供的预测模型并评估决策策略，包括调解利益相关者权衡的妥协函数。一个合成评分机制汇总了用户定义的偏好，跨多个指标对策略进行排名，并选择一个最佳决策者生成可操作的建议，共同优化性能、公平性和领域特定目标。在两个高风险案例研究上的实证验证展示了该框架的多功能性，以及作为对社会影响部署的更负责任、环境感知的替代选择的潜力。

更新时间: 2025-11-24 12:23:10

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2502.08542v3

Resolving Node Identifiability in Graph Neural Processes via Laplacian Spectral Encodings

Message passing graph neural networks are widely used for learning on graphs, yet their expressive power is limited by the one-dimensional Weisfeiler-Lehman test and can fail to distinguish structurally different nodes. We provide rigorous theory for a Laplacian positional encoding that is invariant to eigenvector sign flips and to basis rotations within eigenspaces. We prove that this encoding yields node identifiability from a constant number of observations and establishes a sample-complexity separation from architectures constrained by the Weisfeiler-Lehman test. The analysis combines a monotone link between shortest-path and diffusion distance, spectral trilateration with a constant set of anchors, and quantitative spectral injectivity with logarithmic embedding size. As an instantiation, pairing this encoding with a neural-process style decoder yields significant gains on a drug-drug interaction task on chemical graphs, improving both the area under the ROC curve and the F1 score and demonstrating the practical benefits of resolving theoretical expressiveness limitations with principled positional information.

Updated: 2025-11-24 12:20:36

标题: 通过拉普拉斯谱编码解决图神经过程中的节点可识别性问题

摘要: 消息传递图神经网络被广泛用于图学习，然而它们的表达能力受到一维Weisfeiler-Lehman测试的限制，可能无法区分结构不同的节点。我们提供了关于拉普拉斯位置编码的严格理论，该编码不受特征向量符号翻转和特征空间内基向量旋转的影响。我们证明这种编码能够从固定数量的观察中得到节点可识别性，并与受Weisfeiler-Lehman测试约束的架构实现了样本复杂度分离。分析结合了最短路径和扩散距离之间的单调联系，具有固定锚点集的谱三角定位，以及具有对数嵌入大小的定量谱单射性。作为一个实例，将这种编码与神经过程风格的解码器配对，在化学图上的药物相互作用任务上取得了显著的收益，提高了ROC曲线下面积和F1分数，并展示了通过合理的位置信息解决理论表达限制的实际好处。

更新时间: 2025-11-24 12:20:36

领域: cs.LG,math.PR

下载: http://arxiv.org/abs/2511.19037v1

Node Embeddings via Neighbor Embeddings

Node embeddings are a paradigm in non-parametric graph representation learning, where graph nodes are embedded into a given vector space to enable downstream processing. State-of-the-art node-embedding algorithms, such as DeepWalk and node2vec, are based on random-walk notions of node similarity and on contrastive learning. In this work, we introduce the graph neighbor-embedding (graph NE) framework that directly pulls together embedding vectors of adjacent nodes without relying on any random walks. We show that graph NE strongly outperforms state-of-the-art node-embedding algorithms in terms of local structure preservation. Furthermore, we apply graph NE to the 2D node-embedding problem, obtaining graph t-SNE layouts that also outperform existing graph-layout algorithms.

Updated: 2025-11-24 12:16:52

标题: 通过邻居嵌入实现节点嵌入

摘要: 节点嵌入是非参数图表示学习中的一种范式，其中图节点被嵌入到给定的向量空间中，以实现下游处理。最先进的节点嵌入算法，如DeepWalk和node2vec，基于节点相似性和对比学习的随机游走概念。在这项工作中，我们介绍了直接将相邻节点的嵌入向量聚集在一起而不依赖于任何随机游走的图邻居嵌入（graph NE）框架。我们展示了图NE在保留局部结构方面显著优于最先进的节点嵌入算法。此外，我们将图NE应用于2D节点嵌入问题，获得了优于现有图布局算法的图t-SNE布局。

更新时间: 2025-11-24 12:16:52

领域: cs.LG

下载: http://arxiv.org/abs/2503.23822v2

CSD: Change Semantic Detection with only Semantic Change Masks for Damage Assessment in Conflict Zones

Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. Unlike conventional semantic change detection (SCD), our approach eliminates the need for large-scale semantic annotations of bi-temporal images, instead focusing directly on the changed regions. We term this new task change semantic detection (CSD). The CSD task represents a direct extension of binary change detection (BCD). Due to the limited spatial extent of semantic regions, it presents greater challenges than traditional SCD tasks. We evaluated our method under the CSD framework on both the Gaza-Change and SECOND datasets. Experimental results demonstrate that our proposed approach effectively addresses the CSD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.

Updated: 2025-11-24 12:16:21

标题: CSD:仅使用语义变化掩模进行冲突区域损害评估的变化语义检测

摘要: 准确快速地评估冲突造成的损害对于人道主义援助和地区稳定至关重要。在冲突地区，受损区域通常具有相似的建筑风格，损害通常覆盖小范围，并呈现模糊的边界。这些特征导致数据有限，注释困难，以及识别挑战，包括高类内相似性和模糊的语义变化。为了解决这些问题，我们引入了一个预训练的DINOv3模型，并提出了一个多尺度交叉注意力差异孪生网络（MC-DiSNet）。DINOv3骨干具有强大的视觉表示能力，能够从双时相遥感图像中进行强大且丰富的特征提取。我们还发布了一个新的加沙变化数据集，其中包含从2023年到2024年的高分辨率卫星图像对，带有像素级语义变化注释。值得强调的是，我们的注释仅包括已更改区域的语义像素。与传统的语义变化检测（SCD）不同，我们的方法消除了对双时相图像的大规模语义注释的需求，而是直接关注更改区域。我们将这一新任务称为变化语义检测（CSD）。CSD任务是对二元变化检测（BCD）的直接扩展。由于语义区域的空间范围有限，它比传统的SCD任务更具挑战性。我们在加沙变化和SECOND数据集上在CSD框架下评估了我们的方法。实验结果表明，我们提出的方法有效地解决了CSD任务，并且其出色的性能为跨冲突地区的快速损害评估提供了实际应用的途径。

更新时间: 2025-11-24 12:16:21

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19035v1

Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling

Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.

Updated: 2025-11-24 11:59:55

标题: 生活-IQA：通过GCN增强的层交互和基于MoE的特征解耦提升盲目图像质量评估

摘要: 盲图像质量评估（BIQA）在评估和优化视觉体验中发挥着关键作用。大多数现有的BIQA方法融合了从骨干网络中提取的浅层和深层特征，却忽视了对质量预测的不均等贡献。此外，虽然各种视觉编码器骨干网络被广泛应用于BIQA，但有效的质量解码架构仍未被充分探索。为了解决这些限制，本文研究了浅层和深层特征对BIQA的贡献，并提出了一种通过GCN增强的层间交互和基于MoE的特征解耦的有效质量特征解码框架，称为（Life-IQA）。具体而言，GCN增强的层间交互模块利用GCN增强的最深层特征作为查询，将倒数第二层特征作为键值，然后执行交叉注意力以实现特征交互。此外，提出了基于MoE的特征解耦模块，通过不同专门用于特定失真类型或质量维度的专家来解耦融合表示。大量实验证明，Life-IQA在准确性和成本之间显示出更有利的平衡，优于基准Transformer解码器，并在多个BIQA基准测试中实现了最先进的性能。代码可在以下链接找到：\href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}。

更新时间: 2025-11-24 11:59:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19024v1

OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs

Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.

Updated: 2025-11-24 11:59:31

标题: OrdMoE：多模态专家混合模型中通过分层专家组排名实现偏好对齐

摘要: 偏好学习最近已经成为多模态大型语言模型（MLLMs）后训练对齐的关键策略。然而，现有方法主要依赖于外部人工注释的偏好数据，这种数据收集成本高且劳动密集。在这项工作中，我们提出了OrdMoE，一种新颖的偏好对齐框架，通过利用混合专家（MoE）架构中的内在信号，完全绕过了对外部人类偏好的依赖。具体而言，我们观察到路由器的专家选择分数隐含地编码了响应的质量感知排名（即得分较高的专家始终生成质量更高的输出）。基于这一观察，OrdMoE根据其每个令牌路由分数将专家分组成排名层次，并分别激活每个层次以生成一系列质量递增的响应。这产生了一种零成本的、自监督的生成响应偏好排序，可以直接使用标准偏好学习目标进行优化。在多个多模态基准测试中进行的大量实验表明，OrdMoE显著增强了多模态混合专家LLMs的对齐和整体性能，实现了竞争性的结果，而无需任何人工注释的偏好数据。

更新时间: 2025-11-24 11:59:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.19023v1

Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs

The integration of Large Language Models (LLMs) into healthcare demands a safety paradigm rooted in \textit{primum non nocere}. However, current alignment techniques rely on generic definitions of harm that fail to capture context-dependent violations, such as administrative fraud and clinical discrimination. To address this, we introduce Medical Malice: a dataset of 214,219 adversarial prompts calibrated to the regulatory and ethical complexities of the Brazilian Unified Health System (SUS). Crucially, the dataset includes the reasoning behind each violation, enabling models to internalize ethical boundaries rather than merely memorizing a fixed set of refusals. Using an unaligned agent (Grok-4) within a persona-driven pipeline, we synthesized high-fidelity threats across seven taxonomies, ranging from procurement manipulation and queue-jumping to obstetric violence. We discuss the ethical design of releasing these "vulnerability signatures" to correct the information asymmetry between malicious actors and AI developers. Ultimately, this work advocates for a shift from universal to context-aware safety, providing the necessary resources to immunize healthcare AI against the nuanced, systemic threats inherent to high-stakes medical environments -- vulnerabilities that represent the paramount risk to patient safety and the successful integration of AI in healthcare systems.

Updated: 2025-11-24 11:55:22

标题: 医疗恶意行为：一个用于医疗领域上下文感知安全的数据集LLMs

摘要: 将大型语言模型（LLMs）整合到医疗保健中需要一个根植于“首先不要伤害”的安全范式。然而，当前的对齐技术依赖于通用的伤害定义，无法捕捉到依赖于上下文的违规行为，如行政欺诈和临床歧视。为了解决这一问题，我们介绍了医疗恶意：一个包含214,219个对调节和伦理复杂性的巴西统一卫生系统（SUS）进行校准的对抗提示的数据集。关键的是，该数据集包括每个违规行为背后的推理，使模型能够内化伦理边界，而不仅仅是记住一组固定的拒绝。使用一个未对齐的代理（Grok-4）在一个基于人物的管道中，我们综合了跨越七个分类法的高保真威胁，从采购操纵和跳队到产科暴力。我们讨论了发布这些“脆弱性签名”的伦理设计，以纠正恶意行为者和AI开发人员之间的信息不对称。最终，这项工作主张从普遍到上下文感知的安全转变，为医疗保健AI提供必要资源，使其能够免疫高风险医疗环境固有的微妙、系统性威胁，这些脆弱性代表了对患者安全和AI在医疗保健系统成功整合的最重要风险。

更新时间: 2025-11-24 11:55:22

领域: cs.CY,cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2511.21757v1

(De)-regularized Maximum Mean Discrepancy Gradient Flow

We introduce a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) and its Wasserstein gradient flow. Existing gradient flows that transport samples from source distribution to target distribution with only target samples, either lack tractable numerical implementation ($f$-divergence flows) or require strong assumptions, and modifications such as noise injection, to ensure convergence (Maximum Mean Discrepancy flows). In contrast, DrMMD flow can simultaneously (i) guarantee near-global convergence for a broad class of targets in both continuous and discrete time, and (ii) be implemented in closed form using only samples. The former is achieved by leveraging the connection between the DrMMD and the $χ^2$-divergence, while the latter comes by treating DrMMD as MMD with a de-regularized kernel. Our numerical scheme uses an adaptive de-regularization schedule throughout the flow to optimally trade off between discretization errors and deviations from the $χ^2$ regime. The potential application of the DrMMD flow is demonstrated across several numerical experiments, including a large-scale setting of training student/teacher networks.

Updated: 2025-11-24 11:54:06

标题: (去)正则化的最大均值差异梯度流

摘要: 我们介绍了最大均值差异（DrMMD）及其Wasserstein梯度流的（去）正则化。现有的梯度流将样本从源分布传输到目标分布，只使用目标样本，要么缺乏可计算的数值实现（$f$-散度流），要么需要强假设和修改，如注入噪声，以确保收敛（最大均值差异流）。相比之下，DrMMD流可以同时（i）保证在连续和离散时间内广泛类别的目标近全局收敛，并且（ii）可以仅使用样本以封闭形式实现。前者通过利用DrMMD与$χ^2$-散度之间的联系实现，而后者通过将DrMMD视为具有去正则化核的MMD来实现。我们的数值方案在整个流程中使用自适应去正则化计划，以在离散化误差和偏离$χ^2$区域之间进行最佳权衡。DrMMD流的潜在应用在多个数值实验中得到展示，包括在大规模设置中训练学生/教师网络。

更新时间: 2025-11-24 11:54:06

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2409.14980v3

Forecasting-based Biomedical Time-series Data Synthesis for Open Data and Robust AI

The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. While GANs, VAEs, and diffusion models capture global data distributions, forecasting models offer inductive biases tailored for sequential dynamics. We propose a framework for synthetic biomedical time-series data generation based on recent forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets can be freely shared for open AI development and consistently improve downstream model performance. Numerical results on sleep-stage classification show up to a 3.71\% performance gain with augmentation and a 91.00\% synthetic-only accuracy that surpasses the real-data-only baseline.

Updated: 2025-11-24 11:53:20

标题: 基于预测的生物医学时间序列数据合成，用于开放数据和强大的人工智能

摘要: 由于严格的隐私法规和显著的资源需求，有限的数据可用性严重限制了生物医学时间序列人工智能的发展，从而在数据要求和可访问性之间产生了关键差距。合成数据生成提供了一种有前途的解决方案，通过生成保持真实生物医学时间序列数据统计特性的人工数据集，而不会损害患者隐私。虽然GANs、VAEs和扩散模型捕捉全局数据分布，但预测模型提供了针对顺序动态量身定制的归纳偏差。我们提出了一个基于最近预测模型的合成生物医学时间序列数据生成框架，精确复制复杂的电生理信号，如EEG和EMG，具有高度忠实度。这些合成数据集可以自由共享用于开放式人工智能开发，并持续改善下游模型性能。对睡眠阶段分类的数值结果显示，通过增强可以实现高达3.71\%的性能增益，合成数据的准确率达到91.00\%，超过了真实数据的基准。

更新时间: 2025-11-24 11:53:20

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2510.04622v2

3D Dynamic Radio Map Prediction Using Vision Transformers for Low-Altitude Wireless Networks

Low-altitude wireless networks (LAWN) are rapidly expanding with the growing deployment of unmanned aerial vehicles (UAVs) for logistics, surveillance, and emergency response. Reliable connectivity remains a critical yet challenging task due to three-dimensional (3D) mobility, time-varying user density, and limited power budgets. The transmit power of base stations (BSs) fluctuates dynamically according to user locations and traffic demands, leading to a highly non-stationary 3D radio environment. Radio maps (RMs) have emerged as an effective means to characterize spatial power distributions and support radio-aware network optimization. However, most existing works construct static or offline RMs, overlooking real-time power variations and spatio-temporal dependencies in multi-UAV networks. To overcome this limitation, we propose a {3D dynamic radio map (3D-DRM)} framework that learns and predicts the spatio-temporal evolution of received power. Specially, a Vision Transformer (ViT) encoder extracts high-dimensional spatial representations from 3D RMs, while a Transformer-based module models sequential dependencies to predict future power distributions. Experiments unveil that 3D-DRM accurately captures fast-varying power dynamics and substantially outperforms baseline models in both RM reconstruction and short-term prediction.

Updated: 2025-11-24 11:47:17

标题: 使用视觉Transformer技术进行低空无线网络3D动态无线电地图预测

摘要: 低空高度无线网络（LAWN）随着无人机（UAV）在物流、监视和应急响应领域的不断部署而迅速扩展。由于三维移动性、时变用户密度和有限的功耗预算，可靠的连接仍然是一个关键但具有挑战性的任务。基站（BSs）的发射功率根据用户位置和流量需求动态波动，导致高度非平稳的三维无线环境。无线地图（RMs）已经成为表征空间功率分布并支持基于无线的网络优化的有效手段。然而，大多数现有工作构建静态或离线的RMs，忽视了多无人机网络中实时功率变化和时空依赖关系。为了克服这一限制，我们提出了一个{3D动态无线地图（3D-DRM）}框架，学习和预测接收功率的时空演变。具体来说，一个Vision Transformer（ViT）编码器从3D RMs中提取高维空间表示，而基于Transformer的模块模拟顺序依赖关系以预测未来的功率分布。实验表明，3D-DRM准确捕捉快速变化的功率动态，并在RM重建和短期预测方面显著优于基线模型。

更新时间: 2025-11-24 11:47:17

领域: cs.LG

下载: http://arxiv.org/abs/2511.19019v1

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81$\times$ and 2.30$\times$ averaged speedups against Torch on CPU and GPU platforms, respectively.

Updated: 2025-11-24 11:46:50

标题: PRAGMA：一种基于性能分析的多智能体框架，用于自动内核优化

摘要: 设计高性能内核需要专家级调整和对硬件特性的深刻理解。最近大语言模型（LLMs）的进步使自动生成内核成为可能，然而大多数现有系统仅依赖于正确性或执行时间反馈，缺乏对低级性能瓶颈的推理能力。在本文中，我们介绍了PRAGMA，这是一个基于概要引导的人工智能内核生成框架，将执行反馈和细粒度硬件分析整合到推理循环中。PRAGMA使LLMs能够识别性能瓶颈，保留历史最佳版本，并逐步改进代码质量。我们在KernelBench上评估了PRAGMA，涵盖了GPU和CPU后端。结果显示，PRAGMA始终优于未启用分析的基线AIKG，并分别在CPU和GPU平台上实现了2.81倍和2.30倍的平均加速比对Torch。

更新时间: 2025-11-24 11:46:50

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2511.06345v2

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Safety alignment instills in Large Language Models (LLMs) a critical capacity to refuse malicious requests. Prior works have modeled this refusal mechanism as a single linear direction in the activation space. We posit that this is an oversimplification that conflates two functionally distinct neural processes: the detection of harm and the execution of a refusal. In this work, we deconstruct this single representation into a Harm Detection Direction and a Refusal Execution Direction. Leveraging this fine-grained model, we introduce Differentiated Bi-Directional Intervention (DBDI), a new white-box framework that precisely neutralizes the safety alignment at critical layer. DBDI applies adaptive projection nullification to the refusal execution direction while suppressing the harm detection direction via direct steering. Extensive experiments demonstrate that DBDI outperforms prominent jailbreaking methods, achieving up to a 97.88\% attack success rate on models such as Llama-2. By providing a more granular and mechanistic framework, our work offers a new direction for the in-depth understanding of LLM safety alignment.

Updated: 2025-11-24 11:44:59

标题: 分化定向干预：一个规避LLM安全一致性的框架

摘要: 安全对齐在大型语言模型（LLMs）中灌输了一种拒绝恶意请求的关键能力。先前的研究将这种拒绝机制建模为激活空间中的单一线性方向。我们认为这是一种过度简化，混淆了两个功能上不同的神经过程：危害检测和拒绝执行。在这项工作中，我们将这种单一表示分解为危害检测方向和拒绝执行方向。利用这种细粒度模型，我们引入了不同化的双向干预（DBDI），这是一个新的白盒框架，可以精确地在关键层中中和安全对齐。DBDI应用自适应投影抵消到拒绝执行方向，同时通过直接转向抑制危害检测方向。大量实验表明，DBDI优于突破监狱的突出方法，在Llama-2等模型上实现了高达97.88\%的攻击成功率。通过提供更加细粒度和机械化的框架，我们的工作为深入理解LLM安全对齐提供了一个新的方向。

更新时间: 2025-11-24 11:44:59

领域: cs.CR,cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2511.06852v4

A General Framework for Per-record Differential Privacy

Differential Privacy (DP) is a widely adopted standard for privacy-preserving data analysis, but it assumes a uniform privacy budget across all records, limiting its applicability when privacy requirements vary with data values. Per-record Differential Privacy (PrDP) addresses this by defining the privacy budget as a function of each record, offering better alignment with real-world needs. However, the dependency between the privacy budget and the data value introduces challenges in protecting the budget's privacy itself. Existing solutions either handle specific privacy functions or adopt relaxed PrDP definitions. A simple workaround is to use the global minimum of the privacy function, but this severely degrades utility, as the minimum is often set extremely low to account for rare records with high privacy needs. In this work, we propose a general and practical framework that enables any standard DP mechanism to support PrDP, with error depending only on the minimal privacy requirement among records actually present in the dataset. Since directly revealing this minimum may leak information, we introduce a core technique called privacy-specified domain partitioning, which ensures accurate estimation without compromising privacy. We also extend our framework to the local DP setting via a novel technique, privacy-specified query augmentation. Using our framework, we present the first PrDP solutions for fundamental tasks such as count, sum, and maximum estimation. Experimental results show that our mechanisms achieve high utility and significantly outperform existing Personalized DP (PDP) methods, which can be viewed as a special case of PrDP with relaxed privacy protection.

Updated: 2025-11-24 11:44:10

标题: 每条记录差分隐私的一般框架

摘要: 差分隐私（DP）是隐私保护数据分析的广泛采用标准，但它假定所有记录之间具有统一的隐私预算，当隐私需求随数据值变化时，限制了其适用性。以每条记录为单位的差分隐私（PrDP）通过将隐私预算定义为每条记录的函数来解决这一问题，更好地满足实际需求。然而，隐私预算与数据值之间的依赖性引入了保护预算隐私本身的挑战。现有解决方案要么处理特定的隐私函数，要么采用放松的PrDP定义。一个简单的解决方法是使用隐私函数的全局最小值，但这严重降低了效用，因为通常会将最小值设置得非常低，以考虑对隐私需求较高的稀有记录。在本文中，我们提出了一个通用且实用的框架，使任何标准DP机制能够支持PrDP，并且错误仅取决于实际数据集中存在的记录中的最小隐私需求。由于直接透需这个最小值可能会泄露信息，我们引入了一种称为隐私指定域划分的核心技术，确保准确估计而不损害隐私。我们还通过一种新颖的技术，称为隐私指定查询增强，将我们的框架扩展到本地DP设置。利用我们的框架，我们提出了针对如计数、求和和最大估计等基本任务的首个PrDP解决方案。实验结果表明，我们的机制实现了高效用，并且明显优于现有的个性化DP（PDP）方法，后者可以视为具有放宽隐私保护的PrDP的特例。

更新时间: 2025-11-24 11:44:10

领域: cs.DB,cs.CR

下载: http://arxiv.org/abs/2511.19015v1

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

Large language models demonstrate powerful capabilities across various natural language processing tasks, yet they also harbor safety vulnerabilities. To enhance LLM safety, various jailbreak defense methods have been proposed to guard against harmful outputs. However, improvements in model safety often come at the cost of severe over-refusal, failing to strike a good balance between safety and usability. In this paper, we first analyze the causes of over-refusal from a representation perspective, revealing that over-refusal samples reside at the boundary between benign and malicious samples. Based on this, we propose MOSR, designed to mitigate over-refusal by intervening the safety representation of LLMs. MOSR incorporates two novel components: (1) Overlap-Aware Loss Weighting, which determines the erasure weight for malicious samples by quantifying their similarity to pseudo-malicious samples in the representation space, and (2) Context-Aware Augmentation, which supplements the necessary context for rejection decisions by adding harmful prefixes before rejection responses. Experiments demonstrate that our method outperforms existing approaches in mitigating over-refusal while largely maintaining safety. Overall, we advocate that future defense methods should strike a better balance between safety and over-refusal.

Updated: 2025-11-24 11:38:53

标题: 理解和减轻大型语言模型的拒绝率问题：通过安全表征进行处理

摘要: 大型语言模型在各种自然语言处理任务中展示出强大的能力，但也存在安全漏洞。为了增强LLM的安全性，提出了各种越狱防御方法来防范有害输出。然而，模型安全性的提升往往以严重的过度拒绝为代价，未能在安全性和可用性之间取得良好的平衡。本文首先从表示角度分析了过度拒绝的原因，揭示了过度拒绝样本存在于良性和恶意样本之间的边界。基于此，我们提出了MOSR，旨在通过干预LLM的安全性表示来减轻过度拒绝。MOSR包括两个新颖组件：（1）重叠感知损失加权，通过量化恶意样本在表示空间中与伪恶意样本的相似度来确定恶意样本的擦除权重；（2）上下文感知增强，通过在拒绝响应之前添加有害前缀为拒绝决策提供必要的上下文。实验证明，我们的方法在减轻过度拒绝的同时在很大程度上保持了安全性。总体而言，我们主张未来的防御方法应该在安全性和过度拒绝之间取得更好的平衡。

更新时间: 2025-11-24 11:38:53

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2511.19009v1

Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users' environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.

Updated: 2025-11-24 11:32:24

标题: 引入视觉场景和推理：口语理解的更现实基准

摘要: 口语理解（SLU）包括两个子任务：意图检测（ID）和槽填充（SF）。鉴于其广泛的实际应用范围，增强SLU以实现实际部署变得日益关键。基于个人资料的SLU通过整合上下文意识（CA）、用户资料（UP）和知识图谱（KG）来解决模糊的用户话语，从而支持消除歧义，推动SLU研究朝向实际适用性的发展。然而，现有的SLU数据集在代表现实场景方面仍存在不足。具体而言，（1）CA使用单热向量进行表示，这过于理想化，（2）模型通常仅侧重于预测意图和槽标签，而忽视了可以提高性能和可解释性的推理过程。为了克服这些限制，我们引入了VRSLU，一个集成了视觉图像和明确推理的新型SLU数据集。为了解决过于理想化的CA，我们使用GPT-4o和FLUX.1-dev来生成反映用户环境和状态的图像，然后通过人工验证来确保质量。对于推理，我们使用GPT-4o生成对预测标签的解释，然后由人类注释员对其进行精确性和连贯性的修正。此外，我们提出了一个指导模板LR-Instruct，首先预测标签，然后生成相应的推理。这种两步方法有助于减轻推理偏见对标签预测的影响。实验结果证实了整合视觉信息的有效性，并突显了明确推理在推进SLU方面的潜力。

更新时间: 2025-11-24 11:32:24

领域: cs.AI

下载: http://arxiv.org/abs/2511.19005v1

Enhancing low energy reconstruction and classification in KM3NeT/ORCA with transformers

The current KM3NeT/ORCA neutrino telescope, still under construction, has not yet reached its full potential in neutrino reconstruction capability. When training any deep learning model, no explicit information about the physics or the detector is provided, thus they remain unknown to the model. This study leverages the strengths of transformers by incorporating attention masks inspired by the physics and detector design, making the model understand both the telescope design and the neutrino physics measured on it. The study also shows the efficacy of transformers on retaining valuable information between detectors when doing fine-tuning from one configurations to another.

Updated: 2025-11-24 11:25:30

标题: 使用变压器技术增强KM3NeT/ORCA中的低能量重建和分类

摘要: 目前仍在建设中的KM3NeT/ORCA中微子望远镜尚未充分发挥其微子重建能力。在训练任何深度学习模型时，不提供有关物理或探测器的明确信息，因此它们对模型来说仍然是未知的。本研究利用transformers的优势，通过引入受物理和探测器设计启发的注意力掩码，使模型理解望远镜设计和测量其上微子物理。研究还展示了transformers在从一个配置到另一个配置进行微调时保留有价值信息的有效性。

更新时间: 2025-11-24 11:25:30

领域: hep-ex,astro-ph.IM,cs.AI

下载: http://arxiv.org/abs/2511.18999v1

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance

Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose Warm Chat, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

Updated: 2025-11-24 11:19:05

标题: 温暖对话：具有树形结构引导的情感感知交互式说话头像

摘要: 生成模型发展迅速，使得令人印象深刻的会话头像生成得以实现，给AI赋予了生命。然而，大多数现有方法仅关注单向肖像动画。即使少数支持双向对话互动的方法，也缺乏精确的情感自适应能力，极大地限制了它们的实际适用性。在本文中，我们提出了一种新颖的情感感知会话头像生成框架Warm Chat，用于二元互动。利用大型语言模型（LLM，例如GPT-4）的对话生成能力，我们的方法生成具有丰富情感变化的时间一致的虚拟头像，能够无缝地在说话和倾听状态之间过渡。具体地，我们设计了一个基于Transformer的头部遮罩生成器，在潜在遮罩空间学习时间一致的运动特征，能够生成任意长度、时间一致的遮罩序列来约束头部运动。此外，我们引入了一个交互式会话树结构来表示对话状态转换，其中每个树节点包含诸如子节点/父节点/兄弟节点和当前角色情感状态等信息。通过进行逆层次遍历，我们从当前节点中提取丰富的历史情感线索来指导表达合成。广泛的实验表明了我们方法的卓越性能和有效性。

更新时间: 2025-11-24 11:19:05

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2508.18337v3

Classification EM-PCA for clustering and embedding

The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.

Updated: 2025-11-24 11:18:59

标题: EM-PCA分类用于聚类和嵌入

摘要: 混合模型无疑是聚类中最伟大的贡献之一。对于连续数据，高斯模型通常被使用，而期望最大化（EM）算法特别适用于从中推断聚类的参数估计。尽管这些模型在包括图像聚类在内的各个领域中特别受欢迎，但它们仍然受到维度和EM算法收敛速度缓慢的影响。然而，分类EM（CEM）算法，作为一种分类版本，提供了快速收敛解决方案，尽管维度缩减仍然是一个挑战。因此，在本文中，我们提出了一种算法，同时且非顺序地结合两个任务--数据嵌入和聚类--依赖于主成分分析（PCA）和CEM。我们展示了这种方法在聚类和数据嵌入方面的优势，并与其他聚类方法建立了不同的联系。

更新时间: 2025-11-24 11:18:59

领域: stat.ML,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2511.18992v1

Causally Reliable Concept Bottleneck Models

Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable variables, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose Causally reliable Concept Bottleneck Models (C$^2$BMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and unstructured background knowledge (e.g., scientific literature). Experimental evidence suggests that C$^2$BMs are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t. standard opaque and concept-based models, while maintaining their accuracy.

Updated: 2025-11-24 11:18:03

标题: 因果可靠性概念瓶颈模型

摘要: 基于概念的模型是深度学习中新兴的范式，它将推理过程限制在通过人类可解释变量进行操作，促进了可解释性和人类交互。然而，与流行的不透明神经模型一样，这些架构未能解释数据中所代表的目标现象的真正因果机制。这限制了它们支持因果推理任务的能力，限制了超出分布的泛化，并阻碍了公平约束的实施。为了解决这些问题，我们提出了因果可靠的概念瓶颈模型(C$^2$BMs)，这是一类基于概念的架构，通过一个按照真实世界因果机制模型构建的概念瓶颈来强制进行推理。我们还介绍了一种从观测数据和非结构化背景知识(例如科学文献)中自动学习这种结构的流程。实验证据表明，C$^2$BMs更具可解释性、因果可靠性，并且在对标准不透明和基于概念的模型进行干预时，响应更好，同时保持了它们的准确性。

更新时间: 2025-11-24 11:18:03

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.04363v3

SmartPoC: Generating Executable and Validated PoCs for Smart Contract Bug Reports

Smart contracts are prone to vulnerabilities and are analyzed by experts as well as automated systems, such as static analysis and AI-assisted solutions. However, audit artifacts are heterogeneous and often lack reproducible, executable PoC tests suitable for automated validation, leading to costly, ad hoc manual verification. Large language models (LLMs) can be leveraged to turn audit reports into PoC test cases, but have three major challenges: noisy inputs, hallucinations, and missing runtime oracles. In this paper, we present SmartPoC, an automated framework that converts textual audit reports into executable, validated test cases. First, the input audit report is processed to reduce noise, and only bug-related functions are extracted and fed to LLMs as context. To curb hallucinations and ensure compile-and-run readiness, we leverage LLMs to synthesize PoC test cases with specially-designed pre-/post-execution repair. We further utilize differential verification as oracles to confirm exploitability of the PoC test cases. On the SmartBugs-Vul and FORGE-Vul benchmarks, SmartPoC generates executable, validated Foundry test cases for 85.61% and 86.45% of targets, respectively. Applied to the latest Etherscan verified-source corpus, SmartPoC confirms 236 real bugs out of 545 audit findings at a cost of only $0.03 per finding.

Updated: 2025-11-24 11:08:48

标题: 智能PoC：为智能合约漏洞报告生成可执行且经过验证的PoCs

摘要: 智能合约容易受到漏洞影响，专家和自动化系统（如静态分析和人工智能辅助解决方案）对其进行分析。然而，审计文献异质且常常缺乏可用于自动验证的可重现、可执行的 PoC 测试，导致昂贵、临时的手动验证。大型语言模型（LLMs）可用于将审计报告转化为 PoC 测试用例，但存在三个主要挑战：嘈杂输入、幻觉和缺失运行时神谕。本文介绍了 SmartPoC，这是一个自动化框架，将文本审计报告转换为可执行的、经验证的测试用例。首先，处理输入审计报告以减少噪音，仅提取与漏洞相关的函数，并将其作为上下文提供给 LLMs。为了遏制幻觉并确保编译和运行准备就绪，我们利用 LLMs 合成具有特别设计的前/后执行修复的 PoC 测试用例。我们进一步利用差分验证作为神谕来确认 PoC 测试用例的可利用性。在 SmartBugs-Vul 和 FORGE-Vul 基准测试中，SmartPoC 分别为 85.61% 和 86.45% 的目标生成了可执行、经验证的 Foundry 测试用例。应用于最新的 Etherscan 已验证源代码语料库时，SmartPoC 以每个发现仅 0.03 美元的成本确认了 545 个审计结果中的 236 个真实漏洞。

更新时间: 2025-11-24 11:08:48

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2511.12993v2

Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning

Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.

Updated: 2025-11-24 11:08:01

标题: 重新思考植物疾病诊断：用视觉转换器和零样本学习来弥合学术实践鸿沟

摘要: 最近深度学习的进展使得利用叶片图像进行植物病害分类取得了显著进展。这一领域的许多现有研究依赖于PlantVillage数据集，该数据集包含对准确、整洁背景下捕获的植物图像。尽管在该数据集上训练的模型能够实现高准确性，但它们通常无法推广到真实世界的田间图像，例如农民提交给植物诊断系统的图像。这导致了已发表研究和实际应用需求之间的显著差距，突显了研究和解决这一问题的必要性。在这项研究中，我们调查了基于注意力架构和零样本学习方法是否可以弥合学术数据集和植物病害分类中的真实农业条件之间的差距。我们评估了三种模型类别：卷积神经网络（CNNs）、视觉Transformer和基于对比语言-图像预训练（CLIP）的零样本模型。虽然CNNs在领域转移下表现出有限的稳健性，但视觉Transformer通过捕捉全局上下文特征展现出更强的泛化能力。值得注意的是，CLIP模型可以直接从自然语言描述中对疾病进行分类，而无需任何特定任务的训练，具有强大的适应性和可解释性。这些发现突显了零样本学习作为一种实用且可扩展的领域适应策略，在多样的田间环境中用于植物健康诊断的潜力。

更新时间: 2025-11-24 11:08:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18989v1

Teacher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection

Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network's ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Rigorous evaluations on the widely-used MVTec AD dataset demonstrate that PFADSeg exhibits excellent performance, achieving an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.

Updated: 2025-11-24 11:06:07

标题: 教师编码器-学生解码器去噪引导分割网络用于异常检测

摘要: 视觉异常检测是一项极具挑战性的任务，通常被归类为一类分类和分割问题。最近的研究表明，学生-教师（S-T）框架有效地解决了这一挑战。然而，大多数S-T框架仅依赖于预训练的教师网络来指导学生网络学习多尺度相似特征，忽视了学生网络通过多尺度特征融合增强学习的潜力。在本研究中，我们提出了一个名为PFADSeg的新型模型，该模型将预训练的教师网络、具有多尺度特征融合的去噪学生网络以及引导异常分割网络整合到统一框架中。通过采用独特的教师-编码器和学生-解码器去噪模式，该模型改善了学生网络从教师网络特征中学习的能力。此外，引入了自适应特征融合机制来训练一个自监督分割网络，自主合成异常掩模，显著提高了检测性能。对广泛使用的MVTec AD数据集进行严格评估表明，PFADSeg表现出色，达到了98.9%的图像级AUC、76.4%的像素级平均精度和78.7%的实例级平均精度。

更新时间: 2025-11-24 11:06:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.12104v4

Dynamic Mixture of Experts Against Severe Distribution Shifts

The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.

Updated: 2025-11-24 11:00:32

标题: 动态专家混合对抗严重分布偏移

摘要: 构建能够持续学习和适应不断演变数据流的神经网络是继续学习（CL）和强化学习（RL）领域的核心挑战。这种终身学习问题通常被描述为可塑性-稳定性困境，关注诸如可塑性丧失和灾难性遗忘等问题。与神经网络不同，生物大脑通过容量增长来保持可塑性，这启发研究人员探索类似的方法在人工网络中，例如动态添加容量。先前的解决方案通常缺乏参数效率或依赖显式任务索引，但混合专家（MoE）架构通过为不同分布专门化专家提供了一种有希望的替代方案。本文旨在评估一种适用于继续学习和强化学习环境的DynamicMoE方法，并将其效果与现有网络扩展方法进行基准测试。

更新时间: 2025-11-24 11:00:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18987v1

Q-SAM2: Accurate Quantization for Segment Anything Model 2

The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present Q-SAM2, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q-SAM2 introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q-SAM2 achieves highly accurate inference with substantial efficiency gains, significantly surpassing state-of-the-art general QAT schemes, particularly in the ultra-low 2-bit regime. Specifically, Q-SAM2 achieves an accuracy gain of up to 9.7 ppt in J&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.

Updated: 2025-11-24 10:55:38

标题: Q-SAM2：端到端模型2的准确量化

摘要: 段分割模型2（SAM2）是一个强大的可提示分割的基础模型。然而，其高计算和内存成本是在资源受限设备上部署的主要障碍。在本文中，我们提出了Q-SAM2，一种准确的低比特量化方法，实现了高压缩和高保真度。为了解决在量化过程中由于挑战性的权重和激活分布导致的性能下降，Q-SAM2引入了两个新的贡献：方差减少校准（VRC），一种通过在小校准批次上最小化Frobenius范数来减少权重统计方差的初始化方法；以及可学习的统计剪切（LSC），一种量化感知训练（QAT）方法，学习动量稳定的剪切因子来管理权重和激活中的异常值。全面的实验表明，Q-SAM2实现了高度准确的推断，具有显著的效率增益，明显超过了最先进的通用QAT方案，特别是在超低2比特范围内。具体而言，Q-SAM2在视频分割基准的J＆F上获得了高达9.7个百分点的准确度增益，并且在实例分割的mIoU上比最佳竞争QAT模型提高了7.3个百分点，同时与BF16基线相比，模型尺寸减小了8倍。

更新时间: 2025-11-24 10:55:38

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.09782v2

MOCLIP: A Foundation Model for Large-Scale Nanophotonic Inverse Design

Foundation models (FM) are transforming artificial intelligence by enabling generalizable, data-efficient solutions across different domains for a broad range of applications. However, the lack of large and diverse datasets limits the development of FM in nanophotonics. This work presents MOCLIP (Metasurface Optics Contrastive Learning Pretrained), a nanophotonic foundation model that integrates metasurface geometry and spectra within a shared latent space. MOCLIP employs contrastive learning to align geometry and spectral representations using an experimentally acquired dataset with a sample density comparable to ImageNet-1K. The study demonstrates MOCLIP inverse design capabilities for high-throughput zero-shot prediction at a rate of 0.2 million samples per second, enabling the design of a full 4-inch wafer populated with high-density metasurfaces in minutes. It also shows generative latent-space optimization reaching 97 percent accuracy. Finally, we introduce an optical information storage concept that uses MOCLIP to achieve a density of 0.1 Gbit per square millimeter at the resolution limit, exceeding commercial optical media by a factor of six. These results position MOCLIP as a scalable and versatile platform for next-generation photonic design and data-driven applications.

Updated: 2025-11-24 10:54:19

标题: MOCLIP：大规模纳米光子逆向设计的基础模型

摘要: 基金会模型（FM）正在通过在不同领域为广泛的应用提供可泛化、数据高效的解决方案来改变人工智能。然而，缺乏大规模和多样化的数据集限制了纳米光子学中FM的发展。本研究提出了MOCLIP（Metasurface Optics Contrastive Learning Pretrained），这是一个纳米光子基础模型，将超表面几何和光谱集成到共享的潜在空间中。MOCLIP利用对比学习来对齐几何和光谱表示，使用一个实验获取的数据集，其样本密度与ImageNet-1K相当。研究证明了MOCLIP逆向设计能力，以每秒0.2百万个样本的速率进行高通量零样本预测，使得在几分钟内设计一个充满高密度超表面的完整4英寸晶圆成为可能。它还展示了生成潜在空间优化达到了97%的准确率。最后，我们介绍了一种光学信息存储概念，利用MOCLIP实现了每平方毫米0.1 Gbit的密度，达到了分辨率限制，超过商业光学介质六倍。这些结果将MOCLIP定位为下一代光子设计和数据驱动应用的可扩展和多功能平台。

更新时间: 2025-11-24 10:54:19

领域: physics.optics,cs.AI

下载: http://arxiv.org/abs/2511.18980v1

M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

Updated: 2025-11-24 10:52:11

标题: M2R2：多模态机器人表示用于时间动作分割

摘要: 时间动作分割（TAS）长期以来一直是机器人学和计算机视觉领域的重点研究领域。在机器人学中，算法主要集中在利用本体感知信息来确定技能边界，最近在外科机器人学中引入了视觉。相比之下，计算机视觉通常依赖于外部传感器，如摄像头。现有的机器人多模态TAS模型在模型内部集成了特征融合，使得在不同模型之间重复学习特征变得困难。与此同时，在计算机视觉中常用的预训练视觉特征提取器在物体可见性有限的情况下表现不佳。在这项工作中，我们提出了M2R2，一种专为TAS定制的多模态特征提取器，结合了本体感知和外部感知传感器的信息。我们引入了一种新颖的预训练策略，使得可以在多个TAS模型之间重复使用学习到的特征。我们的方法在REASSEMBLE数据集上取得了最先进的性能，这是一个具有挑战性的多模态机器人装配数据集，优于现有的机器人动作分割模型46.6%。此外，我们进行了广泛的消融研究，评估了不同模态在机器人TAS任务中的贡献。

更新时间: 2025-11-24 10:52:11

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2504.18662v2

FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning

Pruning is an effective method for compressing Large Language Models, but finding an optimal, non-uniform layer-wise sparsity allocation remains a key challenge. While heuristic methods are fast but yield suboptimal performance, more powerful search-based approaches like Reinforcement Learning are often hindered by prohibitive computational costs on large-scale models. To overcome this efficiency barrier, we propose FastForward Pruning. Its core is a decoupled, single-step RL framework that separates policy optimization from the complex budget satisfaction problem. Such a decoupling is crucial for efficiently searching the vast policy space of LLMs. This curriculum-based strategy begins with low-cost, simple tasks and gradually increases in complexity, significantly reducing the search's computational overhead. Evaluated on the LLaMA, Mistral, and OPT model families, our framework discovers pruning policies that achieve superior performance over strong heuristic baselines. Crucially, when compared to other search-based algorithms, our method achieves competitive or superior results at a fraction of the computational cost, demonstrating a clear advantage in search efficiency.

Updated: 2025-11-24 10:47:55

标题: 快速剪枝：通过单步强化学习实现高效的LLM剪枝

摘要: 修剪是压缩大型语言模型的一种有效方法，但寻找最佳的非均匀层间稀疏分配仍然是一个关键挑战。虽然启发式方法快速但产生次优性能，但更强大的基于搜索的方法如强化学习常常受到大规模模型上的计算成本的限制。为了克服这种效率障碍，我们提出了快速前向修剪。其核心是一种分离的、单步强化学习框架，将策略优化与复杂的预算满足问题分开。这种分离对于高效搜索大型语言模型的广阔策略空间至关重要。这种基于课程的策略从低成本、简单任务开始，逐渐增加复杂度，显著减少了搜索的计算开销。在LLaMA、Mistral和OPT模型家族上进行评估，我们的框架发现了修剪策略，其性能优于强启发式基线。至关重要的是，与其他基于搜索的算法相比，我们的方法在计算成本的一小部分上取得了具有竞争力或优越的结果，展示了在搜索效率上的明显优势。

更新时间: 2025-11-24 10:47:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18977v1

When, Where and Why to Average Weights?

Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization across all considered workloads. Finally, we explore the relationship between averaging and learning rate annealing and show how to optimally combine the two to achieve the best performances.

Updated: 2025-11-24 10:35:14

标题: 何时、何地以及为何要平均权重？

摘要: 在训练轨迹中对检查点进行平均是一种简单而强大的方法，可以提高机器学习模型的泛化性能并减少训练时间。受这些潜在收益的启发，并为了公平和全面地评估这种技术，我们使用AlgoPerf进行了现代深度学习中平均技术的大规模评估，AlgoPerf是一个针对优化算法的大规模基准测试。我们研究了权重平均化是否可以减少训练时间，改善泛化性能，并取代学习率衰减，正如最近文献所建议的那样。我们在七种架构和数据集上进行的评估显示，平均化显著加速训练并带来相当大的效率收益，代价是最小的实现和内存成本，同时在所有考虑的工作负载上略微改善了泛化性能。最后，我们探讨了平均化和学习率退火之间的关系，并展示了如何最佳地结合两者以实现最佳性能。

更新时间: 2025-11-24 10:35:14

领域: cs.LG

下载: http://arxiv.org/abs/2502.06761v3

Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers

Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.

Updated: 2025-11-24 10:32:57

标题: 通过专业教师的中间层知识蒸馏防止医学图像分析中的快捷学习

摘要: 深度学习模型容易学习到使用训练数据中虽然相关但无关的特征来解决问题的捷径。在高风险应用，如医学图像分析中，这种现象可能阻止模型在进行预测时使用具有临床意义的特征，可能导致模型鲁棒性差，对患者造成伤害。我们展示了不同类型的捷径（分散在整个图像中的以及局部特定区域的）在网络层中表现出明显差异，因此可以通过针对中间层的缓解策略更有效地进行针对性的处理。我们提出了一种新颖的知识蒸馏框架，利用对一小部分与任务相关的数据进行微调的教师网络来减轻学生网络在受到带有偏差特征的大型数据集训练时学习捷径的问题。通过在CheXpert、ISIC 2017和SimBA数据集上使用各种架构（ResNet-18、AlexNet、DenseNet-121和3D CNNs）进行大量实验，我们证明了相对于传统的经验风险最小化、基于增强的偏差缓解以及基于群体的偏差缓解方法，我们的方法能够取得一致的改进。在许多情况下，我们实现了与基于无偏数据训练的基线模型相当的性能，甚至在分布外的测试数据上也是如此。我们的结果展示了我们的方法在现实世界医学图像场景中的实际适用性，其中偏差注释有限且捷径特征难以事先识别。

更新时间: 2025-11-24 10:32:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.17421v2

LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Language Models

The security of code generated by large language models (LLMs) is a significant concern, as studies indicate that such code often contains vulnerabilities and lacks essential defensive programming constructs. This work focuses on examining and evaluating the security of LLM-generated code, particularly in the context of C/C++. We categorized known vulnerabilities using the Common Weakness Enumeration (CWE) and, to study their criticality, mapped them to CVEs. We used ten different LLMs for code generation and analyzed the outputs through static analysis. The amount of CWEs present in AI-generated code is concerning. Our findings highlight the need for developers to be cautious when using LLM-generated code. This study provides valuable insights to advance automated code generation and encourage further research in this domain.

Updated: 2025-11-24 10:31:53

标题: LLM-CSEC: 大型语言模型生成的C/C++代码安全性的实证评估

摘要: 大型语言模型(LLMs)生成的代码的安全性是一个重要问题，研究表明这类代码通常包含漏洞，缺乏必要的防御性编程结构。本研究旨在考察和评估LLM生成的代码的安全性，特别是在C/C++环境下。我们使用通用弱点枚举(CWE)对已知漏洞进行分类，并将它们映射到CVEs以研究其关键性。我们使用了十种不同的LLMs进行代码生成，并通过静态分析分析了输出。人工智能生成的代码中存在的CWE数量令人担忧。我们的研究结果强调开发人员在使用LLM生成的代码时需要谨慎。本研究提供了有价值的见解，以推动自动化代码生成的发展，并鼓励在这一领域进一步研究。

更新时间: 2025-11-24 10:31:53

领域: cs.AI,cs.CR

下载: http://arxiv.org/abs/2511.18966v1

Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.

Updated: 2025-11-24 10:30:33

标题: 将视觉概念综合为视觉语言程序

摘要: 视觉语言模型（VLMs）在多模态任务上取得了强大的表现，但通常在系统化的视觉推理任务上失败，导致输出不一致或不合逻辑。神经符号方法承诺通过引入可解释的逻辑规则来解决这个问题，尽管它们利用了刚性、领域特定的感知模块。我们提出了视觉语言程序（VLP），将VLMs的感知灵活性与程序合成的系统推理相结合。与将推理嵌入VLM不同，VLP利用模型生成结构化的视觉描述，这些描述被编译成神经符号程序。产生的程序直接在图像上执行，与任务约束保持一致，并提供可解释的解释，有助于简化解释。对合成和真实数据集的实验表明，在需要复杂逻辑推理的任务上，VLP优于直接和结构化提示。

更新时间: 2025-11-24 10:30:33

领域: cs.AI

下载: http://arxiv.org/abs/2511.18964v1

Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

Updated: 2025-11-24 10:26:30

标题: 跨领域泛化的多模态LLMs在全球光伏评估中的应用

摘要: 分布式光伏（PV）系统的快速扩张对电力网络管理提出了挑战，因为许多安装仍然未经记录。虽然卫星图像提供了全球范围的覆盖，但传统的计算机视觉（CV）模型，如CNN和U-Nets，需要大量标记数据，并且无法在不同地区推广。本研究调查了一个多模态大型语言模型（LLM）在全球PV评估中的跨域泛化能力。通过利用结构化提示和微调，该模型将检测、定位和量化集成到统一的架构中。使用$Δ$F1指标进行跨区域评估表明，所提出的模型在未知区域中实现了最小的性能降级，优于传统的CV和变压器基线。这些结果突出了多模态LLM在领域转移下的稳健性，以及它们在可扩展、可转移和可解释的全球PV映射中的潜力。

更新时间: 2025-11-24 10:26:30

领域: cs.CV,cs.AI,cs.LG,eess.IV

下载: http://arxiv.org/abs/2511.19537v1

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.

Updated: 2025-11-24 10:22:28

标题: AVA-VLA：通过主动视觉注意力改进视觉-语言-行动模型

摘要: 视觉-语言-行动（VLA）模型在具体AI任务中展现出卓越的能力。然而，现有的VLA模型通常建立在视觉-语言模型（VLMs）之上，通常在每个时间步独立处理密集的视觉输入。这种方法隐式地将任务建模为马尔可夫决策过程（MDP）。然而，这种忽略历史的设计对于在动态序贯决策中有效处理视觉令牌是次优的，因为它未能利用历史情境。为了解决这一限制，我们从部分可观察马尔可夫决策过程（POMDP）的角度重新制定了问题，并提出了一个名为AVA-VLA的新框架。受到POMDP的启发，行动生成应该取决于信念状态。AVA-VLA引入了主动视觉注意力（AVA）来动态调节视觉处理。它通过利用循环状态实现了这一点，循环状态是从先前决策步骤推导出的代理信念状态的神经近似。具体来说，AVA模块使用循环状态计算软权重，以根据其历史情境主动处理与任务相关的视觉令牌。全面的评估表明，AVA-VLA在包括LIBERO和CALVIN在内的流行机器人基准测试中实现了最先进的性能。此外，在双臂机器人平台上的真实部署验证了该框架的实际适用性和强大的从模拟到真实的可转移性。

更新时间: 2025-11-24 10:22:28

领域: cs.LG,cs.CV,cs.RO

下载: http://arxiv.org/abs/2511.18960v1

Learning to Compress Graphs via Dual Agents for Consistent Topological Robustness Evaluation

As graph-structured data grow increasingly large, evaluating their robustness under adversarial attacks becomes computationally expensive and difficult to scale. To address this challenge, we propose to compress graphs into compact representations that preserve both topological structure and robustness profile, enabling efficient and reliable evaluation.We propose Cutter, a dual-agent reinforcement learning framework composed of a Vital Detection Agent (VDA) and a Redundancy Detection Agent (RDA), which collaboratively identify structurally vital and redundant nodes for guided compression. Cutter incorporates three key strategies to enhance learning efficiency and compression quality: trajectory-level reward shaping to transform sparse trajectory returns into dense, policy-equivalent learning signals; prototype-based shaping to guide decisions using behavioral patterns from both highand low-return trajectories; and cross-agent imitation to enable safer and more transferable exploration. Experiments on multiple real-world graphs demonstrate that Cutter generates compressed graphs that retain essential static topological properties and exhibit robustness degradation trends highly consistent with the original graphs under various attack scenarios, thereby significantly improving evaluation efficiency without compromising assessment fidelity.

Updated: 2025-11-24 10:19:58

标题: 学习通过双代理压缩图以实现一致的拓扑稳健性评估

摘要: 随着图结构数据规模的增长，评估它们在对抗性攻击下的稳健性变得计算昂贵且难以扩展。为了解决这一挑战，我们提出将图压缩为保留拓扑结构和稳健性特征的紧凑表示，从而实现高效可靠的评估。我们提出了Cutter，一个由关键检测代理（VDA）和冗余检测代理（RDA）组成的双代理强化学习框架，它们共同识别结构关键和冗余节点以引导压缩。Cutter结合了三种关键策略来增强学习效率和压缩质量：轨迹级奖励塑造将稀疏轨迹回报转化为密集、策略等价的学习信号；基于原型的塑造利用高和低回报轨迹的行为模式来指导决策；跨代理模仿以实现更安全、更可传输的探索。在多个现实世界的图上进行的实验表明，Cutter生成的压缩图保留了关键的静态拓扑特性，并在各种攻击场景下表现出与原始图高度一致的稳健性下降趋势，从而显著提高了评估效率，而不会损害评估的准确性。

更新时间: 2025-11-24 10:19:58

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18958v1

AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Inference attacks have been widely studied and offer a systematic risk assessment of ML services; however, their implementation and the attack parameters for optimal estimation are challenging for non-experts. The emergence of advanced large language models presents a promising yet largely unexplored opportunity to develop autonomous agents as inference attack experts, helping address this challenge. In this paper, we propose AttackPilot, an autonomous agent capable of independently conducting inference attacks without human intervention. We evaluate it on 20 target services. The evaluation shows that our agent, using GPT-4o, achieves a 100.0% task completion rate and near-expert attack performance, with an average token cost of only $0.627 per run. The agent can also be powered by many other representative LLMs and can adaptively optimize its strategy under service constraints. We further perform trace analysis, demonstrating that design choices, such as a multi-agent framework and task-specific action spaces, effectively mitigate errors such as bad plans, inability to follow instructions, task context loss, and hallucinations. We anticipate that such agents could empower non-expert ML service providers, auditors, or regulators to systematically assess the risks of ML services without requiring deep domain expertise.

Updated: 2025-11-24 10:14:14

标题: AttackPilot：基于LLM代理的自主推理攻击对ML服务进行攻击

摘要: 推论攻击已经被广泛研究，并提供了对ML服务的系统风险评估；然而，对于非专家来说，它们的实施和最佳估计的攻击参数具有挑战性。先进的大型语言模型的出现为开发自主代理作为推论攻击专家提供了一个有前途但广泛未被探索的机会，从而帮助解决这一挑战。在本文中，我们提出了AttackPilot，一个能够独立进行推论攻击而无需人类干预的自主代理。我们对20个目标服务进行了评估。评估结果显示，我们的代理，使用GPT-4o，实现了100.0%的任务完成率和接近专家级的攻击性能，每次运行的平均代币成本仅为0.627美元。该代理还可以由许多其他代表性的LLM提供动力，并可以在服务约束下自适应优化其策略。我们进一步进行了跟踪分析，展示了设计选择，例如多代理框架和任务特定的行动空间，有效地减轻了错误，如坏计划、无法遵循说明、任务上下文丢失和幻觉。我们预计，这样的代理可以使非专家ML服务提供商、审计员或监管机构能够在不需要深度领域专业知识的情况下系统地评估ML服务的风险。

更新时间: 2025-11-24 10:14:14

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.19536v1

Active Inference is a Subtype of Variational Inference

Automated decision-making under uncertainty requires balancing exploitation and exploration. Classical methods treat these separately using heuristics, while Active Inference unifies them through Expected Free Energy (EFE) minimization. However, EFE minimization is computationally expensive, limiting scalability. We build on recent theory recasting EFE minimization as variational inference, formally unifying it with Planning-as-Inference and showing the epistemic drive as a unique entropic contribution. Our main contribution is a novel message-passing scheme for this unified objective, enabling scalable Active Inference in factored-state MDPs and overcoming high-dimensional planning intractability.

Updated: 2025-11-24 10:14:09

标题: 主动推理是变分推理的一个子类型

摘要: 在不确定性下的自动决策需要在开发和探索之间进行平衡。传统方法使用启发式分别处理这两者，而主动推理通过最小化预期自由能（EFE）将它们统一起来。然而，EFE最小化计算成本高昂，限制了可扩展性。我们基于最近的理论，将EFE最小化重新构建为变分推理，正式将其与规划作为推理统一起来，并展示认知驱动作为一种独特的熵贡献。我们的主要贡献是针对这一统一目标的一种新颖的消息传递方案，实现了在分解状态MDP中可扩展的主动推理，并克服了高维度规划的难解性。

更新时间: 2025-11-24 10:14:09

领域: cs.AI

下载: http://arxiv.org/abs/2511.18955v1

Agent-OM: Leveraging LLM Agents for Ontology Matching

Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.

Updated: 2025-11-24 10:10:51

标题: Agent-OM：利用LLM代理进行本体匹配

摘要: 本文介绍了一个新颖的基于大型语言模型的代理驱动的本体匹配系统设计范式。考虑到利用大型语言模型代理进行本体匹配所面临的几个具体挑战，我们提出了一个通用框架，即Agent-OM（本体匹配代理），由两个Siamese代理用于检索和匹配，配备一套本体匹配工具。我们的框架在一个概念验证系统中得到实现。对三个本体对齐评估倡议（OAEI）跟踪任务的评估结果显示，我们的系统在简单的本体匹配任务上可以达到接近长期最佳性能，并且可以显著提高在复杂和少样本本体匹配任务上的性能。

更新时间: 2025-11-24 10:10:51

领域: cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2312.00326v21

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.

Updated: 2025-11-24 10:06:41

标题: 压缩器-VLA：指令引导的视觉标记压缩用于高效的机器人操作

摘要: 视觉-语言-行动（VLA）模型已经成为具有强大范式的具体AI中的一种。然而，处理冗余视觉令牌的显著计算开销仍然是实时机器人部署的关键瓶颈。虽然标准的令牌修剪技术可以缓解这一问题，但这些任务无关的方法很难保留任务关键的视觉信息。为了解决这一挑战，同时保留整体上下文和精细细节以进行精确行动，我们提出了Compressor-VLA，这是一个新颖的混合指令条件化令牌压缩框架，旨在有效地压缩VLA模型中的视觉信息。所提出的Compressor-VLA框架由两个令牌压缩模块组成：一个语义任务压缩器（STC），它提炼整体的、与任务相关的上下文，以及一个空间细化压缩器（SRC），它保留细致的空间细节。这种压缩是通过自然语言指令动态调节的，允许自适应地凝结任务相关的视觉信息。实验证明，Compressor-VLA在LIBERO基准测试中取得了具有竞争力的成功率，同时将FLOPs减少了59%，将视觉令牌数量减少了3倍以上，与基线相比。在双臂机器人平台上的真实机器人部署验证了模型的从仿真到实际的可转移性和实际适用性。此外，定性分析显示，我们的指导指令动态引导模型的感知焦点朝向任务相关的对象，从而验证了我们方法的有效性。

更新时间: 2025-11-24 10:06:41

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.18950v1

OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

Denoising Diffusion Probabilistic Models (DDPMs) show promising potential in one-step Real-World Image Super-Resolution (Real-ISR). Current one-step Real-ISR methods typically inject the low-quality (LQ) image latent representation at the start or end timestep of the DDPM scheduler. Recent studies have begun to note that the LQ image latent and the pre-trained noisy latent representations are intuitively closer at a mid-timestep. However, a quantitative analysis of these latent representations remains lacking. Considering these latent representations can be decomposed into signal and noise, we propose a method based on the Signal-to-Noise Ratio (SNR) to pre-compute an average optimal mid-timestep for injection. To better approximate the pre-trained noisy latent representation, we further introduce the Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder. We also fine-tune the backbone of the DDPM-based generative model using LoRA to perform one-step denoising at the average optimal mid-timestep. Based on these components, we present OMGSR, a GAN-based Real-ISR framework that employs a DDPM-based generative model as the generator and a DINOv3-ConvNeXt model with multi-level discriminator heads as the discriminator. We also propose the DINOv3-ConvNeXt DISTS (Dv3CD) loss, which is enhanced for structural perception at varying resolutions. Within the OMGSR framework, we develop OMGSR-S based on SD2.1-base. An ablation study confirms that our pre-computation strategy and LRR loss significantly improve the baseline. Comparative studies demonstrate that OMGSR-S achieves state-of-the-art performance across multiple metrics. Code is available at \hyperlink{Github}{https://github.com/wuer5/OMGSR}.

Updated: 2025-11-24 09:55:44

标题: OMGSR：只需一个中间时间步骤指导实际图像超分辨率

摘要: 去噪扩散概率模型（DDPMs）在一步真实世界图像超分辨率（Real-ISR）中显示出很大潜力。当前一步真实世界图像超分辨率方法通常在DDPM调度器的开始或结束时间步骤注入低质量（LQ）图像潜在表示。最近的研究开始注意到，LQ图像潜在表示和预训练的嘈杂潜在表示在中间时间步骤上直观上更接近。然而，对这些潜在表示的定量分析仍然缺乏。考虑到这些潜在表示可以分解为信号和噪声，我们提出了一种基于信噪比（SNR）的方法，预先计算注入的平均最佳中间时间步骤。为了更好地逼近预训练的嘈杂潜在表示，我们进一步通过LoRA增强的VAE编码器引入了潜在表示细化（LRR）损失。我们还使用LoRA对DDPM-based生成模型的骨干进行微调，以在平均最佳中间时间步骤上进行一步去噪。基于这些组件，我们提出了OMGSR，一个以GAN为基础的Real-ISR框架，它将基于DDPM的生成模型作为生成器，将具有多级鉴别器头的DINOv3-ConvNeXt模型作为鉴别器。我们还提出了增强不同分辨率结构感知的DINOv3-ConvNeXt DISTS（Dv3CD）损失。在OMGSR框架内，我们开发了基于SD2.1-base的OMGSR-S。消融研究证实了我们的预计算策略和LRR损失显著改进了基线。比较研究表明，OMGSR-S在多个指标上取得了最先进的性能。代码可在Github上找到：https://github.com/wuer5/OMGSR。

更新时间: 2025-11-24 09:55:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.08227v2

MIST: Mutual Information Via Supervised Training

We propose a fully data-driven approach to designing mutual information (MI) estimators. Since any MI estimator is a function of the observed sample from two random variables, we parameterize this function with a neural network (MIST) and train it end-to-end to predict MI values. Training is performed on a large meta-dataset of 625,000 synthetic joint distributions with known ground-truth MI. To handle variable sample sizes and dimensions, we employ a two-dimensional attention scheme ensuring permutation invariance across input samples. To quantify uncertainty, we optimize a quantile regression loss, enabling the estimator to approximate the sampling distribution of MI rather than return a single point estimate. This research program departs from prior work by taking a fully empirical route, trading universal theoretical guarantees for flexibility and efficiency. Empirically, the learned estimators largely outperform classical baselines across sample sizes and dimensions, including on joint distributions unseen during training. The resulting quantile-based intervals are well-calibrated and more reliable than bootstrap-based confidence intervals, while inference is orders of magnitude faster than existing neural baselines. Beyond immediate empirical gains, this framework yields trainable, fully differentiable estimators that can be embedded into larger learning pipelines. Moreover, exploiting MI's invariance to invertible transformations, meta-datasets can be adapted to arbitrary data modalities via normalizing flows, enabling flexible training for diverse target meta-distributions.

Updated: 2025-11-24 09:55:28

标题: MIST: 通过监督训练实现的互信息

摘要: 我们提出了一种全面基于数据驱动的方法来设计互信息（MI）估计器。由于任何MI估计器都是两个随机变量的观察样本的函数，我们使用一个神经网络（MIST）来对这个函数进行参数化，并对其进行端对端训练以预测MI值。训练是在一个包含625,000个具有已知真实MI的合成联合分布的大型元数据集上进行的。为了处理不同的样本大小和维度，我们采用了一个二维注意力方案，确保输入样本之间的排列不变性。为了量化不确定性，我们优化了一个分位数回归损失，使得估计器能够逼近MI的抽样分布，而不是返回一个单一的点估计。这项研究计划不同于以往的工作，采取了一个完全经验主义的路线，为灵活性和效率而牺牲了普遍的理论保证。从经验上看，学习到的估计器在样本大小和维度上大部分优于传统基线，包括在训练过程中未见的联合分布上。由此产生的基于分位数的区间的校准性很好，比基于自助法的置信区间更可靠，而推断速度比现有的神经网络基线快几个数量级。除了即时的经验收益外，这个框架还产生了可以嵌入更大的学习流水线的可训练、完全可微估计器。此外，利用MI对可逆转换的不变性，通过归一化流，元数据集可以适应任意数据模态，实现对多样化目标元分布的灵活训练。

更新时间: 2025-11-24 09:55:28

领域: cs.LG,cs.IT

下载: http://arxiv.org/abs/2511.18945v1

Developing an Algorithm Selector for Green Configuration in Scheduling Problems

The Job Shop Scheduling Problem (JSP) is central to operations research, primarily optimizing energy efficiency due to its profound environmental and economic implications. Efficient scheduling enhances production metrics and mitigates energy consumption, thus effectively balancing productivity and sustainability objectives. Given the intricate and diverse nature of JSP instances, along with the array of algorithms developed to tackle these challenges, an intelligent algorithm selection tool becomes paramount. This paper introduces a framework designed to identify key problem features that characterize its complexity and guide the selection of suitable algorithms. Leveraging machine learning techniques, particularly XGBoost, the framework recommends optimal solvers such as GUROBI, CPLEX, and GECODE for efficient JSP scheduling. GUROBI excels with smaller instances, while GECODE demonstrates robust scalability for complex scenarios. The proposed algorithm selector achieves an accuracy of 84.51\% in recommending the best algorithm for solving new JSP instances, highlighting its efficacy in algorithm selection. By refining feature extraction methodologies, the framework aims to broaden its applicability across diverse JSP scenarios, thereby advancing efficiency and sustainability in manufacturing logistics.

Updated: 2025-11-24 09:52:56

标题: 开发一种算法选择器，用于调度问题中的绿色配置

摘要: 作业车间调度问题（JSP）是运营研究的核心，主要优化能源效率，因为它对环境和经济有深远的影响。高效的调度增强生产指标并减少能源消耗，从而有效地平衡生产力和可持续发展目标。考虑到JSP实例的复杂和多样化性质，以及为解决这些挑战开发的算法数组，智能算法选择工具变得至关重要。本文介绍了一个旨在识别表征其复杂性的关键问题特征并指导选择适当算法的框架。借助机器学习技术，特别是XGBoost，该框架推荐了GUROBI、CPLEX和GECODE等高效的JSP调度求解器。GUROBI在较小实例上表现优异，而GECODE在复杂场景中展现出强大的可扩展性。所提出的算法选择器在推荐解决新JSP实例的最佳算法方面达到了84.51\%的准确率，突显了其在算法选择中的有效性。通过完善特征提取方法，该框架旨在扩大其在各种JSP场景中的适用性，从而推动制造物流中的效率和可持续性。

更新时间: 2025-11-24 09:52:56

领域: cs.AI

下载: http://arxiv.org/abs/2409.08641v2

Counterfactual Explainable AI (XAI) Method for Deep Learning-Based Multivariate Time Series Classification

Recent advances in deep learning have improved multivariate time series (MTS) classification and regression by capturing complex patterns, but their lack of transparency hinders decision-making. Explainable AI (XAI) methods offer partial insights, yet often fall short of conveying the full decision space. Counterfactual Explanations (CE) provide a promising alternative, but current approaches typically prioritize either accuracy, proximity or sparsity -- rarely all -- limiting their practical value. To address this, we propose CONFETTI, a novel multi-objective CE method for MTS. CONFETTI identifies key MTS subsequences, locates a counterfactual target, and optimally modifies the time series to balance prediction confidence, proximity and sparsity. This method provides actionable insights with minimal changes, improving interpretability, and decision support. CONFETTI is evaluated on seven MTS datasets from the UEA archive, demonstrating its effectiveness in various domains. CONFETTI consistently outperforms state-of-the-art CE methods in its optimization objectives, and in six other metrics from the literature, achieving $\geq10\%$ higher confidence while improving sparsity in $\geq40\%$.

Updated: 2025-11-24 09:47:20

标题: 反事实可解释的深度学习多元时间序列分类方法

摘要: 深度学习的最新进展提高了多变量时间序列（MTS）分类和回归的能力，通过捕捉复杂模式，但它们缺乏透明度阻碍了决策制定。可解释的人工智能（XAI）方法提供部分见解，但通常无法传达完整的决策空间。对抗性解释（CE）提供了一种有希望的替代方案，但目前的方法通常优先考虑准确性、接近度或稀疏性 -- 很少同时考虑 -- 限制了它们的实际价值。为了解决这个问题，我们提出了CONFETTI，一种新颖的多目标CE方法用于MTS。CONFETTI识别关键的MTS子序列，定位对抗性目标，并最佳地修改时间序列以平衡预测置信度、接近度和稀疏性。这种方法提供可操作的见解，通过最小的改变，提高了可解释性和决策支持。CONFETTI在UEA存档中的七个MTS数据集上进行了评估，展示了它在各个领域的有效性。CONFETTI在其优化目标中始终优于现有的CE方法，在文献中的六个其他指标中取得了10%以上更高的置信度，同时在40%以上改善了稀疏性。

更新时间: 2025-11-24 09:47:20

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.13237v2

Geometry-Aware Deep Congruence Networks for Manifold Learning in Cross-Subject Motor Imagery

Cross-subject motor-imagery decoding remains a major challenge in EEG-based brain-computer interfaces due to strong subject variability and the curved geometry of covariance matrices on the symmetric positive definite (SPD) manifold. We address the zero-shot cross-subject setting, where no target-subject labels or adaptation are allowed, by introducing novel geometry-aware preprocessing modules and deep congruence networks that operate directly on SPD covariance matrices. Our preprocessing modules, DCR and RiFU, extend Riemannian Alignment by improving action separation while reducing subject-specific distortions. We further propose two manifold classifiers, SPD-DCNet and RiFUNet, which use hierarchical congruence transforms to learn discriminative, subject-invariant covariance representations. On the BCI-IV 2a benchmark, our framework improves cross-subject accuracy by 3-4% over the strongest classical baselines, demonstrating the value of geometry-aware transformations for robust EEG decoding.

Updated: 2025-11-24 09:46:55

标题: 几何感知深度同余网络在跨主体运动想象中的流形学习

摘要: 跨主体运动想象解码仍然是基于脑电图的脑机接口中的一个主要挑战，原因是强烈的主体变异性和对称正定（SPD）流形上协方差矩阵的曲线几何性。我们通过引入新颖的几何感知预处理模块和直接在SPD协方差矩阵上操作的深度一致网络来解决零样本跨主体设置，其中不允许目标主体标签或适应。我们的预处理模块DCR和RiFU通过改进动作分离同时减少主体特定的失真来扩展了黎曼对齐。我们进一步提出了两个流形分类器SPD-DCNet和RiFUNet，它们使用分层一致变换来学习具有辨别性、主体不变的协方差表示。在BCI-IV 2a基准上，我们的框架将跨主体准确性提高了3-4%，超过了最强大的经典基线，展示了几何感知转换对于稳健的脑电解码的价值。

更新时间: 2025-11-24 09:46:55

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.18940v1

Autonomous Vehicle Path Planning by Searching With Differentiable Simulation

Planning allows an agent to safely refine its actions before executing them in the real world. In autonomous driving, this is crucial to avoid collisions and navigate in complex, dense traffic scenarios. One way to plan is to search for the best action sequence. However, this is challenging when all necessary components - policy, next-state predictor, and critic - have to be learned. Here we propose Differentiable Simulation for Search (DSS), a framework that leverages the differentiable simulator Waymax as both a next state predictor and a critic. It relies on the simulator's hardcoded dynamics, making state predictions highly accurate, while utilizing the simulator's differentiability to effectively search across action sequences. Our DSS agent optimizes its actions using gradient descent over imagined future trajectories. We show experimentally that DSS - the combination of planning gradients and stochastic search - significantly improves tracking and path planning accuracy compared to sequence prediction, imitation learning, model-free RL, and other planning methods.

Updated: 2025-11-24 09:43:11

标题: 自主车辆路径规划：通过可微分模拟进行搜索

摘要: 规划允许代理在现实世界中执行动作之前安全地完善它们。在自动驾驶中，这对于避免碰撞并在复杂、密集的交通场景中导航至关重要。规划的一种方式是搜索最佳的动作序列。然而，当所有必要的组件 - 策略、下一个状态预测器和评论者 - 都必须学习时，这是具有挑战性的。在这里，我们提出了Differentiable Simulation for Search（DSS），这是一个利用可微模拟器Waymax作为下一个状态预测器和评论者的框架。它依赖于模拟器的硬编码动态，使状态预测非常准确，同时利用模拟器的可微性有效地搜索动作序列。我们的DSS代理使用梯度下降优化其动作，通过想象未来轨迹。我们通过实验证明，与序列预测、模仿学习、无模型RL和其他规划方法相比，DSS - 规划梯度和随机搜索的结合显著提高了跟踪和路径规划的准确性。

更新时间: 2025-11-24 09:43:11

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2511.11043v2

SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression

Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.

Updated: 2025-11-24 09:41:24

标题: SWAN：稀疏的减少推理内存的筛选关注，通过无解压缩KV缓存压缩

摘要: 大型语言模型（LLMs）在自回归推断过程中面临重大瓶颈，原因是Key-Value（KV）缓存的巨大内存占用。现有的压缩技术，如标记淘汰、量化或其他低秩方法，往往存在信息丢失的风险，具有固定的限制，或者由于显式解压缩步骤而引入显著的计算开销。在这项工作中，我们介绍了SWAN，这是一个新颖的、无需微调的框架，可以消除这种开销。我们的方法使用离线正交矩阵来旋转和修剪KV缓存，然后直接在注意力计算中使用，而无需任何重建。我们的大量实验表明，SWAN结合一个小型密集缓冲区，提供了一个稳健的权衡，即使在KV缓存每个标记节省50-60%内存的情况下，性能仍接近未压缩基线。一个关键优势是其运行时可调节的压缩级别，允许操作员动态调整内存占用，这是在需要固定离线配置的方法中缺少的灵活性。这种无需解压缩设计、在压缩下高性能和可适应性的结合，使SWAN成为为长上下文提供LLMs的实用和高效解决方案。

更新时间: 2025-11-24 09:41:24

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.18936v1

Skeletons Matter: Dynamic Data Augmentation for Text-to-Query

The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.

Updated: 2025-11-24 09:39:03

标题: 骨架很重要：文本到查询的动态数据增强

摘要: 翻译如下：将自然语言问题转换为查询语言的任务长期以来一直是语义解析中的核心关注点。最近大规模语言模型（LLMs）的进展显著加速了这一领域的进展。然而，现有研究通常集中在单一查询语言上，导致方法在不同语言之间的泛化能力有限。本文正式定义了文本到查询任务范式，统一了各种查询语言中的语义解析任务。我们将查询骨架确定为文本到查询任务的共享优化目标，并提出了一个通用的动态数据增强框架，明确诊断模型处理这些骨架的特定弱点，以合成有针对性的训练数据。对四个文本到查询基准的实验表明，我们的方法仅使用少量合成数据就实现了最先进的性能，突显了我们方法的效率和通用性，并为文本到查询任务的统一研究奠定了坚实基础。我们在https://github.com/jjjycaptain/Skeletron 上发布了我们的代码。

更新时间: 2025-11-24 09:39:03

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2511.18934v1

GRAPHIC--Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity

Artificial Intelligence (AI) has been increasingly applied to creative domains, leading to the development of systems that collaborate with humans in design processes. In Graphic Design, integrating computational systems into co-creative workflows presents specific challenges, as it requires balancing scientific rigour with the subjective and visual nature of design practice. Following the PRISMA methodology, we identified 872 articles, resulting in a final corpus of 71 publications describing 68 unique systems. Based on this review, we introduce GRAPHIC (Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity), a framework for analysing computational systems applied to Graphic Design. Its goal is to understand how current systems support human-AI collaboration in the Graphic Design discipline. The framework comprises main dimensions, which our analysis revealed to be essential across diverse system types: (1) Collaborative Panorama, (2) Processes and Modalities, and (3) Graphic Design Principles. Its application revealed research gaps, including the need to balance initiative and control between agents, improve communication through explainable interaction models, and promote systems that support transformational creativity grounded in core design principles.

Updated: 2025-11-24 09:38:31

标题: 图形--人类中心设计和互动中算法实践审查的指导原则与创意

摘要: 人工智能（AI）越来越多地应用于创意领域，导致开发出与人类在设计过程中合作的系统。在平面设计中，将计算系统整合到协同创作工作流程中面临特定挑战，因为它要求在科学严谨性和设计实践的主观和视觉性质之间保持平衡。根据PRISMA方法，我们确定了872篇文章，最终形成了71个描述68个独特系统的出版物的文献库。基于这一审查，我们引入了GRAPHIC（人类中心设计与交互中算法实践审查指南），这是一个用于分析应用于平面设计的计算系统的框架。其目标是了解当前系统如何支持人工智能与平面设计学科的合作。该框架包括主要维度，我们的分析显示这些维度对于各种系统类型都是至关重要的：（1）协作全景，（2）过程和模式，以及（3）平面设计原则。其应用揭示了研究空白，包括需要在代理之间平衡主动性和控制，通过可解释的交互模型改进沟通，并推广支持以核心设计原则为基础的转化性创造力的系统。

更新时间: 2025-11-24 09:38:31

领域: cs.HC,cs.AI,cs.GR

下载: http://arxiv.org/abs/2511.17443v2

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project

Updated: 2025-11-24 09:38:11

标题: 使用负责任的AI考虑保护大型语言模型免受越狱攻击

摘要: 大型语言模型（LLMs）仍然容易受到越狱攻击的影响，这些攻击可以绕过安全过滤器，并导致有害或不道德的行为。本文提出了现有越狱防御措施的系统分类，包括基于提示级别、模型级别和训练时间干预的防御措施，随后提出了三种防御策略。首先，提示级别的防御框架通过净化、释义和自适应系统保护来检测和中和对抗性输入。第二，基于逻辑的转向防御通过推理时间向量转向在安全敏感层中强化拒绝行为。第三，领域特定代理防御采用MetaGPT框架来强制执行结构化、基于角色的协作和领域遵从。在基准数据集上的实验显示攻击成功率显著降低，通过基于代理的防御实现完全缓解。总体而言，本研究强调越狱对LLMs构成重大安全威胁，并确定了预防的关键干预点，同时指出防御策略往往涉及安全性、性能和可扩展性之间的权衡。代码可在以下网址获得：https://github.com/Kuro0911/CS5446-Project

更新时间: 2025-11-24 09:38:11

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.18933v1

Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.

Updated: 2025-11-24 09:37:43

标题: 查找信息：分析现代LLM的内部网络搜索功能

摘要: 现代大型语言模型整合网络搜索以提供实时答案，然而目前尚不清楚它们是否能够有效地校准以在实际需要时使用搜索。我们引入了一个基准测试，评估商业模型在没有访问内部状态或参数的情况下，跨多个模型的网络访问的必要性和有效性。数据集包括一个由783个时间锚定问题组成的静态分割，这些问题可以从截止时间之前的知识中得到答案，旨在测试模型是否根据内部信心低来调用搜索，以及一个由288个截止时间之后设计的动态分割，旨在测试模型是否能够识别何时需要搜索并检索更新信息。网络访问显著提高了GPT-5-mini和Claude Haiku 4.5的静态准确性，尽管置信度校准变差。在动态查询中，两个模型经常调用搜索，但由于查询制定不足，准确率仍然低于70％。每次提高准确性的成本保持较低，但一旦初始检索失败，回报就会减少。选择性调用有所帮助，但模型在搜索后变得过于自信且不一致。总体而言，内置网络搜索显著提高了事实准确性，并可以有选择性地调用，但模型仍然过于自信，在关键时跳过检索，并一旦初始搜索查询效果不佳就会失败。综合而言，内部网络搜索作为一个良好的低延迟验证层比可靠的分析工具更为有效，但仍有改进的空间。

更新时间: 2025-11-24 09:37:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.18931v1

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K \geq 1$, expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.

Updated: 2025-11-24 09:35:35

标题: 薄缓存：分布式推理中专家混合的边缘缓存

摘要: 混合专家（MoE）模型通过仅激活输入中的一小部分相关专家来改善大型语言模型（LLMs）的可伸缩性。然而，MoE模型中专家网络的数量庞大为边缘设备带来了显著的存储负担。为了解决这一挑战，我们考虑了专家在边缘网络中分散进行分布式推断的情景。基于流行的Top-$K$专家选择策略，我们在存储约束下优化了边缘服务器上的专家缓存，形成了一个延迟最小化问题。当$K=1$时，该问题可简化为一个带有背包约束的单调次模最大化问题，我们设计了一个基于贪心的算法，具有$(1 - 1/e)$的近似保证。对于$K \geq 1$的一般情况，同一MoE层内的专家共同激活引入了非次模性，使得贪心方法无效。为了解决这个问题，我们提出了一种连续贪心分解方法，将原始问题分解为一系列子问题，每个子问题都通过动态规划方法解决。此外，我们设计了一种基于最大卷积技术的加速算法，以在多项式时间内获得具有可证保证的近似解。对各种MoE模型的仿真结果表明，与现有基线方法相比，我们的方法显著降低了推断延迟。

更新时间: 2025-11-24 09:35:35

领域: cs.LG,cs.DC,cs.NI

下载: http://arxiv.org/abs/2507.06567v2

Learning Solution Operators for Partial Differential Equations via Monte Carlo-Type Approximation

The Monte Carlo-type Neural Operator (MCNO) introduces a lightweight architecture for learning solution operators for parametric PDEs by directly approximating the kernel integral using a Monte Carlo approach. Unlike Fourier Neural Operators, MCNO makes no spectral or translation-invariance assumptions. The kernel is represented as a learnable tensor over a fixed set of randomly sampled points. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with low computational cost, providing a simple and practical alternative to spectral and graph-based neural operators.

Updated: 2025-11-24 09:35:10

标题: 通过蒙特卡罗型逼近学习偏微分方程的解算符

摘要: Monte Carlo型神经算子（MCNO）引入了一种轻量级架构，通过直接使用蒙特卡罗方法逼近核积分来学习参数化PDE的解算子。与Fourier神经算子不同，MCNO不做光谱或平移不变性的假设。核被表示为在固定一组随机采样点上的可学习张量。这种设计使得在多个网格分辨率上实现泛化而无需依赖于固定的全局基函数或在训练期间重复采样。对标准一维PDE基准上的实验表明，MCNO在低计算成本下实现了竞争性准确性，为光谱和基于图的神经算子提供了简单实用的替代方案。

更新时间: 2025-11-24 09:35:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18930v1

Interpretability of Graph Neural Networks to Assess Effects of Global Change Drivers on Ecological Networks

Pollinators play a crucial role for plant reproduction, either in natural ecosystem or in human-modified landscape. Global change drivers,including climate change or land use modifications, can alter the plant-pollinator interactions. To assess the potential influence of global change drivers on pollination, large-scale interactions, climate and land use data are required. While recent machine learning methods, such as graph neural networks (GNNs), allow the analysis of such datasets, interpreting their results can be challenging. We explore existing methods for interpreting GNNs in order to highlight the effects of various environmental covariates on pollination network connectivity. An extensive simulation study is performed to confirm whether these methods can detect the interactive effect between a covariate and a genus of plant on connectivity, and whether the application of debiasing techniques influences the estimation of these effects. An application on the Spipoll dataset, with and without accounting for sampling effects, highlights the potential impact of land use on network connectivity and shows that accounting for sampling effects partially alters the estimation of these effects.

Updated: 2025-11-24 09:33:15

标题: 图神经网络的可解释性：评估全球变化驱动因素对生态网络的影响

摘要: 传粉者在植物繁殖中起着至关重要的作用，无论是在自然生态系统中还是在人类改造的景观中。全球变化驱动因素，包括气候变化或土地利用修改，可能会改变植物-传粉者之间的相互作用。为了评估全球变化驱动因素对传粉的潜在影响，需要大规模的相互作用、气候和土地利用数据。虽然最近的机器学习方法，如图神经网络（GNNs），允许对这些数据集进行分析，但解释它们的结果可能具有挑战性。我们探讨了解释GNNs的现有方法，以突出各种环境协变量对传粉网络连通性的影响。进行了广泛的模拟研究，以确认这些方法是否能检测协变量和植物属之间在连通性上的交互效应，以及去偏差技术的应用是否会影响这些效应的估计。对Spipoll数据集的应用，考虑了采样效应和未考虑采样效应的情况，突出了土地利用对网络连通性的潜在影响，并表明考虑采样效应部分改变了这些效应的估计。

更新时间: 2025-11-24 09:33:15

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2503.15107v3

MoodBench 1.0: An Evaluation Benchmark for Emotional Companionship Dialogue Systems

With the rapid development of Large Language Models, dialogue systems are shifting from information tools to emotional companions, heralding the era of Emotional Companionship Dialogue Systems (ECDs) that provide personalized emotional support for users. However, the field lacks clear definitions and systematic evaluation standards for ECDs. To address this, we first propose a definition of ECDs with formal descriptions. Then, based on this theory and the design principle of "Ability Layer-Task Layer (three level)-Data Layer-Method Layer", we design and implement the first ECD evaluation benchmark - MoodBench 1.0. Through extensive evaluations of 30 mainstream models, we demonstrate that MoodBench 1.0 has excellent discriminant validity and can effectively quantify the differences in emotional companionship abilities among models. Furthermore, the results reveal current models' shortcomings in deep emotional companionship, guiding future technological optimization and significantly aiding developers in enhancing ECDs' user experience.

Updated: 2025-11-24 09:32:02

标题: 情绪伴侣对话系统评估基准：MoodBench 1.0

摘要: 随着大型语言模型的快速发展，对话系统正在从信息工具转变为情感伴侣，预示着提供个性化情感支持的情感伴侣对话系统（ECDs）的时代即将到来。然而，该领域缺乏对ECDs的清晰定义和系统评估标准。为了解决这一问题，我们首先提出了对ECDs的定义，并进行了正式描述。然后，基于这一理论和“能力层-任务层（三层）-数据层-方法层”的设计原则，我们设计并实施了第一个ECD评估基准-MoodBench 1.0。通过对30个主流模型的广泛评估，我们证明MoodBench 1.0具有出色的差别有效性，并能有效量化不同模型之间的情感伴侣能力差异。此外，结果揭示了当前模型在深度情感伴侣方面的不足，指导未来技术优化，并显著帮助开发人员提升ECDs的用户体验。

更新时间: 2025-11-24 09:32:02

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2511.18926v1

LLM-Driven Kernel Evolution: Automating Driver Updates in Linux

Linux kernel evolution breaks drivers through API/ABI changes, semantic shifts, and security-hardening updates. We introduce DRIVEBENCH, an executable corpus of kernel$\rightarrow$driver co-evolution cases, and AUTODRIVER, a closed-loop, LLM-driven system for automating driver maintenance. The system integrates prompt engineering, multi-agent collaboration, static analysis, and iterative validation to ensure that generated patches are not only syntactically correct but also functionally and semantically consistent with kernel conventions. The corpus spans v5.10-v6.10 with 235 validated cases drawn from 612 candidates. In evaluation across 55 cases, AUTODRIVER achieves 56.4% compilation success; QEMU-based boot verification indicates that compiled patches preserve driver initialization in most instances. By releasing DRIVEBENCH and tooling, we enable reproducible research and a practical route to continuous, safe co-evolution of drivers with the Linux kernel.

Updated: 2025-11-24 09:31:52

标题: LLM驱动的内核演进：在Linux中自动更新驱动程序

摘要: Linux内核的演进通过API/ABI的更改、语义转变和安全强化更新来打破驱动程序。我们介绍了DRIVEBENCH，这是一个可执行的内核$\rightarrow$驱动程序共同演进案例的语料库，以及AUTODRIVER，一个闭环、以LLM驱动的系统，用于自动化驱动程序的维护。该系统整合了及时工程、多代理协作、静态分析和迭代验证，以确保生成的补丁不仅在语法上正确，而且在功能和语义上与内核约定一致。该语料库跨越了v5.10-v6.10，包含了从612个候选案例中提取的235个经过验证的案例。在对55个案例进行评估时，AUTODRIVER实现了56.4%的编译成功率；基于QEMU的引导验证表明，编译的补丁在大多数情况下保留了驱动程序的初始化。通过发布DRIVEBENCH和工具，我们为可重复的研究和与Linux内核连续、安全共同进化的实际途径提供了可能。

更新时间: 2025-11-24 09:31:52

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.18924v1

Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation

Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.

Updated: 2025-11-24 09:29:30

标题: 学习什么值得信任：视觉生成的贝叶斯先验引导优化

摘要: Group Relative Policy Optimization (GRPO)已经成为一种有效且轻量级的框架，用于后训练视觉生成模型。然而，其性能在根本上受到文本视觉对应的模糊性的限制：单个提示可能有效地描述多样化的视觉输出，而单个图像或视频可能支持多个同样正确的解释。这种多对多的关系导致奖励模型生成不确定和弱鉴别信号，导致GRPO未充分利用可靠的反馈并过度适应嘈杂的反馈。我们引入了Bayesian Prior-Guided Optimization (BPGO)，这是GRPO的一种新颖扩展，通过语义先验锚点显式地建模奖励不确定性。BPGO在两个级别上自适应调节优化信任：组间贝叶斯信任分配强调与先验一致的组的更新，同时减少模糊的更新；组内先验锚定重归一化通过扩展自信的偏差和压缩不确定的得分来加强样本区分度。在图像和视频生成任务中，BPGO提供了比标准GRPO和最近的变体更强的语义对齐，增强的感知保真度和更快的收敛速度。

更新时间: 2025-11-24 09:29:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18919v1

TICAL: Trusted and Integrity-protected Compilation of AppLications

During the past few years, we have witnessed various efforts to provide confidentiality and integrity for applications running in untrusted environments such as public clouds. In most of these approaches, hardware extensions such as Intel SGX, TDX, AMD SEV, etc., are leveraged to provide encryption and integrity protection on process or VM level. Although all of these approaches increase the trust in the application at runtime, an often overlooked aspect is the integrity and confidentiality protection at build time, which is equally important as maliciously injected code during compilation can compromise the entire application and system. In this paper, we present Tical, a practical framework for trusted compilation that provides integrity protection and confidentiality in build pipelines from source code to the final executable. Our approach harnesses TEEs as runtime protection but enriches TEEs with file system shielding and an immutable audit log with version history to provide accountability. This way, we can ensure that the compiler chain can only access trusted files and intermediate output, such as object files produced by trusted processes. Our evaluation using micro- and macro-benchmarks shows that Tical can protect the confidentiality and integrity of whole CI/CD pipelines with an acceptable performance overhead.

Updated: 2025-11-24 09:28:48

标题: TICAL：受信任和完整性受保护的应用程序编译

摘要: 在过去几年中，我们目睹了各种努力，为在不受信任的环境中运行的应用程序提供机密性和完整性，例如公共云。在大多数这些方法中，利用硬件扩展，如Intel SGX、TDX、AMD SEV等，来提供进程或虚拟机级别的加密和完整性保护。尽管所有这些方法都增加了运行时应用程序的信任度，但往往被忽视的一个方面是构建时的完整性和机密性保护，这与恶意注入的代码一样重要，因为编译过程中的恶意注入的代码可能会危及整个应用程序和系统。在本文中，我们提出了Tical，这是一个实用的受信任编译框架，它提供了从源代码到最终可执行文件的构建管道中的完整性保护和机密性。我们的方法利用TEE作为运行时保护，但通过文件系统屏蔽和带有版本历史的不可变审计日志来丰富TEE，以提供问责制。这样，我们可以确保编译器链只能访问受信任的文件和由受信任进程生成的目标文件等中间输出。我们使用微基准测试和宏基准测试进行评估，结果显示Tical可以在可接受的性能开销下保护整个CI/CD管道的机密性和完整性。

更新时间: 2025-11-24 09:28:48

领域: cs.CR

下载: http://arxiv.org/abs/2511.17070v2

Future-Back Threat Modeling: A Foresight-Driven Security Framework

Traditional threat modeling remains reactive-focused on known TTPs and past incident data, while threat prediction and forecasting frameworks are often disconnected from operational or architectural artifacts. This creates a fundamental weakness: the most serious cyber threats often do not arise from what is known, but from what is assumed, overlooked, or not yet conceived, and frequently originate from the future, such as artificial intelligence, information warfare, and supply chain attacks, where adversaries continuously develop new exploits that can bypass defenses built on current knowledge. To address this mental gap, this paper introduces the theory and methodology of Future-Back Threat Modeling (FBTM). This predictive approach begins with envisioned future threat states and works backward to identify assumptions, gaps, blind spots, and vulnerabilities in the current defense architecture, providing a clearer and more accurate view of impending threats so that we can anticipate their emergence and shape the future we want through actions taken now. The proposed methodology further aims to reveal known unknowns and unknown unknowns, including tactics, techniques, and procedures that are emerging, anticipated, and plausible. This enhances the predictability of adversary behavior, particularly under future uncertainty, helping security leaders make informed decisions today that shape more resilient security postures for the future.

Updated: 2025-11-24 09:21:12

标题: 未来回溯威胁建模：一种前瞻驱动的安全框架

摘要: 传统的威胁建模仍然是以已知的TTP和过去的事件数据为重点，而威胁预测和预测框架往往与操作或架构工件脱节。这造成了一个根本性的弱点：最严重的网络威胁往往不是来自已知的内容，而是来自假设、忽视或尚未构思的内容，并且经常来源于未来，比如人工智能、信息战和供应链攻击，对手不断开发新的漏洞，可以绕过基于当前知识建立的防御。为了解决这种认知差距，本文介绍了未来回溯威胁建模（FBTM）的理论和方法论。这种预测性方法从设想的未来威胁状态开始，逆向识别当前防御架构中的假设、间隙、盲点和漏洞，提供更清晰、更准确的威胁前景，以便我们可以预见它们的出现，并通过当前的行动塑造我们希望的未来。所提出的方法还旨在揭示已知的未知和未知的未知，包括新兴、预期和可能的战术、技术和程序。这增强了对对手行为的可预测性，特别是在未来不确定性下，帮助安全领导者今天做出明智的决策，为未来塑造更具弹性的安全姿态。

更新时间: 2025-11-24 09:21:12

领域: cs.CR,cs.AI,cs.CY

下载: http://arxiv.org/abs/2511.16088v2

Mysticeti: Reaching the Limits of Latency with Uncertified DAGs

We introduce Mysticeti-C, the first DAG-based Byzantine consensus protocol to achieve the lower bounds of latency of 3 message rounds. Since Mysticeti-C is built over DAGs it also achieves high resource efficiency and censorship resistance. Mysticeti-C achieves this latency improvement by avoiding explicit certification of the DAG blocks and by proposing a novel commit rule such that every block can be committed without delays, resulting in optimal latency in the steady state and under crash failures. We further extend Mysticeti-C to Mysticeti-FPC, which incorporates a fast commit path that achieves even lower latency for transferring assets. Unlike prior fast commit path protocols, Mysticeti-FPC minimizes the number of signatures and messages by weaving the fast path transactions into the DAG. This frees up resources, which subsequently result in better performance. We prove the safety and liveness in a Byzantine context. We evaluate both Mysticeti protocols and compare them with state-of-the-art consensus and fast path protocols to demonstrate their low latency and resource efficiency, as well as their more graceful degradation under crash failures. Mysticeti-C is the first Byzantine consensus protocol to achieve WAN latency of 0.5s for consensus commit while simultaneously maintaining state-of-the-art throughput of over 200k TPS. Finally, we report on integrating Mysticeti-C as the consensus protocol into the Sui blockchain, resulting in over 4x latency reduction.

Updated: 2025-11-24 09:10:32

标题: 鲸偶蹄目：使用未认证的DAG达到延迟极限

摘要: 我们介绍了Mysticeti-C，这是第一个基于有向无环图（DAG）的拜占庭共识协议，实现了3个消息轮的最低延迟下限。由于Mysticeti-C建立在DAG之上，它也实现了高资源效率和抗审查的特性。Mysticeti-C通过避免对DAG块进行显式认证，并提出了一种新的提交规则，使得每个块都可以在没有延迟的情况下提交，从而在稳态和崩溃故障下实现最佳延迟。我们进一步将Mysticeti-C扩展为Mysticeti-FPC，它包含一个快速提交路径，可以实现更低的资产转移延迟。与先前的快速提交路径协议不同，Mysticeti-FPC通过将快速路径交易编织到DAG中，最小化签名和消息的数量。这释放了资源，随后提高了性能。我们在拜占庭环境中证明了安全性和活跃性。我们评估了两个Mysticeti协议，并将它们与最先进的共识和快速路径协议进行比较，以展示它们的低延迟和资源效率，以及它们在崩溃故障下更为优雅的降级。Mysticeti-C是第一个实现共识提交的WAN延迟为0.5秒的拜占庭共识协议，同时保持了每秒超过200,000笔交易的吞吐量的最新水平。最后，我们报道了将Mysticeti-C集成为Sui区块链的共识协议，结果将延迟减少了4倍。

更新时间: 2025-11-24 09:10:32

领域: cs.DC,cs.CR

下载: http://arxiv.org/abs/2310.14821v6

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

Updated: 2025-11-24 09:03:49

标题: 如何学习速率衰减在基于课程的LLM预训练中浪费了您最好的数据

摘要: 由于高质量数据稀缺，大型语言模型(LLMs)通常在各种质量水平的数据混合中进行训练，即使经过精密的数据筛选。一个更好利用高质量数据的自然方法是基于课程的预训练，其中模型根据质量度量确定的质量顺序进行训练。然而，先前的研究报告了这种基于课程的预训练策略的有限改进。这项工作确定了限制这些方法的关键因素：升序数据质量顺序与衰减学习率(LR)计划之间的不兼容性。我们发现，当使用恒定LR时，基于课程的训练明显优于随机洗牌，但在标准LR衰减计划下，其优势减弱。我们的实验证明，这种不兼容性可以通过两种简单策略进行缓解：(1)采用更为适度的LR衰减计划，即最终LR仅比峰值LR略小，并且(2)用模型平均替换LR衰减，即计算最终几个检查点的加权平均值。通过结合这些策略，我们在一套标准基准测试中将平均得分提高了1.64%，而无需进行额外的数据精炼。在30B令牌上训练的1.5B参数模型上验证了各种数据质量指标，我们的发现要求重新评估基于课程的LLM预训练，并强调了与优化方法共同设计数据课程的潜力。

更新时间: 2025-11-24 09:03:49

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2511.18903v1

VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.

Updated: 2025-11-24 08:59:54

标题: VADE: 多模态强化学习中基于在线样本级难度估计的方差感知动态采样

摘要: 类似于GRPO和GSPO这样的基于群体的策略优化方法已经成为训练多模态模型的标准，利用群体式展开和相对优势估计。然而，当群体内的所有响应都接收相同的奖励时，它们会遭受到一个关键的“梯度消失”问题，导致优势估计崩溃，训练信号减弱。现有的缓解这个问题的尝试可以分为两种范式：基于过滤和基于抽样的方法。基于过滤的方法首先广泛生成展开，然后通过过滤出无信息的群体，导致计算开销巨大。基于抽样的方法在展开之前主动选择有效样本，但依赖于静态标准或先前的数据集知识，缺乏实时适应性。为了解决这些问题，我们提出了VADE，一个通过在线样本级困难度估计的方差感知动态抽样框架。我们的框架整合了三个关键组件：使用Beta分布进行在线样本级困难度估计，通过估计的正确概率最大化信息增益的汤普森采样器，以及维持在策略演化下稳健估计的两种尺度的先验衰减机制。这三个组件的设计使VADE能够动态选择最具信息量的样本，从而放大训练信号，同时消除额外的展开成本。在多模态推理基准测试中的大量实验表明，VADE在性能和样本效率方面始终优于强基线，同时大幅减少了计算开销。更重要的是，我们的框架可以作为一个即插即用的组件，无缝集成到现有的基于群体的RL算法中。代码和模型可在https://VADE-RL.github.io找到。

更新时间: 2025-11-24 08:59:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18902v1

Privacy on the Fly: A Predictive Adversarial Transformation Network for Mobile Sensor Data

Mobile motion sensors such as accelerometers and gyroscopes are now ubiquitously accessible by third-party apps via standard APIs. While enabling rich functionalities like activity recognition and step counting, this openness has also enabled unregulated inference of sensitive user traits, such as gender, age, and even identity, without user consent. Existing privacy-preserving techniques, such as GAN-based obfuscation or differential privacy, typically require access to the full input sequence, introducing latency that is incompatible with real-time scenarios. Worse, they tend to distort temporal and semantic patterns, degrading the utility of the data for benign tasks like activity recognition. To address these limitations, we propose the Predictive Adversarial Transformation Network (PATN), a real-time privacy-preserving framework that leverages historical signals to generate adversarial perturbations proactively. The perturbations are applied immediately upon data acquisition, enabling continuous protection without disrupting application functionality. Experiments on two datasets demonstrate that PATN substantially degrades the performance of privacy inference models, achieving Attack Success Rate (ASR) of 40.11% and 44.65% (reducing inference accuracy to near-random) and increasing the Equal Error Rate (EER) from 8.30% and 7.56% to 41.65% and 46.22%. On ASR, PATN outperforms baseline methods by 16.16% and 31.96%, respectively.

Updated: 2025-11-24 08:58:20

标题: 即时隐私：用于移动传感器数据的预测对抗变换网络

摘要: 移动动作传感器，如加速度计和陀螺仪，现在通过标准API普遍可被第三方应用访问。虽然能够实现诸如活动识别和步数统计等丰富功能，但这种开放性也使得未经用户同意即可对敏感用户特征进行不受监管的推断，比如性别、年龄甚至身份。现有的隐私保护技术，如基于GAN的模糊化或差分隐私，通常需要访问完整的输入序列，引入了与实时场景不兼容的延迟。更糟糕的是，它们往往会扭曲时间和语义模式，降低数据对于诸如活动识别之类的良性任务的实用性。为了解决这些限制，我们提出了预测对抗转换网络（PATN），这是一个实时的隐私保护框架，利用历史信号主动生成对抗性扰动。这些扰动立即应用于数据采集，实现连续保护而不影响应用功能。在两个数据集上的实验证明，PATN显著降低了隐私推断模型的性能，实现了攻击成功率（ASR）分别为40.11%和44.65%（将推断准确性降低至接近随机），并将等误差率（EER）从8.30%和7.56%增加至41.65%和46.22%。在ASR上，PATN分别比基线方法提高了16.16%和31.96%。

更新时间: 2025-11-24 08:58:20

领域: cs.CR

下载: http://arxiv.org/abs/2511.07242v4

MetaDCSeg: Robust Medical Image Segmentation via Meta Dynamic Center Weighting

Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.

Updated: 2025-11-24 08:51:02

标题: MetaDCSeg：通过元动态中心加权实现鲁棒的医学图像分割

摘要: 医学图像分割对临床应用至关重要，但常常受到嘈杂的标注和模糊的解剖边界的干扰，导致模型训练的不稳定性。现有方法通常依赖于全局噪声假设或基于置信度的样本选择，这些方法不能充分减轻由标注噪声引起的性能下降，尤其是在具有挑战性边界区域。为了解决这个问题，我们提出了MetaDCSeg，这是一个强大的框架，动态学习最佳的像素权重，以抑制嘈杂的地面真实标签的影响，同时保留可靠的标注。通过显式建模边界不确定性的动态中心距离（DCD）机制，我们的方法利用加权特征距离来指导模型的注意力集中在接近模糊边界的难以分割的像素。这种策略使得对结构边界的处理更加精确，这通常被现有方法忽视，并且显著增强了分割性能。在具有不同噪声级别的四个基准数据集上进行的大量实验表明，MetaDCSeg始终优于现有的最先进方法。

更新时间: 2025-11-24 08:51:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18894v1

Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation

Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.

Updated: 2025-11-24 08:48:39

标题: 超越SELECT：一个基于综合分类指导的现实世界文本到SQL翻译基准测试

摘要: 文本到SQL数据集对于训练和评估文本到SQL模型至关重要，但现有数据集往往受到覆盖范围有限和无法捕捉现实世界应用多样性的困扰。为了解决这一问题，我们提出了一个基于核心意图、语句类型、语法结构和关键操作等维度的文本到SQL分类的新颖分类法。利用这个分类法，我们评估了广泛使用的公共文本到SQL数据集（例如Spider和Bird），揭示了它们在覆盖范围和多样性方面的局限性。然后，我们介绍了一个基于分类法的数据集合成管道，生成了一个名为SQL-Synth的新数据集。这种方法将分类法与大型语言模型（LLMs）结合起来，以确保数据集反映了现实世界文本到SQL应用的广度和复杂性。广泛的分析和实验结果验证了我们的分类法的有效性，因为SQL-Synth在多样性和覆盖范围方面比现有基准数据集表现更好。此外，我们发现现有的LLMs通常无法充分捕捉所有情景的范围，导致在SQL-Synth上性能有限。然而，微调可以在这些情景下显著提高它们的性能。所提出的分类法具有重要的潜在影响，它不仅能够全面分析数据集和不同LLMs的性能，还能指导构建LLMs的训练数据。

更新时间: 2025-11-24 08:48:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.13590v2

Analysis of heart failure patient trajectories using sequence modeling

Transformers have defined the state-of-the-art for clinical prediction tasks involving electronic health records (EHRs). The recently introduced Mamba architecture outperformed an advanced Transformer (Transformer++) based on Llama in handling long context lengths, while using fewer model parameters. Despite the impressive performance of these architectures, a systematic approach to empirically analyze model performance and efficiency under various settings is not well established in the medical domain. The performances of six sequence models were investigated across three architecture classes (Transformers, Transformers++, Mambas) in a large Swedish heart failure (HF) cohort (N = 42820), providing a clinically relevant case study. Patient data included diagnoses, vital signs, laboratories, medications and procedures extracted from in-hospital EHRs. The models were evaluated on three one-year prediction tasks: clinical instability (a readmission phenotype) after initial HF hospitalization, mortality after initial HF hospitalization and mortality after latest hospitalization. Ablations account for modifications of the EHR-based input patient sequence, architectural model configurations, and temporal preprocessing techniques for data collection. Llama achieves the highest predictive discrimination, best calibration, and showed robustness across all tasks, followed by Mambas. Both architectures demonstrate efficient representation learning, with tiny configurations surpassing other large-scaled Transformers. At equal model size, Llama and Mambas achieve superior performance using 25% less training data. This paper presents a first ablation study with systematic design choices for input tokenization, model configuration and temporal data preprocessing. Future model development in clinical prediction tasks using EHRs could build upon this study's recommendation as a starting point.

Updated: 2025-11-24 08:46:39

标题: 使用序列建模分析心力衰竭患者的轨迹

摘要: 变压器已经为涉及电子健康记录（EHR）的临床预测任务定义了最先进技术。最近引入的曼巴架构在处理长上下文长度方面优于基于羊驼的先进变压器（变压器++），同时使用更少的模型参数。尽管这些架构的性能令人印象深刻，但在医学领域尚未建立起系统的方法来在各种设置下对模型性能和效率进行实证分析。在瑞典心力衰竭（HF）大型队列（N =42820）中，对六种序列模型在三种架构类别（变压器、变压器++、曼巴）中的性能进行了调查，提供了一个临床相关的案例研究。患者数据包括从住院EHR中提取的诊断、生命体征、实验室检查、药物和程序。模型根据三个一年预测任务进行评估：初次HF住院后的临床不稳定（再入院表型）、初次HF住院后的死亡以及最新住院后的死亡。消融考虑了基于EHR的输入患者序列、建筑模型配置和用于数据收集的时间预处理技术的修改。羊驼实现了最高的预测区分度、最佳校准，并在所有任务中表现出稳健性，其次是曼巴。这两种架构展示了高效的表示学习，微小配置超越了其他大规模变压器。在相等的模型大小下，羊驼和曼巴使用更少的训练数据实现了更优异的性能。本文提供了一项首次进行的消融研究，涉及输入标记化、模型配置和时间数据预处理的系统设计选择。未来在使用EHR进行临床预测任务的模型开发可以建立在本研究的建议基础上。

更新时间: 2025-11-24 08:46:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.16839v2

Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.

Updated: 2025-11-24 08:46:36

标题: Nemotron-Flash: 迈向延迟最优的混合小语言模型

摘要: 小语言模型（SLM）的高效部署对许多具有严格延迟限制的实际应用至关重要。尽管先前关于SLM设计的工作主要集中在减少参数数量以实现参数最优的SLM，但参数效率并不一定能够直接转化为设备速度的提升。本文旨在确定影响SLM实际设备延迟的关键因素，并为在实际设备延迟是主要考虑因素时的SLM设计和训练提供通用原则和方法。具体来说，我们确定了两个核心架构因素：深度-宽度比和操作者选择。前者对于小批量大小延迟至关重要，而后者影响延迟和大批量大小吞吐量。鉴于此，我们首先研究了延迟最优的深度-宽度比，关键发现是，尽管在相同参数预算下，深薄模型通常能够实现更好的准确性，但它们可能不在准确性-延迟权衡前沿上。接下来，我们探索新兴的高效注意力替代方案，评估它们作为候选构建操作者的潜力。利用确定的有前途的操作者，我们构建了一个进化搜索框架，自动发现这些操作者在混合SLM内的延迟最优组合，推动了准确性-延迟前沿。除了架构改进，我们还使用一种权重归一化技术进一步增强SLM训练，使权重更新更加有效，并改善最终收敛性。结合这些方法，我们引入了一种新的混合SLM家族，称为Nemotron-Flash，显著推进了最先进SLM的准确性-效率前沿，例如与Qwen3-1.7B/0.6B相比，平均准确性提高超过+5.5%，延迟分别降低1.3倍/1.9倍，吞吐量提高18.7倍/45.6倍。

更新时间: 2025-11-24 08:46:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18890v1

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

Updated: 2025-11-24 08:44:29

标题: CoreEval：利用现实世界知识自动构建抗污染数据集，以实现可靠的LLM评估

摘要: 数据污染对自然语言处理任务中LLM评估的公平性构成重大挑战，因为它在训练过程中无意中暴露了模型对测试数据。当前的研究尝试通过修改现有数据集或从最新收集的信息中生成新数据集来缓解这一问题。然而，这些方法未能确保抗污染评估，因为它们未能完全消除模型中的现有知识或保留原始数据集的语义复杂性。为了解决这些限制，我们提出了CoreEval，这是一种用于自动更新数据的抗污染评估策略。该方法首先从原始数据中提取实体关系，并利用GDELT数据库检索相关的最新知识。然后，检索到的知识被重新环境化并与原始数据集集成，原始数据集经过精炼和重组以确保语义连贯性和增强任务相关性。最终，采用强大的数据反射机制来迭代验证和精炼标签，确保更新后的数据集与原始数据集之间的一致性。对更新后的数据集进行的大量实验验证了CoreEval的稳健性，展示了其在减轻数据污染导致的性能高估方面的有效性。

更新时间: 2025-11-24 08:44:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.18889v1

Learning Potential Energy Surfaces of Hydrogen Atom Transfer Reactions in Peptides

Hydrogen atom transfer (HAT) reactions are essential in many biological processes, such as radical migration in damaged proteins, but their mechanistic pathways remain incompletely understood. Simulating HAT is challenging due to the need for quantum chemical accuracy at biologically relevant scales; thus, neither classical force fields nor DFT-based molecular dynamics are applicable. Machine-learned potentials offer an alternative, able to learn potential energy surfaces (PESs) with near-quantum accuracy. However, training these models to generalize across diverse HAT configurations, especially at radical positions in proteins, requires tailored data generation and careful model selection. Here, we systematically generate HAT configurations in peptides to build large datasets using semiempirical methods and DFT. We benchmark three graph neural network architectures (SchNet, Allegro, and MACE) on their ability to learn HAT PESs and indirectly predict reaction barriers from energy predictions. MACE consistently outperforms the others in energy, force, and barrier prediction, achieving a mean absolute error of 1.13 kcal/mol on out-of-distribution DFT barrier predictions. Using molecular dynamics, we show our MACE potential is stable, reactive, and generalizes beyond training data to model HAT barriers in collagen I. This accuracy enables integration of ML potentials into large-scale collagen simulations to compute reaction rates from predicted barriers, advancing mechanistic understanding of HAT and radical migration in peptides. We analyze scaling laws, model transferability, and cost-performance trade-offs, and outline strategies for improvement by combining ML potentials with transition state search algorithms and active learning. Our approach is generalizable to other biomolecular systems, enabling quantum-accurate simulations of chemical reactivity in complex environments.

Updated: 2025-11-24 08:44:20

标题: 学习肽链中氢原子转移反应的势能面

摘要: 氢原子转移（HAT）反应在许多生物过程中至关重要，例如受损蛋白质中的自由基迁移，但它们的机制路径仍未完全理解。由于需要在生物相关尺度上达到量子化学精度，模拟HAT具有挑战性；因此，经典力场和基于密度泛函理论的分子动力学均不适用。机器学习势提供了一种替代方案，能够学习接近量子精度的势能表面（PESs）。然而，训练这些模型以泛化各种HAT配置，特别是在蛋白质中的自由基位置，需要定制数据生成和谨慎的模型选择。在这里，我们系统地利用半经验方法和密度泛函理论生成肽中的HAT配置，构建大型数据集。我们在三种图神经网络架构（SchNet、Allegro和MACE）上进行了基准测试，评估它们学习HAT PESs的能力，并间接从能量预测中预测反应壁。MACE在能量、力和屏障预测方面始终表现优异，实现了对分布外DFT屏障预测的1.13 kcal/mol的平均绝对误差。通过分子动力学，我们展示了我们的MACE势是稳定的、具有反应性的，并能够超越训练数据以模拟胶原I中的HAT屏障。这种准确性使得将ML势集成到大规模胶原模拟中，从预测的屏障计算反应速率，推进了对HAT和肽中自由基迁移机制的理解。我们分析了扩展定律、模型可转移性和成本-性能权衡，并概述了通过将ML势与过渡态搜索算法和主动学习相结合来改进的策略。我们的方法可推广到其他生物分子系统，实现在复杂环境中化学反应的量子精确模拟。

更新时间: 2025-11-24 08:44:20

领域: cs.LG,cond-mat.mtrl-sci,physics.chem-ph,physics.comp-ph,q-bio.BM

下载: http://arxiv.org/abs/2508.00578v2

Hi-SAFE: Hierarchical Secure Aggregation for Lightweight Federated Learning

Federated learning (FL) faces challenges in ensuring both privacy and communication efficiency, particularly in resource-constrained environments such as Internet of Things (IoT) and edge networks. While sign-based methods, such as sign stochastic gradient descent with majority voting (SIGNSGD-MV), offer substantial bandwidth savings, they remain vulnerable to inference attacks due to exposure of gradient signs. Existing secure aggregation techniques are either incompatible with sign-based methods or incur prohibitive overhead. To address these limitations, we propose Hi-SAFE, a lightweight and cryptographically secure aggregation framework for sign-based FL. Our core contribution is the construction of efficient majority vote polynomials for SIGNSGD-MV, derived from Fermat's Little Theorem. This formulation represents the majority vote as a low-degree polynomial over a finite field, enabling secure evaluation that hides intermediate values and reveals only the final result. We further introduce a hierarchical subgrouping strategy that ensures constant multiplicative depth and bounded per-user complexity, independent of the number of users n.

Updated: 2025-11-24 08:42:40

标题: Hi-SAFE：轻量级联邦学习的分层安全聚合

摘要: 联邦学习（FL）在确保隐私和通信效率方面面临挑战，特别是在资源受限的环境中，如物联网（IoT）和边缘网络。虽然基于符号的方法，如带大多数投票的符号随机梯度下降（SIGNSGD-MV），可以节省大量带宽，但由于梯度符号的暴露，它们仍然容易受到推断攻击的影响。现有的安全聚合技术要么与基于符号的方法不兼容，要么产生了不可接受的开销。为了解决这些限制，我们提出了Hi-SAFE，一个用于基于符号的FL的轻量级且具有密码学安全性的聚合框架。我们的核心贡献是构建了有效的用于SIGNSGD-MV的大多数投票多项式，这些多项式源自费马小定理。这种公式将大多数投票表示为有限域上的低次多项式，实现了隐藏中间值并仅显示最终结果的安全评估。我们进一步引入了一个分层分组策略，确保恒定的乘法深度和受限的每用户复杂度，与用户数量n无关。

更新时间: 2025-11-24 08:42:40

领域: cs.LG

下载: http://arxiv.org/abs/2511.18887v1

On the dimension of pullback attractors in recurrent neural networks

Recurrent Neural Networks (RNNs) are high-dimensional state space models capable of learning functions on sequence data. Recently, it has been conjectured that reservoir computers, a particular class of RNNs, trained on observations of a dynamical systems can be interpreted as embeddings. This result has been established for the case of linear reservoir systems. In this work, we use a nonautonomous dynamical systems approach to establish an upper bound for the fractal dimension of the subset of reservoir state space approximated during training and prediction phase. We prove that when the input sequences comes from an Nin-dimensional invertible dynamical system, the fractal dimension of this set is bounded above by Nin. The result obtained here are useful in dimensionality reduction of computation in RNNs as well as estimating fractal dimensions of dynamical systems from limited observations of their time series. It is also a step towards understanding embedding properties of reservoir computers.

Updated: 2025-11-24 08:40:40

标题: 关于循环神经网络中回溯吸引子维度的研究

摘要: 递归神经网络（RNNs）是能够学习序列数据上的函数的高维状态空间模型。最近，有人推测，一类特殊的RNNs，即储水池计算机，经过对动态系统的观测训练后可以被解释为嵌入。这一结果已经在线性储水池系统的情况下得到证实。在这项工作中，我们使用非自治动态系统方法来建立储水池状态空间子集在训练和预测阶段逼近时的分形维度的上限。我们证明了当输入序列来自一个Nin维可逆动态系统时，该集合的分形维度上限为Nin。这里得到的结果对于在RNNs中降维计算以及从其时间序列有限观测中估计动态系统的分形维度是有用的。这也是了解储水池计算机嵌入属性的一步。

更新时间: 2025-11-24 08:40:40

领域: math.DS,cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.11357v3

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.

Updated: 2025-11-24 08:37:38

标题: DataSage：多智能体协作，辅以外部知识检索、多角色辩论和多路径推理，用于洞察发现

摘要: 在当今数据驱动的时代，完全自动化的端到端数据分析，特别是洞察发现，对于发现可操作的见解并协助组织做出有效决策至关重要。随着大型语言模型（LLMs）的快速发展，基于LLM的代理已经成为自动化数据分析和洞察发现的一种有前途的范式。然而，现有的数据洞察代理在几个关键方面仍存在局限，通常由于以下原因未能提供令人满意的结果：（1）领域知识利用不足，（2）分析深度不够，以及（3）在见解生成过程中生成代码容易出错。为了解决这些问题，我们提出了DataSage，一个新型的多代理框架，其中包括三个创新特性：外部知识检索以丰富分析背景、多角色辩论机制以模拟多样化的分析角度并加深分析深度，以及多路径推理以提高生成的代码和见解的准确性。在InsightBench上的大量实验表明，DataSage在所有难度级别上始终优于现有的数据洞察代理，为自动化数据洞察发现提供了有效的解决方案。

更新时间: 2025-11-24 08:37:38

领域: cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2511.14299v2

Accelerating Reinforcement Learning via Error-Related Human Brain Signals

In this work, we investigate how implicit neural feed back can accelerate reinforcement learning in complex robotic manipulation settings. While prior electroencephalogram (EEG) guided reinforcement learning studies have primarily focused on navigation or low-dimensional locomotion tasks, we aim to understand whether such neural evaluative signals can improve policy learning in high-dimensional manipulation tasks involving obstacles and precise end-effector control. We integrate error related potentials decoded from offline-trained EEG classifiers into reward shaping and systematically evaluate the impact of human-feedback weighting. Experiments on a 7-DoF manipulator in an obstacle-rich reaching environment show that neural feedback accelerates reinforcement learning and, depending on the human-feedback weighting, can yield task success rates that at times exceed those of sparse-reward baselines. Moreover, when applying the best-performing feedback weighting across all sub jects, we observe consistent acceleration of reinforcement learning relative to the sparse-reward setting. Furthermore, leave-one subject-out evaluations confirm that the proposed framework remains robust despite the intrinsic inter-individual variability in EEG decodability. Our findings demonstrate that EEG-based reinforcement learning can scale beyond locomotion tasks and provide a viable pathway for human-aligned manipulation skill acquisition.

Updated: 2025-11-24 08:33:47

标题: 通过错误相关的人脑信号加速强化学习

摘要: 在这项工作中，我们研究了如何通过隐式神经反馈来加速复杂机器人操作环境中的强化学习。虽然先前的脑电图（EEG）引导的强化学习研究主要集中在导航或低维度的运动任务上，我们的目标是了解这种神经评估信号是否可以改善涉及障碍物和精确末端执行器控制的高维操作任务中的策略学习。我们将离线训练的EEG分类器解码的错误相关潜在信号整合到奖励塑形中，并系统评估人类反馈加权的影响。在一个富含障碍物的7自由度操作器在到达环境中的实验表明，神经反馈加速了强化学习，并且根据人类反馈加权的不同，有时可以产生超过稀疏奖励基线的任务成功率。此外，当在所有受试者中应用表现最佳的反馈加权时，我们观察到相对于稀疏奖励设置，强化学习的加速是一致的。此外，留出一个受试者的评估证实了提出的框架尽管在EEG可解码性方面存在内在的个体间变异性，但仍然是稳健的。我们的研究结果表明，基于EEG的强化学习可以扩展到超越运动任务，并为与人类对齐的操作技能习得提供了一条可行的途径。

更新时间: 2025-11-24 08:33:47

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.18878v1

Fairness Meets Privacy: Integrating Differential Privacy and Demographic Parity in Multi-class Classification

The increasing use of machine learning in sensitive applications demands algorithms that simultaneously preserve data privacy and ensure fairness across potentially sensitive sub-populations. While privacy and fairness have each been extensively studied, their joint treatment remains poorly understood. Existing research often frames them as conflicting objectives, with multiple studies suggesting that strong privacy notions such as differential privacy inevitably compromise fairness. In this work, we challenge that perspective by showing that differential privacy can be integrated into a fairness-enhancing pipeline with minimal impact on fairness guarantees. We design a postprocessing algorithm, called DP2DP, that enforces both demographic parity and differential privacy. Our analysis reveals that our algorithm converges towards its demographic parity objective at essentially the same rate (up logarithmic factor) as the best non-private methods from the literature. Experiments on both synthetic and real datasets confirm our theoretical results, showing that the proposed algorithm achieves state-of-the-art accuracy/fairness/privacy trade-offs.

Updated: 2025-11-24 08:31:02

标题: 公平与隐私的融合：在多类分类中整合差分隐私和人口平等

摘要: 在敏感应用中越来越多地使用机器学习，这要求算法同时保护数据隐私并确保跨潜在敏感子群体的公平性。虽然隐私和公平性都已经得到广泛研究，但它们的联合处理仍然被理解不足。现有研究经常将它们框定为相互冲突的目标，多项研究表明，诸如差分隐私这样的强隐私概念不可避免地会损害公平性。在这项工作中，我们挑战了这一观点，通过展示差分隐私可以被整合到一个提高公平性的流程中，对公平性保证的影响最小。我们设计了一个后处理算法，称为DP2DP，它同时强化人口统计平衡和差分隐私。我们的分析表明，我们的算法收敛到其人口统计平衡目标的速度基本相同（上对数因子）与文献中最好的非私有方法。对合成和真实数据集的实验证实了我们的理论结果，表明所提出的算法实现了最先进的准确性/公平性/隐私权衡。

更新时间: 2025-11-24 08:31:02

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2511.18876v1

General-Purpose Models for the Chemical Sciences: LLMs and Beyond

Data-driven techniques have a large potential to transform and accelerate the chemical sciences. However, chemical sciences also pose the unique challenge of very diverse, small, fuzzy datasets that are difficult to leverage in conventional machine learning approaches. A new class of models, which can be summarized under the term general-purpose models (GPMs) such as large language models, has shown the ability to solve tasks they have not been directly trained on, and to flexibly operate with low amounts of data in different formats. In this review, we discuss fundamental building principles of GPMs and review recent and emerging applications of those models in the chemical sciences across the entire scientific process. While many of these applications are still in the prototype phase, we expect that the increasing interest in GPMs will make many of them mature in the coming years.

Updated: 2025-11-24 08:29:00

标题: 化学科学的通用模型：LLMs及更多

摘要: 数据驱动技术有巨大潜力改变和加速化学科学。然而，化学科学也面临着非常多样化、小型、模糊的数据集的独特挑战，这些数据集在传统的机器学习方法中很难利用。一种新的模型类别，可以概括为通用模型（GPMs）如大型语言模型，已经显示出能够解决它们没有直接训练的任务，并以灵活的方式处理不同格式的少量数据。在这篇综述中，我们讨论了GPMs的基本构建原则，并回顾了这些模型在整个科学过程中在化学科学中的最近和新兴应用。虽然许多这些应用仍处于原型阶段，但我们预计对GPMs日益增加的兴趣将使其中许多在未来几年变得成熟。

更新时间: 2025-11-24 08:29:00

领域: cs.LG,cond-mat.mtrl-sci,physics.chem-ph

下载: http://arxiv.org/abs/2507.07456v2

GContextFormer: A global context-aware hybrid multi-head attention approach with scaled additive aggregation for multimodal trajectory prediction

Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.

Updated: 2025-11-24 08:28:42

标题: GContextFormer：一种全局上下文感知的混合多头注意力方法，用于多模态轨迹预测，具有缩放的加法聚合

摘要: 多模态轨迹预测生成多个可能的未来轨迹，以解决由于意图模糊性和执行变异性导致的车辆运动不确定性。然而，基于高清地图的模型存在昂贵的数据获取、延迟更新和对损坏输入的脆弱性，导致预测失败。无地图的方法缺乏全局上下文，对直线模式过度放大，同时抑制过渡模式，导致运动意图不对齐。本文提出了GContextFormer，一种具有全局上下文感知的混合注意力和缩放添加聚合的即插即用编码器-解码器架构，实现了无需依赖地图的意图对齐多模态预测。运动感知编码器通过对模式嵌入轨迹标记进行有界缩放添加聚合，构建场景级意图先验，并在共享全局上下文下细化每种模式的表示，减轻模式间抑制，促进意图对齐。分层交互解码器将社交推理分解为双通道交叉注意力：一个标准通道确保代理-模式对的均匀几何覆盖，而一个邻居上下文增强通道强调显著交互作用，门控模块调节它们对维护覆盖-焦点平衡的贡献。对TOD-VT数据集中的八个高速公路匝道场景进行的实验表明，GContextFormer优于最先进的基准模型。与现有的变压器模型相比，GContextFormer通过空间分布实现了更高的稳健性和在高曲率和过渡区域的集中改进。通过运动模式区分和邻居上下文调制揭示推理归因，实现了可解释性。这种模块化架构支持跨领域多模态推理任务的可扩展性。来源：https://fenghy-chen.github.io/sources/.

更新时间: 2025-11-24 08:28:42

领域: cs.AI,cs.CV,cs.LG,cs.MA,cs.RO,cs.SI

下载: http://arxiv.org/abs/2511.18874v1

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches, especially those based on vector similarity, fail to effectively capture relational dependencies and support multi-step reasoning. In this work, we propose CogGRAG, a human cognition-inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG models the reasoning process as a tree-structured mind map that decomposes the original problem into interrelated subproblems and explicitly encodes their semantic relationships. This structure not only provides a global view to guide subsequent retrieval and reasoning but also enables self-consistent verification across reasoning paths. The framework operates in three stages: (1) top-down problem decomposition via mind map construction, (2) structured retrieval of both local and global knowledge from external Knowledge Graphs (KGs), and (3) bottom-up reasoning with dual-process self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies problem decomposition, knowledge retrieval, and reasoning under a single graph-structured cognitive framework, allowing early integration of relational knowledge and adaptive verification. Extensive experiments demonstrate that CogGRAG achieves superior accuracy and reliability compared to existing methods.

Updated: 2025-11-24 08:27:31

标题: 灵感来源于人类认知的知识图谱辅助复杂问题解决

摘要: 大型语言模型（LLMs）已在各个领域展示了显著的潜力。然而，它们通常难以整合外部知识和进行复杂推理，导致产生幻觉和不可靠的输出。检索增强生成（RAG）已经成为一种有前途的范式，可以通过整合外部知识来缓解这些问题。然而，传统的RAG方法，特别是基于向量相似性的方法，未能有效捕捉关系依赖性并支持多步推理。在这项工作中，我们提出了CogGRAG，一种受人类认知启发的、基于图的RAG框架，专为知识图问答（KGQA）而设计。CogGRAG将推理过程建模为一个树状结构的思维导图，将原始问题分解为相关的子问题，并明确编码它们之间的语义关系。这种结构不仅提供了指导后续检索和推理的全局视图，还能够在推理路径上进行自洽验证。该框架分为三个阶段：（1）通过思维导图构建进行自顶向下的问题分解，（2）从外部知识图中进行结构化的检索，包括本地和全局知识，以及（3）通过双过程自我验证进行自底向上的推理。与先前的基于树状分解方法（如MindMap或Graph-CoT）不同，CogGRAG将问题分解、知识检索和推理统一在一个图结构的认知框架下，允许早期整合关系知识和自适应验证。广泛的实验表明，与现有方法相比，CogGRAG实现了更高的准确性和可靠性。

更新时间: 2025-11-24 08:27:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.06567v2

COLI: A Hierarchical Efficient Compressor for Large Images

The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs' transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.

Updated: 2025-11-24 08:27:03

标题: COLI：一种用于大型图像的分层高效压缩器

摘要: 随着高分辨率、大视野图像的日益采用，对高效压缩方法的需求不断增加。传统技术经常无法保留关键图像细节，而数据驱动方法则显示出有限的泛化能力。隐式神经表示（INRs）通过学习从空间坐标到像素强度的连续映射来为单个图像提供一种有希望的替代方法，从而存储网络权重而不是原始像素并避免泛化问题。然而，基于INR对大图像的压缩面临着挑战，包括压缩速度慢和亚优压缩比。为了解决这些限制，我们介绍了COLI（大图像压缩器），这是一个利用视频神经表示（NeRV）的新框架。首先，我们认识到基于INR的压缩是一个训练过程，通过预训练微调范式、混合精度训练和将顺序损失重新制定为可并行化的目标来加速其收敛。其次，利用INRs将图像存储约束转化为权重存储的能力，我们实现了超压缩，这是一种新的后训练技术，可以大幅提高压缩比，同时保持最小输出失真。对两个医学图像数据集的评估表明，COLI在显著降低每像素位元（bpp）的情况下，始终能够达到竞争力或更高的PSNR和SSIM指标，同时将NeRV的训练加速了多达4倍。

更新时间: 2025-11-24 08:27:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.11443v2

Gradient Propagation in Retrosynthetic Space: An Efficient Framework for Synthesis Plan Generation

Retrosynthesis, which aims to identify viable synthetic pathways for target molecules by decomposing them into simpler precursors, is often treated as a search problem. However, its complexity arises from multi-branched tree-structured pathways rather than linear paths. Some algorithms have been successfully applied in this task, but they either overlook the uncertainties inherent in chemical space or face limitations in practical application scenarios. To address these challenges, this paper introduces a novel gradient-propagation-based algorithmic framework for retrosynthetic route exploration. The proposed framework obtains the contributions of different nodes to the target molecule's success probability through gradient propagation and then guides the algorithm to greedily select the node with the highest contribution for expansion, thereby conducting efficient search in the chemical space. Experimental validations demonstrate that our algorithm achieves broad applicability across diverse molecular targets and exhibits superior computational efficiency compared to existing methods.

Updated: 2025-11-24 08:23:34

标题: 在逆向合成空间中的梯度传播：一种有效的合成计划生成框架

摘要: 回溯合成旨在将目标分子分解为更简单的前体物质，从而确定可行的合成途径，通常被视为一个搜索问题。然而，其复杂性来自于多分支树状结构路径，而不是线性路径。一些算法已成功应用于此任务，但它们要么忽视了化学空间固有的不确定性，要么面临实际应用场景中的限制。为了解决这些挑战，本文介绍了一种基于梯度传播的新型算法框架，用于回溯合成路线的探索。所提出的框架通过梯度传播获取不同节点对目标分子成功概率的贡献，然后引导算法贪心地选择具有最高贡献的节点进行扩展，从而在化学空间中进行高效搜索。实验证实，我们的算法在各种分子靶标上具有广泛适用性，并且与现有方法相比表现出更优的计算效率。

更新时间: 2025-11-24 08:23:34

领域: cs.AI,q-bio.BM

下载: http://arxiv.org/abs/2405.16123v2

Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning

Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.

Updated: 2025-11-24 08:22:50

标题: 定期的异步：加速基于策略的强化学习的有效方法

摘要: 自从引入GRPO算法以来，强化学习（RL）越来越受到关注，人们也在努力复制和应用它。然而，训练效率仍然是一个关键挑战。在主流的RL框架中，推理和训练通常部署在同一设备上。虽然这种方法通过资源整合降低了成本，但其同步执行导致了计算耦合，阻碍了并发推理和训练。在本研究中，我们回归到分离推理和训练部署的策略，并通过改进数据加载器，将传统的同步架构转变为定期异步框架，允许按需驱动、独立和弹性地扩展每个组件，同时算法的准确性完全等同于同步方法，两者均属于on-policy策略。值得强调的是，我们在训练阶段采用了统一的三模型架构，并提出了一个共享提示注意力蒙版以减少重复计算。在实践中，这些工作在NPU平台上至少实现了三倍的RL训练整体性能提升，显示了其广泛应用的潜力。

更新时间: 2025-11-24 08:22:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18871v1

Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25

Mechanistic interpretation has greatly contributed to a more detailed understanding of generative language models, enabling significant progress in identifying structures that implement key behaviors through interactions between internal components. In contrast, interpretability in information retrieval (IR) remains relatively coarse-grained, and much is still unknown as to how IR models determine whether a document is relevant to a query. In this work, we address this gap by mechanistically analyzing how one commonly used model, a cross-encoder, estimates relevance. We find that the model extracts traditional relevance signals, such as term frequency and inverse document frequency, in early-to-middle layers. These concepts are then combined in later layers, similar to the well-known probabilistic ranking function, BM25. Overall, our analysis offers a more nuanced understanding of how IR models compute relevance. Isolating these components lays the groundwork for future interventions that could enhance transparency, mitigate safety risks, and improve scalability.

Updated: 2025-11-24 08:22:00

标题: 通往相关性的路径：跨编码器如何实现BM25的语义变体

摘要: 机械解释对生成式语言模型的更详细理解做出了巨大贡献，使得通过内部组件之间的相互作用来识别实现关键行为的结构取得了显著进展。相比之下，信息检索（IR）中的可解释性仍然相对粗糙，对于IR模型如何确定文档是否与查询相关仍有许多未知。在这项工作中，我们通过机械地分析一个常用模型，即交叉编码器，来解释它是如何估计相关性的。我们发现该模型在早期至中间层提取传统的相关性信号，如词项频率和逆文档频率。这些概念随后在后续层中结合，类似于众所周知的概率排名函数BM25。总体而言，我们的分析提供了对IR模型如何计算相关性的更加细致的理解。将这些组件隔离开来奠定了未来干预的基础，这有助于增强透明度，减少安全风险，并提高可扩展性。

更新时间: 2025-11-24 08:22:00

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2502.04645v3

Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models

In the foreseeable future, autonomous vehicles will require human assistance in situations they can not resolve on their own. In such scenarios, remote assistance from a human can provide the required input for the vehicle to continue its operation. Typical sensors used in autonomous vehicles include camera and lidar sensors. Due to the massive volume of sensor data that must be sent in real-time, highly efficient data compression is elementary to prevent an overload of network infrastructure. Sensor data compression using deep generative neural networks has been shown to outperform traditional compression approaches for both image and lidar data, regarding compression rate as well as reconstruction quality. However, there is a lack of research about the performance of generative-neural-network-based compression algorithms for remote assistance. In order to gain insights into the feasibility of deep generative models for usage in remote assistance, we evaluate state-of-the-art algorithms regarding their applicability and identify potential weaknesses. Further, we implement an online pipeline for processing sensor data and demonstrate its performance for remote assistance using the CARLA simulator.

Updated: 2025-11-24 08:17:47

标题: 使用深度生成模型对自动驾驶车辆的传感器数据进行压缩以实现远程辅助

摘要: 在可预见的未来，自动驾驶车辆将需要人类在无法自行解决的情况下提供帮助。在这种情况下，来自人类的远程援助可以为车辆提供所需的输入，以继续其运行。自动驾驶车辆中使用的典型传感器包括摄像头和激光雷达传感器。由于必须实时发送大量传感器数据，高效的数据压缩对于防止网络基础设施过载至关重要。已经证明，使用深度生成神经网络进行传感器数据压缩可以在压缩率和重建质量方面优于传统压缩方法，无论是图像数据还是激光雷达数据。然而，关于基于生成神经网络的压缩算法在远程援助中的性能缺乏研究。为了深入了解深度生成模型在远程援助中的可行性，我们评估了最先进的算法，确定了其适用性并识别潜在的弱点。此外，我们实现了一个用于处理传感器数据的在线管道，并演示了其在使用CARLA模拟器进行远程援助时的性能。

更新时间: 2025-11-24 08:17:47

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2111.03201v3

Multidimensional Music Aesthetic Evaluation via Semantically Consistent C-Mixup Augmentation

Evaluating the aesthetic quality of generated songs is challenging due to the multi-dimensional nature of musical perception. We propose a robust music aesthetic evaluation framework that combines (1) multi-source multi-scale feature extraction to obtain complementary segment- and track-level representations, (2) a hierarchical audio augmentation strategy to enrich training data, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-song identification. Experiments on the ICASSP 2026 SongEval benchmark demonstrate that our approach consistently outperforms baseline methods across correlation and top-tier metrics.

Updated: 2025-11-24 08:12:33

标题: 通过语义一致的C-Mixup增强进行多维音乐审美评价

摘要: 评估生成歌曲的审美质量具有挑战性，这是因为音乐知觉具有多维性。我们提出了一个强大的音乐审美评估框架，结合了（1）多源多尺度特征提取以获得互补的段和曲目级表示，（2）分层音频增强策略以丰富训练数据，以及（3）集成回归和排名损失的混合训练目标，用于准确评分和可靠的顶级歌曲识别。在ICASSP 2026 SongEval基准上的实验表明，我们的方法始终优于基准方法，包括相关性和顶级指标。

更新时间: 2025-11-24 08:12:33

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2511.18869v1

KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and software optimization. While recent advances in LLM-based code generation show promise for complex optimization, existing methods struggle with the vast optimization space due to insufficient hardware domain knowledge, failing to effectively balance exploration and exploitation. We present KernelBand, a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem, enabling LLM agents to strategically navigate the optimization space by treating kernel selection and optimization strategy application as sequential decision-making processes. Our approach leverages hardware profiling information to identify promising optimization strategies and employs runtime behavior clustering to reduce exploration overhead across kernel candidates. Extensive experiments on TritonBench demonstrate that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.

Updated: 2025-11-24 08:11:50

标题: KernelBand：使用分层和硬件感知多臂老虎机增强基于LLM的核优化

摘要: 高质量的内核对于减少大型语言模型（LLMs）的训练和推断成本至关重要，然而传统上需要在硬件架构和软件优化方面具有丰富经验。尽管基于LLM的代码生成的最新进展显示出对复杂优化的潜力，但由于硬件领域知识不足，现有方法在广阔的优化空间中遇到困难，无法有效平衡探索和利用。我们提出了KernelBand，这是一个新颖的框架，将内核优化形式化为一个层次多臂老虎机问题，使LLM代理能够通过将内核选择和优化策略应用视为顺序决策过程来策略性地导航优化空间。我们的方法利用硬件剖析信息来识别有希望的优化策略，并利用运行时行为聚类来减少在内核候选中的探索开销。对TritonBench的大量实验表明，KernelBand明显优于最先进的方法，在更少的令牌下实现了优越的性能，并且在计算资源增加时表现出持续改进而不会饱和。

更新时间: 2025-11-24 08:11:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18868v1

Preprint: Exploring Inevitable Waypoints for Unsolvability Explanation in Hybrid Planning Problems

Explaining unsolvability of planning problems is of significant research interest in Explainable AI Planning. AI planning literature has reported several research efforts on generating explanations of solutions to planning problems. However, explaining the unsolvability of planning problems remains a largely open and understudied problem. A widely practiced approach to plan generation and automated problem solving, in general, is to decompose tasks into sub-problems that help progressively converge towards the goal. In this paper, we propose to adopt the same philosophy of sub-problem identification as a mechanism for analyzing and explaining unsolvability of planning problems in hybrid systems. In particular, for a given unsolvable planning problem, we propose to identify common waypoints, which are universal obstacles to plan existence; in other words, they appear on every plan from the source to the planning goal. This work envisions such waypoints as sub-problems of the planning problem and the unreachability of any of these waypoints as an explanation for the unsolvability of the original planning problem. We propose a novel method of waypoint identification by casting the problem as an instance of the longest common subsequence problem, a widely popular problem in computer science, typically considered as an illustrative example for the dynamic programming paradigm. Once the waypoints are identified, we perform symbolic reachability analysis on them to identify the earliest unreachable waypoint and report it as the explanation of unsolvability. We present experimental results on unsolvable planning problems in hybrid domains.

Updated: 2025-11-24 08:07:47

标题: 草稿：在混合规划问题中探索不可避免的不可解释性解释路径

摘要: 解释规划问题的不可解性是可解释人工智能规划领域的重要研究兴趣。人工智能规划文献已经报道了一些关于生成规划问题解决方案解释的研究工作。然而，解释规划问题的不可解性仍然是一个广泛开放且研究不足的问题。一种广泛实践的计划生成和自动问题解决方法是将任务分解成子问题，以帮助逐渐收敛到目标。在本文中，我们建议采用相同的子问题识别哲学作为分析和解释混合系统中规划问题的不可解性的机制。特别是，对于给定的不可解规划问题，我们建议识别共同的航路点，这些航路点是计划存在的普遍障碍；换句话说，它们出现在从源到规划目标的每个计划中。这项工作将这些航路点视为规划问题的子问题，任何这些航路点的不可达性都可以解释原始规划问题的不可解性。我们提出了一种通过将问题作为最长公共子序列问题的一个示例来识别航路点的新方法，这是计算机科学中广泛流行的问题，通常被认为是动态规划范式的一个说明性示例。一旦识别了航路点，我们对它们执行符号可达性分析，以识别最早的不可达航路点，并将其报告为不可解性的解释。我们在混合领域的不可解规划问题上展示了实验结果。

更新时间: 2025-11-24 08:07:47

领域: cs.AI,cs.FL

下载: http://arxiv.org/abs/2504.15668v2

Generating Reading Comprehension Exercises with Large Language Models for Educational Applications

With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.

Updated: 2025-11-24 08:00:48

标题: 使用大型语言模型生成阅读理解练习，用于教育应用

摘要: 随着大型语言模型（LLMs）的快速发展，LLMs的应用范围大大增加。在教育领域，LLMs展示了显著的潜力，特别是在自动文本生成方面，这使得智能和自适应学习内容的创建成为可能。本文提出了一个名为阅读理解练习生成（RCEG）的新的LLMs框架。它可以自动生成高质量和个性化的英语阅读理解练习。首先，RCEG使用微调的LLMs生成内容候选项。然后，它使用鉴别器选择最佳候选项。最后，生成内容的质量得到了极大改善。为了评估RCEG的性能，构建了一个专门用于英语阅读理解的数据集进行实验，并使用全面的评估指标分析实验结果。这些指标包括内容多样性、事实准确性、语言毒性和教学对齐度。实验结果表明，RCEG显著提高了生成练习的相关性和认知适切性。

更新时间: 2025-11-24 08:00:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.18860v1

PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation

Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.

Updated: 2025-11-24 07:58:20

标题: PairHuman：用于定制双人生成的高保真度摄影数据集

摘要: 个性化的双人肖像定制具有相当大的潜在应用，例如保存情感记忆和促进婚礼摄影规划。然而，缺乏基准数据集阻碍了在双人肖像生成中追求高质量定制的努力。在本文中，我们提出了PairHuman数据集，这是第一个专门设计用于生成符合高摄影标准的双人肖像的大规模基准数据集。PairHuman数据集包含超过100,000张图像，捕捉了各种场景、服饰和双人互动，以及丰富的元数据，包括详细的图像描述、人物定位、人体关键点和属性标签。我们还介绍了DHumanDiff，这是一个专门为双人肖像生成而设计的基准，具有增强的面部一致性，同时平衡了个性化人物生成和语义驱动的场景创建。最后，实验结果表明，我们的数据集和方法生成了具有卓越视觉质量的高度定制肖像，符合人类偏好。我们的数据集可在https://github.com/annaoooo/PairHuman 上公开获取。

更新时间: 2025-11-24 07:58:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.16712v2

Description of Corner Cases in Automated Driving: Goals and Challenges

Scaling the distribution of automated vehicles requires handling various unexpected and possibly dangerous situations, termed corner cases (CC). Since many modules of automated driving systems are based on machine learning (ML), CC are an essential part of the data for their development. However, there is only a limited amount of CC data in large-scale data collections, which makes them challenging in the context of ML. With a better understanding of CC, offline applications, e.g., dataset analysis, and online methods, e.g., improved performance of automated driving systems, can be improved. While there are knowledge-based descriptions and taxonomies for CC, there is little research on machine-interpretable descriptions. In this extended abstract, we will give a brief overview of the challenges and goals of such a description.

Updated: 2025-11-24 07:58:18

标题: 自动驾驶中的边缘案例描述：目标与挑战

摘要: 自动驾驶汽车的分布需要处理各种意外和可能危险的情况，被称为边缘情况（CC）。由于许多自动驾驶系统的模块基于机器学习（ML），CC是它们开发的数据的重要部分。然而，在大规模数据集中只有有限数量的CC数据，这使它们在ML的背景下具有挑战性。通过对CC的更好理解，离线应用（例如数据集分析）和在线方法（例如改进自动驾驶系统的性能）可以得到改进。虽然有基于知识的CC描述和分类法，但对于机器可解释描述的研究很少。在这篇扩展摘要中，我们将简要概述这种描述的挑战和目标。

更新时间: 2025-11-24 07:58:18

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2109.09607v4

Robust and Generalizable GNN Fine-Tuning via Uncertainty-aware Adapter Learning

Recently, fine-tuning large-scale pre-trained GNNs has yielded remarkable attention in adapting pre-trained GNN models for downstream graph learning tasks. One representative fine-tuning method is to exploit adapter (termed AdapterGNN) which aims to 'augment' the pre-trained model by inserting a lightweight module to make the 'augmented' model better adapt to the downstream tasks. However, graph data may contain various types of noise in downstream tasks, such as noisy edges and ambiguous node attributes. Existing AdapterGNNs are often prone to graph noise and exhibit limited generalizability. How to enhance the robustness and generalization ability of GNNs' fine tuning remains an open problem. In this paper, we show that the above problem can be well addressed by integrating uncertainty learning into the GNN adapter. We propose the Uncertainty-aware Adapter (UAdapterGNN) that fortifies pre-trained GNN models against noisy graph data in the fine-tuning process. Specifically, in contrast to regular AdapterGNN, our UAdapterGNN exploits Gaussian probabilistic adapter to augment the pre-trained GNN model. In this way, when the graph contains various noises,our method can automatically absorb the effects of changes in the variances of the Gaussian distribution, thereby significantly enhancing the model's robustness. Also, UAdapterGNN can further improve the generalization ability of the model on the downstream tasks. Extensive experiments on several benchmarks demonstrate the effectiveness, robustness and high generalization ability of the proposed UAdapterGNN method.

Updated: 2025-11-24 07:57:37

标题: 通过不确定性感知适配器学习实现强大且可泛化的GNN微调

摘要: 最近，微调大规模预训练的图神经网络（GNNs）在调整预训练GNN模型以适应下游图学习任务方面引起了显著关注。代表性的微调方法之一是利用适配器（称为AdapterGNN），旨在通过插入轻量级模块来“增强”预训练模型，使“增强”模型更好地适应下游任务。然而，图数据可能在下游任务中包含各种类型的噪声，例如嘈杂的边和模糊的节点属性。现有的AdapterGNN通常容易受到图噪声的影响，展现出有限的泛化能力。如何增强GNN微调的鲁棒性和泛化能力仍然是一个悬而未决的问题。在本文中，我们展示了将不确定性学习整合到GNN适配器中可以很好地解决上述问题。我们提出了基于不确定性的适配器（UAdapterGNN），在微调过程中强化预训练GNN模型针对嘈杂的图数据。具体地，与常规的AdapterGNN相比，我们的UAdapterGNN利用高斯概率适配器来增强预训练GNN模型。通过这种方式，当图中包含各种噪声时，我们的方法可以自动吸收高斯分布方差变化的影响，从而显著增强模型的鲁棒性。此外，UAdapterGNN还可以进一步提高模型在下游任务上的泛化能力。在几个基准测试上进行的大量实验表明，所提出的UAdapterGNN方法具有有效性、鲁棒性和高泛化能力。

更新时间: 2025-11-24 07:57:37

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2511.18859v1

Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories

Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.

Updated: 2025-11-24 07:54:49

标题: 发现、学习和强化：通过多样化的RL生成轨迹扩展视觉-语言-行动预训练

摘要: 视觉-语言-动作（VLA）模型的预训练需要大量多样化、高质量的操作轨迹数据。目前大部分数据是通过人类远程操作获取的，这种方法昂贵且难以扩展。强化学习（RL）方法通过自主探索学习有用的技能，因此是生成数据的一种可行方法。然而，标准的RL训练会收敛到一个狭窄的执行模式，限制了其在大规模预训练中的实用性。我们提出了Discover, Learn and Reinforce（DLR）框架，这是一个信息论模式发现框架，用于为VLA预训练生成多个不同的、高成功率的行为模式。实证结果表明，DLR在LIBERO上生成了一个明显更多样化的轨迹语料库。具体来说，它学习了多个不同的、高成功率的策略，而标准的RL只发现了一个，因此覆盖了更广泛的状态-动作空间区域。当应用于未见过的下游任务套件时，使用我们多样化RL数据进行预训练的VLA模型胜过使用相同规模的标准RL数据集进行训练的模型。此外，DLR表现出单一模式RL缺乏的正数据扩展行为。这些结果将多模式RL定位为具有实际意义、可扩展的数据引擎，用于具体基础模型。

更新时间: 2025-11-24 07:54:49

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.19528v1

AlphaBeta is not as good as you think: a simple random games model for a better analysis of deterministic game-solving algorithms

Deterministic game-solving algorithms are conventionally analyzed in the light of their average-case complexity against a distribution of random game-trees, where leaf values are independently sampled from a fixed distribution. This simplified model enables uncluttered mathematical analysis, revealing two key properties: root value distributions asymptotically collapse to a single fixed value for finite-valued trees, and all reasonable algorithms achieve global optimality. However, these findings are artifacts of the model's design: its long criticized independence assumption strips games of structural complexity, producing trivial instances where no algorithm faces meaningful challenges. To address this limitation, we introduce a simple probabilistic model that incrementally constructs game-trees using a fixed level-wise conditional distribution. By enforcing ancestor dependencies, a critical structural feature of real-world games, our framework generates problems with adjustable difficulty while retaining some form of analytical tractability. For several algorithms, including AlphaBeta and Scout, we derive recursive formulas characterizing their average-case complexities under this model. These allow us to rigorously compare algorithms on deep game-trees, where Monte-Carlo simulations are no longer feasible. While asymptotically, all algorithms seem to converge to identical branching factor (a result analogous to that of independence-based models), deep finite trees reveal stark differences: AlphaBeta incurs a significantly larger constant multiplicative factor compared to algorithms like Scout, leading to a substantial practical slowdown. Our framework sheds new light on classical game-solving algorithms, offering rigorous evidence and analytical tools to advance the understanding of these methods under a richer, more challenging, and yet tractable model.

Updated: 2025-11-24 07:52:48

标题: AlphaBeta并不像你想象的那么好：一个简单的随机游戏模型，用于更好地分析确定性游戏解算法

摘要: 确定性游戏解算法通常在随机游戏树分布的平均情况复杂性下进行分析，其中叶值独立地从固定分布中抽样。这种简化模型使得数学分析更加清晰，揭示了两个关键特性：对于有限价值树，根值分布渐近地收敛到一个固定值，所有合理的算法都实现了全局最优性。然而，这些发现是模型设计的产物：其长期以来备受批评的独立性假设剥夺了游戏的结构复杂性，产生了没有算法面临实质性挑战的琐碎实例。为了解决这一限制，我们引入了一个简单的概率模型，逐步使用固定的逐层条件分布构建游戏树。通过强制祖先依赖关系，这是真实世界游戏的一个关键结构特征，我们的框架生成了一些具有可调难度的问题，同时保留了一定形式的分析可行性。对于包括AlphaBeta和Scout在内的几种算法，我们在该模型下推导了描述其平均情况复杂性的递归公式。这使我们能够在深层游戏树上严格比较算法，而蒙特卡罗模拟不再可行。尽管在渐近意义上，所有算法似乎会收敛到相同的分支因子（类似于基于独立性的模型的结果），深层有限树揭示了明显的差异：与Scout等算法相比，AlphaBeta承担了一个显著较大的常数乘法因子，导致实际上的显著减速。我们的框架为经典游戏解算法带来了新的视角，提供了严格的证据和分析工具，以推进对这些方法在更丰富、更具挑战性但仍可分析的模型下的理解。

更新时间: 2025-11-24 07:52:48

领域: cs.AI

下载: http://arxiv.org/abs/2506.21996v2

Deep Hybrid Model for Region of Interest Detection in Omnidirectional Videos

The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.

Updated: 2025-11-24 07:52:06

标题: 全景视频中感兴趣区域检测的深度混合模型

摘要: 该项目的主要目标是设计一个新的模型，用于预测360°视频中的感兴趣区域。感兴趣区域（ROI）在360°视频流中起着重要作用。例如，ROI被用来预测视口，智能地剪辑视频进行实时流媒体等，以便使用更少的带宽。提前检测视口有助于减少在流媒体和通过头戴设备观看视频时头部的移动。而智能剪辑视频有助于提高向用户流媒体视频的效率，并提升他们观看体验的质量。该报告阐述了识别ROI的次要任务，其中我们设计、训练和测试一个混合显著性模型。在这项工作中，我们称显著性区域为表示感兴趣区域的区域。该方法包括以下过程：预处理视频以获取帧，开发一个用于预测感兴趣区域的混合显著性模型，最终对混合显著性模型的输出预测进行后处理，以获取每帧的输出感兴趣区域。然后，我们将所提出方法的性能与360RAT数据集的主观注释进行比较。

更新时间: 2025-11-24 07:52:06

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18856v1

Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Cross-platform strategy game automation remains a challenge due to diverse user interfaces and dynamic battlefield environments. Existing Vision--Language Models (VLMs) struggle with generalization across heterogeneous platforms and lack precision in interface understanding and action execution. We introduce Yanyun-3, a VLM-based agent that integrates Qwen2.5-VL for visual reasoning and UI-TARS for interface execution. We propose a novel data organization principle -- combination granularity -- to distinguish intra-sample fusion and inter-sample mixing of multimodal data (static images, multi-image sequences, and videos). The model is fine-tuned using QLoRA on a curated dataset across three strategy game platforms. The optimal strategy (M*V+S) achieves a 12.98x improvement in BLEU-4 score and a 63% reduction in inference time compared to full fusion. Yanyun-3 successfully executes core tasks (e.g., target selection, resource allocation) across platforms without platform-specific tuning. Our findings demonstrate that structured multimodal data organization significantly enhances VLM performance in embodied tasks. Yanyun-3 offers a generalizable framework for GUI automation, with broader implications for robotics and autonomous systems.

Updated: 2025-11-24 07:51:46

标题: Yanyun-3: 使用视觉-语言模型实现跨平台策略游戏操作

摘要: 跨平台策略游戏自动化仍然是一个挑战，因为用户界面多样化且战场环境动态变化。现有的视觉-语言模型（VLMs）在异构平台之间的泛化和界面理解以及动作执行的精确性方面存在困难。我们引入了Yanyun-3，这是一个基于VLM的代理程序，集成了Qwen2.5-VL用于视觉推理和UI-TARS用于界面执行。我们提出了一种新颖的数据组织原则--组合粒度--用于区分多模态数据（静态图像、多图像序列和视频）的样本内融合和样本间混合。该模型在三个策略游戏平台上使用QLoRA对一个筛选过的数据集进行微调。最佳策略（M*V+S）在BLEU-4得分上实现了12.98倍的改进，并且相比于完全融合，推理时间减少了63%。Yanyun-3成功地在各种平台上执行核心任务（如目标选择、资源分配）而无需特定于平台的调整。我们的发现表明，结构化的多模态数据组织显著提高了VLM在具体任务中的表现。Yanyun-3为GUI自动化提供了通用框架，对机器人和自主系统具有更广泛的影响。

更新时间: 2025-11-24 07:51:46

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.12937v2

Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect

We present a novel framework that integrates Large Language Models (LLMs) into the Git bisect process for semantic fault localization. Traditional bisect assumes deterministic predicates and binary failure states assumptions often violated in modern software development due to flaky tests, nonmonotonic regressions, and semantic divergence from upstream repositories. Our system augments bisect traversal with structured chain of thought reasoning, enabling commit by commit analysis under noisy conditions. We evaluate multiple open source and proprietary LLMs for their suitability and fine tune DeepSeekCoderV2 using QLoRA on a curated dataset of semantically labeled diffs. We adopt a weak supervision workflow to reduce annotation overhead, incorporating human in the loop corrections and self consistency filtering. Experiments across multiple open source projects show a 6.4 point absolute gain in success rate from 74.2 to 80.6 percent, leading to significantly fewer failed traversals and by experiment up to 2x reduction in average bisect time. We conclude with discussions on temporal reasoning, prompt design, and finetuning strategies tailored for commit level behavior analysis.

Updated: 2025-11-24 07:49:59

标题: 时间旅行：LLM辅助语义行为本地化与Git二分法

摘要: 我们提出了一个新颖的框架，将大型语言模型（LLMs）集成到Git二分过程中，用于语义故障定位。传统的二分假设确定性断言和二进制故障状态假设，这些假设在现代软件开发中经常被违反，原因是测试不稳定、非单调回归以及与上游存储库的语义分歧。我们的系统通过结构化的思维链推理增强了二分遍历，使得在嘈杂条件下逐个提交进行分析成为可能。我们评估了多个开源和专有LLMs，确定了它们的适用性，并使用QLoRA在一个经过精心筛选的语义标记差异数据集上对DeepSeekCoderV2进行了微调。我们采用弱监督工作流程来减少注释开销，同时融合了人为干预和自相矛盾过滤。在多个开源项目上进行的实验显示，成功率从74.2%提高了6.4个百分点，达到80.6%，导致失败遍历大幅减少，实验中二分平均时间减少了最多2倍。最后，我们讨论了针对提交级行为分析量身定制的时间推理、提示设计和微调策略。

更新时间: 2025-11-24 07:49:59

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2511.18854v1

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.

Updated: 2025-11-24 07:46:09

标题: 通过视频进行推理：通过解迷宫任务评估视频模型的推理能力

摘要: 视频模型在高保真度视频生成中取得了显著的成功，具有连贯的运动动态。类似于从文本生成到基于文本的推理的语言建模发展，视频模型的发展激励我们提出一个问题：视频模型是否可以通过视频生成进行推理？与离散的文本语料库相比，视频将推理基于明确的空间布局和时间连续性，这作为空间推理的理想基础。在这项工作中，我们探索了通过视频范式进行推理，并引入了VR-Bench -- 一个旨在系统评估视频模型推理能力的综合基准。基于迷宫解决任务，这些任务本质上需要空间规划和多步推理，VR-Bench包含了来自五种迷宫类型和不同视觉风格的7,920个程序生成视频。我们的实证分析表明，SFT可以有效地引出视频模型的推理能力。视频模型在推理过程中表现出更强的空间感知能力，优于领先的VLMs，并在不同场景、任务和复杂程度的情况下具有良好的泛化能力。我们进一步发现了一个测试时间缩放效应，即在推理过程中进行多样化采样可以将推理可靠性提高10-20%。这些发现突显了通过视频进行空间推理任务的独特潜力和可扩展性。

更新时间: 2025-11-24 07:46:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.15065v2

Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

Generating accurate SQL from users' natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restricts the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summarization and discuss the remaining challenges in this field and suggest expectations for future research directions. All the related resources of LLM-based, including research papers, benchmarks, and open-source projects, are collected for the community in our repository: https://github.com/DEEP-PolyU/Awesome-LLM-based-Text2SQL.

Updated: 2025-11-24 07:43:14

标题: 下一代数据库接口：基于LLM的文本到SQL的调查

摘要: 将用户自然语言问题转化为准确的SQL（文本到SQL）仍然是一个长期存在的挑战，这是因为涉及到用户问题理解、数据库架构理解和SQL生成的复杂性。传统的文本到SQL系统，结合了人工工程和深度神经网络，取得了显著的进展。随后，为文本到SQL任务开发了预训练语言模型（PLMs），取得了令人期待的结果。然而，随着现代数据库和用户问题变得更加复杂，参数规模有限的PLMs往往会产生错误的SQL。这需要更复杂和定制的优化方法，限制了基于PLM的系统的应用。最近，大型语言模型（LLMs）在模型规模增加时表现出了显著的自然语言理解能力。因此，整合基于LLM的解决方案可以为文本到SQL研究带来独特的机会、改进和解决方案。在这项调查中，我们对现有基于LLM的文本到SQL研究进行了全面回顾。具体来说，我们提供了文本到SQL技术挑战和演变过程的简要概述。接下来，我们介绍了设计用于评估文本到SQL系统的数据集和指标。随后，我们对基于LLM的文本到SQL的最新进展进行了系统分析。最后，我们对该领域的剩余挑战进行了总结，并提出了未来研究方向的期望。我们在我们的存储库中为社区收集了所有基于LLM的相关资源，包括研究论文、基准测试和开源项目：https://github.com/DEEP-PolyU/Awesome-LLM-based-Text2SQL。

更新时间: 2025-11-24 07:43:14

领域: cs.CL,cs.AI,cs.DB

下载: http://arxiv.org/abs/2406.08426v8

Pre-Filtering Code Suggestions using Developer Behavioral Telemetry to Optimize LLM-Assisted Programming

Large Language Models (LLMs) are increasingly integrated into code editors to provide AI-powered code suggestions. Yet many of these suggestions are ignored, resulting in wasted computation, increased latency, and unnecessary interruptions. We introduce a lightweight pre-filtering model that predicts the likelihood of suggestion acceptance before invoking the LLM, using only real-time developer telemetry such as typing speed, file navigation, and editing activity. Deployed in a production-grade Visual Studio Code plugin over four months of naturalistic use, our approach nearly doubled acceptance rates (18.4% -> 34.2%) while suppressing 35% of low-value LLM calls. These findings demonstrate that behavioral signals alone can meaningfully improve both user experience and system efficiency in LLM-assisted programming, highlighting the value of timing-aware, privacy-preserving adaptation mechanisms. The filter operates solely on pre-invocation editor telemetry and never inspects code or prompts.

Updated: 2025-11-24 07:42:07

标题: 使用开发人员行为遥测预过滤代码建议以优化LLM辅助编程

摘要: 大型语言模型（LLMs）越来越多地集成到代码编辑器中，以提供基于人工智能的代码建议。然而，许多这些建议被忽略，导致计算资源浪费、延迟增加和不必要的打断。我们引入了一个轻量级的预过滤模型，通过仅使用实时开发人员遥测数据（如打字速度、文件导航和编辑活动）来预测建议接受的可能性，从而在调用LLM之前。在四个月的自然使用中，我们的方法在生产级别的Visual Studio Code插件中部署，几乎将接受率翻了一番（18.4% -> 34.2%），同时抑制了35%的低价值LLM调用。这些发现表明，仅凭行为信号就可以显著改善LLM辅助编程的用户体验和系统效率，突出了时序感知、隐私保护的适应机制的价值。该过滤器仅在调用前的编辑器遥测数据上运行，从不检查代码或提示。

更新时间: 2025-11-24 07:42:07

领域: cs.SE,cs.AI,cs.HC

下载: http://arxiv.org/abs/2511.18849v1

Personalized Federated Segmentation with Shared Feature Aggregation and Boundary-Focused Calibration

Personalized federated learning (PFL) possesses the unique capability of preserving data confidentiality among clients while tackling the data heterogeneity problem of non-independent and identically distributed (Non-IID) data. Its advantages have led to widespread adoption in domains such as medical image segmentation. However, the existing approaches mostly overlook the potential benefits of leveraging shared features across clients, where each client contains segmentation data of different organs. In this work, we introduce a novel personalized federated approach for organ agnostic tumor segmentation (FedOAP), that utilizes cross-attention to model long-range dependencies among the shared features of different clients and a boundary-aware loss to improve segmentation consistency. FedOAP employs a decoupled cross-attention (DCA), which enables each client to retain local queries while attending to globally shared key-value pairs aggregated from all clients, thereby capturing long-range inter-organ feature dependencies. Additionally, we introduce perturbed boundary loss (PBL) which focuses on the inconsistencies of the predicted mask's boundary for each client, forcing the model to localize the margins more precisely. We evaluate FedOAP on diverse tumor segmentation tasks spanning different organs. Extensive experiments demonstrate that FedOAP consistently outperforms existing state-of-the-art federated and personalized segmentation methods.

Updated: 2025-11-24 07:40:04

标题: 个性化的联邦分割与共享特征聚合和边界关注校准

摘要: 个性化的联邦学习（PFL）具有在处理非独立同分布（Non-IID）数据的数据异质性问题的同时保持客户端数据机密性的独特能力。其优势已导致在诸如医学图像分割等领域广泛采用。然而，现有方法大多忽视了利用跨客户端共享特征的潜在好处，其中每个客户端包含不同器官的分割数据。在这项工作中，我们介绍了一种新颖的用于器官无关肿瘤分割的个性化联邦方法（FedOAP），该方法利用交叉注意力来模拟不同客户端的共享特征之间的长程依赖关系，并利用边界感知损失来提高分割一致性。FedOAP采用了一种分离的交叉注意力（DCA），使每个客户端能够保留本地查询，同时关注从所有客户端聚合而来的全局共享键值对，从而捕捉长程器官间特征依赖关系。此外，我们引入了扰动边界损失（PBL），该损失侧重于预测掩膜的边界不一致性，迫使模型更精确地定位边缘。我们在涵盖不同器官的多样肿瘤分割任务上评估了FedOAP。广泛的实验表明，FedOAP一直优于现有的最先进的联邦和个性化分割方法。

更新时间: 2025-11-24 07:40:04

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18847v1

CIF: A Constrained Inversion Framework for Reliable Message Extraction in Diffusion-Based Generative Steganography

Generative image steganography aims to conceal secret information in generated images without arousing suspicion. However, in practical scenarios involving high-capacity embedding or lossy transmission, existing methods still suffer from limited extraction accuracy. The main challenge lies in accurately recovering the secret-embedded latent vectors from stego images. To address this issue, we propose CIF, a constrained inversion framework designed to achieve accurate message extraction. Specifically, CIF reduces dynamic structural errors by enforcing linear consistency in the latent space, meanwhile reduces numerical integration errors by adaptively adjusting the integration order according to local trajectory stability. Experimental results show that our method reduces latent reconstruction error by more than 35\% and achieves higher message extraction accuracy than existing approaches.

Updated: 2025-11-24 07:38:06

标题: CIF：一种用于扩散型生成隐写术中可靠消息提取的受限反演框架

摘要: 生成图像隐写术旨在在生成的图像中隐藏秘密信息而不引起怀疑。然而，在涉及高容量嵌入或有损传输的实际场景中，现有方法仍然受限于提取精度有限的问题。主要挑战在于准确地从隐写图像中恢复嵌入秘密信息的潜在向量。为解决这一问题，我们提出了CIF，一个设计用于实现准确消息提取的受限反演框架。具体地，CIF通过在潜在空间中强制线性一致性来减少动态结构误差，同时通过根据局部轨迹稳定性自适应调整积分顺序来减少数值积分误差。实验结果表明，我们的方法将潜在重建误差降低了超过35％，并且比现有方法实现了更高的消息提取精度。

更新时间: 2025-11-24 07:38:06

领域: cs.CR

下载: http://arxiv.org/abs/2508.00434v2

WaveTuner: Comprehensive Wavelet Subband Tuning for Time Series Forecasting

Due to the inherent complexity, temporal patterns in real-world time series often evolve across multiple intertwined scales, including long-term periodicity, short-term fluctuations, and abrupt regime shifts. While existing literature has designed many sophisticated decomposition approaches based on the time or frequency domain to partition trend-seasonality components and high-low frequency components, an alternative line of approaches based on the wavelet domain has been proposed to provide a unified multi-resolution representation with precise time-frequency localization. However, most wavelet-based methods suffer from a persistent bias toward recursively decomposing only low-frequency components, severely underutilizing subtle yet informative high-frequency components that are pivotal for precise time series forecasting. To address this problem, we propose WaveTuner, a Wavelet decomposition framework empowered by full-spectrum subband Tuning for time series forecasting. Concretely, WaveTuner comprises two key modules: (i) Adaptive Wavelet Refinement module, that transforms time series into time-frequency coefficients, utilizes an adaptive router to dynamically assign subband weights, and generates subband-specific embeddings to support refinement; and (ii) Multi-Branch Specialization module, that employs multiple functional branches, each instantiated as a flexible Kolmogorov-Arnold Network (KAN) with a distinct functional order to model a specific spectral subband. Equipped with these modules, WaveTuner comprehensively tunes global trends and local variations within a unified time-frequency framework. Extensive experiments on eight real-world datasets demonstrate WaveTuner achieves state-of-the-art forecasting performance in time series forecasting.

Updated: 2025-11-24 07:33:35

标题: WaveTuner：全面的小波子带调整技术用于时间序列预测

摘要: 由于固有的复杂性，现实世界时间序列中的时间模式往往跨越多个交织的尺度演变，包括长期周期性、短期波动和突变的制度转变。尽管现有文献已经设计了许多基于时间或频域的复杂分解方法，用于分割趋势-季节性分量和高低频分量，但还提出了一种基于小波域的替代方法，以提供具有精确时频定位的统一多分辨率表示。然而，大多数基于小波的方法存在一个持续偏向递归分解仅低频分量的偏差，严重浪费微妙但信息丰富的高频分量，这些分量对于精确的时间序列预测至关重要。为了解决这个问题，我们提出了WaveTuner，这是一个由全频谱子带调谐支持的小波分解框架，用于时间序列预测。具体而言，WaveTuner包括两个关键模块：（i）自适应小波细化模块，将时间序列转化为时间-频率系数，利用自适应路由器动态分配子带权重，并生成支持细化的子带特定嵌入；（ii）多支路专业化模块，采用多个功能分支，每个分支实例化为一个具有不同功能顺序的灵活的科尔莫戈洛夫-阿诺德网络（KAN），用于建模特定的频谱子带。借助这些模块，WaveTuner全面调整了统一的时间-频率框架内的全局趋势和局部变化。对八个真实世界数据集的广泛实验表明，WaveTuner在时间序列预测中实现了最先进的预测性能。

更新时间: 2025-11-24 07:33:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.18846v1

UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model

Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instruction--remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives. To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions as inputs to jointly predict subsequent visual states, enabling cross-modal reasoning. Via a Hierarchical Prediction-Feedback (HPN) mechanism, MWM collaborates with navigation policies: the first layer generates actions using current vision-and-language features; MWM then infers post-action visual states to guide the second layer's fine-grained decisions. This forms a dynamic bidirectional promotion mechanism where MWM reasoning optimizes navigation policies, while policy decisions feedback to improve MWM's reasoning accuracy. Experiments on R2R and REVERIE datasets show UNeMo outperforms state-of-the-art methods by 2.1% and 0.7% in navigation accuracy for unseen scenes, validating its effectiveness.

Updated: 2025-11-24 07:31:58

标题: UNeMo：通过多模态世界模型进行协作式视觉-语言推理和导航

摘要: 视觉与语言导航（VLN）要求代理根据视觉图像和自然语言指令自主导航复杂环境，这仍然是一个极具挑战性的任务。最近的研究表明，利用预训练的大型语言模型（LLM）增强语言引导的导航推理具有良好的前景。然而，这些方法的推理能力仅限于语言模态，缺乏视觉推理能力。此外，现有的推理模块是单独优化的，与导航策略存在不兼容性，可能导致优化目标的冲突。为了解决这些挑战，我们引入了UNeMo，这是一个设计用于协同优化视觉状态推理和导航决策的新框架。它引入了一个多模态世界模型（MWM），该模型以视觉特征、语言指令和导航动作作为输入，共同预测后续的视觉状态，实现跨模态推理。通过分层预测-反馈（HPN）机制，MWM与导航策略合作：第一层利用当前的视觉和语言特征生成动作；然后MWM推断出执行动作后的视觉状态，以指导第二层的精细决策。这形成了一个动态的双向促进机制，其中MWM的推理优化导航策略，而策略决策反馈以提高MWM的推理准确性。在R2R和REVERIE数据集上的实验表明，UNeMo在未见场景的导航准确性方面优于最先进的方法2.1%和0.7%，验证了其有效性。

更新时间: 2025-11-24 07:31:58

领域: cs.AI

下载: http://arxiv.org/abs/2511.18845v1

A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis

Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.

Updated: 2025-11-24 07:30:15

标题: 一个可复制的框架用于焦点小组分析中的神经主题建模

摘要: 焦点小组讨论产生了丰富的定性数据，但其分析传统上依赖于耗时且劳动密集的手工编码，限制了可扩展性和可重复性。我们提出了一个严谨、可重复的计算框架，将神经主题建模应用于焦点小组记录，解决了基本方法ological挑战：超参数敏感性、模型稳定性和可解释性验证。我们使用BERTopic应用于探讨突尼斯HPV疫苗看法的十个焦点小组（1,076个话语），在27个超参数配置中进行系统评估，通过每个配置的30次重复采样进行稳定性评估，并通过三位领域专家进行正式人类评估验证可解释性。我们的分析表明，对超参数的选择具有极大的敏感性，并且显示了稳定性评估的指标选择必须与分析目标保持一致。一种层次化合并策略（提取细粒度主题以确保稳定性，然后合并以确保可解释性）有效地平衡了稳定性和连贯性之间的权衡，实现了0.558的连贯性，而直接提取的连贯性为0.539。人类验证证实了主题质量，具有很好的一致性（ICC = 0.79，加权Cohen's kappa = 0.578）。我们的框架为研究人员提供了实用指南，可以根据自己的定性研究背景进行调整。所有代码、数据处理脚本和评估协议都是公开可用的，以支持这项工作的复制和扩展。

更新时间: 2025-11-24 07:30:15

领域: cs.CL,cs.HC,cs.LG

下载: http://arxiv.org/abs/2511.18843v1

Optimizing LLM Code Suggestions: Feedback-Driven Timing with Lightweight State Bounds

Large Language Models (LLMs) have transformed code auto-completion by generating context-aware suggestions. Yet, deciding when to present these suggestions remains underexplored, often leading to interruptions or wasted inference calls. We propose an adaptive timing mechanism that dynamically adjusts the delay before offering a suggestion based on real-time developer feedback. Our suggested method combines a logistic transform of recent acceptance rates with a bounded delay range, anchored by a high-level binary prediction of the developer's cognitive state. In a two-month deployment with professional developers, our system improved suggestion acceptance from 4.9% with no delay to 15.4% with static delays, and to 18.6% with adaptive timing-while reducing blind rejections (rejections without being read) from 8.3% to 0.36%. Together, these improvements increase acceptance and substantially reduce wasted inference calls by 75%, making LLM-based code assistants more efficient and cost-effective in practice.

Updated: 2025-11-24 07:29:15

标题: 优化LLM代码建议：轻量级状态边界反馈驱动定时

摘要: 大型语言模型（LLMs）通过生成具有上下文感知的建议，已经改变了代码自动完成的方式。然而，什时呈现这些建议仍未得到充分研究，通常会导致中断或浪费推理调用。我们提出了一种自适应定时机制，根据实时开发者反馈动态调整提供建议之前的延迟。我们建议的方法结合了最近接受率的逻辑变换和受限延迟范围，以高级二进制预测开发者认知状态为锚。在与专业开发者进行了为期两个月的部署后，我们的系统将建议的接受率从无延迟的4.9%提高到静态延迟的15.4%，再到自适应定时的18.6%，同时将未被阅读的盲目拒绝率从8.3%降低至0.36%。这些改进共同提高了接受率，并将浪费的推理调用大幅减少了75%，使LLM基础的代码助手在实践中更加高效和具有成本效益。

更新时间: 2025-11-24 07:29:15

领域: cs.SE,cs.AI,cs.HC

下载: http://arxiv.org/abs/2511.18842v1

Federated style aware transformer aggregation of representations

Personalized Federated Learning (PFL) faces persistent challenges, including domain heterogeneity from diverse client data, data imbalance due to skewed participation, and strict communication constraints. Traditional federated learning often lacks personalization, as a single global model cannot capture client-specific characteristics, leading to biased predictions and poor generalization, especially for clients with highly divergent data distributions. To address these issues, we propose FedSTAR, a style-aware federated learning framework that disentangles client-specific style factors from shared content representations. FedSTAR aggregates class-wise prototypes using a Transformer-based attention mechanism, allowing the server to adaptively weight client contributions while preserving personalization. Furthermore, by exchanging compact prototypes and style vectors instead of full model parameters, FedSTAR significantly reduces communication overhead. Experimental results demonstrate that combining content-style disentanglement with attention-driven prototype aggregation improves personalization and robustness in heterogeneous environments without increasing communication cost.

Updated: 2025-11-24 07:24:09

标题: 联邦式感知变换器表示聚合

摘要: 个性化联邦学习（PFL）面临着持久的挑战，包括来自不同客户数据的领域异质性、由于参与程度失衡导致的数据不平衡以及严格的通信约束。传统的联邦学习通常缺乏个性化，因为单一的全局模型无法捕捉客户特定的特征，从而导致偏见预测和泛化不佳，特别是对于数据分布高度不同的客户。为了解决这些问题，我们提出了FedSTAR，一个注重风格的联邦学习框架，可以将客户特定的风格因素与共享内容表示分离开来。FedSTAR使用基于Transformer的注意机制聚合类别原型，允许服务器自适应地加权客户的贡献，同时保持个性化。此外，通过交换紧凑的原型和风格向量而不是完整的模型参数，FedSTAR显著减少了通信开销。实验结果表明，在不增加通信成本的情况下，将内容-风格解耦与基于注意力的原型聚合相结合，可以提高异质环境中的个性化和稳健性。

更新时间: 2025-11-24 07:24:09

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2511.18841v1

Addressing Situated Teaching Needs: A Multi-Agent Framework for Automated Slide Adaptation

The adaptation of teaching slides to instructors' situated teaching needs, including pedagogical styles and their students' context, is a critical yet time-consuming task for educators. Through a series of educator interviews, we first identify and systematically categorize the key friction points that impede this adaptation process. Grounded in these findings, we introduce a novel multi-agent framework designed to automate slide adaptation based on high-level instructor specifications. An evaluation involving 16 modification requests across 8 real-world courses validates our approach. The framework's output consistently achieved high scores in intent alignment, content coherence and factual accuracy, and performed on par with baseline methods regarding visual clarity, while also demonstrating appropriate timeliness and a high operational agreement with human experts, achieving an F1 score of 0.89. This work heralds a new paradigm where AI agents handle the logistical burdens of instructional design, liberating educators to focus on the creative and strategic aspects of teaching.

Updated: 2025-11-24 07:22:41

标题: 解决情境教学需求：用于自动幻灯片适应的多代理框架

摘要: 教学幻灯片的调整，以满足教师的教学需求，包括教学风格和学生的背景，对于教育工作者来说是一个关键但耗时的任务。通过一系列教育工作者的访谈，我们首先确定并系统地分类了阻碍这一调整过程的关键摩擦点。基于这些发现，我们引入了一个新颖的多代理框架，旨在根据高级教师规范自动调整幻灯片。在涉及8个真实课程的16个修改请求的评估中，验证了我们的方法。该框架的输出在意图一致性、内容连贯性和事实准确性方面始终取得高分，与基线方法在视觉清晰度方面表现一致，同时还展示了适当的及时性和与人类专家的高操作一致性，实现了0.89的F1分数。这项工作标志着一个新的范式，即人工智能代理处理教学设计的后勤负担，使教育工作者能够专注于教学的创造性和战略性方面。

更新时间: 2025-11-24 07:22:41

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2511.18840v1

Enhancing Multi-Label Thoracic Disease Diagnosis with Deep Ensemble-Based Uncertainty Quantification

The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.

Updated: 2025-11-24 07:20:40

标题: 使用深度集成模型的不确定性量化增强多标签胸部疾病诊断

摘要: 深度学习模型，如CheXNet，在高风险临床环境中的实用性受到其纯确定性特性的根本限制，无法提供可靠的预测置信度度量。本项目通过将强大的不确定性量化（UQ）集成到一个针对NIH ChestX-ray14数据集中的14种常见胸部疾病的高性能诊断平台中，填补了这一关键差距。初始架构开发未能通过蒙特卡洛Dropout（MCD）稳定性能和校准，导致不可接受的期望校准误差（ECE）为0.7588。这一技术失败需要严格的架构转变为高多样性的、9成员的深度集成（DE）。这一结果的DE成功稳定性能，并提供了更可靠的可靠性，实现了0.8559的平均接收器工作特性曲线下面积（AUROC）和0.3857的平均F1得分的最新技术水平（SOTA）。至关重要的是，DE表现出更好的校准性（平均ECE为0.0728和负对数似然（NLL）为0.1916），并使得将总不确定性分解为其Aleatoric（不可减小的数据噪音）和Epistemic（可减小的模型知识）组件成为可能，平均Epistemic Uncertainty（EU）为0.0240。这些结果将深度集成确定为一个可信赖和可解释的平台，将模型从一个概率工具转变为可靠的临床决策支持系统。

更新时间: 2025-11-24 07:20:40

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.18839v1

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

Updated: 2025-11-24 07:19:43

标题: IAG：基于VLM的视觉基础输入感知后门攻击

摘要: 最近对视觉语言模型（VLMs）的进展显著增强了视觉定位任务，该任务涉及根据自然语言查询在图像中定位对象。尽管取得了这些进展，但基于VLM的定位系统的安全性尚未得到彻底调查。本文揭示了一种新颖且现实的漏洞：VLM基础视觉定位的第一个多目标后门攻击。与先前依赖静态触发器或固定目标的攻击不同，我们提出了IAG，一种方法，该方法动态生成基于任何指定目标对象描述的输入感知、文本引导的触发器以执行攻击。这通过一个文本条件的UNet实现，它将无法察觉的目标语义线索嵌入到视觉输入中，同时在良性样本上保持正常的定位性能。我们进一步开发了一个平衡语言能力和感知重建的联合训练目标，以确保不可察觉性、有效性和隐秘性。对多个VLMs（例如LLaVA、InternVL、Ferret）和基准（RefCOCO、RefCOCO+、RefCOCOg、Flickr30k实体和ShowUI）进行了大量实验证明，与其他基线相比，IAG在几乎所有设置中实现了最佳ASR，而不会影响干净的准确性，保持对现有防御的鲁棒性，并在数据集和模型之间展现了可转移性。这些发现强调了定位能力VLMs中的关键安全风险，并强调了对值得信赖的多模态理解的进一步研究的需求。

更新时间: 2025-11-24 07:19:43

领域: cs.CV,cs.CL,cs.CR

下载: http://arxiv.org/abs/2508.09456v3

Auto-ML Graph Neural Network Hypermodels for Outcome Prediction in Event-Sequence Data

This paper introduces HGNN(O), an AutoML GNN hypermodel framework for outcome prediction on event-sequence data. Building on our earlier work on graph convolutional network hypermodels, HGNN(O) extends four architectures-One Level, Two Level, Two Level Pseudo Embedding, and Two Level Embedding-across six canonical GNN operators. A self-tuning mechanism based on Bayesian optimization with pruning and early stopping enables efficient adaptation over architectures and hyperparameters without manual configuration. Empirical evaluation on both balanced and imbalanced event logs shows that HGNN(O) achieves accuracy exceeding 0.98 on the Traffic Fines dataset and weighted F1 scores up to 0.86 on the Patients dataset without explicit imbalance handling. These results demonstrate that the proposed AutoML-GNN approach provides a robust and generalizable benchmark for outcome prediction in complex event-sequence data.

Updated: 2025-11-24 07:13:34

标题: Auto-ML图神经网络超模型在事件序列数据中的结果预测中的应用

摘要: 本文介绍了HGNN(O)，一种用于事件序列数据结果预测的AutoML GNN超模型框架。在我们之前关于图卷积网络超模型的工作基础上，HGNN(O)扩展了四种架构—一级、二级、二级伪嵌入和二级嵌入—跨越六种经典的GNN算子。基于贝叶斯优化的自调节机制，通过剪枝和提前停止，实现了对架构和超参数的高效适应，无需手动配置。在平衡和不平衡事件日志上的实证评估表明，HGNN(O)在交通罚款数据集上实现了超过0.98的准确度，并在患者数据集上达到了高达0.86的加权F1分数，而无需明确处理不平衡。这些结果表明，提出的AutoML-GNN方法为复杂事件序列数据的结果预测提供了稳健且可推广的基准。

更新时间: 2025-11-24 07:13:34

领域: cs.LG

下载: http://arxiv.org/abs/2511.18835v1

FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories

With the success of flow matching in visual generation, sampling efficiency remains a critical bottleneck for its practical application. Among flow models' accelerating methods, ReFlow has been somehow overlooked although it has theoretical consistency with flow matching. This is primarily due to its suboptimal performance in practical scenarios compared to consistency distillation and score distillation. In this work, we investigate this issue within the ReFlow framework and propose FlowSteer, a method unlocks the potential of ReFlow-based distillation by guiding the student along teacher's authentic generation trajectories. We first identify that Piecewised ReFlow's performance is hampered by a critical distribution mismatch during the training and propose Online Trajectory Alignment(OTA) to resolve it. Then, we introduce a adversarial distillation objective applied directly on the ODE trajectory, improving the student's adherence to the teacher's generation trajectory. Furthermore, we find and fix a previously undiscovered flaw in the widely-used FlowMatchEulerDiscreteScheduler that largely degrades few-step inference quality. Our experiment result on SD3 demonstrates our method's efficacy.

Updated: 2025-11-24 07:13:23

标题: FlowSteer：通过真实轨迹指导少步图像合成

摘要: 随着流匹配在视觉生成中取得成功，抽样效率仍然是其实际应用中的一个关键瓶颈。在流模型的加速方法中，ReFlow虽然具有理论一致性，但在实际场景中表现不佳，因此在一定程度上被忽视，相比一致性蒸馏和得分蒸馏。在这项工作中，我们研究了ReFlow框架中的这个问题，并提出了FlowSteer，一种方法通过引导学生沿着教师的真实生成轨迹释放了ReFlow-based蒸馏的潜力。我们首先确定了Piecewised ReFlow在训练过程中受到关键分布不匹配的影响，并提出了在线轨迹对齐（OTA）来解决这个问题。然后，我们引入了一个对抗蒸馏目标，直接应用于ODE轨迹，提高了学生对教师生成轨迹的遵循度。此外，我们发现并修复了广泛使用的FlowMatchEulerDiscreteScheduler中一个先前未发现的缺陷，这严重降低了少步推理的质量。我们在SD3上的实验结果证明了我们方法的有效性。

更新时间: 2025-11-24 07:13:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18834v1

BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.

Updated: 2025-11-24 07:12:09

标题: BiasJailbreak：分析大型语言模型中的伦理偏见和越狱漏洞

摘要: 虽然大型语言模型(LLMs)在各种任务中展现出令人印象深刻的熟练程度，但它们也存在潜在的安全风险，例如“越狱”，恶意输入可以迫使LLMs生成有害内容，绕过安全对齐。在本文中，我们深入探讨LLMs中的道德偏见，并研究这些偏见如何被用于越狱。值得注意的是，即使提示的其他部分相同，这些偏见导致GPT-4o模型中非二进制和顺性关键词之间的越狱成功率相差20\%，白人和黑人关键词之间相差16%。我们引入了BiasJailbreak的概念，突显了这些由安全性引起的偏见所带来的固有风险。BiasJailbreak通过向目标LLM自身询问，自动生成有偏见的关键词，并利用这些关键词生成有害输出。此外，我们提出了一种高效的防御方法BiasDefense，通过在生成前注入防御提示来防止越狱尝试。BiasDefense作为一种吸引人的替代方案，不同于需要在文本生成后进行额外推理成本的Guard Models，如Llama-Guard。我们的研究结果强调LLMs中的道德偏见实际上可能导致生成不安全的输出，并提出一种方法使LLMs更加安全和无偏见。为了促进进一步研究和改进，我们开源了BiasJailbreak的代码和工件，为社区提供工具，以更好地理解和减轻LLMs中由安全性引起的偏见。

更新时间: 2025-11-24 07:12:09

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.13334v4

Leveraging Duration Pseudo-Embeddings in Multilevel LSTM and GCN Hypermodels for Outcome-Oriented PPM

Existing deep learning models for Predictive Process Monitoring (PPM) struggle with temporal irregularities, particularly stochastic event durations and overlapping timestamps, limiting their adaptability across heterogeneous datasets. We propose a dual input neural network strategy that separates event and sequence attributes, using a duration-aware pseudo-embedding matrix to transform temporal importance into compact, learnable representations. This design is implemented across two baseline families: B-LSTM and B-GCN, and their duration-aware variants D-LSTM and D-GCN. All models incorporate self-tuned hypermodels for adaptive architecture selection. Experiments on balanced and imbalanced outcome prediction tasks show that duration pseudo-embedding inputs consistently improve generalization, reduce model complexity, and enhance interpretability. Our results demonstrate the benefits of explicit temporal encoding and provide a flexible design for robust, real-world PPM applications.

Updated: 2025-11-24 07:06:08

标题: 利用多级LSTM和GCN超模型中的持续性伪嵌入以实现面向结果的PPM

摘要: 现有的用于预测过程监控（PPM）的深度学习模型在处理时间不规则性方面存在困难，特别是随机事件持续时间和重叠时间戳，限制了它们在异构数据集中的适应性。我们提出了一种双输入神经网络策略，将事件和序列属性分开，使用一个基于持续时间的伪嵌入矩阵将时间重要性转化为紧凑、可学习的表示。这种设计被应用在两个基线家族上：B-LSTM和B-GCN，以及它们基于持续时间的变体D-LSTM和D-GCN。所有模型都包含自调整的超模型，用于自适应架构选择。在平衡和不平衡的结果预测任务上进行的实验表明，持续时间的伪嵌入输入能够稳定提高泛化能力，减少模型复杂性，并增强可解释性。我们的结果展示了显式时间编码的好处，并提供了一个灵活的设计，适用于强大的现实世界PPM应用。

更新时间: 2025-11-24 07:06:08

领域: cs.LG

下载: http://arxiv.org/abs/2511.18830v1

Towards Characterizing Knowledge Distillation of PPG Heart Rate Estimation Models

Heart rate estimation from photoplethysmography (PPG) signals generated by wearable devices such as smartwatches and fitness trackers has significant implications for the health and well-being of individuals. Although prior work has demonstrated deep learning models with strong performance in the heart rate estimation task, in order to deploy these models on wearable devices, these models must also adhere to strict memory and latency constraints. In this work, we explore and characterize how large pre-trained PPG models may be distilled to smaller models appropriate for real-time inference on the edge. We evaluate four distillation strategies through comprehensive sweeps of teacher and student model capacities: (1) hard distillation, (2) soft distillation, (3) decoupled knowledge distillation (DKD), and (4) feature distillation. We present a characterization of the resulting scaling laws describing the relationship between model size and performance. This early investigation lays the groundwork for practical and predictable methods for building edge-deployable models for physiological sensing.

Updated: 2025-11-24 07:06:06

标题: 朝着对PPG心率估计模型的知识蒸馏进行特征化

摘要: 通过由可穿戴设备如智能手表和健身追踪器生成的光电容抗(PPG)信号对心率进行估计对个人的健康和幸福有重要影响。尽管先前的研究已经证明深度学习模型在心率估计任务中表现出色，但为了在可穿戴设备上部署这些模型，这些模型还必须符合严格的内存和延迟限制。在这项工作中，我们探讨和表征了如何将大型预训练的PPG模型精简为适合边缘实时推断的较小模型。我们通过对教师和学生模型容量进行全面扫描，评估了四种精简策略：(1) 硬精简，(2) 软精简，(3) 解耦知识精简(DKD)和(4) 特征精简。我们提出了描述模型大小和性能之间关系的结果缩放定律的表征。这项早期研究为构建用于生理感知的边缘可部署模型的实用和可预测方法奠定了基础。

更新时间: 2025-11-24 07:06:06

领域: cs.LG

下载: http://arxiv.org/abs/2511.18829v1

VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data

An end-to-end machine learning (ML) lifecycle consists of many iterative processes, from data preparation and ML model design to model training and then deploying the trained model for inference. When building an end-to-end lifecycle for an ML problem, many ML pipelines must be designed and executed that produce a huge number of lifecycle versions. Therefore, this paper introduces VeML, a Version management system dedicated to end-to-end ML Lifecycle. Our system tackles several crucial problems that other systems have not solved. First, we address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We solve this problem by proposing to transfer the lifecycle of similar datasets managed in our system to the new training data. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently. Another critical issue is the model accuracy degradation by the difference between training data and testing data during the ML lifetime, which leads to lifecycle rebuild. Our system helps to detect this mismatch without getting labeled data from testing data and rebuild the ML lifecycle for a new data version. To demonstrate our contributions, we conduct experiments on real-world, large-scale datasets of driving images and spatiotemporal sensor data and show promising results.

Updated: 2025-11-24 07:05:54

标题: VeML：大规模和高维数据的端到端机器学习生命周期

摘要: 一个端到端的机器学习（ML）生命周期包括许多迭代过程，从数据准备和ML模型设计到模型训练，然后部署训练好的模型进行推断。当构建一个ML问题的端到端生命周期时，必须设计和执行许多ML管道，产生大量的生命周期版本。因此，本文介绍了VeML，一个专门用于端到端ML生命周期的版本管理系统。我们的系统解决了其他系统尚未解决的几个关键问题。首先，我们解决了构建ML生命周期的高成本问题，特别是针对大规模和高维数据集。我们通过提议将在我们的系统中管理的类似数据集的生命周期转移到新的训练数据来解决这个问题。我们设计了一种基于核心集的算法，可以有效地计算大规模、高维数据的相似性。另一个关键问题是在ML生命周期中训练数据和测试数据之间的差异导致模型准确性下降，从而导致生命周期重建。我们的系统可以帮助检测到这种不匹配，而无需从测试数据中获取标记数据，并为新数据版本重建ML生命周期。为了展示我们的贡献，我们在真实世界的大规模数据集上进行了实验，包括驾驶图像和时空传感器数据，并展示了令人期待的结果。

更新时间: 2025-11-24 07:05:54

领域: cs.LG,cs.DB,cs.HC

下载: http://arxiv.org/abs/2304.13037v3

Solving a Research Problem in Mathematical Statistics with AI Assistance

Over the last few months, AI models including large language models have improved greatly. There are now several documented examples where they have helped professional mathematical scientists prove new results, sometimes even helping resolve known open problems. In this short note, we add another example to the list, by documenting how we were able to solve a previously unsolved research problem in robust mathematical statistics with crucial help from GPT-5. Our problem concerns robust density estimation, where the observations are perturbed by Wasserstein-bounded contaminations.In a previous preprint (Chao and Dobriban, 2023, arxiv:2308.01853v2), we have obtained upper and lower bounds on the minimax optimal estimation error; which were, however, not sharp. Starting in October 2025, making significant use of GPT-5 Pro, we were able to derive the minimax optimal error rate (reported in version 3 of the above arxiv preprint). GPT-5 provided crucial help along the way, including by suggesting calculations that we did not think of, and techniques that were not familiar to us, such as the dynamic Benamou-Brenier formulation, for key steps in the analysis. Working with GPT-5 took a few weeks of effort, and we estimate that it could have taken several months to get the same results otherwise. At the same time, there are still areas where working with GPT-5 was challenging: it sometimes provided incorrect references, and glossed over details that sometimes took days of work to fill in. We outline our workflow and steps taken to mitigate issues. Overall, our work can serve as additional documentation for a new age of human-AI collaborative work in mathematical science.

Updated: 2025-11-24 07:03:56

标题: 利用人工智能辅助解决数理统计中的研究问题

摘要: 在过去几个月里，包括大型语言模型在内的AI模型取得了巨大进步。现在已经有几个记录的例子表明它们帮助专业数学科学家证明了新结果，有时甚至帮助解决了已知的开放性问题。在这篇简短的笔记中，我们通过记录我们是如何能够在数学统计的鲁棒性问题中解决以前未解决的研究问题，并在关键时刻获得GPT-5的帮助，向列表中添加另一个例子。我们的问题涉及鲁棒密度估计，其中观测值受Wasserstein边界污染。在之前的一个预印本中（Chao和Dobriban，2023，arxiv:2308.01853v2），我们获得了最小最优估计误差的上下界；然而，这些界并不尖锐。从2025年10月开始，通过广泛使用GPT-5 Pro，我们得以推导出最小最优误差率（报告在上述arxiv预印本的第3版中）。GPT-5在整个过程中提供了关键的帮助，包括提出我们没有考虑到的计算和对我们不熟悉的技术，如动态Benamou-Brenier公式，用于分析的关键步骤。与GPT-5合作花费了几周的工作，我们估计否则可能需要几个月的时间才能得到相同的结果。同时，与GPT-5合作仍然存在挑战的领域：它有时提供不正确的参考文献，并忽略了有时需要数天工作才能填补的细节。我们概述了我们的工作流程和采取的措施以减轻问题。总的来说，我们的工作可以作为数学科学中人工智能协作工作新时代的补充文档。

更新时间: 2025-11-24 07:03:56

领域: math.ST,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.18828v1

Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification

Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.

Updated: 2025-11-24 07:02:22

标题: 不确定性感知的双学生知识蒸馏用于高效图像分类

摘要: 知识蒸馏已经成为一种强大的模型压缩技术，能够实现从大型教师网络向紧凑的学生模型传递知识。然而，传统的知识蒸馏方法将所有教师的预测视为平等，而不考虑教师对这些预测的信心。本文提出了一种基于不确定性的双学生知识蒸馏框架，利用教师预测的不确定性来有选择地指导学生学习。我们引入了一种同行学习机制，其中两种异构的学生架构，具体为ResNet-18和MobileNetV2，从教师网络和彼此中协作学习。在ImageNet-100上的实验结果表明，我们的方法相比基线知识蒸馏方法实现了更好的性能，ResNet-18实现了83.84\%的top-1准确率，MobileNetV2实现了81.46\%的top-1准确率，分别比传统的单学生蒸馏方法提高了2.04\%和0.92%。

更新时间: 2025-11-24 07:02:22

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.18826v1

Solution of Incompressible Flow Equations with Physics and Equality Constrained Artificial Neural Networks

We present a meshless method for the solution of incompressible Navier-Stokes equations in advection-dominated regimes using physics- and equality-constrained artificial neural networks combined with a conditionally adaptive augmented Lagrangian formulation. A single neural network parameterizes both the velocity and pressure fields, and is trained by minimizing the residual of a Poisson's equation for pressure, constrained by the momentum and continuity equations, together with boundary conditions on the velocity field. No boundary conditions are imposed on the pressure field aside from anchoring the pressure at a point to prevent its unbounded development. The training is performed from scratch without labeled data, relying solely on the governing equations and constraints. To enhance accuracy in advection-dominated flows, we employ a single Fourier feature mapping of the input coordinates. The proposed method is demonstrated for the canonical lid-driven cavity flow up to a Reynolds number of 7,500 and for laminar flow over a circular cylinder with inflow-outflow boundary conditions, achieving excellent agreement with benchmark solutions. We further compare the present formulation against alternative objective-function constructions based on different arrangements of the flow equations, thereby highlighting the algorithmic advantages of the proposed formulation centered around the Poisson's equation for pressure.

Updated: 2025-11-24 06:54:20

标题: 使用受物理和等式约束的人工神经网络求解不可压缩流动方程

摘要: 我们提出了一种无网格方法，用于在对流主导的条件下求解不可压缩Navier-Stokes方程，该方法利用物理和等式约束的人工神经网络结合条件自适应的增广拉格朗日形式。一个单一的神经网络参数化速度和压力场，通过最小化压力泊松方程的残差进行训练，受到动量和连续性方程以及速度场的边界条件的约束。除了在一个点上锚定压力以防止其无界发展之外，没有在压力场上施加边界条件。训练是从零开始进行的，没有标记的数据，仅依赖于控制方程和约束。为了增强在对流主导流动中的准确性，我们采用了输入坐标的单一傅立叶特征映射。所提出的方法在标准的驱动盖腔流动中进行了演示，直到雷诺数达到7500，并在环形圆柱上的层流流动中使用了流入流出边界条件，与基准解取得了优异的一致性。我们进一步将当前构造与基于不同流动方程排列的替代目标函数构造进行了比较，从而突出了围绕压力泊松方程的建议构造的算法优势。

更新时间: 2025-11-24 06:54:20

领域: physics.flu-dyn,cs.LG

下载: http://arxiv.org/abs/2511.18820v1

VALUE: Value-Aware Large Language Model for Query Rewriting via Weighted Trie in Sponsored Search

Query-to-bidword(i.e., bidding keyword) rewriting is fundamental to sponsored search, transforming noisy user queries into semantically relevant and commercially valuable keywords. Recent advances in large language models (LLMs) improve semantic relevance through generative retrieval frameworks, but they rarely encode the commercial value of keywords. As a result, rewrites are often semantically correct yet economically suboptimal, and a reinforcement learning from human feedback (RLHF) stage is usually added after supervised fine-tuning(SFT) to mitigate this deficiency. However, conventional preference alignment frequently overemphasize the ordering of bidword values and is susceptible to overfitting, which degrades rewrite quality. In addition, bidword value changes rapidly, while existing generative methods do not respond to these fluctuations. To address this shortcoming, we introduce VALUE(Value-Aware Large language model for qUery rewriting via wEighted trie), a framework that integrates value awareness directly into generation and enhances value alignment during training. VALUE employs the Weighted Trie, a novel variant of the classical trie that stores real-time value signals for each token. During decoding, the framework adjusts the LLM's token probabilities with these signals, constraining the search space and steering generation toward high-value rewrites. The alignment stage uses a fine-grained preference learning strategy that emphasizes stable, high-value differences and down-weights noisy or transient fluctuations, thereby improving robustness and reducing overfitting. Offline experiments show that VALUE significantly outperforms baselines in both semantic matching and value-centric metrics. VALUE has been deployed on our advertising system since October 2024 and served the Double Eleven promotions, the biggest shopping carnival in China.

Updated: 2025-11-24 06:50:38

标题: VALUE: 基于加权 Trie 的查询重写的价值感知大型语言模型在赞助搜索中的应用

摘要: 查询到出价关键字（即竞价关键字）重写对于赞助搜索至关重要，将嘈杂的用户查询转化为语义相关且商业价值高的关键字。最近大型语言模型（LLMs）的进展通过生成式检索框架提高语义相关性，但它们很少编码关键字的商业价值。因此，重写通常在语义上是正确的，但经济上却不够优化，通常在监督微调（SFT）之后会添加一个强化学习从人类反馈（RLHF）阶段来减轻这种不足。然而，传统的偏好对齐经常过分强调出价关键字值的排序，并容易过拟合，从而降低了重写质量。此外，出价关键字值迅速变化，而现有的生成方法并未对这些波动做出响应。为了解决这一缺点，我们引入了VALUE（Value-Aware Large language model for qUery rewriting via wEighted trie），这是一个将价值意识直接整合到生成中，并在训练过程中增强价值对齐的框架。VALUE采用了加权Trie，这是经典Trie的一种新变体，它为每个标记存储实时价值信号。在解码过程中，框架调整了LLM的标记概率与这些信号，限制了搜索空间并引导生成向高价值的重写。对齐阶段采用了一种细粒度的偏好学习策略，强调稳定的、高价值的差异并降低嘈杂或瞬时波动，从而提高鲁棒性并减少过拟合。离线实验表明，VALUE在语义匹配和价值为中心的度量方面明显优于基线。自2024年10月以来，VALUE已部署在我们的广告系统上，并为中国最大的购物狂欢节“双十一”促销活动提供服务。

更新时间: 2025-11-24 06:50:38

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05321v2

EAGER: Edge-Aligned LLM Defense for Robust, Efficient, and Accurate Cybersecurity Question Answering

Large Language Models (LLMs) are highly effective for cybersecurity question answering (QA) but are difficult to deploy on edge devices due to their size. Quantization reduces memory and compute requirements but often degrades accuracy and increases vulnerability to adversarial attacks. We present EAGER, an edge-aligned defense framework that integrates parameter-efficient quantization with domain-specific preference alignment to jointly optimize efficiency, robustness, and accuracy. Unlike prior methods that address these aspects separately, EAGER leverages Quantized Low-Rank Adaptation (QLoRA) for low-cost fine-tuning and Direct Preference Optimization (DPO) on a self-constructed cybersecurity preference dataset, eliminating the need for human labels. Experiments show that EAGER reduces adversarial attack success rates by up to 7.3x and improves QA accuracy by up to 55% over state-of-the-art defenses, while achieving the lowest response latency on a Jetson Orin, demonstrating its practical edge deployment.

Updated: 2025-11-24 06:49:48

标题: EAGER：面向边缘对齐的LLM防御，用于强大、高效和准确的网络安全问题回答

摘要: 大型语言模型(LLMs)在网络安全问题回答(QA)方面非常有效，但由于其体积较大，很难部署在边缘设备上。量化减少了内存和计算需求，但通常会降低准确性，并增加对对抗攻击的脆弱性。我们提出了EAGER，一种边缘对齐的防御框架，将参数高效量化与领域特定的偏好对齐相结合，共同优化效率、稳健性和准确性。与先前分别解决这些方面的方法不同，EAGER利用Quantized Low-Rank Adaptation (QLoRA)进行低成本微调，并在自构建的网络安全偏好数据集上进行Direct Preference Optimization (DPO)，消除了对人类标签的需求。实验表明，EAGER将对抗攻击成功率降低了最多7.3倍，并将QA准确性提高了最多55%，超过了最先进的防御方法，同时在Jetson Orin上实现了最低的响应延迟，展示了其实际边缘部署的可行性。

更新时间: 2025-11-24 06:49:48

领域: cs.CR

下载: http://arxiv.org/abs/2511.19523v1

A Rule-Based Approach to Specifying Preferences over Conflicting Facts and Querying Inconsistent Knowledge Bases

Repair-based semantics have been extensively studied as a means of obtaining meaningful answers to queries posed over inconsistent knowledge bases (KBs). While several works have considered how to exploit a priority relation between facts to select optimal repairs, the question of how to specify such preferences remains largely unaddressed. This motivates us to introduce a declarative rule-based framework for specifying and computing a priority relation between conflicting facts. As the expressed preferences may contain undesirable cycles, we consider the problem of determining when a set of preference rules always yields an acyclic relation, and we also explore a pragmatic approach that extracts an acyclic relation by applying various cycle removal techniques. Towards an end-to-end system for querying inconsistent KBs, we present a preliminary implementation and experimental evaluation of the framework, which employs answer set programming to evaluate the preference rules, apply the desired cycle resolution techniques to obtain a priority relation, and answer queries under prioritized-repair semantics.

Updated: 2025-11-24 06:49:36

标题: 一种基于规则的方法：规范对冲突事实的偏好并查询不一致的知识库

摘要: 修复为基础的语义学已被广泛研究，作为获取对不一致知识库（KBs）提出的查询的有意义答案的手段。尽管有几项工作考虑了如何利用事实之间的优先关系来选择最佳修复方案，但如何指定这种偏好的问题仍然未得到广泛解决。这促使我们引入了一个基于声明规则的框架，用于指定和计算冲突事实之间的优先关系。由于表达的偏好可能包含不良循环，我们考虑确定一组偏好规则何时总是产生无环关系的问题，并探索一种实用方法，通过应用各种循环移除技术来提取无环关系。为了实现一个端到端的系统，用于查询不一致的KBs，我们提出了一个初步实现和框架的实验评估，该框架利用答案集编程来评估偏好规则，应用所需的循环解决技术来获取优先关系，并在优先修复语义下回答查询。

更新时间: 2025-11-24 06:49:36

领域: cs.LO,cs.AI,cs.DB

下载: http://arxiv.org/abs/2508.07742v2

Uncertainty of Network Topology with Applications to Out-of-Distribution Detection

Persistent homology (PH) is a crucial concept in computational topology, providing a multiscale topological description of a space. It is particularly significant in topological data analysis, which aims to make statistical inference from a topological perspective. In this work, we introduce a new topological summary for Bayesian neural networks, termed the predictive topological uncertainty (pTU). The proposed pTU measures the uncertainty in the interaction between the model and the inputs. It provides insights from the model perspective: if two samples interact with a model in a similar way, then they are considered identically distributed. We also show that the pTU is insensitive to the model architecture. As an application, pTU is used to solve the out-of-distribution (OOD) detection problem, which is critical to ensure model reliability. Failure to detect OOD input can lead to incorrect and unreliable predictions. To address this issue, we propose a significance test for OOD based on the pTU, providing a statistical framework for this issue. The effectiveness of the framework is validated through various experiments, in terms of its statistical power, sensitivity, and robustness.

Updated: 2025-11-24 06:39:45

标题: 网络拓扑的不确定性及其在超出分布检测中的应用

摘要: Persistent homology (PH)是计算拓扑学中的一个关键概念，提供了对空间的多尺度拓扑描述。它在拓扑数据分析中尤为重要，旨在从拓扑角度进行统计推断。在这项工作中，我们介绍了一种新的拓扑总结方法，称为预测拓扑不确定性（pTU）。所提出的pTU衡量了模型与输入之间的交互不确定性。它从模型的角度提供了洞察：如果两个样本与模型以类似的方式交互，则它们被视为具有相同分布。我们还展示了pTU对模型架构不敏感。作为应用，pTU用于解决超出分布（OOD）检测问题，这对确保模型可靠性至关重要。未能检测到OOD输入可能导致不正确和不可靠的预测。为解决这一问题，我们提出了基于pTU的OOD显著性测试，为这一问题提供了统计框架。通过各种实验验证了该框架的有效性，包括其统计功效、敏感性和稳健性。

更新时间: 2025-11-24 06:39:45

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2511.18813v1

Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity Cache

Human-Object Interaction (HOI) detection is a fundamental task in computer vision, empowering machines to comprehend human-object relationships in diverse real-world scenarios. Recent advances in VLMs have significantly improved HOI detection by leveraging rich cross-modal representations. However, most existing VLM-based approaches rely heavily on additional training or prompt tuning, resulting in substantial computational overhead and limited scalability, particularly in long-tailed scenarios where rare interactions are severely underrepresented. In this paper, we propose the Adaptive Diversity Cache (ADC) module, a novel training-free and plug-and-play mechanism designed to mitigate long-tail bias in HOI detection. ADC constructs class-specific caches that accumulate high-confidence and diverse feature representations during inference. The method incorporates frequency-aware cache adaptation that favors rare categories and is designed to enable robust prediction calibration without requiring additional training or fine-tuning. Extensive experiments on HICO-DET and V-COCO datasets show that ADC consistently improves existing HOI detectors, achieving up to +8.57\% mAP gain on rare categories and +4.39\% on the full dataset, demonstrating its effectiveness in mitigating long-tail bias while preserving overall performance.

Updated: 2025-11-24 06:30:08

标题: 通过自适应多样性缓存减轻HOI检测中的长尾偏差

摘要: 人体物体交互（HOI）检测是计算机视觉中的一项基本任务，使机器能够理解多样化真实场景中的人体物体关系。近年来，视觉语言模型（VLMs）的最新进展通过利用丰富的跨模态表示显著改善了HOI检测。然而，大多数现有的基于VLM的方法主要依赖于额外的训练或提示调整，导致了大量的计算开销和有限的可扩展性，特别是在长尾场景中，罕见的交互动作严重不足。在本文中，我们提出了自适应多样性缓存（ADC）模块，这是一种新颖的无需训练和即插即用的机制，旨在减轻HOI检测中的长尾偏差。ADC构建了类别特定的缓存，在推断过程中积累高置信度和多样性特征表示。该方法结合了频率感知的缓存自适应，有利于罕见类别，并旨在实现稳健的预测校准，而无需额外的训练或微调。对HICO-DET和V-COCO数据集的大量实验表明，ADC始终改善现有的HOI检测器，在罕见类别上获得高达+8.57\%的mAP增益，在整个数据集上获得+4.39\%的增益，展示了其在减轻长尾偏差的同时保持整体性能的有效性。

更新时间: 2025-11-24 06:30:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18811v1

HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations

Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.

Updated: 2025-11-24 06:27:58

标题: 超蜂窝RAG：利用双曲表示增强检索增强生成

摘要: 检索增强生成（RAG）使大型语言模型（LLMs）能够访问外部知识，有助于减轻幻觉并增强领域特定的专业知识。基于图的RAG通过引入明确的关系组织来增强结构推理，从而实现跨语义连接的文本单元的信息传播。然而，这些方法通常依赖于捕获语义相似性但缺乏层次深度几何概念的欧几里德嵌入，从而限制了它们表征复杂知识图中固有的抽象关系的能力。为了捕获细粒度语义和全局层次结构，我们提出了HyperbolicRAG，这是一个将双曲几何集成到基于图的RAG中的检索框架。HyperbolicRAG引入了三个关键设计：（1）一个深度感知表示学习器，将节点嵌入到共享的Poincare流形中，以使语义相似性与层次包含对齐，（2）一个无监督对比正则化，强制在抽象级别之间保持几何一致性，以及（3）一个互惠排名融合机制，共同利用来自欧几里德和双曲空间的检索信号，在推理过程中强调跨空间的一致性。跨多个QA基准的广泛实验证明，HyperbolicRAG胜过竞争基线，包括标准的RAG和基于图的基线。

更新时间: 2025-11-24 06:27:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.18808v1

FedRef: Communication-Efficient Bayesian Fine-Tuning using a Reference Model

Federated learning (FL) collaboratively trains artificial intelligence (AI) models to ensure user data privacy. Sharing only model updates generated from local training on client data with the server enhances user data privacy. However, model performance may suffer due to data and system heterogeneity among clients in FL scenarios. Previous studies have proposed model optimization, fine-tuning, and personalization to achieve improved model performance. Despite these efforts, models resulting from FL scenarios often exhibit catastrophic forgetting, which increases the communication and computational costs of clients for model optimization and raises energy consumption. To address these challenges, we propose a reference model-based fine-tuning method for federated learning that overcomes catastrophic forgetting in each round. Our method is derived from Bayesian parameter-efficient transfer learning and includes an proximal term. It employs a reference model that incorporates previous model parameters and reviews previous global features in the model optimization step to mitigate catastrophic forgetting. As a result, our method achieves higher model performance and lower communication and computational costs for clients than existing methods.

Updated: 2025-11-24 06:24:33

标题: FedRef：使用参考模型进行高效通信的贝叶斯微调

摘要: 联邦学习（FL）是一种协作训练人工智能（AI）模型以确保用户数据隐私的方法。仅分享在客户端数据上生成的模型更新与服务器有助于增强用户数据隐私。然而，在FL场景中，由于客户之间的数据和系统异构性，模型性能可能会受到影响。先前的研究提出了模型优化、微调和个性化方法以实现改善模型性能。尽管有这些努力，FL场景中产生的模型通常表现出灾难性遗忘，增加了客户进行模型优化的通信和计算成本，并提高了能耗。为了解决这些挑战，我们提出了一种基于参考模型的微调方法，用于克服每一轮中的灾难性遗忘。我们的方法源自贝叶斯参数高效迁移学习，并包含一个近端项。它采用一个包含先前模型参数并在模型优化步骤中审查先前全局特征的参考模型，以减轻灾难性遗忘。因此，我们的方法实现了比现有方法更高的模型性能和更低的客户通信和计算成本。

更新时间: 2025-11-24 06:24:33

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2506.23210v4

Time-Aware and Transition-Semantic Graph Neural Networks for Interpretable Predictive Business Process Monitoring

Predictive Business Process Monitoring (PBPM) aims to forecast future events in ongoing cases based on historical event logs. While Graph Neural Networks (GNNs) are well suited to capture structural dependencies in process data, existing GNN-based PBPM models remain underdeveloped. Most rely either on short prefix subgraphs or global architectures that overlook temporal relevance and transition semantics. We propose a unified, interpretable GNN framework that advances the state of the art along three key axes. First, we compare prefix-based Graph Convolutional Networks(GCNs) and full trace Graph Attention Networks(GATs) to quantify the performance gap between localized and global modeling. Second, we introduce a novel time decay attention mechanism that constructs dynamic, prediction-centered windows, emphasizing temporally relevant history and suppressing noise. Third, we embed transition type semantics into edge features to enable fine grained reasoning over structurally ambiguous traces. Our architecture includes multilevel interpretability modules, offering diverse visualizations of attention behavior. Evaluated on five benchmarks, the proposed models achieve competitive Top-k accuracy and DL scores without per-dataset tuning. By addressing architectural, temporal, and semantic gaps, this work presents a robust, generalizable, and explainable solution for next event prediction in PBPM.

Updated: 2025-11-24 06:23:22

标题: 时间感知和转换语义图神经网络用于可解释的预测业务流程监控

摘要: 预测业务流程监控（PBPM）旨在基于历史事件日志预测正在进行案例中的未来事件。虽然图神经网络（GNNs）很适合捕捉流程数据中的结构依赖关系，但现有基于GNN的PBPM模型仍然不够成熟。大多数依赖于短前缀子图或忽视时间相关性和转换语义的全局架构。我们提出了一个统一的、可解释的GNN框架，沿着三个关键轴推进了技术水平。首先，我们比较基于前缀的图卷积网络（GCNs）和完整轨迹的图注意力网络（GATs），以量化局部建模和全局建模之间的性能差距。其次，我们引入了一种新颖的时间衰减注意力机制，构建动态、以预测为中心的窗口，强调时间相关的历史并抑制噪音。第三，我们将转换类型语义嵌入到边特征中，以实现对结构模糊的轨迹进行精细推理。我们的架构包括多级可解释性模块，提供了多样化的注意行为可视化。在五个基准测试中评估，所提出的模型在没有每个数据集调整的情况下实现了竞争性的Top-k准确度和DL分数。通过解决架构、时间和语义差距，这项工作提供了一种强大、可推广和可解释的解决方案，用于PBPM中的下一事件预测。

更新时间: 2025-11-24 06:23:22

领域: cs.LG

下载: http://arxiv.org/abs/2508.09527v2

REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

Updated: 2025-11-24 06:18:18

标题: REAL-Prover: 检索增强的Lean证明器用于数学推理

摘要: 目前，形式定理证明器在高中和竞赛级数学领域取得了巨大进展，但其中很少有通用于更高级的数学领域。本文介绍了REAL-Prover，一个基于我们经过精心调整的大型语言模型（REAL-Prover-v1）和集成检索系统（Leansearch-PS）的新型开源逐步定理证明器，用于推动这一领域的发展。该证明器显著提升了解决大学级数学问题的性能。为了训练REAL-Prover-v1，我们开发了HERALD-AF，一个数据提取管道，将自然语言数学问题转化为形式陈述，并开发了一个新的开源Lean 4交互环境（Jixia-interactive）来促进数据收集的综合。在我们的实验中，我们的证明器仅使用监督微调即可取得有竞争力的结果，在ProofNet数据集上实现了23.7%的成功率（Pass@64），与最先进的模型相当。为了进一步评估我们的方法，我们引入了一个专注于代数问题的新基准FATE-M，在这个基准上，我们的证明器实现了56.7%的成功率（Pass@64），达到了最先进水平。

更新时间: 2025-11-24 06:18:18

领域: cs.CL,cs.AI,cs.LG,cs.LO

下载: http://arxiv.org/abs/2505.20613v3

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

Updated: 2025-11-24 06:11:04

标题: 强化学习是否真的激励了LLM中的推理能力，超越了基础模型？

摘要: 具有可验证奖励的强化学习（RLVR）最近在提升大型语言模型（LLMs）的推理性能方面取得了显著成功，特别是在数学和编程任务上。类似于传统的强化学习帮助代理探索和学习新策略，RLVR被认为能够使LLMs不断自我改进，从而获得超出对应基础模型的新颖推理能力。在这项研究中，我们通过系统地探究RLVR训练的LLMs在各种模型系列、RL算法和数学、编码和视觉推理基准上的推理能力边界，使用pass@k作为评估指标。令人惊讶的是，我们发现当前的训练设置并没有引发根本新的推理模式。虽然RLVR训练的模型在小的k值（例如k=1）上优于基础模型，但当k值较大时，基础模型的pass@k得分更高。覆盖率和困惑度分析显示观察到的推理能力来源于基础模型并受其限制。将基础模型视为上限，我们的定量分析表明六种流行的RLVR算法表现相似，远未充分利用基础模型的潜力。相比之下，我们发现蒸馏可以从教师引入新的推理模式，并真正扩展模型的推理能力。总的来说，我们的研究结果表明当前的RLVR方法尚未实现利用RL引发LLMs真正新颖推理能力的潜力。这突显了对改进的RL范式的需求，如持续扩展和多轮代理-环境交互，以释放这一潜力。

更新时间: 2025-11-24 06:11:04

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2504.13837v5

How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks

We investigate the performance of large language models on repetitive deterministic prediction tasks and study how the sequence accuracy rate scales with output length. Each such task involves repeating the same operation n times. Examples include letter replacement in strings following a given rule, integer addition, and multiplication of string operators in many body quantum mechanics. If the model performs the task through a simple repetition algorithm, the success rate should decay exponentially with sequence length. In contrast, our experiments on leading large language models reveal a sharp double exponential drop beyond a characteristic length scale, forming an accuracy cliff that marks the transition from reliable to unstable generation. This indicates that the models fail to execute each operation independently. To explain this phenomenon, we propose a statistical physics inspired model that captures the competition between external conditioning from the prompt and internal interference among generated tokens. The model quantitatively reproduces the observed crossover and provides an interpretable link between attention induced interference and sequence level failure. Fitting the model to empirical results across multiple models and tasks yields effective parameters that characterize the intrinsic error rate and error accumulation factor for each model task pair, offering a principled framework for understanding the limits of deterministic accuracy in large language models.

Updated: 2025-11-24 06:11:01

标题: LLMs有多专注？通过重复确定性预测任务的定量研究

摘要: 我们研究了大型语言模型在重复确定性预测任务上的性能，并研究了序列准确率随输出长度的变化。每个这样的任务都涉及重复执行相同的操作n次。例如，按照给定规则替换字符串中的字母，整数加法以及在多体量子力学中对字符串操作符进行乘法。如果模型通过简单的重复算法执行任务，成功率应该随着序列长度的增加呈指数衰减。相比之下，我们对领先的大型语言模型进行的实验表明，在特征长度范围之外，准确率急剧双指数下降，形成一个准确性悬崖，标志着从可靠到不稳定的生成的过渡。这表明模型无法独立执行每个操作。为了解释这一现象，我们提出了一个受统计物理启发的模型，该模型捕捉了提示的外部条件和生成的令牌之间的内部干扰之间的竞争。该模型定量地再现了观察到的交叉点，并提供了一个可解释的关于注意力诱导干扰和序列级失败之间的联系。将该模型拟合到多个模型和任务的实证结果中，得到了表征每个模型任务对的固有误差率和误差累积因子的有效参数，为理解大型语言模型中确定性准确性的极限提供了一个原则性框架。

更新时间: 2025-11-24 06:11:01

领域: cs.AI

下载: http://arxiv.org/abs/2511.00763v2

NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination, a major source of performance degradation, we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, driving the billion-level advertising revenue and serving hundreds of millions of daily active users.

Updated: 2025-11-24 05:53:46

标题: 哪吒：零牺牲和超高速解码架构用于生成式推荐

摘要: 生成式推荐（GR），由大型语言模型（LLMs）驱动，代表了工业推荐系统的一个有前途的新范式。然而，它们的实际应用受到高推理延迟的严重阻碍，这使它们无法用于高吞吐量、实时服务，并限制了它们的整体业务影响。虽然已经提出了推测解码（SD）来加速自回归生成过程，但现有实施引入了新的瓶颈：它们通常需要单独的草稿模型和基于模型的验证器，需要额外的训练并增加了延迟开销。在本文中，我们通过NEZHA，一个新颖的架构，解决了这些挑战，实现了GR系统的超高速解码，而不牺牲推荐质量。具体而言，NEZHA将一个灵活的自回归草稿头直接集成到主模型中，实现了高效的自动起草。这种设计，结合专门的输入提示结构，保持了序列到序列生成的完整性。此外，为了解决幻觉这一导致性能下降的主要问题，我们引入了一种基于哈希集的高效、无模型的验证器。我们通过对公共数据集的广泛实验展示了NEZHA的有效性，并自2025年10月起在淘宝上成功部署了该系统，推动了数十亿级的广告收入，并为数亿日活跃用户提供服务。

更新时间: 2025-11-24 05:53:46

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.18793v1

ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.

Updated: 2025-11-24 05:52:53

标题: ReBrain：通过检索增强扩散实现稀疏CT切片的脑MRI重建

摘要: 磁共振成像（MRI）在脑部疾病诊断中起着至关重要的作用，但由于生理或临床限制，对某些患者并非总是可行。最近的研究尝试从计算机断层扫描（CT）图像中合成MRI；然而，低剂量方案通常导致高度稀疏的CT体积和较差的平面分辨率，使得对完整脑部MRI体积进行准确重建尤为具有挑战性。为了解决这个问题，我们提出了ReBrain，一个用于脑部MRI重建的检索增强扩散框架。给定任何有限切片的3D CT扫描，我们首先采用布朗桥扩散模型（BBDM）来合成沿着2D维度的MRI切片。同时，我们通过一个经过精细调整的检索模型从全面的先前数据库中检索结构和病理相似的CT切片。这些检索到的切片被用作参考，通过一个ControlNet分支结合来指导中间MRI切片的生成，并确保结构的连续性。当数据库缺乏合适的参考时，我们进一步考虑罕见的检索失败，并应用球形线性插值来提供补充指导。对SynthRAD2023和BraTS的广泛实验表明，ReBrain在稀疏条件下的跨模态重建中实现了最先进的性能。

更新时间: 2025-11-24 05:52:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.17068v2

SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth

The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.

Updated: 2025-11-24 05:52:00

标题: SproutBench：面向青少年的安全和道德大型语言模型基准测试

摘要: 随着大型语言模型(LLMs)在针对儿童和青少年的应用中迅速增长，必须对目前主要针对成年用户的人工智能安全框架进行基本重新评估，忽略了未成年人独特的发展脆弱性。本文突出了现有LLM安全基准的关键缺陷，包括它们对涵盖儿童早期发展(0-6岁)、中期发展(7-12岁)和青春期(13-18岁)的年龄特定认知、情感和社会风险的覆盖不足。为了弥补这些差距，我们引入了SproutBench，这是一个创新的评估套件，包括1,283个基于发展的对抗性提示，旨在探究风险，如情感依赖、侵犯隐私和模仿危险行为。通过对47个不同LLMs的严格实证评估，我们发现了实质性的安全漏洞，这些漏洞得到了强有力的跨维度相关性的支持(例如，安全与风险预防之间的关系)，以及互动性和年龄适宜性之间显著的反向关系。这些见解为推进面向儿童的人工智能设计和部署提供了实用指南。

更新时间: 2025-11-24 05:52:00

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.11009v2

SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

Updated: 2025-11-24 05:44:55

标题: SGDFuse：SAM引导扩散用于高保真红外和可见光图像融合

摘要: 红外和可见光图像融合（IVIF）旨在将红外图像的热辐射信息与可见光图像的丰富纹理细节相结合，以增强下游视觉任务的感知能力。然而，现有方法常常由于缺乏对场景的深度语义理解而无法保留关键目标，同时融合过程本身也可能引入伪影和细节丢失，严重损害图像质量和任务性能。为解决这些问题，本文提出了SGDFuse，这是一个由“Segment Anything Model（SAM）”引导的条件扩散模型，旨在实现高保真度和语义感知的图像融合。我们方法的核心是利用SAM生成的高质量语义掩模作为明确的先验条件，通过条件扩散模型引导融合过程的优化。具体而言，该框架在两个阶段操作：首先对多模态特征进行初步融合，然后联合使用SAM的语义掩模和初步融合图像作为条件，驱动扩散模型的从粗到细的去噪生成。这确保了融合过程不仅具有明确的语义方向性，还保证了最终结果的高保真度。大量实验表明SGDFuse在主观和客观评估以及对下游任务的适应性方面均取得了最先进的性能，为图像融合中的核心挑战提供了强大的解决方案。SGDFuse的代码可在https://github.com/boshizhang123/SGDFuse 上找到。

更新时间: 2025-11-24 05:44:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.05264v4

GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks

Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remarkable success across a wide range of scenarios. The majority of existing research efforts primarily concentrate on developing powerful item tokenizers or advancing LLM decoding strategies to attain superior performance. However, the critical fine-tuning step in GR frameworks, which is essential for adapting LLMs to recommendation data, remains largely unexplored. Current approaches predominantly rely on either the next-token prediction loss of supervised fine-tuning (SFT) or recommendationspecific direct preference optimization (DPO) strategies. Both methods ignore the exploration of possible positive unobserved samples, which is commonly referred to as the exposure bias problem. To mitigate this problem, this paper treats the GR as a multi-step generation task and constructs a GFlowNets-based fine-tuning framework (GFlowGR). The proposed framework integrates collaborative knowledge from traditional recommender systems to create an adaptive trajectory sampler and a comprehensive reward model. Leveraging the diverse generation property of GFlowNets, along with sampling and heuristic weighting techniques, GFlowGR emerges as a promising approach to mitigate the exposure bias problem. Extensive empirical results on two real-world datasets and with two different GR backbones highlight the effectiveness and robustness of GFlowGR.

Updated: 2025-11-24 05:43:01

标题: GFlowGR：使用生成流网络微调生成式推荐框架

摘要: 生成式推荐（GR）通常包括物品标记器和生成式大型语言模型（LLMs），在各种情景中展示了显著的成功。现有的大多数研究工作主要集中在开发强大的物品标记器或推进LLM解码策略以获得优越性能。然而，在GR框架中关键的微调步骤，即将LLMs调整到推荐数据的过程，仍然尚未得到广泛探讨。目前的方法主要依赖于监督微调（SFT）的下一个标记预测损失或推荐特定的直接偏好优化（DPO）策略。这两种方法都忽视了对可能的正面未观察样本的探索，通常被称为曝光偏差问题。为了缓解这个问题，本文将GR视为一个多步生成任务，并构建了一个基于GFlowNets的微调框架（GFlowGR）。所提出的框架整合了传统推荐系统的协作知识，创建了一个自适应轨迹采样器和一个全面的奖励模型。利用GFlowNets的多样化生成特性，结合采样和启发式加权技术，GFlowGR成为缓解曝光偏差问题的一种有前景的方法。在两个真实数据集和两种不同的GR骨干上的大量实证结果突显了GFlowGR的有效性和鲁棒性。

更新时间: 2025-11-24 05:43:01

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2506.16114v2

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Content moderation pipelines for modern large language models combine static filters, dedicated moderation services, and alignment tuned base models, yet real world deployments still exhibit dangerous failure modes. This paper presents RoguePrompt, an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt which passes provider moderation while preserving the original harmful intent. RoguePrompt partitions the instruction across two lexical streams, applies nested classical ciphers, and wraps the result in natural language directives that cause the target model to decode and execute the hidden payload. Our attack assumes only black box access to the model and to the associated moderation endpoint. We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected. Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution, substantially outperforming five automated jailbreak baselines. We further analyze the behavior of several automated and human aligned evaluators and show that dual layer lexical transformations remain effective even when detectors rely on semantic similarity or learned safety rubrics. Our results highlight systematic blind spots in current moderation practice and suggest that robust deployment will require joint reasoning about user intent, decoding workflows, and model side computation rather than surface level toxicity alone.

Updated: 2025-11-24 05:42:54

标题: RoguePrompt：双层加密用于自重建以规避LLM调节

摘要: 现代大型语言模型的内容审核管道结合了静态过滤器、专用审核服务和经过调整的基础模型，然而实际部署仍然表现出危险的故障模式。本文介绍了RoguePrompt，一种自动越狱攻击，将被禁止的用户查询转换为一个自我重建的提示，通过提供商的审核，同时保留原始的有害意图。RoguePrompt将指令划分到两个词汇流中，应用嵌套的经典密码，并将结果包裹在导致目标模型解码和执行隐藏负载的自然语言指令中。我们的攻击仅假设对模型和相关审核端点具有黑盒访问权限。我们对GPT 4o实施了RoguePrompt攻击，并对之前被生产审核系统标记为强烈拒绝的2,448个提示进行了评估。在一个将三个与安全相关的结果（绕过、重构和执行）分开的评估协议下，攻击达到了84.7％的绕过、80.2％的重构和71.5％的完全执行，远远超过了五个自动越狱基准。我们进一步分析了几个自动和人工对齐的评估者的行为，并显示双层词汇转换即使在检测器依赖语义相似性或学习的安全规则时仍然有效。我们的结果突显了当前审核实践中的系统性盲点，并暗示强大的部署将需要关于用户意图、解码工作流程和模型端计算的联合推理，而不仅仅是表面层面的有毒性。

更新时间: 2025-11-24 05:42:54

领域: cs.CR

下载: http://arxiv.org/abs/2511.18790v1

Doubly Wild Refitting: Model-Free Evaluation of High Dimensional Black-Box Predictions under Convex Losses

We study the problem of excess risk evaluation for empirical risk minimization (ERM) under general convex loss functions. Our contribution is an efficient refitting procedure that computes the excess risk and provides high-probability upper bounds under the fixed-design setting. Assuming only black-box access to the training algorithm and a single dataset, we begin by generating two sets of artificially modified pseudo-outcomes termed wild response, created by stochastically perturbing the gradient vectors with carefully chosen scaling. Using these two pseudo-labeled datasets, we then refit the black-box procedure twice to obtain two corresponding wild predictors. Finally, leveraging the original predictor, the two wild predictors, and the constructed wild responses, we derive an efficient excess risk upper bound. A key feature of our analysis is that it requires no prior knowledge of the complexity of the underlying function class. As a result, the method is essentially model-free and holds significant promise for theoretically evaluating modern opaque machine learning system--such as deep nerral networks and generative model--where traditional capacity-based learning theory becomes infeasible due to the extreme complexity of the hypothesis class.

Updated: 2025-11-24 05:38:47

标题: 双重野外调整：基于凸损失的高维黑盒预测的无模型评估

摘要: 我们研究了在一般凸损失函数下，对经验风险最小化（ERM）的过量风险评估问题。我们的贡献是一个高效的重新拟合过程，它计算出过量风险并在固定设计设置下提供高概率的上界。假设仅对训练算法和单个数据集有黑盒访问权限，我们首先通过对梯度向量进行精心选择的缩放进行随机扰动来生成两组人工修改的伪输出，称为野响应。然后，我们使用这两个伪标记数据集两次重新拟合黑盒程序以获得两个相应的野预测器。最后，利用原始预测器、两个野预测器和构建的野响应，我们推导出一个高效的过量风险上界。我们分析的一个关键特点是不需要对底层函数类的复杂性有先验知识。因此，该方法本质上是无模型的，并且在理论上评估现代不透明机器学习系统（如深度神经网络和生成模型）方面具有重要的潜力，因为由于假设类的极端复杂性，传统基于容量的学习理论变得不可行。

更新时间: 2025-11-24 05:38:47

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.18789v1

Understanding Task Transfer in Vision-Language Models

Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

Updated: 2025-11-24 05:37:52

标题: 理解视觉-语言模型中的任务转移

摘要: 视觉-语言模型（VLMs）在多模态基准测试中表现良好，但在深度估计或物体计数等视觉感知任务上落后于人类和专门模型。在一个任务上微调可能会不可预测地影响其他任务的性能，使得特定任务的微调具有挑战性。在本文中，我们通过系统研究任务的可转移性来解决这一挑战。我们考察了将VLM在一个感知任务上微调对其在其他任务上的零样本性能的影响。为了量化这些影响，我们引入了完美差距因子（PGF），这是一个捕捉转移的广度和大小的度量。使用三个受13个感知任务评估的开放权重的VLMs，我们构建了一个任务转移图，揭示了感知任务之间先前未观察到的关系。我们的分析揭示了正面和负面转移的模式，确定了相互影响的任务组，根据它们的转移行为将任务组织成角色，并展示了PGF如何指导数据选择以实现更有效的训练。这些发现突出了积极转移的机会和负面干扰的风险，为推进VLMs提供可行的指导。

更新时间: 2025-11-24 05:37:52

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.18787v1

Hypergraph Contrastive Learning for both Homophilic and Heterophilic Hypergraphs

Hypergraphs, as a generalization of traditional graphs, naturally capture high-order relationships. In recent years, hypergraph neural networks (HNNs) have been widely used to capture complex high-order relationships. However, most existing hypergraph neural network methods inherently rely on the homophily assumption, which often does not hold in real-world scenarios that exhibit significant heterophilic structures. To address this limitation, we propose \textbf{HONOR}, a novel unsupervised \textbf{H}ypergraph c\textbf{ON}trastive learning framework suitable for both hom\textbf{O}philic and hete\textbf{R}ophilic hypergraphs. Specifically, HONOR explicitly models the heterophilic relationships between hyperedges and nodes through two complementary mechanisms: a prompt-based hyperedge feature construction strategy that maintains global semantic consistency while suppressing local noise, and an adaptive attention aggregation module that dynamically captures the diverse local contributions of nodes to hyperedges. Combined with high-pass filtering, these designs enable HONOR to fully exploit heterophilic connection patterns, yielding more discriminative and robust node and hyperedge representations. Theoretically, we demonstrate the superior generalization ability and robustness of HONOR. Empirically, extensive experiments further validate that HONOR consistently outperforms state-of-the-art baselines under both homophilic and heterophilic datasets.

Updated: 2025-11-24 05:35:46

标题: 超图对比学习：同相和异相超图的应用

摘要: 超图作为传统图的一种泛化形式，自然地捕捉到高阶关系。近年来，超图神经网络（HNNs）被广泛用于捕捉复杂的高阶关系。然而，大多数现有的超图神经网络方法本质上依赖同质性假设，而这在展示显著异质结构的现实场景中通常不成立。为了解决这一局限性，我们提出了\textbf{HONOR}，一种适用于同质性和异质性超图的新颖的无监督\textbf{H}ypergraph c\textbf{ON}trastive 学习框架。具体来说，HONOR通过两种互补机制明确地建模了超边和节点之间的异质关系：一个基于提示的超边特征构建策略，保持全局语义一致性同时抑制局部噪声，以及一个自适应关注聚合模块，动态捕捉节点对超边的多样本地贡献。结合高通滤波，这些设计使HONOR能够充分利用异质连接模式，产生更具有区分性和鲁棒性的节点和超边表示。从理论上讲，我们证明了HONOR的卓越泛化能力和鲁棒性。实证上，大量实验进一步验证了在同质性和异质性数据集下，HONOR始终优于最先进的基线模型。

更新时间: 2025-11-24 05:35:46

领域: cs.LG,cs.SI

下载: http://arxiv.org/abs/2511.18783v1

Perturbing the Derivative: Wild Refitting for Model-Free Evaluation of Machine Learning Models under Bregman Losses

We study the excess risk evaluation of classical penalized empirical risk minimization (ERM) with Bregman losses. We show that by leveraging the idea of wild refitting, one can efficiently upper bound the excess risk through the so-called "wild optimism," without relying on the global structure of the underlying function class. This property makes our approach inherently model-free. Unlike conventional analysis, our framework operates with just one dataset and black-box access to the training procedure. The method involves randomized Rademacher symmetrization and constructing artificially modified outputs by perturbation in the derivative space with appropriate scaling, upon which we retrain a second predictor for excess risk estimation. We establish high-probability performance guarantee under the fixed design setting, demonstrating that wild refitting under Bregman losses, with an appropriately chosen wild noise scale, yields a valid upper bound on the excess risk. Thus, our work is promising for theoretically evaluating modern opaque ML models, such as deep neural networks and generative models, where the function class is too complex for classical learning theory and empirical process techniques.

Updated: 2025-11-24 05:35:06

标题: 扰动导数：在Bregman损失下对机器学习模型进行无模型评估的野生调整

摘要: 我们研究了使用Bregman损失的经典惩罚经验风险最小化（ERM）的过量风险评估。我们展示了通过利用野性重新拟合的思想，可以有效地通过所谓的“野性乐观主义”上界估计过量风险，而无需依赖于基础函数类的全局结构。这种性质使我们的方法本质上是无模型的。与传统分析不同，我们的框架仅使用一个数据集和对训练过程的黑盒访问。该方法涉及随机Rademacher对称化，并通过在导数空间中适当缩放的扰动构造人为修改的输出，然后我们重新训练第二个预测器来估计过量风险。我们在固定设计设置下建立了高概率性能保证，证明了在Bregman损失下，通过选择适当的野性噪声规模进行野性重新拟合会得到一个有效的过量风险上界。因此，我们的工作对于在经典学习理论和经验过程技术中函数类过于复杂的现代不透明ML模型的理论评估是有希望的，例如深度神经网络和生成模型。

更新时间: 2025-11-24 05:35:06

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2509.02476v7

A Novel Dual-Stream Framework for dMRI Tractography Streamline Classification with Joint dMRI and fMRI Data

Streamline classification is essential to identify anatomically meaningful white matter tracts from diffusion MRI (dMRI) tractography. However, current streamline classification methods rely primarily on the geometric features of the streamline trajectory, failing to distinguish between functionally distinct fiber tracts with similar pathways. To address this, we introduce a novel dual-stream streamline classification framework that jointly analyzes dMRI and functional MRI (fMRI) data to enhance the functional coherence of tract parcellation. We design a novel network that performs streamline classification using a pretrained backbone model for full streamline trajectories, while augmenting with an auxiliary network that processes fMRI signals from fiber endpoint regions. We demonstrate our method by parcellating the corticospinal tract (CST) into its four somatotopic subdivisions. Experimental results from ablation studies and comparisons with state-of-the-art methods demonstrate our approach's superior performance.

Updated: 2025-11-24 05:31:47

标题: 一个新颖的双流框架用于dMRI Tractography Streamline分类，结合dMRI和fMRI数据

摘要: 流线分类对于从扩散MRI（dMRI）径迹学中识别解剖学有意义的白质束至关重要。然而，当前的流线分类方法主要依赖于流线轨迹的几何特征，无法区分具有类似路径的功能上不同的纤维束。为了解决这个问题，我们引入了一个新颖的双流线分类框架，该框架联合分析dMRI和功能性MRI（fMRI）数据，以增强径迹分割的功能一致性。我们设计了一个新颖的网络，使用预先训练的骨干模型对完整的流线轨迹进行流线分类，同时辅以一个处理来自纤维端点区域的fMRI信号的辅助网络。我们通过将运动皮质脊髓径路（CST）分割为其四个体位学亚区来展示我们的方法。消融研究的实验结果和与最先进方法的比较表明了我们方法的卓越性能。

更新时间: 2025-11-24 05:31:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18781v1

ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.

Updated: 2025-11-24 05:27:05

标题: ConceptGuard：通过多模态风险检测在文本和图像转视频生成中的主动安全措施

摘要: 最近，在视频生成模型方面取得了进展，使得可以通过结合文本和图像的多模态提示创建高质量的视频。虽然这些系统提供了增强的可控性，但它们也引入了新的安全风险，因为有害内容可能会从单个模态或它们的互动中出现。现有的安全方法通常仅限于文本，需要先前对风险类别有所了解，或者作为后生成审核员运作，难以主动减轻这种构成、多模态风险。为了解决这一挑战，我们提出了ConceptGuard，这是一个统一的安全框架，用于主动检测和减轻多模态视频生成中的不安全语义。ConceptGuard分为两个阶段：首先，对比检测模块通过将融合的图像-文本输入投影到结构化概念空间中来识别潜在的安全风险；其次，语义抑制机制通过干预提示的多模态调节，使生成过程远离不安全的概念。为了支持这一框架的开发和严格评估，我们引入了两个新的基准：ConceptRisk，一个用于训练多模态风险的大规模数据集，以及T2VSafetyBench-TI2V，第一个从T2VSafetyBench调整为文本-图像转视频（TI2V）安全设置的基准。对这两个基准的全面实验表明，ConceptGuard始终优于现有基线，在风险检测和安全视频生成方面取得了最先进的结果。

更新时间: 2025-11-24 05:27:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18780v1

SAOT: An Enhanced Locality-Aware Spectral Transformer for Solving PDEs

Neural operators have shown great potential in solving a family of Partial Differential Equations (PDEs) by modeling the mappings between input and output functions. Fourier Neural Operator (FNO) implements global convolutions via parameterizing the integral operators in Fourier space. However, it often results in over-smoothing solutions and fails to capture local details and high-frequency components. To address these limitations, we investigate incorporating the spatial-frequency localization property of Wavelet transforms into the Transformer architecture. We propose a novel Wavelet Attention (WA) module with linear computational complexity to efficiently learn locality-aware features. Building upon WA, we further develop the Spectral Attention Operator Transformer (SAOT), a hybrid spectral Transformer framework that integrates WA's localized focus with the global receptive field of Fourier-based Attention (FA) through a gated fusion block. Experimental results demonstrate that WA significantly mitigates the limitations of FA and outperforms existing Wavelet-based neural operators by a large margin. By integrating the locality-aware and global spectral representations, SAOT achieves state-of-the-art performance on six operator learning benchmarks and exhibits strong discretization-invariant ability.

Updated: 2025-11-24 05:22:28

标题: SAOT：一种增强的局部感知谱变换器用于求解PDEs

摘要: 神经算子在建模输入和输出函数之间的映射方面展现出了解决一类偏微分方程（PDEs）的巨大潜力。傅立叶神经算子（FNO）通过在傅立叶空间中对积分算子进行参数化实现了全局卷积。然而，它经常导致过度平滑的解决方案，并且无法捕获局部细节和高频成分。为了解决这些限制，我们调查了将小波变换的空间频率局部化特性引入Transformer架构中。我们提出了一种具有线性计算复杂度的新型小波注意（WA）模块，以有效地学习局部感知特征。在WA的基础上，我们进一步开发了光谱注意算子Transformer（SAOT），这是一个混合的光谱Transformer框架，通过门控融合块将WA的局部焦点与基于傅立叶的注意力（FA）的全局感受域结合起来。实验结果表明，WA显著缓解了FA的限制，并且在很大程度上优于现有基于小波的神经算子。通过整合局部感知和全局光谱表示，SAOT在六个算子学习基准上实现了最先进的性能，并展现了强大的离散不变性能力。

更新时间: 2025-11-24 05:22:28

领域: cs.LG

下载: http://arxiv.org/abs/2511.18777v1

Rethinking Garment Conditioning in Diffusion-based Virtual Try-On

Virtual Try-On (VTON) is the task of synthesizing an image of a person wearing a target garment, conditioned on a person image and a garment image. While diffusion-based VTON models featuring a Dual UNet architecture demonstrate superior fidelity compared to single UNet models, they incur substantial computational and memory overhead due to their heavy structure. In this study, through visualization analysis and theoretical analysis, we derived three hypotheses regarding the learning of context features to condition the denoising process. Based on these hypotheses, we developed Re-CatVTON, an efficient single UNet model that achieves high performance. We further enhance the model by introducing a modified classifier-free guidance strategy tailored for VTON's spatial concatenation conditioning, and by directly injecting the ground-truth garment latent derived from the clean garment latent to prevent the accumulation of prediction error. The proposed Re-CatVTON significantly improves performance compared to its predecessor (CatVTON) and requires less computation and memory than the high-performance Dual UNet model, Leffa. Our results demonstrate improved FID, KID, and LPIPS scores, with only a marginal decrease in SSIM, establishing a new efficiency-performance trade-off for single UNet VTON models.

Updated: 2025-11-24 05:19:44

标题: 重新思考基于扩散的虚拟试穿中的服装调整

摘要: 虚拟试穿（VTON）是合成一个人穿着目标服装的图像的任务，条件是一个人的图像和一个服装的图像。尽管基于扩散的VTON模型采用双UNet架构显示出比单一UNet模型更高的保真度，但由于其复杂的结构，它们会产生大量的计算和内存开销。在本研究中，通过可视化分析和理论分析，我们得出了关于学习上下文特征来调节去噪过程的三个假设。基于这些假设，我们开发了Re-CatVTON，一个高性能的高效单一UNet模型。我们进一步通过引入一种专为VTON的空间级联调节定制的修改后的无分类器引导策略，并直接注入从干净服装潜在中导出的地面实况服装潜在，以防止预测误差的积累来增强模型。所提出的Re-CatVTON相对于其前身（CatVTON）显著改善了性能，并且需要比高性能双UNet模型Leffa更少的计算和内存。我们的结果显示，在仅有轻微降低SSIM的情况下，FID、KID和LPIPS得分均有所改善，为单一UNet VTON模型建立了新的效率-性能权衡。

更新时间: 2025-11-24 05:19:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18775v1

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

Updated: 2025-11-24 05:16:44

标题: 时间中的先验知识：语言模型可解释性的缺失归纳偏见

摘要: 从语言模型激活中恢复有意义的概念是可解释性的核心目标。尽管现有的特征提取方法旨在识别独立方向的概念，但目前尚不清楚这种假设能否捕捉到语言丰富的时间结构。具体而言，通过贝叶斯视角，我们展示了稀疏自编码器（SAEs）施加了先验，假设概念在时间上独立，暗示着平稳性。与此同时，语言模型表征展现出丰富的时间动态，包括概念维度的系统增长，上下文相关的相关性，以及明显的非平稳性，与SAEs的先验相冲突。受计算神经科学启发，我们引入了一个新的可解释性目标—时间特征分析—它具有一个时间归纳偏差，将在给定时间内的表示分解为两部分：一个可预测的部分，可以从上下文中推断出来，以及一个残余部分，捕捉未被上下文解释的新信息。时间特征分析器能够正确解析园路句子，识别事件边界，并更广泛地将抽象的、缓慢变化的信息与新颖的、快速变化的信息区分开来，而现有的SAEs在所有上述任务中都存在重大缺陷。总的来说，我们的结果强调了在设计强大的可解释性工具时需要与数据匹配的归纳偏差。

更新时间: 2025-11-24 05:16:44

领域: cs.LG

下载: http://arxiv.org/abs/2511.01836v3

Sampling Control for Imbalanced Calibration in Semi-Supervised Learning

Class imbalance remains a critical challenge in semi-supervised learning (SSL), especially when distributional mismatches between labeled and unlabeled data lead to biased classification. Although existing methods address this issue by adjusting logits based on the estimated class distribution of unlabeled data, they often handle model imbalance in a coarse-grained manner, conflating data imbalance with bias arising from varying class-specific learning difficulties. To address this issue, we propose a unified framework, SC-SSL, which suppresses model bias through decoupled sampling control. During training, we identify the key variables for sampling control under ideal conditions. By introducing a classifier with explicit expansion capability and adaptively adjusting sampling probabilities across different data distributions, SC-SSL mitigates feature-level imbalance for minority classes. In the inference phase, we further analyze the weight imbalance of the linear classifier and apply post-hoc sampling control with an optimization bias vector to directly calibrate the logits. Extensive experiments across various benchmark datasets and distribution settings validate the consistency and state-of-the-art performance of SC-SSL.

Updated: 2025-11-24 05:15:58

标题: 在半监督学习中不平衡校准的采样控制

摘要: 类别不平衡仍然是半监督学习（SSL）中的一个关键挑战，特别是当有标记和无标记数据之间的分布不匹配导致偏向分类时。尽管现有方法通过根据无标记数据的估计类分布调整logits来解决这个问题，但它们通常以粗粒度的方式处理模型不平衡，将数据不平衡与由不同类别的学习困难性引起的偏差混为一谈。为了解决这个问题，我们提出了一个统一的框架，SC-SSL，通过解耦抽样控制来抑制模型偏差。在训练过程中，我们确定了在理想条件下用于抽样控制的关键变量。通过引入具有显式扩展能力的分类器，并自适应地调整不同数据分布下的抽样概率，SC-SSL减轻了少数类别的特征级不平衡。在推断阶段，我们进一步分析了线性分类器的权重不平衡，并应用后期抽样控制与优化偏差向量来直接校准logits。在各种基准数据集和分布设置上进行的大量实验验证了SC-SSL的一致性和最先进的性能。

更新时间: 2025-11-24 05:15:58

领域: cs.LG,cs.CV,stat.ML

下载: http://arxiv.org/abs/2511.18773v1

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Deep neural networks (DNNs) have become valuable intellectual property of model owners, due to the substantial resources required for their development. To protect these assets in the deployed environment, recent research has proposed model usage control mechanisms to ensure models cannot be used without proper authorization. These methods typically lock the utility of the model by embedding an access key into its parameters. However, they often assume static deployment, and largely fail to withstand continual post-deployment model updates, such as fine-tuning or task-specific adaptation. In this paper, we propose ADALOC, to endow key-based model usage control with adaptability during model evolution. It strategically selects a subset of weights as an intrinsic access key, which enables all model updates to be confined to this key throughout the evolution lifecycle. ADALOC enables using the access key to restore the keyed model to the latest authorized states without redistributing the entire network (i.e., adaptation), and frees the model owner from full re-keying after each model update (i.e., lock preservation). We establish a formal foundation to underpin ADALOC, providing crucial bounds such as the errors introduced by updates restricted to the access key. Experiments on standard benchmarks, such as CIFAR-100, Caltech-256, and Flowers-102, and modern architectures, including ResNet, DenseNet, and ConvNeXt, demonstrate that ADALOC achieves high accuracy under significant updates while retaining robust protections. Specifically, authorized usages consistently achieve strong task-specific performance, while unauthorized usage accuracy drops to near-random guessing levels (e.g., 1.01% on CIFAR-100), compared to up to 87.01% without ADALOC. This shows that ADALOC can offer a practical solution for adaptive and protected DNN deployment in evolving real-world scenarios.

Updated: 2025-11-24 05:13:45

标题: 重新密钥，无风险：适应模型使用控制

摘要: 深度神经网络（DNNs）已成为模型所有者宝贵的知识产权，因为它们的开发需要大量资源。为了在部署环境中保护这些资产，最近的研究提出了模型使用控制机制，以确保未经授权无法使用模型。这些方法通常通过将访问密钥嵌入模型参数来锁定模型的实用性。然而，它们通常假设静态部署，并且在持续的部署后模型更新（如微调或任务特定适应）方面往往无法承受。在本文中，我们提出了ADALOC，为基于密钥的模型使用控制赋予了模型演化过程中的适应性。它策略性地选择一部分权重作为内在访问密钥，这使得在整个演化生命周期中所有模型更新都限制在此密钥内。ADALOC使得可以使用访问密钥将有密钥的模型恢复到最新的授权状态，而无需重新分发整个网络（即适应），并且使模型所有者免受每次模型更新后完全重新密钥（即保留锁定）的困扰。我们建立了ADALOC的形式基础，提供了限制在访问密钥上的更新引入的错误等关键边界。在标准基准测试中的实验，如CIFAR-100、Caltech-256和Flowers-102，以及现代体系结构，包括ResNet、DenseNet和ConvNeXt，表明ADALOC在保持强大保护的同时实现了在重大更新下的高准确性。特别是，经授权的用途始终实现强大的任务特定性能，而未经授权的使用准确性下降至接近随机猜测水平（例如，在CIFAR-100上为1.01%），相比之下，没有ADALOC时可达到87.01%。这表明ADALOC能够为在不断演化的实际环境中进行自适应和受保护的DNN部署提供实际解决方案。

更新时间: 2025-11-24 05:13:45

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.18772v1

Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective

Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. On 1D data, we find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. To validate this classifier-centric perspective on high-dimensional data, we assess whether a flow-matching postprocessing step that is designed to narrow the gap between a pre-trained diffusion model's learned distribution and the real data distribution, especially near decision boundaries, can improve the performance. Experiments on various datasets verify our classifier-centric understanding.

Updated: 2025-11-24 05:01:32

标题: 从分类器中心的角度研究分类器（-免费）指导

摘要: 无分类器引导已成为具有去噪扩散模型的有条件生成的重要工具。然而，对无分类器引导的全面理解仍然缺失。在这项工作中，我们进行了一项实证研究，以提供对无分类器引导的新视角。具体而言，我们不仅关注无分类器引导，还追溯到根本，即分类器引导，明确导出的关键假设，并进行系统研究以了解分类器的作用。在1D数据上，我们发现分类器引导和无分类器引导都通过将去噪扩散轨迹推离决策边界来实现有条件生成，即通常纠缠有条件信息且难以学习的区域。为了验证这种基于分类器的高维数据视角，我们评估了一个流匹配后处理步骤，该步骤旨在缩小预先训练的扩散模型学习分布与真实数据分布之间的差距，特别是在决策边界附近，是否能提高性能。各种数据集上的实验证实了我们的基于分类器的理解。

更新时间: 2025-11-24 05:01:32

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.10638v3

Unsupervised Multi-View Visual Anomaly Detection via Progressive Homography-Guided Alignment

Unsupervised visual anomaly detection from multi-view images presents a significant challenge: distinguishing genuine defects from benign appearance variations caused by viewpoint changes. Existing methods, often designed for single-view inputs, treat multiple views as a disconnected set of images, leading to inconsistent feature representations and a high false-positive rate. To address this, we introduce ViewSense-AD (VSAD), a novel framework that learns viewpoint-invariant representations by explicitly modeling geometric consistency across views. At its core is our Multi-View Alignment Module (MVAM), which leverages homography to project and align corresponding feature regions between neighboring views. We integrate MVAM into a View-Align Latent Diffusion Model (VALDM), enabling progressive and multi-stage alignment during the denoising process. This allows the model to build a coherent and holistic understanding of the object's surface from coarse to fine scales. Furthermore, a lightweight Fusion Refiner Module (FRM) enhances the global consistency of the aligned features, suppressing noise and improving discriminative power. Anomaly detection is performed by comparing multi-level features from the diffusion model against a learned memory bank of normal prototypes. Extensive experiments on the challenging RealIAD and MANTA datasets demonstrate that VSAD sets a new state-of-the-art, significantly outperforming existing methods in pixel, view, and sample-level visual anomaly proving its robustness to large viewpoint shifts and complex textures.

Updated: 2025-11-24 05:01:16

标题: 无监督多视角视觉异常检测：通过渐进性单应性引导对齐

摘要: 多视角图像的无监督视觉异常检测面临着一个重大挑战：区分由视角变化引起的良性外观变化和真实缺陷。现有方法通常针对单视角输入设计，将多视角视图视为一组不连贯的图像，导致特征表示不一致和高误报率。为了解决这个问题，我们引入了ViewSense-AD (VSAD)，这是一个通过显式建模视角间几何一致性来学习视角不变表示的新框架。其核心是我们的多视角对齐模块（MVAM），利用单应性将相邻视图之间对应的特征区域投影和对齐。我们将MVAM集成到视图对齐潜在扩散模型（VALDM）中，在去噪过程中实现渐进和多阶段对齐。这使模型能够从粗到细的尺度构建对象表面的一致和整体理解。此外，轻量级的融合优化模块（FRM）增强了对齐特征的全局一致性，抑制噪声并提高判别能力。异常检测通过将扩散模型的多级特征与学习的正常原型存储库进行比较来执行。对具有挑战性的RealIAD和MANTA数据集的大量实验表明，VSAD取得了新的最先进水平，在像素、视图和样本级别的视觉异常中明显优于现有方法，证明了其对大视角变化和复杂纹理的稳健性。

更新时间: 2025-11-24 05:01:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18766v1

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li}_2$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize. Thanks to lazy learning and weight decay, the backpropagated gradient $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation independently. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layers. The code is available at https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo.

Updated: 2025-11-24 04:59:04

标题: 《从理解动态学习中的特征出现的可证明的扩展规律》

摘要: 尽管延迟泛化，即理解的现象已经得到广泛研究，但仍然存在一个开放问题，即是否存在一个数学框架来描述会出现什么样的特征，以及如何以及在什么条件下会发生，并且与训练的梯度动态密切相关，对于复杂结构化输入。我们提出了一个名为$\mathbf{Li}_2$的新框架，该框架捕捉了2层非线性网络理解行为的三个关键阶段：（I）懒惰学习，（II）独立特征学习和（III）交互特征学习。在懒惰学习阶段，顶层对随机隐藏表示进行过拟合，模型似乎进行了记忆。由于懒惰学习和权重衰减，从顶层反向传播的梯度$G_F$现在携带有关目标标签的信息，并具有特定结构，使得每个隐藏节点可以独立地学习它们的表示。有趣的是，独立动态恰好遵循能量函数$E$的梯度上升，并且其局部极大值正是出现的特征。我们研究了这些局部最优诱导特征是否具有泛化能力，它们的表示能力以及它们在样本大小上的变化，在群算术任务中。当隐藏节点开始在学习的后期阶段互动时，我们可以明确地展示$G_F$如何改变以专注于需要学习的缺失特征。我们的研究揭示了关键超参数（如权重衰减、学习率和样本大小）在理解中扮演的角色，导致了特征出现、记忆和泛化的可证缩放定律，并揭示了为什么最近的优化器（如Muon）可以有效，从梯度动态的第一原理出发。我们的分析可以扩展到多层。代码可在https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo找到。

更新时间: 2025-11-24 04:59:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.21519v4

Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography

Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model's superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region's most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.

Updated: 2025-11-24 04:55:39

标题: 《基于UNet的结构在多相增强CT中肝肿瘤分割的比较研究》

摘要: 在多相增强计算机断层扫描（CECT）中，肝脏结构的分割在计算机辅助诊断和治疗规划中发挥着至关重要的作用，包括肿瘤检测。在本研究中，我们调查了基于UNet的体系结构在肝脏肿瘤分割中的性能，从最初的UNet扩展到具有不同骨干网络的UNet3+。我们评估了ResNet、基于Transformer的和基于状态空间（Mamba）的骨干网络，所有网络都使用预训练权重进行初始化。令人惊讶的是，尽管现代体系结构取得了进步，基于ResNet的模型在多个评估指标上一贯表现优于基于Transformer和Mamba的替代方案。为了进一步提高分割质量，我们将注意力机制引入到骨干网络中，并观察到将卷积块注意模块（CBAM）结合到模型中可以获得最佳性能。ResNetUNet3+与CBAM模块不仅产生了最佳的重叠指标，Dice得分为0.755，IoU为0.662，还实现了最精确的边界分割，最低的HD95距离为77.911。该模型的优势进一步得到了巩固，其整体准确率为0.925，特异度为0.926，显示出其在准确识别病变和健康组织方面的强大能力。为了进一步提高可解释性，我们使用Grad-CAM可视化来突出显示影响最大的预测区域，从而提供其决策过程的见解。这些发现表明，当经典的ResNet体系结构与现代注意模块结合时，在医学图像分割任务中依然具有很高的竞争力，为临床实践中肝脏肿瘤检测提供了一个有前途的方向。

更新时间: 2025-11-24 04:55:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.25522v4

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and enabling efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler in systems such as Lean, but lacks the exploratory freedom of informal problem solving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proof steps in Lean. The framework performs intermediate formal checking to prevent reasoning drift and employs a memory module that maintains proof continuity across long, multi-step reasoning chains, enabling both exploration and verification within a single workflow. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME'25, Hermes achieves up to a 67% accuracy improvement while using 80% fewer total inference FLOPs. The implementation and codebase are publicly available at https://github.com/aziksh-ospanov/HERMES.

Updated: 2025-11-24 04:50:18

标题: HERMES：朝着在LLMs中高效且可验证的数学推理方向前进

摘要: 非正式数学一直是现代大型语言模型（LLM）推理的核心，提供了灵活性，并且使得论证的构建更加高效。然而，纯粹的非正式推理容易出现逻辑漏洞和难以检测和纠正的微妙错误。相比之下，形式定理证明提供了严谨、可验证的数学推理，其中每个推理步骤都由诸如Lean之类的系统中的受信任的编译器检查，但缺乏非正式问题解决的探索自由。这种不匹配使得当前基于LLM的数学代理无法以原则性的方式结合这两种范式的优势。在这项工作中，我们引入了Hermes，这是第一个明确将非正式推理与Lean中形式验证的证明步骤交替使用的工具辅助代理。该框架执行中间形式检查，以防止推理漂移，并采用一个记忆模块，可以在长、多步推理链中维护证明连续性，从而在一个工作流程中实现探索和验证。我们使用不同参数规模的LLM对Hermes在四个具有挑战性的数学推理基准测试进行评估，从小型模型到最先进的系统。在所有设置中，Hermes可靠地提高了基本模型的推理准确性，同时与基于奖励的方法相比，大幅减少了标记使用和计算成本。在像AIME'25这样的困难数据集上，Hermes实现了高达67%的准确性提升，同时使用了80%更少的总推理FLOPs。实现和代码库可以在https://github.com/aziksh-ospanov/HERMES 上公开获得。

更新时间: 2025-11-24 04:50:18

领域: cs.AI,cs.FL

下载: http://arxiv.org/abs/2511.18760v1

Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.

Updated: 2025-11-24 04:50:07

标题: Transformer树推理训练后课程可证明的好处

摘要: 最近在LLM后训练阶段的课程技术被广泛观察到优于非课程方法，能够增强推理表现，然而为什么以及在何种程度上它们起作用的原则性理解仍然难以捉摸。为了填补这一空白，我们制定了一个理论框架，基于这样的直觉：通过逐步学习可管理的步骤比直接解决困难的推理任务更有效，前提是每个阶段都在模型的有效能力范围内。在链接连续课程阶段的轻微复杂条件下，我们表明课程后训练避免了指数复杂性瓶颈。为了证实这一结果，从解决数学问题如倒计时和奇偶数的思维链（CoTs）中获得的见解，我们将CoT生成建模为状态条件自回归推理树，定义一个统一分支基础模型来捕捉预训练行为，并将课程阶段形式化为要么增加深度（更长的推理链）要么减少提示（更短的前缀）子任务。我们的分析显示，在仅有结果奖励信号的情况下，强化学习微调可以实现高准确性，且具有多项式样本复杂性，而直接学习则受到指数瓶颈的限制。我们进一步为测试时缩放建立类似的保证，其中课程感知查询将奖励预测调用和采样成本从指数级降低到多项式级。

更新时间: 2025-11-24 04:50:07

领域: cs.LG

下载: http://arxiv.org/abs/2511.07372v2

SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage

Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs' vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.

Updated: 2025-11-24 04:42:27

标题: SATA：通过简单辅助任务链接实现LLM越狱的范式

摘要: 大型语言模型（LLMs）在各种任务上取得了显著进展，但它们的安全对齐仍然是一个主要关注点。探索越狱提示可以暴露LLMs的漏洞，并指导确保它们的努力。现有方法主要设计复杂的指令供LLM遵循，或依赖多次迭代，这可能阻碍越狱的性能和效率。在这项工作中，我们提出了一种新颖的越狱范式，即Simple Assistive Task Linkage（SATA），可以有效地规避LLM的安全保护并引发有害响应。具体来说，SATA首先在恶意查询中掩盖有害关键词，生成一个相对良性的查询，其中包含一个或多个[MASK]特殊令牌。然后，它使用一个简单的辅助任务，比如一个掩盖语言模型任务或一个按位置查找元素的任务来编码被掩盖关键词的语义。最后，SATA将辅助任务与掩盖查询链接在一起，共同执行越狱。大量实验表明，SATA实现了最先进的性能，并且在很大程度上优于基准。具体来说，在AdvBench数据集上，使用掩盖语言模型（MLM）辅助任务，SATA实现了85%的攻击成功率（ASR）和4.57的有害分数（HS），使用按位置查找元素（ELP）辅助任务，SATA实现了76%的总体ASR和4.43的HS。

更新时间: 2025-11-24 04:42:27

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2412.15289v5

Quantitative Attractor Analysis of High-Capacity Kernel Logistic Regression Hopfield Networks

Kernel-based learning methods such as Kernel Logistic Regression (KLR) can substantially increase the storage capacity of Hopfield networks, but the principles governing their performance and stability remain largely uncharacterized. This paper presents a comprehensive quantitative analysis of the attractor landscape in KLR-trained networks to establish a solid foundation for their design and application. Through extensive, statistically validated simulations, we address critical questions of generality, scalability, and robustness. Our comparative analysis shows that KLR and Kernel Ridge Regression (KRR) exhibit similarly high storage capacities and clean attractor landscapes under typical operating conditions, suggesting that this behavior is a general property of kernel regression methods, although KRR is computationally much faster. We identify a non-trivial, scale-dependent law for the kernel width $γ$, demonstrating that optimal capacity requires $γ$ to be scaled such that $γN$ increases with network size $N$. This finding implies that larger networks require more localized kernels, in which each pattern's influence is more spatially confined, to mitigate inter-pattern interference. Under this optimized scaling, we provide clear evidence that storage capacity scales linearly with network size~($P \propto N$). Furthermore, our sensitivity analysis shows that performance is remarkably robust with respect to the choice of the regularization parameter $λ$. Collectively, these findings provide a concise set of empirical principles for designing high-capacity and robust associative memories and clarify the mechanisms that enable kernel methods to overcome the classical limitations of Hopfield-type models.

Updated: 2025-11-24 04:42:20

标题: 高容量核逻辑回归霍普菲尔德网络的定量吸引子分析

摘要: 基于核的学习方法，如核逻辑回归（KLR），可以显著增加霍普菲尔德网络的存储容量，但其性能和稳定性的原则仍未完全表征。本文对KLR训练网络中的吸引子景观进行了全面的定量分析，以建立其设计和应用的坚实基础。通过广泛的、统计验证的模拟，我们解决了关于一般性、可扩展性和稳健性的关键问题。我们的比较分析表明，KLR和核岭回归（KRR）在典型操作条件下表现出类似高的存储容量和清晰的吸引子景观，这表明这种行为是核回归方法的一般属性，尽管KRR在计算上要快得多。我们确定了一个非平凡的、依赖规模的核宽度$γ$的定律，证明最佳容量要求$γ$被缩放，使得$γN$随着网络大小$N$增加。这一发现意味着更大的网络需要更局部化的核，其中每个模式的影响更加空间受限，以减轻模式之间的干扰。在优化的缩放下，我们提供清晰的证据表明存储容量与网络大小呈线性关系（$P \propto N$）。此外，我们的敏感性分析显示，性能在选择正则化参数$λ$方面非常稳健。总的来说，这些发现为设计高容量和稳健的联想记忆提供了一套简明的经验原则，并澄清了核方法如何克服霍普菲尔德类型模型的经典限制的机制。

更新时间: 2025-11-24 04:42:20

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2505.01218v4

Can Large Language Models Detect Misinformation in Scientific News Reporting?

Scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the COVID-19 pandemic. Automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. Most research on the validity of scientific reporting treats this problem as a claim verification challenge. In doing so, significant expert human effort is required to generate appropriate claims. Our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. The central research question of this paper is whether it is possible to use large language models (LLMs) to detect misinformation in scientific reporting. To this end, we first present a new labeled dataset SciNews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the CORD-19 database. Our dataset includes both human-written and LLM-generated news articles, making it more comprehensive in terms of capturing the growing trend of using LLMs to generate popular press articles. Then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. We propose several baseline architectures using LLMs to automatically detect false representations of scientific findings in the popular press. For each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. We also test these architectures and prompting strategies on GPT-3.5, GPT-4, and Llama2-7B, Llama2-13B.

Updated: 2025-11-24 04:39:03

标题: 大型语言模型能否检测科学新闻报道中的错误信息？

摘要: 科学事实经常在流行媒体中被歪曲，目的是影响公众舆论和行动，正如COVID-19大流行期间所证明的那样。在科学领域自动检测错误信息具有挑战性，因为这两种媒体类型的写作风格不同，并且仍处于起步阶段。大部分关于科学报道准确性的研究将这个问题视为一个声明验证挑战。这样做需要大量专家人力来生成适当的声明。我们的解决方案绕过了这一步骤，并处理了一个更真实的场景，即这样的明确、标记的声明可能不可用。本文的中心研究问题是是否可以使用大型语言模型（LLMs）来检测科学报道中的错误信息。为此，我们首先提出了一个新的标记数据集SciNews，包含来自可信和不可信来源的2.4k科学新闻故事，配对了来自CORD-19数据库的相关摘要。我们的数据集包括人工编写和LLM生成的新闻文章，从而更全面地捕捉了使用LLMs生成流行媒体文章的增长趋势。然后，我们确定了科学新闻文章中科学有效性的维度，并探讨了如何将其整合到自动检测科学错误信息中。我们提出了几种基线架构，使用LLMs自动检测流行媒体中对科学发现的虚假描述。对于这些架构的每一个，我们使用了几种提示工程策略，包括零射、少射和思维链提示。我们还在GPT-3.5、GPT-4和Llama2-7B、Llama2-13B上测试了这些架构和提示策略。

更新时间: 2025-11-24 04:39:03

领域: cs.CL,cs.AI,cs.SI

下载: http://arxiv.org/abs/2402.14268v2

On Instability of Minimax Optimal Optimism-Based Bandit Algorithms

Statistical inference from data generated by multi-armed bandit (MAB) algorithms is challenging due to their adaptive, non-i.i.d. nature. A classical manifestation is that sample averages of arm rewards under bandit sampling may fail to satisfy a central limit theorem. Lai and Wei's stability condition provides a sufficient, and essentially necessary criterion, for asymptotic normality in bandit problems. While the celebrated Upper Confidence Bound (UCB) algorithm satisfies this stability condition, it is not minimax optimal, raising the question of whether minimax optimality and statistical stability can be achieved simultaneously. In this paper, we analyze the stability properties of a broad class of bandit algorithms that are based on the optimism principle. We establish general structural conditions under which such algorithms violate the Lai-Wei stability criterion. As a consequence, we show that widely used minimax-optimal UCB-style algorithms, including MOSS, Anytime-MOSS, Vanilla-MOSS, ADA-UCB, OC-UCB, KL-MOSS, KL-UCB++, KL-UCB-SWITCH, and Anytime KL-UCB-SWITCH, are unstable. We further complement our theoretical results with numerical simulations demonstrating that, in all these cases, the sample means fail to exhibit asymptotic normality. Overall, our findings suggest a fundamental tension between stability and minimax optimal regret, raising the question of whether it is possible to design bandit algorithms that achieve both. Understanding whether such simultaneously stable and minimax optimal strategies exist remains an important open direction.

Updated: 2025-11-24 04:23:26

标题: 关于极小极大最优乐观主义型赌博算法的不稳定性

摘要: 来自多臂老虎机(MAB)算法生成的数据的统计推断具有挑战性，因为它们具有自适应的非独立同分布特性。一个经典的表现是，在老虎机抽样下的手臂奖励的样本平均可能无法满足中心极限定理。赖和魏的稳定条件为老虎机问题中的渐近正态性提供了一个充分而实质上必要的标准。尽管著名的上置信界(UCB)算法符合这一稳定条件，但它并非极小化最优，这引发了一个问题：是否可以同时实现极小化最优性和统计稳定性。在本文中，我们分析了基于乐观主义原则的广泛类别的老虎机算法的稳定性特性。我们建立了一般的结构条件，根据这些条件，这些算法违反了赖-魏稳定性准则。因此，我们展示了广泛使用的极小化最优的UCB风格算法，包括MOSS，Anytime-MOSS，Vanilla-MOSS，ADA-UCB，OC-UCB，KL-MOSS，KL-UCB++，KL-UCB-SWITCH和Anytime KL-UCB-SWITCH，都是不稳定的。我们进一步用数值模拟结果补充了我们的理论结果，表明在所有这些情况下，样本均值无法表现出渐近正态性。总的来说，我们的研究结果表明，稳定性和极小化最优遗憾之间存在基本的张力，这引发了一个问题：是否可能设计既实现稳定性又极小化最优遗憾的老虎机算法。理解这样的同时稳定和极小化最优策略是否存在仍然是一个重要的开放方向。

更新时间: 2025-11-24 04:23:26

领域: stat.ML,cs.IT,cs.LG,math.ST

下载: http://arxiv.org/abs/2511.18750v1

Evaluation of Real-Time Mitigation Techniques for Cyber Security in IEC 61850 / IEC 62351 Substations

The digitalization of substations enlarges the cyber-attack surface, necessitating effective detection and mitigation of cyber attacks in digital substations. While machine learning-based intrusion detection has been widely explored, such methods have not demonstrated detection and mitigation within the required real-time budget. In contrast, cryptographic authentication has emerged as a practical candidate for real-time cyber defense, as specified in IEC 62351. In addition, lightweight rule-based intrusion detection that validates IEC 61850 semantics can provide specification-based detection of anomalous or malicious traffic with minimal processing delay. This paper presents the design logic and implementation aspects of three potential real-time mitigation techniques capable of countering GOOSE-based attacks: (i) IEC 62351-compliant message authentication code (MAC) scheme, (ii) a semantics-enforced rule-based intrusion detection system (IDS), and (iii) a hybrid approach integrating both MAC verification and Intrusion Detection System (IDS). A comparative evaluation of these real-time mitigation approaches is conducted using a cyber-physical system (CPS) security testbed. The results show that the hybrid integration significantly enhances mitigation capability. Furthermore, the processing delays of all three methods remain within the strict delivery requirements of GOOSE communication. The study also identifies limitations that none of the techniques can fully address, highlighting areas for future work.

Updated: 2025-11-24 04:20:49

标题: IEC 61850 / IEC 62351亚站实时网络安全缓解技术评估

摘要: 数字化变电站扩大了网络攻击的表面，需要在数字变电站中有效检测和缓解网络攻击。虽然基于机器学习的入侵检测已被广泛探讨，但这些方法尚未证明能够在所需的实时预算内实现检测和缓解。相反，密码学身份验证已成为实时网络防御的可行候选方案，如IEC 62351中规定的那样。此外，验证IEC 61850语义的轻量级基于规则的入侵检测可以提供基于规范的检测异常或恶意流量的方法，并且处理延迟很小。本文介绍了三种可能的实时缓解技术的设计逻辑和实现方面，可以对抗基于GOOSE的攻击：（i）符合IEC 62351的消息认证码（MAC）方案，（ii）强制语义规则的入侵检测系统（IDS），以及（iii）集成MAC验证和入侵检测系统（IDS）的混合方法。使用一个网络物理系统（CPS）安全实验室对这些实时缓解方法进行了比较评估。结果表明，混合集成显著增强了缓解能力。此外，所有三种方法的处理延迟仍在GOOSE通信的严格交付要求内。该研究还确定了这些技术都无法完全解决的局限性，为未来工作提出了重点。

更新时间: 2025-11-24 04:20:49

领域: cs.CR,eess.SY

下载: http://arxiv.org/abs/2511.18748v1

Any4D: Open-Prompt 4D Generation from Natural Language and Images

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Updated: 2025-11-24 04:17:26

标题: Any4D：从自然语言和图像生成开放提示的4D

摘要: 基于视频生成的具身世界模型已经引起越来越多的关注，但它们对大规模具身交互数据的依赖仍然是一个关键瓶颈。具身数据的稀缺性、收集困难和高维度基本限制了语言和行动之间的对齐粒度，加剧了长视野视频生成的挑战，从而阻碍了生成模型在具身领域实现“GPT时刻”。有一个天真的观察：\textit{具身数据的多样性远远超过可能的原始动作空间的相对较小空间}。基于这一观点，我们提出了\textbf{原始具身世界模型}（PEWM），将视频生成限制在固定的较短视野范围内，我们的方法\textit{1)使}语言概念和机器人行动的视觉表示之间的对齐更加精细，\textit{2)降低}了学习复杂度，\textit{3)提高了}具身数据收集的数据效率，\textit{4)减少了}推理延迟。通过配备模块化的视觉语言模型（VLM）规划器和起始-目标热图引导机制（SGG），PEWM进一步实现了灵活的闭环控制，并支持原始级别策略在扩展的、复杂任务上的组合泛化。我们的框架利用视频模型中的时空视觉先验和VLM的语义感知来弥合精细物理交互和高层推理之间的差距，为可伸缩、可解释和通用的具身智能铺平道路。

更新时间: 2025-11-24 04:17:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18746v1

Survival Analysis with Machine Learning for Predicting Li-ion Battery Remaining Useful Life

Battery degradation significantly impacts the reliability and efficiency of energy storage systems, particularly in electric vehicles and industrial applications. Predicting the remaining useful life (RUL) of lithium-ion batteries is crucial for optimizing maintenance schedules, reducing costs, and improving safety. Traditional RUL prediction methods often struggle with nonlinear degradation patterns and uncertainty quantification. To address these challenges, we propose a hybrid survival analysis framework integrating survival data reconstruction, survival model learning, and survival probability estimation. Our approach transforms battery voltage time series into time-to-failure data using path signatures. The multiple Cox-based survival models and machine-learning-based methods, such as DeepHit and MTLR, are learned to predict battery failure-free probabilities over time. Experiments conducted on the Toyota battery and NASA battery datasets demonstrate the effectiveness of our approach, achieving high time-dependent AUC and concordance index (C-Index) while maintaining a low integrated Brier score.

Updated: 2025-11-24 04:17:23

标题: 用机器学习进行生存分析以预测锂离子电池剩余寿命

摘要: 电池退化显著影响能量存储系统的可靠性和效率，特别是在电动车辆和工业应用中。预测锂离子电池的剩余有用寿命(RUL)对于优化维护计划、降低成本和提高安全性至关重要。传统的RUL预测方法常常难以处理非线性退化模式和不确定性量化。为了解决这些挑战，我们提出了一种混合生存分析框架，整合了生存数据重构、生存模型学习和生存概率估计。我们的方法将电池电压时间序列转换为时间至故障数据，使用路径签名。学习多个基于Cox的生存模型和基于机器学习的方法，如DeepHit和MTLR，以预测随时间的电池故障自由概率。在丰田电池和NASA电池数据集上进行的实验表明了我们方法的有效性，实现了高时间依赖的AUC和一致性指数(C-Index)，同时保持低综合Brier分数。

更新时间: 2025-11-24 04:17:23

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.13558v7

RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context

Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.

Updated: 2025-11-24 04:12:41

标题: 犀牛洞察：通过模型行为和上下文的控制机制改进深度研究

摘要: 大型语言模型正在从单轮响应器发展为能够进行持续推理和决策的工具使用型代理，用于深度研究。当前系统采用线性流程，从计划到搜索再到撰写报告，但由于对模型行为和上下文缺乏明确控制，导致了错误累积和上下文腐败。我们介绍了RhinoInsight，一个深度研究框架，添加了两个控制机制以增强稳健性、可追踪性和整体质量，而无需参数更新。首先，一个可验证的清单模块将用户需求转化为可追踪和可验证的子目标，结合人类或LLM评论进行优化，编制层次结构大纲以锚定后续行动并防止不可执行的规划。其次，一个证据审计模块结构化搜索内容，迭代更新大纲，并修剪嘈杂的上下文，同时评论家对高质量证据进行排名和绑定，以确保可验证性并减少幻觉。我们的实验表明，RhinoInsight在深度研究任务上达到了最先进的性能，同时在深度搜索任务上保持了竞争力。

更新时间: 2025-11-24 04:12:41

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.18743v1

ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion

Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.

Updated: 2025-11-24 04:10:53

标题: ProxT2I: 通过近端扩散实现高效奖励引导的文本到图像生成

摘要: 扩散模型已经成为生成建模的主导范式，涵盖了各种领域，包括提示条件生成。然而，绝大多数采样器依赖于对逆扩散过程进行前向离散化，并使用从数据中学习得到的评分函数。这种前向和显式的离散化可能会很慢和不稳定，需要大量的采样步骤才能生成高质量的样本。在这项工作中，我们开发了一种基于向后离散化的文本到图像（T2I）扩散模型，称为ProxT2I，依赖于学习和有条件的近端算子而不是评分函数。我们进一步利用最近在强化学习和政策优化方面的进展，为特定任务奖励优化我们的采样器。此外，我们开发了一个新的大规模开源数据集，包括1500万张高质量的人类图像和细粒度标题，称为LAION-Face-T2I-15M，用于训练和评估。与基于评分的基线相比，我们的方法始终提高了采样效率和人类偏好对齐性，并在需要更低计算和更小模型尺寸的情况下实现了与现有最先进和开源文本到图像模型相媲美的结果，为人类文本到图像生成提供了一种轻量级但高效的解决方案。

更新时间: 2025-11-24 04:10:53

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.18742v1

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

Time series anomaly detection is widely used in IoT and cyber-physical systems, yet its evaluation remains challenging due to diverse application objectives and heterogeneous metric assumptions. This study introduces a problem-oriented framework that reinterprets existing metrics based on the specific evaluation challenges they are designed to address, rather than their mathematical forms or output structures. We categorize over twenty commonly used metrics into six dimensions: 1) basic accuracy-driven evaluation; 2) timeliness-aware reward mechanisms; 3) tolerance to labeling imprecision; 4) penalties reflecting human-audit cost; 5) robustness against random or inflated scores; and 6) parameter-free comparability for cross-dataset benchmarking. Comprehensive experiments are conducted to examine metric behavior under genuine, random, and oracle detection scenarios. By comparing their resulting score distributions, we quantify each metric's discriminative ability -- its capability to distinguish meaningful detections from random noise. The results show that while most event-level metrics exhibit strong separability, several widely used metrics (e.g., NAB, Point-Adjust) demonstrate limited resistance to random-score inflation. These findings reveal that metric suitability must be inherently task-dependent and aligned with the operational objectives of IoT applications. The proposed framework offers a unified analytical perspective for understanding existing metrics and provides practical guidance for selecting or developing more context-aware, robust, and fair evaluation methodologies for time series anomaly detection.

Updated: 2025-11-24 04:09:04

标题: 一个面向问题的时间序列异常检测评估指标分类学

摘要: 时间序列异常检测在物联网和网络物理系统中被广泛应用，但由于不同的应用目标和异质度量假设，其评估仍然具有挑战性。本研究引入了一个问题导向的框架，重新解释了现有的度量标准，基于它们旨在解决的具体评估挑战，而不是它们的数学形式或输出结构。我们将超过二十种常用度量标准分类为六个维度：1）基本准确度驱动的评估；2）及时性感知奖励机制；3）对标注不精确性的容忍度；4）反映人工审计成本的惩罚；5）对随机或膨胀得分的鲁棒性；6）参数无关的可比性，用于跨数据集基准测试。我们进行了全面的实验，以检验度量标准在真实、随机和预测检测场景下的行为。通过比较它们的得分分布，我们量化了每个度量标准的区分能力--它区分有意义的检测结果和随机噪音的能力。结果表明，虽然大多数事件级度量表现出很强的可分离性，但一些广泛使用的度量标准（例如NAB，Point-Adjust）显示出对随机得分膨胀的有限抵抗力。这些发现表明，度量标准的适用性必须固有地依赖于任务，并与物联网应用的运营目标保持一致。所提出的框架为理解现有度量标准提供了统一的分析视角，并为选择或开发更具上下文感知、鲁棒性和公平性的时间序列异常检测评估方法提供了实用指导。

更新时间: 2025-11-24 04:09:04

领域: cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.18739v1

FoleyBench: A Benchmark For Video-to-Audio Models

Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench

Updated: 2025-11-24 04:08:20

标题: FoleyBench: 一个用于视频到音频模型的基准测试

摘要: 视频到音频生成（V2A）在领域，如电影后期制作，AR/VR和音效设计中变得越来越重要，特别是用于与屏幕操作同步的Foley音效的创作。Foley需要生成语义与可见事件对齐且与其时间相匹配的音频。然而，由于缺乏针对Foley风格场景定制的基准，评估与下游应用之间存在不匹配。我们发现，过去评估数据集中74%的视频存在视听不一致问题。此外，它们主要由语音和音乐组成，这些领域与Foley的用例不符。为了解决这一差距，我们引入了FoleyBench，这是第一个专门为Foley风格V2A评估而设计的大规模基准。FoleyBench包含5,000个（视频，基本真实音频，文本标题）三元组，每个三元组都包含可见声源，其音频与屏幕事件有因果关系。该数据集是使用自动化，可扩展的流水线应用于来自YouTube和Vimeo的网络视频中。与过去的数据集相比，我们展示了来自FoleyBench的视频具有更强的覆盖率，针对Foley声音专门设计的分类法。每个片段还带有捕捉源复杂性，UCS/AudioSet类别和视频长度的元数据标签，实现对模型性能和故障模式的细粒度分析。我们对几种最先进的V2A模型进行基准测试，评估它们的音频质量，音频-视频对齐，时间同步和音频-文本一致性。样本可在以下网址获取：https://gclef-cmu.org/foleybench

更新时间: 2025-11-24 04:08:20

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2511.13219v2

Thinking Ahead: Foresight Intelligence in MLLMs and World Models

In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.

Updated: 2025-11-24 04:04:59

标题: 提前思考：在MLLMs和世界模型中的先见之明智能

摘要: 在这项工作中，我们将预测智能定义为能够预测和解释未来事件的能力-这是自动驾驶等应用所必需的能力，但在现有研究中大多被忽视。为了弥补这一差距，我们引入了FSU-QA，这是一个专门设计用于引发和评估预测智能的新型视觉问答（VQA）数据集。利用FSU-QA，我们进行了对最先进的视觉-语言模型（VLMs）在面向预测的任务下的首次全面研究，发现当前模型仍然难以推理未来情况。除了作为一个基准之外，FSU-QA还通过衡量生成预测的语义一致性来评估世界模型，通过在VLMs增加这些输出时的性能增益来量化。我们的实验进一步证明，FSU-QA可以有效增强预测推理能力：即使是在FSU-QA上微调的小型VLMs也能大幅超越更大、更先进的模型。综上所述，这些发现将FSU-QA定位为开发真正能够预测和理解未来事件的下一代模型的基础。

更新时间: 2025-11-24 04:04:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18735v1

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

Updated: 2025-11-24 04:02:48

标题: Yo'City: 通过自我批评扩展实现个性化和无限的3D逼真城市场景生成

摘要: 实际的3D城市生成对于广泛的应用至关重要，包括虚拟现实和数字双生。然而，大多数现有方法依赖于训练单一扩散模型，这限制了它们生成个性化和无限扩展的城市规模场景的能力。在本文中，我们提出了Yo'City，一种新颖的代理框架，通过利用现成大型模型的推理和组合能力，实现了用户定制和无限扩展的3D城市生成。具体来说，Yo'City首先通过自上而下的规划策略概念化城市，定义了一个分层的“城市-区-网格”结构。全局规划者确定整体布局和潜在的功能区，而本地设计师则通过详细的网格级描述进一步完善每个区域。随后，通过一个“生产-细化-评估”等距图像合成循环实现了网格级3D生成，然后是图像到3D生成。为了模拟持续的城市演变，Yo'City进一步引入了一个用户交互式、基于关系引导的扩展机制，通过基于场景图的距离和语义感知布局优化，确保空间连贯的城市增长。为了全面评估我们的方法，我们构建了一个多样化的基准数据集，并设计了六个多维度指标，从语义、几何、纹理和布局的角度评估生成质量。大量实验表明，Yo'City在所有评估方面始终优于现有的最先进方法。

更新时间: 2025-11-24 04:02:48

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18734v1

SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats

Multimodal foundation models (MFMs) integrate diverse data modalities to support complex and wide-ranging tasks. However, this integration also introduces distinct safety and security challenges. In this paper, we unify the concepts of safety and security in the context of MFMs by identifying critical threats that arise from both model behavior and system-level interactions. We propose a taxonomy grounded in information theory, evaluating risks through the concepts of channel capacity, signal, noise, and bandwidth. This perspective provides a principled way to analyze how information flows through MFMs and how vulnerabilities can emerge across modalities. Building on this foundation, we introduce a deterministic minimax formulation to analyze defense mechanisms and expose structural vulnerabilities in multimodal systems. Our framework projects attacks onto the noise, signal, and bandwidth axes, collapsing the defense search space and mitigating defender asymmetry. Across 15 defenses, we find that system-level bandwidth and behavior constraints generalize substantially better than brittle model-only methods. Finally, we formalize an MFM "self-destruction threshold" that specifies when termination should be triggered, providing a concrete activation rule for circuit-breaker safeguards within multimodal systems.

Updated: 2025-11-24 03:58:11

标题: SoK: 通过信息流和全球博弈理论分析多模态基础模型的安全性-安全性连续性

摘要: 多模态基础模型（MFMs）整合多样的数据模态，以支持复杂和广泛的任务。然而，这种整合也带来了明显的安全和安全挑战。本文通过识别模型行为和系统级交互中产生的关键威胁，统一了MFMs背景下的安全和安全概念。我们提出了一个基于信息理论的分类法，通过信道容量、信号、噪声和带宽的概念评估风险。这种视角提供了一种原则性的分析信息如何在MFMs中流动以及漏洞如何跨模态出现的方法。在此基础上，我们引入了一个确定性极小极大化公式来分析防御机制，并揭示多模态系统中的结构性漏洞。我们的框架将攻击投射到噪声、信号和带宽轴上，压缩了防御搜索空间，减轻了防御者的不对称性。在15种防御中，我们发现系统级带宽和行为约束比脆弱的仅模型方法更好地推广。最后，我们正式定义了一个MFM“自毁阈值”，指定何时触发终止，为多模态系统内的断路器安全装置提供了具体的激活规则。

更新时间: 2025-11-24 03:58:11

领域: cs.CR

下载: http://arxiv.org/abs/2411.11195v5

OceanForecastBench: A Benchmark Dataset for Data-Driven Global Ocean Forecasting

Global ocean forecasting aims to predict key ocean variables such as temperature, salinity, and currents, which is essential for understanding and describing oceanic phenomena. In recent years, data-driven deep learning-based ocean forecast models, such as XiHe, WenHai, LangYa and AI-GOMS, have demonstrated significant potential in capturing complex ocean dynamics and improving forecasting efficiency. Despite these advancements, the absence of open-source, standardized benchmarks has led to inconsistent data usage and evaluation methods. This gap hinders efficient model development, impedes fair performance comparison, and constrains interdisciplinary collaboration. To address this challenge, we propose OceanForecastBench, a benchmark offering three core contributions: (1) A high-quality global ocean reanalysis data over 28 years for model training, including 4 ocean variables across 23 depth levels and 4 sea surface variables. (2) A high-reliability satellite and in-situ observations for model evaluation, covering approximately 100 million locations in the global ocean. (3) An evaluation pipeline and a comprehensive benchmark with 6 typical baseline models, leveraging observations to evaluate model performance from multiple perspectives. OceanForecastBench represents the most comprehensive benchmarking framework currently available for data-driven ocean forecasting, offering an open-source platform for model development, evaluation, and comparison. The dataset and code are publicly available at: https://github.com/Ocean-Intelligent-Forecasting/OceanForecastBench.

Updated: 2025-11-24 03:57:43

标题: OceanForecastBench：一个为数据驱动的全球海洋预测提供基准数据集

摘要: 全球海洋预测旨在预测关键的海洋变量，如温度、盐度和洋流，这对于理解和描述海洋现象至关重要。近年来，基于数据驱动的深度学习海洋预测模型，如XiHe、WenHai、LangYa和AI-GOMS，已经显示出在捕捉复杂海洋动态和提高预测效率方面的显著潜力。尽管取得了这些进展，缺乏开源、标准化的基准测试已导致数据使用和评估方法的不一致。这一差距阻碍了高效的模型开发，阻碍了公平的性能比较，并限制了跨学科合作。为了解决这一挑战，我们提出了OceanForecastBench，这是一个提供三个核心贡献的基准测试：（1）针对模型训练的28年高质量全球海洋再分析数据，包括23个深度级别的4个海洋变量和4个海面变量。（2）高可靠性的卫星和原位观测用于模型评估，覆盖全球海洋约1亿个位置。（3）一个评估管道和一个包括6个典型基准模型的全面基准测试，利用观测结果从多个角度评估模型性能。OceanForecastBench代表目前可用的最全面的数据驱动海洋预测基准测试框架，提供了一个用于模型开发、评估和比较的开源平台。数据集和代码可在以下网址公开获取：https://github.com/Ocean-Intelligent-Forecasting/OceanForecastBench。

更新时间: 2025-11-24 03:57:43

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.18732v1

Large-Scale In-Game Outcome Forecasting for Match, Team and Players in Football using an Axial Transformer Neural Network

Football (soccer) is a sport that is characterised by complex game play, where players perform a variety of actions, such as passes, shots, tackles, fouls, in order to score goals, and ultimately win matches. Accurately forecasting the total number of each action that each player will complete during a match is desirable for a variety of applications, including tactical decision-making, sports betting, and for television broadcast commentary and analysis. Such predictions must consider the game state, the ability and skill of the players in both teams, the interactions between the players, and the temporal dynamics of the game as it develops. In this paper, we present a transformer-based neural network that jointly and recurrently predicts the expected totals for thirteen individual actions at multiple time-steps during the match, and where predictions are made for each individual player, each team and at the game-level. The neural network is based on an \emph{axial transformer} that efficiently captures the temporal dynamics as the game progresses, and the interactions between the players at each time-step. We present a novel axial transformer design that we show is equivalent to a regular sequential transformer, and the design performs well experimentally. We show empirically that the model can make consistent and reliable predictions, and efficiently makes $\sim$75,000 live predictions at low latency for each game.

Updated: 2025-11-24 03:47:59

标题: 使用轴向变压器神经网络对足球比赛、球队和球员进行大规模游戏结果预测

摘要: 足球是一项以复杂比赛玩法为特征的运动，球员们会执行各种动作，如传球、射门、铲球、犯规，以进球并最终赢得比赛。准确地预测每位球员在比赛中完成的各项动作总数对于多种应用是有益的，包括战术决策制定、体育博彩，以及电视转播评论和分析。这些预测必须考虑比赛状态、两队球员的能力和技巧、球员之间的互动，以及比赛随着时间推移而发展的时间动态。在本文中，我们提出了一种基于Transformer的神经网络，联合地和循环地预测了比赛过程中多个时间步长内十三项个体动作的预期总数，预测结果是针对每位球员、每支球队以及整个比赛的。这个神经网络基于一种高效捕捉比赛进行中时间动态和每个时间步骤中球员之间互动的\emph{轴向Transformer}。我们提出了一种新颖的轴向Transformer设计，我们展示了它与常规的顺序Transformer等效，并且该设计在实验中表现良好。我们实验证明，该模型可以做出一致可靠的预测，并且在低延迟下为每场比赛进行约75000次实时预测。

更新时间: 2025-11-24 03:47:59

领域: cs.LG

下载: http://arxiv.org/abs/2511.18730v1

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Updated: 2025-11-24 03:42:01

标题: 学习基本的具身世界模型：迈向可扩展的机器人学习

摘要: 尽管基于视频生成的具身世界模型引起了越来越多的关注，但它们对大规模具身交互数据的依赖仍然是一个关键瓶颈。具身数据的稀缺性、采集难度和高维度基本限制了语言和行动之间的对齐粒度，并且加剧了长期视频生成的挑战，阻碍了生成模型在具身领域实现“GPT时刻”。有一个天真的观察：具身数据的多样性远远超过可能的原始动作的相对较小空间。基于这一洞见，我们提出了一种新颖的世界建模范式--原始具身世界模型（PEWM）。通过将视频生成限制在固定的短期水平，我们的方法1）实现了语言概念与机器人行动的视觉表示之间的细粒度对齐，2）降低了学习复杂性，3）提高了具身数据采集的数据效率，4）减少了推理延迟。通过配备模块化的视觉语言模型（VLM）规划器和一个起始-目标热力图引导机制（SGG），PEWM进一步实现了灵活的闭环控制，并支持在扩展的、复杂任务上对原始级别策略的组合泛化。我们的框架利用视频模型中的时空视觉先验和VLM的语义意识来弥合细粒度物理交互和高层推理之间的差距，为可扩展、可解释和通用的具身智能铺平了道路。

更新时间: 2025-11-24 03:42:01

领域: cs.RO,cs.AI,cs.MM

下载: http://arxiv.org/abs/2508.20840v3

Reinforcement Learning for Self-Healing Material Systems

The transition to autonomous material systems necessitates adaptive control methodologies to maximize structural longevity. This study frames the self-healing process as a Reinforcement Learning (RL) problem within a Markov Decision Process (MDP), enabling agents to autonomously derive optimal policies that efficiently balance structural integrity maintenance against finite resource consumption. A comparative evaluation of discrete-action (Q-learning, DQN) and continuous-action (TD3) agents in a stochastic simulation environment revealed that RL controllers significantly outperform heuristic baselines, achieving near-complete material recovery. Crucially, the TD3 agent utilizing continuous dosage control demonstrated superior convergence speed and stability, underscoring the necessity of fine-grained, proportional actuation in dynamic self-healing applications.

Updated: 2025-11-24 03:42:00

标题: 自愈材料系统的强化学习

摘要: 转向自主材料系统需要自适应控制方法来最大化结构的寿命。本研究将自愈过程框定为马尔可夫决策过程（MDP）中的强化学习（RL）问题，使代理能够自主推导出最优策略，有效地平衡结构完整性维护和有限资源消耗。在随机仿真环境中对离散动作（Q-learning、DQN）和连续动作（TD3）代理进行比较评估，结果显示，RL控制器明显优于启发式基线，实现了近乎完全的材料恢复。关键是，利用连续剂量控制的TD3代理表现出更快的收敛速度和稳定性，强调了在动态自愈应用中细粒度、比例控制的必要性。

更新时间: 2025-11-24 03:42:00

领域: cs.LG

下载: http://arxiv.org/abs/2511.18728v1

LogSyn: A Few-Shot LLM Framework for Structured Insight Extraction from Unstructured General Aviation Maintenance Logs

Aircraft maintenance logs hold valuable safety data but remain underused due to their unstructured text format. This paper introduces LogSyn, a framework that uses Large Language Models (LLMs) to convert these logs into structured, machine-readable data. Using few-shot in-context learning on 6,169 records, LogSyn performs Controlled Abstraction Generation (CAG) to summarize problem-resolution narratives and classify events within a detailed hierarchical ontology. The framework identifies key failure patterns, offering a scalable method for semantic structuring and actionable insight extraction from maintenance logs. This work provides a practical path to improve maintenance workflows and predictive analytics in aviation and related industries.

Updated: 2025-11-24 03:41:57

标题: LogSyn：一个用于从非结构化的通用航空维护日志中提取结构化见解的少样本LLM框架

摘要: 飞机维护日志包含宝贵的安全数据，但由于其非结构化文本格式，仍然未被充分利用。本文介绍了LogSyn，这是一个利用大型语言模型（LLM）将这些日志转换为结构化、可机器读取数据的框架。通过对6,169条记录进行少量样本的上下文学习，LogSyn执行受控抽象生成（CAG）来总结问题解决叙述，并在详细的分层本体论中对事件进行分类。该框架识别出关键的故障模式，为从维护日志中提取语义结构和可操作见解提供了可扩展的方法。这项工作为改进航空及相关行业的维护工作流程和预测分析提供了实用路径。

更新时间: 2025-11-24 03:41:57

领域: cs.LG

下载: http://arxiv.org/abs/2511.18727v1

GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration

The sparse Mixture-of-Experts (MoE) architecture of large language models (LLMs) confronts an inherent issue of load imbalance arising from the simplistic linear router strategy, which ultimately causes the instability and inefficient learning of LLMs. To address this challenge, we introduce a novel MoE graph-based framework $\textbf{GMoE}$, aimed at enhancing the collaboration among multiple experts. In GMoE, a graph router function is designed to capture the collaboration signals among experts. This enables all experts to dynamically allocate information derived from input data by sharing information with their neighboring experts. Moreover, we put forward two coordination strategies in GMoE: the $\textit{Poisson distribution-based distinction strategy}$ and the $\textit{Normal distribution-based balance strategy}$, to further release the capacity of each expert and increase the model stability in the fine-tuning of LLMs. Specifically, we leverage a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation (LoRA), to implement the graph MoE architecture. Extensive experiments on four real-world benchmark datasets demonstrate the effectiveness of GMoE, showing the benefits of facilitating collaborations of multiple experts in LLM fine-tuning. The code of experimental implementation is available at https://github.com/BAI-LAB/GMoE

Updated: 2025-11-24 03:37:48

标题: GMoE：通过MoE图协作赋能LLMs微调

摘要: 大型语言模型（LLMs）的稀疏专家混合（MoE）架构面临着一个固有的负载不平衡问题，这是由于简单的线性路由策略所导致的，最终导致LLMs的不稳定和低效学习。为了解决这一挑战，我们引入了一种新颖的基于图的MoE框架$\textbf{GMoE}$，旨在加强多个专家之间的协作。在GMoE中，设计了一个图路由函数来捕捉专家之间的协作信号。这使得所有专家能够通过与相邻专家共享信息，动态地分配从输入数据中获取的信息。此外，我们提出了GMoE中的两种协调策略：基于泊松分布的区分策略和基于正态分布的平衡策略，以进一步释放每个专家的能力，并增加LLMs微调过程中的模型稳定性。具体地，我们利用了一个参数高效的微调技术，即低秩适应（LoRA），来实现图MoE架构。对四个真实世界基准数据集的大量实验显示了GMoE的有效性，展示了促进多个专家在LLMs微调中的协作的好处。实验实现的代码可在https://github.com/BAI-LAB/GMoE 上获得。

更新时间: 2025-11-24 03:37:48

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2412.16216v4

Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.

Updated: 2025-11-24 03:37:14

标题: 朝向高效的虚拟逻辑机：通过自适应结构修剪的信息论驱动压缩

摘要: 最近在视觉-语言模型（VLMs）方面取得了显著进展，在多模态任务中表现出卓越性能，但它们不断增长的规模给部署和效率带来了严峻挑战。现有的压缩方法通常依赖于启发式重要性度量或经验性修剪规则，缺乏关于信息保存的理论保证。在这项工作中，我们提出了InfoPrune，一个自适应结构压缩VLMs的信息论框架。基于信息瓶颈原理，我们将修剪形式化为在保留任务相关语义和丢弃多余依赖之间的权衡。为了量化每个注意力头的贡献，我们引入了基于熵的有效秩（eRank）并采用Kolmogorov-Smirnov（KS）距离来衡量原始和压缩结构之间的差异。这产生了一个统一的标准，同时考虑了结构稀疏性和信息效率。基于这一基础，我们进一步设计了两种互补方案：（1）基于训练的头部修剪，由提出的信息损失目标引导，以及（2）通过自适应低秩逼近的无需训练的FFN压缩。在VQAv2、TextVQA和GQA上进行的大量实验表明，InfoPrune实现了高达3.2倍的FLOP减少和1.8倍的加速，同时几乎没有性能下降，为高效多模态大型模型迈出了基于理论的实际有效的一步。

更新时间: 2025-11-24 03:37:14

领域: cs.CV,cs.AI,cs.IT,cs.LG

下载: http://arxiv.org/abs/2511.19518v1

DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning

Autonomous vehicles must navigate safely in complex driving environments. Imitating a single expert trajectory, as in regression-based approaches, usually does not explicitly assess the safety of the predicted trajectory. Selection-based methods address this by generating and scoring multiple trajectory candidates and predicting the safety score for each. However, they face optimization challenges in precisely selecting the best option from thousands of candidates and distinguishing subtle but safety-critical differences, especially in rare and challenging scenarios. We propose DriveSuprim to overcome these challenges and advance the selection-based paradigm through a coarse-to-fine paradigm for progressive candidate filtering, a rotation-based augmentation method to improve robustness in out-of-distribution scenarios, and a self-distillation framework to stabilize training. DriveSuprim achieves state-of-the-art performance, reaching 93.5% PDMS in NAVSIM v1 and 87.1% EPDMS in NAVSIM v2 without extra data, with 83.02 Driving Score and 60.00 Success Rate on the Bench2Drive benchmark, demonstrating superior planning capabilities in various driving scenarios.

Updated: 2025-11-24 03:32:51

标题: DriveSuprim：朝向端到端规划的精确轨迹选择

摘要: 自动驾驶车辆必须在复杂的驾驶环境中安全导航。像回归方法一样模仿单一专家轨迹通常不会明确评估预测轨迹的安全性。选择性方法通过生成和评分多个轨迹候选，并为每个轨迹预测安全分数来解决这个问题。然而，它们面临着从成千上万的候选项中精确选择最佳选项以及区分微小但安全关键差异的优化挑战，尤其是在罕见和具有挑战性的场景中。我们提出了DriveSuprim来克服这些挑战，并通过渐进式候选过滤的粗到精范式、基于旋转的增强方法来提高在分布之外场景中的鲁棒性，以及自我蒸馏框架来稳定训练。DriveSuprim实现了最先进的性能，在NAVSIM v1中达到了93.5%的PDMS，在NAVSIM v2中达到了87.1%的EPDMS，而没有额外数据，在Bench2Drive基准测试中取得了83.02的驾驶得分和60.00的成功率，展示出在各种驾驶场景中的优越规划能力。

更新时间: 2025-11-24 03:32:51

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.06659v3

Ellipsoid-Based Decision Boundaries for Open Intent Classification

Textual open intent classification is crucial for real-world dialogue systems, enabling robust detection of unknown user intents without prior knowledge and contributing to the robustness of the system. While adaptive decision boundary methods have shown great potential by eliminating manual threshold tuning, existing approaches assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions. To address this limitation, we propose EliDecide, a novel method that learns ellipsoid decision boundaries with varying scales along different feature directions. First, we employ supervised contrastive learning to obtain a discriminative feature space for known samples. Second, we apply learnable matrices to parameterize ellipsoids as the boundaries of each known class, offering greater flexibility than spherical boundaries defined solely by centers and radii. Third, we optimize the boundaries via a novelly designed dual loss function that balances empirical and open-space risks: expanding boundaries to cover known samples while contracting them against synthesized pseudo-open samples. Our method achieves state-of-the-art performance on multiple text intent benchmarks and further on a question classification dataset. The flexibility of the ellipsoids demonstrates superior open intent detection capability and strong potential for generalization to more text classification tasks in diverse complex open-world scenarios.

Updated: 2025-11-24 03:32:27

标题: 基于椭球的开放意图分类决策边界

摘要: 文本开放意图分类对于现实世界的对话系统至关重要，可以在没有先验知识的情况下强大地检测未知用户意图，从而提高系统的鲁棒性。尽管自适应决策边界方法通过消除手动阈值调整已显示出巨大潜力，但现有方法假设已知类别的分布是各向同性的，将边界限制在球形上，并忽略了沿不同方向的分布方差。为了解决这一限制，我们提出了EliDecide，一种学习沿不同特征方向具有不同尺度的椭球形决策边界的新方法。首先，我们采用监督对比学习来获得已知样本的判别特征空间。其次，我们应用可学习的矩阵来参数化椭球形，将其作为每个已知类别的边界，比仅由中心和半径定义的球形边界提供更大的灵活性。第三，我们通过一种新设计的双重损失函数优化边界，平衡经验风险和开放空间风险：扩展边界以覆盖已知样本，同时对合成的伪开放样本进行收缩。我们的方法在多个文本意图基准测试上实现了最先进的性能，并在一个问题分类数据集上进一步取得了成功。椭球体的灵活性展示了卓越的开放意图检测能力，并在各种复杂的开放世界场景中具有更强的潜力，可以推广到更多文本分类任务。

更新时间: 2025-11-24 03:32:27

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.16685v2

N2N: A Parallel Framework for Large-Scale MILP under Distributed Memory

Parallelization has emerged as a promising approach for accelerating MILP solving. However, the complexity of the branch-and-bound (B&B) framework and the numerous effective algorithm components in MILP solvers make it difficult to parallelize. In this study, a scalable parallel framework, N2N (a node-to-node framework that maps the B&B nodes to distributed computing nodes), was proposed to solve large-scale problems in a distributed memory computing environment. Both deterministic and nondeterministic modes are supported, and the framework is designed to be easily integrated with existing solvers. Regarding the deterministic mode, a novel sliding-window-based algorithm was designed and implemented to ensure that tasks are generated and solved in a deterministic order. Moreover, several advanced techniques, such as the utilization of CP search and general primal heuristics, have been developed to fully utilize distributed computing resources and capabilities of base solvers. Adaptive solving and data communication optimization were also investigated. A popular open-source MILP solver, SCIP, was integrated into N2N as the base solver, yielding N2N-SCIP. Extensive computational experiments were conducted to evaluate the performance of N2N-SCIP compared to ParaSCIP, which is a state-of-the-art distributed parallel MILP solver under the UG framework. The nondeterministic N2N-SCIP achieves speedups of 22.52 and 12.71 with 1,000 MPI processes on the Kunpeng and x86 computing clusters, which is 1.98 and 2.08 times faster than ParaSCIP, respectively. In the deterministic mode, N2N-SCIP also shows significant performance improvements over ParaSCIP across different process numbers and computing clusters. To validate the generality of N2N, HiGHS, another open-source solver, was integrated into N2N. The related results are analyzed, and the requirements of N2N on base solvers are also concluded.

Updated: 2025-11-24 03:29:55

标题: N2N：分布式内存下大规模MILP的并行框架

摘要: 并行化已被证明是加速MILP求解的一种有前途的方法。然而，分支定界（B&B）框架的复杂性和MILP求解器中众多有效的算法组件使得并行化变得困难。本研究提出了一个可伸缩的并行框架N2N（将B&B节点映射到分布式计算节点的节点到节点框架），用于在分布式内存计算环境中解决大规模问题。该框架支持确定性和非确定性模式，并且设计为易于与现有求解器集成。对于确定性模式，设计并实现了一种基于滑动窗口的新算法，以确保任务按照确定性顺序生成和解决。此外，还开发了几种高级技术，如利用CP搜索和通用原始启发式方法，充分利用分布式计算资源和基础求解器的能力。还进行了自适应求解和数据通信优化的研究。将流行的开源MILP求解器SCIP集成到N2N中作为基础求解器，形成N2N-SCIP。进行了大量计算实验，评估了N2N-SCIP与ParaSCIP的性能表现，ParaSCIP是UG框架下最先进的分布式并行MILP求解器。非确定性N2N-SCIP在鲲鹏和x86计算集群上使用1,000个MPI进程获得了22.52和12.71的加速效果，分别比ParaSCIP快了1.98和2.08倍。在确定性模式下，N2N-SCIP也显示出与ParaSCIP相比在不同进程数量和计算集群上的显著性能改进。为验证N2N的普适性，将另一个开源求解器HiGHS集成到N2N中。对相关结果进行分析，总结了N2N对基础求解器的要求。

更新时间: 2025-11-24 03:29:55

领域: cs.AI,cs.DC,math.OC

下载: http://arxiv.org/abs/2511.18723v1

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict `k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, `(k, $\varepsilon$)-unstable,' to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

Updated: 2025-11-24 03:25:16

标题: 朝向实际保证：SmoothLLM的概率证书

摘要: 平滑LLM防御提供了针对越狱攻击的认证保证，但它依赖于一个在实践中很少成立的严格的“k-不稳定”假设。这个强假设可能会限制所提供的安全证书的可信度。在这项工作中，我们通过引入一个更现实的概率框架，“（k，ε）-不稳定”，来解决这个限制，以认证对抗各种越狱攻击的防御，从基于梯度的（GCG）到语义的（PAIR）。我们通过结合攻击成功的经验模型，推导出SmoothLLM的防御概率的新的、数据驱动的下限，提供一个更可信和实用的安全证书。通过引入（k，ε）-不稳定的概念，我们的框架为从业者提供可行的安全保证，使他们能够设定更好地反映LLMs实际行为的认证阈值。最终，这项工作为使LLMs更加抵抗利用其安全对齐的攻击提供了一个实用且理论基础的机制，这是安全人工智能部署中的一个关键挑战。

更新时间: 2025-11-24 03:25:16

领域: cs.LG

下载: http://arxiv.org/abs/2511.18721v1

FinAudio: A Benchmark for Audio Large Language Models in Financial Applications

Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.

Updated: 2025-11-24 03:22:59

标题: FinAudio：金融应用中音频大语言模型的基准Benchmark

摘要: 音频大型语言模型（AudioLLMs）受到了广泛关注，并在对话、音频理解和自动语音识别（ASR）等音频任务的性能方面取得了显著改进。尽管取得了这些进展，但在金融场景中评估AudioLLMs的基准缺乏，而在金融分析和投资决策中，音频数据（如盈利电话会议和CEO演讲）是至关重要的资源。在本文中，我们介绍了\textsc{FinAudio}，这是第一个旨在评估金融领域中AudioLLMs能力的基准。我们首先根据金融领域的独特特征定义了三项任务：1）短金融音频的ASR，2）长金融音频的ASR，3）长金融音频的摘要。然后，我们分别整理了两个短音频数据集和两个长音频数据集，并开发了一个新颖的金融音频摘要数据集，构成了\textsc{FinAudio}基准。然后，我们在\textsc{FinAudio}上评估了七种流行的AudioLLMs。我们的评估揭示了现有AudioLLMs在金融领域的局限性，并为改进AudioLLMs提供了见解。所有数据集和代码将会发布。

更新时间: 2025-11-24 03:22:59

领域: cs.CE,cs.AI,cs.MM

下载: http://arxiv.org/abs/2503.20990v2

AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation

We introduce AIRHILT (Aviation Integrated Reasoning, Human-in-the-Loop Testbed), a modular and lightweight simulation environment designed to evaluate multimodal pilot and air traffic control (ATC) assistance systems for aviation conflict detection. Built on the open-source Godot engine, AIRHILT synchronizes pilot and ATC radio communications, visual scene understanding from camera streams, and ADS-B surveillance data within a unified, scalable platform. The environment supports pilot- and controller-in-the-loop interactions, providing a comprehensive scenario suite covering both terminal area and en route operational conflicts, including communication errors and procedural mistakes. AIRHILT offers standardized JSON-based interfaces that enable researchers to easily integrate, swap, and evaluate automatic speech recognition (ASR), visual detection, decision-making, and text-to-speech (TTS) models. We demonstrate AIRHILT through a reference pipeline incorporating fine-tuned Whisper ASR, YOLO-based visual detection, ADS-B-based conflict logic, and GPT-OSS-20B structured reasoning, and present preliminary results from representative runway-overlap scenarios, where the assistant achieves an average time-to-first-warning of approximately 7.7 s, with average ASR and vision latencies of approximately 5.9 s and 0.4 s, respectively. The AIRHILT environment and scenario suite are openly available, supporting reproducible research on multimodal situational awareness and conflict detection in aviation; code and scenarios are available at https://github.com/ogarib3/airhilt.

Updated: 2025-11-24 03:18:55

标题: AIRHILT：用于航空多模冲突检测的人机环路测试平台

摘要: 我们介绍了AIRHILT（Aviation Integrated Reasoning, Human-in-the-Loop Testbed），这是一个模块化和轻量级的模拟环境，旨在评估用于航空冲突检测的多模式飞行员和空中交通管制（ATC）辅助系统。基于开源的Godot引擎构建，AIRHILT在统一的可扩展平台内同步飞行员和ATC的无线电通信、来自摄像头流的视觉场景理解以及ADS-B监视数据。该环境支持飞行员和空中交通管制员的交互，提供了终端区域和航线操作冲突的全面场景套件，包括通信错误和程序错误。AIRHILT提供了基于标准化的JSON接口，使研究人员能够轻松地集成、交换和评估自动语音识别（ASR）、视觉检测、决策制定和文本到语音（TTS）模型。我们通过一个参考流程展示了AIRHILT，该流程包括经过精细调整的Whisper ASR、基于YOLO的视觉检测、基于ADS-B的冲突逻辑和GPT-OSS-20B结构化推理，并展示了代表性跑道重叠场景的初步结果，其中助手实现了约7.7秒的平均首次警告时间，ASR和视觉延迟分别约为5.9秒和0.4秒。AIRHILT环境和场景套件是开放的，支持航空领域多模式态势感知和冲突检测的可重现研究；代码和场景可在https://github.com/ogarib3/airhilt获得。

更新时间: 2025-11-24 03:18:55

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2511.18718v1

When and What to Recommend: Joint Modeling of Timing and Content for Active Sequential Recommendation

Sequential recommendation models user preferences to predict the next target item. Most existing work is passive, where the system responds only when users open the application, missing chances after closure. We investigate active recommendation, which predicts the next interaction time and actively delivers items. Two challenges: accurately estimating the Time of Interest (ToI) and generating Item of Interest (IoI) conditioned on the predicted ToI. We propose PASRec, a diffusion-based framework that aligns ToI and IoI via a joint objective. Experiments on five benchmarks show superiority over eight state-of-the-art baselines under leave-one-out and temporal splits.

Updated: 2025-11-24 03:16:10

标题: 何时以及推荐什么：联合建模时间和内容以进行主动序列推荐

摘要: 顺序推荐模型利用用户偏好来预测下一个目标项目。大多数现有的工作是被动的，系统只有在用户打开应用程序时才会做出响应，错过了关闭后的机会。我们研究了主动推荐，它预测下一个互动时间并主动交付物品。两个挑战：准确估计感兴趣的时间（ToI）和生成基于预测ToI的感兴趣的项目（IoI）。我们提出了PASRec，一个基于扩散的框架，通过联合目标对齐ToI和IoI。在五个基准上的实验表明，在留一和时间分割下，PASRec优于八种最先进的基线。

更新时间: 2025-11-24 03:16:10

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2511.18717v1

GRIT-LP: Graph Transformer with Long-Range Skip Connection and Partitioned Spatial Graphs for Accurate Ice Layer Thickness Prediction

Graph transformers have demonstrated remarkable capability on complex spatio-temporal tasks, yet their depth is often limited by oversmoothing and weak long-range dependency modeling. To address these challenges, we introduce GRIT-LP, a graph transformer explicitly designed for polar ice-layer thickness estimation from polar radar imagery. Accurately estimating ice layer thickness is critical for understanding snow accumulation, reconstructing past climate patterns and reducing uncertainties in projections of future ice sheet evolution and sea level rise. GRIT-LP combines an inductive geometric graph learning framework with self-attention mechanism, and introduces two major innovations that jointly address challenges in modeling the spatio-temporal patterns of ice layers: a partitioned spatial graph construction strategy that forms overlapping, fully connected local neighborhoods to preserve spatial coherence and suppress noise from irrelevant long-range links, and a long-range skip connection mechanism within the transformer that improves information flow and mitigates oversmoothing in deeper attention layers. We conducted extensive experiments, demonstrating that GRIT-LP outperforms current state-of-the-art methods with a 24.92\% improvement in root mean squared error. These results highlight the effectiveness of graph transformers in modeling spatiotemporal patterns by capturing both localized structural features and long-range dependencies across internal ice layers, and demonstrate their potential to advance data-driven understanding of cryospheric processes.

Updated: 2025-11-24 03:14:55

标题: GRIT-LP：具有长跨距跳连和分区空间图的图变换器，用于准确预测冰层厚度

摘要: 图形转换器在复杂的时空任务上表现出了显著的能力，但它们的深度通常受到过度平滑和弱长距离依赖建模的限制。为了解决这些挑战，我们引入了GRIT-LP，这是一种专门设计用于从极地雷达图像估计极地冰层厚度的图形转换器。准确估计冰层厚度对于理解积雪积累、重建过去的气候模式以及减少对未来冰盖演化和海平面上升预测的不确定性至关重要。GRIT-LP将归纳几何图学习框架与自注意机制相结合，并引入了两个主要创新，共同解决了对冰层时空模式建模的挑战：一种分区空间图构建策略，形成重叠的、完全连接的局部邻域，以保持空间连贯性并抑制来自无关长距离链接的噪声，以及在转换器内部引入长距离跳跃连接机制，改善信息流动并减轻在更深的注意层中的过度平滑。我们进行了广泛的实验，结果表明，GRIT-LP在均方根误差上优于当前最先进的方法，提高了24.92\%。这些结果突显了图形转换器在通过捕获内部冰层上的局部结构特征和长距离依赖关系来建模时空模式方面的有效性，并展示了它们推动数据驱动的冰冻圈过程理解的潜力。

更新时间: 2025-11-24 03:14:55

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2511.18716v1

HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

Large Language Models (LLMs) have made remarkable progress in their ability to interact with external interfaces. Selecting reasonable external interfaces has thus become a crucial step in constructing LLM agents. In contrast to invoking API tools, directly calling AI models across different modalities from the community (e.g., HuggingFace) poses challenges due to the vast scale (> 10k), metadata gaps, and unstructured descriptions. Current methods for model selection often involve incorporating entire model descriptions into prompts, resulting in prompt bloat, wastage of tokens and limited scalability. To address these issues, we propose HuggingR$^4$, a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection, to efficiently select models. Specifically, We first perform multiple rounds of reasoning and retrieval to get a coarse list of candidate models. Then, we conduct fine-grained refinement by analyzing candidate model descriptions, followed by reflection to assess results and determine if retrieval scope expansion is necessary. This method reduces token consumption considerably by decoupling user query processing from complex model description handling. Through a pre-established vector database, complex model descriptions are stored externally and retrieved on-demand, allowing the LLM to concentrate on interpreting user intent while accessing only relevant candidate models without prompt bloat. In the absence of standardized benchmarks, we construct a multimodal human-annotated dataset comprising 14,399 user requests across 37 tasks and conduct a thorough evaluation. HuggingR$^4$ attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively on GPT-4o-mini.

Updated: 2025-11-24 03:13:45

标题: 拥抱R$^{4}$：一种用于发现最佳模型伴侣的渐进推理框架

摘要: 大型语言模型（LLMs）在与外部接口互动方面取得了显著进展。因此，选择合理的外部接口已成为构建LLM代理的关键步骤。与调用API工具不同，直接从社区（例如HuggingFace）跨不同模态调用人工智能模型存在挑战，原因是规模庞大（>10k）、元数据缺失和无结构描述。当前的模型选择方法通常涉及将整个模型描述整合到提示中，导致提示膨胀、令牌浪费和可扩展性受限。为了解决这些问题，我们提出了一种新颖的框架HuggingR$^4$，结合了推理、检索、细化和反思，以高效选择模型。具体而言，我们首先进行多轮推理和检索，得到一个候选模型的粗略列表。然后，通过分析候选模型描述进行细粒度的细化，随后通过反思评估结果，确定是否需要扩展检索范围。该方法通过将用户查询处理与复杂模型描述处理分离，大大减少了令牌消耗。通过预先建立的向量数据库，复杂模型描述被外部存储并按需检索，使LLM能够集中精力解释用户意图，同时只访问相关候选模型，避免提示膨胀。在缺乏标准化基准的情况下，我们构建了一个多模态人工标注数据集，包括37个任务的14,399个用户请求，并进行了彻底评估。HuggingR$^4$在GPT-4o-mini上获得了92.03%的可操作性和82.46%的合理性率，分别比现有方法高出26.51%和33.25%。

更新时间: 2025-11-24 03:13:45

领域: cs.AI

下载: http://arxiv.org/abs/2511.18715v1

MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation

Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA-Edu, a self-reflective multi-agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA-Edu employs a two-stage co-evolutionary pipeline: (1) a generation-verification-reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code-based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self-reflection modules that evaluate and revise outputs until domain-specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA-Edu over state-of-the-art MLLMs. Compared to GPT-4o, MAGMA-Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image-text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA-Edu achieves the highest scores (Avg-Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self-reflective multi-agent collaboration in pedagogically aligned vision-language reasoning.

Updated: 2025-11-24 03:13:26

标题: MAGMA-Edu: 用于文本图表教育问题生成的多智能体生成多模态框架

摘要: 教育插图在传达抽象概念中起着核心作用，然而当前的多模态大型语言模型（MLLMs）在生成教育视觉时仍然存在局限性，无法产生教学上连贯和语义一致的视觉效果。我们介绍了MAGMA-Edu，这是一个自我反思的多代理框架，将文本推理和图解合成统一起来，用于结构化教育问题生成。与现有方法将文本和图像生成分开处理不同，MAGMA-Edu采用两阶段共进化管道：（1）生成-验证-反思循环，迭代地完善问题陈述和数学准确性的解决方案，以及（2）基于代码的中间表示，强调在图像渲染过程中的几何准确性和语义对齐。这两个阶段均由内部自我反思模块引导，评估和修订输出，直到满足特定领域的教学约束。对多模态教育基准的广泛实验表明，MAGMA-Edu相对于最先进的MLLMs具有更高的优越性。与GPT-4o相比，MAGMA-Edu将平均文本度量从57.01提高到92.31（+35.3 pp），并将图像-文本一致性（ITC）从13.20提高到85.24（+72 pp）。在所有模型骨干中，MAGMA-Edu实现了最高的分数（Avg-Text 96.20，ITC 99.12），确立了多模态教育内容生成的最新技术，并展示了自我反思多代理协作在教学对齐的视觉语言推理中的有效性。

更新时间: 2025-11-24 03:13:26

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2511.18714v1

A Fast Binary Splitting Approach for Non-Adaptive Learning of Erdős--Rényi Graphs

We study the problem of learning an unknown graph via group queries on node subsets, where each query reports whether at least one edge is present among the queried nodes. In general, learning arbitrary graphs with $n$ nodes and $k$ edges is hard in the non-adaptive setting, requiring $Ω\big(\min\{k^2\log n,\,n^2\}\big)$ tests even when a small error probability is allowed. We focus on learning Erdős--Rényi (ER) graphs $G\sim\mathrm{ER}(n,q)$ in the non-adaptive setting, where the expected number of edges is $\bar{k}=q\binom{n}{2}$, and we aim to design an efficient testing--decoding scheme achieving asymptotically vanishing error probability. Prior work (Li--Fresacher--Scarlett, NeurIPS 2019) presents a testing--decoding scheme that attains an order-optimal number of tests $O(\bar{k}\log n)$ but incurs $Ω(n^2)$ decoding time, whereas their proposed sublinear-time algorithm incurs an extra $(\log \bar{k})(\log n)$ factor in the number of tests. We extend the binary splitting approach, recently developed for non-adaptive group testing, to the ER graph learning setting, and prove that the edge set can be recovered with high probability using $O(\bar{k}\log n)$ tests while attaining decoding time $O(\bar{k}^{1+δ}\log n)$ for any fixed $δ>0$.

Updated: 2025-11-24 03:13:19

标题: 一种快速的二进制分裂方法用于非自适应学习Erdős--Rényi图

摘要: 我们研究了通过对节点子集进行组查询来学习未知图的问题，其中每个查询报告所查询节点中是否至少存在一条边。一般来说，在非自适应设置中学习具有$n$个节点和$k$条边的任意图是困难的，即使允许小的错误概率，也需要$Ω\big(\min\{k^2\log n,\,n^2\}\big)$次测试。我们专注于在非自适应设置中学习Erdős-Rényi（ER）图$G\sim\mathrm{ER}(n,q)$，其中期望边数为$\bar{k}=q\binom{n}{2}$，我们的目标是设计一种有效的测试-解码方案，实现渐近消失的错误概率。之前的研究（Li-Fresacher-Scarlett，NeurIPS 2019）提出了一种测试-解码方案，可以获得最优数量的测试$O(\bar{k}\log n)$，但会产生$Ω(n^2)$的解码时间，而他们提出的次线性时间算法在测试次数上会多出一个$(\log \bar{k})(\log n)$的因子。我们将最近为非自适应组测试开发的二进制分裂方法扩展到ER图学习设置中，并证明可以使用$O(\bar{k}\log n)$次测试以高概率恢复边集，同时实现解码时间$O(\bar{k}^{1+δ}\log n)$，其中$δ>0$是任意固定值。

更新时间: 2025-11-24 03:13:19

领域: cs.IT,cs.DM,cs.LG,math.PR

下载: http://arxiv.org/abs/2511.17240v2

Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation

In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. To address these challenges, we introduce a novel framework of Modality-Collaborative LowRank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. To ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and subrouters, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.

Updated: 2025-11-24 03:09:59

标题: 基于模态协同的少样本视频域自适应低秩分解器

摘要: 本文研究了Few-Shot Video Domain Adaptation (FSVDA)这一具有挑战性的任务。视频的多模态性引入了独特的挑战，需要同时考虑领域对齐和模态协作在少样本情况下的情况，而这在之前的文献中被忽视了。我们观察到，在领域转移的影响下，每个单独模态的目标领域的泛化性能，以及融合的多模态特征的泛化性能受到限制。因为每个模态都由具有不同领域转移的多个组件的耦合特征组成。这种变化增加了领域适应的复杂性，从而降低了多模态特征集成的有效性。为了解决这些挑战，我们引入了一种新颖的Modality-Collaborative Low-Rank Decomposers（MC-LRD）框架，以从每个模态中分解具有不同领域转移水平的模态独特和模态共享特征，这对于领域对齐更友好。MC-LRD包括每个模态的多个分解器和多模态分解路由器（MDR）。每个分解器在不同模态之间逐渐共享参数。MDR被利用来选择性地激活分解器以产生模态独特和模态共享特征。为了确保高效的分解，我们分别对分解器和子路由器应用正交去相关约束，增强它们的多样性。此外，我们提出了一个跨领域激活一致性损失，以确保相同类别的目标和源样本表现出一致的分解器激活偏好，从而促进领域对齐。在三个公共基准测试上的广泛实验结果表明，我们的模型在现有方法上取得了显著的改进。

更新时间: 2025-11-24 03:09:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18711v1

G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation

User feedback is critical for refining recommendation systems, yet explicit feedback (e.g., likes or dislikes) remains scarce in practice. As a more feasible alternative, inferring user preferences from massive implicit feedback has shown great potential (e.g., a user quickly skipping a recommended video usually indicates disinterest). Unfortunately, implicit feedback is often noisy: a user might skip a video due to accidental clicks or other reasons, rather than disliking it. Such noise can easily misjudge user interests, thereby undermining recommendation performance. To address this issue, we propose a novel Group-aware User Behavior Simulation (G-UBS) paradigm, which leverages contextual guidance from relevant user groups, enabling robust and in-depth interpretation of implicit feedback for individual users. Specifically, G-UBS operates via two key agents. First, the User Group Manager (UGM) effectively clusters users to generate group profiles utilizing a ``summarize-cluster-reflect" workflow based on LLMs. Second, the User Feedback Modeler (UFM) employs an innovative group-aware reinforcement learning approach, where each user is guided by the associated group profiles during the reinforcement learning process, allowing UFM to robustly and deeply examine the reasons behind implicit feedback. To assess our G-UBS paradigm, we have constructed a Video Recommendation benchmark with Implicit Feedback (IF-VR). To the best of our knowledge, this is the first multi-modal benchmark for implicit feedback evaluation in video recommendation, encompassing 15k users, 25k videos, and 933k interaction records with implicit feedback. Extensive experiments on IF-VR demonstrate that G-UBS significantly outperforms mainstream LLMs and MLLMs, with a 4.0% higher proportion of videos achieving a play rate > 30% and 14.9% higher reasoning accuracy on IF-VR.

Updated: 2025-11-24 02:57:36

标题: G-UBS: 通过基于群组感知的用户行为模拟实现对隐式反馈的稳健理解

摘要: 用户反馈对于完善推荐系统至关重要，然而在实践中，明确的反馈（例如喜欢或不喜欢）仍然很少见。作为更可行的替代方案，从大量隐式反馈中推断用户偏好已经显示出巨大潜力（例如，用户快速跳过推荐的视频通常表示不感兴趣）。不幸的是，隐式反馈通常存在噪音：用户可能跳过视频是由于意外点击或其他原因，而不是不喜欢它。这种噪音很容易误判用户兴趣，从而削弱推荐性能。为了解决这个问题，我们提出了一种新颖的基于群体感知的用户行为模拟（G-UBS）范式，利用相关用户群体的背景指导，实现对个体用户的隐式反馈进行稳健和深入的解释。具体而言，G-UBS通过两个关键代理运作。首先，用户群组管理员（UGM）有效地将用户进行聚类，利用基于LLMs的“总结-聚类-反思”工作流程生成群体概况。其次，用户反馈模型师（UFM）采用创新的基于群体感知的强化学习方法，其中每个用户在强化学习过程中由相关的群体概况指导，使得UFM能够稳健而深入地审查隐式反馈背后的原因。为了评估我们的G-UBS范式，我们构建了一个带有隐式反馈的视频推荐基准（IF-VR）。据我们所知，这是视频推荐中首个多模态隐式反馈评估基准，包括15k用户，25k视频以及933k带有隐式反馈的交互记录。在IF-VR上进行的大量实验表明，G-UBS明显优于主流的LLMs和MLLMs，在IF-VR上，有40%的视频播放率高于30%，推理准确率高出14.9%。

更新时间: 2025-11-24 02:57:36

领域: cs.IR,cs.LG,cs.MA

下载: http://arxiv.org/abs/2508.05709v2

Revisit to the Bai-Galbraith signature scheme

Dilithium is one of the NIST approved lattice-based signature schemes. In this short note we describe the Bai-Galbraith signature scheme proposed in BG14, which differs to Dilithium, due to the fact that there is no public key compression. This lattice-based signature scheme is based on Learning with Errors (LWE).

Updated: 2025-11-24 02:51:59

标题: 重访白-加尔布雷斯签名方案

摘要: 锂是NIST批准的基于晶格的签名方案之一。在这篇简短的说明中，我们描述了BG14提出的白-加尔布雷思签名方案，与锂不同的是，它没有公钥压缩。这种基于晶格的签名方案基于学习与错误（LWE）技术。

更新时间: 2025-11-24 02:51:59

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2511.09582v2

PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback

Effective presentation skills are essential in education, professional communication, and public speaking, yet learners often lack access to high-quality exemplars or personalized coaching. Existing AI tools typically provide isolated functionalities such as speech scoring or script generation without integrating reference modeling and interactive feedback into a cohesive learning experience. We introduce a dual-agent system that supports presentation practice through two complementary roles: the Ideal Presentation Agent and the Coach Agent. The Ideal Presentation Agent converts user-provided slides into model presentation videos by combining slide processing, visual-language analysis, narration script generation, personalized voice synthesis, and synchronized video assembly. The Coach Agent then evaluates user-recorded presentations against these exemplars, conducting multimodal speech analysis and delivering structured feedback in an Observation-Impact-Suggestion (OIS) format. To enhance the authenticity of the learning experience, the Coach Agent incorporates an Audience Agent, which simulates the perspective of a human listener and provides humanized feedback reflecting audience reactions and engagement. Together, these agents form a closed loop of observation, practice, and feedback. Implemented on a robust backend with multi-model integration, voice cloning, and error handling mechanisms, the system demonstrates how AI-driven agents can provide engaging, human-centered, and scalable support for presentation skill development in both educational and professional contexts.

Updated: 2025-11-24 02:51:05

标题: PresentCoach：通过示例和互动反馈实现双代理人演讲辅导

摘要: 有效的演讲技能在教育、专业沟通和公共演讲中至关重要，然而学习者通常缺乏高质量的典范或个性化辅导。现有的人工智能工具通常提供孤立的功能，如演讲评分或脚本生成，而没有将参考建模和交互反馈整合到一个连贯的学习体验中。我们介绍了一个支持演讲实践的双代理系统，包括理想演讲代理和教练代理两个互补角色。理想演讲代理通过将用户提供的幻灯片转换为模型演讲视频，结合幻灯片处理、视觉语言分析、叙述脚本生成、个性化语音合成和同步视频组装。然后，教练代理对用户录制的演讲进行评估，进行多模式语音分析，并以观察-影响-建议（OIS）格式提供结构化反馈。为了增强学习体验的真实性，教练代理还包含一个观众代理，模拟人类听众的视角并提供反映观众反应和参与度的人性化反馈。这些代理共同构成一个观察、实践和反馈的闭环。通过强大的后端实现多模型集成、语音克隆和错误处理机制，该系统展示了人工智能驱动的代理如何在教育和专业环境中提供引人入胜、以人为本和可扩展的支持，促进演讲技能的发展。

更新时间: 2025-11-24 02:51:05

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2511.15253v2

ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction

Video editing and synthesis often introduce object inconsistencies, such as frame flicker and identity drift that degrade perceptual quality. To address these issues, we introduce ObjectAlign, a novel framework that seamlessly blends perceptual metrics with symbolic reasoning to detect, verify, and correct object-level and temporal inconsistencies in edited video sequences. The novel contributions of ObjectAlign are as follows: First, we propose learnable thresholds for metrics characterizing object consistency (i.e. CLIP-based semantic similarity, LPIPS perceptual distance, histogram correlation, and SAM-derived object-mask IoU). Second, we introduce a neuro-symbolic verifier that combines two components: (a) a formal, SMT-based check that operates on masked object embeddings to provably guarantee that object identity does not drift, and (b) a temporal fidelity check that uses a probabilistic model checker to verify the video's formal representation against a temporal logic specification. A frame transition is subsequently deemed "consistent" based on a single logical assertion that requires satisfying both the learned metric thresholds and this unified neuro-symbolic constraint, ensuring both low-level stability and high-level temporal correctness. Finally, for each contiguous block of flagged frames, we propose a neural network based interpolation for adaptive frame repair, dynamically choosing the interpolation depth based on the number of frames to be corrected. This enables reconstruction of the corrupted frames from the last valid and next valid keyframes. Our results show up to 1.4 point improvement in CLIP Score and up to 6.1 point improvement in warp error compared to SOTA baselines on the DAVIS and Pexels video datasets.

Updated: 2025-11-24 02:50:01

标题: ObjectAlign：神经符号对象一致性验证与修正

摘要: 视频编辑和合成经常引入对象不一致性，例如帧闪烁和身份漂移，降低了感知质量。为了解决这些问题，我们引入了ObjectAlign，这是一个新颖的框架，无缝地将感知度量与符号推理相结合，以检测、验证和纠正编辑视频序列中的对象级和时间不一致性。ObjectAlign的新颖贡献如下：首先，我们提出了用于表征对象一致性的度量的可学习阈值（即基于CLIP的语义相似性、LPIPS感知距离、直方图相关性和基于SAM的对象蒙版IoU）。其次，我们引入了一个神经符号验证器，结合了两个组件：（a）基于形式的SMT检查，对掩码对象嵌入进行操作，以可证明地保证对象身份不会漂移，（b）使用概率模型检查器的时间保真检查，验证视频的形式表示与时间逻辑规范的一致性。基于单个逻辑断言，后续的帧转换被认为是“一致的”，该断言要求同时满足学习的度量阈值和这个统一的神经符号约束，确保低级稳定性和高级时间正确性。最后，针对每个连续的被标记帧块，我们提出了基于神经网络的自适应帧修复插值，根据需要纠正的帧数动态选择插值深度。这使得可以从最后一个有效关键帧和下一个有效关键帧重建损坏的帧。我们的结果显示，与DAVIS和Pexels视频数据集上的SOTA基线相比，CLIP得分提高了最多1.4点，warp错误提高了最多6.1点。

更新时间: 2025-11-24 02:50:01

领域: cs.CV,cs.AI,cs.FL,cs.LG

下载: http://arxiv.org/abs/2511.18701v1

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

Updated: 2025-11-24 02:45:02

标题: 超越多项选择：可验证的用于强大视觉-语言RFT的OpenQA

摘要: 多项选择题答题（MCQA）已成为评估和强化现代多模态语言模型的流行格式。其受限的输出格式允许简化、确定性的自动验证。然而，我们发现选项可能泄漏可利用的信号，这使得准确性指标在指示真实能力方面不可靠，并鼓励在强化微调过程中进行显式或隐式的答案猜测行为。我们提出了ReVeL（通过LLM重写和验证），这是一个框架，将多项选择题重写为开放形式问题，同时尽可能保持答案可验证。该框架根据不同答案类型对问题进行分类，并分别应用不同的重写和验证方案。在进行强化微调时，我们转换了20k个MCQA示例，并使用GRPO对Qwen2.5-VL模型进行微调。在ReVeL-OpenQA上训练的模型与多项选择题基准上的MCQA准确性相匹配，并将OpenQA准确性提高约六个百分点，表明比基于MCQA的训练具有更好的数据效率和更稳健的奖励信号。在用于评估时，ReVeL还显示了多达20个百分点的MCQA基准分数膨胀（相对于OpenQA），提高了评分准确性，同时降低了成本和延迟。我们将公开发布代码和数据。

更新时间: 2025-11-24 02:45:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.17405v2

Dendritic Convolution for Noise Image Recognition

In real-world scenarios of image recognition, there exists substantial noise interference. Existing works primarily focus on methods such as adjusting networks or training strategies to address noisy image recognition, and the anti-noise performance has reached a bottleneck. However, little is known about the exploration of anti-interference solutions from a neuronal perspective.This paper proposes an anti-noise neuronal convolution. This convolution mimics the dendritic structure of neurons, integrates the neighborhood interaction computation logic of dendrites into the underlying design of convolutional operations, and simulates the XOR logic preprocessing function of biological dendrites through nonlinear interactions between input features, thereby fundamentally reconstructing the mathematical paradigm of feature extraction. Unlike traditional convolution where noise directly interferes with feature extraction and exerts a significant impact, DDC mitigates the influence of noise by focusing on the interaction of neighborhood information. Experimental results demonstrate that in image classification tasks (using YOLOv11-cls, VGG16, and EfficientNet-B0) and object detection tasks (using YOLOv11, YOLOv8, and YOLOv5), after replacing traditional convolution with the dendritic convolution, the accuracy of the EfficientNet-B0 model on noisy datasets is relatively improved by 11.23%, and the mean Average Precision (mAP) of YOLOv8 is increased by 19.80%. The consistency between the computation method of this convolution and the dendrites of biological neurons enables it to perform significantly better than traditional convolution in complex noisy environments.

Updated: 2025-11-24 02:43:29

标题: 树突卷积用于噪声图像识别

摘要: 在图像识别的实际场景中存在大量噪声干扰。现有研究主要集中在调整网络或训练策略等方法来应对嘈杂的图像识别，抗噪性能已经达到瓶颈。然而，很少有关于从神经元视角探索抗干扰解决方案的研究。本文提出了一种抗噪声神经元卷积。这种卷积模仿了神经元的树突结构，将树突的邻域交互计算逻辑整合到卷积操作的基础设计中，并通过输入特征之间的非线性交互模拟生物树突的异或逻辑预处理功能，从而从根本上重构了特征提取的数学范式。与传统卷积不同，噪声直接干扰特征提取并产生显着影响，DDC通过关注邻域信息的交互来减轻噪声的影响。实验结果表明，在图像分类任务（使用YOLOv11-cls、VGG16和EfficientNet-B0）和目标检测任务（使用YOLOv11、YOLOv8和YOLOv5）中，将传统卷积替换为树突卷积后，EfficientNet-B0模型在嘈杂数据集上的准确性相对提高了11.23%，而YOLOv8的均值平均精度（mAP）提高了19.80%。这种卷积的计算方法与生物神经元的树突之间的一致性使其在复杂嘈杂环境中表现明显优于传统卷积。

更新时间: 2025-11-24 02:43:29

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.18699v1

Multimodal Real-Time Anomaly Detection and Industrial Applications

This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initial lightweight implementation using YOLOv8, ByteTrack, and the Audio Spectrogram Transformer (AST), and an advanced version that incorporates multi-model audio ensembles, hybrid object detection, bidirectional cross-modal attention, and multi-method anomaly detection. The evolution demonstrates significant improvements in accuracy, robustness, and industrial applicability. The advanced system combines three audio models (AST, Wav2Vec2, and HuBERT) for comprehensive audio understanding, dual object detectors (YOLO and DETR) for improved accuracy, and sophisticated fusion mechanisms for enhanced cross-modal learning. Experimental evaluation shows the system's effectiveness in general monitoring scenarios as well as specialized industrial safety applications, achieving real-time performance on standard hardware while maintaining high accuracy.

Updated: 2025-11-24 02:43:19

标题: 多模态实时异常检测及工业应用

摘要: 本文介绍了一种综合多模态房间监控系统的设计、实施和演变，该系统集成了同步视频和音频处理，用于实时活动识别和异常检测。我们描述了该系统的两个迭代版本：一个初始的轻量级实现，使用YOLOv8、ByteTrack和音频频谱变换器（AST），以及一个进阶版本，该版本集成了多模型音频集合、混合目标检测、双向跨模态注意力和多方法异常检测。该系统的演变展示了在准确性、稳健性和工业适用性方面的显著改进。先进的系统结合了三个音频模型（AST、Wav2Vec2和HuBERT）用于全面的音频理解，双目标检测器（YOLO和DETR）用于提高准确性，并采用复杂的融合机制以增强跨模态学习。实验评估显示了该系统在一般监控场景以及专业工业安全应用中的有效性，在标准硬件上实现了实时性能，同时保持了高准确性。

更新时间: 2025-11-24 02:43:19

领域: cs.SD,cs.AI,cs.CV,cs.LG,cs.MM

下载: http://arxiv.org/abs/2511.18698v1

DAGLFNet: Deep Feature Attention Guided Global and Local Feature Fusion for Pseudo-Image Point Cloud Segmentation

Environmental perception systems are crucial for high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor providing accurate 3D point cloud data. Efficiently processing unstructured point clouds while extracting structured semantic information remains a significant challenge. In recent years, numerous pseudo-image-based representation methods have emerged to balance efficiency and performance by fusing 3D point clouds with 2D grids. However, the fundamental inconsistency between the pseudo-image representation and the original 3D information critically undermines 2D-3D feature fusion, posing a primary obstacle for coherent information fusion and leading to poor feature discriminability. This work proposes DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. It incorporates three key components: first, a Global-Local Feature Fusion Encoding (GL-FFE) module to enhance intra-set local feature correlation and capture global contextual information; second, a Multi-Branch Feature Extraction (MB-FE) network to capture richer neighborhood information and improve the discriminability of contour features; and third, a Feature Fusion via Deep Feature-guided Attention (FFDFA) mechanism to refine cross-channel feature fusion precision. Experimental evaluations demonstrate that DAGLFNet achieves mean Intersection-over-Union (mIoU) scores of 69.9% and 78.7% on the validation sets of SemanticKITTI and nuScenes, respectively. The method achieves an excellent balance between accuracy and efficiency.

Updated: 2025-11-24 02:38:27

标题: DAGLFNet：深度特征关注引导的全局和局部特征融合用于伪图像点云分割

摘要: 环境感知系统对于高精度地图制作和自主导航至关重要，LiDAR作为提供精确3D点云数据的核心传感器。有效处理无结构点云并提取结构化语义信息仍然是一个重要挑战。近年来，出现了许多基于伪图像表示方法，通过融合3D点云和2D网格来平衡效率和性能。然而，伪图像表示与原始3D信息之间的根本不一致严重削弱了2D-3D特征融合，构成了一种主要障碍，导致特征的可辨识性差。本文提出了DAGLFNet，这是一个基于伪图像的语义分割框架，旨在提取具有区分性的特征。它包括三个关键组件：首先，一个全局-局部特征融合编码（GL-FFE）模块，以增强集内局部特征相关性并捕获全局背景信息；其次，一个多分支特征提取（MB-FE）网络，以捕捉更丰富的邻域信息并提高轮廓特征的可辨识性；第三，通过深度特征引导的注意力机制（FFDFA）来精细化跨通道特征融合的精度。实验评估表明，DAGLFNet在SemanticKITTI和nuScenes的验证集上分别获得了69.9%和78.7%的平均交集/联合（mIoU）得分。该方法在准确性和效率之间取得了出色的平衡。

更新时间: 2025-11-24 02:38:27

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.10471v2

OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation

Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib's PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.

Updated: 2025-11-24 02:34:08

标题: OmniLens++：通过大型LensLib预训练和潜在PSF表示进行盲目镜头像差校正

摘要: 新兴的基于深度学习的镜头库预训练（LensLib-PT）管道为盲目镜头像差校正提供了一条新途径，通过训练一个通用神经网络，展示了处理各种未知光学退化的强大能力。本文提出了OmniLens++框架，解决了存在管道的泛化能力的两个挑战：数据扩展的困难和缺乏表征光学退化的先验指导。为了改善数据的可扩展性，我们扩展了设计规范，增加了镜头源的退化多样性，并通过量化光学退化的空间变化模式和严重程度来采样更均匀的分布。在模型设计方面，为了利用点扩散函数（PSFs）作为盲目范式中的指导，我们提出了潜在PSF表示（LPR）。引入VQVAE框架来学习LensLib的PSFs的潜在特征，通过建模光学退化过程来约束退化先验的学习。对真实世界镜头和合成LensLib的多样像差进行的实验表明，OmniLens++在盲像差校正方面表现出最先进的泛化能力。除了性能之外，AODLibpro被验证为更有效地跨多种像差进行培训的可扩展基础，而LPR可以进一步挖掘大规模LensLib的潜力。源代码和数据集将在https://github.com/zju-jiangqi/OmniLens2上公开提供。

更新时间: 2025-11-24 02:34:08

领域: eess.IV,cs.CV,cs.LG,physics.optics

下载: http://arxiv.org/abs/2511.17126v2

Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models. ECN employs four stages: Perspective Adoption, Emotional Resonance, Reflective Understanding, and Integrative Synthesis, to guide models toward generating emotionally resonant and contextually aware responses. Experimental results demonstrate that ECN achieves the highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4, while maintaining competitive Regard and Perplexity metrics. These findings emphasize ECN's potential for applications requiring empathy and inclusivity in conversational AI.

Updated: 2025-11-24 02:32:20

标题: 共情级联网络：一种减少大型语言模型中社会偏见的多阶段提示技术

摘要: 这份报告介绍了共情级联网络（ECN）框架，这是一种多阶段提示方法，旨在增强大型语言模型的共情和包容能力。ECN采用四个阶段：视角采纳、情感共鸣、反思理解和整合综合，引导模型生成情感共鸣和具有上下文意识的回应。实验结果表明，ECN在GPT-3.5-turbo和GPT-4中实现了最高的共情商（EQ）得分，同时保持竞争性的尊重和困惑度指标。这些发现强调了ECN在需要共情和包容性的对话AI应用中的潜力。

更新时间: 2025-11-24 02:32:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.18696v1

Stable Multi-Drone GNSS Tracking System for Marine Robots

Accurate localization is essential for marine robotics, yet Global Navigation Satellite System (GNSS) signals are unreliable or unavailable even at a very short distance below the water surface. Traditional alternatives, such as inertial navigation, Doppler Velocity Loggers (DVL), SLAM, and acoustic methods, suffer from error accumulation, high computational demands, or infrastructure dependence. In this work, we present a scalable multi-drone GNSS-based tracking system for surface and near-surface marine robots. Our approach combines efficient visual detection, lightweight multi-object tracking, GNSS-based triangulation, and a confidence-weighted Extended Kalman Filter (EKF) to provide stable GNSS estimation in real time. We further introduce a cross-drone tracking ID alignment algorithm that enforces global consistency across views, enabling robust multi-robot tracking with redundant aerial coverage. We validate our system in diversified complex settings to show the scalability and robustness of the proposed algorithm.

Updated: 2025-11-24 02:28:31

标题: 稳定的海洋机器人多无人机GNSS跟踪系统

摘要: 准确的定位对于海洋机器人至关重要，然而全球导航卫星系统（GNSS）信号在水下甚至很短的距离内都是不可靠或不可用的。传统的替代方案，如惯性导航、多普勒速度记录仪（DVL）、SLAM和声学方法，存在误差累积、高计算需求或基础设施依赖的问题。在这项工作中，我们提出了一种可扩展的多机器人GNSS跟踪系统，用于表面和近表面的海洋机器人。我们的方法结合了高效的视觉检测、轻量级多目标跟踪、基于GNSS的三角测量和一种置信加权的扩展卡尔曼滤波器（EKF），以实时提供稳定的GNSS估计。我们进一步引入了一种跨机器人跟踪ID对齐算法，强制实现全局一致性，从而实现具有冗余空中覆盖的稳健多机器人跟踪。我们在多样化的复杂环境中验证了我们的系统，以展示所提出算法的可扩展性和稳健性。

更新时间: 2025-11-24 02:28:31

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.18694v1

ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation

Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.

Updated: 2025-11-24 02:28:18

标题: ImAgent：一个统一的多模态代理框架，用于测试时可扩展的图像生成

摘要: 最近的文本到图像（T2I）模型在生成视觉上逼真且语义连贯的图像方面取得了显著进展。然而，它们在给定提示时仍然存在随机性和不一致性，特别是当文本描述模糊或不够详细时。现有的方法，如提示重写、最佳N采样和自我完善，可以缓解这些问题，但通常需要额外的模块并独立运行，阻碍了测试时间的扩展效率并增加了计算开销。在本文中，我们介绍了ImAgent，一个无需训练的统一多模态代理，它在一个框架内集成了推理、生成和自我评估，以实现高效的测试时间扩展。在策略控制器的指导下，多个生成动作动态地相互作用和自我组织，以提高图像的保真度和语义对齐，而无需依赖外部模型。对图像生成和编辑任务的大量实验表明，ImAgent在基础模型上始终表现更好，甚至在基础模型失败时超越了其他强基线，突显了统一多模态代理在测试时间扩展下用于自适应和高效图像生成的潜力。

更新时间: 2025-11-24 02:28:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.11483v2

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.

Updated: 2025-11-24 02:27:19

标题: VLM一闪而过：通过神经元分块实现视觉语言模型的I/O高效稀疏化

摘要: 大型视觉语言模型（VLMs）的边缘部署越来越依赖于基于闪存的权重卸载，其中使用激活稀疏化来减少I/O开销。然而，传统的稀疏化仍然是模型中心化的，仅通过激活幅度选择神经元，忽视了访问模式如何影响闪存性能。我们提出神经元分块，这是一种在内存中操作块（即连续神经元组）并将神经元重要性与存储访问成本联系起来的I/O高效稀疏化策略。该方法通过轻量级的访问连续性抽象模拟I/O延迟，并选择具有高效用性的块，定义为通过估计延迟对神经元重要性进行归一化。通过将稀疏化决策与底层存储行为对齐，神经元分块在Jetson Orin Nano和Jetson AGX Orin上将I/O效率提高了最多4.65倍和5.76倍。

更新时间: 2025-11-24 02:27:19

领域: cs.LG,cs.AI,cs.CV,cs.PF

下载: http://arxiv.org/abs/2511.18692v1

MAIF: Enforcing AI Trust and Provenance with an Artifact-Centric Agentic Paradigm

The AI trustworthiness crisis threatens to derail the artificial intelligence revolution, with regulatory barriers, security vulnerabilities, and accountability gaps preventing deployment in critical domains. Current AI systems operate on opaque data structures that lack the audit trails, provenance tracking, or explainability required by emerging regulations like the EU AI Act. We propose an artifact-centric AI agent paradigm where behavior is driven by persistent, verifiable data artifacts rather than ephemeral tasks, solving the trustworthiness problem at the data architecture level. Central to this approach is the Multimodal Artifact File Format (MAIF), an AI-native container embedding semantic representations, cryptographic provenance, and granular access controls. MAIF transforms data from passive storage into active trust enforcement, making every AI operation inherently auditable. Our production-ready implementation demonstrates ultra-high-speed streaming (2,720.7 MB/s), optimized video processing (1,342 MB/s), and enterprise-grade security. Novel algorithms for cross-modal attention, semantic compression, and cryptographic binding achieve up to 225 compression while maintaining semantic fidelity. Advanced security features include stream-level access control, real-time tamper detection, and behavioral anomaly analysis with minimal overhead. This approach directly addresses the regulatory, security, and accountability challenges preventing AI deployment in sensitive domains, offering a viable path toward trustworthy AI systems at scale.

Updated: 2025-11-24 02:26:39

标题: MAIF：通过一种以物为中心的代理范式强化人工智能的信任和出处

摘要: 人工智能信任危机威胁着人工智能革命的发展，监管障碍、安全漏洞和责任缺失阻碍了在关键领域部署人工智能。当前的人工智能系统运行在缺乏审计路径、来源跟踪或可解释性的不透明数据结构上，这些都是新兴法规如欧盟AI法案所需的。我们提出一种以工件为中心的AI代理范式，其中行为由持久的、可验证的数据工件驱动，而不是短暂的任务，从而解决了数据架构级别的信任问题。这种方法的核心是多模工件文件格式（MAIF），这是一个AI原生容器，嵌入了语义表示、加密来源和细粒度访问控制。MAIF将数据从被动存储转变为主动的信任执行，使每个AI操作都具有可审计性。我们的成熟实现展示了超高速流式传输（2,720.7 MB/s）、优化的视频处理（1,342 MB/s）和企业级安全性。新颖的交叉模态注意力、语义压缩和加密绑定算法实现了高达225倍的压缩，同时保持语义保真度。先进的安全功能包括流级访问控制、实时篡改检测以及行为异常分析，而且开销极小。这种方法直接解决了阻碍人工智能在敏感领域部署的监管、安全和责任挑战，为建立可信赖的大规模人工智能系统提供了可行的路径。

更新时间: 2025-11-24 02:26:39

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2511.15097v2

Parallel Unlearning in Inherited Model Networks

Unlearning is challenging in generic learning frameworks with the continuous growth and updates of models exhibiting complex inheritance relationships. This paper presents a novel unlearning framework that enables fully parallel unlearning among models exhibiting inheritance. We use a chronologically Directed Acyclic Graph (DAG) to capture various unlearning scenarios occurring in model inheritance networks. Central to our framework is the Fisher Inheritance Unlearning (FIUn) method, designed to enable efficient parallel unlearning within the DAG. FIUn utilizes the Fisher Information Matrix (FIM) to assess the significance of model parameters for unlearning tasks and adjusts them accordingly. To handle multiple unlearning requests simultaneously, we propose the Merging-FIM (MFIM) function, which consolidates FIMs from multiple upstream models into a unified matrix. This design supports all unlearning scenarios captured by the DAG, enabling one-shot removal of inherited knowledge while significantly reducing computational overhead. Experiments confirm the effectiveness of our unlearning framework. For single-class tasks, it achieves complete unlearning with 0% accuracy for unlearned labels while maintaining 94.53% accuracy for retained labels. For multi-class tasks, the accuracy is 1.07% for unlearned labels and 84.77% for retained labels. Our framework accelerates unlearning by 99% compared to alternative methods. Code is in https://github.com/MJLee00/Parallel-Unlearning-in-Inherited-Model-Networks.

Updated: 2025-11-24 02:24:03

标题: 继承模型网络中的并行遗忘

摘要: 忘却在具有持续增长和更新的模型的通用学习框架中是具有挑战性的，这些模型展示复杂的继承关系。本文提出了一种新颖的忘却框架，使得在展示继承关系的模型之间实现完全并行的忘却成为可能。我们使用按时间顺序排列的有向无环图(DAG)来捕捉模型继承网络中发生的各种忘却场景。我们的框架的核心是Fisher继承忘却(FIUn)方法，旨在实现DAG内的有效并行忘却。FIUn利用Fisher信息矩阵(FIM)来评估模型参数对于忘却任务的重要性，并相应地进行调整。为了同时处理多个忘却请求，我们提出了Merging-FIM(MFIM)函数，将来自多个上游模型的FIM合并成一个统一的矩阵。这种设计支持DAG捕获的所有忘却场景，实现了继承知识的一次性移除，同时显著减少了计算开销。实验证实了我们的忘却框架的有效性。对于单一类别任务，它实现了对于被遗忘标签的0%准确率，同时对于保留标签保持了94.53%的准确率。对于多类别任务，被遗忘标签的准确率为1.07%，保留标签的准确率为84.77%。我们的框架相对于替代方法加速了99%的忘却速度。代码可在https://github.com/MJLee00/Parallel-Unlearning-in-Inherited-Model-Networks找到。

更新时间: 2025-11-24 02:24:03

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2408.08493v4

Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers -- exemplified by DINOv2, SigLIP2 and EVA-CLIP -- occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

Updated: 2025-11-24 02:16:08

标题: 视觉变位词揭示出视觉模型中整体形状处理的隐藏差异

摘要: 人类能够根据局部纹理线索和物体部件的配置来识别物体，然而当代视觉模型主要利用局部纹理线索，产生脆弱、非组合的特征。关于形状与纹理偏见的研究使形状和纹理表示相互对立，衡量形状相对于纹理，忽视了模型（和人类）可以同时依赖两种线索类型的可能性，模糊了两种表示类型的绝对质量。因此，我们将形状评估重新定义为绝对构形能力的问题，由Configural Shape Score（CSS）来操作，（i）衡量识别Object-Anagram对中保留局部纹理但排列全局部件以描绘不同物体类别的图像的能力。在86个卷积、变换器和混合模型中，CSS（ii）揭示了广泛的构形敏感性，完全自我监督和语言对齐的变换器 -- 例如DINOv2、SigLIP2和EVA-CLIP -- 占据CSS谱的顶端。机械探针揭示了（iii）高CSS网络依赖于长程相互作用：半径控制的注意力掩模消除了性能，显示了明显的U形集成概况，并且表示相似性分析揭示了从局部到全局编码的中深度过渡。一个BagNet控制保持在机会（iv），排除了“边界黑客”策略。最后，（v）我们展示构形形状分数还可以预测其他形状相关的评估。总的来说，我们提出，通向真正强大、可推广和类似人类的视觉系统的路径可能不在于强迫在形状和纹理之间做出人工选择，而在于能够无缝整合局部纹理和全局构形形状的架构和学习框架。

更新时间: 2025-11-24 02:16:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.00493v3

How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference

This paper introduces an infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models in commercial datacenters. The framework combines public API performance data with company-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost and provide a dynamically updated dashboard that visualizes model-level energy, water, and carbon metrics. Results show the most energy-intensive models exceed 29 Wh per long prompt, over 65 times the most efficient systems. Even a 0.42 Wh short query, when scaled to 700M queries/day, aggregates to annual electricity comparable to 35{,}000 U.S. homes, evaporative freshwater equal to the annual drinking needs of 1.2M people, and carbon emissions requiring a Chicago-sized forest to offset. These findings highlight a growing paradox: as AI becomes cheaper and faster, global adoption drives disproportionate resource consumption. Our methodology offers a standardized, empirically grounded basis for sustainability benchmarking and accountability in AI deployment.

Updated: 2025-11-24 02:12:44

标题: 人工智能有多饥饿？对LLM推断的能源、水和碳足迹进行基准测试

摘要: 本文介绍了一种基础设施感知的基准测试框架，用于量化商业数据中心中30种最先进模型的LLM推理的环境足迹。该框架将公共API性能数据与公司特定的环境乘数和硬件配置的统计推断相结合。我们此外利用交叉效率数据包络分析（DEA）来按性能相对于环境成本排名模型，并提供一个动态更新的仪表板，可可视化模型级别的能源、水和碳度量。结果显示，最耗能的模型每个长提示超过29瓦时，是最高效系统的65倍。即使是0.42瓦时的短查询，在每天扩展到7亿次查询时，聚合成的年度电力相当于35000个美国家庭，蒸发淡水相当于120万人的年度饮水需求，碳排放量需要芝加哥大小的森林来抵消。这些发现突显了一个不断增长的悖论：随着AI变得更便宜和更快，全球采用推动了不成比例的资源消耗。我们的方法提供了一个标准化、经验基础的基础设施，用于AI部署中的可持续性基准测试和问责制。

更新时间: 2025-11-24 02:12:44

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2505.09598v6

KANO: Kolmogorov-Arnold Neural Operator

We introduce Kolmogorov--Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over generic position-dependent dynamics (variable coefficient PDEs) for any physical input, whereas FNO stays practical only for spectrally sparse operators and strictly imposes a fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx 1.5\times10^{-2}$, by orders of magnitude.

Updated: 2025-11-24 02:11:34

标题: KANO：科尔莫戈洛夫-阿诺德神经算子

摘要: 我们介绍了Kolmogorov-Arnold神经算子（KANO），这是一个由频谱和空间基共同参数化的双域神经算子，具有内在的符号解释能力。理论上证明了KANO克服了傅立叶神经算子（FNO）的纯频谱瓶颈：KANO在任何物理输入下对于一般位置相关动态（变系数PDEs）仍然具有表达能力，而FNO仅适用于频谱稀疏的算子，并严格要求输入的傅立叶尾部迅速衰减。我们在位置相关微分算子上通过实验证实了我们的声明，其中KANO能够稳健地推广，而FNO则失败。在量子哈密顿学习基准测试中，KANO以符号表示形式准确重构了地面实况哈密顿量的系数，精确到小数点后第四位，并且通过投影测量数据获得了约6×10^(-6)的状态失真度，远远优于FNO通过理想完整波函数数据训练后获得的约1.5×10^(-2)，数量级上表现出色。

更新时间: 2025-11-24 02:11:34

领域: cs.LG,cs.AI,cs.CE

下载: http://arxiv.org/abs/2509.16825v3

QuantKAN: A Unified Quantization Framework for Kolmogorov Arnold Networks

Kolmogorov Arnold Networks (KANs) represent a new class of neural architectures that replace conventional linear transformations and node-based nonlinearities with spline-based function approximations distributed along network edges. Although KANs offer strong expressivity and interpretability, their heterogeneous spline and base branch parameters hinder efficient quantization, which remains unexamined compared to CNNs and Transformers. In this paper, we present QuantKAN, a unified framework for quantizing KANs across both quantization aware training (QAT) and post-training quantization (PTQ) regimes. QuantKAN extends modern quantization algorithms, such as LSQ, LSQ+, PACT, DoReFa, QIL, GPTQ, BRECQ, AdaRound, AWQ, and HAWQ-V2, to spline based layers with branch-specific quantizers for base, spline, and activation components. Through extensive experiments on MNIST, CIFAR 10, and CIFAR 100 across multiple KAN variants (EfficientKAN, FastKAN, PyKAN, and KAGN), we establish the first systematic benchmarks for low-bit spline networks. Our results show that KANs, particularly deeper KAGN variants, are compatible with low-bit quantization but exhibit strong method architecture interactions: LSQ, LSQ+, and PACT preserve near full precision accuracy at 4 bit for shallow KAN MLP and ConvNet models, while DoReFa provides the most stable behavior for deeper KAGN under aggressive low-bit settings. For PTQ, GPTQ and Uniform consistently deliver the strongest overall performance across datasets, with BRECQ highly competitive on simpler regimes such as MNIST. Our proposed QuantKAN framework thus unifies spline learning and quantization, and provides practical tools and guidelines for efficiently deploying KANs in real-world, resource-constrained environments.

Updated: 2025-11-24 02:05:16

标题: QuantKAN：科尔莫戈洛夫-阿诺德网络的统一量化框架

摘要: Kolmogorov Arnold Networks (KANs)代表了一种新的神经网络架构类别，它用基于样条函数的函数逼近替代了传统的线性变换和基于节点的非线性，这些函数逼近沿着网络边缘分布。虽然KANs提供了强大的表达能力和可解释性，但它们的异质样条和基本分支参数阻碍了有效的量化，与CNNs和Transformers相比，这一问题尚未得到研究。在本文中，我们提出了QuantKAN，这是一个统一的框架，用于在量化感知训练（QAT）和训练后量化（PTQ）制度下对KANs进行量化。QuantKAN将现代量化算法，如LSQ、LSQ+、PACT、DoReFa、QIL、GPTQ、BRECQ、AdaRound、AWQ和HAWQ-V2，扩展到基于样条的层，具有基本、样条和激活组件的分支特定量化器。通过对MNIST、CIFAR 10和CIFAR 100上多个KAN变种（EfficientKAN、FastKAN、PyKAN和KAGN）的大量实验，我们建立了低比特样条网络的首个系统基准。我们的结果表明，KANs，特别是更深的KAGN变种，与低比特量化兼容，但表现出强烈的方法架构相互作用：LSQ、LSQ+和PACT在4比特下保持了近乎全精度准确度，对于浅层KAN MLP和ConvNet模型，而DoReFa在更深的KAGN下提供了最稳定的行为，在激进的低比特设置下。对于PTQ，GPTQ和Uniform在数据集上一致提供最强的整体性能，而BRECQ在较简单的制度，如MNIST上则具有很高的竞争力。因此，我们提出的QuantKAN框架统一了样条学习和量化，并为在真实世界的资源受限环境中有效部署KANs提供了实用工具和指导。

更新时间: 2025-11-24 02:05:16

领域: cs.LG

下载: http://arxiv.org/abs/2511.18689v1

MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. Our benchmarks show that current off-the-shelf VLMs perform poorly on these tasks. However, with supervised fine-tuning on MedVision, we significantly enhance their performance across detection, T/L estimation, and A/D measurement, demonstrating reduced error rates and improved precision. This work provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging. Code and data are available at https://medvision-vlm.github.io.

Updated: 2025-11-24 01:26:07

标题: MedVision：用于定量医学图像分析的数据集和基准

摘要: 目前在医学领域的视觉-语言模型（VLMs）主要设计用于分类问题回答（例如，“这是正常的还是异常的？”）或定性描述任务。然而，临床决策往往依赖定量评估，例如测量肿瘤的大小或关节的角度，医生从中得出自己的诊断结论。现有的VLMs在定量推理能力方面仍未得到充分探讨和支持。在这项工作中，我们介绍了MedVision，一个专门设计用于评估和改进VLMs在定量医学图像分析上的大规模数据集和基准。MedVision涵盖了22个涵盖不同解剖结构和模态的公共数据集，包含了3080万个图像-注释对。我们关注三个代表性的定量任务：（1）解剖结构和异常的检测，（2）肿瘤/病变（T/L）大小估计，以及（3）角度/距离（A/D）测量。我们的基准测试结果显示，目前现成的VLMs在这些任务上表现不佳。然而，在MedVision上进行监督微调后，我们显著增强了它们在检测、T/L估计和A/D测量方面的性能，展示了错误率降低和精度提高。这项工作为开发具有在医学图像中强大定量推理能力的VLMs奠定了基础。代码和数据可在https://medvision-vlm.github.io 获得。

更新时间: 2025-11-24 01:26:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18676v1

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Programming assistants powered by large language models have improved dramatically, yet existing benchmarks still evaluate them in narrow code-generation settings. Recent efforts such as InfiBench and StackEval rely on Stack Overflow questions and remain limited to single-turn interactions, manually curated data, and isolated snippets rather than full project environments. We introduce CodeAssistBench (CAB), the first benchmark for evaluating multi-turn, project-grounded programming assistance at scale. CAB automatically constructs datasets from GitHub issues tagged as questions, using an LLM-driven pipeline that filters noise, extracts runnable contexts, builds executable containers, and verifies environment correctness. This enables continuous, automated expansion across diverse repositories without manual intervention. Using CAB, we create a testbed of 3,286 real-world issues across 214 repositories, spanning seven languages. Evaluating state-of-the-art models reveals a substantial gap: while models achieve 70-83% accuracy on Stack Overflow-style questions, they solve only 16.49% of CAB issues from post-training-cutoff repositories. On a manually validated subset of 149 issues, top models such as Claude Sonnet 4.5 reach only 12.08% correctness. These results highlight a fundamental challenge: current LLMs struggle to provide assistance in realistic, project-specific contexts despite strong performance on traditional Q&A benchmarks. CAB provides a scalable, reproducible framework for advancing research in multi-turn, codebase-grounded programming agents. The benchmark and pipeline are fully automated and publicly available at https://github.com/amazon-science/CodeAssistBench/.

Updated: 2025-11-24 01:18:11

标题: CodeAssistBench（CAB）：用于多轮基于聊天的代码辅助的数据集和基准测试

摘要: 由大型语言模型驱动的编程助手已经取得了显著的进展，然而现有的基准仍然在狭窄的代码生成环境中评估它们。最近的工作，如InfiBench和StackEval依赖于Stack Overflow问题，仍然局限于单次交互、手动筛选数据和孤立的代码片段，而不是完整的项目环境。我们介绍了CodeAssistBench (CAB)，这是第一个用于评估多回合、基于项目的规模编程辅助的基准。CAB自动从标记为问题的GitHub问题中构建数据集，使用LLM驱动的管道来过滤噪音、提取可运行的上下文、构建可执行的容器，并验证环境正确性。这使得在不需要手动干预的情况下，在不同的存储库中进行持续、自动化的扩展成为可能。使用CAB，我们创建了一个包含3,286个真实问题的实验平台，跨越了214个存储库，涵盖了七种语言。评估最先进的模型揭示了一个实质性的差距：虽然模型在类似Stack Overflow的问题上达到了70-83%的准确率，但它们仅解决了16.49%来自训练后截止存储库的CAB问题。在手动验证的149个问题的子集上，像Claude Sonnet 4.5这样的顶级模型仅达到了12.08%的正确性。这些结果凸显了一个基本挑战：尽管在传统的问答基准上表现出色，但当前的LLM在现实的、项目特定的上下文中提供帮助仍然存在困难。CAB提供了一个可扩展、可重现的框架，用于推进多回合、代码库基础的编程代理研究。该基准和管道完全自动化，并可在https://github.com/amazon-science/CodeAssistBench/上公开获取。

更新时间: 2025-11-24 01:18:11

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2507.10646v3

Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning

Emotion Recognition in Conversation (ERC) is a crucial task for understanding human emotions and enabling natural human-computer interaction. Although Large Language Models (LLMs) have recently shown great potential in this field, their ability to capture the intrinsic connections between explicit and implicit emotions remains limited. We propose a novel ERC training framework, PRC-Emo, which integrates Prompt engineering, demonstration Retrieval, and Curriculum learning, with the goal of exploring whether LLMs can effectively perceive emotions in conversational contexts. Specifically, we design emotion-sensitive prompt templates based on both explicit and implicit emotional cues to better guide the model in understanding the speaker's psychological states. We construct the first dedicated demonstration retrieval repository for ERC, which includes training samples from widely used datasets, as well as high-quality dialogue examples generated by LLMs and manually verified. Moreover, we introduce a curriculum learning strategy into the LoRA fine-tuning process, incorporating weighted emotional shifts between same-speaker and different-speaker utterances to assign difficulty levels to dialogue samples, which are then organized in an easy-to-hard training sequence. Experimental results on two benchmark datasets -- IEMOCAP and MELD -- show that our method achieves new state-of-the-art (SOTA) performance, demonstrating the effectiveness and generalizability of our approach in improving LLM-based emotional understanding.

Updated: 2025-11-24 01:17:15

标题: LLM是否有情感？使用提示、检索和课程学习教授情感识别

摘要: 对话中的情绪识别（ERC）是理解人类情绪并实现自然人机交互的关键任务。尽管大型语言模型（LLMs）最近在这一领域展现出巨大潜力，但它们捕捉显性和隐性情绪之间固有联系的能力仍然有限。我们提出了一种新颖的ERC训练框架PRC-Emo，该框架整合了提示工程、演示检索和课程学习，旨在探索LLMs是否能够有效地感知对话环境中的情绪。具体而言，我们设计了基于显性和隐性情感线索的情感敏感提示模板，以更好地引导模型理解说话者的心理状态。我们构建了第一个专门用于ERC的演示检索存储库，其中包括来自广泛使用的数据集的训练样本，以及由LLMs生成并经过手动验证的高质量对话示例。此外，我们将课程学习策略引入到LoRA微调过程中，将同一说话者和不同说话者的话语之间的情感变化加权，以为对话样本分配难度级别，然后按照易到难的训练顺序进行组织。在两个基准数据集IEMOCAP和MELD上的实验结果显示，我们的方法实现了新的最先进（SOTA）性能，证明了我们的方法在改进基于LLM的情感理解方面的有效性和普适性。

更新时间: 2025-11-24 01:17:15

领域: cs.AI

下载: http://arxiv.org/abs/2511.07061v3

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.

Updated: 2025-11-24 01:13:52

标题: 低秩GEMM：通过FP8加速低秩逼近实现高效矩阵乘法

摘要: 大矩阵乘法是现代机器学习工作负载的基石，然而传统方法受到立方计算复杂性的影响（例如，对于一个大小为$n\times n$的矩阵，复杂度为$\mathcal{O}(n^3)$）。我们提出了Low-Rank GEMM，这是一种新颖的方法，利用低秩矩阵逼近实现亚二次复杂度，同时通过FP8精度和智能内核选择来保持硬件加速性能。在NVIDIA RTX 4090上，我们的实现在最大为$N=20480$的矩阵上实现了高达378 TFLOPS的性能，节省了75%的内存，并在大矩阵上比PyTorch FP32提供了$7.8\times$的加速。该系统自动适应硬件能力，根据矩阵特征和可用加速器选择最佳的分解方法（SVD，随机化SVD）和精度级别。在NVIDIA RTX 4090上进行的全面基准测试表明，Low-Rank GEMM成为了矩阵$N\geq10240$时最快的方法，通过内存带宽优化而不是计算快捷方式超越了传统的cuBLAS实现。

更新时间: 2025-11-24 01:13:52

领域: cs.PF,cs.AI,cs.DC,cs.LG

下载: http://arxiv.org/abs/2511.18674v1

Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in $N$ independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be ${\it misaligned}$ with pass@N in that pass@N accuracy ${\it decreases}$ with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.

Updated: 2025-11-24 01:05:38

标题: 重新思考在扩展测试时间计算时的微调：限制置信度改进数学推理

摘要: 最近在大型语言模型（LLMs）方面取得的进展突显了将测试时间计算扩展到复杂任务，如数学推理和代码生成，以实现强大性能的能力。这引发了一个关键问题：如何修改模型训练以优化在随后的测试时间计算策略和预算下的性能？为了探索这个问题，我们关注pass@N，这是一种简单的测试时间策略，用于在N个独立样本中搜索正确答案。我们惊讶地发现，使用交叉熵（CE）损失进行训练可能与pass@N不一致，因为pass@N的准确率随着训练时间的延长而降低。我们解释了这种不一致性的起源，这是由CE引起的模型过度自信所导致的，并通过实验证实了我们对过度自信作为扩展测试时间计算的障碍的预测。此外，我们提出了一种有原则的修改训练损失，通过限制模型的信心水平和提高pass@N的测试性能，使其更加符合pass@N。我们的算法在多种情景下展示了在MATH和MiniF2F基准测试中数学推理的改进：（1）回答数学问题；（2）通过搜索具有不同形状的证明树来证明定理。总体而言，我们的工作强调了在LLM开发的两个传统分离阶段之间进行协同设计的重要性：训练时间协议和测试时间搜索和推理策略。

更新时间: 2025-11-24 01:05:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.07154v4

Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.

Updated: 2025-11-24 01:04:42

标题: 多智能体交叉熵方法与单调非线性评论家分解

摘要: 合作多智能体强化学习（MARL）通常采用集中训练与分散执行（CTDE），其中集中评论家利用全局信息来引导分散执行者。然而，当一个智能体的次优行为降低了其他智能体的学习时，就会出现集中-分散不匹配（CDM）。先前的方法通过价值分解来减轻CDM，但是线性分解允许每个智能体的梯度，但代价是表达能力有限，而非线性分解改善了表示，但需要集中梯度，重新引入了CDM。为了克服这种权衡，我们提出了多智能体交叉熵方法（MCEM），结合单调非线性评论家分解（NCD）。MCEM通过增加高价值联合行动的概率来更新策略，从而排除次优行为。为了样本效率，我们扩展了基于离线学习的修改k步回报和回溯。分析和实验证明，MCEM在连续和离散动作基准测试中均优于最先进的方法。

更新时间: 2025-11-24 01:04:42

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2511.18671v1

Personalized LLM Decoding via Contrasting Personal Preference

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.

Updated: 2025-11-24 00:58:45

标题: 通过对比个人偏好个性化的LLM解码

摘要: 随着大型语言模型（LLMs）逐渐在各种实际应用中得到部署，LLMs的个性化变得越来越重要。虽然已经积极探索了各种LLM个性化方法，例如基于提示和基于训练的方法，但有效的解码时间算法的发展仍然被大多数人忽视，尽管它们已经证明了潜力。在本文中，我们提出了CoPe（对比个性偏好），这是一种新颖的解码时间方法，应用于在用户特定数据上进行参数高效微调（PEFT）之后。我们的核心思想是利用奖励引导解码，特别用于个性化，通过最大化每个用户的隐式奖励信号。我们在五个开放式个性化文本生成任务中评估了CoPe。我们的实证结果表明，CoPe取得了强大的性能，平均提高了10.57%的ROUGE-L个性化，而不依赖于外部奖励模型或额外的训练过程。

更新时间: 2025-11-24 00:58:45

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.12109v3

Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers

Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.

Updated: 2025-11-24 00:55:14

标题: 确定性连续替换：在预训练的Transformer中快速稳定的模块替换

摘要: 在预训练模型中替换模块，尤其是将二次自注意力替换为高效的注意力替代方案，构成了一个困难的优化问题：冷启动重新初始化会破坏冻结的骨干结构的稳定性。我们在一个受控研究中分离了这一核心稳定性挑战。确定性连续替换（DCR）将教师和学生输出与确定性、退火权重混合。理论上，DCR消除了随机替代中固有的门控引起的梯度方差。在单种子研究中，DCR在受控注意力替换方面比随机门控和蒸馏基线实现了更快的收敛和更强的对齐，为异质操作符交换奠定了基础。

更新时间: 2025-11-24 00:55:14

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.18670v1

Equivariant Deep Equilibrium Models for Imaging Inverse Problems

Equivariant imaging (EI) enables training signal reconstruction models without requiring ground truth data by leveraging signal symmetries. Deep equilibrium models (DEQs) are a powerful class of neural networks where the output is a fixed point of a learned operator. However, training DEQs with complex EI losses requires implicit differentiation through fixed-point computations, whose implementation can be challenging. We show that backpropagation can be implemented modularly, simplifying training. Experiments demonstrate that DEQs trained with implicit differentiation outperform those trained with Jacobian-free backpropagation and other baseline methods. Additionally, we find evidence that EI-trained DEQs approximate the proximal map of an invariant prior.

Updated: 2025-11-24 00:43:54

标题: 等变深度平衡模型用于图像反问题

摘要: 等变成像（EI）使得在训练信号重建模型时无需地面真实数据，通过利用信号的对称性。深度平衡模型（DEQs）是一类强大的神经网络，其中输出是一个学习算子的一个固定点。然而，使用复杂的EI损失来训练DEQs需要通过固定点计算的隐式微分，其实现可能具有挑战性。我们展示了反向传播可以模块化实现，简化训练。实验证明，通过隐式微分训练的DEQs优于那些使用无雅可比反向传播和其他基线方法训练的模型。此外，我们发现证据表明，通过EI训练的DEQs近似于不变先验的近端映射。

更新时间: 2025-11-24 00:43:54

领域: eess.IV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2511.18667v1

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.

Updated: 2025-11-24 00:36:49

标题: 动态专家量化用于可扩展专家混合推理

摘要: 混合专家（MoE）模型在有效扩展LLM容量方面表现出色，但在消费级GPU上的部署受到不活跃专家的大内存占用的限制。静态后训练量化可以降低存储成本，但无法适应不断变化的激活模式，导致在激进压缩下准确性损失。因此，我们提出了DynaExq，一个运行时系统，将专家精度视为一流的、动态管理的资源。DynaExq结合了（1）一个热度感知的精度控制器，持续将专家位宽与长期激活统计数据对齐，（2）一个完全异步的精度切换流水线，将晋升和降级与MoE计算重叠，并且（3）一个无碎片的内存池机制，支持具有确定性分配的混合精度专家。这些组件共同实现了在严格的HBM预算下稳定、非阻塞的精度过渡。在Qwen3-30B和Qwen3-80B MoE模型以及六个代表性基准测试中，DynaExq在单个RTX 5090和A6000 GPU上部署大型LLM，并将准确性提高了最多4.03个百分点，超过了静态低精度基线。结果表明，自适应、工作负载感知的量化是一种有效的内存受限MoE服务策略。

更新时间: 2025-11-24 00:36:49

领域: cs.PF,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.15015v2

Pilot Contamination-Aware Graph Attention Network for Power Control in CFmMIMO

Optimization-based power control algorithms are predominantly iterative with high computational complexity, making them impractical for real-time applications in cell-free massive multiple-input multiple-output (CFmMIMO) systems. Learning-based methods have emerged as a promising alternative, and among them, graph neural networks (GNNs) have demonstrated their excellent performance in solving power control problems. However, all existing GNN-based approaches assume ideal orthogonality among pilot sequences for user equipments (UEs), which is unrealistic given that the number of UEs exceeds the available orthogonal pilot sequences in CFmMIMO schemes. Moreover, most learning-based methods assume a fixed number of UEs, whereas the number of active UEs varies over time in practice. Additionally, supervised training necessitates costly computational resources for computing the target power control solutions for a large volume of training samples. To address these issues, we propose a graph attention network for downlink power control in CFmMIMO systems that operates in a self-supervised manner while effectively handling pilot contamination and adapting to a dynamic number of UEs. Experimental results show its effectiveness, even in comparison to the optimal accelerated projected gradient method as a baseline.

Updated: 2025-11-24 00:28:33

标题: CFmMIMO中基于图注意力网络的飞行员干扰感知功率控制

摘要: 基于优化的功率控制算法通常是迭代的，计算复杂度很高，使它们在无蜂窝大规模多输入多输出（CFmMIMO）系统中的实时应用变得不切实际。基于学习的方法已经成为一种有希望的替代方案，其中，图神经网络（GNNs）已经展示出在解决功率控制问题方面的出色性能。然而，所有现有的基于GNN的方法都假设用户设备（UEs）的导频序列之间存在理想的正交性，这是不现实的，因为在CFmMIMO方案中，UEs的数量超过了可用的正交导频序列。此外，大多数基于学习的方法假定UEs的数量是固定的，而实际上活跃UEs的数量会随时间变化。此外，监督训练需要昂贵的计算资源来计算大量训练样本的目标功率控制解决方案。为了解决这些问题，我们提出了一种用于CFmMIMO系统中下行功率控制的图注意力网络，它以自监督的方式运行，同时有效处理导频污染并适应动态UEs的数量。实验结果表明，即使与最佳加速投影梯度方法作为基线进行比较，该方法也表现出了其有效性。

更新时间: 2025-11-24 00:28:33

领域: cs.LG

下载: http://arxiv.org/abs/2506.00967v4

Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Data

Scaling laws describe how learning performance improves with data, compute, or training time, and have become a central theme in modern deep learning. We study this phenomenon in a canonical nonlinear model: phase retrieval with anisotropic Gaussian inputs whose covariance spectrum follows a power law. Unlike the isotropic case, where dynamics collapse to a two-dimensional system, anisotropy yields a qualitatively new regime in which an infinite hierarchy of coupled equations governs the evolution of the summary statistics. We develop a tractable reduction that reveals a three-phase trajectory: (i) fast escape from low alignment, (ii) slow convergence of the summary statistics, and (iii) spectral-tail learning in low-variance directions. From this decomposition, we derive explicit scaling laws for the mean-squared error, showing how spectral decay dictates convergence times and error curves. Experiments confirm the predicted phases and exponents. These results provide the first rigorous characterization of scaling laws in nonlinear regression with anisotropic data, highlighting how anisotropy reshapes learning dynamics.

Updated: 2025-11-24 00:21:17

标题: 快速逃逸，缓慢收敛：幂律数据下相位恢复学习动态

摘要: Scaling laws描述了学习性能如何随数据量、计算量或训练时间的提高而改善，并已成为现代深度学习中的一个核心主题。我们在一个经典的非线性模型中研究了这一现象：具有各向异性高斯输入的相位恢复，其协方差谱遵循幂律。与各向同性情况不同，在那种情况下动态会收敛到一个二维系统，各向异性会产生一个定性全新的区域，其中无限层级的耦合方程控制了摘要统计数据的演变。我们开发了一个可处理的简化方法，揭示了一个三阶段轨迹：(i)快速远离低对齐度，(ii)摘要统计数据的缓慢收敛，以及(iii)在低方差方向上的谱尾学习。通过这种分解，我们推导了均方误差的显式缩放定律，展示了谱衰减如何决定收敛时间和误差曲线。实验证实了预测的阶段和指数。这些结果为具有各向异性数据的非线性回归中的缩放定律提供了第一次严格的表征，突显了各向异性如何重新塑造学习动态。

更新时间: 2025-11-24 00:21:17

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2511.18661v1

Subtract the Corruption: Training-Data-Free Corrective Machine Unlearning using Task Arithmetic

Corrupted training data are ubiquitous. Corrective Machine Unlearning (CMU) seeks to remove the influence of such corruption post-training. Prior CMU typically assumes access to identified corrupted training samples (a ``forget set''). However, in many real-world scenarios the training data are no longer accessible. We formalize \emph{source-free} CMU, where the original training data are unavailable and, consequently, no forget set of identified corrupted training samples can be specified. Instead, we assume a small proxy (surrogate) set of corrupted samples that reflect the suspected corruption type without needing to be the original training samples. In this stricter setting, methods relying on forget set are ineffective or narrow in scope. We introduce \textit{Corrective Unlearning in Task Space} (CUTS), a lightweight weight space correction method guided by the proxy set using task arithmetic principles. CUTS treats the clean and the corruption signal as distinct tasks. Specifically, we briefly fine-tune the corrupted model on the proxy to amplify the corruption mechanism in the weight space, compute the difference between the corrupted and fine-tuned weights as a proxy task vector, and subtract a calibrated multiple of this vector to cancel the corruption. Without access to clean data or a forget set, CUTS recovers a large fraction of the lost utility under label noise and, for backdoor triggers, nearly eliminates the attack with minimal damage to utility, outperforming state-of-the-art specialized CMU methods in source-free setting.

Updated: 2025-11-24 00:15:46

标题: 消除腐败：使用任务算术进行无训练数据的机器纠错反学习

摘要: Corrupted training data are everywhere. Corrective Machine Unlearning (CMU) aims to eliminate the impact of such corruption after training. Previous CMU approaches typically assume access to identified corrupted training samples (a "forget set"). However, in many real-world scenarios, the training data is no longer available. We introduce the concept of source-free CMU, where the original training data is not accessible, and therefore, no forget set of identified corrupted training samples can be specified. Instead, we rely on a small proxy set of corrupted samples that represent the suspected corruption type without being the original training samples. In this more challenging scenario, methods that depend on a forget set are ineffective or limited in scope. We propose Corrective Unlearning in Task Space (CUTS), a lightweight weight space correction method guided by the proxy set using task arithmetic principles. CUTS treats clean and corruption signals as separate tasks. Specifically, we fine-tune the corrupted model on the proxy set to enhance the corruption mechanism in the weight space, calculate the difference between the corrupted and fine-tuned weights as a proxy task vector, and subtract a calibrated multiple of this vector to eliminate the corruption. Without access to clean data or a forget set, CUTS successfully recovers a significant portion of the lost utility under label noise and significantly reduces the impact of backdoor triggers on utility, outperforming state-of-the-art specialized CMU methods in a source-free setting.

更新时间: 2025-11-24 00:15:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.18660v1

CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

General-purpose VLMs demonstrate impressive capabilities, but their opaque training on uncurated internet data poses critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed literature, and demonstrate its clinical utility versus GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgery consultations were assigned to either CNS-Obsidian or a HIPAA-compliant GPT-4o endpoint as diagnostic co-pilot after consultations. Primary outcomes were diagnostic helpfulness and accuracy, assessed via user ratings and presence of correct diagnosis within the VLM-provided differential. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o's 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults (7.3% utilization). CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance despite being orders of magnitude smaller and less expensive to train. This establishes a transparent framework for scientific communities to build specialized AI models.

Updated: 2025-11-24 00:05:26

标题: CNS-Obsidian：从科学出版物构建的神经外科视觉-语言模型

摘要: 通用VLMs展示了令人印象深刻的能力，但它们在未经筛选的互联网数据上进行的不透明训练对于高风险决策（如神经外科）存在关键限制。我们提出了CNS-Obsidian，这是一个在同行评议文献上训练的神经外科VLM，并在真实环境中展示了与GPT-4o的临床实用性对比。我们从神经外科出版物期刊中汇编了23,984篇文章，产生了78,853幅图表和标题。使用GPT-4o和Claude Sonnet-3.5，我们将这些转换为263,064个训练样本，涵盖三种格式：指导微调、多项选择题和鉴别诊断。我们训练了CNS-Obsidian，这是一个34亿参数LLaVA-Next模型的微调。在NYU Langone Health进行的一个盲目、随机试验（2024年8月30日至11月30日），神经外科会诊被分配给CNS-Obsidian或符合HIPAA标准的GPT-4o端点作为诊断共同飞行员。主要结果是诊断的实用性和准确性，通过用户评分和VLM提供的不同诊断中正确诊断的存在进行评估。CNS-Obsidian在合成问题上与GPT-4o相匹配（76.13% vs 77.54%，p=0.235），但在人类生成的问题上仅实现46.81%的准确率，而GPT-4o为65.70%（p<10-15）。在随机试验中，评估了70次会诊（32次CNS-Obsidian，38次GPT-4o），共有959次会诊（7.3%利用率）。CNS-Obsidian在40.62%的情况下获得了积极评价，而GPT-4o为57.89%（p=0.230）。两种模型在大约60%的情况下包含了正确的诊断（59.38% vs 65.79%，p=0.626）。基于经过筛选的科学文献训练的领域特定VLMs可以接近前沿模型的性能，尽管它们的规模小得多且训练成本较低。这为科学界建立专门的AI模型奠定了透明的框架。

更新时间: 2025-11-24 00:05:26

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2502.19546v5

Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion

Generating realistic sparse multi-category 3D voxel structures is difficult due to the cubic memory scaling of voxel structures and moreover the significant class imbalance caused by sparsity. We introduce Scaffold Diffusion, a generative model designed for sparse multi-category 3D voxel structures. By treating voxels as tokens, Scaffold Diffusion uses a discrete diffusion language model to generate 3D voxel structures. We show that discrete diffusion language models can be extended beyond inherently sequential domains such as text to generate spatially coherent 3D structures. We evaluate on Minecraft house structures from the 3D-Craft dataset and demonstrate that, unlike prior baselines and an auto-regressive formulation, Scaffold Diffusion produces realistic and coherent structures even when trained on data with over 98% sparsity. We provide an interactive viewer where readers can visualize generated samples and the generation process: https://scaffold.deepexploration.org/

Updated: 2025-11-24 00:03:47

标题: 脚手架扩散：利用离散扩散生成稀疏多类别体素结构

摘要: 生成逼真的稀疏多类别3D体素结构很困难，这是由于体素结构的立方内存缩放以及由稀疏性引起的显著类别不平衡所致。我们介绍了Scaffold Diffusion，这是一种专为稀疏多类别3D体素结构设计的生成模型。通过将体素视为标记，Scaffold Diffusion使用离散扩散语言模型生成3D体素结构。我们展示了离散扩散语言模型可以扩展到文本等固有的顺序域以生成空间上连贯的3D结构。我们在3D-Craft数据集的Minecraft房屋结构上进行评估，并展示，与以前的基线和自回归公式不同，即使在训练数据中有超过98%的稀疏性时，Scaffold Diffusion也能产生逼真和连贯的结构。我们提供一个交互式查看器，读者可以在其中可视化生成的样本和生成过程：https://scaffold.deepexploration.org/

更新时间: 2025-11-24 00:03:47

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.00062v3

TRAP: Targeted Redirecting of Agentic Preferences

Autonomous agentic AI systems powered by vision-language models (VLMs) are rapidly advancing toward real-world deployment, yet their cross-modal reasoning capabilities introduce new attack surfaces for adversarial manipulation that exploit semantic reasoning across modalities. Existing adversarial attacks typically rely on visible pixel perturbations or require privileged model or environment access, making them impractical for stealthy, real-world exploitation. We introduce TRAP, a novel generative adversarial framework that manipulates the agent's decision-making using diffusion-based semantic injections into the vision-language embedding space. Our method combines negative prompt-based degradation with positive semantic optimization, guided by a Siamese semantic network and layout-aware spatial masking. Without requiring access to model internals, TRAP produces visually natural images yet induces consistent selection biases in agentic AI systems. We evaluate TRAP on the Microsoft Common Objects in Context (COCO) dataset, building multi-candidate decision scenarios. Across these scenarios, TRAP consistently induces decision-level preference redirection on leading models, including LLaVA-34B, Gemma3, GPT-4o, and Mistral-3.2, significantly outperforming existing baselines such as SPSA, Bandit, and standard diffusion approaches. These findings expose a critical, generalized vulnerability: autonomous agents can be consistently misled through visually subtle, semantically-guided cross-modal manipulations. Overall, our results show the need for defense strategies beyond pixel-level robustness to address semantic vulnerabilities in cross-modal decision-making. The code for TRAP is accessible on GitHub at https://github.com/uiuc-focal-lab/TRAP.

Updated: 2025-11-24 00:01:45

标题: TRAP：以代理为导向的偏好定向重定向

摘要: 由视觉-语言模型（VLMs）驱动的自主代理AI系统正在迅速向现实世界部署发展，然而它们的跨模态推理能力为对抗性操纵引入了新的攻击面，利用跨模态的语义推理。现有的对抗性攻击通常依赖于可见像素扰动或需要特权模型或环境访问，使它们对隐蔽、现实世界利用不切实际。我们引入了TRAP，这是一个新颖的生成对抗性框架，通过将扩散型语义注入到视觉-语言嵌入空间中来操纵代理的决策过程。我们的方法结合了基于负提示的降级与基于正语义优化，由一个连体语义网络和布局感知的空间掩蔽引导。在不需要访问模型内部的情况下，TRAP生成视觉上自然的图像，但却在自主代理AI系统中引起一致的选择偏差。我们在Microsoft Common Objects in Context（COCO）数据集上评估了TRAP，构建了多候选决策场景。在这些场景中，TRAP始终导致领先模型（包括LLaVA-34B、Gemma3、GPT-4o和Mistral-3.2）上的决策级偏好重定向，明显优于现有基线，如SPSA、Bandit和标准扩散方法。这些发现揭示了一个关键的、广义的漏洞：通过视觉微妙、语义引导的跨模态操纵，自主代理可以被持续误导。总的来说，我们的结果表明需要超越像素级鲁棒性的防御策略，以解决跨模态决策制定中的语义漏洞。TRAP的代码可在GitHub上访问，网址为https://github.com/uiuc-focal-lab/TRAP。

更新时间: 2025-11-24 00:01:45

领域: cs.AI

下载: http://arxiv.org/abs/2505.23518v2